| Well, this theory was motivated by the fact that the instructions could
| be decoded in 3 clocks but could not all be executed in less than 4
| clocks. This eventually leads to situations where there are 3 integer
| instructions that could be executed in parallel. Now because the whole
| thing is so out of order oriented, it means the scheduler may execute any
| two of the three in parallel (and not more since there are only two
| integer units.)
|
| Its not hard to see that there are scheduling choices that might starve
| some instructions while getting very far ahead with others. Now once the
| reservation station gets full of the starved instructions, the CPU starts
| losing clocks because it cannot feed the other instructions into the
| reservations station.I think you are right. The two integer execution units are the bottleneck here.
| But Colin's discovery about the alignment issue puts a bit of a different
| spin on this. I suspect that branch target alignment must also play a
| factor since the two instruction fetch stages (which are supposed to
| eliminate alignment issues) probably cannot be pre-fed deeply enough on a
| branch.
True. I have spent some time investigating alignment effects, you can find
the results in my manual. The loop should take 3 or 4 clocks to fetch and
decode, depending on the alignment.
The alignment effects can produce two different steady state conditions
depending on initial conditions, i.e. where the instruction fetch blocks
begin. These effects are undocumented.
The decoded uops go through the RAT stage 3 by 3. This gives a possibility
of 3 different steady states depending on which uops go together in the
beginning. This pattern is maintained as long as the queue between decoder
and RAT is not empty.