This is an abbreviated copy of my initial eMail response to Colin.| [...]
| This code should take exactly 4 clock cycles. However, depending upon
| the alignment of the loop, it takes between 4.00 and 5.32 clock cycles.
| To be specific:
| [...]
Read Intel's optimization guides carefully! The alignment of the code, and
the alignment of branch target play a role. Intel never claimed that their
estimated clock cycles would be correct for anything but "perfect code" in
regards to all these side issues.
| Furthermore, if the cpuid is taken out, the times vary depending upon
| what code is executed before.
Of course. Because CPUID is serializing, it will ensure that all the units
have finished their stuff. So your code won't be executed in parallel with
any code from before the CPUID. Which makes the core schedule differently.
| I posted to comp.asm.lang.x86 about this a few weeks ago, and, although
| no-one there could explain it, one person (Yves Gallot) noticed that
| adding the line
|
| mov edx,mem_var
|
| causes the code to revert to taking 4 clocks (although this was before I
| looked into the different alignments, so this might be a red herring).
The Pentium Pro works optimal, when instructions decode to 4-1-1 micro-OPs
all the time. Again, check with their Optimization guide, #242816-003, and
maybe also give their VTune program a trial.
| PS. Again, it may be a red herring, but all the code samples I've
| found so far which exhibit this behaviour have an overabundance of
| integer instructions. Paul Hsieh speculated that a necessary condition
| might be the decoders running faster than the execution units.
That certainly is a possibility.
--
CL