discussion forum



message

Name: Paul Hsieh
eMail: qed@pobox.com
Date: June 28, 1998 at 03:19:22
Subject: Re: Possible P6 bug
In Reply To: Re: Possible P6 bug by
Christian Ludloff on June 25, 1998 at 12:05:39
Text: | This is an abbreviated copy of my initial eMail response to Colin.
|
|| [...]
|| This code should take exactly 4 clock cycles. However, depending upon
|| the alignment of the loop, it takes between 4.00 and 5.32 clock cycles.
|| To be specific:
|| [...]
|
| Read Intel's optimization guides carefully! The alignment of the code, and
| the alignment of branch target play a role. Intel never claimed that their
| estimated clock cycles would be correct for anything but "perfect code" in
| regards to all these side issues.

:o) Well, Intel's documentation say a lot of interesting things (and does
not say a lot of interesting things too.)

|| Furthermore, if the cpuid is taken out, the times vary depending upon
|| what code is executed before.
|
| Of course. Because CPUID is serializing, it will ensure that all the units
| have finished their stuff. So your code won't be executed in parallel with
| any code from before the CPUID. Which makes the core schedule differently.

Given the way its used in the code, I assume Colin realized this. He is
making the point that the loop's steady state can be heavily influenced
by the initial state of the pipeline.

In theory everything should eventually flush out of the reservation
station and some sort of repeating steady state dependent on only the
contents of the loop should arise. However, Colin's investigations
pretty much prove that this is not the case.

|| I posted to comp.asm.lang.x86 about this a few weeks ago, and, although
|| no-one there could explain it, one person (Yves Gallot) noticed that
|| adding the line
||
|| mov edx,mem_var
||
|| causes the code to revert to taking 4 clocks (although this was before I
|| looked into the different alignments, so this might be a red herring).
|
| The Pentium Pro works optimal, when instructions decode to 4-1-1 micro-OPs
| all the time.

All of the instructions used are of the "1" variety, so in this case this
is not an issue.

I've noticed that this "trick" of adding bogus load commands can work in
many loops that I have tried (they have to do with my job, so I'd rather
not discuss the details.)

| [...] Again, check with their Optimization guide, #242816-003, and
| maybe also give their VTune program a trial.

Actually Intel has other tools which you can find (at this moment in time)
at:

http://developer.intel.com/drg/mmx/AppNotes/PERFMON.HTM
http://developer.intel.com/drg/pentiumII/appnotes/p6perfnt.htm

Or you could figure it out yourself by using some of the MSR's that track
things like "the number of RAT resource conflicts" or something like
that. The RAT (Resource Allocation Table?) is a bit of a mystery to me
(and to most I think given their weak documenation.) I think this is
where all the renaming, and register loading goes on.

|| PS. Again, it may be a red herring, but all the code samples I've
|| found so far which exhibit this behaviour have an overabundance of
|| integer instructions. Paul Hsieh speculated that a necessary condition
|| might be the decoders running faster than the execution units.
|
| That certainly is a possibility.

Well, this theory was motivated by the fact that the instructions could
be decoded in 3 clocks but could not all be executed in less than 4
clocks. This eventually leads to situations where there are 3 integer
instructions that could be executed in parallel. Now because the whole
thing is so out of order oriented, it means the scheduler may execute any
two of the three in parallel (and not more since there are only two
integer units.)

Its not hard to see that there are scheduling choices that might starve
some instructions while getting very far ahead with others. Now once the
reservation station gets full of the starved instructions, the CPU starts
losing clocks because it cannot feed the other instructions into the
reservations station.

Putting the bogus load instruction slows the decoder down so that it does
not run into a situation where it does not have too many integer
instructions to chose from to dispatch to the executions units per clock.
At least that was my theory.

But Colin's discovery about the alignment issue puts a bit of a different
spin on this. I suspect that branch target alignment must also play a
factor since the two instruction fetch stages (which are supposed to
eliminate alignment issues) probably cannot be pre-fed deeply enough on a
branch.

What I mean is that on a branch the second prefetch stage does not receive
enough bytes to dispatch to the decoder on a branch target depending on
the alignment of the branch target simply because the stage is too many
clocks away from the processing of the branch itself.

So I really don't think this is a P6 bug but just a side effect of their
design trade offs. The problem with all of this is that they are just
theories. Intel hasn't written enough clear documentation to figure any
of this out for sure. But I hope this discussion leads to some good
analysis of the P6 from a performance point of view.

optional link: CPU WAR
optional image:



post a followup message
(Be nice... or be blocked. Be technical... or be erased.)

Name: optional link title:
eMail: optional link URL:
Subject: optional image URL:
  Insert line breaks by hand when only about one inch remains at the right side. Otherwise your message will be unreadable.
Text:
 

  Note: The above eMail form fields may look unaligned if you are using a browser other than Netscape Navigator version 3.0.



currently posted followup messages
(You may have to press the RELOAD button of your browser.)




main page