discussion forum
message
| Name: |
Colin Percival |
| eMail: |
Colin_Percival@sfu.ca |
| Date: |
June 25, 1998 at 12:02:25 |
| Subject: |
Possible P6 bug |
| Text: |
I have posted this on behalf of Colin. He suggested to do so, but was unable to do it himself, using Lynx. I am looking into a way of using an eMail alias, which would get routed into the discussion forum.-- CL --------------------------------------------------------------------- About a month ago I started to work on optimizing some of my code for the P6 core. I quickly found, however, that my code was taking very strange numbers of clock cycles. The following loop is the simplest example I have found which exhibits this: mov eax,0 cpuid mov ecx,10000 l1: add eax,eax add ebx,ebx add eax,eax add ebx,ebx add eax,eax add ebx,ebx dec ecx jne l1 This code should take exactly 4 clock cycles. However, depending upon the alignment of the loop, it takes between 4.00 and 5.32 clock cycles. To be specific: 0 mod 16 5.29 cycles 1 mod 16 5.31 cycles 2 mod 16 5.00 cycles 3 mod 16 4.00 cycles 4 mod 16 5.31 cycles 5 mod 16 4.00 cycles 6 mod 16 5.31 cycles 7 mod 16 4.00 cycles 8 mod 16 4.93 cycles 9 mod 16 5.28 cycles 10 mod 16 5.31 cycles 11 mod 16 5.33 cycles 12 mod 16 5.26 cycles 13 mod 16 5.32 cycles 14 mod 16 5.32 cycles 15 mod 16 5.29 cycles Furthermore, if the cpuid is taken out, the times vary depending upon what code is executed before. I posted to comp.asm.lang.x86 about this a few weeks ago, and, although no-one there could explain it, one person (Yves Gallot) noticed that adding the line mov edx,mem_var causes the code to revert to taking 4 clocks (although this was before I looked into the different alignments, so this might be a red herring). I have also communicated with Intel, but they are also mystified. Do you have any idea why this is happening, or how to go about working it out? Thanks, Colin Percival PS. Again, it may be a red herring, but all the code samples I've found so far which exhibit this behaviour have an overabundance of integer instructions. Paul Hsieh speculated that a necessary condition might be the decoders running faster than the execution units. PPS. If you think it might help, please post this to your forum. Unofrtunately, lynx doesn't handle it very well. |
post a followup message
(Be nice... or be blocked. Be technical... or be erased.)
currently posted followup messages
(You may have to press the RELOAD button of your browser.)
|