Usenet

Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!husc6!mit-eddie!uw-beaver!tektronix!tekcrl!tekgvs!toma
From: toma@tekgvs.TEK.COM (Thomas Almy)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: 386 pipeline details
Message-ID: <2271@tekgvs.TEK.COM>
Date: Fri, 8-May-87 11:05:14 EDT
Article-I.D.: tekgvs.2271
Posted: Fri May  8 11:05:14 1987
Date-Received: Sun, 10-May-87 11:36:03 EDT
References: <648@mipos3.UUCP>
Reply-To: toma@tekgvs.UUCP (Thomas Almy)
Organization: Tektronix, Inc., Beaverton, OR.
Lines: 75
Summary: there must be more to it
Xref: mnetor comp.arch:1240 comp.sys.intel:233

In article <648@mipos3.UUCP> kds@mipos3.UUCP (Ken Shoemaker ~) writes:
>A week or so ago, Geoff Kuenning posted a note to comp.sys.intel asking about 
>how the 386 pipeline was organized.  I thought that that information would
>be interesting to people in comp.arch, also, so here it is: [...]

I really appreciated this information, I have been confused, stumped and
mystified over the operation of the 386 since getting one.  The documentation
sure left a lot to be desired.

>
>> Similarly, there is no information that indicates how instruction fetches
>> interact with operand accesses.  Nor is there a specification of how
>> the instruction-decode pipeline reloads after branches;  from the manual
>> one would conclude that you can speed up this: [...]
>> by inserting a "nop" at the target, because the "m" variable in the branch
>> execution time would be reduced from 4 or 5 to 1.  This seems highly
>> unlikely to be the actual case.
>
>you should be able to figure out how long it actually takes to crack an
>instruction from what was presented above.  You are right about you thinking
>that that inserting a nop won't speed it up.  You should also be able to 
>figure out that the more valid data you can get into the processor after 
>a branch, the faster the cracker can get started, so if you put all 
>your branch targets on 32-bit boundaries, you can get the instructions 
>started executing faster after the branch.

Yet I discovered that what Geoff had said was true to a degree!  In a loop
where the first instruction is fetches from memory, the loop performance
will increase if the start of the loop is offset from the 32 bit boundary!
And inserting a noop in some cases did increase performance!

I also found that the memory interleaving causes inconsistant results 
depending on the position of the instruction relative to the position of
the data!

>
>> So first a warning:  the instruction times in the 386 manual are for
>> "ideal" examples only, and real code will almost always take longer than
>> the manual indicates.
>
>"your milage may vary."  Seriously, the numbers are valid and are given 
>as the number of clocks it takes the microcode to execute the
>sequence for the instruction assuming zero wait-state memory+the number
>of clocks it takes to get the next instruction ready to execute in the case
>of jumps.  
>
>I hope this is what you want.  As you see, on the 386, the details 
>of the pipeline aren't nearly as important to know in determining 
>how to make certain instruction sequences run fast.  In your
>example, the main contribution is the speed of the external memory system,
>which has nothing to do with the pipeline details.  

I am afraid I have to agree with Geoff's feelings about needing to measure
the time for code.

Consider this test performed by a friend of mine on his 386 system (running in
286 protected mode) and varified by me (running in 386 protected mode).  Both
systems are Intel 386/AT motherboards with one wait state + interleave.  My 
system has paging enabled.  The exercise involves filling up a 64k segment
with as many copies of an instruction that will fit, and then executing it.
Thus the pipeline will be filled, and a steady state instruction time will
be calculated (in my case, interrupts were disabled as well).  The two byte
instruction "MOV EAX,EBX" (AX,BX in my friends case) took 2.1 clocks on 
average.  The manual says 2 clocks.  What is consuming the extra .1 clock?
Even with the wait state, the instruction queues should always be full.
(3 clocks for each instruction word fetch + 1 for paging? = 4 clocks, but
two instructions are fetched per clock).

The next instruction tested was "CLD", a single byte status flag clearing
instruction.  The manual again says 2 clocks, but we both measured EXACTLY
3 clocks.  Unless the manual is misprinted, there is even less reason for
this.

Tom Almy
Tektronix, Inc.