Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!husc6!mit-eddie!uw-beaver!tektronix!tekcrl!tekgvs!toma From: toma@tekgvs.TEK.COM (Thomas Almy) Newsgroups: comp.arch,comp.sys.intel Subject: Re: 386 pipeline details Message-ID: <2271@tekgvs.TEK.COM> Date: Fri, 8-May-87 11:05:14 EDT Article-I.D.: tekgvs.2271 Posted: Fri May 8 11:05:14 1987 Date-Received: Sun, 10-May-87 11:36:03 EDT References: <648@mipos3.UUCP> Reply-To: toma@tekgvs.UUCP (Thomas Almy) Organization: Tektronix, Inc., Beaverton, OR. Lines: 75 Summary: there must be more to it Xref: mnetor comp.arch:1240 comp.sys.intel:233 In article <648@mipos3.UUCP> kds@mipos3.UUCP (Ken Shoemaker ~) writes: >A week or so ago, Geoff Kuenning posted a note to comp.sys.intel asking about >how the 386 pipeline was organized. I thought that that information would >be interesting to people in comp.arch, also, so here it is: [...] I really appreciated this information, I have been confused, stumped and mystified over the operation of the 386 since getting one. The documentation sure left a lot to be desired. > >> Similarly, there is no information that indicates how instruction fetches >> interact with operand accesses. Nor is there a specification of how >> the instruction-decode pipeline reloads after branches; from the manual >> one would conclude that you can speed up this: [...] >> by inserting a "nop" at the target, because the "m" variable in the branch >> execution time would be reduced from 4 or 5 to 1. This seems highly >> unlikely to be the actual case. > >you should be able to figure out how long it actually takes to crack an >instruction from what was presented above. You are right about you thinking >that that inserting a nop won't speed it up. You should also be able to >figure out that the more valid data you can get into the processor after >a branch, the faster the cracker can get started, so if you put all >your branch targets on 32-bit boundaries, you can get the instructions >started executing faster after the branch. Yet I discovered that what Geoff had said was true to a degree! In a loop where the first instruction is fetches from memory, the loop performance will increase if the start of the loop is offset from the 32 bit boundary! And inserting a noop in some cases did increase performance! I also found that the memory interleaving causes inconsistant results depending on the position of the instruction relative to the position of the data! > >> So first a warning: the instruction times in the 386 manual are for >> "ideal" examples only, and real code will almost always take longer than >> the manual indicates. > >"your milage may vary." Seriously, the numbers are valid and are given >as the number of clocks it takes the microcode to execute the >sequence for the instruction assuming zero wait-state memory+the number >of clocks it takes to get the next instruction ready to execute in the case >of jumps. > >I hope this is what you want. As you see, on the 386, the details >of the pipeline aren't nearly as important to know in determining >how to make certain instruction sequences run fast. In your >example, the main contribution is the speed of the external memory system, >which has nothing to do with the pipeline details. I am afraid I have to agree with Geoff's feelings about needing to measure the time for code. Consider this test performed by a friend of mine on his 386 system (running in 286 protected mode) and varified by me (running in 386 protected mode). Both systems are Intel 386/AT motherboards with one wait state + interleave. My system has paging enabled. The exercise involves filling up a 64k segment with as many copies of an instruction that will fit, and then executing it. Thus the pipeline will be filled, and a steady state instruction time will be calculated (in my case, interrupts were disabled as well). The two byte instruction "MOV EAX,EBX" (AX,BX in my friends case) took 2.1 clocks on average. The manual says 2 clocks. What is consuming the extra .1 clock? Even with the wait state, the instruction queues should always be full. (3 clocks for each instruction word fetch + 1 for paging? = 4 clocks, but two instructions are fetched per clock). The next instruction tested was "CLD", a single byte status flag clearing instruction. The manual again says 2 clocks, but we both measured EXACTLY 3 clocks. Unless the manual is misprinted, there is even less reason for this. Tom Almy Tektronix, Inc.