Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/5/84; site terak.UUCP
Path: utzoo!utcs!lsuc!pesnta!hplabs!hao!noao!terak!doug
From: doug@terak.UUCP (Doug Pardee)
Newsgroups: net.works,net.micro.16k
Subject: Re: 32032 UNIX
Message-ID: <369@terak.UUCP>
Date: Tue, 12-Feb-85 12:07:28 EST
Article-I.D.: terak.369
Posted: Tue Feb 12 12:07:28 1985
Date-Received: Thu, 14-Feb-85 19:01:50 EST
References: <357@topaz.ARPA> <320@terak.UUCP> <278@petrus.UUCP> <5040@utzoo.UUCP> <2347@nsc.UUCP>
Organization: Terak Corporation, Scottsdale, AZ, USA
Lines: 85
Xref: utcs net.works:885 net.micro.16k:203

> >People I trust tell me that the 32016's performance deteriorates
> >*SHARPLY* when wait states are introduced -- it's much worse than
> >you would expect, and in particular it's not linear in the number of
> >wait states.
> 
> In general, programs with lighter bus use show smaller degradation with wait
> states and smaller ratios of NS32032 to NS32016 execution speed.

Wait states are a punch aimed at the 32000's glass jaw -- instruction
prefetch.

For those not completely conversant:  the 32000 series CPU's use
instruction prefetching to try to keep the 8 bytes following the
_current_ instruction already loaded into the CPU.  These bytes
are always the ones located sequentially after the current
instruction.

There are two undesirable side effects which can occur.  The most
obvious occurs when a branch is taken -- the prefetch cycles were
a waste of time, and the new instructions have to be fetched.  But ----
if the CPU had just started a prefetch cycle when the branch is
recognized, it has to wait for it to complete before the branch
can be executed.  Wait states increase the likelihood of this
happening as well as make the situation more serious.

Remembering that programs spend most of their time in loops, and that
a loop requires at least one branch on every time through, this
effect is magnified considerably.  Especially for concocted benchmark
programs, where the contents of the loop tends to be trivial, leaving
the branching as the major time consumer.

A second aspect of the 32000 series enters in here as well -- unlike
the 68000, instructions are not required to start on word boundaries.
If the branch destination is to an "odd" address, the CPU requires
yet another memory cycle, with any wait states.  Compilers for high-
level languages like "C" don't pay any attention to this little detail,
so tight loops can suffer just because the top of the loop is on an
odd-byte boundary.

The other side effect is less obvious.  The instruction prefetch cycle
can also obstruct access to the operands of the current instruction.
Again, wait states increase the likelihood of this happening, and
make the delay more serious as well.

This process, in turn, is made more likely by the use of high-level
languages like "C".  Unlike the competition's CPUs, the 32000 series
allows essentially all operations to be performed memory-to-memory,
without needing a register as an intermediate.  The compilers use
this feature extensively, with the result that operands require
memory access much more often than the equivalent 32000 assembler
code or (e.g.) 68000 "C" code.

Important note:  this presumes that if the compiler had been forced
to bring the operands into a register, and get the result in a
register, that it could have done some optimization and re-used that
register.  It is obvious, is it not, that a simple "Load A, Add B,
Store B" is necessarily going to be slower than "Add A to B"?

And to compound the problem even further:  the 32000 series is set
up to use "indirect addressing" fairly heavily, and the compilers
really use it a bunch.  Especially the "C" compiler, which uses
indirect addressing to implement pointer variables.

But wait, there's more (this is starting to sound like a TV mail-order
ad!).  Most "C" programmers seem to like to use "external" variables
rather than parameters.  On the 32000 series, parameters are accessed
just as easily as ordinary variables, but externals are a *double-
indirect*!  For a 32016 to get just the *address* of an external
item, it has to do four (4) memory cycles.  And if that item is a
pointer variable, "C" will require yet another two memory cycles
before it even has the *address* of the data.

All of this indirect address and operand fetching puts quite a load
on the memory system, and prefetching represents serious competition
for memory cycles.  If that prefetching turns out to have been
unnecessary because of a branch, the performance suffers more than
the number of wait states would imply.

So if you want your 32000 system to hum along, don't use wait states,
keep looping and branching to a minimum, program in assembler, and if
you simply *must* program in "C" avoid external variables and use
register variables (especially for pointer variables).  Oh, BTW, the
MMU adds one wait state of its own.
-- 
Doug Pardee -- Terak Corp. -- !{hao,ihnp4,decvax}!noao!terak!doug