Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!ames!oliveb!pyramid!prls!mips!hansen
From: hansen@mips.UUCP
Newsgroups: comp.arch
Subject: Branch Target Cache vs. Instruction Cache
Message-ID: <397@dumbo.UUCP>
Date: Fri, 15-May-87 22:00:50 EDT
Article-I.D.: dumbo.397
Posted: Fri May 15 22:00:50 1987
Date-Received: Sat, 16-May-87 21:27:29 EDT
References: <3810030@nucsrl.UUCP> <491@necis.UUCP> <3530@spool.WISC.EDU> <767@apple.UUCP>
Lines: 140

Let me start by apologizing for the tone of my last message - I'll try to be
less caustic and more informative this time.

In article <767@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes:
 (about how a branch target cache performs comparably to an instruction
    cache, in a real system environment)
> it can perform comparably because, logically, part of the instruction cache
> is in the external rams.  the branch target cache is only there to cover the
> latency of the initial access in that external ram (how many times do we
> have to say this?).  so, in essence, the Am29000 branch target cache can
> cache 32 loops of *any* size.  this is most certainly not true of traditional
> instruction caches.  on the other hand, this is not necessarily a very
> important attribute; the point is that a branch target cache can perform,
> and in fact does perform, as well as an instruction cache.  i feel that it
> should be i (or someone defending the am29000) who should be asking *you*
> to defend a traditional instruction cache relative to the branch target
> cache (always assuming an external burst mode memory for the branch target
> cache, of course.  the burst mode memory is a critical part of the concept.).

My point was that the four words of information in the BTC is not sufficient
to cover the latency of DRAM accesses in a real system environment. I would
freely acknowlege that it _is_ sufficient if the next level store is SRAM -
however, in that case, the BTC is a _suppliment_ to the SRAM-based
instruction cache. In fact, the system that is simulated by AMD in their
performance simulation has an external cache.

I quote from the McFarling and Hennessey paper "Reducing the cost of
branches," (13th Annual Int. Symp. on Computer Architecture, June 1986) the
following statement: "In our simulations, we noticed that a direct mapped
BTB and an instruction cache of the same size had about the same hit ratio."

As to defending a traditional instruction cache relative to the branch
target cache, this is a apples-and-orange comparison. The instruction cache
leaves main memory bandwidth free for other purposes, such as data
references, memory refresh, and I/O traffic. What I could compare is the
effective branch latency of the two approaches.  Again depending on the
results of the paper above, they compute an average branch time of 1.53
cycles for something similar to the MIPS R2k branch scheme (labelled in the
paper as the "Fast Compare" scheme). I would add 0.06 to that value to
account for the differences (explained below).  They don't have an exact
match for the AMD branch target cache, but do have results for a 64-entry
branch target buffer that generally has shorter penalties for mis-predicted
branches and buffer misses, which requires 1.38 cycles per branch.  However,
the AMD branch scheme also doesn't permit branches on conditions other than
the sign bit of a register, so we have to add time for the comparison
operations (.36 cycles for compare equal/not equal, and .08 for full
arithmetic compares), so the real total is 1.38+.36+.08 = 1.82 cycles per
branch. (Numbers are from the paper, and assume an 80% hit ratio in the BTC)
In summary:

Branch Scheme		average cycles per branch
=======================================================
Fast Compare (R2k)		1.59
64-entry BTC/sign compare (29k)	1.82

Selecting between these schemes, holding all else constant gives the fast
compare scheme about a 2% edge in overall performance over the BTC scheme.
However, McFarling & Hennessey give this warning about the fast compare
scheme: "...the timing of the simple compare is a concern, because it must
complete in time to change the instruction address going out in the next
cycle. This could easily end up as the critical path controlling instruction
cache access and, hence, cycle time." They are right; the timing was a
concern, and we had to put in special hardware to perform the fast compare
and to quickly translate branch addresses to physical addresses.  It is this
translation hardware (Micro-TLB) that adds about 1% of a cycle per
instruction or 6% per branch.  However, the critical paths are on-chip ones,
and can be expected to scale with the technology.

> >I guess the AMD folks are excited about their new child, but when reality
> >sets in, and you try to build a real UNIX-based computer system out of this
> >fine controller part, you'll be sorely disappointed if you expected 17 MIPS
> >at 25 MHz with a nibble-mode DRAM as your primary memory system.  Based on
> 
> *not* nibble-mode, video dram; there is a big difference.
> 
> >their benchmarks (puzzle, dhrystone, and a tiny piece of linpack; sorry, but
> >that's all they've got to compare with...) with an external cache, external
> >cache control hardware, and a fast main memory system, a 29000 at 25 MHz that
> >you can build next year doesn't perform any faster than a MIPS R2000 system
> >at 16.7 MHz that you can build now.
> 
> i will be one of the first to say that mipsco has a good part.  you guys have
> done a great job in the sense that you have gotten great performance at lower
> clock rates.  but what happens to your bus at 25, 30, 35 mhz?  i'm not saying
> that you are guys are going to fail to fix things, but let's not be casting
> stones!  how do you know that amd will be sorely dissapointed in its
> expectations of 17 mips at 25 mhz?  there are accurate simulations to support
> just this data!  i am not trying to say you are wrong, but don't just make
> the statement, back it up with solid data or at least some observations you
> have made about problems with the architecture/implementation!  please,
> follow the lead of your co-worker john mashey.

I agree - there is a difference between nibble-mode and video DRAM, though
both these and page-mode DRAM were stated as working perfectly well with a
4-word BTC and no I-cache. Now the benchmark data is hard to interpret - the
benchmarks we can compare with AMD's all run faster at 16.7 MHz than AMD's
25 Mhz part.  However, these are small benchmarks that do not stress the
cache and memory subsystems of either designs, and I would welcome the
appearance of standard benchmarks that are more representative of
applications' performance.  For example, the DoDuc benchmark would be a big
step forward over most existing benchmarks, and contains a mix of
both integer and floating-point code. I think John Mashey & I have
already discussed differences the architecture of the two machines that can
account for the difference in the benchmark performance, and I'm not out to
repeat the discussion, but to try to help draw what conclusions can be made
from the benchmark data available to date.

Time and considerable further development will occur before AMD can supply
performance data from running large benchmarks under real system conditions,
and I understand that we'll have to make do with what we've got.  However
the only benchmark of potentially meaningful size from AMD (sipasm) performs
substantially less than 17 (780-relative) MIPS with an external cache as
well as burst-mode memory, and as I understand it, these results do not take
multiprogramming cache effects and finite memory write speed into account.

I'm at a disadvantage in quoting performance data for the AMD part, but the
branch target cache miss rate on the larger benchmarks is in the 50% range,
is it not? That would mean that much of the branch behaviour of these larger
programs is not simple loops, and the analogy of the AMD 29k BTC holding 32
loops of any size isn't really valid. What looks bad for the AMD part is
that branches that miss in the BTC get a four-cycle delay. (Please correct
me if I'm wrong on this, but I'm assuming that the 4 words in each BTC entry
reflect that it will take at least 4 cycles to start up again after a BTC
miss.) At risk of contradicting statement made earlier in this message,
that would cause the average branch time to be about 3.5 cycles.

As to what MIPS will do to make external caches work at 25, 30, 35 MHz,
I'm afraid this isn't the right forum to be discussing our future
products. We have publicly stated that we will improve the performance
of our products at a rate of doubling performance every eighteen months,
and our current product plans are running faster than that rate.
If I could ask, what happens to the Am29k bus at 40 to 55 MHz?

Regards,
Craig
-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen