Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!rutgers!ames!oliveb!pyramid!prls!mips!hansen From: hansen@mips.UUCP Newsgroups: comp.arch Subject: Branch Target Cache vs. Instruction Cache Message-ID: <397@dumbo.UUCP> Date: Fri, 15-May-87 22:00:50 EDT Article-I.D.: dumbo.397 Posted: Fri May 15 22:00:50 1987 Date-Received: Sat, 16-May-87 21:27:29 EDT References: <3810030@nucsrl.UUCP> <491@necis.UUCP> <3530@spool.WISC.EDU> <767@apple.UUCP> Lines: 140 Let me start by apologizing for the tone of my last message - I'll try to be less caustic and more informative this time. In article <767@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes: (about how a branch target cache performs comparably to an instruction cache, in a real system environment) > it can perform comparably because, logically, part of the instruction cache > is in the external rams. the branch target cache is only there to cover the > latency of the initial access in that external ram (how many times do we > have to say this?). so, in essence, the Am29000 branch target cache can > cache 32 loops of *any* size. this is most certainly not true of traditional > instruction caches. on the other hand, this is not necessarily a very > important attribute; the point is that a branch target cache can perform, > and in fact does perform, as well as an instruction cache. i feel that it > should be i (or someone defending the am29000) who should be asking *you* > to defend a traditional instruction cache relative to the branch target > cache (always assuming an external burst mode memory for the branch target > cache, of course. the burst mode memory is a critical part of the concept.). My point was that the four words of information in the BTC is not sufficient to cover the latency of DRAM accesses in a real system environment. I would freely acknowlege that it _is_ sufficient if the next level store is SRAM - however, in that case, the BTC is a _suppliment_ to the SRAM-based instruction cache. In fact, the system that is simulated by AMD in their performance simulation has an external cache. I quote from the McFarling and Hennessey paper "Reducing the cost of branches," (13th Annual Int. Symp. on Computer Architecture, June 1986) the following statement: "In our simulations, we noticed that a direct mapped BTB and an instruction cache of the same size had about the same hit ratio." As to defending a traditional instruction cache relative to the branch target cache, this is a apples-and-orange comparison. The instruction cache leaves main memory bandwidth free for other purposes, such as data references, memory refresh, and I/O traffic. What I could compare is the effective branch latency of the two approaches. Again depending on the results of the paper above, they compute an average branch time of 1.53 cycles for something similar to the MIPS R2k branch scheme (labelled in the paper as the "Fast Compare" scheme). I would add 0.06 to that value to account for the differences (explained below). They don't have an exact match for the AMD branch target cache, but do have results for a 64-entry branch target buffer that generally has shorter penalties for mis-predicted branches and buffer misses, which requires 1.38 cycles per branch. However, the AMD branch scheme also doesn't permit branches on conditions other than the sign bit of a register, so we have to add time for the comparison operations (.36 cycles for compare equal/not equal, and .08 for full arithmetic compares), so the real total is 1.38+.36+.08 = 1.82 cycles per branch. (Numbers are from the paper, and assume an 80% hit ratio in the BTC) In summary: Branch Scheme average cycles per branch ======================================================= Fast Compare (R2k) 1.59 64-entry BTC/sign compare (29k) 1.82 Selecting between these schemes, holding all else constant gives the fast compare scheme about a 2% edge in overall performance over the BTC scheme. However, McFarling & Hennessey give this warning about the fast compare scheme: "...the timing of the simple compare is a concern, because it must complete in time to change the instruction address going out in the next cycle. This could easily end up as the critical path controlling instruction cache access and, hence, cycle time." They are right; the timing was a concern, and we had to put in special hardware to perform the fast compare and to quickly translate branch addresses to physical addresses. It is this translation hardware (Micro-TLB) that adds about 1% of a cycle per instruction or 6% per branch. However, the critical paths are on-chip ones, and can be expected to scale with the technology. > >I guess the AMD folks are excited about their new child, but when reality > >sets in, and you try to build a real UNIX-based computer system out of this > >fine controller part, you'll be sorely disappointed if you expected 17 MIPS > >at 25 MHz with a nibble-mode DRAM as your primary memory system. Based on > > *not* nibble-mode, video dram; there is a big difference. > > >their benchmarks (puzzle, dhrystone, and a tiny piece of linpack; sorry, but > >that's all they've got to compare with...) with an external cache, external > >cache control hardware, and a fast main memory system, a 29000 at 25 MHz that > >you can build next year doesn't perform any faster than a MIPS R2000 system > >at 16.7 MHz that you can build now. > > i will be one of the first to say that mipsco has a good part. you guys have > done a great job in the sense that you have gotten great performance at lower > clock rates. but what happens to your bus at 25, 30, 35 mhz? i'm not saying > that you are guys are going to fail to fix things, but let's not be casting > stones! how do you know that amd will be sorely dissapointed in its > expectations of 17 mips at 25 mhz? there are accurate simulations to support > just this data! i am not trying to say you are wrong, but don't just make > the statement, back it up with solid data or at least some observations you > have made about problems with the architecture/implementation! please, > follow the lead of your co-worker john mashey. I agree - there is a difference between nibble-mode and video DRAM, though both these and page-mode DRAM were stated as working perfectly well with a 4-word BTC and no I-cache. Now the benchmark data is hard to interpret - the benchmarks we can compare with AMD's all run faster at 16.7 MHz than AMD's 25 Mhz part. However, these are small benchmarks that do not stress the cache and memory subsystems of either designs, and I would welcome the appearance of standard benchmarks that are more representative of applications' performance. For example, the DoDuc benchmark would be a big step forward over most existing benchmarks, and contains a mix of both integer and floating-point code. I think John Mashey & I have already discussed differences the architecture of the two machines that can account for the difference in the benchmark performance, and I'm not out to repeat the discussion, but to try to help draw what conclusions can be made from the benchmark data available to date. Time and considerable further development will occur before AMD can supply performance data from running large benchmarks under real system conditions, and I understand that we'll have to make do with what we've got. However the only benchmark of potentially meaningful size from AMD (sipasm) performs substantially less than 17 (780-relative) MIPS with an external cache as well as burst-mode memory, and as I understand it, these results do not take multiprogramming cache effects and finite memory write speed into account. I'm at a disadvantage in quoting performance data for the AMD part, but the branch target cache miss rate on the larger benchmarks is in the 50% range, is it not? That would mean that much of the branch behaviour of these larger programs is not simple loops, and the analogy of the AMD 29k BTC holding 32 loops of any size isn't really valid. What looks bad for the AMD part is that branches that miss in the BTC get a four-cycle delay. (Please correct me if I'm wrong on this, but I'm assuming that the 4 words in each BTC entry reflect that it will take at least 4 cycles to start up again after a BTC miss.) At risk of contradicting statement made earlier in this message, that would cause the average branch time to be about 3.5 cycles. As to what MIPS will do to make external caches work at 25, 30, 35 MHz, I'm afraid this isn't the right forum to be discussing our future products. We have publicly stated that we will improve the performance of our products at a rate of doubling performance every eighteen months, and our current product plans are running faster than that rate. If I could ask, what happens to the Am29k bus at 40 to 55 MHz? Regards, Craig -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...decwrl!mips!hansen