Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!ames!amdahl!drivax!socha
From: socha@drivax.UUCP
Newsgroups: comp.arch
Subject: Re: AM29000 Booleans
Message-ID: <1512@drivax.UUCP>
Date: Fri, 8-May-87 14:05:32 EDT
Article-I.D.: drivax.1512
Posted: Fri May  8 14:05:32 1987
Date-Received: Sun, 10-May-87 05:36:16 EDT
References: <1270@aw.sei.cmu.edu> <138@neptune.AMD.COM> <3540@spool.WISC.EDU> <16587@amdcad.AMD.COM>
Reply-To: socha@drivax.UUCP (Henri J. Socha (x6251))
Organization: Digital Research, Monterey
Lines: 91



In article <16587@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes:
>In article <3540@spool.WISC.EDU>, lm@cottage.WISC.EDU (Larry McVoy) writes:
 
>> Could you please show the assembly language generated for the following,
>> 
>> {	int x;
>> 	
>> 	if (x)
>> 	    ;
>> 	else
>> 	    ;
>> 
>> 	if (!x)
>> 	    ;
>> 	else
>> 	    ;
>> }

Well, to add some fuel to the fire about how a machine's performance can
depend on the smarts in the compiler used, I hand modified the example code
given in the referenced article.  The changed code is shown below.
The changes were limited to taking better advantage of the delayed branching.
I only re-arranged and removed some code.

---------------------------------------
	.global	_main
_main:
	.align

;	if (x)
	cpneq	gr72,gr70,0		<-- sets the msb of gr72 if gr70 != 0
	jmpf	gr72,$16		<-- jumps if the msb of gr72 is not set
;	else
;	    y = 1;
	const	gr71,1			<-- waisted assign on fall through case

;	    y = 0;
	const	gr71,0			<-- remember those delayed branches!
$16:
;	if (!x)
	cpeq	gr72,gr70,0
	jmpf	gr72,$18
;	else
;	    y = 1;
	const	gr71,1

; THEN	    y = 0;
	const	gr71,0
$18:
	jmpi	lr00
	nop
>	-- Tim Olson
>	Advanced Micro Devices

The savings were:
	nop
	jmp	$17			<-- remember those delayed branches!
	nop
	jmp	$19

And that is not so shabby for such a small programme especially if the hardware
is smart enough to recognize that the jumped to address is probably in the
pipeline.

Now, don't try to say that the simple case of 1 instruction after the branch
doesn't occur (or occurs rarely).  I've seen it often enough to make me
think that this is a worthwhile optimization by the compiler.

BTW I like the fact that non-arithmetic instructions DO NOT change (affect)
the condition code (status) register. This can develop other optimizations.
But, I can't understand why the following sequence is needed for multiply.

	mul	gr72,gr71,#0		;step 1
	repeat	30			; (not in assembler? my addition)
	mul	gr72,gr71,gr72		;steps 2 to 31 (that's right, 30 words!)

	mull	gr72,gr71,gr72		; step 32

The above from Am29000 user's manual page 7-10 for a 32 bit integer multiply.
It takes 32 instructions to do it  (similar for divide).

Now, I can understand the advantages of a RISC processor but this is going
to far.  Should I put 32 instructions in the processing stream each time
I need to multiply two numbers?  Should I use a subroutine?
Seems to me a perfect  time/space tradeoff decision.  But, what are the costs?

-- 
UUCP:...!amdahl!drivax!socha                                      WAT Iron'75
"Everything should be made as simple as possible but not simpler."  A. Einstein