














Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Main points of this past exam are: Mips Rating, Yield Runtime, Floating Point, Lower-Performance Version, Original Processor, Mips Rating, Original Cost, New Processor, Parallel Prefix, Possible Speedup
Typology: Exams
1 / 22
This page cannot be seen from the preview
Don't miss anything!
University of California, Berkeley College of Engineering Computer Science Division EECS
Spring 2001 John Kubiatowicz
March 1, 2001 CS152 Computer Architecture and Engineering
Your Name:
SID Number:
Discussion Section:
Problem Possible Score
1 20
2 20
3 30
4 30
Total
Problem 1c: What is the CPI and MIPS rating of the new processor?
Problem 1d: What is the original cost per (working) processor?
36 2 12
(^22)
2
die wafer 0. 27 2
2 2 2
dieYield
die wafer dieYield
waferCost dieCost
Problem 1e: What is the new cost per (working) processor?
56 2 10
(^22)
2
die wafer 0. 36 2
2 2 2
dieYield
die wafer dieYield
waferCost dieCost
Problem 1f: Assume that we are considering the other direction of improving the original processor by increasing the speed of floating point. What is the best possible speedup that we could get, and what would the CPI and MIPS rating be of the new processor?
The easiest thing to do is use Amdahl’s law: ( 1 )
f n
f f
speedup
(i.e. speeding up floating-point really well). In this case, f is the fraction of time normally
Max speedup = (1-0.319) -1^ = 1.
Assume the following characteristics for NAND gates: Input load: 120fF, Internal delay: TPlh=0.3ns, TPhl=0.6ns, Load-Dependent delay: TPlhf=.0020ns, TPhlf=.0021ns
Problem 2a: Suppose that we construct an XOR, as follows:
Compute the standard parameters for the linear delay models for this complex gate, assuming the parameters given above for the NAND gate. Assume that a wire doubles the input capacitance of the gate that it is attached to:
A Input Capacitance: 240fF Load-dependent Delays: B Input Capacitance : 240fF TPAYlhf: 0.0020 ns/fF TPAYhlf: 0.0021 ns/fF TPBYlhf: 0.0020 ns/fF TPBYhlf: 0.0021 ns/fF
Maximum Internal delays for A⇒Y: TPAYlh:
Critical path goes through 3 gates. Low-to-high on output implies high-to-low on inputs to last gate, which implies low-to-high on input A. Note that the two internal nodes are driven, so we multiply capacitance by 2:
TPAYlh = 0.3ns+(2)(240fF)(0.0020ns/fF) + 0.6ns + (2)(120fF)(0.0021ns/fF) + 0.3ns = 2.664ns
TPAYhl:
High-to-low on output implies low-to-high on inputs to last gate, which implies high-to-low on input A.
TPAYhl = 0.6ns + (2)(240fF)(0.0021ns/fF) + 0.3ns + (2)(120fF)(0.0020ns/fF) + 0.6ns = 2.
Problem 2c: Now, put these 2-input blocks together to produce a 4-input block that takes I 0 , I 1 , I 2 , and I 3 , and C (^) down and produces: O 0 = I 0 ⊕ C (^) down O 1 = I 1 ⊕ I 0 ⊕ C (^) down O 2 = I 2 ⊕ I 1 ⊕ I 0 ⊕ C (^) down O 3 = I 3 ⊕ I 2 ⊕ I 1 ⊕ I 0 ⊕ C (^) down C (^) up = I 3 ⊕ I 2 ⊕ I 1 ⊕ I 0 Your goal is to minimize the output delay of each block.
Using only blocks from part 2b:
Compute the input capacitance for each input:
I 0 : 480, I 1 : 240, I 2 : 480, I 3 : 240, C (^) down: 480
Identify the critical path of your circuit and compute the unloaded delay for this path.
Critical path from I 0 to O 3. Arrange so that two internal nodes go from high-to-low:
TPI 0 O 3 hl = 3 TPhlxor+2 [TPhlfxor (2) (2) (240)] = 12.996 ns TPI 0 O 3 lh = 2 TPhlxor+2 [TPhlfxor (2) (2) (240)] +TPlh (^) xor= 12.672 ns
X
O (^1)
I 3 I (^2) I 1 I 0
O (^0)
Cup
O 3 O^2
Cdown
Problem 2d: Finally, show how the 4 input prefix circuit can be used as a building block to produce a 16- element prefix circuit that minimizes gate reuse and which has minimal delay. What is the critical path and how many XOR gates are in it?
Hint: this is very similar to a carry-lookahead adder.
The critical path is from I 0 up through the central logic and back through the C (^) down of the last stage to O 14 or O15.
Adding this up, we get: 2 + 3 + 2 = 7 XOR gates
Problem 2e :
How many XOR gates are in the critical path of a 64-bit parallel-prefix circuit?
This adds one more level of blocks. Tracing the first input to last output, we note that we have 2 for each level up, 3 for the top level, and 2 for each level down: 2 + 2 + 3 + 2 + 2 = 11 xor gates.
I 3 I 2 I 1 I 0
o 3 o 2 o 1 o 0 c (^) up c (^) dn I 3 I 2 I 1 I 0
o 3 o 2 o 1 o 0 c (^) up c (^) dn I 3 I 2 I 1 I 0
o 3 o 2 o 1 o 0 c (^) up c (^) dn I 3 I 2 I 1 I 0
o 3 o 2 o 1 o 0 c (^) up c (^) dn
I 3 I 2 I 1 I 0
o 3 o 2 o 1 o 0 c (^) up c (^) dn
Cup
o 15 o 14 o 13 o (^12) o 11 o 10 o (^9) o 8 o 7 o 6 o (^5) o 4 o 3 o 2 o 1 o (^0)
Cdown
Recall how divide (in base 10) works The following shows a division of 1 by 23:
Suppose we had a procedure that produced each of the digits (zeros) in the dividend, one at a time. Consider the remainders as integers from the current decimal point. So, for instance, we have the remainders 1, 10, 100, 80, 110, 180, etc. At each stage, we multiply by ten, add the incoming digit (zero in the example), then
This could be combined with the current remainder but multiplying the remainder by 10, adding the new digit (which is zero in this case), then seeing how much the result divides the answer.
Here is complete pseudo code for computing one of the streams ( Note: we have fixed a couple of the typos) :
Stream (digitnum,incoming,oddnum,sign,xsquared,termID,maxtermID) { ARemainder = A_REMARRAY [termID]; ARemainder = ARemainder × 10 + incoming;
; This is a quotient/remainder operation (ADigit, ARemainder) = ARemainder / xsquared; A_REMARRAY[termID] = ARemainder;
BRemainder = B_REMARRAY [termID]; BRemainder = BRemainder × 10 + Adigit; (BDigit, BRemainder) = BRemainder / oddnum; B_REMARRAY[termID] = BRemainder;
AddInDigit (BDigit, digitnum, sign);
If ((termID = maxtermID ) && (ADigit != 0)) { A_REMARRAY[termID+1] = 0; B_REMARRAY[termID+1] = 0; /* This was missing originally */ maxtermID++; }
If (termID < maxtermID) { MaxtermID = Stream (digitnum, ADigit,(oddnum+2),-sign, xsquared, (termID+1), maxtermID); } return maxtermID; /* This was missing originally */ }
Remainders
Problem 3a: Write MIPS assembly for this pseudo code. Make sure to adhere to MIPS conventions. Assume that A_REMARRAY[] and B_REMARRAY[] are word arrays that are addressed via constants (assume that you can use the la pseudo instruction to load their addresses into registers. Also, assume that there are 7 argument registers ($a0 - $a6) for the sake of this problem. Note that AddInDigit is a procedure call.
Stream: subiu $sp, $sp, 36 ; 7 args, 1 ret addr, 1 temp (ADigit) sw $ra, 36($sp) ; Save return address sw $a0, 32($sp) ; Save $a <... etc ...> ; Save $a1 - $a sw $a6, 8($sp) ; Save $a sll $t0, $a5, 2 ; Convert termID to word index la $t1, A_REMARRAY addu $t1, $t1, $t0 ; address of ARemainder lw $t2, 0($t1) ; Get ARemainder mul $t2, $t2, 10 ; x 10 (pseudo instruction) addu $t2, $t2, $a divu $t2, $a mfhi $t2 ; New remainder sw $t2, 0($t1) ; Save it into array mflo $t sw $t3, 4($sp) ; Save ADigit for later la $t1, B_REMARRAY ; addu $t1, $t1, $t0 ; address of BRemainder lw $t2, 0($t1) ; Get BRemainder mul $t2, $t2, 10 ; x10 (pseudo-instruction) addu $t2, $t2, $t3 ; Add in ADigit divu $t2, $a mfhi $t2 ; New BRemainder sw $t2, 0($t1) ; Save back into array move $a2, $a3 ; sign (third arg) move $a1, $a0 ; digitnum (second arg) mflo $a0 ; Get BDigit jal AddInDigit lw $a0, 32($sp) ; Restore digitnum (arg 1) lw $a1, 4($sp) ; Restore ADigit to $a lw $a2, 24($sp) ; restore oddnum lw $a3, 20($sp) ; restore sign lw $a4, 16($sp) ; restore xsquared lw $a5, 12($sp) ; restore termID lw $v0, 8($sp) ; restore maxTermID (will return) bne $a5, $v0, finalcheck ; termId != maxTermID beq $t3, $r0, finalcheck ; ADigit == 0 sll $t1, $a5, 2 la $t1, A_REMARRAY addu $t1, $t1, $t0 ; address of A_REMARRAY[termID] sw $r0, 4($t1) ; store zero at A_REMARRAY[termID+1] la $t1, B_REMARRAY addu $t1, $t1, $t0 ; address of B_REMARRAY[termID] sw $r0, 4($t1) ; store zero at B_REMARRAY[termID+1] addiu $v0, $v0, 1 ; maxterm++
finalcheck: blt $a5, $v0, return ; Check termID < maxtermID (pseudo-op) addiu $a2, $a2, 2 ; oddnum+ subu $a3, $r0, $a3 ; sign = -sign addiu $a5, $a5, 1 ; termID+ jal stream
return: lw $ra, 36($sp) addiu $sp, $sp, 36 ; restore stack jr $ra ; return
Problem 3c: Explain the initialization of the A_REMVALUE[] and B_REMVALUE[] arrays if we were
going to compute (^)
4 arctan. What is the purpose of the termID and maxtermID
parameters?
We are just going to fold the 4 into our calculations. If we let the 4 be part of the A (^0) computation, then every other term will be multiplied by 4 automatically (since A 1 depends on
A 0 , etc). Thus, we simply have an outer loop that produces the digits of 5
one at a time and feed
them to “stream”. So, we will use A_REMVALUE[] and B_REMVALUE[] for all terms beyond the first one. Since each new remainder gets zeroed as it is needed, we merely have to set the first element of each array to zero. Thus, let A_REMVALUE[0] = 0 and B_REMVALUE[0]=0.
The variable termID tracks which term of the series we are currently working on. Since the first
term ( the x
term) is a little special (It is not derived from other terms by dividing by x 2 , we will
let termID=0 be the (^) 3 3
x
term, termID=1 be the (^) 5 5
x
etc. The maxtermID is the maximum
term that we have produced nonzero values for up to now. Note that in the stages of the design, almost all terms are zero, hence we start termID=maxtermID=
Problem 3d: Explain the initialization of the FINALVALUE array:
Each digit of the FINALVALUE array must be initialized to zero before it is used. Since we are walking though the “answer” one digit at a time, we can choose to initialize this digit before we use it. (I.e. when we are working on the 10 th^ s place, we don’t care what is in the 100 th^ s or 1000 th^ s place, since we know to ignore it.
Problem 3e:
Write pseudo-code to compute (^)
4 arctan using^ stream (). Assume that the initialization in
(3c) and (3d) are accomplished..
FINALVALUE[0]=0 ; Set ones place to zero FINALVALUE[1]=8 ; This is 4/ A_REMVALUE[0]=B_REMVALUE[0] = 0 ; Start with 1 term
; Handle first digit (10 ths place) maxtermID = stream(1,8,3,-1,25,0,0) for (digitnum=2; true; digitnum=digitnum+2) { FINALVALUE[digitnum] = 0; maxtermID=stream(digitnum,0,3,-1,25,0,maxtermID); }
[ This page intentionally left blank]
op | rs | rt | rd | shamt | funct = MEM[PC] op | rs | rt | Imm16 = MEM[PC]
INST Register Transfers ADDU R[rd] ← R[rs] + R[rt]; PC ← PC + 4 SUBU R[rd] ← R[rs] - R[rt]; PC ← PC + 4 ORI R[rt] ← R[rs] + zero_ext(Imm16); PC ← PC + 4 LW R[rt] ← MEM[ R[rs] + sign_ext(Imm16)]; PC ← PC + 4 SW MEM[R[rs] + sign_ext(Imm16)] ← R[rs]; PC ← PC + 4 BEQ if ( R[rs] == R[rt] ) then PC ← PC + 4 + sign_ext(Imm16) || 00 else PC ← PC + 4
For your reference, here is the microcode for two of the 6 MIPS instructions:
Label ALU SRC1 SRC2 ALUDest Memory MemReg PCWrite Sequence Fetch Add PC 4 ReadPC IR ALU Seq Dispatch Add PC ExtShft Dispatch
RType Func rs rt Seq rd-ALU Fetch BEQ Sub rs rt ALUoutCond Fetch
jal
compmul $rd, $rs, $rt ⇒ R[rd]=(R[rs]×R[rt]) – (R[rs+1]×R[rt+1]) R[rd+1]= (R[rs]×R[rt])+(R[rs+1]×R[rt+1]) PC←PC+ This math was a typo. The real way to compute complex multiply is: compmul $rd, $rs, $rt ⇒ R[rd]=(R[rs]×R[rt]) – (R[rs+1]×R[rt+1]) R[rd+1]= (R[rs]×R[rt+1])+(R[rs+1]×R[rt]) PC←PC+ We will give the solution with the original spec (for fairness)
Problem 4a: (2 pts) How wide are microinstructions in the original datapath (answer in bits and show some work!)?
2 + 1 + 3 + 2 + 2 + 1 + 2 + 2 = 15 bits wide
The trickiest part of this computation is the PC Write field. We have to remember to represent the “do nothing” option, which means that there are actually three different values for the PC Write field.
Problem 4b: (4 points) Draw a block diagram of a microcontroller that will support the new instructions (it will be slightly different than that required for the original instructions). Include sequencing hardware, the dispatch ROM, the microcode ROM, and decode blocks to turn the fields of the microcode into control signals. Make sure to show all of the control signals coming from somewhere. ( hint: The PCWr, PCWrCond, and PCSrc signals must come out of a block connected to thePCWrite field of the microinstruction).
2 points were given for drawing a decent microcontroller for the old datapath. 1 point was given if the branching (exception) mechanism was implemented with a mux. Another point was given for showing some new control signals (EPCWrite is the most notable).
3) Expand PCSrc mux to take in 0x80000080.
mfc0: 4 points
1) 13 and 14 only differ by 1 bit, so just use a mux with the LSB of $rt as the selector to choose between Cause and EPC. Any other values of $rt are dontcares.
2) Expand MemtoReg mux to take in the CauseOrEPC.
Alternatively, some students expanded SRC1 to be able to have the value of CauseOrEPC, but this has the disadvantage that you need to create a way for SRC to be forced to zero, and mfc0 would then require 4 instead of 3 microinstructions.
compmul: 4 points Correction: The math in the original test was wrong. The spec given on the exam was: compmul $rd, $rs, $rt => R[rd] (R[rs]R[rt]) – (R[rs+1]R[rt+1]) R[rd+1] (R[rs]R[rt]) + (R[rs+1]R[rt+1]) PC PC + 4
But anyways, this error makes the problem a bit simpler, because with the buggy problem we need to calculate only two products instead of four, so this solution will go with the original instructions.
1) Add 32-bit multiplication capability. Either add the multiply operation to the ALU or put down a multiplier that takes in the same inputs as the ALU.
2) Add registers to store products.
You need at least two. Well, actually if a multiply-accumulate unit is used instead of a multiplier, you could go with just one, but that would make things complicated.
3) Expand ALUSelA and ALUSelB muxes to take in these products.
4) Add capability to read rs+1 and rt+1.
Some students did this with 5-bit adders and muxes. That’s fine, but you don’t need that much hardware because the registers are guaranteed to be even.
5) Add capability to read rd+1.