











Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Main points of this past exam are: Memory Hierarchy, Memory Read, Read References, Cache Holds, Initially Empty, Memory References, Compulsory, Conflict, Capacity, 32-Bit Processor
Typology: Exams
1 / 19
This page cannot be seen from the preview
Don't miss anything!
University of California, Berkeley College of Engineering Computer Science Division EECS
Spring 1999 John Kubiatowicz
April 21, 1999 CS152 Computer Architecture and Engineering
Your Name: Solution
SID Number:
Discussion Section:
Problem Possible Score
Total
Problem 1c: Suppose you have a 32-bit processor, with a virtual-memory page-size of 16K. The data cache is 32K in size with 32-byte cache blocks. Finally, your TLB has 4 entries. Assume that you wish to do TLB lookups in parallel with cache lookups.
Draw a block diagram of the data cache and TLB organization, showing a virtual address as input and both a physical address and data as output. Include cache hit and TLB hit output signals. Include as much information about the internals of the TLB and cache organization as possible. Include, among other things, all of the comparators in the system and any muxes as well. You can indicate RAM as with a simple block, but make sure to label address widths and data widths. Make sure to use abstraction in your diagram so that we can understand it. Label the function of various blocks and the width of any buses.
Answer: The key observation is that, in order to do parallel access to TLB and cache data, you
sets (i.e. 2-way set associative). This is enough to draw our diagram:
Cache Data Cach e Block 0
Valid Cache Tag
Cache Data C ache Block 0
Cache Tag Valid
Compare
VA[31:14]
Compare
OR
Cache Hit
VPN PPN
Sel 32
Encoder
TLB Hit
VA[13:4]
Physical Page Number
Valid Compare
Compare
Compare
Compare
Now, assume the following instruction mix: Loads: 20%, Stores: 15%, Integer: 29%, Floating-Point: 16% Branches: 20%
Assume that you have a memory-hierarchy consisting of 2-levels of cache, 1 level of DRAM, and a DISK. The following parameters are appropriate. Assume a 200MHz processor:
Component Hit Time Miss Rate Block Size First-Level Cache 1 cycle 5% Data 1% Instructions 32 bytes
Second-Level Cache
10 cycles + 1 cycle/64bits 3% 128 bytes
100ns+ 25ns/8 bytes 1% 16K bytes
50ms + 20ns/byte 0% 16K bytes
In addition, assume that there is a TLB which misses 0.1% of the time on data (doesn’t miss on instructions) and which has a fill penalty of 50 cycles.
Problem 1d: What is the average memory access time for Instructions? For Data?
AMAT = HT (^) L1 + MRL1 *AMAT (^) L2 + MRTLB *MP (^) TLB AMAT (^) L2 = HT (^) L2 + MRL2 *AMAT (^) RAM AMAT (^) RAM = HT (^) RAM + MRRAM *AMAT (^) DISK
HT (^) L2 = 10 cycles + 1 cycle/64bits * 32 * 8 bits = 14 cycles HT (^) RAM = 100ns + 25ns/8bytes * 128 bytes = 500ns = 100 cycles HT (^) DISK = 50ms + 20ns/byte * 16Kbytes = 50.32768ms = 10.065536*10 6 cycles
AMAT (^) L2 = 3036.66 cycles AMAT (^) inst = 31.36 cycles AMAT (^) data = 152.88 cycles
Note: HT = hit time MR = miss rate MP = miss penalty AMAT = Average Memory Access Time
Problem 2a: Let the ALU support multiplication. You cannot change or duplicate the memory component, or change or duplicate the ALU component, but are allowed to add muxes, registers, equality comparitors, and random logic. Estimate the minimum number of cycles (on average) that you can hope to achieve in the inner “while” loop. Justify your answer by discussing the operations that must be performed on each iteration and showing a timing diagram for three iterations of the inner loop. Don’t try to change the datapath yet. You will do that in (2b)
Answer (#1): Without migrating computation out of the inner loop, there are 7 cycles of operations, since each address takes 2 cycles to compute (not forgetting to add the base, e.g. poly1!!!), and there is a multiply, add, and one decrement. Note that we moved the incrementing of indexdeg2 before the memory operations, since this takes care of the +1 part of the address computation. We must be careful to overlap memory operations, or we will have to take more cycles. Here is one iteration of the loop; the arrows show address computations: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 indexdeg
poly1+ indexdeg1<<
indexdeg ++
poly2+ indexdeg2<<
indexdeg -- multiply^ add poly1[] poly2[]
Note that we must check the condition for the while in Cycle 1,2, or 3, since indexdeg2 has changed by cycle 3.
Ideal Memory WrAdr Din
RAdr
32 32
32 Dout
MemWr 32
ALU
32
32
ALUOp
ALU Control
32
IRWr
(^32) Instruction Reg
Reg File
Ra
Rw busW
Rb 5
5 32
busA
busB^32
RegWr
Rs Rt Mux^0
1
Rt Rd
PCWr
ALUSelA
1 Mux 0
RegDst
Mux^0
1
32
PC
MemtoReg
Extend
ExtOp
Mux^0
1
32
0 1 2 3
4
Imm (^1632)
<< 2
ALUSelB
Mux^1
0
32 Zero/ NEG
Zero
PCWrCond PCSrc
32
IorD
Mem Data Reg
ALU Out
B
A
Answer(#2): Let’s move the “+1” functionality out of the inner loop. We can hope for a 6-cycle inner loop. Here is one “cycle” of the loop, assuming that we have incremented poly1 and poly by one element (i.e. by 4) outside the inner loop:
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 poly1+ indexdeg1<<
poly2+ indexdeg2<<2 indexdeg1--^ indexdeg2++^ multiply^ add poly1[] poly2[]
Answer(#3): Alternatively, we could increment indexdeg1 outside the loop and leave poly1 and poly alone. Then, by creative incrementing and decrementing,we can avoid the extra +1 computation. This means that we need to check for the boundary condition “indexdeg2 ≤ degree2” before cycle 3 (when indexdeg2 changes) and check for the boundary condition “indexdeg1 ≥ 1” before cycle 5 or check indexdeg1 ≥ 0 on cycle 5 or 6:
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 poly1+ indexdeg1<<2 indexdeg2++
poly2+ indexdeg2<<2 indexdeg1--^ multiply^ add poly1[] poly2[]
Answer(#4): Finally, if we are really clever, we could reduce this down to 4 cycles by using pointers that have the base values already added in. In the following, assume that: “point1 = poly1 + indexdeg1<<2” and that “point2=poly2+indexdeg2<<2”. Also assume that we indexdeg1 is ahead by 1 (i.e. point1 is ahead by 4) at the beginning of the loop. Then, in cycle 1 we check the boundary condition “point2 ≤ poly2+degree2<<4” and in cycle 3 or 4 we check the boundary condition “point1 ≥ poly1”:
Cycle 1 Cycle 2 Cycle 3 Cycle 4 point2+=4 point1-=4 multiply add point1[] point2[] Cond2 check Cond1 check
Just to make clear what we have done, here would be modified code to reflect the last option:
point1 = poly1 + indexdeg1<<4 + 4; point2 = poly2 + indexdeg2<<4;
/* (indexdeg1+indexdeg2)=resultdeg throughout loop / accum = 0; do { cond2 = (point2 == poly2); / Save this on cycle 1 / point2 = point2 + 4; / THE “” means to find value pointer/ accum = accum + (point1)×(point2); point1 = point1 – 4; cond1 = (point1 == poly1+degree2<<4) } until (cond1 or cond2);
Registers needed for this code: point1, point2, poly2, (poly1+degree2<<4), accum, cond2; Note also that we will multiply everything by 4 to make this work as well.
Figure 2a: Additions to the ALU path for polynomial multiply. Note that the ALU input muxes in this figure only include new inputs. Assume that the other inputs are still there.
Figure 2b: Additions to the RegFile control to support polynomial multiply. The primary change is the addition of an extra address MUX to give us access to register RD. Also, added enable to A register.
Figure 2c: Additions to the Memory component to support polynomial multiply. Note that we have two new muxes to support writes of the accumulated value into Poly3. Also, we have added a temporary memory register.
Ideal Memory
PolyWrite
MUX^ WrAddr
ALUout
Poly
PolyWrite
MUX DIn accum
Breg
0
1
1
0 DOut
MemDataReg
MemDataReg
Enab
Enab
To: ALU & RegWr MUX
To: ALU
~InitPoly
Regfile
A
B
InitPoly
MUX RA
RS
RD
RT RB
Poly In MUX
Cond
ALU
=
=
=0^ Additional ALUSELA
ALUout
Additional ALUSELB
ALUout
4
Cond Latch
Cond 0
MemDataReg1 <<
MemDataReg2 <<
0
ALUOut MUX Indexdeg
degree
accum
Indexdeg
degree
resultdeg
poly
En
En
En
En
En
En
En
Regfile OutputA
Field Name Values for Field Function of Field with Specific Value ALU Add ALU adds Subt. ALU subtracts Func code ALU does function code Or ALU does logical OR SRC1 PC 1st ALU input = PC rs 1st ALU input = Reg[rs] SRC2 4 2nd ALU input = 4 Extend 2nd ALU input = sign ext. IR[15-0] Extend0 2nd ALU input = zero ext. IR[15-0] Extshft 2nd ALU input = sign ex., sl IR[15-0] rt 2nd ALU input = Reg[rt] destination rd ALU Reg[rd] = ALUout rt ALU Reg[rt] = ALUout rt Mem Reg[rt] = Mem Memory Read PC Read memory using PC Read ALU Read memory using ALU output Write ALU Write memory using ALU output Memory register IR IR = Mem PC write ALU PC = ALU ALUoutCond IF ALU Zero then PC = ALUout Sequencing Seq Go to sequential μinstruction Fetch Go to the first microinstruction Dispatch Dispatch using ROM.
Label ALU SRC1 SRC2 Dest. Memory Mem. Reg. PC Write Sequence Fetch: Add PC 4 Read PC IR ALU SEQ Add PC Extshft Dispatch
Rtype: Func rs rt Seq rd ALU Fetch
Ori: Or rs Extend0 Seq rt ALU Fetch
Lw: Add rs Extend Seq Read ALU Seq rt MEM Fetch
Sw: Add rs Extend Seq WriteALU Fetch
Beq: Subt rs rt Fetch
Problem 2e: Finally, write microcode for the polynomial multiply instruction. (You are now an official CISC system designer!).
Note that we ended up calling this part of the problem “extra credit” and gave a few extra points to people who had started down the path.
Label ALU SRC1 SRC2 ALULatch Dest. Memory Mem. Reg. PC Write Seq /***** Fetch address of destination polynomial and put in register poly3 */ polymult: InitPoly3 Seq
/***** Fetch poly1[0] and poly2[0] */ Add poly1 0 Seq Add 0 poly2 rd ALU MemData1 Seq rd ALU MemData2 Seq
/***** Compute poly3[0] and initialize degree1 & degree2 with 4 * degree values */ Add MemData1 MemData2 Seq Add MemData1<<2 0 degree1 wr Poly3 Seq Add 0 MemData2<<2 degree2 Seq
/***** Start resultdeg at end (i.e. degree3). Of course, like degree1 and degree2, this is actually *4 */ Add degree2 degree1 resultdeg Seq
/***** Initial value of poly3 is at end of result polynomial (since we are going to work backwards) */ Add poly3 resultdeg poly3 Seq Add poly3 4 poly3 Seq
/***** Compute indexdeg1 and indexdeg2. Next 4 lines are combination of max function and following subtract */ forloop: Sub resultdeg degree1 indexdeg2 bneg forloop Add 0 degree1 indexdeg1 jump forloop forloop1: Add resultdeg 0 indexdeg1 Seq Add 0 0 indexdeg2 Seq
/***** Initialize accum variable for inner loop. */ forloop2: Add 0 0 accum Seq
/***** Let indexdeg1 be ahead by one iteration (to avoid extra +1 in [] – see discussion in 2b) */ Add 4 indexdeg1 indexdeg1 Seq
/***** Next 6 microinstructions are the inner loop */ whileloop:Add poly1 indexdeg1 LatchCond1 Seq Add indexdeg2 4 indexdeg2 rd ALU MemData1 Seq Add indexdeg2 poly2 Seq Sub indexdeg1 4 indexdeg1 rd ALU MemData2 Seq Mul MemData1 MemData2 Seq Add accum ALUout accum bwhile whileloop
/***** End of while loop. Write back result to poly3, update resultdeg and poly3 pointer */ endfor: Sub resultdeg 4 resultdeg wr Poly3 Seq Sub poly3 4 poly3 bfor forloop
/***** Last dummy instruction is just for fetching */ fetch
For the following problem, assume an in-order, MIPS-style pipelined architecture with up to 4 cycles in the EX stage, but full forwarding for operations that take less than 4 cycles. Assume the following number of execution cycles are required:
a. Floating-point multiply: 4 cycles b. Floating-point addition: 2 cycles c. Integer operations: 1 cycle
Assume as well that there is one branch delay slot, that there is no delay between integer operations and dependent branch instructions, and that the load-use latency (or number of load delay slots) is 2 cycles.
One possible pipeline that might behave this way could appear as follows:
ID EX 2 EX 3 IF EX 1 EX 4 WR BR MEM 1 MEM 2
Now, given this pipeline, the following code computes a dot-product. Assume tha r1 and r contain addresses of arrays of floating-point numbers, and that r3 contains the length of the arrays (in elements). Assume that r4 is initialized to zero. Then, the dot product can be computed as follows:
dotprod: lw $f5, 0($r1) ; load element from first array lw $f6, 0($r2) ; load element from second array muls $f7, $f5, $f6 ; multiply elements adds $f4, $f4, $f7 ; add elements to accumulator in f addi $r1, $r1, 4 ; advance pointers addi $r2, $r2, 4 addi $r3, $r3, -1 ; decrement element count bne $r3, $zero, dotprod ; loop nop ; Do nothing (branch delay slot)
Problem 3a: How many cycles on average does each iteration take, without rearranging the code?
Ans: 14 cycles: 9 instructions + 2 stall cycles before “muls” + 3 stall cycles before “adds”.
Problem 3b: Rearrange the code so that it gets as few cycles per iteration as possible (don’t unroll the loop). Show the scheduled code. How many cycles per iteration does it get now? dotprod: lw $f5, 0($r1) ; load element from first array lw $f6, 0($r2) ; load element from second array addi $r1, $f1, 4 ; advance pointers addi $f2, $r2, 4 muls $f7, $f5, $f6 ; multiply elements addi $r3, $r3, -1 ; decrement element count bne $r3, $zero, dotprod ; loop adds $f4, $f4, $f7 ; add elements to accumulator in f
Now this gets 9 cycles/iteration: 8 instructions + one stall cycle before “adds”.
Problem 4a: There are three different types of data hazards, RAW, WAR, and WAW. Define them, giving a short code sequence to illustrate each, and describe how a 5-stage pipeline removes them:
a) RAW: Read After Write - first instruction has not yet modified (written to) register file yet, but next instruction is trying to read that register. add $1, $2, $ addi $4, $1, 50 Fix: Forward data from the pipeline as needed.
b) WAR: Write After Read - first instruction is reading from a register that the next instruction has somehow already modified. add $1, $2, $3 # assume really long fetch or something add $2, $5, $6 # assume writes really fast Fix: Make each stage equal length in time and read the registers early and write late in the pipeline.
c) WAW: Write After Write - later instruction writes to a register before the former instruction has modified it. add $1, $2, $ add $1, $4, $ Fix: Only modify the register file in the WB stage.
Problem 4b: What are control hazards? Name and explain two different techniques for getting rid of them. (1) Waiting always fixes the problem, i.e. stall and bubble the pipeline. (2) Branch Prediction is another more sophisticated solution. (3) Another similar solution is to execute both branches. (4) Changing the software model is also a valid solution, e.g. branch delay slot.
Problem 4c: Come up with two reasons why designers don’t make 100-stage pipelines. Are there circumstances in which such a pipeline might make sense? 100 stage pipelines incur way too many data and control hazards, requiring way too much hardware overhead to fix these hazards. Having so many stages means having lots of registers which means higher area, bigger clock net, more power dissipation, etc. Such a pipe may be feasible if there are very few dependencies and very little decision making such as in stream based processing like multimedia.
Problem 4d: What are precise exceptions and why are they important? Precise exceptions occur when all instructions following the one that made the exception do not affect (modify) the state of the machine, and all instructions previous have finished completely (i.e. written back to register file etc.). Precise interrupts are important because they make getting back from the exception easier to manage and easier to figure out which exception cause the problem and what the problem is.
Problem 4e: Explain how to achieve precise exceptions in a standard 5-stage pipeline. Be explicit. A precise exception is achieved by keeping track of which instruction causes the exception and waiting till the write back stage to cause the exception. This ensures that all instructions following and including the instruction that caused the exception will not modify the state of the system; and all instructions before the exception-causing instruction are completed. To handle precise exception, extra hardware such as the Exception PC (EPC) and Cause register are needed.
Problem 4e: Figure 2 shows the basic components of a Tomasulo architecture. This architecture replaces the normal 5-stages of execution with 4 stages: Fetch, Issue, Execute, and Writeback. Explain what happens to an instruction in each of them (be as complete as you can):
a) Fetch: Fetch instructions from memory in program order and place into Instruction Queue. b) Issue: Get next instruction from Instruction Queue and send to appropriate reservation station, replacing registers with values or tags (if the value is the pending result of an instruction in some reservation station). c) Execute: Dispatch instructions queued in reservation units to execution units if when register (tag) values are available. Mark the reservation stations as available. d) Writeback: Broadcasts result on the CDB (Common Data Bus). Any instructions waiting for the result will grab the value. This will also update the values in the register file.
,QWHJHU
,QW ,QW ,QW
)ORDWLQJSRLQW
)ORDW )ORDW
)URP 0HP )35HJLVWHUV
&RPPRQ'DWD%XV &'%
7R 0HP
,QVWUXFWLRQ 4XHXH
/RDG /RDG /RDG /RDG /RDG /RDG 6WRUH 6WRUH 6WRUH
Problem 4i: Finally, the Tomasulo algorithm has one interesting “bug” in it. Consider the situation where one instruction uses a value from another one. Suppose the first instruction is issued on the same cycle as the one that it depends on is in writeback. add $r1, $r2, $r3 ← The result is broadcast ... add $r4, $r1, $r1 ← This one is being issued
What is the problem? Can you fix it easily? Problem: As the second instruction is being issued, with tags from the register file, the first instruction is finishing up and will remove tag (because the $r1 is no longer busy). Thus, the second instruction will be waiting for a tag that does not exist at the end of the cycle and deadlocks.
Solutions: AS in MIPS register file, do write in the first half of the cycle and read in the second half of the cycle. This way, the second instruction will just get the actual value of $r1, instead of the a tag that points to $r1.