Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Memory Hierarchy - Computer Architecture and Engineering - Solved Exams, Exams of Computer Architecture and Organization

Main points of this past exam are: Memory Hierarchy, Memory Read, Read References, Cache Holds, Initially Empty, Memory References, Compulsory, Conflict, Capacity, 32-Bit Processor

Typology: Exams

2012/2013

Uploaded on 04/02/2013

shashikanth_0p3
shashikanth_0p3 🇮🇳

4.8

(8)

55 documents

1 / 19

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS152 Spring ’99 Midterm II Page 1
University of California, Berkeley
College of Engineering
Computer Science Division EECS
Spring 1999 John Kubiatowicz
Midterm II
Solutions
April 21, 1999
CS152 Computer Architecture and Engineering
Your Name: Solution
SID Number:
Discussion
Section:
Problem Possible Score
120
230
325
425
Total
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13

Partial preview of the text

Download Memory Hierarchy - Computer Architecture and Engineering - Solved Exams and more Exams Computer Architecture and Organization in PDF only on Docsity!

University of California, Berkeley College of Engineering Computer Science Division  EECS

Spring 1999 John Kubiatowicz

Midterm II

Solutions

April 21, 1999 CS152 Computer Architecture and Engineering

Your Name: Solution

SID Number:

Discussion Section:

Problem Possible Score

Total

[ This page left for π ]

Problem 1c: Suppose you have a 32-bit processor, with a virtual-memory page-size of 16K. The data cache is 32K in size with 32-byte cache blocks. Finally, your TLB has 4 entries. Assume that you wish to do TLB lookups in parallel with cache lookups.

Draw a block diagram of the data cache and TLB organization, showing a virtual address as input and both a physical address and data as output. Include cache hit and TLB hit output signals. Include as much information about the internals of the TLB and cache organization as possible. Include, among other things, all of the comparators in the system and any muxes as well. You can indicate RAM as with a simple block, but make sure to label address widths and data widths. Make sure to use abstraction in your diagram so that we can understand it. Label the function of various blocks and the width of any buses.

Answer: The key observation is that, in order to do parallel access to TLB and cache data, you

need to keep the cache index+offset ≤ page size (in bits). So, this means that we need 32K/16K

sets (i.e. 2-way set associative). This is enough to draw our diagram:

Cache Data Cach e Block 0

Valid Cache Tag

Cache Data C ache Block 0

Cache Tag Valid

Sel1^1 Mux^0 Sel

Cache Block

Compare

VA[31:14]

Compare

OR

Cache Hit

TLB Entries

(Fully associative)

VPN PPN

10 Mux

Sel 32

OR

Encoder

TLB Hit

VA[13:4]

Physical Page Number

Valid Compare

Compare

Compare

Compare

Cache

(2-way set associative)

Now, assume the following instruction mix: Loads: 20%, Stores: 15%, Integer: 29%, Floating-Point: 16% Branches: 20%

Assume that you have a memory-hierarchy consisting of 2-levels of cache, 1 level of DRAM, and a DISK. The following parameters are appropriate. Assume a 200MHz processor:

Component Hit Time Miss Rate Block Size First-Level Cache 1 cycle 5% Data 1% Instructions 32 bytes

Second-Level Cache

10 cycles + 1 cycle/64bits 3% 128 bytes

DRAM

100ns+ 25ns/8 bytes 1% 16K bytes

DISK

50ms + 20ns/byte 0% 16K bytes

In addition, assume that there is a TLB which misses 0.1% of the time on data (doesn’t miss on instructions) and which has a fill penalty of 50 cycles.

Problem 1d: What is the average memory access time for Instructions? For Data?

AMAT = HT (^) L1 + MRL1 *AMAT (^) L2 + MRTLB *MP (^) TLB AMAT (^) L2 = HT (^) L2 + MRL2 *AMAT (^) RAM AMAT (^) RAM = HT (^) RAM + MRRAM *AMAT (^) DISK

HT (^) L2 = 10 cycles + 1 cycle/64bits * 32 * 8 bits = 14 cycles HT (^) RAM = 100ns + 25ns/8bytes * 128 bytes = 500ns = 100 cycles HT (^) DISK = 50ms + 20ns/byte * 16Kbytes = 50.32768ms = 10.065536*10 6 cycles

AMAT (^) L2 = 3036.66 cycles AMAT (^) inst = 31.36 cycles AMAT (^) data = 152.88 cycles

Note: HT = hit time MR = miss rate MP = miss penalty AMAT = Average Memory Access Time

Figure 1: The Multicycle Data Path

Problem 2a: Let the ALU support multiplication. You cannot change or duplicate the memory component, or change or duplicate the ALU component, but are allowed to add muxes, registers, equality comparitors, and random logic. Estimate the minimum number of cycles (on average) that you can hope to achieve in the inner “while” loop. Justify your answer by discussing the operations that must be performed on each iteration and showing a timing diagram for three iterations of the inner loop. Don’t try to change the datapath yet. You will do that in (2b)

Problem 2 has many possible solutions. We will discuss some of them here. For

2b onwards, we will illustrate one possible solution.

Answer (#1): Without migrating computation out of the inner loop, there are 7 cycles of operations, since each address takes 2 cycles to compute (not forgetting to add the base, e.g. poly1!!!), and there is a multiply, add, and one decrement. Note that we moved the incrementing of indexdeg2 before the memory operations, since this takes care of the +1 part of the address computation. We must be careful to overlap memory operations, or we will have to take more cycles. Here is one iteration of the loop; the arrows show address computations: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 indexdeg

poly1+ indexdeg1<<

indexdeg ++

poly2+ indexdeg2<<

indexdeg -- multiply^ add poly1[] poly2[]

Note that we must check the condition for the while in Cycle 1,2, or 3, since indexdeg2 has changed by cycle 3.

Ideal Memory WrAdr Din

RAdr

32 32

32 Dout

MemWr 32

ALU

32

32

ALUOp

ALU Control

32

IRWr

(^32) Instruction Reg

Reg File

Ra

Rw busW

Rb 5

5 32

busA

busB^32

RegWr

Rs Rt Mux^0

1

Rt Rd

PCWr

ALUSelA

1 Mux 0

RegDst

Mux^0

1

32

PC

MemtoReg

Extend

ExtOp

Mux^0

1

32

0 1 2 3

4

Imm (^1632)

<< 2

ALUSelB

Mux^1

0

32 Zero/ NEG

Zero

PCWrCond PCSrc

32

IorD

Mem Data Reg

ALU Out

B

A

Answer(#2): Let’s move the “+1” functionality out of the inner loop. We can hope for a 6-cycle inner loop. Here is one “cycle” of the loop, assuming that we have incremented poly1 and poly by one element (i.e. by 4) outside the inner loop:

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 poly1+ indexdeg1<<

poly2+ indexdeg2<<2 indexdeg1--^ indexdeg2++^ multiply^ add poly1[] poly2[]

Answer(#3): Alternatively, we could increment indexdeg1 outside the loop and leave poly1 and poly alone. Then, by creative incrementing and decrementing,we can avoid the extra +1 computation. This means that we need to check for the boundary condition “indexdeg2degree2” before cycle 3 (when indexdeg2 changes) and check for the boundary condition “indexdeg11” before cycle 5 or check indexdeg10 on cycle 5 or 6:

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 poly1+ indexdeg1<<2 indexdeg2++

poly2+ indexdeg2<<2 indexdeg1--^ multiply^ add poly1[] poly2[]

Answer(#4): Finally, if we are really clever, we could reduce this down to 4 cycles by using pointers that have the base values already added in. In the following, assume that: “point1 = poly1 + indexdeg1<<2” and that “point2=poly2+indexdeg2<<2”. Also assume that we indexdeg1 is ahead by 1 (i.e. point1 is ahead by 4) at the beginning of the loop. Then, in cycle 1 we check the boundary condition “point2poly2+degree2<<4” and in cycle 3 or 4 we check the boundary condition “point1poly1”:

Cycle 1 Cycle 2 Cycle 3 Cycle 4 point2+=4 point1-=4 multiply add point1[] point2[] Cond2 check Cond1 check

Just to make clear what we have done, here would be modified code to reflect the last option:

point1 = poly1 + indexdeg1<<4 + 4; point2 = poly2 + indexdeg2<<4;

/* (indexdeg1+indexdeg2)=resultdeg throughout loop / accum = 0; do { cond2 = (point2 == poly2); / Save this on cycle 1 / point2 = point2 + 4; / THE “” means to find value pointer/ accum = accum + (point1)×(point2); point1 = point1 – 4; cond1 = (point1 == poly1+degree2<<4) } until (cond1 or cond2);

Registers needed for this code: point1, point2, poly2, (poly1+degree2<<4), accum, cond2; Note also that we will multiply everything by 4 to make this work as well.

Figure 2a: Additions to the ALU path for polynomial multiply. Note that the ALU input muxes in this figure only include new inputs. Assume that the other inputs are still there.

Figure 2b: Additions to the RegFile control to support polynomial multiply. The primary change is the addition of an extra address MUX to give us access to register RD. Also, added enable to A register.

Figure 2c: Additions to the Memory component to support polynomial multiply. Note that we have two new muxes to support writes of the accumulated value into Poly3. Also, we have added a temporary memory register.

Ideal Memory

PolyWrite

MUX^ WrAddr

ALUout

Poly

PolyWrite

MUX DIn accum

Breg

0

1

1

0 DOut

MemDataReg

MemDataReg

Enab

Enab

To: ALU & RegWr MUX

To: ALU

~InitPoly

Regfile

A

B

InitPoly

MUX RA

RS

RD

RT RB

Poly In MUX

Cond

ALU

=

=

=0^ Additional ALUSELA

ALUout

Additional ALUSELB

ALUout

4

Cond Latch

Cond 0

MemDataReg1 <<

MemDataReg2 <<

0

ALUOut MUX Indexdeg

degree

accum

Indexdeg

degree

resultdeg

poly

En

En

En

En

En

En

En

Regfile OutputA

Table 1: Symbolic Definitions for Microcode

Field Name Values for Field Function of Field with Specific Value ALU Add ALU adds Subt. ALU subtracts Func code ALU does function code Or ALU does logical OR SRC1 PC 1st ALU input = PC rs 1st ALU input = Reg[rs] SRC2 4 2nd ALU input = 4 Extend 2nd ALU input = sign ext. IR[15-0] Extend0 2nd ALU input = zero ext. IR[15-0] Extshft 2nd ALU input = sign ex., sl IR[15-0] rt 2nd ALU input = Reg[rt] destination rd ALU Reg[rd] = ALUout rt ALU Reg[rt] = ALUout rt Mem Reg[rt] = Mem Memory Read PC Read memory using PC Read ALU Read memory using ALU output Write ALU Write memory using ALU output Memory register IR IR = Mem PC write ALU PC = ALU ALUoutCond IF ALU Zero then PC = ALUout Sequencing Seq Go to sequential μinstruction Fetch Go to the first microinstruction Dispatch Dispatch using ROM.

Table 2: Microcode for Simple Instructions

Label ALU SRC1 SRC2 Dest. Memory Mem. Reg. PC Write Sequence Fetch: Add PC 4Read PC IR ALU SEQ Add PC Extshft     Dispatch

Rtype: Func rs rt     Seq    rd ALU    Fetch

Ori: Or rs Extend0     Seq    rt ALU    Fetch

Lw: Add rs Extend     Seq Read ALU Seq    rt MEM    Fetch

Sw: Add rs Extend     Seq     WriteALU   Fetch

Beq: Subt rs rt     Fetch

Problem 2e: Finally, write microcode for the polynomial multiply instruction. (You are now an official CISC system designer!).

Note that we ended up calling this part of the problem “extra credit” and gave a few extra points to people who had started down the path.

Label ALU SRC1 SRC2 ALULatch Dest. Memory Mem. Reg. PC Write Seq /***** Fetch address of destination polynomial and put in register poly3 */ polymult:     InitPoly3    Seq

/***** Fetch poly1[0] and poly2[0] */ Add poly1 0      Seq Add 0 poly2   rd ALU MemData1Seq      rd ALU MemData2Seq

/***** Compute poly3[0] and initialize degree1 & degree2 with 4 * degree values */ Add MemData1 MemData2      Seq Add MemData1<<2 0 degree1wr Poly3   Seq Add 0 MemData2<<2 degree2     Seq

/***** Start resultdeg at end (i.e. degree3). Of course, like degree1 and degree2, this is actually *4 */ Add degree2 degree1 resultdeg     Seq

/***** Initial value of poly3 is at end of result polynomial (since we are going to work backwards) */ Add poly3 resultdeg poly3     Seq Add poly3 4 poly3     Seq

/***** Compute indexdeg1 and indexdeg2. Next 4 lines are combination of max function and following subtract */ forloop: Sub resultdeg degree1 indexdeg2     bneg forloop Add 0 degree1 indexdeg1     jump forloop forloop1: Add resultdeg 0 indexdeg1     Seq Add 0 0 indexdeg2     Seq

/***** Initialize accum variable for inner loop. */ forloop2: Add 0 0 accum     Seq

/***** Let indexdeg1 be ahead by one iteration (to avoid extra +1 in [] – see discussion in 2b) */ Add 4 indexdeg1 indexdeg1     Seq

/***** Next 6 microinstructions are the inner loop */ whileloop:Add poly1 indexdeg1LatchCond1    Seq Add indexdeg2 4 indexdeg2rd ALU MemData1Seq Add indexdeg2 poly2      Seq Sub indexdeg1 4 indexdeg1rd ALU MemData2Seq Mul MemData1 MemData2      Seq Add accum ALUout accum     bwhile whileloop

/***** End of while loop. Write back result to poly3, update resultdeg and poly3 pointer */ endfor: Sub resultdeg 4 resultdegwr Poly3   Seq Sub poly3 4 poly3     bfor forloop

/***** Last dummy instruction is just for fetching */         fetch

Problem 3: Speeding up the Loops

For the following problem, assume an in-order, MIPS-style pipelined architecture with up to 4 cycles in the EX stage, but full forwarding for operations that take less than 4 cycles. Assume the following number of execution cycles are required:

a. Floating-point multiply: 4 cycles b. Floating-point addition: 2 cycles c. Integer operations: 1 cycle

Assume as well that there is one branch delay slot, that there is no delay between integer operations and dependent branch instructions, and that the load-use latency (or number of load delay slots) is 2 cycles.

One possible pipeline that might behave this way could appear as follows:

ID EX 2 EX 3 IF  EX 1   EX 4 WR BR MEM 1 MEM 2

Now, given this pipeline, the following code computes a dot-product. Assume tha r1 and r contain addresses of arrays of floating-point numbers, and that r3 contains the length of the arrays (in elements). Assume that r4 is initialized to zero. Then, the dot product can be computed as follows:

dotprod: lw $f5, 0($r1) ; load element from first array lw $f6, 0($r2) ; load element from second array muls $f7, $f5, $f6 ; multiply elements adds $f4, $f4, $f7 ; add elements to accumulator in f addi $r1, $r1, 4 ; advance pointers addi $r2, $r2, 4 addi $r3, $r3, -1 ; decrement element count bne $r3, $zero, dotprod ; loop nop ; Do nothing (branch delay slot)

Problem 3a: How many cycles on average does each iteration take, without rearranging the code?

Ans: 14 cycles: 9 instructions + 2 stall cycles before “muls” + 3 stall cycles before “adds”.

Problem 3b: Rearrange the code so that it gets as few cycles per iteration as possible (don’t unroll the loop). Show the scheduled code. How many cycles per iteration does it get now? dotprod: lw $f5, 0($r1) ; load element from first array lw $f6, 0($r2) ; load element from second array addi $r1, $f1, 4 ; advance pointers addi $f2, $r2, 4 muls $f7, $f5, $f6 ; multiply elements addi $r3, $r3, -1 ; decrement element count bne $r3, $zero, dotprod ; loop adds $f4, $f4, $f7 ; add elements to accumulator in f

Now this gets 9 cycles/iteration: 8 instructions + one stall cycle before “adds”.

Problem 4: Hazards and Advanced Pipelining

Problem 4a: There are three different types of data hazards, RAW, WAR, and WAW. Define them, giving a short code sequence to illustrate each, and describe how a 5-stage pipeline removes them:

a) RAW: Read After Write - first instruction has not yet modified (written to) register file yet, but next instruction is trying to read that register. add $1, $2, $ addi $4, $1, 50 Fix: Forward data from the pipeline as needed.

b) WAR: Write After Read - first instruction is reading from a register that the next instruction has somehow already modified. add $1, $2, $3 # assume really long fetch or something add $2, $5, $6 # assume writes really fast Fix: Make each stage equal length in time and read the registers early and write late in the pipeline.

c) WAW: Write After Write - later instruction writes to a register before the former instruction has modified it. add $1, $2, $ add $1, $4, $ Fix: Only modify the register file in the WB stage.

Problem 4b: What are control hazards? Name and explain two different techniques for getting rid of them. (1) Waiting always fixes the problem, i.e. stall and bubble the pipeline. (2) Branch Prediction is another more sophisticated solution. (3) Another similar solution is to execute both branches. (4) Changing the software model is also a valid solution, e.g. branch delay slot.

Problem 4c: Come up with two reasons why designers don’t make 100-stage pipelines. Are there circumstances in which such a pipeline might make sense? 100 stage pipelines incur way too many data and control hazards, requiring way too much hardware overhead to fix these hazards. Having so many stages means having lots of registers which means higher area, bigger clock net, more power dissipation, etc. Such a pipe may be feasible if there are very few dependencies and very little decision making such as in stream based processing like multimedia.

Problem 4d: What are precise exceptions and why are they important? Precise exceptions occur when all instructions following the one that made the exception do not affect (modify) the state of the machine, and all instructions previous have finished completely (i.e. written back to register file etc.). Precise interrupts are important because they make getting back from the exception easier to manage and easier to figure out which exception cause the problem and what the problem is.

Problem 4e: Explain how to achieve precise exceptions in a standard 5-stage pipeline. Be explicit. A precise exception is achieved by keeping track of which instruction causes the exception and waiting till the write back stage to cause the exception. This ensures that all instructions following and including the instruction that caused the exception will not modify the state of the system; and all instructions before the exception-causing instruction are completed. To handle precise exception, extra hardware such as the Exception PC (EPC) and Cause register are needed.

Figure 2: A basic Tomasulo architecture

Problem 4e: Figure 2 shows the basic components of a Tomasulo architecture. This architecture replaces the normal 5-stages of execution with 4 stages: Fetch, Issue, Execute, and Writeback. Explain what happens to an instruction in each of them (be as complete as you can):

a) Fetch: Fetch instructions from memory in program order and place into Instruction Queue. b) Issue: Get next instruction from Instruction Queue and send to appropriate reservation station, replacing registers with values or tags (if the value is the pending result of an instruction in some reservation station). c) Execute: Dispatch instructions queued in reservation units to execution units if when register (tag) values are available. Mark the reservation stations as available. d) Writeback: Broadcasts result on the CDB (Common Data Bus). Any instructions waiting for the result will grab the value. This will also update the values in the register file.

,QWHJHU

,QW ,QW ,QW

)ORDWLQJSRLQW

)ORDW )ORDW

)URP 0HP )35HJLVWHUV

&RPPRQ'DWD%XV &'%

7R 0HP

,QVWUXFWLRQ 4XHXH

/RDG /RDG /RDG /RDG /RDG /RDG 6WRUH 6WRUH 6WRUH

Problem 4i: Finally, the Tomasulo algorithm has one interesting “bug” in it. Consider the situation where one instruction uses a value from another one. Suppose the first instruction is issued on the same cycle as the one that it depends on is in writeback. add $r1, $r2, $r3 ← The result is broadcast ... add $r4, $r1, $r1 ← This one is being issued

What is the problem? Can you fix it easily? Problem: As the second instruction is being issued, with tags from the register file, the first instruction is finishing up and will remove tag (because the $r1 is no longer busy). Thus, the second instruction will be waiting for a tag that does not exist at the end of the cycle and deadlocks.

Solutions: AS in MIPS register file, do write in the first half of the cycle and read in the second half of the cycle. This way, the second instruction will just get the actual value of $r1, instead of the a tag that points to $r1.