Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Memory Hierarchy - Computer Architecture and Engineering - Exams, Exams of Computer Architecture and Organization

Main points of this past exam are: Memory Hierarchy, Memory Read, Read References, Cache Holds, Initially Empty, Memory References, Compulsory, Conflict, Capacity, 32-Bit Processor

Typology: Exams

2012/2013

Uploaded on 04/02/2013

shashikanth_0p3
shashikanth_0p3 🇮🇳

4.8

(8)

55 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS152 Spring ’99 Midterm II Page 1
University of California, Berkeley
College of Engineering
Computer Science Division EECS
Spring 1999 John Kubiatowicz
Midterm II
April 21, 1999
CS152 Computer Architecture and Engineering
Your Name: Solution
SID Number:
Discussion
Section:
Problem Possible Score
120
230
325
425
Total
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Memory Hierarchy - Computer Architecture and Engineering - Exams and more Exams Computer Architecture and Organization in PDF only on Docsity!

University of California, Berkeley College of Engineering Computer Science Division  EECS

Spring 1999 John Kubiatowicz

Midterm II

April 21, 1999 CS152 Computer Architecture and Engineering

Your Name: Solution

SID Number:

Discussion Section:

Problem Possible Score

Total

[ This page left for π ]

Problem 1c: Suppose you have a 32-bit processor, with a virtual-memory page-size of 16K. The data cache is 32K in size with 32-byte cache blocks. Finally, your TLB has 4 entries. Assume that you wish to do TLB lookups in parallel with cache lookups.

Draw a block diagram of the data cache and TLB organization, showing a virtual address as input and both a physical address and data as output. Include cache hit and TLB hit output signals. Include as much information about the internals of the TLB and cache organization as possible. Include, among other things, all of the comparators in the system and any muxes as well. You can indicate RAM as with a simple block, but make sure to label address widths and data widths. Make sure to use abstraction in your diagram so that we can understand it. Label the function of various blocks and the width of any buses.

Now, assume the following instruction mix: Loads: 20%, Stores: 15%, Integer: 29%, Floating-Point: 16% Branches: 20%

Assume that you have a memory-hierarchy consisting of 2-levels of cache, 1 level of DRAM, and a DISK. The following parameters are appropriate. Assume a 200MHz processor:

Component Hit Time Miss Rate Block Size First-Level Cache 1 cycle 5% Data 1% Instructions 32 bytes

Second-Level Cache

10 cycles + 1 cycle/64bits 3% 128 bytes

DRAM

100ns+ 25ns/8 bytes 1% 16K bytes

DISK

50ms + 20ns/byte 0% 16K bytes

In addition, assume that there is a TLB which misses 0.1% of the time on data (doesn’t miss on instructions) and which has a fill penalty of 50 cycles.

Problem 1d: What is the average memory access time for Instructions? For Data?

Figure 1: The Multicycle Data Path

Problem 2a: Let the ALU support multiplication. You cannot change or duplicate the memory component, or change or duplicate the ALU component, but are allowed to add muxes, registers, equality comparitors, and random logic. Estimate the minimum number of cycles (on average) that you can hope to achieve in the inner “while” loop. Justify your answer by discussing the operations that must be performed on each iteration and showing a timing diagram for three iterations of the inner loop. Don’t try to change the datapath yet. You will do that in (2b)

(Hints: You can recognize the last loop of the “while” condition by checking for “=0” and “=degree2”, since this loop will always be executed at least once. Also, make sure you understand where each of the computations come from – there may be ways of moving them out of the inner loop! )

Ideal Memory WrAdr Din

RAdr

32 32

32 Dout

MemWr 32

ALU

32

32

ALUOp

ALU Control

32

IRWr

(^32) Instruction Reg

Reg File

Ra

Rw busW

Rb 5

5 32

busA

busB^32

RegWr

Rs Rt

Mux^0

1

Rt Rd

PCWr

ALUSelA

1 Mux 0

RegDst

Mux^0

1

32

PC

MemtoReg

Extend

ExtOp

Mux^0

1

32

0 1 2 3

4

Imm (^1632)

<< 2

ALUSelB

Mux^1

0

32 Zero/ NEG

Zero

PCWrCond PCSrc

32

IorD

Mem Data Reg

ALU Out

B

A

Problem 2b: Assume that our new instruction is specified as follows:

polymult $r3, $r1, $r

Where this is an R-TYPE instruction. Here, registers r1 and r2 hold pointers to the source polynomials, and r3 holds a pointer to memory for the destination polynomial. Let’s assume that there is enough memory at the location specified by r3 to hold any result. Assume also that the registers should not be changed during execution.

Change the data path to support polynomial multiply with the same rate in the inner loop as specified in (2a)? As before, you cannot change or duplicate the memory component, or change or duplicate the ALU component, but are allowed to add muxes, registers, equality comparitors, and random logic. Be explicit and try to be minimize the hardware/minimize the total number of cycles for the complete operation as much as possible. Show all new control points. ( Note: the computation of initial values of indexdeg1 and indexdeg2 can be done with one ALU operation and some muxes!)

Assume that we are going to microcode this instruction. For your reference, Tables 1 and 2 list the symbolic names that we have given to fields of the microinstructions, as well as the microcoded versions of some of the simple instructions.

Problem 2c: First, how does the sequencer box have to change in order to support this instruction? Draw a block diagram showing the MicroPC, the logic around it, and the ROM.

Problem 2d: Next, make changes to Table 1 to reflect your new hardware. Make sure that you are clear about what you are adding/changing.

Problem 2e: Finally, write microcode for the polynomial multiply instruction. (You are now an official CISC system designer!).

Problem 3c: Unroll the given loop once, and schedule it to completely avoid stalls. Show your code. How many cycles per iteration does it get now?

Problem 3d: If you were to unroll the loop 8 times, how many cycles per iteration would this achieve? ( hint: you do not need to actually perform the unrolling, but justify your answer)

Problem 3e: Now, assume that you want to design a new processor that is more deeply pipelined, i.e. which has larger latencies for all of the operations. Maximize the latencies of instructions that the loop can tolerate by rewriting the loop with software pipelining. Do not unroll the loop (i.e. there will be only 8 instructions). Only show code for the loop; you can ignore any startup or cleanup instructions outside the loop. Hint: this code will overlap 3 different iterations of the loop.

Problem 3f: For the software-pipelined version of the loop, assuming that the loop runs without stalls, what is

  • the maximum execution latency for muls?
  • the maximum execution latency for adds?
  • the maximum load-use latency (delay slots) for lw?

Problem 3g: Assuming that most of the power in your original processor was consumed in the execute stages, is the new processor likely to consume more, the same, or less power than the original? Why?

Problem 4: Hazards and Advanced Pipelining

This problem brings together a number of different elements of pipelining.

Problem 4a: There are three different types of data hazards, RAW, WAR, and WAW. Define them, giving a short code sequence to illustrate each, and describe how a 5-stage pipeline removes them:

a) RAW:

b) WAR:

c) WAW:

Problem 4b: What are control hazards? Name and explain two different techniques for getting rid of them.

Problem 4c: What are precise exceptions and why are they important?

Problem 4d: Explain how to achieve precise exceptions in a standard 5-stage pipeline. Be explicit.

Problem 4f: Explain how the Tomasulo architecture handles the three different types of data hazards:

Problem 4g: Assume that you have a long chain of dependent instructions, such as the following: add $r1, $r2, $r add $r3, $r1, $r add $r7, $r3, $r → Also assume that the integer execution unit takes one cycle for adds. What CPI would you achieve for this sequence with the basic Tomasulo architecture, assuming that each of the stages from (4f) are non-overlapped and take a complete cycle?

Problem 4h: Assume that associative matching on the CDB is a slow enough operation that it takes much of a cycle. How can you still get a throughput of one instruction per cycle for long dependent chains of operations such as given in (4g)? Only well-thought-out answers will get credit.

Problem 4i: Finally, the Tomasulo algorithm has one interesting “bug” in it. Consider the situation where one instruction uses a value from another one. Suppose the first instruction is issued on the same cycle as the one that it depends on is in writeback. add $r1, $r2, $r3 ← The result is broadcast ... add $r4, $r1, $r1 ← This one is being issued

What is the problem? Can you fix it easily?