









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Main points of this past exam are: Memory Hierarchy, Memory Read, Read References, Cache Holds, Initially Empty, Memory References, Compulsory, Conflict, Capacity, 32-Bit Processor
Typology: Exams
1 / 16
This page cannot be seen from the preview
Don't miss anything!
University of California, Berkeley College of Engineering Computer Science Division EECS
Spring 1999 John Kubiatowicz
April 21, 1999 CS152 Computer Architecture and Engineering
Your Name: Solution
SID Number:
Discussion Section:
Problem Possible Score
Total
Problem 1c: Suppose you have a 32-bit processor, with a virtual-memory page-size of 16K. The data cache is 32K in size with 32-byte cache blocks. Finally, your TLB has 4 entries. Assume that you wish to do TLB lookups in parallel with cache lookups.
Draw a block diagram of the data cache and TLB organization, showing a virtual address as input and both a physical address and data as output. Include cache hit and TLB hit output signals. Include as much information about the internals of the TLB and cache organization as possible. Include, among other things, all of the comparators in the system and any muxes as well. You can indicate RAM as with a simple block, but make sure to label address widths and data widths. Make sure to use abstraction in your diagram so that we can understand it. Label the function of various blocks and the width of any buses.
Now, assume the following instruction mix: Loads: 20%, Stores: 15%, Integer: 29%, Floating-Point: 16% Branches: 20%
Assume that you have a memory-hierarchy consisting of 2-levels of cache, 1 level of DRAM, and a DISK. The following parameters are appropriate. Assume a 200MHz processor:
Component Hit Time Miss Rate Block Size First-Level Cache 1 cycle 5% Data 1% Instructions 32 bytes
Second-Level Cache
10 cycles + 1 cycle/64bits 3% 128 bytes
100ns+ 25ns/8 bytes 1% 16K bytes
50ms + 20ns/byte 0% 16K bytes
In addition, assume that there is a TLB which misses 0.1% of the time on data (doesn’t miss on instructions) and which has a fill penalty of 50 cycles.
Problem 1d: What is the average memory access time for Instructions? For Data?
Problem 2a: Let the ALU support multiplication. You cannot change or duplicate the memory component, or change or duplicate the ALU component, but are allowed to add muxes, registers, equality comparitors, and random logic. Estimate the minimum number of cycles (on average) that you can hope to achieve in the inner “while” loop. Justify your answer by discussing the operations that must be performed on each iteration and showing a timing diagram for three iterations of the inner loop. Don’t try to change the datapath yet. You will do that in (2b)
(Hints: You can recognize the last loop of the “while” condition by checking for “=0” and “=degree2”, since this loop will always be executed at least once. Also, make sure you understand where each of the computations come from – there may be ways of moving them out of the inner loop! )
Ideal Memory WrAdr Din
RAdr
32 32
32 Dout
MemWr 32
ALU
32
32
ALUOp
ALU Control
32
IRWr
(^32) Instruction Reg
Reg File
Ra
Rw busW
Rb 5
5 32
busA
busB^32
RegWr
Rs Rt
Mux^0
1
Rt Rd
PCWr
ALUSelA
1 Mux 0
RegDst
Mux^0
1
32
PC
MemtoReg
Extend
ExtOp
Mux^0
1
32
0 1 2 3
4
Imm (^1632)
<< 2
ALUSelB
Mux^1
0
32 Zero/ NEG
Zero
PCWrCond PCSrc
32
IorD
Mem Data Reg
ALU Out
B
A
Problem 2b: Assume that our new instruction is specified as follows:
polymult $r3, $r1, $r
Where this is an R-TYPE instruction. Here, registers r1 and r2 hold pointers to the source polynomials, and r3 holds a pointer to memory for the destination polynomial. Let’s assume that there is enough memory at the location specified by r3 to hold any result. Assume also that the registers should not be changed during execution.
Change the data path to support polynomial multiply with the same rate in the inner loop as specified in (2a)? As before, you cannot change or duplicate the memory component, or change or duplicate the ALU component, but are allowed to add muxes, registers, equality comparitors, and random logic. Be explicit and try to be minimize the hardware/minimize the total number of cycles for the complete operation as much as possible. Show all new control points. ( Note: the computation of initial values of indexdeg1 and indexdeg2 can be done with one ALU operation and some muxes!)
Assume that we are going to microcode this instruction. For your reference, Tables 1 and 2 list the symbolic names that we have given to fields of the microinstructions, as well as the microcoded versions of some of the simple instructions.
Problem 2c: First, how does the sequencer box have to change in order to support this instruction? Draw a block diagram showing the MicroPC, the logic around it, and the ROM.
Problem 2d: Next, make changes to Table 1 to reflect your new hardware. Make sure that you are clear about what you are adding/changing.
Problem 2e: Finally, write microcode for the polynomial multiply instruction. (You are now an official CISC system designer!).
Problem 3c: Unroll the given loop once, and schedule it to completely avoid stalls. Show your code. How many cycles per iteration does it get now?
Problem 3d: If you were to unroll the loop 8 times, how many cycles per iteration would this achieve? ( hint: you do not need to actually perform the unrolling, but justify your answer)
Problem 3e: Now, assume that you want to design a new processor that is more deeply pipelined, i.e. which has larger latencies for all of the operations. Maximize the latencies of instructions that the loop can tolerate by rewriting the loop with software pipelining. Do not unroll the loop (i.e. there will be only 8 instructions). Only show code for the loop; you can ignore any startup or cleanup instructions outside the loop. Hint: this code will overlap 3 different iterations of the loop.
Problem 3f: For the software-pipelined version of the loop, assuming that the loop runs without stalls, what is
Problem 3g: Assuming that most of the power in your original processor was consumed in the execute stages, is the new processor likely to consume more, the same, or less power than the original? Why?
This problem brings together a number of different elements of pipelining.
Problem 4a: There are three different types of data hazards, RAW, WAR, and WAW. Define them, giving a short code sequence to illustrate each, and describe how a 5-stage pipeline removes them:
a) RAW:
b) WAR:
c) WAW:
Problem 4b: What are control hazards? Name and explain two different techniques for getting rid of them.
Problem 4c: What are precise exceptions and why are they important?
Problem 4d: Explain how to achieve precise exceptions in a standard 5-stage pipeline. Be explicit.
Problem 4f: Explain how the Tomasulo architecture handles the three different types of data hazards:
Problem 4g: Assume that you have a long chain of dependent instructions, such as the following: add $r1, $r2, $r add $r3, $r1, $r add $r7, $r3, $r → Also assume that the integer execution unit takes one cycle for adds. What CPI would you achieve for this sequence with the basic Tomasulo architecture, assuming that each of the stages from (4f) are non-overlapped and take a complete cycle?
Problem 4h: Assume that associative matching on the CDB is a slow enough operation that it takes much of a cycle. How can you still get a throughput of one instruction per cycle for long dependent chains of operations such as given in (4g)? Only well-thought-out answers will get credit.
Problem 4i: Finally, the Tomasulo algorithm has one interesting “bug” in it. Consider the situation where one instruction uses a value from another one. Suppose the first instruction is issued on the same cycle as the one that it depends on is in writeback. add $r1, $r2, $r3 ← The result is broadcast ... add $r4, $r1, $r1 ← This one is being issued
What is the problem? Can you fix it easily?