









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An unoptimized mips assembly language implementation for 32-bit unsigned division and a multi-cycle, 32-bit x 32-bit unsigned multiplier with booth encoding. It includes datapath diagrams, encoding tables, and assembly code for the divide and multiply procedures. The document also discusses potential improvements for the divide procedure, such as using signed operations and a logical shift in the final step.
Typology: Exams
1 / 17
This page cannot be seen from the preview
Don't miss anything!
University of California, Berkeley College of Engineering Computer Science Division EECS
Spring 1999 John Kubiatowicz
March 3, 1999 CS152 Computer Architecture and Engineering
Your Name:
SID Number:
Discussion Section:
Problem Possible Score
1 15
2 15
3 20
4 20
5 30
Total
Problem 1: Performance Problem 1a : Name the three principle components of runtime that we discussed in class. How do they combine to yield runtime?
Three components: Instruction Count, CPI, and Clock Period (or Rate)
Clock Rate
InstCount CPI
Runtime InstCount CPI Clockperiod × =
Now, you have analyzed a benchmark that runs on your company’s processor. This processor runs at 300MHz and has the following characteristics:
Instruction Type Frequency (%) Cycles Arithmetic and logical 40 1 Load and Store 30 2 Branches 20 3 Floating Point 10 5
Your company is considering a cheaper, lower-performance version of the processor. Their plan is to remove some of the floating-point hardware to reduce the die size.
The wafer on which the chip is produced has a diameter of 10cm, a cost of $2000, and a defect rate of 1 / (cm^2 ). The manufacturing process has an 80% wafer yield and a value of 2 for α. Here are some equations that you may find useful:
The current procesor has a die size of 12mm × 12mm. The new chip has a die size of 10mm ×10mm, and floating point instructions will take 12 cycles to execute.
Problem 1b : What is the CPI and MIPS rating of the original processor?
300 MHz = 143 MIPS
( )
− α
Problem 2: Delay For a Full Adder
A key component of an ALU is a full adder. A symbol for a full adder is:
‘Problem 2a: Implement a full adder using as few 2-input AND, OR, and XOR gates as possible. Keep in mind that the Carry In signal may arrive much later than the A or B inputs. Thus, optimize your design (if possible) to have as few gates between Carry In and the two outputs as possible:
Full Adder
A B
S
Cout C (^) in
Assume the following characteristics for the gates: AND: Input load: 150fF, Propagation delay: TPlh=0.2ns, TPhl=0.5ns, Load-Dependent delay: TPlhf=.0020ns, TPhlf=.0021ns OR: Input load: 100fF Propagation delay: TPlh=0.5ns, TPhl=0.2ns Load-Dependent delay: TPlhf=.0020ns, TPhlf=.0021ns XOR: Input load: 200fF, Propagation delay: TPlh=.8ns, TPhl=.8ns Load-Dependent delay: TPlhf=.0040ns,TPhlf=.0042ns
Problem 2b: Compute the input load for each of the 3 inputs to your full adder:
Input Load (^) A = (150 + 100 + 200) fF = 450 fF Input Load (^) B = (150 + 100 + 200) fF = 450 fF Input Load (^) Cin = (150 + 200) fF = 350 fF
Problem 2c: Identify two critical paths from the inputs to the Sum and the Carry Out signal. Compute the propagation delays for these critical paths based on the information given above. (You will have 2 numbers for each of these two paths):
slowest for the XOR gate, we will choose the value of the Cin signal so that the XOR gate tying A and B together goes from high to low:
Critical path for Carry Out signal is also from A or B:
Problem 2d: Compute the Load Dependent delay for your two outputs.
This is easy: it is just equal to the LDD of the output gate:
TPlhf (^) SUM = 0.0040. TPhlf (^) SUM =0. TPlhf (^) Cout = 0.0020. TPhlf (^) SUM = 0.
Problem 3c: Assume that you have a MIPS processor that is missing the divide instruction. Implement the above divide operation as a procedure. Assume dividend and divisor are in $a0 and $a1 respectively, and that remainder and quotient are returned in registers $v0 and $v respectively. You can use ROL64 as a pseudo-instruction that takes 3 registers. Don’t use any other pseudo-instructions, however. Make sure to adher to MIPS register conventions, and optimize the loop as much as possible.
Solution:
Notes on solution (by line number):
Here is a faster inner loop (I believe this is the fastest, but I could certainly be wrong). The trick is to use the result bits from the comparison in line 6 directly to form the quotient. Note that they are inverted from what we want. So, we just use them and invert all the bits of the quotient after we are done. This saves 32 x 2 – 2 = 62 cycles
Problem 3d: What is the “CPI” of your divide procedure (i.e. what is the total number of cycles to perform a divide)? Assume that each MIPS instruction takes 1 cycle.
Need to consider (1) initialization code, (2) 32 x loop code and, (3) closing code. Remember that ROL is 5 cycles (5 instructions). Some people who neglected to move their branch condition to the end of the loop forgot that there would be one last execution of the branch when count = 0. Assume worst-case delay (the maximum time). To do this, we find the largest time that we might take in the inner loop.
For our first solution:
CPI = 8 + 32 x 12 + 4 = 396 cycles/divide
For the second:
CPI = 8 + 32 x 10 + 6 = 334 cycles/divide
MUX
32-bit Adder
32-bit Adder
PC[31:28] | INST[25:0] | 00
4
IMM
PC[31:28]
32
32
32
32
32
4
INST[25:0]
26
PC
PC Extend
Right shift 2
nPC_Sel PC out
32 32-bit Registers
MUX 5 5
Rt
31 Rd Rt
RegDest 2 1 0
Rs 5
busW
2 MUX 1 0 MemtoReg Output of ALU PC+4 Output ofMemory
Problem 4a : Describe/sketch the modifications needed to the datapath for each instruction. Try to add as little hardware as possible. Make sure that you are very clear about your changes. Also assume that the original datapath had only enough functionality to implement the original 5 instructions:
ADDIU : No change to data path XOR : Must enhance the ALU to include the XOR functionality.
JAL : Must enhance data path to (1) permit proper update to PC and (2) write back PC+4 to register file (into the $ra register). There are many ways to dot this. We will show one. Note that the bottom 26 bits of the instruction must be combined with the top- bits of the current PC and 2 zero bits to form the new PC for a JAL:
NewPC = OldPC[31:28] || Instruction[25:0] || 00.
Also, notice that we now have more possibilites for three of the control signals: nPC_SEL, RegDest, and MemToReg.
BGEZAL : Same modifications to register destination logic and MemToReg mux as above. Also, to check for the condition, it turns
RegA output bus from the register file:
BusA [bit 31]BusA
Problem 4b : Specify control for each of the new instructions. Make sure to include values (possibly “x”) for all of the control points.
Note that we expected you to include all of the relevant control signals in your design. Further, note that the two lines for BGEZAL (taken/not taken) are different. So, if you included only one line there, you needed to make sure that it was clear how signals depended on GEZ.
Instr GEZ nPC_ sel
Reg Wr
Reg- Dest ALUctrl^ ExtOp^
ALU src
Mem Wr
MemTo Reg
ADDIU X 0 1 0 Add SignEx 1 0 0
XOR X 0 1 1 XOR X 0 0 0
JAL X 2 1 2 X X X 0 2
BGEZAL 0 0 0 X X X X 0 X
BGEZAL 1 1 1 2 X X X 0 2
[ Problem 5 continued ] Single-bit Booth encoding results from noticing that a sequence of ones can be represented by two non-zero values at the endpoints:
The encoding uses three symbols, namely: 1 , 0 ,DQG. (The 1 stands for “-1”). A
more complicated example of Booth encoding, used on a two’s-compliment number is the following:
To perform Booth encoding, we build a circuit that is able to recognize the beginning, middle, and end of a string of ones. When encoding a string of bits, we start at the far right. For each new bit, we base our decision of how to encode it on both the current bit and the previous bit (to the right).
Problem 5c: Write a table describing the this encoding. It should have 2 input bits (current and previous) and should output a 2 bit value which is the two’s compliment value of the
encoded digit (representing 1 , 0 ,or 1 ):
Answer was given in class:
Cur Prev Out 0 0 00 0 1 01 1 0 11 1 1 00
Problem 5d: Modify your datapath to do 32x32 bit signed multiplication by Booth-encoding the multiplier (the operand which is shifted during multiplication). Draw the Booth-encoder as a black-box in your design that takes two bits as input and produces a 2-bit, two’s complement encoding on the output. Assume that you have a 32-bit ALU that can either add or subtract. (Hint: Be careful about the sign-bit of the result during shifts. Also, be careful about the initial value of the “previous bit”.) Explain how how your algorithm is different from the previous multiplier.
The neat thing about this type of booth encoding, is that you don’t have to change the algorithm (controller). Just use the “sign” bit of the booth encoded signal to control the ALU. Call the low bit of the booth output “LO[0]” (even though it isn’t really), and POOF, the algorithm is identical.
Note that you DO have to be careful about the sign bit. For fully correct operation, you have to assume that the adder will overflow and need to reconstruct the sign bit. To fix this, we conceptually sign extend the two 32-bit values to 33-bit values. Then, to build a 33-bit adder, we use our 32-bit adder in combination with an extra 3-input xor-gate (which does the “sum” portion of the top bit -- see problem 2). This way we know that we won’t get an overflow, and will have the correct 33rd bit to shift in when we shift. Note that we accepted solutions which simply sign-extended by wrapping the high-bit of the HI register back during shifting. Neglecting sign-extention was not ok, however.
32-bit ALU
HI register (32 bits)
LO register (32 bits)
LoadHIClearHI LoadLO
Multiplicand Register
ShiftAll
LoadMp
CSave out
Cout 32
32
32
Input Multiplicand
LO[0]
Result[HI] Result[LO]
32 32
HI[31]
Multi[31]
PrevEncoderBooth
ENC[0]
ENC[1]
"LO[0]"
Control Logic
Input Multiplier
32
Sub/Add
34-bit ALU
LO register (16x2 bits)
LoadHIClearHI LoadLO
Multiplicand Register
ShiftAll
LoadMp
2 bitsExtra
32 32
LO[1:0]
Result[HI] Result[LO]
32 32
LO[1]PrevEncoderBooth
ENC[0]
ENC[2]
"LO[0]"
Control Logic
Input Multiplier
32
Sub/Add
2
34
34
32
Input Multiplicand
32=>34signEx 34
34x2 MUX
32=>34signEx
<<1 (^34)
ENC[1]
Multi x2/x
2
HI register^2 (16x2 bits)
2
1 0
34
Problem 5g [Extra Credit]: Draw a datapath that does signed multiplication, two bits at a time and include the multiplication algorithm. Draw the two-bit Booth encoder as a black box which implements output from the table in problem 5f. Make sure that you describe how the 5
possible output symbols (i.e. 2 , 1 , 0 , 1 ,and 2 ) are encoded (hint: two’s complement is not
the best solution here). As before, assume that you have a 32-bit ALU that can add andsubtract:
Code Enc2 Enc1 Enc
2 0 1 1 1 0 0 1 0 0 0 0 1 1 0 1 2 1 1 1
This solution is remarkably similiar to the other two solutions. Notice that we have chosen an encoding for the Booth symbols that is similiar to sign/magnitude. Thus, the sign bit is Enc2, which goes directly to select Add/Subtract. The next bit (Enc1) indicates whether or not we should multiply the multiplicand by 2. Finally, the last bit indicates “non-zero”, and is once again equivalent to “LO[0]”. If we had gone with complete sign-magnitude (also ok), we would “or” together Enc0/Enc1 to get our replacement for “LO[0]”. Except for the fact that this solution shifts only 16 times, the control state machine is identical to that used in the previous two problems.
[continued on next page]
The key changes to the datapath are two-fold
1. We are now multiplying the multiplicand by 2 , 1 , 0 , 1 ,or 2_. To do this, we need to_
either shift by one or not. This is the mux just under the multiplicand register.
2. We have formatted our high and low registers so that we can shift pairs of bits at once. Notice that “ShiftAll” now causes two matched shift registers (one containing even bits, the other containing odd ones) to shift. This is like shifting by two in the _previous solution.
Notice that a number of people tried to handle multiplication by two by first shifting by one, adding, then shifting by one again. While this will work, it destroys the advantage in speed gained by radix-4 Booth encoding (i.e. taking half the number of cycles to multiply).
Finally, to make this really fast, you would want to combine the logic for “LoadHI” with the logic for “ShiftAll”, so that you could do them at the same time. This would make the multiplier truly run in 16 cycles. We didn’t expect you to come up with this.