Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Unoptimized Divide and Multiply Procedures in MIPS Assembly Language, Exams of Computer Architecture and Organization

An unoptimized mips assembly language implementation for 32-bit unsigned division and a multi-cycle, 32-bit x 32-bit unsigned multiplier with booth encoding. It includes datapath diagrams, encoding tables, and assembly code for the divide and multiply procedures. The document also discusses potential improvements for the divide procedure, such as using signed operations and a logical shift in the final step.

Typology: Exams

2012/2013

Uploaded on 04/02/2013

shashikanth_0p3
shashikanth_0p3 🇮🇳

4.8

(8)

55 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
University of California, Berkeley
College of Engineering
Computer Science Division EECS
Spring 1999 John Kubiatowicz
Midterm I
SOLUTIONS
March 3, 1999
CS152 Computer Architecture and Engineering
Your Name:
SID Number:
Discussion Section:
Problem Possible Score
115
215
320
420
530
Total
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Unoptimized Divide and Multiply Procedures in MIPS Assembly Language and more Exams Computer Architecture and Organization in PDF only on Docsity!

University of California, Berkeley College of Engineering Computer Science Division  EECS

Spring 1999 John Kubiatowicz

Midterm I

SOLUTIONS

March 3, 1999 CS152 Computer Architecture and Engineering

Your Name:

SID Number:

Discussion Section:

Problem Possible Score

1 15

2 15

3 20

4 20

5 30

Total

Problem 1: Performance Problem 1a : Name the three principle components of runtime that we discussed in class. How do they combine to yield runtime?

Three components: Instruction Count, CPI, and Clock Period (or Rate)

Clock Rate

InstCount CPI

Runtime InstCount CPI Clockperiod × =

= × ×

Now, you have analyzed a benchmark that runs on your company’s processor. This processor runs at 300MHz and has the following characteristics:

Instruction Type Frequency (%) Cycles Arithmetic and logical 40 1 Load and Store 30 2 Branches 20 3 Floating Point 10 5

Your company is considering a cheaper, lower-performance version of the processor. Their plan is to remove some of the floating-point hardware to reduce the die size.

The wafer on which the chip is produced has a diameter of 10cm, a cost of $2000, and a defect rate of 1 / (cm^2 ). The manufacturing process has an 80% wafer yield and a value of 2 for α. Here are some equations that you may find useful:

The current procesor has a die size of 12mm × 12mm. The new chip has a die size of 10mm ×10mm, and floating point instructions will take 12 cycles to execute.

Problem 1b : What is the CPI and MIPS rating of the original processor?

CPI = .4 × 1+ .3 × 2 + .2 × 2 + .1 × 5 = 2.1 cycles/inst

MIPS=

300 MHz = 143 MIPS

( )

2 die area

waferdiameter

diearea

waferdiameter/

dies/wafer

×

π×

π×

− α

×

= × +

defectsperunitarea diearea

dieyield waferyield 1

Problem 2: Delay For a Full Adder

A key component of an ALU is a full adder. A symbol for a full adder is:

‘Problem 2a: Implement a full adder using as few 2-input AND, OR, and XOR gates as possible. Keep in mind that the Carry In signal may arrive much later than the A or B inputs. Thus, optimize your design (if possible) to have as few gates between Carry In and the two outputs as possible:

Full Adder

A B

S

Cout C (^) in

A

C in

B

S

C out

Assume the following characteristics for the gates: AND: Input load: 150fF, Propagation delay: TPlh=0.2ns, TPhl=0.5ns, Load-Dependent delay: TPlhf=.0020ns, TPhlf=.0021ns OR: Input load: 100fF Propagation delay: TPlh=0.5ns, TPhl=0.2ns Load-Dependent delay: TPlhf=.0020ns, TPhlf=.0021ns XOR: Input load: 200fF, Propagation delay: TPlh=.8ns, TPhl=.8ns Load-Dependent delay: TPlhf=.0040ns,TPhlf=.0042ns

Problem 2b: Compute the input load for each of the 3 inputs to your full adder:

Input Load (^) A = (150 + 100 + 200) fF = 450 fF Input Load (^) B = (150 + 100 + 200) fF = 450 fF Input Load (^) Cin = (150 + 200) fF = 350 fF

Problem 2c: Identify two critical paths from the inputs to the Sum and the Carry Out signal. Compute the propagation delays for these critical paths based on the information given above. (You will have 2 numbers for each of these two paths):

Critical path to Sum is from either A or B to Sum. Since the High ⇒ Low transition is

slowest for the XOR gate, we will choose the value of the Cin signal so that the XOR gate tying A and B together goes from high to low:

TPlh = 0.8 + 0.0042 × 200 + 0.8 = 2.44 ns

TPhl = 0.8 + 0.0042 × 200 + 0.8 = 2.44 ns

Critical path for Carry Out signal is also from A or B:

TPlh = 0.5ns + 0.0020 × 150 + 0.2ns + .0020 × 100 + .5ns= 1.7 ns

TPhl = 0.2 ns +0.0021 × 150 + 0.5ns + .0021 × 100 + .2ns= 1.425 ns

Problem 2d: Compute the Load Dependent delay for your two outputs.

This is easy: it is just equal to the LDD of the output gate:

TPlhf (^) SUM = 0.0040. TPhlf (^) SUM =0. TPlhf (^) Cout = 0.0020. TPhlf (^) SUM = 0.

Problem 3c: Assume that you have a MIPS processor that is missing the divide instruction. Implement the above divide operation as a procedure. Assume dividend and divisor are in $a0 and $a1 respectively, and that remainder and quotient are returned in registers $v0 and $v respectively. You can use ROL64 as a pseudo-instruction that takes 3 registers. Don’t use any other pseudo-instructions, however. Make sure to adher to MIPS register conventions, and optimize the loop as much as possible.

Solution:

  1. divide: ori $t4, $zero, 32 ;Initialize count
  2. ori $v0, $zero, 0 ;Rem = 0
  3. add $v1, $a0, $zero ;Quotient = Dividend
  4. ROL64 $v0, $v1, $zero
  5. divloop: addi $t4, $t4, -1 ;Decrement count
  6. sltu $t5, $v0, $a1 ;Check: Rem < Divisor?
  7. bne $t5, $zero, nosub ;Yes. No subtract
  8. subu $v0, $v0, $a0 ;Rem=Rem-divisor
  9. ori $t6, $zero, 1 ;Bit to shift (temp)
  10. j dorol
  11. nosub: ori $t6, $zero, 0 ;Bit to shift (temp)
  12. dorol: ROL64 $v0,$v1,$t6 ;Do the ROL operation
  13. bne $t4,$zero,divloop ;Loop is count nonzero
  14. srl $v0,$v0,1 ;Final shift of remainder
  15. bne $v1,$zero,exit ;Check quotient=
  16. add $v0,$a0,$zero ;Ah. Let rem=dividend
  17. exit: jr $ra

Notes on solution (by line number):

  • This inner loop is 12 cycles, which is the average good solution (accepted as answer)
  • We did require that you be as minimal as possible on the inner loop.
  • Note that we have moved loop check to end of loop (line 13), saving a cycle in loop.
  • Note that signed ops (such as slt) at line 6 doesn’t work for 32-bit unsigned values. Similiarly, subtracting and checking the “sign” of the result doesn’t work either. The simplest way to understand this is to consider the case when the quotient has its high bit set. Then , even though remainder = 0, we will think we should subtract!
  • It is important to have a logical shift in line 14, since you don’t want to simply copy the high bit. Again, that is because this is an unsigned multiplication.
  • Further note on closing lines 15 and 16 (which we didn’t require): This is a special case required to get the correct answer on those cases for which the remainder has its high bit set. Since our last action (line 14) is to logically shift left, there is no way to get such a remainder. Thus, something is clearly broken up to line
    1. Fortunately, we know that remainder < divisor. So, if the remainder has its high bit set, so does the divisor. However, since: dividend = quotient × divisor + remainder we see that we can’t get a 32-bit result on the left if both the quotient×divisor and remainder terms have their high-bits set. This means that: remainder has high-bit set ⇒ quotient = 0. Thus, the easy fix is to check for quotient = 0, and always set remainder=dividend (always works).

Here is a faster inner loop (I believe this is the fastest, but I could certainly be wrong). The trick is to use the result bits from the comparison in line 6 directly to form the quotient. Note that they are inverted from what we want. So, we just use them and invert all the bits of the quotient after we are done. This saves 32 x 2 – 2 = 62 cycles

  1. divide: ori $t4, $zero, 32 ;Initialize count
  2. ori $v0, $zero, 0 ;Rem = 0
  3. add $v1, $a0, $zero ;Quotient = Dividend
  4. ROL64 $v0, $v1, $zero
  5. divloop: addi $t4, $t4, -1 ;Decrement count
  6. sltu $t5, $v0, $a1 ;Check: Rem < Divisor?
  7. bne $t5, $zero, dorol ;Yes. No subtract
  8. subu $v0, $v0, $a0 ;Rem=Rem-divisor
  9. dorol: ROL64 $v0,$v1,$t5 ;Note: t5 is inverted!
  10. bne $t4,$zero,divloop ;Loop is count nonzero
  11. addi $t6,$zero,-1 ;All 1s in t
  12. xor $v1,$v1,$t6 ;Invert bits in quotient
  13. srl $v0,$v0,1 ;Final shift of remainder
  14. bne $v1,$zero,exit ;Check quotient=
  15. add $v0,$a0,$zero ;Ah. Let rem=divident
  16. exit: jr $ra

Problem 3d: What is the “CPI” of your divide procedure (i.e. what is the total number of cycles to perform a divide)? Assume that each MIPS instruction takes 1 cycle.

Need to consider (1) initialization code, (2) 32 x loop code and, (3) closing code. Remember that ROL is 5 cycles (5 instructions). Some people who neglected to move their branch condition to the end of the loop forgot that there would be one last execution of the branch when count = 0. Assume worst-case delay (the maximum time). To do this, we find the largest time that we might take in the inner loop.

For our first solution:

CPI = 8 + 32 x 12 + 4 = 396 cycles/divide

For the second:

CPI = 8 + 32 x 10 + 6 = 334 cycles/divide

MUX

32-bit Adder

32-bit Adder

PC[31:28] | INST[25:0] | 00

4

IMM

PC[31:28]

32

32

32

32

32

4

INST[25:0]

26

PC

PC Extend

Right shift 2

nPC_Sel PC out

32 32-bit Registers

MUX 5 5

Rt

31 Rd Rt

RegDest 2 1 0

Rs 5

busW

2 MUX 1 0 MemtoReg Output of ALU PC+4 Output ofMemory

Problem 4a : Describe/sketch the modifications needed to the datapath for each instruction. Try to add as little hardware as possible. Make sure that you are very clear about your changes. Also assume that the original datapath had only enough functionality to implement the original 5 instructions:

ADDIU : No change to data path XOR : Must enhance the ALU to include the XOR functionality.

JAL : Must enhance data path to (1) permit proper update to PC and (2) write back PC+4 to register file (into the $ra register). There are many ways to dot this. We will show one. Note that the bottom 26 bits of the instruction must be combined with the top- bits of the current PC and 2 zero bits to form the new PC for a JAL:

NewPC = OldPC[31:28] || Instruction[25:0] || 00.

Also, notice that we now have more possibilites for three of the control signals: nPC_SEL, RegDest, and MemToReg.

BGEZAL : Same modifications to register destination logic and MemToReg mux as above. Also, to check for the condition, it turns

out that checking R[rs] ≥ 0 is equivalent to checking the MSB of the

RegA output bus from the register file:

BusA [bit 31]BusA

GEZ

Problem 4b : Specify control for each of the new instructions. Make sure to include values (possibly “x”) for all of the control points.

Note that we expected you to include all of the relevant control signals in your design. Further, note that the two lines for BGEZAL (taken/not taken) are different. So, if you included only one line there, you needed to make sure that it was clear how signals depended on GEZ.

Instr GEZ nPC_ sel

Reg Wr

Reg- Dest ALUctrl^ ExtOp^

ALU src

Mem Wr

MemTo Reg

ADDIU X 0 1 0 Add SignEx 1 0 0

XOR X 0 1 1 XOR X 0 0 0

JAL X 2 1 2 X X X 0 2

BGEZAL 0 0 0 X X X X 0 X

BGEZAL 1 1 1 2 X X X 0 2

[ Problem 5 continued ] Single-bit Booth encoding results from noticing that a sequence of ones can be represented by two non-zero values at the endpoints:

The encoding uses three symbols, namely: 1 ,  0 ,DQG. (The 1 stands for “-1”). A

more complicated example of Booth encoding, used on a two’s-compliment number is the following:

To perform Booth encoding, we build a circuit that is able to recognize the beginning, middle, and end of a string of ones. When encoding a string of bits, we start at the far right. For each new bit, we base our decision of how to encode it on both the current bit and the previous bit (to the right).

Problem 5c: Write a table describing the this encoding. It should have 2 input bits (current and previous) and should output a 2 bit value which is the two’s compliment value of the

encoded digit (representing 1 , 0 ,or 1 ):

Answer was given in class:

Cur Prev Out 0 0 00 0 1 01 1 0 11 1 1 00

Problem 5d: Modify your datapath to do 32x32 bit signed multiplication by Booth-encoding the multiplier (the operand which is shifted during multiplication). Draw the Booth-encoder as a black-box in your design that takes two bits as input and produces a 2-bit, two’s complement encoding on the output. Assume that you have a 32-bit ALU that can either add or subtract. (Hint: Be careful about the sign-bit of the result during shifts. Also, be careful about the initial value of the “previous bit”.) Explain how how your algorithm is different from the previous multiplier.

The neat thing about this type of booth encoding, is that you don’t have to change the algorithm (controller). Just use the “sign” bit of the booth encoded signal to control the ALU. Call the low bit of the booth output “LO[0]” (even though it isn’t really), and POOF, the algorithm is identical.

Note that you DO have to be careful about the sign bit. For fully correct operation, you have to assume that the adder will overflow and need to reconstruct the sign bit. To fix this, we conceptually sign extend the two 32-bit values to 33-bit values. Then, to build a 33-bit adder, we use our 32-bit adder in combination with an extra 3-input xor-gate (which does the “sum” portion of the top bit -- see problem 2). This way we know that we won’t get an overflow, and will have the correct 33rd bit to shift in when we shift. Note that we accepted solutions which simply sign-extended by wrapping the high-bit of the HI register back during shifting. Neglecting sign-extention was not ok, however.

32-bit ALU

HI register (32 bits)

LO register (32 bits)

LoadHIClearHI LoadLO

Multiplicand Register

ShiftAll

LoadMp

CSave out

Cout 32

32

32

Input Multiplicand

LO[0]

Result[HI] Result[LO]

32 32

HI[31]

Multi[31]

PrevEncoderBooth

ENC[0]

ENC[1]

"LO[0]"

Control Logic

Input Multiplier

32

Sub/Add

34-bit ALU

LO register (16x2 bits)

LoadHIClearHI LoadLO

Multiplicand Register

ShiftAll

LoadMp

2 bitsExtra

32 32

LO[1:0]

Result[HI] Result[LO]

32 32

LO[1]PrevEncoderBooth

ENC[0]

ENC[2]

"LO[0]"

Control Logic

Input Multiplier

32

Sub/Add

2

34

34

32

Input Multiplicand

32=>34signEx 34

34x2 MUX

32=>34signEx

<<1 (^34)

ENC[1]

Multi x2/x

2

HI register^2 (16x2 bits)

2

1 0

34

Problem 5g [Extra Credit]: Draw a datapath that does signed multiplication, two bits at a time and include the multiplication algorithm. Draw the two-bit Booth encoder as a black box which implements output from the table in problem 5f. Make sure that you describe how the 5

possible output symbols (i.e. 2 , 1 , 0 , 1 ,and 2 ) are encoded (hint: two’s complement is not

the best solution here). As before, assume that you have a 32-bit ALU that can add andsubtract:

Code Enc2 Enc1 Enc

2 0 1 1 1 0 0 1 0 0 0 0 1 1 0 1 2 1 1 1

This solution is remarkably similiar to the other two solutions. Notice that we have chosen an encoding for the Booth symbols that is similiar to sign/magnitude. Thus, the sign bit is Enc2, which goes directly to select Add/Subtract. The next bit (Enc1) indicates whether or not we should multiply the multiplicand by 2. Finally, the last bit indicates “non-zero”, and is once again equivalent to “LO[0]”. If we had gone with complete sign-magnitude (also ok), we would “or” together Enc0/Enc1 to get our replacement for “LO[0]”. Except for the fact that this solution shifts only 16 times, the control state machine is identical to that used in the previous two problems.

[continued on next page]

The key changes to the datapath are two-fold

1. We are now multiplying the multiplicand by 2 , 1 , 0 , 1 ,or 2_. To do this, we need to_

either shift by one or not. This is the mux just under the multiplicand register.

2. We have formatted our high and low registers so that we can shift pairs of bits at once. Notice that “ShiftAll” now causes two matched shift registers (one containing even bits, the other containing odd ones) to shift. This is like shifting by two in the _previous solution.

  1. Rather than dealing with a “hack” carry solution like the last one, we simply use a 34_ bit adder (built with a 32-bit + full/adders if necessary). We sign-extend everything to 34 bits. We need 33 bits for normal operation, since a 32-bit value times 2 may actually be a 33 bit item. The 34th^ bit is used for exactly the same reason that we used the 33rd^ bit in the previous solution – to have the right thing to shift in.

Notice that a number of people tried to handle multiplication by two by first shifting by one, adding, then shifting by one again. While this will work, it destroys the advantage in speed gained by radix-4 Booth encoding (i.e. taking half the number of cycles to multiply).

Finally, to make this really fast, you would want to combine the logic for “LoadHI” with the logic for “ShiftAll”, so that you could do them at the same time. This would make the multiplier truly run in 16 cycles. We didn’t expect you to come up with this.