Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

5 Solved Problems in Final Exam on Computer System Organization | CS 433, Exams of Computer Architecture and Organization

Material Type: Exam; Professor: Torrellas; Class: Computer System Organization; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Spring 2006;

Typology: Exams

Pre 2010

Uploaded on 03/11/2009

koofers-user-1l8
koofers-user-1l8 🇺🇸

5

(1)

10 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CS 433 Final Exam – May 11, 2006
Professor Sarita Adve
Time: 7:00-10:00pm, 3 hours
Please clearly print your full name, NetID and circle the appropriate category in the space provided
below. Failure to completely fill out this table will result in a ZERO grade.
Name SOLUTIONS
NetID
Category (circle one) 3 Credit Hours 4 Credit Hours
UG Grad(On-Campus) Grad(I2CS)
Instructions
1. You may only use class handouts from this semester’s offering, the course text (Computer
Architecture: A Quantitative Approach - 3rd Edition - by Hennessy and Patterson), your own
homework submissions for this course, and notes written or typed by yourself. No other materials are
allowed, including other books, notes prepared by others, or materials from previous offerings of this
course or from other universities.
2. Calculators are allowed. You may not use any other electronic devices.
3. Please do not turn in your loose scrap paper. Limit your answers to the space provided, if possible. If
not, write on the back of the same sheet. You may use the back of each sheet for scratch work.
4. In all cases, show your work. No credit will be given for numeric answers if there is no indication of
how the answer was derived. Partial credit will be given even if your final solution is incorrect,
provided you show the intermediate steps in getting the final solution.
5. If you believe a problem is incorrectly or incompletely specified, make a reasonable assumption
and solve the problem. The assumption should not result in a trivial solution. In all cases,
clearly state any assumptions you make in your answers.
6. This exam has 5 problems and 13 pages (including this one). All students should attempt all
problems. Please budget your time appropriately. Good luck!
Problem Maximum
Points Received
Points
1 14
2 20
3 5
4 8
5 18
Total 65
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download 5 Solved Problems in Final Exam on Computer System Organization | CS 433 and more Exams Computer Architecture and Organization in PDF only on Docsity!

CS 433 Final Exam – May 11, 2006

Professor Sarita Adve

Time: 7:00-10:00pm, 3 hours

Please clearly print your full name, NetID and circle the appropriate category in the space provided below. Failure to completely fill out this table will result in a ZERO grade.

Name SOLUTIONS NetID Category (circle one) 3 Credit Hours 4 Credit Hours UG Grad(On-Campus) Grad(I2CS)

Instructions

  1. You may only use class handouts from this semester’s offering, the course text ( Computer Architecture: A Quantitative Approach - 3rd Edition - by Hennessy and Patterson), your own homework submissions for this course, and notes written or typed by yourself. No other materials are allowed, including other books, notes prepared by others, or materials from previous offerings of this course or from other universities.
  2. Calculators are allowed. You may not use any other electronic devices.
  3. Please do not turn in your loose scrap paper. Limit your answers to the space provided, if possible. If not, write on the back of the same sheet. You may use the back of each sheet for scratch work.
  4. In all cases, show your work. No credit will be given for numeric answers if there is no indication of how the answer was derived. Partial credit will be given even if your final solution is incorrect, provided you show the intermediate steps in getting the final solution.
  5. If you believe a problem is incorrectly or incompletely specified, make a reasonable assumption and solve the problem. The assumption should not result in a trivial solution. In all cases, clearly state any assumptions you make in your answers.
  6. This exam has 5 problems and 13 pages (including this one). All students should attempt all problems. Please budget your time appropriately. Good luck!

Problem Maximum Points

Received Points 1 14 2 20 3 5 4 8 5 18 Total 65

Problem 1 [14 points]

Consider a single processor system with the following specification:

  • Data cache size is 1 KB (for data) and is 4-way set-associative. Its block size is 16 bytes. It is physically indexed and physically tagged. It uses LRU for replacement within a set and a write- allocate write-back policy for writes.
  • 2-way set-associative TLB with 32 total entries and an LRU replacement policy.
  • Physical addresses of 24 bits.
  • Virtual addresses of 32 bits.
  • Byte addressable memory.
  • Page size is 64 KB.

Part A [4 points]

For each field listed below, indicate the bits of the virtual address that correspond to it. Show your work.

The virtual page offset:

The virtual page number:

The TLB index:

The TLB tag:

Part B [5 points]

For each field listed below, indicate the bits of the physical address that correspond to it. Show your work.

The physical page offset:

The physical page number:

The cache block offset:

The cache index:

The cache tag:

Solution:

We can think of a physical address as (page frame number, page offset)

The physical page offset: Same as virtual page offset, which is 16 bits. The last 16 bits of the physical address (15:0) correspond to this filed.

The physical page number: The remaining bits of physical address are 24 – 16 = 8 bits. The first 8 bits of the physical address (23:16) correspond to this field.

We can also think of the physical address as (tag, index, block offset).

The cache block offset: The block size is 16 bytes, so 16 = 2^4. 4 bits are used for this field, so the last 4 bits of the physical address (3:0) correspond to this field.

The cache index: The cache is 4-way set-associative and the cache size is 1 KB, so we have 1 KB/ (4 * 16) = 2^4 sets. We will need 4 bits to indicate this field, so 7:4 bits of the physical address correspond to this field.

The cache tag: 24 – 4 index bits – 4 block offset bits = 16, so the first 16 bits of the physical address correspond to tag.

Grading: 1 point for each field. No point is given if no work is shown in deriving the answer.

Solution:

Virtual Address Corresponding Physical Address

Part of Physical Address that indexes cache

TLB hit? Cache hit?

FFFF ABCD EF ABCD C Yes No 446C CEBA 1F CEBA B Yes No 48F8 ABCD B2 ABCD C No No 446C CEAB 1F CEAB A Yes No

Grading: 0.25 point per entry. 1 point for all correct entries.

Problem 2 [20 points]

Consider the following program:

int i, int j, double result[4][100], double a[101][4]

for (i=0; i<4; i++) { for (j=0; j<100; j=j++) result[i][j] = a[j][0]*a[j+1][0] + 0.5; }

Arrays result and a contain 8 byte double precision floating point elements.

Assume the following:

  • The program is running on a machine with an L1 data cache.
  • The cache is fully associative with 100 blocks and an LRU replacement policy. The block size is 16 bytes. It is write-through and no write-allocate.
  • Assume that only the accesses to the array locations generate loads to the data cache. The rest of the variables are all allocated in registers.
  • The arrays are stored in row major form.
  • The arrays start at cache line boundaries.
  • Initially, the data cache is empty.

Part A [4 points] Explain which loads to the L1 data cache result in misses for the above program. Give the total number of such misses and indicate which are capacity, conflict, and cold misses. Assume that the processor issues loads in the order in which they appear in the program.

Solution:

When j=0, a[0][0] and a[1][0] are accessed. Their misses bring in blocks with a[0][0] and a[0][1] (not used); and a[1][0] and a[1][1] (not used). When j=1, a[1][0] and a[2][0] are accessed. a[1][0] hits and a[2][0] misses.

In the loop where i=0, there will be 101 misses for a[0][0]…a[100][0].

Since the cache size is only large enough to store 100 blocks of a, when a[100][0] is accessed, the cache will bring in a[100][0] and a[100][1], and replace a[0][0] and a[0][1] (since it is LRU). When a[0][0] is accessed for i=1 and j=0, the cache will not find a[0][0] in the cache anymore, so we need to bring in a[0][0] and a[0][1], replacing a[1][0] and a[1][1]. When j=1, a[1][0] and a[2][0] are accessed, but a[1][0] is not in the cache anymore, so we need to bring a[1][0] and a[1][1], replacing a[2][0] and a[2][1] in the cache, and so on. In other words, there will be 101 misses for each i.

Given there are 4 iterations of i, 4*101 = 404 misses for array a. Total = 404 misses

The first 101 misses are cold misses while the rest are capacity misses.

Part C [7 points]

Assume now we have a cache of infinite size. Repeat Part B. Explain the difference in the solutions for parts B and C.

Solution:

for (j=0;j<100;j++) { prefetch(a[j+8][0]); /* a[j+1][0] for 7 iterations later / result[0][j]=a[j][0]a[j+1][0] + 0.5; }

for (i=1; i<4; i++) { for (j=0; j<100; j++) { result[i][j]=a[j][0]*a[j+1][0] + 0.5; } }

In part C, prefetches are required for the capacity misses in the loops with i=1, 2, 3. In part D, no such prefetches are required since there are no capacity misses. The iteration for i=0 therefore needs to be peeled out from the rest of the loop.

Grading: 1 point for reasonable explanations on the difference in the solutions for part C and D. 1 point for recognizing only array a needs prefetching. 1 point for recognizing iteration i=0 needs to be separated from the rest of the loop. 1 point for correct placement and form of the prefetch instruction, even if the offset for the j index is wrong. 1 point for using the correct prefetching offset, a[j+8][0] 2 points for code that minimizes any unnecessary prefetches. 1 point if the code does not minimize unnecessary prefetches.

Part D [4 points]

Now consider again the original code (i.e., without prefetches) with the original cache hierarchy (as in Part A). Besides prefetching, what other software-only technique can you use to avoid the capacity misses in the original code? Rewrite the original code to include this technique below.

Solution:

We can use loop interchange.

for (j=0; j<100; i++) { for (i=0; i<4; i=i++) result[i][j]=a[j][0]*a[j+1][0] + 0.5; }

Grading: 1 point for recognizing that the accesses in array a don’t change for each iteration of i. 1 point for applying software-only technique to the code. 2 points for code that avoids capacity misses.

Problem 4 [8 points]

You are to implement a queue using an array in a multiprocessor system. The elements of the array can be accessed in parallel by multiple processors. You are to write two functions:

  • enqueue , which will add an element to the tail of the queue, and
  • dequeue , which will remove an element from the head of the queue. Assume the queue always has at least one element; i.e., a dequeue is never called on an empty queue. Also, assume that the queue never gets full; i.e., the array is infinitely long and you don’t have to worry about calling an enqueue on a full queue.

Part A [4 points]

Write the enqueue and dequeue functions using an atomic test&set instruction to achieve synchronization. Don’t worry about using test&test&set, but otherwise, write the most efficient code possible.

Add C-like pseudo-code to this stub:

int head; /* index for the head of the queue / int tail; / index for the tail of the queue / int index; / current array index for enqueuing or dequeuing */

/* Assume the queue is never full and always has at least one element */

enqueue(item) {

queue[index] = item;

dequeue() {

item = queue[index];

return item }

Part B [4 points]

Repeat part (A) using the fetch&increment instruction instead of the test&set for synchronization.

int head; /* index for the head of the queue / int tail; / index for the tail of the queue / int index; / current array index for enqueuing or dequeuing */

/* Assume the queue is never full and always has at least one element */

enqueue(item) {

queue[index] = item;

dequeue() {

item = queue[index];

return item; }

Problem 5 [18 points]

This question concerns a cache coherence protocol implemented in the DEC Firefly machine. This is a snooping update (as opposed to invalidate) cache coherence protocol and is implemented for a system where the processors are connected by a bus. In a snooping update protocol, when a cache modifies its data, it broadcasts the updated data on the bus using a bus update transaction, if necessary. All caches that have a copy of that data then update their own copies. This is in contrast to the invalidation protocol discussed in class where a cache invalidates its copy in response to another processor’s write request to a line. In the Firefly protocol, the copy in memory is also updated on a bus update.

With the Firefly protocol, a cache line can be in two possible states (other than invalid):

  • Dirty Exclusive (DE): The line is present only (exclusively) in this cache. The data in the cache is possibly updated or dirty , i.e., it may be a more recent version than the copy in memory.
  • Clean Shared (CS): The line may be present in other caches (shared) too and memory and all those caches have the same (clean) copy.

All caches are write-allocate. A write-back policy is used if the line is in DE state. For lines in CS state, a write-through policy is used.

The bus has a special line called Shared Line (SL), whose state is usually 0. When cache i performs a bus transaction for a specific cache line, all the caches that have the same line, pull up the Shared Line (SL) to

  1. If no other cache has the line, the Shared Line (SL) remains at 0. Cache i uses the state of the Shared Line (SL) when it performs a bus transaction to determine whether to change to an exclusive state or the shared state.

If a request is made to a block for which memory knows it has a clean copy, then memory will service that request. Otherwise, the appropriate cache will service the request and memory will also get updated.

Consider the following bus transactions:

  • BR: Bus Read – Place a read request on the bus.
  • BU: Bus Update – Update copies in memory and other caches with the same cache block.

Note: You are not required to consider Bus Writeback which may take place on a replacement.

Part A [12 points]

Fill out the following state transition table for a processor i performing a memory access. Show the next state for a block in the cache of processor i and any bus transaction performed by processor i. Each entry should be filled as:

Next State / Bus Transaction (e.g., CS / BR) Where next state = CS, DE or NIC (Not In Cache – i.e., a cache miss) Bus transaction = BR, BU, or NT (No Transaction) For Bus Reads, indicate who will provide the copy of the requested line.

SL is 0 if processor i does a bus transaction SL is 1 if processor i does a bus transaction

Current state in processor i

Read by processor i Write by processor i Read by processor i Write by processor i

DE

CS

NIC

Part B [6 points]

Fill out the following state transition table for the cache of processor i showing the next state for a block in the cache of processor i and the action taken by the cache when a bus transaction is initiated by another processor j. Each entry should be filled as:

Next State / Action (e.g., CS / UPDL) Where next state = CS, DE or NIC (Not In Cache – i.e., a cache miss) Action = PULLSLL1 : Pull SL to 1 UPDL : Update block in cache i (i.e., one’s own cache) PROVL : Provide an updated block in response to a BR (and main memory is also updated as part of this action) NA : No Action

Note: If an entry is not possible (i.e., the system cannot be in such a state), write “Not possible” in that entry.

State in processor i Bus Read by processor j Bus Update by processor j DE CS NIC