










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Exam; Professor: Torrellas; Class: Computer System Organization; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Spring 2006;
Typology: Exams
1 / 18
This page cannot be seen from the preview
Don't miss anything!
Please clearly print your full name, NetID and circle the appropriate category in the space provided below. Failure to completely fill out this table will result in a ZERO grade.
Name SOLUTIONS NetID Category (circle one) 3 Credit Hours 4 Credit Hours UG Grad(On-Campus) Grad(I2CS)
Instructions
Problem Maximum Points
Received Points 1 14 2 20 3 5 4 8 5 18 Total 65
Problem 1 [14 points]
Consider a single processor system with the following specification:
Part A [4 points]
For each field listed below, indicate the bits of the virtual address that correspond to it. Show your work.
The virtual page offset:
The virtual page number:
The TLB index:
The TLB tag:
Part B [5 points]
For each field listed below, indicate the bits of the physical address that correspond to it. Show your work.
The physical page offset:
The physical page number:
The cache block offset:
The cache index:
The cache tag:
Solution:
We can think of a physical address as (page frame number, page offset)
The physical page offset: Same as virtual page offset, which is 16 bits. The last 16 bits of the physical address (15:0) correspond to this filed.
The physical page number: The remaining bits of physical address are 24 – 16 = 8 bits. The first 8 bits of the physical address (23:16) correspond to this field.
We can also think of the physical address as (tag, index, block offset).
The cache block offset: The block size is 16 bytes, so 16 = 2^4. 4 bits are used for this field, so the last 4 bits of the physical address (3:0) correspond to this field.
The cache index: The cache is 4-way set-associative and the cache size is 1 KB, so we have 1 KB/ (4 * 16) = 2^4 sets. We will need 4 bits to indicate this field, so 7:4 bits of the physical address correspond to this field.
The cache tag: 24 – 4 index bits – 4 block offset bits = 16, so the first 16 bits of the physical address correspond to tag.
Grading: 1 point for each field. No point is given if no work is shown in deriving the answer.
Solution:
Virtual Address Corresponding Physical Address
Part of Physical Address that indexes cache
TLB hit? Cache hit?
FFFF ABCD EF ABCD C Yes No 446C CEBA 1F CEBA B Yes No 48F8 ABCD B2 ABCD C No No 446C CEAB 1F CEAB A Yes No
Grading: 0.25 point per entry. 1 point for all correct entries.
Problem 2 [20 points]
Consider the following program:
int i, int j, double result[4][100], double a[101][4]
for (i=0; i<4; i++) { for (j=0; j<100; j=j++) result[i][j] = a[j][0]*a[j+1][0] + 0.5; }
Arrays result and a contain 8 byte double precision floating point elements.
Assume the following:
Part A [4 points] Explain which loads to the L1 data cache result in misses for the above program. Give the total number of such misses and indicate which are capacity, conflict, and cold misses. Assume that the processor issues loads in the order in which they appear in the program.
Solution:
When j=0, a[0][0] and a[1][0] are accessed. Their misses bring in blocks with a[0][0] and a[0][1] (not used); and a[1][0] and a[1][1] (not used). When j=1, a[1][0] and a[2][0] are accessed. a[1][0] hits and a[2][0] misses.
In the loop where i=0, there will be 101 misses for a[0][0]…a[100][0].
Since the cache size is only large enough to store 100 blocks of a, when a[100][0] is accessed, the cache will bring in a[100][0] and a[100][1], and replace a[0][0] and a[0][1] (since it is LRU). When a[0][0] is accessed for i=1 and j=0, the cache will not find a[0][0] in the cache anymore, so we need to bring in a[0][0] and a[0][1], replacing a[1][0] and a[1][1]. When j=1, a[1][0] and a[2][0] are accessed, but a[1][0] is not in the cache anymore, so we need to bring a[1][0] and a[1][1], replacing a[2][0] and a[2][1] in the cache, and so on. In other words, there will be 101 misses for each i.
Given there are 4 iterations of i, 4*101 = 404 misses for array a. Total = 404 misses
The first 101 misses are cold misses while the rest are capacity misses.
Part C [7 points]
Assume now we have a cache of infinite size. Repeat Part B. Explain the difference in the solutions for parts B and C.
Solution:
for (j=0;j<100;j++) { prefetch(a[j+8][0]); /* a[j+1][0] for 7 iterations later / result[0][j]=a[j][0]a[j+1][0] + 0.5; }
for (i=1; i<4; i++) { for (j=0; j<100; j++) { result[i][j]=a[j][0]*a[j+1][0] + 0.5; } }
In part C, prefetches are required for the capacity misses in the loops with i=1, 2, 3. In part D, no such prefetches are required since there are no capacity misses. The iteration for i=0 therefore needs to be peeled out from the rest of the loop.
Grading: 1 point for reasonable explanations on the difference in the solutions for part C and D. 1 point for recognizing only array a needs prefetching. 1 point for recognizing iteration i=0 needs to be separated from the rest of the loop. 1 point for correct placement and form of the prefetch instruction, even if the offset for the j index is wrong. 1 point for using the correct prefetching offset, a[j+8][0] 2 points for code that minimizes any unnecessary prefetches. 1 point if the code does not minimize unnecessary prefetches.
Part D [4 points]
Now consider again the original code (i.e., without prefetches) with the original cache hierarchy (as in Part A). Besides prefetching, what other software-only technique can you use to avoid the capacity misses in the original code? Rewrite the original code to include this technique below.
Solution:
We can use loop interchange.
for (j=0; j<100; i++) { for (i=0; i<4; i=i++) result[i][j]=a[j][0]*a[j+1][0] + 0.5; }
Grading: 1 point for recognizing that the accesses in array a don’t change for each iteration of i. 1 point for applying software-only technique to the code. 2 points for code that avoids capacity misses.
Problem 4 [8 points]
You are to implement a queue using an array in a multiprocessor system. The elements of the array can be accessed in parallel by multiple processors. You are to write two functions:
Part A [4 points]
Write the enqueue and dequeue functions using an atomic test&set instruction to achieve synchronization. Don’t worry about using test&test&set, but otherwise, write the most efficient code possible.
Add C-like pseudo-code to this stub:
int head; /* index for the head of the queue / int tail; / index for the tail of the queue / int index; / current array index for enqueuing or dequeuing */
/* Assume the queue is never full and always has at least one element */
enqueue(item) {
queue[index] = item;
dequeue() {
item = queue[index];
return item }
Part B [4 points]
Repeat part (A) using the fetch&increment instruction instead of the test&set for synchronization.
int head; /* index for the head of the queue / int tail; / index for the tail of the queue / int index; / current array index for enqueuing or dequeuing */
/* Assume the queue is never full and always has at least one element */
enqueue(item) {
queue[index] = item;
dequeue() {
item = queue[index];
return item; }
Problem 5 [18 points]
This question concerns a cache coherence protocol implemented in the DEC Firefly machine. This is a snooping update (as opposed to invalidate) cache coherence protocol and is implemented for a system where the processors are connected by a bus. In a snooping update protocol, when a cache modifies its data, it broadcasts the updated data on the bus using a bus update transaction, if necessary. All caches that have a copy of that data then update their own copies. This is in contrast to the invalidation protocol discussed in class where a cache invalidates its copy in response to another processor’s write request to a line. In the Firefly protocol, the copy in memory is also updated on a bus update.
With the Firefly protocol, a cache line can be in two possible states (other than invalid):
All caches are write-allocate. A write-back policy is used if the line is in DE state. For lines in CS state, a write-through policy is used.
The bus has a special line called Shared Line (SL), whose state is usually 0. When cache i performs a bus transaction for a specific cache line, all the caches that have the same line, pull up the Shared Line (SL) to
If a request is made to a block for which memory knows it has a clean copy, then memory will service that request. Otherwise, the appropriate cache will service the request and memory will also get updated.
Consider the following bus transactions:
Note: You are not required to consider Bus Writeback which may take place on a replacement.
Part A [12 points]
Fill out the following state transition table for a processor i performing a memory access. Show the next state for a block in the cache of processor i and any bus transaction performed by processor i. Each entry should be filled as:
Next State / Bus Transaction (e.g., CS / BR) Where next state = CS, DE or NIC (Not In Cache – i.e., a cache miss) Bus transaction = BR, BU, or NT (No Transaction) For Bus Reads, indicate who will provide the copy of the requested line.
SL is 0 if processor i does a bus transaction SL is 1 if processor i does a bus transaction
Current state in processor i
Read by processor i Write by processor i Read by processor i Write by processor i
Part B [6 points]
Fill out the following state transition table for the cache of processor i showing the next state for a block in the cache of processor i and the action taken by the cache when a bus transaction is initiated by another processor j. Each entry should be filled as:
Next State / Action (e.g., CS / UPDL) Where next state = CS, DE or NIC (Not In Cache – i.e., a cache miss) Action = PULLSLL1 : Pull SL to 1 UPDL : Update block in cache i (i.e., one’s own cache) PROVL : Provide an updated block in response to a BR (and main memory is also updated as part of this action) NA : No Action
Note: If an entry is not possible (i.e., the system cannot be in such a state), write “Not possible” in that entry.
State in processor i Bus Read by processor j Bus Update by processor j DE CS NIC