








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Engineering Issues, Arbiters and Allocators ... One metric to measure the performance of an arbiter is its Fairness.
Typology: Study notes
1 / 14
This page cannot be seen from the preview
Don't miss anything!
Stanford University Monday, May 14th, 2001
Lecture #11: Monday, May 14th, 2001 Lecturer: Professor Bill Dally Scribes: David Wang and Milind Kopikare Reviewer: Kelly Shaw
The two core design issues discussed in lecture were buffers and switches. The tradeoffs involved in the choice of implementation for each have a significant affect upon router latency, throughput, and cost.
Buffers are required to store data elements (flits or packets) when a needed resource(s) is not available. Common scenarios include:
ß Blocked Output Port : If the packet’s desired output port is blocked, the packet needs to be buffered. ß Output Port Contention : If multiple incoming packets from different ports wish to exit the same output port, the loser of the allocation will need to be buffered. ß Routing Contention : If a structural hazard for the routing unit occurs between multiple arriving data elements, the losers will need to be buffered.
Buffering can be partitioned along a centralized or distributed approach.
1.1 Centralized Buffering In the case of centralized buffering, a single buffer services all the inputs and outputs.
ß Advantages: This central buffer serves as a buffer as well as a switch, allowing the buffer to share the capacity across all the inputs ports. ß Disadvantages: It is very hard to build a single shared memory with the amount of bandwidth needed to support this. When the granularity of the data elements becomes very small, it is difficult to support the large number of reads and writes required for high throughput.
1.2 Distributed Buffering Distributed buffering essentially dedicates separate buffers for each channel. Buffering can be analyzed from the perspective of buffer placement (switch vs. inputs) and buffer structure (static vs. dynamic).
1.2.1 Buffer Placement
The two possible locations to distribute the buffers are at the inputs or within the switches themselves as shown in figure 1.
Stanford University Monday, May 14th, 2001
Figure 1: Buffers within the Crosspoints of a Switch
Switch Placement
Given a crossbar in figure 1, buffers are placed at each cross-point. For example, if in1 is routed to out1, in1 can store the data elements into the buffer located at the cross-point between in1 and out1. This structure is equivalent to an output queued switch which can connect any input to any output without interference. This non-interference comes at a high cost in terms of buffer count. As the number of inputs and outputs grow, the number of intermediate buffers needed grows on the order of O(N^2 ).
Input Placement
A more cost effective solution than distributing the buffers within the switch would be to place the buffers for each channel at the input. This implementation choice results in potential blocking at the inputs.
1.2.2 Buffer Allocation
The allocation of the buffers may either be static or dynamic. In a statically allocated buffer, the location and quantity of memory dedicated to each channel is fixed, while in a dynamically allocated implementation, the locations and quantity of memory dedicated to each channel may vary.
Figure 2: Statically Allocated Buffers Statically Allocated Buffers
An example of a RAM with four statically allocated first in first out (FIFO) buffers is illustrated in figure 2. Each buffer occupies ¼ of the memory, with a head and a tail pointer. The head points to the next element in the buffer to leave the buffer, and the tail points to where the next element should enter the buffer. The head and tail wrap around when they reach the boundary of the buffer. This is known as a circular queue. When the head catches up to the tail, the buffer is empty, and when the tail catches up to the head, the buffer is full. This works particularly well if the length of the buffer is a power of two, since the MSBs of the RAM addresses are the address of the first word and the Head and Tail represent the LSBs.
in in
in in
out1 out2 out3 out
Head
Tail
Stanford University Monday, May 14th, 2001
2.1 Bus Switch A bus structure is one of the simplest methods to realize a switch. In figure 7, a bus with N inputs, ach with bandwidth b,is shown. To operate at full rate, de-serializing buffers are placed at the inputs with tri-state buffers driving the bus. The de-serializing buffers drive data on the Nb bandwidth bus 1/Nth^ of the time. On the receiving side, N serializing buffers are placed to read in bursts of Nb, and stream them to the outputs at bandwidth b.
Advantage: Simple allocation. During the cycle that an output gets the bus, it can reach back and read any input. The desired allocation is always provided. In other words, the output can read an input even if every other output want to access the same input. This is not possible with a simple crossbar.
Disadvantages: The key disadvantage to a bus implementation is granularity. Imagine the case when the switch allocator decides to allocate different consecutive flits to different output ports. If the switch allocation occurs at the flit level, the bus width cannot exceed the number of bits in a flit. Thus the following relationship must hold:
Another disadvantage is the latency which the de-serialization and serialization buffers introduce.
2.2 Crossbar Switch
Unlike the bus approach, a crossbar switch requires allocation since the output buffers cannot independently make choices on which inputs to be connected to. Figure 5.1 exhibits a possible scenario where the blocking of input ports will occur. The inputs have the following output mapping:
Input Port Number Desired Output Port Connection 1 1 & 2 2 2 & 3 3 3 & 4 4 4 & 4 Figure 5.1: Table Illustrating blocking for a Crossbar
Thus, a poor allocation may assign output ports to the input ports shown in figure 5.2, resulting in input ports in2 and in4 being completely blocked. (This can be seen if the allocator assigns 1‡1, 1 ‡2, 3‡3, 3‡4).
Figure 5.2: Illustration of a Possible Poor Allocation from Figure 5.
1 2
3 4
1 1 3 3
Stanford University Monday, May 14th, 2001
2.3 Taking the Middle Road. A continuum exists between a simple crossbar implementation and the bus example. In figure 6, each input has two input port connections into the crossbar, allowing a tradeoff between how much logic is put into the switch vs. the switch allocator. There is a continuum between the two extremes of the crossbar and bus approaches. For example, instead of having only one path from an input to an output, figure 6 illustrates how with two input points into the crossbar, two outputs 1 and 2 may read from the same input 1. Typically there is a nice middle point which will resolve 99% of the simple allocations without the granularity problems which a bus switch will experience. In addition, observe how increasing the number of input point into the crossbar results in making it easier to get a maximum assignment.
Figure 6: Easing the Allocation task by adding additional input ports
3 Arbitration One can think of arbitration as a box where at most N inputs send requests to the box and there is just one resource sitting at the other end. The problem of arbitration lies in resolving the issue of deciding which requestor gets the resource. In the switching module of a router, all N inputs send a request to each of the outputs for whom they have a packet to send. Each output arbitrates among the requests and sends one grant. The grants are one hot and we grants are given to only those inputs who have sent a request.
One metric to measure the performance of an arbiter is its Fairness. There are various notions of Fairness. Some of them are:
ß Weak Fairness : This means that no input that has packets to send will be starved of bandwidth. Intuitively, it means that if an input is patient enough, it will eventually be served. This however is not necessarily desirable, especially in delay sensitive applications where the application’s packets need to be sent with a certain delay bound. ß Strong Fairness : This implies that there is a rate bound on the number of grants. This notion of fairness is more accurate in the sense that it strives to ensure that all flows end up getting the same amount of bandwidth all the time. ß Weighted Fairness : Even strong fairness is not always desirable. There might be some packets that would want to be sent immediately and a few others that could afford to be detained longer. This gives rise to the idea of weighted fairness. Here each flow of packets has a weight associated with it and the rate of service to a flow of packets is in proportion to the weight associated with it. ß FCFS fairness : FCFS stands for First Come First Served. It means that if packet A arrived before packet B , FIFS fairness demands that A be served before B. If A never gets
in
in
in
in
out1 out2 out3 out
Stanford University Monday, May 14th, 2001
like an iterative circuit. A carry-in to an input j is 1 if no input < j has requested the output in that round. If even input j does not request for that resource, there is a carry-out to the next requesting input. Notice that with this setup, the carry-out to the next input does not have to depend on the grant given to the previous input. This helps to avoid delay.
Figure 8: A block Diagram of the Priority Encoder
3.4 Round Robin Arbiter (RRA) :
Now it is easy to convert a Fixed Priority Arbiter into a Weakly Fair one. This is what is called as the Round Robin Arbiter. Here we wrap the linear stack of input requests around. There exist a Priority Mechanism exhibited in figure 9. This Priority has to be one hot. This priority mechanism can be thought of as a token that gets rotated among all the inputs. Thus when requestor j has the priority token and it also has a packet to send, j gets the grant and all other requestors have a 0 in their Carry-In. After the first arbitration, requestor ( j +1) gets the token for priority and the cycle continues. The RRA is weakly fair, because every input gets the resource at least every N cycles.
Figure 9: A Logic Diagram of the Round Robin Arbiter.
Stanford University Monday, May 14th, 2001
3.5 Weighted RRA :
An extension to RRA is the weighted version of RRA. Here if we want to process Inputs 1’s packets twice as often as Input 2’s packets we do it in one of two ways. ß We place two request of Input 1 in the Request Stack. In such a case we place the second request somewhere after the first and not immediately below it. This is just to ensure a good distribution of Fairness among all inputs. ß Another way is to have a counter for each input. The counter represents the number of requests the input should be granted over a given period of time. The higher the priority, the higher the original count. Thus Input 1 could have a counter with a value of 256 and Input 2 could have a value of 128. Each time a queue gets served, its counter gets decremented. When a counter becomes zero, its input can no longer be served till all other inputs have exhausted their counters as well. It is clear that by this scheme we ensure that Input 1 gets served twice as much as Input 2 over a large enough time. Note however, that this scheme ensures weighted fairness only over a certain time constant. It is not possible to ensure weighted fairness over a smaller time window..
3.6 Matrix Arbiter:
The Matrix Arbiter gets over the problem of LULS fairness. The Matrix Arbiter keeps a complete order of request grants. This information is maintained in a matrix form where each row corresponds to an input and each column corresponds to an output. There are four Requestors 0, 1, 2, 3. A 1 at the a th^ row and the b th^ column means requestor a beats requestor b in acquiring the resource. The implementation is extremely simple with it being possible to store each element in a flip flop. The Matrix Arbiter logic for arbitration can be better understood by figure 10 below.
Figure 10: Initial Priority Sequence of the Matrix Arbiter.
In this case we only need 6 flip-flops. The relative priorities among the inputs fighting for the sole resource are {0,1,2,3}. Now lets say input one gets processed. The new Matrix will now look like figure 11 below:
Stanford University Monday, May 14th, 2001
Figure 13: A FCFS (Queuing) Arbiter
3.8 Priority (Weighted) Fairness :
In addition to the two ways of ensuring a weighted form of Fairness that have been discussed earlier, another obvious and probably more simpler mechanism is to have different classes of priority. For each priority class, you will have an arbiter. The arbiters are then connected in a sequence. If a High Priority (HP) arbiter grants a request, the Low Priority (LP) arbiters are inhibited. Otherwise, LP arbiters may grant a request. Thus if are using RRA as the arbitration scheme, we simply have a High Priority RRA, and if none of the inputs belonging to the HP RRA request a grant, only then does control shift to the LP RRA. Note however that this scheme could potentially result in permanent starvation of the inputs belonging to the low priority stage. In other words, the LP stage inputs get the scraps after the HP inputs are done with the main course. One way to avoid this would be to ensure that every once in a while, the LP inputs get to steal one cycle from the HP inputs.
Figure 14: A sequence of HP and LP arbiters.
Stanford University Monday, May 14th, 2001
4 Allocators
An allocator is a box that takes in bids from N inputs and picks at most one input to connect to each of N outputs as depicted below in figure 15. The general problem of allocation is an instance of bipartite matching problem illustrated in figure 16. You have a bunch of inputs and outputs and you need to determine how to match the inputs to the outputs. One can view the problem as a matrix where rows are the inputs and columns denote the outputs and a number in the ( i , j) th^ entry in the Matrix means that input i has a request for output j. Now output j can chose to grant to only one of the inputs in it’s column. Also, among all the grants that arrive at an input, an input can choose only one among them. Thus finally the matrix, created from our request example looks as shown in the figure 17 with each row and each column having not more than one entry circled (matched).
Figure 15: An Allocator
Figure 16: A Bipartite Matching of inputs to the outputs.
Figure 17: A Matrix of a possible allocation strategy.
Stanford University Monday, May 14th, 2001
Figure 20
Getting a perfect matching is very time consuming and hard and involves tricks like backtracking even after a match has been made etc. This is very complicated. A compromise to the complexity and the performance is the class of algorithms called Separable Allocation.
4.1 Separable Allocation
This involves separating the input stage and the output stage of allocation. If say we decide that the output stage gets a chance to decide first, each output looks at all input requests that arrive at it and decides which input to grant by picking a winner based on some policy (There is a winner chosen in each column). These grants are then carried over to the input stage and it is now the inputs’ turn to decide on which outputs’ request to accept (There is a winner chosen in each row). Note that an input could potentially get a grant from more than one output.
It is possible that there are a pair of inputs and outputs which have been unmatched after the end of this process. In such a case, we simply repeat the whole process again. It has been proven that in the worst case we would have to do at most N such iterations.
4.2 PIM (parallel iterative matching)
PIM is one such Separable Allocator. We do output arbitration first followed by input arbitration. Here each Output (Input) picks a winner among the possible requests (grants) in a random fashion. Multiple iterations of this process can improve the number of grants. Figure 21 below shows how PIM actually gets better in each iteration. In first iteration, only Input 1 and Input 2 get matched. In the second iteration , even Input 3 and Input 4 get matched.
Stanford University Monday, May 14th, 2001
Figure 21: Illustration of Iterative Steps in PIM
This is an improvement over the PIM allocator in the sense that unlike PIM, the outputs and inputs do not pick the winners randomly, but rather in a deterministic round robin fashion. Additionally, these Round Robin Arbiters move the priority a step ahead in their circular stack only if in that cycle, the winner it picked was actually able to get matched. SLIP performs very well as compared to PIM if the matrix is fairly full and we get fairly high throughput on the very first iteration.
However, even SLIP could end up making a bad match and hence result in a low throughput. In the next lecture, we will talk about a Heuristic Allocation scheme that tries to get over the limitations imposed by SLIP and PIM.