Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Systolic Combining Switch Designs, Exercises of Network Design

1.2.1 Interconnection network topology. Topology refers to the pattern of interconnections between the processors and other system elements.

Typology: Exercises

2022/2023

Uploaded on 05/11/2023

pierc
pierc 🇺🇸

4.3

(4)

220 documents

1 / 172

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Systolic Combining Switch Designs
Susan R. Dickey
1
Courant Institute of Mathematical Sciences
New York University
May 17, 1994
1
Supported by U.S. Department of Energy grantnumber DE-FG02-88ER25052.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Systolic Combining Switch Designs and more Exercises Network Design in PDF only on Docsity!

Systolic Combining Switch Designs

Susan R. Dickey^1

Courant Institute of Mathematical Sciences

New York University

May 17, 1994

(^1) Supp orted by U.S. Department of Energy grant numb er DE-FG02-88ER25052.

For my father John Wilson Dickey 1916{

i

Contents

4.2 Packet formats for the interconnection network in a 16  16 NYU Ultracomputer prototyp e. : 72 4.3 Design of systolic combining queue. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 78 4.4 Schematic of a single cell of the combining queue in the forward path comp onent. : : : : : : : 78 4.5 Blo ck diagram of a combining queue implementation. : : : : : : : : : : : : : : : : : : : : : : : 79 4.6 Combining queue transitions for slot j. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80 4.7 Behavior of chute transfer signal with IN and OUT b oth moving, when combining a 4-packet message. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81 4.8 Behavior of chute transfer signal with IN moving and OUT not moving, when combining a 4-packet message. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82 4.9 Behavior of chute transfer signal with OUT moving and IN not moving, when combining a 4-packet message. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 83 4.10 Logic to pro duce propagate and generate signals. : : : : : : : : : : : : : : : : : : : : : : : : : 84 4.11 Multiple output Domino CMOS gate in carry chain. : : : : : : : : : : : : : : : : : : : : : : : 85 4.12 Blo ck diagram of a return path comp onent. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85 4.13 Blo ck diagram of a wait bu er. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 86 4.14 Slot of a wait bu er holding a two-packet message. : : : : : : : : : : : : : : : : : : : : : : : : 87 4.15 Schematic of a wait bu er cell. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88 4.16 Typ e B switches, molasses and susy simulations, uniform trac, memory cycle 2 : : : : : : : 91 4.17 Typ e B switches, molasses and susy simulations, 0.5 p ercent hot sp ot, no combining, memory cycle 2. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 92 4.18 Typ e B switches, molasses and susy simulations, 0.5 p ercent hot sp ot, combining, memory cycle 2. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93 4.19 Typ e B switches, molasses and susy simulations, 1 p ercent hot sp ot, no combining, memory cycle 2. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 94 4.20 Typ e B switches, molasses and susy simulations, 1 p ercent hot sp ot, combining, memory cycle

  1. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95 4.21 Typ e B switches, molasses and susy simulations, 5 p ercent hot sp ot, no combining, memory cycle 2. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96 4.22 Typ e B switches, molasses and susy simulations, 5 p ercent hot sp ot, combining, memory cycle
  2. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 97 4.23 Typ e B switches, molasses and susy simulations, 10 p ercent hot sp ot, no combining, memory cycle 2. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 98 4.24 Typ e B switches, molasses and susy simulations, 10 p ercent hot sp ot, combining, memory cycle 2. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 99 4.25 Typ e B switches, molasses and susy simulations, uniform trac, memory cycle 4 : : : : : : : 100 4.26 Typ e B switches, molasses and susy simulations, 0.5 p ercent hot sp ot, no combining, memory cycle 4. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 101 4.27 Typ e B switches, molasses and susy simulations, 0.5 p ercent hot sp ot, combining, memory cycle 4. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 102 4.28 Typ e B switches, molasses and susy simulations, 1 p ercent hot sp ot, no combining, memory cycle 4. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 103 4.29 Typ e B switches, molasses and susy simulations, 1 p ercent hot sp ot, combining, memory cycle
  3. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 104 4.30 Typ e B switches, molasses and susy simulations, 5 p ercent hot sp ot, no combining, memory cycle 4. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 105 4.31 Typ e B switches, molasses and susy simulations, 5 p ercent hot sp ot, combining, memory cycle
  4. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 106 4.32 Typ e B switches, molasses and susy simulations, 10 p ercent hot sp ot, no combining, memory cycle 4. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 107 4.33 Typ e B switches, molasses and susy simulations, 10 p ercent hot sp ot, combining, memory cycle 4. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 108 4.34 Typ e A and Typ e B networks, 1 p ercent hot sp ot, bandwidth and latency. : : : : : : : : : : : 110 4.35 Typ e A and Typ e B networks, 10 p ercent hot sp ot, bandwidth and latency. : : : : : : : : : : 111

vi

4.36 Typ e A and Typ e B networks, 10 p ercent hot sp ot, combining, 1024 PEs, latency as a function

Chapter 1

Intro duction

Communication b etween hundreds or thousands of co op erating pro cessors is the key problem in building a massively parallel pro cessor. This thesis is concerned with the b est way to design a fast VLSI switch to b e used in the interconnection network of such a parallel pro cessor. Such a switch should handle the \hot sp ot" problem as well as provide go o d p erformance for uniform trac. The switch designs we consider alleviate the \hot sp ot" problem by adding extra logic to the switches to combine conventional loads and stores as well as fetch-and- op erations destined for the same memory lo cation, according to the metho ds describ ed in [57]. The goal of this work has b een to analyze and design a switching comp onent that is inexp ensive compared to the cost of a pro cessing no de, yet provides the functionality necessary for high-bandwidth, low-latency network p erformance. The theoretical p eak p erformance of a highly-parallel shared-memory multipro cessor may b e less than that of a message-passing multicomputer of equal comp onent count, in which all no des contain a pro cessing element as well as switching hardware. However, the actual p erformance achieved p er pro cessor on a large class of applications should b e much higher in the shared-memory multipro cessor b ecause the dedicated hardware of the network switches provides greater bandwidth p er pro cessing element and handles communication in a more ecient way. The rst section of this intro ductory chapter outlines the contributions of this thesis. The second section discusses related research.

1.1 Contributions

The analyses and simulations rep orted in this thesis were carried out in supp ort of the design and imple- mentation of a switching comp onent for the NYU Ultracomputer architecture. The results are generally concerned with the trade-o b etween overall p erformance and implementation cost. The di erent areas in which results have b een obtained are describ ed in the following subsections.

1.1.1 Performance analysis of di erent switch typ es

Chapter 2 analyzes the e ect that the arrangement and arbitration of bu ers and the degree of the crossbar may have on switch p erformance and cost. Switches in interconnection networks for highly parallel shared memory computer systems may b e imple- mented with di erent internal bu er structures. For a 2  2 synchronous switch, previous studies [78, 116 ] have often assumed a switch comp osed of two queues, one at each output, each of which has unb ounded size and may accept two inputs every clo ck cycle. We call this typ e of switch Type A; a k  k Typ e A switch has k queues, one at each output, each of which may accept k inputs p er cycle. Hardware implementations may actually use simpler queue designs and will have b ounded size. Two additional typ es of switch are analyzed, b oth using queues that may accept only one input at a time: for k  k switches, a Typ e B switch uses k 2 queues, one for each input/output pair; a Typ e C switch uses only k queues, one at each input. In b oth cases, a multiplexer blo cks all but one queue if more than one queue desires the same output, making these mo dels more dicult to analyze than the previous Typ e A

............... . ............... .

.............. ..

.............. ... ...............

............... .

............... . ............... .

.............. .. .............. ..

@ @

............^ .. .. .............. ..

(C)

.............. ..

.............. ... ...............

............... .

............... . ............... .

.............. .. .............. .. 

A AAU



@ @ @R (^) -

(A) (B)

Figure 1.1: Three basic switch typ es

mo del. We have found maximum bandwidth, exp ected queue length, exp ected waiting time, and queue length distribution for the Typ e B and Typ e C 2  2 switches, with unb ounded queue size and with queue size equal to 1. For 2  2 switches we have proved that the bandwidth p er p ort of a Typ e C switch is limited to 75 p ercent. While the Typ e C switch is less exp ensive, Typ e A and B have considerably b etter p erformance.

1.1.2 An ecient CMOS implementation of systolic queues

Chapter 3 describ es an ecient CMOS implementation for systolic queue designs; the basic queue design is useful for bu ered non-combining as well as combining switches. The timing constraints on blo cking and unblo cking for the systolic queue design originally develop ed by Snir and Solworth [130] are formalized. A pro of is given that an implementation of this design using only two global control signals op erates correctly under these timing constraints. A non-combining switch using this systolic queue design was fabricated by MOSIS in 3 micron CMOS and used in a 2 pro cessor prototyp e for over a year. This implementation employed the NORA (no race) clo cking metho dology, using quali ed clo cks as the mechanism for distributing global control. NORA allows the use of compact CMOS circuits with high tolerance for clo ck skew. Quali ed clo cks provide a natural way to implement lo cal data movement in a systolic design, but their use with NORA involves certain complications. A circuit to pro duce a quali ed clo ck for use in the NORA metho dology was develop ed. The circuit's maintenance of NORA assumptions, as well charge-sharing and noise problems that can arise, are describ ed.

1.1.3 Cost and p erformance of an implemented combining switch

Chapter 4 describ es the combining switch that we have implemented for use in the 16  16 pro cessor/memory interconnection network of the NYU Ultracomputer prototyp e. A 12-pro cessor con guation using these switches is currently op erational. Packaging, message typ es and message formats are describ ed. Details are given ab out the internal logic of the two comp onent typ es used in the network. The forward path comp onent includes a systolic combining queue and an ALU for combining; the return path comp onent includes non- combining systolic queues, an asso ciative wait bu er, and an ALU for decombining. A design usable in networks of size up to 256  256 has also b een prepared for fabrication at a smaller feature size in a higher pincount package; di erences in the logic partitioning of the two designs are describ ed. Simulation results were used to compare the p erformance of the sp eci c switch architecture and ow control metho d actually implemented with p erformance predicted by analytical mo dels and by simpler simulation mo dels of queue b ehavior. The e ective queue size of our systolic queues, compared to standard linear FIFOs, was determined through simulation. Performance di erences b etween Typ e A and Typ e B combining switches were explored. Hardware combining is not without cost, but our exp erience in implementing a combining switch indicates that the cost is much less than is widely b elieved. We describ e the design choices made in implementing the switch to keep network bandwidth high and latency low. We compare the cost of a combining switch to that of a non-combining switch and discuss the scalability of the design we have implemented to large numb ers of pro cessors.







QQ  Q QQ

Q Q Q  QQ











A A A A A A

AA













 





 A A A AA S S S

Figure 1.2: An 8  8 -network.

In [14], bandwidth or throughput in a shared memory multipro cessor is de ned as the mean numb er of memory requests delivered by the network to the memory p er cycle. We use the term bandwidth per processor to refer to the steady-state message generation rate that can b e accepted from each pro cessor by the interconnection network and memory system. Performance is said to b e bandwidth-limited if the pro cessors themselves would b e capable of generating a higher rate. In this case, messages are blo cked from entering the network or dropp ed b ecause of con icts for resources in the interconnection network or at memory. By latency we mean the total time from the cycle a pro cessor issues a request to the network until the network delivers the resp onse from memory. As a network b ecomes loaded close to its maximum bandwidth, latencies increase. Performance may b e latency-limited in two ways:

  1. The pro cessor may b e unable to generate new messages, due to an instruction or data interdep endency, until the resp onse to a previously sent message has b een received. How frequently this o ccurs will dep end on the pro cessor, on the application and on the sophistication of the compiler and cache technology and of the op erating system.
  2. The hardware in the pro cessor-network interface may allow only a xed maximum numb er of messages outstanding in the network at a time.

In either case, the result will b e that the message generation rate may drop as the latency increases, not b ecause messages are blo cked from entering the network, but b ecause the PE must wait for a resp onse to some previous message b efore a new message can b e generated.

1.2.1 Interconnection network top ology

Top ology refers to the pattern of interconnections b etween the pro cessors and other system elements. The distinction b etween a static (or direct) top ology and a dynamic (or indirect) top ology is a fundamental one, often asso ciated in practice with di erences in computational mo del. A direct network may b e used to provide low-latency high-bandwidth direct connections to neighb oring no des, for computations that can b e structured to have such spatial lo cality, but when destinations are randomly distributed, bandwidth is limited compared to a multistage indirect network. The cost of a network dep ends not only on the total numb er of wires and no des in a network, but on this pattern of interconnections or top ology. The numb er of direct connections or links to a no de is the degree of the no de. In practice, the pin-out of the chips or b oards containing the pro cessors and switches limit the numb er of links of a given width, so that the degree of the no de is a ma jor cost factor. The bisection width, the numb er of wires which must b e cut to divide the network into two equal size pieces, is a measure of wire density: the denser the wires, the larger the area (or the more layers of wiring) required for layout. Dally

[26] used bisection width as the ma jor cost factor, comparing networks of equal bisection width. Under this metric, networks with low degree no des could have wider data paths, thus giving lower latency p erformance under low trac loads than networks with higher degree no des. The pro duct of degree and bisection width has also b een used to characterize a network's cost [19]. Using the de nitions in [145], a static top ology has each switching p oint connected to a full pro cessor, including memory, while in a dynamic network pro cessing elements (PEs) and memory mo dules (MMs) are connected only at inputs and outputs of the switching network. This p erhaps unintuitive terminology evolved from the usage in [45], where static means that the links b etween pro cessors are dedicated, passive buses and dynamic indicates that the links can b e recon gured by setting active switching elements. Since current pro cessor-at-a-no de \static" architectures may include sophisticated switching hardware at each no de (see, e.g.,[22, 28 ]) and since architectures that do not have a pro cessor at every no de but do not have well-de ned network inputs and outputs have b een develop ed [6, 27 ], the terms direct and indirect network, as used in [124] and elsewhere may b e less misleading. In a direct network, each switch is directly connected to a single pro cessor; a pro cessor and the switch it is connected to form a single no de. (In a message-passing system like the Cosmic Cub e[125], the \switch" may exist only as software running on the pro cessor.) The most frequently considered direct networks are the family of k -ary n-cub es [124], in which k n^ no des are each lab eled with an n-bit radix k numb er and connected to no des that di er in only one radix k digit. Each no de has degree 2 n, twice the numb er of the dimension. A k -ary n-cub e may also b e describ ed as an n-dimensional array with k elements in each direction and end-around connections. Two and three-dimensional mesh top ologies are examples of k -ary n-cub es without the end-around connections. For direct networks with N no des, following the notation in [3], if the degree of each no de is d and the data path width has the same numb er of bits as a single message, the overall message generation rate p er pro cessors at steady state, (when message generation is equal to message arrival) is

mg =

dm D

where m is the average utilization of each incoming link in steady state and D is the average numb er of hops to deliver a message (thus 1 =D is the probability that an incoming message terminates). D can never b e less than O (logd N ) for any top ology [9] if messages are sent from one no de to all other no des with equal probability. Since m must of course b e less than 1, the steady stage message generation rate of direct networks, assuming uniform distribution of destinations, has an upp er b ound of O (d= logd N ); the corresp onding upp er b ound on the usable bandwidth p er no de is d= logd N times the capacity of a link. Indirect networks contain switches that are not directly connected to a pro cessor. Another way to describ e this is to say that the network contains no des without a pro cessor at the no de. Thus, for the same maximum no de degree, these networks are richer in links p er pro cessor and can maintain a higher bandwidth p er no de. To nd the maximum message generation rate p er PE for such a network, consider that mg is constrained by N D mg  L (1:2)

where L is the total numb er of links in the system, since at steady state the total amount of trac which can b e generated each cycle on average can b e no greater than the link capacity. The message generation rate must also b e less than the degree of the connection of the pro cessor no de to the network. Clearly, dep ending on the sp eci c top ology and trac pattern, not all of the L links may act to add usable bandwidth, but this constraint provides an upp er b ound. The most commonly discussed and analyzed indirect networks are the multistage networks with a log- arithmic numb er of stages, such as the Omega network [85], the rectangular SW-Banyan [51], square delta networks [110], the baseline network [146] and the indirect binary n-cub e [113], all of which can b e shown to b e essentially equivalent [146] when the numb er of inputs and outputs to the network and the degree of the switches is the same. By analogy with the k -ary n-cub e direct networks and the butter y network [90], such networks are sometimes called k -ary n- y networks [25], where k is the numb er of inputs and outputs for each switch and n is the numb er of stages in the network, and there are k n^ inputs to and k n^ outputs from the network. Networks in this class have an upp er b ound on the message generation rate at inputs to the network equal to the full capacity of the link.

No. of Bandwidth Average No. of Bisection Top ology no des p er PE distance wires width CM-5 256 8 7.3 16,384 2, J-machine 512 5.14 10.5 27,648 1, NYU Ultra 256 32 18 147,456 8, Tera 256 91.02 45 1,048,576 16,

Table 1.2: Performance factors for sample parallel computers

Table 1.2 shows the same p erformance factors as Table 1.1 for the interconnection networks of particular parallel computer architectures, measuring cost in wires instead of links, and maximum bandwidth in bits p er cycle rather than messages p er cycle. Numb er of no des was chosen to b e comparable with the 256 pro cessor no des in the Tera architecture [6], which has an arrangement of no des that is not easy to de ne for other sizes. Some of these architectures may b e implemented by allowing bidirectional data transfer on a single wire in the same pro cessor cycle [28]; such bidirectional wires are counted as two wires in the total numb er of wires. In the descriptions b elow the degree of the no de is the numb er of bidirectional links, and the link width is the numb er of bits that may b e transmitted in one direction in one cycle. The Tera network has 4096 no des arranged in a 3D torus with sparse links. Each no de is connected in only two of the three p ossible X, Y or Z directions, so no des have degree 4 rather than 6. The links each transmit 64 data bits p er cycle, plus an unsp eci ed numb er of bits for control; only the wires for the data bits are counted here. Only 256 no des are p opulated with pro cessors; the rest are p opulated with memories and I/O pro cessors or are available only to add bandwidth. The average distance rep orted in Table 1.2 is that for the 3D torus, doubled to allow a round-trip to memory, and is p essimistic in the sense that there may b e considerable opp ortunity for lo cality. For the other three networks, the wires transmit b oth control and data. The NYU Ultracomputer actually has 39 bit data paths in the forward path network for a 256 PE machine [17], but only the 32 bits used for data packets are counted in the link width. Since 2  2 switches are used, no des in the network have degree

The J-Machine [108] is a 3D torus architecture with a pro cessor at every no de. For ease in computing the average numb er of hops, we wish the numb er of pro cessors to b e k 3 for some k , so a system with 83 pro cessors was chosen. (The router hardware would allow an ensemble of up to 64K no des.) No des have degree 6, with a link width of 9 bits. Only the data network of the Connection Machine CM-5 [93] is shown. This pro cessor-to-pro cessor network has the top ology of a 4-ary fat tree, implemented using no des of degree 8, with a link width of 8 bits. In theory, the link bandwidth at each no de in a fat tree increases at each level approaching the ro ot, while the numb er of no des at each level decreases prop ortionately; in practice, the numb er of no des at each level in the tree remains the same, as do es the link width, so that the \ro ot" of the tree actually consists of the same numb er of no des as all the leaves. The values rep orted in Table 1.2 were computed under the assumption that all parent connections are used at every level of the tree. To decrease wiring cost, not all the p ossible connections in the CM-5 network are actually used. The variety of networks that have b een chosen for parallel architectures illustrates that cost assessments in practice are not continuous, based on total quantity of wires or silicon area, but very much dep endent on tting into constraints. Design costs and the market for comp onents used in the system may b e more imp ortant in the short run than the quantity and utilization of hardware. However, in our research, we are interested in network architectures that provide scalable bandwidth as system size grows, and thus concentrate our attention on multistage bu ered networks.

1.2.2 Routing proto col

In SIMD systems, metho ds for routing p ermutations of data among the pro cessors may b e imp ortant for p erformance. For MIMD systems, we are primarily interested in routing in a renewal context, i.e., each PE rep eatedly and indep endently generates messages to b e routed to other PEs with some probability. In such

a context, distributed routing, with only lo cal decisions, is more appropriate than global control of routing decisions. The k -ary n- y networks have the prop erty that exactly one path connects a given network input to a given network output, and that routing can b e done lo cally by a simple scheme in which the output at the ith stage in the network is selected by the ith of n k -ary digits in a routing address. For the round-trip messages used in a shared memory system, lo cal control on b oth the forward and return path would seem to require that b oth the destination and return addresses must b e transmitted with each message. However, for k -ary n- y networks, only a single path descriptor eld which contains an amalgam of the origin and destination addresses is required [65]. For 2  2 switches, this scheme works as follows: initially, the path descriptor eld is set to the destination address. At each switch, the high-order address bit selects the p ort to which the message is to b e routed. Each switch replaces this bit with the numb er of the input p ort on which the message arrived and rotates the address one bit so that the routing bit for the next switch will b e the new high-order bit. When leaving stage j of an n-stage network, the low-order j bits will b e the high-order j bits of the origin address and the high-order n j bits will b e the low-order n j bits of the destination address. Thus, when the message reaches its destination, the path descriptor eld may b e reversed and used in the same way as the return address to the origin of the message. Following the de nition in [76], in a delta network the path descriptors asso ciated with di erent paths leading to the same output no de are identical, so that, if the inputs are pro cessors and the outputs are memory mo dules, each pro cessor uses the same routing tag for a given memory mo dule. Bidelta networks have this prop erty in the reverse direction as well; that is, each input also has a unique numeric identi er than can b e used to route in the reverse direction from any output to any input. It was shown in [35] that any two n-stage bidelta networks comp osed of k  k switches are isomorphic. The bidelta prop erty, b esides providing a functional description of a unique network of a given size, is of practical value in a shared memory system since it provides a unique PE or MM numb er for each mo dule that can b e used for interrupts or other purp oses from any lo cation in the system. However, networks without the delta or bidelta prop erties can still use digit-controlled routing and create the return address as the message is routed to its destination, as long as the switches are connected in such a way that every output can b e reached from every input. In such networks, each input may use a di erent address for each output of the network. In a shared memory system, a functional mapping or table lo okup to get the correct memory mo dule address would b e required at each pro cessor as a part of memory address translation. Figure 1.3 shows such a \non-delta" network that was considered for use in a 16-PE NYU Ultracomputer prototyp e b ecause it allowed short wires on a backplane connecting 4  4 switch b oards that were all wired identically [16]. For networks with more than one path from source to destination, adaptive routing may improve p erfor- mance [48], at some cost in complicating logic at the switching no des. We will not consider such networks here.

1.2.3 Switching strategy

Maximum bandwidth and minimum latency are determined by the numb er of wires in the network, the way the wires are connected, and the numb er of hops a message must take, as discussed in section 1.2.1. The bandwidth and latency actually exp erienced are, however, strongly a ected by the switching strategy adopted. Circuit switching provides a guaranteed low latency transmission for a message once a circuit has b een established, at the cost of tying up resources (links) that could b e used by other messages and thus limiting bandwidth. Store-and-forward switching (often called message or packet switching) increases overall network throughput by releasing a link as so on as a message passes through it, at the exp ense of increasing latency for individual long messages by only using one link at a time. Cut-through techniques, which include the recently p opular \wormhole routing," can combine some of the advantages of b oth techniques. Circuit switching, the standard technique used for telephone switching, requires a set-up p erio d during which each link used for the transmission must b e visited. After the circuit has b een set up, all links can b e active simultaneously on a pip elined transmission. Circuit switching is pro tably used when transmission times are typically much larger than the set-up time and has seen relatively little use in computer networks or interconnection networks for multipro cessors, where the set-up time is often substantial compared to the

than one packet. A variety of terms have b een used to describ e units of transmission in the context of cut-through switch- ing. Our usage of the terms \message" and \packet," with packet as the sub-unit of message that can b e transferred in a single network cycle, follows that of Kruskal and Snir in their analysis of what they called \pip elined message switching" in [77]. In the original pap er [69] in which Kermani and Kleinro ck de ne and analyze virtual cut-through, \message" and \packet" are used interchangeably to describ e the unit of network transmission that contains routing information. In their pap er, whenever the head of a message arrives at a no de and the outgoing link is free, the message is said to \cut through" and b egin transmission to the next no de, without b eing bu ered. If the outgoing link is busy at the time the head arrives, the message must b e bu ered. A partial cut is said to o ccur if a message that has b een blo cked and partially bu ered is allowed to continue transmission as so on as the link is free, without waiting for the message to b e completely bu ered. No terminology is develop ed for sub-units of a message; instead the analysis is phrased in terms of the time it takes for the message to b e transmitted across a link. Dally and Seitz, in [30] and elsewhere, use the term wormhole routing to describ e a ow control strategy used with cut-through switching that provides only minimal bu ering. Instead of either providing enough bu ering for a complete message or dropping messages when they are blo cked (as, e. g., the BBN Butter y [24]), only enough bu ering is provided at each no de to make routing decisions; blo cked messages are held in place and tie up links leading back to the tail of the message. Dally and Seitz use \packet" for what we call \message", the set of bits b eing routed together to a destination, and have two terms for the sub-units of a message. Flit ( ow control unit) refers to the smallest sub-unit of a message that a queue can accept or refuse, while phit (physical transfer unit) refers to the part that can b e transmitted across a link in one cycle, corresp onding to our use of the term \packet." The switching designs we consider in later chapters have a ow control proto col that allows partial cuts in the sense that a blo cked address packet may start up transmission b efore the nal packet in the message has arrived at that no de. Our proto col do es not, however, allow the data packets of a message to b e blo cked once the address packet has b een transmitted to the next no de. So there is no sub-unit of our messages that exactly corresp onds to a it in the sense that it is used in describing wormhole routing, as a part of a message that can b e b oth accepted and refused indep endently. Abraham and Padmanabhan [3] analyze b oth \full cut-through," in which cuts are allowed only if the link is free when the head of the message arrives, and \partial cut-through," in which partial cuts are allowed as well.^1 They use \message" in the same sense that we do, but they use the term \nibble" to refer to what we call a \packet." Analyzing the relative p erformance of cut-through switching and store-and-forward switching is a dicult problem, and a numb er of di erent approaches have b een tried. Kermani and Kleinro ck analyzed systems with b oth single and multiple channels p er link and with noisy and noiseless channels, using assumptions of indep endence, Markovian distribution and balanced trac to make the analysis tractable. In Kermani and Kleinro ck's analysis, the message must b e completely bu ered (partial cuts are not allowed). This assumption b oth makes the analysis simpler and is sensible for the data communications application which they were considering, where it is advantageous in the presence of noisy channels to bu er a blo cked message in order to p erform error checking. For single channel links with noiseless channels, the di erence b etween Tm , the average time to transmit a message under store-and-forwards message switching, and Tc , the average time to transmit a message using virtual cut-through, is

Tm Tc = (nh 1)(1 )(1 )tm (1:3)

where nh is the average numb er of hops the message must travel,  is the average utilization of a link, is the prop ortion of the message taken up by header, and tm is the average time required to transmit a message over a single link. When messages are long compared to the capacity of the link, tm will b e large and virtual cut-through can signi cantly improve p erformance compared to store-and-forward. Furthermore, virtual cut-through also provides a substantial savings in storage when trac is light. On the other hand, if the utilization  is high or the average numb er of hops is small, cut-through will not provide a large p erformance gain, even for long messages.

(^1) Since partial cut-through requires greater hardware capability and gives b etter p erformance than full cut-through, the terminology is somewhat unintuitive, since p eople normally exp ect \full" to b e b etter than \part."

In [77], Kruskal and Snir give the following approximate formula for network delay using cut-through switching with m packet messages in a delta network with k  k switches:

T = logk N (t + mt

m(1 1 =k )p 2(1 mp)

) + (m 1)t (1:4)

where N is the numb er of network inputs and outputs, p is the numb er of messages p er cycle, and t is the cycle time of a switch.^2 The rst t in the expression b eing multiplied by the numb er of stages in the network represents the transition time for the head of a message from input to output, assumed to b e a single cycle, and the second term represents queuing delay. The last term accounts for the \pip e-setting" delay. In [79], this formula was mo di ed by a factor obtained from simulations to account for the changes in queueing delay at later stages. No analysis was done for the store-and-forward case. In [3], a delay formula for cut-through was calculated by subtracting the b ene t due to cut-through at each stage from the store-and-forward delay formula. For the full cut-through case, the b ene t due to less waiting for all messages that arrive at the same time as a message that cuts through was given as

1 p 1 a 0

pmt (1:5)

and an additional b ene t for all messages that arrive at a queue during the time the output link is busy after a message that cuts through arrives is given as

1 a 0 a 1 1 a 0

p^2 mt (1:6)

where ai is the probability that i messages arrive at a queue on a given cycle, and the rest of the notation is as in Equation 1.4. Their analysis showed a 35 p ercent improvement for virtual cut-through over store-and- forward, with ab out an additional 20 p ercent additional improvement when partial cuts are allowed, for a total improvement of 55 p ercent at mo derate trac levels, but the gures given are for long messages issued at low rates. Dally's work [25, 31 ] analyzed the p erformance of wormhole routing compared to virtual cut-through with queueing under light trac conditions for b oth indirect k -ary n- y networks and direct k -ary n-cub es. His analysis for virtual cut-through with in nite queues in a k -ary n- y network, gave the following approximate formula for network delay

T = logk N (t + mt

m(1 1 =k )p 2

) + (m 1)t (1:7)

using the same notation as in Equation 1.4. Omitting the factor (1 mp) from the denominator in the expression for queueing delay makes this approximation less accurate as trac increases. Wormhole routing has b ecome very p opular in systems where the o ered load is low, b ecause of its latency advantages over store-and-forward and its storage advantage over bu ered cut-through techniques [107]. However, b ecause messages retain channels when blo cked, this technique is particularly susceptible to problems with deadlo ck in many network con gurations. Sophisticated switching hardware may b e required to solve this problem. In a recent study by Adve and Vernon [4], a closed queueing network mo del was develop ed for wormhole routing in k -ary n-cub e networks using the non-adaptive deadlo ck-free routing scheme of Dally and Seitz [30]. Adve and Vernon found that, when pro cessors are allowed to have multiple outstanding requests, system p erformance is bandwidth-limited rather than latency-limited and thus, since the bandwidth in k -ary n-cub es scales as N (^1) k (see 1.1), this con guration do es not scale well with increasing system size under uniform access patterns. With four outstanding requests p er pro cessor, at least 70- p ercent of each pro cessor's trac must b e directed to its nearest neighb ors for system p erformance to scale well. They also showed that the deadlo ck avoidance algorithm places asymmetric loads on the virtual channels that create di erences in eciencies for pro cessors at di erent p oints in the network.

(^2) Kruskal and Snir distinguish b etween the transmission time and cycle time, but we assume a system in which the cycle time is the maximum of these two values.