





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Abstract. This study outlines a cost-e ective multiprocessor architecture that takes into consideration the impor- tance of hardware and software costs as ...
Typology: Schemes and Mind Maps
1 / 9
This page cannot be seen from the preview
Don't miss anything!
b old, 16pt)
y
y
y
z
Abstract This study outlines a cost-e ective multiprocessor architecture that takes into consideration the impor- tance of hardware and software costs as wel l as deliv- ered performance in the context of real applications. The proposed architecture (HPAM) is organized as a hierarchy of processors-and-memory (PAM) subsys- tems. Each PAM contains one or more processors and one or more memory modules. The fol lowing factors drive the HPAM design
application behavior { an important application be- havior (locality of paral lelism) is characterized and quanti ed for a set of benchmarks; two classes of applications that demand 100 TOPS computation rates are also characterized.
cost-eciency { a favorable comparative analysis of a 2 level HPAM and a conventional multipro- cessor is done using empirical data; technology trends that support the desirability and viability of HPAM organizations are also discussed.
ease-of-usage { a exible programming environment for HPAM is proposed; the scenarios addressed include automatic translation systems, library based programming and performance-guided cod- ing by expert programmers.
Imp ortant computer applications have b een iden- ti ed that will require at least 100 Teraops (10^14 op- erations p er second) computing sp eeds [St et. al 95 ]. A machine that aims at providing such level of p er- formance will require orders of magnitude more re- sources than current high-p erformance computers. As a consequence, cost-e ectiveness and programmabil- ity are the overriding design issues for such a machine. The heterogeneous multipro cessor system discussed in this pap er is organized as a hierarchy of pro cessors- and-memory subsystems (HPAM). Our approach is to leverage as much as p ossible features of commo dity mi- cropro cessors (current and exp ected in the future). In
(^1) This research was partially funded by the National Science
Foundation xxx.
particular, the prop osed approach exploits the analogy of the HPAM organization with conventional memory hierarchies, which micropro cessors readily supp ort. The HPAM approach meets the spirit of the follow- ing ve lessons, learned from previous multipro cessor research and development, in a cost-e ective manner. the cost of multipro cessors can b e greatly reduced by reusing standard commo dity parts and soft- ware; cost-p erformance analysis must re ect this reality. //
high software development costs deter p otential customers of parallel pro cessors; successful mul- tipro cessor designs should present users with fa- miliar programming environments and allow (user and system) software reuse.
designers must always b e aware of Amdahl's law; any serialization intro duced in the system severely limits multipro cessor sp eedups.
memory access and low level communication la- tencies are fundamental limitations; while band- widths can always b e increased (in theory), la- tencies result from fundamental limits of physics and communication software overhead, and thus are to b e avoided or hidden; in practice, costs can limit bandwidth which, in turn, may increase la- tencies.
real applications with irregular control and data structures should b e included in the suite of b enchmarks used to evaluate multipro cessor de- signs.
We expand on the HPAM architecture in the fol- lowing section. Issues related to applications b ehav- ior, cost and p erformance are addressed in Sections 3 and 4 resp ectively. The programming environment of HPAM is describ ed in Section 5. Section 6 includes future work and concluding remarks.
Figure 1 shows a high level view of an HPAM. An HPAM consists of several levels of pro cessors-and- memory (PAM) systems. A PAM system contains one
tle communication b eing required b etween HPAM lev- els when they execute di erent parts of an application. The following two subsections discuss issues related to lo cality of programs with resp ect to the degree of parallelism and constraints imp osed by p eta ops ap- plications.
3.1 Preliminary Findings Lo cality of parallelism has b een characterized and empirically studied in [BeFo 96 ]. An abbreviated dis- cussion of this study follows. For this initial ex- p eriment, the hierarchy is restricted to two levels only. The rst level consisted of a single fast pro- cessor. Three cases were considered for the second level (namely, 10, 100 and 1000 pro cessors) in order to gain some insight into the lo cality b ehavior of dif- ferent degrees of parallelism which might b e present in multilevel machines. For each of several b enchmarks (discussed later in this section), the exp eriment consists of executing a program (on its asso ciated b enchmark data) on a unipro cessor. The program consists of a sequence of assembly level instructions with execution control di- rectives. A leader is one of the following control di- rectives f begin, end, doal l, enddoal l g. The leaders begin and end represent the b eginning and the end of the program and are therefore unique. The leader doal l(k) represents the start of a lo op do I=a,b such that k = b a + 1 and the lo op is parallelizable. The leader enddoal l represent the end of a parallelizable do lo op. A blo ck bi is the sequence of co de b etween two con- secutive leaders. That is
bi =]Li ; Li+1 [ where Li and Li+1 are leaders (1)
Furthermore let the parallelism bp and the size bs of a blo ck bi b e de ned as follows :
bp(bi ) =
1 if Li = 0 beg in^0 k bp(bi 1 ) if Li = 0 doal l (k )^0 bp(bi 2 ) if Li = 0 enddoal l 0 (2)
bs(bi ) : the numb er of assembly level instructions executed b etween Li and Li+. The following example illustrates the de nition of blo ck parallelism and blo ck size.
!begin ...
!doal l(MK) do I= 1, MK
lduw [%fp + 0x48], %l lduw [%l0 + 0], %l stw %l0, [%fp - 0x10] or %g0, 1, %l sethi %hi(0x45c00), %l
!enddoal l ... !end
Let bx b e the blo ck enclosed b etween doal l and end- doal l. For this example, bp(bx ) = M K and bs(bx ) = 5. The blo ck parallelism do es not have to b e known stat- ically and can b e evaluated at runtime. Throughout the remainder of this pap er, the term instruction is used to refer to assembly level co de in- structions. Furthermore, all instructions are assumed to execute in the same amount of time. Under this as- sumption, the execution of a given program P r is rep- resented by the ordered sequence of blo cks [b 1 ; ; bn] such that L 1 = 0 beg in^0 and Ln+1 = 0 end^0. The blo ck parallelism re ects the application degree of paral lelism at di erent instances of the execution of the application undep endently of the machine on which the application is executed. For the machine degree of parallelism, let M D P b e de ned as follows :
M D P (machine degree of paral lelism): the maximum numb er of assembly level instructions that can execute concurrently in the second level. With resp ect to a given MDP, a scalar window is de- ned as follows:
W (^) ijs = [bi ; ; bj ] j
bp(bi 1 ) M D P or i = 1 , bp(bj +1 ) M D P or j = n and bp(bk ) < M P D for i k j
Similarly, a paral lel window is de ned as:
W (^) ijp = [bi ; ; bj ] j
bp(bi 1 ) < M D P or i = 1 , bp(bj +1 ) < M D P or j = n and bp(bk ) M P D for i k j
Let E p b e an example program sp eci ed by the se- quence [b 1 ; ; b 4 ] such that (bp(b 1 ) = 1, bp(b 2 ) = 2, bp(b 3 ) = 10, bp(b 4 ) = 100) and (bs(b 1 ) = 30, bs(b 2 ) = 800, bs(b 3 ) = 10, bs(b 4 ) = 20). For M D P = 10, this example has two windows: W 12 s and W p
The size of windows is de ned as:
w s(W (^) ijx ) =
j k =i b^
bs(bk ) bp(bk ) c^ ^ b^
bp(bk ) M P D c^ if^ x^ =^ p^
The size of the window represents the amount of work done on each pro cessor. For a scalar window, this amount is equivalent to the sum of all blo ck sizes in the window. However, for parallel windows this work is distributed among more than one pro cessor. The rst term in the second summation of Equation 5 represents the amount of work done p er pro cessor given unlimited resources. For the example intro duced ab ove w s(W 12 s ) = 50 and w s(W 34 p ) = 80 For the program P r and a given size h, the percent scalar (paral lel) execution time with resp ect to the window size is de ned as ratio of the sum of the sizes of scalar (parallel) windows with window size = h to the sum of the sizes of all scalar (parallel) windows. As expressed in the following principles, the are two interesting typ es of lo cality with resp ect to the degree of parallelism:
Principle 1 : if a data item is referenced within a scalar (parallel) window of the program, it tends to b e referenced again in the near future within a scalar (parallel) window (data temporal locality with respect to the degree of paral lelism).
Principle 2 : if an instruction b eing executed b e- longs to a scalar (parallel) window, the instruc- tions executed in the near future tend also to b e- long to a scalar (parallel) scalar window (instruc- tion temporal locality with respect to the degree of paral lelism).
We conducted an exp eriment to quantify and an- alyze parallelism lo cality in four b enchmarks from the CMU-b enchmark suite (Airshed, t2, Radar and Stereo [Di et al. 94 ] and ve b enchmarks from the Perfect-Club suite (TRFD, FLO52, ARC2D, OCEAN and MDG [Be et al. 94]). For each program, parallelism was rst detected by Polaris [Bl et al. 94] and the resulting programs were instrumented to output the size of each window. Each window was then classi ed as scalar or parallel for values of M D P of 10, 100 and 1000. Figures 2 and 3 show the histogram of accumulated p ercent scalar and parallel execution times with re- sp ect to scalar and parallel window sizes resp ectively, for the b enchmark AR2CD. These scalar and parallel p ercentages indicate to what extent instruction tem- p oral lo cality is present in a given application. If high p ercentages corresp ond to small windows then the ap- plication exhibits p o or instruction temp oral lo cality. On the other hand, if high p ercentages corresp ond to large windows than the application exhibits high instruction temp oral lo cality. Table 1 shows window sizes which accumulated the highest p ercent scalar ex- ecution time and the highest p ercent parallel execu- tion time for all the b enchmarks. For each b enchmark in Table 1, the rst row indicates the highest p ercent scalar or parallel execution time and the second row is the size of the corresp onding window. The results of this table show that most b enchmarks exhibit high lo cality (e.g., for MDP=100, 99.7% of t2 scalar ex- ecution time corresp onds to window of size 107 and 84.9% of its parallel execution time corresp onds to a window of size 105 ). The exceptions are b enchmarks Airshed and MDG which exhibit relatively p o or lo cal- ity. For each of the b enchmarks used previously, data- reference traces with resp ect to M D P were collected. An M-hit ratio (for mo de hit) and M-miss ratio are de ned as follows:
M-hit ratio: fraction of the total numb er of scalar (parallel) data references in a given scalar (paral- lel) window for which their last reference was also in a scalar (parallel) window.
M-miss ratio: fraction of the total numb er of scalar (parallel) data references in a given scalar (parallel) window for which their last reference was in a parallel (scalar) window. The trace collection in Table 2 was done over a large bu er (1,000,000 lo cations) in order to reduce
Bench M D P = 10 M D P = 100 M D P = 10 marks Sca. Par. Sca. Par. Sca. Par. CMU-Suite Airshed 79.31 71.70 36.55 86.84 96.88 79. (10^2 ) (10^2 ) (10^1 ) (10^3 ) (10^9 ) (10^3 ) t2 99.72 84.93 99.70 84.90 99.52 84. (10^7 ) (10^6 ) (10^7 ) (10^5 ) (10^7 ) (10^4 ) Radar 71.36 41.12 71.21 42.61 100.00 69. (10^7 ) (10^5 ) (10^7 ) (10^4 ) (10^7 ) (10^3 ) Stereo 84.02 69.81 84.02 69.81 97.68 98. (10^3 ) (10^5 ) (10^3 ) (10^4 ) (10^3 ) (10^6 ) Perfect-Suite TRFD 90.40 39.74 53.63 40.78 68.41 47. (10^7 ) (10^7 ) (10^7 ) (10^6 ) (10^5 ) (10^5 ) MDG 74.14 96.56 99.75 92.75 99.60 95. (10^2 ) (10^1 ) (10^8 ) (10^3 ) (10^8 ) (10^2 ) FLO52 61.54 54.62 87.34 51.32 71.05 68. (10^4 ) (10^5 ) (10^5 ) (10^4 ) (10^6 ) (10^3 ) ARC2D 90.21 57.22 55.06 58.37 54.91 59. (10^6 ) (10^6 ) (10^6 ) (10^5 ) (10^6 ) (10^4 ) OCEAN 64.95 57.65 54.44 67.40 53.57 73. (10^6 ) (10^4 ) (10^7 ) (10^3 ) (10^7 ) (10^2 )
Table 1: Scalar and parallel window sizes which accu- mulated the highest p ercent of scalar or parallel exe- cution time for M D P = 10, 100 and 1000.
-2 0 2 4 6 8 Window size (log 10)
0
50
100
% scalar execution time
ARC2D-Scalar
MDP = 10 MDP = 100 MDP = 1000
Figure 2: Histogram of the p ercent scalar execution time with resp ect to scalar window sizes for ARC2D.
capacity misses. The references that are not covered by the M hit or M miss ratios are due to capacity or cold start misses. Table 2 shows a low M Miss ratio for almost all of the b enchmarks. This indicates that the b enchmarks also exhibit signi cant data temp oral lo cality with resp ect to the degree of parallelism.
3.2 Peta ops Applications A conveniently programmable machine able to run at sp eeds of more than 100 TOPS will enable many imp ortant applications. Applications which require computation rates of more than 100 TOPS are hereon called "p eta ops applications". The signi cance of this fact is b est understo o d by describing the nature of a subset of applications that fall into two classes: time-constrained, xed-size problems and large-scale, complex problems. Time-constrained, xed-size problems: Pro- grams for time-constrained problems must execute within a given time due to system requirements, are
Seismic is another application from the SPEC b enchmarks. It consists of an industrial application representative of mo dern seismic pro cessing programs used in the search for oil and gas. Seismic is compu- tationally complex and intensive. A typical execution can generate up to a tera-byte of data and requires in excess of 1018 oating-p oint op erations. Each execu- tion of Seismic consists of four phases : data genera- tion, data stacking, time migration and depth migra- tion. Phase one is rep orted not to require any commu- nication. Whereas, b oth phases two and three require a large amount of communication. Relative to phase two and three, phase four requires a small amount of communication. All the phases include b oth parallel and sequential sections. The SPEChp c b enchmarks target b oth current and future high-p erformance machines. They include scal- able data sets. The largest set currently included in the suite exhaust the resources of most available ma- chines. These large data sets do not represent actual limitations of the co des. Even larger sets are available from the co de sp onsors in SPEC. Although our initial exp eriments used small b ench- marks, we will use the p eta ops applications intro- duced in this section to further study the di erent as- p ects of HPAM.
Given that p eta ops machines will most likely use a large numb er of comp onents, reducing the cost of these machines b ecomes a key design issue. In this section we rep ort our preliminary ndings ab out cost- eciency of HPAM machines and comment on current trends and future technology.
4.1 Preliminary Findings
We made a comparative analysis of a single level multipro cessor with a two-level machine that results from adding a fast pro cessor to the single level. In [AnPo 91] it was shown analytically that the two-level machine can have b etter cost p erformance than the one level-machine. Our analysis [BeFo 96 ] with empir- ical data from the b enchmarks intro duced in section 3.1 and actual hardware costs indicates the two-level organization is b est in almost all the cases. Table 3 shows the ratios of sp eedups of a two- level HPAM to a one-level HPAM. The second level of the two-level HPAM and the one-level HPAM have the same numb er and typ e of pro cessors. The rst level of the two-level HPAM is a fast unipro cessor. In [BeFo 96 ], it was shown that the two-level HPAM is more cost-ecient than the one-level HPAM if the ra- tio of their sp eedups is greater than 1.88, 1.09 and 1.01 for values of M P D = 10, 100 and 1000 resp ec- tively. The results of Table 3 show that these condi- tions are met for almost all the b enchmarks and values of M D P. In some cases the gain factor is as high as
The intent of this example was to show that gains can b e achieved in terms of sp eedup and cost-eciency for 2-level HPAM-like machines using real pro cessors and representative applications. Additional improve-
M D P 10 100 1000 CMU-Suite Airshed 1.88 7.52 8. t2 7.36 8.06 8. Radar 7.81 8.14 8. Stereo 2.92 6.62 8. Perfect-Suite TRFD 1.31 4.15 7. MDG 5.36 8.14 8. FLO52 1.02 4.65 8. ARC2D 1.00 4.65 7. OCEAN 6.95 7.13 7.
Table 3: Ratios of sp eedups of a two-level HPAM (level 1 : fast unipro cessor, level 2: multipro cessor with M P D pro cessors) to a one-level HPAM multi- pro cessor with M P D pro cessors identical to the ones in the second level of the two-level HPAM. The results are shown for values of M D P = 10, 100 and 1000.
ment may b e achieved by overlapping scalar and par- allel computations.
4.2 Current Trends and Future Technol- ogy There are converging trends in the design of pro- cessors and memories that p oint to future existence of chips that include b oth pro cessors and mem- ory. Examples include Pro cessors-In-Memory (PIM) [Ko 94], Computational RAM [El et al. 92], IRAM [Pa 95, Sh et.al 96 , Sa 96]. Similar ideas were pro- p osed as early as 1970 in [St 70 ]. The driving argu- ment for these approaches is the fact that the integra- tion of CPU and memory on the same chip brings b ene ts of lower latency and higher bandwidth in accessing memory that outweigh p ossible reductions in the complexity of the pro cessor [Sa 96 ]. Since memory access latency is b ecoming a limiting p er- formance factor [Jo 95 , Wi 95, WuMc 95], it is rea- sonable to exp ect that future generations of com- mercial chips will increasingly follow this trend. In fact, there are already some examples of such chips ([Sa 96, Sh et.al 96, AD 93 ]). Two additional trends can b e observed in the work mentioned in the last paragraph. One is the inclusion of hardware supp ort for multipro cessor architectures in the integrated CPU-memory chips [Sa 96 ]. This means that they can b e used as the basic blo cks for building PAMs. The other is the inclusion of CPU cores on DRAM devices [Sh et.al 96 ]. It is reasonable to exp ect that, within 10 years, it will b e p ossible to e ectively fabricate chips with several pro cessors and memories (i.e., small PAMs) which can b e combined in multichip mo dules to build (large) PAMs. Chips used to implement di erent PAM levels can either con- tain large complex CPUs and small memories or small simple CPUs (or fewer pro cessors) and large memo- ries. Since these chips will b e used in general purp ose computing their cost should b e low enough for their use in large numb ers in an HPAM machine. Further- more, these chips can reuse existing CPU cores that are widely available for commo dity parts and thus b e less exp ensive than others that use custom designs. Currently it is p ossible to build very ecient shared
memory multipro cessor systems with small numb ers of pro cessors. Within 10 years, given the ab ove dis- cussion and accepted predictions for semiconductor technology it is reasonable to exp ect that mo derate size multipro cessors with shared memory can b e in- tegrated into one or a few chips in a cost-e ective manner. However, very large shared memory multi- pro cessors will continue to present design challenges and will require distributed shared memory implemen- tations. These, in turn, will increase design cost, la- tencies and pro cessing overheads that will render very large multipro cessors increasingly inecient unless the requirement of shared memory is relaxed (and imple- mented in software). A very large one-level shared memory machine capable of more than 100 Teraops would either b e to o exp ensive (if we could build it) or would not b e able to provide a friendly shared-memory environment with go o d p erformance for al l levels of parallelism. The alternative of choice might b e a dis- tributed, message-passing machine. The HPAM ap- proach attempts to provide very ecient cost-e ective shared-memory top levels and simpler-to-implement distributed-memory-based lower levels that are also cost-e ective but p erhaps harder to program. The nature of each level of the HPAM would also change naturally with the evolution of technology. Finally, the HPAM organization lends itself well to single-user mo de or multi-users mo de at di erent levels of HPAM. This would amortize costs across several users.
HPAM can supp ort a programming environment that is usable by users with varying exp ertise in par- allel pro cessing (with prop ortionate p erformance re- turns) by making di erent levels of the hierarchy user-visible. It can leverage and allow co existence of evolving practices in parallel programming, including optimizing compilation, data-parallel programming, message-passing and library-based approaches. A programming environment for the HPAM ma- chine can leverage many to ols and software develop ed for shared-memory and distributed memory machines. There are three p ossible scenarios for an HPAM pro- gramming environment. One scenario relies on auto- matic translation of a conventional user program into an HPAM program { the user sees only the top level of the HPAM and exp ects the system to run programs written in standard languages and select PAM levels automatically. The second scenario, a library-based programming approach, would allow users to comp ose their programs out of existing co de blo cks which are already optimized for the HPAM machine. The third scenario allows the user to sp ecify how parts of a pro- gram should b e executed and provide information that the HPAM system could use to allo cate co de to lev- els. The HPAM architecture is well-suited for all three scenarios. The rst scenario necessitates an automatic transla- tion system which, ideally, is able to analyze program characteristics, the program data set, machine prop- erties, the environment of the program execution and the dynamic machine load. It must b e able to assign windows to hierarchy levels statically at compile time,
at runtime and also facilitate dynamic migration. For window mappings, several tradeo s have to b e considered such as parallelism versus execution sp eed of individual pro cessors, inter-level versus intra-level communication costs and the cost of migrating win- dows versus the load-imbalance of xed window as- signments. The decision ab out window mappings can b e made at compile time if sucient information can b e ob- tained from the source program. Decisions will b e de- ferred to runtime where additional information on the actual data set and the current load of the machine is available. Our implementation of the translation system will b e based on the Polaris compilation system. An ini- tial static algorithm will use Polaris' symb olic anal- ysis capabilities to gather information from the pro- gram that leads to static mapping decisions. In a sec- ond step the compiler will b e able to inject co de into the program that evaluates the b est mapping at run- time. Furthermore, Polaris includes an infrastructure for adaptive and sp eculative program analysis meth- o ds, which will b e available for window de nitions and mappings. These metho ds can use knowledge from the program execution history and execute program seg- ments sp eculatively with a p ossible backtracking step if the sp eculation failed. The second programming scenario will facilitate the comp osition of programs from optimized libraries. To this end a study and implementation of the algorithms used in the application programs in the form of library routines is needed. This involves identifying program segments that can b e optimized as individual subrou- tines and characterizing their b ehavior on the HPAM architecture. For the know ledgeable programmer automatic trans- lation to ols are a starting p oint. In addition p er- formance instrumentation, analysis and visualization to ols are needed. These to ols will allow the program- mer to observe the detailed program b ehavior and tune it to a given HPAM machine.
This study presented an initial high-level descrip- tion and motivation for a 100-Teraops multilevel het- erogeneous machine (HPAM). However, more detailed studies are needed to re ne and analyze this design by carefully considering its technological feasibility along with the usability of the programming environment in the context of imp ortant representative applications. Future work includes understanding technology ad- vances and how they a ect an HPAM architecture. The challenge is to compare a traditional multipro ces- sor design which is based on the technology available in 10 years to an HPAM that consists of comp onents that might span several generations of pro cessors and memories. Furthermore, understanding how di erent parameters of HPAM (such as numb er of pro cessors, size of memory, and memory latency p er level) a ect the execution of p eta op applications is a crucial com- p onent of our design strategy.