
























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Paper; Class: Distributed Computing Systs; Subject: Computer Science; University: Wichita State University; Term: Unknown 1989;
Typology: Papers
1 / 32
This page cannot be seen from the preview
Don't miss anything!
The Journal of Sup ercomputing, , 1{31 () c (^) Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
JAMES B. ARMSTRONG jba@srtc.com Advanced Product Development, Sarno Real Time Corporation, 301B Col lege Road East, Princeton, NJ 08543-5202, USA
MUTHUCUMARU MAHESWARAN maheswar@ecn.purdue.edu MITCHELL D. THEYS theys@ecn.purdue.edu HOWARD JAY SIEGEL hj@purdue.edu MARK A. NICHOLS, AND KENNETH H. CASEY Paral lel Processing Laboratory, School of Electrical and Computer Engineering, 1285 Electrical Engineering Building, Purdue University, West Lafayette, IN 47907-1285, USA
Editor: Hamid R. Arabnia
Abstract. Performance of a parallel algorithm on a parallel machine dep ends not only on the time complexity of the algorithm, but also on how the underlying machine supp orts the fundamen- tal op erations used by the algorithm. This study analyzes various mappings of image correlation algorithms in SIMD, MIMD, and mixed-mo de environments. Exp eriments were conducted on the Intel Paragon, MasPar MP-1, nCUBE 2, and PASM prototyp e. The machine features con- sidered in this study include: mo des of parallelism, communication/computation ratio, network top ology and implementation, SIMD CU/PE overlap, and communication/computation overlap. Performance of an implementation can b e enhanced by using algorithmic techniques that match the machine features. Some algorithmic techniques discussed here are additional communication versus redundant computation, data blo ck transfers, and communication/computation overlap. The results presented are applicable to a large class of image pro cessing tasks. Case studies, such as the one presented here, are a necessary step in developing software to ols for mapping an application task onto a single parallel machine and for mapping the subtasks of an application task, or a set of indep endent application tasks, onto a heterogeneous suite of parallel machines.
Keywords: image correlation, Intel Paragon, MasPar MP-1, MIMD, mixed-mo de, nCUBE 2, PASM prototyp e, scalability, SIMD.
Performance of a parallel algorithm on a parallel machine dep ends not only on the time complexity of the algorithm, but also on how the underlying machine supp orts the fundamental op erations used by the algorithm. This research is an application- driven study of the trade-o s that exist when a parallel implementation is designed
for a given task on a given target machine. The application considered, image correlation, is representative of a large class of window-based image pro cessing techniques. In most low-level image pro cessing tasks, the same set of instructions is applied to all of a two-dimensional array of picture elements (pixels) of an image, where each pixel is a grey-level value for that p osition in the image. Thus, most such image pro cessing algorithms are data parallel in nature [33]. Image correlation (or image template matching) determines the \degree of similarity" b etween a template (i.e., a small image) and any area in the image with the same dimensions as the template.
Assume a distributed memory parallel machine with P PEs (a pro cessing element consists of a pro cessor and memory pair) connected via a logical mesh. The image is divided into P subimages that are distributed among the P PEs such that the PEs will need to interchange some of their pixels so the template can b e fully matched at the \edges" of the subimages. In the dynamic complete sums algorithm, non-lo cal pixels required by a PE are transferred only when they are rst needed during the course of computation. As a variation to the dynamic complete sums algorithm, the complete sums algorithm p erforms all non-lo cal pixel transfers b efore the start of any computation. In the partial sums algorithm, each PE p erforms all p ossible computations on its lo cal data and then transfers its partial results to those PEs that require them. A mathematical study of the dynamic complete sums and partial sums algorithms is provided in [31]^1. The op erations p erformed in these algorithms for image correlation are representative of a wide variety of window-based image pro cessing tasks, such as image smo othing, image convolution, and 2-D median ltering. Consequently, the analyses presented here can b e extended to a large class of data-parallel algorithms. Section 2 describ es the computations involved in the image correlation pro cess. In Section 3, a summary of related work is presented. The three algorithms for image correlation are explained in Section 4.
Due to certain trade-o s b etween the SIMD and MIMD mo des of parallelism, some sequences of instructions are p erformed b etter in one mo de than in the other [26, 30 ]. For example, b ecause of the single synchronized instruction stream in an SIMD program, the if-then-else clauses are serialized. This causes underuti- lization of PEs, b ecause the PEs active for the \then" clause are disabled for the \else" clause and vice versa. However, due to the implicit synchronization in an SIMD program, less inter-PE communication overhead is incurred in SIMD execu- tion. To take advantage of the b ene ts of b oth the SIMD and MIMD mo des of parallelism, mixed-mo de machines have b een built. An SIMD/MIMD mixed-mode system can dynamically switch b etween the SIMD and MIMD mo des of parallelism at instruction-level granularity with generally negligible overhead [30]. Examples of machines that have b een built with mixed-mode capability are EXECUBE [15], MeshSP [11], OPSILA [6], PASM [27, 32], TRAC [18], and Triton/1 [22].
In this study, the three image correlation algorithms are implemented on four di erent parallel machines and the trade-o s are analyzed. The machines are: a commercial SIMD machine (MasPar MP-1 with 16K PEs), a commercial hyp ercub e- based MIMD machine (nCUBE2 with 64 PEs), a commercial mesh-based MIMD machine (Intel Paragon with 140 PEs), and an exp erimental mixed-mode proto-
The serial image correlation algorithm can b e viewed as sliding the template across the input image in row ma jor order and computing ^2 for each r -by-c p ortion of the input image. The maximum ^2 value and its lo cation are the desired output. Therefore, once a ^2 value is computed for a match p osition, it is compared to the current maximum ^2 value and if it is greater, it replaces the maximum in value
either the template t or a p ortion of the input image y has the same value for every element. It is assumed that Stt do es not equal 0. For those match p ositions where Sy y is zero, the ^2 value is not computed and thus the current maximum ^2 value and its p osition are not altered. Typically, uniform areas (i.e., where Sy y is zero) corresp ond to background color or areas of no interest and therefore most applications ignore (bypass) such areas.
The execution time for the image correlation algorithm is dominated by the time to compute the
t[i; j ]y [i; j ],
y [i; j ], and
y [i; j ]^2 values for all p ossible match p ositions. The
t[i; j ] and
t[i; j ]^2 values involve only the template elements and are computed once. The execution time of the image correlation algorithm is indep endent of the actual pixel values in the image. Therefore, in the exp eriments rep orted here, the image in initialized by randomly generated pixel values and the template is initialized by copying the pixel values from a randomly chosen template p osition in the image.
A square, C -by-C input image matrix I and a square, c-by-c template matrix t are considered here to compare the di erent algorithms presented in the reviewed pap ers. The serial image correlation algorithm has a complexity given by O (C 2 c^2 ). Because template matching is an imp ortant task that is carried out in many image pro cessing applications, a great deal of e ort has b een sp ent on developing ecient parallel algorithms for this task. Many previous pap ers have examined di erent algorithms to p erform template matching and implemented these algorithms using di erent interconnection networks (e.g., [2, 7 , 8 , 16 , 17 , 19 , 23, 24]). The work presented in our pap er builds on the research describ ed in the ab ove pap ers and [31]. The ma jor di erence b etween the research presented here and the other pap ers is that the work presented here lo oks at issues involved in map- ping image correlation algorithms onto di erent machines with di erent mo des of parallelism, di erent typ es of interconnection networks, and how various implemen- tations can exploit the particular machine features. For these examples of related work, the di erences from the research here is describ ed further in this section.
In [23], and [24] two algorithms for image correlation on MIMD hyp ercub e mul- tipro cessors are describ ed. One algorithm assumes a ne-grain MIMD hyp ercub e (i.e., the cost of an interpro cessor communication is comparable to the cost of a ba- sic arithmetic instruction) and the other assumes a medium-grain hyp ercub e. The work presented in [23] and [24] is di erent from our work, b ecause their goal was to nd an ecient implementation on ne and coarse grained MIMD hyp ercub es.
Whereas, the goal of the study presented here is to examine the issues involved in mapping the image correlation algorithms onto a variety of machines. In [7], two algorithms for SIMD hyp ercub e computers is describ ed, one for C 2 c^2 pro cessors and another one for C 2 pro cessors. Because these algorithms require C 2 or more numb er of pro cessors, they are di erent from the algorithms considered in this research. Also, a similar di erence exists b etween the algorithm describ ed in [8] and our work. Three di erent interconnection networks are used in [8] and an algorithm with P = C 2 is presented for each network that can solve the template matching problem in O (c^2 ) time. A generalized convolution algorithm for a mesh architecture is presented in [19]. This algorithm initially assumes P = C 2 , and discusses P < C 2 without showing any actual algorithms. The approach in [19] di ers from the one presented here b ecause here algorithms are given for P < C 2. In [16], the authors designed \simple elegant parallel algorithms" for template matching using an SIMD hyp ercub e. The rst algorithm uses P = C 2 pro cessors, the second algorithm uses P = C 2 c^2 pro cessors, and the third algorithm uses P = O (C 2 ) pro cessors. The approaches presented in [16] di er signi cantly from the one presented here b ecause here P < C 2 , and they do not emb ed a mesh in the hyp ercub e. A mesh connected array pro cessor arrangement is used in [17] to implement a 2-D convolution scheme where P = C 2. The pap er examines using diamond, rectangular, and round templates. The approach presented in [17] di ers from the one presented here b ecause here P < C 2 , and here we are concerned with only rectangular templates. A parallel stereo correlation algorithm is used to study a recon gurable multi-ring network (RMRN) in [2]. Stereo correlation is a statistical pro cedure that derives depth information from a pair of pictures of the same scene but from di erent p ositions. The computation in stereo correlation is similar to that in the image correlation that is discussed in this pap er. The work presented in [2] is di erent from that here b ecause in [2] stereo correlation is used as an example to study the prop erties of the RMRN network and here image correlation is used to study the trade-o s in mapping an algorithm onto di erent parallel machines.
4.1. Common Portion of the Paral lel Algorithm Mappings
It is assumed that the P PEs are logically arranged as a
P -by-
P array of PEs (the physical interconnection of the PEs may not corresp ond to this). PE M , for
P and M = m
P + n.
P and b oth R and C are multiples of
P. The input image is partitioned into P subimages (as shown in Figure 1) and each PE's subimage is initially dimensioned as an RS -by-CS matrix, where RS = R=
P and CS = C =
P. To accommo date the pixel data that is transferred into a PE from other PEs, each PE's initial subimage is extended by
Phase I consists of the parallel computation of a value for the term Stt using Equation (1). Because the equation requires template values and no input image values, Stt is computed only once. If the template's dimensions are such that
r c < P PEs can participate. To start the parallel computation, each of the P PEs is
one of the remaining template elements. Next, each PE computes t^2 for each of the template elements it holds, followed by each PE computing the lo cal sum of its t values and the lo cal sum of its t^2 values. Finally, recursive doubling op erations are used to obtain the global sums
t[i; j ]^2 and
t[i; j ] from which each PE can compute Stt via Equation (1). Equation (3) is used to calculate Sy y , in parallel on all PEs, for each of the RS CS distinct match p ositions. A brief description of how Sy y is computed is necessary to understand the trade-o s discussed later. The terms
y [l ; k ]^2 , referred to symb olically as ysum[i, j] and ysumsq[i, j], resp ec- tively, for match position[i, j], are computed for all match p ositions within each PE's RS -by-CS subimage based on the serial algorithm presented in [31]. To assist in the computation of ysum[i, j] and ysumsq[i, j], two single array data
ate storage. If subimage[i, j] represents the pixel at co ordinate (i; j ) of a PE's subimage, colsum[k] is computed for the rst row in each subimage via:
colsum[k] =
i=
subimage[i; k] and colsumsq[k] =
i=
(subimage[i; k])^2 (5)
can b e used with additional data transfers. For match position[i, 0], where
ysum[i; 0 ] =
k =
colsum[k] and ysumsq[i; j] =
k =
colsumsq[k] (8)
needs to b e transferred. The data that is transferred di ers b etween the dynamic
complete and partial sums algorithms as discussed in the next subsection. Once each PE has calculated ysum[i, j] and ysumsq[i, j], each PE can compute Sy y for match position[i, j] via Equation (3).
Equation (2) is used to calculate Sty in parallel on all PEs, for each of the RS CS distinct match p ositions. The three terms
t[l ; k ],
y [l ; k ], and
t[l ; k ]y [l ; k ] are computed for all match p ositions within each PE's RS -by-CS subimage. Of these three terms,
t[l ; k ] and
y [l ; k ] are computed as discussed ab ove. The remaining term,
t[l ; k ]y [l ; k ], is computed directly. Because the value of Stt do es not change with the match p osition, the term ^0 = S (^2) ty =Sy y is computed in parallel on all PEs, for each of the RS CS distinct match p ositions. Therefore, b ecause Stt is known to b e non-negative, the match p osition that yields the maximum ^0 value would yield the maximum ^2 value (Equation (4)). Additionally, for each match p osition, two data-dep endent conditionals are p er- formed. One conditional determines whether Sy y is zero and the other is used to determine if the new ^0 value exceeds the current ^0 maximum. Once each PE has found its lo cal ^0 maximum and its corresp onding subimage lo cation, a recursive doubling op eration is used to nd the global maximum and its corre- sp onding input image lo cation. To determine the lo cation of the global maxi- mum ^0 , a position indicator is used to enco de the global co ordinates into a single numb er to reduce inter-PE communications. For instance, if PE J 's lo cal maxi-
and jg lobal = (J mo d
P CS ig lobal + jg lobal. Each PE passes the p osition indicator along with its maxi- mum ^0 value to the recursive doubling routine. The routine's output is the ordered pair (^0 ; pos), where ^0 is the global maximum and pos is its corresp onding lo cation. Finally, the co ecient of determination, ^2 = ^0 =Stt , and the co ordinate (u; v ) are
4.2. Unique Portions of the Mappings
4.2.1. Dynamic Complete Sums versus Complete Sums The only di erence b e- tween the dynamic complete sums and complete sums algorithms is when data transfers o ccur in Phase I I. In b oth algorithms, the information transferred among PEs are pixels. In the complete sums algorithm, all non-lo cal pixels are transferred b efore the template traverses the input image. The dynamic complete sums algo- rithm transfers the non-lo cal pixels during the template traversal only when they are rst needed. Because the complete sums approach isolates the non-lo cal pixel transfers from the match p osition computation, additional nested lo ops are used for cycling through the transfers and thus contributes additional lo op overhead. However, by interleaving the transfers with computation in the dynamic complete sums approach, more overhead is asso ciated with addressing the appropriate pixel to transfer and the lo cation in which to place the received pixel. The overhead as- so ciated with b oth algorithms is shown in Section 5 to b e o setting. The isolation
one pixel at a time and only when rst needed, no extraneous lo ops are generated, which minimizes lo op overhead.
oo^ +
xx x xxx x x xxx
x
\x" are pixels transferred for an earlier row template p osition \o" are pixels transferred for an earlier column template p osition (same row)
+" is pixel needed
Figure 3. Pixel transfers from PE J + 1 to PE J.
Next, consider match p ositions within PE J whose asso ciated computations re- quire the transfer of pixels from PE J +
P , i.e., each match position[i, j]
P. For match position[i, j]
are needed. However, at this p oint in the algorithm, these pixels have already b een transferred to PE J +
the subimage in PE J +
P. Thus, explicit communication with PE J +
P + 1 is
j = 0 (i.e., those match p ositions including column 0), it is necessary for c pixels to b e transferred for the asso ciated computations. For each match position[i,
including column 0), only one new pixel needs to b e transferred from PE J +
b ecause all of the other pixels are transferred and stored during previous match p osition computations. Similar to Figure 3, the example shown in Figure 4 uses a
4.2.2. Dynamic Complete Sums versus Partial Sums The partial sums algorithm stands in contrast to the dynamic complete sums and complete sums algorithms by the numb er of transfers, the information transferred, and the amount of com- putation done in Phase I I. Figure 2(b) shows the PEs from which PE J receives data in the partial sums algorithm. In general, the partial sums algorithm do es less additions and multiplications but more transfers than the other two algorithms. At each match p osition, the sums
t[i; j ]y [i; j ],
y [i; j ], and
y [i; j ]^2 are computed and used to compute Sty and Rty. In the dynamic complete sums algorithm, pixels were transferred to compute these sums for match p ositions whose
x x x x x o o o
x x x
x
x x x
PE J +
+" is pixel needed \x" are pixels transferred for an earlier row template p osition
earlier column template p osition
\o" are pixels transferred for an
(same row)
Figure 4. Pixel transfers from PE J +
P to PE J.
computations required pixels not in the lo cal PE memory. However, in the par- tial sums algorithm, each PE computes as much of these summations as p ossible with pixels it contains in its lo cal subimage. For those match p ositions whose asso ciated computations require pixels from an adjacent PE, partial sums rather than unpro cessed pixels, are received from the adjacent PEs. Sp eci cally, PE J receives a partial sum from PE J + 1 to complete the computations asso ciated with
Figure 5 depicts an example of this pro cess for an r = 5 and c = 4 template. For each of the ab ove sums, the pixels marked by x's are used by PE J + 1 to compute its partial sum and those pixels denoted by +'s are used by PE J to compute its partial sum. PE J + 1 sends its x partial sum to PE J which adds it to its + partial
for the y [ ] and y [ ]^2 sums, PE J + 1 uses its x partial sum (together with its pixels lab eled o in Figure 5) to form its total sums for match position[i, 0] in PE J + 1.
.
o
o o o x
x
x x
x x x x x
x x x x x
x
and used by PE J only \x" are pixels summed by PE J+ and used by PE J and PE J+ \o" are pixels summed by PE J+ and used by PE J+1 only
+" are pixels summed by PE J
r 1 l=
k =0 t[l;^ k^ ]y^ [l;^ k^ ]^ and^
l=
k =0 y^ [l;^ k^ ]^ for^ their^ resp^ ective^ match^ p^ ositions.
In general, the p ortion of the three sums that will b e computed by PE J + 1 is
is three times the numb er of data items (pixels) transferred by the dynamic complete
are transferred by the partial sums algorithm compared to the dynamic complete sums algorithm [31]. However, b ecause the dynamic complete sums algorithm has some redundant computations when computing the sums
l=
l=
k =0 y^ [l^ ;^ k^ ] (^2) for match p ositions at the \edge" of each subimage, the partial
algorithm complexities of the dynamic complete sums and partial sums for square images (R = C and RS = CS ) and square templates (r = c).
Table 1. Op eration count comparison b etween the dynamic complete sums and partial sums algorithms [31].
Op eration Count Dynamic Complete Partial Sum Sums Algorithm Algorithm
Numb er of 7C^2 =P + 8Cc=
P 7C^2 =P + 8Cc=
P
Numb er of (C=
Multiplication
P Value Transfers
For all these algorithms, by increasing the size of the input image while the template size and the numb er of PEs remains xed, the amount of computation increases much faster than the amount of inter-PE communication. This is b ecause
5.1. Overview of PASM
The PASM small-scale pro of-of-concept prototyp e is a distributed memory, parti- tionable, mixed-mo de machine, with 16 (MC68000-based) PEs in the computational engine [27, 32]. In SIMD mo de, it is assumed that a control unit (CU) broadcasts instructions and common data to the PEs. In MIMD mo de, the PEs indep en- dently execute the programs lo cated within their lo cal memories. The inter-PE communication is p erformed via a circuit-switched \extra stage cub e" multistage interconnection network [29]. The algorithms for PASM were co ded using a com- bination of a C language compiler, AWK scripts (for pre- and p ost-pro cessing), and library routines for data conditionals, inter-PE data transfers, and data trans- fers b etween the CU and PEs. The absolute execution times from the small-scale PASM prototyp e are very slow compared to current workstations; however, for this research comparative times among di erent PASM implementations are the fo cus.
The SIMD CU in PASM includes a fetch unit (FU) In SIMD mo de, the CU CPU initiates the parallel computation by instructing the FU to send blo cks of SIMD co de from the FU memory (which contains the SIMD co de) to the FU queue. Once in the FU queue, each SIMD instruction is broadcast to all PEs. While the FU is enqueuing and broadcasting SIMD instructions to the PEs and the PEs are executing instructions, the CU CPU can b e p erforming its own computations { this prop erty is called CU/PE overlap [14]. Switching b etween SIMD mo de and MIMD mo de on PASM is handled by divid- ing the PEs' logical address space into an MIMD address space, where the PEs access their own lo cal memory, and an SIMD address space, where the PE memory requests are satis ed by the FU broadcasting the SIMD instructions. Switching from SIMD to MIMD is implemented by broadcasting to the PEs a branch in- struction to MIMD space, while switching from MIMD to SIMD is implemented by all PEs indep endently branching from MIMD to SIMD space. Changing exe- cution mo des changes the source of instructions to execute; all stored information (memory, registers, pro cessor state, etc.) is una ected. The algorithms were run on 16 PEs but the results can b e directly extrap olated for large-scale systems. The time to execute Phase I I in SIMD mo de would b e exactly the same for larger system as long as the subimages are RS -by-CS (which is a function of R, C , and P ); in MIMD mo de, the results would b e approximately the same, b ecause of the added synchronization overhead when more PEs are used. The execution times of Phases I and I I I are generally negligible, O (log P ), compared to Phase I I. Therefore, the results are applicable to large-scale systems.
5.2. Comparison of Algorithms
Figure 6(a) shows template size versus total execution time (computation time plus communication time) for the SIMD and MIMD mo de complete sums and dynamic complete sums mappings. Figure 6(b) compares the communication times of the two algorithms (for each mo de). The complete sums algorithm requires additional nested lo ops to cycle through Phase I I's inter-PE transfers. The lo op overhead incurred by these additional nested
algorithm is hidden in SIMD mo de due to CU/PE overlap, but is apparent in MIMD mo de. This is the reason why in MIMD mo de the communication time for the dynamic complete sums is less than the communication time for the complete sums algorithm, while in SIMD mo de the communication times are equal. For MIMD mo de, the communication time di erence b etween the complete sums and dynamic complete sums algorithms do es not impact the total execution times of the two algorithms b ecause the communication time is negligible. Figure 7(a) compares the p erformance of the SIMD and MIMD mo de implemen- tations of the partial sums and dynamic complete sums algorithms. Figure 7(b) compares the communication time of these two algorithms (for each mo de). The total numb er of data items (partial sums) transferred by the partial sums al- gorithm is three times the numb er of data items (pixels) transferred by the dynamic
as with SIMD mo de, once a path is established, it takes less time to transfer a data item b etween PEs than the time taken for an addition or multiplication op eration. Thus, in MIMD mo de the execution time di erence (which is very small) reduces with increasing template sizes. The results presented in Figure 7 show that there is a di erence in the comparative p erformance of the various algorithms b ecause of the mo de of parallelism used; i.e., the crossover p oint in the p erformance of the dynamic complete and partial sums algorithms di er in the SIMD and MIMD mo de of parallelism. Thus, the scalability of an algorithm is a ected by the mo de of parallelism employed. The SIMD/MIMD trade-o for communication synchronization do es not have a signi cant impact on the total execution time of the image correlation algorithms considered in this work for the machine size and data sizes used. Despite a small impact in total execution time, communication overhead can b e seen to b e signi - cantly greater in MIMD mo de than in SIMD mo de. Figure 7(b) shows the commu- nication time for the partial sums algorithm in the SIMD and MIMD mo des. As can b e seen from the graph, the overhead asso ciated with the MIMD transfers is approximately ve times greater than that asso ciated with SIMD transfers for the template sizes shown. This extra overhead is asso ciated with the software proto cols and synchronization required for the PEs to communicate in MIMD mo de. Figure 8 shows the p erformance of the SIMD mo de and mixed-mode implementa- tions for the complete sums, dynamic complete sums, and partial sums algorithms. For the mixed-mo de implementation of the dynamic complete sums and partial sums algorithms, Phase I and Phase I I I are p erformed in SIMD mo de to avoid synchronization overhead during inter-PE communication and Phase I I uses b oth SIMD mo de and MIMD mo de: MIMD mo de for conditionals and SIMD mo de for everything else. In the complete sums algorithm, the mixed-mo de mapping p er- formed Phase I, Phase I I, and the transfers in Phase I I in SIMD mo de, but the rest of Phase I I in MIMD mo de. This mapping was chosen to measure the impact of CU/PE overlap in Phase I I. From Figure 8(a), it can b e observed that the execution time of the mixed-mo de dynamic complete sums is 4% b etter than the execution time of the SIMD dynamic complete sums algorithm. This p erformance di erence is due to the conditional masking scheme used in PASM to p erform data conditional statements in SIMD mo de. As can b e seen from Figure 8(b), the SIMD complete sums algorithm outp erforms the mixed-mo de complete sums algorithm by over 14%. Recall that the mixed-mo de complete sums mapping executes the computations in Phase I I in MIMD. Hence, the 14% p erformance di erence b etween the SIMD and mixed-mo de mappings is due to the amount of CU/PE overlap. This SIMD mo de p erformance advantage is gained by having the CU execute the lo op index increment and compare op erations as well as some array addressing computation, while the PEs execute PE data- dep endent op erations and other array addressing computations. A quantitative analysis of CU/PE overlap is given in the next subsection. From the analysis presented in this subsection, the SIMD algorithms p erform b et- ter than the MIMD algorithms. For the dynamic complete sums and partial sums
0
20
40
60
80
100
120
140
64 128 256 512
execution time (seconds)
image dimension (= R = C)
SIMD CS MIX CS
0
20
40
60
80
100
120
140
160
64 128 256 512
execution time (seconds)
image dimension (= R = C)
SIMD DCS MIX DCS SIMD PS MIX PS
(a) (b)
Figure 8. Varying image size on PASM with an 7-by-7 template for the SIMD mo de and mixed- mo de mappings, where (a) compares the execution times for the dynamic complete sums (DCS) and partial sums (PS) algorithms (b) compares the execution time for the complete sums (CS) algorithm.
algorithms, for r = c = 16 and RS = CS = 20, SIMD implementations p erform more than 26% b etter (Figure 7(a)). In addition, the dynamic complete and partial sums mixed-mo de algorithms are representative of a SIMD mo de implementation of the resp ective algorithms with an improved data-conditional masking scheme, which would further improve SIMD p erformance. Only three data-conditional statements are necessary (comparing the ^0 value with the current maximum ^0 value, checking if Sy y equals zero, and checking if PEs at the \edge" of the logical grid should pro cess the match p osition), thus the advantage for MIMD mo de owing to these conditionals are nominal. SIMD mo de has the signi cant advantage of CU/PE computational overlap (approximately 14%) and more ecient inter-PE communi- cation. The mixed-mo de partial sums algorithm of all the implementations tested had the b est execution time. Recall that the mixed-mode partial sums algorithm executes the entire algorithm except the data conditionals in SIMD mo de. The mixed- mo de partial sums algorithm p erforms b est due to two reasons. One is the low communication/computation ratio of PASM, making it b ene cial to p erform addi- tional communications instead of additional computations. The other is that the mixed-mo de partial sums algorithm b ene ts from the SIMD CU/PE overlap, SIMD communications, and MIMD data-conditional constructs.
5.3. An Empirical Examination of the E ects of CU/PE Overlap
When CU/PE overlap o ccurs, the total execution time for a program is measured from the start of execution to the time when b oth the PEs and the CU have completed their execution. There may b e an unequal amount of work on the CU and PEs, causing one to b ecome idle. This subsection uses simpli ed representative
Except for 8 s sp ent receiving variable i from the CU, the PEs remain idle until the CU has nished enqueuing the instructions for the broadcast j statement
For the co de segment in Figure 10, C U > P E and the CU is active except for the end time p erio d during the last iteration of the for-lo op. Hence, the overall execution time is C U + end. Sp eci cally, assuming that execution time is measured b eginning up on entry of the outer lo op (i.e., not including the p ointer initialization), C U = (74c + 90)r + 12 and P E = 50 cr. If r = c = 7, then C U = 4268 and P E = 2450.
CU PE
32 tptr = tbase + ci; / increment row ptrs */ 32 Iptr = Ibase + Cs *i;
10 tptr = tptr + 1; /* increment column ptrs / 10 Iptr = tptr + 1; 17 broadcast tptr; / send ptrs from / 8 17 broadcast Iptr; / CU to PEs / 8 6 simdbegin / broadcast SIMD block / tysum = tysum + t[tptr]Is[Iptr]; 34 simdend
Figure 10. SIMD pseudo co de segment that overworks the CU.
Consider the nal iteration of the co de segment given in Figure 10. The nal PE activity b egins when the CU has nished placing the broadcast j PE instructions on the instruction queue. After this p oint, the PEs require 8 s to read and exe- cute the broadcast i instructions (to receive i from the CU). Concurrently, the CU needs 6 s to place the next blo ck of SIMD instructions into the instruction queue. Once the PEs have completed the broadcast j instructions, they exe- cute the next blo ck of SIMD instructions. Hence, they remain active for another 34 s. Meanwhile, once the CU nishes placing the next blo ck of SIMD instruc- tions in the instruction queue, it increments the inner lo op control variable, j, and tests true for the end-of-lo op test (this is the nal iteration); likewise for the outer lo op. The CU is then nished with the co de segment. Thus, for Figure 10,
From the ab ove results the execution time of the co de segment in Figure 9 is given by P E + star t = 4858 + 50 = 4908 and the execution time of the co de segment in
Figure 10 is given by C U + end = 4268 + 8 = 4276. Consequently, by p erforming the array index calculations on the CU instead of on the PEs, execution time can
CU/PE overlap can b e maximized by achieving a workload balance b etween the
the co de in Figure 10 such that two p ointer arithmetic statements are migrated from the CU to the PEs. Assuming that execution time is measured up on entry of the outer lo op, C U = (20c + 124)r + 12 and P E = (54c + 16)r. If r = c = 7, C U = 1860 and P E = 2758. The execution time of the co de segment in Figure 11 is P E + star t = 2758 + 112 = 2870. Therefore, by optimizing CU/PE overlap
(ab out 42%).
CU PE
32 tptr = tbase + ci; / increment row ptrs */ 32 Iptr = Ibase + Cs i; 17 broadcast tptr; / send ptrs from / 8 17 broadcast Iptr; / CU to PEs */ 8
6 simdbegin /* broadcast SIMD block / tptr = tptr + 1; / increment column ptrs / 10 Iptr = tptr + 1; 10 tysum = tysum + t[tptr]Is[Iptr]; 34 simdend
Figure 11. SIMD pseudo co de segment optimizing CU/PE overlap.
If the lo op in Figure 11 were executed in MIMD mo de, the broadcast, simdbegin, and simdend instructions would not b e needed. Therefore, the time to execute the lo op in MIMD mo de is (68c + 90)r + 12. For r = c = 7, this time is 3982. Therefore,
the computation. This is an enormous advantage of an SIMD implementation of this co de segment. For the complete sums algorithm, the exp erimental results in Figure 8(b) contrast the total execution time of the SIMD implementation versus a mixed-mo de implementation designed just to show the CU/PE overlap (p erformed