Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

For each uploaded document

Answer questions

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Parallel Image Correlation: Case Study to Examine Trade-Offs in Algorithm | CS 843, Papers of Computer Science

Wichita State University (WSU)Computer Science

Material Type: Paper; Class: Distributed Computing Systs; Subject: Computer Science; University: Wichita State University; Term: Unknown 1989;

Typology: Papers

Pre 2010

Uploaded on 08/18/2009

koofers-user-mv3 🇺🇸

10 documents

1 / 32

This page cannot be seen from the preview

Don't miss anything!

bg1

The Journal of Supercomputing, , 1{31 ()

c



Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Parallel Image Correlation: Case Study to

Examine Trade-Os in Algorithm-to-Machine

Mappings

*

JAMES B. ARMSTRONG

jba@srtc.com

AdvancedProduct Development, Sarno Real Time Corporation, 301B College Road East,

Princeton, NJ 08543-5202, USA

MUTHUCUMARU MAHESWARAN

maheswar@ecn.purdue.edu

MITCHELL D. THEYS

theys@ecn.purdue.edu

HOWARD JAY SIEGEL

hj@purdue.edu

MARK A. NICHOLS, AND KENNETH H. CASEY

Parallel Processing Laboratory, School of Electrical and Computer Engineering, 1285 Electrical

Engineering Building, Purdue University, West Lafayette, IN 47907-1285, USA

Editor:

Hamid R. Arabnia

Abstract.

Performance of a parallel algorithm on a parallel machine depends not only on the

time complexityof the algorithm, but also on how the underlying machinesupports the fundamen-

tal operations used by the algorithm. This study analyzes various mappings of image correlation

algorithms in SIMD, MIMD, and mixed-mo de environments. Experiments were conducted on

the Intel Paragon, MasPar MP-1, nCUBE 2, and PASM prototype. The machine features con-

sidered in this study include: modes of parallelism, communication/computation ratio, network

topology and implementation, SIMD CU/PE overlap, and communication/computation overlap.

Performance of an implementation can be enhanced by using algorithmictechniques that match

the machine features. Some algorithmictechniques discussed here are additional communication

versus redundant computation, data block transfers, and communication/computation overlap.

The results presented are applicable to a large class of image processing tasks. Case studies,

suchastheonepresented here, are a necessary step in developing software tools for mapping an

application task onto a single parallel machine and for mapping the subtasks of an application

task, or a set of independent application tasks, onto a heterogeneous suite of parallel machines.

Keywords:

image correlation, Intel Paragon, MasPar MP-1, MIMD, mixed-mode, nCUBE 2,

PASM prototype, scalability,SIMD.

1. Introduction

Performance of a parallel algorithm on a parallel machine depends not only on the

time complexity of the algorithm, but also on how the underlying machine supports

the fundamental operations used by the algorithm. This research is an application-

driven study of the trade-os that exist when a parallel implementation is designed

*

This work was supported by the National Aeronautical and Space Administration under grant

number NGT-50961, by the Oce of Naval Research under grantnumber N00014-90-J-1937, and

bytheDARPA/ITO Quorum Program under NPS subcontractnumber N62271-97-M-0900. Some

of the equipmentused was supported by the National Science Foundation under grantnumber

CDA-9015696.

pf3

pf4

pf5

pf8

pf9

pfa

pfd

pfe

pff

pf12

pf13

pf14

pf15

pf16

pf17

pf18

pf19

pf1a

pf1b

pf1c

pf1d

pf1e

pf1f

pf20

Related documents

Multiobjective Optimization: Exploring Trade-offs in Simultaneous Objective Attainment

Understanding Trade-offs in Reproductive Strategies: Life History Analysis

EXSC 520 Case Study: Correlation and Bivariate Regression

Correlation Analysis of Nursing Home Stay and Depression Levels: A Case Study

Bio 212 Exam 1 with Correct Answers: Surface Area, Volume, and Physiological Trade-offs

Scalability Analysis of Shared Memory Parallel Architectures: A Case Study on KSR-1 - Prof

Feature-based Image Registration: Algorithms and Techniques

International WACC Analysis: A Real Estate Case Study

VIC2D Operation Manual: Image Correlation Process using VIC2D Version 4.2 - Prof. J. Lyons

Simulation Environment for Mobile Wireless Network Algorithms: A Parallel Approach

Replicating Trade Policy's Impact on Growth: An Econometric Case Study - Prof. Michael Ash

Low Support High Correlation Mining-Distributed Queries-Lecture Notes

Partial preview of the text

Download Parallel Image Correlation: Case Study to Examine Trade-Offs in Algorithm | CS 843 and more Papers Computer Science in PDF only on Docsity!

The Journal of Sup ercomputing, , 1{31 () c (^) Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Parallel Image Correlation: Case Study to

Examine Trade-O s in Algorithm-to-Machine

Mappings *

JAMES B. ARMSTRONG jba@srtc.com Advanced Product Development, Sarno Real Time Corporation, 301B Col lege Road East, Princeton, NJ 08543-5202, USA

MUTHUCUMARU MAHESWARAN maheswar@ecn.purdue.edu MITCHELL D. THEYS theys@ecn.purdue.edu HOWARD JAY SIEGEL hj@purdue.edu MARK A. NICHOLS, AND KENNETH H. CASEY Paral lel Processing Laboratory, School of Electrical and Computer Engineering, 1285 Electrical Engineering Building, Purdue University, West Lafayette, IN 47907-1285, USA

Editor: Hamid R. Arabnia

Abstract. Performance of a parallel algorithm on a parallel machine dep ends not only on the time complexity of the algorithm, but also on how the underlying machine supp orts the fundamen- tal op erations used by the algorithm. This study analyzes various mappings of image correlation algorithms in SIMD, MIMD, and mixed-mo de environments. Exp eriments were conducted on the Intel Paragon, MasPar MP-1, nCUBE 2, and PASM prototyp e. The machine features con- sidered in this study include: mo des of parallelism, communication/computation ratio, network top ology and implementation, SIMD CU/PE overlap, and communication/computation overlap. Performance of an implementation can b e enhanced by using algorithmic techniques that match the machine features. Some algorithmic techniques discussed here are additional communication versus redundant computation, data blo ck transfers, and communication/computation overlap. The results presented are applicable to a large class of image pro cessing tasks. Case studies, such as the one presented here, are a necessary step in developing software to ols for mapping an application task onto a single parallel machine and for mapping the subtasks of an application task, or a set of indep endent application tasks, onto a heterogeneous suite of parallel machines.

Keywords: image correlation, Intel Paragon, MasPar MP-1, MIMD, mixed-mo de, nCUBE 2, PASM prototyp e, scalability, SIMD.

1. Intro duction

Performance of a parallel algorithm on a parallel machine dep ends not only on the time complexity of the algorithm, but also on how the underlying machine supp orts the fundamental op erations used by the algorithm. This research is an application- driven study of the trade-o s that exist when a parallel implementation is designed

(^) This work was supp orted by the National Aeronautical and Space Administration under grant numb er NGT-50961, by the Oce of Naval Research under grant numb er N00014-90-J-19 37 , and by the DARPA/ITO Quorum Program under NPS sub contract numb er N62271-97-M-0900. Some of the equipment used was supp orted by the National Science Foundation under grant numb er CDA-9015696.

for a given task on a given target machine. The application considered, image correlation, is representative of a large class of window-based image pro cessing techniques. In most low-level image pro cessing tasks, the same set of instructions is applied to all of a two-dimensional array of picture elements (pixels) of an image, where each pixel is a grey-level value for that p osition in the image. Thus, most such image pro cessing algorithms are data parallel in nature [33]. Image correlation (or image template matching) determines the \degree of similarity" b etween a template (i.e., a small image) and any area in the image with the same dimensions as the template.

Assume a distributed memory parallel machine with P PEs (a pro cessing element consists of a pro cessor and memory pair) connected via a logical mesh. The image is divided into P subimages that are distributed among the P PEs such that the PEs will need to interchange some of their pixels so the template can b e fully matched at the \edges" of the subimages. In the dynamic complete sums algorithm, non-lo cal pixels required by a PE are transferred only when they are rst needed during the course of computation. As a variation to the dynamic complete sums algorithm, the complete sums algorithm p erforms all non-lo cal pixel transfers b efore the start of any computation. In the partial sums algorithm, each PE p erforms all p ossible computations on its lo cal data and then transfers its partial results to those PEs that require them. A mathematical study of the dynamic complete sums and partial sums algorithms is provided in [31]^1. The op erations p erformed in these algorithms for image correlation are representative of a wide variety of window-based image pro cessing tasks, such as image smo othing, image convolution, and 2-D median ltering. Consequently, the analyses presented here can b e extended to a large class of data-parallel algorithms. Section 2 describ es the computations involved in the image correlation pro cess. In Section 3, a summary of related work is presented. The three algorithms for image correlation are explained in Section 4.

Due to certain trade-o s b etween the SIMD and MIMD mo des of parallelism, some sequences of instructions are p erformed b etter in one mo de than in the other [26, 30 ]. For example, b ecause of the single synchronized instruction stream in an SIMD program, the if-then-else clauses are serialized. This causes underuti- lization of PEs, b ecause the PEs active for the \then" clause are disabled for the \else" clause and vice versa. However, due to the implicit synchronization in an SIMD program, less inter-PE communication overhead is incurred in SIMD execu- tion. To take advantage of the b ene ts of b oth the SIMD and MIMD mo des of parallelism, mixed-mo de machines have b een built. An SIMD/MIMD mixed-mode system can dynamically switch b etween the SIMD and MIMD mo des of parallelism at instruction-level granularity with generally negligible overhead [30]. Examples of machines that have b een built with mixed-mode capability are EXECUBE [15], MeshSP [11], OPSILA [6], PASM [27, 32], TRAC [18], and Triton/1 [22].

In this study, the three image correlation algorithms are implemented on four di erent parallel machines and the trade-o s are analyzed. The machines are: a commercial SIMD machine (MasPar MP-1 with 16K PEs), a commercial hyp ercub e- based MIMD machine (nCUBE2 with 64 PEs), a commercial mesh-based MIMD machine (Intel Paragon with 140 PEs), and an exp erimental mixed-mode proto-

The serial image correlation algorithm can b e viewed as sliding the template across the input image in row ma jor order and computing ^2 for each r -by-c p ortion of the input image. The maximum ^2 value and its lo cation are the desired output. Therefore, once a ^2 value is computed for a match p osition, it is compared to the current maximum ^2 value and if it is greater, it replaces the maximum in value

and lo cation. If either Stt or Sy y is zero, ^2 = 1. This condition exists when

either the template t or a p ortion of the input image y has the same value for every element. It is assumed that Stt do es not equal 0. For those match p ositions where Sy y is zero, the ^2 value is not computed and thus the current maximum ^2 value and its p osition are not altered. Typically, uniform areas (i.e., where Sy y is zero) corresp ond to background color or areas of no interest and therefore most applications ignore (bypass) such areas.

The execution time for the image correlation algorithm is dominated by the time to compute the

P P

t[i; j ]y [i; j ],

P P

y [i; j ], and

PP

y [i; j ]^2 values for all p ossible match p ositions. The

PP

t[i; j ] and

PP

t[i; j ]^2 values involve only the template elements and are computed once. The execution time of the image correlation algorithm is indep endent of the actual pixel values in the image. Therefore, in the exp eriments rep orted here, the image in initialized by randomly generated pixel values and the template is initialized by copying the pixel values from a randomly chosen template p osition in the image.

3. Related Work

A square, C -by-C input image matrix I and a square, c-by-c template matrix t are considered here to compare the di erent algorithms presented in the reviewed pap ers. The serial image correlation algorithm has a complexity given by O (C 2 c^2 ). Because template matching is an imp ortant task that is carried out in many image pro cessing applications, a great deal of e ort has b een sp ent on developing ecient parallel algorithms for this task. Many previous pap ers have examined di erent algorithms to p erform template matching and implemented these algorithms using di erent interconnection networks (e.g., [2, 7 , 8 , 16 , 17 , 19 , 23, 24]). The work presented in our pap er builds on the research describ ed in the ab ove pap ers and [31]. The ma jor di erence b etween the research presented here and the other pap ers is that the work presented here lo oks at issues involved in map- ping image correlation algorithms onto di erent machines with di erent mo des of parallelism, di erent typ es of interconnection networks, and how various implemen- tations can exploit the particular machine features. For these examples of related work, the di erences from the research here is describ ed further in this section.

In [23], and [24] two algorithms for image correlation on MIMD hyp ercub e mul- tipro cessors are describ ed. One algorithm assumes a ne-grain MIMD hyp ercub e (i.e., the cost of an interpro cessor communication is comparable to the cost of a ba- sic arithmetic instruction) and the other assumes a medium-grain hyp ercub e. The work presented in [23] and [24] is di erent from our work, b ecause their goal was to nd an ecient implementation on ne and coarse grained MIMD hyp ercub es.

Whereas, the goal of the study presented here is to examine the issues involved in mapping the image correlation algorithms onto a variety of machines. In [7], two algorithms for SIMD hyp ercub e computers is describ ed, one for C 2 c^2 pro cessors and another one for C 2 pro cessors. Because these algorithms require C 2 or more numb er of pro cessors, they are di erent from the algorithms considered in this research. Also, a similar di erence exists b etween the algorithm describ ed in [8] and our work. Three di erent interconnection networks are used in [8] and an algorithm with P = C 2 is presented for each network that can solve the template matching problem in O (c^2 ) time. A generalized convolution algorithm for a mesh architecture is presented in [19]. This algorithm initially assumes P = C 2 , and discusses P < C 2 without showing any actual algorithms. The approach in [19] di ers from the one presented here b ecause here algorithms are given for P < C 2. In [16], the authors designed \simple elegant parallel algorithms" for template matching using an SIMD hyp ercub e. The rst algorithm uses P = C 2 pro cessors, the second algorithm uses P = C 2 c^2 pro cessors, and the third algorithm uses P = O (C 2 ) pro cessors. The approaches presented in [16] di er signi cantly from the one presented here b ecause here P < C 2 , and they do not emb ed a mesh in the hyp ercub e. A mesh connected array pro cessor arrangement is used in [17] to implement a 2-D convolution scheme where P = C 2. The pap er examines using diamond, rectangular, and round templates. The approach presented in [17] di ers from the one presented here b ecause here P < C 2 , and here we are concerned with only rectangular templates. A parallel stereo correlation algorithm is used to study a recon gurable multi-ring network (RMRN) in [2]. Stereo correlation is a statistical pro cedure that derives depth information from a pair of pictures of the same scene but from di erent p ositions. The computation in stereo correlation is similar to that in the image correlation that is discussed in this pap er. The work presented in [2] is di erent from that here b ecause in [2] stereo correlation is used as an example to study the prop erties of the RMRN network and here image correlation is used to study the trade-o s in mapping an algorithm onto di erent parallel machines.

4. Parallel Algorithm Mappings

4.1. Common Portion of the Paral lel Algorithm Mappings

It is assumed that the P PEs are logically arranged as a

p

P -by-

p

P array of PEs (the physical interconnection of the PEs may not corresp ond to this). PE M , for

0 M < P , is lo cated at p osition (m; n) for m; n <

p

P and M = m

p

P + n.

The input image is dimensioned R-by-C , where R; C

p

P and b oth R and C are multiples of

p

P. The input image is partitioned into P subimages (as shown in Figure 1) and each PE's subimage is initially dimensioned as an RS -by-CS matrix, where RS = R=

p

P and CS = C =

p

P. To accommo date the pixel data that is transferred into a PE from other PEs, each PE's initial subimage is extended by

r 1 rows and c 1 columns, for a total size of (RS + r 1)-by-(CS + c 1)

(where the initial subimage is in rows 0 to RS 1 and columns 0 to CS 1). The

Phase I consists of the parallel computation of a value for the term Stt using Equation (1). Because the equation requires template values and no input image values, Stt is computed only once. If the template's dimensions are such that

r c P , then all P PEs can participate in the parallel computation; otherwise, only

r c < P PEs can participate. To start the parallel computation, each of the P PEs is

assigned br c=P c template elements and (r c) mo d P of those PEs are also assigned

one of the remaining template elements. Next, each PE computes t^2 for each of the template elements it holds, followed by each PE computing the lo cal sum of its t values and the lo cal sum of its t^2 values. Finally, recursive doubling op erations are used to obtain the global sums

PP

t[i; j ]^2 and

PP

t[i; j ] from which each PE can compute Stt via Equation (1). Equation (3) is used to calculate Sy y , in parallel on all PEs, for each of the RS CS distinct match p ositions. A brief description of how Sy y is computed is necessary to understand the trade-o s discussed later. The terms

PP

PP y^ [l^ ;^ k^ ]^ and

y [l ; k ]^2 , referred to symb olically as ysum[i, j] and ysumsq[i, j], resp ec- tively, for match position[i, j], are computed for all match p ositions within each PE's RS -by-CS subimage based on the serial algorithm presented in [31]. To assist in the computation of ysum[i, j] and ysumsq[i, j], two single array data

structures, colsum[k] and colsumsq[k], for 0 k CS , are used as intermedi-

ate storage. If subimage[i, j] represents the pixel at co ordinate (i; j ) of a PE's subimage, colsum[k] is computed for the rst row in each subimage via:

colsum[k] =

X^ r^ ^1

i=

subimage[i; k] and colsumsq[k] =

r X 1

i=

(subimage[i; k])^2 (5)

When progressing from match position[i 1 ; CS 1 ] to match position[i; 0 ] where

1 i < RS (r 1), the colsum[k] and colsumsq[k] arrays are up dated for

0 j < CS via:

colsum[j] = colsum[j] subimage[i 1 ; j] + subimage[i + r 1 ; j] and (6)

colsumsq[j] = colsumsq[j] subimage[i 1 ; j]^2 + subimage[i + r 1 ; j]^2 (7)

For match position[i, j], where RS (r 1) i < RS , the ab ove equations

can b e used with additional data transfers. For match position[i, 0], where

0 i < RS :

ysum[i; 0 ] =

c X 1

k =

colsum[k] and ysumsq[i; j] =

c X 1

k =

colsumsq[k] (8)

For match position[i, j], where 0 i < RS , 1 j < CS (c 1):

ysum[i; 0 ] = ysum[i; j 1 ] colsum[j 1 ] + colsum[j + c 1 ] and (9)

ysumsq[i; j] = ysumsq[i; j] colsumsq[j 1 ] + colsumsq[j + c 1 ] (10)

For match position[i, j], where 0 i < RS and CS (c 1) j < CS , data

needs to b e transferred. The data that is transferred di ers b etween the dynamic

complete and partial sums algorithms as discussed in the next subsection. Once each PE has calculated ysum[i, j] and ysumsq[i, j], each PE can compute Sy y for match position[i, j] via Equation (3).

Equation (2) is used to calculate Sty in parallel on all PEs, for each of the RS CS distinct match p ositions. The three terms

PP

t[l ; k ],

PP

y [l ; k ], and

PP

t[l ; k ]y [l ; k ] are computed for all match p ositions within each PE's RS -by-CS subimage. Of these three terms,

PP

t[l ; k ] and

PP

y [l ; k ] are computed as discussed ab ove. The remaining term,

PP

t[l ; k ]y [l ; k ], is computed directly. Because the value of Stt do es not change with the match p osition, the term ^0 = S (^2) ty =Sy y is computed in parallel on all PEs, for each of the RS CS distinct match p ositions. Therefore, b ecause Stt is known to b e non-negative, the match p osition that yields the maximum ^0 value would yield the maximum ^2 value (Equation (4)). Additionally, for each match p osition, two data-dep endent conditionals are p er- formed. One conditional determines whether Sy y is zero and the other is used to determine if the new ^0 value exceeds the current ^0 maximum. Once each PE has found its lo cal ^0 maximum and its corresp onding subimage lo cation, a recursive doubling op eration is used to nd the global maximum and its corre- sp onding input image lo cation. To determine the lo cation of the global maxi- mum ^0 , a position indicator is used to enco de the global co ordinates into a single numb er to reduce inter-PE communications. For instance, if PE J 's lo cal maxi-

mum ^0 value o ccurred at match position[i, j], let ig lobal = (bJ =

p

P cRS + i)

and jg lobal = (J mo d

p

p P^ )CS^ +^ j^.^ Then^ the^ p^ osition^ indicator^ (pos)^ is:^ pos^ =

P CS ig lobal + jg lobal. Each PE passes the p osition indicator along with its maxi- mum ^0 value to the recursive doubling routine. The routine's output is the ordered pair (^0 ; pos), where ^0 is the global maximum and pos is its corresp onding lo cation. Finally, the co ecient of determination, ^2 = ^0 =Stt , and the co ordinate (u; v ) are

computed, where u = bpos=(

p

P CS )c and v = pos mo d

p

P CS.

4.2. Unique Portions of the Mappings

4.2.1. Dynamic Complete Sums versus Complete Sums The only di erence b e- tween the dynamic complete sums and complete sums algorithms is when data transfers o ccur in Phase I I. In b oth algorithms, the information transferred among PEs are pixels. In the complete sums algorithm, all non-lo cal pixels are transferred b efore the template traverses the input image. The dynamic complete sums algo- rithm transfers the non-lo cal pixels during the template traversal only when they are rst needed. Because the complete sums approach isolates the non-lo cal pixel transfers from the match p osition computation, additional nested lo ops are used for cycling through the transfers and thus contributes additional lo op overhead. However, by interleaving the transfers with computation in the dynamic complete sums approach, more overhead is asso ciated with addressing the appropriate pixel to transfer and the lo cation in which to place the received pixel. The overhead as- so ciated with b oth algorithms is shown in Section 5 to b e o setting. The isolation

one pixel at a time and only when rst needed, no extraneous lo ops are generated, which minimizes lo op overhead.

oo^ +

xx x xxx x x xxx

x

\x" are pixels transferred for an earlier row template p osition \o" are pixels transferred for an earlier column template p osition (same row)

PE J PE^ J^ +^1

+" is pixel needed

Figure 3. Pixel transfers from PE J + 1 to PE J.

Next, consider match p ositions within PE J whose asso ciated computations re- quire the transfer of pixels from PE J +

p

P , i.e., each match position[i, j]

where RS (r 1) i < RS and 0 j < CS c. Over all such p ositions, the

asso ciated computations require all of the pixels from rows 0 to r 2 of columns 0

to CS 1 to b e transferred to PE J from PE J +

p

P. For match position[i, j]

where RS (r 1) i < RS and CS c j < CS , some pixels are required from

PE J +

p

P + 1. Sp eci cally, the pixels for rows 0 to r 2 of columns 0 to c 2

are needed. However, at this p oint in the algorithm, these pixels have already b een transferred to PE J +

p

P and are in rows 0 to r 2 of columns CS to CS + c 1 of

the subimage in PE J +

p

P. Thus, explicit communication with PE J +

p

P + 1 is

not necessary. For each match position[i, j], where RS (r 1) i < RS and

j = 0 (i.e., those match p ositions including column 0), it is necessary for c pixels to b e transferred for the asso ciated computations. For each match position[i,

j] where RS (r 1) i < RS and 1 j < CS (i.e., those match p ositions not

including column 0), only one new pixel needs to b e transferred from PE J +

p

P

b ecause all of the other pixels are transferred and stored during previous match p osition computations. Similar to Figure 3, the example shown in Figure 4 uses a

ve-by-four template and a match position[i, j] for RS (r 1) i < RS and

1 j < CS. At a match position[i, j], the pixel at co ordinate (e; f ) within PE

J +

p

P is transferred to PE J where e = i RS + (r 1) and f = j + (c 1).

4.2.2. Dynamic Complete Sums versus Partial Sums The partial sums algorithm stands in contrast to the dynamic complete sums and complete sums algorithms by the numb er of transfers, the information transferred, and the amount of com- putation done in Phase I I. Figure 2(b) shows the PEs from which PE J receives data in the partial sums algorithm. In general, the partial sums algorithm do es less additions and multiplications but more transfers than the other two algorithms. At each match p osition, the sums

P P

t[i; j ]y [i; j ],

P P

y [i; j ], and

P P

y [i; j ]^2 are computed and used to compute Sty and Rty. In the dynamic complete sums algorithm, pixels were transferred to compute these sums for match p ositions whose

x x x x x o o o

x x x

x

x x x

PE J +

p

P

PE J

+" is pixel needed \x" are pixels transferred for an earlier row template p osition

earlier column template p osition

\o" are pixels transferred for an

(same row)

Figure 4. Pixel transfers from PE J +

p

P to PE J.

computations required pixels not in the lo cal PE memory. However, in the par- tial sums algorithm, each PE computes as much of these summations as p ossible with pixels it contains in its lo cal subimage. For those match p ositions whose asso ciated computations require pixels from an adjacent PE, partial sums rather than unpro cessed pixels, are received from the adjacent PEs. Sp eci cally, PE J receives a partial sum from PE J + 1 to complete the computations asso ciated with

match position[i, j], where 0 i < RS r and CS (c 1) j < CS.

Figure 5 depicts an example of this pro cess for an r = 5 and c = 4 template. For each of the ab ove sums, the pixels marked by x's are used by PE J + 1 to compute its partial sum and those pixels denoted by +'s are used by PE J to compute its partial sum. PE J + 1 sends its x partial sum to PE J which adds it to its + partial

sum to form the total sum for match position[i; CS 1 ] in PE J. Furthermore,

for the y [ ] and y [ ]^2 sums, PE J + 1 uses its x partial sum (together with its pixels lab eled o in Figure 5) to form its total sums for match position[i, 0] in PE J + 1.

.

o

o

o o o x

x

x x

x x x x x

x x x x x

x

and used by PE J only \x" are pixels summed by PE J+ and used by PE J and PE J+ \o" are pixels summed by PE J+ and used by PE J+1 only

+" are pixels summed by PE J

PE J PE^ J^ +^1

(i; CS 1) (i; 0)

Figur P e 5. Pixels summed by PE J + 1 and used by PE J as well as PE J + 1 for computation of

r 1 l=

Pc 1

k =0 t[l;^ k^ ]y^ [l;^ k^ ]^ and^

Pr 1

l=

Pc 1

k =0 y^ [l;^ k^ ]^ for^ their^ resp^ ective^ match^ p^ ositions.

In general, the p ortion of the three sums that will b e computed by PE J + 1 is

the part that uses the pixels in rows i to i + r 1 of columns 0 to j + c CS 1 of

is three times the numb er of data items (pixels) transferred by the dynamic complete

sums algorithm, i.e., 2 RS (c 1) + 2 CS (r 1) + 2(r 1)(c 1) more data items

are transferred by the partial sums algorithm compared to the dynamic complete sums algorithm [31]. However, b ecause the dynamic complete sums algorithm has some redundant computations when computing the sums

Pr 1

l=

Pc 1

Pr 1 k^ =0^ y^ [l^ ;^ k^ ]^ and

l=

Pc 1

k =0 y^ [l^ ;^ k^ ] (^2) for match p ositions at the \edge" of each subimage, the partial

sums algorithm p erforms RS (c 1) + CS (r 1) + (r 1)(c 1) less multiplications

and 8 RS + 8 CS + 3 r + 3 c 6 less additions [31]. Table 1, based on [31], contrasts the

algorithm complexities of the dynamic complete sums and partial sums for square images (R = C and RS = CS ) and square templates (r = c).

Table 1. Op eration count comparison b etween the dynamic complete sums and partial sums algorithms [31].

Op eration Count Dynamic Complete Partial Sum Sums Algorithm Algorithm

Numb er of 7C^2 =P + 8Cc=

p

P 7C^2 =P + 8Cc=

p

P 16C=

p

P

Additions +C^2 c^2 =P + 2c^2 2c C^2 c^2 =P + 2c^2 8c + 6

Numb er of (C=

p

P + c 1 )^2 + C^2 c^2 =P C^2 c^2 =P + C^2 =P

Multiplication

Numb er of (c 1 )^2 + 2C(c 1 )=

p

P 3 (c 1 )^2 + 6C(c 1 )=

p

P Value Transfers

For all these algorithms, by increasing the size of the input image while the template size and the numb er of PEs remains xed, the amount of computation increases much faster than the amount of inter-PE communication. This is b ecause

the numb er of data items to transfer is O (RS c + CS r + r c) and the computation

is O (RS CS r c).

5. PASM Implementations

5.1. Overview of PASM

The PASM small-scale pro of-of-concept prototyp e is a distributed memory, parti- tionable, mixed-mo de machine, with 16 (MC68000-based) PEs in the computational engine [27, 32]. In SIMD mo de, it is assumed that a control unit (CU) broadcasts instructions and common data to the PEs. In MIMD mo de, the PEs indep en- dently execute the programs lo cated within their lo cal memories. The inter-PE communication is p erformed via a circuit-switched \extra stage cub e" multistage interconnection network [29]. The algorithms for PASM were co ded using a com- bination of a C language compiler, AWK scripts (for pre- and p ost-pro cessing), and library routines for data conditionals, inter-PE data transfers, and data trans- fers b etween the CU and PEs. The absolute execution times from the small-scale PASM prototyp e are very slow compared to current workstations; however, for this research comparative times among di erent PASM implementations are the fo cus.

The SIMD CU in PASM includes a fetch unit (FU) In SIMD mo de, the CU CPU initiates the parallel computation by instructing the FU to send blo cks of SIMD co de from the FU memory (which contains the SIMD co de) to the FU queue. Once in the FU queue, each SIMD instruction is broadcast to all PEs. While the FU is enqueuing and broadcasting SIMD instructions to the PEs and the PEs are executing instructions, the CU CPU can b e p erforming its own computations { this prop erty is called CU/PE overlap [14]. Switching b etween SIMD mo de and MIMD mo de on PASM is handled by divid- ing the PEs' logical address space into an MIMD address space, where the PEs access their own lo cal memory, and an SIMD address space, where the PE memory requests are satis ed by the FU broadcasting the SIMD instructions. Switching from SIMD to MIMD is implemented by broadcasting to the PEs a branch in- struction to MIMD space, while switching from MIMD to SIMD is implemented by all PEs indep endently branching from MIMD to SIMD space. Changing exe- cution mo des changes the source of instructions to execute; all stored information (memory, registers, pro cessor state, etc.) is una ected. The algorithms were run on 16 PEs but the results can b e directly extrap olated for large-scale systems. The time to execute Phase I I in SIMD mo de would b e exactly the same for larger system as long as the subimages are RS -by-CS (which is a function of R, C , and P ); in MIMD mo de, the results would b e approximately the same, b ecause of the added synchronization overhead when more PEs are used. The execution times of Phases I and I I I are generally negligible, O (log P ), compared to Phase I I. Therefore, the results are applicable to large-scale systems.

5.2. Comparison of Algorithms

Figure 6(a) shows template size versus total execution time (computation time plus communication time) for the SIMD and MIMD mo de complete sums and dynamic complete sums mappings. Figure 6(b) compares the communication times of the two algorithms (for each mo de). The complete sums algorithm requires additional nested lo ops to cycle through Phase I I's inter-PE transfers. The lo op overhead incurred by these additional nested

lo ops is O (RS c + CS r + r c). This extra lo op overhead in the complete sums

algorithm is hidden in SIMD mo de due to CU/PE overlap, but is apparent in MIMD mo de. This is the reason why in MIMD mo de the communication time for the dynamic complete sums is less than the communication time for the complete sums algorithm, while in SIMD mo de the communication times are equal. For MIMD mo de, the communication time di erence b etween the complete sums and dynamic complete sums algorithms do es not impact the total execution times of the two algorithms b ecause the communication time is negligible. Figure 7(a) compares the p erformance of the SIMD and MIMD mo de implemen- tations of the partial sums and dynamic complete sums algorithms. Figure 7(b) compares the communication time of these two algorithms (for each mo de). The total numb er of data items (partial sums) transferred by the partial sums al- gorithm is three times the numb er of data items (pixels) transferred by the dynamic

as with SIMD mo de, once a path is established, it takes less time to transfer a data item b etween PEs than the time taken for an addition or multiplication op eration. Thus, in MIMD mo de the execution time di erence (which is very small) reduces with increasing template sizes. The results presented in Figure 7 show that there is a di erence in the comparative p erformance of the various algorithms b ecause of the mo de of parallelism used; i.e., the crossover p oint in the p erformance of the dynamic complete and partial sums algorithms di er in the SIMD and MIMD mo de of parallelism. Thus, the scalability of an algorithm is a ected by the mo de of parallelism employed. The SIMD/MIMD trade-o for communication synchronization do es not have a signi cant impact on the total execution time of the image correlation algorithms considered in this work for the machine size and data sizes used. Despite a small impact in total execution time, communication overhead can b e seen to b e signi - cantly greater in MIMD mo de than in SIMD mo de. Figure 7(b) shows the commu- nication time for the partial sums algorithm in the SIMD and MIMD mo des. As can b e seen from the graph, the overhead asso ciated with the MIMD transfers is approximately ve times greater than that asso ciated with SIMD transfers for the template sizes shown. This extra overhead is asso ciated with the software proto cols and synchronization required for the PEs to communicate in MIMD mo de. Figure 8 shows the p erformance of the SIMD mo de and mixed-mode implementa- tions for the complete sums, dynamic complete sums, and partial sums algorithms. For the mixed-mo de implementation of the dynamic complete sums and partial sums algorithms, Phase I and Phase I I I are p erformed in SIMD mo de to avoid synchronization overhead during inter-PE communication and Phase I I uses b oth SIMD mo de and MIMD mo de: MIMD mo de for conditionals and SIMD mo de for everything else. In the complete sums algorithm, the mixed-mo de mapping p er- formed Phase I, Phase I I, and the transfers in Phase I I in SIMD mo de, but the rest of Phase I I in MIMD mo de. This mapping was chosen to measure the impact of CU/PE overlap in Phase I I. From Figure 8(a), it can b e observed that the execution time of the mixed-mo de dynamic complete sums is 4% b etter than the execution time of the SIMD dynamic complete sums algorithm. This p erformance di erence is due to the conditional masking scheme used in PASM to p erform data conditional statements in SIMD mo de. As can b e seen from Figure 8(b), the SIMD complete sums algorithm outp erforms the mixed-mo de complete sums algorithm by over 14%. Recall that the mixed-mo de complete sums mapping executes the computations in Phase I I in MIMD. Hence, the 14% p erformance di erence b etween the SIMD and mixed-mo de mappings is due to the amount of CU/PE overlap. This SIMD mo de p erformance advantage is gained by having the CU execute the lo op index increment and compare op erations as well as some array addressing computation, while the PEs execute PE data- dep endent op erations and other array addressing computations. A quantitative analysis of CU/PE overlap is given in the next subsection. From the analysis presented in this subsection, the SIMD algorithms p erform b et- ter than the MIMD algorithms. For the dynamic complete sums and partial sums

0

20

40

60

80

100

120

140

64 128 256 512

execution time (seconds)

image dimension (= R = C)

SIMD CS MIX CS

0

20

40

60

80

100

120

140

160

64 128 256 512

execution time (seconds)

image dimension (= R = C)

SIMD DCS MIX DCS SIMD PS MIX PS

(a) (b)

Figure 8. Varying image size on PASM with an 7-by-7 template for the SIMD mo de and mixed- mo de mappings, where (a) compares the execution times for the dynamic complete sums (DCS) and partial sums (PS) algorithms (b) compares the execution time for the complete sums (CS) algorithm.

algorithms, for r = c = 16 and RS = CS = 20, SIMD implementations p erform more than 26% b etter (Figure 7(a)). In addition, the dynamic complete and partial sums mixed-mo de algorithms are representative of a SIMD mo de implementation of the resp ective algorithms with an improved data-conditional masking scheme, which would further improve SIMD p erformance. Only three data-conditional statements are necessary (comparing the ^0 value with the current maximum ^0 value, checking if Sy y equals zero, and checking if PEs at the \edge" of the logical grid should pro cess the match p osition), thus the advantage for MIMD mo de owing to these conditionals are nominal. SIMD mo de has the signi cant advantage of CU/PE computational overlap (approximately 14%) and more ecient inter-PE communi- cation. The mixed-mo de partial sums algorithm of all the implementations tested had the b est execution time. Recall that the mixed-mode partial sums algorithm executes the entire algorithm except the data conditionals in SIMD mo de. The mixed- mo de partial sums algorithm p erforms b est due to two reasons. One is the low communication/computation ratio of PASM, making it b ene cial to p erform addi- tional communications instead of additional computations. The other is that the mixed-mo de partial sums algorithm b ene ts from the SIMD CU/PE overlap, SIMD communications, and MIMD data-conditional constructs.

5.3. An Empirical Examination of the E ects of CU/PE Overlap

When CU/PE overlap o ccurs, the total execution time for a program is measured from the start of execution to the time when b oth the PEs and the CU have completed their execution. There may b e an unequal amount of work on the CU and PEs, causing one to b ecome idle. This subsection uses simpli ed representative

Except for 8 s sp ent receiving variable i from the CU, the PEs remain idle until the CU has nished enqueuing the instructions for the broadcast j statement

during the rst iteration, so star t = 4 + 8 + 17 8 + 4 + 8 + 17 = 50 s.

For the co de segment in Figure 10, C U > P E and the CU is active except for the end time p erio d during the last iteration of the for-lo op. Hence, the overall execution time is C U + end. Sp eci cally, assuming that execution time is measured b eginning up on entry of the outer lo op (i.e., not including the p ointer initialization), C U = (74c + 90)r + 12 and P E = 50 cr. If r = c = 7, then C U = 4268 and P E = 2450.

CU PE

tbase = t[]; /* initialize pointers */
Ibase = Is [];

14 for (i=0; i<r; i=i+1) f /* 4, 8, 6 */

32 tptr = tbase + ci; / increment row ptrs */ 32 Iptr = Ibase + Cs *i;

14 for (j=0; j<c; j=j+1) f /* 4, 8, 6 */

10 tptr = tptr + 1; /* increment column ptrs / 10 Iptr = tptr + 1; 17 broadcast tptr; / send ptrs from / 8 17 broadcast Iptr; / CU to PEs / 8 6 simdbegin / broadcast SIMD block / tysum = tysum + t[tptr]Is[Iptr]; 34 simdend

g

g

Figure 10. SIMD pseudo co de segment that overworks the CU.

Consider the nal iteration of the co de segment given in Figure 10. The nal PE activity b egins when the CU has nished placing the broadcast j PE instructions on the instruction queue. After this p oint, the PEs require 8 s to read and exe- cute the broadcast i instructions (to receive i from the CU). Concurrently, the CU needs 6 s to place the next blo ck of SIMD instructions into the instruction queue. Once the PEs have completed the broadcast j instructions, they exe- cute the next blo ck of SIMD instructions. Hence, they remain active for another 34 s. Meanwhile, once the CU nishes placing the next blo ck of SIMD instruc- tions in the instruction queue, it increments the inner lo op control variable, j, and tests true for the end-of-lo op test (this is the nal iteration); likewise for the outer lo op. The CU is then nished with the co de segment. Thus, for Figure 10,

end = 8 6 + 34 (6 + 8 + 6 + 8) = 8.

From the ab ove results the execution time of the co de segment in Figure 9 is given by P E + star t = 4858 + 50 = 4908 and the execution time of the co de segment in

Figure 10 is given by C U + end = 4268 + 8 = 4276. Consequently, by p erforming the array index calculations on the CU instead of on the PEs, execution time can

b e reduced by 4908 4276 = 632 (ab out 13%).

CU/PE overlap can b e maximized by achieving a workload balance b etween the

CU and PEs, such that jC U + end P E star t j is minimized. The co de segment

version in Figure 11 attempts to minimize jC U + end P E star t j by mo difying

the co de in Figure 10 such that two p ointer arithmetic statements are migrated from the CU to the PEs. Assuming that execution time is measured up on entry of the outer lo op, C U = (20c + 124)r + 12 and P E = (54c + 16)r. If r = c = 7, C U = 1860 and P E = 2758. The execution time of the co de segment in Figure 11 is P E + star t = 2758 + 112 = 2870. Therefore, by optimizing CU/PE overlap

execution time can b e reduced further by 4276 2870 = 1406 (ab out 33%). This

gives an improvement over the original (Figure 9) version of 4908 2870 = 2038

(ab out 42%).

CU PE

tbase = t[]; /* initialize pointers */
Ibase = Is [];

14 for (i=0; i<r; i=i+1) f /* 4, 8, 6 */

32 tptr = tbase + ci; / increment row ptrs */ 32 Iptr = Ibase + Cs i; 17 broadcast tptr; / send ptrs from / 8 17 broadcast Iptr; / CU to PEs */ 8

14 for (j=0; j<c; j=j+1) f /* 4, 8, 6 */

6 simdbegin /* broadcast SIMD block / tptr = tptr + 1; / increment column ptrs / 10 Iptr = tptr + 1; 10 tysum = tysum + t[tptr]Is[Iptr]; 34 simdend

g

g

Figure 11. SIMD pseudo co de segment optimizing CU/PE overlap.

If the lo op in Figure 11 were executed in MIMD mo de, the broadcast, simdbegin, and simdend instructions would not b e needed. Therefore, the time to execute the lo op in MIMD mo de is (68c + 90)r + 12. For r = c = 7, this time is 3982. Therefore,

the CU/PE overlap for Figure 11 accounts for ab out 28% (3982 2870 = 1112) of

the computation. This is an enormous advantage of an SIMD implementation of this co de segment. For the complete sums algorithm, the exp erimental results in Figure 8(b) contrast the total execution time of the SIMD implementation versus a mixed-mo de implementation designed just to show the CU/PE overlap (p erformed