


















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Computer Architecture Parallel programs
Typology: Lecture notes
1 / 58
This page cannot be seen from the preview
Don't miss anything!
#^
lec # 3
Fall
9-10-
Parallel Computation/Program IssuesParallel Computation/Program Issues•^ •
Dependency Analysis:Dependency Analysis:^ – –
Types of dependencyTypes of dependency
-^ –
Dependency GraphsDependency Graphs
-^ –
Bernstein’Bernstein
’s Conditions of Parallelisms Conditions of Parallelism
-^ •
Asymptotic Notations for Algorithm Complexity AnalysisAsymptotic Notations for Algorithm Complexity Analysis
-^
Parallel Random-Access Machine (PRAM)^ – –
Example: sum algorithm on P processor PRAMExample: sum algorithm on P processor PRAM
-^ •
Network Model of Message-Network Model of Message
-PassingPassing Multicomputers
Multicomputers
-^
Example: Asynchronous Matrix Vector Product on a Ring
Levels of Parallelism in Program ExecutionLevels of Parallelism in Program Execution
-^ •
Hardware Vs. Software ParallelismHardware Vs. Software Parallelism
Parallel Task Grain SizeParallel Task Grain Size
-^ •
Software Parallelism Types: Data Vs. Functional ParallelismSoftware Parallelism Types: Data Vs. Functional Parallelism
-^ •
Example Motivating Problem With high levels of concurrencyExample Motivating Problem With high levels of concurrency
Limited Parallel Program Concurrency: AmdahlLimited Parallel Program Concurrency: Amdahl’
’s Laws Law
-^ •
Parallel Performance Metrics: Degree of Parallelism (DOP)Parallel Performance Metrics: Degree of Parallelism (DOP)–^ –
Concurrency ProfileConcurrency Profile
-^ •
Steps in Creating a Parallel Program:Steps in Creating a Parallel Program:^ – –
(^1) 1-- Decomposition,
2
Decomposition,
2-
Program Partitioning Example (handout)Program Partitioning Example (handout)
-^ –
Static Multiprocessor Scheduling Example (handout)Static Multiprocessor Scheduling Example (handout)
#^
lec # 3
Fall
9-10-
Parallel Programs: Definitions
-^
A parallel program is comprised of a number of tasks running as threads (orprocesses) on a number of processing elements that cooperate/communicate aspart of a single parallel computation.
-^
Task
:
-^
Arbitrary piece of undecomposed work in parallel computation
-^
Executed sequentially on a single processor; concurrency in parallelcomputation is only across tasks.
-^
Parallel or Independent Tasks:^ –
Tasks that with no dependencies among them and thus can run in parallel ondifferent processing elements.
-^
Parallel Task Grain Size:
The amount of computations in a task.
-^
Process (thread)
:
-^
Abstract program entity that performs the computations assigned to a task
-^
Processes communicate and synchronize to perform their tasks
-^
Processor or (Processing Element)
:
-^
Physical computing engine on which a process executes sequentially
-^
Processes virtualize machine to programmer
-^
First write program in terms of processes, then map to processors
-^
Communication to Computation Ratio (C-to-C Ratio):
Represents the amount of
resulting communication between tasks of a parallel program
In general, for a parallel computation, a lower C-to-C ratio isdesirable and usually indicates better parallel performance
Other ParallelizationOverheads
Communication
Computation Parallel Execution Time
The processor with max. execution timedetermines parallel execution time
i.e At Thread Level Parallelism (TLP)
#^
lec # 3
Fall
9-10-
-^
-^
Algorithm/program Task Dependencies:^ –
Data Dependence:
-^
True Data or Flow Dependence
-^
Name Dependence:
-^
Anti-dependence
-^
Output (or write) dependence
Control Dependence
Hardware/Architecture Resource Dependence
Dependency Analysis & Conditions of Parallelism
Conditions of Parallelism
Down to task = instruction
A task only executes on one processor to which it has been mapped or allocated
Algorithm Related
Parallel Program and Programming Model RelatedAlgorithm Related
Parallel architecture related
#^
lec # 3
Fall
9-10-
Conditions of Parallelism:Conditions of Parallelism:Data & Name DependenceData & Name Dependence
1
in task dependency graphs
2
S2 in dependency graphs
3
S2 in task dependency graphs
S1.. .. S2 ProgramOrder NameDependencies
As part of the algorithm/computation
#^
lec # 3
Fall
9-10-
Name Dependence Classification:
Classification: Anti-Dependence
Task S2 is anti-dependent on task S
-^
Assume task S2 follows task S1 in sequential program order
-^
Task S1 reads one or more values from one or more names (registers ormemory locations)
-^
Task S2 writes one or more values to the same names (same registers ormemory locations read by S1)^ –
Then task S2 is said to be anti-dependent on task S
-^
Changing the relative execution order of tasks S1, S2 in the parallel programviolates this name dependence and may result in incorrect execution.
Task Dependency Graph Representation
S1 S
Anti-dependence
S1.. .. S2 ProgramOrder
Name: Register or Memory Location
S
(Read)
S
(Write)
S
⎯→
S
e.g. shared memory locationsin shared address space (SAS)
Does anti-dependence matter for message passing?
Program Related
#^
lec # 3
Fall
9-10-
Name Dependence Classification:
Classification:
Output (or Write) Dependence
Task S2 is output-dependent on task S
-^
Assume task S2 follows task S1 in sequential program order
-^
Both tasks S1, S2 write to the same a name or names (same registers ormemory locations)^ –
Then task S2 is said to be output-dependent on task S
-^
Changing the relative execution order of tasks S1, S2 in the parallel programviolates this name dependence and may result in incorrect execution.
Task Dependency Graph Representation
I J
Output dependence
Name: Register or Memory Location
S1.. .. S2 ProgramOrder
S
(Write)
S
(Write)
S
⎯→
S
e.g. shared memory locationsin shared address space (SAS)
Does output dependence matter for message passing?
Program Related
lec # 3
Fall
9-10-
ADD.D
F2, F1, F
ADD.D
F4, F2, F
ADD.D
F2, F2, F
ADD.D
F4, F2, F
1 2 3 4
ADD.D F2, F1, F
1 ADD.D F4, F2, F
4
Dependency Graph ExampleDependency Graph Example
MIPS Code
Task Dependency graph
Here assume each instruction is treated as a taskADD.D F4, F2, F
2
ADD.D F2, F2, F
3
True Date Dependence:(1, 2)
(1, 3)
(2, 3)
(3, 4)
i.e.
1
⎯→
2
1
⎯→
3
2
⎯→
3
3
⎯→
4
Output Dependence:(1, 3)
(2, 4)
i.e.
1
⎯→
3
2
⎯→
4
Anti-dependence:(2, 3)
(3, 4)
i.e.
2
⎯→
3
3
⎯→
4
lec # 3
Fall
9-10-
L.D
F0, 0 (R1)
ADD.D
F4, F0, F
S.D
F4, 0(R1)
L.D
F0, -8(R1)
ADD.D
F4, F0, F
S.D
F4, -8(R1)
1 2 3 4 5 6
L.D F0, 0 (R1)
1
ADD.D F4, F0, F
2
S.D F4, 0(R1)
3
ADD.D F4, F0, F
5
L.D F0, -8 (R1)
4
S.D F4, -8 (R1)
6
Can instruction 4 (second L.D) be movedjust after instruction 1 (first L.D)?If not what dependencies are violated?
Can instruction 3 (first S.D) be movedjust after instruction 4 (second L.D)?How about moving 3 after 5 (the second ADD.D)?If not what dependencies are violated?
Dependency Graph ExampleDependency Graph Example
(From 551)
MIPS Code
Task Dependency graph
Here assume each instruction is treated as a task
True Date Dependence:(1, 2)
(2, 3)
(4, 5)
(5, 6)
i.e.
1
⎯→
2
1 ⎯→
3
4 ⎯→
5
5 ⎯→
6
Output Dependence:(1, 4) (2, 5)i.e.
1
⎯→
4
2 ⎯→
5
Anti-dependence:(2, 4)
(3, 5) i.e.
2
⎯→
4
3 ⎯→
5
lec # 3
Fall
9-10-
Conditions of ParallelismConditions of Parallelism
-^
Control Dependence:^ –
-^
Resource Dependence:^ –
-^
Functional units (integer, floating point), memory areas,communication links etc.
-^
Bernstein’s Conditions of Parallelism:Two processes P
1
, P
2
with input sets I
, I 1
2
and output sets
O
, O 1
2
can execute in parallel (denoted by P
1
|| P
) if: 2
I^1
∩
O
2
=
∅
I^2
∩
O
1
=
∅
O
1
∩
O
2
=
∅
i.e no output dependence
Order of Pi.e no flow (data) dependenceor anti-dependence(which is which?)
, P 1
? 2
i.e. Resultsproduced
lec # 3
Fall
9-10-
Bernstein’Bernstein
’s Conditions: An Example
s Conditions: An Example
-^
For the following instructions P
, P 1
, P 2
, P 3
, P 4
: 5
-^
Each instruction requires one step to execute
-^
Two adders are available
P^1
: C = D x E P^2
: M = G + C P^3
: A = B + C P^4
: C = L + M P^5
: F = G
÷
E
Using Bernstein’s Conditions after checking statement pairs:
P^1
|| P
P^2
|| P
P^2
|| P
P^3
|| P
, 5
P^4
|| P
5
X^
P^1 D E +^3
P 4
+^2
P 3
+^1
P C 2
B
G L^
÷
P 5 G E
F
A
C
X^
P^1 D E +
P 1 2 +^3
P 4 ÷
P 5
G B
C P+ 2 F
3 A
L E^
G
C
M
Parallel execution in three stepsassuming two adders are availableper step
Sequentialexecution
Time
X
÷
+^2
+^3
+^1 P^2
Dependence graph:Data dependence (solid lines)Resource dependence (dashed lines)
P1Co-Begin
P1, P3, P Co-EndP
lec # 3
Fall
9-10-
Asymptotic Notations for Algorithm AnalysisAsymptotic Notations for Algorithm Analysis^ ♦
Asymptotic Lower bound
Big Omega Notation
Used in the analysis of the lower limit of algorithm performance
f(n) =
Ω
(g(n))
if there exist positive constants c, n
such that 0
| f(n) |
≥
c | g(n) |
for all
n > n
0
⇒
i.e.
g(n) is a lower bound on f(n)
♦
Asymptotic Tight bound:
Big Theta Notation
Used in finding a tight limit on algorithm performance
f(n) =
Θ
(g(n))
if there exist constant positive integers c
, c 1
, and n 2
such that 0
c^1
| g(n) |
≤
| f(n) |
≤
c
| g(n) | 2
for all
n > n
0
⇒
i.e.
g(n) is both an upper and lower bound on f(n)
AKA Tight bound
Ω
Θ
lec # 3
Fall
9-10-
0
0
0
lec # 3
Fall
) 9-10-
log(
n n
2 n
)
log(
n
n
n 2
Rate of Growth of Common Computing Time^ Rate of Growth of Common Computing Time
FunctionsFunctions
O(1) < O(log n) < O(n) < O(n log n) < O (n
2 ) < O(n
3 ) < O(
n )
lec # 3
Fall
9-10-
Theoretical Models of Parallel Computers:Theoretical Models of Parallel Computers:^ •
Parallel Random-Access Machine (PRAM):^ –
p^
processor, global shared memory model.
-^
Models idealized parallel shared-memory computers with zerosynchronization, communication or memory access overhead.
-^
Utilized in parallel algorithm development and scalability andcomplexity analysis.
-^
PRAM variants: More realistic models than pure PRAM^ –
EREW-PRAM: Simultaneous memory reads or writes to/fromthe same memory location are not allowed.
-^
CREW-PRAM: Simultaneous memory writes to the samelocation is not allowed. (Better to model SAS MIMD?)
-^
ERCW-PRAM: Simultaneous reads from the same memorylocation are not allowed.
-^
CRCW-PRAM: Concurrent reads or writes to/from the samememory location are allowed.
Why? Sometimes used to model SIMD since no memory is shared
PRAM: An Idealized Shared-Memory Parallel Computer Model