Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Computation -to -communication ratio, Lecture notes of Advanced Computer Architecture

Computer Architecture Parallel programs

Typology: Lecture notes

2016/2017

Uploaded on 07/14/2017

wasan-fraihat
wasan-fraihat 🇨🇦

1 document

1 / 58

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMPE655
CMPE655 -
-Shaaban
Shaaban
#1 lec # 3 Fall2013 9-10-2013
Parallel Computation/Program Issues
Parallel Computation/Program Issues
Dependency Analysis:
Dependency Analysis:
Types of dependency
Types of dependency
Dependency Graphs
Dependency Graphs
Bernstein
Bernstein
s Conditions of Parallelism
s Conditions of Parallelism
Asymptotic Notations for Algorithm Complexity Analysis
Asymptotic Notations for Algorithm Complexity Analysis
Parallel Random-Access Machine (PRAM)
Example: sum algorithm on P processor PRAM
Example: sum algorithm on P processor PRAM
Network Model of Message
Network Model of Message-
-Passing
Passing Multicomputers
Multicomputers
Example: Asynchronous Matrix Vector Product on a Ring
Levels of Parallelism in Program Execution
Levels of Parallelism in Program Execution
Hardware Vs. Software Parallelism
Hardware Vs. Software Parallelism
Parallel Task Grain Size
Parallel Task Grain Size
Software Parallelism Types: Data Vs. Functional Parallelism
Software Parallelism Types: Data Vs. Functional Parallelism
Example Motivating Problem With high levels of concurrency
Example Motivating Problem With high levels of concurrency
Limited Parallel Program Concurrency: Amdahl
Limited Parallel Program Concurrency: Amdahl
s Law
s Law
Parallel Performance Metrics: Degree of Parallelism (DOP)
Parallel Performance Metrics: Degree of Parallelism (DOP)
Concurrency Profile
Concurrency Profile
Steps in Creating a Parallel Program:
Steps in Creating a Parallel Program:
1
1-
-Decomposition, 2
Decomposition, 2-
-Assignment, 3
Assignment, 3-
-Orchestration, 4
Orchestration, 4-
-(Mapping + Scheduling)
(Mapping + Scheduling)
Program Partitioning Example (handout)
Program Partitioning Example (handout)
Static Multiprocessor Scheduling Example (handout)
Static Multiprocessor Scheduling Example (handout)
PCA Chapter 2.1, 2.2
+ Average Parallelism
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a

Partial preview of the text

Download Computation -to -communication ratio and more Lecture notes Advanced Computer Architecture in PDF only on Docsity!

CMPE655 -CMPE

  • Shaaban

Shaaban

#^

lec # 3

Fall

9-10-

Parallel Computation/Program IssuesParallel Computation/Program Issues•^ •

Dependency Analysis:Dependency Analysis:^ – –

Types of dependencyTypes of dependency

-^ –

Dependency GraphsDependency Graphs

-^ –

Bernstein’Bernstein

’s Conditions of Parallelisms Conditions of Parallelism

-^ •

Asymptotic Notations for Algorithm Complexity AnalysisAsymptotic Notations for Algorithm Complexity Analysis

-^

Parallel Random-Access Machine (PRAM)^ – –

Example: sum algorithm on P processor PRAMExample: sum algorithm on P processor PRAM

-^ •

Network Model of Message-Network Model of Message

-PassingPassing Multicomputers

Multicomputers

-^

Example: Asynchronous Matrix Vector Product on a Ring

  • •^

Levels of Parallelism in Program ExecutionLevels of Parallelism in Program Execution

-^ •

Hardware Vs. Software ParallelismHardware Vs. Software Parallelism

  • •^

Parallel Task Grain SizeParallel Task Grain Size

-^ •

Software Parallelism Types: Data Vs. Functional ParallelismSoftware Parallelism Types: Data Vs. Functional Parallelism

-^ •

Example Motivating Problem With high levels of concurrencyExample Motivating Problem With high levels of concurrency

  • •^

Limited Parallel Program Concurrency: AmdahlLimited Parallel Program Concurrency: Amdahl’

’s Laws Law

-^ •

Parallel Performance Metrics: Degree of Parallelism (DOP)Parallel Performance Metrics: Degree of Parallelism (DOP)–^ –

Concurrency ProfileConcurrency Profile

-^ •

Steps in Creating a Parallel Program:Steps in Creating a Parallel Program:^ – –

(^1) 1-- Decomposition,

2

Decomposition,

2-

  • Assignment, 3Assignment, 3-
  • Orchestration, 4Orchestration, 4-
  • (Mapping + Scheduling)(Mapping + Scheduling) -^ –

Program Partitioning Example (handout)Program Partitioning Example (handout)

-^ –

Static Multiprocessor Scheduling Example (handout)Static Multiprocessor Scheduling Example (handout)

PCA Chapter 2.1, 2.

  • Average Parallelism

CMPE655 -CMPE

  • Shaaban

Shaaban

#^

lec # 3

Fall

9-10-

Parallel Programs: Definitions

-^

A parallel program is comprised of a number of tasks running as threads (orprocesses) on a number of processing elements that cooperate/communicate aspart of a single parallel computation.

-^

Task

:

-^

Arbitrary piece of undecomposed work in parallel computation

-^

Executed sequentially on a single processor; concurrency in parallelcomputation is only across tasks.

-^

Parallel or Independent Tasks:^ –

Tasks that with no dependencies among them and thus can run in parallel ondifferent processing elements.

-^

Parallel Task Grain Size:

The amount of computations in a task.

-^

Process (thread)

:

-^

Abstract program entity that performs the computations assigned to a task

-^

Processes communicate and synchronize to perform their tasks

-^

Processor or (Processing Element)

:

-^

Physical computing engine on which a process executes sequentially

-^

Processes virtualize machine to programmer

-^

First write program in terms of processes, then map to processors

-^

Communication to Computation Ratio (C-to-C Ratio):

Represents the amount of

resulting communication between tasks of a parallel program

In general, for a parallel computation, a lower C-to-C ratio isdesirable and usually indicates better parallel performance

Other ParallelizationOverheads

Communication

Computation Parallel Execution Time

The processor with max. execution timedetermines parallel execution time

i.e At Thread Level Parallelism (TLP)

CMPE655 -CMPE

  • Shaaban

Shaaban

#^

lec # 3

Fall

9-10-

-^

Dependency analysis is concerned with detecting the presence andtype of dependency between tasks that prevent tasks from beingindependent and from running in parallel on different processorsand can be applied to tasks of any grain size.^ –

Represented graphically as task dependency graphs.

-^

Dependencies between tasks can be 1- algorithm/program related or2- hardware resource/architecture related.

•^

Algorithm/program Task Dependencies:^ –

Data Dependence:

-^

True Data or Flow Dependence

-^

Name Dependence:

-^

Anti-dependence

-^

Output (or write) dependence

–^

Control Dependence

•^

Hardware/Architecture Resource Dependence

Dependency Analysis & Conditions of Parallelism

Conditions of Parallelism

Down to task = instruction

A task only executes on one processor to which it has been mapped or allocated

  • Task Grain Size:Amount of computation ina task

Algorithm Related

Parallel Program and Programming Model RelatedAlgorithm Related

Parallel architecture related

CMPE655 -CMPE

  • Shaaban

Shaaban

#^

lec # 3

Fall

9-10-

Conditions of Parallelism:Conditions of Parallelism:Data & Name DependenceData & Name Dependence

Assume task S2 follows task S1 in sequential program order

1

True Data or Flow Dependence: Task S2 is data dependent ontask S1 if an execution path exists from S1 to S2 and if at leastone output variable of S1 feeds in as an input operand used by S

Represented by

S

S

in task dependency graphs

2

Anti-dependence: Task S2 is antidependent on S1, if S2 follows S1in program order and if the output of S2 overlaps the input of S

Represented by

S

S2 in dependency graphs

3

Output dependence: Two tasks S1, S2 are output dependent ifthey produce the same output variables (or output overlaps)

Represented by

S

S2 in task dependency graphs

S1.. .. S2 ProgramOrder NameDependencies

As part of the algorithm/computation

CMPE655 -CMPE

  • Shaaban

Shaaban

#^

lec # 3

Fall

9-10-

Name Dependence Classification:

Classification: Anti-Dependence

Task S2 is anti-dependent on task S

-^

Assume task S2 follows task S1 in sequential program order

-^

Task S1 reads one or more values from one or more names (registers ormemory locations)

-^

Task S2 writes one or more values to the same names (same registers ormemory locations read by S1)^ –

Then task S2 is said to be anti-dependent on task S

-^

Changing the relative execution order of tasks S1, S2 in the parallel programviolates this name dependence and may result in incorrect execution.

Task Dependency Graph Representation

S1 S

Anti-dependence

S1.. .. S2 ProgramOrder

Name: Register or Memory Location

S

(Read)

SharedNames

S

(Write)

S

⎯→

S

e.g. shared memory locationsin shared address space (SAS)

Does anti-dependence matter for message passing?

Program Related

CMPE655 -CMPE

  • Shaaban

Shaaban

#^

lec # 3

Fall

9-10-

Name Dependence Classification:

Classification:

Output (or Write) Dependence

Task S2 is output-dependent on task S

-^

Assume task S2 follows task S1 in sequential program order

-^

Both tasks S1, S2 write to the same a name or names (same registers ormemory locations)^ –

Then task S2 is said to be output-dependent on task S

-^

Changing the relative execution order of tasks S1, S2 in the parallel programviolates this name dependence and may result in incorrect execution.

Task Dependency Graph Representation

I J

Output dependence

Name: Register or Memory Location

S1.. .. S2 ProgramOrder

S

(Write)

SharedNames

S

(Write)

S

⎯→

S

e.g. shared memory locationsin shared address space (SAS)

Does output dependence matter for message passing?

Program Related

CMPE655 -CMPE

  • Shaaban

Shaaban

lec # 3

Fall

9-10-

ADD.D

F2, F1, F

ADD.D

F4, F2, F

ADD.D

F2, F2, F

ADD.D

F4, F2, F

1 2 3 4

ADD.D F2, F1, F

1 ADD.D F4, F2, F

4

Dependency Graph ExampleDependency Graph Example

MIPS Code

Task Dependency graph

Here assume each instruction is treated as a taskADD.D F4, F2, F

2

ADD.D F2, F2, F

3

True Date Dependence:(1, 2)

(1, 3)

(2, 3)

(3, 4)

i.e.

1

⎯→

2

1

⎯→

3

2

⎯→

3

3

⎯→

4

Output Dependence:(1, 3)

(2, 4)

i.e.

1

⎯→

3

2

⎯→

4

Anti-dependence:(2, 3)

(3, 4)

i.e.

2

⎯→

3

3

⎯→

4

CMPE655 -CMPE

  • Shaaban

Shaaban

lec # 3

Fall

9-10-

L.D

F0, 0 (R1)

ADD.D

F4, F0, F

S.D

F4, 0(R1)

L.D

F0, -8(R1)

ADD.D

F4, F0, F

S.D

F4, -8(R1)

1 2 3 4 5 6

L.D F0, 0 (R1)

1

ADD.D F4, F0, F

2

S.D F4, 0(R1)

3

ADD.D F4, F0, F

5

L.D F0, -8 (R1)

4

S.D F4, -8 (R1)

6

Can instruction 4 (second L.D) be movedjust after instruction 1 (first L.D)?If not what dependencies are violated?

Can instruction 3 (first S.D) be movedjust after instruction 4 (second L.D)?How about moving 3 after 5 (the second ADD.D)?If not what dependencies are violated?

Dependency Graph ExampleDependency Graph Example

(From 551)

MIPS Code

Task Dependency graph

Here assume each instruction is treated as a task

True Date Dependence:(1, 2)

(2, 3)

(4, 5)

(5, 6)

i.e.

1

⎯→

2

1 ⎯→

3

4 ⎯→

5

5 ⎯→

6

Output Dependence:(1, 4) (2, 5)i.e.

1

⎯→

4

2 ⎯→

5

Anti-dependence:(2, 4)

(3, 5) i.e.

2

⎯→

4

3 ⎯→

5

CMPE655 -CMPE

  • Shaaban

Shaaban

lec # 3

Fall

9-10-

Conditions of ParallelismConditions of Parallelism

-^

Control Dependence:^ –

Order of execution cannot be determined before runtimedue to conditional statements.

-^

Resource Dependence:^ –

Concerned with conflicts in using shared resources amongparallel tasks, including:

-^

Functional units (integer, floating point), memory areas,communication links etc.

-^

Bernstein’s Conditions of Parallelism:Two processes P

1

, P

2

with input sets I

, I 1

2

and output sets

O

, O 1

2

can execute in parallel (denoted by P

1

|| P

) if: 2

I^1

O

2

=

I^2

O

1

=

O

1

O

2

=

i.e no output dependence

Order of Pi.e no flow (data) dependenceor anti-dependence(which is which?)

, P 1

? 2

i.e. Resultsproduced

CMPE655 -CMPE

  • Shaaban

Shaaban

lec # 3

Fall

9-10-

Bernstein’Bernstein

’s Conditions: An Example

s Conditions: An Example

-^

For the following instructions P

, P 1

, P 2

, P 3

, P 4

: 5

-^

Each instruction requires one step to execute

-^

Two adders are available

P^1

: C = D x E P^2

: M = G + C P^3

: A = B + C P^4

: C = L + M P^5

: F = G

÷

E

Using Bernstein’s Conditions after checking statement pairs:

P^1

|| P

P^2

|| P

P^2

|| P

P^3

|| P

, 5

P^4

|| P

5

X^

P^1 D E +^3

P 4

+^2

P 3

+^1

P C 2

B

G L^

÷

P 5 G E

F

A

C

X^

P^1 D E +

P 1 2 +^3

P 4 ÷

P 5

G B

C P+ 2 F

3 A

L E^

G

C

M

Parallel execution in three stepsassuming two adders are availableper step

Sequentialexecution

Time

X

P^1

÷

P^5

+^2

+^3

+^1 P^2

P^4

P^3

Dependence graph:Data dependence (solid lines)Resource dependence (dashed lines)

P1Co-Begin

P1, P3, P Co-EndP

CMPE655 -CMPE

  • Shaaban

Shaaban

lec # 3

Fall

9-10-

Asymptotic Notations for Algorithm AnalysisAsymptotic Notations for Algorithm Analysis^ ♦

Asymptotic Lower bound

:^

Big Omega Notation

Used in the analysis of the lower limit of algorithm performance

f(n) =

Ω

(g(n))

if there exist positive constants c, n

such that 0

| f(n) |

c | g(n) |

for all

n > n

0

i.e.

g(n) is a lower bound on f(n)

Asymptotic Tight bound:

Big Theta Notation

Used in finding a tight limit on algorithm performance

f(n) =

Θ

(g(n))

if there exist constant positive integers c

, c 1

, and n 2

such that 0

c^1

| g(n) |

| f(n) |

c

| g(n) | 2

for all

n > n

0

i.e.

g(n) is both an upper and lower bound on f(n)

AKA Tight bound

Ω

Θ

CMPE655 -CMPE

  • Shaaban

Shaaban

lec # 3

Fall

9-10-

Graphs of

O,

f(n) =O(g(n))Upper Bound

cg(n) f(n)

n

0

f(n) =

(g(n))

Lower Bound

cg(n)

n

0

f(n)

f(n) =

(g(n))

Tight bound

c

g(n) 2

n

0

c

g(n) 1

f(n)

CMPE655 -CMPE

  • Shaaban

Shaaban

lec # 3

Fall

) 9-10-

log(

n n

2 n

)

log(

n

n

n 2

Rate of Growth of Common Computing Time^ Rate of Growth of Common Computing Time

FunctionsFunctions

O(1) < O(log n) < O(n) < O(n log n) < O (n

2 ) < O(n

3 ) < O(

n )

CMPE655 -CMPE

  • Shaaban

Shaaban

lec # 3

Fall

9-10-

Theoretical Models of Parallel Computers:Theoretical Models of Parallel Computers:^ •

Parallel Random-Access Machine (PRAM):^ –

p^

processor, global shared memory model.

-^

Models idealized parallel shared-memory computers with zerosynchronization, communication or memory access overhead.

-^

Utilized in parallel algorithm development and scalability andcomplexity analysis.

-^

PRAM variants: More realistic models than pure PRAM^ –

EREW-PRAM: Simultaneous memory reads or writes to/fromthe same memory location are not allowed.

-^

CREW-PRAM: Simultaneous memory writes to the samelocation is not allowed. (Better to model SAS MIMD?)

-^

ERCW-PRAM: Simultaneous reads from the same memorylocation are not allowed.

-^

CRCW-PRAM: Concurrent reads or writes to/from the samememory location are allowed.

Why? Sometimes used to model SIMD since no memory is shared

PRAM: An Idealized Shared-Memory Parallel Computer Model