Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Parallel Processing Techniques: Multi-Core and SIMD Execution, Exercises of Logic

Parallel processing techniques using multi-core and SIMD (Single Instruction, Multiple Data) execution. the concept of using multiple cores to execute different instruction streams and the use of SIMD units to process multiple data elements simultaneously. The document also discusses the benefits and costs of these techniques and provides examples of their implementation. from Carnegie Mellon University's Computer Science course 15-418/618, taught in the Fall of 2018.

Typology: Exercises

2021/2022

Uploaded on 09/12/2022

ashnay
ashnay 🇺🇸

4.8

(9)

238 documents

1 / 78

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Parallel Computer Architecture and Programming
CMU 15-418/15-618, Fall 2018
Lecture 2:
A Modern Multi-Core
Processor
(Forms of parallelism + understanding latency and bandwidth)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e

Partial preview of the text

Download Parallel Processing Techniques: Multi-Core and SIMD Execution and more Exercises Logic in PDF only on Docsity!

Parallel Computer Architecture and Programming

CMU 15-418/15-618, Fall 2018

Lecture 2:

A Modern Multi-Core

Processor

(Forms of parallelism + understanding latency and bandwidth)

Today

▪ Today we will talk computer architecture

▪ Four key concepts about how modern computers work

**- Two concern parallel execution

  • Two concern challenges of accessing memory**

▪ Understanding these architecture basics will help you

**- Understand and optimize the performance of your parallel programs

  • Gain intuition about what workloads might benefit from fast parallel machines**

Example program

void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;

for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }

result[i] = value; } }

Compute sin( x ) using Taylor expansion: sin( x ) = x - x^3 /3! + x^5 /5! - x^7 /7! + ...

for each element of an array of N floating-point numbers

Compile program

void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;

for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }

result[i] = value; } }

ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r

x[i]

result[i]

Execute program

x[i]

Fetch/ Decode

Execution Context

ALU (Execute)

PC

My very simple processor: executes one instruction per clock

ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r

result[i]

Execute program

x[i]

Fetch/ Decode

Execution Context

ALU (Execute)

PC

My very simple processor: executes one instruction per clock

ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r

result[i]

Superscalar processor

ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r

x[i]

Fetch/ Decode 1

Execution Context

Exec 1

Recall from last class: instruction level parallelism (ILP)

Decode and execute two instructions per clock (if possible)

Fetch/ Decode 2

Exec 2

Note: No ILP exists in this region of the program

result[i]

Aside: Pentium 4

Image credit: http://ixbtlabs.com/articles/pentium4/index.html

Processor: multi-core era

Fetch/ Decode

Execution Context

ALU (Execute)

Idea #1:

Use increasing transistor count to add more

cores to the processor

Rather than use transistors to increase

sophistication of processor logic that

accelerates a single instruction stream

(e.g., out-of-order and speculative operations)

Two cores: compute two elements in parallel

Fetch/
Decode
Execution
Context
ALU
(Execute)
Fetch/
Decode
Execution
Context
ALU
(Execute)

ld mul r0, addr[r1]r1, r0, r mul ... r1, r1, r ... ... ... ... ... st addr[r2], r

ld mul r0, addr[r1]r1, r0, r mul ... r1, r1, r ... ... ... ... ... st addr[r2], r

Simpler cores: each core is slower at running a single instruction stream

than our original “fancy” core (e.g., 0.75 times as fast)

But there are now two cores: 2 × 0.75 = 1.5 (potential for speedup!)

result[j]

x[j]

result[i]

x[i]

Expressing parallelism using pthreads

void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;

for (int j=1; j<=terms; j++) { value += sign * numer / denom numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }

result[i] = value; } }

typedef struct { int N; int terms; float* x; float* result; } my_args;

void parallel_sinx(int N, int terms, float* x, float* result) { pthread_t thread_id; my_args args;

args.N = N/2 ; args.terms = terms; args.x = x; args.result = result;

pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread sinx(N - args.N, terms, x + args.N, result + args.N); // do work pthread_join(thread_id, NULL); }

void my_thread_start(void* thread_arg) { my_args* thread_args = (my_args*)thread_arg; sinx(args->N, args->terms, args->x, args->result); // do work }

Data-parallel expression

void sinx(int N, int terms, float* x, float* result) { // declare independent loop iterations forall (int i from 0 to N-1) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;

for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }

result[i] = value; } }

Loop iterations declared by the programmer to be independent

With this information, you could imagine how a compiler might automatically generate parallel threaded code

(in our fictitious data-parallel language)

Sixteen cores: compute sixteen elements in parallel

Sixteen cores, sixteen simultaneous instruction streams

Core 1

Multi-core examples

Intel “Coffee Lake” Core i7 hexa-core CPU (2017)

NVIDIA GTX 1080 GPU 20 replicated processing cores (“SM”) (2016)

Core 5
Core 2
Core 4
Core 3
Core 6
Integrated GPU