Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Parallel Processing Techniques: Multi-Core and SIMD Execution, Exercises of Logic

William Woods University (WWU)Logic

Parallel processing techniques using multi-core and SIMD (Single Instruction, Multiple Data) execution. the concept of using multiple cores to execute different instruction streams and the use of SIMD units to process multiple data elements simultaneously. The document also discusses the benefits and costs of these techniques and provides examples of their implementation. from Carnegie Mellon University's Computer Science course 15-418/618, taught in the Fall of 2018.

Typology: Exercises

2021/2022

Uploaded on 09/12/2022

ashnay 🇺🇸

4.8

(9)

238 documents

1 / 78

This page cannot be seen from the preview

Don't miss anything!

Parallel Computer Architecture and Programming

CMU 15-418/15-618, Fall 2018

Lecture 2:

A Modern Multi-Core

Processor

(Forms of parallelism + understanding latency and bandwidth)

Partial preview of the text

Download Parallel Processing Techniques: Multi-Core and SIMD Execution and more Exercises Logic in PDF only on Docsity!

Parallel Computer Architecture and Programming

CMU 15-418/15-618, Fall 2018

Lecture 2:

A Modern Multi-Core

Processor

(Forms of parallelism + understanding latency and bandwidth)

Today

▪ Today we will talk computer architecture

▪ Four key concepts about how modern computers work

**- Two concern parallel execution

Two concern challenges of accessing memory**

▪ Understanding these architecture basics will help you

**- Understand and optimize the performance of your parallel programs

Gain intuition about what workloads might benefit from fast parallel machines**

Example program

void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;

for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }

result[i] = value; } }

Compute sin( x ) using Taylor expansion: sin( x ) = x - x^3 /3! + x^5 /5! - x^7 /7! + ...

for each element of an array of N floating-point numbers

Compile program

void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;

for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }

result[i] = value; } }

ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r

x[i]

result[i]

Execute program

x[i]

Fetch/ Decode

Execution Context

ALU (Execute)

PC

My very simple processor: executes one instruction per clock

ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r

result[i]

Execute program

x[i]

Fetch/ Decode

Execution Context

ALU (Execute)

PC

My very simple processor: executes one instruction per clock

ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r

result[i]

Superscalar processor

ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r

x[i]

Fetch/ Decode 1

Execution Context

Exec 1

Recall from last class: instruction level parallelism (ILP)

Decode and execute two instructions per clock (if possible)

Fetch/ Decode 2

Exec 2

Note: No ILP exists in this region of the program

result[i]

Aside: Pentium 4

Image credit: http://ixbtlabs.com/articles/pentium4/index.html

Processor: multi-core era

Fetch/ Decode

Execution Context

ALU (Execute)

Idea #1:

Use increasing transistor count to add more

cores to the processor

Rather than use transistors to increase

sophistication of processor logic that

accelerates a single instruction stream

(e.g., out-of-order and speculative operations)

Two cores: compute two elements in parallel

Fetch/

Decode

Execution

Context

ALU

(Execute)

Fetch/

Decode

Execution

Context

ALU

(Execute)

ld mul r0, addr[r1]r1, r0, r mul ... r1, r1, r ... ... ... ... ... st addr[r2], r

Simpler cores: each core is slower at running a single instruction stream

than our original “fancy” core (e.g., 0.75 times as fast)

But there are now two cores: 2 × 0.75 = 1.5 (potential for speedup!)

result[j]

x[j]

result[i]

x[i]

Expressing parallelism using pthreads

void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;

for (int j=1; j<=terms; j++) { value += sign * numer / denom numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }

result[i] = value; } }

typedef struct { int N; int terms; float* x; float* result; } my_args;

void parallel_sinx(int N, int terms, float* x, float* result) { pthread_t thread_id; my_args args;

args.N = N/2 ; args.terms = terms; args.x = x; args.result = result;

pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread sinx(N - args.N, terms, x + args.N, result + args.N); // do work pthread_join(thread_id, NULL); }

void my_thread_start(void* thread_arg) { my_args* thread_args = (my_args*)thread_arg; sinx(args->N, args->terms, args->x, args->result); // do work }

Data-parallel expression

void sinx(int N, int terms, float* x, float* result) { // declare independent loop iterations forall (int i from 0 to N-1) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;

for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }

result[i] = value; } }

Loop iterations declared by the programmer to be independent

With this information, you could imagine how a compiler might automatically generate parallel threaded code

Parallel Processing Techniques: Multi-Core and SIMD Execution, Exercises of Logic

Related documents

Partial preview of the text

Download Parallel Processing Techniques: Multi-Core and SIMD Execution and more Exercises Logic in PDF only on Docsity!

Parallel Computer Architecture and Programming

CMU 15-418/15-618, Fall 2018

Lecture 2:

A Modern Multi-Core

Processor

(Forms of parallelism + understanding latency and bandwidth)

Today

▪ Today we will talk computer architecture

▪ Four key concepts about how modern computers work

▪ Understanding these architecture basics will help you

Example program

Compute sin( x ) using Taylor expansion: sin( x ) = x - x^3 /3! + x^5 /5! - x^7 /7! + ...

for each element of an array of N floating-point numbers

Compile program

x[i]

Execute program

x[i]

PC

My very simple processor: executes one instruction per clock

Execute program

x[i]

PC

My very simple processor: executes one instruction per clock

Superscalar processor

x[i]

Recall from last class: instruction level parallelism (ILP)

Decode and execute two instructions per clock (if possible)

Note: No ILP exists in this region of the program

Aside: Pentium 4

Processor: multi-core era

Idea #1:

Use increasing transistor count to add more

cores to the processor

Rather than use transistors to increase

sophistication of processor logic that

accelerates a single instruction stream

(e.g., out-of-order and speculative operations)

Fetch/

Decode

Execution

Context

ALU

(Execute)

Fetch/

Decode

Execution

Context

ALU

(Execute)

Data-parallel expression

Core 1

Multi-core examples

Core 5

Core 2

Core 4

Core 3

Core 6

Integrated GPU