






































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Parallel processing techniques using multi-core and SIMD (Single Instruction, Multiple Data) execution. the concept of using multiple cores to execute different instruction streams and the use of SIMD units to process multiple data elements simultaneously. The document also discusses the benefits and costs of these techniques and provides examples of their implementation. from Carnegie Mellon University's Computer Science course 15-418/618, taught in the Fall of 2018.
Typology: Exercises
1 / 78
This page cannot be seen from the preview
Don't miss anything!
**- Two concern parallel execution
**- Understand and optimize the performance of your parallel programs
void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;
for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }
result[i] = value; } }
void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;
for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }
result[i] = value; } }
ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r
result[i]
Fetch/ Decode
Execution Context
ALU (Execute)
ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r
result[i]
Fetch/ Decode
Execution Context
ALU (Execute)
ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r
result[i]
ld r0, addr[r1] mul r1, r0, r mul r1, r1, r ... ... ... ... ... ... st addr[r2], r
Fetch/ Decode 1
Execution Context
Exec 1
Fetch/ Decode 2
Exec 2
result[i]
Image credit: http://ixbtlabs.com/articles/pentium4/index.html
Fetch/ Decode
Execution Context
ALU (Execute)
Two cores: compute two elements in parallel
ld mul r0, addr[r1]r1, r0, r mul ... r1, r1, r ... ... ... ... ... st addr[r2], r
ld mul r0, addr[r1]r1, r0, r mul ... r1, r1, r ... ... ... ... ... st addr[r2], r
Simpler cores: each core is slower at running a single instruction stream
than our original “fancy” core (e.g., 0.75 times as fast)
But there are now two cores: 2 × 0.75 = 1.5 (potential for speedup!)
result[j]
x[j]
result[i]
x[i]
Expressing parallelism using pthreads
void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;
for (int j=1; j<=terms; j++) { value += sign * numer / denom numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }
result[i] = value; } }
typedef struct { int N; int terms; float* x; float* result; } my_args;
void parallel_sinx(int N, int terms, float* x, float* result) { pthread_t thread_id; my_args args;
args.N = N/2 ; args.terms = terms; args.x = x; args.result = result;
pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread sinx(N - args.N, terms, x + args.N, result + args.N); // do work pthread_join(thread_id, NULL); }
void my_thread_start(void* thread_arg) { my_args* thread_args = (my_args*)thread_arg; sinx(args->N, args->terms, args->x, args->result); // do work }
void sinx(int N, int terms, float* x, float* result) { // declare independent loop iterations forall (int i from 0 to N-1) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1;
for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer = x[i] * x[i]; denom = (2j+2) * (2j+3); sign *= -1; }
result[i] = value; } }
Loop iterations declared by the programmer to be independent
With this information, you could imagine how a compiler might automatically generate parallel threaded code
(in our fictitious data-parallel language)
Sixteen cores: compute sixteen elements in parallel
Sixteen cores, sixteen simultaneous instruction streams
Intel “Coffee Lake” Core i7 hexa-core CPU (2017)
NVIDIA GTX 1080 GPU 20 replicated processing cores (“SM”) (2016)