finalexamfinal exam - androbenchcsl.skku.edu/uploads/ice3003f09/23-pprog.pdffinalexamfinal exam...
TRANSCRIPT
![Page 1: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/1.jpg)
Final ExamFinal ExamFinal ExamFinal Exam12:00 – 13:20, December 14 (Monday), 2009( y)#330110 (odd student id)#330118 (even student id)#330118 (even student id)
Scope:• Everything
Closed-book examFinal exam scores will be posted in the lectureFinal exam scores will be posted in the lecture homepage
1ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 2: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/2.jpg)
P ll l P iP ll l P iParallel ProgrammingParallel Programming
Jin-Soo Kim ([email protected])Jin Soo Kim ([email protected])Computer Systems Laboratory
Sungkyunkwan Universityhtt // l kk dhttp://csl.skku.edu
![Page 3: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/3.jpg)
ChallengesChallengesChallengesChallengesDifficult to write parallel programsp p g• Most programmers think sequentially• Performance vs. correctness tradeoffs• Missing good parallel abstractions
Automatic parallelization by compilers• Works with some applications:• Works with some applications:
loop parallelism, reduction• Unclear how we can apply to other complex• Unclear how we can apply to other complex
applications
3ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 4: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/4.jpg)
Concurrent vs. ParallelConcurrent vs. ParallelConcurrent vs. ParallelConcurrent vs. ParallelConcurrent programp g• A program containing two or more processes
(threads)
Parallel program• A concurrent program in which each process (thread) co cu e t p og a c eac p ocess (t ead)
executes on its own processor, and hence the processes (threads) execute in parallel
Concurrent programming has been around for a while so why do we bother?for a while, so why do we bother?• Logical parallelism (GUIs, asynchronous events, …)
s Ph sical parallelism (for performance)
4ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
vs. Physical parallelism (for performance)
![Page 5: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/5.jpg)
Think Parallel or PerishThink Parallel or PerishThink Parallel or PerishThink Parallel or Perish
The free lunch is over!
5ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 6: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/6.jpg)
Parallel Prog. Models (1)Parallel Prog. Models (1)Parallel Prog. Models (1)Parallel Prog. Models (1)Shared address space programming modelp p g g• Single address space for all CPUs• Communication through regular load/store (implicit)g g / ( p )• Synchronization using locks and barriers (explicit)• Ease of programmingase o p og a g• Complex hardware for cache coherence
• Thread APIs (Pthreads, …)• OpenMPOpenMP• Intel TBB (Thread Building Blocks)
6ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 7: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/7.jpg)
Parallel Prog. Models (2)Parallel Prog. Models (2)Parallel Prog. Models (2)Parallel Prog. Models (2)Message passing programming modelg p g p g g• Private address space per CPU• Communication through message send/receive over g g /
network interface (explicit)• Synchronization using blocking messages (implicit)• Need to program explicit communication• Simple hardware p
– No cache coherence supporting hardware
• RPC (Remote Procedure Calls)• PVM (obsolete)
7ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
• MPI (de factor standard)
![Page 8: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/8.jpg)
Parallel Prog. Models (3)Parallel Prog. Models (3)Parallel Prog. Models (3)Parallel Prog. Models (3)Parallel programming models vs.p g gParallel computer architectures
P ll l A hiParallel Architectures
Shared‐memory Distributed‐memory
Single
AddressPthreads (CC) NUMA
Parallel
Prog.
Address
SpaceOpenMP Software DSM
Models Message
Passing
Multi‐processes
MPIMPI
8ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 9: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/9.jpg)
Speedup (1)Speedup (1)Speedup (1)Speedup (1)Amdahl’s Law
Speedup 1=
S (1 P) : the time spent
nPPSpeedup
/)1( +−=
– S = (1- P) : the time spent executing the serial portion n : the number of processor cores
• A theoretical basis by which the speedup of parallel computations can be estimated.p
• The theoretical upper limit of speedup is limited by the serial portion of the code.
9ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
p
![Page 10: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/10.jpg)
Speedup (2)Speedup (2)Speedup (2)Speedup (2)
Reducing the serial portion(20% 10%)
Doubling the number of coresDoubling the number of cores(8 16)
10ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 11: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/11.jpg)
Speedup (3)Speedup (3)Speedup (3)Speedup (3)Speedup in realityp p y
)(/)1(1
nHnPP SpeedupPractical
++−=
• H(n) : overhead– Thread management & scheduling
)()(
– Communication & synchronization– Load balancing
E i– Extra computation– Operating system overhead
For better speedupFor better speedup,• Parallelize more (increase P)
P ll li ff ti l ( d ( ))
11ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
• Parallelize effectively (reduce H(n))
![Page 12: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/12.jpg)
Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)
π πtan = 1 atan(1) =
π4
π4
4 ⌠⌡0
1 11+x2 dx
= 4 (atan(1)–atan(0))
⌡0 1+x y = atan(x)y = 1+x21
π= 4 ( – 0) = ππ
4
12ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 13: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/13.jpg)
Calculating Pi (2)Calculating Pi (2)Calculating Pi (2)Calculating Pi (2)Pi: Sequential versionq
#define N 20000000#define STEP (1 0 / (double) N)#define STEP (1.0 / (double) N)
double compute (){
1
1int i;double x;double sum = 0.0;
N
for (i = 0; i < N; i++){
x = (i + 0.5) * STEP;
i+0.5N
1+ ( )2
1
sum += 4.0 / (1.0 + x * x);}return sum * STEP;
}
0 1
(i+0.5)N
13ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
} N
![Page 14: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/14.jpg)
Threads API (1)Threads API (1)Threads API (1)Threads API (1)Threads are the natural unit of execution for parallel programs on shared-memory hardware• Windows threading API• POSIX threads (Pthreads)( )
Threads have their own• Execution context• Execution context• Stack
Threads shareThreads share• Address space (code, data, heap)
R (PID fil t )
14ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
• Resources (PID, open files, etc.)
![Page 15: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/15.jpg)
Threads API (2)Threads API (2)Threads API (2)Threads API (2)Pi: Pthreads version
#include <pthread.h>#define N 20000000#define STEP (1.0 / (double) N)
void *compute (void *arg){( / ( ) )
#define NTHREADS 8
double pi = 0.0;th d t t l k
{int i;double x;double sum = 0.0;i t id (i t) pthread_mutex_t lock =
PTHREAD_MUTEX_INITIALIZER;int main(){
int id = (int) arg;
for (i=id; i<N; i += NTHREADS){{
int i;pthread_t tid[NTHREADS];for (i = 0; i < NTHREADS; i++)
th d t (&tid[i] NULL
{x = (i + 0.5) * STEP;sum += 4.0 / (1.0 + x * x);
}th d t l k (&l k)pthread_create (&tid[i], NULL,
compute, (void *) i);for (i = 0; i < NTHREADS; i++)pthread join (tid[i], NULL);
pthread_mutex_lock (&lock);pi += (sum * STEP);pthread_mutex_unlock (&lock);
}
15ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
p _j ( [ ], );}
}
![Page 16: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/16.jpg)
Threads API (3)Threads API (3)Threads API (3)Threads API (3)Pros• Flexible and widely available• The thread library gives you detailed control over the y g y
threads• Performance can be good
Cons• YOU have to take detailed control over the threads.• Relatively low level of abstraction• No easy code migration path from sequential
program
16ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
• Lack of structure means error-prone
![Page 17: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/17.jpg)
OpenMP (1)OpenMP (1)OpenMP (1)OpenMP (1)What is OpenMP?p• A set of compiler directives and
library routines for portable parallel programming on shared memory systems– Fortran 77, Fortran 90, C, and C++
M lti d t f b th U i d Wi d– Multi-vendor support, for both Unix and Windows
• Standardizes loop-level (data) parallelismC bi i l d ll l d i i l• Combines serial and parallel code in single source
• Incremental approach to parallelizationhtt //• http://www.openmp.org,http://www.compunity.org– Standard documents tutorials sample codes
17ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
– Standard documents, tutorials, sample codes
![Page 18: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/18.jpg)
OpenMP (2)OpenMP (2)OpenMP (2)OpenMP (2)Fork-join modelj• Master thread spawns a team of threads as needed• Synchronize when leaving parallel region (join)y g p g (j )• Only master executes sequential part
18ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 19: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/19.jpg)
OpenMP (3)OpenMP (3)OpenMP (3)OpenMP (3)Parallel construct
#pragma omp parallel{
task();task();}
19ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 20: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/20.jpg)
OpenMP (4)OpenMP (4)OpenMP (4)OpenMP (4)Work-sharing construct: sectionsg
# ll l ti#pragma omp parallel sections{#pragma omp section
phase1();#pragma omp section
phase2();#pragma omp section
phase3();}}
20ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 21: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/21.jpg)
OpenMP (5)OpenMP (5)OpenMP (5)OpenMP (5)Work-sharing construct: loopg p
#pragma omp parallel forp g p pfor (i = 0; i < 12; i++)
c[i] = a[i] + b[i];
#pragma omp parallel{{task ();
#pragma omp forfor (i = 0; i < 12; i++)for (i = 0; i < 12; i++)
c[i] = a[i] + b[i];}
21ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 22: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/22.jpg)
OpenMP (6)OpenMP (6)OpenMP (6)OpenMP (6)Pi: OpenMP versionp
#ifdef _OPENMPdouble compute (){
#include <omp.h>#endif
int i;double x;double sum = 0.0;
#define N 20000000#define STEP (1.0/(double) N)
i t i ()
#pragma omp parallel for private(x) reduction(+:sum)
f (i i i )int main(){
double pi;
for (i = 0; i < N; i ++){x = (i + 0.5) * STEP;s m + 4 0 / (1 0 + * );
pi = compute();}
sum += 4.0 / (1.0 + x * x);}return sum * STEP;
}
22ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
}
![Page 23: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/23.jpg)
OpenMP (7)OpenMP (7)OpenMP (7)OpenMP (7)Environment• Dual Quad-core
Xeon (8 cores)7
8Pi (OpenMP)
N = 2e6
• 2.4GHz• Intel OpenMP
5
6
7
p
N = 2e7
N = 2e8
Ideal
Compiler
3
4
5
Speedu
p
1
2
3
0
1 2 3 4 5 6 7 8
The Number of Cores
23ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
The Number of Cores
![Page 24: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/24.jpg)
OpenMP (8)OpenMP (8)OpenMP (8)OpenMP (8)Pros• Simple and portable• Single source code for both serial and parallel versiong p• Incremental parallelization
ConsCons• Primarily for bounded loops over built-in types
– Only for Fortran-style C/C++ programsOnly for Fortran style C/C++ programs– Pointer-chasing loops?
• Performance may be degraded due toy g– Synchronization overhead– Load imbalancing
24ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
– Thread management overhead
![Page 25: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/25.jpg)
MPI (1)MPI (1)MPI (1)MPI (1)Message Passing Interfaceg g• Library standard defined by a committee of vendors,
implementers, and parallel programmers• Used to create parallel programs based on message
passing• Available on almost all parallel machines in C/Fortran• Two phases
– MPI 1: Traditional message-passing– MPI 2: Remote memory, parallel I/O, dynamic processes
T k t i ll t d h th j b• Tasks are typically created when the jobs are launched
• http://www mpi forum org
25ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
• http://www.mpi-forum.org
![Page 26: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/26.jpg)
MPI (2)MPI (2)MPI (2)MPI (2)Point-to-point communicationp• Sending and receiving messages
Process 0
A[4]
Process 1
B[4]A[4] B[4]
MPI_Recv (B, 4, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD);
MPI_Send (A, 4, MPI_INT, 1, 1000, MPI_COMM_WORLD);
26ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 27: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/27.jpg)
MPI (3)MPI (3)MPI (3)MPI (3)Collective communication• Broadcast
– MPI_Bcast (array, 100, MPI_INT, 0, comm);
array
array array array
• Reduce– MPI Reduce (&localsum, &sum, 1, MPI DOUBLE, _ ( , , , _ ,
MPI_SUM, 0, comm);
localsum localsum localsum localsum
++
+
localsum localsum localsum localsum
27ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
sum
![Page 28: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/28.jpg)
MPI (4)MPI (4)MPI (4)MPI (4)Pi: MPI version#include “mpi.h”#include <stdio.h>#define N 20000000#d fi STEP (1 0 / (d bl ) N)#define STEP (1.0 / (double) N)#define f(x) (4.0 / (1.0 + (x)*(x))int main (int argc, char *argv[]) {
int i, numprocs, myid, start, end, n;, p , y , , , ;double sum = 0.0, mypi, pi;MPI_Init (&argc, &argv);MPI_Comm_size (MPI_COMM_WORLD, &numprocs); MPI Comm rank (MPI COMM WORLD &myid);MPI_Comm_rank (MPI_COMM_WORLD, &myid);start = (N / numprocs) * myid + 1;end = (myid==numprocs‐1)? n : (n/numprocs) * (myid+1);for (i = start; i <= end; i++)( ; ; )
sum += f(STEP * ((double) i + 0.5));myip = STEP * sum;MPI_Reduce (&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);MPI Finalize ();
28ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
MPI_Finalize ();}
![Page 29: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/29.jpg)
MPI (5)MPI (5)MPI (5)MPI (5)Pros• Runs on either shared or distributed memory
architectures• Can be used on a wider range of problems• Distributed memory computers are less expensive
than large shared memory computers
Cons• Requires more programming changes to go from
serial to parallel version• Can be harder to debug• Performance is limited by the communication
29ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
network between the nodes
![Page 30: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/30.jpg)
Parallel Benchmarks (1)Parallel Benchmarks (1)Parallel Benchmarks (1)Parallel Benchmarks (1)Linpackp• Matrix linear algebra• Basis for measuring “Top500 Supercomputing sites”g p p p g
(http://www.top500.org)
SPECrate• Parallel run of SPEC CPU programs• Job-level (or task-level) parallelismJob level (or task level) parallelism
SPLASH• Stanford Parallel Applications for Shared Memory• Stanford Parallel Applications for Shared Memory• Mix of kernels and applications• Strong scaling: keep the problem size fixed
30ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
• Strong scaling: keep the problem size fixed
![Page 31: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/31.jpg)
Parallel Benchmarks (2)Parallel Benchmarks (2)Parallel Benchmarks (2)Parallel Benchmarks (2)NAS parallel benchmark suitep• By NASA Advanced Supercomputing (NAS)• Computational fluid dynamics kernelsp y
PARSEC suite• Princeton Application Repository for Shared Memory• Princeton Application Repository for Shared Memory
Computers• Multithreaded applications using Pthreads andMultithreaded applications using Pthreads and
OpenMP
31ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 32: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/32.jpg)
Cache Coherence (1)Cache Coherence (1)Cache Coherence (1)Cache Coherence (1)Examplesp
CPU0 CPU1Read XRead X CPU0 CPU1Read X
Read XCPU0 CPU1
CacheX = 0
CacheX = 1Read X?
Read XWriteX 1
CPU0 CPU1
CacheX = 1
CacheX = 2
WriteX 1
Read X
Write
l h X?
WriteX 2
Flush X
MemoryX = 0
MemoryX = 2
Flush X?
Invalidate all copies before allowing a write to proceed
Disallow more than onemodified copy
32ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
allowing a write to proceed modified copy
![Page 33: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/33.jpg)
Cache Coherence (2)Cache Coherence (2)Cache Coherence (2)Cache Coherence (2)Cache coherence protocolp• A set of rules to maintain the consistency of data
stored in the local caches as well as in main memory• Ensures multiple read-only copies and exclusive
modified copy
• Types– Snooping-based (or “Snoopy”)– Directory-based
W it t t i• Write strategies– Invalidation-based– Update-based
33ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
– Update-based
![Page 34: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/34.jpg)
Cache Coherence (3)Cache Coherence (3)Cache Coherence (3)Cache Coherence (3)Snooping protocolp g p• All cache controllers monitor (or snoop) on the bus
– Send all requests for data to all processors– Processors snoop to see if they have a shared block– Requires broadcast, since caching info resides at processors
W k ll ith b ( t l b d t)– Works well with bus (natural broadcast)– Dominates for small scale machines
• Cache coherence unit• Cache coherence unit– Cache block (line) is the unit of management– False sharing is possible: Two processors share the same g p p
cache line but not the actual word– Coherence miss: Invalidate can cause a miss for the data
read before
34ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
read before
![Page 35: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/35.jpg)
Cache Coherence (4)Cache Coherence (4)Cache Coherence (4)Cache Coherence (4)MESI protocol: invalidation-basedp• Modified (M)
– Cache has only one copy and copy is modified– Memory is not up-to-date
• Exclusive (E)– Cache has only one copy and copy is unmodified (clean)– Memory is up-to-date
Sh d (S)• Shared (S)– Copies may exist in other caches and all are unmodified– Memory is up-to-dateMemory is up to date
• Invalid (I)– Not in cache
35ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
o cac e
![Page 36: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/36.jpg)
Cache Coherence (5)Cache Coherence (5)Cache Coherence (5)Cache Coherence (5)MESI protocol state machine CPU Read / -p
Bus ReadX /
CPU Read /Bus Read & S-signal on
(no Bus traffic)
Invalid Shared(read only)Bus ReadX /
-
CPU
WBus R
(read only)(read-only)
Bus Read /Bus S signal on
Write /
ReadX
Bus ReadX
Bus WriteB
(Flush)
Bus Bus S- s
Bus S-signal on
If cache miss occurs cache X
/Back
Read /
signal on
occurs, cache will write back modified block.
CPU Write / -(i lid i d d)
Exclusive(read only)Exclusive(read-only)(read/write)
Modified(read/write)
36ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
(invalidate is not needed)CPU Read / -(no Bus traffic)
CPU Read / -(no Bus traffic)
CPU Write / -(no Bus traffic)
![Page 37: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/37.jpg)
Cache Coherence (6)Cache Coherence (6)Cache Coherence (6)Cache Coherence (6)Examples revisitedp
CPU0 CPU1Read XX = 0 (E) Read X CPU0 CPU1
Read XX 0 (S) Read XCPU0 CPU1
CacheX = 0
CacheX = 1
X = 0 (E)
X = 0 (S)
Read XX = 0 (S)WriteX 1
CPU0 CPU1
CacheX = 1
CacheX = 2
X = 0 (S)
WriteX 1
Read XX = 0 (S)
X (I)
Read X
X 1X = 1 (M)
X 1X = 1 (M) X (I)
WriteX 2
MemoryX = 0
X = 1 (S) Flush XX = 1 (S)
MemoryX = 2
Flush XX (I)
X 2X = 2 (M)Flush X
Invalidate all copies before allowing a write to proceed
Disallow more than onemodified copy
37ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
allowing a write to proceed modified copy
![Page 38: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/38.jpg)
Cache Coherence (7)Cache Coherence (7)Cache Coherence (7)Cache Coherence (7)False sharingg
6
7Pi (OpenMP) N = 2e7
Pad = 0
Pad 28
struct thread_stat {int count;char pad[PADS];
} t t[8]
4
5
edup
Pad = 28
Pad = 60
Pad =124
} stat[8];
double compute () {int i, id;
1
2
3
Spe, ;
double x, sum = 0.0;
#pragma omp parallel for i t ( ) d ti ( )
0
1
1 2 3 4 5 6 7 8The Number of Cores
private(x) reduction(+:sum)for (i = 0; i < N; i ++) {int id=omp_get_thread_num();stat[id].count++; The Number of Cores[ ] ;x = (i + 0.5) * STEP;sum += 4.0 / (1.0 + x * x);
}t * STEP
stat[0] stat[1] ... stat[7]
padC C C
38ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
return sum * STEP;}
pad ...C C C
![Page 39: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/39.jpg)
Cache-Aware Programming (1)Cache-Aware Programming (1)Cache Aware Programming (1)Cache Aware Programming (1)Pack data more tightlyg y• Usually, compilers do not pack structures
t t L { s ic
p
struct Loose {short s;int i;char c; p;Foo* p;
};
struct Tight {Foo* p; pint i;short s;char c;
};
pi s c
39ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
};
![Page 40: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/40.jpg)
Cache-Aware Programming (2)Cache-Aware Programming (2)Cache Aware Programming (2)Cache Aware Programming (2)Work in the cache• Reduce the working set• Divide a problem into smaller subproblemsp p• Reorder steps in the code
Read-only data• Separate read-only data from read-write data• Separate read-only data from read-write data• Annotate constants with const keyword • Read only data are stored in separate sections• Read-only data are stored in separate sections• If possible, separate read-mostly variables
40ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 41: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/41.jpg)
Cache-Aware Programming (3)Cache-Aware Programming (3)Cache Aware Programming (3)Cache Aware Programming (3)Read-write data• Group read-write variables which are used together
into a structure• Move read-write variables which are often written to
by different threads onto their own cache line– Reduce false sharing– May require padding
If i bl i d b lti l th d b t• If a variable is used by multiple threads, but every use is independent, move it into Thread Local Storage (TLS)Storage (TLS)
41ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 42: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/42.jpg)
Cache-Aware Programming (4)Cache-Aware Programming (4)Cache Aware Programming (4)Cache Aware Programming (4)Lock variables• Keep the lock and the associated data on distinct
cache line.• If a lock protects data that is frequently
uncontended, try to keep the lock and the data on h h lithe same cache line.
42ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
![Page 43: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/43.jpg)
Cache-Aware Programming (5)Cache-Aware Programming (5)Cache Aware Programming (5)Cache Aware Programming (5)Thread affinityy• Avoid moving a thread from one core to another
– Reduce context switching– Reduce cache misses– Reduce TLB misses
• Bind a process– sched_setaffinity () / sched_getaffinity()
• Bind a thread– pthread_setaffinity_np()– pthread_getaffinity_np()– pthread_attr_setaffinity_np()pthread attr getaffinity np()
43ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
– pthread_attr_getaffinity_np()
![Page 44: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118](https://reader035.vdocuments.us/reader035/viewer/2022062416/60fa0117a3074714d165de9c/html5/thumbnails/44.jpg)
Summary: Why Hard?Summary: Why Hard?Summary: Why Hard?Summary: Why Hard?Impediments to parallel computingp p p g• Parallelization
– Parallel algorithms which maximizes concurrency– Eliminating the serial portion as much as possible– Lack of standardized APIs, environments, and tools
C ll li i• Correct parallelization– Shared resource identification
Difficult to debug (data races deadlock )– Difficult to debug (data races, deadlock, …)– Memory consistency
• Effective parallelizationEffective parallelization– Communication & synchronization overhead– Problem decomposition, granularity, load imbalance, affinity..
44ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])
– Architecture dependency