open multiprocessing dr. bo yuan e-mail: [email protected]
TRANSCRIPT
2
Note on Parallel ProgrammingNote on Parallel Programming
• An incorrect program may produce correct results.
– The order of execution of processes/threads is unpredictable.
– May depend on your luck!
• A program that always produce correct results may not make sense.
– The outputs of a program are just part of the story.
– Efficiency matters!
3
OpenMPOpenMP
• An API for shared memory multiprocessing (parallel) programming in C, C++ and
Fortran.– Supports multiple platforms (processor architectures and operating systems).
– Higher level implementation (a block of code that should be executed in parallel).
• A method of parallelizing whereby a master thread forks a number of slave threads
and a task is divided among them.
• Based on preprocessor directives (Pragma)– Requires compiler support.
– omp.h
• References– http://openmp.org/
– https://computing.llnl.gov/tutorials/openMP/
– http://supercomputingblog.com/openmp/
4
Hello, World!Hello, World!#include <stdio.h>#include <stdlib.h>#include <omp.h>
void Hello(void);
int main(int argc, char* argv[]) { /* Get the number of threads from command line */ int thread_count=strtol(argv[1], NULL, 10);
# pragma omp parallel num_threads(thread_count) Hello();
return 0;}
void Hello(void) { int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();
printf(“Hello from thread %d of %d\n”, my_rank, thread_count);}
5
DefinitionsDefinitions
# pragma omp parallel [clauses] { code_block }
Error Checking
#ifdef _OPENMP# include <omp.h>#endif
#ifdef _OPENMP int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();#else int my_rank=0; int thread_count=1;#endif
implicit barrier
Thread Team = Master + Slaves
text to modify the directive
6
The Trapezoidal RuleThe Trapezoidal Rule
/* Input: a, b, n */h=(b-a)/n;approx=(f(a)+f(b))/2.0;for (i=1; i<=n-1; i++) { x_i=a+i*h; approx+=f(x_i);}approx=h*approx;
Thread 0 Thread 2
# pragma omp critical global_result+=my_result;
Shared Memory Shared Variables Race Condition
7
The critical DirectiveThe critical Directive
# pragma omp critical y=f(x); ... double f(double x) {# pragma omp critical z=g(x); ... }
Cannot be executed simultaneously!
# pragma omp critical(one) y=f(x); ... double f(double x) {# pragma omp critical(two) z=g(x); ... }
Deadlock
8
The atomic DirectiveThe atomic Directive# pragma omp atomic x <op>=<expression>;
<op> can be one of the binary operators:+, *, -, /, &, ^, |, <<, >>
• Higher performance than the critical
directive.
• Only single C assignment statement is
protected.
• Only the load and store of x is
protected.
• <expression> must not reference x.
# pragma omp atomic # pragma omp critical x+=f(y); x=g(x);
Can be executed simultaneously!
x++++xx----x
9
LocksLocks
/* Executed by one thread */Initialize the lock data structure;.../* Executed by multiple threads */Attempt to lock or set the lock data structure;Critical section;Unlock or unset the lock data structure;.../* Executed by one thread */Destroy the lock data structure;
void omp_init_lock(omp_lock_t* lock_p);void omp_set_lock(omp_lock_t* lock_p);void omp_unset_lock(omp_lock_t* lock_p);void omp_destroy_lock(omp_lock_t* lock_p);
10
Trapezoidal Rule in OpenMP Trapezoidal Rule in OpenMP #include <stdio.h>#include <stdlib.h>#include <omp.h>
void Trap(double a, double b, int n, double* global_result_p);
int main(int argc, char* argv[]) { double global_result=0.0; double a, b; int n, thread_count;
thread_count=strtol(argv[1], NULL, 10); printf(“Enter a, b, and n\n”); scanf(“%lf %lf %d”, &a, &b, &n);# pragma omp parallel num_threads(thread_count) Trap(a, b, n, &global_result);
printf(“With n=%d trapezoids, our estimate\n”, n); printf(“of the integral from %f to %f = %.15e\n”, a, b, global_result); return 0;}
11
Trapezoidal Rule in OpenMP Trapezoidal Rule in OpenMP void Trap(double a, double b, int n, double* global_result_p) { double h, x, my_result; double local_a, local_b; int i, local_n; int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();
h=(b-a)/n; local_n=n/thread_count; local_a=a+my_rank*local_n*h; local_b=local_a+local_n*h; my_result=(f(local_a)+f(local_b))/2.0; for(i=1; i<=local_n-1; i++) { x=local_a+i*h; my_result+=f(x); } my_result=my_result*h;
# pragma omp critical *global_result_p+=my_result;}
12
Scope of VariablesScope of Variables
Private Scope
• Only accessible by a single thread
• Declared in the code block
Shared Scope
• Accessible by all threads in a team
• Declared before a parallel directive
• a, b, n
• global_result
• thread_count
• my_rank
• my_result
• global_result_p
• *global_result_p
In serial programming:
• Function-wide scope
• File-wide scope
13
Another Trap FunctionAnother Trap Function
double Local_trap(double a, double b, int n);
global_result=0.0;# pragma omp parallel num_threads(thread_count) {# pragma omp critical global_result+=Local_trap(a, b, n); }
global_result=0.0;# pragma omp parallel num_threads(thread_count) { double my_result=0.0; /* Private */ my_result=Local_trap(a, b, n); # pragma omp critical global_result+=my_result; }
14
The Reduction ClauseThe Reduction Clause
• Reduction: A computation (binary operation) that repeatedly applies the
same reduction operator (e.g., addition or multiplication) to a sequence of
operands in order to get a single result.
• Note:– The reduction variable itself is shared.
– A private variable is created for each thread in the team.
– The private variables are initialized to 0 for addition operator.
global_result=0.0;# pragma omp parallel num_threads(thread_count)\ reduction(+: global_result) global_result=Local_trap(a, b, n);
reduction(<operator>: <variable list>)
15
The parallel for DirectiveThe parallel for Directive
h=(b-a)/n;approx=(f(a)+f(b))/2.0;for (i=1; i<=n-1; i++) { approx+=f(a+i*h);}approx=h*approx;
h=(b-a)/n; approx=(f(a)+f(b))/2.0;# pragma omp parallel for num_threads(thread_count)\ reduction(+: approx) for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx;
• The code block must be a for loop.
• Iterations of the for loop are divided among threads.
• approx is a reduction variable.
• i is a private variable.
16
The parallel for DirectiveThe parallel for Directive
• Sounds like a truly wonderful approach to parallelizing serial programs.
• Does not work with while or do-while loops.– How about converting them into for loops?
• The number of iterations must be determined in advance.
for (; ;) { ...}
for (i=0; i<n; i++) { if (...) break; ...}
int x, y; # pragma omp parallel for num_threads(thread_count) for(x=0; x < width; x++) { for(y=0; y < height; y++) { finalImage[x][y] = f(x, y); } }
private(y)
17
Estimating πEstimating π
0 12
)1(4
7
1
5
1
3
114
k
k
K
double factor=1.0;double sum=0.0;for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor;}pi_approx=4.0*sum;
double factor=1.0; double sum=0.0;# pragma omp parallel for\ num_threads(thread_count)\ reduction(+: sum) for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum;
?
Loop-carried dependence
18
Estimating πEstimating π
if(k%2 == 0) factor=1.0;else factor=-1.0;sum+=factor/(2*k+1);
factor=(k%2 == 0)?1.0: -1.0;sum+=factor/(2*k+1);
double factor=1.0; double sum=0.0;# pragma omp parallel for num_threads(thread_count)\ reduction(+: sum) private(factor) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;
19
Scope MattersScope Matters
double factor=1.0; double sum=0.0;# pragma omp parallel for num_threads(thread_count)\ default(none) reduction(+: sum) private(k, factor) shared(n) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;
• With the default (none) clause, we need to specify the scope of each variable that we use in
the block that has been declared outside the block.
• The value of a variable with private scope is unspecified at the beginning (and after completion) of a
parallel or parallel for block.
The private factor is not specified.
20
Bubble SortBubble Sort
for (len=n; len>=2; len--) for (i=0; i<len-1; i++) if (a[i]>a[i+1]) { tmp=a[i]; a[i]=a[i+1]; a[i+1]=tmp; }
• Can we make it faster?
• Can we parallelize the outer loop?
• Can we parallelize the inner loop?
21
Odd-Even SortOdd-Even Sort
PhaseSubscript in Array
0 1 2 3
0 9 7 8 6
7 9 6 8
1 7 9 6 8
7 6 9 8
2 7 6 9 8
6 7 8 9
3 6 7 8 9
6 7 8 9
Any opportunities for parallelism?
22
Odd-Even SortOdd-Even Sortvoid Odd_even_sort (int a[], int n) { int phase, i, temp; for (phase=0; phase<n; phase++) if (phase%2 == 0) { /* Even phase */ for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }
23
Odd-Even Sort in OpenMPOdd-Even Sort in OpenMP
for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */# pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */# pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }
24
Odd-Even Sort in OpenMPOdd-Even Sort in OpenMP
# pragma omp parallel num_thread(thread_count) \ default(none) shared(a, n) private(i, tmp, phase) for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */# pragma omp for for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */# pragma omp for for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }
25
Data PartitioningData Partitioning
0 1 2 3 4 5 6 7 8
0 1 2
Iterations
Threads
0 1 2 3 4 5 6 7 8
0 1 2
Iterations
Threads
Block
Cyclic
26
Scheduling LoopsScheduling Loops
double Z[N][N];…sum=0.0;for (i=0; i<N; i++) sum+=f(i);
double f(int r) { int i; double val=0.0;
for (i=r+1; i<N; i++) { return_val+=sin(Z[r][i]); } return val;}
Load Balancing
27
The schedule clauseThe schedule clause
sum=0.0;# pragma omp parallel for num_threads(thread_count) \ reduction(+:sum) schedule(static, 1) for (i=0; i<n; i++) sum+=f(i);
n=12, t=3
schedule(static, 1) schedule(static, 2)
schedule(static, 4)
Thread 0: 0, 3, 6, 9
Thread 1: 1, 4, 7, 10
Thread 2: 2, 5, 8, 11
Thread 0: 0, 1, 6, 7
Thread 1: 2, 3, 8, 9
Thread 2: 4, 5, 10, 11
Thread 0: 0, 1, 2, 3
Thread 1: 4, 5, 6, 7
Thread 2: 8, 9, 10, 11
chunksize
schedule(static, total_iterations/thread_count)
28
The dynamic and guided TypesThe dynamic and guided Types
• In a dynamic schedule:– Iterations are broken into chunks of chunksize consecutive iterations.
– Default chunksize value: 1
– Each thread executes a chunk.
– When a thread finishes a chunk, it requests another one.
• In a guided schedule:– Each thread executes a chunk.
– When a thread finishes a chunk, it requests another one.
– As chunks are completed, the size of the new chunks decreases.
– Approximately equals to the number of iterations remaining divided by the
number of threads.
– The size of chunks decreases down to chunksize or 1 (default).
29
Example of guided Schedule Example of guided Schedule Thread Chunk Size of Chunk Remaining Iterations
0 1-5000 5000 4999
1 5001-7500 2500 2499
1 7501-8750 1250 1249
1 8751-9375 625 624
0 9376-9687 312 312
1 9688-9843 156 156
0 9844-9921 78 78
1 9922-9960 39 39
1 9961-9980 20 19
1 9981-9990 10 9
1 9991-9995 5 4
0 9996-9997 2 2
1 9998-9998 1 1
0 9999-9999 1 0
30
Which schedule?Which schedule?
• The optimal schedule depends on:– The type of problem– The number of iterations– The number of threads
• Overhead– guided>dynamic>static
– If you are getting satisfactory results (e.g., close to the theoretically maximum
speedup) without a schedule clause, go no further.
• The Cost of Iterations– If it is roughly the same, use the default schedule.
– If it decreases or increases linearly as the loop executes, a static schedule
with small chunksize values will be good.
– If it cannot be determined in advance, try to explore different options.
31
Performance IssuePerformance Issue
A x y
# pragma omp parallel for num_threads(thread_count) \ default(none) private(i,j) shared(A, x, y, m, n) for(i=1; i<m; i++) { y[i]=0.0; for(j=0; j<n; j++) y[i]+=A[i][j]*x[j]; }
=X
32
Performance IssuePerformance Issue
Number of
Threads
Matrix Dimension
8,000,000 x 8 8,000 x 8,000 8 x 8,000,000
Time Efficiency Time Efficiency Time Efficiency
1 0.322 1.000 0.264 1.000 0.333 1.000
2 0.219 0.735 0.189 0.698 0.300 0.555
3 0.141 0.571 0.119 0.555 0.303 0.275
Cache Miss
False Sharing
33
Performance IssuePerformance Issue
• 8,000,000-by-8– y has 8,000,000 elements Potentially large number of write misses
• 8-by-8,000,000– x has 8,000,000 elements Potentially large number of read misses
• 8-by-8,000,000– y has 8 elements (8 doubles) Could be stored in the same cache line (64 bytes).
– Potentially serious false sharing effect for multiple processors
• 8000-by-8000– y has 8,000 elements (8,000 doubles).
– Thread 2: 4000 to 5999 Thread 3: 6000 to 7999
– {y[5996], y[5997], y[5998], y[5999], y[6000], y[6001], y[6002], y[6003] }
– The effect of false sharing is highly unlikely.
34
Thread SafetyThread Safety
• How to generate random numbers in C?– First, call srand() with an integer seed.
– Second, call rand() to create a sequence of random numbers.
• Pseudorandom Number Generator (PRNG)
• Is it thread safe?– Can it be simultaneously executed by multiple threads without causing problems?
mcaXX nn mod1
Shared State
35
Foster’s MethodologyFoster’s Methodology
• Partitioning– Divide the computation and the data into small tasks.
– Identify tasks that can be executed in parallel.
• Communication– Determine what communication needs to be carried out.
– Local Communication vs. Global Communication
• Agglomeration– Group tasks into larger tasks.
– Reduce communication.
– Task Dependence
• Mapping– Assign the composite tasks to processes/threads.
36
Foster’s MethodologyFoster’s Methodology
37
The n-body ProblemThe n-body Problem
• To predict the motion of a group of objects that interact with each other
gravitationally over a period of time.– Inputs: Mass, Position and Velocity
• Astrophysicist– The positions and velocities of a collection of stars
• Chemist– The positions and velocities of a collection of molecules
38
Newton’s LawNewton’s Law
)()(3 tsts
tsts
mGmf kq
kq
kqqk
)()()()(
1
03 tsts
tsts
mGmF kq
n
qkk
kq
kqq
)()()()(
1
03 tsts
tsts
mGa kq
n
qkk
kq
kq
39
The Basic AlgorithmThe Basic AlgorithmGet input data;for each timestep { if (timestep output) Print positions and velocities of particles; for each particle q Compute total force on q; for each particle q Compute position and velocity of q;}
for each particle q { forces[q][0]=forces[q][1]=0; for each particle k!=q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; forces[q][0]-=G*masses[q]*masses[k]/dist_cubed*x_diff; forces[q][1]-=G*masses[q]*masses[k]/dist_cubed*y_diff; }}
40
Newton’s 3rd Law of MotionNewton’s 3rd Law of Motion
f38
f58
f83 f85
n=12
q=8
r=3
41
The Reduced AlgorithmThe Reduced Algorithmfor each particle q forces[q][0]=forces[q][1]=0;
for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;
forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; }}
42
Euler MethodEuler Method
ttytyttttytytty )()())(()()( 0000000
43
Position and VelocityPosition and Velocity
q
qqqqqqq
qqqqq
q
qqqqqqq
qqqqq
m
tFttvttatvttvtvtv
ttvtsttststs
m
Ftvtavtvvtv
tvstssts
)()()()()()()2(
)()()()()2(
)0()0()0()0()0()0()(
)0()0()0()0()(
'
'
'
'
for each particle q { pos[q][0]+=delta_t*vel[q][0]; pos[q][1]+=delta_t*vel[q][1]; vel[q][0]+=delta_t*forces[q][0]/masses[q]; vel[q][1]+=delta_t*forces[q][1]/masses[q];}
44
Communications: BasicCommunications: Basic
sq(t) vq(t) sr(t) vr(t)
sq(t + t)△ vq(t + t)△ sr(t + t)△ vr(t + t)△
Fq(t) Fr(t)
Fq(t+ t)△ Fr(t+ t)△
45
Agglomeration: BasicAgglomeration: Basic
sq, vq, Fq
sq, vq, Fq
sr, vr, Fr
sr, vr, Fr
sq sr
sq sr
t
t + t△
46
Agglomeration: ReducedAgglomeration: Reduced
sq, vq, Fq
sq, vq, Fq
sr, vr, Fr
sr, vr, Fr
fqr sr
fqr sr
t
t + t△
q<r
47
Parallelizing the Basic SolverParallelizing the Basic Solver
# pragma omp parallel for each timestep { if (timestep output){# pragma omp single nowait Print positions and velocities of particles; }# pragma omp for for each particle q Compute total force on q;# pragma omp for for each particle q Compute position and velocity of q; }
Race Conditions?
48
Parallelizing the Reduced SolverParallelizing the Reduced Solver# pragma omp for for each particle q forces[q][0]=forces[q][1]=0;
# pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;
forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; } }
49
Does it work properly?Does it work properly?
• Consider 2 threads and 4 particles.
• Thread 1 is assigned particle 0 and particle 1.
• Thread 2 is assigned particle 2 and particle 3.
• F3=-f03-f13-f23
• Who will calculate f03 and f13?
• Who will calculate f23?
• Any race conditions?
50
Thread ContributionsThread Contributions
Thread ParticleContributions of Threads
0 1 2
0 0 f01 +f02 +f03+f04 +f05 0 0
1 -f01 +f12 +f13+f14 +f15 0 0
1 2 -f02 -f12 f23 +f24 +f25 0
3 -f03 -f13 -f23 +f34 +f35 0
2 4 -f04 -f14 -f24 -f34 f45
5 -f05 -f15 -f25 -f35 -f45
3 Threads, 6 Particles, Block Partition
51
Thread ContributionsThread Contributions
Thread ParticleContributions of Threads
0 1 2
0 0 f01 +f02 +f03+f04 +f05 0 0
1 1 -f01 f12 +f13+f14 +f15 0
2 2 -f02 -f12 f23 +f24 +f25
0 3 -f03+f34 +f35 -f13 -f23
1 4 -f04 -f34 -f14 +f45 -f24
2 5 -f05 -f35 -f15 -f45 -f25
3 Threads, 6 Particles, Cyclic Partition
52
First PhaseFirst Phase
# pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;
loc_forces[my_rank][q][0]+=force_qk[0]; loc_forces[my_rank][q][1]+=force_qk[1]; loc_forces[my_rank][k][0]-=force_qk[0]; loc_forces[my_rank][k][1]-=force_qk[1]; } }
53
Second PhaseSecond Phase
# pragma omp for for (q=0; q<n; q++) { forces[q][0]=forces[q][1]=0; for(thread=0; thread<thread_count; thread++) { forces[q][0]+=loc_forces[thread][q][0]; forces[q][1]+=loc_forces[thread][q][1]; } }
Race Conditions?
• In the first phase, each thread carries out the same calculations as before but the
values are stored in its own array of forces (loc_forces).
• In the second phase, the thread that has been assigned particle q will add the
contributions that have been computed by different threads.
54
Evaluating the OpenMP CodesEvaluating the OpenMP Codes
• In the reduced code:– Loop 1: Initialization of the loc_forces array
– Loop 2: The first phase of the computation of forces
– Loop 3: The second phase of the computation of forces
– Loop 4: The updating of positions and velocities
• Which schedule should be used?
Threads BasicReducedDefault
ReducedForces Cyclic
ReducedAll Cyclic
1 7.71 3.90 3.90 3.90
2 3.87 2.94 1.98 2.01
4 1.95 1.73 1.01 1.08
8 0.99 0.95 0.54 0.61
55
ReviewReview
• What are the major differences between MPI and OpenMP?
• What is the scope of a variable?
• What is a reduction variable?
• How to ensure mutual exclusion in a critical section?
• What are the common loop scheduling options?
• What is a thread safe function?
• What factors may affect the performance of an OpenMP program?