open multiprocessing dr. bo yuan e-mail: [email protected]

Open MultiprocessingOpen Multiprocessing

Dr. Bo Yuan

E-mail: [email protected]

2

Note on Parallel ProgrammingNote on Parallel Programming

• An incorrect program may produce correct results.

– The order of execution of processes/threads is unpredictable.

– May depend on your luck!

• A program that always produce correct results may not make sense.

– The outputs of a program are just part of the story.

– Efficiency matters!

3

OpenMPOpenMP

• An API for shared memory multiprocessing (parallel) programming in C, C++ and

Fortran.– Supports multiple platforms (processor architectures and operating systems).

– Higher level implementation (a block of code that should be executed in parallel).

• A method of parallelizing whereby a master thread forks a number of slave threads

and a task is divided among them.

• Based on preprocessor directives (Pragma)– Requires compiler support.

– omp.h

• References– http://openmp.org/

– https://computing.llnl.gov/tutorials/openMP/

– http://supercomputingblog.com/openmp/

4

Hello, World!Hello, World!#include <stdio.h>#include <stdlib.h>#include <omp.h>

void Hello(void);

int main(int argc, char* argv[]) { /* Get the number of threads from command line */ int thread_count=strtol(argv[1], NULL, 10);

# pragma omp parallel num_threads(thread_count) Hello();

return 0;}

void Hello(void) { int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();

printf(“Hello from thread %d of %d\n”, my_rank, thread_count);}

5

DefinitionsDefinitions

# pragma omp parallel [clauses] { code_block }

Error Checking

#ifdef _OPENMP# include <omp.h>#endif

#ifdef _OPENMP int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();#else int my_rank=0; int thread_count=1;#endif

implicit barrier

Thread Team = Master + Slaves

text to modify the directive

6

The Trapezoidal RuleThe Trapezoidal Rule

/* Input: a, b, n */h=(b-a)/n;approx=(f(a)+f(b))/2.0;for (i=1; i<=n-1; i++) { x_i=a+i*h; approx+=f(x_i);}approx=h*approx;

Thread 0 Thread 2

# pragma omp critical global_result+=my_result;

Shared Memory Shared Variables Race Condition

7

The critical DirectiveThe critical Directive

# pragma omp critical y=f(x); ... double f(double x) {# pragma omp critical z=g(x); ... }

Cannot be executed simultaneously!

# pragma omp critical(one) y=f(x); ... double f(double x) {# pragma omp critical(two) z=g(x); ... }

Deadlock

8

The atomic DirectiveThe atomic Directive# pragma omp atomic x <op>=<expression>;

<op> can be one of the binary operators:+, *, -, /, &, ^, |, <<, >>

• Higher performance than the critical

directive.

• Only single C assignment statement is

protected.

• Only the load and store of x is

protected.

• <expression> must not reference x.

# pragma omp atomic # pragma omp critical x+=f(y); x=g(x);

Can be executed simultaneously!

x++++xx----x

9

LocksLocks

/* Executed by one thread */Initialize the lock data structure;.../* Executed by multiple threads */Attempt to lock or set the lock data structure;Critical section;Unlock or unset the lock data structure;.../* Executed by one thread */Destroy the lock data structure;

void omp_init_lock(omp_lock_t* lock_p);void omp_set_lock(omp_lock_t* lock_p);void omp_unset_lock(omp_lock_t* lock_p);void omp_destroy_lock(omp_lock_t* lock_p);

10

Trapezoidal Rule in OpenMP Trapezoidal Rule in OpenMP #include <stdio.h>#include <stdlib.h>#include <omp.h>

void Trap(double a, double b, int n, double* global_result_p);

int main(int argc, char* argv[]) { double global_result=0.0; double a, b; int n, thread_count;

thread_count=strtol(argv[1], NULL, 10); printf(“Enter a, b, and n\n”); scanf(“%lf %lf %d”, &a, &b, &n);# pragma omp parallel num_threads(thread_count) Trap(a, b, n, &global_result);

printf(“With n=%d trapezoids, our estimate\n”, n); printf(“of the integral from %f to %f = %.15e\n”, a, b, global_result); return 0;}

11

Trapezoidal Rule in OpenMP Trapezoidal Rule in OpenMP void Trap(double a, double b, int n, double* global_result_p) { double h, x, my_result; double local_a, local_b; int i, local_n; int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();

h=(b-a)/n; local_n=n/thread_count; local_a=a+my_rank*local_n*h; local_b=local_a+local_n*h; my_result=(f(local_a)+f(local_b))/2.0; for(i=1; i<=local_n-1; i++) { x=local_a+i*h; my_result+=f(x); } my_result=my_result*h;

# pragma omp critical *global_result_p+=my_result;}

12

Scope of VariablesScope of Variables

Private Scope

• Only accessible by a single thread

• Declared in the code block

Shared Scope

• Accessible by all threads in a team

• Declared before a parallel directive

• a, b, n

• global_result

• thread_count

• my_rank

• my_result

• global_result_p

• *global_result_p

In serial programming:

• Function-wide scope

• File-wide scope

13

Another Trap FunctionAnother Trap Function

double Local_trap(double a, double b, int n);

global_result=0.0;# pragma omp parallel num_threads(thread_count) {# pragma omp critical global_result+=Local_trap(a, b, n); }

global_result=0.0;# pragma omp parallel num_threads(thread_count) { double my_result=0.0; /* Private */ my_result=Local_trap(a, b, n); # pragma omp critical global_result+=my_result; }

14

The Reduction ClauseThe Reduction Clause

• Reduction: A computation (binary operation) that repeatedly applies the

same reduction operator (e.g., addition or multiplication) to a sequence of

operands in order to get a single result.

• Note:– The reduction variable itself is shared.

– A private variable is created for each thread in the team.

– The private variables are initialized to 0 for addition operator.

global_result=0.0;# pragma omp parallel num_threads(thread_count)\ reduction(+: global_result) global_result=Local_trap(a, b, n);

reduction(<operator>: <variable list>)

15

The parallel for DirectiveThe parallel for Directive

h=(b-a)/n;approx=(f(a)+f(b))/2.0;for (i=1; i<=n-1; i++) { approx+=f(a+i*h);}approx=h*approx;

h=(b-a)/n; approx=(f(a)+f(b))/2.0;# pragma omp parallel for num_threads(thread_count)\ reduction(+: approx) for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx;

• The code block must be a for loop.

• Iterations of the for loop are divided among threads.

• approx is a reduction variable.

• i is a private variable.

16

The parallel for DirectiveThe parallel for Directive

• Sounds like a truly wonderful approach to parallelizing serial programs.

• Does not work with while or do-while loops.– How about converting them into for loops?

• The number of iterations must be determined in advance.

for (; ;) { ...}

for (i=0; i<n; i++) { if (...) break; ...}

int x, y; # pragma omp parallel for num_threads(thread_count) for(x=0; x < width; x++) { for(y=0; y < height; y++) { finalImage[x][y] = f(x, y); } }

private(y)

17

Estimating πEstimating π

0 12

)1(4

7

1

5

1

3

114

k

k

K

double factor=1.0;double sum=0.0;for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor;}pi_approx=4.0*sum;

double factor=1.0; double sum=0.0;# pragma omp parallel for\ num_threads(thread_count)\ reduction(+: sum) for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum;

?

Loop-carried dependence

18

Estimating πEstimating π

if(k%2 == 0) factor=1.0;else factor=-1.0;sum+=factor/(2*k+1);

factor=(k%2 == 0)?1.0: -1.0;sum+=factor/(2*k+1);

double factor=1.0; double sum=0.0;# pragma omp parallel for num_threads(thread_count)\ reduction(+: sum) private(factor) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;

19

Scope MattersScope Matters

double factor=1.0; double sum=0.0;# pragma omp parallel for num_threads(thread_count)\ default(none) reduction(+: sum) private(k, factor) shared(n) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;

• With the default (none) clause, we need to specify the scope of each variable that we use in

the block that has been declared outside the block.

• The value of a variable with private scope is unspecified at the beginning (and after completion) of a

parallel or parallel for block.

The private factor is not specified.

20

Bubble SortBubble Sort

for (len=n; len>=2; len--) for (i=0; i<len-1; i++) if (a[i]>a[i+1]) { tmp=a[i]; a[i]=a[i+1]; a[i+1]=tmp; }

• Can we make it faster?

• Can we parallelize the outer loop?

• Can we parallelize the inner loop?

21

Odd-Even SortOdd-Even Sort

PhaseSubscript in Array

0 1 2 3

0 9 7 8 6

7 9 6 8

1 7 9 6 8

7 6 9 8

2 7 6 9 8

6 7 8 9

3 6 7 8 9

6 7 8 9

Any opportunities for parallelism?

22

Odd-Even SortOdd-Even Sortvoid Odd_even_sort (int a[], int n) { int phase, i, temp; for (phase=0; phase<n; phase++) if (phase%2 == 0) { /* Even phase */ for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

23

Odd-Even Sort in OpenMPOdd-Even Sort in OpenMP

for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */# pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */# pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

24

Odd-Even Sort in OpenMPOdd-Even Sort in OpenMP

# pragma omp parallel num_thread(thread_count) \ default(none) shared(a, n) private(i, tmp, phase) for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */# pragma omp for for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */# pragma omp for for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

25

Data PartitioningData Partitioning

0 1 2 3 4 5 6 7 8

0 1 2

Iterations

Threads

0 1 2 3 4 5 6 7 8

0 1 2

Iterations

Threads

Block

Cyclic

26

Scheduling LoopsScheduling Loops

double Z[N][N];…sum=0.0;for (i=0; i<N; i++) sum+=f(i);

double f(int r) { int i; double val=0.0;

for (i=r+1; i<N; i++) { return_val+=sin(Z[r][i]); } return val;}

Load Balancing

27

The schedule clauseThe schedule clause

sum=0.0;# pragma omp parallel for num_threads(thread_count) \ reduction(+:sum) schedule(static, 1) for (i=0; i<n; i++) sum+=f(i);

n=12, t=3

schedule(static, 1) schedule(static, 2)

schedule(static, 4)

Thread 0: 0, 3, 6, 9

Thread 1: 1, 4, 7, 10

Thread 2: 2, 5, 8, 11

Thread 0: 0, 1, 6, 7

Thread 1: 2, 3, 8, 9

Thread 2: 4, 5, 10, 11

Thread 0: 0, 1, 2, 3

Thread 1: 4, 5, 6, 7

Thread 2: 8, 9, 10, 11

chunksize

schedule(static, total_iterations/thread_count)

28

The dynamic and guided TypesThe dynamic and guided Types

• In a dynamic schedule:– Iterations are broken into chunks of chunksize consecutive iterations.

– Default chunksize value: 1

– Each thread executes a chunk.

– When a thread finishes a chunk, it requests another one.

• In a guided schedule:– Each thread executes a chunk.

– When a thread finishes a chunk, it requests another one.

– As chunks are completed, the size of the new chunks decreases.

– Approximately equals to the number of iterations remaining divided by the

number of threads.

– The size of chunks decreases down to chunksize or 1 (default).

29

Example of guided Schedule Example of guided Schedule Thread Chunk Size of Chunk Remaining Iterations

0 1-5000 5000 4999

1 5001-7500 2500 2499

1 7501-8750 1250 1249

1 8751-9375 625 624

0 9376-9687 312 312

1 9688-9843 156 156

0 9844-9921 78 78

1 9922-9960 39 39

1 9961-9980 20 19

1 9981-9990 10 9

1 9991-9995 5 4

0 9996-9997 2 2

1 9998-9998 1 1

0 9999-9999 1 0

30

Which schedule?Which schedule?

• The optimal schedule depends on:– The type of problem– The number of iterations– The number of threads

• Overhead– guided>dynamic>static

– If you are getting satisfactory results (e.g., close to the theoretically maximum

speedup) without a schedule clause, go no further.

• The Cost of Iterations– If it is roughly the same, use the default schedule.

– If it decreases or increases linearly as the loop executes, a static schedule

with small chunksize values will be good.

– If it cannot be determined in advance, try to explore different options.

31

Performance IssuePerformance Issue

A x y

# pragma omp parallel for num_threads(thread_count) \ default(none) private(i,j) shared(A, x, y, m, n) for(i=1; i<m; i++) { y[i]=0.0; for(j=0; j<n; j++) y[i]+=A[i][j]*x[j]; }

=X

32


Number of

Threads

Matrix Dimension

8,000,000 x 8 8,000 x 8,000 8 x 8,000,000

Time Efficiency Time Efficiency Time Efficiency

1 0.322 1.000 0.264 1.000 0.333 1.000

2 0.219 0.735 0.189 0.698 0.300 0.555

3 0.141 0.571 0.119 0.555 0.303 0.275

Cache Miss

False Sharing

33


• 8,000,000-by-8– y has 8,000,000 elements Potentially large number of write misses

• 8-by-8,000,000– x has 8,000,000 elements Potentially large number of read misses

• 8-by-8,000,000– y has 8 elements (8 doubles) Could be stored in the same cache line (64 bytes).

– Potentially serious false sharing effect for multiple processors

• 8000-by-8000– y has 8,000 elements (8,000 doubles).

– Thread 2: 4000 to 5999 Thread 3: 6000 to 7999

– {y[5996], y[5997], y[5998], y[5999], y[6000], y[6001], y[6002], y[6003] }

– The effect of false sharing is highly unlikely.

34

Thread SafetyThread Safety

• How to generate random numbers in C?– First, call srand() with an integer seed.

– Second, call rand() to create a sequence of random numbers.

• Pseudorandom Number Generator (PRNG)

• Is it thread safe?– Can it be simultaneously executed by multiple threads without causing problems?

mcaXX nn mod1

Shared State

35

Foster’s MethodologyFoster’s Methodology

• Partitioning– Divide the computation and the data into small tasks.

– Identify tasks that can be executed in parallel.

• Communication– Determine what communication needs to be carried out.

– Local Communication vs. Global Communication

• Agglomeration– Group tasks into larger tasks.

– Reduce communication.

– Task Dependence

• Mapping– Assign the composite tasks to processes/threads.

36

Foster’s MethodologyFoster’s Methodology

37

The n-body ProblemThe n-body Problem

• To predict the motion of a group of objects that interact with each other

gravitationally over a period of time.– Inputs: Mass, Position and Velocity

• Astrophysicist– The positions and velocities of a collection of stars

• Chemist– The positions and velocities of a collection of molecules

http://en.wikipedia.org/wiki/File:Orbit5.gif

38

Newton’s LawNewton’s Law

)()(3 tsts

tsts

mGmf kq

kq

kqqk

)()()()(

1

03 tsts

tsts

mGmF kq

n

qkk

kq

kqq

)()()()(

1

03 tsts

tsts

mGa kq

n

qkk

kq

kq

39

The Basic AlgorithmThe Basic AlgorithmGet input data;for each timestep { if (timestep output) Print positions and velocities of particles; for each particle q Compute total force on q; for each particle q Compute position and velocity of q;}

for each particle q { forces[q][0]=forces[q][1]=0; for each particle k!=q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; forces[q][0]-=G*masses[q]*masses[k]/dist_cubed*x_diff; forces[q][1]-=G*masses[q]*masses[k]/dist_cubed*y_diff; }}

40

Newton’s 3rd Law of MotionNewton’s 3rd Law of Motion

f38

f58

f83 f85

n=12

q=8

r=3

41

The Reduced AlgorithmThe Reduced Algorithmfor each particle q forces[q][0]=forces[q][1]=0;

for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; }}

42

Euler MethodEuler Method

ttytyttttytytty )()())(()()( 0000000

http://en.wikipedia.org/wiki/File:Leonhard_Euler_2.jpg

43

Position and VelocityPosition and Velocity

q

qqqqqqq

qqqqq

q

qqqqqqq

qqqqq

m

tFttvttatvttvtvtv

ttvtsttststs

m

Ftvtavtvvtv

tvstssts

)()()()()()()2(

)()()()()2(

)0()0()0()0()0()0()(

)0()0()0()0()(

'

'

'

'

for each particle q { pos[q][0]+=delta_t*vel[q][0]; pos[q][1]+=delta_t*vel[q][1]; vel[q][0]+=delta_t*forces[q][0]/masses[q]; vel[q][1]+=delta_t*forces[q][1]/masses[q];}

44

Communications: BasicCommunications: Basic

sq(t) vq(t) sr(t) vr(t)

sq(t + t)△ vq(t + t)△ sr(t + t)△ vr(t + t)△

Fq(t) Fr(t)

Fq(t+ t)△ Fr(t+ t)△

45

Agglomeration: BasicAgglomeration: Basic

sq, vq, Fq

sq, vq, Fq

sr, vr, Fr

sr, vr, Fr

sq sr

sq sr

t

t + t△

46

Agglomeration: ReducedAgglomeration: Reduced

sq, vq, Fq

sq, vq, Fq

sr, vr, Fr

sr, vr, Fr

fqr sr

fqr sr

t

t + t△

q<r

47

Parallelizing the Basic SolverParallelizing the Basic Solver

# pragma omp parallel for each timestep { if (timestep output){# pragma omp single nowait Print positions and velocities of particles; }# pragma omp for for each particle q Compute total force on q;# pragma omp for for each particle q Compute position and velocity of q; }

Race Conditions?

48

Parallelizing the Reduced SolverParallelizing the Reduced Solver# pragma omp for for each particle q forces[q][0]=forces[q][1]=0;

# pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; } }

49

Does it work properly?Does it work properly?

• Consider 2 threads and 4 particles.

• Thread 1 is assigned particle 0 and particle 1.

• Thread 2 is assigned particle 2 and particle 3.

• F3=-f03-f13-f23

• Who will calculate f03 and f13?

• Who will calculate f23?

• Any race conditions?

50

Thread ContributionsThread Contributions

Thread ParticleContributions of Threads

0 1 2

0 0 f01 +f02 +f03+f04 +f05 0 0

1 -f01 +f12 +f13+f14 +f15 0 0

1 2 -f02 -f12 f23 +f24 +f25 0

3 -f03 -f13 -f23 +f34 +f35 0

2 4 -f04 -f14 -f24 -f34 f45

5 -f05 -f15 -f25 -f35 -f45

3 Threads, 6 Particles, Block Partition

51

Thread ContributionsThread Contributions

Thread ParticleContributions of Threads

0 1 2

0 0 f01 +f02 +f03+f04 +f05 0 0

1 1 -f01 f12 +f13+f14 +f15 0

2 2 -f02 -f12 f23 +f24 +f25

0 3 -f03+f34 +f35 -f13 -f23

1 4 -f04 -f34 -f14 +f45 -f24

2 5 -f05 -f35 -f15 -f45 -f25

3 Threads, 6 Particles, Cyclic Partition

52

First PhaseFirst Phase

# pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

loc_forces[my_rank][q][0]+=force_qk[0]; loc_forces[my_rank][q][1]+=force_qk[1]; loc_forces[my_rank][k][0]-=force_qk[0]; loc_forces[my_rank][k][1]-=force_qk[1]; } }

53

Second PhaseSecond Phase

# pragma omp for for (q=0; q<n; q++) { forces[q][0]=forces[q][1]=0; for(thread=0; thread<thread_count; thread++) { forces[q][0]+=loc_forces[thread][q][0]; forces[q][1]+=loc_forces[thread][q][1]; } }

Race Conditions?

• In the first phase, each thread carries out the same calculations as before but the

values are stored in its own array of forces (loc_forces).

• In the second phase, the thread that has been assigned particle q will add the

contributions that have been computed by different threads.

54

Evaluating the OpenMP CodesEvaluating the OpenMP Codes

• In the reduced code:– Loop 1: Initialization of the loc_forces array

– Loop 2: The first phase of the computation of forces

– Loop 3: The second phase of the computation of forces

– Loop 4: The updating of positions and velocities

• Which schedule should be used?

Threads BasicReducedDefault

ReducedForces Cyclic

ReducedAll Cyclic

1 7.71 3.90 3.90 3.90

2 3.87 2.94 1.98 2.01

4 1.95 1.73 1.01 1.08

8 0.99 0.95 0.54 0.61

55

ReviewReview

• What are the major differences between MPI and OpenMP?

• What is the scope of a variable?

• What is a reduction variable?

• How to ensure mutual exclusion in a critical section?

• What are the common loop scheduling options?

• What is a thread safe function?

• What factors may affect the performance of an OpenMP program?

open multiprocessing dr. bo yuan e-mail: [email protected]

Documents

count thread

num int thread

b int n

p void omp

openmp int

pragma omp parallel

b int i

lock data structure