hamid sarbazi-azad - sharifce.sharif.edu/courses/93-94/2/ce215-1/resources/root/slides... · a...

Post on 08-May-2018

223 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Computational

Mathematics

Department of Computer Engineering Sharif University of Technology e-mail: azad@sharif.edu

Hamid Sarbazi-Azad

OpenMP

Department of Computer Engineering Sharif University of Technology e-mail: azad@sharif.edu

Work-sharing Instructor

PanteA Zardoshti

Computational Mathematics, OpenMP , Sharif University Fall 2015 3

A worksharing construct distributes the execution of the associated region among the members of the team that encounters it.

Work-sharing

#pragma omp parallel for { for (i=0;i<100;i++) A(i) = A(i) + B }

Computational Mathematics, OpenMP , Sharif University Fall 2015 4

A worksharing construct distributes the execution of the associated region among the members of the team that encounters it.

A worksharing region has no barrier on entry; however, an implied barrier exists at the end of the worksharing region.

Work-sharing

#pragma omp parallel for { for (i=0;i<100;i++) A(i) = A(i) + B }

barrier

Computational Mathematics, OpenMP , Sharif University Fall 2015 5

The OpenMP API defines the following worksharing constructs, and these are described in the sections that follow:

• loop

• sections

• single

Constructs

LOOP CONSTRUCT

Computational Mathematics, OpenMP , Sharif University Fall 2015 6

7

The loop construct specifies that the iterations of one or more associated loops will be executed in parallel by threads in the team in the context of their implicit tasks.

The iterations are distributed across threads that already exist in the team executing the parallel region to which the loop region binds.

Loop Construct

#pragma omp for [clause[[,] clause] ... ]

for-loops

Computational Mathematics, OpenMP , Sharif University Fall 2015

8

where clause is one of the following:

• private(list)

• firstprivate(list)

• lastprivate(list)

• schedule(kind[, chunk_size])

• collapse(n)

• ordered

• Nowait

• reduction(reduction-identifier: list)

Clauses

Computational Mathematics, OpenMP , Sharif University Fall 2015

9

How OMP schedules iterations?

Schedule Clause

Computational Mathematics, OpenMP , Sharif University Fall 2015

10

How OMP schedules iterations?

Although the OpenMP standard does not specify how a loop should be partitioned most compilers split the loop in N/p (N #iterations, p #threads) chunks by default.

Schedule Clause

Computational Mathematics, OpenMP , Sharif University Fall 2015

11

How OMP schedules iterations?

Although the OpenMP standard does not specify how a loop should be partitioned most compilers split the loop in N/p (N #iterations, p #threads) chunks by default.

This is called a static schedule (with chunk size N/p)

Schedule Clause

Computational Mathematics, OpenMP , Sharif University Fall 2015

12

How OMP schedules iterations?

Although the OpenMP standard does not specify how a loop should be partitioned most compilers split the loop in N/p (N #iterations, p #threads) chunks by default.

This is called a static schedule (with chunk size N/p) • For example, suppose we have a loop with 1000 iterations and 4 omp

threads.The loop is partitioned as follows:

Schedule Clause

1 250 500 750 1000

Computational Mathematics, OpenMP , Sharif University Fall 2015

13

How OMP schedules iterations?

Although the OpenMP standard does not specify how a loop should be partitioned most compilers split the loop in N/p (N #iterations, p #threads) chunks by default.

This is called a static schedule (with chunk size N/p) • For example, suppose we have a loop with 1000 iterations and 4 omp

threads.The loop is partitioned as follows:

Schedule Clause

1 250 500 750 1000

Computational Mathematics, OpenMP , Sharif University Fall 2015

14

Static

• Blocks of iterations of size “chunk” to threads

• Round robin distribution

schedule(static [,chunk])

Schedule Clause

Computational Mathematics, OpenMP , Sharif University Fall 2015

15

Static

• Blocks of iterations of size “chunk” to threads

• Round robin distribution

schedule(static [,chunk])

Dynamic

• Threads grab “chunk” iterations

• When done with iterations, thread requests next set

schedule(dynamic[,chunk])

Schedule Clause

Computational Mathematics, OpenMP , Sharif University Fall 2015

16

Guided

• Dynamic schedule starting with large block

• Size of the blocks shrink; no smaller than “chunk”

schedule(guided[,chunk])

Runtime • Indicates that the schedule type and chunk are specified by

environment variable OMP_SCHEDULE

• Example of run-time specified scheduling

OMP_SCHEDULE “dynamic,2”

Schedule Clause(cont’d)

Computational Mathematics, OpenMP , Sharif University Fall 2015

17

The Experiment

Computational Mathematics, OpenMP , Sharif University Fall 2015

18

Allows parallelization of perfectly nested loops without using nested parallelism

collapse clause on for/do loop indicates how many loops should be collapsed

Compiler forms a single loop and then parallelizes this

Collapse Clause

#pragma omp for collapse (2) for (k=1; k<=100; k++) for (j=1; j<=200; j++)

Computational Mathematics, OpenMP , Sharif University Fall 2015

19

The ordered region executes in the sequential order

since do_lots_of_work takes a lot of time, most parallel benefit will be realized

ordered is helpful for debugging

Ordered Clause

#pragma omp parallel for for(i = 0; i < nproc; i++){ do_lots_of_work(result[i]); #pragma omp ordered fprintf(fid,”%d %f\n,”i,result[i]”); #pragma omp end ordered }

Computational Mathematics, OpenMP , Sharif University Fall 2015

20

To minimize synchronization, some OpenMP pragmas support the optional nowait clause

If present, threads do not synchronize/wait at the end of that particular construct

Nowait Clause

#pragma omp for nowait for (k=1; k<=100; k++) …

Computational Mathematics, OpenMP , Sharif University Fall 2015

21

#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)

{

f = 1.0;

#pragma omp for nowait

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

#pragma omp for nowait

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier

scale = sum(a,0,n) + sum(z,0,n) + f

} /*-- End of parallel region --*/

Example

Computational Mathematics, OpenMP , Sharif University Fall 2015

22

#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)

{

f = 1.0;

#pragma omp for nowait

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

#pragma omp for nowait

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier

scale = sum(a,0,n) + sum(z,0,n) + f

} /*-- End of parallel region --*/

Example parallel region

Computational Mathematics, OpenMP , Sharif University Fall 2015

23

#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)

{

f = 1.0;

#pragma omp for nowait

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

#pragma omp for nowait

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier

scale = sum(a,0,n) + sum(z,0,n) + f

} /*-- End of parallel region --*/

Example parallel region

Statement is executed by all threads

Computational Mathematics, OpenMP , Sharif University Fall 2015

24

#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)

{

f = 1.0;

#pragma omp for nowait

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

#pragma omp for nowait

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier

scale = sum(a,0,n) + sum(z,0,n) + f

} /*-- End of parallel region --*/

Example parallel region

Statement is executed by all threads

parallel loop

(work is distributed)

Computational Mathematics, OpenMP , Sharif University Fall 2015

25

#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)

{

f = 1.0;

#pragma omp for nowait

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

#pragma omp for nowait

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier

scale = sum(a,0,n) + sum(z,0,n) + f

} /*-- End of parallel region --*/

Example parallel region

Statement is executed by all threads

parallel loop

(work is distributed)

parallel loop

(work is distributed)

Computational Mathematics, OpenMP , Sharif University Fall 2015

26

#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)

{

f = 1.0;

#pragma omp for nowait

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

#pragma omp for nowait

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier

scale = sum(a,0,n) + sum(z,0,n) + f

} /*-- End of parallel region --*/

Example parallel region

Statement is executed by all threads

parallel loop

(work is distributed)

parallel loop

(work is distributed)

synchronization

Computational Mathematics, OpenMP , Sharif University Fall 2015

27

#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)

{

f = 1.0;

#pragma omp for nowait

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

#pragma omp for nowait

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier

scale = sum(a,0,n) + sum(z,0,n) + f

} /*-- End of parallel region --*/

Example parallel region

Statement is executed by all threads

parallel loop

(work is distributed)

parallel loop

(work is distributed)

Statement is executed by all threads

synchronization

Computational Mathematics, OpenMP , Sharif University Fall 2015

28

Barrier

Tread 1 Tread 2 Tread 3

Tread 1 Tread 2 Tread 3

barrier

barrier

?

Computational Mathematics, OpenMP , Sharif University Fall 2015

29

Barrier

Tread 1 Tread 2 Tread 3

Tread 1 Tread 2 Tread 3

barrier

barrier

Use OMP_WAIT_POLICY

to control behaviour of

idle threads ?

Computational Mathematics, OpenMP , Sharif University Fall 2015

30

Suppose we run each of these two loops in parallel over i:

This may give us a wrong answer, Why ?

Example

for (i=0; i < N; i++) a[i] = b[i] + c[i]; for (i=0; i < N; i++) d[i] = a[i] + b[i];

Computational Mathematics, OpenMP , Sharif University Fall 2015

31

Suppose we run each of these two loops in parallel over i:

This may give us a wrong answer, Why ?

Example

for (i=0; i < N; i++) a[i] = b[i] + c[i]; for (i=0; i < N; i++) d[i] = a[i] + b[i];

Computational Mathematics, OpenMP , Sharif University Fall 2015

32

We need to have updated all of a[ ] first, before using a[ ]

All threads wait at the barrier point and only continue when all threads have reached the barrier point

Example(cont’d)

for (i=0; i < N; i++) a[i] = b[i] + c[i]; for (i=0; i < N; i++) d[i] = a[i] + b[i];

Computational Mathematics, OpenMP , Sharif University Fall 2015

33

We need to have updated all of a[ ] first, before using a[ ]

All threads wait at the barrier point and only continue when all threads have reached the barrier point

Example(cont’d)

for (i=0; i < N; i++) a[i] = b[i] + c[i]; for (i=0; i < N; i++) d[i] = a[i] + b[i];

wait ! barrier

Computational Mathematics, OpenMP , Sharif University Fall 2015

34

Barrier

Computational Mathematics, OpenMP , Sharif University Fall 2015

SECTIONS CONSTRUCT

35 Computational Mathematics, OpenMP , Sharif University Fall 2015

36

Independent sections of code can execute concurrently

Sections Construct

#pragma omp parallel sections [clause[[,] clause] ...] { #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); }

Computational Mathematics, OpenMP , Sharif University Fall 2015

37

where clause is one of the following:

• private(list)

• firstprivate(list)

• lastprivate(list)

• Nowait

• reduction(reduction-identifier: list)

Clauses

Computational Mathematics, OpenMP , Sharif University Fall 2015

38

#pragma omp parallel default(none) shared(n,a,b,c,d) private(i)

{

#pragma omp sections nowait

{

#pragma omp section

for (i=0; i<n-1; i++)

b[i] = (a[i] + a[i+1])/2;

#pragma omp section

for (i=0; i<n; i++)

d[i] = 1.0/c[i];

} /*-- End of sections --*/

} /*-- End of parallel region --*/

Example

Section #1 Section #2

Parallel Region

Time

Computational Mathematics, OpenMP , Sharif University Fall 2015

39

Example

Computational Mathematics, OpenMP , Sharif University Fall 2015

40

Example

Computational Mathematics, OpenMP , Sharif University Fall 2015

SINGLE CONSTRUCT

41 Computational Mathematics, OpenMP , Sharif University Fall 2015

42

Denotes block of code to be executed by only one thread

Thread chosen is implementation dependent

Implicit barrier at end

Single Construct

#pragma omp parallel { DoManyThings(); #pragma omp single { ExchangeBoundaries(); } DoManyMoreThings(); }

Threads wait here for single

Computational Mathematics, OpenMP , Sharif University Fall 2015

43

Single Construct

Computational Mathematics, OpenMP , Sharif University Fall 2015

CRITICAL SECTION

44 Computational Mathematics, OpenMP , Sharif University Fall 2015

45

float dot_prod(float* a, float* b, int N)

{

float sum = 0.0;

#pragma omp parallel for shared(sum)

for(int i=0; i<N; i++) {

sum += a[i] * b[i];

}

return sum;

}

Critical Section

What is Wrong?

Computational Mathematics, OpenMP , Sharif University Fall 2015

46

Defines a critical region on a structured block

Critical Construct

float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; }

#pragma omp critical [(lock_name)]

Naming the critical constructs is

optional,but may increase

performance.

Computational Mathematics, OpenMP , Sharif University Fall 2015

47

The variables in “list” must be shared in the enclosing parallel Region

Inside parallel or work-sharing construct: • A PRIVATE copy of each list variable is created and initialized

depending on the “op” • These copies are updated locally by threads

• At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable

Reduction Clause

reduction (op : list)

Computational Mathematics, OpenMP , Sharif University Fall 2015

48

Local copy of sum for each thread

All local copies of sum added together and stored in “global” variable

Reduction Clause

float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for reduction (+:sum) for(int i=0; i<N; i++) { sum += a[i] * b[i]; } return sum; }

Computational Mathematics, OpenMP , Sharif University Fall 2015

c) Hamid Sarbazi-Azad Parallel Programming: OpenMP 49

Reduction Clause (cont.)

Operators

• + Sum

• * Product

• & Bitwise and

• | Bitwise or

• ^ Bitwise exclusive or

• && Logical and

• || Logical or

50

Special case of a critical section

Applies only to simple update of memory location

Atomic Construct

#pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); }

Computational Mathematics, OpenMP , Sharif University Fall 2015

51

void omp_init_lock(omp_lock_t * lock_p);

void omp_set_lock(omp_lock_t * lock_p);

void omp_unset_lock(omp_lock_t * lock_p);

void omp_destroy_lock(omp_lock_t * lock_p);

Lock Construct

Protect resources with locks.

Computational Mathematics, OpenMP , Sharif University Fall 2015

52

omp_lock_t lck;

omp_init_lock(&lck);

#pragma omp parallel for

for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];

omp_unset_lock(&lck);

} omp_destroy_lock(&lck);

Lock Construct

Computational Mathematics, OpenMP , Sharif University Fall 2015

53

omp_lock_t lck;

omp_init_lock(&lck);

#pragma omp parallel for

for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];

omp_unset_lock(&lck);

} omp_destroy_lock(&lck);

Lock Construct

Wait here for your turn

Computational Mathematics, OpenMP , Sharif University Fall 2015

54

omp_lock_t lck;

omp_init_lock(&lck);

#pragma omp parallel for

for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];

omp_unset_lock(&lck);

} omp_destroy_lock(&lck);

Lock Construct

Wait here for your turn

Release the lock so the next

thread gets a turn

Computational Mathematics, OpenMP , Sharif University Fall 2015

55

omp_lock_t lck;

omp_init_lock(&lck);

#pragma omp parallel for

for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];

omp_unset_lock(&lck);

} omp_destroy_lock(&lck);

Lock Construct

Wait here for your turn

Release the lock so the next

thread gets a turn

Free--up storage when done

Computational Mathematics, OpenMP , Sharif University Fall 2015

top related