hamid sarbazi-azad - sharifce.sharif.edu/courses/93-94/2/ce215-1/resources/root/slides... · a...

Computational

Mathematics

Department of Computer Engineering Sharif University of Technology e-mail: azad@sharif.edu

Hamid Sarbazi-Azad

OpenMP

Department of Computer Engineering Sharif University of Technology e-mail: azad@sharif.edu

Work-sharing Instructor

PanteA Zardoshti

Computational Mathematics, OpenMP , Sharif University Fall 2015 3

A worksharing construct distributes the execution of the associated region among the members of the team that encounters it.

Work-sharing

#pragma omp parallel for { for (i=0;i<100;i++) A(i) = A(i) + B }

A worksharing construct distributes the execution of the associated region among the members of the team that encounters it.

A worksharing region has no barrier on entry; however, an implied barrier exists at the end of the worksharing region.

Work-sharing

#pragma omp parallel for { for (i=0;i<100;i++) A(i) = A(i) + B }

barrier

The OpenMP API defines the following worksharing constructs, and these are described in the sections that follow:

• loop

• sections

• single

Constructs

LOOP CONSTRUCT

The loop construct specifies that the iterations of one or more associated loops will be executed in parallel by threads in the team in the context of their implicit tasks.

The iterations are distributed across threads that already exist in the team executing the parallel region to which the loop region binds.

Loop Construct

#pragma omp for [clause[[,] clause] ... ]

for-loops

Computational Mathematics, OpenMP , Sharif University Fall 2015

where clause is one of the following:

• private(list)

• firstprivate(list)

• lastprivate(list)

• schedule(kind[, chunk_size])

• collapse(n)

• ordered

• Nowait

• reduction(reduction-identifier: list)

Clauses

How OMP schedules iterations?

Schedule Clause

Although the OpenMP standard does not specify how a loop should be partitioned most compilers split the loop in N/p (N #iterations, p #threads) chunks by default.

Schedule Clause

This is called a static schedule (with chunk size N/p)

Schedule Clause

This is called a static schedule (with chunk size N/p) • For example, suppose we have a loop with 1000 iterations and 4 omp

threads.The loop is partitioned as follows:

Schedule Clause

1 250 500 750 1000

This is called a static schedule (with chunk size N/p) • For example, suppose we have a loop with 1000 iterations and 4 omp

threads.The loop is partitioned as follows:

Schedule Clause

1 250 500 750 1000

Static

• Blocks of iterations of size “chunk” to threads

• Round robin distribution

schedule(static [,chunk])

Schedule Clause

Static

• Blocks of iterations of size “chunk” to threads

• Round robin distribution

schedule(static [,chunk])

Dynamic

• Threads grab “chunk” iterations

• When done with iterations, thread requests next set

schedule(dynamic[,chunk])

Schedule Clause

Guided

• Dynamic schedule starting with large block

• Size of the blocks shrink; no smaller than “chunk”

schedule(guided[,chunk])

Runtime • Indicates that the schedule type and chunk are specified by

environment variable OMP_SCHEDULE

• Example of run-time specified scheduling

OMP_SCHEDULE “dynamic,2”

Schedule Clause(cont’d)

The Experiment

Allows parallelization of perfectly nested loops without using nested parallelism

collapse clause on for/do loop indicates how many loops should be collapsed

Compiler forms a single loop and then parallelizes this

Collapse Clause

#pragma omp for collapse (2) for (k=1; k<=100; k++) for (j=1; j<=200; j++)

The ordered region executes in the sequential order

since do_lots_of_work takes a lot of time, most parallel benefit will be realized

ordered is helpful for debugging

Ordered Clause

#pragma omp parallel for for(i = 0; i < nproc; i++){ do_lots_of_work(result[i]); #pragma omp ordered fprintf(fid,”%d %f\n,”i,result[i]”); #pragma omp end ordered }

To minimize synchronization, some OpenMP pragmas support the optional nowait clause

If present, threads do not synchronize/wait at the end of that particular construct

Nowait Clause

#pragma omp for nowait for (k=1; k<=100; k++) …

#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)

f = 1.0;

#pragma omp for nowait

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

#pragma omp barrier

scale = sum(a,0,n) + sum(z,0,n) + f

} /*-- End of parallel region --*/

Example

f = 1.0;

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

#pragma omp barrier

Example parallel region

f = 1.0;

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

#pragma omp barrier

Statement is executed by all threads

f = 1.0;

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

#pragma omp barrier

parallel loop

(work is distributed)

f = 1.0;

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

#pragma omp barrier

parallel loop

f = 1.0;

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

#pragma omp barrier

parallel loop

synchronization

f = 1.0;

for (i=0; i<n; i++)

z[i] = x[i] + y[i];

for (i=0; i<n; i++)

a[i] = b[i] + c[i];

#pragma omp barrier

parallel loop

synchronization

Barrier

Tread 1 Tread 2 Tread 3

barrier

Barrier

Tread 1 Tread 2 Tread 3

barrier

Use OMP_WAIT_POLICY

to control behaviour of

idle threads ?

Suppose we run each of these two loops in parallel over i:

This may give us a wrong answer, Why ?

Example

for (i=0; i < N; i++) a[i] = b[i] + c[i]; for (i=0; i < N; i++) d[i] = a[i] + b[i];

Suppose we run each of these two loops in parallel over i:

This may give us a wrong answer, Why ?

Example

We need to have updated all of a[ ] first, before using a[ ]

All threads wait at the barrier point and only continue when all threads have reached the barrier point

Example(cont’d)

We need to have updated all of a[ ] first, before using a[ ]

All threads wait at the barrier point and only continue when all threads have reached the barrier point

Example(cont’d)

wait ! barrier

Barrier

SECTIONS CONSTRUCT

35 Computational Mathematics, OpenMP , Sharif University Fall 2015

Independent sections of code can execute concurrently

Sections Construct

#pragma omp parallel sections [clause[[,] clause] ...] { #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); }

where clause is one of the following:

• private(list)

• firstprivate(list)

• lastprivate(list)

• Nowait

• reduction(reduction-identifier: list)

Clauses

#pragma omp parallel default(none) shared(n,a,b,c,d) private(i)

#pragma omp sections nowait

#pragma omp section

for (i=0; i<n-1; i++)

b[i] = (a[i] + a[i+1])/2;

#pragma omp section

for (i=0; i<n; i++)

d[i] = 1.0/c[i];

} /*-- End of sections --*/

Example

Section #1 Section #2

Parallel Region

Example

SINGLE CONSTRUCT

Denotes block of code to be executed by only one thread

Thread chosen is implementation dependent

Implicit barrier at end

Single Construct

#pragma omp parallel { DoManyThings(); #pragma omp single { ExchangeBoundaries(); } DoManyMoreThings(); }

Threads wait here for single

Single Construct

CRITICAL SECTION

float dot_prod(float* a, float* b, int N)

float sum = 0.0;

#pragma omp parallel for shared(sum)

for(int i=0; i<N; i++) {

sum += a[i] * b[i];

return sum;

Critical Section

What is Wrong?

Defines a critical region on a structured block

Critical Construct

float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; }

#pragma omp critical [(lock_name)]

Naming the critical constructs is

optional,but may increase

performance.

The variables in “list” must be shared in the enclosing parallel Region

Inside parallel or work-sharing construct: • A PRIVATE copy of each list variable is created and initialized

depending on the “op” • These copies are updated locally by threads

• At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable

Reduction Clause

reduction (op : list)

Local copy of sum for each thread

All local copies of sum added together and stored in “global” variable

Reduction Clause

float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for reduction (+:sum) for(int i=0; i<N; i++) { sum += a[i] * b[i]; } return sum; }

c) Hamid Sarbazi-Azad Parallel Programming: OpenMP 49

Reduction Clause (cont.)

Operators

• + Sum

• * Product

• & Bitwise and

• | Bitwise or

• ^ Bitwise exclusive or

• && Logical and

• || Logical or

Special case of a critical section

Applies only to simple update of memory location

Atomic Construct

#pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); }

void omp_init_lock(omp_lock_t * lock_p);

void omp_set_lock(omp_lock_t * lock_p);

void omp_unset_lock(omp_lock_t * lock_p);

void omp_destroy_lock(omp_lock_t * lock_p);

Lock Construct

Protect resources with locks.

omp_lock_t lck;

omp_init_lock(&lck);

#pragma omp parallel for

for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];

omp_unset_lock(&lck);

} omp_destroy_lock(&lck);

Lock Construct

omp_lock_t lck;

for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];

Lock Construct

Wait here for your turn

omp_lock_t lck;

for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];

Lock Construct

Release the lock so the next

thread gets a turn

omp_lock_t lck;

for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];

Lock Construct

Release the lock so the next

thread gets a turn

Free--up storage when done

hamid sarbazi-azad - sharifce.sharif.edu/courses/93-94/2/ce215-1/resources/root/slides... · a...

Documents

communication characteristics and hybrid mpi/openmp ... ·...

h-0127-steps after en - european medicines agency ·...

static strategies for worksharing with unrecoverable

performance and power efficient on-chip communication using...

the effects of worksharing, other product innovations and...

epivir - european medicines agency · alcohols (e.g....

myaa digital future · - bim manager, coordinator,...

factors aﬀecting opportunity of worksharing as a dynamic

ce215 unit 4 curriculum models

comments on: labor market effects of worksharing...

youth programs march 2007. 2 presentation topics skills link...

department of economics discussion paper series · asghar...

public assessment report eu worksharing project … ·...

© cbg-meb. update regulatoire ontwikkelingen inplanning eu...

workshop revit...

ce215 unit 3 developmental domains and learning styles class...

discrete structures of computer science theorems...

aipla 2012 annual meeting washington 25 october 2012...

worksharing robert a. clarke director, office of patent...

#: palm beach county board of county commissioners ·...