– 1 – basic machine independent performance optimizations topics load balancing (review, already...

33
– 1 – Basic Machine Independent Performance Optimizations Topics Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance Optimizations by Code restructuring In the context of OpenMP notation

Upload: emory-carter

Post on 25-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 1 –

Basic Machine Independent Performance Optimizations

Basic Machine Independent Performance Optimizations

TopicsTopics Load balancing (review, already discussed)

In the context of OpenMP notation Performance Optimizations by Code

restructuring In the context of OpenMP notation

Page 2: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 2 –

OpenMP Implementation OverviewOpenMP Implementation Overview

OpenMP implementationOpenMP implementation compiler, library.

Unlike Pthreads (purely a library).Unlike Pthreads (purely a library).

Page 3: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 3 –

OpenMP Example Usage (1 of 2)OpenMP Example Usage (1 of 2)

OpenMPCompiler

AnnotatedSource

SequentialProgram

ParallelProgram

compiler switch

Page 4: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 4 –

OpenMP Example Usage (2 of 2)OpenMP Example Usage (2 of 2)

If you give sequential switch,If you give sequential switch, pragmas are ignored.

If you give parallel switch,If you give parallel switch, pragmas are read, and cause translation into parallel program.

Ideally, one source for both sequential and parallel Ideally, one source for both sequential and parallel program (big maintenance plus).program (big maintenance plus).

Page 5: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 5 –

OpenMP DirectivesOpenMP Directives

Parallelization directives:Parallelization directives: parallel for

Data environment directives:Data environment directives: shared, private, threadprivate, reduction, etc.

Page 6: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 6 –

OpenMP Notation: Parallel ForOpenMP Notation: Parallel For

#pragma omp parallel for

A number of threads are spawned at entry.A number of threads are spawned at entry.

Each thread is assigned a set of iterations for the loop Each thread is assigned a set of iterations for the loop and executes that code.and executes that code.

e.g., block, or cyclic iteration assignment to threadse.g., block, or cyclic iteration assignment to threads

Each thread waits at the end.Each thread waits at the end.

Very similar to fork/join synchronization.Very similar to fork/join synchronization.

Page 7: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 7 –

API SemanticsAPI Semantics

Master thread executes sequential code.Master thread executes sequential code.

Master and slaves execute parallel code.Master and slaves execute parallel code.

Note: very similar to fork-join semantics of Pthreads Note: very similar to fork-join semantics of Pthreads create/join primitives.create/join primitives.

Page 8: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 8 –

Scheduling of IterationsScheduling of Iterations

Scheduling: assigning iterations to a thread.Scheduling: assigning iterations to a thread.

OpenMP allows scheduling strategies, such as block, OpenMP allows scheduling strategies, such as block, cyclic, etc.cyclic, etc.

Page 9: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 9 –

Scheduling of Iterations: SpecificationScheduling of Iterations: Specification

#pragma omp parallel for schedule(<sched>)

<sched><sched> can be one ofcan be one of block (default) cyclic

Page 10: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 10 –

ExampleExample

Multiplication of two matrices C = A x B, where the A Multiplication of two matrices C = A x B, where the A matrix is upper-triangular (all elements below matrix is upper-triangular (all elements below diagonal are 0).diagonal are 0).

0

A

Page 11: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 11 –

Sequential Matrix Multiply BecomesSequential Matrix Multiply Becomes

for( i=0; i<n; i++ )

for( j=0; j<n; j++ ) {

c[i][j] = 0.0;

for( k=i; k<n; k++ )

c[i][j] += a[i][k]*b[k][j];

}

Load imbalance with block distribution.Load imbalance with block distribution.

Page 12: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 12 –

OpenMP Matrix MultiplyOpenMP Matrix Multiply

#pragma omp parallel for schedule( cyclic )

for( i=0; i<n; i++ )

for( j=0; j<n; j++ ) {

c[i][j] = 0.0;

for( k=i; k<n; k++ )

c[i][j] += a[i][k]*b[k][j];

}

Page 13: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 13 –

Code Restructuring OptimizationsCode Restructuring Optimizations

Private variablesPrivate variables

Loop reorderingLoop reordering

Loop peelingLoop peeling

Page 14: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 14 –

General IdeaGeneral Idea

Parallelism limited by dependences.Parallelism limited by dependences.

Restructure code to eliminate or reduce dependences.Restructure code to eliminate or reduce dependences.

Compiler usually not able to do this, good to know how Compiler usually not able to do this, good to know how to do it by hand.to do it by hand.

Page 15: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 15 –

Example 1: Dependency on Scalar Example 1: Dependency on Scalar

for( i=0; i<n; i++ ) {tmp = a[i];a[i] = b[i];b[i] = tmp;

}

Loop-carried dependence on tmp.Loop-carried dependence on tmp.

Easily fixed by privatizing tmp.Easily fixed by privatizing tmp.

Page 16: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 16 –

Fix: Scalar PrivatizationFix: Scalar Privatizationf() {

int tmp; /* local allocation on stack */for( i=from; i<to; i++ ) {

tmp = a[i];a[i] = b[i];b[i] = tmp;

}}

Removes dependence on tmp.Removes dependence on tmp.

Page 17: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 17 –

Fix: Scalar Privatization in OpenMPFix: Scalar Privatization in OpenMP

#pragma omp parallel for private( tmp )for( i=0; i<n; i++ ) {

tmp = a[i];a[i] = b[i];b[i] = tmp;

}

Removes dependence on tmp.Removes dependence on tmp.

Page 18: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 18 –

Example 3: Induction VariableExample 3: Induction Variable

for( i=0, index=0; i<n; i++ ) {index += i;a[i] = b[index];

}

Dependence on index.Dependence on index.

Can be computed from loop variable.Can be computed from loop variable.

Page 19: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 19 –

Fix: Induction Variable EliminationFix: Induction Variable Elimination

#pragma omp parallel for

for( i=0, index=0; i<n; i++ ) {

a[i] = b[i*(i+1)/2];

}

Dependence removed by computing the induction Dependence removed by computing the induction variable.variable.

Page 20: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 20 –

Example 4: Induction VariableExample 4: Induction Variable

for( i=0, index=0; i<n; i++ ) {

index += f(i);

b[i] = g(a[index]);

}

Dependence on variable index, but no formula for its Dependence on variable index, but no formula for its value.value.

Page 21: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 21 –

Fix: Loop SplittingFix: Loop Splitting

for( i=0; i<n; i++ ) {index[i] += f(i);

}#pragma omp parallel forfor( i=0; i<n; i++ ) {

b[i] = g(a[index[i]]);}

Loop splitting has removed dependence.Loop splitting has removed dependence.

Page 22: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 22 –

Example 5 Example 5

for( k=0; k<n; k++ ) for( i=0; i<n; i++ )

for( j=0; j<n; j++ )a[i][j] += b[i][k] + c[k][j];

Dependence on a[i][j] prevents k-loop parallelization.Dependence on a[i][j] prevents k-loop parallelization.

No dependencies carried by i- and j-loops.No dependencies carried by i- and j-loops.

Page 23: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 23 –

Example 5 ParallelizationExample 5 Parallelization

for( k=0; k<n; k++ )

#pragma omp parallel for

for( i=0; i<n; i++ )

for( j=0; j<n; j++ )

a[i][j] += b[i][k] + c[k][j];

We can do better by reordering the loops.We can do better by reordering the loops.

Page 24: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 24 –

Optimization: Loop ReorderingOptimization: Loop Reordering

#pragma omp parallel for

for( i=0; i<n; i++ )

for( j=0; j<n; j++ )

for( k=0; k<n; k++ )

a[i][j] += b[i][k] + c[k][j];

Larger parallel pieces of work.Larger parallel pieces of work.

Page 25: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 25 –

Example 6Example 6

#pragma omp parallel forfor(i=0; i<n; i++ )

a[i] = b[i];#pragma omp parallel forfor( i=0; i<n; i++ )

c[i] = b[i]^2;

Make two parallel loops into one.Make two parallel loops into one.

Page 26: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 26 –

Optimization: Loop FusionOptimization: Loop Fusion

#pragma omp parallel for

for(i=0; i<n; i++ ) {

a[i] = b[i];

c[i] = b[i]^2;

}

Reduces loop startup overhead.Reduces loop startup overhead.

Page 27: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 27 –

Example 7: While LoopsExample 7: While Loops

while( *a) {

process(a);

a++;

}

The number of loop iterations is unknown.The number of loop iterations is unknown.

Page 28: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 28 –

Special Case of Loop SplittingSpecial Case of Loop Splitting

for( count=0, p=a; p!=NULL; count++, p++ );

#pragma omp parallel for

for( i=0; i<count; i++ )

process( a[i] );

Count the number of loop iterations.Count the number of loop iterations.

Then parallelize the loop.Then parallelize the loop.

Page 29: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 29 –

Example 8Example 8

for( i=0, wrap=n; i<n; i++ ) {

b[i] = a[i] + a[wrap];

wrap = i;

}

Dependence onDependence on wrap.wrap.

Only first iteration causes dependence.Only first iteration causes dependence.

Page 30: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 30 –

Loop PeelingLoop Peeling

b[0] = a[0] + a[n];

#pragma omp parallel for

for( i=1; i<n; i++ ) {

b[i] = a[i] + a[i-1];

}

Page 31: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 31 –

Example 10Example 10

for( i=0; i<n; i++ )

a[i+m] = a[i] + b[i];

Dependence if m<n.Dependence if m<n.

Page 32: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 32 –

Another Case of Loop PeelingAnother Case of Loop Peelingif(m>n) {#pragma omp parallel forfor( i=0; i<n; i++ )

a[i+m] = a[i] + b[i];}else {… cannot be parallelized

}

Page 33: – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance

– 33 –

SummarySummary

Reorganize code such thatReorganize code such that dependences are removed or reduced large pieces of parallel work emerge loop bounds become known …

Code can become messy … there is a point of Code can become messy … there is a point of diminishing returns.diminishing returns.