– 1 – basic machine independent performance optimizations topics load balancing (review, already...

– 1 –

Basic Machine Independent Performance Optimizations

Basic Machine Independent Performance Optimizations

TopicsTopics Load balancing (review, already discussed)

In the context of OpenMP notation Performance Optimizations by Code

restructuring In the context of OpenMP notation

– 2 –

OpenMP Implementation OverviewOpenMP Implementation Overview

OpenMP implementationOpenMP implementation compiler, library.

Unlike Pthreads (purely a library).Unlike Pthreads (purely a library).

– 3 –

OpenMP Example Usage (1 of 2)OpenMP Example Usage (1 of 2)

OpenMPCompiler

AnnotatedSource

SequentialProgram

ParallelProgram

compiler switch

– 4 –

OpenMP Example Usage (2 of 2)OpenMP Example Usage (2 of 2)

If you give sequential switch,If you give sequential switch, pragmas are ignored.

If you give parallel switch,If you give parallel switch, pragmas are read, and cause translation into parallel program.

Ideally, one source for both sequential and parallel Ideally, one source for both sequential and parallel program (big maintenance plus).program (big maintenance plus).

– 5 –

OpenMP DirectivesOpenMP Directives

Parallelization directives:Parallelization directives: parallel for

Data environment directives:Data environment directives: shared, private, threadprivate, reduction, etc.

– 6 –

OpenMP Notation: Parallel ForOpenMP Notation: Parallel For

#pragma omp parallel for

A number of threads are spawned at entry.A number of threads are spawned at entry.

Each thread is assigned a set of iterations for the loop Each thread is assigned a set of iterations for the loop and executes that code.and executes that code.

e.g., block, or cyclic iteration assignment to threadse.g., block, or cyclic iteration assignment to threads

Each thread waits at the end.Each thread waits at the end.

Very similar to fork/join synchronization.Very similar to fork/join synchronization.

– 7 –

API SemanticsAPI Semantics

Master thread executes sequential code.Master thread executes sequential code.

Master and slaves execute parallel code.Master and slaves execute parallel code.

Note: very similar to fork-join semantics of Pthreads Note: very similar to fork-join semantics of Pthreads create/join primitives.create/join primitives.

– 8 –

Scheduling of IterationsScheduling of Iterations

Scheduling: assigning iterations to a thread.Scheduling: assigning iterations to a thread.

OpenMP allows scheduling strategies, such as block, OpenMP allows scheduling strategies, such as block, cyclic, etc.cyclic, etc.

– 9 –

Scheduling of Iterations: SpecificationScheduling of Iterations: Specification

#pragma omp parallel for schedule(<sched>)

<sched><sched> can be one ofcan be one of block (default) cyclic

– 10 –

ExampleExample

Multiplication of two matrices C = A x B, where the A Multiplication of two matrices C = A x B, where the A matrix is upper-triangular (all elements below matrix is upper-triangular (all elements below diagonal are 0).diagonal are 0).

0

A

– 11 –

Sequential Matrix Multiply BecomesSequential Matrix Multiply Becomes

for( i=0; i<n; i++ )

for( j=0; j<n; j++ ) {

c[i][j] = 0.0;

for( k=i; k<n; k++ )

c[i][j] += a[i][k]*b[k][j];

}

Load imbalance with block distribution.Load imbalance with block distribution.

– 12 –

OpenMP Matrix MultiplyOpenMP Matrix Multiply

#pragma omp parallel for schedule( cyclic )

for( i=0; i<n; i++ )

for( j=0; j<n; j++ ) {

c[i][j] = 0.0;

for( k=i; k<n; k++ )

c[i][j] += a[i][k]*b[k][j];

}

– 13 –

Code Restructuring OptimizationsCode Restructuring Optimizations

Private variablesPrivate variables

Loop reorderingLoop reordering

Loop peelingLoop peeling

– 14 –

General IdeaGeneral Idea

Parallelism limited by dependences.Parallelism limited by dependences.

Restructure code to eliminate or reduce dependences.Restructure code to eliminate or reduce dependences.

Compiler usually not able to do this, good to know how Compiler usually not able to do this, good to know how to do it by hand.to do it by hand.

– 15 –

Example 1: Dependency on Scalar Example 1: Dependency on Scalar

for( i=0; i<n; i++ ) {tmp = a[i];a[i] = b[i];b[i] = tmp;

}

Loop-carried dependence on tmp.Loop-carried dependence on tmp.

Easily fixed by privatizing tmp.Easily fixed by privatizing tmp.

– 16 –

Fix: Scalar PrivatizationFix: Scalar Privatizationf() {

int tmp; /* local allocation on stack */for( i=from; i<to; i++ ) {

tmp = a[i];a[i] = b[i];b[i] = tmp;

}}

Removes dependence on tmp.Removes dependence on tmp.

– 17 –

Fix: Scalar Privatization in OpenMPFix: Scalar Privatization in OpenMP

#pragma omp parallel for private( tmp )for( i=0; i<n; i++ ) {

tmp = a[i];a[i] = b[i];b[i] = tmp;

}

Removes dependence on tmp.Removes dependence on tmp.

– 18 –

Example 3: Induction VariableExample 3: Induction Variable

for( i=0, index=0; i<n; i++ ) {index += i;a[i] = b[index];

}

Dependence on index.Dependence on index.

Can be computed from loop variable.Can be computed from loop variable.

– 19 –

Fix: Induction Variable EliminationFix: Induction Variable Elimination


for( i=0, index=0; i<n; i++ ) {

a[i] = b[i*(i+1)/2];

}

Dependence removed by computing the induction Dependence removed by computing the induction variable.variable.

– 20 –

Example 4: Induction VariableExample 4: Induction Variable

for( i=0, index=0; i<n; i++ ) {

index += f(i);

b[i] = g(a[index]);

}

Dependence on variable index, but no formula for its Dependence on variable index, but no formula for its value.value.

– 21 –

Fix: Loop SplittingFix: Loop Splitting

for( i=0; i<n; i++ ) {index[i] += f(i);

}#pragma omp parallel forfor( i=0; i<n; i++ ) {

b[i] = g(a[index[i]]);}

Loop splitting has removed dependence.Loop splitting has removed dependence.

– 22 –

Example 5 Example 5

for( k=0; k<n; k++ ) for( i=0; i<n; i++ )

for( j=0; j<n; j++ )a[i][j] += b[i][k] + c[k][j];

Dependence on a[i][j] prevents k-loop parallelization.Dependence on a[i][j] prevents k-loop parallelization.

No dependencies carried by i- and j-loops.No dependencies carried by i- and j-loops.

– 23 –

Example 5 ParallelizationExample 5 Parallelization

for( k=0; k<n; k++ )


for( i=0; i<n; i++ )

for( j=0; j<n; j++ )

a[i][j] += b[i][k] + c[k][j];

We can do better by reordering the loops.We can do better by reordering the loops.

– 24 –

Optimization: Loop ReorderingOptimization: Loop Reordering


for( i=0; i<n; i++ )

for( j=0; j<n; j++ )

for( k=0; k<n; k++ )

a[i][j] += b[i][k] + c[k][j];

Larger parallel pieces of work.Larger parallel pieces of work.

– 25 –

Example 6Example 6

#pragma omp parallel forfor(i=0; i<n; i++ )

a[i] = b[i];#pragma omp parallel forfor( i=0; i<n; i++ )

c[i] = b[i]^2;

Make two parallel loops into one.Make two parallel loops into one.

– 26 –

Optimization: Loop FusionOptimization: Loop Fusion


for(i=0; i<n; i++ ) {

a[i] = b[i];

c[i] = b[i]^2;

}

Reduces loop startup overhead.Reduces loop startup overhead.

– 27 –

Example 7: While LoopsExample 7: While Loops

while( *a) {

process(a);

a++;

}

The number of loop iterations is unknown.The number of loop iterations is unknown.

– 28 –

Special Case of Loop SplittingSpecial Case of Loop Splitting

for( count=0, p=a; p!=NULL; count++, p++ );


for( i=0; i<count; i++ )

process( a[i] );

Count the number of loop iterations.Count the number of loop iterations.

Then parallelize the loop.Then parallelize the loop.

– 29 –

Example 8Example 8

for( i=0, wrap=n; i<n; i++ ) {

b[i] = a[i] + a[wrap];

wrap = i;

}

Dependence onDependence on wrap.wrap.

Only first iteration causes dependence.Only first iteration causes dependence.

– 30 –

Loop PeelingLoop Peeling

b[0] = a[0] + a[n];


for( i=1; i<n; i++ ) {

b[i] = a[i] + a[i-1];

}

– 31 –

Example 10Example 10

for( i=0; i<n; i++ )

a[i+m] = a[i] + b[i];

Dependence if m<n.Dependence if m<n.

– 32 –

Another Case of Loop PeelingAnother Case of Loop Peelingif(m>n) {#pragma omp parallel forfor( i=0; i<n; i++ )

a[i+m] = a[i] + b[i];}else {… cannot be parallelized

}

– 33 –

SummarySummary

Reorganize code such thatReorganize code such that dependences are removed or reduced large pieces of parallel work emerge loop bounds become known …

Code can become messy … there is a point of Code can become messy … there is a point of diminishing returns.diminishing returns.

– 1 – basic machine independent performance optimizations topics load balancing (review, already...

Documents