– 1 – basic machine independent performance optimizations topics load balancing (review, already...
TRANSCRIPT
– 1 –
Basic Machine Independent Performance Optimizations
Basic Machine Independent Performance Optimizations
TopicsTopics Load balancing (review, already discussed)
In the context of OpenMP notation Performance Optimizations by Code
restructuring In the context of OpenMP notation
– 2 –
OpenMP Implementation OverviewOpenMP Implementation Overview
OpenMP implementationOpenMP implementation compiler, library.
Unlike Pthreads (purely a library).Unlike Pthreads (purely a library).
– 3 –
OpenMP Example Usage (1 of 2)OpenMP Example Usage (1 of 2)
OpenMPCompiler
AnnotatedSource
SequentialProgram
ParallelProgram
compiler switch
– 4 –
OpenMP Example Usage (2 of 2)OpenMP Example Usage (2 of 2)
If you give sequential switch,If you give sequential switch, pragmas are ignored.
If you give parallel switch,If you give parallel switch, pragmas are read, and cause translation into parallel program.
Ideally, one source for both sequential and parallel Ideally, one source for both sequential and parallel program (big maintenance plus).program (big maintenance plus).
– 5 –
OpenMP DirectivesOpenMP Directives
Parallelization directives:Parallelization directives: parallel for
Data environment directives:Data environment directives: shared, private, threadprivate, reduction, etc.
– 6 –
OpenMP Notation: Parallel ForOpenMP Notation: Parallel For
#pragma omp parallel for
A number of threads are spawned at entry.A number of threads are spawned at entry.
Each thread is assigned a set of iterations for the loop Each thread is assigned a set of iterations for the loop and executes that code.and executes that code.
e.g., block, or cyclic iteration assignment to threadse.g., block, or cyclic iteration assignment to threads
Each thread waits at the end.Each thread waits at the end.
Very similar to fork/join synchronization.Very similar to fork/join synchronization.
– 7 –
API SemanticsAPI Semantics
Master thread executes sequential code.Master thread executes sequential code.
Master and slaves execute parallel code.Master and slaves execute parallel code.
Note: very similar to fork-join semantics of Pthreads Note: very similar to fork-join semantics of Pthreads create/join primitives.create/join primitives.
– 8 –
Scheduling of IterationsScheduling of Iterations
Scheduling: assigning iterations to a thread.Scheduling: assigning iterations to a thread.
OpenMP allows scheduling strategies, such as block, OpenMP allows scheduling strategies, such as block, cyclic, etc.cyclic, etc.
– 9 –
Scheduling of Iterations: SpecificationScheduling of Iterations: Specification
#pragma omp parallel for schedule(<sched>)
<sched><sched> can be one ofcan be one of block (default) cyclic
– 10 –
ExampleExample
Multiplication of two matrices C = A x B, where the A Multiplication of two matrices C = A x B, where the A matrix is upper-triangular (all elements below matrix is upper-triangular (all elements below diagonal are 0).diagonal are 0).
0
A
– 11 –
Sequential Matrix Multiply BecomesSequential Matrix Multiply Becomes
for( i=0; i<n; i++ )
for( j=0; j<n; j++ ) {
c[i][j] = 0.0;
for( k=i; k<n; k++ )
c[i][j] += a[i][k]*b[k][j];
}
Load imbalance with block distribution.Load imbalance with block distribution.
– 12 –
OpenMP Matrix MultiplyOpenMP Matrix Multiply
#pragma omp parallel for schedule( cyclic )
for( i=0; i<n; i++ )
for( j=0; j<n; j++ ) {
c[i][j] = 0.0;
for( k=i; k<n; k++ )
c[i][j] += a[i][k]*b[k][j];
}
– 13 –
Code Restructuring OptimizationsCode Restructuring Optimizations
Private variablesPrivate variables
Loop reorderingLoop reordering
Loop peelingLoop peeling
– 14 –
General IdeaGeneral Idea
Parallelism limited by dependences.Parallelism limited by dependences.
Restructure code to eliminate or reduce dependences.Restructure code to eliminate or reduce dependences.
Compiler usually not able to do this, good to know how Compiler usually not able to do this, good to know how to do it by hand.to do it by hand.
– 15 –
Example 1: Dependency on Scalar Example 1: Dependency on Scalar
for( i=0; i<n; i++ ) {tmp = a[i];a[i] = b[i];b[i] = tmp;
}
Loop-carried dependence on tmp.Loop-carried dependence on tmp.
Easily fixed by privatizing tmp.Easily fixed by privatizing tmp.
– 16 –
Fix: Scalar PrivatizationFix: Scalar Privatizationf() {
int tmp; /* local allocation on stack */for( i=from; i<to; i++ ) {
tmp = a[i];a[i] = b[i];b[i] = tmp;
}}
Removes dependence on tmp.Removes dependence on tmp.
– 17 –
Fix: Scalar Privatization in OpenMPFix: Scalar Privatization in OpenMP
#pragma omp parallel for private( tmp )for( i=0; i<n; i++ ) {
tmp = a[i];a[i] = b[i];b[i] = tmp;
}
Removes dependence on tmp.Removes dependence on tmp.
– 18 –
Example 3: Induction VariableExample 3: Induction Variable
for( i=0, index=0; i<n; i++ ) {index += i;a[i] = b[index];
}
Dependence on index.Dependence on index.
Can be computed from loop variable.Can be computed from loop variable.
– 19 –
Fix: Induction Variable EliminationFix: Induction Variable Elimination
#pragma omp parallel for
for( i=0, index=0; i<n; i++ ) {
a[i] = b[i*(i+1)/2];
}
Dependence removed by computing the induction Dependence removed by computing the induction variable.variable.
– 20 –
Example 4: Induction VariableExample 4: Induction Variable
for( i=0, index=0; i<n; i++ ) {
index += f(i);
b[i] = g(a[index]);
}
Dependence on variable index, but no formula for its Dependence on variable index, but no formula for its value.value.
– 21 –
Fix: Loop SplittingFix: Loop Splitting
for( i=0; i<n; i++ ) {index[i] += f(i);
}#pragma omp parallel forfor( i=0; i<n; i++ ) {
b[i] = g(a[index[i]]);}
Loop splitting has removed dependence.Loop splitting has removed dependence.
– 22 –
Example 5 Example 5
for( k=0; k<n; k++ ) for( i=0; i<n; i++ )
for( j=0; j<n; j++ )a[i][j] += b[i][k] + c[k][j];
Dependence on a[i][j] prevents k-loop parallelization.Dependence on a[i][j] prevents k-loop parallelization.
No dependencies carried by i- and j-loops.No dependencies carried by i- and j-loops.
– 23 –
Example 5 ParallelizationExample 5 Parallelization
for( k=0; k<n; k++ )
#pragma omp parallel for
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
a[i][j] += b[i][k] + c[k][j];
We can do better by reordering the loops.We can do better by reordering the loops.
– 24 –
Optimization: Loop ReorderingOptimization: Loop Reordering
#pragma omp parallel for
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
for( k=0; k<n; k++ )
a[i][j] += b[i][k] + c[k][j];
Larger parallel pieces of work.Larger parallel pieces of work.
– 25 –
Example 6Example 6
#pragma omp parallel forfor(i=0; i<n; i++ )
a[i] = b[i];#pragma omp parallel forfor( i=0; i<n; i++ )
c[i] = b[i]^2;
Make two parallel loops into one.Make two parallel loops into one.
– 26 –
Optimization: Loop FusionOptimization: Loop Fusion
#pragma omp parallel for
for(i=0; i<n; i++ ) {
a[i] = b[i];
c[i] = b[i]^2;
}
Reduces loop startup overhead.Reduces loop startup overhead.
– 27 –
Example 7: While LoopsExample 7: While Loops
while( *a) {
process(a);
a++;
}
The number of loop iterations is unknown.The number of loop iterations is unknown.
– 28 –
Special Case of Loop SplittingSpecial Case of Loop Splitting
for( count=0, p=a; p!=NULL; count++, p++ );
#pragma omp parallel for
for( i=0; i<count; i++ )
process( a[i] );
Count the number of loop iterations.Count the number of loop iterations.
Then parallelize the loop.Then parallelize the loop.
– 29 –
Example 8Example 8
for( i=0, wrap=n; i<n; i++ ) {
b[i] = a[i] + a[wrap];
wrap = i;
}
Dependence onDependence on wrap.wrap.
Only first iteration causes dependence.Only first iteration causes dependence.
– 30 –
Loop PeelingLoop Peeling
b[0] = a[0] + a[n];
#pragma omp parallel for
for( i=1; i<n; i++ ) {
b[i] = a[i] + a[i-1];
}
– 31 –
Example 10Example 10
for( i=0; i<n; i++ )
a[i+m] = a[i] + b[i];
Dependence if m<n.Dependence if m<n.
– 32 –
Another Case of Loop PeelingAnother Case of Loop Peelingif(m>n) {#pragma omp parallel forfor( i=0; i<n; i++ )
a[i+m] = a[i] + b[i];}else {… cannot be parallelized
}
– 33 –
SummarySummary
Reorganize code such thatReorganize code such that dependences are removed or reduced large pieces of parallel work emerge loop bounds become known …
Code can become messy … there is a point of Code can become messy … there is a point of diminishing returns.diminishing returns.