openmp at a glance - ut southwestern
TRANSCRIPT
Book
– Peter S. Pacheco, An Introduction to Parallel Programming, 2011
– Victor Eijkhout, Online version of book
“Parallel Programming in MPI and OPENMP”
(http://pages.tacc.utexas.edu/~eijkhout/pcse/html/ )
Online Resource
– Blaise Barney, LLNL, https://computing.llnl.gov/tutorials/openMP/
– Joel Yliluoma, https://bisqwit.iki.fi/story/howto/openmp/
Useful reference
2
Very limited modifications to the original code (4 lines of extra code).
Computation wall time reduced with increasing parallelization.
Matrix Multiplication Example with OpenMP
3
C=A*B
A 3000 x 3000B 3000 x 3000
Using Intel Compiler 16.0.2 with –O3 on NucleusA040
Shared memory programming. Works on one node with shared memory.
BioHPC compute nodes have at lease 32 cores to parallelize.
What is OpenMP?
4
OpenMP works withShared Memory System
MPI works with Distributed Memory System
Image credit: Peter S. Pachero, An introduction to parallel programming, Morgan Kaufmann publications, 2011
Not a new language, but an Application Programming Interface.
– Library, Directives, Environment Variable
– Works with C, C++, Fortran
– Needs compiler support, GCC/Intel
– Purpose: By fully utilizing computational resource to achieve results in a shorter time.
What is OpenMP?
5
#include <omp.h> /*include the library */
#pragma omp parallel /* Compiler directive openmp */
function_to_run() /* each thread runs the same code */
int nthreads = omp_get_num_threads(); /* for each thread, get the OMP_NUM_THREADS environment variable */
int my_rank = omp_get_thread_num(); /* each thread can have itsown ID, called rank */
How does OpenMP work?
6
• A “fork-join” work scheme with master and slaves.
: : : /* Serial code */: : :#pragma omp parallel num_threads(nthreads) [options]{::::::}: : : /* Serial code */: : :
#include <stdlib.h>#include <stdio.h>#include <omp.h> /* OpenMP header library */
void Hello(void);
int main(int argc, char* argv[]) {
int nthreads = strtol(argv[1], NULL, 10); /* get # of threads from CLI */
#pragma omp parallel num_threads(nthreads) /* The OpenMP directive */Hello();
return 0;}
void Hello(void) {
int my_rank = omp_get_thread_num(); /* each slave gets its rank id */int nthreads = omp_get_num_threads(); /* get # of threads from envmt */
printf("Hello from thread %d of %d\n", my_rank, nthreads);}
hello_world.c
7
Compile using GCC 4.8.5 shipped with the system (compute node/workstation),
$ gcc –o hello hello_world.c –fopenmp
Compile with Intel compiler,
$ module load intel/16.0.2
$ icc –o hello hello_world.c –qopenmp
Other optional arguments: –O3, –Wall or –w
To run:
$ ./hello 4
Hello from thread 0 of 4
Hello from thread 2 of 4
Hello from thread 1 of 4
Hello from thread 3 of 4
hello_world.c
8
𝜋 = 4 1 −1
3+1
5−1
7+⋯ = 4
𝑘=0
∞(−1)𝑘
2𝑘 + 1
Calculate the Pi when k=32 and 10000000
9
Serial code version:
double sum = 0.0, factor = 1.0;
for ( k = 0; k < n; k++){
factor = ( k%2 == 0 ) ? 1.0 : -1.0sum += factor / (2 * k + 1)
}sum *= 4.;
In the parallel regime, each thread gets a copy of the variables in its own stack/heap
double sum = 0., local_result = 0, \factor = 0.;
int my_rank;int nthreads = strtol(argv[1], NULL, 10);
#pragma omp parallel num_threads(nthreads){
local_result = 0.;my_rank = omp_get_thread_num();
factor = ( my_rank%2 == 0) ? 1.0 : -1.0;local_result = 4. * factor / (2 * my_rank + 1);
sum += local_result;}
Scope of variable – private vs shared
10
Parallel threads often have local values needed to be summed/combined.
Race condition may involve with all threads write to one value.
Critical clause prevents race condition, but not efficient.
Reduction clause is an efficient way to combine local results.
critical vs reduction
11
#pragma omp parallel{
my_rank = omp_get_thread_num();local_result = f(x, my_rank);
#pragma omp critical global_result += local_result;
}
#pragma omp parallel reduction(+:global_result) { my_rank = omp_get_thread_num();
local_result = f(x, my_rank);
global_result += local_result;}
Image retrieved from http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/computer-architecture-2018/lec13-parallel.html
double sum = 0.0, factor = 1.;
int k, n;
#pragma omp parallel for num_threads(nthreads) reduction(+:sum)
for (k = 0; k < n; k++)
{
factor = ( k%2 == 0) ? 1.0 : -1.0;
sum += 4. * factor / (2 * k + 1);
}
parallel for clause
12
The parallel for needs a definite loop iteration description to distribute chucks.
– variable i must be integer or pointer
–Expression start, end and incr must have compatible type
–Start, end and incr must not change during the execution of the loop
–Variable i can only be modified by the increment expression in the for statement
𝑓𝑜𝑟𝑖 < 𝑒𝑛𝑑 𝑖 + +
𝑖 = 𝑠𝑡𝑎𝑟𝑡; 𝑖 < = 𝑒𝑛𝑑; + +𝑖 ;𝑖 > 𝑒𝑛𝑑 𝑖 += 𝑖𝑛𝑐𝑟
OpenMP does not work with while loop.
parallel for clause
13
Schedule specified how each chunk of the calculations be assigned to threads.
Benefit: Align the computing with your data structure.
schedule(<type> [, <chunk size>])
schedule
14
Image Source: http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-loop.html
When shared variable is updated by all the threads in the parallel pool, false sharing
can happen.
#pragma omp parallel for private(i, j) \
schedule(static, 1)
for ( i = 0; i < size; i++)
for ( j = 0; j < size; j++)
y[i] += f(i, j);
Cache and False sharing
15
Speedup: This is how much time you don’t need to wait.
𝑆 =𝑇𝑠𝑒𝑟𝑖𝑎𝑙𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
Efficiency: You will always have some overhead.
𝐸 =𝑇𝑠𝑒𝑟𝑖𝑎𝑙
𝑝𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
Parallelization performance
16
OpenMP is a shared memory programming API.
OpenMP works with C/C++/Fortran
OpenMP works in a fork-join mode with master-slave threads
Variables have scope in parallel region, by default they are shared (except for the for
loop index)
Use reduction clause for summation of local value.
False sharing needs to be avoided.
Recapping
17
Thanks for your attention!