concurrent programming with openmp - ulisboa · concurrent programming with openmp parallel and...
TRANSCRIPT
Concurrent Programming with OpenMP
Parallel and Distributed Computing
Department of Computer Science and Engineering (DEI)Instituto Superior Tecnico
October 3, 2011
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 1 / 42
Outline
Shared Memory Concurrent Programming
Review of Operating Systems: PThreads
OpenMP
Parallel ClausesPrivate / Shared Variables
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 2 / 42
Shared-Memory Systems
Uniform Memory Access (UMA) architecturealso known as
Symmetric Shared-Memory Multiprocessors (SMP)
P
Cache
P P P
MainMemory
I / O
Cache Cache Cache
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 3 / 42
Fork/Join Parallelism
“Cheap” creation/termination of tasks invites forIncremental Parallelization: process of converting a sequential program toa parallel program a little bit at a time.
initially only master thread is active
master thread executes sequential code
Fork: master thread creates or awakens additional threads to executeparallel code
Join: at end of parallel code created threads die or are suspendedMaster Thread
Other ThreadsFork
Join
Fork
Join
Tim
e
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 4 / 42
Fork/Join Parallelism
read(A, B);
x = initX(A, B);
y = initY(A, B);
z = initZ(A, B);
for(i = 0; i < N_ENTRIES; i++)
x[i] = compX(y[i], z[i]);
for(i = 1; i < N_ENTRIES; i++){
x[i] = solveX(x[i-1]);
z[i] = x[i] + y[i];
}
finalize1(&x, &y, &z);
finalize2(&x, &y, &z);
finalize3(&x, &y, &z);
.
.
.
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 5 / 42
Processes and Threads
Process A
Global Data,Shared Code
SystemResources
InterprocessCommunication
Environment
Thread 1
Private Data
Stack
Thread 2
Private Data
Stack
Thread 3
Private Data
Stack
Thread n
Private Data
Stack...
Process B
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 6 / 42
POSIX Threads (PThreads): Creation
int pthread_create(pthread_t *thread, pthread_attr_t *attr,void *(*start_routine)(void*), void *arg)
Example:
pthread_t pt_worker;
void *thread_function(void *args) { /* thread code */ }
pthread_create(&pt_worker, NULL,thread_function, (void *) thread_args);
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 7 / 42
PThreads: Termination and Synchronization
int pthread_exit(void *value_ptr)
int pthread_join(pthread_t thread, void **value_ptr)
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 8 / 42
PThread Example: Summing the Values in Matrix Rows
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
int buffer[N][SIZE];
void *sum_row (void *ptr){
int index = 0, sum = 0;
int *b = (int *) ptr;
while (index < SIZE - 1)
sum += b[index++]; /* sum row*/
b[index]=sum; /* store sum
in last col. */
pthread_exit(NULL);
}
int main(void){
int i,j;
pthread_t tid[N];
for(i = 0; i < N; i++)
for(j = 0; j < SIZE-1; j++)
buffer[i][j] = rand()%10;
for(i = 0; i < N; i++)
if(pthread_create(&tid[i], NULL, sum_row,
(void *) &(buffer[i])) != 0){
printf("Error creating thread, id=%d\n", i);
exit(-1);
}
else
printf ("Created thread w/ id %d\n", i);
for(i = 0; i < N; i++)
pthread_join(tid[i], NULL);
printf("All threads have concluded\n");
for(i = 0; i < N; i++){
for(j = 0; j < SIZE; j++)
printf(" %d ", buffer[i][j]);
printf ("Row %d \n", i);
}
exit(0);
}
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 9 / 42
PThreads: Synchronization
int pthread_mutex_init(pthread_mutex_t *mutex,pthread_mutexattr_t *attr);
int pthread_mutex_lock(pthread_mutex_t *mutex);int pthread_mutex_unlock(pthread_mutex_t *mutex);
Example:
pthread_mutex_t count_lock;
pthread_mutex_init(&count_lock, NULL);
pthread_mutex_lock(&count_lock);atomic_function();pthread_mutex_unlock(&count_lock);
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 10 / 42
PThreads: Synchronization Example
int count;
void *sum_row(void *ptr){int index = 0, sum = 0;int *b = (int *) ptr;
while(index < SIZE - 1)sum += b[index++]; /* sum row */
b[index] = sum; /* store sumin last col. */
count++;
pthread_exit(NULL);}
Problem?
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 11 / 42
PThreads: Synchronization Example
int count;
pthread_mutex_t count_lock;
void *sum_row(void *ptr){
int index = 0, sum = 0;
int *b = (int *) ptr;
while(index < SIZE - 1)
sum += b[index++]; /* sum row */
b[index]=sum; /* store sum
in last col. */
pthread_mutex_lock(&count_lock);
count++;
pthread_mutex_unlock(&count_lock);
pthread_exit(NULL);
}
main() { /*...*/
pthread_mutex_init(&count_lock, NULL);
}
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 12 / 42
OpenMP
What is OpenMP?
Open specification for Multi-Threaded, Shared Memory Parallelism
Standard Application Programming Interface (API):
Preprocessor (compiler) directivesLibrary CallsEnvironment Variables
More info at www.openmp.org
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 13 / 42
OpenMP vs Threads
(Supposedly) Better than threads:
Simpler programming model
Separate a program into serial and parallel regions, rather than Tconcurrently-executing threads
Similar to threads:
Programmer must detect dependencies
Programmer must prevent data races
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 14 / 42
Parallel Programming Recipes
Threads:
1 Start with a parallel algorithm2 Implement, keeping in mind:
Data racesSynchronizationThreading syntax
3 Test & Debug
4 Goto step 2
OpenMP:
1 Start with some algorithm2 Implement serially, ignoring:
Data racesSynchronizationThreading syntax
3 Test & Debug4 Automagically parallelize
with relatively few annotationsthat specify parallelism andsynchronization
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 15 / 42
Parallel Programming Recipes
Threads:
1 Start with a parallel algorithm2 Implement, keeping in mind:
Data racesSynchronizationThreading syntax
3 Test & Debug
4 Goto step 2
OpenMP:
1 Start with some algorithm2 Implement serially, ignoring:
Data racesSynchronizationThreading syntax
3 Test & Debug4 Automagically parallelize
with relatively few annotationsthat specify parallelism andsynchronization
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 15 / 42
OpenMP Directives
Parallelization directives:
parallel region
parallel for
parallel sections
task
Data environment directives:
shared, private, threadprivate, reduction, etc.
Synchronization directives:
barrier, critical
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 16 / 42
C / C++ Directives Format
#pragma omp directive-name [clause,...] \n
Case sensitive
Long directive lines may be continued on succeeding lines by escapingthe newline character with a “\” at the end of the directive line
Always apply to the next statement, which must be a structuredblock. Examples:
#pragma omp ...statement
#pragma omp ...{ statement1; statement2; statement3; }
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 17 / 42
Parallel Region
#pragma omp parallel [clauses]
Creates N parallel threads
All execute subsequent block
All wait for each other at the end of executing the block
Barrier synchronization
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 18 / 42
How Many Threads?
The number of threads created is determined by, in order of precedence:
Use of omp set num threads() library function
Setting of the OMP NUM THREADS environment variable
Implementation default - usually the number of CPUs
Possible to query number of CPUs:
int omp_get_num_procs (void)
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 19 / 42
Parallel Region Example
main() {
printf("Serial Region 1\n");
omp_set_num_threads(4);
#pragma omp parallel{printf("Parallel Region\n");
}
printf("Serial Region 2\n");}
Output?
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 20 / 42
Thread Count and Id API
#include <omp.h>
int omp_get_thread_num()
int omp_get_num_threads()
void omp_set_num_threads(int num)
Example:
#pragma omp parallel{if( !omp_get_thread_num() )master();
elseslave();
}
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 21 / 42
Work Sharing Directives
Always occur within a parallel region
Divide the execution of the enclosed code region among the membersof the team
Do not create new threads
Two main directives are
parallel forparallel section
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 22 / 42
Parallel For
#pragma omp parallel#pragma omp for [clauses]for( ; ; ) { ... }
Each thread executes a subset of the iterations
All threads synchronize at the end of parallel for
Restrictions
No data dependencies between iterations
Program correctness must not depend upon which thread executes aparticular iteration
Paradigm of Data Parallelism.
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 23 / 42
Handy Shortcut
#pragma omp parallel#pragma omp forfor ( ; ; ) { ... }
is equivalent to
#pragma omp parallel forfor ( ; ; ) { ... }
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 24 / 42
PThread Example Revisited
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
int buffer[N][SIZE];
void *sum_row (void *ptr){
int index = 0, sum = 0;
int *b = (int *) ptr;
while (index < SIZE - 1)
sum += b[index++]; /* sum row*/
b[index]=sum; /* store sum
in last col. */
pthread_exit(NULL);
}
int main(void){
int i,j;
pthread_t tid[N];
for(i = 0; i < N; i++)
for(j = 0; j < SIZE-1; j++)
buffer[i][j] = rand()%10;
for(i = 0; i < N; i++)
if(pthread_create(&tid[i], 0, sum_row,
(void *) &(buffer[i])) != 0){
printf("Error creating thread, id=%d\n", i);
exit(-1);
}
else
printf ("Created thread w/ id %d\n", i);
for(i = 0; i < N; i++)
pthread_join(tid[i], NULL);
printf("All threads have concluded\n");
for(i = 0; i < N; i++){
for(j = 0; j < SIZE; j++)
printf(" %d ", buffer[i][j]);
printf ("Row %d \n", i);
}
exit(0);
}
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 25 / 42
PThread Example Revisited
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int buffer[N][SIZE];
void *sum_row (void *ptr){
int index = 0, sum = 0;
int *b = (int *) ptr;
while (index < SIZE - 1)
sum += b[index++]; /* sum row*/
b[index]=sum; /* store sum
in last col. */
}
int main(void){
int i,j;
for(i = 0; i < N; i++)
for(j = 0; j < SIZE-1; j++)
buffer[i][j] = rand()%10;
#pragma omp parallel for
for(i = 0; i < N; i++)
sum_row(buffer[i]);
printf("All threads have concluded\n");
for(i = 0; i < N; i++){
for(j = 0; j < SIZE; j++)
printf(" %d ", buffer[i][j]);
printf ("Row %d \n", i);
}
exit(0);
}
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 26 / 42
Multiple Work Sharing Directives
May occur within the same parallel region:
#pragma omp parallel{#pragma omp forfor( ; ; ) { ... }
#pragma omp forfor( ; ; ) { ... }
}
Implicit barrier at the end of each for.
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 27 / 42
Parallel Sections
Functional Parallelism: several blocks are executed in parallel
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{ a=...;
b=...; }
#pragma omp section /* <- delimiter! */
{ c=...;
d=...; }
#pragma omp section
{ e=...;
f=...; }
#pragma omp section
{ g=...;
h=...; }
} /*omp end sections*/
} /*omp end parallel*/
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 28 / 42
Handy Shortcut
#pragma omp parallel#pragma omp sections{ ... }
is equivalent to
#pragma omp parallel sections{ ... }
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 29 / 42
OpenMP Memory Model
Concurrent programs access two types of data
Shared data, visible to all threads
Private data, visible to a single thread (often stack-allocated)
Threads:
Global variables are shared
Local variables are private
OpenMP:
All variables are by default shared.
Some exceptions:
the loop variable of a parallel for is privatestack (local) variables in called subroutines are private
By using data directives, some variables can be made private or givenother special characteristics.
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 30 / 42
OpenMP Memory Model
Concurrent programs access two types of data
Shared data, visible to all threads
Private data, visible to a single thread (often stack-allocated)
Threads:
Global variables are shared
Local variables are private
OpenMP:
All variables are by default shared.
Some exceptions:
the loop variable of a parallel for is privatestack (local) variables in called subroutines are private
By using data directives, some variables can be made private or givenother special characteristics.
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 30 / 42
Private Variables
#pragma omp parallel for private( list )
Makes a private copy for each thread for each variable in the list.
No storage association with original object
All references are to the local object
Values are undefined on entry and exit
Also applies to other region and work-sharing directives.
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 31 / 42
Shared Variables
#pragma omp parallel for shared ( list )
Similarly, there is a shared data directive.
Shared variables exist in a single location and all threads can read andwrite it
It is the programmer’s responsibility to ensure that all multiple threadsproperly access shared variables (will discuss synchronization next)
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 32 / 42
Example PThread vs OpenMP
PThreads OpenMP
// shared, globalsint n, *x, *y;
void loop() {int i; // private, stack
for(i = 0; i < n; i++)x[i] += y[i];
}
#pragma omp parallel \shared(n,x,y) private(i){#pragma omp forfor(i = 0; i < n; i++)x[i] += y[i];
}
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 33 / 42
Example PThread vs OpenMP
PThreads OpenMP
// shared, globalsint n, *x, *y;
void loop() {int i; // private, stack
for(i = 0; i < n; i++)x[i] += y[i];
}
#pragma omp parallel for{for(i = 0; i < n; i++)x[i] += y[i];
}
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 34 / 42
Example of private Clause
for(i = 0; i < n; i++)for(j = 0; j < n; j++)a[i][j] = b[i][j] + c[i][j];
Make outer loop parallel, to reduce number of forks/joins.Give each thread its own private copy of variable j.
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 35 / 42
Example of private Clause
for(i = 0; i < n; i++)for(j = 0; j < n; j++)a[i][j] = b[i][j] + c[i][j];
Make outer loop parallel, to reduce number of forks/joins.Give each thread its own private copy of variable j.
#pragma omp parallel for private(j)for(i = 0; i < n; i++)for(j = 0; j < n; j++)a[i][j] = b[i][j] + c[i][j];
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 36 / 42
firstprivate / lastprivate Clauses
As mentioned, values of private variables are undefined on entry and exit.
⇒ A private variable within a region has no storage association with thesame variable outside of the region
firstprivate (list)
Variables in list are initialized with the value the original variable hadbefore entering the parallel construct
lastprivate (list)
The thread that executes the sequentially last iteration or section updatesthe value of the variables in list
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 37 / 42
Example of firstprivate / lastprivate Clauses
main()
{
a = 1;
#pragma omp parallel for private(i), firstprivate(a), lastprivate(b)
for (i = 0; i < n; i++) {
...
b = a + i; /* a undefined, unless declared firstprivate */
...
}
a = b; /* b undefined, unless declared lastprivate */
}
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 38 / 42
threadprivate Variables
Private variables are private on a parallel region basis.
threadprivate variables are global variables that are private throughoutthe execution of the program.
#pragma omp threadprivate(x)
Initial data is undefined, unless copyin is used
copyin (list)
data of the master thread is copied to the threadprivate copies
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 39 / 42
Example of threadprivate Clause
#include <omp.h>
int a, b, i, tid;
float x;
#pragma omp threadprivate(a, x)
main () {
printf("1st Parallel Region:\n");
#pragma omp parallel private(b,tid)
{
tid = omp_get_thread_num();
a = tid;
b = tid;
x = 1.1 * tid +1.0;
printf("Thread %d: a,b,x= %d %d %f\n",tid,a,b,x);
} /* end of parallel section */
printf("2nd Parallel Region:\n");
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
printf("Thread %d: a,b,x = %d %d %f\n",tid,a,b,x);
} /* end of parallel section */
}
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 40 / 42
Review
Shared Memory Concurrent Programming
Review of Operating Systems: PThreads
OpenMP
Parallel ClausesPrivate / Shared Variables
CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 41 / 42