parallel processing (cs 667) lecture 5: shared memory parallel programming with openmp *
DESCRIPTION
Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP *. Jeremy R. Johnson. Introduction. Objective: To further study the shared memory model of parallel programming. Introduction to the OpenMP standard for shared memory parallel programming Topics - PowerPoint PPT PresentationTRANSCRIPT
Parallel Processing 1
Parallel Processing (CS 667)
Lecture 5: Shared Memory Parallel Programming with OpenMP*
Jeremy R. Johnson
Parallel Processing 2
Introduction
• Objective: To further study the shared memory model of parallel programming. Introduction to the OpenMP standard for shared memory parallel programming
• Topics– OpenMP vs. Pthreads
• hello_pthreadsc
• hello_openmp.c
– Parallel Regions and execution model– Data parallelism with loops– Shared vs. private variables– Scheduling and chunk size– Synchronization and reduction variables– Functional parallelism with parallel sections– Case Studies
Parallel Processing 3
OpenMP
• Extension to FORTRAN, C/C++– Uses directives (comments in FORTRAN, pragma in C/C++)
• ignored without compiler support• Some library support required
• Shared memory model– parallel regions– loop level parallelism– implicit thread model– communication via shared address space– private vs. shared variables (declaration)– explicit synchronization via directives (e.g. critical)– library routines for returning thread information (e.g.
omp_get_num_threads(), omp_get_thread_num() )– Environment variables used to provide system info (e.g.
OMP_NUM_THREADS)
Parallel Processing 4
Benefits
• Provides incremental parallelism
• Small increase in code size
• Simpler model than message passing
• Easier to use than thread library
• With hardware and compiler support smaller granularity than message passing.
Parallel Processing 5
Further Information
• Adopted as a standard in 1997– Initiated by SGI
• www.openmp.org• computing.llnl.gov/tutorials/openMP
• Chandra, Dagum, Kohr, Maydan, McDonald, Menon, “Parallel Programming in OpenMP”, Morgan Kaufman Publishers, 2001.
• Chapman, Jost, and Van der Pas, “Using OpenMP: Portable Shared Memory Parallel Programming,” The MIT Press, 2008.
Parallel Processing 6
Shared vs. Distributed Memory
Memory
P0 P1 Pn...
Interconnection Network
P0 P1 Pn
...M0 M1 Mn
Shared memory Distributed memory
Parallel Processing 7
Shared Memory Programming Model
• Shared memory programming does not require physically shared memory so long as there is support for logically shared memory (in either hardware or software)
• If logical shared memory then there may be different costs for accessing memory depending on the physical location.
• UMA - uniform memory access– SMP - symmetric multi-processor– typically memory connected to processors via a bus
• NUMA - non-uniform memory access– typically physically distributed memory connected via an
interconnection network
Parallel Processing 8
Hello_openmp.c#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char **argv)
{
int n;
if (argc > 1) {
n = atoi(argv[1]); omp_set_num_threads(n);
}
printf("Number of threads = %d\n",omp_get_num_threads());
#pragma omp parallel
{
int id = omp_get_thread_num();
printf("Hello World from %d\n",id);
if (id == 0)
printf("Number of threads = %d\n",omp_get_num_threads());
}
exit(0);
}
Parallel Processing 9
Compiling & Running Hello_openmp
% gcc –fopenmp hello_openmp.c –o hello
% ./hello 4
Number of threads = 1
Hello World from 1
Hello World from 0
Hello World from 3
Number of threads = 4
Hello World from 2
The order of the print statements is nondeterministic
Parallel Processing 10
Execution Model
Master thread
Master and slave threads
Master thread
Implicit barrier synchronization(join)
Implicit thread creation (fork)
Parallel Region
Parallel Processing 11
Explicit Barrier#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
int n;
if (argc > 1) {
n = atoi(argv[1]);
omp_set_num_threads(n);
}
printf("Number of threads = %d\n",omp_get_num_threads());
#pragma omp parallel
{
int id = omp_get_thread_num();
printf("Hello World from %d\n",id);
#pragma omp barrier
if (id == 0) printf("Number of threads = %d\n",omp_get_num_threads());
}
exit(0);
}
Parallel Processing 12
Output with Barrier
%./hellob 4
Number of threads = 1
Hello World from 1
Hello World from 0
Hello World from 2
Hello World from 3
Number of threads = 4
The order of the “Hello World” print statements are nondeterministic; however, the Number of threads print statement always comes at the end
Parallel Processing 13
Hello_pthreads.c#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <errno.h>
#define MAXTHREADS 32
int main(int argc, char **argv)
{
int error,i,n;
void hello(int *pid);
pthread_t tid[MAXTHREADS],mytid;
int pid[MAXTHREADS];
if (argc > 1) {
n = atoi(argv[1]);
if (n > MAXTHREADS) {
printf("Too many threads\n"); exit(1);
}
pthread_setconcurrency(n);
}
printf("Number of threads = %d\n",pthread_getconcurrency());
for (i=0;i<n;i++) {
pid[i]=i;
error = pthread_create(&tid[i], NULL,(void *(*)(void *))hello, &pid[i]);
}
for (i=0;i<n;i++) {
error = pthread_join(tid[i],NULL);
}
exit(0);
}
Parallel Processing 14
Hello_pthreads.c
void hello(int *pid)
{
pthread_t tid;
tid = pthread_self();
printf("Hello World from %d (tid = %u)\n",*pid,(unsigned int) tid);
if (*pid == 0)
printf("Number of threads = %d\n",pthread_getconcurrency());
}
% gcc -pthread hello.c -o hello
% ./hello 4
Number of threads = 4
Hello World from 0 (tid = 1832728912)
Hello World from 1 (tid = 1824336208)
Number of threads = 4
Hello World from 3 (tid = 1807550800)
Hello World from 2 (tid = 1815943504)
The order of the print statements is nondeterministic
Types of Parallelism
Data Parallelism
Threads execute same instructions
… but on different data
Functional Parallelism
Threads execute different instructions
… and can read same data but should write different
data
F1
F2
F3
F4
Parallel Processing 16
Parallel Loop
int a[1000], b[1000];
int main()
{
int i;
int N = 1000;
for (i=0; i<N; i++)
a[i] = i; b[i] = N-i;
for (i=0;i<N;i++) {
a[i] = a[i] + b[i];
}
int a[1000], b[1000];
int main()
{
int i;
int N = 1000;
// Serial Initialization
for (i=0; i<N; i++)
a[i] = i; b[i] = N-i;
#pragma omp for shared(a,b), private(i), schedule(static)
for (i=0;i<N;i++) {
a[i] = a[i] + b[i];
}
Parallel Processing 17
Scheduling of Parallel Loop
+
a
b
0 1tid
Stripmining
2 Nthreads-1
Parallel Processing 18
Implementation of Parallel Loop
void vadd(int *id){int i;for (i=*id;i<N;i+=numthreads) { a[i] = a[i] + b[i]; }}
for (i=0;i<numthreads;i++) { id[i] = i; error = pthread_create(&tid[i],NULL,(void *(*)(void *))vadd, &id[i]); }for (i=0;i<numthreads;i++) { error = pthread_join(tid[i],NULL); }
Parallel Processing 19
Scheduling Chunks of Parallel Loop
a
b
0 1tid
chunk0
chunk0
Chunk 1
2
Chunk 2
Chunk Nthreads-1
Parallel Processing 20
Implementation of Chunking
#pragma omp for shared(a,b), private(i), schedule(static,CHUNK)for (i=0;i<N;i++) { a[i] = a[i] + b[i];}
void vadd(int *id){int i,j;
for (i=*id*CHUNK;i<N;i+=numthreads*CHUNK) { for (j=0;j<CHUNK;j++) a[i+j] = a[i+j] + b[i+j]; }}
Parallel Processing 21
Race Condition
int x[10000000];int main(int argc, char **argv) {int sum=0;…….omp_set_num_threads(numcounters);
for (i=0;i<numcounters*limit;i++) x[i] = 1;
#pragma omp parallel for schedule(static) private(i) shared(sum,x)for (i=0;i<numcounters*limit;i++) { sum = sum + x[i]; if (i==0) printf("num threads = %d\n",omp_get_num_threads()); }
Parallel Processing 22
Critical Sections
int x[10000000];int main(int argc, char **argv) {int sum=0;…….#pragma omp parallel for schedule(static) private(i) shared(sum,x)for (i=0;i<numcounters*limit;i++) {#pragma omp critical(sum) sum = sum + x[i]; }
Parallel Processing 23
Reduction Variables
int x[10000000];int main(int argc, char **argv) {int sum=0;…….#pragma omp parallel for schedule(static) private(i) shared(x)
reduction(+:sum)for (i=0;i<numcounters*limit;i++) { sum = sum + x[i]; }
Parallel Processing 24
Reduction
X[]
+
partialsum
+
partialsum
+
partialsum
+
partialsum
+
partialsum
+
total sum
Parallel Processing 25
Implementing Reduction
#pragma omp parallel shared(sum,x) {int i;int localsum=0;int id;id = omp_get_thread_num();for (i=id;i<numcounters*limit;i+=numcounters) { localsum = localsum + x[i]; }#pragma omp critical(sum) sum = sum+localsum;}
Functional Parallelism Example
int main()
{
int i;
double a[N], b[N], c[N], d[N];
// Parallel Function
#pragma omp parallel shared(a,b,c,d) privite(i)
{
#pragma omp sections
{
#pragma omp section
for (i=0; i<N; i++)
c[i] = a[i] + b[i];
#pragma omp section
for (i=0; i<N; i++)
d[i] = a[i] * b[i];
}
}