parallel programming in c with mpi and openmp
DESCRIPTION
Parallel Programming in C with MPI and OpenMP. Michael J. Quinn. Chapter 5. The Sieve of Eratosthenes. Chapter Objectives. Analysis of block allocation schemes Function MPI_Bcast Performance enhancements. Outline. Sequential algorithm Sources of parallelism Data decomposition options - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/1.jpg)
Parallel Programmingin C with MPI and OpenMP
Michael J. QuinnMichael J. Quinn
![Page 2: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/2.jpg)
Chapter 5
The Sieve of EratosthenesThe Sieve of Eratosthenes
![Page 3: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/3.jpg)
Chapter Objectives
Analysis of block allocation schemesAnalysis of block allocation schemes Function MPI_BcastFunction MPI_Bcast Performance enhancementsPerformance enhancements
![Page 4: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/4.jpg)
Outline
Sequential algorithmSequential algorithm Sources of parallelismSources of parallelism Data decomposition optionsData decomposition options Parallel algorithm development, analysisParallel algorithm development, analysis MPI programMPI program BenchmarkingBenchmarking OptimizationsOptimizations
![Page 5: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/5.jpg)
Sequential Algorithm
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
2 4 6 8 10 12 14 16
18 20 22 24 26 28 30
32 34 36 38 40 42 44 46
48 50 52 54 56 58 60
3 9 15
21 27
33 39 45
51 57
5
25
35
55
7
49
Complexity: (n ln ln n)
![Page 6: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/6.jpg)
Pseudocode1. Create list of unmarked natural numbers 2, 3, …, n2. k 23. Repeat
(a) Mark all multiples of k between k2 and n(b) k smallest unmarked number > k
until k2 > n4. The unmarked numbers are primes
![Page 7: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/7.jpg)
Sources of Parallelism
Domain decompositionDomain decomposition Divide data into piecesDivide data into pieces Associate computational steps with dataAssociate computational steps with data
One primitive task per array elementOne primitive task per array element
![Page 8: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/8.jpg)
Making 3(a) Parallel
Mark all multiples of k between k2 and n
for all j where k2 j n do if j mod k = 0 then mark j (it is not a prime) endifendfor
![Page 9: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/9.jpg)
Making 3(b) ParallelFind smallest unmarked number > k
Min-reduction (to find smallest unmarked number > k)
Broadcast (to get result to all tasks)
![Page 10: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/10.jpg)
Agglomeration Goals
Consolidate tasksConsolidate tasks Reduce communication costReduce communication cost Balance computations among processesBalance computations among processes
![Page 11: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/11.jpg)
Data Decomposition Options
Interleaved (cyclic)Interleaved (cyclic) Easy to determine “owner” of each indexEasy to determine “owner” of each index Leads to load imbalance Leads to load imbalance for this problemfor this problem
BlockBlock Balances loadsBalances loads More complicated to determine owner if More complicated to determine owner if
nn not a multiple of not a multiple of pp
![Page 12: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/12.jpg)
Block Decomposition Options
Want to balance workload when Want to balance workload when nn not a not a multiple of multiple of pp
Each process gets either Each process gets either n/pn/p or or n/pn/p elementselements
Seek simple expressionsSeek simple expressions Find low, high indices given an ownerFind low, high indices given an owner Find owner given an indexFind owner given an index
![Page 13: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/13.jpg)
Method #1
Let Let r r = = nn mod mod pp If If rr = 0, all blocks have same size = 0, all blocks have same size ElseElse
First First rr blocks have size blocks have size n/pn/p Remaining Remaining p-rp-r blocks have size blocks have size n/pn/p
![Page 14: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/14.jpg)
Examples17 elements divided among 7 processes
17 elements divided among 5 processes
17 elements divided among 3 processes
![Page 15: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/15.jpg)
Method #1 Calculations
First element controlled by process First element controlled by process ii
Last element controlled by process Last element controlled by process ii
Process controlling element Process controlling element jj
),min(/ ripni
1),1min(/)1( ripni
)//)(,)1//(min( pnrjpnj
![Page 16: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/16.jpg)
Method #2
Scatters larger blocks among processesScatters larger blocks among processes First element controlled by process First element controlled by process ii
Last element controlled by process Last element controlled by process ii
Process controlling element Process controlling element jj
pin /
1/)1( pni
njp /)1)1(
![Page 17: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/17.jpg)
Examples17 elements divided among 7 processes
17 elements divided among 5 processes
17 elements divided among 3 processes
![Page 18: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/18.jpg)
Comparing Methods
OperationsOperations Method 1Method 1 Method 2Method 2
Low indexLow index 44 22
High indexHigh index 66 44
OwnerOwner 77 44
Assuming no operations for “floor” function
Our choice
![Page 19: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/19.jpg)
Pop Quiz
Illustrate how block decomposition method Illustrate how block decomposition method #2 would divide 13 elements among 5 #2 would divide 13 elements among 5 processes.processes.
13(0)/ 5 = 0
13(1)/5 = 2
13(2)/ 5 = 5
13(3)/ 5 = 7
13(4)/ 5 = 10
![Page 20: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/20.jpg)
Block Decomposition Macros#define BLOCK_LOW(id,p,n) ((i)*(n)/(p))
#define BLOCK_HIGH(id,p,n) \ (BLOCK_LOW((id)+1,p,n)-1)
#define BLOCK_SIZE(id,p,n) \ (BLOCK_LOW((id)+1)-BLOCK_LOW(id))
#define BLOCK_OWNER(index,p,n) \ (((p)*(index)+1)-1)/(n))
![Page 21: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/21.jpg)
Local vs. Global Indices
L 0 1
L 0 1 2
L 0 1
L 0 1 2
L 0 1 2
G 0 1 G 2 3 4
G 5 6
G 7 8 9 G 10 11 12
![Page 22: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/22.jpg)
Looping over Elements
Sequential programSequential programfor (i = 0; i < n; i++) {for (i = 0; i < n; i++) { … …}}
Parallel programParallel programsize = BLOCK_SIZE (id,p,n);size = BLOCK_SIZE (id,p,n);for (i = 0; i < size; i++) {for (i = 0; i < size; i++) { gi = i + BLOCK_LOW(id,p,n);gi = i + BLOCK_LOW(id,p,n);}}
Index i on this process…
…takes place of sequential program’s index gi
![Page 23: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/23.jpg)
Decomposition Affects Implementation
Largest prime used to sieve is Largest prime used to sieve is nn First process has First process has nn//pp elements elements It has all sieving primes if It has all sieving primes if pp < < nn First process always broadcasts next sieving First process always broadcasts next sieving
primeprime No reduction step neededNo reduction step needed
![Page 24: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/24.jpg)
Fast Marking
Block decomposition allows same marking as sequential Block decomposition allows same marking as sequential algorithm:algorithm:
jj, , j j + + kk, , j j + 2+ 2kk, , j j + 3+ 3kk, …, …
instead ofinstead of
for all for all jj in block in blockif if jj mod mod kk = 0 then mark = 0 then mark jj (it is not a prime) (it is not a prime)
![Page 25: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/25.jpg)
Parallel Algorithm Development1. Create list of unmarked natural numbers 2, 3, …, n
2. k 2
3. Repeat
(a) Mark all multiples of k between k2 and n
(b) k smallest unmarked number > k
until k2 > m
4. The unmarked numbers are primes
Each process creates its share of listEach process does this
Each process marks its share of list
Process 0 only
(c) Process 0 broadcasts k to rest of processes
5. Reduction to determine number of primes
![Page 26: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/26.jpg)
Function MPI_Bcastint MPI_Bcast (
void *buffer, /* Addr of 1st element */
int count, /* # elements to broadcast */
MPI_Datatype datatype, /* Type of elements */
int root, /* ID of root process */
MPI_Comm comm) /* Communicator */
MPI_Bcast (&k, 1, MPI_INT, 0, MPI_COMM_WORLD);
![Page 27: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/27.jpg)
Task/Channel Graph
![Page 28: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/28.jpg)
Analysis
is time needed to mark a cellis time needed to mark a cell Sequential execution time: Sequential execution time: nn ln ln ln ln nn Number of broadcasts: Number of broadcasts: n n / ln / ln n n Broadcast time: Broadcast time: log log pp Expected execution time:Expected execution time:
pnnpnn log)ln/(/lnln
![Page 29: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/29.jpg)
Code (1/4)#include <mpi.h>#include <math.h>#include <stdio.h>#include "MyMPI.h"#define MIN(a,b) ((a)<(b)?(a):(b))
int main (int argc, char *argv[]){ ... MPI_Init (&argc, &argv); MPI_Barrier(MPI_COMM_WORLD); elapsed_time = -MPI_Wtime(); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p);if (argc != 2) { if (!id) printf ("Command line: %s <m>\n", argv[0]); MPI_Finalize(); exit (1);}
![Page 30: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/30.jpg)
Code (2/4) n = atoi(argv[1]); low_value = 2 + BLOCK_LOW(id,p,n-1); high_value = 2 + BLOCK_HIGH(id,p,n-1); size = BLOCK_SIZE(id,p,n-1); proc0_size = (n-1)/p; if ((2 + proc0_size) < (int) sqrt((double) n)) { if (!id) printf ("Too many processes\n"); MPI_Finalize(); exit (1); }
marked = (char *) malloc (size); if (marked == NULL) { printf ("Cannot allocate enough memory\n"); MPI_Finalize(); exit (1); }
![Page 31: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/31.jpg)
Code (3/4) for (i = 0; i < size; i++) marked[i] = 0; if (!id) index = 0; prime = 2; do { if (prime * prime > low_value) first = prime * prime - low_value; else { if (!(low_value % prime)) first = 0; else first = prime - (low_value % prime); } for (i = first; i < size; i += prime) marked[i] = 1; if (!id) { while (marked[++index]); prime = index + 2; } MPI_Bcast (&prime, 1, MPI_INT, 0, MPI_COMM_WORLD); } while (prime * prime <= n);
![Page 32: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/32.jpg)
Code (4/4)
count = 0; for (i = 0; i < size; i++) if (!marked[i]) count++; MPI_Reduce (&count, &global_count, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); elapsed_time += MPI_Wtime(); if (!id) { printf ("%d primes are less than or equal to %d\n", global_count, n); printf ("Total elapsed time: %10.6f\n", elapsed_time); } MPI_Finalize (); return 0;}
![Page 33: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/33.jpg)
Benchmarking
Execute sequential algorithmExecute sequential algorithm Determine Determine = 85.47 nanosec = 85.47 nanosec Execute series of broadcastsExecute series of broadcasts Determine Determine = 250 = 250 secsec
![Page 34: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/34.jpg)
Execution Times (sec)
ProcessorsProcessors PredictedPredicted Actual (sec)Actual (sec)
11 24.90024.900 24.90024.900
22 12.72112.721 13.01113.011
33 8.8438.843 9.0399.039
44 6.7686.768 7.0557.055
55 5.7945.794 5.9935.993
66 4.9644.964 5.1595.159
77 4.3714.371 4.6874.687
88 3.9273.927 4.2224.222
![Page 35: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/35.jpg)
Improvements
Delete even integersDelete even integers Cuts number of computations in halfCuts number of computations in half Frees storage for larger values of Frees storage for larger values of nn
Each process finds own sieving primesEach process finds own sieving primes Replicating computation of primes to Replicating computation of primes to nn Eliminates broadcast stepEliminates broadcast step
Reorganize loopsReorganize loops Increases cache hit rateIncreases cache hit rate
![Page 36: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/36.jpg)
Reorganize Loops
Cache hit rate
Lower
Higher
![Page 37: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/37.jpg)
Comparing 4 VersionsProcsProcs Sieve 1Sieve 1 Sieve 2Sieve 2 Sieve 3Sieve 3 Sieve 4Sieve 4
11 24.90024.900 12.23712.237 12.46612.466 2.5432.543
22 12.72112.721 6.6096.609 6.3786.378 1.3301.330
33 8.8438.843 5.0195.019 4.2724.272 0.9010.901
44 6.7686.768 4.0724.072 3.2013.201 0.6790.679
55 5.7945.794 3.6523.652 2.5592.559 0.5430.543
66 4.9644.964 3.2703.270 2.1272.127 0.4560.456
77 4.3714.371 3.0593.059 1.8201.820 0.3910.391
88 3.9273.927 2.8562.856 1.5851.585 0.3420.342
10-fold improvement
7-fold improvement
![Page 38: Parallel Programming in C with MPI and OpenMP](https://reader036.vdocuments.us/reader036/viewer/2022081416/56812e4d550346895d93e620/html5/thumbnails/38.jpg)
Summary
Sieve of Eratosthenes: parallel design uses Sieve of Eratosthenes: parallel design uses domain decompositiondomain decomposition
Compared two block distributionsCompared two block distributions Chose one with simpler formulasChose one with simpler formulas
Introduced Introduced MPI_BcastMPI_Bcast Optimizations reveal importance of Optimizations reveal importance of
maximizing single-processor performancemaximizing single-processor performance