parallel programming in c with mpi and openmp
DESCRIPTION
Parallel Programming in C with MPI and OpenMP. Michael J. Quinn. Chapter 11. Matrix Multiplication. Outline. Sequential algorithms Iterative, row-oriented Recursive, block-oriented Parallel algorithms Rowwise block striped decomposition Cannon’s algorithm. . =. - PowerPoint PPT PresentationTRANSCRIPT
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel Programmingin C with MPI and OpenMP
Michael J. QuinnMichael J. Quinn
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 11
Matrix MultiplicationMatrix Multiplication
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Outline
Sequential algorithmsSequential algorithms Iterative, row-orientedIterative, row-oriented Recursive, block-orientedRecursive, block-oriented
Parallel algorithmsParallel algorithms Rowwise block striped decompositionRowwise block striped decomposition Cannon’s algorithmCannon’s algorithm
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
=
Iterative, Row-oriented AlgorithmSeries of inner product (dot product) operations
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Performance as n Increases
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Reason:Matrix B Gets Too Big for Cache
Computing a row of C requiresaccessing every element of B
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Block Matrix Multiplication
=
Replace scalar multiplicationwith matrix multiplication
Replace scalar addition with matrix addition
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Recurse Until B Small Enough
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Comparing Sequential Performance
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
First Parallel Algorithm
PartitioningPartitioning Divide matrices into rowsDivide matrices into rows Each primitive task has corresponding Each primitive task has corresponding
rows of three matrices rows of three matrices CommunicationCommunication
Each task must eventually see every row Each task must eventually see every row of Bof B
Organize tasks into a ringOrganize tasks into a ring
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
First Parallel Algorithm (cont.)
Agglomeration and mappingAgglomeration and mapping Fixed number of tasks, each requiring Fixed number of tasks, each requiring
same amount of computationsame amount of computation Regular communication among tasksRegular communication among tasks Strategy: Assign each process a Strategy: Assign each process a
contiguous group of rowscontiguous group of rows
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication of B
A
A B CA
A B C
A
A B CA
A B C
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication of B
A
A B CA
A B C
A
A B CA
A B C
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication of B
A
A B CA
A B C
A
A B CA
A B C
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication of B
A
A B CA
A B C
A
A B CA
A B C
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Complexity Analysis
Algorithm has Algorithm has pp iterations iterations During each iteration a process multipliesDuring each iteration a process multiplies
((n n / / pp) ) ( (nn / / pp) block of A by () block of A by (n n / / pp) ) nn block of B: block of B: ((nn33 / / pp22))
Total computation time: Total computation time: ((nn33 / / pp)) Each process ends up passingEach process ends up passing
((pp-1)-1)nn22//p = p = ((nn22) elements of B) elements of B
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Isoefficiency Analysis
Sequential algorithm: Sequential algorithm: ((nn33)) Parallel overhead: Parallel overhead: ((pnpn22))
Isoefficiency relation: Isoefficiency relation: nn33 CpnCpn22 n n CpCp
This system does not have good scalabilityThis system does not have good scalability
pCppCpCpM 222 //)(
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Weakness of Algorithm 1
Blocks of B being manipulated have Blocks of B being manipulated have pp times times more columns than rowsmore columns than rows
Each process must access every element of Each process must access every element of matrix Bmatrix B
Ratio of computations per communication Ratio of computations per communication is poor: onlyis poor: only 2n / 2n / pp
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel Algorithm 2(Cannon’s Algorithm)
Associate a primitive task with each matrix Associate a primitive task with each matrix elementelement
Agglomerate tasks responsible for a square Agglomerate tasks responsible for a square (or nearly square) block of C(or nearly square) block of C
Computation-to-communication ratio rises Computation-to-communication ratio rises to to nn / / pp
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Elements of A and B Needed to Compute a Process’s Portion of C
Algorithm 1
Cannon’sAlgorithm
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Blocks Must Be Aligned
Before After
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Blocks Need to Be Aligned
A00
B00
A01
B01
A02
B02
A03
B03
A10
B10
A11
B11
A12
B12
A13
B13
A20
B20
A21
B21
A22
B22
A23
B23
A30
B30
A31
B31
A32
B32
A33
B33
Each trianglerepresents a matrix block
Only same-colortriangles shouldbe multiplied
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Rearrange Blocks
A00
B00
A01
B01
A02
B02
A03
B03
A10
B10
A11
B11
A12
B12
A13
B13
A20
B20
A21
B21
A22
B22
A23
B23
A30
B30
A31
B31
A32
B32
A33
B33
Block Aij cyclesleft i positions
Block Bij cyclesup j positions
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Consider Process P1,2
B02
A10A11 A12
B12
A13
B22
B32 Step 1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Consider Process P1,2
B12
A11A12 A13
B22
A10
B32
B02 Step 2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Consider Process P1,2
B22
A12A13 A10
B32
A11
B02
B12 Step 3
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Consider Process P1,2
B32
A13A10 A11
B02
A12
B12
B22 Step 4
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Complexity Analysis
Algorithm has Algorithm has pp iterations iterations During each iteration process multiplies two During each iteration process multiplies two
((n n / / pp ) ) ( (nn / / pp ) matrices: ) matrices: ((nn3 3 / / p p 3/23/2)) Computational complexity: Computational complexity: ((nn3 3 / / pp)) During each iteration process sends and During each iteration process sends and
receives two blocks of size receives two blocks of size ((n n / / pp ) ) ( (nn / / pp ) )
Communication complexity: Communication complexity: ((nn22/ / pp))
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Isoefficiency Analysis
Sequential algorithm: Sequential algorithm: ((nn33)) Parallel overhead: Parallel overhead: ((pnpn22))
Isoefficiency relation:Isoefficiency relation: n n33 C C pn pn22 n n C C p p
This system is highly scalableThis system is highly scalable
22 //)( CppCppCM
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary Considered two sequential algorithmsConsidered two sequential algorithms
Iterative, row-oriented algorithmIterative, row-oriented algorithm Recursive, block-oriented algorithmRecursive, block-oriented algorithm Second has better cache hit rate as Second has better cache hit rate as nn increases increases
Developed two parallel algorithmsDeveloped two parallel algorithms First based on rowwise block striped decompositionFirst based on rowwise block striped decomposition Second based on checkerboard block decompositionSecond based on checkerboard block decomposition Second algorithm is scalable, while first is notSecond algorithm is scalable, while first is not