dense matrix algorithms cs 524 – high-performance computing
Post on 21-Dec-2015
218 views
TRANSCRIPT
Dense Matrix Algorithms
CS 524 – High-Performance Computing
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 2
Definitions
p = number of processors (0 to p-1) n = dimension of array/matrix (0 to n-1) q = number of blocks along one dimension (0 to q-1) tc = computation time for one flop
ts = communication startup time
tw = communication transfer time per word
Interconnection network: crossbar switch with bi-directional links
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 3
Uniform Striped Partitioning
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 4
Checkerboard Partitioning
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 5
Matrix Transpose (MT)
AT(i, j) = A(j, i) for all I and j Sequential run-time
do i = 0, n-1
do j = 0, n-1
B(i, j) = A(j, i)
end do
end do Run time is (n2 – n)/2 or n2/2
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 6
MT - Checkerboard Partitioning (1)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 7
MT – Checkerboard Partitioning (2)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 8
MT – Striped Partitioning
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 9
Matrix-Vector Multiplication (MVM)
MVM: y = Axdo i = 0, n-1
do j = 0, n-1
y(i) = y(i) + A(i, j)*x(j)
end do
end do
Sequential algorithm requires n2 multiplications and additions Assuming one flop takes tc time, sequential run time is 2tcn2
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 10
Row-wise Striping – p = n (1)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 11
Row-wise Striping – p = n (2)
Data partitioning: Pi has row i of A and element i of x
Communication: Each processor broadcasts its element of x
Computation: Each processor perform n additions and multiplications
Parallel run time: Tp = 2ntc + p(ts + tw) = 2ntc + n(ts + tw)
Algorithm is cost-optimal as both parallel and serial cost is O(n2)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 12
Row-wise Striping – p < n
Data partitioning: Each processor has n/p rows of A and corresponding n/p elements of x
Communication: Each processor broadcasts its elements of x
Computation: Each processor perform n2/p additions and multiplications
Parallel run time: Tp = 2tcn2/p+ p[ts + (n/p)tw]
Algorithm is cost-optimal for p = O(n)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 13
Checkerboard Partitioning – p = n2 (1)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 14
Checkerboard Partitioning – p = n2 (2)
Data partitioning: Each processor has one element of A; only processors in last column have one element of x
Communication One element of x from last column to diagonal processor Broadcast from diagonal processor to all processors in
column Global sum of y from all processors in row to last processor
Computation: one multiplication + addition Parallel run time: Tp = 2tc + 3(ts + tw) Algorithm is cost-optimal as serial and parallel cost is
O(n2) For bus network, communication time is 3n(ts + tw);
system is not cost-optimal as cost is O(n3)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 15
Checkerboard Partitioning – p < n2
Data partitioning: Each processor has n/√p x n/√p elements of A; processors in last column have n/√p elements of x
Communication n/√p elements of x from last column to diagonal processor Broadcast from diagonal processor to all processors in
column Global sum of y from all processors in row to last processor
Computation: n2/p multiplications + additions Parallel run time: Tp = 2tcn2/p+ 3 (ts + tw n/√p)
Algorithm is cost-optimal only if p = O(n2)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 16
Matrix-Matrix Multiplication (MMM)
C = A x B, n x n square matrices Block matrix multiplication: algebraic operations on
sub-matrices or blocks of matrices. This view of MMM aids parallelization.
do i = 0, q-1
do j = 0, q-1
do k = 0, q-1
Ci,j = Ci,j + Ai,k x Bk,j
end do end do end do Number of multiplications + additions = n3 . Sequential
run time = 2tcn3
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 17
Checkerboard Partitioning – q = √p
Data partitioning: Pi,j has Ai,j and Bi,j blocks of A and B of dimension n/√p x n/√p
Communication: Each processor broadcasts its submatrix Ai,j to all processors in row; each processor broadcasts its submatrix Bi,j to all processors in column
Computation: Each processor performs n*n/√p* n/√p = n3/p multiplications + additions
Parallel run time: Tp = 2tcn3/p + 2√p[ts + (n2/p)tw]
Algorithm is cost-optimal only if p = O(n2)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 18
Cannon’s Algorithm (1)
Memory-efficient version of the checkerboard partitioned block MMM At any time, each processor has one block of A and B Blocks are cycled after each computation in such a way that after √p
computations the multiplication is done for Ci,j
Initial distribution of matrices is same as checkerboard partitioning
Communication Initial: block Ai,j is moved left by i steps (with wraparound); block Bi,j is
moved up by j steps (with wraparound) Subsequent √p-1 : block Ai,j is moved left by one step; block Bi,j moved
up by one step (both with wraparound)
After √p computation and communication steps the multiplication is complete for Ci,j
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 19
Cannon’s Algorithm (2)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 20
Cannon’s Algorithm (3)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 21
Cannon’s Algorithm (4)
Communication √p point-to-point communications of size n2/p along rows √p point-to-point communications of size n2/p along columns
Computation: over √p steps, each processors performs n3/p multiplications + additions
Parallel run time: Tp = 2tcn3/p + 2√p[ts + (n2/p)tw]
Algorithm is cost-optimal if p = O(n2)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 22
Fox’s Algorithm (1)
Another memory-efficient version of the checkerboard partitioned block MMM Initial distribution of matrices is same as checkerboard
partitioning At any time, each processor has one block of A and B
Steps (repeated √p times)1. Broadcast Ai,i to all processors in the row
2. Multiply block of A received with resident block of B
3. Send the block of B up one step (with wraparound)
4. Select block Ai,(j+1)mod√p and broadcast to all processors in row. Go to 2.
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 23
Fox’s Algorithm (2)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 24
Fox’s Algorithm (3)
Communication √p broadcasts of size n2/p along rows √p point-to-point communications of size n2/p along columns
Computation: Each processor performs n3/p multiplications + additions
Parallel run time: Tp = 2tcn3/p + 2√p[ts + (n2/p)tw]
Algorithm is cost-optimal if p = O(n2)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 25
Solving a System of Linear Equations
System of linear equations, Ax = b A is dense n x n matrix of coefficients b is n x 1 vector of RHS values x is n x 1 vector of unknowns
Solving x is usually done in two stages First, Ax = b is reduced to Ux = y, where U is an unit upper
triangular matrix [U(i,j) = 0 if i > j; otherwise U(i,j) ≠ 0 and U(i,i) = 1 for 0 ≤ i < n]. This stage is called Gaussian elimination.
Second, the unknowns are solved in reverse order starting from x(n-1). This stage is called back-substitution.
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 26
Gaussian Elimination (1)
do k = 0, n-1 do j = k+1, n-1 A(k, j) = A(k, j)/A(k, k) end do y(k) = b(k)/A(k, k) A(k, k) = 1 do i = k+1, n-1 do j = k+1, n-1 A(i, j) = A(i, j) – A(i, k)*A(k, j) end do b(i) = b(i) – A(i, k)*y(k) A(i, k) = 0 end doend do
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 27
Gaussian Elimination (2)
Computations Approximately n2/2 divisions Approximately n3/3 – n2/2 multiplications + subtractions
Approx. sequential run time: Ts = 2tcn3/3
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 28
Striped Partitioning – p = n (1)
Data partitioning: Each processor has one row of matrix A
Communication during k (outermost loop) broadcast of active part of kth (size: n–k–1) row to
processors k+1 to n-1
Computation during iteration k (outermost loop) n – k -1 divisions at processor Pk
n –k -1 multiplications + subtractions for processors Pi (k < i < n)
Parallel run time: Tp = (3/2)n(n-1)tc + nts + 0.5n(n-1)tw
Algorithm is not cost-optimal since serial and parallel costs are O(n3)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 29
Striped Partitioning – p = n (2)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 30
Striped Partitioning – p = n (3)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 31
Pipelined Version (Striped Partitioning)
In the non-pipelined or synchronous version, outer loop k is executed in order. When Pk is performing the division step, all other processors
are idle When performing the elimination step, only processors k+1
to n-1 are active; rest are idle In pipelined version, the division step, communication,
and elimination step are overlapped. Each processor: communicates, if it has data to
communicate; computes, if it has computations to be done; or waits, if none of these can be done.
Cost-optimal for linear array, mesh and hypercube interconnection networks that have directly-connected processors.
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 32
Pipelined Version (2)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 33
Pipelined Version (3)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 34
Striped Partitioning – p < n (1)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 35
Striped Partitioning – p < n (2)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 36
Checkerboard Partitioning – p = n2 (1)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 37
Checkerboard Partitioning – p = n2 (2)
Data partitioning: Pi,j has element A(i, j) of matrix A Communication during iteration k (outermost loop)
Broadcast of A(k, k) to processor (k, k+1) to (k, n-1) in the kth row
Broadcast of modified A(i,k) along ith row for k ≤ i < n Broadcast of modified A(k,j) along jth column for k ≤ j < n
Computation during iteration k (outermost loop) One division at Pk,k
One multiplication + subtraction at processors Pi,j (k < i ,j< n) Parallel run time: Tp = (3/2)n(n-1)tc + n[ts + 0.5(n-1)tw] Algorithm is cost-optimal since serial and parallel costs
are O(n3)
CS 524 (Wi 2003/04)- Asim Karim @ LUMS 38
Back-Substitution
Solution of Ux = y, where U is unit upper triangular matrix
do k = n-1, 0 x(k) = y(k) do i = k-1, 0 y(i) = y(i) – x(k)*U(i,k) end doend do
Computation: approx. n2/2 multiplications + subtractions
Parallel algorithm is similar to that for the Gaussian elimination stage