dense matrix algorithms cs 524 – high-performance computing

Dense Matrix Algorithms

CS 524 – High-Performance Computing

CS 524 (Wi 2003/04)- Asim Karim @ LUMS 2

Definitions

p = number of processors (0 to p-1) n = dimension of array/matrix (0 to n-1) q = number of blocks along one dimension (0 to q-1) tc = computation time for one flop

ts = communication startup time

tw = communication transfer time per word

Interconnection network: crossbar switch with bi-directional links


Uniform Striped Partitioning


Checkerboard Partitioning


Matrix Transpose (MT)

AT(i, j) = A(j, i) for all I and j Sequential run-time

do i = 0, n-1

do j = 0, n-1

B(i, j) = A(j, i)

end do

end do Run time is (n2 – n)/2 or n2/2


MT - Checkerboard Partitioning (1)


MT – Checkerboard Partitioning (2)


MT – Striped Partitioning


Matrix-Vector Multiplication (MVM)

MVM: y = Axdo i = 0, n-1

do j = 0, n-1

y(i) = y(i) + A(i, j)*x(j)

end do

end do

Sequential algorithm requires n2 multiplications and additions Assuming one flop takes tc time, sequential run time is 2tcn2


Row-wise Striping – p = n (1)


Row-wise Striping – p = n (2)

Data partitioning: Pi has row i of A and element i of x

Communication: Each processor broadcasts its element of x

Computation: Each processor perform n additions and multiplications

Parallel run time: Tp = 2ntc + p(ts + tw) = 2ntc + n(ts + tw)

Algorithm is cost-optimal as both parallel and serial cost is O(n2)


Row-wise Striping – p < n

Data partitioning: Each processor has n/p rows of A and corresponding n/p elements of x

Communication: Each processor broadcasts its elements of x

Computation: Each processor perform n2/p additions and multiplications

Parallel run time: Tp = 2tcn2/p+ p[ts + (n/p)tw]

Algorithm is cost-optimal for p = O(n)


Checkerboard Partitioning – p = n2 (1)



Data partitioning: Each processor has one element of A; only processors in last column have one element of x

Communication One element of x from last column to diagonal processor Broadcast from diagonal processor to all processors in

column Global sum of y from all processors in row to last processor

Computation: one multiplication + addition Parallel run time: Tp = 2tc + 3(ts + tw) Algorithm is cost-optimal as serial and parallel cost is

O(n2) For bus network, communication time is 3n(ts + tw);

system is not cost-optimal as cost is O(n3)


Checkerboard Partitioning – p < n2

Data partitioning: Each processor has n/√p x n/√p elements of A; processors in last column have n/√p elements of x

Communication n/√p elements of x from last column to diagonal processor Broadcast from diagonal processor to all processors in

column Global sum of y from all processors in row to last processor

Computation: n2/p multiplications + additions Parallel run time: Tp = 2tcn2/p+ 3 (ts + tw n/√p)

Algorithm is cost-optimal only if p = O(n2)


Matrix-Matrix Multiplication (MMM)

C = A x B, n x n square matrices Block matrix multiplication: algebraic operations on

sub-matrices or blocks of matrices. This view of MMM aids parallelization.

do i = 0, q-1

do j = 0, q-1

do k = 0, q-1

Ci,j = Ci,j + Ai,k x Bk,j

end do end do end do Number of multiplications + additions = n3 . Sequential

run time = 2tcn3


Checkerboard Partitioning – q = √p

Data partitioning: Pi,j has Ai,j and Bi,j blocks of A and B of dimension n/√p x n/√p

Communication: Each processor broadcasts its submatrix Ai,j to all processors in row; each processor broadcasts its submatrix Bi,j to all processors in column

Computation: Each processor performs n*n/√p* n/√p = n3/p multiplications + additions

Parallel run time: Tp = 2tcn3/p + 2√p[ts + (n2/p)tw]

Algorithm is cost-optimal only if p = O(n2)


Cannon’s Algorithm (1)

Memory-efficient version of the checkerboard partitioned block MMM At any time, each processor has one block of A and B Blocks are cycled after each computation in such a way that after √p

computations the multiplication is done for Ci,j

Initial distribution of matrices is same as checkerboard partitioning

Communication Initial: block Ai,j is moved left by i steps (with wraparound); block Bi,j is

moved up by j steps (with wraparound) Subsequent √p-1 : block Ai,j is moved left by one step; block Bi,j moved

up by one step (both with wraparound)

After √p computation and communication steps the multiplication is complete for Ci,j



Communication √p point-to-point communications of size n2/p along rows √p point-to-point communications of size n2/p along columns

Computation: over √p steps, each processors performs n3/p multiplications + additions


Algorithm is cost-optimal if p = O(n2)


Fox’s Algorithm (1)

Another memory-efficient version of the checkerboard partitioned block MMM Initial distribution of matrices is same as checkerboard

partitioning At any time, each processor has one block of A and B

Steps (repeated √p times)1. Broadcast Ai,i to all processors in the row

2. Multiply block of A received with resident block of B

3. Send the block of B up one step (with wraparound)

4. Select block Ai,(j+1)mod√p and broadcast to all processors in row. Go to 2.



Communication √p broadcasts of size n2/p along rows √p point-to-point communications of size n2/p along columns

Computation: Each processor performs n3/p multiplications + additions


Algorithm is cost-optimal if p = O(n2)


Solving a System of Linear Equations

System of linear equations, Ax = b A is dense n x n matrix of coefficients b is n x 1 vector of RHS values x is n x 1 vector of unknowns

Solving x is usually done in two stages First, Ax = b is reduced to Ux = y, where U is an unit upper

triangular matrix [U(i,j) = 0 if i > j; otherwise U(i,j) ≠ 0 and U(i,i) = 1 for 0 ≤ i < n]. This stage is called Gaussian elimination.

Second, the unknowns are solved in reverse order starting from x(n-1). This stage is called back-substitution.


Gaussian Elimination (1)

do k = 0, n-1 do j = k+1, n-1 A(k, j) = A(k, j)/A(k, k) end do y(k) = b(k)/A(k, k) A(k, k) = 1 do i = k+1, n-1 do j = k+1, n-1 A(i, j) = A(i, j) – A(i, k)*A(k, j) end do b(i) = b(i) – A(i, k)*y(k) A(i, k) = 0 end doend do


Gaussian Elimination (2)

Computations Approximately n2/2 divisions Approximately n3/3 – n2/2 multiplications + subtractions

Approx. sequential run time: Ts = 2tcn3/3


Striped Partitioning – p = n (1)

Data partitioning: Each processor has one row of matrix A

Communication during k (outermost loop) broadcast of active part of kth (size: n–k–1) row to

processors k+1 to n-1

Computation during iteration k (outermost loop) n – k -1 divisions at processor Pk

n –k -1 multiplications + subtractions for processors Pi (k < i < n)

Parallel run time: Tp = (3/2)n(n-1)tc + nts + 0.5n(n-1)tw

Algorithm is not cost-optimal since serial and parallel costs are O(n3)


Pipelined Version (Striped Partitioning)

In the non-pipelined or synchronous version, outer loop k is executed in order. When Pk is performing the division step, all other processors

are idle When performing the elimination step, only processors k+1

to n-1 are active; rest are idle In pipelined version, the division step, communication,

and elimination step are overlapped. Each processor: communicates, if it has data to

communicate; computes, if it has computations to be done; or waits, if none of these can be done.

Cost-optimal for linear array, mesh and hypercube interconnection networks that have directly-connected processors.


Pipelined Version (2)


Pipelined Version (3)


Striped Partitioning – p < n (1)


Striped Partitioning – p < n (2)



Data partitioning: Pi,j has element A(i, j) of matrix A Communication during iteration k (outermost loop)

Broadcast of A(k, k) to processor (k, k+1) to (k, n-1) in the kth row

Broadcast of modified A(i,k) along ith row for k ≤ i < n Broadcast of modified A(k,j) along jth column for k ≤ j < n

Computation during iteration k (outermost loop) One division at Pk,k

One multiplication + subtraction at processors Pi,j (k < i ,j< n) Parallel run time: Tp = (3/2)n(n-1)tc + n[ts + 0.5(n-1)tw] Algorithm is cost-optimal since serial and parallel costs

are O(n3)


Back-Substitution

Solution of Ux = y, where U is unit upper triangular matrix

do k = n-1, 0 x(k) = y(k) do i = k-1, 0 y(i) = y(i) – x(k)*U(i,k) end doend do

Computation: approx. n2/2 multiplications + subtractions

Parallel algorithm is similar to that for the Gaussian elimination stage

dense matrix algorithms cs 524 – high-performance computing

Documents

asim karim

n additions

t c time

p additions

processor computation

communication time

communication startup

flop t s