high performance computing 1 numerical linear algebra an introduction

High Performance Computing 1

Numerical Linear AlgebraAn Introduction


Levels of multiplication

• vector-vector a[i]*b[i]

• matrix-vector A[i,j]*b[j]

• matrix-matrix A[I,k]*B[k,j]


Matrix-Matrix

for (i=0; i<n; i++) for (j=0; j<n; j++) { C[i,j]=0.0; for (k=0; k<n; k++) C[i,j]=C[i,j] + A[i,k]*B[k,j]; } Note:O(n3) work


Block matrix

• Serial version - same pseudocode, but interpret i, j as indices of subblocks, and A*B means block matrix multiplication

• Let n be a power of two, and generate a recursive algorithm. Terminates with an explicit formula for the elementary 2X2 multiplications. Allows for parallelism. Can get O(n2.8) work


Pipeline method

• Pump data left to right and top to bottom

recv(&A,P[i,j-1]);

recv(&B,P[i-1,j]);

C=C+A*B

send(&A,P[i,j+1]);

send(&B,P[i+1, j]);


Pipeline method

C[0,0]

C[3,1] C[3,3]C[3,2]

C[1,0]

C[2,0]

C[3,0]

C[1,1]

C[0,3]C[0,2]C[0,1]

C[1,2]

C[2,1] C[2,2]

C[1,3]

C[2,3]

A[0,*]

A[1,*]

A[2,*]

A[3,*]

B[*,0] B[*,1] B[*,2] B[*,3]


Pipeline method

• Similar method for matrix-vector multiplication. But you lose some of the cache reuse


A sense of speed – vector opsLoop Flops per pass Operation per pass operation

1 2 v1(i)=v1(i)+a*v2(i) update

2 8 v1(i)=v1(i)+Σ sk*vk(i) 4-fold vector update

3 1 v1(i)=v1(i)/v2(i) divide

4 2 v1(i)=v1(i)+s*v2(ind(i)) update+gather

5 2 v1(i)=v2(i)-v3(i)*v1(i-1) Bidiagonal

6 2 s=s+v1(i)*v2(i) Inner product


A sense of speed – vector opsLoop J90 cft77

(100 nsec clock)

r_∞ n1/2

T90 1 processor cft77

(2.2 nsec clock)

r_∞ n1/2

1 97 19 1428 159

2 163 29 1245 50

3 21 6 219 43

4 72 21 780 83

5 4.9 2 25 9

6 120 202 474 164


Observations

• Simple do loops not effective

• Cache and memory hierarchy bottlenecks

• For better performance,– combine loops– minimize memory transfer


LINPACK

• library of subroutines to solve linear algebra

• example – LU decomposition and system solve (dgefa and dgesl, resp.)

• In turn, built on BLAS

• see netlib.org


BLAS Basic Linear Algebra Subprograms • a library of subroutines designed to provide

efficient computation of commonly-used linear algebra routines, like dot products, matrix-vector multiplies, and matrix-matrix multiplies.

• The naming convention is not unlike other libraries - the fist letter indicates precision, the rest gives a hint (maybe) of what the routine does, e.g. SAXPY, DGEMM.

• The BLAS are divided into 3 levels: vector-vector, matrix-vector, and matrix-matrix. The biggest speed-up usually in level 3.


BLAS

• Level 1


BLAS

• Level 2


BLAS

• Level 3


How efficient is the BLAS?

load/store float ops refs/ops

level 1

SAXPY 3N 2N 3/2

level 2

SGEMV MN+N+2M 2MN 1/2

level 3

SGEMM 2MN+MK+KN 2MNK 2/N


Matrix-vector

read x(1:n) into fast memory

read y(1:n) into fast memory

for i = 1:n

read row i of A into fast memory

for j = 1:n

y(i) = y(i) + A(i,j)*x(j)

write y(1:n) back to slow memory


Matrix-vector

• m=# slow memory refs = n^2 +3n

• f=# arithmetic ops = 2n^2

• q=f/m ~2

• Mat-vec multiple limited by slow memory


Matrix-matrix


Matrix Multiply - unblockedfor i = 1 to n

read row i of A into fast memory

for j = 1 to n

read C(i,j) into fast memory

read column j of B into fast memory

for k = 1 to n

C(i,j) = C(i,j) + A(i,k) * B(k,j)

write C(i,j) back to slow memory

*


Matrix Multiply unblockedNumber of slow memory references on unblocked matrix multiply

m = n^3 read each column of B n times

+ n^2 read each column of A once for each i

+ 2*n^2 read and write each element of C once

= n^3 + 3*n^2

So q = f/m = (2*n^3)/(n^3 + 3*n^2)

~= 2 for large n, no improvement over matrix-vector multiply


Matrix Multiply blockedConsider A,B,C to be N by N matrices of b by b subblocks where b=n/N is

called the blocksize

for i = 1 to N

for j = 1 to N

read block C(i,j) into fast memory

for k = 1 to N

read block A(i,k) into fast memory

read block B(k,j) into fast memory

C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}

write block C(i,j) back to slow memory

*


Matrix Multiply blockedNumber of slow memory references on blocked

matrix multiply

m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N)

+ N*n^2 read each block of A N^3 times

+ 2*n^2 read and write each block of C

= (2*N + 2)*n^2

So q = f/m = 2*n^3 / ((2*N + 2)*n^2)

~= n/N = b for large n


Matrix Multiply blocked

So we can improve performance by increasing the blocksize b

Can be much faster than matrix-vector multiplty (q=2)

Limit: All three blocks from A,B,C must fit in fast memory (cache), so we

cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3)


More on BLAS• Industry standard interface(evolving)

• Vendors, others supply optimized implementations

• History

–BLAS1 (1970s):

• vector operations: dot product, saxpy

• m=2*n, f=2*n, q ~1 or less

–BLAS2 (mid 1980s)

• matrix-vector operations

• m=n^2, f=2*n^2, q~2, less overhead

• somewhat faster than BLAS1


More on BLAS–BLAS3 (late 1980s)

• matrix-matrix operations: matrix matrix multiply, etc

• m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2

• Good algorithms used BLAS3 when possible (LAPACK)

• www.netlib.org/blas, www.netlib.org/lapack


BLAS on an IBM RS6000/590

BLAS 3

BLAS 2BLAS 1

BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors)

Peak speed = 266 Mflops

Peak

high performance computing 1 numerical linear algebra an introduction

Documents

n matrix matrix

n matrix vector

bj matrixmatrix ai

vs blas

bi matrixvector ai

high performance computing

ibm rs6000590 blas

j slide