high performance computing 1 numerical linear algebra an introduction
TRANSCRIPT
High Performance Computing 1
Numerical Linear AlgebraAn Introduction
High Performance Computing 1
Levels of multiplication
• vector-vector a[i]*b[i]
• matrix-vector A[i,j]*b[j]
• matrix-matrix A[I,k]*B[k,j]
High Performance Computing 1
Matrix-Matrix
for (i=0; i<n; i++) for (j=0; j<n; j++) { C[i,j]=0.0; for (k=0; k<n; k++) C[i,j]=C[i,j] + A[i,k]*B[k,j]; } Note:O(n3) work
High Performance Computing 1
Block matrix
• Serial version - same pseudocode, but interpret i, j as indices of subblocks, and A*B means block matrix multiplication
• Let n be a power of two, and generate a recursive algorithm. Terminates with an explicit formula for the elementary 2X2 multiplications. Allows for parallelism. Can get O(n2.8) work
High Performance Computing 1
Pipeline method
• Pump data left to right and top to bottom
recv(&A,P[i,j-1]);
recv(&B,P[i-1,j]);
C=C+A*B
send(&A,P[i,j+1]);
send(&B,P[i+1, j]);
High Performance Computing 1
Pipeline method
C[0,0]
C[3,1] C[3,3]C[3,2]
C[1,0]
C[2,0]
C[3,0]
C[1,1]
C[0,3]C[0,2]C[0,1]
C[1,2]
C[2,1] C[2,2]
C[1,3]
C[2,3]
A[0,*]
A[1,*]
A[2,*]
A[3,*]
B[*,0] B[*,1] B[*,2] B[*,3]
High Performance Computing 1
Pipeline method
• Similar method for matrix-vector multiplication. But you lose some of the cache reuse
High Performance Computing 1
A sense of speed – vector opsLoop Flops per pass Operation per pass operation
1 2 v1(i)=v1(i)+a*v2(i) update
2 8 v1(i)=v1(i)+Σ sk*vk(i) 4-fold vector update
3 1 v1(i)=v1(i)/v2(i) divide
4 2 v1(i)=v1(i)+s*v2(ind(i)) update+gather
5 2 v1(i)=v2(i)-v3(i)*v1(i-1) Bidiagonal
6 2 s=s+v1(i)*v2(i) Inner product
High Performance Computing 1
A sense of speed – vector opsLoop J90 cft77
(100 nsec clock)
r_∞ n1/2
T90 1 processor cft77
(2.2 nsec clock)
r_∞ n1/2
1 97 19 1428 159
2 163 29 1245 50
3 21 6 219 43
4 72 21 780 83
5 4.9 2 25 9
6 120 202 474 164
High Performance Computing 1
Observations
• Simple do loops not effective
• Cache and memory hierarchy bottlenecks
• For better performance,– combine loops– minimize memory transfer
High Performance Computing 1
LINPACK
• library of subroutines to solve linear algebra
• example – LU decomposition and system solve (dgefa and dgesl, resp.)
• In turn, built on BLAS
• see netlib.org
High Performance Computing 1
BLAS Basic Linear Algebra Subprograms • a library of subroutines designed to provide
efficient computation of commonly-used linear algebra routines, like dot products, matrix-vector multiplies, and matrix-matrix multiplies.
• The naming convention is not unlike other libraries - the fist letter indicates precision, the rest gives a hint (maybe) of what the routine does, e.g. SAXPY, DGEMM.
• The BLAS are divided into 3 levels: vector-vector, matrix-vector, and matrix-matrix. The biggest speed-up usually in level 3.
High Performance Computing 1
BLAS
• Level 1
High Performance Computing 1
BLAS
• Level 2
High Performance Computing 1
BLAS
• Level 3
High Performance Computing 1
How efficient is the BLAS?
load/store float ops refs/ops
level 1
SAXPY 3N 2N 3/2
level 2
SGEMV MN+N+2M 2MN 1/2
level 3
SGEMM 2MN+MK+KN 2MNK 2/N
High Performance Computing 1
Matrix-vector
read x(1:n) into fast memory
read y(1:n) into fast memory
for i = 1:n
read row i of A into fast memory
for j = 1:n
y(i) = y(i) + A(i,j)*x(j)
write y(1:n) back to slow memory
High Performance Computing 1
Matrix-vector
• m=# slow memory refs = n^2 +3n
• f=# arithmetic ops = 2n^2
• q=f/m ~2
• Mat-vec multiple limited by slow memory
High Performance Computing 1
Matrix-matrix
High Performance Computing 1
Matrix Multiply - unblockedfor i = 1 to n
read row i of A into fast memory
for j = 1 to n
read C(i,j) into fast memory
read column j of B into fast memory
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
write C(i,j) back to slow memory
*
High Performance Computing 1
Matrix Multiply unblockedNumber of slow memory references on unblocked matrix multiply
m = n^3 read each column of B n times
+ n^2 read each column of A once for each i
+ 2*n^2 read and write each element of C once
= n^3 + 3*n^2
So q = f/m = (2*n^3)/(n^3 + 3*n^2)
~= 2 for large n, no improvement over matrix-vector multiply
High Performance Computing 1
Matrix Multiply blockedConsider A,B,C to be N by N matrices of b by b subblocks where b=n/N is
called the blocksize
for i = 1 to N
for j = 1 to N
read block C(i,j) into fast memory
for k = 1 to N
read block A(i,k) into fast memory
read block B(k,j) into fast memory
C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}
write block C(i,j) back to slow memory
*
High Performance Computing 1
Matrix Multiply blockedNumber of slow memory references on blocked
matrix multiply
m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N)
+ N*n^2 read each block of A N^3 times
+ 2*n^2 read and write each block of C
= (2*N + 2)*n^2
So q = f/m = 2*n^3 / ((2*N + 2)*n^2)
~= n/N = b for large n
High Performance Computing 1
Matrix Multiply blocked
So we can improve performance by increasing the blocksize b
Can be much faster than matrix-vector multiplty (q=2)
Limit: All three blocks from A,B,C must fit in fast memory (cache), so we
cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3)
High Performance Computing 1
More on BLAS• Industry standard interface(evolving)
• Vendors, others supply optimized implementations
• History
–BLAS1 (1970s):
• vector operations: dot product, saxpy
• m=2*n, f=2*n, q ~1 or less
–BLAS2 (mid 1980s)
• matrix-vector operations
• m=n^2, f=2*n^2, q~2, less overhead
• somewhat faster than BLAS1
High Performance Computing 1
More on BLAS–BLAS3 (late 1980s)
• matrix-matrix operations: matrix matrix multiply, etc
• m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2
• Good algorithms used BLAS3 when possible (LAPACK)
• www.netlib.org/blas, www.netlib.org/lapack
High Performance Computing 1
BLAS on an IBM RS6000/590
BLAS 3
BLAS 2BLAS 1
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors)
Peak speed = 266 Mflops
Peak