linear algebra on gpus vasily volkov. gpu architecture features simd architecture – don’t be...

Linear Algebra on GPUs

Vasily Volkov

GPU Architecture Features• SIMD architecture

– Don’t be confused by scalar ISA which is only a program model• We use vector thread program model instead

– Similar to SSE/Cell SPE/vector units, not superscalar units– Native vector length is 32; can simulate larger (thread blocks)

• Multithreading using register windows– Context switch never flushes registers to memory– If more threads than can fit, some don’t start until others finish

• Huge register files, small high-latency caches– Fundamental difference with CPUs, similar to vector processors– Cache is to save memory bandwidth, not reduce latency– Use register blocking, not cache blocking

• Large startup costs (≈5s)– Not good for fine-grain computations– Use global barriers instead? (≈1.5 s)

Relation to Other Multicores

Processor Core2 SSE, 2.66GHz Cell SPEs G80/G84

Gflop/s/core 21.3 Gflop/s 25.6 Gflop/s 19-23 Gflop/s

Vector length 4 4 32

# of cores 2-4 8 4-16

• All offer both multithreading and SIMD capabilities• Use CUDA to program all of them?

Pointer Chase Benchmark, 8800GTX

8-word cache lineL1: 8 x 5kB, each is 20-way associativeL2: 6 x 32kB, each is 4-way associative

512kB memory pagesTLB: 16 entries, fully associative

run k=A[k] in a loop in a scalar thread

latency bound

larger latency at cache missReveals cache sizes

8800GTX/8800GTS/8600GTS/FX5600:Different number of similar caches

GPU Memory System (8800GTX)

Matrix-Matrix multiply, C = C + AB• 64x16 blocks in C, rank-4 updates

– Ensures that our code is compute-bound• Each thread computes one block in C

– ijk/jik form; other choices produce race condition in C• Keep A’s and C’s block in vector registers

– Similarly done on IBM 3090 VF and Cray X1• Keep B’s block in local memory

– Others keep it in scalar registers (n/a on GPU); or caches (slow)• Use compiler options to enforce tight register budget

– To hide memory latencies better by multithreading• Use prefetching to hide latencies even better

– Now, performance is not bound by latency and bandwidth in reading blocks in A and B!

– Bound by instruction issue and local memory ops (230 Gflop/s)

Performance of SGEMM

CUBLAS 1.1: • keeps square blocks in A and B in local memory• uses long vectors (poor instruction mix)

• exposing too much of data parallelism may cost youOur SGEMM is now in CUBLAS 2.0 beta

SGEMM, 8800GTX, k = 1024

Constant work per vector thread (function of k)Optimized version does better load balancing by computing partial sums

Panel Factorization

CPU: runtime on Core2 Duo 2.66GHz, Intel MKL 10.0 (includes CPU-GPU transfer!)

GPU: estimated for 8800GTX as

Time 5s bandwidth required75GB /s

Design of Matrix Factorizations• Right-looking scheme = most parallelism = best on 16-core GPUs• Crout scheme = least bandwidth = best on 4-core GPU and if using CUBLAS

1.1• Left-looking = half of work in triangular solve = limited parallelism =

inefficient• 2-level blocking

– Both levels are right-looking + premature exit from finer level to keep – Up to 6% speedup only, at large matrices (n≈10,000)

• Block size on GPU is 64 (same as in matrix multiply)– Autotuning in QR (up to 7% speedup)

• Row-major layout on GPU in LU decomposition– Since gathers with large stride are inefficient– Requires transposition at every CPU-GPU transfer– >2x speedup!

• Panel factorization on CPU overlapped with BLAS3 on GPU (use lookahead)• Multiply by inverse (GEMM) instead of triangular solve (TRSM)

– TRSM vs. GEMM is 13 Gflop/s vs. 160 Gflop/s if matrix is 64x64– Parallelism in TRSM is not embarrassing enough

Test Platform

• GPU– Core2 Duo 2.67GHz + GeForce 8800 GTX– Core2 Duo 2.67GHz + two GeForce 8800 GTX

• CPU– Core2 Duo 2.67GHz– Core2 Quad 2.4GHz

Summary of Performance

Speedups vs. CPU

Summary of Speedups

8800GTXGflop/s

Core2 Duo Core2 Quad

Gflop/s speedup Gflop/s speedup

Cholesky 183 23.0 7.4 34.9 5.5

LU 179 22.7 7.7 59.2 3.0

QR 192 22.5 8.3 44.8 4.3

SGEMM 206 25.9 8.0 69.8 3.0

Speedup using 2 GPUs

Using column-cyclic layout

Breakdown of runtime (LU)

What if omit one optimization in LU?

Other Work Done• Tridiagonal eigenvalue solver (bisection)

– Most work: factorize A–iI = LDLT, count signs in D (compute bound)– Done for many i in parallel — traditionally vectorized– If need more parallelism — do multisection instead of bisection

• But it increases total flop count– Rest is difficult to parallelize, does not worth it– Our solution:

• Run vectorized loops on the GPU, rest (least work) on the CPU• Autotune to decide optimal redundancy and when involve CPU• Use features of IEEE arithmetic to save another 15-30% of runtime• Up to 156x faster than LAPACK on Pentium 4

• Tridiagonal eigenvector solver (inverse iteration)– Most work: Solve (A–iI)xk+1=xk for fixed i (bandwidth bound)– Factorize A–iI = LDLT once, keep D only. Reconstruct L on need

• Reconstruction is overlapped with memory access; still bandwidth bound– Don’t pivot — recompute using safe code if fails (do it on CPU)– Up to 120x faster than LAPACK on Core2 Duo so far– More complicated when eigenvalues are clustered

• Stencil computation (7-point on 3D grid)– Blocks in registers and local memory– Bandwidth-bound, runs at up to 66% of pin-bandwidth

Future Work• Analysis of architecture

– Find best parallels in past architectures to reuse methods– Catching up with newer GPUs– More micro-benchmarking to get better performance models

• More scientific kernels– CUFFT is ≈50Gflop/s, can do better (e.g. by not doing sin/cos in the inner loop)

• More LAPACK– Two-sided factorizations used in eigensolvers and SVD

• LAPACK does 50% of work is in BLAS1/BLAS2• Mostly BLAS3 algorithm is known, but has requires more flops if eigenvectors are

needed• May use divide-and-conquer instead

– MRRR (improved inverse iteration algorithm, also rich in parallelism)– Non-symmetric eigensolvers such as QR iterations

• currently fine-grained, can do better?– Iterative refinement for eigenvalue problem?

• ScaLAPACK (distributed memory LAPACK)– One-sided factorizations on a GPU cluster

linear algebra on gpus vasily volkov. gpu architecture features simd architecture – don’t be...

Documents

gtx slide

gflops slide

gpu memory system

memory bandwidth

beta slide

memory latencies

vector registers

latency use register