the spatter benchmark · 2019-03-03 · gather kernel, linear access impact of access sparsity 0 10...

The Spatter Benchmark Or: Benchmarking and Modeling Sparse Memory Accesses

for Heterogeneous Systems

Patrick Lavin, Jeffrey Young, Jason Riedy, Rich Vuduc

Purpose• Dense memory access is well understood, but it is

difficult to predict how a memory system will respond in irregular scenarios

• Indirection, poor spatial and temporal locality

• Spatter allows us to view performance changes across architectures, so that we can better understand their differences

• You can’t understand what you can’t measure

!2

Spatter• Memory benchmark

based on a scatter/gatherkernels • Scatter Y[j[:]] = X[:]

• Gather Y[:] = X[i[:]]

• SG Y[j[:]] = X[i[:]]

• Designed to model sparse data movement in applications like SuperLU and kernels like SpGEMM

• Includes effects of indirection and random or sparse access

!3

Gather

SRC

idx152 8

DEST

Configuration

• Backend - OpenMP, OpenCL, or CUDA

• Work per thread (work item)

• CUDA (or OpenCL) block size

• Buffer sizes, cache-ability, access stride

!4

Examples

!5

Examples - CSR SpMV

• Gather elements of x, then doa dot product with data in A.

A xy

=

=

for (i in range(nrows)): indices ←⃪ row[i] : row[i+1] gather(tmp, x, col[indices]) y[i] = dot_prod(val[indices], tmp)

!6

Examples - CSC SpMV

• Scale some a column of A by the value in x, then scatter-accumulate into y.

for (i in range(ncols)): indices ←⃪ col[i] : col[i+1] tmp ←⃪ vector_scale(val[indices], x[i]) scatter_accum(y, row[indices], tmp)

A xy

=

=

!7

Examples - SpGEMM• Scatter-accumulate

columns of A corresponding to non-zero entries in a column of B into a dense SPA buffer. Gather SPA into C.

A BC

=

=

Algorithm from Buluç and Gilbert: Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments https://doi.org/10.1137/110848244

for (j in range(ncols) : SPA = 0 //dense accumulation buffer for non-zero B(k,j) : scatter_accum(SPA, A(:,k)*B(k,j)) gather(C.val, SPA) gather(C.row, which(SPA)) C.col[j+1] = C.col[j] + nnz(SPA)

SPA:

!8

Examples - Vectorization• Some forms of vectorization may naturally lead to Gather/

Scatter operations

for (j in range(N)): for (i in range(4)): out[j] += data[i,j]

Column-Major

for (j = 0; j < N; j+=8): for (i in range(4)): gather_accum_stride(temp,j+i, 8, 4) //gather 8 elements, //gap size 4 out[j:j+8] = temp

!9

Example - SuperLU

• SuperLU spends a large portion its runtime on just scattering data

0

20

40

60

80

100

ND24K

BBMATH2O

Norm

alize

d ex

ecut

ion

time

SCATTER DGEMM REST

��

Chart credits: Piyush Sao !10

Benchmark Output

!11

Performance Exploration

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64Work per threadEf

f. Ba

ndw

idth

(% o

f Bab

elSt

ream

)

Tesla P100, CUDA BackendGather Bandwidth

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64

BDW, OpenMP Backend

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64

Power8, OpenMP Backend

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64

Tesla P100, CUDA BackendScatter Bandwidth

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64

BDW, OpenMP Backend

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64

Power8, OpenMP Backend

Sparsity1

2

4

8

16

32

64

128

!12

Uniform Stride

Uniform Stride Access

0

10

20

30

40

50

60

70

80

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)

Device−BackendBDW−OMP

GV100−CUDA

K40c−CUDA

KNL−OMP

P100−CUDA

Power8−OMP

SNB−OMP

ThunderX2−OMP

Titan Xp−CUDA

GATHER kernel, Linear AccessImpact of Access Sparsity

0

10

20

30

40

50

60

70

80

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)


GV100−CUDA

K40c−CUDA

KNL−OMP

P100−CUDA

Power8−OMP

SNB−OMP

ThunderX2−OMP

Titan Xp−CUDA

SCATTER kernel, Linear AccessImpact of Access SparsityGather Scatter

CPU

GPU

!13

Random AccessGather, Uniform

0

10

20

30

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)

Device−BackendK40c−CUDA

P100−CUDA

Titan Xp−CUDA

GATHER kernel, Random AccessImpact of Access Sparsity

0

10

20

30

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)


KNL−OMP

Power8−OMP

SNB−OMP

ThunderX2−OMP


0

10

20

30

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)

Device−BackendK40c−CUDA

P100−CUDA

Titan Xp−CUDA


0

10

20

30

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity

Effe

ctive

Ban

dwid

th (%

of B

abel

Stre

am)


KNL−OMP

Power8−OMP

SNB−OMP

ThunderX2−OMP


!14

Energy Efficiency

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

1500

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparisty

Effic

ienc

y (B

ytes

/Jou

le)

DeviceBDW−OMP

GV100−CUDA

K40c−CUDA

KNL−OMP

P100−CUDA

Power8−OMP

SNB−OMP

ThunderX2−OMP

Titan Xp−CUDA

GATHER kernel, Linear AccessImpact of Access Sparsity

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

1500

1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparisty

Effic

ienc

y (B

ytes

/Jou

le)

DeviceBDW−OMP

GV100−CUDA

K40c−CUDA

KNL−OMP

P100−CUDA

Power8−OMP

SNB−OMP

ThunderX2−OMP

Titan Xp−CUDA

SCATTER kernel, Linear AccessImpact of Access Sparsity

Uniform Stride

Gather Scatter

CPU

GPU

!15

What’s Next?• Partner with industry to run on upcoming systems

• Evaluation of slightly more complex synthetic traces

• Mostly stride-1

• Write collisions

• Gather/Scatter traces from real (DOE) mini-apps

• Measure impact of vector length (SVE and AVX) on generated code (and therefore cache performance)

• CILK backend for EMU, FPGA-specific OpenCL Backend

• More general kernels, with accumulation and a length buffer

• Simplify to present a STREAM-like result

!16

More Info

!17

• Spatter.io

• Documentation

• Guide to easily plot your GPU against ours

• ArXiv Pre-print

• Spatter: A Benchmark Suite for Evaluating Sparse Access Patterns

• https://arxiv.org/abs/1811.03743

• Code

• https://github.com/hpcgarage/spatter

Acknowledgements

This material is based upon work supported by the National Science Foundation under Award #1710371.

The Spatter Benchmark (spatter.io)

Or: Benchmarking and Modeling Sparse Memory Accesses for Heterogeneous Systems

Patrick Lavin, Jeffery Young, Jason Riedy, Rich Vuduc

the spatter benchmark · 2019-03-03 · gather kernel, linear access impact of access sparsity 0 10...

Documents