the spatter benchmark · 2019-03-03 · gather kernel, linear access impact of access sparsity 0 10...
TRANSCRIPT
The Spatter Benchmark Or: Benchmarking and Modeling Sparse Memory Accesses
for Heterogeneous Systems
Patrick Lavin, Jeffrey Young, Jason Riedy, Rich Vuduc
Purpose• Dense memory access is well understood, but it is
difficult to predict how a memory system will respond in irregular scenarios
• Indirection, poor spatial and temporal locality
• Spatter allows us to view performance changes across architectures, so that we can better understand their differences
• You can’t understand what you can’t measure
!2
Spatter• Memory benchmark
based on a scatter/gatherkernels • Scatter Y[j[:]] = X[:]
• Gather Y[:] = X[i[:]]
• SG Y[j[:]] = X[i[:]]
• Designed to model sparse data movement in applications like SuperLU and kernels like SpGEMM
• Includes effects of indirection and random or sparse access
!3
Gather
SRC
idx152 8
DEST
Configuration
• Backend - OpenMP, OpenCL, or CUDA
• Work per thread (work item)
• CUDA (or OpenCL) block size
• Buffer sizes, cache-ability, access stride
!4
Examples
!5
Examples - CSR SpMV
• Gather elements of x, then doa dot product with data in A.
A xy
=
=
for (i in range(nrows)): indices ←⃪ row[i] : row[i+1] gather(tmp, x, col[indices]) y[i] = dot_prod(val[indices], tmp)
!6
Examples - CSC SpMV
• Scale some a column of A by the value in x, then scatter-accumulate into y.
for (i in range(ncols)): indices ←⃪ col[i] : col[i+1] tmp ←⃪ vector_scale(val[indices], x[i]) scatter_accum(y, row[indices], tmp)
A xy
=
=
!7
Examples - SpGEMM• Scatter-accumulate
columns of A corresponding to non-zero entries in a column of B into a dense SPA buffer. Gather SPA into C.
A BC
=
=
Algorithm from Buluç and Gilbert: Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments https://doi.org/10.1137/110848244
for (j in range(ncols) : SPA = 0 //dense accumulation buffer for non-zero B(k,j) : scatter_accum(SPA, A(:,k)*B(k,j)) gather(C.val, SPA) gather(C.row, which(SPA)) C.col[j+1] = C.col[j] + nnz(SPA)
SPA:
!8
Examples - Vectorization• Some forms of vectorization may naturally lead to Gather/
Scatter operations
for (j in range(N)): for (i in range(4)): out[j] += data[i,j]
Column-Major
for (j = 0; j < N; j+=8): for (i in range(4)): gather_accum_stride(temp,j+i, 8, 4) //gather 8 elements, //gap size 4 out[j:j+8] = temp
!9
Example - SuperLU
• SuperLU spends a large portion its runtime on just scattering data
0
20
40
60
80
100
ND24K
BBMATH2O
Norm
alize
d ex
ecut
ion
time
SCATTER DGEMM REST
������� ������ ������� ��������
Chart credits: Piyush Sao !10
Benchmark Output
!11
Performance Exploration
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64Work per threadEf
f. Ba
ndw
idth
(% o
f Bab
elSt
ream
)
Tesla P100, CUDA BackendGather Bandwidth
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64
BDW, OpenMP Backend
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64
Power8, OpenMP Backend
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64
Tesla P100, CUDA BackendScatter Bandwidth
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64
BDW, OpenMP Backend
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64
Power8, OpenMP Backend
Sparsity1
2
4
8
16
32
64
128
!12
Uniform Stride
Uniform Stride Access
0
10
20
30
40
50
60
70
80
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendBDW−OMP
GV100−CUDA
K40c−CUDA
KNL−OMP
P100−CUDA
Power8−OMP
SNB−OMP
ThunderX2−OMP
Titan Xp−CUDA
GATHER kernel, Linear AccessImpact of Access Sparsity
0
10
20
30
40
50
60
70
80
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendBDW−OMP
GV100−CUDA
K40c−CUDA
KNL−OMP
P100−CUDA
Power8−OMP
SNB−OMP
ThunderX2−OMP
Titan Xp−CUDA
SCATTER kernel, Linear AccessImpact of Access SparsityGather Scatter
CPU
GPU
!13
Random AccessGather, Uniform
0
10
20
30
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendK40c−CUDA
P100−CUDA
Titan Xp−CUDA
GATHER kernel, Random AccessImpact of Access Sparsity
0
10
20
30
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendBDW−OMP
KNL−OMP
Power8−OMP
SNB−OMP
ThunderX2−OMP
GATHER kernel, Random AccessImpact of Access Sparsity
0
10
20
30
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendK40c−CUDA
P100−CUDA
Titan Xp−CUDA
GATHER kernel, Random AccessImpact of Access Sparsity
0
10
20
30
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendBDW−OMP
KNL−OMP
Power8−OMP
SNB−OMP
ThunderX2−OMP
GATHER kernel, Random AccessImpact of Access Sparsity
!14
Energy Efficiency
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparisty
Effic
ienc
y (B
ytes
/Jou
le)
DeviceBDW−OMP
GV100−CUDA
K40c−CUDA
KNL−OMP
P100−CUDA
Power8−OMP
SNB−OMP
ThunderX2−OMP
Titan Xp−CUDA
GATHER kernel, Linear AccessImpact of Access Sparsity
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparisty
Effic
ienc
y (B
ytes
/Jou
le)
DeviceBDW−OMP
GV100−CUDA
K40c−CUDA
KNL−OMP
P100−CUDA
Power8−OMP
SNB−OMP
ThunderX2−OMP
Titan Xp−CUDA
SCATTER kernel, Linear AccessImpact of Access Sparsity
Uniform Stride
Gather Scatter
CPU
GPU
!15
What’s Next?• Partner with industry to run on upcoming systems
• Evaluation of slightly more complex synthetic traces
• Mostly stride-1
• Write collisions
• Gather/Scatter traces from real (DOE) mini-apps
• Measure impact of vector length (SVE and AVX) on generated code (and therefore cache performance)
• CILK backend for EMU, FPGA-specific OpenCL Backend
• More general kernels, with accumulation and a length buffer
• Simplify to present a STREAM-like result
!16
More Info
!17
• Spatter.io
• Documentation
• Guide to easily plot your GPU against ours
• ArXiv Pre-print
• Spatter: A Benchmark Suite for Evaluating Sparse Access Patterns
• https://arxiv.org/abs/1811.03743
• Code
• https://github.com/hpcgarage/spatter
Acknowledgements
This material is based upon work supported by the National Science Foundation under Award #1710371.
The Spatter Benchmark (spatter.io)
Or: Benchmarking and Modeling Sparse Memory Accesses for Heterogeneous Systems
Patrick Lavin, Jeffery Young, Jason Riedy, Rich Vuduc