tuning sparse matrix vector multiplication for multi-core smps (details in paper at sc07)
DESCRIPTION
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07). Samuel Williams 1,2 , Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 - PowerPoint PPT PresentationTRANSCRIPT
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs
(details in paper at SC07)
Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2
1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology
BIPSBIPS Overview
Multicore is the de facto performance solution for the next decade
Examined Sparse Matrix Vector Multiplication (SpMV) kernel Important HPC kernel Memory intensive Challenging for multicore
Present two auto-tuned threaded implementations: Pthread, cache-based implementation Cell local store-based implementation
Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 IBM Cell Broadband Engine
Compare with leading MPI implementation(PETSc) with an auto-tuned serial kernel (OSKI)
Show Cell delivers good performance and efficiency, while Niagara2 delivers good performance and productivity
BIPSBIPS Sparse Matrix Vector Multiplication
Sparse Matrix Most entries are 0.0 Performance advantage in only
storing/operating on the nonzeros Requires significant meta data
Evaluate y=Ax A is a sparse matrix x & y are dense vectors
Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop)
A x y
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Dataset (Matrices)Multicore SMPs
Test Suite
BIPSBIPS Matrices Used
Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M
Dense
Protein FEM /Spheres
FEM /Cantilever
WindTunnel
FEM /Harbor QCD FEM /
Ship Economics Epidemiology
FEM /Accelerator Circuit webbase
LP
2K x 2K Dense matrixstored in sparse format
Well Structured(sorted by nonzeros/row)
Poorly Structuredhodgepodge
Extreme Aspect Ratio(linear programming)
BIPSBIPS Multicore SMP Systems
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
BIPSBIPSMulticore SMP Systems
(memory hierarchy)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
Conventional Cache-based
Memory Hierarchy
Disjoint Local Store
Memory Hierarchy
BIPSBIPSMulticore SMP Systems
(2 implementations)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
Cache+Pthreads SpMV
implementation
Local Store SpMV
Implementation
BIPSBIPSMulticore SMP Systems
(cache)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
16MB(vectors fit)
4MB
4MB(local store)
4MB
BIPSBIPSMulticore SMP Systems
(peak flops)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
75 Gflop/s(TLP + ILP + SIMD)
17 Gflop/s(TLP + ILP)
29 Gflop/s(TLP + ILP + SIMD)
11 Gflop/s(TLP)
BIPSBIPSMulticore SMP Systems
(peak read bandwidth)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
21 GB/s 21 GB/s
51 GB/s43 GB/s
BIPSBIPSMulticore SMP Systems
(NUMA)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
Unifo
rm M
emor
y Ac
cess
Non-
Unifo
rm M
emor
y Ac
cess
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Naïve Implementation for Cache-based Machines
Performance across suiteAlso, included a median performance number
BIPSBIPS vanilla C Performance
0.00.10.20.30.40.50.60.70.80.91.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.00.10.20.30.40.50.60.70.80.91.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.00.10.20.30.40.50.60.70.80.91.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 Vanilla C implementation Matrix stored in CSR (compressed
sparse row) Explored compiler options - only
the best is presented here x86 core delivers > 10x
performance of Niagara2 thread
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
SPMD, strong scalingOptimized for multicore/threadingVariety of shared memory programming models are acceptable(not just Pthreads)More colors = more optimizations = more work
Pthread Implementationfor Cache-Based Machines
BIPSBIPS Naïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86
machines
BIPSBIPS
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Naïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
8x cores = 1.9x performance 4x cores = 1.5x performance
64x threads = 41x performance Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86
machines
BIPSBIPS
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Naïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
1.4% of peak flops29% of bandwidth
4% of peak flops20% of bandwidth
25% of peak flops39% of bandwidth
Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86
machines
BIPSBIPS Case for Auto-tuning
How do we deliver good performance across all these architectures, across all matrices without exhaustively optimizing every combination
Auto-tuning (in general) Write a Perl script that generates all possible optimizations Heuristically, or exhaustively searches the optimizations Existing SpMV solution: OSKI (developed at UCB)
This work: Tuning geared toward multi-core/-threading generates SSE/SIMD intrinsics, prefetching, loop
transformations, alternate data structures, etc… “prototype for parallel OSKI”
BIPSBIPSPerformance
(+NUMA & SW Prefetching)
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 NUMA aware allocation Tag prefetches with the
appropriate temporal locality Tune for the optimal distance
BIPSBIPS Matrix Compression
For memory bound kernels, minimizing memory traffic should maximize performance
Compress the meta data Exploit structure to eliminate meta data
Heuristic: select the compression thatminimizes the matrix size: power of 2 register blocking CSR/COO format 16b/32b indices etc…
Side effect: matrix may be minimized to the point where it fits entirely in cache
BIPSBIPSPerformance
(+matrix compression)
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 Very significant on some missed the boat on others
BIPSBIPSPerformance(+cache & TLB blocking)
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 Cache blocking geared towards
sparse accesses
BIPSBIPSPerformance
(more DIMMs,firmware fix …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
BIPSBIPS
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Performance(more DIMMs,firmware fix …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
4% of peak flops52% of bandwidth
20% of peak flops66% of bandwidth
52% of peak flops54% of bandwidth
BIPSBIPS
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Performance(more DIMMs,firmware fix …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
3 essential optimizations: Parallelize, Prefetch, Compress
4 essential optimizations: Parallelize, NUMA, Prefetch, Compress
2 essential optimizations: Parallelize, Compress
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
CommentsPerformance
Cell Implementation
BIPSBIPS Cell Implementation
No vanilla C implementation (aside from the PPE) In some cases, what were optional optimizations on cache based
machines, are requirements for correctness on Cell
Even SIMDized double precision is extremely weak Scalar double precision is unbearable Minimum register blocking is 2x1 (SIMDizable) Can increase memory traffic by 66%
Optional Cache blocking is now required local store blocking Spatial and temporal locality is captured by software when
the matrix is optimized In essence, the high bits of column indices are grouped into DMA
lists Branchless implementation
Despite the following performance numbers, Cell was still handicapped by double precision
BIPSBIPS
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
16 SPEs8 SPEs1 SPE
Performance(all)
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 IBM Cell Broadband Engine
BIPSBIPS
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
16 SPEs8 SPEs1 SPE
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
Hardware Efficiency
IBM Cell Broadband Engine39% of peak flops89% of bandwidth
4% of peak flops52% of bandwidth
20% of peak flops66% of bandwidth
52% of peak flops54% of bandwidth
BIPSBIPS
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
16 SPEs8 SPEs1 SPE
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
Productivity
IBM Cell Broadband Engine
3 essential optimizations: Parallelize, Prefetch, Compress
4 essential optimizations: Parallelize, NUMA, Prefetch, Compress
2 essential optimizations: Parallelize, Compress
5 essential optimizations: Parallelize, NUMA, DMA, Compress, Cache(LS) Block
BIPSBIPS
0.000
0.200
0.400
0.600
0.800
1.000
0 8 16 24 32 40 48 56 64threads
Parallel Efficiency
Parallel Efficiency
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4cores
Parallel Efficiency
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1 2 3 4 5 6 7 8cores
Parallel Efficiency
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores
Parallel Efficiency
Intel Clovertown AMD Opteron
Sun Niagara2 IBM Cell Broadband Engine
BIPSBIPS
0.000
0.200
0.400
0.600
0.800
1.000
0 8 16 24 32 40 48 56 64threads
Parallel Efficiency
Parallel Efficiency(breakdown)
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4cores
Parallel Efficiency
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1 2 3 4 5 6 7 8cores
Parallel Efficiency
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores
Parallel Efficiency
Intel Clovertown AMD Opteron
Sun Niagara2 IBM Cell Broadband Engine
multicore
multicore
multicore
multicore
multisocket
multisocket
multisocket
MT
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Multicore MPI ImplementationThis is the default approach to programming multicore
BIPSBIPS Multicore MPI Implementation
Used PETSc with shared memory MPICH Used OSKI (developed @ UCB) to optimize each thread = good autotuned shared memory MPI implementation
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
MPI(autotuned) Pthreads(autotuned)Naïve Single Thread
Intel Clovertown AMD Opteron
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Summary
BIPSBIPS
Median Performance
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
IntelClovertown
AMD Opteron Sun Niagara2 IBM Cell
GFlop/s
Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)
Median Power Efficiency
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
20.0
22.0
24.0
IntelClovertown
AMD Opteron Sun Niagara2 IBM Cell
MFlop/s/Watt
Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)
Median Performance & Efficiency
1P Niagara2 was consistently better than 2P x86 machines (more potential bandwidth)
Cell delivered by far the best performance (better utilization)
Used digital power meter to measure sustained system power
FBDIMM is power hungry (12W/DIMM) Clovertown(330W) Niagara2 (350W) power
BIPSBIPS Summary
Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.
Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency
90% of memory bandwidth (most get ~50%) High power efficiency Easily understood performance Extra traffic = lower performance (future work can address this)
Our multicore specific autotuned implementation significantly outperformed an autotuned MPI implementation
It exploited: Matrix compression geared towards multicore, rather than single NUMA Prefetching
BIPSBIPS Acknowledgments
UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns)
Sun Microsystems Niagara2 access
Forschungszentrum Jülich Cell blade cluster access
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Questions?
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Backup Slides
BIPSBIPS Parallelization
Matrix partitioned by rows and balanced by the number of nonzeros
SPMD like approach A barrier() is called before and after the SpMV kernel Each sub matrix stored separately in CSR Load balancing can be challenging
# of threads explored in powers of 2 (in paper)
A
x
y
BIPSBIPS Exploiting NUMA, Affinity
Bandwidth on the Opteron(and Cell) can vary substantially based on placement of data
Bind each sub matrix and the thread to process it together
Explored libnuma, Linux, and Solaris routines Adjacent blocks bound to adjacent cores
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Single ThreadMultiple Threads,
One memory controllerMultiple Threads,
Both memory controllers
BIPSBIPS Cache and TLB Blocking
Accesses to the matrix and destination vector are streaming But, access to the source vector can be random Reorganize matrix (and thus access pattern) to maximize reuse. Applies equally to TLB blocking (caching PTEs)
Heuristic: block destination, then keep addingmore columns as long as the number ofsource vector cache lines(or pages) touchedis less than the cache(or TLB). Apply allprevious optimizations individually to eachcache block.
Search: neither, cache, cache&TLB
Better locality at the expense of confusingthe hardware prefetchers.
A
x
y
BIPSBIPS Banks, Ranks, and DIMMs
In this SPMD approach, as the number of threads increases, so to does the number of concurrent streams to memory.
Most memory controllers have finite capability to reorder the requests. (DMA can avoid or minimize this)
Addressing/Bank conflicts become increasingly likely
Add more DIMMs, configuration of ranks can help Clovertown system was already fully populated