tuning sparse matrix vector multiplication for multi-core smps (details in paper at sc07)

BIPSBIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs

(details in paper at SC07)

Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2

1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology

[email protected]

BIPSBIPS Overview

Multicore is the de facto performance solution for the next decade

Examined Sparse Matrix Vector Multiplication (SpMV) kernel Important HPC kernel Memory intensive Challenging for multicore

Present two auto-tuned threaded implementations: Pthread, cache-based implementation Cell local store-based implementation

Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 IBM Cell Broadband Engine

Compare with leading MPI implementation(PETSc) with an auto-tuned serial kernel (OSKI)

Show Cell delivers good performance and efficiency, while Niagara2 delivers good performance and productivity

BIPSBIPS Sparse Matrix Vector Multiplication

Sparse Matrix Most entries are 0.0 Performance advantage in only

storing/operating on the nonzeros Requires significant meta data

Evaluate y=Ax A is a sparse matrix x & y are dense vectors

Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop)

A x y

BIPSBIPS


Dataset (Matrices)Multicore SMPs

Test Suite

BIPSBIPS Matrices Used

Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M

Dense

Protein FEM /Spheres

FEM /Cantilever

WindTunnel

FEM /Harbor QCD FEM /

Ship Economics Epidemiology

FEM /Accelerator Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

BIPSBIPS Multicore SMP Systems

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU








179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

BIPSBIPSMulticore SMP Systems

(memory hierarchy)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2


10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)










179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2


AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron


1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s


XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Conventional Cache-based

Memory Hierarchy

Disjoint Local Store

Memory Hierarchy


(2 implementations)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2


10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)










179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2


AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron


1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s


XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Cache+Pthreads SpMV

implementation

Local Store SpMV

Implementation


(cache)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2


10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)










179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2


AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron


1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s


XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

16MB(vectors fit)

4MB

4MB(local store)

4MB


(peak flops)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2


10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)










179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2


AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron


1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s


XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

75 Gflop/s(TLP + ILP + SIMD)

17 Gflop/s(TLP + ILP)

29 Gflop/s(TLP + ILP + SIMD)

11 Gflop/s(TLP)


(peak read bandwidth)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2


10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)










179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2


AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron


1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s


XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

21 GB/s 21 GB/s

51 GB/s43 GB/s


(NUMA)

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2


10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar S

witc

h

Fully Buffered DRAM

4MB

Sha

red

L2 (1

6 w

ay)










179

GB

/s(fi

ll)90

GB

/s(w

ritet

hru)

Sun Niagara2


AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron


1MBvictim

Opteron

1MBvictim

Opteron


DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s


XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Unifo

rm M

emor

y Ac

cess

Non-

Unifo

rm M

emor

y Ac

cess

BIPSBIPS


Naïve Implementation for Cache-based Machines

Performance across suiteAlso, included a median performance number

BIPSBIPS vanilla C Performance

0.00.10.20.30.40.50.60.70.80.91.0

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.00.10.20.30.40.50.60.70.80.91.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.00.10.20.30.40.50.60.70.80.91.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2 Vanilla C implementation Matrix stored in CSR (compressed

sparse row) Explored compiler options - only

the best is presented here x86 core delivers > 10x

performance of Niagara2 thread

BIPSBIPS


SPMD, strong scalingOptimized for multicore/threadingVariety of shared memory programming models are acceptable(not just Pthreads)More colors = more optimizations = more work

Pthread Implementationfor Cache-Based Machines

BIPSBIPS Naïve Parallel Performance

Naïve Pthreads

Naïve Single Thread

0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2 Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86

machines

BIPSBIPS

0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Naïve Parallel Performance

Naïve Pthreads



Sun Niagara2

8x cores = 1.9x performance 4x cores = 1.5x performance

64x threads = 41x performance Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86

machines

BIPSBIPS

0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Naïve Parallel Performance

Naïve Pthreads



Sun Niagara2

1.4% of peak flops29% of bandwidth

4% of peak flops20% of bandwidth


Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86

machines

BIPSBIPS Case for Auto-tuning

How do we deliver good performance across all these architectures, across all matrices without exhaustively optimizing every combination

Auto-tuning (in general) Write a Perl script that generates all possible optimizations Heuristically, or exhaustively searches the optimizations Existing SpMV solution: OSKI (developed at UCB)

This work: Tuning geared toward multi-core/-threading generates SSE/SIMD intrinsics, prefetching, loop

transformations, alternate data structures, etc… “prototype for parallel OSKI”

BIPSBIPSPerformance

(+NUMA & SW Prefetching)

+Software Prefetching

+NUMA/Affinity

Naïve Pthreads


0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2 NUMA aware allocation Tag prefetches with the

appropriate temporal locality Tune for the optimal distance

BIPSBIPS Matrix Compression

For memory bound kernels, minimizing memory traffic should maximize performance

Compress the meta data Exploit structure to eliminate meta data

Heuristic: select the compression thatminimizes the matrix size: power of 2 register blocking CSR/COO format 16b/32b indices etc…

Side effect: matrix may be minimized to the point where it fits entirely in cache

BIPSBIPSPerformance

(+matrix compression)

+Matrix Compression


+NUMA/Affinity

Naïve Pthreads


0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2 Very significant on some missed the boat on others

BIPSBIPSPerformance(+cache & TLB blocking)

+Cache/TLB Blocking

+Matrix Compression


+NUMA/Affinity

Naïve Pthreads


0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2 Cache blocking geared towards

sparse accesses

BIPSBIPSPerformance

(more DIMMs,firmware fix …)

+More DIMMs, Rank configuration, etc…

+Cache/TLB Blocking

+Matrix Compression


+NUMA/Affinity

Naïve Pthreads


0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2

BIPSBIPS

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Performance(more DIMMs,firmware fix …)


+Cache/TLB Blocking

+Matrix Compression


+NUMA/Affinity

Naïve Pthreads



Sun Niagara2




BIPSBIPS

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Performance(more DIMMs,firmware fix …)


+Cache/TLB Blocking

+Matrix Compression


+NUMA/Affinity

Naïve Pthreads



Sun Niagara2

3 essential optimizations: Parallelize, Prefetch, Compress

4 essential optimizations: Parallelize, NUMA, Prefetch, Compress

2 essential optimizations: Parallelize, Compress

BIPSBIPS


CommentsPerformance

Cell Implementation

BIPSBIPS Cell Implementation

No vanilla C implementation (aside from the PPE) In some cases, what were optional optimizations on cache based

machines, are requirements for correctness on Cell

Even SIMDized double precision is extremely weak Scalar double precision is unbearable Minimum register blocking is 2x1 (SIMDizable) Can increase memory traffic by 66%

Optional Cache blocking is now required local store blocking Spatial and temporal locality is captured by software when

the matrix is optimized In essence, the high bits of column indices are grouped into DMA

lists Branchless implementation

Despite the following performance numbers, Cell was still handicapped by double precision

BIPSBIPS

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

16 SPEs8 SPEs1 SPE

Performance(all)

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2 IBM Cell Broadband Engine

BIPSBIPS

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

16 SPEs8 SPEs1 SPE

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2

Hardware Efficiency

IBM Cell Broadband Engine39% of peak flops89% of bandwidth




BIPSBIPS

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

16 SPEs8 SPEs1 SPE

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s


Sun Niagara2

Productivity

IBM Cell Broadband Engine

3 essential optimizations: Parallelize, Prefetch, Compress

4 essential optimizations: Parallelize, NUMA, Prefetch, Compress

2 essential optimizations: Parallelize, Compress

5 essential optimizations: Parallelize, NUMA, DMA, Compress, Cache(LS) Block

BIPSBIPS

0.000

0.200

0.400

0.600

0.800

1.000

0 8 16 24 32 40 48 56 64threads

Parallel Efficiency

Parallel Efficiency

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4cores

Parallel Efficiency

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1 2 3 4 5 6 7 8cores

Parallel Efficiency

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores

Parallel Efficiency



BIPSBIPS

0.000

0.200

0.400

0.600

0.800

1.000

0 8 16 24 32 40 48 56 64threads

Parallel Efficiency

Parallel Efficiency(breakdown)

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4cores

Parallel Efficiency

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1 2 3 4 5 6 7 8cores

Parallel Efficiency

0.000

0.200

0.400

0.600

0.800

1.000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores

Parallel Efficiency



multicore

multicore

multicore

multicore

multisocket

multisocket

multisocket

MT

BIPSBIPS


Multicore MPI ImplementationThis is the default approach to programming multicore

BIPSBIPS Multicore MPI Implementation

Used PETSc with shared memory MPICH Used OSKI (developed @ UCB) to optimize each thread = good autotuned shared memory MPI implementation

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0


TunnelFEM-Har

QCD


FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

MPI(autotuned) Pthreads(autotuned)Naïve Single Thread


BIPSBIPS


Summary

BIPSBIPS

Median Performance

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

IntelClovertown

AMD Opteron Sun Niagara2 IBM Cell

GFlop/s

Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)

Median Power Efficiency

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

22.0

24.0

IntelClovertown

AMD Opteron Sun Niagara2 IBM Cell

MFlop/s/Watt

Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)

Median Performance & Efficiency

1P Niagara2 was consistently better than 2P x86 machines (more potential bandwidth)

Cell delivered by far the best performance (better utilization)

Used digital power meter to measure sustained system power

FBDIMM is power hungry (12W/DIMM) Clovertown(330W) Niagara2 (350W) power

BIPSBIPS Summary

Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.

Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency

90% of memory bandwidth (most get ~50%) High power efficiency Easily understood performance Extra traffic = lower performance (future work can address this)

Our multicore specific autotuned implementation significantly outperformed an autotuned MPI implementation

It exploited: Matrix compression geared towards multicore, rather than single NUMA Prefetching

BIPSBIPS Acknowledgments

UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns)

Sun Microsystems Niagara2 access

Forschungszentrum Jülich Cell blade cluster access

BIPSBIPS


Questions?

BIPSBIPS


Backup Slides

BIPSBIPS Parallelization

Matrix partitioned by rows and balanced by the number of nonzeros

SPMD like approach A barrier() is called before and after the SpMV kernel Each sub matrix stored separately in CSR Load balancing can be challenging

# of threads explored in powers of 2 (in paper)

A

x

y

BIPSBIPS Exploiting NUMA, Affinity

Bandwidth on the Opteron(and Cell) can vary substantially based on placement of data

Bind each sub matrix and the thread to process it together

Explored libnuma, Linux, and Solaris routines Adjacent blocks bound to adjacent cores

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Single ThreadMultiple Threads,

One memory controllerMultiple Threads,

Both memory controllers

BIPSBIPS Cache and TLB Blocking

Accesses to the matrix and destination vector are streaming But, access to the source vector can be random Reorganize matrix (and thus access pattern) to maximize reuse. Applies equally to TLB blocking (caching PTEs)

Heuristic: block destination, then keep addingmore columns as long as the number ofsource vector cache lines(or pages) touchedis less than the cache(or TLB). Apply allprevious optimizations individually to eachcache block.

Search: neither, cache, cache&TLB

Better locality at the expense of confusingthe hardware prefetchers.

A

x

y

BIPSBIPS Banks, Ranks, and DIMMs

In this SPMD approach, as the number of threads increases, so to does the number of concurrent streams to memory.

Most memory controllers have finite capability to reorder the requests. (DMA can avoid or minimize this)

Addressing/Bank conflicts become increasingly likely

Add more DIMMs, configuration of ranks can help Clovertown system was already fully populated

tuning sparse matrix vector multiplication for multi-core smps (details in paper at sc07)

Documents

t i o n

r e s e

s i o nnave implementation

r c h d i v

bytesflopaxyc o

good performance

performance advantage

performance solution