modern many-core architectures for...
TRANSCRIPT
VSC Seminar
Modern Many-Core Architecturesfor Supercomputing
Karl Rupp
https://karlrupp.net/https://github.com/karlrupp/slides/
formerly:Institute for Microelectronics, TU Wien
now:Freelance Scientist
December 11, 2015
2
OverviewOverview
Hardware
Graphics Processing Units
Many Integrated Cores (Xeon Phi family)
Software
CUDA
OpenCL
Parallel Primitives
Performance Modeling
Identifying Bottlenecks
Modeling by Example
3
GPU OverviewGPU Overview
Computing Architecture Schematic
PCI Express
8x20 GB/s2x12 GB/s
CPU GPU
8 GB/s, ~1us Latency
1000 GFLOPs SP 250 GFLOPs DP
100 GFLOPs SP 50 GFLOPs DP
Good for large FLOP-intensive tasks, high memory bandwidth
PCI-Express can be a bottleneck
� 10-fold speedups (usually) not backed by hardware
4
GPU OverviewGPU Overview
PCI−Express
10 GB/s1−10 us
Workgroup Workgroup Workgroup
Workgroup Workgroup Workgroup
GPU
Host200 GB/s10−100 ns
GPU Main Memory
Shared Memory
Shared Memory
Shared Memory Shared Memory
Shared MemoryShared Memory
Details
Workgroups consist of 32-64 hardware threads
Up to 24 hardware workgroups
Shared memory small: approx. 32-64 KB
5
GPU OverviewGPU Overview
Reminder: AVX
One instruction for all elements of a vector register
op(r1-8)
Single Instruction Multiple Threads (SIMT)
One instruction for all threads in workgroup
Each thread has separate registers
Efficient if all threads execute the same instruction
op(r1) op(r2) op(r3) op(r4) op(r5) op(r6) op(r7) op(r8)
6
GPU OverviewGPU Overview
GDDR5
Optimized for throughput
Channel width: multiple of 32 bits
High bus width: 256 bits, 384 bits
Structured Memory Access
Memory controllers use 32/64/128 byte transactions
Partial transactions degrade effective bandwidth
128 Byte 128 Byte 128 Byte
7
GPU OverviewGPU Overview
Host-Device Communication
PCI-Express v2: 8 GB/sec max
PCI-Express v3: 16 GB/sec max
Latency: about 10 µs
10-6
10-5
10-4
10-3
10-2
101
102
103
104
105
106
107
Tim
e (
se
c)
Message Size (Bytes)
K20mW9100
A10-5800KXeon Phi
10-4
10-3
10-2
10-1
100
101
101
102
103
104
105
106
107
Ba
nd
wid
th (
GB
/se
c)
Message Size (Bytes)
K20mW9100
A10-5800KXeon Phi
8
Many Integrated Core ArchitectureMany Integrated Core Architecture
Background
Many-core architecture by Intel
“Just recompile” existing OpenMP-enabled code
Current: Knights Corner (Q4 2012)
Upcoming: Knights Landing (Q1 2016)
9
Many Integrated Core ArchitectureMany Integrated Core Architecture
Knight Corner: Technical Details
High memory bandwidth: 320 GB/sec ideal, 160 GB/sec real
High compute power: 1 TFLOP/sec (fp64)
Fixed main memory: 8 or 16 GB
Custom operating system on board
0
20
40
60
80
100
120
140
160
1 10 100
Bandwidth (GB/sec)
Threads
STREAM Benchmark Results
E5-2670 v3 (Haswell) E5-2650 v2 (Ivy Bridge) E5-2620 (Sandy Bridge)Xeon Phi 7120
10
Many Integrated Core ArchitectureMany Integrated Core Architecture
Knights Corner: Problems
Very low single-thread performance
“Just recompile” almost never enough
PCI-Express communication slower than for GPUs
Even Intel now recommends Haswell-Xeons over Knights Corner
http://xkcd.com/386/
11
Many Integrated Core ArchitectureMany Integrated Core Architecture
KNL: Compute
72 cores
Energy-efficient IA cores
3x single-thread performance vs. KNC
Binary compatibility with Xeon line
3+ TFLOPS in double precision
KNL: On-Package Memory
16 GB
5x bandwidth vs. DDR4
5x power efficiency vs. DDR4
12
Many Integrated Core ArchitectureMany Integrated Core Architecture
KNL Schematic
36 tiles interconnected by 2D mesh
Each tile: 2 cores, 2 vector units per core, 1 MB L2 cache
Tile
3
DDR4
CHANNELS
3
DDR4
CHANNELS
MCDRAM MCDRAM
MCDRAM MCDRAM
Package
13
Many Integrated Core ArchitectureMany Integrated Core Architecture
HBM: Cache Mode
No code changes necessary
Cache misses expensive
HBM: Flat Mode
MCDRAM mapped to physical address space
Exposed as NUMA node
Accessed through memkind library or numactl
HBM: Hybrid Mode
Combination of the above two
Get the best and worst of two worlds
14
OverviewOverview
Software: CUDA
15
CUDACUDA
About
Initial release in 2007
Proprietary programming model by NVIDIA
C++ with extensions
Proprietary compiler extracts GPU kernels
Software Ecosystem
Vendor-tuned libraries: cuBLAS, cuSparse, cuSolver, cuFFT, etc.
Python bindings: pyCUDA
Community projects: CUSP, MAGMA, ViennaCL, etc.
16
CUDACUDA
Programming in CUDA
void work(double *x, double *y, double *z, int N){#pragma omp parallel
{ int thread_id = omp_get_thread_num();for (size_t i=thread_id; i<N; i += omp_get_num_threads())
z[i] = x[i] + y[i];} }
int main(int argc, char **argv){int N = atoi(argv[1]);double *x = malloc(N*sizeof(double));...
...work(x, y, z, N); // call kernel...
free(x);}
17
CUDACUDA
Programming in CUDA
__global__ void work(double *x, double *y, double *z, int N){
int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (size_t i=thread_id; i<N; i += blockDim.x * gridDim.x)
z[i] = x[i] + y[i];}
int main(int argc, char **argv){int N = atoi(argv[1]);double *x = malloc(N*sizeof(double));cudaMalloc(&gpu_x, N*sizeof(double));cudaMemcpy(gpu_x, x, N*8, cudaMemcpyHostToDevice);...work<<<128, 256>>>(x, y, z, N); // call kernel...cudaMemcpy(gpu_x, x, N*8, cudaMemcpyDeviceToHost);...free(x);
}
18
CUDACUDA
x
y
19
CUDACUDA
Thread Control (1D)
Local ID in block: threadIdx.x
Threads per block: blockDim.x
ID of block: blockIdx.x
No. of blocks: gridDim.x
Recommended Default Values
Typical block size: 256 or 512
Typical number of blocks: 256
At least 10 000 logical threads recommended
20
CUDA ExampleCUDA Example
Offset Memory Access
__global__void work(double *x, double *y, double *z, int N, int k){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (size_t i=thread_id; i<N; i += blockDim.x * gridDim.x)
z[i+k] = x[i+k] + y[i+k];}
0
20
40
60
80
100
120
140
0 5 10 15 20 25 30
Bandw
idth
(G
B/s
ec)
Offset
Memory Bandwidth with Offsets - Double Precision
NVIDIA GeForce GTX 470 (Fermi)NVIDIA Tesla K20m (Kepler)NVIDIA GTX 750 Ti (Maxwell)
21
CUDA ExampleCUDA Example
Strided Memory Access
__global__void work(double *x, double *y, double *z, int N, int k){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (size_t i=thread_id; i<N; i += blockDim.x * gridDim.x)
z[i*k] = x[i*k] + y[i*k];}
0
20
40
60
80
100
120
140
5 10 15 20 25 30
Bandw
idth
(G
B/s
ec)
Stride
Memory Bandwidth with Strides - Double Precision
NVIDIA GeForce GTX 470 (Fermi)NVIDIA Tesla K20m (Kepler)NVIDIA GTX 750 Ti (Maxwell)
22
CUDA ExampleCUDA Example
Strided Memory Access
Array of structs problematic
typedef struct particle{double pos_x; double pos_y; double pos_z;double vel_x; double vel_y; double vel_z;double mass;
} Particle;
__global__void increase_mass(Particle *particles, int N){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (int i=thread_id; i<N; i += blockDim.x * gridDim.x)
particles[i].mass *= 2.0;}
23
CUDA ExampleCUDA Example
Strided Memory Access
Workaround: Structure of Arrays
typedef struct particles{double *pos_x; double *pos_y; double *pos_z;double *vel_x; double *vel_y; double *vel_z;double *mass;
} Particle;
__global__void increase_mass(Particle *particles, int N){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (int i=thread_id; i<N; i += blockDim.x * gridDim.x)
particles.mass[i] *= 2.0;}
24
OverviewOverview
Software: OpenCL
25
History of OpenCLHistory of OpenCL
Prior to 2008
OpenCL developed by Apple Inc.
2008
OpenCL working group formed at Khronos Group
OpenCL specification 1.0 released
2010
OpenCL 1.1 (multi-device, subbuffer manipulation)
2011
OpenCL 1.2 (device partitioning)
2013
OpenCL 2.0 (shared virtual memory, SPIR, etc.)
26
OpenCLOpenCL
Similar to CUDA
Kernel language is a subset of C
Explicit memory management, host-device transfers
Memory model: local, shared, global
Different from CUDA
Support by many vendors
No compiler-wrapper, only a shared library
Kernel compilation usually at runtime
27
OpenCL Platform ModelOpenCL Platform Model
Platform 1
Context 1
Kernel 2,1
Kernel 2,2
Kernel 1,n
Kernel 1,2
Kernel 1,1
Kernel 2,m
Memory Buffer 1
Memory Buffer 2
Device 1 Device 2
Device List
Memory Buffer k
Command Queue 2
Command Queue 1Program 2
Device 3
Program 1
28
OpenCLOpenCL
OpenCL Thread Control (1D) vs. CUDA
Local ID in block: get_local_id(0)
Threads per block: get_local_size(0)
ID of block: get_group_id()
No. of blocks: get_num_groups()
Global thread ID: get_global_id()
No. of threads: get_global_size()
threadIdx.x
blockDim.x
blockIdx.x
gridDim.x
29
OverviewOverview
Software: Parallel Primitives
30
Parallel PrimitivesParallel Primitives
Reductions
Use N values to compute 1 result value
Examples: Dot-products, vector norms, etc.
Reductions with Few Threads
Decompose N into chunks for each thread
Compute chunks in parallel
Merge results with single thread
Result
31
Parallel PrimitivesParallel Primitives
Reductions with Many Threads
Decompose N into chunks for each workgroup
Use fast on-chip synchronization within each workgroup
Sum result for each workgroup separately
z
y
a
32
Parallel PrimitivesParallel Primitives
Reductions with Many Threads
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 3Values
8 -2 10 6 0 9 3 8 -2 -3 2 7 0 11 0 3Step 1, Stride
8 7 13 14 0 9 3 8 -2 -3 2 7 0 11 0 3Step 2, Stride
21 21 13 14 0 9 3 8 -2 -3 2 7 0 11 0 3Step 3, Stride
42 21 13 14 0 9 3 8 -2 -3 2 7 0 11 0 3Step 4, Stride
shared_m[threadIdx.x] = thread_sum;for (int stride = blockDim.x/2; stride>0; stride/=2) {__syncthreads();if (threadIdx.x < stride)
shared_m[threadIdx.x] += shared_m[threadIdx.x+stride];}
33
Parallel PrimitivesParallel Primitives
Prefix Sum
Inclusive: Determine yi =∑i
k=1 xk
Exclusive: Determine yi =∑i−1
k=1 xk, y1 = 0
Example
x: 4, 3, 6, 5, 4, 7, 4, 4, 4
y: 4, 7, 13, 18, 22, 29, 33, 37, 41 (inclusive)
y: 0, 4, 7, 13, 18, 22, 29, 33, 37 (exclusive)
Applications
Sparse matrix setup
Graph algorithms
1
23
4
5
6
7
8
9
4
3
5
6 4
7
44
4
34
Parallel PrimitivesParallel Primitives
Prefix Sum Implementation
Values 0 2 1 6 3 1 2 5
0 2 3 7 9 4 3 7Step 1, Stride 1
0 2 3 9 12 11 12 11Step 2, Stride 2
0 2 3 9 12 13 15 20Step 3, Stride 4
for(int stride = 1; stride < blockDim.x; stride *= 2){__syncthreads();shared_buffer[threadIdx.x] = my_value;__syncthreads();if (threadIdx.x >= stride)
my_value += shared_buffer[threadIdx.x - stride];}__syncthreads();shared_buffer[threadIdx.x] = my_value;
35
Parallel PrimitivesParallel Primitives
ILU - Basic Idea
Factor sparse matrix A ≈ LUL and U sparse, triangular
ILU0: Pattern of L, U equal to AILUT: Keep k elements per row
Solver Cycle PhaseResidual correction LUx = z
Forward solve Ly = z
Backward solve Ux = y
Little parallelism in general
5 × × × × ×× 3 ×× × 4 ×× × 5 × × ×
× 5 × × ×× × × 6 × ×× × 3
× × 4 ×× × × 4
36
Parallel PrimitivesParallel Primitives
ILU Level Scheduling
Build dependency graph
Substitute as many entries as possible simultaneously
Trade-off: Each step vs. multiple steps in a single kernel
37
Parallel PrimitivesParallel Primitives
ILU Interpretation on Structured Grids
2d finite-difference discretization
Substitution whenever all neighbors with smaller index computed
Works particularly well in 3d
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
37
Parallel PrimitivesParallel Primitives
ILU Interpretation on Structured Grids
2d finite-difference discretization
Substitution whenever all neighbors with smaller index computed
Works particularly well in 3d
1 2 3 4 5
8
11 12 13 14 15
18
10 9 7 6
16171920
37
Parallel PrimitivesParallel Primitives
ILU Interpretation on Structured Grids
2d finite-difference discretization
Substitution whenever all neighbors with smaller index computed
Works particularly well in 3d
1
15
20
2
3
4
5
6
7
8
9
10
11
12
13
14
16
17
18
19
37
Parallel PrimitivesParallel Primitives
ILU Interpretation on Structured Grids
2d finite-difference discretization
Substitution whenever all neighbors with smaller index computed
Works particularly well in 3d
1 3
4 5
6 7 8
10
11 2 12
13 14 15
16 17
917 18 20
38
Parallel PrimitivesParallel Primitives
Other Parallel Primitives
Sort
Gather and Scatter
Load to shared memory and work there
etc.
GPU-Accelerated Software Libraries
Linear Algebra: ViennaCL, MAGMA, CUSP, VexCL, ...
Solvers: ViennaCL, MAGMA, cuSolver, Paralution, clAMG, ...
FFT: cuFFT, clFFT, FFTW, ...
Primitives: VexCL, Boost.Compute, ...
Machine Learning: Caffe, cuDNN, ...
39
OverviewOverview
Performance Modeling: Bottlenecks
40
Bottleneck PotpourriBottleneck Potpourri
Arithmetic Intensity
Number of FLOPs per Byte
FLOP-limited: AI larger than 1-10
Memory-limited: AI smaller than 1-10
1
10
2007 2008 2009 2010 2011 2012 2013 2014
FLO
P p
er
Byte
End of Year
Floating Point Operations per Byte, Double Precision
Radeon HD 3870
Radeon HD 4870
Radeon HD 5870
Radeon HD 6970
Radeon HD 6970
Radeon HD 7970 GHz Ed.
Radeon HD 8970
FirePro W9100
Tesla C1060
Tesla C1060
Xeon X5680
Xeon X5690
Tesla K20
Tesla K20X
Tesla K40
CPUs, Intel
Xeon X5482
Xeon X5492
Xeon W5590
Tesla C2050
Tesla C2090Xeon E5-2690
Xeon E5-2697 v2
Xeon E5-2699 v3
GPUs, NVIDIA
GPUs, AMD
MIC, Intel
Xeon Phi X7120X
41
Bottleneck PotpourriBottleneck Potpourri
Latency
Bottleneck in strong scaling limit
Ultimate limit for time stepping
Latency - Sources
Network latency (Ethernet ∼ 20µs, Infiniband ∼ 5µs)
PCI-Express latency (Kernel launches, ∼ 10µs)
Thread synchronization (barriers, locks, ∼ 1− 10µs)
Memory latency (∼ 100ns)
10-5
10-4
10-3
10-2
101
102
103
104
105
106
107
Execution T
ime (
sec)
Time per CG Iteration - NVIDIA K20m
BLAS-based, CSR
42
Bottleneck PotpourriBottleneck Potpourri
Load Imbalance
Total execution time determined by slowest thread
Focus on making the slowest thread fast
Easy for static data structures (e.g. dense matrices)
Hard for dynamic data structures (e.g. sparse matrices)
43
Bottleneck PotpourriBottleneck Potpourri
Amdahl’s Law
Ttotal = Tserial + Tparallel/#processors
Speed-up limited by serial portion of an algorithm
44
OverviewOverview
Performance Modeling: Examples
45
Performance Modeling: Vector AdditionPerformance Modeling: Vector Addition
Vector Addition
x = y + z with N elements each
1 FLOP per 24 byte in double precision
Limited by memory bandwidth⇒ T2(N)?≈ 3× 8× N/Bandwidth + Latency
10-6
10-5
10-4
10-3
102
103
104
105
106
107
Ba
nd
wid
th (
GB
/se
c)
Vector Size
Execution Time for x = y + z in Double Precision
NVIDIA Tesla K20mBandwidth LimitBandwidth + Latency Limit
45
Performance Modeling: Vector AdditionPerformance Modeling: Vector Addition
Vector Addition
x = y + z with N elements each
1 FLOP per 24 byte in double precision
Limited by memory bandwidth⇒ T2(N)?≈ 3× 8× N/Bandwidth + Latency
0
20
40
60
80
100
120
140
102
103
104
105
106
107
Ba
nd
wid
th (
GB
/se
c)
Vector Size
Memory Bandwidth for x = y + z in Double Precision
NVIDIA Tesla K20m
46
Performance Modeling: Matrix-Matrix ProductsPerformance Modeling: Matrix-Matrix Products
102
103
102
103
104
GF
LO
P/s
Matrix Rows/Columns
INTEL Core i7 960 - MKL 11.0.2
INTEL Core i7 960 - ViennaCL
NVIDIA GTX 470 - CUBLAS 5.0
NVIDIA GTX 470 - ViennaCL
AMD HD7970 - clAmdBlas 1.10
AMD HD7970 - ViennaCL
47
Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients
Pseudocode
Choose x0
p0 = r0 = b− Ax0
For i = 0 until convergence
1. Compute and store Api
2. Compute 〈pi,Api〉3. αi = 〈ri, ri〉/〈pi,Api〉4. xi+1 = xi + αipi
5. ri+1 = ri − αiApi
6. Compute 〈ri+1, ri+1〉7. βi = 〈ri+1, ri+1〉/〈ri, ri〉8. pi+1 = ri+1 + βipi
EndFor
BLAS-based Implementation
-SpMV, AXPYFor i = 0 until convergence
1. SpMV← No caching of Api
2. DOT← Global sync!
3. -
4. AXPY
5. AXPY← No caching of ri+1
6. DOT← Global sync!
7. -
8. AXPY
EndFor
48
Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients
10-5
10-4
10-3
10-2
101
102
103
104
105
106
107
Exe
cu
tio
n T
ime
(se
c)
Time per CG Iteration - NVIDIA K20m
BLAS-based, CSR
(2D Finite Difference Discretization)
49
Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients
Performance Modeling
6 Kernel Launches (plus two for reductions)
Two device to host data reads from dot products
Model SpMV as seven vector accesses (5-point stencil)
T(N) = 8× 10−6 + 2× 2× 10−6 + (7 + 2 + 3 + 3 + 2 + 3)× 8× x/Bandwidth
10-5
10-4
10-3
10-2
101
102
103
104
105
106
107
Execution T
ime (
sec)
Time per CG Iteration - NVIDIA K20m
BLAS-based, CSRBandwidth + Latency Limit
(2D Finite Difference Discretization)
50
Performance Modeling: Conjugate Gradient OptimizationsPerformance Modeling: Conjugate Gradient Optimizations
Optimization: Rearrange the algorithm
Remove unnecessary reads
Remove unnecessary synchronizations
Use custom kernels instead of standard BLAS
51
Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients
Standard CG
Choose x0
p0 = r0 = b− Ax0
For i = 0 until convergence
1. Compute and store Api
2. Compute 〈pi,Api〉3. αi = 〈ri, ri〉/〈pi,Api〉4. xi+1 = xi + αipi
5. ri+1 = ri − αiApi
6. Compute 〈ri+1, ri+1〉7. βi = 〈ri+1, ri+1〉/〈ri, ri〉8. pi+1 = ri+1 + βipi
EndFor
Pipelined CG
Choose x0
p0 = r0 = b− Ax0
For i = 1 until convergence
1. i = 1: Compute α0, β0, Ap0
2. xi = xi−1 + αi−1pi−1
3. ri = ri−1 − αi−1Api
4. pi = ri + βi−1pi−1
5. Compute and store Api
6. Compute 〈Api,Api〉, 〈pi,Api〉, 〈ri, ri〉7. αi = 〈ri, ri〉/〈pi,Api〉8. βi = (α2
i 〈Api,Api〉 − 〈ri, ri〉)/〈ri, ri〉EndFor
52
Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients
10-5
10-4
10-3
10-2
101
102
103
104
105
106
107
Exe
cu
tio
n T
ime
(se
c)
Time per CG Iteration - NVIDIA K20m
BLAS-basedBandwidth + Latency LimitPipelined
(2D Finite Difference Discretization)
53
Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients
Benefits of Pipelining also for Large Matrices
0 50 100 150 200
0 50 100 150 200
Rel. Execution Time (%)
pwtk
shipsec1
consph
cant
pdb1HYS
3.35
5.29
2.57
2.04
2.76
1.83
1.92
1.68
2.02
1.34
1.38
1.14
1.59
0.97
1.05
0.85
1.46
1.49
0.98
0.98
Tesla C2050
0 50 100 150 200
0 50 100 150 200
Rel. Execution Time (%)
2.89
3.33
1.86
1.42
2.17
1.13
1.46
1.14
1.86
0.93
0.99
0.82
1.56
0.70
0.78
0.61
1.52
0.98
0.74
0.71
Tesla K20m
0 50 100 150 200
0 50 100 150 200
Rel. Execution Time (%)
NA
NA
2.18
1.24
NA
NA
1.30
0.95
NA
NA
1.00
0.64
NA
NA
0.80
0.44
NA
NA
0.78
0.75
FirePro W9000
0 50 100 150 200
0 50 100 150 200
Rel. Execution Time (%)
NA
NA
1.98
1.19
NA
NA
1.18
0.94
NA
NA
0.96
0.63
NA
NA
0.74
0.44
NA
NA
0.78
0.77
FirePro W9100
ViennaCL PARALUTION MAGMA CUSP
54
Performance Modeling: Sparse Matrix TransposePerformance Modeling: Sparse Matrix Transpose
Sparse Matrix Transposition
Compute B = AT
A and B using sparse storage schemes
Row-Wise Datastructures
std::vector<std::map<int, double> >
std::vector<boost::flat_map<int, double> >
Compressed Sparse Row
55
Performance Modeling: Sparse Matrix TransposePerformance Modeling: Sparse Matrix Transpose
Simplest Case: STL
for (int i=0; i<N; ++i)for (auto col = A[i].begin(); col != A[i].end(); ++col)
B[col->index][i] = col->value
More Work: CSR
Compute number of nonzeros per row in B
Allocate data arrays for B
Exclusive scan to find entry points per row
Populate B
Performance Modeling
Read and write each nonzero entry at least once
Low arithmetic intensity: Bandwidth limited!
T(N) = 2× (4 + 8)× nnz(N)/Bandwidth
56
Performance Modeling: Sparse Matrix TransposePerformance Modeling: Sparse Matrix Transpose
10-6
10-5
10-4
10-3
10-2
10-1
100
101
102
103
104
105
106
107
Execution T
ime
Matrix Size
Sparse Matrix Transposition
STL->STLSTL->CSRCSR->STLCSR->FlatCSR->CSRBandwidth Limit
57
SummarySummary
Many-Core Architectures
GPUs: 100s and 1000s of threads
MIC: 61 cores (KNC), 72 cores (KNL)
2-3x enhancements (perf/Watt) for some applications
Amdahl’s Law limits general applicability
Recommendations
(Re-)Use software libraries
Profile and model before you optimize
Pay attention to memory transfers