modern many-core architectures for...

VSC Seminar

Modern Many-Core Architecturesfor Supercomputing

Karl Rupp

https://karlrupp.net/https://github.com/karlrupp/slides/

formerly:Institute for Microelectronics, TU Wien

now:Freelance Scientist

December 11, 2015

https://karlrupp.net/

https://github.com/karlrupp/slides/

2

OverviewOverview

Hardware

Graphics Processing Units

Many Integrated Cores (Xeon Phi family)

Software

CUDA

OpenCL

Parallel Primitives

Performance Modeling

Identifying Bottlenecks

Modeling by Example

3

GPU OverviewGPU Overview

Computing Architecture Schematic

PCI Express

8x20 GB/s2x12 GB/s

CPU GPU

8 GB/s, ~1us Latency

1000 GFLOPs SP 250 GFLOPs DP

100 GFLOPs SP 50 GFLOPs DP

Good for large FLOP-intensive tasks, high memory bandwidth

PCI-Express can be a bottleneck

� 10-fold speedups (usually) not backed by hardware

4


PCI−Express

10 GB/s1−10 us

Workgroup Workgroup Workgroup

Workgroup Workgroup Workgroup

GPU

Host200 GB/s10−100 ns

GPU Main Memory

Shared Memory

Shared Memory

Shared Memory Shared Memory

Shared MemoryShared Memory

Details

Workgroups consist of 32-64 hardware threads

Up to 24 hardware workgroups

Shared memory small: approx. 32-64 KB

5


Reminder: AVX

One instruction for all elements of a vector register

op(r1-8)

Single Instruction Multiple Threads (SIMT)

One instruction for all threads in workgroup

Each thread has separate registers

Efficient if all threads execute the same instruction

op(r1) op(r2) op(r3) op(r4) op(r5) op(r6) op(r7) op(r8)

6


GDDR5

Optimized for throughput

Channel width: multiple of 32 bits

High bus width: 256 bits, 384 bits

Structured Memory Access

Memory controllers use 32/64/128 byte transactions

Partial transactions degrade effective bandwidth

128 Byte 128 Byte 128 Byte

7


Host-Device Communication

PCI-Express v2: 8 GB/sec max

PCI-Express v3: 16 GB/sec max

Latency: about 10 µs

10-6

10-5

10-4

10-3

10-2

101

102

103

104

105

106

107

Tim

e (

se

c)

Message Size (Bytes)

K20mW9100

A10-5800KXeon Phi

10-4

10-3

10-2

10-1

100

101

101

102

103

104

105

106

107

Ba

nd

wid

th (

GB

/se

c)

Message Size (Bytes)

K20mW9100

A10-5800KXeon Phi

8

Many Integrated Core ArchitectureMany Integrated Core Architecture

Background

Many-core architecture by Intel

“Just recompile” existing OpenMP-enabled code

Current: Knights Corner (Q4 2012)

Upcoming: Knights Landing (Q1 2016)

9


Knight Corner: Technical Details

High memory bandwidth: 320 GB/sec ideal, 160 GB/sec real

High compute power: 1 TFLOP/sec (fp64)

Fixed main memory: 8 or 16 GB

Custom operating system on board

0

20

40

60

80

100

120

140

160

1 10 100

Bandwidth (GB/sec)

Threads

STREAM Benchmark Results

E5-2670 v3 (Haswell) E5-2650 v2 (Ivy Bridge) E5-2620 (Sandy Bridge)Xeon Phi 7120

10


Knights Corner: Problems

Very low single-thread performance

“Just recompile” almost never enough

PCI-Express communication slower than for GPUs

Even Intel now recommends Haswell-Xeons over Knights Corner

http://xkcd.com/386/

11


KNL: Compute

72 cores

Energy-efficient IA cores

3x single-thread performance vs. KNC

Binary compatibility with Xeon line

3+ TFLOPS in double precision

KNL: On-Package Memory

16 GB

5x bandwidth vs. DDR4

5x power efficiency vs. DDR4

12


KNL Schematic

36 tiles interconnected by 2D mesh

Each tile: 2 cores, 2 vector units per core, 1 MB L2 cache

Tile

3

DDR4

CHANNELS

3

DDR4

CHANNELS

MCDRAM MCDRAM

MCDRAM MCDRAM

Package

13


HBM: Cache Mode

No code changes necessary

Cache misses expensive

HBM: Flat Mode

MCDRAM mapped to physical address space

Exposed as NUMA node

Accessed through memkind library or numactl

HBM: Hybrid Mode

Combination of the above two

Get the best and worst of two worlds

14

OverviewOverview

Software: CUDA

15

CUDACUDA

About

Initial release in 2007

Proprietary programming model by NVIDIA

C++ with extensions

Proprietary compiler extracts GPU kernels

Software Ecosystem

Vendor-tuned libraries: cuBLAS, cuSparse, cuSolver, cuFFT, etc.

Python bindings: pyCUDA

Community projects: CUSP, MAGMA, ViennaCL, etc.

16

CUDACUDA

Programming in CUDA

void work(double *x, double *y, double *z, int N){#pragma omp parallel

{ int thread_id = omp_get_thread_num();for (size_t i=thread_id; i<N; i += omp_get_num_threads())

z[i] = x[i] + y[i];} }

int main(int argc, char **argv){int N = atoi(argv[1]);double *x = malloc(N*sizeof(double));...

...work(x, y, z, N); // call kernel...

free(x);}

17

CUDACUDA

Programming in CUDA

__global__ void work(double *x, double *y, double *z, int N){

int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (size_t i=thread_id; i<N; i += blockDim.x * gridDim.x)

z[i] = x[i] + y[i];}

int main(int argc, char **argv){int N = atoi(argv[1]);double *x = malloc(N*sizeof(double));cudaMalloc(&gpu_x, N*sizeof(double));cudaMemcpy(gpu_x, x, N*8, cudaMemcpyHostToDevice);...work<<<128, 256>>>(x, y, z, N); // call kernel...cudaMemcpy(gpu_x, x, N*8, cudaMemcpyDeviceToHost);...free(x);

}

18

CUDACUDA

x

y

19

CUDACUDA

Thread Control (1D)

Local ID in block: threadIdx.x

Threads per block: blockDim.x

ID of block: blockIdx.x

No. of blocks: gridDim.x

Recommended Default Values

Typical block size: 256 or 512

Typical number of blocks: 256

At least 10 000 logical threads recommended

20

CUDA ExampleCUDA Example

Offset Memory Access

__global__void work(double *x, double *y, double *z, int N, int k){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (size_t i=thread_id; i<N; i += blockDim.x * gridDim.x)

z[i+k] = x[i+k] + y[i+k];}

0

20

40

60

80

100

120

140

0 5 10 15 20 25 30

Bandw

idth

(G

B/s

ec)

Offset

Memory Bandwidth with Offsets - Double Precision

NVIDIA GeForce GTX 470 (Fermi)NVIDIA Tesla K20m (Kepler)NVIDIA GTX 750 Ti (Maxwell)

21


Strided Memory Access

__global__void work(double *x, double *y, double *z, int N, int k){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (size_t i=thread_id; i<N; i += blockDim.x * gridDim.x)

z[i*k] = x[i*k] + y[i*k];}

0

20

40

60

80

100

120

140

5 10 15 20 25 30

Bandw

idth

(G

B/s

ec)

Stride

Memory Bandwidth with Strides - Double Precision

NVIDIA GeForce GTX 470 (Fermi)NVIDIA Tesla K20m (Kepler)NVIDIA GTX 750 Ti (Maxwell)

22



Array of structs problematic

typedef struct particle{double pos_x; double pos_y; double pos_z;double vel_x; double vel_y; double vel_z;double mass;

} Particle;

__global__void increase_mass(Particle *particles, int N){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (int i=thread_id; i<N; i += blockDim.x * gridDim.x)

particles[i].mass *= 2.0;}

23



Workaround: Structure of Arrays

typedef struct particles{double *pos_x; double *pos_y; double *pos_z;double *vel_x; double *vel_y; double *vel_z;double *mass;

} Particle;

__global__void increase_mass(Particle *particles, int N){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (int i=thread_id; i<N; i += blockDim.x * gridDim.x)

particles.mass[i] *= 2.0;}

24

OverviewOverview

Software: OpenCL

25

History of OpenCLHistory of OpenCL

Prior to 2008

OpenCL developed by Apple Inc.

2008

OpenCL working group formed at Khronos Group

OpenCL specification 1.0 released

2010

OpenCL 1.1 (multi-device, subbuffer manipulation)

2011

OpenCL 1.2 (device partitioning)

2013

OpenCL 2.0 (shared virtual memory, SPIR, etc.)

26

OpenCLOpenCL

Similar to CUDA

Kernel language is a subset of C

Explicit memory management, host-device transfers

Memory model: local, shared, global

Different from CUDA

Support by many vendors

No compiler-wrapper, only a shared library

Kernel compilation usually at runtime

27

OpenCL Platform ModelOpenCL Platform Model

Platform 1

Context 1

Kernel 2,1

Kernel 2,2

Kernel 1,n

Kernel 1,2

Kernel 1,1

Kernel 2,m

Memory Buffer 1

Memory Buffer 2

Device 1 Device 2

Device List

Memory Buffer k

Command Queue 2

Command Queue 1Program 2

Device 3

Program 1

28

OpenCLOpenCL

OpenCL Thread Control (1D) vs. CUDA

Local ID in block: get_local_id(0)

Threads per block: get_local_size(0)

ID of block: get_group_id()

No. of blocks: get_num_groups()

Global thread ID: get_global_id()

No. of threads: get_global_size()

threadIdx.x

blockDim.x

blockIdx.x

gridDim.x

29

OverviewOverview

Software: Parallel Primitives

30

Parallel PrimitivesParallel Primitives

Reductions

Use N values to compute 1 result value

Examples: Dot-products, vector norms, etc.

Reductions with Few Threads

Decompose N into chunks for each thread

Compute chunks in parallel

Merge results with single thread

Result

31


Reductions with Many Threads

Decompose N into chunks for each workgroup

Use fast on-chip synchronization within each workgroup

Sum result for each workgroup separately

z

y

a

32


Reductions with Many Threads

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 3Values

8 -2 10 6 0 9 3 8 -2 -3 2 7 0 11 0 3Step 1, Stride

8 7 13 14 0 9 3 8 -2 -3 2 7 0 11 0 3Step 2, Stride

21 21 13 14 0 9 3 8 -2 -3 2 7 0 11 0 3Step 3, Stride

42 21 13 14 0 9 3 8 -2 -3 2 7 0 11 0 3Step 4, Stride

shared_m[threadIdx.x] = thread_sum;for (int stride = blockDim.x/2; stride>0; stride/=2) {__syncthreads();if (threadIdx.x < stride)

shared_m[threadIdx.x] += shared_m[threadIdx.x+stride];}

33


Prefix Sum

Inclusive: Determine yi =∑i

k=1 xk

Exclusive: Determine yi =∑i−1

k=1 xk, y1 = 0

Example

x: 4, 3, 6, 5, 4, 7, 4, 4, 4

y: 4, 7, 13, 18, 22, 29, 33, 37, 41 (inclusive)

y: 0, 4, 7, 13, 18, 22, 29, 33, 37 (exclusive)

Applications

Sparse matrix setup

Graph algorithms

1

23

4

5

6

7

8

9

4

3

5

6 4

7

44

4

34


Prefix Sum Implementation

Values 0 2 1 6 3 1 2 5

0 2 3 7 9 4 3 7Step 1, Stride 1

0 2 3 9 12 11 12 11Step 2, Stride 2

0 2 3 9 12 13 15 20Step 3, Stride 4

for(int stride = 1; stride < blockDim.x; stride *= 2){__syncthreads();shared_buffer[threadIdx.x] = my_value;__syncthreads();if (threadIdx.x >= stride)

my_value += shared_buffer[threadIdx.x - stride];}__syncthreads();shared_buffer[threadIdx.x] = my_value;

35


ILU - Basic Idea

Factor sparse matrix A ≈ LUL and U sparse, triangular

ILU0: Pattern of L, U equal to AILUT: Keep k elements per row

Solver Cycle PhaseResidual correction LUx = z

Forward solve Ly = z

Backward solve Ux = y

Little parallelism in general

5 × × × × ×× 3 ×× × 4 ×× × 5 × × ×

× 5 × × ×× × × 6 × ×× × 3

× × 4 ×× × × 4

36


ILU Level Scheduling

Build dependency graph

Substitute as many entries as possible simultaneously

Trade-off: Each step vs. multiple steps in a single kernel

37


ILU Interpretation on Structured Grids

2d finite-difference discretization

Substitution whenever all neighbors with smaller index computed

Works particularly well in 3d

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

37






1 2 3 4 5

8

11 12 13 14 15

18

10 9 7 6

16171920

37






1

15

20

2

3

4

5

6

7

8

9

10

11

12

13

14

16

17

18

19

37






1 3

4 5

6 7 8

10

11 2 12

13 14 15

16 17

917 18 20

38


Other Parallel Primitives

Sort

Gather and Scatter

Load to shared memory and work there

etc.

GPU-Accelerated Software Libraries

Linear Algebra: ViennaCL, MAGMA, CUSP, VexCL, ...

Solvers: ViennaCL, MAGMA, cuSolver, Paralution, clAMG, ...

FFT: cuFFT, clFFT, FFTW, ...

Primitives: VexCL, Boost.Compute, ...

Machine Learning: Caffe, cuDNN, ...

39

OverviewOverview

Performance Modeling: Bottlenecks

40

Bottleneck PotpourriBottleneck Potpourri

Arithmetic Intensity

Number of FLOPs per Byte

FLOP-limited: AI larger than 1-10

Memory-limited: AI smaller than 1-10

1

10

2007 2008 2009 2010 2011 2012 2013 2014

FLO

P p

er

Byte

End of Year

Floating Point Operations per Byte, Double Precision

Radeon HD 3870

Radeon HD 4870

Radeon HD 5870

Radeon HD 6970

Radeon HD 6970

Radeon HD 7970 GHz Ed.

Radeon HD 8970

FirePro W9100

Tesla C1060

Tesla C1060

Xeon X5680

Xeon X5690

Tesla K20

Tesla K20X

Tesla K40

CPUs, Intel

Xeon X5482

Xeon X5492

Xeon W5590

Tesla C2050

Tesla C2090Xeon E5-2690

Xeon E5-2697 v2

Xeon E5-2699 v3

GPUs, NVIDIA

GPUs, AMD

MIC, Intel

Xeon Phi X7120X

41


Latency

Bottleneck in strong scaling limit

Ultimate limit for time stepping

Latency - Sources

Network latency (Ethernet ∼ 20µs, Infiniband ∼ 5µs)

PCI-Express latency (Kernel launches, ∼ 10µs)

Thread synchronization (barriers, locks, ∼ 1− 10µs)

Memory latency (∼ 100ns)

10-5

10-4

10-3

10-2

101

102

103

104

105

106

107

Execution T

ime (

sec)

Time per CG Iteration - NVIDIA K20m

BLAS-based, CSR

42


Load Imbalance

Total execution time determined by slowest thread

Focus on making the slowest thread fast

Easy for static data structures (e.g. dense matrices)

Hard for dynamic data structures (e.g. sparse matrices)

43


Amdahl’s Law

Ttotal = Tserial + Tparallel/#processors

Speed-up limited by serial portion of an algorithm

44

OverviewOverview

Performance Modeling: Examples

45

Performance Modeling: Vector AdditionPerformance Modeling: Vector Addition

Vector Addition

x = y + z with N elements each

1 FLOP per 24 byte in double precision

Limited by memory bandwidth⇒ T2(N)?≈ 3× 8× N/Bandwidth + Latency

10-6

10-5

10-4

10-3

102

103

104

105

106

107

Ba

nd

wid

th (

GB

/se

c)

Vector Size

Execution Time for x = y + z in Double Precision

NVIDIA Tesla K20mBandwidth LimitBandwidth + Latency Limit

45

Performance Modeling: Vector AdditionPerformance Modeling: Vector Addition

Vector Addition

x = y + z with N elements each

1 FLOP per 24 byte in double precision

Limited by memory bandwidth⇒ T2(N)?≈ 3× 8× N/Bandwidth + Latency

0

20

40

60

80

100

120

140

102

103

104

105

106

107

Ba

nd

wid

th (

GB

/se

c)

Vector Size

Memory Bandwidth for x = y + z in Double Precision

NVIDIA Tesla K20m

46

Performance Modeling: Matrix-Matrix ProductsPerformance Modeling: Matrix-Matrix Products

102

103

102

103

104

GF

LO

P/s

Matrix Rows/Columns

INTEL Core i7 960 - MKL 11.0.2

INTEL Core i7 960 - ViennaCL

NVIDIA GTX 470 - CUBLAS 5.0

NVIDIA GTX 470 - ViennaCL

AMD HD7970 - clAmdBlas 1.10

AMD HD7970 - ViennaCL

47

Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients

Pseudocode

Choose x0

p0 = r0 = b− Ax0

For i = 0 until convergence

1. Compute and store Api

2. Compute 〈pi,Api〉3. αi = 〈ri, ri〉/〈pi,Api〉4. xi+1 = xi + αipi

5. ri+1 = ri − αiApi

6. Compute 〈ri+1, ri+1〉7. βi = 〈ri+1, ri+1〉/〈ri, ri〉8. pi+1 = ri+1 + βipi

EndFor

BLAS-based Implementation

-SpMV, AXPYFor i = 0 until convergence

1. SpMV← No caching of Api

2. DOT← Global sync!

3. -

4. AXPY

5. AXPY← No caching of ri+1

6. DOT← Global sync!

7. -

8. AXPY

EndFor

48


10-5

10-4

10-3

10-2

101

102

103

104

105

106

107

Exe

cu

tio

n T

ime

(se

c)


BLAS-based, CSR

(2D Finite Difference Discretization)

49



6 Kernel Launches (plus two for reductions)

Two device to host data reads from dot products

Model SpMV as seven vector accesses (5-point stencil)

T(N) = 8× 10−6 + 2× 2× 10−6 + (7 + 2 + 3 + 3 + 2 + 3)× 8× x/Bandwidth

10-5

10-4

10-3

10-2

101

102

103

104

105

106

107

Execution T

ime (

sec)


BLAS-based, CSRBandwidth + Latency Limit


50

Performance Modeling: Conjugate Gradient OptimizationsPerformance Modeling: Conjugate Gradient Optimizations

Optimization: Rearrange the algorithm

Remove unnecessary reads

Remove unnecessary synchronizations

Use custom kernels instead of standard BLAS

51


Standard CG

Choose x0

p0 = r0 = b− Ax0



2. Compute 〈pi,Api〉3. αi = 〈ri, ri〉/〈pi,Api〉4. xi+1 = xi + αipi

5. ri+1 = ri − αiApi

6. Compute 〈ri+1, ri+1〉7. βi = 〈ri+1, ri+1〉/〈ri, ri〉8. pi+1 = ri+1 + βipi

EndFor

Pipelined CG

Choose x0

p0 = r0 = b− Ax0


1. i = 1: Compute α0, β0, Ap0

2. xi = xi−1 + αi−1pi−1

3. ri = ri−1 − αi−1Api

4. pi = ri + βi−1pi−1


6. Compute 〈Api,Api〉, 〈pi,Api〉, 〈ri, ri〉7. αi = 〈ri, ri〉/〈pi,Api〉8. βi = (α2

i 〈Api,Api〉 − 〈ri, ri〉)/〈ri, ri〉EndFor

52


10-5

10-4

10-3

10-2

101

102

103

104

105

106

107

Exe

cu

tio

n T

ime

(se

c)


BLAS-basedBandwidth + Latency LimitPipelined


53


Benefits of Pipelining also for Large Matrices

0 50 100 150 200

0 50 100 150 200

Rel. Execution Time (%)

pwtk

shipsec1

consph

cant

pdb1HYS

3.35

5.29

2.57

2.04

2.76

1.83

1.92

1.68

2.02

1.34

1.38

1.14

1.59

0.97

1.05

0.85

1.46

1.49

0.98

0.98

Tesla C2050

0 50 100 150 200

0 50 100 150 200


2.89

3.33

1.86

1.42

2.17

1.13

1.46

1.14

1.86

0.93

0.99

0.82

1.56

0.70

0.78

0.61

1.52

0.98

0.74

0.71

Tesla K20m

0 50 100 150 200

0 50 100 150 200


NA

NA

2.18

1.24

NA

NA

1.30

0.95

NA

NA

1.00

0.64

NA

NA

0.80

0.44

NA

NA

0.78

0.75

FirePro W9000

0 50 100 150 200

0 50 100 150 200


NA

NA

1.98

1.19

NA

NA

1.18

0.94

NA

NA

0.96

0.63

NA

NA

0.74

0.44

NA

NA

0.78

0.77

FirePro W9100

ViennaCL PARALUTION MAGMA CUSP

54

Performance Modeling: Sparse Matrix TransposePerformance Modeling: Sparse Matrix Transpose

Sparse Matrix Transposition

Compute B = AT

A and B using sparse storage schemes

Row-Wise Datastructures

std::vector<std::map<int, double> >

std::vector<boost::flat_map<int, double> >

Compressed Sparse Row

55


Simplest Case: STL

for (int i=0; i<N; ++i)for (auto col = A[i].begin(); col != A[i].end(); ++col)

B[col->index][i] = col->value

More Work: CSR

Compute number of nonzeros per row in B

Allocate data arrays for B

Exclusive scan to find entry points per row

Populate B


Read and write each nonzero entry at least once

Low arithmetic intensity: Bandwidth limited!

T(N) = 2× (4 + 8)× nnz(N)/Bandwidth

56


10-6

10-5

10-4

10-3

10-2

10-1

100

101

102

103

104

105

106

107

Execution T

ime

Matrix Size

Sparse Matrix Transposition

STL->STLSTL->CSRCSR->STLCSR->FlatCSR->CSRBandwidth Limit

57

SummarySummary

Many-Core Architectures

GPUs: 100s and 1000s of threads

MIC: 61 cores (KNC), 72 cores (KNL)

2-3x enhancements (perf/Watt) for some applications

Amdahl’s Law limits general applicability

Recommendations

(Re-)Use software libraries

Profile and model before you optimize

Pay attention to memory transfers

modern many-core architectures for...

Documents