modern many-core architectures for...

61
VSC Seminar Modern Many-Core Architectures for Supercomputing Karl Rupp https://karlrupp.net/ https://github.com/karlrupp/slides/ formerly: Institute for Microelectronics, TU Wien now: Freelance Scientist December 11, 2015

Upload: others

Post on 23-Mar-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

VSC Seminar

Modern Many-Core Architecturesfor Supercomputing

Karl Rupp

https://karlrupp.net/https://github.com/karlrupp/slides/

formerly:Institute for Microelectronics, TU Wien

now:Freelance Scientist

December 11, 2015

Page 2: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

2

OverviewOverview

Hardware

Graphics Processing Units

Many Integrated Cores (Xeon Phi family)

Software

CUDA

OpenCL

Parallel Primitives

Performance Modeling

Identifying Bottlenecks

Modeling by Example

Page 3: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

3

GPU OverviewGPU Overview

Computing Architecture Schematic

PCI Express

8x20 GB/s2x12 GB/s

CPU GPU

8 GB/s, ~1us Latency

1000 GFLOPs SP 250 GFLOPs DP

100 GFLOPs SP 50 GFLOPs DP

Good for large FLOP-intensive tasks, high memory bandwidth

PCI-Express can be a bottleneck

� 10-fold speedups (usually) not backed by hardware

Page 4: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

4

GPU OverviewGPU Overview

PCI−Express

10 GB/s1−10 us

Workgroup Workgroup Workgroup

Workgroup Workgroup Workgroup

GPU

Host200 GB/s10−100 ns

GPU Main Memory

Shared Memory

Shared Memory

Shared Memory Shared Memory

Shared MemoryShared Memory

Details

Workgroups consist of 32-64 hardware threads

Up to 24 hardware workgroups

Shared memory small: approx. 32-64 KB

Page 5: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

5

GPU OverviewGPU Overview

Reminder: AVX

One instruction for all elements of a vector register

op(r1-8)

Single Instruction Multiple Threads (SIMT)

One instruction for all threads in workgroup

Each thread has separate registers

Efficient if all threads execute the same instruction

op(r1) op(r2) op(r3) op(r4) op(r5) op(r6) op(r7) op(r8)

Page 6: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

6

GPU OverviewGPU Overview

GDDR5

Optimized for throughput

Channel width: multiple of 32 bits

High bus width: 256 bits, 384 bits

Structured Memory Access

Memory controllers use 32/64/128 byte transactions

Partial transactions degrade effective bandwidth

128 Byte 128 Byte 128 Byte

Page 7: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

7

GPU OverviewGPU Overview

Host-Device Communication

PCI-Express v2: 8 GB/sec max

PCI-Express v3: 16 GB/sec max

Latency: about 10 µs

10-6

10-5

10-4

10-3

10-2

101

102

103

104

105

106

107

Tim

e (

se

c)

Message Size (Bytes)

K20mW9100

A10-5800KXeon Phi

10-4

10-3

10-2

10-1

100

101

101

102

103

104

105

106

107

Ba

nd

wid

th (

GB

/se

c)

Message Size (Bytes)

K20mW9100

A10-5800KXeon Phi

Page 8: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

8

Many Integrated Core ArchitectureMany Integrated Core Architecture

Background

Many-core architecture by Intel

“Just recompile” existing OpenMP-enabled code

Current: Knights Corner (Q4 2012)

Upcoming: Knights Landing (Q1 2016)

Page 9: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

9

Many Integrated Core ArchitectureMany Integrated Core Architecture

Knight Corner: Technical Details

High memory bandwidth: 320 GB/sec ideal, 160 GB/sec real

High compute power: 1 TFLOP/sec (fp64)

Fixed main memory: 8 or 16 GB

Custom operating system on board

0

20

40

60

80

100

120

140

160

1 10 100

Bandwidth (GB/sec)

Threads

STREAM Benchmark Results

E5-2670 v3 (Haswell) E5-2650 v2 (Ivy Bridge) E5-2620 (Sandy Bridge)Xeon Phi 7120

Page 10: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

10

Many Integrated Core ArchitectureMany Integrated Core Architecture

Knights Corner: Problems

Very low single-thread performance

“Just recompile” almost never enough

PCI-Express communication slower than for GPUs

Even Intel now recommends Haswell-Xeons over Knights Corner

http://xkcd.com/386/

Page 11: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

11

Many Integrated Core ArchitectureMany Integrated Core Architecture

KNL: Compute

72 cores

Energy-efficient IA cores

3x single-thread performance vs. KNC

Binary compatibility with Xeon line

3+ TFLOPS in double precision

KNL: On-Package Memory

16 GB

5x bandwidth vs. DDR4

5x power efficiency vs. DDR4

Page 12: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

12

Many Integrated Core ArchitectureMany Integrated Core Architecture

KNL Schematic

36 tiles interconnected by 2D mesh

Each tile: 2 cores, 2 vector units per core, 1 MB L2 cache

Tile

3

DDR4

CHANNELS

3

DDR4

CHANNELS

MCDRAM MCDRAM

MCDRAM MCDRAM

Package

Page 13: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

13

Many Integrated Core ArchitectureMany Integrated Core Architecture

HBM: Cache Mode

No code changes necessary

Cache misses expensive

HBM: Flat Mode

MCDRAM mapped to physical address space

Exposed as NUMA node

Accessed through memkind library or numactl

HBM: Hybrid Mode

Combination of the above two

Get the best and worst of two worlds

Page 14: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

14

OverviewOverview

Software: CUDA

Page 15: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

15

CUDACUDA

About

Initial release in 2007

Proprietary programming model by NVIDIA

C++ with extensions

Proprietary compiler extracts GPU kernels

Software Ecosystem

Vendor-tuned libraries: cuBLAS, cuSparse, cuSolver, cuFFT, etc.

Python bindings: pyCUDA

Community projects: CUSP, MAGMA, ViennaCL, etc.

Page 16: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

16

CUDACUDA

Programming in CUDA

void work(double *x, double *y, double *z, int N){#pragma omp parallel

{ int thread_id = omp_get_thread_num();for (size_t i=thread_id; i<N; i += omp_get_num_threads())

z[i] = x[i] + y[i];} }

int main(int argc, char **argv){int N = atoi(argv[1]);double *x = malloc(N*sizeof(double));...

...work(x, y, z, N); // call kernel...

free(x);}

Page 17: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

17

CUDACUDA

Programming in CUDA

__global__ void work(double *x, double *y, double *z, int N){

int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (size_t i=thread_id; i<N; i += blockDim.x * gridDim.x)

z[i] = x[i] + y[i];}

int main(int argc, char **argv){int N = atoi(argv[1]);double *x = malloc(N*sizeof(double));cudaMalloc(&gpu_x, N*sizeof(double));cudaMemcpy(gpu_x, x, N*8, cudaMemcpyHostToDevice);...work<<<128, 256>>>(x, y, z, N); // call kernel...cudaMemcpy(gpu_x, x, N*8, cudaMemcpyDeviceToHost);...free(x);

}

Page 18: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

18

CUDACUDA

x

y

Page 19: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

19

CUDACUDA

Thread Control (1D)

Local ID in block: threadIdx.x

Threads per block: blockDim.x

ID of block: blockIdx.x

No. of blocks: gridDim.x

Recommended Default Values

Typical block size: 256 or 512

Typical number of blocks: 256

At least 10 000 logical threads recommended

Page 20: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

20

CUDA ExampleCUDA Example

Offset Memory Access

__global__void work(double *x, double *y, double *z, int N, int k){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (size_t i=thread_id; i<N; i += blockDim.x * gridDim.x)

z[i+k] = x[i+k] + y[i+k];}

0

20

40

60

80

100

120

140

0 5 10 15 20 25 30

Bandw

idth

(G

B/s

ec)

Offset

Memory Bandwidth with Offsets - Double Precision

NVIDIA GeForce GTX 470 (Fermi)NVIDIA Tesla K20m (Kepler)NVIDIA GTX 750 Ti (Maxwell)

Page 21: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

21

CUDA ExampleCUDA Example

Strided Memory Access

__global__void work(double *x, double *y, double *z, int N, int k){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (size_t i=thread_id; i<N; i += blockDim.x * gridDim.x)

z[i*k] = x[i*k] + y[i*k];}

0

20

40

60

80

100

120

140

5 10 15 20 25 30

Bandw

idth

(G

B/s

ec)

Stride

Memory Bandwidth with Strides - Double Precision

NVIDIA GeForce GTX 470 (Fermi)NVIDIA Tesla K20m (Kepler)NVIDIA GTX 750 Ti (Maxwell)

Page 22: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

22

CUDA ExampleCUDA Example

Strided Memory Access

Array of structs problematic

typedef struct particle{double pos_x; double pos_y; double pos_z;double vel_x; double vel_y; double vel_z;double mass;

} Particle;

__global__void increase_mass(Particle *particles, int N){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (int i=thread_id; i<N; i += blockDim.x * gridDim.x)

particles[i].mass *= 2.0;}

Page 23: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

23

CUDA ExampleCUDA Example

Strided Memory Access

Workaround: Structure of Arrays

typedef struct particles{double *pos_x; double *pos_y; double *pos_z;double *vel_x; double *vel_y; double *vel_z;double *mass;

} Particle;

__global__void increase_mass(Particle *particles, int N){int thread_id = blockIdx.x*blockDim.x + threadIdx.x;for (int i=thread_id; i<N; i += blockDim.x * gridDim.x)

particles.mass[i] *= 2.0;}

Page 24: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

24

OverviewOverview

Software: OpenCL

Page 25: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

25

History of OpenCLHistory of OpenCL

Prior to 2008

OpenCL developed by Apple Inc.

2008

OpenCL working group formed at Khronos Group

OpenCL specification 1.0 released

2010

OpenCL 1.1 (multi-device, subbuffer manipulation)

2011

OpenCL 1.2 (device partitioning)

2013

OpenCL 2.0 (shared virtual memory, SPIR, etc.)

Page 26: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

26

OpenCLOpenCL

Similar to CUDA

Kernel language is a subset of C

Explicit memory management, host-device transfers

Memory model: local, shared, global

Different from CUDA

Support by many vendors

No compiler-wrapper, only a shared library

Kernel compilation usually at runtime

Page 27: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

27

OpenCL Platform ModelOpenCL Platform Model

Platform 1

Context 1

Kernel 2,1

Kernel 2,2

Kernel 1,n

Kernel 1,2

Kernel 1,1

Kernel 2,m

Memory Buffer 1

Memory Buffer 2

Device 1 Device 2

Device List

Memory Buffer k

Command Queue 2

Command Queue 1Program 2

Device 3

Program 1

Page 28: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

28

OpenCLOpenCL

OpenCL Thread Control (1D) vs. CUDA

Local ID in block: get_local_id(0)

Threads per block: get_local_size(0)

ID of block: get_group_id()

No. of blocks: get_num_groups()

Global thread ID: get_global_id()

No. of threads: get_global_size()

threadIdx.x

blockDim.x

blockIdx.x

gridDim.x

Page 29: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

29

OverviewOverview

Software: Parallel Primitives

Page 30: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

30

Parallel PrimitivesParallel Primitives

Reductions

Use N values to compute 1 result value

Examples: Dot-products, vector norms, etc.

Reductions with Few Threads

Decompose N into chunks for each thread

Compute chunks in parallel

Merge results with single thread

Result

Page 31: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

31

Parallel PrimitivesParallel Primitives

Reductions with Many Threads

Decompose N into chunks for each workgroup

Use fast on-chip synchronization within each workgroup

Sum result for each workgroup separately

z

y

a

Page 32: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

32

Parallel PrimitivesParallel Primitives

Reductions with Many Threads

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 3Values

8 -2 10 6 0 9 3 8 -2 -3 2 7 0 11 0 3Step 1, Stride

8 7 13 14 0 9 3 8 -2 -3 2 7 0 11 0 3Step 2, Stride

21 21 13 14 0 9 3 8 -2 -3 2 7 0 11 0 3Step 3, Stride

42 21 13 14 0 9 3 8 -2 -3 2 7 0 11 0 3Step 4, Stride

shared_m[threadIdx.x] = thread_sum;for (int stride = blockDim.x/2; stride>0; stride/=2) {__syncthreads();if (threadIdx.x < stride)

shared_m[threadIdx.x] += shared_m[threadIdx.x+stride];}

Page 33: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

33

Parallel PrimitivesParallel Primitives

Prefix Sum

Inclusive: Determine yi =∑i

k=1 xk

Exclusive: Determine yi =∑i−1

k=1 xk, y1 = 0

Example

x: 4, 3, 6, 5, 4, 7, 4, 4, 4

y: 4, 7, 13, 18, 22, 29, 33, 37, 41 (inclusive)

y: 0, 4, 7, 13, 18, 22, 29, 33, 37 (exclusive)

Applications

Sparse matrix setup

Graph algorithms

1

23

4

5

6

7

8

9

4

3

5

6 4

7

44

4

Page 34: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

34

Parallel PrimitivesParallel Primitives

Prefix Sum Implementation

Values 0 2 1 6 3 1 2 5

0 2 3 7 9 4 3 7Step 1, Stride 1

0 2 3 9 12 11 12 11Step 2, Stride 2

0 2 3 9 12 13 15 20Step 3, Stride 4

for(int stride = 1; stride < blockDim.x; stride *= 2){__syncthreads();shared_buffer[threadIdx.x] = my_value;__syncthreads();if (threadIdx.x >= stride)

my_value += shared_buffer[threadIdx.x - stride];}__syncthreads();shared_buffer[threadIdx.x] = my_value;

Page 35: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

35

Parallel PrimitivesParallel Primitives

ILU - Basic Idea

Factor sparse matrix A ≈ LUL and U sparse, triangular

ILU0: Pattern of L, U equal to AILUT: Keep k elements per row

Solver Cycle PhaseResidual correction LUx = z

Forward solve Ly = z

Backward solve Ux = y

Little parallelism in general

5 × × × × ×× 3 ×× × 4 ×× × 5 × × ×

× 5 × × ×× × × 6 × ×× × 3

× × 4 ×× × × 4

Page 36: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

36

Parallel PrimitivesParallel Primitives

ILU Level Scheduling

Build dependency graph

Substitute as many entries as possible simultaneously

Trade-off: Each step vs. multiple steps in a single kernel

Page 37: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

37

Parallel PrimitivesParallel Primitives

ILU Interpretation on Structured Grids

2d finite-difference discretization

Substitution whenever all neighbors with smaller index computed

Works particularly well in 3d

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

Page 38: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

37

Parallel PrimitivesParallel Primitives

ILU Interpretation on Structured Grids

2d finite-difference discretization

Substitution whenever all neighbors with smaller index computed

Works particularly well in 3d

1 2 3 4 5

8

11 12 13 14 15

18

10 9 7 6

16171920

Page 39: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

37

Parallel PrimitivesParallel Primitives

ILU Interpretation on Structured Grids

2d finite-difference discretization

Substitution whenever all neighbors with smaller index computed

Works particularly well in 3d

1

15

20

2

3

4

5

6

7

8

9

10

11

12

13

14

16

17

18

19

Page 40: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

37

Parallel PrimitivesParallel Primitives

ILU Interpretation on Structured Grids

2d finite-difference discretization

Substitution whenever all neighbors with smaller index computed

Works particularly well in 3d

1 3

4 5

6 7 8

10

11 2 12

13 14 15

16 17

917 18 20

Page 41: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

38

Parallel PrimitivesParallel Primitives

Other Parallel Primitives

Sort

Gather and Scatter

Load to shared memory and work there

etc.

GPU-Accelerated Software Libraries

Linear Algebra: ViennaCL, MAGMA, CUSP, VexCL, ...

Solvers: ViennaCL, MAGMA, cuSolver, Paralution, clAMG, ...

FFT: cuFFT, clFFT, FFTW, ...

Primitives: VexCL, Boost.Compute, ...

Machine Learning: Caffe, cuDNN, ...

Page 42: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

39

OverviewOverview

Performance Modeling: Bottlenecks

Page 43: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

40

Bottleneck PotpourriBottleneck Potpourri

Arithmetic Intensity

Number of FLOPs per Byte

FLOP-limited: AI larger than 1-10

Memory-limited: AI smaller than 1-10

1

10

2007 2008 2009 2010 2011 2012 2013 2014

FLO

P p

er

Byte

End of Year

Floating Point Operations per Byte, Double Precision

Radeon HD 3870

Radeon HD 4870

Radeon HD 5870

Radeon HD 6970

Radeon HD 6970

Radeon HD 7970 GHz Ed.

Radeon HD 8970

FirePro W9100

Tesla C1060

Tesla C1060

Xeon X5680

Xeon X5690

Tesla K20

Tesla K20X

Tesla K40

CPUs, Intel

Xeon X5482

Xeon X5492

Xeon W5590

Tesla C2050

Tesla C2090Xeon E5-2690

Xeon E5-2697 v2

Xeon E5-2699 v3

GPUs, NVIDIA

GPUs, AMD

MIC, Intel

Xeon Phi X7120X

Page 44: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

41

Bottleneck PotpourriBottleneck Potpourri

Latency

Bottleneck in strong scaling limit

Ultimate limit for time stepping

Latency - Sources

Network latency (Ethernet ∼ 20µs, Infiniband ∼ 5µs)

PCI-Express latency (Kernel launches, ∼ 10µs)

Thread synchronization (barriers, locks, ∼ 1− 10µs)

Memory latency (∼ 100ns)

10-5

10-4

10-3

10-2

101

102

103

104

105

106

107

Execution T

ime (

sec)

Time per CG Iteration - NVIDIA K20m

BLAS-based, CSR

Page 45: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

42

Bottleneck PotpourriBottleneck Potpourri

Load Imbalance

Total execution time determined by slowest thread

Focus on making the slowest thread fast

Easy for static data structures (e.g. dense matrices)

Hard for dynamic data structures (e.g. sparse matrices)

Page 46: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

43

Bottleneck PotpourriBottleneck Potpourri

Amdahl’s Law

Ttotal = Tserial + Tparallel/#processors

Speed-up limited by serial portion of an algorithm

Page 47: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

44

OverviewOverview

Performance Modeling: Examples

Page 48: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

45

Performance Modeling: Vector AdditionPerformance Modeling: Vector Addition

Vector Addition

x = y + z with N elements each

1 FLOP per 24 byte in double precision

Limited by memory bandwidth⇒ T2(N)?≈ 3× 8× N/Bandwidth + Latency

10-6

10-5

10-4

10-3

102

103

104

105

106

107

Ba

nd

wid

th (

GB

/se

c)

Vector Size

Execution Time for x = y + z in Double Precision

NVIDIA Tesla K20mBandwidth LimitBandwidth + Latency Limit

Page 49: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

45

Performance Modeling: Vector AdditionPerformance Modeling: Vector Addition

Vector Addition

x = y + z with N elements each

1 FLOP per 24 byte in double precision

Limited by memory bandwidth⇒ T2(N)?≈ 3× 8× N/Bandwidth + Latency

0

20

40

60

80

100

120

140

102

103

104

105

106

107

Ba

nd

wid

th (

GB

/se

c)

Vector Size

Memory Bandwidth for x = y + z in Double Precision

NVIDIA Tesla K20m

Page 50: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

46

Performance Modeling: Matrix-Matrix ProductsPerformance Modeling: Matrix-Matrix Products

102

103

102

103

104

GF

LO

P/s

Matrix Rows/Columns

INTEL Core i7 960 - MKL 11.0.2

INTEL Core i7 960 - ViennaCL

NVIDIA GTX 470 - CUBLAS 5.0

NVIDIA GTX 470 - ViennaCL

AMD HD7970 - clAmdBlas 1.10

AMD HD7970 - ViennaCL

Page 51: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

47

Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients

Pseudocode

Choose x0

p0 = r0 = b− Ax0

For i = 0 until convergence

1. Compute and store Api

2. Compute 〈pi,Api〉3. αi = 〈ri, ri〉/〈pi,Api〉4. xi+1 = xi + αipi

5. ri+1 = ri − αiApi

6. Compute 〈ri+1, ri+1〉7. βi = 〈ri+1, ri+1〉/〈ri, ri〉8. pi+1 = ri+1 + βipi

EndFor

BLAS-based Implementation

-SpMV, AXPYFor i = 0 until convergence

1. SpMV← No caching of Api

2. DOT← Global sync!

3. -

4. AXPY

5. AXPY← No caching of ri+1

6. DOT← Global sync!

7. -

8. AXPY

EndFor

Page 52: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

48

Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients

10-5

10-4

10-3

10-2

101

102

103

104

105

106

107

Exe

cu

tio

n T

ime

(se

c)

Time per CG Iteration - NVIDIA K20m

BLAS-based, CSR

(2D Finite Difference Discretization)

Page 53: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

49

Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients

Performance Modeling

6 Kernel Launches (plus two for reductions)

Two device to host data reads from dot products

Model SpMV as seven vector accesses (5-point stencil)

T(N) = 8× 10−6 + 2× 2× 10−6 + (7 + 2 + 3 + 3 + 2 + 3)× 8× x/Bandwidth

10-5

10-4

10-3

10-2

101

102

103

104

105

106

107

Execution T

ime (

sec)

Time per CG Iteration - NVIDIA K20m

BLAS-based, CSRBandwidth + Latency Limit

(2D Finite Difference Discretization)

Page 54: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

50

Performance Modeling: Conjugate Gradient OptimizationsPerformance Modeling: Conjugate Gradient Optimizations

Optimization: Rearrange the algorithm

Remove unnecessary reads

Remove unnecessary synchronizations

Use custom kernels instead of standard BLAS

Page 55: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

51

Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients

Standard CG

Choose x0

p0 = r0 = b− Ax0

For i = 0 until convergence

1. Compute and store Api

2. Compute 〈pi,Api〉3. αi = 〈ri, ri〉/〈pi,Api〉4. xi+1 = xi + αipi

5. ri+1 = ri − αiApi

6. Compute 〈ri+1, ri+1〉7. βi = 〈ri+1, ri+1〉/〈ri, ri〉8. pi+1 = ri+1 + βipi

EndFor

Pipelined CG

Choose x0

p0 = r0 = b− Ax0

For i = 1 until convergence

1. i = 1: Compute α0, β0, Ap0

2. xi = xi−1 + αi−1pi−1

3. ri = ri−1 − αi−1Api

4. pi = ri + βi−1pi−1

5. Compute and store Api

6. Compute 〈Api,Api〉, 〈pi,Api〉, 〈ri, ri〉7. αi = 〈ri, ri〉/〈pi,Api〉8. βi = (α2

i 〈Api,Api〉 − 〈ri, ri〉)/〈ri, ri〉EndFor

Page 56: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

52

Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients

10-5

10-4

10-3

10-2

101

102

103

104

105

106

107

Exe

cu

tio

n T

ime

(se

c)

Time per CG Iteration - NVIDIA K20m

BLAS-basedBandwidth + Latency LimitPipelined

(2D Finite Difference Discretization)

Page 57: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

53

Performance Modeling: Conjugate GradientsPerformance Modeling: Conjugate Gradients

Benefits of Pipelining also for Large Matrices

0 50 100 150 200

0 50 100 150 200

Rel. Execution Time (%)

pwtk

shipsec1

consph

cant

pdb1HYS

3.35

5.29

2.57

2.04

2.76

1.83

1.92

1.68

2.02

1.34

1.38

1.14

1.59

0.97

1.05

0.85

1.46

1.49

0.98

0.98

Tesla C2050

0 50 100 150 200

0 50 100 150 200

Rel. Execution Time (%)

2.89

3.33

1.86

1.42

2.17

1.13

1.46

1.14

1.86

0.93

0.99

0.82

1.56

0.70

0.78

0.61

1.52

0.98

0.74

0.71

Tesla K20m

0 50 100 150 200

0 50 100 150 200

Rel. Execution Time (%)

NA

NA

2.18

1.24

NA

NA

1.30

0.95

NA

NA

1.00

0.64

NA

NA

0.80

0.44

NA

NA

0.78

0.75

FirePro W9000

0 50 100 150 200

0 50 100 150 200

Rel. Execution Time (%)

NA

NA

1.98

1.19

NA

NA

1.18

0.94

NA

NA

0.96

0.63

NA

NA

0.74

0.44

NA

NA

0.78

0.77

FirePro W9100

ViennaCL PARALUTION MAGMA CUSP

Page 58: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

54

Performance Modeling: Sparse Matrix TransposePerformance Modeling: Sparse Matrix Transpose

Sparse Matrix Transposition

Compute B = AT

A and B using sparse storage schemes

Row-Wise Datastructures

std::vector<std::map<int, double> >

std::vector<boost::flat_map<int, double> >

Compressed Sparse Row

Page 59: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

55

Performance Modeling: Sparse Matrix TransposePerformance Modeling: Sparse Matrix Transpose

Simplest Case: STL

for (int i=0; i<N; ++i)for (auto col = A[i].begin(); col != A[i].end(); ++col)

B[col->index][i] = col->value

More Work: CSR

Compute number of nonzeros per row in B

Allocate data arrays for B

Exclusive scan to find entry points per row

Populate B

Performance Modeling

Read and write each nonzero entry at least once

Low arithmetic intensity: Bandwidth limited!

T(N) = 2× (4 + 8)× nnz(N)/Bandwidth

Page 60: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

56

Performance Modeling: Sparse Matrix TransposePerformance Modeling: Sparse Matrix Transpose

10-6

10-5

10-4

10-3

10-2

10-1

100

101

102

103

104

105

106

107

Execution T

ime

Matrix Size

Sparse Matrix Transposition

STL->STLSTL->CSRCSR->STLCSR->FlatCSR->CSRBandwidth Limit

Page 61: Modern Many-Core Architectures for Supercomputingtypo3.vsc.ac.at/fileadmin/user_upload/vsc/slides/seminar/...3 GPU Overview Computing Architecture Schematic Good for large FLOP-intensive

57

SummarySummary

Many-Core Architectures

GPUs: 100s and 1000s of threads

MIC: 61 cores (KNC), 72 cores (KNL)

2-3x enhancements (perf/Watt) for some applications

Amdahl’s Law limits general applicability

Recommendations

(Re-)Use software libraries

Profile and model before you optimize

Pay attention to memory transfers