cea kepler cuda5 beyond - hpc · pdf file!$acc end kernels... end program myscience ......

41
© NVIDIA Corporation 2012 GPU Acceleration Julien Demouth

Upload: builiem

Post on 16-Mar-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

© NVIDIA Corporation 2012

GPU Acceleration

Julien Demouth

© NVIDIA Corporation 2012

Outline

CUDA Refresher

Kepler ArchitectureSMX

CUDA 5 FeaturesCUDA Dynamic ParallelismHyper-QDevice-Code Linker

© NVIDIA Corporation 2012

GPU PROGRAMMING REFRESHER

© NVIDIA Corporation 2012

Heterogeneous Computing Platform

PCI-E Bus

HostHost DeviceDevice

© NVIDIA Corporation 2012

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACCDirectives

Maximum

Flexibility

Easily Accelerate

Applications

© NVIDIA Corporation 2012

NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP

Vector SignalImage Processing

GPU AcceleratedLinear Algebra

Matrix Algebra on GPU and Multicore NVIDIA cuFFT

C++ STL Features for CUDA

Sparse Linear Algebra

Building-block Algorithms for CUDAIMSL Library

GPU Accelerated Libraries

© NVIDIA Corporation 2012

OpenACC Directives

Program myscience... serial code ...

!$acc kernelsdo k = 1,n1

do i = 1,n2... parallel code ...

enddoenddo

!$acc end kernels...

End Program myscience

CPU GPU

Your original

Fortran or C code

Simple Compiler hints

Compiler Parallelizes code

Works on many-core GPUs &

multicore CPUs

OpenACC

Compiler

Hint

© NVIDIA Corporation 2012

CUDA C/C++

void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)

{{{{

forforforfor (int i = 0; i(int i = 0; i(int i = 0; i(int i = 0; i <<<< n; ++i)n; ++i)n; ++i)n; ++i)

y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

// Invoke serialserialserialserial SAXPY kernel

saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);

__global__ __global__ __global__ __global__ void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)

{{{{

int i = int i = int i = int i = blockIdxblockIdxblockIdxblockIdx.x*.x*.x*.x*blockDimblockDimblockDimblockDim.x + .x + .x + .x + threadIdxthreadIdxthreadIdxthreadIdx.x;.x;.x;.x;

ifififif (i(i(i(i <<<< n) n) n) n) y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

// Invoke parallelparallelparallelparallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;

saxpy_parallelsaxpy_parallelsaxpy_parallelsaxpy_parallel<<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>>(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);

Standard C Code

Parallel C Code

© NVIDIA Corporation 2012

Opening the CUDA Platform with LLVM

CUDA compiler source now available with Open source LLVM Compiler

SDK includes specification documentation, examples, and verifier

Provides ability for anyone to add CUDA to new languages and processors

Learn more at

http://developer.nvidia.com/cuda-source

CUDA

C, C++, Fortran

LLVM Compiler

For CUDA

NVIDIA

GPUs

x86

CPUs

New Language

Support

New Processor

Support

© NVIDIA Corporation 2012

GPUs are Mainstream

Chinese

Academy of

Sciences

Edu/Research

Air Force

Research

Laboratory

Naval Research

Laboratory

GovernmentOil & Gas

Max

Planck

Institute

Mass General

Hospital

Life Sciences Finance Manufacturing

© NVIDIA Corporation 2012

KEPLER ARCHITECTURE�

© NVIDIA Corporation 2012

The Kepler GPU

Efficiency

Programmability

Performance

© NVIDIA Corporation 2012

Tesla K10

3x Single Precision

1.8x Memory Bandwidth

Image, Signal, Seismic

3x Double Precision

Hyper-Q, Dynamic Parallelism

CFD, FEA, Finance, Physics

Tesla K20

Available Q4 2012Available Now

© NVIDIA Corporation 2012

Kepler GK110 Block Diagram

Architecture7.1B Transistors15 SMX units> 1 TFLOP FP641.5 MB L2 Cache384-bit GDDR5

© NVIDIA Corporation 2012

SMX Balance of Resources

Resource Kepler GK110 vsFermi

Floating point throughput 2-3x

Max Blocks per SMX 2x

Max Threads per SMX 1.3x

Register File Bandwidth 2x

Register File Capacity 2x

Shared Memory Bandwidth 2x

Shared Memory Capacity 1x

© NVIDIA Corporation 2012

New ISA Encoding: 255 Registers per Thread

Fermi limit: 63 registers per threadA common Fermi performance limiterLeads to excessive spilling

Kepler : Up to 255 registers per threadEspecially helpful for FP64 apps

Ex. Quda QCD fp64 sample runs 5.3x fasterSpills are eliminated with extra registers

© NVIDIA Corporation 2012

New High -Performance SMX Instructions

SHFL (shuffle) -- Intra-warp data exchange Compiler-generated, high performance

instructions:

� bit shift� bit rotate� fp32 division� read-only cacheATOM -- Broader functionality, Faster

© NVIDIA Corporation 2012

Texture Cache Unlocked

Added a new path for computeAvoids the texture unitAllows a global address to be fetched and cachedEliminates texture setup

Why use it?Separate pipeline from shared/L1Highest miss bandwidthFlexible, e.g. unaligned accesses

Managed automatically by compiler“const __restrict” indicates eligibility

Tex

SMX

L2

TexTexTex

Read-onlyData Cache

© NVIDIA Corporation 2012

CUDA DYNAMIC PARALLELISM�

© NVIDIA Corporation 2012

Improving Programmability

Dynamic Parallelism

Occupancy

Simplify CPU/GPU Divide

Library Calls from Kernels

Batching to Help Fill GPU

Dynamic Load Balancing

Data-Dependent Execution

Recursive Parallel Algorithms

© NVIDIA Corporation 2012

CPU GPU CPU GPU

What Does It Mean?

Autonomous, Dynamic ParallelismGPU as Co-Processor

© NVIDIA Corporation 2012

Nested Launch

� Can create new grids at run-time

� Composable: can only launch into own position in stream

� Can yield until new work completes, then resume using result (i.e. join)

� Launch by and thread is visible to thread’s whole block

CDP Nested Launch

__global__ void B(float *data) {do_stuff(data);

X <<< ..., stream >>> (data);Y <<< ..., stream >>> (data);Z <<< ..., stream >>> (data);cudaStreamSynchronize(stream);

do_more_stuff(data);}

A

B

C

X

Y

Z

CPUint main() {

float *data; setup(data);

A <<< ... >>> (data);B <<< ... >>> (data);C <<< ... >>> (data);

cudaThreadSynchronize();return 0;

}

© NVIDIA Corporation 2012

HYPER-Q�

© NVIDIA Corporation 2012

Fermi Concurrency

Fermi allows 16-way concurrencyUp to 16 grids can run at onceBut CUDA streams multiplex into a single queueOverlap only at stream edges

P -- Q -- R

A -- B -- C

X -- Y -- Z

Stream 1

Stream 2

Stream 3

Hardware Work Queue

A--B--C P--Q--R X--Y--Z

© NVIDIA Corporation 2012

Kepler Improved Concurrency

P -- Q -- R

A -- B -- C

X -- Y -- Z

Stream 1

Stream 2

Stream 3Multiple Hardware Work Queues

A--B--C

P--Q--R

X--Y--Z

Kepler allows 32-way concurrencyOne work queue per streamConcurrency at full-stream level

No inter-stream dependencies

© NVIDIA Corporation 2012

HYPER-Q + MPI PROXY

© NVIDIA Corporation 2012

Hyper-Q + Proxy Enable Better Utilization

Multiple MPI ranks can execute concurrently

Use as many MPI ranks as in the CPU-only case

Particularly interesting forstrong scaling

© NVIDIA Corporation 2012

Hyper-Q + Proxy from the User’s Perspective

No code modifications neededProxy server on host

All processes communicate with theGPU via proxy

Ideal for applications with:MPI-everywhereNon-negligible CPU work Partially migrated to GPU

nvidia-cuda-proxy-control -d

© NVIDIA Corporation 2012

GPU DIRECT

© NVIDIA Corporation 2012

Network

Kepler Enables Full NVIDIA GPUDirect ™

Server 1

GPU1 GPU2CPU

GDDR5Memory

GDDR5Memory

NetworkCard

System Memory

PCI-e

Server 2

GPU1GPU2 CPU

GDDR5Memory

GDDR5Memory

NetworkCard

System Memory

PCI-e

© NVIDIA Corporation 2012

CUDA 5.0�

© NVIDIA Corporation 2012

Nsight Eclipse Edition

© NVIDIA Corporation 2012

CUDA Compilation

Separation of host and device code

Device code translates into:Device-specific binary (.cubin) Device-independent assembly (.ptx)

Device code embedded in host object data

foo.ptx

foo.cubin

foo.c

foo.o PTX/CUBIN

a.outfoo.cubin (JIT) PTX/CUBIN

foo.cu

© NVIDIA Corporation 2012

CUDA 5 Introduces Device Code Linker

foo.ptx

foo.cubin

foo.c

foo.o PTX/CUBIN

foo.cubin (JIT)

foo.cu

bar.ptx

bar.cubin

bar.c

bar.o PTX/CUBIN

a.out

bar.cubin (JIT)

PTX/CUBIN

bar.cu

Host Linker

Device Linker

© NVIDIA Corporation 2012

Device Linker Invocation

Introduction of an optional link step for device co de

Link device-runtime library for dynamic parallelism

Currently, link occurs at cubin level (PTX not suppo rted)

nvcc –arch=sm_20 –dc a.cu b.cunvcc –arch=sm_20 –dlink a.o b.o –o link.og++ a.o b.o link.o –L<path> -lcudart

nvcc –arch=sm_35 –dc a.cu b.cunvcc –arch=sm_35 –dlink a.o b.o -lcudadevrt –o link.og++ a.o b.o link.o –L<path> -lcudadevrt -lcudart

© NVIDIA Corporation 2012

EXASCALE PATH

© NVIDIA Corporation 2012

Echelon Team

© NVIDIA Corporation 2012

Optimize the Storage Hierarchy2

Tailor Memory to the Application3

Data Movement Dominates Power1

Power is the Problem

© NVIDIA Corporation 2012

The High Cost of Data Movement

20mm

64-bit DP20 pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficientoff-chip link

28 nm Silicon Technology

256-bitbuses

16 nJDRAMRead/Write

256-bit access8 kB SRAM

50 pJ

Fetching operands costs more than computing on them

© NVIDIA Corporation 2012

Lane — DFMAs, 20 GFLOPS

P P P P P P P P

Switch

L1$

SM — 8 lanes, 160 GFLOPS

1024 SRAM Banks, 256KB each

NIMC MC

SM SM SM SM

NoC

SM LP LP

SRAM SRAM SRAM

Chip — 128 SMs, 20.48 TFLOPS + 8 Latency Processors

GPU Chip20TF DP256MB

GPU Chip20TF DP256MB

1.4TB/sDRAM BW

150GB/sNetwork BW

DRAMStack

DRAMStack

DRAMStack

NVMemory

Node MCM — 20 TFLOPS + 256 GB

Echelon Architecture (1/2)

© NVIDIA Corporation 2012

Echelon Architecture (2/2)

NODE

NODE

NODE

NODE

MODULE

NODE

NODE

NODE

NODE

MODULE

ROUTER

ROUTER

ROUTER

ROUTER

MODULE

NODE

NODE

NODE

NODE

MODULE

NODE

NODE

NODE

NODE

MODULE

Cabinet – 128 Nodes – 2.56 PF – 50 KWCentral Router Module(s), Dragonfly Interconnect

System – 400 Cabinets – 1 EF – 20 MWDragonfly Interconnect