cea kepler cuda5 beyond - hpc · pdf file!$acc end kernels... end program myscience ......

© NVIDIA Corporation 2012

GPU Acceleration

Julien Demouth


Outline

CUDA Refresher

Kepler ArchitectureSMX

CUDA 5 FeaturesCUDA Dynamic ParallelismHyper-QDevice-Code Linker


GPU PROGRAMMING REFRESHER


Heterogeneous Computing Platform

PCI-E Bus

HostHost DeviceDevice


3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACCDirectives

Maximum

Flexibility

Easily Accelerate

Applications


NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP

Vector SignalImage Processing

GPU AcceleratedLinear Algebra

Matrix Algebra on GPU and Multicore NVIDIA cuFFT

C++ STL Features for CUDA

Sparse Linear Algebra

Building-block Algorithms for CUDAIMSL Library

GPU Accelerated Libraries


OpenACC Directives

Program myscience... serial code ...

!$acc kernelsdo k = 1,n1

do i = 1,n2... parallel code ...

enddoenddo

!$acc end kernels...

End Program myscience

CPU GPU

Your original

Fortran or C code

Simple Compiler hints

Compiler Parallelizes code

Works on many-core GPUs &

multicore CPUs

OpenACC

Compiler

Hint


CUDA C/C++

void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)

{{{{

forforforfor (int i = 0; i(int i = 0; i(int i = 0; i(int i = 0; i <<<< n; ++i)n; ++i)n; ++i)n; ++i)

y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

// Invoke serialserialserialserial SAXPY kernel

saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);

__global__ __global__ __global__ __global__ void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)

{{{{

int i = int i = int i = int i = blockIdxblockIdxblockIdxblockIdx.x*.x*.x*.x*blockDimblockDimblockDimblockDim.x + .x + .x + .x + threadIdxthreadIdxthreadIdxthreadIdx.x;.x;.x;.x;

ifififif (i(i(i(i <<<< n) n) n) n) y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

// Invoke parallelparallelparallelparallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;

saxpy_parallelsaxpy_parallelsaxpy_parallelsaxpy_parallel<<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>>(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);

Standard C Code

Parallel C Code


Opening the CUDA Platform with LLVM

CUDA compiler source now available with Open source LLVM Compiler

SDK includes specification documentation, examples, and verifier

Provides ability for anyone to add CUDA to new languages and processors

Learn more at

http://developer.nvidia.com/cuda-source

CUDA

C, C++, Fortran

LLVM Compiler

For CUDA

NVIDIA

GPUs

x86

CPUs

New Language

Support

New Processor

Support


GPUs are Mainstream

Chinese

Academy of

Sciences

Edu/Research

Air Force

Research

Laboratory

Naval Research

Laboratory

GovernmentOil & Gas

Max

Planck

Institute

Mass General

Hospital

Life Sciences Finance Manufacturing


KEPLER ARCHITECTURE�


The Kepler GPU

Efficiency

Programmability

Performance


Tesla K10

3x Single Precision

1.8x Memory Bandwidth

Image, Signal, Seismic

3x Double Precision

Hyper-Q, Dynamic Parallelism

CFD, FEA, Finance, Physics

Tesla K20

Available Q4 2012Available Now


Kepler GK110 Block Diagram

Architecture7.1B Transistors15 SMX units> 1 TFLOP FP641.5 MB L2 Cache384-bit GDDR5


SMX Balance of Resources

Resource Kepler GK110 vsFermi

Floating point throughput 2-3x

Max Blocks per SMX 2x

Max Threads per SMX 1.3x

Register File Bandwidth 2x

Register File Capacity 2x

Shared Memory Bandwidth 2x

Shared Memory Capacity 1x


New ISA Encoding: 255 Registers per Thread

Fermi limit: 63 registers per threadA common Fermi performance limiterLeads to excessive spilling

Kepler : Up to 255 registers per threadEspecially helpful for FP64 apps

Ex. Quda QCD fp64 sample runs 5.3x fasterSpills are eliminated with extra registers


New High -Performance SMX Instructions

SHFL (shuffle) -- Intra-warp data exchange Compiler-generated, high performance

instructions:

� bit shift� bit rotate� fp32 division� read-only cacheATOM -- Broader functionality, Faster


Texture Cache Unlocked

Added a new path for computeAvoids the texture unitAllows a global address to be fetched and cachedEliminates texture setup

Why use it?Separate pipeline from shared/L1Highest miss bandwidthFlexible, e.g. unaligned accesses

Managed automatically by compiler“const __restrict” indicates eligibility

Tex

SMX

L2

TexTexTex

Read-onlyData Cache


CUDA DYNAMIC PARALLELISM�


Improving Programmability

Dynamic Parallelism

Occupancy

Simplify CPU/GPU Divide

Library Calls from Kernels

Batching to Help Fill GPU

Dynamic Load Balancing

Data-Dependent Execution

Recursive Parallel Algorithms


CPU GPU CPU GPU

What Does It Mean?

Autonomous, Dynamic ParallelismGPU as Co-Processor


Nested Launch

� Can create new grids at run-time

� Composable: can only launch into own position in stream

� Can yield until new work completes, then resume using result (i.e. join)

� Launch by and thread is visible to thread’s whole block

CDP Nested Launch

__global__ void B(float *data) {do_stuff(data);

X <<< ..., stream >>> (data);Y <<< ..., stream >>> (data);Z <<< ..., stream >>> (data);cudaStreamSynchronize(stream);

do_more_stuff(data);}

A

B

C

X

Y

Z

CPUint main() {

float *data; setup(data);

A <<< ... >>> (data);B <<< ... >>> (data);C <<< ... >>> (data);

cudaThreadSynchronize();return 0;

}


HYPER-Q�


Fermi Concurrency

Fermi allows 16-way concurrencyUp to 16 grids can run at onceBut CUDA streams multiplex into a single queueOverlap only at stream edges

P -- Q -- R

A -- B -- C

X -- Y -- Z

Stream 1

Stream 2

Stream 3

Hardware Work Queue

A--B--C P--Q--R X--Y--Z


Kepler Improved Concurrency

P -- Q -- R

A -- B -- C

X -- Y -- Z

Stream 1

Stream 2

Stream 3Multiple Hardware Work Queues

A--B--C

P--Q--R

X--Y--Z

Kepler allows 32-way concurrencyOne work queue per streamConcurrency at full-stream level

No inter-stream dependencies


HYPER-Q + MPI PROXY


Hyper-Q + Proxy Enable Better Utilization

Multiple MPI ranks can execute concurrently

Use as many MPI ranks as in the CPU-only case

Particularly interesting forstrong scaling


Hyper-Q + Proxy from the User’s Perspective

No code modifications neededProxy server on host

All processes communicate with theGPU via proxy

Ideal for applications with:MPI-everywhereNon-negligible CPU work Partially migrated to GPU

nvidia-cuda-proxy-control -d


GPU DIRECT


Network

Kepler Enables Full NVIDIA GPUDirect ™

Server 1

GPU1 GPU2CPU

GDDR5Memory

GDDR5Memory

NetworkCard

System Memory

PCI-e

Server 2

GPU1GPU2 CPU

GDDR5Memory

GDDR5Memory

NetworkCard

System Memory

PCI-e


CUDA 5.0�


Nsight Eclipse Edition


CUDA Compilation

Separation of host and device code

Device code translates into:Device-specific binary (.cubin) Device-independent assembly (.ptx)

Device code embedded in host object data

foo.ptx

foo.cubin

foo.c

foo.o PTX/CUBIN

a.outfoo.cubin (JIT) PTX/CUBIN

foo.cu


CUDA 5 Introduces Device Code Linker

foo.ptx

foo.cubin

foo.c

foo.o PTX/CUBIN

foo.cubin (JIT)

foo.cu

bar.ptx

bar.cubin

bar.c

bar.o PTX/CUBIN

a.out

bar.cubin (JIT)

PTX/CUBIN

bar.cu

Host Linker

Device Linker


Device Linker Invocation

Introduction of an optional link step for device co de

Link device-runtime library for dynamic parallelism

Currently, link occurs at cubin level (PTX not suppo rted)

nvcc –arch=sm_20 –dc a.cu b.cunvcc –arch=sm_20 –dlink a.o b.o –o link.og++ a.o b.o link.o –L<path> -lcudart

nvcc –arch=sm_35 –dc a.cu b.cunvcc –arch=sm_35 –dlink a.o b.o -lcudadevrt –o link.og++ a.o b.o link.o –L<path> -lcudadevrt -lcudart


EXASCALE PATH


Echelon Team


Optimize the Storage Hierarchy2

Tailor Memory to the Application3

Data Movement Dominates Power1

Power is the Problem


The High Cost of Data Movement

20mm

64-bit DP20 pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficientoff-chip link

28 nm Silicon Technology

256-bitbuses

16 nJDRAMRead/Write

256-bit access8 kB SRAM

50 pJ

Fetching operands costs more than computing on them


Lane — DFMAs, 20 GFLOPS

P P P P P P P P

Switch

L1$

SM — 8 lanes, 160 GFLOPS

1024 SRAM Banks, 256KB each

NIMC MC

SM SM SM SM

NoC

SM LP LP

SRAM SRAM SRAM

Chip — 128 SMs, 20.48 TFLOPS + 8 Latency Processors

GPU Chip20TF DP256MB

GPU Chip20TF DP256MB

1.4TB/sDRAM BW

150GB/sNetwork BW

DRAMStack

DRAMStack

DRAMStack

NVMemory

Node MCM — 20 TFLOPS + 256 GB

Echelon Architecture (1/2)


Echelon Architecture (2/2)

NODE

NODE

NODE

NODE

MODULE

NODE

NODE

NODE

NODE

MODULE

ROUTER

ROUTER

ROUTER

ROUTER

MODULE

NODE

NODE

NODE

NODE

MODULE

NODE

NODE

NODE

NODE

MODULE

Cabinet – 128 Nodes – 2.56 PF – 50 KWCentral Router Module(s), Dragonfly Interconnect

System – 400 Cabinets – 1 EF – 20 MWDragonfly Interconnect

cea kepler cuda5 beyond - hpc · pdf file!$acc end kernels... end program myscience ......

Documents