cea kepler cuda5 beyond - hpc · pdf file!$acc end kernels... end program myscience ......
TRANSCRIPT
© NVIDIA Corporation 2012
Outline
CUDA Refresher
Kepler ArchitectureSMX
CUDA 5 FeaturesCUDA Dynamic ParallelismHyper-QDevice-Code Linker
© NVIDIA Corporation 2012
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming Languages
OpenACCDirectives
Maximum
Flexibility
Easily Accelerate
Applications
© NVIDIA Corporation 2012
NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP
Vector SignalImage Processing
GPU AcceleratedLinear Algebra
Matrix Algebra on GPU and Multicore NVIDIA cuFFT
C++ STL Features for CUDA
Sparse Linear Algebra
Building-block Algorithms for CUDAIMSL Library
GPU Accelerated Libraries
© NVIDIA Corporation 2012
OpenACC Directives
Program myscience... serial code ...
!$acc kernelsdo k = 1,n1
do i = 1,n2... parallel code ...
enddoenddo
!$acc end kernels...
End Program myscience
CPU GPU
Your original
Fortran or C code
Simple Compiler hints
Compiler Parallelizes code
Works on many-core GPUs &
multicore CPUs
OpenACC
Compiler
Hint
© NVIDIA Corporation 2012
CUDA C/C++
void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)
{{{{
forforforfor (int i = 0; i(int i = 0; i(int i = 0; i(int i = 0; i <<<< n; ++i)n; ++i)n; ++i)n; ++i)
y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];
}}}}
// Invoke serialserialserialserial SAXPY kernel
saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);
__global__ __global__ __global__ __global__ void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)
{{{{
int i = int i = int i = int i = blockIdxblockIdxblockIdxblockIdx.x*.x*.x*.x*blockDimblockDimblockDimblockDim.x + .x + .x + .x + threadIdxthreadIdxthreadIdxthreadIdx.x;.x;.x;.x;
ifififif (i(i(i(i <<<< n) n) n) n) y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];
}}}}
// Invoke parallelparallelparallelparallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;
saxpy_parallelsaxpy_parallelsaxpy_parallelsaxpy_parallel<<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>>(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);
Standard C Code
Parallel C Code
© NVIDIA Corporation 2012
Opening the CUDA Platform with LLVM
CUDA compiler source now available with Open source LLVM Compiler
SDK includes specification documentation, examples, and verifier
Provides ability for anyone to add CUDA to new languages and processors
Learn more at
http://developer.nvidia.com/cuda-source
CUDA
C, C++, Fortran
LLVM Compiler
For CUDA
NVIDIA
GPUs
x86
CPUs
New Language
Support
New Processor
Support
© NVIDIA Corporation 2012
GPUs are Mainstream
Chinese
Academy of
Sciences
Edu/Research
Air Force
Research
Laboratory
Naval Research
Laboratory
GovernmentOil & Gas
Max
Planck
Institute
Mass General
Hospital
Life Sciences Finance Manufacturing
© NVIDIA Corporation 2012
Tesla K10
3x Single Precision
1.8x Memory Bandwidth
Image, Signal, Seismic
3x Double Precision
Hyper-Q, Dynamic Parallelism
CFD, FEA, Finance, Physics
Tesla K20
Available Q4 2012Available Now
© NVIDIA Corporation 2012
Kepler GK110 Block Diagram
Architecture7.1B Transistors15 SMX units> 1 TFLOP FP641.5 MB L2 Cache384-bit GDDR5
© NVIDIA Corporation 2012
SMX Balance of Resources
Resource Kepler GK110 vsFermi
Floating point throughput 2-3x
Max Blocks per SMX 2x
Max Threads per SMX 1.3x
Register File Bandwidth 2x
Register File Capacity 2x
Shared Memory Bandwidth 2x
Shared Memory Capacity 1x
© NVIDIA Corporation 2012
New ISA Encoding: 255 Registers per Thread
Fermi limit: 63 registers per threadA common Fermi performance limiterLeads to excessive spilling
Kepler : Up to 255 registers per threadEspecially helpful for FP64 apps
Ex. Quda QCD fp64 sample runs 5.3x fasterSpills are eliminated with extra registers
© NVIDIA Corporation 2012
New High -Performance SMX Instructions
SHFL (shuffle) -- Intra-warp data exchange Compiler-generated, high performance
instructions:
� bit shift� bit rotate� fp32 division� read-only cacheATOM -- Broader functionality, Faster
© NVIDIA Corporation 2012
Texture Cache Unlocked
Added a new path for computeAvoids the texture unitAllows a global address to be fetched and cachedEliminates texture setup
Why use it?Separate pipeline from shared/L1Highest miss bandwidthFlexible, e.g. unaligned accesses
Managed automatically by compiler“const __restrict” indicates eligibility
Tex
SMX
L2
TexTexTex
Read-onlyData Cache
© NVIDIA Corporation 2012
Improving Programmability
Dynamic Parallelism
Occupancy
Simplify CPU/GPU Divide
Library Calls from Kernels
Batching to Help Fill GPU
Dynamic Load Balancing
Data-Dependent Execution
Recursive Parallel Algorithms
© NVIDIA Corporation 2012
CPU GPU CPU GPU
What Does It Mean?
Autonomous, Dynamic ParallelismGPU as Co-Processor
© NVIDIA Corporation 2012
Nested Launch
� Can create new grids at run-time
� Composable: can only launch into own position in stream
� Can yield until new work completes, then resume using result (i.e. join)
� Launch by and thread is visible to thread’s whole block
CDP Nested Launch
__global__ void B(float *data) {do_stuff(data);
X <<< ..., stream >>> (data);Y <<< ..., stream >>> (data);Z <<< ..., stream >>> (data);cudaStreamSynchronize(stream);
do_more_stuff(data);}
A
B
C
X
Y
Z
CPUint main() {
float *data; setup(data);
A <<< ... >>> (data);B <<< ... >>> (data);C <<< ... >>> (data);
cudaThreadSynchronize();return 0;
}
© NVIDIA Corporation 2012
Fermi Concurrency
Fermi allows 16-way concurrencyUp to 16 grids can run at onceBut CUDA streams multiplex into a single queueOverlap only at stream edges
P -- Q -- R
A -- B -- C
X -- Y -- Z
Stream 1
Stream 2
Stream 3
Hardware Work Queue
A--B--C P--Q--R X--Y--Z
© NVIDIA Corporation 2012
Kepler Improved Concurrency
P -- Q -- R
A -- B -- C
X -- Y -- Z
Stream 1
Stream 2
Stream 3Multiple Hardware Work Queues
A--B--C
P--Q--R
X--Y--Z
Kepler allows 32-way concurrencyOne work queue per streamConcurrency at full-stream level
No inter-stream dependencies
© NVIDIA Corporation 2012
Hyper-Q + Proxy Enable Better Utilization
Multiple MPI ranks can execute concurrently
Use as many MPI ranks as in the CPU-only case
Particularly interesting forstrong scaling
© NVIDIA Corporation 2012
Hyper-Q + Proxy from the User’s Perspective
No code modifications neededProxy server on host
All processes communicate with theGPU via proxy
Ideal for applications with:MPI-everywhereNon-negligible CPU work Partially migrated to GPU
nvidia-cuda-proxy-control -d
© NVIDIA Corporation 2012
Network
Kepler Enables Full NVIDIA GPUDirect ™
Server 1
GPU1 GPU2CPU
GDDR5Memory
GDDR5Memory
NetworkCard
System Memory
PCI-e
Server 2
GPU1GPU2 CPU
GDDR5Memory
GDDR5Memory
NetworkCard
System Memory
PCI-e
© NVIDIA Corporation 2012
CUDA Compilation
Separation of host and device code
Device code translates into:Device-specific binary (.cubin) Device-independent assembly (.ptx)
Device code embedded in host object data
foo.ptx
foo.cubin
foo.c
foo.o PTX/CUBIN
a.outfoo.cubin (JIT) PTX/CUBIN
foo.cu
© NVIDIA Corporation 2012
CUDA 5 Introduces Device Code Linker
foo.ptx
foo.cubin
foo.c
foo.o PTX/CUBIN
foo.cubin (JIT)
foo.cu
bar.ptx
bar.cubin
bar.c
bar.o PTX/CUBIN
a.out
bar.cubin (JIT)
PTX/CUBIN
bar.cu
Host Linker
Device Linker
© NVIDIA Corporation 2012
Device Linker Invocation
Introduction of an optional link step for device co de
Link device-runtime library for dynamic parallelism
Currently, link occurs at cubin level (PTX not suppo rted)
nvcc –arch=sm_20 –dc a.cu b.cunvcc –arch=sm_20 –dlink a.o b.o –o link.og++ a.o b.o link.o –L<path> -lcudart
nvcc –arch=sm_35 –dc a.cu b.cunvcc –arch=sm_35 –dlink a.o b.o -lcudadevrt –o link.og++ a.o b.o link.o –L<path> -lcudadevrt -lcudart
© NVIDIA Corporation 2012
Optimize the Storage Hierarchy2
Tailor Memory to the Application3
Data Movement Dominates Power1
Power is the Problem
© NVIDIA Corporation 2012
The High Cost of Data Movement
20mm
64-bit DP20 pJ 26 pJ 256 pJ
1 nJ
500 pJ Efficientoff-chip link
28 nm Silicon Technology
256-bitbuses
16 nJDRAMRead/Write
256-bit access8 kB SRAM
50 pJ
Fetching operands costs more than computing on them
© NVIDIA Corporation 2012
Lane — DFMAs, 20 GFLOPS
P P P P P P P P
Switch
L1$
SM — 8 lanes, 160 GFLOPS
1024 SRAM Banks, 256KB each
NIMC MC
SM SM SM SM
NoC
SM LP LP
SRAM SRAM SRAM
Chip — 128 SMs, 20.48 TFLOPS + 8 Latency Processors
GPU Chip20TF DP256MB
GPU Chip20TF DP256MB
1.4TB/sDRAM BW
150GB/sNetwork BW
DRAMStack
DRAMStack
DRAMStack
NVMemory
Node MCM — 20 TFLOPS + 256 GB
Echelon Architecture (1/2)
© NVIDIA Corporation 2012
Echelon Architecture (2/2)
NODE
NODE
NODE
NODE
MODULE
NODE
NODE
NODE
NODE
MODULE
ROUTER
ROUTER
ROUTER
ROUTER
MODULE
NODE
NODE
NODE
NODE
MODULE
NODE
NODE
NODE
NODE
MODULE
Cabinet – 128 Nodes – 2.56 PF – 50 KWCentral Router Module(s), Dragonfly Interconnect
System – 400 Cabinets – 1 EF – 20 MWDragonfly Interconnect