introduction to gpuintroduction to gpu computing › meetings › team_mtg_8_10 ›...
TRANSCRIPT
Introduction to GPUIntroduction to GPUIntroduction to GPU Introduction to GPU ComputingComputing
Peter Peter MessmerMessmermessmer@txcorp commessmer@txcorp com
TechTech--X CorporationX Corporation5621 Arapahoe Ave., Boulder, CO 803035621 Arapahoe Ave., Boulder, CO 80303
[email protected]@txcorp.com
http://www.txcorp.comhttp://www.txcorp.comTechTech--X UK LtdX UK LtdDaresburyDaresbury Innovation CentreInnovation CentreKeckwickKeckwick LaneLane
TechTech--X GmbHX GmbHClaridenstrasseClaridenstrasse 2525CHCH--8027 Zurich8027 Zurich
DaresburyDaresburyCheshire WA4 4FS, UKCheshire WA4 4FS, UKhttp://www.txcorp.co.ukhttp://www.txcorp.co.uk
CHCH 8027 Zurich8027 ZurichSwitzerlandSwitzerlandhttp://www.txcorp.chhttp://www.txcorp.chNIMROD Meeting, Aug 12, 2010NIMROD Meeting, Aug 12, 2010
Why scientific computing on GPUs?
GPUs: Massively Parallel Floating-Point Co-Processors
MP MP MP MP
MP MP MP MP
6 Core Westmere
Fermi
MP MP MPMP6 Co e est e e MP MP MPMP
MP MP MP MP
The Large Memory Bandwidth on GPUs can Benefit Many Scientific Computing Applications
CPU
Benefit Many Scientific Computing Applications
GPUGPU
Cache
2008 2009
Host memory GPU memory
The Exascale “Swim Lanes”GPUs-like Accelerators are one Candidate for Exascale SystemsGPUs like Accelerators are one Candidate for Exascale Systems
Complex Multicore
Maintain complex cores, and replicate
X86Power7
SoC / Use many simpler, lower power cores BlueGene/
Embeddedp
from embedded space
GreenFlash
hi hlGPU / Accelerators
Use highly specialized processors adapted from gaming
Nvidia FermiCell
from gaming
Based on Jeff Broughton (NERSC),Open Fabric Alliance, 3/2010
The Exascale “Swim Lanes”What looks different now may converge in the futureWhat looks different now may converge in the future
?CPUmulti-threading multi-core many-core
?
fully programmableprogrammability
partially programmable
programmability
fixed functionparallelism
Based on Jeff Broughton (NERSC),Open Fabric Alliance, 3/2010, afterJustin Rattner, Intel, ISC 2008
GPU
Software Development Environment for GPUs: CUDA and OpenCLp
• Early GPGPU efforts heroic– Graphics API (OpenGL, DirectX) no natural fit for scientific computing
• Compute Unified Device Architecture (http://www.nvidia.com/cuda)– Supported on all modern NVIDIA GPUs (notebook GPUs, high-end GPUs, mobile devices)
• OpenCL (http://www.khonos.org/opencl)– Open standard, targeting NVIDIA, AMD/ATI GPUs, Cell, multicore x86, ..
• Single Source for CPU and GPU– Host code C, C++, Fortran (PGI)
GPU code C++ with extensions Fortran with extensions– GPU code C++ with extensions, Fortran with extensions– nvcc: NVIDIA cuda compiler – PGI compiler for CUDA Fortran
• Runtime libraries– Data transfer, kernel launch, ..– BLAS, FFT libraries
• Simplified GPU development, but still “close to the metal”!p p ,
• NEXUS: Visual Studio plug-in for GPU development
GPULib:High-Productivity GPU Computing
• IDL (ITT Vis), MATLAB (Mathworks)C Fortran
High Productivity GPU Computing
C, Fortran
• Rich set of data parallel kernels
E t ibl ith i t k l• Extensible with proprietary kernels
• Seamless integration into host language
• Explicit or implicit management of address spaces
• Interface to Tech-X’ FastDL for multi-GPU/distributed memory processing
http://gpulib.txcorp.com (free for non-commercial use)
Messmer, Mullowney, Granger, “GPULib: GPU computing in High-Level Languages”, Computers in Science and Engineering, 10(5), 80, 2008.
GPU Parallelism is Hierarchical(CUDA Terminology)
• Entire grid executes the same code (“kernel”)• Threads execute “independent” serial path through kernels
( gy)
BlockBlock
BlockBlock
BlockBlock
BlockBlock
ThreadThread
ThreadThread
ThreadThread
ThreadThread
Block
Block
Block
Block
Thread
Thread
Thread
Thread
GirdGird
BlockBlock BlockBlock
BlockBlock
ThreadThread ThreadThread
Gird
Block Block
Block
Thread Thread
Threads withinSame Block
Threads in Different Blocks
Synchronize Yes No
Communicate Cheap (“shared memory”) Expensive
Execution order Chunks of 32(“warp”) Not defined
SAXPY on GPUFunction visible both from CPU and GPU (“Kernel”)
Declare variable inGPU ddGPU address space
Define thread grid,64 th d bl k64 threads per block
The Art of GPU Programming: Avoid the Cliffs
Reduce Host-GPU transfer to minimum• Currently only way for MPI
Avoid thread divergence• “Warp executed by a single SIMD instruction”
• Each thread can execute its own code path, BUT all threads in a warp execute all paths taken (predicated execution)paths taken (predicated execution)
• Optimal performance when all threads execute same instructions
Provide sufficient parallelismI t ti d M l t hidd b ti th• Instruction and Memory latency hidden by executing another warp=> Up to 1536 concurrent threads per multiprocessor
• C2050 with 14 MPs requires 24k concurrently active threads (!!)
C f ll fCarefully craft memory access• Memory access should be warp aligned for best performance• Less important on Fermi than older devices due to L2 cache• Per thread random access possible, but expensive
Measure, Think, Code .... (Repeat)
P = Parallelizable part of the code(1-P) = Serial part of the code
Amdahl’s Law, the key to successful optimization
(1-P) = Serial part of the code
Parallel speedup (Amdahl)
Maximum achievable speedup:
• use ‘optimizable’ partp pO = Part of code that benefits from optimization(1-O) = Part that does not benefit from optimization
Speedup achieved by optimization by factor :Speedup achieved by optimization by factor :
Maximum achievable speedup
In order to get a significant speedup, O has to be a very large fraction of the overall time
NIMROD on GPUs: Three Kernels
• Large number of small 1D Fourier Transformsg=> Just use cuFFT
• Dense linear algebra on many small matricesJust use cuBLAS
• Sparse linear solves=> Just implement CSR on GPU
Large Number of small 1D FFTs
Readily available• cuFFT for 1D, 2D, and 3D FFT on GPU, ,• Follows FFTW API (plan generation, execution)• Batched 1D FFTs
=> Might work for the given problem if M really big=> Might work for the given problem if M really big
ProblemU k f fl b t h f ll N t f• Unknown performance oflarge batch of small N transforms
Solution• Hand-crafted FFT that fits into single block?
GPULib acceleration of Range-Azimuth processorRange Azimuth processor
20000x6500 range-azimuth dataset(single precision complex)
CPU+GPU(2.4 GHz Core2Duo E4600 + Tesla C1060)
CPUC U(2.4 GHz Core2Duo E4600)
Batched 1D FFT
Batched 1D FFTBatched 1D FFT
Dense Linear Algebra on Small Matrices
Readily available• cuBLAS for BLAS operations on GPUp
Problem• Kernel invocation overhead for cuBLAS too large• Kernel invocation overhead for cuBLAS too large• Only one matrix at the time (with some effort possibly 16
matrices)• Even worse for LAPACK type operations• Even worse for LAPACK-type operations
Solution• If sufficiently small, one block per matrix
(up to ~ 32x32 in double)• Use vectorizable solvers (e.g. cyclic reduction)Use vectorizable solvers (e.g. cyclic reduction)
GPU acceleration of Option Pricing(solving lots of small 1D diffusion equations)
• Large ensemble of 1D diffusion equationsTri or pentadiagonoal matrices
200N = 100• Tri- or pentadiagonoal matrices
(sparse!)• 360,000 systems with 100 – 300
unknowns 140
160
180N = 100
N=300
• Reference implementation in MATLAB
60
80
100
120
• Tech-X Custom kernelN=300: 182s CPU -> 1.6s GPUN=100: 45s CPU -> 0.7s GPU
N = 100
N=3000
20
40
N = 100CPUGPU
(incl.transfer) GPU (excl. transfer)
Sparse Linear Solver
Readily available• Some solver libraries moving to GPUSome solver libraries moving to GPU
(Trilinos)
P bl N.Z.- VALUESProblem• CSR format uses lots of indirection• Poor performance due to
INDICES
non-coalesced memory access
SolutionDiagonal
Solution• Use diagonal format if possible• Simple stride 1 memory access
Accept multiplication by 0
Off-Diagonal 1
Off-Diagonal 2
• Accept multiplication by 0
Kernel Execution Time on Memory Bandwidth Limited Problem, z = x + yBandwidth Limited Problem, z x y
Kernel Execution Time on Transcendentals, y = exp(x)Transcendentals, y exp(x)
Conclusions/Summary
• GPUs offer tremendous potential to accelerate scientific applications
• GPUs are one of the swim lanes towards exascale
Newer generations getting easier to program• Newer generations getting easier to program
• Still a few cliffs:• Host-GPU transfer• Host-GPU transfer• Lots of parallelism• Careful memory access• Thread divergenceead d e ge ce
• NIMROD may require some custom development
• We’re happy to help!