introduction to gpuintroduction to gpu computing › meetings › team_mtg_8_10 ›...

Introduction to GPUIntroduction to GPUIntroduction to GPU Introduction to GPU ComputingComputing

Peter Peter MessmerMessmermessmer@txcorp commessmer@txcorp com

TechTech--X CorporationX Corporation5621 Arapahoe Ave., Boulder, CO 803035621 Arapahoe Ave., Boulder, CO 80303

[email protected]@txcorp.com

http://www.txcorp.comhttp://www.txcorp.comTechTech--X UK LtdX UK LtdDaresburyDaresbury Innovation CentreInnovation CentreKeckwickKeckwick LaneLane

TechTech--X GmbHX GmbHClaridenstrasseClaridenstrasse 2525CHCH--8027 Zurich8027 Zurich

DaresburyDaresburyCheshire WA4 4FS, UKCheshire WA4 4FS, UKhttp://www.txcorp.co.ukhttp://www.txcorp.co.uk

CHCH 8027 Zurich8027 ZurichSwitzerlandSwitzerlandhttp://www.txcorp.chhttp://www.txcorp.chNIMROD Meeting, Aug 12, 2010NIMROD Meeting, Aug 12, 2010

Why scientific computing on GPUs?

GPUs: Massively Parallel Floating-Point Co-Processors

MP MP MP MP

MP MP MP MP

6 Core Westmere

Fermi

MP MP MPMP6 Co e est e e MP MP MPMP

MP MP MP MP

The Large Memory Bandwidth on GPUs can Benefit Many Scientific Computing Applications

CPU

Benefit Many Scientific Computing Applications

GPUGPU

Cache

2008 2009

Host memory GPU memory

The Exascale “Swim Lanes”GPUs-like Accelerators are one Candidate for Exascale SystemsGPUs like Accelerators are one Candidate for Exascale Systems

Complex Multicore

Maintain complex cores, and replicate

X86Power7

SoC / Use many simpler, lower power cores BlueGene/

Embeddedp

from embedded space

GreenFlash

hi hlGPU / Accelerators

Use highly specialized processors adapted from gaming

Nvidia FermiCell

from gaming

Based on Jeff Broughton (NERSC),Open Fabric Alliance, 3/2010

The Exascale “Swim Lanes”What looks different now may converge in the futureWhat looks different now may converge in the future

?CPUmulti-threading multi-core many-core

?

fully programmableprogrammability

partially programmable

programmability

fixed functionparallelism

Based on Jeff Broughton (NERSC),Open Fabric Alliance, 3/2010, afterJustin Rattner, Intel, ISC 2008

GPU

Software Development Environment for GPUs: CUDA and OpenCLp

• Early GPGPU efforts heroic– Graphics API (OpenGL, DirectX) no natural fit for scientific computing

• Compute Unified Device Architecture (http://www.nvidia.com/cuda)– Supported on all modern NVIDIA GPUs (notebook GPUs, high-end GPUs, mobile devices)

• OpenCL (http://www.khonos.org/opencl)– Open standard, targeting NVIDIA, AMD/ATI GPUs, Cell, multicore x86, ..

• Single Source for CPU and GPU– Host code C, C++, Fortran (PGI)

GPU code C++ with extensions Fortran with extensions– GPU code C++ with extensions, Fortran with extensions– nvcc: NVIDIA cuda compiler – PGI compiler for CUDA Fortran

• Runtime libraries– Data transfer, kernel launch, ..– BLAS, FFT libraries

• Simplified GPU development, but still “close to the metal”!p p ,

• NEXUS: Visual Studio plug-in for GPU development

GPULib:High-Productivity GPU Computing

• IDL (ITT Vis), MATLAB (Mathworks)C Fortran

High Productivity GPU Computing

C, Fortran

• Rich set of data parallel kernels

E t ibl ith i t k l• Extensible with proprietary kernels

• Seamless integration into host language

• Explicit or implicit management of address spaces

• Interface to Tech-X’ FastDL for multi-GPU/distributed memory processing

http://gpulib.txcorp.com (free for non-commercial use)

Messmer, Mullowney, Granger, “GPULib: GPU computing in High-Level Languages”, Computers in Science and Engineering, 10(5), 80, 2008.

GPU Parallelism is Hierarchical(CUDA Terminology)

• Entire grid executes the same code (“kernel”)• Threads execute “independent” serial path through kernels

( gy)

BlockBlock

BlockBlock

BlockBlock

BlockBlock

ThreadThread

ThreadThread

ThreadThread

ThreadThread

Block

Block

Block

Block

Thread

Thread

Thread

Thread

GirdGird

BlockBlock BlockBlock

BlockBlock

ThreadThread ThreadThread

Gird

Block Block

Block

Thread Thread

Threads withinSame Block

Threads in Different Blocks

Synchronize Yes No

Communicate Cheap (“shared memory”) Expensive

Execution order Chunks of 32(“warp”) Not defined

SAXPY on GPUFunction visible both from CPU and GPU (“Kernel”)

Declare variable inGPU ddGPU address space

Define thread grid,64 th d bl k64 threads per block

The Art of GPU Programming: Avoid the Cliffs

Reduce Host-GPU transfer to minimum• Currently only way for MPI

Avoid thread divergence• “Warp executed by a single SIMD instruction”

• Each thread can execute its own code path, BUT all threads in a warp execute all paths taken (predicated execution)paths taken (predicated execution)

• Optimal performance when all threads execute same instructions

Provide sufficient parallelismI t ti d M l t hidd b ti th• Instruction and Memory latency hidden by executing another warp=> Up to 1536 concurrent threads per multiprocessor

• C2050 with 14 MPs requires 24k concurrently active threads (!!)

C f ll fCarefully craft memory access• Memory access should be warp aligned for best performance• Less important on Fermi than older devices due to L2 cache• Per thread random access possible, but expensive

Measure, Think, Code .... (Repeat)

P = Parallelizable part of the code(1-P) = Serial part of the code

Amdahl’s Law, the key to successful optimization

(1-P) = Serial part of the code

Parallel speedup (Amdahl)

Maximum achievable speedup:

• use ‘optimizable’ partp pO = Part of code that benefits from optimization(1-O) = Part that does not benefit from optimization

Speedup achieved by optimization by factor :Speedup achieved by optimization by factor :

Maximum achievable speedup

In order to get a significant speedup, O has to be a very large fraction of the overall time

NIMROD on GPUs: Three Kernels

• Large number of small 1D Fourier Transformsg=> Just use cuFFT

• Dense linear algebra on many small matricesJust use cuBLAS

• Sparse linear solves=> Just implement CSR on GPU

Large Number of small 1D FFTs

Readily available• cuFFT for 1D, 2D, and 3D FFT on GPU, ,• Follows FFTW API (plan generation, execution)• Batched 1D FFTs

=> Might work for the given problem if M really big=> Might work for the given problem if M really big

ProblemU k f fl b t h f ll N t f• Unknown performance oflarge batch of small N transforms

Solution• Hand-crafted FFT that fits into single block?

GPULib acceleration of Range-Azimuth processorRange Azimuth processor

20000x6500 range-azimuth dataset(single precision complex)

CPU+GPU(2.4 GHz Core2Duo E4600 + Tesla C1060)

CPUC U(2.4 GHz Core2Duo E4600)

Batched 1D FFT

Batched 1D FFTBatched 1D FFT

Dense Linear Algebra on Small Matrices

Readily available• cuBLAS for BLAS operations on GPUp

Problem• Kernel invocation overhead for cuBLAS too large• Kernel invocation overhead for cuBLAS too large• Only one matrix at the time (with some effort possibly 16

matrices)• Even worse for LAPACK type operations• Even worse for LAPACK-type operations

Solution• If sufficiently small, one block per matrix

(up to ~ 32x32 in double)• Use vectorizable solvers (e.g. cyclic reduction)Use vectorizable solvers (e.g. cyclic reduction)

GPU acceleration of Option Pricing(solving lots of small 1D diffusion equations)

• Large ensemble of 1D diffusion equationsTri or pentadiagonoal matrices

200N = 100• Tri- or pentadiagonoal matrices

(sparse!)• 360,000 systems with 100 – 300

unknowns 140

160

180N = 100

N=300

• Reference implementation in MATLAB

60

80

100

120

• Tech-X Custom kernelN=300: 182s CPU -> 1.6s GPUN=100: 45s CPU -> 0.7s GPU

N = 100

N=3000

20

40

N = 100CPUGPU

(incl.transfer) GPU (excl. transfer)

Sparse Linear Solver

Readily available• Some solver libraries moving to GPUSome solver libraries moving to GPU

(Trilinos)

P bl N.Z.- VALUESProblem• CSR format uses lots of indirection• Poor performance due to

INDICES

non-coalesced memory access

SolutionDiagonal

Solution• Use diagonal format if possible• Simple stride 1 memory access

Accept multiplication by 0

Off-Diagonal 1

Off-Diagonal 2

• Accept multiplication by 0

Kernel Execution Time on Memory Bandwidth Limited Problem, z = x + yBandwidth Limited Problem, z x y

Kernel Execution Time on Transcendentals, y = exp(x)Transcendentals, y exp(x)

Conclusions/Summary

• GPUs offer tremendous potential to accelerate scientific applications

• GPUs are one of the swim lanes towards exascale

Newer generations getting easier to program• Newer generations getting easier to program

• Still a few cliffs:• Host-GPU transfer• Host-GPU transfer• Lots of parallelism• Careful memory access• Thread divergenceead d e ge ce

• NIMROD may require some custom development

• We’re happy to help!

introduction to gpuintroduction to gpu computing › meetings › team_mtg_8_10 ›...

Documents