introduction to gpuintroduction to gpu computing › meetings › team_mtg_8_10 ›...

21
Introduction to GPU Introduction to GPU Introduction to GPU Introduction to GPU Computing Computing Peter Peter Messmer Messmer messmer@txcorp com messmer@txcorp com Tech Tech-X Corporation X Corporation 5621 Arapahoe Ave., Boulder, CO 80303 5621 Arapahoe Ave., Boulder, CO 80303 messmer@txcorp.com messmer@txcorp.com http://www.txcorp.com http://www.txcorp.com Tech Tech-X UK Ltd X UK Ltd Daresbury Daresbury Innovation Centre Innovation Centre Keckwick Keckwick Lane Lane Tech Tech-X GmbH X GmbH Claridenstrasse Claridenstrasse 25 25 CH CH-8027 Zurich 8027 Zurich Daresbury Daresbury Cheshire WA4 4FS, UK Cheshire WA4 4FS, UK http://www.txcorp.co.uk http://www.txcorp.co.uk CH CH 8027 Zurich 8027 Zurich Switzerland Switzerland http://www.txcorp.ch http://www.txcorp.ch NIMROD Meeting, Aug 12, 2010 NIMROD Meeting, Aug 12, 2010

Upload: others

Post on 27-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

Introduction to GPUIntroduction to GPUIntroduction to GPU Introduction to GPU ComputingComputing

Peter Peter MessmerMessmermessmer@txcorp commessmer@txcorp com

TechTech--X CorporationX Corporation5621 Arapahoe Ave., Boulder, CO 803035621 Arapahoe Ave., Boulder, CO 80303

[email protected]@txcorp.com

http://www.txcorp.comhttp://www.txcorp.comTechTech--X UK LtdX UK LtdDaresburyDaresbury Innovation CentreInnovation CentreKeckwickKeckwick LaneLane

TechTech--X GmbHX GmbHClaridenstrasseClaridenstrasse 2525CHCH--8027 Zurich8027 Zurich

DaresburyDaresburyCheshire WA4 4FS, UKCheshire WA4 4FS, UKhttp://www.txcorp.co.ukhttp://www.txcorp.co.uk

CHCH 8027 Zurich8027 ZurichSwitzerlandSwitzerlandhttp://www.txcorp.chhttp://www.txcorp.chNIMROD Meeting, Aug 12, 2010NIMROD Meeting, Aug 12, 2010

Page 2: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

Why scientific computing on GPUs?

Page 3: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

GPUs: Massively Parallel Floating-Point Co-Processors

MP MP MP MP

MP MP MP MP

6 Core Westmere

Fermi

MP MP MPMP6 Co e est e e MP MP MPMP

MP MP MP MP

Page 4: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

The Large Memory Bandwidth on GPUs can Benefit Many Scientific Computing Applications

CPU

Benefit Many Scientific Computing Applications

GPUGPU

Cache

2008 2009

Host memory GPU memory

Page 5: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

The Exascale “Swim Lanes”GPUs-like Accelerators are one Candidate for Exascale SystemsGPUs like Accelerators are one Candidate for Exascale Systems

Complex Multicore

Maintain complex cores, and replicate

X86Power7

SoC / Use many simpler, lower power cores  BlueGene/

Embeddedp

from embedded space

GreenFlash

hi hlGPU / Accelerators

Use highly specialized processors adapted from gaming

Nvidia FermiCell

from gaming

Based on Jeff Broughton (NERSC),Open Fabric Alliance, 3/2010

Page 6: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

The Exascale “Swim Lanes”What looks different now may converge in the futureWhat looks different now may converge in the future

?CPUmulti-threading multi-core many-core

?

fully programmableprogrammability

partially programmable

programmability

fixed functionparallelism

Based on Jeff Broughton (NERSC),Open Fabric Alliance, 3/2010, afterJustin Rattner, Intel, ISC 2008

GPU

Page 7: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

Software Development Environment for GPUs: CUDA and OpenCLp

• Early GPGPU efforts heroic– Graphics API (OpenGL, DirectX) no natural fit for scientific computing

• Compute Unified Device Architecture (http://www.nvidia.com/cuda)– Supported on all modern NVIDIA GPUs (notebook GPUs, high-end GPUs, mobile devices)

• OpenCL (http://www.khonos.org/opencl)– Open standard, targeting NVIDIA, AMD/ATI GPUs, Cell, multicore x86, ..

• Single Source for CPU and GPU– Host code C, C++, Fortran (PGI)

GPU code C++ with extensions Fortran with extensions– GPU code C++ with extensions, Fortran with extensions– nvcc: NVIDIA cuda compiler – PGI compiler for CUDA Fortran

• Runtime libraries– Data transfer, kernel launch, ..– BLAS, FFT libraries

• Simplified GPU development, but still “close to the metal”!p p ,

• NEXUS: Visual Studio plug-in for GPU development

Page 8: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

GPULib:High-Productivity GPU Computing

• IDL (ITT Vis), MATLAB (Mathworks)C Fortran

High Productivity GPU Computing

C, Fortran

• Rich set of data parallel kernels

E t ibl ith i t k l• Extensible with proprietary kernels

• Seamless integration into host language

• Explicit or implicit management of address spaces

• Interface to Tech-X’ FastDL for multi-GPU/distributed memory processing

http://gpulib.txcorp.com (free for non-commercial use)

Messmer, Mullowney, Granger, “GPULib: GPU computing in High-Level Languages”, Computers in Science and Engineering, 10(5), 80, 2008.

Page 9: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

GPU Parallelism is Hierarchical(CUDA Terminology)

• Entire grid executes the same code (“kernel”)• Threads execute “independent” serial path through kernels

( gy)

BlockBlock

BlockBlock

BlockBlock

BlockBlock

ThreadThread

ThreadThread

ThreadThread

ThreadThread

Block

Block

Block

Block

Thread

Thread

Thread

Thread

GirdGird

BlockBlock BlockBlock

BlockBlock

ThreadThread ThreadThread

Gird

Block Block

Block

Thread Thread

Threads withinSame Block

Threads in Different Blocks

Synchronize Yes No

Communicate Cheap (“shared memory”) Expensive

Execution order Chunks of 32(“warp”) Not defined

Page 10: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

SAXPY on GPUFunction visible both from CPU and GPU (“Kernel”)

Declare variable inGPU ddGPU address space

Define thread grid,64 th d bl k64 threads per block

Page 11: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

The Art of GPU Programming: Avoid the Cliffs

Reduce Host-GPU transfer to minimum• Currently only way for MPI

Avoid thread divergence• “Warp executed by a single SIMD instruction”

• Each thread can execute its own code path, BUT all threads in a warp execute all paths taken (predicated execution)paths taken (predicated execution)

• Optimal performance when all threads execute same instructions

Provide sufficient parallelismI t ti d M l t hidd b ti th• Instruction and Memory latency hidden by executing another warp=> Up to 1536 concurrent threads per multiprocessor

• C2050 with 14 MPs requires 24k concurrently active threads (!!)

C f ll fCarefully craft memory access• Memory access should be warp aligned for best performance• Less important on Fermi than older devices due to L2 cache• Per thread random access possible, but expensive

Page 12: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

Measure, Think, Code .... (Repeat)

P = Parallelizable part of the code(1-P) = Serial part of the code

Amdahl’s Law, the key to successful optimization

(1-P) = Serial part of the code

Parallel speedup (Amdahl)

Maximum achievable speedup:

• use ‘optimizable’ partp pO = Part of code that benefits from optimization(1-O) = Part that does not benefit from optimization

Speedup achieved by optimization by factor :Speedup achieved by optimization by factor :

Maximum achievable speedup

In order to get a significant speedup, O has to be a very large fraction of the overall time

Page 13: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

NIMROD on GPUs: Three Kernels

• Large number of small 1D Fourier Transformsg=> Just use cuFFT

• Dense linear algebra on many small matricesJust use cuBLAS

• Sparse linear solves=> Just implement CSR on GPU

Page 14: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

Large Number of small 1D FFTs

Readily available• cuFFT for 1D, 2D, and 3D FFT on GPU, ,• Follows FFTW API (plan generation, execution)• Batched 1D FFTs

=> Might work for the given problem if M really big=> Might work for the given problem if M really big

ProblemU k f fl b t h f ll N t f• Unknown performance oflarge batch of small N transforms

Solution• Hand-crafted FFT that fits into single block?

Page 15: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

GPULib acceleration of Range-Azimuth processorRange Azimuth processor

20000x6500 range-azimuth dataset(single precision complex)

CPU+GPU(2.4 GHz Core2Duo E4600 + Tesla C1060)

CPUC U(2.4 GHz Core2Duo E4600)

Batched 1D FFT

Batched 1D FFTBatched 1D FFT

Page 16: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

Dense Linear Algebra on Small Matrices

Readily available• cuBLAS for BLAS operations on GPUp

Problem• Kernel invocation overhead for cuBLAS too large• Kernel invocation overhead for cuBLAS too large• Only one matrix at the time (with some effort possibly 16

matrices)• Even worse for LAPACK type operations• Even worse for LAPACK-type operations

Solution• If sufficiently small, one block per matrix

(up to ~ 32x32 in double)• Use vectorizable solvers (e.g. cyclic reduction)Use vectorizable solvers (e.g. cyclic reduction)

Page 17: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

GPU acceleration of Option Pricing(solving lots of small 1D diffusion equations)

• Large ensemble of 1D diffusion equationsTri or pentadiagonoal matrices

200N = 100• Tri- or pentadiagonoal matrices

(sparse!)• 360,000 systems with 100 – 300

unknowns 140

160

180N = 100

N=300

• Reference implementation in MATLAB

60

80

100

120

• Tech-X Custom kernelN=300: 182s CPU -> 1.6s GPUN=100: 45s CPU -> 0.7s GPU

N = 100

N=3000

20

40

N = 100CPUGPU

(incl.transfer) GPU (excl. transfer)

Page 18: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

Sparse Linear Solver

Readily available• Some solver libraries moving to GPUSome solver libraries moving to GPU

(Trilinos)

P bl N.Z.- VALUESProblem• CSR format uses lots of indirection• Poor performance due to

INDICES

non-coalesced memory access

SolutionDiagonal

Solution• Use diagonal format if possible• Simple stride 1 memory access

Accept multiplication by 0

Off-Diagonal 1

Off-Diagonal 2

• Accept multiplication by 0

Page 19: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

Kernel Execution Time on Memory Bandwidth Limited Problem, z = x + yBandwidth Limited Problem, z x y

Page 20: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

Kernel Execution Time on Transcendentals, y = exp(x)Transcendentals, y exp(x)

Page 21: Introduction to GPUIntroduction to GPU Computing › meetings › team_mtg_8_10 › nimrod_gpu.pdf · 2010-08-13 · GPUs-like Accelerators are one Candidate forlike Accelerators

Conclusions/Summary

• GPUs offer tremendous potential to accelerate scientific applications

• GPUs are one of the swim lanes towards exascale

Newer generations getting easier to program• Newer generations getting easier to program

• Still a few cliffs:• Host-GPU transfer• Host-GPU transfer• Lots of parallelism• Careful memory access• Thread divergenceead d e ge ce

• NIMROD may require some custom development

• We’re happy to help!