gpus: engines for future high- performance computing
TRANSCRIPT
GPUs: Engines for Future High-Performance Computing
John OwensJohn OwensAssistant Professor, Electrical and ComputerAssistant Professor, Electrical and Computer
EngineeringEngineeringInstitute for Data Analysis and VisualizationInstitute for Data Analysis and Visualization
University of California, DavisUniversity of California, Davis
GPUs as Compute Engines10 years ago:10 years ago:•• Graphics done in softwareGraphics done in software5 years ago:5 years ago:•• FFull graphics pipelineull graphics pipelineToday:Today:•• 40x geometry, 13x fill 40x geometry, 13x fill vsvs. 5 yrs ago. 5 yrs ago•• Programmable!Programmable!
Programmable, data parallelProgrammable, data parallelprocessing on every desktopprocessing on every desktop
Enormous opportunity to change theEnormous opportunity to change theway commodity computing is done!way commodity computing is done!
GPU
The Rendering Pipeline
Application
Rasterization
Geometry
Composite
Compute 3D geometryMake calls to graphics API
Transform geometry from 3Dto 2D
Generate fragments from 2Dgeometry
Combine fragments into image
GPU
The Programmable Rendering Pipeline
Application
Rasterization
(Fragment)
Geometry
(Vertex)
Composite
Compute 3D geometryMake calls to graphics API
Transform geometry from 3Dto 2D; vertex programs
Generate fragments from 2Dgeometry; fragment programs
Combine fragments into image
Triangle Setup
L2 Tex
Shader Instruction Dispatch
Fragment Crossbar
MemoryPartition
MemoryPartition
MemoryPartition
MemoryPartition
Z-Cull
NVIDIA GeForce 6800 3D Pipeline
Courtesy Nick Triantos, NVIDIA
Vertex
Fragment
Composite
Courtesy Naga Govindaraju
GPU
CPU
Long-Term Trend: CPU vs. GPU
GPU
Recent GPU Performance TrendsG
FLO
PS
32-bit FP multiplies per second
NVIDIA NV30, 35, 40
ATI R300, 360, 420
Pentium 4
July 01 Jan 02 July 02 Jan 03 July 03 Jan 04
Courtesy Pat Hanrahan/David Luebke
$294
$385
$335
25 GB/s
6 GB/s
36 GB/s
Why Are GPUs Fast?Characteristics of computation permit efficientCharacteristics of computation permit efficient
hardware implementationshardware implementations• High amount of parallelism …
• … exploited by graphics hardware
• High latency tolerance and feed-forward dataflow …
• … allow very deep pipelines
• … allow optimization for bandwidth not latency
Simple controlSimple control• Restrictive programming model
Competition between vendorsCompetition between vendorsWhat about programmability? Effect onWhat about programmability? Effect on
performance?performance? How hard to program?How hard to program?
Programming a GPU for GP Programs
•• Run a SIMD programRun a SIMD programover each fragmentover each fragment
•• ““GatherGather”” is permitted is permittedfrom texture memoryfrom texture memory
•• Resulting buffer canResulting buffer canbebe treated as texturetreated as textureon next passon next pass
•• Draw a screen-sizedDraw a screen-sizedquadquad
GPU Programming is Hard
Must think in graphics metaphorsMust think in graphics metaphorsRequires parallel programmingRequires parallel programming (CPU-GPU,(CPU-GPU,
task, data, instruction)task, data, instruction)Restrictive programming models andRestrictive programming models and
instruction setsinstruction setsPrimitive toolsPrimitive toolsRapidly changing interfacesRapidly changing interfaces
Challenge: Programming Systems
CPUCPUScalarScalar
STL, GNU SL, MPI, STL, GNU SL, MPI, ……C, Fortran, C, Fortran, ……
gccgcc, vendor-specific, , vendor-specific, ……gdbgdb, , vtunevtune, Purify, , Purify, ……
LotsLots applicationsapplications
ProgrammingModel
High-LevelAbstractions/
Libraries
Low-LevelLanguages Compilers
Performance Analysis Tools
GPUGPUStream?Stream?
--GLSL, Cg, HLSL, GLSL, Cg, HLSL, ……Vendor-specificVendor-specific
ShadesmithShadesmith, , NVPerfHUDNVPerfHUDNoneNone
kernelskernels
Docs
Brook: General-Purpose Streaming Language
Stream programming modelStream programming model• Treats GPU as streaming coprocessor
• Streams enforce data parallel computing
• Kernels encourage arithmetic intensity
• Streams and kernels explicitly specified
C with stream extensionsC with stream extensionsOpen-source: www.Open-source: www.sfsf.net/projects/brook/.net/projects/brook/Ian Buck et al., Ian Buck et al., ““Brook for Brook for GPUsGPUs: Stream: Stream
Computing on Graphics HardwareComputing on Graphics Hardware””,,Siggraph 2004Siggraph 2004
Challenge: GPU-to-Host Bandwidth
•• PCI-E optimizes GPU-to-CPU bandwidthPCI-E optimizes GPU-to-CPU bandwidth• 16-lane card: 8 GB/s
• Scalable in future
•• Major vendors support PCI-E cards nowMajor vendors support PCI-E cards now•• Multiple Multiple GPUs GPUs supported per CPU - opportunity!supported per CPU - opportunity!
• Cheap and upgradable
GPUs lack band-width to the host,so we won’t use it!
No one uses hostbandwidth, so wewon’t optimize it!
Challenge: Mobile/embedded market
Why?Why?• UI, messaging/screen savers, navigation,
gaming (location based)
Typical specs (cell-phone class):Typical specs (cell-phone class):• 200-800k gates, ~100 MHz, ~100 mW
• 1-10M vtx/s, 100+M frags/s
WhatWhat’’s important?s important?• Visual quality
• Power-efficient (ops/W)
• Avoid memory accesses, unified shaders …
• Low cost
Challenge: Power
Desktop:Desktop:• Double-width cards
• Workstation powersupplies; draw powerfrom motherboard
Mobile:Mobile:• Batteries improving 5-
10% per year
• Ops/W most important
www.coolingzone.com
Current GPGPU Research
Image processing Image processing [[Johnson/Frank/VaidyaJohnson/Frank/Vaidya,,LLNL]LLNL]
Alternate graphics pipelines Alternate graphics pipelines [Purcell,[Purcell,Carr, Carr, CoombeCoombe]]
Visual simulation Visual simulation [Harris[Harris]]
Volume rendering Volume rendering [[KnissKniss, , KrKrügerüger]]
Level set computation Level set computation [[LefohnLefohn, , StrzodkaStrzodka]]
Numerical methods Numerical methods [[BolzBolz, , KrKrügerüger,,StrzodkaStrzodka]]
Molecular dynamics Molecular dynamics [Buck[Buck]]
Databases Databases [Sun, [Sun, GovindarajuGovindaraju]]
……
Grand Challenges
Architecture: Increase featuresArchitecture: Increase features andandperformance without sacrificing coreperformance without sacrificing coremissionmission
Interfaces: Abstractions, APIs, programmingInterfaces: Abstractions, APIs, programmingmodels, languagesmodels, languages• Many approaches needed
• Goal: C programs compiling to dynamically-balanced CPU-GPU clusters
• Academic and research community
Applications: Killer app needed!Applications: Killer app needed!
Acknowledgements
Craig Lund: Mercury Computer SystemsCraig Lund: Mercury Computer SystemsJeremy Jeremy KepnerKepner: Lincoln Labs: Lincoln LabsNick Nick TriantosTriantos, Craig Kolb: NVIDIA, Craig Kolb: NVIDIAMark Segal: ATIMark Segal: ATIKari Kari PulliPulli: Nokia: NokiaAaron Aaron LefohnLefohn: UC Davis: UC DavisIan Buck: StanfordIan Buck: StanfordFunding: DOE Office of Science, Los AlamosFunding: DOE Office of Science, Los Alamos
National Laboratory, National Laboratory, ChevronTexacoChevronTexaco, UC, UCMICRO, UC DavisMICRO, UC Davis
For more information …
GPGPU home: http://www.gpgpu.org/GPGPU home: http://www.gpgpu.org/• Mark Harris, UNC/NVIDIA
GPU GemsGPU Gems (Addison-Wesley)(Addison-Wesley)• Vol 1: 2004; Vol 2: 2005
Conferences: Siggraph, Graphics Hardware,Conferences: Siggraph, Graphics Hardware,GPGP22
• Course notes: Siggraph ‘04, IEEE Visualization ‘04
University research: Caltech, CMU, University research: Caltech, CMU, DuisbergDuisberg,,Illinois, Purdue, Stanford, SUNYIllinois, Purdue, Stanford, SUNYStonybrookStonybrook, Texas, TU , Texas, TU MMününchchenen, Utah,, Utah,UBC, UC Davis, UNC, Virginia, WaterlooUBC, UC Davis, UNC, Virginia, Waterloo