gpus: engines for future high- performance computing

GPUs: Engines for Future High-Performance Computing

John OwensJohn OwensAssistant Professor, Electrical and ComputerAssistant Professor, Electrical and Computer

EngineeringEngineeringInstitute for Data Analysis and VisualizationInstitute for Data Analysis and Visualization

University of California, DavisUniversity of California, Davis

GPUs as Compute Engines10 years ago:10 years ago:•• Graphics done in softwareGraphics done in software5 years ago:5 years ago:•• FFull graphics pipelineull graphics pipelineToday:Today:•• 40x geometry, 13x fill 40x geometry, 13x fill vsvs. 5 yrs ago. 5 yrs ago•• Programmable!Programmable!

Programmable, data parallelProgrammable, data parallelprocessing on every desktopprocessing on every desktop

Enormous opportunity to change theEnormous opportunity to change theway commodity computing is done!way commodity computing is done!

GPU

The Rendering Pipeline

Application

Rasterization

Geometry

Composite

Compute 3D geometryMake calls to graphics API

Transform geometry from 3Dto 2D

Generate fragments from 2Dgeometry

Combine fragments into image

GPU

The Programmable Rendering Pipeline

Application

Rasterization

(Fragment)

Geometry

(Vertex)

Composite

Compute 3D geometryMake calls to graphics API

Transform geometry from 3Dto 2D; vertex programs

Generate fragments from 2Dgeometry; fragment programs

Combine fragments into image

Triangle Setup

L2 Tex

Shader Instruction Dispatch

Fragment Crossbar

MemoryPartition

MemoryPartition

MemoryPartition

MemoryPartition

Z-Cull

NVIDIA GeForce 6800 3D Pipeline

Courtesy Nick Triantos, NVIDIA

Vertex

Fragment

Composite

Courtesy Naga Govindaraju

GPU

CPU

Long-Term Trend: CPU vs. GPU

GPU

Recent GPU Performance TrendsG

FLO

PS

32-bit FP multiplies per second

NVIDIA NV30, 35, 40

ATI R300, 360, 420

Pentium 4

July 01 Jan 02 July 02 Jan 03 July 03 Jan 04

Courtesy Pat Hanrahan/David Luebke

$294

$385

$335

25 GB/s

6 GB/s

36 GB/s

Why Are GPUs Fast?Characteristics of computation permit efficientCharacteristics of computation permit efficient

hardware implementationshardware implementations• High amount of parallelism …

• … exploited by graphics hardware

• High latency tolerance and feed-forward dataflow …

• … allow very deep pipelines

• … allow optimization for bandwidth not latency

Simple controlSimple control• Restrictive programming model

Competition between vendorsCompetition between vendorsWhat about programmability? Effect onWhat about programmability? Effect on

performance?performance? How hard to program?How hard to program?

Programming a GPU for GP Programs

•• Run a SIMD programRun a SIMD programover each fragmentover each fragment

•• ““GatherGather”” is permitted is permittedfrom texture memoryfrom texture memory

•• Resulting buffer canResulting buffer canbebe treated as texturetreated as textureon next passon next pass

•• Draw a screen-sizedDraw a screen-sizedquadquad

GPU Programming is Hard

Must think in graphics metaphorsMust think in graphics metaphorsRequires parallel programmingRequires parallel programming (CPU-GPU,(CPU-GPU,

task, data, instruction)task, data, instruction)Restrictive programming models andRestrictive programming models and

instruction setsinstruction setsPrimitive toolsPrimitive toolsRapidly changing interfacesRapidly changing interfaces

Challenge: Programming Systems

CPUCPUScalarScalar

STL, GNU SL, MPI, STL, GNU SL, MPI, ……C, Fortran, C, Fortran, ……

gccgcc, vendor-specific, , vendor-specific, ……gdbgdb, , vtunevtune, Purify, , Purify, ……

LotsLots applicationsapplications

ProgrammingModel

High-LevelAbstractions/

Libraries

Low-LevelLanguages Compilers

Performance Analysis Tools

GPUGPUStream?Stream?

--GLSL, Cg, HLSL, GLSL, Cg, HLSL, ……Vendor-specificVendor-specific

ShadesmithShadesmith, , NVPerfHUDNVPerfHUDNoneNone

kernelskernels

Docs

Brook: General-Purpose Streaming Language

Stream programming modelStream programming model• Treats GPU as streaming coprocessor

• Streams enforce data parallel computing

• Kernels encourage arithmetic intensity

• Streams and kernels explicitly specified

C with stream extensionsC with stream extensionsOpen-source: www.Open-source: www.sfsf.net/projects/brook/.net/projects/brook/Ian Buck et al., Ian Buck et al., ““Brook for Brook for GPUsGPUs: Stream: Stream

Computing on Graphics HardwareComputing on Graphics Hardware””,,Siggraph 2004Siggraph 2004

Challenge: GPU-to-Host Bandwidth

•• PCI-E optimizes GPU-to-CPU bandwidthPCI-E optimizes GPU-to-CPU bandwidth• 16-lane card: 8 GB/s

• Scalable in future

•• Major vendors support PCI-E cards nowMajor vendors support PCI-E cards now•• Multiple Multiple GPUs GPUs supported per CPU - opportunity!supported per CPU - opportunity!

• Cheap and upgradable

GPUs lack band-width to the host,so we won’t use it!

No one uses hostbandwidth, so wewon’t optimize it!

Challenge: Mobile/embedded market

Why?Why?• UI, messaging/screen savers, navigation,

gaming (location based)

Typical specs (cell-phone class):Typical specs (cell-phone class):• 200-800k gates, ~100 MHz, ~100 mW

• 1-10M vtx/s, 100+M frags/s

WhatWhat’’s important?s important?• Visual quality

• Power-efficient (ops/W)

• Avoid memory accesses, unified shaders …

• Low cost

Challenge: Power

Desktop:Desktop:• Double-width cards

• Workstation powersupplies; draw powerfrom motherboard

Mobile:Mobile:• Batteries improving 5-

10% per year

• Ops/W most important

www.coolingzone.com

Current GPGPU Research

Image processing Image processing [[Johnson/Frank/VaidyaJohnson/Frank/Vaidya,,LLNL]LLNL]

Alternate graphics pipelines Alternate graphics pipelines [Purcell,[Purcell,Carr, Carr, CoombeCoombe]]

Visual simulation Visual simulation [Harris[Harris]]

Volume rendering Volume rendering [[KnissKniss, , KrKrügerüger]]

Level set computation Level set computation [[LefohnLefohn, , StrzodkaStrzodka]]

Numerical methods Numerical methods [[BolzBolz, , KrKrügerüger,,StrzodkaStrzodka]]

Molecular dynamics Molecular dynamics [Buck[Buck]]

Databases Databases [Sun, [Sun, GovindarajuGovindaraju]]

……

Grand Challenges

Architecture: Increase featuresArchitecture: Increase features andandperformance without sacrificing coreperformance without sacrificing coremissionmission

Interfaces: Abstractions, APIs, programmingInterfaces: Abstractions, APIs, programmingmodels, languagesmodels, languages• Many approaches needed

• Goal: C programs compiling to dynamically-balanced CPU-GPU clusters

• Academic and research community

Applications: Killer app needed!Applications: Killer app needed!

Acknowledgements

Craig Lund: Mercury Computer SystemsCraig Lund: Mercury Computer SystemsJeremy Jeremy KepnerKepner: Lincoln Labs: Lincoln LabsNick Nick TriantosTriantos, Craig Kolb: NVIDIA, Craig Kolb: NVIDIAMark Segal: ATIMark Segal: ATIKari Kari PulliPulli: Nokia: NokiaAaron Aaron LefohnLefohn: UC Davis: UC DavisIan Buck: StanfordIan Buck: StanfordFunding: DOE Office of Science, Los AlamosFunding: DOE Office of Science, Los Alamos

National Laboratory, National Laboratory, ChevronTexacoChevronTexaco, UC, UCMICRO, UC DavisMICRO, UC Davis

For more information …

GPGPU home: http://www.gpgpu.org/GPGPU home: http://www.gpgpu.org/• Mark Harris, UNC/NVIDIA

GPU GemsGPU Gems (Addison-Wesley)(Addison-Wesley)• Vol 1: 2004; Vol 2: 2005

Conferences: Siggraph, Graphics Hardware,Conferences: Siggraph, Graphics Hardware,GPGP22

• Course notes: Siggraph ‘04, IEEE Visualization ‘04

University research: Caltech, CMU, University research: Caltech, CMU, DuisbergDuisberg,,Illinois, Purdue, Stanford, SUNYIllinois, Purdue, Stanford, SUNYStonybrookStonybrook, Texas, TU , Texas, TU MMününchchenen, Utah,, Utah,UBC, UC Davis, UNC, Virginia, WaterlooUBC, UC Davis, UNC, Virginia, Waterloo

gpus: engines for future high- performance computing

Documents