general-purpose computation on graphics hardware david luebke university of virginia

General-Purpose Computation on General-Purpose Computation on Graphics HardwareGraphics Hardware

David Luebke University of Virginia

Course IntroductionCourse Introduction

• The GPU on commodity video cards has evolved into an extremely flexible and powerful processor– Programmability– Precision– Power

• We are interested in how to harness that power for general-purpose computation

Motivation: Computational PowerMotivation: Computational Power

• GPUs are fast…– 3 GHz Pentium4 theoretical: 6 GFLOPS, 5.96 GB/sec peak– GeForceFX 5900 observed: 20 GFLOPS, 25.3 GB/sec peak– GeForce 6800 Ultra observed: 53 GFLOPS, 35.2 GB/sec peak– GeForce 8800 GTX: estimated at 520 GFLOPS, 86.4 GB/sec

peak (reference here)• That’s almost 100 times faster than a 3 Ghz Pentium4!

• GPUs are getting faster, faster– CPUs: annual growth 1.5× decade growth 60× – GPUs: annual growth > 2.0× decade growth > 1000

Courtesy Kurt Akeley,Ian Buck & Tim Purcell, GPU Gems (see course notes)

Motivation:Computational PowerMotivation:Computational Power

Courtesy Naga Govindaraju

GPU

CPU

Motivation:Computational PowerMotivation:Computational Power

Courtesy Ian Buck

GPU

GFL

OPS

multiplies per second

NVIDIA NV30, 35, 40

ATI R300, 360, 420

Pentium 4

July 01 Jan 02 July 02 Jan 03 July 03 Jan 04

An Aside: Computational PowerAn Aside: Computational Power

• Why are GPUs getting faster so fast?– Arithmetic intensity: the specialized nature of GPUs makes it

easier to use additional transistors for computation not cache– Economics: multi-billion dollar video game market is a

pressure cooker that drives innovation

Motivation:Flexible and preciseMotivation:Flexible and precise

• Modern GPUs are deeply programmable– Programmable pixel, vertex, video engines– Solidifying high-level language support

• Modern GPUs support high precision– 32 bit floating point throughout the pipeline– High enough for many (not all) applications

Motivation:The Potential of GPGPUMotivation:The Potential of GPGPU

• The power and flexibility of GPUs makes them an attractive platform for general-purpose computation

• Example applications range from in-game physics simulation to conventional computational science

• Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor

The Problem:Difficult To UseThe Problem:Difficult To Use

• GPUs designed for and driven by video games– Programming model is unusual & tied to computer graphics– Programming environment is tightly constrained

• Underlying architectures are:– Inherently parallel– Rapidly evolving (even in basic feature set!)– Largely secret

• Can’t simply “port” code written for the CPU!

GPU Fundamentals:The Graphics PipelineGPU Fundamentals:The Graphics Pipeline

• A simplified graphics pipeline– Note that pipe widths vary– Many caches, FIFOs, and so on not shown

GPUCPU

ApplicationApplication TransformTransform RasterizerRasterizer ShadeShade VideoMemory

(Textures)

VideoMemory

(Textures)VerticesVertices

(3D)(3D)Xformed,Xformed,

LitLitVerticesVertices

(2D)(2D)

FragmentsFragments(pre-pixels)(pre-pixels)

FinalFinalpixelspixels

(Color, Depth)(Color, Depth)

Graphics StateGraphics State

Render-to-textureRender-to-texture

GPU Fundamentals:The Modern Graphics PipelineGPU Fundamentals:The Modern Graphics Pipeline

• Programmable vertex processor!

• Programmable pixel processor!

GPUCPU

ApplicationApplication VertexProcessor

VertexProcessor RasterizerRasterizer Pixel

ProcessorPixel

ProcessorVideo

Memory(Textures)

VideoMemory

(Textures)VerticesVertices

(3D)(3D)Xformed,Xformed,

LitLitVerticesVertices

(2D)(2D)

FragmentsFragments(pre-pixels)(pre-pixels)

FinalFinalpixelspixels

(Color, Depth)(Color, Depth)

Graphics StateGraphics State

Render-to-textureRender-to-texture

VertexProcessor

VertexProcessor

FragmentProcessorFragmentProcessor

GPU Pipeline: TransformGPU Pipeline: Transform

• Vertex Processor (multiple operate in parallel)– Transform from “world space” to “image space”– Compute per-vertex lighting

GPU Pipeline: RasterizerGPU Pipeline: Rasterizer

• Rasterizer– Convert geometric rep. (vertex) to image rep. (fragment)

• Fragment = image fragment– Pixel + associated data: color, depth, stencil, etc.

– Interpolate per-vertex quantities across pixels

GPU Pipeline: ShadeGPU Pipeline: Shade

• Fragment Processors (multiple in parallel)– Compute a color for each pixel– Optionally read colors from textures (images)

Importance of Data ParallelismImportance of Data Parallelism

• GPU: Each vertex / fragment is independent– Temporary registers are zeroed– No static data– No read-modify-write buffers

• Data parallel processing– Best for ALU-heavy architectures: GPUs

• Multiple vertex & pixel pipelines– Hide memory latency (with more computation)

Courtesy of Ian Buck

Arithmetic IntensityArithmetic Intensity

• Lots of ops per word transferred

• GPGPU demands high arithmetic intensity for peak performance– Ex: solving systems of linear equations– Physically-based simulation on lattices– All-pairs shortest paths

Courtesy of Pat Hanrahan

Data Streams & KernelsData Streams & Kernels

• Streams– Collection of records requiring similar computation

• Vertex positions, Voxels, FEM cells, etc.– Provide data parallelism

• Kernels– Functions applied to each element in stream

• Transforms, PDE, …– No dependencies between stream elements

• Encourage high arithmetic intensity

Courtesy of Ian Buck

Example: Simulation GridExample: Simulation Grid

• Common GPGPU computation style– Textures represent computational grids = streams

• Many computations map to grids– Matrix algebra– Image & Volume processing– Physical simulation– Global Illumination

• ray tracing, photon mapping, radiosity

• Non-grid streams can be mapped to grids

Stream ComputationStream Computation

• Grid Simulation algorithm– Made up of steps– Each step updates entire grid– Must complete before next step can begin

• Grid is a stream, steps are kernels– Kernel applied to each stream element

Scatter vs. GatherScatter vs. Gather

• Grid communication– Grid cells share information

Computational Resources InventoryComputational Resources Inventory

• Programmable parallel processors– Vertex & Fragment pipelines

• Rasterizer– Mostly useful for interpolating addresses (texture coordinates)

and per-vertex constants

• Texture unit– Read-only memory interface

• Render to texture– Write-only memory interface

Vertex ProcessorVertex Processor

• Fully programmable (SIMD / MIMD)

• Processes 4-vectors (RGBA / XYZW)

• Capable of scatter but not gather– Can change the location of current vertex– Cannot read info from other vertices– Can only read a small constant memory

• Future hardware enables gather!– Vertex textures

Fragment ProcessorFragment Processor

• Fully programmable (SIMD)

• Processes 4-vectors (RGBA / XYZW)

• Random access memory read (textures)

• Capable of gather but not scatter– No random access memory writes – Output address fixed to a specific pixel

• Typically more useful than vertex processor– More fragment pipelines than vertex pipelines– RAM read– Direct output

CPU-GPU AnalogiesCPU-GPU Analogies

• CPU programming is familiar– GPU programming is graphics-centric

• Analogies can aid understanding


CPU GPU

Stream / Data Array = Texture

Memory Read = Texture Sample


Loop body / kernel / algorithm step = Fragment Program

CPU GPU

FeedbackFeedback

• Each algorithm step depend on the results of previous steps

• Each time step depends on the results of the previous time step


. .

.

Grid[i][j]= x; . . .

Array Write = Render to Texture

CPU GPU

GPU Simulation OverviewGPU Simulation Overview

• Analogies lead to implementation– Algorithm steps are fragment programs

• Computational “kernels”– Current state variables stored in textures– Feedback via render to texture

• One question: how do we invoke computation?

Invoking ComputationInvoking Computation

• Must invoke computation at each pixel– Just draw geometry!– Most common GPGPU invocation is a full-screen quad

Typical “Grid” ComputationTypical “Grid” Computation

• Initialize “view” (so that pixels:texels::1:1)

glMatrixMode(GL_MODELVIEW);glLoadIdentity();glMatrixMode(GL_PROJECTION);glLoadIdentity();glOrtho(0, 1, 0, 1, 0, 1);glViewport(0, 0, outTexResX, outTexResY);

• For each algorithm step:– Activate render-to-texture– Setup input textures, fragment program– Draw a full-screen quad (1x1)

Branching TechniquesBranching Techniques

• Fragment program branches can be expensive– No true fragment branching on NV3X & R3X0

• Better to move decisions up the pipeline– Replace with math– Occlusion Query– Domain decomposition– Z-cull– Pre-computation

Branching with OQBranching with OQ

• Use it for iteration terminationDo { // outer loop on CPU

BeginOcclusionQuery { // Render with fragment program that

// discards fragments that satisfy

// termination criteria} EndQuery

} While query returns > 0

• Can be used for subdivision techniques– Demo later

Domain DecompositionDomain Decomposition

• Avoid branches where outcome is fixed– One region is always true, another false– Separate FPs for each region, no branches

• Example: boundaries

Z-CullZ-Cull

• In early pass, modify depth buffer– Write depth=0 for pixels that should not be modified by later

passes– Write depth=1 for rest

• Subsequent passes– Enable depth test (GL_LESS)– Draw full-screen quad at z=0.5– Only pixels with previous depth=1 will be processed

• Can also use early stencil test

• Not available on NV3X– Depth replace disables ZCull

Pre-computationPre-computation

• Pre-compute anything that will not change every iteration!

• Example: arbitrary boundaries– When user draws boundaries, compute texture containing

boundary info for cells– Reuse that texture until boundaries modified– Combine with Z-cull for higher performance!

High-level Shading LanguagesHigh-level Shading LanguagesHigh-level Shading LanguagesHigh-level Shading Languages

• Cg, HLSL, & GLSlang– Cg:

• http://www.nvidia.com/cg

– HLSL:• http://msdn.microsoft.com/library/default.asp?url=/library/en-us/

directx9_c/directx/graphics/reference/highlevellanguageshaders.asp

– GLSlang: • http://www.3dlabs.com/support/developer/ogl2/whitepapers/

index.html

float3 cPlastic = Cd * (cAmbi + cDiff) + Cs*cSpec

….MULR R0.xyz, R0.xxx, R4.xyxx;MOVR R5.xyz, -R0.xyzz;DP3R R3.x, R0.xyzz, R3.xyzz;…

GPGPU LanguagesGPGPU Languages

• Why do we want them?

– Make programming GPUs easier!• Don’t need to know OpenGL, DirectX, or ATI/NV extensions• Simplify common operations• Focus on the algorithm, not on the implementation

• Sh

– University of Waterloo– http://libsh.sourceforge.net– http://www.cgl.uwaterloo.ca

• Brook

– Stanford University– http://brook.sourceforge.net– http://graphics.stanford.edu

Brook: general purpose streaming languageBrook: general purpose streaming language

• stream programming model– enforce data parallel computing

• streams– encourage arithmetic intensity

• kernels

• C with stream extensions

• GPU = streaming coprocessor

system outlinesystem outline

.br

Brook source files

brcc

source to source compiler

brt

Brook run-time library

Brook language

streamsBrook language

streams

• streams– collection of records requiring similar computation

• particle positions, voxels, FEM cell, …

float3 positions<200>;

float3 velocityfield<100,100,100>;

– similar to arrays, but…• index operations disallowed: position[i]• read/write stream operators

streamRead (positions, p_ptr);

streamWrite (velocityfield, v_ptr);

– encourage data parallelism

Brook language

kernelsBrook language

kernels

• kernels

– functions applied to streams• similar to for_all construct

kernel void foo (float a<>, float b<>, out float result<>) {

result = a + b;}

float a<100>;float b<100>;float c<100>;

foo(a,b,c);for (i=0; i<100; i++)

c[i] = a[i]+b[i];

Brook language


kernels

• kernels

– functions applied to streams• similar to for_all construct

kernel void foo (float a<>, float b<>, out float result<>) {

result = a + b;}

– no dependencies between stream elements– encourage high arithmetic intensity

Brook language


kernels

• kernels arguments– input/output streams– constant parameters– gather streams– iterator streams

kernel void foo (float a<>, float b<>, float t, float array[], iter float n<>, out float result<>) {

result = array[a] + t*b + n;}

float a<100>;float b<100>;float c<100>;float array<25>iter float n<100> = iter(0, 10);

foo(a,b,3.2f,array,n,c);

Brook language


kernels• ray triangle intersection

kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[], RayState oldraystate<>, GridTrilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float3 edge1, edge2, pvec, tvec, qvec; if(oldraystate.state.y > 0) { idx = trilist[oldraystate.state.w].trinum; edge1 = tris[idx].v1 - tris[idx].v0; edge2 = tris[idx].v2 - tris[idx].v0; pvec = cross(ray.d, edge2); det = dot(edge1, pvec); inv_det = 1.0f/det; tvec = ray.o - tris[idx].v0; candidatehit.data.y = dot( tvec, pvec ) * inv_det; qvec = cross( tvec, edge1 ); candidatehit.data.z = dot( ray.d, qvec ) * inv_det; candidatehit.data.x = dot( edge2, qvec ) * inv_det; candidatehit.data.w = idx; } else { candidatehit.data = float4(0,0,0,-1); }}

Brook language

reductionsBrook language

reductions

• reductions

– compute single value from a stream

reduce void sum (float a<>, reduce float r<>)

r += a;}

float a<100>;float r;

sum(a,r);r = a[0];for (int i=1; i<100; i++) r += a[i];

Brook language


reductions

• reductions – associative operations only

(a+b)+c = a+(b+c)• sum, multiply, max, min, OR, AND, XOR• matrix multiply

Brook language


reductions

• multi-dimension reductions

– stream “shape” differences resolved by reduce function

reduce void sum (float a<>, reduce float r<>)

r += a;}

float a<20>;float r<5>;

sum(a,r); for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j];

Brook language

stream repeat & strideBrook language

stream repeat & stride

• kernel arguments of different shape

– resolved by repeat and stride

kernel void foo (float a<>, float b<>, out float result<>);

float a<20>;float b<5>;float c<10>;

foo(a,b,c);

foo(a[0], b[0], c[0])foo(a[2], b[0], c[1])foo(a[4], b[1], c[2])foo(a[6], b[1], c[3])foo(a[8], b[2], c[4])foo(a[10], b[2], c[5])foo(a[12], b[3], c[6])foo(a[14], b[3], c[7])foo(a[16], b[4], c[8])foo(a[18], b[4], c[9])

Brook language

matrix vector multiplyBrook language

matrix vector multiply

kernel void mul (float a<>, float b<>, out float result<>)

{ result = a*b; }

reduce void sum (float a<>, reduce float result<>)

{ result += a; }

float matrix<20,10>;float vector<1, 10>;float tempmv<20,10>;float result<20, 1>;

mul(matrix,vector,tempmv);sum(tempmv,result);

MV

V

V

T=

Brook language

matrix vector multiplyBrook language

matrix vector multiply

kernel void mul (float a<>, float b<>, out float result<>)

{ result = a*b; }

reduce void sum (float a<>, reduce float result<>)

{ result += a; }

float matrix<20,10>;float vector<1, 10>;float tempmv<20,10>;float result<20, 1>;

mul(matrix,vector,tempmv);sum(tempmv,result);

RT sum

Running BrookRunning Brook

• Compiling .br filesBrook CG CompilerVersion: 0.2 Built: Apr 24 2004, 18:11:59brcc [-hvndktyAN] [-o prefix] [-w workspace] [-p shader ] foo.br -h help (print this message) -v verbose (print intermediate generated code) -n no codegen (just parse and reemit the input) -d debug (print cTool internal state) -k keep generated fragment program (in foo.cg) -t disable kernel call type checking -y emit code for ATI 4-output hardware -A enable address virtualization (experimental) -N deny support for kernels calling other kernels -o prefix prefix prepended to all output files -w workspace workspace size (16 - 2048, default 1024) -p shader cpu / ps20 / fp30 / cpumt (can specify multiple)

Running BrookRunning Brook

• BRT_RUNTIME selects platform

CPU Backend: BRT_RUNTIME = cpu

CPU Multithreaded Backend: BRT_RUNTIME = cpumt

NVIDIA NV30 Backend: BRT_RUNTIME = nv30

OpenGL ARB Backend: BRT_RUNTIME = arb

DirectX9 Backend: BRT_RUNTIME = dx9

RuntimeRuntime

• accessing stream data for graphics aps

– Brook runtime api available in c++ code– autogenerated .hpp files for brook code

brook::initialize( "dx9", (void*)device );

// Create streams

fluidStream0 = stream::create<float4>( kFluidSize, kFluidSize );

normalStream = stream::create<float3>( kFluidSize, kFluidSize );

// Get a handle to the texture being used by

// the normal stream as a backing store

normalTexture = (IDirect3DTexture9*)

normalStream->getIndexedFieldRenderData(0);

// Call the simulation kernel

simulationKernel( fluidStream0, fluidStream0, controlConstant,

fluidStream1 );

ApplicationsApplications

• Includes lots of sample applications– Ray-tracer– FFT– Image segmentation– Linear algebra

Brook performanceBrook performance

2-3x faster than CPU implementation

compared against 3GHz P4:• Intel Math Library• FFTW• Custom cached-blocked segment C code

GPUs still lose against SSEcache friendly code.Super-optimizations

• ATLAS• FFTW

ATI Radeon 9800 XT

NVIDIA GeForce 6800

Brook for GPUsBrook for GPUs

• Release v0.3 available on Sourceforge

• Project Page

– http://graphics.stanford.edu/projects/brook

• Source

– http://www.sourceforge.net/projects/brook

• Over 4K downloads!

• Brook for GPUs: Stream Computing on Graphics Hardware

– Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan

Fly-fishing fly images from The English Fly Fishing Shop

GPGPU ExamplesGPGPU Examples

• Fluids

• Reaction/Diffusion

• Multigrid

• Tone mapping

general-purpose computation on graphics hardware david luebke university of virginia

Documents

computational power

texture slide

gpu pipeline

innovation slide

vertex processor multiple

gpu fundamentals

programmable pixel processor

flexibility of gpus