how gpus work - nvidia · ics processing unit (gpu) dedicated to providing a high-performance,...

96 Computer

H O W T H I N G S W O R K

I n the early 1990s, ubiquitousinteractive 3D graphics was stillthe stuff of science fiction. By theend of the decade, nearly everynew computer contained a graph-

ics processing unit (GPU) dedicated toproviding a high-performance, visu-ally rich, interactive 3D experience.

This dramatic shift was the in-evitable consequence of consumerdemand for videogames, advances inmanufacturing technology, and theexploitation of the inherent paral-lelism in the feed-forward graphicspipeline. Today, the raw computa-tional power of a GPU dwarfs that ofthe most powerful CPU, and the gap issteadily widening.

Furthermore, GPUs have movedaway from the traditional fixed-func-tion 3D graphics pipeline toward a flexible general-purpose compu-tational engine. Today, GPUs can implement many parallel algorithmsdirectly using graphics hardware.Well-suited algorithms that leverageall the underlying computationalhorsepower often achieve tremendousspeedups. Truly, the GPU is the firstwidely deployed commodity desktopparallel computer.

THE GRAPHICS PIPELINEThe task of any 3D graphics system

is to synthesize an image from adescription of a scene—60 times persecond for real-time graphics such asvideogames. This scene contains thegeometric primitives to be viewed aswell as descriptions of the lights illu-minating the scene, the way that eachobject reflects light, and the viewer’sposition and orientation.

GPU designers traditionally haveexpressed this image-synthesis processas a hardware pipeline of specializedstages. Here, we provide a high-leveloverview of the classic graphicspipeline; our goal is to highlight thoseaspects of the real-time rendering cal-culation that allow graphics applica-tion developers to exploit modernGPUs as general-purpose parallelcomputation engines.

Pipeline inputMost real-time graphics systems

assume that everything is made of tri-angles, and they first carve up any morecomplex shapes, such as quadrilateralsor curved surface patches, into trian-gles. The developer uses a computergraphics library (such as OpenGL or

Direct3D) to provide each triangle tothe graphics pipeline one vertex at atime; the GPU assembles vertices intotriangles as needed.

Model transformationsA GPU can specify each logical

object in a scene in its own locallydefined coordinate system, which isconvenient for objects that are natu-rally defined hierarchically. This con-venience comes at a price: beforerendering, the GPU must first trans-form all objects into a common coor-dinate system. To ensure that trianglesaren’t warped or twisted into curvedshapes, this transformation is limitedto simple affine operations such asrotations, translations, scalings, andthe like.

As the “Homogeneous Coordinates”sidebar explains, by representing eachvertex in homogeneous coordinates,the graphics system can perform theentire hierarchy of transformationssimultaneously with a single matrix-vector multiply. The need for efficienthardware to perform floating-pointvector arithmetic for millions of ver-tices each second has helped drive theGPU parallel-computing revolution.

The output of this stage of thepipeline is a stream of triangles, allexpressed in a common 3D coordinatesystem in which the viewer is locatedat the origin, and the direction of viewis aligned with the z-axis.

LightingOnce each triangle is in a global

coordinate system, the GPU can com-pute its color based on the lights in thescene. As an example, we describe thecalculations for a single-point lightsource (imagine a very small lightbulb).The GPU handles multiple lights bysumming the contributions of eachindividual light. The traditional graph-ics pipeline supports the Phong light-ing equation (B-T. Phong, “Illumina-tion for Computer-Generated Images,”Comm. ACM, June 1975, pp. 311-317), a phenomenological appearancemodel that approximates the look ofplastic. These materials combine a dulldiffuse base with a shiny specular high-

How GPUs Work David Luebke, NVIDIA ResearchGreg Humphreys, University of Virginia

GPUs have moved away from

the traditional fixed-function

3D graphics pipeline toward

a flexible general-purpose

computational engine.

r2How.qxp 23/1/07 12:44 PM Page 96

light. The Phong lighting equationgives the output color C = KdLi(N · L)+ KsLi(R · V)s.

Table 1 defines each term in theequation. The mathematics here isn’tas important as the computation’sstructure; to evaluate this equationefficiently, GPUs must again operatedirectly on vectors. In this case, werepeatedly evaluate the dot product oftwo vectors, performing a four-com-ponent multiply-and-add operation.

Camera simulationThe graphics pipeline next projects

each colored 3D triangle onto the vir-tual camera’s film plane. Like themodel transformations, the GPU doesthis using matrix-vector multiplication,again leveraging efficient vector opera-tions in hardware. This stage’s outputis a stream of triangles in screen coor-dinates, ready to be turned into pixels.

RasterizationEach visible screen-space triangle

overlaps some pixels on the display;determining these pixels is called ras-terization. GPU designers have incor-porated many rasterizatiom algo-rithms over the years, which all ex-ploit one crucial observation: Eachpixel can be treated independentlyfrom all other pixels. Therefore, themachine can handle all pixels in par-allel—indeed, some exotic machineshave had a processor for each pixel.This inherent independence has ledGPU designers to build increasinglyparallel sets of pipelines.

TexturingThe actual color of each pixel can

be taken directly from the lighting cal-culations, but for added realism,images called textures are oftendraped over the geometry to give theillusion of detail. GPUs store these tex-tures in high-speed memory, whicheach pixel calculation must access todetermine or modify that pixel’s color.

In practice, the GPU might requiremultiple texture accesses per pixel tomitigate visual artifacts that can resultwhen textures appear either smalleror larger on screen than their native

resolution. Because the access patternto texture memory is typically veryregular (nearby pixels tend to accessnearby texture image locations), spe-cialized cache designs help hide thelatency of memory accesses.

Hidden surfacesIn most scenes, some objects

obscure other objects. If each pixelwere simply written to display mem-ory, the most recently submitted tri-angle would appear to be in front.Thus, correct hidden surface removalwould require sorting all trianglesfrom back to front for each view, anexpensive operation that isn’t evenalways possible for all scenes.

All modern GPUs provide a depthbuffer, a region of memory that storesthe distance from each pixel to theviewer. Before writing to the display,the GPU compares a pixel’s distance tothe distance of the pixel that’s alreadypresent, and it updates the displaymemory only if the new pixel is closer.

THE GRAPHICS PIPELINE,EVOLVED

GPUs have evolved from a hardwiredimplementation of the graphics pipeline

to a programmable computational sub-strate that can support it. Fixed-func-tion units for transforming vertices andtexturing pixels have been subsumed bya unified grid of processors, or shaders,that can perform these tasks and muchmore. This evolution has taken placeover several generations by graduallyreplacing individual pipeline stages with increasingly programmable units.For example, the NVIDIA GeForce 3,launched in February 2001, introducedprogrammable vertex shaders. Theseshaders provide units that the pro-grammer can use for performingmatrix-vector multiplication, exponen-tiation, and square root calculations, as

February 2007 97

Homogeneous Coordinates

Points in three dimensions are typically represented as a triple (x,y,z). Incomputer graphics, however, it’s frequently useful to add a fourth coordinate,w, to the point representation. To convert a point to this new representation,we set w = 1. To recover the original point, we apply the transformation(x,y,z,w) —> (x/w, y/w, z/w).

Although at first glance this might seem like needless complexity, it has sev-eral significant advantages. As a simple example, we can use the otherwiseundefined point (x,y,z,0) to represent the direction vector (x,y,z). With this uni-fied representation for points and vectors in place, we can also perform severaluseful transformations such as simple matrix-vector multiplies that would oth-erwise be impossible. For example, the multiplication

can accomplish translation by an amount Dx, Dy, Dz.Furthermore, these matrices can encode useful nonlinear transformations

such as perspective foreshortening.

1 0 0 Δx

0 1 0 Δy

0 0 1 Δz

0 0 0 1

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

x

y

z

w

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

Table 1. Phong lighting equation terms.

Term Meaning

Kd Diffuse color Li Light color N Surface normal L Vector to light Ks Specular color R Reflected light vector V Vector to camera s “Shininess”

r2How.qxp 23/1/07 12:44 PM Page 97

well as a short default program thatuses these units to perform vertex trans-formation and lighting.

GeForce 3 also introduced limitedreconfigurability into pixel processing,

exposing the texturing hardware’sfunctionality as a set of register com-biners that could achieve novel visualeffects such as the “soap-bubble” lookdemonstrated in Figure 1. Subsequent

GPUs introduced increased flexibility,adding support for longer programs,more registers, and control-flow prim-itives such as branches, loops, andsubroutines.

The ATI Radeon 9700 (July 2002)and NVIDIA GeForce FX (January2003) replaced the often awkward reg-ister combiners with fully program-mable pixel shaders. NVIDIA’s latestchip, the GeForce 8800 (November2006), adds programmability to theprimitive assembly stage, allowingdevelopers to control how they con-struct triangles from transformed ver-tices. As Figure 2 shows, modernGPUs achieve stunning visual realism.

Increases in precision have accom-panied increases in programmability.The traditional graphics pipeline pro-vided only 8-bit integers per colorchannel, allowing values ranging from0 to 255. The ATI Radeon 9700increased the representable range ofcolor to 24-bit floating point, andNVIDIA’s GeForce FX followed withboth 16-bit and 32-bit floating point.Both vendors have announced plansto support 64-bit double-precisionfloating point in upcoming chips.

To keep up with the relentlessdemand for graphics performance,GPUs have aggressively embracedparallel design. GPUs have long usedfour-wide vector registers much likeIntel’s Streaming SIMD Extensions(SSE) instruction sets now provide onIntel CPUs. The number of such four-wide processors executing in parallelhas increased as well, from only fouron GeForce FX to 16 on GeForce6800 (April 2004) to 24 on GeForce7800 (May 2005). The GeForce 8800actually includes 128 scalar shaderprocessors that also run on a specialshader clock at 2.5 times the clockrate (relative to pixel output) of for-mer chips, so the computational per-formance might be considered equiv-alent to 128 ¥ 2.5/4 = 80 four-widepixel shaders.

UNIFIED SHADERSThe latest step in the evolution from

hardwired pipeline to flexible compu-tational fabric is the introduction of

98 Computer


Figure 1. Programmable shading.The introduction of programmable shading in 2001 led

to several visual effects not previously possible, such as this simulation of refractive

chromatic dispersion for a “soap bubble” effect.

Figure 2. Unprecedented visual realism. Modern GPUs can use programmable shading to

achieve near-cinematic realism, as this interactive demonstration shows, featuring

actress Adrianne Curry on an NVIDIA GeForce 8800 GTX.

r2How.qxp 23/1/07 12:44 PM Page 98

February 2007 99

extremely high arithmetic throughputand streaming memory bandwidthbut tolerates considerable latency inan individual computation since finalimages are only displayed every 16milliseconds. These workload charac-teristics have shaped the underlyingGPU architecture: Whereas CPUs areoptimized for low latency, GPUs areoptimized for high throughput.

The raw computational horsepowerof GPUs is staggering: A single GeForce8800 chip achieves a sustained 330 bil-lion floating-point operations per sec-ond (Gflops) on simple benchmarks(http://graphics.stanford.edu/projects/gpubench). The ever-increasing power,programmability, and precision ofGPUs have motivated a great deal ofresearch on general-purpose compu-tation on graphics hardware—GPGPUfor short. GPGPU researchers anddevelopers use the GPU as a compu-tational coprocessor rather than as animage-synthesis device.

The GPU’s specialized architectureisn’t well suited to every algorithm.Many applications are inherently ser-ial and are characterized by incoher-ent and unpredictable memory access.Nonetheless, many important prob-lems require significant computational

unified shaders. Unified shaders werefirst realized in the ATI Xenos chip forthe Xbox 360 game console, andNVIDIA introduced them to PCs withthe GeForce 8800 chip.

Instead of separate custom proces-sors for vertex shaders, geometryshaders, and pixel shaders, a unifiedshader architecture provides one largegrid of data-parallel floating-pointprocessors general enough to run allthese shader workloads. As Figure 3shows, vertices, triangles, and pixelsrecirculate through the grid ratherthan flowing through a pipeline withstages of fixed width.

This configuration leads to betteroverall utilization because demand forthe various shaders varies greatlybetween applications, and indeed evenwithin a single frame of one applica-tion. For example, a videogame mightbegin an image by using large trian-gles to draw the sky and distant ter-rain. This quickly saturates the pixelshaders in a traditional pipeline, whileleaving the vertex shaders mostly idle.One millisecond later, the game mightuse highly detailed geometry to drawintricate characters and objects. Thisbehavior will swamp the vertex shadersand leave the pixel shaders mostly idle.

These dramatic oscillations inresource demands in a single imagepresent a load-balancing nightmarefor the game designer and can alsovary unpredictably as the players’viewpoint and actions change. A uni-fied shader architecture, on the otherhand, can allocate a varying percent-age of its pool of processors to eachshader type.

For this example, a GeForce 8800might use 90 percent of its 128 proces-sors as pixel shaders and 10 percentas vertex shaders while drawing thesky, then reverse that ratio whendrawing a distant character’s geome-try. The net result is a flexible parallelarchitecture that improves GPU uti-lization and provides much greaterflexibility for game designers.

GPGPUThe highly parallel workload of

real-time computer graphics demands

resources, mapping well to the GPU’smany-core arithmetic intensity, orthey require streaming through largequantities of data, mapping well to theGPU’s streaming memory subsystem.

Porting a judiciously chosen algo-rithm to the GPU often producesspeedups of five to 20 times overmature, optimized CPU codes runningon state-of-the-art CPUs, and speed-ups of more than 100 times have beenreported for some algorithms thatmap especially well.

Notable GPGPU success storiesinclude Stanford University’s Folding@home project, which uses spare cyclesthat users around the world donate tostudy protein folding (http://folding.stanford.edu). A new GPU-acceleratedFolding@home client contributed28,000 Gflops in the month after itsOctober 2006 release—more than 18percent of the total Gflops that CPUclients contributed running on Micro-soft Windows since October 2000.

In another GPGPU success story,researchers at the University of NorthCarolina and Microsoft used GPU-based code to win the 2006 IndyPennySort category of the TeraSortcompetition, a sorting benchmarktesting price/performance for database

GPU

Vertexprograms

3D geometricprimitives

Geometryprograms

Programmable unified processors

Hidden surfaceremoval

Computeprograms

GPU memory (DRAM)Final image

Rasterization

Pixel programs

Figure 3. Graphics pipeline evolution.The NVIDIA GeForce 8800 GPU replaces the

traditional graphics pipeline with a unified shader architecture in which vertices,

triangles, and pixels recirculate through a set of programmable processors.The flexibility

and computational power of these processors invites their use for general-purpose com-

puting tasks.

r2How.qxp 23/1/07 12:44 PM Page 99

operations (http://gamma.cs.unc.edu/GPUTERASORT). Closer to home forthe GPU business, the HavokFX prod-uct uses GPGPU techniques to accel-erate tenfold the physics calculationsused to add realistic behavior to objects in computer games (www.havok.com).

M odern GPUs could be seen asthe first generation of com-modity data-parallel proces-

sors. Their tremendous computationalcapacity and rapid growth curve, faroutstripping traditional CPUs, high-light the advantages of domain-spe-cialized data-parallel computing.

We can expect increased program-mability and generality from future

GPU architectures, but not withoutlimit; neither vendors nor users wantto sacrifice the specialized architec-ture that made GPUs successful in thefirst place. Today, GPU developersneed new high-level programmingmodels for massively multithreadedparallel computation, a problem soonto impact multicore CPU vendors aswell.

Can GPU vendors, graphics devel-opers, and the GPGPU research com-munity build on their success withcommodity parallel computing totranscend their computer graphicsroots and develop the computationalidioms, techniques, and frameworksfor the desktop parallel computingenvironment of the future? ■

David Luebke is a research scientist at NVIDIA Research. Contact him [email protected].

Greg Humphreys is a faculty member inthe Computer Science Department at theUniversity of Virginia. Contact him [email protected].

100 Computer


Computer welcomes your submis-sions to this bimonthly column. Foradditional information, or tosuggest topics that you would liketo see explained, contact columneditor Alf Weaver at [email protected].

To submit a manuscript for peer review, see Computer’s author guidelines:

www.computer.org/computer/author.htm

Computer

magazine

looks ahead

to future

technologies

• Computer, the flagship publication of the IEEE ComputerSociety, publishes peer-reviewed technical content that covers all aspects of computer science, computerengineering, technology, and applications.

• Articles selected for publication in Computer are edited to enhance readability for the nearly 100,000 computingprofessionals who receive this monthly magazine.

• Readers depend on Computer to provide current, unbiased, thoroughly researched information on the newest directions in computing technology.

Welcomes Your Contribution

r2How.qxp 23/1/07 12:44 PM Page 100

how gpus work - nvidia · ics processing unit (gpu) dedicated to providing a high-performance,...

Documents