evolution of the modern graphics architectures with a focus on gpus | turing100@persistent

Evolution of Graphics Architectures

with a focus on GPUs

Sanjiv Satoor

Senior Manager, NVIDIA

First Generation - Wireframe

Vertex: transform, clip, and project

Rasterization: lines only

Pixel: no pixels! calligraphic display

Dates: prior to 1987

Storage Tube Terminals

CRTs with analog charge “persistence”

Accumulate a detailed static image by writing points or

line segments

Erase the stored image to start a new one

Early Framebuffers

By the mid-1970’s one could afford framebuffers with a

few bits per pixel at modest resolution

“A Random Access Video Frame Buffer”,

Kajiya, Sutherland, Cheadle, 1975

Vector displays were still better for fine position detail

Framebuffers were used to emulate storage tube vector

terminals on a raster display

Second Generation – Shaded Solids

Vertex: lighting

Rasterization: filled polygons

Pixel: depth buffer, color blending

Dates: 1987 - 1992

Third Generation – Texture Mapping

Vertex: more, faster

Rasterization: more, faster

Pixel: texture filtering, antialiasing

Dates: 1992 - 2001

IRIS 3000 Graphics Cards

Geometry Engines & Rasterizer 4 bit / pixel Framebuffer

(2 instances)

1990’s

Desktop 3D workstations under $5000

Single-board, multi-chip graphics subsystems

Rise of 3D on the PC

40 company free-for-all until intense competition knocked out all but a

few players

Many were “decelerators”, and easy to beat

Single-chip GPUs

Interesting hardware experimentation

PCs would take over the workstation business

Interesting consoles

3DO, Nintendo, Sega, Sony

1998 1999 2000 2001 2002 2003 2004

DirectX 6

Multitexturing

Riva TNT

DirectX 8

SM 1.x

GeForce 3 Cg

DirectX 9

SM 2.0

GeForceFX

DirectX 9.0c

SM 3.0

GeForce 6

DirectX 5

Riva 128

DirectX 7

T&L TextureStageState

GeForce 256

Quake 3 Giants Halo Far Cry UE3 Half-Life

All images © their respective owners

Moving toward programmability

RIVA 128 3M xtors

GeForce 256 23M xtors

GeForce FX 250M xtors

GeForce 8800 681M xtors

GeForce 3 60M xtors

“Kepler” 7B xtors

1995 2000 2001 2006 2012

Fixed function Programmable shaders CUDA

2003

Evolution of GPUs

Copyright © NVIDIA Corporation 2006 Unreal © Epic

Per-Vertex Lighting No Lighting Per-Pixel Lighting

Lush, Rich Worlds Stunning Graphics Realism

Core of the Definitive Gaming Platform Incredible Physics Effects

Hellgate: London © 2005-2006 Flagship Studios, Inc. Licensed by NAMCO BANDAI Games America, Inc.

Crysis © 2006 Crytek / Electronic Arts

Full Spectrum Warrior: Ten Hammers © 2006 Pandemic Studios, LLC. All rights reserved. © 2006 THQ Inc. All rights reserved.

Tradition Fixed Function Graphics pipeline

T&L evolved to vertex shading

memory

interface

vertex

processing

triangle

setup

pixel

processing

raster

operations

Triangle, point, line setup

Flat shading, texturing eventually pixel shading

Blending, Z-buffering, Antialiasing

Wider and faster over the years

Processor per function

Migration of functionality to GPU hardware

GeForce3/DX8 Pixel Shading Pipeline

Programmable Shaders: GeForceFX (2002)

Vertex and fragment operations specified in small (macro) assembly language

User-specified mapping of input data to operations

Limited ability to use intermediate computed values to index input data (textures and vertex uniforms)

Input 2Input 1Input 0

OP

Temp 2Temp 1Temp 0

ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx;

DP3R R0.w, R0.xyzx, R0.xyzx;

RSQR R0.w, R0.w;

MULR R0.xyz, R0.w, R0.xyzx;

ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx;

DP3R R0.w, R1.xyzx, R1.xyzx;

RSQR R0.w, R0.w;

MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx;

MULR R1.xyz, R0.w, R1.xyzx;

DP3R R0.w, R1.xyzx, f[TEX1].xyzx;

MAXR R0.w, R0.w, {0}.x;

GeForce 6 Architecture

Unified Hardware Shader Design

L2

FB

SP SP

L1

TF

Th

read

Pro

cesso

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

GeForce 8 Architecture Build the architecture around the processor

Millions of triangles Millions of pixels

Why are so many parallel operations needed?

Input triangle Tessellate Projection Rasterize Shade Transform vertices

Image plane

Camera

GPU = More computational horsepower and

bandwidth per watt

Few complex processors

Optimized for single-threaded performance

Many simple processors with minimal overhead

Slow single-threaded performance but massive overall throughput

GPU Architecture

Efficiency

Programmability

Performance

GPU Architecture:

Two Main Components

Streaming Multiprocessors (SMs)

Perform the actual computations

Each SM has its own:

Control units, registers, execution pipelines, caches

Global memory

Analogous to RAM in a CPU server

Accessible by both GPU and CPU

Currently up to 6 GB per GPU

Bandwidth currently up to 250 GB/s

DR

AM

I/F

G

iga

Th

rea

d

HO

ST

I/F

D

RA

M I

/F

DR

AM

I/F

DR

AM

I/F

DR

AM

I/F

DR

AM

I/F

L2

KEPLER

The Fastest, Most Efficient GPU Ever Built

Kepler GK110 Architecture

7.1B Transistors

14 SMX units

3.95 TFLOP FP32

1.31 TFLOP FP64

250 GB/sec

2688 cores

PCI Express Gen3

WORLD’S #1 SUPERCOMPUTER With a peak performance of 27 petaflops, the Titan supercomputer at Oak Ridge National Labs is the world’s fastest. 18,688 GPUs provide 90% of the machine’s computing power.

The Graphics pipeline

Vertex and fragment processing are programmable

The programmer can write programs that are executed for every vertex as well as for every fragment

This allows fully customizable geometry and shading effects that go well beyond the generic look and feel of older 3D applications

host

interface

vertex

processing

triangle

setup

pixel

processing

memory

interface

Thank you

evolution of the modern graphics architectures with a focus on gpus | turing100@persistent

Technology