cs 354 gpu architecture
DESCRIPTION
CS 354 Computer Graphics; GPU Architecture; March 6, 2012TRANSCRIPT
CS 354GPU Architecture
Mark KilgardUniversity of TexasMarch 6, 2012
CS 354 2
Today’s material
In-class quiz Lecture topic
Architecture of Graphics Processing Units (GPUs)
Course work Homework #4 due today Review textbook reading
Chapter 5, 6, and 7
Project #2 on texturing, shading, & lighting is coming Remember: Midterm in-class on March 8
CS 354 3
My Office Hours
Tuesday, before class Painter (PAI) 5.35 8:45 a.m. to 9:15
Thursday, after class ACE 6.302 11:00 a.m. to 12
Randy’s office hours Monday & Wednesday 11 a.m. to 12:00 Painter (PAI) 5.33
CS 354 4
Last time, this time
Last lecture, we discussed Programmable shading Graphics hardware shading languages
This lecture How do GPUs work?
CS 354 5
Daily Quiz
1. Pick the best choice: Shade trees area) fractal trees with shadowsb) OpenGL commands c) hierarchical arrangements of shading computationsd) fractal patterns of all sorts
2. Name one general purpose programming language that GLSL borrows from.
3. Multiple choice: The GLSL standard has built-in data types fora) vectorsb) matricesc) texture samplersd) floating-point valuese) pointers to malloc’ed memoryf) a through eg) a through d
On a sheet of paper• Write your EID, name, and date• Write #1, #2, #3, #4 followed by its answer
CS 354 6
Key Trend in OpenGL Evolution
Direct3D follows the same trend Also reflects trend in GPU architecture
API and hardware co-evolving
Fixed-function Programmable
SimpleConfigurability
ComplexConfigurability
Shaders!
High-level languages
CS 354 7
Programming Shaders inside GPU
Multiple programmable domains within the GPU Can be programmed in high-level languages
Cg, HLSL, or OpenGL Shading Language (GLSL)
GeometryProgram
3D Applicationor Game
OpenGL API
GPUFront End
VertexAssembly
VertexShader
Clipping, Setup,and Rasterization
FragmentShader
Texture Fetch
RasterOperations
Framebuffer Access
Memory Interface
CPU – GPU Boundary
OpenGL 3.3
Attribute Fetch
PrimitiveAssembly
Parameter Buffer Readprogrammable
fixed-function
Legend
CS 354 8
Complex OpenGL Data Flow
CS 354 9
Six Years of GPU ArchitectureProduct New Features OpenGL
VersionDirect3D Version
2000 GeForce 256Hardware transform & lighting, configurable fixed-point shading, cube maps, texture compression, anisotropic texture filtering
1.3 DX7
2001 GeForce3
Programmable vertex transformation, 4 texture units, dependent textures, 3D textures, shadow maps, multisampling, occlusion queries
1.4 DX8
2002 GeForce4 Ti 4600 Early Z culling, dual-monitor 1.4 DX8.1
2003 GeForce FX
Vertex program branching, floating-point fragment programs, 16 texture units, limited floating-point textures, color and depth compression
1.5 DX9
2004 GeForce 6800 Ultra
Vertex textures, structured fragment branching, non-power-of-two textures, generalized floating-point textures, floating-point texture filtering and blending
2.0 DX9c
2005 GeForce 7800 GTX Transparency antialiasing 2.0 DX9c
CS 354 10
0
200
400
600
800
1,000
1,200
1,400
GeForce2GTS
GeForce3 GeForce4 Ti4600
GeForce FX GeForce6800 Ultra
GeForce7800 GTX
GeForce PeakVertex Processing Trends
Vertex units 1 1 2 3 6 8
Mill
ions
of
vert
ices
per
sec
ond
rate for trivial 4x4vertex transform
exceeds peaksetup rates—allowsexcess vertex processing
CS 354 11
0
20
40
60
80
100
120
140
160
180
200
GeForce2GTS
GeForce3 GeForce4 Ti4600
GeForce FX GeForce6800 Ultra
GeForce7800 GTX
Rawbandwidth
Effective rawbandwidthwithcompression
Expon.(Effective rawbandwidthwithcompression)Expon. (Rawbandwidth)
GeForce PeakMemory Bandwidth Trends
128-bit interface 256-bit interface
Gig
abyt
es p
er s
econ
d
CS 354 12
Effective GPUMemory Bandwidth
Compression schemes Lossless depth and color (when multisampling)
compression Lossy texture compression (S3TC / DXTC) Typically assumes 4:1 compression
Avoidance useless work Early killing of fragments (Z cull) Avoiding useless blending and texture fetches
Very clever memory controller designs Combining memory accesses for improved coherency Caches for texture fetches
CS 354 13
0
200
400
600
800
1,000
1,200
1,400
Coreclock
Memoryclock
GeForce Core and Memory Clock RatesGeForce Core and Memory Clock Rates
DDR memorytransition—memory rates double physical clock rate
Meg
aher
tz (
Mhz
)
CS 354 14
0
50
100
150
200
250
300
GeForce2GTS
GeForce3 GeForce4 Ti4600
GeForce FX GeForce6800 Ultra
GeForce7800 GTX
GeForce PeakTriangle Setup Trends
Mill
ions
of
tria
ngle
s pe
r se
cond
assumes 50%face culling
CS 354 15
0
2,000
4,000
6,000
8,000
10,000
12,000
GeForce2GTS
GeForce3 GeForce4 Ti4600
GeForce FX GeForce6800 Ultra
GeForce7800 GTX
GeForce PeakTexture Fetch Trends
Mill
ions
of
text
ure
fetc
hes
per
seco
nd
Texture units 2×4 2×4 2×4 2×4 16 24
assuming no texture cache misses
CS 354 16
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
GeForce2GTS
GeForce3 GeForce4 Ti4600
GeForce FX GeForce6800 Ultra
GeForce7800 GTX
GeForce PeakDepth/Stencil-only Fill
assuming no read-modify-write
Mill
ions
of
dept
h/st
enci
l pix
el u
pdat
espe
r se
cond
Raster Op units 4 4 4 4+4 16+16 16+16
double speeddepth-stencil only
CS 354 17
0
50
100
150
200
250
300
350
400
450
Riva ZX RivaTNT2
GeForce2GTS
GeForce3 GeForce4Ti 4600
GeForceFX
GeForce6800Ultra
GeForce7800 GTX
GeForce Transistor Count and Semiconductor Process
Mill
ions
of
tran
sist
ors
Process (µm) 0.35 0.22 0.18 0.18 0.15 0.13 0.13 0.11
CS 354 18
16
GeForceFX 5900
GeForce6800 Ultra
Vertex
Fragment
2nd TextureFetch
3 6
4+4
Raster Color
Raster Depth
4+4 16+16
GeForce7800 GTX
HardwareUnit
8
24
16+16
CS 354 19
GeForce 7800 GTXBoard Details
256MB/256-bit DDR3 600 MHz8 pieces of 8Mx3216x PCI-Express
SLI Connector
DVI x 2
sVideoTV Out
Single slot cooling
CS 354 20
GeForce 7800 GTXGPU Details
302 million transistors430 MHz core clock256-bit memory interface
Notable Functionality• Non-power-of-two textures with mipmaps• Floating-point (fp16) blending and filtering• sRGB color space texture filtering and
frame buffer blending• Vertex textures• 16x anisotropic texture filtering• Dynamic vertex and fragment branching• Double-rate depth/stencil-only rendering• Early depth/stencil culling• Transparency antialiasing
CS 354 21
Triangle Setup/RasterTriangle Setup/Raster
Shader Instruction DispatchShader Instruction Dispatch
Fragment CrossbarFragment Crossbar
MemoryPartitionMemoryPartition
MemoryPartitionMemoryPartition
MemoryPartitionMemoryPartition
MemoryPartitionMemoryPartition
Z-CullZ-Cull
8 Vertex Engines
24 Fragment Shaders
16 Raster Operation Pipelines
GeForce 7800 GTX ParallelismGeForce 7800 GTX Parallelism
CS 354 22
GeForce Graphics Pipeline
CPUVertexEngine Setup Raster
FragmentShader
Texture
RasterOps
FrameBuffer
Z Cull
Separate dedicated units
CS 354 23
GeForce Graphics PipelineVertex Engine
CPUVertexEngine Setup Raster
FragmentShader
Texture
RasterOps
FrameBuffer
Z Cull
Vertex pulling
Vector floating-point instructions
Dynamic branching
Vertex texture
Vertex stream frequency
CS 354 24
GeForce Graphics PipelineSetup
CPUVertexEngine Setup Raster
FragmentShader
Texture
RasterOps
FrameBuffer
Z Cull
Prepare triangle for rasterization
215M triangles/sec setup
CS 354 25
GeForce Graphics PipelineRaster
CPUVertexEngine Setup Raster
FragmentShader
Texture
RasterOps
FrameBuffer
Z Cull
Compute coverage
Points, lines, and triangles
Rotated grid multisampling
CS 354 26
GeForce Graphics PipelineZ Cull
CPUVertexEngine Setup Raster
FragmentShader
Texture
RasterOps
FrameBuffer
Z Cull
Discard fragments early based on Z
Up to 64 pixels/clock
Multisampled: 256 samples/clock
CS 354 27
GeForce Graphics PipelineFragment Shader
CPUVertexEngine Setup Raster
FragmentShader
Texture
RasterOps
FrameBuffer
Z Cull
User-programmed fragment coloring
Dynamic branching
Long shaders
Multiple render targets
fp16 and fp32 vectors
CS 354 28
GeForce Graphics PipelineTexture
CPUVertexEngine Setup Raster
FragmentShader
Texture
RasterOps
FrameBuffer
Z Cull
fp16 and sRGB filtering
16x anisotropic filtering
Non-power-of-two mipmapping
Shadow maps, cube maps, and 3D
Floating-point textures
CS 354 29
GeForce Graphics PipelineTexture
CPUVertexEngine Setup Raster
FragmentShader
Texture
RasterOps
FrameBuffer
Z Cull
2x and 4x multisampling
fp16 and sRGB blending
Multiple render targets
Color and depth compression
Double-speed depth/stencil only
CS 354 30
Single GeForce 7800Vertex Unit
FP32 Vector
Unit
To Setup
FP32 ScalarUnit
Viewport Processing
BranchUnit
VertexTextureFetch
Texture Cache
Primitive Assembly +Attribute Processing
Vertex Processing Engine• MIMD Architecture• Dual Issue• Low-penalty branching• Shader Model 3.0• 32 vector registers• 512 static instructions per
program• Indexed input and output
registers
Vertex Texture Fetch• Non-stalling• Up to 4 texture units
• Unlimited fetches• Mipmapping, no filtering
CS 354 31
Vertex Texturing Example
Flat tessellated mesh Displaced meshHeight field
texture
VertexProgram
CS 354 32
Vertex Textures for Dynamic Displacement Mapping
Images used with permission from Pacific Fighters. © 2004 Developed by 1C:Maddox Games. All rights reserved. © 2004 Ubi Soft Entertainment.
Without Vertex Textures With Vertex Textures
CS 354 33
Vertex Textures to Drive Particle Systems
Render-to-texture Simulation runs
in floating-pointframe buffer, alsousable as texture
Vertex textures Determines particle
location withvertex texturefetch
CS 354 34
Single GeForce 7800Fragment Shader Pipeline
Texture Processor
Texture Cache
BranchProcessor
FP32 ShaderUnit 1
FP32 ShaderUnit 2
Input Fragment Data
Output ShadedFragments
Fixed-functionFog Unit
TextureData
Texture Processor16 texture units1 texture fetch at full speedBilinear or tri-linear filtering16x anisotropic filteringFloating-point (fp16) texture filtering
Shader Unit 14 MULs + RCPDual IssueTexture address calculationFast fp16 normalizeFree: negate, abs, condition codes
Shader Unit 24 MADs or DP4 Dual IssueFree: negate, abs, condition codes
CS 354 35
Operations Per Fragment Shader Pass
ShaderShaderUnit 1Unit 1
TextureTexture
ShaderShaderUnit 2Unit 2
4 Components1 Op / component4 ops / fragment
per pass
4 Components1 Op / component4 Ops / fragment
per pass
1 Texture / fragment at full speed per pass
8 Operations / fragment per pass
or
CS 354 36
Fragment ShaderComponent Co-issue
Use 4 components various ways RGBA all together RGB and A RG and GB
Both shader units Two operations
per shader unitR G B A
Operation 3 Operation 4
R G B A
Operation 1 Operation 2
ShaderUnit 1
ShaderUnit 2
CS 354 37
Single GeForce 7800Raster Operations Pipeline
Input Shaded
Fragment Data
Pixel CrossbarInterconnect
Multisample Antialiasing
DepthCompression
DepthRaster
Operations
ColorCompression
ColorRaster
Operations
Frame Buffer PartitionMemory
Functionality• OpenEXR
floating-point blending
• sRGB blending
• 4x rotated grid multisampling
• Lossless color and depth compression
• Multiple render targets
CS 354 38
GeForce 7800Transparency Antialiasing
Conventional 4x antialiasingwith alpha tested context
Transparency antialiasingwith alpha tested context
CS 354 39
Scalable Link Interface (SLI)
Gang two GeForce 6600, 6800, or 7800 graphics boards together Can almost double your performance
Two 6800 Ultraspictured
SLIConnector
CS 354 40
SLI Rendering Modes
Split Frame Rendering (SFR) One GPU renders top of screen; other renders the bottom Scales fragment processing but not vertex processing
Alternate Frame Rendering (AFR) Scales both vertex and fragment processing Adds frame latency Rendering must be free of CPU synchronization
SLI Antialiasing: SLI8x and SLI16x Better antialiasing quality rather than performance Each card renders with slightly different sub-pixel offset
CS 354 41
PC Graphics Hardware Evolution
1997 2000 2005 2010
RIVA 128RIVA 1283M 3M
transistorstransistors
GeForceGeForce®® 25625623M 23M
transistorstransistors
GeForce GeForce FXFX
125M 125M transistorstransistors
GeForce GeForce 88008800681M 681M
transistorstransistorsGeForce GeForce 3 3
60M 60M transistortransistor
ss
GeForce GeForce 580 GTX580 GTX
3B3B transistorstransistors
Viable economics: 650 million GeForce GPUs since Viable economics: 650 million GeForce GPUs since 19991999
1,000x complexity since 19951,000x complexity since 1995
Moore’s Law at workMoore’s Law at work
CS 354 42
Current High-end “Fermi” GPU
Current high-end graphics card 512 graphics “cores” 1.5Gb memory System power: 600W OpenGL 4.2 / DirectX 11
functionality
CS 354 43
High-level “Fermi” Architecture GF100 Four Graphics
Processor Clusters (GPCs) Each is self-
contained graphics pipeline
Smaller chips have fewer GPcs
Shared L2 cache 6 Memory
Controllers 1.5 Gb
CS 354 44
Inside EachGraphics Processing Cluster
Raster engine Four SMs
Streaming Multiprocessor
Texture fetch resources
Tessellation and vertex processing resources Polymorph
Engine
CS 354 45
StreamingMultiprocessor (SM)
Multi-processor execution unit 32 scalar processor
cores Warp is a unit of
thread execution of up to 32 threads
Two workloads Graphics
Vertex shader Tessellation Geometry shader Fragment shader
Compute
CS 354 46
PrimitiveProgram
OpenGL Pipeline Programmable Domains run on Unified Hardware
Unified Streaming Processor Array (SPA) architecture means same capabilities for all domains Plus tessellation + compute (not shown below)
GPUFront End
VertexAssembly
VertexProgram
,Clipping, Setup,
and Rasterization
FragmentProgram
Texture Fetch
RasterOperations
Framebuffer Access
Memory Interface
Attribute Fetch
PrimitiveAssembly
Parameter Buffer Read
Can beunifiedhardware!
CS 354 47
Dual Warp Scheduling
32 threads launch!
CS 354 48
Shader or CUDA Core,Same Unit but Two Personalities
Execution unit Scalar floating-point Scalar integer
CS 354 49
Levels of Caching in Fermi GPU
12 KB L1 Texture cache Per texture unit
SM 64 K cache Split into dedicated 16K or 48K
Load/Store cache Shared memory 48K or 16K
L2 unifies texture cache, raster operation cache, and internal buffering in prior generation 768 K Read / write Fully coherent
CS 354 50
Cache Use Strategiesin Fermi GPU
Pipeline stages can communicate efficiently through GPU’s L1 and L2 caches Buffering between stages stays all on chip Only vertex, texel, and pixel read/writes need to go to DRAM
CS 354 51
Vertex and Tessellation Processing Tasks
Fixed-function graphics engines Pull attributes and assemble vertex Manage tessellation control and domain shader evaluation Viewport transform Attribute setup of plane equations for rasterization Stream out vertices into buffers
CS 354 52
Rasterization Tasks
Turns primitives into fragments Computes edge equations Two-stage rasterization
Coarse raster finds tiles the primitive could be in Fine raster evaluates sample positions within tiles
Zcull efficiently eliminates occluded fragments
CS 354 53
Input Mesh
From Metro 2033, © THQ and 4A Games
Base Input Mesh
CS 354 54
From Metro 2033, © THQ and 4A Games
Apply Phong Tessellation
CS 354 55
Apply Displacement Mapping
From Metro 2033, © THQ and 4A Games
Add Displacement Mapping
CS 354 56
Workstations2 to 4 Tesla
GPUs
Integrated CPU-GPU Servers & Blades
OEM CPU Server +Compute 1U
GPUs as Compute Nodes
Architecture of GPU has evolved into a high-performance, high-bandwidth compute node
Small form factorCompute
CS 354 57
Compute Programming Model
Cooperative Thread Array (CTA) Single Program, Multiple Data Organized in 1D, 2D, or 3D
Programming APIs CUDA, OpenCL, DirectCompute
APIs + language = parallel processing system
OpenGL or Direct3D through shaders Cg, HLSL, GLSL
CS 354 58
Now in World’s Fastest Supercomputers
Tianhe-1A
2.507 Petaflop
7168 Tesla M2050 GPUs
National Supercomputing Center in Tianjin
CS 354 59
Opposite direction:Consumer mobile devices
CS 354 60
Low-power MobileSystem on a Chip (SoC)
Complete system on a chip 4 ARM cores Integrated graphics
OpenGL ES 2.0
Power <1W
CS 354 61
Mid-term Next Class
Mid-term Similar in format to the homeworks 15% of your final grade Arrive on time Open textbook. Open notes, including lecture slides. Calculators allowed/encouraged. No smart phones, no computers, no Internet access. Show your work to justify your answer and provide a basis for partial
credit. What to study
All material in lecture slides Review in-class quiz questions Study homeworks Responsible for textbook readings
Have a relaxing spring break Next lecture: Shadows Come back to Project 2