a simd-efficient 14 instruction shader program for high...
TRANSCRIPT
A SIMD-efficient 14 Instruction Shader Program
for High-Throughput
Microtriangle Rasterization
Jordi Roca · Victor Moya · Carlos GonzalezVicente Escandell · Albert Murciego
Agustin Fernandez, Computer Architecture Department, UPC
Roger Espasa, Intel
1
Micropolygons & Tesselation: The future trend in interactive 3D rendering
for improved Level-Of-Detail
2
• An alternative GPU rasterization pipeline toefficienly process microtriangles.
• Our approach processes several microtrianglesin parallel using GPU shader threads:
– Scalable throughput is guaranteed in next GPU generations.
I´m presenting today…
3
Outline
• The Rasterization of Microtriangles.
• Parallel Rasterization in GPU Shaders.
• Problems & Solutions.
• Performance Results.
• Conclusion.
4
V0 (1,0)
V1 (13,4)
V2 (2,7)
Input: screen-projected vertex coordinates: {(1,0),(13,4),(2,7)}Output: covered fragments: {(1,0),(1,1),(2,1),(3,1), …}
The triangle rasterizer jobX
Y
5
Three Triangle Rasterization Approaches
E(x,y) : ax + by + c = 0
Intel Larrabee(software-based, SIMD-16)
Tile Scan
NVIDIA GPUs’96 - today
Recursive
Edge Equations(setup + traversal)
Pomegranate ‘00
Scan Lines
• Hard to parallelize• Software renderers
Fatahalian K., ’09 and THIS WORK
X-Products
• More efficient for verysmall triangles.•Independent per-pixel computation.
6
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12
Flo
atin
g-p
oin
t o
pe
rati
on
s
Triangle size
Edge Equations
X-products
Setup equations or X-products?
The high cost of triangle setup is notamortized for ≤ 2-pixel triangles
Cross-products is more efficientfor very small triangles
Rasterizer efficiency (ops per pixel, Lower = Better)
7
The GPU´s bottleneck in microtriangles: a single Setup unit!
• Typical 2009-GPU rates: 1tri : 32pix /clock
• But the microtriangle ratio is 1tri : ≈1pix
• The Single Setup unit starves the Pixel Pipeline (Shader/ZStencil/Color)
• Need more microtriangle throughput … Can shader units help?
Utilization of thedifferent GPU units rendering a 1-pixel size streamof microtriangles
0%
20%
40%
60%
80%
100%
1 2 3 4 5 6 7 8 9 1011
Un
it u
tiliz
atio
n
Time (Kcycles)
rstz/setupmemorycolorZ/stencilshader
8
How could we increase the throughput formicrotriangles?
• Option 1: Replicate N times the Triangle Setup unit– Increases area
– Does not scale to very large number of microtriangles
• Option 2: Use the shader units to render microtriangles– THIS WORK.
– No area cost
– Large triangles still use the existing triangle setup unit
– Scales in the future as
• Microtriangles are more frequent
• Future GPUs offer more shader cores
9
Proposed Microtriangle pipeline
• Selectable bythe API user.
Texture
Depth/Stencil
Render Target
PixelShader
OutputMerger
Rasterizer/
Interpolator
Texture
Depth/Stencil
Render Target
Rasterize
Interpolate
PixelShader
OutputMerger
Triangle BoundRasterize & Shade Pixels
Standard DX10 Pipeline(for normal triangles)
Microtriangle Pipeline
InputAssembler
Texture
Texture
Stream Output
VertexShader
Vertex Buffer
Index Buffer
Geometry Shader
InputAssembler
Texture
Texture
Stream Output
VertexShader
Vertex Buffer
Index Buffer
Geometry Shader
10
Outline
• The Rasterization of Microtriangles.
• Parallel Rasterization in GPU Shaders.
• Problems & solutions.
• Performance Results.
• Conclusion.
11
Parallel Rasterization in GPU Shaders1. Fill shader vector groups with
fragments within the bounding
boxes of n input microtriangles
Rasterization
Z Interpolation
Thread Entry
2. Run the
rasterization
program on
multiple
fragments
followed by
the original
API fragment
shader.
Z= 3 Z= 5 Z= 7 Z= 9 Z= 1 Z= 1 Z= 2Attribute
Interpolation
Z= 3
S = 1
T = 0
Z= 5
S = 0
T = 1
Z= 7
S = 0
T = 0
Z= 9
S = 1
T = 1
Z= 1
S = 0
T = 0
Z= 1
S = 1
T = 1
Z= 2
S = 1
T = 0
Original DirectX
FragmentShader
Thread Exit
3. Reorder shaded fragments and do Z Test.
12
The required features of ourrasterization program:
• Consistent rasterization (no cracks orrepeated pixels):– Fixed-point arithmetic.– Tie break rule for adjacent edges.
• Full support of modern GPU aspects:– Z interpolation:
• Perspective• Orthogonal
– Attribute interpolation:• Flat• Non-perspective correct• Perspective correct• Centroid
– Face culling:• Front/Back/Front&Back
– MSAA:• x2, x4, x6, x8• Customizable patterns
13
Outline
• The Rasterization of Microtriangles.
• Parallel Rasterization in GPU Shaders.
• Problems & Solutions.
• Performance Results.
• Conclusion.
14
Shading of sparse vectors: Bounding Box optimization pre-pass
• Increases 20 to 45% the density of microtriangle vectors.• Culls entirely subpixel microtriangles (55% culling ratio).• Simple hardware (four comparators, four adders) performs this
optimization inside the Triangle Bound unit.
Can shrink these BB sides
The gap tells those pixelswill be never really hit!
Subpixel-accurate BB:Pixel-accurate BB:
15
Avoid cracks or repeated pixels:Use of Fixed-Point arithmetic
• The rasterization program must ensure that each single pixel is hit by exactly one microtriangle in the mesh (no cracks, no repeated).– Extended the shader ISA with FXMUL and FXMAD fixed-point
instructions which provide consistent cross-product resultsacross microtriangles.
Floating Point 32 Bits Fixed point 24.8 Bits
A lit mesh of adjacent microtriangles
16
Outline
• The Rasterization of Microtriangles.
• Parallel Rasterization in GPU Shaders.
• Problems & Solutions.
• Performance Results.
• Conclusion.
17
Great microtriangle throughput scaling
• Render times for1/2 pixel and 1/8 pixel-sizemicrotrianglemeshes scale up 1.3X to 4X with 16 shader cores, wrtthe traditional GPU rasterizer unit.
• The better scaling of 1/8 size (blue) is due to theeffectiveness of the Bounding Box optimization.
18
Outline
• The Rasterization of Microtriangles.
• Parallel Rasterization in GPU Shaders.
• Problems & Solutions.
• Performance Results.
• Conclusion.
19
Conclusions• Near term 3D rendering demands for a microtriangle pipeline to efficienly
process tessellated surfaces.
• Current GPU rasterizers are not intended for microtriangles:– Designed for high pixel rates on triangles larger than ~10 pixels.– Poor microtriangle throughput to feed the pixel pipeline. – Replication inefficiently increases area: Bad scalability.
• We propose to rasterize microtriangles in GPU shaders.– The largest & more scalable resource in today´s GPUs– Using the more efficient Xproducts instead of edge setup.– As an alternative selectable pipeline by the API user.
• Problems and solutions:– Shading of sparse vectors: Bounding Box optimization pre-pass.– No cracks or repeated pixels by using Fixed-Point operations.
20
Thank you!Q&A
21
BACKUP
22
Enable Early Z Optimization
Rasterization
Z Interpolation
AttributeInterpolation
Original DirectX
FragmentShader
Thread Entry
Thread Exit
Late Z Test
Original DirectX
FragmentShader
Thread Exit
Early Z Test
Thread Entry
Thread Exit
Rasterization
Z Interpolation
AttributeInterpolation
Thread Entry
Rasterization
Z Interpolation
AttributeInterpolation
Original DirectX
FragmentShader
Thread Entry
Thread Exit
Earl
yZ
Test
Sleep Thread
Z test request
Z test result
23