a simd-efficient 14 instruction shader program for high...

A SIMD-efficient 14 Instruction Shader Program

for High-Throughput

Microtriangle Rasterization

Jordi Roca · Victor Moya · Carlos GonzalezVicente Escandell · Albert Murciego

Agustin Fernandez, Computer Architecture Department, UPC

Roger Espasa, Intel

1

Micropolygons & Tesselation: The future trend in interactive 3D rendering

for improved Level-Of-Detail

2

• An alternative GPU rasterization pipeline toefficienly process microtriangles.

• Our approach processes several microtrianglesin parallel using GPU shader threads:

– Scalable throughput is guaranteed in next GPU generations.

I´m presenting today…

3

Outline

• The Rasterization of Microtriangles.

• Parallel Rasterization in GPU Shaders.

• Problems & Solutions.

• Performance Results.

• Conclusion.

4

V0 (1,0)

V1 (13,4)

V2 (2,7)

Input: screen-projected vertex coordinates: {(1,0),(13,4),(2,7)}Output: covered fragments: {(1,0),(1,1),(2,1),(3,1), …}

The triangle rasterizer jobX

Y

5

Three Triangle Rasterization Approaches

E(x,y) : ax + by + c = 0

Intel Larrabee(software-based, SIMD-16)

Tile Scan

NVIDIA GPUs’96 - today

Recursive

Edge Equations(setup + traversal)

Pomegranate ‘00

Scan Lines

• Hard to parallelize• Software renderers

Fatahalian K., ’09 and THIS WORK

X-Products

• More efficient for verysmall triangles.•Independent per-pixel computation.

6

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 10 11 12

Flo

atin

g-p

oin

t o

pe

rati

on

s

Triangle size

Edge Equations

X-products

Setup equations or X-products?

The high cost of triangle setup is notamortized for ≤ 2-pixel triangles

Cross-products is more efficientfor very small triangles

Rasterizer efficiency (ops per pixel, Lower = Better)

7

The GPU´s bottleneck in microtriangles: a single Setup unit!

• Typical 2009-GPU rates: 1tri : 32pix /clock

• But the microtriangle ratio is 1tri : ≈1pix

• The Single Setup unit starves the Pixel Pipeline (Shader/ZStencil/Color)

• Need more microtriangle throughput … Can shader units help?

Utilization of thedifferent GPU units rendering a 1-pixel size streamof microtriangles

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 1011

Un

it u

tiliz

atio

n

Time (Kcycles)

rstz/setupmemorycolorZ/stencilshader

8

How could we increase the throughput formicrotriangles?

• Option 1: Replicate N times the Triangle Setup unit– Increases area

– Does not scale to very large number of microtriangles

• Option 2: Use the shader units to render microtriangles– THIS WORK.

– No area cost

– Large triangles still use the existing triangle setup unit

– Scales in the future as

• Microtriangles are more frequent

• Future GPUs offer more shader cores

9

Proposed Microtriangle pipeline

• Selectable bythe API user.

Texture

Depth/Stencil

Render Target

PixelShader

OutputMerger

Rasterizer/

Interpolator

Texture

Depth/Stencil

Render Target

Rasterize

Interpolate

PixelShader

OutputMerger

Triangle BoundRasterize & Shade Pixels

Standard DX10 Pipeline(for normal triangles)

Microtriangle Pipeline

InputAssembler

Texture

Texture

Stream Output

VertexShader

Vertex Buffer

Index Buffer

Geometry Shader

InputAssembler

Texture

Texture

Stream Output

VertexShader

Vertex Buffer

Index Buffer

Geometry Shader

10

Outline



• Problems & solutions.


• Conclusion.

11

Parallel Rasterization in GPU Shaders1. Fill shader vector groups with

fragments within the bounding

boxes of n input microtriangles

Rasterization

Z Interpolation

Thread Entry

2. Run the

rasterization

program on

multiple

fragments

followed by

the original

API fragment

shader.

Z= 3 Z= 5 Z= 7 Z= 9 Z= 1 Z= 1 Z= 2Attribute

Interpolation

Z= 3

S = 1

T = 0

Z= 5

S = 0

T = 1

Z= 7

S = 0

T = 0

Z= 9

S = 1

T = 1

Z= 1

S = 0

T = 0

Z= 1

S = 1

T = 1

Z= 2

S = 1

T = 0

Original DirectX

FragmentShader

Thread Exit

3. Reorder shaded fragments and do Z Test.

12

The required features of ourrasterization program:

• Consistent rasterization (no cracks orrepeated pixels):– Fixed-point arithmetic.– Tie break rule for adjacent edges.

• Full support of modern GPU aspects:– Z interpolation:

• Perspective• Orthogonal

– Attribute interpolation:• Flat• Non-perspective correct• Perspective correct• Centroid

– Face culling:• Front/Back/Front&Back

– MSAA:• x2, x4, x6, x8• Customizable patterns

13

Outline





• Conclusion.

14

Shading of sparse vectors: Bounding Box optimization pre-pass

• Increases 20 to 45% the density of microtriangle vectors.• Culls entirely subpixel microtriangles (55% culling ratio).• Simple hardware (four comparators, four adders) performs this

optimization inside the Triangle Bound unit.

Can shrink these BB sides

The gap tells those pixelswill be never really hit!

Subpixel-accurate BB:Pixel-accurate BB:

15

Avoid cracks or repeated pixels:Use of Fixed-Point arithmetic

• The rasterization program must ensure that each single pixel is hit by exactly one microtriangle in the mesh (no cracks, no repeated).– Extended the shader ISA with FXMUL and FXMAD fixed-point

instructions which provide consistent cross-product resultsacross microtriangles.

Floating Point 32 Bits Fixed point 24.8 Bits

A lit mesh of adjacent microtriangles

16

Outline





• Conclusion.

17

Great microtriangle throughput scaling

• Render times for1/2 pixel and 1/8 pixel-sizemicrotrianglemeshes scale up 1.3X to 4X with 16 shader cores, wrtthe traditional GPU rasterizer unit.

• The better scaling of 1/8 size (blue) is due to theeffectiveness of the Bounding Box optimization.

18

Outline





• Conclusion.

19

Conclusions• Near term 3D rendering demands for a microtriangle pipeline to efficienly

process tessellated surfaces.

• Current GPU rasterizers are not intended for microtriangles:– Designed for high pixel rates on triangles larger than ~10 pixels.– Poor microtriangle throughput to feed the pixel pipeline. – Replication inefficiently increases area: Bad scalability.

• We propose to rasterize microtriangles in GPU shaders.– The largest & more scalable resource in today´s GPUs– Using the more efficient Xproducts instead of edge setup.– As an alternative selectable pipeline by the API user.

• Problems and solutions:– Shading of sparse vectors: Bounding Box optimization pre-pass.– No cracks or repeated pixels by using Fixed-Point operations.

20

Thank you!Q&A

21

BACKUP

22

Enable Early Z Optimization

Rasterization

Z Interpolation

AttributeInterpolation

Original DirectX

FragmentShader

Thread Entry

Thread Exit

Late Z Test

Original DirectX

FragmentShader

Thread Exit

Early Z Test

Thread Entry

Thread Exit

Rasterization

Z Interpolation


Thread Entry

Rasterization

Z Interpolation


Original DirectX

FragmentShader

Thread Entry

Thread Exit

Earl

yZ

Test

Sleep Thread

Z test request

Z test result

23

a simd-efficient 14 instruction shader program for high...

Documents