grass, fur and all things hairy - amd at gdc14

Grass, Fur and all things hairyNicolas Thibieroz Karl HilleslandGaming Engineering Manager, AMD Senior Research Engineer, AMD

Next-gen Grass, Fur and Hair●The time for next-gen quality is now●Tomb Raider pioneered next-gen hair

● Even on PS4/XB1●Users expect this level of quality for next-gen titles●You need to start thinking about this●This talk is about making high-quality fur, grass and hair run at real-time performance

TressFX applied to Grass, Fur and Hair●Variations of the same technique can be used for all those applications●In all cases the core principles of next-gen quality are still needed:

● Compute simulations● Anti-aliasing● Transparency● Volumetric self-shadowing● A good lighting model

Forward Rendering Pipeline – a refresher

●Consists of three steps:● Hair simulation● Shade and store fragments into buffers● Fetch shaded fragments, sort and render

// Retrieve current pixel count and increase counter uint uPixelCount = LinkedListUAV.IncrementCounter(); uint uOldStartOffset;

// Exchange indices in LinkedListHead texture corresponding to pixel location InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset);

// Append new element at the end of the Fragment and Link Buffer Element.uNext = uOldStartOffset; LinkedListUAV[uPixelCount] = Element;

● Head UAV● Each pixel location has a “head pointer” to a linked list in

the PPLL UAV● PPLL UAV

● As new fragments are rendered, they are added to the next open location in the PPLL (using UAV counter)

● A link is created to the fragment pointed to by the head pointer

● Head pointer then points to the new fragment

Per-Pixel Linked Lists

Head UAV

PPLL UAV

CSCSCS

Input Geometry Post-simulation geometry (UAV)

Forward Rendering Pipeline – a refresherHair Simulation

Simulation parameters

Model space

World space

Forward Rendering Pipeline – a refresherShade and Store fragments into Buffers

Coverage

coverage

nextLighting

Homogeneous clip space

World space

Null RT

Stencil

PPLL UAV

Head UAV

Shadows

Extrusion from line segments to non-indexed

triangles

Full Screen Quad

Forward Rendering Pipeline – a refresherFetch shaded fragments, sort and render

Stencil

Head UAV

PPLL UAV

Render targetFragment sorting and

manual blending

Forward Rendering Performance●Main cost in forward rendering mode is in the shading part

● All fragments are lit and shadowed before being stored● PPLL storing is typically not the bottleneck!

●Don’t need maximum quality on all fragments● “tail” fragments need only “good enough” quality

●Solution: Use shader LOD

Forward vs Deferred Rendering PipelineDeferred rendering pipeline

●Hair simulation●Store fragment properties into buffers●Fetch fragment properties, sort, shade and render

● Full shading on K-frontmost fragments

● “Tail” fragments are shaded with a simpler light equation and shadowing algorithm

Forward rendering pipeline

●Hair simulation●Full shading and store fragments into buffers●Fetch shaded fragments, sort and render

CSCSCS

Input Geometry Post-simulation geometry (UAV)

Deferred Rendering PipelineHair Simulation – unchanged!

Simulation parameters

Model space

World space

Deferred Rendering Pipeline – a refresherStore Fragment Properties into Buffers

Coverage

tangent

coverage

Homogeneous clip space

World space

Null RT

Stencil

PPLL UAV

Head UAV

Index Buffer

Indexed triangle list

Deferred Rendering PipelineFetch fragments, sort, shade and render

Stencil

Head UAV

PPLL UAV

Render targetK frontmost fragment: full shading, sorting and manual blending

Lighting Shadows

Full Screen Quad

Tail fragments: cheap chading, no sorting and manual blending

Deferred Rendering Shading LOD Optimization●Deferred approach allows a reduction in shading cost “Shader LOD”

● Only sort and shade K frontmost fragments at high quality● “Simple” shading and out-of-order rendering on tail fragments● Single-tap shadowing on tail fragments

●Very little quality difference compared to full shading● But much better performance!

Technique CostOut of order, no shading 1.31 ms

Out of order, shading 2.80 ms

Forward PPLL, shading 3.38 ms

Deferred PPLL, shading 2.13 ms Fur model with ~130,000 fur strandsRunning on AMD Radeon 7970 @ 1080p

Shading cost is ~ 1.5 ms

PPLL costis ~ 0.58 ms

Full quality shading forced on for all fragments

Shading LOD

●A great portion of time was spent in the GPU front-end● 920,000 line segments for fur model

●Expansion from line segments to triangles was done in GS and then VS with Draw()● Each segment would create a quad (two triangles) with 6 vertices

Geometry Optimizations

DrawIndexed() method

Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) };

Line segments Expanded quads

Draw() method

Line segments Expanded quads

3,562,3

Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) };

●Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!

●Input line segments have a random order●Just render fewer (but thicker) fragments when far away!●Needs shading adjustments to ensure smooth quality transitions●Increase alpha threshold for fragment inclusion when far away

Distance-based LOD system Optimization

●PPLL Head UAV uses a RWTexture2D instead of a Buffer● Results in more efficient caching for UAV accesses

●Avoid GPR indexing for sorting● Sorting K frontmost fragments required array of Generic Purpose Registers with

random indexing into it● Used an ALU-based indexing approach to improve performance

●TO DO: compute shader simulation optimizations● Currently a set of multiple compute shaders● Looking at combining some of these, optimizing shaders and output formats

Other Optimizations

Per-Pixel Linked Lists UAV Memory Considerations

●How much memory is needed?● Guesstimate for a given usage model● Max (hair pixels x average overdraw) fragments

●What happens when I run out?● Missing fragments

●What can be done about it?

k-Buffer in Memory

PP Linked-List (PPLL) k-Buffer fixed size array

Node Pool

All fragments

How big?

k k k k k k k kk k k k k k k kk k k k k k k kk k k k k k k kk k k k k k k kk k k k k k k k

Simple Memory Bound

The Front kApproximation to avoid massive sorting●Only sort the front k fragments per-pixel●Blend the rest out-of-order

If deferring for shader LOD … also● Full quality shade on front k● Cheap shade on rest

20 frags/pixel (ave) Red = over 100

k is 4, 8, 16

The Front kApproximation to avoid massive sorting●Only sort the front k fragments per-pixel●Blend the rest out-of-order

If deferring for shader LOD … also● Full quality shade on front k● Cheap shade on rest

k-Buffer

Can’t know front k until all fragments processed

k-Buffer

For Each Fragment in Each Pixel

Index of furthest

New Fragment

BlendTail ColorTail

Fragment

If New Fragment in k

Index of furthest

k-BufferBlend

Tail Color

If in k1. Swap with furthest2. Find new furthest3. Blend with tail

Tail Fragment

New Fragment

If not in k

Index of furthest

k-BufferBlend

Tail Color

If not in k1. Blend with tail

Tail Fragment

New Fragment

From PPLL to k-BufferFor each pixel:

Write frags to memFor each fragment in each pixel

read fragment from memupdate k-buffer (reg)blend tail fragment (reg)

Read k-buffer from memSort and blend k-buffer (reg)

update k-buffer (mem)blend tail fragment

k-Buffer

Screen Width

8 bytes each(depth and data)

PPLL nodes were 12 bytes(depth, data, next)

K=4, 8, 16

PPLL: 2nd Pass

New Fragment

Index of furthest

BlendTail ColorTail

Fragment

k-Buffer

Registers

k-Buffer in Memory: 1st Pass

New Fragment

Index of furthest

BlendTail ColorTail

FragmentMutex, index, …

BlendUnit

k-Buffer

Memory

Mutex/Count/Index Buffer

Screen Width

Mutex BitInitialized Bit

Max Index(4 bits)

Count(remainder)

High bit

32 bits

Spinlock Mutex[allow_uav_condition]for(; i<MAX_LOOP_COUNT && !bStop; ++i){ uint oldID; InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID); if( (oldID&RESERVED) != RESERVED) ) {

[[ … Do work ]]DeviceMemoryBarrier();tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED;bStop = true;

} // end mutex check}// end spinlock loop

Paranoia

ReleaseDo Work

Find New Max Depthuint new_max_depth = u_inDepth;[unroll] for(int t=0; t<KBUFFER_SIZE; t++){

uint element_depth = DEPTH( vScreenAddress, t );

if(element_depth > new_max_depth ){

new_max_depth = element_depth;new_max_id = t;

Generally more memory traffic

than PPLL

Initialization: The first kOptions●Clear k-buffer fullscreen (0,1)●Clear k-buffer stenciled, 3rd pass●Clear on first fragment●Count

Max Index(4 bits)

Count(remainder)

High bit

The first kInterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount);

[allow_uav_condition]if(oldCount < KBUFFER_SIZE){ DATA(vScreenAddress,oldCount) = u_inData; DEPTH(vScreenAddress,oldCount) = u_inDepth; return uint2(u_outDepth,u_outData);}

Max Index(4 bits)

Count(remainder)

High bit

Models

2k polygons

~20k hairs~130k hairs

Stats2-3.5 M fragments

200-300k pixels

ShadingOne point light & shadow

2 shifted specular lobes

Depth Complexity

Grey 1Blue 8Green 50Red 100+

Contention

Max attempts per pixel, k=4

Dark Blue 1Aqua <=4Bright Aqua <=8

PerformanceTime ratio to out-of-order blending●Forward PPLL: 1.02 to 1.4●Forward k-Buffer: 1.2 to 1.4●Deferred PPLL: 0.7 to 0.9●Deferred k-Buffer: 0.9 to 1.6

K-Buffer in Memory●Simple memory bound●Can be less memory●Usually slower

● Increased memory traffic

Simulation

Hair Simulation●Length Constraint●Local Constraint●Global Constraint●Model Transform●Collision Shapes●External Forces (wind, gravity, etc.)

Fur Simulation●Length Constraint●Local Constraint●Global Constraint●Model Transform●Collision Shapes●External Forces (wind, gravity, etc.)

Grass Simulation●Length Constraint●Local Constraint (1D)●Global Constraint●Model Transform●Collision Shapes●External Forces (wind, gravity, etc.)

Constraint Method (iterative)

●Used for length, local and global constraints●Length is most difficult to converge

● particularly under large movement

Tridiagonal Matrix Formulation● Direct solve for length constraint

● Almost zero stretch● Limited to smaller time steps (stability)

● Still cheap● Leverages matrix structure of strands● Two sweeps of strand

Tridiagonal Matrix Formulation“Tridiagonal Matrix Formulation for Inextensible Hair Strand Simulation”, VRIPHYS, 2013

Summary●Next-gen look is possible now!●Deferred Rendering for shading LOD is fastest●k-buffer in memory is an option for memory-constrained situations●High-quality grass and fur simulation with compute

Upcoming TressFX 2 SDK sample update with fur scenario at http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/

Questions?

Extras

Isoline Tessellation for hair/fur? 1/2●Isoline tessellation has two tess factors

● First is line density (lines per invocation)● Second is line detail (segments per line)

●In theory provides easy LOD system● Variable line density and detail by increasing both tessellation factors

based on distance

Tess = (1,1) Tess = (2,1) Tess = (2,2) Tess = (2,3) Tess = (3,3)

Isoline Tessellation for hair/fur? 2/2●In practice isoline tessellation is not cost effective for this scenario●Lines are always 1-pixel thick

● Need GS to extrude them into triangles for smooth edges● Major impact on performance!

● Alternative is to enable MSAA● Most engines are deferred so this causes a large performance impact

● No extrusion for smoothing edges and no MSAA = poor quality!

●Bottom line: a pure Vertex Shader solution is faster● LOD benefit is easily done in VS (more on this later)● Curvature is rarely a problem (dependant on vertices/strands at authoring time)

AA, Self-shadowing and Transparency

Basic Rendering Antialiasing Antialiasing

+ Self Shadowing

Antialiasing + Self

Shadowing + Transparency

grass, fur and all things hairy - amd at gdc14

Technology

hairy leg spider

arlo , the hairy armadillo

hairy mouth | two

hairy harrys

mantle and nitrous - combining efficient engine design with...

our ‘hairy’ experience !

vertex shader tricks by bill bilodeau - amd at gdc14

friends dirty-hairy

grass, fur and all things hairy nicolas thibierozkarl...

nothingness by hairy sock

hairy bear scare

big hairy audacious goals

the hairy tree man

hairy cube design

direct3d and the future of graphics apis - amd at gdc14

hairy leukoplakia

hairy root culture

! hairy nude private school !

rendering battlefield 4 with mantle by johan andersson - amd...

hairy harry