technology behind amd’s “leo demo” jay mckee mts engineer, amd

Technology Behind AMD’s “Leo Demo”

Jay McKeeMTS Engineer, AMD

Why Forward Rendering?

● Complex materials● Multiple light types● Supports hardware anti-aliasing● Efficient memory usage● Supports transparency● BUT, previously could not support a

large number of lights

Forward+ Rendering

● Modified forward renderer. Add computer shader for light culling. Modify main light loop.

● Lighting and shading done in the same place, all information is preserved.

Forward+ Rendering (continued)● No limits on parameters for lights and

materials● Omni● Spot● Cinematic (arbitrary falloffs, barndoor)● BRDF per material instance

● Simple design, concentrate on rendering, not engine maintenance.

Important DX11 features

●Compute Shaders●UAV support.

Compute Shaders

●In Leo demo we use two compute shaders:● One for culling lights.● Another for spawning Virtual Point Lights (VPLs)

for indirect lighting.

● Culling 3,072 lights takes 1.7 ms on high end GPU.

UAVs

● Array(s) of scene light information.● Array of u32 light indices for storing

start/end lights per-tile.● Array of material instance data

Algorithm summary● Depth Pre-Pass● Light Culling

● Screen divided into tiles. Launch compute shader per tile.● Light info such as position, radius, direction, length

passed to light culling compute shader.● Light culling shader projects lights bounds to screen-

space tiles. Uses scene depth from z pre-pass for z testing against light volumes.

● Outputs to UAV describing per tile light list start/end along with a large UAV of u32 array of light indices.

● Output UAVs are passed to main light shaders for looping through lights per-pixel.

Algorithm summary continued● Render scene materials

● Base light accumulation function● Use screen x, y location to determine tileID● From tileID, get light start and end indices● From start index to end index, loop● Entry is index into light array.● Accumulate light hitting pixel● Returns total direct and indirect light hitting

pixel.

Algorithm summary continued

● Material shader● Decides what to do with total incoming light● Passed into material’s BRDF for example● Uses light accumulation building blocks

● Env. lighting, base light accumulation, BRDF, etc. are put together for final pixel color.

Light Culling Shader Details (1/3)

// 1. prepare

float4 frustum[4];

float minZ, maxZ;

{

ConstructFrustum( frustum );

minZ = thread_REDUCE(MIN, depth );

maxZ = thread_REDUCE(MAX, depth );

ldsMinZ = SIMD_REDUCE(MIN, minZ );

ldsMaxZ = SIMD_REDUCE(MAX, maxZ );

minZ = ldsMinZ;

maxZ = ldsMaxZ;

}

Light Culling Shader Details (2/3)__local u32 ldsNLights = 0;

__local u32 ldsLightBuffer[MAX];

// 2. overlap check, accumulate in LDS

for(int i=threadIdx; i<nLights; i+=WG_SIZE)

{

Light light = fetchAndTransform( lightBuffer[ i ] );

if( overlaps( light, frustum ) && overlaps ( light, minZ, maxZ ) )

{

AtomicAppend( ldsLightBuffer, i );

}

}

Light Culling Shader Details (3/3)// 3. export to global

__local u32 ldsOffset;

if( threadIdx == 0 )

{

ldsOffset = AtomAdd( ldsNLights );

globalLightStart[tileIdx] = ldsOffset;

globalLightEnd[tileIdx] = ldsOffset + ldsNLights;

}

for(int i=threadIdx; i< ldsNLights; i+=WG_SIZE)

{

int dstIdx = ldsOffset + i;

globalLightIndexBuffer[dstIdx] = ldsLightBuffer[i];

}

// BaseLighting.inc // THIS INC FILE IS ALL THE COMMON LIGHTING CODE

StructuredBuffer<float4> LightParams : register(u0);StructuredBuffer<uint> LowerBoundLights : register(u1);StructuredBuffer<uint> UpperBoundLights : register(u2);StructuredBuffer<int2> LightIndexBuffer : register(u3);

uint GetTileIndex(float2 screenPos){ float tileRes = (float)m_tileRes; uint numCellsX = (m_width + m_tileRes - 1)/m_tileRes; uint tileIdx = floor(screenPos.x/tileRes)+floor(screenPos.y/tileRes)*numCellsX;

return tileIdx;}

}

Light Accumulation Pseudo-code

Light Accumulation (2):StartHLSL BaseLightLoopBegin // THIS IS A MACRO, INCLUDED IN MATERIAL SHADERS

uint tileIdx = GetTileIndex( pixelScreenPos ); uint startIdx = LowerBoundLights[tileIdx]; uint endIdx = UppweBoundLights[tileIdx];

[loop] for ( uint lightListIdx = startIdx; lightListIdx < endIdx; lightListIdx++ ) {

int lightIdx = LightIndexBuffer[lightListIdx];

// Set common light parametersfloat ndotl = max(0, dot(normal, lightVec));

float3 directLight = 0;float3 indirectLight = 0;

Light Accumulation (3):

if( lightIdx >= numDirectLightsThisFrame ) { CalculateIndirectLight(lightIdx , indirectLight); } else { if( IsConeLight( lightIdx ) ) { // <<== Can add more light types here CalculateDirectSpotlight(lightIdx , directLight); } else { CalculateDirectSpherelight(lightIdx , directLight); } }

float3 incomingLight = (directLight + indirectLight)*ndotl; float shadowTerm = CalcShadow();

EndHLSL

StartHLSL BaseLightLoopEnd }EndHLSL

Material Shader Template:#include "BaseLighting.inc"

float4 PS ( PSInput i ) : SV_TARGET{ float3 totalDiffuse = 0; float3 totalSpec = GetEnvLighting();;

$include BaseLightLoopBegin

// unique material code goes here!! Light accumulation on the pixel for a given light// we have total incoming light and direct/indirect light components as well as material params and shadow term// use these building blocks to integrate lighting terms

totalDiffuse += GetDiffuse(incomingLight); totalSpec += CalcPhong(incomingLight);

$include BaseLightLoopEnd

float3 finalColor = totalDiffuse + totalSpec; return float4( finalColor, 1 );}

Debug Mode Demo

Benchmark

3k dynamic lights

Compute-based Deferred v.s. Forward+

Forward+(L)

Forward+(H)

Deferred(L)

Deferred(H)

0 2 4 6 8 10 12 14 16 18 20

Prepass Light processing

Final shading

Time (ms)

Takahiro Harada, Jay McKee, Jason C.Yang, Forward+: Bringing Deferred Lighting to the Next Level, Eurographics Short Paper (2012)

Depth Pre-Pass Critical

● Pixel overdraw cripples this technique so depth pre-pass is required.

● Depth pre-pass is good opportunity to use MRT to generate other full-screen data needed for post-fx and other render fx (optional).

Other important points

● XBOX 360 has good bandwidth so given limitations on forward rendering, deferred makes a lot of sense.

● However, ALU computation growing at faster rate than bandwidth. more and more feasible to just do the calculations than to read/write so much data.

● Dynamic branching penalties not nearly as bad as before. As an optimization, compute shader can sort by light-type for example to minimize penalties.

● All that "light management" CPU side code to decide which lights hit each object for setting constant registers can be ditched!

Summary

● Modified forward renderer that handles scenes with 1000s of lights.

● Hardware anti-aliasing (MSAA) “automatic”● Bandwidth friendly.● Makes the most of the GPU's ALU power (which is

growing faster than bandwidth)

Thanks!Contact: [email protected]@[email protected]

Leo Demo website:http://developer.amd.com/samples/demos/pages/AMDRadeonHD7900SeriesGraphicsReal-TimeDemos.aspx

Eurographics 2012: 'Forward+: Bringing Deferred Lighting to the Next Level'

mailto:[email protected]



http://developer.amd.com/Resources/documentation/samples/demos/pages/AMDRadeonHD7900SeriesGraphicsReal-TimeDemos.aspx



technology behind amd’s “leo demo” jay mckee mts engineer, amd

Documents

light array

light start

light volumes

light culling compute

light culling shader

main light shaders

total incoming light

main light loop