dx11 performance gems

58

Upload: dee

Post on 26-Feb-2016

46 views

Category:

Documents


2 download

DESCRIPTION

DX11 Performance Gems. (Or: “DX11 – unpacking the box”) Jon Jansen - Developer Technology Engineer, NVIDIA. Topics Covered. Case study: Opacity Mapping Using tessellation to accelerate lighting effects Accelerating up-sampling with GatherRed () - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DX11 Performance Gems
Page 2: DX11 Performance Gems

(Or: “DX11 – unpacking the box”)

Jon Jansen - Developer Technology Engineer, NVIDIA

DX11 Performance Gems

Page 3: DX11 Performance Gems

DX11 Performance Gems

Topics CoveredCase study: Opacity Mapping– Using tessellation to accelerate

lighting effects– Accelerating up-sampling with

GatherRed()– Playing nice with AA using

SV_SampleIndex– Read-only depth for soft particles

Page 4: DX11 Performance Gems

DX11 Performance Gems

Topics Covered (cont)Deferred Contexts

– How DX11 can help you pump the API harder than you thought possible

– Much viscera - very gory!

Page 5: DX11 Performance Gems

Not Covered‘Aesthetic’ tessellation

DirectCompute

!!! Great talks on these topics coming up !!!

DX11 Performance Gems

Page 6: DX11 Performance Gems

DEFERRED CONTEXTS

DX11 Performance Gems

Page 7: DX11 Performance Gems

Deferred Contexts• What are your options when your API

submission thread is a bottleneck?• What if submission could be done on multiple

threads, to take advantage of multi-core?– This is what Deferred Contexts solve in DX11

DX11 Performance Gems

Page 8: DX11 Performance Gems

• So why not just submit directly to an API from multiple threads and be done?

Thread 2:

Deferred Contexts

DX11 Performance Gems

D3D:

Thread 1:

Page 9: DX11 Performance Gems

• So why not just submit directly to an API from multiple threads and be done?

Thread 2:

Deferred Contexts

DX11 Performance Gems

D3D:

Thread 1:

Page 10: DX11 Performance Gems

• So why not just submit directly to an API from multiple threads and be done?

Thread 2:

Deferred Contexts

DX11 Performance Gems

D3D:

Thread 1:

Page 11: DX11 Performance Gems

• So why not just submit directly to an API from multiple threads and be done?

Thread 2:

Deferred Contexts

DX11 Performance Gems

D3D:

Thread 1:

Page 12: DX11 Performance Gems

• So why not just submit directly to an API from multiple threads and be done?

Thread 2:

Deferred Contexts

DX11 Performance Gems

D3D:

Thread 1:

WHOOPS!!

Page 13: DX11 Performance Gems

Deferred Contexts• A Deferred Context is a device-like interface

for building command-lists

DX11 Performance Gems

// Creation is very straightforwardID3D11DeviceContext* pDC = NULL;hr = pD3DDevice->CreateDeferredContext(0,&pDC);

Page 14: DX11 Performance Gems

Deferred Contexts• DX11 uses the same ID3D11DeviceContext

interface for ‘immediate’ API calls• Immediate context is the only way to finally

submit work to the GPU• Access it via ID3D11Device::GetImmediateContext()• ID3D11Device has no submission API

DX11 Performance Gems

Page 15: DX11 Performance Gems

DX11 Performance Gems

Thread 2:

DC1:

DC2:

Thread 1:

D3DIM

M D

C:

Render submission calls

Render submission calls

Page 16: DX11 Performance Gems

DX11 Performance Gems

Thread 2:

DC1:

DC2:

FinishCommandList()

FinishCommandList()

Thread 1:

D3DIM

M D

C:

Page 17: DX11 Performance Gems

DX11 Performance Gems

Thread 2:

DC1:

DC2:

FinishCommandList()

FinishCommandList()

Inter-thread sync, collate/order/buffer CL’s

Thread 1:

D3DIM

M D

C:

Page 18: DX11 Performance Gems

DX11 Performance Gems

Thread 2:

DC1:

DC2:

FinishCommandList()

FinishCommandList()

Thread 1:

D3DIM

M D

C:

Inter-thread sync, collate/order/buffer CL’s

Page 19: DX11 Performance Gems

DX11 Performance Gems

Thread 2:

DC1:

DC2:

FinishCommandList()

FinishCommandList()

Thread 1:

D3DIM

M D

C:

Inter-thread sync, collate/order/buffer CL’s

Start of new CL

Page 20: DX11 Performance Gems

...etc

DX11 Performance Gems

Thread 2:

DC1:

DC2:

FinishCommandList()

...etc

...etc

...etc

FinishCommandList()

FinishCommandList() FinishCommandList()

Inter-thread sync, collate/order/buffer CL’s

Thread 1:

D3DIM

M D

C:

Page 21: DX11 Performance Gems

...etc

DX11 Performance Gems

Thread 2:

DC1:

DC2:

‘RenderMain’ Thread

D3DFinishCommandList()

...etc

...etc

...etc

FinishCommandList()

FinishCommandList() FinishCommandList()

ExecuteCommandList()

Inter-thread sync, collate/order/buffer CL’s

Thread 1:

IMM

DC

:

Page 22: DX11 Performance Gems

...etc

DX11 Performance Gems

Thread 2:

DC1:

DC2:

‘RenderMain’ Thread

D3DFinishCommandList()

ExecuteCommandList()

...etc

...etc

...etc

FinishCommandList()

FinishCommandList() FinishCommandList()

ExecuteCommandList()

Inter-thread sync, collate/order/buffer CL’s

Thread 1:

IMM

DC

:

Page 23: DX11 Performance Gems

DX11 Performance Gems

...etcThread 2:

DC1:

DC2:

‘RenderMain’ Thread

D3DFinishCommandList()

ExecuteCommandList()

ExecuteCommandList()

ExecuteCommandList()

...etc

...etc

...etc

FinishCommandList()

FinishCommandList() FinishCommandList()

ExecuteCommandList()

Inter-thread sync, collate/order/buffer CL’s

Thread 1:

IMM

DC

:

Page 24: DX11 Performance Gems

Deferred Contexts• Flexible DX11 internals

– DX11 runtime has built-in implementation– BUT: the driver can take charge, and use its own

implementation• e.g. command lists could be built at a lower level,

moving more of the CPU work onto submission threads

DX11 Performance Gems

Page 25: DX11 Performance Gems

Deferred Contexts: Perf• Try to balance workload over contexts/threads

– but submission workloads are seldom predictable– granularity helps (if your submission threads are

able to pick up work dynamically)– if possible, do heavier submission workloads first– ~12 CL’s per core, ~1ms per CL is a good target

DX11 Performance Gems

Page 26: DX11 Performance Gems

Deferred Contexts: Perf• Target reasonable command list sizes

– think of # of draw calls in a command list much like # triangles in a draw call

– i.e. each list has overhead~equivalent to a few dozen API calls

DX11 Performance Gems

Page 27: DX11 Performance Gems

Deferred Contexts: Perf• Leave some free CPU time!

– having all threads busy can cause CPU saturation and prevent “server” thread from rendering*

– ‘busy’ includes busy-waits (i.e. polling)

*this is good general advice: never use more than N-1 CPU coresfor your game engine. Always leave one for the graphics driver

DX11 Performance Gems

Page 28: DX11 Performance Gems

Deferred Contexts: Perf• Mind your memory!

– each Map() call associates memory with the CL– releasing the CL is the only way to release the

memory– could get tight in a 2GB virtual address space!

DX11 Performance Gems

Page 29: DX11 Performance Gems

Deferred Contexts: wrapping up• DC’s + multi-core = pump the API HARD• Real-world specifics coming up in Dan’s talk

CALL TO ACTION: If you wish you could submit more batches and you’re not already using DC’s, then experiment!

DX11 Performance Gems

Page 30: DX11 Performance Gems

CASE STUDY: OPACITY MAPPING

DX11 Performance Gems

Page 31: DX11 Performance Gems

Case study: Opacity Mapping

DX11 Performance Gems

Page 32: DX11 Performance Gems

Case study: Opacity MappingGOALS:

– Plausible lighting for a game-typical particle system (16K largish translucent billboards)

– Self-shadowing (using opacity mapping)• Also receive shadows from opaque objects

– 3 light sources (all with shadows)

DX11 Performance Gems

Page 33: DX11 Performance Gems

Case study: Opacity Mapping

DX11 Performance Gems

+ opacity mapping*

=

*e.g. [Jansen & Bavoil, 2010]

Page 34: DX11 Performance Gems

Case study: Opacity Mapping• Brute force (per-pixel lighting/shadowing) is

not performant– 5 to 10 FPS on GTX 560 Ti* or HD 6950*

• Not surprising considering amount of overdraw...

DX11 Performance Gems

*1680x1050, 4xAA

Page 35: DX11 Performance Gems

Case study: Opacity Mapping

DX11 Performance Gems

Page 36: DX11 Performance Gems

Case study: Opacity Mapping– Vertex lighting? Faster, but shows significant delta

from ‘ground truth’ of per-pixel lighting...

DX11 Performance Gems

PS lighting VS lighting

5 to 10 FPS 60 to 65 FPS

Page 37: DX11 Performance Gems

Case study: Opacity Mapping• Use DX11 tessellation to calculate lighting at

an intermediate ‘sweet-spot’ rate in the DS• High-frequency components can remain at

per-pixel or per-sample rates, as required– opacity– visibility

DX11 Performance Gems

Page 38: DX11 Performance Gems

Case study: Opacity Mapping

DX11 Performance Gems

Surface placement

Light attenuationOpaque shadowsOpacity shadows

Visibility

VSrate

PSrate

samplerate

Texturing

PS lighting

Page 39: DX11 Performance Gems

Case study: Opacity MappingPS lighting DS lighting

DX11 Performance Gems

Surface placement

Light attenuationOpaque shadowsOpacity shadows

Visibility

DSrate

VSrate

PSrate

samplerate

VSrate

PSrate

samplerate

Texturing

Page 40: DX11 Performance Gems

Case study: Opacity Mapping

DX11 Performance Gems

DS lightingPS lighting (VS lighting)

5 to 10 FPS 60 to 65 FPS 40 to 45 FPS

Page 41: DX11 Performance Gems

Case study: Opacity Mapping• Adaptive tessellation gives best of both worlds

– VS-like calculation frequency– PS-like relationship with screen pixel frequency

• 1:15 works well in this case

• Applicable to any slowly-varying shading result– GI, other volumetric algos

DX11 Performance Gems

Page 42: DX11 Performance Gems

Case study: Opacity Mapping• Main bottleneck is fill-rate following tess-opt• So... render particles to low-res offscreen buffer*

– significant benefit, even with tess opt (1.2x to 1.5x for GTX 560 Ti / HD 6950)

– BUT: simple bilinear up-sampling from low-res can lead to artifacts at edges...

DX11 Performance Gems *[Cantlay, 2007]

Page 43: DX11 Performance Gems

DX11 Performance Gems

Case study: Opacity MappingGround truth (full res) Bilinear up-sample (half-res)

Page 44: DX11 Performance Gems

Case study: Opacity Mapping• Instead, we use nearest-depth up-sampling*

– conceptually similar to cross-bilateral filtering**– compares high-res depth with neighbouring low-

res depths– samples from closest matching neighbour at depth

discontinuities (bilinear otherwise)

DX11 Performance Gems

*[Bavoil, 2010] **[Eisemann & Durand, 2004] [Petschnigg et al, 2004]

Page 45: DX11 Performance Gems

Case study: Opacity Mapping

DX11 Performance Gems Near-Z

Far-Z

full-respixel

lo-resneighbours

Z10(@UV10)

ZFull(@UV)

if( abs(Z00-ZFull) < kDepthThreshold && abs(Z10-ZFull) < kDepthThreshold && abs(Z01-ZFull) < kDepthThreshold && abs(Z11-ZFull) < kDepthThreshold ){ return loResColTex.Sample(sBilin,UV);} else{ return loResColTex.Sample(sPoint,NearestUV);}

Z00(@UV00)

Page 46: DX11 Performance Gems

Case study: Opacity Mapping

DX11 Performance Gems Near-Z

Far-Z if( abs(Z00-ZFull) < kDepthThreshold && abs(Z10-ZFull) < kDepthThreshold && abs(Z01-ZFull) < kDepthThreshold && abs(Z11-ZFull) < kDepthThreshold ){ return loResColTex.Sample(sBilin,UV);} else{ return loResColTex.Sample(sPoint,NearestUV);}

NearestUV = UV10

Z10(@UV10)

ZFull(@UV)

Z00(@UV00)

Page 47: DX11 Performance Gems

DX11 Performance Gems

Case study: Opacity MappingGround truth (full res) Nearest-depth up-sample

Page 48: DX11 Performance Gems

Case study: Opacity Mapping• Use SM5 GatherRed() to efficiently fetch 2x2 low-res

depth neighbourhood in one go

DX11 Performance Gems

float4 zg = g_DepthTex.GatherRed(g_Sampler,UV);float z00 = zg.w; // w: floor(uv)float z10 = zg.z; // z: ceil(u),floor(v)float z01 = zg.x; // x: floor(u),ceil(v)float z11 = zg.y; // y: ceil(uv)

Page 49: DX11 Performance Gems

Case study: Opacity Mapping• Nearest-depth up-sampling plays nice with AA

when run per-sample– and surprisingly performant! (FPS hit < 5%)

DX11 Performance Gems

float4 UpsamplePS( VS_OUTPUT In,uint uSID :

SV_SampleIndex) : SV_Target

Page 50: DX11 Performance Gems

Case study: Opacity Mapping• Soft particles (depth-based alpha fade)

– requires read from scene depth– for < DX11, this used to mean...

• EITHER: sacrificing depth-test (along with any associated acceleration)

• OR: maintaining two depth surfaces (along with any copying required)

DX11 Performance Gems

Page 51: DX11 Performance Gems

DX11 solution:

Case study: Opacity Mapping

DX11 Performance Gems

depthtexture

‘traditional’DSV

depth textureSRVCreateShaderResourceView()

CreateDepthStencilView()

DX11 read-only DSV

CreateDepthStencilView()+ D3D11_DSV_READ_ONLY_DEPTH

NEW!!!in DX11

Page 52: DX11 Performance Gems

STEP 1: render opaque objects to depth texture

Case study: Opacity Mapping

DX11 Performance Gems

pDC->OMSetRenderTargets(...)

// Render opaque objects

depth textureSRV

DX11 read-only DSV

‘traditional’DSV

Page 53: DX11 Performance Gems

STEP 2: render soft particles with depth-test

Case study: Opacity Mapping

DX11 Performance Gems

depth textureSRV

DX11 read-only DSV

pDC->OMSetRenderTargets(...)pDC->PSSetShaderResources(...)// (Valid D3D state!)

// Render soft particles

‘traditional’DSV

Page 54: DX11 Performance Gems

Case study: Opacity Mapping‘Hard’ particles Soft particles

DX11 Performance Gems

Page 55: DX11 Performance Gems

Case study: wrapping up• 5x to 10x overall speedup• DX11 tessellation gave us most of it• But rendering at reduced-res alleviates fill-rate

and lets tessellation shine thru• GatherRed() and RO DSV also saved cycles

DX11 Performance Gems

Page 56: DX11 Performance Gems

Case study: wrapping up

CALL TO ACTION: Go light some particles!DX11 Performance Gems

Page 57: DX11 Performance Gems

End of tour!• Questions?

jjansen at nvidia dot com

DX11 Performance Gems

Page 58: DX11 Performance Gems

ReferencesBAVOIL, L. 2010. Modern Real-Time Rendering Techniques. From Future Game On

Conference, September 2010.CANTLAY, I. 2007. High-speed, off-screen particles. In GPU Gems 3. 513-528EISEMANN, E., & DURAND, F. 2004. Flash Photography Enhancement via Intrinsic

Relighting. In ACM Trans. Graph. (SIGGRAPH) 23, 3, 673–678.JANSEN, J., AND BAVOIL, L. 2010. Fourier opacity mapping. In Proceedings of the

Symposium on Interactive 3D Graphics and Games, 165–172.PETSCHNIGG, G., AGRAWALA, M., HOPPE, H., SZELISKI, R., COHEN, M., & TOYAMA, K.

2004. Digital photography with flash and no-flash image pairs. In ACM Trans. Graph. (SIGGRAPH) 23, 3, 664–672.

DX11 Performance Gems