Download - Avoiding Catastrophic Performance Loss
![Page 1: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/1.jpg)
Avoiding Catastrophic Performance Loss
Detecting CPU-GPU Sync PointsJohn McDonald, NVIDIA Corporation
![Page 2: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/2.jpg)
Topics
● D3D/GL Driver Models
● Types of Sync Points
● How bad are they, really?
● Detection
● Repair
● Summary
![Page 3: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/3.jpg)
D3D Driver Model
● Multithreaded
● Client Thread (Your Application + D3D Runtime)
● Server Thread (D3D Runtime [DDI] + Driver)
● GPU (??)
● Remains in user-mode for as long as possible
![Page 4: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/4.jpg)
GL Driver
● Very similar● Client thread (your application + GL entry points)
● Server thread (shelved data + expansion)
● GPU
● Again, very little time in Kernel Mode
![Page 5: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/5.jpg)
Example Healthy Timeline
Client
Driver
Runtime
GPU
Runtime (DDI)
Thread separator
Component separator
State Change
Action Method (draw, clear, etc)
Present
![Page 6: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/6.jpg)
Types of Sync Points
● Driver Sync Point
● CPU-GPU Sync Point
● Can be Server->GPU
● Can be Client->GPU
![Page 7: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/7.jpg)
Driver Sync Point
● Major concern in OpenGL
● Minor concern in D3D
● Caused when Client thread would need information available only to Server thread
● In GL, any function that returns a value
● In D3D, certain State-getting operations
![Page 8: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/8.jpg)
Healthy Timeline
Client
Driver
Runtime
GPU
Runtime (DDI)
Thread separator
Component separator
State Change
Action Method (draw, clear, etc)
Present
![Page 9: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/9.jpg)
Driver Sync Point
Client
Driver
Runtime
GPU
Runtime (DDI)
Thread separator
Component separator
State Change
Action Method (draw, clear, etc)
Driver Sync Point
![Page 10: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/10.jpg)
CPU-GPU Sync Point: Defined
When an application-side operation requires GPU work to finish prior to the completion of the provoking operation, a CPU-GPU Sync Point has been introduced.
![Page 11: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/11.jpg)
CPU-GPU Sync Point (cont’d)
● Primary causes are buffer updates and obtaining query results
● GPU readback
● e.g. ReadPixels
● Locking the Backbuffer
● Complete list of entry points in Appendix
![Page 12: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/12.jpg)
CPU-GPU Sync Point Visualized
● Ideal frame time should be max(CPU time, GPU time)
● Sync points cause this to be CPU Time + GPU Time.
![Page 13: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/13.jpg)
Healthy Timeline
Client
Driver
Runtime
GPU
Runtime (DDI)
Thread separator
Component separator
State Change
Action Method (draw, clear, etc)
Present
![Page 14: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/14.jpg)
CPU-GPU (Server->GPU) Sync Point
Client
Driver
Runtime
GPU
Runtime (DDI)
Thread separator
Component separator
State Change
Action Method (draw, clear, etc)
Server->GPU Sync Point
![Page 15: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/15.jpg)
Healthy Timeline
Client
Driver
Runtime
GPU
Runtime (DDI)
Thread separator
Component separator
State Change
Action Method (draw, clear, etc)
Present
![Page 16: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/16.jpg)
CPU-GPU (Client->GPU) Sync Point
Client
Driver
Runtime
GPU
Runtime (DDI)
Thread separator
Component separator
State Change
Action Method (draw, clear, etc)
Client->GPU Sync Point
![Page 17: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/17.jpg)
How bad are they, really?
● One CPU-GPU Sync Point can halve your framerate.
● The more there are, the harder they are to detect
● They are hard to detect with sampling profilers—the time disappears into Kernel Time.
![Page 18: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/18.jpg)
We get it. They suck. Now what?
● GPU Timestamp Queries to the rescue!
![Page 19: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/19.jpg)
Finding CPU-GPU Sync Points
● For each entry point that could cause a CPU-GPU sync point…
● Wrap the call with two GPU Timestamp Queries (Don’t forget the Disjoint Query)
● Ideally: record a portion of the stack at the call site
● Also record CPU timestamps around the call
![Page 20: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/20.jpg)
Finding Sync Points (cont’d)
● Later:
● Compute the elapsed time between the queries
● If it is small (< 10 ns), then no GPU kickoff was required
● If it’s larger, a GPU kickoff probably occurred—you’ve found a CPU-GPU Sync Point!
![Page 21: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/21.jpg)
Code! (Original)
ctx->Map(...);
![Page 22: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/22.jpg)
Code! (New)
ctx->Begin(pDisjoint);ctx->End(pTimestampBefore);double earlier = timer::now();ctx->Map(...);double cpuElapsed = timer::now() – earlier;ctx->End(pTimestampAfter);ctx->End(pDisjoint);stack = getStackRecord();gSPChecker->Register(pDisjoint, pTimestampBefore, pTimestampAfter,
stack, cpuElapsed);
![Page 23: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/23.jpg)
Four Possibilities
CPU Elapsed GPU Elapsed Meaning
Low ~None <10 ns No problem!
High ~None <10 ns Possible Driver Sync (Bad)
Low Low* (~1 us) Possible Server->GPU Sync (Worse)
High Low* (~1 us) Possible Client->GPU Sync (Ugh)
* Let’s talk about this in a bit
![Page 24: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/24.jpg)
No problem!
Client
Driver
Runtime
GPU
Runtime (DDI)
Thread separator
Component separator
State Change
Action Method (draw, clear, etc)
Present
QueriesWell behaved Map
CPU Timestamp
![Page 25: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/25.jpg)
Client->GPU Sync Point - detected
Client
Driver
Runtime
GPU
Runtime (DDI)
Thread separator
Component separator
State Change
Action Method (draw, clear, etc)
Queries
CPU-GPU Sync Point
CPU Timestamp
![Page 26: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/26.jpg)
Low elapsed GPU?
● GPU is fed commands in FIFO order
● Likely only command caught is WFI
● Which is ~1,000 clocks, or ~1 us or more.
● Subject for future improvements
![Page 27: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/27.jpg)
Split push buffer?
● Two calls right next to each other may wind up in different pushbuffer fragments
● And different GPU kickoffs
● This doesn’t hurt our scheme—Timestamp queries occur after “all results of previous commands are realized.”
● This means the timestamp is from the end of the pipeline—not the beginning.
![Page 28: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/28.jpg)
Split Pushbuffer (cont’d)
● Shouldn’t be an issue unless you are CPU-bound and barely using the GPU
● Workarounds. Only report:
● Violators that have either large elapsed GPU times (>1 us); or
● Hash the call stack, look for those that show up repeatedly.
![Page 29: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/29.jpg)
Fixing CPU-GPU Sync Points
● Adjust flags● E.g. D3D9, never lock a default buffer with Flags=0
● Be wary of using nearly all GPU memory
● May not be enough room for DISCARD operations
● Spin-locking on query results—that’s definitely a CPU-GPU Sync Point, regardless of API.
![Page 30: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/30.jpg)
Fixing CPU-GPU Sync Points (cont’d)
● Use NO_OVERWRITE in combination with GPU fences (or event queries) to ensure safe, contention-free updates
● Defer Query resolution until at least one frame later
● Use PBOs to do asynchronous readbacks
● And wait “awhile” before mapping.
![Page 31: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/31.jpg)
Summary
CPU-GPU Sync Points. Not even one.
![Page 32: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/32.jpg)
Questions
● jmcdonald at nvidia dot com
![Page 33: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/33.jpg)
Appendix
![Page 34: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/34.jpg)
GPU Timestamp Queries
● Tells you the GPU-time when preceedingoperations have completed—including writes to the FB.
● Two timestamp queries adjacent in the pushbuffer will have an elapsed time of 1/(Clock Frequency). (Very, very small).
![Page 35: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/35.jpg)
Problematic D3D9 Entry Points
● Create*^
● IDirect3DQuery9::GetData
● *::Lock
● *::LockRect
● Present
^ Rare, but possible
![Page 36: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/36.jpg)
Problematic D3D11 Entry Points
● ID3D11Device::CreateBuffer*^
● ID3D11Device::CreateTexture*^
● ID3D11DeviceContext::Map
● ID3D11DeviceContext::GetData
● IDXGISwapChain::Present
^ Rare, but possible
![Page 37: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/37.jpg)
Problematic GL Entry Points
● glBufferData^
● glBufferSubData^
● glClientWaitSync
● glFinish
^ Rare, but possible
![Page 38: Avoiding Catastrophic Performance Loss](https://reader033.vdocuments.us/reader033/viewer/2022052400/559b31ea1a28ab3e0a8b4855/html5/thumbnails/38.jpg)
Problematic GL Entry Points
● glGetQueryResult
● glMap*
● glTexImage*^
● glTexSubImage*^
● SwapBuffers
^ Rare, but possible