![Page 1: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/1.jpg)
Are we done with hardware ray tracing?Or …
How can we make real-time raytracing more pervasive?Holger Gruen – Principal engineer
XPU Technology and Research group
Intel Corporation
![Page 2: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/2.jpg)
Ray tracing hardware is great!• It solves real-time problems rasterization can‘t solve efficiently
• Recent DXR games have shown a variety of fantastic ray traced effects
• There is also great progress with regards to denoising
• Yet, ray traced rendering is more than just tracing rays
• BVH management (issues wrt build/refit/streaming)
• Hit point shading (SIMD utilization issues)
⇒Current AAA games need to limit ray tracing
• This talk is about open problems that need to be overcome
• Beyond what just making GPUs faster will give us
• e.g. like we have overcome traversal performance problems
![Page 3: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/3.jpg)
Rasterization can deal with ...• 1000s of uniquely on-the fly skinned characters
• An animated forest with alpha-tested foliage
• Highly programmable on-the-fly per instance deformations (UE Kite demo)
• (DX12) mesh shaders that generate dynamic geometry on-the-fly
• Dynamic geometry that gets tessellated on-the-fly
• Massive amounts of virtual geometry that gets streamed and decompressed on-the-fly (see e.g. UE5 demo video)
![Page 4: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/4.jpg)
A simple example (1080p, high end GPU*):
• A test scene with ~19M triangles
• 100 uniquely skinned characters
• 100 uniquely animated trees
• Gbuffer rasterization: ~2.7 ms
• Raytracing cost: ~8.6 ms
• Compute Shader animation: ~2.5 ms
• BLAS updates: ~3.2 ms
• Primary rays: ~2.9 ms (1 ray/pixel)
But what about raytracing?
*NVIDIA® GeForce® RTX® 2080Ti
![Page 5: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/5.jpg)
BVH related issues
![Page 6: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/6.jpg)
BVH building/refitting costs• BVH updates don‘t yet scale like streamed rasterized geometry
• 225 skinned characters (20k triangles each)
• Rasterization: ~1.8 ms
• Animation + BVH refit: ~6 ms
![Page 7: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/7.jpg)
BVH building/refitting costs• BVH updates don‘t yet scale like streamed rasterized geometry
• 100 animated (non-rigid) trees (160k triangles each)• Rasterization: ~2.4 ms • animation + BVH refit: ~5.8 ms
• Procedural and dynamically tessellated geometry typically perform worse
![Page 8: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/8.jpg)
High BVH Memory Footprints• Modern games push a lot of dynamic/procedural geometry!
• A dynamic BVH consumes ~60-80 bytes per triangle
⇒225 uniquely skinned characters (20k tris each) consume ~280MB
⇒100 uniquely animated trees (160k tris each) consume ~980MB
![Page 9: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/9.jpg)
High BVH Memory Footprints• Static BVHs still use about 30-50 bytes per prim
⇒Raytraced effects can access most of the scene geometry due to secondary rays
⇒The whole scene may need to be in the BVH (pathtracing)
• Near future: UE5 scenes rumored to have billions of triangles
⇒Needs very aggressive view dependent culling
⇒Is current BVH storage/build/update/streaming technology up to this task?
⇒Current AAA games limit BVH complexity and as a result ray tracing
![Page 10: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/10.jpg)
Potential solutions to the above BVH issuesReduce memory footprints:
• Hardware support for lossy geometry compression (BVH+Traversal)?
Reduce build/refit costs:
• Hardware accelerated builds?
• Support for lazy builds?• Similar to procedural texture problem:
‘AMFS: Adaptive Multi-Frequency Shading for Future Graphics Processors’, P. Clarberg, R. Toth, J. Hasselgren,J. Nilsson, T. Akenine-Möller
Reduce memory footprint & build/refit costs:
• Support lazy&caches hardware builds for transient dynamic or procedural pieces of geometry?
Lazy builds:
Only update nodes
when they are visited
by rays
![Page 11: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/11.jpg)
Coherency & SIMD utilizationissues
![Page 12: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/12.jpg)
• Divergent ray traversal duration/steps for rays⇒All SIMD lanes blocked until the ray with the highest # of traversal steps returns
• BVH traversal divergence⇒Can be hidden from shaders by HW traversal units
⇒Of course divergent traversal still taxes the memory hierarchy
• Shader path divergence⇒SIMD lanes may need to execute different hit shaders
• Divergent resource access in hit shaders⇒SIMD lanes may need to fetch from diverging resources
Most common coherency & SIMD utilization issues
Lane 0 Lane 1 Lane 2 Lane 3
Assumed SIMD4 GPU
with 4 lanes per wave
Please note that
traversal has way
higher latency then
Texture sampling!
![Page 13: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/13.jpg)
Most common coherency & SIMD utilization• Divergent ray traversal duration/steps for rays
⇒All SIMD lanes blocked until the ray with the highest # of traversal steps returns
• BVH traversal divergence⇒Can be hidden from shaders by HW traversal units
⇒Of course divergent traversal still taxes the memory hierarchy
• Shader path divergence⇒SIMD lanes may need to execute different hit shaders
• Divergent resource access in hit shaders⇒SIMD lanes may need to fetch from diverging resources
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
Traversal Steps
% Rays
0
20
40
60
80
100
120
140
1 2 4 8 16 32
SIMD Wave Sizes
Avg. highest number of traversal step in a SIMD wave
Camera rays from a DXR games
![Page 14: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/14.jpg)
Most common coherency & SIMD utilization• Divergent ray traversal duration/steps for rays
⇒All SIMD lanes blocked until the ray with the highest # of traversal steps return
• BVH traversal divergence⇒Can be hidden from shaders by HW traversal units
⇒Of course divergent traversal still taxes the memory hierarchy
• Shader path divergence⇒SIMD lanes may need to execute different hit shaders
• Divergent resource access in hit shaders⇒SIMD lanes may need to fetch from diverging resources
AO rays from a DXR games
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
Traversal steps
% Rays
0
20
40
60
80
100
120
140
160
180
200
1 2 4 8 16 32
SIMD Wave size
Avg. highest number of traversal steps in a SIMD wave
![Page 15: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/15.jpg)
Most common coherency & SIMD utilization issues• Divergent ray traversal duration/steps for rays
⇒All SIMD lanes blocked until the longest ray traversal path is done
• Shader path divergence⇒SIMD lanes may need to execute different hit/material shaders
• Divergent resource access in hit shaders⇒SIMD lanes may need to fetch from diverging resources
• BVH traversal divergence⇒Can be hidden from shaders by HW traversal units
⇒Of course divergent traversal still taxes the memory hierarchy
Different colors
depict different hit
shaders
Lane 0 Lane 1 Lane 2 Lane 3
![Page 16: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/16.jpg)
Most common coherency & SIMD utilization issues• Divergent ray traversal duration/steps for rays
⇒All SIMD lanes blocked until the longest ray traversal path is done
• Shader path divergence⇒SIMD lanes may need to execute different hit shaders
• Divergent textures⇒Same shaders, but SIMD lanes may need to fetch from diverging resources
• BVH traversal divergence⇒Can be hidden from shaders by HW traversal units
⇒Of course divergent traversal still taxes the memory hierarchy
Lane 0 Lane 1 Lane 2 Lane 3
![Page 17: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/17.jpg)
How to increase coherency & SIMD utilization?Current solutions:
• Sort rays for more traversal coherency/ less SIMD latency• e.g. by origin + direction, Morton code, ...
• Sort hit points for coherent shading• e.g. by material ID, shading model, etc.
• Expensive for multiple bounces• Hit point streaming can consume cosiderable bandwidth
• High local shared memory footprints may limit your occupancy
![Page 18: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/18.jpg)
How to increase coherency & SIMD utilization?Potential future solutions:
• Could we do asynchronous raytracing?
• Could hardware bundle coherent shading requests?
Different colors
depict different hit
shaders
Lane 0 Lane 1Wave 0 Lane 0 Lane 1 Wave 1
Lane 0 Lane 1 Lane 0 Lane 1
![Page 19: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/19.jpg)
LOD management issuesCurrent raytracing hardware allows only limited LOD management
• Crossfading is possible using instance and ray masks (see recent NVIDIA blog)
• Anyhit() shaders allow more programmable fades but are slower
• Geomorphing through refits seems possible but is expensive
Potential future solutions:
• Fully implement a fast traversal shaders stage in hardware?
• See “Flexible Ray Traversal with an Extended Programming Model”by W. Lee, G. Liktor, K. Vaidyanathan
• Traversal shaders can do flexible LOD selection and more!
![Page 20: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/20.jpg)
Alpha testing is comparably slow• Hardware traversal gets interrupted to run a shader that computes if a ray vs
triangle intersection is valid
• See our talk on Wednesday:
„Sub-triangle opacity masks for faster ray tracing of transparent objects”
![Page 21: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/21.jpg)
RecognitionsThanks to
Karthik Vaidyanathan,
Carsten Benthin,
Joshua Barczak
and Gabor Liktor from Intel
who contributed to the above slides.
![Page 22: Are we done with hardware ray tracing? · Are we done with hardware ray ... •animation + BVH refit: ~5.8 ms •Procedural and dynamically tessellated geometry typically perform](https://reader036.vdocuments.us/reader036/viewer/2022071017/5fd05d361e08334ee63fd555/html5/thumbnails/22.jpg)
Q&A