![Page 1: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/1.jpg)
![Page 2: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/2.jpg)
An Optimized Diffusion Depth Of Field Solver (DDOF)
28th February 2011 2AMD‘s Favorite Effects
Holger Gruen – AMD
![Page 3: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/3.jpg)
Agenda• Motivation• Recap of a high-level explanation of DDOF• Recap of earlier DDOF solvers• A Vanilla Cyclic Reduction(CR) DDOF solver• A DX11 optimized CR solver for DDOF• Results28th February 2011 AMD‘s Favorite Effects 3
![Page 4: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/4.jpg)
Motivation• Solver presented at GDC 2010 [RS2010] has
some weaknesses• Great implementation but memory reqs and
runtime too high for many game developers• Looking for faster and memory efficient solver
28th February 2011 AMD‘s Favorite Effects 4
![Page 5: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/5.jpg)
Diffusion DOF recap 1• DDOF is an enhanced way of blurring a picture
taking an arbitrary CoC at a pixel into account• Interprets input image as a heat distribution• Uses the CoC at a pixel to derive a per pixel
heat conductivity CoC=Circle of Confusion
28th February 2011 AMD‘s Favorite Effects 5
![Page 6: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/6.jpg)
Diffusion DOF recap 2• Blurring is done by time stepping a differential
equation that models the diffusion of heat• ADI method used to arrive at a separable
solution for stepping• Need to solve tri-diagonal linear system for
each row and then each colum of the input28th February 2011 AMD‘s Favorite Effects 6
![Page 7: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/7.jpg)
DDOF Tri-diagonal system
28th February 2011 AMD‘s Favorite Effects 7
1 1 1 1
2 2 2 2 2
3 3 3 3 3
0
0 n n n n
b c y xa b c y x
a b c y x
a b y x
• row/col of inputimage
• derived from CoC at each pixel of aninput row/col
• resulting blurred row/col
![Page 8: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/8.jpg)
Solver recap 1• The GDC2010 solver [RS2010] is a ‚hybrid‘ solver
– Performs three PCR steps upfront– Performs serial ‚Sweep‘ algorithm to solve small
resulting systems– Check [ZCO2010] for details on other hybrid
solvers
28th February 2011 AMD‘s Favorite Effects 8
![Page 9: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/9.jpg)
Solver recap 2• The GDC2010 solver [RS2010] has drawbacks
– It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm
• GPUs without RW cache will suffer– For high resolutions three PCR steps produce tri-diagonal
system of substantial size• This means a serial (sweep) algorithm is run on a ‚big‘ system
28th February 2011 AMD‘s Favorite Effects 9
![Page 10: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/10.jpg)
Solver recap 3• Cyclic Reduction (CR) solver
– Used by [Kass2006] in the original DDOF paper– Runs in two phases
1. reduction phase2. backward substitution phase
28th February 2011 AMD‘s Favorite Effects 10
![Page 11: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/11.jpg)
Solver recap 4• According to [ZCO2010]:
– CR solver has lowest computational complexity of all solvers
– It suffers from lack of parallelism though • At the end of the reduction phase• At the start of the backwards substitution phase
28th February 2011 AMD‘s Favorite Effects 11
![Page 12: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/12.jpg)
Passes of a Vanilla CR Solver
28th February 2011 AMD‘s Favorite Effects 12
Input imageX
Pass 1: construct from CoC
abc
1 1 1 1
2 2 2 2 2
3 3 3 3 3
0
0 n n n n
b c y xa b c y x
a b c y x
a b y x
reduce
reduce
reduce
reduce
…
…
Stop at size 1Solve for the first y
Y substitutesubstitute
…
Blurred image
![Page 13: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/13.jpg)
Vanilla Solver Results• Higher performance than reported in
[Bavoil2010] (~6 ms vs. ~8ms at 1600x1200)
• Memory footprint prohibitively high – >200 MB at 1600x1200
• Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010]
28th February 2011 AMD‘s Favorite Effects 13
![Page 14: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/14.jpg)
Vanilla CR Solver
28th February 2011 AMD‘s Favorite Effects 14
Input imageX
Pass 1: construct from CoC
abc
reduce
reduce
reduce
reduce
…
…
Stop at size 1Solve for the first y
Y substitutesubstitute
…
Blurred image
This is what kills
parallelism
![Page 15: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/15.jpg)
Keeping the parallelism high
28th February 2011 AMD‘s Favorite Effects 15
Input imageX
Pass 1: construct from CoC
abc
reduce
reduce
reduce
reduce
…
…
Stop at a reasonable size
Solve for Y at that resolution to have a big enough parallel workload (e.g using PCR see [ZCO2010])
Y substitutesubstitute
…
Blurred image
![Page 16: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/16.jpg)
Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 16
Input imageX
Pass 1: construct from CoC
abc
reduce
reduce
reduce
reduce
…
…
Stop at a reasonable size
Solve for Y at that resolution
Y substitutesubstitute
…
Blurred image
![Page 17: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/17.jpg)
Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 17
rgab32fX
rgab32fabc
rgab32f
rgab32f
reduce
reduce
reduce
reduce
…
…
Stop at a reasonable size
Solve for Y at that resolution
Y substitutesubstitute
…
rgba32f rgab32fsubsti-tute
![Page 18: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/18.jpg)
Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 18
rgab16fX
rgab32fabc
rgab16f
rgab32f
reduce
reduce
reduce
reduce
…
…
Stop at a reasonable size
Solve for Y at that resolution
Y substitutesubstitute
…
rgba16f rgab16fsubsti-tute
This saves some significant amount of memory - We found no artifacts for going from rgba32f to rgba16f
![Page 19: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/19.jpg)
Memory Optimizations 2
28th February 2011 AMD‘s Favorite Effects 19
rgab16fX
rgab32fabc
rgab16f
rgab32f
reduce
reduce
reduce
reduce
…
…
Stop at a reasonable size
Solve for Y at that resolution
Y substitutesubstitute
…
rgba16f rgab16fsubsti-tute
This does again save a significant amount of memory as this is the biggest surface used by the solver
![Page 20: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/20.jpg)
Memory Optimizations 2
28th February 2011 AMD‘s Favorite Effects 20
rgab16fX
abc
rgab16f
rgab32f
reduce reduce
reduce
…
…
Stop at a reasonable size
Solve for Y at that resolution
Y substitutesubstitute
…
rgba16f rgab16fsubsti-tute
Skip abc construction pass and compute abc on-the-fly during 1. reduction pass
![Page 21: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/21.jpg)
Intermediate Results 1600x1200
28th February 2011 AMD‘s Favorite Effects 21
Solver Time in ms Memory in MegabytesHD5870 GTX480
GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]
~117 (guesstimate)
Standard Solver (already skips high res abc construction)
3.66 3.33 ~132
![Page 22: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/22.jpg)
Memory Optimizations 3
28th February 2011 AMD‘s Favorite Effects 22
rgab16fX
abc
rgab16f
rgab32f
reduce reduce
reduce
…
…
Stop at a reasonable size
Solve for Y at that resolution
Y substitutesubstitute
…
rgba16f rgab16fsubsti-tute
Skip abc construction pass compute abc during 1. reduction pass
Yet again this saves a significant amount of memory !
![Page 23: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/23.jpg)
Memory Optimizations 3
28th February 2011 AMD‘s Favorite Effects 23
rgab16fX
abc
reduce4
…
…
Stop at a reasonable size
Solve for Y at that resolution
Y substitutesubstitute
…
rgba16f
substitute4
Skip abc construction pass compute abc during 1. reduction pass
Reduce 4-to-1in a special first reduction pass
Substitute 1-to-4 in a special substitution pass
![Page 24: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/24.jpg)
Intermediate Results 1600x1200
28th February 2011 AMD‘s Favorite Effects 24
Solver Time in ms Memory in MegabytesHD5870 GTX480
GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]
~117 (guesstimate)
Standard Solver (already skips high res abc construction)
3.66 3.33 ~132
4–to-1 Reduction 2.87 3.32 ~73
![Page 25: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/25.jpg)
DX11 Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 25
rgab16fX
abc
reduce4
…
…
Stop at a reasonable size
Solve for Y at that resolution
Y substitutesubstitute
…
rgba16f
substitute4
Skip abc construction pass compute abc during 1. reduction pass
Reduce 4-to-1in a special first reduction pass
Substitute 1-to-4 in a special substitution pass
![Page 26: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/26.jpg)
DX11 Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 26
rgab16fX
abc
reduce4
…
…
Stop at a reasonable size
Solve for Y at that resolution
Y substitutesubstitute
…
rgba16f
substitute4
Skip abc construction pass compute abc during 1. reduction pass
Reduce 4-to-1in a special first reduction pass
Substitute 1-to-4 in a special substitution pass
Pack abc and X into one rgba_uint surface
![Page 27: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/27.jpg)
Using SM5 for data packing
28th February 2011 AMD‘s Favorite Effects 27
rgab16fX
rgab32fabc
uint
uint
uint
uint
pack x,y channel
(f32tof16(X.x) + (f32tof16(X.y) << 16))
![Page 28: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/28.jpg)
Using SM5 for data packing
28th February 2011 AMD‘s Favorite Effects 28
rgab16fX
rgab32fabc
uint
uint
uint
uint
lower 5 bits of z channel
higher 27 bits of x channel
pack
(asuint(abc.x) &0xFFFFFFC0) | (f32tof16(X.z) & 0x3F))
Steal 6 lowest mantissa bits of abc.x to store some bits of X.z
![Page 29: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/29.jpg)
Using SM5 for data packing
28th February 2011 AMD‘s Favorite Effects 29
rgab16fX
rgab32fabc
uint
uint
uint
uint
central 5 bits of z channel
higher 27 bits of y channel
pack
(asuint(abc.y) &0xFFFFFFC0) | ((f32tof16(X.z) >>6 )& 0x3F))
Steal 6 lowest mantissa bits of abc.y to store some bits of X.z
![Page 30: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/30.jpg)
SM5 Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 30
rgab16fX
rgab32fabc
uint
uint
uint
uint
higher 5 bits of z channel
higher 27 bits of z channel pack
(asuint(abc.z) &0xFFFFFFC0) | ((f32tof16(X.z) >>12 )& 0x3F))
Steal 6 lowest mantissa bits of abc.z to store some bits of X.z
![Page 31: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/31.jpg)
Sample Screenshot
28th February 2011 AMD‘s Favorite Effects 31
![Page 32: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/32.jpg)
Abs(Packed-Unpacked) x 255.0f
28th February 2011 AMD‘s Favorite Effects 32
![Page 33: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/33.jpg)
DX11 Memory Optimizations 2• Solver does a horizonal and vertical pass• Chain of lower res RTs needs to be there twice
– Horizontal reduction/substitution chain– Vertical reduction/substitution chain
• How can DX11 help?
28th February 2011 AMD‘s Favorite Effects 33
![Page 34: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/34.jpg)
DX11 Memory Optimizations 2• UAVs allow us to reuse data of the horizontal
chain for the vertical chain• A proof of concept implementation shows that this
works nicely but impacts the runtime significantly – ~40% lower fps
• Stayed with RTs as memory was already quite low• Use only if you are really concerned about memory
28th February 2011 AMD‘s Favorite Effects 34
![Page 35: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/35.jpg)
Final Results 1600x1200
28th February 2011 AMD‘s Favorite Effects 35
Solver Time in ms Memory in MegabytesHD5870 GTX480
GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]
~117 (guesstimate,)
Standard Solver (already skips high res abc construction)
3.66 3.33 ~132
4–to-1 Reduction 2.87 3.32 ~73
4-to-1 Reduction + SM5 Packing 2.75 3.14 ~58
![Page 36: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/36.jpg)
Future Work• Look into CS acceleration of the solver
– 4-to-1 reduction pass– 1-to-4 substitution pass
• Look into using heat diffusion for other effects– e.g. Motion blur
28th February 2011 AMD‘s Favorite Effects 36
![Page 37: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/37.jpg)
Conclusion• Optimized CR solver is fast and mem-efficient
– Used in Dragon Age 2– 4aGames considering its use for new projects– Detailed description in ‚Game Engine Gems 2‘
• Mail me ([email protected]) if you want access to the sources
28th February 2011 AMD‘s Favorite Effects 37
![Page 38: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/38.jpg)
References• [Kass2006] “Interactive depth of field using simulated diffusion on a GPU”
Michael Kass, Pixar Animation studios, Pixar technical memo #06-01• [ZCO2010] “Fast Tridiagonal Solvers on the GPU” Y. Zhang, J. Cohen, J. D.
Owens, PPoPP 2010• [RS2010] “DX11 Effects in Metro 2033: The Last Refuge” A. Rege, O.
Shishkovtsov, GDC 2010• [Bavoil2010] „Modern Real-Time Rendering Techniques“, L. Bavoil,
FGO2010
28th February 2011 AMD‘s Favorite Effects 38
![Page 39: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/39.jpg)
Backup
28th February 2011 AMD‘s Favorite Effects 39
![Page 40: An Optimized Diffusion Depth Of Field Solver (DDOF)](https://reader030.vdocuments.us/reader030/viewer/2022020208/56815dfd550346895dcc39b3/html5/thumbnails/40.jpg)
Results 1920x1200
28th February 2011 AMD‘s Favorite Effects 40
Solver Time in ms Memory in MegabytesHD5870 GTX480
Standard Solver (already skips high res abc construction)
4.31 4.03 ~158
4–to-1 Reduction 3.36 4.02 ~88
4-to-1 Reduction + SM5 Packing 3.23 3.79 ~70