gpu-efficient recursive filtering and summed-area tables
DESCRIPTION
GPU-Efficient Recursive Filtering and Summed-Area Tables . D. Nehab 1 A. Maximo 1 R. S. Lima 2 H. Hoppe 3 1 IMPA 2 Digitok 3 Microsoft Research. Recursive filters. Linear, shift-invariant filters But use feedback from earlier outputs. input. prologue. output. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/1.jpg)
GPU-Efficient Recursive Filtering and Summed-Area Tables
D. Nehab1 A. Maximo1 R. S. Lima2 H. Hoppe3
1IMPA 2Digitok 3Microsoft Research
![Page 2: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/2.jpg)
• Linear, shift-invariant filters• But use feedback from earlier outputs
Recursive filters
input
output
prologue
![Page 3: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/3.jpg)
• Linear, shift-invariant filters• But use feedback from earlier outputs
• Sequential dependency chainoutput
inputprologue
Recursive filters
![Page 4: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/4.jpg)
Applications of recursive filtering
• B-Spline (or other) interpolation
input coefficients interpolation(from coefficients)
recursive preprocessing step
![Page 5: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/5.jpg)
Applications of recursive filtering
• B-Spline (or other) interpolation• Fast, wide, Gaussian-blur approximation• Summed-area tables
input blurred
recursive filters
![Page 6: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/6.jpg)
• Recursive filters can be causal or anticausal• Causal goes forward, anticausal in reverse direction
• Filter order is simply the number r of feedbacks
Causality and order
input epilogue
output
![Page 7: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/7.jpg)
• Independent columns• Causal
• Anticausal
• Independent rows• Causal
• Anticausal
Filter sequences and separability• Often, sequences of recursive filters are needed
![Page 8: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/8.jpg)
Algorithm RT
• The baseline algorithm• Process columns in parallel, then rows in parallel• Ruijters et al. 2010 “GPU prefilter […]”
inpu
tou
tput
stag
es
column processing row processing
![Page 9: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/9.jpg)
First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
Thr o
ughp
ut ( G
iP/s
)
RT
Cubic B-Spline Interpolation (GeForce GTX 480)
Alg. Step Complexity
Max. # of Threads
UsedBandwidth
RT
![Page 10: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/10.jpg)
Optimization roadmap• Modern GPUs have several hundred cores• Latency-hiding requires many times more tasks• Images are not large enough: must parallelize further
Alg. Step Complexity
Max. # of Threads
UsedBandwidth
RT
![Page 11: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/11.jpg)
• Similar to parallel prefix-sum algorithms• Sengupta et al. 2007 “Scan primitives for GPU computing”• Dotsenko et al. 2008 “Fast scan algorithms […]”
• Compute and store incomplete prologues• Fix incomplete prologues• Somewhat more complicated than a recursive invocation
• Use prologues to compute and store causal results
Increasing parallelism
… …✗ ✗✗✗… ……
![Page 12: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/12.jpg)
✗
Fixing incomplete prologues
… …
…
superposition
linearity
![Page 13: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/13.jpg)
Algorithm 2
• Adds block parallelism• Sung et al. 1986 “Efficient […] recursive […]”, or• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms
inpu
tou
tput
stag
es
fix fix fix fix
![Page 14: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/14.jpg)
First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
Thr o
ughp
ut ( G
iP/s
)
RT2
Cubic B-Spline Interpolation (GeForce GTX 480)
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
2
RT
![Page 15: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/15.jpg)
Optimization roadmap• Modern GPUs have several hundred cores• Latency-hiding requires many times more tasks• Images are not large enough: must parallelize further
• FLOP/IO ratio of recursive filters is too low• Can use even more FLOPs but must reduce IO• To do so, we introduce overlapping
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
2
RT
![Page 16: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/16.jpg)
Causal-anticausal overlapping• Start anticausal processing before causal is done• Saves reading and writing causal results!
• Compute and store incomplete prologues & epilogues• Fix incomplete prologues & twice-incomplete epilogues• Twice-incomplete epilogues are trickier
• Use them to compute and store anticausal results
… …
![Page 17: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/17.jpg)
Fixing twice-incomplete epilogues• Repeatedly apply linearity and superposition
• Tedious derivation, simple result
twice-incomplete epilogue
corrected prologue
corrected epilogue
![Page 18: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/18.jpg)
Algorithm 4
• Adds causal-anticausal overlapping• Eliminates reading and writing causal results• Both in column and in row processing
• Modest increase in computation
inpu
tou
tput
stag
es
fix bothfix both
![Page 19: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/19.jpg)
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
4
2
RT
First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms
• Alg. 4 adds causal-anticausal overlapping• Eliminates 4hw of IO• Modest increase in computation
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
Thr o
ughp
ut ( G
iP/s
)
RT24
Cubic B-Spline Interpolation (GeForce GTX 480)
![Page 20: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/20.jpg)
Algorithm 5
• Adds row-column overlapping• Eliminates reading and writing column results• Modest increase in computation
inpu
tou
tput
stag
es
fix all!
![Page 21: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/21.jpg)
Start from input and global borders
![Page 22: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/22.jpg)
Load blocks into shared memory
![Page 23: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/23.jpg)
Compute & store incomplete borders
![Page 24: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/24.jpg)
Compute & store incomplete borders
![Page 25: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/25.jpg)
Compute & store incomplete borders
![Page 26: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/26.jpg)
Compute & store incomplete borders
![Page 27: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/27.jpg)
Compute & store incomplete borders
![Page 28: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/28.jpg)
Compute & store incomplete borders
![Page 29: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/29.jpg)
Compute & store incomplete borders
![Page 30: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/30.jpg)
Compute & store incomplete borders
![Page 31: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/31.jpg)
All borders in global memory
![Page 32: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/32.jpg)
Fix incomplete borders
![Page 33: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/33.jpg)
Fix twice-incomplete borders
![Page 34: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/34.jpg)
Fix thrice-incomplete borders
![Page 35: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/35.jpg)
Fix four-times-incomplete borders
![Page 36: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/36.jpg)
Done fixing all borders
![Page 37: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/37.jpg)
Load blocks into shared memory
![Page 38: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/38.jpg)
Finish causal columns
![Page 39: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/39.jpg)
Finish anticausal columns
![Page 40: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/40.jpg)
Finish causal rows
![Page 41: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/41.jpg)
Finish anticausal rows
![Page 42: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/42.jpg)
Store results to global memory
![Page 43: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/43.jpg)
Done!
![Page 44: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/44.jpg)
• Fixing thrice-incomplete row-prologues
• Fixing four-times-incomplete row-epilogues
Row-column overlapping rules
![Page 45: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/45.jpg)
First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms
• Alg. 4 adds causal-anticausal overlapping• Eliminates 4hw of IO• Modest increase in computation• Alg. 5 adds row-column overlapping• Eliminates additional 2hw of IO• Modest increase in computation
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
5
4
2
RT
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
Thr o
ughp
ut ( G
iP/s
)
RT245
Cubic B-Spline Interpolation (GeForce GTX 480)
![Page 46: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/46.jpg)
Second-order filter benchmarks
• Alg. 42 uses causal-anticausal overlapping
• Alg. 52 adds row-column overlapping• Added complexity outweighs IO reduction• Balance will change (hardware, compiler, implementation)
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
42
52
1
2
3
4
5
Thr o
ughp
ut ( G
iP/s
)
52
42
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
Quintic B-Spline Interpolation (GeForce GTX 480)
![Page 47: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/47.jpg)
• CUFFT is in frequency domain• complexity• DIR is direct convolution• complexity• Podlozhnyuk 2007 whitepaper
“Image convolution with CUDA”
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
Thro
ughp
ut ( G
iP/s
)
DIR 2.5DIR 5DIR 10
Overlapped Recursive
CUFFT
Gaussian blur results• Overlapped recursive• 3rd order approximation• complexity• van Vliet et al. 1998
“Recursive Gaussian derivative filters”• Implemented as 51 fused with 42
• Recursive approximation is faster• Even for modest size images• Also modest standard-deviations
Gaussian Blur(GeForce GTX 480)
![Page 48: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/48.jpg)
Summed-area table benchmarks
• Harris et al 2008, GPU Gems 3• “Parallel prefix-scan […]”• Multi-scan + transpose + multiscan• Implemented with CUDPP
• Hensley 2010, Gamefest• “High-quality depth of field”• Multi-wave method• Our improvements
+ specialized row and column kernels+ save only incomplete borders+ fuse row and column stages
• Overlapped SAT• Row-column overlapping
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
8
9
Thro
ughp
ut ( G
iP/s
)
Summed-area Table(GeForce GTX 480)
Harris et al [2008]Hensley [2010]Improved Hensley [2010]Overlapped SAT
• First-order filter, unit coefficient, no anticausal component
![Page 49: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/49.jpg)
Future work• Volumetric processing• Overlapping should generalize• Not enough shared memory (yet?)
• CPU implementation• Blocking should increase L1 cache effectiveness• Is doubling amount of computation worth it?
• Solving general narrow-banded linear systems• Overlapping back- and forward- substitution
![Page 50: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/50.jpg)
Conclusions• Recursive filters are useful in many applications• Cubic and quintic B-Spline interpolation• Gaussian-blur approximation• Summed-area table computation
• We introduced parallel algorithms for GPUs• Overlapping reduces IO requirements• Leads to faster algorithms
• Code is available from project page• Most is already there, rest is on the way
![Page 51: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.vdocuments.us/reader036/viewer/2022062410/56816552550346895dd7ca14/html5/thumbnails/51.jpg)
Questions?
baseline
Alg. RT (0.5 GiP/s)
+ block parallelism
Alg. 2 (3 GiP/s)
+ causal-anticausal overlapping
Alg. 4 (5 GiP/s)
+ row-column overlapping
Alg. 5 (6 GiP/s)