gpu-efficient recursive filtering and summed-area tables

51
GPU-Efficient Recursive Filtering and Summed-Area Tables D. Nehab 1 A. Maximo 1 R. S. Lima 2 H. Hoppe 3 1 IMPA 2 Digitok 3 Microsoft Research

Upload: marlee

Post on 24-Feb-2016

87 views

Category:

Documents


0 download

DESCRIPTION

GPU-Efficient Recursive Filtering and Summed-Area Tables . D. Nehab 1 A. Maximo 1 R. S. Lima 2 H. Hoppe 3 1 IMPA 2 Digitok 3 Microsoft Research. Recursive filters. Linear, shift-invariant filters But use feedback from earlier outputs. input. prologue. output. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GPU-Efficient Recursive Filtering and Summed-Area Tables

GPU-Efficient Recursive Filtering and Summed-Area Tables

D. Nehab1 A. Maximo1 R. S. Lima2 H. Hoppe3

1IMPA 2Digitok 3Microsoft Research

Page 2: GPU-Efficient Recursive Filtering and Summed-Area Tables

• Linear, shift-invariant filters• But use feedback from earlier outputs

Recursive filters

input

output

prologue

Page 3: GPU-Efficient Recursive Filtering and Summed-Area Tables

• Linear, shift-invariant filters• But use feedback from earlier outputs

• Sequential dependency chainoutput

inputprologue

Recursive filters

Page 4: GPU-Efficient Recursive Filtering and Summed-Area Tables

Applications of recursive filtering

• B-Spline (or other) interpolation

input coefficients interpolation(from coefficients)

recursive preprocessing step

Page 5: GPU-Efficient Recursive Filtering and Summed-Area Tables

Applications of recursive filtering

• B-Spline (or other) interpolation• Fast, wide, Gaussian-blur approximation• Summed-area tables

input blurred

recursive filters

Page 6: GPU-Efficient Recursive Filtering and Summed-Area Tables

• Recursive filters can be causal or anticausal• Causal goes forward, anticausal in reverse direction

• Filter order is simply the number r of feedbacks

Causality and order

input epilogue

output

Page 7: GPU-Efficient Recursive Filtering and Summed-Area Tables

• Independent columns• Causal

• Anticausal

• Independent rows• Causal

• Anticausal

Filter sequences and separability• Often, sequences of recursive filters are needed

Page 8: GPU-Efficient Recursive Filtering and Summed-Area Tables

Algorithm RT

• The baseline algorithm• Process columns in parallel, then rows in parallel• Ruijters et al. 2010 “GPU prefilter […]”

inpu

tou

tput

stag

es

column processing row processing

Page 9: GPU-Efficient Recursive Filtering and Summed-Area Tables

First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

Thr o

ughp

ut ( G

iP/s

)

RT

Cubic B-Spline Interpolation (GeForce GTX 480)

Alg. Step Complexity

Max. # of Threads

UsedBandwidth

RT

Page 10: GPU-Efficient Recursive Filtering and Summed-Area Tables

Optimization roadmap• Modern GPUs have several hundred cores• Latency-hiding requires many times more tasks• Images are not large enough: must parallelize further

Alg. Step Complexity

Max. # of Threads

UsedBandwidth

RT

Page 11: GPU-Efficient Recursive Filtering and Summed-Area Tables

• Similar to parallel prefix-sum algorithms• Sengupta et al. 2007 “Scan primitives for GPU computing”• Dotsenko et al. 2008 “Fast scan algorithms […]”

• Compute and store incomplete prologues• Fix incomplete prologues• Somewhat more complicated than a recursive invocation

• Use prologues to compute and store causal results

Increasing parallelism

… …✗ ✗✗✗… ……

Page 12: GPU-Efficient Recursive Filtering and Summed-Area Tables

Fixing incomplete prologues

… …

superposition

linearity

Page 13: GPU-Efficient Recursive Filtering and Summed-Area Tables

Algorithm 2

• Adds block parallelism• Sung et al. 1986 “Efficient […] recursive […]”, or• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms

inpu

tou

tput

stag

es

fix fix fix fix

Page 14: GPU-Efficient Recursive Filtering and Summed-Area Tables

First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

Thr o

ughp

ut ( G

iP/s

)

RT2

Cubic B-Spline Interpolation (GeForce GTX 480)

Alg. Step Complexity

Max. # of Threads

MemoryBandwidth

2

RT

Page 15: GPU-Efficient Recursive Filtering and Summed-Area Tables

Optimization roadmap• Modern GPUs have several hundred cores• Latency-hiding requires many times more tasks• Images are not large enough: must parallelize further

• FLOP/IO ratio of recursive filters is too low• Can use even more FLOPs but must reduce IO• To do so, we introduce overlapping

Alg. Step Complexity

Max. # of Threads

MemoryBandwidth

2

RT

Page 16: GPU-Efficient Recursive Filtering and Summed-Area Tables

Causal-anticausal overlapping• Start anticausal processing before causal is done• Saves reading and writing causal results!

• Compute and store incomplete prologues & epilogues• Fix incomplete prologues & twice-incomplete epilogues• Twice-incomplete epilogues are trickier

• Use them to compute and store anticausal results

… …

Page 17: GPU-Efficient Recursive Filtering and Summed-Area Tables

Fixing twice-incomplete epilogues• Repeatedly apply linearity and superposition

• Tedious derivation, simple result

twice-incomplete epilogue

corrected prologue

corrected epilogue

Page 18: GPU-Efficient Recursive Filtering and Summed-Area Tables

Algorithm 4

• Adds causal-anticausal overlapping• Eliminates reading and writing causal results• Both in column and in row processing

• Modest increase in computation

inpu

tou

tput

stag

es

fix bothfix both

Page 19: GPU-Efficient Recursive Filtering and Summed-Area Tables

Alg. Step Complexity

Max. # of Threads

MemoryBandwidth

4

2

RT

First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms

• Alg. 4 adds causal-anticausal overlapping• Eliminates 4hw of IO• Modest increase in computation

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

Thr o

ughp

ut ( G

iP/s

)

RT24

Cubic B-Spline Interpolation (GeForce GTX 480)

Page 20: GPU-Efficient Recursive Filtering and Summed-Area Tables

Algorithm 5

• Adds row-column overlapping• Eliminates reading and writing column results• Modest increase in computation

inpu

tou

tput

stag

es

fix all!

Page 21: GPU-Efficient Recursive Filtering and Summed-Area Tables

Start from input and global borders

Page 22: GPU-Efficient Recursive Filtering and Summed-Area Tables

Load blocks into shared memory

Page 23: GPU-Efficient Recursive Filtering and Summed-Area Tables

Compute & store incomplete borders

Page 24: GPU-Efficient Recursive Filtering and Summed-Area Tables

Compute & store incomplete borders

Page 25: GPU-Efficient Recursive Filtering and Summed-Area Tables

Compute & store incomplete borders

Page 26: GPU-Efficient Recursive Filtering and Summed-Area Tables

Compute & store incomplete borders

Page 27: GPU-Efficient Recursive Filtering and Summed-Area Tables

Compute & store incomplete borders

Page 28: GPU-Efficient Recursive Filtering and Summed-Area Tables

Compute & store incomplete borders

Page 29: GPU-Efficient Recursive Filtering and Summed-Area Tables

Compute & store incomplete borders

Page 30: GPU-Efficient Recursive Filtering and Summed-Area Tables

Compute & store incomplete borders

Page 31: GPU-Efficient Recursive Filtering and Summed-Area Tables

All borders in global memory

Page 32: GPU-Efficient Recursive Filtering and Summed-Area Tables

Fix incomplete borders

Page 33: GPU-Efficient Recursive Filtering and Summed-Area Tables

Fix twice-incomplete borders

Page 34: GPU-Efficient Recursive Filtering and Summed-Area Tables

Fix thrice-incomplete borders

Page 35: GPU-Efficient Recursive Filtering and Summed-Area Tables

Fix four-times-incomplete borders

Page 36: GPU-Efficient Recursive Filtering and Summed-Area Tables

Done fixing all borders

Page 37: GPU-Efficient Recursive Filtering and Summed-Area Tables

Load blocks into shared memory

Page 38: GPU-Efficient Recursive Filtering and Summed-Area Tables

Finish causal columns

Page 39: GPU-Efficient Recursive Filtering and Summed-Area Tables

Finish anticausal columns

Page 40: GPU-Efficient Recursive Filtering and Summed-Area Tables

Finish causal rows

Page 41: GPU-Efficient Recursive Filtering and Summed-Area Tables

Finish anticausal rows

Page 42: GPU-Efficient Recursive Filtering and Summed-Area Tables

Store results to global memory

Page 43: GPU-Efficient Recursive Filtering and Summed-Area Tables

Done!

Page 44: GPU-Efficient Recursive Filtering and Summed-Area Tables

• Fixing thrice-incomplete row-prologues

• Fixing four-times-incomplete row-epilogues

Row-column overlapping rules

Page 45: GPU-Efficient Recursive Filtering and Summed-Area Tables

First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms

• Alg. 4 adds causal-anticausal overlapping• Eliminates 4hw of IO• Modest increase in computation• Alg. 5 adds row-column overlapping• Eliminates additional 2hw of IO• Modest increase in computation

Alg. Step Complexity

Max. # of Threads

MemoryBandwidth

5

4

2

RT

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

Thr o

ughp

ut ( G

iP/s

)

RT245

Cubic B-Spline Interpolation (GeForce GTX 480)

Page 46: GPU-Efficient Recursive Filtering and Summed-Area Tables

Second-order filter benchmarks

• Alg. 42 uses causal-anticausal overlapping

• Alg. 52 adds row-column overlapping• Added complexity outweighs IO reduction• Balance will change (hardware, compiler, implementation)

Alg. Step Complexity

Max. # of Threads

MemoryBandwidth

42

52

1

2

3

4

5

Thr o

ughp

ut ( G

iP/s

)

52

42

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

Quintic B-Spline Interpolation (GeForce GTX 480)

Page 47: GPU-Efficient Recursive Filtering and Summed-Area Tables

• CUFFT is in frequency domain• complexity• DIR is direct convolution• complexity• Podlozhnyuk 2007 whitepaper

“Image convolution with CUDA”

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

Thro

ughp

ut ( G

iP/s

)

DIR 2.5DIR 5DIR 10

Overlapped Recursive

CUFFT

Gaussian blur results• Overlapped recursive• 3rd order approximation• complexity• van Vliet et al. 1998

“Recursive Gaussian derivative filters”• Implemented as 51 fused with 42

• Recursive approximation is faster• Even for modest size images• Also modest standard-deviations

Gaussian Blur(GeForce GTX 480)

Page 48: GPU-Efficient Recursive Filtering and Summed-Area Tables

Summed-area table benchmarks

• Harris et al 2008, GPU Gems 3• “Parallel prefix-scan […]”• Multi-scan + transpose + multiscan• Implemented with CUDPP

• Hensley 2010, Gamefest• “High-quality depth of field”• Multi-wave method• Our improvements

+ specialized row and column kernels+ save only incomplete borders+ fuse row and column stages

• Overlapped SAT• Row-column overlapping

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

8

9

Thro

ughp

ut ( G

iP/s

)

Summed-area Table(GeForce GTX 480)

Harris et al [2008]Hensley [2010]Improved Hensley [2010]Overlapped SAT

• First-order filter, unit coefficient, no anticausal component

Page 49: GPU-Efficient Recursive Filtering and Summed-Area Tables

Future work• Volumetric processing• Overlapping should generalize• Not enough shared memory (yet?)

• CPU implementation• Blocking should increase L1 cache effectiveness• Is doubling amount of computation worth it?

• Solving general narrow-banded linear systems• Overlapping back- and forward- substitution

Page 50: GPU-Efficient Recursive Filtering and Summed-Area Tables

Conclusions• Recursive filters are useful in many applications• Cubic and quintic B-Spline interpolation• Gaussian-blur approximation• Summed-area table computation

• We introduced parallel algorithms for GPUs• Overlapping reduces IO requirements• Leads to faster algorithms

• Code is available from project page• Most is already there, rest is on the way

Page 51: GPU-Efficient Recursive Filtering and Summed-Area Tables

Questions?

baseline

Alg. RT (0.5 GiP/s)

+ block parallelism

Alg. 2 (3 GiP/s)

+ causal-anticausal overlapping

Alg. 4 (5 GiP/s)

+ row-column overlapping

Alg. 5 (6 GiP/s)