gpu-efficient recursive filtering and summed-area tables
DESCRIPTION
GPU-Efficient Recursive Filtering and Summed-Area Tables. Jeremiah van Oosten Reinier van Oeveren. Table of Contents. Introduction Related Works Prefix Sums and Scans Recursive Filtering Summed-Area Tables Problem Definition Parallelization Strategies Baseline (Algorithm RT) - PowerPoint PPT PresentationTRANSCRIPT
GPU-Efficient Recursive Filtering and Summed-Area
TablesJeremiah van OostenReinier van Oeveren
Introduction Related Works
◦ Prefix Sums and Scans◦ Recursive Filtering◦ Summed-Area Tables
Problem Definition Parallelization Strategies
◦ Baseline (Algorithm RT)◦ Block Notation◦ Inter-block Parallelism◦ Kernel Fusion (Algorithm 2)
Overlapping◦ Causal-Anticausal overlapping (Algorithm 3 & 4)◦ Row-Column Causal-Anitcausal overlapping (Algorithm 5)
Summed-Area Tables◦ Overlapped Summed-Area Tables (Algorithm SAT)
Results Conclusion
Table of Contents
Introduction
Linear filtering is commonly used to blur, sharpen or down-sample images.
A direct implementation evaluating a filter of support d on a h x w image has a cost of O(hwd).
Introduction
The cost of the image filter can be reduced using a recursive filter in which case previous results can be used to compute the current value:
Cost can be reduced to O(hwr) where r is the number of recursive feedbacks.
Introduction
At each step, the filter produces an output element by a linear combination of the input element and previously computed output elements.
Recursive Filters
0(hwr)
Continue…
Applications of recursive filters◦ Low-pass filtering like Gaussian kernels◦ Inverse Convolution (◦ Summed-area tables
Recursive Filters
input blurred
recursive filters
Recursive filters can be causal or anticausal (or non-causal).
Causal filters operate on previous values.
Anticausal filters operate on “future” values.
Causality
Anticausal filters operate on “future” values.
Anticausal
Continue…
𝒛 ¿ 𝑅(𝒚 ,𝒆)
𝑧𝑖 ¿ 𝑦 𝑖−∑𝑘=1
𝑟
𝑎𝑘𝑧 𝑖+𝑘
It is often required to perform a sequence of recursive image filters.
Filter Sequences
X
P
Y
E
ZP’ U E’V
• Independent Columns• Causal
• Anticausal
• Independent Rows• Causal
• Anticausal𝑉=𝐹 𝜏 (𝐸 ′ ,𝑈 )
𝑈=𝐹𝜏 (𝑃 ′ ,𝑍 )
𝑌=𝐹 (𝑃 , 𝑋 )
𝑍=𝐹 (𝑌 ,𝐸)
The naïve approach to solving the sequence of recursive filters does not sufficiently utilize the processing cores of the GPU.
The latest GPU from NVIDIA has 2,668 shader cores. Processing even large images (2048x2048) will not make full use of all available cores.
Under utilization of the GPU cores does not allow for latency hiding. We need a way to make better utilization of the GPU without
increasing IO.
Maximizing Parallelism
In the paper “GPU-Efficient Recursive Filtering and Summed-Area Tables” by Diego Nehab et. al. they introduce a new algorithmic framework to reduce memory bandwidth by overlapping computation over the full sequence of recursive filters.
Overlapping
Partition the image into 2D blocks of size .
Block Partitioning
Related Works
A prefix sum
Simple case of a first-order recursive filter. A scan generalizes the recurrence using an arbitrary
binary associative operator. Parallel prefix-sums and scans are important building
blocks for numerous algorithms. [Iverson 1962; Stone 1971; Blelloch 1989; Sengupta et.
al. 2007] An optimized implementation comes with the CUDPP
library [2011].
Prefix Sums and Scans
A generalization of the prefix sum using a weighted combination of prior outputs.
This can be implemented as a scan operation with redefined basic operators.
Ruijters and Thevenaz [2010] exploit parallelisim across the rows and columns of the input.
Recursive Filtering
Sung and Mitra [1986] use block parallelism and split the computation into two parts:◦ One computation based only on the block data
assuming a zero initial conditions.◦ One computation based only on the initial
conditions and assuming zero block data.
Recursive Filtering
Summed-area tables enable the averaging rectangular regions of pixel with a constant number of reads
Summed-Area Tables
+LR
-LL
-UR
+UL
width
height
𝑎𝑣𝑔=𝐿𝑅−𝐿𝐿−𝑈𝑅+𝑈𝐿
h𝑤𝑖𝑑𝑡 ∙h h𝑒𝑖𝑔 𝑡
The paper titled “Fast Summed-Area Table Generation…” from Justin Hensley et. al. (2005) describes a method called recursive doubling which requires multiple passes of the input image. (A 256x256 image requires 16 passes to compute).
Summed-Area Tables
Image A
Image B
Image A
Image B
In 2010, Justin Hensley extended his 2005 implementation to compute shaders taking more samples per pass and storing the result in intermediate shared memory. Now a 256x256 image only required 4 passes when reading 16 samples per pass.
Summed-Area Tables
Problem Definition
Casual recursive filters of order are characterized by a set of feedback coefficients in the following manner.
Given a prologue vector and an input vector of any size the filter produces the output:
Problem Definition
𝒚=𝐹 (𝒑 , 𝒙)
Such that ( has the same size as the input ).
Causal recursive filters depend on a prologue vector
Problem Definition
𝑦 𝑖=𝑥 𝑖−∑𝑘=1
𝑟
𝑎𝑘 𝑦 𝑖−𝑘
Similar for the anitcausal filter. Given an input vector and an epilogue vector , the output vector is defined by:
𝑧𝑖=𝑦 𝑖−∑𝑘=1
𝑟
𝑎 ′𝑘 𝑧𝑖+𝑘
For row processing, we define an extended casual filter and anticausal filter .
Problem Definition
𝒖=𝐹 𝜏 (𝒑 ′ ,𝒛 )
𝒗=𝑅𝜏 (𝒖 ,𝒆 ′ )
With these definitions, we are able to formulate the problem of applying the full sequence of four recursive filters (down, up, right, left).
Problem Definition
X
P
Y
E
ZP’ U E’V
• Independent Columns• Causal
• Anticausal
• Independent Rows• Causal
• Anticausal𝑉=𝐹 𝜏 (𝐸 ′ ,𝑈 )
𝑈=𝐹𝜏 (𝑃 ′ ,𝑍 )
𝑌=𝐹 (𝑃 , 𝑋 )
𝑍=𝐹 (𝑌 ,𝐸)
The goal is to implement this algorithm on the GPU to make full use of all available resources.◦ Maximize occupancy by splitting the problem up
to make use of all cores.◦ Reduce I/O to global memory.
Must break the dependency chain in order to increase task parallelism.
Primary design goal: Increase the amount of parallelism without increasing memory I/O.
Problem Definition
Prior Parallelization Strategies
Baseline algorithm ‘RT’Block notation Inter-block parallelismKernel fusion
Prior Parallelization strategies
Algorithm Ruijters & ThévenazIndependent row and column processing Step RT1: In parallel for each column in ,
apply sequentially and store . Step RT2: In parallel for each column in ,
apply sequentially and store . Step RT1: In parallel for each row in ,
apply sequentially and store . Step RT1: In parallel for each row in ,
apply sequentially and store .
Algorithm RT in diagram formin
put
outp
utst
ages
column processing row processing
Completion takes 4r ) steps Bandwidth usage in total is
= streaming multiprocessors = number of cores (per processor) = width of the input image = height of the input image = order of the applied filter
Algorithm RT performance
Partition input image into blocks◦ = number of threads in warp (=32)
What means what?◦ = block in matrix with index ◦ = column-prologue submatrix◦ = column-epilogue submatrix
For rows we have (similar) transposed operators: and
Block notation (1)
Block notation (1 cont’d)
Tail and head operators: selecting prologue- and epilogue-shaped submatrices from
Block notation (2)
Result: blocked version of problem definition, , ,
Block notation (3)
Superposition (based on linearity)
Effects of the input and prologue/epilogue on the output can be computed independently
Some useful key properties (1)
Express as matrix productsFor any , is the identity matrix
Some useful key properties (2)
, Precomputed matrices that depend only on the feedback coefficients of filters and respectively. Details in paper.
Perform block computation independently
Inter-block parellelism (1)
𝑩𝒎 (𝒚 )=𝑭 (𝑷𝒎−𝟏 (𝐲 ) ,𝑩𝒎 (𝒙 ) )output block
Prologue / tail of prev. output block superposition
𝑩𝒎 (𝒚 )=𝑭 (𝟎 ,𝑩𝒎 (𝒙 ) )+𝑭 (𝑷𝒎−𝟏 (𝒚 ) ,𝟎 )
Inter-block parellelism (2)𝑩𝒎 (𝒚 )=𝑭 (𝟎 ,𝑩𝒎 (𝒙 ) )+𝑭 (𝑷𝒎−𝟏 (𝒚 ) ,𝟎 )
𝑩𝒎(𝒚 )
incomplete causal output
first term second term
𝑭 ( 𝑰 𝒓 ,𝟎 )𝑷𝒎−𝟏 (𝒚 )
𝑨𝑭𝑷 𝑷𝒎−𝟏(𝒚 )
Inter-block parellelism (3) (1)
1.3 In parallel for all m, compute & store output block using (2) and the previously computed
Recall: (2)
1.1 In parallel for all m, compute and store each
1.2 Sequentially for each m, compute and store the according to (1) and using the previously computed
Algorithm 1
Inter-block parellelism (4)Processing all rows and columns using causal and anti-causal filter pairs requires 4 successive applications of algorithm 1.
There are independent tasks: hides memory access latency.
However.. The memory bandwidth usage is now Significantly more than algorithm RT (
can be solved
Original idea: Kirk & Hwu [2010]
Use output of one kernel as input for the next without going through global memory.
Fused kernel: code from both kernels but keep intermediate results in shared mem.
Kernel fusion (1)
Use Algorithm 1 for all filters, do fusing. Fuse last stage of with first stage of Fuse last stage of and first stage of Fuse last stage of with first stage of
Kernel fusion (2)
We aimed for bandwidth reduction. Did it work? Algorithm 1: Algorithm 2: yes, it did!
Kernel fusion (3), Algorithm 2
inpu
tou
tput
stag
es
fix fix fix fix
* for the full algorithm in text, please see the paper
*
Further I/O reduction is still possible: by recomputing intermediary results instead of storing in memory.
More bandwidth reduction: (=good)
No. of steps: (≈bad*)
Bandwidth usage is less than Algorithm RT(!) but involves more computations *But.. future hardware may tip the balance in favor of more computations.
Kernel fusion (4)
Overlapping
Overlapping is introduced to reduce IO to global memory.
It is possible to work with twice-incomplete anticausal epilogues , computed directly from the incomplete causal output block .
This is called casual-anticausal overlapping.
Causal-Anticausal Overlapping
Recall that we can express the filter so that the input and the prologue or epilogue can be computed independently and later added together.
Causal-Anticausal Overlapping
𝐹 (𝒑 , 𝒙) ¿ 𝐹 (𝟎 ,𝒙 )+𝐹 (𝒑 ,𝟎 ) ,𝑅(𝒚 ,𝒆) ¿ 𝑅 (𝒚 ,𝟎 )+𝐹 (𝟎 ,𝒆)
Using the previous properties, we can split the dependency chains of anticausal epilogues.
Causal-Anticausal Overlapping
Which can be further simplified to:
Causal-Anticausal Overlapping
Where the twice-incomplete is such that
Causal-Anticausal Overlapping
Each twice-incomplete epilogue depends only on the corresponding input block and therefore they can all be computed in parallel already in the first pass. As a byproduct of that same pass, we can compute and store the that will be needed to obtain . With , we can compute all in the following pass.
1. In parallel for all , compute and store and .2. Sequentially for each , compute and store
the using the previously computed .3. Sequentially for each , compute and store
using the previously computed and .4. In parallel for all , compute each causal
output block using the previously computed . Then compute and store each anticausal output block using the previously computed .
Algorithm 3 (Row & Column)
Algorithm 3 computes row and columns in separate passes. Fusing these two stages, results in algorithm 4.
Algorithm 4
1. In parallel for all and , compute and store the ) and .2. Sequentially for each , but in parallel for each , compute and
store the using the previously computed .3. Sequentially for each , but in parallel for each , compute and
store the using the previously computed and .4. In parallel for all and , compute using the previously
computed . Then compute and store the using the previously computed . Finally, compute and store both and .
5. Sequentially for each , but in parallel for each , compute and store the from .
6. Sequentially for each , but in parallel for each , compute and store each using the previously computed and .
7. In parallel for all and , compute using the previously computed and . Then compute and store the using the previously computed .
Algorithm 4
Adds causal-anticausal overlapping◦ Eliminates reading and writing causal results
Both in column and in row processing◦ Modest increase in computation
Algorithm 4in
put
outp
utst
ages
fix bothfix both
There is still one source of inefficiency in algorithm 4. We wait until the complete block is available in stage 4 before computing incomplete and twice-incomplete .
We can overlap row and column computations and work with thrice-incomplete transposed epilogues obtained directly during algorithm 4 stage 1.
Row-Column Casual-Anticausal Overlapping
Below is the formula for completing the thrice-incomplete transposed prologues:
Row-Column Casual-Anticausal Overlapping
The thrice-incomplete satisfies
Row-Column Casual-Anticausal Overlapping To complete the four-times-incomplete
transposed epilogues of :
1. In parallel for all and , compute and store each , , , and .
2. In parallel for all , sequentially for each , compute and store the using the previously computed .
3. In parallel for all , sequentially for each , compute and store using the previously computed and .
4. In parallel for all , sequentially for each , compute and store using the previously computed , , and .
5. In parallel for all , sequentially for each , compute and store using the previously computed , , , and .
6. In parallel for all and , successively compute , , , and using the previously computed , , , and . Store .
Algorithm 5
Adds row-column overlapping◦ Eliminates reading and writing column results◦ Modest increase in computation
Algorithm 5in
put
outp
utst
ages
fix all!
Start from input and global borders
Load blocks into shared memory
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
All borders in global memory
Fix incomplete borders
Fix twice-incomplete borders
Fix thrice-incomplete borders
Fix four-times-incomplete borders
Done fixing all borders
Load blocks into shared memory
Finish causal columns
Finish anticausal columns
Finish causal rows
Finish anticausal rows
Store results to global memory
Done!
Overlapped Summed-Area Tables
A summed-area table is obtained using prefix sums over columns and rows.
The prefix-sum filter is a special case first-order causal recursive filter (with feedback coefficient ).
We can directly apply overlapping to optimize the computation of summed-area tables.
Overlapped Summed-Area Tables
In blocked form, the problem is to obtain output from input where the blocks satisfy the relations:
Overlapped Summed-Area Tables
𝐵𝑚 ,𝑛 (𝑉 )=𝑆𝜏 (𝑃𝑚 ,𝑛− 1𝜏 (𝑉 ) ,𝐵𝑚 ,𝑛(𝑌 ))
𝐵𝑚 ,𝑛 (𝑌 )=𝑆 (𝑃𝑚−1 ,𝑛 (𝑌 ) ,𝐵𝑚 ,𝑛 ( 𝑋 ))
Using the strategy developed for causal-anticausal overlapping, computing and using overlapping becomes easy.
In the first stage, we compute the incomplete output blocks and directly from the input.
Overlapped Summed-Area Tables
𝐵𝑚 ,𝑛 (𝑌 )=𝑆(𝟎 ,𝐵𝑚 ,𝑛 (𝑋 ))𝐵𝑚 ,𝑛 (𝑉 )=𝑆𝜏 (𝟎 ,𝐵𝑚 ,𝑛(𝑌 ))
We store only the incomplete prologues and . Then we complete them using:
Overlapped Summed-Area Tables
𝑃𝑚 ,𝑛 (𝑌 )=𝑃𝑚−1 ,𝑛 (𝑌 )+𝑃𝑚 ,𝑛(𝑌 )𝑃𝑚 ,𝑛
𝜏 (𝑉 )=𝑃𝑚 ,𝑛− 1𝜏 (𝑉 )+𝑠 (𝑃𝑚− 1 ,𝑛 (𝑌 ) )+𝑃𝑚 ,𝑛
𝜏 (𝑉 )
Scalar denotes the sum of all entries in vector .
1. In parallel for all and , compute and store the and .
2. Sequentially for each , but in parallel for each , compute and store the using the previously computed. Compute and store .
3. Sequentially for each , but in parallel for each , compute and store the using the previously computed , and .
4. In parallel for all and , compute then compute and store using the previously computed and .
Algorithm SAT
S.1 Reads the input then computes and stores the incomplete prologues (red) and (blue).
Algorithm SAT
S.2 Completes the prologues (red) and computes scalars (yellow).
Algorithm SAT
S.3 Completes prologues
Algorithm SAT
S.4 Reads the input and completed prologues, then computes and stores the final summed-area table.
Algorithm SAT
Results
First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms
• Alg. 4 adds causal-anticausal overlapping• Eliminates 4hw of IO• Modest increase in computation• Alg. 5 adds row-column overlapping• Eliminates additional 2hw of IO• Modest increase in computation
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
5
4
2
RT
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
Thr o
ughp
ut ( G
iP/s
)
RT245
Cubic B-Spline Interpolation (GeForce GTX 480)
Summed-area table benchmarks
• Harris et al 2008, GPU Gems 3• “Parallel prefix-scan […]”• Multi-scan + transpose + multiscan• Implemented with CUDPP
• Hensley 2010, Gamefest• “High-quality depth of field”• Multi-wave method• Our improvements
+ specialized row and column kernels+ save only incomplete borders+ fuse row and column stages
• Overlapped SAT• Row-column overlapping
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
8
9
Thro
ughp
ut ( G
iP/s
)
Summed-area Table(GeForce GTX 480)
Harris et al [2008]Hensley [2010]Improved Hensley [2010]Overlapped SAT
• First-order filter, unit coefficient, no anticausal component
Conclusion
The paper describes an efficient algorithmic framework that reduces memory bandwidth over a sequence of recursive filters.
It splits the input into blocks that are processed in parallel on the GPU.
Overlapping the causal, anticausal, row and columns filters to reduce IO to global memory which leads to substantial performance gains.
Conclusion
Difficult to understand theoretically Complex implementation
Drawbacks
Questions?
baseline
Alg. RT (0.5 GiP/s)
+ block parallelism
Alg. 2 (3 GiP/s)
+ causal-anticausal overlapping
Alg. 4 (5 GiP/s)
+ row-column overlapping
Alg. 5 (6 GiP/s)