gpu-efficient recursive filtering and summed-area tables

GPU-Efficient Recursive Filtering and Summed-Area

TablesJeremiah van OostenReinier van Oeveren

Introduction Related Works

◦ Prefix Sums and Scans◦ Recursive Filtering◦ Summed-Area Tables

Problem Definition Parallelization Strategies

◦ Baseline (Algorithm RT)◦ Block Notation◦ Inter-block Parallelism◦ Kernel Fusion (Algorithm 2)

Overlapping◦ Causal-Anticausal overlapping (Algorithm 3 & 4)◦ Row-Column Causal-Anitcausal overlapping (Algorithm 5)

Summed-Area Tables◦ Overlapped Summed-Area Tables (Algorithm SAT)

Results Conclusion

Table of Contents

Introduction

Linear filtering is commonly used to blur, sharpen or down-sample images.

A direct implementation evaluating a filter of support d on a h x w image has a cost of O(hwd).

Introduction

The cost of the image filter can be reduced using a recursive filter in which case previous results can be used to compute the current value:

Cost can be reduced to O(hwr) where r is the number of recursive feedbacks.

Introduction

At each step, the filter produces an output element by a linear combination of the input element and previously computed output elements.

Recursive Filters

0(hwr)

Continue…

Applications of recursive filters◦ Low-pass filtering like Gaussian kernels◦ Inverse Convolution (◦ Summed-area tables

Recursive Filters

input blurred

recursive filters

Recursive filters can be causal or anticausal (or non-causal).

Causal filters operate on previous values.

Anticausal filters operate on “future” values.

Causality

Anticausal filters operate on “future” values.

Anticausal

Continue…

𝒛 ¿ 𝑅(𝒚 ,𝒆)

𝑧𝑖 ¿ 𝑦 𝑖−∑𝑘=1

𝑟

𝑎𝑘𝑧 𝑖+𝑘

It is often required to perform a sequence of recursive image filters.

Filter Sequences

X

P

Y

E

ZP’ U E’V

• Independent Columns• Causal

• Anticausal

• Independent Rows• Causal

• Anticausal𝑉=𝐹 𝜏 (𝐸 ′ ,𝑈 )

𝑈=𝐹𝜏 (𝑃 ′ ,𝑍 )

𝑌=𝐹 (𝑃 , 𝑋 )

𝑍=𝐹 (𝑌 ,𝐸)

The naïve approach to solving the sequence of recursive filters does not sufficiently utilize the processing cores of the GPU.

The latest GPU from NVIDIA has 2,668 shader cores. Processing even large images (2048x2048) will not make full use of all available cores.

Under utilization of the GPU cores does not allow for latency hiding. We need a way to make better utilization of the GPU without

increasing IO.

Maximizing Parallelism

In the paper “GPU-Efficient Recursive Filtering and Summed-Area Tables” by Diego Nehab et. al. they introduce a new algorithmic framework to reduce memory bandwidth by overlapping computation over the full sequence of recursive filters.

Overlapping

Partition the image into 2D blocks of size .

Block Partitioning

Related Works

A prefix sum

Simple case of a first-order recursive filter. A scan generalizes the recurrence using an arbitrary

binary associative operator. Parallel prefix-sums and scans are important building

blocks for numerous algorithms. [Iverson 1962; Stone 1971; Blelloch 1989; Sengupta et.

al. 2007] An optimized implementation comes with the CUDPP

library [2011].

Prefix Sums and Scans

A generalization of the prefix sum using a weighted combination of prior outputs.

This can be implemented as a scan operation with redefined basic operators.

Ruijters and Thevenaz [2010] exploit parallelisim across the rows and columns of the input.

Recursive Filtering

Sung and Mitra [1986] use block parallelism and split the computation into two parts:◦ One computation based only on the block data

assuming a zero initial conditions.◦ One computation based only on the initial

conditions and assuming zero block data.

Recursive Filtering

Summed-area tables enable the averaging rectangular regions of pixel with a constant number of reads

Summed-Area Tables

+LR

-LL

-UR

+UL

width

height

𝑎𝑣𝑔=𝐿𝑅−𝐿𝐿−𝑈𝑅+𝑈𝐿

h𝑤𝑖𝑑𝑡 ∙h h𝑒𝑖𝑔 𝑡

The paper titled “Fast Summed-Area Table Generation…” from Justin Hensley et. al. (2005) describes a method called recursive doubling which requires multiple passes of the input image. (A 256x256 image requires 16 passes to compute).

Summed-Area Tables

Image A

Image B

Image A

Image B

In 2010, Justin Hensley extended his 2005 implementation to compute shaders taking more samples per pass and storing the result in intermediate shared memory. Now a 256x256 image only required 4 passes when reading 16 samples per pass.

Summed-Area Tables

Problem Definition

Casual recursive filters of order are characterized by a set of feedback coefficients in the following manner.

Given a prologue vector and an input vector of any size the filter produces the output:

Problem Definition

𝒚=𝐹 (𝒑 , 𝒙)

Such that ( has the same size as the input ).

Causal recursive filters depend on a prologue vector

Problem Definition

𝑦 𝑖=𝑥 𝑖−∑𝑘=1

𝑟

𝑎𝑘 𝑦 𝑖−𝑘

Similar for the anitcausal filter. Given an input vector and an epilogue vector , the output vector is defined by:

𝑧𝑖=𝑦 𝑖−∑𝑘=1

𝑟

𝑎 ′𝑘 𝑧𝑖+𝑘

For row processing, we define an extended casual filter and anticausal filter .

Problem Definition

𝒖=𝐹 𝜏 (𝒑 ′ ,𝒛 )

𝒗=𝑅𝜏 (𝒖 ,𝒆 ′ )

With these definitions, we are able to formulate the problem of applying the full sequence of four recursive filters (down, up, right, left).

Problem Definition

X

P

Y

E

ZP’ U E’V

• Independent Columns• Causal

• Anticausal

• Independent Rows• Causal

• Anticausal𝑉=𝐹 𝜏 (𝐸 ′ ,𝑈 )

𝑈=𝐹𝜏 (𝑃 ′ ,𝑍 )

𝑌=𝐹 (𝑃 , 𝑋 )

𝑍=𝐹 (𝑌 ,𝐸)

The goal is to implement this algorithm on the GPU to make full use of all available resources.◦ Maximize occupancy by splitting the problem up

to make use of all cores.◦ Reduce I/O to global memory.

Must break the dependency chain in order to increase task parallelism.

Primary design goal: Increase the amount of parallelism without increasing memory I/O.

Problem Definition

Prior Parallelization Strategies

Baseline algorithm ‘RT’Block notation Inter-block parallelismKernel fusion

Prior Parallelization strategies

Algorithm Ruijters & ThévenazIndependent row and column processing Step RT1: In parallel for each column in ,

apply sequentially and store . Step RT2: In parallel for each column in ,

apply sequentially and store . Step RT1: In parallel for each row in ,

apply sequentially and store . Step RT1: In parallel for each row in ,

apply sequentially and store .

Algorithm RT in diagram formin

put

outp

utst

ages

column processing row processing

Completion takes 4r ) steps Bandwidth usage in total is

= streaming multiprocessors = number of cores (per processor) = width of the input image = height of the input image = order of the applied filter

Algorithm RT performance

Partition input image into blocks◦ = number of threads in warp (=32)

What means what?◦ = block in matrix with index ◦ = column-prologue submatrix◦ = column-epilogue submatrix

For rows we have (similar) transposed operators: and

Block notation (1)

Block notation (1 cont’d)

Tail and head operators: selecting prologue- and epilogue-shaped submatrices from

Block notation (2)

Result: blocked version of problem definition, , ,

Block notation (3)

Superposition (based on linearity)

Effects of the input and prologue/epilogue on the output can be computed independently

Some useful key properties (1)

Express as matrix productsFor any , is the identity matrix

Some useful key properties (2)

, Precomputed matrices that depend only on the feedback coefficients of filters and respectively. Details in paper.

Perform block computation independently

Inter-block parellelism (1)

𝑩𝒎 (𝒚 )=𝑭 (𝑷𝒎−𝟏 (𝐲 ) ,𝑩𝒎 (𝒙 ) )output block

Prologue / tail of prev. output block superposition

𝑩𝒎 (𝒚 )=𝑭 (𝟎 ,𝑩𝒎 (𝒙 ) )+𝑭 (𝑷𝒎−𝟏 (𝒚 ) ,𝟎 )

Inter-block parellelism (2)𝑩𝒎 (𝒚 )=𝑭 (𝟎 ,𝑩𝒎 (𝒙 ) )+𝑭 (𝑷𝒎−𝟏 (𝒚 ) ,𝟎 )

𝑩𝒎(𝒚 )

incomplete causal output

first term second term

𝑭 ( 𝑰 𝒓 ,𝟎 )𝑷𝒎−𝟏 (𝒚 )

𝑨𝑭𝑷 𝑷𝒎−𝟏(𝒚 )

Inter-block parellelism (3) (1)

1.3 In parallel for all m, compute & store output block using (2) and the previously computed

Recall: (2)

1.1 In parallel for all m, compute and store each

1.2 Sequentially for each m, compute and store the according to (1) and using the previously computed

Algorithm 1

Inter-block parellelism (4)Processing all rows and columns using causal and anti-causal filter pairs requires 4 successive applications of algorithm 1.

There are independent tasks: hides memory access latency.

However.. The memory bandwidth usage is now Significantly more than algorithm RT (

can be solved

Original idea: Kirk & Hwu [2010]

Use output of one kernel as input for the next without going through global memory.

Fused kernel: code from both kernels but keep intermediate results in shared mem.

Kernel fusion (1)

Use Algorithm 1 for all filters, do fusing. Fuse last stage of with first stage of Fuse last stage of and first stage of Fuse last stage of with first stage of

Kernel fusion (2)

We aimed for bandwidth reduction. Did it work? Algorithm 1: Algorithm 2: yes, it did!

Kernel fusion (3), Algorithm 2

inpu

tou

tput

stag

es

fix fix fix fix

* for the full algorithm in text, please see the paper

*

Further I/O reduction is still possible: by recomputing intermediary results instead of storing in memory.

More bandwidth reduction: (=good)

No. of steps: (≈bad*)

Bandwidth usage is less than Algorithm RT(!) but involves more computations *But.. future hardware may tip the balance in favor of more computations.

Kernel fusion (4)

Overlapping

Overlapping is introduced to reduce IO to global memory.

It is possible to work with twice-incomplete anticausal epilogues , computed directly from the incomplete causal output block .

This is called casual-anticausal overlapping.

Causal-Anticausal Overlapping

Recall that we can express the filter so that the input and the prologue or epilogue can be computed independently and later added together.


𝐹 (𝒑 , 𝒙) ¿ 𝐹 (𝟎 ,𝒙 )+𝐹 (𝒑 ,𝟎 ) ,𝑅(𝒚 ,𝒆) ¿ 𝑅 (𝒚 ,𝟎 )+𝐹 (𝟎 ,𝒆)

Using the previous properties, we can split the dependency chains of anticausal epilogues.


Which can be further simplified to:


Where the twice-incomplete is such that


Each twice-incomplete epilogue depends only on the corresponding input block and therefore they can all be computed in parallel already in the first pass. As a byproduct of that same pass, we can compute and store the that will be needed to obtain . With , we can compute all in the following pass.

1. In parallel for all , compute and store and .2. Sequentially for each , compute and store

the using the previously computed .3. Sequentially for each , compute and store

using the previously computed and .4. In parallel for all , compute each causal

output block using the previously computed . Then compute and store each anticausal output block using the previously computed .

Algorithm 3 (Row & Column)

Algorithm 3 computes row and columns in separate passes. Fusing these two stages, results in algorithm 4.

Algorithm 4

1. In parallel for all and , compute and store the ) and .2. Sequentially for each , but in parallel for each , compute and

store the using the previously computed .3. Sequentially for each , but in parallel for each , compute and

store the using the previously computed and .4. In parallel for all and , compute using the previously

computed . Then compute and store the using the previously computed . Finally, compute and store both and .

5. Sequentially for each , but in parallel for each , compute and store the from .

6. Sequentially for each , but in parallel for each , compute and store each using the previously computed and .

7. In parallel for all and , compute using the previously computed and . Then compute and store the using the previously computed .

Algorithm 4

Adds causal-anticausal overlapping◦ Eliminates reading and writing causal results

Both in column and in row processing◦ Modest increase in computation

Algorithm 4in

put

outp

utst

ages

fix bothfix both

There is still one source of inefficiency in algorithm 4. We wait until the complete block is available in stage 4 before computing incomplete and twice-incomplete .

We can overlap row and column computations and work with thrice-incomplete transposed epilogues obtained directly during algorithm 4 stage 1.

Row-Column Casual-Anticausal Overlapping

Below is the formula for completing the thrice-incomplete transposed prologues:

Row-Column Casual-Anticausal Overlapping

The thrice-incomplete satisfies

Row-Column Casual-Anticausal Overlapping To complete the four-times-incomplete

transposed epilogues of :

1. In parallel for all and , compute and store each , , , and .

2. In parallel for all , sequentially for each , compute and store the using the previously computed .

3. In parallel for all , sequentially for each , compute and store using the previously computed and .

4. In parallel for all , sequentially for each , compute and store using the previously computed , , and .

5. In parallel for all , sequentially for each , compute and store using the previously computed , , , and .

6. In parallel for all and , successively compute , , , and using the previously computed , , , and . Store .

Algorithm 5

Adds row-column overlapping◦ Eliminates reading and writing column results◦ Modest increase in computation

Algorithm 5in

put

outp

utst

ages

fix all!

Start from input and global borders

Load blocks into shared memory

Compute & store incomplete borders

All borders in global memory

Fix incomplete borders

Fix twice-incomplete borders

Fix thrice-incomplete borders

Fix four-times-incomplete borders

Done fixing all borders

Load blocks into shared memory

Finish causal columns

Finish anticausal columns

Finish causal rows

Finish anticausal rows

Store results to global memory

Overlapped Summed-Area Tables

A summed-area table is obtained using prefix sums over columns and rows.

The prefix-sum filter is a special case first-order causal recursive filter (with feedback coefficient ).

We can directly apply overlapping to optimize the computation of summed-area tables.


In blocked form, the problem is to obtain output from input where the blocks satisfy the relations:


𝐵𝑚 ,𝑛 (𝑉 )=𝑆𝜏 (𝑃𝑚 ,𝑛− 1𝜏 (𝑉 ) ,𝐵𝑚 ,𝑛(𝑌 ))

𝐵𝑚 ,𝑛 (𝑌 )=𝑆 (𝑃𝑚−1 ,𝑛 (𝑌 ) ,𝐵𝑚 ,𝑛 ( 𝑋 ))

Using the strategy developed for causal-anticausal overlapping, computing and using overlapping becomes easy.

In the first stage, we compute the incomplete output blocks and directly from the input.


𝐵𝑚 ,𝑛 (𝑌 )=𝑆(𝟎 ,𝐵𝑚 ,𝑛 (𝑋 ))𝐵𝑚 ,𝑛 (𝑉 )=𝑆𝜏 (𝟎 ,𝐵𝑚 ,𝑛(𝑌 ))

We store only the incomplete prologues and . Then we complete them using:


𝑃𝑚 ,𝑛 (𝑌 )=𝑃𝑚−1 ,𝑛 (𝑌 )+𝑃𝑚 ,𝑛(𝑌 )𝑃𝑚 ,𝑛

𝜏 (𝑉 )=𝑃𝑚 ,𝑛− 1𝜏 (𝑉 )+𝑠 (𝑃𝑚− 1 ,𝑛 (𝑌 ) )+𝑃𝑚 ,𝑛

𝜏 (𝑉 )

Scalar denotes the sum of all entries in vector .

1. In parallel for all and , compute and store the and .

2. Sequentially for each , but in parallel for each , compute and store the using the previously computed. Compute and store .

3. Sequentially for each , but in parallel for each , compute and store the using the previously computed , and .

4. In parallel for all and , compute then compute and store using the previously computed and .

Algorithm SAT

S.1 Reads the input then computes and stores the incomplete prologues (red) and (blue).

Algorithm SAT

S.2 Completes the prologues (red) and computes scalars (yellow).

Algorithm SAT

S.3 Completes prologues

Algorithm SAT

S.4 Reads the input and completed prologues, then computes and stores the final summed-area table.

Algorithm SAT

Results

First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms

• Alg. 4 adds causal-anticausal overlapping• Eliminates 4hw of IO• Modest increase in computation• Alg. 5 adds row-column overlapping• Eliminates additional 2hw of IO• Modest increase in computation

Alg. Step Complexity

Max. # of Threads

MemoryBandwidth

5

4

2

RT

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

Thr o

ughp

ut ( G

iP/s

)

RT245

Cubic B-Spline Interpolation (GeForce GTX 480)

Summed-area table benchmarks

• Harris et al 2008, GPU Gems 3• “Parallel prefix-scan […]”• Multi-scan + transpose + multiscan• Implemented with CUDPP

• Hensley 2010, Gamefest• “High-quality depth of field”• Multi-wave method• Our improvements

+ specialized row and column kernels+ save only incomplete borders+ fuse row and column stages

• Overlapped SAT• Row-column overlapping

u64 128 256 512 1024 2048 4096

Inp t size (pixels)

2 22 2 2 22

1

2

3

4

5

6

7

8

9

Thro

ughp

ut ( G

iP/s

)

Summed-area Table(GeForce GTX 480)

Harris et al [2008]Hensley [2010]Improved Hensley [2010]Overlapped SAT

• First-order filter, unit coefficient, no anticausal component

Conclusion

The paper describes an efficient algorithmic framework that reduces memory bandwidth over a sequence of recursive filters.

It splits the input into blocks that are processed in parallel on the GPU.

Overlapping the causal, anticausal, row and columns filters to reduce IO to global memory which leads to substantial performance gains.

Conclusion

Difficult to understand theoretically Complex implementation

Drawbacks

Questions?

baseline

Alg. RT (0.5 GiP/s)

+ block parallelism

Alg. 2 (3 GiP/s)

+ causal-anticausal overlapping

Alg. 4 (5 GiP/s)

+ row-column overlapping

Alg. 5 (6 GiP/s)

gpu-efficient recursive filtering and summed-area tables

Documents

anticausal filters

anticausal filter r

causality recursive

recursive filters0hwrcontinue

corresponding input

input vector y

onedimensional input

summedarea tablesoverlapped