a hardware-friendly bilateral solver for real-time virtual ...amrita/slides/hpg17.pdf · bilateral...

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality VideoAmrita Mazumdar Armin AlaghiJonathan T. BarronDavid GallupLuis CezeMark OskinSteven M. Seitz

University of Washington

Google

University of Washington

1

virtual reality video with omnidirectional stereo (ODS)

2

the Google Jump camera rig can capture ODS video easily

16 GoPros x 4K camera feed 3.6 GB/s raw video

3

4


Anderson et al., SIGGRAPH Asia 2016

5



processing video from Google Jump is slow

1 hour of video

10 hours on 1000 cores

6 Anderson et al., SIGGRAPH Asia 2016

Google Jump pipeline breakdown

sensor download to viewer

pre-processing alignment optical

flowcompositing

7 Anderson et al., SIGGRAPH Asia 2016

download to viewer


flowcompositing

12% 69% 17%2%

the bilateral solver dominates processing

time

8

Google Jump pipeline breakdown


sensor

The bilateral solver produces an image that is smooth and accurate.

9

input pair (from two cameras)

blocky flow field upsample into noisy flow field

transform to bilateral grid and

solve

output result: smooth flow field


this work: a hardware-friendly

bilateral solver (HFBS)

The bilateral solver is hard to parallelize

second-order global optimization

global communication prevents aggressive parallelization

high-dimensional, sparse matrices

sparsity results in significant divergence on GPUs

why not a dense grid? too large to store on-chip

11

Barron Poole 2016 HFBS (our work)

✅ includes color grayscale only

dense matrix too big to fit in memory ✅ dense matrix fits in memory

global communication required ✅ local communication only

iterative bistochastization before solving ✅ partial, non-iterative bistochastization

HFBS is easier to parallelizedetailed

formulation in paper

HFBS demonstrates imperceptible accuracy loss

task: Ferstl et al., ICCV 2013, data: Middlebury stereo dataset

input image

noisy depth map

Barron Poole 2016

HFBS (this work)13

algorithm optimizations make it easier to implement bilateral solver in parallel

hardware

14

15

plan: exploit this parallelism with a custom hardware accelerator

algorithm optimizations make it easier to implement bilateral solver in parallel

hardware

Mapping HFBS to hardware

16

download to viewer


flowcompositingsensor

17

load video pair

construct bilateral grid per pair

perform hardware-friendly bilateral

solver

slice out solution into output images

CPU FPGA

Mapping HFBS to hardwaredownload to viewer


flowcompositingsensor

microarchitecture

18

CPU

main memory AXI memory interface

HFBS controller

z-axis memory

controller

z-axis memory bank

z-axis memory bank

z-axis memory bank

bilateral filter worker



memory access selector fixed-point

datapath

custom memory layout

Floating-point resource requirements limit hardware parallelism

float64 32-bit fixed 64-bit fixed 47-bit fixed

DSPs per worker 18 1 16 4

Maximum # workers 379 6840 427 1710

Error (MSE) - 8.3 x 10-4 7.16 x 10-13 6.69 x 10-7

19

Fixed-point datapath conversion

20

Erro

r (M

SE re

lativ

e to

floa

t64)

1E-12

1E-10

1E-08

1E-06

1E-04

1E-02

Decimal Precision (Fraction of Bitwidth)40% 50% 60% 70% 80% 90%

Max Error

32 64 47Bitwidth

z-axis slicing for bilateral grid memory layout

21

x:0,y:0,r:255,g:172,b:0 x:0,y:1,r:255,g:172,b:0

.

.

.

.

. x:100,y:100,r:255,g:172,b:0

z = 0

Evaluation

23

Evaluation

download to viewer


flowcompositing

12% 69% 17%2%

Does HFBS improve runtime? How does parallelization affect power?

sensor

Experimental SetupCPU: Intel Xeon E5-2620

GPU: NVIDIA GTX 1080 Ti

FPGA: Xilinx Virtex Ultrascale+

Baseline: Barron Poole et al. 2016 (CPU only)

256 iterations of optimization

Varied bilateral grid vertices count ⇒ 4 KB - 1.8 GB grid sizes

24

HFBS is faster and more scalable than prior work.

25

log

Run

time

(ms)

0.01

1

100

10000

log Bilateral Grid Vertices1,000 100,000 10,000,000

Prior Work (CPU) CPU GPU FPGA

26

30 FPS and better

log

Run

time

(ms)

0.01

1

100

10000

log Bilateral Grid Vertices1,000 100,000 10,000,000

Prior Work (CPU) CPU GPU FPGA

HFBS is faster and more scalable than prior work.

HFBS-FPGA is more power-efficient than other platforms

Ops

/ W

att I

mpr

ovem

ent

0

10

20

30

40

Power-efficiency relative to prior work

30.72x

2.12x0.45x1.00x

Prior Work CPU GPU FPGA

27

building a VR video camera rig with HFBS

29

this work full system

HFBS-FPGA consumes much less power than a GPU for the same task

30

16 GPUs = 4,560 Wfull system16 FPGAs = 400 W

HFBS makes real-time VR video more feasible with FPGAs

offloaded to cloud

31

on-node with FPGAs

sensor download to viewer


flowcompositing

to conclude

fast, parallel implementation of bilateral solving with little accuracy loss

fixed-point datatypes and a custom bilateral-grid memory layout for improved FPGA performance

hardware-software codesign to reduce latency and improve quality for future VR applications

32

parallel algorithm for bilateral solvingFPGA architecture50x faster, 30x more power-efficient

33

A Hardware-Friendly Bilateral Solver for Real-Time Virtual Reality Video

a hardware-friendly bilateral solver for real-time virtual ...amrita/slides/hpg17.pdf · bilateral...

Documents