a hardware-friendly bilateral solver for real-time virtual ...amrita/slides/hpg17.pdf · bilateral...

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality VideoAmrita Mazumdar Armin AlaghiJonathan T. BarronDavid GallupLuis CezeMark OskinSteven M. Seitz

University of Washington

Google

University of Washington

virtual reality video with omnidirectional stereo (ODS)

the Google Jump camera rig can capture ODS video easily

16 GoPros x 4K camera feed 3.6 GB/s raw video

Anderson et al., SIGGRAPH Asia 2016

processing video from Google Jump is slow

1 hour of video

10 hours on 1000 cores

6 Anderson et al., SIGGRAPH Asia 2016

Google Jump pipeline breakdown

sensor download to viewer

pre-processing alignment optical

flowcompositing

7 Anderson et al., SIGGRAPH Asia 2016

download to viewer

flowcompositing

12% 69% 17%2%

the bilateral solver dominates processing

Google Jump pipeline breakdown

sensor

The bilateral solver produces an image that is smooth and accurate.

input pair (from two cameras)

blocky flow field upsample into noisy flow field

transform to bilateral grid and

output result: smooth flow field

this work: a hardware-friendly

bilateral solver (HFBS)

The bilateral solver is hard to parallelize

second-order global optimization

global communication prevents aggressive parallelization

high-dimensional, sparse matrices

sparsity results in significant divergence on GPUs

why not a dense grid? too large to store on-chip

Barron Poole 2016 HFBS (our work)

✅ includes color grayscale only

dense matrix too big to fit in memory ✅ dense matrix fits in memory

global communication required ✅ local communication only

iterative bistochastization before solving ✅ partial, non-iterative bistochastization

HFBS is easier to parallelizedetailed

formulation in paper

HFBS demonstrates imperceptible accuracy loss

task: Ferstl et al., ICCV 2013, data: Middlebury stereo dataset

input image

noisy depth map

Barron Poole 2016

HFBS (this work)13

algorithm optimizations make it easier to implement bilateral solver in parallel

hardware

plan: exploit this parallelism with a custom hardware accelerator

algorithm optimizations make it easier to implement bilateral solver in parallel

hardware

Mapping HFBS to hardware

download to viewer

flowcompositingsensor

load video pair

construct bilateral grid per pair

perform hardware-friendly bilateral

solver

slice out solution into output images

CPU FPGA

Mapping HFBS to hardwaredownload to viewer

flowcompositingsensor

microarchitecture

main memory AXI memory interface

HFBS controller

z-axis memory

controller

z-axis memory bank

bilateral filter worker

memory access selector fixed-point

datapath

custom memory layout

Floating-point resource requirements limit hardware parallelism

float64 32-bit fixed 64-bit fixed 47-bit fixed

DSPs per worker 18 1 16 4

Maximum # workers 379 6840 427 1710

Error (MSE) - 8.3 x 10-4 7.16 x 10-13 6.69 x 10-7

Fixed-point datapath conversion

Decimal Precision (Fraction of Bitwidth)40% 50% 60% 70% 80% 90%

Max Error

32 64 47Bitwidth

z-axis slicing for bilateral grid memory layout

x:0,y:0,r:255,g:172,b:0 x:0,y:1,r:255,g:172,b:0

. x:100,y:100,r:255,g:172,b:0

Evaluation

download to viewer

flowcompositing

12% 69% 17%2%

Does HFBS improve runtime? How does parallelization affect power?

sensor

Experimental SetupCPU: Intel Xeon E5-2620

GPU: NVIDIA GTX 1080 Ti

FPGA: Xilinx Virtex Ultrascale+

Baseline: Barron Poole et al. 2016 (CPU only)

256 iterations of optimization

Varied bilateral grid vertices count ⇒ 4 KB - 1.8 GB grid sizes

HFBS is faster and more scalable than prior work.

log Bilateral Grid Vertices1,000 100,000 10,000,000

Prior Work (CPU) CPU GPU FPGA

30 FPS and better

log Bilateral Grid Vertices1,000 100,000 10,000,000

Prior Work (CPU) CPU GPU FPGA

HFBS is faster and more scalable than prior work.

HFBS-FPGA is more power-efficient than other platforms

Power-efficiency relative to prior work

30.72x

2.12x0.45x1.00x

Prior Work CPU GPU FPGA

building a VR video camera rig with HFBS

this work full system

HFBS-FPGA consumes much less power than a GPU for the same task

16 GPUs = 4,560 Wfull system16 FPGAs = 400 W

HFBS makes real-time VR video more feasible with FPGAs

offloaded to cloud

on-node with FPGAs

sensor download to viewer

flowcompositing

to conclude

fast, parallel implementation of bilateral solving with little accuracy loss

fixed-point datatypes and a custom bilateral-grid memory layout for improved FPGA performance

hardware-software codesign to reduce latency and improve quality for future VR applications

parallel algorithm for bilateral solvingFPGA architecture50x faster, 30x more power-efficient

A Hardware-Friendly Bilateral Solver for Real-Time Virtual Reality Video

a hardware-friendly bilateral solver for real-time virtual ...amrita/slides/hpg17.pdf · bilateral...

Documents

siggraph paper reading 2011

gpu to web siggraph asia

automatic solver configuration and solver portfolios

siggraph 2003, san diego

siggraph '82

siggraph 2007 course notes practical least-squares for...

siggraph realtime radiosity architecture

opencl intro siggraph asia

course 26 siggraph 2006

progressive meshes (siggraph ’96)

siggraph asia course notes

opencl spir bof - siggraph 2014

siggraph 2016 vulkan and nvidia: the...

siggraph 2006

lightspeed siggraph talk

opencl bof - siggraph 2014

giga voxels siggraph talk

mobile crossplatformchallenges siggraph

webcl bof - siggraph 2014

isometric siggraph 2004