a hardware-friendly bilateral solver for real-time virtual ...amrita/slides/hpg17.pdf · bilateral...

Post on 30-Jul-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality VideoAmrita Mazumdar Armin AlaghiJonathan T. BarronDavid GallupLuis CezeMark OskinSteven M. Seitz

University of Washington

Google

University of Washington

1

virtual reality video with omnidirectional stereo (ODS)

2

the Google Jump camera rig can capture ODS video easily

16 GoPros x 4K camera feed 3.6 GB/s raw video

3

4

the Google Jump camera rig can capture ODS video easily

Anderson et al., SIGGRAPH Asia 2016

5

the Google Jump camera rig can capture ODS video easily

Anderson et al., SIGGRAPH Asia 2016

processing video from Google Jump is slow

1 hour of video

10 hours on 1000 cores

6 Anderson et al., SIGGRAPH Asia 2016

Google Jump pipeline breakdown

sensor download to viewer

pre-processing alignment optical

flowcompositing

7 Anderson et al., SIGGRAPH Asia 2016

download to viewer

pre-processing alignment optical

flowcompositing

12% 69% 17%2%

the bilateral solver dominates processing

time

8

Google Jump pipeline breakdown

Anderson et al., SIGGRAPH Asia 2016

sensor

The bilateral solver produces an image that is smooth and accurate.

9

input pair (from two cameras)

blocky flow field upsample into noisy flow field

transform to bilateral grid and

solve

output result: smooth flow field

Anderson et al., SIGGRAPH Asia 2016

this work: a hardware-friendly

bilateral solver (HFBS)

The bilateral solver is hard to parallelize

second-order global optimization

global communication prevents aggressive parallelization

high-dimensional, sparse matrices

sparsity results in significant divergence on GPUs

why not a dense grid? too large to store on-chip

11

Barron Poole 2016 HFBS (our work)

✅ includes color grayscale only

dense matrix too big to fit in memory ✅ dense matrix fits in memory

global communication required ✅ local communication only

iterative bistochastization before solving ✅ partial, non-iterative bistochastization

HFBS is easier to parallelizedetailed

formulation in paper

HFBS demonstrates imperceptible accuracy loss

task: Ferstl et al., ICCV 2013, data: Middlebury stereo dataset

input image

noisy depth map

Barron Poole 2016

HFBS (this work)13

algorithm optimizations make it easier to implement bilateral solver in parallel

hardware

14

15

plan: exploit this parallelism with a custom hardware accelerator

algorithm optimizations make it easier to implement bilateral solver in parallel

hardware

Mapping HFBS to hardware

16

download to viewer

pre-processing alignment optical

flowcompositingsensor

17

load video pair

construct bilateral grid per pair

perform hardware-friendly bilateral

solver

slice out solution into output images

CPU FPGA

Mapping HFBS to hardwaredownload to viewer

pre-processing alignment optical

flowcompositingsensor

microarchitecture

18

CPU

main memory AXI memory interface

HFBS controller

z-axis memory

controller

z-axis memory bank

z-axis memory bank

z-axis memory bank

bilateral filter worker

bilateral filter worker

bilateral filter worker

memory access selector fixed-point

datapath

custom memory layout

Floating-point resource requirements limit hardware parallelism

float64 32-bit fixed 64-bit fixed 47-bit fixed

DSPs per worker 18 1 16 4

Maximum # workers 379 6840 427 1710

Error (MSE) - 8.3 x 10-4 7.16 x 10-13 6.69 x 10-7

19

Fixed-point datapath conversion

20

Erro

r (M

SE re

lativ

e to

floa

t64)

1E-12

1E-10

1E-08

1E-06

1E-04

1E-02

Decimal Precision (Fraction of Bitwidth)40% 50% 60% 70% 80% 90%

Max Error

32 64 47Bitwidth

z-axis slicing for bilateral grid memory layout

21

x:0,y:0,r:255,g:172,b:0 x:0,y:1,r:255,g:172,b:0

.

.

.

.

. x:100,y:100,r:255,g:172,b:0

z = 0

Evaluation

23

Evaluation

download to viewer

pre-processing alignment optical

flowcompositing

12% 69% 17%2%

Does HFBS improve runtime? How does parallelization affect power?

sensor

Experimental SetupCPU: Intel Xeon E5-2620

GPU: NVIDIA GTX 1080 Ti

FPGA: Xilinx Virtex Ultrascale+

Baseline: Barron Poole et al. 2016 (CPU only)

256 iterations of optimization

Varied bilateral grid vertices count ⇒ 4 KB - 1.8 GB grid sizes

24

HFBS is faster and more scalable than prior work.

25

log

Run

time

(ms)

0.01

1

100

10000

log Bilateral Grid Vertices1,000 100,000 10,000,000

Prior Work (CPU) CPU GPU FPGA

26

30 FPS and better

log

Run

time

(ms)

0.01

1

100

10000

log Bilateral Grid Vertices1,000 100,000 10,000,000

Prior Work (CPU) CPU GPU FPGA

HFBS is faster and more scalable than prior work.

HFBS-FPGA is more power-efficient than other platforms

Ops

/ W

att I

mpr

ovem

ent

0

10

20

30

40

Power-efficiency relative to prior work

30.72x

2.12x0.45x1.00x

Prior Work CPU GPU FPGA

27

building a VR video camera rig with HFBS

29

this work full system

 HFBS-FPGA consumes much less power than a GPU for the same task

30

16 GPUs = 4,560 Wfull system16 FPGAs = 400 W

HFBS makes real-time VR video more feasible with FPGAs

offloaded to cloud

31

on-node with FPGAs

sensor download to viewer

pre-processing alignment optical

flowcompositing

to conclude

fast, parallel implementation of bilateral solving with little accuracy loss

fixed-point datatypes and a custom bilateral-grid memory layout for improved FPGA performance

hardware-software codesign to reduce latency and improve quality for future VR applications

32

parallel algorithm for bilateral solvingFPGA architecture50x faster, 30x more power-efficient

33

A Hardware-Friendly Bilateral Solver for Real-Time Virtual Reality Video

top related