a hardware-friendly bilateral solver for real-time virtual ...amrita/slides/hpg17.pdf · bilateral...
TRANSCRIPT
A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality VideoAmrita Mazumdar Armin AlaghiJonathan T. BarronDavid GallupLuis CezeMark OskinSteven M. Seitz
University of Washington
University of Washington
1
virtual reality video with omnidirectional stereo (ODS)
2
the Google Jump camera rig can capture ODS video easily
16 GoPros x 4K camera feed 3.6 GB/s raw video
3
4
the Google Jump camera rig can capture ODS video easily
Anderson et al., SIGGRAPH Asia 2016
5
the Google Jump camera rig can capture ODS video easily
Anderson et al., SIGGRAPH Asia 2016
processing video from Google Jump is slow
1 hour of video
10 hours on 1000 cores
6 Anderson et al., SIGGRAPH Asia 2016
Google Jump pipeline breakdown
sensor download to viewer
pre-processing alignment optical
flowcompositing
7 Anderson et al., SIGGRAPH Asia 2016
download to viewer
pre-processing alignment optical
flowcompositing
12% 69% 17%2%
the bilateral solver dominates processing
time
8
Google Jump pipeline breakdown
Anderson et al., SIGGRAPH Asia 2016
sensor
The bilateral solver produces an image that is smooth and accurate.
9
input pair (from two cameras)
blocky flow field upsample into noisy flow field
transform to bilateral grid and
solve
output result: smooth flow field
Anderson et al., SIGGRAPH Asia 2016
this work: a hardware-friendly
bilateral solver (HFBS)
The bilateral solver is hard to parallelize
second-order global optimization
global communication prevents aggressive parallelization
high-dimensional, sparse matrices
sparsity results in significant divergence on GPUs
why not a dense grid? too large to store on-chip
11
Barron Poole 2016 HFBS (our work)
✅ includes color grayscale only
dense matrix too big to fit in memory ✅ dense matrix fits in memory
global communication required ✅ local communication only
iterative bistochastization before solving ✅ partial, non-iterative bistochastization
HFBS is easier to parallelizedetailed
formulation in paper
HFBS demonstrates imperceptible accuracy loss
task: Ferstl et al., ICCV 2013, data: Middlebury stereo dataset
input image
noisy depth map
Barron Poole 2016
HFBS (this work)13
algorithm optimizations make it easier to implement bilateral solver in parallel
hardware
14
15
plan: exploit this parallelism with a custom hardware accelerator
algorithm optimizations make it easier to implement bilateral solver in parallel
hardware
Mapping HFBS to hardware
16
download to viewer
pre-processing alignment optical
flowcompositingsensor
17
load video pair
construct bilateral grid per pair
perform hardware-friendly bilateral
solver
slice out solution into output images
CPU FPGA
Mapping HFBS to hardwaredownload to viewer
pre-processing alignment optical
flowcompositingsensor
microarchitecture
18
CPU
main memory AXI memory interface
HFBS controller
z-axis memory
controller
z-axis memory bank
z-axis memory bank
z-axis memory bank
bilateral filter worker
bilateral filter worker
bilateral filter worker
memory access selector fixed-point
datapath
custom memory layout
Floating-point resource requirements limit hardware parallelism
float64 32-bit fixed 64-bit fixed 47-bit fixed
DSPs per worker 18 1 16 4
Maximum # workers 379 6840 427 1710
Error (MSE) - 8.3 x 10-4 7.16 x 10-13 6.69 x 10-7
19
Fixed-point datapath conversion
20
Erro
r (M
SE re
lativ
e to
floa
t64)
1E-12
1E-10
1E-08
1E-06
1E-04
1E-02
Decimal Precision (Fraction of Bitwidth)40% 50% 60% 70% 80% 90%
Max Error
32 64 47Bitwidth
z-axis slicing for bilateral grid memory layout
21
x:0,y:0,r:255,g:172,b:0 x:0,y:1,r:255,g:172,b:0
.
.
.
.
. x:100,y:100,r:255,g:172,b:0
z = 0
Evaluation
23
Evaluation
download to viewer
pre-processing alignment optical
flowcompositing
12% 69% 17%2%
Does HFBS improve runtime? How does parallelization affect power?
sensor
Experimental SetupCPU: Intel Xeon E5-2620
GPU: NVIDIA GTX 1080 Ti
FPGA: Xilinx Virtex Ultrascale+
Baseline: Barron Poole et al. 2016 (CPU only)
256 iterations of optimization
Varied bilateral grid vertices count ⇒ 4 KB - 1.8 GB grid sizes
24
HFBS is faster and more scalable than prior work.
25
log
Run
time
(ms)
0.01
1
100
10000
log Bilateral Grid Vertices1,000 100,000 10,000,000
Prior Work (CPU) CPU GPU FPGA
26
30 FPS and better
log
Run
time
(ms)
0.01
1
100
10000
log Bilateral Grid Vertices1,000 100,000 10,000,000
Prior Work (CPU) CPU GPU FPGA
HFBS is faster and more scalable than prior work.
HFBS-FPGA is more power-efficient than other platforms
Ops
/ W
att I
mpr
ovem
ent
0
10
20
30
40
Power-efficiency relative to prior work
30.72x
2.12x0.45x1.00x
Prior Work CPU GPU FPGA
27
building a VR video camera rig with HFBS
29
this work full system
HFBS-FPGA consumes much less power than a GPU for the same task
30
16 GPUs = 4,560 Wfull system16 FPGAs = 400 W
HFBS makes real-time VR video more feasible with FPGAs
offloaded to cloud
31
on-node with FPGAs
sensor download to viewer
pre-processing alignment optical
flowcompositing
to conclude
fast, parallel implementation of bilateral solving with little accuracy loss
fixed-point datatypes and a custom bilateral-grid memory layout for improved FPGA performance
hardware-software codesign to reduce latency and improve quality for future VR applications
32
parallel algorithm for bilateral solvingFPGA architecture50x faster, 30x more power-efficient
33
A Hardware-Friendly Bilateral Solver for Real-Time Virtual Reality Video