octnetfusion: learning depth fusion from data · octnetfusion: learning depth fusion from data...

Autonomous Vision Group

Max Planck Institutefor Intelligent Systems

OctNetFusion: Learning Depth Fusion from DataGernot Riegler1, Ali Osman Ulusoy2,3, Horst Bischof1, Andreas Geiger2,41Graz University of Technology 2MPI for Intelligent Systems Tübingen 3Microsoft 4ETH Zürich

Motivation

Scene TSDF Fusion, 32 views, no noise

TSDF Fusion, 2 views, no noise TSDF Fusion, 32 views, noise

OctNet

Use grid of shallow octrees

1

0 1

01010001

0 1

01000101

0 1

00010000

0 1

01010000

Convolution

Oout[i, j, k] = pool_voxels(i,j,k)∈Ω[i,j,k]

(L−1∑l=0

M−1∑m=0

N−1∑n=0

Wl,m,n ·Oin [i, j, k]

)

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

===⇒Conv.

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

==⇒Pool

Pooling

Oout[i, j, k] =

Oin[2i, 2j, 2k] if cell_width(2i, 2j, 2k) > 1

P else

P = maxl,m,n∈[0,1](Oin[2i+ l, 2j +m, 2k + n])

=⇒ =⇒

OctNet Unpooling

Naïve Guided

=⇒ =⇒

OctNetFusion

Architecture

643 1283 2563

1283 2563

643

L

1283

L

2563

L

643

1283

2563

Encoder-Decoder Module

Input

Featu

res

Con

cat

Con

v1,16,3×

3×3,1

Con

v16,32,3×

3×3,1

Pool

32,32,2×

2×2,2

Con

v32,32,3×

3×3,1

Con

v32,64,3×

3×3,1

Pool

64,64,2×

2×2,2

Con

v64,64,3×

3×3,1

Con

v64,64,3×

3×3,1

Con

v64,64,3×

3×3,1

Unpool

64,64,2×

2×2,2

Con

cat

Con

v128,32,3×

3×3,1

Con

v32,32,3×

3×3,1

Unpool

32,32,2×

2×2,2

Con

cat

Con

v64,16,3×

3×3,1

Con

v16,16,3×

3×3,1

Stru

cture

Featu

resRecon

struction

Structure Module

OD×H×Wn

Unpooln, n, 2 × 2, 2

P 2D×2H×2Wn Split Q2D×2H×2W

n

Convn, 1, 3 × 3, 1

RD×H×W

1 L

IntermediateReconstruction

Derived SplitMask

Resulting OctreeStructure

Volumetric Shape Completion

Voxlets Dataset [3]

Method IoU Precision Recall

Zheng et al.* 0.528 0.773 0.630Firman et al.* 0.585 0.793 0.658

Firman et al. 0.550 0.734 0.705Ours 0.650 0.834 0.756

Zheng et al. Firman et al. Ours Ground-Truth

Volumetric Depth Fusion

ModelNet: Input Encoding

MAD (mm)VolFus TV-L1 Occ TDF + Occ TSDF TSDF Hist

643 4.136 3.899 2.095 1.987 1.710 1.7151283 2.058 1.690 0.955 0.961 0.838 0.7362563 1.020 0.778 0.410 0.408 0.383 0.337

VolFus [2] TV-L1 [5] Ours Ground-Truth

ModelNet: Number of Input Views

MAD (mm)views=1 views=2 views=4 views=6

VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours

643 59.295 48.345 7.855 15.626 13.267 2.755 4.136 3.899 1.715 3.171 2.905 1.4841283 29.795 26.525 3.853 7.850 6.999 1.333 2.058 1.690 0.736 1.648 1.445 0.6612563 14.919 14.529 1.927 3.929 3.537 0.616 1.020 0.778 0.337 0.842 0.644 0.360

ModelNet: Varying Input Noise

MAD (mm)σ = 0.00 σ = 0.01 σ = 0.02 σ = 0.03

VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours

643 3.020 3.272 1.647 3.439 3.454 1.487 4.136 3.899 1.715 4.852 4.413 1.9381283 1.330 1.396 0.744 1.647 1.543 0.676 2.058 1.690 0.736 2.420 1.850 0.8042563 0.621 0.637 0.319 0.819 0.697 0.321 1.020 0.778 0.429 1.188 0.858 0.402

σ=

0.0

σ=

0.0

3

VolFus [2] TV-L1 [5] Ours Ground-Truth

Kinect Object Scans [1]

MAD (mm)views=10 views=20

VolFus TV-L1 Ours VolFus TV-L1 Ours

643 103.855 25.976 22.540 72.631 22.081 18.4221283 58.802 12.839 11.827 41.631 11.924 9.6372563 31.707 5.372 4.806 22.555 5.195 4.110

References[1] S. Choi et al. “A Large Dataset of Object Scans”. In: arXiv.org 1602.02481 (2016).[2] B. Curless and M. Levoy. “A Volumetric Method for Building Complex Models from Range Images”. In: SIG-

GRAPH. 1996.[3] M. Firman et al. “Structured Prediction of Unobserved Voxels From a Single Depth Image”. In: CVPR. 2016.[4] G. Riegler et al. “OctNet: Learning Deep 3D Representations at High Resolutions”. In: CVPR. 2017.[5] C. Zach et al. “A Globally Optimal Algorithm for Robust TV-L1 Range Image Integration.” In: ICCV. 2007.[6] B. Zheng et al. “Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics”. In: CVPR.

2013.

octnetfusion: learning depth fusion from data · octnetfusion: learning depth fusion from data...

Documents