octnetfusion: learning depth fusion from data · octnetfusion: learning depth fusion from data...
TRANSCRIPT
Autonomous Vision Group
Max Planck Institutefor Intelligent Systems
OctNetFusion: Learning Depth Fusion from DataGernot Riegler1, Ali Osman Ulusoy2,3, Horst Bischof1, Andreas Geiger2,41Graz University of Technology 2MPI for Intelligent Systems Tübingen 3Microsoft 4ETH Zürich
Motivation
Scene TSDF Fusion, 32 views, no noise
TSDF Fusion, 2 views, no noise TSDF Fusion, 32 views, noise
OctNet
Use grid of shallow octrees
1
0 1
01010001
0 1
01000101
0 1
00010000
0 1
01010000
Convolution
Oout[i, j, k] = pool_voxels(i,j,k)∈Ω[i,j,k]
(L−1∑l=0
M−1∑m=0
N−1∑n=0
Wl,m,n ·Oin [i, j, k]
)
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
===⇒Conv.
0.125
0.250
0.125
0.000
0.000
0.000
0.125
0.250
0.125
==⇒Pool
Pooling
Oout[i, j, k] =
Oin[2i, 2j, 2k] if cell_width(2i, 2j, 2k) > 1
P else
P = maxl,m,n∈[0,1](Oin[2i+ l, 2j +m, 2k + n])
=⇒ =⇒
OctNet Unpooling
Naïve Guided
=⇒ =⇒
OctNetFusion
Architecture
643 1283 2563
1283 2563
643
L
1283
L
2563
L
643
1283
2563
Encoder-Decoder Module
Input
Featu
res
Con
cat
Con
v1,16,3×
3×3,1
Con
v16,32,3×
3×3,1
Pool
32,32,2×
2×2,2
Con
v32,32,3×
3×3,1
Con
v32,64,3×
3×3,1
Pool
64,64,2×
2×2,2
Con
v64,64,3×
3×3,1
Con
v64,64,3×
3×3,1
Con
v64,64,3×
3×3,1
Unpool
64,64,2×
2×2,2
Con
cat
Con
v128,32,3×
3×3,1
Con
v32,32,3×
3×3,1
Unpool
32,32,2×
2×2,2
Con
cat
Con
v64,16,3×
3×3,1
Con
v16,16,3×
3×3,1
Stru
cture
Featu
resRecon
struction
Structure Module
OD×H×Wn
Unpooln, n, 2 × 2, 2
P 2D×2H×2Wn Split Q2D×2H×2W
n
Convn, 1, 3 × 3, 1
RD×H×W
1 L
IntermediateReconstruction
Derived SplitMask
Resulting OctreeStructure
Volumetric Shape Completion
Voxlets Dataset [3]
Method IoU Precision Recall
Zheng et al.* 0.528 0.773 0.630Firman et al.* 0.585 0.793 0.658
Firman et al. 0.550 0.734 0.705Ours 0.650 0.834 0.756
Zheng et al. Firman et al. Ours Ground-Truth
Volumetric Depth Fusion
ModelNet: Input Encoding
MAD (mm)VolFus TV-L1 Occ TDF + Occ TSDF TSDF Hist
643 4.136 3.899 2.095 1.987 1.710 1.7151283 2.058 1.690 0.955 0.961 0.838 0.7362563 1.020 0.778 0.410 0.408 0.383 0.337
VolFus [2] TV-L1 [5] Ours Ground-Truth
ModelNet: Number of Input Views
MAD (mm)views=1 views=2 views=4 views=6
VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours
643 59.295 48.345 7.855 15.626 13.267 2.755 4.136 3.899 1.715 3.171 2.905 1.4841283 29.795 26.525 3.853 7.850 6.999 1.333 2.058 1.690 0.736 1.648 1.445 0.6612563 14.919 14.529 1.927 3.929 3.537 0.616 1.020 0.778 0.337 0.842 0.644 0.360
ModelNet: Varying Input Noise
MAD (mm)σ = 0.00 σ = 0.01 σ = 0.02 σ = 0.03
VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours
643 3.020 3.272 1.647 3.439 3.454 1.487 4.136 3.899 1.715 4.852 4.413 1.9381283 1.330 1.396 0.744 1.647 1.543 0.676 2.058 1.690 0.736 2.420 1.850 0.8042563 0.621 0.637 0.319 0.819 0.697 0.321 1.020 0.778 0.429 1.188 0.858 0.402
σ=
0.0
σ=
0.0
3
VolFus [2] TV-L1 [5] Ours Ground-Truth
Kinect Object Scans [1]
MAD (mm)views=10 views=20
VolFus TV-L1 Ours VolFus TV-L1 Ours
643 103.855 25.976 22.540 72.631 22.081 18.4221283 58.802 12.839 11.827 41.631 11.924 9.6372563 31.707 5.372 4.806 22.555 5.195 4.110
References[1] S. Choi et al. “A Large Dataset of Object Scans”. In: arXiv.org 1602.02481 (2016).[2] B. Curless and M. Levoy. “A Volumetric Method for Building Complex Models from Range Images”. In: SIG-
GRAPH. 1996.[3] M. Firman et al. “Structured Prediction of Unobserved Voxels From a Single Depth Image”. In: CVPR. 2016.[4] G. Riegler et al. “OctNet: Learning Deep 3D Representations at High Resolutions”. In: CVPR. 2017.[5] C. Zach et al. “A Globally Optimal Algorithm for Robust TV-L1 Range Image Integration.” In: ICCV. 2007.[6] B. Zheng et al. “Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics”. In: CVPR.
2013.