learning-based localization · 2020. 3. 31. · posenet discussion 29 [1]“posenet: a...
TRANSCRIPT
Learning-based Localization Eric Brachmann
ECCV 2018 Tutorial on Visual Localization - Feature-based vs. Learned Approaches
Torsten Sattler, Eric Brachmann
Roadmap
• Machine Learning Basics [10min]• Convolutional Neural Networks
• Random Forests
• Camera Pose Regression [15min]• Pose Parametrization
• Pose Loss Functions
• Results
• Scene Coordinate Regression [35min]• Scene Coordinate Regression Forests
• Scene Coordinate Regression CNNs
• End-to-End Learning
• Results
2
Roadmap
• Machine Learning Basics [10min]• Convolutional Neural Networks
• Random Forests
• Camera Pose Regression [15min]• Pose Parametrization
• Pose Loss Functions
• Results
• Scene Coordinate Regression [35min]• Scene Coordinate Regression Forests
• Scene Coordinate Regression CNNs
• End-to-End Learning
• Results
3
Machine Learning
4
Monday
Sunday
Wednesday
Saturday
𝐲 = 𝑓(𝐼,𝐰)Wednesday
Saturday
Sunday
Training Set
Convolutional Neural Networks
5
0.2 -0.3
0.7 0.1
𝐰1
Convolutional Neural Networks
6
0.2 -0.3
0.7 0.1
𝐰1
Convolutional Neural Networks
7
0.2 -0.3
0.7 0.1
𝐰1
𝑓1 𝐼, 𝐰1
Convolutional Neural Networks
8
0.2 -0.3
0.7 0.1
𝐰1
0.6 -0.4
-0.1 0.0
𝐰2
𝑥
𝑓(𝑥)
𝑓2 𝑓1 𝐼, 𝐰1 , 𝐰2
Convolutional Neural Networks
9
MondayWednesday
SaturdaySunday
y = 𝑓3(𝑓2 𝑓1 𝐼,𝐰1 , 𝐰2 , 𝑤3)
ℓ(y)Sunday
𝛿ℓ
𝛿𝐰1=
𝛿ℓ
𝛿𝑓3
𝛿𝑓3𝛿𝑓2
𝛿𝑓2𝛿𝑓1
𝛿𝑓1𝛿𝐰1
Convolutional Neural Networks
10
𝐰
[Long et al., “Fully convolutional networks for semantic segmentation”. CVPR15]
Convolutional Neural Networks
11
𝐰
[12 Scenes Dataset: http://graphics.stanford.edu/projects/reloc/]
Convolutional Neural Networks
12
[Kokkinos, “UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision using Diverse Datasets and Limited Memory”, CVPR17]
Convolutional Neural Networks
• Advantages:
• Expensive but Parallelizable
• Powerful due to End-To-End Training
• Disadvantages:
• Interpretability
• Training Data
13
Feature Extraction(hand-crafted or learned)
Classification(hand-crafted or learned)
Image 𝐼 Output
LearnedImage 𝐼 Output
Roadmap
• Machine Learning Basics [10min]• Convolutional Neural Networks
• Random Forests
• Camera Pose Regression [15min]• Pose Parametrization
• Pose Loss Functions
• Results
• Scene Coordinate Regression [35min]• Scene Coordinate Regression Forests
• Scene Coordinate Regression CNNs
• End-to-End Learning
• Results
14
Random Forests
15
𝐲 = 𝑓(𝐼,𝐰)
𝑔awake(𝐰𝟏) = 0.5
𝑔smile 𝐰𝟐 = −0.7
𝑔hat(𝐰𝟑) = 0.0
𝑔awake > 0
𝑔smile > 0
Sunday
Saturday Monday
Random Forests
16
Monday
Saturday
Sunday
𝑔awake > 0
SundaySaturday
Monday
Random Forests
17
Monday
Saturday
Sunday
𝑔awake > 0
Sunday
Monday
Sunday
Wednesday
Saturday𝑔smile > 0
Saturday Monday
Monday
Sunday
Wednesday
Saturday
Monday
Sunday
Wednesday
Saturday
Random Forests
18
[Shotton et al., “Real-Time Human Pose Recognition in Parts from Single Depth Images”, CVPR11]
Random Forests
• Advantages
• Fast
• Good Performance
• Need less training data
• Disadvantages
• No end-to-end training*
• Hand-crafted features*
• Interpretability
19
* Not in [Kontschieder et al., „Deep Neural Decision Forests“, ICCV15]
Rule 34 (of computer science after 2012):„There is a paper about it, no exceptions.“
(Andrej Karpathy, t-SNE embeddings of IMAGENET)
Roadmap
• Machine Learning Basics [10min]• Convolutional Neural Networks
• Random Forests
• Camera Pose Regression [15min]• Pose Parametrization
• Pose Loss Functions
• Results
• Scene Coordinate Regression [35min]• Scene Coordinate Regression Forests
• Scene Coordinate Regression CNNs
• End-to-End Learning
• Results
20
Camera Pose Regression
21
Input: RGB Image Feed-Forward CNN Output: Pose
𝑟1𝑟2𝑟3𝑡1𝑡2𝑡3
= መ𝐡
Annotated Training Data
[7 Scenes Dataset: https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/]
Roadmap
• Machine Learning Basics [10min]• Convolutional Neural Networks
• Random Forests
• Camera Pose Regression [15min]• Pose Parametrization
• Pose Loss Functions
• Results
• Scene Coordinate Regression [35min]• Scene Coordinate Regression Forests
• Scene Coordinate Regression CNNs
• End-to-End Learning
• Results
22
Pose Parametrization
23
መ𝐡
መ𝐡 ∈ SE(3) with SE 3 = {𝑅 𝐭0 1
|𝑅 ∈ 𝑆𝑂 3 , 𝐭 ∈ ℝ3}
𝑅 ∈ SO 3 = {𝑅 ∈ ℝ3×3|𝑅𝑅T = 𝐼, det R = 1}
𝑅 =
𝑟1 𝑟2 𝑟3𝑟4 𝑟5 𝑟6𝑟7 𝑟8 𝑟9
More infos about the wonderful world of SO(3) in e.g. [Hartley et al., „Rotation Averaging“, IJCV13]
Rotation Matrix
Enforce/Map:𝑅𝑅T = 𝐼, det R = 1
Unit Quaternione.g. [1]
𝐪 = (𝑐, 𝑣1, 𝑣2, 𝑣3)
Enforce/Map:𝐪 = 1
Log Unit Quaternione.g. [3]
log 𝐪 = 2 log 𝑅 [4]
Enforce/Map:−
Axis-Anglee.g. [2]
log 𝑅 = 𝜃ෝ𝐮 =(𝑢1′, 𝑢2′, 𝑢3′)
Enforce/Map:−
[1] Kendall et al., “PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization”, ICCV15[2] Brachmann et al., “DSAC - Differentiable RANSAC for Camera Localization”, CVPR17[3] Brahmbhatt et al., “Geometry-Aware Learning of Maps for Camera Localization”, CVPR18[4] Sola, “Quaternion kinematics for the error-state Kalman filter”, 2017
Roadmap
• Machine Learning Basics [10min]• Convolutional Neural Networks
• Random Forests
• Camera Pose Regression [15min]• Pose Parametrization
• Pose Loss Functions
• Results
• Scene Coordinate Regression [35min]• Scene Coordinate Regression Forests
• Scene Coordinate Regression CNNs
• End-to-End Learning
• Results
24
Pose Loss Functions
25
How to measure rotation error?
𝐡 = (𝑅, 𝐭) ℓ 𝐭, 𝐭∗ = 𝐭 − 𝐭∗
Quaternion Distance[2]: 𝐪 − 𝐪∗
Angle-Axis Distance,Log Quaternion Distance[3]: log 𝑅 − log 𝑅∗
Angular Distance [1]: 𝜃 𝑅, 𝑅∗ = log(𝑅∗𝑅T)
𝜃
𝑅 𝑅∗
𝐪
𝐪∗
𝜃
𝐪 − 𝐪∗
log 𝑅 − log 𝑅∗ ≠ log 𝑇𝑅 − log 𝑇𝑅∗ [4]
[1] Brachmann et al., “DSAC - Differentiable RANSAC for Camera Localization”, CVPR17[2] Kendall et al., “PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization”, ICCV15[3] Brahmbhatt et al., “Geometry-Aware Learning of Maps for Camera Localization”, CVPR18[4] Hartley et al., „Rotation Averaging“, IJCV13
Pose Loss Functions
26
ℓ𝛽 𝐡, 𝐡∗ = ℓ 𝐭, 𝐭∗ + 𝛽ℓ 𝑅, 𝑅∗
ℓ 𝐡, 𝐡∗ = ℓ 𝐭, 𝐭∗ + ℓ 𝑅, 𝑅∗
or max(ℓ 𝐭, 𝐭∗ , ℓ 𝑅, 𝑅∗ )
[1] Brachmann et al., “DSAC - Differentiable RANSAC for Camera Localization”, CVPR17[2] Kendall et al., “PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization”, ICCV15[3] Kendall and Cipolla, “Geometric Loss Functions for Camera Pose Regression with Deep Learning”, CVPR 2017
How to combine rotation error and translation error?
ℓ𝜎2 𝐡, 𝐡∗ = ℓ 𝐭, 𝐭∗ exp −𝑠𝑡 + 𝑠𝑡 + ℓ 𝑅, 𝑅∗ exp −𝑠𝑅 + 𝑠𝑅
Hand-Tuned [2]:
Self-Tuned [3]:
Implicit/Metric [1]:
𝑠 = log 𝜎2
Pose Loss Functions
27
[1] Kendall and Cipolla, “Geometric Loss Functions for Camera Pose Regression with Deep Learning”, CVPR 2017
How to combine rotation error and translation error?
Measure the reprojection error [1]:
ℓ𝜋 𝐡, 𝐡∗ =
𝐯∈ℳ
𝜋 𝐡, 𝐯 − 𝜋(𝐡∗, 𝐯)
Roadmap
• Machine Learning Basics [10min]• Convolutional Neural Networks
• Random Forests
• Camera Pose Regression [15min]• Pose Parametrization
• Pose Loss Functions
• Results
• Scene Coordinate Regression [35min]• Scene Coordinate Regression Forests
• Scene Coordinate Regression CNNs
• End-to-End Learning
• Results
28
PoseNet Discussion
29
[1]“PoseNet: A Convolutional Network for Real-Time 6-DoF Camera Localization”, Kendall et al., ICCV 2015[2]“Image-Based Localization with Spatial LSTMs”, Walch et al., ICCV 2017[3]“Geometric Loss Functions for Camera Pose Regression with Deep Learning”, Kendall and Cipolla, CVPR 2017[4]“Geometry-Aware Learning of Maps for Camera Localization”, Brahmbhatt et al., CVPR18[5]“Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization”, Sattler et al., PAMI 2017
PoseNet [1]: ~45cm, 10°(GoogLeNet, Quaternions, ℓ𝛽)
Spatial LSTM [2]: ~31cm, 10°(GoogLeNet+LSTM, Quaternions, ℓ𝛽)
PoseNet [3]: ~23cm, 8°(GoogLeNet, Quaternions, ℓ𝜎2)
PoseNet [3]: ~23cm, 8°(GoogLeNet, Quaternions, ℓ𝜋)
PoseNet [4]: ~23cm, 8°(ResNet, Quaternions, ℓ𝜋)
PoseNet [4]: ~22cm, 8°(ResNet, log Quat., ℓ𝜎2)
MapNet+ [4]: ~19cm, 7°(ResNet, log Quat., ℓ𝜎2+ℓRel)
Advantages PoseNet:
• Simple
• End-To-End Learning
• Fast
Sparse Features: ~5cm, 2°
2015
2017
2018
[7 Scenes Dataset: https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/]
Side Comment: Visual Odometry
• Pose Regression seems to work very well for estimating relative poses [1]
30
Additional Input: Pose estimate of last frame
≈ Tracking
Accuracy: ~5cm, 4°
[1]“Deep Auxiliary Learning For Visual Localization And Odometry”, Valada et al., ICRA 2018
Feature-Based Pipeline
31
Feature Extraction
Feature Matching
Estimate Poseመ𝐡 = PnP({𝐩𝑖 , 𝐲𝑖})
Image 𝐼6D Pose
መ𝐡
Image Coordinates 𝐩𝑖 Scene Coordinates 𝐲𝑖
[7 Scenes Dataset: https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/]
Feature-Based Pipeline
32
Feature Extraction
Feature Matching
Estimate Poseመ𝐡 = PnP({𝐩𝑖 , 𝐲𝑖})
Image 𝐼6D Pose
መ𝐡
Image Coordinates 𝐩𝑖 Scene Coordinates 𝐲𝑖
[7 Scenes Dataset: https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/]
Feature-Based Pipeline
33
Feature Extraction
Feature Matching
Image 𝐼6D Pose
መ𝐡RANSAC
Estimate Poses (PnP)
Score Poses (Inliers)
Image Coordinates 𝐩𝑖 Scene Coordinates 𝐲𝑖
[7 Scenes Dataset: https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/]
Task Knowledge: Local Invariance
34
Training Image Test Images
መ𝐡 ? ? ?? ? ?
[7 Scenes Dataset: https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/]
Task Knowledge: The Camera Model
35
𝐶
𝐲𝑖
𝐩𝑖
𝐩𝑖 = 𝐶𝐡𝐲𝑖Camera Matrix
Pose SceneCoordinate
ImageCoordinate
Scene CoordinateSystem
Camera CoordinateSystem
Estimate Poseመ𝐡 = PnP({𝐩𝑖 , 𝐲𝑖})
Roadmap
• Machine Learning Basics [10min]• Convolutional Neural Networks
• Random Forests
• Camera Pose Regression [15min]• Pose Parametrization
• Pose Loss Functions
• Results
• Scene Coordinate Regression [35min]• Scene Coordinate Regression Forests
• Scene Coordinate Regression CNNs
• End-to-End Learning
• Results
36
Feature-Based Pipeline
37
Feature Extraction
Feature Matching
Image 𝐼6D Pose
መ𝐡RANSAC
Estimate Poses (PnP)
Score Poses (Inliers)
Image Coordinates 𝐩𝑖 Scene Coordinates 𝐲𝑖[7 Scenes Dataset: https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/]
Scene Coordinate Regression [Sho13]
38
Scene Coordinate Regression
Image 𝐼(RGB-D)
6D Pose መ𝐡
RANSACEstimate Poses (Kabsch)
Score Poses (Inliers)
[Sho13] “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13
𝐲𝑖 ∈ ℝ3𝐩𝑖 ∈ ℝ2
Scene Coordinate Regression Forest
39
𝑓 𝐩 < 𝜏
𝑓𝐷𝐴−𝑅𝐺𝐵 𝐩 = 𝐼𝑅𝐺𝐵 𝐩 + 𝜹1𝐼𝐷(𝐩)
, 𝑐1 − 𝐼𝑅𝐺𝐵(𝒑 + 𝜹2𝐼𝐷(𝐩)
, 𝑐2)
𝑓𝐷𝐴−𝐷 𝐩 = 𝐼𝐷 𝐩 + 𝜹1𝐼𝐷(𝐩)
− 𝐼𝐷(𝒑 + 𝜹2𝐼𝐷(𝐩)
)
Features: Depth Adapted Pixel Differences
Split Score: Reduction of Spatial Variance
𝑉 𝑆𝐿 , 𝑆𝑅 = |𝑆𝐿|
𝐲∈𝑆𝐿
𝐲 − ത𝐲𝐿 + |𝑆𝑅|
𝐲∈𝑆𝑅
𝐲 − ത𝐲𝑅𝑆𝐿 𝑆𝑅
Scene Coordinate Regression Forest
40
𝑋
𝑓𝑆
𝑋
𝑓𝑆
𝐲pred
Mean-Shift
Scene Coordinate Regression Results
41
Scene Coordinate Regression Results
42
[Sho13] “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13
RGB%Pose Err. < 𝟓𝐜𝐦𝟓°
Sparse Features 38.6%
RGB-D
[Sho13] 72.6%
Scene Coordinate Regression Results
43
[Sho13] “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13[Val15] “Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization”, Valentin et al., CVPR’15
RGB%Pose Err. < 𝟓𝐜𝐦𝟓°
Sparse Features 38.6%
RGB-D
[Sho13] 72.6%
[Val15] 91.3%
𝑋
𝑓𝑆
𝐲pred𝑃(𝐲)
Scene Coordinate Regression Results
44
[Sho13] “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13[Val15] “Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization”, Valentin et al., CVPR’15[Cav17] “On-the-Fly Adaptation of Regression Forests for Online Camera Localization”, Cavallari et al., CVPR’17
RGB%Pose Err. < 𝟓𝐜𝐦𝟓°
Sparse Features 38.6%
RGB-D
[Sho13] 72.6%
[Val15] 91.3%
[Cal17] 90.7%
𝑋
𝑓𝑆
𝑋
𝑓𝑆
𝑋
𝑓𝑆
Scene Coordinate Regression Results
45
[Sho13] “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13[Val15] “Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization”, Valentin et al., CVPR’15[Cav17] “On-the-Fly Adaptation of Regression Forests for Online Camera Localization”, Cavallari et al., CVPR’17[Bra16] “Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image”, Brachmann et al., CVPR’16
RGB%Pose Err. < 𝟓𝐜𝐦𝟓°
Sparse Features 38.6%
[Bra16] 55.2%
RGB-D
[Sho13] 72.6%
[Val15] 91.3%
[Cal17] 90.7%
Scene Coordinate Regression Forests
• Advantages:
• Fast
• Good Results
• Training On-The-Fly
• Disadvantages:
• Needs 3D Model for Training
• No End-To-End Training
46
𝑉 𝑆𝐿 , 𝑆𝑅 = |𝑆𝐿|
𝐲∈𝑆𝐿
𝐲 − ത𝐲𝐿 + |𝑆𝑅|
𝐲∈𝑆𝑅
𝐲 − ത𝐲𝑅
Roadmap
• Machine Learning Basics [10min]• Convolutional Neural Networks
• Random Forests
• Camera Pose Regression [15min]• Pose Parametrization
• Pose Loss Functions
• Results
• Scene Coordinate Regression [35min]• Scene Coordinate Regression Forests
• Scene Coordinate Regression CNNs
• End-to-End Learning
• Results
47
Scene Coordinate Regression with a CNN
48
Scene Coordinate Regression (CNN)
Image 𝐼6D Pose
መ𝐡RANSAC
Estimate Poses (PnP)
Score Poses (Inliers)
𝐲𝑖 ∈ ℝ3𝐩𝑖 ∈ ℝ2
Random Forest vs. CNN
49
Forest Prediction: CNN Prediction:
Pose Estimation Succeeds (< 5cm, 5°) Pose Estimation Fails (> 5cm, 5°)
Ground Truth:
Random Forest vs. CNN
50
CNN Prediction:
Obj. Corrd. Error< 𝟏𝟎𝐜𝐦
%Pose Err. < 𝟓𝐜𝐦𝟓°
Random Forest ~15% 55.2%
CNN ~25% 17.2%
0
0,02
0,04
0,06
0,08
0,1
0,12
0,14
0 100 200 300 400 500 600 700 800 900 1000
Freq
uen
cy
Object Coordinate Error (mm)
Forest OutliersInlier Threshold
Inlier Peaks
Random Forest CNN
Optimized: ℓ𝐿2 𝐲, 𝐲∗ = 𝐲 − 𝐲∗ 2
Scene Coordinate Loss Functions
51
ℓ𝐿1
ℓHuber ℓTukey
ℓ𝐿2%𝐎𝐛𝐣. 𝐂𝐨𝐨𝐫. Err.
< 𝟏𝟎𝐜𝐦%Pose Err. < 𝟓𝐜𝐦𝟓°
Random Forest ~15% 55.2%
CNN (𝐿2) ~25% 17.2%
CNN (𝐿1) ~𝟒𝟓% 55.9%
CNN (Huber) ~𝟒𝟓% 54.4%
CNN (Tukey) ~𝟒𝟓% 52.1%
Scene Coordinate Loss Functions
52
%𝐎𝐛𝐣. 𝐂𝐨𝐨𝐫. Err. < 𝟏𝟎𝐜𝐦
%Pose Err. < 𝟓𝐜𝐦𝟓°
Random Forest ~15% 55.2%
CNN (𝐿2) ~25% 17.2%
CNN (𝐿1) ~𝟒𝟓% 55.9%
CNN (Huber) ~𝟒𝟓% 54.4%
CNN (Tukey) ~𝟒𝟓% 52.1%
0
0,02
0,04
0,06
0,08
0,1
0,12
0,14
0 100 200 300 400 500 600 700 800 900 1000
Freq
uen
cy
Object Coordinate Error (mm)
Inlier Threshold
Random Forest CNN (𝐿2) CNN (𝐿1)
Scene Coordinate Regression Nets
53
• Advantages:
• Fast
• Better Results
• Disadvantages:
• Needs 3D Model for Training
• End-To-End Training?
• Advantages:
• Fast
• Good Results
• Disadvantages:
• Needs 3D Model for Training
• No End-To-End Training
Even Better Loss Function?
54
Scene Coordinate Regression (CNN)
Image 𝐼6D Pose
መ𝐡RANSAC
Estimate Poses (PnP)
Score Poses (Inliers)
Optimized: ℓ𝐿1 𝐲, 𝐲∗ = 𝐲 − 𝐲∗ ℓ መ𝐡, 𝐡∗
DSAC – Differentiable RANSAC [Bra17]
[Bra17] “DSAC – Differentiable RANSAC for Camera Localization” Brachmann et al., CVPR 2017
Roadmap
• Machine Learning Basics [10min]• Convolutional Neural Networks
• Random Forests
• Camera Pose Regression [15min]• Pose Parametrization
• Pose Loss Functions
• Results
• Scene Coordinate Regression [35min]• Scene Coordinate Regression Forests
• Scene Coordinate Regression CNNs
• End-to-End Learning
• Results
55
What is RANSAC?
56
𝐲𝑖𝐡 = 𝑓({𝐲𝑖}) 𝐡
What is RANSAC?
57
𝐡11) Sample multiple hypotheses 𝐡𝑗𝐡1 = 𝑓(𝐲1, 𝐲4)𝐡2 = 𝑓 𝐲2, 𝐲3𝐡3 = 𝑓 𝐲1, 𝐲5
2) Score each hypothesis: 𝑠 𝐡𝑗𝑠 𝐡1 = 2𝑠 𝐡2 = 2𝑠 𝐡3 = 4
3) Take best one (and refine): መ𝐡 = argmax𝐡𝑗
𝑠(𝐡𝑗)
መ𝐡 = 𝐡3
𝐡2
𝐡3
Our Pipeline
58
Scene Coordinate Regression (CNN)
Image 𝐼6D Pose
መ𝐡RANSAC
Estimate Poses (PnP)
Score Poses
ReprojectionErrors of 𝐡2
𝐰
𝐡1
𝐡3
𝐡4
𝐡2
Input RGB Scene Coordinate Regression Hypothesis Sampling Scoring Hypothesis Selection Result
መ𝐡
𝑠(𝐡1)
𝑠(𝐡4)
𝑠(𝐡2)
𝑠(𝐡3)
Our Pipeline
59
Scene Coordinate Regression (CNN)
Image 𝐼6D Pose
መ𝐡RANSAC
Estimate Poses (PnP)
Score Poses
ReprojectionErrors of 𝐡2
𝐰
𝐡1
𝐡3
𝐡4
𝐡2
Input RGB Scene Coordinate Regression Hypothesis Sampling Scoring Hypothesis Selection Result
መ𝐡
𝑠(𝐡1)
𝑠(𝐡4)
𝑠(𝐡2)
𝑠(𝐡3)
𝜕
𝜕𝐰ℓ(መ𝐡, 𝐡∗)
End-to-End Learning: How?
60
𝐡AM = argmax𝐡𝑗
𝑠(𝐡𝑗)
argmax Selection
non-differentiable
hard decision
𝐡SoftAM =
𝑗
exp(𝑠(𝐡𝑗))𝐡𝑗σ𝑘 exp(𝑠(𝐡𝑘))
Soft argmax Selection
differentiable
soft decision
ReprojectionErrors of 𝐡2
𝐰
𝐡1
𝐡3
𝐡4
𝐡2
መ𝐡
𝑠(𝐡1)
𝑠(𝐡4)
𝑠(𝐡2)
𝑠(𝐡3)
End-to-End Learning: How?
61
𝐡AM = argmax𝐡𝑗
𝑠(𝐡𝑗)
argmax Selection
non-differentiable
hard decision
𝐡SoftAM =
𝑗
exp(𝑠(𝐡𝑗))𝐡𝑗σ𝑘 exp(𝑠(𝐡𝑘))
Soft argmax Selection
differentiable
soft decision
𝐡DSAC = 𝐡𝑗 , where 𝑗~exp(𝑠(𝐡𝑗))
σ𝑘 exp(𝑠(𝐡𝑘))
Probabilistic Selection
differentiable
hard decision
ReprojectionErrors of 𝐡2
𝐰
𝐡1
𝐡3
𝐡4
𝐡2
መ𝐡
𝑠(𝐡1)
𝑠(𝐡4)
𝑠(𝐡2)
𝑠(𝐡3)
DSAC – Differentiable RANSAC
62
𝐡DSAC = 𝐡𝑗 , where 𝑗~exp(𝑠(𝐡𝑗))
σ𝑘 exp(𝑠(𝐡𝑘))= 𝑃 𝑗 𝐰
Probabilistic Selection
𝜕
𝜕𝐰𝔼𝑗~𝑃(𝑗|𝐰) ℓ(𝐡𝑗, 𝐡
∗)
Differentiation via the expected task loss:
= 𝔼𝑗~𝑃(𝑗|𝐰) ℓ 𝐡𝑗 , 𝐡∗
𝜕
𝜕𝐰log 𝑃 𝑗 𝐰 +
𝜕
𝜕𝐰ℓ 𝐡𝑗 , 𝐡
∗
𝐰
Scene CoordinateRegression
Scoring
Derivative of the selection probability Derivative of the task loss
Roadmap
• Machine Learning Basics [10min]• Convolutional Neural Networks
• Random Forests
• Camera Pose Regression [15min]• Pose Parametrization
• Pose Loss Functions
• Results
• Scene Coordinate Regression [35min]• Scene Coordinate Regression Forests
• Scene Coordinate Regression CNNs
• End-to-End Learning
• Results
63
Accuracy on 7-Scenes [Sho13]
64
Sparse Features [Sho13]Brachmann et al. [Bra16]Ours [Bra17]
RANSACSoft argmaxDSAC
38.6%55.2%
61.0% (not end-to-end)
57.8% (end-to-end)
66.2% (end-to-end)
[Sho13] “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13[Bra16] “Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image”, Brachmann et al., CVPR’16[Bra17] “Learning to Predict Dense Correspondences For 6D Pose Estimation”, Brachmann, Thesis, 2017
Error < 5cm, 5°
Side Comment: Learning Scores
65
[Sho13] Inlier Count: 𝑠 𝐡 = σ𝑖 𝟙 𝜏 − 𝑟𝑖 𝐡,𝐰 - not differentiable
[Bra18] Soft In. Count: 𝑠 𝐡 = σ𝑖 sig(𝜏 − 𝛽𝑟𝑖 𝐡,𝐰 ) − differentiable
[Bra17] learned 𝑠 𝐡 - hard to regularize, overfits
Training Set (2 Images Total)
Test Image
Estimation with Learned Score [Bra17] Estimation with Soft Inlier Count [Bra18]
Estimated Camera Poses 3D Model Overlay 3D Model Overlay
𝑠(𝐡1)
𝑠(𝐡4)
𝑠(𝐡2)
𝑠(𝐡3)
𝑠(𝐡1)
𝑠(𝐡4)
𝑠(𝐡2)
𝑠(𝐡3)
Score Regression
Soft Inlier Count
[Sho13] “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13[Bra17] “DSAC – Differentiable RANSAC for Camera Localization”, Brachmann et al., CVPR’17[Bra18] “Learning Less is More - 6D Camera Localization via 3D Surface Regression”, Brachmann and Rother, CVPR’18
Accuracy on 7-Scenes [Sho13]
66
Sparse Features [Sho13]Brachmann et al. [Bra16]Ours [Bra17]
RANSACSoft argmaxDSAC
38.6%55.2%
61.0% (not end-to-end)
57.8% (end-to-end)
66.2% (end-to-end)
[Sho13] “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13[Bra16] “Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image”, Brachmann et al., CVPR’16[Bra17] “Learning to Predict Dense Correspondences For 6D Pose Estimation”, Brachmann, Thesis, 2017[Bra18] “Learning Less is More - 6D Camera Localization via 3D Surface Regression”, Brachmann and Rother, CVPR’18
Error < 5cm, 5°
DSAC++ [Bra18] 76.1% (end-to-end)
[Bra17]
[Bra18]
Scene Coordinate Regression Nets
67
• Advantages:
• Fast
• End-To-End Training
• Better Best Results
• Disadvantages:
• Needs 3D Model for Training
• Advantages:
• Fast
• Good Results
• Disadvantages:
• Needs 3D Model for Training
• No End-To-End Training
Training without a 3D Model
68
• End-to-end training needs no 3D model but good initialization
• [Bra17] Initialization: Minimize 3D distance to ground truth scene coordinates
• [Bra18] Initialization: Minimize 2D reprojection error
min
𝑖
𝐲𝑖 𝐰 − 𝐲𝑖∗
min
𝑖
𝜋(𝐲𝑖 𝐰 , 𝐡∗) − 𝐩𝑖
[Bra17] “DSAC – Differentiable RANSAC for Camera Localization”, Brachmann et al., CVPR’17[Bra18] “Learning Less is More - 6D Camera Localization via 3D Surface Regression”, Brachmann and Rother, CVPR’18
Accuracy on 7-Scenes [Sho13]
69
Sparse Features [Sho13]Brachmann et al. [Bra16]Ours [Bra17]
RANSACSoft argmaxDSAC
38.6%55.2%
61.0% (not end-to-end)
57.8% (end-to-end)
66.2% (end-to-end)
[Sho13] “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13[Bra16] “Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image”, Brachmann et al., CVPR’16[Bra17] “Learning to Predict Dense Correspondences For 6D Pose Estimation”, Brachmann, Thesis, 2017[Bra18] “Learning Less is More - 6D Camera Localization via 3D Surface Regression”, Brachmann and Rother, CVPR’18
Error < 5cm, 5°
DSAC++ [Bra18] 76.1% (end-to-end)
(w/o M) [Bra18] 60.4% (end-to-end)
with 3D model
w/o 3D model
Scene Coordinate Regression Nets
70
• Advantages:
• Fast
• End-To-End Training
• Better Best Results
• Disadvantages:
• Needs 3D Model for Training
• Advantages:
• Fast
• Good Results
• Disadvantages:
• Needs 3D Model for Training
• No End-To-End Training
Conclusion
Pose Regression
• fast
• coarse estimates
71
PoseNet: http://mi.eng.cam.ac.uk/projects/relocalisation/MapNet: https://research.nvidia.com/publication/2018-06_Geometry-Aware-Learning-ofSCoRF [Bra16]: https://hci.iwr.uni-heidelberg.de/vislearn/research/scene-understanding/pose-estimation/#CVPR16
VLocNet: http://deeploc.cs.uni-freiburg.de/DSAC: https://github.com/cvlab-dresden/DSACDSAC++: https://github.com/vislearn/LessMore
Scene Coordinate Regression• with RFs: fast and accurate• with CNNs: best accuracy, no 3D models
Learning pose estimation works! High accuracy possible.Learning pose estimation is fun! Interesting mixture of learning and geometry.
Code of many methods online: