Robust Learning-based Image MatchingIMW CVPR 2019
Dawei Sun Zixin Luo Jiahui Zhang
00 Outline
An image matching pipeline: 1) local keypoint detection, 2) local keypoint description, 3) sparse matching.
Part 1
ContextDesc: a learning-based
local descriptorContextDesc: Local Descriptor
Augmentation with Cross-Modality
Context, CVPR’19
00 Outline
Part 2
A learning-based inlier
classification and fundamental
matrix estimation methodIn submission
An image matching pipeline: 1) local keypoint detection, 2) local keypoint description, 3) sparse matching.
01 Motivation
Ref.
NN
GT
Ref.
NN
GT
The results are becoming saturated on standard benchmark
*Samples are from HPatches dataset.
01 Motivation
Ref.
NN
GT
Ref.
NN
GT
Similar but incorrect
Nearest-neighbor matches
Ground-truth matches
Reference patches
Similar but incorrect
*GeoDesc is used for feature description.
01 Motivation
Ref.
NN
GT
Ref.
NN
GT
Locally distinguishable by visual appearance for,
e.g., repetitive patterns?
Indistinguishable
Indistinguishable
Nearest-neighbor matches
Ground-truth matches
Reference patches
01 Motivation
More visual context!
• The representation of both local details and richer
context – how to construct the network?• Construct feature pyramids – too costly for this low-
level task?
01 Motivation
*Keypoints are derived from SIFT.
Keypoint distribution reveals meaningful scene structure
01 Motivation
Coarse matches can be established, even without
color information
Keypoint distribution reveals meaningful scene structureKeypoints are designed to be repeatable in the same underlying scene
01 Motivation
Encoding geometric context from keypoint
distribution of individual image
• Keypoints are irregular and unordered – how to construct a proper encoder?
• Keypoints are not perfectly repeatable – how to acquire strong invariance property to different image variations?
01 Motivation
Visual context
• Incorporate high-level visual information
• Resort to regional representation often used in image retrieval
(one forward pass of the entire image)
Geometric context
• Geometric cues from keypoint distribution.
• Resort to PointNet-like architecture to process 2D point sets
01 Motivation
A unified framework: Cross-modality local
descriptor augmentation
Based on off-the-shelf descriptors…
02
Preparation Augmentation
Aggregation
Visual context encoder
𝐾𝐾 × 4
𝐾𝐾 × 32 × 32 𝐾𝐾 × 128
Geometric context encoder
Regional feature extractor
𝐻𝐻×𝑊𝑊
×3
𝐻𝐻32
×𝑊𝑊32
× 2048
𝐾𝐾 × 2
ResNet-50
Patch sampler
KeypointsImage sample
Sampling gridRegional features Interpolation
𝐾𝐾× 128
C
Matchability predictor
Res
idua
l Uni
t 0w
/ CN
(…)
Res
idua
l Uni
t 3w
/ CN
𝐾𝐾 × 1
𝐾𝐾 × 3
Perc
eptro
n
𝐾𝐾 × 128
MLPC
Local feature extractor
MLP
𝐾𝐾 × 128
𝐾𝐾× 640𝐾𝐾 × 512
Quad Loss
N-pair loss
𝐾𝐾 × 2048
𝐾𝐾 × 128
C Channel-wise concatenation
Element-wise summation
MLP Multi-layer perceptron
CN ContextNormalization
tanh
𝑙𝑙2-Normalization
MLP
w/ C
N
Methods
02
Preparation
𝐾𝐾 × 4
𝐾𝐾 × 32 × 32 𝐾𝐾 × 128
Regional feature extractor
𝐻𝐻×𝑊𝑊
×3
𝐻𝐻32
×𝑊𝑊32
× 2048
𝐾𝐾 × 2
ResNet-50
Patch sampler
KeypointsImage sample
Local feature extractor
Keypoint locationsE.g., SIFT
Methods
02
Preparation
𝐾𝐾 × 4
𝐾𝐾 × 32 × 32 𝐾𝐾 × 128
Regional feature extractor
𝐻𝐻×𝑊𝑊
×3
𝐻𝐻32
×𝑊𝑊32
× 2048
𝐾𝐾 × 2
ResNet-50
Patch sampler
KeypointsImage sample
Local feature extractor
Raw local featuresE.g., GeoDesc
Methods
02
Preparation
𝐾𝐾 × 4
𝐾𝐾 × 32 × 32 𝐾𝐾 × 128
Regional feature extractor
𝐻𝐻×𝑊𝑊
×3
𝐻𝐻32
×𝑊𝑊32
× 2048
𝐾𝐾 × 2
ResNet-50
Patch sampler
KeypointsImage sample
Local feature extractor
Regional featuresAn off-the-shelf image
retrieval model
Methods
02
Augmentation
Aggregation
Visual context encoder
Geometric context encoder
Sampling gridRegional features Interpolation
𝐾𝐾× 128
C
Matchability predictor
Res
idua
l Uni
t 0w
/ CN
(…)
Res
idua
l Uni
t 3w
/ CN
𝐾𝐾 × 1
𝐾𝐾 × 3
Perc
eptro
n
𝐾𝐾× 128
MLPC
MLP
𝐾𝐾 × 128
𝐾𝐾 × 640𝐾𝐾 × 512
Quad Loss
N-pair loss
𝐾𝐾 × 2048
𝐾𝐾 × 128
C Channel-wise concatenation
Element-wise summation
MLPMulti-layer perceptron
CNContextNormalization
tanh
𝑙𝑙2-Normalization
MLP
w/ C
N
A variant of PointNet
𝐾𝐾 × 2Keypoint locations
Methods𝐾𝐾 × 128
Geometric context
02
Augmentation
Aggregation
Visual context encoder
Geometric context encoder
Sampling gridRegional features Interpolation
𝐾𝐾× 128
C
Matchability predictor
Res
idua
l Uni
t 0w
/ CN
(…)
Res
idua
l Uni
t 3w
/ CN
𝐾𝐾 × 1
𝐾𝐾 × 3
Perc
eptro
n
𝐾𝐾× 128
MLPC
MLP
𝐾𝐾 × 128
𝐾𝐾 × 640𝐾𝐾 × 512
Quad Loss
N-pair loss
𝐾𝐾 × 2048
𝐾𝐾 × 128
C Channel-wise concatenation
Element-wise summation
MLPMulti-layer perceptron
CNContextNormalization
tanh
𝑙𝑙2-Normalization
MLP
w/ C
N
𝐾𝐾 × 2Keypoint locations
𝐾𝐾 × 128Raw local features
Concat: 𝐾𝐾 × 3
Matchability predictor
Methods
𝐾𝐾 × 1
+1.0
-1.0
02
Augmentation
Aggregation
Visual context encoder
Geometric context encoder
Sampling gridRegional features Interpolation
𝐾𝐾× 128
C
Matchability predictor
Res
idua
l Uni
t 0w
/ CN
(…)
Res
idua
l Uni
t 3w
/ CN
𝐾𝐾 × 1
𝐾𝐾 × 3
Perc
eptro
n
𝐾𝐾× 128
MLPC
MLP
𝐾𝐾 × 128
𝐾𝐾 × 640𝐾𝐾 × 512
Quad Loss
N-pair loss
𝐾𝐾 × 2048
𝐾𝐾 × 128
C Channel-wise concatenation
Element-wise summation
MLPMulti-layer perceptron
CNContextNormalization
tanh
𝑙𝑙2-Normalization
MLP
w/ C
N
𝐻𝐻32
×𝑊𝑊32
× 2048Regional features
𝐾𝐾 × 2Keypoint locations
Interpolation
Methods
𝐾𝐾 × 128Visual context
02
Augmentation
Aggregation
Visual context encoder
Geometric context encoder
Sampling gridRegional features Interpolation
𝐾𝐾× 128
C
Matchability predictor
Res
idua
l Uni
t 0w
/ CN
(…)
Res
idua
l Uni
t 3w
/ CN
𝐾𝐾 × 1
𝐾𝐾 × 3
Perc
eptro
n
𝐾𝐾× 128
MLPC
MLP
𝐾𝐾 × 128
𝐾𝐾 × 640𝐾𝐾 × 512
Quad Loss
N-pair loss
𝐾𝐾 × 2048
𝐾𝐾 × 128
C Channel-wise concatenation
Element-wise summation
MLPMulti-layer perceptron
CNContextNormalization
tanh
𝑙𝑙2-Normalization
MLP
w/ C
N
Raw local featureRaw local feature
Geometric context
Visual context
Methods
02
Augmentation
Aggregation
Visual context encoder
Geometric context encoder
Sampling gridRegional features Interpolation
𝐾𝐾× 128
C
Matchability predictor
Res
idua
l Uni
t 0w
/ CN
(…)
Res
idua
l Uni
t 3w
/ CN
𝐾𝐾 × 1
𝐾𝐾 × 3
Perc
eptro
n
𝐾𝐾× 128
MLPC
MLP
𝐾𝐾 × 128
𝐾𝐾 × 640𝐾𝐾 × 512
Quad Loss
N-pair loss
𝐾𝐾 × 2048
𝐾𝐾 × 128
C Channel-wise concatenation
Element-wise summation
MLPMulti-layer perceptron
CNContextNormalization
tanh
𝑙𝑙2-Normalization
MLP
w/ C
N
Raw local feature
Geometric context
Visual context
Element-wise summation
Methods
02
Augmentation
Aggregation
Visual context encoder
Geometric context encoder
Sampling gridRegional features Interpolation
𝐾𝐾× 128
C
Matchability predictor
Res
idua
l Uni
t 0w
/ CN
(…)
Res
idua
l Uni
t 3w
/ CN
𝐾𝐾 × 1
𝐾𝐾 × 3
Perc
eptro
n
𝐾𝐾× 128
MLPC
MLP
𝐾𝐾 × 128
𝐾𝐾 × 640𝐾𝐾 × 512
Quad Loss
N-pair loss
𝐾𝐾 × 2048
𝐾𝐾 × 128
C Channel-wise concatenation
Element-wise summation
MLPMulti-layer perceptron
CNContextNormalization
tanh
𝑙𝑙2-Normalization
MLP
w/ C
N
128-d feature
Methods
03 Ablations Improvements from visual context
03 Improvements from geometric context
Ablations
03
Improvements from cross-modality context
Ablations
04 EvaluationsSUN3D: indoor scenes
04 EvaluationsYFCC: outdoor scenes
04 Evaluations
Oxford dataset: Different image
variations
04 Evaluations
3D reconstruction benchmark
04 Evaluations
Illumination changeScale or rotation change
Perspective change Indoor scene
SIFT
Geo
Des
cO
urs
SIFT
Geo
Des
cO
urs
04 Evaluations
When pose accuracy as evaluation metric: consistent improvement, but less significant
After obtaining sufficient matches, what is the next bottleneck in order to improve the image matching?
05 Sparse matching
1) Establish putative matches (nearest-neighbor search/FLANN)
2) Outlier rejection (ratio test/mutual check/GMS)
3) Geometry computation (5-point/8-point algorithm with RANSAC)
4) Non-linear optimization for refinement
05 Sparse matching
1) Establish putative matches (nearest-neighbor search/FLANN)
2) Outlier rejection (ratio test/mutual check/GMS)
3) Geometry computation (5-point/8-point algorithm with RANSAC)
4) Non-linear optimization for refinement Learning-based
*Yi et al.: Learning to find good correspondences, CVPR’18.
05 Sparse matching
Given putative matches 𝑁𝑁 × 4, where each row vector
denotes a correspondence (𝑥𝑥, 𝑦𝑦,𝑥𝑥′, 𝑦𝑦′) of an image pair.
The network predicts the probability vector 𝑁𝑁 × 1 that
indicates whether a correspondence is an inlier.
Only inlier matches (and its confidence) are used for
computing the geometry.
05 Sparse matching
Given putative matches 𝑁𝑁 × 4, where each row vector
denotes a correspondence (𝑥𝑥, 𝑦𝑦,𝑥𝑥′, 𝑦𝑦′) of an image pair.
The network predicts the probability vector 𝑁𝑁 × 1 that
indicates whether a correspondence is an inlier.
Only inlier matches (and its confidence) are used for
computing the geometry.
05 Sparse matching
Given putative matches 𝑁𝑁 × 4, where each row vector
denotes a correspondence (𝑥𝑥, 𝑦𝑦,𝑥𝑥′, 𝑦𝑦′) of an image pair.
The network predicts the probability vector 𝑁𝑁 × 1 that
indicates whether a correspondence is an inlier.
Only inlier matches (and their confidence) are used for
solving the two-view geometry.
05 Sparse matching
No outlier rejection
Mutual check
Proposed learning-based
Why is it important?
05 Sparse matching
Previous method• Adopt a PointNet-like architecture.
• Apply context normalization (instance
normalization) on the entire point set to capture
global context.
*Yi et al.: Learning to find good correspondences, CVPR’18.
Local context, e.g., piece-wise smoothness
(GMS matcher).
05 Sparse matching
Previous method• Adopt a PointNet-like architecture.
• Apply context normalization (instance
normalization) on the entire point set to capture
global context.
*Bian et al.: GMS: Grid-based Motion Statistics for Fast, Ultra-robust Feature Correspondence, CVPR’17.
Local context, e.g., piece-wise smoothness
(GMS matcher).
05 Sparse matching
Previous method• Adopt a PointNet-like architecture.
• Apply context normalization (instance
normalization) on the entire point set to capture
global context.
Proposed• Learn to establish neighboring relations on
unordered, non-Euclidean correspondence sets.
• Build a hierarchical architecture to capture both
global and local context.
05 Sparse matching
Previous method
Proposed
06 Future work
Evaluation metric• HPatches: patch verification/matching/retrieval -
Reflect the performance in real applications? [1]• Two-view image matching on pose recovery
accuracy - Involve other variables such as RANSAC-based algorithms? [2]
• 3D reconstruction metrics – Involve more variables such as image retrieval, SfM or bundle adjustment? [3]
• The ground truth is often obtained from SfM with a
traditional matching pipeline.[1] Balntas et al.: HPatches: A benchmark and evaluation of handcrafted and learned local descriptors, CVPR’17[2] Yi et al.: Learning to find good correspondences, CVPR’18.[3] Schönberger et al.: Comparative Evaluation of Hand-Crafted and Learned Local Features, CVPR’17.
Keypoint detection• Challenging for learning-based methods – clear
definition as supervision?
• Pixel-wise, even sub-pixel wise accuracy – preserve
low-level details after multiple convolutions?
• Combine the advantage of both human priors and
learned priors.
06 Future work
Keypoint detection• Challenging for learning-based methods – clear
definition as supervision?
• Pixel-wise, even sub-pixel wise accuracy – preserve
low-level details after multiple convolutions?
• Combine the advantage of both human priors and
learned priors.
Evaluation metric• HPatches: patch verification/matching/retrieval -
Reflect the performance in real applications? [1]• Two-view image matching on pose recovery
accuracy - Involve other variables such as RANSAC-based algorithms? [2]
• 3D reconstruction metrics – Involve more variables such as image retrieval, SfM or bundle adjustment? [3]
• The ground truth is often obtained from SfM with a
traditional matching pipeline.[1] Balntas et al.: HPatches: A benchmark and evaluation of handcrafted and learned local descriptors, CVPR’17[2] Yi et al.: Learning to find good correspondences, CVPR’18.[3] Schönberger et al.: Comparative Evaluation of Hand-Crafted and Learned Local Features, CVPR’17.
06 Future work
Keypoint detection• Challenging for learning-based methods – clear
definition as supervision?
• Pixel-wise, even sub-pixel wise accuracy – preserve
low-level details after multiple convolutions?
• Combine the advantage of both human priors and
learned priors.
Evaluation metric• HPatches: patch verification/matching/retrieval -
Reflect the performance in real applications? [1]• Two-view image matching on pose recovery
accuracy - Involve other variables such as RANSAC-based algorithms? [2]
• 3D reconstruction metrics – Involve more variables such as image retrieval, SfM or bundle adjustment? [3]
• The ground truth is often obtained from SfM with a
traditional matching pipeline.[1] Balntas et al.: HPatches: A benchmark and evaluation of handcrafted and learned local descriptors, CVPR’17[2] Yi et al.: Learning to find good correspondences, CVPR’18.[3] Schönberger et al.: Comparative Evaluation of Hand-Crafted and Learned Local Features, CVPR’17.
06 Future work
Keypoint detection• Challenging for learning-based methods – clear
definition as supervision?
• Pixel-wise, even sub-pixel wise accuracy – preserve
low-level details after multiple convolutions?
• Combine the advantage of both human priors and
learned priors.
Evaluation metric• HPatches: patch verification/matching/retrieval -
Reflect the performance in real applications? [1]• Two-view image matching on pose recovery
accuracy - Involve other variables such as RANSAC-based algorithms? [2]
• 3D reconstruction metrics – Involve more variables such as image retrieval, SfM or bundle adjustment? [3]
• The ground truth is often obtained from SfM with a
traditional matching pipeline.[1] Balntas et al.: HPatches: A benchmark and evaluation of handcrafted and learned local descriptors, CVPR’17[2] Yi et al.: Learning to find good correspondences, CVPR’18.[3] Schönberger et al.: Comparative Evaluation of Hand-Crafted and Learned Local Features, CVPR’17.
06 Future work
Keypoint detection• Challenging for learning-based methods – clear
definition as supervision?
• Pixel-wise, even sub-pixel wise accuracy – preserve
low-level details after multiple convolutions?
• Combine the advantage of both human priors and
learned priors.
Evaluation metric• HPatches: patch verification/matching/retrieval -
Reflect the performance in real applications? [1]• Two-view image matching on pose recovery
accuracy - Involve other variables such as RANSAC-based algorithms? [2]
• 3D reconstruction metrics – Involve more variables such as image retrieval, SfM or bundle adjustment? [3]
• The ground truth is often obtained from SfM with a
traditional matching pipeline.[1] Balntas et al.: HPatches: A benchmark and evaluation of handcrafted and learned local descriptors, CVPR’17[2] Yi et al.: Learning to find good correspondences, CVPR’18.[3] Schönberger et al.: Comparative Evaluation of Hand-Crafted and Learned Local Features, CVPR’17.
Thanks!
Code available at: https://github.com/lzx551402/contextdesc