crafting and learning for image matching - cmpcmp.felk.cvut.cz/~mishkdmy/slides/eecvc2019... · cmp...

Crafting and learning for image matching

presented by Dmytro Mishkin

joint work with

Anastasia Mishchuk, Milan Pultar, Filip Radenovic, Daniel Barath, Michal Perdoch, Jiri Matas

12019.07.06, Odesa, EECVC

What is image matching?

• The task is to find the correspondences between pixels in two images and/or find a geometrical relations between camera poses

• It`s special version is also known as “wide baseline stereo”:

• large change in viewpoint, illumination, time & occlusions, modality

2019.07.06, Odesa, EECVC 2

Where is image matching useful? 3D rec & SLAM

R. Mur-Artal, and J. D. Tardós. ORB-SLAM2: an Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras, arXiv 2016

L. Schonberger and J.-M. Frahm,Structure-from-Motion Revisited, 2016COLMAP

Daniel DeTone, Tomasz Malisiewicz, Andrew RabinovichMagicLeap SLAM

Image retrieval 2.0F. Radenovic, series of works

Google Landmark Retrieval challenge 2019 winner

Where is image matching useful? Image retrieval

What is NOT topic of my talk

• Semantic correspondences (the object/scene is not the same)

2019.07.06, Odesa, EECVC 5

Images from Aberman et al. “Neural Best-Buddies: Sparse Cross-Domain Correspondence”, SIGGRAPH 2018

What is NOT topic of my talk

• Short baseline stereo (wait for Anastasiia Mishchuk talk at 17:20)

• Optimization methods for stereo• see talks from previous EECVCs:

Alexander Shekhovtsov (2017) and Tolga Birdal (2018)

2019.07.06, Odesa, EECVC 6

Wide baseline stereo pipeline

Measurement region selector

Matching

Descriptor

DescriptorDetector

Detector

Geometrical verification (RANSAC)

Image credit: Andrea Vedaldi, ICCVW 2017

Single feature visualization

Toy example for illustration: matching with OpenCV SIFT

2019.07.06, Odesa, EECVC 8Try yourself: https://github.com/ducha-aiki/matching-strategies-comparison

Toy example for illustration: matching with OpenCV SIFT

2019.07.06, Odesa, EECVC 9

Recovered 1st to 2nd image projection, ground truth 1st to 2nd image project,inlier correspondences

Geometric verification (RANSAC)

2019.07.06, Odesa, EECVC 10

Matching

Descriptor

DescriptorDetector

Detector

Homography: planar surface/static camera

2019.07.06, Odesa, EECVC 11

Image credit: forums.fast.ai

Planar surface or static camera → use homographyImage with dominant plane → use homographyNot sure what to use? → try homography first.

Fundamental matrix: general two-view case

2019.07.06, Odesa, EECVC 12

• General two view geometry in static scene. A corresponding point lies somewhere on a line in the other image. Where on the line - depends on the (unknown) depth

• Weaker constraint than homography

• Still rigid (no motion in scene assumed)

Image credit: https://en.wikipedia.org/wiki/Epipolar_geometry

RANSAC: fitting the data with gross outliers

• What is it

2019.07.06, Odesa, EECVC 13

Image credit: https://scipy-cookbook.readthedocs.io/items/RANSAC.html

OpenCV functions:cv2.findHomography()cv2.findFundamentalMatrix()

We will publish soon a python package, which is 2..5 times faster and have an additional tricks inside

https://github.com/ducha-aiki/pyransac (save this link,the repo is private now, will clean-up and open next week)

Pitfails and solutions: homography

• Wrong geometry case because of dirty correspondences

2019.07.06, Odesa, EECVC 14

OpenCV finds 31 wrong inliers in 0.018s.

CMP RANSAC finds 6 wrong inliers in 0.004s.

See the same pattern in img1: 3 corrs in line + group.RANSAC H is prone to such case

Pitfails and solutions: homography

• Solution: 2-way error metric

2019.07.06, Odesa, EECVC 15

CMP RANSAC + transfer check:finds 48 correct inliers in 0.005s.

𝐻2−1 = 𝐻1−2−1

Check #inliers consistent in opposite direction

Python CMP RANSAC package will be available soon

D. Mishkin, J. Matas and M. Perdoch. MODS: Fast and Robust Method for Two-View Matching, CVIU 2015,

Pitfails and solutions: fundamental matrix

• F is too permissive (point to line)

2019.07.06, Odesa, EECVC 16

Pitfails and solutions: fundamental matrix

• LAF-check: remember that local feature is oriented circle or ellipse, not just a point.

• Check if additional points on circle are consistent with geometry

2019.07.06, Odesa, EECVC 17

Matching strategies

2019.07.06, Odesa, EECVC 18

Matching

Descriptor

DescriptorDetector

Detector

Nearest neighbor (1NN) strategy

2019.07.06, Odesa, EECVC 20

Features from img1 are matched to features from img2

You can see, that it is asymmetric and allowing “many-to-one” matches

Nearest neighbor (NN) strategy

2019.07.06, Odesa, EECVC 21

Features from img1 are matched to features from img2

OpenCV RANSAC failed to find a good model with NN matchingFound 1st image projection: blue, ground truth: green,Inlier correspondences: yellow

Mutual nearest neighbor (MNN) strategy

2019.07.06, Odesa, EECVC 22

Features from img1 are matched to features from img2Only cross-consistent (mutual NNs) matches are retained.

Mutual nearest neighbor (MNN) strategy

2019.07.06, Odesa, EECVC 23

OpenCV RANSAC failed to find a good model with MNN matchingNo one-to-many connections, but still badFound 1st image projection: blue, ground truth: green ,inlier correspondences: yellow

Features from img1 are matched to features from img2Only cross-consistent (mutual NNs) matches are retained.

Second nearest neighbor ratio (SNN) strategy

2019.07.06, Odesa, EECVC 24

1stNN/2ndNN > 0.8, drop

1stNN/2ndNN < 0.8, keep

Features from img1 are matched to features from img2- we look for 2 nearest neighbors

- If both are too similar (1stNN/2ndNN ratio > 0.8) →discard

- If 1st NN is much closer (1stNN/2ndNN ratio ≤ 0.8) →keep

D. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, IJCV 2004

2019.07.06, Odesa, EECVC 25

NN/SNN > 0.8, drop

NN/SNN < 0.8, keep

OpenCV RANSAC found a model roughly correctNo one-to-many connections, but still badFound 1st image projection: blue, ground truth: green ,inlier correspondences: yellow

Second nearest neighbor ratio (SNN) strategy

1st geometrically inconsistent nearest neighbor ratio (FGINN) strategy

2019.07.06, Odesa, EECVC 26

SNN ratio is cool, but what about symmetrical, or too closely detected features? Ratio test will kill them.Solution: look for 2nd nearest neighbor, which is far enough from 1st nearest.

Mishkin et al., “MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015

1st geometrically inconsistent nearest neighbor ratio (FGINN) strategy

2019.07.06, Odesa, EECVC 27

SNN ratio is cool, but what about symmetrical, or too closely detected features? Ratio test will kill them.Solution: look for 2nd nearest neighbor, which is far enough from 1st nearest.

Mishkin et al., “MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015

SNN vs FGINN

2019.07.06, Odesa, EECVC 28Mishkin et al., “MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015

SNN: roughly correct

FGINN: more correspondences,better geometry found

Symmetrical FGINN

2019.07.06, Odesa, EECVC 29

Recall, that FGINN is still asymmetric:Matching (Img1 → Img2) ≠ (Img2 → Img1)

We can do both (Img1 → Img2) and (Img2 → Img1)

and keep all FGINNs (union)

or only cross-consistent FGINNs

Learned filtering strategy (CVPR 2018)

2019.07.06, Odesa, EECVC 30Yi et al. Learning to Find Good Correspondences https://arxiv.org/abs/1711.05971

Input: matches (x1, y1, x2, y2) [ N x 4 ]Output: scores [ N x 1 ]

Evaluation on IMW2019 data

2019.07.06, Odesa, EECVC 31

CVPR 2019 competition

https://image-matching-workshop.github.io/

Evaluation

Stereo: features ⇨ matching ⇨

⇨ OpenCV RANSAC ⇨ pose estimation

Participants

organizers

Metric: # of precise enough recovered camera poses (mAP @ 15°)

Evaluation on IMW2019 data

2019.07.06, Odesa, EECVC 32

• NN – are you kidding? Never use it alone• SNN is simple and good• FGINN is always a bit better• Symmetrical FGINN rocks• Learning is not that powerful (yet?)

Descriptor: HardNet (NIPS, 2017)

Mishchuk et.al. Working hard to know your neighbor’s margins: Local descriptor learning loss. NIPS 2017

332019.07.06, Odesa, EECVC

Matching

Descriptor

DescriptorDetector

Detector

• Q1: How to find correct correspondences?

• Q2: How to filter out features, which do not have a correspondence?

• A1: Nearest neighbor by descriptor distance

• A2: Threshold the second-to-first nearest ratio (SNN)

HardNet: lets use it for training CNNs!

Classical way to select good correspondences

Architecture: deep, VGGNet style

• Adopted from previous sota L2Net descriptor Tian et al (CVPR 2017).

• Vanilla CNN: Convolution + BatchNorm + ReLU

37Sampling: positives: random

negatives: hard-in-batch

HardNet vs SIFT descriptor

2019.07.06, Odesa, EECVC 38Mishkin et al., “MODS: Fast and Robust Method for Two-View Matching”, CVIU 2015

SIFT: 71 inliers

HardNet: 121 inliers

Results: HPatches

1.5 … 2 times better than rootSIFT:

GeoDesc: same architecture, special loss and sampling utilizing 3d reconstruction data.

HardNet training scales well with the bigger datasets

Luo et.al, ECCV 20182019.07.06, Odesa, EECVC

Descriptor: creating the dataset (CVWW, 2019)Leveraging Outdoor Webcams for Local Descriptor LearningMilan Pultar, Dmytro Mishkin, Jiří Matas

412019.07.06, Odesa, EECVC

Matching

Descriptor

DescriptorDetector

Detector

● Brown dataset ● HPatches● PhotoSynth● GL3D

Existing datasets for local descriptor learning

● 1 128 millions of images● Almost 30K webcams● Continuously growing● Cameras placed across all world● ~10 TB of data

● Each camera in one directory○ Split further into folders by year○ Image timestamp in GMT○ GPS info not always available

Archive of Many Outdoor Scenes (AMOS)

Camera 1001

Camera 1002

Images -2011

Images -2013Images -2010

Images -2011

AMOS views

Good cameras

Bad cameras

Pipeline of AMOS Patches

Camera selection

● Choose randomly 20 images from each camera● Test each image using the criteria

● Keep the camera, if 14/20 images pass46

-> 474 cameras

Sky segmentation in the wild: An empirical study.R. P. Mihail et al, 2012(https://github.com/kuangliu/torchcv)

Appearance clustering

● Solves data redundancy

● Use fc6 layer of ImageNet-pretrained AlexNet

● Run K-means in the AlexNet output space

● Choose K=120 most representative images

(by looking at the corresponding outputs)

47-> 474 cameras, each 120 images

Imagenet classification with deep convolutional neural networks.A. Krizhevsky et al, 2012

Viewpoint reclustering

● Solves switching of cameras between views● Uses MODS (image matching) in greedy algorithm1. Pick a reference image2. Find matching pairs3. Create a new view; exclude images from original sequence4. If original sequence not empty:

Repeat● Keep the biggest view from each camera, 50 images each (if available)

48-> 273 views

Registration

● Results still not satisfactory● Why?

○ MODS often outputs homography matrix only for small part of image

○ Need for final manual check

-> Use GDB-ICP● In each view

○ Run registration on pairs of images○ If a single fail -> remove the whole view

49-> 151 registered views

Manual pruning

● Several problems not detected so far○ Dynamic scenes○ Cloud-dominated scenes○ Views with very similar content

50-> 27 registered views, 50 images each

Patch selection

● Apply masks (crop out text etc.)

● Sampling of centers (response function)

● Random rotation (any angle)

● Random scale

Experiments: (de)registration

● Displace each patch randomly○ observe the influence on precision

● Result:○ Precise registration is important

● mAP - mean average precision

HPatches matching task, full split

Experiments: #batch composition

● We use hard-in-batch triplet margin loss● Composition of a batch influences precision● Idea: choose a subset of views

as source for patches

● Intuition: Tough pairs often comefrom the same image

-> Improvement

HPatches matching task, full split

Evaluation

● New state-of-the-art in matching underillumination changes(to the best of our knowledge)

● Outperforms recently proposedHardNetPS in full split

● We propose AMOS Patches test splitfor evaluation of robustness tolighting and season-related conditions

Measurement region selector:orientation

562019.07.06, Odesa, EECVC

Matching

Descriptor

DescriptorDetector

Detector

Which patch should we describe?

2019.07.06, Odesa, EECVC 57

Detector: x, y, scaleShould we rotate patch?Should we deform patch?

Handcrafted: dominant orientation

Learned orientation: CNN

Yi et al. Learning to Assign Orientations to Feature Points CVPR 2016

D. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, IJCV 2004

If images are upright for sure: don`t detect orientation

2019.07.06, Odesa, EECVC 58

DoG + HardNet matches +FGINN union + RANSAC. Found 1st image projection: blue,ground truth: green ,

inlier correspondences: yellow

Dominant gradient orientation: 123 inliers

Learned orientation:140 inliers

Constant orientation: 181 inliers

AffNet (ECCV 2018)Measurement region selector

592019.07.06, Odesa, EECVC

Matching

Descriptor

DescriptorDetector

Detector

AffNet: learning measurement region

Mishkin et.al. Repeatability Is Not Enough: Learning Affine Regions via Discriminability. ECCV 2018 602019.07.06, Odesa, EECVC

Do AffNet help? Yes, if the problem is hard

2019.07.06, Odesa, EECVC 61

FGINN union + RANSAC. Found 1st image projection: blue,ground truth: green ,

inlier correspondences: yellow

DoG + HardNet 2.0: 123 inliers

DoG + AffNet + HardNet 2.0: 165 inliers

• Find affine shape such that maximizes difference between positive and hardest-in-batch negative examples

• Positive-only learning (Yi et. Al, CVPR2015) leads to degenerated ellipses

• Triplet margin (HardNet) – unstable in training affine shape

AffNet: learning measurement region

Local feature detector

2019.07.06, Odesa, EECVC 63

Matching

Descriptor

DescriptorDetector

Detector

Detector is the often failure point of the whole process

• Yet we still use 10-20 y.o stuff like SIFT or FAST, because nothing significantly better for practical purposed have been proposed

• So let`s stick to the basics

2019.07.06, Odesa, EECVC 64

Stylianou et.al, WACV 2015. Characterizing Feature Matching Performance Over Long Time Periods

SIFT is the DoG detector + SIFT descriptor

• Really, there is not such thing, as SIFT detector.

• But everyone so got used to name DoG as SIFT

2019.07.06, Odesa, EECVC 65

DoG filter is a simple blob templatehttps://docs.opencv.org/3.4.3/da/df5/tutorial_py_sift_intro.html

Gaussian scalespace, “stack of gradually smoothed versions” of original image

Detections on synthetic image

ORB: FAST detector + BRIEF descriptor

2019.07.06, Odesa, EECVC 66

FAST is a corner detector based on segment test

Joint detectors and descriptors

SuperPoint (CVPRW 2017)DELF (ICCV 2017)D2Net (CVPR 2019)

2019.07.06, Odesa, EECVC 67

Matching

Descriptor

DescriptorDetector

Detector

SuperPoint

2019.07.06, Odesa, EECVC 68DeTone et al., SuperPoint: Self-Supervised Interest Point Detection and Description CVPRW2017

2019.07.06, Odesa, EECVC 69Noh et al., SuperPoint: Large-Scale Image Retrieval with Attentive Deep Local Features ICCV 2017

“Attention” as weighting for global descriptor

2019.07.06, Odesa, EECVC 70Dusmanu et al, D2-Net: A Trainable CNN for Joint Description and Detection of Local Features CVPR 2019

Comparison on toy example

2019.07.06, Odesa, EECVC 71

SuperPoint: 51 inliers

DoG + HardNet: 123 inliers

D2Net: 26 inliers, incorrect geometry

All things together

2019.07.06, Odesa, EECVC 72

SIFT + SNN match + OpenCV RANSAC:27 inliers

SIFT + NoOri + HardNet + FGINN union match + CMP RANSAC:179 inliers

I really need to match this

• View synthesis: MODS

2019.07.06, Odesa, EECVC 73

MODS (controller and preprocessor)

MODS handles angular viewpoint difference up to:• 85° for planar scenes • 30° for structured

Affine view synthesis

Images

Det1-Desc1

Det2-Desc2

Match RANSACMatch!

Not match? Try more view synthesis

• If you DO NOT need correspondences & camera pose → DO NOT use local features. Use global descriptor (ResNet101 GeM) + fast search (faiss)

• Step 0: try OpenCV SIFT

• Use proper RANSAC (private now, will clean-up and open next week)

• Matching → use FGINN in two-way mode

• Need to be faster → ORB.

• Need to be more robust → use SIFT + HardNet 2.0

• Custom data → train on your own dataset

• Even more robust → use SIFT + AffNet + HardNet 2.0

• If images are upright, DO NOT DETECT the ORIENTATION

• Landmark data → DELF

2019.07.06, Odesa, EECVC 75

Thank you for your attention

ducha_aiki

ducha-aiki

crafting and learning for image matching - cmpcmp.felk.cvut.cz/~mishkdmy/slides/eecvc2019... · cmp...

Documents

lecture 10 - interest point detectors & ransac

ransac-flow: generic two-stage image alignment arxiv...

lecture: ransac and feature...

ransac - university of...

rs-slam: ransac sampling for visual fastslam · ransac,...

performance evaluation of ransac family - bmva · 2 choi,...

corners and ransac

ransac presentation

improved ransac performance using simple, iterative...

geometry for computer vision - linköping...

ransac experimentation

random sample consensus (ransac) algorithm, a … · random...

2-view alignment and ransac

an improved ransac homography algorithm for feature based

twenty five years of ransac: mlesac a new cost function for...

fast keypointrecognitionusing randomferns...3 0 50 100 150...

video stab using ransac

optimal randomized ransac with sprt -...

epipolar geometry estimation via ransac...

neural-guided ransac: learning where to sample model