neural-guided ransac: learning where to sample model ... · hypotheses are ranked according to...

Neural-Guided RANSAC: Learning Where to Sample Model Hypotheses

Eric Brachmann and Carsten RotherVisual Learning Lab

Heidelberg University (HCI/IWR)http://vislearn.de

AbstractWe present Neural-Guided RANSAC (NG-RANSAC), an

extension to the classic RANSAC algorithm from robust op-timization. NG-RANSAC uses prior information to improvemodel hypothesis search, increasing the chance of findingoutlier-free minimal sets. Previous works use heuristic side-information like hand-crafted descriptor distance to guidehypothesis search. In contrast, we learn hypothesis searchin a principled fashion that lets us optimize an arbitrarytask loss during training, leading to large improvements onclassic computer vision tasks. We present two further ex-tensions to NG-RANSAC. Firstly, using the inlier count it-self as training signal allows us to train neural guidancein a self-supervised fashion. Secondly, we combine neuralguidance with differentiable RANSAC to build neural net-works which focus on certain parts of the input data andmake the output predictions as good as possible. We evalu-ate NG-RANSAC on a wide array of computer vision tasks,namely estimation of epipolar geometry, horizon line esti-mation and camera re-localization. We achieve superior orcompetitive results compared to state-of-the-art robust esti-mators, including very recent, learned ones.

1. IntroductionDespite its simplicity and time of invention, Random

Sample Consensus (RANSAC) [12] remains an importantmethod for robust optimization, and is a vital componentof many state-of-the-art vision pipelines [39, 40, 29, 6].RANSAC allows accurate estimation of model parametersfrom a set of observations of which some are outliers. Tothis end, RANSAC iteratively chooses random sub-sets ofobservations, so called minimal sets, to create model hy-potheses. Hypotheses are ranked according to their consen-sus with all observations, and the top-ranked hypothesis isreturned as the final estimate.

The main limitation of RANSAC is its poor performancein domains with many outliers. As the ratio of outliers in-creases, RANSAC requires exponentially many iterationsto find an outlier-free minimal set. Implementations ofRANSAC therefore often restrict the maximum number ofiterations, and return the best model found so far [7].

SIFT

Co

rres

po

nd

ence

sR

AN

SAC

Re

sult

Neu

ral G

uid

ance

NG

-RA

NSA

C R

esu

lt

Pro

bab

ility

Figure 1. RANSAC vs. NG-RANSAC. We extract 2000 SIFT cor-respondences between two images. With an outlier rate of 88%,RANSAC fails to find the correct relative transformation (greencorrect and red wrong matches). We use a neural network to pre-dict a probability distribution over correspondences. Over 90% ofthe probability mass falls onto 239 correspondences with an out-lier rate of 33%. NG-RANSAC samples minimal sets accordingto this distribution, and finds the correct transformation up to anangular error of less than 1◦.

In this work, we combine RANSAC with a neural net-work that predicts a weight for each observation. Theweights ultimately guide the sampling of minimal sets.We call the resulting algorithm Neural-Guided RANSAC(NG-RANSAC). A comparison of our method with vanillaRANSAC can be seen in Fig. 1.

1

arX

iv:1

905.

0413

2v1

[cs

.CV

] 1

0 M

ay 2

019

When developing NG-RANSAC, we took inspirationfrom recent work on learned robust estimators [56, 36].In particular, Yi et al. [56] train a neural network to clas-sify observations as outliers or inliers, fitting final modelparameters only to the latter. Although designed to re-place RANSAC, their method achieves best results whencombined with RANSAC during test time, where it wouldremove any outliers that the neural network might havemissed. This motivates us to train the neural network inconjunction with RANSAC in a principled fashion, ratherthan imposing it afterwards.

Instead of interpreting the neural network output as softinlier labels for a robust model fit, we let the output weightsguide RANSAC hypothesis sampling. Intuitively, the neuralnetwork should learn to decrease weights for outliers, andincrease them for inliers. This paradigm yields substantialflexibility for the neural network in allowing a certain mis-classification rate without negative effects on the final fittingaccuracy due to the robustness of RANSAC. The distinc-tion between inliers and outliers, as well as which misclas-sifications are tolerable, is solely guided by the minimiza-tion of the task loss function during training. Furthermore,our formulation of NG-RANSAC facilitates training withany (non-differentiable) task loss function, and any (non-differentiable) model parameter solver, making it broadlyapplicable. For example, when fitting essential matrices,we may use the 5-point algorithm rather than the (differ-entiable) 8-point algorithm which other learned robust esti-mators rely on [56, 36]. The flexibility in choosing the taskloss also allows us to train NG-RANSAC self-supervised byusing maximization of the inlier count as training objective.

The idea of using guided sampling in RANSAC is notnew. Tordoff and Murray first proposed to guide the hy-pothesis search of MLESAC [48], using side-information[47]. They formulated a prior probability of sparse featurematches being valid based on matching scores. While thishas a positive affect on RANSAC performance in some ap-plications, feature matching scores, or other hand-craftedheuristics, were clearly not designed to guide hypothesissearch. In particular, calibration of such ad-hoc measurescan be difficult as the reliance on over-confident but wrongprior probabilities can yield situations where the same fewobservations are sampled repeatedly. This fact was rec-ognized by Chum and Matas who proposed PROSAC [9],a variant of RANSAC that uses side-information only tochange the order in which RANSAC draws minimal sets.In the worst case, if the side-information was not usefulat all, their method would degenerate to vanilla RANSAC.NG-RANSAC takes a different approach in (i) learning theweights to guide hypothesis search rather than using hand-crafted heuristics, and (ii) integrating RANSAC itself in thetraining process which leads to self-calibration of the pre-dicted weights.

Recently, Brachmann et al. proposed differentiableRANSAC (DSAC) to learn a camera re-localizationpipeline [4]. Unfortunately, we can not directly use DSACto learn hypothesis sampling since DSAC is only differen-tiable w.r.t. to observations, not sampling weights. How-ever, NG-RANSAC applies a similar trick also used to makeDSAC differentiable, namely the optimization of the ex-pected task loss during training. While we do not rely onDSAC, neural guidance can be used in conjunction withDSAC (NG-DSAC) to train neural networks that predict ob-servations and observation confidences at the same time.We summarize our main contributions:• We present NG-RANSAC, a formulation of RANSAC

with learned guidance of hypothesis sampling. We canuse any (non-differentiable) task loss, and any (non-differentiable) minimal solver for training.• Choosing the inlier count itself as training objective

facilitates self-supervised learning of NG-RANSAC.• We use NG-RANSAC to estimate epipolar geometry

of image pairs from sparse correspondences, where itsurpasses competing robust estimators.• We combine neural guidance with differentiable

RANSAC (NG-DSAC) to train neural networks thatmake accurate predictions for parts of the input, whileneglecting other parts. These models achieve compet-itive results for horizontal line estimation, and state-for-the-art for camera re-localization.

2. Related WorkRANSAC was introduced in 1981 by Fischler and Bolles

[12]. Since then it was extended in various ways, see e.g. thesurvey by Raguram et al. [35]. Combining some of the mostpromising improvements, Raguram et al. created the Uni-versal RANSAC (USAC) framework [34] which representsthe state-of-the-art of classic RANSAC variants. USAC in-cludes guided hypothesis sampling according to PROSAC[9], more accurate model fitting according to Locally Op-timized RANSAC [11], and more efficient hypothesis veri-fication according to Optimal Randomized RANSAC [10].Many of the improvements proposed for RANSAC couldalso be applied to NG-RANSAC since we do not requireany differentiability of such add-ons. We only imposes re-strictions on how to generate hypotheses, namely accordingto a learned probability distribution.

RANSAC is not often used in recent machine learning-heavy vision pipelines. Notable exceptions include geo-metric problems like object instance pose estimation [3, 5,21], and camera re-localization [41, 51, 28, 8, 46] whereRANSAC is coupled with decision forests or neural net-works that predict image-to-object correspondences. How-ever, in most of these works, RANSAC is not part of thetraining process because of its non-differentiability. DSAC[4, 6] overcomes this limitation by making the hypothesis

2

selection a probabilistic action which facilitates optimiza-tion of the expected task loss during training. However,DSAC is limited in which derivatives can be calculated.DSAC allows differentiation w.r.t. to observations. For ex-ample, we can use it to calculate the gradient of image coor-dinates for a sparse correspondence. However, DSAC doesnot model observation selection, and hence we cannot useit to optimize a matching probability. By showing how tolearn neural guidance, we close this gap. The combinationwith DSAC enables the full flexibility of learning both, ob-servations and their selection probability.

Besides DSAC, a differentiable robust estimator, therehas recently been some work on learning robust estima-tors. We discussed the work of Yi et al. [56] in the intro-duction. Ranftl and Koltun [36] take a similar but itera-tive approach reminiscent of Iteratively Reweighted LeastSquares (IRLS) for fundamental matrix estimation. In eachiteration, a neural network predicts observation weights fora weighted model fit, taking into account the residuals ofthe last iteration. Both, [56] and [36], have shown consid-erable improvements w.r.t. to vanilla RANSAC but requiredifferentiable minimal solvers, and task loss functions. NG-RANSAC outperforms both approaches, and is more flexi-ble when it comes to defining the training objective. Thisflexibility also enables us to train NG-RANSAC in a self-supervised fashion, possible with neither [56] nor [36].

3. MethodPreliminaries. We address the problem of fitting modelparameters h to a set of observations y ∈ Y that are con-taminated by noise and outliers. For example, h could bea fundamental matrix that describes the epipolar geometryof an image pair [16], and Y could be the set of SIFT cor-respondences [27] we extract for the image pair. To calcu-late model parameters from the observations, we utilize asolver f , for example the 8-point algorithm [15]. However,calculating h from all observations will result in a poor es-timate due to outliers. Instead, we can calculate h from asmall subset (minimal set) of observations with cardinalityN : h = f(y1, . . . ,yN ). For example, for a fundamentalmatrix N = 8 when using the 8-point algorithm. RANSAC[12] is an algorithm to chose an outlier-free minimal setfrom Y such that the resulting estimate h is accurate. Tothis end, RANSAC randomly chooses M minimal sets tocreate a pool of model hypothesesH = (h1, . . . ,hM ).

RANSAC includes a strategy to adaptively choose M ,based on an online estimate of the outlier ratio [12]. Thestrategy guarantees that an outlier-free set will be sampledwith a user-defined probability. For tasks with large outlierratios,M calculated like this can be exponentially large, andis usually clamped to a maximum value [7]. For notationalsimplicity, we take the perspective of a fixed M but do notrestrict the use of an early-stopping strategy in practice.

RANSAC chooses a model hypothesis as the final esti-mate h according to a scoring function s:

h = argmaxh∈H

s(h,Y). (1)

The scoring function measures the consensus of an hypoth-esis w.r.t. all observations, and is traditionally implementedas inlier counting [12].Neural Guidance. RANSAC chooses observations uni-formly random to create the hypothesis pool H. We aimat sampling observations according to a learned distribu-tion instead that is parametrized by a neural network withparameters w. That is, we select observations according toy ∼ p(y;w). Note that p(y;w) is a categorical distributionover the discrete set of observations Y , not a continuous dis-tribution in observation space. We wish to learn parametersw in a way that increases the chance of selecting outlier-free minimal sets, which will result in accurate estimates h.We sample a hypothesis pool H according to p(H;w) bysampling observations and minimal sets independently, i.e.

p(H;w) =

M∏j=1

p(hj ;w), with p(h;w) =

N∏i=1

p(yi;w).

(2)

From a pool H, we estimate model parameters h withRANSAC according to Eq. 1. For training, we assumethat we can measure the quality of the estimate with a taskloss function `(h). The task loss can be calculated w.r.t.a ground truth model h∗, or self-supervised, e.g. by usingthe inlier count of the final estimate: `(h) = −s(h,Y).We wish to learn the distribution p(H;w) in a way that wereceive a small task loss with high probability. Inspired byDSAC [4], we define our training objective as the minimiza-tion of the expected task loss:

L(w) = EH∼p(H;w)

[`(h)

]. (3)

We compute the gradients of the expected task loss w.r.t. thenetwork parameters as

∂

∂wL(w) = EH

[`(h)

∂

∂wlog p(H;w)

]. (4)

Integrating over all possible hypothesis pools to calculatethe expectation is infeasible. Therefore, we approximatethe gradients by drawing K samplesHk ∼ p(H;w):

∂

∂wL(w) ≈ 1

K

K∑k=1

[`(h)

∂

∂wlog p(Hk;w)

]. (5)

Note that gradients of the task loss function ` do not appearin the expression above. Therefore, differentiability of the

3

task loss `, the robust solver h (i.e. RANSAC) or the min-imal solver f is not required. These components merelygenerate a training signal for steering the sampling proba-bility p(H;w) in a good direction. Due to the approxima-tion by sampling, the gradient variance of Eq. 5 can be high.We apply a standard variance reduction technique from re-inforcement learning by subtracting a baseline b [45]:

∂

∂wL(w) ≈ 1

K

K∑k=1

[[`(h)− b] ∂

∂wlog p(Hk;w)

]. (6)

We found a simple baseline in the form of the average lossper image sufficient, i.e. b = ¯. Subtracting the baseline willmove the probability distribution towards hypothesis poolswith lower-than-average loss for each training example.Combination with DSAC. Brachmann et al. [4] proposeda RANSAC-based pipeline where a neural network with pa-rameters w predicts observations y(w) ∈ Y(w). End-to-end training of the pipeline, and therefore learning the ob-servations y(w), is possible by turning the argmax hypoth-esis selection of RANSAC (cf. Eq. 1) into a probabilisticaction:

hDSAC = hj ∼ p(j|H) =exp s(hj ,Y(w))∑Mk=1 exp s(hk,Y(w))

. (7)

This differentiable variant of RANSAC (DSAC) choosesa hypothesis randomly according to a distribution calcu-lated from hypothesis scores. The training objective aimsat learning network parameters such that hypotheses withlow task loss are chosen with high probability:

LDSAC(w) = Ej∼p(j) [`(hj)] . (8)

In the following, we extend the formulation of DSAC withneural guidance (NG-DSAC). We let the neural networkpredict observations y(w) and, additionally, a probabilityassociated with each observation p(y;w). Intuitively, theneural network can express a confidence in its own predic-tions through this probability. This can be useful if a certaininput for the neural network contains no information aboutthe desired model h. In this case, the observation predictiony(w) is necessarily an outlier, and the best the neural net-work can do is to label it as such by assigning a low proba-bility. We combine the training objectives of NG-RANSAC(Eq. 3) and DSAC (Eq. 8) which yields:

LNG-DSAC(w) = EH∼p(H;w)Ej∼p(j|H) [`(hj)] , (9)

where we again construct p(H;w) from individualp(y;w)’s according to Eq. 2. The training objective of NG-DSAC consists of two expectations. Firstly, the expectationw.r.t. sampling a hypothesis pool according to the probabili-ties predicted by the neural network. Secondly, the expecta-tion w.r.t. sampling a final estimate from the pool according

to the scoring function. As in NG-RANSAC, we approxi-mate the first expectation via sampling, as integrating overall possible hypothesis pools is infeasible. For the secondexpectation, we can calculate it analytically, as in DSAC,since it integrates over the discrete set of hypotheses hj ina given pool H. Similar to Eq. 6, we give the approximategradients ∂

∂wL(w) of NG-DSAC as:

1

K

K∑k=1

[[Ej [`]− b] ∂

∂wlog p(Hk;w) +

∂

∂wEj [`]

],

(10)

where we use Ej [`] as a stand-in for Ej∼p(j|Hk) [`(hj)].The calculation of gradients for NG-DSAC requires thederivative of the task loss (note the last part of Eq. 10)because Ej [`] depends on parameters w via observationsy(w). Therefore, training NG-DSAC requires a differen-tiable task loss function `, a differentiable scoring functions, and a differentiable minimal solver f . Note that we in-herit these restrictions from DSAC. In return, NG-DSAC al-lows for learning observations and observation confidences,at the same time.

4. ExperimentsWe evaluate neural guidance on multiple, classic com-

puter vision tasks. Firstly, we apply NG-RANSAC to es-timating epipolar geometry of image pairs in the form ofessential matrices and fundamental matrices. Secondly, weapply NG-DSAC to horizon line estimation and camera re-localization. We present the main experimental results here,and refer to the appendix for details about network architec-tures, hyper-parameters and more qualitative results. Ourimplementation is based on PyTorch [32], and we will makethe code publicly available for all the tasks discussed below.

4.1. Essential Matrix Estimation

Epipolar geometry describes the geometry of two imagesthat observe the same scene [16]. In particular, two imagepoints x and x′ in the left and right image corresponding tothe same 3D point satisfy x′>Fx = 0, where the 3× 3 ma-trix F denotes the fundamental matrix. We can estimate Funiquely (but only up to scale) from 8 correspondences, orfrom 7 correspondences with multiple solutions [16]. Theessential matrix E is a special case of the fundamental ma-trix when the calibration parametersK andK ′ of both cam-eras are known: E = K ′>FK. The essential matrix can beestimated from 5 correspondences [31]. Decomposing theessential matrix allows to recover the relative pose betweenthe observing cameras, and is a central step in image-based3D reconstruction [40]. As such, estimating the fundamen-tal or essential matrices of image pairs is a classic and well-researched problem in computer vision.

4

0.1

1 0.1

6 0.2

3

0.0

6 0.1

3 0.2

4

0.1

0 0.1

7 0.2

9

0.0

7 0.1

3 0.2

2

0.2

7 0.3

5 0.4

5

0.0

8 0.1

4 0.2

40.3

2 0.4

1 0.5

3

0.0

7 0.1

3 0.2

2

0.1

3

0.1

8 0.2

4

0.0

2

0.0

5

0.1

0

0.4

9

0.5

4

0.5

9

0.1

4 0.2

0 0.2

9

0.0

0.2

0.4

0.6

0.8

5° 10° 20° 5° 10° 20°

Outdoor Indoor

AU

C

a) without side-information

DeMoN [47] GMS [2] + RANSAC SIFT + InClass [53] + RANSACLIFT [52] + InClass [53] + RANSAC SIFT + RANSAC SIFT + NG-RANSAC

0.4

8 0.5

4 0.6

1

0.1

2 0.1

8 0.2

7

0.5

4 0.6

0 0.6

6

0.1

5 0.2

2 0.3

1

0.5

3

0.5

8 0.6

4

0.1

4 0.2

1 0.3

1

0.5

4 0.6

0 0.6

6

0.1

5 0.2

2 0.3

2

0.5

9

0.6

4 0.7

0

0.1

6 0.2

4 0.3

4

5° 10° 20° 5° 10° 20°

Outdoor Indoor

b) with side-information

SIFT + Ratio + RANSAC RootSIFT + Ratio + RANSAC SIFT + NG-RANSAC (+SI)SIFT + Ratio + NG-RANSAC (+SI) RootSIFT + Ratio + NG-RANSAC (+SI)

0.5

1

0.5

6 0.6

2

0.1

4 0.2

0.2

9

0.5

6

0.6

1 0.6

8

0.1

6 0.2

3 0.3

4

5° 10° 20° 5° 10° 20°

Outdoor Indoor

c) self-supervised

SIFT + NG-RANSAC (+SI)RootSIFT + Ratio + NG-RANSAC (+SI)

Figure 2. Essential Matrix Estimation. We calculate the relative pose between outdoor and indoor image pairs via the essential matrix.We measure the AUC of the cumulative angular error up to a threshold of 5◦, 10◦ or 20◦. a) We use no side-information about the sparsecorrespondences. b) We use side-information in the form of descriptor distance ratios between the best and second best match. We use itto filter correspondences with a threshold of 0.8 (+Ratio), and as an additional input for our network (+SI). c) We train NG-RANSAC in aself-supervised fashion by using the inlier count as training objective.

In the following, we firstly evaluate NG-RANSAC forthe calibrated case and estimate essential matrices fromSIFT correspondences [27]. For the sake of comparabil-ity with the recent, learned robust estimator of Yi et al. [56]we adhere closely to their evaluation setup, and compare totheir results.Datasets. Yi et al. [56] evaluate their approach in outdooras well as indoor settings. For the outdoor datasets, theyselect five scenes from the structure-from-motion (SfM)dataset of [19]: Buckingham, Notredame, Sacre Coeur, St.Peter’s and Reichstag. They pick two additional scenesfrom [44]: Fountain and Herzjesu. They reconstruct eachscene using a SfM tool [53] to obtain ‘ground truth’ cam-era poses, and co-visibility constraints for selecting imagepairs. For indoor scenes Yi et al. choose 16 sequences fromthe SUN3D dataset [54] which readily comes with groundtruth poses captured by KinectFusion [30]. See Appendix Afor a listing of all scenes. Indoor scenarios are typically verychallenging for sparse feature-based approaches because oftexture-less surfaces and repetitive elements (see Fig. 1 foran example). Yi et al. train their best model using one out-door scene (St. Peter’s) and one indoor scene (Brown 1),and test on all remaining sequences (6 outdoor, 15 indoor).Yi et al. kindly provided us with their exact data splits, andwe will use their setup. Note that training and test is per-formed on completely separate scenes, i.e. the neural net-work has to generalize to unknown environments.Evaluation Metric. Via the essential matrix, we recoverthe relative camera pose up to scale, and compare to theground truth pose as follows. We measure the angular errorbetween the pose rotations, as well as the angular error be-tween the pose translation vectors in degrees. We take themaximum of the two values as the final angular error. Wecalculate the cumulative error curve for each test sequence,and compute the area under the curve (AUC) up to a thresh-old of 5◦, 10◦ or 20◦. Finally, we report the average AUCover all test sequences (but separately for the indoor andoutdoor setting).

Implementation. Yi et al. train a neural network to clas-sify a set of sparse correspondences in inliers and outliers.They represent each correspondence as a 4D vector com-bining the 2D coordinate in the left and right image. Theirnetwork is inspired by PointNet [33], and processes eachcorrespondence independently by a series of multilayer per-ceptrons (MLPs). Global context is infused by using in-stance normalization [49] in-between layers. We re-buildthis architecture in PyTorch, and train it according to NG-RANSAC (Eq. 3). That is, the network predicts weights toguide RANSAC sampling instead of inlier class labels. Weuse the angular error between the estimated relative pose,and the ground truth pose as task loss `. As minimal solverf , we use the 5-point algorithm [31]. To speed up training,we initialize the network by learning to predict the distanceof each correspondence to the ground truth epipolar line,see Appendix A for details. We initialize for 75k iterations,and train according to Eq. 3 for 25k iterations. We optimizeusing Adam [23] with a learning rate of 10−5. For eachtraining image, we extract 2000 SIFT correspondences, andsample K = 4 hypothesis pools with M = 16 hypotheses.We use a low number of hypotheses during training to ob-tain variation when sampling pools. For testing, we increasethe number of hypotheses to M = 1000. We use an inlierthreshold of 10−3 assuming normalized image coordinatesusing camera calibration parameters.Results. We compare NG-RANSAC to the inlier classifica-tion (InClass) of Yi et al. [56]. They use their approach withSIFT as well as LIFT [55] features. The former works betterfor outdoor scenes, the latter works better for indoor scenes.We also include results for DeMoN [50], a learned SfMpipeline, and GMS [2], a semi-dense approach using ORBfeatures [38]. See Fig. 2 a) for results. RANSAC achievespoor results for indoor and outdoor scenes across all thresh-olds, scoring as the weakest method among all competitors.Coupling it with neural guidance (NG-RANSAC) elevatesit to the leading position with a comfortable margin. Seealso Fig. 3 for qualitative results.

5

RANSAC Result.

Neu

ral G

uid

ance

NG-RANSAC Result.

Indoor Outdoor Kitti

Δ𝑡: 122.8°, Δ𝑅: 1.2°

Δ𝑡: 5.5°, Δ𝑅: 0.6°

Δ𝑡: 131.0°, Δ𝑅: 49.1°

Δ𝑡: 1.8°, Δ𝑅: 0.8°

% Inliers: 25.1, F-score: 38.8, Mean: 0.16

% Inliers: 31.7, F-score: 62.6, Mean: 0.10

RA

NSA

CN

G-R

AN

SAC

Figure 3. Qualitative Results. We compare fitted models for RANSAC and NG-RANSAC. For the indoor and outdoor image pairs, we fitessential matrices, and for the Kitti image pair we fit the fundamental matrix. We draw final model inliers in green if they adhere to theground truth model, and red otherwise. We also measure the quality of each estimate, see the main text for details on the metrics.

NG-RANSAC outperforms InClass of Yi et al. [56] de-spite some similarities. Both use the same network archi-tecture, are based on SIFT correspondences, and both useRANSAC at test time. Yi et al. [56] train using a hy-brid classification-regression loss based on the 8-point al-gorithm, and ultimately compare essential matrices usingsquared error. Therefore, their training objective is very dif-ferent from the evaluation procedure. During evaluation,they use RANSAC with the 5-point algorithm on top oftheir inlier predictions, and measure the angular error. NG-RANSAC incorporates all these components in its trainingprocedure, and therefore optimizes the correct objective.

Using Side-Information. The evaluation procedure de-fined by Yi et al. [56] is well suited to test a robust esti-mator in high-outlier domains. However, it underestimateswhat classical approaches can achieve on these datasets.Lowe’s ratio criterion [27] is often deployed to drasticallyreduce the amount of outlier correspondences. It is based onthe distance ratio of the best and second-best SIFT match.Matches with a ratio above a threshold (we use 0.8) are re-moved before running RANSAC. We denote the ratio crite-rion as +Ratio in Fig. 2 b), and observe a drastic improve-ment when combined with SIFT and RANSAC. This classicapproach outperforms all learned methods of Fig. 2 a) andis competitive to NG-RANSAC (when not using any side-information). We can push performance even further by ap-plying RootSIFT normalization to the SIFT descriptors [1].By training NG-RANSAC on ratio-filtered RootSIFT cor-respondences, using distance ratios as additional networkinput (denoted by +SI), we achieve best accuracy.

Self-supervised Learning. We can train NG-RANSACself-supervised when we define a task loss ` to assess thequality of an estimate independent of a ground truth modelh∗. A natural choice is the inlier count of the final estimate.We found the inlier count to be a very stable training sig-nal, even in the beginning of training such that we require

no special initialization of the network. We report results ofself-supervised NG-RANSAC in Fig. 2 c). It outperformsall competitors, but achieves slightly worse accuracy thanNG-RANSAC trained with supervision. A supervised taskloss allows NG-RANSAC to adapt more precisely to theevaluation measure used at test time. For the datasets usedso far, the process of image pairing uses co-visibility infor-mation, and therefore a form of supervision. In the next sec-tion, we learn NG-RANSAC fully self-supervised by usingthe ordering of sequential data to assemble image pairs.

4.2. Fundamental Matrix Estimation

We apply NG-RANSAC to fundamental matrix estima-tion, comparing it to the learned robust estimator of Ranftland Koltun [36], denoted Deep F-Mat. They propose aniterative procedure where a neural network estimates obser-vation weights for a robust model fit. The residuals of thelast iteration are an additional input to the network in thenext iteration. The network architecture is similar to theone used in [56]. Correspondences are represented as 4Dvectors, and they use the descriptor matching ratio as anadditional input. Each observation is processed by a seriesof MLPs with instance normalization interleaved. Deep F-Mat was published very recently, and the code is not yetavailable. We therefore follow the evaluation procedure de-scribed in [36] and compare to their results.Datasets. Ranftl and Koltun [36] evaluate their method onvarious datasets that involve custom reconstructions whichare not publicly available. Therefore, we compare to theirmethod on the Kitti dataset [14], which is online. Ranftland Koltun [36] train their method on sequences 00-05 ofthe Kitti odometry benchmark, and test on sequences 06-10. They form image pairs by taking subsequent imageswithin a sequence. For each pair, they extract SIFT cor-respondences and apply Lowe’s ratio criterion [27] with athreshold of 0.8.

6

Training Objective

% Inliers F-score Mean Median

RANSAC - 21.85 13.84 0.35 0.32

USAC [31] - 21.43 13.90 0.35 0.32

Deep F-Mat [33] Mean 24.61 14.65 0.32 0.29

NG-RANSAC Mean 25.05 14.76 0.32 0.29

NG-RANSAC F-score 24.13 14.72 0.33 0.31

NG-RANSAC %Inliers 25.12 14.74 0.32 0.29

Figure 4. Fundamental Matrix Estimation. We measure theaverage percentage of inliers of the estimated model, the align-ment of estimated inliers and ground truth inliers (F-score), andthe mean and median distance of estimated inliers to ground truthepilines. For NG-RANSAC, we compare the performance aftertraining with different objectives. Note that %Inliers is a self-supervised training objective.

Evaluation Metric. Ranftl and Koltun [36] evaluate us-ing multiple metrics. They measure the percentage of inliercorrespondences of the final model. They calculate the F-score over correspondences where true positives are inliersof both the ground truth model and the estimated model.The F-score measures the alignment of estimated and truefundamental matrix in image space. Both metrics use aninlier threshold of 0.1px. Finally, they calculate the meanand median epipolar error of inlier correspondences w.r.t.the ground truth model, using an inlier threshold of 1px.Implementation. We cannot use the architecture of DeepF-Mat which is designed for iterative application. There-fore, we re-use the architecture of Yi et al. [56] from theprevious section for NG-RANSAC (also see Appendix Bfor details). We adhere to the training setup described inSec. 4.1 with the following changes. We observed fastertraining convergence on Kitti, so we omit the initializationstage, and directly optimize the expected task loss (Eq. 3)for 300k iterations. Since Ranftl and Koltun [36] evaluateusing multiple metrics, the choice of the task loss function` is not clear. Hence, we train multiple variants with dif-ferent objectives (%Inliers, F-score and Mean error) andreport the corresponding results. As minimal solver f , weuse the 7-point algorithm, a RANSAC threshold of 0.1px,and we draw K = 8 hypothesis pools per training imagewith M = 16 hypotheses each.Results. We report results in Fig. 4 where we compareNG-RANSAC with RANSAC, USAC [34] and Deep F-Mat. Note that USAC also uses guided sampling based onmatching ratios according to the PROSAC strategy [9]. NG-RANSAC outperforms the classical approaches RANSACand USAC. NG-RANSAC also performs slightly superiorto Deep F-Mat. We observe that the choice of the train-ing objective has small but significant influence on the eval-uation. All metrics are highly correlated, and optimizinga metric in training generally also achieves good (but notnecessarily best) accuracy using this metric at test time. In-terestingly, optimizing the inlier count during training per-forms competitively, although being a self-supervised ob-jective. Fig. 3 shows a qualitative result on Kitti.

AUC (%)

Simon et al. [39] 54.4

Kluger et al. [22] 57.3

Zhai et al. [54] 58.2

Workman et al. [49] 71.2

DSAC 74.1

NG-DSAC 75.2

SLNet [23] 82.3

Figure 5. Horizon Line Estimation. Left: AUC on the HLWdataset. Right: Qualitative results. We draw the ground truth hori-zon in green and the estimate in blue. Dots mark the observationspredicted by NG-DSAC, and the dot colors mark their confidence(dark = low). Note that the horizon can be outside the image.

4.3. Horizon Lines

We fit a parametric model, the horizon line, to a singleimage. The horizon can serve as a cue in image understand-ing [52] or for image editing [25]. Traditionally, this task issolved via vanishing point detection and geometric reason-ing [37, 24, 57, 42], often assuming a Manhattan or Atlantaworld. We take a simpler approach and use a general pur-pose CNN that predicts a set of 64 2D points based on theimage to which we fit a line with RANSAC, see Fig. 5. Thenetwork has two output branches (see Appendix C for de-tails) predicting (i) the 2D points y(w) ∈ Y(w), and (ii)probabilities p(y;w) for guided sampling.Dataset. We evaluate on the HLW dataset [52] which is acollection of SfM datasets with annotated horizon line. Testand training images partly show the same scenes, and thehorizon line can be outside the image area.Evaluation Metric. As is common practice on HLW, wemeasure the maximum distance between the estimated hori-zon and ground truth within the image, normalized by im-age height. We calculate the AUC of the cumulative errorcurve up to a threshold of 0.25.Implementation. We train using the NG-DSAC objective(Eq. 9) from scratch for 300k iterations. As task loss `, weuse the normalized maximum distance between estimatedand true horizon. For hypothesis scoring s, we use a softinlier count [6]. We train using Adam [23] with a learningrate of 10−4. For each training image, we draw K = 2hypothesis pools with M = 16 hypotheses. We also draw16 hypotheses at test time. We compare to DSAC which wetrain similarly but disable the probability branch.Results. We report results in Fig. 5. DSAC and NG-DSAC achieve competitive accuracy on this dataset, rank-ing among the top methods. NG-DSAC has a small but sig-nificant advantage over DSAC alone. Our method is onlysurpassed by SLNet [25], an architecture designed to findsemantic lines in images. SLNet generates a large numberof random candidate lines, selects a candidate via classifica-tion, and refines it with a predicted offset. We could coupleSLNet with neural guidance for informed candidate sam-pling. Unfortunately, the code of SLNet is not online andthe authors did not respond to inquiries.

7

Gre

at C

ou

rtO

ld H

osp

ital

St M

. Ch

urc

h

a) Development of Neural Guidance Throughout TrainingRGB Iteration: 0 Iteration: 100k Iteration: 200k

Samp

ling W

eight

01

b) Learned Representation (Great Court)DSAC++ NG-DSAC++

Figure 6. Neural Guidance for Camera Re-localization. a) We show how the predicted sampling probabilities of NG-DSAC++ developduring end-to-end training. b) We visualize the internal representation of the neural network. We predict scene coordinates for eachtraining image, plotting them with their RGB color. For DSAC++ we choose training pixels randomly, for NG-DSAC++ we chooserandomly among the top 1000 pixels per training image according to the predicted distribution.

DSAC++ [6](VGGNet)

DSAC++(ResNet)

NG-DSAC++(ResNet)

Great Court 40.3cm 40.3cm 35.0cm

Kings College 17.7cm 13.0cm 12.6cm

Old Hospital 19.6cm 22.4cm 21.9cm

Shop Facade 5.7cm 5.7cm 5.6cm

St M. Church 12.5cm 9.9cm 9.8cm

Figure 7. Camera Re-Localization. We report median positionerror for Cambridge Landmarks [22]. DSAC++ (ResNet) is ourre-implementation of [6] with an improved network architecture.

4.4. Camera Re-Localization

We estimate the absolute 6D camera pose (position andorientation) w.r.t. a known scene from a single RGB image.Dataset. We evaluate on the Cambridge Landmarks [22]dataset. It is comprised of RGB images depicting five land-mark buildings1 in Cambridge, UK. Ground truth poseswere generated by running a SfM pipeline.Evaluation Metric. We measure the median translationalerror of estimated poses for each scene2.Implementation. We build on the publicly availableDSAC++ pipeline [6] which is a scene coordinate regres-sion method [41]. A neural network predicts for each im-age pixel a 3D coordinate in scene space. We recover thepose from the 2D-3D correspondences using a perspective-n-point solver [13] within a RANSAC loop. The DSAC++pipeline implements geometric pose optimization in a fullydifferentiable way which facilitates end-to-end training. Were-implement the neural network integration of DSAC++with PyTorch (the original uses LUA/Torch). We alsoupdate the network architecture of DSAC++ by using aResNet [18] instead of a VGGNet [43]. As with horizon

1We omitted one additional scene (Street). Like other learning-basedmethods before [6] we failed to achieve sensible results for this scene. Byvisual inspection, the corresponding SfM reconstruction seems to be ofpoor quality, which potentially harms the training process.

2The median rotational accuracies are between 0.2◦ to 0.3◦ for allscenes, and do hardly vary between methods.

line estimation, we add a second output branch to the net-work for estimating a probability distribution over scene co-ordinate predictions for guided RANSAC sampling. We de-note this extended architecture NG-DSAC++. We adhereto the training procedure and hyperparamters of DSAC++(see Appendix D) but optimize the NG-DSAC objective(Eq. 9) during end-to-end training. As task loss `, we usethe average of the rotational and translational error w.r.t. theground truth pose. We sample K = 2 hypothesis poolswith M = 16 hypotheses per training image, and increasethe number of hypotheses to M = 256 for testing.Results. We report our quantitative results in Fig. 7. Firstly,we observe a significant improvement for most scenes whenusing DSAC++ with a ResNet architecture. Secondly, com-paring DSAC++ with NG-DSAC++, we notice a small tomoderate, but consistent, improvement in accuracy. The ad-vantage of using neural guidance is largest for the GreatCourt scene, which features large ambiguous grass ar-eas, and large areas of sky visible in many images. NG-DSAC++ learns to ignore such areas, see the visualizationin Fig. 6 a). The network learns to mask these areas solelyguided by the task loss during training, as the network failsto predict accurate scene coordinates for them. In Fig. 6 b),we visualize the internal representation learned by DSAC++and NG-DSAC++ for one scene. The representation ofDSAC++ is very noisy, as it tries to optimize geometricconstrains for sky and grass pixels. NG-DSAC++ learnsa cleaner representation by focusing entirely on buildings.

5. ConclusionWe have presented NG-RANSAC, a robust estimator us-

ing guided hypothesis sampling according to learned prob-abilities. For training we can incorporate non-differentiabletask loss functions and non-differentiable minimal solvers.Using the inlier count as training objective allows us toalso train NG-RANSAC self-supervised. We applied NG-RANSAC to multiple classic computer vision tasks and ob-serve a consistent improvement w.r.t. RANSAC alone.

8

Acknowledgements: This project has received fundingfrom the European Research Council (ERC) under the Eu-ropean Unions Horizon 2020 research and innovation pro-gramme (grant agreement No 647769). The computationswere performed on an HPC Cluster at the Center for Infor-mation Services and High Performance Computing (ZIH) atTU Dresden.

A. Essential Matrix EstimationList of Scenes Used for Training and Testing.

Training:

• Staint Peter’s (Outdoor)• brown bm 3 - brown bm 3 (Indoor)

Testing (Outdoor):

• Buckingham• Notre Dame• Sacre Coeur• Reichstag• Fountain• HerzJesu

Testing (Indoor):

• brown cogsci 2 - brown cogsci 2• brown cogsci 6 - brown cogsci 6• brown cogsci 8 - brown cogsci 8• brown cs 3 - brown cs3• brown cs 7 - brown cs7• harvard c4 - hv c4 1• harvard c10 - hv c10 2• harvard corridor lounge - hv lounge1 2• harvard robotics lab - hv s1 2• hotel florence jx - florence hotel stair room all• mit 32 g725 - g725 1• mit 46 6conf - bcs floor6 conf 1• mit 46 6lounge - bcs floor6 long• mit w85g - g 0• mit w85h - h2 1

Network Architecture. As mentioned in the main paper,we replicated the architecture of Yi et al. [56] for our exper-iments on epipolar geometry (estimating essential and fun-damental matrices). For a schematic overview see Fig. 8.The network takes a set of feature correspondences as in-put, and predicts as output a weight for each correspon-dence which we use to guide RANSAC hypothesis sam-pling. The network consists of a series of multilayer percep-trons (MLPs) that process each correspondence indepen-dently. We implement the MLPs with 1 × 1 convolutions.The network infuses global context via instance normaliza-tion layers [49], and it accelerate training via batch normal-ization [20]. The main body of the network is comprisedof 12 blocks with skip connections [18]. Each block con-

sists of two linear layers followed by instance normaliza-tion, batch normalization and a ReLU activation [17] each.We apply a Sigmoid activation to the last layer, and normal-ize by dividing by the sum of outputs.3

Initialization Procedure. We initialize our network in thefollowing way. We define a target sampling distributiong(y;E∗) using the ground truth essential matrix E∗ givenfor each training pair. Intuitively, the target distributionshould return a high probability when a correspondence y isaligned with the ground truth essential matrixE∗, and a lowprobability otherwise. We assume that correspondence y isa 4D vector containing two 2D image coordinates x and x′

(3D in homogeneous coordinates). We define the epipolarerror of a correspondence w.r.t. essential matrix E:

d(y, E) =(x′>Ex)2

[Ex]20 + [Ex]21 + [E>x′]20 + [E>x′]21, (11)

where [·]i returns the ith entry of a vector. Using the epipo-lar error, we define the target sampling distribution:

g(y;E∗) =1

2πσ2exp

(−d(y, E∗)

2σ2

). (12)

Parameter σ controls the softness of the target distribution,and we use σ = 10−3 which corresponds to the inlierthreshold we use for RANSAC. To initialize our network,we minimize the KL divergence between the network pre-diction p(y;w) and the target distribution g(y;E∗). Weinitialize for 75k iterations using Adam [23] with a learningrate of 10−3 and a batch size of 32.Implementation Details. For the following componentswe rely on the implementations provided by OpenCV [7]:the 5-point algorithm [31], epipolar error, SIFT features[27], feature matching, and essential matrix decomposition.We extract 2000 features per input image which yields 2000correspondences for image pairs after matching. When ap-plying Lowe’s ratio criterion [27] for filtering and hence re-ducing the number of correspondences, we randomly du-plicate correspondences to restore the number of 2000. Weminimize the expected task loss using Adam [23] with alearning rate of 10−5 and a batch size of 32. We choosehyperparameters based on validation error of the Reichstagscene. We observe that the magnitude of the validation er-ror corresponds well to the magnitude of the training error,i.e. a validation set would not be strictly required.Qualitative Results. We present additional qualitative re-sults for indoor and outdoor scenarios in Fig. 9. We com-pare results of RANSAC and NG-RANSAC, also visualiz-ing neural guidance as predicted by our network. We ob-tain these results in the high-outlier setup, i.e. without using

3The original architecture of Yi et al. [56] uses a slightly different out-put processing due to using the output as weights for a robust model fit.They use a ReLU activation followed by a tanh activation.

9

1x1

–C

12

8 –

S1

1x1

–C

12

8 –

S1

1x1

–C

12

8 –

S1

1x1 – C128 – S1

Filter Size – Channels – Stride

Legend:

BatchNorm

Output: Sampling Weights 𝑝(𝐲;𝐰)

(1 × 2000 × 1)

(5 × 2000 × 1)

Input: Feature Correspondences12x

InstanceNorm

1x1

–C

1 –

S1

Sigm

oid

No

rmal

izat

ion

2 × 2D Coordinates + 1D Side Information (optional)

2000 Correspondences

Dummy Spatial Dimension

Figure 8. NG-RANSAC Network Architecture for F/E-matrix Estimation. The network takes a set of feature correspondences as inputand predicts as output a weight for each correspondence. The network consists of linear layers interleaved by instance normalization [49],batch normalization [20] and ReLUs [17]. The arc with a plus marks a skip connection for each of the twelve blocks [18]. This architecturewas proposed by Yi et al. [56].

Lowe’s ratio criterion and without using side-information asadditional network input.

B. Fundamental Matrix Estimation

Implementation Details. We reuse the architecture ofFig. 8. We normalize image coordinates of feature matchesbefore passing them to the network. We subtract the meancoordinate and divide by the coordinate standard deviation,where we calculate mean and standard deviation over thetraining set. Ranftl and Koltun [36] fit the final fundamentalmatrix to the top 20 weighted correspondences as predictedby their network. Similarly, we re-fit the final fundamentalmatrix to the largest inlier set found by NG-RANSAC. Thisrefinement step results in a small but noticeable increase inaccuracy. For the following components we rely on the im-plementations provided by OpenCV [7]: the 7-point algo-rithm, epipolar error, SIFT features [27] and feature match-ing.Qualitative Results. We present additional qualitative re-sults for the Kitti dataset [14] in Fig. 10. We compare re-sults of RANSAC and NG-RANSAC, also visualizing neu-ral guidance as predicted by our network.

C. Horizon Lines

Network Architecture. We provide a schematic of our net-work architecture for horizon line estimation in Fig. 11. Thenetwork takes a 256 × 256px RGB image as input. Were-scale images of arbitrary aspect ratio such that the longside is 256px. We symmetrically zero-pad the short sideto 256px. The network has two output branches. The first

branch predicts a set of 8 × 8 = 64 2D points, our obser-vations y(w), to which we fit the horizon line. We applya Sigmoid and re-scale output points to [-1.5,1.5] in rela-tive image coordinates to support horizon lines outside theimage area. We implement the network in a fully convo-lutional way [26], i.e. each output point is predicted fora patch, or restricted receptive field, of the input image.Therefore, we shift the coordinate of each output point tothe center of its associated patch.

The second output branch predicts sampling probabili-ties p(y;w) for each output point. We apply a Sigmoid tothe output of the second branch, and normalize by dividingby the sum of outputs. During training, we block the gradi-ents of the second output branch when back propagating tothe base network. The sampling gradients have larger vari-ance and magnitude than the observation gradients of thefirst branch, especially in the beginning of training. Thishas a negative effect on convergence of the network as awhole. Intuitively, we want to give priority to the obser-vation prediction because they determine the accuracy ofthe final model parameters. The sampling prediction shouldaddress deficiencies in the observation predictions withoutinfluencing them too much. The gradient blockade ensuresthese properties.Implementation Details. We use a differentiable soft inliercount [6] as scoring function, i.e.:

s(h,Y) = α∑y∈Y

1− sig[βd(y,h)− βτ ], (13)

where d(y,h) denotes the point-line distance between ob-servation y and line hypothesis h. Hyperparameter α deter-

10

Δ𝑡: 63.1°, Δ𝑅: 0.5°

Δ𝑡: 1.1°, Δ𝑅: 0.6°

Δ𝑡: 80.1°, Δ𝑅: 17.8°

Δ𝑡: 5.2°, Δ𝑅: 3.0°

Δ𝑡: 97.5°, Δ𝑅: 5.6°

Δ𝑡: 1.9°, Δ𝑅: 2.0°

Δ𝑡: 83.1°, Δ𝑅: 9.5°

Δ𝑡: 3.3°, Δ𝑅: 0.2°

Indoor Outdoor

RA

NSA

CN

eu

ral G

uid

ance

NG

-RA

NSA

CR

AN

SAC

Ne

ura

l Gu

idan

ceN

G-R

AN

SAC

Figure 9. Qualitative Results for Essential Matrix Estimation. We compare results of RANSAC and NG-RANSAC. Below each result,we give the angular error between estimated and true translation vectors, and estimated and true rotation matrices. We draw correspon-dences in green if they adhere to the ground truth essential matrix with an inlier threshold of 10−3, and red otherwise.

mines the softness of the scoring distribution in DSAC, βdetermines the softness of the Sigmoid, and τ is the inlierthreshold. We use α = 0.1, β = 100 and τ = 0.05. Wechoose hyperparameters based on grid search for the mini-mal training loss.

As discussed in the main paper, we use the normal-ized maximum distance between a line hypothesis and the

ground truth horizon in the image as task loss `. This canlead to stability issues when we sample line hypotheses withvery steep slope. Therefore, we clamp the task loss to amaximum of 1, i.e. the normalized image height.

As mentioned before, some images in the HLW dataset[52] have their horizon outside the image. Some of theseimages contain virtually no visual cue where the horizon

11

RA

NSA

CN

eura

l Gu

idan

ceN

G-R

AN

SAC

% Inliers: 19.9, F-score: 26.4, Mean Error: 0.16


RA

NSA

CN

eura

l Gu

idan

ceN

G-R

AN

SAC



RA

NSA

CN

eura

l Gu

idan

ceN

G-R

AN

SAC



Figure 10. Qualitative Results for Fundamental Matrix Estimation. We compare results of RANSAC and NG-RANSAC. Below eachresult, we give the percentage of inliers of the final model, the F-score which measures the alignment of estimated and true fundamental ma-trix, and the mean epipolar error of estimated inlier correspondences w.r.t. the ground truth fundamental matrix. We draw correspondencesin green if they adhere to the ground truth fundamental matrix with an inlier threshold of 0.1px, and red otherwise.

12

3x3

–C

32

–S1

3x3

–C

64

–S1

3x3

–C

12

8 –

S2

3x3

–C

25

6 –

S2

3x3

–C

25

6 –

S2

3x3

–C

25

6 –

S2

3x3

–C

25

6 –

S2

3x3

–C

25

6 –

S1

3x3

–C

25

6 –

S1

3x3

–C

25

6 –

S1

1x1

–C

51

2 –

S1

1x1

–C

51

2 –

S1

1x1

–C

2 –

S1

1x1

–C

51

2 –

S1

1x1

–C

51

2 –

S1

1x1

–C

1 –

S1

3x3 – C32 – S1

Filter Size – Channels – Stride Gradient Blockade

Legend:

BatchNorm

Output 1: 2D Points 𝐲(𝐰)(2 × 8 × 8)

Output 2: Sampling Weights 𝑝(𝐲;𝐰)(1 × 8 × 8)

(256 × 256px, zero-padded)

Input: RGB Image

Sigm

oid

No

rmal

izat

ion

Sigm

oid

No

rmal

izat

ion

Figure 11. NG-DSAC Network Architecture for Horizon Line Estimation. The network takes a RGB image as input and predicts asoutput a set of 2D points and corresponding sampling weights. The network consists of convolution layers interleaved by batch normaliza-tion [20] and ReLUs [17]. The arc with a plus marks a skip connection [18]. We use the gradient blockage during training to prevent directinfluence of the sampling prediction (second branch) to learning the observations (first branch).

exactly lies. Therefore, we find it beneficial to use a robustvariant of the task loss `′ that limits the influence of suchoutliers. We use:

`′ =

{` ` < 0.25

0.25√` otherwise

, (14)

i.e. we use the square root of the task loss after a magnitudeof 0.25, which is the magnitude up to which the AUC iscalculated when evaluating on HLW [52].Qualitative Results. We present additional qualitative re-sults for the HLW dataset [52] in Fig. 12.

D. Camera Re-LocalizationNetwork Architecture. We provide a schematic of our net-work architecture for camera re-localization in Fig. 13. Thenetwork is a FCN [26] that takes an RGB image as input,and predicts dense outputs, sub-sampled by a factor of 8.The network has two output branches. The first branch pre-dicts 3D scene coordinates [41], our observations y(w), towhich we fit the 6D camera pose. The second output branchpredicts sampling probabilities p(y;w) for the scene coor-dinates. We apply a Sigmoid to the output of the secondbranch, and normalize by dividing by the sum of outputs.During training, we block the gradients of the second out-put branch when back propagating to the base network. Thesampling gradients have larger variance and magnitude thanthe observation gradients of the first branch, especially inthe beginning of training. This has a negative effect on con-vergence of the network as a whole. Intuitively, we want to

give priority to the scene coordinate prediction because theydetermine the accuracy of the pose estimate. The samplingprediction should address deficiencies in the scene coordi-nate predictions without influencing them too much. Thegradient blockade ensures these properties.Implementation details. We follow the three-stage train-ing procedure proposed by Brachmann and Rother forDSAC++ [6].

Firstly, we optimize the distance between predicted andground truth scene coordinates. We obtain ground truthscene coordinates by rendering the sparse reconstructionsgiven in the Cambridge Landmarks dataset [22]. We ignorepixels with no corresponding 3D point in the reconstruction.Since the reconstructions contain outlier 3D points, we usethe following robust distance:

(.y,y∗) =

{||y − y∗||2 ||y − y∗||2 < 10

10√||y − y∗||2 otherwise

, (15)

i.e. we use the Euclidean distance up to a threshold of 10mafter which we use the square root of the Euclidean distance.We train the first stage for 500k iterations using Adam [23]with a learning rate of 10−4 and a batch size of 1 image.

Secondly, we optimize the reprojection error of the scenecoordinate predictions w.r.t. to the ground truth camerapose. Similar to the first stage, we use a robust distancefunction with a threshold of 10px after which we use thesquare root of the reprojection error. We train the secondstage for 300k iterations using Adam [23] with a learningrate of 10−4 and a batch size of 1 image.

13

Figure 12. Qualitative Results for Horizon Line Estimation. Next to each input image, we show the estimated horizon line in blue andthe true horizon line in green. We also show the observation points predicted by our network, colored by their sampling weight (dark =low).

3x3

–C

32

–S1

3x3

–C

64

–S2

3x3

–C

12

8 –

S2

3x3

–C

25

6 –

S2

3x3

–C

25

6 –

S1

1x1

–C

25

6 –

S1

3x3

–C

25

6 –

S1

3x3

–C

51

2 –

S1

1x1

–C

51

2 –

S1

3x3

–C

51

2 –

S1

3x3

–C

51

2 –

S1

1x1

–C

51

2 –

S1

3x3

–C

51

2 –

S1

1x1

–C

51

2 –

S1

1x1

–C

51

2 –

S1

1x1

–C

1 –

S1

1x1

–C

51

2 –

S1

1x1

–C

51

2 –

S1

1x1

–C

3 –

S1

(480 × 720px)

Input: RGB Image

Output 1: Scene Coordinates 𝐲(𝐰)(3 × 60 × 107)

Output 2: Sampling Weights 𝑝(𝐲;𝐰)(1 × 60 × 107)

3x3 – C32 – S1

Filter Size – Channels – Stride Gradient Blockade

Legend: Sigm

oid

No

rmal

izat

ion

Figure 13. NG-DSAC++ Network Architecture for Camera Re-Localization. The network takes an RGB image as input and predicts asoutput dense scene coordinates and corresponding sampling weights. The network consists of convolution layers followed by ReLUs [17].Am arc with a plus marks a skip connection [18]. We use the gradient blockage during training to prevent direct influence of the samplingprediction (second branch) to learning the scene coordinates (first branch).

Thirdly, we optimize the expected task loss according tothe NG-DSAC objective as explained in the main paper. Astask loss we use ` = ∠(θ,θ∗)+ ||t−t∗||2. We measure theangle between estimated camera rotation θ and ground truthrotation θ∗ in degree. We measure the distance between theestimated camera position t and ground truth position t∗

in meters. As with horizon line estimation (see previoussection), we use a soft inlier count as hypothesis scoringfunction with hyperparameters α = 10, β = 0.5 and τ =10. We train the third stage for 200k iterations using Adam[23] with a learning rate of 10−6 and a batch size of 1 image.Learned 3D Representations. We visualize the internal3D scene representations learned by DSAC++ and NG-DSAC++ in Fig. 14 for two more scenes.

References[1] R. Arandjelovic. Three things everyone should know to im-

prove object retrieval. In CVPR, 2012. 6[2] J. Bian, W.-Y. Lin, Y. Matsushita, S.-K. Yeung, T. D.

Nguyen, and M.-M. Cheng. Gms: Grid-based motion statis-tics for fast, ultra-robust feature correspondence. In CVPR,2017. 5

[3] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton,and C. Rother. Learning 6D object pose estimation using 3Dobject coordinates. In ECCV, 2014. 2

[4] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel,S. Gumhold, and C. Rother. DSAC-Differentiable RANSACfor camera localization. In CVPR, 2017. 2, 3, 4

[5] E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold,

14

DSAC++ NG-DSAC++ DSAC++

St. Marys Church Kings College

Figure 14. Learned 3D Representations. We visualize the internal representation of the neural network. We predict scene coordinates foreach training image, plotting them with their RGB color. For DSAC++ we choose training pixels randomly, for NG-DSAC++ we chooserandomly among the top 1000 pixels per training image according to the predicted distribution.

and C. Rother. Uncertainty-driven 6D pose estimation of ob-jects and scenes from a single RGB image. In CVPR, 2016.2

[6] E. Brachmann and C. Rother. Learning less is more-6D cam-era localization via 3D surface regression. In CVPR, 2018.1, 2, 7, 8, 10, 13

[7] G. Bradski. OpenCV. Dr. Dobb’s Journal of Software Tools,2000. 1, 3, 9, 10

[8] T. Cavallari, S. Golodetz, N. A. Lord, J. Valentin, L. Di Ste-fano, and P. H. Torr. On-the-fly adaptation of regressionforests for online camera relocalisation. In CVPR, 2017. 2

[9] O. Chum and J. Matas. Matching with PROSAC - Progres-sive sample consensus. In CVPR, 2005. 2, 7

[10] O. Chum and J. Matas. Optimal randomized ransac. TPAMI,2008. 2

[11] O. Chum, J. Matas, and J. Kittler. Locally optimizedRANSAC. In DAGM, 2003. 2

[12] M. A. Fischler and R. C. Bolles. Random Sample Consen-sus: A paradigm for model fitting with applications to imageanalysis and automated cartography. Commun. ACM, 1981.1, 2, 3

[13] X.-S. Gao, X.-R. Hou, J. Tang, and H.-F. Cheng. Completesolution classification for the perspective-three-point prob-lem. TPAMI, 2003. 8

[14] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-tonomous driving? the kitti vision benchmark suite. InCVPR, 2012. 6, 10

[15] R. I. Hartley. In defense of the eight-point algorithm. TPAMI,1997. 3

[16] R. I. Hartley and A. Zisserman. Multiple View Geometry inComputer Vision. Cambridge University Press, 2004. 3, 4

[17] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rec-tifiers: Surpassing human-level performance on ImageNetclassification. In ICCV, 2015. 9, 10, 13, 14

[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In CVPR, 2016. 8, 9, 10, 13, 14

[19] J. Heinly, J. L. Schonberger, E. Dunn, and J.-M. Frahm. Re-constructing the World* in Six Days *(As Captured by theYahoo 100 Million Image Dataset). In CVPR, 2015. 5

[20] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. InICML, 2015. 9, 10, 13

[21] O. H. Jafari, S. K. Mustikovela, K. Pertsch, E. Brachmann,and C. Rother. ipose: Instance-aware 6d pose estimation ofpartly occluded objects. 2018. 2

[22] A. Kendall, M. Grimes, and R. Cipolla. PoseNet: A convo-lutional network for real-time 6-DoF camera relocalization.In ICCV, 2015. 8, 13

[23] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. In ICLR, 2015. 5, 7, 9, 13, 14

[24] F. Kluger, H. Ackermann, M. Y. Yang, and B. Rosenhahn.Deep learning for vanishing point detection using an inversegnomonic projection. In GCPR, 2017. 7

[25] J.-T. Lee, H.-U. Kim, C. Lee, and C.-S. Kim. Semantic linedetection and its applications. In ICCV, 2017. 7

[26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In CVPR, 2015. 10, 13

[27] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 2004. 3, 5, 6, 9, 10

[28] D. Massiceti, A. Krull, E. Brachmann, C. Rother, and P. H. S.Torr. Random forests versus neural networks - what’s best forcamera localization? In ICRA, 2017. 2

[29] R. Mur-Artal and J. D. Tardos. ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-Dcameras. T-RO, 2017. 1

[30] R. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim,A. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgib-bon. KinectFusion: Real-time dense surface mapping andtracking. In Proc. ISMAR, 2011. 5

[31] D. Nister. An efficient solution to the five-point relative poseproblem. TPAMI, 2004. 4, 5, 9

[32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-matic differentiation in pytorch. In NIPS-W, 2017. 4

[33] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deeplearning on point sets for 3d classification and segmentation.In CVPR, 2017. 5

15

[34] R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J.-M.Frahm. Usac: A universal framework for random sampleconsensus. TPAMI, 2013. 2, 7

[35] R. Raguram, J.-M. Frahm, and M. Pollefeys. A comparativeanalysis of RANSAC techniques leading to adaptive real-time random sample consensus. In ECCV, 2008. 2

[36] R. Ranftl and V. Koltun. Deep fundamental matrix estima-tion. In ECCV, 2018. 2, 3, 6, 7, 10

[37] C. Rother. A new approach for vanishing point detection inarchitectural environments. In BMVC, 2002. 7

[38] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB:An efficient alternative to SIFT or SURF. In ICCV, 2011. 5

[39] T. Sattler, B. Leibe, and L. Kobbelt. Efficient & effective pri-oritized matching for large-scale image-based localization.TPAMI, 2016. 1

[40] J. L. Schonberger and J.-M. Frahm. Structure-from-MotionRevisited. In CVPR, 2016. 1, 4

[41] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, andA. Fitzgibbon. Scene coordinate regression forests for cam-era relocalization in RGB-D images. In CVPR, 2013. 2, 8,13

[42] G. Simon, A. Fond, and M.-O. Berger. A-contrario horizon-first vanishing point detection using second-order groupinglaws. In ECCV, 2018. 7

[43] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. CoRR, 2014. 8

[44] C. Strecha, W. von Hansen, L. J. V. Gool, P. Fua, andU. Thoennessen. On benchmarking camera calibration andmulti-view stereo for high resolution imagery. In CVPR,2008. 5

[45] R. S. Sutton and A. G. Barto. Introduction to ReinforcementLearning. MIT Press, 1998. 4

[46] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys,J. Sivic, T. Pajdla, and A. Torii. InLoc: Indoor visual local-ization with dense matching and view synthesis. In CVPR,2018. 2

[47] B. Tordoff and D. W. Murray. Guided sampling and consen-sus for motion estimation. In ECCV, 2002. 2

[48] P. H. S. Torr and A. Zisserman. MLESAC: A new robust esti-mator with application to estimating image geometry. CVIU,2000. 2

[49] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Instancenormalization: The missing ingredient for fast stylization.CoRR, 2016. 5, 9, 10

[50] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg,A. Dosovitskiy, and T. Brox. Demon: Depth and motionnetwork for learning monocular stereo. In CVPR, 2017. 5

[51] J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi,and P. H. S. Torr. Exploiting uncertainty in regression forestsfor accurate camera relocalization. In CVPR, 2015. 2

[52] S. Workman, M. Zhai, and N. Jacobs. Horizon lines in thewild. In BMVC, 2016. 7, 11, 13

[53] C. Wu. Towards linear-time incremental structure from mo-tion. In 3DV, 2013. 5

[54] J. Xiao, A. Owens, and A. Torralba. Sun3d: A databaseof big spaces reconstructed using sfm and object labels. InICCV, 2013. 5

[55] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: Learnedinvariant feature transform. In ECCV, 2016. 5

[56] K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, andP. Fua. Learning to find good correspondences. In CVPR,2018. 2, 3, 5, 6, 7, 9, 10

[57] M. Zhai, S. Workman, and N. Jacobs. Detecting vanishingpoints using global image context in a non-manhattan world.In CVPR, 2016. 7

16

neural-guided ransac: learning where to sample model ... · hypotheses are ranked according to...

Documents