large-scale visual odometry for rough terrain 3 2007-large... · 2009-05-31 · large-scale visual...

12
Large-Scale Visual Odometry for Rough Terrain Kurt Konolige, Motilal Agrawal, and Joan Sol` a SRI International 333 Ravenswood Ave Menlo Park, CA 94025 konolige,agrawal,[email protected] Summary. Motion estimation from stereo imagery, sometimes called visual odom- etry, is a well-known process. However, it is difficult to achieve good performance using standard techniques. We present the results of several years of work on an integrated system to localize a mobile robot in rough outdoor terrain using vi- sual odometry, with an increasing degree of precision. We discuss issues that are important for real-time, high-precision performance: choice of features, matching strategies, incremental bundle adjustment, and filtering with inertial measurement sensors. Using data with ground truth from an RTK GPS system, we show exper- imentally that our algorithms can track motion, in off-road terrain, over distances of 10 km, with an error of less than 10 m (0.1%). 1 Introduction Estimating motion from an image sensor is an appealing idea - a low-cost, compact and self-contained device based on vision could be an important component of many navigation systems. One can imagine household robots that keep track of where they are, or automatic vehicles that navigate on- and off-road terrain, using vision-based motion estimation to fuse their sensor readings in a coherent way. These tasks are currently entrusted to various types of sensors - GPS and inertial measurement units (IMUs) are the primary ones - that can be costly in high-precision applications, or prone to error: GPS does not work indoors or under tree canopy, low-cost IMUs quickly degrade unless corrected. Visual motion estimation can be a method for estimating motion in is own right, and also as a complement to these more traditional methods. We are interested in very precise motion estimation over courses that are many kilometers in length, in natural terrain. Vehicle dynamics and outdoor scenery can make the problem of matching images very challenging. Figure 1 0 Support provided by DARPA’s Learning Applied to Ground Robots project.

Upload: others

Post on 15-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

Large-Scale Visual Odometry for Rough

Terrain

Kurt Konolige, Motilal Agrawal, and Joan Sola

SRI International 333 Ravenswood Ave Menlo Park, CA 94025konolige,agrawal,[email protected]

Summary. Motion estimation from stereo imagery, sometimes called visual odom-etry, is a well-known process. However, it is difficult to achieve good performanceusing standard techniques. We present the results of several years of work on anintegrated system to localize a mobile robot in rough outdoor terrain using vi-sual odometry, with an increasing degree of precision. We discuss issues that areimportant for real-time, high-precision performance: choice of features, matchingstrategies, incremental bundle adjustment, and filtering with inertial measurementsensors. Using data with ground truth from an RTK GPS system, we show exper-imentally that our algorithms can track motion, in off-road terrain, over distancesof 10 km, with an error of less than 10 m (0.1%).

1 Introduction

Estimating motion from an image sensor is an appealing idea - a low-cost,compact and self-contained device based on vision could be an importantcomponent of many navigation systems. One can imagine household robotsthat keep track of where they are, or automatic vehicles that navigate on-and off-road terrain, using vision-based motion estimation to fuse their sensorreadings in a coherent way. These tasks are currently entrusted to varioustypes of sensors - GPS and inertial measurement units (IMUs) are the primaryones - that can be costly in high-precision applications, or prone to error: GPSdoes not work indoors or under tree canopy, low-cost IMUs quickly degradeunless corrected. Visual motion estimation can be a method for estimatingmotion in is own right, and also as a complement to these more traditionalmethods.

We are interested in very precise motion estimation over courses that aremany kilometers in length, in natural terrain. Vehicle dynamics and outdoorscenery can make the problem of matching images very challenging. Figure 1

0 Support provided by DARPA’s Learning Applied to Ground Robots project.

Page 2: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

2 Kurt Konolige, Motilal Agrawal, and Joan Sola

shows two successive images from an outdoor scene, with Harris corners ex-tracted. Note the lack of distinctive corner features that are normally presentin indoor and urban scenes – most corners in the first image are not found inthe second.

In this paper, we present a state-of-the-art system for realtime VO onrough terrain using stereo images (it also works well in indoor and urbanscenes). The system derives from recent research by the authors and otherson high-precision VO. The most important techniques are:

• Stable feature detection. Precise VO depends on features that can betracked over longer sequences. Standard features (Harris [11], FAST [19],SIFT [14]) can exhibit poor performance over some offroad sequences. Wepresent a new multiscale feature (named CenSurE [3]) that has improvedstability over both outdoor and indoor sequences, and is inexpensive tocompute.

• Incremental sparse bundle adjustment. Sparse bundle adjustment (SBA)[9, 24] is a nonlinear batch optimization over camera poses and trackedfeatures. Recent experiments have shown that an incremental form of SBAcan reduce the error in VO by a large factor [9, 16, 23].

• IMU integration. Fusing information from an IMU can lower the growth ofangular errors in the VO estimate [22]. Even low-cost, low-precision IMUscan significantly increase performance, especially in the tilt and roll axes,where gravity gives a global normal vector.

Our main contribution is a realtime system that, by using these techniques,attains precise localization in rough outdoor terrain. A typical result is lessthan 0.1% maximum error over a 9 km trajectory, with match failure of just0.2% of the visual frames, on a vehicle running up to 5 m/s.

2 Visual Odometry Overview

Visual odometry estimates a continuous camera trajectory by examining thechanges motion induces on the images. There are two main classes of tech-niques.

Fig. 1. Harris corner features in two consecutive outdoor frames. Only threematched points survive a motion consistency test.

Page 3: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

Large-Scale Visual Odometry for Rough Terrain 3

Dense motion algorithms, also known as optical flow, track the image mo-tion of brightness patterns over the full image [5]. The computed flow fieldsare typically useful for obstacle avoidance or other low-level behaviors, and itis difficult to relate flow fields to global geometry. One notable exception isthe recent work on stereo flow [6], which uses standard structure from motion(SfM) techniques to relate all the pixels of two successive images. This researchis similar to that described in this paper, but uses dense pixel comparisonsrather than features to estimate motion.

Feature tracking methods track a small number of features from image toimage. The use of features reduces the complexity of dealing with image dataand makes realtime performance more realistic. Broadly speaking, there aretwo approaches to estimating motion. Causal systems associate 3D landmarkswith features, and track the camera frame with respect to these features,typically in a Kalman filter framework [7, 8, 21]. The filter is composed of theset of landmarks and the last pose of the camera. These methods have notyet proven capable of precise trajectory tracking over long distances, becauseof the computational cost of using large numbers of features in the filter,and because the linearization demanded by the filter can lead to suboptimalestimates for motion.

SfM systems are a type of feature-tracking method that use structure-from-motion methods to estimate the relative position of two or more cameraframes, based on matching features [1, 2, 4, 16, 18]. It is well-known that therelative pose (up to scale) of two internally calibrated camera frames can beestimated from 5 matching features [17], and that two calibrated stereo framesrequire just 3 points [12]. More matched features lead to greater precisionin the estimate - typically hundreds of points can be used to give a highaccuracy. Figure 2 shows the basic idea of estimating 3D point positions andcamera frames simultaneously. The precision of the estimate can be increasedby keeping some small number of recent frames in a bundle adjustment [9, 16,23].

Ours is an SfM system, and is most similar to the recent work ofMouragnon et al. [16] and Sunderhauf et al. [23]. The main difference is theprecision we obtain by the introduction of a new, more stable feature, and

Fig. 2. Stereo frames and 3D points. SfM VO estimates the pose of the frames andthe positions of the 3D points at the same time, using image projections.

Page 4: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

4 Kurt Konolige, Motilal Agrawal, and Joan Sola

the integration of an IMU to maintain global pose consistency. We also de-fine error statistics for VO in a correct way, and present results for a vehiclenavigating over several kilometers of rough terrain.

2.1 A Precise Visual Odometry System

Consider the problem of determining the trajectory of a vehicle in unknownoutdoor terrain. The vehicle has a stereo camera whose intrinsic parametersand relative pose are known, as well as an IMU with 3-axis accelerometersand gyroscopes. Our goal is to precisely determine the global orientation andposition of the vehicle frame at every stereo frame. The system operates in-crementally; for each new frame, it does the following.

1. Extract features from the left image.2. Perform dense stereo to get corresponding positions in the right image.3. Match to features in previous left image using ZNCC.4. Form consensus estimate of motion using RANSAC on three points.5. Bundle adjust most recent N frames.6. Fuse result with IMU data.

2.2 Features and Tracking

Distinctive features are extracted from each new frame, then matched to theprevious frame by finding the minimum zero-mean normalized cross correla-tion (ZNCC) score to all features within a search area. ZNCC does not accountfor viewpoint changes, and more sophisticated methods (affine transform [20],SIFT descriptors [14]) could be used, at the expense of increased computation.

Restricting the search area can increase the probability of a good match.A model of the camera motion (e.g., vehicle dynamics and kinematics fora mounted camera) can predict where to search for features. Let C be thepredicted 3D motion of the frame j + 1 relative to frame j, with covarianceQ. If qi is a 3D point in the coordinates of camera frame j, then its projectionqi,j+1 and its covariance on camera frame j + 1 are given by:

qij = KTCqi (1)

Qij = JQJ⊤. (2)

Here K is the camera calibration matrix, TC is the homogeneous transformderived from C, and J is the Jacobian of qij with respect to C. In our experi-ments, we use the predicted angular motion from an IMU, when available, tocenter the search area using (1), and keep a constant search radius.

From these uncertain matches, we recover a consensus pose estimate usinga RANSAC method [10]. Several thousand relative pose hypotheses are gen-erated by randomly selecting three matched non-colinear features, and thenscored using pixel reprojection errors (1). If the motion estimate is small and

Page 5: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

Large-Scale Visual Odometry for Rough Terrain 5

the percentage of inliers is large enough, we discard the frame, since compos-ing such small motions increases error. Figure 3 shows a set of points that aretracked across several key frames.

2.3 Center Surround Extrema (CenSurE) Features

The biggest difficulty in VO is the data association problem: correctly iden-tifying which features in successive frames are the projection of a commonscene point. It is important that the features be stable under changes in light-ing and viewpoint, distinctive, and fast to compute. Typically corner featuressuch as Harris [11] or the more recent FAST [19] features are used. Multiscalefeatures such as SIFT [14] attempt to find the best scale for features, givingeven more viewpoint independence. In natural outdoor scenes, corner featurescan be difficult to find. Figure 1 shows Harris features in a grassy area of theFt. Carson dataset (see Section 3 for a description). Note that relatively fewpoints are stable across the images, and the maximum consistent consensusmatch is only 3 points.

The problem seems to be that corner features are small and vanish acrossscale or variations of texture in outdoor scenes. Instead, we use a center-

surround feature, either a dark area surrounded by a light one, or vice versa.This feature is given by the normalized Laplacian of Gaussian (LOG) function:

σ2 ∇2G(σ), (3)

where G(σ) is the Gaussian of the image with a scale of σ. Scale-space extremaof (3) are more stable than Harris or other gradient features [15].

We calculate the LOG approximately using simple center-surround Haarwavelets [13] at different scales. Figure 3(b) shows a generic center-surroundwavelet of block size n that approximates LOG; the value H(x, y) is 1 at thelight squares, and -8 (to account for the different number of light and darkpixels) at the dark ones. Convolution is done by multiplication and summing,and then normalized by the area of the wavelet:

n

Fig. 3. Left: CenSurE features tracked over several frames. Right: CenSurE kernelof block size n.

Page 6: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

6 Kurt Konolige, Motilal Agrawal, and Joan Sola

(3n)−2 ×∑

x,y

H(x, y)I(x, y). (4)

which approximates the normalized LOG. These features are very simple tocompute using integral image techniques [25], requiring just 7 operations perconvolution, regardless of the wavelet size. We use a set of 6 scales, with blocksize n = [1, 3, 5, 7, 9, 11]. The scales cover 3-1/2 octaves, although the scaledifferences are not uniform. Once the center-surround responses are computedat each position and scale, we find the extrema by comparing each point inthe 3D image-scale space with its 26 neighbors in scale and position. WithCenSurE features, a consensus match can be found for the outdoor images(Figure 4).

While the basic idea of CenSurE features is similar to that of SIFT, theimplementation is extremely efficient, comparable to Harris or FAST detection[3]1. We compared the matching ability of the different features over the 47Kof the Little Bit dataset (see Section 3). We tested this in two ways: thenumber of failed matches between successive frames, and the average lengthof a feature track (Table 1 second row). For VO, it is especially important tohave low failure rates in matching successive images, and CenSurE failed onjust 78 images out of the 47K image set (.17%). The majority of these imageswere when the cameras were facing the sky, and almost all of the image wasuniform. We also compared the performance of these features on a short out-and-back trajectory of 150m (each direction) with good scene texture andslow motion, so there were no frame matching failures. Table 2 compares theloop closure error in meters (first row) and as percentage (second row) fordifferent features. Again CenSurE gives the best performance in terms of thelowest loop closure error.

Harris FAST SIFT CenSurE

Fail 0.53% 2.3% 2.6% 0.17%Length 3.0 3.1 3.4 3.8

Table 1. Matching statistics for the LittleBit dataset

Harris FAST SIFT CenSurE

Err 4.65 12.75 14.77 2.92% 1.55% 4.25% 4.92% 0.97%

Table 2. Loop closure error for differentfeatures

Fig. 4. Matched CenSurE points across two consecutive frames. 94 features arematched, with 44 consensus inliers.

1 For 512x384 images: FAST 8ms, Harris 9ms, CenSurE 15ms, SIFT 138ms.

Page 7: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

Large-Scale Visual Odometry for Rough Terrain 7

2.4 Incremental Pose Estimation

Estimating the most recent N frame poses and the tracked points can beposed as a nonlinear minimization problem. Measurement equations relate thepoints qi and frame poses Cj to the projections qij , according to (1). Theyalso describe IMU measurements of gravity normal and yaw angle changes:

gj = hg(Cj) (5)

∆ψj−1,j = h∆ψ(Cj−1, Cj) (6)

The function hg(C) returns the deviation of the frame C in pitch and roll fromgravity normal. h∆ψ(Cj−1, Cj) is just the yaw angle difference between the

two frames. Given a set of measurements (qij , gj , ∆ψj−1,j), the ML estimateof the variables Cj , qi is given by minimizing the squared error sum

minCj>n,qi

i,j

ǫ⊤v Q−1v ǫv +

j−1,j

ǫ⊤∆ψQ−1

∆ψǫ∆ψ +∑

j

ǫ⊤g Q−1g ǫg

(7)

Here the error terms are the differences between the measured and computedvalues for a given Cj , qi. Note that we have only specified minimization overCj for j > n, which holds the oldest n < N frames constant to anchor thebundle adjustment. For the experiments of this paper, we used N = 9 andn = 6.

The IMU terms in (7) were introduced to show the optimal way to in-corporate these measurements. In our current experiments, we instead use asimple post-filter that operates only on the latest frame (Section 2.5). Consid-ering just the first term, (7) can be linearized and solved by Taylor expansion,yielding the normal equation

J⊤

q (x)Q−1Jq(x)dx = −J⊤

q (x)Q−1q(x), (8)

where x are the initial values for the parameters Cj>n, qi, q(x) are all theqij stacked onto a single vector, Q is the block-diagonal of Q−1

v , and J isthe jacobian of the measurements. After (8) is solved, a new x′ = x + dxis computed, and the process is iterated until convergence. Various schemesto guarantee convergence can be utilized, typically by adding a factor to thediagonal, as in the Levenberg-Marquadt algorithm.

The system (8) is very large, since the number of features is typicallyhundreds per frame. Sparse bundle adjustment efficiently solves it by takingadvantage of the sparse structure of the jacobian. In our experiments (Section3), SBA can decrease the error in VO by a factor of 2 to 5 - this is thefirst large-scale experiment to quantify error reductions in VO from SBA (see[23, 16] for smaller experiments).

Page 8: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

8 Kurt Konolige, Motilal Agrawal, and Joan Sola

2.5 VO and IMU fusion using EKF

In this paper, we use VO to provide incremental pose estimation and the IMUas both an inclinometer (absolute roll and pitch) and an angular rate sensor(for incremental yaw). Translation estimates from VO are not corrected byIMU but will benefit from the improved angular precision (see discussion ondrifts in Section 2). We fuse VO and IMU via loose coupling, in which eachpositioning subsystem is taken as a pose estimator and treated as a blackbox. Pose information from each device is fused in a second stage with simple,small-sized EKF procedures. Loose coupling is suboptimal in the sense thatthe existing cross-correlations between internal states of different devices arediscarded. However, we show that long term VO accuracy can be dramati-cally improved by just using the simplest loosely coupled fusion with IMUinformation.

EKF formulation

The EKF formulation is quite straightforward. To ensure continuity and dif-ferentiability properties of the state vector, the vehicle’s orientation is encodedwith a quaternion, and the state of the vehicle is represented by the 7-vectorpose X = [x, y, z, a, b, c, d]⊤. Its Gaussian character is specified by the couple{X, P} so that p(X) = N (X − X, P ), which is initialized to the null value{X0 = [0, 0, 0, 1, 0, 0, 0]⊤, P0 = 07×7}. We call k the filter time index, whichcorresponds to the last SBA frame N. We systematically enforce quaternionnormalization and Euler gimbal-lock corrections whenever necessary.

We start with EKF motion prediction from VO. At each iteration of thefilter, we take the VO incremental motion Gaussian estimate {C∆, Q∆}, fromframe N − 1 to N , and use it for prediction via standard EKF.

Second we correct the absolute gravity normal by using the IMU as aninclinometer. Our IMU provides processed (not raw) information in the formof a vehicle pose CIMU

k = [xk, yk, zk, φk, θk, ψk]⊤ (position, and orientation in

Euler angles). Accelerometers inside the IMU sense the accelerations togetherwith the gravity vector. Because accelerations have zero mean in the long term,these readings provide absolute information about the gravity direction. Thegravity vector readings in the vehicle frame only depend on roll φk and pitch

θk angles, so we define the measurement equation (5) to be:

gk = hg(CIMUk ) =

[

φkθk

]

. (9)

Its uncertaintyG = diag(σ2g , σ

2g) is given a large σg of around 0.5rad to account

for unknown accelerations. The Gaussian couple {gk, G} is then fed into thefilter with a regular EKF update.

Finally, we correct relative yaw increments by exploiting the angular ratereadings of the IMU. The measurement equation (6) for the yaw increment inthe IMU is trivial:

Page 9: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

Large-Scale Visual Odometry for Rough Terrain 9

∆ψk−1,k = h∆ψ(CIMUk−1 , CIMU

k ) = ψk − ψk−1 (10)

This yaw increment is added to the last filter estimate Xk−1 (that has beenstored in the last iteration) to obtain a yaw measurement that is relative innature yet absolute in the form, hence adequate for a classical EKF update:

yk = hψ(Xk−1) +∆ψk−1,k (11)

where the observation function hψ(X) provides the yaw angle of the poseX. Its noise variance is Yk = σ2

∆ψ∆t, with σ∆ψ the angular random walk

characteristic of the IMU (in rad/√s units), and ∆t the time lapse in seconds

from (k−1) to k. The Gaussian couple {yk, Yk} is then fed into the filter witha regular EKF update.

3 Experiments

We are fortunate in having large outdoor datasets with frame-registeredground truth from RTK GPS, which is accurate to several cm in XY and10 cm in Z. For these datasets, the camera FOV is 35 deg, the baseline is50 cm, and the frame rate is 10 Hz (512x384), so there is often large imagemotion. We took datasets from Little Bit (9 km trajectory, 47K frames) inPennsylvania, and Ft Carson (4 km, 20K frames) in Colorado, to get vari-ety in imagery. The Ft. Carson dataset is more difficult for matching, withlarger motions and less textured images. In the experiments, we use only Cen-SurE features, which failed the fewest times (0.17% for Little Bit, 4.0% forFt. Carson).

3.1 VO Error Analysis and Calibration

We analyze VO-only drifts in order to provide useful odometry-like errorstatistics. The odometry model consists of translation and angular errorsover displacement (m/

√m and deg/

√m), and angular error over rotation

(deg/√

deg); we do not compute the latter because of lack of data. From the9km run of the Little Bit data, we select 100 sections 200m long, and inte-grate position and angular drifts. Averaging then reveals both random walkdrifts and deterministic drifts or biases (Fig. 5 and summary in Table 3). Weobserve a) a nearly ideal random walk behavior of the angular drifts; andb) a linear growth of the mean position drifts. This linear growth comes fromsystematic biases that can be identified with uncalibrated parameters of thevisual system. X drift (longitudinal direction) indicates a scale bias that canbe assigned to an imperfect calibration of the stereo baseline. Y and Z drifts(transversal directions) indicate pan and tilt deviations of the stereo head.The angles can be computed by taking the arc-tangent of the drifts over thetotal distance. In Table 4 we show the performance improvement of SBA –note the significant improvement in error values.

Page 10: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

10 Kurt Konolige, Motilal Agrawal, and Joan Sola

0 5 0 1 0 0 1 5 0 2 0 0� 101X d r i f t [ m ]

0 5 0 1 0 0 1 5 0 2 0 0� 2� 1012 R o l l d r i f t [ d e g ]Fig. 5. Integrated errors for Roll (left) and X (right) estimates. All 200m longtrajectories (gray), their mean (blue), STD (green, shown at ±1σ from mean) andRMS (red). Couples {Pitch,Y} and {Yaw,Z} behave similarly.

state mean STD derived calibration

deg/m deg/√

mφ -0.002 0.065 white noise: no infoθ 0.002 0.196 white noise: no infoψ -0.003 0.145 white noise: no info

mm/m mm/√

mx -5.76 69 -0.58% baseline factory -11.1 59 0.63◦ stereo panz 11.9 34 0.68◦ stereo tilt

Table 3. VO errors (CenSurE, SBA, no IMU)

state no SBA SBA

φ 0.191 0.065θ 0.243 0.196ψ 0.230 0.145x 99 69y 73 59z 44 34

Table 4. STD valuesfor two VO methods (in{deg,mm}/√m)

3.2 Trajectories

The VO angular errors contribute nonlinearly to trajectory error. On the twodatasets, we compared RMS and max XYZ trajectory errors. In the case ofmatching failure, we substituted IMU data for the angles, and set the distanceto the previous value. In Table 5, the effects of bundle adjustment and IMUfiltering are compared. Figure 6 has the resultant trajectories.

In both datasets, IMU filtering plays the largest role in bringing downerror rates. This is not surprising, since angular drift leads to large errors overdistance. Even with a noisy IMU, global gravity normal will keep Z errors low.The extent of XY errors depends on how much the IMU yaw angle drifts overthe trajectory - in our case, a navigation-grade IMU has 1 deg/hr of drift.Noisier IMU yaw data would lead to higher XY errors.

The secondary effect is from SBA. With or without IMU filtering, SBAcan lower error rates by half or more, especially in the Ft. Carson dataset,where the matching is less certain.

4 Conclusion

We have described a functioning stereo VO system for precise position esti-mation in outdoor natural terrain. Using a novel scale-space feature, sparsebundle adjustment, and filtering against IMU angle and gravity information,

Page 11: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

Large-Scale Visual Odometry for Rough Terrain 11

Table 5. Trajectory error statistics, in meters and percent of trajectory

RMS error in XYZ Max error in XYZ

Little Bit VO No SBA 97.41 (1.0%) 295.77 (3.2%)VO SBA 45.74 (0.49%) 137.76 (1.5%)

VO No SBA + IMU 7.83 (0.08%) 13.89 (0.15%)VO SBA + IMU 4.09 (0.04%) 7.06 (0.08%)

Ft Carson VO No SBA 263.70 (6.9%) 526.34 (13.8%)VO SBA 101.43 (2.7%) 176.99 (4.6%)

VO No SBA + IMU 19.38 (0.50%) 28.72 (0.75%)VO SBA + IMU 13.90 (0.36%) 20.48 (0.54%)

4 0 0 2 0 0 0 2 0 0 4 0 0 6 0 04 0 02 0 0 02 0 04 0 06 0 0 R T K G P Sn o S B AS B AI M U f u s e d R T K G P Sn o S B AS B AI M U f u s e d1 0 0 0 8 0 0 6 0 0 4 0 0 2 0 0 0 2 0 01 0 0 08 0 06 0 04 0 02 0 00

Fig. 6. Trajectories from Little Bit (left) and Ft Carson (right) datasets.

we obtain an extremely precise positioning system for rough terrain travel. Thesystem is currently being integrated on a large autonomous outdoor vehicle,to operate as an alternative to GPS/IMU positioning. The error statistics ofthe VO/IMU described here are better than standard (non-RTK) GPS/IMUsystems over ranges up to 10 km.

Currently the system runs at about 10 Hz on a 2 GHz Pentium, with thefollowing timings: feature extraction (15 ms), dense stereo (25 ms), tracking(8 ms), RANSAC estimation (18 ms), SBA (30 ms), and IMU filtering (0 ms).We are porting to a dual-core system, and expect to run at 20 Hz.

One area of improvement would be to integrate a full model of the IMU,including forward accelerations, into the minimization (7). For datasets like FtCarson, where there are significant problems with frame-to-frame matching,the extra information could help to lower the global error.

References

1. M. Agrawal and K. Konolige. Real-time localization in outdoor environmentsusing stereo vision and inexpensive gps. In ICPR, August 2006.

2. M. Agrawal and K. Konolige. Rough terrain visual odometry. In Proc. Inter-national Conference on Advanced Robotics (ICAR), August 2007.

Page 12: Large-Scale Visual Odometry for Rough Terrain 3 2007-Large... · 2009-05-31 · Large-Scale Visual Odometry for Rough Terrain 3 Dense motion algorithms, also known as optical flow,

12 Kurt Konolige, Motilal Agrawal, and Joan Sola

3. M. Agrawal and K. Konolige. CenSurE: Center surround extremas for realtimefeature detection and matching. In Preparation.

4. M. Agrawal, K. Konolige, and R. Bolles. Localization and mapping for au-tonomous navigation in outdoor terrains: A stereo vision approach. In WACV,February 2007.

5. S. S. Beauchemin and J. L. Barron. The computation of optical flow. ACMComputing Surveys, 27(3), 1995.

6. A. Comport, E. Malis, and P. Rives. Accurate quadrifocal tracking for robust3d visual odometry. In ICRA, 2007.

7. A. Davison. Real-time simultaneaous localisation and mapping with a singlecamera. In ICCV, pages 1403–1410, 2003.

8. A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-timesingle camera slam. IEEE PAMI, 29(6), 2007.

9. C. Engels, H. Stewnius, and D. Nister. Bundle adjustment rules. Photogram-metric Computer Vision, September 2006.

10. M. Fischler and R. Bolles. Random sample consensus: a paradigm for modelfitting with application to image analysis and automated cartography. Commun.ACM., 24:381–395, 1981.

11. C. Harris and M. Stephens. A combined corner and edge detector. In AlveyVision Conference, pages 147–151, 1988.

12. R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision.Cambridge University Press, 2000.

13. R. Lienhart and J. Maydt. An extended set of haar-like features for rapid objectdetection. In IEEE Conference on Image Processing (ICIP), 2002.

14. D. G. Lowe. Distinctive image features from scale-invariant keypoints. Interna-tional Journal of Computer Vision, 60(2):91–110, 2004.

15. K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. InEuropean Conference on Computer Vision (ECCV), 2002.

16. E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. Real timelocalization and 3d reconstruction. In CVPR, volume 1, pages 363 – 370, June2006.

17. D. Nister. An efficient solution to the five-point relative pose problem. IEEEPAMI, 26(6):756–770, June 2004.

18. D. Nister, O. Naroditsky, and J. Bergen. Visual odometry. In Proc. IEEEConference on Computer Vision and Pattern Recognition, June 2004.

19. E. Rosten and T. Drummond. Machine learning for high-speed corner detection.In European Conference on Computer Vision, volume 1, 2006.

20. J. Shi and C. Tomasi. Good features to track. In Proc. Computer Vision andPattern Recognition (CVPR), 1994.

21. J. Sola, M. Devy, A. Monin, and T. Lemaire. Undelayed initialization in bearingonly slam. In ICRA, 2005.

22. D. Strelow and S. Singh. Motion estimation from image and inertial measure-ments. International Journal of Robotics Research, 23(12), 2004.

23. N. Sunderhauf, K. Konolige, S. Lacroix, and P. Protzel. Visual odometry usingsparse bundle adjustment on an autonomous outdoor vehicle. In TagungsbandAutonome Mobile Systeme. Springer Verlag, 2005.

24. B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzibbon. Bundleadjustment - a modern synthesis. In Vision Algorithms: Theory and Practice,LNCS, pages 298–375. Springer Verlag, 2000.

25. P. Viola and M. Jones. Robust real-time face detection. In ICCV01, 2001.