improving gps localisation with vision and inertial sensing

G E O M A T I C A

IMPROVING GPS LOCALIZATION WITHVISION AND INERTIAL SENSING

Adel Fakih, Samantha Ng, and John ZelekUniversity of Waterloo

Inertial sensors and cameras are two sources of odometry that can provide a means of continuous local-ization when the GPS signal is blocked because of jamming or interference. Furthermore, they augment the GPSlocalization with orientation (heading) information. Although there have been many approaches to fuse GPSwith vision and inertial measurements, most of these approaches are still subject to many conditions,assumptions, and specific applications that do not guarantee the most accurate localization from inferredmeasurements of the available sensors at all times. We propose a comprehensive framework to fuse GPS intel-ligently with vision and inertial data by automatically, and adaptively, selecting the best combination of sensorsand the best set of parameters. Our framework, based on multiple hypotheses, is a roadmap for real time sys-tems that provide high accuracy, outlier robustness, and scalability when mapping a large number of 3D points.

1. Introduction

Mobile mapping vehicles, as well as pedestri-ans, equipped with GPS are often subject to satelliteshading and signal dropout when operating alongtree-lined streets, in wooded areas, and environ-ments where tall buildings can block reception.Vehicle mounted GPS positioning is further hin-dered by structures which cause satellites to go inand out of view, limiting a GPS receiver’s ability toprovide continuous and accurate position informa-tion. In addition, multi-path problems encounteredas a result of signals bouncing off reflective surfacesoften cause GPS systems to be unable to providereliable positioning information in this type of area.One way to provide uninterrupted measurements ofthe position, roll, pitch, and true heading of movingvehicles is to integrate precision GPS withadvanced inertial technology, similar to whatplanes and rockets use; however, such technology isunfeasible for the automotive or pedestrian sectordue to the expense.

As humans, we are very successful in navigating,localizing, and finding our way with our eyes and

inertial system (vestibular system). In this investiga-tion, we mimic sensor fusion algorithms that combinethe results of computer vision algorithms with the out-puts of GPS and inexpensive inertial sensor meas-urements. Our system is fast, scalable, and flexible; itselectively fuses measurements from the given sensorsin a robust, probabilistic, multi-hypothesis approachthat always gives the best possible localization andmapping. We also include a magnetometer (i.e. com-pass) in the system apparatus. The magnetometer,providing information about absolute orientation,enhances the ability of the system to avoid large driftafter losing the GPS signal. Our system is not tied toany specific image feature detector and tracker.Therefore, any feature detector can be used to detectfeatures in the images and any tracker (or matcher) canbe used to track those features between images. Oursystem is resilient to the problems and failures of thefeature tracker; firstly, it uses multiple hypotheses and,secondly, it uses additional information from the othersensors. We provide preliminary results to prove thevalidity and feasibility of our proposed approach.

GEOMATICA Vol. 63, No. 4, 2009, pp. 383 to 396

Les capteurs inertiels et les caméras sont deux sources d’odométrie qui fournissent une méthode delocalisation en continu lorsque le signal GPS est bloqué à cause d’un brouillage intentionnel ou d'interférence.En outre, ils améliorent la localisation du GPS avec de l’information sur l’orientation (le cap). Même s’il y aeu plusieurs approches pour fusionner le GPS avec des mesures visuelles et d’inertie, la plupart de cesapproches sont encore assujetties à de nombreuses conditions, suppositions et applications précises qui negarantissent pas la localisation la plus exacte des mesures inférées des capteurs disponibles en tout temps.Nous proposons un cadre approfondi pour fusionner de façon intelligente le GPS avec les données visuelleset les données inertielles en choisissant, de façon automatique et adaptative, la meilleure combinaison decapteurs et le meilleur ensemble de paramètres. Basé sur des hypothèses multiples, notre cadre est unefeuille de route pour les systèmes en temps réel qui fournissent une grande exactitude, une robustesse centreles valeurs aberrantes et une variabilité dimensionnelle pour cartographier un grand nombre de pointstridimensionnels.

Adel [email protected]

Samantha [email protected]

John [email protected]

G E O M A T I C A

This paper is organized into six sections.Section 1 is this introduction. Section 2 provides anoverview of previous work on fusing GPS witheither vision or inertial sensors; it constitutes a pre-amble for the presentation of the proposed frame-work. Section 3 introduces the notations and equa-tions to be used in the framework. Section 4 is adetailed explanation of the proposed approach.Section 5 presents some of our results and Section6 is our set of conclusions.

2. Background

The Global Positioning System (GPS) has longbeen used in combination with inertial sensors forlocalization and mapping [Grewal et al. 2007].Together, GPS and inertial measurements units(IMUs) naturally complement each other. GPScompensates for drift in inertial measurements whileinertial measurements are available at higher sam-pling rates and in a wider range of environments.There have been many approaches introduced inrecent years for combining GPS and inertial meas-urements [Grewal et al. 2007; Li et al. 2006;Mohamed and Schwarz 1999; Sasiadek et al. 2000].These approaches range from loose coupling toultra-tight coupling and from using different esti-mation techniques such as the conventional KalmanFilter, the adaptive Kalman Filter, and the fuzzyadaptive Kalman filter. Although inertial sensors canoperate in more environments (e.g. indoors) whencompared to GPS, they can only compensate for lossin GPS signal for a short period of time and will driftquickly from the true trajectory.

More recently, localization methods use visionwith filter-based approaches [Chiuso et al. 2002;Davison et al. 2007; Eade and Drummond 2006] orwith odometry-based approaches [Yang and

Pollefeys 2003; Nister et al. 2006; Konolige et al.2007]. However, there are two main problems withvision systems. First, there is positional drift; driftthat occurs due to the disappearance of old featuresand the appearance of new features in filter-basedapproaches, or drift that occurs due to the accumu-lation of errors, especially during rotations inodometry-based approaches. The drift problem issometimes addressed using loop-closure, as inSLAM (Simultaneous Localization and Mapping)[Davison et al. 2007; Eade and Drummond 2006],although it is not generalizable to all scenarios. Thesecond problem with vision is its high dependenceon the quality of feature detection and trackingcomponents, which are sensitive to lighting condi-tions and occlusions. When tracking fails, thevision-based localization fails catastrophically. Inaddition, the effect of feature outliers is a seriousproblem for vision-based solutions. Vision has beenused previously with GPS applications mainly tohelp in road detection and following [Bai and Wang2008; Goldbeck et al. 2000; Xia et al. 2006; Hu andUchimura 2006], or in the context of augmentedreality applications [Carceroni et al. 2006;Ramachandran et al. 2007; Forlani et al. 2005],where GPS aids vision in localization and mapping.

To address the problems of positional drift andfeature detection quality, many authors propose tocombine vision and inertial measurements together[Jones et al. 2007; Chikuma et al. 2004; Qian et al.2001; Roumeliotis et al. 2002; Konolige et al. 2007].Generally, the combination of vision and inertialmeasurements gives more accurate results thaneither one on its own provided that uncertainties inboth measurements are well-modeled. We showedthese results in the context of trying to determine thecamera velocity from an optical flow field and fromthe measurements of a gyroscope using an optimalnon-linear minimization [Fakih and Zelek 2008]. Asshown in Figures 1 and 2, we demonstrated that, witha good understanding of measurement uncertaintiesin both sensors, we can usually get a combined esti-mate that is better than the estimates obtained fromeach of the measurements by itself.

However, if we have an inaccurate assumptionof measurement uncertainty, which may be the like-ly case for visual measurements, the combinationcould lead to results that are worse than the estimatesobtained from the individual measurements. This isshown in Figure 3, where incorrect uncertaintieswere assumed in the combination and led to resultswith higher error than the gyro errors.

From these results, we conclude that althoughthe fusion of vision and inertial is a good strategyin many cases, in some cases it might not be theright thing to do. Accordingly, a good system

384

Figure 1: Rotational velocity error on an indoor sequence.

G E O M A T I C A

should be able to fuse data when the fusion is effi-cient and it should be able to rely on individual sen-sors when measurements become very noisy orunknown uncertainties are suspected. Oneapproach to resolve a similar problem is based onusing multiple Kalman filters [Drolet et al. 2000];a bank of Kalman filters is used, one of which isrunning at all times. Whenever new sensor meas-urements become available, the suitable filter isselected from the bank of available filters. TheAdaptive Kalman Filter [Gustafsson and Firm2000] is another way to resolve this problem. Theadaptive filter tries to estimate the noise covariancein the measurements; however, this is not suitablefor vision, since features are short-lived and, hence,not practical for estimating their noise covariance.

3. Methodology PreliminariesWe wish to determine the motion of a moving

platform from the readings of a set of sensors mount-ed on the platform. Figure 4 shows our platform withthree types of sensors.

We use a Trimble ProXH GPS receiver, a PointGrey Flea2 firewire camera, and a MemsensenIMU inertial unit, which provides measurementsof the rotational velocity, the translational accelera-tion, and the components of the Earth’s magneticfield along the three axes of the unit. The pose ofthe platform is represented by a rotation matrixR∈SO(3) (Lie group of rotations in R3) and a transla-

to the inertial frame which we chose to be a localNorth East Down (NED) frame. The velocity of themoving platform is represented by a rotational

which we chose to be coincident with the cameraframe (i.e. oriented so that the axis coincides withthe optical axis of the camera). The motion evolvesaccording to:

(1)

(2)

For filtering purposes, we represent R by the

385

(a) Rotational velocity error.

(b) Rotational velocity error.Figure 2: Results on two outdoor sequences.

Figure 3: Example of fusing gyro and vision when theassumed uncertainties of the two sensors are inaccurate.The rotational error in the fused estimate is worse thanthe gyro estimate.

T t + 1 = eω t xT t + V t

R t + 1 = eω t

x R t ,

where ω is the skew symmetric matrix defined as

ω def=

0 – ω z ωy

ωz 0 – ω x

– ω y ωx 0.

vector of its exponential coordinates Ω = Ωx,Ωy,ΩzT

tion vector T = TN,TE,TD, Texpressed with respect

velocity vector ω = ωx,ωy,ωz,Tand a translational

velocity V = Vx,Vy,Vz,T

expressed in the body frame

G E O M A T I C A

The measurements that the platform receivesfrom the sensors are the absolute position of theGPS receiver unit in the Earth Centered Earth Fixed

which is a sum of the true acceleration of the IMU

frame, and the magnetometer’s measurement of theorientation of the IMU Rmag of the unit with respectto the NED frame. To determine the origin and ori-ent our system so that the camera frame coincideswith the NED frame, we leave the platform at restfor 5 seconds and use the average of the GPS, mag-netometer, and accelerometer measurements overthese first few seconds. The camera provides thepositions of salient features that are tracked in theimages. The 3D positions of the world points cor-responding to the tracked features are added to theparameters to be estimated for two reasons: (1) thesimultaneous estimation of these 3D points with themotion adds to the accuracy of the vision system,and (2) these points constitute a map that can beused for mapping applications. The 3D position ofa world feature is represented in the inertial frame

with respect to the camera frame is:

(3)

(4)

f is the focal “length.” A calibrated Euclideanframework is assumed, and without loss of gener-ality, f is taken as 1. In this paper, visual features aretracked using a pyramidal implementation of theLucas-Kanade feature tracker [Bouguet 2000].Figure 5 shows a test sequence with the featuretracked between two of the frames.

4. Methodology4.1 Meta Level Overview

We wish to determine the position and veloci-ty of the platform accurately, at every time step,regardless of the environment (e.g. indoor, out-door), the natural conditions (e.g. bright, dark), orthe type of motions (e.g. fast, slow, jerky). Using aprobabilistic formulation, we want to determine, atevery time step t, the probability distribution of the

The notation , is used to represent the concatena-tion of vectors on top of each other. We denote the

386

Figure 4: Setup.

with R = e Ω t x and Ω = Logso3 R .

ECEF frame Tg , the accelerometer 's measurement

of the translational acceleration ains in the IMU frame,

and the gravity of the earth g, the gyroscope's meas−

urement of the rotational velocity ωins in the IMU

by the vector X = X,Y ,Z T and in the camera frame

by Xc = Xc,Yc,Z cT.

At time t, the position of the feature point X

Xc t = R t X t + T t .

Assuming a perspective projection model, Xc

projects to the point u = u,v, f T on the image plane:

u tv t

f= proj Xc

def=

fXcZc

fYcZc

f

.

Figure 5: A frame from Sequence 2 and the directionsof the optical flow used to track the features. Meanfeature displacement is four pixels per frame.

state vector S = Ω,T,ω,V ,a,X1,.,XN given all the

measurements T GPS , ωins,ains,Ωmag i.e. inertial

and U = u1,.,uN i.e. visual features from 0 to t.

desired distribution as p S t U 0 : t ,Tg 0 : t ,

G E O M A T I C A

A traditional way to do the estimation of suchsystems is by using the Extended Kalman Filter(EKF). However, the EKF has two main problems inthis context; firstly, it’s computationally intractableif the number of 3D features tracked by the visionsystem is high and, secondly, it assumes that all theuncertainties are Gaussian and perfectly known. Toaddress these problems, Rao-BlackwellizedParticle Filters (RBPF) have been introduced. Thefirst generation of these filters [Montemerlo 2003]reduced the computational complexity by capitaliz-ing on the important characteristic that features areindependent given the camera motion. Then, a par-ticle filter can be used in which each particle iscomposed of a motion sample and N low dimen-sional filter EKFs, one for each of the N features.However, there was still a need for further reduc-tion of the number of particles and improving theconvergence properties of the filter. A drasticimprovement [Eade and Drummond 2006;Montemerlo et al. 2003; Sim et al. 2007] over thesimple Rao-Blackwellized particle filter was foundin using the optimal importance function, which isthe distribution of the estimates given not only thestate at the previous time-step but also the mostrecent observation. This ensures that samples arepropagated in an efficient way, which reduces thenumber of wasted samples. However, in the men-tioned approaches, the way samples are drawn fromthis importance function is not optimal in the senseof providing particles that lie in the regions of high-est probability of the posterior. This is not optimalbecause the function does not use the uncertainty ofa motion sample during the update process and,therefore, it updates the feature locations based onthe predicted motion, which might not be very accu-rate if the motion is changing rapidly. To deal withthat, some researchers [Karlsson et al. 2005; Sim etal. 2007; Elinas et al. 2006] use mixed proposaldistributions that combine hypotheses from themotion and observation models.

The approach we present in this paper borrowssome principles from the Rao-Blackwellized parti-cle filter; we divide the state vector into two parts,where the second one is estimated, analyticallyconditioned, on the first part. However, ourapproach is intrinsically different from the RBPF.First, we use hypotheses that are much more gener-al than the particles in the Rao-Blackwellized par-ticles. Our hypotheses represent not only a motionestimate, but also a set of sensors to be used.Hence, we have multiple types of hypotheses. Inthe current paper, we limit the types of hypothesesto three: vision only hypotheses, vision-inertial

hypotheses, and inertial only hypotheses. Theapproach is more general to allow for the easyincorporation of any additional sensor. Furthermore,our hypotheses do not represent an exact probabili-ty distribution over the 3D parameters; however,they can be seen as a mixture of Gaussian represen-tations of this distribution. Another major differencebetween our approach and the Rao-Blackwellizedapproach is that in our approach, instead of dividingthe state vector into a motion part and a featurelocations part, we divide it into a first part that con-sists of the camera motion and a small number K ofrandom 3D points, and another part containing theremaining 3D points conditioned on the first part.The K features that are associated with the first partare chosen randomly for every hypothesis by draw-ing putative sets from the available tracked featureswhen the hypothesis is created. This new division, aswe will show, has very important implications on theefficiency and the mechanism of filter operations.

As we mentioned, the state vector in our

motion plus a few number K of features, and another

ically should be a small number (>5). A relativelylarge number leads to better results but increasesthe computational expense. In this paper we settleon K = 8,

(5)

where randperm(n) is a random permutation of thenumbers from 1 to N. The elements (3D feature

written, therefore, as follows (n.b. dropping thetime indices):

(6)

The rationale of dividing the state vector in thepresented way is that we can now approximate thedesired distribution as follows:

387

ωins 0:t ,Vins 0:t ,Ωm 0:t .

approach is divided into a part S1 containing the

part S2 consisting of the remaining features. K typ-

S = S1,S2

S1 = Ω t ,T t ,ω t ,V t ,XL l t , ,

L = randperm N ,l ∈ 1,k

S2 = Xi t ,., ,i = 1,.,N, i ≠ L l ,

locations in the second part S2 of S are independent

given S1. The desired probability distribution can be

p S U,Tg,ωins,Ωmag =

p S1,S2 U 1,U 2,Tg,ωins,Vins,Ωmag

where U 1 are the image projections of the features in

S1 and U 2 are the image projections of the features in S2.

A drasticimprovement

over thesimple Rao-

Blackwellizedparticle filter

was found inusing the

optimalimportancefunction…

G E O M A T I C A

(7)

ent only on the first part of the visual measurements

we can deal with that by forming multiple hypothe-

somewhat similar to the RANSAC [Fischler andBolles. 1981] paradigm and it provides robustnessto outliers in the tracking of the features.

Now every one of those hypotheses can be

done in a low dimensional Extended Kalman Filter

can be estimated in a 3-dimensional EKF condition

already updated and not only predicted from theprevious time instant.

As mentioned before, although the combinationof inertial measurements with vision generally pro-vides estimates that are better than those provided byeither one of the sensors, in some situations, if one ofthe sensors measurements is really poor or if theuncertainties in the measurements are not wellknown, the combination is not fruitful. For this rea-son, we keep three kinds of hypotheses:

• Hypotheses using vision without inertial sen-sors corresponding to the distribution

this hypothesis is estimated only from the

• Hypotheses using both vision with inertial sen-sors (plus magnetometer) corresponding to the

• And one hypothesis corresponding to inertialmeasurements (plus magnetometer) without

sis, is estimated using an Adaptive KalmanFilter (AKF) [Gustafsson and Firm 2000],which estimates also the noise covariance ∑ins

in the inertial measurements. This is done todeal with cases in situations where ∑ins is not

known or changes with time. The estimate of∑ins from this hypothesis is fed to the other

hypotheses.

Figure 6 shows the three types of particles in oursystem. Figures 7, 8, and 9 depict the estimationprocess of each type of hypothesis.

At every time step, the estimation of all thehypotheses can be done in a GPS mode or in a non-GPS mode depending on the availability of the GPSsignal. Also, a procedure is designed to determine thebest hypotheses (see Section 4.4). Figure 10 showsthe mean visual reprojection errors of 20 hypothesesover 160 frames and the chosen hypothesis.

388

Figure 6: Particles used in the filter.

S1,S2 U 1,U 2,Tg,ωins,Vins,Ωmag =

p S1 U 1,Tg,ωins,Vins,Ωmag p S2 S1U 2

since S2 is independent of U 1 given S1. The approx-

imation comes from the fact that S1 is now independ-

U 1 which is a set of few K features instead of being

dependent on all the features U 1 + U 2 . However,

ses in which the features in U 1 are different. This is

efficiently estimated, since estimating S1 can be

see Section 4.2 , and then every 3D feature in S2

on S1 see Secion 4.3 . This way of estimation is a

drastic improvement over the RBPF, since S2 relies

on its update process of a motion in S1 that is

p S1 U 1,Tgps p S2 S1,U 2 . The part S1 of

U 1 data associated with it, and then S2 is esti-

mated from the U 2 data using the motion in S1.

distribution p S1 U 1,Tg,ωins,Vins,Ωmag

p S2 S1,U 2 . The part S1 of this hypothesisis estimated from the U 1 data associated with it

and from the inertial measurements than S2 is

estimated from the U 2 data using the motion

in S1.

vision: p S1 Tg,ωins,Vins,Ωmag p S2 S1,U .

In this case, S1 contains only motion estimates

while all the features are in S2. S1, in this hypothe-

G E O M A T I C A

4.2 Estimation of Part 1 forEvery Hypothesis

Three types of dynamical systems can bedefined—one for each of our hypotheses. The equa-tions in these systems are a subset of the followingsystem of equations:

(8)

vector of the exponential coordinates of R, i.e. R

ance matrix representing the uncertainty on theposition of the feature; it is obtained from the fea-ture tracker. C is a matrix representing the transfor-mation from the NED reference frame to the ECEF

ter to the origin of the NED frame. Rci is the rota-tion between the body frame and the nIMU frame.Tci is the vector from the nIMU frame to the body

the GPS measurement; it can be obtained from theDilution of Precision (DOP) values provided by the

are the noise from uncertainties in the accelerometersand gyroscopes of the inertial sensor, which are esti-mated adaptively using an Adaptive Kalman Filter

389

Figure 7: Estimation of a vision hypothesis: Part 1 is estimated using an EKFwhose measurements are the projections of the features in Part 1 and the GPSfix (when available). Then every feature in Part 2 is estimated using an indi-vidual EKF based on the motion in Part 1.

Figure 8: Estimation of a vision-inertial hypothesis: Part 1 is estimated using anEKF whose measurements are the projections of the K features in Part 1, theinertial measurements, and the GPS fix (when available). Then every feature inPart 2 is estimated using an individual EKF based on the motion in Part 1.

Figure 9: Estimation of an inertial hypothesis: Part 1 of the hypothesis is esti-mated using an adaptive KF [Gustafsson and Firm 2000] whose measurementsare the inertial measurements and the GPS fix (when available). Then everyfeature in Part 2 is estimated using an individual EKF based on the motion inPart 1 and whose measurement is the projection of the considered feature.

1 XL l t + 1 = XL l t l = 1.K

2 T t + 1 = e w t xT t + V t

3 Ω t + 1 = Logso3 e w t xe Ω tx

4 V t + 1 = V t + a t

5 ω t + 1 = ω t + a t

6 a t + 1 = a t

7 ui t = proj R t Xi t + T t

+ nvis,i t i = 1.N

8 TGPS t = CT t + B + nGPS

9 ains t = RciT a t + RT t g + R t Tci + abias + nins,acc

10 ω ins t = RciT ω t + ωbias + nins,gyro

11 Ωmag t = Ω t + nmag

with Xi

0 = X0i,T 0 = T0. Ω = Logso3 R is the

= exp Ω x . Ω 0 = Ω0V 0 = V0ω 0 = ω0,

nvis,i ~ N 0,Σ vis,i t where Σ vis,i t is the covari-

reference frame, and B is a vector from the earth cen-

frame. nGPS 0,ΣGPS is the noise from uncertainty in

GPS unit. nins,acc 0,Σ ins,acc and nins,gyro 0,Σ ins,gyro

G E O M A T I C A

biases for the accelerometers and gyroscopes given

noise the from uncertainty in the magnetometermeasurement and is also estimated adaptively.

The vision hypotheses will have equations 8.9and 8.10 missing. Inertial hypotheses will haveequations 8.1 and 8.7 missing. In the non-GPSmode, all hypotheses will have equation 8.8 miss-ing. Note that the K features in equation 8.1 arechosen randomly from the N total features and aredifferent for every hypothesis. There are two rea-sons for which GPS measurements are not avail-able at all times. The first reason is that the GPSprovide measurements at 1Hz, whereas vision andinertial measurements are much more frequent (e.g.30Hz). The second reason is that the GPS receivermay not be in line of site of enough satellites todetermine its position (e.g. in tunnels or buildings).Unfortunately, the scale and reference frame are notobservable with vision (i.e. visual features do notprovide measurements of absolute scale). Thus,when the GPS signal disappears, we have to fix thescale and reference frame for vision-type hypothe-ses. To do this, whenever the GPS measurement isnot available, we replace equation 8.1 with

4.2.1 EKF Formulation

the form (n.b. dropping the index j):

(9)

where f is a subset of the system of equations in (8),dependent on the hypothesis type, and h is the sys-tem of equations 8.6, 8.7 and 8.10.

low the EKF approach described in [Chiuso et al.2002] and use the standard notations of dynamical

mate of x predicted from the previous estimate and

vious estimate and the most recent measurement.The filtering equations are:

Prediction

(10)

Update

(11)

Gain

(12)

Linearization

(13)

For calculation details of the above matricesrefer to (6).

390

Figure 10: Mean reprojection errors of Sequence 2 of the State Vector atframe 160 on all the previous frames.

in the inertial hypothesis. abias and ωbias are the

in the hardware specfications. nmag 0Σmag is the

Z L l t + 1 = Z L l t , l = 4…8

xL l t + 1 = xL t t , l = 4…8,

where XL l = Z L l xL l ,Z L l yL l ,Z L lT.

For each hypothesis S1

j, we have a system of

S1 t = f S1 t – 1 + ns t

ns t ~ N 0,Σ s

m t = h S1 + nm t

nm t ~ N 0,Σm , l = 1,.,k

Therefore, to update S1j

from t – 1 to t, we fol-

systems filtering where x tt – 1 refers to an esti-

x t t refers to an estimate of x that uses the pre-

S1 tt − 1 = f S1 t − 1 t − 1

Pm tt − 1 = F t − 1 Pm t − 1 t − 1

F T t − 1 + Σm

.

S1 tt = S1 tt − 1 + A t u t − h S 1 tt − 1

Pm tt = Γ t Pm tt − 1 ΓT t + A t ΣnAT t.

Λt = H t P t H T t + Σn

Αt = P tt – 1 H T t Λ – 1 t

Γ t = I – A t H t

.

F t =∂ f

∂S1

S1 t – 1t – 1

H t = ∂h∂S1

S1 tt – 1.

G E O M A T I C A

4.3 Estimation of Part 2

dynamical system:

(14)Note that in the above equation, the measurement

is one of the major advantages of our approach. Let

Gm be the linearization of the measurement equation

(15)

sideration the effect of the uncertainty in the cameramotion.

4.4 Evaluating the HypothesesAt every time instant, we rank each of the

hypotheses based on their fitness to the availablemeasurements.

4.4.1 Weights from Vision MeasurementsIdeally, to test a given hypothesis j at time t

using the visual measurements, we have to deter-mine its re-projection error evis(j,t) on all the previ-ous frames. The reprojection of the ith 3D point inthe jth at time t hypothesis on the kth frame and itscovariance matrix are given by:

where

where ρ is a robust function which we simplydefine as ρ(e) = e if 0 ≤ e < th and ρ(e) = th if e ≥th, where th is a threshold. The errors are expressedin terms of mean pixel error per feature. However,this would require us to keep track of all the imageprojections and to recompute the reprojectionerrors of the state vector on all the frames, at everytime, which can not be done in real time. Therefore,we use an approximation of this error, which can becomputed recursively at every frame as follows:

In this case, instead of computing the reprojec-

frames k, k = 0,.,t, we are computing for every frame

would be equivalent to the following:

A weight for the hypothesis, based on the errorêvis (j,t), then is defined as follows:

wvis (j,t) = exp(–êvis (j,t)). 20)

The weights from all the hypotheses are, then,normalized to sum to one.

4.4.2 Inertial WeightsInertial weights are measured by the deviations

error consists of two parts: one for the accelerometerand one for the gyros. We define the mean and covari-ance matrices of the translational and rotational veloc-ities of the jth hypothesis in the IMU reference frame:

391

For every feature, Xi

jof S2

j, we have the following

Xij

t = Xij

t – 1 + nx t ,0.3cmnx t ~ N 0,Σ x

ui t = proj Rj

tt Xij

t + Tj

tt +nvis, j t

n t ~ N 0,Σ vis, j

,

model is formulated in terms of Rj

tt , Tj

tt

and not in terms of Rj

tt – 1 , Tj

tt – 1 . This

Q be the covariance matrix of Ω jt , T

jt . Let

with respect to R t and T t at R j t ,Tj

t :

Gm, jj t =

∂proj R t Xij

t + T t

∂ R t ,T t R j t ,T j

t .

Then Xijcan be updated using similar equations

as in Section 4.2.1. However, instead of using Σvis,i

we replace it by Σvis,i + Gm,ij QGm,i

jT to take into con−

upij t,k = proj R j k Xi

jt + T

jk

Σpij t,k = Gm,i

j k Q k Gm,ij\:T k + Gx,i

j t,k Px t Gx,ij\:T t,k , (16)

(17)

(19)

Gx,ij t,k =

∂proj R j k Xi t + Tj

k

∂Xi t Xi

j.

Px is the covariance of Xij

t .

evis j,t = 1t + 1

ρ upij t,k –ui

TΣi = 1

NΣpi

j t,k + Σ vis, j k– 1

upij t,k – uiΣ

k = 0

t,

evis j,t = e j,t – 1 tt + 1 +

ρ upij t,t – ui

TΣpi

j t,t + Σ vis t– 1

upij t,t – uiΣ

i = 1

N

t + 1

.

tion error of the state vector S t on all the previous

k the reprojection error of S k on this frame, which

evis j,t = 1t + 1

ρΣi = 1

Nupi

j k,k – uiT ΣPi

j k,k + Σvis, j k– 1

upij k,k – ui ,Σ

k = 0

t

of V and ω from the inertial measurements. This

(18)

G E O M A T I C A

The weights corresponding to êacc(j,t) andêgyro(j,t) are calculated as follows:

wacc(j,t) = exp(–êacc(j,t)) (23)wgyro(j,t) = exp(–êgyro(j,t))

The gyro and accelerometer weights of thehypotheses are then normalized to 1.

4.4.3 Magnetometer WeightsThe magnetometer weights are measured by

ment.

(24)

The weights are computed as:

wmag(j,t) = exp(–êmag(j,t)), (25)

and then normalized to 1 over all the consideredhypotheses.

4.4.4 GPS WeightsThe GPS weights are measured by the devia-

The weights are computed as:

wGPS(j,t) = exp(–êGPS(j,t)), (27)

and then normalized to 1 over all the consideredhypotheses.

4.4.5 Combining WeightsTo compute an aggregate weight for each hypoth-

esis, we simply treat the weights from each type ofmeasurement as probabilities and multiply them:

w(j,t) = wvis(j,t)wacc(j,t)wgyro(j,t)wmag(j,t)wGPS(j,t).(28)

Note that when the visual measurements are notgood, all the reprojection errors are very high. Therobust function ρ then leads, almost, to equal visu-al weights for the hypotheses, and then w(j,t) willnot be affected by the vision weights.

4.5 System Initializationand Operation

Upon starting our system, a number of featuresare detected in the first image and then trackedthrough subsequent frames. We randomly sampleNs putative sets of K = 8 features each. Every set isused to generate a hypothesis (vision or vision iner-tial). A hypothesis is generated by initializing an

rotation with respect to the NED frame is deter-mined from the accelerometer and magnetometer atrest, and the translational and rotational velocitiesare set to zero. After a number of frames, when theEKFs are assumed to have converged for everyhypothesis, we triangulate the 3D positions of thefeatures in its Part 2 using the motion in its Part 1.The covariance of every 3D feature is determinedusing the unscented transform [Julier and Uhlmann1997], the covariance of the motion in the Part 1and the covariance of its image feature. The trian-gulated 3D features and their covariances are usedas an initialization of the individual EKFs in Part 2.A similar procedure is used to insert new features.When a new feature is detected (n.b. in our systemwe keep the number of visible features constant), itis tracked for a certain number of frames (e.g. fromt1 to t2). Then, for each of the available hypotheses,the 3D location of this feature is determined by tri-angulation and used as an initial value for the EKFsof this feature in every hypothesis.

392

apinsj t =Rci aj t + RT, j t g + Rj t Tci + abias

T

Σpins,accj t = Rci

T Pa t jRci

ωpinsj t = Rci

T ω j t + wbias

Σpins,gyroj t = Rci

T Pωj t Rci

.

Pajt and Pω

jt are the covariance matrices of a and

ω in the hypothesis j.

eacc j,t = eacc j,t – 1 +

apins

jt – ains t

T

Σpins,accj

t + Σ ins,acc t

– 1

apins

jt – ains t

t + 1.

egyro j,t = egyro j,t – 1 +

Σk = 0

t

ωpins

jt – ωins t

T

Σ ins,gyro– 1

ωpins

jt – ωins t

t + 1

the deviation of Ω from the magnetometer measure−

emag j,t = emag j,t – 1 +

Ω jt – Ωmag t

T

Σmag + pΩj t

– 1 Ω jt – Ωmag t .

t + 1

PΩj t is the covariance of Ω in the hypothesis j.

PTj t is the covariance of T in the hypothesis j.

tion of T from the GPS measurement:

eGPS j,t = eGPS j,t – 1 +

CTj

t –B –TGPS tT

CPTj t C T+ ΣGPS

– 1CT

jt + B – TGPS t .

t + 1

EKF in which the initial translation is set to 0, the

(22)

(21)

(26)

G E O M A T I C A

given hypothesis is not tracked (e.g. disappears orgets occluded), it is replaced by another feature

feature is chosen as the one with the lowest vari-ance. The replacement is done by eliminating theoccluded, or disappearing, feature from the statevector and substituting its mean by the mean of thenew feature. In addition, the lines and rows in thecovariance matrix corresponding to the old featureare set to zero and only the part corresponding tothe autocovariance of the old feature is replaced bythe covariance of the new feature.

When the weight of a given hypothesisbecomes smaller than some threshold th1 and ifnone of its individual weights is higher than anoth-er thereshold th2, the hypothesis is discarded andreplaced by a new hypothesis generated from onethen interchanging the feature having the largest

5 Experimental Results withthe Vision System

As a proof of concept for our hypothesis-basedapproach, we tested the system using vision onlyhypotheses. We calibrated a Flea2 camera, fromPoint Grey, using a checkerboard pattern and usedthe camera to capture several sequences of images.In some of the sequences, the camera is fixed on apan-tilt unit; and in other sequences, the camera ismoved by hand. We report, hereafter, the results onthree sequences. We compare the reprojection errorof the state vector estimated by our system, in all theconsidered frames, against the error of the fullcovariance EKF [Chiuso et al. 2002] and of the opti-mal nonlinear offline optimization (i.e. bundleadjustment) [Triggs et al. 2000]. The bundle adjust-ment is a Gauss-Newton batch estimation of thecamera poses at all the frames and the positions ofall the considered points by minimizing the repro-jection errors of the 3D parameters in all the images.The reprojection errors in a given frame are deter-mined by projecting the estimates of the 3D pointson this frame using the corresponding camera poseestimate, then taking the difference between the

obtained projections and the detected features posi-tions. We use these reprojection errors to evaluatedifferent estimators. If the errors in the image pro-jections are Gaussians, the estimator correspondingto the smallest reprojection error would be theMaximum Likelihood estimator; otherwise, in thenon-Gaussian case it would be the BLUE estimator(Best Linear Unbiased Estimator). Figures 11, 12,and 13 show the reprojection errors of the state vec-tors at time 160, of all the (20) considered hypothe-ses on all the frames, from time 10 till 160, for thethree sequences respectively. The blue solid linescorrespond to the different hypotheses; the greendash-dotted line corresponds to the optimal bundleadjustment solution; the dashed red line correspondsto the chosen hypothesis; and the dotted black linecorresponds to the full EKF solution. The chosenhypothesis in our system always corresponds to oneof the best hypotheses; it has a reprojection errormuch lower than the EKF error and it is also veryclose to the error of the optimal bundle adjustment.Figure 14 shows the estimated positions of the scenelandmarks, where the diameter of the circles is pro-portional to the depth of the feature. The larger thediameter of circle, the farther it is from the cameracenter. A qualitative examination of these featuresshows that their obtained depth is in accord with thedistance to the cameras and that the 3D positions areobtained up to a good accuracy. The results present-ed, thus far, show the effectiveness and efficiency ofour multi-hypothesis approach. Since the kernel ofevery hypothesis is the EKF filter of its Part 1, andbased on previous results of vision-inertial EKFs[Jones et al. 2007], we expect the same performanceof our approach, with added accuracy, when weinclude inertial measurements in the tests.

393

When one of the K features in the part S1 of a

from the part S2 of the hypothesis and this latter

of the good hypotheses. The new hypothesis S1new

is generated by copying the old hypothesis Sj

and

variance in S1new

with the feature having the lowest

variance in S2new

.

Figure 11: Mean reprojection errors of Sequence 1 of the State Vectors atframe 160 on all the previous frames.

G E O M A T I C A

6. Conclusion

We proposed an efficient system to fuse GPSwith vision and inertial measurements for localiza-tion and mapping. The system maintains multiplehypotheses that are more general than the particles ina Particle Filter and allow our system much moreflexibility in selecting the set of measurements withthe set of parameters that give the most accurateresults. The system is independent of the image fea-tures tracking algorithm while the multiple hypothe-ses scheme provides robustness to the potential fail-ures of the tracker. We showed preliminary resultsthat outlined the superiority of our framework versusthe traditional EKF in terms of giving the best possi-ble localization and mapping. Our framework caneasily be generalized so that it is scalable to differentnetworks of sensors. Future work will focus on anextensive experimental evaluation of our approachwith very long sequences in addition to variousindoor and outdoor conditions. We will rigorouslytest our system’s robustness to extended loss of GPSdata, as well as confirm the superior computationalefficiency and scalability implied by the derivationof our probabilistic framework when compared toother sensor fusion algorithms.

ReferencesBai, L. and Y. Wang. 2008. Fusing image, GPS and GIS

for road tracking using multiple condensation parti-cle filters. In 2008 IEEE Intelligent VehiclesSymposium, pp. 162–167.

Bouguet, J. 2000. Pyramidal implementation of theLucas Kanade feature tracker description of thealgorithm. Intel Corporation, MicroprocessorResearch Labs.

Carceroni, R., A. Kumar, and K. Daniilidis. 2006.Structure from motion with known camera posi-tions. In Proc. of IEEE Intl. Conf. on ComputerVision and Pattern Recognition, pp. 477–484.

Chikuma, M., Z. Hu, and K. Uchimura. 2004. Fusion ofvision, GPS and 3D-gyro in solving camera globalregistration problem for vision-based road naviga-tion. IEIC Technical Report (Institute ofElectronics, Information and CommunicationEngineers), 103(643):71–76.

Chiuso, A., P. Favaro, H. Jin, and S. Soatto. 2000. 3-Dmotion and structure from 2-D motion causally inte-grated over time: Implementation. In ECCV ’00:Proceedings of the 6th European Conference onComputer Vision-Part II, pp. 734–750, London,U.K., Springer-Verlag.

Chiuso, A., P. Favaro, H. Jin, and S. Soatto. 2002.Structure from motion causally integrated overtime. IEEE Transactions on Pattern Analysis andMachine Intelligence, pp. 523–535.

394



Figure 14: Recovered 3D points. The diameter of the circles is pro-portional to the distance from the camera.

G E O M A T I C A

Davison, A., I. Reid, N. Molton, and O. Stasse. 2007.MonoSLAM: Real-time single camera SLAM.IEEE Transactions on Pattern Analysis andMachine Intelligence, pp. 1052–1067.

Drolet, L., F. Michaud, and J. Cote. 2000. Adaptable sen-sor fusion using multiple Kalman filters. In 2000IEEE/RSJ International Conference on IntelligentRobots and Systems, 2000. (IROS 2000).Proceedings, volume 2.

Eade, E. and T. Drummond. 2006. Scalable monocularSLAM. Computer Vision and Pattern Recognition,2006 IEEE Computer Society Conference on, 1.

Elinas, P., R. Sim, and J. Little. 2006. SLAM: Stereovision SLAM using the Rao-Blackwellised particlefilter and a novel mixture proposal distribution. pp.1564–1570.

Fakih, A. and J. Zelek. 2008. On the benefits of usinggyroscope measurements with structure frommotion. In ECCV workshop on Multi-camera andMulti-modal Sensor Fusion (M2SFA208), Oct.

Fischler, M.A. and R.C. Bolles. 1981. Random sampleconsensus: A paradigm for model fitting with appli-cations to image analysis and automated cartogra-phy. Communications of the ACM, 24:381–395.

Forlani, G., R. Roncella, and F. Remondino. 2005.Structure and motion reconstruction of short mobilemapping image sequences. In VII Conference onOptical 3D Measurement Techniques, pp. 265–274.

Goldbeck, J., B. Huertgen, S. Ernst, and L. Kelch. 2000.Lane following combining vision and DGPS. Imageand Vision Computing, 18(5):425–433.

Grewal, M., L. Weill, and A. Andrews. 2007. Globalpositioning systems, inertial navigation, and inte-gration. Wiley-Interscience.

Gustafsson, F. and K. Firm. 2000. Adaptive filtering andchange detection. Wiley New York.

Hu, Z. and K. Uchimura. 2006. Fusion of vision, GPSand 3D gyro data in solving camera registrationproblem for direct visual navigation. InternationalJournal of ITS Research, 4(1):3–12.

Jones, E., A. Vedaldi, and S. Soatto. 2007. Inertial struc-ture from motion with autocalibration. InProceedings of the ICCV Workshop on DynamicalVision.

Julier, S. and J. Uhlmann. 1997. A new extension of theKalman filter to nonlinear systems. 3.

Karlsson, N., E. di Bernardo, J. Ostrowski, L. Goncalves,P. Pirjanian, and M. Munich. 2005. The vSLAMalgorithm for robust localization and mapping. pp.24–29.

Konolige, K., M. Agrawal, and J. Sola. 2007. Large scalevisual odometry for rough terrain. In Proc.International Symposium on Robotics Research.

Li, Y., J. Wang, C. Rizos, P. Mumford, and W. Ding.2006. Low-cost tightly coupled GPS/INS integra-tion based on a nonlinear Kalman filtering design.In Proceedings of ION National Technical Meeting,pp. 18–20.

Mohamed, A. and K. Schwarz. 1999. Adaptive Kalmanfiltering for INS/GPS. Journal of Geodesy,73(4):193–203.

Montemerlo, M. 2003. FastSLAM: A factored solution tothe simultaneous localization and mapping problemwith unknown data association. Ph.D. thesis,Robotics Institute, Carnegie Mellon University,Pittsburgh, PA, July.

Montemerlo M., S. Thrun, D. Koller, and B. Wegbreit.2003. FastSLAM 2.0: An improved particle filteringalgorithm for simultaneous localization and map-ping that provably converges. 18:1151–1156.

Nister. D., O. Naroditsky, and J. Bergen. 2006. Visualodometry for ground vehicle applications. Journalof Field Robotics, 23(1):3–20.

Point Grey Research, Inc. http://www.ptgrey.com.Qian, G., R. Chellappa, and Q. Zheng. 2001. Robust

structure from motion estimation using inertial data.Journal of the Optical Society of America A,18:2982–2997.

Ramachandran, M., A. Veeraraghavan, and R. Chellappa.2007. Fast bilinear SfM with side information. InIEEE 11th International Conference on ComputerVision, 2007. ICCV 2007, pp. 1–8.

Roumeliotis, S., A. Johnson, and J. Montgomery. 2002.Augmenting inertial navigation with image-basedmotion estimation. In IEEE InternationalConference on Robotics and Automation, 2002.Proceedings. ICRA’02, volume 4.

Sasiadek, J., Q. Wang, and M. Zeremba. 2000. Fuzzyadaptive Kalman filtering for INS/GPS data fusion.In Intelligent Control, 2000. Proceedings of the2000 IEEE International Symposium on, pp.181–186.

Sim, R., P. Elinas, and J. J. Little. 2007. A study of theRao-Blackwellised particle filter for efficient andaccurate vision-based SLAM. Int. J. Comput.Vision, 74(3):303–318.

Triggs, B., P. McLauchlan, R. Hartley, and A. Fitzgibbon.2000. Bundle adjustment–a modern synthesis.Vision Algorithms: Theory and Practice,1883:298–372.

Xia, T., M. Yang, and R. Yang. 2006. Vision based glob-al localization for intelligent vehicles. In 2006 IEEEIntelligent Vehicles Symposium, pp. 571–576.

Yang, R. and M. Pollefeys. 2003. Multi-resolution real-time stereo on commodity graphics hardware. In2003 IEEE Computer Society Conference onComputer Vision and Pattern Recognition, 2003.Proceedings, volume 1.

Authors:

Adel Fakih is a Ph.D. student in SystemsDesign Engineering at the University of Waterloo.His research interests are in visual motion analysis,structure from motion, visual navigation and track-ing human limbs in video sequences. He holds aMasters degree in Computer and CommunicationsEngineering from the American University of

395

G E O M A T I C A

Beirut, in Lebanon. His masters’ research was aboutthe detection and extraction of vertebrae from spinalX-ray images. He did his undergraduate studies atthe Lebanese University where he graduated with aDiploma in Electrical and Electronics Engineering.

Samantha Ng is currently pursuing a M.Sc.degree at the University of Waterloo. Her mainresearch area is computer vision and human poseestimation. Samantha received her B.Sc. from theUniversity of Waterloo, specializing in mechatron-ics, statistics, and management science.

John Zelek is Associate Professor in SystemsDesign Engineering at the University of Waterloo,with expertise in the area of intelligent mechatroniccontrol systems that interface with humans; specifi-cally, (1) wearable sensory substitution and assistivedevices; (2) probabilistic visual and tactile percep-

tion; (3) wearable haptic devices including theirdesign, synthesis and analysis; and (4) human-robotinteraction. Zelek has co-founded two start-up com-panies. He was awarded the best paper award at theinternational IEEE/IAPRS Computer and RobotVision conference. He was also awarded with bestpaper at the 1995 IEEE SMC conference, and hisstudent won best student paper at the 2002 VisionInterface conference. He was awarded a 2006 &2008 Distinguished Performance Award from theFaculty of Engineering at the University ofWaterloo. He was also awarded the 2004 YoungInvestigator Award by the Canadian ImageProcessing & Pattern Recognition Society for hiswork in robotic vision. His research has been report-ed by the media in the Globe & Mail, DiscoveryChannel, CBC, CTV, and many other venues such asthe Science & Vie Junior Hors Série (renownedyoung science magazine in France). o

396

improving gps localisation with vision and inertial sensing

Documents