department of computer science · department of computer science george mason university technical...

8
Department of Computer Science George Mason University Technical Report Series 4400 University Drive MS#4A5 Fairfax, VA 22030-4444 USA http://cs.gmu.edu/ 703-993-1530 Location Recognition, Global Localization and Relative Positioning Based on Scale-Invariant Keypoints Jana Koˇ seck´ a and Xiaolong Yang * Technical Report GMU-CS-TR-2004-2 Abstract The localization capability of a mobile robot is central to basic navigation and map building tasks. We describe a probabilistic environment model which facilitates global localization scheme by means of location recognition. In the exploration stage the environment is partitioned into a class of locations, each characterized by a set of scale-invariant keypoints. The descriptors associated with these keypoints can be robustly matched despite changes in contrast, scale and viewpoint. We demon- strate the efficacy of these features for location recogni- tion, where given a new view the most likely location from which this view came is determined. The misclas- sifications due to dynamic changes in the environment or inherent appearance ambiguities are overcome by ex- ploiting neighborhood relationships captured by a Hid- den Markov Model. We report the recognition perfor- mance of this approach on an indoor environment con- sisting of eighteen locations and discuss the suitability of this approach for a more general class of recognition problems. Once the most likely location has been de- termined we show how to compute the relative pose be- tween the representative view and the current view. 1 Introduction and Related Work The two main instances of mobile robot localization problem are the continuous pose maintenance problem and the global localization also known as ’robot kid- napping’ problem. While the successful solution to the localization problem requires addressing both, here we concentrate only on the global localization aspect. The problem of vision-based global localization shares many * This work is supported by NSF IIS-0118732. aspects of object recognition and hence is amenable to use of similar methodologies. While several instances of vision-based localization have been successfully solved in smaller scale environments [1, 2, 3, 4], the applicabil- ity of these methods to large dynamically changing en- vironment poses additional challenges and calls for al- ternative models. The methods for localization vary in the choice of features and the environment model. The two main components of the environment model are the descriptors chosen to represent an image and the repre- sentation of changes in image appearance as a function of viewpoint. Similarly as in the case of object recogni- tion, both global and local image descriptors have been considered. The class of global image descriptors con- sider the entire image as a point in the high-dimensional space and model the changes in appearance as a func- tion of viewpoint using subspace methods [5]. Given the subspace representation the pose of the camera was typ- ically obtained by spline interpolation method, exploit- ing the continuity of the mapping between the object appearance and continuously changing viewpoint. Ro- bust versions of these methods have been applied in the robot localization using omnidirectional cameras [1]. Al- ternative global representations proposed in the past in- clude responses to banks of filters [6], multi-dimensional histograms [7, 8] or orientation histograms [9]. These types of global image descriptors integrated the spatial image information and enabled classification of views into coarser classes (e.g. corridors, open areas), yielding only qualitative localization. In the case of local meth- ods, the image is represented in terms of localized image regions, which can be reliably detected. The represen- tatives of local image descriptors include affine or rota- tionally invariant features [10, 11] or local Fourier trans- forms of salient image regions [12]. Due to the locality of 1

Upload: others

Post on 21-Apr-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Department of Computer Science · Department of Computer Science George Mason University Technical Report Series 4400 University Drive MS#4A5 Fairfax, VA 22030-4444 USA ... problem

Department of Computer ScienceGeorge Mason UniversityTechnical Report Series

4400 University Drive MS#4A5Fairfax, VA 22030-4444 USAhttp://cs.gmu.edu/703-993-1530

Location Recognition, Global Localization and RelativePositioning Based on Scale-Invariant Keypoints

Jana Kosecka and Xiaolong Yang∗

Technical Report GMU-CS-TR-2004-2

Abstract

The localization capability of a mobile robot is central tobasic navigation and map building tasks. We describe aprobabilistic environment model which facilitates globallocalization scheme by means of location recognition.In the exploration stage the environment is partitionedinto a class of locations, each characterized by a setof scale-invariant keypoints. The descriptors associatedwith these keypoints can be robustly matched despitechanges in contrast, scale and viewpoint. We demon-strate the efficacy of these features for location recogni-tion, where given a new view the most likely locationfrom which this view came is determined. The misclas-sifications due to dynamic changes in the environmentor inherent appearance ambiguities are overcome by ex-ploiting neighborhood relationships captured by a Hid-den Markov Model. We report the recognition perfor-mance of this approach on an indoor environment con-sisting of eighteen locations and discuss the suitabilityof this approach for a more general class of recognitionproblems. Once the most likely location has been de-termined we show how to compute the relative pose be-tween the representative view and the current view.

1 Introduction and Related Work

The two main instances of mobile robot localizationproblem are the continuous pose maintenance problemand the global localization also known as ’robot kid-napping’ problem. While the successful solution to thelocalization problem requires addressing both, here weconcentrate only on the global localization aspect. Theproblem of vision-based global localization shares many

∗This work is supported by NSF IIS-0118732.

aspects of object recognition and hence is amenable touse of similar methodologies. While several instances ofvision-based localization have been successfully solvedin smaller scale environments [1, 2, 3, 4], the applicabil-ity of these methods to large dynamically changing en-vironment poses additional challenges and calls for al-ternative models. The methods for localization vary inthe choice of features and the environment model. Thetwo main components of the environment model are thedescriptors chosen to represent an image and the repre-sentation of changes in image appearance as a functionof viewpoint. Similarly as in the case of object recogni-tion, both global and local image descriptors have beenconsidered. The class of global image descriptors con-sider the entire image as a point in the high-dimensionalspace and model the changes in appearance as a func-tion of viewpoint using subspace methods [5]. Given thesubspace representation the pose of the camera was typ-ically obtained by spline interpolation method, exploit-ing the continuity of the mapping between the objectappearance and continuously changing viewpoint. Ro-bust versions of these methods have been applied in therobot localization using omnidirectional cameras [1]. Al-ternative global representations proposed in the past in-clude responses to banks of filters [6], multi-dimensionalhistograms [7, 8] or orientation histograms [9]. Thesetypes of global image descriptors integrated the spatialimage information and enabled classification of viewsinto coarser classes (e.g. corridors, open areas), yieldingonly qualitative localization. In the case of local meth-ods, the image is represented in terms of localized imageregions, which can be reliably detected. The represen-tatives of local image descriptors include affine or rota-tionally invariant features [10, 11] or local Fourier trans-forms of salient image regions [12]. Due to the locality of

1

Page 2: Department of Computer Science · Department of Computer Science George Mason University Technical Report Series 4400 University Drive MS#4A5 Fairfax, VA 22030-4444 USA ... problem

these image features, the recognition is naturally prone tolarge amounts of clutter and occlusions. The sparser setof descriptors were in case of both global and local meth-ods, typically obtained by principal component analysisor various clustering techniques.

Our approach is motivated by the recent advances inobject recognition using local scale invariant featuresproposed by [10] and adopts the strategy for localiza-tion by means of location recognition. The image se-quence acquired by a robot during the exploration is firstpartitioned to individual locations. The locations corre-spond to the regions of the space across which the fea-tures can be matched successfully. Each location is rep-resented by a set of model views and their associatedscale-invariant features. In the first localization stage,the current view is classified as belonging to one of thelocations using standard voting approach. In the sec-ond stage we exploit the knowledge about neighborhoodrelationships between individual locations captured byHidden Markov Model (HMM) and demonstrate an im-provement in the overall recognition rate. The main con-tribution of the presented work is the instantiation of theHidden Markov Model in the context of this problem anddemonstration of an improvement in the overall recogni-tion rate. This step is essential particularly in the case oflarge scale environments which often contain uninforma-tive regions, violating the continuity the of the mappingbetween the environment appearance and camera pose.In such case imposing a discrete structure on the spaceof continuous observations enables us to overcome thesedifficulties while maintaining a high recognition rate.

2 Scale-Invariant Features

The use of local features and their associated de-scriptors in the context of object recognition has beendemonstrated successfully by several researchers in thepast [13, 14, 15]. In this paper we examine the effective-ness of scale-invariant (SIFT) features proposed by D.Lowe [10]. The SIFT features correspond to highly dis-tinguishable image locations which can be detected ef-ficiently and have been shown to be stable across widevariations of viewpoint and scale. Such image loca-tions are detected by searching for peaks in the imageD(x, y, σ) which is obtained by taking a difference oftwo neighboring images in the scale space

D(x, y, σ) = (G(x, y, kσ)−G(x, y, σ)) ∗ I(x, y)= L(x, y, kσ)− L(x, y, σ). (1)

The image scale spaceL(x, y, σ) is first build by con-volving the image with Gaussian kernel with varyingσ, such that at particularσ, L(x, y, σ) = G(x, y, σ) ∗I(x, y). Candidate feature locations are obtained by

Figure 1: The circle center represents the keypoint’s lo-cation and the radius the keypoint’s scale.

searching for local maxima and minima ofD(x, y, σ).In the second stage the detected peaks with low contrastor poor localization are discarded. More detailed discus-sion about enforcing the separation between the features,sampling of the scale space and improvement in featurelocalization can be found in [10, 16]. Once the loca-tion and scale have been assigned to candidate keypoints,the dominant orientation is computed by determining thepeaks in the orientation histogram of its local neighbor-hood weighted by the gradient magnitude. The keypointdescriptor is then formed by computing local orientationhistograms (with 8 bin resolution) for each element of a4 × 4 grid overlayed over16 × 16 neighborhood of thepoint. This yields 128 dimensional feature vector whichis normalized to unit length in order to reduce the sen-sitivity to image contrast and brightness changes in thematching stage. Figure 1 shows the keypoints found inthe example images in our environment. In the reportedexperiments the number of features detected in an im-age of size480 × 640 varies between 10 to 1000. Inmany instances this relatively low number of keypoints,is due to the fact that in indoors environments many im-ages have small number of textured regions. Note thatthe detected SIFT features correspond to distinguishableimage regions and include both point features as well asregions along line segments.

3 Environment Model

The environment model, which we will use to test ourlocalization method is obtained in the exploration stage.Given a temporally sub-sampled sequence acquired dur-ing the exploration (images were taken approximatelyevery 2-3 meters), the sequence is partitioned into 18 dif-ferent locations. The exploration route can be seen inFigure 2. Different locations in our model correspond tohallways, sections of corridors and meeting rooms ap-proached at different headings. In the current experi-ment, the environment is mostly comprised of networkof rectangular corridors and hallways which are typicallytraversed with four possible headings (N, S, W, E). The

2

Page 3: Department of Computer Science · Department of Computer Science George Mason University Technical Report Series 4400 University Drive MS#4A5 Fairfax, VA 22030-4444 USA ... problem

Figure 2: The map on the fourth floor of our building.The arrows correspond to the heading of the robot andthe labels represent individual locations.

Figure 3: The number of keypoints matched betweenconsecutive views for the sequence comprised of 18 loca-tions (snapshot was taken every 2-3 meters) captured bya digital camera (left); the number of keypoints matchedbetween the first andi-th view for a video sequence com-prised of 3 locations (right).

deviations from these headings can be handled as longas there is a sufficient overlap between the model viewsacquired during the exploration and current views. Incase the current view cannot be matched successfully,a new location is added to the model. The number ofviews per location vary between 8 to 20 depending onthe appearance variation within the location. The transi-tions between the locations occur either at places wherenavigation decisions have to be made or when the ap-pearance of the location changes suddenly. The tran-sitions between individual locations are determined de-pending on the number of features which can be suc-cessfully matched between the successive frames. Theseare depicted in Figure 3 for a sequence captured by astill digital camera along the path which visited all eigh-teen locations (some of them twice) and for a video sub-sequence along a path which visited three locations. Thetransitions between individual locations are marked bythe peaks in the graph, corresponding to new locations.In order to obtain a more compact representation of eachlocation a number of representative views is chosen perlocation, each characterized by a set of SIFT features.The sparsity of the model is directly related to the capa-bility of matching SIFT features in the presence of largervariations in scale. The number of representative viewsvaried between one to four per location and was obtained

by regular sampling of the views belonging to individuallocations. Examples of representative views associatedwith individual locations are depicted in Figure??.

4 Location recognition

The environment model obtained in the previous sectionconsists of a database of model views1. The i-th lo-cation in the model, withi = 1, . . . N is representedby n views Ii

1, . . . , Iin with n ∈ {1, 2, 3, 4} and each

view is represented by a set of SIFT features{Sk(Iij)},

wherek is the number of features. In the initial stagewe tested the location recognition by using a simple vot-ing scheme. Given a new query imageQ and its as-sociated keypoints{Sl(Q)} a set of corresponding key-points betweenQ and each model viewIi

j , {C(Q, Iij)},

is first computed. The correspondence is determined bymatching each keypoint in{Sl(Q)} against the databaseof {Sk(Ii

j)} keypoints and choosing the nearest neighborbased on the Euclidean distance between two descriptors.We only consider point matches with high discriminationcapability, whose nearest neighbor is at least 0.6 timescloser then the second nearest neighbor. More detailedjustification behind the choice of this threshold can befound in [10]. In the subsequent voting scheme we deter-mine the location whose keypoints were most frequentlyclassified as nearest neighbors. The location where thequery imageQ came from is then determined based onthe number of successfully matched points among allmodel views

C(i) = maxj

|{C(Q, Iij)}| and [l, num] = max

iC(i)

wherel is the index of location with maximum numbernum of matched keypoints. Table 1 shows the locationrecognition results as a function of number of represen-tative views per location on the training sequence of 250views and two test sequences of 134 and 130 images

1It is our intention to attain a representation of location in termsof views (as opposed to some abstract features) in order to facilitaterelative positioning tasks in the later metric localization stage.

3

Page 4: Department of Computer Science · Department of Computer Science George Mason University Technical Report Series 4400 University Drive MS#4A5 Fairfax, VA 22030-4444 USA ... problem

(# of views) #1 (250) #2 (134) #3 (130)one 84% 46% 44%two 97.6% 68% 66%four 100% 82% 83%

Table 1: Recognition performance for one training andtwo test sequences in terms of % of correctly classifiedviews as a function of number of representative views.

L4 train L4 test

L6 train L6 test

Figure 4: Changes in the appearance of locationL4 andL6 between the training and testing. In the left imagepair the bookshelve was replaced by a table and couchand in the right pair recycling bins were removed.

each. All three sequences were sparse with images taken2-3 meters apart. The two test sequences were taken atdifferent days and times of day, exhibiting larger devi-ations from the path traversed during the training. De-spite a large number of representative views per locationrelatively poor performance on the second and third testsequence was due to several changes in the environmentbetween the training and testing stage. In 5 out of 18locations several objects were moved or misplaced. Ex-amples of dynamic changes can be seen in Figure 4.The poorer performance due to dynamic changes is notsurprising, since the most discriminative SIFT featuresoften belong to objects some of which are not inherent toparticular locations. In the next section we describe howto resolve these issues by modelling the neighborhoodrelationships between individual locations.

5 Modelling spatial relationshipsbetween locations

We propose to resolve these difficulties by incorporat-ing additional knowledge about neighborhood relation-ships between individual locations. The rationale behind

this choice is that despite the presence of ambiguitiesin recognition of individual views the temporal contextshould be instrumental in resolving them. The use oftemporal context is motivated by the work of [17] whichaddresses the place recognition problem in the context ofwearable computing application. The temporal contextis determined by spatial relationships between individuallocations and is modelled by a Hidden Markov Model(HMM). In this model the states correspond to individ-ual locations and the transition function determines theprobability of transition from one state to another. Sincethe locations cannot be observed directly each location ischaracterized by The most likely location is at each in-stance of time obtained by maximizing the conditionalprobabilityP (Lt = li|o1:t) of being at timet and loca-tion li given the available observations up to timet. Thelocation likelihood can be estimated recursively using thefollowing formula

P (Lt = li|o1...t) ∝ p(ot|Lt = li)P (Lt = li|o1:t−1)(2)

wherep(ot|Lt = li) is the observation likelihood, char-acterizing how likely is the observationot at time t tocome from locationli. The choice of observation like-lihood depends on the available observations and thematching criterion. When local descriptors are used asobservations, several such choices have been proposed inthe context of probabilistic approaches to object recogni-tion [18, 19]. The proposed likelihood functions prop-erly accounted for the density and spatial arrangementsof features and improved overall recognition rate. Inthe context of global image descriptors the locationswere modelled in terms of Gaussian mixtures proposedin [17]. Since the location recognition problem is notablysimpler then the object recognition problem due to occlu-sions and clutter not being some prominent, we used asimpler form of the likelihood function. The conditionalprobabilityp(ot|Lt = li) that a query imageQt at timet characterized by an observationot = {Sl(Qt)} camefrom certain location, is directly related to the cardinalityof the correspondence setC(i), normalized by the totalnumber of matched points across all locations

p(ot|Lt = li) =C(i)∑j C(j)

.

In order to explicitly incorporate the location neighbor-hood relationships, the second term of equation (2) canbe further decomposed

P (Lt = li|o1:t−1) =N∑j

A(i, j)P (Lt−1 = lj |o1:t−1)

(3)whereN is the total number of locations andA(i, j) =P (Lt = li|Lt = lj) is the probability of two locations

4

Page 5: Department of Computer Science · Department of Computer Science George Mason University Technical Report Series 4400 University Drive MS#4A5 Fairfax, VA 22030-4444 USA ... problem

Seq. 2 with and without HMM

Seq. 3 with and without HMM

Figure 5: Classification results with for Sequence 2 andSequence 3 with and without considering the spatial re-lationships. The black circles correspond to the labels ofmost likely locations.

being adjacent. In the presence of a transition betweentwo locations the corresponding entry ofA was assigneda unit value and in the final stage all the rows of the ma-trix were normalized. The results of location recognitionemploying this model are in Figure 5. The recognitionrate for Sequence 2 was96.3% and for Sequence 3 itwas95.4%. The location label assigned to each image isthe one with the highest probability. While in both casessome images were misclassified the overall recognitionrates are an improvement compared to the rates reportedin Table 1. Despite the classification errors in Sequence2, the order of visited locations was correctly determined.For Sequence 3, where we exhibited some intentional de-viations between the path taken during training and test-ing, the classification of location 14 was incorrect. Theeffect of HMM model can be examined by making allthe probabilities in the transition matrixA uniform es-sentially neglecting the knowledge of location neighbor-hood relationships. The assigned location labels for thiscase are in the right column of Figure 5, with noticeablydegraded recognition performance.

6 Pose Estimation

Once the most likely location and best matched view hasbeen found we can compute the relative displacement be-tween the current view and model view. Prior to compu-tation of the displacement the matches are refined in or-der to establish exact correspondences between matched

Figure 6: Correspondences between the two views andassociated scale ratios of corresponding points. Note thatthe scale ratios of correspondences2, 3 and7 are notice-ably above median value indicating inconsistent match.

keypoints. In the first stage we use the ratio of intrin-sic scale between two corresponding keypoints to rejectthe mismatches. The ratios between all matches is com-puted and those matches who’s ratio is above thresholdrelated to the median ratio of all matches are discardedfirst. This enables us to discard matches due to somerepetitive structures in the environment. See Figure 6for example. From Table 6 we can see the scale ratio

index scale1 scale2 scale ratio1 4.0700 9.8300 0.41402 2.4100 3.2500 0.74153 1.7800 2.8500 0.62464 1.2300 3.2400 0.37965 1.2300 2.7300 0.45056 1.1400 2.5100 0.45427 1.0500 1.0000 1.05008 1.2600 2.4600 0.5122

Table 2: Scales of individual keypoints in two views andscale ratios of corresponding matches.

(scale1/scale2) values for matches 2,3 and 7 are notice-ably larger than median scale ratio. Additional matchoutliers are rejected in the process of robust computationof the relative displacement between the current (query)image and the representative view.

The current view and the matched model view are re-lated by a rigid body displacementg = (R, T ) repre-sented by a rotationR ∈ SO(3) and translationT =

5

Page 6: Department of Computer Science · Department of Computer Science George Mason University Technical Report Series 4400 University Drive MS#4A5 Fairfax, VA 22030-4444 USA ... problem

[tx, ty, tz]T ∈ R3. Provided that the camera is calibrated,g can be estimated from the epipolar geometry betweenthe two views. This recovery problem can be further sim-plified taking into account the fact that the motion of therobot is restricted to a plane. Here we outline an algo-rithm for this special case and demonstrate how to re-cover the displacement in case of unknown focal length.The case of general motion and unknown focal lengthwas studied by [20] and the solution for the case of pla-nar motion case has been proposed by [21] in the contextof uncalibrated stereo. Here we demonstrate a slightlydifferent, more concise solution to the problem. Con-sider the perspective camera projection model, where 3Dcoordinates of pointX = [X, Y, Z]T are related to theirimage projectionsx = [x, y, 1]T by an unknown scaleλ; λx = X. In case the camera is calibrated the twoviews of the scene are related byλ2x2 = Rλ1x1 + T ,where(R, T ) ∈ SE(3) is a rigid body transformationandλ1 andλ2 are the unknown depths with respect toindividual camera frames. After elimination of the un-known scales from the above equation, the relationshipbetween the two views is captured by so-called epipolarconstraint

xT2 TRx1 = xT

2 Ex1 = 0, (4)

whereE = TR is the essential matrix2 In case of planarmotion, assuming translation inx− z plane and rotationaroundy−axis by an angleθ, the essential matrix has thefollowing sparse form

E =

0 −tz 0tzcθ + t1sθ 0 tzsθ − t1cθ

0 tx 0

(5)

wheresθ(cθ) denotesin θ(cos θ) respectively. Given atleast four point correspondences, the elements of the es-sential matrix[e1, e2, e3, e4]T can be obtained as a leastsquares solution of a system of homogeneous equationsof the form (4). Once the essential matrixE has beenrecovered, the four different solutions forθ and T =±[tx, 0, tz] can be obtained (using basic trigonometry)directly from the parametrization of the essential matrix(5). The physically correct solution is then obtained us-ing the positive depth constraint. In the case of unknownfocal length the two views are related by so called funda-mental matrixF

xT2 F x1 = 0 with x = K−1x. (6)

The fundamental matrixF is in this special planar, par-tially calibrated case related to the essential matrixE asfollows

F = K−T EK−1 with K =

f 0 00 f 00 0 1

(7)

2T denotes a3 × 3 skew symmetric matrix associated with vectorT .

Location 1 Location 2

Location 1

Location 2

Figure 7: Relative positioning experiments with respectto the representative views. Bottom: Query views alongthe path between the first view and the representativeview for two different locations. Top: Recovered mo-tions for two locations.

wheref is the unknown focal length. The remaining in-trinsic parameters are assumed to be known. In the planarmotion case the matrixF = [0, f1, 0; f2, 0, f3; 0, f4, 0]can be recovered from the homogeneous constraints ofthe form (6) given a minimum of four matched points.Given the planar motion case the focal length can becomputed directly from the entries of the fundamentalmatrix as

f =

√f21 − f2

3

f22 − f2

4

. (8)

For more details of the derivation see [22]. Oncef iscomputed, the relative displacement between the viewscan be obtained by the method outlined for the calibratedcase. Additional care has to be taken in assuring that thedetected matches do not come from a degenerate config-uration. We have used RANSAC algorithm for the robustestimation of the pose between two views, with slightlymodified sampling strategy. Figure 7 shows two exam-ples of relative positioning with respect to two differentrepresentative views. The focal length estimates obtainedfor these examples aref = 624.33 andf = 545.30. Therelative camera pose for individual views is representedin the figure by a coordinate frame.

6

Page 7: Department of Computer Science · Department of Computer Science George Mason University Technical Report Series 4400 University Drive MS#4A5 Fairfax, VA 22030-4444 USA ... problem

7 Conclusions and Future Works

We have demonstrated the suitability and the discrimi-nation capability of the scale-invariant SIFT features inthe context of location recognition and global localiza-tion task. Although the matching and location recogni-tion methods can be accomplished using an efficient andsimple voting scheme, the recognition rate is affected bydynamic changes in the environment and inherent am-biguities in the appearance of individual locations. Wehave shown that these difficulties can be partially re-solved by exploiting the neighborhood relationships be-tween the locations captured by Hidden Markov Models.

Since the notion of location is not defined preciselyand is merely inferred in the learning stage the presentedmethod enables only qualitative global localization interms of individual locations. Following the global lo-calization we compute the relative pose of the robot withrespect to the closest reference view [23] found in thematching stage. This enables us to achieve metric local-ization with respect to the reference view, which can befollowed by relative positioning tasks. More extensiveexperiments as well as integration with the explorationand navigation strategies on-board of mobile robot plat-form are currently underway.

8 Acknowledgements

The authors would like to thank D. Lowe for makingavailable the code for detection of SIFT features.

9 Appendix

9.1 Environment Model Acquisition

The following appendix describes the process of acqui-sition of the model in the exploration stage. Whileat the moment the exploration stage and naviga-tion/localization stage are separate we are in the processof integrating the two so as to enable the simultaneousmap building and localization. Given the sequence takenin the exploration stage we start initially with an emptymodel and add new locations, while matching the new lo-cations to the existing locations in the database. In casenew location is encountered the representative views ofthe location and their associated SIFT features are addedto the model. The transition matrix is updated accord-ingly. Tables??, 4 and 5 show the result of the modelconstruction stage. From Table 4, we can see that frames1 to 10 belong to locationL1, frames 123-165 also be-long to locationL1, etc. The representative views of theindividual locations are in Table 5. In the transition ma-trix table,A[i][j] = 1 represents the existence of a tran-sition from locationLi to locationLj .

1 1 0 0 0 0 0 0 1 0 0 0 0 00 1 1 0 0 0 0 0 0 0 0 0 0 00 0 1 1 0 0 0 0 0 0 0 0 0 00 0 0 1 1 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 1 0 00 0 0 0 0 1 1 0 0 0 0 0 0 00 0 0 0 0 0 1 1 0 0 0 0 0 01 0 0 0 0 0 0 1 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 0 0 0 00 0 0 0 0 0 0 0 0 1 1 0 0 00 0 0 0 1 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 0 0 0 1 1 00 0 0 0 0 0 0 0 0 0 0 0 1 11 0 0 0 0 0 0 0 0 0 0 0 0 1

Table 3: Transition Matrix.

begin frame # end frame # location label1 10 L1

11 23 L2

24 35 L3

36 50 L4

51 91 L5

92 101 L6

102 111 L7

112 122 L8

123 165 L1

166 174 L9

175 188 L10

189 200 L11

201 226 L5

227 239 L12

240 249 L13

250 258 L14

259 290 L1

Table 4: Exploration Route.

References

[1] Artac, M., Jogan, M., Leonardis, A.: Mobilerobot localization using an incremental eigenspacemodel. In: IEEE Conference of Robotics and Au-tomation. (2002) 1025 – 1030

[2] Gaspar, J., Winters, N., Santos-Victor, J.: Vision-based navigation and environmental representa-tions with an omnidirectional camera. IEEE Trans-actions on Robotics and Automation (2000) 777–789

[3] Davidson, A., Murrray, D.: Simultaneous localiza-tion and map building using active vision. IEEETransactions on PAMI24 (2002) 865–880

7

Page 8: Department of Computer Science · Department of Computer Science George Mason University Technical Report Series 4400 University Drive MS#4A5 Fairfax, VA 22030-4444 USA ... problem

Location view # 1 view # 2 view # 3 view # 4L1 123 135 147 160L2 11 15 19 21L3 24 27 32 33L4 36 40 46 48L5 51 64 78 89L6 92 95 98 99L7 102 105 108 109L8 112 115 119 120L9 166 168 172 172L10 175 179 184 186L11 189 192 197 198L12 227 231 235 237L13 240 243 246 247L14 250 252 256 256

Table 5: Representative views for individual locations.

[4] Se, S., Lowe, D., Little, J.: Global localization us-ing distinctive visual features. In: Proc. of Interna-tional Conference on Robots and Systems. (2002)153–158

[5] Nayar, S., Nene, S., Murase, H.: Subspace methodsfor robot vision. IEEE Transactions on Roboticsand Automation (6) 750–758

[6] Torralba, A., Sinha, P.: Recognizing indoor scenes.MIT AI Memo (2001)

[7] Schiele, B., Crowley, J.L.: Object recognitionusing multidimensional receptive field histograms.International Journal of Computer Vision (2000)

[8] H. Aoki, B.S., Pentland, A.: Recognizing placesusing image sequences. In: Conference on Percep-tual User Interfaces, San Francisco (1998)

[9] Kosecka, J., Zhou, L., Barber, P., Duric, Z.: Qual-itative image based localization in indoors environ-ments. In: IEEE Proceedings of CVPR. (2003) 3–8

[10] Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Com-puter Vision (2004) to appear

[11] Wolf, J., Burgard, W., Burkhardt, H.: Usingand image retrieval system for vision-based mobilerobot localization. In: Proc. of the InternationalConference on Image and Video Retrieval (CIVR).(2003)

[12] Sims, R., Dudek, G.: Learning envirmental featuresfor pose estimation. Image and Vision Computing19 (2001) 733–739

[13] Schmid, C., Mohr, R.: Local grayvalue invariantsfor image retrieval. IEEE Transactions on PAMI19(1997) 530–534

[14] Selinger, A., Nelson, R.: A perceptual groupinghierarchy for appearance-based 3d object recogni-tion. Computer Vision and Image Understanding(76) 83–92

[15] Lowe, D.: Object recognition from local scale in-variant features. In: International Conference onComputer Vision. (1999) 1150–1157

[16] Brown, M., Lowe, D.: Invariant features from in-terest point groups. In: In Proceedings of BMVC,Cardiff, Wales. (2002) 656–665

[17] Torralba, A., Murphy, K., Freeman, W., Rubin, M.:Context-based vision system for place and objectrecognition. In: International Conference on Com-puter Vision. (2003)

[18] Pope, A., Lowe, D.: Probabilistic models of ap-pearance for 3-d object recognition. InternationalJournal of Computer Vision40 (2000) 149 – 167

[19] Schmid, C.: A structured probabilistic model forrecognition. In: Proceedings of CVPR, Kauai,Hawai. (1999) 485–490

[20] Sturm, P.: On focal length calibration from twoviews. In: Conference on CVPR, Kauai. Volume 2.(2001) 145–150

[21] Brooks, M., Agapito, L.D., Huyng, D., Baumela,L.: Towards robust metric reconstruction via a dy-namic uncalibrated stereo. Image and Vision Com-puting16 (1998) 989–1002

[22] Kosecka, J., Yang, X.: (Global localization andrelative positioning based on scale-invariant key-points)

[23] Xang, X., Kosecka, J.: Experiments in loca-tion recognition using scale-invariant sift features.Technical Report GMU-TR-2004-2, George MasonUniversity (2004)

8