], a mo d-psaeedi/icra_2000_0.pdf · metho ds. these metho ds v ary based up on the sensor, the en...

3D Motion Tracking of a Mobile Robot in a Natural Environment

P. Saeedi, P. Lawrence, D. LoweDepartment of Electrical and Computer Engineering, Department of Computer Science

University of British ColumbiaVancouver, BC, V6T 1Z4, Canada

{[email protected]}

Abstract

This paper presents a vision-based tracking system

suitable for autonomous robot vehicle guidance. The

system includes a head with three on-board CCD cam-

eras, which can be mounted anywhere on a mobile ve-

hicle. By processing consecutive trinocular sets of pre-

cisely aligned and recti�ed images, the local 3D trajec-

tory of the vehicle in an unstructured environment can

be tracked. First, a 3D representation of stable fea-

tures in the image scene is generated using a stereo

algorithm. Second, motion is estimated by tracking

matched features over time. The motion equation with

6-DOF is then solved using an iterative least squares

�t algorithm. Finally, a Kalman �lter implementation

is used to optimize the world representation of scene

features.

1 Introduction

The problem of motion tracking for mobile robotshas been studied extensively, resulting in a variety ofmethods. These methods vary based upon the sensor,the environment and the user's prior knowledge of theenvironment. Many of these systems are developed forin-door structured environments, or they are based onthe recognition of prede�ned known landmarks. Mostof these systems however are limited to 2D planar mo-tions [5] [9] [10].

At the University of British Columbia we are inves-tigating the problem of 3D motion estimation (pose es-timation) of mobile robots in unknown environments.We assume that we have no prior knowledge of the en-vironment and that there is not any speci�c landmarkin the scene. Further the scene is mostly constructed ofrigid objects, although if there are a few small movingobjects the system still relies on the static informa-tion. The motion of the robot is also assumed to belimited in acceleration. This allows the feature searchtechniques to work on a small and predictable rangeof possible matches.

Harris [4] has described a system that solves for full6-DOF motion from a monocular camera, but it su�ersfrom di�culties with initialization and accuracy, whilewe are able to overcome these problems by integratingstereo and motion solutions. Our approach consists ofseveral phases that are executed sequentially.

I. Feature Extraction: the extraction of meaningfulfeatures from the scene that can be tracked overa sequence of frames or over time.

II. Stereo Vision: the creation of a 3D representationof the extracted features within the scene.

III. Feature Tracking: identi�cation and tracking ofidentical features over time.

IV. Motion Estimation: the calculation of camera mo-tion relative to tracked features in an absolute ref-erence frame.

V. Position Re�nement: the re�nement of the 3D lo-cations of world features by combining individualmeasurements over a sequence of estimations.

Each one of these sub-tasks is studied in more detailin the following sections and is followed with a studyof experimental results and conclusions.

2 Feature Detection

Choosing the type of feature is very important andhas a strong impact on the real-time performance ofthe system. In systems based upon landmarks ormodels, it is likely that no landmark may be visibleand so the motion estimation will not be accurate forsome percentage of the time. Choosing simple fea-tures within the scene increases the reliability of thesolution, and enables the system to �nd an accuratemotion estimation most of time, unless the scene isvery uniform. We have chosen to work with corners,because they are discrete and partially invariant toscale and rotational changes.

The Harris and Stephens corner detector [3], a mod-i�ed version of the Moravec [8] corner detector, is im-plemented. Their method involves shifting a circularpatch of the image in di�erent directions. If the patchincludes a corner then shifting along all directions re-sults in large changes. Therefore a corner can be de-tected when a minimum of changes produced by anyof the shifts, is large enough:

E(x; y) = Wu;vjIx+u;y+v � Iu;vj2 (1)

Iu;v presents the image intensity value at point (u; v)and x and y introduce the shift amount of the circularwindow Wu;v

Wu;v = e�u2+v2

�2 (2)

With the assumption of small displacements, Equa-tion 2 is truncated by Taylor series to a linear term

E(x; y) = [x; y]M

�x

y

�; where M =

�A C

C B

�;

(3)

Where,

A = X2W ; B = Y 2W ; C = XY W(4)

and,

X�@I

@xY �

@I

@y(5)

The quality of the corner then is measured from a cor-ner response R,

R = Det(M ) �K(Tr(M ))2 (6)

A quick look at R shows that the response functionis very small within a uniform region, negative in edgeregions and positive in corner regions. The value Kin the response function is the maximum ratio of theeigenvalues of M , for which the response function ispositive.

Figure 1 shows the result of corner detection on asample image.

3 Stereo Vision

Constructing the depth of the features is possibleusing a stereo algorithm. At each point in time, ourCCD camera system, Triclops [11], captures a set ofthree images which are precisely aligned horizontallyand vertically. These recti�ed images are used to con-struct a sparse depth map for corner features. Thissolution was chosen for the real-time performance of

Figure 1: A corner detection sample result.

O

L

C1C2M

PQ

d2d1

z-f

k

Τ

f

S

B

..

.

.

.

.

..

.

Figure 2: Depth construction by stereo images.

the system, since the number of corners is much lowerthan the number of pixels in each image.

As shown in Figure 2, the depth of point O, z, canbe computed from the displacement of the correspond-ing projected points on the stereo images, d1 � d2.

z =f:B

d2 � d1(7)

Here f is the focal length of the camera, B is base lineor separation of cameras and C1 and C2 are cameracenters.

Although constructing the depth is possible withjust two stereo images the use of three images enhancesthe accuracy of the depth and motion estimation byeliminating invalid match candidates. Figure 3 showsa set of captured images. Corresponding corners areshown with the identical numbers. Knowing the focallength, camera separation and displacement for point9 (29 pixels) the depth is computed to be 0.98m.

4 Feature Tracking

In this section corresponding 3D features aretracked from one frame (at time=t) to the next frame

2

Figure 3: Stereo matching result for a trinocular setof images.

(at time=t + �t). Systems with more complicatedfeatures or landmarks usually track the landmarkthrough di�erent frames, since detecting the landmarkor model from scratch may take more time. In our ap-proach it is not possible to track identical corners fromframe to frame without detecting them in each set ofnew images. Therefore for each corner a simple searchroutine is applied in order to �nd all the possible matchcandidates in the vicinity of the predicted position inthe next image frame. Accordingly, a similarity metricfunction, the Normalized Sum of Squared Di�erences,is implemented to measure the similarity of each pairof match candidates [2].

S =

M

2Xx=�M

2

N

2Xy=�N

2

((I1 � �I1)� (I2 � �I2))2

vuuutM

2Xx=�M

2

N

2Xy=�N

2

(I1 � �I1)2

M

2Xx=�M

2

N

2Xy=�N

2

(I2 � �I2)2

Where I1 and I2 present the image intensities with theaverage values of �I1 and �I2.

The two corners within corresponding image searchregions with the highest similarity metric, S, are con-sidered to be identical features. Figure 4 shows someidentical features that are tracked over two frames.

5 Motion Estimation

Having a set of corresponding corners between eachtwo consecutive images, the motion estimation be-comes the problem of optimizing a 3D transformation

Identical features

Figure 4: Corresponding features are related andtracked in two consecutive images.

that projects the world corners, constructed from the�rst image, onto the second image. Although the 3Dconstruction of 2D features is a non-linear function,the problem of motion estimation still is well-behaved.This is because any 3D motion includes rotations andtranslations.

� Rotations are functions of the cosine of the rota-tion angles.

� Translation toward or away from the camera in-troduce a perspective distortion as a function ofthe inverse of the distance from the camera.

� Translation parallel to the image plane is almostlinear.

Therefore, the problem of 3D motion estimation is apromising candidate for the application of Newton'smethod, which is based on the assumption that thefunction is locally linear. To minimize the probabil-ity of converging to a false local minimum, we lookfor outliers and eliminate them during the iterationprocess.

In this method at each iteration a correction vectorx is computed that is subtracted from the current esti-mate, resulting in a new estimate. If P (i) is the vectorof image coordinates (u; v) for iteration i, then

P (i+1) = P (i) � x (8)

3

Given a vector of error measurements between theworld 3D features and their projections, we �nd the xthat eliminates (minimizes) this error.

The e�ect of each element of correction vector xion error measurement ei, is the multiplication of thepartial derivative of the error with respect to that pa-rameter to the same correction vector; this is done byconsidering the main assumption, local linearity of thefunction

Jx = e Where Jij =@ei

@xj(9)

J is called the Jacobian matrix and ei presents theerror between the predicted location of the object andactual position of the match found in image coordi-nates. Each row of this matrix equation states thatone measured error ei, should be equal to the sum ofall changes in that error resulting from the parame-ter correction [7]. Since the Equation 10 is usuallyover-determined and therefore no unique solution ex-ists, we �nd a vector x that minimizes the 2-norm ofthe residual.

minkJx� ek2 (10)

Equation 10 has the same solution as the normal equa-tion,

x = [JTJ ]�1JT e (11)

Therefore in each iteration of Newton's method, wesimply solve the normal Equation 11 for x using LUdecomposition [12].

The most computationally expensive aspect of im-plementing the Newton method is calculating the par-tial derivatives or the Jacobian matrix. The partialderivatives with respect to the translation parameterscan be most easily calculated by �rst reparametrizingthe projection equations [6]. If the vector of motionparameters is (Dx; Dy; Dz; �x; �y; �z), then the newlocation of projected point (x; y; z) in the subsequentimage is

(u; v) = (f(x +Dx)

z +Dz

;f(y +Dy)

z +Dz

) (12)

Dx, Dy and Dz show the incremental translations and�x, �y and �z are rotational increments about the x,y and z. The partial derivatives in Jacobian matrix,

Equation 9, are calculated from

@u

@Dx

= 1@u

@Dy

= 0@u

@Dz

=fx

(z +Dz)2 (13)

@u

@�x=

f

z +Dz

@x

@�x�

fx

(z +Dz)2@z

@�x(14)

@u

@�y=

f

z +Dz

@x

@�y�

fx

(z +Dz)2@z

@�y(15)

@u

@�z=

f

z +Dz

@x

@�z�

fx

(z +Dz)2@z

@�z(16)

The partial derivative of x, y and z with respect tocounterclockwise rotation parameters � (in radians)can be found in Table 1. This table shows how easily

Table 1: The partial derivatives table.

x y z

�x 0 �z y

�y z 0 �x�z �y x 0

and e�ciently we can compute the Jacobian matrixelements in Equation 9.

6 Position Re�nement

For many systems each motion estimation from anindividual sample or set of samples contains a signi�-cant amount of random error. If there is no signi�cantsystematic error, then these errors can be reduced by�ltering the location information [4].

In our system each frame, within which a featureis detected, gives an additional measurement for thelocation of that feature. The Kalman �lter provides ameans to combine these noisy measurements to form acontinuous estimate of the current location of the fea-ture. Each point in the world space is associated witha Kalman �lter, which is updated using new motioninformation. This process increases the accuracy oflocation information of the feature points in the worldspace. In our system we implemented a Kalman �lterin a similar fashion to Shapiro's method [13].

This formulation is recursive and the least squareestimate of the world feature position w, and its co-variance C, are given recursively by

C�1i = C�1i�1 + ATi V

�1i Ai (17)

wi = wi�1 + ki(bi � Aixi�1) (18)

ki = CiATi V

�1i (19)

4

Here Ci is the uncertainty in the estimation of wi,which is the estimated world position of the feature atframe i. ki is the �lter gain, bi is the current measure-ment of the feature, Vi is the covariance matrix of theerrors and Ai is identity matrix.

Obviously the error vector for any given measure-ment relies on the relative accuracy of that measure-ment. Our corner detector's accuracy which is relatedto the accuracy of our stereo system, originally 2 pix-els, is improved by �tting a sub-sample estimator [1].It is a simple quadratic estimator that locates the cor-ner within a pixel. The method uses the neighboringintensity values and �ts a second order curve on thecorner and its two neighbors. Since the depth con-struction is very sensitive to noise, this estimator im-proves the accuracy of the depth construction signi�-cantly.

7 Experimental Results

To show the performance of the system, a motionestimation experiment is carried out along a knownpath by processing 87 image frames. Motion is deter-mined while the robot moves from a starting point A,toward an ending point H. In order to compare theestimated motion at di�erent positions, with the realvalues, a number of reference points are considered.The position of these reference points are measuredin the world. By using these measurements we canobserve and analyze the accuracy of the estimation.

This experiment solves for all 6 degrees-of-freedom,although the physical experiment environment in-cluded only changes in 3 of the parameters, depth z, xand the depth axis orientation �y. Figure 5 shows thecomparison of the real path and the estimated pathfound by the system.

A study of the results shows that the depth con-struction has 15% error at point B. This error is re-duced to 10% when the system is moved to position C.While moving toward points D, E and F, the z estima-tion error is increased to 18%. This behavior can be ex-plained by studying the experiment's environment. Ascan be seen in Figure 6 at the beginning of the exper-iment most of the objects within the scene are locatedfar from the camera system. This is the main sourceof the inaccuracy in the z parameter. Distant featureshave particularly poor depth information since theyare gained from limited resolution. This large error indepth e�ectively displaces these points and can causeerror in localization of the mobile system as it tries tosolve the least squares �t, using data that has a verysystematic type of error in depth.

As the camera moves closer to objects, the depthconstruction becomes more accurate. As soon as it

−50050100150200250300−50

0

50

100

150

200

250

300

350

Z [c

m]

X [cm]

A

B

C

D

EFGH

Map of the path between point A and point H

System estimated motion from point A to point H

Depth axis orientation from point A to point H

Figure 5: The actual and estimated path.

starts rotating, the error in parameter z of the estima-tion increases and continues to do so until the end ofthe curve. This reaction can be clari�ed by noticingthe fact that further points show behavior under rota-tion that is very similar to a sideway translation. Thiscauses the least squares �t to �nd a solution that hasa translation or rotation that is much larger than anyactual movement. In other words, there is e�ectivelya coupling between rotational and translational mo-tion, due to these points. This source of error can bereduced by using wide-angle optics on the camera, aspoints imaged fromwidely separated angles will clearlydistinguish rotation from translation.

The results also demonstrate that the system esti-mates the orientation with a very high accuracy. Theerror in the worst case is not more than 3% over a 90degree rotation.

8 Conclusions

In this paper we described a feature-based 3D tra-jectory tracking system for the control of a mobilerobot equipped with a camera system. This systemreduces the computational cost e�ectively by process-ing corner features of the scene. Also, using a stereoalgorithm to construct world features has enabled usto perform 3D motion estimation, using 2D images.Sub-pixel interpolation and Kalman �ltering have im-proved the accuracy of the system compared to 2Dvisual tracking systems. We have been unable to �ndreal-time performance of other 3Dmotion tracking sys-tems based on 2D images. The performance of the sys-tem is 1.1 seconds per motion estimation on an Intel r

5

E HGF

DA CB

Figure 6: Sample images taken at reference points.

Pentiumr 150MHz processor and we believe this canbe improved to video rate in the future.

9 Acknowledgments

This work was supported by the CanadianIRIS/PRECARN Network of Centers of Excellence.

References

[1] V.N. Dvornychenko. Bounds on (deterministic) cor-relation functions with application to registration.IEEE Transactions on Pattern Analysis and Machine

Intelligence, pages 206�216, 1983.

[2] Pascal Fua. A parallel stereo algorithm that produces

dense depth maps and preserves image features. Ma-chine Vision and Applications, Springer-Verlag, 1993.

[3] C. Harris and M. Stephens. A combined corner andedge detector. Proceeding 4'th Alvey Vision Confer-

ence, pages 147�151, 1988.

[4] Chris Harris. Tracking with rigid models. Active Vi-sion, MIT Press, Cambridge, 1992.

[5] Y. Kim. Localization of a mobile robot using a laserrange �nder in a hierarchical navigation system. IEEEInternational Conference on Robotics and Automa-

tion, 1993.

[6] David G. Lowe. Three-dimensional object recogni-

tion from single two-dimensional images. Arti�cialIntelligence, Elsevier Science Publishers B.V. (North-Holland), 1987.

[7] David G. Lowe. Fitting Parameterized Three-dimensional Models to Images. IEEE Transactions

on Pattern Analysis and Machine Intelligence, pages441�450, 1991.

[8] H. Moravec. Obstacle avoidance and navigation in thereal world by a seeing robot rover. Technical report,Robotics Institute, Carnegie Mellon University, 1980.

[9] A.J. Munoz and J. Gonzalez. Two-dimensionallandmark-based position estimation from a single im-age. IEEE International Conference on Robotics and

Automation, pages 3709�3714, 1998.

[10] R. Murrieta-Cid, M. Briot, and N. Vandapel. Land-mark identi�cation and tracking in natural environ-ment. IEEE International Conference on Robotics

and Automation, pages 179�184, 1998.

[11] Inc. Point Grey Research. Triclops Stereo VisionSystem. Technical report, Department of ComputerScience, University of British Columbia, Vancouver,www.ptgrey.com, 1997.

[12] W.H. Press, S.A. Teukolsk, W.T. Vetterling, and B.P.Flannery. Numerical Recipes in C: The Art of Scien-ti�c Computing, Cambridge University Press, 1992.

[13] L.S. Shapiro, A. Zisserman, and M. Brady. 3D MotionRecovery via A�ne Epipolar Geometry. Int. J. Comp.

Vis. vol 16, pages 147�182, 1995.

6

], a mo d-psaeedi/icra_2000_0.pdf · metho ds. these metho ds v ary based up on the sensor, the en...

Documents