junghyun kwon visual tracking via particle filtering on ... · euclidean metric on p n, and...

Junghyun KwonSchool of Electrical Engineeringand Computer Science,Seoul National University,Seoul 151-742,[email protected]

Frank C. ParkSchool of Mechanicaland Aerospace Engineering,Seoul National University,Seoul 151-742,[email protected]

Visual Tracking viaParticle Filtering on theAffine Group

Abstract

We present a particle filtering algorithm for visual tracking, inwhich the state equations for the object motion evolve on thetwo-dimensional affine group. We first formulate, in a coordinate-invariant and geometrically meaningful way, particle filtering onthe affine group that allows for combined state–covariance estima-tion. Measurement likelihoods are also calculated from the image co-variance descriptors using incremental principal geodesic analysis,a generalization of principal component analysis to curved spaces.Comparative visual tracking studies demonstrate the increased ro-bustness of our tracking algorithm.

KEY WORDS—visual tracking, particle filtering, affinegroup, Lie group

1. Introduction

Since Isard and Blake (1998) first showed that particle filtering(see, e.g., Gordon et al. (1993), Doucet et al. (2001), Arulam-palam et al. (2002) and the references therein for an introduc-tion and survey of recent developments in particle filtering)can be effectively applied to visual tracking, numerous parti-cle filtering-based visual trackers have been proposed. What

The International Journal of Robotics ResearchVol. 29, No. 2–3, February/March 2010, pp. 198–217DOI: 10.1177/0278364909345167c� The Author(s), 2010. Reprints and permissions:http://www.sagepub.co.uk/journalsPermissions.nav

makes the visual tracking problem challenging, particularlyfrom the point of view of robustness, are sudden and frequentchanges in, e.g., pose and facial expressions, lighting, and thebackground environment (resulting from occlusions, for exam-ple).

Not surprisingly much of the recent effort in particlefiltering-based tracking has been devoted to attempting to con-front these challenges directly. Lee et al. (2005) apply an off-line learning algorithm using training data for different view-points, whereas Jepson et al. (2003), Zhou et al. (2004), andWang et al. (2007) estimate the appearance model parameterswhile adapting to appearance changes in an online manner. In-cremental subspace learning methods have also been widelyapplied: Ross et al. (2008) develop a principal component-based tracker with an online incremental principal componentupdate feature, Li et al. (2007) extend the incremental princi-pal component method to higher-dimensional tensor data, andMei et al. (2007) propose an incremental kernel as a non-linearcounterpart to incremental principal components. The imagecovariance descriptor introduced by Tuzel et al. (2006) has alsobeen applied with some success to object detection (Tuzel et al.2007) and tracking (Porikli et al. 2006). Other related notablecontributions in particle filtering-based visual tracking includePerez et al. (2004), Rui and Chen (2001), Khan et al. (2004),and Chang and Ansari (2005).

While tracked objects are typically represented by an imageregion surrounding the object (e.g., an ellipse or rectangle), ithas been pointed out recently that information about any shapedeformations of the object image region are also highly use-ful� Zhou et al. (2004) and Lee et al. (2005), for example, ar-gue that such information can lead to better performance in,

198

Kwon and Park / Visual Tracking via Particle Filtering on the Affine Group 199

e.g., human–robot interaction, video surveillance, and otherapplications that require simultaneous recognition and track-ing. Under mild assumptions, such an image region trans-formation can be effectively realized as a two-dimensionalaffine transformation. A number of recent approaches to par-ticle filtering-based visual tracking explicitly make use of thetwo-dimensional affine transformation as the underlying statespace e.g. Zhou et al. (2004), Li et al. (2007), and Ross et al.(2008).

As is well known, the set of two-dimensional affine trans-formations is not a vector space, but rather a curved spacethat also has the structure of a Lie group (the two-dimensionalaffine group Aff (2)). Existing tracking algorithms for the mostpart employ a set of local coordinates to parameterize Aff (2)in vector form, and formulate the state and measurement equa-tions as non-linear equations evolving on a vector space. Theadvantage of doing so is that existing vector space particlefiltering algorithms can now be readily applied to the problemat hand.

Under usual circumstances it is not unreasonable to expectthat such local coordinate-based tracking methods will workreliably. There are of course many possible choices for lo-cal coordinates on Aff (2), and the performance of any localcoordinate-based tracker will inevitably depend on the particu-lar coordinates used. However, as long as the tracking scenariois reasonably well behaved, and the (typically) additive noisemodels formulated in the chosen coordinates is locally mean-ingful, local coordinate-based algorithms should work for themost part.

Realistic visual tracking scenarios, however, often involveextreme changes in the background environment, lighting con-ditions, and other external disturbances that venture to the lim-its of the local coordinate’s domain of validity. As local coordi-nates typically cannot globally parameterize a curved space ina homogeneous fashion (consider, for example, the severe dis-tortion that occurs at the poles in the standard two-dimensionallatitudinal–longitudinal maps of the Earth), local coordinate-based trackers that fail to take into account the intrinsic geom-etry of Aff (2) will have at best uneven, and at worst highly un-reliable, tracking performance. Partly as a consequence of thisfailure to take the underlying geometry into account, humanintervention in the form of, e.g., adjusting the covariance ofnoise models and other various filtering parameters, often be-comes necessary for reliable performance over a realistic rangeof operating regimes.

Performing filtering on Aff (2) correctly, therefore, requiresa geometrical formulation of the state and measurement equa-tions. Chiuso and Soatto (2000), Srivastava (2000), and Sri-vastava et al. (2002) have investigated particle filtering on Liegroups for the case of linear state dynamics and measurements,with particular focus given to the orthogonal and Euclideangroups. In previous work (Kwon et al. 2007) we have devel-oped a particle filtering algorithm for general non-linear sto-chastic systems evolving on the Euclidean group, and more

general matrix Lie groups, that also allows for simultaneousstate and covariance parameter estimation� the latter is es-sentially a geometric extension of the kernel smoothing withshrinkage method of Liu and West (2001), that correctly ac-counts for the geometric structure of the Lie group state spaceand P�n�, the space of symmetric positive-definite matrices.

In this paper we take the general matrix Lie group filteringframework of Kwon et al. (2007) and the affine matrix for-mulation of visual tracking as our common point of depar-ture, and develop a particle-filtering-based visual tracking al-gorithm that correctly accounts for the underlying geometry.In addition to working out the details of the general Lie groupparticle filtering algorithm for the case of the affine group(e.g., formulas for the sample mean on Aff (2) and P�n�), track-ing robustness is also enhanced via the following additionalfeatures:

� We construct and apply an auto-regressive (AR) statedynamics on the affine group. While past work has em-ployed random walk models for their simplicity (whosedisadvantages can be overcome to some extent by us-ing a larger number of particles), we show that an ARstate dynamics offers a reasonable compromise betweenmodel simplicity and flexibility, and is a highly usefulmeans of improving tracking performance.

� We apply the combined state–covariance estimationframework for Lie group filtering (Kwon et al. 2007)to the visual tracking problem. Experience suggests thatsetting appropriate covariance values in the state particlepropagation is crucial to good tracking, and the com-bined state–covariance estimation eliminates the needfor the user to determine appropriate covariance val-ues a priori. The computational efficiency of the state–covariance estimation procedure is further enhanced byuse of the log-Euclidean metric (Arsigny et al. 2007),thereby eliminating the optimization procedure requiredto calculate the intrinsic sample mean of P�n�.

� Measurement likelihoods are calculated by applyingprincipal geodesic analysis (PGA� essentially a gener-alization of principal component analysis (PCA) to Rie-mannian manifolds) to the image covariance descriptorP�n� (Fletcher et al. 2004� Pennec et al. 2006� Fletcherand Joshi 2007).

Our experimental results convincingly show that by draw-ing together a diverse collection of geometric tools—some ofit developed in previous work by the authors, others newlydeveloped as a geometric generalization of existing vectorspace ideas—tracking robustness can be enhanced consider-ably. Apart from providing the algorithmic details of our vi-sual tracker, which as we show below involves numerous geo-metrical subtleties of practical relevance, our main contribu-tion is a convincing demonstration that, when combined in the

200 THE INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH / February/March 2010

Fig. 1. The state representation for our visual tracking frame-work. The initial state X0 is set to be the 3 � 3 identity ma-trix. During tracking, Xk represents the two-dimensional affinetransformation of the object image region with respect to theinitial two-dimensional coordinates.

right way, geometrical methods, despite their increased com-putational complexity, can indeed have a significant impact onimproving tracking performance.

This paper is organized as follows. We begin in Section 2with some geometric preliminaries on the affine group, in-cluding a discussion of general particle filtering on the affinegroup, and a demonstration of how purely local coordinate-based methods can break down. Section 3 provides a de-tailed account of our Aff (2) particle-filtering framework forvisual tracking: modeling the state dynamics as an AR processon Aff (2), combined state–covariance estimation, calculatingmeasurement likelihoods with incremental PGA and the log-Euclidean metric on P�n�, and formulas for the sample meanon Aff (2) and P�n�. Experimental results are given in Sec-tion 4, while Section 5 concludes with a summary1.

2. Particle Filtering on the Affine Group

2.1. State Dynamics

The object image region is initially assumed as shown in theleft panel of Figure 1. The affine transformation of a pointp � �px � py�

� � �2 in the object image region is realizedby an invertible linear transformation G � �2�2, followed bya translation t � �tx � ty�

� � �2, i.e. p � Gp t . In homoge-neous coordinates, such a transformation can be equivalentlyrepresented as �

� p

1

��

�� G t

0 1

�� p

1

�� (1)

1. Preliminary versions of this work have also been described by Kwon andPark (2008). Also, upon publication of our results, we have become awarethat Li et al. (2008) also reports using the log-Euclidean metric on the imagecovariance descriptor for incremental subspace learning.

The homogeneous matrix representation of the two-dimen-sional affine transformation

�G t0 1

�can be identified as the ma-

trix Lie group of two-dimensional affine transformations, de-noted by Aff (2).

Here Aff (2) is the semi-direct product of GL(2) and �2,where GL(2) denotes the general linear group of 2 � 2 non-singular matrices. Its Lie algebra aff (2) consists of matrices ofthe form

�M r0 0

�, where M � gl�2� (gl�2�, which is simply the

space of real 2� 2 matrices, denotes the Lie algebra of GL(2))and r � �2. The basis elements Ei of aff (2) are chosen to beclosed under the Lie bracket operation:

E1 �

��

1 0 0

0 1 0

0 0 0

�� E2 �

��

1 0 0

0 �1 0

0 0 0

��

E3 �

��

0 �1 0

1 0 0

0 0 0

�� E4 �

��

0 1 0

1 0 0

0 0 0

��

E5 �

��

0 0 1

0 0 0

0 0 0

�� E6 �

��

0 0 0

0 0 1

0 0 0

�� (2)

Figure 2 illustrates the geometric transformations induced byeach basis element Ei . A general affine transformation is theresult of a combination of the transformation modes shown inFigure 2.

We now consider left-invariant stochastic state equations onthe affine group of the form

d X � X � A�X� dt Xm

i�1

bi Ei d�i � (3)

where X �Aff (2) is the state, the map A : Aff (2) aff (2) ispossibly non-linear, the bi are scalar constants, and the d�i �� denotes independent Wiener process noise. The exponentialEuler discretization of (3) leads to

Xk � Xk�1 � exp

�A�X� t��t

6i�1

Ei

��t�k�i

�� (4)

where �k � ��k�1� � � � �k�6�� is six-dimensional zero-mean

Gaussian noise with a specified covariance matrix S � �6�6.The discrete version of the measurement equation is expressedas

yk � g�Xk� nk� (5)

where g : Xk �Ny is a non-linear function and nk is zero-mean Gaussian noise with a covariance R � �Ny�Ny .


Fig. 2. The geometric transformation modes induced by eachLie algebra basis element of A f f �2�.

Given a set of measurements y0:k � �y0� � � � � yk�, our par-ticle filtering algorithm relies on importance sampling to se-quentially estimate p�X0:k � y0:k� or its marginal distributionp�Xk � y0:k� (Doucet et al. 2001). In this paper we use thesampling importance resampling (SIR) filter, where the stateprediction density p�Xk � Xk�1� is employed as the impor-tance function. In this case the importance weight associatedwith each particle is set proportional to the measurement likeli-hood p�yk � Xk�. Sampling from p�Xk � Xk�1� can be realizedefficiently via (4), and the measurement likelihood p�yk � Xk�can be calculated based on (5). When tracking is initiated, allof the particles �X �1�0 � � � � � X �N�0 � are initialized as 3�3 identitymatrices as depicted in Figure 1.

2.2. Sample Mean on the Affine Group

In a typical particle-filtering framework, the optimal state esti-mator �Xk is taken to be the weighted sample mean of the par-ticles �X �1�k � � � � � X �N�k � distributed according to p�Xk � y0:k�.Since the sample mean of the translation part t � �2 is welldefined, we concentrate on the remaining GL(2) part. Given aset of GL(2) elements �G1� � � � �GN �, the intrinsic mean �G isdefined as

arg min�G�GL�2�

Ni�1

d� �G�Gi �2� (6)

where d�� represents the geodesic distance between twoGL(2) elements. Determining the minimal geodesic betweentwo arbitrary elements of GL(2) generally requires solving atwo-point boundary value problem as described by Lee et al.(2007). For computational reasons, we instead choose to ap-proximate the sample mean using the fact that near the identity,

minimal geodesics are given by the left and right translationsof the one-parameter subgroups of the form et M , M � gl�2�,t � �. For our case where the particles are resampled accord-ing to their associated weights, all of the affine transformationsrepresented by the particles can be expected to be close to eachother.

We therefore approximate the sample mean of�G�1�

k � � � � �G�N�k �, GL(2) components of the resampled par-

ticles X �i�k , as

�Gk � Gk�max � exp �Mk

�� (7)

�Mk � 1

N

Ni�1

log�

G�1k�maxG�i�

k

�� (8)

where Gk�max denotes the GL(2) part of the particle possess-ing the greatest weight before resampling. The sample mean

of �X �1�k � � � � � X �N�k � can then be readily obtained as� �Gk �tk

0 1

�,

where �tk � �2 is the arithmetic mean of the translation parts.

2.3. Pitfalls of Purely Local Coordinate Formulations

This section illustrates in some detail the pitfalls of purely lo-cal coordinate-based particle filtering algorithms that do notproperly take into account the geometry of the affine group.Zhou et al. (2004) defined the state x � �6 as the directEuclidean embedding of Aff (2), i.e. x � �a1� � � � � a6�

�, whereeach ai constitutes an affine transformation matrix of the

form� a1 a2 a5

a3 a4 a60 0 1

�. In Li et al. (2007) and Ross et al. (2008),

G �GL�2� is decomposed via singular value decomposition(SVD) as G � U DV�, where U and V are unitary matri-

ces and D ��

s1 00 s2

�. Then, as shown by Hartley and Zisser-

man (2000), G can be represented as G � �U V��V DV� �Rot�1�Rot��2�DRot�2�, where Rot�� represents a two-dimensional rotation, and 1 and 2 are rotation angles. Rosset al. (2008) and Li et al. (2007) then define the state x � �6

as x � �1� 2� s1� � tx � ty��, where � s1�s2.

In the above cases, the state equation is obtained by dis-cretizing a stochastic differential equation of the form

dx � f �x� t� dt F�x� t� d�� (9)

where f : �6 �6 and F : �6 �6�m are possiblytime-varying non-linear functions and d� is m-dimensionalWiener process noise. To illustrate the potential pitfalls of ap-plying existing particle filtering algorithms to (discretized ver-sions of) Equation (9) without a proper geometric accountingof the affine group, consider two initial object image regionsas shown in Figure 3(a). Observe that each side of the objectimage region corresponding to X0 is twice as long as the sidecorresponding to X 0. At time k�1, both transformed image re-gions appear identical as shown in Figure 3(b), and the states


Fig. 3. There are two different states X and X representing the affine transformations of two different initial object image regionsas shown in (a). At time k � 1, both transformed image regions represented by Xk�1 and X k�1 appear the same as shown in (b).Using the Euclidean embedding to represent X and X in vector form as x and x , the same perturbation to xk�1 and x k�1 resultsin the different degrees of shape deformation as shown in (c). In contrast, the degrees of shape deformation induced by the sameperturbation to the different states Xk�1 and X k�1 are the same as shown in (d).

Xk�1 and X k�1 corresponding to the transformed image re-gions are respectively given by

Xk�1 �

��

1 0 10

0 1 10

0 0 1

�� X k�1 �

��

2 0 10

0 2 10

0 0 1

�� (10)

Using the Euclidean embedding, the states Xk�1 andX k�1 can be represented in vector form as xk�1 ��1� 0� 0� 1� 10� 10�� and x k�1 � �2� 0� 0� 2� 10� 10��, respec-tively. Assuming the state transition is induced via a ran-dom walk for simplicity, i.e. the state equation is given byxk � xk�1 �k , where �k is six-dimensional Gaussian noise,the states at time k perturbed by the same random noise �k ��0�2� 0�2� 0�2� 0�2� 2� 2�� become

xk � xk�1 �k � �1�2� 0�2� 0�2� 1�2� 12� 12�� (11)

x k � x k�1 �k � �2�2� 0�2� 0�2� 2�2� 12� 12�� (12)

Figure 3(c) shows the transformed image regions by each per-turbed state (xk by the dashed line and x k by the dotted line). Itcan be verified that the degrees of shape deformation inducedby xk and x k are dependent on xk�1 and x k�1, in spite of theidentical level of perturbation �k applied to both. This phe-nomenon is clearly undesirable from the perspective of visualtracking.

In contrast, expressing random walk state transitions interms of our geometric framework, the state equation becomes


�6

i�1

Ei

��t�k�i

�� (13)


Figure 3(d) shows the transformed image regions for each stateperturbed by the same random noise �k�i , each correspondingto the same transformation induced by xk (Xk by the dashedline and X k by the dotted line). We can see that for the samestate perturbation, the shapes of the transformed image regionsare exactly the same for both Xk�1 and X k�1. The degree ofshape deformation induced by a specific perturbation is alwaysthe same regardless of the state.

In Figure 3(d), it is worth noting that the locations of thetransformed image regions by Xk and X k are different, in con-trast to their identically transformed shapes. This can be under-stood by simply examining the result of multiplying two affinetransformations Q1 and Q2:

Q1 Q2 �� G1 t1

0 1

�� G2 t2

0 1

��

�� G1G2 G1t2 t1

0 1

�� (14)

Considering Q1 and Q2 to be respectively Xk�1 and

exp��6

i�1 Ei

��t�k�i

�in (13), it can be recognized that the

translation of the transformed image region by Xk depends onXk�1, even for the same state perturbation.

We now argue that this phenomenon is in fact an advantageof our geometric framework. Figure 4 physically illustrates thesituation discussed so far. Initially, two cubes of different sizesare placed at the same three-dimensional position. Here X k�1represents the affine transformation of the image region of thesmall cube, which is moved half the distance toward the cam-era from its initial three-dimensional position. Similarly, Xk�1

represents the affine transformation of the image region of thelarge cube, which is twice the size of the small cube. The largecube is moved from its initial three-dimensional position whilemaintaining the same distance to the camera, and its image re-gion (represented by Xk�1) is the same as that for the smallcube (represented by X k�1).

In this scenario, the perturbation to Xk�1 and X k�1 canbe physically understood as three-dimensional object rotationand translation over some fixed time interval. Since the smallcube is twice as close to the camera as the large cube, theimage translation induced by the three-dimensional transla-tion of the small cube is twice as great as that induced by thesame three-dimensional translation of the large cube. In con-trast, the degrees of shape deformation induced by the samethree-dimensional rotations of the two cubes are necessarilysame. The example shown in Figure 3(d) exactly matches thisphysical interpretation. From this perspective, our geometricframework can be considered as a natural representation of im-age transformations induced by object movement in the three-dimensional space.

Fig. 4. A physical interpretation of the example of Figure 3, inwhich the same perturbation produces identical shape defor-mations but different translations.

For the SVD-based local coordinate representation, thesame problem occurring in the Euclidean embedding case isalso inevitable (with the possible exception of the Rot�1�being factored out to represent the object rotation). Anotherdrawback of the SVD-based approach is that the decomposi-tion is not unique. For example, suppose

xk ��

8��

12�1�2� 0�8� 10� 10

��

The affine transformation matrix represented by xk becomes��

1�1168 �0�3106 10

0�2485 0�9273 10

0 0 1

�� (15)

Conversely, if we represent the above matrix as xk via SVD,the result is

xk ��

8� 5�

12� 0�96� 1�25� 10� 10

��

We can see that 2 are changed to 2 � ��2 (with appropri-ate changes made to s1 and ). This is triggered by the coor-dinate change to transform the unitary matrices obtained viaSVD into valid rotation matrices. As a result tracking perfor-mance may be diminished, since particles perturbed by differ-ent amounts of random noise may represent the same affinetransformation.

3. Visual Tracking Framework

We now apply the affine particle filtering framework describedin the previous section to the visual tracking problem. Partic-ular focus will be given to features that are specific to visualtracking, including the choice of an AR process on Aff (2) forthe state dynamics, combined state–covariance estimation onAff (2) and P�n�, and calculating measurement likelihoods us-ing PGA on the image covariance descriptors.


3.1. AR State Dynamics

Tracking performance depends to a large extent on the accu-racy of the state dynamics model. The simplest state dynam-ics model is a random walk. Within our framework, a ran-dom walk model implies that the drift A�X� t� in (4) is set tozero as (13). While the random walk model has been shown tobe effective for a surprisingly large class of simple and well-behaved application scenarios, tracking performance can befurther enhanced if the particle propagation is guided by theappropriate state dynamics beyond the random walk.

Balancing model flexibility with simplicity, in our frame-work we choose to model the state dynamics via a first-orderAR process. The standard vector space formulation of an ARprocess of order p can be expressed as

xk �p

i�1

ai xk�i �k� (16)

where ai are the AR process parameters. For our case wherethe state space is Aff (2), the AR process must be definedin a different way that respects the geometry of Aff (2). Thedefinition of AR processes on general Riemannian manifoldshas been addressed by Xavier and Manton (2006)� extendingthis formulation to the affine group (and to general matrix Liegroups in a straightforward way), the state transition equationis expressed as a first-order AR process on Aff (2) of the form


�Ak�1

6i�1

Ei

��t�k�i

�� (17)

Ak�1 � a log�X�1k�2 Xk�1�� (18)

where a is the AR process parameter. For typical tracking ap-plications (where video frame rates will typically range be-tween 30 and 60 frames per second, and object speeds are notexcessively fast), it can be assumed that X�1

k�2 Xk�1 is closeto the identity. Thus, log�X�1

k�2 Xk�1� can be defined uniquely.Note that it is not possible to compute Ak�1 immediately fromeach particle at k � 1 because the particles are resampled atevery time step. Therefore, in practice, the state is augmentedas �Xk� Ak�.

3.2. Tracking with Unknown Covariance

Visual tracking performance also depends to a great extent onthe value of the covariance S related with the Wiener processin (17). However, in general, it is difficult to select appropriatevalues of S a priori. Instead of trying to find the appropriatevalue of S, we adopt the approach described by Kwon et al.(2007), further extending the combined state-parameter esti-mation method of Liu and West (2001) to P�n�, the space of

n � n symmetric positive-definite matrices (recall that covari-ance matrices are elements of P�n�).

If there are unknown parameters in the state equation, thestate Xk can be augmented with k as �Xk� k� to estimate k

simultaneously with Xk . To solve the degeneracy problem ow-ing to the non-existence of the dynamics for the parameters, itis commonplace to add an artificial dynamics to the parame-ter particles �i�k as �i�k � N � �i�k�1�8�k�1� with the covariance�k�1 decreasing with time. In this case, the “over-dispersion”or “loss of information” (Liu and West 2001) occurs since thecovariance of the parameter particles are increased owing tothe artificial noise. To make the covariance of the parame-ter particles unchanged even with the artificial noise, Liu andWest (2001) have proposed kernel smoothing with shrinkage,where each parameter particle �i�k is randomly sampled froma normal kernel of the form N �a �i�k�1 �1� a� � k�1� h2� � k�1

�

(here � k�1 and �� k�1are respectively the mean and covari-

ance of �i�k�1)� by setting a � �1� h2, the “over-dispersion”is trivially corrected. A discount factor � with typical val-ues between 0�95 and 0�99 is used as the control factor, witha � �3� � 1��2� (Liu and West 2001).

Extending the combined state-parameter estimation ap-proach of Liu and West (2001) to arbitrary covariances S re-quires methods for finding, on P�n�, minimal geodesics be-tween two arbitrary elements of P�n�, the sample mean andsample covariance, and how to perform Gaussian random sam-pling on P�n�. In this regard the log-Euclidean metric recentlyproposed by Arsigny et al. (2007) offers a computationallyefficient way to implement the combined state–covariance es-timation.

3.2.1. Log-Euclidean Metric on P�n�

Since P�n� is convex, the arithmetic mean of P�n� samplesalways lies within P�n�. However, the arithmetic mean ofP�n� has a number of undesirable properties, e.g., singular val-ues are not preserved (Pennec et al. 2006� Fletcher and Joshi2007). Thus, we focus on the intrinsic mean for P�n� samples�P1� � � � � PN �, defined as

arg min�P�P�n�

Ni�1

d� �P� Pi �2� (19)

where d�� represents a geodesic distance between two P�n�elements.

To compute the geodesic distance, an appropriate metric forP�n� should be first identified. The affine-invariant metric in-variant under the GL�n� group action proposed by Pennec etal. (2006) and Fletcher and Joshi (2007) can be regarded as anatural choice, but requires an optimization procedure to com-pute the sample mean on P�n�. The log-Euclidean metric ofArsigny et al. (2007) preserves much of the natural properties


of the affine-invariant metric while being computationally triv-ial. Its basic formula is straightforward: the distance betweenP1 and P2 is given by

d�P1� P2� � � log�P2�� log�P1�� (20)

where � �� represents the standard Euclidean vector norm. Thegeodesic between P1 and P2 is simply expressed as

� �t� � exp��1� t� log�P1� t log�P2�� (21)

With the log-Euclidean distance equation (20), the log-Euclidean mean of P�n� elements �P1� � � � � PN � is obtainedin closed form as

�P � exp

�1

N

Ni�1

log�Pi�

�� (22)

Then the sample covariance � �P can be calculated as

� �P �1

N � 1

Ni�1

Zi Z�i � (23)

where Zi � log�Pi �� log� �P� in column vector form.

3.2.2. Kernel Smoothing with Shrinkage for CovarianceEstimation

Kernel smoothing with shrinkage as proposed by Liu and West(2001) can now be performed for covariance parameter estima-tion. First, the sample mean � k�1 and covariance �� k�1

of the

covariance particles �i�k�1 can be easily calculated from (22)and (23). The kernel mean a �i�k�1 �1� a� � k�1 can be under-stood as the corresponding point on the geodesic connecting k�1 and � k�1 with � �0� � k�1 and � �1� � � k�1. With thelog-Euclidean geodesic equation (21), a �i�k�1 �1 � a� � k�1

corresponds to exp�a log� �i�k�1� �1 � a� log� � k�1��. Finally,Gaussian random sampling with the mean � and covariance�can be realized as follows: (i) perform Cholesky decomposi-tion on the covariance � as � � CC�, (ii) generate a zero-mean, unit-variance Gaussian random vector z � �n�n1��2 andform Z � Cz, and (iii) compute exp�log�� [Z]� where [Z]denotes Z reshaped as an element of Sym(n), the space of sym-metric matrices.

3.3. Robust Measurement Likelihood Calculation

3.3.1. Image Covariance Descriptor for Object ImageRepresentation

Our visual tracking framework assumes that the tracking is ini-tiated when the object is detected automatically, and its rectan-gular image template from the initial frame is given. To repre-sent the object image template, we use the image covariance

Fog. 5. For our measurement process, the image covariancedescriptor is employed to represent the object image region.The image region determined by Xk is transformed to an W �H image patch, and then normalized to have unit variance andzero mean. The image covariance descriptor is calculated fromthe normalized W � H image patch TXk .

descriptor recently proposed by Tuzel et al. (2006), which hasbeen applied with some success to both object tracking (Porikliet al. 2006) and detection (Tuzel et al. 2007).

The object image template is first transformed to an W�Himage patch, and then normalized to have unit variance andzero mean (to minimize the effect of illumination changes)�to every pixel in the normalized W � H image patch TX0 , afeature vector fi � �10 is assigned as

fi �� px � py� pr � p � I� �Ipx �� Ipy ��

tan�1�

IpyIpx

�� Ipx px �� Ipy py�

��

� (24)

where �px � py�� and �pr � p �� respectively represent the

Cartesian and polar coordinates of each pixel, I denotes thepixel intensity, and Ipx , Ipx px , Ipy , Ipy py are the first- andsecond-order image derivatives with respect to the Cartesiancoordinates. The image covariance descriptor CX0 � �10�10

for the object template is then given by

CX0 �1

W H � 1

W Hi�1

� fi � �f �� fi � �f �� (25)

where �f is the mean value of fi .By representing the object template in the form of the image

covariance descriptor rather than the raw image has the follow-ing advantages: (i) not only the intensity information but thespatial information can be considered simultaneously� (ii) mul-tiple features beyond the image intensity, e.g., color informa-tion, edge-like information embedded in the image derivatives,images from different modalities such as an infrared image,can be easily fused� (iii) the dimension of the image covariancedescriptor is quite low compared with the raw image� (iv) theeffect of outlier pixels can be smoothed out during the covari-ance calculation.

During tracking, as depicted in Figure 5 the image covari-ance descriptor CXk is also calculated from TXk , the normal-


ized W � H image patch transformed from the image regiondetermined by Xk . Since the image covariance descriptor isan element of P�n�, the similarity between CX0 and CXk canbe understood as the distance between CX0 and CXk , whichis simply determined via the Log-Euclidean metric of (20) as� log�CXk �� log�CX0��.

3.3.2. PGA-based Measurement Likelihood

As tracking proceeds, the object image covariance descrip-tors C �Xk

corresponding to the estimated state �Xk can be col-lected. The object appearance generally undergoes a gradualchange, induced by various internal and external causes suchas changes in the object pose, facial expressions, and light-ing conditions. For purposes of robust measurement likelihoodcalculation, it would be helpful to use in some appropriatefashion the image covariance descriptors �C �X1

�C �X2� � � �� col-

lected up to the present time. One possible way is to use sub-space methods such as PCA, which has been used with widesuccess in various recognition and tracking applications. In ourcase, where the object image covariance is not a vector but asymmetric and positive-definite matrix, PCA for vector-valueddata cannot be applied directly.

Fletcher et al. (2004) generalizes PCA to manifold-valueddata in a coordinate-invariant way, and refers to it as PGA. Theidea of PGA is to apply standard PCA to the tangent space atthe intrinsic mean as follows: (i) compute the intrinsic meanof the manifold-valued samples� (ii) compute the sample co-variance of the manifold-valued samples on the tangent spaceat the computed intrinsic mean� (iii) find the eigenvectors andeigenvalues of the sample covariance.

Recently, PGA of P�n� has been presented with the affine-invariant metric by Fletcher and Joshi (2007). However, ap-plying PGA as described by Fletcher and Joshi (2007) to ourcase has a number of disadvantages, the most notable beingthat the affine-invariant mean is obtained via a gradient de-scent optimization procedure. In contrast, the log-Euclideanmean of P�n� samples can be calculated in a fixed time with-out iteration� PGA can thus be applied to the image covari-ance descriptors efficiently. In short, PGA of P�n� elements�P1� � � � � PN � is equivalent to obtaining the eigenvalues andeigenvectors of the sample covariance obtained using (23), andwith the sample mean obtained using (22). Furthermore, theincremental PCA of Ross et al. (2008) can also be applieddirectly to the image covariance descriptors without any al-gorithmic modification, as long as we retain the matrix loga-rithms of the image covariance descriptors in column vectorform. (For space reasons the incremental PCA algorithm ofRoss et al. (2008) is not recounted here.)

For the measurement likelihood calculation with the incre-mentally updated mean and principal eigenvectors of the ob-ject image covariance descriptors, the probabilistic approachof Moghaddam and Pentland (1997) can be adopted. Given

the mean �C of the collected image covariance descriptors�CX0�C �X1

�C �X2� � � �� and the first M principal eigenvectors

�i � i � 1� � � � �M , the residual error e2r for the image covari-

ance CXk is defined as

e2r � � log�CXk �� log� �C��2 �

Mi�1

c2i � (26)

where the ci are the projection coefficients of log�CXk �� log� �C�� for each principal eigenvector, i.e.

ci � ��i �log�CXk �� log� �C��. The residual error e2

r isunderstood to be the “distance-from-feature-space” (DFFS)while the “distance-in-feature-space” (DIFS) is defined asthe Mahalanobis distance, i.e.

�Mi�1�c

2i ��i � where the �i are

the eigenvalues corresponding to �i (Moghaddam and Pent-land 1997). Therefore, the measurement equation (5) can beexplicitly expressed as

yk ��

er��Mi�1

c2i�i

�� nk� (27)

where nk is zero-mean Gaussian noise with covariance R �Diag�� 2

PGA� 1� � �2�2. The measurement likelihood p�yk �Xk� is then calculated from

p�yk � Xk� � exp

��1

2y�k R�1 yk

�� (28)

The collected object images �T �X1� T �X2

� � � �� can also be usedfor similarity comparison purposes� we simply add the squareroot of the sum of squared differences (SSD) between TXk andthe mean intensity image of �TX0 � T �X1

� T �X2� � � �� to yk to further

increase the robustness of the measurement likelihood calcu-lation. The measurement equation is finally given by

yk �

��

er��Mi�1

c2i�i

�TXk � �T�

�� nk� (29)

where �T represents the object mean intensity image in-crementally updated, and nk is Gaussian noise with R �Diag�� 2

PGA� 1� �2SSD� � �3�3. Until the minimal number of

the object image covariance descriptors sufficient to performPGA is collected, at the very initial phase of tracking wherethe object appearance change is not intense, the measurementbecomes simply the square root of SSD between TX0 and TXk ,i.e.

yk � �TXk � TX0� nk� (30)

where nk is sampled from N �0� R�, R � � 2SSD.

We now present the visual tracking algorithm describedthus far.


Algorithm

1. Initialization

(a) Set k � 0.

(b) Set the initial phase period as kinit and update period askupdate for the incremental PGA.

(c) Set � between 0�95 � 0�99 for the covariance estima-tion.

(d) Set number of particles as N .

(e) For i � 1� � � � � N ,

� Set X �i�0 � I , A�i�0 � 0.

� Draw the covariance parameters �i�0 for S fromp� 0�.

2. Importance sampling step

(a) Set k � k 1.

(b) Compute the mean � k�1 and covariance �� k�1of

� �1�k�1� � � � � �N�k�1� from (22) and (23).

(c) For i � 1� � � � � N , draw X ��i�k � p�Xk � X �i�k�1�

��i�k �, i.e.

i. Determine the kernel mean a �i�k�1 �1 � a� � k�1

from (21) where a � �3� � 1��2�.

ii. Draw ��i�k from N �a �i�k�1 �1�a� � k�1� h2� � k�1

�

where h2 � 1� a2.

iii. Generate the Gaussian �k from N �0� ��i�k �, and

propagate X �i�k�1 to X ��i�k with A�i�k�1 via (17).

iv. Compute A��i�k with (18).

(d) For i � 1� � � � � N , calculate the importance weights by��i�k � p�yk � X ��i�

k � with (30) if k � kinit, or with (29).

(e) For i � 1� � � � � N , normalize the importance weights by

��i�k � ��i�k

��Nj�1�

� j�k

��1.

3. Selection step (resampling)

(a) Resample from �X ��1�k � � � � � X ��N�

k �, �A��1�k � � � � � A��N�

k �,and � ��1�

k � � � � � ��N�k � with probability proportional

to ��i�k to produce independent and identically dis-tributed (i.i.d.) random samples �X �1�k � � � � � X �N�k �,�A�1�k � � � � � A�N�k �, and � �1�k � � � � �

�N�k �.

(b) For i � 1� � � � � N , set ��i�k � ��i�k � 1N .

(c) If k � kinit and mod�k� kupdate� � 0,

Fig. 6. Images from the video clip used for Experiment 1. Thecube image undergoes relatively large-scale changes.

i. Update �C , �i , and �i using the incremental PGAalgorithm.

ii. Update �T , the mean of the collected object inten-sity images.

4. Go to importance sampling step

4. Experiments

In this section, we demonstrate the feasibility of our proposedvisual tracking algorithm via various experiments, comparingthe performance of our geometric tracker with existing lo-cal coordinate-based trackers. In the first set of experiments,we track a cube moving under a large-scale change. The sec-ond set of experiments are performed with four widely usedtest video clips (Jepson et al. 2003� Ross et al. 2008)� thesetest clips are particularly challenging because of the suddenand extreme appearance changes resulting from, e.g., tempo-ral occlusions, changes in the object pose, and abruptly varyinglighting conditions.

4.1. Experiment 1: A Moving Cube

The video clip used in this experiment contains images of acube moving under rather mild tracking conditions, i.e. a fixedcamera and indoor environment. As shown in Figure 6, thescale of the cube image undergoes relatively large changes.To focus on verifying to what extent tracking performance de-pends on the state representation, we run our tracker withoutmany of the features proposed in Section 3, i.e. the AR statedynamics, covariance parameter estimation, and incrementalPGA of the image covariance descriptors. The state transitionis instead generated by a random walk with pre-specified co-variance, with Equation (30) taken as the measurements.


4.1.1. Implementation Details

The covariance for the Wiener process noise for the cube videoclip is set to

S � diag 0�22� 0�022� 0�42� 0�022� 102� 102

��(31)

with the discrete time interval �t � 130 � with this choice the

cube video clip is sampled at 30 frames per second (fps). TheWiener process noise �k � �6 is sampled from N �0� S� andfed into (17) with Ak�1 � 0.

For comparison, we also apply the local coordinaterepresentation-based trackers using the Euclidean embeddingand SVD. For the Euclidean embedding-based tracker, thestate transition is set to be xk � xk�1 �k , where

�k �

��

�k�1 �k�2

��k�3 �k�4

�k�3 �k�4

�k�1 � �k�2

�k�5

�k�6

��

(32)

with �k � ��k�1� � � � � �k�6�� 6 sampled from �k �

N �0� S�t�� the above can be justified by noting that the fol-lowing approximation is valid for sufficiently small �k :

exp

�6

i�1

�k�i Ei

�

�

��

1 �k�1 �k�2 ��k�3 �k�4 �k�5

�k�3 �k�4 1 �k�1 � �k�2 �k�6

0 0 1

�� (33)

Setting the state covariance for the SVD-based tracker tobe equivalent to our geometric tracker proves to be somewhatproblematic, because the transformations induced by E1, E2,and E4 are involved together in s1, , and 2. Instead we haveexperimentally found that the state transition via the Wienerprocess with

S � diag 0�42� 0�022� 0�22� 0�022� 102� 102

��(34)

is almost equivalent to our case around the identity. The stateequation for the SVD-based tracker thus becomes xk � xk�1�k , with �k � N �0� S�t�.

As mentioned earlier in Section 2.3, for local coordinaterepresentation-based trackers, larger covariance values are

Table 1. Tracking Results for the Cube Video Clip

Case Tracker Cov. N Err. N. Err. Time

1 Ours S 200 5.4058 2.8038 0.2035

2 Ours S 500 5.3078 2.7919 0.5027

3 Ours S 1,000 5.0648 2.6634 1.047

4 L1 S 500 32.2654 14.5886 0.266

5 L1 S 1,000 15.3742 6.9193 0.531

6 L1 2S 1,000 6.6966 3.4259 0.542

7 L1 2S 2,000 6.7121 3.4618 1.086

8 L1 3S 1,000 6.0824 3.1889 0.539

9 L1 3S 2,000 5.6238 2.9250 1.084

10 L2 S 500 6.3431 3.4203 0.281

11 L2 S 1,000 5.9576 3.1621 0.563

12 L2 2S 1,000 6.0190 3.2449 0.561

13 L2 2S 2,000 5.8082 3.1127 1.142

needed to ensure successful tracking when the GL(2) com-ponent of the state is farther from the identity. Therefore, werun the two local coordinate representation-based trackers withvarious covariance values and various numbers of particles.

To quantitatively compare tracking performance, the two-dimensional pixel coordinates of the four corner points of thecube image are extracted manually and used as ground truthdata. Tracking performance is judged based on the differencebetween the ground truth data and corner point coordinates es-timated by each tracker. The experiments are performed us-ing Matlab R2008a running on an Intel Core-2 Quad 2.4 GHzprocessor with 3 GB memory.

4.1.2. Results

Tracking results for the cube video clip are summarized in Ta-ble 1. In the table, “L1” and “L2” respectively represent theEuclidean embedding-based and SVD-based trackers, while“Cov.”, “N”, and “Time” respectively denote the covariancefor the Wiener process noise, number of particles, and averagecomputation time per frame in seconds. “Err.” is the averageof ei , calculated as

ei �� 4

j�1

pi� j � �pi� j

�2� i � 1� � � � � N f � (35)

where N f is the total number of video frames, and pi� j and�pi� j respectively represent the coordinates of the four cornerpoints corresponding to the estimated state and ground truthdata. Since ei is clearly affected by the cube image scale, we


Fig. 7. A plot of the scale change of the cube image.

calculate another error �ei normalized by the scales of the cubeimage calculated from the ground truth data as shown in Fig-ure 7. “N. Err.” in Table 1 is the average of �ei . Figure 8 and 9show respectively ei and �ei produced by each tracker.

As shown in Table 1, the cases corresponding to our tracker,i.e. “1”, “2”, and “3”, yield the lowest tracking errors amongall of the cases (based on comparing “Err.” and “N. Err.” forcases “1”, “2”, and “3” with the others). Of particular note isthat even with a small number of particles as in case “1”, sat-isfactory tracking performance is achieved. The plots of �ei inFigure 9(a) imply tracking errors that are more or less invari-ant to scale changes (with the exception of some frames aroundthe 127th frame, where an abrupt scale change occurs accom-panied by relatively large rotation and position changes).

As suggested by the large error values for the cases “4”and “5” in Table 1, tracking failures occur for the Euclideanembedding-based tracker when the covariance S is set identi-cal to ours. From Figure 8(b) and Figure 9(b), we can see thatthe tracking errors after the 100th frame are greater than thosefor the earlier phase, owing to the large-scale change. One im-plication is that this choice of S is appropriate only when theGL(2) component of the state is close to the identity.

As shown in Table 1 for the cases “6”, “7”, “8”, and “9”,tracking accuracy of the Euclidean embedding-based trackeris improved with an increase in number of particles (1,000 and2,000) and covariance values. (2S and 3S). However, it shouldbe noted that the tracking errors for the large number of parti-cles is still greater than the cases “1”, where only 200 particlesare used for our tracker.

For the SVD-based tracker, we can see from Table 1 thatthe tracking errors for the cases “10” and “11” with the co-variance S equivalent to our case is much smaller than that

Table 2. Characteristics of the Video Clips used in Experi-ment 2

Dudek David Sylvester Trellis

Environment Indoor Indoor Indoor Outdoor

Facial Change Large Large No Small

Pose Change Moderate Large Very Largelarge

Lighting Moderate Large Large VeryChange large

Temporal Yes No No NoOcclusion

Overall 4 3 2 1Difficulty

of the Euclidean embedding-based tracker with the covarianceS (cases “4” and “5”). This superiority arises from the factthat the rotation component is factored out via the SVD-basedrepresentation, which is not the case for the Euclidean embed-ding. Although tracking accuracy of the SVD-based tracker isslightly improved with an increase in the number of particles(2,000) and covariance values (2S) for the case “13”, it is stillworse than that of our tracker. The tracking accuracy with 3Swas not noticeably improved, and the results for such cases arenot included here.

As given in Table 1, the computational complexity of ourtracker is about twice that of the local coordinate representa-tion-based trackers. The tracking results for the cases “1”, “4”,and “10” can also be seen in Extension 1.

4.2. Experiment 2: Video Clips from Previous Literature

We now test the tracking performance of our algorithm us-ing some well-known video clips from the literature: the“Dudek” sequence from Jepson et al. (2003) and the “David”,“Sylvester”, and “Trellis” sequences from Ross et al. (2008).For comparison purposes all of the video clips are resampledto be 15 fps. Further aspects of the test video clips are summa-rized in Table 2.

4.2.1. Implementation Details

For the prior distributions of the unknown parameters, i.e.p� 0�, a uniform distribution between possible intervals is sug-gested by Higuchi (2001). Since in our case the parameterspace is P�n�, we first sample Sym(n) using a uniform distrib-ution between�0�25 and 0.25. Then the initialized covarianceparticles are obtained as the matrix exponential of the sampledSym(n) particles� this initialization procedure is consistent with


Fig. 8. The plots of the tracking errors ei for each frame. (a) Tracking errors for cases “1” (solid), “2” (dashed), and “3” (dotted).(b) Tracking errors for cases “4” (solid) and “5” (dashed). (c) Tracking errors for cases “6” (solid), “7” (dashed), “8” (dotted),and “9” (dash-dot). (d) Tracking errors for cases “10” (solid), “11” (dashed), “12” (dotted), and “13” (dash-dot).

the log-Euclidean metric used for P�n�. The control factor � isset to 0.99.

Since the magnitude of the noise for the translational partis quite different from that for the GL(2) part, the randomnoise generated with the covariance particles should be ap-propriately scaled. Furthermore, the geometric transforma-tions induced by E2 and E4 in (2) are not significant in thetest video clips. Therefore, we scale the random noise gen-erated with the covariance particles with scale parameters�0�05� 0�001� 0�05� 0�001� 15� 15�� in the order of the Lie al-gebra basis elements of Aff (2) in (2).

We set the initial phase period to be the first 15 frames, andthe update period for the incremental PGA to be 5 frames. We

used 500 particles for the experiments, with an AR parametervalue of 0.5. These experiments are also performed in MatlabR2008a running on an Intel Core-2 Quad 2.4 GHz processorwith 3 GB memory.

4.2.2. Results

Tracking results for the “Dudek” sequence shown in the firstcolumn of Figure 10 appear to be quite satisfactory. Note thatchanges in the object appearance (in this case due to changesin the object pose, facial expressions, lighting conditions, andthe wearing and removal of glasses) and object scale occur


Fig. 9. The plots of the tracking errors �ei normalized by the scales of the cube image for each frame. (a) Tracking errors for cases“1” (solid), “2” (dashed), and “3” (dotted). (b) Tracking errors for cases “4” (solid) and “5” (dashed). (c) Tracking errors forcases “6” (solid), “7” (dashed), “8” (dotted), and “9” (dash-dot). (d) Tracking errors for cases “10” (solid), “11” (dashed), “12”(dotted), and “13” (dash-dot).

gradually. Owing to the inherent stochastic nature of particlefiltering, the tracker can recover from the temporary drift in-duced by the temporal object occlusion as shown in the 103rdand 110th frames.

The “David” sequence appears more challenging than the“Dudek” sequence because of the abrupt change in light-ing conditions, together with a long-term object pose change.However, as shown in the second column of Figure 10, ourtracker tracks the object quite well over the entire frame se-quence. The tracker maintains its focus around the object whilethe object changes its pose around the 160th frame, and contin-

ues to successfully track the object when the frontal face of theobject reappears after the 177th frame. Tracking results for theremainder of the “David” sequence after the 177th frame arealso successful despite the facial expression change, removingand wearing the glasses, and changes in lighting as shown inthe 226th, 299th, and 429th frames.

For the “Sylvester” sequence, our proposed tracker alsoyields quite satisfactory tracking results as shown in the thirdcolumn of Figure 10. Around the 309th frame, the tracker tem-porarily loses the object because of the abrupt pose change.However, as shown in the 322nd and 437th frames, our tracker


Fig. 10. Tracking results for the four video clips using our proposed tracker. The first, second, third, and fourth columns representthe tracking results respectively for the “Dudek”, “David”, “Sylvester”, and “Trellis” sequences.

does not lose the object permanently, and quickly recoversfrom any temporal drift whenever the frontal face of the ob-ject reappears.

The “Trellis” sequence, possibly the most challengingamong the four test video clips, involves an extreme changein lighting that occurs at the same time as a long-term objectpose change. Again, tracking results shown in the fourth col-umn of Figure 10 are highly encouraging: as shown in the 63rd,150th, and 268th frames, our tracker is able to track the object

in spite of the extreme change in lighting. After the long-termobject pose change around the 332nd frame, the tracker contin-ues to track the object when the frontal object face reappears,despite the abrupt change in lighting that occurs at the 371stand 442nd frames. Tracking results of our tracker can also beseen in Extension 1.


Fig. 11. A comparison between the tracking results of our proposed tracker (solid rectangle) and an intensity-based tracker(dashed rectangle) using the incrementally updated object mean image �T . The first, second, third, and fourth columns representthe tracking results respectively for the “Dudek”, “David”, “Sylvester”, and “Trellis” sequences.

4.2.3. Comparison with an Intensity-based Tracker

To measure the tracking benefits of measurement likelihoodcalculation using incremental PGA of the image covariancedescriptors, we run our tracker using only the incrementallyupdated object mean image �T , i.e. the measurement equationis now given by

yk � �TXk � �T� nk� (36)

where nk � N �0� R�� R � � 2SSD. A comparison between our

tracker (solid rectangle) and the simple intensity-based tracker(dashed rectangle) is shown in Figure 11.

For the “Dudek” sequence, the intensity-based trackeryields surprisingly good tracking results as shown in the firstcolumn of Figure 11. We presume that these can be attributedto the fact that changes in lighting and object pose are rela-tively moderate, and that the object template image has beennormalized to have unit variance and zero mean so as to min-imize the effects of illumination changes. However, at someframes (such as around the 374th and 472nd frames), trackingaccuracy is noticeably worse than that of our proposed tracker.

Tracking results for the simple intensity-based tracker arenot as good for the remaining three sequences. As evident fromthe second column of Figure 11, the intensity-based trackereventually loses the object for the “David” sequence afterthe long-term object pose change around the 160th frame, al-though tracking results are satisfactory up to this point. Track-

ing failures also occur for the “Sylvester” and “Trellis” se-quences. As shown in the third column of Figure 11, thetracker permanently loses the object after the 231st frame,when extreme changes in the object pose and lighting occursimultaneously. Around the 182nd frame when an extremechange in lighting take place, tracking accuracy is worse thanthat of our tracker. For the “Trellis” sequence, as shown inthe fourth column of Figure 11, the tracker loses the objectat the very early stages because of the extreme change in light-ing. The tracking result comparison shown in Figure 11 is alsogiven in Extension 1.

4.2.4. Tracking Without the AR State Dynamics

To verify the effect of the state dynamics using the AR processon Aff (2), we run our proposed tracker again without the ARstate dynamics, i.e. we set a � 0 in (18). A comparison oftracking results with (solid rectangle) and without (dashed rec-tangle) the AR state dynamics is shown in Figure 12.

For the “Dudek” sequence shown in the first column ofFigure 12, the tracker fails to recover quickly from the driftinduced by the temporal occlusion as shown in the 117thand 187th frames, and slowly adjusts to the fast object scalechanges as shown in the 504th frame. As shown in the secondcolumn of Figure 12, the tracker also adjusts to the object posechange around the 160th frame, which is much slower than for


Fig. 12. A comparison of the tracking results for our proposed tracker with (solid rectangle) and without (dashed rectangle)the AR state dynamics. The first, second, third, and fourth columns represent the tracking results respectively for the “Dudek”,“David”, “Sylvester”, and “Trellis” sequences.

the tracker with the AR state dynamics used in the “David”sequence.

Without the AR state dynamics, tracking results for the“Sylvester” (the third column of Figure 12) and “Trellis” (thefourth column of Figure 12) sequences are considerably worse.For the “Sylvester” sequence, tracking accuracy is comparableto that of our tracker with the AR state dynamics before thetracker loses the object around the 304th frame. Unlike ourtracker with the AR state dynamics, the tracker cannot recoverquickly from the temporal drift, even when the frontal objectreappears in the 335th frame. For the “Trellis” sequence, with-out the AR state dynamics, the tracker eventually loses the ob-ject at an early stage of the sequence. The tracking result com-parison shown in Figure 12 is also available in Extension 1.

4.2.5. Comparison with IVT Particle Filtering Tracker

Finally, we compare our tracker’s performance with that of theIVT tracker of Ross et al. (2008). Tracking results of the IVTtracker are obtained by directly running the Matlab code avail-able from Ross (2008) without any parameter modification.For the “Dudek” and “David” sequences, the initial objectregions used for our tracker are different from those for theIVT tracker. During experiments, we have found that the IVTtracker’s performance for the “Dudek” sequence is sensitive tothe initial object region. Thus, for a fair comparison, we runboth trackers with the same initial object image region sug-gested by Ross (2008). The comparison between our tracker

(solid rectangle) and the IVT tracker (dashed rectangle) isshown in Figure 13.

For the “Dudek” sequence, both trackers accurately trackthe object over the entire sequence as shown in the first col-umn of Figure 13. On the whole, our tracker’s performance iscomparable to that of the IVT tracker. Around the 373rd frame,the IVT tracker outperforms our tracker while the opposite isthe case around the 476th frame. Our tracker outperforms theIVT tracker for the “David” sequence as shown in the secondcolumn of Figure 13. Although both trackers never lose the ob-ject, our tracker displays superior tracking performance aroundthe 78th and 243rd frames when compared with the initial ob-ject image region.

The better performance of our tracker is more clearlydemonstrated for the “Sylvester” (the third column of Fig-ure 13) and “Trellis” (the fourth column of Figure 13) se-quences. For the “Sylvester” sequence, both trackers lose theobject around the 305th frame as a result of the abrupt objectpose change. As tracking proceeds, the IVT tracker cannot re-cover from the drift and eventually loses the object, unlike ourtracker. As shown in the 180th frame, the tracking accuracy ofthe IVT tracker is worse than that of our tracker.

For the “Trellis” sequence, both trackers accurately trackthe object over the initial lighting condition change. After thelong-term object pose change around the 320th frame, the IVTtracker eventually loses the object while our tracker continuesto track the object until the end of the sequence. The trackingresult comparison of Figure 13 can also be seen in Extension 1.


Fig. 13. A comparison of tracking results for our proposed tracker (solid rectangle) and a state-of-the-art particle filtering-based tracker Ross et al. (2008) (dashed rectangle). The first, second, third, and fourth columns represent the tracking resultsrespectively for the “Dudek”, “David”, “Sylvester”, and “Trellis” sequences.

4.3. Discussion

The improved performance of our geometric tracker overpurely local coordinate-based trackers has been quantitativelyshown via experiments with the cube video clip. As arguedearlier in Section 2.3, our geometric tracker more naturallyrepresents the image transformations induced by the three-dimensional object movement� we suspect that the differencein tracking accuracy can be traced to the fact that each Liealgebra basis element of our tracker naturally represents eachpossible geometric transformation mode as shown in Figure 2,which is not the case for the local coordinate-based trackers.

We have also qualitatively demonstrated our geomet-ric tracker’s improved performance through a series of ex-periments involving benchmark test video clips. A simpleintensity-based measurement likelihood was sufficient whenchanges in lighting are not abrupt as in the “Dudek” sequence.However, for abrupt changes in lighting such as those found inthe “David”, “Sylvester”, and “Trellis” sequences, intensity-based measurement likelihoods were not robust. In contrast,such changes in lighting could be dealt with appropriatelywith the incremental PGA of the image covariance descrip-tors. Another advantage of our tracker is the ability to quicklyrecover from large and abrupt object pose changes wheneverthe frontal object image reappears.

The advantages of using an Aff (2) AR process for the statedynamics has also been demonstrated. In principle, a randomwalk model is appropriate for most visual tracking applica-

tions, particularly if a large number of particles, and large co-variance values for particle propagation, are used. However, inpractice the number of particles one can effectively use is lim-ited by the requirement that visual tracking should inherentlybe performed online. Including an appropriate state dynamicsis one means of compensating for this limitation, and our stud-ies suggest that the Aff (2) AR process is quite adequate for ourpurposes.

We also have shown that our tracker outperforms the IVTtracker, especially for the “Sylvester” and “Trellis” sequences.The aforementioned tracking failures of the IVT tracker forthe “Sylvester” and “Trellis” sequences have already been re-ported by Ross et al. (2008). For the “Sylvester” sequence,note that our tracker’s performance using the video clip re-sampled in 15 fps is better than that of the IVT tracker usingthe original video clip at 30 fps. From Extension 1, we cansee that the IVT tracker’s tracking results are prone to chat-ter, whereas ours tends to be smoother. This comes from thefact that the state estimation of the IVT tracker is given by theparticle whose weight is greatest. We believe that it is betterto estimate the state using the sample mean of the particlesrather than the maximum weight particle. We have also foundthat the tracking performance of our tracker is not sensitiveto the initial object image region� it is instructive to compareour tracking results for the “Dudek” and “David” sequencesusing our own initial object image regions (Figure 10 and Ex-tension 1) with using the initial object image region suggestedby Ross et al. (2008) (Figure 13 Extension 1).


Fig. 14. The estimated covariance elements for the “Sylvester”sequence. The solid and dashed lines respectively denote thediagonal and off-diagonal elements of the estimated covari-ance.

For the covariance estimation, note that the settings for thefour test video clips are identical, i.e. the covariance particlesare initialized from the matrix exponential of Sym(n) sampledfrom the uniform distribution between �0�25 and 0.25, andthe sampled noise is scaled with the same scale parameters.We cannot assert that the estimated covariance via our com-bined state–covariance estimation approach is highly accuratefor a given specific situation. However, an important benefit isthat difficult and time-consuming noise parameter tuning forspecific scenarios can be avoided. One case of covariance esti-mation is shown in Figure 14. With a sufficiently large numberof particles, more general and larger affine transformations canbe tracked with larger values of the noise scale parameters.

The average computational time per frame of our proposedvisual tracker is about 1.8 seconds with 500 particles. Al-though the computation times indicate that the current speedof our tracker is far from being online, we expect that furtherspeedup, up to rates of 15 fps, is possible (the current imple-mentation is in Matlab, with little consideration given to im-proving speed at this stage).

5. Conclusions

We have presented a visual tracking framework based ona coordinate-invariant, geometrically well-defined particlefiltering algorithm developed for Aff (2). To increase robust-ness, the tracker adopts the following additional features:(i) the state dynamics is given by an AR process defined onAff (2)� (ii) combined state–covariance estimation� and (iii) ro-bust measurement likelihood calculation based on incremental

PGA of the image covariance descriptors. The log-Euclideanmetric on P�n� plays a fundamental role in covariance estima-tion and incremental PGA of the image covariance descriptors.The proposed visual tracking framework has been tested viavarious comparative experiments, with considerably improvedtracking performance relative to current state-of-the-art visualtrackers.

Possibilities for future work include extending our visualtracking framework to multiple object tracking in more real-istic situations, and how to deal with long-term object occlu-sions. Both problems can be approached in the context of thedata association problem.

Acknowledgements

The authors are grateful to Professor Kyoung Mu Lee forhis comments and suggestions that helped to improve themanuscript. This research was supported in part with fund-ing from CBMS-KRF-2007-412-J03001, KIST-CIR, the Intel-ligent Autonomous Manipulation Research Center, ROSAEC,and IAMD-SNU.

Appendix: Index to Multimedia Extensions

The multimedia extension page is found at http://www.ijrr.org

Table of Multimedia Extensions

Extension Type Description

1 Video Moving pictures corresponding to allof the tracking results shown in Sec-tion 4

References

Arsigny, V., Fillard, P., Pennec, X. and Ayache, N. (2007).Geometric means in a novel vector space structure on sym-metric positive-definite matrices. SIAM Journal on MatrixAnalysis and Applications, 29(1): 328–347.

Arulampalam, M., Maskell, S., Gordon, N. and Clapp, T.(2002). A tutorial on particle filter for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on SignalProcessing, 50(2): 174–188.

Chang, C. and Ansari, R. (2005). Kernel particle filter for vi-sual tracking. IEEE Signal Processing Letters, 12(3): 242–245.

Chiuso, A. and Soatto, S. (2000). Monte carlo filtering on liegroups. Proceedings of IEEE Conference on Decision andControl, pp. 304–309.

Doucet, A., Freitas, N. and Gordon, N. (eds) (2001). Sequen-tial Monte Carlo Methods in Practice. Berlin, Springer.


Fletcher, P., Conglin, L., Pizer, S. and Joshi, S. (2004). Prin-cipal geodesic analysis for the study of nonlinear statisticsof shape. IEEE Transactions on Medical Imaging, 23(8):995–1005.

Fletcher, P. and Joshi, S. (2007). Riemannian geometry for thestatistical analysis of diffusion tensor data. Signal Process-ing, 87: 250–262.

Gordon, N., Salmond, D. and Smith, A. (1993). Novel ap-proach to nonlinear/non-Gaussian Bayesian state estima-tion. Radar and Signal Processing, IEE Proceedings F,140(2): 107–113.

Hartley, R. and Zisserman, A. (2000). Multiple View Geome-try in Computer Vision. Cambridge, Cambridge UniversityPress.

Higuchi, T. (2001). Self organizing time series model. Sequen-tial Monte Carlo Methods in Practice, Doucet, A., Freitas,N. and Gordon, N. (eds). Berlin, Springer, pp. 429–444.

Isard, M. and Blake, A. (1998). Condensation–conditionaldensity propagation for visual tracking. International Jour-nal of Computer Vision, 29(1): 5–28.

Jepson, A., Fleet, D. and El-Maraghi, T. (2003). Robust on-line appearance models for visual tracking. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 25(10):1296–1311.

Khan, Z., Balch, T. and Dellaert, F. (2004). A Rao–Blackwellized particle filter for eigentracking. Proceedingsof IEEE Computer Society Conference on Computer Visionand Pattern Recognition.

Kwon, J., Choi, M., Park, F. and Chun, C. (2007). Particlefiltering on the Euclidean group: framework and applica-tions. Robotica, 25(6): 725–737.

Kwon, J. and Park, F. (2008). Visual tracking via parti-cle filtering on the affine group. Proceedins of IEEE In-ternational Conference on Information and Automation,pp. 997–1002.

Lee, K.-C., Ho, J., Yang, M.-H. and Kriegman, D. (2005). Vi-sual tracking and recognition using probabilistic appear-ance manifolds. Computer Vision and Image Understand-ing, 99: 303–331.

Lee, S., Choi, M., Kim, H. and Park, F. (2007). Geometric di-rect search algorithms for image registration. IEEE Trans-actions on Image Processing, 16(9): 2215–2224.

Li, X., Hu, W., Zhang, Z., Zhang, X. and Luo, G. (2007). Ro-bust visual tracking based on incremental tensor subspacelearning. Proceedings of IEEE International Conference onComputer Vision.

Li, X., Hu, W., Zhang, Z., Zhang, X., Zhu, M. and Cheng,J. (2008). Visual tracking via incremental log-EuclideanRiemannian subspace learning. Proceedings of IEEE Com-puter Society Conference on Computer Vision and PatternRecognition.

Liu, J. and West, M. (2001). Combined parameter and stateestimation in simulation-based filtering. Sequential Monte

Carlo Methods in Practice, Doucet, A., Freitas, N. and Gor-don, N. (eds). Berlin, Springer, pp. 197–223.

Mei, X., Zhou, S. and Porikli, F. (2007). Probabilistic visualtracking via robust template matching and incremental sub-space update. Proceedings of IEEE International Confer-ence on Multimedia and Expo, pp. 1818–1821.

Moghaddam, B. and Pentland, A. (1997). Probabilistic visuallearning for object representation. IEEE Transactions onPattern Analysis and Machine Intelligence, 19(7): 696–710.

Pennec, X., Fillard, P. and Ayache, N. (2006). A Riemannianframework for tensor computing. International Journal ofComputer Vision, 66(1): 41–66.

Perez, P., Vermaak, J. and Blake, A. (2004). Data fusion forvisual tracking with particles. Proceedings of the IEEE,92(3): 495–513.

Porikli, F., Tuzel, O. and Meer, P. (2006). Covariance track-ing using model update based on Lie algebra. Proceedingsof IEEE Computer Society Conference on Computer Visionand Pattern Recognition.

Ross, D. (2008). IVT tracker. http://www.cs.toronto.edu/�dross/ivt.

Ross, D., Lim, J., Lin, R.-S. and Yang, M.-H. (2008). Incre-mental learning for robust visual tracking. InternationalJournal of Computer Vision, 77(1–3): 125–141.

Rui, Y. and Chen, Y. (2001). Better proposal distributions: ob-ject tracking using unscented particle filter. Proceedings ofIEEE Computer Society Conference on Computer Visionand Pattern Recognition.

Srivastava, A. (2000). Bayesian filtering for tracking pose andlocation of rigid targets. Proceedings of SPIE AeroSense.

Srivastava, A., Grenander, U., Jensen, G. and Miller, M.(2002). Jump-diffusion Markov processes on orthogonalgroups for object pose estimation. Journal of StatisticalPlanning and Inference, 103: 15–37.

Tuzel, O., Porikli, F. and Meer, P. (2006). Region covariance: afast descriptor for detection and classification. Proceedingsof European Conference on Computer Vision.

Tuzel, O., Porikli, F. and Meer, P. (2007). Human detectionvia classification on riemannian manifolds. Proceedings ofIEEE Computer Society Conference on Computer Visionand Pattern Recognition.

Wang, H., Suter, D., Schindler, K. and Shen, C. (2007). Adap-tive object tracking based on an effective appearance filter.IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 29(9): 1661–1667.

Xavier, J. and Manton, J. (2006). On the generalization of arprocess to riemannian manifolds. Proceedings of IEEE In-ternational Conference on Acoustics, Speech, and SignalProcessing.

Zhou, S., Chellappa, R. and Moghaddam, B. (2004). Visualtracking and recognition using appearance-adaptive modelsin particle filters. IEEE Transactions on Image Processing,13(11): 1491–1506.

junghyun kwon visual tracking via particle filtering on ... · euclidean metric on p n, and...

Documents