arxiv:1509.01520v1 [cs.cv] 4 sep 2015xavirema.eu/wp-content/papercite-data/pdf/ba-cviu.pdf · 2016....

17
Arxiv e-print An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes ? Sileye Ba · Xavier Alameda-Pineda · Alessio Xompero · Radu Horaud Abstract Object tracking is an ubiquitous problem that appears in many applications such as remote sensing, audio processing, computer vision, human-machine interfaces, human-robot interaction, etc. Although thorougly investigated in computer vision, tracking a time-varying number of persons remains a challenging open problem. In this paper, we propose an on-line variational Bayesian model for multi-person tracking from cluttered visual observations provided by person detectors. The paper has the following contributions: A Bayesian framework for tracking an unknown and varying number of persons, a variational expectation-maximization (VEM) algorithm with closed-form expressions for the posterior distributions and model parameter estimation, A method capable of exploiting observations from multiple detectors, thus enabling multimodal fusion, and built-in object-birth and object-visibility processes that allow to handle person appearance and disappearance. The method is evaluated on standard datasets, which shows competitive and encouraging results with respect to state of the art methods. Keywords Multi-person tracking · Bayesian tracking · variational expectation-maximization · causal inference · person detection 1 Introduction The problem of tracking a varying number of objects is ubiquitous in a number of fields such as remote sensing, computer vision, human-computer interaction, human-robot interaction, etc. While several off-line multi-object tracking methods are available, on-line multi-person tracking is still extremely challenging [1]. In this paper we propose an on-line tracking method based on the tracking-by-detection paradigm. Indeed, the latter has been made possible because of the development of efficient part detectors which provide observations of the objects of interest [2]. Moreover, one advantage of tracking by detection is the existence of linear mappings linking the kinematic states of the tracked objects with the observations issued from the detectors. This is possible because part detectors efficiently model non-linear effects and thus provide image observations that are related to the kinematic state of an object in a linear fashion. In addition to the difficulties associated to single-object tracking (occlusions, self-occlusions, vari- abilities in image appearance, unpredictable temporal behavior, etc.), tracking a varying and unknown number of objects makes the problem more complex and more difficult, for the following reasons: (i) the ? Support from the EU ERC Advanced Grant VHIA (#34113) is greatly acknowledged. S. Ba · A. Xompero · R. Horaud INRIA Grenoble Rhˆ one-Alpes Montbonnot Saint-Martin, France Xavier Alameda-Pineda University of Trento Trento, Italy arXiv:1509.01520v1 [cs.CV] 4 Sep 2015

Upload: others

Post on 25-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

Arxiv e-print

An On-line Variational Bayesian Model for Multi-Person Trackingfrom Cluttered Scenes?

Sileye Ba · Xavier Alameda-Pineda · AlessioXompero · Radu Horaud

Abstract Object tracking is an ubiquitous problem that appears in many applications such as remotesensing, audio processing, computer vision, human-machine interfaces, human-robot interaction, etc.Although thorougly investigated in computer vision, tracking a time-varying number of persons remainsa challenging open problem. In this paper, we propose an on-line variational Bayesian model formulti-person tracking from cluttered visual observations provided by person detectors. The paper hasthe following contributions: A Bayesian framework for tracking an unknown and varying number ofpersons, a variational expectation-maximization (VEM) algorithm with closed-form expressions for theposterior distributions and model parameter estimation, A method capable of exploiting observationsfrom multiple detectors, thus enabling multimodal fusion, and built-in object-birth and object-visibilityprocesses that allow to handle person appearance and disappearance. The method is evaluated onstandard datasets, which shows competitive and encouraging results with respect to state of the artmethods.

Keywords Multi-person tracking · Bayesian tracking · variational expectation-maximization · causalinference · person detection

1 Introduction

The problem of tracking a varying number of objects is ubiquitous in a number of fields such as remotesensing, computer vision, human-computer interaction, human-robot interaction, etc. While severaloff-line multi-object tracking methods are available, on-line multi-person tracking is still extremelychallenging [1]. In this paper we propose an on-line tracking method based on the tracking-by-detectionparadigm. Indeed, the latter has been made possible because of the development of efficient partdetectors which provide observations of the objects of interest [2]. Moreover, one advantage of trackingby detection is the existence of linear mappings linking the kinematic states of the tracked objects withthe observations issued from the detectors. This is possible because part detectors efficiently modelnon-linear effects and thus provide image observations that are related to the kinematic state of anobject in a linear fashion.

In addition to the difficulties associated to single-object tracking (occlusions, self-occlusions, vari-abilities in image appearance, unpredictable temporal behavior, etc.), tracking a varying and unknownnumber of objects makes the problem more complex and more difficult, for the following reasons: (i) the

? Support from the EU ERC Advanced Grant VHIA (#34113) is greatly acknowledged.

S. Ba · A. Xompero · R. HoraudINRIA Grenoble Rhone-AlpesMontbonnot Saint-Martin, France

Xavier Alameda-PinedaUniversity of TrentoTrento, Italy

arX

iv:1

509.

0152

0v1

[cs

.CV

] 4

Sep

201

5

Page 2: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

2 Sileye Ba et al.

observations coming from part detectors need to be associated to one of the tracked objects, (ii) thenumber of objects is not known in advance and hence it must be estimated, which include the processof discarding detection errors, (iii) when many objects are present the dimension of the state-space islarge, and hence the tracker has to handle a large number of hidden-state parameters, (iv) the numberof objects varies over time and one has to deal with hidden states of varying dimensionality, from zerowhen there is no visible object, to a large number of detected objects. Note that in this case and ifa Bayesian setting is being considered, as is often the case, the exact recursive filtering solution isintractable; and finally, (v) a crucial difference between single- and multi-object tracking is the mutualocclusions between objects.

Previously proposed methodological frameworks for multi-target tracking can be divided into threegroups. Firstly, in the trans-dimensional Markov chain model [3], the dimensionality of the hiddenstate-space is part of the state variable. This allows to track a variable number of objects by jointly es-timating the number of objects and their kinematic states. In a computer vision scenario, [4,5] exploitedthis framework for tracking a varying number of objects. The main drawback is that the states areinferred by means of a reversible jump Markov chain Monte Carlo sampling, which is computationallyvery expensive [6]. Secondly, a random finite set multi-target tracking formulation was proposed [7,8,9]. Initially used in radar applications [7], in this framework the targets are modeled as realizations ofa random finite set which is composed of an unknown number of elements. Because an exact solutionto this model is intractable, an approximation known as the probability hypothesis density (PHD)filter was proposed [10]. Further sampling-based approximations of the PHD filter were subsequentlyproposed, e.g., [11,12,13]. These were exploited in [14] for tracking a time-varying number of activespeakers using auditory cues and in [15] for multi-target tracking using visual observations. Thirdly,conditional random fields (CRF) were also chosen to address multi-target tracking [16,17,18]. Con-trarily to the trans-dimensional MCMC model, or random set filtering framework, the CRF frameworkis a probabilistic model able to cope with complex conditional probability distributions. In this case,tracking is casted into an energy minimization problem.

In this paper we propose an on-line Bayesian framework for tracking an unknown and varyingnumber of persons. By on-line we mean a causal filtering method that only uses past images. This isin strong contrast with off-line trackers that use both past and future images. In order to overcome theintractability of the recursive Bayesian filter, we propose a variational Bayesian approximation of thehidden-state filtering distribution. In addition, a clutter target is defined so that spurious observations,namely detector failures, are associated to this target and do not contaminate the filtering process.Importantly, the proposed framework leads to closed-form expressions for the posterior distributionsof the hidden variables and for the model parameters, thus yielding an intrinsically efficient filteringprocedure implemented via an variational EM (VEM) algorithm. Furthermore, our formalism allows tointegrate in a principled fashion heterogeneous observations coming from various detectors, e.g, face,upper-body, silhouette, etc. Remarkably, objects that come in and out of the field of view, namely objectappearance and disappearance, are handled by object birth and visibility processes. In particular, wereplace the classical death process by a visibility process which allows to put asleep tracks associatedwith persons that are no longer visible. The main advantage is that these tracks can be awaken as soonas new observations match their appearance with high confidence.

The proposed method is thoroughly evaluated on two annotated datasets: (i) the cocktail-partydataset which is composed of two sequences involving up to three persons recorded with a close-viewcamera and (ii) the MOT challenge dataset recorded with far-view cameras. Conducted experimentsdemonstrate that our method competes fairly well with respect to state-of-the-art methods. It shouldbe emphasized that the latter use off-line non-causal smoothing methods, in which both past and futuretemporal information are exploited, while our method uses a causal filtering which only considers pastdata. The paper contributions are summarized below:

– A Bayesian framework for tracking an unknown and varying number of persons;– A VEM algorithm with closed-form expressions for the posterior distributions and model parameter

estimation;– The method is capable of exploiting observations from multiple detectors, thus enabling multimodal

fusion, and

Page 3: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes? 3

– Object-birth and object-visibility processes allow to handle person appearance and disappearance.

The remainder of this paper is organized as follows. Section 2 reviews previous work relevant to ourwork method. Section 3 details the proposed Bayesian model and a variational model solution ispresented in Section 4. In Section 5, we depict the birth and visibility processes allowing to handle anunknown and varying number of persons. Section 6 describes results of experiments and benchmarksto assess the quality of the proposed method. Finally, Section 7 draws some conclusions.

2 Related Work

Generally speaking, object tracking is the temporal estimation of the object’s kinematic state. In thecontext of image-based tracking, the object state is a parametrization of its localization in the (2D)image plane. In computer vision, object tracking has been thoroughly investigated [19]. Objects ofinterest could be people, faces, hands, vehicles, etc. According to the considered number of objects tobe tracked, tracking can be classified into single-object tracking, fixed-number multi-object tracking,and varying-number multi-object tracking.

Methods for single object tracking consider only one object and usually involve an initialization step,a state update step, and a reinitialization step allowing to recover from failures. Practical initializationsteps are based on generic object detectors allowing to scan the input image in order to find the objectof interest [20,21]. Object detectors can be used for the reinitialization step as well. However, usinggeneric object detectors is problematic when other objects of the same kind than the tracked object arepresent in the visual scene. In order to resolve such ambiguities, different complementary appearancemodels have been proposed, such as object templates, color appearance models, edges (image gradients)and texture, e.g., Gabor features and histogram of gradient orientations (HOG). Regarding the updatestep, the current state can be estimated from previous states and observations with either deterministic[22] or probabilistic methods [23].

Even if it is still a challenging problem, tracking a single object is very limited in scope. Rapidly,the computer vision community drove its attention towards fixed-number multi-object tracking [24].Additional difficulties are encountered when tracking multiple objects. Firstly, there is an increase ofthe tracking state dimensionality as the multi-object tracking state dimensionality is the single ob-ject state dimensionality multiplied by the number of tracked objects. Secondly, associations betweenobservations and objects are required. However, when the number of objects and the number of obser-vations are large, the association problem is combinatorial [25,26]. Thirdly, because of the presence ofmultiple targets, tracking methods have to be robust to mutual occlusions.

In most practical applications, the number of objects to be tracked, not only that is unknown, butit also varies over time. Importantly, tracking a time-varying number or objects requires an efficientmechanism to add new objects entering the field of view, and to remove objects that moved away. In aprobabilistic setting, these mechanisms are based on birth and death processes. Efficient multi-objectalgorithms have to be developed within principled methodologies allowing to handle hidden states ofvarying dimensionality. The most popular methods are based on the trans-dimensional Markov chain[3,4,5], on random finite sets [9,14,15], and more recently, on conditional random fields (CRF) [27,17,18]. Less popular but successful methodologies include the Bayesian multiple blob tracker of [28],the boosted particle filter for multi-target tracking of [29] and the Rao-Blackwellized filter for multipleobjects tracking [30].

Currently available multi-object tracking methods suffer from different drawbacks. CRF-based ap-proaches are naturally non-causal, that is, they use both past and future information. Therefore, evenif they have shown high robustness to clutter, they are only suitable for off-line applications whensmoothing (as opposite to filtering) techniques can be used. Random finite sets and the PHD filteringtechniques are time-consuming and provide non-associated tracks. In other words, these techniquesrequire an external method in order to associate observations and tracks to objects. Finally, even ifMCMC techniques are able to associate tracks to objects using only causal information they are ex-tremely complex from a computational point of view, and their performance is very sensitive to thesampling procedure being used. In contrast, the variational Bayesian framework we propose associates

Page 4: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

4 Sileye Ba et al.

(a) (b)

Fig. 1 Examples of detections used as observations by the proposed person tracker: upper-body, face (a), and full-body (b) detections. Notice that one of the faces was not detected and that there is a false full-body detection in thebackground.

tracks to previously seen objects and creates new tracks in an unified framework that filters past obser-vations in an intrinsically efficient way, since all the steps of the algorithm are expressed in closed-form.Hence the proposed method robustly and efficiently tracks a varying and unknown number of personsfrom a combination of image detectors.

3 Variational Bayesian Multiple-Person Tracking

3.1 Notations and Definitions

We start by introducing our notations. Vectors and matrices are in bold A, a, scalars are in italic A,a and sets are in calligraphic A. In general random variables are denoted with upper-case letters, Aand A, and their realizations with lower-case letters, a and a.

Since the objective is to track multiple persons whose number may vary over time, we assumethat there is a maximum number of people, denoted by N , that may simultaneously be present in thescene. This parameter is necessary in order to cast the problem at hand into a finite-dimensional statespace, consequently N can be arbitrarily large. We model a track n at time t with the existence binaryvariable etn and let etn = 1 if the person is already present and etn = 0 otherwise. The vectorizationof the existence variables at time t is denoted by Et = {etn}Nn=1 and their sum, namely the number of

tracked persons at t, is denoted by Nt =∑Nn=1 etn.

The kinematic state of person n is a random vector Xtn = (L>tn,U>tn)>, where Ltn ∈ R4 is the

person location, i.e., 2D image position, width and height, and Utn ∈ R2 is the person velocity inthe image plane. The multi-person state random vector is denoted by Xt = (X>t1, . . . ,X

>tN )> ∈ R6N .

Importantly, the kinematic state is described by a set of hidden variables which must be robustlyestimated.

In order to ease the challenging task of tracking multiple persons with a single static camera, weassume the existence of I detectors, each of them providing Ki

t localization observations at each timet, with i ∈ [1 . . . I]. Fig. 1 provides examples of face and upper-body detections (see Fig. 1(a)) and offull-body detections (see Fig. 1(b)). The k-th localization observation gathered by the i-th detector attime t is denoted by yitk ∈ R4, and represents the location (2D position, width, height) of a person in

the image. The set of observations provided by detector i at time t is denoted by Yit = {yitk}Ki

t

k=1, andthe set of observations provided by all the detectors at time t is denoted by Yt = {Yit}Ii=1. Associated toeach localization detection yitk there is a photometric description of the person’s appearance, denotedby hitk. This photometric observation is extracted from the bounding box of yitk. Altogether, the

Page 5: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes? 5

localization and photometric observations constitute the raw observations oitk = {yitk,hitk} used byour tracker. Analogous definitions for Hit, Ht, Oit and Ot hold.

We also define an observation-to-person assignment (hidden) variable Zitk associated with eachobservation oitk. Formally, Zitk is a categorical variable taking values in the set {1 . . . N}: Zitk = nmeans that oitk is associated to person n. The sets Zit and Zt are defined in an analogous way to Yitand Yt. These assignment variables can be easily used to handle false detections. Indeed, it is commonthat a detection corresponds to some clutter instead of a person. We cope with these false detections bydefining a clutter target. In practice, the index n = 0 is assigned to this clutter target, which is alwaysvisible, i.e. et0 = 1 for all t. Hence, the set of possible values for Zitk is extended to {0} ∪ {1 . . . N},and Zitk = 0 means that observation oitk has been generated by clutter and not by a person.

3.2 The Proposed Bayesian Multi-Person Tracking Model

The on-line multi-person tracking problem is cast into the estimation of the filtering distribution ofthe hidden variables given the observations p(Zt,Xt|O1:t, E1:t), where O1:t = {O1, . . . ,Ot}. The Bayesrule allows us to factorize the filtering distribution in the following way:

p(Zt,Xt|O1:t, E1:t) =p(Ot|Zt,Xt,O1:t−1, E1:t)p(Zt,Xt|O1:t−1, E1:t)

p(Ot|O1:t−1, E1:t). (1)

Importantly, we assume that the observations at time t only depend on the hidden and visibilityvariables at time t. Therefore the filtering distribution (1) rewrites as:

p(Zt,Xt|O1:t, E1:t) =p(Ot|Zt,Xt, Et)p(Zt|Et)p(Xt|O1:t−1, E1:t)

p(Ot|O1:t−1, E1:t). (2)

The denominator of (2) only involves observed variables and therefore its evaluation is not necessaryas long as one can normalize the expression arising from the numerator. Hence we focus on the threeterms of the latter, namely the the observation model, the observation-to-person assignment modeland the predictive distribution, respectively.

3.2.1 The Observation Model

The joint observation likelihood can be factorized as:

p(Ot|Zt,Xt, Et) =

I∏i=1

Kit∏

k=1

p(oitk|Zitk,Xt, Et). (3)

In addition, we make the reasonable assumption that, while localization observations depend both onthe assignment variable and kinematic state, the appearance observations only depend on the assign-ment variable, that is the person identity, but not on his/her kinematic state. We also assume both thelocalization and appearance observations to be independent given the hidden variables. Consequently,the observation likelihood of a single joint observation can be factorized as:

p(oitk|Zitk,Xt, Et) = p(yitk,hitk|Zitk,Xt, Et) (4)

= p(yitk|Zitk,Xt, Et)p(hitk|Zitk, Et).

The localization observation model is defined depending on whether the observation is generated byclutter or by a person:

– If the observation is generated from clutter, namely Zitk = 0, the variable yitk follows an uniformdistribution with probability density function u(yitk);

– If the observation is generated by person n, namely Zitk = n, the variable yitk follows a Gaussiandistribution with mean PiXtn and covariance Σi: yitk ∼ g(yitk; PiXtn,Σ

i)

Page 6: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

6 Sileye Ba et al.

The linear operator Pi maps the kinematic state vectors onto the i-th space of observations. Forexample, when Xtn represents the upper-body kinematic state (upper-body localization and velocity)and yitk represents the upper-body localization observation, Pi is a projection which, when applied to astate vector, only retains the localization components of the state vector. When yitk is a face localizationobservation, the operator Pi is a composition of the previous projection, and an affine transformationmapping an upper-body bounding-box onto its corresponding face bounding-box. Finally, the fullobservation model is compactly defined by

p(yitk|Zitk = n,Xt, Et) = u(yitk)1−etn(u(yitk)δ0ng(yitk; PiXtn,Σ

i)1−δ0n)etn

, (5)

where δij stands for the Kronecker function.The appearance observation model is also defined depending on whether the observations is clutter

or not. When the observation is generated by clutter, the appearance observation follows a uniform dis-tribution with density function u(hitk). When the observation is generated by person n, the appearanceobservation follows a Bhattacharya distribution with density defined as

b(hitk; hn) =1

Wλexp(−λdB(hitk,hn)),

where λ is a positive skewness parameter, dB(., .) is the Battacharya distance between histograms,hn is the n-th person’s reference appearance model1. This gives the following compact appearanceobservation model:

p(hitk|Zitk = n,Xt, Et) = u(hitk)1−etn(u(hitk)δ0nb(hitk; hn)1−δ0n)etn . (6)

3.2.2 The Observation-to-Person Assignment Model

The joint distribution of the assignment variables factorizes as:

p(Zt|Et) =

I∏i=1

Kit∏

k=1

p(Ztk|Et). (7)

Given existence variables Et, the assignment variables Zitk are assumed to follow multinomial distribu-tions defined as:

p(Zitk = n|et) = etnaitkn with

N∑n=0

etnaitkn = 1 (8)

where aitkn represents the probability to assign the observation k from the detector i to person n.Because etn takes the value 1 for actual persons, the probability to assign an observation to a non-existing person is null.

3.2.3 The Predictive Distribution

The kinematic state predictive distribution models the probability distribution of the kinematic stateat time t given the observations up to time t − 1 and the existence variable p(Xt|O1:t−1, E1:t). Themain component of the predictive distribution is the kinematic state dynamics.

To define the kinematic state dynamics we make two hypotheses. Firstly we assume that thekinematic state dynamics follow a first-order Markov chain, meaning that the state Xt only dependson state Xt−1. Secondly, we assume that the person locations do not influence each other’s dynamics,meaning that there is one first-order Markov chain for each person. Formally, this can be written as:

p(Xt|X1:t−1, E1:t) = p(Xt|Xt−1, Et) =

N∏n=1

p(Xtn|Xt−1n, etn). (9)

1 It should be noticed that the normalization constant Wλ =∫∑

k hk=1 exp(−λdB(h,hn))dh can be exactly computed

only for histograms with dimension lower than 3. In practice Wλ is approximated using Monte Carlo integration.

Page 7: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes? 7

The immediate consequence of this choice is that the posterior distribution can be computed as:

p(Xt|O1:t−1, E1:t) =

N∏n=1

∫p(Xtn|xt−1n, etn)p(xt−1n|O1:t−1, E1:t−1)dxt−1n. (10)

For the model to be complete, p(Xtn|Xt−1,n, etn) needs to be defined. The temporal evolution of thekinematic state Xtn is defined as:

p(Xtn = xtn|Xt−1,n = xt−1,n, etn) = u(xtn)1−etng(xtn; Dxt−1,n,Λn)etn , (11)

where u(xtn) is a uniform distribution over the motion state space, g is a Gaussian probability densityfunction, D represents the dynamics transition operator, and Λn is a covariance matrix accounting foruncertainties on the state dynamics. The transition operator is defined as:

D =

I4

[I202

]02×4 I2

=

1 0 0 0 1 00 1 0 0 0 10 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1

.

In other words, the dynamical model for an existing person n is either a Gaussian with mean vectorDXt−1,n and covariance matrix Λn, or a uniform distribution if person n does not exist. The completeset of parameters of the proposed model is denoted by Θ =

{{At′}tt′=1, {Σi}Ii=1, {Λn}Nn=1

}, where

At = {aitkn}I,Ki

t ,Ni,k,n=1 is the set of probabilities used to define the distribution of the observation-to-person

assignment (8).

4 Variational Inference

Because of the combinatorial nature of the observation-to-person assignment problem, a direct opti-mization of the filtering distribution (2) with respect to the hidden variables is intractable. We proposeto overcome this problem via a variational inference method. The principle of variational inference isto approximate the intractable filtering distribution p(Zt,Xt|O1:t, E1:t) by a tractable distributionq(Zt,Xt). To solve the multi-person tracking problem, we defined the approximating distribution as:

q(Zt,Xt) =

I∏i=1

Kit∏

k=0

q(Zitk)

N∏n=0

q(Xtn) ≈ p(Zt,Xt|O1:t, E1:t). (12)

According to the variational Bayesian formulation [31,32], given the observations and the parame-ters at the previous iteration Θ◦, the optimal approximation q can be computed as follows:

log q(Zitk) = Eq(Zt/Zitk,Xt) {log p(Zt,Xt|O1:t, e1:t,Θ

◦)} , (13)

log q(Xtn) = Eq(Zt,Xt/Xtn) {log p(Zt,Xt|O1:t, e1:t,Θ◦)} . (14)

Once the posterior distribution over the hidden variables is approximated, the optimal parametersare estimated using Θ = arg maxΘ J(Θ,Θ◦) with J defined as:

J(Θ,Θ◦) = Eq(Zt,Xt) {log p(Zt,Xt,O1:t|e1:t,Θ,Θ◦)} . (15)

To summarize, the proposed solution for multi-person tracking is an on-line variational EM al-gorithm. Indeed, the variational approximation (12) leads to a variational EM in which the E-stepconsists of computing (13) and (14) and the M-step consists of maximizing the expected complete-data log-likelihood (15). The expectation and maximization steps of the algorithm are now detailed.

Page 8: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

8 Sileye Ba et al.

4.1 E-Z-Step

The estimation of q(Zitk) is carried out by developing the expectation (13). The full derivation can befound in A.2, which yields the following formula:

q(Zitk = n) = etnαitkn, (16)

where

αitkn =εitkn∑N

m=0 etmεitkm

, (17)

and εitkn is defined as:

εitkn =

{aitk0u(yitk)u(hitk) n = 0,

aitkng(yitk,Piµtn,Σ

i)e− 1

2Tr(Pi>(Σi)

−1PiΓtn

)b(hitk; hn) n 6= 0,

(18)

where Tr(·) is the trace operator and µtn and Γtn are defined by (20) and (21) below. Intuitively, thisapproximation shows that the assignment of an observation to a person is based on spatial proxim-ity between the observation localization and the person localization, and the similarity between theobservation’s appearance and the person’s reference appearance.

4.2 E-X-Step

The estimation of q(Xtn) is derived from (14). Similarly to the previous posterior distribution, themathematical derivations are provided in A.3, and boil down to the following formula:

q(Xtn) = u(Xtn)1−etng(Xtn;µtn,Γtn)etn , (19)

where the mean vector µtn and the covariance matrix Γtn are given by

Γtn =( I∑i=1

Kit∑

k=0

αitkn

(Pi> (Σi

)−1Pi)

+ (DΓt−1,nD> + Λn)−1)−1

(20)

µtn = Γtn

( I∑i=1

Kit∑

k=0

αitknPi> (Σi)−1

yitk + (DΓt−1,nD> + Λn)−1Dµt−1,n

). (21)

We note that the variational approximation of the kinematic-state distribution reminds the Kalmanfilter solution of a linear dynamical system with mainly one difference: in our solution (20) and (21),to compute the mean vectors and covariance matrices, the observations are weighted by αitkn.

4.3 M-steps

Once the posterior distribution of the hidden variables are estimated, the optimal parameter valuescan be estimated via maximization of J defined by 15. The M-step allows to estimate the assignmentvariable distribution parameters aitkn, the observation model covariance matrices Σi, and the dynamicalmodel covariance matrices Λn.

The M-step for the observation assignment variable paraemters a is carried out by maximizing thecost function

J(At) =

I∑i=1

Kit∑

k=0

N∑n=0

etnαitkn log(etna

itkn) (22)

while considering the constraints∑Nn=0 etna

itkn = 1. By using Lagrange multiplier, the solution is

straightforward: aitkn = αitkn.

Page 9: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes? 9

The M-Step for observation covariances corresponds to the estimation of Σi. This is done bymaximizing

J(Σi) =

Kit∑

k=1

N∑n=1

etnαitkn log(yitk,P

iXtn,Σi)

with respect to Σi. Differentiating J(Σi) with respect to Σi and equating to zero gives:

Σi =1

KitN

Kit∑

k=1

N∑n=1

etnαitkn

(PiΓtnPi> + (yitk −Piµtn)(yitk −Piµtn)>

)(23)

The M-Step for kinematic state dynamics covariances corresponds to the estimation of Λn for afixed n. This done by maximizing cost function

J(Λn) = Eq(Xtn|etn)[log g(Xtn; Dµt−1n,DΓtnD> + Λn)etn)].

Equating the cost differential to zeros gives:

Λn = DΓt−1nD> + Γtn + (µtn −Dµt−1,n)(µtn −Dµt−1,n)> (24)

It is worth noticing that, in the current filtering application, the formulas for Σi and Λn areinstantaneous, i.e., they are estimated only from the observations at time t. The information at timet is usually insufficient to obtain stable values for these matrices. Even if estimating Σi and Λn issuitable in a parameter learning scenario where the tracks are provided, we noticed that in practicaltracking scenarios, where the tracks are unknown, this does not yield stable results. Suitable priors onthe temporal dynamics of the covariance parameters are required. Therefore, in this paper we assumethat the observation and dynamical model covariance matrices are fixed and during tracking we onlyestimate the assignment variable parameters.

5 Track-Birth and Person-Visibility Processes

Tracking a time-varying number of targets requires procedures to create tracks when new targets enterthe scene and to delete tracks when corresponding targets leave the visual scene. In this paper, wepropose a statistical test based birth process that creates new tracks and a hidden Markov model basedvisibility process that handles targets that leave the scene.

5.1 Track-Birth Process

The principle of the track-birth process is to search for consistent trajectories in the history of ob-servations associated to clutter. Intuitively, the two hypotheses “the considered observation sequenceis generated by a person not being tracked” and “the considered observation sequence is generated byclutter” are confronted.

The model of “the considered observation sequence is generated by a person not being tracked”hypothesis is based on the observations and dynamic models defined in (5) and (11). If there is anot-yet-tracked person n generating the considered observation sequence {ytk0 , . . . ,yt−L,kL},2 thenthe observation likelihood is p(yt−l,kl |xt−l,n) = g(yt−l,kl ; Pxt−l,n,Σ) and the person trajectory is gov-erned by the dynamical model p(xtn|xt−1,n) = g(xtn; Dxt−1,n,Λn). Since there is no prior knowledgeabout the starting point of the track, we assume a “flat” Gaussian distribution over xt−L,n, namely

2 In practice we considered L = 2, however, derivations can be conducted for arbitrary values of L.

Page 10: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

10 Sileye Ba et al.

pb(xt−L,n) = g(xt−L,n; mb,Γb), which is approximatively equivalent to a uniform distribution over theimage. Consequently, the joint observation distribution writes:

τ0 = p(ytk0 , . . . ,yt−L,kL)

=

∫p(ytk0 , . . . ,yt−L,kL ,xt:t−L,n)dxt:t−L,n

=

∫ L∏l=0

p(ytkl |xt−l,n)×L−1∏l=0

p(xt−l,n|xt−l−1,n)× pb(xt−2,n)dxt:t−L,n, (25)

which can be seen as the marginal of a multivariate Gaussian distribution. Therefore, the joint obser-vation distribution p(ytk0 ,yt−1,k1 , . . . ,yt−2,kL) is also Gaussian and can be explicitly computed.

The model of “the considered observation sequence is generated by clutter” hypothesis is based onthe observation model given in (5). When the considered observation sequence {ytk0 , . . . ,yt−L,kL} isgenerated by clutter, observation are independent and identically uniformly distributed. In this case,the joint observation likelihood is

τ1 = p(ytk0 , . . . ,yt−L,kL) =

L∏l=0

u(yt−l,kl) (26)

Finally, our birth process is as follows, for all ytk0 such that τ0 > τ1, a new target is added bysetting etn = 1, q(xtn;µtn,Γtn) with µtn = [y>tk0 ,0

>2 ]>, and Γtn is set to the value of a birth covariance

matrix (see (19)). Also, the reference appearance model for the new target is defined as htn = htk0 .

5.2 Person-Visibility Process

A tracked person is said to be visible at time t whenever there are observations associated to thatperson, otherwise the person is considered not visible. Instead of deleting tracks, as classical deathprocesses would do, our model marks tracks without associated observations as sleeping. In this way,we keep the possibility to awaken such sleeping tracks when their appearance model highly matchesthe appearance of an observation.

We denote the n-th person visibility (binary) variable by Vtn, meaning that the person is visibleat time t if Vtn = 1 and 0 otherwise. We assume the existence of a transition model for the hiddenvisibility variable Vtn. More precisely, the visibility state temporal evolution is governed by a symmetric

transition matrix, i.e., p(Vtn = j|Vt−1,n = i) = πδijv (1− πv)1−δij , where πv is the probability to remain

in the same state. To enforce temporal smoothness, the probability to remain in the same state is takenhigher than the probability to switch to another state.

The goal now is to estimate the visibility of all the persons. For this purpose we define the visibility

observations as νtn =∑Ii=1

∑Kit

k=1 etnaitkn, being 0 when no observation is associated to person n. In

practice, we need to filter the visibility state variables Vtn using the visibility observations νtn. In otherwords, we need to estimate the filtering distribution p(Vtn|ν1:tn, e1:tn) which can be written as:

p(Vtn = vtn|ν1:t, e1:tn) =

p(νtn|vtn, etn)∑vt−1,n

p(vtn|vt−1,n)p(vt−1,n|ν1:t−1,n, e1:t−1)

p(νtn|ν1:t−1,n, e1:t), (27)

where the denominator corresponds to integrating the numerator over vtn. In order to fully specify themodel, we define the visibility observation likelihood as:

p(νtn|vtn, etn) = (e−λνtn)vtn(1− e−λνtn)1−vtn (28)

Importantly, at each frame, the filtering distribution can be explicitly computed.

Page 11: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes? 11

(a) (b)

Fig. 2 Region splitting for color histogram computation: the upper body case in Fig.2(a)) and the full body in Fig.2(b).

6 Experiments

6.1 Evaluation Protocol

We experimentally assess the performance of the proposed model using two datasets. The Cocktail-Party Dataset is composed of two videos with people recorded with a close-view camera (see Fig 1(a)).In this dataset only the people’s upper body is visible. This dataset is composed of two videos: CPD-3is a recording of 3 persons during 853 frames, and CPD-2 is a recording of 2 persons for 495 frames,both involving occlusions. The second dataset is a subset of the MOT Challenge Dataset [33],available at http://motchallenge.net/. Among the MOT Challenge dataset, we selected two videos:TUD-Stadmitte (9 persons, 179 frames) and PETS09-S2L1 (18 persons, 795 frames), because theyare classically used for benchmarking multi-person tracking models [18,17]. TUD-Stadmitte recordsclosely viewed full body pedestrians, while PETS09-S2L1 records far-viewed full body pedestrians (seeFig. 1(b)). All recordings were performed with a static camera.

Because multi-person tracking intrinsically implies track creation, deletion, target identity mainte-nance, and localization, evaluating multi-person tracking models is a non-trivial task. Many metricshave been proposed [34,35], and we follow the widely accepted CLEAR 2006 multi-person trackingevaluation metrics [35]. On the one side the multi-object tracking accuracy (MOTA) combines falsepositives (FP), missed targets (FN), and identity switches (ID). On the other side, the multi-objecttracking precision (MOTP) measures the alignment of the tracker output bounding box with thegroundtruth. We also provide tracking precision (Pr) and recall (Rc).

6.2 Evaluation On The Cocktail Party Dataset

In the cocktail party dataset our model exploits upper body detections obtained using [20] and facedetections obtained using [21]. Therefore, we have two types of observations, upper body “u” andface “f”. The hidden state corresponds to the position and velocity of the upper body. The observationoperator Pu (see section 3.2.1) for the upper body observations simply removes the velocity componentsof the hidden state. The observation operator Pf for the face observations combines a projectionremoving the velocity components and an affine transformation computed from the upper body andface bounding boxes. The appearance observations are concatenations of joint hue-saturation (HS) colorhistograms of the torso split into three different regions, plus the head region as shown in Fig.2(a).

Table 1 shows the performance of the model over the two sequences of the Cocktail Party Dataset.Four variants of the proposed model are compared, using (i) upper body detection and color histograms(UC), (ii) face detection and color histogram (FC), (iii) upper body detection, face detection, and colorhistograms (FUC) and (iv) upper body and face detection (FU).

Page 12: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

12 Sileye Ba et al.

Sequence Method Rc Pr MOTA MOTP

CPD-3

UC 93.6 99.6 91.8 86.8FC 62.8 98.4 59.7 68.4FUC 92.6 99.7 90.1 82.9FU 86.6 100 84.3 82.4

CPD-2

UC 70.7 99.4 64.3 85.8FC 90.1 94.6 76.0 76.7FUC 95.2 96.2 80.0 82.9FU 92.5 96.3 80.1 81.9

Table 1 Performance measures (in%) on the two sequences of the Cocktail Party Dataset.

Fig. 3 Sample tracking results on CPD-3. The green bounding boxes represent the face detections and the yellowbounding boxes represent the upper body detections. Importantly, the red bounding boxes display the tracking results.

Results in Table 1 show that for the sequence CPD-3, the model exploiting only upper bodyobservations (UC) performs better than the model exploiting face detection observations (FC). Themain reason being that in this sequence people are most of the time recorded from profile views. It iswell known that face detectors are sub-optimal for profile view faces detection. Despite the integrationof non-optimal face localization observations, the model combining upper body, and face detection, andcolor observations (FUC) achieves a multiple object tracking accuracy (MOTA) similar to the modelusing only upper body and color observations. For the second sequence (CPD-2) where people aremostly exhibiting in near frontal faces, the model using face detection observations (FC) has a higherMOTA measure then the model using upper body and color information. In this case, the combinationof upper body, face, and color performs better than model using only single localization observations(upper body or face). Table 1 shows also that the model using color (FUC), performs better thanthe model which does not use color (FU). For CPD-3, the model exploiting color has higher MOTAmeasure. For CPD-2, the two models achieve similar MOTA, however, the model using color has higherrecall. In all cases, model FUC is always the best or the second best.

To give a qualitative flavor to the tracking performance, Figure 5 gives sample results achieved bythe proposed model (FUC) on CPD-3. These images show that the model is able to correctly initializenew tracks, identify occluded people as no longer visible, and recover their identities after occlusion.Tracking results are provided as supplementary material.

Figure 4 gives the estimated targets visibility probabilities (see Section 5.2) for sequence CPD-3with sample tracking images given in Figure 3. The person visibility show that tracking for person 1and 2 starts at the beginning of the sequence, and person 3 arrives at frame 600. Also, person 1 isoccluded between frames 400 and 450 (see fourth image in the first row, and first image in the secondrow of Figure 3).

Page 13: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes? 13

Fig. 4 Estimated visibility probabilities for tracked persons in sequence CPD-3. Every row displays the correspondingtargets visibility probabilities for every time frame. Yellow color represents very high probability (close to 1), and bluecolor represents very low probabilities.

Sequence Method Rc Pr MOTA MOTP

TUD-StadmitteB 72.2 81.7 54.8 65.4BC 70.9 82.5 53.5 65.1[17] 84.7 86.7 71.5 65.5

PETS09-S2L1

B 90.1 86.2 74.9 71.8BC 90.2 87.6 76.7 71.8[17] 92.4 98.4 90.6 80.2[27] - - 83 69.5

Table 2 Performance measures (in %) on the two sequences of MOT Challenge Dataset.

6.3 Evaluation on the MOT Challenge Dataset

In order to fairly evaluate different tracking methods, the organizers of the MOT Challenge provide thepedestrian detections obtained using [36]. This removes any bias due to the instantaneous pedestriandetector. For the MOT Challenge dataset, we model a single person’s kinematic state as the full bodybounding box and its velocity. In this case, the observation operator P simply removes the velocityinformation, keeping only the bounding box’ position and size. The appearance observations are theconcatenation of the joint HS histograms of the head, torso and legs areas (see Figure 2(b)).

As we previously did with the Cocktail Party Dataset, we evaluate our model using only bodylocalization observations (B) and jointly using body localization and color appearance observations(BC). In addition, we also compare the proposed model to two tracking models, proposed by Milanet al in [17] and by Bae and Yoon in [27]. Importantly, the direct comparison of our model to thesetwo state-of-the-art methods must be done with care. Indeed, we only use information from the past(filtering) while the other two methods use past as well as future information (smoothing). Therefore,we expect these two models to outperform the proposed one. However, the main prominent advantageof filtering methods over smoothing methods is that while the latter are inherently unsuitable foron-line processing, the former are naturally appropriate for on-line task, since they only use causalinformation.

Table 1 reports the performance of these methods on the two selected sequences of the MOTChallenge. In this table, results over TUD-Stadmitte show similar performances for our model using ornot appearance information. Therefore, color information is not very informative in this sequence. InPETS09-S2-L1, our model using color achieves better MOTA measure, precision, and recall, showingthe benefit of integrating color into the model. As expected, Milan et al and Bae and Yoon, outperformthe proposed model. However, the non-causal nature of their method makes them unsuitable for on-linetracking tasks, where the observations must be processed when received, and not before.

Figure 5 presents sample results for the PET09-S2L1 sequence. In addition, videos presenting theresults on the MOT Challenge Dataset are provided as supplementary material. These results showtemporally consistent tracks. Occasionally, person identity switches may occur when two people cross.Remarkably, because the proposed tracking model is allowed to reuse the identity of persons visiblein the past, a new person entering the scene may be associated to someone who left the scene if theirappearances are very similar.

Page 14: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

14 Sileye Ba et al.

Fig. 5 Tracking results on PETS09-S2L1. Green boxes represent observations and red bounding boxes represent trackingoutputs associated with person identities. Green and red bounding boxes may overlap.

Page 15: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes? 15

7 Conclusions

This paper presents an on-line variational Bayesian model for time-varying multi-person tracking usingcluttered visual observations. Up to our knowledge, this is the first time-varying variational Bayesianmodel for tracking an unknown time-varying number of targets. We proposed a birth and visibilityprocesses to handle person entering and leaving the scene. The proposed model is evaluated on twodatasets showing competitive results with respect to state of the art multi-person tracking models.Remarkably, even if in the conducted experiments we only use color information for the visual ap-pearance, our framework is versatile enough to accommodate other visual cues such as texture, HOGdescriptors or motion information (i.e. optical flow).

In future research, we will extend the visual tracker to incorporate auditory cues. We will considerjointly tracking people kinematic state, and their speaking status (active/inactive). The frameworkproposed in this paper will allow to exploit voice activity detection and sound source localization asaudio observations used to track active speakers. When using audio information, robust voice descrip-tors, and their blending with the tracking model will be investigated. We will also extend the proposedformalism to moving cameras, where the internal state of the system will have to be tracked as well.This case is of particular interest in applications such as pedestrian tracking for self driving cars orhumanoid robot perception.

References

1. W. Luo, J. Xing, X. Zhang, W. Zhao, T.-K. Kim, Multiple object tracking: a review, in: arXiv:1409.761, 2015, 2015.2. M. Andriluka, S. Roth, B. Scheile, People-tracking-by-detection and people-detection-by-tracking, in: Conference on

Computer Vision and Pattern Recognition (CVPR), IEEE, 2008, pp. 1–8.3. P. J. Green, Trans-dimensional Markov chain Monte Carlo, in: Oxford Statistical Science Series, 2003, pp. 179–198.4. Z. Khan, T. Balch, F. Dellaert, An MCMC-based particle filter for tracking multiple interacting targets, in: European

Conference on Computer Vision (ECCV), Springer, 2004, pp. 279–290.5. K. Smith, D. Gatica-Perez, J.-M. Odobez, Using particles to track varying numbers of interacting people, in: Com-

puter Vision and Pattern Recognition (CVPR), IEEE, 2005, pp. 962–969.6. P. J. Green, Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, in:

Biometrika, 1995, pp. 711–732.7. R. P. Mahler, Multisource multitarget filtering: a unified approach, in: Aerospace/Defense Sensing and Controls,

International Society for Optics and Photonics, 1998, pp. 296–307.8. R. P. S. Mahler, Statistics 101 for multisensor multitarget data fusion, in: A&E Systems Magazine, IEEE, 2004, pp.

53–64.9. R. P. S. Mahler, Statistics 102 for multisensor multitarget data fusion, in: Selected Topics on Signal Processing,

IEEE, 2013, pp. 53–64.10. R. P. S. Mahler, A theoretical foundation for the Stein-Winter” probability hypothesis density (PHD)” multitarget

tracking approach, in: Army Research Office Alexandria VA, Tech-report, 2000.11. H. Sidenbladh, Multi-target particle filtering for the probability hypothesis density, in: International Conference on

Information Fusion, IEEE, 2003, pp. 800–806.12. D. Clark, J. Bell, Convergence results for the particle phd filter, in: Transactions on Signal Processing, IEEE, 2006,

pp. 2652–2661.13. B.-N. Vo, S. Singh, A. Doucet, Random finite sets and sequential monte carlo methods in multi-target tracking, in:

International Radar Conference, IEEE, 2003, pp. 486–491.14. W. K. Ma, B. N. Vo, S. S. Singh, Tracking an unknown time-varying number of speakers using TDOA measurements:

a random finite set approach, in: Transactions on Signal Processing, IEEE, 2006, pp. 3291–3304.15. E. Maggio, M. Taj, A. Cavallaro, Efficient multitarget visual tracking using random finite sets, in: Transactions on

Circuits and Systems for Video Technology, IEEE, 2008, pp. 1016–1027.16. B. Yang, R. Nevatia, An online learned CRF model for multi-target tracking, in: Conference on Computer Vision

and Pattern Recognition (CVPR), IEEE, 2012, pp. 2034–2041.17. A. Milan, S. Roth, K. Schindler, Continuous energy minimization for multitarget tracking, in: Transactions on

Pattern Analysis and Machine Intelligence, IEEE, 2014, pp. 58–72.18. A. Heili, A. Lopez-Mendez, J.-M. Odobez, Exploiting long-term connectivity and visual motion in CRF-based multi-

person tracking, in: Transactions on Image Processing, IEEE, 2014, pp. 3040–3056.19. A. Yilmaz, O. Javed, M. Shah, Object tracking: A survey, in: Computing Surveys, ACM, 2006, p. 13.20. P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part based

models, in: Transactions on Pattern Analysis and Machine Intelligence, IEEE, 2010, pp. 1627–1645.21. X. Zhu, D. Ramanan, Face detection, pose estimation, and landmark localization in the wild, in: Computer Vision

and Pattern Recognition (CVPR), IEEE, 2012, pp. 2879–2886.22. D. Comaniciu, P. Meer, Mean shift: A robust approach toward feature space analysis, in: Transactions on Pattern

Analysis and Machine Intelligence, IEEE, 2002, pp. 603–619.

Page 16: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

16 Sileye Ba et al.

23. S. Arulampalam, S. Maskell, N. Gordon, T. Clapp, A tutorial on particle filters for online nonlinear/non-Gaussianbayesian tracking, in: Transactions on Signal Processing, IEEE, 2002, pp. 174–188.

24. S. Sarka, A. Vehtari, J. Lampinen, Rao-Blackwellized Monte Carlo data association for multiple target tracking, in:7th International Conference on Information Fusion, IEEE, 2004, pp. 583–590.

25. Y. Yan, A. Kostin, W. Christmas, J. Kittler, A novel data association algorithm for object tracking in clutterwith application to tennis video analysis, in: Computer Vision and Pattern Recognition (CVPR), IEEE, 2006, pp.634–641.

26. Y. Bar Shalom, F. Daum, J. Huang, The probabilistic data association filter, in: Control System Magazine, IEEE,2009, pp. 82–100.

27. S.-W. Bae, K.-J. Yoon, Robust online multi-object tracking based on tracklet confidence and online discriminativeappearance learning, in: Computer Vision and Pattern Recognition (CVPR), IEEE, 2014, pp. 1218–1225.

28. M. Isard, J. MacCormick, Bramble: A bayesian multiple-blob tracker, in: International Conference on ComputerVision (ICCV), IEEE, 2001, pp. 34–41.

29. K. Okuma, A. Taleghani, N. De Freitas, J. J. Little, D. G. Lowe, A boosted particle filter: Multitarget detectionand tracking, in: European Conference on Computer Vision (ECCV), Springer, 2004, pp. 28–39.

30. S. Sarka, A. Vehtari, J. Lampinen, Rao-Blackwellized particle filter for multiple target tracking, in: InformationFusion, Elsevier, 2007, pp. 2–15.

31. V. Smidl, A. Quinn, The Variational Bayes Method in Signal Processing, 1st Edition, Springer, 2006.32. C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), 1st Edition, Springer,

2007.33. L. Leal-Taixe, A. Milan, I. Reid, S. Roth, K. Schindler, MOT Challenge 2015: Towards a benchmark for multi-target

tracking, in: arXiv preprint arXiv:1504.01942, 2015.34. K. Smith, D. Gatica-Perez, J.-M. Odobez, S. Ba, Evaluating multi-object tracking, in: CVPR Workshop on Empirical

Evaluation Methods in Computer Vision (EEMCV), IEEE, 2005, pp. 36–36.35. R. Stiefelhagen, K. Bernardin, R. Bowers, J. S. Garofolo, D. Mostefa, P. Soundararajan, CLEAR 2006 evaluation,

in: First International Workshop on Classification of Events and Relationship, CLEAR 2006, Springer, 2005, pp.1–44.

36. P. Dollar, R. Appel, S. Belongie, P. Perona, Fast feature pyramids for object detection, in: Transactions on PatternAnalysis and Machine Intelligence, IEEE, 2014, pp. 1532–1545.

A Derivation of the Variational Formulation

A.1 Filtering Distribution Approximation

The goal of this section is to derive an approximation of the hidden-state filtering distribution p(Zt,Xt|O1:t, e1:t), giventhe variational approximating distribution q(Zt−1,Xt−1) at t − 1. Using Bayes rule, the filtering distribution can bewritten as

p(Zt,Xt|O1:t, e1:t) =p(Ot|Zt,Xt, et)p(Zt,Xt|O1:t−1, e1:t)

p(Ot|O1:t−1, e1:t). (29)

It is composed of three terms, the likelihood p(Ot|Zt,Xt, et), the predictive distribution p(Zt,Xt|O1:t−1, e1:t), and thenormalization factor p(Ot|O1:t−1, e1:t) which is independent of the hidden variables. The likelihood can be expanded as:

p(Ot|Zt,Xt, et) =

I∏i=1

∏k≤Ki

t

N∏n=0

p(otk|Ztk = n,Xt, et)δn(Zi

tk) (30)

where δn is the Dirac delta function, and p(otk|Ztk = n,Xt, et) is the individual observation likelihood defined in (5)and (6).

The predictive distribution factorizes as

p(Zt,Xt|O1:t−1, e1:t) = p(Zt|et)p(Xt|O1:t−1, e1:t).

Exploiting its multinomial nature, the assignment variable distribution p(Zt|et) can be fully expanded as:

p(Zt|et) =I∏i=1

∏k≤Ki

t

N∏n=0

p(Zitk = n|et)δn(Zitk). (31)

Using the motion state dynamics definition p(xtn|xt−1n, etn) the previous time motion state filtering distributionvariational approximation q(xt−1n|et−1) = p(xt−1n|O1:t−1, e1:t−1) defined in (19), motion state predictive distributionp(Xt = xt|O1:t−1, e1:t) can approximated by

p(Xt = xt|O1:t−1, e1:t)

=N∏n=1

∫p(xtn|xt−1n, etn)p(xt−1n|O1:t−1, e1:t−1)dxt−1n

≈N∏n=1

u(xtn)1−etng(xtn,Dµt−1,n,DΓtnD> + Λn)etn (32)

Page 17: arXiv:1509.01520v1 [cs.CV] 4 Sep 2015xavirema.eu/wp-content/papercite-data/pdf/Ba-CVIU.pdf · 2016. 1. 7. · Previously proposed methodological frameworks for multi-target tracking

An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes? 17

Equations (30), (31), and (32) define the numerator of the tracking filtering distribution (29). The logarithm of thisfiltering distribution is used by the proposed variational EM algorithm.

A.2 Derivation of the E-Z-Step

The E-Z-step corresponds to the estimation of q(Zitk|et) given by (13) which, from the log of the filtering distribution,can be written as:

logq(Zitk|et) =

N∑n=0

δn(Zitk)Eq(Xtn|et)[log(p(yitk,h

itk|Z

itk = n,Xt, et)p(Z

itk = n|et)

)] + C, (33)

where C gathers terms that are constant with respect to the variable of interest, Zitk in this case. By substituting

p(yitk,hitk|Z

itk = n,Xt, et), and p(Zitk = n|et) with their expressions (5), (6), and (8), by introducing the notations

εitk0 = u(yitk)u(hitk)

εitkn = g(yitk,Pµtn,Σi) exp(−

1

2Tr(P>Σi−1

PΓtn))b(hitk,hn)

and after some algebraic derivations, the distribution of interest can be written as the following multinomial distribution

q(Zitk = n|et) = αitkn =etnaitknε

itkn∑N

m=0 etmaitkmε

itkm

(34)

A.3 Derivation of the E-X-Step

The E-step for the motion state variables consists in the estimation of q(Xtn|etn) using relation log q(Xtn|etn) =Eq(Zt,Xt/Xtn|et)[log p(Zt,Xt|O1:t, e1:t)] which can be expanded as

log q(Xtn|et) =

I∑i=1

Kit∑

k=0

Eq(Zitk|et)[δn(Zitk)] log g(yitk; PXtn,Σ

i)etn

+ log(u(Xtn)1−etng(Xtn; Dµt−1n,DΓtnD> + Λn)etn ) + C,

where, as above, C gathers constant terms. After some algebraic derivation one obtains q(Xtn|etn) = u(Xtn)1−etng(Xtn;µtn,Γtn)etn

where the mean and covariance of the Gaussian distribution are given by (20) and by (21).