recurrent neural networks arxiv:1604.03635v1 [cs.cv] 13 ... · online multi-target tracking using...

Online Multi-target Tracking usingRecurrent Neural Networks

Anton Milan1 Seyed Hamid Rezatofighi1 Anthony Dick1

Konrad Schindler2 Ian Reid1

1School of Computer Science, The University of Adelaide, Australia2Photogrammetry and Remote Sensing Group, ETH Zurich

Abstract. We present a novel approach to online multi-target trackingbased on recurrent neural networks (RNNs). Tracking multiple objectsin real-world scenes involves many challenges, including a) an a-prioriunknown and time-varying number of targets, b) a continuous state es-timation of all present targets, and c) a discrete combinatorial problemof data association. Most previous methods involve complex models thatrequire tedious tuning of parameters. Here, we propose for the first time,a full end-to-end learning approach for online multi-target tracking basedon deep learning. Existing deep learning methods are not designed forthe above challenges and cannot be trivially applied to the task. Our so-lution addresses all of the above points in a principled way. Experimentson both synthetic and real data show competitive results obtained at≈300 Hz on a standard CPU, and pave the way towards future researchin this direction.

1 Introduction

Tracking multiple targets in unconstrained environments is an extremely chal-lenging task. Even after several decades of research [1–6], it is still far fromreaching the accuracy of human labelling. (cf . [7]). The task itself constituteslocating all targets of interest in a video sequence and maintaining their identityover time. One of the obvious questions that arises immediately is how to modelthe vast variety of data present in arbitrary videos that may include differentview points or camera motion, various lighting conditions or levels of occlusion,a varying number of targets, etc.

Tracking-by-detection has emerged as one of the most successful strategiesto tackle this challenge. Here, all “unused” data that is available in a videosequence is discarded and reduced to just a few single measurements per frame.This can be achieved by background subtraction [8] for static cameras, or objectdetections [9, 10] in a more general case. The task is then to associate eachmeasurement to a corresponding target, i.e. to address the problem of dataassociation. Moreover, due to clutter and an unknown number of targets, theoption to discard a measurement as a false alarm and a strategy to initiate newtargets as well as terminate exiting ones must be addressed.

arX

iv:1

604.

0363

5v1

[cs

.CV

] 1

3 A

pr 2

016

2 A. Milan et al.

time

data

asso

ciat

ion

tt-1 t+1

prediction

update

prediction

Fig. 1. A schematic illustration of our architecture. We use RNNs for temporal pre-diction and update as well as track management. The combinatorial problem of dataassociation is solved via LSTMs for each frame.

With the recent rise of deep learning, there has been surprisingly little workrelated to multi-target tracking. We presume that this is due to several reasons.First, when dealing with a huge number of parameters, deep models require hugeamounts of training data, which is not yet available in the case of multi-targettracking. Second, both the data and the desired solution can be quite versatile.One is faced with both discrete and continuous variables, unknown cardinalityfor input and output, and variable lengths of video sequences. One interestingexception in this direction is the recent work of Ondruska and Posner [11] thatintroduces deep recurrent neural networks to the task of state estimation. Al-though this work shows promising results, it only demonstrates its efficacy onsimulated data with near-perfect sensor measurements, a known number of tar-gets, and smooth, linear motion. Its applicability to real-world data is thereforerather limited, because the difficulty of tracking an unknown number of targetsin presence of measurement failure and clutter is bypassed entirely.

With this paper, we make an important step towards fully end-to-end modellearning for online tracking of multiple targets in realistic scenarios. Our maincontributions are as follows:

1. Inspired by the well-studied Bayesian filtering idea, we present a recurrentneural network capable of performing all multi-target tracking tasks includ-ing prediction, data association, state update as well as initiation and ter-mination of targets within a unified network structure (Fig. 1). One of themain advantages of this approach is that it is completely model-free, i.e. itdoes not require any prior knowledge about target dynamics, clutter distri-butions, etc. It can therefore capture linear (cf . Kalman filter), non-linear(cf . particle filter), and higher-order dependencies.

2. We further show, that a model for the challenging combinatorial problem ofdata association including birth and death of targets can be learned entirelyfrom data. This time-varying cardinality component demonstrates that itis possible to utilise RNNs not only to predict sequences with fixed-sizedinput and output vectors, but in fact to infer unordered sets with unknowncardinality.

Online Multi-target Tracking using Recurrent Neural Networks 3

3. Permutation of real data, sampling from generative models, as well as aphysically-based system all serve to generate arbitrary amounts of syntheticdata, which are used for training.

4. Qualitative and quantitative results on simulated and real data show verypromising results, confirming the potential of this approach. We firmly be-lieve that it will inspire other researchers to extend this idea and to furtheradvance its performance.

2 Related Work

Multi-object tracking. A multitude of sophisticated models have been developedin the past to capture the complexity of the problem at hand. Early work in-cludes the multiple hypothesis tracker (MHT) [1] and joint probabilistic dataassociation (JPDA) [2]. Both were developed in the realm of radar and sonartracking but were considered too slow for computer vision applications for a longtime. With the advances in computational power, they have found their way backand have recently been re-introduced in conjunction with novel appearance mod-els [12], or suitable approximation methods [13]. A large amount of work focusedon simplified models that could be solved to (near) global optimality [3,4,14–19].Here, the problem is cast as a linear program and solved via relaxation [3, 18],shortest-path [14,16] or max-flow algorithms [4]. Conversely, more complex costfunctions have been considered in [20–24], but without any theoretical boundson optimality. The optimization techniques range from quadratic boolean pro-gramming [20], over customized alpha-expansion [22] to greedy constraint prop-agation [24].

Deep learning. Early ideas of biologically inspired learning systems date backmany decades [25]. Later, convolutional neural networks (also known as CNNs)and the back propagation algorithm were developed and mainly applied to hand-written digit recognition [26]. However, despite their effectiveness on certaintasks, they could hardly compete with other well-established approaches. Thiswas mainly due to their major limitation of requiring huge amounts of trainingdata in order not to overfit the high number of parameters. With faster multi-processor hardware and with a sudden increase in labelled data, CNNs havebecome increasingly popular, initiated by a recent breakthrough on the task ofimage classification [27]. CNNs achieve state-of-the-art results in many appli-cations [28–30] but are restrictive in their output format. Conversely, recurrentneural networks (RNNs) [31] include a loop between the input and the output.This not only enables to simulate a memory effect, but also allows for mappinginput sequences to arbitrary output sequences, as long as the sequence alignmentand the input and output dimensions are known in advance.

Our work is inspired by the recent success of recurrent neural nets (RNNs)and their application to language modeling [32–34]. However, it is not straight-forward to apply the same strategies to the problem of multi-target tracking fornumerous reasons. First, the state space is multi-dimensional. Instead of predict-ing one character or one word, at each time step the state of all targets has to

4 A. Milan et al.

be considered at once. Second, the state consists of both continuous and discretevariables. The former represents the actual location (and possibly further prop-erties such as velocities) of targets, while a discrete representation is required toresolve data association. Further discrete indicator variables may also be usedto infer certain target states like the track state, the occlusion level, etc. Third,the desired number of outputs (e.g . targets) varies over time. In this paper, weintroduce a method for addressing all these issues and demonstrate how RNNscan be used for end-to-end learning of multi-target tracking systems.

3 Background

3.1 Recurrent Neural Networks

Recurrent neural nets (RNNs) are a particular type of deep architectures thathave been applied to numerous tasks including machine translation [35], hand-writing recognition and synthesis [36], speech recognition [37], image captiongeneration [32, 33], object detection [38] and many more. Their distinct powerlies in their ability to capture sequential dependencies, providing a mechanismthat can be interpreted as a memory capability.

Broadly speaking, RNNs work in a sequential manner, where a predictionis made at each time step, given the previous state and possibly an additionalinput. The core of an RNN is its hidden state h ∈ Rn of size n that acts as themain control mechanism for predicting the output, one step at a time. In general,RNNs may have multiple layers l = 1, ..., L. We will denote hlt as the hidden stateat time t on layer l. h0 can be thought of as the input layer, holding the inputvector, while hL holds the final embedded representation used to produce thedesired output yt. The hidden state for a particular layer l and time t is computedas

hlt = tanhW l(hl−1t , hlt−1

)>, (1)

where W is a matrix of learnable parameters. Note that W l ∈ Rn×2n varies oneach layer of the network, but does not depend on t.

RNNs of this form have several drawback. First, the simple coupling of thehidden state in Eq. (1) to its previous neighbours in space and time is not strongenough to capture long-term dependences. Second, one is faced with the problemof so-called vanishing gradients, where individual activations become saturatedsuch that the gradient does not propagate sufficiently through time. To remedythis, two types of extensions have been proposed: the long short-term memory(LSTM) [39] and the gated recurrent unit (GRU) [40]. In this work, we onlyconsider the former for our purpose. Next to the hidden state, the LSTM unitalso keeps an embedded representation of the state c that acts as a memory. Agated mechanism controls how much of the previous state should be “forgotten”or replaced by the new input (see Fig. 2, right, for an illustration). More formally,the hidden representations are computed as

hlt = o� tanh(clt)

(2)


andclt = f � clt−1 + i� g, (3)

respectively, where � represents element-wise multiplication. The input, outputand forget gates are all vectors of size n and model the memory update in abinary fashion using a sigmoid function:

i, o, f = σ[W l(hl−1t , hlt−1

)>], (4)

with a separate weight matrix W l for each gate.

g = tanh[W l(hl−1t , hlt−1

)>](5)

is the new candidate to replace the current memory, similar to the original RNNformulation from Eq. (1). The complete weight matrix for each layer of an LSTMis W l ∈ R4n×2n.

3.2 Bayesian Filtering

In this section, we will briefly review the classical Bayesian filtering approachfor target tracking. Here, the goal is to estimate the true state x from noisymeasurements z. Under the Markov assumption, the state distribution at timet given all past measurements is estimated recursively as

p(xt|z1:t) ∝ p(zt|xt)∫p(xt|xt−1)p(xt−1|z1:t−1)dxt−1, (6)

where p(zt|xt) is the last observation likelihood and p(xt|xt−1) the state transi-tion probability. Typically, Eq. (6) is evaluated in two steps: a prediction stepthat evaluates the state dynamics, and an update step that corrects the beliefabout the state based on the current measurements. Two of the most widelyused techniques for solving the above equation are Kalman filter [41] and par-ticle filter [42]. The former performs exact state estimation under linear andGaussian assumptions for the state and measurements models, while the latterapproximates arbitrary distributions using sequential importance sampling.

When dealing with multiple targets, one is faced with two additional chal-lenges. 1) Before the state update can be performed, it is crucial to determine,which measurements are associated with which targets. A number of algorithmshave been proposed to address this problem of data association including simplenearest neighbours or similar greedy techniques, and sophisticated probabilisticapproaches like JPDA (see [43] for an overview). 2) To allow for a time-varyingnumber of targets, it is necessary to provide a mechanism to spawn new targetsthat enter the scene, and remove existing ones that disappear indefinitely. Likedata association, this task is non-trivial, since each unassigned measurement canpotentially be either the start of a new trajectory or a false alarm. Conversely, amissing measurement for a certain target could mean that the target has disap-peared, or that the detector has failed. To address this challenge, online trackingapproaches typically base their decisions about births and deaths of tracks onheuristics that consider the number of consecutive measurement errors.

6 A. Milan et al.

4 Our Approach

We will now describe our approach to cast the classical Bayesian state estimation,data association as well as track initiation and termination tasks as a recurrentneural net, allowing for full end-to-end learning of the model.

4.1 Preliminaries and Notation

We begin by defining xt ∈ RN ·D as the vector containing the states for all targetsat one time instance. In our setting, the targets are represented by their bound-ing box coordinates (x, y, w, h), such that D = 4. Note that it is conceptuallystraight-forward to extend the state to an arbitrary dimension, e.g . to incor-porate velocity, acceleration or appearance model. N is the maximum numberof targets that can be represented (or tracked) simultaneously in one particularframe and xit refers to the state of the ith target. N is what we call the network’sorder and captures the spatial dependencies between targets. In its most generalform with N = 1, all targets are assumed to move independently. Similar to thestate vector above, zt ∈ RM ·D is the vector of all measurements in one frame,where M is maximum number of detections per frame. It is important to pointout, that there is no limit on the overall number of targets that our model canhandle.

The assignment probability matrix A ∈ [0, 1]N×(M+1) represents for eachtarget (row) the distribution of assigning individual measurements to that target,i.e. Aij = p(i assigned to j) and ∀i :

∑j Aij = 1. Note that an extra column

in A is needed to incorporate the case that a measurement is missing. Finally,E ∈ [0, 1]N is an indicator vector that represents the existence probability of atarget and is necessary to deal with an unknown and time-varying number oftargets. We will use (∼) to explicitly denote the ground truth variables.

4.2 Multi-target Tracking with RNNs

As motivated above, we decompose the problem at hand into two major blocks:state prediction and update as well as track management on one side, and dataassociation on the other. This strategy has several advantages. First, one canisolate and debug individual components effectively. Second, the framework be-comes modular, making it easy to replace each module or to add new ones. Third,it enables one to (pre)train every block separately, which not only significantlyspeeds up the learning process but can in fact be necessary for joint training.We will now describe both building blocks in detail.

4.3 Target Motion

Let us first turn to state prediction and update. We rely on a temporal RNNdepicted in Fig. 2 (left) to learn the temporal dynamic model of N targetsjointly as well as an indicator to determine births and deaths of targets (see


ht xt zt+1 At+1 Et

Et+1 E∗t+1xt+1x∗t+1 ht+1

hiCt+1

hi+1

ci

Ait+1 ci+1

learnable parameters

element-wise operations dot product softmaxsigmoid

Fig. 2. Left: An RNN-based architecture for state prediction, state update, and targetexistence probability estimation. Right: An LSTM-based model for data association.

next section). At time t, the RNN outputs four values1 for the next time step:A vector x∗t+1 ∈ RN ·D of predicted states for all targets, a vector xt+1 ∈ RN ·D

of all updated states, a vector Et+1 ∈ (0, 1)N of probabilities indicating for eachtarget how likely it is a real trajectory, and E∗t+1, which is the absolute differenceto Et. This decision is computed based on the current state xt and existenceprobabilities Et as well as the measurements zt+1 and data association At+1 inthe following frame. This building block has three primary objectives:

1. Prediction: Learn a complex dynamic model for a pre-defined number oftargets.

2. Update: Learn to correct the state distribution, given target-to-measurementassignments.

3. Birth / death: Learn to identify track initiation and termination based onthe state, the measurements and the data association.

The prediction x∗t+1 for the next frame depends solely on the current state xt andthe network’s hidden state ht. Once the data association At+1 for the followingframe is available (see Sec. 4.5), the state is updated according to assignmentprobabilities. To that end, all measurements and the predicted state are concate-nated to form x = [zt+1;x∗t+1] weighted by the assignment probabilities At+1.At the same time, the track existence probability Et+1 is computed.

Loss. A loss or objective is required by any machine learning algorithm to com-pute the goodness-of-fit of the model, i.e. how close the prediction correspondsto the true solution. It is a continuous function, typically chosen such that min-imising the loss maximises the performance of the given task. In our case, we

1 We omit the RNN’s hidden state at this point for clarity.

8 A. Milan et al.

are therefore interested in a loss that correlates with the tracking performance.This poses at least two challenges. First, measuring the performance of multi-target tracking is far from trivial [22] and moreover highly dependent on theparticular application. For example, in vehicle assistance systems it is absolutelycrucial to maintain the highest precision and recall to avoid accidents and tomaintain robustness to false positives. On the other hand, in sports analysis itbecomes more important to avoid ID switches between different players. One ofthe most widely accepted metrics is the multi-object tracking accuracy (MOTA)that combines these three error types and gives a reasonable assessment of theoverall performance. Ideally, one would train an algorithm directly on the de-sired performance measure. This, however, poses a second challenge. The MOTAcomputation involves a complex algorithm with non-differentiable zero-gradientcomponents, that cannot easily be incorporated into an analytical loss function.Hence, we propose the following loss that satisfies our needs:

L(x∗, x, E , x, E) =λ

ND

∑‖x∗ − x‖2︸︷︷︸

prediction

+κ

ND‖x− x‖2︸︷︷︸update

+ νLE + ξE∗,︸︷︷︸birth/death + reg.

(7)

where x∗, x, and E are the predicted values, and x and E are the true values,respectively. Note that we omit the time index here for better readability. Inpractice the loss for one training sample is averaged over all frames in the se-quence.

The loss consists of four components. Let us first concentrate on the first two,assuming for now that the number of targets is fixed. Intuitively, we aim to learna network that predicts trajectories that are close to the ground truth tracks.This should hold for both, predicting the target’s motion in the absence of anymeasurements, as well as correcting the track in light of new measurements. Tothat end, we minimise the mean squared error (MSE) between state predictionsand state update and the ground truth.

4.4 Initiation and Termination

Tracking multiple targets in real-world situations is challenged by the fact thattargets can appear and disappear in the area of interest. This aspect must notbe ignored. We propose to capture the time-varying number of targets by anadditional variable E ∈ (0, 1)N that mimics the probability that a target exists(E = 1) or not (E = 0) at one particular time instance.

Loss. The last two terms of the loss in Eq. (7) guide the learning to predictthe existence of each target at any given time. This is necessary to allow fortarget initiation and termination. Here, we employ the widely used binary crossentropy (BCE) loss

LE(E , E) = E log E + (1− E) log(1− E) (8)


0

1

E

Ground Truth

ts te 0

1

E

No smoothness

ts te

time

0

1

E

With smoothness

ts te

timetime

Fig. 3. The effect of the pairwise smoothness prior on the existence probability. SeeSec. 4.4 for details.

that approximates the probability of the existence for each target. Note that thetrue values E here correspond to a box function over time (cf . Fig. 3, left). Whenusing the BCE loss alone, the RNN learns to make rather hard decisions, whichresults in track termination at each frame when a measurement is missing. Toremedy this, we propose to add a smoothness prior E∗ that essentially minimisesthe absolute difference between two consecutive values for E .

4.5 Data Association

Arguably, the data association, i.e. the task to uniquely classify the correspond-ing measurement for each target, is the most challenging component of trackingmultiple targets. Greedy solutions are efficient, but do not yield good results ingeneral, especially in crowded scenes with clutter and occlusions. Approacheslike JPDA are on the other side of the spectrum. They consider all possible as-signment hypotheses jointly, which results in an NP-hard combinatorial problem.Hence, in practice, efficient approximations must be used.

In this section, we describe an LSTM-based architecture that is able to learnto solve this task entirely from training data. This is somewhat surprising formultiple reasons. First, joint data association is in general a highly complex,discrete combinatorial problem. Second, most solutions in the output space aremerely permutations of each other w.r.t. the input features. Finally, any possibleassignment should meet the one-to-one constraint to prevent the same measure-ment to be assigned to multiple targets. We believe that the LSTM’s non-lineartransformations and its strong memory component are the main driving forcethat allows for all these challenges to be learned effectively.

To support this claim, we demonstrate the capability of LSTM-based dataassociation on the example of replicating the linear assignment problem. Ourmodel is illustrated in Figures 1 and 2 (right). The main idea is to exploit theLSTM’s temporal step-by-step functionality to predict the assignment for eachtarget one target at a time. The input at each step, next to the hidden stateh and the cell state c, is the entire feature vector. For our purpose, we use thepairwise-distance matrix C ∈ RN×M , where Cij = ‖xi − zj‖2 is the Euclideandistance between the predicted state of target i and measurement j. Note thatit is straight-forward to extend the feature vector to incorporate appearance orany other similarity information. The output that we are interested in is then a

10 A. Milan et al.

vector of probabilities Ai for one target and all available measurements, obtainedby applying a softmax layer with normalisation to the predicted values. Here,Ai denotes the ith row of A.

Loss. To measure the misassignment cost, we employ the widely used negativelog-likelihood loss

L(Ai, a) = − log(Aia), (9)

where a is the correct assignment and Aij is the target i to measurement jassignment probability, as described in Sec. 4.1.

4.6 Training Data

It is well known that deep architectures require vast amounts of training data toavoid overfitting the model. Huge labelled datasets like ImageNET [44] or Mi-crosoft COCO [45] have enabled deep learning methods to unfold their potentialon tasks like image classification or pixel labelling. Unfortunately, mainly dueto the very tedious and time-consuming task of video annotation, only very lim-ited amount of labelled data for pedestrian tracking is publicly available today.We therefore resort to synthetic data augmentation, which we use alongside realdata. To that end, we propose three different methods that enable us to generateunlimited amounts of unique training samples: 1) by randomly perturbing realdata; 2) by sampling from a simple generative trajectory model learned fromreal data; and 3) by generating physically motivated 3D-world projections.

Perturbation of real data. As the first concept, we adapt previously knowntechniques (cf . e.g . [9]) for data augmentation. Here, the data is altered in thefollowing ways:

– Mirroring. Annotated trajectories are flipped horizontally and in time withprobability 0.5. This corresponds to mirroring the image and running thevideo backwards, respectively. Note that we refrain from vertical flipping asit may lead to unrealistic data.

– Translation. The data is translated randomly by [0− 20]% of the image sizein both x and y dimensions.

– Rotation. The trajectory are rotated randomly by [−20, 20] degrees aroundthe image centre. Both translation and rotation simulate small pan and rollmovements of the sensor.

Any trajectory parts that appear outside the image after the transformation isapplied are discarded.

A generative model. The second way to obtain training data is to samplefrom a generative model. To that end, we first learn a trajectory model from eachtraining sequence. For simplicity, we only estimate the mean and the variance oftwo features: the start location x1 and the average velocity v from all annotatedtrajectories in that sequence. For each training sample we then generate up toN tracks by sampling from a normal distribution with the learned parameters.


Top view 3D World Image view

Fig. 4. An overview of our physically-based framework for synthetic data generation.Tracks are generated based on physical priors in a 3D scene and then projected to asynthetic view, where a realistic set of detections (dashed boxes) is obtained.

Physically-based trajectory generation. In the third method, the data isgenerated by simulating real-world motion and cameras in the following way.All tracks are first generated on a virtual ground plane, as seen from the bird’seye view (cf . Fig. 4). Human motion patterns typically do not involve com-plex manoeuvring so that reasonable motion can be synthesised by combiningsimple parametric dynamic models such as constant velocity, constant turn andBrownian motion. For this experiment, we generate the motion of each personrecursively over time using the constant velocity dynamics with accelerationnoise model. The height of each person is randomly sampled from a normaldistribution with µ = 170 cm and σ = 20 cm.

To generate realistic sequences, we use a perspective camera model withrandom radial location and rotational angles with respect to polar coordinatein 3D world. The camera parameters and rotation and translation matrices areestimated such that the camera is facing toward scene and the centres of thescene and the image plane are coincident whilst covering the entire trackingarea. The trajectories of the targets leaving the scene are terminated and weallow new targets to be added into the scene during the sequence.

For each target, we generate a noisy detection according to a random prob-ability of detection. To simulate the performance of a detector, locations andheights of targets are perturbed with multivariate Gaussian noise whose co-variance is adjusted to be proportional to the target’s height in the image. Inaddition, a target is deemed missed if more than fifty percent of its bounding boxis occluded by other targets that are closer to the camera. Clutter is also requiredto be added into the detection list. To this end, a number of false detections perframe are generated according to a Poisson distribution with a predefined mean.Then their locations and heights are sampled from a uniform distribution in theimage domain.

12 A. Milan et al.

RNN Size

0 200 400 600

MO

TA

[%

]

7.5

8

8.5

9

Learning rate ×10-5

0 2 4 6

7.5

8

8.5

9

λ (prediction loss)

0 0.5 1 1.5 2

7.5

8

8.5

9

Fig. 5. Influence of three exemplar hyper-parameters on the overall performance onthe MOTChallenge benchmark, measured by MOTA. The optimal parameter is markedwith a red circle. Note that this graph shows the performance of our prediction/updateRNN block for only one target (N = 1), which explains the relatively low MOTA.

5 Implementation Details

We implemented our framework in Lua and Torch7. Both our entire code baseas well as pre-trained models are publicly available.2

Finding correct hyper-parameters for deep architectures still remains a non-trivial task [46]. We follow some of the best practices found in the literature [34,46], such as setting the initial weights for the forget gates higher (1 in our case),and also employ a standard grid search to find a best setting for the presenttask (see Fig. 5 for three examples). In this section we will point out some of themost important parameters and implementation details. For a more in-depthcoverage, we refer the reader to the supplemental material and to the sourcecode.

Network size. The RNN for state estimation and track management is trainedwith one layer and 300 hidden units. The data association is a more complex task,requiring more representation power. To that end, the LSTM module employedto learn the data association consists of two layers and 500 hidden units.

Optimisation. We use the RMSprop [47] to minimise the loss. The learning rateis set initially to 0.0003 and is decreased by 5% every 20, 000 iterations. Weset the maximum number of iterations to 200, 000, which is enough to reachconvergence. The training of both modules takes approximately 30 hours on aCPU. With a more accurate implementation and the use of GPUs we believethat training can be sped up significantly.

Data. The RNN is trained with about 100, 000 20-frame long sequences. Thedata is divided into mini-batches of 10 samples per batch and normalised tothe range [−0.5, 0.5], w.r.t. the image dimensions. We experimented with themore popular zero-mean and unit-variance data normalisation but found thatthe fixed one based on the image size yields superior performance.

6 Experiments

To demonstrate the functionality of our approach, we first perform experimentson simulated data. Fig. 6 shows an example of the tracking results on synthetic

2 https://bitbucket.org/amilan/rnntracking

https://bitbucket.org/amilan/rnntracking


Fig. 6. Results of our tracking method on three 20-frame long synthetic sequences incluttered environments. Top: Ground truth (x-coordinate vs. frame number). Middle:Our reconstructed trajectories. Bottom: The existence probability E for each target.Note the delayed initiation and termination, e.g . for the top-most track (beige) in thesecond example. This an inherent limitation of any purely online approach that cannotbe avoided.

data. Here, five targets with random birth and death times are generated in arather cluttered environment. The initiation / termination indicators are illus-trated in the bottom row.

We further test our approach on real-world data, using the MOTChallengebenchmark [7]. This pedestrian tracking dataset is a collection of 22 video se-quences (11/11 for training and testing, respectively), with a relatively highvariation in target motion, camera motion, viewing angle and person density.The evaluation is performed on a server using unpublished ground truth.

Baseline comparison. We first compare the proposed end-to-end learning ap-proach to three baselines. The results on the training set are reported in Tab. 1.The first baseline (Kalman-HA) employs a combination of a Kalman filter withbipartite matching solved via the Hungarian algorithm. Tracks are initiatedat each unassigned measurement and terminated as soon as a measurement ismissed. This baseline is the only one that fully fulfils the online state estima-tion without any heuristics, time delay or post-processing. The second baseline(Kalman-HA2) uses the same tracking and data association approach, but em-ploys a set of heuristics to remove false tracks in an additional post-processingstep. Finally, JPDAm is the full joint probabilistic data association approach,recently proposed in [13], including post-processing. We show the results of twovariants of our method. One with learned motion model and Hungarian data as-sociation, and one with full end-to-end learning for all components, using RNNsand LSTMs. Our learned model performs favourably compared to the purely on-

14 A. Milan et al.

Table 1. Tracking results on the MOTChallenge training dataset. *Denotes offlinepost-processing.

Method Rcll Prcn MT ML FP FN IDs FM MOTA MOTP

Kalman-HA 28.5 79.0 32 334 3,031 28,520 685 837 19.2 69.9Kalman-HA2* 28.3 83.4 39 354 2,245 28,626 105 342 22.4 69.4JPDAm* 30.6 81.7 38 348 2,728 27,707 109 380 23.5 69.0

RNN HA 37.8 75.2 50 267 4,984 24,832 518 963 24.0 68.7RNN LSTM 37.1 73.5 50 260 5,327 25,094 572 983 22.3 69.0

Table 2. Tracking results on the MOTChallenge test dataset. *Denotes an offline (ordelayed) method.

Method MOTA MOTP FAR MT% ML% FP FN IDs Frag. FPS

MDP [48] 30.3% 71.3% 1.7 13.0 38.4 9,717 32,422 680 1,500 1.1JPDAm* [13] 23.8% 68.2% 1.1 5.0 58.1 6,373 40,084 365 869 32.6TC ODAL [49] 15.1% 70.5% 2.2 3.2 55.8 12,970 38,538 637 1,716 1.7

RNN LSTM 19.0% 71.0% 2.0 5.5 45.6 11,578 36,706 1,490 2,081 165.2

line solution (Kalman-HA) and is even able to keep up with similar approachesbut without any heuristic or delay in the output.

Benchmark results. Next, we show our results on the benchmark test set inTab. 2 next to three online methods. The current leaderboard lists over 50 dif-ferent trackers, with the top ones reaching almost 50% MOTA. Even though theevaluation is performed by the benchmark organisers, there are still considerabledifferences between various submissions, that are worthy pointing out. First, alltop-ranked submissions use their own set of detections and not those provided inthe dataset. While a better detector typically improves the tracking result, thedirect comparison of the tracking method becomes rather meaningless. There-fore, we prefer to use the provided detections to guarantee a fair setting. Second,most methods perform so-called offline tracking, i.e. the solution is inferred ei-ther using the entire video sequence, or by peeking a few frames into the future,thus returning the tracking solution with a certain time delay. This is in contrastto our method, which aims to strictly compute and fix the solution with eachincoming frame, before moving to the next one. Finally, it is important to notethat many current methods use target appearance or other image features likeoptic flow [6] to improve the data association. Our method does not utilise anyvisual features and solely relies on geometric locations provided by the detector.We acknowledge the usefulness of such features for pedestrian tracking, but theseare often not available in other application, such as e.g . cell or animal tracking.We therefore retain from including them at this point and leave it for futurework.

Overall, our approach does not quite reach the accuracy of the state of theart in online tracking [48], but is two orders of magnitude faster. Fig. 7 showssome example frames from the test set.


Fig. 7. Our RNN tracking results on selected MOTChallenge sequences including ADL-Rundle-3 (first row), TUD-Crossing (second row) and PETS S2.L2 (bottom).

7 Discussion and Future Work

We presented an approach to address the challenging problem of data associa-tion and trajectory estimation within a neural network setting. To the best ofour knowledge, this is the first approach that employs end-to-end training formulti-target tracking. We showed that an RNN-based network can be utilised tolearn complex motion models for multiple targets jointly. The second, somewhatsurprising finding is that an LSTM network is able to learn a globally optimal as-signment, which is a non-trivial task for such an architecture. We firmly believe,that by incorporating appearance and by learning a more robust associationstrategy, the results can be improved significantly.

References

1. Reid, D.B.: An algorithm for tracking multiple targets. IEEE Transactions onAutomatic Control 24(6) (December 1979) 843–854

2. Fortmann, T.E., Bar-Shalom, Y., Scheffe, M.: Multi-target tracking using jointprobabilistic data association. In: 19th IEEE Conference on Decision and Controlincluding the Symposium on Adaptive Processes. Volume 19. (December 1980)807–812

3. Jiang, H., Fels, S., Little, J.J.: A linear programming approach for multiple objecttracking. In: CVPR 2007

4. Zhang, L., Li, Y., Nevatia, R.: Global data association for multi-object trackingusing network flows. In: CVPR 2008

5. Andriyenko, A., Schindler, K., Roth, S.: Discrete-continuous optimization formulti-target tracking. In: CVPR 2012. 1926–1933

6. Choi, W.: Near-online multi-target tracking with aggregated local flow descriptor.In: ICCV 2015

16 A. Milan et al.

7. Leal-Taixe, L., Milan, A., Reid, I., Roth, S., Schindler, K.: MOTChallenge 2015:Towards a benchmark for multi-target tracking. arXiv:1504.01942 [cs] (April 2015)arXiv: 1504.01942.

8. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-timetracking. In: CVPR 1999. (1999) 2246–2252

9. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:CVPR 2005. 886–893

10. Dollar, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for objectdetection. IEEE T. Pattern Anal. Mach. Intell. 36(8) (2014) 1532–1545

11. Ondruska, P., Posner, I.: Deep tracking: Seeing beyond seeing using recurrentneural networks. In: The Thirtieth AAAI Conference on Artificial Intelligence(AAAI), Phoenix, Arizona USA (February 2016)

12. Kim, C., Li, F., Ciptadi, A., Rehg, J.M.: Multiple hypothesis tracking revisited:Blending in modern appearance model. In: ICCV 2015

13. Rezatofighi, H.S., Milan, A., Zhang, Z., Shi, Q., Dick, A., Reid, I.: Joint proba-bilistic data association revisited. In: ICCV 2015

14. Berclaz, J., Fleuret, F., Turetken, E., Fua, P.: Multiple object tracking using k-shortest paths optimization. IEEE T. Pattern Anal. Mach. Intell. 33(9) (September2011) 1806–1819

15. Ben Shitrit, H., Berclaz, J., Fleuret, F., Fua, P.: Tracking multiple people underglobal appearance constraints. In: ICCV 2011

16. Pirsiavash, H., Ramanan, D., Fowlkes, C.C.: Globally-optimal greedy algorithmsfor tracking a variable number of objects. In: CVPR 2011

17. Henriques, J.a., Caseiro, R., Batista, J.: Globally optimal solution to multi-objecttracking with merged measurements. In: ICCV 2011

18. Andriyenko, A., Schindler, K.: Globally optimal multi-target tracking on a hexag-onal lattice. In: ECCV 2010. Volume 1. 466–479

19. Butt, A.A., Collins, R.T.: Multi-target tracking by Lagrangian relaxation to min-cost network flow. In: CVPR 2013

20. Leibe, B., Schindler, K., Van Gool, L.: Coupled detection and trajectory estimationfor multi-object tracking. In: ICCV 2007

21. Milan, A., Roth, S., Schindler, K.: Continuous energy minimization for multitargettracking. IEEE T. Pattern Anal. Mach. Intell. 36(1) (2014) 58–72

22. Milan, A., Schindler, K., Roth, S.: Detection- and trajectory-level exclusion inmultiple object tracking. In: CVPR 2013

23. Brendel, W., Amer, M.R., Todorovic, S.: Multiobject tracking as maximum weightindependent set. In: CVPR 2011

24. Chen, S., Fern, A., Todorovic, S.: Multi-object tracking via constrained sequentiallabeling. In: CVPR 2014

25. Ivakhnenko, A.G., Lapa, Valentin, G.: Cybernetic predicting devices. (1966) 25026. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied

to document recognition. Proceedings of the IEEE 86(11) (November 1998) 2278–2324

27. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deepconvolutional neural networks. In: NIPS*2012, Curran Associates, Inc. 1097–1105

28. Wang, T., Wu, D., Coates, A., Ng, A.: End-to-end text recognition with con-volutional neural networks. In: 2012 21st International Conference on PatternRecognition (ICPR). (November 2012) 3304–3308

29. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels witha common multi-scale convolutional architecture. arXiv


30. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR 2014

31. Goller, C., Kuchler, A.: Learning task-dependent distributed representations bybackpropagation through structure. In: ICNN, IEEE (1996) 347352

32. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural imagecaption generator. In: The IEEE Conference on Computer Vision and PatternRecognition (CVPR). (June 2015)

33. Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating imagedescriptions. In: CVPR 2015. Volume abs/1412.2306.

34. Karpathy, A., Johnson, J., Li, F.F.: Visualizing and understanding recurrent net-works. arXiv:1506.02078 [cs] (June 2015) arXiv: 1506.02078.

35. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neuralnetworks. In: nips-2014. (2014) 3104–3112

36. Graves, A.: Generating sequences with recurrent neural networks. arXiv:1308.0850[cs] (August 2013) arXiv: 1308.0850.

37. Graves, A., Mohamed, A.r., Hinton, G.E.: Speech recognition with deep recurrentneural networks. In: IEEE International Conference on Acoustics, Speech andSignal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013. (2013)6645–6649

38. Stewart, R., Andriluka, M.: End-to-end people detection in crowded scenes.arXiv:1506.04878 [cs] (June 2015) arXiv: 1506.04878.

39. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8)(November 1997) 17351780

40. Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties ofneural machine translation: Encoder-decoder approaches. arXiv:1409.1259 [cs, stat](September 2014) arXiv: 1409.1259.

41. Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans-actions of the ASME–Journal of Basic Engineering 82(Series D) (1960) 35–45

42. Doucet, A., Godsill, S., Andrieu, C.: On sequential monte carlo sampling methodsfor bayesian filtering. Statistics and Computing 10(3) (2000) 197–208

43. Bar-Shalom, Y., Fortmann, T.E.: Tracking and Data Association. Academic Press(1988)

44. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet LargeScale Visual Recognition Challenge. (2014)

45. Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona,P., Ramanan, D., Zitnick, C.L., Dollar, P.: Microsoft coco: Common objects incontext. arXiv:1405.0312 [cs] (May 2014) arXiv: 1405.0312.

46. Greff, K., Srivastava, R.K., Koutnk, J., Steunebrink, B.R., Schmidhuber, J.: Lstm:A search space odyssey. arXiv:1503.04069 [cs] (March 2015) arXiv: 1503.04069.

47. Tieleman, T., Hinton, G.: Rmsprop: Divide the gradient by a running average ofits recent magnitude. Coursera: Neural networks for machine learning (2012)

48. Xiang, Y., Alahi, A., Savarese, S.: Learning to track: Online multi-object trackingby decision making. In: International Conference on Computer Vision (ICCV).(2015) 4705–4713

49. Bae, S.H., Yoon, K.J.: Robust online multi-object tracking based on tracklet con-fidence and online discriminative appearance learning. In: CVPR 2014

recurrent neural networks arxiv:1604.03635v1 [cs.cv] 13 ... · online multi-target tracking using...

Documents