evaluatingmultipleobjecttrackingperformance ... · these measures can behave in a nonintuitive...

10
Hindawi Publishing Corporation EURASIP Journal on Image and Video Processing Volume 2008, Article ID 246309, 10 pages doi:10.1155/2008/246309 Research Article Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics Keni Bernardin and Rainer Stiefelhagen Interactive Systems Lab, Institut f¨ ur Theoretische Informatik, Universit¨ at Karlsruhe, 76131 Karlsruhe, Germany Correspondence should be addressed to Keni Bernardin, [email protected] Received 2 November 2007; Accepted 23 April 2008 Recommended by Carlo Regazzoni Simultaneous tracking of multiple persons in real-world environments is an active research field and several approaches have been proposed, based on a variety of features and algorithms. Recently, there has been a growing interest in organizing systematic evaluations to compare the various techniques. Unfortunately, the lack of common metrics for measuring the performance of multiple object trackers still makes it hard to compare their results. In this work, we introduce two intuitive and general metrics to allow for objective comparison of tracker characteristics, focusing on their precision in estimating object locations, their accuracy in recognizing object configurations and their ability to consistently label objects over time. These metrics have been extensively used in two large-scale international evaluations, the 2006 and 2007 CLEAR evaluations, to measure and compare the performance of multiple object trackers for a wide variety of tracking tasks. Selected performance results are presented and the advantages and drawbacks of the presented metrics are discussed based on the experience gained during the evaluations. Copyright © 2008 K. Bernardin and R. Stiefelhagen. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION The audio-visual tracking of multiple persons is a very active research field with applications in many domains. These range from video surveillance, over automatic indexing, to intelligent interactive environments. Especially in the last case, a robust person tracking module can serve as a powerful building block to support other techniques, such as gesture recognizers, face or speaker identifiers, head pose estimators [1], and scene analysis tools. In the last few years, more and more approaches have been presented to tackle the problems posed by unconstrained, natural environments and bring person trackers out of the laboratory environment and into real-world scenarios. In recent years, there has also been a growing inter- est in performing systematic evaluations of such tracking approaches with common databases and metrics. Examples are the CHIL [2] and AMI [3] projects, funded by the EU, the U.S. VACE project [4], the French ETISEO [5] project, the U.K. Home Oce iLIDS project [6], the CAVIAR [7] and CREDS [8] projects, and a growing number of workshops (e.g., PETS [9], EEMCV [10], and more recently CLEAR [11]). However, although benchmarking is rather straightforward for single object trackers, there is still no general agreement on a principled evaluation procedure using a common set of objective and intuitive metrics for measuring the performance of multiple object trackers. Li et al. in [12] investigate the problem of evaluating systems for the tracking of football players from multiple camera images. Annotated ground truth for a set of visible players is compared to the tracker output and 3 measures are introduced to evaluate the spatial and temporal accuracy of the result. Two of the measures, however, are rather specific to the football tracking problem, and the more general measure, the “identity tracking performance,” does not consider some of the basic types of errors made by multiple target trackers, such as false positive tracks or localization errors in terms of distance or overlap. This limits the application of the presented metric to specific types of trackers or scenarios. Nghiem et al. in [13] present a more general framework for evaluation, which covers the requirements of a broad range of visual tracking tasks. The presented metrics aim at allowing systematic performance analysis using large

Upload: others

Post on 18-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EvaluatingMultipleObjectTrackingPerformance ... · these measures can behave in a nonintuitive fashion (such as the CD, as the authors state, or the FP and FN, as we will demonstrate

Hindawi Publishing CorporationEURASIP Journal on Image and Video ProcessingVolume 2008, Article ID 246309, 10 pagesdoi:10.1155/2008/246309

Research ArticleEvaluating Multiple Object Tracking Performance:The CLEAR MOT Metrics

Keni Bernardin and Rainer Stiefelhagen

Interactive Systems Lab, Institut fur Theoretische Informatik, Universitat Karlsruhe, 76131 Karlsruhe, Germany

Correspondence should be addressed to Keni Bernardin, [email protected]

Received 2 November 2007; Accepted 23 April 2008

Recommended by Carlo Regazzoni

Simultaneous tracking of multiple persons in real-world environments is an active research field and several approaches havebeen proposed, based on a variety of features and algorithms. Recently, there has been a growing interest in organizing systematicevaluations to compare the various techniques. Unfortunately, the lack of common metrics for measuring the performance ofmultiple object trackers still makes it hard to compare their results. In this work, we introduce two intuitive and general metrics toallow for objective comparison of tracker characteristics, focusing on their precision in estimating object locations, their accuracyin recognizing object configurations and their ability to consistently label objects over time. These metrics have been extensivelyused in two large-scale international evaluations, the 2006 and 2007 CLEAR evaluations, to measure and compare the performanceof multiple object trackers for a wide variety of tracking tasks. Selected performance results are presented and the advantages anddrawbacks of the presented metrics are discussed based on the experience gained during the evaluations.

Copyright © 2008 K. Bernardin and R. Stiefelhagen. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. INTRODUCTION

The audio-visual tracking of multiple persons is a very activeresearch field with applications in many domains. Theserange from video surveillance, over automatic indexing, tointelligent interactive environments. Especially in the lastcase, a robust person tracking module can serve as a powerfulbuilding block to support other techniques, such as gesturerecognizers, face or speaker identifiers, head pose estimators[1], and scene analysis tools. In the last few years, more andmore approaches have been presented to tackle the problemsposed by unconstrained, natural environments and bringperson trackers out of the laboratory environment and intoreal-world scenarios.

In recent years, there has also been a growing inter-est in performing systematic evaluations of such trackingapproaches with common databases and metrics. Examplesare the CHIL [2] and AMI [3] projects, funded by theEU, the U.S. VACE project [4], the French ETISEO [5]project, the U.K. Home Office iLIDS project [6], the CAVIAR[7] and CREDS [8] projects, and a growing number ofworkshops (e.g., PETS [9], EEMCV [10], and more recently

CLEAR [11]). However, although benchmarking is ratherstraightforward for single object trackers, there is still nogeneral agreement on a principled evaluation procedureusing a common set of objective and intuitive metrics formeasuring the performance of multiple object trackers.

Li et al. in [12] investigate the problem of evaluatingsystems for the tracking of football players from multiplecamera images. Annotated ground truth for a set of visibleplayers is compared to the tracker output and 3 measuresare introduced to evaluate the spatial and temporal accuracyof the result. Two of the measures, however, are ratherspecific to the football tracking problem, and the moregeneral measure, the “identity tracking performance,” doesnot consider some of the basic types of errors made bymultiple target trackers, such as false positive tracks orlocalization errors in terms of distance or overlap. This limitsthe application of the presented metric to specific types oftrackers or scenarios.

Nghiem et al. in [13] present a more general frameworkfor evaluation, which covers the requirements of a broadrange of visual tracking tasks. The presented metrics aimat allowing systematic performance analysis using large

Page 2: EvaluatingMultipleObjectTrackingPerformance ... · these measures can behave in a nonintuitive fashion (such as the CD, as the authors state, or the FP and FN, as we will demonstrate

2 EURASIP Journal on Image and Video Processing

amounts of benchmark data. However, a high numberof different metrics (8 in total) are presented to evaluateobject detection, localization and tracking performance, withmany dependencies between separate metrics, such thatone metric can often only be interpreted in combinationwith one or more others. This is for example the case forthe “tracking time” and “object ID persistence/confusion”metrics. Further, many of the proposed metrics are stilldesigned with purely visual tracking tasks in mind.

Because of the lack of commonly agreed on and generallyapplicable metrics, it is not uncommon to find trackingapproaches presented without quantitative evaluation, whilemany others are evaluated using varying sets of more orless custom measures (e.g., [14–18]). To remedy this, thispaper proposes a thorough procedure to detect the basictypes of errors produced by multiple object trackers andintroduces two novel metrics, the multiple object trackingprecision (MOTP), and the multiple object tracking accuracy(MOTA), that intuitively express a tracker’s overall strengthsand are suitable for use in general performance evaluations.

Perhaps the work that most closely relates to ours isthat of Smith et al. in [19], which also attempts to definean objective procedure to measure multiple object trackerperformance. However, key differences to our contributionexist: again, a large number of metrics are introduced: 5 formeasuring object configuration errors, and 4 for measuringinconsistencies in object labeling over time. Some of the mea-sures are defined in a dual way for trackers and for objects(e.g., MT/MO, FIT/FIO, TP/OP). This makes it difficult togain a clear and direct understanding of the tracker’s overallperformance. Moreover, under certain conditions, some ofthese measures can behave in a nonintuitive fashion (suchas the CD, as the authors state, or the FP and FN, as wewill demonstrate later). In comparison, we introduce just 2overall performance measures that allow a clear and intuitiveinsight into the main tracker characteristics: its precisionin estimating object positions, its ability to determine thenumber of objects and their configuration, and its skill atkeeping consistent tracks over time.

In addition to the theoretical framework, we presentactual results obtained in two international evaluationworkshops, which can be seen as field tests of the proposedmetrics. These evaluation workshops, the classification ofevents, activities, and relationships (CLEAR) workshops,were held in spring 2006 and 2007 and featured a varietyof tracking tasks, including visual 3D person tracking usingmultiple camera views, 2D face tracking, 2D person andvehicle tracking, acoustic speaker tracking using microphonearrays, and even audio-visual person tracking. For all thesetracking tasks, each with its own specificities and require-ments, the here-introduced MOTP and MOTA metrics,or slight variants thereof, were employed. The experiencesmade during the course of the CLEAR evaluations arepresented and discussed as a means to better understand theexpressiveness and usefulness, but also the weaknesses of theMOT metrics.

The remainder of the paper is organized as follows.Section 2 presents the new metrics, the MOTP and theMOTA and a detailed procedure for their computation.

Section 3 briefly introduces the CLEAR tracking tasks andtheir various requirements. In Section 4, sample results areshown and the usefulness of the metrics is discussed. Finally,Section 5 gives a summary and a conclusion.

2. PERFORMANCE METRICS FOR MULTIPLEOBJECT TRACKING

To allow a better understanding of the proposed metrics, wefirst explain what qualities we expect from an ideal multipleobject tracker. It should at all points in time find the correctnumber of objects present; and estimate the position of eachobject as precisely as possible (note that properties such asthe contour, orientation, or speed of objects are not explicitlyconsidered here). It should also keep consistent track of eachobject over time: each object should be assigned a uniquetrack ID which stays constant throughout the sequence (evenafter temporary occlusion, etc.). This leads to the followingdesign criteria for performance metrics.

(i) They should allow to judge a tracker’s precision indetermining exact object locations.

(ii) They should reflect its ability to consistently trackobject configurations through time, that is, to cor-rectly trace object trajectories, producing exactly onetrajectory per object.

Additionally, we expect useful metrics

(i) to have as few free parameters, adjustable thresholds,and so forth, as possible to help making evaluationsstraightforward and keeping results comparable;

(ii) to be clear, easily understandable, and behave accord-ing to human intuition, especially in the occurrenceof multiple errors of different types or of unevenrepartition of errors throughout the sequence;

(iii) to be general enough to allow comparison of mosttypes of trackers (2D, 3D trackers, object centroidtrackers, or object area trackers, etc.);

(iv) to be few in number and yet expressive, so they maybe used, for example, in large evaluations where manysystems are being compared.

Based on the above criteria, we propose a procedure forthe systematic and objective evaluation of a tracker’s char-acteristics. Assuming that for every time frame t, a multipleobject tracker outputs a set of hypotheses {h1, . . . ,hm} for aset of visible objects {o1, . . . , on}, the evaluation procedurecomprises the following steps.

For each time frame t,

(i) establish the best possible correspondence betweenhypotheses hj and objects oi,

(ii) for each found correspondence, compute the error inthe object’s position estimation,

(iii) accumulate all correspondence errors:

(a) count all objects for which no hypothesis wasoutput as misses,

Page 3: EvaluatingMultipleObjectTrackingPerformance ... · these measures can behave in a nonintuitive fashion (such as the CD, as the authors state, or the FP and FN, as we will demonstrate

K. Bernardin and R. Stiefelhagen 3

o1

h1

h2

h5

o2

o8Misses

t

t

t

t

Falsepositives

Figure 1: Mapping tracker hypotheses to objects. In the easiest case,matching the closest object-hypothesis pairs for each time frame tis sufficient.

(b) count all tracker hypotheses for which no realobject exists as false positives,

(c) count all occurrences where the trackinghypothesis for an object changed compared toprevious frames as mismatch errors. This couldhappen, for example, when two or more objectsare swapped as they pass close to each other,or when an object track is reinitialized with adifferent track ID, after it was previously lostbecause of occlusion.

Then, the tracking performance can be intuitively expressedin two numbers: the “tracking precision” which expresseshow well exact positions of persons are estimated, and the“tracking accuracy” which shows how many mistakes thetracker made in terms of misses, false positives, mismatches,failures to recover tracks, and so forth. These measures willbe explained in detail in the latter part of this section.

2.1. Establishing correspondences between objectsand tracker hypotheses

As explained above, the first step in evaluating the perfor-mance of a multiple object tracker is finding a continu-ous mapping between the sequence of object hypotheses{h1, . . . ,hm} output by the tracker in each frame and the realobjects {o1, . . . , on}. This is illustrated in Figure 1. Naively,one would match the closest object-hypothesis pairs andtreat all remaining objects as misses and all remaininghypotheses as false positives. A few important points needto be considered, though, which make the procedure lessstraightforward.

2.1.1. Valid correspondences

First of all, the correspondence between an object oi anda hypothesis hj should not be made if their distance disti, jexceeds a certain threshold T . There is a certain conceptualboundary beyond which we can no longer speak of an errorin position estimation, but should rather argue that the

tracker has missed the object and is tracking something else.This is illustrated in Figure 2(a). For object area trackers(i.e., trackers that also estimate the size of objects or thearea occupied by them), distance could be expressed interms of the overlap between object and hypothesis, forexample, as in [14], and the threshold T could be set to zerooverlap. For object centroid trackers, one could simply usethe Euclidian distance, in 2D image coordinates or in real 3Dworld coordinates, between object centers and hypotheses,and the threshold could be, for example, the average widthof a person in pixels or cm. The optimal setting for Ttherefore depends on the application task, the size of objectsinvolved, and the distance measure used, and cannot bedefined for the general case (while a task-specific, data-drivencomputation of T may be possible in some cases, this wasnot further investigated here. For the evaluations presentedin Sections 3 and 4, empirical determination based on taskknowledge proved sufficient). In the following, we refer tocorrespondences as valid if disti, j < T .

2.1.2. Consistent tracking over time

Second, to measure the tracker’s ability to label objectsconsistently, one has to detect when conflicting correspon-dences have been made for an object over time. Figure 2(b)illustrates the problem. Here, one track was mistakenlyassigned to 3 different objects over the course of time.A mismatch can occur when objects come close to eachother and the tracker wrongfully swaps their identities.It can also occur when a track was lost and reinitializedwith a different identity. One way to measure such errorscould be to decide on a “best” mapping (oi,hj) for everyobject oi and hypothesis hj , for example, based on theinitial correspondence made for oi, or the correspondence(oi,hj) most frequently made in the whole sequence. Onewould then count all correspondences where this mappingis violated as errors. In some cases, this kind of measure canhowever become nonintuitive. As shown in Figure 2(c), if,for example, the identity of object oi is swapped just once inthe course of the tracking sequence, the time frame at whichthe swap occurs drastically influences the value output bysuch an error measure.

This is why we follow a different approach: only countmismatch errors once at the time frames where a change inobject-hypothesis mappings is made; and consider the corre-spondences in intermediate segments as correct. Especially incases where many objects are being tracked and mismatchesare frequent, this gives us a more intuitive and expressiveerror measure. To detect when a mismatch error occurs, alist of object-hypothesis mappings is constructed. Let Mt ={(oi,hj)} be the set of mappings made up to time t and letM0 = {·}. Then, if a new correspondence is made at timet + 1 between oi and hk which contradicts a mapping (oi,hj)in Mt, a mismatch error is counted and (oi,hj) is replaced by(oi,hk) in Mt+1.

The so constructed mapping list Mt can now helpto establish optimal correspondences between objects andhypotheses at time t + 1, when multiple valid choices exist.Figure 2(d) shows such a case. When it is not clear, which

Page 4: EvaluatingMultipleObjectTrackingPerformance ... · these measures can behave in a nonintuitive fashion (such as the CD, as the authors state, or the FP and FN, as we will demonstrate

4 EURASIP Journal on Image and Video Processing

o1

h1

h1

h1

o1o1

Miss

t t + 1 t + 2

Falsepositive

Dist.> T

(a)

o1

h1

h2

h3

o2

o3Mismatch

Mismatch

t t + 1 t + 2 t + 3 t + 4 t + 5 t + 6 t + 7

(b)

o1

h1

h2

Case 1:

o1

h1 h2

Case 2:

t t + 1 t + 2 t + 3 t + 4 t + 5 t + 6 t + 7 t + 8

t t + 1 t + 2 t + 3 t + 4 t + 5 t + 6 t + 7 t + 8

(c)

o1

h1

h1

h2

Miss

o1 o1

t t + 1 t + 2

Falsepositive Dist.< T

(d)

Figure 2: Optimal correspondences and error measures. (a) When the distance between o1 and h1 exceeds a certain threshold T , one canno longer make a correspondence. Instead, o1 is considered missed and h1 becomes a false positive. (b): Mismatched tracks. Here, h2 is firstmapped to o2. After a few frames, though, o1 and o2 cross paths and h2 follows the wrong object. Later, it wrongfully swaps again to o3. (c):Problems when using a sequence-level “best” object-hypothesis mapping based on most frequently made correspondences. In the first case,o1 is tracked just 2 frames by h1, before the track is taken over by h2. In the second case, h1 tracks o1 for almost half the sequence. In bothcases, a “best” mapping would pair h2 and o1. This however leads to counting 2 mismatch errors for case 1; and 4 errors for case 2, althoughin both cases only one error of the same kind was made. (d): Correct reinitialization of a track. At time t, o1 is tracked by h1. At t + 1, thetrack is lost. At t + 2, two valid hypotheses exist. The correspondence is made with h1 although h2 is closer to o1, based on the knowledge ofprevious mappings up to time t + 1.

hypothesis to match to an object oi, priority is given toho with (oi,ho) ∈ Mt, as this is most likely the correcttrack. Other hypotheses are considered false positives, andcould have occurred because the tracker outputs severalhypotheses for oi, or because a hypothesis that previouslytracked another object accidentally crossed over to oi.

2.1.3. Mapping procedure

Having clarified all the design choices behind our strategy forconstructing object-hypothesis correspondences, we sum-marize the procedure as follows.

Let M0 = {·}. For every time frame t, consider thefollowing.

(1) For every mapping (oi,hj) in Mt−1, verify if it is stillvalid. If object oi is still visible and tracker hypothesishj still exists at time t, and if their distance does

not exceed the threshold T , make the correspondencebetween oi and hj for frame t.

(2) For all objects for which no correspondence wasmade yet, try to find a matching hypothesis. Allowonly one-to-one matches, and pairs for which thedistance does not exceed T . The matching shouldbe made in a way that minimizes the total object-hypothesis distance error for the concerned objects.This is a minimum weight assignment problem,and is solved using Munkres’ algorithm [20] withpolynomial runtime complexity. If a correspondence(oi,hk) is made that contradicts a mapping (oi,hj) inMt−1, replace (oi,hj) with (oi,hk) in Mt. Count thisas a mismatch error and let mmet be the number ofmismatch errors for frame t.

(3) After the first two steps, a complete set of matchingpairs for the current time frame is known. Let ct be

Page 5: EvaluatingMultipleObjectTrackingPerformance ... · these measures can behave in a nonintuitive fashion (such as the CD, as the authors state, or the FP and FN, as we will demonstrate

K. Bernardin and R. Stiefelhagen 5

the number of matches found for time t. For each oftheses matches, calculate the distance dit between theobject oi and its corresponding hypothesis.

(4) All remaining hypotheses are considered false posi-tives. Similarly, all remaining objects are consideredmisses. Let f pt and mt be the number of falsepositives and misses, respectively, for frame t. Let alsogt be the number of objects present at time t.

(5) Repeat the procedure from step 1 for the next timeframe. Note that since for the initial frame, the set ofmappings M0 is empty, all correspondences made areinitial and no mismatch errors occur.

In this way, a continuous mapping between objects andtracker hypotheses is defined and all tracking errors areaccounted for.

2.2. Performance metrics

Based on the matching strategy described above, two veryintuitive metrics can be defined.

(1) The multiple object tracking precision (MOPT):

MOTP =∑

i,tdit∑

tct. (1)

It is the total error in estimated position for matchedobject-hypothesis pairs over all frames, averagedby the total number of matches made. It showsthe ability of the tracker to estimate precise objectpositions, independent of its skill at recognizingobject configurations, keeping consistent trajectories,and so forth.

(2) The multiple object tracking accuracy (MOTA):

MOTA = 1−∑

t(mt + f pt +mmet)∑

tgt, (2)

where mt, f pt, and mmet are the number of misses,of false positives, and of mismatches, respectively, fortime t. The MOTA can be seen as derived from 3 errorratios:

m =∑

tmt∑tgt

, (3)

the ratio of misses in the sequence, computed overthe total number of objects present in all frames,

f p =∑

t f pt∑tgt

, (4)

the ratio of false positives, and

mme =∑

tmmet∑tgt

, (5)

the ratio of mismatches.

o1

h1

o2

o3

o4

Misses

t t + 1 t + 2 t + 3 t + 4 t + 5 t + 6 t + 7

Figure 3: Computing error ratios. Assume a sequence length of 8frames. For frames t1 to t4, 4 objects o1, . . . , o4 are visible, but noneis being tracked. For frames t5 to t8, only o4 remains visible, and isbeing consistently tracked by h1. In each frame t1, . . . , t4, 4 objectsare missed, resulting in 100% miss rate. In each frame t5, . . . , t8,the miss rate is 0%. Averaging these frame level error rates yields aglobal result of (1/8)(4.100 + 4.0) = 50% miss rate. On the otherhand, summing up all errors first, and computing a global ratioyield a far more intuitive result of 16 misses/20 objects = 80%.

Summing up over the different error ratios gives us the totalerror rate Etot, and 1− Etot is the resulting tracking accuracy.The MOTA accounts for all object configuration errors madeby the tracker, false positives, misses, mismatches, over allframes. It is similar to metrics widely used in other domains(such as the word error rate (WER), commonly used inspeech recognition) and gives a very intuitive measure ofthe tracker’s performance at detecting objects and keepingtheir trajectories, independent of the precision with whichthe object locations are estimated.

Remark on computing averages: note that for both MOTPand MOTA, it is important to first sum up all errors acrossframes before a final average or ratio can be computed.The reason is that computing ratios rt for each frame tindependently before calculating a global average (1/n)

∑trt

for all n frames (such as, e.g., for the FP and FN measuresin [19]) can lead to nonintuitive results. This is illustratedin Figure 3. Although the tracker consistently missed mostobjects in the sequence, computing ratios independently perframe and then averaging would still yield only 50% missrate. Summing up all misses first and computing a singleglobal ratio, on the other hand, produces a more intuitiveresult of 80% miss rate.

3. TRACKING EVALUATIONS IN CLEAR

The theoretical framework presented here for the evalua-tion of multiple object trackers was applied in two largeevaluation workshops. The classification of events, activities,and relationships (CLEARs) workshops [11] as organizedin a collaboration between the European CHIL project, theU.S. VACE project, and the National Institute of Standardsand Technology (NIST) [21] (as well as the AMI project, in2007), and were held in the springs of 2006 and 2007. They

Page 6: EvaluatingMultipleObjectTrackingPerformance ... · these measures can behave in a nonintuitive fashion (such as the CD, as the authors state, or the FP and FN, as we will demonstrate

6 EURASIP Journal on Image and Video Processing

represent the first international evaluations of their kind,using large databases of annotated multimodal data, andaimed to provide a platform for researchers to benchmarksystems for acoustic and visual tracking, identification,activity analysis, event recognition, and so forth, usingcommon task definitions, datasets, tools, and metrics. Theyfeatured a variety of tasks related to the tracking of humansor other objects in natural, unconstrained indoor andoutdoor scenarios, and presented new challenges to systemsfor the fusion of multimodal and multisensory data. Acomplete description of the CLEAR evaluation workshops,the participating systems, and the achieved results can befound in [22, 23].

The authors wish to make the point here, that theseevaluations represent a systematic, large-scale effort usinghours of annotated data, and with a substantial amountof participating systems, and can therefore be seen as atrue practical test of the usefulness of the MOT metrics.The experience from these workshops was that the MOTmetrics were indeed applicable to a wide range of trackingtasks, made it easy to gain insights into tracker strengthsand weaknesses, to compare overall system performances,and helped researchers publish and convey performanceresults that are objective, intuitive, and easy to interpret.In the following, the various CLEAR tracking tasks arebriefly presented, highlighting the differences and specifici-ties that make them interesting from the point of viewof the requirements posed to evaluation metrics. Whilein 2006, there were still some exceptions, in 2007 alltasks related to tracking, for single or multiple objects,and for all modalities, were evaluated using the MOTmetrics.

3.1. 3D visual person tracking

The CLEAR 2006 and 2007 evaluations featured a 3D persontracking task, in which the objective was to determine thelocation on the ground plane of persons in a scene. Thescenario was that of small meetings or seminars, and severalcamera views were available to help determine 3D locations.Both the tracking of single persons (the lecturer in front of anaudience) and of multiple persons (all seminar participants)were attempted. The specifications of this task posed quite achallenge for the design of appropriate performance metrics:measures such as track merges and splits, usually found inthe field of 2D image-based tracking, had little meaningin the 3D multicamera tracking scenario. On the otherhand, errors in location estimation had to be carefully dis-tinguished from false positives and false track associations.Tracker performances were to be intuitively comparable forsequences with large differences in the number of groundtruth objects, and thus varying levels of difficulty. In the end,the requirements of the 3D person tracking task drove muchof the design choices behind the MOT metrics. For this task,error calculations were made using the Euclidian distancebetween hypothesized and labeled person positions on theground plane, and the correspondence threshold was set to50 cm.

Figure 4 shows examples for the scenes from the seminardatabase used for 3D person tracking.

3.2. 2D face tracking

The face tracking task was to be evaluated on two differentdatabases: one featuring single views of the scene and onefeaturing multiple views to help better resolve problems ofdetection and track verification. In both cases, the objectivewas to detect and track faces in each separate view, estimatingnot only their position in the image, but also their extension,that is, the exact area covered by them. Although in the 2006evaluation, a variety of separate measures were used, in the2007 evaluation, the same MOT metrics as in the 3D persontracking task, with only slight variations, were successfullyapplied. In this case, the overlap between hypothesized andlabeled face bounding boxes in the image was used asdistance measure, and the distance error threshold was setto zero overlap.

Figure 5 shows examples for face tracker outputs on theCLEAR seminar database.

3.3. 2D person and vehicle tracking

Just as in the face tracking task, the 2D view-based trackingof persons and vehicles was also evaluated on different setsof databases representing outdoor traffic scenes, using onlyslight variants of the MOT metrics. Here also, bounding boxoverlap was used as the distance measure.

Figure 6 shows a scene from the CLEAR vehicle trackingdatabase.

3.4. 3D acoustic and multimodal person tracking

The task of 3D person tracking in seminar or meetingscenarios also featured an acoustic subtask, where trackingwas to be achieved using the information from distributedmicrophone networks, and a multimodal subtask, where thecombination of multiple camera and multiple microphoneinputs was available. It is noteworthy here, that the MOTmeasures could be applied with success to the domain ofacoustic source localization, where overall performance istraditionally measured using rather different error metrics,and is decomposed into speech segmentation performanceand localization performance. Here, the miss and falsepositive errors in the MOTA measure accounted for seg-mentation errors, whereas the MOTP expressed localizationprecision. As a difference to visual tracking, mismatches werenot considered in the MOTA calculation, as acoustic trackerswere not expected to distinguish the identities of speakers,and the resulting variant, the A − MOTA, was used forsystem comparisons. In both, the acoustic and multimodalsubtasks, systems were expected to pinpoint the 3D locationof active speakers and the distance measure used was theEuclidian distance on the ground plane, with the thresholdset to 50 cm.

Page 7: EvaluatingMultipleObjectTrackingPerformance ... · these measures can behave in a nonintuitive fashion (such as the CD, as the authors state, or the FP and FN, as we will demonstrate

K. Bernardin and R. Stiefelhagen 7

ITC-irst UKA AIT IBM UPC

Figure 4: Scenes from the CLEAR seminar database used in 3D person tracking.

UKA AIT

Figure 5: Scenes from the CLEAR seminar database used for face detection and tracking.

Figure 6: Sample from the CLEAR vehicle tracking database (i-LIDS dataset [6]).

4. EVALUATION RESULTS

This section gives a brief overview of the evaluation resultsfrom select CLEAR tracking tasks. The results serve todemonstrate the effectiveness of the proposed MOT metricsand act as a basis for discussion of inherent advantages,drawbacks, and lessons learned during the workshops. For amore detailed presentation, the reader is referred to [22, 23].

Figure 7 shows the results for the CLEAR 2007 Visual 3Dperson tracking task. A total of seven tracking systems withvarying characteristics participated. Looking at the first col-umn, the MOTP scores, one finds that all systems performedfairly well, with average localization errors under 20 cm. Thiscan be seen as quite low, considering the area occupied onaverage by a person and the fact that the ground truth itself,representing the projections to the ground plane of headcentroids, was only labeled to 5–8 cm accuracy. However, onemust keep in mind that the fixed threshold of 50 cm, beyondwhich an object is considered as missed completely by thetracker, prevents the MOTP from rising too high. Even inthe case of uniform distribution of localization errors, theMOTP value would be 25 cm. This shows us that, consideringthe predefined threshold, System E is actually not very preciseat estimating person coordinates, and that System B, on theother hand, is extremely precise, when compared to ground

Site/system MOTP Miss rateFalse pos.

rate Mismatches MOTASystem A 92 mm 30.86% 6.99% 1139 59.66%System B 91 mm 32.78% 5.25% 1103 59.56%System C 141 mm 20.66% 18.58% 518 59.62%System D 155 mm 15.09% 14.5% 378 69.58%System E 222 mm 23.74% 20.24% 490 54.94%System F 168 mm 27.74% 40.19% 720 30.49%System G 147 mm 13.07% 7.78% 361 78.36%

Figure 7: Results for the CLEAR’07 3D multiple person trackingvisual subtask.

truth uncertainties. More importantly still, it shows usthat the correspondence threshold T strongly influences thebehavior of the MOTP and MOTA measures. Theoretically,a threshold of T = ∞ means that all correspondencesstay valid once made, no matter how large the distancebetween object and track hypothesis becomes.This reducesthe impact of the MOTA to measuring the correct detectionof the number of objects, and disregards all track swaps,stray track errors, and so forth, resulting in an also unusableMOTP measure. On the other hand, if T approximates 0,all tracked objects will eventually be considered as missed,and the MOTP and MOTA measures lose their meaning. Asa consequence, the single correspondence threshold T mustbe carefully chosen based on the application and evaluationgoals at hand. For the CLEAR 3D person tracking task,the margin was intuitively set to 50 cm, which producedreasonable results, but the question of determining theoptimal threshold, perhaps automatically in a data drivenway, is still left unanswered.

The rightmost column in Figure 7, the MOTA measure,proved somewhat more interesting for overall performancecomparisons, at least in the case of 3D person tracking, asit was not bounded to a reasonable range, as the MOTPwas. There was far more room for errors in accuracy inthe complex multitarget scenarios under evaluation. The

Page 8: EvaluatingMultipleObjectTrackingPerformance ... · these measures can behave in a nonintuitive fashion (such as the CD, as the authors state, or the FP and FN, as we will demonstrate

8 EURASIP Journal on Image and Video Processing

best and worst overall systems, G and F reached 78% and30% accuracy, respectively. Systems A, B, C, and E, on theother hand, produced very similar numbers, although theyused quite different features and algorithms. While theMOTA measure is useful to make such broad high-levelcomparisons, it was felt that the intermediate miss, falsepositive and mismatch errors measures, which contribute tothe overall score, helped to gain a better understanding oftracker failure modes, and it was decided to publish themalongside the MOTP and MOTA measures. This was useful,for example, for comparing the strengths of systems B and C,which had a similar overall score.

Notice that in contrast to misses and false positives,for the 2007 CLEAR 3D person tracking task, mismatcheswere presented as absolute numbers as the total number oferrors made in all test sequences.This is due to an imbalance,which was already noted during the 2006 evaluations, andfor which no definite solution has been found as of yet: fora fairly reasonable tracking system and the scenarios underconsideration, the number of mismatch errors made in asequence of several minutes labeled at 1 second intervals isin no proportion to the number of ground truth objects,or, for example, to the number of miss errors incurring ifonly one of many objects is missed for a portion of thesequence.This typically resulted in mismatch error ratios ofoften less than 2%, in contrast to 20–40% for misses orfalse positives, which considerably reduced the impact offaulty track labeling on the overall MOTA score. Of course,one could argue that this is an intuitive result becausetrack labeling is a lesser problem compared to the correctdetection and tracking of multiple objects, but in the endthe relative importance of separate error measures is purelydependent on the application. To keep the presentation ofresults as objective as possible, absolute mismatch errorswere presented here, but the consensus from the evaluationworkshops was that according more weight to track labelingerrors was desirable, for example, in the form of trajectory-based error measures, which could help move away fromframe-based miss and false positive errors, and thus reducethe imbalance.

Figure 8 shows the results for the visual 3D single persontracking task, evaluated in 2006. As the tracking of a singleobject can be seen as a special case of multiple objecttracking, the MOT metrics could be applied in the sameway. Again, one can find at a glance the best performingsystem in terms of tracking accuracy, System D with 91%accuracy, by looking at the MOTA values. One can alsoquickly discern that, overall, systems performed better on theless challenging single person scenario. The MOTP columntells us that System B was remarkable, among all others, inthat it estimated target locations down to 8.8 cm precision.Just as in the previous case, more detailed components ofthe tracking error were presented in addition to the MOTA.In contrast to multiple person tracking, mismatch errorsplay no role (or should not play any) in the single personcase. Also, as a lecturer was always present and visible in theconsidered scenarios, false positives could only come fromgross localization errors, which is why only a detailed analysisof the miss errors was given. For better understanding,

Site/system MOTPMiss rate(dist > T)

Miss rate(no hypo) MOTA

System A 246 mm 88.75% 2.28% −79.78%System B 88 mm 5.73% 2.57% 85.96%System C 168 mm 15.29% 3.65% 65.44%System D 132 mm 4.34% 0.09% 91.23%System E 127 mm 14.32% 0% 71.36%System F 161 mm 9.64% 0.04% 80.67%System G 207 mm 12.21% 0.06% 75.52%

Figure 8: Results for the CLEAR’06 3D Single Person Trackingvisual subtask

they were broken down into misses resulting from failuresto detect the person of interest (miss rate (no hypo)), andmisses resulting from localization errors exceeding the 50 cmthreshold (miss rate (dist > T)). In the latter case, as aconsequence of the metric definition, every miss error wasautomatically accompanied by a false positive error, althoughthese were not presented separately for conciseness.

This effect, which is much more clearly observable in thesingle object case, can be perceived as penalizing a trackertwice for gross localization errors (one miss penalty, and onefalse positive penalty). This effect is however intentional anddesirable for the following reason: intelligent trackers thatuse some mechanisms such as track confidence measures,to avoid outputting a track hypothesis when their locationestimation is poor, are rewarded compared to trackers whichcontinuously output erroneous hypotheses. It can be arguedthat a tracker which fails to detect a lecturer for half of asequence performs better than a tracker which consistentlytracks the empty blackboard for the same duration of time.This brings us to the noteworthy point: just as much asthe types of tracker errors (misses, false positives, distanceerrors, etc.) that are used to derive performance measures,precisely “how” these errors are counted, the procedure fortheir computation when it comes to temporal sequences,plays a major role in the behavior and expressiveness of theresulting metric.

Figure 9 shows the results for the 2007 3D persontracking acoustic subtask. According to the task definition,mismatch errors played no role and just as in the visualsingle person case, components of the MOTA score werebroken down into miss and false positive errors resultingfrom faulty segmentation (true miss rate, true false pos. rate),and those resulting from gross localization errors (loc. errorrate). One can easily make out System G as the overall bestperforming system, both in terms of MOTP and MOTA, withperformance varying greatly from system to system. Figure 9demonstrates the usefulness of having just one or two overallperformance measures when large numbers of systems areinvolved, in order to gain a high-level overview before goinginto a deeper analysis of their strengths and weaknesses.

Figure 10, finally, shows the results for the 2007 facetracking task on the CLEAR seminar database. The maindifference to the previously presented tasks lies in thefact that 2D image tracking of the face area is performedand the distance error between ground truth objects andtracker hypotheses is expressed in terms of overlap of the

Page 9: EvaluatingMultipleObjectTrackingPerformance ... · these measures can behave in a nonintuitive fashion (such as the CD, as the authors state, or the FP and FN, as we will demonstrate

K. Bernardin and R. Stiefelhagen 9

Site/system MOTPTrue miss

rateTrue falsepos. rate

Loc. errorrate A-MOTA

System A 257 mm 35.3% 11.06% 26.09% 1.45%System B 256 mm 0% 22.01% 41.6% −5.22%System C 208 mm 11.2% 7.08% 18.27% 45.18%System D 223 mm 11.17% 7.11% 29.17% 23.39%System E 210 mm 0.7% 21.04% 23.94% 30.37%System F 152 mm 0% 22.04% 14.96% 48.04%System G 140 mm 8.08% 12.26% 12.52% 54.63%System H 168 mm 25.35% 8.46% 12.51% 41.17%

Figure 9: Results for the CLEAR’07 3D person tracking acousticsubtask.

Site/systemMOTP

(overlap) Miss rateFalse pos.

rateMismatch

rate MOTASystem A 0.66 42.54% 22.1% 2.29% 33.07%System B 0.68 19.85% 10.31% 1.03% 68.81%

Figure 10: Results for the CLEAR’07 face tracking task.

respective bounding boxes. This is reflected in the MOTPcolumn. As the task required the simultaneous tracking ofmultiple faces, all types of errors, misses, false positives, andmismatches were of relevance, and were presented alongwith the overall MOTA score. From the numbers, one canderive that although systems A and B were fairly equal inestimating face extensions once they were found, System Bclearly outperformed System A when it comes to detectingand keeping track of these faces in the first place. This caseagain serves to demonstrate how the MOT measures can beapplied, with slight modifications but using the same generalframework, for the evaluation of various types of trackerswith different domain-specific requirements, and operatingin a wide range of scenarios.

5. SUMMARY AND CONCLUSION

In order to systematically assess and compare the perfor-mance of different systems for multiple object tracking,metrics which reflect the quality and main characteristics ofsuch systems are needed. Unfortunately, no agreement on aset of commonly applicable metrics has yet been reached.

In this paper, we have proposed two novel metrics for theevaluation of multiple object tracking systems. The proposedmetrics—the multiple object tracking precision (MOTP)and the multiple object tracking accuracy (MOTA)—areapplicable to a wide range of tracking tasks and allow forobjective comparison of the main characteristics of trackingsystems, such as their precision in localizing objects, theiraccuracy in recognizing object configurations, and theirability to consistently track objects over time.

We have tested the usefulness and expressiveness of theproposed metrics experimentally, in a series of internationalevaluation workshops. The 2006 and 2007 CLEAR work-shops hosted a variety of tracking tasks for which a largenumber of systems were benchmarked and compared. Theresults of the evaluation show that the proposed metricsindeed reflect the strengths and weaknesses of the variousused systems in an intuitive and meaningful way, allow for

easy comparison of overall performance, and are applicableto a variety of scenarios.

ACKNOWLEDGMENT

The work presented here was partly funded by the EuropeanUnion (EU) under the integrated project CHIL, Computers inthe Human Interaction Loop (Grant no. IST-506909).

REFERENCES

[1] M. Voit, K. Nickel, and R. Stiefelhagen, “Multi-view head poseestimation using neural networks,” in Proceedings of the 2ndWorkshop on Face Processing in Video (FPiV ’05), in associationwith the 2nd IEEE Canadian Conference on Computer andRobot Vision (CRV ’05), pp. 347–352, Victoria, Canada, May2005.

[2] CHIL—Computers In the Human Interaction Loop, http://chil.server.de/.

[3] AMI—Augmented Multiparty Interaction, http://www.amiproject.org/.

[4] VACE—Video Analysis and Content Extraction, http://www.informedia.cs.cmu.edu/arda/vaceII.html.

[5] ETISEO—Video Understanding Evaluation, http://www.silogic.fr/etiseo/.

[6] The i-LIDS dataset, http://scienceandresearch.homeoffice.gov.uk/hosdb/cctv-imaging-technology/video-based-detection-systems/i-lids/.

[7] CAVIAR—Context Aware Vision using Image-based ActiveRecognition, http://homepages.inf.ed.ac.uk/rbf/CAVIAR/.

[8] F. Ziliani, S. Velastin, F. Porikli, et al., “Performance evaluationof event detection solutions: the CREDS experience,” inProceedings of the IEEE International Conference on AdvancedVideo and Signal Based Surveillance (AVSS ’05), pp. 201–206,Como, Italy, September 2005.

[9] PETS—Performance Evaluation of Tracking and Surveillance,http://www.cbsr.ia.ac.cn/conferences/VS-PETS-2005/.

[10] EEMCV—Empirical Evaluation Methods in ComputerVision, http://www.cs.colostate.edu/eemcv2005/.

[11] CLEAR—Classification of Events, Activities and Relation-ships, http://www.clear-evaluation.org/.

[12] Y. Li, A. Dore, and J. Orwell, “Evaluating the performance ofsystems for tracking football players and ball,” in Proceedingsof the IEEE International Conference on Advanced Video andSignal Based Surveillance (AVSS ’05), pp. 632–637, Como, Italy,September 2005.

[13] A. T. Nghiem, F. Bremond, M. Thonnat, and V. Valentin,“ETISEO, performance evaluation for video surveillance sys-tems,” in Proceedings of the IEEE International Conference onAdvanced Video and Signal Based Surveillance (AVSS ’07), pp.476–481, London, UK, September 2007.

[14] R. Y. Khalaf and S. S. Intille, “Improving multiple peopletracking using temporal consistency,” MIT Department ofArchitecture House n Project Technical Report, MassachusettsInstitute of Technology, Cambridge, Mass, USA, 2001.

[15] A. Mittal and L. S. Davis, “M2Tracker: a multi-view approachto segmenting and tracking people in a cluttered sceneusing region-based stereo,” in Proceedings of the 7th EuropeanConference on Computer Vision (ECCV ’02), vol. 2350 ofLecture Notes in Computer Science, pp. 18–33, Copenhagen,Denmark, May 2002.

[16] N. Checka, K. Wilson, V. Rangarajan, and T. Darrell, “A prob-abilistic framework for multi-modal multi-person tracking,”

Page 10: EvaluatingMultipleObjectTrackingPerformance ... · these measures can behave in a nonintuitive fashion (such as the CD, as the authors state, or the FP and FN, as we will demonstrate

10 EURASIP Journal on Image and Video Processing

in Proceedings of the IEEE Workshop on Multi-Object Tracking(WOMOT ’03), Madison, Wis, USA, June 2003.

[17] K. Nickel, T. Gehrig, R. Stiefelhagen, and J. McDonough,“A joint particle filter for audio-visual speaker tracking,” inProceedings of the 7th International Conference on MultimodalInterfaces (ICMI ’05), pp. 61–68, Torento, Italy, October 2005.

[18] H. Tao, H. Sawhney, and R. Kumar, “A sampling algorithm fortracking multiple objects,” in Proceedings of the InternationalWorkshop on Vision Algorithms (ICCV ’99), pp. 53–68, Corfu,Greece, September 1999.

[19] K. Smith, D. Gatica-Perez, J. Odobez, and S. Ba, “Evaluatingmulti-object tracking,” in Proceedings of the IEEE Workshopon Empirical Evaluation Methods in Computer Vision (EEMCV’05), vol. 3, p. 36, San Diego, Calif, USA, June 2005.

[20] J. Munkres, “Algorithms for the assignment and transporta-tion problems,” Journal of the Society of Industrial and AppliedMathematics, vol. 5, no. 1, pp. 32–38, 1957.

[21] NIST—National Institute of Standards and Technology,http://www.nist.gov/.

[22] R. Stiefelhagen and J. Garofolo, Eds., Multimodal Technologiesfor Perception of Humans: First International Evaluation Work-shop on Classification of Events, Activities and Relationships,CLEAR 2006, vol. 4122 of Lecture Notes in Computer Science,Springer, Berlin, Germany.

[23] R. Stiefelhagen, J. Fiscus, and R. Bowers, Eds., MultimodalTechnologies for Perception of Humans, Joint Proceedings of theCLEAR 2007 and RT 2007 Evaluation Workshops, vol. 4625 ofLecture Notes in Computer Science, Springer, Berlin, Germany.