a human-machine collaborative approach to tracking human ... · a human-machine collaborative...

8
A Human-Machine Collaborative Approach to Tracking Human Movement in Multi-Camera Video Philip DeCamp MIT Media Lab 20 Ames Street, E15-441 Cambridge, Massachusetts 02139 Deb Roy MIT Media Lab 20 Ames Street, E15-488 Cambridge, Massachusetts 02139 ABSTRACT Although the availability of large video corpora are on the rise, the value of these datasets remain largely untapped due to the difficulty of analyzing their contents. Automatic video analyses produce low to medium accuracy for all but the simplest analysis tasks, while manual approaches are pro- hibitively expensive. In the tradeoff between accuracy and cost, human-machine collaborative systems that synergisti- cally combine approaches may achieve far greater accuracy than automatic approaches at far less cost than manual. This paper presents TrackMarks, a system for annotating the location and identity of people and objects in large cor- pora of multi-camera video. TrackMarks incorporates a user interface that enables a human annotator to create, review, and edit video annotations, but also incorporates tracking agents that respond fluidly to the users actions, processing video automatically where possible, and making efficient use of available computing resources. In evaluation, TrackMarks is shown to improve the speed of a multi-object tracking task by an order of magnitude over manual annotation while re- taining similarly high accuracy. Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and Software; H.5.2 [Information Interfaces and Pre- sentations]: User Interfaces General Terms Performance, Human Factors, Design Keywords video annotation, multiple camera, object tracking, human- machine collaboration 1. INTRODUCTION The ubiquity of digital video cameras coupled with the plummeting cost of storage and computer processing enables Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIVR ’09, July 8-10, 2009 Santorini, GR Copyright 2009 ACM 978-1-60558-480-5/09/07 ...$5.00. new forms of human behavioral analysis that promise to transform forensics, behavioral and social psychology, ethnog- raphy, and beyond. Unfortunately, the ability for state-of- the-art automatic algorithms to reliably analyze fine-grained human activity in video is severely limited in all but the most controlled contexts. Object tracking represents one of the most active and well developed areas in computer vision, yet existing systems have significant difficulty processing video that contains adverse lighting conditions, occlusions, or mul- tiple targets in close proximity. During the 2007 CLEAR Evaluation on the Classification of Events, Activities, and Relationships[10], out of six systems applied to tracking per- sons in surveillance video, the highest accuracy achieved was 55.1% as computed by the MOTA metric[2]. While higher accuracy systems may exist and clearly progress will con- tinue to be made, the performance gap between human and machine visual tracking for many classes of video is likely to persist into the foreseeable future. While automated processing may be sufficient for some applications, the focus of this paper is to investigate the use of tracking algorithms for applications that demand a level of accuracy beyond the capability of fully automatic anal- ysis. In practice today, when video data is available but automatic analysis is not an option, manual analysis is the only recourse. Typically, the tools for manual annotation of video are extremely labor intensive barring use in all but the most resource-rich situations. Our aim is to design human- machine collaborative systems that capitalize on the com- plementary strengths of video analysis algorithms and deep visual capabilities of human oversight to yield video tracking capabilities at a accuracy-cost tradeoff that is not achievable by full automation or purely manual methods alone. The human-machine collaborative approach to informa- tion technology was first clearly articulated by J. C. R. Lick- lider[6] as a close and fluid interaction between human and computer that leverages the respective strengths of each. He envisioned a symbiotic relationship in which computers could perform the “formulative tasks” of finding, organiz- ing, and revealing patterns within large volumes of data and paving the way for human interpretation and judgment. As computers have proliferated over the past fifty years, in- stances of what Licklider might consider collaborative sys- tems have become pervasive, from the computerized braking and suspension in modern cars that help the driver stay on the road to the spell checkers that review each word as the user types it. While human-machine collaboration is now implicit in many fields, the conceptual framework still pro- vides insight when addressing the problem of video annota-

Upload: hoangliem

Post on 13-Apr-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

A Human-Machine Collaborative Approach to TrackingHuman Movement in Multi-Camera Video

Philip DeCampMIT Media Lab

20 Ames Street, E15-441Cambridge, Massachusetts 02139

Deb RoyMIT Media Lab

20 Ames Street, E15-488Cambridge, Massachusetts 02139

ABSTRACTAlthough the availability of large video corpora are on therise, the value of these datasets remain largely untapped dueto the difficulty of analyzing their contents. Automatic videoanalyses produce low to medium accuracy for all but thesimplest analysis tasks, while manual approaches are pro-hibitively expensive. In the tradeoff between accuracy andcost, human-machine collaborative systems that synergisti-cally combine approaches may achieve far greater accuracythan automatic approaches at far less cost than manual.This paper presents TrackMarks, a system for annotatingthe location and identity of people and objects in large cor-pora of multi-camera video. TrackMarks incorporates a userinterface that enables a human annotator to create, review,and edit video annotations, but also incorporates trackingagents that respond fluidly to the users actions, processingvideo automatically where possible, and making efficient useof available computing resources. In evaluation, TrackMarksis shown to improve the speed of a multi-object tracking taskby an order of magnitude over manual annotation while re-taining similarly high accuracy.

Categories and Subject DescriptorsH.3.4 [Information Storage and Retrieval]: Systemsand Software; H.5.2 [Information Interfaces and Pre-sentations]: User Interfaces

General TermsPerformance, Human Factors, Design

Keywordsvideo annotation, multiple camera, object tracking, human-machine collaboration

1. INTRODUCTIONThe ubiquity of digital video cameras coupled with the

plummeting cost of storage and computer processing enables

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CIVR ’09, July 8-10, 2009 Santorini, GRCopyright 2009 ACM 978-1-60558-480-5/09/07 ...$5.00.

new forms of human behavioral analysis that promise totransform forensics, behavioral and social psychology, ethnog-raphy, and beyond. Unfortunately, the ability for state-of-the-art automatic algorithms to reliably analyze fine-grainedhuman activity in video is severely limited in all but the mostcontrolled contexts. Object tracking represents one of themost active and well developed areas in computer vision, yetexisting systems have significant difficulty processing videothat contains adverse lighting conditions, occlusions, or mul-tiple targets in close proximity. During the 2007 CLEAREvaluation on the Classification of Events, Activities, andRelationships[10], out of six systems applied to tracking per-sons in surveillance video, the highest accuracy achieved was55.1% as computed by the MOTA metric[2]. While higheraccuracy systems may exist and clearly progress will con-tinue to be made, the performance gap between human andmachine visual tracking for many classes of video is likely topersist into the foreseeable future.

While automated processing may be sufficient for someapplications, the focus of this paper is to investigate the useof tracking algorithms for applications that demand a levelof accuracy beyond the capability of fully automatic anal-ysis. In practice today, when video data is available butautomatic analysis is not an option, manual analysis is theonly recourse. Typically, the tools for manual annotation ofvideo are extremely labor intensive barring use in all but themost resource-rich situations. Our aim is to design human-machine collaborative systems that capitalize on the com-plementary strengths of video analysis algorithms and deepvisual capabilities of human oversight to yield video trackingcapabilities at a accuracy-cost tradeoff that is not achievableby full automation or purely manual methods alone.

The human-machine collaborative approach to informa-tion technology was first clearly articulated by J. C. R. Lick-lider[6] as a close and fluid interaction between human andcomputer that leverages the respective strengths of each.He envisioned a symbiotic relationship in which computerscould perform the “formulative tasks” of finding, organiz-ing, and revealing patterns within large volumes of data andpaving the way for human interpretation and judgment. Ascomputers have proliferated over the past fifty years, in-stances of what Licklider might consider collaborative sys-tems have become pervasive, from the computerized brakingand suspension in modern cars that help the driver stay onthe road to the spell checkers that review each word as theuser types it. While human-machine collaboration is nowimplicit in many fields, the conceptual framework still pro-vides insight when addressing the problem of video annota-

tion. Given that humans can perform visual tasks with greataccuracy, and that computers can process video with greatefficiency, finding an effective bridge between the two mayyield an annotation system that process performs with muchgreater efficiency than manual annotation while making fewconcessions to accuracy.

In this paper, we present TrackMarks, a human-computercollaborative system designed to annotate large collectionsof multi-camera video recordings. Specifically, this paperwill describe TrackMarks as applied to annotating personidentity and location, although the system may be adaptedto other video analysis tasks. When using TrackMarks, thehuman annotator begins the process by providing one ormore manual annotations on single frames of video. Thesystem then attempts to extend the user annotations intotracklets (partial track segments), filling in any sections ofthe data that are not completely annotated. To organizethis process, the system maintains a prioritized list of anno-tation jobs and dynamically assigns these jobs to trackingagents (computer processes that perform tracking). A sin-gle user can trigger a larger number of parallel tracking pro-cesses. As the system generates tracklets, the user may shiftto a verification and correction role. To support this role,the system supports interactive visualization of tracklets asthey are being generated tightly integrated with the abilityto issue corrections or additional annotations. With Track-Marks, users can fluidly interleave annotation, verification,and correction.

The development of TrackMarks was motivated by thevideo analysis challenges posed by the Human SpeechomeProject (HSP)[8]. The project is an attempt to study childlanguage development through the use of very large collec-tions of longitudinal, densely-sampled, audio-video record-ings. In a pilot data collection, 11 ceiling-mounted fish-eyelens mega-pixel cameras and 14 boundary-layer microphoneswere installed in the home of a child. From the child’sbirth to three years of age, approximately 90,000 hours of14 frames-per-second video recordings were collected, cap-turing roughly 70% of the child’s waking experience at homeduring this period. In theory, given such a corpus, a scien-tific researcher should be able find and analyze events ofinterest to identify patterns of physical and social activity,compute aggregate statistics on aspects of behavior, or val-idate computational models of behavior and development.All of these tasks pose significant problems involving audio-video indexing, retrieval, and analysis. TrackMarks is anattempt to dramatically reduce the cost of preprocessingsuch video collections so that they may be used for scientificinvestigation.

There are existing systems that perform video annotationin a collaborative manner. Perhaps the one most similar toour own is described by Agarwala et al. in [1]. They de-scribe a system for rotoscoping in which the user annotatesobject contours for two keyframes, the system interpolatesthe contours for the frames in between, after which the usercan review and correct the generated annotations. This sys-tem provides a more simple interaction in which the systemperforms one tracking job at a time and the user reviews theresults when the job is completed. In contrast, TrackMarksfocuses on running multiple trackers at once and enablingthe user to interact with the tracking processes while theyare running.

Other approaches to tracking have included “two-stage

tracking” systems, where the first stage involves the genera-tion of high confidence tracklets, and the second stage deter-mines how to aggregate the tracklets into longer tracks. In[9], the tracklets are combined automatically with a globaloptimization algorithm. In [5], Ivanov et al. describe a sys-tem that performs tracklet aggregation interactively witha human operator, which they refer to as “human guidedtracking.” In this system, tracklets are first generated frommotion sensor data and the system aggregates the trackletsfrom multiple sensors as fully as it can. When the systemcannot identify the correct tracklet sequence with sufficientconfidence, such as when two objects come too close togetherto be disambiguated from the motion data, it presents theuser with relevant video recordings and the user indicateswhich tracklets belong to a given target. While these sys-tems may require far less manual labor than TrackMarks,they rely more on automatic tracking and require high con-fidence tracklets. TrackMarks has been developed to ad-dress worst case scenarios, in which sections of data mayneed to be annotated precisely, frame-by-frame. The moreautomatic systems still inform the future directions of Track-Marks, and possibilities for combining more automatic ap-proaches will be discussed in Section 4.1.

2. THE TRACKMARKS SYSTEM

2.1 Design GoalsThe purpose of TrackMarks is to identify and track mul-

tiple people from recordings taken from multiple camerasplaced in connected visual regions, but was also designedaround several additional objectives:

• Make annotation as efficient as possible while allowingthe operator to achieve an arbitrary level of accuracy.

• Annotate camera handovers (when a person moves outof view of one camera and into view of another), oc-clusions, and absences. (Absent events are defined ascases in which a known target is not in view of anycamera.)

• Provide an interface that is responsive enough to sup-port fluid interaction between the operator and sys-tem. It is important that the automatic processes notimpede the operator’s ability to navigate and anno-tate the video. This requires that the response timefor frequent operations remain under a few hundredmilliseconds.

• Management for video collections larger than 100 TB.This affects database management, but also, for largecorpora, it is unlikely that a present day computer willbe able to access all of the data locally. When videodata must be accessed over a network, it poses addi-tional problems in maintaining responsiveness.

2.2 User Interface OverviewFigure 1 shows a screenshot of the TrackMarks interface.

The top, middle panel shows a typical frame of video in fullmega-pixel resolution. In this image, two bounding boxeshave been superimposed on the video that indicate the po-sitions of a child and adult. The colors of the boxes indicatethe identity of each person. Video navigation is performedprimarily with a jog-shuttle controller.

Figure 1: TrackMarks Interface

The top, left panel displays video thumbnails from theother cameras in the house. The user selects the videostream to view by clicking one of these thumbnails.

The bottom panel shows a timeline visualization of theannotations that resembles a subway map. This componentsummarizes the annotations that have been made, providingthe user with a method of identifying and accessing portionsof the data that require annotation. The horizontal axis rep-resents time, which consists of approximately 30 minutes ofdata in this example. The blue, vertical bar indicates theuser’s position in the video stream. The timeline is dividedinto horizontal“channels,” each representing one camera, de-marcated by the thin black lines. The top channel, coloredgray, represents the “absent” channel that is used to indicatethat a target is not present in any of recordings. Finally,the thick, colored lines represent the tracklets. As with thebounding boxes superimposed on the video, the tracklets arecolored to indicate person identity. The vertical placementof the tracklet indicates the channel, so when a tracklet linemakes a vertical jump to another channel, it indicates thatthe target moved to a different camera at that place in time.As the system generates annotations, the timeline map addsor extends these tracklet lines and the user can monitor theprogress of the system. Note that each bounding box shownon the video frame corresponds to a thin slice from one ofthe tracklet lines on the timeline view.

2.3 Track RepresentationTrack data is represented in a hierarchical structure with

three levels: track points, track segments, and tracklets. Atthe lowest level, track points correspond to an annotationassociated with a single frame of video. For the instanceof the system described in this paper, all track points con-sist of bounding boxes. Track points are grouped into tracksegments, which represent a set of track points for a contigu-ous sequence of video frames from a single camera. Tracksegments may also indicate that a target is occluded or ab-sent for an interval of time, and that no track points areavailable. Adjoining track segments are grouped into track-lets. Tracklets specify the identity of the target and combinethe track data for that target across multiple cameras for acontinuous time interval.

Several constraints are placed on the track data to simplify

the system. First, only one annotation may be created fora given target in a given time frame. It is not possible toindicate that a target simultaneously occupies more thanone location or camera, and multiple tracklets for a giventarget may not overlap. This precludes the use of multiplehypothesis tracking algorithms or annotating multiple viewsof the same object captured by different cameras, but greatlysimplifies interaction with the system because the user doesnot need to review multiple video streams when annotatinga given target.

Second, each tracklet has a single key point that is usuallya track point created by the user. The tracklet originatesfrom the key point, and extends forward and backward intime from that point. The purpose of the key point is tosimplify tracklet editing. When deleting a track point froma tracklet, if the track point is defined after the key point, itis assumed that all of the tracklet defined after the deletedpoint is no longer valid and the right side of the tracklet istrimmed.

2.4 Annotation ProcessThis section outlines the annotation process from the view

of the human annotator. To begin the process, the userselects an assignment to work on, where the assignmentdefines an objective for the user and a portion of data toprocess. The user browses the video and locates a target.Target identification is performed manually. The jog-shuttlecontroller used to navigate the video has nine buttons thatare mapped to the most frequently occurring targets. Theuser may quickly select the identity by pressing the corre-sponding target button, or, more slowly, may evoke a popupmenu that contains a complete list of targets as well as anoption to define new targets. Given the overhead positionof the cameras, it is sometimes necessary to browse througha portion of the data before an identification may be made.After identification, the user uses a mouse to draw a bound-ing box that roughly encompasses the object as it appearsin the video. If the annotation appears correct, the usercommits the annotation.

When the user commits an annotation, it creates a newtrack point as well and a new tracklet that consists of onlythat point. By default, TrackMarks automatically attemptsto extend the tracklet bidirectionally. This process is de-scribed in Section 2.5. Camera handover is performed man-ually, and the user must create annotations at time frames inwhich a target enters or leaves a room. Usually, when anno-tating a target entering a room, the user defines a trackletthat should be extended forward in time, but should notbe extended backward because the target will no longer bethere. For this reason, it is also possible for the user to spec-ify that a tracklet be extended only forward, only backward,or not at all.

While tracking is being performed, the operating mayskim the video stream and verify the generated annotations.If a mistake is found, the user may click on the incorrecttrack point and delete it, causing the tracklet to be trimmedback to that point, or may choose to delete the tracklet alto-gether. More commonly, the user may draw a new boundingbox around the target and commit the correction, causingthe incorrect annotation to be deleted and restarting thetracking process for the target at that frame.

In addition to making position annotations, the user mayindicate that a target is occluded or absent. Occlusions and

absences are indicated in an identical process, and both arereferred to as occlusion annotations. Unlike the position an-notations that are associated with a single frame, occlusionannotations may cover an arbitrary period of video. Ratherthan require the user find both the beginning and end of theannotation, which might require searching through a greatdeal of video and disrupt the user’s workflow, the user markseach end of the annotation separately. Assuming that theuser is navigating forward through a video stream and iden-tifies the first time frame where a target becomes occluded,the annotator selects the target identity and presses a buttonon the jog-shuttle controller that indicates the selected tar-get is occluded from that frame onward. This creates a newtracklet that starts from the user’s time frame and extendsas far as possible without overlapping existing annotationsfor the same target. When the user reaches the later timeframe where the target becomes unoccluded, he can createa new annotation at that frame, effectively trimming theoriginal occlusion annotation. In the timeline view shown inFigure 1, occlusion tracklets are distinguished from normaltracklets by using hollow lines.

2.5 Tracking AgentsWhile there are many possibilities for improving annota-

tion efficiency through faster hardware or improved track-ing algorithms, our work focuses on improving performancethrough structured collaboration. Having an efficient trackerimplementation is still important, but large gains can be hadby scheduling the tracking processes intelligently to mini-mize redundant operations, to provide rapid feedback to theuser so that errors are corrected quickly, and to prevent thesizable computational resources required for object trackingfrom interfering with the interface. The key optimizationmade by TrackMarks, then, is the tracking agent subsystemthat manages and executes tracking tasks.

The tracking process may be broken into four steps, inwhich tracking jobs are defined, prioritized, assigned, andexecuted. This is not a linear process; jobs may be revisedor cancelled at any time. The tracking agent subsystem thathandles this process consists of four levels: a job creator, ajob delegator, job executors, and trackers.

The job creator receives all requests to expand trackletsand defines a job for each request. For convenience, thebackward time direction is referred to as left, and the forwardas right. Each job is associated with a tracklet ri, wherethe purpose of the job is to add annotations to expand thetracklet in either the left or right direction, di ∈ {0, 1}. Theleft and right edges of the tracklet occur at times ti,0 andti,1, respectively, and the job is complete when the edge ti,di

is extended to a goal time, ti,di = t̃i.

The main parameter to compute is the goal time. t̃i isselected such that the tracklet is extended as far as possiblewithout overlapping existing tracklets with the same target.Assuming tracklet ri is being extended to the right di = 1,the job creator finds the first tracklet rj that exists to the

right of ri and shares the same target. If there is no such rj ,then t̃i may be set to the end of the available video stream.If rj does exist, then t̃i is set to the left edge of that tracklett̃i ← tj,0.

In the case where rj is being extended backwards by an-other job, then the goal time of both jobs is set to the timehalf way between ri and rj , splitting the remaining loadbetween the two jobs, t̃i, t̃j ← 0.5(ti,1 + tj,0).

Tracklets may be created and altered while tracking oc-curs. The job creator is notified of all such events and re-computes the goal times t̃ for all jobs that might be affected.If the operator deletes an annotation that was created by ajob that is still being processed, that job has failed and isimmediately terminated.

All existing jobs are given to the job delegator, whichschedules jobs for execution. In the case of TrackMarks, thetime required to perform person tracking is a small frac-tion of the time required for retrieving and decoding thevideo. Subsequently, the greatest gains in efficiency resultfrom tracking as many targets as possible for each frame ofvideo retrieved. The job delegator exploits this by mappingall jobs into job sets that may be executed simultaneouslyby processing a single contiguous segment of video, priori-tizing each job set, and assigning the highest priority sets tothe available executors.

The process of grouping jobs into sets can be formulated asa constraint satisfaction problem. Each job bi is associatedwith a segment of video to process, defined by a cameraci and time interval ri. For forward jobs with di = 1, theinterval is si = [ti,r, t̃i]. For jobs with di = −1, si = [t̃i, ti,0].Each job set, B, is must meet three constraints:

• Forward and backward tracking jobs cannot be com-bined, nor can jobs that access video streams generatedfrom different cameras. ∀bi∈B∀bj∈Bdi = dj ∧ ci = cj .

• Jobs with non overlapping intervals cannot be pro-cessed together. The union of time intervals for alljobs in a set ∪i,bi∈Bsi must be interval.

• A job cannot belong to a set if the tracklet edge beingextended by that job is too distant from the other jobsin the set. ∀bi∈B∃bj∈B |tri −trj | < τ , where τ is a chosenparameter.

When a job bi is first created, the delegator attempts tolocate an existing job set, B, to which the job may be addedwithout violating any constraints. If no such set can befound, a new set is created containing only one job. Eachtime a job as added or otherwise altered, the constraints arereapplied, causing job sets to be split and joined as needed.The algorithm used to solve this constraint problem involvesimplementation details and is of less interest than the for-mulation of the problem itself.

Job Creator

Job Delegator

Job Executors

Trackers

Figure 2: Tracking Agent Subsystem. The job cre-ator manages individual tracking jobs. The dele-gator organizes the jobs into sets. Each executorprocesses one set of jobs, feeding video frames toone tracker per job.

Each time a job set is altered, the job delegator recom-putes a priority for that set. The priority is based first onprocessing jobs sets such that they are likely to reach an-other job set, allowing them to merge and thus reduce com-putational load. Second, higher priority is given to job setsthat are processing data at earlier time frames, encouragingtracking to be performed from left to right. This criteriahelps to reduce the fragmentation of tracklets, and reducesthe amount of time the user expends reviewing video. Othercriteria might also be incorporated at this stage. Givinggreater priority to the recency of the jobs might help toplace trackers near the locations that the user is annotating,improving the perceived responsiveness of the system.

After the job sets are prioritized, the top n sets are as-signed to job executors. The parameter n thus determinesthe number of tracking threads running at any point in time,and the portion of resources allocated to tracking. Each ex-ecutor runs in an isolated thread, executing all jobs in theset bi ∈ B simultaneously. The executor initializes a trackerfor each job. Job sets may be altered while the executor isrunning and not all jobs in the set are necessarily aligned.For forward tracking job sets, ∀bi∈Bdi = 1, the executoralways retrieves the earliest frame of video to be processedby any job, tf = minbi∈Bti,1, passing that frame to eachtracker that has not already processed it. The resulting be-havior is that the executor may sometimes “jump backward”in the video stream and reprocess a segment of video untilthe earliest job catches up to the others.

The trackers are then conceived as passive componentsthat produce one track point that indicates the locationof one target for each video frame provided. The position

tracker used is built on the mean-shift tracking algorithm[4]. Mean-shift tracking is performed by adjusting the sizeand location of a bounding box to maintain a consistent dis-tribution of color features within the box. This algorithmwas chosen partially for its relatively fast speed. However, itwas also chosen over motion-based algorithms because it waspredicted that people in a home environment would spendmuch more time sitting in place, as compared to people ina public space that might spend more time traveling. Addi-tionally, when tracking a child being held by a caregiver, thetwo targets may be separated more reliably by color than bymotion. While the implementation used relies more stronglyon color features, motion features were not discounted en-tirely, and the tracking algorithm does incorporate a fastforeground-background segmentation step to improve per-formance for targets in motion.

The end result of the tracking agent system is a structuredbut fluid interaction between the user and system. As theuser creates annotations, TrackMarks propagates them tocreate a set of tracklets that cover the full range of data forall targets. As it performs tracking, TrackMarks attemptsto reduce the fragmentation of the data, combines trackingjobs when possible to greatly reduce the computation timerequired, and continuously indicates tracking progress in theuser interface. When the user makes a correction, the track-ing agents remove any jobs that are no longer useful andcreates new jobs to propagate the correction. The trackeragents afford the user flexibility in annotating the data. Indifficult cases where the tracking fails often, the user primar-ily focus on reviewing and correcting annotations. In otherinstances, the user may skim through the video and providea few annotations, and return later to review the results. Inthe worst cases in which tracking fails completely, the userstill has the ability to annotate manually.

3. EVALUATIONTo test the efficiency of TrackMarks, three hours of multi

channel video recordings were selected from the speechomecorpus for transcription. The selected data was broken intosix, 30 minute assignments. To select assignments of interestthat contained high levels of activity, the data was selectedfrom the 681 hours of data for which speech transcripts wereavailable, allowing the activity levels of the assignments tobe estimated by examining the amount of speech activity.The data from this subset was collected over a 16 monthperiod, during which a child and family was recorded whilethe child was between the age of 9–24 months. The 1362 as-signments were ordered in a list according to a combinationof activity level and distribution of collection dates, and asequence of six assignments was selected from near the topof that list (positions 18–24). Figure 3 shows several videoframes taken from the transcribed data.

Annotators were given the same objective for all assign-ments: to fully annotate the position and identity of thechild for the entire assignment, and to annotate all otherpersons within sight of the child’s position, including personsvisible to the child through open doorways. The annotatorswere instructed to fix all incorrect annotations. Annotations(drawn as bounding boxes) were deemed correct if the boxcontained at least half of the target object, and that it wouldnot be possible to reduce the area of the bounding box byhalf without also removing part of the object. Targets weremarked occluded if less than half of the target object was vis-

Figure 3: Example Annotations

mean min max stdA1 time (minutes) 64.1 31.0 106 27.1A2 time (minutes) 44.2 22.5 62.7 14.8

Strict IAA MOTA 88.1% 77.4% 98.7% 7.11%IAA MOTA 97.3% 86.3% 99.9% 5.40%IAA MOTP 12.3px 8.33px 17.2px 3.52px

Object Count 47,482 15,381 71,320 20,277Objects Per Frame 1.88 0.610 2.83 0.805

Manual Annotations 316 181 466 117Forced Errors 65.7 38.0 90.0 18.9

Unforced Errors 250 143 376 98.4

Table 1: Annotation Trial Results

ible, and absent if the target was not present in any roomsbeing recorded.

The evaluation was performed by two annotators. Anno-tator A1 (first author) had previously annotated approxi-mately five hours of video with TrackMarks. Annotator A2was given two warmup assignments that are omitted fromthe evaluation. Results are shown in Table 1.

The first two rows present the annotation time requiredby both annotators, A1 and A2. The speed of the annota-tors fell between 28.3% and 133% of the speed required toannotate the data in realtime, with a overall mean of 55.4%.That is, 108 minutes of labor was required on average toannotate 60 minutes of recordings.

Rows three through five present the inter-annotator agree-ment computed with the CLEAR MOT metrics[10]. TheMOT metrics provide a standardized method for computingthe accuracy (MOTA) and precision (MOTP) of multi objecttracking systems. The MOTA score indicates the percent ofcorrect matches, taking into account the number of misses,false positives, and target mismatches. The MOTP scoreindicates the average error for all correct matches. Thesemetrics are designed to be generalized to a wide range oftracking tasks, and require an application specific error func-tion to compute the error between two data points in a giventime frame. The error function used for this evaluation con-siders two data points, bounding boxes p1 and p2, to be amatch if p1 intersects p2. If p1 and p2 match, then the errorbetween them is defined as the Euclidean distance betweentheir respective centroids. In standard applications of theMOT metrics, one set of tracks must be provided as groundtruth and another set for evaluation. To adapt the metricsfor inter-annotator agreement, we compute the scores twice,each time using the data provided from a different annotatoras the ground truth, and average the results. Additionally,because only the child target was required to be annotatedcompletely and there was greater flexibility for annotatingthe other targets, we compute inter-annotator agreementonly on the tracks associated with the child target.

The Strict IAA MOTA figures, computed as defined above,

show the accuracy of one annotator compare to the other,where the average was 88.1%. The primary source of dis-agreement was in the determination of whether a target wasvisible, occluded, or absent. While the evaluation task de-fined that an object could be marked occluded if less thanhalf of the object was visible, this rule proved difficult to ap-ply in practice. For example, for approximately 15 minutesof the evaluation data, the child target was in a crib andwas only visible through the slats of the crib. Other por-tions of data contained the child repeatedly covering anduncovering his head with a blanket in both walking and ly-ing positions, making it unclear whether the blanket shouldbe considered an occlusion or clothing. Most of all, if theautomatic tracking was performing well despite significantocclusions, annotators did not consistently mark the occlu-sion since that would mean deleting otherwise valid positionannotations. Another source of disagreement was in deter-mining the exact point of camera handoff, since the systemforces the annotator to associate a given target with at mostone camera at one time.

To get another sense of accuracy, the second IAA MOTAfigures were computed on only the time frames where bothannotators agreed that the child target was visible. Further-more, the accuracy was not computed for time frames thatoccurred within five seconds of a camera handover. The re-sulting accuracy of 97.3% shows that the majority of thedisagreement was caused by the added ambiguity of explic-itly marking occlusions and camera handovers. On average,this processed culled 39% of the annotations, which mightbe expected for a child target that is often being held oroccluded by the environment, but indicates that more strin-gent guidelines are needed to annotate multi-camera videoconsistently.

The IAA MOTP shows that the average distance betweenannotation centroids was 12.3 pixels; approximately 1.3% ofthe 960 by 960 pixel video frames. Because the video wasrecorded with a fish-eye lens, the distance represented by12.3 pixels varies considerably depending on the pixel posi-tion, room geometry, and pose of the target. Because pre-cision is computed only between correct matches, the preci-sion scores computed for the strict agreement had very smalldeviation from those reported.

Referring back to Table 1, the bottom five rows containstatistics for the assignment tasks, all of which are aver-aged across the results from both annotators. Object Countrefers to the total number of objects annotated for the entireassignment, not counting occluded and absent annotations.Objects Per Frame indicates the average number of objectsoccurring for each time frame. Because the cameras usedwere not synchronized at a per frame level, the number oftime frames in one half hour was approximated as 25,000.

Manual Annotations indicates the total number of man-ual annotations created by the annotator for an assignment.Each manual annotation may be thought of as system er-rors, in which the system was unable to produce a correctannotation and required human intervention. We define twokinds of errors: forced and unforced. Forced errors occuron camera handovers and object transitions between visible,occluded, and absent. These errors are considered forced be-cause the system is not designed to identify these events, andeven with perfect tracking, they would still require manualintervention. The number of forced errors, then, is the the-oretical minimum number of manual annotations required

if perfect, single-camera tracking were available. Unforcederrors occur when the automatic tracking fails on a framecontaining a visible target and requires manual correction.These are errors that might be reduced with a better tracker.In this evaluation, annotators were instructed to achieve ac-curacy and speed, not to minimize the number of annota-tions made. Subsequently, the number of unforced errorscan only be roughly estimated by subtracting the numberof forced errors from the total number of annotations. Al-though the unforced error figures are less certain than theothers, they suggest that the amount of human interventioncould be greatly reduced by more accurate tracking.

In the second evaluation, we compare the efficiency ofTrackMarks to a fully manual approach. A simple systemwas constructed that provided an extremely simplified pro-cedure for manual annotation. The system presents oneframe of video to the annotator, the annotator locates anddraws a bounding box around the child target, and the sys-tem presents the next frame of video. The video frames arepresented in sequence, so the primary limit was the speedof physically drawing the bounding box. In one ten minutetrial, an annotator (first author), working at a rapid butcomfortable pace, produced annotations for 846 objects. Inthe evaluation of TrackMarks, each 30 minute assignmentcontained an average of 47,482 objects, suggesting that theminimum time needed to annotate such an assignment man-ually would be 561 minutes; about 10.4 times as much hu-man labor as required by TrackMarks. While there aremany possibilities for expediting manual annotation, it isan inherently tedious task, both physically and mentally. Amore comprehensive analysis would likely need to includeestimates for the amount of rest annotators require betweensessions, and adjustments to the pay needed to retain an-notators. Even in the absence of such an evaluation, westrongly believe that manual approaches will be limited toa fraction of the efficiency of human-machine collaboration,and that the difference will grow rapidly as technologies im-prove.

Last, we compare the accuracy of TrackMarks to that ofa fully automated tracker. The automatic tracker evalu-ated was implemented with SwisTrack[7], an open sourcesoftware package containing many commonly used trackercomponents. Out of several configurations, we achieved thebest results by detecting objects using motion templates[3]and combining the objects into tracks with a nearest neigh-bor tracker. 140 hours of single channel video taken from thespeechome corpus was processed with the automatic tracker.The half-hour block that contained the greatest number ofidentified objects was selected and annotated using Track-Marks. The MOT metrics were again used to determine ac-curacy and precision. In this evaluation, the TrackMarks an-notations were used as ground truth. The annotations pro-duced by TrackMarks consist of bounding boxes mi aroundan object, while the tracks from the automatic tracker con-sist of points ni that represent object centroids. Becauseof this discrepancy, an error function was used that differsfrom that used in the first evaluation. If mi is a truth an-notation, and ni is a test annotation occurring in the sametime frame, then ni is considered a match to mi if ni is con-tained by mi scaled by 1.5 about its center, and the errorbetween mi and ni is the Euclidean distance between ni andthe center of mi. The performance of the automatic trackeris shown in Table 2.

MOTA 48.0%MOTP 32.8 px

Objects 67,231Matches 37,323

Misses 29,908False Positives 4,634

Mismatches 470

Table 2: Automatic Tracker Performance

The performance of the automatic tracker was compara-ble to other multi-object tracking systems applied to similartasks[10]. The overwhelming cause of errors was misses,at 29,908, followed by false positives at 4,634, and thenmismatches at 470, where mismatches were defined as in-stances where the generated tracklets switched from follow-ing one target to another. The MOTA achieved by theautomatic tracker, 48.0%, is 49.3% lower than the inter-annotator MOTA achieved by TrackMarks, and 40.1% lowerthan the strict inter-annotator MOTA. The average error be-tween matches was 32.8 pixels, compared to 12.3 pixels forthe inter-annotator precision of TrackMarks.

4. DISCUSSIONFor tracking tasks that require low to moderate accuracy,

automatic tracking is still the most cost effective solution.However, high error rates severely limit the usefulness andpossible applications for video analysis. On the other ex-treme, manual annotation requires a great deal of labor,limiting this approach to projects that can absorb the highcost of human annotators.

We have established TrackMarks as a viable approach toannotation that achieves a balance between cost and accu-racy that creates new possibilities for video analysis. Track-Marks provides a level of accuracy far closer to manual an-notation than automatic, while providing efficiency that isan order of magnitude greater than fully manual process-ing. Furthermore, TrackMarks makes it efficient to explic-itly mark object occlusions and absences, providing an ad-ditional layer of information that can be used for analysiswhen more descriptive annotations cannot be created. In-deed, this feature was an important objective for the behav-ioral analysis research for which TrackMarks was developed.

While we have focused on person tracking as an exampleof video processing that can benefit from human-machinecollaboration, we believe this approach will become increas-ingly more valuable as video analysis becomes feasible fora growing number of applications. Even as object track-ing algorithms improve, new applications will arise that willmake use of increasingly detailed information, such as heador hand pose. Researchers and analysts will require humanvision faculties to process video for the foreseeable future,which may establish an important niche for human-machinecollaborative systems like TrackMarks.

4.1 Future DirectionsThis research has focused more on the interaction between

the computer and human, and has tried to maximize theusefulness of a relatively modest object tracker. The mostobvious improvements involve increasing the accuracy andspeed of the tracker processes to reduce the time and ef-

fort required of the human operator. Trackers that pro-duce reliable confidence scores introduce a whole range ofpossibilities for more intelligent tracklet editing. Identify-ing segments of data that have a high probability of errorsprovides a way to bring the human’s attention where it isneeded most, reducing the time spent on reviewing data.Confidence also provides methods for automatically joiningor splitting tracklets, In the current implementation, remov-ing one incorrect annotation causes the entire tracklet to betrimmed to that point, the assumption being that it is moreefficient to reprocess the segment of video than to require theuser to find all subsequent errors in the tracklet. In manycases, cutting the tracklet based on confidence may preventredundant processing.

However, there may be much larger gains to be found byfocusing on the higher level tracking logic. This includesintroducing automatic person detection, enabling the sys-tem to perform more tracking operations without manualinitialization. In the case where the camera positions arewell known, this leads naturally to automatic camera han-dover. As tracking performance improves, much of the pro-cessing could be performed offline, before the human an-notator looks at the data, similar to the two-level trackingsystems referred to in Section 1. In the case where the videocontains longitudinal recordings of a limited number of tar-gets, more detailed target models may be learned that helpclassify identity based on both appearance and behavioralroutines.

5. ACKNOWLEDGMENTSWe thank Yuri Ivanov for his ideas and his help in fram-

ing this paper; Kleovoulos Tsourides for lending his time toannotate data; and Stefanie Tellex for providing an objecttracker and data. This research was funded in part by agrant from the U.S. Office of Naval Research.

6. REFERENCES[1] A. Agarwala, A. Hertzmann, D. H. Salesin, and S. M.

Seitz. Keyframe-based tracking for rotoscoping andanimation. ACM Trans. Graph., 23(3):584–591, 2004.

[2] K. Bernardin and R. Stiefelhagen. Evaluating multipleobject tracking performance: the clear mot metrics. J.Image Video Process., 2008(3):1–10, 2008.

[3] G. Bradski and A. Kaehler. Learning OpenCV:Computer Vision with the OpenCV Library. O’Reilly,Cambridge, MA, 2008.

[4] V. Comaniciu and P. Meer. Kernel-based objecttracking, 2003.

[5] Y. Ivanov, A. Sorokin, C. Wren, and I. Kaur. Trackingpeople in mixed modality systems. In SPIEConference on Visual Communications and ImageProcessing (VCIP), volume 6508 of Proc. of IEEE.SPIE, January 2007.

[6] J. C. R. Licklider. Man-computer symbiosis. IRETransactions on Human Factors in Electronics,HFE-1:4–11, March 1960.

[7] T. Lochmatter, P. Roduit, C. Cianci, N. Correll,J. Jacot, and A. Martinoli. SwisTrack - A FlexibleOpen Source Tracking Software for Multi-AgentSystems. In IEEE/RSJ 2008 International Conferenceon Intelligent Robots and Systems (IROS 2008), pages4004–4010. IEEE, 2008.

[8] D. Roy, R. Patel, P. DeCamp, R. Kubat,M. Fleischman, B. Roy, N. Mavridis, S. Tellex,A. Salata, J. Guinness, M. Levit, and P. Gorniak. Thehuman speechome project. In P. V. et al., editor,Symbol Grounding and Beyond: Proceedings of theThird International Workshop on the Emergence andEvolution of Linguistic Communication, pages192–196. Springer, 2006.

[9] V. Singh, B. Wu, and R. Nevatia. Pedestrian trackingby associating tracklets using detection residuals. InMotion and Video Computing, pages 1–8, January2008.

[10] R. Stiefelhagen, K. Bernardin, R. Bowers, R. T. Rose,M. Michel, and J. Garofolo. The CLEAR 2007Evaluation. Springer-Verlag, Berlin, Heidelberg, 2008.