multiple human tracking in high-density crowds

12
Multiple human tracking in high-density crowds Irshad Ali , Matthew N. Dailey Computer Science and Information Management Program, Asian Institute of Technology (AIT), Pathumthani, Thailand abstract article info Article history: Received 19 January 2012 Received in revised form 26 July 2012 Accepted 22 August 2012 Keywords: Head detection Pedestrian tracking Crowd tracking Particle lters 3D object tracking 3D head plane estimation Human detection Least-squares plane estimation AdaBoost detection cascade In this paper, we introduce a fully automatic algorithm to detect and track multiple humans in high-density crowds in the presence of extreme occlusion. Typical approaches such as background modeling and body part-based pedestrian detection fail when most of the scene is in motion and most body parts of most of the pedestrians are occluded. To overcome this problem, we integrate human detection and tracking into a single framework and introduce a conrmation-by-classication method for tracking that associates detec- tions with tracks, tracks humans through occlusions, and eliminates false positive tracks. We use a Viola and Jones AdaBoost detection cascade, a particle lter for tracking, and color histograms for appearance modeling. To further reduce false detections due to dense features and shadows, we introduce a method for estimation and utilization of a 3D head plane that reduces false positives while preserving high detection rates. The algorithm learns the head plane from observations of human heads incrementally, without any a priori extrinsic camera calibration information, and only begins to utilize the head plane once condence in the parameter estimates is sufciently high. In an experimental evaluation, we show that conrmation- by-classication and head plane estimation together enable the construction of an excellent pedestrian tracker for dense crowds. © 2012 Elsevier B.V. All rights reserved. 1. Introduction As public concern about crime and terrorist activity increases, the importance of security is growing, and video surveillance systems are increasingly widespread tools for monitoring, management, and law enforcement in public areas. Since it is difcult for human operators to monitor surveillance cameras continuously, there is strong interest in automated analysis of video surveillance data. Some of the impor- tant problems include pedestrian tracking, behavior understanding, anomaly detection, and unattended baggage detection. In this paper, we focus on pedestrian tracking. Automatic pedestrian detection and tracking is a well-studied prob- lem in computer vision research, but the solutions proposed thus far are only able to track a few people. Inter-object occlusion, self-occlusion, re- ections, and shadows are some of the factors making automatic detec- tion and tracking of people in crowds difcult. The pedestrian tracking problem is especially difcult when the task is to monitor and manage a large crowd in gathering areas such as airports and train stations. See the example shown in Fig. 1. There has been a great deal of progress in recent years, but still, most state-of-the-art systems are inapplicable to large crowd management situations because they rely on either background modeling [15], body part detection [3,6], or body shape models [7,8,1]. These techniques are not applicable to heavily crowded scenes in which the majority of the scene is in motion (rendering back- ground modeling useless) and most human bodies are partially or fully occluded. Under these conditions, we believe that the head is the only body part that can be robustly detected and tracked. In this paper we therefore present a method for tracking pedestrians that detects and tracks heads rather than full bodies. The main contributions of our work are as follows: 1. We combine a head detector and particle lter to track multiple people in high-density crowds. 2. We introduce a method for estimation and utilization of a head plane parallel to the ground plane at the expected human height that is extracted automatically from observations from a single, uncalibrated camera. The head plane is estimated incrementally, and when the condence in the estimate is sufciently high, we use it to reject likely false detections produced by the head detector. 3. We introduce a conrmation by classication method for tracking that associates detections with tracks over an image and handles occlusions in a single step. Our system assumes a single static uncalibrated camera placed at a sufcient height so that the heads of people traversing the scene can be observed. For detection we use a standard Viola and Jones Haar-like AdaBoost cascade [9], but the detector could be replaced generically Image and Vision Computing 30 (2012) 966977 This paper has been recommended for acceptance by Massimo Piccardi, Ph.D. Corresponding author at: Computer Science and Information Management Department, Asian Institute of Technology (AIT), P.O. Box 4, Klong Luang, Pathumthani 12120, Thailand. Tel.: +66 875972954. E-mail addresses: [email protected] (I. Ali), [email protected] (M.N. Dailey). 0262-8856/$ see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.imavis.2012.08.013 Contents lists available at SciVerse ScienceDirect Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

Upload: fs-guizi

Post on 19-Nov-2015

12 views

Category:

Documents


1 download

DESCRIPTION

Artigo sobre rastreamento de pessoas

TRANSCRIPT

  • Image and Vision Computing 30 (2012) 966977

    Contents lists available at SciVerse ScienceDirect

    Image and Vision Computing

    j ourna l homepage: www.e lsev ie r .com/ locate / imav isMultiple human tracking in high-density crowds

    Irshad Ali , Matthew N. DaileyComputer Science and Information Management Program, Asian Institute of Technology (AIT), Pathumthani, Thailand This paper has been recommended for acceptance b Corresponding author at: Computer Science and Inform

    Asian Institute of Technology (AIT), P.O. Box 4, Klong LuanTel.: +66 875972954.

    E-mail addresses: [email protected] (I. Ali), mdailey

    0262-8856/$ see front matter 2012 Elsevier B.V. Allhttp://dx.doi.org/10.1016/j.imavis.2012.08.013a b s t r a c ta r t i c l e i n f oArticle history:Received 19 January 2012Received in revised form 26 July 2012Accepted 22 August 2012

    Keywords:Head detectionPedestrian trackingCrowd trackingParticle filters3D object tracking3D head plane estimationHuman detectionLeast-squares plane estimationAdaBoost detection cascadeIn this paper, we introduce a fully automatic algorithm to detect and track multiple humans in high-densitycrowds in the presence of extreme occlusion. Typical approaches such as background modeling and bodypart-based pedestrian detection fail when most of the scene is in motion and most body parts of most ofthe pedestrians are occluded. To overcome this problem, we integrate human detection and tracking into asingle framework and introduce a confirmation-by-classification method for tracking that associates detec-tions with tracks, tracks humans through occlusions, and eliminates false positive tracks. We use a Violaand Jones AdaBoost detection cascade, a particle filter for tracking, and color histograms for appearancemodeling. To further reduce false detections due to dense features and shadows, we introduce a methodfor estimation and utilization of a 3D head plane that reduces false positives while preserving high detectionrates. The algorithm learns the head plane from observations of human heads incrementally, without any apriori extrinsic camera calibration information, and only begins to utilize the head plane once confidencein the parameter estimates is sufficiently high. In an experimental evaluation, we show that confirmation-by-classification and head plane estimation together enable the construction of an excellent pedestriantracker for dense crowds.

    2012 Elsevier B.V. All rights reserved.1. Introduction

    As public concern about crime and terrorist activity increases, theimportance of security is growing, and video surveillance systems areincreasingly widespread tools for monitoring, management, and lawenforcement in public areas. Since it is difficult for human operatorsto monitor surveillance cameras continuously, there is strong interestin automated analysis of video surveillance data. Some of the impor-tant problems include pedestrian tracking, behavior understanding,anomaly detection, and unattended baggage detection. In this paper,we focus on pedestrian tracking.

    Automatic pedestrian detection and tracking is a well-studied prob-lem in computer vision research, but the solutions proposed thus far areonly able to track a fewpeople. Inter-object occlusion, self-occlusion, re-flections, and shadows are some of the factors making automatic detec-tion and tracking of people in crowds difficult. The pedestrian trackingproblem is especially difficult when the task is to monitor and managea large crowd in gathering areas such as airports and train stations.See the example shown in Fig. 1. There has been a great deal of progressin recent years, but still, most state-of-the-art systems are inapplicableto large crowd management situations because they rely on eithery Massimo Piccardi, Ph.D.ationManagement Department,g, Pathumthani 12120, Thailand.

    @ait.asia (M.N. Dailey).

    rights reserved.background modeling [15], body part detection [3,6], or body shapemodels [7,8,1]. These techniques are not applicable to heavily crowdedscenes in which the majority of the scene is in motion (rendering back-ground modeling useless) and most human bodies are partially or fullyoccluded. Under these conditions, we believe that the head is the onlybody part that can be robustly detected and tracked. In this paper wetherefore present a method for tracking pedestrians that detects andtracks heads rather than full bodies. The main contributions of ourwork are as follows:

    1. We combine a head detector and particle filter to track multiplepeople in high-density crowds.

    2. We introduce a method for estimation and utilization of a headplane parallel to the ground plane at the expected human heightthat is extracted automatically from observations from a single,uncalibrated camera. The head plane is estimated incrementally,and when the confidence in the estimate is sufficiently high, weuse it to reject likely false detections produced by the headdetector.

    3. We introduce a confirmation by classification method for trackingthat associates detections with tracks over an image and handlesocclusions in a single step.

    Our system assumes a single static uncalibrated camera placed at asufficient height so that the heads of people traversing the scene canbe observed. For detection we use a standard Viola and Jones Haar-likeAdaBoost cascade [9], but the detector could be replaced generically

    http://dx.doi.org/10.1016/j.imavis.2012.08.013mailto:[email protected]:[email protected]://dx.doi.org/10.1016/j.imavis.2012.08.013http://www.sciencedirect.com/science/journal/02628856

  • Fig. 1. A sample frame from the Mochit station dataset.

    967I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012) 966977with any real time detector capable of detecting heads in crowds. Fortracking we use a particle filter [10,11] for each head that incorporatesa simple motion model and a color histogram-based appearance model.

    The main difficulty in using a generic object detector for humantracking is that the detector's output is unreliable; all detectorsmake er-rors. We have a tradeoff between detection rates and false positiverates: when we try to increase the detection rate, in most cases wealso increase the false positive rate. However, we can alleviate this di-lemma when scene constraints are available; detections inconsistentwith scene constraints can be rejected without affecting the true detec-tion rate. One such constraint is 3D scene information. We propose atechnique that neither assumes known scene geometry nor computesinterdependencies between objects. We merely assume the existenceof a head plane that is parallel to the ground plane at the averagehuman height. Nearly all human heads in a crowded scene will appearwithin a meter or two of this head plane. If the relationship betweenthe camera and the head plane is known, and the camera's intrinsic pa-rameters are known, we can predict the approximate size of a head'sprojection into the image plane, and we can use this information to re-ject inconsistent candidate trajectories or only search for heads at ap-propriate scales for each position in the image. To find the head plane,we run our head detector over one or more images of a scene at multi-ple scales, compute the 3D position of each head based on an assumedreal-world head size and the camera's intrinsics, and then we find thehead plane using robust nonlinear least squares.

    When occlusion is not a problem, constrained head detectionworksfairly well, and we can use the detector to guide the frame-to-frametracker using simple rules for data association and elimination of falsetracks due to false alarms in the detector. However, when partial orfull occlusions are frequent, data association becomes critical, and sim-ple matching algorithms no longer work. False detections often mis-guide tracks, and tracked heads are frequently lost due to occlusion.To address these issues, we introduce a confirmation-by-classificationmethod that performs data association and occlusion handling in singlestep. On each frame, we first use the detector to confirm the trackingprediction result for each live trajectory, thenwe eliminate live trajecto-ries that have not been confirmed for some number of frames. This pro-cess allows us to minimize the number of false positive trajectorieswithout losing track of heads that are occluded for short periods of time.

    In an experimental evaluation, we find that the proposed methodprovides for effective tracking of large numbers of people in a crowd.Using the automatically-extracted 3D head plane information im-proves accuracy, reducing false positive rates while preserving highdetection rates. To our knowledge, this is the largest-scale individualhuman tracking experiment performed thus far, and the results areextremely encouraging. In future work, with further algorithmicimprovements and runtime optimization, we hope to achieve robust,real time pedestrian tracking for even larger crowds.

    The paper is organized as follows: in Section 2, we provide a briefsurvey of related work. Section 3 describes the detection and trackingalgorithms in detail. In Section 4, we describe an experimental evalu-ation of the algorithm. Section 5 concludes the paper.

    2. Related work

    In this section, we provide a summary of related work. Whilespace limitations make it impossible to provide a complete survey,we identify the major trends in the research on tracking pedestriansin crowds.

    In crowds, the head is the most reliably visible part of the humanbody. Many researchers have attempted to detect pedestrians throughhead detection. Zhao et al. [1,12] detect heads from foreground bound-aries, intensity edges, and foreground residues (foreground regionswith previously detected object regions removed). Wu and Nevatia[2] detect humans using body part detection. They train their detectoron examples of heads and shoulders as well as other body parts.These methods use background modeling, so while they are effectivefor isolated pedestrians or small groups of people, they fail in high den-sity crowds. For a broad view of pedestrian detection methods, see therecent survey by Dollar et al. [13]. To attain robust head detection inhigh density crowds, we use a Viola and Jones AdaBoost cascade classi-fier using Haar-like features [9,14]. We train the AdaBoost cascadeoffline, then, at runtime, we use the classifier as a detector, running asliding window over the image at the specific range of scales expectedfor the scene.

    For tracking, we use a particle filter [11]. The particle filter or se-quential Monte Carlo method was introduced to the computer visioncommunity by Isard and Blake [10,15] and is well known to enable ro-bust object tracking (see e.g. [1621]). In this paper, we use the objectdetector to guide the tracker. To track heads from frame to frame, weuse the standard approach in which the uncertainty about an object'sstate (position) is represented as a set of weighted particles, each par-ticle representing one possible state. The filter propagates particlesfrom one frame to another frame using a motion model, computes aweight for each propagated particle using a sensor or appearancemodel, then resamples the particles according to their weights. Theinitial distribution for the filter is centered on the location of the ob-ject the first time it is detected.

    Building on recent advances in object detection, many researchershave proposed tracking methods that utilize object detection. Thesealgorithms use appearance, size, and motion information to measuresimilarities between detections and trajectories. Many solutions existfor the problem of data association between detections and trajecto-ries. In the joint probabilistic data association filter approach [22],joint posterior association probabilities are computed for multipletargets in Poisson clutter, and the best possible assignment is madeon each time step. Reid [23] generates a set of data-association hy-potheses to account for all possible origins of every measurementover several time steps. The well-known Hungarian algorithm [24]can also be used for optimal matching between all possible pairs ofdetections and live tracker trajectories. Very recent work [16,25]uses an appearance-based classifier at each time step to solve thedata association problem. In dense crowds, where appearance ambi-guity is high, it is difficult to use global appearance-based data associ-ation between detections and trajectories; spatial locality constraintsneed to be exploited. In this work, we combine head detection withparticle filters to perform spatially-constrained data association. Weuse the particle filter to constrain the search for a detection for eachtrajectory.

    Rodriguez et al. [26] first combine crowd density estimates with in-dividual person detections and minimize an energy function to jointlyoptimize the estimates of the density and locations of individual people

  • 968 I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012) 966977in the crowd. In a second part, the authors use the scene geometrymethod proposed by Hoeim et al. [27,28]. They select a few detectedheads and compute vanishing lines to estimate the camera height.After getting the camera height, the authors estimate the 3D locationsof detected heads and compare each 3D location with the averagehuman height to reject head detections inconsistent with the scene ge-ometry. Rodriguez et al. only estimate the head plane in order toestimate the camera height, whereas we estimate the head plane incre-mentally from a series of head detections then use the head plane to re-ject detections once sufficient confidence in the head plane estimate isachieved. The Rodriguez et al. method compares detections to a fixedaverage human height to reject inconsistent heads. Our method ismore adaptive and is only used if sufficient confidence in the headplane estimate is achieved.

    The multiple-view approach takes input from two or more cam-eras, with or without overlapping fields of view. Several researchgroups use the multiple-view approach to track people [2932]. Inmost cases, evidence from all cameras is merged to avoid the occlu-sion problem. The multiple-view approach can reduce the ambiguityinherent in a single camera view and can help solve the problem ofpartial or full occlusion in dense crowds, but in many real environ-ments, it is not feasible.

    Several research groups have proposed the use of 3D information forsegmentation, for occlusion reasoning, and to recover 3D trajectories.Lv, Zhao and Nevatia [33] propose a method for auto calibration froma video of a walking human. First, they detect the human's head andlegs at leg-crossing phases using background subtraction and temporalanalysis of the object shape. Then they locate the head and feet posi-tions from the principal axis of the human blob. Finally, they findvanishing points and calibrate the camera. The method is based onbackground modeling and shape analysis and is effective for isolatedpedestrians or small groups of people, but these techniques fail inhigh density crowds. Zhao, Nevatia, and Wu [1] use a known groundplane for segmentation and to reduce inter-object occlusion; Rosalesand Sclaroff [34] recover 3D trajectories using anextendedKalmanfilter(EKF); Leibe, Schindler, and van Gool [35] formulate object detectionand trajectory estimation as a coupled optimization problem on aknown ground plane. Hoiem, Efros, and Hebert [27,28] first estimaterough surface geometry in the scene and then use this information toadjust the probability of finding a pedestrian at a given image location.In other words, they estimate possible object locations before applyingan object detector to the image. Their algorithm is based on recovery ofsurface geometry and camera height. They use a publicly available exe-cutable to produce confidence maps for three main classes: ground,vertical, and sky, and five subclasses of vertical: planar surfacesfacing left, center, and right, and non-planar solid and poroussurfaces. To recover the camera height, the authors use manually la-beled training images and compute a maximum likelihood estimate ofthe camera height based on the labeled horizon and the height distribu-tions of cars and people in the scene. Finally, they put the object intoperspective, modeling the location and scale in the image. They modelinterdependencies between objects, surface orientations, and the cam-era viewpoint. Our algorithm works directly from the object detector'sresults, without any assumptions about scene geometry or interdepen-dencies between objects. We first apply the head detector to the imageand estimate a head plane parallel to the ground plane at the expectedhuman height directly from detected heads without any other informa-tion. The head plane estimate is updated incrementally, and when theconfidence in the estimate is sufficiently high, we use it to reject falsedetections produced by the head detector.

    Ge, Collins and Ruback [36] use a hierarchical clustering algorithm todivide low andmedium density crowds into small groups of individualstraveling together. To discover the group structure, the authors detectand track the moving individuals in the scene. The method has very in-teresting applications such as abnormal event detection in crowds anddiscovering pathways in crowds.Preliminary reports on our pedestrian tracker have previouslyappeared in two conference papers. In the first [37], we introducethe confirmation-by-classification method, and in the second [38],we introduce automatic identification of the head plane from a singleimage. In the current paper, we have improved the particle filter fortracking, added robust incremental estimation of the head planewith a stopping criterion, and performed an extensive empirical eval-uation of the method on several publicly available data sets. We com-pare our results to the state of the art on the same data sets.

    3. Human head detection and tracking

    Here we provide a summary of our head detection and tracking al-gorithm in pseudocode then give the details of each of the main com-ponents of the system.

    3.1. Summary

    1. Acquire input crowd video V.2. In first frame v0 of V, detect heads. Let xi,0=(xi,yi), i1 N be the

    2D positions of the centers and let hi, i1 N be the heights of thedetection windows for the detected heads.

    3. For each detected head i, compute the approximate 3D locationXi=(Xi,Yi,Zi) corresponding to xi,0 and hi.

    4. Find the 3D plane =(a,b,c,d) best fitting the 3D locations Xi usingRANSAC then refine the estimate of using LevenbergMarquardtto minimize the sum squared difference between observed andpredicted head heights.

    5. From the error covariance matrix for the parameters of , find thevolume V of the error ellipsoid as an indicator of the uncertainty inthe head plane.

    6. Initialize trajectories Tj, j1 N with initial positions xj,0.7. Initialize occlusion count Oj for each trajectory j to 0.8. Initialize the appearance model (color histogram) hj,0 for each tra-

    jectory from the region around xj,0.9. For each subsequent frame vi of input video,

    (a) For each existing trajectory Tj:i. use the motion model to predict the distribution p(xj,i|xj,i1)

    over locations for head j in frame i, creating a set of candidateparticles xj,i(k), k1 K.

    ii. compute the color histogram hj,i(k) and likelihood p(hj,i(k)|xj,i(k),hj,i1) for each particle k using the appearance model.

    iii. resample the particles according to their likelihood. Let kj

    be the index of the most likely particle for trajectory j.(b) Perform confirmation by classification:

    i. Run the head detector on frame vi and get 3D locations foreach detection.

    ii. If head plane uncertainty V is greater than threshold, addthe new observations and 3D locations, reestimate , andrecalculate V (Steps 11).

    iii. If head plane uncertainty V is less than threshold, use the 3Dpositions and current estimate of to filter out detectionstoo far from the head plane. Let xl, l1M be the 2D posi-tions of the centers of new detections after filtering.

    iv. For each trajectory Tj,find the detection xlnearest to xj,i(kj)with-

    in some distance C. If found, consider the location classified asa head and resetOj to 0; otherwise, incrementOj. In our exper-iments, we set C to 75% of the width (in pixels) of head j.

    v. Initialize a new trajectory for each detection not associatedwith a trajectory in the previous step.

    vi. Delete each trajectory Tj that has occlusion count Oj greaterthan a threshold and history length |Tj| less than track survivalthreshold.

    vii. Deactivate each trajectory Tj with occlusion count Oj greaterthan threshold and history length |Tj| greater than or equalto track survival threshold.

  • 969I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012) 9669773.2. Detection

    For object detection, although more promising algorithms haverecently appeared [39,40], we currently use a standard Viola andJones AdaBoost cascade [9,14] trained on Haar-like features offlinewith a few thousand example heads and negative images. At runtime,we use the classifier as a detector, running a sliding window over theimage at the specific range of scales expected for the scene.

    Our approach to head plane estimation is based on a few assump-tions. We assume a pinhole camera with known focal length and thatall human heads are approximately the same size in the real world.We further assume that the heads visible in a crowded scene will lieclose to a plane that is parallel to the ground at the average heightof the humans in the scene. Based on these assumptions, we can com-pute the approximate 3D position of a detected head using the size ofthe detection window then estimate the plane best fitting the data.Once sufficient confidence in the plane is obtained, we can then rejectdetections corresponding to 3D positions too far from the head plane.See Fig. 2 for a schematic diagram.

    We use a robust incremental estimation method. First, we detectheads in the first frame and obtain a robust estimate of the besthead plane using RANSAC. Second, we refine the estimate by mini-mizing the squared difference between measured and predictedhead heights using the LevenbergMarquardt nonlinear least squaresalgorithm. Using the normalized error covariance matrix, we com-pute the volume of the error ellipsoid, which indicates the uncertain-ty in the estimated plane's parameters. On subsequent frames, we addany new detections and reestimate the plane until the volume of theerror ellipsoid is below a threshold. We determine the threshold ex-perimentally. Details of the method are given below.3.2.1. 3D head position estimation from a 2D detectionGiven the approximate actual height ho of humanheads, a 2Dheadde-

    tection at image position xi=(xi,yi) with height hi, and the camera focallength f, we can compute the approximate 3D location Xi=(Xi,Yi, Zi) ofthe candidate head in the camera coordinate system as follows.

    Zi hohi

    f 1

    Xi Zifxi

    h0hi

    xi 2Detection3DE

    3D Head PlaneEstimation

    Compute ErrorEllipsoid Volume

    V

    2D Locations

    3D Plane

    Fig. 2. Flow of the incremental heaYi Zifyi

    h0hi

    yi 3

    3.2.2. Linear head plane estimationAfter obtaining a set of 3D locations X Xif gi1n of possible

    heads, we compute the parameters =(a,b,c,d) of the plane

    aX bY cZ d 0

    minimizing the objective function

    q Xni1

    aXi bYi cZi d 2: 4

    Since the set of 3D locations X will in general contain outliers dueto errors in head height estimates and false detections from the de-tector, before performing the above minimization, we eliminate out-liers using RANSAC [41]. On each iteration of RANSAC, we samplethree points from X, compute the corresponding plane, find the con-sensus set for that plane, and retain the largest consensus set. Thenumber of iterations is calculated adaptively based on the size ofthe largest consensus set. We set the number of iterations k to theminimum number of iterations required to guarantee, with a smallprobability of failure, that the model with the largest consensus sethas been found. The probability of selecting three inliers from X atleast once is p 1 1w3 k, where w is the probability of selectingan inlier in a single sample. We initialize k to infinity, then on each it-eration, we recalculate w using the size of the largest consensus setfound so far and then find the number of iterations k needed toachieve a success rate of p.

    3.2.3. Nonlinear head plane refinementThe linear estimate of the head plane computed in the previous

    section minimizes an algebraic objective function (Eq. (4)) that doesnot take into account the fact that head detections close to the cameraare more accurately localized in 3D than head detections far awayfrom the camera.

    In this step, we refine the linear estimate of the head plane to min-imize the objective function

    q Xni1

    hihi 2

    ; 5Locationstimation

    First Frameor

    V>T

    Apply Thresholdon Point to Plane

    Distance

    3D Locations

    Output 2DLocations

    1 0

    Output 2DLocations

    d plane estimation algorithm.

    image of Fig.2

  • 970 I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012) 966977where hi is the height of the detection window for head i and hi is thepredicted height of head i based on the 2D location (xi,yi) of its detectionand the plane . To calculate hi, we find the ray through the camera cen-

    ter C=(0,0,0) passing through (xi,yi), find the intersection X i

    X i Y iZ ih iT

    of that raywith the head plane , then calculate the expected

    height of an object with height h0 at X i when projected into the image.To find the intersection of the ray with the plane, given the camera

    matrix

    K f 0 cx0 f cy0 0 1

    24

    35

    containing the focal length f and principal point (cx,cy), we find an ar-bitrary point

    X0

    i K1xiyi1

    24

    35

    on the ray then find the scalar such that

    X0

    i1

    T 0:

    Finally, we calculate X i X0i and hi h0Z i f .

    We use the Lourakis implementation of the LevenbergMarquardtnonlinear least squares algorithm [42] to find the plane minimizingthe objective function of Eq. (5) and obtain an error covariance matrixQ for the elements of .

    3.2.4. Incremental head plane estimationBased on the parameter vector and covariancematrix Q obtained

    as described in the previous section, we compute the volume of theerror ellipsoid to indicate the uncertainty in the estimated plane's pa-rameters. Since the uncertainty in the plane's orientation only de-pends weakly on the distance of the camera to the head plane,whereas the uncertainty in the plane's distance to the camera de-pends strongly on that distance, we only consider the uncertainty inthe plane's (normalized) orientation, ignoring the uncertainty in thedistance to the plane. Let 1, 2, and 3 be the eigenvalues of theupper 33 submatrix of Q representing the uncertainty in the planeorientation parameters a, b, and c. The radii of the normalized plane

    orientation error ellipsoid are ri ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffii.

    a2 b2 c2 r

    for i1, 2, 3.

    The volume of the plane orientation error ellipsoid is then

    V 43r1r2r3: 6

    V quantifies the uncertainty of the estimated head plane's orienta-tion. During tracking, for each frame, we add any newly detectedheads, reestimate the head plane, and recalculate V. If it is less thanthreshold, we stop the process and use the head plane to filter subse-quent detections. We determine the threshold empirically.

    3.3. Particle filter

    We use particle filters [10,11] to track heads. The particle filter iswell known to enable robust object tracking (see e.g. [1621]). Weuse the standard approach in which the uncertainty about an object'sstate (position) is represented as a set of weighted particles, each parti-cle representing onepossible state. Ourmethod automatically initializesseparate filters for each new trajectory. The initial distribution of theparticles is centered on the location of the object the first time it isdetected. The filters propagate particles from frame i-1 to frame i usinga motion model then compute weights for each propagated particleusing a sensor or appearance model. Here are the steps in more detail:

    1. Predict: we predict p(xj,i|xj,i1), a distribution over head j's positionin frame i given our belief in its position in frame i-1. The motionmodel is described in the next section.

    2. Measure: for each propagated particle k, we measure the likelihoodp(hj,i(k)|xj,i(k), hj,i1) using a color histogram-based appearancemodel. After computing the likelihood of each particle, we treatthe likelihoods as weights, normalizing them to sum to 1.

    3. Resample: we resample the particles to avoid degenerate weights,obtaining a new set of equally-weighted particles. We use sequen-tial importance resampling (SIR) [11].

    3.3.1. Motion modelWe use a second-order auto-regressive dynamical model to pre-

    dict the 2D position in the current frame based on the 2D positionsin the past two frames. In particular, we assume the simple second-order linear autoregressive model

    xj;i 2xj;i1xj;i2 i

    in which i is distributed as a circular Gaussian.

    3.3.2. Appearance modelOur appearance model uses color histograms to compute particle

    likelihoods. We use the simple method of quantizing to fixed-widthbins. We use 30 bins for hue and 32 bins for saturation. Learning op-timized bins or using a more sophisticated appearance model basedon local histograms along with other information such as spatial orstructural information would most likely improve our tracking per-formance, but the simple method works well in our experiments.

    Whenever we create a new track, we compute a color histogram hjfor the detection window in HSV space and save it for comparisonwith histograms extracted from future frames. To extract the histo-gram from a detection window, we use a circular mask to removethe corners of the window.

    We use the Bhattacharyya similarity coefficient between modelhistogram hj and observed histogram h(k) to compute a particle's like-lihood as follows, assuming n bins in each histogram:

    p h x; h0ed h;h

    0 7where

    d h;h0 1Xn

    b1

    ffiffiffiffiffiffiffiffiffiffihbh

    0b

    q

    and hb and hb denote bin b of h and h

    0, respectively.When we track an object for a long time, its appearance will

    change, so we update the track histogram for every frame in whichthe track is confirmed.

    3.4. Confirmation by classification

    To reduce tracking errors, we introduce a simple confirmation-by-classification method, described in detail in this section.

    3.4.1. Recovery from missesMany researchers, for example Breitenstein et al. [16], initialize

    trackers for only those detections appearing in a zone along the imageborder. In high density crowds, this assumption is invalid, so we maymiss many heads. Due to occlusion and appearance variation, we maynot detect all heads in the first frame or when they initially appear. Tosolve this problem, in each image,we search for newheads in all regions

  • 971I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012) 966977of the image not predicted by themotionmodel for a previously trackedhead. Any newly detected headwithin some distance C of the predictedposition of a previously tracked head is assumed to be associated withthe existing trajectory and ignored. If the distance is greater than C,we create a new trajectory for that detection. We currently set C to be75% of the width of the detection window.3.4.2. Data associationWith detection-based tracking, it is difficult to decide which de-

    tection should guide which track. Most researchers compute a simi-larity matrix between new detections in the frame and existingtrajectories using color, size, position, and motion features then findan optimal assignment. These solutions work well in many cases,but in high density crowds in which the majority of the scene is inmotion andmost of the humans' bodies are partially or fully occluded,it tends to introduce tracking error such as ID switches. In this work,we use the particle filter to guide the search for a detection for eachtrajectory. For each trajectory Tj, we search for a detection at locationxj,i(k) within some distance C, where xj,i(k

    ) is the position of the mostlikely particle for trajectory j. We currently set C to be 75% of thewidth of the detection window. If found detection we consider thatthe location is classified as a head, the trajectory is confirmed in thisframe, and associate the detection with the current track; if notfound we consider the trajectory is occluded in this frame. We useconfirmation and occlusion information to reduce tracking errors. De-tails are given in next section.3.4.3. Occlusion countOcclusion handling and inconsistent false track rejections are the

    main challenges for any tracking algorithm. To handle these prob-lems, we introduce a simple occlusion count scheme. When head jis first detected and its trajectory is initialized, we set the occlusioncount Oj=0. After updating the head's position in frame i, we confirmthe estimated position through detection as described in the previoussection. On each frame, the occlusion count of each trajectories notconfirmed through classification is incremented, and the occlusioncount of each confirmed trajectory is reset to 0. An example of theFig. 3. Occlusion count scheme. (a) Short occlusion. The occlusion count does not reach the docclusion. The occlusion count reaches the deactivation threshold, so the track is deactivated. (pared with deactivated trajectories, and if the appearance match is sufficiently strong, the deaincrement and reset process is shown in Fig. 3(a). The details of thealgorithm are as follows:

    1. Eliminate false tracks: shadows and other non-head objects in thescene tend to produce transient false detections that could leadto tracking errors. In order to prevent these false detections frombeing tracked by the appearance-based tracker through time, weuse the head detector to confirm the estimated head position foreach trajectory and eliminate any new trajectory not confirmedfor some number of frames. A trajectory is considered transientuntil it is confirmed in several frames.

    2. Short occlusions: to handle short occlusions during tracking, wekeep track of the occlusion count for each trajectory. If the headis confirmed before the occlusion count reaches the deactivationthreshold, we consider the head successfully tracked through theocclusion. An example shown in Fig. 3(a).

    3. Long occlusions: when an occlusion in a crowded scene is long, it isoften impossible to recover, due to appearance changes and uncer-tainty in tracking. However, if the object's appearance when itreappears is sufficiently similar to its appearance before the occlu-sion, it can be restored. We use the occlusion count and number ofconfirmations to handle long occlusions through deactivation andreactivation. When an occlusion count reaches the deactivationthreshold, we deactivate the trajectory. An example is shown inFig. 3(b). Subsequently, when a new trajectory is confirmed by de-tections in several consecutive frames, we consider it a candidatecontinuation of existing deactivated trajectories. An example isshown in Fig. 3(c). Whenever a newly confirmed trajectory matchesa deactivated trajectory sufficiently strongly, the deactivated trajec-tory is reactivated from the position of the new trajectory.4. Experimental evaluation

    In this section, we provide details of a series of experiments to eval-uate our algorithm. Firstwe describe the training and test data sets. Sec-ond, we describe some of the important implementation details. Third,we describe the evaluation metrics we use. Finally we provide resultsand discussion for the detection and tracking evaluations.eactivation threshold, so the head is successfully tracked through the occlusion. (b) Longc) Possible reactivation of a deactivated trajectory. Newly confirmed trajectories are com-ctivated track is reactivated from the position of the matching new trajectory.

  • Fig. 4. Positive head samples used to train the head detector offline. (a) Example images collected for training. (b) Example scaled images.

    972 I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012) 9669774.1. Training data

    To train the Viola and Jones Haar-like AdaBoost cascade detector,we cropped 5396 heads (examples are shown in Fig. 4) from videoscollected at various locations and scaled each to 1616 pixels. Wealso collected 4187 negative images not containing human heads.

    For the CAVIAR experiments described in the next section, sincemany of the heads appearing in the CAVIAR test set are very small,we created a second training set by scaling the same 5396 positive ex-amples to a size of 1010.

    4.2. Test data

    There is no generally accepted dataset available for crowd tracking.Most researchers use their own datasets to evaluate their algorithms.In this work, for the experimental evaluation, we have created, to thebest of our knowledge, the most challenging existing dataset specifical-ly for tracking people in high density crowds; the dataset with groundtruth information is publicly available for download at http://www.cs.ait.ac.th/vgl/irshad/.

    We captured the video at 640480 pixels and 30 frames per secondat the Mochit light rail station in Bangkok, Thailand. A sample frame isshown in Fig. 1. We then hand labeled the locations of all heads presentin every frame of the video. For the ground truth data format, wefollowed the annotation guidelines for the Video Analysis and ContentExtraction (VACE-II) workshop [43]. This means that for each head0 100 200106

    105

    104

    103

    102

    Fra

    Err

    or E

    llips

    oid

    Vol

    ume

    Fig. 5. Error in head plane estimation. We plot the volume of ellipsoid computed using covagraph shows error in each frame after adding newly detected heads in the frame.present in each frame, we record the bounding box, a unique ID thatis consistent across frames, and a flag indicatingwhether the head is oc-cluded or not. We labeled a total of 700 frames containing a total of28,430 heads, for an average of 40.6 heads/frame.

    For comparison with existing state of the art research, we also testour algorithm on the well-known indoor Context-Aware Vision usingImage-based Active Recognition (CAVIAR) [44] dataset. Since our eval-uation requires ground truth positions of each pedestrian's head ineach frame, we selected the four sequences from the shopping centercorridor view for which the needed information is available. The se-quences contain a total of 6315 frames, the frame size is 384288,and the sequences were captured at 25 frames per second. There are atotal of 26,950 heads over all four sequences, with an average of4.27 heads per frame.

    Our algorithm is designed to track heads in high density crowds.The performance of any tracking algorithm will depend upon thedensity of the crowd. In order to characterize this relationship we in-troduce a simple crowd density measure

    D i PiN

    ; 8

    where Pi is the number of pixels in pedestrian i's bounding box and Nis the total number of pixels in all of the images.

    According to this measure, the highest per-frame crowd density inour Mochit test sequence is 0.63, whereas the highest per-frame300 400 500

    me No

    riance matrix of non linear head plane estimation based on 3D positions of heads. The

    http://www.cs.ait.ac.th/vgl/irshad/http://www.cs.ait.ac.th/vgl/irshad/image of Fig.4image of Fig.5

  • Table 1Detection results for single image with and without head plane estimation.

    GT Hits Misses FP

    Without head plane estimation 34 31 3 35With head plane estimation 34 30 4 12

    973I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012) 966977crowd density in CAVIAR is 0.27. The crowd density in the Mochit se-quence is higher than that in any publicly-available pedestrian track-ing video database.

    To directly evaluate the accuracy of our head plane estimationmethod, we use the People Tracking sequence (S2.L1) of PerformanceEvaluation of Tracking and Surveillance (PETS) [45] dataset. In PETS,camera calibration information is given for every camera. In this se-quence there are a total of eight views. In views 1, 2, 3, and 4, thehead sizes are very small; we excluded these views because ourmethod has a limit on the minimum head size that can be detected.We trained a new head detector specifically on the PETS trainingdata (which is separate from the people tracking test sequence), ranthe tracking and head plane estimation algorithm on views 5, 6, 7,and 8, then compared our estimated head plane with the actualground plane information that comes with the dataset.

    4.3. Implementation details

    We implemented the system in C++withOpenCVwithout any spe-cial code optimization. The system attempts to track anything head-likein the scene, whether moving or not, since it does not rely on any back-ground modeling. We detect heads and create initial trajectories basedon the first frame, and thenwe track heads from frame to frame. Furtherimplementation details are given in the following sections.

    4.3.1. Trajectory initialization and terminationWe use the head detector to find heads in the first frame and cre-

    ate initial trajectories. As previously mentioned, rather than detectheads only in the border region of the image, we detect all heads inevery frame. We first try to associate new heads with existing trajec-tories; when this fails for a new head detection, a new trajectory isinitialized from the current frame. Any head trajectory in the exitzone (close to the image border) for which the motion model pre-dicts a location outside the frame is eliminated.

    4.3.2. Identity managementIt is also important to assign andmaintain object IDs automatically

    during tracking. We assign a unique ID to each trajectory during ini-tialization then maintain the ID during tracking. Trajectories thatFig. 6. Detection results for one frame of the Mochit test video. Rectangles indicate candiestimation.are temporarily lost due to occlusion are reassigned the same ID onrecovery to avoid identity changes. During long occlusions, when atrack is not confirmed for several frames, we deactivate that trackand search for new matching detections in subsequent frames. If amatch is found, the track is reactivated from the position of the newdetection and reassigned the same ID.

    4.3.3. Incremental head plane estimationAs previously discussed, we incrementally estimate the head

    plane. During tracking, we collect detected heads cumulativelyover each frame and perform head plane estimation. As discussed inSection 3.2.4, to determine when to start using the estimated plane tofilter detections, we use the volume of the normalized plane orientationerror ellipsoid (see Eq. (6)) as a measure of the uncertainty in the esti-mate. Fig. 5 shows how the error ellipsoid volume evolves over time onthe Mochit data set as heads detected in subsequent frames are addedto the data set. We stop head plane estimation when the error ellipsoidvolume is less than 0.0003.

    As a quantitative evaluation of the head plane estimation method,we trained a new head detector on the PETS training set then ran oursystem on views 58 from the PETS People Tracking sequence (S2.L1).The estimation error was 305 mm, 433 mm, 355 mm, and 280 mmfor the orthogonal distance between the plane and the camera centerand 25, 20, 17, and 16 for the orientation. This indicates that themethod is quite effective at unsupervised estimation of the head plane.

    4.4. Evaluation metrics

    In this section, we describe the methods we use to evaluate thetracking algorithm. Unfortunately there are no commonly-usedmetricsfor human detection and tracking in crowded scenes. We adopt mea-sures similar to those proposed by Nevatia et al. [2,46,47] for trackingpedestrians in sparse scenes. In their work, there are different defini-tions for ID switch and trajectory fragmentation errors. Wu and Nevatia[2] define ID switches as identity exchanges between a pair of resulttrajectories, while Li, Huang, and Nevatia [47] define an ID switch asa tracked trajectory changing its matched GT ID.We adopt the defini-tions of ID switch and fragment errors proposed by Li, Huang, andNevatia [47]. If a trajectory ID is changed but not exchanged, we countit as one ID switch, similarly for fragments. This definition is more strictand leads to higher numbers of ID switch and fragment errors, but it iswell defined.

    Bernardin and Stiefelhagen have proposed an alternative set ofmetrics, the CLEARMOTmetrics [48], for multiple object tracking per-formance. Their multiple object tracking precision (MOTP) and multi-ple object tracking accuracy (MOTA) methods are not suitable fordate head positions. (a) Without 3D head plane estimation. (b) With 3D head plane

  • Table 2Overall detection results with and without head plane estimation.

    GT Hits Misses FP

    Without head plane estimation 24,605 19,277 5328 7053With head plane estimation 24,605 17,950 6655 3328

    Table 3Tracking results with and without head plane estimation for the Mochit station dataset.Total number of trajectories is 74.

    MT% PT% ML% Frag IDS FAT

    Without head plane 67.6 28.4 4.0 43 20 41With head plane 70.3 25.7 4.0 46 17 27

    974 I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012) 966977crowd tracking because they integrate multiple factors into onescalar-valued measure. Kasturi and colleagues [51] have proposed aframework to evaluate face, text and vehicle detection and trackingin video. Their method are not suitable for crowd tracking becausethey integrate multiple factors into one scalar-valued measure.

    We specifically use the following evaluation criteria:

    1. Ground truth (GT): number of ground truth trajectories.2. Mostly tracked (MT): number of trajectories that are successfully

    tracked for more than 80% of their length (tracked length dividedby the ground truth track length).

    3. Partially tracked (PT): number of trajectories that are successfullytracked in 20%80% of the ground truth frames.

    4. Mostly lost (ML): number of trajectories that are successfullytracked for less than 20% of the ground truth frames.

    5. Fragments (Frag): number of times that a ground truth trajectory isinterrupted in the tracking results.

    6. ID switches (IDS): number of times the system-assigned ID changesover all ground truth trajectories.

    7. False trajectories (FAT): number of system trajectories that do notcorrespond to ground truth trajectories.

    4.5. Detection results

    We trained our head detection cascade using the OpenCVhaartraining utility. We set the number of training stages to 20,Fig. 7. Sample tracking results for the Mochit test video. Blue rectangles indicate ethe minimum hit rate per stage to 0.995, and the maximum falsealarm rate per stage to 0.5. The training process required about16 h on a 2.8 GHz Intel Pentium 4 with 4 GB of RAM.

    To test the system's raw head detection performance with andwithout head plane estimation on a single image, we ran our head de-tector on an arbitrary single frame extracted from the Mochit testdata sequence. The results are summarized in Table 1 and visualizedin Fig. 6.

    There are a total of 34 visible ground truth (GT) heads in theframe. Using the head plane to reject detections inconsistent withthe scene geometry reduces the number of false positives (FP) from35 to 12 and only reduces the number of detections (hits) from 31to 30. The results show that the head plane estimation method isvery useful for filtering false detections.

    To test the system's head detection performance with and withouthead plane estimation on whole sequence, we ran our head detectoron each frame extracted from theMochit test data sequence and com-pared with the head location reported by our tracking algorithm. Theresults are summarized in Table 2.

    There are a total of 24,605 visible ground truth (GT) heads in thesequence of 700 frames. Our algorithm reduces the number of falsepositives (FP) from 7053 to 3328 and only reduces the number of de-tections (hits) from 19,277 to 17,950.

    4.6. Tracking results

    In the Mochit station dataset [49], there are an average of 40.6 indi-viduals per frame over the 700 hand-labeled ground truth frames, for atotal of 28,430 heads, and a total of 74 individual ground truth trajecto-ries. We used 20 particles per head. Tracking results with and withouthead plane estimation are shown in Table 3. The head plane estimationmethod improves accuracy slightly, but more importantly, it reducesthe false positive ratewhile preserving a high rate of successful tracking.For a frame size of 640480, the processing time was approximately1.4 s per frame, with or without head plane estimation, on a 3.2 GHzIntel Core i5 with 4 GB RAM. Fig. 7 shows tracking results for severalframes of the Mochit test video.

    In the four selected sequences from the CAVIAR dataset for whichthe ground truth pedestrian head information is available, there are atotal of 43 ground truth trajectories, with an average of 4.27 individualsper frame or 26,950 heads total over the 6315 hand-labeled groundstimated head positions; red rectangles indicate ground truth head positions.

    image of Fig.7

  • Fig. 8. Sample tracking results on the CAVIAR dataset. Blue rectangles indicate estimated head positions; red rectangles indicate ground truth head positions.

    975I.A

    li,M.N.D

    ailey/Im

    ageand

    Vision

    Computing

    30(2012)

    966977

    image of Fig.8

  • 976 I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012) 966977truth frames. Since our detector cannot detect heads smaller than1515 reliably, in the main evaluation, we exclude ground truthheads smaller than 1515. However, to enable direct comparison toexisting full-body pedestrian detection and tracking methods, we alsoprovide results including the untracked small heads as errors. Weagain used 20 particles per head. Tracking results with and withoutsmall heads and with and without head plane estimation are shownin Table 4. For a frame size of 384288, the processing time was ap-proximately 0.350 s per frame on the same 3.2 GHz Intel Core i5 with4 GB RAM. Fig. 8 shows tracking results for several frames of theCAVIAR test set.

    Our head tracking algorithm is designed especially for high densi-ty crowds. We do not expect it to work as well as full-body trackingalgorithms on sparse data where the full body is in most cases visible.Although it is difficult to draw any strong conclusion from the data inTable 4, since none of the reported work is using precisely the samesubset of the CAVIAR data, we tentatively conclude that our methodgives comparable performance to the state of the art, even though itis using less information.

    Although the researchers whose work is summarized in Table 4have not made their code public to enable direct comparison on theMochit test set, we would expect that any method relying on fullbody or body part based tracking would perform much more poorlythan our method on that data set.5. Conclusion

    Tracking people in high density crowds such as the one shown inFig. 1 is a real challenge and is still an open problem. In this paper, weintroduce a fully automatic algorithm to detect and track multiplehumans in high-density crowds in the presence of extreme occlusion.We integrate human detection and tracking into a single frameworkand introduce a confirmation by classification method to estimateconfidence in a tracked trajectory, track humans through occlusions,and eliminate false positive tracks. We find that confirmation by clas-sification dramatically reduces tracking errors such as ID switchesand fragments.

    The main difficulty in using a generic object detector for humantracking is that the detector's output is unreliable; all detectorsmake errors. To further reduce false detections due to dense featuresand shadows, we present an algorithm using an estimate of the 3Dhead plane to reduce false positive head detections and improve pe-destrian tracking accuracy in crowds. The method is straightforward,makes reasonable assumptions, and does not require any knowledgeof camera extrinsics. Based on the projective geometry of the pinholecamera and an assumed approximate head size, we compute 3D loca-tions of candidate head detections. We then fit a plane to the set ofdetections and reject detections inconsistent with the estimatedscene geometry. The algorithm learns the head plane from observa-tions of human heads incrementally, and only begins to utilize theTable 4Tracking results for CAVIAR dataset.

    GT MT% PT% ML% Frag IDS FAT

    Zhao, Nevatia, and Wu [1] 227 62.1 5.3 89a 22a 27Wu and Nevatia [2] 189 74.1 4.2 40a 19a 4Xing, Ai and Lao [50]b 140 84.3 12.1 3.6 24 14 Li, Huang and Nevatia [47] 143 84.6 14.0 1.4 17 11 Ali and Daileyc 33 75.8 24.2 0.0 10 1 21Ali and Dailey (without head plane) 33 75.8 24.2 0.0 14 2 34Ali and Dailey (all heads) 43 16.3 58.1 25.6 11 1 21

    a The Frag and IDS definitions are less strict than ours, giving lower numbers of frag-ments and ID switches.

    b Does not count people less than 24 pixels wide.c Does not count heads less than 15 pixels wide.head plane once confidence in the parameter estimates is sufficientlyhigh.

    We find that together, the confirmation-by-classification and headplane estimation methods enable the construction of an excellent pe-destrian tracker for dense crowds. In futurework, with further algorith-mic improvements and runtime optimization, we hope to achieverobust, real time pedestrian tracking for even larger crowds.Acknowledgments

    This research was supported by graduate fellowships from theHigher Education Commission of Pakistan (HEC) and the Asian Instituteof Technology (AIT) to Irshad Ali. We are grateful to Shashi Gharti forhelp with ground truth labeling software. We thank Faisal Bukhariand Waheed Iqbal for valuable discussions related to this work.Appendix A. Supplementary data

    Supplementary data to this article can be found online at http://dx.doi.org/10.1016/j.imavis.2012.08.013.References

    [1] T. Zhao, R. Nevatia, B. Wu, Segmentation and tracking of multiple humans incrowded environments, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 30 (7)(2008) 11981211.

    [2] B. Wu, R. Nevatia, Detection and tracking of multiple, partially occluded humansby Bayesian combination of edgelet based part detectors, Int. J. Comput. Vision(IJCV) 75 (2) (2007) 247266.

    [3] B. Wu, R. Nevatia, Y. Li, Segmentation of multiple, partially occluded objects bygrouping, merging, assigning part detection responses, in: IEEE Conference Com-puter Vision and Pattern Recognition (CVPR), 2008.

    [4] S.M. Khan, M. Shah, A multiview approach to tracking people in crowded scenesusing a planar homography constraint, in: European Conference on Computer Vi-sion (ECCV), 2006.

    [5] J. Berclaz, F. Fleuret, P. Fua, Robust people tracking with global trajectory optimi-zation, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2006.

    [6] M. Andriluka, S. Roth, B. Schiele, People-tracking-by-detection and people-detection-by-tracking, in: IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR), 2008, pp. 18.

    [7] D. Ramanan, D.A. Forsyth, A. Zisserman, Tracking people by learning their appear-ance, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 29 (1) (2007) 6581.

    [8] T. Zhao, R. Nevatia, Tracking multiple humans in crowded environment, in: IEEEConference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, 2004.

    [9] P. Viola, M. Jones, Robust real time object detection, Int. J. Comput. Vision (IJCV)57 (2001) 137154.

    [10] M. Isard, A. Blake, A mixed-state condensation tracker with automatic model-switching, in: IEEE International Conference on Computer Vision (ICCV), 1998,pp. 107112.

    [11] A. Doucet, N. de Freitas, N. Gordon, Sequential Monte Carlo Methods in Practice,Springer, New York, 2001.

    [12] T. Zhao, R. Nevatia, Tracking multiple humans in complex situations, IEEE Trans.Pattern Anal. Mach. Intell. (PAMI) 26 (9) (2004) 12081221.

    [13] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: an evaluation of thestate of the art, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 34 (2012) 743761.

    [14] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple fea-tures, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2001, pp. 511518.

    [15] M. Isard, A. Blake, CONDENSATION conditional density propagation for visualtracking, Int. J. Comput. Vision (IJCV) 29 (1998) 528.

    [16] M.D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, L.V. Gool, Onlinemulti-person tracking-by-detection from a single, uncalibrated camera, IEEETrans. Pattern Anal. Mach. Intell. (PAMI) 33 (9) (2011) 18201833.

    [17] H.-G. Kang, D. Kim, Real-time multiple people tracking using competitive conden-sation, Pattern Recognit. 38 (2005) 10451058.

    [18] S.V. Martnez, J. Knebel, J. Thiran, Multi-object tracking using the particle filter algo-rithm on the top-view plan, in: European Signal Processing Conference (EUSIPCO),2004.

    [19] J. Vermaak, A. Doucet, P. Perez, Maintaining multi-modality through mixturetracking, in: IEEE International Conference on Computer Vision (ICCV), 2003.

    [20] Z. Khan, T. Balch, F. Dellaert, MCMC-based particle filtering for tracking a variablenumber of interacting targets, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 27(2005) 18051918.

    [21] K. Okuma, A. Taleghani, N.D. Freitas, J.J. Little, D.G. Lowe, A boosted particle filter:multitarget detection and tracking, in: European Conference on Computer Vision(ECCV), 2004.

    http://dx.doi.org/10.1016/j.imavis.2012.08.013http://dx.doi.org/10.1016/j.imavis.2012.08.013

  • 977I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012) 966977[22] C. Rasmussen, G.D. Hager, Probabilistic data association methods for trackingcomplex visual objects, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 23 (6)(2001) 560576.

    [23] D.B. Reid, An algorithm for tracking multiple targets, IEEE Trans. Autom. Control.24 (6) (1979) 843854.

    [24] H.W. Kuhn, The Hungarian method for the assignment problem, Nav. Res. Logist.Q. 2 (1955) 8387.

    [25] C.-H. Kuo, C. Huang, R. Nevatia, Multi-target tracking by on-line learned discrimina-tive appearancemodels, in: IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), 2010.

    [26] M. Rodriguez, I. Laptev, J. Sivic, J.-Y. Audibert, Density-aware person detectionand tracking in crowds, in: IEEE International Conference on Computer Vision(ICCV), 2011.

    [27] D. Hoiem, A. Efros, M. Hebert, Putting objects into perspective, in: IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 21372144.

    [28] D. Hoiem, A. Efros, M. Hebert, Putting objects into perspective, Int. J. Comput. Vi-sion (IJCV) 80 (1) (2008) 315.

    [29] F. Fleuret, J. Berclaz, R. Lengagne, P. Fua, Multicamera people tracking with aprobabilistic occupancy map, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 30(2008) 267282.

    [30] A. Mittal, L.S. Davis, M2tracker: a multi-view approach to segmenting and track-ing people in a cluttered scene, Int. J. Comput. Vision (IJCV) 51 (2003) 189203.

    [31] R. Eshel, Y. Moses, Homography based multiple camera detection and tracking ofpeople in a dense crowd, in: IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2008.

    [32] T. Zhao, M. Aggarwal, R. Kumar, H. Sawhney, Real-time wide area multi-camerastereo tracking, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2005.

    [33] M. Fengjun Lv, T. Zhao, R. Nevatia, Camera calibration from video of a walkinghuman, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 28 (9) (2006) 15131518.

    [34] R. Rosales, S. Sclaroff, 3D trajectory recovery for tracking multiple objects and tra-jectory guided recognition of actions, in: IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 1999.

    [35] B. Leibe, K. Schindler, L.V. Gool, Coupled detection and trajectory estimation formulti-object tracking, in: IEEE International Conference on Computer Vision (ICCV),2007, pp. 18.

    [36] W. Ge, R.T. Collins, R.B. Ruback, Vision-based analysis of small groups in pedes-trian crowds, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 34 (5) (2011)10031016.[37] I. Ali, M.N. Dailey, Multiple human tracking in high-density crowds, in: Ad-vanced Concepts for Intelligent Vision Systems (ACIVS), Vol. LNCS 5807, 2009,pp. 540549.

    [38] I. Ali, M.N. Dailey, Head plane estimation improves the accuracy of pedestrian trackingin dense crowds, in: International Conference on Control, Automation, Robotics andVi-sion (ICARCV), 2010, pp. 20542059, http://dx.doi.org/10.1109/ICARCV.2010.5707425.

    [39] B. Leibe, A. Leonardis, B. Schiele, Robust object detection with interleaved catego-rization and segmentation, Int. J. Comput. Vision (IJCV) 77 (2008) 259289.

    [40] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2005.

    [41] M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model fittingwith applications to image analysis and automated cartography, Commun. ACM24 (6) (1981) 381395.

    [42] M. Lourakis, levmar: LevenbergMarquardt nonlinear least squares algorithms inC/C++, available at http://www.ics.forth.gr/lourakis/levmar/ Jul. 2004.

    [43] H. Raju, S. Prasad, Annotation guidelines for video analysis and content extraction(VACE-II). available at http://isl.ira.uka.de/clear07/downloads/ 2006.

    [44] The CAVIAR data set, available at http://homepages.inf.ed.ac.uk/rbf/CAVIAR/ 2011.[45] PETS benchmark data, available at http://www.cvg.rdg.ac.uk/PETS2009/a.html 2009.[46] C.-H. Kuo, C. Huang, R. Nevatia, Multi-target tracking by on-line learned discrim-

    inative appearance models, in: IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2010, pp. 685692.

    [47] Y. Li, C. Huang, R. Nevatia, Learning to associate: hybrid boosted multi-targettracker for crowded scene, in: IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2009, pp. 29532960.

    [48] K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance:the CLEAR MOT metrics, EURASIP J. Image Video Process. 2008 (2008) 110.

    [49] Mochit station dataset, available at http://www.cs.ait.ac.th/vgl/irshad/ 2009.[50] J. Xing, H. Ai, S. Lao, Multi-object tracking through occlusions by local tracklets fil-

    tering and global tracklets association with detection responses, in: IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 12001207.

    [51] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M.Boonstra, V. Korzhova, J. Zhang, Framework for performance evaluation of face,text, and vehicle detection and tracking in video: Data, metrics, and protocol,IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 31 (2) (2009) 319336.

    http://dx.doi.org/10.1109/ICARCV.2010.5707425http://www.ics.forth.gr/lourakis/levmar/http://isl.ira.uka.de/clear07/downloads/http://homepages.inf.ed.ac.uk/rbf/CAVIAR/http://www.cvg.rdg.ac.uk/PETS2009/a.htmlhttp://www.cs.ait.ac.th/vgl/irshad/

    Multiple human tracking in high-density crowds1. Introduction2. Related work3. Human head detection and tracking3.1. Summary3.2. Detection3.2.1. 3D head position estimation from a 2D detection3.2.2. Linear head plane estimation3.2.3. Nonlinear head plane refinement3.2.4. Incremental head plane estimation

    3.3. Particle filter3.3.1. Motion model3.3.2. Appearance model

    3.4. Confirmation by classification3.4.1. Recovery from misses3.4.2. Data association3.4.3. Occlusion count

    4. Experimental evaluation4.1. Training data4.2. Test data4.3. Implementation details4.3.1. Trajectory initialization and termination4.3.2. Identity management4.3.3. Incremental head plane estimation

    4.4. Evaluation metrics4.5. Detection results4.6. Tracking results

    5. ConclusionAcknowledgmentsAppendix A. Supplementary dataReferences