scene aware detection and block assignment tracking in crowded scenes2012 full
Post on 02-Jun-2018
218 Views
Preview:
TRANSCRIPT
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
1/15
This article appeared in a journal published by Elsevier. The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third partywebsites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elseviers archiving and manuscript policies are
encouraged to visit:
http://www.elsevier.com/copyright
http://www.elsevier.com/copyrighthttp://www.elsevier.com/copyright -
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
2/15
Author's personal copy
Scene Aware Detection and Block Assignment Tracking in crowded scenes
Genquan Duan a,, Haizhou Ai a, Junliang Xing a, Song Cao b, Shihong Lao c
a Computer Science and Technology Department, Tsinghua University, Beijing, Chinab Electronic Engineering Department, Tsinghua University, Beijing, Chinac Development Center, OMRON Social Solutions Co., LTD, Kyoto, Japan
a b s t r a c ta r t i c l e i n f o
Article history:
Received 18 July 2011
Received in revised form 7 February 2012
Accepted 10 February 2012
Keywords:
Visual surveillance
Object detection
Object tracking
Particlelter
How far can human detection and tracking go in real world crowded scenes? Many algorithms often fail in
such scenes due to frequent and severe occlusions as well as viewpoint changes. In order to handle these dif-
culties, we propose Scene Aware Detection (SAD) and Block Assignment Tracking (BAT) that incorporate
with some availablescene models (e.g. background, layout, groundplane andcamera models). TheSAD is pro-
posed for accurate detection through utilizing 1) camera model to deal with viewpoint changes by rectifying
sub-images, 2) a structural lter approach to handle occlusions based on a feature sharing mechanism in
which a three-level hierarchical structure is built for humans, and 3) foregrounds for pruning negative and
falsepositivesamplesand merging intermediate detectionresults.Many detection or appearance based track-
ingsystems areproneto errorsin occluded scenesbecause of failures of detectorsand interactionsof multiple
objects. Differently, the BAT formulates tracking as a block assignment process, where blocks with the same
label form the appearance of one object. In the BAT, we model objects on two levels, one is the ensemble
level to measure how it is like an object by discriminative models, and the other one is the block level to mea-
sure how it is like a target object by appearance and motion models. The main advantage of BAT is that it can
track an objecteven when all the partdetectors fail as long as theobject has assigned blocks. Extensiveexper-
iments in many challenging real world scenes demonstrate the efciency and effectiveness of our approach.
2012 Elsevier B.V. All rights reserved.
1. Introduction
Human detection and tracking are classic problems in computer
vision for the applications in visual surveillance, driver-aided system
and trafc managements, and have achieved signicant progresses
recently. Many existing detection and tracking methods, however,
encounter great challenges from radial distortions, illumination vari-
ations, viewpoint changes and occlusions, all of which are quite com-
mon in real world scenes.
The goal of our work is to cope with these difculties to detect and
track multiple humans in surveillance scenes using a single stationarycamera. Many detection and tracking systems developed so far as-
sume that the viewpoint is frontal, a person enters the scene without
occlusions, a person appears or disappears in some special locations, a
person will exist in the scene for a given number of frames or the
human ow is gentle. In this paper, we present a robust detection
and tracking system attempting to minimize such constraining as-
sumptions, which is able to handle the following difculties: 1) occlu-
sion, when multiple persons crowdedly enter and move in the scene;
2) relatively unconstrained camera viewpoints, rotations and heights;
3) relatively unconstrained human motions, appearances and posi-
tions with respect to the camera; 4) humans appearing for only a
small number of frames; and 5) relatively slowly moving humans.
We only assume inherently that humans stand on the ground plane
in the scene and ignore those below this ground plane or stand in
other places such as rooftops, windows or sky. This is a very reason-
able assumption which is applicable in most of thesurveillance scenes.
We innovate from both detection and tracking for the scenes with oc-
clusions and viewpoint changes. Our main contributions include two
aspects as follows.
A Scene Aware Detection for accurate detection. Specically, it in-
cludes: (1) A simple but efcient learning algorithm to use fore-
grounds to prune negative and false positive samples; (2) A
structural lter approach to detect occluded humans in a feature
sharing mechanism; and (3) A foreground aware merging strategy
to explain foregrounds by detected results;
A Block Assignment Tracking for robust tracking where tracking is
formulated as a block assignment process and the objects are mod-
eled in different levels, i.e. the block level and the ensemble level.
Blocks with the same label form the appearance of one object,
from which robust appearance and motion models can be estab-
lished. Its main advantage is that it can track an object even when
all the part detectors fail as long as the object has assigned blocks.
Image and Vision Computing 30 (2012) 292305
This paper has been recommended for acceptance by Xiaogang Wang.
Corresponding author. Tel.: +86 10 62795495; fax: +86 10 62795871.
E-mail addresses:dgq08@mails.tsinghua.edu.cn(G. Duan),
ahz@mail.tsinghua.edu.cn(H. Ai),xjl07@mails.tsinghua.edu.cn(J. Xing),
cao-s08@mails.tsinghua.edu.cn(S. Cao),lao@ari.ncl.omron.co.jp(S. Lao).
0262-8856/$ see front matter 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.imavis.2012.02.008
Contents lists available at SciVerse ScienceDirect
Image and Vision Computing
j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / i m a v i s
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
3/15
Author's personal copy
The rest of this paper is organized as follows. Related work is dis-
cussed in the next section. Our system is overviewed in Section 3.
Scene Aware Detection is presented in Section 4. Block Assignment
Tracking is described in Section 5. Experimental results on many chal-
lenging real world datasets are provided along with some discussions
inSection 6. Conclusions and future work are given inSection 7.
2. Related work
There area great deal of works in theliterature on object detection,
such as faces[1]and pedestrians[24], and multiple target tracking,
such as vehicles[5]and humans[68]. Here we mainly review some
robust detection methods to cope with occlusions and viewpoint
changes at rst, andthen discuss some detection related anddetection
free tracking algorithms.
2.1. Robust detection
2.1.1. Occlusion handling
Using multiple part detectors, Wu et al. [2] proposed a Bayesian
approach for combination, while Huang et al. [3] introduced a dynam-ic search. Wang et al. [9]proposed a global-part occlusion handling
method, where an occlusion likelihood map was produced from HOG
feature responses rst and then segmented by mean shift approach.
2.1.2. Viewpoint change handling
Due to the changes of viewpoints, human appearances and poses
vary a lot. To solve this difculty, Li et al.[10]detected objects in rec-
tied sub-images with a learned frontal viewpoint detector. Another
method is to learn one powerful detector for all possible viewpoints,
such as[11,12]. Duan et al.[11]clustered the complex multiple view-
point samples into several sub-categories rst and then learn a classi-
er for each sub category. Felzenszwalb et al. [12]proposed a more
efcient model, Deformable Part based Model, in which a root lter
and several parts models are learned for each object category that
can detect objects with some pose changes.
2.1.3. Integration with other models
Beleznai et al.[13]used local shape descriptors to infer human lo-
cations in images of absolute background difference from background
model. Hoiem et al. [14] and Huang et al. [15] utilized scene geometric
model to restrict the object locations and ground plane model to re-
strict the objects heights in a particular location.
2.2. Robust tracking
2.2.1. Detection free tracking
Some techniques assume that objects enter the scene in some spe-
cic location [5], or appear in the scene without occlusions [5,16] for a
period of time that allows object models to be built up while they areisolated. Some techniques (e.g.[5,6]) depend on accurate segmenta-
tion of moving foreground objects from a background color intensity
model,whereKamijo et al. [5] segmented foreground blocks into vehi-
cles using spatialtemporal information, and Zhao et al.[6]developed
a tracker based on human shape model.All of them rely on an inherent
assumption that there will be signicant difference in color intensity
information between foreground and background. Unfortunately,
there are many problems for background modeling, such as being in-
accurate, noise sensitive, and weak in shadow. Similar assumptions
are made in[1720], where the authors extracted features, e.g. inten-
sity, colors, edges, contours, feature points, and used them to establish
the correspondences between modelimages and target images. More-
over, shape based approaches[6,21]will encounter challenges when
body parts are not isolated which may cause signicant occlusions,and appearance based ones[16] often fail when several objects get
close together as this kind of algorithms fail to allocate the pixels to
the correct object. In order to overcome some of these problems,
Kelly et al.[22]used 3D stereo information to detect pedestrians via
a 3D clustering process and track them by a weighted maximum car-
dinality matching scheme.
2.2.2. Detection related tracking
2.2.2.1. Detection based tracking.With the fast development of object
detection techniques, object detectors play an important role in
many tracking algorithms. Some tracking algorithms use detection
as their observation model. One of the most successful techniques is
particle lter[23]. Particle lter is based on Sequential Monte Carlo
Sampling, which has gained many attentions because of its simplicity,
generality, and extensibility in a wide range of challenging applica-
tions.Xing et al. [7] combined multiple part detectors with particle l-
ter to track multiple objects with occlusions. Another kind of work is
to associate detected results of video frames locally [24]or globally
[8,15,2527].Wuetal. [24] associated detection results in two consec-
utive frames. Jiang et al.[25]adapted Linear Programming for associ-
ation, while Zhang et al. [26]used min-cost ow. Andriluka et al. [8]
tailored Viterbi algorithm to link detection results, which combinedboth the advantages of detections and tracking. Huang et al. [15]pre-
sented a three-level hierarchical association approach where they
achieved short tracks and long tracks at the low level and middle
level separately, and rened the last trajectories with the estimated
scene knowledge at thehigh level.Pirsiavash et al. [27] proposed glob-
ally optimal greedy algorithms to estimate the number of tracks, their
birth and death states in a cost function. Global association based
tracking method could theoretically obtain a global optimum, since
the results of all the frames are available before tracking. However,
the cost of heavy computations and temporal delays limits them in
real time applications.
2.2.2.2. Online learning.Avidan[28]trained an ensemble of weak clas-
siers online to distinguish between the object and the background.
Grabner et al. [29] described an online boosting algorithm for real-
time tracking, which was very adaptive but may drift. To limit the
drifting problem, Grabner et al. [30] introduced a semi-supervised
learning algorithm using unlabeled data explored in a principled man-
ner, while Babenko et al. [31]proposed an online Multiple Instance
Learning using one positive bag consisting of several image patches
to update a learned classier. However, manually initialization and fo-
cusing on single object tracking prevent their applications in our inter-
ested scenes.
3. System overview
We propose to detect and track multiple humans in surveillance
scenes with occlusions and viewpoint changes using a single station-
ary camera by taking advantage of some available scene models (e.g.background, camera, layout and ground plane models). We believe
that the models we use are generic and applicable to a wide variety
of situations. The models used are listed as follows.
(a) A camera model to rectify an image with large viewpoint
changes into a frontal viewpoint;
(b) A background model to direct the system's attention to the re-
gions showing difference from the background;
(c) A layout model to restrict objects in the scene;
(d) A ground plane model to restrict objects standing on the ground.
The whole system is overviewed in Fig. 1, which mainly includes
two components, Scene Aware Detection and Block Assignment Track-
ing. The three key factors of the SAD are foreground aware pruningto
prune negative and false positive samples, a structural lterapproachbased on our previous work[4] to detect occluded objects, and fore-
ground aware mergingto explain foregrounds by detected results. The
293G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
4/15
Author's personal copy
BAT formulates tracking as a block assignment process, which can track
an object even when all the part detectors fail, as long as the object has
assigned blocks. The BAT proceeds as follows. It maintains the spatial
and temporal consistence in the block level rst (Block Tracking) and
then precisely estimates locations and sizes of objects in the ensemble
level using appearance, motion and discriminative models (Ensemble
Tracking), and at last assigns blocks to maintain the blocks with the
samelabel looklike apart of human by combining both previous results
(Ensemble to Block Assignment). In implementations, we split each frame
into 88 blocks and typically a 640480 image contains 8060=
4800 blocks. A block is called a foreground block if the pixel number
in the foreground region is larger than 20% of that in the whole. Similar
to[5], the BAT takes foreground blocks into account and ignores back-
ground ones.
The BAT is a particular segmentation problem, coarser than pixel
level segmentations but ner than bounding boxes as illustrated in
Fig. 2. Pixel level segmentations are dened to achieve the most accu-
rate results. But they are somewhat prone to errors for occlusions and
particularly non-rigid objects like humans with viewpoint changes as
their contours are disturbed and vary drastically. These restrictions
prevent such methods from applications in our concentrated scenes.
Bounding boxes may take extra (non-object or other object) pixels
into account and miss some real pixels. These drawbacks also exist
in the BAT but relatively more moderate, since BAT considers fore-
ground blocks and ignores background ones. Hence, more important-
ly, BAT can build up more robust appearance and motion models for
objects from these blocks than bounding boxes.
4. Scene aware detection
4.1. Scene models
Background model is widely used in many tracking systems. In
order to establish a background model robust to noise, motion and il-
lumination variations, we employ the lifespan background modeling
algorithm in our previous work [32], where short, middle and long
life span models are online adaptively built and updated in a collabo-
rative manner.
Fig. 1.System overview. Round rectangle box: inputs and outputs. Rectangle box: procedure. Solid arrow: data ow. Double-line arrow: extra input models. The key factors of our
system are marked out in bold.
Fig. 2.Comparisons of BAT, bounding boxes and pixel level segmentations in one object. (a) an image; (b) the foreground image; (c) ideal pixel level segmentations labeled man-
ually; (d) bounding boxes with extra pixels (left) and missed pixels (right); and (e) BAT with extra blocks (left) and missed blocks (right). Please seeSection 3for more discussions.
294 G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
5/15
Author's personal copy
Cameramodels are utilized to handle viewpoint changes in detec-
tion. We follow the method[10], which rst detects objects in sub-
images rectied from a changed viewpoint to a frontal viewpoint,
and then projects the detection results into the original image. This
kind of method is able to take advantage of detectors learned for afrontal viewpoint and avoid a more difcult training for multiple
viewpoints samples. During detection, the sampling in 3D space is
projected into the image coordinate as shown in Fig. 3(d) (bottom).
Moreover, there is no need to do such rectications for frontal view-
point scenes. To speed up detection in these scenes, we assume a
linear mapping from 2D coordinate (x,y) to the human height (Lh),
c1x + c2y + c3= Lh.c1,c2and c3are unknown parameters and can be
estimated through a RANSAC style algorithm like[33]. During detec-
tion, the sampling in 2D space is a scanning window process re-
strained by the linear mapping as shown in Fig. 3(d) (top). Please
refer to[33,10]for details.
Layoutmodels can be easily marked out for stationary scenes such
asFig. 3(c). We assume that humans stand on the ground planein the
layout. After integrating these two models with the linear mapping or
camera model mentioned earlier, we can obtain the sampled search-
ing points and corresponding human heights in scenes as illustrated
inFig. 3(d).
4.2. Foreground aware pruning (FAP)
This step is to prune negative and false positive samples by fore-
grounds as shown inFig. 4(a). We take this pruning problem as a 2-
class classication problem on binary images, and design a simple
discriminatively learning algorithm under the boosting framework
[1]. The aim is to mine some features to learn a fast and effective
pruning detector.
Our used features are based on zero moment of regionRG,M(RG),where M(RG) = (x,y)RGIB(x,y) ina binaryimage IB. Each feature risa sub-region ofIB as shown in black ofFig. 4(c). The feature value can
be calculated as
f r; IB
M r M IBr
IB 1
where |IB| is the total number of pixels in IB. We restrictras a rectan-
gle, and hence Eq.(1)can be calculated efciently through an integral
image without generating image pyramids like[1].
Positive samples for the pruning can be achieved by labeling man-
ually as shown inFig. 4(b). However, collecting negative samples are
impractical because of two reasons. One reason is that negative sam-
ples can be in any form, which is too time consuming for manually
labeling. The other reason is that when applying the pruning detector,
negative samples themselves are always inaccurate because of noises
in background modeling, and thus it is likely that parts of real objects
are missing in foregrounds and some backgrounds are included in
objects. In fact, negative samples are not necessary because 1) small
amount of negative samples may cause overtting, and 2) large
amount of negative samples might make the pruning detectors very
Fig. 3.Models in detection: (a) original images; (b) foregrounds; (c) scene layouts; (d) some searching points in red with lines whose lengths indicating the corresponding human
heights; (e) cropped sub-images and their foregrounds; and (f) detection results projected as quadrangles in original images. The top and bottom rows show a common frontal
viewpoint scene and a changed viewpoint one separately. Note that, in the latter occasion, camera models are adopted to handle the difculty of viewpoint changes.
a
Negative
False positive of left-body False positive of right-body False positive of whole-body
False positive of head-shoulder False positive of upper-body
b c
Fig. 4. Foreground pruning. (a) Typical pruned negative and false positive examples. (b) Whole body positive masks, from which other part positive masks can be generated.
(c) Five used features.
295G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
6/15
Author's personal copy
complex and thus they are inefcient to prune negative and false pos-
itive samples. Motivated by the above, pruning classiers are learned
with positive samples only. Theclassier on feature ris determined as
hr IB 1; f r; I
B
Tr> 00; otherwise
( 2
whereTr=minxiBf(r,xiB),is small positive(102), andxiB is a pos-
itive sample. In consideration of the inaccuracy of background model-
ing, positive samples are disturbed by moving 3 pixels left or right, or
2 pixels top or bottom.
This pruning should be fast and effective. Instead of automatically
selecting good features from a large feature pool as [1], we simply
design several features as shown in Fig. 4(c). All classiers learned
on these features are combined together to be one strong detector,
whose orders are not constrained. Then a searching window will be
considered if its corresponding foreground passes this strong detec-
tor. For a n m image, the pre-processing of an integral image costs
O(nm) time and space. Then our used feature can be calculated inO(1) time and thus a bunch of classiers will cost approximately con-
stant time. Its effectiveness will be evaluated in the experiment.
4.3. Structurallter approach
The detection is based on our previous work[4,34]. We proposed
to learn an Integral Structural Filter (ISF) detector in [4]to detect
humans with occlusions and articulated poses in a feature sharing
mechanism. We build up a three-level hierarchical model for human,
words, sentences and paragraphs, where words are the most basic
units, sentences are some meaningful sub-structures and paragraphs
are the appearance statuses (e.g., headshoulder, upper-body, left-
part, right-part and whole-body in occluded scenes). An example is
shown in Fig. 5. We integratethe detectorsfor thethreelevels through
inferringfrom word to sentence,from sentence to paragraphand from
word to paragraph. All detectors for structures (words, sentences and
paragraphs) are based on Real Adaboost algorithm and Associated
Pairing Comparison Features (APCFs)[34]. APCF describes invariance
of color and gradient of an object to some extent and it contains two
essential elements, Pairing Comparison of Color (PCC) and Pairing
Comparison of Gradient (PCG). A PCC (or PCG) is a Boolean color (or
gradient) comparison of two granules in which a granule is a square
window patch. Please refer to[4,34]for more details.
4.4. Foreground aware merging (FAM)
We then discuss the merging strategy after obtaining all detected
results. Different from previous approaches (e.g.[2,3]) which stick to
detection results, we integrate foreground information into post-
processing. We consider objects one by one after extending them to
the whole body through adding and deleting operations dened onvisible and invisible parts of objects. To reduce the complexity of
computation, the two operations are based on blocks as dened in
Section 3.
A hypothesis his a detected response. We denote the block set and
foreground block set ofh be Bh and Fh respectively. For a hypothesis set
H, we have BH hH Bhand FH hH Fhcorrespondingly. The scoreof addingh intoHis dened as
scadd h
FH hf gFH BH hf gBH ; FH hf gFH
>TM Fh; hH0; otherwise:
8>>>:
3
h can be added ifscadd(h) > Tadd. TMis a threshold. Thescoreof deleting
hfromHis dened as
scdel h
FHFH hf g BHBH hf g ; FHFH hf g
>TMFh; hH0; otherwise:
8>>>: 4
h can be deleted ifscdel(h)bTdel. Taddand Tdel are empirical parameters.
The less Tadd, the more added objects. ThelargerTdel, the more deleted
objects. In the implementation, we propose a greedy way to rst uti-
lize the adding operation to nd possible hypotheses and then the de-
letingoperation to deletesome badones.Althoughthe strategy is very
simple, it yields promising detection results in the experiments.
5. Block Assignment Tracking
The previous section mainly discusses accurately locating objects
in the scenes with occlusions and viewpoint changes. In this section,
we concentrate on robustly tracking them. In the next, we will rst
derive the formulation of our block assignment tracking problem,
and then present our solution.
Fig. 5.The hierarchical structure of pedestrian[4].
296 G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
7/15
Author's personal copy
5.1. Problem formulation
Denoting object state sequences from frame 1 to frameTasS1 : T=
{S1,, ST} and the corresponding observation sequences collected
from the frame data as O1 :T={O1,, OT}, a tracking problem can
be formulated to solve the following MAP (maximum a posteriori)
problem
St argmax
St
p St jO1:t: 5
Generally, an object state can be modeled as the location and size
of the object on the ensemble level like[7]or a set of blocks forming
the appearance like [5]. Tracking on the ensemble level is efcient
when objects are isolated. However, it tends to have errors when ob-
jects interact with each other since ensemble observations can be am-
biguous and missing because of occlusions. When objects are well
initialized, tracking on the block level is efcient even with heavy oc-
clusions, which mainly considers the block persistence in spatial and
temporal spaces. But it cannot guarantee that a segmented region is
like an object part. In fact, it might contain none or several objects.
Moreover, it does not have an explicitly correcting mechanism to rec-tify errors that arose during initialization and tracking. In order to
combine their merits and get rid of their restrictions, we propose to
model object states on both ensemble and block levels as St= {Zt, Vt},
whereZt= {zt, k}k=1K is the ensemble level state of all Kobjects and
Vt= {vt, i}i=1N is the block level state of all Nblocks.vt, iis the label for
blockbt, i, indicating thatbt, ibelongs to objectzt, vt,i ifvt, i0, or back-ground ifvt, ib0. All blocks with the same label form the appearance
of an object, while the ensembles describe coarse shapes of objects
and cover some blocks assigned to them as illustrated inFig. 6. There-
fore, we modify Eq.(5)and formulate our problem as
Zt;V
t
argmax
Zt;Vt p Zt; Vt jO1:t; Vt1: 6
Compared to Eq.(5),Vt1in the right side of Eq. (6)takes the pre-
vious assignment into account. However, the optimization of Eq. (6)is
not tractable because V1:tandZtareclosely intertwined at time t. The in-
ference betweenVtandVt1should hold the spatial and temporal per-
sistence of block assignments. Meanwhile, Ztencourages blocks with
the same label in Vtto look like an object. Moreover, V1:t1can provide
robust appearance and motion models of objects for inferring Zt. To
make the optimization tractable, we propose to split Eq.(6)into three
steps. Therst step is to obtain an intermediate assignment ~Vtthrough
inferences on the block level of two sequential frames ignoring Zt,
~Vt argmax
~Vt
p ~Vt
Vt1; Ot1:t
: 7
This step can hold the persistence of block assignments in spatialand temporal spaces. Then, the second step focuses on inferring Zt
with the aid of robust appearance and motion models of objects esti-
mated fromV1:t1,
Zt argmax
Zt
p Zt jO1:t: 8
Afterwards, the third step is to achieve the nal assignment by
combining the previous results ~Vtand Zt,
Vt argmax
Vt
p Vt jZt; Ot;~Vt: 9
The third step is based on the other two steps, which can make
blocks with the same label look like a part of some object and poten-
tially rectify possible errors during initialization and tracking. After
integrating these three steps into Eq.(6), we obtain
p Zt; Vt jO1:t; Vt1p ~VtVt1; Ot1:t p Zt jO1:tp Vt jZt; Ot; ~Vt: 10
Therefore, Eq.(10)can be efciently solved by the max-product
algorithm. These three steps will be further explained in the next sec-
tion correspondingly.Now, we have elaborated our problem formulation. Since the last
step is to assign blocks each time, we term it as Block Assignment
Tracking. Compared to[5,7], our formulation provides a simple way
to integrate block and ensemble level information.
5.2. Solution
In this subsection, we present details of the three steps in Eq. (10)
which are Block Tracking, Ensemble Tracking and Ensemble to Block
Assignment correspondingly. At the end, we give a summary of our
tracking algorithm.
5.2.1. Block Tracking
This step is to predict an intermediate result by taking advantagesof the constraints of label, color, and shape. Inspired by the similar
problem in[5], we dene
lnp ~VtVt1; Ot1:t XK
k0
XNi1
i bt;i;zt;k
Vt1; Ot1:t vt;i; k
XNi1
i bt;i; bt;j1i;; bt;jl
i
Vt1; Ot1:t 11
wherei is thepenaltyifbt,i is assigned tozt,k, i is thepenalty when bt,iand its neighbors are assigned to different objects, (i,j) is a Kroneckerfunction, equaling to 1 ifi =j or 0 otherwise, l =|Nbt,i| andNbt,i are 8-
neighbored blocks of bt,i. The observations here are actually image
sequences and object states are updated straightforwardly from their
previous states aszt,k=zt1,k+rzt, kby their motionsrzt, k=(rzt,kx, rzt,ky).The motion of an object is represented by the most frequent motion
Fig. 6.Tracking Problem formulation. Left: original image. Middle: foreground block image. Right: an assignment where blocks in the same color (label) form the appearance of one
object and the quadrangles indicate coarse shapes of objects.
297G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
8/15
Author's personal copy
amongits all blocks, where the motion for one block can be obtained by
applying block matching. Then we give the denitions ofiandi.
i bt;i;zt;k
Vt1; Ot1:t aDLt;i;kbDMt;i;kcDAt;i;k: 12
DLt,i,k is a rough shape constraint to restrict the spread of block labels.As object shapes are quadrangles, we need to eliminate the effectsof scale
along axis and rotation in the 2D plane. Our idea is to make use of a nor-
malization matrix~x 1=W 0
0 1=H
cos sin
sin cos
, where[WH]Tis
the minimum size of detection and is the angle between an object and
the vertical. Let the centers ofbt,iandzt,kbe x t;i xt;i;yt;i T
andxzt;k
xzt;k ;yzt;k
Trespectively. We deneDLt;i;k exp~x xt;ixzt;k
2
.
DMt,i,kis a temporal constraint of the label consistency, dened as
DMt;i;k Mt;i;kMi 2
. Mt,i,k is the number of pixels in the over-
lapped area between zt1,k and the region moving b t,i by rzt,k. Miis the total number of pixels in a block.
DAt,i,kis a color constraint, which measures the temporal color co-herence. LettingItbe the gray scale frame at time t, we dene
DAt;i;k 0dxb80dyb8
It xdx;ydy
It1 xdxrzt;kx;ydyrzt;ky : 13
iis the spatial constraint of the label consistency
i bt;i; bt;j1i;; bt;jl
i
Vt1; Ot1:t
d
XKk1
Ni;kNk
2g
Xln1
rt;irt; inj2:
14
Similar to[5], we set a =1, b = 1, c=0.125,d =0.00000025 and
g=0.5, and adopt Gibbs Sampler algorithm [35] to solve Eq. (11).
Please refer to[5,35]for more details.
5.2.2. Ensemble Tracking
This step is to estimate object locations accurately in the ensemble
level and offer the potential to amend possible errors in initialization
and tracking as discussed earlier. The errors are not notable in a short
time (NEframes for simplicity), but will be magnied vastly as time
passes. For the former situation, object states updated by their mo-
tions are adequate. For the latter situation, we refer to the update
step for a sequential Bayesian estimation problem:
p Zt jO1:tL Ot jZtp Zt jO1:t1 15
in whichp(Zt|O1 : t1) is thepredictionstep
p Zt jO1:t1 D Zt jZt1p Zt1 jO1:t1dZt1 16
whereL(Ot|Zt) is the likelihood of observation and D(Zt|Zt1) is the
dynamic model of thesystem, which is modeled as oneorderGaussian
by considering object motions.In order to approximate the ltering distribution, Particle Filter (PF)
approach[23]used a set of weighted particles. Its direct extension for
multiple object tracking models objects as unrelated. However, it may
cause ID switches when tracking adjacent objects, because observations
are ambiguous to be assigned to objects. Differently, we do not distin-
guish particles generated from different objects. Fig. 7compares the
two strategies. Formally, we extend[23]by
p Zt jO1:t XNpn1
nt;kznt zt 17
in which Np is the totalnumber of particles, and z() denotes the delta-Dirac function at positionz. Thenth particle is denoted aspn=(xt
n,stn,
Htn, {t,kn }1K).xtn=(xtn,ytn) is the location,stn is the scale,Htn is the appear-ance model, t,k
n is the weight for zt,k. Motivated by the successes of
[7,16], we dene
nt;k
n;Dt;k 1 n;Gt;k; x
ntxzt;k
2b;
0; otherwise
( 18
where t,kn,D is a discriminative weight modeled using the detector con-
dence and t,kn,G is an appearance weight measured from an online
learned appearance model. is a parameter (=0.5 here). is a distancethreshold. The appearance models for particles or objects come from
pixels in foreground blocks. We utilize HSV color space and the number
of bins for each channel is set to 16.
But objects may get lost sometimes during tracking. If an object
cannot get enough support particles (t,kn >), it is lost and bufferedfor possibly matching newly detected objects. We perform object de-
tection (SAD) everyNFframes tond new objects. If a lost object can-
not get matched in TWframes, it will be discarded.
5.2.3. Ensemble to Block Assignment
Thisstepis to achieve the nal result with the intermediate assign-
ment~Vtandthe estimatedobject stateZt. This problem is a multi-label
problem, which can be easily converted to 2-label problem by adding
objects one by one and then solved by graph cut algorithms. Suppose
object map Vt is obtained after adding objects zt,1~zt,k1and it will
add object zt,k. Then the target is to minimize the following energy
function each time
EkXNi1
i bt;i;zt;k
bt;jNbt;i
i;j bt;i; bt;j
: 19
a b c
Fig. 7.Comparisons of sampling strategies. (a) shows a scene with six persons. (b) PF [23]models objects as unrelated. (c) In our strategy, particles from different objects are not
distinct. But those far away from the concerned object are ignored (e.g., only particles from object D, C and E contribute to D).
298 G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
9/15
Author's personal copy
Unary itemi encodes the data likelihood, which imposes penaltiesfor assigning blockbt,ito objectzt,k. We consider the shape model and
thepriorknowledge in
i bt;i;zt;k
xt;i; xzt;k
ni 1 ~vt;i; k
20
where (,) is a kernel function dened as (xt, i,xzt,k) = DLt, i,k. isanoccluded factor. Let n i be the number of objects that occlude zt,k in
blockbt,i, where an object is occluded by others if they are overlapped
and itsy-axis value is larger. Intuitively, the largerni, the lower i. Wehave1 (set to 1.25 in our experiments).
Pairwise item i,j encourages the spatial coherence and imposespenalties when bt,i and bt,j are assigned with different labels. As a
sub-modular energy function can be solved by graph cut algorithms,
we adopt Potts model for simplicity
i;j exp
2 At;i;At;j
A
0@
1A 1 exp rt;irt;j2r
0@
1A; vt;ivt;j
0; otherwise
8>>>: 21
whereAt,land rt,lare the appearance and motion ofbt,l(l = i,j).is aparameter (=0.5 here).Aandrare normalization factors. Here theappearances of blocks aremodeled as 4 bins histogramin gray images.
A is set to be the number of pixels in a block (=64 here).Suppose themaximum motion of a block is the block size (88), and thus we set
r= 82+ 82=128.
After achieving the nal assignment, we then update appearance
models of objects. Intuitively speaking, if an object is occluded by
others, meaning that some of its overlapped foreground blocks are
not assigned to it, the update ratio should be small. The more occlu-
sion, the less update ratio. Based on this, we dene the update ratio
as =0.5Nk/Na, where Nk is the number of blocks assigned to zt,kandNais the total number of blocks overlapped byzt,k. Given the pre-
vious and current appearance models for zt,k,Apand Ac, the update is
described asA = (1)Ap+Ac.
Until now, we have described the three key components of theBAT. For easy reference, the entire procedure of the BAT is summa-
rized inFig. 8.
6. Experiments
In this section, we carry out extensive experiments to evaluate our
proposed detectionand tracking system. We rst describe the training
andtesting datasets,and then list some detectionand tracking metrics
for evaluations, and then evaluate the performance of our system, and
make some discussions at last.
6.1. Datasets
We have labeled 2470 positive masks of 24 58 as shown inFig. 4(b) for training the foreground pruning detector. We also have
collected 18,474 whole body positive samples of 2458 for learning
object detectors as shown inFig. 9. The positive masks and samples
of the other parts can be generated from those of whole body using
the denitions inFig. 5.
We use a large variety of challenging test datasets with different
situations of occlusions and viewpoints for evaluation as summarized
inFig. 10. Occlusions or viewpoint changes in these real world data-
sets make them valuable for evaluating detection and tracking sys-
tems. As the viewpoint in CAVIAR1 is frontal, learned detectors can
Fig. 8.The algorithm of our system.
299G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
10/15
Author's personal copy
be applied directly. But since the viewpoints in CAVIAR2, PETS2007
and our dataset are tilted, we utilize camera models to cope with it.
In our experiments, we aimat improvingboth detectionand track-
ing performances with off-line discriminative models. Therefore, test
datasets are totally independent from the training set and we employ
the generally trained detectors into all test sequences without retrain-
ing them specically for a certain scene.
6.2. Metrics
We use False Positive Per Image (FPPI) for detection evaluation.
When the intersection between a detection response and a ground-
truth box is larger than 50% of their union, we consider it to be a suc-
cessful detection. Only one detection per annotation is counted as
correct.
For multi-object tracking, there is no single established protocol.
We follow two current existing metrics. The metrics [36]count the
number of mostly tracked (MT), partially tracked (PT) and mostly
lost (ML) trajectories as well as the number of track fragmentations(FM) and identity switches (IDS). The CLEAR-metrics [37]calculate
the Multiple Object Tracking Accuracy (MOTA) which take into ac-
count false positives, missed targets and identity switches; and the
Multiple Object Tracking Precision (MOTP) measuring the precision
with which objects are located using the intersection of the estimated
region with the ground truth region.
6.3. Performance evaluations
6.3.1. Detection evaluations
In this subsection, we concentrate on evaluating the performances
of the key components of our SAD, foreground aware pruning (FAP),
Structural Filter approach (ISF) and foreground aware merging (FAM).
Since the number of available frames in test datasets is quite huge, weonly select 200 representative frames from each test datasets for
evaluation.
6.3.1.1. Efciency of FAP. TheaimofFAPistoefciently prunenegatives
and falsepositives. Table 1 shows the pruned window proportions and
saved times on these datasets with default detection parameters. In
Table 1, we can see that about 79%94.4% windows are pruned,
which yields a plenty of time saving (0.29 s4.6 s). Since there are
lots of people in our datasets, the pruned proportion is less than
those of the other datasets. Compared to CAVIAR1, the other three
datasets need to rectify sub images, and thus they cost much more
times than CAVIAR1. However, as there are a few (b4) persons in
CAVIAR2, the cost time is not as huge as S02 and our datasets. This ex-
periment sufciently demonstrates the efciency of FAP.
6.3.1.2. Efciency of SAD.We choose two state-of-the-art works[12,38]
for comparison with our SAD. ACF [38] has achieved good performances
for pedestrian detection, which is a strong competitor for frontal view-
point detection. ACF is learned on the same training dataset as our ISF
for a fair comparison. Since there are no publically available detectors
for multiple viewpoints of humans, we use Deformable Part Model
(DPM)[12]as a baseline, which is very famous for detecting objectswith large variations. The original DPM detector is provided by the
author and trained on Pascal VOC 2008. For a fair comparison, we also
train a new DPM detector on the same training dataset as our ISF. To
distinguish them, we denote them as DPM1 and DPM2 separately.
In the following, for concise descriptions, we let MAP be the
Bayesian method in [2] andNAIVEbe thesimpleststrategyto combine
Fig. 9.Positive samples for the whole body.
Fig. 10. Test datasets. CAVIAR dataset can be downloaded from http://homepages.inf.ed.ac.uk/rbf/CAVIAR/. PETS2007 can be downloaded from http://www.cvg.rdg.ac.uk/
PETS2007/. Humans in CAVIAR2 are too small, and therefore we double the original video size (384288).
Table 1
Evaluations of foreground aware pruning.NHis the average number of humans. PPWis
the average proportion of pruned windows in all scanned windows.Tis the cost time
without foreground aware pruning and tis the saved time when using it.
CAVIAR1 S02 Our dataset CAVIAR2
NH 6 9 11 4
PPW 94.4% 90% 79% 86%t(ms) 700 4600 1200 290
T(ms) 1210 10,400 7560 650
300 G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
11/15
Author's personal copy
detection results by the near locations in the following. Note that ex-
cept CAVIAR1, the other three test datasets need rectications with
camera models. The methods using camera models are indicated by
CAM. The ROC curves are shown inFig. 11.
6.3.1.2.1. Improvements of FAP. Compared to ISF+NAIVE, ISF+
NAIVE+FAP improves the detection rate about 3% on CAVIAR1. Com-
pared to ISF+NAIVE+CAM, ISF+NAIVE+CAM+FAP improves the
performance about 4% on S02, 4% on our dataset, 1% on CAVIAR2. Sim-
ilar performance improvements are achieved in ISF+MAP+CAM+
FAP. From the experiments in Fig. 11 andTable 1, we can see that
FAP not only works well on pruning but also improves the detection
performances.
6.3.1.2.2. Improvements of ISF and scene models. ISF+MAP per-
forms better than or comparable to ACF+MAP on CAVIAR1, S02
and our dataset, demonstrating that ISF can detect occluded humans
in scenes without large viewpoint changes. And ISF (ISF+MAP and
ISF+NAIVE) is better than DPM (DPM1 and DPM2),where there
might be two main reasons: 1) the ability of deformable part based
model is limited on strong labeled samples like our training dataset,
and 2) the weak feature in DPM uses only gradient information,while the weak feature in ISF combines both color and gradient infor-
mation which is more discriminative than the former for pedestrian
detection. Note that, as DPM2 is more focused on pedestrians than
DPM1, it performs better than DPM1 on S02, and comparable to
DPM1 on CAVIAR1 and our dataset. But all these detectors fail in CAV-
IAR2 because of large viewpoint changes, which can be better solved
by camera models. Comparing with ISF+NAIVE, ISF+NAIVE+CAM
improves the performances about 3% on S02 and 26% on our dataset,
and it works well on CAVIAR2. Similar improvements are achieved in
ISF+MAP+CAM. As the viewpoint of CAVIAR1 is frontal, the linear
mapping from 2D coordinate to human height is used. In the experi-
ment, we nd that the linear mapping does not reduce the detection
performance, while speeds up the detection about 0.6 s compared to
only using ISF on average.6.3.1.2.3. Improvements of FAM. We replace the post processing
method by FAM to show further performance improvements. Compared
to ISF+ MAP +FAP, our approach (ISF+ FAM+FAP) can improve the
detection rate by about 11% in CAVIAR1. Compared to ISF+MAP+
FAP+CAM, our approach (ISF+FAM+FAP+CAM) improves the de-
tection rate about 16% on our dataset and 14% in S02. As MAP adds ob-
jects in y-decent order which is not true in large viewpoint changed
scenes, it does not work well in CAVIAR2 and even worse than NAIVE
sometimes. On contrast, our approach can still work well in such scenes
and achieves 52% detection rate at FPPI=0.1 in CAVIAR2. We also ob-
serve another interesting phenomenon: the curves of our approach are
much cliffy than the others. It indicates that we can detect more objects
with less false samples. This is mainly because of pruned false positive
samples and used scene models. We zoom in the curves ofFig. 11(b)
and (c) to illustrate more details inFig. 11(e) and (f) respectively.
6.3.1.2.4. Summary. These experiments have shown the effective-
ness of the key components (FAP, ISF and FAM) of our SAD in occlud-
ed and viewpoint changed scenes. Therefore, our SAD as a whole
outperforms many of the state-of-the-art detection algorithms such
as[12,38]. But the speed is not satisfactory. The detection costs on av-
erage about 0.51 s, 5.8 s, 6.36 s and 0.36 s on CAVIAR1, our dataset,
S02 and CAVIAR2 respectively. Because of changed viewpoints andheavy occlusions, it costs much more time on our dataset and S02
than on CAVIAR1 and CAVIAR2. For further speedup and performance
improvements, we recommend our proposed BAT, which is evaluated
in the next subsection.
6.3.2. Tracking evaluations
In this section, we report our BAT tracking performances on all test
datasets based on the SAD results without retraining detectors for
specic scenes. For concise descriptions, we let our BAT with and
without camera models be BAT+ 3D and BAT+ 2D respectively.
6.3.2.1. Algorithms for comparisons. We compare our approach with
some state-of-the-art tracking algorithms[26,24,15,7,36,27]. We uti-
lize the implementation1 of [27] to carry out experiments by
a b c
d e f
Fig. 11.Evaluation of our SAD compared to DPM[12]and ACF[38]. (a), (b), (c) and (d) compare our approach with several state-of-the-art works on CAVIAR1, S02, Our dataset and
CAVIAR2 separately. (e) and (f) zooms in our approach on S02 and our dataset respectively to illustrate more details.
1 http://www.ics.uci.edu/~dramanan/.
301G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
12/15
Author's personal copy
ourselves. In this implementation, the authors do not use appearances
after detecting objects. Therefore, it will obtain relatively more frag-
ments and ID switches as well as missing detections. We improve its
performance by (1) utilizing background modelings to remove false
positive samples, and (2) building up appearance models for detectedobjects to associate them and (3) adjusting some parameters to
achieve better tracking results. After this improvement, it can track
more humans, but there are still too many fragments and ID switches.
Thus, we only use it for comparisons on the following metrics, MT, PT,
ML and MOTP.
Besides these state-of-the-art algorithms, we also use two simpli-
ed versions of our BATas baselines to demonstrate the improvement
of combining both block and ensemble information. Onebaselineonly
uses Ensemble Tracking, shortened as BAT(ET). BAT(ET)+2D, where
camera models are not used in detection, is similar to [7].BAT(ET) +
3D, where camera models are used in detection, is a better way to
show the improvement of BAT+3D with ensemble information. The
otherbaselineonly uses BlockTracking, shortenedas BAT(BT). Objects
can be well initialized in CAVIAR2 because of little occlusions, but notin CAVIAR1, our dataset and S02 because of severe and frequent
occlusions. Therefore,BAT(BT) +3Dis fair in comparison with BAT+
3D on CAVIAR2, in which camera models are used in detection.
6.3.2.2. Quantitative results.The obtained results are shown inFigs. 12
and 13.6.3.2.2.1. CAVIAR1. We compare our BAT with [26,24,15,7,36] in
Fig. 12. Among them, our method achieves the highest MT. Our FM
and IDS are a little higher than[36], which mainly because we handle
sequences online but[36]used all detection results to obtain a global
optimization. The MOTA and MOTP of our approach are both better
than those of[7], showing the efciency of combining block and en-
semble information. In general, CAVIAR1 is relatively easy for many
tracking systems, however, the further used test datasets are more
challenging.
6.3.2.2.2. Our dataset and S02. We compare our approach with
[7,27] on these two datasets in Fig. 13(top) and (middle). As described
in Section 6.3.1, many state-of-the-art detection algorithms do not
perform as well as our detection approach in sceneswith heavy occlu-
sions and slightly changed viewpoints. The detection processes in[7,27]lost many humans on our dataset and S02, which reduces the
Fig. 12.Quantitative results on CAVIAR1. *The denitions of fragment and IDS numbers in[26,7]are obtained by looser evaluation metrics.
Fig. 13.Quantitative results of our method on our dataset, S02 and CAVIAR2.
302 G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
13/15
Author's personal copy
tracking performances. While, our SAD can detect many humans and
our BAT generally performs much better and more stable than[7,27].
BAT(ET)+3D can track more objects than[7,27], but it obtains many
fragments and ID switches. Compared to BAT(ET)+3D, BAT+3D
achieves a better performance. It obtains higher MT/PT/MOTP/MOTA
and lower FM and IDS. This improvement shows that combinations
of block and ensemble information are superior to only using ensem-bleinformationfor tracking. Compared to BAT+2D, theimprovement
of BAT+3D majorly lies in the usage of camera models because of
slight changed viewpoints. Furthermore, BAT+3D is always better
thanBAT+2D inMT/PT/ML,but not inother metrics.A partof the rea-
son is that the ground truths are labeled as rectangles, but the tracked
humansof BAT+3D arequadrangles. However, because our dataset is
much more crowded than S02, there are still many partially tracked
objects.
6.3.2.2.3. CAVIAR2. Because of the extremely large viewpointchanges, the methods without using camera models (such as [7,27]
and BAT+2D) fail totally in this dataset. As far as we know, there
a
b
c
d
Fig. 14.Tracking results. (a) and (b) compares[7](top) and our approach (bottom) onOneStopMoveEnter1corof CAVIAR1 and S02. (c) and (d) illustrate sample results in our data-
set andMeet_crowdof CAVIAR2 separately. The layouts of (a) and (b) are already shown inFig. 3(c). The layouts of (c) and (d) are illustrated in the most left. More descriptions can
be found inSection 6.3.2.
303G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
14/15
Author's personal copy
are no publically available implementations to deal with multiple
humans tracking in these scenes. Thus, we compare our BAT+3D
with BAT(BT)+3D and BAT(ET)+3D inFig. 13(bottom). Compared
to BAT(ET)+ 3D, BAT(BT)+ 3D achieves higher MT, MOTA and
MOTP, but more IDS and FM. Our BAT+3D can integrate both of
their advantages and achieves better performances. It improves MT
by 13.8%, MOTAby 7.2% and reduces PT by 18.2% than the second best.6.3.2.2.4. Summary. As described earlier, the application of Block
Tracking is limited, because it requires good initializations, but the
achievement of good initializations is difcult in occluded scenes. Com-
paringBAT+ 2D than [7] on CAVIAR1 and BAT+ 3D than BAT(ET)+ 3D
on theother three datasets (our dataset, S02 and CAVIAR2), we can eas-
ily conclude that combinations of block and ensemble information can
improve the tracking performance. From the experiments inFigs. 12
and 13, we can see that our proposed detection and tracking system
can work robustly in the scenes with heavy occlusions and viewpoint
changes.
6.3.2.3. Sample results. Fig. 14demonstrates some tracking results by
our tracking algorithm, where the green and red arrows point some
IDS, the purple dotted ellipses point some target missing or lost, andthe blue arrow points some false alarms. Panels (a) and (b) illustrate
scenes with targets walking against a crowd. We compare[7] in the
top with our approach in the bottom. Our method can consistently
track these objects, while[7] experiences several instances of IDS, tar-
get lost and false alarms. Panel (c) features a subway scene with many
people walking, where the occlusions are very severe and the view-
point is slightly changed. Our tracker succeeds in tracking many of
them. Panel (d) shows a scene with several people walking where
the viewpoint is extremely changed. Our tracker tracks them success-
fully all over the sequence.
6.4. Discussions
6.4.1. Parameters
There are some parameters in the SAD and BAT as listed in Fig. 15
with corresponding descriptions and default values. The affections of
some key parameters to our framework are presented as follows. For
SAD, the parameters TM, Tadd and Tdel directly impact on the post-
process of detection. Smaller TM indicates the more probability of
adding or deleting a detected response each time. Smaller Tadd will
add more objects and larger Tdelwill delete more objects. For BAT,NOandare key parameters. Larger NO(i.e. more particles) can improvethe performance, but it will cost more time. Largerconsiders moreconsistencies in videos, which can improve the tracking performance
when the detection is not so accurate, especially in CAVIAR2 because
of large viewpoint changes. Therefore, we set most parameters by de-fault and they are relatively robust in the experiments, except that we
setNO=300 and=5 for CAVIAR2.
6.4.2. Processing speeds
The entire system is implemented in one thread by C++, without
special code optimization and taking advantage of GPU processing. On
a workstation with an Intel Core(TM)2 2.33 GHz and 2 G memory, we
achieve real-time process speed of 2.715 fps (given the video size,
the object number and the changed viewpoint), as shown in Fig. 16
compared with detection only. The current bottleneck is the detection
stage. As not all speedup possibilities are explored yet, the current
run-time raises hope that online experiments in real world applica-
tions will not be too far away.
6.4.3. Failure cases
Objects are initialized by detection in our system. The failures of
detection (e.g. missing detection and false alarms) cannot be avoided
in the tracking process. If the initialized object is not so accurate, such
asobject 8 inFrame 20ofFig. 14(c), it drifts easilyand tends to belost.
In particular, camera models impacts a lot on detection in viewpoint-
changing scenes. Bad estimated camera parameters will lead to unex-
pected detected results. Besides, our system cannot handle the near
vertical viewpoint where camera is right over the top of objects,
since it is impossible to recover the objects' frontal viewpoint in this
situation as pointed out in[10].
7. Conclusion
In this paper, we propose a robust system for multi-object detec-
tion and tracking in surveillance scenes with occlusions and view-
point changes. Our SAD can achieve robust detection through:
(1) camera models to cope with viewpoint changes; (2) structural l-
ter approach to handle occlusions; and (3) foreground aware pruning
Fig. 15.Default parameters.
304 G. Duan et al. / Image and Vision Computing 30 (2012) 292305
-
8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full
15/15
Author's personal copy
and foreground aware merging with the aid of some scene models.
Our BAT can track objects robustly even when all the part detectors
fail as long as the object has assigned blocks, which formulate track-
ing as a block assignment process. Its key factors are: (1) Block Track-
ing to maintain the spatial and temporal consistence of labels; (2)
Ensemble Tracking to precisely estimate locations and sizes of ob-
jects; and (3) Ensemble to Block Assignment to maintain the blocks
with the same label look like a part of human.
Although our method tracks remarkably well even through occlu-
sions and viewpoint changes, one unavoidable drawback is fuzzy ob-
ject boundaries. To overcome this, we can learn and extract some
discriminative patches to represent and track objects. Another draw-back is that the tracking results are jittering, which can be amended
by estimated object trajectories. For detection improvement, we can
use online algorithms to make the ofine and general detectors be-
come adaptive to a xedscene. Although the current systemonly con-
siders human, the proposed mechanism can be easily to be extended
to other kinds of objects. Based on the detection and tracking results,
some high level analysis of object behaviorsbecome possible. Further-
more, we hope to be able to make our approach applicable to real
world needs.
Acknowledgements
This work is supported in part by National Science Foundation of
China under grant No.61075026, National Basic Research Program ofChina under Grant No.2011CB302203. Mr. Shihong LAO is partially
supported by R&D Program for Implementation of Anti-Crime and
Anti-Terrorism Technologies for a Safe and Secure Society, Special
Coordination Fund for Promoting Science and Technology of MEXT,
the Japanese Government.
Appendix A. Supplementary data
Supplementary data to this article can be found online at doi:10.
1016/j.imavis.2012.02.008.
References
[1] P. Viola, M. Jones, Rapid objectdetectionusing a boosted cascadeof simple features, in:
Proc. IEEE Int. Conf. Comput. Vis.Pattern Recogni., Kauai,HI, USA, 2001, pp.I-511I-518.[2] B. Wu, R. Nevatia, Detection of multiple, partially occluded humans in a single
image by Bayesian combination of edgelet part detectors, in: Proc. IEEE Int.Conf. Comput. Vis., Beijing, China, 2005, pp. 9097.
[3] C. Huang, R. Nevatia, High performance object detection by collaborative learningof joint ranking of granules features, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., San Francisco, California, USA, 2010, pp. 4148.
[4] G. Duan, H. Ai, S. Lao, A structural lter approach to human detection, in: Proc.Eur. Conf. Comput. Vis., Crete, Greece, 2010, pp. 238251.
[5] S. Kamijo, Y. Matsushita, K. Ikeuchi, M. Sakauchi, Trafc monitoring and accidentdetection at intersections, IEEE Trans. Intell. Transp. Syst. 1 (2000) 108118.
[6] T. Zhao, R. Nevatia, Tracking multiple humans in complex situations, IEEE Trans.Pattern Anal. Mach. Intell. 26 (2004) 12081221.
[7] J. Xing, H. Ai, S. Lao, Multi-object tracking through occlusions by local tracklets l-tering and global tracklets association with detection responses, in: Proc. IEEE Int.Conf. Comput. Vis. Pattern Recogni., Miami, FL, USA, 2009, pp. 12001207.
[8] M. Andriluka, S. Roth, B. Schiele, People-tracking-by-detection and people-detection-by-tracking, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., An-chorage, Alaska, USA, 2008, pp. 18.
[9] X. Wang, T.X. Han, S. Yan, An hog-lbp human detector with partial occlusion han-dling, in: Proc. IEEE Int. Conf. Comput. Vis., Kyoto, Japan, 2009, pp. 32 39.
[10] Y. Li, B. Wu, R. Nevatia, Human detection by searching in 3D space using cameraand scene knowledge, in: Proc. IEEE Int. Conf. Image Process., Tampa, Florida,USA, 2008, pp. 15.
[11] G. Duan, H. Ai, S. Lao, Human detection in video over large viewpoint changes, in:Proc. IEEE Asi. Conf. Comput. Vis., Queenstown, New Zealand, 2010, pp. 683696.[12] P. Felzenszwalb, D. McAllester, D. Ramaman, A discriminatively trained, multi-
scale, deformable part model, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Anchorage, Alaska, USA, 2008, pp. 18.
[13] C. Beleznai, H. Bischof, Fast human detection in crowded scenes by contour inte-gration and local shape estimation, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Miami, FL, USA, 2009, pp. 22462253.
[14] D. Hoiem, A.A. Efros, M. Hebert, Putting objects in perspective, Int. J. Comput. Vis.80 (2008) 315.
[15] C. Huang, B. Wu, R. Nevatia, Robust object trackingby hierarchical associationof detec-tion responses, in: Proc. Eur. Conf. Comput. Vis., Marseille, France, 2008, pp. 788801.
[16] A. Senior, Tracking with probabilistic appearance models, in: ECCV Workshop onPerformance Evaluation of Tracking and Surveillance Systems, Copenhagen,Denmark, 2002, pp. 4855.
[17] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pat-tern Anal. Mach. Intell. 25 (2003) 564577.
[18] P. Fieguth, D. Terzopoulos, Color based tracking of heads and other mobile objectsat video frame rates, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., San
Juan, Puerto Rico, 1997, pp. 2127.[19] M. Isard, A. Blake, Contour tracking by stochastic propagation of conditional density,
in: Proc. European Conf. Computer Vision, Cambridge, UK, 1996, pp. 343356.[20] J.C. Clarke, A. Zisserman, Detection and tracking of independent motion, Image
Vis. Comput. 14 (1996) 565572.[21] M.D. Rodriguez, M. Shah, Detecting and segmenting humans in crowded scenes,
in: Proc. IEEE Int. Conf. Multimed., Augsburg, Germany, 2007, pp. 353356.[22] P. Kelly, N.E. O'Connor, A.F. Smeaton, Robust pedestrian detection and tracking in
crowded scenes, Image Vis. Comput. 27 (2009) 14451458.[23] M. Isard, A. Blake, Condensation-conditional density propagation for visual track-
ing, Int. J. Comput. Vis. 28 (1998) 528.[24] B. Wu, R. Nevatia, Detection and tracking of multiple, partially occluded humans
by Bayesian combination of edgelet based part detectors, Int. J. Comput. Vis. 75(2007) 247266.
[25] H. Jiang, S. Fels, J.J. Little, A linear programming approach for multiple objecttracking, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Minneapolis,MN, USA, 2007, pp. 18.
[26] L. Zhang, Y. Li, R. Nevatia, Global data association for multi-object tracking usingnetwork ows, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Anchorage,Alaska, USA, 2008, pp. 18.
[27] H. Pirsiavash, D. Ramanan, C.C. Fowlkes, Globally-optimal greedy algorithms fortracking a variable number of objects, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Colorado Springs, CO, USA, 2011, pp. 12011208.
[28] S. Avidan, Ensemble tracking, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007)261271.
[29] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in:British Machine Vision Conference, Edinburgh, British, 2006.
[30] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robusttracking, in: Proc. Eur. Conf. Comput. Vis., Marseille, France, 2008, pp. 234247.
[31] B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online multiple instancelearning, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Miami, FL, USA,2009, pp. 983990.
[32] J. Xing, L. Liu, H. Ai, Background subtraction through multiple life span modeling,in: Proc. IEEE Int. Conf. Image Process., Brussels, Belguim, 2011.
[33] B. Wu, R. Nevatia, Y. Li, Segmentation of multiple partially occluded objects bygrouping merging assigning part detection responses, in: Proc. IEEE Int. Conf.Comput. Vis. Pattern Recogni., Anchorage, Alaska, USA, 2008, pp. 18.
[34] G. Duan, C. Huang, H. Ai, S. Lao, Boosting associated pairing comparison featuresfor pedestrian detection, in: Proc. IEEE Workshop Visual Surveillance, Kyoto,
Japan, 2009, pp. 10971104.[35] S. Geman, D. Geman, Stochastic relaxation, Gibbs distribution, and the Bayesian
restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 721741.[36] Y. Li, C. Huang, R. Nevatia, Learning to associate: Hybrid boosted multi-target
tracker for crowded scene, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni.,Miami, FL, USA, 2009, pp. 29532960.
[37] K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance:the clear mot metrics, J. Image Video Process., 2008, 2008.
[38] W. Gao, H. Ai, S. Lao, Adaptive contour features in oriented granular space forhuman detection and segmentation, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Miami, FL, USA, 2009, pp. 17861793.
Fig. 16.Speed comparisons of detection and tracking (ms).
305G. Duan et al. / Image and Vision Computing 30 (2012) 292305
top related