kanishka nithin and françois brémond ieee proof€¦ · kanishka nithin and françois brémond 1...
Post on 13-Aug-2020
11 Views
Preview:
TRANSCRIPT
IEEE P
roof
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1
Globality–Locality-Based Consistent DiscriminantFeature Ensemble for Multicamera Tracking
Kanishka Nithin and François Brémond
Abstract— Spatiotemporal data association and fusion is a1
well-known NP-hard problem even in a small number of cameras2
and frames. Although it is difficult to be tractable, solving them3
is pivot for tracking in a multicamera network. Most approachesAQ:1 4
model association maladaptively toward properties and contents5
of video, and hence they produce suboptimal associations and6
association errors propagate over time to adversely affect fusion.7
In this paper, we present an online multicamera multitarget8
tracking framework that performs adaptive tracklet correspon-9
dence by analyzing and understanding contents and properties10
of video. Unlike other methods that work only on synchronous11
videos, our approach uses dynamic time warping to establish12
correspondence even if videos have linear or nonlinear time13
asynchronous relationship. Association is a two-stage process14
based on geometric and appearance descriptor space ranked15
by their inter- and intra-camera consistency and discriminancy.16
Fusion is reinforced by weighting the associated tracklets with a17
confidence score calculated using reliability of individual camera18
tracklets. Our robust ranking and election learning algorithm19
dynamically selects appropriate features for any given video.20
Our method establishes that, given the right ensemble of features,21
even computationally efficient optimization yields better accuracy22
in tracking over time and provides faster convergence that23
is suitable for real-time application. For evaluation on RGB,24
we benchmark on multiple sequences in PETS 2009 and we25
achieve performance that is on par with the state of the art.26
For evaluating on RGB-D, we built a new data set.27
Index Terms— XXXXX.AQ:2 28
I. INTRODUCTION29
THE goal of this paper is to: 1) provide a real-time solution30
with good accuracy to estimate states of multiple targets31
relative to its complement in multicamera environment and32
2) conserve the identities of targets and produce unfragmented33
long trajectories under variations in appearance and motion34
over time. In spite of the number of solutions, real-time multi-35
target tracking across multiple camera network with reasonable36
overlap is still considered most challenging and unsolved37
computer vision problem. This is mainly due to placement of38
cameras, time asynchronous cameras, multicamera calibration,39
Manuscript received December 16, 2015; revised April 27, 2016 andAQ:3 August 10, 2016; accepted September 22, 2016. This work was supported
in part by the Agence Nationale de la recherche under Grant ANR-13-SECU-0005-01 of the Project MOVEMENT and in part by the Program COSG 2013.This paper was recommended by Associate Editor H. Yao.
AQ:4 The authors are with INRIA Sophia Antipolis Méditerranée, 06902Sophia Antipolis, France (e-mail: kanishka-nithin.dhandapani@inria.fr;francois.bremond@inria.fr).
Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSVT.2016.2615538
distortions, parallelism, fuzzy data association, and fusion 40
across network of cameras. Despite challenges, multicamera 41
systems are crucial because they help in obtaining more visual 42
information about the same scene that complements each 43
other, thereby helping in overcoming traditional deficits of 44
single-camera object tracking and improving higher vision 45
tasks such as activity recognition and surveillance. 46
Offline and global association methods usually require 47
detection and tracking results for entire sequence prior to data 48
association. This leads to high computation due to iterative 49
associations across multiple cameras for generalizing globally 50
optimized tracklet association and fusion; therefore, they are 51
difficult to apply for real-time applications. Global approaches 52
are also more exposed to local optima solutions compared with 53
online methods, whereas our method performs online associ- 54
ations and fusion based on optimal frame buffer containing 55
the information gathered till the present frame. Hence, our 56
approach reduces the ambiguity in global associations and 57
it produces competing performance to the state of the art 58
while being suitable for real-time applications. As a byprod- 59
uct, shortcomings of online frame buffer-based tracking are 60
implicitly overcome by multicamera system setup. 61
Unlike some of works mentioned in Section II, the proposed 62
online multicamera tracklet association is designed consider- 63
ing two key criteria—inter- and intra-camera consistency and 64
discriminability of trajectory features. Our method incremen- 65
tally learns and updates the discriminative appearance model 66
belonging to each trajectory and ranks them based on consis- 67
tency and discriminancy of the candidate tracklets. We also use 68
3D projected geometric information in conjunction with long- 69
term appearance features for efficient data association even in 70
challenging situations. 71
In our approach, we use planar homography to establish 3D 72
common referential between cameras onto which the 3D points 73
of each tracklet from all cameras are projected. Dynamic 74
time warping (DTW) algorithm is used to find one-to-one 75
frame mapping between linear or nonlinear time asynchronous 76
cameras. DTW also selects candidate tracklets for association. 77
Tracklet association is modeled as a sequence of complete 78
bipartite graphs. Association score for each pair of tracklets 79
is calculated as ensemble of geometric and appearance fea- 80
tures weighted by globality–locality consistent discriminant 81
score (GLCDS). GLCDS is learnt as an estimate of discrimi- 82
nancy weighted consistency score. Discriminancy of individual 83
features is calculated as fisher score of that feature over entire 84
tracklets. Consistency of each feature is calculated as deviation 85
1051-8215 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE P
roof
2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
of that feature over a distribution belonging to the tracklet86
under consideration. Fusion is performed using confidence87
score-based adaptive weighting method. This enables correct88
and consistent trajectory association and fusion even if the89
individual trajectories have inherent noises, occlusion, and90
false positives.91
Our method has the following advantages.92
1) We integrated measures that account properties, nature93
of video, and its contents for online feature selection94
and combination. It automatically elects the best feature95
ensemble based on the video contents and properties.96
2) We lack real-time state-of-the-art approaches in multi-97
camera tracking. This is attributed to the heavy opti-98
mizers used in such approaches. Our method reduces99
the burden of relying on such heavy optimizers by100
concentrating on feature engineering. Our approach pro-101
duces state-of-the-art comparable performance in real102
time by avoiding computationally expensive optimiza-103
tion, metrics, and data-gathering (fusion) strategy, thus104
significantly influencing on the scalability of network as105
well.106
3) Our cost function allows us to efficiently model mul-107
tilevel relationship among tracklets such as a spread108
of global, local, and motion features used in our109
method.110
4) Our approach leverages depth information upon avail-111
ability to complement RGB data to overcome short-112
comings of RGB cameras and other issues like113
privacy.114
The reminder of this paper is divided into the following115
sections. In Section II, we review some significant previous116
work and how our method differs from them. In Section III, we117
review multicamera synchronization and multiview geometry118
used in our approach. Next, in Section IV, we discuss how119
we formulate trajectory association problem, followed by120
Section V that describes calculation of trajectory similarity121
metrics. Section VI briefs on consistency and discriminancy122
of cross-view tracklets and GLCDS calculation. Trajectory123
fusion is introduced in Section VII, the experimental results124
are presented in Section VIII, and finally, Section IX concludes125
this paper.126
II. RELATED WORK127
In recent years, there have been comparatively less mul-128
ticamera data association and tracking approaches proposed.129
Most of the multicamera approaches in recent times have130
concentrated mainly on offline approaches. On a general basis,131
approaches can be outlined based on: 1) fusion time—either132
early fusion [2] or late fusion [3] and 2) the search space—133
greedy, i.e., temporally local (online) or global optimization134
with longer temporal stride (offline) [4], [5].135
Approach [1] extends the work of [6] to jointly model mul-136
ticamera reconstruction and global temporal data association137
using MAP. They use global min cost flow graph for tracking138
across multiple cameras. Berclaz et al. [6] have detection based139
on probability occupancy map. They also use flow graph-140
based method for solving both mono-camera and multicamera141
setup within a restricted and predetermined area of interest. 142
The drawback of such min cost flow graphs that currently 143
own the state of the art is that they are not real time as 144
the complexity increases with more cameras in the network 145
since combinations of observations from multiple cameras 146
increase exponentially and the costs need to be predefined. 147
Min-flow graphs cannot work with higher order motion models 148
as their cost function cannot be factored into product or sum of 149
edges of adjacent nodes. Reference [19] solves the association 150
problem by first solving 3D hypothesis from multiple camera 151
object detection fusion and then by solving temporal data 152
association. The drawback is unnecessary overhead where 153
the problem is diversified into two separate problems of 3D 154
reconstruction fusion at central server and solving to assign 155
back the reconstructed fusion into 3D tracklets established by 156
individual sensors. 157
Evans et al. [7] use early fusion strategy for detection 158
inspired from [2] and extend it for multicamera tracking 159
and estimating object size in multicamera environment. Their 160
approach leverages multiview information into early stage 161
(detection) of pipeline to remove ghosts. Since the synergy 162
map they use for ghost suppression also suppresses existing 163
objects in the previous frame, they cannot perform tracking by 164
associating detections moment to moment. Multivariate opti- 165
mization is performed on object size together with probable 166
location of object in the next frame. The objective function 167
involves both object size estimate and tracking information, 168
and the solution may be suboptimal and is not real time. 169
By nature of their ghost suppression method that involves 170
intricate assumptions such as line of view from camera to 171
object assumptions, it makes it difficult to track objects in 172
cluttered or crowded environment. 173
Anjum et al. [8] have presented an unsupervised inter- 174
camera trajectory correspondence algorithm. For the asso- 175
ciation step, they propose a hybrid approach: project the 176
trajectories from each camera view to the ground plane in 177
order to find associations among trajectories, and then, make 178
image-plane reprojections of the matched trajectories. These 179
methods rely entirely on goodness of homography, smallest 180
margin of error in calibration gets added up during initial 181
projections and reprojections. Thus, these methods are suscep- 182
tible to introduce errors that end up being association errors. 183
Sheikh et al. [9] have proposed a target association algorithm 184
that addressed the problem of associating trajectories across 185
multiple moving airborne cameras with a constraint that at 186
least one object is seen simultaneously between every pair of 187
cameras for at least five frames. Since this method uses object 188
centroid as feature points to recover the homography and later 189
uses RANSAC to find out best subset of such points to find 190
correspondence, it works well when in sparse environment, but 191
in dense environment, it may fail. Their approach assumes that 192
all the objects to be tracked are on the common ground well 193
aligned with all the cameras present in the network. 194
To address the shortcomings of the methods discussed 195
above, we propose a framework that synthesizes local fea- 196
ture level information into the global object level based on 197
consistent discriminant election and weighting for multitarget 198
tracking. 199
IEEE P
roof
NITHIN AND BRÉMOND: GLOBALITY–LOCALITY-BASED CONSISTENT DISCRIMINANT FEATURE ENSEMBLE FOR MULTICAMERA TRACKING 3
Fig. 1. Pipeline of our approach.
III. PROPOSED METHODOLOGY200
General system architecture and pipeline can be seen in201
Fig. 1. Multiple worker threads process each camera in a202
network at the same time. All threads perform detection and203
tracking as independent worker nodes. After a buffer time,204
these threads synchronize to push their data onto the master205
thread where all key multicamera-related work is done, i.e.,206
building online tracklet appearance models, local features, 3D207
projection, online learnt feature ensemble, association, and208
fusion.209
IV. MULTICAMERA SYNCHRONIZATION210
AND MULTIVIEW GEOMETRY211
Elementary and most key settings for our multicamera212
tracking system are as follows.213
1) The cameras in the network need to be time synchro-214
nized with respect to reference camera C ref . Here by215
reference camera, we mean a chosen camera onto which216
the geometric data from other cameras in a network are217
projected to.218
2) Individual camera calibration for projection onto a 219
3D world W Ck belonging to that camera. 220
3) Multiview homography that establishes a mapping 221
between world of camera k W Ck and world of reference 222
camera W ref223
Most of previous approaches assume that cameras are time 224
synchronized, but we also handle the case of linear and non- 225
linear asynchronization between the cameras. If the cameras 226
are linearly asynchronous, we need to map each frame in 227
camera Ck to corresponding frame in reference camera C ref . 228
We accomplish this task using linear regression. Given a set of 229
values, the linear regression model assumes that the relation 230
between the dependent variable FCkand T Ck
variable is linear. 231
FCkare frames from camera k, and T Ck
are timestamps T 232
from camera k. The relation between both variables can be 233
approximated as linear as 234
FCk = tCk
0 + slopeCk × T Ck(1) 235
where Ck is the kth camera. For simplicity, we assume 236
constant tCk
0 = 0. 237
In order to find a relation between each video, we can equate 238
the timestamps of both cameras T Ck = T C ref239
T Ck = FCk
slopeCk = T C ref = FC ref
slopeC ref . (2) 240
After if we know the parameters slopeCkand slopeC ref
, we 241
can map from the frame of one camera to the other. This 242
parameter can be obtained from expressions 243
slopeCk = �FCk
�T Ck . (3) 244
Then the camera with lower frame rate is taken as reference, 245
and the synchronization for the camera Ck is calculated as 246
FCk = slopeCk
slopeC ref × FC ref. (4) 247
If the cameras are nonlinearly asynchronized, we use DTW 248
as a way to establish approximate frame-to-frame correspon- 249
dence between them. Here DTW also doubles as a dynamic 250
programming approach to speed up the process of finding 251
geometric similarity between the tracklets that need to be asso- 252
ciated. More details on DTW and the process are explained 253
in Section V-A. 254
A moving person viewed from different points of view 255
results in different trajectories. The estimation of the homogra- 256
phy between these views is the key in establishing association 257
between them. Our multiview calibration is based on planar 258
homography. 259
Points projected on a 3D world W Ck from the kth view 260
may be related to the corresponding image points in the 3D 261
world W ref in reference view using planar homography. The 262
idea is to project the trajectory points from all cameras under 263
consideration onto the common referential world. In our case, 264
common referential is reference camera coordinate system. 265
Given a point X in the kth view, the problem consists in finding 266
IEEE P
roof
4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
Fig. 2. Projective transformation of tracklet points belonging to Trk−11 ,Trk−1
2between images from camera k−1 and image plane of reference camera usinghomography H induced by that plane. The same happens with camera k + 1and so on.
Fig. 3. Corresponding projections on reference image plane. Left: referenceimage plane. Blue lines represent the projection of points from nonreferencecamera to the image plane belonging to reference camera.
the corresponding point X ′ in the reference view. The relation267
between the first and the second view is given by268
X ′ = Hπ · X. (5)269
Once we found the homography between views, we can270
project the trajectories from one camera view to the other one271
as shown in Figs. 2 and 3.272
V. MULTIVIEW TRAJECTORY ASSOCIATION273
Generalized maximum /minimum clique problem or274
K-partite problem, where finding the clique with maximum275
score or minimum cost is an NP-hard problem as shown276
in [20]. Since there is no polynomial time solution to this prob-277
lem, we breakdown the problem by reducing it to sequential278
bipartite matching problem between reference camera and any279
other camera Ck in the network. Let us say we have K cameras280
{C ref ,C1,C2 . . .Ck}, and we reproject all the trajectories281
from cameras {C1,C2..Ck} to reference camera C ref and282
perform trajectory association, similarity calculation on C ref .283
The associated tracklets between the reference camera and the284
kth camera are accumulated until tracklet associations for all285
{C ref ,Ck} pairs are solved. Once all the tracklet associations286
from each camera pair are available, the fusion is done in the287
reference camera C ref . By doing this way, it leads to estimation288
of optimal solution for NP hard problem in polynomial time.289
Fig. 4. Two tracklets in common subintervals between two cameras in thetime interval [tA, tB ].
The association problem in general is related to the need of 290
establishing correspondences between pairwise similar trajec- 291
tories that come from different overlapping cameras. 292
The association or correspondence may be modeled as a 293
sequence of bipartite graph matching problem in which each 294
set Sk has trajectories that belong to camera k. For example, for 295
a reference camera C ref and any other overlapping camera Ck , 296
a set of trajectories Sref and Sk is defined. 297
A bipartite graph is a graph G in which the vertex set V 298
can be divided into two disjoint subsets Sref and Sk such 299
that every edge e ∈ E has one end point in Sref and the 300
other end point in Sk . Each object being tracked is denoted 301
by TOi in the resulting observation (i.e., a track point) of 302
the multitarget tracking algorithm. The tracked objects have 303
been synchronized in terms of frame number F , and they have 304
2D space coordinates (x, y). Thus 305
TOt = (F, (x, y))t . 306
Let TOi represent the i th tracked object that belongs to the 307
trajectory TrCk
i observed in the camera Ck where k = l, r . 308
Thus, each trajectory is composed by a time sequence of 309
3D points of physical objects 310
TrCk
i = {TOi
0,TOi1,TOi
t , . . . ,TOini
}(6) 311
where ni is the length of the above trajectory. Consequently, 312
each camera Ck has a set of N and M trajectories belonging 313
to sets Sref and Sk 314
Sref = {TrC ref
0 ,TrC ref
1 ,TrC ref
2 , . . .TrC ref
N
}(7) 315
Sk = {TrCk
0 ,TrCk
1 ,TrCk
2 , . . .TrCk
M
}. (8) 316
We abstract the trajectory association problem across mul- 317
tiple cameras as follows. Each trajectory TrCk
j is a node of 318
the bipartite graph that belongs to the set Sk linked with 319
the camera Ck . A hypothesized association between two 320
trajectories is represented by an edge in the bipartite graph. 321
The goal is to find the best match in the graph. 322
A. Time Overlapping Trajectories 323
For each hypothetical association, we first filter and remove 324
the associations of trajectories that do not overlap in time. 325
IEEE P
roof
NITHIN AND BRÉMOND: GLOBALITY–LOCALITY-BASED CONSISTENT DISCRIMINANT FEATURE ENSEMBLE FOR MULTICAMERA TRACKING 5
In the case of time overlapping trajectories, we take the326
intersecting time interval between them, that is, the lower and327
the highest time value between both trajectories to get a new328
time interval in which both trajectories are contained. In the329
example of Fig. 4, we have two trajectories TrCli ∈ Sl with330
0 < i < N and TrCrj ∈ Sr with 0 < j < M , and the result-331
ing overlapping time interval is �t = [TrCl(t0), T rCr
(t f )].332
In order to apply DTW, we need trajectories of the same size333
to be compared frame by frame. The gaps or missing points334
(due to miss detections or occlusions) are completed with local335
linear interpolation and smoothing for the mentioned time336
interval �t .337
B. Linear Interpolation and Smoothing338
Object detection is not perfect due to occlusions, visibility,339
density of crowd, and placement of camera, and thus, a linear340
interpolation is applied in order to reach a more complete341
trajectory. We assume that a person follows uniform linear342
motion between the next and the previous frame. Based on343
that, a linear interpolation is performed in order to correct miss344
detections of time length equal to � frame(s) at a time. In our345
experiments, we heuristically limit usage of interpolation up346
to � = 4, and more than four missing detections would be347
treated as disappearance of object. To perform this correction,348
position of the person in the current frame is estimated as349
TrCk
i (t) = TrCk
i (t − 1)− TrCk
i (t +�)
�(9)350
where � is the difference between the previous and the next351
available detection’s frame number. TrCk
i (t) is the position of352
tracked object at time t , TrCk
i (t − 1) is the position of tracked353
object at time (t − 1), TrCk
i (t +�) is the position of tracked354
object at time (t +�), and Ck is the camera number.355
The 2D space of the trajectories that belongs to the kth356
camera is projected to 2D space of re f camera in order to357
compare and find similar trajectories. During this task, some358
noise can arise. Thus, in order to deal with this noise, we359
smooth the trajectory for better results. At this time, we are360
almost ready to compute the trajectory similarity. However, the361
common tracklets between both trajectories need to be found.362
C. Find Tracklets in Common Subintervals363
Fig. 4 shows a graphic illustration of two overlapping364
trajectories in time interval [tA, tB ]. The x and y axes cor-365
respond to geometric space, i.e., geometric x, y coordinates,366
and t-axis corresponds to time. The two trajectories have two367
tracklets in the subintervals [t0, t1], [t2, t3] ⊂ [tA, tB ] belongs368
to trajectories TrCl
i , TrCr
j , two tracklets in the subintervals369
[t3, t4], [t5, t6] ⊂ [tA, tB] belongs to TrCr
j . Finally, one tracklet370
in [t1, t2] ⊂ [tA, tB ] belongs to TrCl
i .371
Later on, a trajectory similarity algorithm is applied for372
every pair of tracklets in common subintervals among both373
trajectories. It is important to note that now the tracklets have374
the same length and have been synchronized.375
VI. TRAJECTORY SIMILARITY CALCULATION 376
The comparison of two temporal sequences invariant to time 377
and speed (e.g., trajectory) and their similarity measurement 378
is done using DTW. There are several trajectory similarity 379
measurements in the state of the art. Two similarity models 380
draw our attention: longest common subsequence described 381
in [10] and DTW introduced in [11]. Among these, we choose 382
the latter as it offers enhanced robustness, particularly being 383
sensible to noisy data. As our goal is to associate trajectories, 384
we need a local measurement for trajectories’ comparison that 385
is being done using DTW. 386
A. Time-Invariant Tracklet Alignment and Similarity 387
DTW is a distance measure for measuring similarity 388
between two temporal sequences that may vary in time or 389
speed. DTW-based similarity measure works well between 390
cameras having both linear and nonlinear FPS mapping. 391
As a first step in DTW, we place the trajectories in a 392
grid in order to compare them, and initialize every element 393
as ∞ (represent ∞ distance). Each element of the grid 394
is given by d(TrCl
i (ti ),TrCr
j (t j )) representing Euclidean dis- 395
tance that is the alignment between two trajectories’ points 396
TrCl
i (ti ),TrCr
j (t j )∀ti ∈ [0...n],∀t j ∈ [0...n], where n is the 397
length of the shortest trajectory. 398
Many paths connecting the beginning and the ending point 399
of the grid can be constructed. The goal of DTW is to find the 400
optimal path that minimizes the global accumulative Euclidean 401
distance between both trajectories of size n 402
D(TrCl
i ,TrCr
j
) = min
⎡
⎣N∑
ti ,t j =1
d(TrCl
i (ti ),TrCr
j (t j ))⎤
⎦ (10) 403
D(n,m) = d(TrCl
i (n),TrCr
i (m)) 404
+ min
⎧⎨
⎩
D(n − 1,m)D(n − 1,m − 1)
D(n,m − 1)
⎫⎬
⎭. (11) 405
The warping path point predecessor of D(n,m), denoted 406
by α, is selected as the one that gives the smallest accumulative 407
distance of the three neighbors as 408
α(t + 1) = min
⎧⎨
⎩
D(n − 1,m)D(n − 1,m − 1)
D(n,m − 1)
⎫⎬
⎭. (12) 409
Finally, the optimal warping path is a sequence of accumu- 410
lative distances from the first element of each trajectory until 411
the end 412
α = α(t0), α(t1), . . . α(ti ), . . . , α(tN ). (13) 413
We can see in Fig. 5 that the tracklets are very similar from 414
frame 65 to 82, but after seem like they start to be unequal. The 415
further close the optimal path wanders around the diagonal, the 416
more the two sequences match together. 417
We could use the immunity/invariance DTW has for time 418
misalignment in time series sequences while aligning the 419
tracklets from different cameras. We use this property of DTW 420
and try to infer a statistic, which could help us approximate 421
IEEE P
roof
6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
Fig. 5. DTW results for tracklet 1 of two trajectories’ comparison.In X and Y , the frames are shown. The optimal path is represented in green,and the DTW result is shown in red.
the nonlinear mapping between certain time asynchronous422
cameras in network. We process the shape of the DTW423
warping path (red as shown in Fig. 5) to retrieve information424
on complementary frame pairs belonging to warping path.425
In other words, we decode the DTW warping path in terms of426
frames. The extracted complementary pairs act as one-to-one427
frame mapping between the cameras under consideration.428
VII. ONLINE LEARNT GLOBALITY–LOCALITY429
FEATURE ENSEMBLE430
The core idea of our approach is ranking and selection431
of global–local features to form an ensemble that is crucial432
for tracklet association while giving good inter-camera dis-433
criminability between tracklets. Using only local association434
information leads to produce shorter fragmented fused trajec-435
tories. This may even cause the fusion to drift when one of436
the cameras has lot of occlusions as it is based on frame-437
to-frame information. Using only global information leads to438
more iterative associations as global information induces more439
confusion. Associations are unreliable when there are lots of440
distortions existing between cameras. Thus, it is important to441
strike a balance between these informations while extracting442
the most consistent and discriminate of them for calculating443
association. This helps in compensating for the limitations of444
each feature for a given video.445
It is a known fact that feature combinations capture more446
underlying semantics than single feature patterns. But using447
less influential pattern combination may not improve the448
performance of a tracker mainly due to limited discriminability449
of individual feature. Trajectory similarity is calculated as a450
two-stage approach (local and global). An ensemble of local451
and global features is used for determining similarity score.452
The electing weights that decide the ensemble are learnt online453
based on the consistency and maximum discriminability of the454
feature distributions.455
A. Local Tracklet Similarity456
At local stage, importance is given to local frame-to-frame457
geometric information. From DTW results, we calculate some458
statistics like proximity.459
Proximity as Euclidean Distance Mean: From DTW results, 460
we calculate normalized pixel Euclidean distance mean for 461
each trajectory comparison and each edge of the bipartite 462
graph. To normalize the DTW results, we divide by the 463
maximum possible distance between both trajectories, that is, 464
the size of the image 465
EDM = D(TrCl
i ,TrCr
j
)/n. (14) 466
B. Global Tracklet Similarity 467
At global stage, information pertaining to overall appear- 468
ance of the object throughout the tracklet is taken into account 469
for determining the similarity between tracklets. Feature pat- 470
terns used for determining an overall appearance score are 471
updated online regularly for the entire trajectory. A global 472
matching score (GMS) quantified from features below rep- 473
resents global tracklet similarity. 474
Global Matching Score: Appearance-based cues have 475
played a vital role in tracklet association rule mining. Given a 476
set of appearance cues, we create an ensemble of high-quality 477
ones for effective discrimination between tracklet association 478
candidate matches. We extend mono-camera tracklet reliability 479
descriptor work in [12] to suit our approach. We use k=7 cues 480
for our work. 481
1) 2D Shape Ratio (k = 1) and 2D Area (k = 2): Shape 482
ratio and area of an object are obtained from respective 483
bounding boxes, and within a temporal window, they are 484
immune to lighting and contrast changes. Thus, they are 485
one of the good cues to use. 486
2) Color Histogram (k = 3) and Dominant Color (k = 4): 487
It is basically a normalized RGB color histogram of 488
pixels inside bounding box of moving object. Dominant 489
color descriptor is used to take into consideration only 490
important colors of object. 491
3) Color Covariance Descriptor (k = 5): Color covariance 492
descriptor is a covariance matrix that characterizes the 493
appearance of regions in image and is invariant to size 494
and identical shifting of color values. Therefore, color 495
covariance descriptor resists to illumination changes. 496
4) Motion Descriptor (k = 6): Depending on the context, 497
constant velocity model or Brownian model is used to 498
describe motion represented by Gaussian distribution. 499
It is useful when objects have a similar appearance. 500
5) Occlusion (K = 7): Occlusions significantly degrade the 501
performance of tracking algorithm, and we progressively 502
analyze occlusion by exploiting the spatiotemporal con- 503
text and overlap information between the tracked object 504
and other objects. 505
We define tracklet Tr p as an overlapping tracklet of tracklet 506
Tri if tracklet Tr p has at least one frame overlap with tracklet 507
Tri ( called as temporal overlap) and the 2D distance of both 508
tracklets is below a predefined threshold (called as spatial 509
overlap). We define tracklet Tr j as candidate matching tracklet 510
of tracklet Tri if it satisfies temporal constraint like the last 511
object detection of Tri must appear earlier than the first object 512
detection of Tr j and a spatial constraint like that the last object 513
detection of Tri can reach the first object detection of Tr j after 514
a number of frames of potential misdetection with the current 515
frame rate. 516
IEEE P
roof
NITHIN AND BRÉMOND: GLOBALITY–LOCALITY-BASED CONSISTENT DISCRIMINANT FEATURE ENSEMBLE FOR MULTICAMERA TRACKING 7
To ensure reliable tracklet association, [12] weights the517
discriminative appearance and motion model descriptors and518
generates a GMS. The GMS of tracklet Tri with each tracklet519
in its matching candidate list (Tr j ) is520
GMS(TrCl
i ,TrCr
j
) =∑6
k=1 wi jk · DSk
(TrCl
i ,TrCr
j
)
∑6k=1 w
i jk
(15)521
where wi jk are corresponding weights of each feature descrip-522
tors DSk(TrCl
i ,TrCr
j ) calculated online by modeling them523
directly proportional to descriptor similarity of a tracklet with524
its matching candidate and inversely proportional to descriptor525
similarity of other overlapping tracklets.526
If (Tri ,Tr j ) are matching candidates, (Tri ,Tr p) are other527
overlapping tracklets, and their discriminative descriptor528
weight is calculated as529
wi, jk = ζ [DSk(Tri ,Tr j )−X(DSk(Tri ,Tr p))−1] (16)530
where ζ = 10 determined experimentally and X is the median531
of the similarities between tracklets (Tri ,Tr p). The advantage532
of the median is that its value is not affected by a few of533
extremely big or small values. The discriminative weight for534
motion cue alone is calculated as535
wi, j6 = 0.5 − 0.5 max
k=1...5
(w
i, jk
). (17)536
C. Globality–Locality Consistent Discriminant Score537
A cost matrix A is built to represent the cost of association538
between two tracklets (TrCl
i , TrCr
j ). Each element of such an539
association cost matrix represents GLCDS weighted sum of540
Euclidean distance and GMS between the two trajectories.541
An entry in association cost matrix A can be defined as542
A(TrCl
i ,TrCr
j
) = λm(Tri ) · EDM(TrCl
i ,TrCr
j
)543
+ (1 − λm(Tri )) · GMS(TrCl
i ,TrCr
j
)(18)544
where λm is GLCDS learned to obtain appropriate ensem-545
ble feature combination and is discussed further later546
in Section VI-D.547
Now the bipartite graph is complete and the weight Wij548
of each edge e ∈ E in G = (V ; E) is A(TrCl
i ,TrCr
j ) given549
by (18).550
λm helps to decide a tradeoff between local information551
extracted from frames or global appearance information from552
tracklets. The learnt weight helps in better feature selection553
and combination to enhance inter-tracklet discrimination and554
also cope up with intra-tracklet variations. In this approach,555
both local geometric and global appearance feature patterns556
complement each other and are impactful in situations where557
the data set involves significant appearance changes across558
object pose, illumination, viewing angle, and different camera559
parameters.560
1) Color Calibration Across Cameras: To calculate con-561
sistency and discriminative power of tracklet features across562
cameras, we need to color calibrate the cameras for accounting563
color distortion between them. Therefore, as a preprocessing564
step before validating discriminability and consistency, we565
perform histogram specification and histogram matching, i.e.,566
we project and transform the histogram of any camera Ck onto 567
histogram of reference camera C ref . Level of color distortion 568
after specification is validated by comparing the transformed 569
histogram and reference histogram using correlation-based 570
histogram matching. 571
Even if appearance model of a tracklet is discriminative, it 572
makes sense to weight them high only if the features in the 573
model are consistent and vice versa. Thus, λm is calculated 574
as an estimate of discriminant score weighted consistency of 575
individual features. 576
D. Discriminative Power of Tracklet Features 577
Discriminative power of the GMS features is calculated as 578
a mean of normalized fisher scores of individual GMS tracklet 579
features. Fisher score is a quantitative measure popularly 580
used in statistics for numerically solving maximum likelihood 581
problems. In computer vision, fisher score is used to rank 582
the best set of features, such that in the space spanned by 583
selected features, the distances between datapoints of different 584
classes are as large as possible, while distances between 585
datapoints of the same class are small. Reference [13] uses 586
fisher score to compare one feature subset with another one in 587
order to find the most discriminating set of feature instances. 588
Reference [14] has used fisher score for online selection of 589
most discriminative set of tracking features. Since ours is 590
a multicamera setup, we need to adapt this fisher score to 591
avoid certain undesirable scenarios from affecting the final 592
discriminant score. Constraints we lay on fisher score are as 593
follows. 594
1) In a multicamera tracking problem, the discriminating 595
power of tracklet features should be measured across 596
cameras and not intra camera. Thus, in (19), instead 597
of calculating the mean over all tracklets over both 598
cameras, we calculate mean only on the camera with 599
candidate matching tracklets. 600
2) Online descriptor weight w f of the f th feature obtained 601
while calculating GMS specifies the robustness of that 602
feature. While calculating mean and the variance of the 603
f th feature of the i th tracklet, we use w f to weight 604
that mean and variance of the f th feature to specify the 605
influence of such features on fisher score. 606
Let k be the set of all features, individual fisher score for 607
any feature fk∀k ∈ [1 . . . |k|] is calculated as 608
δ( fk) =∑N
i=1 w fk
(μi fk − μCr
fk
)2
∑Ni=1 w fk
(ρ2
i fk
) (19) 609
where μi fk and ρi fk are the mean and the variance of the kth 610
GMS feature of the i th tracklet, N is the number of tracklets 611
in camera Cl , w fk is the descriptor similarity weight of the 612
kth feature, and μCr
fkis the mean of the kth GMS feature of 613
overall candidate tracklets belonging to complementary pair 614
of camera Cr . 615
Normalized fisher score for the kth GMS feature is calcu- 616
lated as δ′( fk) 617
δ′( fk) = δ( fk)∑|F |
z=1 δ( fz). (20) 618
IEEE P
roof
8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
Fig. 6. Associations of each trajectory after Hungarian algorithm.
1) Consistent Discriminancy of Tracklet Features: An indi-619
vidual consistency score is obtained for each feature fk in620
GMS metric over the entire tracklet (Tri ) as621
υ( fk,Tri ) =√∑nk
t=0( fk(TOit )− fk(Tri ))
2
nk(21)622
where fk(TOit ) is the kth feature extracted from the i th tracked623
object TOi at time t , fk(Tri ) is the kth feature mean over624
trajectory of tracked object TOi , and nk is the total number625
of detections.626
Normalized individual consistency score υ ′( fk ,Tri ) of the627
kth feature υ ′( fk,Tri ) is calculated as628
υ ′( fk ,Tri ) = υ( fk)∑|F |
z=1 υ( fz). (22)629
GLCDS of features on an entire tracklet is calculated by630
taking square root of sum of weighted consistency score of631
individual features over a tracklet Tri632
λm(Tri )633
=√δ′( f1)·υ ′( f1,Tri )2 + ..+ δ′( f|F |) · υ ′( f|F |,Tri )2
|F | .634
(23)635
E. Hungarian Algorithm636
The task at hand is finding the maximum matching of G.637
Formally, maximum matching is defined as a matching with638
the largest possible number of edges; it is globally optimal.639
The goal is to find an optimal assignment, i.e., find the640
maximum matching in G. We apply the Hungarian algorithm641
defined in [15] given the cost matrix built with the Aij values.642
After applying the Hungarian algorithm to matrix A, we get643
the maximum matching as shown in Fig. 6. The red lines644
specify the established associations between tracklets across645
cameras as a result of the Hungarian algorithm.646
VIII. TRAJECTORY FUSION647
Trajectory confidence score RTO can be intuitively inter-648
preted as how well tracklets’ fusion from individual cameras649
can match the original trajectory of target. We calculate650
individual tracklets confidence based on the following.651
1) Length: Long trajectories are more reliable, and there-652
fore trajectories below a handpicked short length are653
unreliable.654
2) Geometric Coherence Score: Assuming that the varia- 655
tion of tracklet features follow a Gaussian distribution, 656
the coherence score is calculated as follows. 657
From (6), TOit is the position of object TOi at time t and 658
TOit−1 is previous position of object TOi . The coherence 659
score is defined as 660
= 1√
2πσ 2i
e− (di −μi )
2
2σ2i (24) 661
where di is the 2D distance between TOi and TOit−1, 662
μi and σi are, respectively, the mean and standard 663
deviation of frame-to-frame distance distribution formed 664
by a set of positions of object TOi . 665
3) Appearance Coherence Score: Similar to geometric 666
coherence score, but here we account for an array of 667
appearance features. Here di represents the distance 668
between feature descriptors at TOi and TOit−1 669
Confidence score RTO of a tracklet is the mean of all the 670
above coherence scores. 671
As part of the fusion task, a merged trajectory with the 672
information coming from both views is built. To fuse two 673
trajectories coming from two different cameras at a time t , 674
e.g., Tri ∈ Sl with 0 < i < N and Tr j ∈ Sr with 0 < j < N 675
into a global one TrGi,G j , we apply an adaptive weighting 676
method as 677
TrGi,G j (t) =
⎧⎪⎪⎪⎨
⎪⎪⎪⎩
ψ1TrCli (t)+ ψ2TrCr
j (t) if TrCli (t),TrCr
j (t)
overlap over time t
TrCli (t) if only TrCl
i (t) exists at time t
TrCrj (t) if only TrCr
j (t) exists at time t
678
(25) 679
where ψ1 and ψ2 are the weights calculated as in (26). Each 680
tracked object has a reliability attribute RTO with values [0, 1], 681
and the weighed function is defined in terms of its RTO value 682
as 683
ψ1 = RTOi
RTOi + RTO j
ψ2 = RTO j
RTOi + RTO j
(26) 684
where RTOi and RTO j are the reliability attributes of tracked 685
object from camera Cl and Cr , respectively. 686
The fused trajectory is not smooth. In order to get a 687
better and smoothed one, we apply a simple moving average 688
technique (also called moving mean). 689
IX. EVALUATION 690
Our RGB approach is evaluated on publicly available 691
PETS2009 data set [16]. We choose to evaluate on View 1, 692
View 3, View 5, and View 7 in S2.L1 scenario. There is one 693
static occlusion in View 1, namely, a pole with display board, 694
and View 3 is quite challenging as a tree occupies significant 695
area in the right side of video. Also there is substantial color 696
tone variation between the views, making it hard for color- 697
based cues. For this reason, most of the methods avoid this 698
combination of view. To show the effectiveness of GLCDS, we 699
take up this challenging view as it more resembles real-world 700
scenario. 701
IEEE P
roof
NITHIN AND BRÉMOND: GLOBALITY–LOCALITY-BASED CONSISTENT DISCRIMINANT FEATURE ENSEMBLE FOR MULTICAMERA TRACKING 9
TABLE I
RESULT COMPARISON IN PERCENTAGE. THE BEST CONFIGURATION OF OUR SYSTEM IS MARKED WITH THE BLUE BACKGROUND
For evaluating our work, we use the following metrics:702
CLEAR [17] metrics, namely, multiple object tracking accu-703
racy (MOTA) and multiple object tracking precision (MOTP),704
identity switches (IDS), track fragments, mostly tracked (MT),705
partly tracked (PT), and mostly lost (ML) from [18].706
Table I summarizes comparison between our method and707
other multicamera approaches on PETS2009 data set. Unlike708
other methods that use heavy computation and optimization709
for best results as a tradeoff over real-time performance,710
our objective was to make the algorithm more real time711
making minimal sacrifice on the accuracy. This is achieved712
as our method uses computationally efficient and in-complex713
optimization technique with dynamic feature ranking and714
election for an effective ensemble. We use buffer frame715
size = 20 frames in a temporal sliding window pattern to716
be able to perform association and fusion online.717
We experiment our method with four different system718
configurations:719
1) C1: without online learnt feature ensemble selection720
(GLCDS based);721
2) C2: without online learnt tracklet appearance models;722
3) C3: without locality-based features;723
4) C4: with full configuration.724
The evaluation results of each configuration (C1–C4) show725
us how much impact each part has on the proposed method.726
C4 is our entire system with fully loaded configuration and727
is expected to improve the performance to maximum. From728
Table I, we can see that the absence of GLCDS and online729
appearance models has introduced the only ML entry among730
the pool of configurations symbolizing the significance of731
online learnt feature ensemble. Configurations C1 and C2732
produce IDS stressing on the impact of online appearance733
models on the framework. Since Views 5 and 7 give a closer734
view at the overlapping area, appearance features from these735
views play a vital role. C4 altogether produces reliable long736
trajectories, thereby improving fragmentation, ML, and PL,737
and also suppresses IDS. We can see that our method surpasses738
the state of the art in IDS and produces more or less similar739
results on various other metrics while remaining a real-time740
online approach.741
For evaluating on RGB-D data, we select five videos from a742
private data set, in which participants with Alzheimer disease743
aged more than 65 years are recruited by the memory center744
Fig. 7. One of the frames during the evaluation of the RGB-D video.
TABLE II
COMPARISON OF MULTICAMERA RGBD TRACKER VERSUS
MONO-CAMERA RGB TRACKER
of a collaborating hospital. The clinical protocol asks the 745
participants to undertake a set of physical tasks and instru- 746
mental activities of daily living in a hospital observation room 747
furnished with home appliances. Experimental recordings use 748
two RGB-D cameras (Kinect) with 640 × 480 pixels of 749
resolution and nonlinear time synchronization between them. 750
Each pair of videos has two different views of the scene, lateral 751
and frontal, with a maximum amount of two people per view. 752
A sample frame form the video is shown in Fig. 7. 753
In our data set, doctor trajectory is cut several times because 754
of occlusions. Sometimes, he appears in one camera and some- 755
times in the other. The merged trajectory keeps the information 756
of both cameras making a good manage of occlusions. 757
In this video, the mono-camera tracking has bad results 758
for the doctor in the right camera and even worst for the 759
patient in the left camera. But it can be seen that with 760
multicamera approach, we combine the best results for each 761
camera into a global one, and so finally, we have the two 762
tracked objects that appear in the scene with good tracking 763
results. Our multicamera results improve the mono-camera 764
trajectory significantly as shown in Table II. 765
These experiments reveal that our framework is robust 766
in rectifying the challenges of conventional mono camera 767
tracking and produces consistent trajectories with no IDS. 768
IEEE P
roof
10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
Our approach had the results benchmarked based on a view769
(which actually resembles real world) purposefully ignored by770
all other methods and also produced improvements to the state771
of the art while being a real-time approach.772
System Implementation773
As shown in Fig. 1, our system is implemented with parallel774
programming to handle multiple cameras in a network as775
multithreads. Time efficiency of multicamera master thread is776
appreciable as it takes the same time as the turnaround time777
of individual worker threads. All individual worker node’s778
local geometric information is projected on to the reference779
camera’s world. Local feature extraction, association, and780
fusion are all done in the reconstructed reference world, and781
then projected back to reference camera’s image plane for782
evaluation and visualization. Therefore, theoretically, there are783
no bounds for number of cameras to run in our framework,784
as the model is very elastic and extensible. But hardware785
capability might be a bottleneck.786
X. CONCLUSION787
We introduced a multicamera multitarget multimodality788
online tracking framework that associates and fuses trajectories789
on the grounds of an online learned consistent and discriminant790
global–local feature ensemble. Our approach’s backbone has791
been feature engineering, and its performance on the data792
sets demonstrated the importance of dynamically selecting793
and ranking features that capture and wholly represent the794
video properties and contents. As a result of our work, we795
were able to build optimally long complete trajectories by796
linking and fusing data based on confidence and reliabil-797
ity scores calculated at individual camera level. Using this798
framework, we achieve highly parallel and effective real-time799
performance, which is absent in the state-of-the-art methods.800
Our approach outperforms some existing multicamera tracking801
and is comparable with state-of-the-art benchmark data sets.802
Even when coupled with in-complex optimizations to fasten803
the algorithm, final results show the impact of engineering804
feature embeddings and their selection on accuracy and real-805
time performance.806
REFERENCES807
[1] M. Hofmann, D. Wolf, and G. Rigoll, “Hypergraphs for joint multi-view808
reconstruction and multi-object tracking,” in Proc. IEEE Conf. Comput.809
Vis. Pattern Recognit., Jun. 2013, pp. 3650–3657.810
[2] M. Evans, L. Li, and J. Ferryman, “Suppression of detection ghosts in811
homography based pedestrian detection,” in Proc. IEEE 9th Int. Conf.812
Adv. Video Signal-Based Surveill., Sep. 2012, pp. 31–36.813
[3] M. Taj and A. Cavallaro, “Distributed and decentralized multicamera814
tracking,” IEEE Signal Process. Mag., vol. 28, no. 3, pp. 46–58,815
May 2011.816
[4] L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-817
object tracking using network flows,” in Proc. IEEE Conf. Comput. Vis.818
Pattern Recognit. (CVPR), Jun. 2008, pp. 1–8.819
[5] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Globally-optimal820
greedy algorithms for tracking a variable number of objects,” in821
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011,822
pp. 1201–1208.823
[6] J. Berclaz, F. Fleuret, and P. Fua, “Multiple object tracking using flow824
linear programming,” in Proc. 12th IEEE Int. Workshop Perform. Eval.825
Tracking Surveill. (PETS-Winter), Dec. 2009, pp. 1–8.826
[7] M. Evans, C. J. Osborne, and J. Ferryman, “Multicamera object detection 827
and tracking with object size estimation,” in Proc. 10th IEEE Int. Conf. 828
Adv. Video Signal Based Surveill. (AVSS), Aug. 2013, pp. 177–182. 829
[8] N. Anjum and A. Cavallaro, “Trajectory association and fusion across 830
partially overlapping cameras,” in Proc. 6th IEEE Int. Conf. Adv. Video 831
Signal Based Surveill. (AVSS), Sep. 2009, pp. 201–206. 832
[9] Y. A. Sheikh and M. Shah, “Trajectory association across multiple 833
airborne cameras,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, 834
no. 2, pp. 361–367, Feb. 2008. 835
[10] L. Bergroth, H. Hakonen, and T. Raita, “A survey of longest common 836
subsequence algorithms,” in Proc. IEEE 7th Int. Symp. String Process. 837
Inf. Retr. (SPIRE), Sep. 2000, pp. 39–48. 838
[11] A. Kassidas, J. F. MacGregor, and P. A. Taylor, “Synchronization of 839
batch trajectories using dynamic time warping,” AIChE J., vol. 44, no. 4, 840
pp. 864–875, 1998. 841
[12] D. P. Chau, F. Bremond, and M. Thonnat, “A multi-feature tracking 842
algorithm enabling adaptation to context variations,” in Proc. 4th Int. 843
Conf. Imag. Crime Detection Prevention (ICDP), Nov. 2011, pp. 1–6. AQ:4844
[13] Q. Gu, Z. Li, and J. Han. (2012). “Generalized Fisher score for feature 845
selection.” [Online]. Available: https://arxiv.org/abs/1202.3725 846
[14] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discrimina- 847
tive tracking features,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, 848
no. 10, pp. 1631–1643, Oct. 2005. 849
[15] H. W. Kuhn, “The Hungarian method for the assignment problem,” 850
Naval Res. Logistics Quart., vol. 2, nos. 1–2, pp. 83–97, 1955. 851
[16] J. Ferryman and A. Ellis, “PETS2010: Dataset and challenge,” in 852
Proc. 7th IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS), 853
Aug./Sep. 2010, pp. 143–150. 854
[17] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking 855
performance: The CLEAR MOT metrics,” EURASIP J. Image Video 856
Process., vol. 2008, p. 246309, Dec. 2008. AQ:5857
[18] C.-H. Kuo and R. Nevatia, “How does person identity recognition 858
help multi-person tracking?” in Proc. IEEE Conf. Comput. Vis. Pattern 859
Recognit. (CVPR), Jun. 2011, pp. 1217–1224. 860
[19] Z. Wu, N. I. Hristov, T. L. Hedrick, T. H. Kunz, and M. Betke, “Tracking 861
a large number of objects from multiple views,” in Proc. IEEE 12th Int. 862
Conf. Comput. Vis., Sep./Oct. 2009, pp. 1546–1553. 863
[20] C. Feremans, M. Labbé, and G. Laporte, “Generalized network design 864
problems,” Eur. J. Oper. Res., vol. 148, no. 1, pp. 1–13, 2003. 865
Kanishka Nithin received the master’s degree in 866
computer vision and image processing from Amrita 867
School of Engineering, India, in 2015. 868
He has been with the STARS team, INRIA 869
Sophia Antipolis, Sophia Antipolis, France, since 870
2015, where he is currently a Pre-Ph.D. Researcher, 871
involved in multicamera video surveillance. His 872
research interests include semantic video analytics, 873
object recognition, visual slam, and deep learning. 874
AQ:6
François Brémond received the Ph.D. degree in 875
video understanding from INRIA Sophia Antipolis, 876
Sophia Antipolis, France, in 1997 and the HDR 877
degree in scene understanding from University of 878
Nice Sophia Antipolis, Nice, France, in 2007. 879
He was a Post-Doctoral Researcher with Uni- 880
versity of Southern California, Los Angeles, CA, 881
USA, where he was involved in the interpre- 882
tation of videos taken from unmanned airborne 883
vehicles. He created the STARS team in 2012. 884
He is currently the Research Director with INRIA 885
Sophia Antipolis. He has been conducting research in video understand- 886
ing at Sophia-Antipolis since 1993. He has authored or co-authored over 887
140 scientific papers published in international journals and conferences in 888
video understanding. 889
Dr. Brémond is a Handling Editor of MVA and a Reviewer for several inter- 890
national journals such as Computer Vision and Image Understanding, Interna- 891
tional Journal of Pattern Recognition and Artificial Intelligence, International 892
Journal of Human-Computer Studies, IEEE TRANSACTIONS ON PATTERN 893
ANALYSIS AND MACHINE INTELLIGENCE, Artificial Intelligence Journal, 894
and Eurasip Journal on Advances in Signal Processing and conferences such 895
as CVPR, ICCV, AVSS, VS, and ICVS. He has supervised or co-supervised 896
13 Ph.D. theses. He is an EC INFSO and French ANR Expert for reviewing 897
projects. 898
AQ:7
top related