vehicle localisation and classification in ... et al. vehicle localisation and classification in...
TRANSCRIPT
1
VEHICLE LOCALISATION AND CLASSIFICATION
IN URBAN CCTV STREAMS
Norbert Buch1, Mark Cracknell
2, James Orwell
1 and Sergio A. Velastin
1
1. Kingston University, Penrhyn Road, Kingston upon Thames, KT1 2EE, United Kingdom
{norbert.buch, james.orwell, sergio.velastin}@kingston.ac.uk
2. Transport for London, Palestra, 197 Blackfriars Road, London, SE1 8NJ, United Kingdom
ABSTRACT
This paper presents an introduction to video analysis tools for urban traffic management.
Based on a review of the limitation of current systems, a framework for localising and
classifying cars in real-world coordinates is introduced as part of a project at Transport for
London. Vehicle detection is performed using either motion silhouettes or 3DHOG (3D
extended Histograms of Oriented Gradients). The latter is more robust in urban environments.
Qualitative and quantitative evaluation of the proposed systems is provided with an outlook
on further development potential.
KEYWORDS
vehicle detection, road user classification, pedestrian, urban traffic, visual surveillance, video
analysis, computer vision
INTRODUCTION
Intelligent image detection systems are part of a centralised approach to modern day traffic
management. This has arisen from the need for more cost effective and efficient monitoring
of traffic. Traffic monitoring CCTV tends to be unique in that it includes high camera
numbers, is in the public domain and contains long transmission paths (up to 40km). With
1200 cameras and over 100 monitors it is not feasible to continuously monitor every CCTV
camera installed within Transport for London’s (TfL) network. In fact, it has been shown that
manual monitoring over time significantly reduces the accuracy of detection. Therefore, the
development of a technology that provides automatic and relevant real-time alerts to Traffic
Co-ordinators can have an immediate and long term impact on traffic management through
the implementation of responsive traffic strategies.
In early 2006, TfL launched the Image Recognition and Incident Detection (IRID) project.
This project was tasked to review the current image processing market and see how it met
TfL’s detection requirements. Testing was carried out on the following criteria: Congestion,
Stopped Vehicles, Banned turns, Vehicle counting, Subway Monitoring and Bus detection.
[6,7].
Results from this testing show good performance in Congestion detection (80% precision),
but poor performance in tracking based detection (~20% precision), clearly showing
limitations in capability. This limitation led directly to the creation of a research relationship
with Kingston University. The aims of this project are the localisation and subsequent
Buch et al. Vehicle Localisation and Classification in Urban CCTV Streams ITS World Congress 2009
2
Figure 1 Framework for 3D localisation and classification and models used
classification of vehicles and pedestrians in camera views specific to the urban environment
of Transport for London. Estimating the vehicle position in real-world coordinates (on road
maps) is also beneficial for traffic enforcement applications. The conventional concept of
using background modelling for the generation of a motion mask is only used as baseline.
Motion masks suffer from noise due to lighting changes, camera shake, rain, etc. and
particularly from occlusion due to low camera angle. All these effects are inherent to urban
traffic scenes. The project objective was to move beyond motion estimation and use other
visual means to localise and classify cars. A texture based classifier (3DHOG) was introduced
to overcome the mentioned problems.
3D FRAMEWORK FOR VEHICLE CLASSIFICATION AND LOCALISATION
We developed a framework to localise vehicles either based on the motion mask or based on
the texture in the image. The framework is shown in Figure 1. The detector uses background
estimation with a Gaussian Mixture Model (GMM) [11] to model the static part of the scene.
This model allows for several images to be considered static as shown in Figure 2, which
enables rejection of moving objects in the background like trees. By considering the
difference between the background and a new frame, an initial foreground mask is estimated.
This mask is refined by a shadow removal algorithm, which considers areas as shadows, if
they are slightly darker than the background but have the same colour. From the resulting
binary foreground mask, closed contours are extracted.
The contours provided by the detector are used to initialise hypotheses for vehicle locations
on the ground plane. The hypotheses are verified by the classifier by comparing existing
vehicle models with the input image to give a match measure. The models can either be the
projected silhouettes or appearance data gathered during a training phase. Finding the
maximum match in terms of location and vehicle type gives the final estimation. The
detection of a single frame can be tracked over time using a Kalman filter, which will produce
tracks (trajectories) for vehicles. Further details about the framework can be found in Buch et
al. [2,4] and for the tracker in [5]. As visual output, the localised vehicles are marked up on
the camera view and on a map of the area. Refer to Figure 2 for an example frame from TfL
with corresponding map. The thin dark red boxes represent the regions of interest where
vehicle detection is performed. This could be set to bus lane areas to detected unpermitted
vehicles in those lanes.
GP positions
GMM [11]
3D
Hypothesis
2D Projection
[12]
Match Measure silhouettes
model data
Models
scores
frame
Maximum
labels GP pos
Closed
Contours
foreground mask
Tracker
Detector
Kalman
Filter
tracks
Classifier
frame
Buch et al. Vehicle Localisation and Classification in Urban CCTV Streams ITS World Congress 2009
3
a b
Figure 2 a) Detector stage. b) Localised vehicles with superimposed wire frames and
ground plane location. The coloured lines are the tracks of individual vehicles.
RESULTS USING MOTION SILHOUETTES
The localisation framework was first tested with a baseline approach using the overlap of
motions silhouettes and model silhouettes as match measure. Figure 3 illustrates the operation
where a score is defined as the ratio between intersection and union of both silhouettes. The
3D location of vehicles is found by generating a hypotheses grid (green crosses in Figure 3)
around the back projected silhouette centroid (red cross in Figure 3). A score surface is
generated for this grid and the hypothesis with the highest score is selected as location and
class for the detected vehicle.
Figure 3 Classifier based on motion silhouette overlap with model
GMM
Frame Initial mask
Remove Shadow
Closed Contour
Foreground mask
Stable background
Second background
Foreground Detector
Silhouettes
3D Hypothesis
Overlap Area
Maximum
2D Project. Model
s
Silhouettes
-2 -1 0 1 20
20
40
60
80
100
Cros-section position [m]
Matc
h m
eas
ure
[%
]
ID 2
ID 3
ID 4
ID 5
ID 6
ID 7
ID 8
ID 9
Model Silhouette
Model
Overlap Scores
Labels
Classifier
Ground plane map
Buch et al. Vehicle Localisation and Classification in Urban CCTV Streams ITS World Congress 2009
4
ground truth
detection
pedestrian
bike
car/taxi
van
bus/lorry
FN
count
overlap
pedestr
ian
.71
.02
.02
0
0
.24
241
.57
bik
e
.49
.47
0
0
0
.04
45
.6
car/
taxi
.01
.03
.85
.02
.02
.07
371
.66
van
.02
0
.1
.84
.03
.02
63
.69bus/lorr
y
0
0
0
.02
.98
0
62
.72
FP
.1
.02
.04
.08
.03
0
ground truth
detection
pedestrian
bike
car/taxi
van
bus/lorry
count
pedestr
ian
.94
.02
.03
.01
.01
182
bik
e
.51
.49
0
0
0
43
car/
taxi
.01
.03
.92
.02
.03
344
van
.02
0
.1
.85
.03
62
bus/lorr
y
0
0
0
.02
.98
62
Symbol
Recall R
Precision P
Classifier PC
Detector RD
Detector PD
GT Overlap
Value
79.5%
83.9%
89.8%
88.6%
93.5%
0.64
a b c
Table 1 a) Confusion matrix of full system including detector. b) Confusion matrix of
classifier only. C) Full performance for motion silhouettes based system.
Classification performance
The system was extensively tested with video footage from Transport for London and from
the i-LIDS datasets [1]. The latter is a benchmarking dataset provided by the UK Home
Office to imaging research institutions. In Table 1, we present classification results on 1 hour
video footage from the parked car scenario. Good overall classification performance is
demonstrated with some confusion between bikes and pedestrians. This is due the similar size
of those two types of road users. The localisation performance is demonstrated by the
bounding box overlap between the wire frame and the ground truth in the overlap row. The
value of 64% overlap is good, considering the use of the wire frame rather than the motion
silhouette for the bounding box estimation. Detailed explanation for performance measures
can be found in [1] and its application in [2,4].
Tracking performance
Tracking is performed on the ground plane of the scene, which simplifies behaviour analysis
like bus lane monitoring. We use the standard formulation of the Kalman filter for a constant
velocity model of vehicles. The object tracking performance is demonstrated by comparing
our tracker with a baseline tracker (OpenCV blob tracker [10]). The OpenCV tracker uses an
adaptive mixture of Gaussians for background estimation, connected component analysis for
data association and Kalman filtering for tracking blob position and size. The data used is i-
LIDS [1] as for the performance evaluation of classification.
We propose a rich set of metrics such as Correct Detected Tracks, False Detected Tracks and
Track Detection Failure to provide a general overview of the system’s performance. Track
Fragmentation shows whether the temporal and spatial coherence of tracks is established. ID
Change is useful to test the data association module of the system. Latency indicates how
quick the system can respond to an object entering the camera view, and Track Completeness
how complete the object has been tracked. Metrics such as Track Distance Error and
Closeness of Tracks indicate the accuracy of estimating the position, the spatial and the
temporal extent of the objects respectively. More details about this evaluation framework can
be found in [5] and Yin et al. [13]. The proposed system detected 94% of the ground truth
tracks compared to 88% of the base line. Our system has half of the track detection failures
compared to the base line. Please refer to Table 2 for a complete set of metrics and Figure 4
for visual tracking examples.
Buch et al. Vehicle Localisation and Classification in Urban CCTV Streams ITS World Congress 2009
5
Table 2 Tracking results of motion silhouette classifier
Figure 4 Example tracking results from the i-LIDS data
BEYOND MOTION: 3DHOG
The limitations of motions silhouettes inspired the use of texture to detect vehicles by
appearance. Good results have been reported elsewhere for patch based approaches in object
recognition [9] and pedestrian detection with histograms of oriented gradients (HOG) [8]. We
introduce a novel concept by applying the HOG descriptor to image patches defined in 3D
model space. Full details on this 3DHOG framework can be found in Buch et al. [3]. This
feature approach substitutes the overlap match measure in the block diagram with a training
Metrics proposed Tracker OpenCV blob Tr.
Number of Ground truth tracks 100 100 Number of system tracks 144 203
Correct detected tracks 94 88
Track detection failure 6 12
False detected tracks 27 90
Latency [frames] 5 5
Track fragmentation 8 18
Average track Completeness [time] 64% 55%
ID change 10 3
Average track closeness [bbox overlap] 54% 35%
Standard Deviation of closeness 20% 13%
Average distance error [pixels] 22 21
Standard Deviation of distance error 19 15
Buch et al. Vehicle Localisation and Classification in Urban CCTV Streams ITS World Congress 2009
6
a b c d
Figure 5 a) 3D model with marked interest points. b) input camera image frame.
c) extracted and normalised image patches displayed in 3D space. d) 3DHOG gradient
features generated from the visible image patches
based classification step. Figure 5 shows the model with patch centres defined with the model
and the extracted patches. Affine transformations are used to generate those scale normalised
patches. A descriptor is generated from every patch, which consists of either gradient
histograms, a frequency spectrum or a simple image histogram. A data driven appearance
model is learned, whereas a single Gaussian distribution is learned for every interest point
descriptor. During system operation, new descriptors are generated for every hypothesis (2D
projection block in Figure 1), whereas the distance between learned descriptors and newly
seen descriptors define the match measure. The remainder of the framework remains identical
to the earlier description.
Performance comparison
For evaluation on the same dataset as in the last section, a recall of 81% is achieved at a
precision of 82%. Those results indicate similar performance to the motion based system,
however providing more robustness against the usual noise sources in the urban environment.
By using texture and appearance rather than motion, the 3DHOG classifier can deal with
cases were the motion silhouette is significantly distorted or similar for different classes.
There are illustrative examples in Figure 6. The case of oversized silhouettes due to shadows
and lighting changes is rectified and a pedestrian in correctly detected. Another common
problem is saturation of some areas in the camera. The example shows a very small silhouette
for a white van, which was missed from the motion foreground due to the saturation in the
same area. The 3DHOG classifier successfully detects and classifies the van. For objects of
similar size like pedestrians and bicycles, the classifier can distinguish correctly based on the
appearance.
CONCLUSIONS
We presented a review of commercial video analytics systems tested by Transport for
London. The findings and the progress of the subsequent project with Kingston University is
demonstrated. An improved computer vision system is demonstrated by introducing a 3D
localisation framework for vehicles. Good results are demonstrated for motion silhouette
based vehicle classification. An extension to texture based classification is given by moving
beyond the concept of motion estimation. This extends the concept of HOG by introducing
novel 3DHOG, which use a “3D surface window” for vehicle classification. This classifier
demonstrated superior performance for challenging cases where motions silhouettes are
Buch et al. Vehicle Localisation and Classification in Urban CCTV Streams ITS World Congress 2009
7
3DHOG classifier Motion Silhouette classifier
Figure 6 Comparison between 3DHOG and motion silhouette classifier. Noisy motion
foreground (blue outline) is misclassified on the right. In contrast, the 3DHOG classifier
correctly classifies the pedestrian inside the shadow and the van in the saturated area.
incorrect. The frame to frame detection is used as input for a Kalman filter to generate
trajectories of road users. Future work will focus on more evaluation of the proposed systems
under diverse weather and operation conditions.
ACKNOWLEDGEMENTS
This work is sponsored and conducted in cooperation with the Directorate for Traffic
Operations at Transport for London.. [12 [11]
REFERENCES
[1] Home Office Scientific Development Branch. Imagery library for intelligent detection
systems i-lids. http://scienceandresearch.homeoffice.gov.uk/hosdb/cctv-imaging-
technology/video-based-detection-systems/i-lids/ [accessed 19 December 2008].
[2] Norbert Buch, James Orwell, and Sergio A. Velastin. Detection and classification of
vehicles for urban traffic scenes. In International Conference on Visual Information
Engineering VIE08, pages 182–187. IET, July 2008.
[3] Norbert Buch, James Orwell, and Sergio A. Velastin. 3D extended histogram of
oriented gradients (3DHOG) for classification of road users in urban scenes. In British
Machine Vision Conference BMVC 2009, London, September 2009.
[4] Norbert Buch, James Orwell, and Sergio A. Velastin. Urban road user detection and
classification using 3d wire frame models. IET Computer Vision [accepted], 2009.
Buch et al. Vehicle Localisation and Classification in Urban CCTV Streams ITS World Congress 2009
8
[5] Norbert Buch, Fei Yin, James Orwell, Dimitrios Makris, and Sergio A. Velastin. Urban
vehicle tracking using a combined 3d model detector and classifier. In 13th
International Conference on Knowledge-Based and Intelligent Information &
Engineering Systems KES2009, Santiago, Chile, September 2009. LNCS Springer.
[6] Mark Cracknell. Image detection in the real world – interactive session. In Intelligent
Transportation Systems ITS ’07 Aalborg, 2007.
[7] Mark Cracknell. Image detection in the real world – a progress update. In Intelligent
Transportation Systems World Congress ITS WC 2008 – New York, 2008.
[8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society
Conference on, volume 1, pages 886–893, 2005.
[9] Bastian Leibe, Ales Leonardis, and Bernt Schiele. Combined object categorization and
segmentation with an implicit shape model. In ECCV’04 Workshop on Statistical
Learning in Computer Vision, pages 17–32, May 2004.
[10] OpenCV. Open source computer vision library.
http://sourceforge.net/projects/opencvlibrary [accessed 19 December 2008].
[11] C. Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time
tracking. In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society
Conference on., volume 2, pages 246–252, June 1999.
[12] Roger Y. Tsai. An efficient and accurate camera calibration technique for 3d machine
vision. In Proc. Int. Conf. on Computer Vision and Pattern Recognition (CVPR),
(1986), pages 364–374, 1986.
[13] Fei Yin, Dimitrios Makris, and Sergio A. Velastin. Performance evaluation of object
tracking algorithms. In 10th IEEE International Workshop on Performance Evaluation
of Tracking and Surveillance, PETS'07, Rio de Janeiro, October 2007.