vehicle localisation and classification in ... et al. vehicle localisation and classification in...

1

VEHICLE LOCALISATION AND CLASSIFICATION

IN URBAN CCTV STREAMS

Norbert Buch1, Mark Cracknell

2, James Orwell

1 and Sergio A. Velastin

1

1. Kingston University, Penrhyn Road, Kingston upon Thames, KT1 2EE, United Kingdom

{norbert.buch, james.orwell, sergio.velastin}@kingston.ac.uk

2. Transport for London, Palestra, 197 Blackfriars Road, London, SE1 8NJ, United Kingdom

ABSTRACT

This paper presents an introduction to video analysis tools for urban traffic management.

Based on a review of the limitation of current systems, a framework for localising and

classifying cars in real-world coordinates is introduced as part of a project at Transport for

London. Vehicle detection is performed using either motion silhouettes or 3DHOG (3D

extended Histograms of Oriented Gradients). The latter is more robust in urban environments.

Qualitative and quantitative evaluation of the proposed systems is provided with an outlook

on further development potential.

KEYWORDS

vehicle detection, road user classification, pedestrian, urban traffic, visual surveillance, video

analysis, computer vision

INTRODUCTION

Intelligent image detection systems are part of a centralised approach to modern day traffic

management. This has arisen from the need for more cost effective and efficient monitoring

of traffic. Traffic monitoring CCTV tends to be unique in that it includes high camera

numbers, is in the public domain and contains long transmission paths (up to 40km). With

1200 cameras and over 100 monitors it is not feasible to continuously monitor every CCTV

camera installed within Transport for London’s (TfL) network. In fact, it has been shown that

manual monitoring over time significantly reduces the accuracy of detection. Therefore, the

development of a technology that provides automatic and relevant real-time alerts to Traffic

Co-ordinators can have an immediate and long term impact on traffic management through

the implementation of responsive traffic strategies.

In early 2006, TfL launched the Image Recognition and Incident Detection (IRID) project.

This project was tasked to review the current image processing market and see how it met

TfL’s detection requirements. Testing was carried out on the following criteria: Congestion,

Stopped Vehicles, Banned turns, Vehicle counting, Subway Monitoring and Bus detection.

[6,7].

Results from this testing show good performance in Congestion detection (80% precision),

but poor performance in tracking based detection (~20% precision), clearly showing

limitations in capability. This limitation led directly to the creation of a research relationship

with Kingston University. The aims of this project are the localisation and subsequent

Buch et al. Vehicle Localisation and Classification in Urban CCTV Streams ITS World Congress 2009

2

Figure 1 Framework for 3D localisation and classification and models used

classification of vehicles and pedestrians in camera views specific to the urban environment

of Transport for London. Estimating the vehicle position in real-world coordinates (on road

maps) is also beneficial for traffic enforcement applications. The conventional concept of

using background modelling for the generation of a motion mask is only used as baseline.

Motion masks suffer from noise due to lighting changes, camera shake, rain, etc. and

particularly from occlusion due to low camera angle. All these effects are inherent to urban

traffic scenes. The project objective was to move beyond motion estimation and use other

visual means to localise and classify cars. A texture based classifier (3DHOG) was introduced

to overcome the mentioned problems.

3D FRAMEWORK FOR VEHICLE CLASSIFICATION AND LOCALISATION

We developed a framework to localise vehicles either based on the motion mask or based on

the texture in the image. The framework is shown in Figure 1. The detector uses background

estimation with a Gaussian Mixture Model (GMM) [11] to model the static part of the scene.

This model allows for several images to be considered static as shown in Figure 2, which

enables rejection of moving objects in the background like trees. By considering the

difference between the background and a new frame, an initial foreground mask is estimated.

This mask is refined by a shadow removal algorithm, which considers areas as shadows, if

they are slightly darker than the background but have the same colour. From the resulting

binary foreground mask, closed contours are extracted.

The contours provided by the detector are used to initialise hypotheses for vehicle locations

on the ground plane. The hypotheses are verified by the classifier by comparing existing

vehicle models with the input image to give a match measure. The models can either be the

projected silhouettes or appearance data gathered during a training phase. Finding the

maximum match in terms of location and vehicle type gives the final estimation. The

detection of a single frame can be tracked over time using a Kalman filter, which will produce

tracks (trajectories) for vehicles. Further details about the framework can be found in Buch et

al. [2,4] and for the tracker in [5]. As visual output, the localised vehicles are marked up on

the camera view and on a map of the area. Refer to Figure 2 for an example frame from TfL

with corresponding map. The thin dark red boxes represent the regions of interest where

vehicle detection is performed. This could be set to bus lane areas to detected unpermitted

vehicles in those lanes.

GP positions

GMM [11]

3D

Hypothesis

2D Projection

[12]

Match Measure silhouettes

model data

Models

scores

frame

Maximum

labels GP pos

Closed

Contours

foreground mask

Tracker

Detector

Kalman

Filter

tracks

Classifier

frame


3

a b

Figure 2 a) Detector stage. b) Localised vehicles with superimposed wire frames and

ground plane location. The coloured lines are the tracks of individual vehicles.

RESULTS USING MOTION SILHOUETTES

The localisation framework was first tested with a baseline approach using the overlap of

motions silhouettes and model silhouettes as match measure. Figure 3 illustrates the operation

where a score is defined as the ratio between intersection and union of both silhouettes. The

3D location of vehicles is found by generating a hypotheses grid (green crosses in Figure 3)

around the back projected silhouette centroid (red cross in Figure 3). A score surface is

generated for this grid and the hypothesis with the highest score is selected as location and

class for the detected vehicle.

Figure 3 Classifier based on motion silhouette overlap with model

GMM

Frame Initial mask

Remove Shadow

Closed Contour

Foreground mask

Stable background

Second background

Foreground Detector

Silhouettes

3D Hypothesis

Overlap Area

Maximum

2D Project. Model

s

Silhouettes

-2 -1 0 1 20

20

40

60

80

100

Cros-section position [m]

Matc

h m

eas

ure

[%

]

ID 2

ID 3

ID 4

ID 5

ID 6

ID 7

ID 8

ID 9

Model Silhouette

Model

Overlap Scores

Labels

Classifier

Ground plane map


4

ground truth

detection

pedestrian

bike

car/taxi

van

bus/lorry

FN

count

overlap

pedestr

ian

.71

.02

.02

0

0

.24

241

.57

bik

e

.49

.47

0

0

0

.04

45

.6

car/

taxi

.01

.03

.85

.02

.02

.07

371

.66

van

.02

0

.1

.84

.03

.02

63

.69bus/lorr

y

0

0

0

.02

.98

0

62

.72

FP

.1

.02

.04

.08

.03

0

ground truth

detection

pedestrian

bike

car/taxi

van

bus/lorry

count

pedestr

ian

.94

.02

.03

.01

.01

182

bik

e

.51

.49

0

0

0

43

car/

taxi

.01

.03

.92

.02

.03

344

van

.02

0

.1

.85

.03

62

bus/lorr

y

0

0

0

.02

.98

62

Symbol

Recall R

Precision P

Classifier PC

Detector RD

Detector PD

GT Overlap

Value

79.5%

83.9%

89.8%

88.6%

93.5%

0.64

a b c

Table 1 a) Confusion matrix of full system including detector. b) Confusion matrix of

classifier only. C) Full performance for motion silhouettes based system.

Classification performance

The system was extensively tested with video footage from Transport for London and from

the i-LIDS datasets [1]. The latter is a benchmarking dataset provided by the UK Home

Office to imaging research institutions. In Table 1, we present classification results on 1 hour

video footage from the parked car scenario. Good overall classification performance is

demonstrated with some confusion between bikes and pedestrians. This is due the similar size

of those two types of road users. The localisation performance is demonstrated by the

bounding box overlap between the wire frame and the ground truth in the overlap row. The

value of 64% overlap is good, considering the use of the wire frame rather than the motion

silhouette for the bounding box estimation. Detailed explanation for performance measures

can be found in [1] and its application in [2,4].

Tracking performance

Tracking is performed on the ground plane of the scene, which simplifies behaviour analysis

like bus lane monitoring. We use the standard formulation of the Kalman filter for a constant

velocity model of vehicles. The object tracking performance is demonstrated by comparing

our tracker with a baseline tracker (OpenCV blob tracker [10]). The OpenCV tracker uses an

adaptive mixture of Gaussians for background estimation, connected component analysis for

data association and Kalman filtering for tracking blob position and size. The data used is i-

LIDS [1] as for the performance evaluation of classification.

We propose a rich set of metrics such as Correct Detected Tracks, False Detected Tracks and

Track Detection Failure to provide a general overview of the system’s performance. Track

Fragmentation shows whether the temporal and spatial coherence of tracks is established. ID

Change is useful to test the data association module of the system. Latency indicates how

quick the system can respond to an object entering the camera view, and Track Completeness

how complete the object has been tracked. Metrics such as Track Distance Error and

Closeness of Tracks indicate the accuracy of estimating the position, the spatial and the

temporal extent of the objects respectively. More details about this evaluation framework can

be found in [5] and Yin et al. [13]. The proposed system detected 94% of the ground truth

tracks compared to 88% of the base line. Our system has half of the track detection failures

compared to the base line. Please refer to Table 2 for a complete set of metrics and Figure 4

for visual tracking examples.


5

Table 2 Tracking results of motion silhouette classifier

Figure 4 Example tracking results from the i-LIDS data

BEYOND MOTION: 3DHOG

The limitations of motions silhouettes inspired the use of texture to detect vehicles by

appearance. Good results have been reported elsewhere for patch based approaches in object

recognition [9] and pedestrian detection with histograms of oriented gradients (HOG) [8]. We

introduce a novel concept by applying the HOG descriptor to image patches defined in 3D

model space. Full details on this 3DHOG framework can be found in Buch et al. [3]. This

feature approach substitutes the overlap match measure in the block diagram with a training

Metrics proposed Tracker OpenCV blob Tr.

Number of Ground truth tracks 100 100 Number of system tracks 144 203

Correct detected tracks 94 88

Track detection failure 6 12

False detected tracks 27 90

Latency [frames] 5 5

Track fragmentation 8 18

Average track Completeness [time] 64% 55%

ID change 10 3

Average track closeness [bbox overlap] 54% 35%

Standard Deviation of closeness 20% 13%

Average distance error [pixels] 22 21

Standard Deviation of distance error 19 15


6

a b c d

Figure 5 a) 3D model with marked interest points. b) input camera image frame.

c) extracted and normalised image patches displayed in 3D space. d) 3DHOG gradient

features generated from the visible image patches

based classification step. Figure 5 shows the model with patch centres defined with the model

and the extracted patches. Affine transformations are used to generate those scale normalised

patches. A descriptor is generated from every patch, which consists of either gradient

histograms, a frequency spectrum or a simple image histogram. A data driven appearance

model is learned, whereas a single Gaussian distribution is learned for every interest point

descriptor. During system operation, new descriptors are generated for every hypothesis (2D

projection block in Figure 1), whereas the distance between learned descriptors and newly

seen descriptors define the match measure. The remainder of the framework remains identical

to the earlier description.

Performance comparison

For evaluation on the same dataset as in the last section, a recall of 81% is achieved at a

precision of 82%. Those results indicate similar performance to the motion based system,

however providing more robustness against the usual noise sources in the urban environment.

By using texture and appearance rather than motion, the 3DHOG classifier can deal with

cases were the motion silhouette is significantly distorted or similar for different classes.

There are illustrative examples in Figure 6. The case of oversized silhouettes due to shadows

and lighting changes is rectified and a pedestrian in correctly detected. Another common

problem is saturation of some areas in the camera. The example shows a very small silhouette

for a white van, which was missed from the motion foreground due to the saturation in the

same area. The 3DHOG classifier successfully detects and classifies the van. For objects of

similar size like pedestrians and bicycles, the classifier can distinguish correctly based on the

appearance.

CONCLUSIONS

We presented a review of commercial video analytics systems tested by Transport for

London. The findings and the progress of the subsequent project with Kingston University is

demonstrated. An improved computer vision system is demonstrated by introducing a 3D

localisation framework for vehicles. Good results are demonstrated for motion silhouette

based vehicle classification. An extension to texture based classification is given by moving

beyond the concept of motion estimation. This extends the concept of HOG by introducing

novel 3DHOG, which use a “3D surface window” for vehicle classification. This classifier

demonstrated superior performance for challenging cases where motions silhouettes are


7

3DHOG classifier Motion Silhouette classifier

Figure 6 Comparison between 3DHOG and motion silhouette classifier. Noisy motion

foreground (blue outline) is misclassified on the right. In contrast, the 3DHOG classifier

correctly classifies the pedestrian inside the shadow and the van in the saturated area.

incorrect. The frame to frame detection is used as input for a Kalman filter to generate

trajectories of road users. Future work will focus on more evaluation of the proposed systems

under diverse weather and operation conditions.

ACKNOWLEDGEMENTS

This work is sponsored and conducted in cooperation with the Directorate for Traffic

Operations at Transport for London.. [12 [11]

REFERENCES

[1] Home Office Scientific Development Branch. Imagery library for intelligent detection

systems i-lids. http://scienceandresearch.homeoffice.gov.uk/hosdb/cctv-imaging-

technology/video-based-detection-systems/i-lids/ [accessed 19 December 2008].

[2] Norbert Buch, James Orwell, and Sergio A. Velastin. Detection and classification of

vehicles for urban traffic scenes. In International Conference on Visual Information

Engineering VIE08, pages 182–187. IET, July 2008.

[3] Norbert Buch, James Orwell, and Sergio A. Velastin. 3D extended histogram of

oriented gradients (3DHOG) for classification of road users in urban scenes. In British

Machine Vision Conference BMVC 2009, London, September 2009.

[4] Norbert Buch, James Orwell, and Sergio A. Velastin. Urban road user detection and

classification using 3d wire frame models. IET Computer Vision [accepted], 2009.


8

[5] Norbert Buch, Fei Yin, James Orwell, Dimitrios Makris, and Sergio A. Velastin. Urban

vehicle tracking using a combined 3d model detector and classifier. In 13th

International Conference on Knowledge-Based and Intelligent Information &

Engineering Systems KES2009, Santiago, Chile, September 2009. LNCS Springer.

[6] Mark Cracknell. Image detection in the real world – interactive session. In Intelligent

Transportation Systems ITS ’07 Aalborg, 2007.

[7] Mark Cracknell. Image detection in the real world – a progress update. In Intelligent

Transportation Systems World Congress ITS WC 2008 – New York, 2008.

[8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In

Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society

Conference on, volume 1, pages 886–893, 2005.

[9] Bastian Leibe, Ales Leonardis, and Bernt Schiele. Combined object categorization and

segmentation with an implicit shape model. In ECCV’04 Workshop on Statistical

Learning in Computer Vision, pages 17–32, May 2004.

[10] OpenCV. Open source computer vision library.

http://sourceforge.net/projects/opencvlibrary [accessed 19 December 2008].

[11] C. Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time

tracking. In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society

Conference on., volume 2, pages 246–252, June 1999.

[12] Roger Y. Tsai. An efficient and accurate camera calibration technique for 3d machine

vision. In Proc. Int. Conf. on Computer Vision and Pattern Recognition (CVPR),

(1986), pages 364–374, 1986.

[13] Fei Yin, Dimitrios Makris, and Sergio A. Velastin. Performance evaluation of object

tracking algorithms. In 10th IEEE International Workshop on Performance Evaluation

of Tracking and Surveillance, PETS'07, Rio de Janeiro, October 2007.

vehicle localisation and classification in ... et al. vehicle localisation and classification in...

Documents