[ieee 2009 international conference on application of information and communication technologies...

Multi-camera based Human Tracking with Non-Overlapping Fields of View

Muhammad Owais Mehmood

NED University of Engineering & Technology University Road, Karachi - 75270, Pakistan

Email: [email protected]

Abstract- This paper presents approach for an automated surveillance system which performs human detection and tracking across multiple non-overlapping cameras. At single camera level, motion based segmentation is achieved using an accurate optical flow estimation technique. Feature matching and region-based shape descriptors are used for object tracking and classification. The proposed approach extends these features across fields of view (FOV) of multiple non-overlapping cameras in order to obtain correct inter-camera correspondences. Performance evaluation of the system was done in different scenarios which verify that the proposed method can efficiently tackle the complexity of video motion tracking with high processing speed.

I. INTRODUCTION

Visual surveillance systems are important tools used for monitoring the behavior of objects and people to ensure that they are within the norms. Surveillance finds applications in monitoring of traffic, crowds, airports, etc – generally helping in matters of safety as well as crime. Using computer vision techniques, surveillance systems are now able to draw together more information as having automation increases the response time and efficiency by reducing load on the human operator.

The nature of video surveillance activities determine single-camera single-location tracking as insufficient and require multi-camera surveillance systems with automatic calibration, detection of objects, and tracking with correct inter-camera correspondences. This further reduces the human operator’s load by operating on a FOV which is larger than that provided by a single-camera.

This paper describes the development of an automated visual surveillance system which has the ability to detect and track human beings using multiple non-overlapping cameras. Optical flow, feature matching and shape descriptors are combined in single camera approach and then extended globally across multiple cameras. Experiments on various datasets show that this approach gives accurate results with high computational performance and maintains robustness under varying illumination and noise conditions.

The paper is organized as follows: Section II describes related work followed by single-camera surveillance in Section III. Section IV provides framework for the multi-camera surveillance system. Results of experiments on various data sets are in Section V and Section VI concludes the paper.

II. RELATED WORK

Articulated object detection and tracking have received immense attention in Computer Vision. Grabner et al. [1] answer the question ‘Is Pedestrian Tracking Really a Hard Task’ by classifying detection as inherently difficult due to the variability in the appearance of persons (e.g., clothing, pose in the scene, illumination) and in the background (e.g., clutter, occlusions, moving objects).

Shio and Sklansky [2] proposed to use correlation techniques for detecting motion of human beings; a simple method but it demanded a great deal of computational power. Lim et al. [3] located the human face by analyzing skin color and using CAMSHIFT algorithm; a motorized webcam further enhanced the tracking. However, this approach is limited only to the facial tracking with dependency on the pan-tilt control of image acquisition device.

Solutions have also been attempted using methods like optical flow [5] and Kalman filtering. Kalman filtering method is widely found in object tracking literature as in [6]. Optical flow is used for motion based segmentation and tracking. E.R. Davies [22] and Intel’s study of video surveillance systems [12] provide an overview of these and other key algorithms in Computer Vision and video surveillance. According to [6] linear Kalman filtering is sufficient for many problems in controlled indoor and outdoor environments but for approximating non-Gaussian probability distributions, Particle filter is used and it has wide range of applications including object tracking in Computer Vision [12]. Watada and Musaand in [19] proposed a method for human tracking using foot step direction and applied particle filtering to detect and track human movement. This security system was able to track several human motions simultaneously. Gustafsson et al. [20] discussed positioning, navigating and tracking using Particle Filter with focus on automotive and airborne applications and found particle filter to show improvements over linearization approaches. Jung et al. [4] described robust detection of objects with ego-motion compensation; adaptive particle filter and EM algorithm estimated the position of moving objects.

Tracking with a single camera requires correspondence of frames over time. However, multiple-camera tracking is a correspondence problem between tracks of objects seen from different viewpoints at the same time instant [7]. Several different techniques like 3D reconstruction methods (Shape-from-Silhouette) ([10],[11]), alignment approaches and feature matching techniques [7] are used for solving multi-camera calibration and correspondence issues. In [8], [27], an

978-1-4244-4740-4/09/$25.00 ©2009 IEEE

automated visual surveillance system KNIGHT performed correspondence between multiple cameras without requiring calibration. Ribeiro et al. provided an overview for Human Activity Recognition by addressing problems of feature selection, data modeling, and classifier structure and also provided techniques for achieving real-time performances [9].

The system in this paper uses improved techniques of optical flow for motion-based segmentation. Feature matching and region-based shape description approaches are used for object tracking and classification. This implementation is then expanded across multiple non-overlapping cameras for multi-person multi-camera tracking. This paper also focuses on a multi-camera video surveillance system with improved computational performance.

III. SINGLE-CAMERA SURVEILLANCE

Single camera system consists of preprocessing, optical

flow calculations, tracking, classification and activity analysis modules. This section discusses each module in detail.

A. Development Tools

HALCON is a machine vision library that comes with an integrated development environment called HDevelop. It has an Image acquisition module which provides interfaces to many frame grabbers and industrial cameras (analog, Camera Link, USB 2.0, IEEE 1394, and GigE) [13]. Thus, the system can be interfaced with many different kinds of cameras.

All this has been implemented on Microsoft Windows Vista operating system running on 2.23 GHz Intel Core2Duo (Laptop) processor with 4GB memory.

B. Pre-processing

Pre-processing is done to minimize the effect of varying illumination and noise. 1) Varying Illumination: The varying illumination problem is minimized by using the illuminate function which enhances contrast. Very dark parts of the image are illuminated more strongly, while very light ones are darkened. The pre-processed new intensity value is given by

(( )* )new round val mean factor orig (1) where orig is the original intensity value, mean is the corresponding intensity value of the low pass filtered image and val equals the median value. As in [15], factor of 0.7 and filter of size 101x101 is used. 2) Noise Reduction: The image frame is smoothed by a Gaussian filter. This also reduces aliasing in case downsizing is required for high resolution videos. C. Optical Flow Calculations

Optical flow field is a velocity field that is the projection of three-dimensional motion of scene as a two-dimensional signal and is used to extract information about the movement between two consecutive images of a monocular image sequence.

Fig. 1. System Architecture

Different methods for calculating the optical flow have been

studied. In this paper, the method proposed by Brox et al. is used [14]. This approach is flow-driven, robust, isotropic, and uses a gradient constancy term. Experiments in [14] show that the accuracy attained by Brox et al. method is significant and sometimes even twice as high as the best value known so far (e.g. as compared to [16]). Moreover, the method is robust under a considerable amount of noise. The energy functional of the Brox et al. algorithm can be written as

data smoothE E E (2)

where is the regularization parameter known as FlowSmoothness and

2 2( ( ) ( ) ( ) ( ) )dataE I x w I x I x w I x d x (3)

2 23 3( )smoothE dxu v (4)

In dataE , is the gradient constancy weight known as

GradientConstancy. FlowSmoothness and GradientConstancy constitute the model parameters of Brox et al. algorithm.

HALCON reference manual [15] and experiments with different videos show that if GradientConstancy is 5 then the system works well and therefore it is fixed. Flowsmoothness is used to smooth the computed optical flow field. This value has to be adjusted in different conditions as shown in Table I. It can be observed that the greater the resolution, the greater the FlowSmoothness and that the value is less in outdoor environments.

Fig. 2. Single camera tracking on a PETS 2009 video

TABLE I FLOWSMOOTHNESS VALUES IN DIFFERENT CONDITIONS

Video Details Resolution FlowSmoothness

Corridor/Outdoor Environment 160 x 120 5

Indoor Environment #1 320 x 240 20

Indoor Environment #2 320 x 240 20

Outdoor Environment #1 320 x 240 10

Outdoor Environment #2 640 x 480 20

After this, Thresholding is employed for segmentation due

to which pixels exhibiting movement are extracted as regions. D. Object Tracking

Tracking of object motion is complex because many objects are moving simultaneously and independently, hence we need to apply motion-based constraints on each individual object. 1) Mean Intensity Feature: Optical flow and segmentation provides us with regions. If R is a region, p a pixel from R with the intensity value g(p) and F the plane (F = |R|), then the feature mean is defined as

( )

p in R

g pMean

F (5)

Let n be a frame of the video and R1 is the region in this

frame obtained by segmentation; similarly, n+1 is the next consecutive video frame and R2 is the region in this frame. Let Tp be a threshold given as

( ) ( 1)

2p

M n M nT

(6) where |M(n)| is the mean of nth frame and |M(n+1)| is the

mean of n+1th frame. Regions R1 and R2 are identical if the means M(n) and

M(n+1) vary from Tp within a variation of ±4 units. In this case, regions R1and R2 are identified as a single, identical and unique object and is bounded by a distinct color bounding box throughout the video. This is repeated for all the frames and regions in the video. 2) Object Entry/Exit: Object entry/exit from FOV of the camera is used as feature to detect new objects or their removal from the sequence of frames. This results in increased efficiency when combined with appearance based feature discussed above. 3) Occlusion Detection: Occlusion is detected by the use of Object Entry/Exit method presented by Khan et al. in [7] for the Multi-camera system; this concept is worked out for handling occlusion. If an object enters or leaves the scene, not from the side of FOV but from the middle (which is not possible physically), then it is meant to be an occlusion. The

limitation of this technique is that the occluded group has to be separated at least once (a remedy to this is presented in Section III-E).

As in [7], if an ‘actual’ object entry/exit must occur not from side of FOV but from somewhere else e.g. due to a person emerging from stairs or behind the door then this case has to be accounted for separately (the area where door or stairs are, is marked as an object entry/exit point).

E. Object Classification

Object classification is done according to the region-based shape description features to determine whether an object is a human being or not. The simplest and most natural property of a region is its area [17]. This simplicity is exploited in the design as the operator specifies the minimum and maximum area, which is expected to be a human being. This approach also helps in separating the individual human beings from groups (remedy to the occlusion detection limitation mentioned in section III-D) and remove unwanted objects (such as leaves). Fig. 4 shows that the person is correctly detected and tracked even when partly occluded. Another region-based descriptor used is Rectangularity. If an object (not it’s bounding box) is found to be perfectly or nearly rectangular then it is discarded as a human being can’t be perfectly rectangular.

Thus, the controlling parameters for single camera system are FlowSmoothness, minimum and maximum area. By specifying these three parameters correctly, human beings can be accurately tracked.

F. Activity Analysis

Ribeiro et al. mention a set of features based on the estimates of optical flow which can be used for human activity recognition [9]. Flow norm feature has been used to perform activity analysis. From [9], it can be seen that if target pixels are defined as

2( ) ( , ) | ( , ) target( )P t x y x y t (7)

Then, the flow norm is given as:

2 2( , , ) ( , , ) ( , , ) ( , , ) , ( , ) ( )x yf t x y F t x y f t x y f t x y x y P t (8)

Fig. 3. Person being tracked in an outdoor setup video obtained from ViSOR(Video Surveillance Online Repository). Flownorm, FPS (frames per second) and frame number being processed is displayed.

Fig. 4. Two persons being tracked in the corridors of Electronic Engineering Department, NED University; the person in green box is partly occluded, yet correctly tracked in the low-resolution video.

Since optical flow converts 3D information into 2D, it gives us an uncalibrated estimate about object’s speed and its proximity to the camera. This can be used to estimate the person’s activity. Detection of an object in restricted area and a person counter routine is also implemented. Fig. 3 shows flow norm in the processed frame.

IV. FEATURE MATCHING ACROSS MULTIPLE NON-

OVERLAPPING CAMERAS

A multi-camera system widens the scope of surveillance as it increases the coverage area and solves issues like occlusion; but, for this, the system must take into account the inter-camera correspondences. Multi-camera multiple-object tracking did not receive much attention in computer vision until 2003 [7] and a large amount of work has been done on this assuming overlapping views [21].

The multi-camera approach in this paper is based on the feature-matching algorithms and focuses on cameras with non-overlapping FOVs. In [21] camera setup for business sequence is given in which inter-camera travel time can be estimated and locations of exits and entrances across cameras are correlated. Scenario like Fig. 5 is much more complex as there are numerous, highly uncorrelated entry/exit points of each camera which also makes the inter-camera travel time prediction more difficult. Therefore, approaches like in [21] will fail because the system cannot be modeled by assumptions of correlated object entry/exits and inter-camera travel times. Therefore, we use feature matching approach; for this features discussed in section III are established globally across all the cameras.

Feature matching approaches are affected if disparity between the views is large. In order to make system robust to this, similar cameras are used, which overcomes the photometric variations like contrast, color balance, etc. Lightning variations, which affect mean intensity feature, are compensated by the use of region-based shape description features and to ensure this works, all the cameras are viewing the same ground plane. Fig. 5 shows the results of this approach; the system is able to correctly track the person, in red bounding box, in nine frames across three cameras and the person with green bounding box is tracked in three frames across two different cameras. Similarly, Fig. 6 demonstrates non-overlapping multiple-camera surveillance in the indoor scenario, in which a person is correctly tracked across two

Camera 1 Camera 2 Camera 3



Fig. 5. Multiple-camera surveillance performed on three non-overlapping cameras installed at different departments of NED University. Person with red bounding box is correctly assigned in all cameras; similarly person in green box is visible in three frames of two cameras and accurately tracked.

cameras. Results in both figures prove that the system is accurate in varying environments and conditions.

If large numbers of people are present then experiments showed that data obtained by mean intensity values can become enormous and cluttered; in this case, HALCON’s

Camera 1 Camera 2

Camera 1 Camera 2

Fig. 6. Multiple-camera surveillance; two non-overlapping indoor video sequences from IBM’s “Performance Evaluation of Surveillance systems” project.

correlation-based matching is preferred which is based on gray-values and fast normalized cross-correlation. Correlation is preferred because defocus, texture, rotation, slightly deformed shapes and linear illumination changes can be coped with. Fast correlation is also one of the simplest methods which can be used for matching the objects across different frames and views. This problem can also be solved by combining our appearance based solution with an alignment based solution as in [8].

V. RESULTS

Single camera system was tested with nine videos, about 14,000 frames. Figures 2–6 show the performance of single camera approach in indoor and outdoor scenarios taken with different cameras. Videos from IBM[23], PETS[24], ViSOR[25] and security surveillance setup at NED University have been evaluated. The system performed successfully, sometimes the track was lost in a few frames, but the system successfully recovered itself every time.

In order to evaluate the performance of multi-camera algorithm, a non-overlapping system with three cameras was setup at different departments of NED University as illustrated in Fig. 5. As discussed in the previous section that this surveillance scenario was complex due to a large number of highly uncorrelated entry/exit points whose inter-camera travel time prediction was very difficult. However, our algorithm performed accurate tracking. In order to evaluate the system indoors, two independent single camera videos were taken from IBM’s “Performance Evaluation of Surveillance systems” project and were input to the multi-camera system; with results shown in Fig. 6. The system works perfectly with one person, but, problems were faced in tracking when more than one person enters the FOVs of different cameras. However, solutions are presented to this in the previous section.

Table II provides data on the performance evaluation of the proposed system. MATLAB’s people tracking demo available in Video and Image Processing Blockset [26], featuring Kalman filter tracking, is used as a comparison with the proposed system on the same test machine with same test videos. Results are mentioned in Table II for single-camera and two-camera streams processed by the algorithm presented in this paper and results of the single-camera surveillance done by the MATLAB demo are also given. Performance analysis mentioned in the related works ([4],[8],[18],[27],[28]) and their comparison with the proposed system is summarized in Table III. The results in Table II and III show that the proposed system achieves excellent computational performance and it is possible to process the saved feed of ten cameras on the test machine. Work is also being pursued, based on this high-performance approach, on a distributed system that can process with such efficiency in real-time.

Thus, the proposed solution works perfectly in single-camera setup and a basic multi-camera visual surveillance system has been designed for cameras with non-overlapping FOVs having a large number of uncorrelated object entry/exit

TABLE II PERFORMANCE EVALUATION OF THE PROPOSED SYSTEM

points. It is also proved that the proposed system achieves this with a high processing speed.

VI. CONCLUSION

This paper presented an approach for automated visual surveillance for single and multi camera systems. Optical flow, feature matching and shape descriptors are combined to detect people and track them efficiently. The feature matching approach and shape descriptions are then extended globally to multiple non-overlapping cameras to attain correct inter-camera correspondences. Experiments on indoor and outdoor sequences with different cameras proved that the approach is simple, fast and accurate.

To advance the system in the future, several components can be worked upon, for example, ability to track multiple-people in crowded scenes and handling severe occlusions.

TABLE III

PERFORMANCE COMPARISON WITH OTHER SYSTEMS

System Details Processing rate (FPS)

Jung et al.[4] Embedded computer (Pentium III 1 GHz) Single-

camera – 320x240 resolution 5

Masoud et al. [18] Video & Vector Processors, Single-camera 30

KNIGHT [8],[27] 3 cameras each on a Pentium 2.0GHz machine 10

Black et al. [28] (Multi-camera) 5

Proposed System Intel Core2Duo 2.23GHz

(2 cameras) – 320x240 resolution 30

Proposed System Intel Core2Duo 2.23GHz

(Single-camera) – 320x240 resolution 51

Video Details, Resolution, Frame Rate

PROPOSED SYSTEM 1 Stream

Processing Rate (FPS)

PROPOSED SYSTEM 2 Streams

Processing Rate (FPS)

MATLABDEMO

1 Stream Processing

Rate (FPS)

Indoor Video, 320x240, 15fps 51 30 15

Outdoor Video, 344x277, 25fps 39 28 15



This can be achieved by using particle filter which is powerful in approximating non-Gaussian probability distributions. Human detection and activity recognition can be further enhanced by selecting more features and developing classifiers accordingly.

The proposed system achieves high computational speeds and this advantage can be further extended to a grid computing environment which can further reduce the processing time and achieve high-speed real-time multi-camera surveillance with a large numbers of cameras.

ACKNOWLEDGMENT

Author would like to thank Muhammad Zeeshan Zia (Researcher, Munich University of Technology) for his continuous support and ideas for this project.

REFERENCES

[1] Grabner Helmut, Roth Peter M. and Bischof Horst, “Is Pedestrian

Detection Really a Hard Task?”, IEEE Workshop on Performance Evaluation of Tracking and Surveillance, 2007.

[2] A. Shio and J. Sklansky, “Segmentation of people in motion,” Proc. IEEE Workshop Visual Motion, Oct. 1991, pp. 325–332.

[3] Resmana Lim, Thiang and Iwan Nyoto S., “Face tracking system using a web camera”, Conference on Electrical, Electronic, Communication, Information (CECI 2003), 17 June 2003.

[4] Boyoon Jung and Gaurav S. Sukhatme, “Detecting Moving Objects using a Single Camera on a Mobile Robot in an Outdoor Environment,” International Conference on Intelligent Autonomous Systems, Amsterdam, The Netherlands , March (2004) , pp. 980-987.

[5] B. Maurin, O. Masoud, and N. Papanikolopoulos, “Monitoring Crowded Traffic Scenes,” Proc. of the IEEE 5th Int. Conf. On Intelligent Transportation Systems (ITSC 2002), Singapore, September 3–6, 2002, pp 19-24.

[6] R. Bodor, B. Jackson and N.P. Papanikolopoulos, "Vision-Based Human Tracking and Activity Recognition," Proc. of the 11th Mediterranean Conf. on Control and Automation, Jun. 2003.

[7] Sohaib Khan and Mubarak Shah, “Consistent labeling of tracked objects in multiple cameras with overlapping fields of view”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 10, Oct. 2003, pp. 1355 – 1360.

[8] Omar Javed and Mubarak Shah, “KNIGHTM: A Multi-Camera Surveillance System”, ONDCP International Technology Symposium, 2003.

[9] P. Ribeiro and J. Santos-Victor, “Human Activities Recognition from Video: modeling, feature selection and classification architecture”, Proc. Workshop on Human Activity Recognition and Modelling (HAREM 2005 - in conjunction with BMVC 2005), Oxford, Sept 2005, pp. 61-70.

[10] Kong Man Cheung, Simon Baker, and Takeo Kanade, “Shape-From-Silhouette of Articulated Objects and its Use for Human Body Kinematics Estimation and Motion Capture”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June, 2003.

[11] Wong, K-Y.K. and Cipolla, R, “Structure and motion from silhouettes”, 8th IEEE International Conference on Computer Vision (ICCV'02), July 2001, Vancouver, Cananda.

[12] Trista P. Chen, Horst Haussecker, Alexandar Bovyrin, Roman Belenov, Konstantin Rodyushkin, Alexander Kuranov and Victor Eruhimov, “Computer Vision Workload Analysis: Case Study of Video Surveillance Systems”, Intel Technology Journal, vol. 9, no. 2, May, 19, 2005.

[13] HALCON 8.0 Broucher, the Power of Machine Vision, Germany, 2008. [14] Thomas Brox, Andr´es Bruhn, Nils Papenberg, and Joachim Weickert,

“High Accuracy Optical Flow Estimation Based on a Theory for Warping”, In Proc. 8th European Conference on Computer Vision, Springer LNCS 3024, T. Pajdla and J. Matas (Eds.), vol. 4, Prague, Czech Republic, May 2004, pp. 25-36.

[15] HALCON Reference Manual Version 8.0.2, the Power of Machine Vision, Germany, 2008.

[16] Fonseca, A.; Mayron, L. M.; Socek, D. and Marques, O., “Design and Implementation of an Optical Flow-Based Autonomous Video Surveillance System”, EuroIMSA 2008, March 17-19, 2008, Innsbruck, Austria.

[17] Milan Sonka, Vaclav Hlavac and Roger Boyle, Image Processing: Analysis and Machine Vision, 2nd ed., CL-Engineering, 1998, pp. 254-258.

[18] Osama Masoud and Nikolaos Papanikolopoulos,"A Novel Method for Tracking and Counting Pedestrians in Real-Time Using a Single Camera", IEEE Transactions on Vehicular Technology,Vol. 50, Issue 5, 2001, pp. 1267-1278.

[19] Junzo Watada and Zalili Binti Musaand, “Tracking Human Motions for Security System”, SICE Annual Conference 2008, August 20-22, 2008.

[20] F. Gustafsson, F. Gunnarsson, N. Bergman, U. Forssell, J. Jansson, R. Karlsson and P-J. Nordlund, “Particle filters for positioning, navigation, and tracking”, IEEE Transactions on Signal Processing, vol. 50, no. 2, Feb. 2002, pp. 425 – 437.

[21] Omer Javed, Zeeshan Rasheed, Khurram Shafique and Mubarak Shah, “Tracking across multiple cameras with disjoint views", Proceedings of the Ninth IEEE International Conference on Computer Vision, vol. 2, Oct. 2003, pp. 952-957.

[22] E. R. Davies, Machine Vision: Theory, Algorithms, Practicalities, 3rd ed., Morgan Kaufmann, 2004, pp. 524-541.

[23] IBM, Performance Evaluation of Surveillance Systems Video Sequences. [Online]. Available: http://www.research.ibm.com/peoplevision/performanceevaluation.html

[24] Eleventh IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Benchmark Data on People Tracking. [Online]. Available: http://www.cvg.rdg.ac.uk/PETS2009/a.html#s2

[25] University of Modena and Reggio Emilia, Italy, Video Surveillance Online Repository’s Video Database on single camera surveillance in outdoor Unimore D.I.I. setup. [Online]. Available: http://www.openvisor.org/video_videosInCategory.asp?idcategory=1

[26] Video and Image Processing Blockset, MATLAB, People Tracking Program. [Online]. Available: http://www.mathworks.com/products/demos/videoimage/PeopleTracking/viptrackpeople.html

[27] Mubarak Shah, Omar Javed and Khurram Shafique, "Automated Visual Surveillance in Realistic Scenarios," IEEE MultiMedia, vol. 14, no. 1, pp. 30-39, Jan.-Mar. 2007, doi:10.1109/MMUL.2007.3

[28] Black, J.James and Ellis, T. Tim, "Multi Camera Image Tracking", IVC, vol. 24, no. 11, 1 November 2006, pp. 1256-1267.

[ieee 2009 international conference on application of information and communication technologies...

Documents