designing video surveillance systems as services

Designing Video Surveillance Systems asServices

R. Cucchiara and A. Prati and R. Vezzani

Abstract This chapter briefly describes the research activities carried out at Im-agelab in Modena, Italy, on video surveillance. Among the several projects, twoof them are detailed. They share the paradigm of offering advanced services: onthe first case through the developing of a SaaS (Software as a Service) platformfor video surveillance, whereas in the other through the inclusion in standard videosurveillance systems of advanced functionalities of video analysis, such as peoplere-identification in multi-camera systems.

1 Introduction

Imagelab is a research lab located at the Information Engineering Department ofUniversity of Modena and Reggio Emilia. It has been created in 1999 and devel-oped based on public and private fundings, thanks to the rich local territory. Cur-rently, it is composed of 5 permanent people (Simone Calderara, Rita Cucchiara,Costantino Grana, Andrea Prati and Roberto Vezzani), 4 PhD students, 4 post-doctoral students and a honorary affiliate (Massimo Piccardi, University of Tech-nology at Sydney). Imagelab’s research activity is focused on several topics, rang-ing from image processing and transmission, to video and image analysis, to sceneunderstanding, to industrial vision systems, to real-time systems for security andsurveillance, to medical imaging and to visual digital libraries. Please refer to thewebsite http://www.imagelab.unimore.it for further details on projects, events and

R. Cucchiara and R. VezzaniD.I.I. - University of Modena and Reggio Emilia, Modena, Italy, e-mail: {rita.cucchiara,roberto.vezzani}@unimore.it

A. PratiDi.S.M.I. - University of Modena and Reggio Emilia, Reggio Emilia, Italy, e-mail:{andrea.prati}@unimore.it

1

2 R. Cucchiara and A. Prati and R. Vezzani

for looking at the more than 250 publications in international journals, books andconferences.

In this chapter, however, the main focus will be given to the researches carriedout at Imagelab on topics related to the video surveillance. The current research isfunded by some international and national projects; among the others, we wouldcite:

• International project BESAFE (Behavioral lEarning in Surveilled Areas withFeature Extraction) funded by the NATO “Science for Peace” programme (incollaboration with the Hebrew University of Jerusalem) (2008-2011). The projectgoal is to extract visual features and to exploit machine learning algorithms toanalyze people behaviors, with specific applications to terrorism; on this regard,Imagelab has developed an innovative approach for people trajectory analysisbased on circular statistics methods;

• Regional project on PRRIITT programme with the company Bridge.129 SpA(2010-2011) for developing a system for the security in construction workingsites. The project goals are to identify intruders in the scene and to detect po-tential threats for the security; on this regard, Imagelab is conducting tw0 mainstreams of research: one devoted to innovative and effective way for detectingpeople in complex scenarios; the other focused on the integration of RFID tech-nologies with video cameras to detect intruders;

• European project THIS (Transport Hub Intelligent video System) (2009- 2011),funded by the JLS-DG of the European Union, and coordinated by Imagelab;

• Regional project ViSERaS (Video surveillance in Emilia Romagna as a Service)(2010-2011), funded by Lepida SpA and Regione Emilia-Romagna.

While the first two will not be further detailed here (basically because they arealmost finished and more application-oriented, respectively), the last two will befurther described in the remainder of this chapter.

2 ViSERaS - Video surveillance in Emilia Romagna as a Service

The ViSERaS project is a 6-month project (Dec. 2010-June 2011) funded by Lep-ida SpA under the research carried out with Regione Emilia-Romagna. In addi-tion to Imagelab, the project involves also IBM Italia Spa, Vitrociset Spa and CSPScarl(Centro Studi Piemonte) Spa. The objective of the project is to develop an in-tegrated system of video management and video analysis “as a service”, where pub-lic administrations (in particular, small municipalities of Region Emilia-Romagna)can borrow from Lepida, in addition to the high-bandwidth connectivity, also themanagement, storage and analysis of a large number of cameras, through a SaaS(Software as a Service) architecture.

The evolution of technologies has drawn valuable information from video surveil-lance to be correlated with data and information extracted from numerous databasesheld by police and other public institutions in order to obtain predictive analysis on

Designing Video Surveillance Systems as Services 3

the phenomena of accidents and crime, thus allowing the authorities to improve theaction and intervention response.

Despite the many projects and commercial companies related to video surveil-lance, in the region Emilia-Romagna there is no mention of a single system at themetropolitan level. For this reason the project ViSERaS proposes not only virtu-alization of resources, typical of cloud computing, but also to share a number ofpotential video surveillance services, which can then be optimized and customizedand selected from different regional situations, for improving the overall perfor-mance and complexity of the system.

Offered services will be managed with different order of complexity rangingfrom simple consultancy support, configuration and other services such as remotedata storage, remote control of the configuration of systems (including control ofPTZ cameras), maintenance of secure data using watermarking techniques, videoanalysis and video services to mine data stored, and so on.

Fig. 1 Scheme of the architecture of the ViSERaS project.

The general architecture of the platform which is envisioned in this project (Fig.1) is based on IP technology which is known to offer many benefits, such as reli-ability, scalability, open architecture, third-party integration, etcetera. The platformwill consist of a central unit (Video Management), which will handle the streamingvideo, including video recording and archiving of metadata database. This will bethe brain of the system on which circulates through all the videos and will be acces-sible outside through a graphical interface based on web technology. The choice ofa web interface makes the system inherently flexible allowing theoretically any PC


connected to the network, regardless of operating system including smart phonesand mobile devices, the use of video streams, live and archived, and the total con-figuration system. Moreover, the Video Management subsystem may be distributedgeographically or located in one central location for storage and analysis.

One of the most important requirement of the project was to employ, especiallyfor the Video Management system, open source solutions. Among the many presentnowadays, our choice resides on Zoneminder (http://www.zoneminder.com), be-cause it exhibits several interesting features, such as:

• it supports various types of cameras, including PTZ;• it is built on standard tools, C++, Perl and PHP;• it uses the high-performance MySQL database;• it is composed by architecture independent processes for high-performance video

capture and analysis, implementing high redundancy against failure;• it supports live video in mpeg, multi-part jpeg and stills.

Despite its positive features, Zoneminder is an ongoing project which presentsstill many bugs and is continuously updated and patched. In addition, it is not con-ceived to automatically handle scalability issues, which are instead of crucial im-portance for the ViSERaS project. In fact, the project must be capable to offer agood response to different types of customers, ranging from small towns (with onlysome cameras, 1 to 5), to large cities (with tens or even hundreds of cameras), tofederated entities (such as federations of neighboring towns, or police departmentshandling several towns or areas). This requires to develop a software architecturethat can easily and safely adapt its resources (mainly computational) depending onthe number of cameras to be managed.

Different solutions have been studied, which includes the use of a supercomputer,or the use of virtual machines (MV in Fig. 2) which can be either allocated to thesame physical machines (MF in Fig. 2) or one virtual machine can be spread onmore physical machines.

In addition, it is very important that the different customers are handled uni-formly, especially in terms of web interface. As a consequence, a middleware layermust be developed to transparently redirect the requests to the Zoneminder in-stance(s) of the different virtual machines. In other words, both the standard (smalland large cities) and the federated users must log onto the same page (ZoneminderMiddelware - ZMM - login): if a standard user logs in, it will be redirected to thecorresponding Zoneminder instance(s) (which are mapped one-to-one to virtual ma-chines, on either one or multiple physical machines); if a federated user logs in,multiple queries are issued to different Zoneminder instances, potentially distributedalso geographically. In both cases, the views of multiple cameras must be percep-tually the same, making transparent the redirection of the middleware. A samplesnapshot of the developed views for federated users is shown in Fig. 3.

Finally, the project has a secondary objective which is to also include some VideoAnalysis functionalities, such as motion detection, object tracking, trajectory analy-sis, license plate recognition, abandoned package detection, and many others. With-out digging into details, this module will be developed by merging existing solutions


Fig. 2 Simplified middleware scheme of the Viseras video management system.

Fig. 3 An example of the Video Management interface based on Zoneminder.

provided by both IBM and Imagelab itself. This merge will be possible by usingDirectShow(TM) technology. The overall architecture of the IBM Video Anslysismodule is shown in Fig. 4.

3 THIS - Transport Hub Intelligent video Systems

The project THIS addresses automatic behavioral analysis through video processingfocused on crowded scenarios, such as the transportation hubs. A system perform-ing human behavioral analysis, detaching what is usual from what is not, would fillthe gap and provide a reactive, and hopefully pro-active control task, preventing ter-roristic attacks or crime situations in public places. In order to learn what is normalor not, we propose to use statistical inference enriched with contextual information.E.g., “starting to run” in an exit zone could be abnormal and suspicious, but becomesnormal if the person is trying to reach a closing gate.


Fig. 4 Overall tentative architecture of the Video Analysis module by IBM (Courtesy of IBM).

We propose to apply the paradigm “learn-and-predict” by modeling the normalactivity in the hub with tools of automatic and semi-automatic classification andannotation, applying innovative methods of people tracking in crowd, and statisticalpattern of activity recognition.

The new tools will be integrated in existing solutions and included in availableCCTV systems, without the need of redesigning installed video surveillance systemson different scenarios, e.g. airports, harbors or railway stations.

The system should be modular and easy to configure and update. In fact, surveil-lance of wide areas with several connected cameras integrated in the same automaticsystem is no more a chimera, but modular, scalable and flexible architectures aremandatory to manage them.

In addition to the proposal of new specializes methods and techniques, one ofthe main issues faced during the project is the architectural design of distributedsurveillance systems and in particular we propose an integrated framework suitablefor research purposes [5]. Exploiting a computer architecture analogy, a three layertracking system makes the foundation of the framework, coping with the integrationof both overlapping and non overlapping cameras. Then, a static service orientedarchitecture is adopted to collect and manage the plethora of high level modules,such as face detection and recognition, posture and action classification, and so on.Finally, the overall architecture is controlled by an event driven communication in-frastructure, which assures the scalability and the flexibility of the system. A schemaof the complete framework is depicted in Fig. 5.

Moreover, the main results achieved by Imagelab during the first year of theproject mainly regards:


Fig. 5 SOA-EDA research platform [5]

• background initialization and modeling. We proposed a new and fast tech-nique for background estimation from cluttered image sequences. Most of thebackground initialization approaches developed so far collect a number of initialframes and then require a slow estimation step which introduces a delay when-ever it is applied. Conversely, the proposed technique redistributes the compu-tational load among all the frames by means of a patch by patch preprocessing,which makes the overall algorithm more suitable for real-time applications [2].

• AD-HOC - Single camera people tracking. AD-HOC (Appearance Driven Hu-man tracking with Occlusion Classification) [6] is a complete framework for mul-tiple people tracking in video surveillance applications in presence of large oc-clusions. The appearance-based approach allows the estimation of the pixel-wiseshape of each tracked person even during the occlusion. This peculiarity can bevery useful for higher level processes, such as action recognition or event detec-tion. A first step predicts the position of all the objects in the new frame whilea MAP framework provides a solution for best placement. A second step asso-ciates each candidate foreground pixel to an object according to mutual objectposition and color similarity. A novel definition of non-visible regions accountsfor the parts of the objects that are not detected in the current frame, classifyingthem as dynamic, scene or apparent occlusions.

• Online Action classification. Hidden Markov Models (HMM) have been widelyused for action recognition, since they allow to easily model the temporal evolu-


(a) (b) (c) (d)

Fig. 6 (a) a human 3d model, (b) average silhouettes used for the model creation, (c) our simplifiedhuman model, (d) the vertices sampling used in our tests

tion of a single or a set of numeric features extracted from the data. The selectionof the feature set and the related emission probability function are the key issuesto be defined. In particular, if the training set is not sufficiently large, a manualor automatic feature selection and reduction is mandatory. The method devel-oped during the project models the emission probability function as a Mixtureof Gaussian and the feature set is obtained from the projection histograms ofthe foreground mask. Imagelab participated as a THIS partner to the ICPR 2010contest on action recognition [4].

• People re-identification with 3d body models. People appearance is the mostuseful source of information if we need to match images of people captured byspatially or temporally disjoint cameras, i.e., geometrical relations are not avail-able. Even if partially solved using region based features, one of the main limita-tion of the available solutions for people re-identification is the dependency fromthe point of view: for example, the specific location of characteristic patterns isusually lost and cannot be used for people matching. Thus, we propose to createa simplified 3D body model which allows to map appearance features to their3D location in the body model [1]. More details on this innovative approach arereported in the next subsection.

3.1 People re-identification using 3D body models

First of all, we defined a generic 3D-model. Differently from motion capture or ac-tion recognition systems, we are not interested in the precise location and pose ofeach body part, but we need to correctly map and store the person appearance. In-stead of an articulated body model (as in fig. 6(a)), we propose a new monolithic 3dmodel, called ”Sarc3D”. The model construction has been driven by real data: side,frontal and top views of generic people were extracted from various surveillancevideos; thus, for each view an average silhouette has been computed and used forthe creation of a graphical 3d body model (see fig. 6(b)), producing a sarcophagus-like (fig. 6(c)) body hull. The final body model is a set of vertices regularly sampledfrom the sarcophagus surface. The number of sampled vertices could be selected ac-


cordingly to the required resolution. In our tests on real surveillance setups, we usedfrom 153 to 628 vertices (fig. 6(d)). Other sampling densities of the same surfacehave been tested, but the selected ones outperformed the others on specificity andprecision tests, and are a good trade-off between speed and efficacy. As anticipatedthe model does not feature limbs or other specific details and even if the final aspectis somewhat unrealistic, it allows an easiest alignment and, most importantly, allowsthe mapping of visual feature on specific body points.

As a representative signature, we created an instance of the generic model foreach detected person, characterized by a scale factor (to cope with different bodybuilds) and relating appearance information (i.e., color and texture) to each vertex.

The 3D placement of the model in the real scene is obtained from the output ofa the AD-HOC tracking system previously described. Assuming a vertical standingposition of the person, the challenging problem to solve is the estimation of hishorizontal orientation. To this aim, we consider that people move forward and thuswe exploit the trajectory on the ground plane to give a first approximation using asliding window approach. In fig.7(a) and 7(b) a sample frame and the correspondingmodel placement and orientation is provided. In particular, the sample positionsused for the curve fitting and orientation estimation are highlighted.

Given the 3D placement and orientation, the appearance part of the model canbe recovered from the 2D frames projecting each vertex to the camera image plane.The steps are depicted in Fig. 7(c).

(a) (b) (c)

Fig. 7 (a) A frame from a video, (b) Automatic 3D positioning and orientation (c) Initializationof the 3D model of a person: the model to image alignment, projection of the model vertex to theimage plane, vertex initialization or update

If multiple cameras are available or if the short-term tracking system providesmore detections for the same object, the 3D model could integrate all the availableframes. For each of them, after the alignment step, a new feature vector is computedfor each vertex successfully projected inside the silhouette of the person. Figure 8shows some sample models created from one or two images.

Since one of the main applications of the Sarc3D model is the people re-identification, we defined the distance measure between two models as the meandistance over all the corresponding vertices.

Many experiments have been carried out, on real videos and on our bench-marking suite. From its introduction, ViPER [3] is the reference dataset for re-


(a) (b) (c) (d) (e)

Fig. 8 Various models created and the corresponding source images

identification problems. Unfortunately, it contains two images for each target only.Thus we propose a suitable benchmark dataset with 50 people, consisting of shortvideo clips captured with a calibrated camera. The annotated data set is composedby four views for each person, 200 snapshots in total.

As above mentioned, we assumed to have a sufficiently accurate tracking system,which gives the 2D foreground images used for the model alignment. However,the proposed method is reliable and robust enough, even in case of approximatedalignments. The use of a generic sarcophagus-like model and local color histogramsinstead of detailed 3D models and point-wise colors goes precisely in this directionand allow to cope with small alignment errors.

Acknowledgements THIS project is with the support of the Prevention, Preparedness and Con-sequence Management of Terrorism and other Security-related Risks Programme European Com-mission - Directorate-General Justice, Freedom and Security.ViSERaS project is funded by the company Lepida SpA.

References

1. Baltieri, D., Vezzani, R., Cucchiara, R.: 3d body model construction and matching for real timepeople re-identification. In: Proceedings of Eurographics Italian Chapter Conference 2010(EG-IT 2010). Genova, Italy (2010)

2. Baltieri, D., Vezzani, R., Cucchiara, R.: Fast background initialization with recursive hadamardtransform. In: Proceedings of the 7th IEEE International Conference on Advanced Video andSignal-Based Surveillance, pp. 165–171. Boston (USA) (2010)

3. Gray, D., Brennan, S., Tao, H.: Evaluating Appearance Models for Recognition, Reacquisition,and Tracking. In: Proc. of PETS 2007 (2007)

4. Vezzani, R., Baltieri, D., Cucchiara, R.: Hmm based action recognition with projection his-togram features. In: Proceedings of the ICPR 2010 Contests, pp. 286–293. Istanbul, Turkey(2010)

5. Vezzani, R., Cucchiara, R.: Event driven software architecture for multi-camera and distributedsurveillance research systems. In: Proceedings of the First IEEE Workshop on Camera Net-works - CVPRW, pp. 1–8. San Francisco (2010)

6. Vezzani, R., Grana, C., Cucchiara, R.: Probabilistic people tracking with appearance modelsand occlusion classification: The ad-hoc system. Pattern Recognition Letters 32(6), 867–877(2011)

designing video surveillance systems as services

Documents