intelligent video surveillance system- bsaurabhgupta/sura_report.pdf · progress report for the...

STUDENT UNDERGRADUATE RESEARCH AWARD(SURA), 2009

Progress report for the project titled

Intelligent Video Surveillance System- B

Submitted BySaurabh Gupta

(Entry No. 2007CS10185)

Under the guidance of:Professor Subhashis Banerjee

Computer Science and Engineering

Department of Computer Science and EngineeringIndian Institute of Technology Delhi

October 2009

1

Certificate

This is to certify that the project titled Intelligent Video Surveillance System- B is a bonafide work done by Saurabh Gupta (Entry: 2007CS10185) as part of Summer UndergraduateResearch Award, 2009 in the Department of Computer Science and Engineering at Indian In-stitute of Technology, Delhi. This project was carried out by them under my guidance and hasnot been submitted elsewhere.

Professor Subhashis BanerjeeDepartment of Computer Science and EngineeringIndian Institute of Technology, DelhiNew Delhi, India

2

Acknowledgement

I am eternally grateful and highly indebted to Professor Subhashis Banerjee for giving me thisopportunity to work under him. His fraternal guidance, keen interest and educative discussionshave been a cornerstone for the success of this project. His involvement with the project wasa source of great motivation and ideas. I would also like to thank Industrial Research andDevelopment Unit for giving me this unique learning opportunity to work on this project. Ialso express my sincere gratitude to the Department of Computer Science and Engineeringfor providing me with the all the necessary facilities required for the completion of this project.

This project is the part of a larger project done collectively by Ankit Sagwal, Ankit Narangand me. I thank them for keeping up my enthusiasm and for providing me with valuablesuggestions. I also thank Ayesha Choudhary for the valuable discussions I have had with herand for the critical test data she provided. I also express my humble gratitude towards myparents and family for their unwavering trust in me and my abilities.

Saurabh Gupta

3

Contents

1 Introduction 51.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Why video surveillance ? . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.2 Why unsupervised ? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Exact Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Theoretical Background 62.1 What we mean by unsupervised and why is it possible ? . . . . . . . . . . . . . 62.2 Background Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Latent Semantic Indexing(LSI) . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Probabilistic Latent Semantic Analysis(pLSA) . . . . . . . . . . . . . . . . . . 9

3 Implementation 103.1 Basic Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 How we apply pLSA in context of videos ? . . . . . . . . . . . . . . . 103.1.2 Choice of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Approach followed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.1 Clustering of a given set of videos . . . . . . . . . . . . . . . . . . . . 11

3.3 Other Activities undertaken . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Results 134.1 Single person activity analysis and classification . . . . . . . . . . . . . . . . . 134.2 Multi person activity classification . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Future Work 175.1 Unusual activity flagging for single person video . . . . . . . . . . . . . . . . 175.2 Multi person activity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4

1 Introduction

1.1 Motivation

1.1.1 Why video surveillance ?

Security and surveillance are important issues in today’s world. The recent acts of terrorismhave highlighted the urgent need for efficient surveillance. Contemporary surveillance systemsuse digital video recording (DVR) cameras which play host to multiple channels. The majordrawbacks with this model is that it requires continuous manual monitoring which is infeasiblebecause of factors like human fatigue and cost of manual labour. Moreover, it is virtually im-possible to search through recordings for important events in the past since that would require aplayback of the entire duration of video footage. Hence, there is indeed a need for an automatedsystem for video surveillance which can detect unusual activities on its own.

1.1.2 Why unsupervised ?

A system which needs to be programmed according to the location it is to be deployed in, wouldrequire lots of initial overheads while installing it. This overhead includes enumeration of thekind of activities which would happen in this area and then coming up with an Finite StateMachine (FSM) model which accurately captures routine activities and flags non routine ones.Clearly, this overhead is large and makes a programmed approach unsuitable for large scaledeployment. Hence there is a need for unsupervised video surveillance system which is able tolearn routine activities on its own from non labelled learning data. A system with self learningability would be easy to deploy and would make it possible to have large scale monitoring.

1.2 Objectives

The objective for this project are as follows

1. To design a model for unsupervised classification of videos of single person activitiesand hence detect unusual activities.

2. To extend the approach for single person video classification to multi people activityclassification.

1.3 Exact Problem Statement

Treating videos as documents, coming up with appropriate choice of words describing the videodocument which on analysis using pLSA(probabilistic Latent Semantic Analysis), logicallyclusters the videos based on the activity happening in the video.

5

1.4 Related Work

1. Unsupervised image classification

One of the first successful attempts at unsupervised classification in the area on com-puter vision was that of image classification. Sivic in the paper titled Discovering Object

Categories in Image Collections ([8]) used pLSA for unsupervised image classification.The concept used was simple. It involved using features like vector quantized SIFT de-scriptors computed on affine covariant regions as visual words describing images. Thischoice of word on being subject to clustering using pLSA classified images on basis ofthe objects contained in them.

2. Attempts at supervised video classification

There have been numerous attempts at supervised learning based on learning programmedmodel for activities (as described in [6], [3]). These are fairly successful at identifyingthe particular activities they are programmed for, but are not deployable in a general sce-nario. They suffer from disadvantages of non scalability in terms of their deploymentand installation (as mentioned in the motivation for unsupervised surveillance).

There have been other attempts at unsupervised activity classification (as described in [7],[4]). But these have primarily been on specific attributes of activities like only trajectoryor shape, etc. They are able to provide a very sound model for unsupervised learning ofthe specific attribute they are designed for, but fail to give a generic model for overallactivity analysis.

3. Unsupervised Video analysis using pLSA for single object using Video Epitomes

pLSA has been successfully used for unsupervised video analysis ([1]). This used epit-ome subtraction to obtain space time patches as words for the video document and cateredonly to single person activity analysis. Moreover, the use of video epitomes made this ex-tremely computationally expensive making it difficult for deployment in an online man-ner. The current project is basically an extension of this work to multi person activityanalysis in a manner which is feasible for online deployment.

2 Theoretical Background

2.1 What we mean by unsupervised and why is it possible ?

The notion of unsupervised learning means learning and deducing information from unclassi-fied raw learning videos and to then use this information to classify any new query video intoone of the classes that have been learnt.

6

In the context of this project, this implies looking at a large set of unclassified videos,learning patterns from this data set, and then classifying any new video that is provided as aquery. The learning process is such that videos with similar activities are grouped together intothe same class. The query video is put into a class with which it shares most of its features.

Any routine activity that comes up as a query will easily get mapped to as belonging toone of the learnt classes. But on the contrary, an unusual activity would not get completelyclassified with any one cluster. This distinction can then be used to classify activities as usualand unusual.

Hence the only supervision that is involved here is feeding in learning data from whichusual activity patterns are learnt and whatever is not usual gets classified as unusual, makingsuch an unsupervised scheme for classification possible.

2.2 Background Subtraction

For any kind of computer vision processing it is important (and is often the first step) to separateout objects from the image that we are viewing. This process is called background subtraction.

More formally, it is the process of removal of static background features from an image. Thefeatures of an image which are not part of the background are called the foreground objects. Forexample, if we have a camera fixed at a traffic light, the vehicles will serve as foreground objectsand features like zebra-crossing, lamp post, etc will serve as background features. Consider forexample the following original image frame and the corresponding foreground frame obtainedafter background subtraction.

Figure 1: On left is the original frame. On Right is the foreground frame obtained after back-ground subtraction.

Any background subtraction method needs to learn the background features in the currentsetting. This training can be done in various ways. The simplest one is to let the system store alarge number of frames (say 1000) and take a bit wise average of all these frames. This averagewill serve as the background for subsequent frames. There is a problem with such an approachthat once the training phase is over, the system will fix the background image and will notincorporate any other stationary objects introduced in the scene into the background. Hence, a

7

better approach is to carry on background training as the system runs.An improved background model which can better cater to lighting and intensity changes in

the scene along with dynamic learning was proposed by Chris Stauffer and W. E. L. Grimson.This adaptive background mixture model models each pixel in the form of a Gaussians (interms of RGB intensity) and measure the new frame pixel Gaussians against the backgroundpixel Gaussians ([9]). This approach is more robust than the previous one. A more improvedbackground modeling technique was proposed by Brendan Klare and Sudeep Sarkar(in [5]).They followed the similar approach of modeling pixel as Gaussians, but instead of Gaussiansbased only on RGB intensities, they used 13 Gaussians to represent one pixel. This techniquevery efficiently handles the illumination changes in the scene and also adapts rapidly if thebackground features are altered.

We refer to foreground frames in the rest of the report. By this we mean, a binary imagewhich is white(or 1) in places where there is a foreground object and is black (or 0) at all otherplaces.

2.3 Latent Semantic Indexing(LSI)

Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematicaltechnique called Singular Value Decomposition (SVD) to identify patterns in the relationshipsbetween the terms and concepts contained in an unstructured collection of text. LSI is based onthe principle that words that are used in the same contexts tend to have similar meanings. A keyfeature of LSI is its ability to extract the conceptual content of a body of text by establishingassociations between terms that occur in similar contexts. It is called Latent Semantic Indexingbecause of its ability to correlate semantically related terms that are latent in a collection oftext.

This technique considers a document as a bag of words. This essentially means it takes intoaccount only the occurrence of words in a document and not their relative orderings. hencea set of d documents on w words can be described as a d × w matrix A with the ni,j entryrepresenting the number of occurrences of the word j in document i. LSI works by doinga Singular Value Decomposition on this document × word matrix, to obtain a term-conceptvector matrix, U , a singular values matrix, Σ, and a concept-document vector matrix, V , whichsatisfy the following relations

A = UΣV t

UU t = I

V V t = I

8

A k rank approximation of this is obtained considering only the top k largest singular valuesfrom the matrix Σ, and only the first k columns of the matrices U and V to obtain Ak as

Ak = UkΣkVkt (1)

This is a k rank approximation of A, having the least norm 2 deviation from A over all rank k

approximations. What this does is to map the original sparse space to a more meaningful spaceof lesser dimension with the documents located at coordinates given by UkΣk. The extent ofsimilarity of two documents is now obtained from the notion of closeness in this k dimensionalreduced space. Moreover, in this process of dimension reduction, noise is effectively removedfrom the data.

LSA exploits the observation that similar words occur in similar documents and similardocuments contain similar words. Hence, it is easily able to learn synonymy (different wordshaving similar meaning, by their simultaneous occurrences in documents) and polysemy(wordswith multiple meanings, by their occurrences in multiple different contexts) in a completelyunsupervised manner.

2.4 Probabilistic Latent Semantic Analysis(pLSA)

Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of twomode and co-occurrence data, which has applications in information retrieval and filtering,natural language processing, machine learning from text, and in related areas. Compared tostandard Latent Semantic Analysis which stems from linear algebra and performs a SingularValue Decomposition of co-occurrence tables, this method is based on a mixture decompositionderived from a latent class model. This results in a more principled approach which has a solidfoundation in statistics.

Corresponding to coordinates in a reduced vector space in LSA, here we have likelihoodsof documents belonging to a particular latent topic; and how well a particular topic is explainedby the given set of words. From probability theory we have,

P (d, w) = P (d)∑z∈Z

P (w|z)P (z|d)

The EM algorithm described in [2] calculates P (w|z) and P (z|d) for a given k number ofclasses. The term P (z|d) describes how well document d explains hidden topic z and the termP (w|z) describes to what extent does word w contribute in explaining topic z.

Hence, we obtain clusters where in similar documents have high probability of explainingthe same hidden topic.

Besides, having all features that LSI offers, it has an additional benefit of having a stronger

9

statistical foundation.

3 Implementation

3.1 Basic Technique

3.1.1 How we apply pLSA in context of videos ?

The basic technique we follow, is to treat videos as text documents. We extract out somefeatures from the video clip, which are treated as words describing this video document. Hencewe obtain a set of words; and a set of documents containing these words.

Next we apply pLSA on these documents in exactly the same way as is done for normaldocuments and obtain clusters. Any new video is then classified into one of the learned clusters.

3.1.2 Choice of Words

The most important aspect for this scheme to work well is the choice of words describing thevideos. It is important for the chosen words to capture sufficient information about the video.Moreover, it must capture information which is able to give us clusters which we are lookingfor.

The choice of words that we work with, is space time patches. Consider a space timevolume made by consecutive foreground frames (of say a 20 frame clip) spread along the z-axis (the first image frames being parallel to the XY plane at z = 0, second one at z = 1 andso on). Let (x, y, t) denote a point in this space, where (x, y) is the location in frame t of the20 frame length clip.

A patch, Pdx,dy,dt in this space time volume, is the region of volume dx×dy×dt containedin the region {(a, b, c) : a ∈ (x, x + dx), b ∈ (y, y + dy), c ∈ (t, t + dt)}. A patch Pdx,dy,dt, ischaracterized completely by its starting coordinate (x, y, t).

A foreground patch is a patch having more than d fraction of its pixels as a part of theforeground. There will be foreground patches in the space time volume of foreground frames.

The words describing a video document are all the foreground patches contained in thespace time volume of this video.

If SD is the set of words for document D, then

SD = {(x, y, t) : (x, y, t)is a foreground patch} (2)

This choice of words was arrived at after trying out a lot of different possibilities. Thiscaptures the information we require for clustering videos based on the activities contained inthem. The space information is contained in the x and y coordinate and the flow of action is

10

captured by the frame number.Different activities would have a different pattern of the (x, y, t) patches. Similar activities

would have a similar pattern of (x, y, t) patches. Though they may not have exactly the samepattern but over a large training data set, would share patches with documents containing similaractivity and hence get clustered with group of documents containing this activity.

Thus this choice of words is able to give us the kind of clustering that we are looking for.

3.2 Approach followed

In principle, there are two phases in a generic machine learning solution. The first is to learnfrom a given data set, and then to answer queries based on the learning from the data set.

Hence, even in context of pLSA there is a learning phase and a classification phase(wherequeries are actually answered). The learning phase learns classes and their characteristics andthe classification phase gives the likelihood of a new document belonging to each of the classeslearnt. If the likelihood does not suggest that it belongs to any of the classes learnt then it isflagged as an unusual activity. There is sufficient theory of doing this classification for a novelvideo into the already learnt classes as described in [1].

So, for the purpose of this project we focus only on obtaining these clusters for a broad setof settings. The kind of clustering which is obtained depends largely on the choice of wordsthat we use to capture the information contained in the document. Hence, an appropriate choiceof words is essential for obtaining good clusters. We focus our effort to obtaining a good set ofwords, which is able to give us good logical clustering.

We describe below the process we followed to obtain clusters based on given video docu-ments.

3.2.1 Clustering of a given set of videos

1. Dividing the video stream in short clips (our video documents)

The video stream is obtained as a regular sequence of frames. The entire video is dividedinto short 20 frame clips which make up one document. So the entire video gives us a setof documents which we need to cluster into classes based on their similarity.

2. Background subtraction to obtain foreground blobs(in openCV)

We then perform background subtraction onto each of this clip to obtain foregroundframes corresponding to each video frame. The background subtraction module thatwas used is based on the Gaussian Mixture Model (as presented in [9]) which is noiseresistant and adapts well to changes in the backgrounds. We use the standard openCVimplementation of this algorithm with appropriate learning rate parameters. Also, we

11

apply the GMM algorithm not on the raw RBG image but on a normalized RBG as thisreduces the effects of shadows and intensity changes.

3. Obtaining words from foreground images(in Matlab)

As described above, the volume of size dx∗dy∗dt located at (x, y, t)(x, y ∈ {0, 5, 10, 15, ..., },t ∈ {0, 2, 4, ...} )having more than d fraction of their pixels as foreground pixels is treatedas a patch and the word (x, y, t) is said to belonging to this video document. This searchis done over the entire space of (x, y, t) and valid words are added to the document.

4. Obtaining dictionary of words contained in this document set(in Matlab)

Once words describing each individual document are identified for all the training videos,we do a clustering on this set of words to come up with a dictionary of top k wordscommon to all documents. The universe of words that we have defined is large andcaptures only limited amount of spatial closeness in this (x, y, t) space. So this clusteringof words into k classes groups similar words (which are relatively nearby as indicated bytheir (x, y, t) coordinates) and captures common features across different documents.

Consider for example the set of words (5,10,1), (5,15,1), (10,10,1), (40,10,2), (45,15,2)and (50,10,3). A clustering on this set of words to obtain top 2 words would map words(5,10,1), (5,15,1), (10,10,1) to word 1 and (40,10,2), (45,15,2), (50,10,3) to word 2. So,2 different documents containing words (5,10,1) and (5,15,1) (which are very similarbut not exactly same) would be said to be containing the same word, word 1. This thuscaptures common features across different documents.

pLSA treats all words as unrelated to one another (in fact it tries to learn the relation-ship between them). Hence without this step two documents would be classified as sameonly if they are exactly same. Even very similar documents would be classified as differ-ent, as the words they contain are similar but not exactly same (and pLSA sees them ascompletely different words).

This clustering was done using the kmeans function in Matlab.

5. Generation of words document matrix (in Matlab)

Once the dictionary of words is obtained, raw words in each document are mapped to oneof the k common words from the dictionary and a document× word matrix is obtained((Ni,j) entry in the matrix indicates number of occurrence of word j in document i). Thismatrix is much the same as the one obtained for classification of literary documents (usedby any web search engine).

6. Running pLSA on this words document matrix to obtain video clusters(in Matlab)

The documentxword matrix thus obtained is processed through pLSA to obtain n clus-ters. The algorithm gives us probability of each document and each word belonging to a

12

particular cluster. Similar documents and words are the ones which have high probabilityof belonging to the same cluster.

3.3 Other Activities undertaken

Some other activities were under taken in the course of the project to obtain insight into thetechniques being used for learning. These are briefly described below.

1. Book title experiment - classification of books based on LSA and pLSA

In the process of understanding and getting a feel of how exactly pLSA and LSA work,I carried out an experiment to classify books based on their titles. For the purpose of thisexperiment, I collected the book titles of lots of books on 3 different topics in computerscience (from the department library). I then selected words from the titles to constructa dictionary of words (this was basically eliminating useless words like of, for, and, the,etc) and then a document × word matrix was obtained. This when subjected to LSAand pLSA (independently) clustered books into 3 clusters putting books on the sametopic into exactly one cluster. I observed how these techniques were able to eliminatenoise and co relate words with similar meanings based on their occurrence in documentstogether. Coding for this was done in perl and Matlab.

2. Evaluating pLSA performance on multi topic documents

When we come to multi people activity clustering, we need to use pLSA to learn fromvideos containing more than one activity. To be able to evaluate how pLSA treats multitopic documents, I conducted an experiment giving pLSA documents pertaining to 2topics. Some documents discussed only of the topic while some discussed both. Whenwe do a 2 way clustering on this, pLSA returns clusters such that cluster 1 pertains totopic 1, and cluster 2 pertains to topic 2, with the document discussing both these topicssplit in likelihood of belonging to either of the cluster.

4 Results

4.1 Single person activity analysis and classification

The space time patch words work well for single person activity analysis. They are able tocluster videos as per the activity contained in them and are able to bring out common trajectoriesand common areas of activities in an unsupervised manner from videos. As an example, theclusters obtained on a training video in Bharti Building corridor are shown in Figures 1-6.

13

Figure 2: Class 1 - Activity in the near end of corridor.

Figure 3: Class 2 - Cluster of activities in the far end of corridor.

Figure 4: Class 3 - Cluster of activities around the door.

Figure 5: Class 4 - Cluster of trajectories along the right edge of corridor.

Figure 6: Class 5 - Cluster of trajectories along the center of the corridor.

Figure 7: Class 6 - Cluster of trajectories along the left edge of the corridor.

14

4.2 Multi person activity classification

The space time patch approach on multi person videos forms classes such that each class corre-sponds to a single person activity happening in the video. Videos with a single activity in themare assigned to the corresponding class. Videos which have more than one activity happening inthem are given likelihoods of belonging to the classes of the single person activities happeningin them.

pLSA is able to learn the single person activities happening in multi person videos, based onthe observation that the words describing a particular single person activity, occur as a groupin different multi person videos (in much the same way as topic discovery happens in setscontaining multi topic documents).

Consider for example the classes obtained on a learning video in the coffee shop area nearthe library.

15

Figure 8: Class 1 - Trajectories along the path in mid right.

Figure 9: Class 2 - Trajectories along the bottom edge.

Figure 10: Class 3 - Trajectories from the bottom left to the center of frame.

Figure 11: Class 4 - People turning from main building side to coffee shop.

Figure 12: Class 5 - Trajectories from left top corner to right top corner and vice versa.

Figure 13: A video with likelihoods for classes 1, 3 and 5.

16

5 Future Work

5.1 Unusual activity flagging for single person video

We have only learnt classes of videos from a given learning data set. Though novel videoclassification has been studied thoroughly in [1], we have not explicitly implemented it here.We intend to implement novel video classification based on this choice of words, and startflagging unusual activities.

5.2 Multi person activity analysis

Though we have presented multi person activity classification using pLSA, we need to thor-oughly investigate the suitability of this choice of words for multi person activity analysis.

Moreover, as mentioned, the likelihood of a video with multiple activities gets split amongclasses describing the activities contained in it. This may pose a problem in flagging an un-usual activity, because we rely on a confused match of the unusual word sequence with thealready learned classes. Hence, we need to investigate how to differentiate between videoswith multiple activities and videos with unusual activities.

References

[1] Ayesha Choudhary, Manish Pal, Subhashis Banerjee, and Santanu Chaudhury. Unusualactivity analysis using video epitomes and plsa. In ICVGIP, pages 390–397, 2008.

[2] Thomas Hofmann. Probabilistic latent semantic analysis. In In Proc. of Uncertainty in

Artificial Intelligence, UAI’99, pages 289–296, 1999.

[3] S. Hongeng and R. Nevatia. Multi-agent event recognition. In Computer Vision, 2001.

ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 2, pages 84–91 vol.2, 2001.

[4] Fan Jiang, Ying Wu, and Aggelos K. Katsaggelos. A dynamic hierarchical clusteringmethod for trajectory-based unusual video event detection. IEEE Trans. Image Process.,18:907–913, 2009.

[5] B. Klare and S. Sarkar. Background subtraction in varying illuminations using an ensemblebased on an enlarged feature set. Computer Vision and Pattern Recognition Workshop,0:66–73, 2009.

17

[6] Dhruv Mahajan, Nipun Kwatra, Sumit Jain, Prem Kalra, and Subhashis Banerjee. A frame-work for activity recognition and detection of unusual activities. In ICVGIP, pages 15–21,2004.

[7] C. Piciarelli, G.L. Foresti, and L. Snidara. Trajectory clustering and its applications forvideo surveillance. Advanced Video and Signal Based Surveillance, IEEE Conference on,0:40–45, 2005.

[8] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering ob-ject categories in image collections. In Proceedings of the International Conference on

Computer Vision, 2005.

[9] Chris Stauffer and W.E.L Grimson. Adaptive background mixture models for real-timetracking. In CVPR99, Fort Colins, CO, 1999.

18

intelligent video surveillance system- bsaurabhgupta/sura_report.pdf · progress report for the...

Documents