142 ieee transactions on learning ... - baskent.edu.tr

27
Content Based Lecture Video Retrieval Using Speech and Video Text Information Haojin Yang and Christoph Meinel, Member, IEEE Abstract—In the last decade e-lecturing has become more and more popular. The amount of lecture video data on the World Wide Web (WWW) is growing rapidly. Therefore, a more efficient method for video retrieval in WWW or within large lecture video archives is urgently needed. This paper presents an approach for automated video indexing and video search in large lecture video archives. First of all, we apply automatic video segmentation and key-frame detection to offer a visual guideline for the video content navigation. Subsequently, we extract textual metadata by applying video Optical Character Recognition (OCR) technology on key-frames and Automatic Speech Recognition (ASR) on lecture audio tracks. The OCR and ASR transcript as well as detected slide text line types are adopted for keyword extraction, by which both video- and segment-level keywords are extracted for content-based video browsing and search. The performance and the effectiveness of proposed indexing functionalities is proven by evaluation. Index Terms—Lecture videos, automatic video indexing, content-based video search, lecture video archives Ç 1 INTRODUCTION D IGITAL video has become a popular storage and exchange medium due to the rapid development in recording technology, improved video compression tech- niques and high-speed networks in the last few years. Therefore audiovisual recordings are used more and more frequently in e-lecturing systems. A number of uni- versities and research institutions are taking the opportu- nity to record their lectures and publish them online for students to access independent of time and location. As a result, there has been a huge increase in the amount of multimedia data on the Web. Therefore, for a user it is nearly impossible to find desired videos without a search function within a video archive. Even when the user has found related video data, it is still difficult most of the time for him to judge whether a video is useful by only glancing at the title and other global metadata which are often brief and high level. Moreover, the requested infor- mation may be covered in only a few minutes, the user might thus want to find the piece of information he requires without viewing the complete video. The prob- lem becomes how to retrieve the appropriate information in a large lecture video archive more efficiently. Most of the video retrieval and video search systems such as You- Tube, Bing and Vimeo reply based on available textual metadata such as title, genre, person, and brief descrip- tion, etc. Generally, this kind of metadata has to be created by a human to ensure a high quality, but the crea- tion step is rather time and cost consuming. Furthermore, the manually provided metadata is typically brief, high level and subjective. Therefore, beyond the current approaches, the next generation of video retrieval sys- tems apply automatically generated metadata by using video analysis technologies. Much more content-based metadata can thus be generated which will lead to two research questions in the e-lecturing context: Can those metadata assist the learner in searching required lecture content more efficiently? If so, how can we extract the important metadata from lecture videos and provide hints to the user? According to the questions, we postulated the following hypothesis: Hypothesis 1. The relevant metadata can be automatically gath- ered from lecture videos by using appropriate analysis techni- ques. They can help a user to find and to understand lecture contents more efficiently, and the learning effectiveness can thus be improved. Traditional video retrieval based on visual feature extrac- tion cannot be simply applied to lecture recordings because of the homogeneous scene composition of lecture videos. Fig. 1a shows an exemplary lecture video recorded using an outdated format produced by a single video camera. Vary- ing factors may lower the quality of this format. For exam- ple, motion changes of the camera may affect the size, shape and the brightness of the slide; the slide can be partially obstructed when the speaker moves in front of the slide; any changes of camera focus (switching between the speaker view and the slide view) may also affect the further slide detection process. Nowadays people tend to produce lecture videos by using multi-scenes format (cf. Fig. 1b), by which the speaker and his presentation are displayed synchronously. This can be achieved either by displaying a single video of the speaker and a synchronized slide file, or by applying a state The authors are with the Hasso-Plattner-Institute for Software Systems Engineering GmbH (HPI), PO Box 900460, D-14440 Potsdam, Germany. E-mail: {haojin.yang, meinel}@hpi.uni-potsdam.de. Manuscript received 5 Aug. 2013; revised 29 Nov. 2013; accepted 16 Jan. 2014. Date of publication 26 Feb. 2014; date of current version 2 July 2014. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TLT.2014.2307305 142 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 7, NO. 2, APRIL-JUNE 2014 1939-1382 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 26-Apr-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

Content Based Lecture Video Retrieval UsingSpeech and Video Text Information

Haojin Yang and Christoph Meinel,Member, IEEE

Abstract—In the last decade e-lecturing has become more and more popular. The amount of lecture video data on the World

Wide Web (WWW) is growing rapidly. Therefore, a more efficient method for video retrieval in WWW or within large lecture

video archives is urgently needed. This paper presents an approach for automated video indexing and video search in large

lecture video archives. First of all, we apply automatic video segmentation and key-frame detection to offer a visual guideline

for the video content navigation. Subsequently, we extract textual metadata by applying video Optical Character Recognition

(OCR) technology on key-frames and Automatic Speech Recognition (ASR) on lecture audio tracks. The OCR and ASR

transcript as well as detected slide text line types are adopted for keyword extraction, by which both video- and segment-level

keywords are extracted for content-based video browsing and search. The performance and the effectiveness of proposed

indexing functionalities is proven by evaluation.

Index Terms—Lecture videos, automatic video indexing, content-based video search, lecture video archives

Ç

1 INTRODUCTION

DIGITAL video has become a popular storage andexchange medium due to the rapid development in

recording technology, improved video compression tech-niques and high-speed networks in the last few years.Therefore audiovisual recordings are used more andmore frequently in e-lecturing systems. A number of uni-versities and research institutions are taking the opportu-nity to record their lectures and publish them online forstudents to access independent of time and location. As aresult, there has been a huge increase in the amount ofmultimedia data on the Web. Therefore, for a user it isnearly impossible to find desired videos without a searchfunction within a video archive. Even when the user hasfound related video data, it is still difficult most of thetime for him to judge whether a video is useful by onlyglancing at the title and other global metadata which areoften brief and high level. Moreover, the requested infor-mation may be covered in only a few minutes, the usermight thus want to find the piece of information herequires without viewing the complete video. The prob-lem becomes how to retrieve the appropriate informationin a large lecture video archive more efficiently. Most ofthe video retrieval and video search systems such as You-Tube, Bing and Vimeo reply based on available textualmetadata such as title, genre, person, and brief descrip-tion, etc. Generally, this kind of metadata has to becreated by a human to ensure a high quality, but the crea-tion step is rather time and cost consuming. Furthermore,

the manually provided metadata is typically brief, highlevel and subjective. Therefore, beyond the currentapproaches, the next generation of video retrieval sys-tems apply automatically generated metadata by usingvideo analysis technologies. Much more content-basedmetadata can thus be generated which will lead to tworesearch questions in the e-lecturing context:

� Can those metadata assist the learner in searchingrequired lecture content more efficiently?

� If so, how can we extract the important metadatafrom lecture videos and provide hints to the user?

According to the questions, we postulated the followinghypothesis:

Hypothesis 1. The relevant metadata can be automatically gath-ered from lecture videos by using appropriate analysis techni-ques. They can help a user to find and to understand lecturecontents more efficiently, and the learning effectiveness canthus be improved.

Traditional video retrieval based on visual feature extrac-tion cannot be simply applied to lecture recordings becauseof the homogeneous scene composition of lecture videos.Fig. 1a shows an exemplary lecture video recorded using anoutdated format produced by a single video camera. Vary-ing factors may lower the quality of this format. For exam-ple, motion changes of the camera may affect the size, shapeand the brightness of the slide; the slide can be partiallyobstructed when the speaker moves in front of the slide;any changes of camera focus (switching between thespeaker view and the slide view) may also affect the furtherslide detection process.

Nowadays people tend to produce lecture videos byusing multi-scenes format (cf. Fig. 1b), by which the speakerand his presentation are displayed synchronously. This canbe achieved either by displaying a single video of thespeaker and a synchronized slide file, or by applying a state

� The authors are with the Hasso-Plattner-Institute for Software SystemsEngineering GmbH (HPI), PO Box 900460, D-14440 Potsdam, Germany.E-mail: {haojin.yang, meinel}@hpi.uni-potsdam.de.

Manuscript received 5 Aug. 2013; revised 29 Nov. 2013; accepted 16 Jan.2014. Date of publication 26 Feb. 2014; date of current version 2 July 2014.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TLT.2014.2307305

142 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 7, NO. 2, APRIL-JUNE 2014

1939-1382� 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

of the art lecture recording system such as tele-Teaching Any-where Solution Kit(tele-TASK).1

Fig. 1b illustrates an example of such a system whichdelivers two main parts of the lecture: the main scene of lec-turers which is recorded by using a video camera and thesecond which captures the desktop of the speaker’s com-puter (his presentations) during the lecture through a framegrabber tool. The key benefits of the latter one for a lectureris the flexibility. For the indexing, no extra synchronizationbetween video and slide files is required, and we do notneed to take care of the slide format. The main drawback isthat the video analysis methods may introduce errors. Ourresearch work mainly focus on those lecture videos pro-duced by using the screen grabbing method. Since two vid-eos are synchronized automatically during the recordingprocess. Therefore, the temporal scope of a complete uniqueslide can be considered as a lecture segment. This way, seg-menting two-scenes lecture videos can be achieved by onlyprocessing slide video streams, which contain most of thevisual text metadata. The extracted slide frames can providea visual guideline for video content navigation.

Text is a high-level semantic feature which has oftenbeen used for content-based information retrieval. In lecturevideos, texts from lecture slides serve as an outline for thelecture and are very important for understanding. Thereforeafter segmenting a video file into a set of key frames (allthe unique slides with complete contents), the text detectionprocedure will be executed on each key frame, and theextracted text objects will be further used in text recognitionand slide structure analysis processes. Especially, theextracted structural metadata can enable more flexible videobrowsing and video search functions.

Speech is one of the most important carriers of informa-tion in video lectures. Therefore, it is of distinct advantagethat this information can be applied for automatic lecturevideo indexing. Unfortunately, most of the existing lecturespeech recognition systems in the reviewed work cannotachieve a sufficient recognition result, the Word Error Rates(WERs) having been reported from [1], [2], [3], [4], [5] and[6] are approximately 40–85 percent. The poor recognition

results not only limit the usability of speech transcript, butalso affect the efficiency of the further indexing process. Inour research, we intended to continuously improve the ASRresult for German lectures by building new speech trainingdata based on the open-source ASR tool. However, in theopen-source context, it lacks method for generating the Ger-man phonetic dictionary automatically, which is the one ofthe most important part of an ASR software. Therefore, wedeveloped an automated procedure in order to fill this gap.

A large amount of textual metadata will be created byusing OCR und ASR method, which opens up the contentof lecture videos. To enable a reasonable access for the user,the representative keywords are further extracted from theOCR and ASR results. For content-based video search, thesearch indices are created from different informationresources, including manual annotations, OCR and ASRkeywords, global metadata, etc. Here the varying recogni-tion accuracy of different analysis engines might result insolidity and consistency problems, which have not beenconsidered in the most related work (cf. Section 2.2). There-fore, we propose a new method for ranking keywordsextracted from various information resources by using theextended Term Frequency Inverse Document Frequency(TFIDF) score [7]. The ranked keywords from both segment-and video-level can directly be used for video contentbrowsing and video search. Furthermore, the video similar-ity can be calculated by using the Cosine Similarity Measure[8] based on extracted keywords.

In summary, the major contributions of this paper are thefollowing:

� We extract metadata from visual as well as audioresources of lecture videos automatically by apply-ing appropriate analysis techniques. For evaluationpurposes we developed several automatic indexingfunctionalities in a large lecture video portal, whichcan guide both visually- and text-oriented users tonavigate within lecture video. We conducted a userstudy intended to verify the research hypothesis andto investigate the usability and the effectiveness ofproposed video indexing features.

� For visual analysis, we propose a new method forslide video segmentation and apply video OCR togather text metadata. Furthermore, lecture outline isextracted from OCR transcripts by using stroke

Fig. 1. (a) An example of outdated lecture video format.(b) An exemplary lecture video. Video 1 shows the professor giving his lecture, whereas hispresentation is played in video 2.

1. Tele-TASK system was initially designed in 2002 at the universityof Trier. Today, weekly 2000 people (unique visits) around the worldvisit the tele-TASK lecture video portal (www.tele-task.de) with morethan 4,800 lectures and 14,000 podcasts free of charge via internet.

YANG AND MEINEL: CONTENT BASED LECTURE VIDEO RETRIEVAL USING SPEECH AND VIDEO TEXT INFORMATION 143

Page 3: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

width and geometric information. A more flexiblesearch function has been developed based on thestructured video text.

� We propose a solution for automatic German pho-netic dictionary generation, which fills the gap inopen-source ASR domain. The dictionary softwareand compiled speech corpus are provided for thefurther research use.

� In order to overcome the solidity and consistencyproblems of a content-based video search system,we propose a keyword ranking method for multi-modal information resources. In order to evaluatethe usability, we implemented this approach in alarge lecture video portal.

� The developed video analysis methods have beenevaluated by using compiled test data sets as well asopened benchmarks. All compiled test sets are pub-licly available from our website for the furtherresearch use.

The rest of the paper is organized as follows: Section 2reviews related work in lecture video retrieval and content-based video search domain. Section 3 describes our auto-matic video indexing methods, while the user study is pre-sented in Section 5. A content-based lecture video searchengine using multimodal information resources is intro-duced in Section 4. Finally, Section 6 concludes the paperwith an outlook on future work.

2 RELATED WORK

Information retrieval in the multimedia-based learningdomain is an active and multidisciplinary research area.Video texts, spoken language, community tagging, manualannotations, video actions, or gestures of speakers can actas the source to open up the content of lectures.

2.1 Lecture Video Retrieval

Wang et al. proposed an approach for lecture video index-ing based on automated video segmentation and OCR anal-ysis [9]. The proposed segmentation algorithm in their workis based on the differential ratio of text and backgroundregions. Using thresholds they attempt to capture the slidetransition. The final segmentation results are determined bysynchronizing detected slide key-frames and related textbooks, where the text similarity between them was calcu-lated as indicator. Grcar et al. introduced VideoLectures.netin [10] which is a digital archive for multimedia presenta-tions. Similar to [9], the authors also apply a synchroniza-tion process between the recorded lecture video and theslide file, which has to be provided by presenters. Our sys-tem contrasts to these two approaches since it directly ana-lyzes the video, which is thus independent of any hardwareor presentation technology. The constrained slide formatand the synchronization with an external document are notrequired. Furthermore, since the animated content evolve-ment is often applied in the slide, but has not been consid-ered in [9] and [10], their system might not work robustlywhen those effects occur in the lecture video. In [9], the finalsegmentation result is strongly dependent on the quality ofthe OCR result. It might be less efficient and imply redun-dancies, when only poor OCR result is obtained.

Tuna et al. presented their approach for lecture videoindexing and search [11]. They segment lecture videos intokey frames by using global frame differencing metrics.Then standard OCR software is applied for gathering tex-tual metadata from slide streams, in which they utilizesome image transformation techniques to improve the OCRresult. They developed a new video player, in which theindexing, search and captioning processes are integrated.Similar to [9], the used global differencing metrics cannotgive a sufficient segmentation result when animations orcontent build-ups are used in the slides. In that case, manyredundant segments will be created. Moreover, the usedimage transformations might be still not efficient enoughfor recognizing frames with complex content and back-ground distributions. Making use of text detection and seg-mentation procedures could achieve much better resultsrather than applying image transformations.

Jeong et al. proposed a lecture video segmentationmethod using Scale Invariant Feature Transform (SIFT) featureand the adaptive threshold in [12]. In their work SIFT fea-ture is applied to measure slides with similar content. Anadaptive threshold selection algorithm is used to detectslide transitions. In their evaluation, this approach achievedpromising results for processing one-scene lecture videos asillustrated in Fig. 1a.

Recently, collaborative tagging has become a popularfunctionality in lecture video portals. Sack and Waitelonis[13] and Moritz et al. [14] apply tagging data for lecturevideo retrieval and video search. Beyond the keyword-based tagging, Yu et al. proposed an approach to annotatelecture video resources by using Linked Data. Their frame-work enables users to semantically annotate videos usingvocabularies defined in the Linked Data cloud. Then thosesemantically linked educational resources are furtheradopted in the video browsing and video recommendationprocedures. However, the effort and cost needed by theuser annotation-based approach cannot satisfy the require-ments for processing large amounts of web video data witha rapid increasing speed. Here, the automatic analysis is nodoubt much more suitable. Nevertheless, using LinkedData to further automatically annotate the extracted textualmetadata opens a future research direction.

ASR provides speech-to-text information on spoken lan-guages, which is thus well suited for content-based lecturevideo retrieval. The studies described in [5] and [15] arebased on out-of-the-box commercial speech recognitionsoftware. Concerning such commercial software, to achievesatisfying results for a special working domain an adaptionprocess is often required, but the custom extension is rarelypossible. The authors of [1] and [6] focus on English speechrecognition for Technology Entertainment and Design (TED)lecture videos and webcasts. In their system, the trainingdictionary is created manually, which is thus hard to beextended or optimized periodically. Glass et al. proposed asolution for improving ASR results of English lectures bycollecting new speech data from the rough lecture audiodata [3]. Inspired by their work, we developed an approachfor creating speech data from German lecture videos. Hau-bold and Kender focus on multi-speaker presentation vid-eos. In their work speaker changes can be detected byapplying a speech analysis method. A topic phrases

144 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 7, NO. 2, APRIL-JUNE 2014

Page 4: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

extraction algorithm from highly imperfect ASR results(WER � 75 percent) has been proposed by using lecturecourse-related sources such as text books and slide files [4].

Overall, most of those lecture speech recognition systemshave low recognition rate, the WERs of audio lectures areapproximately 40-85 percent. The poor recognition resultslimit the further indexing efficiency. Therefore, how to con-tinuously improve ASR accuracy for lecture videos is stillan unsolved problem.

The speaker-gestures-based information retrieval for lec-ture videos has been studied in [16]. The author equippedthe lecture speaker with special gloves that enable the auto-matic detection and evaluation of gestures. The experimen-tal results show that 12 percent of the lecture topicboundaries were correctly detected using speaker-gestures.However, those gesture features are highly dependent onthe characteristics of speakers and topics. It might have lim-ited use in large lecture video archives with massiveamounts of speakers.

2.2 Content-Based Video Search

Several content-based video search engines have been pro-posed recently. Adcock et al. proposed a lecture webcastsearch system [17], in which they applied a slide frame seg-menter to extract lecture slide images. The system retrievedmore than 37.000 lecture videos from different resourcessuch as YouTube, Berkeley Webcast, etc. The search indicesare created based on the global metadata obtained from thevideo hosting website and texts extracted from slide videosby using a standard OCR engine. Since they do not applytext detection and text segmentation process, the OCR rec-ognition accuracy of their approach is therefore lower thanour system’s. Furthermore, by applying the text detectionprocess we are able to extract the structured text line suchas title, subtitle, key-point, etc., that enables a more flexiblesearch function. In the CONTENTUS [18] project, a content-based semantic multimedia retrieval system has been devel-oped. After the digitization of media data, several analysistechniques, e.g., OCR, ASR, video segmentation, automatedspeaker recognition, etc., have been applied for metadatageneration. An entity recognition algorithm and an openknowledge base are used to extract entities from the textualmetadata. As mentioned before, searching through the rec-ognition results with a degree of confidence, we have todeal with the solidity and the consistency problem. How-ever the reviewed content-based video search systems didnot consider this issue.

3 AUTOMATED LECTURE VIDEO INDEXING

In this chapter we will present four analysis processes forretrieving relevant metadata from the two main parts oflecture video, namely the visual screen and audio tracks.From the visual screen we firstly detect the slide transi-tions and extract each unique slide frame with its tempo-ral scope considered as the video segment. Then thevideo OCR analysis is performed for retrieving textualmetadata from slide frames. Based on OCR results, wepropose a novel solution for lecture outline extraction byusing stroke width and geometric information ofdetected text lines. In speech-to-text analysis we applied

the open-source ASR software CMU Sphinx.2 To buildthe acoustic and language model, we collected speechtraining data from open-source corpora and our lecturevideos. As already mentioned, it lacks method in open-source context for creating German phonetic dictionaryautomatically. We thus developed a solution to fill thisgap and made it available for the further research use.

3.1 Slide Video Segmentation

Video browsing can be achieved by segmenting video intorepresentative key frames. The selected key frames can pro-vide a visual guideline for navigation in the lecture videoportal. Moreover, video segmentation and key-frame selec-tion is also often adopted as a preprocessing for other analy-sis tasks such as video OCR, visual concept detection, etc.

Choosing a sufficient segmentation method is based onthe definition of “video segment” and usually depends onthe genre of the video. In the lecture video domain, thevideo sequence of an individual lecture topic or subtopic isoften considered as a video segment. This can be roughlydetermined by analyzing the temporal scope of lectureslides. Many approaches (as, e.g., [17], [11]) make use ofglobal pixel-level-differencing metrics for capturing slide tran-sitions. A drawback of this kind of approach is that the saltand pepper noise of video signal can affect the segmentationaccuracy. After observing the content of lecture slides, werealize that the major content as, e.g., text lines, figures,tables, etc., can be considered as Connected Components(CCs). We therefore propose to use CC instead of pixel asthe basis element for the differencing analysis. We call itcomponent-level-differencing metric. This way we are able tocontrol the valid size of the CC, so that the salt and peppernoise can be rejected from the differencing process. For cre-ating CCs from binary images, we apply the algorithmaccording to [19] which demonstrated an excellent perfor-mance advantage. Another benefit of our segmentationmethod is its robustness to animated content progressivebuild-ups used within lecture slides. Only the most com-plete unique slides are captured as video segments. Thoseeffects affect the most lecture video segmentation methodsmentioned in chapter 2.

Our segmentation method consists of two steps (cf.Fig. 3):

� In the first step, the entire slide video is analyzed. Wetry to capture every knowledge change betweenadjacent frames, for which we established an analy-sis interval of three seconds by taking both accuracyand efficiency into account. This means that seg-ments with a duration smaller than three secondsmay be discarded in our system. Since there are veryfew topic segments shorter than three seconds, thissetting is therefore not critical. Then we create cannyedge maps for adjacent frames and build the pixeldifferential image from the edge maps. The CC anal-ysis is subsequently performed on this differentialimage and the number of CCs is then used as athreshold for the segmentation. A new segment iscaptured if the number exceeds Ts1. Here we

2. http://cmusphinx.sourceforge.net/.

YANG AND MEINEL: CONTENT BASED LECTURE VIDEO RETRIEVAL USING SPEECH AND VIDEO TEXT INFORMATION 145

Page 5: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

establish a relatively small Ts1 to ensure that eachnewly emerging knowledge point will be captured.Obviously, the first segmentation step is sensitive toanimations and progressive build-ups. The result isthus too redundant for video indexing. Hence, theprocess continues with the second segmentation stepbased on the frames in the first step.

� In the second segmentation step the real slide transi-tions will be captured. The title and content region ofa slide frame is first defined. We established the con-tent distribution of commonly used slide styles byanalyzing a large amount of lecture videos in ourdatabase. Let Rt and Rc denote the title and contentregion which account for 23 and 70 percent of theentire frame height respectively. In Rt we apply CC-based differencing as described above with a smallthreshold value of 1 for capturing the slide transi-tions. Here any small changes within the title regionmay cause a slide transition. For instance, two slidesoften differ from each other in a single chapter num-ber. If there is no difference found in Rt, then we tryto detect the first and the last bounding box objectin Rc vertically and perform the CC-based differenc-ing within the object regions of two adjacent frames

(cf. Fig. 2). In case that the difference value of bothobject regions between adjacent frames exceed thethreshold Ts2, a slide transition is then captured. Fordetecting the content progressive build-up horizon-tally, the process could be repeated in a similar man-ner. In our experiment, Ts1 ¼ 20 and Ts2 ¼ 5 haveproven to serve best for our training data. Exact set-ting of these parameters is not critical.

� Since the proposed method is designed for segment-ing slide videos, it might be not suitable when videoswith varying genres have been embedded in theslides and are played during the presentation. Tosolve this problem we have extended the originalalgorithm by using a Support Vector Machine (SVM)classifier and image intensity histogram features. Weuse the Radial Basis Function (RBF) as kernel. In orderto make a comparison, we also applied the Histogramof Oriented Gradients (HOG) feature [20] in the experi-ment. We adapted the HOG feature with eight gradi-ent directions, whereas the local region size was setto 64� 64. Moreover, in using this approach ourvideo segmentation method is also suitable for proc-essing such one-screen lecture videos with a frequentswitch between slide- and speaker-scene.

3.1.1 Experimental Result

To evaluate our slide-video segmentation method, we havecompiled a lecture video test data set, which consist of 20randomly selected videos from different speakers with vari-ous layouts and font styles. Each unique slide with completecontents is regarded as a correct match. The overall numberof detected segments is 860, of which 816 are correct. Theachieved segmentation recall and precision are 98 and95 percent for the test data set, respectively. Since this exper-iment is designed to evaluate the slide segmentation algo-rithm, videos with various genres embedded within theslides are not considered. The detailed evaluation result aswell as the url for each test video can be found at [21].

To evaluate the slide-image classifier, 2,597 slide framesand 5,224 non-slide frames have been collected in total for

Fig. 2. We detect the first and the last object line in Rc vertically and perform the CC differencing metric on those line sequences from adjacentframes. In frame (a) and (b), a same text line (top) can be found; whereas in frame (b) and (c), the CC-based differences of both text lines exceed thethreshold Ts2. A slide transition is thus found between frame (b) and (c).

Fig. 3. Lecture video segmentation workflow. Step 1: adjacent framesare compared with each other by applying the CC analysis on their differ-ential edge maps. Step 2: slide transitions are captured by performingtitle and content region analysis.

146 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 7, NO. 2, APRIL-JUNE 2014

Page 6: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

the training. All slide frames are collected from our lecturevideo database. To create a non-slide frame set with varyingimage genres, we have collected additional 3,000 imagesfrom flickr.3 Moreover, about 2,000 non-slide video frameshave been collected from the lecture video database. Thetest set consists of 240 slide frames and 233 non-slideframes, which differ from the training set.

In the classifier training the SVM-parameters were deter-mined by using the grid-search function. To calculate theimage intensity histogram, 256 histogram bins were initiallycreated corresponding to the 256 image grayscale values.Then the normalized histogram values were applied to trainthe SVM classifier. We have evaluated the normalizationfactor (cf. Fig. 5), which has proven to serve best when set to1,000.4

The comparison results of two features are illustrated inTable 1. Both features achieved a good recall rate for recog-nizing slide frames. However, compared with the HOG fea-ture the intensity histogram feature showed a considerableimprovement in precision and F1 measure.

3.2 Video OCR for Lecture Videos

Texts in the lecture slides are closely related to the lecturecontent, can thus provide important information for theretrieval task. In our framework, we developed a novelvideo OCR system for gathering video text.

For text detection, we developed a new localization-verification scheme. In the detection stage, an edge-basedmulti-scale text detector is used to quickly localize candi-date text regions with a low rejection rate. For the subse-quent text area verification, an image entropy-basedadaptive refinement algorithm not only serves to reject falsepositives that expose low edge density, but also furthersplits the most text- and non-text-regions into separateblocks. Then Stroke Width Transform (SWT) [22]-based verifi-cation procedures are applied to remove the non-textblocks. Since the SWT verifier is not able to correctly iden-tify special non-text patterns such as sphere, window-blocks, garden fence, we adopted an additional SVM classi-fier to sort out these non-text patterns in order to furtherimprove the detection accuracy. For text segmentation andrecognition, we developed a novel binarization approach,in which we utilize image skeleton and edge maps to

identify the text pixels. The proposed method consists ofthree main steps: text gradient direction analysis, seed pixelselection, and seed-region growing. After the seed-regiongrowing process, the video text images are converted into asuitable format for standard OCR engines. The subsequentspell-checking process will further sort out incorrect wordsfrom the recognition results. Our video OCR system hasbeen evaluated by applying several test data sets. Especiallyby using the online evaluation of the opened benchmarkICDAR 2011 competition test sets for born-digital images[23], our text detection method achieved the second place,and our text binarization and word recognition methodachieved the first place in the corresponding ranking list ontheir website (last check 08/2013). An in-depth discussionof the developed video OCR approach and the detailedevaluation results can be found in [24]. By applying theopen-source print OCR engine tesseract-ocr,5 we achievedrecognition of 92 percent of all characters and 85 percent ofall words correctly for lecture video images. The compiledtest data set including 180 lecture video frames and respec-tive manual annotations is available at [21]. Figs. 4a and 4bdemonstrate some exemplary text detection and binariza-tion results of our methods.

3.3 Slide Structure Analysis

Generally, in the lecture slide the content of title, subtitleand key point have more significance than the normal slidetext, as they summarize each slide. Due to this fact, we clas-sify the type of text lines recognized from slide frames byusing geometrical information and stroke width feature.The benefits of the structure analysis method can be sum-marized as follows:

� The lecture outline can be extracted using classifiedtext lines, it can provide a fast overview of a lecturevideo and each outline item with the time stamp canin turn be adopted for video browsing (e.g., Fig. 7).The usability of the outline feature has been evalu-ated by a user study provided in chapter 5.

� The structure of text lines can reflect their differentsignificance. This information is valuable for asearch/indexing engine. Similar to web searchengines which make use of the explicitly pre-definedHTML structure (HTML-tags) for calculating theweight of texts in web pages, our method further

Fig. 4. (a) Exemplary text detection results (b) Text binarization results.

3. www.flickr.com.4. The norm-function normalizes the histogram bins, by which the

sum of the bins is scaled to equal the factor. 5. https://code.google.com/p/tesseract-ocr.

YANG AND MEINEL: CONTENT BASED LECTURE VIDEO RETRIEVAL USING SPEECH AND VIDEO TEXT INFORMATION 147

Page 7: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

opens up the video content and enables the searchengine to give more accurate and flexible searchresults based on the structured video text.

The process begins with a title line identification proce-dure. A text line will be considered as a candidate title linewhen it localizes in the upper third part of the frame, it hasmore than three characters, it is one of three highest textlines and has the uppermost vertical position. Then the cor-responding text line objects will be labeled as the titleobjects and we repeat the process on the remaining textobjects in the same manner. The further detected title linesmust have a similar height (the tolerance is up to 10px) andstroke width value (the tolerance is up to 5px) as the firstone. For our purposes, we allow up to three title lines to bedetected for each slide frame. All non-title text line objectsare further classified into three classes: content text, key-pointand footline. The classification is based on the height and theaverage stroke width of the text line object, which isdescribed as follows:

key-point if st > smean ^ ht > hmean

footline if st < smean ^ ht < hmean ^ y ¼ ymax

content text otherwise;

where smean and hmean denote the average stroke width andthe average text line height of a slide frame, and ymax

denotes the maximum vertical position of a text line object(starts from the top-left corner of the image).

To further extract the lecture outline, we firstly apply aspell checker to sort out text line objects which do not satisfythe following conditions:

� a valid text line object must have more than threecharacters,

� a valid text line object must contain at least onenoun,

� the textual character count of a valid text line objectmust be more than 50 percent of the entire stringlength.

The remaining objects will be labeled as the outlineobject. Subsequently, we merge all title lines within thesame slide according to their position. Other text line objectsfrom this slide will be considered as the subitem of the titleline. Then we merge text line objects from adjacent frameswith the similar title content (with 90 percent content over-lap). The similarity is measured by calculating the amountof same characters and same words. After a merging pro-cess all duplicated text lines will be removed. Finally, thelecture outline is created by assigning all valid outlineobjects into a consecutive structure according to theiroccurrences.

3.3.1 Experimental Result

We evaluated the lecture outline extraction method byselecting 180 segmented unique slide frames from the testvideos used in chapter 3.1.1. In our experiment, both outlinetype classification accuracy and word recognition accuracywere considered. Therefore, the accuracy of outline extrac-tion is based on the OCR recognition result. We havedefined the recall and precision metric as follows:

Recall ¼ number of correctly retrieved outline words

number of all outline words in ground truth

Precision ¼ number of correctly retrieved outline words

number of retrieved outline words:

Table 2 shows the evaluation results for the title and key-point extraction, respectively.

Although the extraction rate of key-point still has a certainimprovement space, the extracted titles are already wellsuited for the automatic video indexing. Furthermore, theoutline extraction accuracy can be further improved by pro-viding better OCR results.

Fig. 5. Evaluation results of the normalization factor of the image inten-sity histogram feature.

TABLE 1Feature Comparison Result

Fig. 6. Workflow for extending our speech corpus.

148 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 7, NO. 2, APRIL-JUNE 2014

Page 8: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

3.4 ASR for German Lecture Videos

In addition to video OCR, ASR can provide speech-to-textinformation from lecture videos, which offers the chance toimprove the quantity of automatically generated metadatadramatically. However, as mentioned, most lecture speechrecognition systems cannot achieve a sufficient recognitionrate. A model-adaption process is often required. Further-more, in the open-source context, it lacks method for gener-ating the German phonetic dictionary automatically.Therefore, we developed a solution intended to fill this gap.We decided to build acoustic models for our special usecase by applying the CMU Sphinx Toolkit6 and the GermanSpeech Corpus by Voxforge7 as a baseline. Unlike otherapproaches that collect speech data by applying dictation ina quiet environment, we gathered hours of speech datafrom real lecture videos and created corresponding tran-scripts. This way the real teaching environment can beinvolved in the training process. For the language modeltraining we applied the collected text corpora from Germandaily news corpus (radio programs, 1996-2000), Wort-schatz-Leipzig8 and the audio transcripts of the collectedspeech corpora.

Fig. 6 depicts the workflow for creating the speech train-ing data. First of all, the recorded audio file is segmentedinto smaller pieces and improper segments are sorted out.For each remaining segment the spoken text is transcribedmanually, and added to the transcript file automatically. Asan intermediate step, a list of all used words in the tran-script file is created. In order to obtain the phonetic dictio-nary, the pronunciation of each word has to be representedphonetically.

A recorded lecture audio stream yields approximately90 minutes of speech data, which is far too long to be proc-essed by the ASR trainer or the speech decoder at once.Shorter speech segments are thus required. Manually col-lecting appropriate speech segments and transcripts israther time consuming and costly. There are a number ofsteps to be performed to acquire high quality input data fora ASR trainer tool. Our current approach is to fully auto-mate segmentation (Fig. 6(2)) and partly automate selection(Fig. 6(3)) without suffering from quality drawbacks likedropped word endings. The fundamental steps can be

described as follows: we first compute the absolute valuesof input samples to get a loudness curve, which is thendownsampled (e.g., by factor 100). Then a blur filter (e.g.,radius ¼ 3) is applied to eliminate outliers and a loudnessthreshold determines which areas are considered quiet.Non-quiet areas with a specified maximum length (5 secondshas proven to serve best for our training data) will serve asa potential speech utterance. Finally, we retrieve speechutterances from the original audio stream and save theminto files.

From the experiments we have learned that all speechsegments used for the acoustic model training must meetcertain quality requirements. Experience has shown thatapproximately 50 percent of the generated audio segmentshave to be sorted out due to one of the following reasons:

� the segment contains acoustical noise created byobjects and humans in the environment around thespeaker, e.g., doors closing, chairs moving, studentstalking,

� the lecturer mispronounces some words, so that theyare completely invalid from an objective point ofview,

� the speaker’s language is clear, but the segmentationalgorithm cuts off parts of a spoken word so that itbecomes invalid.

Therefore, the classification of audio segments in good orbad quality is not yet solvable automatically, as the term“good quality” is very subjective and strongly depends onone’s personal perception. Nevertheless, the proposed seg-mentation method can perform a preselection that speedsup the manual transcription process significantly. Theexperimental results show that the WER decreased by about19 percent, when adding 7.2 hours of speech data from ourlecture videos to the training set (cf. Table 3).

3.4.1 Creation of the German Phonetic Dictionary

The phonetic dictionary is an essential part of every ASRsoftware. For each word that appears in the transcripts, it

TABLE 2Evaluation Results of Lecture Outline Extraction

Fig. 7. Video browsing using extracted titles and key-points from OCR transcripts, where the highlighted key-points with Bold font style wereextracted by using their stroke width feature.

6. cmusphinx.sourceforge.net/.7. www.voxforge.org/.8. corpora.informatik.uni-leipzig.de/download.html.

YANG AND MEINEL: CONTENT BASED LECTURE VIDEO RETRIEVAL USING SPEECH AND VIDEO TEXT INFORMATION 149

Page 9: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

defines one or more phonetic representations. As ourspeech corpus is growing continuously, the extension andmaintenance of the dictionary becomes a common task. Anautomatic generator is highly desired. Unfortunately it lackssuch a tool in the open-source context. Therefore, we havebuilt a phonetics generator by using a customized phoneticalphabet, which contains 45 phonemes used in German pro-nunciation. We provide this tool and the compiled speechtraining data for the further research use.

The generation algorithm operates on three layers: Onword level, we have defined a so-called exemption dictionarywhich contains foreign words in particular and whose pro-nunciation does not follow German rules, e.g., byte,ph€anomen. First of all, a check is made as to whether theinput word can be found in the exemption dictionary. If this isthe case, the phonetic representation is completely read fromthis dictionary and no further steps have to be performed.Otherwise, we scale down to syllable level by applying anextern hyphenation algorithm.9

On syllable level, we examine the result of the hyphenationalgorithm, as, e.g., com-pu-ter for the input word com-

puter. For many single syllables (and also pairs of sylla-bles), the syllable mapping describes the correspondingphonetic representation, as, e.g., au f for the German prefixauf and n i: d er for the disyllabic German prefix nieder.If there is such a representation for our input syllable, then itis added to the phonetic result immediately. Otherwise, wehave to split the syllable further into its characters and pro-ceed on character level.

On character level, a set of German pronunciation rules areapplied to determine how the current single character ispronounced, including character type (consonant or vowel),neighboring characters, relative position inside the contain-ing syllable, absolute position inside the whole word, etc.,First and foremost, a heuristic checks if the current and thenext 1–2 characters can be pronounced natively. If this is notthe case, or the word is only one character long, the charac-ters are pronounced as if they were spelled letter by letter,as, e.g., the abbreviations abc (for alphabet) and ZDF (a Ger-man TV channel). In the next step, we determine the charac-ter type (consonant or vowel) in order to apply the correctpronunciation rules. The conditions of each of these rulesare verified, until one is true or all conditions are verifiedand proven to be false. If the latter is the case, the standardpronunciation is applied, which assumes closed vowels.

An evaluation with 20,000 words from transcripts showsthat 98.1 percent of all input words were processed correctlywithout any manual amendment. The 1.9 percent incorrectwords mostly have an English pronunciation. They are cor-rected manually and added to the exemption dictionarysubsequently.

Besides the striking advantage of saving a lot of time andeffort needed for dictionary maintenance, Table 4 showsthat our automatically created dictionary does not result inworse WER compared with the Voxforge dictionary. Theused speech corpus for the evaluation is a smaller versionof the German Voxforge Corpus which contains 4.5 hours ofaudio data from 13 different speakers. It is directly availablefrom the Voxforge website including a ready-to-use dictio-nary. Replacing this dictionary with our automatically gen-erated one results in a slightly better recognition result.

4 VIDEO CONTENT BROWSING AND VIDEO SEARCH

In this chapter we will demonstrate several video contentbrowsing features and discuss our video search method in alarge lecture video portal.

4.1 Browsing with Lecture Keyframes and ExtractedLecture Outline

Fig. 8 shows the visualization of slide key frames in our lec-ture video portal. The segments are visualized in the formof a timeline within the video player. This feature isintended to give a fast hint for navigation. If the user wantsto read the slide content clearly, a slide gallery has been pro-vided underneath the video player. By clicking on thethumbnails or on the timeline-units, the video will navigateto the begin of the segment.

Fig. 7 demonstrates an exemplary lecture outline (struc-ture), where the title “Race Conditions (4/4)”, and thehighlighted key-points “FTP Server”, “Race Conditionswith Interrupts” with Bold font style have been successfullydetected by using their geometric information and thestroke width feature respectively. By clicking on the outlineitems the video will jump to the corresponding time posi-tion. Furthermore, the user can use the standard text searchfunction of a web browser to search through the outlinecontent.

4.2 Keyword Extraction and Video Search

The lecture content-based metadata can be gathered byusing OCR and ASR tools. However the recognition results

TABLE 3Current Progress of Our Acoustic Model Training

The WER results have been measured through a random set of 708 testsegments from seven lecturers.

TABLE 4Comparison Result with the Voxforge Dictionary

Fig. 8. Visualization of segmented lecture slides in video player.9. http://swolter.sdf1.org/software/libhyphenate.html.

150 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 7, NO. 2, APRIL-JUNE 2014

Page 10: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

of automatic analysis engines are often error prone and alarge amount of irrelevant words are also generated. There-fore we extract keywords from the raw recognition results.Keywords can summarize a document and are widely usedfor information retrieval in digital libraries. In this work,only nouns and numbers are considered as keyword candi-date. The top n words from them will be regarded as key-word. Segment-level as well as video-level keywords areextracted from different information resources such as OCRand ASR transcripts respectively. For extracting segment-level keywords, we consider each individual lecture videoas a document corpus and each video segment as a singledocument, whereas for obtaining video-level keywords, alllecture videos in the database are processed, and each videois considered as a single document.

To extract segment-level keywords, we first arrange eachASR and OCR word to an appropriate video segmentaccording to the time stamp. Then we extract nouns fromthe transcripts by using the stanford part-of-speech tagger[25] and a stemming algorithm is subsequently utilized tocapture nouns with variant forms. To remove the spellingmistakes resulted by the OCR engine, we perform a dictio-nary-based filtering process.

We calculate the weighting factor for each remainingkeyword by extending the standard TFIDF score [26]. Ingeneral, the TFIDF algorithm calculates keywords onlyaccording to their statistical frequencies. It cannot representthe location information of keywords, that might be impor-tant for ranking keywords extracted from web pages or lec-ture slides. Therefore, we defined a new formula forcalculating TFIDF score, as shown by Eq. (1):

tfidfseg�internalðkwÞ

¼ 1

N

tfidfocr � 1

ntype

Xntypei¼1

wi þ tfidfasr � wasr

!;

(1)

where kw is the current keyword, tfidfocr and tfidfasrdenote its TFIDF score computed from OCR and ASRresource respectively, w is the weighting factor for various

resources, ntype denotes the number of various OCR text linetypes. N is the number of available information resources,in which the current keyword can be found, namely the cor-responding TFIDF score does not equal 0.

Since OCR text lines are classified into four types in oursystem (cf. Chapter 3.3), we can calculate the correspond-ing weighting factor for each type and for each informa-tion resource by using their confidence score. Eq. (2)depicts the formula

wi ¼ m

siði ¼ 1 . . .nÞ; (2)

where the parameter m is set to equal 1 in our system, and s

can be calculated by using the corresponding recognitionaccuracy of the analysis engine, as shown by Eq. (3)

si ¼ 1�Accuracyi ði ¼ 1 . . .nÞ: (3)

Fig. 9 demonstrates the web GUI of the segment-level key-word browsing and search function in our lecture videoportal. The scatter chart diagram has been used for visu-alizing the keywords, as shown in Fig. 9(1), in which thehorizontal axis of the scatter chart represents the videotime, the red and light-blue plots represent the keywordsextracted from ASR and OCR results, respectively.Underneath the chart a timeline serves for representingthe linear-structure of the video (Fig. 9(3)), whereas theweight information of the actual keyword for correspond-ing segments is indicated by another colored timeline(Fig. 9(2)), where stronger color gradients refer to higherkeyword weight of the corresponding segment. The usercan zoom in/zoom out to the appropriate segment inter-val by clicking on the colored timeline-unit or via mouseselection in the diagram area. When mouse-hovering onthe gray timeline, the recommended keywords are dis-played for each segment (Fig. 9(4)). By clicking on thekeyword its appearance will be highlighted in the scatterchart. The user can further click on the plot-point in thechart, the video will then navigate to the position wherethe word has been spoken or appears in the slide.

Fig. 9. Segment-level keyword browsing and keyword search function in our lecture video portal.

YANG AND MEINEL: CONTENT BASED LECTURE VIDEO RETRIEVAL USING SPEECH AND VIDEO TEXT INFORMATION 151

Page 11: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

As already mentioned, to build a content-based videosearch engine by using multimodal information resources,we have to deal with solidity and consistency problems.Those information resources might be generated either by ahuman or by an analysis engine. For the latter case, differentanalysis engines may have various confidence scores.Therefore, during the ranking process we should considerboth the statistical feature of the keywords and their confi-dence scores. We have thus defined a formula for comput-ing the video-level TFIDF score, as shown by Eq. (4)

tfidfvid�levelðkwÞ ¼ 1

N

Xni¼1

tfidfi � wi; (4)

where tfidfi and wi denote the TFIDF score and the corre-sponding weighting factor for each information resource. Nis the number of available information resources, in whichthe current keyword can be found.

In our case, video search indices can be built from threeinformation resources currently, including manually cre-ated global video metadata (lecturer name, course descrip-tion, etc.), ASR and OCR words. As described in Eq. (1), theOCR text lines were in turn classified into several types, andthe formula extended as

tfidfvid�levelðkwÞ ¼ tfidfg � wg

þ 1

N

tfidfocr � 1

ntype

Xntypei¼1

wi þ tfidfasr � wasr

!;

(5)

where tfidfg and wg denote the TFIDF score and the weight-ing factor for global video metadata.

The ranked video-level keywords are used for the con-tent-based video search in our lecture video portal. Fig. 10shows some exemplary search results, in which segmentedlecture slides, search hits and keyword weights are addi-tionally provided to the user when hovering on the corre-sponding timeline-unit. Clicking on the unit the videosegment will be played by a pop-up player.

The video similarity can further be computed by usinga vector space model and the cosine similarity measure.Table 5 shows an exemplary keyword-video matrix Akw�v,its columns and rows correspond to video and keywordindices respectively. The value of each matrix element is

the calculated tfidfvid�level score of the keyword kwi in thevideo vj.

Let us assume that each column of A denotes a vector djwhich corresponds to a video vj. Here, the dimension of djis the number of selected keywords. Let q denote the queryvector which corresponds to another video, the similaritybetween two videos can then be calculated by using cosinesimilarity measure according to Eq. (6)

simðdj; qÞ ¼Pm

i¼1ðaijqiÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPmi¼1ðaijÞ2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPmi¼1ðqiÞ2

q : (6)

Furthermore, the TFIDF score for the inter-video segmentcomparison can be derived by combining tfidfseg�internal

and tfidfvid�level. Using this score we are able to implementa segment-based lecture fragment search/recommendationsystem.

5 INITIAL USER STUDY

We have evaluated the proposed indexing features in theform of a user study intended to verify the research hypoth-esis 1 with the following derived questions:

� Which video indexing tools are helpful for finding aspecific lecture topic within a lecture video, and

� howquickly andhowaccurately can this be achieved?

� Can learning effectiveness be improved by usingvideo indexing tools in a lecture video portal?

The involved indexing tools in this evaluation includeautomatically extracted slides, that are provided in a time-line format, lecture outline extracted from the slidesenhanced with direct links into the video as well as the key-word browsing/search functionality.

Fig. 10. Content-based video search in the lecture video portal.

TABLE 5Keyword-Video Matrix A

152 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 7, NO. 2, APRIL-JUNE 2014

Page 12: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

Twelve single participants were recruited for the userstudy. They were bachelor or first semester master studentsat our institute in the field of “IT Systems Engineering”. Inorder to fulfill the goal, two tasks have been conducted:

� Task 1 search for the temporal scope of a specific lec-ture topic within a 1 hour lecture video by using oneof the following setups:

- video with a seekable player- key-frames and video- lecture outline and video- keywords and video- all available indexing tools and video

� Task 2 was to watch a complete topic segment of10 minutes. One time the participants were allowedto use all of the indexing tools on the video and forthe second video they were only allowed to use thevideo player without additional tools. After watch-ing a video the students had to take a short exam sowe could measure learning effectiveness.

After a preliminary introduction, each student had1 hour time to fulfill the requested tasks. We applied awithin-subject-design for the experiment intended to com-pare the outcomes in speed and accuracy of various setups.Therefore, each participant was asked to perform all tasks.For each of the setups we prepared different videos. All thevideos were chosen carefully with similarity in complexityof the topic and findability of the required information inmind. The videos are part of computer science studies andthey were made in the subjects’ native language. We ran-domly selected the order of the tasks and used a counterbal-anced order to assign videos to each setup. This wasintended to prevent the subjects from becoming tired andtheir learning being thereby effected over time.

All participants had never worked with the video index-ing tools before the user study. But even though the tools

were new to them, every single subject was able to use themto their advantage after a small introduction. We found thatthe requested information was found faster and more accu-rate than searching without the video indexing tools when-ever the students used key-frames, lecture outline or all ofthe available features (cf. Fig. 11 and Table 6). The only toolshowing a slower processing speed were the keywords. Inour understanding this is caused by the necessity to scrollthe website down from the video player to the keywordsGUI. Nevertheless, it achieved the best accuracy results asshown in Table 6.

In the second task, we prepared three multiple choiceand a free-text question for each test video. The total scorefor one video is around 12 points. We designed three meth-ods for the result scoring:

� W1—for multiple choice question, +1 for each correctanswer, �2 for each incorrect/missed answer; +2 foreach correct point in free-text answer.

� W2—for multiple choice question, +1 for each correctanswer, �1 for each incorrect answer, 0 for missedanswers; +2 for each correct point in free-textanswer.

� Normalized W2—since the maximal number of cor-rect answers for the multiple choice questions is dif-ferent for each test video, we thus additionally builtthe normalized result of W2.

Fig. 12 shows that learning effectiveness can be improvedmeasurably by using video indexing tools, where the meanand standard deviation of scoring results are depicted inTable 7. Almost all the participants have used the keyframeand outline feature in this task for reviewing knowledgepoints, while the keyword feature was rarely used duringthe watching process. The results of the post-test-question-naire indicate that all participants would like to see theindexing tools be used in lecture video archives.

Fig. 11. Processing speed evaluation results of task 1. The bar chartdepicts the mean processing time and standard deviation of each setup.

TABLE 6Processing Accuracy Evaluation Results of Task 1

Fig. 12. Results of the learning effectiveness evaluation (task 2), wherethe bar chart presents the accumulated score for each setup of theexam.

TABLE 7Mean and Standard Deviation Results of Task 2

YANG AND MEINEL: CONTENT BASED LECTURE VIDEO RETRIEVAL USING SPEECH AND VIDEO TEXT INFORMATION 153

Page 13: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

The preliminary results of the user study can provide atrend which corroborates the hypothesis 1. The results arenot statistically significant though and need a higher num-ber of test persons to ensure reliability.

6 CONCLUSION AND FUTURE WORK

In this paper, we presented an approach for content-basedlecture video indexing and retrieval in large lecture videoarchives. In order to verify the research hypothesis weapply visual as well as audio resource of lecture videos forextracting content-based metadata automatically. Severalnovel indexing features have been developed in a large lec-ture video portal by using those metadata and a user studyhas been conducted.

As the future work, the usability and utility study for thevideo search function in our lecture video portal will be con-ducted. Automated annotation for OCR and ASR resultsusing Linked Open Data resources offers the opportunity toenhance the amount of linked educational resources signifi-cantly. Therefore more efficient search and recommendationmethod could be developed in lecture video archives.

REFERENCES

[1] E. Leeuwis, M. Federico, and M. Cettolo, “Language modelingand transcription of the ted corpus lectures,” in Proc. IEEE Int.Conf. Acoust., Speech Signal Process., 2003, pp. 232–235.

[2] D. Lee and G. G. Lee, “A korean spoken document retrieval systemfor lecture search,” in Proc. ACM Special Interest Group Inf. RetrievalSearching Spontaneous Conversational SpeechWorkshop, 2008.

[3] J. Glass, T. J. Hazen, L. Hetherington, and C. Wang, “Analysis andprocessing of lecture audio data: Preliminary investigations,” inProc. HLT-NAACL Workshop Interdisciplinary Approaches SpeechIndexing Retrieval, 2004, pp. 9–12.

[4] A. Haubold and J. R. Kender, “Augmented segmentation andvisualization for presentation videos,” in Proc. 13th Annu. ACMInt. Conf. Multimedia, 2005, pp. 51–60.

[5] W. H€urst, T. Kreuzer, and M. Wiesenh€utter, “A qualitative studytowards using large vocabulary automatic speech recognition toindex recorded presentations for search and access over the web,”in Proc. IADIS Int. Conf. WWW/Internet, 2002, pp. 135–143.

[6] C. Munteanu, G. Penn, R. Baecker, and Y. C. Zhang, “Automaticspeech recognition for webcasts: How good is good enough andwhat to do when it isn’t,” in Proc. 8th Int. Conf. Multimodal Interfa-ces, 2006.

[7] G. Salton and C. Buckley, “Term-weighting approaches in auto-matic text retrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, 1988.

[8] G. Salton, A. Wong, and C. S. Yang. (Nov. 1975). A vectorspace model for automatic indexing, Commun. ACM, 18(11),pp. 613–620, [Online]. Available: http://doi.acm.org/10.1145/361219.361220

[9] T.-C. Pong, F. Wang, and C.-W. Ngo, “Structuring low-qualityvideotaped lectures for cross-reference browsing by video textanalysis,” J. Pattern Recog., vol. 41, no. 10, pp. 3257–3269, 2008.

[10] M. Grcar, D. Mladenic, and P. Kese, “Semi-automatic categoriza-tion of videos on videolectures.net,” in Proc. Eur. Conf. Mach.Learn. Knowl. Discovery Databases, 2009, pp. 730–733.

[11] T. Tuna, J. Subhlok, L. Barker, V. Varghese, O. Johnson, and S.Shah. (2012), “Development and evaluation of indexed captionedsearchable videos for stem coursework,” in Proc. 43rd ACM Tech.Symp. Comput. Sci. Educ., pp. 129–134. [Online]. Available: http://doi.acm.org/10.1145/2157136.2157177.

[12] H. J. Jeong, T.-E. Kim, and M. H. Kim.(2012), “An accurate lecturevideo segmentation method by using sift and adaptive threshold,”in Proc. 10th Int. Conf. Advances Mobile Comput., pp. 285–288.[Online].Available: http://doi.acm.org/10.1145/2428955.2429011.

[13] H. Sack and J. Waitelonis, “Integrating social tagging and docu-ment annotation for content-based search in multimedia data,” inProc. 1st Semantic Authoring Annotation Workshop, 2006.

[14] C. Meinel, F. Moritz, and M. Siebert, “Community tagging intele-teaching environments,” in Proc. 2nd Int. Conf. e-Educ., e-Bus.,e-Manage. and E-Learn., 2011.

[15] S. Repp, A. Gross, and C. Meinel, “Browsing within lecture videosbased on the chain index of speech transcription,” IEEE Trans.Learn. Technol., vol. 1, no. 3, pp. 145–156, Jul. 2008.

[16] J. Eisenstein, R. Barzilay, and R. Davis. (2007). “Turning lecturesinto comic books using linguistically salient gestures,” in Proc.22nd Nat. Conf. Artif. Intell., 1, pp. 877–882. [Online]. Available:http://dl.acm.org/citation.cfm?id=1619645.1619786.

[17] J. Adcock, M. Cooper, L. Denoue, and H. Pirsiavash, “Talkminer:A lecture webcast search engine,” in Proc. ACM Int. Conf. Multime-dia, 2010, pp. 241–250.

[18] J. Nandzik, B. Litz, N. Flores Herr, A. L€ohden, I. Konya, D.Baum, A. Bergholz, D. Sch€onfuß, C. Fey, J. Osterhoff, J. Waitelo-nis, H. Sack, R. K€ohler, and P. Ndjiki-Nya. (2012) “ Contentus—technologies for next generation multimedia libraries, Multime-dia Tools Appl., pp. 1–43, [Online]. Available:http://dx.doi.org/10.1007/s11042-011-0971-2.

[19] F. Chang, C.-J. Chen, and C.-J. Lu, “A linear-time component-labeling algorithm using contour tracing technique,” Comput. Vis.Image Understanding, vol. 93, no. 2, pp. 206–220, Jan. 2004.

[20] B. T. N. Dala, “Histograms of oriented gradients for humandetection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2005,vol. 1, pp. 886–893.

[21] Ground truth data. (2013). [Online]. Available: http://www.yanghaojin.com/research/videoOCR.html.

[22] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in naturalscenes with stroke width transform,” in Proc. Int. Conf. Comput.Vis. Pattern Recog., 2010, pp. 2963–2970.

[23] D. Karatzas, S. R. Mestre, J. Mas, F. Nourbakhsh, and P. P. Roy,“Icdar 2011 robust reading competition: Challenge 1: Reading textin born-digital images (web and email),” in Proc. Int. Conf. Doc.Anal. Recog., Sep. 2011, pp. 1485–1490.

[24] H. Yang, B. Quehl, and H. Sack. (2012), “A framework forimproved video text detection and recognition,” Multimedia ToolsAppl., pp. 1–29, [Online]. Available: http://dx.doi.org/10.1007/s11042-012-1250-6.

[25] K. Toutanova, D. Klein, C. Manning, and Y. Singer, “Feature-richpart-of-speech tagging with a cyclic dependency network,” inProc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: HumanLang. Technol., 2003, pp. 252–259.

[26] G. Salton and C. Buckley, “Term-weighting approaches in auto-matic text retrieval,” Inf. Process.Manage., vol. 24, pp. 513–523, 1988.

Haojin Yang received the diploma engineeringdegree from the Technical University Ilmenau,Germany, in 2008. In 2013, he received the doc-torate degree from the Hasso-Plattner-Institutefor IT-Systems Engineering (HPI), University ofPotsdam. His current research interests revolvearound multimedia analysis, information retrieval,semantic technology, content based video searchtechnology. He is involved in the development ofthe Web-University project space enhancementof the tele-TASK video recording system.

Christoph Meinel studied mathematics andcomputer science at Humboldt University, Berlin.He received the doctorate degree, in 1981, andwas habilitated in 1988. After visiting positions atthe University of Paderborn and the Max-Planck-Institute for computer science in Saarbr€ucken, hebecame a full professor of computer science atthe University of Trier. He is currently the presi-dent and CEO of the Hasso-Plattner-Institute forIT-Systems Engineering at the University of Pots-dam. He is a full professor of computer science

with a chair in Internet technology and systems. He is a member of aca-tech, the German “National Academy of Science and Engineering,” andnumerous scientific committees and supervisory boards. His researchfocuses on IT-security engineering, teleteaching, and telemedicine. Hehas published more than 400 papers in high-profile scientific journalsand at international conferences. He is a member of the IEEE.

154 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 7, NO. 2, APRIL-JUNE 2014

Page 14: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

1168 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 4, APRIL 2007

Semantic-Based Surveillance Video RetrievalWeiming Hu, Dan Xie, Zhouyu Fu, Wenrong Zeng, and Steve Maybank, Senior Member, IEEE

Abstract—Visual surveillance produces large amounts of videodata. Effective indexing and retrieval from surveillance videodatabases are very important. Although there are many waysto represent the content of video clips in current video retrievalalgorithms, there still exists a semantic gap between users andretrieval systems. Visual surveillance systems supply a platformfor investigating semantic-based video retrieval. In this paper, asemantic-based video retrieval framework for visual surveillanceis proposed. A cluster-based tracking algorithm is developed toacquire motion trajectories. The trajectories are then clusteredhierarchically using the spatial and temporal information, to learnactivity models. A hierarchical structure of semantic indexingand retrieval of object activities, where each individual activityautomatically inherits all the semantic descriptions of the ac-tivity model to which it belongs, is proposed for accessing videoclips and individual objects at the semantic level. The proposedretrieval framework supports various queries including queriesby keywords, multiple object queries, and queries by sketch. Formultiple object queries, succession and simultaneity restrictions,together with depth and breadth first orders, are considered. Forsketch-based queries, a method for matching trajectories drawnby users to spatial trajectories is proposed. The effectiveness andefficiency of our framework are tested in a crowded traffic scene.

Index Terms—Activity models, semantic-based, video retrieval,visual surveillance.

I. INTRODUCTION

CAMERA systems for surveillance are widely used and pro-duce large amounts of video data which are stored for fu-

ture or immediate use. In this context, effective indexing andretrieval from surveillance video databases are very important.Surveillance videos are rich in motion information which standsout as the most important cue to identify the dynamic contentof videos. Extraction, storage, and analysis of motion informa-tion in videos are critical in content-based surveillance videoretrieval.

Video retrieval using motion information has been inves-tigated in the recent years [1]–[6], [25], [26]. The existingapproaches emphasize the decomposition and approximation ofmotion trajectories, where semantic information about object

Manuscript received August 7, 2006; revised October 27, 2006. This workwas supported in part by the NSFC under Grant 60520120099. The associateeditor coordinating the review of this manuscript and approving it for publica-tion was Dr. Til Aach.

W. Hu, D. Xie, Z. Fu, and W. Zeng are with the National Laboratory ofPattern Recognition, Institute of Automation, Chinese Academy of Sciences,Beijing 100080, China (e-mail: [email protected]; [email protected];[email protected]; [email protected]).

S. Maybank is with the School of Computer Science and Information Sys-tems, Birkbeck College, London WC1E 7HX, U.K. (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2006.891352

motions is not reflected. Such approaches to video retrievalcan answer sketch-based or example-based queries, but cannotanswer semantic-based queries. Although manual annotationof video data makes it possible to implement semantic-basedvideo retrieval, the complete manual annotation of large videodatabases is too expensive.

In visual surveillance systems, the cameras are mainly fixed,and object motions in the monitored scenes tend to be regular.These facts make visual surveillance data very suitable for in-vestigating semantic-based video retrieval. Visual surveillancesystems need the support of techniques of video retrieval, andin the meanwhile supply a platform for investigating semantic-based video retrieval.

A. Related Work

There are two main aspects of semantic-based video retrievalfor visual surveillance: motion-based video retrieval and the se-mantic description of object motions.

a) Motion-based video retrieval: Some pioneering videoretrieval systems using motion information have been devel-oped. Chang et al. [1] propose an online video retrieval systemsupporting automatic object-based indexing and spatiotem-poral queries. The system includes algorithms for automatedvideo-object segmentation and tracking. Real-time videoediting techniques are used to respond to user queries. Bashir etal. [2] apply principal component analysis (PCA) to reduce thedimensionality of trajectory data. A two-level PCA operationwith coarse-to-fine retrieval is used to identify trajectoriesclose to a query trajectory. Sahouria and Zakhor [3] proposean object trajectory-based system for video indexing, wherethe indexing is based on Haar wavelet transform coefficients.Chen and Chang [4] use wavelet decomposition to segmenteach trajectory and produce an index based on velocity fea-tures. Jung et al. [5] base their motion model on polynomialcurve fitting. The motion model is used as an indexing keyfor accessing individual objects. They also propose an effi-cient way for indexing and retrieval based on object-specificfeatures. Dagtas et al. [6] present models that use the objectmotion information to characterize events to allow subsequentvideo retrieval. Scale-invariant algorithms for spatiotemporalsearch have been developed. Chikashi et al. [25] propose atrajectory extraction, encoding, and query method based onthe spatiotemporal relationships between objects. Dimitrovaand Golshani [26] propose a motion description method basedon the motion compensation component of the MPEG videoencoding scheme. Trajectories of blocks are acquired by com-puting their forward and backward motion vectors, and they areused as feature vectors for retrieval of videos.

The above approaches emphasize the reduction of the dimen-sion of the trajectory data in order to acquire a set of parametersto represent trajectory features. A semantic gap exists between

1057-7149/$25.00 © 2007 IEEE

Page 15: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

HU et al.: SEMANTIC-BASED SURVEILLANCE VIDEO RETRIEVAL 1169

Fig. 1. Our framework for surveillance video retrieval.

users and retrieval systems because there is no mapping betweenthe parameters of the trajectories and the semantic meanings ofobject motions.

b) Semantic Description of Motions: In recent years, thesemantic description of object motions has been investigated.Haag and Nagel [27] describe a system for incremental recog-nition of traffic situations. They use fuzzy metric temporal logic(FMTL) as an inference tool to handle uncertainty and temporalaspects of action recognition. Mohnhaupt and Neumann [28]establish a “3-D scene description sequence,” which includestraffic data such as directions, positions, and times of vehicles,etc. Bell and Pau [29] develop an object-oriented logic programsystem for image interpretation and apply it to vehicle recogni-tion in real scenes. Huang et al. [30] and Buxton et al. [31] usea Bayesian belief network and inference engine in sequences ofhighway traffic scenes to produce high-level concepts like “carchanging lane” and “car stalled.” Remagnino et al. [32] presentan event-based visual surveillance system for monitoring ve-hicles and pedestrians, which supplies verbal descriptions ofdynamic activities in 3-D scenes. Liu et al. [33] interpret dy-namic-object interactions in temporal image sequences usingfuzzy sets. The descriptions of objects are handled by fuzzylogic and fuzzy measures. Kojima et al. [34] propose a methodfor describing human activities in video images based on a hi-erarchy of concepts related to actions. Thonnat and Rota [35]propose a method for representing human activities using sce-narios, which are translated into text by filling entries in a tem-plate of natural language sentences. Ayers and Shah [36] pro-pose a method for generating textual descriptions of human ac-tivities in video images using state transition models.

Current approaches for the semantic description of objectmotions mainly depend on manual definition of object motionmodels. There is a strong demand for automatic methods forlearning object motion models.

B. Our Work

A novel framework for semantic-based surveillance video re-trieval is proposed in this paper. The aim is to bridge the se-mantic gap between users and video retrieval systems. Fig. 1gives an overview of our framework. Video sequences takenfrom digital cameras are input to the system. Objects in the sceneare tracked and object trajectories are extracted. The spectralalgorithm is used to cluster trajectories to learn object activitymodels. Semantic descriptions are added to the activity models

to form semantic activity models, and semantic indexes are con-structed for video databases. Keywords-based queries, multipleobject queries, and sketch-based queries are supported.

Our framework is original in the following ways.• Visual surveillance and video retrieval are combined to ex-

plore semantic-based video retrieval through learning se-mantic object activity models.

• We learn object activity models from motion trajectoriesautomatically by clustering trajectories. A method forlearning activity models is proposed. Trajectories arehierarchically clustered using spectral clustering. Spatialinformation is used to cluster all trajectories into somecategories, and trajectories in each of the categories arefurther clustered into subcategories using the temporalinformation. Each cluster of trajectories corresponds toan activity model. Each activity model represents the se-mantic content of the corresponding category of activities.

• We propose a hierarchical structure of semantic indexingand retrieval of object activities. The activity models areused as indexing keys for accessing individual objects.Each individual activity automatically inherits all the se-mantic descriptions of the activity model to which it be-longs. In our framework, only the semantic descriptionitem for an activity model is given manually, and otheritems for the activity model and all items for an activity areobtained automatically. So, compared with current videoretrieval approaches with complete manual annotation, ourapproach greatly reduces the workload of manual annota-tion.

• Based on our hierarchical structure of semantic indexingand retrieval, keywords-based queries, multiple objectqueries, and sketch-based queries are supported. Formultiple object queries, the temporal restrictions “succes-sion” and “simultaneity” are considered. For sketch-basedqueries, a robust algorithm for matching trajectories drawnby users to spatial trajectories is proposed.

• We test our framework experimentally in a crowded trafficscene. The experimental results show the robustness ofthe tracking algorithm, the efficiency of the algorithm forlearning activity models, and the encouraging performanceof the algorithms for video retrieval.

This paper is organized as follows. Section II briefly intro-duces the method for tracking multiple objects. Section III de-scribes the learning of activity models. Section IV presents the

Page 16: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

1170 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 4, APRIL 2007

mechanism of semantic indexing and retrieval. Section V il-lustrates experimental results. The last section summarizes thepaper.

II. TRACKING

Although the motivation of the paper is to investigate learningof semantic activity models and semantic-based video retrieval,these important tasks are based on tracking of moving objects. Inthis paper, we focus on real traffic scenes where there are manyvehicles and mutual occlusions between multiple vehicles. Al-though some algorithms [5], [7] have been proposed to trackmultiple objects, most of them fail due to the complexity of mo-tions. In this section, we only briefly introduce our cluster-basedalgorithm for tracking multiple objects. Readers may read ourprevious paper [16] for more details.

Our tracking algorithm is based on the principle that amoving object is associated with a cluster of pixels in thefeature space and the distributions of the clusters change littlebetween consecutive frames [8], [9]. Foreground pixels areidentified using background subtraction, and features of fore-ground pixels are extracted and clustered. Each cluster centroidcorresponds to a moving object or a part of a moving object.The clusters in the current frame contribute to the initializationof the cluster centroids of foreground pixels in the subsequentframe. A cluster centroid in the subsequent frame and thecluster centroid in the current frame, which participates in theinitialization of cluster centroid , both correspond to the sameobject or the same part of an object. Clusters can grow dynam-ically: be created or be erased. A predication algorithm is usedto ensure that each cluster centroid correctly corresponds to amoving object in spite of the occurrence of partial occlusions.

A. Acquisition of Pixel Features

Foreground pixels are acquired by a self-adaptable back-ground update model. Each foreground pixel is described witha feature vector containing its coordinates , velocity

, and color in the space

(1)

Compared with coordinate and color features, velocity featuresare difficult to acquire. In the algorithm, velocities are estimatedusing an optical flow algorithm, which is an improved one, in-spired by [10] and [11]. We only estimate the optical flow forforeground pixels. This greatly reduces the computational cost,compared with the methods which estimate the optical flow forall pixels in the current frame.

B. Clustering of Foreground Pixels

After the pixel features are acquired, we apply the fuzzy-means algorithm [12]–[14] to cluster foreground pixels.

In order to improve the speed of the clustering process, re-gions of foreground pixels are averaged. The image plane is par-titioned into square regions with equal size. The feature vector

of each region is the mean of the feature vectors of the fore-ground pixels in the region, and it is associated with a weight ,which is number of foreground pixels in this region.

The sample feature vectors are clustered using theweighting fuzzy -means algorithm which is modified tosupport weighted feature vectors. Let denote the numberof sample feature vectors; the dimension of the samplefeature vectors, and the number of cluster centroids.Cluster centroid is represented by thevector . Given all sample feature vectors

, , the Euclidean distancefrom sample feature vector to cluster centroid is

calculated by

(2)

The fuzzy membership of sample feature vector tocluster centroid is computed by

(3)

The value of each cluster centroid vector is updated iterativelyby

(4)

where is the number of iteration. In the frames rather than thefirst frame, the initial values ( , ) of the clustercentroid vectors are derived from the previous frames. In thefirst frame, the values of are randomly selected.

C. Prediction of Cluster Centroids

For a cluster centroid in the current frame, a prediction algo-rithm is used to produce an initial value for the correspondingcluster centroid in the next frame. This prediction ensures thateach centroid in the current frame corresponds to a correct fore-ground pixel cluster in the next frame. As a result, the linkagesbetween moving objects and corresponding cluster centroids aremaintained in spite of the occurrence of partial occlusions.

In our work, many objects need to be tracked simultaneously,so there is a high computational cost. For this reason, we keepthe computational cost of the tracking algorithm as small aspossible. A fast prediction algorithm, the double exponentialsmoothing-based prediction algorithm [15], is selected. Therunning speed of the double exponential smoothing-basedprediction algorithm is faster than that of the Kalman filter andthe extended Kalman filter [15].

D. Modeling of Trajectories

In order to describe the entire trajectories of moving objects,cluster centroids should be created or erased as objects enter or

Page 17: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

HU et al.: SEMANTIC-BASED SURVEILLANCE VIDEO RETRIEVAL 1171

leave the scene. This is called dynamic growing of cluster cen-troids. More details on the dynamic growing of cluster centroidsare given in [16].

Once a video clip is processed, centroid trajectories aregrouped into object trajectories. The centroid trajectories withthe similar motions are merged into one object trajectory: fortwo centroid trajectories, if they exist over the same sequenceof frames, we calculate the distance between the centroids ineach frame. If these distances are approximately constant andsmall, the two centroid trajectories are merged. The mergedtrajectory is the mean of these centroid trajectories.

III. SEMANTIC ACTIVITY MODELS

Activity models are learned from the trajectories obtainedby tracking. We define the activity of an object as the totalof all information on the motion of the object from the timethe object enters the scene until it either leaves the scene orit remains stationary for so long that it is treated as part ofthe background for motion detection, or from the time it be-gins to move from stop until it either leaves the scene or re-mains stationary for so long again that it is treated as part of thebackground. We only consider the activity of each individualobject; the interactions between multiple objects are not de-scribed. An activity is described by a spatiotemporal trajectory

, , whichcontains not only the spatial information represented by the spa-tial trajectory but also the temporal information represented bythe velocities in each sample point in the trajectory, where thespatial trajectory only contains subvectors in the vectors

.An activity model describes a category of activities with sim-

ilar semantic meanings. Activity models are also described byspatial-temporal trajectories. Given sufficient motion trajecto-ries (generally thousands of trajectories accumulated over hoursof tracking), activity models can be learned through trajectoryclustering. An activity model, represented by a trajectory clustercenter, is the representation of a category of activities that havecommon motion features and semantic meanings (how the ac-tivity models relate to the semantic meanings are introduced inSections III-B and IV-A). In this paper a spectral clustering-based approach is proposed to cluster trajectories to learn ac-tivity models.

A. Trajectory Clustering

Clustering of motion trajectories is a relatively new researchtopic. While early work models trajectories in indirect ways anddoes not cluster trajectories directly, more recent work handlestrajectories in a more direct way by using the whole trajectoryfor clustering [17]. Current approaches include vector quantiza-tion [18], self-organizing feature mapping [19], joint co-occur-rence statistics [20], fuzzy -means clustering [21], agglomer-ative clustering [22], agglomerative complete-link hierarchicalclustering [37], and deterministic annealing pairwise clustering[37], etc.

Spectral clustering is a novel class of clustering algorithms,which operate on the matrix of similarities between pairs ofinput feature vectors instead of on the input feature vectors

themselves, and seek an optimal partition of the graph whosevertexes represent feature vectors and whose edges representthe similarities. Different versions of spectral clustering exist.In this paper the one presented in [23] is used based on mul-tiway normalized graph cutting.

In order to handle a large number of object trajectories, wehierarchically cluster trajectories according to different features.We consider the following two points.

• Activities observed in different road lanes or routes are as-signed to different activity models, i.e., lane informationis included in the activity models. Therefore, we clustersample trajectories into different lanes or routes using spa-tial information of trajectories.

• Objects which pass along the same lane or route may havedifferent activities producing different activity models. Forexample, a vehicle may move rapidly or slowly, or on anirregular pathway through a lane or route. These activitiesare treated as different, although they occur in the samelane or route. Therefore, temporal information is used tofurther cluster the trajectories which have been assigned tothe same lane or route.a) Spatial-based clustering: In spatial-based clustering,

only the spatial parts of the trajectories are used. Each spatial tra-jectory is linearly interpolated to ensure that all trajectory havethe same number of points. Let the resulting set of trajectoriesbe , whereis the number of sample trajectories. It is assumed that each ofthese trajectories has points.

We construct the matrix of similarities between trajectories inbased on the pairwise distances between the trajectories. For

trajectories and , the average coordinate distance betweenthe corresponding points on them is calculated by

(5)

Then, the similarity between trajectories and is measured by

(6)

where the scaling parameter controls the decay of similarityas distance increases. We perform a simple correlation test toselect the appropriate value of by increasing in log scale andtracking the correlation of similarity matrix between adjacentscales. is chosen before correlation converges.

It should be mentioned that Hausdorff distance is not adoptedhere, as it does not preserve sequential information; for example,the Hausdorff distance does not distinguish between two vehi-cles following the same path but heading in opposite directions.

We cluster the data set into groups using the procedureas follows.Step 1: Compute the similarity matrix for the data set

by (6).Step 2: Construct matrix [23]

(7)

Page 18: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

1172 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 4, APRIL 2007

where is a diagonal matrix whose th diagonalelement is the sum of all elements in the th row ofmatrix .

Step 3: Apply the eigenvalue decomposition to matrix tofind out the eigenvectors , corre-sponding to the largest eigenvalues. The eigen-vectors are represented as column vectors.

Step 4: Form a new matrix bystacking the eigenvectors in columns, and nor-malize each row of to unit length.

Step 5: Cluster the row vectors of into clusters,using the fuzzy -means algorithm by treating eachrow as a new feature vector corresponding to thevector in the original data set .

The eigen-transformation is the key for spectral clustering.Data points which belong to different clusters but are close toeach other in the original data space become well separated inthe transformed eigenspace. We can find nearly orthogonalrow vectors from as initial centers for the fuzzy -means clus-tering in Step 5. To do this, we select the first row vector ran-domly, and then select the second row vector with the smallestcorrelation value with the first row vector. The next row vectorwith the smallest sum of correlation values with the selected rowvectors is selected until rows are selected from .

To more efficiently represent trajectory clusters, each clusterof trajectories is represented by a template trajectory . The

template trajectory is defined as the one with the minimalsum of square of distances to all trajectories in cluster . Wecompute the covariance of each point in each template trajec-tory using the corresponding points in all the sample trajectoriescorresponding to this template trajectory. Then the envelope ofeach trajectory cluster is obtained.

The validity of clustering results is evaluated with a tightnessand separation coefficient

(8)

where is the fuzzy membership of to . The coefficientis a ratio of the mean of the square of the distance

between each input sample and its corresponding template tra-jectory, to the minimum of the square of the distance betweenany two template trajectories. The clustering result should makethe distance between any two template trajectories as large aspossible, and the distance between an input sample and its cor-responding template trajectory as small as possible. We estimatethe number of clusters by checking if can be reducedas the number of clusters is varied.

It is hard to achieve accurate clustering results by a singleapplication of spatial-based clustering to cluster the set of allsample trajectories whose number is very large, as current clus-tering approaches cannot accurately cluster large and compli-cated data which contain many clusters. Hence, we adopt a “di-vide and conquer” strategy for hierarchically clustering spatialtrajectories. In the first layer of spatial-based clustering, the setof trajectories is divided into a small number of large clusters, in

Fig. 2. Hierarchical spatial-based clustering.

which dominant paths and routes may be distinguished. Then,a second layer of spatial-based clustering is applied to clusterthe trajectories in each cluster acquired by the first layer of spa-tial-based clustering. Thus, finer clustering is achieved as a re-sult of the second layer of spatial-based clustering, for example,lanes in a dominant path can be separated as illustrated in Fig. 2.After the above two layers of spatial-based clustering, differentclusters of trajectories representing different lanes or routes ofvehicle movements are extracted.

b) Temporal-based clustering: In temporal-based clus-tering, temporal information is used to further cluster tra-jectories which have been clustered into a lane or a route byspatial-based clustering. Spatiotemporal trajectories, rather thanspatial trajectories, are required in temporal-based clustering.

For any two spatiotemporal trajectories and , it is assumedthat trajectory contains sampling points, trajectory con-tains sampling points, and . The average distancebetween corresponding points on trajectories and is calcu-lated by

(9)

The first part of the right hand side of the equation measures thespatial-temporal similarity between trajectory and the subtra-jectory composed of the first points in trajectory . The secondpart of the right hand side of the equation approximately mea-sures the spatial-temporal similarity between trajectory andthe subtrajectory composed of the latter points in trajec-tory , using the distances from points in trajectory

to point in trajectory . In the equation, the greedy alignmentis used to take advantage of the temporal features of trajectoriesfor the temporal-based clustering. If , the second part ofthe equation is omitted.

The procedure of temporal-based clustering is similar to thatof spatial-based clustering except for the measure of the sim-ilarity between trajectories.

B. Semantic Activity Models

By clustering trajectories, a set of activity models is learned,where each activity model corresponds to a template trajectory.Our approach for learning activity models maps activities withsimilar semantic meanings to the same activity model. Thus,each constructed activity model represents the semantic content

Page 19: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

HU et al.: SEMANTIC-BASED SURVEILLANCE VIDEO RETRIEVAL 1173

TABLE IDATA STRUCTURE OF ACTIVITY DESCRIPTORS

TABLE IIDATA STRUCTURE OF ACTIVITY MODEL DESCRIPTORS

of the corresponding category of activities. Semantic meaningsof activity models are endowed manually with keywords (seeSection IV-A for more details). So, our activity models are se-mantic-based ones.

IV. SEMANTIC INDEXING AND RETRIEVAL

The learned activity models are used as indexing keys foraccessing the individual activities of objects in video databases.

A. Semantic Indexing

In our retrieval framework, the basic units for retrieval areindividual activities. In order to effectively describe object ac-tivities, an “activity descriptor” is constructed for each activity.Table I gives the data structure of activity descriptors. As shownin Table I, an object activity descriptor contains the identity ofthe activity, the identity of the video clip showing the activity,the frame numbers of the birth and death of the activity, the tra-jectory representing the activity, and the static information ofthe object such as color, size, etc. For each activity model, an“activity model descriptor” is also constructed. Table II showsthe data structure of activity model descriptors. As shown inTable II, an activity model descriptor contains the identity ofthe activity model, a list of activities which belong to the activitymodel, the corresponding template trajectory, and the semanticdescriptions. The activity descriptors and the activity model de-scriptors, together with the original video clips, are stored in

databases and are used for indexing and retrieval, as the resump-tive representation of the original video clips.

As aforementioned, an activity model represents the commonmotion features of a class of similar activities; in our approach,keywords that describe these common motion features are as-signed to the activity model manually. The keywords are addedto the semantic description item in the activity model descriptor,in order to achieve an indexing at the semantic level. Fig. 3 il-lustrates our hierarchical structure of semantic indexing of ob-ject activities. In the structure, each individual activity automat-ically inherits all the concepts of the activity model to whichit belongs. By the structure of indexing, only the semantic de-scription item of an activity model, which is a compressed formof a class of activities, is given manually. Other items in an ac-tivity model descriptor and all items in an activity descriptorare obtained automatically. So, compared with current videoretrieval approaches with complete manual annotation, our ap-proach greatly reduces the workload of manual annotation.

It is noted that semantic meanings of speed (i.e., “normalspeed,” “high speed,” and “low speed”) can be acquired statis-tically. The mean and covariance of values of speed in thesample trajectories are calculated. “Normal speed” lies between

and . “Low speed” is lower than . “High speed”is higher than . So, the keywords of the speed informationcan be automatically added to the semantic description item ofeach activity mode.

In order to adapt to the changes of object activities in a com-plex scene, the activity models are adjusted dynamically overtime. When a new video clip is input to a database, the activi-ties of moving objects are extracted. For each of these activities,the best matching activity model is found. Then it is checked tosee if this activity can be classified into the best matching ac-tivity model. If the distance from the spatiotemporal trajectoryof this activity to the template trajectory of the activity model issmall enough, the activity is classified to the activity model andadded to the activity list of this activity model. Otherwise, thisactivity is not classified to an existing activity model, and in-stead it is treated as a temporary abnormal activity and added tothe abnormal activity model. As shown in Fig. 3, the abnormalactivity model is a predefined model in the framework. It is usedas a temporary store for all temporary abnormal activities. Thespectral algorithm is used periodically to cluster the activities inthe abnormal activity model. If there is a cluster which containsenough activities, the activities in this cluster are considered tobe normal. A new activity model is then established for thiscluster, and added to the collection of existing activity models.With the above strategy, our system has the ability to capturethe slow and long-term changes of object activities in a com-plex scene. Thus, the adaptability and robustness of the videoretrieval framework are increased.

B. Applicable Query Types

In our framework, the following responses are made toqueries from users: The object activities best matching thequeries and their associated video clips are found. The birthand death frames of each of the activities are located, and thesubvideo between these two frames is supplied to users for

Page 20: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

1174 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 4, APRIL 2007

Fig. 3. Hierarchical structure of semantic indexing.

browsing. Queries by keywords, multiple object queries, andqueries by sketch are supported.

a) Query by keywords: The responses to a keyword-basedquery such as “a blue car ran from south to north at a high speed”has two stages. First, a keyword matching process is used to findout the activity models whose semantic descriptions well matchthe concepts of “ran from south to north” and “high speed.”Second, the activity lists of these activity models are searchedfor the activities whose color feature matches “blue.” Activitiesfound in this way form the responses to the query.

In the process of keyword matching, a similarity measure-ment in vector spaces [24] is used to determine the degree ofmatching between a query and an activity model. It is assumedthat an activity model contains a set of keywords , whereeach keyword in is endowed with a weight of value ,and there is a set of keywords in the query sentence(s), whereeach keyword is endowed with a weight of value . Then,the degree of matching between the activity model and thequery sentence(s) is

(10)

It should be mentioned that users can input their queries usingnatural language, but the retrieval process is implemented bykeyword matching. Of course, this way may cause errors in theretrieval results.

The quality of the retrieval results is measured by precision-recall curves which are generated by increasing the size of thereturn list and computing the corresponding precision values.

b) Multiple object queries: In our video retrieval frame-work, multiple object queries are also supported. Given thedescriptions of the motions of two or more objects and atemporal restriction between occurrences of these objects, thebest matching activities of multiple objects are searched forand output under the temporal restriction.

In this paper, the following two temporal restrictions are con-sidered.

• “Succession”: There is succession between occurrences ofactivities (especially succession of beginning time).

• “Simultaneity”: There is time overlap of occurrences ofactivities.

For instance, we input a query with two objects.• Object 1: “A white car turned left.”• Object 2: “A red car turned left.”

If the restriction of succession of beginning time is used, thisquery means to find objects 1 and 2, where the start of the ac-tivity of object 1 is earlier than that of object 2. If the “simul-taneity” restriction is used, this query means that there is timeoverlap of occurrences of the activities of objects 1 and 2.

For multiple object queries, the precision-recall cures are af-fected by the permutation order of retrieval results, caused byerrors in trajectory clustering. Fig. 4 shows an example of per-mutation order of two object retrieval results. In Fig. 4, eachletter in a rectangle represents a cluster of trajectories, and eachnumber in a circle represents a trajectory in the cluster repre-sented by the rectangle linked to it by a solid line. It is assumedthat trajectory 1 in cluster and either of the trajectories 1 and2 in cluster satisfy the query; trajectory 2 in cluster and ei-ther of the trajectories 1 and 2 in cluster also satisfy the query.There are two ways to order the retrieval results.

• Depth-first: The output order of the retrieval result is. If it

is a clustering mistake that trajectory 1 or 2 is clusteredto cluster A, there may be a big drop in precision in theprecision-recall curve.

• Breadth-first: The output order of the retrieval result is. If it is

an error that trajectory 1 or 2 is in cluster , there may betwo drops of precision in the precision-recall curve, but thedeepness of the two drops is smaller than that produced bydepth-first ordering. The precision-recall curve is overallimproved, compared with that produced by depth-first or-dering.c) Query by sketch: There are queries that cannot be ex-

pressed by keywords, such as “a vehicle moved in this way.”A sketch-based query is designed to solve such query require-ments. A drawing interface allows users to draw motion tra-jectories which approximately represent the activities that theusers expect to query. Sketches drawn by users are matched to

Page 21: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

HU et al.: SEMANTIC-BASED SURVEILLANCE VIDEO RETRIEVAL 1175

Fig. 4. Permutation order of retrieval results.

the spatial trajectories associated with activity models. The bestmatching activity models are found. Then, the activity lists ofthese activity models are searched for the best matching activi-ties.

In this paper, a method for matching trajectoriesdrawn by users to spatial template trajectories of activitymodels (or spatial trajectories of activities) is proposed.Let trajectory drawn by a user be represented by

, and letbe the spatial template trajectory in an activity model:

. In order tomeasure the spatial similarity between trajectories and

, we re-sample trajectory to obtain a trajectory with thesame number of points as trajectory , scale the re-sampledtrajectory to be the same length as trajectory , and thentranslate the scaled trajectory to fit trajectory as closely aspossible.

• Re-sampling: On trajectory , we samplepoints which are linked to form trajectoryto represent trajectory , where point in tra-jectory is prorated in the line segment

in trajectory .• Scaling: Trajectory is scaled, by , to form tra-

jectory , where and are respectively the lengthsof trajectories and .

• Translation: Trajectory is translated by toform trajectory . The sum of squares of the distancesof corresponding points in trajectories and is

(11)

The minimum of is reached when its partial deriva-tives are zero

(12)

The solution to (12) is

(13)

Then, the spatial similarity between trajectories and ismeasured by with and given by (13).This spatial similarity is used to match spatial trajectories forsketch-based queries.

V. EXPERIMENTAL RESULTS

In order to verify the efficiency and accuracy of our videoretrieval framework, different video sequences from a crowdedtraffic scene were tested. First, the performance of the trackingalgorithm is introduced briefly. The results of learning activitymodels are then illustrated. Finally, the results of video retrievalare demonstrated.

A. Tracking

The experimental results of our clustering-based tracking areshown in Fig. 5. In Fig. 5(a), an example of pixel segmentationand clustering in one frame is shown. The image size isfixed to 320 240 pixels. The extracted trajectories are illus-trated in Fig. 5(b), represented by black lines. In the tested videosequences, occlusions between moving vehicles, and those be-tween moving vehicles and static street attachments such asstreet lamps and trees occur frequently. The correctness of thetracking results is checked by our own visual judgment. For thetested video sequences, 1216 motion trajectories were producedwith our tracking algorithm. Of these, 1184 motion trajectorieswere seen to be correct. So, the correct rate of tracking for thetested video sequences is 97.4%. This indicates that the pro-posed tracking algorithm is suitable for traffic surveillance. Themajor reasons why 2.6% trajectories are unacceptable are localdisturbances from groups of pedestrians and long lasting occlu-sions between moving vehicles. Our tracker runs at the speedof 5–10 frames in a second on a P4-1.8-GHz computer with amoderate number of vehicles presenting in the scene. More de-tails on the performance of the tracking algorithm are given in[16].

B. Learning of Activity Models

The results of learning activity models are displayed belowin both quantitative form and visual form. First, we statisti-cally compare the spectral approach with the fuzzy -means ap-proach. Then, some activity learning results are visually shown.

In the experiments, we use the tightness and separation co-efficient in (8) to evaluate the performance of clustering.Smaller values of indicate superior performance. For bothfuzzy means clustering and spectral clustering, we examine thevalues of in the first layer of spatial-based clustering, as thenumber of clusters is changed. The test is repeated for ten timesfor a given number of clusters. The results are shown in Fig. 6.In Fig. 6, only a single line is used to show the results of spectralclustering, because the results are the same for all times we re-peat the test. However, in the fuzzy -means clustering, the valueof always varies when the test is repeated. The three linesshow respectively the minimal, average and maximal values of

. Large variances are observed between the results acquiredin different testing times. From Fig. 6, we can see that thevalues of spectral clustering are much smaller than those of thefuzzy -means clustering. So, spectral clustering not only out-performs the fuzzy -means clustering by one order of magni-

Page 22: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

1176 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 4, APRIL 2007

Fig. 5. Results of tracking: (a) Pixel segmentation and clustering in one frame; (b) trajectories.

Fig. 6. Comparison between fuzzy c-means clustering and spectral clustering.

tude, but is also much more stable. The values of also tellus how many clusters we should select for spectral clustering.From Fig. 6, we find a leap in as the number of clusters in-creases from 17 to 18. This suggests that the proper number ofclusters for the first layer of spatial-based spectral clustering is17.

Fig. 7 shows the results of the second layer of spatial-basedclustering. Fig. 7(a) illustrates the best result of the fuzzy-means clustering in ten runs, and Fig. 7(b) the result of

spectral clustering. In Fig. 7(a) and (b), each cluster is repre-sented by the template trajectory overlapped on the originaltrajectories. As shown in Fig. 7(a), some lanes or routes arenot separated using the fuzzy -means clustering. However, asshown in Fig. 7(b), all lanes and routes are correctly separatedusing the spectral clustering.

Fig. 8 shows some contrasts between the results of the firstand second layers of spatial-based spectral clustering. Somedominant paths of vehicle motion extracted using the first layerof spatial-based spectral clustering are shown in Fig. 8(a), rep-resented in the envelope form. There are more extracted dom-inant paths than displayed here. We just choose the six mostfrequent paths for better visualization. Moreover, to avoid over-lapping, the envelopes shown here are narrower than the actualones. Fig. 8(b) shows some results of the second layer of spa-tial-based spectral clustering with the envelope representation,where each of the two straight dominant paths shown in Fig. 8(a)is further divided into three lanes.

Fig. 9 shows the 76 activity models finally learned. The spa-tial-temporal template trajectories corresponding to these ac-tivity models are represented by white lines. As shown in Fig. 9,the learned activity models are consistent with the sample tra-jectories, so the results can be treated as satisfactory.

In our experiment, 156 keywords are assigned manually tothese activity models. We only need to manually organize thepredicates (containing appropriate verbs) to describe the 76 ac-tivity models. It is not necessary to annotate the 1216 activities.

C. Retrieval

In the following, we illustrate visually and statistically theretrieval results for keywords-based queries, multiple objectqueries, and sketch-based queries.

1) Keywords-Based Retrieval: With the learned activitymodels, we can retrieve videos at the semantic level. Fig. 10shows an example of keyword-based video retrieval. If thequery of an user only contains the keyword of “turned left,”the vehicle activities shown in Fig. 10(a)–(e) are all matched.The video clips which show these activities are retrieved, andthe associated frames are located and supplied to the user forbrowsing. When a query from a user contains more descriptivekeywords such as “turned left and then ran toward west,” theactivities shown in Fig. 10(a)–(d) are eliminated due to thenonmatching of the semantic descriptions, and the activitiesshown in Fig. 10(e) are the retrieval results. If static informationabout objects is further added into the query, for example “awhite car” is specified, by matching to the item of Obj_Color inactivity descriptors, the first, second, and third vehicle activitiesshown in Fig. 10(e) are eliminated. Only the fourth activityshown in Fig. 10(e) and the video clip which shows this activityare retrieved. From this hierarchical query example, we seethat our video retrieval can retrieve activities with differentsemantic levels.

For the query “a red car turned left,” the keyword “turnedleft” is matched to the semantic description item of the corre-sponding activity models, and the keyword “red” is matched tothe Obj_Color item in the corresponding activities. There are40 activities found, of which 35 ones satisfy the query. Fig. 11shows the precision-recall curve for this retrieval, where the sizeof the return list is increased in steps of length 5. From the pre-cision-recall curve, we can see that the precision is maintainedabove 80%. The retrieval result overall satisfies the requirementof the query.

Page 23: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

HU et al.: SEMANTIC-BASED SURVEILLANCE VIDEO RETRIEVAL 1177

Fig. 7. Results of second layer of spatial-based clustering: (a) fuzzy c-means; (b) spectral clustering.

Fig. 8. Contrast between first and second layers of spatial-based spectral clustering: (a) first layer; (b) second layer.

Fig. 9. Results of learning activity models.

For the query “a white car ran from south to north by theright lane,” the keywords “ran from south to north by the rightlane” are matched to the semantic description item of the corre-sponding activity models, and the keyword “white” is matchedto the Obj_Color item in the corresponding activities. There are284 activities found, of which 280 ones satisfy the query. Fig. 12shows the precision-recall curve for this retrieval, where the sizeof the return list is increased in steps of length 10. The preci-sion-recall curve shows to be windless. The retrieval result isgood.

2) Multiple Object Query: For multiple object queries, the“succession” and “simultaneity” restrictions are considered inour experiments. For each restriction, retrieval results are or-dered in depth-first and breadth-first ways.

a) “Succession” restriction: We consider the followingtwo-object query: “a car ran from south to north” and “a carturned right” where the restriction of succession of beginning

time is used. There are 29 pairs of activities found, of which22 ones satisfy the query. Fig. 13 shows the precision-recallcurve for this retrieval with the depth-first permutation order(introduced in Section IV-B1b), where the size of the returnlist is increased in steps of length 3. The precision valuesare all above 50%. The precision-recall curve illustrates thatthere is a big drop at the second point. The reason for thebig drop is that a trajectory satisfying the first object query isassigned to an incorrect cluster by the algorithm for clusteringtrajectories. Fig. 14 shows the precision-recall curve with thebreadth-first permutation order. From Fig. 14, we can see thatthe curve becomes smoother; and the precision values are allabove 60%. The precision-recall curve is overall improved.

For the query with three objects: “a car ran from east tosouth,” “car turned right,” and “a car ran from south to north,”with the succession restriction, 99 triples of activities are found,of which 89 ones satisfy the query. Figs. 15 and 16 show theprecision-recall curves for the retrievals with the depth-firstand breadth-first permutation orders respectively, where thesize of the return list is increased in steps of length 9. Forthe curve with the depth-first order, the precision values areoverall above 80%, except the first value lying between 70%and 80%. For the curve with the breadth-first order, all theprecision values are above 80%. So, the retrieval result withthe breadth-first order is better than that with the depth-firstorder.

b) “Simultaneity” restriction: We replace the “succes-sion” restriction in the above two object queries with the“simultaneity” restriction. For the two-object query, there are14 matched pairs of activities, of which 12 ones satisfy thequery. Figs. 17 and 18 show the precision-recall curves with

Page 24: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

1178 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 4, APRIL 2007

Fig. 10. Results of keyword-based queries: (a) “Turned left and then ran to-ward north.” (b) “Turned left and then turned round toward south.” (c) “Turnedleft and then turned round toward north.” (d) “Turned left; turned round towardnorth; and then parked.” (e) “Turned left and then ran toward west.”

the depth-first and breadth-first orders, respectively, where thesize of the return list is increased in unit steps. It is shown thatthe lowest precision in both the curves is 50%. However, theprecision values in the curve with the breadth-first order areoverall higher than those with the depth-first order.

Fig. 11. Precision-recall curve for the keywords-based query 1.

Fig. 12. Precision-recall curve for the keywords-based query 2.

Fig. 13. Precision-recall curve for two-object query with succession restrictionand depth-first order.

For the three-object query with the simultaneity restriction,there are only two matched triples of activities, both of whichsatisfy the query. Fig. 19 shows the precision-recall curve. Theprecision of the retrieval is 100%.

3) Sketch-Based Query: Fig. 20 shows a trajectory drawn bya user. For this query, 153 activities are found, of which 152ones satisfy the query. Fig. 21 shows the precision-recall curvefor this retrieval, where the size of the return list is increased insteps of length 10. Fig. 22 shows another trajectory drawn by auser. For this query, five activities are found, all of which satisfythe query. Fig. 23 shows the precision-recall curve for this re-trieval, where the size of the return list is increased in unit steps.From these two examples, we can see that the retrieval results

Page 25: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

HU et al.: SEMANTIC-BASED SURVEILLANCE VIDEO RETRIEVAL 1179

Fig. 14. Precision-recall curve for two-object query with succession restrictionand breadth-first order.

Fig. 15. Precision-recall curve for three-object query with succession restric-tion and depth-first order.

Fig. 16. Precision-recall curve for three-object query with succession restric-tion and breadth-first order.

are satisfying for the sketch-based queries, and the algorithm formatching spatial trajectories is robust.

VI. CONCLUSION

In this paper, a semantic-based retrieval framework for trafficvideo sequences has been proposed. A clustering-based trackingalgorithm is used to obtain trajectories. Based on hierarchicalspectral clustering of the trajectories, we obtain a set of activitymodels to which semantic meanings are given. With semanticindexing, our retrieval framework provides a query interface atthe semantic level. Keywords-based queries, multiple object

Fig. 17. Precision-recall curve for two-object query with simultaneity restric-tion and depth-first order.

Fig. 18. Precision-recall curve for two-object query with simultaneity restric-tion and breadth-first order.

Fig. 19. Precision-r

Fig. 20. Trajectory 1 drawn by a user.

queries, and sketch-based queries are supported. In our frame-work, trajectory analysis and video retrieval are combined

Page 26: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

1180 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 4, APRIL 2007

Fig. 21. Precision-recall curve for Fig. 20.

Fig. 22. Trajectory 2 drawn by a user.

Fig. 23. Precision-recall curve for Fig. 22.

effectively, and the workload of manual annotation is greatlyreduced. The framework has been experimentally tested in acrowded traffic scene, with good results.

REFERENCES

[1] S. F. Chang, W. Chen, H. J. Meng, H. Sundaram, and D. Zhong,“A fully automated content-based video search engine supportingspatiotemporal queries,” IEEE Trans. Circuits Syst. Video Technol.,vol. 8, no. 5, pp. 602–615, Sep. 1998.

[2] F. I. Bashir, A. A. Khokhar, and D. Schonfeld, “Segmented trajectorybased indexing and retrieval of video data,” in Proc. Int. Conf. ImageProcessing, Sep. 2003, vol. 3, pp. 623–626.

[3] E. Sahouria and A. Zakhor, “Motion indexing of video,” in Proc. Int.Conf. Image Processing, Santa Barbara, CA, Oct. 1997, vol. 2, pp.526–529.

[4] W. Chen and S. F. Chang, “Motion trajectory matching of video ob-jects,” Proc. SPIE, vol. 3972, pp. 544–553, Jan. 2000.

[5] Y. K. Jung, K. W. Lee, and Y. S. Ho, “Content-based event retrievalusing semantic scene interpretation for automated traffic surveillance,”IEEE Trans. Intell. Transport. Syst., vol. 2, no. 3, pp. 151–163, Sep.2001.

[6] S. Dagtas, W. A. Khatib, A. Ghafoor, and R. L. Kashyap, “Modelsfor motion-based video indexing and retrieval,” IEEE Trans. ImageProcess., vol. 9, no. 1, pp. 88–101, Jan. 2000.

[7] S. Kamijo, Y. Matsushita, K. Ikeuchi, and M. Sakauchi, “Trafficmonitoring and accident detection at intersections,” IEEE Trans. Intell.Transport. Syst., vol. 1, no. 2, pp. 108–118, Jun. 2000.

[8] B. Heisele, U. Kressel, and W. Ritter, “Tracking non-rigid, movingobjects based on color cluster flow,” in Proc. IEEE Conf. ComputerVision and Pattern Recognition, Jun. 1997, pp. 257–260.

[9] A. E. C. Pece, “From cluster tracking to people counting,” in Proc.IEEE Workshop Performance Evaluation of Tracking and Surveillance,Copenhagen, Denmark, Jun. 2002, pp. 9–17.

[10] R. Cutler and M. Turk, “View-based interpretation of real-time opticalflow for gesture recognition,” in Proc. IEEE International Conf. Auto-matic Face and Gesture Recognition, 1998, pp. 416–421.

[11] B. Maurin, O. Masoud, and N. Papanikolopoulos, “Monitoringcrowded traffic scenes,” in Proc. IEEE Int. Conf. Intelligent Trans-portation Systems, Singapore, 2002, pp. 19–24.

[12] J. F. Colen and T. Hutcheson, “Reducing the time complexity of thefuzzy c-means algorithm,” IEEE Trans. Fuzzy Syst., vol. 2, no. 2, pp.263–267, Apr. 2002.

[13] N. R. Pal and J. C. Bezdek, “Complexity reduction for large imageprocessing,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 32, no.5, pp. 598–611, Oct. 2002.

[14] S. Eschrich, J. Ke, L. O. Hall, and D. B. Goldgof, “Fast accurate fuzzyclustering through data reduction,” IEEE Trans. Fuzzy Syst., vol. 11,no. 2, pp. 262–270, Apr. 2003.

[15] J. Joseph and J. LaViola, “Double exponential smoothing: An alter-native to Kalman filter-based predictive tracking,” in Proc. ImmersiveProjection Technology and Virtual Environments, 2003, pp. 199–206.

[16] W. Hu, X. Xiao, Z. Fu, D. Xie, T. Tan, and S. Maybank, “A system forlearning statistical motion patterns,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 28, no. 9, pp. 1450–1464, Sep. 2006.

[17] I. N. Junejo, O. Javed, and M. Shah, “Multi feature path modelingfor video surveillance,” in Proc. Int. Conf. Pattern Recognition, Cam-bridge, U.K., 2004, vol. II, pp. 716–719.

[18] N. Johnson and D. Hogg, “Learning the distribution of object trajec-tories for event recognition,” Image Vis. Comput., vol. 14, no. 8, pp.609–615, 1996.

[19] J. Owens and A. Hunter, “Application of the self-organizing map totrajectory classification,” in Proc. IEEE Int. Workshop Visual Surveil-lance, 2000, pp. 77–83.

[20] C. Stauffer and W. E. L. Grimson, “Learning patterns of activity usingreal-time tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22,no. 8, pp. 747–757, Aug. 2000.

[21] W. M. Hu, D. Xie, T. N. Tan, and S. Maybank, “Learning activitypatterns using fuzzy self-organizing neural network,” IEEE Trans.Syst., Man, Cybern. B, Cybern., vol. 34, no. 3, pp. 1618–1626,Jun. 2004.

[22] D. Makris and T. Ellis, “Path detection in video surveillance,” ImageVis. Comput., vol. 20, no. 12, pp. 895–903, 2002.

[23] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysisand an algorithm,” in Advances in Neural Information Processing Sys-tems, T. Dietterich, S. Becker, and Z. Ghahramani, Eds. Vancouver,BC, Canada: MIT Press, 2001, vol. 14, pp. 849–856.

[24] S. Simone and J. Ramesh, “Integrated browsing and querying for imagedatabases,” IEEE Multimedia, vol. 7, no. 3, pp. 26–39, Jul. 2000.

[25] Y. Chikashi, N. Yoshihiro, and T. Katsumi, “Querying video data byspatio-temporal relationships of moving object traces,” in Proc. 6thIFIP 2.6 Working Conf. Visual Database Systems, Brisbane, Australia,May 2002, pp. 357–371.

[26] N. Dimitrova and F. Golshani, “Rx for semantic video database re-trieval,” in Proc. 2nd ACM Int. Conf. Multimedia, San Francisco, CA,1994, pp. 219–226.

[27] M. Haag and H.-H. Nagel, “Incremental recognition of traffic situationsfrom video image sequences,” Image Vis. Comput., vol. 18, no. 2, pp.137–153, Jan. 2000.

[28] M. Mohnhaupt and B. Neumann, “On the use of motion concepts fortop-down control in traffic scenes,” in Proc. Eur. Conf. Computer Vi-sion, Antibes, France, 1990, pp. 598–600.

[29] B. Bell and L. F. Pau, “Context knowledge and search control issues inobject-oriented prolog-based image understanding,” Pattern Recognit.Lett., vol. 13, no. 4, pp. 279–290, 1992.

[30] T. Huang, D. Koller, J. Malik, G. Ogasawara, B. Rao, S. Russell, andJ. Weber, “Automatic symbolic traffic scene analysis using belief net-works,” in Proc. 12th Nat. Conf. Artificial Intelligence, Seattle, WA,1994, pp. 966–972.

Page 27: 142 IEEE TRANSACTIONS ON LEARNING ... - baskent.edu.tr

HU et al.: SEMANTIC-BASED SURVEILLANCE VIDEO RETRIEVAL 1181

[31] H. Buxton and S. G. Gong, “Visual surveillance in a dynamic and un-certain world,” Artif. Intell., vol. 78, no. 1–2, pp. 431–459, Oct. 1995.

[32] P. Remagnino, T. N. Tan, and K. D. Baker, “Multi-agent visualsurveillance of dynamic scenes,” Image Vis. Comput., vol. 16, no. 8,pp. 529–532, 1998.

[33] Z. Q. Liu, L. T. Bruton, J. C. Bezdek, J. M. Keller, S. Dance, N. R.Bartley, and C. Zhang, “Dynamic image sequence analysis using fuzzymeasures,” IEEE Trans. Syst., Man, Cybern., B: Cybern., vol. 31, no.4, pp. 557–571, Aug. 2001.

[34] A. Kojima, T. Tamura, and K. Fukunaga, “Natural language descriptionof human activities from video images based on concept hierarchy ofactions,” Int. J. Comput. Vis., vol. 50, no. 2, pp. 171–184, 2002.

[35] M. Thonnat and N. Rota, “Image understanding for visual surveillanceapplications,” in Proc. Int. Workshop Cooperative Distributed Vision,1999, pp. 51–82.

[36] D. Ayer and M. Shah, “Monitoring human behavior from video takenin an office environment,” Image Vis. Comput., vol. 19, no. 12, pp.833–846, 2001.

[37] D. Vasquez and T. Fraichard, “Motion prediction for moving objects:A statistical approach,” in Proc. IEEE Int. Conf. Robotics Automation,Apr. 26–May 1, 2004, vol. 4, pp. 3931–3936.

Weiming Hu received the Ph.D. degree from the De-partment of Computer Science and Engineering, Zhe-jiang University, China.

From April 1998 to March 2000, he was a Postdoc-toral Research Fellow with the Institute of ComputerScience and Technology and Founder of the Researchand Design Center, Peking University, China. SinceApril 2000, he has been with the National Labora-tory of Pattern Recognition, Institute of Automation,Chinese Academy of Sciences, Beijing. Currently, heis a Professor and a Ph.D. Student Supervisor in the

laboratory. His research interests are in visual surveillance, neural networks, fil-tering of Internet objectionable information, retrieval of multimedia, and under-standing of Internet behaviors. He has published more than 80 papers in nationaland international journals and international conferences.

Dan Xie received the B.S. degree in automatic con-trol and the M.S. degree from the Beijing Universityof Aeronautics and Astronautics (BUAA), Beijing,China, in 2001 and 2004, respectively. He is currentlypursuing the Ph.D. degree in the Computer ScienceDepartment, University of Massachusetts, Amherst.

His current research interests include patternrecognition, machine learning, and neural networks.

Zhouyu Fu received the B.Sc. degree in informationengineering from Zhejiang University, China, in2001, and the M.Sc. degree in pattern recognitionfrom the National Laboratory of Pattern Recognition,Institute of Automation, Chinese Academy of Sci-ences, Beijing, in 2004. He is currently pursuing thePh.D. degree at the Department of Information En-gineering, Research School of Information Sciencesand Engineering, Australian National University.

His research interests include pattern recognitionand computer vision.

Wenrong Zeng received the B.S. degree fromthe Department of Electronics, Peking University,China, in July 2006. She is currently pursuing theM.S. degree at the Institute of Semiconductors,Chinese Academy of Sciences, Beijing.

Her research interests include computer vision,pattern recognition, and artificial neural network.

Steve Maybank (SM’05) received the B.A. degreein mathematics from King’s College, Cambridge,U.K., in 1976, and the Ph.D. degree in computer sci-ence from Birkbeck College, University of London,London, U.K., in 1988.

He joined the Pattern Recognition Group at Mar-coni Command and Control Systems, Frimley, U.K.,in 1980, and then joined the GEC Hirst ResearchCentre, Wembley, U.K., in 1989. From 1993 to 1995,he was a Royal Society/EPSRC Industrial Fellowin the Department of Engineering Science, Univer-

sity of Oxford, Oxford, U.K. In 1995, he joined the University of Reading,Reading, U.K., as a Lecturer in the Department of Computer Science. In 2004,he became a Professor at the School of Computer Science and InformationSystems, Birkbeck College. His research interests include the geometry ofmultiple images, camera calibration, visual surveillance, information geometry,and the applications of statistics to computer vision.