ibm research reportafter outside publication, requests should be filled only by reprints or legally...

RC21444 (96156) 18 November 1998 Computer Science

IBM Research Report

Multi-Search of Video Segments Indexed byTime-Aligned Annotations of Video Content

Anni Coden, Norman Haas, Robert Mack

IBM Research DivisionThomas J. Watson Research Center

P.O. Box 704Yorktown Heights, NY 10598

Research DivisionAlmaden - Austin - Beijing - Haifa - T. J. Watson - Tokyo - Zurich

LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Reportfor early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests.After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218,Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home .

Abstract:

This paper describes an approach to the design andimplementation of a video content managementsystem that enables the efficient indexing andstoring, and searching and browsing of videosegments defined by annotation of video contentattributes. The annotation values represent thedurations of independently indexed and potentiallyoverlapping attributes of the video content, such ascamera motion, shot distance and face count, andassociated information, such as spoken words in the soundtrack. These values define segments ofvideo of any granularity from entire programs downto arbitrarily short segments. Boolean searches onthese multiple indexed criteria can be performed,and temporal intervals corresponding to videosegments can be computed that reflect the unionand intersection of search results meeting theseBoolean conditions. We propose a relationaldatabase-based model for storing indexed attributeswhich is flexible and because it does not index orstore any fixed temporal segmentation of thevideo. Rather annotation values can be stored inany way appropriate for an open-ended set ofannotation methods representing attributes of videocontent. The system is part of a project to exploit aset of automatic annotation methods for the visualattributes of video content, and to develop acomponent of a complete studio for broadcastingdigital HDTV (High Definition Television).

Keywords : Media Content Management, VideoDatabase, Query System

Introduction

Digital video asset management systems arebecoming more pervasive as analog video tapearchives are being converted to digital format, andnew videos are produced in digital format. Keyvideo management tasks are ingest, annotation orindexing, database modeling, search and browsingof stored content. Commercial systems arebeginning to develop technological solutions forthese video management tasks. A key challenge isannotation: developing search indices of descriptorsfor video content.. Aguierre-Smith and Davenportput the challenge this way: “[The organization of a]video database system is a challenging informationmanagement problem because the way that aparticular sequence of images is described effects[how] a maker will retrieve it and incorporate it intoa movie” [1].

However, the nature/type of metadata being storedis changing rapidly. Historically, “card catalog”metadata were associated with a video, such astitle and production date. Such metadata attributesare descriptions of the video as a whole. Videoasset management systems being developed inresearch projects and in some commercial systems(examples include ISLIP [2], based on InforMediaresearch prototype [3] AVID [4], VIRAGE [5],EXCALIBUR [6]) are beginning to go beyondlimited sets of metadata attributes and index videocontent on a video segment granularity (time unit)considerably less than traditional program clips.With the advance of automatic indexing ofadvanced content-related features in videosprograms (e.g., detecting faces or other objects, themotion of these objects, etc.), it is becomingincreasingly necessary to index metadata ofdifferent types and relate them to varying

Multi-Search of Video Segments Indexed by Time-AlignedAnnotations of Video Content Anni Coden, Norman Haas, Robert MackIBM Thomas J. Watson Research Center

anni, nhaas, [email protected], NY USA

granularities in the database: to "index into the clip"[7]. The types of annotations can be open-ended,they can overlap (e.g., multiple visual events canoccur within the same segment of video), and theduration of the annotation features can vary widelydepending on the content of the video. Thisrequirement adds complexity to the databasestoring the metadata and to the correspondingprocesses of querying and browsing.What does it mean to retrieve a segment of videobased on a query specification that refers to multiple annotation criteria? How can we enableusers to browse video content in a way that closelyrelates to the annotation criteria specified in a queryspecification? We will address these issues in therest of the paper.

In the system we describe here, video indexing isdone by storing start and stop times of videosegments corresponding to the value of a particularannotation feature (e.g., number of faces in asegment of video). A given segment of video cancontain overlapping segments corresponding tooverlapping annotation values. In our system,operations of indexing, searching and browsing canbe defined over these segments in flexible ways:e.g., indexing may store segment specifications fora set of annotation features. A query specificationmay return a set of computed segmentscorresponding to some Boolean combination ofannotation values. Browsing can consist ofplayback of these computed result segments, butalso allows expanding the limits of playback of thelarger video segments containing the resultsegments, or contracting the limits to playbacksmaller segments contained in the result segment.Our system stores and operates on segmentspecifications, and as we will show, provides greatflexibility in searching and browsing video. Annotating Video Content

The full spectrum of subject matter and format ofvideos that might occur in an unconstrainedcollection of video is quite diverse. Consider what istypically presented each day on television broadcastnetworks: news and weather reports, dramas,sports, sitcoms, concerts, wildlife documentariesand movies. The content of these programs vary

widely in the amount of visual action, theimportance of the spoken material to the visualimage, the abundance or absence of humans, thepresence of superimposed text, etc. In addition toprogram content, which typically runs for tens ofminutes, a collection would include commercialsand station promos, which might be at most oneminute in length, and as short as ten seconds orless. The length of shots would vary widely overthis material. Also, an unconstrained collectionwould include produced, edited and raw footage.These types of video differ greatly, in that rawfootage tends to have very much longer shots, tocontain segments in which there is no action or nomeaningful action, and does not haveclosed-caption text. The challenge is to develop anapproach to annotation that is open-ended,increasingly content-driven, and takes into accountthe temporal duration in which annotation criteriaapply within the video.

Some promising work has been done on how tobuild systems that do represent content-drivensegmentation and annotation of videos.

The stratification system of Aguierre-Smith andDavenport [1] uses a set of predefined keywordsand free text to describe parts of a video. Thenotion is that the keywords provide a context forthe free text. For example, a keyword may say“house” or “garden”, whereas the free textassociated with a frame may specify “someoneeating”. Searching for the word “eating” will findtwo separate segments, the keywords will give thecontext - eating in the house or in the garden. Thefree text is associated with a single frame, whereasa set of frames are associated with each keyword.Multiple keywords can be associated with a set offrames. All the data is stored in ASCII files andannotations are purely text-based.

Such a system, however, does not lend itself to theinclusion of other types of function-valuedmetadata, in particular, a value cannot beassociated with a metadata attribute. For instance,it is not clear how to extend the system to store theinformation that there are ‘three’ faces visible in aparticular sequence of frames. The system allowsfor only one free text description per frame and

2

Submitted to ACM MultiMedia 99 Conference 03/15/99

hence, as the author points out, “simple annotationsdepend on the linearity of the medium and randomaccess does not give contextual information to thesearch results.” A content managment system hasto allow for all types of data to be stored andsearched and for the results to be viewed in acontext as appropriate for the application.

Davis [8x] makes another case for videostream-based annotations. In particular, he pointsout that “different segmentations are necessary asvideo content/features continue across a givensegmentation or exist within a segmentationboundary. Stream based annotations generate newsegmentations by virtue of their unions,intersections, overlaps, etc. In short, in addressingthe challenges of representing video for largearchives what we need are representations whichmake clips, not representations of clips.” Afterlaying this ground work, his system starts with afixed segmentation based on grouping sets offrames into shots represented as icons which areused as building blocks for a visual language whichis geared towards understanding the semantics andsyntax of video. Similarly to the stratificationsystem, a video in this system can be is describedby a fixed (but potentially large) set of words oricons. A drawback of it is that adding newannotations implies extending the language and allthe functions operating on it.

Commercial systems ([2], [4], [5], [6]) similarly arebased on storing annotations implying a fixedsegmentation criteria for the video, and retrievingthis fixed segmentation regardless of the annotationcriteria specified in the query specification.Furthermore, the time segments are defined atingest time and cannot be changed later on in theprocess.

Multisearch and Browse Prototype

Our prototype system does not require fixedsegmentation and can accommodate different ofannotation values: Annotations asserting the valuesof metadata attributes can be made using anyscalar or vector data type as appropriate - text,number, or graph to name the most common ones.

For instance, the following attributes are presentlyassociated with a video: keywords, free text, motiondirection, motion magnitude, face count, face size,shot distance, OCR text, and speech transcription.For each annotation, the exact interval of time overwhich an metadata attribute and its value is true isrecorded and the value itself is associated with thistime interval. Clearly, the time intervals recordedare of varying length. Hence, multiple annotationscan overlap within any given segment of video, andFigure 1 illustrates this capability, in schematicform.

Figure 1 Multiple overlapping annotationstime-aligned with video.

If appropriate, spatial data can also be stored inconjunction with an annotation. For instance, therectangle in which three faces appear may beknown and could be recorded. One can easilyimpose an x-y coordinate on a video where the xcoordinate coincides with the time line and hence,only the y-coordinate needs to be stored.

A simple query involves a single attribute of thevideo, such as one or more free text terms (inassociated text) or a value of motion magnitude.For example: find all the video segments where theword ‘impeachment’ is mentioned or find all thevideo segments where the camera motion is ‘fast’.A query can also be composed of one or moreattribute (attribute values) in any well-formedBoolean combination, for instance: “Find the videosegments where (“impeachment” is spoken AND you see one face in close-up) OR there are two

3


people moving fast”. The result of such a"multisearch query" is the set of time segments inwhich the query is satisfied. The start and stoptimes of these segments must be determined bycomputation.

Our approach to content annotation is to usemodular “annotation operators”, which areindependent of one another (although the timeintervals where their values are true mightoverlap). At present, the descriptors we can detectand index are: camera motion (pan, zoom), facecount (0 - 6, many), shot length (close-up, long shotand several intermediate lengths, not to be confusedwith the shot’s temporal duration), spoken text(speech recognition), closed caption text (humantranscription, carried external to the image), andopen captioning (text encoded as pixels in theimage; usually superimposed titling, such as thescore in a sporting match). In principle, there couldbe any number of such descriptors.

The prototype does not predetermine the unit ofsegmentation (granularity), other than the inherentlimit of a single frame. Rather, segmentation isdetermined automatically, based on the videocontent, and as a consequence of applying theannotation operators. When a new video isannotated, annotation descriptors scan the video,and “measure” a value for each frame. This valuemight be a scalar, a character string or a vector, -(such as a histogram). When the value changes forsome frame, the annotation operator emits a

“keyword = value (in point, out point)”

assertion, where the keyword is unique to thatoperator, and “in point” and “out point” are the firstframe and last frame of the interval,- over whichthat value held. For example, the “zoom” annotationdescriptor can automatically detect within a videowhen the camera is performing a zoom operation,and extract for each zoom operation the start andend frames and its value (e.g., ‘zoom in’ or ‘zoomout’). Thus segmentation boundaries are a functionof the value of the annotation feature in the videocontent itself.

The search and browse prototype also annotatestext associated with video, either as caption text, orrecognized speech, or manually entereddescriptions. Caption text associated with videoprovides topic-related "new story" segments, whoseduration in the video can be determined by extracting story delimiters embedded in theclosed-captioning, using a caption text annotationoperator. An example is a typical newscast of television evening news which is comprised of “stories.” The closed caption text contains specialmarkers “>>>” which denote news story breaks(these are manually inserted by professionaltranscribers). The prototype can also annotate theduration of recognized speech text associated withthe video. In this case, there are no markersdelimiting topic or "story" sgementations. Severaloptions exist for recording the time at which aword is spoken. One option is to "time-align"indexed terms: i.e., track the actual time eachword is spoken. However, for purposes of indexingthe text content of recognized speech it isconvenient to have the notion of a “document”collection. Hence, we record sequences of words(typically in the order of a 100), and index each 100word chunk as a “document”. [9] The videosegment corresponding to the document is definedto be the interval from when the first word of thedocument was uttered until the last word is uttered.For a variety of reasons (speed, reliability ofranking, especially when the search key is multiplewords), we group words into blocks (typically,order of 100), and search them first, then searchthe ten best matching blocks for occurrences of theindividual words. In addition to indexing these“documents”, we also record the time interval foreach word, to provide time-aligned browsing ofvideo by text terms. For example, suppose we aresearching a sports program aired after "September1, 1997" where the announcer shouts “goal”, where the camera zooms in when this word isuttered, and where three faces are visible. Afterperforming the search, the user interface can showthe sports event by playing the video, display theannouncer’s comments in text format with theword “goal” highlighted, and allow the user to jumpfrom one “goal” to another in the text and see aclose-up of the event in the video at the same time.

4


Figure 2 Multisearch on segments: Combining results from Boolean combinations of searchcriteria.

This example indicates the "multi-search" capabilityof our system. This means that a user can poseBoolean combinations of different types of indexedcriteria, and the system will combine componentresults to return a single result set. Furthermore, thesystem will compute the exact time intervals inwhich the events occured. In the next section, wewill explain multi-search in a more formal way.Multi-search shows quite clearly how the flexiblegranularity of recording the content-relatedmetadata comes into play. In our previous example,the knowledge of the exact time when “goal” wasshouted enables us to highlight the correspondingtext and play the corresponding video. However,the time interval where the camera was zoomingcould differ (e.g., it might be a longer or shorterrelative to the segment where “goal” was shouted).

Figure 2 shows an example of a Boolean query, thecomponent result segments, and a computedintersection.

5


The system can compute the exact time interval inwhich all the specified events happen (“goal” isshouted, camera is zooming in, and three faces areshown). Clearly, the data does not to be loaded in aspecific way to be able to carry out this precisecalculation. Furthermore, our system is not onlyable to play the relevant video segment, users canalso chose to view only the frames where the word“goal” is shouted, or view the whole video segmentwhich contains the retrieved segment wherezooming occurs or view only the video segmentswhere both events occur. This enables flexiblenonlinear browsing of by end users (the userinterface is discussed below). The “granularity” ofthe browsing can be different from the granularityof the “querying”, a typical desire of a user tochange the specification as the application shiftsfrom querying mode to browsing mode.

User Interface for Search and Browse

It is worth describing an end-user's view of theseannotation methods, and the search and browsingcapabilities they enable. Figure 3 shows a screenshot of a query input form for a prototypegraphical user interface (GUI) which enables auser to pose queries in using multiple searchcriteria, including caption text and speech terms,camera motion attributes, face attributes, andprogram level meta data (e.g., title and play date).Using a fixed form to build query specificationslimits the type of Boolean queries that can bemade. Consequently, we also implemented a querylanguage where (expert) users can specify aBoolean query of arbitrary complexity usingparentheses and a query syntax (an example of aquery using this langauge is shown in Figure 2).

The results of a search are shown in Figure 4. Foreach result, the relevant text and some key framesfor the first few shots comprising the resultsegment are shown. The system also allows for“drilling down” into the content of a selected search result segment, viewing all the informationassociated with the segment, as shown in Figure 5 For each result segment, end users can view closedcaption and recognized speech text, video content,shot story board with key frames, etc. In additionthe various VCR style controls allow users tonavigate through the video program containing the

result segment, according to various segmentationcriteria that may not have been part of the original

query specification. Examples include shotboundaries, story boundaries (based on caption text markers), frames, seconds, etc. This implies that allthe index feature values for the result segment areavailable to the browser, even if these values werenot specified in the original search specification. Ineffect, end users are able to browse the results interms of any indexed value expressed in thesegment. The display panel of a selected searchresult segment becomes a starting point for flexibleand broad-based browsing functions.

Figure 3 Query input form for text terms, motion direction and magnitude, and program levelattributes.

Query Model and Processing

Underlying the query and browse application inFigures 2, 3, 4 and 5 is a query model, includingquery processing methods, designed to handle thecontent-oriented, and time-based annotations wehave described. A query can be an arbitraryBoolean combination, including parentheses andnegation of constraints, and more advancedoperations related to combining results, each of which we discuss in turn.

Queries in our prototype are conjunctions ofinequality constraints (including equalities) of the form “keyword <relational operator> value”,where <relational operator> could include “=”

and annotation operator-specific vector distancemetrics along the lines of “distance metric N(value,V0) < k”, for some constant vector value V0 andscalar value k. In addition, a “like” operator issupported (in SQL this corresponds to a substringmatch operator). Values can be specified exactly,but wild cards are also supported. For text, theconstraints are of the form “contains lexical unit 1 ,... lexical unit N”, where a lexical unit is a singleword or a multi-word (proper name, phrase, etc.).Furthermore, a text query can have the form“contains a Boolean expression of words and/orphrases”. These constraints can be augmented witha set of linguistic attributes such as language,morphology and synonyms to mention a few.

Figure 4 Search result page, showing storyboard keyframes for video result segments andcaption text excerpts.

For example, a query could be of the form:determine the time intervals in videos where thegenre is sports, the announcer talks about goals orscores, the score is greater than three, and thecamera zooms in.

The search result set produced by each constraintis a list of (interval, ranking) pairs, where each(temporal) interval is a (video ID, in point, out point)triple. The answer set for a complete query isobtained by computing, for each video, theintersections/unions of the intervals. The following pair wise interval relationships occur [10]:

� two intervals may be identical � one may lie entirely within the other � they partially overlap, or� they are completely disjoint.

Finally, the intersection/union of the survivinganswer set intervals is computed, to return theanswer. Recall that Figure 2 showed an exampleof a Boolean query, the component result segments,and a computed intersection. The query can be anarbitrary Boolean expression including parentheses

Figure 5 Detailed content view of a specific result video segment, with video playback controls.

of the above mentioned constraints, and hence theintersection or union is defined according to thequery specification. Furthermore, other time-basedoperations can be performed which we discuss inthe next section.

In the case of unions, arising from disjunction ofconstraints, the following pair wise intervalrelationships occur [10]: � two intervals may be identical� one may lie entirely within the other� they partially overlap, or� they abut one another.

then, the two intervals can be combined into one.Finally, the union of the surviving answer intervalsis returned as the query result.

The value of this approach is the ability to handle adiversity of source material, where any attempt todefine a segmentation that was roughly basedaround a generic time constant or a fraction of thetotal length of the video would not generally beappropriate. Units of segmentation that would bewell-suited to some queries would almost certainlybe inappropriate for others. Also, the in- and out-points for different annotation operators are, ingeneral, completely different from each other, andconsequently would generate a sparse set ofannotations. Our database schema allows for veryefficient storage of such sparse sets.

Also, the in- and out- points in the result set of aparticular query is a function of the query itself.Full flexibility is available for applications to applywhatever granularity of segmentation is appropriatefor user queries or for the content indexing needed.For instance, some commercially available systemsuse shot transitions as their segmentation of videoand results are returned at the granularity of ascene. Our system automatically indexes scenetransitions, and an application can easily map theresults into “scenes” if so desired. This feature,called rounding up is discussed in more detail inthe next section.

Additional Operations on Time-BasedSegment Specifications

Our system contains special features which areuseful in manipulating search results for purposesof helping users find relevant results, or relatingvideo segments to a larger video context fromwhich the results are derived.

Combining Multiple Rankings. Searching videosegments by text associated with the video (e.g.,caption text) returns results when a full text index issearched. Multiple search engines may returnadditional rank ordered result sets, e.g., search onrecognized speech text (which can be indexedseparately from caption text). In these cases, thesystem needs to compute a combined rank for acombined result set. Different considerations haveto be taken into account in this combinationprocess, such as the different types of content, userpreferences or the reliability of the search. Forexample, the results of text search against thewritten transcript are 100% accurate whereas asearch against recognized speech transcript doesnot in general achieve the same accuracy. Hencewhen combining these two ranks, more weightshould be given to the rank returned from thesearch against the written transcript.

Segment Padding. In some cases, theconstraint-satisfying interval in a search result isvery short, e.g., corresponding to utterances ofindividual words. We provide a user-settable“padding factor” time constant, which is used toextend both the start and the end of each finalanswer set interval. Padding can be done atdifferent times: one can pad the result sets of eachsub query (e.g., the results from a text search, theresult from a parametric search); or one can padthe final result set after combining the result sets ofthe sub queries. Our system allows for all thesemethods of padding. Padding intervals may be application specific, and values may be defined bydefault settings or exposed to end users whochoose values based on their search requirements.

Rounding Up. A related feature is rounding up.In some cases, such as news stories, users maywish to resort to predefined semantic partitioning ofthe videos which would be captured as anannotation feature. They might want, for example, to view a complete story containing a resultsegment. Our scheme allows for overriding result

segment boundaries computed for Boolean queriesalong the following lines: Take each time interval Tin the result set and determine the smallestpredefined time interval T1 which contains T in itsentirety. (Note that T1 could be a combination ofseveral adjoining predefined time intervals.) Thisfeature exposes the strength of our flexiblesegmentation. Any fixed segmentation (like the onedefined by higher level segment criteria such asnew story boundaries) can be imposed after theresult set for a query has been defined, and hencethe user is not restricted to seeing the results interms of the segmentations produced by theoriginal indexing methods.

Conclusions

In this paper we presented a video contentmanagement system which is designed for storingand retrieving digital videos and metadataassociated with them. Our system focuses onrepresenting and storing the temporal duration ofthe metadata, using methods that result in efficientstorage, efficient retrieval, and flexible searchingand browsing. Each metadata annotationdetermines the segment granularity at which it is tobe represented in the database index. Additionaltime-based operations allow an application (user) todetermine the granularity at which the video data ispresented as a search result and when result setsare browsed and played back.

Acknowledgments

We would particularly like to acknowledge MichaelConcannon’s programming contributions to making the end user interface prototype a reality. We alsothank the other members of the project team whodeveloped the annotation methods referred to in thispaper: Chitra Dorai and Jonathan Connell (andtheir IBM Research manager Ruud Bolle).

Our system is being developed as part of acooperative, multicompany effort to design andimplement an architecture for a complete studio forproducing and broadcasting high-definitiontelevision (HDTV) using MPEG-2 compressedbitstreams. The specification, known as “DigitalStudio Command and Control” (DS-CC) is adistributed objects model, implemented in CORBA.

It has been submitted to the Society of MotionPicture and Television Engineers (SMPTE) as arecommendation as a broadcasting industrystandard. The prototype system described here andthe underlying server support are implemented inJava as DS-CC objects.

The work was funded in part by the NationalInstitute of Standards and Technology, TheAdvanced Technology Program (NIST/ATP)under contract No. 70NANB5H1174. We are alsograteful for the support of the NIST program, inparticular the NIST Technical Program Manager,David Hermreck, and for the IBM ResearchPrincipal Investigator, James Janniello, who leadsthe studio infrastructure aspects of the NIST ATPproject.

References

[1] Aguierre Smith, T.G. and Davenport, G., ”TheStratification System, A Design Environment forRandom Access Video”, Network and OperatingSystems Support for Digital Audio and Video. ThirdInternational Workshop Proceedings 1993, p.250-61, La Jolla, CA, USA 12-13 Nov. 1992.Springer Verlag

[2] ISLIP Media, Inc. "www.islip.com", Pittsburgh,PA 15312, USA.

[3] Christel, M., Stevens, S., Wactlar, H.,"Informedia Digital Video Library," In Proceedingsof the ACM Multimedia '94 Conference. NewYork, ACM. pp. 480-481, October, 1994.

[4]"Avid’s Open Media Management Initiative,"Content Watch Vol 1, No 5. 1998.

[5] Virage Inc., “www.virage.com”, San Mateo,CA, USA.

[6] Excalibur Technologies Corporation."www.excalibur.com", Vienna, VA 22182, USA.

[7] R. M. Bolle, B.-L. Yeo, M. M. Yeung,“Videoquery: Research Directions”, IBM Journalof Research and Development, Volume 42,Number 2, March 1998

[8] M. Davis, “ Media Streams: An Iconic ViusalLanguage for Video Representation”, Proceedingsof the 1993 Symposium on Visual Languages, pp.196-202.

[9] S. Dharanipragada, and S. Roukos,``Experimental results in Audio-Indexing'',Proceedings of 1997 Speech RecognitionWorkshop,Chantilly,Virginia, February 1997.

[10] James F. Allen “Maintaining Knowledge aboutTemporal Intervals,” Communications of the ACM,Vol 26, No 11, 1983

[11] Juliano, M., and Pendyala, K. Use of Mediaasset management systems in broadcast newsenvironments. Whitepaper in: NAB MultimediaWorld Journal, 1997.

ibm research reportafter outside publication, requests should be filled only by reprints or legally...

Documents