feature article applications of this conundrum has ...tive in characterizing video program for both...

Managingmultimedia datarequires more thancollecting the datainto storagearchives anddelivering it vianetworks to homesor offices. We surveytechnologies andapplications forvideo-contentanalysis andretrieval. We alsogive specificexamples.

The advances in the data capturing,storage, and communication tech-nologies have made vast amounts ofvideo data available to consumer and

enterprise applications. However, interactingwith multimedia data, and video in particular,requires more than connecting with data banksand delivering data via networks to customers’homes or offices. We still have limited tools andapplications to describe, organize, and managevideo data. The fundamental approach is toindex video data and make it a structured media.Manually generating video content descriptionis time consuming—and thus more costly—tothe point that it’s almost impossible. Moreover,when available, it’s subjective, inaccurate, andincomplete.

This conundrum has attracted researchersfrom various disciplines, each with their ownalgorithms and systems. In addition, the MPEGgroup recently issued MPEG-7 as a standard toprovide normative framework for multimediacontent description. However, in contrast, thereare few convincing stories we can tell about suc-cessful applications of the research results. Itseems that the excitement enjoyed by manyresearchers from both academia and industrieshas yet to generate significant impact in the mar-ketplace. Is there any significant application thatcan benefit from our research? Can we solve thevideo retrieval problem as we originally claimed?Are we falling into the same hype as artificialintelligence (AI) once did? We believe the answeris no. However, we need to reexamine ourresearch strategies and methodologies, and mostimportantly, users’ need for technologies for con-tent-based video retrieval.

To address these questions, we held a paneldiscussion chaired by Hong-Jiang Zhang at theInternational Workshop on Very Low Bit-RateVideo Coding (VLBV 98), with researchers in thevideo coding and content analysis field. This arti-cle summarizes the evolving views from the pan-elists and audience, as well as the continuedonline discussion regarding state-of-the-art tech-nologies, directions, and important applicationsfor research on content-based video retrieval.

Content-based video retrievalWe perceive a video program as a document.

Video indexing should be analogous to text doc-ument indexing, where we perform a structuralanalysis to decompose a document into para-graphs, sentences, and words, before buildingindices. When someone authors a book, they cre-ate a table of contents for browsing the content’sorder and a semantic index of keywords andphrases for searching by content. Similarly, tofacilitate fast and accurate content access to videodata, we should segment a video document intoshots and scenes to compose a table of contents,and we should extract keyframes or keysequences as index entries for scenes or stories.Therefore, the core research in content-basedvideo retrieval is developing technologies toautomatically parse video, audio, and text toidentify meaningful composition structure andto extract and represent content attributes of anyvideo sources.

A typical scheme of video-content analysisand indexing, as proposed by many researchers,

42 1070-986X/02/$17.00 © 2002 IEEE

Applications ofVideo-ContentAnalysis andRetrieval

Nevenka DimitrovaPhilips Research

Hong-Jiang ZhangMicrosoft Research

Behzad ShahrarayAT&T Labs Research

Ibrahim SezanSharp Laboratories of America

Thomas HuangUniversity of Illinois at Urbana–Champaign

Avideh ZakhorUniversity of California at Berkeley

Feature Article

involves four primary processes: fea-ture extraction, structure analysis,abstraction, and indexing. Eachprocess poses many challengingresearch problems. In what follows,we briefly review these challengingresearch issues and the algorithmsdeveloped so far to address them.

Feature extraction for contentanalysis

A critical process in content-based video indexing is feature extraction, whichwe show in Figure 1. The effectiveness of anindexing scheme depends on the effectiveness ofattributes in content representation. However,we can’t map easily extractable video features(such as color, texture, shape, structure, layout,and motion) easily into semantic concepts (suchas indoor and outdoor, people, or car-racingscenes). In the audio domain, features (such aspitch, energy, and bandwidth) can enable audiosegmentation and classification.

Although visual content is a major source ofinformation in a video program, an effectivestrategy in video-content analysis is to use attrib-utes extractable from multimedia sources. Muchvaluable information is also carried in othermedia components, such as text (superimposedon the images, or included as closed captions),audio, and speech that accompany the pictorialcomponent. A combined and cooperative analy-sis of these components would be far more effec-tive in characterizing video program for bothconsumer and professional applications. TheInformedia system,1 AT&T’s Pictorial Transcriptssystem,2–5 and Video Scout6 are examples of suchapproaches.

Structure analysisVideo structure parsing is the next step in over-

all video-content analysis and is the process ofextracting temporal structural information ofvideo sequences or programs. This process lets usorganize video data according to their temporalstructures and relations and thus build table ofcontents. It involves detecting temporal bound-aries and identifying meaningful segments ofvideo. Many effective and robust algorithms forvideo parsing have been developed7–11 for seg-menting a video program into its temporal com-position bricks. Ideally, these composition bricksshould be categorized in a hierarchy similar to filmstoryboards. The top level consists of sequences or

stories, which are composed of sets of scenes.Scenes are further partitioned into shots. Eachshot contains a sequence of frames recorded con-tiguously and representing a continuous action intime or space. With such structural information,we can automatically build a video program’stable of contents.12

An important step in the process of videostructure parsing is that of segmenting the videointo individual scenes. From a narrative point ofview, a scene consists of a series of consecutiveshots grouped together because they’re shot inthe same location or because they share somethematic content. The process of detecting thesevideo scenes is analogous to paragraphing in textdocument parsing, but it requires a higher levelof content analysis. There are two approaches forautomatically recognizing program sequences:one based on film production rules,13 the otherbased on a priori program models.10 Both havehad limited success because scenes or stories invideo are only logical layers of representationbased on subjective semantics, and no universaldefinition and rigid structure exists for scenesand stories.

In contrast, shots are actual physical basic lay-ers in video, whose boundaries are determined byediting points or where the camera switches onor off. Fortunately, analogous to words or sen-tences in text documents, shots are a good choiceas the basic unit for video-content indexing, andthey provide the basis for constructing a videotable of contents. Shot boundary detection algo-rithms that rely only on visual information con-tained in the video frames can segment the videointo frames with similar visual contents.Grouping the shots into semantically meaning-ful segments such as stories, however, usuallyisn’t possible without incorporating informationfrom the video program’s other components.Multimodal processing algorithms involving theprocessing of not only the video frames, but also

43

July–Septem

ber 2002

Structureanalysis

Abstraction Featureextraction

MetadataClustering& indexing

Summary/skimmed video

Features

Video streams

Retrieval& browsing

Figure 1. Processdiagram for video-content analysis andindexing.

the text, audio, and speech components thataccompany them have proven effective inachieving this goal.12

Video abstractionVideo abstraction is the process of creating a

presentation of visual information about a land-scape or the structure of video, which should bemuch shorter than the original video. This abstrac-tion process is similar to extraction of keywords orsummaries in text document processing. That is,we need to extract a subset of video data from theoriginal video such as keyframes or highlights asentries for shots, scenes, or stories. Abstraction isespecially important given the vast amount of datafor a video program of even a few minutes’ dura-tion. The result forms the basis not only for videocontent representation but also for content-basedvideo browsing. Combining the structure infor-mation extracted from video parsing andkeyframes extracted in video abstraction, we canbuild a visual table of contents of a video program.

Several terms and corresponding methodsexist for abstracting video content, includingskimming, highlights, and summary. A videoskim is a condensed representation of the videocontaining keywords, frames, visual, and audiosequences. Highlights normally involve detec-tion of important events in the video. A summa-ry means that we preserved important structuraland semantic information in a short version ofthe video represented via key audio, video,frames, and/or segments.

Keyframes play an important role in the videoabstraction process. Keyframes are still images,extracted from original video data, that best rep-resent the content of shots in an abstract man-ner. Often we use keyframes to supplement thetext of a video log.2 The representational powerof a set of keyframes depends on how they’rechosen from all frames of a sequence. Not allimage frames within a sequence are equallydescriptive, and the challenge is how to auto-matically determine which frames are most rep-resentative. An even more challenging task is todetect a hierarchical set of keyframes such that asubset at a given level represents a certain granu-larity of video content, which is critical for con-tent-based video browsing. Researchers havedeveloped many effective algorithms,14 althoughrobust keyframe extraction remains a challeng-ing research topic.

Automatically extracting video highlights is aneven more challenging research topic, because it

requires more high-level content analysis. Rui et al.15 have developed a way to reduce the usualthree-hour baseball game—which normallyincludes long and uneventful close-ups—to 10minutes of the most exciting highlights fromeach inning. They use audio features such asexcited speech and baseball hits. The algorithmto determine the exciting parts weighs all theprobabilities for different features using a supportvector machine (SVM).

Li et al.16,17 have recently reported anothersports-related technology, where they proposealgorithms for automatically detecting all seg-ments containing interesting events of a partic-ular game. Interesting events are specific to aparticular sport. The proposed event detectionalgorithms use two types of prior knowledge toextract semantics from broadcast sports video:domain and production knowledge. Domainknowledge in sports means the definition of keyevents that are important for a particular sport,such as a play in American football, a hit in base-ball, and a goal and its set up in soccer.Production knowledge refers to techniques usedto produce the broadcast video. These techniqueshelp viewers follow the game in an entertaining,informative, and captivating manner. They directviewers’ attention via well-accustomed produc-tion patterns that viewers expect, such as scenetransitions after plays, replays, dynamic broad-caster logos sandwiching replay segments, certaincamera angles that are sport or event specific, andscoreboard overlays.

The event detection algorithms in Li et al.16,17

represent these two types of knowledge in termsof rule bases where rules are expressed in terms oflow-level visual and aural features that are auto-matically computed from the media. They auto-matically detect all key events in baseball,American football, and sumo wrestling broadcastprograms. Their algorithms16 detect every play in abaseball broadcast video and every bout in a sumomatch broadcast. One algorithm17 automaticallydetects every play in an American football broad-cast video. Plays include segments of the gamewhere the ball is put into play and actively played,including pitches, hits, base steals, and home runsin baseball, and running or passing plays and fieldgoals in football. By detecting an event, we meandetecting the start and end points of video seg-ments containing the event. These algorithms usethe methods Pan et al.18,19 discuss for automatical-ly detecting replay segments.

Li et al.20 presented a prototype system demon-

44

IEEE

Mul

tiM

edia

strating the results of these technologies (referredto as High-Impact Sports). The prototype systemused an MPEG-7-compliant XML description for-mat for the event segments and an MPEG-7 brows-er that provided novel user interface paradigmsoffering summarized viewing or play-by-play non-linear navigation. A play summary typically pro-vides three to six times compaction in the viewingtime for baseball and American football, depend-ing on the particular game. The compaction ratiofor sumo wrestling can be as high as 20 times.

It should be noted that the High-ImpactSports technology, unlike the approach in Rui et al.,15 detects every single play without neces-sarily attempting to prioritize the events. It there-fore isn’t limited to generating short highlights.It serves the needs of a sports production studioor an avid sports fan that may want to see all theplays (or prefers to be in control of selecting theexciting plays). It facilitates efficient digital mediamanagement in a production environment. Forsports fans, this technology provides the oppor-tunity to catch a missed game during its regularbroadcast and to consume even more sports.

A successful skimming approach involvesusing information from multiple sources, includ-ing sound, speech, transcript, and video imageanalysis. The Informedia project1 is a good exam-ple of this approach, which automatically skimsdocumentary and news videos with textual tran-scriptions by first abstracting the text using clas-sical text skimming techniques and then lookingfor the corresponding parts in the video. Thismethod creates a skim video, which represents ashort synopsis of the original. The goal was tointegrate language and image understandingtechniques for video skimming by extracting sig-nificant information, such as specific objects,audio keywords, and relevant video structure.The resulting skim video is much shorter, wherecompaction is as high as 20 to 1, and yet retainsthe original segment’s essential content. Anotherexample21 combines audio, video, speech, andtext to process TV news programs. This approachresults in the segmentation of the program intoindividual stories. The system selects a few repre-sentative images and keywords to represent eachstory’s contents. The textual information pro-vided by closed captions or derived from theaudio track using speech recognition plays animportant role in this process. When transcrip-tions of the program are generated by automaticspeech recognition (ASR), satisfactory results maynot be achievable using a keyword-driven

approach to videos, where soundtrack containsmore than just speech, such as movies. Sundaramand Chang22 proposed a solution to the skim-ming problem based on two important ques-tions:

❚ What’s the relationship between the visualcomplexity of a shot and its comprehensiontime?

❚ How does syntactical structure in the videoaffect its comprehension?

They introduced a framework for determiningvisual skims and formulated the problem of skimgeneration as a general utility maximizationproblem with constraints. This is an importantstep because it allows for a principled way toimpose additional constraints and make trade-offs between them.

Summarization is another challenging topicbecause of the need to explore the content’sstructure. Liu and Kender23 explore the type ofscene changes to signal transitions betweensemantic units in the domain of documentaries.Their approach was to look for evidence for shotcomposition rules by means of Hidden MarkovModels. They found that the best approach is onethat trains the HMM with labeled subsequencesthat have approximately equal elapsed time,rather than subsequences with an equal numberof shots, or subsequences with shots aligned tosome semantic event. Agnihotri et al.24 proposedsummarization of video programs using the tran-script. Their process involves cue extraction, cat-egorization, classification, and a summarizer.Given a paragraph, the categorization processfinds the underlying topic. Each of the 20 cate-gories is an aggregation of a set of keywords relat-ed to that particular class. The summarizerexploits the underlying temporal structure anddomain knowledge as well as textual cues in thetranscript.

Indexing for retrieval and browsingThe structural and content attributes extracted

in feature extraction, video parsing, and abstrac-tion processes, or the attributes that are enteredmanually, are often referred to as metadata. Basedon these attributes, we can build video indices andthe table of contents through, for instance, a clus-tering process that classifies sequences or shotsinto different visual categories or an indexingstructure. As in many other database systems, we

45

July–Septem

ber 2002

need schemes and tools to use the indices andcontent metadata to query, search, and browselarge video databases. Researchers have developednumerous schemes and tools for video indexingand query. However, robust and effective toolstested by thorough experimental evaluation withlarge data sets are still lacking. Therefore, in themajority of cases, retrieving or searching videodatabases by keywords or phrases will be the modeof operation. In some cases, we can retrieve withreasonable performance by content similaritydefined by low-level visual features of, forinstance, keyframes and example-based queries.

Often in queries of video clips, we want toquantify queries based on particular attributes thatinvolve objects and subregions within the viewableimage. Some support for automated object extrac-tion could help us with these queries. The object-oriented compression scheme standardized byMPEG-4 provides an ideal data representation forsupporting such indexing and retrieval schemes. Itwill also simplify the task of video structure pars-ing and keyframe extraction, because many of thenecessary content features (such as object motion)are readily available. Zhang et al.25 proposed aframework to use such content information invideo content representation, abstraction, index-ing, and browsing. Similarly, Chang et. al.26 haveapplied object-based representation in indexingand retrieving video clips.

Although we tend to think of indexing forsupporting fast retrieval of video clips, browsingis equally significant for video data, because thevolume of video data requires techniques to pre-sent information landscape or structure to give aquick overview. By browsing, we mean a casualand quick access of content. The visual table ofcontents built based on structure informationand keyframes provides an ideal representationfor content-based video browsing. Browsing toolsbuilt based on such representation are especiallyuseful, given that it still isn’t feasible to auto-matically build semantic content-based indexingof video programs. However, browsing shouldn’tbe viewed as only a compromised tool for videoindexing. Rather, it’s an effective alternative anda complementary step for searching video data.Often users want quick access to relevant videodata, although the process may initially lack anyspecific goal or focus. Browsing may suitablyaddress those needs. Furthermore, browsing isalso intimately related to and essential for videoretrieval. It can help formulate queries, making iteasier for the user to just ask around in theprocess of figuring out the most appropriatequery to pose. Applications of such browsingtools include video editing and composition,where we often browse through a large numberof relevant video clips before determining thefinal cut list.

We can also use MPEG-7 for multimediaindexing. We discuss this further in the sidebar“MPEG-7.”

Application models: The user’sperspective and research methodologies

We’re facing a barrier similar to the one thatthe AI research community faced for many years:machine understanding of visual content. This isone of the major reasons that we haven’t seenmany convincing stories about successful appli-cations of our research results, especially in themarketplace. In reviewing the past success indeveloping algorithms for video structure pars-ing, abstraction and content analysis, and inexamining our research strategies and method-ologies, we need to address several issues.

First, although a technology’s success will beultimately judged by its usefulness in the target-ed applications, we should distinguish betweenthe long-term research objective and short-termapplications. Long-term research will result ingeneral solutions to many applications. On theother hand, short-term applications will educate

46

IEEE

Mul

tiM

edia

MPEG-7Having realized the importance of content management, the Moving

Pictures Expert Group (MPEG) started the standardization activity oncontent description. Formally called the Multimedia Content DescriptionInterface, MPEG-7 provides a standardized description of various typesof multimedia information, as well as descriptions of user preferencesand usage history pertaining to multimedia information. The normativepart of the standard focuses on a framework for encoding the descrip-tors and description schemes. The standard doesn’t comprise the extrac-tion of descriptors (features) or specify search engines that will use thedescriptions. Instead, the standard enables the exchange of contentbetween different content providers along the media value chain. Inaddition, it enables the development of applications that will use theMPEG-7 descriptions without specific ties to a single content provider.More information about MPEG-7 is available at the MPEG homepage(http://mpeg.telecomitalialab.com/).

MPEG-7 became an international standard in December 2001. This willhave an impact on availability of additional information along with theimages and video segments for many applications. This fact will help focusthe needs for research on content analysis topics, which aren’t available inthe MPEG-7 description schemes.

the users for the potential of this technologywhile providing focus for some hard technicalproblems for the research community. The ulti-mate goal for the long-term research is to providea link between the extracted low-level featuresand the high-level semantic description thathumans perceive without significant effort.Nevertheless, many working systems exist thatserve as proof to the effectiveness of even partialand domain-specific content descriptors for selec-tive retrieval of visual information based on low-level feature extraction.

In designing these applications and productsin the area of multimedia content analysis, wemust keep the user in mind at all times, becauseusers at different levels view a technology’s use-fulness differently. We can broadly classify usersinto two extremes:

❚ nontechnical consumers and

❚ trained, technical, professional corporate userswho regularly use the products.

The requirements for each of these classes are dif-ferent, so we should use different technologies toaddress their needs. Professional and consumerapplications of video indexing technologies canboth absorb functionalities, which stem from sin-gle modality processing, and extract even low-level features.

For technology-savvy users working withcontent analysis, indexing, searching, andauthoring tools on a daily basis, it makes senseto design systems that require more user sophis-tication. For example, major news agencies andTV broadcasters own large video archives. If wedevelop automated indexing, analysis, andsearch products for these applications, it’s con-ceivable to have trained individuals to retrieveand access required multimedia information,much the same way as today’s trained profes-sional librarians and information specialistsretrieve information based on textual data.Under these circumstances, we can expect theoperator or user to search images via textures,color histograms, or other low-level featureanalysis that we wouldn’t expect a typical con-sumer to be able or willing to cope with.

At the other extreme, for consumers, theproducts and applications need to be extremelysimple for them to be viable in the marketplace.As an example, increasing consumer access toelectronic imaging devices such as still digital

cameras and digital camcorders has resulted inan explosion in the volume of data being gen-erated. For consumers to annotate, index, han-dle, process, and access their data, productsmust be designed with simple yet useful func-tionalities. This would preclude searching tech-niques that are, for example, based on colorhistograms. Instead, consumers might want tofind all the pictures in which uncle Joe is withthe baby by defining once and for all, who uncleJoe and baby are pictorially. Clearly, from atechnical point of view, this is harder to solvethan searching color histograms. At themoment, no technically robust solutions existfor this problem. Generally speaking, the low-level features that most indexing and query sys-tems are based on might prove to be nonviablein the consumer market in their raw form. Weshould conceal algorithms for low-level featureextraction from the user. For example, a videosegmentation and filtering algorithm can onlygive the final visual table of contents of homevideo with simple interaction for quickoverview and access. The intermediate step—setting thresholds—is prohibitively beyond theconsumer’s horizon.

Professional and educationalapplications

Professional activities that involve generatingor using large volumes of video and multimediadata are prime candidates for taking advantage ofvideo-content analysis techniques. Here we dis-cuss several such applications.

Automated authoring of Web content Media organizations and TV broadcasting

companies have shown considerable interest inpresenting their information on the Web. A sur-vey conducted in 1998 by the Pew ResearchCenter for the People and the Press indicated thatthe number of Americans who obtained theirnews on the Internet was growing at an aston-ishing rate. This survey indicated that 36 millionpeople got their news online at least once a week.This number had more than tripled in a two-yearperiod. (The full survey results are available athttp://people-press.org/reports/).

The process of generating Web-accessible con-tent usually involves using one of several exist-ing Web-authoring tools to manually composedocuments consisting of text, images, and possi-bly audio and video clips. This process usuallyconsumes considerable amounts of time. When

47

July–Septem

ber 2002

other representations of the same content arealready composed for presentation in the videoform, we can use such presentations to repurposethe content for the Web, thereby reducing work.Analysis of the already composed video contentthrough image and video understanding, speechtranscription, and linguistic processing can serveto create alternative presentations of the infor-mation suitable for the Web. We can automati-cally convert large video archives to digitallibraries. We can also automatically augmentthese Web presentations with related and sup-plementary information and thereby create aricher source of information than the originalvideo programs. An example of such an auto-mated authoring system is the PictorialTranscripts system.2–5

Pictorial Transcripts uses video and text analy-sis techniques to convert closed-captioned videoprograms to Hypertext Markup Language(HTML) presentations with still frames contain-ing the visual information accompanied by textderived from the closed captions. A content-based sampling method14 performs the task ofreducing the video frames into a small set ofimages that represent the visual contents of eachscene in a compact way. This sampling process isbased on detecting cuts and gradual transitions,as well as a quantitative analysis of the cameraoperations. Linguistic analysis of closed-captiontext refines the text, generates textual indices,and creates links to supplementary information.Figure 2 shows a sample screen for the PictorialTranscripts that uses the output of the content-based sampling and the processed text from theclosed captions. We can easily retrieve such acompact presentation over low-bandwidth com-munications networks. When more bandwidthis available, the presentation can include audioand video information. In this case, the stillimages and text serve as a pictorial and textualindex into the audio and video media compo-nents. Users can search and browse a digitalvideo library (DVL) created in this way using pic-torial information as well as textual informationextracted from closed captions, or recognizedspeech, to retrieve selective pieces of video froma large archive (see Figure 3).

The AT&T DVL system employs additionalmedia processing techniques to improve theorganization and presentation of the video infor-mation. The system can use speech processing tocorrect misalignment between the audio trackand the closed-caption text. When closed-caption text isn’t available, it can employ a largevocabulary automatic speech recognizer (LVASR)to generate a transcript of the program from theaudio track (see Figure 4). The quality of the auto-matically generated transcripts is determined byseveral factors, such as the quality of speech,background noise, vocabulary size, and languagemodels. Although we can obtain high-qualityresults under favorable conditions, the accuracyof such automatically generated transcripts isgenerally below those generated manually andtherefore isn’t suitable for direct presentation tothe users. Nevertheless, these automatically gen-erated transcripts provide a viable alternative tothe closed-caption text for information retrievalpurposes. When sufficient bandwidth is avail-able, it can deliver video presentations (such as

48

IEEE

Mul

tiM

edia

Figure 2. A snapshotfrom the PictorialTranscripts system.

Figure 3. A snapshot from the DVL system at AT&T.

Imag

e co

urte

sy o

f NBC

new

sIm

age

cour

tesy

of N

BC n

ews

TV programs) with the same quality as the origi-nal productions. The real-time transport proto-col (RTP) and the specific payload types definedby the Internet Engineering Task Force (IETF)have already made it possible to deliver high-quality MPEG-2 encoded video over IP networks.In the short term, this will only be feasible overprivate local IP networks. In the long term, how-ever, this will let us create searchable and brows-able TV.6,12

Searching and browsing large video archivesAnother professional application of automat-

ed media content analysis is in organizing andindexing large volumes of video data to facilitateefficient and effective use of these resources forinternal use. Major news agencies and TV broad-casters own large archives of video that havebeen accumulated over many years. Besides theproducers, others outside the organization usethe footage from these archives to meet variousneeds. These large archives usually exist onnumerous different storage media, ranging fromblack-and-white film to magnetic-tape formats.

Traditionally, the indexing information usedto organize these large archives has been limitedto titles, dates, and human-generated synopses.We generally use this information to select videoprograms possibly relevant to the application athand. We ultimately discern the material’s rele-vance by viewing the candidate programs linear-ly or nonlinearly. Converting these large archivesinto digital form is a first step in facilitating thesearch process. This in itself is a major improve-ment over the old methods. We must addressseveral practical issues, however, to make such anundertaking feasible and economical. These largevideo libraries create a unique opportunity forusing intelligent media analysis techniques tocreate advanced searching and browsing tech-niques to find relevant information quickly andinexpensively. Intelligent video segmentationand sampling techniques can reduce the visualcontents of the video program to a small numberof static images. We can browse these images tospot information and use image similarity search-es to find shots with similar content and motionanalysis to categorize the video segments.Higher-level analysis can extract information rel-evant to the presence of humans or objects in thevideo. Audio event detection and speech detec-tion can extract additional information to helpthe user find segments of interest.

Despite the limitations in the robust and effi-

cient extraction of information from the con-stituent media streams, existing methods areeffective in reducing manual labor.27

Easy access to educational materialThe availability of large multimedia libraries

that we can efficiently search has a strong impacton education. Students and educators can expandtheir access to educational material. TheTelecommunications Act of 1996 has acknowl-edged the significance of this. It has special provi-sions for providing Internet access to schools andpublic libraries. This holds the promise of turningsmall libraries that contain a small number ofbooks and multimedia sources into ones withimmediate access to every book, audio program,video program, and other multimedia education-al material. It also gives students access to largedata resources without even leaving the class.

Indexing and archiving multimediapresentations

Intelligently indexing multimedia presenta-tions is another area where content-based analy-sis can play a major role. Existing videocompression and transmission standards havemade it possible to transmit presentations toremote sites. We can then store these presenta-tions for on-demand replay. Different mediacomponents of the presentation can be processedto characterize and index it. Such processingcould include analyzing the speaker’s gestures,slide transition detection, extracting textual

49

July–Septem

ber 2002

Figure 4. Search based on automatic speech recognition.

Imag

e co

urte

sy o

f C-S

PAN

information by performing optical characterrecognition (OCR) on the slides, speech recogni-tion, speaker identification and discrimination,and audio event detection. The informationextracted by this processing generates powerfulindexing capabilities that would enable content-based retrieval of different segments of a presen-tation. Users can search an archive ofpresentations to find information about a topic.

Figure 5 shows a sample screen from a techni-cal presentation with the speaker in the left win-dow. The slides synchronize with the talk. This isdone using a specialized scene change detectionalgorithm to find the slide transitions. An offline-generated transcript of the talk synchronizes withthe video using speech processing and searchesand jumps to the correct points in the talk (as thesearch results window shows at the bottom ofFigure 5).

Indexing and archiving multimediacollaborative sessions

Multimedia collaborative systems can alsobenefit from effective multimedia understandingand indexing techniques. Communication net-works give people the ability to work togetherdespite geographic distances. The multimediacollaborative sessions involve real-time exchangeof visual, textual, and auditory information. Theinformation retained is often limited to the col-laboration’s end result and doesn’t include the

steps that were taken or discussions that tookplace. We can set up archiving systems to storeall the information together with relevant syn-chronization information. Content-based analy-sis and indexing of these archives based onmultiple information streams enable the retrievalof segments of the collaborative process. Such aprocess lets users not only access the end resultbut also the process that led to those results.When the communication links used for the col-laborative session are established by a conferenc-ing bridge, we can use the available data in theindexing process, thereby reducing the process-ing required to identify each stream’s source.

Consumer domain applicationsVideo-content analysis research is geared

toward large video archives. However, the widestaudience for video-content analysis is consumers.We all have video content pouring throughbroadcast TV and cable. Also, as consumers, weown unlabeled home video and recorded tapes.To capture the consumer’s perspective, the man-agement of video information in the home enter-tainment area will require sophisticated yetfeasible techniques for analyzing, filtering, andbrowsing video by content.

The methods dealing with video informationin the consumer domain will have differentrequirements. In large archives, we store data infiles, and we can access it repeatedly and slowerthan in real time. Therefore, the algorithms forcontent extraction can operate at rates slowerthan 30 fps. In consumer devices, however, thecontent may be available only during real-timedisplay (recording or playback). Consequently,we can analyze video data only in real time. Inlarge archives, we assume that the workstationfor video management has considerable power(and possibly hardware support) to run video-content analysis algorithms. In the consumerdomain, recording and display devices areimpoverished from the point of view of infor-mation processing—devices normally have lim-ited memory and processor power. Therefore, thealgorithms must run with all the constraints inreal time. In addition, these algorithms have dif-ferent kinds of accessory information available.Large video archives will probably have the infor-mation about the author, actors, and storyboard.However, in the consumer domain, we can prob-ably expect metadata to be available in thebroadcast stream or in the electronic programguide. Researchers are developing standards in

50

IEEE

Mul

tiM

edia

Figure 5. Sample screen of the system for indexing and archivingmultimedia presentations.

this area, such as digital video brocasting serviceinformation (DVB-SI) and MPEG-7, which willadd descriptions and accessory data to the storedand streamed content.

Consumer devices can use many additionalfeatures for video cataloging, advanced video con-trol, personalization, profiling, and time-savingfunctions. Information filtering functions for con-verging PCs and TVs will add value to video appli-cations beyond digital capture, playback, andinterconnect. We can use these consumer appli-cations in video editing, cataloging applications,enhanced access, and filtering applications.

Video overview and accessAn example of a video home library applica-

tion that performs VHS tape cataloging functionsis Video Indexing for True Access andMultimedia Information Navigation (Vitamin).28

In this system, the video-content analysis processextracts visual information, which it thenarchives and presents to the user as a visual tableof contents for the analyzed video. The purposeis to later use this information for retrieval pur-poses in a master index. This prototype has anarchival and a retrieval module to perform thesetwo functions.

During archiving, the system performs abruptscene change and static scene detection based ona comparison of discrete cosine transform (DCT)coefficients of subsequent frames (MPEG-1 andMPEG-2). However, from the user’s perspective,not all the keyframes are important or necessaryto convey the video’s visual contents. We applya keyframe filtering method to reduce the num-ber of keyframes. We reduce the number offrames by filtering out noisy, blurry, unicolor,and repetitive frames. For example, in a dialoguescene, it’s likely that both speakers will be shownseveral times, requiring two frames to representthe dialogue scene.

During the keyframe selection process, we useframe signatures to compare keyframes and todetect a particular frame’s content. A keyframesignature representation is derived for eachgrouping of similarly valued DCT blocks in aframe. By using different thresholds, we can con-trol the number of filtered keyframes. The signa-tures are also used for spotting certain patterns,which might correspond to objects of interest.Selected keyframes are structured in a temporalhierarchy, which is flattened to aid in optimalretrieval, even from slow storage devices. The sys-tem presents this hierarchy to the user in a visu-

al table of contents (see Figure 6.) The user canbrowse and navigate through the visual indexand fast-forward or rewind to a certain point onthe videotape or MPEG file.

Video content filteringIn the consumer domain, some products

already perform video-content analysis and fil-tering functions—for example, VCRs with theautomatic commercial skip feature. Most systemswork by detecting black frames and changes inactivity.

When people watch a TV program, such asSeinfeld, they immediately recognize the non-program segments in the broadcast. This isbecause they recognize the broadcast’s switch incontext. Some characteristics of commercials aretherefore naturally different than the TV pro-gram. Such special characteristics include rapidscene changes, repetitive nature, use of text withdifferent sizes, transitional monochrome framesinto and out of the commercial break, andabsence of the station logo. Based on these char-acteristics, suitable methods for advertisementisolation are cut rate, average keyframe distance,black frame, static frame rate, similar frame dis-tance, text location detection, logo detection,audio analysis, and memorizing (autolearning)advertisements. The commercial breaks are usu-ally preceded and terminated by a series of blackframes.

We can often detect commercials by deter-mining when a high number of cuts per minute

51

July–Septem

ber 2002

Figure 6. A snapshotfrom the visual table ofcontents user interfacein a home libraryapplication.

occur in conjunction with the identification ofthe black-frame series. Specifically, by using therelative times between cuts, we can determinethe cut density. However, action movies mayalso have prolonged scenes with a large numberof cuts per minute. For a more reliable commer-cial isolation, we can analyze a total distributionof the cuts and black frames in the source videoin an attempt to deduce their frequency andlength.

We analyze the keyframes for color uniformi-ty and similarity to previously selected key-frames. We use the DCT coefficients to create aframe signature for the keyframe clusteringprocess. During this process, we use the framesignatures to compare keyframes and detect aparticular frame’s content. In addition, we ana-lyze the video cut rate and determine the dura-tion of the cut-rate change. We also identify theblack frame boundaries and analyze the total dis-tribution of the cut-rate change in the programto deduce the frequency and length of the cuts.This can help us determe the likelihood of com-mercial segments.

Figure 7 shows the keyframes extracted froma commercial break. The vertical lines at the bot-tom represent the cuts, and a high density ofcuts-per-unit time (cut rate) may represent acommercial break, as Figure 7 depicts. The cutrate alone produces many false positives. Toreduce the number of false positives, we canexamine the false-positive sections of the com-mercials for the presence or absence of text. Allin this example had a significant amount of text.The text varies significantly in the position onthe TV screen as well as size. The type of text pre-sent in some of these areas was scene text (forexample, text on cars, helicopters, or police vehi-cles), buildings with names, or a product logo.

For consumer applications, we must detectcommercials on a constrained platform.29 We’vealso developed methods that use features pro-duced during MPEG compression. We developed

an algorithm that uses features as triggers andverifiers to detect commercials. Our algorithmfirst looks for a triggering event, such as a blackframe, to mark a potential commercial start.Once a potential start is found, it uses other char-acteristic features to verify the commercial break.We’ve achieved a recall of 93 percent and a pre-cision of 95 percent when station logos and trail-ers are excluded and a precision of 99 percentwhen station logos and trailers are regarded as apart of the commercial break.

Enhanced access to broadcast videoIntelligent access and enhanced search tools

for broadcast content is an important area wherecontent-based analysis can have a strong contri-bution. The existing high-definition TV, videocompression, and transmission standards havemade it possible to transmit a large number ofhigh-quality TV channels to the consumer overthe broadcast networks and the Internet.However, the only available tools to grapple withthe growing number of TV channels are thescrolling program guide (at least in the US) andthe old-fashioned paper guide. The programguide is envisioned under the analog paradigmof passively receiving the entertainment. Thescrolling isn’t interactive and is valid for only alimited time. Within the Society of Motion,Picture, and Television Engineers (SMPTE),TVAnytime, MPEG, and DVB, there are ongoingefforts to provide auxiliary information to theregular broadcast stream. Current personal videorecorders on the market use an electronic pro-gram guide for an interactive selection of pro-grams to watch or store. There could be layers ofadditional personalization of content where newvideo-content analysis algorithms could beemployed. The combined information extractedby this video-content processing will generatepowerful indexing capabilities that would enablecontent-based retrieval of different segments ofTV programs online or offline.

We developed Video Scout, a content-basedretrieval system for personalizing TV at a sub-program level.6 Users make content requests intheir user profiles and then Scout begins record-ing TV programs. In addition, Scout actuallywatches the TV programs it records and person-alizes program segments. Scout analyzes the visu-al, audio, and transcript data to segment andindex the programs. When viewing full pro-grams, users see a high-level overview as well astopic-specific starting points. (For example, users

52

IEEE

Mul

tiM

edia

Time

Figure 7. Keyframesfrom an area with ahigh density of cutsrepresenting acommercial break.

can quickly find and play Dolly Parton’s musicalperformance within an episode of Late Night withDavid Letterman.) In addition, users can accessvideo segments organized by topic (such as find-ing all the segments on Philips Electronics thatScout has recorded from various financial newsprograms). We divide Scout’s interface into twosections—program guide and TV magnets. Theprogram guide lets users interact with whole TVprograms that they can segment in differentways. TV magnets (see Figure 8) let users accesstheir profiles and video clips organized by topic.Users navigate the interface on a TV screen usinga remote control.

The archiving module employs a three-layered,multimodal integration framework to segment,analyze, characterize, and classify segments. Themultimodal segmentation and indexing incorpo-rates a Bayesian framework that integrates infor-mation from the audio, visual, and transcript(closed-caption) domains. This framework usesthree layers to process low-, mid-, and high-levelmultimedia information. The retrieval modulerelies on users’ personal preferences to deliverboth full programs and video segments. In addi-tion to using electronic program guide metadataand a user profile, Scout lets users request specifictopics within a program. For example, users canrequest the video clip of the US President speak-ing from a half-hour news program. The high-level layer generates semantic information aboutTV program topics used during retrieval.

Advanced access to broadcast video must takeinto account user preferences captured in a userprofile. Ferman et al.30 proposed automatic userprofiling and filtering agents. The profiling agent,based on fuzzy reasoning, automatically gener-ates the users’ profile on the basis of their usagehistory. Given program description metadata,the filtering agent filters programs on the basis ofthe user’s profile. The agents developed byFerman et al.30 can generate MPEG-7 and TV-Anytime compliant usage history and user pref-erence descriptions as well as filtering programsthat are described by MPEG-7 and TV-Anytimecompliant descriptions.

ConclusionsTo keep things in perspective, it’s important

to distinguish between research activities, exper-iments, and real applications that have made, orare likely to make, the transition from researchlabs into the real world. Researchers and tech-nologists, who are constantly reminded of the

level of difficulty in meeting certain technologi-cal challenges, are more likely to be excited bynew technologies that may not yet be ready foruse. The targeted users are the ultimate judges ofthe technology’s usefulness in meeting theirneeds. On the other hand, accepting new ways ofdoing things often involves a change in mindset.Users accustomed to performing a certain task byusing existing tools and methods might have atendency to resist new tools and methods. Forexample, most people who are accustomed totext-based information retrieval techniques maynot feel as comfortable with the notion of per-forming image and video searches by nonlin-guistic queries. Well-designed prototypeapplications can help bring about the necessarychange in mindset for users to accept these appli-cations. Such prototypes also serve the purposeof making technologists aware of users’ require-ments and preferences. MM

References1. M.G. Brown et al., “Automatic Content-Based

Retrieval of Broadcast News,” Proc. 3rd Int’l Conf.Multimedia (ACM Multimedia 95), ACM Press, NewYork, 1995, pp. 35-43.

2. B. Shahraray and D.C. Gibbon, “AutomaticGeneration of Pictorial Transcripts,” Proc. SPIE Conf.Multimedia Computing and Networking 1995, SPIEPress, Bellingham, Wash., 1995, pp. 512-518.

3. B. Shahraray and D.C. Gibbon, “AutomatedAuthoring of Hypermedia Documents of VideoPrograms,” Proc. 3rd Int’l Conf. Multimedia (ACM

53

July–Septem

ber 2002

Figure 8. Contentmagnets attractingstory segments in thecontent-based personalvideo recorder VideoScout application.

Multimedia 95), ACM Press, New York, 1995, pp.401-409.

4. B. Shahraray and D.C. Gibbon, “Efficient Archivingand Content-Based Retrieval of Video Informationon the Web,” Proc. AAAI Symp. Intelligent Integrationand Use of Text, Image, Video, and Audio Corpora,1997, pp. 133-136.

5. B. Shahraray, “Multimedia Information Retrievalusing Pictorial Transcripts,” Handbook of MultimediaComputing, B. Furht, ed., CRC Press, Boca Raton,Fla., 1999, pp. 345-359.

6. R.S. Jasinschi et al., “Integrated MultimediaProcessing for Topic Segmentation andClassification,” Proc. IEEE Int’l Conf. Image Processing(ICIP 2001), IEEE CS Press, Los Alamitos, Calif., 2001.

7. H.J. Zhang et al., “Video Parsing, Retrieval, andBrowsing: An Integrated and Content-BasedSolution,” Proc. Third Int’l Conf. Multimedia (ACMMultimedia 95), ACM Press, New York, 1995,pp.15-24.

8. R. Zabih, K. Mai, and J. Miller, “A Robust Methodfor Detecting Cuts and Dissolves in VideoSequences,” Proc. 3rd Int’l Conf. Multimedia (ACMMultimedia 95), ACM Press, New York, 1995.

9. H.J. Zhang et al., “Video Parsing Using CompressedData,” Proc. SPIE 94 Image and Video Processing II,SPIE Press, Bellingham, Wash., 1994, pp.142-149.

10. D. Swanberg, C.-F. Shu, and R. Jain, “Knowledge-Guided Parsing in Video Databases,” Proc. SPIEConf. Storage and Retrieval for Image and VideoDatabases, SPIE Press, Bellingham, Wash., 1993.

11. H.J. Zhang et al., “Automatic Parsing and Indexingof News Video,” Multimedia Systems, vol. 2, no. 6,1995, pp. 256-265.

12. Q. Huang et al., “Automated Generation of NewsContent Hierarchy by Integrating Audio, Video, andText Information,” Proc. IEEE Int’l Conf. Acoustics,Speech, and Signal Processing (ICASSP 99), IEEE CSPress, Los Alamitos, Calif., 1999.

13. P. Aigrain and P. Joly, “The Automatic Real-TimeAnalysis of Film Editing and Transition Effects andIts Applications,” Computers & Graphics, vol. 18, no.1, Jan./Feb. 1994, pp. 93-103.

14. B. Shahraray, “Scene Change Detection andContent-Based Sampling of Video Sequences,”IS&T/SPIE Symp. Digital Video Compression:Algorithm and Technologies, SPIE Press, Bellingham,Wash., vol. 2419, 1995, pp. 2-13.

15. Y. Rui, A. Gupta, and A. Acero, “AutomaticallyExtracting Highlights for TV Baseball Programs,”Proc. 8th Int’l Conf. Multimedia (ACM Multimedia2000), ACM Press, New York, 2000, pp. 105-115.

16. B. Li and M.I. Sezan, “Event Detection andSummarization in American Football Broadcast

Video,” Proc. IS&T/SPIE Conf. Storage and Retrieval forMedia Databases, SPIE Press, Bellingham, Wash.,2002, pp. 202-215.

17. B. Li and M.I. Sezan, “Event Detection andSummarization in Sports Video,” Proc. IEEE Workshopon Content-Based Access to Video and Image Libraries,IEEE CS Press, Los Alamitos, Calif., 2001, CD-ROM.

18. H. Pan, B. Li, and M.I. Sezan, “Automatic Detectionof Replay Segments in Broadcast Sports Programsby Detection of Logos in Scene Transitions,” Proc.IEEE Int’l Conf. Acoustics, Speech, and SignalProcessing (ICASSP 2002), IEEE CS Press, LosAlamitos, Calif., 2002, CD-ROM.

19. H. Pan, P. van Beek, and M.I. Sezan, “Detection ofSlow-Motion Replay Segments in Sports Video forHighlights Generation,” Proc. IEEE Int’l Conf. Acoustics,Speech, and Signal Processing (ICASSP 2001), IEEE CSPress, Los Alamitos, Calif., 2001, CD-ROM.

20. B. Li et al., “Sports Program Summarization,” IEEEComputer Vision and Pattern Recognition (CVPR)Conf. Demonstration Session, IEEE CS Press, LosAlamitos, Calif., 2001, CD-ROM.

21. D.C. Gibbon and J. Segen, “Video Based Detectionof People from Motion,” Proc. Virtual RealitySystems Conf., 1993.

22. H. Sundaram and S.-F. Chang, “Constrained UtilityMaximization for Generating Visual Skims,” Proc.IEEE Workshop on Content-Based Access of Image andVideo Libraries (CBAIVL-2001), IEEE CS Press, LosAlamitos, Calif., 2001.

23. T. Liu and J.R. Kender, “A Hidden Markov ModelApproach to the Structure of Documentaries,” Proc.Computer Vision and Pattern Recognition: Workshopon Content-Based Access of Image and Video Libraries(CVPR 00), IEEE CS Press, Los Alamitos, Calif., 2000.

24. L. Agnihotri et al., “Summarization of VideoPrograms Based on Closed Captioning,” Proc. SPIEConf. Storage and Retrieval in Media Databases, SPIEPress, Bellingham, Wash., 2001, pp. 599-607.

25. H.J. Zhang, J.Y.A. Wang, and Y. Altunbasak,“Content-Based Video Retrieval and Compression:A Unified Solution,” Proc. IEEE Int’l Conf. ImageProcessing, IEEE CS Press, Los Alamitos, Calif., 1997.

26. S.F. Chang et al., “VideoQ: An Automated Content-Based Video Search System Using Visual Cues,”Proc. 5th Int’l Conf. Multimedia (ACM Multimedia97), ACM Press, New York, 1997, pp. 313-324.

27. D. Gibbon et al., “Browsing and Retrieval of FullBroadcast-Quality Video,” Proc. Packet Video Conf.,1999, CD-ROM.

28. N. Dimitrova, T.McGee, and H. Elenbaas, “VideoKeyframe Extraction and Filtering: A Keyframe IsNot a Keyframe to Everyone,” Proc. ACM Conf.Knowledge and Information Management, ACM

54

IEEE

Mul

tiM

edia

Press, New York, 1997, pp.113-120.29. N. Dimitrova et al., “Real-Time Commercial

Detection Using MPEG Features,” to appear in Proc.9th Int’l Conf. Information Processing andManagement of Uncertainty in Knowledge-BasedSystems (IPMU 2002), July 2002.

30. A.M. Ferman et al., “Content-Based Filtering andPersonalization Using Structured Metadata,” toappear in Joint Conf. Digital Libraries, 2002.

Nevenka Dimitrova is a researchstaff member at Philips Research.Her main research interests are incontent information manage-ment, digital TV, content synthe-sis, video content navigation and

retrieval, MPEG-7, and advanced multimedia systems.She has a BS in mathematics and computer science fromthe University of Kiril and Metodij, Skopje, Macedonia,and an MS and PhD in computer science from ArizonaState University. She is on the editorial board of IEEE Mul-tiMedia, ACM Multimedia Systems Journal, ACM Transac-tions of Information Systems, and IEEE CommunicationsSociety (an online magazine). She is a program chair ofACM Multimedia 2002 in the area of content processing.

Hong-Jiang Zhang is a seniorresearcher and assistant managingdirector at Microsoft ResearchAsia. He has a BS from ZhengzhouUniversity, China, and PhD fromthe Technical University of Den-

mark, both in electrical engineering. He is a senior IEEEmember and an ACM member. He currently serves onthe editorial boards of five professional journals and adozen committees of international conferences.

Behzad Shahraray is a divisionmanager at AT&T Labs Research,where he heads the MultimediaProcessing Research Department.His work focuses on multimediaindexing, multimedia data min-

ing, content-based video sampling, and automatedauthoring of searchable and browsable multimediacontent. He has MS degrees in electrical engineeringand computer, information, and control engineering,as well as a PhD in electrical engineering from the Uni-versity of Michigan, Ann Arbor. He’s an IEEE and ACMmember. He serves on the editorial board of the Inter-national Journal of Multimedia Tools and Applications.

Ibrahim Sezan is the director ofthe Information Systems Tech-nologies Department at Sharp Lab-oratories of America. His researchfocuses on audio–visual contentunderstanding and content sum-

marization, automatic user profiling and content filter-ing, human visual system models, visually optimizedinformation display algorithms for flat-panel displays,and smart algorithms for cameras. He has BS degrees inelectrical engineering and mathematics from BogaziciUniversity, Istanbul, Turkey. He has an MS in physicsfrom the Stevens Institute of Technology, Hoboken,New Jersey, and a PhD in electrical computer and sys-tems engineering from Rensselaer Polytechnic Institute,Troy, New York.

Thomas Huang is the William L.Everitt Distinguished Professor ofElectrical and Computer Engi-neering at the University of Illi-nois at Urbana–Champaign(UIUC). He also serves at UIUC as

a research professor at the Coordinated Science Labora-tory, and he’s the head of the Image Formation and Pro-cessing Group at the Beckman Institute for AdvancedScience and Technology. He has a BS in electrical engi-neering from National Taiwan University, China, andan MS and ScD in electrical engineering from the Mass-achusetts Institute of Technology.

Avideh Zakhor is a professor inthe Department of Electrical Engi-neering and Computer Sciencesat the University of California atBerkeley. Her research interestsinclude image and video process-

ing, compression, and communication. She has a BSfrom the California Institute of Technology, Pasadena,and an MS and PhD from the Massachusetts Instituteof Technology, all in electrical engineering. She is anIEEE fellow.

Readers may contact Nevenka Dimitrova at PhillipsResearch, 345 Scarborough Rd., Briarcliff Manor, NY10510, email [email protected].

For further information on this or any other computing

topic, please visit our Digital Library at http://computer.

org/publications/dlib.

55

July–Septem

ber 2002