large-scale analysis for interactive media consumption

Chapter 1

Large-Scale Analysis for InteractiveMedia Consumption

David Gibbon1, Andrea Basso1, Lee Begeja2, Zhu Liu1, Bernard Renger2, Behzad Shahraray1,Eric Zavesky1

1AT&T Labs - Research, 200 Laurel Avenue South, Middletown, NJ, USA2AT&T Labs - Research, 180 Park Avenue, Florham Park, NJ, USA

1.1 Introduction

Over the years the fidelity and quantity of TV content has steadily increased, but consumersare still experiencing considerable difficulties in finding the content matching their personalinterests. New mobile and IP consumption environments have emerged with the promiseof ubiquitous delivery of desired content, but in many cases, available content descriptionsin the form of electronic program guides lack sufficient detail and cumbersome human in-terfaces yield a less than positive user experience. Creating metadata through a detailedmanual annotation of TV content is costly and, in many cases, this metadata may be lostin the content life-cycle as assets are repurposed for multiple distribution channels. Contentorganization can be daunting when considering domains from breaking news contributions,local or government channels, live sports, music videos, documentaries up through dramaticseries and feature films. As the line between TV content and Internet content continuesto blur, more and more long tail content will appear on TV and the ability to be able toautomatically generate metadata for it becomes paramount. Research results from severaldisciplines must be brought together to address the complex challenge of cost effectively aug-menting existing content descriptions to facilitate content personalization and adaptation forusers given todays range of content consumption contexts.

This chapter presents systems architectures for processing large volumes of video effi-ciently, practical, state of the art solutions for TV content analysis and metadata generation,and potential applications that utilize this metadata in effective and enabling ways. A brief

1

2CHAPTER 1. LARGE-SCALE ANALYSIS FOR INTERACTIVE MEDIA CONSUMPTION

synopsis of chapter highlights is included below to serve as guidance for the reader as severaltopics are given a deeper discussion.

• System Architecture - A flexible system architecture supports a range of applica-tions including both on-demand retrieval of assets ingested as files as well as real-timeprocessing of IP multicast MPEG-2 transport streams. Transcoding and segmentationis performed at both ingest and delivery phases to meet the system design parametersof optimizing storage and supporting content repurposing for a range of user devicesincluding desktop browsers, mobile phones, tablets, and set top boxes.

• Media Analysis - Starting from a high-level program structure discovery, programs aresegmented at several scales using the detection of similar segments with shot boundarydetection, mid-level semantic classifiers of images, speech recognition, speaker segmenta-tion, face detection, near duplicate detection and clustering methods. These elementarysegmentations are then combined to perform anchorperson detection and multimodalnews story segmentation for easier program navigation and indexing.

• Clients and Applications - Standards compliant metadata (e.g. EPG, PSIP, Medi-aRSS, MPEG-7) is ingested, augmented and made available to client applications. Over200,000 TV programs have been indexed by the proposed system and media processingresults are available in XML form for each asset. Indexing systems with web serviceinterfaces provide rapid access to this detailed metadata for content collections to facil-itate the creation of highly dynamic applications from complex analytical scenarios tomobile retrieval environments.

inges&on & processing retrievalacquisi&on

desktops & set-‐top boxes

mobile devices

media archive

query processing

content analysis

broadcastfeed (linear)

audio/video content

metadataIP Mul>castMP2TS, H.264

On Demand(File sources) EPG

database

File TransferADI, RSS, etc.

metadata index

content transcode

frame segmenta>on

Figure 1.1: High-level system architecture briefly describing all stages of content analysis.

The remainder of this chapter is organized into three main sections as follows: section1.2 describes systems and architectures to support content processing at scale, section 1.3

1.2. SYSTEM ARCHITECTURES FOR CONTENT PROCESSING 3

discusses techniques for processing TV content, and section 1.4 outlines a few applicationsthat are enabled by TV content processing, ranging from those targeted to expert users fordetailed content analysis to those intended for novice users for entertainment applicationsutilizing multiple device environments.

1.2 System Architectures for Content Processing

The increasing relevance of automatic content processing methods as a key component formultimedia services requires the development of sophisticated, large-scale content processingsystem architectures. Such architectures, rely on well known Service oriented Architectures(SoA) concepts with the needed extension specific to content processing, are necessary toprocess large amounts of content from real-time feeds or existing archives and to analyze andpublish content rapidly for immediate search. While the transactional nature of SoA is veryeffective for some of the tasks in a content processing architecture, other media-intensiveworkflows are characterized by a series of specific requirements. For example, the individualcontent processing modules can range between computationally very light and of a state-less transactional nature (i.e., converting a metadata format) to extremely computationallyintensive (i.e., asset transcoding) and may take a considerable amount of time. In addi-tion, asset management, large data transfer and large data storage are generally involved,requiring the separation of the metadata and the media processing paths for efficiency. Fur-thermore, specific consideration needs to be given to maintain coherence of media formatsand profiles. Finally, content security including content watermarking and DRM must betaken into account for every step of the media workflow. In the discussed architectures, theworkflows are media-aware to address these issues.

A high-level system architecture is presented in Figure 1.1. The architecture separatesthe media specific path (indicated by solid arrows in Figure 1.1) from the metadata path(dashed arrows in Figure 1.1). Metadata that is still embedded in the media is separatedat the ingestion phase. Two separate communication buses, one for media and one formetadata, allow for the proper process pipelining in order to maximize efficiency and meetrequirements for a variety of applications. The system architecture in Figure 1.1 is distributedand reconfigurable and can process live ingested media streams and static media collections.This architecture handles media acquisition, content processing, indexing and publishing fora variety of heterogeneous devices. The ensemble of the modules is orchestrated by a taskflow management component that exploits parallelism whenever possible. A central schedulermanages load balancing among the servers and has the capability of task prioritization.Each module is exposed as a web service, implemented with industry-standard interoperablemiddleware components. For more information, readers can consult a detailed descriptionof the architecture in [16].

1.2.1 Web Services Model

With the increasing availability of high-bandwidth connectivity, the opportunities for contentanalysis, services have surfaced in desktop, mobile, and set-top environments alike. This is aformidable challenge because each environment may have unique demands for content acqui-sition, representation, retrieval, and delivery. An increasingly popular way to bridge these


environments is using web-based services following either a SOAP-based or REST-based de-sign models that leverages lightweight data formats like MediaRSS (XML) or JSON. Webservices allow generic access to both acquisition (the Internet, a camera, DTV feed, desktopcomputer, etc.) and consumption (a mobile phone, a television, etc.) devices to offload heavyresource and complex computational requirements onto a remote location, as illustrated inFigure 1.2. Web services naturally fit into “cloud” computing or storage architectures, wherenetwork connectivity, throughput, and some security demands are assumed to be satisfiedby service providers as a prerequisite of their offering.

raw content

metadata

adapted content

indexing database

content storage

query processing

content analysis

delivery adapta7on

distributed compu7ng nodes

set-‐top boxesdesktopmobilegeneric acquisi7on

services bus

Figure 1.2: Typical web-service configuration, offering a generic interface to many platformsand offloading indexing, analysis, and storage requirements to a decentralized location.

As illustrated above, there are a few core components for content analysis systems exposedvia web services. First, the most critical (and only visible) component is a services controllerthat acts as a middle layer, mapping requests from different clients into functionality requestsfor underlying systems. This middle layer exposes a simple yet intuitive API that addsadditional security provisioning to internal system functions and is capable of routing arequest to any number of distributed resources that are hidden within the web service.These capabilities correctly vet access to internal resources and allow a single web-service toefficiently distribute requests according to system load. Second, a set of computers (eitherphysical or virtual), referred to as nodes, capable of executing analysis functions on a pieceof content are located within the web service. Nodes can be added in an ad-hoc fashion toexecute one or many of the processing functions required. For example, some nodes with alarge local storage capacity and a fast processor may be ideal for video analysis whereas othernodes with a large memory capacity may be better suited for complex numerical analysis.Next, an indexing database that stores metadata and low-level features for a set of contentis stored within the web service and is accessible only to other internal systems. Similarto the internal computation nodes, this indexing database also can be distributed to betteraccommodate different request loads if needed. Finally, to complete a media web service, areliable content storage and delivery system is needed. The only responsibilities of this systemare to store incoming content and queries and to deliver that content in a format that therequesting client can interpret. As the number of devices, operating systems, and content

1.2. SYSTEM ARCHITECTURES FOR CONTENT PROCESSING 5

codecs fluctuate across clients it is critical to have a content delivery system capable ofsatisfying all playback requests. These systems can transcode content to all possible formatsrequired (i.e., decode to a raw format and re-encode to a target format) either preemptivelyoff-line or on-line in a live, adaptive streaming fashion. Although, on-line transcoding nolonger requires elaborate systems with costly dedicated hardware, the cost of digital storageis also continually declining. Thus, a system designer must be cognizant of the expectednumber of users and requests for his or her application before choosing one configuration.

1.2.2 VoD Ingest Model

In more traditional architectures such VoD, content is analyzed in a distributed manner fromstatic repositories. In these architectures, content ingestion and processing are two separatephases that include a large amount of intermediate storage and potentially involve contenttranscoding. Such architectures do not suffer from high latency and in general do not havestreaming and real-time requirements. However, with the increased need for content analysis,and in particular the volume of content derived from broadcast or multicast sources thatneeds to be processed, real-time and streaming requirements need to be factored in, leadingto more complex system architectures.

1.2.3 Linear & Continuous Ingest Model

We define linear and continuous ingestion models framework where metadata and contentis ingested continuously in time in a linear fashion and where the result of the contentprocessing is readily available for consumption. Every element of a linear and continuousingestion architecture must be designed to minimize end-to-end latency. Relevant contentanalysis results corresponding to a point in the ingested media stream need to be availablefor consumption in a bounded time interval for applications such as content-based search,advanced content-based services, content monitoring, etc. Media processing profiles aredefined to match content analysis loads with available processing resources. The architec-ture dynamically adapts from one profile to another in a graceful manner. As an example,transcoding parameters can be changed dynamically or accuracy of speech to text indexingcan be traded off versus execution time. For real-time processing, content storage is min-imized if not eliminated completely and content must be processed in a real-time mannerwith minimal buffering and state. Unlike the VoD ingest case, the continuous ingest modelprecludes any media processing that uses multiple passes with large windows. In order tomeet some of these requirements the usage of specialized processing software and hardwarenodes may be required. Such nodes may require demultiplexed elementary streams from theoriginal content or encapsulation format conversion may be required in some cases. Thismay be implemented by a preprocessing service that prepares the media prior to invokingthe main processing services.

1.2.4 Role of Standards

Taking a modular approach to media processing architecture design allows individual com-ponents to be optimized more easily and simplifies reconfigurability to support a wide range


of applications. Well defined standards for data representation are critical for successfulsystem operation and open standards enable interoperability among research groups andindustry vendors. This applies not only at the transport and basic data marshaling levelthrough the use of such standards as TCP/IP, HTTP, XML, and REST, but also up throughthe application layers as well. Of course supporting a range of media encoding specificationsand media container formats is a requirement for any media processing system. Beyond this,standardizing the representation of the results of media analysis in addition to basic contentdescriptions enables content creators and media consumption applications to interoperateharmoniously.

Electronic Program Guide Metadata

The role of TV content analysis may be viewed as that of extracting enhanced content de-scriptions that augment available high-level descriptions. Although other external sourcesmay be available such as less dynamic Internet knowledge repositories, generally the mostreliable descriptions are provided by Electronic Program Guides (EPG) that accompany TVprogramming offered over the air, cable, or IP networks. These enhanced descriptions in-crease the quality of experience for users, for example, by improving content discovery byproviding more data to search and personalization systems. Standards are critical for theexchange of content descriptions among the content analysis subsystem and other servicecomponents as well as the interoperability between different systems. EPG data is typicallymanaged in relational databases but exchanged in XML format. Various methods have beendevised to efficiently deliver EPG metadata via multicast communications [6]. Programdescriptions and scheduled broadcast event information can also be delivered in an encapsu-lated form with the content. For example, terrestrial broadcasts in the U.S. use the ATSCProgram and System Information Protocol to deliver program guide information to receivers[13]. The data model is most easily represented in an XML schema which facilitates reuseof data types and extension to support representation of detailed content analysis resultsharmoniously. For instance, the Alliance of Telecommunications Industry Solutions / IPTVInteroperability Forum (ATIS IIF) EPG specification [30] incorporates schemas from TV-Anytime [2] as well as MPEG-7 [20] and includes, for example, classification schemes forrole codes defined by the European Broadcast Union (EBU) [3]. In addition to the globalprogram descriptions, TV-Anytime allows for specification of metadata describing segmentsof a program which, for example, could be used for subtopics in a documentary. Going fur-ther, a recent effort in MPEG-7 [30] defines a profile for representation of automated contentprocessing results. This will be a natural extension to the existing EPG standards whichalready use MPEG-7 data types.

Representation of Automatically Extracted Metadata

The current evolution and performance of content analysis tools and its imminent mass-scalemarket adoption stresses the importance of a standardized representation of automaticallyextracted metadata. In this context, the EBU P/SCAIE [29][1] metadata group has de-signed a new MPEG-7 audiovisual profile for the description of complex multimedia contententities. This profile accommodates a comprehensive structural description of the content,including also audio and visual feature descriptions obtained via automatic metadata ex-traction. The profile also defines a set of semantic constraints on the selected tools, whichresolve ambiguities in modeling the description and support system interoperability. The

1.3. MEDIA ANALYSIS 7

description tools in this profile can be used to describe the results of various kinds of mediaanalyses with visual and audio low-level features. Consequently, the information resultingfrom the media analyses, such as shot/scene detection, face recognition/tracking, speechrecognition, copy detection and summarization, can be used in a wide range of applica-tions such as archive management, news, new services for new media, and many academicprojects handling a large scale of video content. Citing a practical use-case discussed earlierin this chapter, copy repetition detection is implemented through AudioVisual Segmentscross-reference (through the Relation element of AudioVisual Segment), and by the use ofthe appropriate term chosen from Segment Relation Classification Scheme to specify thekind of copy/transformation. For more detail, the reader can refer to the examples reportedin [1].

Media Encoding and Delivery

While several media encoding formats may be used for content archival, and new formats areconstantly being developed, there has been some convergence towards H.264 for a wide rangeof video encoding applications. H.264 is often used in conjunction with an MPEG-2 transportstream for delivery on IPTV networks, or is encapsulated in an MPEG-4 file format forweb applications. Individual implementations must balance design tradeoffs when selectingparticular encoding parameters; for example, HTTP adaptive streaming and precise randomaccess may suggest considerable use of reference frame structure with frequent intra-codedframes, but at the cost of increased storage and bandwidth requirements.

On-Demand Metadata

Global metadata in the form of an electronic program guide is used to describe scheduledbroadcast time and program-level information about programs. The on-demand consump-tion paradigm is gaining popularity in the TV realm and web media has been predominantlyon-demand for many years. While some of the same program description data-types maybe used for both EPG and VoD (or CoD for content on demand) applications, in practiceother standards for CoD have emerged such as the Cable Labs VoD metadata specifications[5] in the broadcast space or MediaRSS [4] and its variants in the web space. These datamodels not only include program descriptions, but also support publishing functions such asspecifying content purchase options and periods of availability.

1.3 Media Analysis

At the core of every content processing system is a battery of techniques for media analysis.To address the many different forms of media like still images, audio podcasts, live or broad-cast video, and highly edited content, media analysis is broken down into different stagesthat all produce bits of information, called metadata, that can be easily transmitted andstored within a processing architecture. Automated media analysis methods can generatedetailed metadata that augments existing manually created content descriptions to enablea range of video services. Algorithms may operate on individual media streams including


audio, video or text streams from subtitles or closed captions. Multimodal processing tech-niques process media streams collectively to improve accuracy. This section presents severalmethods that have applicability to a broad range of TV content genres. Most of the mediaprocessing functions discussed here have been the subject of research study for a number ofyears and interested readers are encouraged to consult the references for a more in-depthtreatment. However, to convey solutions to a few issues encountered in media processingalgorithms in this domain, some sections are explored with more detail.

1.3.1 Media Segmentation

scene scene

increasing syntax complexity

shotshotshotshotshotshotshot shot shotshot

topic/story

commercial program segment

topic topic/story topic/story

program segment

program

Figure 1.3: Typical TV program structure.

Dividing media streams and programs intosmaller segments facilitates media retrieval,browsing, and content adaption for mo-bile device applications. For some contentsources such as TV news programs, a hier-archical program structure can be extracted[19] as shown in Figure 1.3. This pro-gram structure, whether determined auto-matically by content analysis or manually byuser annotation, enables other content anal-ysis routines to more precisely understandand process the different syntactic struc-ture of a TV program. For example, non-news programs (i.e., soap-operas and sit-coms) may have story segments like an in-troduction, action, and conclusion instead oftopic segments, but both of these segmenta-tions provide strong cues for the underlying syntax of a program. While some programanalysis systems may require a detailed syntactical understanding of a programs structure,the methods described here focus on high-performance detection of elementary syntax anal-ysis, such as shots and commercial breaks.

Shot Boundary Detection

In a video sequence, a visual shot corresponds to the act of turning a camera on and offof a scene, person, or object and results in a group of adjacent frames whose content ishomogeneous. Shot boundary detection facilitates further video content analysis and itis an important component in video indexing, query, and browsing systems. Due to fastglobal or local motion, camera operations, complex lighting schemes, camera instability, andcombinations of a variety of video editing effects, shot boundary detection (SBD) is still achallenging task. One of the most established platforms for shot boundary detection was theSBD task evaluated yearly in TRECVID from 2001 to 2007. TRECVID is a workshop hostedby the National Institute for Standards and Technology (NIST) organized to objectivelyevaluate international academic and industrial research labs in numerous video analysis tasks[35]. Interested readers can find the state of the art approaches reported in these workshops.

One approach for shot boundary detection and classification [25] is illustrated in Figure1.4. A set of independent detectors, targeting the most common types of shot boundaries,


Figure 1.4: Overview of the shot boundary detection system.

are designed. This figure shows the detectors for cut, fade in, fade out, fast dissolve (less than5 frames), dissolve, and subshot introduced by global motion. Subshot detection is valuablefor providing accurate representations of video in cases such as with long camera pans wherethe visual contents have changed significantly within a single shot. Essentially, each detectoris a finite state machine (FSM), which may have different numbers of states to detect thetarget transition pattern and locate the transition boundaries. The modular design of thesystem allows for easy addition of additional detectors to tackle other types of transitions thatmay be introduced during the production process. A support vector machine (SVM) basedtransition verification method is employed in several detectors, including cut, fast dissolve,and dissolve detectors. Finally, the results of all detectors are fused together based on thepriorities of these detectors. The detectors utilize two types of visual features: intra-frameand inter-frame features. The intra-frame features are extracted within a single frame, andthey include color histogram, edge, and related statistical features. The inter-frame featuresrely on the current frame and one previous frame, and capture the motion compensatedintensity matching errors and histogram changes. For more detailed information about theimplementation of these detectors, please refer to [25].

1.3.2 Audio Processing

Most systems utilizing audio processing for TV content indexing focus on the speech withina program. While this narrow scope may unusual, it is unusual for musical scores, spe-cific sounds (i.e. sound effects like laugh tracks), and environmental audio to significantlycontribute to the indexing of TV content because of their brief and heterogenous nature.However, some visual techniques, discussed later in section 1.3.4, like semantic concepts and


duplicate detection can be aided by the addition of generic audio detection methods.

Speaker Segmentation and Clustering

Speaker segmentation is important for automatic speech recognition and audio content anal-ysis. For example, with information about different speaker segments, automatic speechrecognition systems can dynamically adapt utilized models and parameters for differentspeakers to improve overall recognition accuracy. Additionally, information about differ-ent speaker segments provides useful cues for indexing and browsing audio content. Thesecues can be taken advantage of to organize and index TV content. The overall sysytemfor speaker segmentation and clustering is depicted in Figure 1.5. The algorithm uses mel-frequency cepstral coefficients (MFCC) and Gaussian mixture model (GMM) to model theacoustic characteristics of speakers. The Bayesian information criterion (BIC) is adoptedto locate the speaker boundaries and determine the number of speakers [11]. The kernel,indicated by the round cornered rectangle in the middle of the figure, proceeds iteratively.At each iteration, BIC gain induced by splitting segments of one speaker into two speakersis computed and the speaker whose split produces the maximum BIC gain is found. If thatgain is positive, then the number of speakers is increased for the next iteration. If not,the iteration terminates and the process is complete. The speaker boundaries and speakermodels are refined iteratively while speaker segments are split (see the embedded dashed linerectangle), until the speaker labels converge.

Figure 1.5: Speaker segmentation and clustering algorithm.


1.3.3 Closed Caption Processing

While not all content descriptors that are supplied with or extracted from content are intextual form, linguistic information plays a major role in indexing, retrieval, and adaptationof content. Closed captioning that is provided with most TV programs is a major sourceof this information. Because of some issues related to proper timing and formatting ofclosed captions, effective content searching and adaptation requires additional processing ofthe raw captioning information. These requirements are addressed by properly aligning thecaptions with audio information by employing speech recognition, and restoring the propercase through linguistic processing.

A significant portion of TV content is closed captioned in real-time, as it is being broad-cast. This is, of course, necessary for live events such as sports or breaking news programs,but since this mode of captioning is less time consuming to produce than off-line captioning,it is used for content from other genres as well. While real-time captioning provides aninvaluable service to hearing impaired viewers and broadens the audience to include non-fluent listeners and to public multi-monitor viewing contexts, the process inherently includessignificant variable latency (e.g., 3-10 seconds) and paraphrasing.

Optimizing the user experience for content based retrieval of TV content and the perfor-mance of multimedia processing algorithms requires accurate synchronization of the closedcaption text with the other media components. This can be achieved using speech processingeither using edit-distance minimization on the one-best word hypothesis from large vocab-ulary automatic speech recognition or by using forced alignment methods similar to thoseused in acoustic modeling. Systems using either of these approaches must be robust to thepresence of background music and mismatches between the closed caption text arising fromparaphrasing and must deal with program segments for which no captioning is available.

Web Mining for Language Modeling

When adapting content for viewing on a range of display devices, correct text capitalizationis an important factor in determining the quality of the transcripts obtained from closedcaptioned text [10][6]. An N-gram language model generated using a large corpus of APnewswire data provides a baseline model for case restoration. Keeping this data up-to-daterequires timely discovery and mining of recently published documents to learn new infor-mation and incorporate it into the models and Web resources described by RSS are a goodsource of such documents. However, as the sources vary widely in terms of content format-ting, processing is required to extract the relevant textual components and detect breaksat sentence boundaries. The result of the processing (Figure 1.6) shows that readability isgreatly improved.

1.3.4 Image Processing

While audio processing largely helps to index content with speech, multimedia content fromTV news programs, documentaries, or home videos often lacks rich descriptions of the actualscenes and subjects within an image. Fortunately, a number of techniques have matured togenerate a mid-level representation of content including general scene information, detect


Original CC Text HUNDREDS OF FAMILIES HAVE FOUND THEM-SELVES IN NEW YORK CITY WHERE THEY’REGETTING A HEARTFELT BIG APPLE WELCOME.

Case restored CC Hundreds of families have found themselves in NewYork City where they’re getting a heartfelt Big Applewelcome.

Figure 1.6: Closed Caption Case Restoration.

the presence of people or repeating characters, and even locate similar frames and videosegments within a single program that can help to identify repeated thematic information.

Semantic Concepts

One problem encountered when indexing video content is the inability for existing metadatato adequately describe that content. Often, the way a computer indexes content (i.e., usinglow-level features) and the way a human would describe it (i.e., using high-level textualkeywords) are quite different. This problem is referred to as a semantic gap, and is due tocomputational limitations in content representation [36]. Mid-level semantic concepts, oftenrealized as machine-learned visual classifiers, are one increasingly popular way to resolve thisdisparity because they combine knowledge from a set of machine features and human labelsover a common dataset. Additionally, mid-level semantic concepts can aide in indexing ofcontent if it does not include title and description information or an audio track, whichrespectively provide textual keywords and transcripts from speech recognition.

The core stages traditionally employed when using semantic classifiers are as follows.

1. Define the lexicon or names of concepts - As research in the area of formallydefined concept classifications has progressed, an increasing number of ontologies havebeen developed that define concepts and categories based on language [12], popularconcepts for consumer media [27], and generic concepts common in broadcast news[23]. In the presented work, the latter option was chosen and the Large Scale ConceptOntology for Media (LSCOM) definitions are used to describe a library of conceptsbecause each offers an additional piece of mid-level information that can be used toindex the target content. However it should be noted that the detection performance forsome of these concepts was poor, which was most commonly attributable to infrequencyof training data, or greatly varying appearance of the subjects.

2. Obtain human labels and machine features - After defining a set of concepts, bothhuman-provided labels and machine features are required. Although other techniqueshave sought to leverage social tags from popular photo sharing sites [12], the presentedsystem utilized the relevant and non-relevant labels from 374 LSCOM [23] and 100MediaMill [37] concept definitions that were provided for 160 hours of multi-lingualbroadcast news from the TRECVID2005 development set [35]. To reduce the annotationrequired to achieve a given level of accuracy, active learning may be employed in aninteractive and iterative manor. Alternatively, other sources of labeling such as socialtagging may be leveraged, but here methods such as consensus labeling are required toreduce labeling error.


3. Train machine models for classification - The final step is to train a set of classifiers,and in this system one support vector machine (SVM) was trained for each feature andconcept pair. Optimal low-level image features can also vary dramatically, features werederived from prior work: three global features (grid color moments, Gabor texture, andedge direction histogram [7]) and one local feature (keypoints soft-quantized into a bag-of-words representation [41]). To produce a single concept score, each SVM is evaluated,normalized with a sigmoid, and then averaged across features.

concept filter precision @ 40 precision @ 100baseball only 0.050 0.020or athlete 0.175 0.250or sports 0.375 0.340not soccer 0.525 0.440or running 0.600 0.540or grandstand 0.600 0.420

(a) baseball only (b) or athlete or sports (c) not soccer or running

Figure 1.7: For a system utilizing semantic concepts as filters in an interactive search process,precision scores for different results set sizes and visual illustrations of results at each stagewith relevant images marked with a white square in the lower right corner. Performancecontinues to improve until a point (here, the addition of grandstand) where the noisy natureof the concept classifiers begins to overpower useful relevance scores.

One example usage of semantic classifiers is to aide in interactive search in a filteringfashion [41]. In this use-case, the user starts with a single textual or concept-based query andthen refines the query by expanding it to include another concept (a logical OR operation)or exclude a concept (a logical NOT operation). This system was evaluated over 314 hoursof TV content and demonstrated a gradual increase in precision up to a point, as illustratedin Figure 1.7. After this critical point (which varies per query), the noisy nature of thesemantic classifiers begins to degrade performance.

Face Detection

Detecting the presence of people in video can provide valuable information for determiningthe structure of the content and is useful in creating visual summaries. For example, in newsprograms, detecting the presence of the anchorperson and reporters can be used as a part ofthe process of segmenting the content into individual news stories. This facilitates retrievalor repurposing of content by converting long form content into shorter units for consumptionon mobile devices. Applications may discard repeated images of the anchorperson in order


to provide more informative visual summaries containing images representing news eventsfilmed on location. Face detection is the task of detecting human faces in arbitrary imagesor videos. In the simplest case, face detection in video can be performed by treating thevideo as a sequence of still images (frames) from which it is composed and by performingface detection on each frame. Such an approach may be suitable for applications that notonly require faces to be detected, but also need to track the motion of the individual facesbetween consecutive frames.

While the Viola-Jones algorithm [39] has been widely used for detecting faces in an image,while more recent work involves on-line learning of faces and their identification in real-time[33]. Earlier methods include fast template matching using iterative dynamic programmingcombined with tracking [26], neural network based techniques [32], and sparse network oflinear functions technique of Roth et al. [31].

Face Clustering

The state of the art in face detection and identification is now mature enough to act as acomplementary piece of semantic information extracted during the analysis process. Faceidentification can provide valuable non-linguistic information that can be used to retrievevideo segments in which a particular person appears like finding popular characters, locatingthe principal cast of a video, finding multiple characters appearing together, and even findingcharacters in particular outfits. When highly accurate identification is not possible, or noprior information is available about the faces that are present in a video program, informationrelated to the appearance of the same faces in different segments has proven highly effectivein deriving higher level semantic information. In user generated videos, one can search forspecific points in videos where certain family members appear or search for segments ofvideos that contain certain combinations of family members.

Several face clustering algorithms have been proposed in the literature. Antonopoulos, etal. [8] propose using a dissimilarity matrix from pre-existing face clusters; Fitzsgibbon andZisserman [14] use joint manifold distance; Tao and Tan [38] propose dividing face sequencesby pose and then applying additional constraints obtained from domain knowledge. Thereare other methods that have been proposed in [40][24][18][31].

Figure 1.8: Icon and torso regions are definedin terms of the face location and size.

The approach discussed below uses hier-archical agglomerative clustering (HAC) togenerate clusters of faces from a video ormultimedia presentation. It employs a hy-brid technique that uses Eigenfaces, face,and torso information during the clusteringprocess. This provides richer and more sta-ble results than any of the single methodsalone [9]. To cluster the detected faces, theface and torso region below the face are used.Patterns of clothing in the torso regions aretrivially detected (an offset from the face)and they are more differentiable among dif-ferent persons. A weighting is applied to thetorso with the assumption that same personwithin one video (e.g., TV news program)


wears the same clothes. In total, features are weighted from two regions: an icon region andtorso region as shown in Figure 1.8.

Features from these three regions are extracted to measure the dissimilarity among faces.For the icon and torso regions, several features are computed: color moments in Luv colorspace (9 features), the Gabor texture features (mean and standard deviation) in a combi-nation of three scales and four directions (24 features), and the edge direction histogram in16 bins (17 features, including one bin for non-edge pixels). These 50 dimensional low levelfeatures can effectively represent the color and texture patterns in these regions. For face re-gion i, its icon and torso features are denoted by Ii and Ti respectively. In this approach, thedistance between any two face, icon, or torso pairs is simply the Euclidean distance FD(i, j),ID(i, j) or TD(i, j). However, to emphasize the uniqueness of each person’s face region, facefeatures are first mapped into a basis derived from an Eigen analysis of faces inspired by theEigenface concept [14]. An average face and M Eigenfaces are denoted by Ψ and un, wheren = 1, ...M respectively. For face Fi, its Eigenface components ωi = (ωi

n, n = 1, ...M) arecomputed by ωi

n = uTn (Fi−Ψ). All faces from one video are analyzed as a set to compute Ψ

and un (where M is preset to 16) after resizing face regions to 50x50. Finally, the Euclideandistance for a face is computed using a vector of the ω components that correspond to thatface. Although not analyzed here, for more generic (but possibly less discriminable features),Eigenface models can also be learned from a larger corpus.

Duplicate Detection

Duplicate detection is the process of detecting images or segments from specific contentthat may be included in other content. Duplicate image and video detection is essentialfor copyright monitoring, business intelligence, and advertisement monitoring. However, inaddition to these applications, duplicate detection can serve as a powerful mechanism thatenables the discovery of content of interest from large content repositories bases on a shortsample. At a high level, there are two distinct approaches to duplicate detection. The firstapproach is watermarking, which embeds some kind of watermark in the original image. Forthis method to be effective, the embedded watermark needs to be robust to a number oftransformations that are applied during processing to survive such transformations. Exam-ples of such transformations include re-encoding, scaling, and changes in intensity or color.The watermark also needs to be imperceptive such that it does not hurt the visual qualityof the original image or video. The second approach involves detecting the duplicated imageor video purely based on the content itself. This approach is applicable in a wider rangeof applications since it does not require the initial step of embedding a watermark in thecontent. Due to this distinct advantage, this chapter focuses on the latter approach.

Many methods have been proposed for near duplicate detection, and they can be generallygrouped into two categories: global visual feature based and local visual feature based. Thefirst category relies on visual features that are extracted from the entire image, includingcolor, texture, edge, etc [28]. These features usually reflect the global characteristics of theimage, and the feature dimension is low. The advantage of this category is the high efficiency,yet the disadvantage is the low robustness. Simple image transformations, for example,Gamma correction and insertion of big patterns, can easily devastate the performance. Thesecond category relies on salient local feature points that can be repeatedly detected aftersevere image transformations, for example, heavy re-encoding and rotation. The commonlyadopted visual features include Scale-invariant feature transform (SIFT), Speeded Up Robust


Features (SURF), and Maximally Stable Extremal Regions (MSER) [22]. Usually thousandsof such features are detected in an image, and the overall feature dimension is high. Methodsin this category usually deliver high detection performance, and they can also find sub-imageduplicates. The obvious disadvantage is that they are hard to scale. One remedy for thescalability issue is to adopt the bag of visual words approach [34], which converts the highdimension visual features into quantized labels, and treats them as words in a document.Traditional information retrieval techniques can be straightforwardly applied. Such methodsusually provide satisfactory performance with much less computational complexity. Whilethe near duplicate detection method presented here is in the first category, it has beenaugmented by a grid mechanism that improves the reliability of the general global visualfeature based approaches.

Figure 1.9: Near duplicate image detection algorithm.

Figure 1.9 is a high-level representation of a one possible duplicate image detection algo-rithm implementation. Each image is partitioned into 4x4 pixel grid, and within each grida set of global features is computed [16], including the color moments in Luv color space[10], the Gabor texture features (mean and standard deviation) in a combination of threescales and four directions, and the edge direction histogram in 16 bins. For efficient imagecomparison in the entire video collection, the Locality-Sensitive Hashing (LSH) approachwas employed [7]. For each image, 64 hashing values are computed based on global imagefeatures. The LSH values of all reference images are saved in a database. The LSH valuesof each test image are used to query the database. Figure 1.10 presents a prototype systemfor finding near duplicate images. All keyframes extracted from the “NBC Nightly News”and the “ABC World News Tonight” programs in 2007 have been indexed. In total, thereare about 400,000 keyframes. The panel on the left hand side shows the user interface forbrowsing all keyframes in a certain TV program. The user can select any of them and searchnear duplicates in the database. In this example, the third keyframe in the first row is chosenas the query keyframe, and the duplicate detection results are listed in the panel on the righthand side. The query time is about 2 seconds, which is indicative this approach’s efficiency.

1.4. CLIENTS & APPLICATIONS 17

Figure 1.10: Near duplicate keyframe detection system.

1.4 Clients & Applications

Previous sections in this chapter focused on the discussion of core content analysis methodsand system architectures to support (i.e., store and index) the information from those meth-ods. With the discussed tools and system architecture as a foundation, content in variousformats can be analyzed in either an asynchronous or live fashion and delivered to an appli-cation over a standards-compliant web services interface. This section provides a brief lookat a few example applications and their implementations, which use different quantities ofthe produced metadata for a range of tasks from generic mash-up applications to user-centricmobile retrieval.

1.4.1 Retrieval Services

Following the design shown in Figure 1.2 as a template, web services interfaces to the query,analysis, and delivery systems were created using the Python language and the CherryPysoftware library1. These services permit the viewing of content captured from broadcast tele-vision and analyzed by a set of distributed nodes in a real-time fashion. Services executinganalysis routines for shot segmentation, speech recognition, semantic concept classification,duplicate image detection, and face recognition are performed in parallel by a set of nodesthat can independently be enabled or disabled to provide a richer user experience that can bereplayed in mobile, desktop, and set-top environments. This client personalization is possi-ble because each of the underlying analysis routines are exposed with a generic web services

1CherryPy - http://www.cherrypy.org/


interface that is client agnostic. This system was also deployed to simultaneously record,analyze, and index metadata (using the SOLR Lucene environment2) for four broadcasttelevision channels and generic user-generated content (i.e. personal images, home videos,generic Internet content) on one high-performance computer respectively. Verifying the de-sign guidelines above, multiple content analysis tasks are consolidate into a single physicalresource because the web service is capable of distributing each analysis task of any form toa number of ad-hoc processing nodes. Finally, subsequent web service requests for semanticclassification were made asynchronously to process a decade of content that was capturedbefore this analysis technique was even available. This capability is possible because eachanalysis request is atomic, mimicking web-based REST requests that allow stateless opera-tions to occur non-deterministically.

1.4.2 Content Analytics

The media archive and associated extracted metadata can serve a wide range of applicationsincluding entertainment, broadcast monitoring, and complex analytics. Two analytical in-terfaces are discussed below demonstrating a comprehensive playback view and a programaggregated view for spotting content anomalies.

Figure 1.11 depicts a user interface for presenting and interacting with the metadata thathas been extracted from a TV program. A search component facilitates content selectionand once a particular asset is selected, the metadata that is extracted is displayed on aninteractive timeline. Other visualizations such as the results of face detection and clusteringare made available by utilizing a tabbed structure on the web page. An “analytics” tabshows summary statistics derived from the media processing so that the selected asset can becompared with other assets along a number of dimensions (e.g., shots per minute, percentageof shots with people, words spoken per minute, etc.). The video window allows the replay ofH.264 encoded video at HD and SD resolutions that are streamed using RTMP and can berendered using Adobe Flash. The same streams may also be rendered in HTML 5 browsersthat support this video format. Closed captions are encoded in the W3C Timed Text MarkupLanguage (TTML) [15]. The extracted metadata is available for export to various standardformats including MediaRSS and MPEG-7.

In a second illustration in Figure 1.12, the automatically detected commercial segmentsfrom two popular evening talk shows are visualized. The top and bottom bar charts depictthe location and duration of commercial segments in the program “Late Night with JimmyFallon” and the program “The Tonight Show With Jay Leno”, respectively. This simplemetadata visualization demonstrates that there is high amount of variance in the start andduration of commercial segments during the “Jimmy Fallon” show compared to the relativelystable commercial segment position and duration during “Jay Leno”. Although some plotlocations may be attributed to detector errors, the likely explanation for these differencesis the requirement for the content provider to provide relatively specific time slots for bothlocal and nationwide advertising campaigns. While this example focuses on commercial data,almost any other metadata field can be quickly visualized to facilitate the easy discovery ofcontent anomalies and patterns over varying timescales.

2SOLR - http://lucene.apache.org/solr/


Keyframe and segment summary

Intui(ve textual query interface

Random-‐access HD video playback

Interac(ve metadata (meline for...

closed cap(ons

shot segments

commercial breaks

...

Figure 1.11: Comprehensive analytics view demonstrating full search, playback and inspec-tion capabilities.

1.4.3 Mobile and Multiscreen Video Retrieval

In the previous section, interfaces intended for professional analysis for actionable informa-tion from either a single asset or a collection of assets was describe. Now a system for TVcontent analysis can be used to create compelling user experiences for content discovery andconsumption for novice users on mobile devices. In a mobile scenario, the primary challengesarise from the limited user interface capabilities of the mobile device. Similarly, in a TVviewing usage context for entertainment purposes, a keyboard and mouse-based interfaceis not appropriate. Here, the assumption is made that the systems engineering challengesof securely delivering appropriately transcoded media streams to a range of devices havebeen addressed. For the applications that are envisioned, video delivery subsystem supportrandom access and stream handoff are required, but these capabilities are emerging in themarketplace via IPTV services or for on-demand content at least, with best effort IP systems(over the top). With these capabilities, mobile users can preview or watch video on a mobiledevice (mobile phone or tablet) and then send the video to the big screen in the living roomfor shared viewing. To address the above mentioned challenges, the use of spoken naturallanguage queries as part of a multimodal interface to allow users to search for previouslyrecorded TV content on mobile devices is proposed. The architecture detected in Figures1.13 and 1.14 is utilized and extended with additional capabilities to handle spoken querieswith interactive response requirements. Turning to research in multiscreen video retrievaland consumption, a prototype system (called iMIRACLE) was designed and implementedthat allows the users to search for previously recorded video. Once the desired video has


0 5 10 15 20 25 30 35 40 45 50 55 60

0 5 10 15 20 25 30 35 40 45 50 55 60

The Tonight Show With Jay Leno (airing at 11:35pm)

Automa'cally Detected Commercial Segments WNBC (Channel 4 -‐ New York City), July 2010 -‐ December 2010

Late Night With Jimmy Fallon (airing at 12:37am)

running 'me (in minutes)

running 'me (in minutes)

air da

teair da

te

Jul 1

Dec 1

Jul 1

Dec 1

Figure 1.12: Visualization of detected commercial segments aligned to program runningtime, from air dates spanning six months in 2010, and used to detect content anomalies.

been retrieved, it can be played on the mobile device, or sent to another display screen suchas a set-top-box (STB) connected to a TV monitor.

The iMIRACLE architecture is shown in Figure 1.13. Broadcast feeds (ATSC/MPEG-2)are ingested and processed by the content/media analysis components described in section1.3. During processing, closed caption extraction and alignment, metadata (title, station,genre, airdate, etc.) extraction, indexing, scene change detection, and content-based sam-pling techniques are executed. The extracted metadata and closed caption data are used tobuild new speech models daily for cloud-based automatic speech recognition engine (ASR)(here, AT&T WATSON [17]) and natural language understanding (NLU) [21]. The speechmodels are hierarchical language models (HLM) that parse components of a multi-constraintspeech query. A top level speech grammar is used in conjunction with five sub-modelswhere each sub-model handles a different constraint (title, station, genre, time, and con-tent). Metadata is used to create sub-models and the closed captioning is used to createcontent sub-models.

A prototype system was constructed and evaluated on an Apple iPad as the mobiledevice. Using a mobile device, the mobile user might speak “home and garden shows onABC mentioning Italian villa”. First, the audio is streamed to the cloud and the WATSONengine performs the ASR and NLU functions. The speech recognition output is “home andgarden shows on a b c mentioning Italian villa” and the natural language understandingteases apart the different constraints of the query. In this case, the genre is “home andgarden”, the station is “ABC”, and the content search term is “Italian villa”. The systemthen uses this information to form a search query (one parameter for each constraint). The


• iMIRACLE UI/client• speech applica5on• media player

5

set-‐top boxes

mobile device

media archive

query processing content

analysisdelivery

adapta5on

speechmodels

WATSONspeech

recogni5on

inges5on

broadcastfeed

audio/video content

metadata

Figure 1.13: iMIRACLE Architecture supports mobile and IPTV clients.

mobile device will then receive a list of TV programs from the query processing componentthat match the search criteria, as shown in Figure 1.14(a). This example demonstrates thebenefits of using speech to create search queries with multiple constraints. Empirically, usershave noted that speaking queries is faster than typing them and the use of natural languageduring query formulation facilitates the combination of multiple constraints with little effort.Figure 1.14(a) depicts the results of this query, consisting of four programs that meet theuserss search criteria, where the user can select a program to view more details about theselected program. At this point, the user can browse the content thumbnails and associatedtext and initiate the replay of the video with a single touch on one of the thumbnails asshown in Figure 1.14(b). The delivery adaptation module in Figure 1.13 streams the H.264encoded video (HTTP live streaming) to the mobile device. Alternatively, the user canchoose to view the content on the TV, where the delivery adaptation module will streama higher quality H.264 encoded version of the video (using RTSP streaming) to the STB.Thus, the mobile user can preview the video on the mobile device or can throw the video tothe TV STB.

As a final example of how TV content analysis can create personalized interfaces thatfacilitate content consumption, an application intended for tablet devices is demonstrated.Todays mobile devices support touch screen interfaces and include highly capable graphicsthat when combined with high speed data networking, can render compelling rich mediainterfaces for content exploration and selection. Figure 1.15 shows a screen capture from amobile device rendering results in response to a user specified topic of interest (in this case,the topic was Yankees). Rather than developing a customized user interface, a standardscompliant image browser3 is used to provide rapid perspective transformations with simu-lated momentum to render an intuitive user interface. The application seamlessly submitsqueries to the media retrieval services which deliver personalized thumbnail sets and videometadata represented in MediaRSS format. The video content can be selected and previewedand consumed on the mobile device or displayed at high resolution on a connected TV. Thisexample demonstrates the value of processing TV content to create enhanced video service

3Cooliris - http://www.cooliris.com


(a) (b)

Figure 1.14: Interfaces demonstrating programs matching a speech query (a) and the abilityto randomly access automatically detected topic segments of any program (b).

capabilities that allow viewers to quickly navigate to content of interest.

1.5 Conclusions

In this chapter, some of the state of the art methods for TV content analysis were presented.Starting from a holistic analysis of incoming program data, this chapter iterates over thekey components of a content analysis engine: automatic shot segmentation, audio analysisfor speech, alignment of caption data, semantic classification of scenes, and near-duplicatedetection in a large-scale database. Although the purpose of each discussed method maybe different, the underlying goal of producing useful, distinctive metadata for describingcontent is common throughout all components. Additionally, with fully implemented anddeployed algorithms in continuously running systems, one can assert that these methods havebeen optimized for scalability and easy integration in existing IPTV services. The systemarchitecture proposed in this chapter not only embodies these existing services, but hasbeen created such that it easily accommodates unforeseen future approaches in an intuitivemodular fashion. Wherever possible, all methods have also been organized and orchestratedin a scalable and efficient Service oriented Architecture (SoA) that is media-aware. Thisimportant distinction has allowed the proposed architectures to both preserve and leverageexisting metadata capabilities (i.e., closed caption extraction from analog or digital signals)while simultaneously adapting ingestion routines to accommodate new codecs and mediaformats. Finally, as a real-world evaluation of the architecture and underlying metadata,several service concepts as well as heterogeneous clients have been described supportingmultiscreen content consumption scenarios and services.

1.5. CONCLUSIONS 23

Figure 1.15: Graphical, highly interactive browsing of TV content on a mobile device


References

[1] Iso/iec jtc/sc/wg iso/iec 15938-9:2005/pdam 1 information technology multimedia con-tent description interface part 9: Profiles and levels, amendment 1: Extensions toprofiles and levels. 2005.

[2] ETSI TS 102 822-3-1 V1.4.1 (2007-11), Broadcast and On-line Services: Search, se-lect and rightful use of content on personal storage systems (TV-Anytime); Part 3:Metadata; Sub-part 1: Phase 1 Metadata schemas. 2007.

[3] Ebu study of content analysis-based automatic information extraction in production,first call for technologies. June 2009.

[4] Yahoo! mediarss specification 1.5.0, September 2009.[5] Cablelabs content 3.0 specification version 1.1, md-sp-contentv3.0-i01-100812. August

2010.[6] O.M. Alliance. Service guide for mobile broadcast services. Candidate Version, 1, 2008.[7] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest

neighbor in high dimensions. 2006.[8] P. Antonopoulos, N. Nikolaidis, and I. Pitas. Hierarchical Face Clustering using SIFT

Image Features. In Computational Intelligence in Image and Signal Processing, 2007.CIISP 2007. IEEE Symposium on, pages 325–329. IEEE, 2007.

[9] L. Begeja and Z. Liu. Searching and Browsing Video in Face Space. In Multimedia,2009. ISM’09. 11th IEEE International Symposium on, pages 336–341. IEEE, 2009.

[10] C. Chelba and A. Acero. Adaptation of maximum entropy capitalizer: Little data canhelp a lot. Computer Speech & Language, 20(4):382–399, 2006.

[11] S.S. Chen and P.S. Gopalakrishnan. Speaker, environment and channel change detectionand clustering via the bayesian information criterion. 1998.

[12] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scalehierarchical image database. 2009.

[13] M. Eyer. PSIP: Program & System Information Protocol. 2002.[14] A.W. Fitzgibbon and A. Zisserman. Joint manifold distance: a new approach to ap-

pearance based clustering. 2003.[15] ed. G. Adams. Timed text markup language (ttml) 1.0. November 2010.[16] D. Gibbon and Z. Liu. Large scale content analysis engine. In Proceedings of the First

ACM workshop on Large-scale multimedia retrieval and mining, pages 97–104. ACM,2009.

[17] V. Goffin, C. Allauzen, E. Bocchieri, D. Hakkani-Tur, A. Ljolje, S. Parthasarathy,M. Rahim, G. Riccardi, and M. Saraclar. The AT&T Watson speech recognizer. InProceedings of ICASSP, pages 1033–1036. Citeseer, 2005.

[18] P. Huang, Y. Wang, and M. Shao. A new method for multi-view face clustering invideo sequence. In Data Mining Workshops, 2008. ICDMW’08. IEEE InternationalConference on, pages 869–873. IEEE, 2008.

[19] Qian Huang, Zhu Liu, Aaron Rosenberg, David Gibbon, and Behzad Shahraray. Au-tomated generation of news content hierarchy by integrating audio, video, and textinformation. In Proc. of Acoustics, Speech and Signal Processing, volume 6, pages 3025– 3028, 1999.

[20] International Standards Organization (ISO). Information technology – multimedia con-tent description interface – part 5: Multimedia description schemes, iso/iec 15938-5:2003(mpeg-7). 2003.

[21] M. Johnston and S. Bangalore. Finite-state multimodal integration and understanding.Natural Language Engineering, 11(02):159–187, 2005.

[22] Y. Ke, R. Sukthankar, and L. Huston. Efficient Near-duplicate Detection and sub-imageRetrieval. In ACM Multimedia. ACM, October 2004.

[23] L. Kennedy, A. Hauptmann, M. Naphade, A.H.J.R. Smith, and SF Chang. Lscom

1.5. CONCLUSIONS 25

lexicon definitions and annotations version 1.0. In DTO Challenge Workshop on LargeScale Concept Ontology for Multimedia, pages 217–2006, 2006.

[24] Z. Li and X. Tang. Bayesian face recognition using support vector machine and faceclustering. 2004.

[25] Z. Liu, D. Gibbon, E. Zavesky, B. Shahraray, and P. Haffner. AT&T research at trecvid2006. In Notebook Paper. NIST TRECVID Workshop, Gaithersburg, MD, 2006.

[26] Z. Liu and Y. Wang. Face detection and tracking in video using dynamic programming.In Image Processing, 2000. Proceedings. 2000 International Conference on, volume 1,pages 53–56. IEEE, 2000.

[27] A. Loui, J. Luo, S.F. Chang, D. Ellis, W. Jiang, L. Kennedy, K. Lee, and A. Yana-gawa. Kodak’s consumer video benchmark data set: concept definition and annotation.In Proceedings of the international workshop on Workshop on multimedia informationretrieval, pages 245–254. ACM, 2007.

[28] Y. Maret, F. Dufaux, and T. Ebrahimi. Image replica detection based on support vectorclassifier. In Optical Information System III, SPIE, volume 5909, pages 173–181. SPIE,2005.

[29] Alberto Messina, Robbie De Sutter, Werner Bailer, Masanori Sano, Jean-Pierre Evain,Patrick Ndjiki-Nya, Antje Linnemann, Birgit Schrter, and Andrea Basso. Some changesto wd on mpeg-7 audiovisual description profile (avdp). ISO/IEC JTC1/SC29/WG11MPEG2010/M18150, October 2010.

[30] E. Mikoczy, D. Sivchenko, B. Xu, and J.I. Moreno. IPTV Systems, Standards and Ar-chitectures: Part II-IPTV Services over IMS: Architecture and Standardization. Com-munications Magazine, IEEE, 46(5):128–135, 2008.

[31] D. Roth, M. Yang, and N. Ahuja. A SNoW-based face detector. 2000.[32] H.A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. Pattern

Analysis and Machine Intelligence, IEEE Transactions on, 20(1):23–38, 1998.[33] J. Sivic, M. Everingham, and A. Zisserman. Who are you?-Learning person specific

classifiers from video. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on, pages 1145–1152. IEEE, 2009.

[34] J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Match-ing in Videos. In Intl. Conf. on Computer Vision, pages 1470–1477, January 2003.

[35] A.F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and TRECVid. InProceedings of the 8th ACM international workshop on Multimedia information retrieval,pages 321–330. ACM, 2006.

[36] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-basedimage retrieval at the end of the early years. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 22(12):1349–1380, 2000.

[37] C. Snoek, M. Worring, D. Koelma, and A. Smeulders. Learned lexicon-driven interactivevideo retrieval. Image and Video Retrieval, pages 11–20, 2006.

[38] J. Tao and Y.P. Tan. Face clustering in videos using constraint propagation. In Circuitsand Systems, 2008. ISCAS 2008. IEEE International Symposium on, pages 3246–3249.IEEE.

[39] P. Viola and M. Jones. Robust real-time object detection. International Journal ofComputer Vision, 57(2):137–154, 2002.

[40] N. Vretos, V. Solachidis, and I. Pitas. A mutual information based face clusteringalgorithm for movies. In 2006 IEEE International Conference on Multimedia and Expo,pages 1013–1016. IEEE, 2006.

[41] E. Zavesky, Z. Liu, D. Gibbon, and B. Shahraray. Searching Videos in Visual SemanticSpaces.

large-scale analysis for interactive media consumption

Documents