audio-visual content indexing, filtering, and adaptationsfchang/papers/talk-vlbv01... ·...

10/2001 S.-F. Chang Columbia U. 1

Audio-Visual Content Indexing, Filtering, and Adaptation

Shih-Fu Chang

Digital Video and Multimedia GroupADVENT University-Industry Consortium

Columbia University10/12/2001

http://www.ee.columbia.edu/dvmm


ContentArchives

WWW

documentimage/video

search organizationauthor presentation

Filtering & adaptation(preference)

Challenge: tools/systems for searching and filtering

push pull

— Information Overload, Limited User Attention, Time-SensitiveImportant Issues:

— Heterogeneous PlatformsExample Products:

AltaVista, Google, DoCoMo/IBM, NewsTake, TiVo/Philips, PC DVR


MPEG-7 and issues

Feature Extraction

MPEG-7 Description

Search/Filtering Application

ISO/IEC 15938 “Multimedia Content Description Interface”

Issues:Active research in analysis and indexingSystem issues: delivery, compressionWhat applications? …


Re-examine Content Production/Usage Chain

textualsearch

feature extraction similarity

matching

Production & post-production A-V Data Viewer

indexing

patternrecognition

text

speech/textrecognition

structurebrowsing

event/object search

summary

metadata

Analysis Interaction


Types of Content Indexing

Reverse engineeringProduction structure parsing (shot, camera)

Metadata preservationFormat, Production, Credits, etc.

Feature measurementaudio-visual, spatio-temporal objects and features

Content recognition and organizationSemantic meaning descriptionScene grouping and skimming


When do we need automated tools

Conditions for high impactMetadata that’s not available from productionWork that humans are not good at (e.g., low-level computation)Large volume, low individual valueTime sensitiveAcceptable performance

Less Promising AreasContent from digital production tools with metadataPrime content that can afford manual solutions

Linear/Passive Video Interactive Video• Highlights• Pitches• Runs• By Player• By Time• Set your Own• Save• Delete• Edit

Case 1: Live Sports Video Filtering and Navigatio

(Images excluded)(images excluded)

n

Time sensitive interestMassive production and audienceTime compressibilityTemporal structure and production rules

Beyond Ringer and Clip Download

Time-sensitive short video messages suiting personal needs

Services:

•Messaging•Localized information•Media/game download•Multi-person games•TV phone•On-site purchase

Image showing sports video on mobile handset

excluded


Regular Structure and Views in Sports Video

Entire Tennis Video

Set

Game

……

……

……

……

Serve

Elementary Shots

CommercialsCloseup, Crowd ……


Regular set of views CatcherPitching Close-up pitcher

(Images excluded) (Images excluded)(Images excluded)

Base hit First base Full field

(Images excluded) (Images excluded)(Images excluded)

Important issues: canonical view beginning of recurrent semantic unitview transition pattern types of events


Real-Time Sports Video Parsing

Detect and classify recurrent semantic unitsReal-time processing

simple features compressed-domain processing

Combine global-level and object-level classification

(Chang, Zhong, Kumar ’01)


Detecting recurrent semantic units: Serve

Every I/P frames Key frames/shot Down-sampled I/P frames

Color, lines, player

Images showing different views of tennis serve excluded

Canonical View Detection

Adaptive Color Filter

Object level verification

Compressed video

Shot Detection

(Chang, Zhong, Kumar ’00)


Learning spatio-temporal rules

Batter

Pitching

Ground Pitcher

Grass Sand

Level 1: Object

Level 2: Object-part

Level 3: Perceptual Area

Level 4: Region RegionsRegions Regions Regions

Image showingbaseball pitch excluded


Systematic learning of object rules

trainingimages

scenemodel Visual

Apprentice(with

A. Jaimes)

• classifiers• features• rules

Output:Classifiers and rules

Initial input

Automated segmentation

tagging

(Jaimes and Chang ‘99)


Application: Time-Sensitive Mobile Video

Content Adaptive Streaming

Streaming

Live Video

Event Detection Content

AdaptationEncode/

Mux Buffer

BufferDemux/Decoder

Structure Analysis

User

Server or Gateway

Client


Content Adaptive Streaming

time

important segments: video mode

non-important segments: still frame + audio + captions

channel rateadaptive

video rate

timet1t0

channelinputoutput

accumulatedrate


Recurrent semantic units in soccer video?

play 0.0(sec.)

break 38.3 49.3

82.8 95.7

133.6 163.2 167.6

Game a sequence of play and break segmentsPlay/break No canonical views/eventsSporadic events (start and end)Shot boundary ≠ play boundary

Heuristic Model:Features Views Structure

(Image Excluded)

global zoom-in close-up

View Classification Play-Break Temporal Segmentation

Threshold Adaptation

Grass Color Classifier

(unsupervised)

View classification (unsupervised)

Segmentation Rules

Decision Phase

Learning Phase

Structure info

Sampled frames


Modeling Transitions with HMM

Break Play

SegmentFeatures

(color, motion, objects)

TrainHMM families

Play-Break TransitionModel

(Dynamic Programming)ML Detection +

Linear RegressionTest Data

Play-Break Classification Rate:over different games87%, 86%, 85%, 73%


Case 2:Long Programs with Loose Structures

transient 4:04 5:222:20

Images of shots excluded

Goal:Scene/Topic Segmentation, Browsing, Preview

Challenge:Diverse production styles and views


Domain Consideration

Film: Does not satisfy impact conditionsNot much value except

archived collection in librariesless popular selection in PVR/STB

Challenging, rich styles and mediaFun


Problems and Approaches

Explore production models Explore multimedia integrationIncorporate structural and special cuesExplore viewer’s perceptual models

Goal:Computable features/theories + models

Scene structures and summaries


Production Constraints Video Scenes

Camera placement [180o rule]

overlapping background causes chromatic consistency

Lighting/chromaticity continuity


Audition Psychology Audio Scenes

Unrelated sounds seldom begin/end at the same time.Sounds from the same source change properties smoothly and gradually over time.Changes in acoustic states affect all components of the sound (e.g., walking away from a ringing bell)Human perception uses long term grouping (e.g., a series of footsteps).

A. Bregman ‘90: Auditory Scene Analysis

An audio scene change is said to occur when majority of the sound sources change.


Explore Audio-visual association

Complementary audio-visual structuresfirst hour of Blade Runner

video: 28 scenesaudio only: 33 scenes

audio/video agreement: 24video change only: silent or transient scenesaudio change only: mood/state change

singleton scene boundaries

video scene boundaryaudio scene boundary

Sundaram & Chang ICASSP ‘00


Mixing audio-visual scenes

Video scene boundary

Audio scene boundary

65%

21%

10%

4%sense &

sensibility


Film: Common Scene Structures

4 common scene types in filmProgressive or Coherent:same location, similar camera takes, consistent audio


time

(Demo)


Common Scene Types (2)

Transient : different locations, dissimilar camera takes, evolving audio


MTV-type: fast, dissimilar shots, consistent audio

(Demo)


Common Scene Types (3)

Dialog: repeating structures, consistent audio



Computable Scene Detection Architecture

data

shot detectionaudio features

silence

structure

video-scenesAudio-scenes

fusion

computable scenes


The Viewer/Listener Model

memory to to: decision instant

time

attention span

Memory (M): Net amount of data remaining in viwer’smemory, e.g., 32 sec.Attention span (AS): The most recent data occupying user’s attention, e.g., 16 sec.Measure the group, long-term coherence between M and AS

(Kender and Yeo 98, Sundaram and Chang 00)

Video Scene SegmentationTm

Tasshot key-frames

fa fb∆ t

Group coherence based on viewer memory model1 1= − • • • − ∆( , ) ( ( , )) ( )a b mR a b d a b f f t T

∈ ∈

= ∑ ∑ max

{ \ }( ) ( , ) ( )

as m as

o oa T b T T

C t R a b C t

Audio Scene Segmentationcompute multi-scalefeatures

Detect modelcorrelationdecay slope

An audio sceneis a collection of

Componentdecomposition

sources

(11 speech/music features)

majorityvote

change?

trend

sinusoidal

noise

(sundaram and chang ’00)


Detecting Dialog Structure1

m o d ( , )0

1( ) ( , )N

i i n Ni

n d o oN

−

+=

∆ ≡ ∑cyclic autocorrelation

…

frequency = 2

0 1 2 3 4 5 6 7 8 9 10 11

(images excluded)

Accuracy: 80%/94% - 91%/100% precision recall

Aligning Scene Boundaries

ambiguity window

Audio: sourcecorrelation

Video: groupcoherence

aligned

: audio boundary: video boundary

: windows


Fusion Rules

Synchronized a-scene and v-scene scene

Exceptions:Ignore scenes within structuresIgnore (weak v-scene, weak a-scene)Add (strong v-scene, silence)Require tighter synchronization if (weak v, normal a)

1

2

3

5

silence

structure

A-scene boundary

V-scene boundary

Weak V-scene

boundary


Open Issues

VideoUse higher-level cognitive models

Currently focus on group coherence between groupsSyntax structure within scenes not consideredPotential use of information theory and temporal modeling

AudioSource separation (music, speech, noise, etc)

Scene-level organizationVisualization, summarization, matching


Scene Skimming

Find the skim with the highest utility, satisfying the

production syntax constraintssyntax

preserved

Reduce to a short-time preview


Utility of a shot

Explore Perceptual Model and Scene Syntax

Image excluded

Image excluded

Is comprehension time related to the visual spatio-temporal complexity of the shot ?The presence of detail robs a shot of its screen time.

(a) (b)

Shot complexity and comprehension time

Represent shot by its key-frame

A shot is selected at randomThe subject was asked to correctly answer four questions in minimum time:

Who WhatWhenWhere

“Why” was not asked


Time-complexity relationship

Plot of average time vs. complexity shows two bounds Ub

Lb

complexity →

time →

02.

54.

5

( ) 2.40 1.11( ) 0.61 0.68

b

b

U c cL c c

= += +

0 0.5 1

comprehension time increases with complexity


Shot condensation→

02.

5 reduce to upper bound

original shotUb

Lb

4.5

time

complexity →0 0.5 1


Film syntaxThe specific arrangement of shots so as to bring out their mutual relationship. [ sharff 82 ].

Minimum number of shots in a sceneThe particular ordering of the shotsThe specific duration of the shots, to direct viewer attentionChanging the scale of the shots

Film makers think in terms of phrases of shots and not individual shots.


The progressive phrase

“Two well chosen shots will create expectations of the development of narrative; the third well-chosen shot will resolve those expectations.”[ sharff 82 ].

Hence, a phrase (a group of shots) must at least have three shots.

Maximal compression:eliminate all the dark shots.


The dialog

“Depicting a conversation between m people requires 3m shots.” [ sharff 82 ].

Hence, a dialog must at least have six shots

Maximal compression:eliminate all the dark shots.


Utility Optimization Framework

Objective function

for each skim Utility of

SkimRhythm Penalty

Droppedshot penalty

Find the skim with the minimal cost, satisfying the syntax constraints

syntax preserved


Scene Skimming

Original-1 Skim-110:3 time reduction

Original-2 Skim-25:1 time reduction

Echo Video Digital Library & Remote Medicine

Echo Video

View Segmentation KeyFrame, Heart

Beat summary

Domain Knowledge

View Recognizer

Object Annotation

Diagnosis Reports

TransitionModel

Adaptive network delivery

Content search(Ebadollahi and Chang ‘00)


Conclusions

Production model Media Integration Viewer model

Analysis and interaction

SkimmingNavigation

Time-sensitive filteringAdaptive delivery

Applications


Acknowledgements

Live Sports Filtering: Di Zhong and Raj KumarSoccer Video Parsing:Lexing Xie, Peng Xu, Ajay Divakaran, Anthony Vetro, Huifang SunScene Segmentation and Skimming:Hari SundaramMedical Video:Shahram Ebadollahi


More Information

Columbia Digital Video/Multimedia Group


ADVENT Industry/University Consortiumhttp://www.ee.columbia.edu/advent

http://www.ee.columbia.edu/~sfchang


http://www.ee.columbia.edu/advent

http://www.ee.columbia.edu/advent

http://www.ee.columbia.edu/~sfchang

audio-visual content indexing, filtering, and adaptationsfchang/papers/talk-vlbv01... ·...

Documents