audio-visual content indexing, filtering, and adaptationsfchang/papers/talk-vlbv01... ·...
TRANSCRIPT
10/2001 S.-F. Chang Columbia U. 1
Audio-Visual Content Indexing, Filtering, and Adaptation
Shih-Fu Chang
Digital Video and Multimedia GroupADVENT University-Industry Consortium
Columbia University10/12/2001
http://www.ee.columbia.edu/dvmm
10/2001 S.-F. Chang Columbia U. 2
ContentArchives
WWW
documentimage/video
search organizationauthor presentation
Filtering & adaptation(preference)
Challenge: tools/systems for searching and filtering
push pull
— Information Overload, Limited User Attention, Time-SensitiveImportant Issues:
— Heterogeneous PlatformsExample Products:
AltaVista, Google, DoCoMo/IBM, NewsTake, TiVo/Philips, PC DVR
10/2001 S.-F. Chang Columbia U. 3
MPEG-7 and issues
Feature Extraction
MPEG-7 Description
Search/Filtering Application
ISO/IEC 15938 “Multimedia Content Description Interface”
Issues:Active research in analysis and indexingSystem issues: delivery, compressionWhat applications? …
10/2001 S.-F. Chang Columbia U. 4
Re-examine Content Production/Usage Chain
textualsearch
feature extraction similarity
matching
Production & post-production A-V Data Viewer
indexing
patternrecognition
text
speech/textrecognition
structurebrowsing
event/object search
summary
metadata
Analysis Interaction
10/2001 S.-F. Chang Columbia U. 5
Types of Content Indexing
Reverse engineeringProduction structure parsing (shot, camera)
Metadata preservationFormat, Production, Credits, etc.
Feature measurementaudio-visual, spatio-temporal objects and features
Content recognition and organizationSemantic meaning descriptionScene grouping and skimming
10/2001 S.-F. Chang Columbia U. 6
When do we need automated tools
Conditions for high impactMetadata that’s not available from productionWork that humans are not good at (e.g., low-level computation)Large volume, low individual valueTime sensitiveAcceptable performance
Less Promising AreasContent from digital production tools with metadataPrime content that can afford manual solutions
Linear/Passive Video Interactive Video• Highlights• Pitches• Runs• By Player• By Time• Set your Own• Save• Delete• Edit
Case 1: Live Sports Video Filtering and Navigatio
(Images excluded)(images excluded)
n
Time sensitive interestMassive production and audienceTime compressibilityTemporal structure and production rules
Beyond Ringer and Clip Download
Time-sensitive short video messages suiting personal needs
Services:
•Messaging•Localized information•Media/game download•Multi-person games•TV phone•On-site purchase
Image showing sports video on mobile handset
excluded
10/2001 S.-F. Chang Columbia U. 9
Regular Structure and Views in Sports Video
Entire Tennis Video
Set
Game
……
……
……
……
Serve
Elementary Shots
CommercialsCloseup, Crowd ……
10/2001 S.-F. Chang Columbia U. 10
Regular set of views CatcherPitching Close-up pitcher
(Images excluded) (Images excluded)(Images excluded)
Base hit First base Full field
(Images excluded) (Images excluded)(Images excluded)
Important issues: canonical view beginning of recurrent semantic unitview transition pattern types of events
10/2001 S.-F. Chang Columbia U. 11
Real-Time Sports Video Parsing
Detect and classify recurrent semantic unitsReal-time processing
simple features compressed-domain processing
Combine global-level and object-level classification
(Chang, Zhong, Kumar ’01)
10/2001 S.-F. Chang Columbia U. 12
Detecting recurrent semantic units: Serve
Every I/P frames Key frames/shot Down-sampled I/P frames
Color, lines, player
Images showing different views of tennis serve excluded
Canonical View Detection
Adaptive Color Filter
Object level verification
Compressed video
Shot Detection
(Chang, Zhong, Kumar ’00)
10/2001 S.-F. Chang Columbia U. 13
Learning spatio-temporal rules
Batter
Pitching
Ground Pitcher
Grass Sand
Level 1: Object
Level 2: Object-part
Level 3: Perceptual Area
Level 4: Region RegionsRegions Regions Regions
Image showingbaseball pitch excluded
10/2001 S.-F. Chang Columbia U. 14
Systematic learning of object rules
trainingimages
scenemodel Visual
Apprentice(with
A. Jaimes)
• classifiers• features• rules
Output:Classifiers and rules
Initial input
Automated segmentation
tagging
(Jaimes and Chang ‘99)
10/2001 S.-F. Chang Columbia U. 15
Application: Time-Sensitive Mobile Video
Content Adaptive Streaming
Streaming
Live Video
Event Detection Content
AdaptationEncode/
Mux Buffer
BufferDemux/Decoder
Structure Analysis
User
Server or Gateway
Client
10/2001 S.-F. Chang Columbia U. 16
Content Adaptive Streaming
time
important segments: video mode
non-important segments: still frame + audio + captions
channel rateadaptive
video rate
timet1t0
channelinputoutput
accumulatedrate
10/2001 S.-F. Chang Columbia U. 17
Recurrent semantic units in soccer video?
play 0.0(sec.)
break 38.3 49.3
82.8 95.7
133.6 163.2 167.6
Game a sequence of play and break segmentsPlay/break No canonical views/eventsSporadic events (start and end)Shot boundary ≠ play boundary
Heuristic Model:Features Views Structure
(Image Excluded)
global zoom-in close-up
View Classification Play-Break Temporal Segmentation
Threshold Adaptation
Grass Color Classifier
(unsupervised)
View classification (unsupervised)
Segmentation Rules
Decision Phase
Learning Phase
Structure info
Sampled frames
10/2001 S.-F. Chang Columbia U. 19
Modeling Transitions with HMM
Break Play
SegmentFeatures
(color, motion, objects)
TrainHMM families
Play-Break TransitionModel
(Dynamic Programming)ML Detection +
Linear RegressionTest Data
Play-Break Classification Rate:over different games87%, 86%, 85%, 73%
10/2001 S.-F. Chang Columbia U. 20
Case 2:Long Programs with Loose Structures
transient 4:04 5:222:20
Images of shots excluded
Goal:Scene/Topic Segmentation, Browsing, Preview
Challenge:Diverse production styles and views
10/2001 S.-F. Chang Columbia U. 21
Domain Consideration
Film: Does not satisfy impact conditionsNot much value except
archived collection in librariesless popular selection in PVR/STB
Challenging, rich styles and mediaFun
10/2001 S.-F. Chang Columbia U. 22
Problems and Approaches
Explore production models Explore multimedia integrationIncorporate structural and special cuesExplore viewer’s perceptual models
Goal:Computable features/theories + models
Scene structures and summaries
10/2001 S.-F. Chang Columbia U. 23
Production Constraints Video Scenes
Camera placement [180o rule]
overlapping background causes chromatic consistency
Lighting/chromaticity continuity
10/2001 S.-F. Chang Columbia U. 24
Audition Psychology Audio Scenes
Unrelated sounds seldom begin/end at the same time.Sounds from the same source change properties smoothly and gradually over time.Changes in acoustic states affect all components of the sound (e.g., walking away from a ringing bell)Human perception uses long term grouping (e.g., a series of footsteps).
A. Bregman ‘90: Auditory Scene Analysis
An audio scene change is said to occur when majority of the sound sources change.
10/2001 S.-F. Chang Columbia U. 25
Explore Audio-visual association
Complementary audio-visual structuresfirst hour of Blade Runner
video: 28 scenesaudio only: 33 scenes
audio/video agreement: 24video change only: silent or transient scenesaudio change only: mood/state change
singleton scene boundaries
video scene boundaryaudio scene boundary
Sundaram & Chang ICASSP ‘00
10/2001 S.-F. Chang Columbia U. 26
Mixing audio-visual scenes
Video scene boundary
Audio scene boundary
65%
21%
10%
4%sense &
sensibility
10/2001 S.-F. Chang Columbia U. 27
Film: Common Scene Structures
4 common scene types in filmProgressive or Coherent:same location, similar camera takes, consistent audio
Images of shots excluded
time
(Demo)
10/2001 S.-F. Chang Columbia U. 28
Common Scene Types (2)
Transient : different locations, dissimilar camera takes, evolving audio
Images of shots excluded
MTV-type: fast, dissimilar shots, consistent audio
(Demo)
10/2001 S.-F. Chang Columbia U. 29
Common Scene Types (3)
Dialog: repeating structures, consistent audio
Images of shots excluded
10/2001 S.-F. Chang Columbia U. 30
Computable Scene Detection Architecture
data
shot detectionaudio features
silence
structure
video-scenesAudio-scenes
fusion
computable scenes
10/2001 S.-F. Chang Columbia U. 31
The Viewer/Listener Model
memory to to: decision instant
time
attention span
Memory (M): Net amount of data remaining in viwer’smemory, e.g., 32 sec.Attention span (AS): The most recent data occupying user’s attention, e.g., 16 sec.Measure the group, long-term coherence between M and AS
(Kender and Yeo 98, Sundaram and Chang 00)
Video Scene SegmentationTm
Tasshot key-frames
fa fb∆ t
Group coherence based on viewer memory model1 1= − • • • − ∆( , ) ( ( , )) ( )a b mR a b d a b f f t T
∈ ∈
= ∑ ∑ max
{ \ }( ) ( , ) ( )
as m as
o oa T b T T
C t R a b C t
Audio Scene Segmentationcompute multi-scalefeatures
Detect modelcorrelationdecay slope
An audio sceneis a collection of
Componentdecomposition
sources
(11 speech/music features)
majorityvote
change?
trend
sinusoidal
noise
(sundaram and chang ’00)
10/2001 S.-F. Chang Columbia U. 34
Detecting Dialog Structure1
m o d ( , )0
1( ) ( , )N
i i n Ni
n d o oN
−
+=
∆ ≡ ∑cyclic autocorrelation
…
frequency = 2
0 1 2 3 4 5 6 7 8 9 10 11
(images excluded)
Accuracy: 80%/94% - 91%/100% precision recall
Aligning Scene Boundaries
ambiguity window
Audio: sourcecorrelation
Video: groupcoherence
aligned
: audio boundary: video boundary
: windows
10/2001 S.-F. Chang Columbia U. 36
Fusion Rules
Synchronized a-scene and v-scene scene
Exceptions:Ignore scenes within structuresIgnore (weak v-scene, weak a-scene)Add (strong v-scene, silence)Require tighter synchronization if (weak v, normal a)
1
2
3
5
silence
structure
A-scene boundary
V-scene boundary
Weak V-scene
boundary
10/2001 S.-F. Chang Columbia U. 37
Open Issues
VideoUse higher-level cognitive models
Currently focus on group coherence between groupsSyntax structure within scenes not consideredPotential use of information theory and temporal modeling
AudioSource separation (music, speech, noise, etc)
Scene-level organizationVisualization, summarization, matching
10/2001 S.-F. Chang Columbia U. 38
Scene Skimming
Find the skim with the highest utility, satisfying the
production syntax constraintssyntax
preserved
Reduce to a short-time preview
10/2001 S.-F. Chang Columbia U. 39
Utility of a shot
Explore Perceptual Model and Scene Syntax
Image excluded
Image excluded
Is comprehension time related to the visual spatio-temporal complexity of the shot ?The presence of detail robs a shot of its screen time.
(a) (b)
Shot complexity and comprehension time
Represent shot by its key-frame
A shot is selected at randomThe subject was asked to correctly answer four questions in minimum time:
Who WhatWhenWhere
“Why” was not asked
10/2001 S.-F. Chang Columbia U. 41
Time-complexity relationship
Plot of average time vs. complexity shows two bounds Ub
Lb
complexity →
time →
02.
54.
5
( ) 2.40 1.11( ) 0.61 0.68
b
b
U c cL c c
= += +
0 0.5 1
comprehension time increases with complexity
10/2001 S.-F. Chang Columbia U. 42
Shot condensation→
02.
5 reduce to upper bound
original shotUb
Lb
4.5
time
complexity →0 0.5 1
10/2001 S.-F. Chang Columbia U. 43
Film syntaxThe specific arrangement of shots so as to bring out their mutual relationship. [ sharff 82 ].
Minimum number of shots in a sceneThe particular ordering of the shotsThe specific duration of the shots, to direct viewer attentionChanging the scale of the shots
Film makers think in terms of phrases of shots and not individual shots.
10/2001 S.-F. Chang Columbia U. 44
The progressive phrase
“Two well chosen shots will create expectations of the development of narrative; the third well-chosen shot will resolve those expectations.”[ sharff 82 ].
Hence, a phrase (a group of shots) must at least have three shots.
Maximal compression:eliminate all the dark shots.
10/2001 S.-F. Chang Columbia U. 45
The dialog
“Depicting a conversation between m people requires 3m shots.” [ sharff 82 ].
Hence, a dialog must at least have six shots
Maximal compression:eliminate all the dark shots.
10/2001 S.-F. Chang Columbia U. 46
Utility Optimization Framework
Objective function
for each skim Utility of
SkimRhythm Penalty
Droppedshot penalty
Find the skim with the minimal cost, satisfying the syntax constraints
syntax preserved
10/2001 S.-F. Chang Columbia U. 47
Scene Skimming
Original-1 Skim-110:3 time reduction
Original-2 Skim-25:1 time reduction
Echo Video Digital Library & Remote Medicine
Echo Video
View Segmentation KeyFrame, Heart
Beat summary
Domain Knowledge
View Recognizer
Object Annotation
Diagnosis Reports
TransitionModel
Adaptive network delivery
Content search(Ebadollahi and Chang ‘00)
10/2001 S.-F. Chang Columbia U. 49
Conclusions
Production model Media Integration Viewer model
Analysis and interaction
SkimmingNavigation
Time-sensitive filteringAdaptive delivery
Applications
10/2001 S.-F. Chang Columbia U. 50
Acknowledgements
Live Sports Filtering: Di Zhong and Raj KumarSoccer Video Parsing:Lexing Xie, Peng Xu, Ajay Divakaran, Anthony Vetro, Huifang SunScene Segmentation and Skimming:Hari SundaramMedical Video:Shahram Ebadollahi
10/2001 S.-F. Chang Columbia U. 51
More Information
Columbia Digital Video/Multimedia Group
http://www.ee.columbia.edu/dvmm
ADVENT Industry/University Consortiumhttp://www.ee.columbia.edu/advent
http://www.ee.columbia.edu/~sfchang