generating automatic video previews - win.tue.nl fileextracting the essence of video generating...
TRANSCRIPT
Extracting the essence of videoGenerating automatic video previews
Mauro BarbieriPhilips Research
November 27th 2007
2
Sturgeon’s law
“Ninety percent of everything is crud”T. Sturgeon
“Ninety percent of everything is crud”T. Sturgeon
“Life is short”
“Life is too short for crud”
[J. Foote, “Kooks, Obsessives, Sturgeon’s Law, and the Real Meaning of Search”, IEEE MM 2005]
4
Problem statement
• Content offer explodes
• Free time to consume content does not increase
• We need means that help making choices:– get insights into the video content– as simply as possible– while being entertained
5
Video preview
• Short video sequence composed of automatically-selected portions of the original video
• Gives concisely a reliable impression of a video (mood, feel, genre)
• Helps deciding what to watch• Not a teaser: represents the true content
7
Requirements elicitation
• Related literature on video summarization• Film production literature• Guided interviews• ~30 requirements:
– Duration – Exclusion – Continuity – Structural– Priority – Temporal order– Uniqueness
8
Approach – constrained optimization
• Preview = subset of original video
• Requirements are translated into:– Constraints that subset must satisfy– Score functions
• Preview generation optimization problem– Find the subset that satisfies all constraints and
maximizes the objective function
10
Duration requirements
• Duration of the preview:
• Duration of the segments:
∑∈
≤≤Pp
DpdD maxmin )(
)(min pddPp ≤∈∀
11
Continuity requirements
• Visual continuity– No abrupt interruptions of action
• Speech continuity– Include only complete sentences
• Subtitles continuity– Display subtitles for a sufficient amount of time
12
Temporal segmentation
• Visual continuity: shot cut detection
0
0.1
0.2
0.3
0.4
0.5
1 2 3 4
0.000.020.04
0.060.080.10
0 66 132 1970.000.020.04
0.060.080.10
0 66 132 1970.000.020.04
0.060.080.10
0 66 132 1970.000.020.04
0.060.080.10
0 66 132 1970.000.020.04
0.060.080.10
0 66 132 197
Framei–2 Framei–1 Framei Framei+1 Framei+2
Dist(Framei–2, Framei–1) Dist(Framei–1, Framei) Dist(Framei, Framei+1) Dist(Framei+1, Framei+2)
13
Segment compensation
• Speech classifier • Overlaid text detection
v1 v2 video segments
a overlapping audio segment
v1’ video segment after compensation
a overlapping audio segment
15
Priority – fast understanding
• Objective function directly proportional to
– Sharpness
– Brightness
17
Priority – people
• Objective function directly proportional to face size and position
• Viola-Jones face detector
19
Priority – action
• Objective function directly proportional to– Motion activity: standard deviation motion
vectors– Cut density: inverse shot duration– Loudness: average audio energy
20
Priority requirements
• Fast understanding
• People
• Action
• Dialogues and speech
• Emotional moments
21
Priority – emotional moments
• Objective function directly proportional to– Face size– Start of music– Presence of subsequent advertisement
22
Priority requirements
• Fast understanding
• People
• Action
• Dialogues and speech
• Emotional moments
• Story clues
23
Priority – story clues
• Keyword extraction from textual subtitles
• Objective function directly proportional to number of keywords
• Penalty function for repeating keywords
25
Objective function
eval(P) = e1π(P) – e2 ρ(P) + e3η(P) + e4ω(P) – e5ε(P)
• π(P): priority score• ρ(P): redundancy score• η(P): structure score• ω(P): temporal order score• ε(P): penalty term
26
Solution approach – optimization
• Local search: simulated annealing
• Start with a random initial solution and iterate to improve the solution
• At each iteration:– a better solution is always accepted– a worse solution is accepted with a certain probability
based on how worse the solution is and on the current temperature (it gets pickier as it progresses)
27
Solution approach – optimization
• On dataset of 30 video items:– Random: eval(P) = 0.03– Subsample: eval(P) = 0.04– Local search: eval(P) = 0.45
• According to our model local search performs “well”
• Are requirements sufficiently fulfilled?• Are user satisfied with previews’ quality?
28
User study – hypothesis
• Is the optimization-based approach providing a better overview of a video than subsampling?
• How good are automatic previews w. r. t. to manually-made ones?
Hypothesis:
optimization-based manual
high quality
subsample
29
Hypothesis
• Better overview with respect to:
A. Understandability of the individual segmentsB. Transitions between segmentsC. Amount of useful informationD. Correct representation of the atmosphereE. Usefulness for choosing
30
Experiment design
• Direct rating• Within subject
– Each participant evaluates 3 preview versions (subsample, optimized, and manual)
• Calibration with good and bad previews• Special design to reduce order effects
(E. Stinstra, CMQ)
31
Test material
Title Genre Duration
007 The World is not Enough action, adventure, thriller 128 minutes
Friends, Seas. 5, Ep. 17, “The One with Rachel’s Inadvertent Kiss” comedy, romance 20 minutes
Master and Commander action, adventure, drama, war 138 minutes
The Nanny, Seas. 1, Ep. 0 comedy 24 minutes
Harry Potter and the Chambers of Secrets
adventure, family, fantasy, mystery 161 minutes
Forrest Gump comedy, drama, romance 142 minutes
33
Participants
• Volunteers from High Tech Campus• Gender: 20 females, 20 males• Age: mean 28 (min 22, max 42)• Language: 24 Dutch, 16 English• All subjects interested in movies:
– 23 watch more than 1 film per week– 15 watch 1-4 films per month– 2 watch less than 1 film per month
34
0123456789
10
manualoptimizedsubsample
Audio transitions
Visual transitions
Informativeness Atmosphere Overall Usefulness
Results – mean scores
• Analysis of variance for each question– Main factors: algorithm and content– Dependent variable: score– Random factor: subject– Post-hoc Tukey test on algorithm
35
Other effects
• For each question another ANOVA– Main factors: algorithm and content– Dependent variable: score– Covariates:
• Language (English, Dutch)• Gender (male, female)• Age (22-25, 26-30, 31-35, 36-65)• Film-fan (>1 film/week, 1-4 films/month, <1 film/month)• Liking the type of content (1-4 scale)• Knowing the content (no, yes or partially)
• No significant main effects for language, gender, age, likingor knowing the content
• One significant main effect for film-fan (F = 5.986, p = 0.015)
36
Most frequent comments
Manual Optimized SubsampleNot enough information on the story 0 4 9Good information on the story 7 4 1Gives away too much information 3 2 1Good impression of the atmosphere 5 6 1Bad impression of the atmosphere 0 0 6Presence of uninformative scenes 0 0 7Missing link between segments 0 0 5Too short segments 6 0 3
37
Conclusions
• Method to automatically generate previews for browsing video archives
• Suitable to be implemented in consumer storage devices: e.g. DVR set-top boxes
• User study: quality of previews better than sub-sampling, but not as good as human-made
optimization-based manualsubsample
high quality
39
Content analysis and understanding
• Decouple visual and audio segmentation
• Exploit better textual information (e.g. [Tsoneva et al. 2007])
40
[Tsoneva et al. 2007]
Keyword/Character Rank
Monica, ChandlerRoss, Joey, Ross
Ross,Chandler
Ross, Joey,Joey
2
11
2
1
1Monica,
Rachel, Joey
2
43
Rachel
5
Phoebe,Emily
6
1.659
0.714 1.032 0.15
0.468
1.122
⎟⎟⎠
⎞⎜⎜⎝
⎛+++−=
)4()4(
)3()3(
)2()2(*1)1(
ScLScKR
ScLScKR
ScLScKRddScKR
41
Examples of film grammar
• Camera angle, e.g. “Dutch” angle used in emotionally charged scenes
• Color palette influence on mood: warm/cold hues, saturation, brightness and color energy
• Focus/defocus to attract viewers’ attention to a part of a frame. E.g. low-depth of focus, rack focus
Long shot Medium shot Close-up shot
• Field of view: establish scene, show action, highlight emotional response of characters, etc.
42
Augmentation
Example• Mix synthetic voice over
Text pre-processing
Speechsynthesis
Textual information
Video previewgenerationAV content
Mixer preview
44
AcknowledgmentsAd DenissenAlbertine VisserDzevdet BurazerovicEmile AartsEnno EhlersErik NiessenErwin StinstraFabian ErnstFrank CrienenFreddy SnijderGerhard LangelaarGerhard MekenkampHans WedaIgor PaulussenJan Engel
Jan KorstJeroen BreebaartJettie HoonhoutLalitha AgnihotriMaria Zapata FerrerMartin McKinneyMauro BarbieriNevenka DimitrovaOlaf SeupelPeter JakobsPeter SelsRob van den BoomenRobert van UdenRuud WijnandsWim VerhaeghAll participants to the user studies