generating automatic video previews - win.tue.nl fileextracting the essence of video generating...

45
Extracting the essence of video Generating automatic video previews Mauro Barbieri Philips Research November 27 th 2007

Upload: trinhnga

Post on 11-Feb-2019

218 views

Category:

Documents


0 download

TRANSCRIPT

Extracting the essence of videoGenerating automatic video previews

Mauro BarbieriPhilips Research

November 27th 2007

2

Sturgeon’s law

“Ninety percent of everything is crud”T. Sturgeon

“Ninety percent of everything is crud”T. Sturgeon

“Life is short”

“Life is too short for crud”

[J. Foote, “Kooks, Obsessives, Sturgeon’s Law, and the Real Meaning of Search”, IEEE MM 2005]

3

Outline

• Problem statement• Requirements• Solution approach• Validation• Conclusions

4

Problem statement

• Content offer explodes

• Free time to consume content does not increase

• We need means that help making choices:– get insights into the video content– as simply as possible– while being entertained

5

Video preview

• Short video sequence composed of automatically-selected portions of the original video

• Gives concisely a reliable impression of a video (mood, feel, genre)

• Helps deciding what to watch• Not a teaser: represents the true content

6

Preview example

7

Requirements elicitation

• Related literature on video summarization• Film production literature• Guided interviews• ~30 requirements:

– Duration – Exclusion – Continuity – Structural– Priority – Temporal order– Uniqueness

8

Approach – constrained optimization

• Preview = subset of original video

• Requirements are translated into:– Constraints that subset must satisfy– Score functions

• Preview generation optimization problem– Find the subset that satisfies all constraints and

maximizes the objective function

9

Solution approach – overview

Candidate AV

segments

Raw AV content Preparation Selection Preview

10

Duration requirements

• Duration of the preview:

• Duration of the segments:

∑∈

≤≤Pp

DpdD maxmin )(

)(min pddPp ≤∈∀

11

Continuity requirements

• Visual continuity– No abrupt interruptions of action

• Speech continuity– Include only complete sentences

• Subtitles continuity– Display subtitles for a sufficient amount of time

12

Temporal segmentation

• Visual continuity: shot cut detection

0

0.1

0.2

0.3

0.4

0.5

1 2 3 4

0.000.020.04

0.060.080.10

0 66 132 1970.000.020.04

0.060.080.10

0 66 132 1970.000.020.04

0.060.080.10

0 66 132 1970.000.020.04

0.060.080.10

0 66 132 1970.000.020.04

0.060.080.10

0 66 132 197

Framei–2 Framei–1 Framei Framei+1 Framei+2

Dist(Framei–2, Framei–1) Dist(Framei–1, Framei) Dist(Framei, Framei+1) Dist(Framei+1, Framei+2)

13

Segment compensation

• Speech classifier • Overlaid text detection

v1 v2 video segments

a overlapping audio segment

v1’ video segment after compensation

a overlapping audio segment

14

Priority requirements

• Fast understanding

15

Priority – fast understanding

• Objective function directly proportional to

– Sharpness

– Brightness

16

Priority requirements

• Fast understanding

• People

17

Priority – people

• Objective function directly proportional to face size and position

• Viola-Jones face detector

18

Priority requirements

• Fast understanding

• People

• Action

19

Priority – action

• Objective function directly proportional to– Motion activity: standard deviation motion

vectors– Cut density: inverse shot duration– Loudness: average audio energy

20

Priority requirements

• Fast understanding

• People

• Action

• Dialogues and speech

• Emotional moments

21

Priority – emotional moments

• Objective function directly proportional to– Face size– Start of music– Presence of subsequent advertisement

22

Priority requirements

• Fast understanding

• People

• Action

• Dialogues and speech

• Emotional moments

• Story clues

23

Priority – story clues

• Keyword extraction from textual subtitles

• Objective function directly proportional to number of keywords

• Penalty function for repeating keywords

24

Solution approach – overview

Candidate AV

segments

Raw AV content Preparation Selection Preview

25

Objective function

eval(P) = e1π(P) – e2 ρ(P) + e3η(P) + e4ω(P) – e5ε(P)

• π(P): priority score• ρ(P): redundancy score• η(P): structure score• ω(P): temporal order score• ε(P): penalty term

26

Solution approach – optimization

• Local search: simulated annealing

• Start with a random initial solution and iterate to improve the solution

• At each iteration:– a better solution is always accepted– a worse solution is accepted with a certain probability

based on how worse the solution is and on the current temperature (it gets pickier as it progresses)

27

Solution approach – optimization

• On dataset of 30 video items:– Random: eval(P) = 0.03– Subsample: eval(P) = 0.04– Local search: eval(P) = 0.45

• According to our model local search performs “well”

• Are requirements sufficiently fulfilled?• Are user satisfied with previews’ quality?

28

User study – hypothesis

• Is the optimization-based approach providing a better overview of a video than subsampling?

• How good are automatic previews w. r. t. to manually-made ones?

Hypothesis:

optimization-based manual

high quality

subsample

29

Hypothesis

• Better overview with respect to:

A. Understandability of the individual segmentsB. Transitions between segmentsC. Amount of useful informationD. Correct representation of the atmosphereE. Usefulness for choosing

30

Experiment design

• Direct rating• Within subject

– Each participant evaluates 3 preview versions (subsample, optimized, and manual)

• Calibration with good and bad previews• Special design to reduce order effects

(E. Stinstra, CMQ)

31

Test material

Title Genre Duration

007 The World is not Enough action, adventure, thriller 128 minutes

Friends, Seas. 5, Ep. 17, “The One with Rachel’s Inadvertent Kiss” comedy, romance 20 minutes

Master and Commander action, adventure, drama, war 138 minutes

The Nanny, Seas. 1, Ep. 0 comedy 24 minutes

Harry Potter and the Chambers of Secrets

adventure, family, fantasy, mystery 161 minutes

Forrest Gump comedy, drama, romance 142 minutes

32

Preview examples

• Example A• Example B• Example C

ManualSubsampleOptimized

33

Participants

• Volunteers from High Tech Campus• Gender: 20 females, 20 males• Age: mean 28 (min 22, max 42)• Language: 24 Dutch, 16 English• All subjects interested in movies:

– 23 watch more than 1 film per week– 15 watch 1-4 films per month– 2 watch less than 1 film per month

34

0123456789

10

manualoptimizedsubsample

Audio transitions

Visual transitions

Informativeness Atmosphere Overall Usefulness

Results – mean scores

• Analysis of variance for each question– Main factors: algorithm and content– Dependent variable: score– Random factor: subject– Post-hoc Tukey test on algorithm

35

Other effects

• For each question another ANOVA– Main factors: algorithm and content– Dependent variable: score– Covariates:

• Language (English, Dutch)• Gender (male, female)• Age (22-25, 26-30, 31-35, 36-65)• Film-fan (>1 film/week, 1-4 films/month, <1 film/month)• Liking the type of content (1-4 scale)• Knowing the content (no, yes or partially)

• No significant main effects for language, gender, age, likingor knowing the content

• One significant main effect for film-fan (F = 5.986, p = 0.015)

36

Most frequent comments

Manual Optimized SubsampleNot enough information on the story 0 4 9Good information on the story 7 4 1Gives away too much information 3 2 1Good impression of the atmosphere 5 6 1Bad impression of the atmosphere 0 0 6Presence of uninformative scenes 0 0 7Missing link between segments 0 0 5Too short segments 6 0 3

37

Conclusions

• Method to automatically generate previews for browsing video archives

• Suitable to be implemented in consumer storage devices: e.g. DVR set-top boxes

• User study: quality of previews better than sub-sampling, but not as good as human-made

optimization-based manualsubsample

high quality

38

Future work

• Content analysis and understanding

• Augmentation

• User-created content

39

Content analysis and understanding

• Decouple visual and audio segmentation

• Exploit better textual information (e.g. [Tsoneva et al. 2007])

40

[Tsoneva et al. 2007]

Keyword/Character Rank

Monica, ChandlerRoss, Joey, Ross

Ross,Chandler

Ross, Joey,Joey

2

11

2

1

1Monica,

Rachel, Joey

2

43

Rachel

5

Phoebe,Emily

6

1.659

0.714 1.032 0.15

0.468

1.122

⎟⎟⎠

⎞⎜⎜⎝

⎛+++−=

)4()4(

)3()3(

)2()2(*1)1(

ScLScKR

ScLScKR

ScLScKRddScKR

41

Examples of film grammar

• Camera angle, e.g. “Dutch” angle used in emotionally charged scenes

• Color palette influence on mood: warm/cold hues, saturation, brightness and color energy

• Focus/defocus to attract viewers’ attention to a part of a frame. E.g. low-depth of focus, rack focus

Long shot Medium shot Close-up shot

• Field of view: establish scene, show action, highlight emotional response of characters, etc.

42

Augmentation

Example• Mix synthetic voice over

Text pre-processing

Speechsynthesis

Textual information

Video previewgenerationAV content

Mixer preview

43

44

AcknowledgmentsAd DenissenAlbertine VisserDzevdet BurazerovicEmile AartsEnno EhlersErik NiessenErwin StinstraFabian ErnstFrank CrienenFreddy SnijderGerhard LangelaarGerhard MekenkampHans WedaIgor PaulussenJan Engel

Jan KorstJeroen BreebaartJettie HoonhoutLalitha AgnihotriMaria Zapata FerrerMartin McKinneyMauro BarbieriNevenka DimitrovaOlaf SeupelPeter JakobsPeter SelsRob van den BoomenRobert van UdenRuud WijnandsWim VerhaeghAll participants to the user studies