book2movie: aligning video scenes with book …...book2movie: aligning video scenes with book...

Book2Movie: Aligning Video scenes with Book chapters

Makarand Tapaswi Martin Bauml Rainer StiefelhagenKarlsruhe Institute of Technology, 76131 Karlsruhe, Germany{makarand.tapaswi, baeuml, rainer.stiefelhagen}@kit.edu

Abstract

Film adaptations of novels often visually display in a fewshots what is described in many pages of the source novel.In this paper we present a new problem: to align book chap-ters with video scenes. Such an alignment facilitates findingdifferences between the adaptation and the original source,and also acts as a basis for deriving rich descriptions fromthe novel for the video clips.

We propose an efficient method to compute an align-ment between book chapters and video scenes using match-ing dialogs and character identities as cues. A major con-sideration is to allow the alignment to be non-sequential.Our suggested shortest path based approach deals with thenon-sequential alignments and can be used to determinewhether a video scene was part of the original book. Wecreate a new data set involving two popular novel-to-filmadaptations with widely varying properties and compareour method against other text-to-video alignment baselines.Using the alignment, we present a qualitative analysis ofdescribing the video through rich narratives obtained fromthe novel.

1. IntroductionTV series and films are often adapted from novels [4]. In

the fantasy genre, popular examples include the Lord of theRings trilogy, the TV series Game of Thrones based on thenovel series A Song of Ice and Fire, the Chronicles of Nar-nia and Harry Potter series. There are many other exam-ples from various genres: The Bourne series (action), TheHunger Games (adventure), the House of Cards TV series(drama), and a number of super hero movies (e.g. Batman,Spiderman) many of which are based on comic books.

We believe that such adaptations are a large untappedresource that can be used to simultaneously improve andunderstand the story semantics for both video and naturallanguage. In recent years there is a rising interesting to au-tomatically generate meaningful captions for images [10,11, 16, 30] and even user-generated videos [7, 21, 29]. Forreaders to visualize the story universe, novels often pro-

vide rich textual descriptions in the form of attributes oractions of the content that are depicted visually in the video(e.g., specially for characters and their surroundings). Dueto the large number of such adaptations of descriptive textinto video, such data can prove to be an excellent source forlearning joint textual and visual models.

Determining how a specific part of the novel is adaptedto the video domain is a very interesting problem in it-self [9, 17, 31]. A classical example of this is to pin-pointthe differences between books and their movie adaptations1.Differences are at various levels of detail and can rangefrom appearance, presence or absence of characters, all theway up to modification of the sub-stories. As a start, we ad-dress the problem of finding differences at the scene level,more specifically on how chapters and scenes are ordered,and which scenes are not backed by a chapter from the book.

A more fine-grained linkage between novels and filmscan have direct commercial applications. Linking the twoencourages growth in the consumption of both literary andaudio-visual material [3].

In this work, we target the problem of aligning videoscenes to specific chapters of the novel (Sec. 4). Fig. 1presents an example of the ground truth alignment betweenbook chapters and parts of the video along with a discussionof some of the challenges in the alignment problem. We em-phasize on finding scenes that are not part of the novel andperform our evaluation with this goal in mind (Sec. 5).

There are many applications which emerge from thealignment. Sec. 6 discusses how vivid descriptions in abook can be used as a potential source of weak labels toimprove video understanding. Similar to [25], an alignmentbetween video and text enables text-based video retrievalsince searching for a part of the story in the video can betranslated to searching in the text (novel). In contrast to[25], which used short plot synopses of a few sentences pervideo, novels provide a much larger pool of descriptionswhich can potentially match a text-based search. Novels of-ten contain more information than what appears in a shortvideo adaptation. For example, minor characters that appear

1There are websites which are dedicated to locating such differences,e.g. http://thatwasnotinthebook.com/

1

12

34

56

78

91

011

12

13

14

15

16

171

81

92

02

12

22

32

42

52

62

72

82

93

03

13

23

33

43

53

63

73

83

94

04

14

24

34

44

54

64

74

84

95

05

15

25

35

45

55

65

75

85

96

06

16

26

36

46

56

66

76

86

97

07

17

27

3

Chapte

rs

E0

1E

02

E0

3E

04

E0

5E

06

E0

7E

08

E0

9E

10

Epis

odes

Gam

e o

f T

hro

nes (

Seaso

n 1

)

A S

on

g o

f Ic

e a

nd

Fir

e (

Bo

ok 1

)

Figu

re1.

Thi

sfig

ure

isbe

stvi

ewed

inco

lor.

We

pres

entt

hegr

ound

trut

hal

ignm

ento

fch

apte

rsfr

omth

efir

stno

velo

fA

Song

ofIc

ean

dFi

reto

the

first

seas

onof

the

TV

seri

esG

ame

ofTh

rone

s.B

ook

chap

ters

are

depi

cted

onth

elo

wer

axis

with

tick

spac

ing

corr

espo

ndin

gto

the

num

ber

ofw

ords

inth

ech

apte

r.A

sth

eno

velf

ollo

ws

apo

int-

of-v

iew

narr

ativ

e,ea

chch

apte

ris

colo

rco

ded

base

don

the

char

acte

r.C

hapt

ers

with

are

dba

ckgr

ound

(see

8,18

,42)

are

notr

epre

sent

edin

the

vide

oad

apta

tion.

The

ten

epis

odes

(E01

–E

10)

ofth

ese

ason

1co

rres

pond

ing

toth

efir

stno

vela

repl

otte

don

the

top

axis

.E

ach

bar

ofco

lor

repr

esen

tsth

elo

catio

nan

ddu

ratio

nof

time

inth

evi

deo

whi

chco

rres

pond

sto

the

spec

ific

chap

terf

rom

the

nove

l.A

line

join

ing

the

cent

erof

the

chap

tert

oth

eba

rind

icat

esth

eal

ignm

ent.

Whi

tesp

aces

betw

een

the

colo

red

bars

indi

cate

that

thos

esc

enes

are

nota

ligne

dto

any

part

ofth

eno

vel:

alm

ost3

0%of

alls

hots

dono

tbel

ong

toan

ych

apte

r.A

noth

erch

alle

nge

isth

atch

apte

rsca

nbe

split

and

pres

ente

din

mul

tiple

scen

es(e

.g.,

chap

ter3

7).N

ote

the

com

plex

ityof

the

adap

tatio

nan

dho

wth

evi

deo

does

notn

eces

sari

lyfo

llow

the

book

chap

ters

sequ

entia

lly.W

ew

illm

odel

thes

ein

tric

acie

san

dus

ean

effic

ient

appr

oach

tofin

da

good

alig

nmen

t.

in the video are often un-named, however, can be associatedwith a name by leveraging the novels. Finally, the alignmentwhen performed on many films and novels together can beused to study the task of adaptation itself and, for example,characterize different screenwriters and directors.

2. Related workIn recent years, videos and accompanying text are an in-

creasingly popular subject of research. We briefly reviewand discuss related work in the direction of video analysisand understanding using various sources of text.

The automatic analysis of TV series and films has greatlybenefited from external sources of textual information. Fantranscripts with subtitles are emerging as the de facto stan-dard in automatic person identification [6, 8, 19, 24] and ac-tion recognition [5, 12]. They also see application in videosummarization [28] and are used to understand the scenestructure of videos [13]. There is also work on aligningvideos to transcripts in the absence of subtitles [23].

The last year has seen a rise in joint analysis of text andvideo data. [22] presents a framework to make TV seriesdata more accessible by providing audio-visual meta dataalong with three types of textual information – transcripts,episode outline and video summaries. [25] introduces andinvestigates the use of plot synopses from Wikipedia to per-form video retrieval. The alignment between sentences ofthe plot synopsis and shots of the video is leveraged tosearch for story events in TV series. A new source of text inthe form of descriptive video service used to help visuallyimpaired people watch TV or films is introduced by [20].In the domain of autonomous driving videos, [14] parsestextual queries as a semantic graph and uses bipartite graphmatching to bridge the text-video gap, ultimately perform-ing video retrieval. Very recently, an approach to generatetextual descriptions for short video clips through the use ofConvolutional and Recurrent Neural Networks is presentedin [29]. While the video domains may be different, the im-proved vision and language models have spiked a clear in-terest in co-analyzing text and video data.

Adapting a novel to a film is an interesting problem inthe performing arts literature. [9, 17, 31] present variousguidelines on how a novel can be adapted to the screen ortheater2. Our proposed alignment is a first step towards au-tomating the analysis of such adaptations at a large scale.

Previous works in video analysis almost exclusively dealwith textual data which is derived from the video. This of-ten causes the text to be rather sparse owing to limited hu-man involvement in its production. One key difference inanalyzing novels – as done in this work – is that the videois derived from the text (or, both are derived from the core

2http://www.kvenno.is/englishmovies/adaptation.htm is a nice sum-mary in terms of the differences and freedom that different medium mayuse while essentially telling the same story.

storyline). This usually means that the available textual de-scription is much more complete which has both advantages(more details) and disadvantages (not every part of the textneeds to be present in the video). We think that this ulti-mately provides a powerful source for improving simulta-neous semantic understanding of both text and video.

Closest to our alignment approach are two works whichuse dynamic programming (DP) to perform text-to-videoalignment. Sankar et al. [23] align videos to transcriptswithout subtitles; and Tapaswi et al. [25] align sentencesfrom the plot synopses to shots of the video. However, bothmodels make two strong assumptions: (i) sentences andvideo shots appear sequentially; (ii) every shot is assignedto a sentence. This is often not the case for novel-to-filmadaptations (see Fig. 1). We specifically address the prob-lem of non-sequential alignments in this paper. To the bestof our knowledge, this is also the first work to automaticallyanalyze novels and films, to align book chapters with videoscenes and as a by-product derive rich and unrestricted at-tributes for the visual content.

3. Data set and preprocessingWe first briefly describe our new data set followed by

some necessary pre-processing steps.

3.1. A new Book2Movie data set

As this is the first work in this direction, we create a newdata set, comprising of two novel-to-film/series adaptations:• GOT: Season 1 of the TV series Game of Thrones cor-

responds to the first book of the A Song of Ice and Fireseries, titled A Game of Thrones. Fig. 1 shows our an-notated ground truth alignment for this data.• HP: The first film and book of the Harry Potter series –

Harry Potter and the Sorcerer’s Stone. A figure similarto Fig. 1 is included in the supplementary material.

Our choice of data is motivated by multiple reasons.Firstly, we consider both TV series and film adaptationswhich typically have different filming styles owing to thestructure imposed by episodes. There is a large disparity inthe sizes of the books (GOT source novel is almost 4 timesas big as HP), and this is also reflected in the total runtimeof the respective videos (9h for GOT vs. 2h30m for HP).

Secondly, the stories are targeted towards different audi-ences and thus their language and content differs vastly. Thefirst book of the Harry Potter series caters towards childrenwhile the same cannot be said about A Game of Thrones.

Finally, while both stories are from the fantasy anddrama genre, they have their own unique worlds and differ-ent writing styles. GOT presents multiple sub-stories thattake place in different locations in the world at the sametime. This allows for more freedom in the adaptation of thesub-stories creating a complex alignment pattern. On the

other hand, HP is very character centric with a large chunkof the story revolving around the main character (Harry).The story and the adaptation are thus relatively sequential.

Some statistics of the data set can be found in Sec. 5.

3.2. Scene detection

We consider shots as the basic unit of video processingand they are used to annotate the ground truth novel-to-filmalignment. Due to the large number of shots that films con-tain (typically more than 2000), we further group shots to-gether and use video scenes as basis for the alignment.

We perform scene detection using the dynamic program-ming method proposed in [26]. The method optimizes theplacement of scene boundaries such that shots within ascene are most similar in color and shot threads are not split.

We sacrifice on granularity and induce minor errors ow-ing to wrong scene boundary detection. However, the usageof scenes reduces complexity of the alignment algorithm,and facilitates stronger cues obtained by an average overmultiple shots in the scene.

3.3. Film dialog parsing

We extract character dialogs from the video using sub-titles which are included in the DVD. We convert subti-tles into dialogs based on a simple, yet widely followedset of rules. For example, a two line subtitle whose linesstart with “–” are spoken by different characters. Similarly,we also group subtitles appearing consecutively (no timespacing between the subtitles) until the sentence is com-pleted. These subtitle-dialogs are one video cue to performthe alignment.

3.4. Novel dialog parsing

We follow a hierarchical method for processing the en-tire text of the novel. We first divide the book into chaptersand each chapter into paragraphs. For the alignment we re-strict ourselves at the level of chapters.

Paragraphs are essentially of two types: (i) with dialogor (ii) with only narration.

Dialogs in paragraphs are indicated by the presence ofquotation marks “ and ”. Each sentence in the dialog istreated separately. These text-based dialogs are used as analignment cue along with the video dialogs.

The narrative paragraphs usually set the scene, describethe characters, and back stories. They are our majorsource of attribute labels (see Sec. 6). We process the en-tire book with part-of-speech tagging using the StanfordCoreNLP [2] toolbox. This provides us with the necessaryadjective tags for extracting descriptions.

4. Aligning book chapters to video scenesWe propose a graph-based alignment approach to match

chapters of a book to scenes of a video. Fig. 2 presents an

Figure 2. Alignment cues: We present a small section of the GOT novel, chapter 2 (left) and compare it against the first episode of theadaptation (subtitles shown on the right). Names in the novel are highlighted with colors for which we expect to find corresponding facetracks in the video. We also demonstrate the dialog matching process of finding the longest common subsequence of words. The numberof words and the score (inverted term frequency) are displayed for one such matching dialog. This figure is best viewed in color.

overview of the cues used to perform the alignment.For a specific novel-to-film adaptation, let NS be the

number of scenes andNC the number of chapters. Our goalis to find a chapter c∗s which corresponds to each scene s

c∗s = arg maxc={∅,1...NC}

φ(c,s) , (1)

subject to certain story progression constraints (discussedlater). All scenes which are not part of the original novelare assigned to the ∅ (null) chapter. φ(c,s) captures the sim-ilarity between the chapter c and scene s.

4.1. Story characters

Characters play a very important role in any story. Sim-ilar to the alignment of videos to plot synopses [25], wepropose to use name mentions in the text and face tracks inthe video to find occurrences of characters.

Characters in videos To obtain a list of names whichappear in each scene, we first perform face detection andtracking. Following [27], we align subtitles and transcriptsand assign names to “speaking faces”. We consider allnamed characters as part of our cast list. As feature, weuse the recently proposed VF2 face track descriptor [18] (aFisher vector encoding of dense SIFT features with videopooling). Using the weakly labeled data obtained fromthe transcripts, we train one-vs-all linear Support VectorMachine (SVM) classifiers for each character and performmulti-class classification on all tracks. Tracks which have alow confidence towards all character models (negative SVMscores) are classified as “unknown”.

Sec. 5 presents a brief evaluation of face track recogni-tion performance.

Finding name references in the novel Using the list ofcharacters obtained from the transcripts and along with the

proper noun part-of-speech tags, finding name mentions inthe book is fairly straightforward. However, complex sto-ries – especially with multiple people from the same fam-ily – can contain ambiguities. For example, in Game ofThrones, we see that Eddard Stark is often referred to asNed, or addressed with the title Lord Stark. However, astitles can be passed on, his son Robb is also referred to asLord Stark in the same book. We weight the name men-tions based on their types (ordered from highest to lowest):(i) full name; (ii) only first name; (iii) alias or titles; (iv) andonly last name. The actual search for names is performedby simple text matching of complete words.

Matching For every scene and chapter, we count thenumber of occurrences for each character. We stack themas “histograms” of dimension NP (number of people). Forscenes, we count the number of face tracks QS ∈ RNS×NP ,and for chapters we obtain weighted name mentions QC ∈RNC×NP . We normalize the occurrence matrices such thatthe row-sum equals 1, and then compute Euclidean distancebetween them. Finally, we normalize the distance and ob-tain our identity based similarity score as

φI(c,s) =√2− ‖qS(s)− qC(c)‖2 , (2)

where qS(s) stands for row s of matrix QS . We use thisidentity-based similarity measure φI ∈ RNC×NS betweenevery chapter c and scene s to perform the alignment.

4.2. Dialog matching

Finding an identical dialog in the novel and video adap-tation is a strong hint towards the alignment. While match-ing dialogs between the novel and film sounds easy, oftenthe adaptation changes the presentation style so that veryfew dialogs are reproduced verbatim. For example, in GOT,we have 12,992 dialogs in the novel and 6,842 in the video.

Figure 3. Illustration of the graph used for alignment. The back-ground gradient indicates that we are likely to reach nodes in thelighter areas while nodes in the darker areas are harder to access.

Of these, only 308 pairs are a perfect match with 5 wordsor more, while our following method finds 1,358 pairs withhigh confidence.

We denote the set of dialogs in the novel and video byDN and DV respectively. Let WN and WV be the cor-responding total number of words. We compute the termfrequency [15] for each word w in the video dialogs as

ψV (w) = #w/WV , (3)

where #w is the number of occurrences of the word. Asimilar term frequency ψN (w) is computed for the novel.

To quantify the similarity between a pair of dialogs, n ∈DN from the novel and v ∈ DV from the video, we findthe longest common subsequence between them (see Fig. 2)and extract the list of matching words Sn,v . However, stopwords such as of, the, for can produce false spikes in theraw count |Sn,v|. We counter this by incorporating invertedterm frequencies and compute a similarity score betweeneach dialog pair (n, v) as

φD(n,v) =∑

w∈Sn,v

−(logψN (w) + logψV (w))/2 . (4)

For the alignment, we trace the dialogs to their respectivebook chapters and video scenes, and accumulate them

φD(c,s) =∑n∈cD

∑v∈sD

φD(n,v) , (5)

where cD and sD is the set of dialogs in the chapterand scene respectively. We finally normalize the obtaineddialog-based similarity matrix φD ∈ RNC×NS .

4.3. Alignment as shortest path through a graph

We model the problem of aligning the chapters of a bookto scenes of a video (Eq. 1) as finding the shortest path in asparse directed acyclic graph (DAG). The edge distances of

the graph are devised in a way that the shortest path throughthe graph corresponds to the best matching alignment. Aswe will see, this model allows us to capture all the non-sequential behavior of the alignment while easily incorpo-rating information from the cues and providing an efficientsolution. Fig. 3 illustrates the nodes of the graph along witha glimpse of how the edges are formed.

The main graph (with no frills) consists NC · NS nodesordered as a regular matrix (green nodes in Fig. 3) whererows represent chapters (indexed by c) and columns repre-sent scenes (indexed by s). We create an edge from everynode in column s to all nodes in column s+1 resulting in atotal ofN2

C ·(NS−1) edges. These edge transitions allow toassign each scene to any chapter, and the overall alignmentis simply the list of nodes visited in the shortest path.

Prior Every story has a natural forward progressionwhich an adaptation (usually) follows, i.e. early chapters ofa novel are more likely to be presented earlier in the videowhile chapters towards the end come later. We initialize theedges of our graph with the following distances.

The local distance from a node (c, s) to any node inthe next column (c′, s + 1) is modeled as a quadratic dis-tance and deals with transitions which encourage neighbor-ing scenes to be assigned to the same or nearby chapters.

d(c,s)→(c′,s+1) = α+|c′ − c|2

2 ·NC, (6)

where dn→n′ denotes the edge distance from node n to n′.α serves as a non-zero distance offset from which we willlater subtract the similarity scores.

Another influence of the prior is a global likelihood ofbeing at any node in the graph. To incorporate this, we mul-tiply all incoming edges of node (c, s) by a Gaussian factor

g(c, s) = 2− exp

(− (s− µs)

2

2 ·N2S

). (7)

µs = dc · NS/NCe and the 2 − exp(·) ensures the multi-plying factor g ≥ 1. This results in larger distances to gotowards nodes in the top-right or bottom-left corners.

Overall, the distance to come to any node (c, s) is influ-enced by where the edge originates (dn→(c,s), Eq. 6) andthe location of the node itself (g(c, s), Eq. 7).

Unmapped scenes Adaptations provide creative freedomand thus every scene need not always stem from a specificchapter from the novel. For example, in GOT about 30%of the shots do not have a corresponding part in the novel(c.f . Fig. 1). To tackle this problem, we add an additionallayer of NS nodes to our graph (top layer of blue nodes inFig. 3). Inclusion of these nodes allows us to maintain thecolumn-wise edge structure while also providing the addi-tional feature of unmapped scenes. The distance to arrive atthis node from any other node is initialized to α+µg , whereµg is the average value of the global influence Eq. 7.

External source/sink Initializing the shortest path withinthe main graph would force the first scene to belong to thefirst chapter, and the last scene to the last chapter. How-ever, this need not be true. To prevent this, we include twoadditional source and sink nodes (red nodes in Fig. 3). Wecreate NC additional edges each to transit from the sourceto column s = 1 and from column s = NS to the sink. Thedistances of these edges are based on the global distance.

Using the alignment cues The shortest path for the cur-rent setup of the graph without external influence assignsequal number of scenes to every chapter. Due to its struc-ture, we refer to this as the diagonal prior. We now use theidentity (Eq. 2) and dialog matching (Eq. 4) cues to lowerthe distance of reaching a high scoring node. For example, anode which has multiple matching dialogs and the same setof characters is very likely to depict the same story and bea part of correct scene-to-chapter alignment. Thus reducingthe incoming distance to this node encourages the shortestpath to go through it.

For every node (c, s) with a non-zero similarity score,we subtract the similarity score from all incoming edges,thus encouraging the shortest path to go through this node.

dn′→(c,s) = dn′→(c,s) −∑M

wMφM(c,s) , (8)

where φM(·) is the similarity score of modalityM with weightwM > 0 and n′ = (·, s − 1) is the list of nodes from theprevious scene with an edge to (c, s). The sum of weightsacross all modalities is constrained as

∑M wM = 1. In

our case, the modalities comprise of dialog matching andcharacter identities, however, such an approach allows toeasily extend the model with more cues.

When the similarity score between scene s and all chap-ters is low, it indicates that the scene might not be a part ofthe book. We thus modify the incoming distance to the node(∅, s) as follows:

dn′→(∅,s) = dn′→(∅,s)−∑M

wM max

(0, 1−

∑c

φM(c,s)

).

(9)This encourages the shortest path to assign the scene s tothe null chapter class c∗s = ∅.

Implementation details We use Dijkstra’s algorithm tosolve the shortest path problem. For GOT, our graph hasabout 27k nodes and 2M edges. Finding the shortest path isvery efficient and takes less than 1 second. Our initializationdistance parameter α is set to 1. The weights wM are notvery crucial and result in good alignment performance for alarge range of values.

5. EvaluationWe now present evaluation of our proposed alignment

approach on the data set [1] described in Sec. 3.1.

VID

EO duration #scenes #shots (#nobook)

GOT 8h 58m 369 9286 (2708)HP 2h 32m 138 2548 (56)

BO

OK #chapters #words #adj, #verb

GOT 73 293k 17k, 59kHP 17 78k 4k, 17k

Face

-ID #characters #tracks (unknown) id accuracy

GOT 95 11094 (2174) 67.6HP 46 3777 (843) 72.3

Table 1. An overview of our data set. The table is divided intothree sections related to information about the video, the book andthe face identification scheme.

Data set statistics Table 1 presents some statistics of thetwo novel-to-video adaptations. The GOT video adaptationand book is roughly four times larger than that of HP. Amajor difference between the two adaptations is the fractionof shots in the video which are not part of a book chapter.For GOT the number of shots not in the book (#nobook) aremuch higher at 29.2% than as compared to only 2.2% forHP. Both novels are a large resource for adjectives (#adj)and verbs (#verb).Face ID performance Table 1 (Face-ID) shows the over-all face identification performance in terms of track-levelaccuracy (fraction of correctly labeled face tracks).

The face tracks involve a large number of people andmany unknown characters, and our track accuracy of around70% is on par with state-of-the-art performance on complexvideo data [18]. Additionally, GOT and HP are a new addi-tion to the existing data sets on person identification and wewill make the tracks and labels publicly available [1].

5.1. Ground truth and evaluation criterion

The ground truth alignment between book chapters andvideos is performed at the shot level. This provides a fine-grained alignment independent of the specific scene detec-tion algorithm. The scene detection only helps simplify thealignment problem by reducing the complexity of the graph.

We consider two metrics. First, we measure the align-ment accuracy (acc) as the fraction of shots that are assignedto the correct chapter. This includes shot assignments to ∅.We also emphasize on finding shots that are not part of thebook (particularly for GOT). Finding these shots is treatedlike a detection problem, and we use precision (nb-pr) andrecall (nb-rc) measures.

5.2. Book-to-video alignment performance

We present the alignment performance of our approachin Table 2, inspect the importance of dialog and identitycues, and compare the method against multiple baselines.

As discussed in Sec. 3.2, for the purpose of alignment,we perform scene detection and group shots into scenes.

E01 E02 E03 E04 E05 E06 E07 E08 E09 E10

Game of Thrones (Season 1) - Episodes

ground truth

prior

DTW3

prior + ids + dlgs

Figure 4. This figure is best viewed in color. We visualize the alignment as obtained from various methods. The ground truth alignment(row 1) is the same as presented in Fig. 1. Chapters are indicated by colors, and empty areas (white spaces) indicate that those shot are notpart of the novel. We present alignment performance of three methods: prior (row 2), DTW3(row 3) and our approach prior + ids + dlgs(row 4). The color of the first sub-row for each method indicates the chapter to which every scene is assigned. Comparing vertically againstthe ground truth we can determine the accuracy of the alignment. For simplicity, the second sub-row of each method indicates whether thealignment is correct (green) or wrong (red).

methodGOT HP

acc nb-pr nb-rc acc nb-pr nb-rc

scenes upper 95.1 97.9 86.4 96.7 40.0 7.1

prior 12.4 – – 19.0 – –prior + ids 55.3 52.8 48.7 80.4 0.0 0.0prior + dlgs 73.1 55.8 74.2 86.2 20.0 3.6ids + dlgs 66.5 71.7 20.9 77.4 0.0 0.0prior + ids + dlgs 75.7 70.5 53.4 89.9 0.0 0.0

MAX [25] 54.9 – – 73.3 – –MAX [25]+∅ 60.7 68.0 37.7 73.0 0.0 0.0DTW3 [25] 44.7 – – 94.8 – –

Table 2. Alignment performance in terms of overall accuracy ofshots being assigned book chapters (acc). The no-book precision(nb-pr) and no-book recall (nb-rc) are for shots which are not partof the book. See Sec. 5.2 for a detailed discussion.

However, errors in the scene boundary detection can lead tosmall alignment errors measured at the shot level. Methodscenes upper is the best possible alignment performancegiven the scene boundaries and we observe that we do notlose much in terms of overall accuracy.

Shortest-path based As a baseline (prior), we evaluatethe alignment obtained from our graph after initializingedge distances with local and global priors. The alignmentis an equal distribution of scenes among chapters. For bothGOT and HP we observe bad performance suggesting thedifficulty of the task. The prior distances modified with thedialog matching cue (prior + dlgs) outperform prior withcharacter identities (prior + ids) quite significantly. An ex-planation of this is (i) inaccurate face track recognition; and(ii) similar names appearing in the text (e.g. Jon Arryn, JonUmber and Jon Snow are three characters in GOT). How-ever the fusion of the two cues prior + ids + dlgs performsbest. Note that we are quite successful in finding shots thatare not part of the book. If we disable the global prior

which provides structure to the distances (ids + dlgs), weobtain comparable performance indicating that the cues areresponsible for the alignment accuracy. In general, our abil-ity to predict whether a shot belongs to the novel or not isgood for GOT. However, for HP as the no-book shots arevery few (2.2%), they are difficult to discern.

The accuracy is robust against a large range of weights.For example, for GOT, sweepingwid ∈ [0.01, 0.5] results inaccuracy ranging from 71.5% to 69.9% peaking in between.

Baselines The MAX baseline [25] uses both cues, how-ever treats scenes independent of one another. The methodselects the highest scoring chapter for each scene in the sim-ilarity matrix φ(c,s), and performs worse as compared to ourmethod. When augmented with the null class (MAX + ∅)by scoring it similar to Eq. 9 we see an improvement forGOT, while HP performs slightly worse. Note that our pro-posed method prior+ids+dlgs is far better than MAX.

DTW3 [25] (similar to DP [23]) forces a scene s to beassigned to the same or subsequent chapter as scene s − 1.This restriction allows the method to perform well in thecase of simple structures (HP), but fails completely in thecase of complex alignments (GOT).

Qualitative analysis Fig. 4 presents the alignment ob-tained by different methods on GOT. Notice how ourmethod is quite successful at predicting scenes which arenot part of the book (the white spaces in the first sub-row). Both prior and DTW3 have large chunks of erroneousscenes (second sub-row in red) due to long-range influencesof the alignment. A drawback of our method is that it needssufficient evidence to assign a scene to a chapter. This isspecially seen in HP (analysis in supplementary material).

6. Mining rich descriptionsA novel portrays the visual world through rich descrip-

tions. While there are many levels of such a presentation

(a) (Ch11, P47, E02 12m23s): Arya was in herroom, packing a polished ironwood chest thatwas bigger than she was. Nymeria was help-ing. Arya would only have to point, and the wolfwould bound across the room, snatch up somewisp of silk in her jaws, and fetch it back.

(b) (Ch60, P66, E09 7m55s) Lord Walder wasninety, a wizened pink weasel with a bald spot-ted head, too gouty to stand unassisted. Hisnewest wife, a pale frail girl of sixteen years,walked beside his litter when they carried himin. She was the eighth Lady Frey.

(c) (Ch12, P33, E01 51m21s) One egg was adeep green, with burnished bronze flecks thatcame and went depending on how Dany turnedit. Another was pale cream streaked with gold.The last was black, as black as a midnight sea,yet alive with scarlet ripples and swirls.

(d) (Ch4, P23, M 14m23s) From an insidepocket of his black overcoat he pulled a slightlysquashed box. Harry opened it with tremblingfingers. Inside was a large, sticky chocolate cakewith Happy Birthday Harry written on it in greenicing.

(e) (Ch9, P194, M 1h02m45s) They were look-ing straight into the eyes of a monstrous dog,a dog that filled the whole space between ceil-ing and floor. It had three heads. Three pairsof rolling, mad eyes; three noses, twitching andquivering in their direction.

(f) (Ch10, P78-79, M 1h06m59s) Hermionerolled up the sleeves of her gown, flicked herwand, and said, “Wingardium Leviosa!” Theirfeather rose off the desk and hovered about fourfeet above their heads.

Figure 5. Example of mined attributes ranging from events, scenes, and person and object descriptions. The first row corresponds to GOT,while the second shows examples from HP. The location of the description in the novel is abbreviated as chapter number (Ch), paragraphnumber (P) and the location in the video as episode EXX or movie M in minutes and seconds. The caption text is verbatim from the novel.The various attributes are highlighted in the caption below every frame. This figure is best viewed on screen.

we highlight and focus on a few key ones: (i) narrative textdescribing the scene or location; (ii) detailed character de-scriptions including face-related features, clothing, and alsoportrayal of their mental state; (iii) character interactions,their emotions and usage of various communication verbsto express a sentiment.

As an application of the scene-to-chapter alignment, wepresent a sample of attributes that are easily obtained. Fig. 5presents qualitative examples from our data set. The figurecaptions are verbatim paragraphs from the novel. For sim-plicity, we highlight the relevant attributes of the text whichcorrespond to the part of the image. Many more such exam-ples can be found in the supplementary material.

We extract these examples pseudo-automatically from agiven scene-to-chapter alignment. Small regions (±3 para-graphs, ±5 shots) surrounding matched dialogs are used tofind appropriate examples. Note that the video clip, i.e. theregion of shots around the matching dialog are part of thesame story, however we select one good frame for presenta-tion purposes. Although this currently requires some man-ual intervention, generating pairs of images and rich de-scriptions is far easier than going through the entire book.Automating this task is an interesting area for future work.

7. ConclusionWith a rising interest in transforming novels to films or

TV series, we present for the first time an automatic wayto analyze novel-to-film adaptations. We create and anno-tate a data set involving one film and one season of a TVseries. Our goal is to associate video scenes with bookchapters and is achieved by modeling the task as a short-est path graph problem. Such a model allows for a non-sequential alignment and is able to handle scenes that do notcorrespond to parts of the book. We use information cuessuch as matching dialogs and character identities to bridgethe gap between the text and video domains. A compar-ison against state-of-the-art plot synopsis alignment algo-rithms shows that we perform much better specially in caseof complicated alignment structures. As an application ofthe alignment between book chapters and video scenes, wealso present qualitative examples of extracting rich scene,character, and object descriptions from the novel.

Acknowledgments This work was supported by theDeutsche Forschungsgemeinschaft (DFG - German Re-search Foundation) under contract no. STI-598/2-1. Theviews expressed herein are the authors’ responsibility anddo not necessarily reflect those of DFG.

References[1] Book2Movie data download containing novel-to-film

ground truth alignments, and face tracks + labels.http://cvhci.anthropomatik.kit.edu/projects/mma.

[2] Stanford CoreNLP. http://nlp.stanford.edu/software/. Re-trieved 2014-11-14.

[3] ‘Deathly Hallows’ film breathes life into Harry Potter booksales. http://www.nielsen.com, Nov. 2010. Retrieved 2014-11-14.

[4] Where do highest-grossing screenplays come from?http://stephenfollows.com, Jan. 2014. Retrieved 2014-11-14.

[5] P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, andJ. Sivic. Finding Actors and Actions in Movies. In ICCV,2013.

[6] T. Cour, B. Sapp, C. Jordan, and B. Taskar. Learning fromambiguously labeled images. In CVPR, 2009.

[7] P. Das, C. Xu, R. F. Doell, and J. J. Corso. A ThousandFrames in Just a Few Words: Lingual Description of Videosthrough Latent Topics and Sparse Object Stitching. In CVPR,2013.

[8] M. Everingham, J. Sivic, and A. Zisserman. “Hello! Myname is... Buffy” - Automatic Naming of Characters in TVVideo. In BMVC, 2006.

[9] R. Giddings, K. Selby, and C. Wensley. Screening the novel:The theory and practice of literary dramatization. Macmil-lan, 1990.

[10] A. Karpathy and L. Fei-Fei. Deep Visual-Semantic Align-ments for Generating Image Descriptions. In CVPR, 2015.

[11] R. Kiros, R. Salakhutdinov, and R. S. Zemel. UnifyingVisual-Semantic Embeddings with Multimodal Neural Lan-guage Models. Transactions on Association for Computa-tional Linguistics, 2015.

[12] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.Learning Realistic Human Actions from Movies. In CVPR,2008.

[13] C. Liang, C. Xu, J. Cheng, and H. Lu. TVParser : An Auto-matic TV Video Parsing Method. In CVPR, 2011.

[14] D. Lin, S. Fidler, C. Kong, and R. Urtasun. Visual SemanticSearch: Retrieving Videos via Complex Textual Queries. InCVPR, 2014.

[15] C. D. Manning, P. Raghavan, and H. Schutze. Scoring, termweighting, and the vector space model. In Introduction toInformation Retrieval. Cambridge University Press, 2008.

[16] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Ex-plain Images with Multimodal Recurrent Neural Networks.In NIPS Deep Learning workshop, 2014.

[17] B. McFarlane. Novel to Film: an Introduction to the Theoryof Adaptation. Clarendon press, Oxford, 1996.

[18] O. M. Parkhi, K. Simonyan, A. Vedaldi, and A. Zisserman.A Compact and Discriminative Face Track Descriptor. InCVPR, 2014.

[19] V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Link-ing people in videos with “their” names using coreferenceresolution. In ECCV, 2014.

[20] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. ADataset for Movie Description. In CVPR, 2015.

[21] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, andB. Schiele. Translating Video Content to Natural LanguageDescriptions. In ICCV, 2013.

[22] A. Roy, C. Guinaudeau, H. Bredin, and C. Barras. TVD: AReproducible and Multiply Aligned TV Series Dataset. InLanguage Resources and Evaluation Conference, 2014.

[23] P. Sankar, C. V. Jawahar, and A. Zisserman. Subtitle-freeMovie to Script Alignment. In BMVC, 2009.

[24] J. Sivic, M. Everingham, and A. Zisserman. “Who are you?”– Learning person specific classifiers from video. In CVPR,2009.

[25] M. Tapaswi, M. Bauml, and R. Stiefelhagen. Story-basedVideo Retrieval in TV series using Plot Synopses. In ACMIntl. Conf. on Multimedia Retrieval (ICMR), 2014.

[26] M. Tapaswi, M. Bauml, and R. Stiefelhagen. StoryGraphs:Visualizing Character Interactions as a Timeline. In CVPR,2014.

[27] M. Tapaswi, M. Bauml, and R. Stiefelhagen. Improved WeakLabels using Contextual Cues for Person Identification inVideos. In IEEE Intl. Conf. on Automatic Face and GestureRecognition (FG), 2015.

[28] T. Tsoneva, M. Barbieri, and H. Weda. Automated summa-rization of narrative video on a semantic level. In Intl. Conf.on Semantic Computing, 2007.

[29] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach,R. Mooney, and K. Saenko. Translating Videos to NaturalLanguage Using Deep Recurrent Neural Networks. In NorthAmerican Chapter of the Association for Computational Lin-guistics Human Language Technologies, 2015.

[30] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show andTell: A Neural Image Caption Generator. In CVPR, 2015.

[31] G. Wagner. The novel and the cinema. Fairleigh DickinsonUniversity Press, 1975.

book2movie: aligning video scenes with book …...book2movie: aligning video scenes with book...

Documents