book2movie: aligning video scenes with book …...book2movie: aligning video scenes with book...
TRANSCRIPT
Book2Movie: Aligning Video scenes with Book chapters
Makarand Tapaswi Martin Bauml Rainer StiefelhagenKarlsruhe Institute of Technology, 76131 Karlsruhe, Germany{makarand.tapaswi, baeuml, rainer.stiefelhagen}@kit.edu
Abstract
Film adaptations of novels often visually display in a fewshots what is described in many pages of the source novel.In this paper we present a new problem: to align book chap-ters with video scenes. Such an alignment facilitates findingdifferences between the adaptation and the original source,and also acts as a basis for deriving rich descriptions fromthe novel for the video clips.
We propose an efficient method to compute an align-ment between book chapters and video scenes using match-ing dialogs and character identities as cues. A major con-sideration is to allow the alignment to be non-sequential.Our suggested shortest path based approach deals with thenon-sequential alignments and can be used to determinewhether a video scene was part of the original book. Wecreate a new data set involving two popular novel-to-filmadaptations with widely varying properties and compareour method against other text-to-video alignment baselines.Using the alignment, we present a qualitative analysis ofdescribing the video through rich narratives obtained fromthe novel.
1. IntroductionTV series and films are often adapted from novels [4]. In
the fantasy genre, popular examples include the Lord of theRings trilogy, the TV series Game of Thrones based on thenovel series A Song of Ice and Fire, the Chronicles of Nar-nia and Harry Potter series. There are many other exam-ples from various genres: The Bourne series (action), TheHunger Games (adventure), the House of Cards TV series(drama), and a number of super hero movies (e.g. Batman,Spiderman) many of which are based on comic books.
We believe that such adaptations are a large untappedresource that can be used to simultaneously improve andunderstand the story semantics for both video and naturallanguage. In recent years there is a rising interesting to au-tomatically generate meaningful captions for images [10,11, 16, 30] and even user-generated videos [7, 21, 29]. Forreaders to visualize the story universe, novels often pro-
vide rich textual descriptions in the form of attributes oractions of the content that are depicted visually in the video(e.g., specially for characters and their surroundings). Dueto the large number of such adaptations of descriptive textinto video, such data can prove to be an excellent source forlearning joint textual and visual models.
Determining how a specific part of the novel is adaptedto the video domain is a very interesting problem in it-self [9, 17, 31]. A classical example of this is to pin-pointthe differences between books and their movie adaptations1.Differences are at various levels of detail and can rangefrom appearance, presence or absence of characters, all theway up to modification of the sub-stories. As a start, we ad-dress the problem of finding differences at the scene level,more specifically on how chapters and scenes are ordered,and which scenes are not backed by a chapter from the book.
A more fine-grained linkage between novels and filmscan have direct commercial applications. Linking the twoencourages growth in the consumption of both literary andaudio-visual material [3].
In this work, we target the problem of aligning videoscenes to specific chapters of the novel (Sec. 4). Fig. 1presents an example of the ground truth alignment betweenbook chapters and parts of the video along with a discussionof some of the challenges in the alignment problem. We em-phasize on finding scenes that are not part of the novel andperform our evaluation with this goal in mind (Sec. 5).
There are many applications which emerge from thealignment. Sec. 6 discusses how vivid descriptions in abook can be used as a potential source of weak labels toimprove video understanding. Similar to [25], an alignmentbetween video and text enables text-based video retrievalsince searching for a part of the story in the video can betranslated to searching in the text (novel). In contrast to[25], which used short plot synopses of a few sentences pervideo, novels provide a much larger pool of descriptionswhich can potentially match a text-based search. Novels of-ten contain more information than what appears in a shortvideo adaptation. For example, minor characters that appear
1There are websites which are dedicated to locating such differences,e.g. http://thatwasnotinthebook.com/
1
12
34
56
78
91
011
12
13
14
15
16
171
81
92
02
12
22
32
42
52
62
72
82
93
03
13
23
33
43
53
63
73
83
94
04
14
24
34
44
54
64
74
84
95
05
15
25
35
45
55
65
75
85
96
06
16
26
36
46
56
66
76
86
97
07
17
27
3
Chapte
rs
E0
1E
02
E0
3E
04
E0
5E
06
E0
7E
08
E0
9E
10
Epis
odes
Gam
e o
f T
hro
nes (
Seaso
n 1
)
A S
on
g o
f Ic
e a
nd
Fir
e (
Bo
ok 1
)
Figu
re1.
Thi
sfig
ure
isbe
stvi
ewed
inco
lor.
We
pres
entt
hegr
ound
trut
hal
ignm
ento
fch
apte
rsfr
omth
efir
stno
velo
fA
Song
ofIc
ean
dFi
reto
the
first
seas
onof
the
TV
seri
esG
ame
ofTh
rone
s.B
ook
chap
ters
are
depi
cted
onth
elo
wer
axis
with
tick
spac
ing
corr
espo
ndin
gto
the
num
ber
ofw
ords
inth
ech
apte
r.A
sth
eno
velf
ollo
ws
apo
int-
of-v
iew
narr
ativ
e,ea
chch
apte
ris
colo
rco
ded
base
don
the
char
acte
r.C
hapt
ers
with
are
dba
ckgr
ound
(see
8,18
,42)
are
notr
epre
sent
edin
the
vide
oad
apta
tion.
The
ten
epis
odes
(E01
–E
10)
ofth
ese
ason
1co
rres
pond
ing
toth
efir
stno
vela
repl
otte
don
the
top
axis
.E
ach
bar
ofco
lor
repr
esen
tsth
elo
catio
nan
ddu
ratio
nof
time
inth
evi
deo
whi
chco
rres
pond
sto
the
spec
ific
chap
terf
rom
the
nove
l.A
line
join
ing
the
cent
erof
the
chap
tert
oth
eba
rind
icat
esth
eal
ignm
ent.
Whi
tesp
aces
betw
een
the
colo
red
bars
indi
cate
that
thos
esc
enes
are
nota
ligne
dto
any
part
ofth
eno
vel:
alm
ost3
0%of
alls
hots
dono
tbel
ong
toan
ych
apte
r.A
noth
erch
alle
nge
isth
atch
apte
rsca
nbe
split
and
pres
ente
din
mul
tiple
scen
es(e
.g.,
chap
ter3
7).N
ote
the
com
plex
ityof
the
adap
tatio
nan
dho
wth
evi
deo
does
notn
eces
sari
lyfo
llow
the
book
chap
ters
sequ
entia
lly.W
ew
illm
odel
thes
ein
tric
acie
san
dus
ean
effic
ient
appr
oach
tofin
da
good
alig
nmen
t.
in the video are often un-named, however, can be associatedwith a name by leveraging the novels. Finally, the alignmentwhen performed on many films and novels together can beused to study the task of adaptation itself and, for example,characterize different screenwriters and directors.
2. Related workIn recent years, videos and accompanying text are an in-
creasingly popular subject of research. We briefly reviewand discuss related work in the direction of video analysisand understanding using various sources of text.
The automatic analysis of TV series and films has greatlybenefited from external sources of textual information. Fantranscripts with subtitles are emerging as the de facto stan-dard in automatic person identification [6, 8, 19, 24] and ac-tion recognition [5, 12]. They also see application in videosummarization [28] and are used to understand the scenestructure of videos [13]. There is also work on aligningvideos to transcripts in the absence of subtitles [23].
The last year has seen a rise in joint analysis of text andvideo data. [22] presents a framework to make TV seriesdata more accessible by providing audio-visual meta dataalong with three types of textual information – transcripts,episode outline and video summaries. [25] introduces andinvestigates the use of plot synopses from Wikipedia to per-form video retrieval. The alignment between sentences ofthe plot synopsis and shots of the video is leveraged tosearch for story events in TV series. A new source of text inthe form of descriptive video service used to help visuallyimpaired people watch TV or films is introduced by [20].In the domain of autonomous driving videos, [14] parsestextual queries as a semantic graph and uses bipartite graphmatching to bridge the text-video gap, ultimately perform-ing video retrieval. Very recently, an approach to generatetextual descriptions for short video clips through the use ofConvolutional and Recurrent Neural Networks is presentedin [29]. While the video domains may be different, the im-proved vision and language models have spiked a clear in-terest in co-analyzing text and video data.
Adapting a novel to a film is an interesting problem inthe performing arts literature. [9, 17, 31] present variousguidelines on how a novel can be adapted to the screen ortheater2. Our proposed alignment is a first step towards au-tomating the analysis of such adaptations at a large scale.
Previous works in video analysis almost exclusively dealwith textual data which is derived from the video. This of-ten causes the text to be rather sparse owing to limited hu-man involvement in its production. One key difference inanalyzing novels – as done in this work – is that the videois derived from the text (or, both are derived from the core
2http://www.kvenno.is/englishmovies/adaptation.htm is a nice sum-mary in terms of the differences and freedom that different medium mayuse while essentially telling the same story.
storyline). This usually means that the available textual de-scription is much more complete which has both advantages(more details) and disadvantages (not every part of the textneeds to be present in the video). We think that this ulti-mately provides a powerful source for improving simulta-neous semantic understanding of both text and video.
Closest to our alignment approach are two works whichuse dynamic programming (DP) to perform text-to-videoalignment. Sankar et al. [23] align videos to transcriptswithout subtitles; and Tapaswi et al. [25] align sentencesfrom the plot synopses to shots of the video. However, bothmodels make two strong assumptions: (i) sentences andvideo shots appear sequentially; (ii) every shot is assignedto a sentence. This is often not the case for novel-to-filmadaptations (see Fig. 1). We specifically address the prob-lem of non-sequential alignments in this paper. To the bestof our knowledge, this is also the first work to automaticallyanalyze novels and films, to align book chapters with videoscenes and as a by-product derive rich and unrestricted at-tributes for the visual content.
3. Data set and preprocessingWe first briefly describe our new data set followed by
some necessary pre-processing steps.
3.1. A new Book2Movie data set
As this is the first work in this direction, we create a newdata set, comprising of two novel-to-film/series adaptations:• GOT: Season 1 of the TV series Game of Thrones cor-
responds to the first book of the A Song of Ice and Fireseries, titled A Game of Thrones. Fig. 1 shows our an-notated ground truth alignment for this data.• HP: The first film and book of the Harry Potter series –
Harry Potter and the Sorcerer’s Stone. A figure similarto Fig. 1 is included in the supplementary material.
Our choice of data is motivated by multiple reasons.Firstly, we consider both TV series and film adaptationswhich typically have different filming styles owing to thestructure imposed by episodes. There is a large disparity inthe sizes of the books (GOT source novel is almost 4 timesas big as HP), and this is also reflected in the total runtimeof the respective videos (9h for GOT vs. 2h30m for HP).
Secondly, the stories are targeted towards different audi-ences and thus their language and content differs vastly. Thefirst book of the Harry Potter series caters towards childrenwhile the same cannot be said about A Game of Thrones.
Finally, while both stories are from the fantasy anddrama genre, they have their own unique worlds and differ-ent writing styles. GOT presents multiple sub-stories thattake place in different locations in the world at the sametime. This allows for more freedom in the adaptation of thesub-stories creating a complex alignment pattern. On the
other hand, HP is very character centric with a large chunkof the story revolving around the main character (Harry).The story and the adaptation are thus relatively sequential.
Some statistics of the data set can be found in Sec. 5.
3.2. Scene detection
We consider shots as the basic unit of video processingand they are used to annotate the ground truth novel-to-filmalignment. Due to the large number of shots that films con-tain (typically more than 2000), we further group shots to-gether and use video scenes as basis for the alignment.
We perform scene detection using the dynamic program-ming method proposed in [26]. The method optimizes theplacement of scene boundaries such that shots within ascene are most similar in color and shot threads are not split.
We sacrifice on granularity and induce minor errors ow-ing to wrong scene boundary detection. However, the usageof scenes reduces complexity of the alignment algorithm,and facilitates stronger cues obtained by an average overmultiple shots in the scene.
3.3. Film dialog parsing
We extract character dialogs from the video using sub-titles which are included in the DVD. We convert subti-tles into dialogs based on a simple, yet widely followedset of rules. For example, a two line subtitle whose linesstart with “–” are spoken by different characters. Similarly,we also group subtitles appearing consecutively (no timespacing between the subtitles) until the sentence is com-pleted. These subtitle-dialogs are one video cue to performthe alignment.
3.4. Novel dialog parsing
We follow a hierarchical method for processing the en-tire text of the novel. We first divide the book into chaptersand each chapter into paragraphs. For the alignment we re-strict ourselves at the level of chapters.
Paragraphs are essentially of two types: (i) with dialogor (ii) with only narration.
Dialogs in paragraphs are indicated by the presence ofquotation marks “ and ”. Each sentence in the dialog istreated separately. These text-based dialogs are used as analignment cue along with the video dialogs.
The narrative paragraphs usually set the scene, describethe characters, and back stories. They are our majorsource of attribute labels (see Sec. 6). We process the en-tire book with part-of-speech tagging using the StanfordCoreNLP [2] toolbox. This provides us with the necessaryadjective tags for extracting descriptions.
4. Aligning book chapters to video scenesWe propose a graph-based alignment approach to match
chapters of a book to scenes of a video. Fig. 2 presents an
Figure 2. Alignment cues: We present a small section of the GOT novel, chapter 2 (left) and compare it against the first episode of theadaptation (subtitles shown on the right). Names in the novel are highlighted with colors for which we expect to find corresponding facetracks in the video. We also demonstrate the dialog matching process of finding the longest common subsequence of words. The numberof words and the score (inverted term frequency) are displayed for one such matching dialog. This figure is best viewed in color.
overview of the cues used to perform the alignment.For a specific novel-to-film adaptation, let NS be the
number of scenes andNC the number of chapters. Our goalis to find a chapter c∗s which corresponds to each scene s
c∗s = arg maxc={∅,1...NC}
φ(c,s) , (1)
subject to certain story progression constraints (discussedlater). All scenes which are not part of the original novelare assigned to the ∅ (null) chapter. φ(c,s) captures the sim-ilarity between the chapter c and scene s.
4.1. Story characters
Characters play a very important role in any story. Sim-ilar to the alignment of videos to plot synopses [25], wepropose to use name mentions in the text and face tracks inthe video to find occurrences of characters.
Characters in videos To obtain a list of names whichappear in each scene, we first perform face detection andtracking. Following [27], we align subtitles and transcriptsand assign names to “speaking faces”. We consider allnamed characters as part of our cast list. As feature, weuse the recently proposed VF2 face track descriptor [18] (aFisher vector encoding of dense SIFT features with videopooling). Using the weakly labeled data obtained fromthe transcripts, we train one-vs-all linear Support VectorMachine (SVM) classifiers for each character and performmulti-class classification on all tracks. Tracks which have alow confidence towards all character models (negative SVMscores) are classified as “unknown”.
Sec. 5 presents a brief evaluation of face track recogni-tion performance.
Finding name references in the novel Using the list ofcharacters obtained from the transcripts and along with the
proper noun part-of-speech tags, finding name mentions inthe book is fairly straightforward. However, complex sto-ries – especially with multiple people from the same fam-ily – can contain ambiguities. For example, in Game ofThrones, we see that Eddard Stark is often referred to asNed, or addressed with the title Lord Stark. However, astitles can be passed on, his son Robb is also referred to asLord Stark in the same book. We weight the name men-tions based on their types (ordered from highest to lowest):(i) full name; (ii) only first name; (iii) alias or titles; (iv) andonly last name. The actual search for names is performedby simple text matching of complete words.
Matching For every scene and chapter, we count thenumber of occurrences for each character. We stack themas “histograms” of dimension NP (number of people). Forscenes, we count the number of face tracks QS ∈ RNS×NP ,and for chapters we obtain weighted name mentions QC ∈RNC×NP . We normalize the occurrence matrices such thatthe row-sum equals 1, and then compute Euclidean distancebetween them. Finally, we normalize the distance and ob-tain our identity based similarity score as
φI(c,s) =√2− ‖qS(s)− qC(c)‖2 , (2)
where qS(s) stands for row s of matrix QS . We use thisidentity-based similarity measure φI ∈ RNC×NS betweenevery chapter c and scene s to perform the alignment.
4.2. Dialog matching
Finding an identical dialog in the novel and video adap-tation is a strong hint towards the alignment. While match-ing dialogs between the novel and film sounds easy, oftenthe adaptation changes the presentation style so that veryfew dialogs are reproduced verbatim. For example, in GOT,we have 12,992 dialogs in the novel and 6,842 in the video.
Figure 3. Illustration of the graph used for alignment. The back-ground gradient indicates that we are likely to reach nodes in thelighter areas while nodes in the darker areas are harder to access.
Of these, only 308 pairs are a perfect match with 5 wordsor more, while our following method finds 1,358 pairs withhigh confidence.
We denote the set of dialogs in the novel and video byDN and DV respectively. Let WN and WV be the cor-responding total number of words. We compute the termfrequency [15] for each word w in the video dialogs as
ψV (w) = #w/WV , (3)
where #w is the number of occurrences of the word. Asimilar term frequency ψN (w) is computed for the novel.
To quantify the similarity between a pair of dialogs, n ∈DN from the novel and v ∈ DV from the video, we findthe longest common subsequence between them (see Fig. 2)and extract the list of matching words Sn,v . However, stopwords such as of, the, for can produce false spikes in theraw count |Sn,v|. We counter this by incorporating invertedterm frequencies and compute a similarity score betweeneach dialog pair (n, v) as
φD(n,v) =∑
w∈Sn,v
−(logψN (w) + logψV (w))/2 . (4)
For the alignment, we trace the dialogs to their respectivebook chapters and video scenes, and accumulate them
φD(c,s) =∑n∈cD
∑v∈sD
φD(n,v) , (5)
where cD and sD is the set of dialogs in the chapterand scene respectively. We finally normalize the obtaineddialog-based similarity matrix φD ∈ RNC×NS .
4.3. Alignment as shortest path through a graph
We model the problem of aligning the chapters of a bookto scenes of a video (Eq. 1) as finding the shortest path in asparse directed acyclic graph (DAG). The edge distances of
the graph are devised in a way that the shortest path throughthe graph corresponds to the best matching alignment. Aswe will see, this model allows us to capture all the non-sequential behavior of the alignment while easily incorpo-rating information from the cues and providing an efficientsolution. Fig. 3 illustrates the nodes of the graph along witha glimpse of how the edges are formed.
The main graph (with no frills) consists NC · NS nodesordered as a regular matrix (green nodes in Fig. 3) whererows represent chapters (indexed by c) and columns repre-sent scenes (indexed by s). We create an edge from everynode in column s to all nodes in column s+1 resulting in atotal ofN2
C ·(NS−1) edges. These edge transitions allow toassign each scene to any chapter, and the overall alignmentis simply the list of nodes visited in the shortest path.
Prior Every story has a natural forward progressionwhich an adaptation (usually) follows, i.e. early chapters ofa novel are more likely to be presented earlier in the videowhile chapters towards the end come later. We initialize theedges of our graph with the following distances.
The local distance from a node (c, s) to any node inthe next column (c′, s + 1) is modeled as a quadratic dis-tance and deals with transitions which encourage neighbor-ing scenes to be assigned to the same or nearby chapters.
d(c,s)→(c′,s+1) = α+|c′ − c|2
2 ·NC, (6)
where dn→n′ denotes the edge distance from node n to n′.α serves as a non-zero distance offset from which we willlater subtract the similarity scores.
Another influence of the prior is a global likelihood ofbeing at any node in the graph. To incorporate this, we mul-tiply all incoming edges of node (c, s) by a Gaussian factor
g(c, s) = 2− exp
(− (s− µs)
2
2 ·N2S
). (7)
µs = dc · NS/NCe and the 2 − exp(·) ensures the multi-plying factor g ≥ 1. This results in larger distances to gotowards nodes in the top-right or bottom-left corners.
Overall, the distance to come to any node (c, s) is influ-enced by where the edge originates (dn→(c,s), Eq. 6) andthe location of the node itself (g(c, s), Eq. 7).
Unmapped scenes Adaptations provide creative freedomand thus every scene need not always stem from a specificchapter from the novel. For example, in GOT about 30%of the shots do not have a corresponding part in the novel(c.f . Fig. 1). To tackle this problem, we add an additionallayer of NS nodes to our graph (top layer of blue nodes inFig. 3). Inclusion of these nodes allows us to maintain thecolumn-wise edge structure while also providing the addi-tional feature of unmapped scenes. The distance to arrive atthis node from any other node is initialized to α+µg , whereµg is the average value of the global influence Eq. 7.
External source/sink Initializing the shortest path withinthe main graph would force the first scene to belong to thefirst chapter, and the last scene to the last chapter. How-ever, this need not be true. To prevent this, we include twoadditional source and sink nodes (red nodes in Fig. 3). Wecreate NC additional edges each to transit from the sourceto column s = 1 and from column s = NS to the sink. Thedistances of these edges are based on the global distance.
Using the alignment cues The shortest path for the cur-rent setup of the graph without external influence assignsequal number of scenes to every chapter. Due to its struc-ture, we refer to this as the diagonal prior. We now use theidentity (Eq. 2) and dialog matching (Eq. 4) cues to lowerthe distance of reaching a high scoring node. For example, anode which has multiple matching dialogs and the same setof characters is very likely to depict the same story and bea part of correct scene-to-chapter alignment. Thus reducingthe incoming distance to this node encourages the shortestpath to go through it.
For every node (c, s) with a non-zero similarity score,we subtract the similarity score from all incoming edges,thus encouraging the shortest path to go through this node.
dn′→(c,s) = dn′→(c,s) −∑M
wMφM(c,s) , (8)
where φM(·) is the similarity score of modalityM with weightwM > 0 and n′ = (·, s − 1) is the list of nodes from theprevious scene with an edge to (c, s). The sum of weightsacross all modalities is constrained as
∑M wM = 1. In
our case, the modalities comprise of dialog matching andcharacter identities, however, such an approach allows toeasily extend the model with more cues.
When the similarity score between scene s and all chap-ters is low, it indicates that the scene might not be a part ofthe book. We thus modify the incoming distance to the node(∅, s) as follows:
dn′→(∅,s) = dn′→(∅,s)−∑M
wM max
(0, 1−
∑c
φM(c,s)
).
(9)This encourages the shortest path to assign the scene s tothe null chapter class c∗s = ∅.
Implementation details We use Dijkstra’s algorithm tosolve the shortest path problem. For GOT, our graph hasabout 27k nodes and 2M edges. Finding the shortest path isvery efficient and takes less than 1 second. Our initializationdistance parameter α is set to 1. The weights wM are notvery crucial and result in good alignment performance for alarge range of values.
5. EvaluationWe now present evaluation of our proposed alignment
approach on the data set [1] described in Sec. 3.1.
VID
EO duration #scenes #shots (#nobook)
GOT 8h 58m 369 9286 (2708)HP 2h 32m 138 2548 (56)
BO
OK #chapters #words #adj, #verb
GOT 73 293k 17k, 59kHP 17 78k 4k, 17k
Face
-ID #characters #tracks (unknown) id accuracy
GOT 95 11094 (2174) 67.6HP 46 3777 (843) 72.3
Table 1. An overview of our data set. The table is divided intothree sections related to information about the video, the book andthe face identification scheme.
Data set statistics Table 1 presents some statistics of thetwo novel-to-video adaptations. The GOT video adaptationand book is roughly four times larger than that of HP. Amajor difference between the two adaptations is the fractionof shots in the video which are not part of a book chapter.For GOT the number of shots not in the book (#nobook) aremuch higher at 29.2% than as compared to only 2.2% forHP. Both novels are a large resource for adjectives (#adj)and verbs (#verb).Face ID performance Table 1 (Face-ID) shows the over-all face identification performance in terms of track-levelaccuracy (fraction of correctly labeled face tracks).
The face tracks involve a large number of people andmany unknown characters, and our track accuracy of around70% is on par with state-of-the-art performance on complexvideo data [18]. Additionally, GOT and HP are a new addi-tion to the existing data sets on person identification and wewill make the tracks and labels publicly available [1].
5.1. Ground truth and evaluation criterion
The ground truth alignment between book chapters andvideos is performed at the shot level. This provides a fine-grained alignment independent of the specific scene detec-tion algorithm. The scene detection only helps simplify thealignment problem by reducing the complexity of the graph.
We consider two metrics. First, we measure the align-ment accuracy (acc) as the fraction of shots that are assignedto the correct chapter. This includes shot assignments to ∅.We also emphasize on finding shots that are not part of thebook (particularly for GOT). Finding these shots is treatedlike a detection problem, and we use precision (nb-pr) andrecall (nb-rc) measures.
5.2. Book-to-video alignment performance
We present the alignment performance of our approachin Table 2, inspect the importance of dialog and identitycues, and compare the method against multiple baselines.
As discussed in Sec. 3.2, for the purpose of alignment,we perform scene detection and group shots into scenes.
E01 E02 E03 E04 E05 E06 E07 E08 E09 E10
Game of Thrones (Season 1) - Episodes
ground truth
prior
DTW3
prior + ids + dlgs
Figure 4. This figure is best viewed in color. We visualize the alignment as obtained from various methods. The ground truth alignment(row 1) is the same as presented in Fig. 1. Chapters are indicated by colors, and empty areas (white spaces) indicate that those shot are notpart of the novel. We present alignment performance of three methods: prior (row 2), DTW3(row 3) and our approach prior + ids + dlgs(row 4). The color of the first sub-row for each method indicates the chapter to which every scene is assigned. Comparing vertically againstthe ground truth we can determine the accuracy of the alignment. For simplicity, the second sub-row of each method indicates whether thealignment is correct (green) or wrong (red).
methodGOT HP
acc nb-pr nb-rc acc nb-pr nb-rc
scenes upper 95.1 97.9 86.4 96.7 40.0 7.1
prior 12.4 – – 19.0 – –prior + ids 55.3 52.8 48.7 80.4 0.0 0.0prior + dlgs 73.1 55.8 74.2 86.2 20.0 3.6ids + dlgs 66.5 71.7 20.9 77.4 0.0 0.0prior + ids + dlgs 75.7 70.5 53.4 89.9 0.0 0.0
MAX [25] 54.9 – – 73.3 – –MAX [25]+∅ 60.7 68.0 37.7 73.0 0.0 0.0DTW3 [25] 44.7 – – 94.8 – –
Table 2. Alignment performance in terms of overall accuracy ofshots being assigned book chapters (acc). The no-book precision(nb-pr) and no-book recall (nb-rc) are for shots which are not partof the book. See Sec. 5.2 for a detailed discussion.
However, errors in the scene boundary detection can lead tosmall alignment errors measured at the shot level. Methodscenes upper is the best possible alignment performancegiven the scene boundaries and we observe that we do notlose much in terms of overall accuracy.
Shortest-path based As a baseline (prior), we evaluatethe alignment obtained from our graph after initializingedge distances with local and global priors. The alignmentis an equal distribution of scenes among chapters. For bothGOT and HP we observe bad performance suggesting thedifficulty of the task. The prior distances modified with thedialog matching cue (prior + dlgs) outperform prior withcharacter identities (prior + ids) quite significantly. An ex-planation of this is (i) inaccurate face track recognition; and(ii) similar names appearing in the text (e.g. Jon Arryn, JonUmber and Jon Snow are three characters in GOT). How-ever the fusion of the two cues prior + ids + dlgs performsbest. Note that we are quite successful in finding shots thatare not part of the book. If we disable the global prior
which provides structure to the distances (ids + dlgs), weobtain comparable performance indicating that the cues areresponsible for the alignment accuracy. In general, our abil-ity to predict whether a shot belongs to the novel or not isgood for GOT. However, for HP as the no-book shots arevery few (2.2%), they are difficult to discern.
The accuracy is robust against a large range of weights.For example, for GOT, sweepingwid ∈ [0.01, 0.5] results inaccuracy ranging from 71.5% to 69.9% peaking in between.
Baselines The MAX baseline [25] uses both cues, how-ever treats scenes independent of one another. The methodselects the highest scoring chapter for each scene in the sim-ilarity matrix φ(c,s), and performs worse as compared to ourmethod. When augmented with the null class (MAX + ∅)by scoring it similar to Eq. 9 we see an improvement forGOT, while HP performs slightly worse. Note that our pro-posed method prior+ids+dlgs is far better than MAX.
DTW3 [25] (similar to DP [23]) forces a scene s to beassigned to the same or subsequent chapter as scene s − 1.This restriction allows the method to perform well in thecase of simple structures (HP), but fails completely in thecase of complex alignments (GOT).
Qualitative analysis Fig. 4 presents the alignment ob-tained by different methods on GOT. Notice how ourmethod is quite successful at predicting scenes which arenot part of the book (the white spaces in the first sub-row). Both prior and DTW3 have large chunks of erroneousscenes (second sub-row in red) due to long-range influencesof the alignment. A drawback of our method is that it needssufficient evidence to assign a scene to a chapter. This isspecially seen in HP (analysis in supplementary material).
6. Mining rich descriptionsA novel portrays the visual world through rich descrip-
tions. While there are many levels of such a presentation
(a) (Ch11, P47, E02 12m23s): Arya was in herroom, packing a polished ironwood chest thatwas bigger than she was. Nymeria was help-ing. Arya would only have to point, and the wolfwould bound across the room, snatch up somewisp of silk in her jaws, and fetch it back.
(b) (Ch60, P66, E09 7m55s) Lord Walder wasninety, a wizened pink weasel with a bald spot-ted head, too gouty to stand unassisted. Hisnewest wife, a pale frail girl of sixteen years,walked beside his litter when they carried himin. She was the eighth Lady Frey.
(c) (Ch12, P33, E01 51m21s) One egg was adeep green, with burnished bronze flecks thatcame and went depending on how Dany turnedit. Another was pale cream streaked with gold.The last was black, as black as a midnight sea,yet alive with scarlet ripples and swirls.
(d) (Ch4, P23, M 14m23s) From an insidepocket of his black overcoat he pulled a slightlysquashed box. Harry opened it with tremblingfingers. Inside was a large, sticky chocolate cakewith Happy Birthday Harry written on it in greenicing.
(e) (Ch9, P194, M 1h02m45s) They were look-ing straight into the eyes of a monstrous dog,a dog that filled the whole space between ceil-ing and floor. It had three heads. Three pairsof rolling, mad eyes; three noses, twitching andquivering in their direction.
(f) (Ch10, P78-79, M 1h06m59s) Hermionerolled up the sleeves of her gown, flicked herwand, and said, “Wingardium Leviosa!” Theirfeather rose off the desk and hovered about fourfeet above their heads.
Figure 5. Example of mined attributes ranging from events, scenes, and person and object descriptions. The first row corresponds to GOT,while the second shows examples from HP. The location of the description in the novel is abbreviated as chapter number (Ch), paragraphnumber (P) and the location in the video as episode EXX or movie M in minutes and seconds. The caption text is verbatim from the novel.The various attributes are highlighted in the caption below every frame. This figure is best viewed on screen.
we highlight and focus on a few key ones: (i) narrative textdescribing the scene or location; (ii) detailed character de-scriptions including face-related features, clothing, and alsoportrayal of their mental state; (iii) character interactions,their emotions and usage of various communication verbsto express a sentiment.
As an application of the scene-to-chapter alignment, wepresent a sample of attributes that are easily obtained. Fig. 5presents qualitative examples from our data set. The figurecaptions are verbatim paragraphs from the novel. For sim-plicity, we highlight the relevant attributes of the text whichcorrespond to the part of the image. Many more such exam-ples can be found in the supplementary material.
We extract these examples pseudo-automatically from agiven scene-to-chapter alignment. Small regions (±3 para-graphs, ±5 shots) surrounding matched dialogs are used tofind appropriate examples. Note that the video clip, i.e. theregion of shots around the matching dialog are part of thesame story, however we select one good frame for presenta-tion purposes. Although this currently requires some man-ual intervention, generating pairs of images and rich de-scriptions is far easier than going through the entire book.Automating this task is an interesting area for future work.
7. ConclusionWith a rising interest in transforming novels to films or
TV series, we present for the first time an automatic wayto analyze novel-to-film adaptations. We create and anno-tate a data set involving one film and one season of a TVseries. Our goal is to associate video scenes with bookchapters and is achieved by modeling the task as a short-est path graph problem. Such a model allows for a non-sequential alignment and is able to handle scenes that do notcorrespond to parts of the book. We use information cuessuch as matching dialogs and character identities to bridgethe gap between the text and video domains. A compar-ison against state-of-the-art plot synopsis alignment algo-rithms shows that we perform much better specially in caseof complicated alignment structures. As an application ofthe alignment between book chapters and video scenes, wealso present qualitative examples of extracting rich scene,character, and object descriptions from the novel.
Acknowledgments This work was supported by theDeutsche Forschungsgemeinschaft (DFG - German Re-search Foundation) under contract no. STI-598/2-1. Theviews expressed herein are the authors’ responsibility anddo not necessarily reflect those of DFG.
References[1] Book2Movie data download containing novel-to-film
ground truth alignments, and face tracks + labels.http://cvhci.anthropomatik.kit.edu/projects/mma.
[2] Stanford CoreNLP. http://nlp.stanford.edu/software/. Re-trieved 2014-11-14.
[3] ‘Deathly Hallows’ film breathes life into Harry Potter booksales. http://www.nielsen.com, Nov. 2010. Retrieved 2014-11-14.
[4] Where do highest-grossing screenplays come from?http://stephenfollows.com, Jan. 2014. Retrieved 2014-11-14.
[5] P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, andJ. Sivic. Finding Actors and Actions in Movies. In ICCV,2013.
[6] T. Cour, B. Sapp, C. Jordan, and B. Taskar. Learning fromambiguously labeled images. In CVPR, 2009.
[7] P. Das, C. Xu, R. F. Doell, and J. J. Corso. A ThousandFrames in Just a Few Words: Lingual Description of Videosthrough Latent Topics and Sparse Object Stitching. In CVPR,2013.
[8] M. Everingham, J. Sivic, and A. Zisserman. “Hello! Myname is... Buffy” - Automatic Naming of Characters in TVVideo. In BMVC, 2006.
[9] R. Giddings, K. Selby, and C. Wensley. Screening the novel:The theory and practice of literary dramatization. Macmil-lan, 1990.
[10] A. Karpathy and L. Fei-Fei. Deep Visual-Semantic Align-ments for Generating Image Descriptions. In CVPR, 2015.
[11] R. Kiros, R. Salakhutdinov, and R. S. Zemel. UnifyingVisual-Semantic Embeddings with Multimodal Neural Lan-guage Models. Transactions on Association for Computa-tional Linguistics, 2015.
[12] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.Learning Realistic Human Actions from Movies. In CVPR,2008.
[13] C. Liang, C. Xu, J. Cheng, and H. Lu. TVParser : An Auto-matic TV Video Parsing Method. In CVPR, 2011.
[14] D. Lin, S. Fidler, C. Kong, and R. Urtasun. Visual SemanticSearch: Retrieving Videos via Complex Textual Queries. InCVPR, 2014.
[15] C. D. Manning, P. Raghavan, and H. Schutze. Scoring, termweighting, and the vector space model. In Introduction toInformation Retrieval. Cambridge University Press, 2008.
[16] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Ex-plain Images with Multimodal Recurrent Neural Networks.In NIPS Deep Learning workshop, 2014.
[17] B. McFarlane. Novel to Film: an Introduction to the Theoryof Adaptation. Clarendon press, Oxford, 1996.
[18] O. M. Parkhi, K. Simonyan, A. Vedaldi, and A. Zisserman.A Compact and Discriminative Face Track Descriptor. InCVPR, 2014.
[19] V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Link-ing people in videos with “their” names using coreferenceresolution. In ECCV, 2014.
[20] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. ADataset for Movie Description. In CVPR, 2015.
[21] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, andB. Schiele. Translating Video Content to Natural LanguageDescriptions. In ICCV, 2013.
[22] A. Roy, C. Guinaudeau, H. Bredin, and C. Barras. TVD: AReproducible and Multiply Aligned TV Series Dataset. InLanguage Resources and Evaluation Conference, 2014.
[23] P. Sankar, C. V. Jawahar, and A. Zisserman. Subtitle-freeMovie to Script Alignment. In BMVC, 2009.
[24] J. Sivic, M. Everingham, and A. Zisserman. “Who are you?”– Learning person specific classifiers from video. In CVPR,2009.
[25] M. Tapaswi, M. Bauml, and R. Stiefelhagen. Story-basedVideo Retrieval in TV series using Plot Synopses. In ACMIntl. Conf. on Multimedia Retrieval (ICMR), 2014.
[26] M. Tapaswi, M. Bauml, and R. Stiefelhagen. StoryGraphs:Visualizing Character Interactions as a Timeline. In CVPR,2014.
[27] M. Tapaswi, M. Bauml, and R. Stiefelhagen. Improved WeakLabels using Contextual Cues for Person Identification inVideos. In IEEE Intl. Conf. on Automatic Face and GestureRecognition (FG), 2015.
[28] T. Tsoneva, M. Barbieri, and H. Weda. Automated summa-rization of narrative video on a semantic level. In Intl. Conf.on Semantic Computing, 2007.
[29] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach,R. Mooney, and K. Saenko. Translating Videos to NaturalLanguage Using Deep Recurrent Neural Networks. In NorthAmerican Chapter of the Association for Computational Lin-guistics Human Language Technologies, 2015.
[30] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show andTell: A Neural Image Caption Generator. In CVPR, 2015.
[31] G. Wagner. The novel and the cinema. Fairleigh DickinsonUniversity Press, 1975.