understanding videos, constructing plots learning a ...jshi/papers/storyline.pdf · understanding...

Understanding Videos, Constructing PlotsLearning a Visually Grounded Storyline Model from Annotated Videos

Abhinav Gupta1, Praveen Srinivasan2, Jianbo Shi2 and Larry S. Davis11University of Maryland, College Park, MD

2University of Pennsylvania, Philadelphia, [email protected], [email protected], [email protected], [email protected]

AbstractAnalyzing videos of human activities involves not only

recognizing actions (typically based on their appearances),but also determining the story/plot of the video. The sto-ryline of a video describes causal relationships between ac-tions. Beyond recognition of individual actions, discoveringcausal relationships helps to better understand the semanticmeaning of the activities. We present an approach to learna visually grounded storyline model of videos directly fromweakly labeled data. The storyline model is represented asan AND-OR graph, a structure that can compactly encodestoryline variation across videos. The edges in the AND-ORgraph correspond to causal relationships which are repre-sented in terms of spatio-temporal constraints. We formu-late an Integer Programming framework for action recogni-tion and storyline extraction using the storyline model andvisual groundings learned from training data.

1. IntroductionHuman actions are (typically) defined by their appear-

ances/motion characteristics and the complex and struc-tured causal dependencies that relate them. These causaldependencies define the goals and intentions of the agents.The storylineof a video includes the actions that occur inthat video and causal relationships [21] between them. Amodel that represents the set of storylines that can occur ina video corpus and the general causal relationships amongstactions in the video corpus is referred to as a “storylinemodel”. Storyline models also indicate the agents likelyto perform various actions and the visual appearance of ac-tions. A storyline model can be regarded as a (stochastic)grammar, whose language (individual storylines) representspotential plausible “explanations” of new videos in a do-main. For example, in analysing a collection of surveillancevideos of a traffic intersection scene, a plausible (incom-plete) storyline-model is:When a traffic light turns greentraffic starts moving. If while traffic is moving, a pedes-trian walks into an intersection, then the traffic suddenlystops. Otherwise it stops when signal turns red.. Not onlyare the actions “turns green”, “moving” and “walks” ob-servable, there are causal relationships among the actions:traffic starts moving because a light turns green, but it stopsbecause a pedestrian entered an intersection or signal turned

red. Beyond recognition of individual actions, understand-ing the causal relationships among them provides informa-tion about the semantic meaning of the activity in video - theentire set of actions is greater than the sum of the individ-ual actions. The causal relationships are often representedin terms of spatio-temporal relationships between actions.These relationships provide semantic/spatio-temporal con-text useful for inference of the storyline and recognition ofindividual actions in subsequent, unannotated videos.

The representational mechanism of the storyline modelis very important; traditional action recognition has heav-ily utilized graphical models, most commonly DynamicBayesian networks (DBNs). However, the fixed structure ofsuch models (often encoded by a domain expert) severelylimits the storylines that can be represented by the model.At each time step, only a fixed set of actions and agents areavailable to model the video, which is not sufficient for sit-uations in which the numbers of agents and actions varies.For example, in sports, sequences of actions are governedby the rules of the game and the goals of players/teams.These rules and goals represent a structure that extends be-yond a simple fixed structure of recurring events. The set ofpossible or probable actions and/or agents at any given timemay vary substantially. An important contribution of thiswork is the introduction of AND-OR graphs [22, 26] as arepresentation mechanism for storyline models. In addition,unlike approaches where human experts design graphicalmodels, we learn the structure and parameters of the graphfrom weakly labeled videos using linguistic annotations andvisual data. Simultaneous learning of storyline models andappearance models of actions constrains the learning pro-cess and leads to improved visual appearance models. Fi-nally, we show that the storyline model can be used as acontextual model for inference of the storyline and recogni-tion of actions in new videos.

Our approach to modeling and learning storyline modelsof actions from weakly labeled data is summarized in Fig-ure 1. The storyline models are represented by AND-ORgraphs, where selections are made at OR-nodes to generatestoryline variations. For example, in the AND-OR graphshown in the figure, the ‘pitching’ OR-node has two chil-dren ‘hit’ and ‘miss’ which represent two possibilities inthe storyline, i.e after pitching either a ‘hit’ or a ‘miss’ can

1

Videos + Captions(long and short)

After the pitcher pitches the ball, batter hits it. Batter its the ball and runs towards the base. Meanwhile the fielder catches the ball and runs towards the base.

Picher pitches the ball and then the batter swings the bat but misses it.

Pitcher pitches the ball and then the batter hits it. After hitting the ball, batter runs to the base. Simultaneously, the fielder runs to ball, catchs it and then throws it to the fielder at the base.

Hit-Batter

Pitch-Pitcher

Run-Fielder

Run-Batter

Catch-Fielder

Run-Fielder

Throw-Fielder

Miss-Batter

Run-Fielder

Run-Batter

Catch-Fielder

Run-Fielder

Throw-Fielder

Hit-BatterMiss-Batter

Pitch-Pitcher

Run-Fielder

Run-Batter

Catch-Fielder

Run-Fielder

Throw-Fielder

Input:

Goal: Learn CompositionalStoryline Model

y t

x

OR


Pitch-Pitcher

Run-Fielder

Run-Batter

Catch-Fielder

Run-Fielder

Throw-Fielder

Criteria:

• G : Structure of Storyline Model

•Θ: Parameters of causal

relationships, appearance of actions

3. Provide good explanations for training videos and captions:

Bad

GoodBad

Pitch-Pitcher

Miss-Batter

Hit-Batter

1. Simple Structure:

Want model thathas…

2. Conditional distributions: avoid overfitting

Figure 1. Visually Grounded Storyline-Model: Given annotated videos, we learn the storyline model and the visual grounding of eachaction. The optimization function for searching the storyline model has three terms: (1) Simple structure. (2) Connections based on simpleconditional distributions. (3) Provides explanations for visual and text data in the training set. The figure also shows how our AND-ORgraph can encode the variations in storylines (three videos at the top with different storylines (bottom-right)), not possible with graphicalmodels like DBNs.occur. The edges in the AND-OR graph represent causal re-lationships and are defined in terms of spatio-temporal con-straints. For example, an edge from ‘catch’ to ‘throw’ indi-cates that ‘throw’ is causally dependent on ‘catch’(a ball canbe thrown only after it has been caught). This causal rela-tionship can be defined in terms of time astcatch < tthrow.The causal relationship has a spatial constraint also - some-one typically throws to another agent at a different location.

Our goal is to learn the storyline model and the visualgroundings of each action from the weakly labeled data -videos with captions. We exploit the fact that actions havetemporal orderings and spatial relationships, and that manyactions either “causally” influence or are “causally” depen-dent on other actions. Humans learn these “causal” relation-ships between different actions by utilizing sources of in-formation including language, vision and direct experience(interaction with the world). In our approach, we utilize hu-man generated linguistic annotations of videos to supportlearning of storyline models.

2. Related WorkExisting datasets for learning action appearance models

provide samples for a few classes and in controlled andsimplified settings. Such datasets fail to generalize to ac-tions with large intra-class variations and are unsuitablefor learning contextual models due to unnatural settings.There has been recent interest in utilizing large amounts ofweakly labeled datasets, such as movies/TV shows in con-junction with scripts/subtitles. Approaches such as [6, 15]provide assignment of frames/faces to actions/names. Such

approaches regard assignment and appearance learning asseparate process. Nitta et al. [20] present an approach to an-notate sports videos by associating text to images based onpreviously specified knowledge of the game. In contrast wesimultaneously learn a storyline model of the video corpusand match tracked humans in the videos to action verbs (i.e,solving the segmentation and correspondence problems).

Our approach is motivated by work in image annotationwhich typically model the joint distribution of images andkeywords to learn keyword appearance models [1]. Similarmodels have been applied to video retrieval, where annota-tion words are actions instead of object names [7]. Whilesuch models exploit the co-occurrence of image featuresand keywords, they fail to exploit the overall structure inthe video. Gupta et. al [12] presented an approach to si-multaneously learn models of both nouns and prepositionsfrom weakly labeled data. Visually grounded models ofprepositions are used to learn a contextual model for im-proving labeling performance. However, spatial reasoningis performed independently for each image. Some spatialreasoning annotations in the images are not incidental andcan be shared across most images in the dataset (For ex-ample, for all the images in the dataset sun is above wa-ter). In addition, the contextual model based on priors overpossible relationship words restricts the clique size of theBayesian network used for inference. Also, it requires afully connected network, which can lead to intractable in-ference. In contrast, our approach learns a computationallytractable storyline model based on causal relationships that

generally hold in the given video domain.There has been significant research in using contextual

models for action recognition [11, 23, 10, 2, 18]. Muchof this work has focused on the use of graphical modelssuch as Hidden Markov Models (HMMs) [24], DynamicBayesian Networks (DBNs) [10] to model contextual rela-tionships among actions. A drawback of these approachesis their fixed structure, defined by human experts. The set ofprobable actions and/or agents at any given time may varygreatly, so a fixed set of successor actions is insufficient.Our AND-OR graph storyline model can model both con-textual relationships (like graphical models) while simulta-neously modeling variation in structure (like grammars [2]).

In computer vision, AND-OR graphs have been used torepresent compositional patterns [26, 4]. Zhu and Mum-ford [26] used AND-OR graph to represent a stochasticgrammar of images. Zhu et. al [25] present an approachto learn AND-OR graphs for representing an object shapedirectly from weakly supervised data. Lin et. al [16] alsoused an AND-OR graph representation for modeling activ-ity in an outdoor surveillance setting. While their approachassumes hand-labeled annotations of spatio-temporal rela-tionships and AND-OR structure is provided, our approachlearns the AND-OR graph structure and its parameters us-ing text based captions. Furthermore, [16] assumes one-onecorrespondence between nodes and tracks as compared toone-many correspondence used in our approach.

Our work is similar in spirit to structure learning ofBayesian networks in [8], which proposed a structural-EMalgorithm for combining the standard EM-algorithm for op-timizing parameters with search over the Bayesian networkstructure. We also employ an iterative approach to searchfor parameters and structure. The structure search in [8]was over the space of possible edges given a fixed set ofnodes. However, in our case, both the nodes and edges areunknown. This is because a node can occur more than oncein a network, depending upon the context in which it occurs(See figure 2). Therefore, the search space is much largerthan the one considered in [8].

3. Storyline ModelWe model the storyline of a collection of videos as an

AND-OR graphG = (Vand, Vor, E). The graph has twotypes of nodes - OR-nodes,Vor and AND-nodesVand. EachOR-nodev ∈ Vor represents an action which is describedby its type and agent. Each action-type has a visual ap-pearance model which provides visual grounding for OR-nodes. Each OR-node is connected to other OR-nodes ei-ther directly or via an AND-node. For example, in Fig 1,middle, the OR-node “Pitch” has two OR-children whichrepresents two possibilities after a pitch (i.e either the bat-ter hits the ball (“Hit-Batter”) or misses it (“Miss-Batter”).A path from an OR-nodevi to an OR-nodevj (directly orvia an AND-node) represents the causal dependence of ac-tion vj upon actionvi. Here, AND-nodes are dummy nodesand only used when an activity can causally influence twoor more simultaneous activities. The causal relationshipsbetween two OR-nodes are defined by spatio-temporal con-straints. For example, the causal relationship that ‘hitting’

depends on ‘pitching’ the ball can be defined temporallyas tpitch < thit (hitting occurs after pitching) and spa-tially as the pitcher must be some distanced′ from the bat-ter d(pitch, hit) ≈ d′. Figure 1 shows several examples(top) of videos whose actions are represented by AND-ORgraphs. Note that the AND-OR graph can simultaneouslycapture both long and short duration storylines in a singlestructure (bottom-right).

4. Learning the Storyline Model

Our goal is to learn a visually grounded storyline modelfrom weakly labeled data. Video annotations include namesof actions in the videos and some subset of the temporaland spatial relationships between those actions. These rela-tionships are provided by both spatial and temporal prepo-sitions such as “before”, “left” and “above”. Each temporalpreposition is further modeled in terms of the relationshipsdescribed in Allen’s Interval Logic. As part of bottom-upprocessing, we assume that each video has a set of human-tracks, some of which correspond to actions of interest. Thefeature vector that describes each track is based on appear-ance histograms of Spatio-Temporal Interest Points (STIPs)[14, 19] extracted from the videos.

Establishing causal relationships between actions andlearning groundings of actions involves solving a match-ing problem. We need to know which human-tracks in thetraining videos match to different action-verbs of the sto-ryline to learn their appearance models and the storyline-model of videos. However, matching of tracks to action-verbs and storyline extraction of a particular video dependson the structure of the storyline-model, the appearances ofactions and causal relationships between them. This leadsto yet another chicken-and-egg problem, and we employa structural EM-like iterative approach to simultaneouslylearn the storyline-model and appearance models of actionsfrom collections of annotated videos. Formally, we want tolearn the structureG and parameters of the storyline modelΘ = (θ, A) (θ-Conditional Distributions,A-Appearancemodels), given the set of videos(V1..Vn) and their asso-ciated annotations(L1..Ln):(G, Θ) = arg max

G′,Θ′P (G′, Θ′|V1..Vn,L1..Ln)

∝ arg maxG′,Θ′

∏i

∑Mi,Si

P (Vi,Li|G′, Θ′, M i, Si)P (G′, Θ′)

• Si : Storyline for videoi.• M i : Matchings of tracks to actions for videoi.

We treat bothS andM as missing data and formulatean EM-approach. The prior,P (G, Θ), is based on simplestructure (R(G)) and simple conditional distributions terms(D(G, Θ)) and the likelihood terms are based on how wellthe storyline model generates storylines which can explainboth the videos and their linguistic annotations(C(G, Θ)).

Figure 2 summarizes our approach for learning visu-ally grounded storyline-models of training videos. Givenan AND-OR graph structure at the beginning of an itera-tion, we fix the structure and iterate between learning pa-rameters/visual grounding of the AND-OR graph and the

Videos + Captions Pitch-Pitcher Run- Batter

Run-Fielder Catch-Fielder Throw-Fielder

Hit-Batter

Initial StorylineModel Structure

Hit-Batter

Miss-Batter

Pitch-Pitcher

Run-Fielder

Run-Batter

Catch-Fielder

Throw-Fielder

Human Tracks

Visual Appearance

and Spatio-Temporal

Constraints

Iterate until model cannot be improved

Hit-Batter

Miss-Batter

Pitch-Pitcher

Run-Fielder

Run-Batter

Catch-Fielder

Run-Fielder

Throw-Fielder

Update Parameters(Θ)

•θ:Conditional distributions

•A: Visualappearance

HitMiss

Pitch

Run Run

Throw

HitMiss

Pitch

Run Run

Catch

Throw Run

Run

Hit

Pitch

Catch

Throw

Use new

model to

obtain

assignments

Select new

model structure

Hit-Batter

Miss-Batter

Pitch-Pitcher

Run-Fielder

Run-Batter

Catch-Fielder

Throw-Fielder

Input: videos/captions Initialization Our Iterative Approach

Update, for each video:•S: Storyline•M:action – track matchings

Hit-Batter

Miss-Batter

Pitch-Pitcher

Run-Fielder

Run-Batter

Catch-Fielder

Throw-Fielder

After the pitcher pitches the ball, batter hits it. Batter its the ball and runs towards the base. Meanwhile the fielder catches the ball and runs towards the base.

Picher pitches the ball and then the batter swings the bat but misses it.

Pitcher pitches the ball and then the batter hits it. After hitting the ball, batter runs to the base. Simultaneously, the fielder runs to ball, catchs it and then throws it to the fielder at the base.

Videos ( )

Captions ( )

Figure 2. An Overview of our approach; our storyline model(G, Θ) is initialized using videos and captions, and we propose an iterativeprocedure to improve its parametersΘ and the structureG.

matching of tracks to action nodes(Sec. 4.1). In the hard-E step, we estimate storyline and matchings for all train-ing videos using the currentG, Θ. In the M step, we up-dateΘ using the estimated storylines and matchings for allvideos. After convergence or a few iterations, we generatenew graph proposals by local modifications to the originalgraph (Sec. 4.2) and select the modification that best repre-sents the set of storylines for the videos(Sec. 4.3). This newstoryline model is then used for re-initializing the iterativeapproach, which iterates between appearances and match-ings. The new storyline model, which is a better model ofthe variations in storylines across the training videos, al-lows better interpretation. For example, in the figure, the“run-fielder” action after the “catch” was labeled as “throw”since the initial storyline-model did not allow “run” after“catch”. Subsequently, an updated storyline model allowsfor “run” after “catching”, and the assignments improve be-cause of the new expanded storyline.

4.1. Parsing VideosWe now describe how, an AND-OR storyline model is

used to analyze, or parse, videos and obtain their storylinesand matchings of human tracks to storyline actions. Weprovide a one-many matching formulation, where severalhuman tracks can be matched to a single action. Match-ing of tracks to actions also requires making a selection ateach OR-nodes to select one storyline out of the set of pos-sible storylines. While there have been several heuristic in-ference algorithms for AND-OR graphs, we formulate aninteger programming approach to obtain the storyline andmatchings, and solve a relaxed version of the problem inthe form of a linear program.

Given an AND-OR graphG, a valid instantiation,S(representing a storyline), of the AND-OR graph is a func-

tion S : i ∈ Vand ∪ Vor → {0, 1} that obeys the followingconstraints: (1) At each OR-nodevi there is an associatedvariableSi which represents whether the or-node has beenselected for a particular storyline of not. For example, infig 3, ‘hit’ is a part of the storyline, thereforeS3 = 1, andmiss is not a part of the storyline soS2 = 0. (2) Since OR-children represent alternate possible storyline extensions,exactly one child can be selected at each OR-node. (3) AnOR-node,i, can be instantiated (i.eSi = 1) only when allthe OR-nodes in the unique path from the root to nodei havebeen instantiated. For example, since the path from ‘pitch-ing’ to ‘catching’ includes ‘hitting’, ‘catching’ can be partof a storyline if and only if ‘hitting’ is part of the storyline.

GivenT human tracks in a video, a matching of tracksand nodes is a mappingM : i ∈ Vor, j ∈ {1, ..., T + 1} →{0, 1}. Mij = 1 indicates that the action at the OR-nodei is matched with trackj. Since some of the actions mightnot be visible due to occlusion and camera-view, we adda dummy track which can be associated with any actionwith some penalty. Depending on the constraints imposedon M , different matchings between actions and tracks canbe allowed: many-to-many, many-to-one, one-to-many, orone-to-one. We consider those mappings that associate oneaction to many tracks, which is represented by the constraint1TM = 1. Furthermore, no tracks should be matched to anOR node that is not instantiated:∀i ∈ Vor, Mij ≤ Si.

Finally, to incorporate pairwise constraints (such as tem-poral ordering and spatial relationships) between matchesof two nodesi andk, Mij andMkl, we introduce variablesX : xijkl ∈ {0, 1}; xijkl = 1 indicates that the actionat nodei and track j are matched, and the action at nodek and track l are matched.Instead of enforcing a compu-tationally difficult hard constraintxijkl = Mij ∗ Mkl, wemarginalize both sides overl and represent the constraint

Hit-Batter

Pitch-Pitcher

Run-BatterCatch-

Fielder

Miss-Batter

2 tracks 3 tracks

Good: correct matching of actions, tracks


Pitch-Pitcher

Run-Batter

Catch-Fielder

Spatio-TemporalConstraints:

Good; actions incorrect order, correct spatial relationships

Parsing cost:

XAppearanceMatching:

ExplanationReward:

Pitch-PitcherX ≠

Pitch-PitcherX <

Miss-Batter

Start Start

Figure 3. Given the upper left video, we show two possible parsesand costs for each parsing cost function component. The leftparse is worse since it explains fewer tracks (ExplanationE(M)),matches actions to tracks with different appearance (AppearanceA(S, M)), and violates spatial and temporal constraints (the pitch-pitcher action should be matched to a track which occurs beforethe miss-batter track, and the two tracks should appear in the usualpitcher-batter configuration) (Spatio -TemporalT (X)). The cor-rect parse on the right scores well according to all three cost func-tion components.

as:∀k,∑

l xijkl = Mij .In parsing, we search for a “best” valid instantiation

S (representing a storyline from the storyline model) anda matchingM of tracks and actions (representing visualgrounding of nodes). The optimization function for selec-tion is based on three terms: (1) Select a storyline consistingof nodes for which good matching tracks can be found. (2)Select a storyline which can explain as many human tracksas possible. (3) The matching of nodes to tracks should notviolate spatio-temporal constraints defined by the storylinemodel (See Figure 3). The three terms that form the basisof the objective to be minimized, subject to the above con-straints onS, M andX, are:

Appearance Matching: The cost of a matching is basedon the similarity of the appearances of instantiated nodesand corresponding tracks. This cost can be written as:

A(S, M) =∑

i

||(∑

j

Mij)tj − SiAi|| (1)

wheretj represents the appearance histogram of trackjandAi represents the appearance histogram model of theaction at nodei. In many to one matching, multiple trackscombine to match to a single action node. Therefore, thefirst term,(

∑j Mijtj), sums the appearance histograms of

human tracks that match to nodei. This is then compared tothe appearance model at nodei by measuring the L1-norm.Figure 3 shows an example of parsing with high(left parse)and low matching costs(right parse). The left parse has highmatching cost since the track of a batter running is assignedto the pitching node which are not similar in appearance.

Explanation Reward: Using only appearance matchingwould cause the optimization algorithm to prefer small sto-rylines, since they require less matching. To remove thisbias, we introduce a reward,E , for explaining as many ofthe STIPs as possible. We computeE as:

E(M) = −∑

j

min(∑

i

Mij , 1)||tj || (2)

This term computes the number of tracks/STIPs that havebeen assigned to a node in the AND-OR graph and thereforeexplained by the storyline model.

Spatio-Temporal Constraints: We also penalizematchings which violate spatio-temporal constraints im-posed by causal relationships. Ifpijkl encodes the violationcost of having an incompatible pair of matches (nodei totrackj and nodek to trackl), the term for spatio-temporalviolation cost is represented as:T (X) =

∑ijkl pijklxijkl.

This term prefers matchings that do not violate the spatio-temporal constraints imposed by the learned AND-ORgraph. For example, the left parse in Figure 3 matches the‘pitching’ and ‘miss’ actions to incorrect tracks, resultingin ‘pitching’ starting after ‘batting’ in the video, which isphysically impossible. The tracks are also not in the typicalpitcher-batter spatial configuration. Therefore, this match-ing has a high cost as compared to the matching shown inthe right parse.

The above objective and the constraints result in an Inte-ger Program which is a NP-Hard problem. We approximatethe solution by relaxing the variablesS, M andX to lie in[0, 1]. The result is a linear program, which can be solvedvery quickly. For the learning procedure, we have the anno-tated list of actions that occur in the video. We utilize theseannotations to obtain a valid instantiation/storylineS andthen optimize the function overM,X only. For inference,given a new video with no annotations, we simultaneouslyoptimize the objective overS, M, X.

4.2. Generating new Storyline Model ProposalsAfter every few inner iterations of the algorithm, we

search for a better storyline model to explain the matchingsand causal-relationships between actions. To do this, wegenerate new graph proposals based on local modificationsto the AND-OR graph structure from the previous iteration.

The local modifications are: (1) Deletion of an edge andadding a new edge (2) Adding a new edge (3) Adding a newnode. The selection of edges to delete and add is randomand based on the importance sampling procedure, wheredeletion of important edges are avoided and addition of animportant edge is preferred. The importance is defined onthe basis of the likelihood that the head and tail of the edgeare related by a causal relationship.

4.3. Selecting the New Storyline ModelEach iteration selects the AND-OR graph from the set

of modifications which best represents the storylines of thetraining videos. The criteria for selection is based on fourdifferent terms:

Track Matching Likelihood: The first criteria measureshow well a proposal explains the matchings obtained inthe previous parsing step. The matching of tracks to ac-tions from the previous step is used to obtain a likelihoodof an AND-OR graph generating such a matching. Thelikelihood of the pth graph proposal,Gp

r generating thepairwise matchingsXr−1 (at iterationr − 1) is given by

1Z exp(−TGp

r(Xr−1)). This likelihood is based on the third

term from the parsing cost, but here the penalty terms arecomputed with respect to the individual graph proposals.

Annotation Likelihood: The AND-OR graph repre-senting the storyline model should not only explain thematching of tracks to actions, but also the linguistic anno-tations associated with each video. The underlying idea isthat the same storyline model is used to generate the visualdata and linguistic annotations. The cost function measureshow likely an instantiation of the AND-OR graph storylinemodel accounts for the video’s actions annotations and howwell the constraints specified by linguistic prepositions inannotations are satisfied by the AND-OR graph constraints.For example, if the annotation for a training video includes‘pitching before hitting’, a good AND-OR graph would notonly generate a storyline including ‘pitching’ and ‘hitting’but also have the conditional distribution for the edge pitch-ing→ hitting, such thatP (thit − tpitch > 0|θ) is high.

Structure Complexity If we only consider likelihoodsbased on linguistic and visual data, more complex graphswhich represent large numbers of possibilities will alwaysbe preferred over simple graphs. Therefore, an importantcriteria for selection of an AND-OR graph is that it shouldbe simple. This provides a prior over the space of possiblestructures. We use a simplicity prior similar to [9], whichprefers linear chains over non-linear structures.

Distribution Complexity The complexity of an AND-OR graph depends not only on its graph structure, but alsothe conditional distributions of children actions given parentactions. For an actioni (OR-node) in an AND-OR graph,we form a distribution over all possible successors, or setsof actions that could appear immediately after actioni ina storyline. The individual spatio-temporal conditional dis-tributions betweeni and its successors are combined intoa single distribution over successors, and we compute theentropy of this combined distribution. The entropies ofthe successor distributions for all OR-nodes in the graphare averaged, providing a measure of the complexity of theconditional distributions contained in the AND-OR graph.Our cost prefers higher entropy distributions; empirically,we have found that this results in better ranking of struc-tures. We can also draw intuition from work on maximumentropy Markov models [17], where higher entropy distri-butions are preferred in learning conditional distributions toprevent overfitting.

4.4. Initializing the SearchFor initialization, we need some plausible AND-OR

causal graph to represent the storyline model and appear-ance models of actions. Establishing a causal sequenceof actions from passive visual data is a difficult problem.While one can establish a statistical association betweentwo variablesX andY , inferring causality - whetherX →Y or Y → X- is difficult. For initialization, we use the lin-guistic annotations of the videos. Based on psychologicalstudies of causal learning, we use ‘time’ as a cue to generatethe initial storyline model [13]. If an actionA immediatelyprecedes actionB, thenA is more likely to be the cause andB is more likely to be the effect.

We initialize the AND-OR graph with the minimumnumber of nodes required to represent all the actions in theannotations of the training videos. Some actions might havemore than one node due to their multiple occurrences in thesame video and due to different contexts under which theaction occur. For example, ‘catch-fielder’ can occur in avideo under two different contexts. The action ’catching’in the outfield and ’catching’ at a base are different and re-quire different nodes in the AND-OR graph. Using Allen’sinterval temporal logic, we obtain the weight of all possi-ble edges in the graph, which are then selected in a greedymanner such that there is no cycle in the graph. DummyAND-nodes are then inserted by predicting the likelihoodof two activities occurring simultaneously in a video.

For initialization of appearance models, we use the ap-proach proposed in [12]. Using the spatio-temporal rea-soning based on the prepositions and the co-occurrence ofvisual features, we obtain a one-one matching of tracks toactions which is used to learn the initial appearance models.

5. Experimental EvaluationFor our dataset, we manually chose video clips of a wide

variety of individual plays from a set of baseball DVDsfor the 2007 World Series and processed them as follows:We first detect humans using the human detector[5]. Ap-plied to each frame with a low detection threshold, the out-put of the detector is a set of detection windows whichpotentially contain humans. To create tracks, we performagglomerative clustering of these detection windows overtime, comparing windows in nearby frames according tothe distance between their centroids, and similarity of colorhistograms as measured by the Chi-square distance. The re-sulting tracks can be improved by extending each track for-wards and backwards in time using color histogram match-ing. STIPs that fall within the detection window of a trackin a frame contribute to the track’s appearance histogram.

Training: We trained the storyline model on 39 videos(individual baseball plays), consisting of approximately8000 frames. The training videos contained both veryshort and very long plays. We evaluate the performanceof our training algorithm in terms of number of actions cor-rectly matched to tracks. Figure 4 shows how this accuracychanged over the training process. The figure is dividedinto three colored blocks. Within each colored block, thestructure of the storyline model remains the same and theapproach iterates between parsing and parameter update.At the end of each colored block, we update our storylinemodel and select a new storyline model which is then usedto parse videos and estimate parameters. We can see thatthe accuracy rises significantly over the course of training,well above the initial baselines, validating our iterative ap-proach to training. The percentage improvement over Guptaet. al [12] is as much as 10%.

Figure 5 a) shows an example of how a parse for a videoimproves with iterations. Figure 5 b) shows an additionalexample of a video with its inferred storyline and matchingsof actions to tracks; we can see that all but the run-fielderaction are correctly matched.

Storyline Extraction for New Videos: Our test set in-

Pitching

No Swing Hit Swing - Miss

Run Run

Catch Catch

Throw Throw

Catch

Run

Pitching


Run Run

Catch Catch

Throw Throw

Catch

RunCatch

Pitching


Run Run

Catch Catch

Throw Throw

Catch

RunCatch

Run

75

70

65

33.6

Iteration 1 Iteration 2

IBM Model 1[3]

Gupta et. al [11]

Iteration 3

% C

orr

ect A

ssig

nm

en

ts

Figure 4. Quantitative evaluation of training performance and how the storyline model changes with iterations. Within each colored block,the storyline model remains fixed and the algorithm iterates between parsing and parameter estimation. At the end of each colored block,the structure of the AND-OR graph is modified and a new structure is learned. The structural changes for the three iterations are shown.

Run-BatterCatch-Fielder

PitchingPitching

Hit-Batter Hit-Batter

Catch-Fielder

Run

Pitching

Hit

Catch

Run

Pitching

Hit

Catch

Iteration 1

(a) Assignment

improvement

with iterations

due to better

“storyline”

Iteration 2

(b) Example of

Storyline

Extraction and

Assignments

during Training

Process

Run

Pitching

Hit

Catch

RunHit-Batter

PitchingThrow

Catch

ThrowCatch-Fielder

Run-Batter

Run-Fielder

Figure 5. (a) Improvement in assignments with iterations: In the first iteration, the true storyline for the video pitching→ hit → catching,is not a valid instantiation of the AND-OR graph. The closest plausible storyline involves activities like run which have to be hallucinatedin order to obtain assignments. However, as the storyline model improves in iteration 2, the true storyline now becomes a valid storylineand is used for obtaining the assignments. (b) Another example of the assignments obtained in training. The assignments are shown bycolor-coding, each track is associated to the node which has similar color in the instantiated AND-OR graph.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Re

ca

ll

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cis

ion

Co-occurrence

based model

Gupta et. al [11]

Our Approach

Figure 6. Quantitative Evaluation of Labeling Accuracy

cludes 42 videos from the same baseball corpus. Again,they ranged from very short and simple to longer and morecomplicated. We first evaluated the performance in termsof storyline extraction. Fig.7 shows some qualitative exam-ples of the storyline extraction in terms of the instantiationof AND-OR graphs, assignment of tracks to actions and thetext that is generated from the storyline model. We use re-call and precision values of action labeling to measure theperformance of storyline extraction. We compare the per-formance of our approach to the baseline methods of Guptaet.al [12] and IBM Model 1[3]. Figure 6 shows two barplots, one for recall (left) and the other for precision (right).For the baseline methods, we show the average precisionand recall values and compare against our method’s per-formance (block of blue, red and green bars). Our method

nearly doubles the precision of the baseline methods (.8 vs..4), and has a much higher recall (.85 vs. 0.5 for [12] and0.1 for [3]). It performs well over most of the actions, withthe exception of the action Swing-Miss (low recall). Wealso evaluated the number of correct matchings obtainedfor the actions in the predicted storylines. Quantitatively,we obtained70% correct assignments of tracks to actions.

We attribute the success of our approach to three reasons:(1) An important reason for improvement in training com-pared to Gupta et. al [12] is that they did not feedback thecontextual models learned at the end of their single itera-tive loop of training to relearning models of object appear-ances. (2) During inference, the coupling of actions via theAND-OR graph model provides a more structured modelthan simple context from co-occurrence statistics and binaryrelationship words can provide. (3) The one-many (actionto track matching) framework used here is more powerfulthan the one-one framework in [12] and handles the prob-lem of fragmented segmentation.

6. ConclusionWe proposed the use of storyline model, which repre-

sents the set of actions and causal relationships betweenthose actions. Our contributions include: (1) Representa-tion of storyline model as an AND-OR graph whose com-positional nature allows for compact encoding of substantialstoryline variation across training videos. (2) We also pre-

Pitching

Miss

Pitching

Hit

CatchThrow

Run Run

Run

Pitching

Hit

Catch

Run

Catch

Throw

Catch

Catch

Pitching

Hit

Pitching Hit Catch

Pitcher pitches the ball before Batter hits. Batter hits and then simultaneously Batter runs to base and Fielder runs towards the ball. Fielder runs towards the ball and then Fielder catches the ball. Fielder throws to the base after Fielder catches the ball. Fielder throws to the base and then Fielder at Base catches the ball at base .

Pitcher pitches the ball before Batter hits. Batter hits and then simultaneously Batter runs to base and Fielder runs towards the ball. Fielder catches the ball after Fielder runs towards the ball. Fielder catches the ball before Fielder throws to the base. Fielder throws to the base and then Fielder at Base catches the ball at base .

Pitcher pitches the ball and then Batter hits. Fielder catches the ball after Batter hits.

Run

Pitching

Hit

Catch

Run

Catch

Throw

Pitching

Hit

RunRun

Catch

Catch

Pitching

Miss

Pitcher pitches the ball and then Batter does not swing.

Figure 7. Storyline Extraction for New Videos: We show the instantiation of AND-OR graph obtained for each training video and the storygenerated in text by our algorithm. The assignments of tracks to action nodes are shown by color coding.

sented a method for learning storyline models from weaklyannotated videos. (3) We formulate the parsing problem asa linear integer program. Our formulation permits one-to-many matching of actions to video tracks, addressing theproblem of fragmented bottom-up segmentation. Experi-mental results show that harnessing the structure of videoshelps in better assignment of tracks to action during train-ing. Furthermore, coupling of actions into a structuredmodel provides a richer contextual model that significantlyoutperformed two baselines that utilize priors based on co-occurrence and relationships words.

Acknowledgement: This research was funded by USGovernments VACE program, NSF-IIS-0447953 and NSF-IIS-0803538. Praveen Srinivasan was partially supportedby an NSF Graduate Fellowship.

References

[1] K. Barnard, P. Duygulu, N. Freitas, D. Forsyth, D. Blei, andM. I. Jordan. Matching words and pictures.Journal of Ma-chine Learning Research, pages 1107–1135, 2003.

[2] A. Bobick and Y. Ivanov. Action recognition using proba-bilistic parsing.CVPR98.

[3] P. Brown, S.Pietra, V. Pietra, and R. Mercer. The mathemat-ics of statistical machine translation: parameter estimation.Comp. Linguistics, 1993.

[4] H. Chen, Z. Jian, Z. Liu, and S. Zhu. Composite templatesfor cloth modeling and sketching.CVPR, 2006.

[5] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection.CVPR, 2005.

[6] M. Everingham, J. Sivic, and A. Zisserman. Hello! myname is...buffy - automatic naming of characters in tv video.BMVC, 2006.

[7] M. Fleischman and D. Roy. Situated models of meaning forsports video retrieval.Human Language Tech., 2007.

[8] N. Friedman. The bayesian structural em algorithm.UAI,1998.

[9] N. Friedman and D. Koller. Being bayesian about networkstructure: A bayesian approach to structure discovery inbayesian networks.Machine Learning, 2003.

[10] S. Gong and T. Xiang. Recognition of group activities usingdynamic probabilistic networks.ICCV, 2003.

[11] A. Gupta and L. Davis. Objects in action: An approachfor combining action understanding and object perception.CVPR’07.

[12] A. Gupta and L. Davis. Beyond nouns: Exploiting prepo-sitions and comparative adjectives for learning visual classi-fiers. ECCV, 2008.

[13] D. Laganado and A. Solman. Time as a guide to cause.J. ofExper. Psychology: Learning Memory & Cognition, 2006.

[14] I. Laptev. On space-time interest points.IJCV, 2005.[15] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.

Learning realistic human actions from movies.CVPR’08.[16] L. Lin, H. Gong, L. Li, and L. Wang. Semantic event rep-

resentation and recognition using syntactic attribute graphgrammar.PRL, 2009.

[17] A. McCallum, D. Freitag, and F. Pereira. Maximum entropymarkov models for information extraction and segmentation.ICML, 2000.

[18] C. Needham, P. Santos, R. Magee, V. Devin, D. Hogg, andA. Cohn. Protocols from perceptual observations.AI’05.

[19] J. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learn-ing of human action categories using spatio-temporal words.BMVC, 2006.

[20] N. Nitta, N. Babaguchi, and T. Kitahashi. Extracting actors,actions an events from sports video - a fundamental approachto story tracking.ICPR, 2000.

[21] J. Pearl. Causality: Models, reasoning, and inference.Cam-bridge University Press, 2000.

[22] J. Pearl. Heuristics: Intelligent search strategies for computerproblem solving.Addison-Wesley, 1984.

[23] S. Tran and L. Davis. Visual event modeling and recognitionusing markov logic networks.ECCV’08.

[24] A. Wilson and A. Bobick. Parametric hidden markov modelsfor gesture recognition.PAMI, 1999.

[25] L. Zhu, C. Lin, H. Huang, Y. Chen, and A. Yuille. Unsuper-vised structure learning: Hierarchical recursive composition,suspicious coincidence, competitive exclusion.ECCV’08.

[26] S. Zhu and D. Mumford. A stochastic grammar of im-ages.Foundations and Trends in Comp. Graphics and Vision,2006.

understanding videos, constructing plots learning a ...jshi/papers/storyline.pdf · understanding...

Documents