generating natural-language video descriptions using text-mined knowledge

Intelligent Information Retrieval and Web Search

1Generating Natural-Language Video Descriptions Using Text-Mined KnowledgeRay MooneyDepartment of Computer ScienceUniversity of Texas at Austin

Joint work withNiveda Krishnamoorthy Girish MalkarmenkarTanvi Motwani. Kate Saenko Sergio Guadarrama..

1Integrating Language and VisionIntegrating natural language processing and computer vision is an important aspect of language grounding and has many applications.NIPS-2011 Workshop on Integrating Language and VisionNAACL-2013 Workshop on Vision and LanguageCVPR-2013 Workshop on Language for Vision2

Video Description Dataset(Chen & Dolan, ACL 2011)2,089 YouTube videos with 122K multi-lingual descriptions. Originally collected for paraphrase and machine translation examples.Available at: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/ Sample Video

Sample M-Turk Human Descriptions (average ~50 per video)A MAN PLAYING WITH TWO DOGSA man takes a walk in a field with his dogs.A man training the dogs in a big field.A person is walking his dogs.A woman is walking her dogs.A woman is walking with dogs in a field.A woman is walking with four dogs outside.A woman walks across a field with several dogs.All dogs are going along with the woman.dogs are playingDogs follow a man.Several dogs follow a person.some dog playing each otherSomeone walking in a field with dogs.very cute dogsA MAN IS GOING WITH A DOG.four dogs are walking with woman in fieldthe man and dogs walking the forestDogs are Walking with a Man.The woman is walking her dogs.A person is walking some dogs.A man walks with his dogs in the field.A man is walking dogs.a dogs are runningA guy is training his dogsA man is walking with dogs.a men and some dog are runningA men walking with dogs.A person is walking with dogs.A woman is walking her dogs.Somebody walking with his/her pets.the man is playing with the dogs.A guy training his dogs.A lady is roaming in the field with his dogs.A lady playing with her dogs.A man and 4 dogs are walking through a field.A man in a field playing with dogs.A man is playing with dogs.Our Video Description TaskGenerate a short, declarative sentence describing a video in this corpus.First generate a subject (S), verb (V), object (O) triplet for describing the video.

Next generate a grammatical sentence from this triplet.A cat is playing with a ball. 6A person is riding a motorbike.

SUBJECTVERBOBJECT

personridemotorbikeThis can be done by correctly identifying the subject, the verb and the object in the video.7OBJECT DETECTIONScow0.11person0.42table0.07aeroplane0.05dog0.15motorbike0.51train0.17car0.29

We start by running object detectors on each video frame. 8SORTED OBJECT DETECTIONSmotorbike0.51person0.42car0.29aeroplane0.05

These detections are then sorted by their confidence scores. 9VERB DETECTIONShold0.23drink0.11move0.34dance0.05slice0.13climb0.17shoot0.07ride0.19

Similarly, we run verb detectors trained on the spatio-temporal interest points in the video. These points, indicated by yellow circles, show us where interesting events occur.10SORTED VERB DETECTIONSmove0.34hold0.23ride0.19dance0.05

The verb detections are also sorted by their confidence scores.11SORTED VERB DETECTIONSmove0.34hold0.23ride0.19dance0.05motorbike0.51person0.42car0.29aeroplane0.05SORTED OBJECT DETECTIONS

Now that we have the sorted object and verb detections, 12

OBJECTSVERBSEXPAND VERBSmove 1.0walk 0.8pass 0.8ride 0.8

we expand each verb with its most similar verbs to generate a larger set of potential words for describing the action. 13

OBJECTSVERBSEXPAND VERBShold 1.0keep 1.0

14

OBJECTSVERBSEXPAND VERBSride 1.0go 0.8move 0.8walk 0.7

OBJECTSVERBSEXPAND VERBSdance 1.0turn 0.7jump 0.7hop 0.6

OBJECTSVERBSEXPANDED VERBSWeb-scale text corporaGigaWord, BNC, ukWaC, WaCkypedia, GoogleNgrams A man rides a horsedet(man-2, A-1)nsubj(rides-3, man-2)root(ROOT-0, rides-3)det(horse-5, a-4)dobj(rides-3, horse-5)

GET Dependency ParsesSubject-Verb-Object triplet

Since vision detections can be noisy especially while dealing challenging YouTube videos, we need to incorporate some real world knowledge that reflects the likelihood of various entities being the subject or object of a given activity. This can be learnt from web-scale text corpora by mining for subject, verb , object triplets via dependency parses. 17

OBJECTSVERBSEXPANDED VERBSWeb-scale text corporaGigaWord, BNC, ukWaC, WaCkypedia, GoogleNgrams

...

SVO Language Model

We then train a language model on these subject, verb, object triplets using Kneser-Ney smoothing. A language model assigns a probability to judge how likely a person would be to say such a sentence. This S-V-O language model is now capable of ranking triplets by their real world likelihood. 18

OBJECTSVERBSEXPANDED VERBSWeb-scale text corporaGigaWord, BNC, ukWaC, WaCkypedia, GoogleNgrams

...

SVO Language ModelRegularLanguageModel

We also train a regular language model on the Google n gram corpus. 19

OBJECTSVERBSEXPANDED VERBSWeb-scale text corporaGigaWord, BNC, ukWaC, WaCkypedia, GoogleNgrams SVO Language ModelREGULAR Language Model

CONTENT PLANNING: The best triplet can now be selected by combining the vision detection scores with the score from the language model.20

OBJECTSVERBSEXPANDED VERBSWeb-scale text corporaGigaWord, BNC, ukWaC, WaCkypedia, GoogleNgrams SVO Language ModelREGULAR Language ModelSURFACE REALIZATION: A person is riding a motorbike.

CONTENT PLANNING: In the final step we generate a set of candidate descriptions using a template based approach on the Subject, Verb, Object triplet. The regular language model picks the best description for the video from these candidate descriptions.21Object DetectionUsed Felzenszwalb et al.s (2008) pretrained deformable part models.Covers 20 PASCAL VOC object categoriesAeroplanesBicyclesBirdsBoatsBottlesBusesCarsCatsChairsCows

Dining tablesDogsHorsesMotorbikesPeoplePotted plantsSheepSofasTrainsTV/MonitorsActivity Detection ProcessParse video descriptions to find the majority verb stem for describing each training video.Automatically create activity classes from the video training corpus by clustering these verbs.Train a supervised activity classifier to recognize the discovered activity classes.23

.Video Clips

playdancecut. NL Descriptions. ~314Verb Labelschopslicejumpthrowhitthrow, hitHierarchicalClusteringplay # throw # hit # dance # jump # cut # chop # slice # ..A girl is dancing. A young woman is dancing ritualistically. Indian women are dancing in traditional costumes.Indian women dancing for a crowd.The ladies are dancing outside. A puppy is playing in a tub of water.A dog is playing with water in a small tub.A dog is sitting in a basin of water and playing with the water.A dog sits and plays in a tub of water.A man is cutting a piece of paper in half lengthwise using scissors.A man cuts a piece of paper.A man is cutting a piece of paper.A man is cutting a paper by scissor.A guy cuts paper.A person doing somethingA puppy is playing in a tub of water.A dog is playing with water in a small tub.A dog is sitting in a basin of water and playing with the water.A dog sits and plays in a tub of water.A girl is dancing. A young woman is dancing ritualistically. Indian women are dancing in traditional costumes.Indian women dancing for a crowd.The ladies are dancing outside. A man is cutting a piece of paper in half lengthwise using scissors.A man cuts a piece of paper.A man is cutting a piece of paper.A man is cutting a paper by scissor.A guy cuts paper.A person doing somethingcut, chop, slice dance, jumpAutomatically Discovering Activity Classes Automatically Discovering Activities and Producing Labeled Training Data Hierarchical Agglomerative Clustering Using res metric from WordNet::Similarity (Pedersen et al.), We cut the resulting hierarchy to obtain 58 activity clusters

Automatically Discovering Activity Classes

A woman is riding a horse on the beach.A woman is riding a horse.

A group of young girls are dancing on stage.A group of girls perform a dance onstage.

A woman is riding horse on a trail.A woman is riding on a horse.

A man is cutting a piece of paper in half lengthwise using scissors.A man cuts a piece of paper.

A girl is dancing. A young woman is dancing ritualistically.

A girl is dancing. A young woman is dancing ritualistically.

A man is cutting a piece of paper in half lengthwise using scissors.A man cuts a piece of paper.

A woman is riding horse on a trail.A woman is riding on a horse.climb, flyride, walk, run, move, racecut, chop, slicedance, jumpplaythrow, hit

A group of young girls are dancing on stage.A group of girls perform a dance onstage.

A woman is riding a horse on the beach.A woman is riding a horse.cut, chop, sliceride, walk, run, move, racedance, jumpCreating Labeled Activity Data Supervised Activity Recognition Extract video features for Spatio-Temporal Interest Points (STIPs) (Laptev et al., CVPR-2008)Histograms of Oriented Gradients (HoG)Histograms of Optical Flow (Hof)Use extracted features to train a Support Vector Machine (SVM) to classify videos.

27

STIP features

ride, walk, run, move, raceA woman is riding horse in a beach.A woman is riding on a horse. A woman is riding on a horse.Training VideoNL description Discovered Activity LabelSVM Trained on STIP features and activity cluster labels

Activity Recognizer using Video Features

Selecting SVO Just Using Vision (Baseline)Top object detection from vision = SubjectNext highest object detection = ObjectTop activity detection = Verb

Sample SVO SelectionTop object detections:person: 0.67motorbike: 0.56dog: 0.11Top activity detections:ride: 0.41keep_hold: 0.32lift: 0.23

Vision triplet: (person, ride, motorbike)Evaluating SVO TriplesA ground-truth SVO for a test video is determined by picking the most common S, V, and O used to describe this video (as determined by dependency parsing).Predicted S, V, and O are compared to ground-truth using two metrics:Binary: 1 or 0 for exact match or notWUP: Compare predicted word to ground truth using WUP semantic word similarity score from WordNet Similarity (0WUP1)31Experiment DesignSelected 235 potential test videos that contain VOC objects based on object names (or synonyms) appearing in their descriptions.Used remaining 1,735 videos to discover activity clusters, keeping clusters with at least 9 videos.Keep training and test videos whose verb is in the 58 discovered clusters.1,596 training videos185 test videos32Baseline SVO ResultsSubjectVerb Object AllVision baseline71.35%8.65%29.19%1.62%Subject Verb Object AllVisionbaseline87.76%40.20%61.18%63.05%Binary AccuracyWUP AccuracyVision Detections are Faulty!Top object detections:motorbike: 0.67person: 0.56dog: 0.11Top activity detections:go_run_bowl_move: 0.41ride: 0.32lift: 0.23

Vision triplet: (motorbike, go_run_bowl_move, person)Using Text-Mining to DetermineSVO PlausibilityBuild a probabilistic model to predict the real-world likelihood of a given SVO.P(person,ride,motorbike) > P(motorbike,run,person)Run the Stanford dependency parser on a large text corpus, and extract the S, V, and O for each sentence. Train a trigram language model on this SVO data, using Kneyser-Ney smoothing to back-off to SV and VO bigrams.35Text CorporaCorporaSize of text (# words)British National Corpus (BNC)100MGigaWord1BukWaC2BWaCkypedia_EN800MGoogleNgrams1012 Stanford dependency parses from first 4 corpora used to build SVO language model.

Full language model used for surface realization trained on GoogleNgrams using BerkeleyLM

person hit ball -1.17person ride motorcycle -1.3person walk dog -2.18person park bat -4.76car move bag -5.47car move motorcycle -5.52SVO Language ModelVerb ExpansionGiven the poor performance of activity recognition, it is helpful to expand the set of verbs considered beyond those actually in the predicted activity clusters.We also consider all verbs with a high WordNet WUP similarity (>0.5) to a word in the predicted clusters.38Sample Verb Expansion Using WUPgo 1.0walk 0.8pass 0.8follow 0.8fly 0.8fall 0.8come 0.8ride 0.8run 0.67chase 0.67approach 0.67

move Integrated Scoring of SVOsConsider the top n=5 detected objects and the top k=10 verb detections (plus their verb expansions) for a given test video.Construct all possible SVO triples from these nouns and verbs.Pick the best overall SVO using a metric that combines evidence from both vision and language.

40Linearly interpolate vision and language-model scores:

Compute SVO vision score assuming independence of components and taking into account similarity of expanded verbs.

Combining SVO Scores

Sample Reranked SVOsperson,ride,motorcycle -3.02person,follow,person -3.31person,push,person -3.35person,move,person -3.42person,run,person -3.50person,come,person -3.51person,fall,person -3.53person,walk,person -3.61motorcycle,come,person -3.63person,pull,person -3.65Baseline Vision triplet: motorbike, march, personperson,walk,dog -3.35person,follow,person -3.35dog,come,person -3.46person,move,person -3.46person,run,person -3.52person,come,person -3.55person,fall,person -3.57person,come,dog -3.62person,walk,person -3.65person,go,dog -3.70Baseline Vision triplet: person, move, dog Sample Reranked SVOsSVO Accuracy Results(w1 = 0)SubjectActivityObject AllVision baseline71.35%8.65%29.19%1.62%SVO LM(No Verb Expansion)85.95%16.22%24.32%11.35%SVO LM(Verb Expansion)85.95%36.76%33.51%23.78%Subject Activity Object AllVision baseline87.76%40.20%61.18%63.05%SVO LM(No Verb Expansion)94.90%63.54%69.39%75.94%SVO LM(Verb Expansion)94.90%66.36%72.74%78.00%Binary AccuracyWUP AccuracySurface Realization:Template + Language ModelInput:The best SVO triplet from the content planning stageBest fitting preposition connecting the verb & object (mined from text corpora)

Template:Determiner + Subject + (conjugated Verb) + Preposition(optional) + Determiner + Object

Generate all sentences fitting this template and rank them using a Language Model trained on Google NGrams

Automatic Evaluation of Sentence Quality

Evaluate generated sentences using standard Machine Translation (MT) metrics.Treat all human provided descriptions as reference translationsHuman Evaluation of DescriptionsAsked 9 unique MTurk workers to evaluate descriptions of each test video.Asked to choose between vision-baseline sentence, SVO-LM (VE) sentence, or neither.Gold-standard item included in each HIT to exclude unreliable workers.When preference expressed, 61.04% preferred SVO-LM (VE) sentence.For 84 videos where the majority of judges had a clear preference, 65.48% preferred the SVO-LM (VE) sentence.

Examples where we outperform the baseline

48Examples where we underperform the baseline

49Discussion PointsHuman judges seem to care more about correct objects than correct verbs, which helps explain why their preferences are not as pronounced as differences in SVO scores.Novelty of YouTube videos (e.g. someone dragging a cat on the floor), mutes impact of SVO model learned from ordinary text.50Future WorkLarger scale experiment using bigger sets of objects (ImageNet) and activities.Ability to generate more complex sentences with adjectives, adverbs, multiple objects, and scenes.Ability to generate multi-sentential descriptions of longer videos with multiple events.5152ConclusionsGrounding language in vision is a fundamental problem with many applications.We have developed a preliminary broad-scale video description system.Mining common-sense knowledge (e.g. an SVO model) from large-scale parsed text, improves performance across multiple evaluations.Many directions for improving the complexity and coverage of both language and vision components.52

Examples where we outperform the baselineExamples where we underperform the baseline

generating natural-language video descriptions using text-mined knowledge

Documents