copyright © 2004 informedia contexture: analyzing and synthesizing video and verbal context for...
TRANSCRIPT
Copyright © 2004
Informedia Contexture:Analyzing and Synthesizing Video and Verbal Context for Intelligence Analysis Dialogues
Alex Hauptmann, Howard WactlarCarnegie Mellon University
Pittsburgh, USA
October 2004
Copyright © 2004
Our Meaning of Contexture
Definition: The weaving or assembly of [multimedia] parts into a cohesive whole in order to provide a more complete picture or information structure [to both questions and answers]
• Interpreting and communicating an associated visual and verbal context to information
• May contain language, imagery and gestures• May illuminate the meaning or significance• May explain its circumstances• More like a collegial expert response than an encyclopedic source
• Accelerating discovery by both system and analyst• Understanding video perspectives
• Subtle opinions, attitudes, biases• Both visual and textual rhetoric
• Continuously updating video biographs and event timelines
Copyright © 2004
9/11/20013/18/1998 7/2/1999 9/19/2001 3/19/2003 9/17/2003
PeopleAssociation
Interviewee
Monologue
WordQuotation
VisualQuotation
A word quotation of Retired Wesley Clark describing for the president in the
Being a CNN Military Analyst on military deployment in preparation for the wa
Being an interviewee in a news program as a presidential candidate for
Wesley Clark (without being mentioned in the video clip) sitting next to Madeleine hearing on a resolution that would direct President Clinton to
A visual quotation of Wesley Clark (referred commander) describing President Milosevic’s intent in.
Being a CNN Military Analyst on action in the Iraq War.
Biograph data Visualizations
Synthesizedvideo clips
Contextual info
Conceptual Overview
-Semantic relations on entities- Harden with structured data- Perspective interpretation- Video ontology
Context Analysis0 01011100101010011010101001100010010011100
Extract Semantic Data
-Scene classification- Event detection- Title/topic labeling- Named entity extraction- Verify entities with structured data
Generate Information Contexture
- Understand questions - Provide context-rich answers - Produce video biographs - Enable context-based iterative
QA process
Analyst
Analyst’s Profile &QA History
Multiple Multimedia Information Sources
StructuredData
Domestic Sources. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Foreign Sources. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Copyright © 2004
Scope of Work
Extracting information from video sources for finding
answers Applying broadcast TVnews ontology(joint with USC)
Understanding multimediaquestions and their context
Understanding the biasof source, topical, or
rhetorical perspectives
Integration andevaluation
Incorporating video biographs and
perspectives intoanswer contextures
Contexturedialogue
Learning fromthe analyst
Copyright © 2004
Scope of Work
Applying broadcast TVnews ontology(joint with USC)
Extracting information from video sources for finding
answers
Understanding multimediaquestions and their context
Understanding the biasof source, topical, or
rhetorical perspectives
Incorporating video biographs and
perspectives intoanswer contextures
Contexturedialogue
Learning fromthe analyst
Integration andevaluation
Copyright © 2004
Scope of Work
Extracting information from video sources for finding
answers Applying broadcast TVnews ontology(joint with USC)
Understanding multimediaquestions and their context
Understanding the biasof source, topical, or
rhetorical perspectives
Integration andevaluation
Incorporating video biographs and
perspectives intoanswer contextures
Contexturedialogue
Learning fromthe analyst
Copyright © 2004
Understanding Multimedia Questions
1. Find shots of Pope John Paul II.
2. Find shots of a rocket or missile taking off
3. Find shots of the Tomb of the Unknown Soldier at Arlington National Cemetery.
4. Find shots of the front of the White House in the day-time with the fountain running.
Copyright © 2004
Automatic Video Retrieval System
Multi-modal Query
Pope John Paul II
Video Library
Weighted Fusion of Similarity Rankings
Final Ranked List of Video Shots
…
Multiple Modality Video Analysis Experts
SpeechTrans.
Video OCR
Color Feature
Semantic Class Filter
Audio Feature
Texture Feature
Copyright © 2004
Finding the combination weights
Offline
Online
Video Library
Learn Weights Learn
Weights
TrainingQueries
Training Data
Similarity rankings from multiple experts
Query
Copyright © 2004
Finding the combination weights
Offline
Online
Video Library
ClassifyQueriesClassifyQueries
Learn Weights Learn
Weights
TrainingQueries
Training Data
Similarity rankings from multiple experts
Query
Copyright © 2004
Query Types for Video Retrieval• Named person queries, possibly with constraints
“Find shots of Yasser Arafat“, “Find shots of Ronald Reagan speaking".
• Named object queries for an object with a unique name.
“Find shots of the Statue of Liberty“, “Find shots of the Mercedes logo".
• General object queries for a type of objects. They may be qualified.
“Find shots of snow-covered mountains“, “Find shots of one or more cats".
• Scene queries for multiple types of objects in certain relationships.
“Find shots of roads with lots of vehicles“, “Find shots of people spending leisure time on the beach".
Copyright © 2004
Finding the Combination Weightsfor Merging Search Results
• Uniform, fixed weights for all queries• Individual weightings for each query
• Not enough known about each query
• Weightings for each of 4 query types
Text search usually does better and is more consistent than any other single search modality
Copyright © 2004
Query Classification
Named-Entityextraction
POS tagging + NP chunking
Syntacticparsing
Person Q Specific Object Qpeople name
organization or location name
Scene Qmultiple NPs
single NP
no propername
Scene QGeneral Object Q
“Find shots of Bill Clinton”
“Find shots of Capitol Hill”
“Find shots with (multiple pedestrians) and (multiple vehicles in motion)”
“Find shots of (a person diving into the water)”
nested NPsno nested NP
“Find shots of (one or more cats)”
Query X
Copyright © 2004
Hierarchical Mixture of Experts
VideoShots
Query
TextRetrieval
RetrievalExpert 1
RetrievalExpert n
2
1
( | , )u uk k
k
P R q s
QueryType
n
k
lk
lk sqRPf
1
),|(
uu
ll
Copyright © 2004
Performance of different weighting schemes
0.1
0.15
0.2
0.25
0.3
0.35
MAP PREC@10 PREC@30
Text Only Q-Uniform Oracle Q-Type Single MoE Q-Type Hier MoE
Copyright © 2004
Performance of different weighting schemes
0.1
0.15
0.2
0.25
0.3
0.35
0.4
MAP PREC@10 PREC@30
Text Only Q-Uniform Oracle Q-Type Single MoE Q-Type Hier MoE Q-Specific Oracle
Copyright © 2004
Current Limitations
• Unable to assign multiple query types to one query• “Finding Bill Clinton
speaking in front of a US flag” (person, object)
• Unable to capture the query-specific aspects • “Finding day-time scenes
of the Federal Reserve Building, Washington DC”
Copyright © 2004
Scope of Work
Extracting information from video sources for finding
answers Applying broadcast TVnews ontology(joint with USC)
Understanding multimediaquestions and their context
Understanding the biasof source, topical, or
rhetorical perspectives
Integration andevaluation
Incorporating video biographs and
perspectives intoanswer contextures
Learning fromthe analyst
Contexturedialogue
Copyright © 2004
Labeling Every Face with aNews Structure Model (NSM)
Sources of information:• Audio transcripts + Named Entity extraction• Overlaid text• Speaker audio characteristics• Temporal position of name w.r.t. video segment• Temporal structure of news (“Grammar”)
• Constraints based on image similarity• Constraints from speaker audio similarity
Copyright © 2004
Baseline Algorithm
shot s
Transcript clues exist for anchor OR s is first shot in story
Transcript clues exist for reporter
reporter name(s)by distance
news-subject name(s)by distance
anchorname
Y
Y N
N
Copyright © 2004
Overlaid Text with Video OCR
Overlaid text
Rep. NEWT GINGRICH
VOCR text
rgp nev~j ginuhicij
Edit distance to names:
Bill Clinton (0.67)
Newt Gingrich (0.46)
David Ensor (0.72)
Saddam Hussein (0.78)
Elizabeth Vargas (0.88)
Bill Richardson (0.80)
Copyright © 2004
news-subject
anchor
reporter
news-subject
news-subject
anchorreporter
reporter
anchor
Detection of Anchors, Reporters and News-Subjects
reporter
reporter
reporternews-subject
news-subject
news-subject
anchor
Copyright © 2004
Image and Audio Similarity Constraints
Speaker ID
Shot
1 65432
span = 1 span = 3span = 2I I I I II
7 1098 11
I I I I I
span = 2 span = 3span = 3
Constraints from image similarity
Constraints from speaker IDs
Copyright © 2004
Naming Accuracy of Different Approaches
MAP for
(count)
Anchor
(187)
Reporter
(64)
News-Subject
(125)
Overall
(376)
Top-1Baseline 0.834 0.359 0.256 0.561
NSM 0.957 0.734 0.512 0.771
Top-3Baseline 0.877 0.515 0.560 0.710
NSM 0.983 0.922 0.752 0.896
Copyright © 2004
Featureextraction
Visual Gender Classification
Face detection
Originalscale
Haar wavelets
Boostingclassifiers
Output
male
male
female
female
male
Correct Error
Copyright © 2004
Interface Showing the People Labeled with Names
Copyright © 2004
Scope of Work
Understanding the biasof source, topical, or
rhetorical perspectives
Contexturedialogue
Extracting information from video sources for finding
answers Applying broadcast TVnews ontology(joint with USC)
Understanding multimediaquestions and their context
Integration andevaluation
Incorporating video biographs and
perspectives intoanswer contextures
Learning fromthe analyst
Copyright © 2004
Finding Stories with Different Perspectives
Copyright © 2004
Show length and shot type
22 sec
2 min 24 sec
12 sec
FOX and CNN news coverage of David Kay Report on the search for WMD in Iraq.
Perspective of broadcaster can be seen in text overlay
FOX and CNN news coverage of David Kay Report on the search for WMD in Iraq.
FOX uses faster cut rate and has more participation by the anchor
Copyright © 2004
Scope of Work
Integration andevaluation
Contexturedialogue
Extracting information from video sources for finding
answers Applying broadcast TVnews ontology(joint with USC)
Understanding multimediaquestions and their context
Understanding the biasof source, topical, or
rhetorical perspectives
Incorporating video biographs and
perspectives intoanswer contextures
Learning fromthe analyst
Copyright © 2004
Metrics-based Evaluations
NIST TRECVID 2004 Video Search Evaluation
• Submitted classification results for 10 different semantic features• Similar to a “Routing Task” for video clips
• Submitted Informedia system video search answers for
• Interactive runs comparing expert/novice users
• Interactive runs using either complete or only visual information
• Automatic/Manual runs contrasting components of the system
Results to be announced later in October ...
Copyright © 2004
Thank you
Carnegie Mellon UniversityPittsburgh, PA
USA