copyright © 2004 informedia contexture: analyzing and synthesizing video and verbal context for...

Copyright © 2004

Informedia Contexture:Analyzing and Synthesizing Video and Verbal Context for Intelligence Analysis Dialogues

Alex Hauptmann, Howard WactlarCarnegie Mellon University

Pittsburgh, USA

October 2004

Copyright © 2004

Our Meaning of Contexture

Definition: The weaving or assembly of [multimedia] parts into a cohesive whole in order to provide a more complete picture or information structure [to both questions and answers]

• Interpreting and communicating an associated visual and verbal context to information

• May contain language, imagery and gestures• May illuminate the meaning or significance• May explain its circumstances• More like a collegial expert response than an encyclopedic source

• Accelerating discovery by both system and analyst• Understanding video perspectives

• Subtle opinions, attitudes, biases• Both visual and textual rhetoric

• Continuously updating video biographs and event timelines

Copyright © 2004

9/11/20013/18/1998 7/2/1999 9/19/2001 3/19/2003 9/17/2003

PeopleAssociation

Interviewee

Monologue

WordQuotation

VisualQuotation

A word quotation of Retired Wesley Clark describing for the president in the

Being a CNN Military Analyst on military deployment in preparation for the wa

Being an interviewee in a news program as a presidential candidate for

Wesley Clark (without being mentioned in the video clip) sitting next to Madeleine hearing on a resolution that would direct President Clinton to

A visual quotation of Wesley Clark (referred commander) describing President Milosevic’s intent in.

Being a CNN Military Analyst on action in the Iraq War.

Biograph data Visualizations

Synthesizedvideo clips

Contextual info

Conceptual Overview

-Semantic relations on entities- Harden with structured data- Perspective interpretation- Video ontology

Context Analysis0 01011100101010011010101001100010010011100

Extract Semantic Data

-Scene classification- Event detection- Title/topic labeling- Named entity extraction- Verify entities with structured data

Generate Information Contexture

- Understand questions - Provide context-rich answers - Produce video biographs - Enable context-based iterative

QA process

Analyst

Analyst’s Profile &QA History

Multiple Multimedia Information Sources

StructuredData

Domestic Sources. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

Foreign Sources. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

Copyright © 2004

Scope of Work

Extracting information from video sources for finding

answers Applying broadcast TVnews ontology(joint with USC)

Understanding multimediaquestions and their context

Understanding the biasof source, topical, or

rhetorical perspectives

Integration andevaluation

Incorporating video biographs and

perspectives intoanswer contextures

Contexturedialogue

Learning fromthe analyst

Copyright © 2004

Scope of Work

Applying broadcast TVnews ontology(joint with USC)


answers






Contexturedialogue



Copyright © 2004

Scope of Work









Contexturedialogue


Copyright © 2004

Understanding Multimedia Questions

1. Find shots of Pope John Paul II.

2. Find shots of a rocket or missile taking off

3. Find shots of the Tomb of the Unknown Soldier at Arlington National Cemetery.

4. Find shots of the front of the White House in the day-time with the fountain running.

Copyright © 2004

Automatic Video Retrieval System

Multi-modal Query

Pope John Paul II

Video Library

Weighted Fusion of Similarity Rankings

Final Ranked List of Video Shots

…

Multiple Modality Video Analysis Experts

SpeechTrans.

Video OCR

Color Feature

Semantic Class Filter

Audio Feature

Texture Feature

Copyright © 2004

Finding the combination weights

Offline

Online

Video Library

Learn Weights Learn

Weights

TrainingQueries

Training Data

Similarity rankings from multiple experts

Query

Copyright © 2004

Finding the combination weights

Offline

Online

Video Library

ClassifyQueriesClassifyQueries

Learn Weights Learn

Weights

TrainingQueries

Training Data

Similarity rankings from multiple experts

Query

Copyright © 2004

Query Types for Video Retrieval• Named person queries, possibly with constraints

“Find shots of Yasser Arafat“, “Find shots of Ronald Reagan speaking".

• Named object queries for an object with a unique name.

“Find shots of the Statue of Liberty“, “Find shots of the Mercedes logo".

• General object queries for a type of objects. They may be qualified.

“Find shots of snow-covered mountains“, “Find shots of one or more cats".

• Scene queries for multiple types of objects in certain relationships.

“Find shots of roads with lots of vehicles“, “Find shots of people spending leisure time on the beach".

Copyright © 2004

Finding the Combination Weightsfor Merging Search Results

• Uniform, fixed weights for all queries• Individual weightings for each query

• Not enough known about each query

• Weightings for each of 4 query types

Text search usually does better and is more consistent than any other single search modality

Copyright © 2004

Query Classification

Named-Entityextraction

POS tagging + NP chunking

Syntacticparsing

Person Q Specific Object Qpeople name

organization or location name

Scene Qmultiple NPs

single NP

no propername

Scene QGeneral Object Q

“Find shots of Bill Clinton”

“Find shots of Capitol Hill”

“Find shots with (multiple pedestrians) and (multiple vehicles in motion)”

“Find shots of (a person diving into the water)”

nested NPsno nested NP

“Find shots of (one or more cats)”

Query X

Copyright © 2004

Hierarchical Mixture of Experts

VideoShots

Query

TextRetrieval

RetrievalExpert 1

RetrievalExpert n

2

1

( | , )u uk k

k

P R q s

QueryType

n

k

lk

lk sqRPf

1

),|(

uu

ll

Copyright © 2004

Performance of different weighting schemes

0.1

0.15

0.2

0.25

0.3

0.35

MAP PREC@10 PREC@30

Text Only Q-Uniform Oracle Q-Type Single MoE Q-Type Hier MoE

Copyright © 2004

Performance of different weighting schemes

0.1

0.15

0.2

0.25

0.3

0.35

0.4

MAP PREC@10 PREC@30

Text Only Q-Uniform Oracle Q-Type Single MoE Q-Type Hier MoE Q-Specific Oracle

Copyright © 2004

Current Limitations

• Unable to assign multiple query types to one query• “Finding Bill Clinton

speaking in front of a US flag” (person, object)

• Unable to capture the query-specific aspects • “Finding day-time scenes

of the Federal Reserve Building, Washington DC”

Copyright © 2004

Scope of Work










Contexturedialogue

Copyright © 2004

Labeling Every Face with aNews Structure Model (NSM)

Sources of information:• Audio transcripts + Named Entity extraction• Overlaid text• Speaker audio characteristics• Temporal position of name w.r.t. video segment• Temporal structure of news (“Grammar”)

• Constraints based on image similarity• Constraints from speaker audio similarity

Copyright © 2004

Baseline Algorithm

shot s

Transcript clues exist for anchor OR s is first shot in story

Transcript clues exist for reporter

reporter name(s)by distance

news-subject name(s)by distance

anchorname

Y

Y N

N

Copyright © 2004

Overlaid Text with Video OCR

Overlaid text

Rep. NEWT GINGRICH

VOCR text

rgp nev~j ginuhicij

Edit distance to names:

Bill Clinton (0.67)

Newt Gingrich (0.46)

David Ensor (0.72)

Saddam Hussein (0.78)

Elizabeth Vargas (0.88)

Bill Richardson (0.80)

Copyright © 2004

news-subject

anchor

reporter

news-subject

news-subject

anchorreporter

reporter

anchor

Detection of Anchors, Reporters and News-Subjects

reporter

reporter

reporternews-subject

news-subject

news-subject

anchor

Copyright © 2004

Image and Audio Similarity Constraints

Speaker ID

Shot

1 65432

span = 1 span = 3span = 2I I I I II

7 1098 11

I I I I I

span = 2 span = 3span = 3

Constraints from image similarity

Constraints from speaker IDs

Copyright © 2004

Naming Accuracy of Different Approaches

MAP for

(count)

Anchor

(187)

Reporter

(64)

News-Subject

(125)

Overall

(376)

Top-1Baseline 0.834 0.359 0.256 0.561

NSM 0.957 0.734 0.512 0.771

Top-3Baseline 0.877 0.515 0.560 0.710

NSM 0.983 0.922 0.752 0.896

FOX and CNN news coverage of David Kay Report on the search for WMD in Iraq.

Perspective of broadcaster can be seen in text overlay

FOX and CNN news coverage of David Kay Report on the search for WMD in Iraq.

FOX uses faster cut rate and has more participation by the anchor

Copyright © 2004

Metrics-based Evaluations

NIST TRECVID 2004 Video Search Evaluation

• Submitted classification results for 10 different semantic features• Similar to a “Routing Task” for video clips

• Submitted Informedia system video search answers for

• Interactive runs comparing expert/novice users

• Interactive runs using either complete or only visual information

• Automatic/Manual runs contrasting components of the system

Results to be announced later in October ...

copyright © 2004 informedia contexture: analyzing and synthesizing video and verbal context for...

Documents

video sources

synthesizing video

information contexture

analyst slide

verbal context

analyst integration

information structure

contextrich answers