video pattern mining - eie.polyu.edu.hkeie.polyu.edu.hk/~cmsp/isimp04/file/hk-isimp-2004.pdfs.-f....

S.-F. Chang, Columbia U. 1

1S.F. Chang

Prof. Shih-Fu Chang

Digital Video and Multimedia LabColumbia University

October, 2004http://www.ee.columbia.edu/dvmm

Video Pattern Mining

ISIMP 2004

2S.F. Chang

Joint workLexing Xie, Lyndon Kennedy, Winston Hsu, Dong-Qing Zhang (Columbia U.)Ajay Divakaren and Huifan Sun (MERL)John R. Smith and Ching-Yung Lin (IBM)

Supported in part by MERLARDA VACE phase IIIBM-Columbia Semantrix Project


3S.F. Chang

Temporal Patterns Everywhere …

[Iyengar99][www.music-scores.com][Purdue Stat490B]

Internet traffic patterns

music motifs

QESVADKMGMGQSGVGALFN

QTKTAKDLGVYQSAINKAIH

QTELATKAGVKQQSIQLIEA

QTRAALMMGINRGTLRKKLK

λ Cro

λ Rep

434 Cro

Fis Protein

protein motifs

There are interesting patterns representing useful information in different domains.

4S.F. Chang

Example Patterns in Video

time

financial news, CNN

98-06-02

98-06-07

98-05-20

anchor interview text/graphics footage …

soccer video

play start interception attemptspass attempt at the goal break

Play level break

View levelbaseball


5S.F. Chang

Patterns Across News Sources

Story/event often re-occur within or across channelsRelated stories share intrinsic audio-visual patterns -- forming news thread Tendency how a story is repeated at different times and by different channels – forming higher-level patterns

Source 1

Source 2

Source n

Event 1 Event 2 Event nReal world ... ...

...

...

...

:

Event n

Thread 1 Thread 2 Event nThreads ... ...Thread n

Example‘Marines Extend Duty’

Chinese News

CNN News

6S.F. Chang

Types of Temporal PatternsDense and stochastic

Sparse and deterministic

Temporal association of eventsIf A occurs in channel 1, then B occurs within time T in channel 2, e.g. recurrent news topics

Others …

soccer video

play start interception attemptspass attempt at the goal breakFocus of the talk


7S.F. Chang

(Grand) Challenges for Pattern MiningThere are many patterns in video at different levels.

How do we discover them (semi)-automatically?Do they correspond to any semantic meanings?What are the underlying features/structures of each pattern?Which patterns are more predictable and thus detectable?

Why do these matter?Explore structures and knowledge in a domain without a priori manual definition

discover interesting, meaningful concepts & eventsdefine normal states and outliers

Facilitate scalability to new content and domains

8S.F. Chang

Challenge: Unsupervised Pattern Discovery

Given a new domain/corpus, discover patterns automaticallyE.g., News, consumer, surveillance, and personal life log

Technical Issues:Find appropriate spatio-temporal statistical models to represent and explain patternsFind locations in the video stream that fit such models

… …… timeIssues

What’s the adequate class of models?How to learn suitable model structures?How to select good features?

vs.

vs.


9S.F. Chang

Finding the right class of models …Lessons from prior worksupervised + unsupervised

Many video temporal patterns are characterized byDynamics and Features

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

20

40

60

(sec)

dominant color ratio

Motion intensity

play

ColorMotionObjects Audio…

DistinctiveFeature

Distributions

… Play BreakRecurrent Soccer patterns

Close-up

Wide Zoom-inDistinctiveTransitionDynamics

Distinctive patterns are characterized by state-dependent transitions and features.Like speech recognition, HMM and variants may serve as promising candidates.


11S.F. Chang

A simple test: HMM Parsing of Soccer Video

LabeledSegments

(color, motion, audio, objects)

TrainHMM familiesFor soccer

plays/breaks

learn high-level transition

prob.

(training phase)‘play’

‘break’

Model library

Test Data Feature

Sequence

Likelihood Estimationfor each Model

High-Level Play-BreakOptimal sequence

(Dynamic Programming)

(testing phase)

… Play Break

12S.F. Chang

Supervised HMM models are effective for parsing structural patterns in sports video

81.7%89.6%89.6%79.9%Espana

89.6%85.3%85.3%79.9%KoreaB

79.8%84.3%84.3%78.1%KoreaA

80.6%82.5%82.5%87.2%Argentina

EspanaKoreaBKoreaAArgentina

Training SetTest set

Sports video from various countriesEvaluation by Cross Validation

• The good classification provides preliminary evidence supporting HMM model and the A-V features Demo• Avg. Play-Break Classification Rate: 83.5% vs. 60% of blind guessing• Boundary timing accuracy: 62% within 3 seconds

(Xie et al ‘02)


13S.F. Chang

Temporal Pattern Mining with a Single Model: Hierarchical HMM

Intuitive Representation for Video PatternsPatterns occur at different levels following different transition models States in each level may correspond to different semantic concepts

time… ……

top-level states

running pitching

break

bottom-level states

bench close up

batteraudiencefield bird view

pitcher1st base

BaseballExample

time… ……

top-level states

Topic/Genre 1

Topic/Genre 2

bottom-level states

Interview,Financialdata

Reporteranchor Sports footage

NewsReporting

(Xie et al ‘02)

14S.F. Chang

[Clarkson et al ’99]Long personal videos captured by wearable

deviceColor histogram and MFCC features at 10Hzdiscovered pattern sequence has high

correlation with ground truth events

Applying HMM to unsupervised mining

A simplified structure without higher level state transition

[Naphade et al ’02]Films and talk shows Color and edge histogram, MFCC and energy features at 30Hzdiscover recurrent patterns of explosion and applause


15S.F. Chang

Hierarchical HMM

Dynamic Bayesian Network representation

Tree-Structured representation

Ft+1 bottom-level hidden states

top-level hidden states

observations

Gt

Ht

Yt

Ft

Gt+1

Ht+1

Yt+1 level-exiting

states

g1 g3

g2

h11

h12

h21 h22

h32

h31

[Fine, Singer, Tishby ‘98][K. Murphy, ’01] [Xie et al ’02]

Flexible control structure (bottom-up control with exit state)Extensible to multiple levels and distributionsEfficient inference technique available

Complexity O(D·T·QαD), α=1.5 to 2Application in unsupervised discovery has not been explored

Questions: how to find right model structures and feature sets?

16S.F. Chang

The Need for Model Selection

Different domains have different descriptive complexities.

talk show

news

soccer

?

?

?


17S.F. Chang

Model Selection with RJ-MCMCDefault HHMM

EM

Possible Model Operations

Split

MergeSwap

(move, state)=(split, 2-2)

new modelAccept proposal?

Prob. Thresh. for accepting change= (eBIC ratio)x(proposal ratio)xJ

next iteration

stop

[Green95][Andrieu99]

[Xie ICME03]

Optimum points:balance (data

fitness) + (model complexity)

18S.F. Chang

Which Features Shall We Use?

color histogram

edge histogram

MFCCzero-crossing

rate delta energy

nasdaqjenningstonight lawyerclinton

juriaccus

tf-idf

Spectral rolloff

pitch

keywords

Gaborwavelet

descriptors

zernikemoments

outdoors?

people?

LPC coeff.

vehicle?

motion estimates

… ……time

?

logtf-entropyface?

text


19S.F. Chang

Issues of Feature Selection

X1

X2

X3

X4

[Zhu et.al.’97][Koller,Sahami’96][Ellis,Bilmes’00]…

Criteria: (1) find the feature-feature or feature-concept relevance(2) eliminate any redundancy

Target concept Y1

32

4

Temporal samples not independent Temporal sequence

No target concept defined a priori

But new problems for temporal sequence mining:

Unsupervised

Standard process for Supervised learning:Identify a good subset of observations to improve model generalization performance and reduce computation.

[Xing, Jordan’01]

20S.F. Chang

A Divide and Conquer Approach:Feature Selection for Temporal Pattern Mining

Feature pool

[Koller’96] [Xing’01][Xie et al. ICIP’03]

Multiple consistentfeature sets

wrapper

Mutual information

1 2

3

Ranked feature sets with redundancy eliminated

filter

Feature sequences

State sequences

q1=“abaaabbb”q2=“BABBBAAA”I(q1,q2)=1

Markov Blanket

{X?, Xb} ⇒ q1=“abaaabbb”{Xb} ⇒ q1’=“abaaabbb”= q1

Eliminate X?


21S.F. Chang

Results: on Sports Videos

baseball

videos featuresHHMM

audio: energies, zero-crossing rate, spectral rolloff

visual: dominant color ratio, camera translation motion

soccer

patterns

HHMM top-level label sequence

vs. play/break?

22S.F. Chang

A Simple Test: Mining Baseball Videos

videosbaseball

HHMM + feature selection

dominant color ratiohorizontal motion (vertical motion)

(1)

(2)

(3) …

audio volume low-band energy

…

Ranking based on BIC

score

Semantics of patterns?

Correspond to play/break with 82.3% accuracy

(demo)

?


23S.F. Chang

Unsupervised Mining not less Effective than Supervised Learning

Fixed features {DCR, MI}, MPEG-7 Korean Soccer video

* DCR=‘dominant-color-ratio’, MI=‘motion-intensity’, Mx=‘horizontal-camera-pan’

Automatic selection of both model and features

75.2%

74.8%

82.3%

Correspondence w. Play/Break

75.2±1.3%

75.0±1.2%

73.1±1.1%

64.0±10.%

75.5±1.8%

Correspondence w. Play/Break

Other examples of video pattern discovery (e.g., Jojic and Frey)

Graphical Model

Nebojsa Jojic, Brendan Frey, ‘A Generative Model for 2.5D Vision: Estimating appearance, transformation, illumination, transparency and occlusion’, IJCV 2002

Hidden

Observed

Similar principle in using video generative modelsAssume latent spatial-temporal models and transformationsAV features are observations of the generation processUnsupervised inferencing


25S.F. Chang

Video mining methods seem promising in finding hidden patterns in video.But how to find the meanings of the patterns? Approach

fuse the metadata streams when available.Multi-modal fusion

26S.F. Chang

OutlineThe problemUnsupervised pattern discovery with HHMMFinding meaningful patterns

With text associationBy multi-modal fusion

Summary

politics goal!snow… …


27S.F. Chang

Towards Meaningful PatternsWhat are the meanings of the hidden patterns?Manual association feasible only if meanings are fewand known.Metadata come to the rescue.

time… ……

? ??

???

food

ASR/VOCR

iraq nasdaqclintonpeter jennings snowtonight juri comput damagaccuslawyerwin

28S.F. Chang

Associating Patterns with TextHHMMvideos AV features

and conceptsPatterns/tokensnewslabel-word associationwords (ASR/VOCR)


Co-Occurrence of HHMM Labels & Words

Likelihood ratio:independent

strong correlation

strong exclusion 0

1

1+

…clinton

damage

foodgrand

jury……nasdaq

reportsnow

tonightw

in…wordlabel

…

damagenasdaqdow

clintonfoodjennings compute

temparaturewin

... ...... Words

from English

HHMM labels from Language “X”

“correlation” between HHMM labels and wordsco-occurrence counts.

)(.,/),()|(,.)(/),()|(wCwqCqwCqCwqCwqC

==

Conditional prob.

Not very informative:no distinctive mapping!

Problems with the Co-occurrence StatisticsStory#News VideoHHMM labelASR token

q1 q2 q2w2 w2w1

1 2 3“true mapping

prob.”“observed

co-occurrence”

Applied to Image Annotation[Dyugulu et. al. 2002]

image ~ blobs {b1, …, bn}~ words {w1, …, wn}

Machine Translation[Brown’93]

Her dog is typing on my computer.

Son chien dactylographie sur mon ordinateur.

True mapping

dactylograph*

e=‘computer’

chien

ordinateur

sur

mon/ma

son/sa

f

Co-occurrence

q1w1


31S.F. Chang

Inferring the translation probability between visual tokens and words

Prob. of the hidden alignmentsAuxiliary Function for E-M

( ) ( )

1 1 1( ; ) ( | , , ) log[ ( ) ( | )]

n nM LNold old

nj nj ni nj nj nin j i

Q p a i w b p a i t w w b bθ θ θ= = =

= = = = =∑∑∑Sum over the true joint prob.

*find optimal argmax log ( | , )p w bθ

θ θ=

1 1 1log ( | , ) log( ( ) ( | ))

n nM LN

nj nj nin j i

p w b p a i t w w b bθ= = =

= = = =∑∑ ∑

32S.F. Chang

The problem: Co-occurrence “un-smoothing”.

Translation between AV Tokens & Words

Solve with EM [Brown’93]

independent

strong correlation

strong exclusion 0

1

1+

L=

(assume q’s are ind.)

(assume w’s are ind.)

Mappings become much more selective


33S.F. Chang

ExperimentsTRECVID2003 news

44 half-hour videos, ABC/CNNRaw audio-visual features: color, motion, speech12 semantic concepts for each shot [IBM-TREC’03](weather, people, sports, non-studio, nature-vegetation, outdoors, news-subject-face, female speech, airplane, vehicle, building, road)ASR transcript

HHMM on concept confidence scoresConstruct 10 HHMM models with automatically determined size and featuresMachine translation methods to find association between HHMM patterns and words

34S.F. Chang

Example Correspondences

indoor, news-subject-face,

building

people, non-studio-

setting

Visual Concept

Clinton-Jones (Recall 45%,

Precision 15%) Iraqi-weapon (Recall 25%,

Precision 15%)

murder, lewinski, congress, allege, jury, judge, clinton, preside, politics, saddam, lawyer, accuse, independent, monica, charge, …

(9,1)

El-nino Storm ’98

(recall 80%)

storm, rain, forecast, flood, coast, el, nino, administer, water, cost, weather, protect, starr, north, plane, …

(6,3)

ManualInspection

Translated WordsHHMM label

[Xie et al. ICIP’04]

Obtained with SVM classifiers

[IBM’03]

(m, q): model # m state # q

Lexicon obtained by shallow parsing of keywords from speech recognition output.

Automatic ManualInspection


35S.F. Chang

Can’t we find such patterns using conventional clustering?(HHMM vs. K-means comparison)

independent

strong correlation

strong exclusion 0

1

1+

L=

HHMM provides unambiguous association between discovered patterns and words

36S.F. Chang

OutlineThe problemUnsupervised pattern discovery with HHMMFinding meaningful patterns

With text associationBy multi-modal fusion

Summary

politics goal!snow… …

- Versatile- Multi-modal- Meaningful- Knowledge-adaptive


37S.F. Chang

Is MT the right paradigm?

Potential Problems:Words topical meanings

Temporal correspondence between words and pattern tokens may not exist.Audio-visual features and Text are processed separately and then fused at a late stage.

HHMMvideos features patterns

news

audio-visual label-word association

words from ASR

≠

38S.F. Chang

Change to Joint AVT Pattern Mining

Instead, both the AV pattern tokens and words should be used to jointly define hidden semantic states.A multi-modal data fusion problem [AVSR, multi-modal interaction]Aiming at recovering the semantics [TDT 1998-2004]

videos features multi-modal pattern

sequencenews

audio visualtext nasdaq

tonightclinton

juri

An early fusion engine


39S.F. Chang

Layer Dynamic Mixture Model for Produced Videos

time……

nasdaqjenningstonight clinton

juri

text

audio

visual

observationstokenization

PLSA

HMM

HHMM

meaningful patterns?

Extracting High-level Concepts from Text – pLSA

Use data-driven analysis to find concept association and latent semantic topicsUse the syntactic story structures of video to define co-occurrenceUse graphics model statistical inferencing to discover latent semantics

Not just counting occurrences

nasdaqjennings

tonight lawyerclintonjuri

accus

nasdaqjennings

tonight lawyerclintonjuri

accus

a

newshowever

t……storiesshots …

?latent

semantics y

news stories d

words x


Some semantic clusters from text pLSA

saddamiraqbaghdadweaponhusseinstrikesecure … …

goldolympics… …

jury lewinskistarrgrand accusation sexual independent water monicainvestigationpresident… …

temperaturerain coast snow el heavy northern stormforecasttornadopressureeastfloridaninogulfweather… …

downasdaqindustrialaveragewalljonesgaintrade… …

cancer increase secure temperaturetexasaccusation chance nasdaqpressure center … …

cancer africatemperaturemovie coast center heavy research rain strike … …

“financial”

“investigation”

“weather”“iraq”

“olympics”

“bad” clusters

42S.F. Chang

Layered Dynamic Mixture Model (LDMM)[Xie’TR04]

[Oliver’02][Stork’96][Nefian’02]

text

audio

visual

observations X

mid-level labels Y

time

Inference: EM

high-level clusters Z

What do the highest-level patterns represent?Conjecture: News Topics with salient multimedia and temporal cues!


43S.F. Chang

Linking Multi-Source Videos, Web Pages

Source # 1

ThreadingNews Stories

News Video Sources

Source # 2

Source # 3

Question: does the discovered thread correspond to distinct semantics?• no ground truth• tentatively use NIST TDT topics

44S.F. Chang

Fusion Results Evaluated by Text TDT Topics & Metrics

http://www.nist.gov/speech/tests/tdt/

1. Story Segmentation - Detect changes between topically cohesive sections 2. Topic Tracking - Keep track of stories similar to a set of example stories 3. Topic Detection - Build clusters of stories that discuss the same topic 4. First Story Detection - Detect if a story is the first story of a new, unknown topic 5. Link Detection - Detect whether or not two stories are topically linked

•“NIST TDT research develops algorithms for discovering and threading together topically related material in streams of datasuch as newswire and broadcast news in both English and Mandarin Chinese. “

• Current TDT use text or ASR only.

TDT Tasks


45S.F. Chang

Topic Detection and Trackinghttp://www.nist.gov/speech/tests/tdt/

Topic linking

?

The tracking of known topics -- classification;The detection of unknown topics – clustering;The detection of pairs of stories on the same topic (links).

[TDT04 evaluation plan v1.1]

46S.F. Chang

3 7 2931 37 48 56 62 66 72 76 860

1000

2000

3000

115 207

1014

0621 13

243

971 56 37 10

531 72 12 8 9 28 18

10 0 1 67 3 12

19 1 4 9 2 74 41 8 12

649 0 7 69 8

2354

179

7 1978

631

070

709

16 135

49 1 37 125

11 63 9 338

729 1 14

513 9 0 7 10 1

986

150

30 6 19 105 23

03 2 13

133 47 14 2 28 6 18 3 16 2 12

48 3 1 65 8 8 28 1 1 21 18 5 4

topics

Elections

Scandals Hearings

Legal Criminal Cases

Natural Disasters

Accidents

Acts of Violence or War

Science and Discovery

Financial

New Laws

SportsPolitical and Diplomatic Meetings

Celebrity Human Interest

Miscellaneous

0

20

40

60

80 ABCCNN

# of documents in the TDT text corpus over different topics

# of stories in trecvid2003 (151 videos)

Videos have different topical distribution from text documents


523 video stories in 30 frequent topics

1 30 Asian Economic Crisis 2 87 Monica Lewinsky Case 8 4 Casey Martin Sues PGA 12 4 Pope visits Cuba 13 16 1998 Winter Olympics 15 67 Current Conflict with Iraq 18 12 Bombing AL Clinic 19 9 Cable Car Crash 20 4 China Airlines Crash 21 9 Tornado in Florida 31 5 John Glenn 32 13 Sgt. Gene McKinney 41 7 Grossberg baby murder 43 6 Dr. Spock Dies 44 38 National Tobacco Settlement 47 28 Viagra Approval 48 20 Jonesboro shooting 56 7 James Earl Ray's Retrial? 65 15 Rats in Space! 70 39 "India, A Nuclear Power?" 71 24 Israeli-Palestinian Talks (London) 76 20 Anti-Suharto Violence 77 7 Unabomber 83 4 World AIDS Conference 84 6 Job incentives 86 24 GM Strike 87 18 NBA finals 89 4 Afghan Earthquake 91 9 German Train derails 96 12 Clinton-Jiang Debate

Topic # #stories topic title

Out of 151 videos(V, F1, F2, Test),4241 segments,3066 stories.

48S.F. Chang

ExperimentsNews videos [TRECVID2003]

ABC, CNN : 30 min x 151 video programsTraining/testing on same channels

PLSAevery storyASR word-stems tf-idftext

every shot22 conceptsvisual

1 seccamera translation (2-d)motion

1 sechistogram (15-d)colorHHMM

.5 secpitch, pause, audio-classaudio

bottom-level model

granularityfeature elementsmodality


Evaluate the MM patterns against TDT topics

Auto-story boundary, 32 clustersMinimum Cdet among the CNN videos for each topic

Topic (#stories)

0

0.2

0.4

0.6

0.8

1

1.2

AsianE(30)Lew

insky(87)suePGA(4)PopeCuba(4)W

O'98(16)

Iraq(67)Bom

bAC(12)

CCCrash(9)CACrash(4)TornadoFL(9)

JGlenn(5)GM

cKin(13)GBM

urder(7)dSdie(6)Tobacco(38)

Viagra(28)JShoot(20)RayTrial(7)RatSpace(15)InNuclear(39)IsPalTalk(24)ASViolence(20)

Unabomber(7)

AIDConf(4)

JobIncen(6)GM

Strike(24)N

BAFianl(18)

AfEaQuake(4)

GTraindR(9)CJDebate(12)

MM fusionText onlyvconcept1vconcept2vconcept3motioncoloraudio

better

multi-modal fusion able to detect certain topics more accurately than text aloneSuch topics tend to be rich in non-textual cues

Topic-ClusterMisalignment

Example: Cluster #28

Unique audio-visual + text terms + temporal transition• AV: graphics, music, anchor speech, subject etc• textual terms: most related to financial• temporal structure: statistically predictable transition patterns

Correspond to “Dollars & Sense, CNN” (demo)


51S.F. Chang

Example: Sports (demo)

Consistent audio-visual (color, motion, graphics), words, and temporal transitions.

52S.F. Chang

Example: Advertisements

• Audio-visual dominate in this thread. • Text not useful

(demo)


53S.F. Chang

Application in Multi-Source News Threading

Joint Work with IBM T.J. Watson ResearchSupported by ARDA VACE II program

Reconstructing Semantic Threads Across Multiple Video Broadcast News Sources Using Multi-Level Concept Modeling

ObjectivesExtraction of semantic threads through multi-level video content analysis (core- and semantic-level features, scene markers & similarities, commonality of content and structure)Establishment of viability of automatic concept detection as basis for video semantics understanding and higher-level analysis functions and user interactionsDevelopment of framework and technologies supporting extraction of thousands of concepts from news video

Novel ApproachHierarchical segmentation and threading at multiple granularities (key-frames, shots, scenes, episodes and stories) Dynamic probabilistic graphical modeling for detection of concepts to form “semantic basis” of objects, sites, actionsStatistical modeling & mining of spatio-temporal & semantic structures across multiple broadcast video news sources

PIs: IBM (John R. Smith); Columbia Univ. (Prof. Shih-Fu Chang)

Event 1 Event 2 Event n…

News Source 1

News Source 2

News Source m

News Source 3…

Thread 1 Thread nThread 2

Video Broadcast News – Content Acquisition & Production

Multi-level Video Analysis and Thread Extraction

Core-level

Fusion-level

Event-level

Thread-level

Visual Analysis

Video News Analysis Tools and System

Multi-level searching, filtering & interactionSemantics-based browsing Concept-based classification

Analyst


Multi-level News Video Content Analysis

Multi-level feature extraction and systematic development of higher-level access functions

Core-level

Fusion-level

Event-level

Thread-level

VideoOCR

RegionExtraction

VideoTexture

ConceptDetection

ContextExploitation

ReducingSupervision

dim1

dim2highlow

high

low

(car)

(sky)

Objects

Actions

Sites

Concepts

Outdoors IndoorsPerson

PeopleFace

NewsSubjectAnchor

Crowd

NewsMonolog

NewsDialog Studio

StructureModeling

StoryClustering

ThreadExtraction

EventModeling

StorySegmentation

Visual modalitiesNon-visual (speech, text, closed captions, etc.)

PIs: IBM (John R. Smith); Columbia Univ. (Prof. Shih-Fu Chang)

56S.F. Chang

Detection of Image Near Duplicate (IND)

With Dongqing Zhang


How to Link Video Threads with external Sources (e.g., web pages)?

Source # 1

ThreadingNews Stories

Search byexample

News Web Site 1

News Web Site 2

News Video Sources

Source # 2

Source # 3

58S.F. Chang

Image Near-Duplicate DetectionImage Near-Duplicate (IND): A pair of images in which one is close to the exact duplicate of the other, but differs slightly due to variations of content and camera parameters

Source # 1

News Story Thread Across Sources

Source # 2

Source # 3

Application of IND Detection:Threading news videosacross sources


59S.F. Chang

Part-based Approach for IND Detection

Part representation: we use interest point based representationPart detection: SUSAN corner detector –detecting corners using local geometries, such as curvature. Efficient algorithm compared with others, such as HarrisFeatures

Part: Color, spatial location, Gaborwavelet coefficients

Inter-part relation: Difference of spatial locations

Visual scene is composed of objects (parts) with spatial/attribute relations

Part-based representation by Attributed Relational Graph (ARG)

ARG

==??

ARGSimilarity

IND Detection becomesARG similarity problem

Graphical Model for ARG similarityUncertainty of Vertex Correspondence(due to occlusion, content change)

Variations of Attributes(due to lighting, object move)

ARG similarity is the likelihood ratio of the stochastic transformation process

VertexCorrespondence

Attribute Transformation

1

2

3

1

2

11x

32x

GraphS

GraphtX },...,,{ 321211 xxx=

Product Graph

Graph s

Graph t

H: Hypothesis

H=1 : two graphs are similar

Graphical Model of the Stochastic Process

H


61S.F. Chang

Inference and LearningLearning

Learning using node-level annotation

Positive Samples

Negative Samples

Learning using image-level annotation

Likelihood computation

1

2

3

1

2

11x

32x

GraphS

Grapht

Computing exact likelihood is intractable, so approximate it using Bethe approximation, leading to Loopy Brief Propagation in the product graph

Use E-M process:

Inference ParameterEstimation

Inference results(after thresholdingbeliefs)

Stochastic GraphEditing Process

IND vs. non-IND Likelihood Ratioa new similarity measure

Similarity by Likelihood Ratio

Learning

Similarity = P(Graph_t | Graph_s, Two_graph_is_NID)

P(Graph_t | Graph_s, Two_graph_is_not_NID)

• Compute Likelihood

,=Intractable ! So approximate it by using

Jensen’s lower bound

+ constant

1. Inference : Compute the approximate distribution by Loopy Belief Propagation

• Learning by node-level annotation

• Learning by image-level annotation

Positive Samples

Negative Samples

Learning is realized by Variational E-M

Learning is realized by Parameter Computation

q̂

Statistical Approach to Graph Matching

2. Learning : Estimate and by using E-M

YsH

Yt


IND Detection Performance (demo)

Detection Results

Search Scope

Probabilityof SuccessfulRetrieval

Retrieval Results

Precision

RecallStatistics of Image Near-Duplicate (IND)in TREC News Video Database

4147 News StoriesIn TREC-VID 2003

About 747 StoriesAnnotated by TDT2

At least 78 Storieshave INDs

Dataset for IND detection• 150 IND pairs, 300 non-duplicate images

• No multiple duplicate

Training set• 30 IND pairs, 60 non-duplicate images

• Node-level learning: 5 IND pairs, 5 non-duplicate images

64S.F. Chang

Conclusions

Patterns abound in multimedia dataMining facilitates auto discovery of salient or novel concepts scalable automatic knowledge discoveryCurrent results seen in

Multi-level temporal patterns mining through HHMMVideo generation modelsFusion between pattern labels and ASR metadata

Challenging IssuesMining of patterns of different types or complex patterns at higher levelsDetection of alerts and novel eventsEvaluation and Visualization

Theme: Media Pattern Mining for Semantics Discovery


65S.F. Chang

Conclusions (2)

Multi-source news video analysis and content exploitation offers technical challenges and application opportunities

Multimedia Video MiningCross-source linking using image near duplicateMulti-modal fusion for story segmentationCurrent work:

Large scale AV concept detectionStory topical threading

Closely related to evaluation tasks in TRECVIDA workshop for developing large-scale concept ontology (LSCOM)

Threading News Videos Across Sources Using Multi-modal Cues

66S.F. Chang

More InformationDigital Video| Multimedia Lab http://www.ee.columbia.edu/dvmmPapers

Winston Hsu, Lyndon Kennedy, Chih-Wei Huang, Shih-Fu Chang, Ching-Yung Lin, Giridharan Iyengar. News Video Story Segmentation using Fusion of Multi-Level Multi-modal Features in TRECVID 2003. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, Canada, May 2004. L. Xie, L. Kennedy, S.-F. Chang, A. Divakaran, H. Sun, C.-Y. Lin (2004). "Discovering Meaningful Multimedia Patterns with Audio-visual Concepts and Associated Text." IEEE International Conference on Image Processing (ICIP 2004), Singapore, October 2004. Dong-Qing Zhang and Shih-Fu Chang, "Detecting Image Near-Duplicate by Stochastic Attributed Relational Graph Matching with Learning", To appear in ACM Multimedia 2004 (ACM MM2004). Xie, S.-F. Chang, A. Divakaran and H. Sun (2003), Unsupervised Mining of Statistical Temporal Structures in Video, in Video Mining, Azreil Rosenfeld, David Doremann, Daniel Dementhon Eds, Kluwer Academic Publishers, 2003

video pattern mining - eie.polyu.edu.hkeie.polyu.edu.hk/~cmsp/isimp04/file/hk-isimp-2004.pdfs.-f....

Documents