content-based image and video analysis -high-level feature ... · nist evaluated a set of features...

Content-based Image and Video Retrieval

Vorlesung, SS 2010

Content-based Image and Video Analysis

-High-Level Feature Detection-

Hazım Kemal Ekenel, [email protected]

Rainer Stiefelhagen, [email protected]

CV-HCI Research Group: http://cvhci.ira.uka.de/

31.05.2010

Last week

Introduction to digital video processing

Shot boundary detection

A Shot is a sequence of frames captured by a

single camera in a single continuous action.

A case study: TV Genre Classification

Estimating the type of the TV program using visual

+ aural + cognitive + structural cues

Content-based image and video retrieval 2

This week

Concept Detection/High-level feature

detection

Data collection & annotation

Concept lexicon

Evaluations

Sample systems –IBM, MediaMill, Columbia Uni.

Visual features

Learning approaches

Fusion techniques

Issues & challenges


Video Retrieval

4

Search using content

Example I:

IBM IMARS Multimedia Analysis and Retrieval System:

http://mp7.watson.ibm.com/?ibm-download-now=View+demo

Video Retrieval

Example II:

University of Amsterdam MediaMill

http://www.science.uva.nl/research/mediamill/demo/index.php

5

Video Retrieval

Example III:

Uni-Karlsruhe, CVCHI research group

6

Levels of image/video retrieval

Level 1: Based on color, texture, shape features

Images are compared based on low-level features,

no semantics involved

A lot of research done, is a feasible task

Level 2: Bring semantic meanings into the search

E. g. identifying human beings, horses, trees, beaches

Requires retrieval techniques of level 1

Level 3: Retrieval with abstract and subjective attributes

Find pictures of a particular birthday celebration

Find a picture of a happy beautiful woman

Requires retrieval techniques of level 2 and very complex logic

What is a concept/high-level feature?


Challenges


Changes in

• View angle,

• Scale,

• Color,

• Shape …

Large-Scale Concept Ontology

for Multimedia (LSCOM)


• Collaborative activity of three critical communities to create

a user-driven concept ontology for analysis of video

Users (Analysts,

Broadcasters)

Ontology Experts

Technical Researchers, Algorithm

Designers & System Developers

Sample Concepts

000 – Parade

Definition: Multiple units of marchers, devices, bands,

banners or Music.

001 – Exiting_Car

Definition: A car exiting from somewhere, such as a

highway, building, or parking lot.

002 – Handshaking

Definition: Two people shaking hands. Does not include

hugging or holding hands.

003 – Running

Definition: One or more people running.

004 – Airplane_Crash

Definition: Airplane crash site.


Sample Concepts

006 – Demonstration_Or_Protest

Definition: One or more people protesting. May or may not

have banners or signs.

007 – People_Crying

Definition: One or more people with visible tears.

008 – Airplane_Takeoff

Definition: Airplane heading down the runway for take off

(may have already left runway and be ascending).

009 – Airplane_Landing

Definition: Airplane descending or decelerating after making

contact with runway.

010 – Helicopter_Hovering

Definition: Helicopter in the air. May be moving or staying in

place.


Evaluation: TRECVID

High-level Feature Detection Task

Promote progress in content-based analysis,

detection, retrieval in large amount of digital

video

Combine multiple errorful sources of evidence

Achieve greater effectiveness, speed, and

usability

Confront systems with unfiltered data and

realistic tasks

Measure systems against human abilities


Evolution of TRECVID


TV2007 vs. TV2008 vs TV2009 datasets


TV 2009 10 new features selection

Participants suggested features that include:

Parts of natural scenes.

Child.

Sports.

Non-speech audio component.

People and objects in action.

Frequency in consumer video.

NIST basic selection criteria:

Features has to be moderately frequent

Has clear definition

Be of use in searching

No overlap with previously used topics/features


20 features evaluated

The 10 marked with “*” are a subset of those tested in 2008


Frequency of hits varies by

feature (TRECVID 2009)


High-level feature detection task (1)

Goal: Build benchmark collection for visual

concepts detection methods

Secondary goals:

encourage generic (scalable) methods for detector

development

semantic annotation is important for

search/browsing

Video data collection:

News magazine, science news, news reports,

documentaries, educational programming and

archival video


High-level feature detection task (2)

NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP)

Four training types were allowed A : Systems trained on only common TRECVID

development collection data

OR

(formerly B) Systems trained on only common development collection data but not on (just) common annotation of it.

C : System is not of type A

a,c : same as A and C but no training data specific to any sound and vision data has been used


Evaluation

Each feature assumed to be binary: absent or present

for each master reference shot

Task: Find shots that contain a certain feature, rank

them according to confidence measure, submit the top

2000

NIST pooled and judged top results from all

submissions

Evaluated performance effectiveness by calculating

the inferred average precision of each feature result

Compared runs in terms of mean inferred average

precision across the feature results.


Problem definition

Given an n-dimensional feature vector xi, part of a

shot i,

The aim is to obtain a measure, which indicates

whether semantic concept wj is present in shot i

Various visual feature extraction methods can be

used to obtain xi

Several supervised machine learning approaches

can be used to learn the relation between wj and xi

p(wj | xi)


Concept Detection Process


Precision / Recall

Recall: Percentage of all relevant documents that are

retrieved

Precision: Percentage of retrieved documents that

are relevant

F/F1 measure:


Precision vs. Recall

Precision-recall curves

Recall = 1 / 5

Precision = 1 / 1

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

1

Ranked Matches

Query

4

7

1

5

2

8

6

3

9



Recall = 2 / 5

Precision = 2 / 3

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

1

Ranked Matches

Query

4

7

1

5

2

8

6

3

9



Recall = 3 / 5

Precision = 3 / 5

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

1

Ranked Matches

Query

4

7

1

5

2

8

6

3

9



Recall = 4 / 5

Precision = 4 / 7

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

1

Ranked Matches

Query

4

7

1

5

2

8

6

3

9



Recall = 5 / 5

Precision = 5 / 9

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

1

Ranked Matches

Query

4

7

1

5

2

8

6

3

9

Inferred average precision (infAP)

Developed by Emine Yilmaz and Javed A. Aslam at

Northeastern University*

Estimates average precision well using a small

sample of judgments from the usual submission

pools

Experiments on TRECVID 2005 & 2006 & 2007

feature submissions confirmed quality of the estimate

in terms of actual scores and system ranking


* J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments

Proceedings of the 29th ACM SIGIR Conference, Seattle, 2006.

Top 10 runs in HLF (TRECVID 2009)

31

Trends in HLF task -2008

HLF task is accepted as an important building

block for search

More interest in category B and C

submissions (using web data, e.g. Flickr

image, youtube)

Hardly any feature specific approaches

Large variety in classifier architectures and

choices of feature representations

Using salient/SIFT points as feature becomes

more and more popular!


TRECVID 2009 Observations

Focus on robustness, merging many different

representations

Comparing fusion strategies

Efficiency improvements (e.g. GPU implementations)

Analysis of more than one keyframe per shot

Audio analysis

Using temporal context information

Analyzing motion information

Automatic extraction of Flickr training data


State of the art


Labeled examples

Low-level

Feature

Extraction

Supervised

Learner

Feature

MeasurementClassification

Training

Testing

Outdoor

probability 0.95

Airplane

probability 0.7

Points to Consider

High-level feature learning consists in training a

system from sets of positive and negative examples.

The system’s performance depends a lot on the

implementation choices and details. It also strongly

depends on the SIZE and QUALITY of the TRAINING

EXAMPLES.

While it is quite easy and cheap to get large amounts

of raw data, it is usually very costly to have them

annotated.


Annotations

Collaborative Annotation Effort

36

Sequential

annotation interface

Parallel annotation interface

Keyframes

A set of frames that best represent the visual

content of the scene

Many techniques available for keyframe

extraction:

Take the first, middle or last frames

Temporal change

Clustering


Active Learning

Use an existing system and heuristics for

selecting the samples to annotate → need of

a classification score.

Annotate first or only the samples that are

expected to be the most informative for

system training → various strategies.

Get same performance with less annotations

and/or get better performance with the same

annotation count


Active Learning Strategies

Random sampling

Uncertainty sampling: Choose the most uncertain

samples,

Samples with the probability which is the closest 0.5 are selected

Relevance sampling: choose the most probable positive

samples,

Samples with the probability which is the closest 1.0 are selected

Choose the farthest samples from already evaluated

ones,

Combinations of these, e.g. choose the samples

amongst the most probable ones and amongst the

farthest from the already evaluated ones.


Evaluated Strategies

Random sampling: baseline

Relevance sampling is the best one when a small fraction

(less than 15%) of the dataset is annotated.

Uncertainty sampling is the best one when a medium to

large fraction (15% or more) of the dataset is annotated.

40Amount of used data

MA

P

Active Learning Conclusions

The maximum performance is reached when 12 to 15% of

the whole dataset is annotated (for 36K samples).

The optimal fraction to annotate depends upon the size of

the training set: it roughly varies with the square root of

the training set size (25 to 30% for 9K samples).

Random sampling is better than linear scan.

Simulated active learning can improve system

performance even on fully annotated training sets.

Uncertainty sampling is more “precision oriented”.

Relevance sampling is more “recall oriented”.


content-based image and video analysis -high-level feature ... · nist evaluated a set of features...

Documents