content-based image and video analysis -high-level feature ... · nist evaluated a set of features...

41
Content-based Image and Video Retrieval Vorlesung, SS 2010 Content-based Image and Video Analysis -High-Level Feature Detection- Hazım Kemal Ekenel, [email protected] Rainer Stiefelhagen, [email protected] CV-HCI Research Group: http://cvhci.ira.uka.de/ 31.05.2010

Upload: others

Post on 13-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Content-based Image and Video Retrieval

Vorlesung, SS 2010

Content-based Image and Video Analysis

-High-Level Feature Detection-

Hazım Kemal Ekenel, [email protected]

Rainer Stiefelhagen, [email protected]

CV-HCI Research Group: http://cvhci.ira.uka.de/

31.05.2010

Page 2: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Last week

Introduction to digital video processing

Shot boundary detection

A Shot is a sequence of frames captured by a

single camera in a single continuous action.

A case study: TV Genre Classification

Estimating the type of the TV program using visual

+ aural + cognitive + structural cues

Content-based image and video retrieval 2

Page 3: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

This week

Concept Detection/High-level feature

detection

Data collection & annotation

Concept lexicon

Evaluations

Sample systems –IBM, MediaMill, Columbia Uni.

Visual features

Learning approaches

Fusion techniques

Issues & challenges

Content-based image and video retrieval 3

Page 4: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Video Retrieval

4

Search using content

Example I:

IBM IMARS Multimedia Analysis and Retrieval System:

http://mp7.watson.ibm.com/?ibm-download-now=View+demo

Page 5: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Video Retrieval

Example II:

University of Amsterdam MediaMill

http://www.science.uva.nl/research/mediamill/demo/index.php

5

Page 6: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Video Retrieval

Example III:

Uni-Karlsruhe, CVCHI research group

6

Page 7: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Levels of image/video retrieval

Level 1: Based on color, texture, shape features

Images are compared based on low-level features,

no semantics involved

A lot of research done, is a feasible task

Level 2: Bring semantic meanings into the search

E. g. identifying human beings, horses, trees, beaches

Requires retrieval techniques of level 1

Level 3: Retrieval with abstract and subjective attributes

Find pictures of a particular birthday celebration

Find a picture of a happy beautiful woman

Requires retrieval techniques of level 2 and very complex logic

Page 8: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

What is a concept/high-level feature?

Content-based image and video retrieval 8

Page 9: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Challenges

Content-based image and video retrieval 9

Changes in

• View angle,

• Scale,

• Color,

• Shape …

Page 10: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Large-Scale Concept Ontology

for Multimedia (LSCOM)

Content-based image and video retrieval 10

• Collaborative activity of three critical communities to create

a user-driven concept ontology for analysis of video

Users (Analysts,

Broadcasters)

Ontology Experts

Technical Researchers, Algorithm

Designers & System Developers

Page 11: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Sample Concepts

000 – Parade

Definition: Multiple units of marchers, devices, bands,

banners or Music.

001 – Exiting_Car

Definition: A car exiting from somewhere, such as a

highway, building, or parking lot.

002 – Handshaking

Definition: Two people shaking hands. Does not include

hugging or holding hands.

003 – Running

Definition: One or more people running.

004 – Airplane_Crash

Definition: Airplane crash site.

Content-based image and video retrieval 11

Page 12: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Sample Concepts

006 – Demonstration_Or_Protest

Definition: One or more people protesting. May or may not

have banners or signs.

007 – People_Crying

Definition: One or more people with visible tears.

008 – Airplane_Takeoff

Definition: Airplane heading down the runway for take off

(may have already left runway and be ascending).

009 – Airplane_Landing

Definition: Airplane descending or decelerating after making

contact with runway.

010 – Helicopter_Hovering

Definition: Helicopter in the air. May be moving or staying in

place.

Content-based image and video retrieval 12

Page 13: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Evaluation: TRECVID

High-level Feature Detection Task

Promote progress in content-based analysis,

detection, retrieval in large amount of digital

video

Combine multiple errorful sources of evidence

Achieve greater effectiveness, speed, and

usability

Confront systems with unfiltered data and

realistic tasks

Measure systems against human abilities

Content-based image and video retrieval 13

Page 14: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Evolution of TRECVID

Content-based image and video retrieval 14

Page 15: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

TV2007 vs. TV2008 vs TV2009 datasets

Content-based image and video retrieval 15

Page 16: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

TV 2009 10 new features selection

Participants suggested features that include:

Parts of natural scenes.

Child.

Sports.

Non-speech audio component.

People and objects in action.

Frequency in consumer video.

NIST basic selection criteria:

Features has to be moderately frequent

Has clear definition

Be of use in searching

No overlap with previously used topics/features

Content-based image and video retrieval 16

Page 17: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

20 features evaluated

The 10 marked with “*” are a subset of those tested in 2008

Content-based image and video retrieval 17

Page 18: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Frequency of hits varies by

feature (TRECVID 2009)

Content-based image and video retrieval 18

Page 19: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

High-level feature detection task (1)

Goal: Build benchmark collection for visual

concepts detection methods

Secondary goals:

encourage generic (scalable) methods for detector

development

semantic annotation is important for

search/browsing

Video data collection:

News magazine, science news, news reports,

documentaries, educational programming and

archival video

Content-based image and video retrieval 19

Page 20: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

High-level feature detection task (2)

NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP)

Four training types were allowed A : Systems trained on only common TRECVID

development collection data

OR

(formerly B) Systems trained on only common development collection data but not on (just) common annotation of it.

C : System is not of type A

a,c : same as A and C but no training data specific to any sound and vision data has been used

Content-based image and video retrieval 20

Page 21: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Evaluation

Each feature assumed to be binary: absent or present

for each master reference shot

Task: Find shots that contain a certain feature, rank

them according to confidence measure, submit the top

2000

NIST pooled and judged top results from all

submissions

Evaluated performance effectiveness by calculating

the inferred average precision of each feature result

Compared runs in terms of mean inferred average

precision across the feature results.

Content-based image and video retrieval 21

Page 22: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Problem definition

Given an n-dimensional feature vector xi, part of a

shot i,

The aim is to obtain a measure, which indicates

whether semantic concept wj is present in shot i

Various visual feature extraction methods can be

used to obtain xi

Several supervised machine learning approaches

can be used to learn the relation between wj and xi

p(wj | xi)

Content-based image and video retrieval 22

Page 23: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Concept Detection Process

Content-based image and video retrieval 23

Page 24: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Precision / Recall

Recall: Percentage of all relevant documents that are

retrieved

Precision: Percentage of retrieved documents that

are relevant

F/F1 measure:

Content-based image and video retrieval 24

Page 25: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Precision vs. Recall

Precision-recall curves

Recall = 1 / 5

Precision = 1 / 1

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

1

Ranked Matches

Query

4

7

1

5

2

8

6

3

9

Page 26: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Precision vs. Recall

Precision-recall curves

Recall = 2 / 5

Precision = 2 / 3

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

1

Ranked Matches

Query

4

7

1

5

2

8

6

3

9

Page 27: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Precision vs. Recall

Precision-recall curves

Recall = 3 / 5

Precision = 3 / 5

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

1

Ranked Matches

Query

4

7

1

5

2

8

6

3

9

Page 28: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Precision vs. Recall

Precision-recall curves

Recall = 4 / 5

Precision = 4 / 7

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

1

Ranked Matches

Query

4

7

1

5

2

8

6

3

9

Page 29: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Precision vs. Recall

Precision-recall curves

Recall = 5 / 5

Precision = 5 / 9

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

1

Ranked Matches

Query

4

7

1

5

2

8

6

3

9

Page 30: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Inferred average precision (infAP)

Developed by Emine Yilmaz and Javed A. Aslam at

Northeastern University*

Estimates average precision well using a small

sample of judgments from the usual submission

pools

Experiments on TRECVID 2005 & 2006 & 2007

feature submissions confirmed quality of the estimate

in terms of actual scores and system ranking

Content-based image and video retrieval 30

* J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments

Proceedings of the 29th ACM SIGIR Conference, Seattle, 2006.

Page 31: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Top 10 runs in HLF (TRECVID 2009)

31

Page 32: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Trends in HLF task -2008

HLF task is accepted as an important building

block for search

More interest in category B and C

submissions (using web data, e.g. Flickr

image, youtube)

Hardly any feature specific approaches

Large variety in classifier architectures and

choices of feature representations

Using salient/SIFT points as feature becomes

more and more popular!

Content-based image and video retrieval 32

Page 33: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

TRECVID 2009 Observations

Focus on robustness, merging many different

representations

Comparing fusion strategies

Efficiency improvements (e.g. GPU implementations)

Analysis of more than one keyframe per shot

Audio analysis

Using temporal context information

Analyzing motion information

Automatic extraction of Flickr training data

Content-based image and video retrieval 33

Page 34: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

State of the art

Content-based image and video retrieval 34

Labeled examples

Low-level

Feature

Extraction

Supervised

Learner

Feature

MeasurementClassification

Training

Testing

Outdoor

probability 0.95

Airplane

probability 0.7

Page 35: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Points to Consider

High-level feature learning consists in training a

system from sets of positive and negative examples.

The system’s performance depends a lot on the

implementation choices and details. It also strongly

depends on the SIZE and QUALITY of the TRAINING

EXAMPLES.

While it is quite easy and cheap to get large amounts

of raw data, it is usually very costly to have them

annotated.

Content-based image and video retrieval 35

Page 36: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Annotations

Collaborative Annotation Effort

36

Sequential

annotation interface

Parallel annotation interface

Page 37: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Keyframes

A set of frames that best represent the visual

content of the scene

Many techniques available for keyframe

extraction:

Take the first, middle or last frames

Temporal change

Clustering

Content-based image and video retrieval 37

Page 38: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Active Learning

Use an existing system and heuristics for

selecting the samples to annotate → need of

a classification score.

Annotate first or only the samples that are

expected to be the most informative for

system training → various strategies.

Get same performance with less annotations

and/or get better performance with the same

annotation count

Content-based image and video retrieval 38

Page 39: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Active Learning Strategies

Random sampling

Uncertainty sampling: Choose the most uncertain

samples,

Samples with the probability which is the closest 0.5 are selected

Relevance sampling: choose the most probable positive

samples,

Samples with the probability which is the closest 1.0 are selected

Choose the farthest samples from already evaluated

ones,

Combinations of these, e.g. choose the samples

amongst the most probable ones and amongst the

farthest from the already evaluated ones.

Content-based image and video retrieval 39

Page 40: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Evaluated Strategies

Random sampling: baseline

Relevance sampling is the best one when a small fraction

(less than 15%) of the dataset is annotated.

Uncertainty sampling is the best one when a medium to

large fraction (15% or more) of the dataset is annotated.

40Amount of used data

MA

P

Page 41: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types

Active Learning Conclusions

The maximum performance is reached when 12 to 15% of

the whole dataset is annotated (for 36K samples).

The optimal fraction to annotate depends upon the size of

the training set: it roughly varies with the square root of

the training set size (25 to 30% for 9K samples).

Random sampling is better than linear scan.

Simulated active learning can improve system

performance even on fully annotated training sets.

Uncertainty sampling is more “precision oriented”.

Relevance sampling is more “recall oriented”.

Content-based image and video retrieval 41