content-based image and video analysis -high-level feature ... · nist evaluated a set of features...
TRANSCRIPT
![Page 1: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/1.jpg)
Content-based Image and Video Retrieval
Vorlesung, SS 2010
Content-based Image and Video Analysis
-High-Level Feature Detection-
Hazım Kemal Ekenel, [email protected]
Rainer Stiefelhagen, [email protected]
CV-HCI Research Group: http://cvhci.ira.uka.de/
31.05.2010
![Page 2: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/2.jpg)
Last week
Introduction to digital video processing
Shot boundary detection
A Shot is a sequence of frames captured by a
single camera in a single continuous action.
A case study: TV Genre Classification
Estimating the type of the TV program using visual
+ aural + cognitive + structural cues
Content-based image and video retrieval 2
![Page 3: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/3.jpg)
This week
Concept Detection/High-level feature
detection
Data collection & annotation
Concept lexicon
Evaluations
Sample systems –IBM, MediaMill, Columbia Uni.
Visual features
Learning approaches
Fusion techniques
Issues & challenges
Content-based image and video retrieval 3
![Page 4: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/4.jpg)
Video Retrieval
4
Search using content
Example I:
IBM IMARS Multimedia Analysis and Retrieval System:
http://mp7.watson.ibm.com/?ibm-download-now=View+demo
![Page 5: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/5.jpg)
Video Retrieval
Example II:
University of Amsterdam MediaMill
http://www.science.uva.nl/research/mediamill/demo/index.php
5
![Page 6: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/6.jpg)
Video Retrieval
Example III:
Uni-Karlsruhe, CVCHI research group
6
![Page 7: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/7.jpg)
Levels of image/video retrieval
Level 1: Based on color, texture, shape features
Images are compared based on low-level features,
no semantics involved
A lot of research done, is a feasible task
Level 2: Bring semantic meanings into the search
E. g. identifying human beings, horses, trees, beaches
Requires retrieval techniques of level 1
Level 3: Retrieval with abstract and subjective attributes
Find pictures of a particular birthday celebration
Find a picture of a happy beautiful woman
Requires retrieval techniques of level 2 and very complex logic
![Page 8: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/8.jpg)
What is a concept/high-level feature?
Content-based image and video retrieval 8
![Page 9: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/9.jpg)
Challenges
Content-based image and video retrieval 9
Changes in
• View angle,
• Scale,
• Color,
• Shape …
![Page 10: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/10.jpg)
Large-Scale Concept Ontology
for Multimedia (LSCOM)
Content-based image and video retrieval 10
• Collaborative activity of three critical communities to create
a user-driven concept ontology for analysis of video
Users (Analysts,
Broadcasters)
Ontology Experts
Technical Researchers, Algorithm
Designers & System Developers
![Page 11: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/11.jpg)
Sample Concepts
000 – Parade
Definition: Multiple units of marchers, devices, bands,
banners or Music.
001 – Exiting_Car
Definition: A car exiting from somewhere, such as a
highway, building, or parking lot.
002 – Handshaking
Definition: Two people shaking hands. Does not include
hugging or holding hands.
003 – Running
Definition: One or more people running.
004 – Airplane_Crash
Definition: Airplane crash site.
Content-based image and video retrieval 11
![Page 12: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/12.jpg)
Sample Concepts
006 – Demonstration_Or_Protest
Definition: One or more people protesting. May or may not
have banners or signs.
007 – People_Crying
Definition: One or more people with visible tears.
008 – Airplane_Takeoff
Definition: Airplane heading down the runway for take off
(may have already left runway and be ascending).
009 – Airplane_Landing
Definition: Airplane descending or decelerating after making
contact with runway.
010 – Helicopter_Hovering
Definition: Helicopter in the air. May be moving or staying in
place.
Content-based image and video retrieval 12
![Page 13: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/13.jpg)
Evaluation: TRECVID
High-level Feature Detection Task
Promote progress in content-based analysis,
detection, retrieval in large amount of digital
video
Combine multiple errorful sources of evidence
Achieve greater effectiveness, speed, and
usability
Confront systems with unfiltered data and
realistic tasks
Measure systems against human abilities
Content-based image and video retrieval 13
![Page 14: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/14.jpg)
Evolution of TRECVID
Content-based image and video retrieval 14
![Page 15: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/15.jpg)
TV2007 vs. TV2008 vs TV2009 datasets
Content-based image and video retrieval 15
![Page 16: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/16.jpg)
TV 2009 10 new features selection
Participants suggested features that include:
Parts of natural scenes.
Child.
Sports.
Non-speech audio component.
People and objects in action.
Frequency in consumer video.
NIST basic selection criteria:
Features has to be moderately frequent
Has clear definition
Be of use in searching
No overlap with previously used topics/features
Content-based image and video retrieval 16
![Page 17: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/17.jpg)
20 features evaluated
The 10 marked with “*” are a subset of those tested in 2008
Content-based image and video retrieval 17
![Page 18: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/18.jpg)
Frequency of hits varies by
feature (TRECVID 2009)
Content-based image and video retrieval 18
![Page 19: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/19.jpg)
High-level feature detection task (1)
Goal: Build benchmark collection for visual
concepts detection methods
Secondary goals:
encourage generic (scalable) methods for detector
development
semantic annotation is important for
search/browsing
Video data collection:
News magazine, science news, news reports,
documentaries, educational programming and
archival video
Content-based image and video retrieval 19
![Page 20: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/20.jpg)
High-level feature detection task (2)
NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP)
Four training types were allowed A : Systems trained on only common TRECVID
development collection data
OR
(formerly B) Systems trained on only common development collection data but not on (just) common annotation of it.
C : System is not of type A
a,c : same as A and C but no training data specific to any sound and vision data has been used
Content-based image and video retrieval 20
![Page 21: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/21.jpg)
Evaluation
Each feature assumed to be binary: absent or present
for each master reference shot
Task: Find shots that contain a certain feature, rank
them according to confidence measure, submit the top
2000
NIST pooled and judged top results from all
submissions
Evaluated performance effectiveness by calculating
the inferred average precision of each feature result
Compared runs in terms of mean inferred average
precision across the feature results.
Content-based image and video retrieval 21
![Page 22: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/22.jpg)
Problem definition
Given an n-dimensional feature vector xi, part of a
shot i,
The aim is to obtain a measure, which indicates
whether semantic concept wj is present in shot i
Various visual feature extraction methods can be
used to obtain xi
Several supervised machine learning approaches
can be used to learn the relation between wj and xi
p(wj | xi)
Content-based image and video retrieval 22
![Page 23: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/23.jpg)
Concept Detection Process
Content-based image and video retrieval 23
![Page 24: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/24.jpg)
Precision / Recall
Recall: Percentage of all relevant documents that are
retrieved
Precision: Percentage of retrieved documents that
are relevant
F/F1 measure:
Content-based image and video retrieval 24
![Page 25: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/25.jpg)
Precision vs. Recall
Precision-recall curves
Recall = 1 / 5
Precision = 1 / 1
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
1
Ranked Matches
Query
4
7
1
5
2
8
6
3
9
![Page 26: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/26.jpg)
Precision vs. Recall
Precision-recall curves
Recall = 2 / 5
Precision = 2 / 3
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
1
Ranked Matches
Query
4
7
1
5
2
8
6
3
9
![Page 27: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/27.jpg)
Precision vs. Recall
Precision-recall curves
Recall = 3 / 5
Precision = 3 / 5
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
1
Ranked Matches
Query
4
7
1
5
2
8
6
3
9
![Page 28: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/28.jpg)
Precision vs. Recall
Precision-recall curves
Recall = 4 / 5
Precision = 4 / 7
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
1
Ranked Matches
Query
4
7
1
5
2
8
6
3
9
![Page 29: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/29.jpg)
Precision vs. Recall
Precision-recall curves
Recall = 5 / 5
Precision = 5 / 9
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
Recall
Pre
cis
ion
1
Ranked Matches
Query
4
7
1
5
2
8
6
3
9
![Page 30: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/30.jpg)
Inferred average precision (infAP)
Developed by Emine Yilmaz and Javed A. Aslam at
Northeastern University*
Estimates average precision well using a small
sample of judgments from the usual submission
pools
Experiments on TRECVID 2005 & 2006 & 2007
feature submissions confirmed quality of the estimate
in terms of actual scores and system ranking
Content-based image and video retrieval 30
* J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments
Proceedings of the 29th ACM SIGIR Conference, Seattle, 2006.
![Page 31: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/31.jpg)
Top 10 runs in HLF (TRECVID 2009)
31
![Page 32: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/32.jpg)
Trends in HLF task -2008
HLF task is accepted as an important building
block for search
More interest in category B and C
submissions (using web data, e.g. Flickr
image, youtube)
Hardly any feature specific approaches
Large variety in classifier architectures and
choices of feature representations
Using salient/SIFT points as feature becomes
more and more popular!
Content-based image and video retrieval 32
![Page 33: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/33.jpg)
TRECVID 2009 Observations
Focus on robustness, merging many different
representations
Comparing fusion strategies
Efficiency improvements (e.g. GPU implementations)
Analysis of more than one keyframe per shot
Audio analysis
Using temporal context information
Analyzing motion information
Automatic extraction of Flickr training data
Content-based image and video retrieval 33
![Page 34: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/34.jpg)
State of the art
Content-based image and video retrieval 34
Labeled examples
Low-level
Feature
Extraction
Supervised
Learner
Feature
MeasurementClassification
Training
Testing
Outdoor
probability 0.95
Airplane
probability 0.7
![Page 35: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/35.jpg)
Points to Consider
High-level feature learning consists in training a
system from sets of positive and negative examples.
The system’s performance depends a lot on the
implementation choices and details. It also strongly
depends on the SIZE and QUALITY of the TRAINING
EXAMPLES.
While it is quite easy and cheap to get large amounts
of raw data, it is usually very costly to have them
annotated.
Content-based image and video retrieval 35
![Page 36: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/36.jpg)
Annotations
Collaborative Annotation Effort
36
Sequential
annotation interface
Parallel annotation interface
![Page 37: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/37.jpg)
Keyframes
A set of frames that best represent the visual
content of the scene
Many techniques available for keyframe
extraction:
Take the first, middle or last frames
Temporal change
Clustering
Content-based image and video retrieval 37
![Page 38: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/38.jpg)
Active Learning
Use an existing system and heuristics for
selecting the samples to annotate → need of
a classification score.
Annotate first or only the samples that are
expected to be the most informative for
system training → various strategies.
Get same performance with less annotations
and/or get better performance with the same
annotation count
Content-based image and video retrieval 38
![Page 39: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/39.jpg)
Active Learning Strategies
Random sampling
Uncertainty sampling: Choose the most uncertain
samples,
Samples with the probability which is the closest 0.5 are selected
Relevance sampling: choose the most probable positive
samples,
Samples with the probability which is the closest 1.0 are selected
Choose the farthest samples from already evaluated
ones,
Combinations of these, e.g. choose the samples
amongst the most probable ones and amongst the
farthest from the already evaluated ones.
Content-based image and video retrieval 39
![Page 40: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/40.jpg)
Evaluated Strategies
Random sampling: baseline
Relevance sampling is the best one when a small fraction
(less than 15%) of the dataset is annotated.
Uncertainty sampling is the best one when a medium to
large fraction (15% or more) of the dataset is annotated.
40Amount of used data
MA
P
![Page 41: Content-based Image and Video Analysis -High-Level Feature ... · NIST evaluated a set of features using 50% random sample of the submission pools (Inferred AP) Four training types](https://reader033.vdocuments.us/reader033/viewer/2022050608/5fafa0ca9716475fdb582867/html5/thumbnails/41.jpg)
Active Learning Conclusions
The maximum performance is reached when 12 to 15% of
the whole dataset is annotated (for 36K samples).
The optimal fraction to annotate depends upon the size of
the training set: it roughly varies with the square root of
the training set size (25 to 30% for 9K samples).
Random sampling is better than linear scan.
Simulated active learning can improve system
performance even on fully annotated training sets.
Uncertainty sampling is more “precision oriented”.
Relevance sampling is more “recall oriented”.
Content-based image and video retrieval 41