presentation by: douglas turnbull cse department, ucsd topic in vision and learning

Formulating Semantic Image Annotation as a Supervised Learning Problem

Gustavo Carneiro and Nuno VasconcelosCVPR ‘05

Presentation by:Douglas Turnbull

CSE Department, UCSD

Topic in Vision and LearningNovember 3, 2005

What is Image Annotation?

Given an image, what are the words that describe the image?

What is Image Retrieval?

Given a database of images and a query string (e.g. words), what are the images that are described by the words?

Query String: “jet”

Problem: Image Annotation & Retrieval

Based on the low cost of both digital camera and hard disk space, billions of consumer have the ability create and store digital images.

There are already billions of digital images stored on personal computers and in commercial databases.

How do store images in and retrieve images from a large database?

dturnbul

(250M Americans)*(5% Own Digitial Camera)*(100 Photos per year) = 1.25 Billion Photos per yearGoogle Image serach on ".jpg" returns 11.8 Million Hits.Yahoo Image search on ".jpg" return 137 Million Hits.


In general, people do not spent time labeling, organizing or annotating their personal image collections.

Label:• Images are often stored with the name that is produced by the digital camera:

– “DSC002861.jpg”• When they are labeled, they are given a vague names that rarely describe the

content of the image: – ”GoodTimes.jpg”, “China05.txt”

Organize:• No standard scheme exists for filing images• Individuals use ad hoc methods: “Chrismas2005Photos” and “Sailing_Photos”• It is hard to merge image collections since the taxonomies (e.g. directory

hierarchies) differ from user to user.


In general, people do not spent time labeling, organizing or annotating their personal image collections.

Annotate:• Explicit Annotation: Rarely do we explicitly annotate our images with

captions. – An exception is when we are create web galleries

• i.e. My wedding photos on www.KodakGallery.com

• Implicit Annotation: Sometimes we do implicitly annotate images we imbed images into text (as is the case with webpages.)– Web-based search engines make use of the implicit annotation when they

index images.• i.e. Google Image Search, Picsearch

http://www.kodakgallery.com/


If we can’t depend on human labeling, organization, or annotation, we will have to resort to “content-based image retrieval”:– We will extract features vectors from each image– Based on these feature vectors, we will use statistical models to

characterize the relationship between a query and image features.

How do we specify a meaningful query to be able to navigate this image feature space?


Content-Based Image Retrieval: How do we specify a query?

Query-by-sketch: Sketch a picture, extract features from the pictures, we the features to find similar images in the database.

This requires that1. we have a good drawing interface handy 2. everybody is able to draw3. the quick sketch is able to capture the salient nature of the desired

query

Not a very feasible approach.


Content-Based Image Retrieval: How do we specify a query?

Query-by-text: Input words into a statistical model that models models the relationship between words and image features.

This requires that:1. A keyboard2. A statistical model that can relate words to image features3. Words can be used to capture the salient nature of the desired query.

A number of research systems have been develop that find a relationship content-based image features and text for the purpose of image annotation and retrieval.

- Mori, Takahashi, Oka (1999)- Daygulu, Barnard, de Freitas (2002)- Blei, Jordan (2003)- Feng, Manmantha, Lavrenko (2004)

dturnbul

1. Keyboards are everywhere2. We will be discussing statistical models that relate words to images3. Caption are commonly used to convey information about the content of images.

Outline

Notation and Problem Statement

Three General Approaches to Image Annotation1. Supervised One vs. All (OVA) Models2. Unsupervised Models using Latent Variables3. Supervised M-ary Model

Estimating P(image features|words)

Experimental Setup and Results

Automatic Music Annotation

dturnbul



dturnbul



xi = vector of image features

x = {x1, x2 , … }

wi = one word

w = {w1, w2 , … }

= vector of feature vectors

= vector of words

Image and Caption Image Regions

dturnbul



dturnbul



-

dturnbul



Image Regions

Multiple Instance Learning: this regions has no visual aspect of “jet”

Weak Labeling: this image depict sky eventhough the caption does contain “sky”

dturnbul


Outline






dturnbul


Supervised OVA Models

Early research posed the problem as a supervised learning problem: train a classifier for each semantic concept.

Binary Classification/Detection Problems:• Holistic Concepts: landscape/cityscape, indoor/outdoor scenes• Object Detection: horses, buildings, trees, etc

Much of the early work focused on feature design and used existing models developed by the machine learning community (SVM, KNN, etc) for classification.

dturnbul



dturnbul



Pro:• Easy to implement• Can design features and tune learning algorithm for each classification task

• Notion of optimal performance on each task• Data sets represent basis of comparison – OCR data set

Con:•Doesn’t scale well with a large vocabulary

• Requires train and use L classifier• Hard to compare posterior probabilities output by L classifier

• No natural ranking of keywords. • Weak labeling is a problem:

• Images not labeled with keyword are placed in D0

dturnbul


Unsupervised Models

The goal is to estimate the joint distribution

We introduce a latent (e.g. hidden) variable L that encode S hidden states of the world.i.e. “Sky” state, “Jet” state

A state defines a joint distribution of image features and keywords.i.e. P(x=(blue, white, fuzzy), w=(“sky”, “cloud”,”blue”) | “Sky” State) will have high probability.

We can sum over the S states variable to find the joint distribution

Learning is based on the expectation maximization (EM):1) E-step: update strength of association between image-caption with state2) M-step: maximize likelihood of joint distribution for each state

Annotation involves the most probable words under the joint distribution model

dturnbul


Unsupervised Models

Multiple-Bernoulli Relevance Model (MBRM) – (Feng, Manmantha, Lavrenko CVPR ’04)

• Simplest unsupervised model which achieves best results• Each of the D images in the training set is a “not-so-hidden” state• Assume conditional independence between image features and keywords

given state

MBRM eliminates the need for EM since we don’t need to find the strength of association between image-caption and state.

Parameter estimation is straight forwardPX|L is estimated using a Gaussian kernelPW|L reduces to counting

The algorithm becomes essentially “smoothed k-nearest neighbor”.

dturnbul


Unsupervised Models

Pros:• More scaleable than Supervised OVA

– Size of vocabulary• Natural ranking of keywords• Weaker demands on quality of labeling

– Robust to a weakly labeled dataset

Cons:• No guarantees of optimality since keywords are not explicitly treated

as classes– Annotation: What is a good annotation?– Retrieval: What are the best images given a query string?

dturnbul


Supervised M-ary Model

Critical Idea: Why introduce latent variables when a keyword directly represents a semantic class.

A random variable W which takes values in {1,…,L} such that W = i if x is label with keyword wi.

The class conditional distributions PX|W(x|i) are estimated using the images that have keyword wi.

To annotate a new image with features x, the Bayes decision rule is invoked:

Unlike Supervised OVA which consist of solving L binary decision problems, we are solving one decision problem with L classes.

The keyword compete to represent the image features.

dturnbul


Supervised M-ary Model

Pros:• Natural Ranking of Keywords

– Similar to unsupervised models– Posterior probabilities are relative to same classification problem.

• Does not require training of non-class models– Non-class model are the Yi = 0 in Supervised OVA– Robust to weakly labeled data set since images that contain concept but

are not labeled with the keyword do not adversely effect learning.– Non-class models are computational bottleneck

• Learning a density estimates PX|W(x|i) is computationally equivalent to learning density estimates for each image in MBRM model.

– Relies on Mixture Hierarchy method (Vasconcelos ’01)

• When vocabulary size (L) is smaller than the training set size (D), annotation is computationally more efficient than the most efficient unsupervised algorithm.

dturnbul


Outline






dturnbul


Density Estimation

For Supervised M-ary learning, we need to find the class-conditional density estimates PX|W(x|wi) using a training data set Di.

– All the images in Di have been labeled with wi

Two Questions:1) Given that a number of the image regions from images in Di will not

exhibit visual properties that relate to wi, can we even estimate these densities?

i.e An image labeled “jet” will have regions where only sky is present.

2) What is the “best” way to estimate these densities?– “best” – the estimate can be calculated using a computationally efficient

algorithm– “best” – the estimate is accurate and general.

dturnbul


Density Estimation

Multiple Instance Learning: a bag of instance receive a label for the entire bag if one or more instances deserves that label.

This makes the data noisy, but with enough averaging we can get a good density estimate.

For example:1 Suppose all images has three

regions.2 Every image annotated with “jet”

have one region with jet-like features (i.e. mu =20, sigma = 3).

3 The other two regions are uniformly distributed with mu ~ U(-100,1000) and sigma ~ U(0.1,10)

4 If we average 1000 images, the “jet” distribution emerges

dturnbul


Density Estimation

For word wi, we have Di images each of which is represented by a vector of feature vectors.

The authors discuss four methods of estimating PX|W(x|i).

1. Direct Estimation

2. Model Averaging1) Histogram

2) Naïve Averaging

3. Mixture Hierarchies

dturnbul


Density Estimation

1) Direct Estimation• All feature vectors for all images represent a distribution• Need to does some heuristic smoothing – e.g. Use a Gaussian

Kernel• Does not scale well with training set size or number of vector per

imageF

ea

ture

2

Feature 1

Smoothed kNN

dturnbul


Density Estimation

2) Model AveragingEach image l in Di represents a individual distributionWe average the image distributions to find one class distribution

The paper mentions two techniques1) Histograms – partition space and count

• Data sparsity problems for high dimensional feature vectors. 2) Naïve Averaging using Mixture Models

• Slow annotation time since there will be KD Gaussian if each image mixture has K components

Fe

atu

re 2

Feature 1

Fe

atu

re 2

Feature 1

HistogramSmoothed kNN

Fe

atu

re 2

Feature 1

Mixtures

dturnbul


Density Estimation

3) Mixture Hierarchies – (Vasconcelos 2001)• Each image l in Di represents a individual mixture of K Gaussian

distributions• We combine “redundant” mixture components using EM

– E-Step: Compute weight between each of the KD components and the T components

– M-Step: Maximize parameters of T components using weights• The final distribution is one Mixture of T Gaussians for each keyword wi

where T << KD.

Di

l1 l3 lDil2 . . .

dturnbul


Outline






dturnbul


Experimental Setup

Corel Stock Photos Data Set5,000 images – 4,500 for training, 500 for testingCaption of 1-5 words per image from a vocabulary of L=371 keywords

Image Features– Convert from RGB to YBR color space– Computes 8 x 8 discrete cosine transform (DCT)– Results is a 3*64 =192 dimensional feature vector for each image region– 64 low frequency features are retain so that

dturnbul


Experimental Setup

Two (simplified) tasks:Annotation: given a new image, what are the best five words that describe the imageRetrieval: Given a one word query, what are the images that match the query.

Evaluation Metrics:|wH| - number of images that have been annotated with w by humans|wA| - number of images that have been automatically annotated with w|wC| - number of images that have been automatically annotated with w AND where annotated with w by humans

Recall = |wC|/|wH|Precision = |wC|/|wA|

Mean Recall and Mean Precision are average over all the words found in the test set.

dturnbul


Other Annotation Systems

1. Co-occurrence (1999) – Mori, Takahashi, OkaEarly work that clusters sub-images (block-based decomposition) and counts word frequencies for each cluster

2. Translation (2002) – Duygulu, Barnard, de Freitas, Forsyth– “Vocabulary of Blobs”

• Automatic Segmentation -> Feature Vectors -> Clustering -> Blobs– An image is made of of Blobs, Words are associated with Blobs -> New

Caption– “Blobs” are latent states

Block-Based DecompositionAutomatic Decomposition

dturnbul


Other Annotation Systems

3. CRM (2003)- Lavrenko, Manmatha, JeonContinuous-space Relevance Model “smoothed KNN” algorithm

image features are modeled using kernel-based densitiesautomatic image segmentationcolor, shape, texture features

word features are modeled using multinomial distribution“Training Images” are latent states.

4. CRM-rect(2004) – Feng Manmantha, LavrenkoSame as CRM but using block-based decomposition rather than segmentation

5. MBRM (2004)– Feng, Manmantha, LavrenkoMultiple-Bernoulli Relevance ModeSame as CRM-rect but uses multiple-Bernoulli distribution to model word features

shifts emphasis to presence of word rather than prominence of word.

dturnbul


New Annotation Systems

6. CRM-rect-DCT (2005) – Carneiro, VasconcelosCRM-rect with DCT features

7. Mix-hier(2005) -Carneiro, VasconcelosSupervised M-ary LearningDensity estimation using Mixture HierarchiesDCT features

dturnbul


Annotation Results

Examples of Image Annotations:

dturnbul


Annotation Results

Performance of Annotation system on Corel test set500 images, 260 keywords, generate 5 keywords per imageRecall = |wC|/|wH|, Precision = |wC|/|wA|

Gain of 16% recall at same or better level of precision

Gain of 12% in words with positive recall

i.e. a word is found in both human and automatic annotation at least once.

dturnbul


Annotation Results

Annotation computation time for Mix-Hier scales with training set size.MBRM is O(TR), where T is training set sizeMix-Hier is O(CR), where C is the size of the vocabularyR is the number of image regions per image.Complexity is measured in seconds to annotated a new images.

dturnbul


Retrieval Results

First five ranked images for “mountain”, “pool”, “blooms”, and “tiger”

dturnbul


Retrieval Results

Mean Average PrecisionFor each word wi, find all na,i images that have been automatically annotated with word wi. Out of the na,i images, let nc,i be the number of images that have been annotated with wi by humans.The precision of wi is nc,i / na,i.If we have L words in our vocabulary, mean average precision is

Mix-Hier does 40% better on words with positive recall.

dturnbul


Outline






dturnbul



Annotation: Given a song, what are the words that describe the music.– Automatic Music Reviews

Retrieval: Given a text query, what are the songs that are best describe by the query.– Song Recommendation, playlist generation, music retrieval

Features extraction involves applying filters to digital audio signalsFourier, Wavelet, Gammatone are common filterbank transforms

Music may be “more difficult” to annotate since music is inherently subjective.-Music evokes different thoughts and feeling to different listeners-An individual experience with music changes all the time-All music is art unlike most digital images.

-The Corel data set consists of concrete “object” and “landscape” scene

-An similar dataset might focus on Modern Art (Pollack, Mondrian, Dali)

dturnbul



Computer Hearing (aka Machine Listening, Computer Audition):

• Music is one subdomain of sound– Sound Effects, Human speech, Animal Vocalization, Environment Sounds all

represent other subdomains of sound

• Annotation is one problem– Query-by-humming, Audio Monitoring, Sound Segmentation, Speech-to-Text

are examples of other Computer Hearing Problems

dturnbul



Computer Hearing and Computer Vision are closely related:1. Large public and private database exist that are rapidly

growing in size2. Digital Medium

• Sound is 2D – intensity (amplitude) & time or frequency & magnitude• Sound is often represented in 3D – magnitude, time and frequency

• Image is 3D – 2 spatial dimensions, an intensity (color)• Video is 4D – 2 spatial dimensions, an intensity, time

3. Video is comprised of both images and sound4. Feature extraction techniques are similar

• Applying filters to digital medium

dturnbul


Work Cited:

Carneiro, Vasconcelos. “Formulating Semantic Image Annotation as a Supervised Learning Problem” (CVPR ’05)

Vasconcelos. “Image Indexing with Mixture Hierarchies” (CVPR ’01)

Feng, Manmatha, Lavernko. “Multiple Bernoulli Relevance Models for Image and Video Annotation” (CVPR ’04)

Blei, Jordan. “Modeling Annotated Data” (SIGIR ’03)

dturnbul


presentation by: douglas turnbull cse department, ucsd topic in vision and learning

Documents