a bayesian hierarchical model for learning natural scene categories l. fei-fei and p. perona. cvpr...

A Bayesian Hierarchical Model for Learning Natural Scene

Categories L. Fei-Fei and P. Perona. CVPR 2005

Discovering objects and their location in images

J. Sivic, B. Russell, A. Efros, A. Zisserman and B. Freeman. ICCV 2005

Tomasz [email protected]

Advanced Machine PerceptionFebruary 2006

Graphical Models: Recent Trend in Machine Learning

Describing Visual Scenes using Transformed Dirichlet Processes. E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec. 2005.

Outline

Goals of both vision papers

Techniques from statistical text modeling

- pLSA vs LDA

Scene Classification via LDAObject Discovery via pLSA

Goal: Learn and Recognize Natural Scene Categories

Classify a scene without first extractingobjects

Other techniques we know of:-Global frequency (Oliva and Torralba)-Texton Histogram (Renninger, Malik et al)

Goal: Discover Object Categories Discover what objects are present in a collection

of images in an unsupervised way

Find those same objects in novel images Determine what local image features correspond

to what objects; segmenting the image

Enter the world of Statistical Text Modeling

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.

Bag-of-words approaches: the order of words in a document can be neglected

Graphical Model Fun

student

Here we start the Text Slides

Bag-of-words

A document is a collection of M wordsA corpus (collection of documents) is

summarized in a term-document matrix

ObjectObject Bag of ‘words’Bag of ‘words’

1990: Latent Semantic Analysis (LSA)

Goal: map high-dimensional count vectors to a lower dimensional representation to reveal semantic relations between words

The lower dimensional space is called the latent semantic space

Dim( latent space ) = K

1990: Latent Semantic Analysis (LSA)

D = {d1,…,dN} N documents

W = {w1,…,wM} M words

Nij = #(di,wj) NxM co-occurrence term-document matrix

NxM = NxK xKxK

xKxM

docu

men

ts

docu

men

ts

words wordstopics topics

topi

cs

topi

cs

What did we just do?

NxM = NxK xKxK

xKxM

docu

men

ts

docu

men

ts

words wordstopics topics

topi

cs

topi

cs

Singular Value Decomposition

LSA summary

SVD on term-document matrixApproximate N by thresholding all but the

largest K singular values in W to zeroProduces rank-K optimal approximation to

N in the L2-matrix or Frobenius norm sense

LSA and Polysemy

Polysemy: the ambiguity of an individual word or phrase that can be used (in different contexts) to express two or more different meanings

Under the LSA model, the coordinates of a word in latent space can be written as a linear superposition of the coordinates of the documents that contain the word

According to this superpositionprinciple, LSA is unable to capture

multiple senses of a word

Problems with LSA

LSA does not define a properly normalized probability distribution

No obvious interpretation of the directions in the latent space

From statistics, the utilization of L2 norm in LSA corresponds to a Gaussian Error assumption which is hard to justify in the context of count variables

Polysemy problem

pLSA to the rescue

Probabilistic Latent Semantic Analysis

pLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model

Observed word distributions

word distributionsper topic

Topic distributionsper document

Slide credit: Josef Sivic

K

kjkkiji dzpzwpdwp

1

)|()|()|(

pLSA to the rescueDecomposition into Probabilities!

Maximize likelihood of data using EM.Minimize KL divergence between empiricaldistribution and model

Observed counts of word i in document j

Learning the pLSA parametersLearning the pLSA parameters


Unlike LSA, pLSA does not minimize any type of ‘squared deviation.’ The parameters are estimated in a probabilistically sound way.

EM for pLSA (training on a corpus)

E-step: compute posterior probabilities for the latent variables

M-step: maximize the expected complete data log-likelihood

Graphical View of pLSA

pLSA is a generative model

Select a document di with prob P(di)Pick latent class zk with prob P(zk|di)Generate word wj with prob P(wj|zk)

wd z

Observed variables

PlatesLatent variables

How does pLSA deal with previously unseen documents?

“Folding-in” Heuristic

First train on Corpus to obtainNow re-run same training EM algorithm,

but don’t re-estimate and let D={dunseen}

Problems with pLSA

Not a well-defined generative model of documents; d is a dummy index into the list of documents in the training set (as many values as documents)

No natural way to assign probability to a previously unseen document

Number of parameters to be estimated grows with size of training set

LDA to the rescue

Latent Dirichlet Allocation treats the topic mixture weights as a k-parameter hidden random variable and places a Dirichlet prior on the multinomial mixing weights

Dirichlet distribution is conjugate to the multinomial distribution (most natural prior to choose: the posterior distribution is also a Dirichlet!)

pLSA LDA

Corpus-Level parameters in LDA

Alpha and beta are corpus-level documents that are sampled once in the corpus creating generative model (outside of the plates!)

Alpha and beta must be estimated before we can find the topic mixing proportions belonging to a previously unseen document

LDA

Getting rid of plates

zNz3z2z1

wNw3w2w1

zNz3z2z1

wNw3w2w1

zNz3z2z1

wNw3w2w1

Thanks to Jonathan Huang for the un-plated LDA graphic

Inference in LDA

Inference = estimation of document-level parameters

Intractable to compute must employ approximate inference

Approximate Inference in LDA

Variational Methods: Use Jensen’s inequality to obtain a lower bound on the log likelihood that is indexed by a set of variational parameters

Optimal Variational Parameters (document-specific) are obtained by minimizing the KL divergence between the variational distribution and the true posterior

Variational distribution

Variational Methods are one way of doing this.Gibbs sampling (MCMC) is another way.

Look at some P(w|z) produced by LDA

Show some pLSI and LDA results applied to text

An LDA project by Tomasz Malisiewicz and Jonathan Huang

Search for the word ‘drive’

pLSA and LDA applied to Images

How can one apply these techniques to the images?

Hoffman, 2001

Hierarchical Bayesian Hierarchical Bayesian text modelstext models

wN

d z

D

wN

c z

D

Blei et al., 2001

Probabilistic Latent Semantic Analysis (pLSA)

Latent Dirichlet Allocation (LDA)

wN

d z

D


Probabilistic Latent Semantic Analysis (pLSA)

“face”

Sivic et al. ICCV 2005

wN

c z

D


Latent Dirichlet Allocation (LDA)

Fei-Fei et al. ICCV 2005

“beach”

A Bayesian Hierarchical Model for Learning Natural Scene Categories

School of Computer Science

start the first paper

Flow Chart: Quick Overview

How to Generate an Image?

Given scene generate an intermediate probability vector over ‘themes’

Determine current theme from mixture of themes

Choose a scene (mountain, beach, …)

For each word:

Draw a codeword from that theme

How to Generate an Image?

Inference

How to make decision on a novel image

Integrate over latent variables to get:

Approximate Variational Inference (not easy, but Gibbs sampling is supposed to be easier)

Codebook

174 Local Image Patches

Detection:Evenly Sampled GridRandom SamplingSaliency DetectorLowe’s DoG Detector

Representation:Normalized 11x11 gray values128-dim SIFT

Results: Average performance 64%

Confusion Matrix100 training examples and 50 test examplesRank statistic test: the probability of a test scene correctlybelong to one of the top N most probable categories

Results: The Distributions

Themedistribution

Codeworddistribution

The peak at 174

Summary of detection and representation choices

SIFT outperforms pixel gray valuesSliding grid, which creates the largest

number of patches, does best

Discovering objects and their location in images

School of Computer Science

start the second paper

Visual Words

Vector Quantized SIFT descriptors computed in regions

Regions come from elliptical shape adaptation around interest point, and from the maximally stable regions of Matas et al.

Both are elliptical regions at twice their detected scale

Building a VocabularyBuilding a Vocabulary

…

Building a VocabularyBuilding a Vocabulary

Vector quantization

…


K-means clustering of 300K regions to get about 1K clusters for each of Shape Adapted and Maximally Stable regions

pLSA Training

Sanity Check: Remember what quantities must be estimated?

Results #1: Topic Discovery

This is just the training stage

Obtain P(zk|dj) for each image, then classify image as containing object k according to the max of P(zk|dj) over k

4 object categoriesPlus background

Results #1: Topic Discovery

Results #2: Classifying New Images

Object Categories learned on a corpus, then object categories found in new image

Remember the index d inthe graphical model

Anybody remember how this is done?

How does pLSA deal with previously unseen documents?

“Folding-in” Heuristic

First train on Corpus to obtainNow re-run same training EM algorithm,

but don’t re-estimate and let D={dunseen}

Results #2: Classifying New Images

Train on one set and test on another

Results #3: Segmentation

Localization and Segmentation of Object

For a word occurrence in a particular document we can examine the probability of different topics

Find words with P(zk|dj,wi) > .8

Results #3: Segmentation

Note: words shown are not the most probable wordsfor a topic, but instead they are words that have a high probability of occurring in a topic AND high probability of occurring in the image

Results #3: Segmentation and Doublets

Two class image dataset consisting of half the faces (218 images) and backgrounds (217 images)

A 4 topic pLSA model is learned for all training faces and training backgrounds with 3 fixed background topics, i.e. one (face) topic is learned in addition to the three fixed background topics

A doublet vocabulary is then formed from the top 100 visual words of the face topic. A second 4 topic pLSA model is then learned for the combined vocabulary of singlets and doublets with the background topics fixed.

Doublets

Efros: didn’t work as much as you’d think

FaceSegmentationScores

Singleton: .49 Doublets: .61

Conclusions

Showed how both papers use bag-of-words approaches

We’re now ready to become experts on generative models like pLSA and LDA

Graphical Model Fun! (Carlos Guestrin teaches Graphical Models)

Are you really into Graphical Models?

Describing Visual Scenes using Transformed Dirichlet Processes. E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec. 2005.

References

A Bayesian Hierarchical Model for Learning Natural Scene Categories, Fei Fei Li et al

Describing Visual Scenes using Transformed Dirichlet Processes, Sudderth et al

Discovering objects and their location in images, Sivic et al

Latent Dirichlet Allocation, Blei et al Unsupervised Learning by Probabilistic Latent

Semantic Analysis, T. Hoffman

a bayesian hierarchical model for learning natural scene categories l. fei-fei and p. perona. cvpr...

Documents

plsa slide

word slide

lsa lsa

image slide

termdocument matrix

lsa model

latent semantic analysis

graphical model fun