20 cv mil_models_for_words

Computer vision: models, learning and inference

Chapter 20 Models for visual words

Please send errata to [email protected]

2

Visual words

2Computer vision: models, learning and inference. ©2011 Simon J.D. Prince

• Most models treat data as continuous• Likelihood based on normal distribution• Visual words = discrete representation of

image• Likelihood based on categorical distribution• Useful for difficult tasks such as scene

recognition and object recognition

3

Motivation: scene recognition


4

Structure


• Computing visual words• Bag of words model• Latent Dirichlet allocation• Single author-topic model• Constellation model• Scene model• Applications

5

Computing dictionary of visual words


1. For every one of the I training images, select a set of Ji spatial locations.• Interest points• Regular grid

2. Compute a descriptor at each spatial location in each image

3. Cluster all of these descriptor vectors into K groups using a method such as the K-Means algorithm

4. The means of the K clusters are used as the K prototype vectors in the dictionary.

6

Encoding images as visual words


1. Select a set of J spatial locations in the image using the same method as for the dictionary

2. Compute the descriptor at each of the J spatial locations. 3. Compare each descriptor to the set of K prototype

descriptors in the dictionary4. Assign a discrete index to this location that corresponds to

the index of the closest word in the dictionary.

End result:

Discrete feature index x and y position

7

Structure



8

Bag of words model


Key idea:

• Abandon all spatial information• Just represent image by relative frequency

(histogram) of words from dictionary

where

9

Bag of words


10

Structure


Learning (MAP solution):

Inference:

11

Bag of words for object recognition


12

Problems with bag of words


13

Structure



14

Latent Dirichlet allocation


• Describes relative frequency of visual words in a single image (no world term)

• Words not generated independently (connected by hidden variable)

• Analogy to text documents– Each image contains mixture of several topics (parts)– Each topic induces a distribution over words

15



16



Generative equations

Marginal distribution over features

Conjugate priors over parameters

17



18

Learning LDA model


• Part labels p hidden variables• If we knew them then it would be easy to estimate the

parameters

• How about EM algorithm? Unfortunately, parts within in image not independent

19



20

Learning


Strategy:

1. Write an expression for posterior distribution over part labels

2. Draw samples from posterior using MCMC3. Use samples to estimate parameters

21

1. Posterior over part labels


Can compute two terms in numerator in closed formDenominator

intractable

22

2. Draw samples from posterior


Gibbs’ sampling: fix all part labels except one and sample from conditional distribution

This can be computed in closed form

23

3. Use samples to estimate parameters


Samples substitute in for real part labels in update equations

24

Structure



25

Single author topic model


26

Single author-topic model


27

Learning



Likelihood same as before, prior becomes

28

Learning




29

Inference


Compute posterior over categories

Likelihood that words in this image are due to category n

30

Structure



31



32

Constellation model


33

Constellation model


34

Learning



Prior same as before, likelihood becomes

35

Learning




Part and word probabilities as before

36

Inference


Compute posterior over categories

Likelihood that words in this image are due to category n

37

Learning


38

Structure



39



40

Scene model


41

Scene model


42

Structure



43

Video Google


44

Action recognition


Spatio-temporal bag of words model 91.8% classification

45

Action recognition


20 cv mil_models_for_words

Technology

words computer vision

learning lda model

parameterscomputer vision

dictionarywherecomputer

independentcomputer

updateequationscomputer

bag of wordscomputer

closed formcomputer