![Page 1: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/1.jpg)
Object Recognition as Machine Translation
Matching Words and Pictures
Heather Dunlop16-721: Advanced Perception
April 17, 2006
![Page 2: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/2.jpg)
Machine Translation
• Altavista’s Babel Fish:– There are three more weeks of classes!– Il y a seulement trois semaines
supplémentaires de classes!– ¡Hay solamente tres más semanas de
clases!– Ci sono soltanto tre nuove settimane dei
codici categoria! – Es gibt nur drei weitere Wochen Kategorien!
![Page 3: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/3.jpg)
Statistical Machine Translation
• Statistically link words in one language to words in another
• Requires aligned bitext– eg. Hansard for Canadian parliament
![Page 4: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/4.jpg)
Statistical Machine Translation
• Assuming an unknown one-one correspondence between words, come up with a joint probability distribution linking words in the two languages
• Missing data problem: solution is EM
Given the translation
probabilities, estimate the
correspondences
Given the correspondences,
estimate the translation probabilities
![Page 5: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/5.jpg)
Multimedia Translation
• Data:
– Words are associated with images, but correspondences are unknown
sun sea sky
sun sea sky
![Page 6: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/6.jpg)
Auto-Annotation
• Predicting words for the images
tiger grass cat
![Page 7: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/7.jpg)
Region Naming
• Can also be applied to object recognition
• Requires a large data set
![Page 8: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/8.jpg)
Browsing
![Page 9: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/9.jpg)
Auto-Illustration
Moby Dick
![Page 10: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/10.jpg)
Data Sets of Annotated Images
• Corel data set• Museum image collections• News photos (with captions)
![Page 11: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/11.jpg)
First Paper
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabularyby Pinar Duygulu, Kobus Barnard, Nando de Freitas, David Forsyth
– A simple model for annotation and correspondence
![Page 12: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/12.jpg)
Overview
![Page 13: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/13.jpg)
Input Representation
• Segment with Normalized Cuts:
• Only use regions larger than a threshold (typically 5-10 per image)
• Form vector representation of each region• Cluster regions with k-means to form blob tokens
sun sky waves sea
word tokens
![Page 14: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/14.jpg)
Input Representation
• Represent each region with a feature vector– Size: portion of the image covered by the region– Position: coordinates of center of mass– Color: avg. and std. dev. of (R,G,B), (L,a,b) and
(r=R/(R+G+B),g=G/(R+G+B))– Texture: avg. and variance of 16 filter responses– Shape: area / perimeter2, moment of inertia,
region area / area of convex hull
![Page 15: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/15.jpg)
Tokenization
![Page 16: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/16.jpg)
Assignments
• Each word is predicted with some probability by each blob
![Page 17: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/17.jpg)
Expectation Maximization
• Select word with highest probability to assign to each blob
N
n
M
j
L
ininjnj
n n
bwtiapbwp1 1 1
)|()()|(
probability that blob bni
translates to word wnj
probability of obtaining word wnj given instance of
blob bni
# of images
# of words
# of blobs
![Page 18: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/18.jpg)
Expectation Maximization
• Initialize to blob-word co-occurrences:
• Iterate:
Given the translation
probabilities, estimate the
correspondences
Given the correspondences,
estimate the translation probabilities
![Page 19: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/19.jpg)
Word Prediction
• On a new image:– Segment– For each region:
• Extract features• Find the corresponding blob token using
nearest neighbor• Use the word posterior probabilities to
predict words
![Page 20: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/20.jpg)
Refusing to Predict
• Require: p(word|blob) > threshold– ie. Assign a null word to any blob
whose best predicted word lies below the threshold
• Prunes vocabulary, so fit new lexicon
![Page 21: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/21.jpg)
Indistinguishable Words
• Visually indistinguishable:– cat and tiger, train and locomotive
• Indistinguishable with our features:– eagle and jet
• Entangled correspondence:– polar – bear– mare/foals – horse
• Solution: cluster similar words– Obtain similarity matrix– Compare words with symmetrised KL divergence– Apply N-Cuts on matrix to get clusters– Replace word with its cluster label
![Page 22: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/22.jpg)
Experiments
• Train with 4500 Corel images– 4-5 words for each image – 371 words in vocabulary– 5-10 regions per image– 500 blobs
• Test on 500 images
![Page 23: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/23.jpg)
Auto-Annotation
• Determine most likely word for each blob• If probability of word is greater than some
threshold, use in annotation
![Page 24: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/24.jpg)
Measuring Performance
• Do we predict the right words?
![Page 25: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/25.jpg)
Region Naming / Correspondence
![Page 26: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/26.jpg)
Measuring Performance
• Do we predict the right words?• Are they on the right blobs?• Difficult to measure because data set
contains no correspondence information
• Must be done by hand on a smaller data set
• Not practical to count false negatives
![Page 27: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/27.jpg)
Successful Results
![Page 28: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/28.jpg)
Successful Results
![Page 29: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/29.jpg)
Unsuccessful Results
![Page 30: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/30.jpg)
Refusing to Predict
![Page 31: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/31.jpg)
Clustering
![Page 32: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/32.jpg)
Merging Regions
![Page 33: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/33.jpg)
Results
light bar = average number of times blob predicts word in correct place
dark bar = average number of times blob predicts word which is in the image
![Page 34: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/34.jpg)
Second paper
Matching Words and Picturesby Kobus Barnard, Pinar Duygulu, Nando de Freitas,
David Forsyth, David Blei, Michael I. Jordan
– Comparing lots of different models for annotation and correspondence
![Page 35: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/35.jpg)
Annotation Models
• Multi-modal hierarchical aspect models
• Mixture of multi-modal LDA
![Page 36: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/36.jpg)
Multi-Model Hierarchical Aspect Model
cluster = a path from a leaf to the
root
![Page 37: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/37.jpg)
Multi-Model Hierarchical Aspect Model
• All observations are produced independent of one another
• I-0: as above• I-1: cluster dependent level structure
– p(l|d) replaced with p(l|c,d)
• I-2: generative model– p(l|d) replaced with p(l|c)– allows prediction for documents not in training set
document
observations
clusters levels
normalization
Gaussian
frequency tables
![Page 38: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/38.jpg)
Multi-Model Hierarchical Aspect Model
• Model fitting is done with EM• Word prediction:
set of observed
blobs
![Page 39: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/39.jpg)
Mixture of Multi-Modal LDA
multinomial
Dirichlet
multinomial
multinomial
multivariate Gaussian
mixture component and hidden factor
![Page 40: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/40.jpg)
Mixture of Multi-Modal LDA
• Distribution parameters estimated with EM
• Word prediction:
posterior over mixture
components
posterior Dirichlet
![Page 41: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/41.jpg)
Correspondence Models
• Discrete translation• Hierarchical clustering• Linking word and region emission
probabilities• Paired word and region emission
![Page 42: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/42.jpg)
Discrete Translation
• Similar to first paper• Use k-means to vector-quantize the set
of features representing an image region
• Construct a joint probability table linking word tokens to blob tokens
• Data set doesn’t provide explicit correspondences– Missing data problem => EM
![Page 43: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/43.jpg)
Hierarchical Clustering
• Again, using vector-quantized image regions
• Word prediction:
![Page 44: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/44.jpg)
Linking Word andRegion Emission
• Words emitted conditioned on observed blobs
• D-O: as above (D for dependent)• D-1: cluster dependent level distributions
– Replace p(l|c,d) with p(l|d)
• D-2: generative model– Replace p(l|d) with p(l)
B U W
![Page 45: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/45.jpg)
Paired Word and Region Emission at Nodes
• Observed words and regions are emitted in pairs: D={(w,b)}
• C-0: as above (C for correspondence)• C-1: cluster dependent level structure
– p(l|d) replaced with p(l|c,d)
• C-2: generative model– p(l|d) replaced with p(l|c)
![Page 46: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/46.jpg)
Wow, That’s a Lot of models!
• Multi-modal hierarchical: I-0, I-1, I-2• Multi-modal LDA• Discrete translation• Hierarchical clustering• Linked word and region emission: D-0, D-1, D-
2• Paired word and region emission: C-0, C-1, C-2
• Count = 12• Why so many?
![Page 47: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/47.jpg)
Evaluation Methods
• Annotation performance measures:– KL divergence between predicted and target
distributions:
– Word prediction measure:• n = # of words in image• r = # of words predicted correctly• # of words predicted is set to # of actual keywords
– Normalized classification score:• w = # of words predicted incorrectly• N = vocabulary size
![Page 48: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/48.jpg)
Results
• Methods using clustering are very reliant on having images that are close to the training data
• MoM-LDA has strong resistance to over-fitting
• D-0 (linked word and region emission) appears to give best results, taking all measures and data sets into consideration
![Page 49: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/49.jpg)
Successful Results
![Page 50: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/50.jpg)
Unsuccessful Results
good annotation, poor correspondence
complete failure
![Page 51: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/51.jpg)
N-cuts vs. Blobworld
Normalized Cuts
Blobworld
![Page 52: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/52.jpg)
N-cuts vs. Blobworld
![Page 53: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/53.jpg)
Browsing ResultsClustering by text only Clustering by image features
only
![Page 54: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/54.jpg)
Browsing ResultsClustering by both text and image features only
![Page 55: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/55.jpg)
Search Results
• query: tiger, river
tiger, cat, water, grass tiger, cat, water, grass tiger, cat, grass, trees
tiger, cat, water, grasstiger, cat, grass, foresttiger, cat, water, grass
![Page 56: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/56.jpg)
Auto-Illustration Results
• Passage from Moby Dick:– “The large importance attached to the harpooneer's
vocation is evinced by the fact, that originally in the old Dutch Fishery, two centuries and more ago, the command of a whale-ship!…”
• Words extracted from the passage using natural language processing tools– large importance attached fact old dutch century more
command whale ship was per son was divided officer word means fat cutter time made days was general vessel whale hunting concern british title old dutch official present rank such more good american officer boat night watch ground command ship deck grand political sea men mast
![Page 57: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/57.jpg)
Auto-Illustration Results
• Top-ranked images retrieved using all extracted words:
![Page 58: Object Recognition as Machine Translation Matching Words and Pictures](https://reader035.vdocuments.us/reader035/viewer/2022062222/56815a72550346895dc7d621/html5/thumbnails/58.jpg)
Conclusions
• Lots of different models developed– Hard to tell which is best
• Can be used with any set of features• Numerous applications:
– Auto-annotation– Region naming (aka object recognition)– Browsing– Searching– Auto-illustration
• Improvements in translation from visual to semantic representations lead to improvements in image access