a probabilistic topic- connection model for automatic ...xc35/ppt/cikm10_slides.pdf · •the major...

A Probabilistic Topic-Connection Model for Automatic Image Annotation

www.ischool.drexel.edu

Xin Chen1, Xiaohua Hu1, ZhongnaZhou2, Caimei Lu1, Gail Rosen3, Tingting He4, E.K. Park5

1College of Information Science and Technology, Drexel University, Philadelphia, PA, USA, 2Dept. of ECE at University of Missouri in Columbia, MO, USA, 3Dept. of ECE at Drexel University in Philadelphia, PA, USA, 4Dept. of Computer Science at Central China Normal University in Wuhan, China, 5CSI-CUNY in Staten Island, NY, USA

2010-11-9 2

OutlineOutlineProblem Statement & Research Questions

Review ‐ Background & Existing Techniques

Represent Image Content in the Feature Space

Topic Modeling

Developed Method and Evaluation

Developed Methods

Evaluation

Conclusions

2010-11-9 3

Web images comes with additional text information (Flickr.com)

2010-11-9 4

Web images comes with additional text information (Wikipedia page)

2010-11-9 5

ImageNet (Fei-Fei, et al, CVPR ‘09)

An ontology of image based on WordNet, currently has:13,000+ categories of visual concepts10 million human-clean images (~700 images per category)Openly available (www.image-net.org)

2010-11-9 6

Mapping a ImageNet synset to a Wikipedia page

ImageNet dataset: images synset“Chrysamthemum coronarium”

Wikipedia page obtained by URL matching

The mapping provides an ‘indirect’ link between image sets and textual information.

2010-11-9 7

Objective: utilizing available image-text pairs as prior knowledge for automatic image annotation

Manual image annotation is time‐consuming, laborious and expensive.

Image‐text pairs provide insight on revealing the correlation between image visual content and informative textual descriptions

Breakthroughs in automatic image annotation will help to organize the massive amount of digital images, promote developing and studying of image storage and retrieval systems, and serve for other applications such as online image‐sharing.

2010-11-9 8




Topic Modeling


Developed Methods

Evaluation

Conclusions

2010-11-9 9

Problem 1: the semantic gap

• The major difficulty in content‐based image retrieval is the “semantic gap”between image features and the user.” (Arnold, W.M. et al., 2000).

An CBIR application called Retrievr, in which you can draw query images that can be used to find matching Flickr images.

2010-11-9 10

2010-11-9 11

Problem 2: the image appearances vary a lot among the same image category

Variations:Similarity transform:Spatial layout changeScale changeRotationsBlurringIllumination change…Affine transform:SkewingDifferent scaling of axes…

2010-11-9 12

Solution: Objects represented by parts and key-points

2010-11-9 13

Bag-of-FeaturesBackground & Existing Techniques

Assumption: the patterns of different object categories can be represented by different distributions of local structures.

Salient point detector

• Harris‐Laplace Detector (Mikolajczyk, 2004)

• DoG salient points detector (Lowe, 2004)

Region detector

• Kadir‐Brady (KB) saliency detector (Kadir and Brady,2001 )

• Maximally Stable Extremal Regions ‐MSERs (Matas et al. 2002)

Quantification – local descriptors:

• SIFT (Lowe, 2004)

• Color‐SIFT (van de Weijer, 2006)

2010-11-9 14



Represent Image Content in the Feature Space• Represent Image by ‘Key‐points’

• Represent Image by ‘Parts’

Topic Modeling


Developed Methods

Evaluation

Conclusions

2010-11-9 15

The Gaussian images and the difference-of-Gaussian images (Lowe, 2004)

Images are blurred by 2D-Gaussian function:

Adjacent Gaussian images are subtracted to produce the difference-of-Gaussian images.

For next octave, the Gaussian images are down-sampled by a factor of 2, and the process repeated.

The Gaussian images and the difference-of-Gaussian images (Lowe, 2004)

2 2 2( ) / 22

1( , , )2

x yG x y e σσπσ

− +=

2010-11-9 16

Difference-of-Gaussian (DoG) salient points detector (Lowe, 2004)

Original image Output of DoG salient point detection

The DoG salient point detector detects the scale‐space extreme points in the difference‐of‐Gaussian images and tends to extract blob‐like key points from images.

2010-11-9 17

Scale Invariant Feature Transform (SIFT) salient point descriptor (Lowe, 2004)

Image patches containing salient points are rotated to a canonical orientation and divided into cells. Each cell is represented as an 8‐dimension feature vector according to the gradient magnitude in eight orientations.

Compared to other descriptors, the SIFT descriptor is more robust and invariant to rotation and scale/luminance changes.

The SIFT descriptor of salient points (2×2 cells) (Lowe, 2004)

2010-11-9 18

Grouping similar local descriptors into visual words

Typically, the K‐Mean clustering algorithm is used to cluster the descriptors of extracted image patches into visual words and establish a code book of visual words for a specific image collection.

⎫⎪⎪⎪⎬⎪⎪⎪⎭

Code book of visual words (Sivic, 2003) and (Fei-Fei et al. 2005)

Each key‐point assigned the index of the cluster center closest to the descriptor.

2010-11-9 19

“Bag-of-Visual-Words”

•Since visual words repeatedly appear in images and carry some atomic meanings, they can be regarded as visual analog of text words. Each image can be represented as a “Bag-of-Visual-Words”, which is an unordered collection of visual words.

•Many effective text mining and information retrieval algorithms (such as feature selection, stop words removal and TF-IDF term weighting) are applied to the vector space model of visual words.

2010-11-9 20

A simple test: the 15-scenes benchmark

15-scenes benchmark dataset consists of 4485 image spread over 15 categories, each of the 15 scene category contain 200 to 400 images and range from natural scenes to man-made environments

Oliva & Torralba, 2001Fei Fei & Perona, 2005Lazebnik, et al 2006

2010-11-9 21

00.10.20.30.40.50.60.70.80.91

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Recall

Precision

forest(S.)

tallbuilding(S.)coast(S.)

forest(V.)

tallbuilding(V.)coast(V.)

forest(B.)

tallbuilding(B.)coast(B.)

Comparison of precision-recall with respect to different image categories

Chen, et. al PAKDD ‘09

2010-11-9 22

Top ranked image retrieval results

Top ranked retrieval results

Query image (Coast)

2010-11-9 23

Image retrieval result (continue)

Top ranked retrieval results

Query image (Skyscrapers)

2010-11-9 24

LDA model for the ‘Bag-of-Visual-Words’(Fei-Fei et al. 2005)

We are able to achieve topic modeling from image documents in the same way as text documents, by using ‘Bag-of-Visual-Words’ .

LDA model for visual words (Fei-Fei et al. 2005)

codewords dictionarycodewords dictionary

Bag-of-Visual-Words

2010-11-9 25



Represent Image Content in the Feature Space• Represent Image by ‘Key‐points’

• Represent Image by ‘Parts’

Topic Modeling


Developed Methods

Evaluation

Conclusions

2010-11-9 26

Maximally Stable Extremal Regions –MSERs (Matas et al. 2002)

MSERs is a highly efficient region detector. The idea origins from thresholdings in image color/intensity space I. The thresholding yields a binary image Et as follows:

An extremal region is maximally stable when the area (or the boundary length) of the segment changes the least with respect to the threshold.

The set of MSERs is closed under continuous geometric transformations and is invariant to affine intensity changes.

2010-11-9 27

The appearance of object parts as continuous space

Image Morphing

2010-11-9 28

How to quantify the image parts in a continuous space?

Image patches containing salient parts are rotated to canonical angle and adjust to uniform size (known as normalized patches).

Principal component analysis (PCA) is performed on normalized patches to obtain feature representation

Finally, the appearance of each patch (which is n × n matrix) is quantified as a feature vector of the first k (typically 20-50) principal components

Adjusting image patches to uniform size

2010-11-9 29

Selection of Principal Component Number

Reconstruction of first 50 PCA components

Major PCA components = 15

Original Pixel

2010-11-9 30

Summary: image represented by key-points and parts

Represent image by SIFT descriptors and MSER features

SIFT(key‐points)

MSER(parts)

2010-11-9 31




Topic Modeling


Developed Methods

Evaluation

Conclusions

2010-11-9 32

Topic Modeling Topic Modeling -- IntuitiveIntuitive

Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we now know that behind the origin of the visual perception in the brain there is a considerably more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image.

sensory, brain, visual, perception,

retinal, cerebral cortex,eye, cell, optical

nerve, imageHubel, Wiesel

IntuitiveAssume the data we see is generated by some parameterized random process.Learn the parameters that best explain the data.Use the model to predict (infer) new data, based on data seen so far.

2010-11-9 33

‘Bag of word model’ for text document

w jintelligence

d i

Texas Instruments said it has developedthe first 32-bit computer chip designedspecifically for artificial intelligenceapplications [...]

D = Document collection W = Lexicon/Vocabulary

...

artif

icia

l

1

inte

llige

nce

inte

rest

0

artif

act

0 ...... 2t

=di

w1 ... w j ... wJ

d1

...

di

...

dI

D

W

...

......

...

Document-Term Matrix

...frequ

ency

2010-11-9 34

Notations

WordBasic unit.Item from a vocabulary indexed by {1, . . . ,V}.

DocumentSequence of N words, denoted by w = (w1,w2, . . . ,wN).

CollectionA total of D documents, denoted by C = {w1,w2, . . . ,wD}.

TopicDenoted by z, the total number is K.Each topic has its unique word distribution p(w|z)

2010-11-9 35

Background & Existing Techniquesof Generative Latent Topic Models

The Naïve Bayesian model

The probabilistic latent semantic indexing (PLSI) model

PLSI Model (Hoffman, 2001)

* arg max ( | ) ( ) ( | )z p z w p z p w z= ∝

Word-Topic decision

Prior Probability of Topic z

Likelihood of word w given topic z

Assumption:

Each document has a mixture of k topics.

Fitting the model involves:

Estimating the topic specific word distributions p(wi|zk) and document specific topic distributions p(zk|dj) from the corpse via maximum likelihood estimation (MLE).

2010-11-9 36

Latent Dirichlet Allocation (LDA) Model (Blei, 2003)

In PLSI model, the topic mixture probability p(zk|dj) for documents are fixed once the model is estimated. For new coming document, the model needed to be re-estimated. Thus it is not scalable.

The LDA model treats the probability of latent topics for each document p(z|d) and the conditional probability of words for each latent topic p(w|z) as latent random variables whichare subject to change when new document comes.

Usually, symmetric prior is used:

( | ) ~ ( )j dp z d Multi θ

( | ) ~ ( )j jip w z Multi φ

~ ( )j Dirφ β

θd~Dir(α)

Generative process of LDA model

1 2 1 2{ , ,..., } {0.1}, { , ,..., } {0.01}T Wα α α β β β= = = =α β

2010-11-9 37

Explanation of Prior Settings

Why we set: ?

The above parameter setting is for the consideration that we need to make topic modeling results more diverse. By doing this, each document will in turn have its unique favor on a small number of topics that related to its content instead of having equal probability for every latent topic.

1 2{ , ,..., } {0.1}Tα α α= =α

( | ) ~ ( )j dp z d Multi θ

θd~Dir(α)

Documents as mixtures of topics, each has a different prior probability

2010-11-9 38

This extension of the original topic model can be achieved by introducing a new branch for visual words which makes topics of visual words associated with that of caption words. So it’s called Corr‐LDA model. The prototype of Corr‐LDA model is introduced by (Blei, 2003) .

The model is estimated via Gibbs Sampling Monte Carlo process (Griffiths, 2004), which involves iteratively estimating the posterior probability for topics from current word‐topic assignment, and adopting a Monte Carlo process to determine the assignment of word‐topic in the next round.

Extension of topic model for visual words and image captions

β

α

z

w

φβ

z

w

φ

θ

˜

˜

˜

˜

D

T

Latent Topics

Posterior probability for topics at each iteration:

, ,

, ,

( | , , )wi di j i j

wi i di j i

n np z j w

W n T nβ αβ α

− −

− −

+ += ∝ ⋅

+ +-i -wiw z ii

Entity Type 1

Entity Type 2

Corr-LDA Model (Blei, 2003)

2010-11-9 39

Problem with CorrLDA model in modeling image and text

Needle-leaf forest is composed largely of straight trunked, conical tress with relatively short branches, and small, narrow, needlelike leaves. These tress are conifers. Where evergreen, the needleleaf forest provides continues and deep shade to the ground so that lower layers of vegetation are sparse or absent except for a thick carpet of mosses in many places. Species are few and large tracts of forest consist almost entirely of but one or two species.

Topic 2

Topic 1

Topic 5

...

...

Topic 3

...

branc

h

specie

s

leaf

tree

anim

al

...

groun

d

Document-level Topic Mixture Composition

...β'

α

z

w

φβ

y

v

ψ

θ D

T

Corr-LDA Model

2010-11-9 40




Topic Modeling


Developed Methods

Evaluation

Conclusions

2010-11-9 41

An improved topic model

Note: multiple word phrases extracted by Xtract (Smadja, 1993)

2010-11-9 42

The Data Collection and Settings

The image dataset is acquired from the ImageNet (http://www.image‐net.org/). Specifically, we download synsets under the “flower”, “mammal”and “tree” subtree. The synsets are mapped to a Wikipedia pages describing the same concept.

A rule‐based method is used to identify the explanative sections in Wikipedia pages. Articles with insufficient words (<200 words) are filtered out. In total, we obtain text descriptions for 1452 synsets (330, 562 and 560 synsets for subtrees “flower”, “mammal” and “tree”, respectively).

For each synset, we replicate the text descriptions to each of its images. We then make index for single‐words and multiple word phrases in the text descriptions, and extract visual‐word features as well as MESR region features from images (an average of 1095 visual words and 127 MSER regions per image)

The ImageNet has a backbone hierarchical ontology structure from WordNet, in which each node involves a group of images that depict a particular concept named as a synonym set, or “synset”

2010-11-9 43

Illustration of uncovered latent topics by proposed model

Topic84Top words Probabilityflower 0.019254orchid 0.012133Amanda 0.00867subgenera 0.006814shape 0.006617monophylet 0.006449Masdevallia 0.004167genera 0.003656subgenu 0.003208sever 0.003009section 0.003009genu 0.002962tuft 0.002903dura 0.002583Klotzsch 0.002562COLOMBIA 0.002558subtrib 0.002537epiphyt 0.002384final 0.002314botanist 0.002215

Top Phrase Probabilityone flower 0.015733orchid family 0.009458severalgenus 0.008829smooth leaf 0.007536triangular flower 0.006662temperate climate 0.006321a flower 0.005409specy ¨cm. 0.004575horticultural trade 0.004265e.g.m. 0.004012reproductive structure 0.003869divisionmagnoliophyta 0.003179biological function 0.003105male sperm 0.003041female ovum 0.002879higherplant 0.002747next generation 0.002676primarymean 0.002664reproductive organ 0.002459selective pressure 0.002443

2010-11-9 44

Illustration of uncovered latent topics by proposed model

Topic116Top words ProbabilityLeopard 0.011636Africa 0.0095Panthera 0.007002jaguar 0.00681lion 0.005525spot 0.005232cat 0.004863black 0.00485cross 0.004607Felida 0.004351home 0.003937hybrid 0.003923India 0.003921Uncia 0.003818central 0.003755normal 0.003571exist 0.003102parent 0.003069climb 0.003063habitat 0.003011

Top Phrase Probabilitysnow leopard 0.014342black panther 0.014025sri lanka 0.013044male leopard 0.012733genuspanthera 0.012725smallspot 0.012723mammalspecy 0.012718greatdiversity 0.012718greekword 0.012718southernasia 0.012718Indian subcontinent 0.012718rain forest 0.007444short leg 0.005945american continent 0.005864berlin zoo 0.005864forest area 0.005108wide variation 0.004198across 0.004079abundantprey 0.003955severalspecy 0.003925

2010-11-9 45




Topic Modeling


Developed Methods

Evaluation

Conclusions

2010-11-9 46

Likelihood comparison

2

2 22

2

1

1 ''

Likelihood of visual components given the model:

( | ) ( | , ) ( | )

( )( ) ( ) ( )

j

T

j j j j jj

T VTTvj vv vVTV

jv v j vv

p p y p y d

CVC V

ψψ ψ ψ

βββ β

=

=

⎡ ⎤= ⎢ ⎥⎣ ⎦

+⎡ ⎤Γ= ⋅⎢ ⎥Γ Γ +⎣ ⎦

∏ ∫

∏∏ ∑

v y v

2010-11-9 47

Perplexity comparison (# of visual topics=1000)

log ( , | , )exp

( )test

test

dd dw pd

pPerplexity

N N

⎡ ⎤−⎢ ⎥=

+⎢ ⎥⎣ ⎦

∑∑

d d d dw p v r

2010-11-9 48

Annotation accuracy comparison

1

( | ) ( | ). ( | )T

i j it

p w d p w w z t p z t d j=

= = = = =∑

2010-11-9 49




Topic Modeling


Developed Methods

Evaluation

Conclusions

2010-11-9 50

ConclusionsA probabilistic topic‐connection model is proposed to deal with the problem of modeling images and associated text description.

Specifically, new latent variables have been introduced to allowfor more flexible sampling of word topics and visual topics, in which one word topic may connect to multiple visual topics.

The proposed model provides better representation of the connection between latent semantic topics and latent image patterns, thus achieves better performance in the task of automatic image annotation compared to the traditional Corr‐LDA model.

2010-11-9 51

Questions or Comments?

THANK YOU FOR COMING! ☺