a probabilistic topic- connection model for automatic ...xc35/ppt/cikm10_slides.pdf · •the major...
TRANSCRIPT
A Probabilistic Topic-Connection Model for Automatic Image Annotation
www.ischool.drexel.edu
Xin Chen1, Xiaohua Hu1, ZhongnaZhou2, Caimei Lu1, Gail Rosen3, Tingting He4, E.K. Park5
1College of Information Science and Technology, Drexel University, Philadelphia, PA, USA, 2Dept. of ECE at University of Missouri in Columbia, MO, USA, 3Dept. of ECE at Drexel University in Philadelphia, PA, USA, 4Dept. of Computer Science at Central China Normal University in Wuhan, China, 5CSI-CUNY in Staten Island, NY, USA
2010-11-9 2
OutlineOutlineProblem Statement & Research Questions
Review ‐ Background & Existing Techniques
Represent Image Content in the Feature Space
Topic Modeling
Developed Method and Evaluation
Developed Methods
Evaluation
Conclusions
2010-11-9 3
Web images comes with additional text information (Flickr.com)
2010-11-9 4
Web images comes with additional text information (Wikipedia page)
2010-11-9 5
ImageNet (Fei-Fei, et al, CVPR ‘09)
An ontology of image based on WordNet, currently has:13,000+ categories of visual concepts10 million human-clean images (~700 images per category)Openly available (www.image-net.org)
2010-11-9 6
Mapping a ImageNet synset to a Wikipedia page
ImageNet dataset: images synset“Chrysamthemum coronarium”
Wikipedia page obtained by URL matching
The mapping provides an ‘indirect’ link between image sets and textual information.
2010-11-9 7
Objective: utilizing available image-text pairs as prior knowledge for automatic image annotation
Manual image annotation is time‐consuming, laborious and expensive.
Image‐text pairs provide insight on revealing the correlation between image visual content and informative textual descriptions
Breakthroughs in automatic image annotation will help to organize the massive amount of digital images, promote developing and studying of image storage and retrieval systems, and serve for other applications such as online image‐sharing.
2010-11-9 8
OutlineOutlineProblem Statement & Research Questions
Review ‐ Background & Existing Techniques
Represent Image Content in the Feature Space
Topic Modeling
Developed Method and Evaluation
Developed Methods
Evaluation
Conclusions
2010-11-9 9
Problem 1: the semantic gap
• The major difficulty in content‐based image retrieval is the “semantic gap”between image features and the user.” (Arnold, W.M. et al., 2000).
An CBIR application called Retrievr, in which you can draw query images that can be used to find matching Flickr images.
2010-11-9 10
2010-11-9 11
Problem 2: the image appearances vary a lot among the same image category
Variations:Similarity transform:Spatial layout changeScale changeRotationsBlurringIllumination change…Affine transform:SkewingDifferent scaling of axes…
2010-11-9 12
Solution: Objects represented by parts and key-points
2010-11-9 13
Bag-of-FeaturesBackground & Existing Techniques
Assumption: the patterns of different object categories can be represented by different distributions of local structures.
Salient point detector
• Harris‐Laplace Detector (Mikolajczyk, 2004)
• DoG salient points detector (Lowe, 2004)
Region detector
• Kadir‐Brady (KB) saliency detector (Kadir and Brady,2001 )
• Maximally Stable Extremal Regions ‐MSERs (Matas et al. 2002)
Quantification – local descriptors:
• SIFT (Lowe, 2004)
• Color‐SIFT (van de Weijer, 2006)
2010-11-9 14
OutlineOutlineProblem Statement & Research Questions
Review ‐ Background & Existing Techniques
Represent Image Content in the Feature Space• Represent Image by ‘Key‐points’
• Represent Image by ‘Parts’
Topic Modeling
Developed Method and Evaluation
Developed Methods
Evaluation
Conclusions
2010-11-9 15
The Gaussian images and the difference-of-Gaussian images (Lowe, 2004)
Images are blurred by 2D-Gaussian function:
Adjacent Gaussian images are subtracted to produce the difference-of-Gaussian images.
For next octave, the Gaussian images are down-sampled by a factor of 2, and the process repeated.
The Gaussian images and the difference-of-Gaussian images (Lowe, 2004)
2 2 2( ) / 22
1( , , )2
x yG x y e σσπσ
− +=
2010-11-9 16
Difference-of-Gaussian (DoG) salient points detector (Lowe, 2004)
Original image Output of DoG salient point detection
The DoG salient point detector detects the scale‐space extreme points in the difference‐of‐Gaussian images and tends to extract blob‐like key points from images.
2010-11-9 17
Scale Invariant Feature Transform (SIFT) salient point descriptor (Lowe, 2004)
Image patches containing salient points are rotated to a canonical orientation and divided into cells. Each cell is represented as an 8‐dimension feature vector according to the gradient magnitude in eight orientations.
Compared to other descriptors, the SIFT descriptor is more robust and invariant to rotation and scale/luminance changes.
The SIFT descriptor of salient points (2×2 cells) (Lowe, 2004)
2010-11-9 18
Grouping similar local descriptors into visual words
Typically, the K‐Mean clustering algorithm is used to cluster the descriptors of extracted image patches into visual words and establish a code book of visual words for a specific image collection.
⎫⎪⎪⎪⎬⎪⎪⎪⎭
Code book of visual words (Sivic, 2003) and (Fei-Fei et al. 2005)
Each key‐point assigned the index of the cluster center closest to the descriptor.
2010-11-9 19
“Bag-of-Visual-Words”
•Since visual words repeatedly appear in images and carry some atomic meanings, they can be regarded as visual analog of text words. Each image can be represented as a “Bag-of-Visual-Words”, which is an unordered collection of visual words.
•Many effective text mining and information retrieval algorithms (such as feature selection, stop words removal and TF-IDF term weighting) are applied to the vector space model of visual words.
2010-11-9 20
A simple test: the 15-scenes benchmark
15-scenes benchmark dataset consists of 4485 image spread over 15 categories, each of the 15 scene category contain 200 to 400 images and range from natural scenes to man-made environments
Oliva & Torralba, 2001Fei Fei & Perona, 2005Lazebnik, et al 2006
2010-11-9 21
00.10.20.30.40.50.60.70.80.91
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Recall
Precision
forest(S.)
tallbuilding(S.)coast(S.)
forest(V.)
tallbuilding(V.)coast(V.)
forest(B.)
tallbuilding(B.)coast(B.)
Comparison of precision-recall with respect to different image categories
Chen, et. al PAKDD ‘09
2010-11-9 22
Top ranked image retrieval results
Top ranked retrieval results
Query image (Coast)
2010-11-9 23
Image retrieval result (continue)
Top ranked retrieval results
Query image (Skyscrapers)
2010-11-9 24
LDA model for the ‘Bag-of-Visual-Words’(Fei-Fei et al. 2005)
We are able to achieve topic modeling from image documents in the same way as text documents, by using ‘Bag-of-Visual-Words’ .
LDA model for visual words (Fei-Fei et al. 2005)
codewords dictionarycodewords dictionary
Bag-of-Visual-Words
2010-11-9 25
OutlineOutlineProblem Statement & Research Questions
Review ‐ Background & Existing Techniques
Represent Image Content in the Feature Space• Represent Image by ‘Key‐points’
• Represent Image by ‘Parts’
Topic Modeling
Developed Method and Evaluation
Developed Methods
Evaluation
Conclusions
2010-11-9 26
Maximally Stable Extremal Regions –MSERs (Matas et al. 2002)
MSERs is a highly efficient region detector. The idea origins from thresholdings in image color/intensity space I. The thresholding yields a binary image Et as follows:
An extremal region is maximally stable when the area (or the boundary length) of the segment changes the least with respect to the threshold.
The set of MSERs is closed under continuous geometric transformations and is invariant to affine intensity changes.
2010-11-9 27
The appearance of object parts as continuous space
Image Morphing
2010-11-9 28
How to quantify the image parts in a continuous space?
Image patches containing salient parts are rotated to canonical angle and adjust to uniform size (known as normalized patches).
Principal component analysis (PCA) is performed on normalized patches to obtain feature representation
Finally, the appearance of each patch (which is n × n matrix) is quantified as a feature vector of the first k (typically 20-50) principal components
Adjusting image patches to uniform size
2010-11-9 29
Selection of Principal Component Number
Reconstruction of first 50 PCA components
Major PCA components = 15
Original Pixel
2010-11-9 30
Summary: image represented by key-points and parts
Represent image by SIFT descriptors and MSER features
SIFT(key‐points)
MSER(parts)
2010-11-9 31
OutlineOutlineProblem Statement & Research Questions
Review ‐ Background & Existing Techniques
Represent Image Content in the Feature Space
Topic Modeling
Developed Method and Evaluation
Developed Methods
Evaluation
Conclusions
2010-11-9 32
Topic Modeling Topic Modeling -- IntuitiveIntuitive
Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we now know that behind the origin of the visual perception in the brain there is a considerably more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image.
sensory, brain, visual, perception,
retinal, cerebral cortex,eye, cell, optical
nerve, imageHubel, Wiesel
IntuitiveAssume the data we see is generated by some parameterized random process.Learn the parameters that best explain the data.Use the model to predict (infer) new data, based on data seen so far.
2010-11-9 33
‘Bag of word model’ for text document
w jintelligence
d i
Texas Instruments said it has developedthe first 32-bit computer chip designedspecifically for artificial intelligenceapplications [...]
D = Document collection W = Lexicon/Vocabulary
...
artif
icia
l
1
inte
llige
nce
inte
rest
0
artif
act
0 ...... 2t
=di
w1 ... w j ... wJ
d1
...
di
...
dI
D
W
...
......
...
Document-Term Matrix
...frequ
ency
2010-11-9 34
Notations
WordBasic unit.Item from a vocabulary indexed by {1, . . . ,V}.
DocumentSequence of N words, denoted by w = (w1,w2, . . . ,wN).
CollectionA total of D documents, denoted by C = {w1,w2, . . . ,wD}.
TopicDenoted by z, the total number is K.Each topic has its unique word distribution p(w|z)
2010-11-9 35
Background & Existing Techniquesof Generative Latent Topic Models
The Naïve Bayesian model
The probabilistic latent semantic indexing (PLSI) model
PLSI Model (Hoffman, 2001)
* arg max ( | ) ( ) ( | )z p z w p z p w z= ∝
Word-Topic decision
Prior Probability of Topic z
Likelihood of word w given topic z
Assumption:
Each document has a mixture of k topics.
Fitting the model involves:
Estimating the topic specific word distributions p(wi|zk) and document specific topic distributions p(zk|dj) from the corpse via maximum likelihood estimation (MLE).
2010-11-9 36
Latent Dirichlet Allocation (LDA) Model (Blei, 2003)
In PLSI model, the topic mixture probability p(zk|dj) for documents are fixed once the model is estimated. For new coming document, the model needed to be re-estimated. Thus it is not scalable.
The LDA model treats the probability of latent topics for each document p(z|d) and the conditional probability of words for each latent topic p(w|z) as latent random variables whichare subject to change when new document comes.
Usually, symmetric prior is used:
( | ) ~ ( )j dp z d Multi θ
( | ) ~ ( )j jip w z Multi φ
~ ( )j Dirφ β
θd~Dir(α)
Generative process of LDA model
1 2 1 2{ , ,..., } {0.1}, { , ,..., } {0.01}T Wα α α β β β= = = =α β
2010-11-9 37
Explanation of Prior Settings
Why we set: ?
The above parameter setting is for the consideration that we need to make topic modeling results more diverse. By doing this, each document will in turn have its unique favor on a small number of topics that related to its content instead of having equal probability for every latent topic.
1 2{ , ,..., } {0.1}Tα α α= =α
( | ) ~ ( )j dp z d Multi θ
θd~Dir(α)
Documents as mixtures of topics, each has a different prior probability
2010-11-9 38
This extension of the original topic model can be achieved by introducing a new branch for visual words which makes topics of visual words associated with that of caption words. So it’s called Corr‐LDA model. The prototype of Corr‐LDA model is introduced by (Blei, 2003) .
The model is estimated via Gibbs Sampling Monte Carlo process (Griffiths, 2004), which involves iteratively estimating the posterior probability for topics from current word‐topic assignment, and adopting a Monte Carlo process to determine the assignment of word‐topic in the next round.
Extension of topic model for visual words and image captions
β
α
z
w
φβ
z
w
φ
θ
˜
˜
˜
˜
D
T
Latent Topics
Posterior probability for topics at each iteration:
, ,
, ,
( | , , )wi di j i j
wi i di j i
n np z j w
W n T nβ αβ α
− −
− −
+ += ∝ ⋅
+ +-i -wiw z ii
Entity Type 1
Entity Type 2
Corr-LDA Model (Blei, 2003)
2010-11-9 39
Problem with CorrLDA model in modeling image and text
Needle-leaf forest is composed largely of straight trunked, conical tress with relatively short branches, and small, narrow, needlelike leaves. These tress are conifers. Where evergreen, the needleleaf forest provides continues and deep shade to the ground so that lower layers of vegetation are sparse or absent except for a thick carpet of mosses in many places. Species are few and large tracts of forest consist almost entirely of but one or two species.
Topic 2
Topic 1
Topic 5
...
...
Topic 3
...
branc
h
specie
s
leaf
tree
anim
al
...
groun
d
Document-level Topic Mixture Composition
...β'
α
z
w
φβ
y
v
ψ
θ D
T
Corr-LDA Model
2010-11-9 40
OutlineOutlineProblem Statement & Research Questions
Review ‐ Background & Existing Techniques
Represent Image Content in the Feature Space
Topic Modeling
Developed Method and Evaluation
Developed Methods
Evaluation
Conclusions
2010-11-9 41
An improved topic model
Note: multiple word phrases extracted by Xtract (Smadja, 1993)
2010-11-9 42
The Data Collection and Settings
The image dataset is acquired from the ImageNet (http://www.image‐net.org/). Specifically, we download synsets under the “flower”, “mammal”and “tree” subtree. The synsets are mapped to a Wikipedia pages describing the same concept.
A rule‐based method is used to identify the explanative sections in Wikipedia pages. Articles with insufficient words (<200 words) are filtered out. In total, we obtain text descriptions for 1452 synsets (330, 562 and 560 synsets for subtrees “flower”, “mammal” and “tree”, respectively).
For each synset, we replicate the text descriptions to each of its images. We then make index for single‐words and multiple word phrases in the text descriptions, and extract visual‐word features as well as MESR region features from images (an average of 1095 visual words and 127 MSER regions per image)
The ImageNet has a backbone hierarchical ontology structure from WordNet, in which each node involves a group of images that depict a particular concept named as a synonym set, or “synset”
2010-11-9 43
Illustration of uncovered latent topics by proposed model
Topic84Top words Probabilityflower 0.019254orchid 0.012133Amanda 0.00867subgenera 0.006814shape 0.006617monophylet 0.006449Masdevallia 0.004167genera 0.003656subgenu 0.003208sever 0.003009section 0.003009genu 0.002962tuft 0.002903dura 0.002583Klotzsch 0.002562COLOMBIA 0.002558subtrib 0.002537epiphyt 0.002384final 0.002314botanist 0.002215
Top Phrase Probabilityone flower 0.015733orchid family 0.009458severalgenus 0.008829smooth leaf 0.007536triangular flower 0.006662temperate climate 0.006321a flower 0.005409specy ¨cm. 0.004575horticultural trade 0.004265e.g.m. 0.004012reproductive structure 0.003869divisionmagnoliophyta 0.003179biological function 0.003105male sperm 0.003041female ovum 0.002879higherplant 0.002747next generation 0.002676primarymean 0.002664reproductive organ 0.002459selective pressure 0.002443
2010-11-9 44
Illustration of uncovered latent topics by proposed model
Topic116Top words ProbabilityLeopard 0.011636Africa 0.0095Panthera 0.007002jaguar 0.00681lion 0.005525spot 0.005232cat 0.004863black 0.00485cross 0.004607Felida 0.004351home 0.003937hybrid 0.003923India 0.003921Uncia 0.003818central 0.003755normal 0.003571exist 0.003102parent 0.003069climb 0.003063habitat 0.003011
Top Phrase Probabilitysnow leopard 0.014342black panther 0.014025sri lanka 0.013044male leopard 0.012733genuspanthera 0.012725smallspot 0.012723mammalspecy 0.012718greatdiversity 0.012718greekword 0.012718southernasia 0.012718Indian subcontinent 0.012718rain forest 0.007444short leg 0.005945american continent 0.005864berlin zoo 0.005864forest area 0.005108wide variation 0.004198across 0.004079abundantprey 0.003955severalspecy 0.003925
2010-11-9 45
OutlineOutlineProblem Statement & Research Questions
Review ‐ Background & Existing Techniques
Represent Image Content in the Feature Space
Topic Modeling
Developed Method and Evaluation
Developed Methods
Evaluation
Conclusions
2010-11-9 46
Likelihood comparison
2
2 22
2
1
1 ''
Likelihood of visual components given the model:
( | ) ( | , ) ( | )
( )( ) ( ) ( )
j
T
j j j j jj
T VTTvj vv vVTV
jv v j vv
p p y p y d
CVC V
ψψ ψ ψ
βββ β
=
=
⎡ ⎤= ⎢ ⎥⎣ ⎦
+⎡ ⎤Γ= ⋅⎢ ⎥Γ Γ +⎣ ⎦
∏ ∫
∏∏ ∑
v y v
2010-11-9 47
Perplexity comparison (# of visual topics=1000)
log ( , | , )exp
( )test
test
dd dw pd
pPerplexity
N N
⎡ ⎤−⎢ ⎥=
+⎢ ⎥⎣ ⎦
∑∑
d d d dw p v r
2010-11-9 48
Annotation accuracy comparison
1
( | ) ( | ). ( | )T
i j it
p w d p w w z t p z t d j=
= = = = =∑
2010-11-9 49
OutlineOutlineProblem Statement & Research Questions
Review ‐ Background & Existing Techniques
Represent Image Content in the Feature Space
Topic Modeling
Developed Method and Evaluation
Developed Methods
Evaluation
Conclusions
2010-11-9 50
ConclusionsA probabilistic topic‐connection model is proposed to deal with the problem of modeling images and associated text description.
Specifically, new latent variables have been introduced to allowfor more flexible sampling of word topics and visual topics, in which one word topic may connect to multiple visual topics.
The proposed model provides better representation of the connection between latent semantic topics and latent image patterns, thus achieves better performance in the task of automatic image annotation compared to the traditional Corr‐LDA model.
2010-11-9 51
Questions or Comments?
THANK YOU FOR COMING! ☺