incremental visual objects clustering ... -...

18
Multimed Tools Appl DOI 10.1007/s11042-010-0616-x Incremental visual objects clustering with the growing vocabulary tree Zhenyong Fu · Hongtao Lu · Wenbin Li © Springer Science+Business Media, LLC 2010 Abstract With the bag-of-visual-words image representation, we can use the text analysis methods, such as pLSA and LDA, to solve the visual objects clustering and classification problems. However the previous works only used a fixed visual vocabulary, which is formed by vector quantizing SIFT like region descriptors, and so the learned visual topic models are also only based on the fixed vocabulary. This paper presents a novel approach to cluster visual objects in an incremental manner. Given a new batch of images, we firstly expand the visual vocabulary to include the new visual words, and then adjust the objects clustering model to absorb these new words, and finally give the clustering result. We achieve our goal by adapting to the visual domain of the incremental pLSA model previously used for text analysis. Experimental results demonstrate the feasibility and stability of the growing vocabulary tree and the clustering performance using the images from seven categories in a dynamic environment. Keywords Visual clustering · Bag-of-words · Incremental pLSA Z. Fu (B ) · H. Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China e-mail: [email protected] H. Lu e-mail: [email protected] W. Li Department of Diagnostic and Interventional Radiology, Affiliated Sixth People’s Hospital, Shanghai Jiao Tong University, Shanghai, China e-mail: [email protected]

Upload: others

Post on 21-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools ApplDOI 10.1007/s11042-010-0616-x

Incremental visual objects clustering with the growingvocabulary tree

Zhenyong Fu · Hongtao Lu · Wenbin Li

© Springer Science+Business Media, LLC 2010

Abstract With the bag-of-visual-words image representation, we can use the textanalysis methods, such as pLSA and LDA, to solve the visual objects clusteringand classification problems. However the previous works only used a fixed visualvocabulary, which is formed by vector quantizing SIFT like region descriptors, andso the learned visual topic models are also only based on the fixed vocabulary.This paper presents a novel approach to cluster visual objects in an incrementalmanner. Given a new batch of images, we firstly expand the visual vocabulary toinclude the new visual words, and then adjust the objects clustering model to absorbthese new words, and finally give the clustering result. We achieve our goal byadapting to the visual domain of the incremental pLSA model previously used fortext analysis. Experimental results demonstrate the feasibility and stability of thegrowing vocabulary tree and the clustering performance using the images from sevencategories in a dynamic environment.

Keywords Visual clustering · Bag-of-words · Incremental pLSA

Z. Fu (B) · H. LuDepartment of Computer Science and Engineering, Shanghai Jiao Tong University,Shanghai, Chinae-mail: [email protected]

H. Lue-mail: [email protected]

W. LiDepartment of Diagnostic and Interventional Radiology, Affiliated Sixth People’s Hospital,Shanghai Jiao Tong University, Shanghai, Chinae-mail: [email protected]

Page 2: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

1 Introduction

Clustering visual objects into different categories is one of the core problems incomputer vision, which has received much attention over the past decades. One ofmain challenges of visual objects clustering comes from the complex variations inreal world environments, such as backgrounds, viewpoints, illuminations, scales, andorientations. Due to the advance in local appearance descriptors [13, 15, 16], theimages can be represented by the bag of visual words which are built by quantizedthe local descriptors, such as SIFT [13]. This greatly simplifies the analysis, sincethe data are represented by a co-occurrence matrix, a table of the counts of eachvisual word in each document. Then under the bag-of-visual-words representation,the text analysis technique can be adapted to the computer vision area, such asthe probabilistic Latent Semantic Analysis (pLSA) [8, 9] and the Latent DirichletAllocation (LDA) [1].

In the visual classification and clustering tasks, Fei-Fei and Perona use the LDAto model the natural scene categories [5]; Sivic use the pLSA model to clustervisual objects in an unsupervised manner [20]. In these previous works, the visualvocabulary is typically built by applied a batch clustering technique to a subsampleof training data and performed vector quantization before training topic models.But the previous works based on the probabilistic generative models all use a fixedvisual vocabulary. It is a kind of static representation and not proper in dynamicenvironments. Our experimental results show that the clustering performance sufferswhen the visual vocabulary is trained by the ill-tuned training data.

In this paper, we extend the unsupervised visual clustering method described in[20]. Compared to the static clustering environment in [20], we design our approachunder the more dynamic assumption. It is an incremental visual objects clusteringframework. As is shown in Fig. 1, we not get the total images at once as the previous

Fig. 1 Illustration of the framework of the incremental visual objects clustering. This frameworkworks in an incremental way, where the images are processed with batches. When a new batch ofimages arrives, the local descriptors from the new images are used to grow the visual vocabulary tree.We just grow the vocabulary tree in the leaf node. As is shown in the figure, the blue node representsthe growing node and the red nodes represent the new visual words. The clustering model is thenupdated by the new images and the new visual vocabulary tree. We can directly get the clusteringresults from the updated model

Page 3: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

works [20], but get one batch of images every time. The images are arrived withbatches. This is a more natural way because real-world problems are often dynamicand incremental, such as the collecting process of personal photos or image dataset ofthe search engine. Besides the dynamic property, our proposed method is unsuper-vised and need not the labels of training samples. This also is better manner becauseof the difficulty of labeling the training data in practice. Yeh and Darrell develop anincremental learning method of visual classifiers based on the SMO training process[22]. Their approach is supervised manner and the method proposed in the paper isunsupervised. Li et al. propose an automatic image dataset collecting and supervisedmodel learning approach in an incremental way [12]. They use the new images tore trigger the Gibbs sampling process of the Dirichlet Process-based model. Similarwork includes [19]. However these works also use a static and fixed visual vocabularythat is trained at first and keep unchanged throughout the incremental collectingprocess. In our framework, when a new batch of images is arrived, we firstly growthe visual vocabulary. Our results show that for different image distributions, thegrowing situations are very different: for the uniform distribution of images fromdifferent categories, the vocabulary grows gently; for the non-uniform distribution ofimages, the vocabulary grows dramatically. Due to the increasing popularity of meth-ods based on the local appearance measure, effective building and searching of thevisual vocabulary become very important. Our visual vocabulary structure is basedon the scalable vocabulary tree [18]. This approach quantizes image descriptors usinga hierarchical subspace division (hierarchical k-means clustering) to produce visualwords. Compared to the flat-form, tree structure can dramatically reduce costs ofbuilding and searching of the visual vocabulary. The original vocabulary tree methodis still static and fixed. We grow the vocabulary tree to include the new visual words,and so it is dynamic vocabulary tree. After growing the previous vocabulary tree, weupdate the generative model in an incremental manner. Our approach is based onthe incremental pLSA model [4], which can absorb the new visual words into themodel and converge faster than the batch retraining manner.

Based on the scalable vocabulary tree method, Yeh and Darrell propose anadaptive vocabulary forests for category learning [23]. But their category model isdiscriminative not generative model. Compared to our unsupervised manner, theirclassification model is supervised and need image labels of training samples. Whenthe vocabulary tree changed, they need only update the histogram pyramid [7, 10].In the generative model, the visual word index in the vocabulary is a very importantfactor. When the vocabulary tree changed, the ‘word-id’ of one descriptor maybealso changes under the rule of the closest center. We call this phenomenon to‘word shifting’. If the visual words shift too much, the dynamic vocabulary tree isunstable. The scalable vocabulary tree method that is initially used to solve theobject recognition problem is a kind of indexing technology of local appearancedescriptor [18]. Similar research works include [11, 17, 19, 21]. All of these worksbelong to the methods of recognition or classification via matching and voting of localdescriptor, in which the effective indexing of descriptors is the most important. Theycompare the similarity between images through the count of matching descriptors.The vocabulary tree in these works does not keep the ‘word-id’ unchanged (theyneed not the ‘word-id’) and is unstable. In this paper, we propose several strategiesto grow the vocabulary tree, and our experiments test their stability according to the

Page 4: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

‘word shifting’ phenomenon. The spectral clustering is another type of clusteringmethod used for visual clustering tasks [2, 6, 24]. Because we limit our discussionto methods based on bag-of-words representation, we do not introduce more aboutthese works.

The specific contributions of this work are highlighted as follows: (1) We haveproposed one kind of construction manner of growing visual vocabulary and testits stability. (2) Our works in this paper show the big improvement in the visualclustering results with the growing vocabulary tree, compared to the previous work.To the best of our knowledge, ours is among the first papers (if not the first) to usethe un-fixed visual vocabulary in visual category analysis based on the probabilisticgenerative model. (3) We have adapted the recent literature on incremental proba-bilistic semantic analysis to solve the problem of visual objects clustering.

In Section 2, we propose several strategies and give various implementation detailsto grow the visual vocabulary tree using the new batch of images. Section 3 describesthe incremental clustering process of visual objects. To explain and compare per-formance, in Section 4 we apply the vocabulary tree and different visual clusteringmodels to sets of images for which the ground truth categories is known. We giveexperimental results about the feasibility and stability of the growing vocabulary treeand compare the clustering performance of our approach to the method in [20] andthe batch retraining method.

2 Growing vocabulary tree

To build the vocabulary tree of visual words, we use two types of region detectors.One is harris-affine [15], and the other is MSER [14]. Then the 128-dim rotationallyinvariant SIFT vector is computed to represent each region in each image. We usethe binaries provided at [25] for all of these. We summarize the basic method ofvocabulary tree in Algorithm 1 [18]. Vocabulary tree is a regular tree. We use kto define the branch factor (the number of children in each node), so each node has

Algorithm 1 Training vocabulary tree of k-branch and l-depth using the descriptorvectors Dfunction VTree(D, k, l)

if l > 1 and partition conditions satisfied thenrun k-means on D to get k cluster centers, T1, . . . , Tk

VT ← set {T1, . . . , Tk} as the child nodespartition D into k groups, D1, . . . , Dk, with T1, . . . , Tk

foreach Ti in {T1, . . . , Tk} doTi ← VTree(Di, k, l − 1)

end foreachelse

set VT to empty treeend ifreturn VT

end function

Page 5: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

either no child node or k child nodes. D is the training data which is a 128 × n matrix,n represents the total number of training descriptors. l represents the maximumheight we wish.

After running an initial k-means process on D, D is partitioned into k groups,where each group consists of the descriptor vectors closest to a particular clustercenter. Algorithm 1 recursively uses the sub groups {Di} to build sub trees {Ti}. Tosimplify, we denote the cluster center and the corresponding sub tree using the samenotation Ti, and the following narrative will use the same simplified notation. Therecursive building process will terminate when the tree has the desired height l, or thepartition conditions are not satisfied. In our implementation, the tree is not allowedto grow freely and we set the partition condition: |Di| ≥ 2 · k, where | · | means thecardinal of sets. If one sub group contains too little descriptors, less than 2k, we willstop building the sub tree.

In our framework, when the first batch of images arrives, we train the initialvocabulary tree using the Algorithm 1. We encode the nodes of the initial tree bya single integer and call these integers as the visual word-id. When the new batch ofimages arrives, the new descriptor vectors are simply propagated down the previoustree by at each layer comparing the descriptor vector to the k candidate clustercenters (represented by k child nodes in the layer) and choosing the closest one.In the leaf of vocabulary tree, if the sub group satisfies the growing conditions, thevocabulary tree will grow. In this paper, we propose two main manners to growthe tree according to whether change the cluster centers at each layer. They are theincremental vocabulary tree and the evolutionary vocabulary tree.

2.1 Incremental vocabulary tree

The incremental vocabulary tree uses the simplest growing manner. This mannerpropagates down the new descriptors along the previous vocabulary tree and don’tchange the cluster vectors in the inner node. If the growing conditions are satisfied,the tree will grow in the leaf node. We summarize the incremental vocabulary tree inAlgorithm 2.

Algorithm 2 Incrementally grow the vocabulary tree using the new descriptorvectors Dnew

procedure INCR-VTree(VT, Dnew, k)if VT is leaf and growing conditions satisfied then

l ← logk(|Dnew|/2k)

VT ← VTree(Dnew, k, l)else

get k child nodes of VT, T1, . . . , Tk

partition Dnew into k groups, Dnew1 , . . . , Dnew

k , using T1, . . . , Tk

foreach Ti in {T1, . . . , Tk} doINCR-VTree(Ti, Dnew

i , k)end foreach

end ifend procedure

Page 6: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

When the vocabulary tree is dynamic growing, we face several problems: (1) theextent to which to grow the tree, (2) when to grow the tree and (3) how to encodethe new visual word by a single integer.

The growth occurs in the leaves of the tree, and we use Dnew to represent descrip-tors propagated in one leaf node. In this paper, we use (1) to compute the heightof the growing sub tree:

l = logk|Dnew|

2k(1)

We store the number of descriptors passed through in each node in the initialvocabulary tree, and update these numbers when propagating the new descriptorsalong the previous tree. We propose two growing conditions that determine whento grow the tree. They are all to compare between the history statistics and thestatistics of the new arrived descriptors in leaf node. In a special leaf node, we definefour variables: (1) nc, the new number of descriptors; (2) hc, the history number ofdescriptors; (3) nr, the ratio of the new number at this leaf to the total new number;(4) hr, the ratio of the history number at this leaf to the total history number. And wedefine a growing factor, GF, as an algorithm parameter. Then the growing conditionsare: (1) nc ≥ GF · hc and (2) nr ≥ GF · hr. In a leaf node, if any growing condition ismet, the vocabulary tree will grow.

In this paper, we encode the cluster centers in each leaf node using a uniqueinteger with the order of natural numbers. These cluster centers in leaf nodes are thevisual words and have an integer as the word-id. When the tree grows, the previousword-id of the growing leaf node is assigned to the closest descendant leaf node, andthe other descendant leaf nodes are assigned the following unused integers in thistree with the order of natural numbers. The algorithms of the vocabulary tree includethe process of encoding the visual words, and we don’t write them in the algorithmtables.

2.2 Evolutionary vocabulary tree

The evolutionary vocabulary tree uses another growing manner. For this manner, theold cluster centers in each layer will be adapted using new descriptors. The processis recursive and the cluster centers are updated layer by layer.

In Algorithm 3, we define the moving vector mv to represent the change of theupper-layer’s cluster center. In an inner node, we firstly update the cluster centersby adding mv to them, which makes the centers of lower-layer to move along withthe upper-layer’s center. Let Tu

1 , . . . , Tuk represent these updated centers. Then we

use Tu1 , . . . , Tu

k to compute k centers of training descriptors reached this node, andlet C1, . . . , Ck represent these training centers. The details of how to compute thetraining centers will be listed later. We use the Hungarian algorithm to match thetwo sets of cluster centers, i.e., {Tu

i } and {Ci}. Let ni be the count of new descriptorswhich belong to the cluster center Ci, and hi be the history count of descriptors passedthrough Ti (Ti and Tu

i have the same history count value). We compute the newcluster centers in each inner node following the evolutionary k-means [3]. We don’t

Page 7: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

Algorithm 3 Evolutionarily grow the vocabulary tree using the new descriptorvectors Dnew

procedure EVOL-VTree(VT, Dnew, k, mv)if VT is leaf and growing conditions satisfied then

l ← logk(|Dnew|/2k)

VT ← VTree(Dnew, k, l)else

get k child nodes of VT, T1, . . . , Tk

Tui ← Ti + mv, ∀i ∈ {1 . . . k}

compute the k cluster centers, C1, . . . , Ck, of Dnew, using Tu1 , . . . , Tu

kTi ← γi · Ci + (1 − γi) · Tu

i , ∀i ∈ {1 . . . k}mv′

i ← Ti − Tui + mv, ∀i ∈ {1 . . . k}

partition Dnew into k groups, Dnew1 , . . . , Dnew

k , using T1, . . . , Tk

foreach Ti in {T1, . . . , Tk} doEVOL-VTree(Ti, Dnew

i , k, mv′i)

end foreachend if

end procedure

use the change parameter cp in [3], because it is un-normalized. Let γi = ni/(ni + hi),then update Ti as

Ti = γi · Ci + (1 − γi) · Tui (2)

Finally we compute the moving vector of the child layer, which is the addition of thechange in this layer and the moving vector of the upper layer.

For computing the training centers C1, . . . , Ck, we use three manners: (1) divide,the training descriptors are partitioned into k groups using the updated layer centersTu

1 , . . . , Tuk , and then we compute the mean vector of each group as the training

center; (2) no-initial, using k-means to cluster the training descriptors into k clusters,and then let the cluster centers be the training centers; (3) initial, different from thesecond manner, we initialize the cluster centers with Tu

1 , . . . , Tuk before running k-

means.

2.3 Search in the vocabulary tree

For the visual objects clustering task based on the text analysis methods, we needdetermine the word-id of a descriptor. This is the searching process in the vocabularytree and we summarize it in Algorithm 4.

3 The incremental visual clustering process

We describe the details of the incremental visual objects clustering process in thissection, which is based on the probabilistic Latent Semantic Analysis (pLSA) model[8, 9] and its extension to incremental form [4]. We will describe the models usingthe terms ‘images’ and ‘visual words’ to replace the original terms ‘documents’ and‘words’ as used in the text literature.

Page 8: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

Algorithm 4 Search the word-id of p in VT

function LEAF-Search(VT, p)get k child nodes of VT, T1, . . . , Tk

Tc ← the closest to p in {T1, . . . , Tk}if Tc is a leaf node then

return the word-id of Tc

elsereturn LEAF-Search(Tc, p)

end ifend function

3.1 Training the initial clustering model

For the first batch of images B, we use the descriptors of B to train the initialvocabulary tree. For each descriptor in each image, we determine the visual word-id using the trained vocabulary tree. The batch of images is summarized in a co-occurrence matrix F, where f (w, d) stores the number of occurrences of a visualword w in image d.

The initial clustering model is based on the probabilistic latent semantic analysismodel [8, 9]. The pLSA is a generative model, which assumes that the visual words w

are generated independently of the specific image d conditioned on the unobservedvisual category variable z; that is, P(w, d|z) = P(w|z)P(d|z), P(w|z, d) = P(w|z),and P(d|z, w) = P(d|z). Let Z be the set of latent visual category variables, thenthe joint probability of the co-occurrence pair (w, d) is shown as follows:

P(w, d) = P(d)∑

z∈Z

P(w|z)P(z|d) (3)

= P(w)∑

z∈Z

P(d|z)P(z|w) (4)

w and d are symmetrical variable in these two equations. The parameters of the pLSAmodel are estimated by the iterative EM algorithm, which uses the training image setB to maximize the log-likelihood function L:

L =∑

d∈B

w∈d

f (w, d) log P(w, d) (5)

If (3) is used to partition the joint distribution, the conditional probability ofP(z|w, d) can be estimated in the E (Estimation) step:

P(z|w, d) = P(w|z)P(z|d)∑z′∈Z P(w|z′)P(z′|d)

(6)

Page 9: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

In the M (Maximization) step, the probabilities P(w|z) and P(z|d) can be estimated,respectively, by the following:

P(w|z) =∑

d∈B f (w, d)P(z|w, d)∑d∈B

∑w′∈d f (w′, d)P(z|w′, d)

(7)

P(z|d) =∑

w∈d f (w, d)P(z|w, d)∑w∈d f (w, d)

(8)

According to the symmetry of w and d, if using (4) to partition the joint distribu-tion and following the same process in [9], then in E step:

P(z|w, d) = P(d|z)P(z|w)∑z′∈Z P(d|z′)P(z′|w)

(9)

and in M step:

P(d|z) =∑

w∈d f (w, d)P(z|w, d)∑d′∈B

∑w∈d′ f (w, d′)P(z|w, d′)

(10)

P(z|w) =∑

d∈B f (w, d)P(z|w, d)∑d∈B f (w, d)

(11)

(9) to (11) is the symmetrical form of (6) to (8). When training the initial clusteringmodel, the model parameters P(w|z) and P(z|d) (or, P(d|z) and P(z|w)) are initial-ized randomly and normalized, then these parameters and P(z|w, d) are iterativelyrefined by applying the EM procedure until they converge.

After training the initial model, for a new image q, we can estimate the newparameters P(z|q) and P(z|w, q) by applying the fold-in process, in which we onlyupdate P(z|q) and P(z|w, q) and fix P(w|z) in the EM procedure. It is the ‘imagefold-in’ process. Also because of the symmetry of w and d, the symmetrical form ofthe fold-in process is, for a new visual word v, to fix P(d|z) and only update P(z|v)

and P(z|v, d) in the EM procedure. It is the ‘visual word fold-in’ process. As in [20],an image d is considered to belong to the latent visual category z according to themaximum of P(z′|d), where z′ ∈ Z .

3.2 Incremental training process of the clustering model

In the original pLSA model, the new visual words in the new images are completelyignored. The incremental pLSA can deal with the new visual words [4]. The followingnotations are used in the incremental training process: Bold denotes the total previousbatches of images, Bnew denotes the new batch of images, do denotes one image inBold, dn denotes one image in Bnew, d denotes one image in Bold ∪ Bnew. For eachnew batch of images Bnew, we firstly use the descriptors of Bnew to grow the previousvocabulary tree. Let wo represent an old visual word in the previous vocabulary tree,wn represent a new visual word only in the grown but not in the previous vocabularytree, w represent the total visual words in the grown vocabulary tree. Because we

Page 10: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

encode the visual word with the continuous order of natural numbers, we can easilydetermine which one is the new visual word and which one is the old.

Similar to the first batch, we can use the grown vocabulary tree to summarize Bnew

into a co-occurrence matrix. In the previous process before the new batch comes, wehave got P(wo|z), P(z|do) and P(z|wo, do). The incremental training process use twofold-in steps to initialize P(w|z) and P(z|d) to the proper initial values, where w andd are defined above. This will make the training process converge faster compared tothe random initial parameters in the original pLSA model. The details are describedas follows.

For each dn ∈ Bnew, we firstly use the image fold-in process to fold in dn with thefixed old parameters P(wo|z) using (6) and (8). In this process, we only consider theold visual word wo in dn. After the image fold-in process, we can get P(z|wo, dn) andP(z|dn), then we use (10) to compute P(dn|z). For the sake of clarity, (10) is rewrittenwith the notations of wo and dn:

P(dn|z) =∑

wo∈dnf (wo, dn)P(z|wo, dn)∑

d′n∈Bnew

∑wo∈d′

nf (wo, d′

n)P(z|wo, d′n)

Secondly, we use the visual word fold-in process to fold in wn. In this process,P(z|wn) are initialized randomly and are normalized. Then we fix P(dn|z) and use (9)and (11) to iteratively refine P(z|wn, dn) and P(z|wn) until they converge. After thevisual word fold-in process, we can get the converged parameters P(z|wn, dn). Thenwe can obtain P(z|w, d) consisting of P(z|wo, do), P(z|wo, dn) and P(z|wn, dn), wherew ∈ {wo} ∪ {wn} and d ∈ {do} ∪ {dn} are defined above. P(z|wn, do) in P(z|w, d) iszero. Let f (w, d) represent the frequency of each visual word (of the total visualwords) in each image (of the total images), and f (wn, do) in f (w, d) is zero too. Thenwe use (7), with f (w, d) and P(z|w, d), to compute P(w|z).

Finally, we combine P(z|do) and P(z|dn) into P(z|d). We use P(z|d) and P(w|z)

as the initial parameter values to execute the EM procedure using (6) to (8). Afterthe procedure converges, we get a new set of model parameters, P(z|d), P(w|z) andP(z|w, d). These parameters have included the new images and the new visual words.They will be used in the incremental training for next batch of images. And we candetermine the clustering result of images from the new converged parameters P(z|d).

4 Experiments and results

We design a set of experiments to investigate three areas: (i) evaluate the growingvocabulary tree including the stability of the visual word and the changes of the visualwords count in the vocabulary tree; (ii) the clustering performance comparison ofthe fixed vs. dynamic visual vocabulary; (iii) the clustering performance comparisonof the batch vs. incremental training. We perform our experiments by using theimage dataset similar to Sivic et al. [20], in which they use four categories fromCaltech images. For better simulating more dynamic visual clustering environment,we add three categories and use the images on seven visual object categoriesfrom the Caltech image datasets: euphonium (64 images), cellphone (59 images),airplanes (800 images), face-easy (435 images), chess-board (120 images), leopards(200 images) and motorbike (798 images). We can make use of the big differences

Page 11: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

of the image count in seven categories to simulate more dynamic visual clusteringenvironment as follows.

4.1 Dynamic clustering environment

In our experiments, the above images are used to simulate a series of dynamicenvironments in which we partition the images into batches and perform the incre-mental clustering process described in Fig. 1. Four division strategies are used inour experiments: (1) uniform, the images of each category are partitioned averagelyinto batches; (2) random, the whole images are randomly rearranged and partitionedinto batches with the same size; (3) same, each category corresponds to each batchwith the same order mentioned above; and (4) non-uniform, we firstly sort categoriesaccording to the image count in each category (including ascend and descend order),and then rearrange the images according to the order of sorted categories, and finallypartition the rearranged images into batches with the same size. In the same strategy,we choose the same count of images, (59 in our experiment, the smallest sample countin seven categories), and in other strategies we randomly choose 1/3 images in eachbatch to train or grow the vocabulary tree. In our experiment, the batch count is alsoseven. In the following narrative, we use ‘uniform’, ‘random’, ‘same’, ‘ascend’ and‘descend’ to represent the above division strategies. Each division strategy representsa kind of dynamic environment.

4.2 Grow the vocabulary tree

The objective of the first set of experiments was to determine which one among thegrowing manners described in Section 2 can make the vocabulary tree grow stably.We test the proposed growing manners (INCR, EVOL-divide, EVOL-initial andEVOL-no-initial) in the following aspects: (1) the changes of the visual words count,which will affect the convergence time of the EM procedure in pLSA; (2) the stabilityof the visual words, which will affect the clustering performance. In the growingprocess, the two parameters, growing factor (GF) and the depth of tree (l) will affectthe final scale of the vocabulary. In this paper, we set GF = 2 which will providesuitable growing ability and meanwhile limit the vocabulary not to grow infinitely.The initial depth of the tree in Algorithm 1 is 3 (the reason will be explained inSection 4.3). The following depth of sub-tree is adaptive as in Algorithms 2 and 3.

Figure 2 shows that except EVOL-no-initial, the other three growing mannerstend to grow slowly with the more uniform distribution. All of the four manners willfast grow with the more non-uniform distribution. All of the evolutionary mannersgrow more gently than the incremental growing manners under non-uniform distri-butions and the EVOL-initial is the most gentle growing manners. Under the samedivision, the vocabulary tree has dramatically grown in batch-5. According to ourdefinition of the same division, images in batch-5 are the whole chess-board images.This shows that the chess-board images have a large number of descriptors unlikethe descriptors in the previous batches.

In our implementation, these new descriptors are further quantized into visualwords in the vocabulary tree.

For different growing manners, the more important is the visual word stability.After growing the vocabulary tree, the word-id of a local descriptor should remain

Page 12: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

1 2 3 4 5 6 7900

950

1000

1050

1100

11501150

Num

ber

of th

e vi

sual

wor

ds

(a) uniform

1 2 3 4 5 6 7950

1000

1050

1100

11501150

Num

ber

of th

e vi

sual

wor

ds

(b) random

1 2 3 4 5 6 71000

1300

1600

1900

2200

Num

ber

of th

e vi

sual

wor

ds

(c) same

1 2 3 4 5 6 77800

1100

1400

1700

2000

Num

ber

of th

e vi

sual

wor

ds

(d) ascend

1 2 3 4 5 6 71000

1300

1600

1900

22002200

Num

ber

of th

e vi

sual

wor

ds

(e) descend

INCR EVOL–divide EVOL–initial EVOL–no–initial

Fig. 2 Visual words count in the vocabulary tree

unchanged in the current and previous vocabulary tree. For describing our computa-tion method of the stability, we firstly give the following notations. Let Vi denote thegrown vocabulary tree using the images of batch-i, let Wi be the word count in Vi; letVlast be the grown vocabulary tree using the last batch of images (in our experimentit is the batch-7), let Wlast be the word count in Vlast. Obviously, Wi ≤ Wlast. Foran image b in batch i, let b represent the co-occurrence vector of b using Vi, let b̂represent the co-occurrence vector of b using Vlast. Let b̂ f be the front Wi items andb̂ r be the rear Wlast − Wi items in b̂ . We use the difference of these two vectors, band b̂ , to measure the count of the changed visual words in b , and then define theunchanged ratio (stability):

Rb = 1 − ‖ b − b̂ f ‖1 + ‖ b̂ r ‖1

2 · ‖ b ‖1

(12)

where ‖ · ‖1 represents l1-norm. We compute Rb of each image from batch-1 tobatch-6, and summarize the mean and standard variation of Rb in Fig. 3. The resultsshow that the INCR growing manner has the high mean and small standard variation.

4.3 Visual object clustering

To assess our method on image clustering, we compared the following approaches,

1. Sivic et al. 05: This is the method in [20]. It firstly employs pLSA to learn latenttopics and then uses the image fold-in process to cluster the following images.

2. batch retraining: In each batch of the dynamic clustering process, the batch re-training method uses the updated vocabulary tree to compute the co-occurrence

Page 13: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

0 1 2 3 4 5 660.2

0.4

0.6

0.8

1.0

Sta

bilit

y of

vis

ual w

ords

(%

)

(a) uniform

0 1 2 3 4 5 660.2

0.4

0.6

0.8

1

Sta

bilit

y of

vis

ual w

ords

(%

)

(b) random

0 1 2 3 4 5 660.2

0.4

0.6

0.8

1

Sta

bilit

y of

vis

ual w

ords

(%

)

(c) same

0 1 2 3 4 5 660.2

0.4

0.6

0.8

1

Sta

bilit

y of

vis

ual w

ords

(%

)

(d) ascend

0 1 2 3 4 5 660.2

0.4

0.6

0.8

1

Sta

bilit

y of

vis

ual w

ords

(%

)

(e) descend

INCR EVOL–divide EVOL–initial EVOL–no–initial

Fig. 3 Visual words stability

matrix of the total images in the previous batches and the current batch. Thenthe EM procedure of the original pLSA will be executed.

3. incremental training: This is the method described in this paper.

The first approach use the fixed visual vocabulary which is generated using the firstbatch images, and the others use the growing visual vocabulary in the clusteringprocess.

We set the number of latent topics in pLSA model equal to the ground truthcategory number. For the Sivic et al. 05 method, we use the descriptors of the firstbatch to train a vocabulary tree with the depth parameter l = 3 and the branchparameter k = 10. This is because the usual visual words count is 1,000 to 2,000in previous works [5, 12, 20], and the above setting will assure the initial visualwords are about 1,000 in our work. In our experiment, all trained trees (five divisionstrategies) for Sivic et al. 05 method include more than 2,000 visual words. Becausethe visual words count is an important factor to the clustering results, we controlthe initial visual words count in the growing type vocabulary with a small value forremaining the final visual words count in the growing type not more than the fixedtype vocabulary. To the batch retraining and incremental training methods, we firstlytrain a vocabulary tree using the images of the first batch with the parameters l = 3and k = 10, and then grow the tree in the following batches. The batch retrainingand incremental training approaches use the same grown vocabulary tree. In view ofthe stability of the incremental vocabulary tree, we use the INCR growing manner.The visual words count of the growing vocabulary is summarized in Fig. 2. Inour experiment, all EM procedures use the same converge condition. Finally, theclustering result and the ground truth label of each image are applied to compute theconfusion matrix for evaluating the clustering performance.

We summarize the average clustering performance in Fig. 4. It is obvious that theperformance of Sivic et al. 05 method with the fixed vocabulary declines dramatically

Page 14: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

Fig. 4 Comparisons of theaverage clusteringperformance

uniform random same ascend descend0

10

20

30

40

50

60

70

Ave

rage

Per

form

ance

(%

)

Sivic et al. 05Incremental trainingBatch retraining

in the same and non-uniform division. In Fig. 5, we give the details of confusionmatrixes in the same division. We can see that in this situation the Sivic et al. 05method performs badly. The main reason is that the fixed visual vocabulary treebased on the ill-tuned training data cannot adapt to the more dynamic environment.In the uniform and random strategies, Sivic et al. 05 performs better than the otherstwo approaches using growing vocabulary. It is mainly because of the differentvisual-word counts in these two vocabularies. We have mentioned above that weuse the fixed vocabulary which have more than 2,000 visual words. Due to the gentlegrowing in the ‘uniform’ and ‘random’ strategies, the final visual-words counts inthese two strategies are less than 1,100 (Fig. 2) and this produces the worse categoryperformance than the fixed vocabulary.

Figure 4 also shows that the clustering performance of the incremental trainingmethod is usually better than the one of the batch retraining method. It is becausethe incremental training method uses the previous model parameters to initializethe current model ones to the neighbor of previous local optimal solution. At thesame time, the batch retraining method use the random value to initialize the current

Fig. 5 Comparisons of confusion matrixes under the same division strategy. c1∼ c7 represent theclustering classes. The clustering results of Sivic et al. 05 method are not good because of the fixedvisual vocabulary tree based on the ill-tuned training data

Page 15: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

0 30 60 90 120 150 180 210210–8.06

–8.02

–7.98

–7.94

–7.9

–7.86x 106

iteration times

log–

likel

ihoo

d

(a) batch retraining

0 10 20 30 40 50 60 70–7.8262

–7.826

–7.8258

–7.8256

–7.8254

–7.8252x 106

iteration times

log–

likel

ihoo

d

(b) incremental training

Fig. 6 Comparisons of the convergence processes of the batch-7 under the ascend division manner:a batch retraining and b incremental retraining. The converged log-likelihood of batch retrainingmethod (<−7.86 × 106) is still less than the initial log-likelihood of incremental retraining method(>−7.826 × 106)

model parameters. Figure 6 shows that the batch retraining method converges to alocal maximum which is even less than the initial value of the incremental training.Because the training model is based on the maximum likelihood, the incrementalusually performs better than the batch retraining.

Figure 7 is the convergence time of the three approaches under different imagesdivision strategies. From Fig. 7, we can find that the incremental training methodconverge slower than the Sivic et al. 05 (fold-in), but faster than the batch retrainingmethod. And we also find that the convergence time of the incremental trainingmethod is adaptive. When the new batch of images are similar to the previousimages, few new visual words will be added into the vocabulary tree and then theincremental training method will converge very fast, such as the uniform and randomstrategies (see Fig. 2, the vocabulary tree grows very gently in these two divisionmanners). When there are more new visual words due to the new batch of images,the incremental training method need more time to converge to a better local optimalsolution (see Fig. 6).

Fig. 7 Comparisons of theconvergence time underdifferent dynamic clusteringenvironments

uniform random same ascend descend0

1000

2000

3000

4000

5000

Tim

e (s

ec)

Sivic et al. 05Incremental trainingBatch retraining

Page 16: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

5 Conclusion

We have proposed a new approach to cluster the visual objects in an incrementalmanner under more dynamic environments. Our methods extend previous genera-tive topic models to handle both new images and new visual words. The experimentalresults show that using a dynamic and growing visual vocabulary is very significantin visual clustering tasks, because the clustering performance will suffer very muchwith the fixed vocabulary which is built by the ill-tuned images categories, and withthe growing vocabulary we can achieve the big clustering performance improvementin experiments. In addition, we compared three approaches in our experiments.The experimental results show that the proposed incremental method can achievethe best clustering performance among three approaches and have an adaptiveconvergence time. For future studies, we will further adapt the current incrementalvisual category methods based on probabilistic generative models, such as LDA andDirichlet Process, to handle the new visual words.

Acknowledgements This work was supported by the National High Technology Research andDevelopment Program of China (No. 2008AA02Z310), Shanghai Committee of Science and Tech-nology (No. 08411951200, No. 08JG05002), 973 (2009CB320901) and NLPR (09-4-1).

References

1. Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–10222. Cai D, He X, Li Z, Ma WY, Wen JR (2004) Hierarchical clustering of WWW image search results

using visual, textual and link information. In: ACM multimedia3. Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: Proc. ACM SIGKDD4. Chou TC, Chen MC (2008) Using incremental plsa for threshold resilient online event anlysis.

IEEE Trans Knowl Data Eng 20:289–2995. Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories.

In: Proc. CVPR6. Gao B, Liu TY, Qin T, Zheng X, Cheng QS, Ma WY (2005) Web image clustering by consistent

utilization of visual features and surrounding texts. In: ACM multimedia7. Grauman K, Darrell T (2005) The pyramid match kernel: discriminative classification with sets

of image features. In: Proc. ICCV8. Hofmann T (1999) Probabilistic latent semantic indexing. In: Proc. SIGIR9. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn

43:177–19610. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for

recognizing natural scene categories. In: Proc. CVPR11. Lepetit V, Fua P (2006) Keypoint recognition using randomized trees. In: PAMI, pp 1465–147912. Li L, Wang G, Fei-Fei L (2007) Optimol: automatic online picture collection via incremental

model learning. In: Proc. CVPR13. Lowe D (2004) Distinctive image features from scale-invariant keypoints. IJCV 60:91–11014. Matas J, Chum O, Martin U, Pajdla T (2002) Robust wide baseline stereo from maximally stable

extremal regions. In: Proc. BMVC, vol 1, pp 384–39315. Mikolajczyk K, Schmid C (2004) Scale and affine invariant interest point detectors. IJCV 60:

63–8616. Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. PAMI 27:

1615–163017. Moosmann F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification.

PAMI 9:1632–164618. Nistér D, Stewénius H (2006) Scalable recognition with a vocabulary tree. In: Proc. CVPR19. Reddy KK, Liu J, Shah M (2009) Incremental action recognition using feature-tree. In: ICCV

Page 17: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

20. Sivic J, Russell BC, Efros AA, Zisserman A, Freeman WT (2005) Discovering objects and theirlocation in images. In: Proc. ICCV, pp 370–377

21. Slobodan I (2008) Object labeling for recognition using vocabulary trees. In: ICPR22. Yeh T, Darrell T (2008) Dynamic visual category learning. In: CVPR23. Yeh T, Lee J, Darrell T (2007) Adaptive vocabulary forests for dynamic indexing and category

learning. In: Proc. ICCV24. Zheng X, Cai D, He X, Ma WY, Lin X (2004) Locality preserving clustering for image database.

In: ACM multimedia25. http://www.robots.ox.ac.uk/∼vgg/research/affine/

Zhenyong Fu received the B.S degree in Mathematics from the Kunming University of Science andTechnology in 2002, and the M.E. degree in Computer Science from the Fudan University, in 2005.He is currently pursuing the Ph.D. degree in the Department of Computer Science and Engineering,Shanghai Jiao Tong University, Shanghai, China. His research interests includes machine learning,computer vision and pattern recognition.

Hongtao Lu got his Ph.D. degree in Electronic Engineering from Southeast University, Nanjing,in 1997. After graduation he became a postdoctoral fellow in Department of Computer Science,Fudan University, Shanghai, China, where he spent two years. In 1999, he joined the Departmentof Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, where he is now aprofessor. His research interest includes machine learning, computer vision and pattern recognition,and information hiding. He has published more than sixty papers in international journals such asIEEE Transactions, Neural Networks and in international conferences. His papers got more than400 citations by other researchers.

Page 18: Incremental visual objects clustering ... - nlpr-web.ia.ac.cnnlpr-web.ia.ac.cn/2010papers/开放课题... · invariant SIFT vector is computed to represent each region in each image

Multimed Tools Appl

Wenbin Li got his Ph.D. degree from Shanghai Medical University, Shanghai, in 1995. In 2004,he joined the Department of Diagnostic and Interventional Radiology, Affiliated Sixth People’sHospital of Shanghai Jiao Tong University, Shanghai, where he is now a professor. His researchinterest includes medical image processing, medical imaging exam, and pattern recognition. He haspublished more than fifty papers in international journals conferences.