mining visual collocation patterns via self-supervised ...jsyuan/papers/2012/mining... · visual...

13
1 Mining Visual Collocation Patterns via Self-Supervised Subspace Learning Junsong Yuan, Member, IEEE, and Ying Wu, Senior Member, IEEE Abstract—Traditional text data mining techniques are not di- rectly applicable to image data which contain spatial information and are characterized by high-dimensional visual features. It is not a trivial task to discover meaningful visual patterns from images, because the content variations and spatial dependency in visual data greatly challenge most existing data mining methods. This paper presents a novel approach to coping with these difficulties for mining visual collocation patterns. Specifically, the novelty of this work lies in the following new contributions: (1) a principled solution to the discovery of visual collocation patterns based on frequent itemset mining; and (2) a self-supervised subspace learning method to refine the visual codebook by feeding back discovered patterns via subspace learning. The experimental results show that our method can discover semantically mean- ingful patterns efficiently and effectively. Index Terms—image data mining, visual pattern discovery, visual collocation pattern. I. I NTRODUCTION Motivated by the previous success in mining structured data (e.g., transaction data) and semi-structured data (e.g., text), it has aroused our curiosity in finding meaningful patterns in non-structured multimedia data like images and videos [1] [2] [3] [4] [5]. For example, as illustrated in Fig. 1, once we can extract some invariant visual primitives such as interest points [6] or salient regions [7] from the images, we can represent each image as a collection of such visual primitives characterized by high-dimensional feature vectors. By further quantizing those visual primitives to discrete “visual items” (also known as “visual words”) through clustering these high-dimensional features [1], each image is represented by a set of transaction records, where each transaction corresponds to a local image region and describes its composition of visual items. After that, data mining techniques can be applied to such a transaction database induced from images for discov- ering visual collocation pattern. Although the discovery of visual patterns from images appears to be quite exciting, data mining techniques that are successful in transaction and text data may not be simply applied to image data that contain high-dimensional features and have spatial structures. Unlike transaction and text data Junsong Yuan is with the School of Electrical and Electronics Engineering, Nanyang Technological University, Singapore (email: [email protected]), and Ying Wu is with the Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, 60208 USA (email:[email protected]). This work was supported in part by the Nanyang Assistant Professorship to Dr. Junsong Yuan, National Science Foundation grant IIS-0347877, IIS- 0916607, US Army Research Laboratory and the US Army Research Office under grant ARO W911NF-08-1-0504. that are composed of discrete elements without much ambi- guity (i.e. predefined items and vocabularies), visual patterns generally exhibit large variabilities in their visual appearances. A same visual pattern may look very different under differ- ent views, scales, lighting conditions, not to mention partial occlusion. It is very difficult, if not impossible, to obtain invariant visual features that are insensitive to these variations such that they can uniquely characterize visual primitives. Therefore although a discrete item codebook can be forcefully obtained by clustering high-dimensional visual features (e.g., by k-means clustering), such “visual items” tend to be much more ambiguous than the case of transaction and text data. Thus the imperfect clustering of visual primitives brings large challenges when directly applying traditional data mining methods to image data. Specifically, the ambiguity lies in two aspects: synonymy and polysemy [8]. A synonymous visual item shares the same semantic meanings with other visual items. Because the corresponding underlying semantics is split and represented by multiple visual items, synonymy leads to over-representations. On the other hand, a polysemous visual item may mean different things under different contexts. Thus polysemy leads to under-representations. Both phenomena appear quite often when clustering visual primitives through an unsupervised way. The root of these phenomena is the large uncertainties within non-structured visual data in the high-dimensional space. Therefore, it is crucial to address the uncertainty issues. One possible solution to resolve the ambiguity of polysemous visual words may be to put them into a spatial context. In other words, the visual collocation (or co-occurrence) of several visual items is likely to be much less ambiguous. Therefore, it is of great interest to automatically discover these collocation visual patterns. Once such visual collocation patterns are discovered, they can help to learn a better representation for clustering visual primitives. However, since visual patterns exhibit more complex struc- ture than transaction and text pattern, the difficulty in rep- resenting and discovering spatial patterns in images prevents straightforward generalization of traditional data mining meth- ods that are applicable for transaction data. For example, unlike traditional transaction database where records are inde- pendent of each other, the induced transactions generated by image patches can be correlated due to spatial overlap. This phenomenon complicates the data mining process for spatial data, because simply counting the occurrence frequencies is doubtable and thus a frequent pattern is not necessarily a meaningful pattern. Thus special care needs to be taken. Although there exist methods [9] [10] [11] for spatial collo- cation pattern discovery from geo-spatial data, they cannot be

Upload: others

Post on 07-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

1

Mining Visual Collocation Patterns viaSelf-Supervised Subspace Learning

Junsong Yuan,Member, IEEE,and Ying Wu,Senior Member, IEEE

Abstract—Traditional text data mining techniques are not di-rectly applicable to image data which contain spatial informationand are characterized by high-dimensional visual features. It isnot a trivial task to discover meaningful visual patterns fromimages, because the content variations and spatial dependency invisual data greatly challenge most existing data mining methods.This paper presents a novel approach to coping with thesedifficulties for mining visual collocation patterns. Specifically, thenovelty of this work lies in the following new contributions: (1) aprincipled solution to the discovery of visual collocationpatternsbased on frequent itemset mining; and (2) a self-supervisedsubspace learning method to refine the visual codebook by feedingback discovered patterns via subspace learning. The experimentalresults show that our method can discover semantically mean-ingful patterns efficiently and effectively.

Index Terms—image data mining, visual pattern discovery,visual collocation pattern.

I. I NTRODUCTION

Motivated by the previous success in mining structureddata (e.g., transaction data) and semi-structured data (e.g.,text), it has aroused our curiosity in finding meaningfulpatterns in non-structured multimedia data like images andvideos [1] [2] [3] [4] [5]. For example, as illustrated in Fig. 1,once we can extract some invariant visual primitives suchas interest points [6] or salient regions [7] from the images,we can represent each image as a collection of such visualprimitives characterized by high-dimensional feature vectors.By further quantizing those visual primitives to discrete “visualitems” (also known as “visual words”) through clustering thesehigh-dimensional features [1], each image is represented by aset of transaction records, where each transaction correspondsto a local image region and describes its composition of visualitems. After that, data mining techniques can be applied tosuch a transaction database induced from images for discov-ering visual collocation pattern.

Although the discovery of visual patterns from imagesappears to be quite exciting, data mining techniques that aresuccessful in transaction and text data may not be simplyapplied to image data that contain high-dimensional featuresand have spatial structures. Unlike transaction and text data

Junsong Yuan is with the School of Electrical and Electronics Engineering,Nanyang Technological University, Singapore (email: [email protected]),and Ying Wu is with the Department of Electrical EngineeringandComputer Science, Northwestern University, Evanston, IL,60208 USA(email:[email protected]).

This work was supported in part by the Nanyang Assistant Professorshipto Dr. Junsong Yuan, National Science Foundation grant IIS-0347877, IIS-0916607, US Army Research Laboratory and the US Army Research Officeunder grant ARO W911NF-08-1-0504.

that are composed of discrete elements without much ambi-guity (i.e. predefined items and vocabularies), visual patternsgenerally exhibit large variabilities in their visual appearances.A same visual pattern may look very different under differ-ent views, scales, lighting conditions, not to mention partialocclusion. It is very difficult, if not impossible, to obtaininvariant visual features that are insensitive to these variationssuch that they can uniquely characterize visual primitives.Therefore although a discrete item codebook can be forcefullyobtained by clustering high-dimensional visual features (e.g.,by k-means clustering), such “visual items” tend to be muchmore ambiguous than the case of transaction and text data.Thus the imperfect clustering of visual primitives brings largechallenges when directly applying traditional data miningmethods to image data. Specifically, the ambiguity lies in twoaspects:synonymyand polysemy[8]. A synonymous visualitem shares the same semantic meanings with other visualitems. Because the corresponding underlying semantics is splitand represented by multiple visual items, synonymy leads toover-representations. On the other hand, a polysemous visualitem may mean different things under different contexts. Thuspolysemy leads to under-representations. Both phenomenaappear quite often when clustering visual primitives throughan unsupervised way. The root of these phenomena is thelarge uncertainties within non-structured visual data in thehigh-dimensional space. Therefore, it is crucial to addressthe uncertainty issues. One possible solution to resolve theambiguity of polysemous visual words may be to put theminto a spatial context. In other words, thevisual collocation(orco-occurrence) of several visual items is likely to be much lessambiguous. Therefore, it is of great interest to automaticallydiscover these collocation visual patterns. Once such visualcollocation patterns are discovered, they can help to learnabetter representation for clustering visual primitives.

However, since visual patterns exhibit more complex struc-ture than transaction and text pattern, the difficulty in rep-resenting and discovering spatial patterns in images preventsstraightforward generalization of traditional data mining meth-ods that are applicable for transaction data. For example,unlike traditional transaction database where records areinde-pendent of each other, the induced transactions generated byimage patches can be correlated due to spatial overlap. Thisphenomenon complicates the data mining process for spatialdata, because simply counting the occurrence frequencies isdoubtable and thus a frequent pattern is not necessarily ameaningful pattern. Thus special care needs to be taken.Although there exist methods [9] [10] [11] for spatial collo-cation pattern discovery from geo-spatial data, they cannot be

Page 2: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

2

Fig. 1. The illustration of collocation visual patterns from images. Thereare two kinds of imperfectness when translating image data into transactiondata for data mining. First, the visual primitives can be miss-detected in thefeature extraction process, due to the occlusion, bad lighting conditions orthe unreliable detectors. Secondly, even if a visual primitive is extracted,it can be wrongly clustered into a visual item due to visual polysemy andsynonymy. A direct pattern mining on the noisy transaction database cannotobtain satisfactory results.

directly applied to image data which are characterized by high-dimensional features. Moreover, the spatial co-occurrences ofthe items do not necessarily indicate the real associationsamong them, because a frequent spatial collocation patterncanbe generated by the self-repetitive texture in the image andthusis not semantically meaningful. Thus, finding frequent patternsmay not always output meaningful and informative patterns inimage data.

Given a collection of images, our objective of image datamining is to discover meaningful visual patterns that appearrepetitively among the images. Compared with the backgroundclutters, such visual patterns are of great interests thus shouldbe well treated in clustering visual items. For example, given afew face photos of different persons, can we discover the visualcollocations like eyes and noses that can interpret the facecategory? Moreover, once these visual patterns are discovered,can they help to learn a better feature representation viasubspace learning?

To address these problems, this paper presents a novelbottom-up approach to discovering semantically meaningfulvisual patterns from images. As shown in Fig. 1, an imageis represented by a collection of visual items after clusteringvisual primitives. To discover visual collocation patterns, anew data mining criterion is proposed. Instead of using theco-occurrence frequency as the criterion for mining the mean-ingful collocation patterns in images, a more plausiblevisualcollocation miningbased on likelihood ratio test is proposed toevaluate the significance of a visual itemset. Secondly, once thevisual collocation patterns (foreground items) and backgrounditems are discovered, we can feed them back into the clusteringprocedure by learning a better subspace representation via

metric learning, to distinguish the object of interests from thecluttered background. Then the visual primitives can be betterclustered in the learned subspace. To this end, we propose aself-supervised subspace learning of visual items. By takingadvantage of the discovered visual patterns, such a top-downrefinement procedure helps to reduce the ambiguities amongvisual items and better distinguish foreground items from thebackground items. Our experiments on three object categoriesfrom the 101 Caltech dataset demonstrate the effectivenessandefficiency of our proposed method.

The rest of the paper is organized as follows. We discuss therelated work in Section II, followed by the overview of ourapproach in Section III. The discovery of visual collocationpatterns is presented in Section IV. After that we discusshow to refine the discovered patterns via metric learning inSection V. The experiments are conducted in Section VIfollowed by the conclusion in Section VII.

II. RELATED WORK

By characterizing an image as a collection of primitivevisual features that highlight the local image invariants,andclustering these primitive features into discrete visual words,we can translate an image to a visual document. Such a “bagof words” model bridges the text data mining and image datamining research, and has been extensive applied in imageretrieval [12], recognition [13], as well as image data min-ing [1] [14] [15]. As a similar treatment of texts, previous textinformation retrieval and data mining methods can be appliedto image data. For example, in [16], text-retrieval methodsare applied to search visual objects in videos. Statisticalnatural language models, such as probabilistic Latent SemanticAnalysis (pLSA) and Latent Dirichlet Allocation (LDA), areapplied to discover object categories from images [17] [8].

Although it brings many benefits by representing imagesas visual documents, the induced visual vocabulary tend tobe much more ambiguous than that in text. To learn abetter visual vocabulary, [18] discusses the limitation ofk-means clustering and propose a new strategy to build thecodebook. However, unsupervised learning of a good visualvocabulary is difficult. As a treatment to resolve the ambiguityof polysemous and synonymous visual words, spatial contextinformation in image is taken into consideration. It is notedthat theco-occurrenceof visual words, namely a compositionof visual items, has a better representation power and is likelyto be less ambiguous, such as the visual phrases [19] [20],visual synset [21] and dependent regions [22].

To discover visual patterns, some other methods consider thespatial configuration among the visual features when modelingthe spatial image pattern. In [23], attributed relational graphs(ARG) are applied to model the spatial image pattern. EMalgorithm is used for parameters learning. In [24], spatialrelations are defined in terms of the distance and directionbetween pair of detected parts. In [25], both geometry andappearance information of generic visual object category iscoded in a hierarchical generative model. A constellationmodel is used to model a visual object category in [26]. Theproposed probabilistic model is able to handle all kinds of

Page 3: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

3

variations of the objects, such as shape, appearance, occlusionand relative scale. However, the structure of the model and theobject parts need to be manually selected. In [27], a tree modelis proposed to discover and segment visual objects that belongto the same category. Despite its powerful modeling ability, thetraining of these generative models is usually time-consuming.When using the EM algorithm for learning, it is easy to betrapped by the local optimality. In [28], a generative model,called active basis model, is proposed to learn a deformabletemplate to sketch the common object from images. The activebasis consists of a small number of Gabor wavelet elementsat selected locations and orientations.

Instead of using top-down generative model to discover vi-sual patterns, these are also data-driven bottom-up approaches.To perform efficient image data mining, frequent itemsetmining is applied in [29] [30] by translating each local imageregion into a transaction. In order to consider the spatialconfiguration among the visual items, in [31], semantics arerepresented as visual elements and geometric relationshipsbetween them. An efficient mining method is proposed to findpair-wise associations of visual elements that have consistentgeometric relationships sufficiently often. In [32], an efficientimage pattern mining approach is proposed to handle the scale,translation, and rotation variations in mining frequent spatialpatterns in images. In [33], contextual visual word is proposedto improve the image search and near duplicate copy detection.

There are also related works in data mining. It is of greatinterests to discover frequent patterns in data mining research.For example, frequent itemset mining (FIM) and its extensions[34] [35] [36] have been extensively studied. However, ahighly frequent pattern may not be informative or interesting,thus a more important task is to extract informative andpotentially interesting patterns from the possibly noisy data.This can be done by mining meaningful patterns either throughpost-processing the FIM results or proposing new data miningcriteria, including mining compressed patterns [37] [38] [39],approximate patterns [40] [41] [42] and pattern summarization[43] [44] [45]. These data mining techniques may discovermeaningful frequent itemsets and represent them in a compactway.

III. OVERVIEW

A. Notations and basic concepts

Each image in the database is described by a set ofvisual primitives: I = vi =

(

~fi, xi, yi

)

, where ~fi de-

notes the high-dimensional feature andxi, yi denotes thespatial location ofvi in the image. We treat these visualprimitives as theatomic visual patterns. For each visualprimitive vi ∈ I, its local spatial neighbors form agroupGi = vi, vi1 , vi2 , · · · , viK

. For example,Gi can be thespatial K-nearest neighbors (K-NN) or ε-nearest neighbors(ε-NN) of vi under the Euclidean distance. As illustrated inFig. 2, the image databaseDI = ItTt=1 can generate acollection of such groups, where each groupGi is associatedwith a visual primitive vi. We want to mention that twospatially neighbored groups may share some visual primitives

Fig. 2. Illustrations of the visual groups and the discoveryof visualcollocations. Each circle corresponds to a spatial group (namely a transaction),which is composed of 5-NN visual items. An image can generatea collectionof such groups for data mining. A and B are discovered visual collocationpatterns.

due to their spatial overlap. By further quantizing all the high-dimensional features~fi ∈ DI into M classes throughk-means clustering, a codebook of visual primitivesΩ can beobtained. We call every prototypeWk in the codebookΩ =W1, ..., WM a visual item. Because each visual primitive isuniquely assigned to one of the visual itemsWi, the groupGi can be transformed into atransactionTi. More formally,given the group datasetG = GiNi=1 generated fromDI andthe visual item codebookΩ (|Ω| = M ), we can induce atransaction databaseT = Ti. Such an induced transactiondatabase is essentially based on thecentric reference featuremodelfor mining association rules [10]. Given the visual itemcodebookΩ, a sub-setP ⊂ Ω is called avisual itemset(itemset for short). For a given itemsetP , the transactionTi

which includesP is called anoccurrenceof P , i.e. Ti is anoccurrence ofP , if P ⊆ Ti. Let T(P) denote the set of all theoccurrences ofP in T, and thefrequencyof P is the numberof its occurrences denoted as:

frq (P) = |T(P)| = |i : ∀j ∈ P , tij = 1|, (1)

where tij = 1 denotes that thejth item appears in theithtransaction, andtij = 0 otherwise.

For a given thresholdθ, called aminimum support, itemsetP is frequentif frq(P) > θ. It is not a trivial task to discoverall the frequent itemsets given datasetT, because the numberof possible itemsets is exponentially large with respect tothecodebook size. For example, the codebookΩ has in total2|Ω|

candidates for frequent itemsets, therefore exhaustive check isinfeasible for large codebooks. Also, if an itemsetP appearsfrequently, then all of its sub-setsP ′ ⊂ P will also appearfrequently, i.e. frq(P) > θ ⇒ frq(P ′) > θ. For example,a frequent itemsetP composed withn items can generate2n frequent sub-itemsets including itself and the null itemset.To eliminate this redundant representation,closed frequentitemsetsis introduced [46]. Thus this guarantees that no visualcollocations will be left out. Theclosed frequent itemsetisdefined as follows.

Page 4: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

4

Fig. 3. The overview for the proposed method for mining visual collocationpatterns. We propose a hierarchical and self-supervised visual pattern discov-ery method to handle the imperfectness from the visual vocabulary and canreveal the hierarchical structure of visual patterns

Definition 1: closed frequent itemsetIf for an itemsetP , there is no other itemsetQ ⊇ P that cansatisfy T(P) = T(Q), we sayP is closed. For any itemsetP andQ, T(P ∪ Q) = T(P) ∩ T(Q), and if P ⊆ Q thenT(Q) ⊆ T(P).

To find frequent itemsets, we apply the FP-growth algo-rithm to discoverclosedfrequent itemsets [47]. The numberof closed frequent itemsets is much less than the frequentitemsets, and they compress information of frequent itemsetsin a lossless form,i.e. the full list of frequent itemsetsF = Pi and their corresponding frequency counts canbe exactly recovered from the compressed representation ofclosed frequent itemsets. As FP-tree has a prefix-tree structureand can store compressed information of frequent itemset,it can efficiently discover all the closed frequent sets fromtransaction datasetT.

B. Overview of our method

We present the overview of our visual pattern discoverymethod in Fig. 3. Given a collection of images, we detect thelocal interest features, followed by clustering them into agroupof visual items. The spatial dependences of these visual itemsare discovered via using the proposed data mining methods.Once these spatial collocation patterns are discovered, itcanguide the subspace learning in finding a better feature spacefor visual item clustering. Finally, the discovered visualitemsare further grouped to recover the visual patterns.

In Section IV, we present our new criteria for discoveringvisual collocation patternsPi ⊂ Ω. After that in Section V,

we feed back the discovered visual collocationsΨ to refinethe data mining via metric learning. The experiments areconducted in Section VI and we conclude in Section VII.

IV. D ISCOVERING V ISUAL COLLOCATION PATTERNS

A. Visual Primitive Extraction

We apply the PCA-SIFT points [48] as thevisual primitives.Such visual primitives are mostly located in the informativeimage regions such as corners and edges, and the featuresare invariant under rotations, scale changes, and slight view-point changes. Normally each image may contain hundredsto thousands of such visual primitives based on the sizeof the image. According to [48], each visual primitive is a41 × 41 gradient image patch at the given scale, and rotatedto align its dominant orientation to a canonical direction.Principal component analysis (PCA) is applied to reduce thedimensionality of the feature. Finally each visual primitive isdescribed by a 35-dimensional feature vector~fi. These visualprimitives are initially clustered into visual items through k-means clustering, using Euclidean metric in the feature space.We will discuss how to obtain a better visual item codebookΩ

based on the proposed self-supervised metric learning schemein Sec. V.

B. Finding Meaningful Visual Collocations

Given an image datasetDI and its induced transactiondatabaseT, the task is to discover the visual collocationpatternsP ⊂ Ω (|P| ≥ 2). Each visual collocation iscomposed by a collect of visual items that occur togetherspatially. To evaluate the qualification of aP ⊆ Ω, simplychecking its frequencyfrq(P) in T is far from sufficient. Forexample, even if an itemset appears frequently, it is not clearwhether such co-occurrences among the items are statisticallysignificant or just by chance. In order to evaluate the statisticalsignificance of a frequent itemsetP , we propose a newlikelihood ratio test criterion. We compare the likelihoodthatP is generated by the meaningful pattern versus the likelihoodthatP is randomly generated,i.e. by chance.

More formally, we perform the likelihood ratio test tomeasure a visual collocation based on the two hypotheses,where

H0: occurrences ofP are randomly generated;H1: occurrences ofP are generated by the hidden pattern.

Given a transaction databaseT, the likelihood ratioL(P)

of a visual collocationP = Wi|P|i=1 can be calculated as:

L(P) =P (P|H1)

P (P|H0)=

∑Ni=1 P (P|Ti, H1)P (Ti|H1)

∏|P|i=1 P (Wi|H0)

. (2)

HereP (Ti|H1) = 1N

is the prior, andP (P|Ti, H1) is the like-lihood thatP is generated by a hidden pattern and is observedat a particular transactionTi. ThereforeP (P|Ti, H1) = 1,if P ⊆ Ti, and P (P|Ti, H1) = 0, otherwise. Consequently,based on Eq. 1, we can calculateP (P|H1) = frq(P)

N. We also

assume that the itemsWi ∈ P are conditionally independent

Page 5: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

5

under the null hypothesisH0, andP (Wi|H0) is the prior ofitem Wi ∈ Ω, i.e. the total number of visual primitives thatare labeled withWi in the image databaseDI . We thus referL(P) as the “significance” score to evaluate the deviation ofa visual collocation patternP . If P is a second-order itemset,then L(P) degenerates to the pointwise mutual informationcriterion,e.g., the lift criterion [46].

It is worth noting thatL(P) may favor high-order colloca-tions even though they appear less frequently. Table I presentssuch an example, where 90 transactions have only itemsAandB; 30 transactions haveA,B andC; 61 transactions haveD andE; and 19 transactions haveC andE.

TABLE ITRANSACTION DATABASET1 .

transaction number L(P)

AB 90 1.67ABC 30 1.70DE 61 2.5CE 19 0.97

From Table I, It is easy to evaluate the significant scoresfor P1 = A, B andP2 = A, B, C with L(P1) = 1.67and L(P2) = 1.70 > L(P1). This result indicates thatP2

is a more significant pattern thanP1 but counter-intuitive.This observation challenges our intuition becauseP2 is nota cohesive pattern. For example, the other two sub-patternsof P2, P3 = A, C and P4 = B, C, contain almostindependent items:L(P3) = L(P4) = 1.02. Actually, P2

should be treated as a variation ofP1 as C is more likelyto be a noise. The following equation explains what causesthe incorrect result. We calculate the significant score ofP2

as:

L(P2) =P (A, B, C)

P (A)P (B)P (C)= L(P1)×

P (C|A, B)

P (C). (3)

Therefore when there is a small disturbance with the distri-bution of C over T1 such thatP (C|A, B) > P (C), P2 willcompeteP1 even thoughP2 is not a cohesive pattern (e.g.Cis not related to eitherA or B). To avoid those free-riders suchasC for P1, we perform a more strict test on the itemset. Fora high-orderP (|P| > 2), we perform the t-test for each pairof its items to check if itemsWi and Wj (Wi, Wj ∈ P) arereally dependent (see Appendix A for details.) A high-ordercollocationPi is meaningful only if all of its pairwise subsetscan pass the test:∀i, j ∈ P , t(Wi, Wj) > τ , whereτ is theconfidence threshold for the t-test. This further reduces theredundancy among the discovered itemsets.

Finally, to assure that a visual collocationP is meaningful,we also require it to appear relatively frequent in the database,i.e. frq(P) > θ, such that we can eliminate those colloca-tions that appear rarely but happen to exhibit strong spatialdependency among items. With these three criteria, a visualcollocation pattern is defined as follows.

Definition 2: Visual Collocation PatternAn itemsetP ⊆ Ω is a(θ, τ, γ)-meaningful visual collocation,if it is:

1) frequent: frq(P) > θ;

2) pair-wisely cohesive: t(Wi, Wj) > τ, ∀i, j ∈ P ;3) significant: L(P) > γ.

C. Spatial Dependency among Induced Transactions

Suppose primitivesvi and vj are spatial neighbors, theirinduced transactionTi andTj will have large spatial overlap.Due to such spatial dependency among the transactions, itcan cause over-counting problem if simply calculatingfrq(P)from Eq. 1. Fig. 4 illustrates this phenomena wherefrq(P)contains duplicate counts.

T 1 T 2

C E

D

ID T1 T2

A B C E A B D F

Frq(A,B) = 2

T3 ... B

F

Z

K C

F

G

D

L

A A

Transaction Database

Image Composed of Visual Items

Fig. 4. Illustration of the frequency over-counting causedby the spatialoverlap of transactions. The itemsetA, B is counted twice byT1 =A, B, C, E andT2 = A, B, D, F, although it has only one instance inthe image. Namely there is only one pair ofA andB that co-occurs together,such thatd(A, B) < 2ε with ε the radius ofT1. In the texture region wherevisual primitives are densely sampled, such over-count will largely exaggeratethe number of repetitions for a texture pattern.

In order to address this transaction dependency problem,we apply a two-phase mining scheme. First, without consid-ering the spatial overlaps, we perform closed FIM to obtaina candidate set of frequent itemsets. For these candidatesF = Pi : frq(Pi) > θ, we re-count the number oftheir real instances exhaustively through the original imagedatabaseDI , not allowing duplicate counts. This needs onemore scan of the whole database. Without causing confusion,we denote ˆfrq(P) as the occurrence number ofP and useit to updatefrq(P). Accordingly, we adjust the calculation

of P (P|H1) =ˆfrq(P)

N, where N = N/K denotes the

approximated independent transaction number withK theaverage size of transactions. In practice, asN is hard toestimate, we rankPi according to their significant valueL(P)and perform the top-K pattern mining.

Integrating all the steps in this section, we present ouralgorithm to discover meaningful visual collocations in Al-gorithm 1.

Algorithm 1 : Visual Collocation Mining

input : Transaction datasetT, parameters:(θ, τ, γ)output: a collection of meaningful visual collocations:

Ψ = Pi

Init: closed FIM withfrq(Pi) > θ: F = Pi, Ψ←− ∅;1

foreach Pi ∈ F do GetRealInstanceNumber(Pi)2

for Pi ∈ F do3

if L(Pi) > γ ∧ PassPairwiseTtest (Pi) then4

Ψ←− Ψ ∪ Pi5

ReturnΨ6

Page 6: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

6

V. SELF-SUPERVISEDREFINEMENT OFV ISUAL ITEM

CODEBOOK

A. Foreground V.S. Background

As discussed earlier, our image data mining method highlyrelies on the quality of the visual item codebookΩ. A badclustering of visual primitives brings large quantizationerrorswhen generating the transactions. Such a quantization errorwill affect the data mining results significantly. Thus a goodΩ is required. To improve the codebook construction, wepropose to use the discovered collocation patterns to supervisethe clustering process. Although there is no supervision avail-able initially, the unsupervised data mining process actuallydiscover useful information for supervision. Thus it is calledself-supervised refinement.

We notice that each image contains two layers: a foregroundobject and the background clutters. Given a collection ofimages containing the same category of objects, the foregroundobjects are similar, while the background clutters are differentfrom each other. The discovered visual collocations are sup-posed to be associated with the foreground object. As eachimage is composed of the visual items, we can partition thecodebook into two partsΩ = Ω

+ ∪Ω−, where items inΩ+

are more likely to appear in the foreground object, while itemsin Ω

− are more likely to appear in the background. Thus itprovides information to learn a better codebook.

By discovering a set of visual collocationsΨ = Pi, wedefine theforeground item codebookas follows:

Definition 3: foreground codebookΩ+

Given a set of visual collocationsΨ = Pi, an itemWi ∈ Ω is a foreground item if it belongs to any collocationpatternP ∈ Ψ, namely,∃P ∈ Ψ, such thatWi ⊂ P . Allof the foreground items compose the foreground codebookΩ

+ =⋃|Ψ|

i=1 Pi.

With the foreground codebook, thebackground codebookbecomesΩ− = Ω\Ω+. Each visual primitive belongs toeither the foreground object (positive class) or the backgroundclutter (negative class).

Our goal now is to use the data mining results to refine thecodebooksΩ+ andΩ

−, such that they can better distinguishthe two classes. For the negative class, any visual primitive thatbelongs toΩ− can be the negative training example. However,for the positive classΩ+, not all of items inΩ+ are qualifiedto be positive samples. We only choose those instances of thevisual collocations.

B. Learning a better metric for clustering

With these training examples via data mining, we transferthe unsupervised clustering problem into semi-supervisedclus-tering to obtain a better codebookΩ. Our task is to clusterall the visual primitivesvi ∈ DI . Now some of the visualprimitives are already labeled according to the discoveredvisual collocations and the background visual items. Thus wecan use these labeled primitives to help cluster all of visualprimitives.

We apply the nearest component analysis (NCA) [49] toimprove the clustering results by learning a better Mahalanobisdistance metric in the feature space. Similar to the lineardiscriminative analysis (LDA), NCA targets at learning aglobal linear projection matrixA. However, unlike LDA, NCAdoes not need to assume that each visual item class has aGaussian distribution and thus can be applied to more generalcases. Given two visual primitivesvi and vj , NCA learns anew metricA and the distance in the transformed space is:dA(vi, vj) = (~fi− ~fj)

T AT A(~fi− ~fj) = (A~fi−A~fj)T (A~fi−

A~fj).The objective of NCA is to maximize a stochastic variant

of the leave-one-out K-NN score on the training set. In thetransformed space, a pointvi selects another pointvj as itsneighbor with probability:

pij =exp(−‖A~fi −A~fj‖2)

k 6=i exp(−‖A~fi −A~fk‖2), pii = 0. (4)

Under the above stochastic selection rule of nearest neigh-bors, NCA tries to maximize the expected number of pointscorrectly classified under the nearest neighbor classifier (theaverage leave-one-out performance):

f(A) =∑

i

j∈Ci

pij , (5)

where Ci = j|ci = cj denotes the set of points in thesame class asi. By differentiatingf , the objective function canbe maximized through gradient search for optimalA. Afterobtaining the projection matrixA, we update all the visualfeatures ofvi ∈ DI from ~fi to A~fi, and re-cluster the visualprimitives based on their new featuresA~fi.

C. Clustering of Visual Collocations

The discovered visual collocations may not be completepatterns. There is redundancy among them as well. Give apatternH = A, B, C, it is possible to obtain many incom-plete visual collocations such asA, B, A, C, B, C dueto image noises and quantization errors. Therefore we need tohandle this problem.

If two visual collocationsPi andPj are correlated, theirtransaction setT(Pi) and T(Pj) (Eq. 1) should also havea large overlap [43], implying that they may be generatedfrom the same patternH. As a result,∀i, j ∈ Ψ, we canmeasure their similaritys(i, j), which depend not only on theirfrequencies ˆfrq(Pi) and ˆfrq(Pj), but also the correlationbetween their transaction setT(Pi) andT(Pj). We apply theJaccard distance to estimates(i, j) [50]:

s(i, j) = exp

1

1−|T(Pi)∩T(Pj )|

|T(Pi)∪T(Pj )| . (6)

Given a collection of visual collocationsΨ = Pi andtheir pair-wise similarity s(i, j), we cluster them intoKclasses using normalized cut [51]. Each classHj = Pi

|Hj|i=1

is a group of visual collocations, called avisual part. EachPi ∈ H is an variation ofH, due to the image noises orthe quantization error of the codebook. By grouping visualcollocations, we end up with a collection of visual parts:

Page 7: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

7

H = Hi. These visual parts can be further constructed torecover the whole visual object.

Algorithm 2 : Main Algorithminput : Image datasetDI ,

ε or K for searching spatialε-NN or K-NN,parameter:(θ, τ, γ),number of semantic patterns:|H|,number of maximum iterationl

output: A set of semantic patterns:H = Hi

Init: Get visual item codebookΩ0 and induced1

transaction DBT0Ω

; i←− 0;while i < l do2

Ψi = MIM(Ti

Ω); /* visual collocation mining3

*/

Ωi+ = ∪jPj, wherePj ∈ Ψ

i;4

Ai = NCA (Ωi+ ,Ti

Ω); /* get new metric */5

UpdateΩi andTi based onAi;/* re-clustering */6

if little change ofΩi then7

break;8

i←− i + 19

S = GetSimMatrix (Ψi);10

H = NCut (S, |H|); /* pattern summarization */11

ReturnH;12

VI. EXPERIMENTS

A. Dataset Description

Given a large image datasetDI = Ii, we first extractthe PCA-SIFT points [48] in each imageIi and treat theseinterest points as the visual primitives. We resize all imagesby the factor of2/3. The feature extraction is on average0.5 seconds per image. Multiple visual primitives can belocated at the same spatial location, but with various scalesand orientations. Each visual primitives is represented asa35-d feature vector after principal component analysis. Thenk-means algorithm is used to cluster these visual features intoa visual item codebookΩ. We select three categories fromthe Caltech101 database [26] for the experiments: faces (435images from23 persons), cars (123 images of different cars),and airplanes (800 images of different airplanes). We set theparameters for MIM as:θ = 1

4 |DI |, where|DI | is the totalnumber of images, andτ is associated with the confidencelevel of 0.90. Instead of setting thresholdγ, we select thetop phrases by ranking theirL(P) values. We set visual itemcodebook size|Ω| = 160, 500, and300, for the car, face, andairplane categories, respectively. For generating the transactiondatabasesT, we setK = 5 for searching spatial K-NN tocompose each transaction. All the experiments were conductedon a Pentium-43.19GHz PC with1GB RAM running windowXP.

B. Evaluation of Visual Collocation Patterns

We use two metrics to evaluate the discovered visual collo-cations: (1) the precision ofΨ: ρ+ denotes the percentageof discovered visual collocationsPi ∈ Ψ that are located

in the foreground objects, and (2) the precision ofΩ−: ρ−

denotes the percentage of meaningless itemsWi ∈ Ω− that

are located in the background. Fig. 5 illustrates the conceptsof our evaluation. In the ideal situation, ifρ+ = ρ− = 1, theneveryPi ∈ Ψ is associated with the interesting object,i.e.located inside the object bounding box; while all meaninglessitems Wi ∈ Ω

− are located in the backgrounds. In such acase, we can precisely discriminate the frequently appearedforeground objects from the clutter backgrounds, through anunsupervised learning. Finally, we use retrieval rateη todenote the percentage of retrieved images that contain at leastone visual collocation. Since for the airplane category, mostairplanes appear in the clean background, its precision willbe high because there is much less interest points located inthe background. To make a fair evaluation, we thus only testthe accuracy of the car and face categories, where the objectsare always located in a cluttered background. We only testthe airplane category for the discovery of high-level visualpatterns.

Fig. 5. Evaluation of visual collocations mining. The highlight boundingbox (yellow) represents the foreground region where the interesting object islocated. In the idea case, all the MIPi ∈ Ψ should locate inside the boundingboxes while all the meaningless itemsWi ∈ Ω

− are located outside thebounding boxes.

In Table II, we present the visual collocations from the cardatabase. The first row indicates the number of visual collo-cations (|Ψ|), selected by theirL(P). When selecting morevisual collocations, the precision score ofΨ, ρ+, decreases(from 1.00 to 0.86), while the percentage of retrieved imagesη increases (from0.11 to 0.88). The high precision ofρ+

indicates that the discovered visual collocation are associatedwith the foreground objects. It is also noted that meaningfulitem codebookΩ+ is only a small subset with respect toΩ (|Ω| = 160). This implies that most visual items do notbelong to the foreground objects. They are noisy items fromthe backgrounds.

TABLE IIPRECISION SCOREρ+ AND RETRIEVAL RATE η FOR THE CAR DATABASE,

CORRESPONDING TO VARIOUS SIZES OFΨ. SEE TEXT FOR DESCRIPTIONS

OF ρ+ AND η.

|Ψ| 1 5 10 15 20 25 30

|Ω+| 2 7 12 15 22 27 29η 0.11 0.40 0.50 0.62 0.77 0.85 0.88ρ+ 1.00 0.96 0.96 0.91 0.88 0.86 0.86

We further compare three types of criteria for selectingvisual collocationsP into Ψ, against the baseline of selectingthe individual visual itemsWi ∈ Ω to build Ψ. The threevisual collocation selection criteria are: (1) occurrencefre-quency: ˆfrq(P) (2) T-score (Eq. 7) (only select the secondorder itemsets,|P| = 2) and (3) likelihood ratio:L(P) (Eq. 2).

Page 8: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

8

The comparison results are presented in Fig. 6. It shows howρ+ andρ− vary with increasing size ofΨ (|Ψ| = 1, ..., 30). Ingeneral, the larger the size ofΨ, the lower the precisionρ+,but the higher the precisionρ−, becauseΩ− becomes smallerand purer. Meanwhile, we notice that all of the three criteriaperform significantly better than the baseline of choosing themost frequent individual items as meaningful patterns. This isnot surprising because frequent itemsWi ∈ Ω correspond tocommon features (e.g.corners), therefore they appear in bothforeground objects and clutter backgrounds and express littlediscriminative ability.

Finally, Among the three criteria, occurrence frequencyf rq(P) performs worse than the other two criteria, which fur-ther demonstrates that not all frequent itemsets are meaningfulpatterns. It is also shown from Fig. 6 that when only selectinga few number of visual collocations,i.e.Ψ has a small size, allthe three criteria yield similar performances. However, whenmore visual collocations are added, the proposed likelihoodratio test method performs better than the other two, whichshows our MIM algorithm can discover meaningful visualpatterns.

0.65 0.7 0.75 0.8 0.85 0.90.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ρ−

ρ+

likelihood ratiot−testco−occorrence frequencysingle item frequency

Fig. 6. Performance comparison by applying three differentcriteria to selectvisual collocations, also with the baseline of selecting most frequent individualitems to buildΨ.

By taking advantage of the FP-growth algorithm for closedFIM, our pattern discovery is very efficient. As presentedin Table III, it costs around17.4 seconds for discoveringvisual collocations from the face database containing over60, 000 transactions. Compared with top-down models such astopic discovery for visual pattern discovery, such a bottom-uppattern mining method is more efficient. Moreover, as we onlyselect a very small subset of visual primitives for subspacelearning,e.g., NCA, the computational cost for metric learningis acceptable.

TABLE IIICPUCOMPUTATIONAL COST FOR VISUAL COLLOCATIONS MINING IN

FACE DATABASE, WITH |Ψ| = 30.

# images|DI |

# transactions|T|

closed FIM[47]

MIMAlg.1

435 62611 1.6 sec 17.4 sec

C. Refinement of visual item codebook

To implement NCA for metric learning, we select5 visualcollocations fromΨ (|Ψ| = 10). There are in total less than

10 items shared by these5 visual collocations for both faceand car categories,i.e. |Ω+| < 10. For the foreground class,we select all of the instances of top five visual collocationsastraining samples. Considering the large number of backgrounditems Ω

−, we only select a small number of them whichhave higher probability to generated from the background.Specifically, from the visual primitives belonging toΩ−, weonly select those uncommon visual primitives that cannot findmany matches in the rest of images. The number of negativetraining samples is selected according to the number of posi-tive training samples, to make balanced training examples.

After learning a new metric using NCA, the inter-classdistance is enlarged while the intra-class distance is reducedamong the training samples. We then reconstruct the visualitem codebookΩ usingk-means clustering, and perform thevisual collocation discovery again. To compare the visualcollocations using the original codebook and the refined one,Fig. 7 shows the results in both car and face datasets. It can beseen that the precisionρ+ of visual collocations is improvedwith the refined codebook.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

η

ρ+

face w/o refinementcar w/o refinementface w refinementcar w refinement

Fig. 7. Comparison of visual item codebook before and after self-supervisedrefinement.

D. Clustering of visual collocations

For each object category, we select the top-10 visualcollocations by theirL(P) (Eq. 2). All of the discoveredvisual collocations are the second-order, third-order, orfourth-order itemsets. Each collocation pattern is a local compositionof visual items. These items function together as a singlevisual lexical entity. By further clustering these top-10 visualcollocation (|Ψ| = 10) into visual parts, the clustering resultsare presented in Fig. 8 and Fig. 9, for the face and carcategories respectively. For the face category, we clustervisualcollocations into|H| = 6 visual parts. Five of the six visualparts are semantically meaningful: (1) left eye (2) betweeneyes (3) right eye (4) nose and (5) mouth. All of the discoveredvisual parts have very high precision. It is interesting to notethat left eye and right eye are discovered separately, due tothedifferences of their visual appearances. The other visual partthat is not associated with the face. It corresponds to cornersfrom computers and windows in the office environment. Forthe car category, we cluster them into|H| = 2 visual parts:(1) car wheels and (2) car bodies (mostly windows containingstrong edges). For the airplane category, we also cluster theminto |H| = 3 visual parts, while two of them are semanticallymeaningful: (1) airplane heads and (2) airplane wings.

Page 9: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

9

To evaluate the clustering of visual collocations, we applythe precision and recall scores defined as follows: Recall = #detects / (# detects + # miss detects) and Precision = # detects/( # detects + # false alarms). For each visual part, the groundtruths are manually labeled in the images. We evaluate boththe car and face categories in Fig. 8 and Fig. 9. It can be seenthat the discovered visual parts are of high precision but lowrecall rate. The high precision rate validates the quality of thediscovered patterns. The miss detection of many visual partsis mainly caused by the interest point miss detection and thequantization errors in the codebook.

E. From visual parts to visual objects

The discovered visual parts in Fig. 8 and Fig 9 can be furthercomposed into a high-level pattern that describes the wholeobject. We treat each partHi as a high-level item and build an-other codebookΩ′ = H = Hi. Based on the new codebookΩ

′, each image is composed of a few high-level itemsHi, i.e.,visual parts. Thus each image generates a single transactionfor data mining. Then we perform visual collocation miningon these transactions again, but with a much smaller codebookΩ

′. In general, our approach can be easily extended to a multi-level discovery of visual patterns. Letl denotes the level, atypical semantic pattern in layerl, Hl

i ⊂ Ωl is a compositionof simpler subpatterns (items) from the lower levelΩl−1 =Hl−1

i ,Hl−12 , ...,Hl−1

Ml−1 which, in turn, are built from even

simpler subpatternsΩl−2 = Hl−21 ,Hl−2

2 , ...,Hl−2Ml−2. The

most primitive subpatternsΩ1 = H11,H

12, ...,H

1M1 are the

primitive visual items. Such a multi-level pattern discoverycan help to reveal the hierarchical structure of visual patternsand is computationally efficient.

To explain the procedure of detecting visual parts, Fig. 10,Fig. 12, and Fig 14 show some exemplar results, for the car,face, and airplane categories, respectively. Visual primitivesare highlighted as the green circles in each image. Then thediscovered visual collocations are highlighted by the boundingboxes. The colors of the visual primitives inside the boundingbox distinguish different types of the visual parts. Once thevisual parts are obtained, Fig. 11, Fig. 13 and Fig. 15show how these visual parts can be further constructed torepresent the whole object, for the car, face, and airplanecategories, respectively. Since a visual part correspondstoa group of similar visual collocationsH = Pj, in eachimage, we treat any occurrence ofPj ∈ H as the occur-rence ofH. For example, according to the visual parts inFig. 8, there are 5 visual parts in the face category:Ω

′ =left eye, right eye, between eyes, nose, mouth in the facecategory. While for the car and airplane categories, there areonly two visual parts:Ω′ = car wheel, car body for the carcategory andΩ′ = airplane head, airplane wing for theairplane category, respectively. It is interesting to notice thatalthough we do not reinforce geometrical relationship amongthe visuals parts, the geometrical configuration among themisconsistent among different images.

VII. C ONCLUSION

Text-based data mining techniques are not directly applica-ble to image data, which exhibit much larger variabilities and

uncertainties. To leap from text data mining to image datamining, we present a systematic study on mining visual collo-cation patterns from images. A new criterion for discoveringvisual collocations based on traditional FIM is proposed. Suchvisual collocations are statistically more interesting than thefrequent itemsets. To obtain a better visual codebook, a self-supervised subspace learning method is proposed by applyingthe discovered visual collocations as supervision to learnabetter similarity metric through subspace learning. By furtherclustering these visual collocations (incomplete sub-patterns),we successfully extract semantic visual patterns despite theintra-pattern variations and the cluttered backgrounds. As apure data-driven bottom-up approach, our method does notdepend on a top-down generative model in discovering visualpatterns. It is computationally efficient and requires no priorknowledge of the visual pattern. Our future work will considerhow to apply the discovered visual collocations for imagesearch and categorization.

APPENDIX APAIR-WISE DEPENDENCYTEST

If Wi, Wj ∈ Ω are independent, then the process ofrandomly generating the pairWi, Wj in a transactionTi

is a (0/1) Bernoulli trial with probabilityP (Wi, Wj) =P (Wi)P (Wj). According to the central limit theory, asthe number of trials (transaction numberN ) is large, theBernoulli distribution can be approximated by the Gaussianrandom variablex, with mean µx = P (Wi)P (Wj). Atthe same time, we can measure the average frequency ofWi, Wj by counting its real instance number inT, suchthat P (Wi, Wj) = ˆfrq(Wi, Wj)/N . In order to verify if theobservationP (Wi, Wj) is drawn from the Gaussian distribu-tion x with meanµx, the following T-score is calculated;S2

is the estimation of variance from the observation data.

t(Wi, Wj) =P (Wi, Wj)− µx

S2

N

=P (Wi, Wj)− P (Wi)P (Wj)√

P (Wi,Wj)(1−P (Wi,Wj))

N

≈ˆfrq(Wi, Wj)−

1N

ˆfrq(Wi) ˆfrq(Wj)√

ˆfrq(Wi, Wj).

REFERENCES

[1] J. Sivic and A. Zisserman, “Video data mining using configurations ofviewpoint invariant regions,” inProc. IEEE Conf. on Computer Visionand Pattern Recognition, pp. 488–495, 2004.

[2] J. Yuan, Z. Li, Y. Fu, Y. Wu, and T. S. Huang, “Common spatial patterndiscovery by efficient candidate pruning,” inProc. IEEE Conf. on ImageProcessing, 2007.

[3] D. Liu, G. Hua, and T. Chen, “A hierarchical visual model for videoobject summarization,”TPAMI, 2010.

[4] G. Zhao and J. Yuan, “Mining and cropping common objects fromimages,” inProc. ACM Multimedia, pp. 975–978, 2010.

[5] G. Zhao, J. Yuan, J. Xu, and Y. Wu, “Discovering the thematic object incommercial videos,”IEEE Multimedia Magazine, vol. 18, no. 3, pp. 56–65, 2011.

[6] D. Lowe, “Distinctive image features from scale-invariant keypoints,”Intl. Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.

Page 10: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

10

H 1: left eyes Prc.: 100%

Rec.: 26.7%

H2 : between eyes Prc.: 97.6%

Rec.: 17.8%

H3 : right eyes Prc.: 99.3%

Rec.: 32.6%

H4 : noses Prc.: 96.3%

Rec.: 24.1%

H5 : mouths Prc.: 97.8%

Rec.: 10.3%

H6 : coners Prc.: N.A.

Rec.: N.A.

Fig. 8. Face category: top-10 visual collocationsΨ (|Ψ| = 10) and their clustering results into five visual parts (|H| = 6). Each image patch shows avisual collocationPi ∈ Ψ. The rectangles in the images are visual primitives (e.g.PCA-SIFT interest points at their scales). Every visual collocation, exceptfor the 3rd one, is composed of2 items. The3rd visual collocation is a high-order one composed of3 items. For each visual collocation, we show itsprecision and recall rates.

H1 : wheels Prc.: 96.7% Rec.: 23.2%

H2 : car bodies Prc.: 67.1% Rec.:N.A.

Fig. 9. Car category: top-10 visual collocationsΨ (|Ψ| = 10) and their clustering results into two visual parts (|H| = 2). Each image patch shows a visualcollocationPi ∈ Ψ. The rectangles in the images are visual primitives. Every visual collocation is composed of2 items, except for the5th itemset, whichis composed of3 items. For each visual collocation, we show its precision and recall rates.

Fig. 10. Car category: from visual primitives to visual parts. The first row shows the original images. The second row shows their visual primitives (PCA-SIFTpoints). Each green circle denotes a visual primitive with corresponding location, scale and orientation. It is possible that two visual primitives are locatedat the same position but with different scales. The third rowshows the visual collocations. Each red rectangle shows a visual collocation. The colors of thevisual primitives inside the red rectangle distinguish different types of visual parts. For example, wheels are red andcar bodies are blue.

Fig. 11. Car category: from visual parts to visual object. The two visual parts are discovered in Fig. 9. Different colorsshow different visual parts: carwheels are red (H1) and car bodies are green (H2). Different types of cars are abstracted by a composition ofcar-wheel and car-body.

[7] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,J. Matas,F. Schaffalitzky, T. Kadir, and L. V. Gool, “A comparison of affineregion detectors,”Intl. Journal of Computer Vision, vol. 65, no. 1-2,pp. 43–72, 2005.

[8] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, and A. Zisserman,

“Using multiple segmentation to discover objects and theirextent inimage collections,” inProc. IEEE Conf. on Computer Vision and PatternRecognition, pp. 1605–1614, 2006.

[9] Y. Huang, S. Shekhar, and H. Xiong, “Discovering colocation patternsfrom spatial data sets: a general approach,”IEEE Transaction on

Page 11: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

11

Fig. 12. Face category: from visual primitives to visual parts. The first and forth rows show the original images. The second and fifth rows show theirvisual primitives. Each green circle denotes a visual primitive with corresponding location, scale and orientation. The third and sixth rows show the visualcollocations. Each red rectangle shows a visual collocation. The colors of the visual primitives inside the red rectangle distinguish different types of visualparts,e.g.green primitives are between eyes.

Fig. 13. Face category: from visual parts to visual object. Five visual parts are discovered in Fig. 9. Different colors show different visual parts. Differentfaces are abstracted by a composition of the visual parts.

Knowledge and Data Engineering, vol. 16, no. 12, pp. 1472–1485, 2004.[10] X. Zhang, N. Mamoulis, D. W. Cheung, and Y. Shou, “Fast mining of

spatial collocations,” inProc. ACM SIGKDD, 2004.[11] W. Hsu, J. Dai, and M. L. Lee, “Mining viewpoint patternsin image

databases,” inProc. SIGKDD, 2003.[12] J. Sivic and A. Zisserman, “Video google: A text retrieval approach

to object matching in videos,” inProc. IEEE Intl. Conf. on ComputerVision, pp. 1470 – 1477, 2003.

[13] G. Csurka, C. Dance, L. Fan, J. Williamowski, and C. Bray, “Visualcategorization with bags of keypoints,” inProc. Workshop on EuropeanConf. on Compsuter Vision, pp. 1–22, 2004.

[14] T. Tuytelaars, C. H. Lampert, M. B. Blaschko, and W. Buntine, “Un-supervised object discovery: A comparison,”International Journal ofComputer Vision, vol. 88, pp. 284–302, 2010.

[15] L. Cao and F.-F. Li, “Spatially coherent latent topic model for concurrentsegmentation and classification of objects and scenes,” inProc. IEEEIntl. Conf. on Computer Vision, pp. 1–8, 2007.

[16] J. Sivic and A. Zisserman, “Efficient visual search for objects in videos,”

Proc. of the IEEE, vol. 96, no. 4, pp. 548–566, 2008.[17] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W.T. Freeman,

“Discovering objects and their location in images,” inProc. IEEE Intl.Conf. on Computer Vision, pp. 370–377, 2005.

[18] F. Jurie and B. Triggs, “Creating efficient codebooks for visual recog-nition,” in Proc. IEEE Intl. Conf. on Computer Vision, pp. 604–610,2005.

[19] J. Yuan, Y. Wu, and M. Yang, “Discovery of collocation patterns: fromvisual words to visual phrases,” inProc. IEEE Conf. on Computer Visionand Pattern Recognition, pp. 1–8, 2007.

[20] S. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li, “Descriptive visualwords and visual phrases for image applications,” inProc. ACM Multi-media, 2009.

[21] Y.-T. Zheng, S.-Y. Neo, T.-S. Chua, and Q. Tian, “Visualsynset: ahigher-level visual representation for object-based image retrieval,”TheVisual Computer, vol. 25, no. 1, pp. 13–23, 2009.

[22] G. Wang, Y. Zhang, and L. Fei-Fei, “Using dependent regions for objectcategorization in a generative framework,” inProc. IEEE Conf. on

Page 12: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

12

Fig. 14. Airplane category: from visual primitives to visual parts. The colors of the visual primitives inside the rectangle distinguish different types of visualparts.

Fig. 15. Airplane category: from visual parts to visual object. Two visual parts are discovered, highlighted by two different colors.

Computer Vision and Pattern Recognition, pp. 1597–1604, 2006.[23] P. Hong and T. S. Huang, “Spatial pattern discovery by learning

a probabilistic parametric model from multiple attributedrelationalgraphs,”Discrete Applied Mathematics, pp. 113–135, 2004.

[24] S. Agarwal, A. Awan, and D. Roth, “Learning to detect objects in imagesvia a sparse, part-based representation,”IEEE Trans. on Pattern Analysisand Machine Intelligence, vol. 26, no. 11, pp. 1475–1490, 2004.

[25] G. Bouchard and B. Triggs, “Hierarchical part-based visual objectcategorization,” inProc. IEEE Conf. on Computer Vision and PatternRecognition, pp. 710 – 715, 2005.

[26] R. Fergus, P. Perona, and A. Zisserman, “Object class recognition byunsupervised scale-invariant learning,” inProc. IEEE Conf. on ComputerVision and Pattern Recognition, pp. 264 – 271, 2003.

[27] S. Todorovic and N. Ahuja, “Unsupervised category modeling, recog-nition, and segmentation in images,”IEEE Trans. on Pattern Analysisand Machine Intelligence, vol. 30, no. 12, pp. 2158–2174, 2008.

[28] Y. N. Wu, Z. Si, H. Gong, , and S.-C. Zhu, “Learning activebasis modelfor object detection and recognition,”International Journal of ComputerVision, vol. 90, no. 2, pp. 198–235, 2010.

[29] T. Quack, V. Ferrari, B. Leibe, and L. V. Gool, “Efficientmining offrequent and distinctive feature configurations,” inProc. IEEE Intl. Conf.on Computer Vision, pp. 1–8, 2007.

[30] J. Yuan, Y. Wu, and M. Yang, “From frequent itemsets to semanticallymeaningful visual patterns,” inProc. ACM SIGKDD, pp. 864–873, 2007.

[31] J. Gao, Y. Hu, J. Liu, and R. Yang, “Unsupervised learning of high-order structural semantics from images,” inProc. IEEE Intl. Conf. onComputer Vision, pp. 2122 – 2129, 2009.

[32] S. Kim, X. Jin, and J. Han, “Sparclus: Spatial relationship pattern-basedhierarchical clustering,” inProc. 2008 SIAM Int. Conf. on Data Mining,2008.

[33] S. Zhang, Q. Huang, G. Hua, S. Jiang, W. Gao, and Q. Tian, “Buildingcontextual visual vocabulary for large-scale image applications,” inProc.ACM Multimedia, 2010.

[34] J. Han, J. Pei, and W. Yi, “Mining frequent patterns without candidategeneration,” inProc. SIGMOD, 2000.

[35] R.Agrawal, T.Imielinski, and A.Swami, “Mining association rules be-tween sets of items in large databases,” inProc. SIGMOD, 1993.

[36] J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining: currentstatus and future directions,” inData Mining and Knowledge Discovery,2007.

[37] T. Calders and B. Goethals, “Depth-first non-derivableitemset mining,”in Proc. SIAM International Conference on Data Mining, 2005.

[38] N. Pasquiera, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering frequentclosed itemsets for association rules,” inProc. ICDT, 1999.

[39] A. Siebes, J. Vreeken, and M. van Leeuwen, “Item sets that compress,”in Proc. SIAM International Conference data mining (SDM), 2006.

[40] C. Yang, U. Fayyad, and P. S.Bradley, “Efficient discovery of error-tolerant frequent itemsets in high dimensions,” inProc. ACM SIGKDD,2001.

[41] F. Afrati, A. Gionis, and H. Mannila, “Approximating a collection offrequent sets,” inProc. ACM SIGKDD, 2004.

[42] J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel, and J. Prins, “Miningapproximate frequent itemsets in the presence of noise: Algorithm and

analysis,” in Proc. SIAM International Conference on Data Mining,2006.

[43] X. Yan, H. Cheng, J. Han, and D. Xin, “Summarizing itemset patterns:a profile-based approach,” inProc. ACM SIGKDD, 2005.

[44] C. Wang and S. Parthasarathy, “Summarizing itemset patterns usingprobabilistic models,” inProc. ACM SIGKDD, 2006.

[45] D. Xin, H. Cheng, X. He, and J. Han, “Extracting redundancy-awaretop-k patterns,” inProc. ACM SIGKDD, 2006.

[46] J. Han and M. Kamber, “Data mining: Concepts and techniques,” inMorgan Kaufmann Publishers., 2000.

[47] G. Grahne and J. Zhu, “Fast algorithms for frequent itemset mining usingFP-trees,”IEEE Transaction on Knowledge and Data Engineering, 2005.

[48] Y. Ke and R. Sukthankar, “Pca-sift: a more distinctive representation forlocal image descriptors,” inProc. IEEE Conf. on Computer Vision andPattern Recognition, pp. 506–513, 2004.

[49] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbor-hood component analysis,” inProc. of Neural Information ProcessingSystems, 2004.

[50] Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai, “Generatingsemanticannotations for frequent patterns with context analysis,”in Proc. ACMSIGKDD, 2006.

[51] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEETrans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 8,pp. 888–905, 2000.

PLACEPHOTOHERE

Junsong Yuan(M’08) is currently a Nanyang Assis-tant Professor at Nanyang Technological University.He received his Ph.D. from Northwestern Univer-sity, USA and his M.Eng. from National Universityof Singapore. Before that, he graduated from thespecial program for the gifted young in HuazhongUniversity of Science and Technology, P.R.China.He was a research intern at Microsoft ResearchRedmond, Kodak Research Laboratories, Rochester,and Motorola Applied Research Center, Schaum-burg, USA. His current research interests include

computer vision, video analytics, multimedia search and mining, vision-based human computer interaction, biomedical image analysis, etc. He is theprogram director of Video Analytics in the Infocomm Center of Excellenceat Nanyang Technological University.

Junsong Yuan received the Outstanding Ph.D. Thesis award from theEECS department in Northwestern University, and the Doctoral SpotlightAward from IEEE Conf. Computer Vision and Pattern Recognition Conference(CVPR’09). He was also a recipient of the elite Nanyang Assistant Profes-sorship in 2009. He has filed 3 US patents and is a member of IEEE.

Page 13: Mining Visual Collocation Patterns via Self-Supervised ...jsyuan/papers/2012/Mining... · visual features that highlight the local image invariants, and clustering these primitive

13

PLACEPHOTOHERE

Ying Wu (SM’06) received the B.S. from HuazhongUniversity of Science and Technology, Wuhan,China, in 1994, the M.S. from Tsinghua University,Beijing, China, in 1997, and the Ph.D. in electri-cal and computer engineering from the Universityof Illinois at Urbana-Champaign (UIUC), Urbana,Illinois, in 2001.

From 1997 to 2001, he was a research assistantat the Beckman Institute for Advanced Science andTechnology at UIUC. During summer 1999 and2000, he was a research intern with Microsoft Re-

search, Redmond, Washington. In 2001, he joined the Department of Electricaland Computer Engineering at Northwestern University, Evanston, Illinois, asan assistant professor. He is currently an associate professor of ElectricalEngineering and Computer Science at Northwestern University. His currentresearch interests include computer vision, image and video analysis, patternrecognition, machine learning, multimedia data mining, and human-computerinteraction. He serves as associate editors for IEEE Transactions on ImageProcessing, SPIE Journal of Electronic Imaging, and IAPR Journal of MachineVision and Applications. He received the Robert T. Chien Award at UIUC in2001, and the NSF CAREER award in 2003. He is a senior member oftheIEEE.