clustering with multiviewpoint based similarity measure.bak

14
Clustering with Multiviewpoint-Based Similarity Measure Duc Thang Nguyen, Lihui Chen, Senior Member, IEEE, and Chee Keong Chan Abstract—All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that use other popular similarity measures on various document collections to verify the advantages of our proposal. Index Terms—Document clustering, text mining, similarity measure. Ç 1 INTRODUCTION C LUSTERING is one of the most interesting and important topics in data mining. The aim of clustering is to find intrinsic structures in data, and organize them into mean- ingful subgroups for further study and analysis. There have been many clustering algorithms published every year. They can be proposed for very distinct research fields, and developed using totally different techniques and ap- proaches. Nevertheless, according to a recent study [1], more than half a century after it was introduced, the simple algorithm k-means still remains as one of the top 10 data mining algorithms nowadays. It is the most frequently used partitional clustering algorithm in practice. Another recent scientific discussion [2] states that k-means is the favorite algorithm that practitioners in the related fields choose to use. Needless to mention, k-means has more than a few basic drawbacks, such as sensitiveness to initialization and to cluster size, and its performance can be worse than other state-of-the-art algorithms in many domains. In spite of that, its simplicity, understandability, and scalability are the reasons for its tremendous popularity. An algorithm with adequate performance and usability in most of application scenarios could be preferable to one with better performance in some cases but limited usage due to high complexity. While offering reasonable results, k-means is fast and easy to combine with other methods in larger systems. A common approach to the clustering problem is to treat it as an optimization process. An optimal partition is found by optimizing a particular function of similarity (or distance) among data. Basically, there is an implicit assumption that the true intrinsic structure of data could be correctly described by the similarity formula defined and embedded in the clustering criterion function. Hence, effectiveness of clustering algorithms under this approach depends on the appropriateness of the similarity measure to the data at hand. For instance, the original k-means has sum-of-squared-error objective function that uses euclidean distance. In a very sparse and high-dimensional domain like text documents, spherical k-means, which uses cosine similarity (CS) instead of euclidean distance as the measure, is deemed to be more suitable [3], [4]. In [5], Banerjee et al. showed that euclidean distance was indeed one particular form of a class of distance measures called Bregman divergences. They proposed Bregman hard- clustering algorithm, in which any kind of the Bregman divergences could be applied. Kullback-Leibler divergence was a special case of Bregman divergences that was said to give good clustering results on document data sets. Kullback- Leibler divergence is a good example of nonsymmetric measure. Also on the topic of capturing dissimilarity in data, Pakalska et al. [6] found that the discriminative power of some distance measures could increase when their none- uclidean and nonmetric attributes were increased. They concluded that noneuclidean and nonmetric measures could be informative for statistical learning of data. In [7], Pelillo even argued that the symmetry and nonnegativity assump- tion of similarity measures was actually a limitation of current state-of-the-art clustering approaches. Simultaneously, clus- tering still requires more robust dissimilarity or similarity measures; recent works such as [8] illustrate this need. The work in this paper is motivated by investigations from the above and similar research findings. It appears to us that the nature of similarity measure plays a very important role in the success or failure of a clustering method. Our first objective is to derive a novel method for measuring similarity between data objects in sparse and high-dimensional domain, particularly text documents. From the proposed similarity measure, we then formulate new clustering criterion functions and introduce their 988 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012 . The authors are with the Division of Information Engineering, School of Electrical and Electronic Engineering, Nanyang Technological University, Block S1, Nanyang Avenue, Republic of Singapore 639798. E-mail: [email protected], {elhchen, eckchan}@ntu.edu.sg. Manuscript received 16 July 2010; revised 25 Feb. 2011; accepted 10 Mar. 2011; published online 22 Mar. 2011. Recommended for acceptance by M. Ester. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2010-07-0398. Digital Object Identifier no. 10.1109/TKDE.2011.86. 1041-4347/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society http://ieeexploreprojects.blogspot.com

Upload: ieeexploreprojects

Post on 14-Jun-2015

2.242 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Clustering with multiviewpoint based similarity measure.bak

Clustering with Multiviewpoint-BasedSimilarity Measure

Duc Thang Nguyen, Lihui Chen, Senior Member, IEEE, and Chee Keong Chan

Abstract—All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity

between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity

measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is

that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects

assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment

of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for

document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that

use other popular similarity measures on various document collections to verify the advantages of our proposal.

Index Terms—Document clustering, text mining, similarity measure.

Ç

1 INTRODUCTION

CLUSTERING is one of the most interesting and importanttopics in data mining. The aim of clustering is to find

intrinsic structures in data, and organize them into mean-ingful subgroups for further study and analysis. There havebeen many clustering algorithms published every year.They can be proposed for very distinct research fields, anddeveloped using totally different techniques and ap-proaches. Nevertheless, according to a recent study [1],more than half a century after it was introduced, the simplealgorithm k-means still remains as one of the top 10 datamining algorithms nowadays. It is the most frequently usedpartitional clustering algorithm in practice. Another recentscientific discussion [2] states that k-means is the favoritealgorithm that practitioners in the related fields choose touse. Needless to mention, k-means has more than a few basicdrawbacks, such as sensitiveness to initialization and tocluster size, and its performance can be worse than otherstate-of-the-art algorithms in many domains. In spite of that,its simplicity, understandability, and scalability are thereasons for its tremendous popularity. An algorithm withadequate performance and usability in most of applicationscenarios could be preferable to one with better performancein some cases but limited usage due to high complexity.While offering reasonable results, k-means is fast and easy tocombine with other methods in larger systems.

A common approach to the clustering problem is to treatit as an optimization process. An optimal partition is foundby optimizing a particular function of similarity (ordistance) among data. Basically, there is an implicit

assumption that the true intrinsic structure of data couldbe correctly described by the similarity formula defined andembedded in the clustering criterion function. Hence,effectiveness of clustering algorithms under this approachdepends on the appropriateness of the similarity measure tothe data at hand. For instance, the original k-means hassum-of-squared-error objective function that uses euclideandistance. In a very sparse and high-dimensional domainlike text documents, spherical k-means, which uses cosinesimilarity (CS) instead of euclidean distance as the measure,is deemed to be more suitable [3], [4].

In [5], Banerjee et al. showed that euclidean distance wasindeed one particular form of a class of distance measurescalled Bregman divergences. They proposed Bregman hard-clustering algorithm, in which any kind of the Bregmandivergences could be applied. Kullback-Leibler divergencewas a special case of Bregman divergences that was said togive good clustering results on document data sets. Kullback-Leibler divergence is a good example of nonsymmetricmeasure. Also on the topic of capturing dissimilarity in data,Pakalska et al. [6] found that the discriminative power ofsome distance measures could increase when their none-uclidean and nonmetric attributes were increased. Theyconcluded that noneuclidean and nonmetric measures couldbe informative for statistical learning of data. In [7], Pelilloeven argued that the symmetry and nonnegativity assump-tion of similarity measures was actually a limitation of currentstate-of-the-art clustering approaches. Simultaneously, clus-tering still requires more robust dissimilarity or similaritymeasures; recent works such as [8] illustrate this need.

The work in this paper is motivated by investigationsfrom the above and similar research findings. It appears tous that the nature of similarity measure plays a veryimportant role in the success or failure of a clusteringmethod. Our first objective is to derive a novel method formeasuring similarity between data objects in sparse andhigh-dimensional domain, particularly text documents.From the proposed similarity measure, we then formulatenew clustering criterion functions and introduce their

988 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

. The authors are with the Division of Information Engineering, School ofElectrical and Electronic Engineering, Nanyang Technological University,Block S1, Nanyang Avenue, Republic of Singapore 639798.E-mail: [email protected], {elhchen, eckchan}@ntu.edu.sg.

Manuscript received 16 July 2010; revised 25 Feb. 2011; accepted 10 Mar.2011; published online 22 Mar. 2011.Recommended for acceptance by M. Ester.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2010-07-0398.Digital Object Identifier no. 10.1109/TKDE.2011.86.

1041-4347/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

http://ieeexploreprojects.blogspot.com

Page 2: Clustering with multiviewpoint based similarity measure.bak

respective clustering algorithms, which are fast and scalablelike k-means, but are also capable of providing high-qualityand consistent performance.

The remaining of this paper is organized as follows: InSection 2, we review related literature on similarity andclustering of documents. We then present our proposal fordocument similarity measure in Section 3.1. It is followedby two criterion functions for document clustering and theiroptimization algorithms in Section 4. Extensive experimentson real-world benchmark data sets are presented anddiscussed in Sections 5 and 6. Finally, conclusions andpotential future work are given in Section 7.

2 RELATED WORK

First of all, Table 1 summarizes the basic notations that willbe used extensively throughout this paper to representdocuments and related concepts. Each document in acorpus corresponds to an m-dimensional vector d, wherem is the total number of terms that the document corpushas. Document vectors are often subjected to some weight-ing schemes, such as the standard Term Frequency-InverseDocument Frequency (TF-IDF), and normalized to haveunit length.

The principle definition of clustering is to arrange dataobjects into separate clusters such that the intraclustersimilarity as well as the intercluster dissimilarity ismaximized. The problem formulation itself implies thatsome forms of measurement are needed to determine suchsimilarity or dissimilarity. There are many state-of-the-artclustering approaches that do not employ any specific formof measurement, for instance, probabilistic model-basedmethod [9], nonnegative matrix factorization [10], informa-tion theoretic coclustering [11] and so on. In this paper,though, we primarily focus on methods that indeed doutilize a specific measure. In the literature, euclideandistance is one of the most popular measures

Distðdi; djÞ ¼ kdi � djk: ð1Þ

It is used in the traditional k-means algorithm. The objectiveof k-means is to minimize the euclidean distance betweenobjects of a cluster and that cluster’s centroid

minXkr¼1

Xdi2Srkdi � Crk2: ð2Þ

However, for data in a sparse and high-dimensional space,such as that in document clustering, cosine similarity ismore widely used. It is also a popular similarity score intext mining and information retrieval [12]. Particularly,similarity of two document vectors di and dj, Simðdi; djÞ, isdefined as the cosine of the angle between them. For unitvectors, this equals to their inner product

Simðdi; djÞ ¼ cosðdi; djÞ ¼ dtidj: ð3Þ

Cosine measure is used in a variant of k-means calledspherical k-means [3]. While k-means aims to minimizeeuclidean distance, spherical k-means intends to maximizethe cosine similarity between documents in a cluster andthat cluster’s centroid

maxXkr¼1

Xdi2Sr

dtiCrkCrk

: ð4Þ

The major difference between euclidean distance and cosinesimilarity, and therefore between k-means and spherical k-means, is that the former focuses on vector magnitudes,while the latter emphasizes on vector directions. Besidesdirect application in spherical k-means, cosine of documentvectors is also widely used in many other documentclustering methods as a core similarity measurement. Themin-max cut graph-based spectral method is an example[13]. In graph partitioning approach, document corpus isconsider as a graph G ¼ ðV ;EÞ, where each document is avertex in V and each edge in E has a weight equal to thesimilarity between a pair of vertices. Min-max cut algorithmtries to minimize the criterion function

minXkr¼1

SimðSr; S n SrÞSimðSr; SrÞ

where SimðSq; SrÞ1�q;r�k

¼X

di2Sq;dj2SrSimðdi; djÞ;

and when the cosine as in (3) is used, minimizing thecriterion in (5) is equivalent to

minXkr¼1

DtrD

kDrk2: ð6Þ

There are many other graph partitioning methods withdifferent cutting strategies and criterion functions, such asAverage Weight [14] and Normalized Cut [15], all of whichhave been successfully applied for document clusteringusing cosine as the pairwise similarity score [16], [17]. In[18], an empirical study was conducted to compare avariety of criterion functions for document clustering.

Another popular graph-based clustering technique isimplemented in a software package called CLUTO [19]. Thismethod first models the documents with a nearest neighborgraph, and then splits the graph into clusters using a min-cutalgorithm. Besides cosine measure, the extended Jaccardcoefficient can also be used in this method to representsimilarity between nearest documents. Given nonunit docu-ment vectors ui, uj ðdi ¼ ui=kuik; dj ¼ uj=kujkÞ, their ex-tended Jaccard coefficient is

SimeJaccðui; ujÞ ¼utiuj

kuik2 þ kujk2 � utiuj: ð7Þ

NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 989

TABLE 1Notations

http://ieeexploreprojects.blogspot.com

Page 3: Clustering with multiviewpoint based similarity measure.bak

Compared with euclidean distance and cosine similarity, theextended Jaccard coefficient takes into account both themagnitude and the direction of the document vectors. If thedocuments are instead represented by their correspondingunit vectors, this measure has the same effect as cosinesimilarity. In [20], Strehl et al. compared four measures:euclidean, cosine, Pearson correlation, and extended Jaccard,and concluded that cosine and extended Jaccard are the bestones on web documents.

In nearest neighbor graph clustering methods, such as theCLUTO’s graph method above, the concept of similarity issomewhat different from the previously discussed methods.Two documents may have a certain value of cosine similarity,but if neither of them is in the other one’s neighborhood, theyhave no connection between them. In such a case, somecontext-based knowledge or relativeness property is alreadytaken into account when considering similarity. Recently,Ahmad and Dey [21] proposed a method to compute distancebetween two categorical values of an attribute based on theirrelationship with all other attributes. Subsequently, Iencoet al. [22] introduced a similar context-based distancelearning method for categorical data. However, for a givenattribute, they only selected a relevant subset of attributesfrom the whole attribute set to use as the context forcalculating distance between its two values.

More related to text data, there are phrase-based andconcept-based document similarities. Lakkaraju et al. [23]employed a conceptual tree-similarity measure to identifysimilar documents. This method requires representingdocuments as concept trees with the help of a classifier.For clustering, Chim and Deng [24] proposed a phrase-based document similarity by combining suffix tree modeland vector space model. They then used HierarchicalAgglomerative Clustering algorithm to perform the cluster-ing task. However, a drawback of this approach is the highcomputational complexity due to the needs of building thesuffix tree and calculating pairwise similarities explicitlybefore clustering. There are also measures designed speci-fically for capturing structural similarity among XMLdocuments [25]. They are essentially different from thedocument-content measures that are discussed in this paper.

In general, cosine similarity still remains as the mostpopular measure because of its simple interpretation andeasy computation, though its effectiveness is yet fairlylimited. In the following sections, we propose a novel way toevaluate similarity between documents, and consequentlyformulate new criterion functions for document clustering.

3 MULTIVIEWPOINT-BASED SIMILARITY

3.1 Our Novel Similarity Measure

The cosine similarity in (3) can be expressed in thefollowing form without changing its meaning:

Simðdi; djÞ ¼ cosðdi�0; dj�0Þ ¼ ðdi�0Þtðdj�0Þ; ð8Þ

where 0 is vector 0 that represents the origin point.According to this formula, the measure takes 0 as one andonly reference point. The similarity between two documentsdi and dj is determined w.r.t. the angle between the twopoints when looking from the origin.

To construct a new concept of similarity, it is possible touse more than just one point of reference. We may have amore accurate assessment of how close or distant a pair ofpoints are, if we look at them from many differentviewpoints. From a third point dh, the directions anddistances to di and dj are indicated, respectively, by thedifference vectors ðdi � dhÞ and ðdj � dhÞ. By standing atvarious reference points dh to view di, dj and working ontheir difference vectors, we define similarity between thetwo documents as

Simðdi; djÞdi;dj2Sr

¼ 1

n�nrX

dh2SnSrSimðdi�dh; dj�dhÞ: ð9Þ

As described by the above equation, similarity of twodocuments di and dj—given that they are in the samecluster—is defined as the average of similarities measuredrelatively from the views of all other documents outside thatcluster. What is interesting is that the similarity here isdefined in a close relation to the clustering problem. Apresumption of cluster memberships has been made prior tothe measure. The two objects to be measured must be in thesame cluster, while the points from where to establish thismeasurement must be outside of the cluster. We call thisproposal the Multiviewpoint-based Similarity, or MVS. Fromthis point onwards, we will denote the proposed similaritymeasure between two document vectors di and dj byMVSðdi; djjdi; dj2SrÞ, or occasionally MVSðdi; djÞ for short.

The final form of MVS in (9) depends on particularformulation of the individual similarities within the sum. Ifthe relative similarity is defined by dot-product of thedifference vectors, we have

MVSðdi; djjdi; dj 2 SrÞ

¼ 1

n�nrX

dh2SnSrðdi�dhÞtðdj�dhÞ

¼ 1

n�nrXdh

cosðdi�dh; dj�dhÞkdi�dhkkdj�dhk:

ð10Þ

The similarity between two points di and dj inside clusterSr, viewed from a point dh outside this cluster, is equal tothe product of the cosine of the angle between di and djlooking from dh and the euclidean distances from dh tothese two points. This definition is based on the assumptionthat dh is not in the same cluster with di and dj. The smallerthe distances kdi � dhk and kdj � dhk are, the higher thechance that dh is in fact in the same cluster with di and dj,and the similarity based on dh should also be small to reflectthis potential. Therefore, through these distances, (10) alsoprovides a measure of intercluster dissimilarity, given thatpoints di and dj belong to cluster Sr, whereas dh belongs toanother cluster. The overall similarity between di and dj isdetermined by taking average over all the viewpoints notbelonging to cluster Sr. It is possible to argue that whilemost of these viewpoints are useful, there may be some ofthem giving misleading information just like it may happenwith the origin point. However, given a large enoughnumber of viewpoints and their variety, it is reasonable toassume that the majority of them will be useful. Hence, theeffect of misleading viewpoints is constrained and reducedby the averaging step. It can be seen that this method offers

990 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

http://ieeexploreprojects.blogspot.com

Page 4: Clustering with multiviewpoint based similarity measure.bak

more informative assessment of similarity than the single

origin point-based similarity measure.

3.2 Analysis and Practical Examples of MVS

In this section, we present analytical study to show that the

proposed MVS could be a very effective similarity measure

for data clustering. In order to demonstrate its advantages,

MVS is compared with cosine similarity on how well they

reflect the true group structure in document collections.

First, exploring (10), we have

MVS ðdi; djjdi; dj 2 SrÞ

¼ 1

n� nrX

dh2SnSr

�dtidj � dtidh � dtjdh þ dthdh

¼ dtidj�1

n�nrdtiXdh

dh�1

n�nrdtjXdh

dhþ1; kdhk¼1

¼ dtidj �1

n� nrdtiDSnSr �

1

n� nrdtjDSnSr þ 1

¼ dtidj � dtiCSnSr � dtjCSnSr þ 1;

ð11Þ

where DSnSr ¼P

dh2SnSr dh is the composite vector of all the

documents outside cluster r, called the outer composite

w.r.t. cluster r, and CSnSr ¼ DSnSr=ðn�nrÞ the outer centroid

w.r.t. cluster r, 8r ¼ 1; . . . ; k. From (11), when comparing

two pairwise similarities MVSðdi; djÞ and MVSðdi; dlÞ,document dj is more similar to document di than the other

document dl is, if and only if

dtidj � dtjCSnSr > dtidl � dtlCSnSr, cosðdi; djÞ � cosðdj; CSnSrÞkCSnSrk >

cosðdi; dlÞ � cosðdl; CSnSrÞkCSnSrk:ð12Þ

From this condition, it is seen that even when dl is considered

“closer” to di in terms of CS, i.e., cosðdi; djÞ� cosðdi; dlÞ, dl can

still possibly be regarded as less similar to di based on MVS

if, on the contrary, it is “closer” enough to the outer centroid

CSnSr than dj is. This is intuitively reasonable, since the

“closer” dl is to CSnSr , the greater the chance it actuallybelongs to another cluster rather than Sr and is, therefore,less similar to di. For this reason, MVS brings to the table anadditional useful measure compared with CS.

To further justify the above proposal and analysis, wecarried out a validity test for MVS and CS. The purpose ofthis test is to check how much a similarity measurecoincides with the true class labels. It is based on oneprinciple: if a similarity measure is appropriate for theclustering problem, for any of a document in the corpus, thedocuments that are closest to it based on this measureshould be in the same cluster with it.

The validity test is designed as following: For each typeof similarity measure, a similarity matrix A ¼ faijgn�n iscreated. For CS, this is simple, as aij ¼ dtidj. The procedurefor building MVS matrix is described in Fig. 1.

First, the outer composite w.r.t. each class is determined.Then, for each row ai of A, i ¼ 1; . . . ; n, if the pair ofdocuments di and dj; j ¼ 1; . . . ; n are in the same class, aij iscalculated as in line 10, Fig. 1. Otherwise, dj is assumed tobe in di’s class, and aij is calculated as in line 12, Fig. 1. Aftermatrix A is formed, the procedure in Fig. 2 is used to get itsvalidity score.

For each document di corresponding to row ai of A, weselect qr documents closest to di. The value of qr is chosenrelatively as percentage of the size of the class r that containsdi, where percentage 2 ð0; 1�. Then, validity w.r.t. di iscalculated by the fraction of these qr documents having thesame class label withdi, as in line 12, Fig. 2. The final validity isdetermined by averaging over all the rows ofA, as in line 14,Fig. 2. It is clear that validity score is bounded within 0 and 1.The higher validity score a similarity measure has, the moresuitable it should be for the clustering task.

Two real-world document data sets are used as examplesin this validity test. The first is reuters7, a subset of thefamous collection, Reuters-21578 Distribution 1.0, of Reuter’snewswire articles.1 Reuters-21578 is one of the most widely

NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 991

Fig. 1. Procedure: Build MVS similarity matrix.Fig. 2. Procedure: Get validity score.

1. http://www.daviddlewis.com/resources/testcollections/reuters21578/.

http://ieeexploreprojects.blogspot.com

Page 5: Clustering with multiviewpoint based similarity measure.bak

used test collection for text categorization. In our validitytest, we selected 2,500 documents from the largest sevencategories: “acq,” “crude,” “interest,” “earn,” “money-fx,”“ship,” and “trade” to form reuters7. Some of the documentsmay appear in more than one category. The second data set

is k1b, a collection of 2,340 webpages from the Yahoo! subjecthierarchy, including six topics: “health,” “entertainment,”“sport,” “politics,” “tech,” and “business.” It was createdfrom a past study in information retrieval called WebAce[26], and is now available with the CLUTO toolkit [19].

The two data sets were preprocessed by stop-word

removal and stemming. Moreover, we removed words thatappear in less than two documents or more than 99.5 percentof the total number of documents. Finally, the documentswere weighted by TF-IDF and normalized to unit vectors.The full characteristics of reuters7 and k1b are presented inFig. 3.

Fig. 4 shows the validity scores of CS and MVS on the

two data sets relative to the parameter percentage. Thevalue of percentage is set at 0.001, 0.01, 0.05, 0.1, 0:2; . . . ; 1:0.According to Fig. 4, MVS is clearly better than CS for bothdata sets in this validity test. For example, with k1b data setat percentage ¼ 1:0, MVS’ validity score is 0.80, while that ofCS is only 0.67. This indicates that, on average, when we

pick up any document and consider its neighborhood ofsize equal to its true class size, only 67 percent of thatdocument’s neighbors based on CS actually belong to itsclass. If based on MVS, the number of valid neighborsincreases to 80 percent. The validity test has illustrated thepotential advantage of the new multiviewpoint-based

similarity measure compared to the cosine measure.

4 MULTIVIEWPOINT-BASED CLUSTERING

4.1 Two Clustering Criterion Functions IRIR and IVIVHaving defined our similarity measure, we now formulateour clustering criterion functions. The first function, calledIR, is the cluster size-weighted sum of average pairwisesimilarities of documents in the same cluster. First, let usexpress this sum in a general form by function F

F ¼Xkr¼1

nr1

n2r

Xdi;dj2Sr

Simðdi; djÞ

24

35: ð13Þ

We would like to transform this objective function intosome suitable form such that it could facilitate the

optimization procedure to be performed in a simple, fast

and effective way. According to (10)

Xdi;dj2Sr

Simðdi; djÞ

¼X

di;dj2Sr

1

n� nrX

dh2SnSrdi � dhð Þt dj � dh

� �

¼ 1

n� nrXdi;dj

Xdh

�dtidj � dtidh � dtjdh þ dthdh

�:

Since

Xdi2Sr

di ¼Xdj2Sr

dj ¼ Dr;

Xdh2SnSr

dh ¼ D�Dr and kdhk ¼ 1;

we have

Xdi;dj2Sr

Simðdi; djÞ

¼X

di;dj2Srdtidj �

2nrn� nr

Xdi2Sr

dtiX

dh2SnSrdh þ n2

r

¼ DtrDr �

2nrn� nr

DtrðD�DrÞ þ n2

r

¼ nþ nrn� nr

kDrk2 � 2nrn� nr

DtrDþ n2

r :

Substituting into (13) to get

F ¼Xkr¼1

1

nr

nþ nrn� nr

kDrk2 � nþ nrn� nr

� 1

� �DtrD

� �þ n:

Because n is constant, maximizing F is equivalent to

maximizing �F

�F ¼Xkr¼1

1

nr

nþnrn�nr

kDrk2 � nþnrn�nr

� 1

� �DtrD

� �: ð14Þ

If comparing �F with the min-max cut in (5), both functions

contain the two terms kDrk2 (an intracluster similarity

992 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

Fig. 3. Characteristics of reuters7 and k1b data sets.

Fig. 4. CS and MVS validity test.

http://ieeexploreprojects.blogspot.com

Page 6: Clustering with multiviewpoint based similarity measure.bak

measure) and DtrD (an intercluster similarity measure).

Nonetheless, while the objective of min-max cut is to

minimize the inverse ratio between these two terms, our

aim here is to maximize their weighted difference. In �F , this

difference term is determined for each cluster. They are

weighted by the inverse of the cluster’s size, before summed

up over all the clusters. One problem is that this formula-

tion is expected to be quite sensitive to cluster size. From the

formulation of COSA [27]—a widely known subspace

clustering algorithm—we have learned that it is desirable

to have a set of weight factors � ¼ f�rgk1 to regulate the

distribution of these cluster sizes in clustering solutions.

Hence, we integrate � into the expression of �F to have it

become

�F� ¼Xkr¼1

�rnr

nþnrn�nr

kDrk2 � nþnrn�nr

�1

� �DtrD

� �: ð15Þ

In common practice, f�rgk1 are often taken to be simplefunctions of the respective cluster sizes fnrgk1 [28]. Let ususe a parameter � called the regulating factor, which hassome constant value (� 2 ½0; 1�), and let �r ¼ n�r in (15), thefinal form of our criterion function IR is

IR¼Xkr¼1

1

n1��r

nþnrn�nr

kDrk2� nþnrn�nr

�1

� �DtrD

� �: ð16Þ

In the empirical study of Section 5.4, it appears that IR’s

performance dependency on the value of � is not very

critical. The criterion function yields relatively good

clustering results for � 2 ð0; 1Þ.In the formulation of IR, a cluster quality is measured by

the average pairwise similarity between documents withinthat cluster. However, such an approach can lead tosensitiveness to the size and tightness of the clusters. WithCS, for example, pairwise similarity of documents in asparse cluster is usually smaller than those in a densecluster. Though not as clear as with CS, it is still possiblethat the same effect may hinder MVS-based clustering ifusing pairwise similarity. To prevent this, an alternativeapproach is to consider similarity between each documentvector and its cluster’s centroid instead. This is expressed inobjective function G

G ¼Xkr¼1

Xdi2Sr

1

n�nrX

dh2SnSrSim di�dh;

CrkCrk

�dh� �

G ¼Xkr¼1

1

n�nrXdi2Sr

Xdh2SnSr

di�dhð Þt CrkCrk

�dh� �

:

ð17Þ

Similar to the formulation of IR, we would like to express

this objective in a simple form that we could optimize more

easily. Exploring the vector dot product, we get

Xdi2Sr

Xdh2SnSr

di � dhð Þt CrkCrk

� dh� �

¼Xdi

Xdh

dtiCrkCrk

� dtidh � dthCrkCrk

þ 1

� �

¼ ðn�nrÞDtr

Dr

kDrk�Dt

rðD�DrÞ � nrðD�DrÞtDr

kDrk

þ nrðn� nrÞ; sinceCrkCrk

¼ Dr

kDrk

¼ nþ kDrkð ÞkDrk � nr þ kDrkð Þ DtrD

kDrkþ nrðn� nrÞ:

Substituting the above into (17) to have

G ¼Xkr¼1

nþkDrkn�nr

kDrk �nþkDrkn�nr

� 1

� �DtrD

kDrk

� �þ n:

Again, we could eliminate n because it is a constant.Maximizing G is equivalent to maximizing IV below

IV ¼Xkr¼1

nþkDrkn�nr

kDrk�nþkDrkn�nr

�1

� �DtrD

kDrk

� �: ð18Þ

IV calculates the weighted difference between the twoterms: kDrk and Dt

rD=kDrk, which again represent anintracluster similarity measure and an intercluster similar-ity measure, respectively. The first term is actuallyequivalent to an element of the sum in spherical k-meansobjective function in (4); the second one is similar to anelement of the sum in min-max cut criterion in (6), but withkDrk as scaling factor instead of kDrk2. We have presentedour clustering criterion functions IR and IV in the simpleforms. Next, we show how to perform clustering by using agreedy algorithm to optimize these functions.

4.2 Optimization Algorithm and Complexity

We denote our clustering framework by MVSC, meaningClustering with Multiviewpoint-based Similarity. Subse-quently, we have MVSC-IR and MVSC-IV , which are MVSCwith criterion function IR and IV , respectively. The maingoal is to perform document clustering by optimizing IR in(16) and IV in (18). For this purpose, the incremental k-wayalgorithm [18], [29]—a sequential version of k-means—isemployed. Considering that the expression of IV in (18)depends only on nr and Dr, r ¼ 1; . . . ; k, IV can be written ina general form

IV ¼Xkr¼1

Ir nr;Drð Þ; ð19Þ

where Irðnr;DrÞ corresponds to the objective value of clusterr. The same is applied to IR. With this general form, theincremental optimization algorithm, which has two majorsteps Initialization and Refinement, is described in Fig. 5.

At Initialization, k arbitrary documents are selected to bethe seeds from which initial partitions are formed. Refine-ment is a procedure that consists of a number of iterations.During each iteration, the n documents are visited one byone in a totally random order. Each document is checked ifits move to another cluster results in improvement of theobjective function. If yes, the document is moved to the

NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 993

http://ieeexploreprojects.blogspot.com

Page 7: Clustering with multiviewpoint based similarity measure.bak

cluster that leads to the highest improvement. If no clustersare better than the current cluster, the document is notmoved. The clustering process terminates when an iterationcompletes without any documents being moved to newclusters. Unlike the traditional k-means, this algorithm is astepwise optimal procedure. While k-means only updatesafter all n documents have been reassigned, the incrementalclustering algorithm updates immediately whenever eachdocument is moved to new cluster. Since every move whenhappens increases the objective function value, convergenceto a local optimum is guaranteed.

During the optimization procedure, in each iteration, themain sources of computational cost are

. Searching for optimum clusters to move individualdocuments to: Oðnz � kÞ.

. Updating composite vectors as a result of suchmoves: Oðm � kÞ.

where nz is the total number of nonzero entries in alldocument vectors. Our clustering approach is partitional andincremental; therefore, computing similarity matrix isabsolutely not needed. If � denotes the number of iterationsthe algorithm takes, since nz is often several tens times largerthanm for document domain, the computational complexityrequired for clustering with IR and IV is Oðnz � k � �Þ.

5 PERFORMANCE EVALUATION OF MVSC

To verify the advantages of our proposed methods, weevaluate their performance in experiments on documentdata. The objective of this section is to compare MVSC-IRand MVSC-IV with the existing algorithms that also usespecific similarity measures and criterion functions fordocument clustering. The similarity measures to be com-pared includes euclidean distance, cosine similarity, andextended Jaccard coefficient.

5.1 Document Collections

The data corpora that we used for experiments consist of

20 benchmark document data sets. Besides reuters7 and k1b,

which have been described in details earlier, we included

another 18 text collections so that the examination of the

clustering methods is more thorough and exhaustive.

Similar to k1b, these data sets are provided together with

CLUTO by the toolkit’s authors [19]. They had been used

for experimental testing in previous papers, and their

source and origin had also been described in details [30],

[31]. Table 2 summarizes their characteristics. The corpora

present a diversity of size, number of classes and class

balance. They were all preprocessed by standard proce-

dures, including stop-word removal, stemming, removal of

too rare as well as too frequent words, TF-IDF weighting

and normalization.

5.2 Experimental Setup and Evaluation

To demonstrate how well MVSCs can perform, we compare

them with five other clustering methods on the 20 data sets

in Table 2. In summary, the seven clustering algorithms are

. MVSC-IR: MVSC using criterion function IR

. MVSC-IV : MVSC using criterion function IV

. k-means: standard k-means with euclidean distance

. Spkmeans: spherical k-means with CS

. graphCS: CLUTO’s graph method with CS

. graphEJ: CLUTO’s graph with extended Jaccard

. MMC: Spectral Min-Max Cut algorithm [13].

Our MVSC-IR and MVSC-IV programs are implemented in

Java. The regulating factor � in IR is always set at 0.3 during

the experiments. We observed that this is one of the most

appropriate values. A study on MVSC-IR’s performance

relative to different � values is presented in a later section.

The other algorithms are provided by the C library interface

which is available freely with the CLUTO toolkit [19]. For

994 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

Fig. 5. Algorithm: Incremental clustering.

TABLE 2Document Data Sets

http://ieeexploreprojects.blogspot.com

Page 8: Clustering with multiviewpoint based similarity measure.bak

each data set, cluster number is predefined equal to thenumber of true class, i.e., k ¼ c.

None of the above algorithms are guaranteed to findglobal optimum, and all of them are initialization-dependent. Hence, for each method, we performedclustering a few times with randomly initialized values,and chose the best trial in terms of the correspondingobjective function value. In all the experiments, each testrun consisted of 10 trials. Moreover, the result reportedhere on each data set by a particular clustering method isthe average of 10 test runs.

After a test run, clustering solution is evaluated bycomparing the documents’ assigned labels with their truelabels provided by the corpus. Three types of externalevaluation metric are used to assess clustering performance.They are the FScore, Normalized Mutual Information (NMI),and Accuracy. FScore is an equally weighted combination ofthe “precision” (P ) and “recall” (R) values used ininformation retrieval. Given a clustering solution, FScore isdetermined as

FScore ¼Xki¼1

nin

maxjðFi;jÞ

where Fi;j ¼2� Pi;j �Ri;j

Pi;j þRi;j;Pi;j ¼

ni;jnj

; Ri;j ¼ni;jni

;

where ni denotes the number of documents in class i, nj thenumber of documents assigned to cluster j, and ni;j thenumber of documents shared by class i and cluster j. Fromanother aspect, NMI measures the information the true classpartition and the cluster assignment share. It measures howmuch knowing about the clusters helps us know about theclasses

NMI ¼Pk

i¼1

Pkj¼1 ni;j log

n�ni;jninj

� ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPk

i¼1 ni log nin

� Pkj¼1 nj log

njn

� r :

Finally, Accuracy measures the fraction of documents thatare correctly labels, assuming a one-to-one correspondencebetween true classes and assigned clusters. Let q denote anypossible permutation of index set f1; . . . ; kg, Accuracy iscalculated by

Accuracy ¼ 1

nmaxq

Xki¼1

ni;qðiÞ:

The best mapping q to determine Accuracy could be foundby the Hungarian algorithm.2 For all three metrics, theirrange is from 0 to 1, and a greater value indicates a betterclustering solution.

5.3 Results

Fig. 6 shows the Accuracy of the seven clustering algorithmson the 20 text collections. Presented in a different way,clustering results based on FScore and NMI are reported inTables 3 and 4, respectively. For each data set in a row, thevalue in bold and underlined is the best result, while thevalue in bold only is the second to best.

It can be observed that MVSC-IR and MVSC-IV performconsistently well. In Fig. 6, 19 out of 20 data sets, exceptreviews, either both or one of MVSC approaches are in thetop two algorithms. The next consistent performer isSpkmeans. The other algorithms might work well on certaindata set. For example, graphEJ yields outstanding result onclassic; graphCS and MMC are good on reviews. But they donot fare very well on the rest of the collections.

To have a statistical justification of the clusteringperformance comparisons, we also carried out statisticalsignificance tests. Each of MVSC-IR and MVSC-IV waspaired up with one of the remaining algorithms for a pairedt-test [32]. Given two paired sets X and Y of N measuredvalues, the null hypothesis of the test is that the differencesbetween X and Y come from a population with mean 0. Thealternative hypothesis is that the paired sets differ fromeach other in a significant way. In our experiment, thesetests were done based on the evaluation values obtained onthe 20 data sets. The typical 5 percent significance level wasused. For example, considering the pair (MVSC-IR, k-means), from Table 3, it is seen that MVSC-IR dominates k-means w.r.t. FScore. If the paired t-test returns a p-valuesmaller than 0.05, we reject the null hypothesis and say thatthe dominance is significant. Otherwise, the null hypothesisis true and the comparison is considered insignificant.

The outcomes of the paired t-tests are presented inTable 5. As the paired t-tests show, the advantage of MVSC-IR and MVSC-IV over the other methods is statisticallysignificant. A special case is the graphEJ algorithm. On theone hand, MVSC-IR is not significantly better than graphEJif based on FScore or NMI. On the other hand, when MVSC-IR and MVSC-IV are tested obviously better than graphEJ,the p-values can still be considered relatively large,although they are smaller than 0.05. The reason is that, asobserved before, graphEJ’s results on classic data set arevery different from those of the other algorithms. Whileinteresting, these values can be considered as outliers, andincluding them in the statistical tests would affect theoutcomes greatly. Hence, we also report in Table 5 the testswhere classic was excluded and only results on the other19 data sets were used. Under this circumstance, bothMVSC-IR and MVSC-IV outperform graphEJ significantlywith good p-values.

5.4 Effect of �� on MVSC-IRIR’s Performance

It has been known that criterion function-based partitionalclustering methods can be sensitive to cluster size andbalance. In the formulation of IR in (16), there existsparameter � which is called the regulating factor, � 2 ½0; 1�.To examine how the determination of � could affect MVSC-IR’s performance, we evaluated MVSC-IR with differentvalues of � from 0 to 1, with 0.1 incremental interval. Theassessment was done based on the clustering results inNMI, FScore, and Accuracy, each averaged over all the20 given data sets. Since the evaluation metrics for differentdata sets could be very different from each other, simplytaking the average over all the data sets would not be verymeaningful. Hence, we employed the method used in [18]to transform the metrics into relative metrics before aver-aging. On a particular document collection S, the relative

NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 995

2. http://en.wikipedia.org/wiki/Hungarian_algorithm.

http://ieeexploreprojects.blogspot.com

Page 9: Clustering with multiviewpoint based similarity measure.bak

FScore measure of MVSC-IR with � ¼ �i is determined asfollowing:

relative FScoreðIR;S; �iÞ ¼max�j FScoreðIR;S; �jÞ

� �FScoreðIR;S; �iÞ

;

where �i; �j 2 f0:0; 0:1; . . . ; 1:0g, FScoreðIR;S; �iÞ is theFScore result on data set S obtained by MVSC-IR with� ¼ �i. The same transformation was applied to NMI andAccuracy to yield relative_NMI and relative_Accuracy, respec-tively. MVSC-IR performs the best with an �i if its relativemeasure has a value of 1. Otherwise its relative measure isgreater than 1; the larger this value is, the worse MVSC-IRwith �i performs in comparison with other settings of �.Finally, the average relative measures were calculated overall the data sets to present the overall performance.

Fig. 7 shows the plot of average relative FScore, NMI, andAccuracy w.r.t. different values of �. In a broad view,MVSC-IR performs the worst at the extreme values of � (0

and 1), and tends to get better when � is set at some softvalues in between 0 and 1. Based on our experimentalstudy, MVSC-IR always produces results within 5 percentof the best case, regarding any types of evaluation metric,with � from 0.2 to 0.8.

6 MVSC AS REFINEMENT FOR kk-MEANS

From the analysis of (12) in Section 3.2, MVS provides anadditional criterion for measuring the similarity amongdocuments compared with CS. Alternatively, MVS can beconsidered as a refinement for CS, and consequently MVSCalgorithms as refinements for spherical k-means, whichuses CS. To further investigate the appropriateness andeffectiveness of MVS and its clustering algorithms, wecarried out another set of experiments in which solutionsobtained by Spkmeans were further optimized by MVSC-IRand MVSC-IV . The rationale for doing so is that if the finalsolutions by MVSC-IR and MVSC-IV are better than the

996 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

Fig. 6. Clustering results in Accuracy. Left-to-right in legend corresponds to left-to-right in the plot.

http://ieeexploreprojects.blogspot.com

Page 10: Clustering with multiviewpoint based similarity measure.bak

intermediate ones obtained by Spkmeans, MVS is indeed

good for the clustering problem. These experiments would

reveal more clearly if MVS actually improves the clustering

performance compared with CS.In the previous section, MVSC algorithms have been

compared against the existing algorithms that are closely

related to them, i.e., ones that also employ similarity

measures and criterion functions. In this section, we make

use of the extended experiments to further compare the

MVSC with a different type of clustering approach, the

NMF methods [10], which do not use any form of explicitly

defined similarity measure for documents.

6.1 TDT2 and Reuters-21578 Collections

For variety and thoroughness, in this empirical study, we

used two new document copora described in Table 6: TDT2

and Reuters-21578. The original TDT2 corpus,3 which

consists of 11,201 documents in 96 topics (i.e., classes), has

been one of the most standard sets for document clustering

purpose. We used a subcollection of this corpus which

contains 10,021 documents in the largest 56 topics. The

Reuters-21578 Distribution 1.0 has been mentioned earlier in

this paper. The original corpus consists of 21,578 documents

NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 997

TABLE 4Clustering Results in NMI

TABLE 3Clustering Results in FScore

3. http://www.nist.gov/speech/tests/tdt/tdt98/index.html.

http://ieeexploreprojects.blogspot.com

Page 11: Clustering with multiviewpoint based similarity measure.bak

in 135 topics. We used a subcollection having 8,213 docu-

ments from the largest 41 topics. The same two document

collections had been used in the paper of the NMF methods

[10]. Documents that appear in two or more topics were

removed, and the remaining documents were preprocessed

in the same way as in Section 5.1.

6.2 Experiments and Results

The following clustering methods:

. Spkmeans: spherical k-means

. rMVSC-IR: refinement of Spkmeans by MVSC-IR

. rMVSC-IV : refinement of Spkmeans by MVSC-IV

. MVSC-IR: normal MVSC using criterion IR

. MVSC-IV : normal MVSC using criterion IV ,

and two new document clustering approaches that do not

use any particular form of similarity measure:

. NMF: Nonnegative Matrix Factorization method

. NMF-NCW: Normalized Cut Weighted NMF

were involved in the performance comparison. When used

as a refinement for Spkmeans, the algorithms rMVSC-IRand rMVSC-IV worked directly on the output solution of

Spkmeans. The cluster assignment produced by Spkmeans

was used as initialization for both rMVSC-IR and rMVSC-

IV . We also investigated the performance of the original

MVSC-IR and MVSC-IV further on the new data sets.Besides, it would be interesting to see how they and theirSpkmeans-initialized versions fare against each other. Whatis more, two well-known document clustering approachesbased on nonnegative matrix factorization, NMF and NMF-NCW [10], are also included for a comparison with ouralgorithms, which use explicit MVS measure.

During the experiments, each of the two corpora in Table 6were used to create six different test cases, each of whichcorresponded to a distinct number of topics used(c ¼ 5; . . . ; 10). For each test case, c topics were randomlyselected from the corpus and their documents were mixedtogether to form a test set. This selection was repeated50 times so that each test case had 50 different test sets. Theaverage performance of the clustering algorithms with k ¼ cwere calculated over these 50 test sets. This experimentalsetup is inspired by the similar experiments conducted in theNMF paper [10]. Furthermore, similar to previous experi-mental setup in Section 5.2, each algorithm (including NMFand NMF-NCW) actually considered 10 trials on any test setbefore using the solution of the best obtainable objectivefunction value as its final output.

The clustering results on TDT2 and Reuters-21578 areshown in Table 7 and 8, respectively. For each test case in acolumn, the value in bold and underlined is the bestamong the results returned by the algorithms, while thevalue in bold only is the second to best. From the tables,several observations can be made. First, MVSC-IR andMVSC-IV continue to show they are good clusteringalgorithms by outperforming other methods frequently.They are always the best in every test case of TDT2.

998 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

Fig. 7. MVSC-IR’s performance with respect to �.

TABLE 6TDT2 and Reuters-21578 Document Corpora

TABLE 5Statistical Significance of Comparisons Based on Paired t-Tests with 5 Percent Significance Level

http://ieeexploreprojects.blogspot.com

Page 12: Clustering with multiviewpoint based similarity measure.bak

Compared with NMF-NCW, they are better in almost allthe cases, except only the case of Reuters-21578, k ¼ 5,where NMF-NCW is the best based on Accuracy.

The second observation, which is also the main objectiveof this empirical study, is that by applying MVSC to refinethe output of spherical k-means, clustering solutions areimproved significantly. Both rMVSC-IR and rMVSC-IV leadto higher NMIs and Accuracies than Spkmeans in all the

cases. Interestingly, there are many circumstances whereSpkmeans’ result is worse than that of NMF clusteringmethods, but after refined by MVSCs, it becomes better. Tohave a more descriptive picture of the improvements, wecould refer to the radar charts in Fig. 8. The figure showsdetails of a particular test case where k ¼ 5. Remember thata test case consists of 50 different test sets. The chartsdisplay result on each test set, including the accuracy result

NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 999

TABLE 7Clustering Results on TDT2

TABLE 8Clustering Results on Reuters-21578

Fig. 8. Accuracies on the 50 test sets (in sorted order of Spkmeans) in the test case k ¼ 5.

http://ieeexploreprojects.blogspot.com

Page 13: Clustering with multiviewpoint based similarity measure.bak

obtained by Spkmeans, and the results after refinement byMVSC, namely rMVSC-IR and rMVSC-IV . For effectivevisualization, they are sorted in ascending order of theaccuracies by Spkmeans (clockwise). As the patterns in bothFigs. 8a and 8b reveal, improvement in accuracy is mostlikely attainable by rMVSC-IR and rMVSC-IV . Many of theimprovements are with a considerably large margin,especially when the original accuracy obtained bySpkmeans is low. There are only few exceptions whereafter refinement, accuracy becomes worst. Nevertheless, thedecreases in such cases are small.

Finally, it is also interesting to notice from Tables 7 and 8that MVSC preceded by spherical k-means does notnecessarily yields better clustering results than MVSC withrandom initialization. There are only a small number ofcases in the two tables that rMVSC can be found better thanMVSC. This phenomenon, however, is understandable.Given a local optimal solution returned by spherical k-means, rMVSC algorithms as a refinement method wouldbe constrained by this local optimum itself and, hence, theirsearch space might be restricted. The original MVSCalgorithms, on the other hand, are not subjected to thisconstraint, and are able to follow the search trajectory oftheir objective function from the beginning. Hence, whileperformance improvement after refining spherical k-means’result by MVSC proves the appropriateness of MVS and itscriterion functions for document clustering, this observationin fact only reaffirms its potential.

7 CONCLUSIONS AND FUTURE WORK

In this paper, we propose a Multiviewpoint-based Similar-ity measuring method, named MVS. Theoretical analysisand empirical examples show that MVS is potentially moresuitable for text documents than the popular cosinesimilarity. Based on MVS, two criterion functions, IR andIV , and their respective clustering algorithms, MVSC-IRand MVSC-IV , have been introduced. Compared with otherstate-of-the-art clustering methods that use different typesof similarity measure, on a large number of document datasets and under different evaluation metrics, the proposedalgorithms show that they could provide significantlyimproved clustering performance.

The key contribution of this paper is the fundamentalconcept of similarity measure from multiple viewpoints.Future methods could make use of the same principle, butdefine alternative forms for the relative similarity in (10), ordo not use average but have other methods to combine therelative similarities according to the different viewpoints.Besides, this paper focuses on partitional clustering ofdocuments. In the future, it would also be possible to applythe proposed criterion functions for hierarchical clusteringalgorithms. Finally, we have shown the application of MVSand its clustering algorithms for text data. It would beinteresting to explore how they work on other types ofsparse and high-dimensional data.

REFERENCES

[1] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J.McLachlan, A. Ng, B. Liu, P.S. Yu, Z.-H. Zhou, M. Steinbach, D.J.Hand, and D. Steinberg, “Top 10 Algorithms in Data Mining,”Knowledge Information Systems, vol. 14, no. 1, pp. 1-37, 2007.

[2] I. Guyon, U.V. Luxburg, and R.C. Williamson, “Clustering:Science or Art?,” Proc. NIPS Workshop Clustering Theory, 2009.

[3] I. Dhillon and D. Modha, “Concept Decompositions for LargeSparse Text Data Using Clustering,” Machine Learning, vol. 42,nos. 1/2, pp. 143-175, Jan. 2001.

[4] S. Zhong, “Efficient Online Spherical K-means Clustering,” Proc.IEEE Int’l Joint Conf. Neural Networks (IJCNN), pp. 3180-3185, 2005.

[5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clustering withBregman Divergences,” J. Machine Learning Research, vol. 6,pp. 1705-1749, Oct. 2005.

[6] E. Pekalska, A. Harol, R.P.W. Duin, B. Spillmann, and H. Bunke,“Non-Euclidean or Non-Metric Measures Can Be Informative,”Structural, Syntactic, and Statistical Pattern Recognition, vol. 4109,pp. 871-880, 2006.

[7] M. Pelillo, “What Is a Cluster? Perspectives from Game Theory,”Proc. NIPS Workshop Clustering Theory, 2009.

[8] D. Lee and J. Lee, “Dynamic Dissimilarity Measure for SupportBased Clustering,” IEEE Trans. Knowledge and Data Eng., vol. 22,no. 6, pp. 900-905, June 2010.

[9] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Clustering on theUnit Hypersphere Using Von Mises-Fisher Distributions,”J. Machine Learning Research, vol. 6, pp. 1345-1382, Sept. 2005.

[10] W. Xu, X. Liu, and Y. Gong, “Document Clustering Based on Non-Negative Matrix Factorization,” Proc. 26th Ann. Int’l ACM SIGIRConf. Research and Development in Informaion Retrieval, pp. 267-273,2003.

[11] I.S. Dhillon, S. Mallela, and D.S. Modha, “Information-TheoreticCo-Clustering,” Proc. Ninth ACM SIGKDD Int’l Conf. KnowledgeDiscovery and Data Mining (KDD), pp. 89-98, 2003.

[12] C.D. Manning, P. Raghavan, and H. Schutze, An Introduction toInformation Retrieval. Cambridge Univ. Press, 2009.

[13] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “A Min-Max CutAlgorithm for Graph Partitioning and Data Clustering,” Proc.IEEE Int’l Conf. Data Mining (ICDM), pp. 107-114, 2001.

[14] H. Zha, X. He, C. Ding, H. Simon, and M. Gu, “Spectral Relaxationfor K-Means Clustering,” Proc. Neural Info. Processing Systems(NIPS), pp. 1057-1064, 2001.

[15] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,”IEEE Trans. Pattern Analysis Machine Intelligence, vol. 22, no. 8,pp. 888-905, Aug. 2000.

[16] I.S. Dhillon, “Co-Clustering Documents and Words UsingBipartite Spectral Graph Partitioning,” Proc. Seventh ACMSIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD),pp. 269-274, 2001.

[17] Y. Gong and W. Xu, Machine Learning for Multimedia ContentAnalysis. Springer-Verlag, 2007.

[18] Y. Zhao and G. Karypis, “Empirical and Theoretical Comparisonsof Selected Criterion Functions for Document Clustering,”Machine Learning, vol. 55, no. 3, pp. 311-331, June 2004.

[19] G. Karypis, “CLUTO a Clustering Toolkit,” technical report, Dept.of Computer Science, Univ. of Minnesota, http://glaros.dtc.umn.edu/~gkhome/views/cluto, 2003.

[20] A. Strehl, J. Ghosh, and R. Mooney, “Impact of SimilarityMeasures on Web-Page Clustering,” Proc. 17th Nat’l Conf. ArtificialIntelligence: Workshop of Artificial Intelligence for Web Search (AAAI),pp. 58-64, July 2000.

[21] A. Ahmad and L. Dey, “A Method to Compute Distance BetweenTwo Categorical Values of Same Attribute in UnsupervisedLearning for Categorical Data Set,” Pattern Recognition Letters,vol. 28, no. 1, pp. 110-118, 2007.

[22] D. Ienco, R.G. Pensa, and R. Meo, “Context-Based DistanceLearning for Categorical Data Clustering,” Proc. Eighth Int’l Symp.Intelligent Data Analysis (IDA), pp. 83-94, 2009.

[23] P. Lakkaraju, S. Gauch, and M. Speretta, “Document SimilarityBased on Concept Tree Distance,” Proc. 19th ACM Conf. Hypertextand Hypermedia, pp. 127-132, 2008.

[24] H. Chim and X. Deng, “Efficient Phrase-Based DocumentSimilarity for Clustering,” IEEE Trans. Knowledge and Data Eng.,vol. 20, no. 9, pp. 1217-1229, Sept. 2008.

[25] S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese, “FastDetection of xml Structural Similarity,” IEEE Trans. Knowledge andData Eng., vol. 17, no. 2, pp. 160-175, Feb. 2005.

[26] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V.Kumar, B. Mobasher, and J. Moore, “Webace: A Web Agent forDocument Categorization and Exploration,” Proc. Second Int’lConf. Autonomous Agents (AGENTS ’98), pp. 408-415, 1998.

1000 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

http://ieeexploreprojects.blogspot.com

Page 14: Clustering with multiviewpoint based similarity measure.bak

[27] J. Friedman and J. Meulman, “Clustering Objects on Subsets ofAttributes,” J. Royal Statistical Soc. Series B Statistical Methodology,vol. 66, no. 4, pp. 815-839, 2004.

[28] L. Hubert, P. Arabie, and J. Meulman, Combinatorial Data Analysis:Optimization by Dynamic Programming. SIAM, 2001.

[29] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, seconded. John Wiley & Sons, 2001.

[30] S. Zhong and J. Ghosh, “A Comparative Study of GenerativeModels for Document Clustering,” Proc. SIAM Int’l Conf. DataMining Workshop Clustering High Dimensional Data and Its Applica-tions, 2003.

[31] Y. Zhao and G. Karypis, “Criterion Functions for DocumentClustering: Experiments and Analysis,” technical report, Dept. ofComputer Science, Univ. of Minnesota, 2002.

[32] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997.

Duc Thang Nguyen received the BEng degreein electrical and electronic engineering fromNanyang Technological University, Singapore,where he is also working toward the PhDdegree in the Division of Information Engineer-ing. Currently, he is an operations planninganalyst at PSA Corporation Ltd., Singapore. Hisresearch interests include algorithms, informa-tion retrieval, data mining, optimizations, andoperations research.

Lihui Chen received the BEng degree incomputer science and engineering at ZhejiangUniversity, China and the PhD degree incomputational science at the University of St.Andrews, United Kingdom. Currently, she is anassociate professor in the Division of InformationEngineering at Nanyang Technological Univer-sity in Singapore. Her research interests includemachine learning algorithms and applications,data mining and web intelligence. She has

published more than 70 referred papers in international journals andconferences in these areas. She is a senior member of the IEEE, and amember of the IEEE Computational Intelligence Society.

Chee Keong Chan received the BEng degreefrom National University of Singapore in elec-trical and electronic engineering, the MSCdegree and DIC in computing from ImperialCollege, University of London, and the PhDdegree from Nanyang Technological University.Upon graduation from NUS, he worked as anR&D engineer in Philips Singapore for severalyears. Currently, he is an associate professor inthe Information Engineering Division, lecturing in

subjects related to computer systems, artificial intelligence, softwareengineering, and cyber security. Through his years in NTU, he haspublished about numerous research papers in conferences, journals,books, and book chapter. He has also provided numerous consultationsto the industries. His current research interest areas include data mining(text and solar radiation data mining), evolutionary algorithms (schedul-ing and games), and renewable energy.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 1001

http://ieeexploreprojects.blogspot.com