base paper1.pdf

8/9/2019 base paper1.pdf

1/16

Mining Projected Clusters inHigh-Dimensional Spaces

Mohamed Bouguessa and Shengrui Wang, Member, IEEE

AbstractClustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing

clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the

full-dimensional space. To address this problem, a number of projected clustering algorithms have been proposed. However, most

of them encounter difficulties when clusters hide in subspaces with very low dimensionality. These challenges motivate our effort

to propose a robust partitional distance-based projected clustering algorithm. The algorithm consists of three phases. The first

phase performs attribute relevance analysis by detecting dense and sparse regions and their location in each attribute. Starting

from the results of the first phase, the goal of the second phase is to eliminate outliers, while the third phase aims to discover

clusters in different subspaces. The clustering process is based on the K-Means algorithm, with the computation of distance

restricted to subsets of attributes where object values are dense. Our algorithm is capable of detecting projected clusters of low

dimensionality embedded in a high-dimensional space and avoids the computation of the distance in the full-dimensional space.

The suitability of our proposal has been demonstrated through an empirical study using synthetic and real data sets.

Index TermsData mining, clustering, high dimensions.

1 INTRODUCTION

DATA mining is theprocess of extracting potentially usefulinformation from a data set [1]. Clustering is a populardata mining technique which is intended to help the userdiscover and understand the structure or grouping of thedata in the set according to a certain similarity measure [2].Clustering algorithms usually employ a distance metric (e.g.,euclidean) or a similarity measure in order to partition the

database so that the data points in each partition are moresimilar than points in different partitions.The commonly used euclidean distance, while computa-

tionally simple, requires similar objects to have close valuesin all dimensions. However, with the high-dimensional datacommonly encountered nowadays, the concept of similaritybetween objects in the full-dimensional space is ofteninvalid and generally not helpful. Recent theoretical results[3] reveal that data points in a set tend to be more equallyspaced as the dimension of the space increases, as long asthe components of the data point are independently andidentically distributed (i.i.d.). Although the i.i.d. condition israrely satisfied in real applications, it still becomes less

meaningful to differentiate data points based on a distanceor a similarity measure computed using all the dimensions.These results explain the poor performance of conventionaldistance-based clustering algorithms on such data sets.

Feature selection techniques are commonly utilized as apreprocessing stage for clustering, in order to overcome thecurse of dimensionality. The most informative dimensions

are selected by eliminating irrelevant and redundant ones.Such techniques speed up clustering algorithms andimprove their performance [4]. Nevertheless, in someapplications, different clusters may exist in differentsubspaces spanned by different dimensions. In such cases,dimension reduction using a conventional feature selectiontechnique may lead to substantial information loss [5].

The following example provides an idea of the difficul-ties encountered by conventional clustering algorithms andfeature selection techniques. Fig. 1 illustrates a generateddata set composed of 1,000 data points in 10D space. Notethat this data set is generated based on the data generatormodel described in [5]. As we can see from Fig. 1, there arefour clusters that have their own relevant dimensions (e.g.,cluster 1 exists in dimensions A1, A4, A8, and A10). Byrelevant dimensions, we mean dimensions that exhibitcluster structure. In our example, there are also threeirrelevant dimensionsA3, A5, and A7 in which all the datapoints are sparsely distributed, i.e., no cluster structure existin these dimensions. Note that the rows in this figure

indicate the boundaries of each cluster.For such an example, a traditional clustering algorithm islikely to fail to find the four clusters. This is because thedistance function used by these algorithms gives equaltreatment to all dimensions, which are, however, not ofequal importance. While feature selection techniques canreduce the dimensionality of the data by eliminatingirrelevant attributes such as A3, A5, and A7, there is anenormous risk that they will also eliminate relevantattributes such as A1. This is due to the presence of manysparse data points in A1, where a cluster is in fact present. Tocope with this problem, new classes of projected clusteringhave emerged.

Projected clustering exploits the fact that in high-dimensional data sets, different groups of data pointsmay be correlated along different sets of dimensions. Theclusters produced by such algorithms are called projected

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 4, APRIL 2009 507

. The authors are with the Department of Computer Science, University ofSherbrooke, Sherbrooke, QC J1K 2R1, Canada.E-mail: {mohamed.bouguessa, shengrui.wang}@usherbrooke.ca.

Manuscript received 6 Feb. 2007; revised 22 Jan. 2008; accepted 17 July 2008;

published online 25 July 2008.Recommended for acceptance by D. Talia.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-0054-0207.Digital Object Identifier no. 10.1109/TKDE.2008.162.

1041-4347/09/$25.00 2009 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: Kongu Eningerring College. Downloaded on August 26, 2009 at 11:06 from IEEE Xplore. Restrictions apply.


2/16

clusters. A projected cluster is a subset S Pof data points,together with a subspace SD of dimensions, such that thepoints in SPare closely clustered in SD [5]. For instance,the fourth cluster in the data set presented in Fig. 1 isSP4; SD4 fxk; . . . ; xng, fA2; A4; A6; A9g. R ec ent r e-search has suggested the presence of projected clusters inmany real-life data sets [6].

A number of projected clustering algorithms have beenproposed in recent years. Although these previous algo-rithms have been successful in discovering clusters indifferent subspaces, they encounter difficulties in identify-ing very low-dimensional projected clusters embedded inhigh-dimensional space. Yip et al. [7] observed that currentprojected clustering algorithms provide meaningful resultsonly when the dimensionalities of the clusters are not muchlower than that of the data set. For instance, some partitionalprojected clustering algorithms, such as PROCLUS [5] andORCLUS [8], make use of a similarity function that involvesall dimensions in order to find an initial approximation of

the clusters. After that, relevant dimensions of each clusterare determined using some heuristics, and the clustering isrefined based on the relevant dimensions previouslyselected. Here, it is clear that a similarity function that usesall dimensions misleads the relevant dimension detectionmechanism and adversely affect the performance of thesealgorithms. Another example is HARP [9], a hierarchicalprojected clustering algorithm based on the assumption thattwo data points are likely to belong to the same cluster ifthey are very similar to each other along many dimensions.However, when the number of relevant dimensions percluster is much lower than the data set dimensionality, suchan assumption may not be valid. In addition, some existing

projected clustering algorithms, such as PROCLUS [5] andORCLUS [8], require the user to provide the averagedimensionality of the subspaces, which is very difficult toestablish in real-life applications.

These observations motivate our effort to propose a novelprojected clustering algorithm, called Projected Clusteringbased on the K-Means Algorithm (PCKA). PCKA is com-posed of three phases: attribute relevance analysis, outlierhandling, and discovery of projected clusters. Our algorithmis partitional in nature and able to automatically detectprojected clusters of very low dimensionality embedded inhigh-dimensional space, thereby avoiding computation ofthe distance in the full-dimensional space.

The remainder of this paper is organized as follows: InSection 2, we provide a brief overview of recent projectedclustering algorithms and discuss their strengths andweaknesses. Section 3 describes our projected clustering

algorithm in detail. Section 4 presents experiments andperformance results on a number of synthetic and real datasets. Our conclusion is given in Section 5.

2 RELATEDWORK

The problem of finding projected clusters has beenaddressed in [5]. The partitional algorithm PROCLUS,

which is a variant of the K-medoid method, iterativelycomputes a good medoid for each cluster. With the set ofmedoids, PROCLUS finds the subspace dimensions for eachcluster by examining the neighboring locality of the spacenear it. After the subspace has been determined, each datapoint is assigned to the cluster of the nearest medoid. Thealgorithm is run until the sum of intracluster distancesceases to change. ORCLUS [8] is an extended version ofPROCLUS that looks for non-axis-parallel clusters, by usingSingular Value Decomposition (SVD) to transform the datato a new coordinate system and select principal compo-nents. PROCLUS and ORCLUS were the first to successfullyintroduce a methodology for discovering projected clusters

in high-dimensional spaces, and they continue to inspirenovel approaches.

A limitation of these two approaches is that the processof forming the locality is based on the full dimensionality ofthe space. However, it is not useful to look for neighbors indata sets with very low-dimensional projected clusters. Inaddition, PROCLUS and ORCLUS require the user toprovide the average dimensionality of the subspace, whichalso is very difficult to do in real-life applications.

In [10], Procopiuc et al. propose an approach calledDensity-based Optimal projective Clustering (DOC) inorder to identify projected clusters. DOC proceeds bydiscovering clusters one after another, defining a projectedcluster as a hypercube with width 2w, where w is a user-supplied parameter. In order to identify relevant dimen-sions for each cluster, the algorithm randomly selects a seedpoint and a small set Yof neighboring data points from thedata set. A dimension is considered as relevant to the clusterif and only if the distance between the projected value of theseed point and the data point in Yon the dimension is nomore than w. All data points that belong to the definedhypercube form a candidate cluster. The suitability of theresulting cluster is evaluated by a quality function which isbased on a user-provided parameter that controls thetrade-off between the number of objects and the number of

relevant dimensions. DOC tries different seeds and neigh-boring data points, in order to find the cluster that optimizesthe quality function. The entire process is repeated to findother projected clusters. It is clear that since DOC scans theentire data set repetitively, its execution time is very high.To alleviate this problem, an improved version of DOCcalled FastDOC is also proposed in [10].

DOC is based on an interesting theoretical foundationand has been successfully applied to image processingapplications [10]. In contrast to previous approaches (i.e.,PROCLUS and ORCLUS), DOC is able to automaticallydiscover the number of clusters in the data set. However,the input parameters of DOC are difficult to determine and

an inappropriate choice by the user can greatly diminish itsaccuracy. Furthermore, DOC looks for clusters with equalwidth along all relevant dimensions. In some types of data,however, clusters with different widths are more realistic.

508 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 4, APRIL 2009

Fig. 1. Example of data set containing projected clusters.



3/16

Another hypercube approach called Frequent-Pattern-based Clustering (FPC) is proposed in [11] to improve theefficiency of DOC. FPC replaces the randomized module ofDOC with systematic search for the best cluster defined by arandom medoid point p. In order to discover relevantdimensions for the medoid p, an optimized adaptation ofthe frequent pattern tree growth method used for miningitem sets is proposed. In this context, the authors of FPC

illustrate the analogy between mining frequent item setsand discovering dense projected clusters around randompoints. The adapted mining technique is combined withFastDOC to discover clusters. However, the fact that FPCreturns only one cluster at a time adversely affects itscomputational efficiency. In order to speed up FPC, anextended version named Concurrent FPC (CFPC) is alsoproposed in [11]. CFPC can discover multiple clusterssimultaneously, which improves the efficiency of theclustering process.

It is shown in [11] that FPC significantly improves theefficiency of DOC/FastDOC and can be much faster thanthe previous approaches. However, since FPC is built on

DOC/FastDOC, it inherits some of their drawbacks. FPCperforms well only when each cluster is in the form of ahypercube and the parameter values are specified correctly.

A recent paper [9] proposes a hierarchical projectedclustering algorithm called a Hierarchical approach withAutomatic Relevant dimension selection for Projectedclustering (HARP). The basic assumption of HARP is thatif two data points are similar in high-dimensional space,they have a high probability of belonging to the samecluster in lower-dimensional space. Based on this assump-tion, two clusters are allowed to merge only if they aresimilar enough in a number of dimensions. The minimumsimilarity and minimum number of similar dimensions aredynamically controlled by two thresholds, without theassistance of user parameters. The advantage of HARP isthat it provides a mechanism to automatically determinerelevant dimensions for each cluster and avoid the use ofinput parameters, whose values are difficult to set. Inaddition to this, the study in [6] illustrates that HARPprovides interesting results on gene expression data.

On the other hand, as mentioned in Section 1, it has beenshown in [3] that, for a number of common data distribu-tion, as dimensionality increases, the distance to the nearestdata point approaches the distance to the farthest datapoint. Based on these results, the basic assumption of HARP

will be less valid when projected clusters have few relevantdimensions. In such situations, the accuracy of HARPdeteriorates severely. This effect on HARPs performancewas also observed by Yip et al. in [7].

In order to overcome the limitation encountered byHARP and other projected clustering algorithms, theauthors of HARP propose in [7] a semisupervised approachnamed Semisupervised Projected Clustering (SSPC). Thisalgorithm is partitional in nature and similar in structure toPROCLUS. As in semisupervised clustering, SSPC makesuse of domain knowledge (labeled data points and/orlabeled dimensions) in order to improve the quality of aclustering. As reported in [7], the clustering accuracy can be

greatly improved by inputting only a small amount ofdomain knowledge. However, in some applications, domainknowledge in the form of labeled data points and/or labeleddimensions is very limited and not usually available.

A density-based algorithm named Efficient ProjectiveClustering by Histograms (EPCH) is proposed in [12] forprojected clustering. EPCH performs projected clusteringby histogram construction. By iteratively lowering a thresh-old, dense regions are identified in each histogram. Asignature is generated for each data point correspondingto some region in some subspace. Projected clusters areuncovered by identifying signatures with a large number of

data points [12]. EPCH has an interesting property in thatno assumption is made about the number of clusters or thedimensionality of subspaces. In addition to this, it has beenshown in [12] that EPCH can be fast and is able to handleclusters of irregular shape. On the other hand, while EPCHavoids the computation of distance between data points inthe full-dimensional space, it suffers from the curse ofdimensionality. In our experiments [13], we have observedthat when the dimensionality of the data space increasesand the number of relevant dimensions for clustersdecreases, the accuracy of EPCH is affected.

A field that is closely related to projected clustering issubspace clustering. CLIQUE [1] was the pioneering

approach to subspace clustering, followed by a number ofalgorithms in the same field such as ENCLUS [14], MAFIA[15], and SUBCLU [16]. The idea behind subspace clusteringis to identify all dense regions in all subspaces, whereas inprojected clustering, as the name implies, the main focus ison discovering clusters that are projected onto particularspaces [5]. The outputs of subspace clustering algorithmsdiffer significantly from those of projected clustering [5].Subspace clustering techniques tend to produce a partitionof the data set with overlapping clusters [5], [9]. The outputof such algorithms is very large, because data points may beassigned to multiple clusters. In contrast, projected cluster-ing algorithms produce disjoint clusters with a singlepartitioning of points [5], [9], [10], [11], [12]. Dependingon the application domain, both subspace clustering andprojected clustering can be powerful tools for mining high-dimensional data. Since the major concern of this paper isprojected clustering, we will focus only on such techniques.Further details and a survey on subspace clusteringalgorithms and projected clustering algorithms can befound in [17] and [18].

3 THE ALGORITHMPCKA

3.1 Problem Statement

To describe our algorithm, we will introduce some notationand definitions. Let DB be a data set of d-dimensionalpoints, where the set of attributes is denoted byA fA1; A2; . . . ; Adg. Let X fx1; x2; . . . ; xNg be the set ofNdata points, wherexi xi1; . . . ; xij; . . . ; xid. Eachxiji1; . . . ; N;j1; . . . ; dcorresponds to the value of data pointxi on attribute Aj. In what follows, we will call xij a1D point. In this paper, we assume that each data pointxibelongs either to one projected cluster or to the set ofoutliersOU T. Given the number of clusters nc, which is aninput parameter, a projected cluster Cs, s1; . . . ; nc isdefined as a pairSPs; SDs, whereS Ps is a subset of datapoints ofDB, andSDs is a subset of dimensions ofA, such

that the projections of the points in SPs along eachdimension in S Ds are closely clustered. The dimensions inSPs are called relevant dimensions for the cluster Cs. Theremaining dimensions, i.e., A SDs, are called irrelevant

BOUGUESSA AND WANG: MINING PROJECTED CLUSTERS IN HIGH-DIMENSIONAL SPACES 509



4/16

dimensions for the cluster Cs. The cardinality of the setSDsis denoted by ds, where dsd, and ns denotes thecardinality of the set S Ps, wherens < N.

PCKA is focused on discovering axis-parallel projectedclusters which satisfy the following properties:

1. Projected clusters must be dense. Specifically, theprojected values of the data points along each

dimension of fSDsgs1;...;nc form regions of highdensity in comparison to those in each dimension offA SDsgs;...;nc.

2. The subset of dimensionsfSDsgs1;...;nc may not bedisjoint and they may have different cardinalities.

3. For each projected clusterCs, the projections of thedata points in SPs along each dimension in SDs aresimilar to each other according to a similarityfunction, but dissimilar to other data points not in Cs.

The first property is based on the fact that relevantdimensions of the clusters contain dense regions incomparison to irrelevant ones and such a concept ofdensity is comparatively relative across all the dimen-sions in the data set. The reason for the second andthird properties is trivial. In our clustering process, which isK-Means-based, we will use the euclidean distance in orderto measure the similarity between a data point and a clustercenter such that only dimensions that contain dense regionsare involved in the distance calculation. Hence, thediscovered clusters should have, in general, a concave(near spherical) shape [1].

Note that the algorithm that we propose does notpresume any distribution on each individual dimensionfor the input data. Furthermore, there is no restrictionimposed on the size of the clusters or the number of

relevant dimensions of each cluster. A projected clustershould have a significant number of selected (i.e., relevant)dimensions with high relevance in which a large number ofpoints are close to each other.1 To achieve this, PCKAproceeds in three phases:

1. Attribute relevance analysis.The goal is to identifyall dimensions in a data set which exhibit some clusterstructure by discovering dense regions and theirlocation in each dimension. The underlying assump-tion for this phase is that, in the context of projectedclustering, a cluster should have relevant dimensionsin which the projection of each point of the cluster is

close to a sufficient number of other projected points(from the whole data set), and this concept ofcloseness is relative across all the dimensions. Theidentified dimensions represent potential candidatesfor relevant dimensions of the clusters.

2. Outlier handling. Based on the results of the firstphase, the aim is to identify and eliminate outlierpoints from the data set. Like the majority ofclustering algorithms, PCKA considers outliers aspoints that do not cluster well [5].

3. Discovery of projected clusters. The goal of thisphase is to identify clusters and their relevantdimensions. The clustering process is based on a

modified version of the K-Means algorithm in whichthe computation of distance is restricted to subsetswhere the data point values are dense. Based on theidentified clusters, in the last step of our algorithm,we refine the results of phase 1 by selecting theappropriate dimensions of each cluster.

Looking to the design of our algorithm, it is clear that ourstrategy to identify clusters and their relevant dimensions

cannot be viewed as globally optimizing an objectivefunction. In other words, our technique is not similar tofull iterative clustering algorithms, such as PROCLUS [5]and SSPC [7], which require an objective function to beoptimized. The proposed algorithm belongs to the broadcategory of techniques, such as CLIQUE [1] and EPCH [12],that do not treat projected/subspace clustering as anoptimization problem and thus do not admit an explicitobjective function. On the other hand, PCKA follows thegeneral principle of projected clustering and uses varioustechniques pertaining to minimizing the intercluster simi-larity and maximizing the intracluster similarity in therelevant dimensions. To identify projected clusters, PCKA

proceeds phase-by-phase and avoids the difficulty ofattempting to solve a hard combinatorial optimizationproblem. The algorithms of these phases are described inthe following sections.

3.2 Attribute Relevance Analysis

In the context of projected clustering, irrelevant attributescontain noise/outliers and sparse data points, whilerelevant ones may exhibit some cluster structure [5]. Bycluster structure, we mean a region that has a higherdensity of points than its surrounding regions. Such denseregion represents the 1D projection of some cluster. Hence,it is clear that by detecting dense regions in each dimensionwe are able to discriminate between dimensions that arerelevant to clusters and irrelevant ones.

In order to detect densely populated regions in eachattribute we compute a sparseness degree yij for each1D point xij by measuring the variance of its k nearest(1D point) neighborsk nn.Definition 1. The sparseness degree of xij is defined as

yijP

r2pjixij

rcji2

k1 , where pjixij fnnjkxij [ xijg and

jpjixijj k 1. nnjk denotes the set of k nn of xij indimension Aj, and c

ji is the center of the set p

jixij, i.e.,

cji Pr2pji xijrk1 .Intuitively, a large value ofyijmeans thatxijbelongs to a

sparse region, while a small one indicates that xijbelongs toa dense region.

Calculation of the k nearest neighbors is, in general, anexpensive task, especially when the number of data pointsN is very large. However, since we are searching for theknearest neighbors in a 1D space, we can perform the taskin an efficient way by presorting the values in each attributeand limiting the number of distance comparisons to amaximum of2k values.

The major advantage of using the sparseness degree is

that it provides a relative measure on which the denseregions are more easily distinguishable from sparse regions.On the other hand, when a dimension contains only sparseregions, all the estimatedyijfor the same dimension tend to


1. This stipulation ensures that the high relevance of the selecteddimensions of each cluster is not due to a random chance. Hence, somedegenerative situations in which we can get a large cluster with a lownumber of relevant dimensions are successfully avoided.



5/16

be very large. Our objective now is to determine whether ornot dense regions are present in a dimension.

In order to identify dense regions in each dimension, weare interested in all sets of xij having a small sparsenessdegree. In the preliminary version of PCKA described in [13],dense regions are distinguished from sparse regions using auser predefined density threshold. However, in such anapproach, an inappropriate choice of the value of the densitythreshold by the user may affect the clustering accuracy. Inthis paper, we develop a systematic and efficient way todiscriminate between dense and sparse regions. For thispurpose, we propose to model the sparseness degree yijof allthe 1D points xij in a dimension as a mixture distribution.The probability density function (PDF) is therefore estimatedand the characteristics of each dimension are identified.

Note that, in order to identify dense regions, it might be

possible in some cases to fit the original data points in eachdimension as a mixture of Gaussian distribution (or anothermore complex mixture distribution). However, this wouldunnecessarily limit the generality and thus the applicabilityof our algorithm. For instance, when a dimension containsonly noise/outliers (i.e., data points with sparse distribu-tion), the use of a Gaussian distribution is not the best choicedue to its symmetric shape restriction. On the other hand,the sparseness degree is a natural choice for representingthe local density of a point in a dimension. We will showthat the sparseness degree tends to have a smooth andmultimodal distribution, which more naturally suggests a

mixture distribution. These features make our approachsuitable to deal with various possible distributions in eachdimension. Below, we propose an effective way to identifydense regions and their location in each dimension.

3.2.1 PDF Estimation

Since we are dealing with 1D spaces, estimating thehistogram is a flexible tool to describe some statistical

properties of the data. For the purpose of clarification,consider the data set presented in Fig. 1. The histograms ofthe sparseness degrees calculated for dimensionsA1,A2,A3,andA4 are illustrated in Fig. 2. Here,k is chosen to be

ffiffiffiffiffiN

pand the calculated yij are normalized in the interval ]0, 1].The histograms presented in Fig. 2 suggest the existence ofcomponents with different shape and/or heavy tails, whichinspired us to use the gamma mixture model.

Formally, we expect that the sparseness degrees of adimension dfollows a mixture density of the form

Gy X

m

l

1

lGly; l; l; 1

where Gl: is the lthgamma distribution with parametersl and l which represent, respectively, the shape and thescale parameters of the lthcomponent; and ll1; . . . ; mare the mixing coefficients, with the restriction that l >0forl1; . . . ; mand Pml1 l1. The density function of thelthcomponent is given by

Gly; l; l ll

l yl1 exply; 2

w h e r e l is t he g am ma f un ct io n g iv en b y R

10

t1 exptdt; t >0.As mentioned above, each gamma component Glin (2) has

two parameters: the shape parameter l and the scaleparameter l. The shape parameter allows the distributionto take on a variety of shapes, depending on its value [19],


Fig. 2. Histograms of the sparseness degree of (a) dimension A1, (b) dimension A2, (c) dimension A3, and (d) dimension A4.



6/16

[20]. When l < 1, the distribution is highly skewed and isL-shaped. Whenl1, we get the exponential distribution.In the case of l > 1, the distribution has a peak (mode) inl 1=land skewed shape. The skewness decreases as thevalue oflincreases. This flexibility of the gamma distribu-tion and its positive sample space make it particularlysuitable to model the distribution of the sparseness degrees.

A standard approach for estimating the parameters ofthe gamma components Gl is the maximum likelihoodtechnique [21]. The likelihood function is defined as

LGll; l Yy2Gl

Gly; l; l

lNll

NllY

y2Glyl1 exp l

Xy2Gl

y

!;

3

whereNl is the size of the lthcomponent. The logarithm ofthe likelihood function is given by

log

LGl

l; l

Nlllog

l

Nllog

l

l 1Xy2Gl

logy lXy2Gl

y: 4

To find the values ofland lthat maximize the likelihoodfunction, we differentiate logLGll; l with respect toeach of these parameters and set the result equal to zero:

@

@llogLGll; lNlloglNl

0ll

Xy2Gl

logy0

)logl0l

l 1

Nl

Xy2Gl

logy

5and

@

@llogLGll; l

Nlll

Xy2Gl

y0

)l NllPy2Gl y:

6

This yields the equation

logl l log 1Nl Xy2Gl

y

! 1

Nl Xy2Gllogy; 7

where : is the digamma function given by 0.

The digamma function can be approximated very accu-rately using the following equation [22]:

log 12

1122

11204

12526

: 8

The parameter l can be estimated by solving (7) usingthe Newton-Raphson method. lis then substituted into (6)to determine l.

The use of a mixture of gamma distributions allows us topropose a flexible model to describe the distribution of the

sparseness degree. To form such a model, we need toestimatem, the number of components, and the parametersfor each component. One popular approach to estimatingthe number of components m is to increase m from 1 to

m max and to compute some particular performancemeasures on each run, until partition into an optimalnumber of components is obtained. For this purpose, weimplement a standard two-step process. In the first step, wecalculate the maximum likelihood of the parameters of themixture for a range of values ofm (from 1 to m max). Thesecond step involves calculating an associated criterion andselecting the value of m which optimizes the criterion. A

variety of approaches have been proposed to estimate thenumber of components in a data set [23], [24]. In ourmethod, we use a penalized likelihood criterion, called theBayesian Information Criterion BIC. BIC was firstintroduced by Schwartz [25] and is given by

BICm 2Lm NplogN; 9whereL is the logarithm of the likelihood at the maximumlikelihood solution for the mixture model under investiga-tion, and Np is the number of parameters estimated. Thenumber of components that minimizes BICm is consid-ered to be the optimal value for m.

Typically, the maximum likelihood of the parameters ofthe distribution is estimated using the EM algorithm [26].This algorithm requires the initial parameters of eachcomponent. Since EM is highly dependent on initialization[27], it will be helpful to perform initialization by mean of aclustering algorithm [27], [28]. For this purpose, we imple-ment the FCM algorithm [29] to partition the set ofsparseness degrees into m components. Based on such apartition, we can estimate the parameters of each componentand set them as initial parameters to the EM algorithm. Oncethe EM algorithm converges, we can derive a classificationdecision about the membership of each sparseness degree in

each component. The procedure for estimating theP DF ofthe sparseness degrees of each dimension is summarized inAlgorithm 1.

Algorithm 1. P DFestimation of the sparseness degrees ofeach dimension

1: Input: Aj, m max, k2: Output: m, l, l, l3: Based on Definition 1, compute the sparseness

degreeyij;4: Normalizeyij in the interval ]0, 1];5: for m1 to m max do6: if m = = 1 then7: Estimate the parameters of the gamma distribution

based on the likelihood formula using (6), (7), and (8);8: Compute the value ofBI Cm using (9);9: else10: Apply the FCM algorithm as an initialization of the

EM algorithm;11: Apply the EM algorithm to estimate the parameters

of the mixture using (6), (7) and (8);12: Compute the value ofBI Cm using (9);13: end if14: end for15: Select the number of components m, such that

margmin BICm;Let us focus now on the choice of the value ofm max. In

our experiments on different data sets, we observed that




7/16

when a dimension contains only sparse regions the sparse-ness degrees are well fitted, in general, by one gammacomponent. When a dimension contains a mix of dense andsparse regions the sparseness degrees are well fitted, ingeneral, by two gamma components. This can be explainedby the fact that the sparseness degree provides a relativemeasure in which sparse regions are easily distinguishablefrom dense regions. Such a concept of relativity between thevalues of the sparseness degree makes the choice ofm maxfairly simple. Based on this, we believe that setting m max3 is, in general, a practical choice. The reader should beaware, however, that the choice ofm maxis not limited to 3and the user can set other values. However, setting values ofm max > 3 can unnecessarily increase the execution time.Hence, we suggest usingm max3 as a default value. TheestimatedPDFs of the sparseness degrees ofA1, A2; A3, andA4 are illustrated in Fig. 3.

3.2.2 Detection of Dense Regions

Once the P DF of the sparseness degrees yij of eachdimension is estimated, we turn to the problem of how todetect dense regions and their location in a dimension. Forthis purpose, we make an efficient use of the properties ofthe estimatedP DF. As illustrated in Fig. 3, the locations of

the components that represent dense regions are close tozero in comparison to those that represent sparse regions.Based on this observation, we propose a method to examinethe location of each of the components in all the dimensions

in order to find a typical value that best describes thesparseness degrees of each component. Let locqdenote thelocation of the component q, where q1; . . . ; mtotal, andmtotal is the total number of all the components over all thedimensions in the data set. Intuitively, a large value oflocqmeans that the component q corresponds to a sparseregions, while a small one indicates that this componentcorresponds to a dense regions.

The most commonly used measures of location forunivariate data are the mean, the mode, and the median.The most appropriate measure in our case is the median,due to the variable shape of the gamma distribution. For

instance, when the shape parameter l < 1, the distributionis L-shaped with a heavy tail. In this case, the distributionhas no mode and the extreme value present in the tailaffects the computation of the mean. When l > 1, thedistribution is skewed, which implies that the mean will bepulled in the direction of the skewness. In all thesesituations, the median provides a better estimate of locationthan does the mean or the mode, because skeweddistribution and extreme values do not distort the median,whose computation is based on ranks.

In order to identify dense regions in each dimension, weare interested in all components with small values of locq.We therefore propose to use the MDL principle [30] to

separate small and large values of locq. TheMDL-selectiontechnique that we use in our approach is similar to theMDL-pruning technique described in [1]. The authors in [1]use this technique to select subspaces with large coverage


Fig. 3.PDFsof the sparseness degrees of (a) dimensionA1, (b) dimensionA2, (c) dimensionA3, and (d) dimensionA4. The sparseness degrees ofA1 andA2 are well fitted by two gamma components, where the first component represents a dense regions, while the second one represents asparse regions. In the case of A3andA4, the estimatedPDFscontain only one gamma component, which represents a sparse regions forA3and adense region for A4.



8/16

values and discard the rest. We want to use the MDLprinciple in similar way but in our case we want to selectsmall values of locqand their corresponding components.The fundamental idea behind the MDL principle is toencode the input data under a given model and select theencoding that minimizes the code length [30].

LetLOC floc1; . . . ;locq; . . . ;locmtotal gbe the set of alllocqvalues for each dimension in the entire data set. Ourobjective is to divide LOCinto two groupsEand F, whereEcontains the highest values oflocq, andFcontains the lowvalues. To this end, we implement the model described inAlgorithm 2. Based on the result of this partitioning process,we selected the components corresponding to each of the locqvalues in F. In Fig. 4, we use the data set described inSection 1 to provide an illustration of this approach. Asshown in this figure, there is a clear cutoff point which allowsus to select the components with the smallest locqvalues.

Algorithm 2. MDL-based selection technique1: Input: LOC

2: Output: E, F3: Sort the values in LOC in descending order;4: for q2 to mtotal do5: E floci; i1; . . . ; qg and the mean ofE isE;6: F flocj; jq; . . . ; mtotalgand the mean ofF is F7: Calculate the code length

CLq log2E P

locq2Elog2jlocq Ej log2F

Plocq2Fjlocq Fj

8: end for9: Select the best partitioning given by the pair E; F, i.e.,

theoneforwhichthecorrespondingCLq isthesmallest;

Based on the components selected by means ofAlgorithm 2, we can now definitively discriminate betweendense and sparse regions in all the dimensions in the data set.

Definition 2.Let zij2 f0; 1g, wherezijis a binary weight. Ifyijbelongs to one of the selected components then zij1 and xijbelong to a dense region; else zij0 and xij belongs to asparse region.

From Definition 2, we obtain a binary matrix ZNdwhichcontains theinformation on whether each data point falls intoa dense region of an attribute. For example, Fig. 5 illustratesthe matrixZfor the data used in the example in Section 1.

It is clear that the computation of zij depends on theinput parameter k (the number of nearest neighbors of1D point). Although it is difficult to formulate and obtainoptimal values for this parameter, it is possible for us to

propose guidelines for its estimation. In fact, the role of theparameterk is intuitively easy to understand and it can beset by the user based on specific knowledge of theapplication. In general, if k is too small, the sparsenessdegreesyij are not meaningful, since a 1D point in a denseregion might have a similar sparseness degree value to a1D point in a sparse region. Obviously, the parameter k isrelated to the expected minimum cluster size and should bemuch smaller than the number of objects Nin the data. Togain a clear idea of the sparseness of the neighborhood of apoint, we have chosen to set k ffiffiffiffiffiNp in our implementa-tion. Phase 1 of PCKA is summarized in Algorithm 3.

Algorithm 3. Phase 1 of PCKA1: Input: DB, k, m max2: Output: Z3: LOC ;;4: Choosem max;5: for j1 to ddo6: Apply Algorithm 1(Aj, m max, k, m, l, l, l);7: Apply EM to partition the sparseness degree in

dimensionj into m components;8: fori1 to m do9: LOCLOC[ mediancomponenti;10: end for11: end for12: Apply Algorithm 2 (LOC, E,F);13: Based on thelocqvalues in F, select the components

which represent dense regions;14: Based on Definition 2 compute the matrixZ;

In summary, phase 1 of PCKA allows attribute relevanceanalysis to be performed in a systematic way withoutimposing any restriction on the distribution of the originaldata points, which is actually an advantage. Indeed, thesparseness degree is an effective feature indicating whethera point is more likely situated in a sparse or a dense regionin a dimension. The gamma mixture model is used due toits shape flexibility, which allows it to provide valuableinformation about the distribution of points and the locationof dense and sparse regions, when considering eachindividual dimension. The MDL-pruning technique isemployed to automatically discriminate between denseand sparse regions over all the dimensions considered

together. We believe that this combination of the threetechniques makes our attribute relevance analysis approachparticularly suitable for performing projected clustering, asour experiments will illustrate.


Fig. 4. Partitioning of the set LOC into two sets Eand F.

Fig. 5. The matrix ZNd.



9/16

Finally, we should point out that our approach todetecting gamma components that correspond to denseregions is based on the assumption that the medians of allthe gamma components of all dimensions are comparable.We have made this assumption since we are performing ourtask in a systematic way. However, such an assumptioncould have some negative influence on the results if thedata set contains clusters with very different densities.

Specifically, clusters with very low density can be confusedwith sparse regions that do not contain any clusterstructure. As a result, dimensions in which such clustersexist will probably not be identified as relevant. Thislimitation arises from the fact that we attempt to separateregions in all individual dimensions into two classes:dense and sparse. To cope with this limitation, onepossible solution is to extend our approach to deal withmore general cases by adopting the local-density-basedstrategy developed in [31]. We will consider this issue inour future work.

3.3 Outlier Handling

In addition to the presence of irrelevant dimensions, high-dimensional data are also characterized by the presence ofoutliers. Outliers can be defined as a set of data points thatare considerably dissimilar, exceptional, or inconsistent withrespect to the remaining data [32]. Most of the clusteringalgorithms, including PCKA, consider outliers as points thatare not located in clusters and should be captured andeliminated because they hinder the clustering process.

A common approach to identify outliers is to analyze therelationship of each data point with the rest of the data,based on the concept of proximity [33], [34]. However, inhigh-dimensional spaces, the notion of proximity is not

straightforward [3]. To overcome this problem, our outlierhandling mechanism makes an efficient use of the proper-ties of the binary matrix Z. In fact, the matrix Z containsuseful information about dense regions and their locationsin the data setDB. It is obvious that outliers do not belongto any of the identified dense regions; they are located insparse regions in DB. Such points are highly dissimilarfrom the rest of the data set and can be identified by usingthe binary weights zij. For this purpose, we use binarysimilarity coefficients to measure the similarity betweenbinary data pointszi fori1; . . . ; N in the matrixZ.

Given two binary data points z1 and z2, there are fourfundamental quantities that can be used to define similaritybetween the two [35]: a jz1jz2j1j, b jz1j1 ^z2j0j,c jz1j0 ^ z2j1jand d jz1jz2j0j, wherej1; . . . ; d. One commonly used similarity measure forbinary data is the Jaccard coefficient. This measure isdefined as the number of variables that are coded as 1 forboth states divided by the number of variables that arecoded as 1 for either or both states. In our work, we requirea similarity measure that can reflect the degree of overlapbetween the binary data points zi in the identified denseregions in the matrix Z. Since dense regions are encoded by1 in the matrix Z, we believe that the Jaccard coefficient issuitable for our task because it considers only matches on

1s to be important. The Jaccard coefficient is given as

JCz1; z2 aa b c : 10

The Jaccard coefficient has values between 0 (not at allsimilar) and 1 (completely similar). A pair of points isconsidered similar if the estimated Jaccard coefficientbetween them exceeds a certain threshold ". In all ourexperiments on a number of data sets, we observed thatsetting "0:70, for the task of our study, represents anacceptable degree of similarity between two binary vectors.Our outlier handling mechanism is based on the following

definition:Definition 3.Let be a user-defined parameter, zia binary vector

from Z, and SVi fzjjJCzi; zj< " and j1; . . . ; Ng theset of similar binary vectors ofzi. A data pointxi is an outlierwith respect to parameters and " ifjSVij< .

The above definition has intuitive appeal since in essenceit exploits the fact that in contrast to outliers, points whichbelong to dense regions (clusters) generally have a largenumber of similar points. Based on this, the value of theparameter should not be larger than the size of the clustercontaining theithdata pointxiand should be much smaller

thanN, the size of the data setDB. Hence, setting ffiffiffiffiffiNpis, in general, a reasonable choice. On the other hand, it isclear that when all of the binary weights for a binary datapoint zi in the matrix Zare equal to zero, the related datapointxiis systematically considered as an outlier because itdoes not belong to any of the discovered dense regions.

The identified outliers are discarded fromDBand storedin the set OUT, while their corresponding rows areeliminated from the matrix Z. Thus, phase 2 of PCKAyields a reduced data set RDBwith size NrN jOUTjand its new associated matrix of binary weights TNrd. Ourmethod for eliminating outliers is described in Algorithm 4.

Algorithm 4. Phase 2 of PCKA1: Input: DB, Z, ", 2: Output: RDB, T,OU T3: OUT ;;4: Let countbe a table of size N;5: for i1 to N do6: counti 0;7: end for8: for i1 to N do9: if

Pdj1 zij0 then

10: OUT OU T[ fxig;

11: else12: for ji 1 to N do13: Estimate the number of similar binary vector ofzi

andzj:14: if J Czi; zj> " then15: counti counti 1;16: countj countj 1;17: end if18: end for19: if counti< then20: OUT OUT[ fxig;21: end if

22: end if23: end for24: RDBDB OUT;25: Based onRDBand OU TextractT from Z;




10/16

3.4 Discovery of Projected Clusters

The main focus of phase 3 of PCKA is to identify projectedclusters. The problem of finding projected clusters istwofold: we must discover the clusters and find theappropriate set of dimensions in which each cluster exists[5]. To tackle this chicken-and-egg problem [9], weproceed in two steps:

1. In the first step, we cluster the data points based onthe K-Means algorithm, with the computation ofdistance restricted to subsets of dimensions whereobject values are dense.

2. Based on the clusters obtained in the first step, thesecond step proceeds to select the relevant dimen-sions of the identified clusters by making use of theproperties of the binary matrix T.

In the following, we describe each step in detail.As mentioned above, our goal in the first step is to

identify cluster members. For this purpose, we propose amodification to the basic K-Means algorithm. The standard

K-Means assumes that each cluster is composed of objectsdistributed closely around its centroid. The objective of theK-Means is thus to minimize the squared distance betweeneach object and the centroid of its cluster [9]. However, ascan be expected from the discussion in Section 1, this is notan effective approach with high-dimensional data, becauseall the dimensions are involved in computing the distancebetween a point and the cluster center.

The problem described above can be addressed bymodifying the distance measure, making use of thedimension relevance information identified in phase 1 ofPCKA. More precisely, in order to cluster the data points in

a more effective way, we modify the basic K-Means byusing a distance function that considers contributions onlyfrom relevant dimensions when computing the distancebetween a data point and the cluster center. In concreteterms, we associate the binary weights tiji1; . . . ; Nr;j1; . . . ; d in matrix Tto the euclidean distance. This makesthe distance measure more effective because the computa-tion of distance is restricted to subsets (i.e., projections)where the object values are dense. Formally, the distancebetween a pointxiand the cluster centervss1; . . . ; nc isdefined as

distxi; vs ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXd

j1tij xij vsj2

vuut : 11In this particular setting, our modified K-Means algo-

rithm will allow two data points with different relevantdimensions to be grouped into one cluster if the two pointsare close to each other in their common relevant dimen-sion(s). Each cluster obtained will have dimensions in whichthere are a significant number of cluster members which areclose to each other. These clusters are actually alreadyprojected clusters and we have adopted a straightforwardstrategy to detect relevant dimensions for each cluster.

Specifically, we make use of the density informationstored in matrix T to determine how well a dimensioncontributes to the formation of the obtained clusters. In fact,the sum of the binary weights of the data points belonging

to the same cluster over each dimension gives us ameaningful measure of the relevance of each dimension tothe cluster. Based on this observation, we propose arelevance index Wsj for each dimension in cluster Cs. TheindexWsj for the dimension jj1; . . . ; d in cluster Cs isdefined as follows:

Wsj P

ti2Cs tij

jCsj :

12

Wsjrepresents the percentage of the points in the cluster Cswho have the dimension jas their relevant dimension.

Definition 4. Let 20; 1. A dimension Aj is considered-relevantfor the cluster Cs ifWsj > .

In the above definition,is a user-defined parameter thatcontrols the degree of relevance of the dimension Aj tothe cluster Cs. It is clear that the larger the value of therelevance index, the more relevant the dimension to thecluster. Based on this property, it is possible to perform a

more general analysis on fWsjjj1; . . . ; dg or even onfWsjgs1;...;nc;j1;...;d to automatically determine relevantdimensions. Currently, it is still not clear whether thisadvantage is significant. On the other hand, since Wsj is arelative measure, it is not difficult to choose an appropriatevalue for . In our experiments, we set 0:8 since it is agood practical choice to ensure a high degree of relevance.Adopting the simple thresholding approach also betterillustrates the contribution of attribute relevance analysis(i.e., phase 1) to PCKA. Phase 3 of our algorithm issummarized in Algorithm 5.

Algorithm 5. Phase 3 of PCKA

1: Input: RDB, T, nc,2: Output: vs, UNrnc, SDs

{vs: cluster centers where s1; . . . ; nc.UNrnc: matrix of the membership degrees of eachdata point in each cluster.SDs: set of the relevant dimensions of each cluster.}

3: Choose the cluster centersv0ss1; . . . ; ncrandomlyfrom RDB;

4: repeat5: Compute the membership matrixUNrnc:6: fori1 to Nr do7: for j

1 to nc do

8: if distxi; vs< distxi; vj then9: uij0;10: else11: uij1;12: end if13: end for14: end for15: Compute the cluster center:

16: v1sPNr

i1uistixi

PNr

i1 uiss1; . . . ; nc;

17: until convergence, i.e., no change in centroidcoordinates;

18: Based on Definition 4, detect the setSDs of relevantdimensions for each cluster Cs;




11/16

4 EMPIRICALEVALUATION

In this section, we devise a series of experiments designed

to evaluate the suitability of our algorithm in terms of

1. accuracythe aim is to test whether our algorithm,in comparison with other existing approaches, isable to correctly identify projected clusters and

2. efficiencythe aim is to determine how the runningtime scales with 1) the size and 2) the dimensionalityof the data set.

We compare the performance of PCKA to that of SSPC [7],HARP [9], PROCLUS [5], and FastDOC [10]. The evaluationis performed on a number of generated data sets withdifferent characteristics. Furthermore, experiments on realdata sets are also presented. All the experiments reported inthis section were run on a PC with Intel Core 2 Duo CPU of2.4 GHz and 4-Gbyte RAM.

4.1 Performance Measure

A number of new metrics for comparing projected clustering

algorithms and subspace clustering algorithms were re-cently proposed in [36]. The performance measure used inour paper is the Clustering Error (CE) distance forprojected/subspace clustering. This metric performs com-parisons in a more objective way since it takes into accountthe data point groups and the associated subspace simulta-neously. The CE distance has been shown to be the mostdesirable metric for measuring agreement between parti-tions in the context of projected/subspace clustering [36]. Adeeper investigation of the properties of the CE distance andother performance measures for projected/subspace clus-tering can be found in [36].

Assume thatGPis a collection fC1; . . . ; Cs; . . . ; Cncg ofncgenerated projected clusters and RP is a collection

fC01; . . . ; C0s; . . . ; C0ncg ofnc real projected clusters. To describethe CE distance, we need to define the union of projected

clusters. LetUdenote the union of the projected clusters in

GP an d RP. Formally, UUGP;RP suppGPSsuppRP, where suppGP and suppRP are the supportof clustering GP and RP, respectively. The support of

clustering RP is defined as suppGP Ss1;...;nc suppCs,wheresuppCsis the support of projected clusterCs, givenby suppCs fxijjxi2SPs^ Aj2SDsg. Le t M mijdenote the confusion matrix, in which mij represents thenumber of 1D points shared by the projected clustersCsand

C0s, i.e., mij jsuppCiT

suppC0jj. The matrix M is trans-formed, by using of the Hungarian method [37], in order to

find a permutation of cluster labels such that the sum of the

diagonal elements of M is maximized. Dmax denotes this

maximized sum. The generalized CE distance for projected/

subspace clustering is given by

CERP;GP jUj DmaxjUj : 13

The value of CE is always between 0 and 1. The moresimilar the two partitions GP and RP, the smaller the CE

value. When GP and RP are identical, the CE value will

be zero.

4.2 Synthetic Data Generation Method

Since our focus is on axis-parallel clusters, we used the datagenerator model described in the PROCLUS paper [5] inorder to simulate various situations. This data generationmodel has been used by most researchers in their studies[7], [9], [10], [11], [12] to evaluate the performance ofprojected clustering algorithms. The parameters used insynthetic data generation are the size of the data set N,the number of clustersnc, the data set dimensionalityd, theaverage cluster dimensionality lreal, the domain of thevalues of each dimension minj;maxj, the standarddeviation value rangesdvmin;sdvmax, which is related tothe distribution of points in each cluster, and the outlierpercentageOP. Using these parameters, clusters and theirassociated subspaces are created randomly. Projectedvalues of cluster points are generated according to thenormal distribution in their relevant dimension, with themean randomly chosen fromminj;maxjand the standarddeviation value fromsdvmin;sdvmax. For irrelevant dimen-sions and outliers, the dimensional values are randomly

selected fromminj;maxj.4.3 Parameters for Comparing Algorithms

As mentioned above, we compared the performance ofPCKA to that of SSPC, HARP, PROCLUS, and FastDOC. Inall the experiments on synthetic data sets, the number ofclustersnc was set to the true number of projected clustersused to generate the data sets. PROCLUS requires theaverage cluster dimensionality as a parameter; it was set tothe true average cluster dimensionality. Several values weretried for the parameters of FastDOC and SSPC, followingthe suggestions in their respective papers, and we report

results for the parameter settings that produced the bestresults. HARP requires the maximum percentage of outliersas a parameter; it was set to the true percentage of outlierspresent in the data sets. In order to obtain satisfactoryresults and avoid initialization bias, nondeterministicalgorithms such as SSPC, PROCLUS, and FastDOC wererun more than 10 times on each data set and we consideronly the best result for each of them. In all of theexperiments, SSPC was run without any semisupervision.For PCKA, in all the experiments, we set k ffiffiffiffiffiNp ,0:7, and0:8.4.4 Quality of Results

The performance of projected clustering algorithms isprimarily affected by the average cluster dimensionalityand the amount of outliers in the data set. The main goal ofthe experiments presented in this section was to evaluatethe capability of projected clustering algorithms to correctlyidentify projected clusters in various situations.

4.4.1 Robustness to the Average Cluster Dimensionality

The main concern of the first set of experiments was toanalyze the impact of cluster dimensionality on the qualityof clustering. For this purpose, we generated 16 differentdata sets withN 3;000data points, number of dimensionsd100, minj 100, and maxj100. In each data set,there were five clusters with sizes varying between10 percent of N to 30 percent of N. For each relevantdimension of a cluster, the values ofsdvminand sdvmaxwere




12/16

set to 1 percent and 6 percent of the global standarddeviation on that dimension, respectively. The averagecluster dimensionality varied from 2 percent to 70 percentof the data dimensionalityd. Since our goal in this first set ofexperiments was to analyze the impact of cluster dimension-ality, no outliers were generated. Therefore, the outlierdetection mechanism of PCKA, SSPC, and PROCLUS wasdisabled. For FastDOC, since we could not disable its outlierelimination option, we chose the results with the highestaccuracy after several runs. Fig. 6 illustrates the CE distancebetween the output of each of the five algorithms and thetrue clustering.

The values of the CE distance in Fig. 6 invite severalcomments.

PCKA is able to achieve highly accurate results and itsperformance is generally consistent. As we can see fromFig. 6, PCKA is more robust to variation of the averagecluster dimensionality than the other algorithms. For

instance, when the average cluster dimensionality is verylow (lreal2 percent of d), which is a difficult case, onlyPCKA yields an acceptable results. The experiments showthat our algorithm is able to detect clusters and their relevantdimensions accurately in various situations. PCKA success-fully avoids the selection of irrelevant dimensions in all thedata sets used in these experiments. This can be explainedby the fact that PCKA starts by identifying dense regionsand their locations in each dimension, enabling it to restrictthe computation of the distance to subsets of dimensionswhere the projected values of the data points are dense.

SSPC encounters difficulties when the average clusterdimensionality is low as 2 percent ofd. In other situations,

the best results of SSPC are similar to those of PCKA. SSPCprovides accurate results and is able to correctly identifyprojected clusters. This may be due to the fact that SSPCmakes use of an objective function that combines data pointclustering and dimension selection into a single optimiza-tion problem [7].

HARP performs well when lreal30 percent of d, dis-playing performance comparable to that of PCKA and SSPC.On the other hand, when lreal< 30 percent of d, theperformance of HARP is affected and its results are lesscompetitive with those of PCKA and SSPC. We haveobserved that when lreal < 20 percentofd, HARP encounters

difficulties in correctly identifying clusters and their relevantdimensions. When lreal2 20 percentof d; 30 percentof d,HARP clusters the data points well but tends to includemany irrelevant dimensions, which explains the values of

the CE distance in Fig. 6. This can be attributed to the factthat data sets with low-dimensional projected clustersmislead HARPs dimension selection procedures. In suchsituations, the basic assumption of HARPi.e., that if twodata points are similar in high-dimensional space, they havea high probability of belonging to the same cluster in lower-dimensional spacebecomes less valid.

The results of PROCLUS are less accurate than those

given by PCKA and SSPC. When data sets contain projectedclusters of low dimensionality, PROCLUS performs poorly.When lreal40 percent of d, we have observed thatPROCLUS is able to cluster the data points well but itsdimension selection mechanism is not very accuratebecause it tends to include some irrelevant dimensionsand discard some relevant ones. This behavior of PROCLUScan be explained by the fact that its dimension selectionmechanism, which is based on a distance calculation thatinvolves all dimensions by detecting a set of neighboringobjects to a medoid, severely hampers its performance.

As we can see from Fig. 6, FastDOC encounters difficulties

in correctly identifying projected clusters. In our experi-ments, we have observed that although the dimensionality ofprojected clusters is high (i.e., lreal70 percent of d),FastDOC tends to select a small subset of relevant dimen-sions for some clusters. All the experiments reported hereseem to suggest that in the presence of relevant attributeswith different standard deviation values, FastDOCs as-sumption that projected clusters take the form of a hypercubeappears to be less valid. In such situations, it is difficult to setcorrect values for the parameters of FastDOC.

4.4.2 Outlier Immunity

The aim of this set of experiments was to test the effect ofthe presence of outliers on the performance of PCKA incomparison to SSPC, HARP, PROCLUS, and FastDOC. Forthis purpose, we generated three groups of data sets, eachcontaining five data sets with N 1; 000, d100, andnc3. In each data set in a group, the percentage of outliersvaried from 0 percent to 20 percent of N. We set lreal2 percent ofd for the first group, lreal15 percent ofd forthe second andlreal30 percentofdfor the third. In all thedata sets in the three groups, sdvmin andsdvmax were set to1 percent and 6 percent of the global standard deviation,respectively. Fig. 7 illustrates the CE distance for the five

algorithms.As we can see from the figure, PCKA display consistentperformance, as was observed for the first set of experi-ments on data sets without outliers. In difficult cases (i.e.,when lreal2 percent of d), PCKA provides much betterresults than all the other algorithms. All the results reportedin Fig. 7 suggest that PCKA is less sensitive to thepercentage of outliers in data sets. In addition to this,variations in the average cluster dimensionality in thepresence of different percentages of outliers have no majorimpact on the performance of PCKA. This can be explainedby the fact that the outlier handling mechanism of PCKAmakes an efficient use of the density information stored in

the matrix Z, giving it high outlier immunity.The performance of SSPC is greatly affected whenlreal

2 percent of d, with different values of OP. When lreal4 percentofd, the performance of SSPC is greatly improved


Fig. 6. CE distance between the output of each of the five algorithms and

the true clustering.



13/16

and is not much affected by the percentage of outliers in thedata sets. HARP is less sensitive to the percentage of

outliers in data sets when lreal30 percentofd. Otherwise(i.e., whenlreal < 30 percentofd, as in the case illustrated in

Fig. 7), the performance of HARP is severely affected. This

behavior of HARP is consistent with the analysis given in

the previous section. On the other hand, we have observed

in our experiments that HARP is sensitive to its maximum

outlier percentage parameter. An incorrect choice of thevalue of this parameter may affect the accuracy of HARP.Fig. 7 illustrates that PROCLUS and FastDOC encounter

difficulties in correctly identifying projected clusters.

PROCLUS tends to classify a large number of data points

as outliers. Similar behavior of PROCLUS was also

observed in [8] and [9]. The same phenomenon occurs for

FastDOC. We found that FastDOC performs well on datasets that contain dense projected clusters with the form of a

hypercube.

4.5 Scalability

In this section, we study the scalability of PCKA withincreasing data set size and dimensionality. In all of the

following experiments, the quality of the results returned by

PCKA was similar to that presented in the previous section.

Scalability with the data set size. Fig. 8a shows theresults for scalability with the size of the data set. In thisexperiment, we used 50D data and varied the number ofdata points from 1,000 to 100,000. There were four clusterswith lreal12 percent of d and 5 percent of N weregenerated as outliers. As we can see from Fig. 8a, PCKAscales quadratically with increase in data set size. In all our

experiments, we observed that PCKA is faster than HARP.ForN25;000, the execution time of PCKA is comparableto that of SSPC and PROCLUS when the time used forrepeated runs is also included.

Scalability with dimensionality of the data set. InFig. 8b, we see that PCKA scales linearly with the increasein the data dimension. The results presented are for datasets with 5,000 data points grouped in four clusters, withlreal12 percent of d and 5 percent of N generated asoutliers. The dimensionality of the data sets varies from 100to 1,000. As in the scalability experiments w.r.t. the data setsize, the execution time of PCKA is usually better than thatof HARP and comparable to those of PROCLUS and SSPC

when the time used for repeated runs is also included.

4.6 Experiments on Real-World Data

In addition to our experiments on synthetic data, thesuitability of our algorithm was also tested on three real-world data sets. A simple illustration of each of these isgiven below.

Wisconsin Diagnostic Breast Cancer Data (WDBC). Thisdata can be obtained from the UCI Machine LearningRepository (http://archive.ics.uci.edu/ml/). The set con-tains 569 samples, each with 30 features. The samples aregrouped into two clusters: 357 samples for benign and

212 for malignant.Saccharomyces Cerevisiae Gene Expression Data (SCGE).

This data set, available from http://cs.wellesley.edu/~btjaden/CORE, represents the supplementary material


Fig. 7. Immunity to outliers. (a) Data sets with lreal2 percent of d.(b) Data sets with lreal15 percent of d. (c) Data sets with lreal30 percent of d.

Fig. 8. Scalability experiments. (a) Scalability of PCKA with regard to the

data set size. (b) Scalability of PCKA with regard to the data set

dimensionality.



14/16

used in Tjadens paper [38]. The Saccharomyces Cerevisiaedata contains the expression level of 205 genes under80 experiments. The data set is presented as a matrix. Eachrow corresponds to a gene and each column to anexperiment. The genes are grouped into four clusters.

Multiple Features Data (MF). This data set is availablefrom the UCI Machine Learning Repository. The set consistsof features of handwritten numerals (0-9) extractedfrom a collection of Dutch utility maps. Two hundredpatterns per cluster (for a total of 2,000 patterns) have beendigitized in binary images. For our experiments, we used

five feature sets (files):

1. mfeat-fou: 76 Fourier coefficients of the charactershapes;

2. mfeat-fac: 216 profile correlations;3. mfeat-kar: 64 Karhunen-Loeve coefficients;4. mfeat-zer: 47 Zernike moments; and5. mfeat-mor: 6 morphological features.

In summary, we have a data set with 2,000 patterns,409 features, and 10 clusters. All the values in each featurewere standardized to mean 0 and variance 1.

We compared the performance of our algorithm, in terms

of clustering quality, with that of SSPC, HARP, PROCLUS,and FastDOC. For this purpose, we used the class labelsas ground truth and measured the accuracy of clusteringby matching the points in input and output clusters.We adopted the classical definition of accuracy as thepercentage of correctly partitioned data points. We want tomention that the value of the average cluster dimensionalityparameter in PROCLUS was estimated based on thenumber of relevant dimensions identified by PCKA. Inaddition to this, since there are no data points labeled asoutliers in any of the data described above, the outlierdetection option of PCKA, SSPC, HARP, and PROCLUSwas disabled. Table 1 illustrates the accuracy of clustering

for the four algorithms.As can be seen from Table 1, PCKA is able to achieve

highly accurate results in different situations involving datasets with different characteristics. These results confirm thesuitability of PCKA previously observed on the generateddata. The behavior of SSPC is also characterized by highclustering accuracy. However, with multiple feature data,the clustering process is stopped and SSPC returns an error.HARP yields moderate accuracy on WDBC and MF data,while its performance on SCGE data is acceptable. FastDOCyields acceptable accuracy on WDBC data, while it per-forms poorly on SCGE and MF data. The poor results may

be due to inappropriate choice of parameters, although wedid try different sets of parameters in our experiments.Finally, from the experiments on the three data sets, we cansee that the accuracy achieved by PROCLUS is reasonable.

To provide more comparisons and to confirm thesuitability of our approach, we also analyzed the qualitativebehavior of nonprojected clustering algorithms on the datasets considered in this setof experiments.In [39], thestandardK-Means algorithms and three unsupervised competitiveneural network algorithmsthe neural gas network [40], thegrowing neural gas network [41], and the self-organizingfeature map [42]are used to cluster the WDBC data. Theaccuracy achieved by these algorithms on this data is between90 percent and 92 percent, which is very close to the accuracyachieved by PCKA on the same data (WDBC). Such results

suggest that the moderate number of dimensions of this datadoes not have a major impact on algorithms that consider alldimensions in the clustering process.

In [38], the algorithm Clustering Of Repeat Expressiondata (CORE), a nonprojected clustering algorithm, is used tocluster SCGE data. CORE is a clustering approach akin tothe K-Means, designed to cluster gene expression data sets.Such data sets are expected to contain noisy dimensions [6],[38]. The accuracy achieved by the CORE algorithm onSCGE data is 72.68 percent,2 which is less than thatachieved by PCKA and other projected clustering algo-rithms, as illustrated in Table 1. Similar to the study in [6],

our result suggests that projected clustering algorithms aresuitable for clustering gene expression data.

Finally, for MF data, contrary to the other two data sets,it seems there are few studies in the literature that use thisset to evaluate clustering algorithms. Most of the work onthis data set is related to classification algorithms [43], [44].Hence, in order to have an intuitive idea about the behaviorof a method that considers all dimensions in the clusteringprocess, we have chosen to run the standard K-Means onthis data. The accuracy achieved is 77.2 percent, which isless than that achieved by PCKA and PROCLUS on thesame data, as can be seen from Table 1. The enhanced

performance given by PCKA and PROCLUS in comparisonto that of the K-Means suggests that some dimensions arelikely not relevant to some clusters in this data.

5 CONCLUSION

We have proposed a robust distance-based projectedclustering algorithm for the challenging problem of high-dimensional clustering, and illustrated the suitability of ouralgorithm in tests and comparisons with previous work.Experiments show that PCKA provides meaningful resultsand significantly improves the quality of clustering when

the dimensionalities of the clusters are much lower than


2. We calculated the accuracy of CORE on SCGE data based on theclustering result available from http://cs.wellesley.edu/~btjaden/CORE/yeast_clustering.txt.

TABLE 1Accuracy of Clustering



15/16

that of the data set. Moreover, our algorithm yields accurateresults when handling data with outliers. The performanceof PCKA on real data sets suggests that our approach couldbe an interesting tool in practice. The accuracy achieved byPCKA results from its restriction of the distance computa-tion to subsets of attributes, and its procedure for the initialselection of these subsets. Using this approach, we believethat many distance-based clustering algorithms could be

adapted to cluster high-dimensional data sets.There are still many obvious directions to be explored in

the future. The interesting behavior of PCKA on generateddata sets with low dimensionality suggests that our approachcan be used to extract useful information from geneexpression data that usually have a high level of backgroundnoise. From the algorithmic point of view, we believe that animproved scheme for PCKA is possible. One obviousdirection for further study is to extend our approach to thecase of arbitrarily oriented clusters. Another interestingdirection to explore is to extend the scope of phase 1 of PCKAfrom attribute relevance analysis to attribute relevance and

redundancy analysis. This seems to have been ignored by allof the existing projected clustering algorithms.

ACKNOWLEDGMENTS

The authors gratefully thank Kevin Yuk-Lap Yip of YaleUniversity for providing the implementations of projectedclustering algorithms used in the experiments and thereviewers for their valuable comments and importantsuggestions. This work is supported by research grantsfrom the Natural Sciences and Engineering ResearchCouncil of Canada (NSERC).

REFERENCES[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan,

Automatic Subspace Clustering of High Dimensional Data,Data Mining and Knowledge Discovery, vol. 11, no. 1, pp. 5-33, 2005.

[2] A.K. Jain, M.N. Mutry, and P.J. Flynn, Data Clustering: AReview,ACM Computing Surveys,vol. 31, no. 3, pp. 264-323, 1999.

[3] K. Beyer, J. Goldstein, R. Ramakrishan, and U. Shaft, When isNearest Neighbor Meaningful, Proc. Seventh Intl Conf. DatabaseTheory (ICDT 99), pp. 217-235, 1999.

[4] H. Liu and L. Yu, Toward Integrating Feature SelectionAlgorithms for Classification and Clustering,IEEE Trans. Knowl-edge and Data Eng., vol. 17, no. 4, pp. 491-502, Apr. 2005.

[5] C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, and J.S. Park,

Fast Algorithmfor Projected Clustering, Proc. ACMSIGMOD 99,pp. 61-72, 1999.[6] K.Y.L. Yip, D.W. Cheung, M.K. Ng, and K. Cheung, Identifying

Projected Clusters from Gene Expression Profiles, J. BiomedicalInformatics,vol. 37, no. 5, pp. 345-357, 2004.

[7] K.Y.L. Yip, D.W. Cheng, and M.K. Ng, On Discovery ofExtremely Low-Dimensional Clusters Using Semi-SupervisedProjected Clustering, Proc. 21st Intl Conf. Data Eng. (ICDE 05),pp. 329-340, 2005.

[8] C.C. Aggarwal and P.S. Yu, Redefining Clustering for HighDimensional Applications, IEEE Trans. Knowledge and Data Eng.,vol. 14, no. 2, pp. 210-225, Mar./Apr. 2002.

[9] K.Y.L. Yip, D.W. Cheng, and M.K. Ng, HARP: A PracticalProjected Clustering Algorithm, IEEE Trans. Knowledge and DataEng., vol. 16, no. 11, pp. 1387-1397, Nov. 2004.

[10] C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, Monte

Carlo Algorithm for Fast Projective Clustering, Proc. ACMSIGMOD 02, pp. 418-427, 2002.[11] M. Lung and N. Mamoulis, Iterative Projected Clustering by

Subspace Mining, IEEE Trans. Knowledge and Data Eng., vol. 17,no. 2, pp. 176-189, Feb. 2005.

[12] E.K.K. Ng, A.W. Fu, and R.C. Wong, Projective Clustering byHistograms,IEEE Trans. Knowledge and Data Eng., vol. 17, no. 3,pp. 369-383, Mar. 2005.

[13] M. Bouguessa, S. Wang, and Q. Jiang, A K-Means-BasedAlgorithm for Projective Clustering, Proc. 18th IEEE Intl Conf.Pattern Recognition (ICPR 06), pp. 888-891, 2006.

[14] C.H. Cheng, A.W. Fu, and Y. Zhang, Entropy-Based SubspaceClustering for Mining Numerical Data,Proc. ACM SIGMOD 99,pp. 84-93, 1999.

[15] S. Goil, H. Nagesh, and A. Choudhary, MAFIA: Efficient and

Scalable Subspace Clustering for Very Large Data Sets, TechnicalReport CPDC-TR-9906-010, Northwestern Univ., 1999.

[16] K. Kailing, H.-P. Kriegel, and P. Kroger, Density-ConnectedSubspace Clustering for High-Dimensional Data, Proc. FourthSIAM Intl Conf. Data Mining (SDM 04), pp. 246-257, 2004.

[17] L. Parsons, E. Haque, and H. Liu, Subspace Clustering for HighDimensional Data: A Review,ACM SIGKDD Explorations News-letter,vol. 6, no. 1, pp. 90-105, 2004.

[18] K.Y.L. Yip, HARP: A Practical Projected Clustering Algorithmfor Mining Gene Expression Data, masters thesis, The Univ. ofHong Kong, 2004.

[19] K. Bury, Statistical Distributions in Engineering. Cambridge Univ.Press, 1998.

[20] N. Balakrishnan and V.B. Nevzorov, A Primer on StatisticalDistributions. John Wiley & Sons, 2003.

[21] R.V. Hogg, J.W. McKean, and A.T. Craig, Introduction toMathematical Statistics, sixth ed. Pearson Prentice Hall, 2005.[22] J.F. Lawless, Statistical Models and Methods for Lifetime Data.

John Wiley & Sons, 1982.[23] M. Bouguessa, S. Wang, and H. Sun, An Objective Approach to

Cluster Validation, Pattern Recognition Letters, vol. 27, no. 13,pp. 1419-1430, 2006.

[24] J.J. Oliver, R.A. Baxter, and C.S. Wallace, Unsupervised LearningUsing MML, Proc. 13th Intl Conf. Machine Learning (ICML 96),pp. 364-372, 1996.

[25] G. Schwarz, Estimating the Dimension of a Model, Annals ofStatistics,vol. 6, no. 2, pp. 461-464, 1978.

[26] A. Dempster, N. Laird, and D. Rubin, Maximum Likelihood fromIncomplete Data via the EM Algorithm, J. Royal Statistical Soc.(Series B), vol. 39, pp. 1-37, 1977.

[27] M.A.T. Figueiredo and A.K. Jain, Unsupervised Learning ofFinite Mixture Models, IEEE Trans. Pattern Analysis and MachineIntelligence,vol. 24, no. 3, pp. 381-396, Mar. 2002.

[28] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions.John Wiley & Sons, 1997.

[29] J.C. Bezdek, Pattern Recognition with Fuzzy Objective FunctionAlgorithms. Plenum, 1981.

[30] J. Rissanen, Stochastic Complexity in Statistical Inquiry. WorldScientific, 1989.

[31] S. Breunig, H.-P. Kriegel, R. Ng, and J. Sander, LOF: IdentifyingDensity-Based Local Outliers, Proc. ACMSIGMOD 00, pp. 93-104,2000.

[32] J. Han and M. Kamber, Data Mining, Concepts and Techniques.Morgan Kaufman, 2001.

[33] F. Angiulli and C. Pizzuti, Outlier Mining in Large High-Dimensional Data Sets, IEEE Trans. Knowledge and Data Eng.,

vol. 17, no. 2, pp. 369-383, Feb. 2005.[34] E.M. Knorr, R.T. Ng, and V. Tucakov, Distance-Based Outliers:

Algorithms and Applications, The VLDB J., vol. 8, nos. 3/4,pp. 237-253, 2000.

[35] T. Li, A Unified View on Clustering Binary Data, MachineLearning,vol. 62, no. 3, pp. 199-215, 2006.

[36] A. Patrikainen and M. Meila, Comparing Subspace Clusterings,IEEE Trans. Knowledge and Data Eng., vol. 18, no. 7, pp. 902-916,

July 2006.[37] C.H. Papadimitrio and K. Steiglizt, Combinatorial Optimization,

Algorithms and Complexity.Prentice-Hall, 1982.[38] B. Tjaden, An Approach for Clustering Gene Expression Data

with Error Information,BMC Bioinformatics, vol. 7, no. 17, 2006.[39] K.A.J. Doherty, R.G. Adams, and N. Davey, Unsupervised

Learning with Normalised Data and Non-Euclidean Norms,

Applied Soft Computing,vol. 7, no. 17, pp. 203-210, 2007.[40] T.M. Martinetz, S.G. Berkovich, and K.J. Schulten, Neural GasNetwork for Vector Quantization and Its Application to Time-series Prediction, IEEE Trans. Neural Networks, vol. 4, no. 4,pp. 558-569, 1993.




16/16

[41] B. Fritzke, A Growing Neural Gas Network Learns Topologies,Advances in Neural Information Processing Systems, G. Tesauro,D.S. Touretzky, and T. K. Leen, eds., vol. 7, pp. 625-632, 1995.

[42] T. Kohonen, Self-Organizing Maps. Springer, 1997.[43] P. Mitra, C.A. Murthy, and S.K. Pal, Unsupervised Feature

Selection Using Feature Similarity, IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 24, no. 3, pp. 301-312, Mar. 2002.

[44] A.K. Jain, R.P.W. Duin, and J. Mao, Statistical Pattern Recogni-tion: A Review, IEEE Trans. Pattern Analysis and MachineIntelligence,vol. 22, no. 1, pp. 4-37, Jan. 2000.

Mohamed Bouguessa received the MScdegree in computer science in 2005. He iscurrently working toward the PhD degree in theDepartment of Computer Science, University ofSherbrooke, Quebec, Canada. His researchinterests include the area of data mining. Hisresearch projects include high-dimensionaldata, clustering, knowledge discovery in hu-man-web Interaction, and social networksanalysis.

Shengrui Wang received the bachelor degreefrom Hebei University, the MS degree from theUniversite Joseph Fourier, and the PhDdegree from the Institut National Polytechniquede Grenoble. He is a full professor in theDepartment of Computer Science, Universityof Sherbrooke, Sherbrooke, Quebec, Canada.His research interests include pattern recogni-tion, data mining, bioinformatics, image pro-cessing, artificial intelligence, and information

retrieval. He is a member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


base paper1.pdf

Documents