computing clusters of correlation connected objects · computing clusters of correlation connected...

Computing Clusters of Correlation Connected Objects

ABSTRACTThe detection of correlations between different features in aset of feature vectors is a very important data mining taskbecause correlation indicates a dependency between the fea-tures or some association of cause and effect between them.This association can be arbitrarily complex, i.e. one or morefeatures might be dependent from a combination of severalother features. Well-known methods like the principal com-ponents analysis (PCA) can perfectly find correlations whichare global, linear, not hidden in a set of noise vectors, anduniform, i.e. the same type of correlation is exhibited inall feature vectors. In many applications such as medicaldiagnosis, molecular biology, time sequences, or electroniccommerce, however, correlations are not global since thedependency between features can be different in differentsubgroups of the set. In this paper, we propose a methodcalled 4C (Computing Correlation Connected Clusters) toidentify local subgroups of the data objects sharing a uni-form but arbitrarily complex correlation. Our algorithm isbased on a combination of PCA and density-based cluster-ing (DBSCAN). Our method has a determinate result andis robust against noise. A broad comparative evaluationdemonstrates the superior performance of 4C over compet-ing methods such as DBSCAN, CLIQUE and ORCLUS.

1. INTRODUCTIONThe tremendous amount of data produced nowadays in var-ious application domains can only be fully exploited by ef-ficient and effective data mining tools. One of the primarydata mining tasks is clustering, which aims at partitioningthe objects (described by points in a high dimensional fea-ture space) of a data set into dense regions (clusters) sepa-rated by regions with low density (noise). Knowing the clus-ter structure is important and valuable because the differ-ent clusters often represent different classes of objects whichhave previously been unknown. Therefore, the clusters bringadditional insight about the stored data set which can beexploited for various purposes such as improved marketingby customer segmentation, determination of homogeneous

groups of web users through clustering of web logs, struc-turing large amounts of text documents by hierarchical clus-tering, or developing thematic maps from satellite images.

An interesting second kind of hidden information that maybe interesting to users are correlations in a data set. A corre-lation is a linear dependency between two or more features(attributes) of the data set. The most important methodfor detecting correlations is the principal components anal-ysis (PCA) also known as Karhunen Loewe transformation.Knowing correlations is also important and valuable becausethe dimensionality of the data set can be considerably re-duced which improves both the performance of similaritysearch and data mining as well as the accuracy. Moreover,knowing about the existence of a relationship between at-tributes enables one to detect hidden causalities (e.g. theinfluence of the age of a patient and the dose rate of med-ication on the course of his disease or the coregulation ofgene expression) or to gain financial advantage (e.g. in stockquota analysis), etc.

Methods such as PCA, however, are restricted, because theycan only be applied to the data set as a whole. Therefore,it is only possible to detect correlations which are expressedin all points or almost all points of the data set. For alot of applications this is not the case. For instance, in theanalysis of gene expression, we are facing the problem that adependency between two genes does only exist under certainconditions. Therefore, the correlation is visible only in alocal subset of the data. Other subsets may be either notcorrelated at all, or they may exhibit completely differentkinds of correlation (different features are dependent on eachother). The correlation of the whole data set can be weak,even if for local subsets of the data strong correlations exist.Figure 1 shows a simple example, where two subsets of 2-dimensional points exhibit different correlations.

To the best of our knowledge, both concepts of clustering(i.e. finding densely populated subsets of the data) and cor-relation analysis have not yet been addressed as a combinedtask for data mining. The most relevant related approach isORCLUS [1], but since it is k-means-based, it is very sen-sitive to noise and the locality of the analyzed correlationsis usually too coarse, i.e. the number of objects taken intoaccount for correlation analysis is too large (cf. Section 2and 6 for a detailed discussion). In this paper, we developa new method which is capable of detecting local subsetsof the data which exhibit strong correlations and which are

Proc. ACM Int. Conf. on Management of Data (SIGMOD), Paris, France, 2004

dbs dbs

Christian Böhm, Karin Kailing, Peer Kröger, Arthur Zimek Institute for Computer Science University of Munich

(a) 2D view.

attribute 1 attribute 2

(b) Transposed view.

Figure 1: 1-Dimensional Correlation Lines

densely populated (w.r.t a given density threshold). We callsuch a subset a correlation connected cluster.

In lots of applications such correlation connected clusters areinteresting. For example in E-commerce (recommendationsystems or target marketing) where sets of customers withsimilar behavior need to be detected, one searches for pos-itive linear correlations. In DNA microarray analysis (geneexpression analysis) negative linear correlations express thefact that two genes may be coregulated, i.e. if one has ahigh expression level, the other one is very low and viceversa. Usually such a coregulation will only exist in a smallsubset of conditions or cases, i.e. the correlation will be hid-den locally in the dataset and cannot be detected by globaltechniques. Figures 1 and 2 show simple examples how cor-relation connected clusters can look like. In Figure 1 theattributes exhibit two different forms of linear correlation.We observe that if for some points there is a linear correla-tion of all attributes, these points are located along a line.Figure 2 presents two examples where an attribute z is cor-related to attributes x and y (i.e. z = a+ bx+ cy). In thiscase the set of points forms a 2-dimensional plane.

As stated above, in this paper, we propose an approachthat meets both the goal of clustering and correlation anal-ysis in order to find correlation connected clusters. Theremainder is organized as follows: In Section 2 we reviewand discuss related work. In Section 3 we formalize ournotion of correlation connected clusters. Based on this for-malization, we present in Section 4 an algorithm called 4C(C omputing C orrelation C onnected C lusters) to efficientlycompute such correlation connected clusters. In Section 5we analyze the computational complexity of our algorithm,while Section 6 contains an extensive experimental evalua-tion of 4C. Section 7 concludes the paper and gives somehints at future work.

2. RELATED WORKTraditional clustering algorithms such as k-means or theEM-algorithm search for spatial clusters, which are spher-ically shaped. In [10] and [5] two algorithms are proposedwhich are able to find clusters of arbitrary shape. However,these approaches are not able to distinguish between arbi-trarily shaped clusters and correlation clusters. In [10] the

density-based algorithm DBSCAN is introduced. We willdescribe the basic concepts of this approach in detail in thenext chapter. In [5] the authors propose the algorithm FC(Fractal Clustering) to find clusters of arbitrary shape. Thepaper presents only experiments for two-dimensional datasets, so it is not clear whether the fractal dimension is re-ally stable in higher dimensions. Furthermore, the shapesof clusters depend on the type of correlation of the involvedpoints. Thus, in the case of linear correlations it is moretarget-oriented to search for shapes like lines or hyperplanesthan for arbitrarily shaped clusters.

As most traditional clustering algorithms fail to detect mean-ingful clusters in high-dimensional data, in the last years alot of research has been done in the area of subspace cluster-ing. In the following we are examining to what extent sub-space clustering algorithms are able to capture local datacorrelations and find clusters of correlated objects. Theprincipal axis of correlated data are arbitrarily oriented. Incontrast, subspace clustering techniques like CLIQUE [3]and its successors MAFIA [11] or ENCLUS [9] or projectedclustering algorithms like PROCLUS [2] and DOC [15] onlyfind axis parallel projections of the data. In the evaluationpart we show that CLIQUE as one representative algorithmis in fact not able to find correlation clusters. Thereforewe focus our attention to ORCLUS [1] which is a k-meansrelated projected clustering algorithm allowing clusters toexist in arbitrarily oriented subspaces. The problem of thisapproach is that the user has to specify the number of clus-ters in advance. If this guess does not correspond to theactual number of clusters the results of ORCLUS deterio-rate. A second problem, which can also be seen in the eval-uation part is noisy data. In this case the clusters found byORCLUS are far from optimal, since ORCLUS assigns eachobject to a cluster and thus cannot handle noise efficiently.

Based on the fractal (intrinsic) dimensionality of a data set,the authors of [14] present a global dimensionality reductionmethod. Correlation in the data leads to the phenomenonthat the embedding dimension of a data set (in other wordsthe number of attributes of the data set) and the intrinsicdimension (the dimension of the spatial object representedby the data) can differ a lot. The intrinsic (correlation frac-tal dimension) is used to reduce the dimensionality of the

(a) 3D view.

attribute 1 attribute 2 attribute 3

(b) Transposed view of one plane.

Figure 2: 2-Dimensional Correlation Planes

data. As this approach adopts a global view on the dataset and does not account for local data distributions, it can-not capture local subspace correlations. Therefore it is onlyuseful and applicable, if the underlying correlation affects alldata objects. Since this is not the case for most real-worlddata, in [8] a local dimensionality reduction technique forcorrelated data is proposed which is similar to [1]. The au-thors focus on identifying correlated clusters for enhancingthe indexing of high dimensional data only. Unfortunately,they do not give any hint to what extent their heuristic-based algorithm can also be used to gain new insight intothe correlations contained in the data.

In [21] the move-based algorithm FLOC computing near-optimal δ-clusters is presented. A transposed view of thedata is used to show the correlations which are captured bythe δ-cluster model. Each data object is shown as a curvewhere the attribute values of each object are connected. Ex-amples can be seen in Figure 1(b) and 2(b). A cluster isregarded as a subset of objects and attributes for which theparticipating objects show the same (or a similar) tendencyrather than being close to each other on the associated sub-set of dimensions. The δ-cluster model concentrates on twoforms of coherence, namely shifting (or addition) and am-plification (or production). In the case of amplification co-herence for example, the vectors representing the objectsmust be multiples of each other. The authors state that thiscan easily be transformed to the problem of finding shift-ing coherent δ-cluster by applying logarithmic function toeach object. Thus they focus on finding shifting coherent δ-clusters and introduce the metric of residue to measure thecoherency among objects of a given cluster. An advantageis that thereby they can easily handle missing attribute val-ues. But in contrast to our approach, the δ-cluster modellimits itself to a very special form of correlation where all at-tributes are positively linear correlated. It does not includenegative correlations or correlations where one attribute isdetermined by two or more other attributes. In this casessearching for a trend is no longer possible as can be seen inFigure 1 and 2. Let us note, that such complex dependen-cies cannot be illustrated by transposed views of the data.The same considerations apply for the very similar p-clustermodel introduced in [19].

3. THE NOTION OF CORRELATION CON-NECTED CLUSTERS

In this section, we formalize the notion of a correlation con-nected cluster. Let D be a database of d-dimensional fea-ture vectors (D ⊆ Rd). An element P ∈ D is called pointor object. The value of the i-th attribute (1 ≤ i ≤ d) ofP is denoted by pi (i.e. P = (p1, . . . , pd)

T). Intuitively, acorrelation connected cluster is a dense region of points inthe d-dimensional feature space having at least one principalaxis with low variation along this axis. Thus, a correlationconnected cluster has two different properties: density andcorrelation. In the following, we will first address these twoproperties and then merge these ingredients to formalize ournotion of correlation connected clusters.

3.1 Density-Connected SetsThe density-based notion is a common approach for cluster-ing used by various clustering algorithms such as DBSCAN[10], DBCLASD [20], DENCLUE [12], and OPTICS [4]. Allthese methods search for regions of high density in a featurespace that are separated by regions of lower density.

A typical density-based clustering algorithm needs two pa-rameters to define the notion of density: First, a parameterµ specifying the minimum number of objects, and second,a parameter ε specifying a volume. These two parametersdetermine a density threshold for clustering.

Our approach follows the formal definitions of density-basedclusters underlying the algorithm DBSCAN. The formal def-inition of the clustering notion is presented and discussed infull details in [10]. In the following we give a short summaryof all necessary definitions.

Definition 1 (ε-neighborhood).

Let ε ∈ R and O ∈ D. The ε-neighborhood of O, denotedby Nε(O), is defined by

Nε(O) = X ∈ D | dist(O,X) ≤ ε.

Based on the two input parameters ε and µ, dense regionscan be defined by means of core objects:

Definition 2 (core object).

Let ε ∈ R and µ ∈ N. An object O ∈ D is called coreobject w.r.t ε and µ, if its ε-neighborhood contains at leastµ objects, formally:

Coreden(O)⇔ |Nε(O) | ≥ µ.

Let us note, that we use the acronym “den” for the densityparameters ε and µ. In the following, we omit the param-eters ε and µ wherever the context is clear and use “den”instead.

A core object O can be used to expand a cluster, with allthe density-connected objects of O. To find these objectsthe following concepts are used.

Definition 3 (direct density-reachability).

Let ε ∈ R and µ ∈ N. An object P ∈ D is directly density-reachable from Q ∈ D w.r.t ε and µ, if Q is a core objectand P is an element of Nε(Q), formally:

DirReachden(Q,P )⇔ Coreden(Q) ∧ P ∈ Nε(Q).

Let us note, that direct density-reachability is symmetriconly for core objects.

Definition 4 (density-reachability).

Let ε ∈ R and µ ∈ N. An object P ∈ D is density-reachablefrom Q ∈ D w.r.t ε and µ, if there is a chain of objectsP1, . . . ,Pn ∈ D, P1 = Q, Pn = P such that Pi+1 is directlydensity-reachable from Pi, formally:

Reachden(Q,P )⇔∃P1, . . . ,Pn ∈ D : P1 = Q ∧ Pn = P ∧∀i ∈ 1, . . . , n− 1 : DirReachden(Pi, Pi+1).

Density-reachability is the transitive closure of direct density-reachability. However, it is still not symmetric in general.

Definition 5 (density-connectivity).

Let ε ∈ R and µ ∈ N. An object P ∈ D is density-connectedto an object Q ∈ D w.r.t ε and µ, if there is an object O ∈D such that both P and Q are density-reachable from O,formally:

Connectden(Q,P )⇔∃O ∈ D : Reachden(O,Q) ∧ Reachden(O,P ).

Density-connectivity is a symmetric relation. A density-connected cluster is defined as a set of density-connectedobjects which is maximal w.r.t. density-reachability [10].

Definition 6 (density-connected set).

Let ε ∈ R and µ ∈ N. A non-empty subset C ⊆ D iscalled a density-connected set w.r.t ε and µ, if all objectsin C are density-connected and C is maximal w.r.t density-reachability, formally:

ConSetden(C)⇔

(1) Connectivity: ∀O,Q ∈ C : Connectden(O,Q)

(2) Maximality: ∀P,Q ∈ D : Q ∈ C∧Reachden(Q,P )⇒P ∈ C.

Using these concepts DBSCAN is able to detect arbitrarilyshaped clusters by one single pass over the data. To do so,DBSCAN uses the fact, that a density-connected cluster canbe detected by finding one of its core-objects O and com-puting all objects which are density reachable from O. Thecorrectness of DBSCAN can be formally proven (cf. lem-mata 1 and 2 in [10], proofs in [17]). Although DBSCANis not in a strong sense deterministic (the run of the algo-rithm depends on the order in which the points are stored),both the run-time as well as the result (number of detectedclusters and association of core objects to clusters) are de-terminate. The worst case time complexity of DBSCAN isO(n logn) assuming an efficient index and O(n2) if no indexexists.

3.2 Correlation SetsIn order to identify correlation connected clusters (regionsin which the points exhibit correlation) and to distinguishthem from usual clusters (regions of high point density only)we are interested in all sets of points with an intrinsic di-mensionality that is considerably smaller than the embed-ding dimensionality of the data space (e.g. a line or a planein a three or higher dimensional space). There are severalmethods to measure the intrinsic dimensionality of a pointset in a region, such as the fractal dimension or the prin-cipal components analysis (PCA). We choose PCA becausethe fractal dimension appeared to be not stable enough inour first experiments.

The PCA determines the covariance matrix M = [mij ] withmij =

∑S∈S

sisj of the considered point set S, and decom-

poses it into an orthonormal Matrix V called eigenvectormatrix and a diagonal matrix E called eigenvalue matrixsuch that M = VEVT. The eigenvectors represent the prin-cipal axes of the data set whereas the eigenvalues representthe variance along these axes. In case of a linear dependencybetween two or more attributes of the point set (correla-tion), one or more eigenvalues are close to zero. A set formsa λ-dimensional correlation hyperplane if d − λ eigenvaluesfall below a given threshold δ ≈ 0. Since the eigenvaluesof different sets exhibiting different densities may differ alot in their absolute values, we normalize the eigenvalues bymapping them onto the interval [0, 1]. This normalization isdenoted by Ω and simply divides each eigenvalue ei by themaximum eigenvalue emax. We call the eigenvalues ei withΩ(ei) ≤ δ close to zero.

Definition 7 (λ-dimensional linear correlation set).

Let S ⊆ D, λ ∈ N (λ ≤ d), EV = e1, ..., ed the eigen-values of S in descending order (i.e. ei ≥ ei+1) and δ ∈ R(δ ≈ 0). S forms an λ-dimensional linear correlation setw.r.t δ if at least d − λ eigenvalues of S are close to zero,formally:

CorSetλδ (S)⇔ |ei ∈ EV |Ω(ei) ≤ δ| ≥ d− λ.

where Ω(ei) = ei/e1.

This condition states that the variance of S along d − λprincipal axes is low and therefore the objects of S forman λ-dimensional hyperplane. If we drop the index λ andspeak of a correlation set in the following we mean an λ-dimensional linear correlation set, where λ is not specifiedbut fix.

Definition 8 (Correlation dimension).

Let S ∈ D be a linear correlation set w.r.t δ ∈ N. The num-ber of eigenvalues with ei > δ is called correlation dimension,denoted by CorDim(S).

Let us note, that if S is a λ-dimensional linear correlationset, then CorDim(S) ≤ λ. The correlation dimension of alinear correlation set S corresponds to the intrinsic dimen-sion of S.

3.3 Clusters as Correlation-Connected SetsA correlation connected cluster can be regarded as a max-imal set of density-connected points that exhibit uniformcorrelation. We can formalize the concept of correlationconnected sets by merging the concepts described in the pre-vious two subsections: density-connected sets (cf. Definition6) and correlation sets (cf. Definition 7). The intuition ofour formalization is to consider those points as core objectsof a cluster which have an appropriate correlation dimensionin their neighborhood. Therefore, we associate each point Pwith a similarity matrix MP which is determined by PCAof the points in the ε-neighborhood of P . For conveniencewe call VP and EP the eigenvectors and eigenvalues of P ,respectively. A point P is inserted into a cluster if it hasthe same or a similar similarity matrix like the points in thecluster. To achieve this goal, our algorithm looks for pointsthat are close to the principal axis (or axes) of those pointswhich are already in the cluster. We will define a similaritymeasure MP for the efficient search of such points.

We start with the formal definition of the covariance matrixMP associated with a point P .

Definition 9 (covariance matrix).

Let P ∈ D. The matrix MP = [mij ] with

mij =∑

S∈Nε(P )

sisj (1 ≤ i, j ≤ d)

is called the covariance matrix of the point P . VP and EP

(with MP = VPEPVTP ) as determined by PCA of MP are

called the eigenvectors and eigenvalues of the point P , re-spectively.

We can now define the new similarity measure MP whichsearches points in the direction of highest variance of MP

(the major axes). Theoretically, MP could be directly usedas a similarity measure, i.e.

distMP (P,Q) =√

(P −Q)MP (P −Q)T where P,Q ∈ D.

Figure 3(a) shows the set of points which lies in an ε-neighbor-hood of the point using MP as similarity measure. The dis-tance measure puts high weights on those axes with a high

variance whereas directions with a low variance are associ-ated with low weights. This is usually desired in similaritysearch applications where directions of high variance have ahigh distinguishing power and, in contrast, directions of lowvariance are negligible.

Obviously, for our purpose of detecting correlation clusters,we need quite the opposite. We want so search for points inthe direction of highest variance of the data set. Therefore,we need to assign low weights to the direction of highestvariance in order to shape the ellipsoid such that it reflectsthe data distribution (cf. Figure 3(b)). The solution is tochange large eigenvalues into smaller ones and vice versa.We use two fixed values, 1 and a parameter κ 1 ratherthan e.g. inverting the eigenvalues in order to avoid prob-lems with singular covariance matrices. The number 1 is anatural choice because the corresponding semi-axes of theellipsoid are then epsilon. The parameter κ controls the”thickness” of the λ-dimensional correlation line or plane,i.e. the tolerated deviation.

This is formally captured in the following definition:

Definition 10 (correlation similarity matrix).

Let P ∈ D and VP , EP the corresponding eigenvectorsand eigenvalues of the point P . Let κ ∈ R be a constantwith κ 1. The new eigenvalue matrix EP with entries ei(i = 1, . . . d) is computed from the eigenvalues e1, . . . , ed inEP according to the following rule:

ei =

1 if Ω(ei) > δκ if Ω(ei) ≤ δ

where Ω is the normalization of the eigenvalues onto [0, 1]

as described above. The matrix MP = VP EPVTP is called

the correlation similarity matrix. The correlation similaritymeasure associated with point P is denoted by

distP (P,Q) =

√(P −Q) · MP · (P −Q)T.

Figure 3(b) shows the ε-neighborhood according to the cor-

relation similarity matrix MP . As described above, the pa-rameter κ specifies how much deviation from the correlationis allowed. The greater the parameter κ, the tighter andclearer the correlations which will be computed. It empiri-cally turned out that our algorithm presented in Section 4is rather insensitive to the choice of κ. A good suggestion isto set κ = 50 in order to achieve satisfying results, thus —for the sake of simplicity — we omit the parameter κ in thefollowing.

Using this similarity measure, we can define the notions ofcorrelation core objects and correlation reachability. How-ever, in order to define correlation-connectivity as a symmet-ric relation, we face the problem that the similarity measurein Definition 10 is not symmetric, because distP (P,Q) =distQ(Q,P ) does in general not hold (cf. Figure 4(b)).Symmetry, however, is important to avoid ambiguity of theclustering result. If an asymmetric similarity measure isused in DBSCAN a different clustering result can be ob-tained depending on the order of processing (e.g. whichpoint is selected as the starting object) because the sym-metry of density-connectivity depends on the symmetry of

Pmajor axis of

the data set

(a)

P

/

(b)

Figure 3: Correlation ε-neighborhood of a point P according to (a) MP and (b) MP .

P

Q

P

Q

(a)

P

Q

P

Q

(b)

Figure 4: Symmetry of the correlation ε-neighborhood: (a) P ∈ N MQε (Q). (b) P 6∈ N MQ

ε (Q).

direct density-reachability for core-objects. Although theresult is typically not seriously affected by this ambiguityeffect we avoid this problem easily by an extension of oursimilarity measure which makes it symmetric. The trick isto consider both similarity measures, distP (P,Q) as well asdistQ(P,Q) and to combine them by a suitable arithmeticoperation such as the maximum of the two. Based on theseconsiderations, we define the correlation ε-neighborhood asa symmetric concept:

Definition 11 (correlation ε-neighborhood).

Let ε ∈ R. The correlation ε-neighborhood of an object

O ∈ D, denoted by N MOε (O), is defined by:

N MOε (O) = X ∈ D | maxdistO(O,X), distX(O,X) ≤ ε.

The symmetry of the correlation ε-neighborhood is illus-trated in Figure 4. Correlation core objects can now bedefined as follows.

Definition 12 (correlation core object).

Let ε, δ ∈ R and µ, λ ∈ N. A point O ∈ DB is calledcorrelation core object w.r.t ε, µ, δ, and λ (denoted byCore

corden(O)), if its ε-neighborhood is a λ-dimensional linear

correlation set and its correlation ε-neighborhood contains atleast µ points, formally:

Corecorden(O)⇔ CorSet

λδ (Nε(P )) ∧ |N MP

ε (P ) | ≥ µ.

Let us note that in Corecorden the acronym “cor” refers to the

correlation parameters δ and λ. In the following, we omitthe parameters ε, µ, δ, and λ wherever the context is clearand use “den” and “cor” instead.

Definition 13 (Direct correlation-reachability).

Let ε, δ ∈ R and µ, λ ∈ N. A point P ∈ D is directcorrelation-reachable from a point Q ∈ D w.r.t ε, µ, δ, andλ (denoted by DirReach

corden(Q,P)) if Q is a correlation core

object, the correlation dimension of Nε(P ) is at least λ, and

P ∈ N MQε (Q), formally:

DirReachcorden(Q,P )⇔

(1) Corecorden(Q)

(2) CorDim(Nε(P )) ≤ λ

(3) P ∈ N MQε (Q).

Correlation-reachability is symmetric for correlation core ob-jects (cf. Figure 4(b)). Both objects P and Q must find theother object in their corresponding correlation ε-neighborhood.

Definition 14 (correlation-reachability).

Let ε, δ ∈ R (δ ≈ 0) and µ, λ ∈ N. An object P ∈ D iscorrelation-reachable from an object Q ∈ D w.r.t ε, µ, δ,

and λ (denoted by Reachcorden(Q,P)), if there is a chain of

objects P1, · · ·Pn such that P1 = Q,Pn = P and Pi+1 isdirect correlation-reachable from Pi, formally:

Reachcorden(Q,P )⇔

∃P1, . . . ,Pn ∈ D : P1 = Q ∧ Pn = P ∧∀i ∈ 1, . . . , n− 1 : DirReach

corden(Pi, Pi+1).

It is easy to see, that correlation-reachability is the transitiveclosure of direct correlation-reachability.

Definition 15 (correlation-connectivity).

Let ε ∈ R and µ ∈ N. An object P ∈ D is correlation-connected to an object Q ∈ D if there is an object O ∈ Dsuch that both P and Q are correlation-reachable from O,formally:

Connectcorrden (Q,P )⇔

∃o ∈ D : Reachcorrden (O,Q) ∧ Reach

corrden (O,P ).

Correlation-connectivity is a symmetric relation. A correlation-connected cluster can now be defined as a maximal correlation-connected set:

Definition 16 (correlation-connected set).

Let ε, δ ∈ R and µ, λ ∈ N. A non-empty subset C ⊆ Dis called a density-connected set w.r.t ε, µ, δ, and λ, if allobjects in C are density-connected and C is maximal w.r.tdensity-reachability, formally:

ConSetcorden(C)⇔

(1) Connectivity: ∀O,Q ∈ C : Connectcorden(O,Q)

(2) Maximality: ∀P,Q ∈ D : Q ∈ C∧Reachcorden(Q,P )⇒

P ∈ C.

The following two lemmata are important for validating thecorrectness of our clustering algorithm. Intuitively, theystate that we can discover a correlation-connected set fora given parameter setting in a two-step approach: First,choose an arbitrary correlation core object O from the data-base. Second, retrieve all objects that are correlation-reachablefrom O. This approach yields the density-connected set con-taining O.

Lemma 1.Let P ∈ D. If P is a correlation core object, then the

set of objects, which are correlation-reachable from P is acorrelation-connected set, formally:

Corecorden(P ) ∧ C = O ∈ D |Reach

corden(P,O)

⇒ ConSetcorden(C).

Proof.

(1) C 6= ∅:By assumption, Core

corden(P) and thus, CorDim(Nε(P )) ≤

λ.⇒ DirReach

corden(P, P )

⇒ Reachcorden(P, P )

⇒ P ∈ C.(2) Maximality:Let X ∈ C and Y ∈ D and Reach

corden(X,Y ).

⇒ Reachcorden(P,X) ∧Reach

corden(X,Y )

⇒ Reachcorden(P, Y ) (since correlation reachability is a tran-

sitive relation).⇒ Y ∈ C.(3) Connectivity:∀X,Y ∈ C : Reach

corden(P,X) ∧Reach

corden(P, Y )

⇒ Connectcorden(X,Y ) (via P ).

Lemma 2.Let C ⊆ D be a correlation-connected set. Let P ∈ C be

a correlation core object. Then C equals the set of objectswhich are correlation-reachable from P , formally:

ConSetcorden(C) ∧ P ∈ C ∧Core

corden(P )

⇒ C = O ∈ D |Reachcorden(P,O).

Proof.

Let C = O ∈ D |Reachcorden(P,O). We have to show that

C = C:(1) C ⊆ C: obvious from definition of C.(2) C ⊆ C: Let Q ∈ C. By assumption, P ∈ C andConSet

corden(C).

⇒ ∃O ∈ C : Reachcorden(O,P ) ∧Reach

corden(O,Q)

⇒ Reachcorden(P,O) (since both O and P are correlation core

objects and correlation reachability is symmetric for corre-lation core objects.⇒ Reach

corden(P,Q) (transitivity of correlation-reachability)

⇒ Q ∈ C.

4. ALGORITHM 4CIn the following we describe the algorithm 4C, which per-forms one pass over the database to find all correlation clus-ters for a given parameter setting. The pseudocode of thealgorithm 4C is given in Figure 5. At the beginning eachobject is marked as unclassified. During the run of 4Call objects are either assigned a certain cluster identifier ormarked as noise. For each object which is not yet classified,4C checks whether this object is a correlation core object(see STEP 1 in Figure 5). If the object is a correlation coreobject the algorithm expands the cluster belonging to thisobject (STEP 2.1). Otherwise the object is marked as noise(STEP 2.2). To find a new cluster 4C starts in STEP 2.1with an arbitrary correlation core object O and searches forall objects that are correlation-reachable from O. This issufficient to find the whole cluster containing the object O,due to Lemma 2. When 4C enters STEP 2.1 a new clusteridentifier “clusterID” is generated which will be assigned toall objects found in STEP 2.1. 4C begins by inserting allobjects in the correlation ε-neighborhood of object O into aqueue. For each object in the queue it computes all directlycorrelation reachable objects and inserts those objects intothe queue which are still unclassified. This is repeated untilthe queue is empty.

As discussed in Section 3 the results of 4C do not depend onthe order of processing, i.e. the resulting clustering (num-

algorithm 4C(D, ε, µ, λ, δ)

// assumption: each object in D is marked as unclassified

for each unclassified O ∈ D do

STEP 1. test Corecorden(O) predicate:

compute Nε(O);if |Nε(O)| ≥ µ then

compute MO;if CorDim(Nε(O)) ≤ λ then

compute MO and N MOε (O);

test |N MOε (O)| ≥ µ;

STEP 2.1. if Corecorden(O) expand a new cluster:

generate new clusterID;

insert all X ∈ N MOε (O) into queue Φ;

while Φ 6= ∅ doQ = first object in Φ;compute R = X ∈ D |DirReach

corden(Q,X);

for each X ∈ R doif X is unclassified or noise then

assign current clusterID to Xif X is unclassified then

insert X into Φ;remove Q from Φ;

STEP 2.2. if not Corecorden(O) O is noise:

mark O as noise;

end.

Figure 5: Pseudo code of the 4C algorithm.

ber of clusters and association of core objects to clusters) isdeterminate.

5. COMPLEXITY ANALYSISThe computational complexity with respect to the numberof data points as well as the dimensionality of the data spaceis an important issue because the proposed algorithms aretypically applied to large data sets of high dimensionality.The idea of our correlation connected clustering method isfounded on DBSCAN, a density based clustering algorithmfor Euclidean data spaces. The complexity of the originalDBSCAN algorithm depends on the existence of an indexstructure for high dimensional data spaces. The worst casecomplexity is O(n2), but the existence of an efficient indexreduces the complexity to O(n logn) [10]. DBSCAN is lin-ear in the dimensionality of the data set for the Euclideandistance metric. If a quadratic form distance metric is ap-plied instead of Euclidean (which enables user adaptabilityof the distance function), the time complexity of DBSCANis O(d2 · n logn). In contrast, subspace clustering methodssuch as CLIQUE are known to be exponential in the numberof dimensions [1, 3].

We begin our analysis with the assumption of no index struc-ture.

Lemma 3. The overall worst-case time complexity of ouralgorithm on top of the sequential scan of the data set isO(d2 · n2 + d3 · n).

Proof. Our algorithm has to associate each point of thedata set with a similarity measure that is used for searchingneighbors (cf. Definition 10). We assume that the corre-sponding similarity matrix must be computed once for eachpoint, and it can be held in the cache until it is no moreneeded (it can be easily decided whether or not the similar-ity matrix can be safely discarded). The covariance matrixis filled with the result of a Euclidean range query which canbe evaluated in O(d · n) time. Then the matrix is decom-posed using PCA which requires O(d3) time. For all pointstogether, we have O(d · n2 + d3 · n).Checking the correlation core point property according toDefinition 12, and expanding a correlation connected clus-ter requires for each point the evaluation of a range querywith a quadratic form distance measure which can be donein O(d2 ·n). For all points together (including the above costfor the determination of the similarity matrix), we obtain anworst-case time complexity of O(d2 · n2 + d3 · n).

Under the assumption that an efficient index structure forhigh dimensional data spaces [7, 6] is available, the complex-ity of all range queries is reduced from O(n) to O(logn). Letus note that we can use Euclidean range queries as a filterstep for the quadratic form range queries because no semi-axis of the corresponding ellipsoid exceeds ε. Therefore, theoverall time complexity in this case is given as follows:

Lemma 4. The overall worst case time complexity of ouralgorithm on top of an efficient index structure for high di-mensional data is O(d2 · n logn+ d3 · n).

Proof. Analogous to Lemma 3.

6. EVALUATIONIn this section, we present a broad evaluation of 4C. We im-plemented 4C as well as the three comparative methods DB-SCAN, CLIQUE, and ORCLUS in JAVA. All experimentswere run on a Linux workstation with a 2.0 GHz CPU and2.0 GB RAM.

6.1 EfficiencyAccording to Section 5 the runtime of 4C scales superlinearwith the number of input records. This is illustrated inFigure 6 showing the results of 4C applied to synthetic 2-dimensional data of variable size.

6.2 EffectivenessWe evaluated the effectiveness of 4C on several syntheticdatasets as well as on real world datasets including gene ex-pression data and metabolome data. In addition, we com-pared the quality of the results of our method to the qualityof the results of DBSCAN, ORCLUS, and CLIQUE. In allour experiments, we set the parameter κ = 50 as suggestedin Section 3.3.

6.2.1 Synthetic DatasetsWe first applied 4C on several synthetic datasets (with 2 ≤d ≤ 30) consisting of several dense, linear correlations. Inall cases, 4C had no problems to identify the correlation-connected clusters. As an example, Figure 7 illustrates the

0

50

100

150

200

250

300

0 50 100 150 200 250 300 350 400

database size (x 1,000)

runt

ime

(s x

1,0

00)

Figure 6: Scalability against database size.

Dataset DCluster 1

Cluster 2

Cluster 3

Noise

Figure 7: Clusters found by 4C on 10D syntheticdataset. Parameters: ε = 10.0, µ = 5, λ = 2, δ = 0.1.

transposed view of the three clusters and the noise 4C foundon a sample 10-dimensional synthetic dataset consisting ofapproximately 1,000 points.

6.2.2 Real World DatasetsGene Expression Data. We applied 4C to the gene ex-pression dataset of [18]. The dataset is derived from timeseries experiments on the yeast mitotic cell cycle. The ex-pression levels of approximately 3000 genes are measuredat 17 different time slots. Thus, we face a 17-dimensionaldata space to search for correlations indicating coregulatedgenes. 4C found 60 correlation connected clusters with fewcoregulated genes (10-20). Such small cluster sizes are quitereasonable from a biological perspective. The transposedviews of four sample clusters are depicted in Figure 8. Allfour clusters exhibit simple linear correlations on a sub-set of their attributes. Let us note, that we also foundother linear correlations which are rather complex to visual-ize. We also analyzed the results of our correlation clusters(based on the publicly available information resources onthe yeast genome [16]) and found several biologically im-portant implications. For example one cluster consists of

Sample Cluster 1 Sample Cluster 2

Sample Cluster 4Sample Cluster 3

Figure 8: Sample clusters found by 4C on the geneexpression dataset. Parameters: ε = 25.0, µ = 8,λ = 8, δ = 0.01.

several genes coding for proteins related to the assemblyof the spindle pole required for mitosis (e.g. KIP1, SLI15,SPC110, SPC25, and NUD1). Another cluster contains sev-eral genes coding for structural constituents of the ribosome(e.g. RPL4B, RPL15A, RPL17B, and RPL39). The func-tional relationships of the genes in the clusters confirm thesignificance of the computed coregulation.

Metabolome Data. We applied 4C on a metabolomedataset [13]. The dataset consists of the concentrations of43 metabolites in 2,000 human newborns. The newbornswere labeled according to some specific metabolic diseases.Thus, the dataset consists of 2,000 data points with d = 43.4C detected six correlation connected sets which are visu-alized in Figure 9. Cluster one and two (in the lower leftcorner marked with “control”) consists of healthy newbornswhereas the other clusters consists of newborns having onespecific disease (e.g. “PKU” or “LCHAD”). The group ofnewborns suffering from “PKU” was split in three clusters.Several ill as well as healthy newborns were classified asnoise.

6.2.3 Comparisons to Other MethodsWe compared the effectiveness of 4C with related cluster-ing methods, in particular the density-based clustering algo-rithm DBSCAN, the subspace clustering algorithm CLIQUE,and the projected clustering algorithm ORCLUS. For thatpurpose, we applied these methods on several synthetic data-sets including 2-dimensional datasets and higher dimensionaldatasets (d = 10).

Comparison with DBSCAN. The clusters found byDBSCAN and 4C applied on the 2-dimensional datasetsare depicted in Figure 10. In both cases, DBSCAN findsclusters which do not exhibit correlations (and thus are notdetected by 4C). In addition, DBSCAN cannot distinguishvarying correlations which overlap (e.g. both correlationsin Dataset B in Figure 10) and treat such clusters as onedensity-connected set, whereas 4C can differentiate such cor-relations. We gain similar observations when we applied DB-

PKU

PKU

PKU

LCHAD

control

control

Figure 9: Clusters found by 4C on the metabolomedataset. Parameters: ε = 150.0, µ = 8, λ = 20, δ = 0.1.

SCAN and 4C on the higher dimensional datasets. Let usnote, that these results are not astonishing since DBSCANonly searches for density connected sets but does not searchfor correlations and thus cannot be applied to the task offinding correlation connected sets.

Comparison with CLIQUE. A comparison of 4C withCLIQUE gained similar results. CLIQUE finds clusters insubspaces which do not exhibit correlations (and thus arenot detected by 4C). On the other hand, CLIQUE is usuallylimited to axis-parallel clusters and thus cannot detect arbi-trary correlations. These observations occur especially withhigher dimensional data (d ≥ 10 in our tests). Again, theseresults are not astonishing since CLIQUE only searches foraxis-parallel subspace clusters (dense projections) but doesnot search for correlations. This empirically supported thesuspicion that CLIQUE cannot be applied to the task offinding correlation connected sets.

Comparison with ORCLUS. A comparison of 4C withORCLUS resulted in quite different observations. In fact,ORCLUS computes clusters of correlated objects. However,since it is based on k-means, it suffers from the followingtwo drawbacks: First, the choice of k is a rather hard taskfor real-world datasets. Even for synthetic datasets, wherewe knew the number of clusters beforehand, ORCLUS oftenperforms better with a slightly different value of k. Second,ORCLUS is rather sensitive to noise which often appears inreal-world datasets. Since all objects have to be assigned toa cluster, the locality of the analyzed correlations is oftentoo coarse (i.e. the subsets of the points taken into accountfor correlation analysis are too large). As a consequence,

Clusters foundby DBSCAN

Clusters foundby 4C

Dataset A

Dataset B

Figure 10: Comparison between 4C and DBSCAN.

the correlation clusters are often blurred by noise objectsand thus are hard to obtain from the resulting output. Fig-ure 11 illustrates a sample 3-dimensional synthetic dataset,the clusters found by 4C are marked by black lines. Figure12 depicts the objects in each cluster found by ORCLUS(k = 3 yields the best result) separately. It can be seen,that the correlation clusters are — if detected — blurred bynoise objects. When we applied ORCLUS on higher dimen-sional datasets (d = 10) the choice of k became even morecomplex and the problem of noise objects blurring the clus-ters (i.e. too coarse locality) simply cumulated in the factthat ORCLUS often could not detect correlation clusters inhigh-dimensional data.

6.2.4 Input ParametersThe algorithm 4C needs four input parameters which arediscussed in the following:

The parameter ε ∈ R specifies the size of the local areasin which the correlations are examined and thus determinesthe number of objects which contribute to the covariancematrix and consequently to the correlation similarity mea-sure of each object. It also participates in the determinationof the density threshold, a cluster must exceed. Its choiceusually depends on the volume of the data space (i.e. themaximum value of each attribute and the dimensionality ofthe feature space). The choice of ε has two aspects. First, itshould not be too small because in that case, an insufficient

Figure 11: Three correlation connected clustersfound by 4C on a 3-dimensional dataset. Param-eters: ε = 2.5, µ = 8. δ = 0.1, λ = 2.

number of objects contribute to the correlation similaritymeasure of each object and thus, this measure can be mean-ingless. On the other hand, ε should not be toe large becausethen some noise objects might be correlation reachable fromobjects within a correlation connected cluster. Let us note,that our experiments indicated that the second aspect is notsignificant for 4C (in contrast to ORCLUS).

The parameter µ ∈ N specifies the number of neighbors anobject must find in an ε-neighborhood and in a correlation ε-neighborhood to exceed the density threshold. It determinesthe minimum cluster size. The choice of µ should not be tosmall (µ ≥ 5 is a reasonable lower bound) but is ratherinsensitive in a broad range of values.

Both ε and µ should be chosen hand in hand.

The parameter λ ∈ N specifies the correlation dimensionof the correlation connected clusters to be computed. Asdiscussed above, the correlation dimension of a correlationconnected cluster corresponds to its intrinsic dimension. Inour experiments, it turned out that λ can be seen as anupper bound for the correlation dimension of the detectedcorrelation connected clusters. However, the computed clus-ters tend to have a correlation dimension close to λ.

The parameter δ ∈ R (where 0 ≤ δ ≤ 1) specifies the lowerbound for the decision whether an eigenvalue is set to 1or to κ 1. It empirically turned out that the choice ofδ influences the tightness of the detected correlations, i.e.how much local variance from the correlation is allowed.Our experiments also showed that δ ≤ 0.1 is usually a goodchoice.

7. CONCLUSIONSIn this paper, we have proposed 4C, an algorithm for com-puting clusters of correlation connected objects. This algo-rithm searches for local subgroups of a set of feature vectorswith a uniform correlation. Knowing a correlation is poten-tially useful because a correlation indicates a causal depen-dency between features. In contrast to well-known methodsfor the detection of correlations like the principal compo-

Figure 12: Clusters found by ORCLUS on thedataset depicted in Figure 11. Parameters: k = 3,l = 2.

nents analysis (PCA) our algorithm is capable to separatedifferent subgroups in which the dimensionality as well asthe type of the correlation (positive/negative) and the de-pendent features are different, even in the presence of noisepoints.

Our proposed algorithm 4C is determinate, robust againstnoise, and efficient with a worst case time complexity be-tween O(d2 · n logn + d3 · n) and O(d2 · n2 + d3 · n). In anextensive evaluation, data from gene expression analysis andmetabolic screening have been used. Our experiments showa superior performance of 4C over methods like DBSCAN,CLIQUE, and ORCLUS.

In the future we will integrate our method into object-relationaldatabase systems and investigate parallel and distributedversions of 4C.

8. REFERENCES[1] C. Aggarwal and P. Yu. ”Finding Generalized

Projected Clusters in High Dimensional Space”. InProc. ACM SIGMOD Int. Conf. on Management ofData (SIGMOD’00), Dallas, TX, 2000.

[2] C. C. Aggarwal and C. Procopiuc. ”Fast Algorithmsfor Projected Clustering”. In Proc. ACM SIGMODInt. Conf. on Management of Data (SIGMOD’99),Philadelphia, PA, 1999.

[3] R. Agrawal, J. Gehrke, D. Gunopulos, andP. Raghavan. ”Automatic Subspace Clustering of HighDimensional Data for Data Mining Applications”. InProc. ACM SIGMOD Int. Conf. on Management ofData (SIGMOD’98), Seattle, WA, 1998.

[4] M. Ankerst, M. M. Breunig, H.-P. Kriegel, andJ. Sander. ”OPTICS: Ordering Points to Identify theClustering Structure”. In Proc. ACM SIGMOD Int.Conf. on Management of Data (SIGMOD’99),Philadelphia, PA, pages 49–60, 1999.

[5] D. Barbara and P. Chen. ”Using the FractalDimension to Cluster Datasets”. In Proc. ACMSIGKDD Int. Conf. on Knowledge Discovery inDatabases (SIGKDD’00), Boston, MA, 2000.

[6] S. Berchtold, C. Bohm, H. V. Jagadish, H.-P. Kriegel,and J. Sander. ”Independent Quantization: An IndexCompression Technique for High-Dimensional DataSpaces”. In Proc. Int. Conf. on Data Engineering(ICDE 2000), San Diego, CA, 2000, pages 577–588,2000.

[7] S. Berchtold, D. A. Keim, and H.-P. Kriegel. ”TheX-Tree: An Index Structure for High-DimensionalData”. In Proc. 22nd Int. Conf. on Very LargeDatabases (VLDB’96), Mumbai (Bombay), India,pages 28–39, 1996.

[8] K. Chakrabarti and S. Mehrotra. ”LocalDimensionality Reduction: A New Approach toIndexing High Dimensional Spaces”. In Proc. 26th Int.Conf. on Very Large Databases (VLDB’00), Cairo,Egypt, 2000.

[9] C.-H. Cheng, A.-C. Fu, and Y. Zhang.”Entropy-Based Subspace Clustering for MiningNumerical Data”. In Proc. ACM SIGKDD Int. Conf.on Knowledge Discovery in Databases (SIGKDD’99),San Diego, FL, 1999.

[10] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. ”ADensity-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise”. In Proc. 2ndInt. Conf. on Knowledge Discovery and Data Mining(KDD’96), Portland, OR, pages 291–316. AAAI Press,1996.

[11] S. Goil, H. Nagesh, and A. Choudhary. ”MAFIA:Efficiant and Scalable Subspace Clustering for VeryLarge Data Sets”. Tech. Report No.CPDC-TR-9906-010, Center for Parallel a´ndDistributed Computing, Dept. of Electrical andComputer Engineering, Northwestern University, 1999.

[12] A. Hinneburg and D. A. Keim. ”An EfficientApproach to Clustering in Large MultimediaDatabases with Noise”. In Proc. 4th Int. Conf. onKnowledge Discovery and Data Mining (KDD’98),New York, NY, pages 224–228. AAAI Press, 1998.

[13] B. Liebl, U. Nennstiel-Ratzel, R. von Kries,R. Fingerhut, B. Olgemoller, A. Zapf, and A. A.Roscher. ”Very High Compliance in an ExpandedMS-MS-Based Newborn Screening Program DespiteWritten Parental Consent”. Preventive Medicine,34(2):127–131, 2002.

[14] E. Parros Machado de Sousa, C. Traina, A. Traina,and C. Faloutsos. ”How to Use Fractal Dimension toFind Correlations between Attributes”. In WorkshopKDD2002, 2002.

[15] C. M. Procopiuc, M. Jones, P. K. Agarwal, and M. T.M. ”(a monte carlo algorithm for fast projectiveclustering)”. In Proceedings of the 2002 ACMSIGMOD International Conference on Management ofData, Madison, Wisconsin, June 3-6, 2002, pages418–427, 2002.

[16] Saccharomyces Genome Database (SGD).http://www.yeastgenome.org/. (visited:Oktober/November 2003).

[17] J. Sander, M. Ester, H.-P. Kriegel, and X. Xu.”Density-Based Clustering in Spatial Databases: TheAlgorithm GDBSCAN and its Applications”. DataMining and Knowledge Discovery, 2:169–194, 1998.

[18] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho,and C. G. M. ”Systematic Determination of GeneticNetwork Architecture”. Nature Genetics, 22:281–285,1999.

[19] H. Wang, W. Wang, J. Yang, and P. S. Yu.”Clustering by Pattern Similarity in Large Data Set”.In Proc. ACM SIGMOD Int. Conf. on Management ofData (SIGMOD’02), Madison, Wisconsin, 2002.

[20] X. Xu, M. Ester, H.-P. Kriegel, and J. Sander. ”ADistribution-Based Clustering Algorithm for Miningin Large Spatial Databases”. In Proc. 14th Int. Conf.on Data Engineering (ICDE’98), Orlando, FL, pages324–331. AAAI Press, 1998.

[21] J. Yang, W. Wang, H. Wang, and P. S. Yu.”Delta-Clusters: Capturing Subspace Correlation in aLarge Data Set”. In ICDE 2002:San Jose, California,USA, 2002.

computing clusters of correlation connected objects · computing clusters of correlation connected...

Documents