cactus – clustering categorical data using summaries by venkatesh ganti, johannes gehrke and raghu...

CACTUS – Clustering Categorical Data Using

SummariesBy Venkatesh Ganti, Johannes Gehrke and Raghu

Ramakrishnan

RongEn LiSchool of Informatics, Edinburgh

Overview

• Introduction and motivation

• Existing tools for clustering categorical data: STIRR and ROCK

• Definition of a cluster over categorical data

• The algorithm – CACTUS

• Experiments and results

• Summary

Introduction and motivation

• Numeric data, {1,2,3,4,5, …}• Categorical data, {LFD, PMR, DME}

– Usually small number of attribute values in their domains. Large domains are typically hard to infer useful information

– Use relations! Relations contain different attributes, but the cross product of domain attributes can be large.

• CACTUS – a fast summarisation-based algorithm which uses summary information to find well-fined clusters.

Existing tools for clustering categorical data

• STIRR– Each attribute value is represented as a weighted vertex in a graph.– Multiple copies b1,…,bm (basins) of weighted vertices are maintained.

They can have different weights.– Starting Step: a set of weights on all vertices in all basins.– Iterative Step: Increment the weight in basin bi on vertex tj, for each

vertices tuple t=<t1, …, tn> in bi, using a function combining the weights of vertices other than tj in bi.

– At fixed point: the large positive weights and small negative weights across the basins isolate two groups of attribute values on each attribute.

• ROCK– Starts with each tuple in its own cluster.– Merges close clusters until a required number (user specified) of

clusters remains. Closeness defined by a similarity function.• Use STIRR to compare with CACTUS.

Definitions: Interval region, support and belonging

• A1,…,An is a set of categorical attributes with domains D1,…,Dn respectively. D is a set tuples where each tuple t є {1,…,n}.– Interval region: S=S1 X … X Sn if Si subset of Di for

all i є {1,…,n}. Equivalent to intervals in numeric data

– The support of a value pair σD(ai,aj)=|{tєD:t.Ai=ai &

t.Aj=aj}|/|D|. The support of a region σD(S) is the number of tuples in D contained in S

– Belonging: A tuple t=<t.A1,…t.An> є D belongs to a region S if for all t є {1,…,n}, t.Ai є Si.

Definitions: expected support, strongly connected

• The expected support under attribute-independence assumption,– Of a region: E[σD(S)] = |D|*|S1|X…X|Sn|/|D1|X…X|Dn|

– Of a pair ai and aj: E[σD(ai,aj)] = α*|D|/|Di|X|Dj|

– α is normally set to 2 or 3

• Strongly Connected– ai and aj: if σD(ai,aj)>E[σD(ai,aj)], σ*D(ai,aj)=σD(ai,aj);

Otherwise, 0.– ai є Si w.r.t Sj: for all x є Sj, ai and x are strongly connected.– Si and Sj: if each ai є Si is strongly connected with each aj є Sj

and if each aj є Sj is strongly connected with each ai є Si.

Definitions: Cluster, Cluster-projection, sub-cluster and subspace cluster

• C=<C1,..Cn> is a cluster over {A1,…,An} if – 1. Ci and Cj are strongly connected– 2. There exists on C’i such that C’i is a proper

superset of Ci and C’i and Ci are strongly connected– 3. σD(C) of C is >= α * the expected support of C under

attribute-independence assumption• Ci is a cluster-projection of C on Ai.

• C is a sub-cluster if it only satisfies 1 and 3.

• A cluster C over a subset of all attributes S proper subset of {A1,…,An} is a subspace cluster on S.

Definitions: similarity, inter-attribute summaries, intra-attribute summaries

• Similarity: γj(ai,a2) = |{x є Dj: σ*D(a1,x)>0 and σ*D(a2,x)>0}|

• Inter-attribute summary:– ∑ij={(ai,aj, σ*D(ai,aj): ai є Di, aj є Dj, and σ*D(ai,aj)>0}– Strongly connected attribute values pairs where each

pair has attribute values from different attributes• Intra-attribute summary:

– ∑ij={(ai,aj, γjD(ai,aj): ai є Di, aj є Dj, and γjD(ai,aj)>0}– Similarities between attribute values of the same

attribute

CACTUS Vs STIRR: clusters found by CACTUS

CACTUS Vs STIRR: clusters found by STIRR

CACTUS: CAtegorical ClusTering Using Summaries

• Central idea: data summary (inter- & intra- attribute summary) is sufficient enough to find candidate clusters which can then be validated.

• A three-phase clustering algorithm:– Summarisation– Clustering– Validation

Summarisation Phase

• Assumption: the inter- & intra- attribute summary of any pair of attributes fits easily into main memory.

• Inter-attribute Summaries:– Use a counter set to 0 initially for each pair (ai,aj) є Di x Dj.

– Scan the dataset, increment the counter for each pair.– After the scan, compute σ*D(ai,aj) and reset the counters of

those whose σ* < E[σD(ai,aj)]. Store those values pairs.

• Intra-attribute Summaries:– Scan the dataset and find those tuples (T1,T2) of one domain

such that T1.a is strongly connected with T1.b and T2.a is strongly connected with T2.b.

– Very fast operation, hence only compute them when needed

Clustering Phase

• A two-step operation:

–Step 1. analyse each attribute to compute all cluster-projections on it

–Step 2. Synthesise candidate clusters on sets of attributes from the cluster-projections on individual attributes

Clustering Phase continued

• Step1: Compute cluster-projections on attributes– Step A. Find all cluster-projections on Ai of cluster over (Ai,Aj).– Step B. Compute all the cluster-projections on Ai of cluster over {A1,…,An} by

intersecting sets of cluster-projects from Step A.– Step A is NP-Hard! Solution: use distinguishing sets.

• Distinguishing sets identify different cluster-projections.

• Construct distinguishing sets on Ai and extend w.r.t Aj some of the candidate distinguishing sets on Ai.

• Detailed steps are too long for this presentation, sorry!

– StepB: intersection of Cluster-projection• Intersection joint S1ΠS2 = {s: there exist s1єS1 and s2єS2 such that s=s1Πs2 and |s|

>1}

• Apply intersection joint to all sets of attribute values on Ai.

• Step2: Try to augment ck with a cluster projection ck+1 on attribute Ak+1. If new cluster <ci,ck+1> is a sub-cluster on (Ai,Ak+1), i є {1,…,k}, then add ck+1 = <c1,…ck+1> to the final cluster.

Validation Phase

• Use a required threshold to recognise false candidates which do not have enough support because some of the 2-clusters combined to form a candidate cluster may be due to different sets of tuples.

Experiments and Results

• To compare with STIRR

• Use 1 million tuples, 10 attributes and 100 attribute values for each attribute.

• CACTUS discovers a broader class of clusters than STIRR.

Experiments and Results

Conclusion

• The authors formalised the definition of a cluster in categorical data

• CACTUS is a fast and efficient algorithm for clustering in categorical data.

• I am sorry that I did not show some part of the algorithm due to time constraint.

Question Time

cactus – clustering categorical data using summaries by venkatesh ganti, johannes gehrke and raghu...

Documents