cactus – clustering categorical data using summaries by venkatesh ganti, johannes gehrke and raghu...
TRANSCRIPT
CACTUS – Clustering Categorical Data Using
SummariesBy Venkatesh Ganti, Johannes Gehrke and Raghu
Ramakrishnan
RongEn LiSchool of Informatics, Edinburgh
Overview
• Introduction and motivation
• Existing tools for clustering categorical data: STIRR and ROCK
• Definition of a cluster over categorical data
• The algorithm – CACTUS
• Experiments and results
• Summary
Introduction and motivation
• Numeric data, {1,2,3,4,5, …}• Categorical data, {LFD, PMR, DME}
– Usually small number of attribute values in their domains. Large domains are typically hard to infer useful information
– Use relations! Relations contain different attributes, but the cross product of domain attributes can be large.
• CACTUS – a fast summarisation-based algorithm which uses summary information to find well-fined clusters.
Existing tools for clustering categorical data
• STIRR– Each attribute value is represented as a weighted vertex in a graph.– Multiple copies b1,…,bm (basins) of weighted vertices are maintained.
They can have different weights.– Starting Step: a set of weights on all vertices in all basins.– Iterative Step: Increment the weight in basin bi on vertex tj, for each
vertices tuple t=<t1, …, tn> in bi, using a function combining the weights of vertices other than tj in bi.
– At fixed point: the large positive weights and small negative weights across the basins isolate two groups of attribute values on each attribute.
• ROCK– Starts with each tuple in its own cluster.– Merges close clusters until a required number (user specified) of
clusters remains. Closeness defined by a similarity function.• Use STIRR to compare with CACTUS.
Definitions: Interval region, support and belonging
• A1,…,An is a set of categorical attributes with domains D1,…,Dn respectively. D is a set tuples where each tuple t є {1,…,n}.– Interval region: S=S1 X … X Sn if Si subset of Di for
all i є {1,…,n}. Equivalent to intervals in numeric data
– The support of a value pair σD(ai,aj)=|{tєD:t.Ai=ai &
t.Aj=aj}|/|D|. The support of a region σD(S) is the number of tuples in D contained in S
– Belonging: A tuple t=<t.A1,…t.An> є D belongs to a region S if for all t є {1,…,n}, t.Ai є Si.
Definitions: expected support, strongly connected
• The expected support under attribute-independence assumption,– Of a region: E[σD(S)] = |D|*|S1|X…X|Sn|/|D1|X…X|Dn|
– Of a pair ai and aj: E[σD(ai,aj)] = α*|D|/|Di|X|Dj|
– α is normally set to 2 or 3
• Strongly Connected– ai and aj: if σD(ai,aj)>E[σD(ai,aj)], σ*D(ai,aj)=σD(ai,aj);
Otherwise, 0.– ai є Si w.r.t Sj: for all x є Sj, ai and x are strongly connected.– Si and Sj: if each ai є Si is strongly connected with each aj є Sj
and if each aj є Sj is strongly connected with each ai є Si.
Definitions: Cluster, Cluster-projection, sub-cluster and subspace cluster
• C=<C1,..Cn> is a cluster over {A1,…,An} if – 1. Ci and Cj are strongly connected– 2. There exists on C’i such that C’i is a proper
superset of Ci and C’i and Ci are strongly connected– 3. σD(C) of C is >= α * the expected support of C under
attribute-independence assumption• Ci is a cluster-projection of C on Ai.
• C is a sub-cluster if it only satisfies 1 and 3.
• A cluster C over a subset of all attributes S proper subset of {A1,…,An} is a subspace cluster on S.
Definitions: similarity, inter-attribute summaries, intra-attribute summaries
• Similarity: γj(ai,a2) = |{x є Dj: σ*D(a1,x)>0 and σ*D(a2,x)>0}|
• Inter-attribute summary:– ∑ij={(ai,aj, σ*D(ai,aj): ai є Di, aj є Dj, and σ*D(ai,aj)>0}– Strongly connected attribute values pairs where each
pair has attribute values from different attributes• Intra-attribute summary:
– ∑ij={(ai,aj, γjD(ai,aj): ai є Di, aj є Dj, and γjD(ai,aj)>0}– Similarities between attribute values of the same
attribute
CACTUS Vs STIRR: clusters found by CACTUS
CACTUS Vs STIRR: clusters found by STIRR
CACTUS: CAtegorical ClusTering Using Summaries
• Central idea: data summary (inter- & intra- attribute summary) is sufficient enough to find candidate clusters which can then be validated.
• A three-phase clustering algorithm:– Summarisation– Clustering– Validation
Summarisation Phase
• Assumption: the inter- & intra- attribute summary of any pair of attributes fits easily into main memory.
• Inter-attribute Summaries:– Use a counter set to 0 initially for each pair (ai,aj) є Di x Dj.
– Scan the dataset, increment the counter for each pair.– After the scan, compute σ*D(ai,aj) and reset the counters of
those whose σ* < E[σD(ai,aj)]. Store those values pairs.
• Intra-attribute Summaries:– Scan the dataset and find those tuples (T1,T2) of one domain
such that T1.a is strongly connected with T1.b and T2.a is strongly connected with T2.b.
– Very fast operation, hence only compute them when needed
Clustering Phase
• A two-step operation:
–Step 1. analyse each attribute to compute all cluster-projections on it
–Step 2. Synthesise candidate clusters on sets of attributes from the cluster-projections on individual attributes
Clustering Phase continued
• Step1: Compute cluster-projections on attributes– Step A. Find all cluster-projections on Ai of cluster over (Ai,Aj).– Step B. Compute all the cluster-projections on Ai of cluster over {A1,…,An} by
intersecting sets of cluster-projects from Step A.– Step A is NP-Hard! Solution: use distinguishing sets.
• Distinguishing sets identify different cluster-projections.
• Construct distinguishing sets on Ai and extend w.r.t Aj some of the candidate distinguishing sets on Ai.
• Detailed steps are too long for this presentation, sorry!
– StepB: intersection of Cluster-projection• Intersection joint S1ΠS2 = {s: there exist s1єS1 and s2єS2 such that s=s1Πs2 and |s|
>1}
• Apply intersection joint to all sets of attribute values on Ai.
• Step2: Try to augment ck with a cluster projection ck+1 on attribute Ak+1. If new cluster <ci,ck+1> is a sub-cluster on (Ai,Ak+1), i є {1,…,k}, then add ck+1 = <c1,…ck+1> to the final cluster.
Validation Phase
• Use a required threshold to recognise false candidates which do not have enough support because some of the 2-clusters combined to form a candidate cluster may be due to different sets of tuples.
Experiments and Results
• To compare with STIRR
• Use 1 million tuples, 10 attributes and 100 attribute values for each attribute.
• CACTUS discovers a broader class of clusters than STIRR.
Experiments and Results
Conclusion
• The authors formalised the definition of a cluster in categorical data
• CACTUS is a fast and efficient algorithm for clustering in categorical data.
• I am sorry that I did not show some part of the algorithm due to time constraint.
Question Time