mixed-attribute clustering and weighted clustering presented by: yiu man lung 24 january, 2003
TRANSCRIPT
Mixed-Attribute Clustering and Weighted Clustering
Presented by: Yiu Man Lung
24 January, 2003
Outline
Mixed-Attribute Clustering ROCK, CACTUS Links between mixed attributes
Weighted Clustering PROCLUS Weights, distance and similarity measures Methods of computing the weights
Conclusion
Mixed-Attribute Clustering Most real datasets have mixed attributes
numeric (continuous) total ordering categorical (discrete) no total ordering
Few clustering algorithms for mixed attributes Combined information from mixed attributes may
be useful for clustering Use the context to compute the distance / similarity
measure instead of using a fixed measure Apply the concept of links (in ROCK) and concept
of strongly-connected attribute values (in CACTUS)
ROCK Hierarchical, agglomerative clustering algorithm We only focus on the concept of links Given records Ti , Tj
Their similarity is sim(Ti ,Tj) They are neighbors if sim(Ti ,Tj) ≥θ link(Ti ,Tj) is the number of their common neighbors
Relationship between links and clusters High intra-cluster similarity within clusters
Many links within clusters High inter-cluster dissimilarity among clusters
Few cross links among clusters
Example of links
Jaccard coefficient: |Ti ⋂ Tj| / |Ti ⋃ Tj|
sim({1,2,3},{1,2,7}) = 0.5 (different cluster)
sim({1,2,3},{3,4,5}) = 0.2 (same cluster) For links, let θ= 0.5
link({1,2,3},{1,2,7}) = 3
link({1,2,3},{3,4,5}) = 4
Figure 1: Basket data example (adapted from [1])
CACTUS Clusters are generated and validated in later phases We only focus on the summarization (first) phase
inter-attribute summary IJ (links among attribute values)
intra-attribute summary II (similarity of attribute values)
Some notationsDataset: D
Tuple: t
Categorical attributes: A1 , A2 , … , An
Domains: D1 , D2 , … , Dn
Values in domains : V1,* , V2,* , … , Vn,*
Support and similarityLet i≠j, support σD(Vi,x ,Vj,y) is defined as
σD(Vi,x ,Vj,y) = |{t ∊ D : t.Ai= Vi,x ⋀ t.Aj=Vj,y }|
Let α>1 , Vi,x and Vj,y are strongly connected if
σD(Vi,x ,Vj,y) > α*|D| / (|Di|*|Dj|)
σ*D(Vi,x ,Vj,y) = σD(Vi,x ,Vj,y) if they are strongly connected
= 0 otherwise
The similarity γj (Vi,x ,Vi,z) with respect to Dj (i≠j) is defined as
γj (Vi,x ,Vi,z ) =|{ u ∊ Dj :σ*D(Vi,x , u)>0 ⋀ σ*D(Vi,z , u) >0}|
expected support underattribute independence assumption
Example of similarities
Figure 2: Inter-attribute summary IJ
(adapted from [2])Figure 3: Intra-attribute summary II
(adapted from [2])
number of common neighborsin another attribute
links among attribute valuesof different attributes
Summary on two previous concepts
Links and strongly-connected attribute values are for categorical data
The former is for tuples and the latter is for attribute values
The latter can be viewed as “links” between attribute values
Need to extend them to mixed attributes
Links between mixed attributes
Ai is categorical and Aj is numeric How to compute the similarity
γj (Vi,x ,Vi,z) with respect to Dj (i≠j) ? MultiSet: set with multiplicity
{3,7} and {7,3} are equivalent{3,7} and {3,3,7} are different
The set of values of Aj with Ai as Vi,x is defined by MSet(Vi,x , i, j) = { t.Aj : t ∊ D and t.Ai= Vi,x }
D={(a,5),(a,6),(b,7),(b,8)}MSet(a,1,2)={5,6} , MSet(b,1,2)={7,8}
The similarity γj (Vi,x ,Vi,z) can be computed using MSet(Vi,x , i, j) and MSet(Vi,z , i, j)
Figure 4: Inter-attribute summary of mixed attributes
Links between mixed attributes (1) Histogram
Represent MSet(Vi,x , i, j) by a histogram Hist(Vi,x , i, j) Compute the similarity γj (Vi,x ,Vi,z) by
sim(Hist(Vi,x , i, j),Hist(Vi,z , i, j)) Histogram Intersection
Sim = 2 + 3 + 3 + 4 + 5 + 4 + 3 = 24 (needs to be normalized) A more robust method also considers adjacent regions
0
1
2
3
4
5
6
0
1
2
3
4
5
6
Figure 5: Histogram of Hist(Vi,x , i, j) Figure 6: Histogram of Hist(Vi,z , i, j)
Links between mixed attributes (2)
Approximate the sequence by normal distribution Assume MSet(Vi,x , i, j) satisfies normal distribution The mean and variance of MSet(Vi,x , i, j) can describe
MSet(Vi,x , i, j) approximately Compute the similarity γj (Vi,x ,Vi,z) by the means and
variances of MSet(Vi,x , i, j) and MSet(Vi,z , i, j)
Weighted clustering Clustering not meaningful in high dimensional
spaces because of the irrelevant attributes Clusters may form in subspaces
Projected clustering algorithms find the subspaces of the clusters during cluster formation
It is possible that different attributes have different relevances to different clusters Weighted clustering algorithms determine the weights of
attributes in the clusters
Users can interpret the weights in a meaningful way
Example on weighted clusteringAssume there are 3 attributes X,Y and Z.
Figure 7: Projection on X-Y plane (adapted from [3])
Figure 8: Projection on X-Z plane (adapted from [3])
Projected clustering
Weighted clustering
Cluster 1 {X,Z} wx=0.45, wy=0.10,wz=0.45
Cluster 2 {X,Y} wx=0.45, wy=0.45,wz=0.10
PROCLUS
A projected clustering algorithm Medoid-based and efficient Some disadvantages
Clusters with less than (|D|/k)*minDev are bad Quality of clusters depends on the medoids
Example: number of clusters (k) = 2 Unlucky 2 medoids drawn from the same cluster that cluster will be split into two small clusters Points in the other cluster become misses or outliers
Figure 9: An example of clusters
Definition of weights
Assume there are k clusters and m attributes.
The weights of the clusters must satisfy:
∀ (i [1,k], j [1,m])∈ ∈ , wi,j is real number s.t. wi,j [0,1]∈
∀ (i [1,k])∈ , ∑j [1,m]∈ wi,j = 1
Note: i and j are integers
Weighted measures Weighted distance
distp(x,y)= ∑i [1,m]∈ wp,i * distp,i(x.Ai,y.Ai) Weighted similarity
simp(x,y)= ∑i [1,m] ∈ wp,i * simp,i(x.Ai,y.Ai) For the weights to be meaningful, it is required that
distp,i and simp,i return real values from [0,1] A simple categorical distance measure:
distp,i(x.Ai,y.Ai) = 0 if x.Ai=y.Ai
= 1 otherwise A more complex simp,i will be introduced later
Adapted algorithms from PROCLUS
Adapt it to a weighted clustering algorithmChange the FindDimension procedure (for
finding relevant attributes) to the FindWeight procedure (for computing the weights of the attributes)
Replace the distance functions 3 methods for computing the weights
1st method
wp,i= Var ({ | {t : t ∊ Cp ⋀ t.Ai= Vi,j} | : Vi,j ∊ Di})
and then normalize wp,i
Attribute having high variance of counts among attribute values
high relevance to the cluster Variance of attribute values
of an attribute is used as the
weight of that attribute
in that cluster
A B C
a:4 d:3 f:1
b:1 e:2 g:1
h:1
i:1
j:1
Var 4.5 0.5 0
W 0.9 0.1 0
A B C
a d f
a d g
a d h
a e i
b e j
Count of an attribute
value in a cluster
2nd method wp,i = |{t : t ∊ Cp ⋀ t.Ai=Medoid(Cp).Ai}|
and then normalize wp,i
In a cluster, the number of records in the cluster with the same attribute values as the medoid in for each attribute is computed.
Attributes with high counts
higher weights
A B C
4 3 1
w 0.5 0.375 0.125
A B C
a d f
a d g
a d h
a e i
b e j
medoid
3rd method
The original similarity γj (Vi,x ,Vi,z) is too strong because only strongly-connected attribute values are considered
We change the definition from γj (Vi,x ,Vi,z ) =|{ u ∊ Dj :σ*D(Vi,x , u)>0 ⋀ σ*D(Vi,z , u) >0}|
to
γp j (Vi,x ,Vi,z ) = ∑u∊ Dj
σCp(Vi,x , u) * σCp
(Vi,z , u)
γp (Vi,x ,Vi,z ) = ∑j [1,m],∈ i≠j γp
j (Vi,x ,Vi,z )
3rd method (cont’)A similarity matrix per each attribute per cluster (total k*m)
SimMaxp,i=max ({γp (Vi,x ,Vi,z ): i [1,m] ∈ ⋀ Vi,x ,Vi,z ∈ Di})
simp,i(Vi,x ,Vi,x ) = 1
simp,i(Vi,x ,Vi,z ) = γp (Vi,x ,Vi,z ) / SimMaxp,i
wp,i = SimMaxp,i and
then normalize wp,i
sim f g s y
f 1 0.01 0.68 0.97
g 0.01 1 0.01 0.01
s 0.68 0.01 1 1
y 0.97 0.01 1 1
Maximum entry of
similarity matrix [p,i]
Extension
Pruning of insignificant weightsAlthough the weights of irrelevant attributes are
low, they can also affect the distance measure if there are too many of them
Let α> 1, a weight is said to be insignificant if it is lower than 1/(α * m). (m: # of dimensions)
Insignificant weights are set to zero and all the weights are normalized again.
Future work
Redefine the “badness” of the medoids Medoids are bad if the clusters are less than a
predefined size. Real dataset may have clusters of various sizes.
Detect whether different medoids are chosen from the same cluster
Use other distance or similarity measure Study mixed-attribute clustering Adapt other algorithms to weighted clustering
algorithm
Conclusion
Mixed-attribute clustering can exploit information from both types of attributes
Weighted clustering can reduce the effect of noise on the clusters
The weights are meaningful to end-users Adapt other algorithms to weighted
clustering algorithms
References
1) Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5):345–366, 2000.
2) Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. CACTUS – clustering categorical data using summaries. In Knowledge Discovery and Data Mining, pages 73–83, 1999.
3) Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. Fast algorithms for projected clustering. pages 61–72, 1999.