mixed-attribute clustering and weighted clustering presented by: yiu man lung 24 january, 2003

Mixed-Attribute Clustering and Weighted Clustering

Presented by: Yiu Man Lung

24 January, 2003

Outline

Mixed-Attribute Clustering ROCK, CACTUS Links between mixed attributes

Weighted Clustering PROCLUS Weights, distance and similarity measures Methods of computing the weights

Conclusion

Mixed-Attribute Clustering Most real datasets have mixed attributes

numeric (continuous) total ordering categorical (discrete) no total ordering

Few clustering algorithms for mixed attributes Combined information from mixed attributes may

be useful for clustering Use the context to compute the distance / similarity

measure instead of using a fixed measure Apply the concept of links (in ROCK) and concept

of strongly-connected attribute values (in CACTUS)

ROCK Hierarchical, agglomerative clustering algorithm We only focus on the concept of links Given records Ti , Tj

Their similarity is sim(Ti ,Tj) They are neighbors if sim(Ti ,Tj) ≥θ link(Ti ,Tj) is the number of their common neighbors

Relationship between links and clusters High intra-cluster similarity within clusters

Many links within clusters High inter-cluster dissimilarity among clusters

Few cross links among clusters

Example of links

Jaccard coefficient: |Ti ⋂ Tj| / |Ti ⋃ Tj|

sim({1,2,3},{1,2,7}) = 0.5 (different cluster)

sim({1,2,3},{3,4,5}) = 0.2 (same cluster) For links, let θ= 0.5

link({1,2,3},{1,2,7}) = 3

link({1,2,3},{3,4,5}) = 4

Figure 1: Basket data example (adapted from [1])

CACTUS Clusters are generated and validated in later phases We only focus on the summarization (first) phase

inter-attribute summary IJ (links among attribute values)

intra-attribute summary II (similarity of attribute values)

Some notationsDataset: D

Tuple: t

Categorical attributes: A1 , A2 , … , An

Domains: D1 , D2 , … , Dn

Values in domains : V1,* , V2,* , … , Vn,*

Support and similarityLet i≠j, support σD(Vi,x ,Vj,y) is defined as

σD(Vi,x ,Vj,y) = |{t ∊ D : t.Ai= Vi,x ⋀ t.Aj=Vj,y }|

Let α>1 , Vi,x and Vj,y are strongly connected if

σD(Vi,x ,Vj,y) > α*|D| / (|Di|*|Dj|)

σ*D(Vi,x ,Vj,y) = σD(Vi,x ,Vj,y) if they are strongly connected

= 0 otherwise

The similarity γj (Vi,x ,Vi,z) with respect to Dj (i≠j) is defined as

γj (Vi,x ,Vi,z ) =|{ u ∊ Dj :σ*D(Vi,x , u)>0 ⋀ σ*D(Vi,z , u) >0}|

expected support underattribute independence assumption

Example of similarities

Figure 2: Inter-attribute summary IJ

(adapted from [2])Figure 3: Intra-attribute summary II

(adapted from [2])

number of common neighborsin another attribute

links among attribute valuesof different attributes

Summary on two previous concepts

Links and strongly-connected attribute values are for categorical data

The former is for tuples and the latter is for attribute values

The latter can be viewed as “links” between attribute values

Need to extend them to mixed attributes

Links between mixed attributes

Ai is categorical and Aj is numeric How to compute the similarity

γj (Vi,x ,Vi,z) with respect to Dj (i≠j) ? MultiSet: set with multiplicity

{3,7} and {7,3} are equivalent{3,7} and {3,3,7} are different

The set of values of Aj with Ai as Vi,x is defined by MSet(Vi,x , i, j) = { t.Aj : t ∊ D and t.Ai= Vi,x }

D={(a,5),(a,6),(b,7),(b,8)}MSet(a,1,2)={5,6} , MSet(b,1,2)={7,8}

The similarity γj (Vi,x ,Vi,z) can be computed using MSet(Vi,x , i, j) and MSet(Vi,z , i, j)

Figure 4: Inter-attribute summary of mixed attributes

Links between mixed attributes (1) Histogram

Represent MSet(Vi,x , i, j) by a histogram Hist(Vi,x , i, j) Compute the similarity γj (Vi,x ,Vi,z) by

sim(Hist(Vi,x , i, j),Hist(Vi,z , i, j)) Histogram Intersection

Sim = 2 + 3 + 3 + 4 + 5 + 4 + 3 = 24 (needs to be normalized) A more robust method also considers adjacent regions

0

1

2

3

4

5

6

0

1

2

3

4

5

6

Figure 5: Histogram of Hist(Vi,x , i, j) Figure 6: Histogram of Hist(Vi,z , i, j)

Links between mixed attributes (2)

Approximate the sequence by normal distribution Assume MSet(Vi,x , i, j) satisfies normal distribution The mean and variance of MSet(Vi,x , i, j) can describe

MSet(Vi,x , i, j) approximately Compute the similarity γj (Vi,x ,Vi,z) by the means and

variances of MSet(Vi,x , i, j) and MSet(Vi,z , i, j)

Weighted clustering Clustering not meaningful in high dimensional

spaces because of the irrelevant attributes Clusters may form in subspaces

Projected clustering algorithms find the subspaces of the clusters during cluster formation

It is possible that different attributes have different relevances to different clusters Weighted clustering algorithms determine the weights of

attributes in the clusters

Users can interpret the weights in a meaningful way

Example on weighted clusteringAssume there are 3 attributes X,Y and Z.

Figure 7: Projection on X-Y plane (adapted from [3])

Figure 8: Projection on X-Z plane (adapted from [3])

Projected clustering

Weighted clustering

Cluster 1 {X,Z} wx=0.45, wy=0.10,wz=0.45

Cluster 2 {X,Y} wx=0.45, wy=0.45,wz=0.10

PROCLUS

A projected clustering algorithm Medoid-based and efficient Some disadvantages

Clusters with less than (|D|/k)*minDev are bad Quality of clusters depends on the medoids

Example: number of clusters (k) = 2 Unlucky 2 medoids drawn from the same cluster that cluster will be split into two small clusters Points in the other cluster become misses or outliers

Figure 9: An example of clusters

Definition of weights

Assume there are k clusters and m attributes.

The weights of the clusters must satisfy:

∀ (i [1,k], j [1,m])∈ ∈ , wi,j is real number s.t. wi,j [0,1]∈

∀ (i [1,k])∈ , ∑j [1,m]∈ wi,j = 1

Note: i and j are integers

Weighted measures Weighted distance

distp(x,y)= ∑i [1,m]∈ wp,i * distp,i(x.Ai,y.Ai) Weighted similarity

simp(x,y)= ∑i [1,m] ∈ wp,i * simp,i(x.Ai,y.Ai) For the weights to be meaningful, it is required that

distp,i and simp,i return real values from [0,1] A simple categorical distance measure:

distp,i(x.Ai,y.Ai) = 0 if x.Ai=y.Ai

= 1 otherwise A more complex simp,i will be introduced later

Adapted algorithms from PROCLUS

Adapt it to a weighted clustering algorithmChange the FindDimension procedure (for

finding relevant attributes) to the FindWeight procedure (for computing the weights of the attributes)

Replace the distance functions 3 methods for computing the weights

1st method

wp,i= Var ({ | {t : t ∊ Cp ⋀ t.Ai= Vi,j} | : Vi,j ∊ Di})

and then normalize wp,i

Attribute having high variance of counts among attribute values

high relevance to the cluster Variance of attribute values

of an attribute is used as the

weight of that attribute

in that cluster

A B C

a:4 d:3 f:1

b:1 e:2 g:1

h:1

i:1

j:1

Var 4.5 0.5 0

W 0.9 0.1 0

A B C

a d f

a d g

a d h

a e i

b e j

Count of an attribute

value in a cluster

2nd method wp,i = |{t : t ∊ Cp ⋀ t.Ai=Medoid(Cp).Ai}|

and then normalize wp,i

In a cluster, the number of records in the cluster with the same attribute values as the medoid in for each attribute is computed.

Attributes with high counts

higher weights

A B C

4 3 1

w 0.5 0.375 0.125

A B C

a d f

a d g

a d h

a e i

b e j

medoid

3rd method

The original similarity γj (Vi,x ,Vi,z) is too strong because only strongly-connected attribute values are considered

We change the definition from γj (Vi,x ,Vi,z ) =|{ u ∊ Dj :σ*D(Vi,x , u)>0 ⋀ σ*D(Vi,z , u) >0}|

to

γp j (Vi,x ,Vi,z ) = ∑u∊ Dj

σCp(Vi,x , u) * σCp

(Vi,z , u)

γp (Vi,x ,Vi,z ) = ∑j [1,m],∈ i≠j γp

j (Vi,x ,Vi,z )

3rd method (cont’)A similarity matrix per each attribute per cluster (total k*m)

SimMaxp,i=max ({γp (Vi,x ,Vi,z ): i [1,m] ∈ ⋀ Vi,x ,Vi,z ∈ Di})

simp,i(Vi,x ,Vi,x ) = 1

simp,i(Vi,x ,Vi,z ) = γp (Vi,x ,Vi,z ) / SimMaxp,i

wp,i = SimMaxp,i and

then normalize wp,i

sim f g s y

f 1 0.01 0.68 0.97

g 0.01 1 0.01 0.01

s 0.68 0.01 1 1

y 0.97 0.01 1 1

Maximum entry of

similarity matrix [p,i]

Extension

Pruning of insignificant weightsAlthough the weights of irrelevant attributes are

low, they can also affect the distance measure if there are too many of them

Let α> 1, a weight is said to be insignificant if it is lower than 1/(α * m). (m: # of dimensions)

Insignificant weights are set to zero and all the weights are normalized again.

Future work

Redefine the “badness” of the medoids Medoids are bad if the clusters are less than a

predefined size. Real dataset may have clusters of various sizes.

Detect whether different medoids are chosen from the same cluster

Use other distance or similarity measure Study mixed-attribute clustering Adapt other algorithms to weighted clustering

algorithm

Conclusion

Mixed-attribute clustering can exploit information from both types of attributes

Weighted clustering can reduce the effect of noise on the clusters

The weights are meaningful to end-users Adapt other algorithms to weighted

clustering algorithms

References

1) Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5):345–366, 2000.

2) Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. CACTUS – clustering categorical data using summaries. In Knowledge Discovery and Data Mining, pages 73–83, 1999.

3) Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. Fast algorithms for projected clustering. pages 61–72, 1999.

mixed-attribute clustering and weighted clustering presented by: yiu man lung 24 january, 2003

Documents