mixed-attribute clustering and weighted clustering presented by: yiu man lung 24 january, 2003

26
Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Upload: rodney-rodgers

Post on 02-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Mixed-Attribute Clustering and Weighted Clustering

Presented by: Yiu Man Lung

24 January, 2003

Page 2: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Outline

Mixed-Attribute Clustering ROCK, CACTUS Links between mixed attributes

Weighted Clustering PROCLUS Weights, distance and similarity measures Methods of computing the weights

Conclusion

Page 3: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Mixed-Attribute Clustering Most real datasets have mixed attributes

numeric (continuous) total ordering categorical (discrete) no total ordering

Few clustering algorithms for mixed attributes Combined information from mixed attributes may

be useful for clustering Use the context to compute the distance / similarity

measure instead of using a fixed measure Apply the concept of links (in ROCK) and concept

of strongly-connected attribute values (in CACTUS)

Page 4: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

ROCK Hierarchical, agglomerative clustering algorithm We only focus on the concept of links Given records Ti , Tj

Their similarity is sim(Ti ,Tj) They are neighbors if sim(Ti ,Tj) ≥θ link(Ti ,Tj) is the number of their common neighbors

Relationship between links and clusters High intra-cluster similarity within clusters

Many links within clusters High inter-cluster dissimilarity among clusters

Few cross links among clusters

Page 5: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Example of links

Jaccard coefficient: |Ti ⋂ Tj| / |Ti ⋃ Tj|

sim({1,2,3},{1,2,7}) = 0.5 (different cluster)

sim({1,2,3},{3,4,5}) = 0.2 (same cluster) For links, let θ= 0.5

link({1,2,3},{1,2,7}) = 3

link({1,2,3},{3,4,5}) = 4

Figure 1: Basket data example (adapted from [1])

Page 6: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

CACTUS Clusters are generated and validated in later phases We only focus on the summarization (first) phase

inter-attribute summary IJ (links among attribute values)

intra-attribute summary II (similarity of attribute values)

Some notationsDataset: D

Tuple: t

Categorical attributes: A1 , A2 , … , An

Domains: D1 , D2 , … , Dn

Values in domains : V1,* , V2,* , … , Vn,*

Page 7: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Support and similarityLet i≠j, support σD(Vi,x ,Vj,y) is defined as

σD(Vi,x ,Vj,y) = |{t ∊ D : t.Ai= Vi,x ⋀ t.Aj=Vj,y }|

Let α>1 , Vi,x and Vj,y are strongly connected if

σD(Vi,x ,Vj,y) > α*|D| / (|Di|*|Dj|)

σ*D(Vi,x ,Vj,y) = σD(Vi,x ,Vj,y) if they are strongly connected

= 0 otherwise

The similarity γj (Vi,x ,Vi,z) with respect to Dj (i≠j) is defined as

γj (Vi,x ,Vi,z ) =|{ u ∊ Dj :σ*D(Vi,x , u)>0 ⋀ σ*D(Vi,z , u) >0}|

expected support underattribute independence assumption

Page 8: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Example of similarities

Figure 2: Inter-attribute summary IJ

(adapted from [2])Figure 3: Intra-attribute summary II

(adapted from [2])

number of common neighborsin another attribute

links among attribute valuesof different attributes

Page 9: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Summary on two previous concepts

Links and strongly-connected attribute values are for categorical data

The former is for tuples and the latter is for attribute values

The latter can be viewed as “links” between attribute values

Need to extend them to mixed attributes

Page 10: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Links between mixed attributes

Ai is categorical and Aj is numeric How to compute the similarity

γj (Vi,x ,Vi,z) with respect to Dj (i≠j) ? MultiSet: set with multiplicity

{3,7} and {7,3} are equivalent{3,7} and {3,3,7} are different

The set of values of Aj with Ai as Vi,x is defined by MSet(Vi,x , i, j) = { t.Aj : t ∊ D and t.Ai= Vi,x }

D={(a,5),(a,6),(b,7),(b,8)}MSet(a,1,2)={5,6} , MSet(b,1,2)={7,8}

The similarity γj (Vi,x ,Vi,z) can be computed using MSet(Vi,x , i, j) and MSet(Vi,z , i, j)

Figure 4: Inter-attribute summary of mixed attributes

Page 11: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Links between mixed attributes (1) Histogram

Represent MSet(Vi,x , i, j) by a histogram Hist(Vi,x , i, j) Compute the similarity γj (Vi,x ,Vi,z) by

sim(Hist(Vi,x , i, j),Hist(Vi,z , i, j)) Histogram Intersection

Sim = 2 + 3 + 3 + 4 + 5 + 4 + 3 = 24 (needs to be normalized) A more robust method also considers adjacent regions

0

1

2

3

4

5

6

0

1

2

3

4

5

6

Figure 5: Histogram of Hist(Vi,x , i, j) Figure 6: Histogram of Hist(Vi,z , i, j)

Page 12: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Links between mixed attributes (2)

Approximate the sequence by normal distribution Assume MSet(Vi,x , i, j) satisfies normal distribution The mean and variance of MSet(Vi,x , i, j) can describe

MSet(Vi,x , i, j) approximately Compute the similarity γj (Vi,x ,Vi,z) by the means and

variances of MSet(Vi,x , i, j) and MSet(Vi,z , i, j)

Page 13: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Weighted clustering Clustering not meaningful in high dimensional

spaces because of the irrelevant attributes Clusters may form in subspaces

Projected clustering algorithms find the subspaces of the clusters during cluster formation

It is possible that different attributes have different relevances to different clusters Weighted clustering algorithms determine the weights of

attributes in the clusters

Users can interpret the weights in a meaningful way

Page 14: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Example on weighted clusteringAssume there are 3 attributes X,Y and Z.

Figure 7: Projection on X-Y plane (adapted from [3])

Figure 8: Projection on X-Z plane (adapted from [3])

Projected clustering

Weighted clustering

Cluster 1 {X,Z} wx=0.45, wy=0.10,wz=0.45

Cluster 2 {X,Y} wx=0.45, wy=0.45,wz=0.10

Page 15: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

PROCLUS

A projected clustering algorithm Medoid-based and efficient Some disadvantages

Clusters with less than (|D|/k)*minDev are bad Quality of clusters depends on the medoids

Example: number of clusters (k) = 2 Unlucky 2 medoids drawn from the same cluster that cluster will be split into two small clusters Points in the other cluster become misses or outliers

Figure 9: An example of clusters

Page 16: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Definition of weights

Assume there are k clusters and m attributes.

The weights of the clusters must satisfy:

∀ (i [1,k], j [1,m])∈ ∈ , wi,j is real number s.t. wi,j [0,1]∈

∀ (i [1,k])∈ , ∑j [1,m]∈ wi,j = 1

Note: i and j are integers

Page 17: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Weighted measures Weighted distance

distp(x,y)= ∑i [1,m]∈ wp,i * distp,i(x.Ai,y.Ai) Weighted similarity

simp(x,y)= ∑i [1,m] ∈ wp,i * simp,i(x.Ai,y.Ai) For the weights to be meaningful, it is required that

distp,i and simp,i return real values from [0,1] A simple categorical distance measure:

distp,i(x.Ai,y.Ai) = 0 if x.Ai=y.Ai

= 1 otherwise A more complex simp,i will be introduced later

Page 18: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Adapted algorithms from PROCLUS

Adapt it to a weighted clustering algorithmChange the FindDimension procedure (for

finding relevant attributes) to the FindWeight procedure (for computing the weights of the attributes)

Replace the distance functions 3 methods for computing the weights

Page 19: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

1st method

wp,i= Var ({ | {t : t ∊ Cp ⋀ t.Ai= Vi,j} | : Vi,j ∊ Di})

and then normalize wp,i

Attribute having high variance of counts among attribute values

high relevance to the cluster Variance of attribute values

of an attribute is used as the

weight of that attribute

in that cluster

A B C

a:4 d:3 f:1

b:1 e:2 g:1

h:1

i:1

j:1

Var 4.5 0.5 0

W 0.9 0.1 0

A B C

a d f

a d g

a d h

a e i

b e j

Count of an attribute

value in a cluster

Page 20: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

2nd method wp,i = |{t : t ∊ Cp ⋀ t.Ai=Medoid(Cp).Ai}|

and then normalize wp,i

In a cluster, the number of records in the cluster with the same attribute values as the medoid in for each attribute is computed.

Attributes with high counts

higher weights

A B C

4 3 1

w 0.5 0.375 0.125

A B C

a d f

a d g

a d h

a e i

b e j

medoid

Page 21: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

3rd method

The original similarity γj (Vi,x ,Vi,z) is too strong because only strongly-connected attribute values are considered

We change the definition from γj (Vi,x ,Vi,z ) =|{ u ∊ Dj :σ*D(Vi,x , u)>0 ⋀ σ*D(Vi,z , u) >0}|

to

γp j (Vi,x ,Vi,z ) = ∑u∊ Dj

σCp(Vi,x , u) * σCp

(Vi,z , u)

γp (Vi,x ,Vi,z ) = ∑j [1,m],∈ i≠j γp

j (Vi,x ,Vi,z )

Page 22: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

3rd method (cont’)A similarity matrix per each attribute per cluster (total k*m)

SimMaxp,i=max ({γp (Vi,x ,Vi,z ): i [1,m] ∈ ⋀ Vi,x ,Vi,z ∈ Di})

simp,i(Vi,x ,Vi,x ) = 1

simp,i(Vi,x ,Vi,z ) = γp (Vi,x ,Vi,z ) / SimMaxp,i

wp,i = SimMaxp,i and

then normalize wp,i

sim f g s y

f 1 0.01 0.68 0.97

g 0.01 1 0.01 0.01

s 0.68 0.01 1 1

y 0.97 0.01 1 1

Maximum entry of

similarity matrix [p,i]

Page 23: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Extension

Pruning of insignificant weightsAlthough the weights of irrelevant attributes are

low, they can also affect the distance measure if there are too many of them

Let α> 1, a weight is said to be insignificant if it is lower than 1/(α * m). (m: # of dimensions)

Insignificant weights are set to zero and all the weights are normalized again.

Page 24: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Future work

Redefine the “badness” of the medoids Medoids are bad if the clusters are less than a

predefined size. Real dataset may have clusters of various sizes.

Detect whether different medoids are chosen from the same cluster

Use other distance or similarity measure Study mixed-attribute clustering Adapt other algorithms to weighted clustering

algorithm

Page 25: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Conclusion

Mixed-attribute clustering can exploit information from both types of attributes

Weighted clustering can reduce the effect of noise on the clusters

The weights are meaningful to end-users Adapt other algorithms to weighted

clustering algorithms

Page 26: Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

References

1) Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5):345–366, 2000.

2) Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. CACTUS – clustering categorical data using summaries. In Knowledge Discovery and Data Mining, pages 73–83, 1999.

3) Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. Fast algorithms for projected clustering. pages 61–72, 1999.