distance based clusteringcs545/fall13/dokuwiki/lib/... · distance-based models 8.3 distance-based...

10/21/13

1

Distance based clustering

Chapter 8

1 0 2 4 6 8 10 12 142

4

6

8

10

12

14

16

Clustering

²  Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990).

²  What is a cluster?

– Group of objects separated from other clusters

2 0 2 4 6 8 10 12 142

4

6

8

10

12

14

16

Means and medians

As discussed earlier the mean is the minimizer of What about using This gives rise to the geometric median, which is more robust to outliers. Issue: no closed form solution.

3

argminy

X

x2D||x� y||2

argminy

X

x2D||x� y||

Means, medians, medoids

It may be useful to restrict exemplars to be one of the given data points. These are called medoids. How would we compute the medoid for a set of points?

4

10/21/13

2

K-means clustering

Notation: Dj – the set of points assigned to cluster j Plausible objective:

5

minimizeKX

i=1

X

xi2Dj

||xi � µj ||2

K-means clustering

Notation: Dj – the set of points assigned to cluster j Plausible objective: Issue: NP-complete problem.

6

minimizeKX

i=1

X

xi2Dj

||xi � µj ||2

K-means clustering

7

8. Distance-based models 8.3 Distance-based clustering

p.248 Algorithm 8.1: K -means clustering

Algorithm KMeans(D,K )

Input : data D µRd ; number of clusters K 2N.Output : K cluster means µ1, . . . ,µK 2Rd .

1 randomly initialise K vectors µ1, . . . ,µK 2Rd ;2 repeat3 assign each x 2 D to argmin j Dis2(x,µ j );4 for j = 1 to K do5 D j √ {x 2 D|x assigned to cluster j };6 µ j = 1|D j |

Px2D j x;

7 end8 until no change in µ1, . . . ,µK ;9 return µ1, . . . ,µK ;

Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 243 / 349

K-means clustering

Iterations of K-means

8


p.248 Figure 8.11: K -means clustering

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

(left) First iteration of 3-means on Gaussian mixture data. The dotted lines are theVoronoi boundaries resulting from randomly initialised centroids; the violet solid lines arethe result of the recalculated means. (middle) Second iteration, taking the previouspartition as starting point (dotted line). (right) Third iteration with stable clustering.


10/21/13

3

Local minima

The k-means algorithm converges to a local minimum of its objective function:

9


p.249 Figure 8.12: Sub-optimality of K -means

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

(left) First iteration of 3-means on the same data as Figure 8.11 with differently initialisedcentroids. (right) 3-means has converged to a sub-optimal clustering.


Running time?

What is the running time per iteration?

10






Px2D j x;



Running time?

What is the running time per iteration? Typically, converges very quickly (and in fact, guaranteed to converge in a finite number of iterations)

11






Px2D j x;



Running time?

What is the running time per iteration? Can easily be kernelized.

12






Px2D j x;



10/21/13

4

Dealing with local minima

Run the algorithm multiple times with different initializations.

13

Initialization

A good initialization can lead to faster convergence to a better optimal solution. Common choices: ²  Choose K random data points as centroids ²  Randomly divide the data into K clusters and compute the

centroids A more sophisticated approach: v  Create a collection of subsamples of the data. Cluster the

resulting cluster centers using K-means and use for initialization.

14

P.S. Bradley, and Usama M. Fayyad. Refining Initial Points for K-Means Clustering. Proceedings of the Fifteenth International Conference on Machine Learning ICML '98

K-medoids clustering

The

15


p.250 Algorithm 8.2: K -medoids clustering

Algorithm KMedoids(D,K ,Dis)

Input : data D µX ; number of clusters K 2N;distance metric Dis : X £X !R.

Output : K medoids µ1, . . . ,µK 2 D , representing a predictive clustering of X .1 randomly pick K data points µ1, . . . ,µK 2 D ;2 repeat3 assign each x 2 D to argmin j Dis(x,µ j );4 for j = 1 to k do5 D j √ {x 2 D|x assigned to cluster j };6 µ j = argmin

x2D jP

x

02D j Dis(x,x0);



Partitioning around medoids (PAM)

PAM clustering

16


p.251 Algorithm 8.3: Partitioning around medoids clustering

Algorithm PAM(D,K ,Dis)

Input : data D µX ; number of clusters K 2N;distance metric Dis : X £X !R.

Output : K medoids µ1, . . . ,µK 2 D , representing a predictive clustering of X .1 randomly pick K data points µ1, . . . ,µK 2 D ;2 repeat3 assign each x 2 D to argmin j Dis(x,µ j );4 for j = 1 to k do5 D j √ {x 2 D|x assigned to cluster j };6 end7 Q √P j

Px2D j Dis(x,µ j );

8 for each medoid m and each non-medoid o do9 calculate the improvement in Q resulting from swapping m with o;

10 end11 select the pair with maximum improvement and swap;12 until no further improvement possible;13 return µ1, . . . ,µK ;


10/21/13

5

Sensitivity to scaling

17


p.251 Figure 8.13: Scale-sensitivity of K -means

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−5

−4

−3

−2

−1

0

1

2

3

4

5

(left) On this data 2-means detects the right clusters. (right) After rescaling the y-axis,this configuration has a higher between-cluster scatter than the intended one.


Assumptions behind the model

K-means assumes spherical clusters. We will discuss probabilistic extensions that address this to some extent. Probably the most widely used clustering algorithm because of its simplicity and easy implementation

18

Silhouettes

How do we know we have a good clustering?

19


Silhouettes I

t For any data point xi , let d(xi ,D j ) denote the average distance of xi to thedata points in cluster D j , and let j (i ) denote the index of the cluster that xibelongs to.

t Furthermore, let a(xi ) = d(xi ,D j (i )) be the average distance of xi to thepoints in its own cluster D j (i ), and let b(xi ) = mink 6= j (i ) d(xi ,Dk ) be theaverage distance to the points in its neighbouring cluster.

t We would expect a(xi ) to be considerably smaller than b(xi ), but thiscannot be guaranteed.

t So we can take the difference b(xi )°a(xi ) as an indication of how‘well-clustered’ xi is, and divide this by b(xi ) to obtain a number less thanor equal to 1.


Silhouettes

20


Silhouettes II

t It is, however, conceivable that a(xi ) > b(xi ), in which case the differenceb(xi )°a(xi ) is negative. This describes the situation that, on average, themembers of the neighbouring cluster are closer to xi than the members ofits own cluster.

t In order to get a normalised value we divide by a(xi ) in this case. Thisleads to the following definition:

s(xi ) =b(xi )°a(xi )

max(a(xi ),b(xi ))

t A silhouette then sorts and plots s(x) for each instance, grouped by cluster.



p.252 Figure 8.14: Silhouettes

0 0.2 0.4 0.6 0.8 1

1

2

Silhouette Value

Clu

ster

0 0.2 0.4 0.6 0.8 1

1

2

Silhouette Value

Clu

ster

(left) Silhouette for the clustering in Figure 8.13 (left), using squared Euclidean distance.Almost all points have a high s(x), which means that they are much closer, on average,to the other members of their cluster than to the members of the neighbouring cluster.(right) The silhouette for the clustering in Figure 8.13 (right) is much less convincing.


10/21/13

6

Dendrograms Definition: Given a dataset D, a dendrogram is a binary tree with the elements of D at its leaves. An internal node of the tree represents the subset of elements in the leaves of the subtree rooted at that node.

21

Hierarchical clustering

Algorithm outline: Start with each data point in a separate cluster At each step merge the closest pair of clusters

22


Algorithm outline: Start with each data point in a separate cluster At each step merge the closest pair of clusters Need to define a measure of distance between clusters:

23

8. Distance-based models 8.4 Hierarchical clustering

p.254 Definition 8.4: Dendrogram and linkage function

Given a data set D , a dendrogram is a binary tree with the elements of D at itsleaves. An internal node of the tree represents the subset of elements in theleaves of the subtree rooted at that node. The level of a node is the distancebetween the two clusters represented by the children of the node. Leaves havelevel 0.A linkage function L : 2X £2X !R calculates the distance between arbitrarysubsets of the instance space, given a distance metric Dis : X £X !R.


Linkage functions

v  Single linkage v  Smallest pairwise distance between elements from each cluster

v  Complete linkage v  Largest distance between elements from each cluster

v  Average linkage v  The average distance between elements from each cluster

v  Centroid linkage v  Distance between cluster means

24

10/21/13

7

Dendrograms revisited Interpretation of the vertical dimension: The distance between the clusters when they were merged (the level associated with the cluster). The leaves have level 0.

25


26


p.255 Algorithm 8.4: Hierarchical agglomerative clustering

Algorithm HAC(D,L)

Input : data D µX ; linkage function L : 2X £2X !R defined in terms ofdistance metric.

Output : a dendrogram representing a descriptive clustering of D .1 initialise clusters to singleton data points;2 create a leaf at level 0 for every singleton cluster;3 repeat4 find the pair of clusters X ,Y with lowest linkage l , and merge;5 create a parent of X ,Y at level l ;6 until all data points are in one cluster;7 return the constructed binary tree with linkage levels;


Linkage matters

27


p.256 Figure 8.16: Linkage matters

1 2 3 4

5 6 7 8

A

B

D

E FG

4 8 2 3 6 7 1 5

C

AC B DE

F

G

G

F

1 2 3 4

5 6 7 8

A

B

D

E FG

4 8 2 3 6 7 1 5

C

AC B DE

1 2 3 4

5 6 7 8

A

B

D

E FG

4 8 2 3 6 7 1 5

C

AC B DEFG

(left) Complete linkage defines cluster distance as the largest pairwise distance between elementsfrom each cluster, indicated by the coloured lines between data points. (middle) Centroid linkagedefines the distance between clusters as the distance between their means. Notice that E obtainsthe same linkage as A and B, and so the latter clusters effectively disappear. (right) Singlelinkage defines the distance between clusters as the smallest pairwise distance. The dendrogramall but collapses, which means that no meaningful clusters are found in the given grid configuration.

Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 256 / 349There are also issues with centroid linkage (see book).

Clustering random data

28


p.258 Figure 8.18: Clustering random data

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1

23

4

5

6

7

8

9 1011

12

13

14

15

16

17

18

1920

8 15 7 2 4 16 5 12 10 17 1 3 13 14 18 6 9 19 11 200

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

1

2

3

Silhouette Value

Clu

ster

(left) 20 data points, generated by uniform random sampling. (middle) The dendrogramgenerated from complete linkage. The three clusters suggested by the dendrogram arespurious as they cannot be observed in the data. (right) The rapidly decreasingsilhouette values in each cluster confirm the absence of a strong cluster structure. Point18 has a negative silhouette value as it is on average closer to the green points than tothe other red points.


10/21/13

8

How many clusters in my data

Clustering algorithms will find as many clusters as you ask for. Need methods for deciding the number of clusters.

29

distance based clusteringcs545/fall13/dokuwiki/lib/... · distance-based models 8.3 distance-based...

Documents