distance based clusteringcs545/fall13/dokuwiki/lib/... · distance-based models 8.3 distance-based...

Download Distance based clusteringcs545/fall13/dokuwiki/lib/... · Distance-based models 8.3 Distance-based clustering p.252 Figure 8.14: Silhouettes 0 0.2 0.4 0.6 0.8 1 1 2 Silhouette Value

If you can't read please download the document

Upload: others

Post on 25-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

  • 10/21/13

    1

    Distance based clustering

    Chapter 8

    1 0 2 4 6 8 10 12 142

    4

    6

    8

    10

    12

    14

    16

    Clustering

    ²  Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990).

    ²  What is a cluster?

    – Group of objects separated from other clusters

    2 0 2 4 6 8 10 12 142

    4

    6

    8

    10

    12

    14

    16

    Means and medians

    As discussed earlier the mean is the minimizer of What about using This gives rise to the geometric median, which is more robust to outliers. Issue: no closed form solution.

    3

    argminy

    X

    x2D||x� y||2

    argminy

    X

    x2D||x� y||

    Means, medians, medoids

    It may be useful to restrict exemplars to be one of the given data points. These are called medoids. How would we compute the medoid for a set of points?

    4

  • 10/21/13

    2

    K-means clustering

    Notation: Dj – the set of points assigned to cluster j Plausible objective:

    5

    minimizeKX

    i=1

    X

    xi2Dj

    ||xi � µj ||2

    K-means clustering

    Notation: Dj – the set of points assigned to cluster j Plausible objective: Issue: NP-complete problem.

    6

    minimizeKX

    i=1

    X

    xi2Dj

    ||xi � µj ||2

    K-means clustering

    7

    8. Distance-based models 8.3 Distance-based clustering

    p.248 Algorithm 8.1: K -means clustering

    Algorithm KMeans(D,K )

    Input : data D µRd ; number of clusters K 2N.Output : K cluster means µ1, . . . ,µK 2Rd .

    1 randomly initialise K vectors µ1, . . . ,µK 2Rd ;2 repeat3 assign each x 2 D to argmin j Dis2(x,µ j );4 for j = 1 to K do5 D j √ {x 2 D|x assigned to cluster j };6 µ j = 1|D j |

    Px2D j x;

    7 end8 until no change in µ1, . . . ,µK ;9 return µ1, . . . ,µK ;

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 243 / 349

    K-means clustering

    Iterations of K-means

    8

    8. Distance-based models 8.3 Distance-based clustering

    p.248 Figure 8.11: K -means clustering

    −6 −4 −2 0 2 4 6−5

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    5

    −6 −4 −2 0 2 4 6−5

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    5

    −6 −4 −2 0 2 4 6−5

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    5

    (left) First iteration of 3-means on Gaussian mixture data. The dotted lines are theVoronoi boundaries resulting from randomly initialised centroids; the violet solid lines arethe result of the recalculated means. (middle) Second iteration, taking the previouspartition as starting point (dotted line). (right) Third iteration with stable clustering.

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 244 / 349

  • 10/21/13

    3

    Local minima

    The k-means algorithm converges to a local minimum of its objective function:

    9

    8. Distance-based models 8.3 Distance-based clustering

    p.249 Figure 8.12: Sub-optimality of K -means

    −6 −4 −2 0 2 4 6−5

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    5

    −6 −4 −2 0 2 4 6−5

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    5

    (left) First iteration of 3-means on the same data as Figure 8.11 with differently initialisedcentroids. (right) 3-means has converged to a sub-optimal clustering.

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 246 / 349

    Running time?

    What is the running time per iteration?

    10

    8. Distance-based models 8.3 Distance-based clustering

    p.248 Algorithm 8.1: K -means clustering

    Algorithm KMeans(D,K )

    Input : data D µRd ; number of clusters K 2N.Output : K cluster means µ1, . . . ,µK 2Rd .

    1 randomly initialise K vectors µ1, . . . ,µK 2Rd ;2 repeat3 assign each x 2 D to argmin j Dis2(x,µ j );4 for j = 1 to K do5 D j √ {x 2 D|x assigned to cluster j };6 µ j = 1|D j |

    Px2D j x;

    7 end8 until no change in µ1, . . . ,µK ;9 return µ1, . . . ,µK ;

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 243 / 349

    Running time?

    What is the running time per iteration? Typically, converges very quickly (and in fact, guaranteed to converge in a finite number of iterations)

    11

    8. Distance-based models 8.3 Distance-based clustering

    p.248 Algorithm 8.1: K -means clustering

    Algorithm KMeans(D,K )

    Input : data D µRd ; number of clusters K 2N.Output : K cluster means µ1, . . . ,µK 2Rd .

    1 randomly initialise K vectors µ1, . . . ,µK 2Rd ;2 repeat3 assign each x 2 D to argmin j Dis2(x,µ j );4 for j = 1 to K do5 D j √ {x 2 D|x assigned to cluster j };6 µ j = 1|D j |

    Px2D j x;

    7 end8 until no change in µ1, . . . ,µK ;9 return µ1, . . . ,µK ;

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 243 / 349

    Running time?

    What is the running time per iteration? Can easily be kernelized.

    12

    8. Distance-based models 8.3 Distance-based clustering

    p.248 Algorithm 8.1: K -means clustering

    Algorithm KMeans(D,K )

    Input : data D µRd ; number of clusters K 2N.Output : K cluster means µ1, . . . ,µK 2Rd .

    1 randomly initialise K vectors µ1, . . . ,µK 2Rd ;2 repeat3 assign each x 2 D to argmin j Dis2(x,µ j );4 for j = 1 to K do5 D j √ {x 2 D|x assigned to cluster j };6 µ j = 1|D j |

    Px2D j x;

    7 end8 until no change in µ1, . . . ,µK ;9 return µ1, . . . ,µK ;

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 243 / 349

  • 10/21/13

    4

    Dealing with local minima

    Run the algorithm multiple times with different initializations.

    13

    Initialization

    A good initialization can lead to faster convergence to a better optimal solution. Common choices: ²  Choose K random data points as centroids ²  Randomly divide the data into K clusters and compute the

    centroids A more sophisticated approach: v  Create a collection of subsamples of the data. Cluster the

    resulting cluster centers using K-means and use for initialization.

    14

    P.S. Bradley, and Usama M. Fayyad. Refining Initial Points for K-Means Clustering. Proceedings of the Fifteenth International Conference on Machine Learning ICML '98

    K-medoids clustering

    The

    15

    8. Distance-based models 8.3 Distance-based clustering

    p.250 Algorithm 8.2: K -medoids clustering

    Algorithm KMedoids(D,K ,Dis)

    Input : data D µX ; number of clusters K 2N;distance metric Dis : X £X !R.

    Output : K medoids µ1, . . . ,µK 2 D , representing a predictive clustering of X .1 randomly pick K data points µ1, . . . ,µK 2 D ;2 repeat3 assign each x 2 D to argmin j Dis(x,µ j );4 for j = 1 to k do5 D j √ {x 2 D|x assigned to cluster j };6 µ j = argmin

    x2D jP

    x

    02D j Dis(x,x0);

    7 end8 until no change in µ1, . . . ,µK ;9 return µ1, . . . ,µK ;

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 247 / 349

    Partitioning around medoids (PAM)

    PAM clustering

    16

    8. Distance-based models 8.3 Distance-based clustering

    p.251 Algorithm 8.3: Partitioning around medoids clustering

    Algorithm PAM(D,K ,Dis)

    Input : data D µX ; number of clusters K 2N;distance metric Dis : X £X !R.

    Output : K medoids µ1, . . . ,µK 2 D , representing a predictive clustering of X .1 randomly pick K data points µ1, . . . ,µK 2 D ;2 repeat3 assign each x 2 D to argmin j Dis(x,µ j );4 for j = 1 to k do5 D j √ {x 2 D|x assigned to cluster j };6 end7 Q √P j

    Px2D j Dis(x,µ j );

    8 for each medoid m and each non-medoid o do9 calculate the improvement in Q resulting from swapping m with o;

    10 end11 select the pair with maximum improvement and swap;12 until no further improvement possible;13 return µ1, . . . ,µK ;

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 248 / 349

  • 10/21/13

    5

    Sensitivity to scaling

    17

    8. Distance-based models 8.3 Distance-based clustering

    p.251 Figure 8.13: Scale-sensitivity of K -means

    −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−1

    −0.8

    −0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−5

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    5

    (left) On this data 2-means detects the right clusters. (right) After rescaling the y-axis,this configuration has a higher between-cluster scatter than the intended one.

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 249 / 349

    Assumptions behind the model

    K-means assumes spherical clusters. We will discuss probabilistic extensions that address this to some extent. Probably the most widely used clustering algorithm because of its simplicity and easy implementation

    18

    Silhouettes

    How do we know we have a good clustering?

    19

    8. Distance-based models 8.3 Distance-based clustering

    Silhouettes I

    t For any data point xi , let d(xi ,D j ) denote the average distance of xi to thedata points in cluster D j , and let j (i ) denote the index of the cluster that xibelongs to.

    t Furthermore, let a(xi ) = d(xi ,D j (i )) be the average distance of xi to thepoints in its own cluster D j (i ), and let b(xi ) = mink 6= j (i ) d(xi ,Dk ) be theaverage distance to the points in its neighbouring cluster.

    t We would expect a(xi ) to be considerably smaller than b(xi ), but thiscannot be guaranteed.

    t So we can take the difference b(xi )°a(xi ) as an indication of how‘well-clustered’ xi is, and divide this by b(xi ) to obtain a number less thanor equal to 1.

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 250 / 349

    Silhouettes

    20

    8. Distance-based models 8.3 Distance-based clustering

    Silhouettes II

    t It is, however, conceivable that a(xi ) > b(xi ), in which case the differenceb(xi )°a(xi ) is negative. This describes the situation that, on average, themembers of the neighbouring cluster are closer to xi than the members ofits own cluster.

    t In order to get a normalised value we divide by a(xi ) in this case. Thisleads to the following definition:

    s(xi ) =b(xi )°a(xi )

    max(a(xi ),b(xi ))

    t A silhouette then sorts and plots s(x) for each instance, grouped by cluster.

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 251 / 349

    8. Distance-based models 8.3 Distance-based clustering

    p.252 Figure 8.14: Silhouettes

    0 0.2 0.4 0.6 0.8 1

    1

    2

    Silhouette Value

    Clu

    ster

    0 0.2 0.4 0.6 0.8 1

    1

    2

    Silhouette Value

    Clu

    ster

    (left) Silhouette for the clustering in Figure 8.13 (left), using squared Euclidean distance.Almost all points have a high s(x), which means that they are much closer, on average,to the other members of their cluster than to the members of the neighbouring cluster.(right) The silhouette for the clustering in Figure 8.13 (right) is much less convincing.

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 252 / 349

  • 10/21/13

    6

    Dendrograms Definition: Given a dataset D, a dendrogram is a binary tree with the elements of D at its leaves. An internal node of the tree represents the subset of elements in the leaves of the subtree rooted at that node.

    21

    Hierarchical clustering

    Algorithm outline: Start with each data point in a separate cluster At each step merge the closest pair of clusters

    22

    Hierarchical clustering

    Algorithm outline: Start with each data point in a separate cluster At each step merge the closest pair of clusters Need to define a measure of distance between clusters:

    23

    8. Distance-based models 8.4 Hierarchical clustering

    p.254 Definition 8.4: Dendrogram and linkage function

    Given a data set D , a dendrogram is a binary tree with the elements of D at itsleaves. An internal node of the tree represents the subset of elements in theleaves of the subtree rooted at that node. The level of a node is the distancebetween the two clusters represented by the children of the node. Leaves havelevel 0.A linkage function L : 2X £2X !R calculates the distance between arbitrarysubsets of the instance space, given a distance metric Dis : X £X !R.

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 254 / 349

    Linkage functions

    v  Single linkage v  Smallest pairwise distance between elements from each cluster

    v  Complete linkage v  Largest distance between elements from each cluster

    v  Average linkage v  The average distance between elements from each cluster

    v  Centroid linkage v  Distance between cluster means

    24

  • 10/21/13

    7

    Dendrograms revisited Interpretation of the vertical dimension: The distance between the clusters when they were merged (the level associated with the cluster). The leaves have level 0.

    25

    Hierarchical clustering

    26

    8. Distance-based models 8.4 Hierarchical clustering

    p.255 Algorithm 8.4: Hierarchical agglomerative clustering

    Algorithm HAC(D,L)

    Input : data D µX ; linkage function L : 2X £2X !R defined in terms ofdistance metric.

    Output : a dendrogram representing a descriptive clustering of D .1 initialise clusters to singleton data points;2 create a leaf at level 0 for every singleton cluster;3 repeat4 find the pair of clusters X ,Y with lowest linkage l , and merge;5 create a parent of X ,Y at level l ;6 until all data points are in one cluster;7 return the constructed binary tree with linkage levels;

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 255 / 349

    Linkage matters

    27

    8. Distance-based models 8.4 Hierarchical clustering

    p.256 Figure 8.16: Linkage matters

    1 2 3 4

    5 6 7 8

    A

    B

    D

    E FG

    4 8 2 3 6 7 1 5

    C

    AC B DE

    F

    G

    G

    F

    1 2 3 4

    5 6 7 8

    A

    B

    D

    E FG

    4 8 2 3 6 7 1 5

    C

    AC B DE

    1 2 3 4

    5 6 7 8

    A

    B

    D

    E FG

    4 8 2 3 6 7 1 5

    C

    AC B DEFG

    (left) Complete linkage defines cluster distance as the largest pairwise distance between elementsfrom each cluster, indicated by the coloured lines between data points. (middle) Centroid linkagedefines the distance between clusters as the distance between their means. Notice that E obtainsthe same linkage as A and B, and so the latter clusters effectively disappear. (right) Singlelinkage defines the distance between clusters as the smallest pairwise distance. The dendrogramall but collapses, which means that no meaningful clusters are found in the given grid configuration.

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 256 / 349There are also issues with centroid linkage (see book).

    Clustering random data

    28

    8. Distance-based models 8.4 Hierarchical clustering

    p.258 Figure 8.18: Clustering random data

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1

    23

    4

    5

    6

    7

    8

    9 1011

    12

    13

    14

    15

    16

    17

    18

    1920

    8 15 7 2 4 16 5 12 10 17 1 3 13 14 18 6 9 19 11 200

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    1

    2

    3

    Silhouette Value

    Clu

    ster

    (left) 20 data points, generated by uniform random sampling. (middle) The dendrogramgenerated from complete linkage. The three clusters suggested by the dendrogram arespurious as they cannot be observed in the data. (right) The rapidly decreasingsilhouette values in each cluster confirm the absence of a strong cluster structure. Point18 has a negative silhouette value as it is on average closer to the green points than tothe other red points.

    Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 258 / 349

  • 10/21/13

    8

    How many clusters in my data

    Clustering algorithms will find as many clusters as you ask for. Need methods for deciding the number of clusters.

    29