cs 565: data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10...
TRANSCRIPT
![Page 1: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/1.jpg)
Boston University Slideshow Title Goes Here
CS 565: Data mining
• Clustering: David Arthur, Sergei Vassilvitskii. k-means++: The Advantages of Careful Seeding. In SODA 2007
• Thanks A. Gionis and S. Vassilvitskii for the slides
![Page 2: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/2.jpg)
Boston University Slideshow Title Goes Here• a grouping of data objects such that the objects within a group are similar (or near) to one another and dissimilar (or far) from the objects in other groups
What is clustering?
![Page 3: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/3.jpg)
Boston University Slideshow Title Goes Here
minimize intra-cluster distances
maximizeinter-cluster distances
a grouping of data objects such that the objects within a group are similar (or near) to one another and dissimilar (or far) from the objects in other groups
How to capture this objective?
![Page 4: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/4.jpg)
Boston University Slideshow Title Goes Here
The clustering problem
•Given a collection of data objects • Find a grouping so that• similar objects are in the same cluster• dissimilar objects are in different clusters
✦ Why we care ?
✦ stand-alone tool to gain insight into the data✦ visualization
✦ preprocessing step for other algorithms✦ indexing or compression often relies on clustering
![Page 5: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/5.jpg)
Boston University Slideshow Title Goes Here
Applications of clustering
• image processing• cluster images based on their visual content
• web mining• cluster groups of users based on their access patterns on webpages• cluster webpages based on their content
• bioinformatics• cluster similar proteins together (similarity wrt chemical structure and/or
functionality etc)
•many more...
![Page 6: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/6.jpg)
Boston University Slideshow Title Goes Here
The clustering problem
•Given a collection of data objects • Find a grouping so that• similar objects are in the same cluster• dissimilar objects are in different clusters
✦ Basic questions: ✦ what does similar mean? ✦ what is a good partition of the objects?
i.e., how is the quality of a solution measured? ✦ how to find a good partition?
![Page 7: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/7.jpg)
Boston University Slideshow Title Goes Here
Notion of a cluster can be ambiguous
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5
Notion of a Cluster can be Ambiguous
How many clusters?
Four ClustersTwo Clusters
Six Clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
Types of Clusterings
O A clustering is a set of clusters
O Important distinction between hierarchical and
partitional sets of clusters
O Partitional Clustering
– A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
O Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
![Page 8: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/8.jpg)
Boston University Slideshow Title Goes Here
Types of clusterings
• Partitional• each object belongs in exactly one cluster
• Hierarchical• a set of nested clusters organized in a tree
![Page 9: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/9.jpg)
Boston University Slideshow Title Goes Here
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7
Partitional Clustering
Original Points A Partitional Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
Hierarchical Clustering
p4p1
p3
p2
p4 p1
p3
p2 p4p1 p2 p3
p4p1 p2 p3
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Traditional Dendrogram
![Page 10: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/10.jpg)
Boston University Slideshow Title Goes Here
Partitional clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7
Partitional Clustering
Original Points A Partitional Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
Hierarchical Clustering
p4p1
p3
p2
p4 p1
p3
p2 p4p1 p2 p3
p4p1 p2 p3
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Traditional Dendrogram
![Page 11: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/11.jpg)
Boston University Slideshow Title Goes Here
Partitional algorithms
• partition the n objects into k clusters
• each object belongs to exactly one cluster
• the number of clusters k is given in advance
![Page 12: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/12.jpg)
Boston University Slideshow Title Goes Here
The k-means problem
• consider set X={x1,...,xn} of n points in Rd
• assume that the number k is given• problem:
• find k points c1,...,ck (named centers or means)so that the cost
is minimized
nX
i=1
minj
�L
22(xi, cj)
=
nX
i=1
minj
||xi � cj ||22
![Page 13: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/13.jpg)
Boston University Slideshow Title Goes Here
The k-means problem
• consider set X={x1,...,xn} of n points in Rd
• assume that the number k is given• problem:
• find k points c1,...,ck (named centers or means)• and partition X into {X1,...,Xk} by assigning each point xi in X to its nearest
cluster center, • so that the cost
is minimized
nX
i=1
minj
||xi
� c
j
||22 =kX
j=1
X
x2Xj
||x� c
j
||22
![Page 14: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/14.jpg)
Boston University Slideshow Title Goes Here
The k-means problem
• k=1 and k=n are easy special cases (why?)
• an NP-hard problem if the dimension of the data is at least 2 (d≥2)
• for d≥2, finding the optimal solution in polynomial time is infeasible
• for d=1 the problem is solvable in polynomial time
• in practice, a simple iterative algorithm works quite well
![Page 15: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/15.jpg)
Boston University Slideshow Title Goes Here
The k-means algorithm
• voted among the top-10 algorithms in data mining
• one way of solving the k-means problem
![Page 16: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/16.jpg)
Boston University Slideshow Title Goes Here
The k-means algorithm
1.randomly (or with another method) pick k cluster centers {c1,...,ck}
2.for each j, set the cluster Xj to be the set of points in X that are the closest to center cj
3.for each j let cj be the center of cluster Xj (mean of the vectors in Xj)
4.repeat (go to step 2) until convergence
![Page 17: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/17.jpg)
Boston University Slideshow Title Goes Here
Top10
algorithms
indata
mining
7
Fig. 1
Changes
incluster
representativelocations
(indicatedby
‘+’
signs)and
dataassignm
ents(indicated
bycolor) during
anexecution
of thek-means
algorithm
123
Sample execution
![Page 18: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/18.jpg)
Boston University Slideshow Title Goes Here
Properties of the k-means algorithm
• finds a local optimum
• often converges quickly but not always
• the choice of initial points can have large influence in the result
![Page 19: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/19.jpg)
Boston University Slideshow Title Goes Here
Effects of bad initialization8 X. Wu et al.
Fig. 2 Effect of an inferior initialization on the k-means results
extent by running the algorithm multiple times with different initial centroids, or by doinglimited local search about the converged solution.
2.2 Limitations
In addition to being sensitive to initialization, the k-means algorithm suffers from severalother problems. First, observe that k-means is a limiting case of fitting data by a mixture ofk Gaussians with identical, isotropic covariance matrices (! = " 2I), when the soft assign-ments of data points to mixture components are hardened to allocate each data point solelyto the most likely component. So, it will falter whenever the data is not well described byreasonably separated spherical balls, for example, if there are non-covex shaped clusters inthe data. This problem may be alleviated by rescaling the data to “whiten” it before clustering,or by using a different distance measure that is more appropriate for the dataset. For example,information-theoretic clustering uses the KL-divergence to measure the distance between twodata points representing two discrete probability distributions. It has been recently shown thatif one measures distance by selecting any member of a very large class of divergences calledBregman divergences during the assignment step and makes no other changes, the essentialproperties ofk-means, including guaranteed convergence, linear separation boundaries andscalability, are retained [3]. This result makes k-means effective for a much larger class ofdatasets so long as an appropriate divergence is used.
123
![Page 20: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/20.jpg)
Boston University Slideshow Title Goes Here
Limitations of k-means: different sizes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39
Limitations of K-means
O K-means has problems when clusters are of differing – Sizes– Densities– Non-globular shapes
O K-means has problems when the data contains outliers.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
![Page 21: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/21.jpg)
Boston University Slideshow Title Goes Here
Limitations of k-means: different density
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41
Limitations of K-means: Differing Density
Original Points K-means (3 Clusters)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42
Limitations of K-means: Non-globular Shapes
Original Points K-means (2 Clusters)
![Page 22: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/22.jpg)
Boston University Slideshow Title Goes Here
Limitations of k-means: non-spherical shapes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41
Limitations of K-means: Differing Density
Original Points K-means (3 Clusters)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42
Limitations of K-means: Non-globular Shapes
Original Points K-means (2 Clusters)
![Page 23: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/23.jpg)
Boston University Slideshow Title Goes Here
Discussion on the k-means algorithm
• finds a local optimum
• often converges quickly but not always
• the choice of initial points can have large influence in the result
• tends to find spherical clusters• outliers can cause a problem• different densities may cause a problem
![Page 24: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/24.jpg)
Boston University Slideshow Title Goes Here
Initialization
• random initialization • random, but repeat many times and take the best
solution• helps, but solution can still be bad
• pick points that are distant to each other• k-means++• provable guarantees
![Page 25: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/25.jpg)
Boston University Slideshow Title Goes Here
k-means++
David Arthur and Sergei Vassilvitskiik-means++: The advantages of careful seeding SODA 2007
![Page 26: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/26.jpg)
Boston University Slideshow Title Goes Here
k-means algorithm: random initialization
![Page 27: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/27.jpg)
Boston University Slideshow Title Goes Here
k-means algorithm: random initialization
![Page 28: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/28.jpg)
Boston University Slideshow Title Goes Here
1
2
34
k-means algorithm: initialization with further-first traversal
![Page 29: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/29.jpg)
Boston University Slideshow Title Goes Here
k-means algorithm: initialization with further-first traversal
![Page 30: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/30.jpg)
Boston University Slideshow Title Goes Here
1
2
3
but... sensitive to outliers
![Page 31: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/31.jpg)
Boston University Slideshow Title Goes Here
but... sensitive to outliers
![Page 32: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/32.jpg)
Boston University Slideshow Title Goes Here
Here random may work well
![Page 33: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/33.jpg)
Boston University Slideshow Title Goes Here
k-means++ algorithm
• interpolate between the two methods• let D(x) be the distance between x and the nearest
center selected so far• choose next center with probability proportional to
(D(x))a = Da(x)
✦ a = 0 random initialization✦ a = ∞ furthest-first traversal✦ a = 2 k-means++
![Page 34: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/34.jpg)
Boston University Slideshow Title Goes Here
k-means++ algorithm
• initialization phase: • choose the first center uniformly at random• choose next center with probability proportional to D2(x)
• iteration phase:• iterate as in the k-means algorithm until convergence
![Page 35: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/35.jpg)
Boston University Slideshow Title Goes Here
k-means++ initialization
1
2
3
![Page 36: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/36.jpg)
Boston University Slideshow Title Goes Here
k-means++ result
![Page 37: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/37.jpg)
Boston University Slideshow Title Goes Here
k-means++ provable guarantee
Theorem:
k-means++ is O(logk) approximate in expectation
![Page 38: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/38.jpg)
Boston University Slideshow Title Goes Here
• approximation guarantee comes just from the first iteration (initialization)
• subsequent iterations can only improve cost
k-means++ provable guarantee
![Page 39: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/39.jpg)
Boston University Slideshow Title Goes Here• consider optimal clustering C*
• assume that k-means++ selects a center from a new optimal cluster
• then• k-means++ is 8-approximate in expectation
• intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error
• an inductive proof shows that the algorithm is O(logk) approximate
k-means++ analysis
![Page 40: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/40.jpg)
Boston University Slideshow Title Goes Here
k-means++ proof : first cluster
• fix an optimal clustering C*
• first center is selected uniformly at random• bound the total error of the points in the optimal cluster
of the first center
![Page 41: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/41.jpg)
Boston University Slideshow Title Goes Here
k-means++ proof : first cluster
• let A be the first cluster• each point a0 ∈ A is equally likely to
be selected as center
✦ expected error:
E[�(A)] =X
a02A
1
|A|X
a2A
||a� a0||2
= 2X
a2A
||a� A||2 = 2�⇤(A)
![Page 42: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/42.jpg)
Boston University Slideshow Title Goes Here
k-means++ proof : other clusters
• suppose next center is selected from a new cluster in the optimal clustering C*
• bound the total error of that cluster
![Page 43: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/43.jpg)
Boston University Slideshow Title Goes Here
k-means++ proof : other clusters• let B be the second cluster and b0 the center selected
E[�(B)] =X
b02B
D2(b0)Pb2B D2(b)
X
b2B
min{D(b), ||b� b0||2}
D(b0) D(b) + ||b� b0||
triangle inequality:
D2(b0) 2D2(b) + 2||b� b0||2
![Page 44: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/44.jpg)
Boston University Slideshow Title Goes Here
k-means++ proof : other clusters
• average over all points b in B
E[�(B)] =X
b02B
D2(b0)Pb2B D2(b)
X
b2B
min{D(b), ||b� b0||2}
D2(b0) 2D2(b) + 2||b� b0||2
D2(b0) 2
|B|X
b2B
D2(b) +2
|B|X
b2B
||b� b0||2
✦ recall
4X
b2B
1
|B|X
b02B
||b� b0||2 = 4X
b2B
2||b� B||2 = 8�⇤(B)
![Page 45: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/45.jpg)
Boston University Slideshow Title Goes Here• if that k-means++ selects a center from a new optimal cluster
• then• k-means++ is 8-approximate in expectation
• an inductive proof shows that the algorithm is O(logk) approximate
k-means++ analysis
![Page 46: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/46.jpg)
Boston University Slideshow Title Goes Here
Lesson learned
• no reason to use k-means and not k-means++
• k-means++ :• easy to implement• provable guarantee• works well in practice
![Page 47: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/47.jpg)
Boston University Slideshow Title Goes Here
The k-median problem
• consider set X={x1,...,xn} of n points in Rd
• assume that the number k is given• problem:
• find k points c1,...,ck (named medians)• and partition X into {X1,...,Xk} by assigning each point xi in X to its nearest
cluster median, • so that the cost
is minimized
nX
i=1
minj
||xi
� c
j
||2 =kX
j=1
X
x2Xj
||x� c
j
||2
![Page 48: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/48.jpg)
Boston University Slideshow Title Goes Here
the k-medoids algorithm
or PAM (partitioning around medoids)
1.randomly (or with another method) choose k medoids {c1,...,ck} from the original dataset X
2.assign the remaining n-k points in X to their closest medoid cj
3.for each cluster, replace each medoid by a point in the cluster that improves the cost
4.repeat (go to step 2) until convergence
![Page 49: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/49.jpg)
Boston University Slideshow Title Goes Here
Discussion on the k-medoids algorithm
• very similar to the k-means algorithm
• same advantages and disadvantages
• how about efficiency?
![Page 50: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/50.jpg)
Boston University Slideshow Title Goes Here
The k-center problem
• consider set X={x1,...,xn} of n points in Rd
• assume that the number k is given• problem:
• find k points c1,...,ck (named centers)• and partition X into {X1,...,Xk} by assigning each point xi in X to its nearest
cluster center, • so that the cost
is minimizedn
max
i=1
kmin
j=1||xi � cj ||2
![Page 51: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/51.jpg)
Boston University Slideshow Title Goes Here
Properties of the k-center problem
• NP-hard for dimension d≥2
• for d=1 the problem is solvable in polynomial time (how?)
• a simple combinatorial algorithm works well
![Page 52: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/52.jpg)
Boston University Slideshow Title Goes Here
The k-center problem
• consider set X={x1,...,xn} of n points in Rd
• assume that the number k is given• problem:
• find k points c1,...,ck (named centers)• and partition X into {X1,...,Xk} by assigning each point xi in X to its nearest
cluster center, • so that the cost
is minimizedn
max
i=1
kmin
j=1||xi � cj ||2
![Page 53: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/53.jpg)
Boston University Slideshow Title Goes Here
Furthest-first traversal algorithm
• pick any data point and label it 1• for i=2,...,k• find the unlabeled point that is furthest from {1,2,...,i-1}• // use d(x,S) = min y∈S d(x,y)• label that point i
• assign the remaining unlabeled data points to the closest labeled data point
![Page 54: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/54.jpg)
Boston University Slideshow Title Goes Here
Furthest-first traversal algorithm: example
13 2
4
![Page 55: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/55.jpg)
Boston University Slideshow Title Goes Here
• furthest-first traversal algorithm gives a factor 2 approximation
Furthest-first traversal algorithm
![Page 56: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/56.jpg)
Boston University Slideshow Title Goes Here
Furthest-first traversal algorithm
• pick any data point and label it 1• for i=2,...,k• find the unlabeled point that is furthest from {1,2,...,i-1}• // use d(x,S) = min y∈S d(x,y)• label that point i• p(i) = argmin j<i d(i,j)• Ri = d(i,p(i))
• assign the remaining unlabeled data points to the closest labeled data point
![Page 57: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/57.jpg)
Boston University Slideshow Title Goes Here
Analysis
• Claim 1: R1 ≥ R2 ≥ ... ≥ Rk
• proof:•Rj = d(j,p(j))
= d(j,{1,2,...,j-1}) ≤ d(j,{1,2,...,i-1}) // j > i ≤ d(i,{1,2,...,i-1}) = Ri
![Page 58: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/58.jpg)
Boston University Slideshow Title Goes Here
• Claim 2: • let C be the clustering produced by the FFT algorithm• let R(C) be the cost of that clustering• then R(C) = Rk+1
• proof:• for any i>k we have :
d(i,{1,2,...,k}) ≤ d(k+1,{1,2,...,k}) = Rk+1
Analysis
![Page 59: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/59.jpg)
Boston University Slideshow Title Goes Here• Theorem• let C be the clustering produced by the FFT algorithm
• let C* be the optimal clustering
• then R(C) ≤ 2R(C*)
• proof:• let C*1,…, C*k be the clusters of the optimal k-clustering
• if these clusters contain points {1,…,k} then R(C) ≤ 2R(C*) ✪
• otherwise suppose that one of these clusters contains two or more of the points in {1,…,k}
• these points are at distance at least Rk from each other
• this (optimal) cluster must have radius½ Rk ≥ ½ Rk+1= ½ R(C)
Analysis
![Page 60: CS 565: Data miningcs-people.bu.edu/evimaria/cs565-14/kmeanspp.pdf · • voted among the top-10 algorithms in data mining • one way of solving the k-means problem. Boston University](https://reader034.vdocuments.us/reader034/viewer/2022050308/5f70243c2e89af6a752792c9/html5/thumbnails/60.jpg)
Boston University Slideshow Title Goes Here
x
z
✪ R(C) ≤ 2R(C*)
R(C) ≤ x ≤ z + R(C*) ≤ 2R(C*)
R(C*)a labeledpoint in the cluster