data clustering · 2019-05-01 · large-scale clustering (bfr algorithm) for each chunk all points...
TRANSCRIPT
![Page 1: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/1.jpg)
Data ClusteringHéctor Corrada Bravo
University of Maryland, College Park, USA CMSC 644: 20190306
![Page 2: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/2.jpg)
Motivating ExampleTime series dataset of mortgage affordability as calculated anddistributed by Zillow: https://www.zillow.com/research/data/.
The dataset consists of monthly mortgage affordability values for 76counties with data from 1979 to 2017.
1 / 70
![Page 3: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/3.jpg)
Motivating Example"To calculate mortgage affordability, we first calculate themortgage payment for the medianvalued home in ametropolitan area by using the metrolevel Zillow Home ValueIndex for a given quarter and the 30year fixed mortgageinterest rate during that time period, provided by the FreddieMac Primary Mortgage Market Survey (based on a 20 percentdown payment)."
2 / 70
![Page 4: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/4.jpg)
Motivating Example"Then, we consider what portion of the monthly medianhousehold income (U.S. Census) goes toward this monthlymortgage payment. Median household income is available witha lag. "
3 / 70
![Page 5: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/5.jpg)
Can we partition countiesinto groups of counties withsimilar value trends acrosstime?
Motivating Example
4 / 70
![Page 6: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/6.jpg)
Cluster AnalysisThe highlevel goal of cluster analysis is to organize objects(observations) that are similar to each other into groups.
We want objects within a group to be more similar to each other thanobjects in different groups.
5 / 70
![Page 7: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/7.jpg)
Cluster AnalysisThe highlevel goal of cluster analysis is to organize objects(observations) that are similar to each other into groups.
We want objects within a group to be more similar to each other thanobjects in different groups.
Central to this highlevel goal is how to measure the degree of similaritybetween objects.
A clustering method then uses the similarity measure provided to it togroup objects into clusters.
6 / 70
![Page 8: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/8.jpg)
Cluster AnalysisResult of the kmeans algorithmpartitioning the data into 9 clusters.
The darker series within eachcluster shows the average timeseries within the cluster.
7 / 70
![Page 9: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/9.jpg)
Dissimilarity-based ClusteringFor certain algorithms, instead of similarity we work with dissimilarity,often represented as distances.
When we have observations defined over attributes, or predictors, wedefine dissimilarity based on these attributes.
8 / 70
![Page 10: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/10.jpg)
Dissimilarity-based ClusteringGiven measurements for observations over predictors.
Suppose we define a dissimilarity , we can then define adissimilarity between objects as
xij i = 1, … ,N j = 1, … , p
dj(xij,xi′j)
d(xi,xi′) =
p
∑j=1
dj(xij,xi′j)
9 / 70
![Page 11: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/11.jpg)
Dissimilarity-based ClusteringIn the kmeans algorithm, and many other algorithms, the most commonusage is squared distance
We can use different dissimilarities, for example
which may affect our choice of clustering algorithm later on.
dj(xij,xi′j) = (xij − xi′j)2
dj(xij,xi′j) = |xij − xi′j|
10 / 70
![Page 12: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/12.jpg)
Dissimilarity-based ClusteringFor categorical variables, we could set
dj(xij,xi′j) =⎧⎨⎩
0 if xij = xi′j
1 o.w.
11 / 70
![Page 13: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/13.jpg)
Dissimilarity-based ClusteringIf the values the categorical variable have an intrinsic similarity
Generalize using symmetric matrix with elements
, and otherwise.
This may of course lead to a dissimilarity that is not a proper distance.
L
Lrr′ = Lr′r
Lrr = 0
Lrr′ ≥ 0
12 / 70
![Page 14: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/14.jpg)
Limitations of Distance-based clusteringWhen working with distances, we are quickly confronted with the curseof dimensionality.
One flavor: in high dimensions: all points are equidistant
13 / 70
![Page 15: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/15.jpg)
The curse of dimensionalityConsider the case where we have many covariates. We want to use adistancebased clustering method.
14 / 70
![Page 16: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/16.jpg)
The curse of dimensionalityConsider the case where we have many covariates. We want to use adistancebased clustering method.
Basically, we need to define distance and look for small multidimensional "balls" around the data points. With many covariates thisbecomes difficult.
15 / 70
![Page 17: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/17.jpg)
The curse of dimensionalityImagine we have equally spaced data and that each covariate is in .We want something that clusters points into local regions, containg somereasonably small amount of data points (say 10%). Let's imagine theseare highdimensional cubes.
[0, 1]
16 / 70
![Page 18: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/18.jpg)
The curse of dimensionalityImagine we have equally spaced data and that each covariate is in .We want something that clusters points into local regions, containg somereasonably small amount of data points (say 10%). Let's imagine theseare highdimensional cubes.
If we have covariates and we are forming dimensional cubes, theneach side of the cube must have size determined by .
[0, 1]
p p
l l × l × ⋯ × l = lp = .10
17 / 70
![Page 19: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/19.jpg)
The curse of dimensionalityIf the number of covariates is p=10, then . So it really isn't local!If we reduce the percent of data we consider to 1%, . Still not verylocal.
l = .11/10 = .8
l = 0.63
18 / 70
![Page 20: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/20.jpg)
The curse of dimensionalityIf the number of covariates is p=10, then . So it really isn't local!If we reduce the percent of data we consider to 1%, . Still not verylocal.
If we keep reducing the size of the neighborhoods we will end up withvery small number of data points in each cluster and require a largenumber of clusters.
l = .11/10 = .8
l = 0.63
19 / 70
![Page 21: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/21.jpg)
K-means ClusteringA commonly used algorithm to perform clustering is the Kmeansalgorithm.
It is appropriate when using squared Euclidean distance as the measureof object dissimilarity.
d(xi,xi′) =
p
∑j=1
(xij − xi′j)2
= ∥xi − xi′∥2
20 / 70
![Page 22: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/22.jpg)
K-means ClusteringKmeans partitions observations into clusters, with provided as aparameter.
Given some clustering, or partition, , denote cluster assignment ofobservation to cluster is denoted as .
K K
C
xi k ∈ {1, … , K} C(i) = k
21 / 70
![Page 23: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/23.jpg)
K-means ClusteringKmeans partitions observations into clusters, with provided as aparameter.
Given some clustering, or partition, , denote cluster assignment ofobservation to cluster is denoted as .
Kmeans minimizes this clustering criterion:
K K
C
xi k ∈ {1, … , K} C(i) = k
W(C) =K
∑k=1
∑i: C(i)=k
∑i′: C(i′)=k
∥xi − xi′∥21
2
22 / 70
![Page 24: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/24.jpg)
K-means ClusteringThis is equivalent to minimizing
with:
is the average of predictor over the observations assigned tocluster , is the number of observations assigned to cluster
W(C) =K
∑k=1
Nk ∑i: C(i)=k
∥xi − x̄k∥21
2
x̄k = (x̄k1, … , x̄kp)
x̄kj j
k
Nk k
23 / 70
![Page 25: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/25.jpg)
K-means Clustering
Minimize the total distance given by each observation to the mean(centroid) of the cluster to which the observation is assigned.
W(C) =K
∑k=1
Nk ∑i: C(i)=k
∥xi − x̄k∥21
2
24 / 70
![Page 26: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/26.jpg)
K-means ClusteringAn iterative algorithm is used to minimize this criterion
1. Initialize by choosing observations as centroids 2. Assign each observation to the cluster with the nearest centroid, i.e.,set
3. Update centroids 4. Iterate steps 1 and 2 until convergence
K m1, m2, … , mk
i
C(i) = arg min1≤k≤K ∥xi − mk∥2
mk = x̄k
25 / 70
![Page 27: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/27.jpg)
Here we illustrate thekmeans algorithmover four iterations onour example datawith .
K-means Clustering
K = 4
26 / 70
![Page 28: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/28.jpg)
K-means ClusteringCriterion is reduced in each iteration so the algorithm is assured toconverge.
Not a convex criterion, the clustering we obtain may not be globallyoptimal.
In practice, the algorithm is run with multiple initializations (step 0) andthe best clustering achieved is used.
W(C)
27 / 70
![Page 29: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/29.jpg)
K-means ClusteringAlso, selection of observations as centroids can be improved using theKmeans++ algorithm:
1. Choose an observation as centroid uniformly at random2. To choose centroid , compute for each observation not chosen asa centroid the distance to the nearest centroid
3. Set centroid to an observation randomly chosen with probability
4. Iterate steps 1 and 2 until centroids are chosen
m1
mk i
di = min1≤l<k ∥xi − ml∥2
mk
ed
i
∑i′ e
d
i′
K
28 / 70
![Page 30: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/30.jpg)
Choosing the number of clustersThe number of parameters must be determined before running the Kmeans algorithm.
There is no clean direct method for choosing the number of clusters touse in the Kmeans algorithm (e.g. no crossvalidation method)
29 / 70
![Page 31: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/31.jpg)
Looking at criterion alone is not
sufficient as thecriterion will becomesmaller as the valueof is reduced.
Choosing the number of clusters
W(C)
K
30 / 70
![Page 32: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/32.jpg)
Choosing the number of clustersWe can use properties of this plot for adhoc selection.
Suppose there is a true underlying number of clusters in the data,
improvement in the statistic will be fast for values of slower for values of .
K∗
WK(C) K ≤ K∗
K > K∗
31 / 70
![Page 33: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/33.jpg)
Choosing the number of clustersImprovement in the statistic will be fast for values of
In this case, there will be a cluster which will contain observationsbelonging to two of the true underlying clusters, and therefore will havepoor within cluster similarity.
As is increased, observations may then be separated into separateclusters, providing a sharp improvement in the statistic.
WK(C) K ≤ K∗
K
WK(C)
32 / 70
![Page 34: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/34.jpg)
Choosing the number of clustersImprovement in the statistic will be slower for values of
In this case, observations belonging to a single true cluster are split intomultiple cluster, all with generally high withincluster similarity,
Splitting these clusters further will not improve the statistic verysharply.
WK(C) K > K∗
WK(C)
33 / 70
![Page 35: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/35.jpg)
The curve willtherefore have aninflection pointaround .
Choosing the number of clusters
K∗
34 / 70
![Page 36: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/36.jpg)
Choosing the number of clustersThe gap statistic is used to identify the inflection point in the curve.
It compares the behavior of the statistic based on the data with thebehavior of the statistic for data generated uniformly at random overthe range of the data.
Chooses the that maximizes the gap between these two curves.
WK(C)
WK(C)
K WK(C)
35 / 70
![Page 37: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/37.jpg)
For this dataset, thegap statistic suggeststhere is no clearcluster structure andtherefore is thebest choice.
A choice of isalso appropriate.
Choosing the number of clusters
K = 1
K = 4
36 / 70
![Page 38: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/38.jpg)
Large-scale clusteringCost of Kmeans as presented:
Each iteration: Compute distance for each point to centroid O(knp)
37 / 70
![Page 39: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/39.jpg)
Large-scale clusteringCost of Kmeans as presented:
Each iteration: Compute distance for each point to centroid
This implies we have to do multiple passes over entire dataset.
Not good for massive datasets
O(knp)
38 / 70
![Page 40: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/40.jpg)
Large-scale clusteringCost of Kmeans as presented:
Each iteration: Compute distance for each point to centroid
This implies we have to do multiple passes over entire dataset.
Not good for massive datasets
Can we do this in "almost" a single pass?
O(knp)
39 / 70
![Page 41: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/41.jpg)
Large-scale clustering (BFR Algorithm)1. Select points as before2. Process data file in chunks:
Set chunk size so each can be processed in main memoryWill use memory for workspace so not entire memory available
k
40 / 70
![Page 42: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/42.jpg)
Large-scale clustering (BFR Algorithm)For each chunk
All points sufficiently close to the centroid of one of the clusters isassigned to that cluster (Discard Set)
k
41 / 70
![Page 43: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/43.jpg)
Large-scale clustering (BFR Algorithm)For each chunk
All points sufficiently close to the centroid of one of the clusters isassigned to that cluster (Discard Set)
Remaining points are clustered (e.g., using means with some valueof . Two cases
Clusters with more than one point (Compressed Set)Singleton clusters (Retained Set)
k
k
k
42 / 70
![Page 44: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/44.jpg)
Large-scale clustering (BFR Algorithm)For each chunk
All points sufficiently close to the centroid of one of the clusters isassigned to that cluster (Discard Set)
Remaining points are clustered (e.g., using means with some valueof . Two cases
Clusters with more than one point (Compressed Set)Singleton clusters (Retained Set)
Try to merge clusters in Compressed Set
k
k
k
43 / 70
![Page 45: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/45.jpg)
Large-scale clustering (BFR Algorithm)What is sufficiently close?
"Weighted" distance to centroid below some threshold.
: cluster mean for feature : cluster standard deviation of feature
⎷d
∑i=1
(pi − ci)2
σi
ci i
σi i
44 / 70
![Page 46: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/46.jpg)
Large-scale clustering (BFR Algorithm)
Assumption: points are distributed along axis-parallel ellipses
45 / 70
![Page 47: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/47.jpg)
Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances
46 / 70
![Page 48: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/48.jpg)
Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances
We can do this by storing for each cluster :
number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster
j
Nj
sij i j
s2
iji j
47 / 70
![Page 49: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/49.jpg)
Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances
We can do this by storing for each cluster :
number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster
Constant amount of space to represent cluster
j
Nj
sij i j
s2
iji j
48 / 70
![Page 50: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/50.jpg)
Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances
We can do this by storing for each cluster :
number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster
Constant amount of space to represent cluster
Exercise. show these are sufficient to calculate weighted distance
j
Nj
sij i j
s2
iji j
49 / 70
![Page 51: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/51.jpg)
Large-scale clustering (BFR Algorithm)This is used to represent (final) clusters in Discard Set and (partial)clusters in Compressed Set
Only points explicitly in memory are those in the Retained Set
50 / 70
![Page 52: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/52.jpg)
Large-scale clustering (BFR Algorithm)This is used to represent (final) clusters in Discard Set and (partial)clusters in Compressed Set
Only points explicitly in memory are those in the Retained Set
Points outside of Retained Set are never kept in memory (written outalong with cluster assignment)
51 / 70
![Page 53: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/53.jpg)
Large-scale clustering (BFR Algorithm)
Merging clusters in Compressed Set
Example: Merge if the variance of combined clusters is sufficiently closeto variance of separate clusters
52 / 70
![Page 54: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/54.jpg)
Large-scale clustering (BFR Algorithm)After all data is processed:
Assign points in Retained Set to cluster with nearest centroid
Merge partial clusters in Compressed Set with final clusters inDiscarded Set
53 / 70
![Page 55: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/55.jpg)
Large-scale clustering (BFR Algorithm)After all data is processed:
Assign points in Retained Set to cluster with nearest centroid
Merge partial clusters in Compressed Set with final clusters inDiscarded Set
Or,
Flag all of these as outliers
54 / 70
![Page 56: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/56.jpg)
Soft K-means ClusteringInstead of the combinatorial approach of the means algorithm, take amore direct probabilistic approach to modeling distribution .
Assume each of the clusters corresponds to a multivariate distribution ,
is given by mixture of these distributions as .
K
Pr(X)
K
Prk(X)
Pr(X) Pr(X) =∑K
k=1 πkPrk(X)
55 / 70
![Page 57: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/57.jpg)
Soft K-means ClusteringSpecifically, take as a multivariate normal distribution
and mixture density .
Prk(X) fk(X) = N(μk,σ2kI)
f(X) = ∑K
k=1 πkfk(X)
56 / 70
![Page 58: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/58.jpg)
Soft K-means ClusteringUse Maximum Likelihood to estimate parameters
based on their loglikelihood
θ = (μ1, … , μK, σ21 , … , σ2
K, π1, … , πK)
ℓ(θ; X) =N
∑i=1
log[ K
∑k=1
πkfk(xi; θ)]
57 / 70
![Page 59: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/59.jpg)
Soft K-means Clustering
Maximizing this likelihood directly is computationally difficult
Use Expectation Maximization algorithm (EM) instead.
ℓ(θ; X) =N
∑i=1
log[ K
∑k=1
πkfk(xi; θ)]
58 / 70
![Page 60: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/60.jpg)
Soft K-means ClusteringConsider unobserved latent variables taking values 0 or 1,
specifies observation was generated by component of themixture distribution.
Δik
Δij = 1 xi k
59 / 70
![Page 61: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/61.jpg)
Soft K-means ClusteringNow set ,and assume we observed values for indicatorvariables .
We can write the loglikelihood of our parameters in this case as
Pr(Δik = 1) = πk
Δik
ℓ0(θ; X, Δ) =N
∑i=1
K
∑k=1
Δik log fk(xi; θ) +N
∑i=1
K
∑k=1
Δik log πk
60 / 70
![Page 62: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/62.jpg)
Soft K-means ClusteringMaximum likelihood estimates:
.
μ̂k =∑N
i=1 Δikxi
∑N
i=1 Δik
σ̂2k =
∑Ni=1 Δik(xi−μ̂k)2
∑Ni=1 Δik
π̂k =∑K
i=1
N
61 / 70
![Page 63: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/63.jpg)
Soft K-means ClusteringOf course, this result depends on observing values for which we don'tobserve. Use an iterative approach as well:
given current estimate of parameters ,maximize
.
Δik
θ
E(ℓ0(θ′; X, Δ)|X, θ)
62 / 70
![Page 64: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/64.jpg)
Soft K-means ClusteringOf course, this result depends on observing values for which we don'tobserve. Use an iterative approach as well:
given current estimate of parameters ,maximize
.
We can prove that maximizing this quantity also maximizes the likelihoodwe need .
Δik
θ
E(ℓ0(θ′; X, Δ)|X, θ)
ℓ(θ; X)
63 / 70
![Page 65: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/65.jpg)
Soft K-means ClusteringIn the mixture case, what is the function we would maximize?
Define
γik(θ) = E(Δik|Xi, θ) = Pr(Δik = 1|Xi, θ)
64 / 70
![Page 66: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/66.jpg)
Soft K-means ClusteringUse Bayes' Rule to write this in terms of the multivariate normal densitieswith respect to current estimates :θ
γik =
=
Pr(Xi|Δik = 1)Pr(Δik = 1)
Pr(Xi)
fk(xi; μk, σ2k)πk
∑K
l=1 fl(xi; μl, σ2l )πl
65 / 70
![Page 67: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/67.jpg)
Soft K-means ClusteringQuantity is referred to as the responsibility of cluster for observation, according to current parameter estimate .
γik(θ) k
i θ
66 / 70
![Page 68: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/68.jpg)
Soft K-means ClusteringThen the expectation we are maximizing is given by
E(ℓ0(θ′; X, Δ)|X, θ) =N
∑i=1
K
∑k=1
γik(θ) log fk(xi; θ′) +N
∑i=1
K
∑k=1
γik(θ) log π′k
67 / 70
![Page 69: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/69.jpg)
Soft K-means ClusteringWe can now give a complete specification of the EM algorithm formixture model clustering.
1. Take initial guesses for parameters 2. Expectation Step: Compute responsibilities 3. Maximization Step: Estimate new parameters based onresponsibilities as below.
4. Iterate steps 1 and 2 until convergence
θ
γik(θ)
68 / 70
![Page 70: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/70.jpg)
Soft K-means AlgorithmEstimates in the Maximization step are given by
and
μ̂k =∑
Ni=1 γik(θ)xi
∑Ni=1 γik
σ̂2k =
∑ i = 1N γik(θ)(xi − μk)
∑N
i=1 γik(θ)
π̂k =∑N
i=1 γik(θ)
N
69 / 70
![Page 71: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster](https://reader030.vdocuments.us/reader030/viewer/2022040320/5e4d2ef60c5fe27c0b327431/html5/thumbnails/71.jpg)
Soft K-means AlgorithmThe name "soft" Kmeans refers to the fact that parameter estimates foreach cluster are obtained by weighted averages across all observations.
This is an instance of the ExpectationMaximization Algorithm which wewill see again later this semester.
70 / 70