dmtm lecture 15 clustering evaluation
TRANSCRIPT
![Page 1: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/1.jpg)
Prof. Pier Luca Lanzi
Clustering ValidationData Mining and Text Mining (UIC 583 @ Politecnico di Milano)
![Page 2: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/2.jpg)
Prof. Pier Luca Lanzi
Syllabus
• Chapter 17, Data Mining and Analysis: Fundamental Concepts and Algorithms. Mohammed J. Zaki & Wagner Meira Jr
• Functions available in Scikit-learnhttp://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
• Functions available in Rhttps://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf
2
![Page 3: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/3.jpg)
Prof. Pier Luca Lanzi
Cluster Validation and Assessment
Clustering Evaluationassess the goodness or quality of the clustering
Clustering Stabilitysensitivity of the clustering result to various algorithmic parameters
Clustering Tendencysuitability of applying clustering in the first place,
does the data have any inherent grouping structure?
3
![Page 4: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/4.jpg)
Prof. Pier Luca Lanzi
Validity Measures
• External Validation Measures§ Employ criteria that are not inherent to the dataset§ E.g. prior or expert-specified knowledge about the clusters, for example,
class labels for each point. • Internal Validation Measures
§ Employ criteria that are derived from the data itself§ For instance, intracluster and intercluster distances to measure cluster
compactness (e.g., how similar are the points in the same cluster?) and separation (e.g., how far apart are the points in different clusters?).
• Relative Validation Measures§ Aim to directly compare different clusterings, usually those obtained via
different parameter settings for the same algorithm.
4
![Page 5: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/5.jpg)
Prof. Pier Luca Lanzi
External Measures(the correct or ground-truth clustering is known a priori)
5
![Page 6: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/6.jpg)
Prof. Pier Luca Lanzi
Given a clustering partition C andthe ground truth partitioning T,
we redefine TP, TN, FP, FNin the context of clustering
![Page 7: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/7.jpg)
Prof. Pier Luca Lanzi
True Positives, True Negatives, False Positives, and False Negatives
• True Positives§xi and xj are a true positive pair if they belong to the same
partition in T, and they are also in the same cluster in C§TP is defined as the number of true positive pairs
• False Negatives§xi and xj are a false negative pair if they belong to the same
partition in T, but they do not belong to the same cluster in C. § FN is defined as the number of true positive pairs
7
![Page 8: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/8.jpg)
Prof. Pier Luca Lanzi
True Positives, True Negatives, False Positives, and False Negatives
• False Positives§xi and xj are a false positive pair if the do not belong to the
same partition in T, but belong to the same cluster in C § FP is the number of false positive pairs
• True Negatives§xi and xj are a false negative pair if they do not belong to the
same partition in T, nor to the same cluster in C§TN is the number of true negative pairs
8
![Page 9: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/9.jpg)
Prof. Pier Luca Lanzi
Given the number of pairs N
N=TP+FP+FN+TN
9
![Page 10: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/10.jpg)
Prof. Pier Luca Lanzi
Jaccard Coefficient
• Measures the fraction of true positive point pairs, but after ignoring the true negatives as,
• For a perfect clustering C, the coefficient is one, that is, there are no false positives nor false negatives.
• Note that the Jaccard coefficient is asymmetric in that it ignores the true negatives
10
![Page 11: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/11.jpg)
Prof. Pier Luca Lanzi
Rand Statistic
• Measures the fraction of true positives and true negatives over all pairs as
• The Rand statistic measures the fraction of point pairs where both the clustering C and the ground truth T agree
• A perfect clustering has a value of 1 for the statistic.
• The adjusted rand index is the extension of the rand statistic corrected for chance.
11
![Page 12: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/12.jpg)
Prof. Pier Luca Lanzi
Fowlkes-Mallows Measure
• Define precision and recall analogously to what done for classification,
• The Fowlkes–Mallows (FM) measure is defined as the geometric mean of the pairwise precision and recall
• FM is also asymmetric in terms of the true positives and negatives because it ignores the true negatives. Its highest value is also 1, achieved when there are no false positives or negatives.
12
![Page 13: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/13.jpg)
Prof. Pier Luca Lanzi
Mutual Information Based Scores
13
![Page 14: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/14.jpg)
Prof. Pier Luca Lanzi
Mutual Information
• Mutual information tries to quantify the amount of shared information between the clustering C and ground truth partitioning T,
• Where§ pij is the probability that a point in cluster i also belongs to
partition j§ pci is the probability of cluster Ci§ ptj is the probability of cluster Tj
14
![Page 15: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/15.jpg)
Prof. Pier Luca Lanzi
Normalized Mutual Information
• The normalized mutual information (NMI) is defined as
• Where,
• Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement.
15
![Page 16: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/16.jpg)
Prof. Pier Luca Lanzi
Homogeneity, Completeness and V-measure
16
![Page 17: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/17.jpg)
Prof. Pier Luca Lanzi
Homogeneity, Completeness, and V-measure
• Homogeneity§ Each cluster contains only members of a single class.
• Completeness§All members of a given class are assigned to the same cluster
• V-measure§Harmonic mean of homogeneity and completeness
• The three measures are bounded between 0 and 1• The higher the value the better
17
![Page 18: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/18.jpg)
Prof. Pier Luca Lanzi
Internal Validation Measures(criteria that are derived from the data itself)
18
![Page 19: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/19.jpg)
Prof. Pier Luca Lanzi
Internal Validation Measures
• Based on the notions of intracluster similarity or compactness contrasted with the notions of intercluster separation
• They typically propose a trade-off to maximizing these two competing measures
• They are computed from the distance (or proximity) matrix
• The internal measures are based on various functions over the intracluster and intercluster weights.
19
![Page 20: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/20.jpg)
Prof. Pier Luca Lanzi
Some Important Statistics
• Sum over all the intracluster weights over all the clusters
• Sum of all intercluster weights
• Number of distinct intracluster edges Nin and intercluster edges, Nout
20
![Page 21: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/21.jpg)
Prof. Pier Luca Lanzi
BetaCV
• BetaCV is computed as the ratio of the mean intracluster distance to the mean intercluster distance
• The smaller the BetaCV ratio, the better the clustering, as it indicates that intracluster distances are on average smaller than intercluster distances
21
![Page 22: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/22.jpg)
Prof. Pier Luca Lanzi
C-Index
• Let Wmin(Nin) be the sum of the smallest Nin distances in the proximity matrix W, where Nin is the total number of intraclusteredges, or point pairs• Let Wmax(Nin) be the sum of the largest Nin distances in W• The C-index measures to what extent the clustering puts
together the Nin points that are the closest across the k clusters. • It is defined as,
• The smaller the C-index, the better the clustering, as it indicates more compact clusters with relatively smaller distances within clusters rather than between clusters.
22
![Page 23: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/23.jpg)
Prof. Pier Luca Lanzi
• Defined as the ratio between the minimum distance between point pairs from different clusters and the maximum distance between point pairs from the same cluster
• Where, the minimum intercluster distance is computed as,
• And the maximum intracluster distance is computed as,
• The larger the Dunn index the better the clustering because it means even the closest distance between points in different clusters is much larger than the farthest distance between points in the same cluster. However, the Dunn index may be insensitive because the minimum intercluster and maximum intracluster distances do not capture all the information about a clustering.
Dunn Index 23
![Page 24: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/24.jpg)
Prof. Pier Luca Lanzi
Davies–Bouldin Index
• Let μi denote the cluster mean and σμi denote the dispersion or spread of the points around the cluster mean,
where var(Ci) is the total variance of cluster Ci
• The Davies–Bouldin measure for a pair of clusters Ci and Cj is defined as the ratio
• DBij measures how compact the clusters are compared to the distance between the cluster means.
24
![Page 25: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/25.jpg)
Prof. Pier Luca Lanzi
Davies–Bouldin Index
• The Davies–Bouldin index is then defined as
• For each cluster Ci, we pick the cluster Cj that yields the largest DBij ratio.
• The smaller the DB value the better the clustering, as it means that the clusters are well separated (i.e., the distance between cluster means is large), and each cluster is well represented by its mean (i.e., has a small spread).
25
![Page 26: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/26.jpg)
Prof. Pier Luca Lanzi
Silhouette Coefficient
• Measure of both cohesion and separation of clusters, and is based on the difference between the average distance to points in the closest cluster and to points in the same cluster. • For each point xi we calculate its silhouette coefficient si as
• Where μin(xi) is the mean distance from xi to points in its own cluster yi
26
![Page 27: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/27.jpg)
Prof. Pier Luca Lanzi
Silhouette Coefficient
• And the mean of the distances from xi to points in the closest cluster is computed as,
• The si value of a point lies in the interval [−1,+1]. § A value close to +1 indicates that xi is much closer to points in its own
cluster and is far from other clusters. § A value close to zero indicates that xi is close to the boundary between
two clusters. § A value close to −1 indicates that xi is much closer to another cluster than
its own cluster, and therefore, the point may be mis-clustered.
27
![Page 28: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/28.jpg)
Prof. Pier Luca Lanzi
Silhouette Coefficient
• The silhouette coefficient is defined as the mean si value across all the points
• A value close to +1 indicates a good clustering.
• Drawbacks§The Silhouette Coefficient is generally higher for convex
clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN
28
![Page 29: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/29.jpg)
Prof. Pier Luca Lanzi
Calinski-Harabaz Index¶
• Given k clusters, the Calinski-Harabaz score s is given by the ratio of the between-cluster dispersion mean and the within-cluster dispersion,
• That is,
• The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster• The index is generally higher for convex clusters than other
concepts of clusters, such as density based clusters like those obtained through DBSCAN.
29
![Page 30: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/30.jpg)
Prof. Pier Luca Lanzi
Relative Measures(compare different clusterings obtained by varying different
parameters for the same algorithm, e.g., the number of clusters k)
30
![Page 31: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/31.jpg)
Prof. Pier Luca Lanzi
Within/Between Clusters Sum of Squares
• Within-cluster sum of squares
where μi is the centroid of cluster Ci (in case of Euclidean spaces)
• Between-cluster sum of squares
where μ is the centroid of the whole dataset
31
![Page 32: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/32.jpg)
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering 32
![Page 33: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/33.jpg)
Prof. Pier Luca Lanzi
Calinski-Harabaz Index¶
• We can use the Calinski-Harabaz index to select k
• In a good clustering, we expect the within-cluster scatter to be smaller relative to the between-cluster scatter, which should result in a higher value of the index
• Thus, we can either select the k corresponding to the higher index value or we can perform a knee analysis and look for a significant increase followed by much smaller differences
• For instance, we can choose the value k > 3 that minimizes,
33
![Page 34: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/34.jpg)
Prof. Pier Luca Lanzi
Silhouette Coefficient
• We can use the silhouette coefficient sj of each point xj and the average SC value to estimate the number of clusters in the data
• For each cluster, plot the sj values in descending order
• Check the overall SC value for a particular value of k, as well as SCi values for each cluster i
• Pick the value of k that yields the best clustering, with many points having high sj values within each cluster, as well as high values for SC and SCi (1 ≤ i ≤ k).
34
![Page 35: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/35.jpg)
Prof. Pier Luca LanziSilhouette coefficients for the Iris dataset computed using a k-means algorithm with k=2
![Page 36: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/36.jpg)
Prof. Pier Luca LanziSilhouette coefficients for the Iris dataset computed using a k-means algorithm with k=2
![Page 37: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/37.jpg)
Prof. Pier Luca LanziSilhouette coefficients for the Iris dataset computed using a k-means algorithm with k=2
![Page 38: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/38.jpg)
Prof. Pier Luca Lanzi
Cluster Stability
![Page 39: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/39.jpg)
Prof. Pier Luca Lanzi
the clusterings obtained from severaldatasets sampled from the same
distribution should be similar or “stable.”
![Page 40: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/40.jpg)
Prof. Pier Luca Lanzi
Algorithm to choose k as the number of clusters that exhibits the least deviation between the clusterings. From Zaki’s textbook © Cambridge University Press 2014
![Page 41: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/41.jpg)
Prof. Pier Luca Lanzi
Clustering Tendency
![Page 42: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/42.jpg)
Prof. Pier Luca Lanzi
Clustering Tendency
• Aims to determine whether the dataset has any meaningful groups to begin with
• Difficult task typically tackled by comparing the data distribution with samples randomly generated from the same data space
• Existing approaches include,§ Spatial Histogram §Distance Distribution §Hopkins Statistic §…
42
![Page 43: DMTM Lecture 15 Clustering evaluation](https://reader033.vdocuments.us/reader033/viewer/2022050803/5a647b1d7f8b9a27568b4c65/html5/thumbnails/43.jpg)
Prof. Pier Luca Lanzi
Run the Python notebookfor this lecture
43