clustering introduction
TRANSCRIPT
![Page 1: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/1.jpg)
Clustering for New Discovery in Data
Houston Machine Learning Meetup
![Page 2: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/2.jpg)
2SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)– Regression models -Yan– SVM and kernel SVM - Yan– Tree-based models - Dario– Bayesian method - Xiaoyang– Ensemble models - Yan
• Unsupervised learning (3 sessions)– K-means clustering – DBSCAN - Cheng– Mean shift – Agglomerative clustering - Kunal– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions) _ Neural network
– From neural network to deep learning – Convolutional neural network– Train deep nets with open-source tools
![Page 3: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/3.jpg)
3SCR©
Roadmap: Application
• Business analytics
• Recommendation system
• Natural language processing
• Computer vision
• Energy industry
![Page 4: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/4.jpg)
4SCR©
Agenda
• Introduction
• Application of clustering
• K-means
• DBSCAN
• Cluster validation
![Page 5: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/5.jpg)
5SCR©
What is clustering
Clustering: to discover the natural groupings of a set of objects/patterns in the unlabeled data
![Page 6: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/6.jpg)
6SCR©
Application: Recommendation
![Page 7: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/7.jpg)
7SCR©
Application: Document Clustering
https://www.noggle.online/knowledgebase/document-clustering/
![Page 8: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/8.jpg)
8SCR©
Application: Pizza Hut Center
Delivery locations
![Page 9: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/9.jpg)
9SCR©
Application: Discovering Gene functions
Important to discover diseases and treatment
![Page 10: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/10.jpg)
10SCR©
Clustering Algorithm
• K-Means (King of clustering, many variants)
• DBSCAN (group neighboring points)
• Mean shift (locating the maxima of density)
• Spectral clustering (cares about connectivity instead of proximity)
• Hierarchical clustering (a hierarchical structure, multiple levels)
• Expectation Maximization (k-means is a variant of EM)
• Latent Dirichlet Allocation (natural language processing)
……
![Page 11: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/11.jpg)
11SCR©
• K-Means
• DBSCAN
![Page 12: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/12.jpg)
12SCR©
Cluster Validation
![Page 13: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/13.jpg)
13SCR©
Cluster Validity
• For cluster analysis, the question is how to evaluate the “goodness” of the resulting clusters?
• Then why do we want to evaluate them?– To avoid finding patterns in noise– To compare clustering algorithms– To determine the optimal number of clusters
![Page 14: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/14.jpg)
14SCR©
Cluster Validity
• Numerical measures:– External: Used to measure the extent to which cluster labels match
externally supplied class labels.• Entropy
– Internal: Used to measure the goodness of a clustering structure without respect to external information.
• Sum of Squared Error (SSE)– Relative: Used to compare two different clusterings.
• Often an external or internal measurement is used for this function, e.g., SSE or entropy
• Visualization
![Page 15: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/15.jpg)
15SCR©
Internal Measures: WSE and BSE
• Cluster Cohesion: Measures how closely related are objects in a cluster– Example: SSE
• Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters
• Example: Squared Error– Cohesion is measured by the within cluster sum of squares (SSE)
– Separation is measured by the between cluster sum of squares
– Where |Ci| is the size of cluster i
i Cx
ii
mxWSS 2)(
i
ii mmCBSS 2)(
![Page 16: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/16.jpg)
16SCR©
Internal Measures: WSE and BSE
• Example: SSE– BSS + WSS = constant
10919)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(22
2222
TotalBSS
WSS
1 2 3 4 5 m1 m2
m
K=2 clusters:
100100)33(4
10)35()34()32()31(2
2222
TotalBSS
WSSK=1 cluster:
![Page 17: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/17.jpg)
17SCR©
Internal Measures: WSE and BSE
• Can be used to estimate the number of clusters
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
KS
SE
5 10 15
-6
-4
-2
0
2
4
6
WS
S
![Page 18: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/18.jpg)
18SCR©
Internal Measures: Proximity graph measures
• Cluster cohesion is the sum of the weight of all links within a cluster.
• Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster.
cohesion separation
![Page 19: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/19.jpg)
19SCR©
Correlation between affinity matrix and incidence matrix
• Given affinity distance matrix D = {d11,d12, …, dnn }
Incidence matrix C= { c11, c12,…, cnn } from clustering • Correlation r between D and C is given by
n
jiij
n
jiij
n
jiijij
ccdd
ccddr
1,1
2_
1,1
2_
1,1
__
)()(
))((
![Page 20: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/20.jpg)
20SCR©
Correlation with Incidence matrix
n
jiij
n
jiij
n
jiijij
ccdd
ccddr
1,1
2_
1,1
2_
1,1
__
)()(
))((
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
r = -0.9235 r = -0.5810
![Page 21: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/21.jpg)
21SCR©
Visualization of similarity matrix
• Order the similarity matrix with respect to cluster labels and inspect visually.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Poi
nts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
![Page 22: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/22.jpg)
22SCR©
• Clusters in random data are not so crisp
Points
Poi
nts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Visualization of similarity matrix
![Page 23: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/23.jpg)
23SCR©
Final Comment on Cluster Validity
“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”
Algorithms for Clustering Data, Jain and Dubes
![Page 24: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/24.jpg)
24SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)– Regression models -Yan– SVM and kernel SVM - Yan– Tree-based models - Dario– Bayesian method - Xiaoyang– Ensemble models - Yan
• Unsupervised learning (3 sessions)– K-means clustering – DBSCAN - Cheng– Mean shift – Hierarchical clustering - Kunal– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions) _ Neural network
– From neural network to deep learning - Yan– Convolutional neural network – Train deep nets with open-source tools
![Page 25: Clustering introduction](https://reader031.vdocuments.us/reader031/viewer/2022013013/58f21b1f1a28abc3428b45ad/html5/thumbnails/25.jpg)
25SCR©
Thank you
Slides will be posted on slide share:
http://www.slideshare.net/xuyangela