Download - Lecture 10 Clustering
![Page 1: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/1.jpg)
1
Lecture 10 Clustering
![Page 2: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/2.jpg)
2
Preview
Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
![Page 3: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/3.jpg)
4
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation database
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
Urban planning: Identifying groups of houses according to their house type, value, and geographical location
Seismology: Observed earth quake epicenters should be clustered along continent faults
![Page 4: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/4.jpg)
5
What Is a Good Clustering?
A good clustering method will produce clusters with High intra-class similarity Low inter-class similarity
Precise definition of clustering quality is difficult Application-dependent Ultimately subjective
![Page 5: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/5.jpg)
6
Requirements for Clustering in Data Mining
Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal domain knowledge required to
determine input parameters Ability to deal with noise and outliers Insensitivity to order of input records Robustness wrt high dimensionality Incorporation of user-specified constraints Interpretability and usability
![Page 6: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/6.jpg)
7
Similarity and Dissimilarity Between Objects
Euclidean distance (p = 2):
Properties of a metric d(i,j): d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)
)||...|||(|),( 22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
![Page 7: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/7.jpg)
8
Major Clustering Approaches
Partitioning: Construct various partitions and then
evaluate them by some criterion
Hierarchical: Create a hierarchical decomposition of
the set of objects using some criterion
Model-based: Hypothesize a model for each cluster
and find best fit of models to data
Density-based: Guided by connectivity and density
functions
![Page 8: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/8.jpg)
9
Partitioning Algorithms
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids
algorithms k-means (MacQueen, 1967): Each cluster is
represented by the center of the cluster k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw, 1987): Each cluster is represented by one of the objects in the cluster
![Page 9: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/9.jpg)
10
K-Means Clustering
Given k, the k-means algorithm consists of four steps: Select initial centroids at random. Assign each object to the cluster with
the nearest centroid. Compute each centroid as the mean
of the objects assigned to it. Repeat previous 2 steps until no
change.
![Page 10: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/10.jpg)
k-Means [MacQueen ’67]
![Page 11: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/11.jpg)
![Page 12: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/12.jpg)
![Page 13: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/13.jpg)
![Page 14: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/14.jpg)
![Page 15: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/15.jpg)
![Page 16: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/16.jpg)
![Page 17: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/17.jpg)
![Page 18: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/18.jpg)
Another example (k=3)
19
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
![Page 19: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/19.jpg)
20
K-Means Clustering (contd.)
Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
![Page 20: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/20.jpg)
21
Comments on the K-Means Method
Strengths Relatively efficient: O(tkn), where n is # objects, k is
# clusters, and t is # iterations. Normally, k, t << n.
Often terminates at a local optimum. The global optimum may be found using techniques such as simulated annealing and genetic algorithms
Weaknesses Applicable only when mean is defined (what about
categorical data?) Need to specify k, the number of clusters, in advance Trouble with noisy data and outliers Not suitable to discover clusters with non-convex
shapes
![Page 21: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/21.jpg)
22
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
![Page 22: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/22.jpg)
Simple example
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
![Page 23: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/23.jpg)
Cut-off point
![Page 24: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/24.jpg)
Single link example
![Page 25: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/25.jpg)
![Page 26: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/26.jpg)
![Page 27: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/27.jpg)
![Page 28: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/28.jpg)
![Page 29: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/29.jpg)
![Page 30: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/30.jpg)
![Page 31: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/31.jpg)
![Page 32: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/32.jpg)
![Page 33: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/33.jpg)
close?
Inter-Cluster similarity?
![Page 34: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/34.jpg)
MIN(Single-link)
![Page 35: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/35.jpg)
MAX(complete-link)
![Page 36: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/36.jpg)
Distance Between Centroids
![Page 37: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/37.jpg)
Group Average(average-link)
![Page 38: Lecture 10 Clustering](https://reader033.vdocuments.us/reader033/viewer/2022052510/56812a47550346895d8d87df/html5/thumbnails/38.jpg)
What is: density-based problem?
Classes of arbitrary spatial shapes
Desired properties: good efficiency (large databases) minimal domain knowledge (determine
parameters)
Intuition: notion of density