![Page 1: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/1.jpg)
Clustering
Panagio's Papapetrou, PhD Associate Professor, Stockholm University Adjunct Professor, Aalto University
1
![Page 2: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/2.jpg)
TODAY… Nov 4 Introduc4on to data mining
Nov 5 Associa4on Rules
Nov 10, 14 Clustering and Data Representa4on
Nov 17 Exercise session 1 (Homework 1 due)
Nov 19 Classifica4on
Nov 24, 26 Similarity Matching and Model Evalua4on
Dec 1 Exercise session 2 (Homework 2 due)
Dec 3 Combining Models
Dec 8, 10 Time Series Analysis
Dec 15 Exercise session 3 (Homework 3 due)
Dec 17 Ranking
Jan 13 Review
Jan 14 EXAM
Feb 23 Re-‐EXAM 2
![Page 3: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/3.jpg)
What is clustering?
• A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
3
![Page 4: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/4.jpg)
Outliers • Outliers are objects that do not belong to any cluster or
form clusters of very small cardinality
• In some applica4ons we are interested in discovering outliers, not clusters (outlier analysis)
cluster
outliers
![Page 5: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/5.jpg)
Applica4ons of clustering?
• Image Processing – cluster images based on their visual content
• Web – Cluster groups of users based on their access pa]erns on webpages
– Cluster webpages based on their content • Bioinforma4cs
– Cluster similar proteins together (similarity wrt chemical structure and/or func4onality etc)
• Many more…
5
![Page 6: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/6.jpg)
The clustering task • Group observa4ons into groups so that the observa4ons
belonging in the same group are similar, whereas observa4ons in different groups are different
• Basic quesGons:
– What does “similar” mean? – What is a good par44on of the objects? I.e., how is the quality of a solu4on measured?
– How to find a good par44on of the observa4ons?
6
![Page 7: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/7.jpg)
Observa4ons to cluster
• Usually data objects consist of a set of a]ributes (also known as dimensions)
• J. Smith, 20, 200K
• If all d dimensions are real-‐valued then we can visualize the data as points in a d-‐dimensional space
• If all d dimensions are binary then we can think of each data point as a binary vector
7
![Page 8: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/8.jpg)
Distance func4ons for binary vectors
Q1 Q2 Q3 Q4 Q5 Q6
X 1 0 0 1 1 0
Y 0 1 1 0 1 0
JSim(X,Y ) = X∩YX∪Y
• Jaccard similarity between binary vectors X and Y • Jaccard distance between binary vectors X and Y Jdist(X,Y) = 1-‐ JSim(X,Y)
• Example: • JSim = 1/6 • Jdist = 5/6
![Page 9: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/9.jpg)
Distance func4ons for real-‐valued vectors
• Lp norms or Minkowski distance:
9
![Page 10: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/10.jpg)
Distance func4ons for real-‐valued vectors
• Lp norms or Minkowski distance:
where p is a posi4ve integer
Lp(x,y)= (xi−yi)i=1
d∑
p#
$
%%%%%
&
'
(((((
1/p
10
![Page 11: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/11.jpg)
Distance func4ons for real-‐valued vectors
• Lp norms or Minkowski distance:
where p is a posi4ve integer
• If p = 1, L1 is the Manha@an (or city block) distance:
• If p = 2, L2 is the Euclidean distance:
Lp(x,y)= (xi−yi)i=1
d∑
p#
$
%%%%%
&
'
(((((
1/p
L1(x,y)= xi−yii=1
d∑
L2(x,y)= (|x1−y1|2+|x2−y2|
2+...+|xd−yd |2)
11
![Page 12: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/12.jpg)
Data Structures
• Data matrix
• Distance matrix
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
ndx...nx...n1x...............idx...ix...i1x...............1dx...1x...11x
ℓ
ℓ
ℓ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0...)2,()1,(:::
)2,3()
...ndnd
0dd(3,10d(2,1)
0
a]ributes/dimensions
tuples/objects
objects ob
jects
12
![Page 13: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/13.jpg)
Par44oning algorithms
• Basic concept:
– Construct a par44on of a set of n objects into a set of k clusters
– Each object belongs to exactly one cluster
– The number of clusters k is given in advance
13
![Page 14: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/14.jpg)
The k-‐means problem • Given a set X of n points in a d-‐dimensional space and an integer k
• Task: choose a set of k points {c1, c2,…,ck} in the d-‐dimensional space to form clusters C = {C1, C2,…,Ck} such that
is minimized • Some special cases: k = 1, k = n
Cost(C) = L22 x,ci( )
x∈Ci
∑i=1
k
∑ = x − ci( )2x∈Ci
∑i=1
k
∑
14
![Page 15: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/15.jpg)
Algorithmic proper4es of the k-‐means problem
• NP-‐hard if the dimensionality of the data is at least 2 (d>=2)
• Finding the best solu4on in polynomial 4me is infeasible
• For d=1 the problem is solvable in polynomial 4me
• A simple itera4ve algorithm works quite well in prac4ce
15
![Page 16: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/16.jpg)
The k-‐means algorithm • Randomly pick k cluster centers {c1,…,ck}
• For each i, set the cluster Ci to be the set of points in X that are closer to ci than they are to cj for all i≠j
• For each i let ci be the center of cluster Ci (mean of the vectors in Ci)
• Repeat un4l convergence
16
![Page 17: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/17.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
K-‐means Clustering: Step 1 Algorithm: k-‐means, Distance Metric: Euclidean Distance
k1
k2
k3
17
![Page 18: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/18.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
K-‐means Clustering: Step 2 Algorithm: k-‐means, Distance Metric: Euclidean Distance
k1
k2
k3
18
![Page 19: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/19.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
K-‐means Clustering: Step 3 Algorithm: k-‐means, Distance Metric: Euclidean Distance
k1
k2
k3
19
![Page 20: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/20.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
K-‐means Clustering: Step 4 Algorithm: k-‐means, Distance Metric: Euclidean Distance
k1
k2
k3
20
![Page 21: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/21.jpg)
Proper4es of the k-‐means algorithm
• Finds a local op4mum
• Converges oken quickly (but not always) • But it is guaranteed to converge aker at most KN itera4ons (N number of objects)
• The choice of ini4al points can have large influence in the result
21
![Page 22: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/22.jpg)
Discussion of the k-‐means algorithm
x1
x2
x3
x4 K = 2
22
![Page 23: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/23.jpg)
Discussion of the k-‐means algorithm
x1
x2
x3
x4
c1
c2 K = 2
23
![Page 24: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/24.jpg)
Discussion of the k-‐means algorithm
x1
x2
x3
x4
c1
c2 K = 2
K-‐means converges immediately!
24
![Page 25: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/25.jpg)
Discussion of the k-‐means algorithm
x1
x2
x3
x4
c1
c2 K = 2
K-‐means converges immediately!
Bad result L 25
![Page 26: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/26.jpg)
Discussion of the k-‐means algorithm
x1
x2
x3
x4
c1
c2 K = 2
K-‐means converges immediately!
Even worse as the horizontal lines stretch 26
![Page 27: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/27.jpg)
K-‐means++ • Choose one center uniformly at random from X
• For each data point x, compute d(x)
– distance between x and the nearest center that has already been chosen
• Choose “the farthest” data point as a new center – using a weighted probability distribu4on where a point x is chosen with probability propor4onal to d(x)
• Repeat Steps 2 and 3 un4l k centers have been chosen
27
![Page 28: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/28.jpg)
K-‐means • Limita4ons?
• What if the space is not known?
• Example: images… 0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
28
![Page 29: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/29.jpg)
Clustering images
29
![Page 30: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/30.jpg)
Clustering images: scien4st or not
30
What are the centroids?
![Page 31: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/31.jpg)
The k-‐median problem • Given a set X of n points in a d-‐dimensional space and an integer k
• Task: choose a set of k points {c1,c2,…,ck} from X and form clusters {C1,C2,…,Ck} such that
is minimized
31
∑∑= ∈
=k
i Cxi
i
cxLCCost1
1 ),()(
![Page 32: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/32.jpg)
The k-‐medoids algorithm • Or … PAM (Par44oning Around Medoids, 1987)
– Choose randomly k medoids from the original
dataset X
– Assign each of the n-‐k remaining points in X to their
closest medoid
– Itera4vely replace one of the medoids by one of the
non-‐medoids if it improves the total clustering cost
32
![Page 33: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/33.jpg)
Discussion of PAM algorithm • The algorithm is very similar to the k-‐means
algorithm
• Advantages and disadvantages?
33
![Page 34: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/34.jpg)
K-‐means vs. K-‐medoids • K-‐medoids is more flexible:
– Can be used with any distance/similarity measure
– K-‐means works only with measures that are consistent with the mean (e.g., Euclidean Distance)
• K-‐medoids is more robust to ouliers: – Example: [1 2 3 4 100000]
– mean = 20002 vs medoid = 3
• K-‐medoids is more expensive: – All pair-‐wise distances are computed at each itera4on
34
![Page 35: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/35.jpg)
CLARA (Clustering Large Applica4ons) • It draws mul6ple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
• Strength: deals with larger data sets than PAM
• Weakness: – Efficiency depends on the sample size
– A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased
35
![Page 36: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/36.jpg)
What is the right number of clusters? • …or who sets the value of k?
• For n points to be clustered consider the case where k=n. What is the value of the error func4on?
• What happens when k = 1?
• Since we want to minimize the error why don’t we select always k = n?
• X-‐means: – based on the Bayesian Informa4on Criterion – Detects the op4mal number of clusters
36
![Page 37: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/37.jpg)
Hierarchical Clustering
• Produces a set of nested clusters organized as a hierarchical tree
• Can be visualized as a dendrogram – A tree-‐like diagram that records the sequences of merges or splits
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
37
![Page 38: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/38.jpg)
Strengths of Hierarchical Clustering
• No assump4ons on the number of clusters – Any desired number of clusters can be obtained by ‘cutng’ the dendogram at the proper level
• Hierarchical clusterings may correspond to meaningful taxonomies – Example in web (e.g., product catalogs), etc…
38
![Page 39: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/39.jpg)
Hierarchical Clustering • Two main types of hierarchical clustering
– AgglomeraGve: • Start with the points as individual clusters • At each step, merge the closest pair of clusters un4l only one cluster (or k clusters) lek
– Divisive: • Start with one, all-‐inclusive cluster • At each step, split a cluster un4l each cluster contains a point (or there are k clusters)
• Tradi4onal hierarchical algorithms use a similarity or distance matrix
– Merge or split one cluster at a 4me
39
![Page 40: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/40.jpg)
Complexity of hierarchical clustering
• Distance matrix is used for deciding which clusters to merge/split
• At least quadra4c in the number of data points
• Not usable for large datasets
40
![Page 41: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/41.jpg)
Agglomera4ve clustering algorithm
• Most popular hierarchical clustering technique
• Basic algorithm 1. Compute the distance matrix between the input data points 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the distance matrix 6. UnGl only a single cluster remains
• Key opera4on is the computa4on of the distance between two clusters
– Different defini4ons of the distance between clusters lead to different algorithms
41
![Page 42: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/42.jpg)
Input/ Ini4al setng • Start with clusters of individual points and a distance/proximity
matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Distance/Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
42
![Page 43: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/43.jpg)
Intermediate State • Aker some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance/Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
43
![Page 44: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/44.jpg)
Intermediate State • Merge the two closest clusters (C2 and C5) and update the
distance matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance/Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p1244
![Page 45: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/45.jpg)
Aker Merging • “How do we update the distance matrix?”
C1
C4
C2 U C5
C3 ? ? ? ?
?
?
?
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
...p1 p2 p3 p4 p9 p10 p11 p1245
![Page 46: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/46.jpg)
Distance between two clusters
• Each cluster is a set of points
• How do we define distance between two sets of points – Lots of alterna4ves – Not an easy task
46
![Page 47: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/47.jpg)
Distance between two clusters
• Single-‐link distance between clusters Ci and Cj the minimum distance (or highest similarity) between any object in Ci and any object in Cj
• The distance is defined by the two most similar objects
( ) { }jiyxjisl CyCxyxdCCD ∈∈= ,),(min, ,
47
![Page 48: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/48.jpg)
Single-‐link clustering: example
• Determined by one pair of points, i.e., by one link in the proximity graph using the similarity matrix below:
I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
48
![Page 49: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/49.jpg)
Strengths of single-‐link clustering
Original Points Two Clusters
• Can handle non-elliptical shapes
49
![Page 50: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/50.jpg)
Limita4ons of single-‐link clustering
Original Points Two Clusters
• Sensitive to noise and outliers • It produces long, elongated clusters
50
![Page 51: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/51.jpg)
Distance between two clusters
• Complete-‐link distance between clusters Ci and Cj the maximum distance (minimum similarity) between any object in Ci and any object in Cj
• The distance is defined by the two most dissimilar objects
( ) { }jiyxjicl CyCxyxdCCD ∈∈= ,),(max, ,
51
![Page 52: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/52.jpg)
Complete-‐link clustering: example
• Distance between clusters is determined by the two most distant points in the different clusters
I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
52
![Page 53: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/53.jpg)
Strengths of complete-‐link clustering
Original Points Two Clusters
• More balanced clusters (with equal diameter) • Less susceptible to noise
53
![Page 54: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/54.jpg)
Limita4ons of complete-‐link clustering
Original Points Two Clusters
• Tends to break large clusters • All clusters tend to have the same diameter – small clusters are merged with larger ones
54
![Page 55: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/55.jpg)
Distance between two clusters
• Average-‐link distance between clusters Ci and Cj is the average distance between any object in Ci and any object in Cj
( ) ∑∈∈×
=ji CyCxji
jiavg yxdCC
CCD,
),(1,
55
![Page 56: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/56.jpg)
Average-‐link clustering: example
• Proximity of two clusters is the average of pairwise proximity between points in the two clusters.
I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
56
![Page 57: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/57.jpg)
Average-‐link clustering: discussion
• Compromise between Single and Complete Link
• Strengths – Less suscep4ble to noise and outliers
• Limita4ons – Biased towards globular clusters
57
![Page 58: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/58.jpg)
Distance between two clusters
• Ward’s distance between clusters Ci and Cj the difference between: – the total within cluster sum of squares for the two clusters separately
– the within cluster sum of squares resul6ng from merging the two clusters in cluster Cij
• ri: centroid (mean) of Ci • rj: centroid (mean) of Cj • rij: centroid (mean) of Cij
( ) ( ) ( ) ( )∑∑∑∈∈∈
−−−+−=ijji Cx
ijCx
jCx
ijiw rxrxrxCCD 222,
58
![Page 59: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/59.jpg)
Ward’s distance for clusters
• Similar to average-‐link
• Less suscep4ble to noise and outliers
• Biased towards globular clusters
• Hierarchical analogue of k-‐means – Can be used to ini4alize k-‐means
59
![Page 60: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/60.jpg)
Hierarchical Clustering: Time and Space requirements
• For a dataset X consis4ng of n points
• O(n2) space; it requires storing the distance matrix
• O(n3) Gme in most of the cases – There are n steps and at each step the size n2
distance matrix must be updated and searched – Complexity can be reduced to O(n2 log(n) ) 4me by
using appropriate data structures
60
![Page 61: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/61.jpg)
How good is a clustering? • Cluster purity
– Each cluster is assigned to the class label that is most frequent in the cluster
– Purity is measured by coun4ng the number of correctly assigned objects and dividing by the total number of objects
Purity = (5 + 4 + 3) / 17 = 12 / 17
61
![Page 62: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/62.jpg)
How good is purity?
• Cluster purity – Simple! – Intui4ve
• Trade-‐off between number of clusters and purity? – If each object belongs to its own cluster, then purity is 1.0 – Is that desirable?
• Cannot trade-‐off the quality of clusters against the number of clusters
• Alterna4ves: – Normalized Mutual Informa4on
– Rand Index 62
![Page 63: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/63.jpg)
What if we don’t have class labels? • a(i): average dissimilarity (distance) of i with all other data within
the same cluster • b(i): the lowest average dissimilarity (distance) of i to any other
cluster where i is not a member • The Silhoue]e of object i:
• Values between -‐1 and 1
– if b(i) >> a(i) then s(i) = 1 – if b(i) << a(i) then s(i) = -‐1
• Total Silhouete: – average silhouete value of all objects
63
i
![Page 64: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/64.jpg)
Correct number of clusters
64
• Can be used to iden4fy the correct number of clusters
![Page 65: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/65.jpg)
Clustering aggrega4on
• Many different clusterings for the same dataset! – Different objec4ve func4ons – Different algorithms – Different number of clusters
• Which clustering is the best? – AggregaGon: we do not need to decide, but rather find a reconcilia4on between different outputs
65
![Page 66: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/66.jpg)
The clustering-‐aggrega4on problem
• Input – n objects X = {x1,x2,…,xn} – m clusterings of the objects {C1,…,Cm} – clustering: a collec4on of disjoint sets that cover X
• Output
– a single clustering C, that is as close as possible to all input clusterings
• How do we measure closeness of clusterings?
– disagreement distance
66
![Page 67: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/67.jpg)
Disagreement distance • For two par44ons C and P, and objects x,y
in X define
• if IC,P(x,y) = 1 we say that x,y create a disagreement between clusterings C and P • The disagreement distance between C and
P is:
U C P x1 1 1 x2 1 2 x3 2 1 x4 3 3 x5 3 4
D(C,P) = 3
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
=≠
≠=
=
otherwise0
P(y) P(x) and C(y) C(x) ifOR
P(y) P(x) and C(y) C(x) if 1
y)(x,I PC,
D(C,P) = IC,P (x,y)(x,y)∑
67
![Page 68: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/68.jpg)
Clustering aggrega4on • Given m clusterings C1,…,Cm find C such that
is minimized ∑=
=m
1ii )CD(C, D(C)
U C1 C2 C3 C x1 1 1 1 1 x2 1 2 2 2 x3 2 1 1 1 x4 2 2 2 2 x5 3 3 3 3 x6 3 4 3 3
the aggrega4on cost
68
![Page 69: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/69.jpg)
Why clustering aggrega4on?
• Clustering categorical data U City Profession Nationality
x1 New York Doctor U.S.
x2 New York Teacher Canada
x3 Boston Doctor U.S.
x4 Boston Teacher Canada
x5 Los Angeles Lawer Mexican
x6 Los Angeles Actor Mexican
69
![Page 70: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/70.jpg)
Why clustering aggrega4on?
• Detect outliers – outliers are defined as points for which there is no consensus by the clusterings
• Improve the robustness of clustering algorithms – different algorithms have different weaknesses – combining them can produce a be]er result
• Privacy preserving clustering – different companies have data for the same users – compute an aggregate clustering without sharing the actual data
70
![Page 71: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/71.jpg)
Complexity of Clustering Aggrega4on
• The clustering aggrega4on problem is NP-‐hard – the median par44on problem [Barthelemy and LeClerc 1995]
• Look for heuris4cs and approximate solu4ons
ALG(I) ≤ c OPT(I)
71
![Page 72: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/72.jpg)
A simple 2-‐approxima4on algorithm
• The disagreement distance D(C,P) is a metric – What does that mean? We will revisit this in a few lectures…
• The algorithm BEST: – select among the input clusterings the clustering C* that minimizes D(C*)
– a 2-‐approximate solu4on
72
![Page 73: Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering Panagio’s)Papapetrou,)PhD) Associate)Professor,)StockholmUniversity) AdjunctProfessor,)Aalto)University!](https://reader034.vdocuments.us/reader034/viewer/2022042914/5f4fcac6fac1a85cbc5833d5/html5/thumbnails/73.jpg)
TODOs
• Assignment 1: – Due next week [Nov. 17 at 11:00am] – Discussion forum on this assignment now available on iLearn2
• Exercise Session 1: – Nov. 17 at 11:00am
73