dm13 clustering
TRANSCRIPT
-
8/18/2019 Dm13 Clustering
1/35
-
8/18/2019 Dm13 Clustering
2/35
Outline
Introduction
K-means clustering
Hierarchical clustering: COBWEB
-
8/18/2019 Dm13 Clustering
3/35
Classifcation vs. Clustering
Classification: Supervised learning:
Learns a method for predicting theinstance class from pre-labeled(classified) instances
-
8/18/2019 Dm13 Clustering
4/35
Clustering
Unsupervised learning:
Finds “natural” grouping ofinstances given un-labeled data
-
8/18/2019 Dm13 Clustering
5/35
Clustering Methods
Many dierent method and algorithms:
!or numeric and"or sym#olic data
$eterministic vs. %ro#a#ilistic
E&clusive vs. overla%%ing
Hierarchical vs. 'at
(o%-do)n vs. #ottom-u%
-
8/18/2019 Dm13 Clustering
6/35
Clusters:e&clusive vs. overla%%ing
a
k
j
i
h
g
f
ed
c
b
Simple 2-D representation
Non-overlapping
Venn diagram
Overlapping
a
k
j
i
h
g
f
ed
c
b
-
8/18/2019 Dm13 Clustering
7/35
Clustering Evaluation
Manual ins%ection
Benchmar*ing on e&isting la#els
Cluster +uality measures distance measures
high similarity )ithin a cluster, lo) across
clusters
-
8/18/2019 Dm13 Clustering
8/35
(he distance unction
im%lest case: one numeric attri#ute /
$istance01,23 4 /013 5 /023
everal numeric attri#utes: $istance01,23 4 Euclidean distance #et)een 1,2
6ominal attri#utes: distance is set to 7 ivalues are dierent, 8 i they are e+ual
/re all attri#utes e+ually im%ortant9
Weighting the attri#utes might #e necessary
-
8/18/2019 Dm13 Clustering
9/35
im%le Clustering: K-means
Wor*s )ith numeric data only
73 ic* a num#er 0K3 o cluster centers 0atrandom3
;3 /ssign every item to its nearest clustercenter 0e.g. using Euclidean distance3
e%eat ste%s ;,< until convergence
0change in cluster assignments less than athreshold3
-
8/18/2019 Dm13 Clustering
10/35
K-means e&am%le, ste% 7
k 1
k 2
k 3
X
Y
Pick 3
initial
cluster
centers(randomly)
-
8/18/2019 Dm13 Clustering
11/35
K-means e&am%le, ste% ;
k 1
k 2
k 3
X
Y
ssi!n
eac" #oint
to t"e closest
cluster
center
-
8/18/2019 Dm13 Clustering
12/35
K-means e&am%le, ste% <
X
Y
$o%e
eac" cluster
center
to t"e meano& eac" cluster
!
k 2
"
k 1
k 3
#
-
8/18/2019 Dm13 Clustering
13/35
-
8/18/2019 Dm13 Clustering
14/35
-
8/18/2019 Dm13 Clustering
15/35
K-means e&am%le, ste% =#
X
Y
recom#ute
cluster
means
k 1
k 3k
2
-
8/18/2019 Dm13 Clustering
16/35
K-means e&am%le, ste% @
X
Y
mo%e cluster
centers tocluster means
k 2
k 1
k 3
-
8/18/2019 Dm13 Clustering
17/35
$iscussion, 7
What can #e the %ro#lems )ith
K-means clustering9
-
8/18/2019 Dm13 Clustering
18/35
$iscussion, ;
>esult can vary signifcantly de%ending oninitial choice o seeds 0num#er and %osition3
Can get tra%%ed in local minimum
E&am%le:
A: What can #e done9
instances
initial clustercenters
-
8/18/2019 Dm13 Clustering
19/35
$iscussion, <
/: (o increase chance o fndingglo#al o%timum: restart )ithdierent random seeds.
-
8/18/2019 Dm13 Clustering
20/35
K-means clusteringsummary/dvantages
im%le,understanda#le
items automaticallyassigned to clusters
$isadvantages
Must %ic* num#er oclusters #eore hand
/ll items orced into acluster
(oo sensitive tooutliers
-
8/18/2019 Dm13 Clustering
21/35
K-means clustering - outliers9What can #e done a#out outliers9
-
8/18/2019 Dm13 Clustering
22/35
K-means variations
K-medoids 5 instead o mean, use medianso each cluster
Mean o 7,
-
8/18/2019 Dm13 Clustering
23/35
DHierarchical clustering Bottom u%
tart )ith single-instance clusters
/t each ste%, oin the t)o closest clusters
$esign decision: distance #et)een clusters
E.g.t)o closest instances in clustersvs. distance #et)een means
(o% do)n
tart )ith one universal cluster
!ind t)o clusters
roceed recursively on each su#set
Can #e very ast
Both methods %roduce a
dendrogram
g a c i e d k b j f h
-
8/18/2019 Dm13 Clustering
24/35
DIncremental clustering
Heuristic a%%roach 0COBWEB"CF/I(3
!orm a hierarchy o clusters incrementally
tart:
tree consists o em%ty root node
(hen:
add instances one #y one
u%date tree a%%ro%riately at each stage to u%date, fnd the right lea or an instance
May involve restructuring the tree
Base u%date decisions on category utility
-
8/18/2019 Dm13 Clustering
25/35
DClustering )eather dataID Outlook Temp. Humidity Windy
A Sunny Hot High False
Sunny Hot High True
! O"ercast Hot High False
D #ainy $ild High False
% #ainy !ool &ormal False
F #ainy !ool &ormal True
' O"ercast !ool &ormal True
H Sunny $ild High False
I Sunny !ool &ormal False
( #ainy $ild &ormal False
K Sunny $ild &ormal True
) O"ercast $ild High True
$ O"ercast Hot &ormal False
& #ainy $ild High True
7
;
<
-
8/18/2019 Dm13 Clustering
26/35
DClustering )eather dataID Outlook Temp. Humidity Windy
A Sunny Hot High False
Sunny Hot High True
! O"ercast Hot High False
D #ainy $ild High False
% #ainy !ool &ormal False
F #ainy !ool &ormal True
' O"ercast !ool &ormal True
H Sunny $ild High False
I Sunny !ool &ormal False
( #ainy $ild &ormal False
K Sunny $ild &ormal True
) O"ercast $ild High True
$ O"ercast Hot &ormal False
& #ainy $ild High True
=
<
Merge *esthost and
runner-up
@
!onsider splitting the*est host i+ merging
doesn,t help
-
8/18/2019 Dm13 Clustering
27/35
D!inal hierarchyID Outlook Temp. Humidity Windy
A Sunny Hot High False
Sunny Hot High True
! O"ercast Hot High False
D #ainy $ild High False
Oops a and b are
actually "ery similar
-
8/18/2019 Dm13 Clustering
28/35
DE&am%le: the iris data 0su#set3
-
8/18/2019 Dm13 Clustering
29/35
DClustering )ith cuto
-
8/18/2019 Dm13 Clustering
30/35
DCategory utility
Category utility: +uadratic loss unctiondefned on conditional %ro#a#ilities:
Every instance in dierent category
numerator #ecomes
k
vaC vaC
C C C CU l i j
ijil ijil
k
∑ ∑∑ =−=
=
)*Pr+*,(Pr+*Pr+
)-...--(
22
21
2*Pr+ iji vam =− maximm
num*er o+ attri*utes
-
8/18/2019 Dm13 Clustering
31/35
DOverftting-avoidanceheuristic
I every instance gets %ut into a dierent categorythe numerator #ecomes 0ma&imal3:
Where n is num#er o all %ossi#le attri#ute values.
o )ithout k in the denominator o the CG-ormula,every cluster )ould consist o one instance
∑∑ =−i j
ijvian 2*Pr+ Maximum value of CUMaximum value of CU
-
8/18/2019 Dm13 Clustering
32/35
Other Clustering /%%roaches
EM 5 %ro#a#ility #ased clustering
Bayesian clustering
OM 5 sel-organiing ma%s?
-
8/18/2019 Dm13 Clustering
33/35
$iscussion Can inter%ret clusters #y using su%ervised learning
learn a classifer #ased on clusters
$ecrease de%endence #et)een attri#utes9
%re-%rocessing ste%
E.g. use principal component analysis
Can #e used to fll in missing values
Key advantage o %ro#a#ilistic clustering:
Can estimate li*elihood o data
Gse it to com%are dierent models o#ectively
-
8/18/2019 Dm13 Clustering
34/35
E&am%les o Clustering/%%lications
$arketing discover customer grou%s and use
them or targeted mar*eting and re-organiation
Astronomy fnd grou%s o similar stars and
gala&ies
%arth-/uake studies O#served earth +ua*e
e%icenters should #e clustered along continent
aults 'enomics fnding grou%s o gene )ith similar
e&%ressions
?
-
8/18/2019 Dm13 Clustering
35/35
Clustering ummary
unsu%ervised
many a%%roaches
K-means 5 sim%le, sometimes useul K-medoids is less sensitive to outliers
Hierarchical clustering 5 )or*s or sym#olicattri#utes
Evaluation is a %ro#lem