lesson9 clustering
TRANSCRIPT
-
8/6/2019 Lesson9 Clustering
1/46
Chapter 3: Cluster Analysis
3.1 Basic Conc epts of Clustering
3.2 Partitioning Methods
3.3 Hierarchica l Methods
3.3.1 The Princ iple3.3.2 Ag glomera tive and Divisive Cluste ring
3.3.3 BIRCH
3.3.4 Roc k
3.4 Density-based Methods3.4.1 The Princ iple
3.4.2 DBSCAN
3.4.3 OPTICS
3.5 Clustering High-Dimensional Data
3.6 Outlier Analysis
-
8/6/2019 Lesson9 Clustering
2/46
3.3.1 The Princ ip le
Group da ta ob jec ts into a tree of c lusters
Hiera rc hic a l methods c an be
Agglomerative: bo ttom-up a pp roa c h
Divisive: top-down ap proa c h
Hierarc hic a l c lustering has no bac ktrac king
If a pa rtic ula r merge or sp lit turns out to be poor c hoic e, it cannot
be correc ted
-
8/6/2019 Lesson9 Clustering
3/46
3.3.2 Agglomerative and Divisive
Agglomerative Hierarchic al Clustering
Bottom-up stra tegy
Eac h c luster sta rts with only one ob jec t Clusters a re merged into la rger and la rger c lusters until:
All the ob jec ts a re in a sing le c luster
Certa in te rmina tion c ond itions a re sa tisfied
Divisive Hierarchical Clustering
Top-down stra tegy
Sta rt w ith a ll ob jec ts in one c luster
Clusters a re sub d ivided into smaller and smaller c luste rs until:
Eac h ob jec t fo rms a c luster on its own
Certa in te rmina tion c ond itions a re sa tisfied
-
8/6/2019 Lesson9 Clustering
4/46
Example
Agglomera tive and d ivisive a lgorithms on a da ta set o f fiveobjects {a , b , c , d, e}
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
aa b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
-
8/6/2019 Lesson9 Clustering
5/46
Example
AGNES
Clusters C1 and C2
ma y be m erged if an ob jec tin C1 and an ob jec t in C2 form
the minimum Euc lidean
d istance between any two
ob jec ts from d ifferent c lusters
DIANA
A c luster is sp lit ac c ord ing to some p rinc ip le, e.g ., the maximumEuc lid ian d istanc e b etween the c losest neighboring ob jec ts in thecluster
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
aa b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
-
8/6/2019 Lesson9 Clustering
6/46
Distanc e Between Clusters
First measure: Minimum distanc e
| p -p | is the d istance b etween two ob jec ts p and p
Use c ases
An a lgorithm tha t uses the minimum d ista nc e to measure the
d istanc e between c lusters is c a lled sometimes nearest-neighborc lustering algorithm
If the c lustering p roc ess termina tes when the minimum d ista nc e
between nearest c lusters exc eeds an a rb itra ry threshold , it iscalled single-linkage algorithm
An a gg lomera tive a lgorithm tha t uses the minimum d istanc emeasure is a lso c a lled minimal spanning tree a lgorithm
|'|min),( ',min ppCCd ji CpCpji =
-
8/6/2019 Lesson9 Clustering
7/46
Distanc e Between Clusters
Sec ond mea sure: Maximum d istance
| p -p | is the d istance b etween two ob jec ts p and p
Use c ases
An a lgorithm tha t uses the maximum d istanc e to measure the
d istanc e between c lusters is c a lled sometimes farthest-neighborc lustering algorithm
If the c lustering p roc ess termina tes when the maximum d istanc e
between nearest c lusters exc eeds an a rb itra ry threshold , it iscalled complete-linkage algorithm
|'|max),( ',max ppCCd ji CpCpji =
-
8/6/2019 Lesson9 Clustering
8/46
Distanc e Between Clusters
Minimum a nd maximum d istanc es a re extreme imp lying tha t theyare overly sensitive to outliers or no isy d a ta
Third measure: Mean distance
m i and mj are the mea ns for c luster C i and Cj respectively
Fourth measure: Average d istance
| p -p | is the d istance b etween two ob jec ts p and p
ni and nj are the number of ob jec ts in c luster C i and Cj respectively
Mean is d iffic ult to c ompute for c a tegoric a l da ta
||),( jijimean mmCCd =
=
i jCp Cpji
jiavg ppnn
CCd'
|'|1
),(
-
8/6/2019 Lesson9 Clustering
9/46
Cha llenges & Solutions
It is difficult to selec t merge or sp lit points
No backtracking
Hierarc hica l c lustering does not sc ale well: examines a goodnumber of ob jec ts before a ny dec ision of sp lit or merge
One p romising d irec tions to solve these p rob lems is to c ombinehiera rc hic a l c lustering with other c lustering tec hniques: multiple-phase c lustering
-
8/6/2019 Lesson9 Clustering
10/46
3.3.3 BIRCH
BIRCH: Ba lanc ed Ite ra tive Reduc ing and Clustering UsingHierarchies
Agglomerative Clustering designed for c lustering a large amountof numeric a l data
What Birc h a lgorithm tries to solve?
Most o f the existing a lgorithms DO NOT c onsider the c ase tha tdatasets ca n be too large to fit in ma in memory
They DO NOT c onc entra te on minimizing the number of scans ofthe da taset
I/ O c osts are very high
The c omp lexity o f BIRCH is O(n) where n is the number of ob jec ts
to be c lustered .
-
8/6/2019 Lesson9 Clustering
11/46
BIRCH: The Idea by example
Data Objec ts
1
Clustering Proc ess (build a tree)
Cluster1
1
2
3
4
5
6
2
If c luster 1 bec omes too large (not compac t) by add ing ob jec t 2,
then split the c luster
Lea f node
-
8/6/2019 Lesson9 Clustering
12/46
BIRCH: The Idea by example
Data Objec ts
1
Clustering Proc ess (build a tree)
Cluster1
1
2
3
4
5
6
2
Lea f node
Cluster2
entry 1 entry 2
Lea f node with two entries
-
8/6/2019 Lesson9 Clustering
13/46
BIRCH: The Idea by example
Data Objec ts
1
Clustering Proc ess (build a tree)
Cluster1
1
2
3
4
5
6
2
Lea f node
Cluster2
3
entry1 is the c losest to objec t 3
If c luster 1 bec omes too large by adding objec t 3,then split the c luster
entry 1 entry 2
-
8/6/2019 Lesson9 Clustering
14/46
BIRCH: The Idea by example
Data Objec ts
1
Clustering Proc ess (build a tree)
Cluster1
1
2
3
4
5
6
2
Lea f node
Cluster2
3
entry 1 entry 2 entry 3
Cluster3
Leaf node with three entries
-
8/6/2019 Lesson9 Clustering
15/46
BIRCH: The Idea by example
Data Objec ts
1
Clustering Proc ess (build a tree)
Cluster1
1
2
3
4
5
6
2
Lea f node
Cluster2
3
entry 1 entry 2 entry 3
Cluster3
4
entry3 is the c losest to objec t 4
Cluster 2 remains compa c t when adding objec t 4then add ob jec t 4 to c luster 2
Cluster2
-
8/6/2019 Lesson9 Clustering
16/46
BIRCH: The Idea by example
Data Objec ts
1
Clustering Proc ess (build a tree)
Cluster1
1
2
3
4
5
6
2
Lea f node
3
entry 1 entry 2 entry 3
Cluster3
4
entry2 is the c losest to objec t 5
Cluster 3 becomes too large by adding objec t 5then split c luster 3?
BUT there is a limit to the number of entries a node can haveThus, split the node
Cluster2
5
-
8/6/2019 Lesson9 Clustering
17/46
BIRCH: The Idea by example
Data Objec ts
1
Clustering Proc ess (build a tree)
Cluster1
1
2
3
4
5
6
2
Lea f node
3
Cluster3
4
Cluster2
5
entry 1 entry 2
entry 1.1 entry 1.2 entry 2.1 entry 2.2
Lea f node
Non-Leaf node
Cluster4
-
8/6/2019 Lesson9 Clustering
18/46
BIRCH: The Idea by example
Data Objec ts
1
Clustering Proc ess (build a tree)
Cluster1
1
2
3
4
5
6
2
Lea f node
3
Cluster3
4
Cluster2
5
entry 1 entry 2
entry 1.1 entry 1.2 entry 2.1 entry 2.2
Lea f node
Non-Leaf node
Cluster4
6
entry1.2 is the c losest to objec t 6
Cluster 3 remains compac t when adding objec t 6
then add ob jec t 6 to c luster 3
Cluster3
-
8/6/2019 Lesson9 Clustering
19/46
BIRCH: Key Components
Clustering Feature (CF)
Summary of the sta tistics for a g iven c luste r: the 0-th, 1st a nd 2ndmoments of the c luster from the sta tistic a l point o f view
Used to c om pute centroids, and mea sure the c ompac tnessand d istanc e of c lusters
CF-Tree
height-ba lanced tree
two pa rame ters:
numb er of entries in eac h nod e
The diameterof a ll entries in a lea f node
Lea f nod es a re c onnec ted via prevand nextpointers
-
8/6/2019 Lesson9 Clustering
20/46
Clustering Feature
Clustering Feature (CF): CF = (N, LS, SS)
N: Numb er of da ta points
LS:linea r sum of N points:
SS:squa re sum of N points:
=N
i iX
1
=
N
i i
X1
2
Cluster 1(2,5)(3,2)(4,3)
CF2= 3, (35,36), (417 ,440)Cluster 2
CF1= 3, (2+3+4 , 5+2+3), (22+32+42 , 52+22+32) = 3, (9,10), (29 ,38)
Cluster3
CF3=CF1+CF2= 3+3, (9+35, 10+36), (29+417 , 38+440) = 6, (44,46), (446 ,478)
-
8/6/2019 Lesson9 Clustering
21/46
Properties of Clustering Feature
CF entry is a summary of sta tistics of the c luster
A representation of the c luster
A CF entry has suffic ient information to c a lc ula te the c entroid ,
rad ius, d iameter and many other d istanc e mea sures
Additively theorem a llows us to merge sub-c lusters incrementa lly
-
8/6/2019 Lesson9 Clustering
22/46
Distanc e Measures
Given a c luster with da ta points ,
Centroid:
Radius: average d istanc e from any point of the c luster to itscentroid
Diameter: square roo t of average mea n squared d istancebetween a ll pa irs of points in the c luster
n
X
x
n
i
i== 10
n
xx
R
n
i
i=
= 1
2
0 )(
n
xx
D
n
i
ji
n
j
= =
=1
2
1
)(
-
8/6/2019 Lesson9 Clustering
23/46
CF Tree
B = Branc hing Fac to r,
maximum c hild ren
in a non-lea f node
T= Thresho ld ford iameter or rad ius
of the c luster in a lea f
L = number of entries in
a lea f
CF entry in pa rent = sum of CF entries of a c hild of tha t entry
In-memory, height-ba lanc ed tree
CF1 CF2 CFk
CF1 CF2 CFk
Rootlevel
Firstlevel
-
8/6/2019 Lesson9 Clustering
24/46
CF Tree Insertion
Sta rt w ith the root
Find the CF entry in the root c losest to the da ta point, move totha t c hild and repeat the p roc ess until a c losest lea f entry isfound.
At the lea f
If the point can be ac c ommodated in the c luster, upd ate theentry
If this add ition vio la tes the thresho ld T, sp lit the entry, if thisvio la tes the limit imp osed by L, sp lit the lea f. If its pa rent node isfull, sp lit tha t a nd so on
Update the CF entries from the lea f to the roo t to ac c ommoda tethis point
-
8/6/2019 Lesson9 Clustering
25/46
Phase 1: Load into memory by build ing a CF tree
Phase 2 (op tiona l): Condense tree intodesirab le range by build ing a sma ller CF tree
Initial CF tree
Data
Pha se 3: Globa l Cluste ring
Smaller CF tree
Good Clusters
Pha se 4: (op tiona l and offline): Cluste r Refining
Better Clusters
Birc h Algorithm
-
8/6/2019 Lesson9 Clustering
26/46
Birc h Algorithm: Phase 1
Choose an initia l va lue for threshold , sta rt inserting the da ta pointsone by one into the tree a s per the insertion a lgorithm
If, in the midd le of the above step , the size of the CF tree exc eed sthe size of the ava ilab le memory, inc rease the va lue of threshold
Convert the partia lly built tree into a new tree
Repeat the above steps until the ent ire da taset is sc anned and afull tree is built
Outlier Hand ling
-
8/6/2019 Lesson9 Clustering
27/46
Birc h Algorithm: Phase 2,3, and 4
Phase 2
A bridge between pha se 1 and p hase 3
Builds a sma ller CF tree by inc reasing the thresho ld
Phase 3
Ap p ly g loba l c lustering a lgorithm to the sub -c lusters g iven b ylea f entries of the CF tree
Improves c lustering qua lity
Phase 4
Sc an the entire d a taset to label the da ta points
Outlier ha nd ling
-
8/6/2019 Lesson9 Clustering
28/46
3.3.4 ROCK: for Categorical Data
Experiments show tha t d istanc e func tions do not lea d to highqua lity c lusters when c lustering c a tegoric a l da ta
Most c luste ring tec hniques assess the simila rity b etween p oints toc rea te c lusters
At eac h step , points tha t a re simila r a re merged into a sing lecluster
Loc a lized approac h p rone to errors
ROCK: uses links instead of distances
-
8/6/2019 Lesson9 Clustering
29/46
Example: Compute Jac c ard Coeffic ient
Transac tion items: a,b,c,d,e,f,g Two c lusters oftransactions
Compute Jac cardcoeffic ient betweentransactions
||
||),(
ji
ji
jiTT
TTTTsim
Sim({a,b,c},{b,d,e})=1/ 5=0.2
Jac card c oeffic ientbetween transac tions ofCluster1 ranges from 0.2to 0.5
Cluster1. {a , b , c }{a , b , d }{a , b , e}{a , c , d }{a , c , e}{a , d , e}
{b , c , d }{b , c , e}{b , d , e}{c , d , e}
Cluster2. {a, b , f}{a , b , g}{a, f, g }
{b, f, g }
Jac card coeffic ient betweentransac tions belonging todifferent c lusters can alsoreach 0.5
Sim({a,b,c},{a,b,f})=2/4=0.5
-
8/6/2019 Lesson9 Clustering
30/46
Example: Using Links
Transac tion items: a,b,c,d,e,f,g Two c lusters oftransactionsThe number of links between Ti and Tj
is the number of commonneighbors
Ti and Tj are neighbors if
Sim(Ti,Tj)>
Consider =0.5Link({a,b,f}, {a,b,g}) = 5
(common neighbors)
Link({a,b,f},{a,b,c})=3
(common neighbors)
Cluster1. {a , b , c }{a , b , d }
{a , b , e}{a , c , d }{a , c , e}{a , d , e}{b , c , d }{b , c , e}{b , d , e}{c , d , e}
Cluster2. {a, b , f}{a , b , g}{a, f, g }
{b, f, g }
Link is a better measure
than Jac card coeffic ient
-
8/6/2019 Lesson9 Clustering
31/46
ROCK
ROCK: Robust Clustering using linKs
Major Ideas
Use links to measure simila rity/ p roximity Not d istanc e-based
Comp utationa l c omplexity
ma: average numb er of neighbors
mm: ma ximum number of neighbors n: number of ob jec ts
Algorithm
Sampling -based c luste ring
Draw random samp le
Cluste r with links
Label da ta in d isk
O n n m m n nm a
( log )2 2+ +
-
8/6/2019 Lesson9 Clustering
32/46
-
8/6/2019 Lesson9 Clustering
33/46
3.4.1 The Princ ip le
Regard c lusters as dense reg ions in the da ta spac e separa ted byreg ions of low density
Major fea tures
Disc over c luste rs of a rb itra ry sha pe Hand le no ise One sc an Need density parameters as termina tion c ond ition
Several interesting studies
DBSCAN: Este r, et a l. (KDD96) OPTICS: Ankerst, et a l (SIGMOD99).
DENCLUE: Hinneburg & D. Keim (KDD98)
CLIQUE: Agrawa l, et a l. (SIGMOD98) (more grid -based)
-
8/6/2019 Lesson9 Clustering
34/46
Basic Conc epts: -neighborhood & c ore ob jec ts
= 1 c m
The neighborhood within a rad ius of a g iven ob jec t is c a lledthe -neighborhood of the ob jec t
If the -neighborhood of an ob jec t c onta ins a t lea st a minimumnumber, MinPts, of ob jec ts then the ob jec t is c a lled a core objec t
Example: = 1 cm, MinPts=3
m and p are c ore ob jc ets bec ausetheir -neighborhoods
c onta in a t lea st 3 po ints
p
pmq
-
8/6/2019 Lesson9 Clustering
35/46
Direc tly density-Reachable Ob jec ts
An ob jec t p isdirec tly density- reachab le from ob jec t q if p is
within the -neighborhood o f q and q is a c ore ob jec t
Example:
q is d irec tly density-reac hab le from mm is d irec tly density-reac ha b le from pand vic e versa
pmq
-
8/6/2019 Lesson9 Clustering
36/46
Density-Reachable Objec ts
An ob jec t p is density-reachable from ob jec t q with respec t to and MinPts if there is a c ha in of ob jec ts p1,pn where p1=q and
pn=p suc h tha t p i+1 is d irec tly reac hab le from p i with respec t to and MinPts
Example:
q is density-reac hab le from p because q is d irec tly density-
rea c hab le from m and m is d irec tly density-reac ha b le from p
p is not density-reac hab le from q because q is not a c oreobject
pmq
-
8/6/2019 Lesson9 Clustering
37/46
Density-Connectivity
An ob jec t p isdensity-connected to ob jec t q with respec t to and MinPts if there is an ob jec t O suc h as both p and q are
density reac hab le from O with respec t to and MinPts
Example:
p,q and m are a ll density connec ted
pmq
-
8/6/2019 Lesson9 Clustering
38/46
3.4.2 DBSCAN
Searc hes for c lusters by c hec king the -neighborhood of eac hpoint in the d a tabase
If the -neighborhood of a p oint p c onta ins more than MinPts, anew c luster with a c ore ob jec t is c rea ted
DBSCAN itera tively c ollec ts d irec tly density reac hab le ob jec tsfrom these c ore ob jec ts. Whic h may involve the m erge of a fewdensity-reac hab le c lusters
The p roc ess terminates when no new p oint can be a dded to a nycluster
D i b d Cl i
-
8/6/2019 Lesson9 Clustering
39/46
Density-based Clustering
1 2
3 4
MinPts=4
D it b d Cl t i
-
8/6/2019 Lesson9 Clustering
40/46
Density-based Clustering
5 6
7 8
DBSCAN S i i P
-
8/6/2019 Lesson9 Clustering
41/46
DBSCAN: Sensitive to Parameters
3 4 3 OPTICS
-
8/6/2019 Lesson9 Clustering
42/46
3.4.3 OPTICS
Motivation
Very d ifferent loc a l densities may beneeded to revea l c lusters in d ifferentregions
Clusters A,B,C1,C2, and C3 c annot bedetec ted using one g lob a l densityparameter
A g lob a l density pa rameter c andetec t either A,B,C or C1,C2,C3
Solutions
Use OPTICS
A B
C C1C2
C3
OPTICS P i i l
-
8/6/2019 Lesson9 Clustering
43/46
OPTICS Princ ip le
Produce a spec ia l order of the d a tabase
with respec t to its density-based c lustering struc ture
c onta in information about every c lustering leve l of the d a ta set
(up to a genera ting d istance )
Whic h information to use?
Core distance and Reac hability distance
-
8/6/2019 Lesson9 Clustering
44/46
Core-distance and Reac hability-d istance
The core-distance of an ob jec t is the sma llest that makes {p} ac ore ob jec t
If p is not a c ore ob jec t, the c ore d istance o f p is undefined
Example (, MinPts=5)
is the c ore d istance o f p
It is the d istance b etween p and the
fourth c losest ob jec t
The reachability-distance of an objec t q
with respec t to ob jec t to ob jec t p is:
Example
Reachability-distance(q 1,p)=core-distance(p)=
Reachability-distance(q 2,p)=Euclidian(q2,p )
=6mm
=3mm
p
p
q1
q2
Max(core-distanc e(p), Euc lid ian(p,q))
OPTICSAlgorithm
-
8/6/2019 Lesson9 Clustering
45/46
OPTICS Algorithm
Crea tes an ordering of the ob jec ts in the d a tabase a nd stores forea c h ob jec t its:
Core-distance
Distanc e rea c hab ility from the c losest c ore ob jec t from whic h anob jec t have b een d irec tly density-rea c hab le
This info rmation is suffic ient for the extrac tion of a ll density-basedc lustering with respec t to any d istanc e tha t is sma ller than used in genera ting the order
-
8/6/2019 Lesson9 Clustering
46/46