unsupervised clustering - sites.cs.ucsb.edu
TRANSCRIPT
Unsupervised Clustering
2PR , ANN, & ML
Unsupervised Clustering
Training samples are not labeled
May not know
how many classes
a prior probability
state-conditional probability
Automatic discovery of structures
Intuitively, objects in the same class stick
together and form clusters
P i( )
p x i( | )
3PR , ANN, & ML
Locating groups (clusters) having similar
measurements
Given x x x unlabelled
partition in to C clusters
n
C
{ , ,..., }( )
, ,...,
1 2
1 2
Unsupervised Clustering (cont)
4PR , ANN, & ML
Similarity Measure
Need a similarity measurement s(x,x’)
(e.g., distance between x and x’ )
dd
d
d
q| d
t
kk
d
k
yxyx
yx
yxyx
yxΣyxyx
yxyx
),(
features)binary (for y Commonalit
),(
productinner Normalized
)()(),(
Distance sMahalanobi
1,)|(),(
Metric Minkowski
1
1
1
5PR , ANN, & ML
Similarity Measure (cont.)
How to set the threshold?
Too large all samples assigned into a single class
Too small each sample in its own class
How to properly weight each feature?
How do brightness of a fish and length of a fish
relate?
How to scale (or normalize) different measures?
6PR , ANN, & ML
Axis Scaling
7PR , ANN, & ML
Axis Scaling (cont.)
8PR , ANN, & ML
9PR , ANN, & ML
Threshold
10PR , ANN, & ML
Criteria for Clustering
A criterion function for clustering
ii xi
i
c
i x
ien
J xmmx1
||1
2
c
i x xi i in
J1 '
2|'|1
xx
11PR , ANN, & ML
Within and Between group variance
minimize and maximize the trace and
determinant of the appropriate scatter matrix
trace: square of the scattering radius
determinant: square of the scattering volume
x
c
i
t
iB
xii
c
i x
t
W
nn
nii
xmmmmmS
xmmxmxS
ii
ii
1))((
1))((
1
1
Criteria for Clustering (cont.)
12PR , ANN, & ML
Pathology: when class sizes differ
ii xi
i
c
i x
ien
J xmmx1
||1
2
Relaxed constraint Allow the large class to
grow into smaller class
Stringent constraint Disallow large class and
split it
13PR , ANN, & ML
K-Means Algorithm
(fixed # of clusters)
Arbitrarily pick N cluster centers, assign
samples to nearest center
Compute sample mean of each cluster
Reassign samples to clusters with the
nearest mean (for all samples)
Repeat if there are changes, otherwise stop
14PR , ANN, & ML
15PR , ANN, & ML
16PR , ANN, & ML
17PR , ANN, & ML
K-Means
In some sense, it is a simplified version of
the case II of learning parametric form
c
j
jjjk
iiikki
wPwxp
wPwxpxwP
1
)(),|(
)(),|(),|(
In stead of multiple membership, give the
sample to whichever class whose mean is
closest to it
otherwise
xwP ki0
mean classclosest with class1),|(
18PR , ANN, & ML
Fuzzy K-Means
In Matlab, you can find k-mean (it is called
c-mean) under fuzzy logic toolbox
It implements fuzzy k-mean
Where membership function is not 1 or 0, but
can be fractional
19PR , ANN, & ML
Iterative K-means Iterative refinement of clusters to minimize
Each step: move a sample from one to another
ii xi
i
c
i x
in
J xmmx1
||1
2
22
2
2
ji
|ˆ|1
|ˆ|1
|ˆ|11
ˆ
|ˆ|11
ˆ
to from x̂
j
j
j
i
i
i
j
j
j
jj
j
j
jj
i
i
iii
i
iii
n
n
n
n
n
nJJ
n
n
nJJ
n
mxmx
mxmx
mm
mxmx
mm
20PR , ANN, & ML
1. Select an initial partition of the n samples into clusters and compute
2. Select the next candidate sample
3. If the current cluster is larger than 1, then
4. Transfer
5. Update 6. If J has not changed in n attempts, stop, otherwise
go back to 2.
x
ijn
n
ijn
n
j
j
j
j
j
j
j2
2
|ˆ|1
|ˆ|1
mx
mx
x to if for all jk k j
J m mi k, ,
J m mc, ,...,1
21PR , ANN, & ML
Hierarchical Clustering
K-means assume a “flat” data
description
Hierarchical descriptions are more
frequently
22PR , ANN, & ML
Hierarchical clustering (cont.)
Samples in the same cluster will remain so
at a higher level
x1 x2 x3 x4 x5 x6 1
2
3
4
5
level
23PR , ANN, & ML
Bottom-up: agglomerate
Top-down: divisive
.to.Go
bycdecrement,anddelete,and.Merge
andclusters,distinctofpairnearestthe.Find
stopc,c.If
},...,x{xand n c.Let
jji
ji
n
25
1ˆ4
3
ˆ2
ˆ1 1
Hierarchical clustering Options
24PR , ANN, & ML
At a certain step, we have c classes
ixi
iiin
n xm1
,||
Merge two clusters i and j
)(11
,|| jjii
jixji
ijjiij
jiij
nnnn
xnn
mnnij
mm
Hierarchical clustering Procedure
25PR , ANN, & ML
Then certain criteria function will increase
2
222
222
||
)(
||||||
ji
ji
ji
ijjijjii
x
j
x
i
x
ijij
nn
nn
nnnn
Ejiij
mm
mmm
mxmxmx
),(minarg||minarg,,
2
,
**
jiji
ji
ji
ji
jid
nn
nnji
mm
I*,j* chosen in such a way
Until no such pair exist that cause an increase in the criteria
function less than certain preset threshold
Hierarchical clustering Procedure (cont.)
26PR , ANN, & ML
In fact, the criteria function can be chosen
in many different ways, resulting in
different clustering behaviors
)(max),(
classes)(compact neighbor Furthest
)(min),(
classes) (elongatedneighbor Nearest
,max
,min
yx,
yx,
ji
ji
yxji
yxji
d
d
Criteria Function
27PR , ANN, & ML
•Starting from every sample in a cluster
•Merge them according to some criteria function
•Until two clusters exist
28PR , ANN, & ML
29PR , ANN, & ML
30PR , ANN, & ML
In Reality
With n objects, the distance matrix is n x n
For large databases, it is computationally
expensive to compute and store the matrix
Solution: storing only k nearest clusters in
the distance matrix: complexity is k x n
31PR , ANN, & ML
Divisive Clustering
Less often used
Have to be careful that criterion function
usually decreases monotonically (the
samples become purer each time)
Natural grouping is the one where a large
drop in impurity occurs
iteration
impurity
32PR , ANN, & ML
ISODATA Algorithm
Iterative self-organizing data analysis
technique
A tried-and-true clustering algorithm
Dynamically update # of clusters
Can go both top-down (split) and bottom-up
(merge) directions
33PR , ANN, & ML
Notation:
merged be can that clusters ofnumber maximum
mergingfor distance maximum
splittingfor spread maximum
clusters ofnumber (desired) eapproximat
clustera in samples ofnumber on threshold
maxN
D
N
T
m
s
D
34PR , ANN, & ML
Algorithm
1. Cluster the existing data into Nc clusters but eliminate any data and classes with fewer than T members, decrease Nc. Exit when classification of samples has not changed
2. If iteration odd and
Split clusters whose samples are sufficiently disjoint, increase Nc
If any clusters have been split, go to 1
3. Merge any pair of clusters whose samples are sufficiently close
4. Go to step 1
DcD
c NNorN
N 22
35PR , ANN, & ML
Step 1
Classify samples according
to nearest mean
Discard samples in clusters
< T samples, decrease Nc
Recompute sample
mean of each cluster
Change in classification?
Parameter T
yes
noexit
36PR , ANN, & ML
Compute for each cluster
ki
ki
jkji
kj
k
k
i
k
k
N
Nd
)(
)(
||1
max
||1
)(2
)(
y
y
μy
μy
Compute globally
cN
k
kk dNN
d1
1
Step 2
37PR , ANN, & ML
22, skk
2
12
dc
k
k
NNor
TNand
dd
TNDs ,,2
Split cluster
Nc= Nc+1
yes
yes
no
no
Cluster centers are displaced in
opposite direction along the axis of
largest variance
38PR , ANN, & ML
Compute for each pairs of clusters
|| jiijd μμ
Sort distance from smallest to largest less than Dm
Step 3
39PR , ANN, & ML
max# Nmergeof
Dd mij
Merge i and j
Nc= Nc-1
yes
noNeither cluster i
nor j merged?
)(1
' jjii
ji
NNNN
μμμ
40
Spectral Clustering
Graph theoretical approach
(Advanced) linear algebra is really
important
A field by itself
Cover the basics here
PR , ANN, & ML
41
Graph notations
Undirected graph G = (V, E)
V = {v1, …, vn} are nodes (samples)
E = {ei,j| 0<=i,j<n} are edges, with weights
(similarity) wi,j
W: adjacency matrix with entries wi,j
D: degree matrix (diagonal) with entries
A: subset of vertices, \bar{A}: complement
V\A
1A = (f1, … fn)T, fi=1 if i in A and 0 otherwise
PR , ANN, & ML
42
More graph notations
Size of subset A in V
Based on # of vertices
Based on its connections
Connected component A
Any two vertices in A can be joined by a path
where all intermediate points lie in A
No connection between A and \bar{A}
Partition with A1, …, Ak, if
PR , ANN, & ML
43
Similarity Definitions
RBF (fully connected or thresholded):
E.g, Gaussian kernel gives wi,j
e-neighborhood graph:
Connect all points whose pairwise distances less
than e
K-nearest neighbor:
vi and vj are neighbors if either one is a kNN of the
other
Mutual k-nearest neighbor:
vi and vj are neighbors if both are a kNN of the other
PR , ANN, & ML
44
Graph Laplacian: L = D - W
PR , ANN, & ML
45
One connected component
PR , ANN, & ML
46
Multiple connected components
PR , ANN, & ML
47
Normalized Laplacian
PR , ANN, & ML
48
Unnormalized Spectral Clustering
PR , ANN, & ML
49
Normalized Spectral Clustering
PR , ANN, & ML
50PR , ANN, & ML
51
Toy Example
Random sample of 200 points drawn from 4
Gaussians
Similarity based on
Graph
Fully connected or
10 nearest neighbors
PR , ANN, & ML
52
Min-Cut Formulation
If the edge weight represents degree of similarity,
optimal bi-partitioning of a graph is to minimize
the cut (so called min-cut problem)
Intuition: Cut the connection between dis-similar
samples, hence, edge weight (similarity) should
be small
BvAu
vuwBAcut,
),(),(
Feature
space
53
Problem with Min-cut
Tends to cut out small regions
Sum weight (green + red + blue) = constant
Min-cut minimizes sum of weights of blue
edges, with no regard to green and red (half of
the picture)
54
Remedy Distribute the total weight such that
Sum of weights of the blue edges are
minimized
Max between group variance
Sum of weights of the red (green) edges are
maximized
Min within group variance
55
Normalized Cut
Penalize cutting out small, isolated clusters
set vertex full :
green)-totalblue (red ),(),(
red)-totalblue (green ),(),(
(blue) ),(),(
),(
),(
),(
),(),(
,
,
,
V
vuwVBasso
vuwVAasso
vuwBAcut
greentotal
blue
redtotal
blue
VBasso
BAcut
VAasso
BAcutBANcut
VvAu
VvAu
BvAu
A B
56
Normalized Cut (cont.)
Penalize cutting out small, isolated clusters
Small blue
Small red (or green)
greentotal
blue
redtotal
blue
VBasso
BAcut
VAasso
BAcutBANcut
),(
),(
),(
),(),(
57
Intuition
2
2),(),(
2),(
),(
),(
),(
),(
),(
),(
),(
1),(
),(
),(
),( 1
),(
),(
),(
),(
1),(
),(
),(
),(
),(),(),(),(),(),(
,
,,,,
bluered
blue
bluegreen
blue
bluered
red
bluegreen
green
BANcutBANassoc
VBasso
BAcut
VAasso
BAcut
VBasso
BBassoc
VAasso
AAassoc
VBasso
BAcut
VBasso
BBassoc
VAasso
BAcut
VAasso
AAassoc
VAasso
BAcut
VAasso
vuw
BAcutvuwvuwvuwvuwVAasso
AvAu
AvAuBvAuAvAuVvAu
Assoc reflects intra-class connection which
should be maximized
Ncut represents inter-class connection which
should be minimized
58
Solution
How do you define similarity?
Multiple measurements (i.e., a feature vector)
can be used
How do you find the minimal normalized
cut?
Solution turns out to be a generalized eigen
value problem!
k
j
k
i
kk vvwjid ||),(
59
Bi
Ai
xi
N
1
1
1
1
1
1
x
j
i
NNN
jiwd
d
d
d
),(
1
1
D
0
0
NNNNwNwNw
Nwww
Nwww
),()2,()1,(
),2()2,2()1,2(
),1()2,1()1,1(
W
||
0
VN
d
d
k
i
i
x
i
i
60
D11
x1WDx1
D11
x1WDx1TT )1(
))(()())(()(
),(
),(
),(
),(),(
0
0,0
0
0,0
kk
d
xxw
d
xxw
VBasso
BAcut
VAasso
BAcutBANcut
TT
x
i
xx
jiij
x
i
xx
jiij
i
ji
i
ji
Bx
Ax
Bx
Ax
i
i
i
i
1
0
20
1
2
x1x1
0000 kjkj x
kiki
x
jij
x
kik
x
jiji xwdxwxwxwd
61
yDzzzW)D(DD
DyW)y(D
x)b(1x)(1yDyy
W)y(Dyx
T
T
yx
2
1
2
1
2
1
1
min)(min
k
kb
Ncut
A symmetric semi-positive-definite matrix
Real, >=0 eigen values
Orthogonal eigen vectors
0:eigenvalue:reigenvecto 2
1
1Dz o
62
Hence, the second smallest eigen vector
contains the minimal cut solution (in
floating point format)
Even though computing all eigen
vectors/values are expensive O(n^3),
computing a small number of those are not
that expensive (Lanczos method)
Recursive applications of the procedure
63
Results
64