unsupervised clustering - sites.cs.ucsb.edu

Unsupervised Clustering

2PR , ANN, & ML

Unsupervised Clustering

Training samples are not labeled

May not know

how many classes

a prior probability

state-conditional probability

Automatic discovery of structures

Intuitively, objects in the same class stick

together and form clusters

P i( )

p x i( | )

3PR , ANN, & ML

Locating groups (clusters) having similar

measurements

Given x x x unlabelled

partition in to C clusters

n

C

{ , ,..., }( )

, ,...,

1 2

1 2

Unsupervised Clustering (cont)

4PR , ANN, & ML

Similarity Measure

Need a similarity measurement s(x,x’)

(e.g., distance between x and x’ )

dd

d

d

q| d

t

qq

kk

d

k

yxyx

yx

yxyx

yxΣyxyx

yxyx

),(

features)binary (for y Commonalit

),(

productinner Normalized

)()(),(

Distance sMahalanobi

1,)|(),(

Metric Minkowski

1

1

1

5PR , ANN, & ML

Similarity Measure (cont.)

How to set the threshold?

Too large all samples assigned into a single class

Too small each sample in its own class

How to properly weight each feature?

How do brightness of a fish and length of a fish

relate?

How to scale (or normalize) different measures?

6PR , ANN, & ML

Axis Scaling

7PR , ANN, & ML

Axis Scaling (cont.)

8PR , ANN, & ML

9PR , ANN, & ML

Threshold

10PR , ANN, & ML

Criteria for Clustering

A criterion function for clustering

ii xi

i

c

i x

ien

J xmmx1

||1

2

c

i x xi i in

J1 '

2|'|1

xx

11PR , ANN, & ML

Within and Between group variance

minimize and maximize the trace and

determinant of the appropriate scatter matrix

trace: square of the scattering radius

determinant: square of the scattering volume

x

c

i

t

iB

xii

c

i x

t

W

nn

nii

xmmmmmS

xmmxmxS

ii

ii

1))((

1))((

1

1

Criteria for Clustering (cont.)

12PR , ANN, & ML

Pathology: when class sizes differ

ii xi

i

c

i x

ien

J xmmx1

||1

2

Relaxed constraint Allow the large class to

grow into smaller class

Stringent constraint Disallow large class and

split it

13PR , ANN, & ML

K-Means Algorithm

(fixed # of clusters)

Arbitrarily pick N cluster centers, assign

samples to nearest center

Compute sample mean of each cluster

Reassign samples to clusters with the

nearest mean (for all samples)

Repeat if there are changes, otherwise stop

14PR , ANN, & ML

15PR , ANN, & ML

16PR , ANN, & ML

17PR , ANN, & ML

K-Means

In some sense, it is a simplified version of

the case II of learning parametric form

c

j

jjjk

iiikki

wPwxp

wPwxpxwP

1

)(),|(

)(),|(),|(

In stead of multiple membership, give the

sample to whichever class whose mean is

closest to it

otherwise

xwP ki0

mean classclosest with class1),|(

18PR , ANN, & ML

Fuzzy K-Means

In Matlab, you can find k-mean (it is called

c-mean) under fuzzy logic toolbox

It implements fuzzy k-mean

Where membership function is not 1 or 0, but

can be fractional

19PR , ANN, & ML

Iterative K-means Iterative refinement of clusters to minimize

Each step: move a sample from one to another

ii xi

i

c

i x

in

J xmmx1

||1

2

22

2

2

ji

|ˆ|1

|ˆ|1

|ˆ|11

ˆ

|ˆ|11

ˆ

to from x̂

j

j

j

i

i

i

j

j

j

jj

j

j

jj

i

i

iii

i

iii

n

n

n

n

n

nJJ

n

n

nJJ

n

mxmx

mxmx

mm

mxmx

mm

20PR , ANN, & ML

1. Select an initial partition of the n samples into clusters and compute

2. Select the next candidate sample

3. If the current cluster is larger than 1, then

4. Transfer

5. Update 6. If J has not changed in n attempts, stop, otherwise

go back to 2.

x

ijn

n

ijn

n

j

j

j

j

j

j

j2

2

|ˆ|1

|ˆ|1

mx

mx

x to if for all jk k j

J m mi k, ,

J m mc, ,...,1

21PR , ANN, & ML

Hierarchical Clustering

K-means assume a “flat” data

description

Hierarchical descriptions are more

frequently

22PR , ANN, & ML

Hierarchical clustering (cont.)

Samples in the same cluster will remain so

at a higher level

x1 x2 x3 x4 x5 x6 1

2

3

4

5

level

23PR , ANN, & ML

Bottom-up: agglomerate

Top-down: divisive

.to.Go

bycdecrement,anddelete,and.Merge

andclusters,distinctofpairnearestthe.Find

stopc,c.If

},...,x{xand n c.Let

jji

ji

n

25

1ˆ4

3

ˆ2

ˆ1 1

Hierarchical clustering Options

24PR , ANN, & ML

At a certain step, we have c classes

ixi

iiin

n xm1

,||

Merge two clusters i and j

)(11

,|| jjii

jixji

ijjiij

jiij

nnnn

xnn

mnnij

mm

Hierarchical clustering Procedure

25PR , ANN, & ML

Then certain criteria function will increase

2

222

222

||

)(

||||||

ji

ji

ji

ijjijjii

x

j

x

i

x

ijij

nn

nn

nnnn

Ejiij

mm

mmm

mxmxmx

),(minarg||minarg,,

2

,

**

jiji

ji

ji

ji

jid

nn

nnji

mm

I*,j* chosen in such a way

Until no such pair exist that cause an increase in the criteria

function less than certain preset threshold

Hierarchical clustering Procedure (cont.)

26PR , ANN, & ML

In fact, the criteria function can be chosen

in many different ways, resulting in

different clustering behaviors

)(max),(

classes)(compact neighbor Furthest

)(min),(

classes) (elongatedneighbor Nearest

,max

,min

yx,

yx,

ji

ji

yxji

yxji

d

d

Criteria Function

27PR , ANN, & ML

•Starting from every sample in a cluster

•Merge them according to some criteria function

•Until two clusters exist

28PR , ANN, & ML

29PR , ANN, & ML

30PR , ANN, & ML

In Reality

With n objects, the distance matrix is n x n

For large databases, it is computationally

expensive to compute and store the matrix

Solution: storing only k nearest clusters in

the distance matrix: complexity is k x n

31PR , ANN, & ML

Divisive Clustering

Less often used

Have to be careful that criterion function

usually decreases monotonically (the

samples become purer each time)

Natural grouping is the one where a large

drop in impurity occurs

iteration

impurity

32PR , ANN, & ML

ISODATA Algorithm

Iterative self-organizing data analysis

technique

A tried-and-true clustering algorithm

Dynamically update # of clusters

Can go both top-down (split) and bottom-up

(merge) directions

33PR , ANN, & ML

Notation:

merged be can that clusters ofnumber maximum

mergingfor distance maximum

splittingfor spread maximum

clusters ofnumber (desired) eapproximat

clustera in samples ofnumber on threshold

maxN

D

N

T

m

s

D

34PR , ANN, & ML

Algorithm

1. Cluster the existing data into Nc clusters but eliminate any data and classes with fewer than T members, decrease Nc. Exit when classification of samples has not changed

2. If iteration odd and

Split clusters whose samples are sufficiently disjoint, increase Nc

If any clusters have been split, go to 1

3. Merge any pair of clusters whose samples are sufficiently close

4. Go to step 1

DcD

c NNorN

N 22

35PR , ANN, & ML

Step 1

Classify samples according

to nearest mean

Discard samples in clusters

< T samples, decrease Nc

Recompute sample

mean of each cluster

Change in classification?

Parameter T

yes

noexit

36PR , ANN, & ML

Compute for each cluster

ki

ki

jkji

kj

k

k

i

k

k

N

Nd

)(

)(

||1

max

||1

)(2

)(

y

y

μy

μy

Compute globally

cN

k

kk dNN

d1

1

Step 2

37PR , ANN, & ML

22, skk

2

12

dc

k

k

NNor

TNand

dd

TNDs ,,2

Split cluster

Nc= Nc+1

yes

yes

no

no

Cluster centers are displaced in

opposite direction along the axis of

largest variance

38PR , ANN, & ML

Compute for each pairs of clusters

|| jiijd μμ

Sort distance from smallest to largest less than Dm

Step 3

39PR , ANN, & ML

max# Nmergeof

Dd mij

Merge i and j

Nc= Nc-1

yes

noNeither cluster i

nor j merged?

)(1

' jjii

ji

NNNN

μμμ

40

Spectral Clustering

Graph theoretical approach

(Advanced) linear algebra is really

important

A field by itself

Cover the basics here

PR , ANN, & ML

41

Graph notations

Undirected graph G = (V, E)

V = {v1, …, vn} are nodes (samples)

E = {ei,j| 0<=i,j<n} are edges, with weights

(similarity) wi,j

W: adjacency matrix with entries wi,j

D: degree matrix (diagonal) with entries

A: subset of vertices, \bar{A}: complement

V\A

1A = (f1, … fn)T, fi=1 if i in A and 0 otherwise

PR , ANN, & ML

42

More graph notations

Size of subset A in V

Based on # of vertices

Based on its connections

Connected component A

Any two vertices in A can be joined by a path

where all intermediate points lie in A

No connection between A and \bar{A}

Partition with A1, …, Ak, if

PR , ANN, & ML

43

Similarity Definitions

RBF (fully connected or thresholded):

E.g, Gaussian kernel gives wi,j

e-neighborhood graph:

Connect all points whose pairwise distances less

than e

K-nearest neighbor:

vi and vj are neighbors if either one is a kNN of the

other

Mutual k-nearest neighbor:

vi and vj are neighbors if both are a kNN of the other

PR , ANN, & ML

44

Graph Laplacian: L = D - W

PR , ANN, & ML

45

One connected component

PR , ANN, & ML

46

Multiple connected components

PR , ANN, & ML

47

Normalized Laplacian

PR , ANN, & ML

48

Unnormalized Spectral Clustering

PR , ANN, & ML

49

Normalized Spectral Clustering

PR , ANN, & ML

50PR , ANN, & ML

51

Toy Example

Random sample of 200 points drawn from 4

Gaussians

Similarity based on

Graph

Fully connected or

10 nearest neighbors

PR , ANN, & ML

52

Min-Cut Formulation

If the edge weight represents degree of similarity,

optimal bi-partitioning of a graph is to minimize

the cut (so called min-cut problem)

Intuition: Cut the connection between dis-similar

samples, hence, edge weight (similarity) should

be small

BvAu

vuwBAcut,

),(),(

Feature

space

53

Problem with Min-cut

Tends to cut out small regions

Sum weight (green + red + blue) = constant

Min-cut minimizes sum of weights of blue

edges, with no regard to green and red (half of

the picture)

54

Remedy Distribute the total weight such that

Sum of weights of the blue edges are

minimized

Max between group variance

Sum of weights of the red (green) edges are

maximized

Min within group variance

55

Normalized Cut

Penalize cutting out small, isolated clusters

set vertex full :

green)-totalblue (red ),(),(

red)-totalblue (green ),(),(

(blue) ),(),(

),(

),(

),(

),(),(

,

,

,

V

vuwVBasso

vuwVAasso

vuwBAcut

greentotal

blue

redtotal

blue

VBasso

BAcut

VAasso

BAcutBANcut

VvAu

VvAu

BvAu

A B

56

Normalized Cut (cont.)

Penalize cutting out small, isolated clusters

Small blue

Small red (or green)

greentotal

blue

redtotal

blue

VBasso

BAcut

VAasso

BAcutBANcut

),(

),(

),(

),(),(

57

Intuition

2

2),(),(

2),(

),(

),(

),(

),(

),(

),(

),(

1),(

),(

),(

),( 1

),(

),(

),(

),(

1),(

),(

),(

),(

),(),(),(),(),(),(

,

,,,,

bluered

blue

bluegreen

blue

bluered

red

bluegreen

green

BANcutBANassoc

VBasso

BAcut

VAasso

BAcut

VBasso

BBassoc

VAasso

AAassoc

VBasso

BAcut

VBasso

BBassoc

VAasso

BAcut

VAasso

AAassoc

VAasso

BAcut

VAasso

vuw

BAcutvuwvuwvuwvuwVAasso

AvAu

AvAuBvAuAvAuVvAu

Assoc reflects intra-class connection which

should be maximized

Ncut represents inter-class connection which

should be minimized

58

Solution

How do you define similarity?

Multiple measurements (i.e., a feature vector)

can be used

How do you find the minimal normalized

cut?

Solution turns out to be a generalized eigen

value problem!

k

j

k

i

kk vvwjid ||),(

59

Bi

Ai

xi

N

1

1

1

1

1

1

x

j

i

NNN

jiwd

d

d

d

),(

1

1

D

0

0

NNNNwNwNw

Nwww

Nwww

),()2,()1,(

),2()2,2()1,2(

),1()2,1()1,1(

W

||

0

VN

d

d

k

i

i

x

i

i

60

D11

x1WDx1

D11

x1WDx1TT )1(

))(()())(()(

),(

),(

),(

),(),(

0

0,0

0

0,0

kk

d

xxw

d

xxw

VBasso

BAcut

VAasso

BAcutBANcut

TT

x

i

xx

jiij

x

i

xx

jiij

i

ji

i

ji

Bx

Ax

Bx

Ax

i

i

i

i

1

0

20

1

2

x1x1

0000 kjkj x

kiki

x

jij

x

kik

x

jiji xwdxwxwxwd

61

yDzzzW)D(DD

DyW)y(D

x)b(1x)(1yDyy

W)y(Dyx

T

T

yx

2

1

2

1

2

1

1

min)(min

k

kb

Ncut

A symmetric semi-positive-definite matrix

Real, >=0 eigen values

Orthogonal eigen vectors

0:eigenvalue:reigenvecto 2

1

1Dz o

62

Hence, the second smallest eigen vector

contains the minimal cut solution (in

floating point format)

Even though computing all eigen

vectors/values are expensive O(n^3),

computing a small number of those are not

that expensive (Lanczos method)

Recursive applications of the procedure

63

Results

unsupervised clustering - sites.cs.ucsb.edu

Documents