clustering introduction preprocessing: dimensional reduction with svd clustering methods: k-means,...

Clustering

IntroductionPreprocessing: dimensional reduction with SVDClustering methods: K-means, FCMHierarchical methodsModel based methods (at the end)Competitive NN (SOM) (not shown here)SVC, QCApplicationsCOMPACT

(an ill-defined problem)

What Is Clustering?

Why? To help understand the natural grouping or structure in a data set

When? Used either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms, e.g., to discover

classes

Not

Not

Clas

sifica

tion

Class

ifica

tion

!!!!

Clustering is partitioning of data into meaningful (?) groups called clusters.Cluster a collection of objects that are “similarsimilar” to one

another … what is similar? unsupervised learning: no predefined classes

Clustering Applications

Operations Research: Facility Location Problem: locate fire stations so as to

minimize the maximum/average distance a fire truck must travel

Signal Processing Vector Quantization: Transmit large files (e.g., video,

speech) by computing quantizers Astronomy:

SkyCat: Clustered 2x109 sky objects into stars, galaxies, quasars, etc based on radiation emitted in different spectrum bands.

Clustering Applications

Marketing: Segmentation of customers for target marketing Segmentation of customers based on online clickstream data.

Web To discover categories of content. Search results

Bioinformatics Gene expression

Finding groups of individuals (sick Vs. healthy) Finding groups of genes

Motifs search. …

In practice, clustering is one of the most widely used data mining techniques Association rule algorithms produce too many rules Other machine learning algorithms require labeled data.

Points/Metric Space Points could be in in Rd, {0,1}d,… Metric Space: dist(x,y) is a distance metric

if Reflexive: dist(x,y)=0 iff x=y Symmetric: dist(x,y)= dist(y,x) Triangle Inequality: dist(x,y) dist(x,z) +

dist(z,y)

x

y

Example of Distance Metrics The distance between x=<x1,…,xn> and

y=<y1,…,yn> is: L2 norm: Manhattan Distance (L1 norm):

Documents: Cosine measure Similarity

i.e., more similar -> close to 1 less similar -> close to 0

Not a metric space, but 1-cos is

2211 )()( nn yxyx

nn yxyx 11

Correlation

We might care more about the overall shape of expression profiles rather than the actual magnitudes

That is, we might want to consider genes similar when they are “up” and “down” together

When might we want this kind of measure? What experimental issues might make this appropriate?

Pearson Linear Correlation

We’re shifting the expression profiles down (subtracting the means) and scaling by the standard deviations (i.e., making the data have mean = 0 and std = 1)

n

ii

n

ii

n

ii

n

ii

i

n

ii

yn

y

xn

x

yyxx

yyxx

1

1

)()(

))((),(

1

2

1

2

1yx

Pearson Linear Correlation Pearson linear correlation (PLC) is a measure that is

invariant to scaling and shifting (vertically) of the expression values

Always between –1 and +1 (perfectly anti-correlated and perfectly correlated)

This is a similarity measure, but we can easily make it into a dissimilarity measure:

2

),(1 yxpd

PLC (cont.)

PLC only measures the degree of a linear relationship between two expression profiles!

If you want to measure other relationships, there are many other possible measures (see Jagota book and project #3 for more examples)

= 0.0249, so dp = 0.4876

The green curve is the square of the blue curve – this relationship is not captured with PLC

More correlation examples

What do you think the correlation is here? Is this what we want?

How about here? Is this what we want?

Missing Values A common problem w/ microarray data One approach with Euclidean distance or

PLC is just to ignore missing values (i.e., pretend the data has fewer dimensions)

There are more sophisticated approaches that use information such as continuity of a time series or related genes to estimate missing values – better to use these if possible

Preprocessing

For methods that are not applicable in very high dimensions you may want to apply

- Dimensional reduction, e.g. consider the first few SVD components (truncate S at r-dimensions) and use the remaing values of the U or V matrices

- Dimensional reduction + normalization: after applying dimensional reduction normalize all resulting vectors to unit length (i.e. consider angles as proximity measures)

- Feature selection, e.g. consider only features that have large variance. More on feature selection in the future.

Clustering Types

Exclusive vs. Overlapping Clustering Hierarchical vs. Global Clustering Formal vs. Heuristic Clustering

First two examples:

K-Means: exclusive, global, heuristic

FCM (fuzzy c-means): overlapping, global, heuristic

Two classes of data described by (o) and (*). The objective is to reproduce the two classes by K=2 clustering.

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2

log(intensity) 475 Hz

log

(inte

nsi

ty)

55

7 H

z

Tiles data: o = whole tiles, * = cracked tiles, x = centres

1. Place two cluster centres (x) at random.2. Assign each data point (* and o) to the nearest cluster centre (x)

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


1. Compute the new centre of each class2. Move the crosses (x)

Iteration 2

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


Iteration 3

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


Iteration 4 (then stop, because no visible change)Each data point belongs to the cluster defined by the nearest centre

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


The membership matrix M: 1. The last five data points (rows) belong to the first cluster (column)2. The first five data points (rows) belong to the second cluster (column)

M =

0.0000 1.0000

0.0000 1.0000

0.0000 1.0000

0.0000 1.0000

0.0000 1.0000

1.0000 0.0000

1.0000 0.0000

1.0000 0.0000

1.0000 0.0000

1.0000 0.0000

Membership matrix M

otherwise

ifm jkikik

0

122

cucu

data point k cluster centre i

distance

cluster centre j

Results of K-means depend on the starting point of the algorithm. Repeat it several times to get a better feeling whether the results are meaningful.

c-partition

Kc

iallforUCØ

jiallforØCC

UC

i

ji

c

ii

2

1

All clusters C together fills the

whole universe UClusters do not

overlap

A cluster C is never empty and it is smaller than the whole universe U

There must be at least 2 clusters in a c-partition

and at most as many as the number of data

points K

Objective function

c

i Ckik

c

ii

ik

JJ1

2

,1 u

cu

Minimise the total sum of all distances

Algorithm: fuzzy c-means (FCM)

Each data point belongs to two clusters to different degrees

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


1. Place two cluster centres

2. Assign a fuzzy membership to each data point depending on distance

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


1. Compute the new centre of each class2. Move the crosses (x)

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


Iteration 2

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


Iteration 5

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


Iteration 10

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


Iteration 13 (then stop, because no visible change)Each data point belongs to the two clusters to a degree

-8 -6 -4 -2 0 2-8

-7

-6

-5

-4

-3

-2

-1

0

1

2


log

(inte

nsi

ty)

55

7 H

z


The membership matrix M: 1. The last five data points (rows) belong mostly to the first cluster (column)2. The first five data points (rows) belong mostly to the second cluster (column)

M =

0.0025 0.9975

0.0091 0.9909

0.0129 0.9871

0.0001 0.9999

0.0107 0.9893

0.9393 0.0607

0.9638 0.0362

0.9574 0.0426

0.9906 0.0094

0.9807 0.0193

Hard Classifier (HCM)

Ok

light

moderate

severeOk

A cell is either one or the other class defined by a colour.

Fuzzy Classifier (FCM)

Ok

light

moderate

severeOk

A cell can belong to several classes to adegree, i.e., one columnmay have several colours.

Dendrograms allow us to visualize visualization is not unique!

Tends to be sensitive to small changes in the data Provided with clusters of every size: where to “cut” is

user-determined Large storage demand +

Running Time: O(n2 * |levels|) = O(n3) Depends on: distance measure, linkage method

Hierarchical Clustering• Greedy• Agglomerative vs. Divisive

Hierarchical Agglomerative Clustering

We start with every data point in a separate cluster

We keep merging the most similar pairs of data points/clusters until we have one big cluster left

This is called a bottom-up or agglomerative method

Hierarchical Clustering (cont.) This produces a

binary tree or dendrogram

The final cluster is the root and each data item is a leaf

The height of the bars indicate how close the items are

Hierarchical Clustering Demo

Hierarchical Clustering Issues Distinct clusters are not produced –

sometimes this can be good, if the data has a hierarchical structure w/o clear boundaries

There are methods for producing distinct clusters, but these usually involve specifying somewhat arbitrary cutoff values

What if data doesn’t have a hierarchical structure? Is HC appropriate?

Support Vector Clustering

Given points x in data space, define images in Hilbert space.

Require all images to be enclosed by a minimal sphere in Hilbert space.

Reflection of this sphere in data space defines cluster boundaries.

Two parameters: width of Gaussian kernel and fraction of outliers

Ben-Hur, Horn, Siegelmann & Vapnik. JMLR 2 (2001) 125-127

Variation of q allows for clustering solutions on various scales

q=1,

20,

24,

48

Example that allows for SVclustering only in presence of outliers. Procedure: limit β <C=1/pN, where p=fraction of assumed outliers in the data.

q=3.5 p=0 q=1 p=0.3

Similarity to scale space approach for high values of q and p. Probability distribution obtained from R(x) .

q=4.8 p=0.7

From Scale-space to Quantum Clustering

Parzen window approach: estimate the probability density by kernel functions (Gaussians) located at data points.

N

i

N

i

xx

i

i

ecxfcxP1 1

2

)(2

2

)(

σ= 1/√(2q)

Quantum Clustering

View P= as the solution of the Schrödinger equation:

with the potential V(x) responsible for attraction to cluster centers and the Lagrangian causing the spread.

Find V(x):

i

xx

i

i

exxd

EExV2

2

22

2

22

2

1

22

ExVH 2

2

2

Horn and Gottlieb, Phys. Rev. Lett. 88 (2002) 018702

The Crabs Example The Crabs Example (from Ripley’s (from Ripley’s textbook)textbook)4 classes, 50 samples each, d=54 classes, 50 samples each, d=5

A topographic map of the probability distribution for the crab data set with =1/2 using principal components 2 and 3. There exists only one maximum.

The Crabs ExampleQC potential exhibits four minima identified with cluster centers

A topographic map of the potential for the crab data set with =1/2 using principal components 2 and 3 . The four minima are denoted by crossed circles. The contours are set at values V=cE for c=0.2,…,1.

The Crabs Example - ContdThe Crabs Example - Contd..

A three dimensional plot of the potential for the crab data set with =1/3 using principal components 2 and 3

The Crabs Example - ContdThe Crabs Example - Contd..

A three dimensional plot of the potential for the crab data set with =1/2 using principal components 2 and 3

Identifying Clusters

Local minima of the potential are identified with cluster centers.

Data points are assigned to clusters according to:-minimal distance from centers, or,-sliding points down the slopes of the potential

with gradient descent until they reach the centers.

The Iris ExampleThe Iris Example3 classes, each containing 50 samples, d=43 classes, each containing 50 samples, d=4

A topographic map of the potential for the iris data set with =0.25 using principal components 1 and 2. The three minima are denoted by crossed circles. The contours are set at values V=cE for c=0.2,…,1.

The Iris Example - Gradient Descent DynamicsThe Iris Example - Gradient Descent Dynamics

The Iris Example - Using Raw Data in 4DThe Iris Example - Using Raw Data in 4D..

There are only 5 misclassifications. =0.21.

Example – Yeast cell cycle

Yeast cell cycle data were studied by several Yeast cell cycle data were studied by several groups who have applied SVD. groups who have applied SVD. (Spellman et al. (Spellman et al.

Molecular Biology of the Cell, 9, Dec. 2000)Molecular Biology of the Cell, 9, Dec. 2000) We use it to test clustering of genes, whose We use it to test clustering of genes, whose classification into groups was investigated by classification into groups was investigated by Spellman et al.Spellman et al.

The gene/sample matrix that we start from has The gene/sample matrix that we start from has dimensions of 798x72, using the same selection dimensions of 798x72, using the same selection as made by as made by (Shamir, R. and Sharan, R. 2002 ). (Shamir, R. and Sharan, R. 2002 ).

We truncate it to r=4 and obtain, once again, We truncate it to r=4 and obtain, once again, our best results for our best results for σσ=0.5, where four clusters =0.5, where four clusters follow from the QC algorithm. follow from the QC algorithm.


The five gene families as represented in two coordinates of our r=4 dimensional space.


Cluster assignments of genes for QC with s=0.46 , as compared to the classification by Spellman into five classes, shown as alternating gray and white areas .

Yeast cell cycle in normalized 2 dimensions

Hierarchical Quantum Clustering (HQC)

Start with raw data matrix containing gene expression profiles of the samples.

Apply SVD and truncate to r-space by selecting the first r significant eigenvectors

Apply QC in r-dimensions starting at small scale , obtaining many clusters. Move data points to cluster centers and reiterate the process at higher σ. This produces hierarchical clustering that can be represented by a dendrogram.

Example – Clustering of human cancer cells

The NCI60 set is a gene expression profile of The NCI60 set is a gene expression profile of ~8000 genes in 60 human cancer cells. ~8000 genes in 60 human cancer cells.

NCI60 includes cell lines derived from cancers NCI60 includes cell lines derived from cancers of colorectal, renal, ovarian, breast, prostate, of colorectal, renal, ovarian, breast, prostate, lung and central nervous system, as well as lung and central nervous system, as well as leukemias and melanomas.leukemias and melanomas.

After application of selective filters the number After application of selective filters the number of gene spots is reduced to 1,376 gene subset. of gene spots is reduced to 1,376 gene subset. (Scherf et al. – Nature Genetics 24 , 2000)(Scherf et al. – Nature Genetics 24 , 2000)

We applied HQC with r=5 dimensionWe applied HQC with r=5 dimension.

Example – Clustering of human cancer cells

Dendrogram of 60 cancer cell samples. The clustering was done in 5 truncated dimensions. The first 2 letters in each sample represent the tissue/cancer type.

Example - Projection onto the unit sphere

Representation of data of four classes of cancer cells on two dimensions of the truncated space. The circles denote the locations of the data points before this normalization was applied

COMPACT – a comparative package for clustering assessment

Compact is a GUI Matlab tool that enables an easy and intuitive way to compare some clustering methods.

Compact is a five-step wizard that contains basic Matlab clustering methods as well as the quantum clustering algorithm. Compact provides a flexible and customizable interface for clustering data with high dimensionality.

Compact allows both textual and graphical display of the clustering results

How to Install?

COMPACT is a self-extracting package. In order to install and run the QUI tool, follow these three easy steps

Download the COMPACT.zip package to your local drive.

Add the COMPACT destination directory to your Matlab path.

Within Matlab, type ‘compact’ at the command prompt.

Steps – 1

Input parameters

Steps – 1

Selecting variables

Steps – 2

Determining the matrix shape and vectors to cluster

Steps – 3

Preprocessing Procedures Components’ variance

graphs Preprocessing parameters

Steps – 4

Points distribution preview

and clustering method selection

Steps – 5

Parameters for clustering algorithms Kmeans

Steps – 5

Parameters for clustering algorithms FCM

Steps – 5

Parameters for clustering algorithms NN

Steps – 5

Parameters for clustering algorithms QC

Steps – 6COMPACT results

Steps – 6Results

Clustering Methods: Model-Based Data are generated from a mixture of

underlying probability distributions

Some Examples Two univariate

normal components

Equal proportions Common

variance 2=1

=1 =2

=3 =4

Two univariate normal components

proportions 0.75 and 0.25

Common variance 2=1

=1 =2

=3 =4

and some more

Probability Models

Classification Likelihood

1 11

( ,..., ; ,..., | ) ( | )i i

n

C G n ii

L x f x

set of parameters of cluster K k

|i ik x K Mixture Likelihood

1 111

( ,..., ; ,..., | ) ( | )n G

M G G k k i kki

L x f x

is the probability that an observation belongs to cluster K ( )

k0;k

1

1G

kk

Probability Models (Cont.) Most used multivariate normal distribution

Θk has a means vector μk and a covariance matrix Σk

11( ) ( )

2/ 2

1( | , )

2 | |

Ti k k i kx x

k i k k pk

f x e

How is the covariance matrix Σk calculated?

Calculating the covariance matrix Σk

The idea: parameterize the covariance matrixT

k k k k kD A D Dk – Orthogonal matrix of eigenvectors

Determines the orientation of the PCs of Σk

Ak – Diagonal matrix whose elements are proportional to the eigenvalues of Σk

Determines the shape of the density contours

λk – Scalar Determines the volume of the corresponding

ellipsoid

Σk Definition Determines the Model

spherical, equal (SOS criterion)k I all ellipsoids are equal k DAD

How is Θk computed? EM algorithm

1 1

( , , | ) [log ( | )]n G

k k ik ik k k i ki k

l z x z f x

The complete-data log-likelihood(*)

1 if belongs to group

0 otherwisei

ik

x kz

Density of an observation given zi is

is the conditional expectation of zik given xi and Θ1,…, ΘG

1

( | ) ik

Gz

k i kk

f x

1ˆ [ | , ,..., ]ik ik i Gz E z x

1

ˆˆ

n

ik ii

kk

z x

n

ˆ kk

n

n

1

ˆn

k iki

n z

ˆk depends on the model

ˆikz• E: calculate,

,1

ˆˆ ˆ( | )ˆ

ˆˆ ˆ( | )

k k i k kik G

j j i j jj

f xz

f x

• M: given maximize (*)ˆikz

Limitations of the EM Algorithm Low rate of convergence

You should start with good starting points and hope for separable clusters…

Not practical for large number of clusters (== probabilities)

"Crashes" when covariance matrix becomes singular Problems when there are few observation in a

cluster EM must not get more clusters than exist in

nature…

clustering introduction preprocessing: dimensional reduction with svd clustering methods: k-means,...

Documents

y distx

labeled data

data distribution

data setwhen

similar close

maximumaverage distance

distance metric ifreflexive

online clickstream data