unsupervised learning & cluster analysis: basic concepts and algorithms

Unsupervised learning &Cluster Analysis: Basic

Concepts and Algorithms

Assaf Gottlieb

Some of the slides are taken form Introduction to data mining, by Tan, Steinbach, and Kumar

What is unsupervised learning & Cluster Analysis ? Learning without a priori knowledge about the classification of samples; learning without a teacher.

Kohonen (1995), “Self-Organizing Maps”

“Cluster analysis is a set of methods for constructing a (hopefully) sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual.”

B. S. Everitt (1998), “The Cambridge Dictionary of Statistics”

What do we cluster?

Samples/Instances

Features/Variables

Applications of Cluster Analysis Understanding

Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations

Data Exploration Get insight into data distribution Understand patterns in the data

SummarizationReduce the size of large data setsA preprocessing step

Objectives of Cluster Analysis Finding groups of objects such that the objects in a group

will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Competing objectives

Notion of a Cluster can be Ambiguous

How many clusters?

Four Clusters

Two Clusters

Six Clusters

Depends on “resolution” !

PrerequisitesUnderstand the nature of your

problem, the type of features, etc.

The metric that you choose for similarity (for example, Euclidean distance or Pearson correlation) often impacts the clusters you recover.

Similarity/Distance measures Euclidean Distance

Highly depends on scaleof features may require normalization

City Block

N

n nn yxyxd1

2)(),(

d

iiiM wxD

1

1

deuc=0.5846 deuc=1.1345

deuc=2.6115 These examples of Euclidean distance match our intuition of dissimilarity pretty well…

deuc=1.41 deuc=1.22

…But what about these?

What might be going on with the expression profiles on the left? On the right?

Similarity/Distance measures Cosine

Pearson Correlation

Invariant to scaling (Pearson also to addition) Spearman correlation for ranks

yx

yxNyxC

N

i ii

1

cosine

1

),(

])(][)([

))((),(

1

2

1

2

1

N

i yi

N

i xi

N

i yixipearson

mymx

mymxyxC

Similarity/Distance measures Jaccard similarity

When interested in intersection size

YX

YXYXJSim

),(

X YX ∩ Y

X U Y

Types of Clusterings Important distinction between

hierarchical and partitional sets of clusters

Partitional Clustering A division data objects into non-overlapping subsets

(clusters) such that each data object is in exactly one subset

Hierarchical clustering A set of nested clusters organized as a hierarchical

tree

Partitional Clustering

Original Points A Partitional Clustering

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Dendrogram 2

Dendrogram 1

Other Distinctions Between Sets of Clustering methods

Exclusive versus non-exclusive In non-exclusive clusterings, points may belong to

multiple clusters. Can represent multiple classes or ‘border’ points

Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster

with some weight between 0 and 1 Weights must sum to 1

Partial versus complete In some cases, we only want to cluster some of the

data Heterogeneous versus homogeneous

Cluster of widely different sizes, shapes, and densities

Clustering Algorithms Hierarchical clustering K-means Bi-clustering

Hierarchical Clustering Produces a set of nested clusters

organized as a hierarchical tree Can be visualized as a dendrogram

A tree like diagram that records the sequences of merges or splits

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering Do not have to assume any particular

number of clusters Any desired number of clusters can be

obtained by ‘cutting’ the dendogram at the proper level

They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal

kingdom, phylogeny reconstruction, …)

Hierarchical Clustering Two main types of hierarchical clustering

Agglomerative (bottom up): Start with the points as individual clusters At each step, merge the closest pair of clusters until only one

cluster (or k clusters) left

Divisive (top down): Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point

(or there are k clusters)

Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time

Agglomerative Clustering Algorithm

More popular hierarchical clustering technique

Basic algorithm is straightforward1. Compute the proximity matrix2. Let each data point be a cluster3. Repeat4. Merge the two closest clusters5. Update the proximity matrix6. Until only a single cluster remains

Key operation is the computation of the proximity of two clusters

Different approaches to defining the distance between clusters distinguish the different algorithms

Starting Situation Start with clusters of individual points and

a proximity matrixp1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Intermediate Situation After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Intermediate Situation We want to merge the two closest clusters (C2 and

C5) and update the proximity matrix.

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

After Merging The question is “How do we update the proximity

matrix?”

C1

C4

C2 U C5

C3? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

MIN MAX Group Average Distance Between Centroids Ward’s method (not discussed)

Proximity Matrix


p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an

objective function Ward’s Method uses squared error


p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids

Cluster Similarity: MIN or Single Link Similarity of two clusters is based on the

two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one

link in the proximity graph.

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

Hierarchical Clustering: MIN

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

3

4

5

3 6 2 5 4 10

0.05

0.1

0.15

0.2

Strength of MIN

Original Points Two Clusters

• Can handle non-elliptical shapes

Limitations of MIN


• Sensitive to noise and outliers

Cluster Similarity: MAX or Complete Linkage Similarity of two clusters is based on the

two least similar (most distant) points in the different clusters Determined by all pairs of points in the two

clustersI1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

Hierarchical Clustering: MAX


3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

2

3

4

5

6

1

2 5

3

4

Strength of MAX


• Less susceptible to noise and outliers

Limitations of MAX


•Tends to break large clusters

•Biased towards globular clusters

Cluster Similarity: Group Average Proximity of two clusters is the average of pairwise

proximity between points in the two clusters.

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)p,pproximity(

)Cluster,Clusterproximity(ji

ClusterpClusterp

ji

jijjii

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 34 5

Hierarchical Clustering: Group Average


3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

1

2

3

4

5

6

1

2

5

3

4

Hierarchical Clustering: Group Average Compromise between Single and

Complete Link

Strengths Less susceptible to noise and outliers

Limitations Biased towards globular clusters

Hierarchical Clustering: Comparison

Group Average

MIN MAX

1

2

3

4

5

61

2

5

34

1

2

3

4

5

61

2 5

3

41

2

3

4

5

6

12

3

4

5

Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot

be undone Different schemes have problems with one or more of the

following: Sensitivity to noise and outliers Difficulty handling different sized clusters and convex

shapes Breaking large clusters (divisive)

Dendrogram correspond to a given hierarchical clustering is not unique, since for each merge one needs to specify which subtree should go on the left and which on the right

They impose structure on the data, instead of revealing structure in these data.

How many clusters? (some suggestions later)

K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple

K-means Clustering – Details Initial centroids are often chosen randomly.

Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured mostly by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations.

Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d )

n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

Typical choice

Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them.

x is a data point in cluster Ci and mi is the representative point for cluster Ci

can show that mi corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest

error One easy way to reduce SSE is to increase K, the number of

clusters A good clustering with smaller K can have a lower SSE than a poor

clustering with higher K

K

i Cxi

i

xmdistSSE1

2 ),(

Issues and Limitations for K-means How to choose initial centers? How to choose K? How to handle Outliers? Clusters different in

Shape Density Size

Assumes clusters are spherical in vector space Sensitive to coordinate changes

Two different K-means Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 6

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 5

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Solutions to Initial Centroids Problem Multiple runs Sample and use hierarchical clustering to

determine initial centroids Select more than k initial centroids and

then select among these initial centroids Select most widely separated

Bisecting K-means Not as susceptible to initialization issues

Bisecting K-means Bisecting K-means algorithm

Variant of K-means that can produce a partitional or a hierarchical clustering

Bisecting K-means Example

Issues and Limitations for K-means How to choose initial centers? How to choose K?

Depends on the problem + some suggestions later

How to handle Outliers? Preprocessing

Clusters different in Shape Density Size

Issues and Limitations for K-means How to choose initial centers? How to choose K? How to handle Outliers? Clusters different in

Shape Density Size

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density


Limitations of K-means: Non-globular Shapes


Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.Find parts of clusters, but need to put together.

Overcoming K-means Limitations

Original Points K-means Clusters

K-meansPros Simple Fast for low dimensional data It can find pure sub clusters if large number of

clusters is specifiedCons K-Means cannot handle non-globular data of

different sizes and densities K-Means will not identify outliers K-Means is restricted to data which has the

notion of a center (centroid)

Biclustering/Co-clustering Two genes can have similar

expression patterns only under some conditions

Similarly, in two related conditions, some genes may exhibit different expression patterns N

genes

M conditions

Biclustering As a result, each

cluster may involve only a subset of genes and a subset of conditions, which form a “checkerboard” structure:

Biclustering In general – a hard task (NP-hard) Heuristic algorithms described briefly: Cheng & Church – deletion of rows and

columns. Biclusters discovered one at a time

Order-Preserving SubMatrixes Ben-Dor et al.

Coupled Two-Way Clustering (Getz. Et al) Spectral Co-clustering

Cheng and Church Objective function for heuristic methods (to

minimize):

Greedy method: Initialization: the bicluster contains all rows and

columns. Iteration:

1. Compute all aIj, aiJ, aIJ and H(I, J) for reuse.2. Remove a row or column that gives the maximum

decrease of H. Termination: when no action will decrease H or H <= . Mask this bicluster and continue

Problem removing “trivial” biclusters

JjIi

IJIjiJij aaaaJI

JIH,

2)(1

),(

Ben-Dor et al. (OPSM) Model:

For a condition set T and a gene g, the conditions in T can be ordered in a way so that the expression values are sorted in ascending order (suppose the values are all unique).

Submatrix A is a bicluster if there is an ordering (permutation) of T such that the expression values of all genes in G are sorted in ascending order.

Idea of algorithm: to grow partial models until they become complete models.

t1 t2 t3 t4 t5

g1 7 13 19 2 50

g2 19 23 39 6 42

g3 4 6 8 2 10

Induced permutation

2 3 4 1 5

Ben-Dor et al. (OPSM)

Getz et al. (CTWC) Idea: repeatedly perform one-way

clustering on genes/conditions. Stable clusters of genes are used as the

attributes for condition clustering, and vice versa.

Spectral Co-clustering Main idea:

Normalize the 2 dimension Form a matrix of size m+n (using SVD) Use k-means to cluster both types of data

http://adios.tau.ac.il/SpectralCoClustering/

Evaluating cluster quality Use known classes (pairwise F-measure,

best class F-measure) Clusters can be evaluated with

“internal” as well as “external” measures Internal measures are related to the

inter/intra cluster distance External measures are related to how

representative are the current clusters to “true” classes

Inter/Intra Cluster Distances

Intra-cluster distance (Sum/Min/Max/Avg) the

(absolute/squared) distance between- All pairs of points in

the cluster OR- Between the centroid

and all points in the cluster OR

- Between the “medoid” and all points in the cluster

Inter-cluster distanceSum the (squared) distance

between all pairs of clusters

Where distance between two clusters is defined as:

- distance between their centroids/medoids

- (Spherical clusters)- Distance between the

closest pair of points belonging to the clusters

- (Chain shaped clusters)

Davies-Bouldin index A function of the ratio of the sum of within-

cluster (i.e. intra-cluster) scatter to between cluster (i.e. inter-cluster) separation

Let C={C1,….., Ck} be a clustering of a set of N objects:

with and

where Ci is the ith cluster and ci is the centroid for cluster i

k

iiR

kDB

1

.1

jikjiji RR

,,..1

max||||

)var()var(

ji

ji

jiij cc

CCR

Davies-Bouldin index example For eg: for the clusters shown

Compute

var(C1)=0, var(C2)=4.5, var(C3)=2.33

Centroid is simply the mean here, so c1=3, c2=8.5, c3=18.33

So, R12=1, R13=0.152, R23=0.797

Now, compute

R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31 and R32)

Finally, compute DB=0.932

||||

)var()var(

ji

ji

jiij cc

CCR

jikjiji RR

,,..1

max

k

iiR

kDB

1

.1

Davies-Bouldin index example (ctd) For eg: for the clusters shown

Compute

Only 2 clusters here var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c2=18.33

R12=1.26 Now compute Since we have only 2 clusters here, R1=R12=1.26;

R2=R21=1.26 Finally, compute DB=1.26

||||

)var()var(

ji

ji

jiij cc

CCR

jikjiji RR

,,..1

max

k

iiR

kDB

1

.1

Other criteria Dunn method

(Xi, Xj): intercluster distance between clusters Xi and Xj

(Xk): intracluster distance of cluster Xk

Silhouette method Identifying outliers

C-index Compare sum of distances S over all pairs from the same

cluster against the same # of smallest and largest pairs.

)(max

),(minmin)(

111

kck

ji

ijcjci X

XXUV

Example datasetAML/ALL dataset (Golub et al.)

samples

Genes/ features

Leukemia 72 patients (samples) 7129 genes 4 groups

Two major types ALL & AML T & B Cells in ALL With/without treatment in

AML

AML/ALL dataset Davies-Bouldin index - C=4

Dunn method - C=2

Silhouette method – C=2

Validity index

c = 2 c = 3 c = 4 c = 5 c = 6

V11 0.18 0.25 0.20 0.19 0.19 V21 1.21 0.69 0.58 0.49 0.47 V31 0.68 0.37 0.39 0.32 0.33 V41 0.55 0.22 0.27 0.21 0.23

V51 0.61 0.32 0.33 0.27 0.28 V61 0.93 0.52 0.53 0.42 0.40

V12 0.62 0.56 0.63 0.59 0.61 V22 4.27 2.20 1.87 1.56 1.49 V32 2.40 1.18 1.24 1.03 1.06 V42 1.95 0.70 0.85 0.66 0.73 V52 2.15 1.01 1.07 0.86 0.90

V62 3.26 1.66 1.69 1.36 1.28

V13 0.23 0.20 0.23 0.22 0.23 V23 1.55 0.80 0.69 0.57 0.55

V33 0.87 0.43 0.46 0.37 0.39 V43 0.71 0.26 0.31 0.24 0.27 V53 0.78 0.32 0.39 0.31 0.33 V63 1.18 0.61 0.63 0.50 0.47

Average 1.34 0.68 0.69 0.57 0.57

Validity index

c = 2 c = 3 c = 4 c = 5 c = 6

V11 0.18 0.25 0.20 0.19 0.19 V21 1.21 0.69 0.58 0.49 0.47 V31 0.68 0.37 0.39 0.32 0.33 V41 0.55 0.22 0.27 0.21 0.23

V51 0.61 0.32 0.33 0.27 0.28 V61 0.93 0.52 0.53 0.42 0.40

V12 0.62 0.56 0.63 0.59 0.61 V22 4.27 2.20 1.87 1.56 1.49 V32 2.40 1.18 1.24 1.03 1.06 V42 1.95 0.70 0.85 0.66 0.73 V52 2.15 1.01 1.07 0.86 0.90

V62 3.26 1.66 1.69 1.36 1.28

V13 0.23 0.20 0.23 0.22 0.23 V23 1.55 0.80 0.69 0.57 0.55

V33 0.87 0.43 0.46 0.37 0.39 V43 0.71 0.26 0.31 0.24 0.27 V53 0.78 0.32 0.39 0.31 0.33 V63 1.18 0.61 0.63 0.50 0.47

Average 1.34 0.68 0.69 0.57 0.57

Visual evaluation - coherency

Cluster quality example do you see clusters?

C Silhouette

2 0.49223 0.57394 0.47735 0.49916 0.54047 0.5418 0.51719 0.595610 0.6446

C Silhouette

2 0.48633 0.57624 0.59575 0.53516 0.57017 0.54878 0.50839 0.531110 0.5229

Dimensionality Reduction Map points in high-dimensional space to

lower number of dimensions Preserve structure: pairwise distances, etc. Useful for further processing:

Less computation, fewer parameters Easier to understand, visualize

Dimensionality Reduction Feature selection vs. Feature Extraction Feature selection – select important

features Pros

meaningful features Less work acquiring

Unsupervised Variance, Fold UFF

Dimensionality Reduction Feature Extraction Transforms the entire feature set to lower

dimension. Pros

Uses objective function to select the best projection

Sometime single features are not good enough Unsupervised

PCA, SVD

Principal Components Analysis (PCA) approximating a high-dimensional data set

with a lower-dimensional linear subspace

****

******

**

** ****

******

**

**

**

**

**** **

****

**

****

Data pointsData points

First principal componentFirst principal componentSecond principal componentSecond principal component

Original axesOriginal axes

Singular Value Decomposition

Principal Components Analysis (PCA) Rule of thumb for selecting number of

components “Knee” in screeplot Cumulative percentage variance

****

******

**

** ****

******

**

**

**

**

**** **

****

**

****

Tools for clustering Matlab – COMPACT http://adios.tau.ac.il/compact/

Tools for clustering Cluster +TreeView (Eisen et al.) http://rana.lbl.gov/eisen/?page_id=42

Summary Clustering is ill-defined and considered an

“art” In fact, this means you need to

Understand your data beforehand Know how to interpret clusters afterwards

The problem determines the best solution (which measure, which clustering algorithm) – try to experiment with different options.

unsupervised learning & cluster analysis: basic concepts and algorithms

Documents

groupsintercluster distances

data summarization

data mining

multiple clusters

unclassified set of

set of nested clusters

size of large data

group stocks