lecture 13 clustering. what is clustering? clustering is an exploratory statistical technique used...

28
Lecture 13 Clustering

Upload: amberly-rich

Post on 08-Jan-2018

222 views

Category:

Documents


0 download

DESCRIPTION

Clustering: some thoughts Biologists really LOVE clustering, and believe that clustering can produce “discoveries” of patterns. Statisticians for the most part are somewhat skeptical about these methods. Clustering is called “unsupervised learning” by computer scientists and “class discovery” by micro-array biologists. Though clustering DOES provide clusters or smaller groups, keep in mind that to interpret them you need to know the epidemiological and technical aspect of the study.

TRANSCRIPT

Page 1: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Lecture 13

Clustering

Page 2: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

What is Clustering?

• Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters” with the idea that objects within a cluster are similar and objects in different clusters are different.

• It uses different distance measures between units of a group and across groups to decide which units fall in a group.

Page 3: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Clustering: some thoughts

• Biologists really LOVE clustering, and believe that clustering can produce “discoveries” of patterns.

• Statisticians for the most part are somewhat skeptical about these methods.

• Clustering is called “unsupervised learning” by computer scientists and “class discovery” by micro-array biologists.

• Though clustering DOES provide clusters or smaller groups, keep in mind that to interpret them you need to know the epidemiological and technical aspect of the study.

Page 4: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Keep in mind

• Epidemiological aspects of the study: experiment orobservational study, and if the latter, knowledge of possibleconfounders; relation between training and test sets; relationbetween current data and future data to which results might beapplied,…• Technical aspects of the study: tissue collection, storage andprocessing procedures, microarray assays and analyses, focussingon the role of time, place, personnel, reagents, and methods used,including evidence of design, randomization, blinding, to avoid orcorrect for possible confounding….• In other contexts, and possibly in these, the results have beendriven by study inadequacies rather than by biology. Beware!

Page 5: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Clustering in MA

• Idea: to group observations that are “similar” based on predefined criteria.

• Clustering can be applied to rows (genes) and/or columns (arrays) of an expression data matrix.

• Clustering allows for reordering of the rows/columns of an expression data matrix which is appropriate for visualization.

• Fundamentally an exploratory tool, clustering is firmly imbedded in many biologists’ minds as the statistical method for the analysis of microarray gene expression data. Worrying, but an opportunity!

Page 6: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Aim and End product of clustering

• Clustering leads to readily interpretable figures and can be helpful• for identifying patterns in time or space, especially artifacts!• Examples:• We can cluster cell samples (columns),e.g. 1) for identification

(profiles). Here, we might want to estimate the number of different neuronal cell types in a set of samples, based on gene expression measurements. 2) the identification of new / unknown tumor classes using gene expression profiles.

• We can cluster genes (rows) , e.g. 1) using large numbers of yeast experiments, to identify groups of co-regulated genes. These are usually interpreted and subjected to further analysis. 2) to reduce redundancy (cf. variable selection) in predictive models.

• BIOLOGIST like the latter, but it is often NOT practical to do.

Page 7: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Why Cluster Samples?

• There are very few formal theories about clustering though intuitively the idea is:

• cluster the internal cohesion and external isolation. • Time-course experiments are often clustered to see if there

are developmental similarities.• Useful for visualization.• Generally considered appropriate in typical clinical

experiments.

Page 8: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Clustering

• How is “closeness decided”?

• For clustering we generally need two ideas:

• Distance: the original distance used to measure the distance between two points

• Linkage: condensation of each group of observations into a single representative point

Page 9: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Clustering: preliminaries• Distance or similarity measures:• Geometric distances• L1 (Manhattan): d1(x,y)=S |xi-yi| • L2(Euclidean, ruler distance): d2(x,y)= [S (xi-yi)2 ]1/2 • [ (xi-yi)’ (xi-yi) ]1/2

• Standardized ruler-distance [ (zi1-zi2)’ (zi1-zi2) ]1/2 • Mahalanobis Distance: [ (xi-yi)’ S-1(xi-yi) ]1/2 • Correlation distance: 1-r, where r is the correlation

coefficient.• CAN HAVE WEIGHTED VERSIONS OF THESE.

Page 10: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Clustering: preliminaries

Linkage:• Average Linkage: the distance between two groups of points is the

average of all pairwise distances.• Median Linkage: the distance between two groups of points is the

median of all pairwise distances.• Centroid method: the distance between two groups of points is the

distance between the centroids of the two groups.• Single Linkage: the distance between two-groups is the smallest of all

pairwise distances.• Complete Linkage: the distance between two-groups is the largest of

all pairwise distances.

Page 11: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”
Page 12: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Types of Clustering

• Hierarchical and Non-hierarchical methods:• Non-Hierarchical (Partitioning): Have an initial set of

cluster seed points and then build clusters around the point, using one of the distance measures. If the cluster is too large, it can split into smaller ones.

• Hierarchical: Observed data points are grouped into clusters in a nested sequence of groups.

Page 13: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”
Page 14: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Non-hierarchical: Partitioning methods

Partition the data into a pre-specified number k of mutually exclusive and exhaustive groups.

Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares.

Issues:Need to know the seeds and the number of clusters to start off with. If one uses the computer the pick the seeds the order of entry of the data may make a difference.

Page 15: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Hierarchical methods

• Hierarchical clustering methods produce a tree or dendrogram often using single-link clustering methods

• They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level.

• The tree can be built in two distinct ways - bottom-up: agglomerative clustering.** - top-down: divisive clustering.

Page 16: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Partitioning vs. Hierarchical • Partitioning:Advantages• Optimal for certain criteria.• Genes automatically assigned

to clustersDisadvantages• Need initial k;• Often require long computation

times.• All genes are forced into a

cluster.

• Hierarchical

Advantages• Faster computation.• Visual.

Disadvantages• Unrelated genes are eventually

joined• Rigid, cannot correct later for

erroneous decisions made earlier.

• Hard to define clusters.

Page 17: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Bottom-up- Agglomerative Method• This is the most common used method and produces the

famous tree-diagram. • VERY common in MA literature, first introduced by Eisen

(1998).• Start with n mRNA sample (or g gene) clusters• At each step, merge the two closest clusters using a

measure of between-cluster dissimilarity which reflects the shape of the clusters

• The distance between clusters is defined by the method used (e.g., if complete linkage, the distance is defined as the distance between furthest pair of points in the two clusters)

Page 18: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Example

• Suppose we have 5 genes with a distance matrix given by:

1 2 3 4 51 .31 .43 .47 .232 .48 .47 .333 .37 .464 .455

Page 19: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Example

• First we have 5 clusters:• C0 = {[1],[2],[3],[4],[5]}• Since 1 and 5 have the least distance they are combined

and C1 = {[1,5],[2],[3],[4]}• And C2= {[1,5],[2],[3,4]}• And C3= {[1,5,2],[3,4]}• And C4= {[1,5,2,3,4]}

Page 20: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”
Page 21: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Dendograms

• The dendogram should be interpreted with care, remember each branch of the dendogram is really like a mobile and can rotate, without altering the mathematical structure of the tree.

• Neighboring nodes are “close” ONLY if they lie on the same branch.

• It has been proposed one should slice the tree and look at the clusters produced therein. However, WHERE to cut the tree is subjective and there is no consensus about this.

• Issue: mistakes made early have no way of being corrected later in this approach.

Page 22: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Two-way Clustering

• Refers to methods that use samples and genes simultaneously to extract information.

Some examples:• - Block Clustering (Hartigan, 1972) which repeatedly

rearranges rows and columns to obtain the largest reduction of total within block variance.

• - Plaid Models (Lazzeroni and Owen, 2002)• - Friedman and Meulmann (2002) present an algorithm to

cluster samples based on the subsets of attributes, i.e. each group of samples could have been characterized by different gene sets.

Page 23: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”
Page 24: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

How many clusters? A brief discussion

• Global Criteria:1. Statistics based on within- and between-clusters matrices of sumsof-

squares and cross-products (30 methods reviewed by Milligan & Cooper, 1985).

2. Average silhouette (Kaufman & Rousseeuw, 1990).3. Graph theory (e.g.: cliques in CAST) (Ben-Dor et al., 1999).4. Model-based methods: EM algorithm for Gaussian mixtures, Fraley &

Raftery (1998, 2000) and McLachlan et al. (2001).• Resampling methods:1. Gap statistic (Tibshirani et al., 2000).2. WADP (Bittner et al., 2000).3. Clest (Dudoit & Fridlyand, 2001).4. Boostrap (van der Laan & Pollard, 2001).

Page 25: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Some remarks on clustering- 1

• Simplistically, clustering cannot fail. That is, every clustering method will return clusters, whether the data are organized in clusters or not.

• Clustering helps to group / order information and is a visualization tool for learning about the data. However, clustering results do not provide any kind of “proof” of anything.

Page 26: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

Some remarks-II

• One of the more paradoxical aspects of clustering is that it gets used in biology, even when class labels are available instead of using a discrimination method.

• The idea is: it is somehow seen as less “biased” to demonstrate the ability of the data to produce the class differences without using class labels.

• When the inferred clusters largely coincide with the known classes, this is thought to “validate” the class labels.

• The illogicality and inefficiency of this process does not seem to have become widely appreciated. One sees different “classifiers” (e.g. different gene sets) compared w.r.t their ability to separate known classes, simply by inspecting the clustering they produce, rather than by building classifiers.

Page 27: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

• library(stats)• library(cluster)• #reading a tab-delimited file with 10 rows (condition1-10) and 200 #cols(200

genes)• my.data1=read.table("cluster.csv",header=TRUE,sep=",")• #removing the first column (with the row names from the dataset)• my.data=my.data1[,-1]• par(mfrow=c(1,3))• #clustering using euclidean,manhattan,correlation distance,

average,single,complete linkage• dist1=dist(t(my.data),method="euclidean")• hc1=hclust(dist1,"ave")• plot(hc1)• dist2=dist(t(my.data),method="manhattan")• hc2=hclust(dist2,"single")• plot(hc2)• dd <- as.dist((1 - cor((my.data))/2))• round(1000 * dd) # (prints more nicely)• hc3=hclust((dd),method="complete") # to see a dendrogram of clustered

variables• plot(hc3)

Page 28: Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”

C8

C9 C3

C4

C6

C7

C10

C5

C2

C11

7080

9010

011

0

Cluster Dendrogram

hclust (*, "average")dist1

Hei

ght C

2C

4C

3C

8C

9C

5C

11C

10C

6C

7

820

840

860

880

900

920

Cluster Dendrogram

hclust (*, "single")dist2

Hei

ght

C3

C7

C4

C6

C11

C2

C10

C8

C5

C90.8

0.9

1.0

1.1

1.2

Cluster Dendrogram

hclust (*, "complete")(dd)

Hei

ght