lab 5: cluster analysis using r and...

17
Lab 5: Cluster Analysis Using R and Bioconductor June 4, 2003 Introduction In this lab we introduce you to various notions of distance and to some of the clustering algorithms that are available in R. The lab comes in two parts, in the first we consider different distance measures while in the second part we consider the clustering meth- ods. It is worth mentioning that all clustering and machine learning methods rely on some measure of distance. Almost all implementations have some measure that they use as a default (this is usually Euclidean). As a user of this software you need to decide what distance measures are of interest or are relevant for your situation and to subse- quently ensure that the clustering methods use the distance metric that you believe is appropriate. The examples in this section will rely on the data reported in ?. The basic compar- isons here are between B-cell derived ALL, T-cell derived ALL and AML. In some cases we will just compare ALL to AML while in others comparisons between the three groups will be considered. > library(Biobase) > library(annotate) > library(golubEsets) > library(genefilter) > library(ellipse) > library(lattice) Loading required package: grid > library(cluster) > library(mva) We transform the data in much the same way that ? did. The sequence of commands needed are given below. 1

Upload: others

Post on 02-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

Lab 5: Cluster Analysis Using R and Bioconductor

June 4, 2003

Introduction

In this lab we introduce you to various notions of distance and to some of the clusteringalgorithms that are available in R. The lab comes in two parts, in the first we considerdifferent distance measures while in the second part we consider the clustering meth-ods. It is worth mentioning that all clustering and machine learning methods rely onsome measure of distance. Almost all implementations have some measure that they useas a default (this is usually Euclidean). As a user of this software you need to decidewhat distance measures are of interest or are relevant for your situation and to subse-quently ensure that the clustering methods use the distance metric that you believe isappropriate.

The examples in this section will rely on the data reported in ?. The basic compar-isons here are between B-cell derived ALL, T-cell derived ALL and AML. In some caseswe will just compare ALL to AML while in others comparisons between the three groupswill be considered.

> library(Biobase)

> library(annotate)

> library(golubEsets)

> library(genefilter)

> library(ellipse)

> library(lattice)

Loading required package: grid

> library(cluster)

> library(mva)

We transform the data in much the same way that ? did. The sequence of commandsneeded are given below.

1

Page 2: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

> data(golubTrain)

> X <- exprs(golubTrain)

> X[X < 100] <- 100

> X[X > 16000] <- 16000

> mmfilt <- function(r = 5, d = 500, na.rm = TRUE) {

+ function(x) {

+ minval <- min(x, na.rm = na.rm)

+ maxval <- max(x, na.rm = na.rm)

+ (maxval/minval > r) && (maxval - minval > d)

+ }

+ }

> mmfun <- mmfilt()

> ffun <- filterfun(mmfun)

> sub <- genefilter(X, ffun)

> sum(sub)

[1] 3051

> X <- X[sub, ]

> X <- log2(X)

> golubTrainSub <- golubTrain[sub, ]

> golubTrainSub@exprs <- X

> Y <- golubTrainSub$ALL.AML

> Y <- paste(golubTrain$ALL.AML, golubTrain$T.B.cell)

> Y <- sub("NA", "", Y)

Now that the data are set up, we are ready to compute distances between tumorsamples and apply clustering procedures to these samples.

1 Distances

We first compute the correlation distance between the 38 tumor samples from the train-ing set, using all 3,051 genes retained after the above filtering procedure.

> r <- cor(X)

> dimnames(r) <- list(as.vector(Y), as.vector(Y))

> d <- 1 - r

We rely on the plotcorr function from the ellipse package and the levelplot func-tion from lattice for displaying this 38 ×38 distance matrix.

> plotcorr(r, main = "Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes")

2

Page 3: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

Leukemia data: Correlation matrix for 38 mRNA samples All 3,051 genes

ALL B−cellALL T−cellALL T−cellALL B−cellALL B−cellALL T−cellALL B−cellALL B−cellALL T−cellALL T−cellALL T−cellALL B−cellALL B−cellALL T−cellALL B−cellALL B−cellALL B−cellALL B−cellALL B−cellALL B−cellALL B−cellALL B−cellALL T−cellALL B−cellALL B−cellALL B−cellALL B−cellAML AML AML AML AML AML AML AML AML AML AML

ALL

B−

cell

ALL

T−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

AM

L A

ML

AM

L A

ML

AM

L A

ML

AM

L A

ML

AM

L A

ML

AM

L

> plotcorr(r, numbers = TRUE, main = "Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes")

3

Page 4: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

Leukemia data: Correlation matrix for 38 mRNA samples All 3,051 genes

ALL B−cellALL T−cellALL T−cellALL B−cellALL B−cellALL T−cellALL B−cellALL B−cellALL T−cellALL T−cellALL T−cellALL B−cellALL B−cellALL T−cellALL B−cellALL B−cellALL B−cellALL B−cellALL B−cellALL B−cellALL B−cellALL B−cellALL T−cellALL B−cellALL B−cellALL B−cellALL B−cellAML AML AML AML AML AML AML AML AML AML AML

ALL

B−

cell

ALL

T−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

AM

L A

ML

AM

L A

ML

AM

L A

ML

AM

L A

ML

AM

L A

ML

AM

L

107787788777787787787577778777666766776

710

777776888778777776577777677776777786

77

1076977888768677766578767777656655775

877

108797777777887777677877777666766776

7768

10676777787987778666877676655776775

77976

1077888768776665579767777666766775

877977

108777777777776577877777666766776

8677678

10666766676676587777777567755776

78877876

1088778777766568767676656766775

788778768

108777777776568777677666767776

7887787688

10778777676678777676667766765

77777777777

1077776776577787777776877887

876786767777

107887788676878666555666665

7887787687877

10777776567767677666766775

77689776777787

1087778666878666555666664

877887777777878

107787677878777656766775

7777767677767777

10676556767677656666675

77777676776777776

1076576767777666766766

876776776777877877

107676778677666766676

7667856666668687667

10665767555455665564

55566555556565665566

1055656555444554444

777767786677766757765

107777777667766776

7787697788876767666557

10776777666765775

77788787777787887777677

1067676555666765

776776776778767766765776

108677677866777

8777777777778788778767678

10777656766776

76776777666766676765577667

1067556655766

777777777777676777755777776

108777877887

7777677767676767777557767778

10677776877

67666665666756566664466566576

1086778777

675656665667565556654665755778

108887778

6666566766765656666547657667768

10776777

77677777777867676776577687687787

1077888

675676656667666666665666665777877

107776

6756666567676666666546566657687677

10877

77777777777867676765477777788777878

1087

787777777768676776764776776877778778

107

6656556656575545566446557667778786777

10

> levelplot(r, col.region = heat.colors(50), main = "Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes")

We can see in this plots that there is not a strong grouping of the ALL and AMLsamples. We now consider selecting genes that are good differentiators between the twogroups. We use a t-test filter and require that the p-value be below 0.01 for selection(no adjustment has been made).

> gtt <- ttest(golubTrainSub$ALL, p = 0.01)

> gf1 <- filterfun(gtt)

> whT <- genefilter(golubTrainSub, gf1)

> gTrT <- golubTrainSub[whT, ]

Now we can repeat the above computations and plots using the data in gTrT. Therewere 609 genes selected using the t-test criterion above.

> rS <- cor(exprs(gTrT))

> dimnames(rS) <- list(gTrT$ALL, gTrT$ALL)

> dS <- 1 - rS

> plotcorr(r, main = "Leukemia data: Correlation matrix for 38 mRNA samples\n 609 genes")

4

Page 5: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

2 Multidimensional Scaling

We now turn to multidimensional scaling (MDS) for representing the tumor sampledistance matrix. MDS is a data reduction method that is appropriate for general distancematrices. It is much like principal component analysis (PCA), but it is not the same –except for Euclidean distances. Given an n×n distance matrix, MDS is concerned withidentifying n points in Euclidean space with a similar distance structure. The purpose isto provide a lower dimensional representation of the distances which can better conveythe information on the relationships between objects, such as the existence of clustersor one-dimensional structure in the data (e.g., seriation). In most cases we would liketo find a two or three dimensional reduction that could easily be visualized.

> mds <- cmdscale(d, k = 2, eig = TRUE)

> plot(mds$points, type = "n", xlab = "", ylab = "", main = "MDS for ALL AML data, correlation matrix, G=3,051 genes, k=2")

> text(mds$points[, 1], mds$points[, 2], Y, col = as.integer(factor(Y)) +

+ 1, cex = 0.8)

−0.3 −0.2 −0.1 0.0 0.1 0.2

−0.

2−

0.1

0.0

0.1

MDS for ALL AML data, correlation matrix, G=3,051 genes, k=2

ALL B−cell

ALL T−cell

ALL T−cell

ALL B−cell

ALL B−cell

ALL T−cell

ALL B−cell

ALL B−cell

ALL T−cell

ALL T−cellALL T−cell

ALL B−cell

ALL B−cell

ALL T−cell

ALL B−cell

ALL B−cell

ALL B−cellALL B−cell

ALL B−cell

ALL B−cellALL B−cell

ALL B−cell

ALL T−cell

ALL B−cellALL B−cellALL B−cell

ALL B−cell

AML

AML

AML AML

AML

AML

AML

AML

AML AML AML

5

Page 6: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

> mds3 <- cmdscale(d, k = 3, eig = TRUE)

> pairs(mds3$points, main = "MDS for ALL AML data, correlation matrix, G=3,051 genes, k=3",

+ pch = c("B", "T", "M")[as.integer(factor(Y))], col = as.integer(factor(Y)) +

+ 1)

var 1

−0.2 −0.1 0.0 0.1

B

T

TBB

TB B

TTT

B

B

T

BB

BB

B

BB

BT

B

B

B

B

M M

MM

MMM

M

MM

M −0.

3−

0.1

0.1

B

T

T BB

T B B

TT

T

B

B

T

BB

BB

B

BB

BT

B

B

B

B

M M

MM

MMM

M

M M

M

−0.

2−

0.1

0.0

0.1

B

T

T

B

B

T

BB

TT T

B

B

T

B

BBB

B

BB

B

T

BB B

B

M

M

MM

M

M

M

M

MMM var 2B

T

T

B

B

T

BB

TTT

B

B

T

B

BB B

B

BB

B

T

B BB

B

M

M

M M

M

M

M

M

M M M

−0.3 −0.1 0.1 0.2

B

T T

B

B

T

B

B

T

TT

B B

T

B

B

B

B

BB

B

B

T

B

B B

B

M

M

M

MM M

MMMM

M B

T T

B

B

T

B

B

T

TT

BB

T

B

B

B

B

BB

B

B

T

B

BB

B

M

M

M

MMM

M MM

M

M

−0.15 −0.05 0.05 0.15

−0.

15−

0.05

0.05

0.15

var 3

MDS for ALL AML data, correlation matrix, G=3,051 genes, k=3

> if (require(rgl) && interactive()) {

+ rgl.spheres(x = mds3$points[, 1], mds3$points[, 2], mds3$points[,

+ 3], col = as.integer(factor(Y)) + 1, radius = 0.01)

+ }

Loading required package: rgl

Note, if we are measuring distances between samples on the basis of G = 3, 051 probesthen we are essentially looking at points in 3,051 dimensional space. As with PCA, thequality of the representation will depend on the magnitude of the first k eigenvalues. Toassess how many components there are, a scree plot similar to that used for principalcomponents can be created. This plot of the eigenvalues suggests that much of the

6

Page 7: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

information is contained in the first component. One might consider using either threeor four components as well.

> mdsScree <- cmdscale(d, k = 8, eig = TRUE)

> plot(mdsScree$eig, pch = 18, col = "blue")

1 2 3 4 5 6 7 8

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Index

mds

Scr

ee$e

ig

EDD and ROC

In this section we consider a couple of other tools for visualizing genomic data. The firstis edd which stands for expression density diagnostics. The idea behind edd is to lookfor genes whose patterns of expression are quite different in two groups of interest.

To assess whether two genes have different patterns of expression edd transforms allexpression values to the range 0,1 and then performs density estimation on them. Thusit is looking for differences in shape and ignores differences in scale or location. In somesituations location and scale are more important (however there are many different toolsthat can be used in those situations).

7

Page 8: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

In this case we will use edd to compare those patients with AML to those with B-cellderived ALL. We will rely on the merged data set so that we have a number of samplesfrom each group (38 ALLs and 25 AMLs).

Once we have subset the data to the appropriate cases we next perform some non-specific filtering to remove any genes that show little variation across samples. This isan important step when using edd . Since edd standardizes the genes (across samples)to give approximately common centering and scaling non-informative genes will not bedistinguishable from informative ones. Therefore, we must explicitly remove the non-informative ones before applying edd .

> gMs <- golubMerge[, (golubMerge$T.B != "T-cell" | is.na(golubMerge$T.B))]

> fF1 <- gapFilter(100, 200, 0.1)

> ff <- filterfun(fF1)

> whgMs <- genefilter(gMs, ff)

> sum(whgMs)

[1] 3082

> gMs <- gMs[whgMs, ]

> library(edd)

Loading required package: nnet

Loading required package: class

Loading required package: xtable

We next split the data into two expression sets, one with the AML and the otherwith the B-cell ALL samples. For each set we use edd to classify samples into one of aset of predetermined shapes.

We will use edd to perform the classification and will use the default values of mostof the parameters. That means that we will be using the multiCand for candidatecomparison and knn for classification.

When multicand is selected edd selects 100 samples from each of the eight candidatedensities. Each of sample is scaled to have median 0 and mad 1. These constitute thereference samples and for each gene we will perform the same transformation and thencompare the resulting data to the reference samples. Any machine learning techniquecan in principle be used to make this comparison. In this example we will use the defaultclassifier which is knn.

The density, across samples, for each gene is found and scaled to have median 0 andmad 1 (as for the reference samples).

> gMsALL <- gMs[, gMs$ALL == "ALL"]

> gMsAML <- gMs[, gMs$ALL == "AML"]

> set.seed(1234)

> edd.ALL <- edd(gMsALL, method = "knn", k = 4, l = 2)

> sum(is.na(edd.ALL))

8

Page 9: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

[1] 29

> edd.AML <- edd(gMsAML, method = "knn", k = 4, l = 2)

> sum(is.na(edd.AML))

[1] 71

> table(edd.ALL, edd.AML)

edd.AML

edd.ALL .25N(0,1)+.75N(4,1) .75N(0,1)+.25N(4,1) B(2,8) B(8,2)

.25N(0,1)+.75N(4,1) 0 4 3 3

.75N(0,1)+.25N(4,1) 2 28 69 10

B(2,8) 3 88 243 40

B(8,2) 1 5 8 9

logN(0,1) 1 45 102 14

N(0,1) 4 27 94 34

t(3) 13 63 103 50

U(0,1) 0 6 4 2

X^2(1) 1 42 48 5

edd.AML

edd.ALL logN(0,1) N(0,1) t(3) U(0,1) X^2(1)

.25N(0,1)+.75N(4,1) 3 5 10 0 1

.75N(0,1)+.25N(4,1) 20 40 40 17 15

B(2,8) 66 167 150 58 46

B(8,2) 3 20 11 4 1

logN(0,1) 79 60 70 15 58

N(0,1) 20 118 74 30 17

t(3) 60 116 153 38 28

U(0,1) 3 9 5 3 0

X^2(1) 68 25 29 11 45

For 29 of the ALL patients no classification was made. So properly they should beclassified as doubt. For the AML group 71 were classified as doubt. We can see from theresulting table that there are a number of genes for which the ALL classification is amixture (mix1) while the AML classification is as a Normal random variable. To furtherexplore these will need to make use of some plots.

> diff1 <- edd.AML == "N(0,1)" & edd.ALL == ".75N(0,1)+.25N(4,1)"

> table(diff1)

diff1

FALSE TRUE

3021 40

9

Page 10: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

> diff1[is.na(diff1)] <- FALSE

> AMLexprs <- exprs(gMsAML)[diff1, ]

> ALLexprs <- exprs(gMsALL)[diff1, ]

Now we have the two sets of expression data and we can see what the differences are.

> tAML <- fq.matrows(t(apply(AMLexprs, 1, centerScale)))

> tALL <- fq.matrows(t(apply(ALLexprs, 1, centerScale)))

> par(mfrow = c(2, 2))

> hist(AMLexprs[1, ])

> hist(ALLexprs[1, ])

> hist(AMLexprs[2, ])

> hist(ALLexprs[2, ])

> par(mfrow = c(1, 1))

ROC

Receiver operator curves are a tool that can be used to assess the classificaton capabilitiesof a covariate. In situations where there are two classes (for example, AML versus ALL)then one can consider splitting the data into two groups depending on the value of thecovariate. The relative proportions of AML and ALL in the two resulting groups can beused to assess the capability of that covariate for splitting the data.

Hierarchical Clustering

We can use the hclust function in the mva package for agglomerative hierarchical clus-tering. We consider the following agglomeration methods, i.e., methods for computingbetween cluster distances:

Single linkage The distance between two clusters is the minimum distance betweenany two objects, one from each cluster.

Average linkage The distance between two clusters is the average of all pairwise dis-tances between the members of both clusters.

Complete linkage The distance between two clusters is the maximum distance be-tween two objects, one from each cluster.

The nested sequence of clusters resulting from hierarchical clustering methods can berepresented graphically using a dendrogram. A dendrogram is a binary tree diagram inwhich the terminal nodes, or leaves, represent individual observations and in which theheight of the nodes reflects the dissimilarity between the two clusters that are joined.

10

Page 11: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

While dendrograms are quite appealing because of their apparent ease of interpretation,they can be misleading.

First, the dendrogram corresponding to a given hierarchical clustering is not unique,since for each merge one needs to specify which subtree should go on the left and whichon the right. For n observations, there are n − 1 merges and hence 2(n−1) possibleorderings of the terminal nodes. Therefore, 2(n−1) different dendrograms are consistentwith any given sequence of hierarchical clustering operations. The ordering is of practicalimportance because it will result in different genes being placed next to each other inthe heat-maps and possibly different interpretations of the same data. A number ofheuristics have been suggested for ordering the terminal nodes in a dendrogram. Thedefault in the R function hclust from the cluster package is to order the subtrees sothat the tighter cluster is on the left (i.e., the most recent merge of the left subtree is ata lower value than the last merge of the right subtree).

A second, and perhaps less recognized shortcoming of dendrograms, is that theyimpose structure on that data, instead of revealing structure in these data. Indeed, den-drograms are often viewed as graphical summaries of the data, rather than of the resultsof the clustering algorithm. Such a representation will be valid only to the extent thatthe pairwise dissimilarities possess the hierarchical structure imposed by the clusteringalgorithm. Note that in particular, the same matrix of pairwise distances between obser-vations will be represented by a different dendrogram depending on the distance functionthat is used to compute between-cluster distances (e.g., complete or average linkage).The cophenetic correlation coefficient can be used to measure how well the hierarchicalstructure from the dendrogram represents the actual distances. This measure is definedas the correlation between the n(n − 1)/2 pairwise dissimilarities between observationsand their cophenetic dissimilarities from the dendrogram, i.e., the between cluster dis-similarities at which two observations are first joined together in the same cluster.

Next, we apply three agglomerative hierarchical clustering methods to cluster tumorsamples. For each methods, we plot a dendrogram, compute the corresponding cophe-netic correlation, and compare the three clusters obtained from cutting the tree intothree branches to the actual tumor labels (ALL B-cell, ALL T-cell, and AML).

> hc1 <- hclust(as.dist(d), method = "average")

> coph1 <- cor(cophenetic(hc1), as.dist(d))

> plot(hc1, main = paste("Dendrogram for ALL AML data: Coph = ",

+ round(coph1, 2)), sub = "Average linkage, correlation matrix, G=3,051 genes")

> cthc1 <- cutree(hc1, 3)

> table(Y, cthc1)

cthc1

Y 1 2 3

ALL B-cell 16 2 1

11

Page 12: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

ALL T-cell 8 0 0

AML 0 11 0

ALL

B−

cell

AM

L A

ML

AM

L A

ML

AM

L A

ML

ALL

B−

cell

ALL

B−

cell

AM

L A

ML A

ML

AM

L A

ML

ALL

B−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell A

LL B

−ce

llA

LL B

−ce

ll

0.1

0.2

0.3

0.4

0.5

Dendrogram for ALL AML data: Coph = 0.74

Average linkage, correlation matrix, G=3,051 genesas.dist(d)

Hei

ght

> hc2 <- hclust(as.dist(d), method = "single")

> coph2 <- cor(cophenetic(hc2), as.dist(d))

> plot(hc2, main = paste("Dendrogram for ALL AML data: Coph = ",

+ round(coph2, 2)), sub = "Single linkage, correlation matrix, G= 3,051 genes")

> cthc2 <- cutree(hc2, 3)

> table(Y, cthc2)

cthc2

Y 1 2 3

ALL B-cell 17 1 1

ALL T-cell 8 0 0

AML 11 0 0

12

Page 13: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

AM

L A

ML

AM

L A

LL B

−ce

llA

LL T

−ce

llA

LL T

−ce

llA

LL T

−ce

llA

LL T

−ce

llA

LL T

−ce

llA

LL T

−ce

llA

LL T

−ce

llA

LL T

−ce

llA

ML

AM

L A

ML

ALL

B−

cell

ALL

B−

cell

AM

L A

ML

AM

L A

ML

AM

L A

LL B

−ce

llA

LL B

−ce

llA

LL B

−ce

llA

LL B

−ce

llA

LL B

−ce

llA

LL B

−ce

llA

LL B

−ce

llA

LL B

−ce

llA

LL B

−ce

llA

LL B

−ce

llA

LL B

−ce

llA

LL B

−ce

ll0.10

0.15

0.20

0.25

0.30

0.35

Dendrogram for ALL AML data: Coph = 0.61

Single linkage, correlation matrix, G= 3,051 genesas.dist(d)

Hei

ght

> hc3 <- hclust(as.dist(d), method = "complete")

> coph3 <- cor(cophenetic(hc3), as.dist(d))

> plot(hc3, main = paste("Dendrogram for ALL AML data: Coph = ",

+ round(coph3, 2)), sub = "Complete linkage, correlation matrix, G= 3,051 genes")

> cthc3 <- cutree(hc3, 3)

> table(Y, cthc3)

cthc3

Y 1 2 3

ALL B-cell 18 1 0

ALL T-cell 8 0 0

AML 5 0 6

13

Page 14: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

AM

L A

ML

AM

L A

ML

AM

L A

ML

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

AM

L A

LL B

−ce

llA

LL B

−ce

llA

ML

AM

L A

ML

AM

L

0.1

0.2

0.3

0.4

0.5

0.6

Dendrogram for ALL AML data: Coph = 0.7

Complete linkage, correlation matrix, G= 3,051 genesas.dist(d)

Hei

ght

Divisive hierarchical clustering is available using diana from the cluster package.

> di1 <- diana(as.dist(d))

> cophdi <- cor(cophenetic(di1), as.dist(d))

> plot(di1, which.plots = 2, main = paste("Dendrogram for ALL AML data: Coph = ",

+ round(cophdi, 2)), sub = "Divisive algorithm, correlation matrix, G= 3,051 genes")

> ct.di <- cutree(di1, 3)

> table(Y, ct.di)

ct.di

Y 1 2 3

ALL B-cell 16 2 1

ALL T-cell 1 7 0

AML 0 11 0

14

Page 15: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

B−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

T−

cell

ALL

B−

cell

ALL

B−

cell

AM

L A

ML

AM

L A

ML AM

L A

ML

AM

L A

ML

AM

L A

ML

AM

L

0.1

0.2

0.3

0.4

0.5

0.6

Dendrogram for ALL AML data: Coph = 0.56

Divisive algorithm, correlation matrix, G= 3,051 genesas.dist(d)

Hei

ght

3 Partitioning Methods

Here we apply the partitioning around medoids (PAM) method to cluster tumor mRNAsamples.

> set.seed(12345)

> pm3 <- pam(as.dist(d), k = 3, diss = TRUE)

> table(Y, pm3$clustering)

Y 1 2 3

ALL B-cell 16 3 0

ALL T-cell 0 1 7

AML 0 11 0

> clusplot(d, pm3$clustering, diss = TRUE, labels = 3, col.p = 1,

+ col.txt = as.integer(factor(Y)) + 1, main = "Bivariate cluster plot for ALL AML data\n Correlation matrix, K=3, G=3,051 genes")

15

Page 16: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

−0.2 −0.1 0.0 0.1 0.2 0.3

−0.

2−

0.1

0.0

0.1

0.2

Bivariate cluster plot for ALL AML data Correlation matrix, K=3, G=3,051 genes

Component 1

Com

pone

nt 2

These two components explain 35.9 % of the point variability.

●●

● ●

● ●

ALL B−cell

ALL T−cell

ALL T−cell

ALL B−cell

ALL B−cell

ALL T−cell

ALL B−cellALL B−cellALL T−cell

ALL T−cellALL T−cell

ALL B−cell

ALL B−cell

ALL T−cell

ALL B−cell

ALL B−cellALL B−cellALL B−cell

ALL B−cell

ALL B−cellALL B−cell

ALL B−cell

ALL T−cell

ALL B−cell ALL B−cellALL B−cell

ALL B−cell

AML

AML

AML AML

AML AML

AML

AML

AML AML

AML

> plot(pm3, which.plots = 2, main = "Silhouette plot for ALL-AML Data")

One of the more difficult questions that arises when clustering data is the assessmentof how many clusters there are in the data. We now consider using PAM with fourclusters.

> set.seed(12345)

> pm4 <- pam(as.dist(d), k = 4, diss = TRUE)

> clusplot(d, pm4$clustering, diss = TRUE, labels = 3, col.p = 1,

+ col.txt = as.integer(factor(Y)) + 1, main = "Bivariate cluster plot for ALL AML data\n Correlation matrix, K=4, G=3,051 genes")

16

Page 17: Lab 5: Cluster Analysis Using R and Bioconductormaster.bioconductor.org/help/course-materials/2003/... · In this lab we introduce you to various notions of distance and to some of

−0.2 −0.1 0.0 0.1 0.2 0.3

−0.

2−

0.1

0.0

0.1

0.2

Bivariate cluster plot for ALL AML data Correlation matrix, K=4, G=3,051 genes

Component 1

Com

pone

nt 2

These two components explain 35.9 % of the point variability.

●●

● ●

ALL B−cell

ALL T−cell

ALL T−cell

ALL B−cell

ALL B−cell

ALL T−cell

ALL B−cellALL B−cellALL T−cell

ALL T−cellALL T−cell

ALL B−cell

ALL B−cell

ALL T−cell

ALL B−cell

ALL B−cellALL B−cellALL B−cell

ALL B−cell

ALL B−cellALL B−cell

ALL B−cell

ALL T−cell

ALL B−cell ALL B−cellALL B−cell

ALL B−cell

AML

AML

AML AML

AML AML

AML

AML

AML AML

AML

17