clustering (gene expression data) 6.095/6.895 - computational biology: genomes, networks, evolution...

30
Clustering (Gene Expression Data) 5/6.895 - Computational Biology: Genomes, Networks, Evo ecture October 4, 2005

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Clustering(Gene Expression Data)

6.095/6.895 - Computational Biology: Genomes, Networks, Evolution

Lecture October 4, 2005

Page 2: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Challenges in Computational Biology

DNA

4 Genome Assembly

Gene FindingRegulatory motif discovery

Database lookup

Gene expression analysis9

RNA transcript

Sequence alignment

Evolutionary Theory7

TCATGCTATTCGTGATAATGAGGATATTTATCATATTTATGATTT

Cluster discovery10 Gibbs samplingProtein network analysis12

Emerging network properties14

13 Regulatory network inference

Comparative Genomics

RNA folding8

11

Page 3: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Plan

• Gene Expression Data/DNA Microarrays

• Feature selection and Clustering

Page 4: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

DNA MicroArrays

• To measure levels of messages in a cell– Construct an array with DNA sequences for multiple genes– Hybridize each RNA in your sample to a sequence in your

array (All sequences from the same gene hybridize to the same spot)

– Measure the number of hybridizations for each spot

DN

A 1

DN

A 3

DN

A 5

DN

A 6

DN

A 4

DN

A 2

cDNA 4

cDNA 6

Hybridize Gen

e 1

Gen

e 3

Gen

e 5

Gen

e 6

Gen

e 4

Gen

e 2

MeasureRNA 4

RNA 6

RT

Page 5: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Result

• 6000 genes in one shot• Entire transcriptome

observable in one experiment

• Can perform multiple experiments under varying conditions– Temperature– Time– Sugar level– Other chemicals– Gene knock-outs– …

Page 6: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Noise

• Sources of Noise– Cross-hybridization– Non-uniform hybridization

kinetics– Non-linearity of array

response to concentration– Non-linear amplification– Improper probe sequence– Difference in

materials/procedures

Page 7: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

• Noise model: yij=niαij (cj tij) + εij

– yij: observed level for gene j on chip i– tij: true level– cj: gene constant– ni: multiplicative chip normalization– αij, εij: multiplicative and additive noise terms

• Estimating the parameters– ni: spiked in control probes, not present in genome studied– cj: control experiments of known concentrations for gene j– εij: un-spiked control probes should be zero– αij: spiked controls that are constant across chips

Expression Value Normalization

Page 8: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Gene expression data

• For each gene j we have a vector

tj=(t1j,t2j, …, tdj)

• Now what ?

• I.e., what can we do with this data ?

Page 9: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Supervised vs. unsupervised “learning”

• Make the parallel with modeling biological sequences that we saw last week.

• What do we do when we don’t have any models? – We can look for patterns / i.e. similarities between

the different genes– We can look for recurring themes.

Clustering

Page 10: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

The problem

• Group genes into co-regulated sets– Observe cells under different

environmental changes– Find genes whose expression

profiles are affected in a similar way

– These genes are potentially co-regulated, i.e. regulated by the same transcription factor

• Clustering!

Page 11: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Clustering expression levels

• Clustering process: 1. How to tell if two expression profiles are similar ?

– Define the (dis)-similarity measure between two profiles

2. How to group multiple profiles into meaningful subsets ?

– Describe the clustering procedure

3. Are the results meaningful ?– Evaluate statistical significance of a clustering

• And don’t forget about:– De-noising– Choice of experiments/features

Page 12: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

(Dis)-similarity measures• Distance metrics (between vectors x and y )

– “Manhattan” distance: MD(x,y) = ∑I |xi-yi|– Euclidean distance: ED(x,y) = [ ∑I (xi-yi)2 ]1/2

– SSE: SSE(x,y) = ∑I (xi-yi)2 • Correlation:

C(x,y)= ∑I xi * yi

(possibly take absolute value)

• Data pre-processing: Instead of clustering on direct observation of expression values…– … can cluster based on differential expression from the mean,

e.g.,

∑I | xi – avg(x) – (yi – avg(y)) |– … or differential expression normalized by standard deviation,

e.g.,

∑I | (xi – avg(x))/stdev(x) – (y i- avg(y))/stdev(y) |

Page 13: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Clustering Algorithms

• Hierarchical: Merge data successively to construct tree

b

ed

f

a

c

h

ga b d e f g hc

• Non-Hierarchical: place k-means to best explain data

b

ed

f

a

c

h

gc1

c2

c3a b g hcd e f

Page 14: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Hierarchical clustering

• Bottom-up algorithm:– Initialization: each point in a

separate cluster

• At each step:– Choose the pair of closest

clusters– Merge

• The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y

• Avoids the problem of specifying the number of clusters

b

ed

f

a

c

h

g

Page 15: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Distance between clusters

• CD(X,Y)=minx X, y Y D(x,y)

Single-link method

• CD(X,Y)=maxx X, y Y D(x,y)

Complete-link method

• CD(X,Y)=avgx X, y Y D(x,y)

Average-link method

• CD(X,Y)=D( avg(X) , avg(Y) )

Centroid method

ed

f

h

g

ed

f

h

g

ed

f

h

g

ed

f

h

g

Page 16: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Example I

Page 17: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Example II

Page 18: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

K-means algorithm• Each cluster Xi has a center ci

• Define the clustering cost criterion

COST(X1,…Xk) = ∑Xi ∑x Xi SSE(x,ci)

• Algorithm tries to find clusters X1…Xk and centers c1…ck that minimize COST

• K-means algorithm:– Initialize centers “somehow”– Repeat:

• Compute best clusters for given centers → Attach each point to the closest center• Compute best centers for given clusters → Choose the centroid of points in cluster

– Until the COST is “small”

b

ed

f

a

c

h

g

c1

c2

c3

How ?

How ?

SSE(x,y) = ∑I (xi-yi)2

Page 19: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Choosing optimal center

• Consider a cluster X and a center c (not necessarily a centroid)• Want to minimize

∑x X SSE(x,c)

= ∑x X ∑ i (xi-ci)2

= ∑ i ∑x X (xi-ci)2

• Can optimize each ci separately:

∑x X (xi-ci)2

= ∑x X ( xi2 - 2xici – ci

2 )

= ∑x X xi2 – ci ∑x X 2xi + |X|ci

2

• Optimum:

c i= ∑x X xi / |X|

Page 20: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Links

• http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html

• http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletH.html

• http://www.cs.mcgill.ca/~papou/helppage.htm

Page 21: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

• Relationship between k-means and EM,

• Optimizing two variables at the same time.

• Know one compute the other, make the parallel

Page 22: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Clustering Algorithms: Running time

• Hierarchical: Merge data successively to construct tree

b

ed

f

a

c

h

ga b d e f g hc

• Non-Hierarchical: place k-means to best explain data

b

ed

f

a

c

h

gc1

c2

c3

Page 23: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Running time: Hierarchical methods

• Repeat:– Choose the pair of closest

clusters– Merge

• Number of iterations:– Exactly n-1

• Iteration cost:– At most n2 computations of CD( , )– How many point-point distance

computations ?– At most n2 as well !

• Total running time: O(n3)

b

ed

f

a

c

h

g

Page 24: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

• What about the running time for k-means?

Page 25: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Improvements

• Single-link = Minimum Spanning Tree– O(n2) time

• …

Page 26: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

What have we learned?

• Gene expression data– Microarray technology– De-noising

• Two methods for clustering– Hierarchical clustering

• non-parametric, general, top-down

– K-means clustering• ‘model’-based

– Relationship with HMMs, alignment• Distance metrics

• What’s next? – Evaluate clustering results– Visualizing clustering output

Page 27: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Evaluating clustering output• Computing statistical significance of clusters

rm

k

Nmk

n

m

p

rposP )(

• N experiments, p labeled ++, (N-p) ––

• Cluster: k elements, m positive

• P-value of single cluster containing k elements out of which r are same

Prob that a randomly chosen set of k experiments would result in m positive and k-m negative

P-value of uniformity

in computed cluster

Page 28: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Visualizing clustering output

Page 29: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Rearranging tree branches

• Optimizing one-dimensional ordering of tree leaves

a b g hcd e f

a b d e f g hc

Page 30: Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

• Ziv Bar-Zoseph published a linear-time DP algorithm (from what I remember) to calculate branch re-ordering. It’d be fun to show it in lecture, if you have time