clustering (gene expression data) 6.095/6.895 - computational biology: genomes, networks, evolution...

Clustering(Gene Expression Data)

6.095/6.895 - Computational Biology: Genomes, Networks, Evolution

Lecture October 4, 2005

Challenges in Computational Biology

DNA

4 Genome Assembly

Gene FindingRegulatory motif discovery

Database lookup

Gene expression analysis9

RNA transcript

Sequence alignment

Evolutionary Theory7

TCATGCTATTCGTGATAATGAGGATATTTATCATATTTATGATTT

Cluster discovery10 Gibbs samplingProtein network analysis12

Emerging network properties14

13 Regulatory network inference

Comparative Genomics

RNA folding8

11

Plan

• Gene Expression Data/DNA Microarrays

• Feature selection and Clustering

DNA MicroArrays

• To measure levels of messages in a cell– Construct an array with DNA sequences for multiple genes– Hybridize each RNA in your sample to a sequence in your

array (All sequences from the same gene hybridize to the same spot)

– Measure the number of hybridizations for each spot

DN

A 1

DN

A 3

DN

A 5

DN

A 6

DN

A 4

DN

A 2

cDNA 4

cDNA 6

Hybridize Gen

e 1

Gen

e 3

Gen

e 5

Gen

e 6

Gen

e 4

Gen

e 2

MeasureRNA 4

RNA 6

RT

Result

• 6000 genes in one shot• Entire transcriptome

observable in one experiment

• Can perform multiple experiments under varying conditions– Temperature– Time– Sugar level– Other chemicals– Gene knock-outs– …

Noise

• Sources of Noise– Cross-hybridization– Non-uniform hybridization

kinetics– Non-linearity of array

response to concentration– Non-linear amplification– Improper probe sequence– Difference in

materials/procedures

• Noise model: yij=niαij (cj tij) + εij

– yij: observed level for gene j on chip i– tij: true level– cj: gene constant– ni: multiplicative chip normalization– αij, εij: multiplicative and additive noise terms

• Estimating the parameters– ni: spiked in control probes, not present in genome studied– cj: control experiments of known concentrations for gene j– εij: un-spiked control probes should be zero– αij: spiked controls that are constant across chips

Expression Value Normalization

Gene expression data

• For each gene j we have a vector

tj=(t1j,t2j, …, tdj)

• Now what ?

• I.e., what can we do with this data ?

Supervised vs. unsupervised “learning”

• Make the parallel with modeling biological sequences that we saw last week.

• What do we do when we don’t have any models? – We can look for patterns / i.e. similarities between

the different genes– We can look for recurring themes.

Clustering

The problem

• Group genes into co-regulated sets– Observe cells under different

environmental changes– Find genes whose expression

profiles are affected in a similar way

– These genes are potentially co-regulated, i.e. regulated by the same transcription factor

• Clustering!

Clustering expression levels

• Clustering process: 1. How to tell if two expression profiles are similar ?

– Define the (dis)-similarity measure between two profiles

2. How to group multiple profiles into meaningful subsets ?

– Describe the clustering procedure

3. Are the results meaningful ?– Evaluate statistical significance of a clustering

• And don’t forget about:– De-noising– Choice of experiments/features

(Dis)-similarity measures• Distance metrics (between vectors x and y )

– “Manhattan” distance: MD(x,y) = ∑I |xi-yi|– Euclidean distance: ED(x,y) = [ ∑I (xi-yi)2 ]1/2

– SSE: SSE(x,y) = ∑I (xi-yi)2 • Correlation:

C(x,y)= ∑I xi * yi

(possibly take absolute value)

• Data pre-processing: Instead of clustering on direct observation of expression values…– … can cluster based on differential expression from the mean,

e.g.,

∑I | xi – avg(x) – (yi – avg(y)) |– … or differential expression normalized by standard deviation,

e.g.,

∑I | (xi – avg(x))/stdev(x) – (y i- avg(y))/stdev(y) |

Clustering Algorithms

• Hierarchical: Merge data successively to construct tree

b

ed

f

a

c

h

ga b d e f g hc

• Non-Hierarchical: place k-means to best explain data

b

ed

f

a

c

h

gc1

c2

c3a b g hcd e f

Hierarchical clustering

• Bottom-up algorithm:– Initialization: each point in a

separate cluster

• At each step:– Choose the pair of closest

clusters– Merge

• The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y

• Avoids the problem of specifying the number of clusters

b

ed

f

a

c

h

g

Distance between clusters

• CD(X,Y)=minx X, y Y D(x,y)

Single-link method

• CD(X,Y)=maxx X, y Y D(x,y)

Complete-link method

• CD(X,Y)=avgx X, y Y D(x,y)

Average-link method

• CD(X,Y)=D( avg(X) , avg(Y) )

Centroid method

ed

f

h

g

ed

f

h

g

ed

f

h

g

ed

f

h

g

Example I

Example II

K-means algorithm• Each cluster Xi has a center ci

• Define the clustering cost criterion

COST(X1,…Xk) = ∑Xi ∑x Xi SSE(x,ci)

• Algorithm tries to find clusters X1…Xk and centers c1…ck that minimize COST

• K-means algorithm:– Initialize centers “somehow”– Repeat:

• Compute best clusters for given centers → Attach each point to the closest center• Compute best centers for given clusters → Choose the centroid of points in cluster

– Until the COST is “small”

b

ed

f

a

c

h

g

c1

c2

c3

How ?

How ?

SSE(x,y) = ∑I (xi-yi)2

Choosing optimal center

• Consider a cluster X and a center c (not necessarily a centroid)• Want to minimize

∑x X SSE(x,c)

= ∑x X ∑ i (xi-ci)2

= ∑ i ∑x X (xi-ci)2

• Can optimize each ci separately:

∑x X (xi-ci)2

= ∑x X ( xi2 - 2xici – ci

2 )

= ∑x X xi2 – ci ∑x X 2xi + |X|ci

2

• Optimum:

c i= ∑x X xi / |X|

Links

• http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html

• http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletH.html

• http://www.cs.mcgill.ca/~papou/helppage.htm

http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html

http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html

http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletH.html

http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletH.html

• Relationship between k-means and EM,

• Optimizing two variables at the same time.

• Know one compute the other, make the parallel

Clustering Algorithms: Running time

• Hierarchical: Merge data successively to construct tree

b

ed

f

a

c

h

ga b d e f g hc

• Non-Hierarchical: place k-means to best explain data

b

ed

f

a

c

h

gc1

c2

c3

Running time: Hierarchical methods

• Repeat:– Choose the pair of closest

clusters– Merge

• Number of iterations:– Exactly n-1

• Iteration cost:– At most n2 computations of CD( , )– How many point-point distance

computations ?– At most n2 as well !

• Total running time: O(n3)

b

ed

f

a

c

h

g

• What about the running time for k-means?

Improvements

• Single-link = Minimum Spanning Tree– O(n2) time

• …

What have we learned?

• Gene expression data– Microarray technology– De-noising

• Two methods for clustering– Hierarchical clustering

• non-parametric, general, top-down

– K-means clustering• ‘model’-based

– Relationship with HMMs, alignment• Distance metrics

• What’s next? – Evaluate clustering results– Visualizing clustering output

Evaluating clustering output• Computing statistical significance of clusters

rm

k

Nmk

n

m

p

rposP )(

• N experiments, p labeled ++, (N-p) ––

• Cluster: k elements, m positive

• P-value of single cluster containing k elements out of which r are same

Prob that a randomly chosen set of k experiments would result in m positive and k-m negative

P-value of uniformity

in computed cluster

Visualizing clustering output

Rearranging tree branches

• Optimizing one-dimensional ordering of tree leaves

a b g hcd e f

a b d e f g hc

• Ziv Bar-Zoseph published a linear-time DP algorithm (from what I remember) to calculate branch re-ordering. It’d be fun to show it in lecture, if you have time

clustering (gene expression data) 6.095/6.895 - computational biology: genomes, networks, evolution...

Documents

gene j ij

clustering slide

n i ij c j t ij ij y

gene constant n i

genome assembly gene

chip i t ij

materialsprocedures

rt slide