clustering (gene expression data) 6.095/6.895 - computational biology: genomes, networks, evolution...

Clustering(Gene Expression Data)

6.095/6.895 - Computational Biology: Genomes, Networks, Evolution

Lecture October 4, 2005

Challenges in Computational Biology

4 Genome Assembly

Gene FindingRegulatory motif discovery

Database lookup

Gene expression analysis9

RNA transcript

Sequence alignment

Evolutionary Theory7

TCATGCTATTCGTGATAATGAGGATATTTATCATATTTATGATTT

Cluster discovery10 Gibbs samplingProtein network analysis12

Emerging network properties14

13 Regulatory network inference

Comparative Genomics

RNA folding8

• Gene Expression Data/DNA Microarrays

• Feature selection and Clustering

DNA MicroArrays

• To measure levels of messages in a cell– Construct an array with DNA sequences for multiple genes– Hybridize each RNA in your sample to a sequence in your

array (All sequences from the same gene hybridize to the same spot)

– Measure the number of hybridizations for each spot

cDNA 4

cDNA 6

Hybridize Gen

MeasureRNA 4

Result

• 6000 genes in one shot• Entire transcriptome

observable in one experiment

• Can perform multiple experiments under varying conditions– Temperature– Time– Sugar level– Other chemicals– Gene knock-outs– …

• Sources of Noise– Cross-hybridization– Non-uniform hybridization

kinetics– Non-linearity of array

response to concentration– Non-linear amplification– Improper probe sequence– Difference in

materials/procedures

• Noise model: yij=niαij (cj tij) + εij

– yij: observed level for gene j on chip i– tij: true level– cj: gene constant– ni: multiplicative chip normalization– αij, εij: multiplicative and additive noise terms

• Estimating the parameters– ni: spiked in control probes, not present in genome studied– cj: control experiments of known concentrations for gene j– εij: un-spiked control probes should be zero– αij: spiked controls that are constant across chips

Expression Value Normalization

Gene expression data

• For each gene j we have a vector

tj=(t1j,t2j, …, tdj)

• Now what ?

• I.e., what can we do with this data ?

Supervised vs. unsupervised “learning”

• Make the parallel with modeling biological sequences that we saw last week.

• What do we do when we don’t have any models? – We can look for patterns / i.e. similarities between

the different genes– We can look for recurring themes.

Clustering

The problem

• Group genes into co-regulated sets– Observe cells under different

environmental changes– Find genes whose expression

profiles are affected in a similar way

– These genes are potentially co-regulated, i.e. regulated by the same transcription factor

• Clustering!

Clustering expression levels

• Clustering process: 1. How to tell if two expression profiles are similar ?

– Define the (dis)-similarity measure between two profiles

2. How to group multiple profiles into meaningful subsets ?

– Describe the clustering procedure

3. Are the results meaningful ?– Evaluate statistical significance of a clustering

• And don’t forget about:– De-noising– Choice of experiments/features

(Dis)-similarity measures• Distance metrics (between vectors x and y )

– “Manhattan” distance: MD(x,y) = ∑I |xi-yi|– Euclidean distance: ED(x,y) = [ ∑I (xi-yi)2 ]1/2

– SSE: SSE(x,y) = ∑I (xi-yi)2 • Correlation:

C(x,y)= ∑I xi * yi

(possibly take absolute value)

• Data pre-processing: Instead of clustering on direct observation of expression values…– … can cluster based on differential expression from the mean,

∑I | xi – avg(x) – (yi – avg(y)) |– … or differential expression normalized by standard deviation,

∑I | (xi – avg(x))/stdev(x) – (y i- avg(y))/stdev(y) |

Clustering Algorithms

• Hierarchical: Merge data successively to construct tree

ga b d e f g hc

• Non-Hierarchical: place k-means to best explain data

c3a b g hcd e f

Hierarchical clustering

• Bottom-up algorithm:– Initialization: each point in a

separate cluster

• At each step:– Choose the pair of closest

clusters– Merge

• The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y

• Avoids the problem of specifying the number of clusters

Distance between clusters

• CD(X,Y)=minx X, y Y D(x,y)

Single-link method

• CD(X,Y)=maxx X, y Y D(x,y)

Complete-link method

• CD(X,Y)=avgx X, y Y D(x,y)

Average-link method

• CD(X,Y)=D( avg(X) , avg(Y) )

Centroid method

Example I

Example II

K-means algorithm• Each cluster Xi has a center ci

• Define the clustering cost criterion

COST(X1,…Xk) = ∑Xi ∑x Xi SSE(x,ci)

• Algorithm tries to find clusters X1…Xk and centers c1…ck that minimize COST

• K-means algorithm:– Initialize centers “somehow”– Repeat:

• Compute best clusters for given centers → Attach each point to the closest center• Compute best centers for given clusters → Choose the centroid of points in cluster

– Until the COST is “small”

SSE(x,y) = ∑I (xi-yi)2

Choosing optimal center

• Consider a cluster X and a center c (not necessarily a centroid)• Want to minimize

∑x X SSE(x,c)

= ∑x X ∑ i (xi-ci)2

= ∑ i ∑x X (xi-ci)2

• Can optimize each ci separately:

∑x X (xi-ci)2

= ∑x X ( xi2 - 2xici – ci

= ∑x X xi2 – ci ∑x X 2xi + |X|ci

• Optimum:

c i= ∑x X xi / |X|

• http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html

• http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletH.html

• http://www.cs.mcgill.ca/~papou/helppage.htm

• Relationship between k-means and EM,

• Optimizing two variables at the same time.

• Know one compute the other, make the parallel

Clustering Algorithms: Running time

• Hierarchical: Merge data successively to construct tree

ga b d e f g hc

• Non-Hierarchical: place k-means to best explain data

Running time: Hierarchical methods

• Repeat:– Choose the pair of closest

clusters– Merge

• Number of iterations:– Exactly n-1

• Iteration cost:– At most n2 computations of CD( , )– How many point-point distance

computations ?– At most n2 as well !

• Total running time: O(n3)

• What about the running time for k-means?

Improvements

• Single-link = Minimum Spanning Tree– O(n2) time

• …

What have we learned?

• Gene expression data– Microarray technology– De-noising

• Two methods for clustering– Hierarchical clustering

• non-parametric, general, top-down

– K-means clustering• ‘model’-based

– Relationship with HMMs, alignment• Distance metrics

• What’s next? – Evaluate clustering results– Visualizing clustering output

Evaluating clustering output• Computing statistical significance of clusters

rposP )(

• N experiments, p labeled ++, (N-p) ––

• Cluster: k elements, m positive

• P-value of single cluster containing k elements out of which r are same

Prob that a randomly chosen set of k experiments would result in m positive and k-m negative

P-value of uniformity

in computed cluster

Visualizing clustering output

Rearranging tree branches

• Optimizing one-dimensional ordering of tree leaves

a b g hcd e f

a b d e f g hc

• Ziv Bar-Zoseph published a linear-time DP algorithm (from what I remember) to calculate branch re-ordering. It’d be fun to show it in lecture, if you have time

clustering (gene expression data) 6.095/6.895 - computational biology: genomes, networks, evolution...

gene j ij

clustering slide

n i ij c j t ij ij y

gene constant n i

genome assembly gene

chip i t ij

materialsprocedures

rt slide

Documents

genomics and personalised medicine · •tom – deputy...

genomes slides

bs2009 - genomes

vibrio genomes

6 . sequencing genomes

barcodes genomes

the human genomes

virus genomes

selecting genomes for reconstruction of a ncestral genomes

chapter 21: genomes & their evolution - untitled page...

29 mammalian genomes

121 artificial genomes

modeling biological sequence and hidden markov...

bb30055: genes and genomes genomes - dr. mv hejmadi...

microbial genomes

eukaryotic genomes the organization and control of...

genes and genomes. genome on line database (gold) 243...

genomes summary

putting together alignments & comparing assemblies michael...

dna and genomes