clustering (gene expression data) 6.095/6.895 - computational biology: genomes, networks, evolution...
Post on 21-Dec-2015
214 Views
Preview:
TRANSCRIPT
Clustering(Gene Expression Data)
6.095/6.895 - Computational Biology: Genomes, Networks, Evolution
Lecture October 4, 2005
Challenges in Computational Biology
DNA
4 Genome Assembly
Gene FindingRegulatory motif discovery
Database lookup
Gene expression analysis9
RNA transcript
Sequence alignment
Evolutionary Theory7
TCATGCTATTCGTGATAATGAGGATATTTATCATATTTATGATTT
Cluster discovery10 Gibbs samplingProtein network analysis12
Emerging network properties14
13 Regulatory network inference
Comparative Genomics
RNA folding8
11
Plan
• Gene Expression Data/DNA Microarrays
• Feature selection and Clustering
DNA MicroArrays
• To measure levels of messages in a cell– Construct an array with DNA sequences for multiple genes– Hybridize each RNA in your sample to a sequence in your
array (All sequences from the same gene hybridize to the same spot)
– Measure the number of hybridizations for each spot
DN
A 1
DN
A 3
DN
A 5
DN
A 6
DN
A 4
DN
A 2
cDNA 4
cDNA 6
Hybridize Gen
e 1
Gen
e 3
Gen
e 5
Gen
e 6
Gen
e 4
Gen
e 2
MeasureRNA 4
RNA 6
RT
Result
• 6000 genes in one shot• Entire transcriptome
observable in one experiment
• Can perform multiple experiments under varying conditions– Temperature– Time– Sugar level– Other chemicals– Gene knock-outs– …
Noise
• Sources of Noise– Cross-hybridization– Non-uniform hybridization
kinetics– Non-linearity of array
response to concentration– Non-linear amplification– Improper probe sequence– Difference in
materials/procedures
• Noise model: yij=niαij (cj tij) + εij
– yij: observed level for gene j on chip i– tij: true level– cj: gene constant– ni: multiplicative chip normalization– αij, εij: multiplicative and additive noise terms
• Estimating the parameters– ni: spiked in control probes, not present in genome studied– cj: control experiments of known concentrations for gene j– εij: un-spiked control probes should be zero– αij: spiked controls that are constant across chips
Expression Value Normalization
Gene expression data
• For each gene j we have a vector
tj=(t1j,t2j, …, tdj)
• Now what ?
• I.e., what can we do with this data ?
Supervised vs. unsupervised “learning”
• Make the parallel with modeling biological sequences that we saw last week.
• What do we do when we don’t have any models? – We can look for patterns / i.e. similarities between
the different genes– We can look for recurring themes.
Clustering
The problem
• Group genes into co-regulated sets– Observe cells under different
environmental changes– Find genes whose expression
profiles are affected in a similar way
– These genes are potentially co-regulated, i.e. regulated by the same transcription factor
• Clustering!
Clustering expression levels
• Clustering process: 1. How to tell if two expression profiles are similar ?
– Define the (dis)-similarity measure between two profiles
2. How to group multiple profiles into meaningful subsets ?
– Describe the clustering procedure
3. Are the results meaningful ?– Evaluate statistical significance of a clustering
• And don’t forget about:– De-noising– Choice of experiments/features
(Dis)-similarity measures• Distance metrics (between vectors x and y )
– “Manhattan” distance: MD(x,y) = ∑I |xi-yi|– Euclidean distance: ED(x,y) = [ ∑I (xi-yi)2 ]1/2
– SSE: SSE(x,y) = ∑I (xi-yi)2 • Correlation:
C(x,y)= ∑I xi * yi
(possibly take absolute value)
• Data pre-processing: Instead of clustering on direct observation of expression values…– … can cluster based on differential expression from the mean,
e.g.,
∑I | xi – avg(x) – (yi – avg(y)) |– … or differential expression normalized by standard deviation,
e.g.,
∑I | (xi – avg(x))/stdev(x) – (y i- avg(y))/stdev(y) |
Clustering Algorithms
• Hierarchical: Merge data successively to construct tree
b
ed
f
a
c
h
ga b d e f g hc
• Non-Hierarchical: place k-means to best explain data
b
ed
f
a
c
h
gc1
c2
c3a b g hcd e f
Hierarchical clustering
• Bottom-up algorithm:– Initialization: each point in a
separate cluster
• At each step:– Choose the pair of closest
clusters– Merge
• The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y
• Avoids the problem of specifying the number of clusters
b
ed
f
a
c
h
g
Distance between clusters
• CD(X,Y)=minx X, y Y D(x,y)
Single-link method
• CD(X,Y)=maxx X, y Y D(x,y)
Complete-link method
• CD(X,Y)=avgx X, y Y D(x,y)
Average-link method
• CD(X,Y)=D( avg(X) , avg(Y) )
Centroid method
ed
f
h
g
ed
f
h
g
ed
f
h
g
ed
f
h
g
Example I
Example II
K-means algorithm• Each cluster Xi has a center ci
• Define the clustering cost criterion
COST(X1,…Xk) = ∑Xi ∑x Xi SSE(x,ci)
• Algorithm tries to find clusters X1…Xk and centers c1…ck that minimize COST
• K-means algorithm:– Initialize centers “somehow”– Repeat:
• Compute best clusters for given centers → Attach each point to the closest center• Compute best centers for given clusters → Choose the centroid of points in cluster
– Until the COST is “small”
b
ed
f
a
c
h
g
c1
c2
c3
How ?
How ?
SSE(x,y) = ∑I (xi-yi)2
Choosing optimal center
• Consider a cluster X and a center c (not necessarily a centroid)• Want to minimize
∑x X SSE(x,c)
= ∑x X ∑ i (xi-ci)2
= ∑ i ∑x X (xi-ci)2
• Can optimize each ci separately:
∑x X (xi-ci)2
= ∑x X ( xi2 - 2xici – ci
2 )
= ∑x X xi2 – ci ∑x X 2xi + |X|ci
2
• Optimum:
c i= ∑x X xi / |X|
Links
• http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html
• http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletH.html
• http://www.cs.mcgill.ca/~papou/helppage.htm
• Relationship between k-means and EM,
• Optimizing two variables at the same time.
• Know one compute the other, make the parallel
Clustering Algorithms: Running time
• Hierarchical: Merge data successively to construct tree
b
ed
f
a
c
h
ga b d e f g hc
• Non-Hierarchical: place k-means to best explain data
b
ed
f
a
c
h
gc1
c2
c3
Running time: Hierarchical methods
• Repeat:– Choose the pair of closest
clusters– Merge
• Number of iterations:– Exactly n-1
• Iteration cost:– At most n2 computations of CD( , )– How many point-point distance
computations ?– At most n2 as well !
• Total running time: O(n3)
b
ed
f
a
c
h
g
• What about the running time for k-means?
Improvements
• Single-link = Minimum Spanning Tree– O(n2) time
• …
What have we learned?
• Gene expression data– Microarray technology– De-noising
• Two methods for clustering– Hierarchical clustering
• non-parametric, general, top-down
– K-means clustering• ‘model’-based
– Relationship with HMMs, alignment• Distance metrics
• What’s next? – Evaluate clustering results– Visualizing clustering output
Evaluating clustering output• Computing statistical significance of clusters
rm
k
Nmk
n
m
p
rposP )(
• N experiments, p labeled ++, (N-p) ––
• Cluster: k elements, m positive
• P-value of single cluster containing k elements out of which r are same
Prob that a randomly chosen set of k experiments would result in m positive and k-m negative
P-value of uniformity
in computed cluster
Visualizing clustering output
Rearranging tree branches
• Optimizing one-dimensional ordering of tree leaves
a b g hcd e f
a b d e f g hc
• Ziv Bar-Zoseph published a linear-time DP algorithm (from what I remember) to calculate branch re-ordering. It’d be fun to show it in lecture, if you have time
top related