gibbs biclustering of microarray data yves moreau & qizheng sheng katholieke universiteit leuven...
Post on 21-Dec-2015
216 views
TRANSCRIPT
Gibbs biclustering of microarray data
Yves Moreau& Qizheng Sheng
Katholieke Universiteit LeuvenESAT-SCD (SISTA)
on leave at Center for Biological Sequence analysis,
Danish Technical University
April 18, 2023 CBS Microarray Course 2
Clustering
Form coherent groups of Genes Patient samples (e.g., tumors) Drug or toxin response
Study these groups to get insight into biological processes Diagnostic and prognostic classes Genes in same clusters can have
same function or same regulation Clustering algorithms
Hierarchical clustering K-means Self-Organizing Maps ...
April 18, 2023 CBS Microarray Course 3
What’s wrong with clustering?
Clustering is a long-solved problem ?!?
Many problems with current clustering algorithms PCA does not do any form of grouping Hierarchical clustering does not produce distinct groups
Only a tree; it is then up to the user to pick nodes from the tree
K-means does not tell you how many clusters really are present in the data
...
April 18, 2023 CBS Microarray Course 4
A wish list for clustering We expect a lot from a clustering algorithm
Fast and not memory hungry Can run easily on a large microarray data set
10-100.000 genes, >100 experiments Partitioning of genes into distinct groups and automatically
determine the “right” number of groups Robust
If you remove some genes and some experiments, you want to obtain roughly the same groups
Rejection of outliers (genes that do not clearly belong to any group)
Probabilistic cluster membership One gene can belong to several clusters
Incorporation of biological knowledge into account Maybe you want some known genes to cluster together Meaning of the clusters?
Heterogeneous microarray data sources
April 18, 2023 CBS Microarray Course 6
Microarray cost per expression measurement Budgets and expertise
Publicly available microarray data Need for exchange standards & repositories
Big consortia set up big microarray projects Genome projects “transcriptome” projects (=
compendia)
Change in microarray projects ( sequence analysis) Analyze public data first to generate an hypothesis Design and perform your own microarray experiment
From genome projects to transcriptome projects
April 18, 2023 CBS Microarray Course 7
Data becomes more heterogeneous Gene clustering
Group genes that behave similarly over all conditions
Gene biclustering Group genes that behave similarly
over a subset of conditions “Feature selection” More suitable
for heterogeneous compendium
Why biclustering?
April 18, 2023 CBS Microarray Course 8
Probabilistic graphical models
Biostatistics
Bayesian statsClusteringDecision support
Genetics
Linkage analysisPhylogeny
Sequence analysis
Modeling protein familiesGene predictionRegulatory sequence analysis
Expression analysis
ClusteringGenetic network inference
Graphicalmodels
April 18, 2023 CBS Microarray Course 9
Distribution of expression values for a given gene
HighMediumLow
Bicluster Discretized microarray
data set
Discretizing microarray data Microarray data is
continuous Discretize by equal
frequency
gen
es
conditions
April 18, 2023 CBS Microarray Course 12
Likelihood0
1
.9.9.9.9.9
.9.05.9.9.9
.9.9.9.9.9.05.9.9.9.9
.9.9.9.9.05
( | , , )P D g c
April 18, 2023 CBS Microarray Course 13
Likelihood0
1
.9.05.05.05.9
.05.9.9.05.05
.05.05.05.05.05
.05.05.9.9.05
( | ', , )
( | , , )
P D g c
P D g c
Get the right genes
April 18, 2023 CBS Microarray Course 14
Likelihood0
1
.9.9.05.05.9
.9.05.05.9.9
.9.9 .05 .05.9.05.9.05 .05.9
.9.9 .05 .05.05
( | , ', )
( | , , )
P D g c
P D g c
Get the right conditions
April 18, 2023 CBS Microarray Course 15
Likelihood0
1
.6.6.2.2.6
.6.2.2.2.6
.6.6.2.2.6.2.6.2.2.6
.2.6.2.2.2
( | , , ')
( | , , )
P D g c
P D g c
Get the right frequency pattern
April 18, 2023 CBS Microarray Course 16
Optimizing the bicluster Find the right bicluster
Genes Conditions Pattern
For a given choice of genes and conditions, the “best” pattern is given by the frequencies found in the extracted pattern No more need to optimize over the pattern
Maximum likelihood: find genes and conditions that maximize
Gibbs sampling: find genes and conditions that optimize
( | , )P D g c
( , | )P g c D
April 18, 2023 CBS Microarray Course 18
Markov Chain Monte-Carlo Markov chain with transition matrix T
)|( 1 iXjXPT ttij
A C G TA 0.0643 0.8268 0.0659 0.0430
C 0.0598 0.0484 0.8515 0.0403
G 0.1602 0.3407 0.1736 0.3255
T 0.1507 0.1608 0.3654 0.3231
X=A
X=C X=G
X=T
April 18, 2023 CBS Microarray Course 19
Markov Chain Monte-Carlo Markov chains can sample from complex distributions
ACGCGGTGTGCGTTTGACGAACGGTTACGCGACGTTTGGTACGTGCGGTGTACGTGTACGACGGAGTTTGCGGGACGCGTACGCGCGTGACGTACGCGTGAGACGCGTGCGCGCGGACGCACGGGCGTGCGCGCGTCGCGAACGCGTTTGTGTTCGGTGCACCGCGTTTGACGTCGGTTCACGTGACGCGTAGTTCGACGACGTGACACGGACGTACGCGACCGTACTCGCGTTGACACGATACGGCGCGGCGGGCGCGGACGTACGCGTACACGCGGGAACGCGCGTGTTTACGACGTGACGTCGCACGCGTCGGTGTGACGGCGGTCGGTACACGTCGACGTTGCGACGTGCGTGCTGACGGAACGACGACGCGACGCACGGCGTGTTCGCGGTGCGG
AC
GT
%
Position
April 18, 2023 CBS Microarray Course 20
Gibbs sampling Markov chain for Gibbs sampling
1
1 1
0 0 0
( | , )1
( | , )1 1
( | , )1 1 1
( , , ) ( | , ) ( | , ) ( | , )
( , , )
( , , )
( , , )
( , , )
lim ( , , ) ( , , )
lim ( )
i i
i i
i i
P A B b C ci i i i
P B A a C ci i i i
P C A a B bi i i i
k k kk
kk
P A B C P A B C P B A C P C A B
a b c
a a b c
b a b c
c a b c
P A B C P A B C
P A
( ); lim ( ) ( ); lim ( ) ( )k kk k
P A P B P B P C P C
April 18, 2023 CBS Microarray Course 21
Gibbs sampling True target distribution (2D normal N(,))
0 1 0.5true
0 0.5 1
April 18, 2023 CBS Microarray Course 22
Gibbs sampling First 20 Gibbs sampling iterates (conditionals are 1D
normals)
April 18, 2023 CBS Microarray Course 23
Gibbs sampling Burn-in samples (1000 samples)
0 1 0.5true
0 0.5 1
0.3634 1.1243 0.7443burn-in
0.4190 0.7443 1.3724
April 18, 2023 CBS Microarray Course 24
Gibbs sampling Samples after Markov chain convergence (samples 1000-
2000)
0 1 0.5true
0 0.5 1
0.3634 1.1243 0.7443burn-in
0.4190 0.7443 1.3724
0.0187 1.0282 0.5052converged
0.0443 0.5052 1.0621
April 18, 2023 CBS Microarray Course 25
Data augmentation Gibbs sampling
Introducing unobserved variables often simplifies the expression of the likelihood
A Gibbs sampler can then be set up
Samples from the Gibbs sampler can be used to estimate parameters
( , | ) ( | , ) ( | , )
( | , , ) ( | , , )
model parameters, missing data, data
i ji ji j
P M D P M D P M D
P M D P M M D
M D
PME
1
1( | ) ( , | )
Nk
kM
E D P M D dMdN
April 18, 2023 CBS Microarray Course 26
Pros and cons Gibbs sampling
Explore the space of configuration of a probabilistic model of the data according to the probability of each configuration
Based on incrementaly perturbing the configuration one variable at a time, preferably choosing more likely configurations
Pros Clear probabilistic interpretation Bayesian framework “Global optimization”
Cons Mathematical details not easy to work out Relatively slow
April 18, 2023 CBS Microarray Course 28
Gibbs samplingCurrent configuration
1 1( 1| , , )?P g g c D2 2( 1| , , )?P g g c D
Next gene configuration
3 3( 1| , , )?P g g c D
April 18, 2023 CBS Microarray Course 29
Updated gene configuration
Next complete configuration iterate many times
April 18, 2023 CBS Microarray Course 30
Gibbs biclustering( , | ) ( | , , ) ( | , , )i ji ji j
P g c D P g g c D P c c g D
April 18, 2023 CBS Microarray Course 32
Remarks Gibbs biclustering allows noisy patterns Optimized configuration is obtained by averaging successive
iterated configurations
Biclustering is oriented Find subset of samples for which a subset of genes is
consistenly expressed across genes Find subset of genes that are consistently expressed across a
subset of samples
Searching for multiple patterns For gene biclustering, remove the data of
the genes from the current bicluster Search for a new pattern Stop if only empty pattern repeatedly found
April 18, 2023 CBS Microarray Course 35
Mixed-Lineage Leukemia Armstrong et al., Nature Genetics, 2002
Mixed-Lineage Leukemia (MLL) is a subtype of ALL Caused by chromosomal rearrangement in MLL gene Poorer prognosis than ALL
Microarray analysis shows that MLL is distinct from ALL
FLT3 tyrosine kinase distinguishes most strongly between MLL, ALL, and AML Candidate drug target
April 18, 2023 CBS Microarray Course 37
Biclustering leukemia data Bicluster patients
Find patients for which a subset of genes has a consistent expression profile across this group of patients
Discovery set 21 ALL, 17 MLL, 25 AML
Validation set 3 ALL, 3 MLL, 3 AML