1 sylvia richardson centre for biostatistics imperial college, london bayesian hierarchical...
TRANSCRIPT
1
Sylvia RichardsonCentre for Biostatistics
Imperial College, London
Bayesian hierarchical modelling of gene expression data
In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s)
Helen Causton and Tim Aitman (Hammersmith)Graeme Ambler and Peter Green (Bristol)
Philippe Broët (INSERM, Paris)
BBSRC Exploiting Genomics grant
2
Outline
• Hierarchical modelling framework
• A Bayesian gene expression index
• Modelling differential expression
• False discovery rate and mixture models
3
Introduction
• Gene expression is a hierarchical process– Substantive question– Experimental design– Sample preparation– Array design & manufacture– Gene expression matrix– Probe level data– Image level data
• Interest in using statistical framework capable to handle multiple sources of variability coherently
Interestingvariability(signal)
Obscuringvariability
(noise)
+
Bayesian statistics
4
Bayesian hierarchical model framework
• Has the flexibility to model various sources of variability: between probes, gene specific, within array, between array, …
• Building of all these features into a common model
• Avoids the need to use systematically a plug-in approach
uncertainty is propagated • ‘Borrow strength’ / share out information
according to principle• Allows some model checking
5
Gene expression analysis is amulti-step process
Low-level Model(how is the measured expression
related to the signal)
Multi-arrays processing(how to make appropriate
combined inference)
Differential Expression
ClusteringPartition Model
We build all these steps in a common statistical framework
6
Hierarchical model of replicate(biological) variability and array effect
PMMM
PMMM
PMMM
Gene specific variability (probe)Gene index BGX
Condition 1
PMMM
PMMM
PMMM
PMMM
Gene specific variability (probe)Gene index BGX
Differential expression parameter
Condition 2
Integrated modelling of Affymetrix data
PMMM
Gene and condition BGX index
Gene and condition BGX index
Hierarchical model of replicate(biological) variability and array effect
7
A fully Bayesian Gene eXpression index for Affymetrix GeneChip arrays
Anne Mette HeinSR, Helen Causton,
Graeme Ambler, Peter Green
Gene specific variability (probe)
PMMM
PMMM
PMMM
PMMM
Gene index BGX
8
Single array model: Motivation
Key observations: Conclusions:
• PMs and MMs both increase with spike-in concentration (MMs slower than PMs)
MMs bind fraction of signal
• Spread of PMs increase with level
Multiplicative (and additive) error; transformation needed
• Considerable variability in PM (and MM) response within a probe set
Varying reliability in gene expression estimation for different genes
• Probe effects approximately additive on log-scale
Estimate gene expression measure from PMs and MMs on log scale
9
• The intensity for the PM measurement for probe (reporter) j and gene g is due to binding
of labelled fragments that perfectly match the oligos in the spot
The true Signal Sgj
of labelled fragments that do not
perfectly match these oligos
The non-specific hybridisation Hgj
• The intensity of the corresponding MM measurement is caused
by a binding fraction Φ of the true signal Sgj
by non-specific hybridisation Hgj
Model assumptions and key biological parameters
10
BGX single array model:g=1,…,G (thousands), j=1,…,J (11-20)
Gene specific error terms:exchangeable
log(ξ g2)N(a,
b2)
log(Sgj+1) TN (μg , ξg2)
j=1,…,J
Gene expression index (BGX):
g=median(TN (μg , ξ g2))
“Pools” information over probes j=1,…,J
log(Hgj+1) TN(λ, η2)
Array-wide distribution
PMgj N( Sgj + Hgj , τ2)
MMgj N(Φ Sgj + Hgj ,τ2) Background noise, additive
signal Non-specific hybridisation
fraction
Priors: “vague” 2 ~ (10-3, 10-3) ~ B(1,1),
g ~ U(0,15) 2 ~ (10-3, 10-3), ~ N(0,103) “Empirical Bayes”
11
Implementation
• In WinBugs for ease of model development
and C++ for efficiency• Joint estimation of parameters in full Bayesian
framework• Base inference on posterior distribution
of all unknown quantities, Sgj, Hgj ,
g = Median of TN(g, ξ g2), ….
and use appropriate summaries
12
• 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line
• In sample k: each of 11 genes spiked in at concentration ck:
sample k: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 conc. ck(pM): 0.0 0.5 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 50 75 100 150
• Each sample hybridised to an array
Single array model performance:Data set : varying concentrations (geneLogic):
Consider subset consisting of 500 normal genes
+ 11 spike-ins
13
Single array model performance:One array: four genes spiked in at concentration 5.0
Posterior distributions:
2.5-97.5 credibility intervals:
o: log(PM-MM)
: TN(medPost(g),medPost(ξ g2))
Log(Sgj+1):
g:
posterior distributions reflect variability
PM: MM:PM-MM:
Probes: degree of response / variability over probe set:
medium / high low / low medium / low high/ low
Probe behaviour:
Highly Variable responses within probes sets and between genes
BGX index
Log Sgj
14
Single array model performance: signal and expression index10 arrays: gene 1 spiked-in at increasing concentrations
`true signal`/expression index BGX increases with concentration
Posterior distributions:
2.5-97.5 credibility intervals:
o : log(PM-MM)
: TN(medPost(g),medPost(ξ g2))
Log(Sgj+1):
g:
as previously:
log(Hgj+1):
15
2.5-97.5 credibility interval:
Single array model performance: non-specific hybridization10 arrays: gene 1 spiked-in at increasing concentrations
Signals Signals/cross
Non-specific hybridization does not increase with concentration
: TN(medPost(g),medPost(ξ g2))
log(Hgj+1):
16
Single array model performance:
11 genes spiked in at 13 (increasing)
concentrations
BGX index g increases with
concentration …..
… except for gene 7 (spiked-in??)
Indication of smooth
& sustained increase
over a wider range of
concentrations
Comparison with other expression measures
17
2.5 – 97.5 % credibility intervals for the Bayesian expression index
11 spike-in genes at 13 different concentration (data set A)
Note how the variabilityis substantially larger for low expression level
Each colour corresponds to a different spike-in geneGene 7 : broken red line
18
What variability is captured?
• For some genes, there is considerable discrepancy between the information given by the different probes
• Posterior becomes “flat” or “bimodal”• Hard to summarise by a single number
Less reproducibility of point estimates of expression level
• Model improvement: -- stratify Φ by CG content ? -- less weight to the MM in some cases? – more robust summary of index distribution or heavy tail distributions?
19
Single array model:examples of posterior distributions of BGX
expression indices
Each curve represents a gene
Examples with data:
o: log(PMgj-MMgj)
j=1,…,Jg
(at 0 if not defined)
Mean +- 1SD
20
Differential expression and array effects
Alex Lewin SR, Natalia Bochkina,
Anne Glazier, Tim Aitman
21
Data Set and Biological question
Previous Work (Tim Aitman, Anne Marie Glazier)
The spontaneously hypertensive rat (SHR): A model of human insulin resistance syndromes.
Deficiency in gene Cd36 found to be associated with insulin resistance in SHR
Following this, several animal models were developed where other relevant genes are knocked out comparison between knocked out and wildtype
(normal) mice or rats.
22
Data Set and Biological question
Microarray Data
Data set A (MAS 5) ( 12000 genes on each array)
3 SHR compared with 3 transgenic rats
Data set B (RMA) ( 22700 genes on each array)
8 wildtype (normal) mice compared with 8 knocked out mice
Biological Question
Find genes which are expressed differently in wildtype and knockout / transgenic mice
23
Gene specific error term Gene specific error term
Differential expression parameter
PMMM
Condition 1 Condition 2
Posterior distribution
(flat prior)
Mixture modelling for classification
Hierarchical model of replicateVariability and array effect
Hierarchical model of replicateVariability and array effect
24
Model for Differential Expression
• Expression-level-dependent normalisation
• Only few replicates per gene, so share information between genes to estimate variability of gene expression between the replicates
• To select interesting genes:– Use posterior distribution of quantities of interest,
function of, ranks ….– Use mixture prior on the differential expression
parameter
25
Data: ygr = log gene expression for gene g, replicate r
(for the present, ygr is treated as known data)
g = gene effect
r( ) = array effect (possibly expression-level dependent)
g2 = gene specific variance
• 1st level
ygr N(g + r(g), g2), Σr r (g) = 0
r( ) = smooth function of g
Bayesian hierarchical model for replicate expression data (under one condition)
Piecewise polynomial with unknown break points
26
Condition 1 (3 replicates)
Condition 2 (3 replicates)
Needs ‘normalisation’
Spline curves shown
Exploratory analysis of array effect
27
• 2nd level
Priors for g (flat) , coefficients and break points
Σr (g) = 0 constraint imposed
g2 lognormal (μ, τ)
Hyper-parameters μ and τ can be influential.In a full Bayesian analysis, these are not fixed
• 3rd level
μ N( c, d) τ lognormal (e, f)
Hierarchical structure for gene specific parameters
28
• Variances are estimated using information from all G x R measurements (~12000 x 3) rather than just 3
• Variances are stabilised and shrunk towards average variance
Smoothing of the gene specific variances
29
• Check assumptions on gene variances, e.g. exchangeable variances, what distribution ?
• Predict sample variance Sg2 new (a chosen checking function)
from the model specification (not using the data for this)
• Compare predicted Sg2 new with observed Sg
2 obs
‘Bayesian p-value’: Prob( Sg2 new > Sg
2 obs )
• Distribution of p-values approx Uniform if model is ‘true’
(Marshall and Spiegelhalter, 2003)• Easily implemented in MCMC algorithm
Bayesian Model Checking
30
Data set A
31
Differential expression model
The quantity of interest is the difference between conditions for each gene: dg , g = 1, …,N
Joint model for the 2 conditions :
yg1r = g - ½ dg + 1r(g) + g1r , r = 1, … R1
yg2r = g + ½ dg + 2r(g) + g2r , r = 1, … R2
• g is now the overall gene effect over the conditions•The parameter of interest dg is given a flat prior (for now)
•Same assumptions for the distribution of σ2gs
• Modelling of sr(g) as before, s = 1, 2 , sum to zero constraint imposed within each condition
32
Possible Statistics for Differential Expression
dg ≈ log fold change
dg* = dg / (σ2 g1 / R1 + σ2 g2 / R2 )½ (standardised difference)
• We obtain the posterior distribution of all {dg} and/or {dg
* }
• Can compute directly posterior probability of genes satisfying criterion X of interest:
pg,X = Prob( g of “interest” | Criterion X, data)
• Can compute the distributions of ranks
33
Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4
Criterion X
The majority of the genes
have very small pg,X :
90% of genes
have pg,X < 0.2
Genes withpg,X > 0.5 (green)
# 280pg,X > 0.8 (red)
# 46
pg,X = 0.49
Plot of log fold change versus overall expression level
Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) Mas5
Genes with low overall expression have a greater range of fold change than those with higher expression
34
Gene is of interest if |log fold change| > log (1.5)Criterion X:
The majority of the genes
have very small pg,X :
97% of genes
have pg,X < 0.2
Genes withpg,X > 0.5 (green)
# 292pg,X > 0.8 (red)
# 139
Plot of log fold change versus overall expression level
Experiment: 8 wildtype mice compared to 8 knockout mice RMA
35
Posterior probabilities and log fold change
Data set A : 3 replicates MAS5 Data set B : 8 replicates RMA
36
Credibility intervals for ranks
100 genes with lowest rank (most under/over expressed)
Low rank, high uncertainty
Low rank, low uncertainty
Data set B
37
• Compute
Probability ( |dg* | > 2 | data)
Bayesian analogue of a t test !
• Order genes
• Select genes such that
Probability ( |dg* | > 2 | data) > cut-off
Using the posterior distribution of dg*
(standardised difference)
38
Bayesian
T test
(Bayesian estimate)
Volcano plots
For illustration, cut-offs lines drawn at 0.95
39
PMMM
PMMM
PMMM
Gene specific variability (probe)Gene index BGX
Condition 1
PMMM
PMMM
PMMM
PMMM
Gene specific variability (probe)Gene index BGX
Distribution of differential expression parameter
Condition 2
Integrated modelling of Affymetrix data
PMMM
Distribution of expression index for gene g , condition 1
Distribution of expression index for gene g , condition 2
Hierarchical model of replicate(biological) variability and array effect
Hierarchical model of replicate(biological) variability and array effect
40
PMgjcr N( Sgjcr+ Hgjcr , τcr2)
MMgjcr N(ΦSgjcr+ Hgjcr , τcr2)
BGX Multiple array model: conditions: c=1,…,C, replicates: r = 1,…,Rc
log(Sgjcr+1) TN (μgc , ξ gc2)
Gene and condition specific BGX
gc=median(TN(μgc, ξ gc
2)) “Pools” information over replicate probe sets j = 1,…J, r = 1,…,Rc
Background noise, additiveArray specific
log(Hgjcr+1) TN(λcr,ηcr2)
Array-specific distribution of non-specific hybridisation
41
Posterior distributions of BGX:Single array vs multiple array analyses:
Mean +- 1SD
Three replicate arrays analysed separately
Three replicate arrays analysed together (multiple array model)
42
Subset of AffyU133A spike-in data set(AffyComp)
Consider:
• Six arrays, 1154 genes (every 20th and 42 spike-ins)
• Same cRNA hybridised to all arrays EXCEPT for spike-ins:
`1` `2` `3` … `12` `13` `14`
Spike-in genes: 1-3 4-6 7-9 … 34-36 37-39 40-42
Spike-in conc (pM):
Condition 1 (array 1-3): 0.0 0.25 0.50 … 128 256 512
Condition 2 (array 4-6): 0.25 0.50 1.00 … 256 512 0.00
Fold change: - 2 2 … 2 2 -
43
M v A plots:
True fold changes: Black: zero Red: 2
A: (1/2)*(exprg,1+exprg,2), M: (exprg,1-exprg,2)
NB! Point estimates used
MAS5 and RMA: exprgc= mean over three replicates
BGX: Multiple array index
44
BGX: measure of uncertainty providedPosterior mean +- 1SD credibility intervals
diffg=bgxg,1- bgxg,2
}
Spike in 1113 -1154above the blue line
Blue stars show RMA measure
45
Mixture and Bayesian estimation of false discovery rates
Natalia Bochkina, Alex Lewin SR, Philippe Broët
46
• Gene lists can be built by computing separately a criteria for each gene and ranking
• Thousands of genes are considered simultaneously• How to assess the performance of such lists ?
Multiple Testing Problem
Statistical ChallengeSelect interesting genes without including too many false
positives in a gene list
A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up
Want an evaluation of the expected false discovery rate (FDR)
47
Bayesian Estimate of FDR
• Step 1: Choose a gene specific parameter (e.g. dg ) or a gene statistic (see later)
• Step 2: Model its prior (resp marginal) distribution using a mixture model
-- with one component that models the unaffected genes (null hypothesis) e.g. point mass at 0 for dg
-- other components that model (flexibly) the alternative
• Step 3: Calculate the posterior probability for any gene of belonging to the unmodified component : pg0 | data
• Step 4: Evaluate FDR (and FNR) for any listAssuming that all the gene classification are independent:Bayes FDR (list) | data = 1/card(list) Σg list pg0
48
Mixture prior
• To obtain a gene list, a commonly used method
(cf Lonnstedt &Speed 2002, Newton 2003, Smyth 2003, …) is to define a mixture prior for dg :
• H0 dg = 0 point mass at 0 with probability p0
• H1 dg ~ flexible 2-sided distribution to model differential expression
Classify each gene following its posterior probabilities of not being in the null: 1- pg0
Use Bayes rule or fix the FDR
49
Classification with mixture prior
• Joint estimation of all the mixture parameters (including p0) avoids plugging-in of values (e.g. p0) that are influential on the classification
• Sensitivity to prior settings of the alternative distribution and performance has been tested on simulated data sets
Work in progress
Poster by Natalia Bochkina
50
Performance of the mixture prior
yg1r = g - ½ dg + g1r , r = 1, … R1
yg2r = g + ½ dg + g2r , r = 1, … R2
(For simplification, we assume that the data has been pre normalised)
σ2g ~ IG(a, b)
dg ~ p0δ0 + p1G (1.5, 1) + p2G (1.5, 2)
H0 H1
Dirichlet distribution for (p0, p1, p2)
Exponential hyper prior for 1 and 2
51
Simulated data
ygr ~ N(dg , σ2g) (8 replicates)
σ2g ~ IG(1.5, 0.05)
dg ~ (-1)Bern(0.5) G(2,2), g=1:200
dg = 0, g=201:1000
Choice of simulation parametersinspired by estimates found in analyses of biological data sets
Plot of the true differences
52
Posterior estimates of fold change using mixture model
53
Comparison of mixture classification and posterior probabilities for the standardised differences
In red, 200
genes with
dg ≠ 0
Probability ( |dg* | > 2 | data)
31 = 4%False negative
10 = 6%False positive
Post Prob (g H1)
54
Post Prob (g H1) = 1- pg0
Bayesrule
FDR (black)FNR (blue)as a function of1- pg0
55
Using mixtures for modelling the marginal distribution of gene statistics
• Instead of modelling the prior for dg as a
mixture, an alternative is – To summarise differential expression by a
gene statistic– To model is marginal distribution as a
mixture such that the distribution is approximately known under H0 and use a flexible distribution for the alternative
56
Mixture modelling of transformed F statistics
Gene statistic based on classical F statistic (this was developed to analyse multiclass ( > 2 conditions) experiments)
Gives a de-centred asymmetric marginal distribution rather than a two-tailed one
Transform F -> approx. standard Normal if no change across conditions (H0).
Use a mixture of normals (variable number) for modelling the alternative (following Richardson and Green 1997)
57
Results for Simulated Data(to detect modified profile over 3 conditions)
Broet, Lewin, SR 2004 Bayes mixture estimate of FDR is close to true value
Case A : well separatednull and alternative hypotheses
Case B : less separated null and alternative hypotheses
For details, see the poster by Alex Lewin
58
Marginal mixture performance for the simulated data
(2 conditions, same data as for the prior mixture)Number on list as a function
of cut-off prob
Expected number of false positive
59
Simulated data, comparison of prior and marginal mixture classification
Good agreementbetween the 2 approaches
The marginal mixturehas more false positives
Transformation to Normality for 2 conditions??
Further comparisonin progress
60
Bayesian gene expression measure (BGX)
Good range of resolution , provides credibility intervals
Differential Expression
Expression-level-dependent normalisation
Borrow information across genes for variances
Joint distribution of ranks, gene lists based on posterior probabilities
False Discovery Rate
Mixture gives good estimate of FDR and classifies well
Future work
Mixture prior on BGX index, with uncertainty propagated to mixture parameters, comparison of marginal and prior mixture approaches, clustering for more general experimental set-ups
Summary
61
Papers and technical reports:
Hein AM., Richardson S., Causton H., Ambler G. and Green P. (2004)BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data (submitted)
Lewin A., Richardson S., Marshall C., Glazier A. and Aitman T. (2003) Bayesian Modelling of Differential Gene Expression (submitted)
Broët P., Lewin A., Richardson S., Dalmasso C. and Magdelenat H. (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. (Bioinformatics, advanced access April 29 2004)
Broët, P., Richardson, S. and Radvanyi, F. (2002) Bayesian Hierarchical Model for Identifying Changes in Gene Expression from Microarray Experiments , Journal of Computational Biology 9, 671-683.
Available athttp ://www.bgx.org.uk/
Thanks