using simulated data to optimise experimental design and analysis for rna sequencing (conrad...
TRANSCRIPT
Using Simulated Data to Optimise Experimental Design
and Analysis for RNA-Sequencing.
Conrad Burden Mathematical Sciences Institute Australian National University
Canberra
RNA-Seq: Using high-throughput sequencing technology to sequence cDNA that has been reverse-transcribed from RNA to get information about a sample’s RNA content. If the sample is mRNA from a cell, it detects which genes are expressed. Useful for: 1. Expression profiling 2. Detecting differential expression
Extract RNA Library prep Sequencing
RNA cDNA
• Extract mRNA from total RNA • Randomly fragment • Reverse transcribe to cDNA • Ligate sequencing adaptor • Size select to ~ 200 bases • Amplify with PCR
Sequence and map to reference genome to get a digital count of fragments sampled from each gene
Extract RNA Library prep Sequencing
RNA cDNA
Biological variaGon Technical variaGon
Poisson noise
Overdispersion
Final count for each gene is overdispersed Poisson
Extract RNA Library prep Sequencing
RNA cDNA (conc = R)
1. For a given gene, let R = molar concentraGon of cDNA in ‘library’ for a given gene of interest, with E(R) = q; Var(R) = v. 2. Consider q as a proxy for the ‘transcript abundance’ of this gene. 3. Sequencer counts K for this gene given R is Poisson: K|R ~ Pois(λR). 1, 2 and 3 imply
E(K) = μ, Var(K) = μ(1 + φμ), where μ = λq, φ = v/q2. φ is called the overdispersion.
(count = K) transcript abundance ~ q
Extract RNA Library prep Sequencing
RNA cDNA (conc = R) (count = K) transcript abundance ~ q
Moreover, if
λR ~ Gamma(mean = μ, variance = φμ)
Then
K ~ NegBin(mean = μ, variance = μ(1 + φμ)
If λ, μ and φ can be esGmated from the data, q = μ/ λ gives an esGmate of the abundance of this transcript.
(Data: human lymphoblastoid cell lines from J.K. Pickrell et al., Nature 464 768–772.)
SyntheGc Poisson vs. Poisson Same cDNA library, different sequencers
Same biol. source, different cDNA libraries Different biol. reps.
Gene
Condi)on A Condi)on B ... etc.
Rep 1
Rep 2
...etc
Rep 1
Rep 2
...etc
ENSG00000209432 4 6 ... 35 45 ...
ENSG00000209432 0 0 ... 2 1 ... ENSG00000209432 110 96 ... 177 203 ...
ENSG00000209432 1268 1089 ... 9246 9873 ...
ENSG00000212678 148 201 ... 112 93 ...
... etc.
typically > 10,000 genes or transcript isoforms
n reps per condiGon
different condiGons or biol. samples
Data from an RNA-‐Seq experiment to detect differenGal expression typically looks like this:
Gene
Condi)on A Condi)on B ... etc.
Rep 1
Rep 2
...etc
Rep 1
Rep 2
...etc
ENSG00000209432 4 6 ... 35 45 ...
ENSG00000209432 0 0 ... 2 1 ... ENSG00000209432 110 96 ... 177 203 ...
ENSG00000209432 1268 1089 ... 9246 9873 ...
ENSG00000212678 148 201 ... 112 93 ...
... etc.
typically > 10,000 genes or transcript isoforms
n reps per condiGon
different condiGons or biol. samples
Data from an RNA-‐Seq experiment to detect differenGal expression typically looks like this:
Which genes are differenGally expressed?
R packages for assessing differenGal expression based on the negaGve binomial distribuGon:
• DESeq: S. Anders and W. Huber, Gen. Biol. 11:R106 (2010)
• edgeR: M. Robinson, D. McCarthy and G. Smyth, Bioinf 26:139 (2010)
• (also NBPseq: Y. Di, et al., SAGMB 10:24 (2011) and
TSPM: P. Auer and R. Doerge: SAGMB 10:26 (2011))
They differ in how they esGmate the overdispersion (φ) for each gene from a limited number of replicates:
• DESeq: dispersion φ esGmated for each gene as the greater of a per-‐gene maximum likelihood esGmate and a parametric fit to
φ = a + b/μ
• edgeR: dispersion φ esGmated per gene from a likelihood funcGon condiGoned on sum across condiGons, then squeezed towards a common-‐to-‐all genes dispersion using empirical Bayes
p-‐values under the null hypothesis
(μ/λ)condiGon A = (μ/λ)condiGon B calculated under the approximaGon that the total counts in each condiGon is NB, and condiGoned on the sum of counts
KA = counts (cond. A)
K B = cou
nts (cond
. B)
(a, b)
p-‐values under the null hypothesis
(μ/λ)condiGon A = (μ/λ)condiGon B calculated under the approximaGon that the total counts in each condiGon is NB, and condiGoned on the sum of counts
KA = counts (cond. A)
K B = cou
nts (cond
. B)
(a, b)
kA
Prob
(KA =
a|K
A + K
B = a + b) (1-‐sided) p-‐value
is the sum of these probabiliGes
a
p-‐values under the null hypothesis
(μ/λ)condiGon A = (μ/λ)condiGon B calculated under the approximaGon that the total counts in each condiGon is NB, and condiGoned on the sum of counts
KA = counts (cond. A)
K B = cou
nts (cond
. B)
(a, b)
kA
Prob
(KA =
a|K
A + K
B = a + b) (2-‐sided) p-‐value
is the sum of these probabiliGes
a
Robles et al., BMC Genomics (2012) 13:484
Test DESeq and edgeR using simulated data TesGng null hypothesis:
1. Start with Pickrell et al. dataset of 69 sequenced cDNA libraries from HapMap project (i.e. a table of RNA-‐Seq counts for 69 biological replicates of ~60,000 transcript isoforms).
2. Use max. likelihood to produce from this a set of NB parameters (μi, φi) for i = 1, ..., ~60,000 represenGng a ‘typical’ range of means and overdispersions for our syntheGc transcriptome.
3. Construct a syntheGc dataset of counts: • n reps of ‘control’ counts Kijcontrol ~ NB(μi, φi) , j = 1, ... n • n reps of ‘treatment’ counts Kijtreatment ~ NB(μi, φi)
Null hypothesis: (no up-‐ or down-‐regulaGon) n = 3 reps vs. 3 reps expect flat p-‐value distribuGon.
Synthetic data: 3 rep vs. 3 rep
Per
cent
of T
otal
0
2
4
6
8
10
0.0 0.2 0.4 0.6 0.8 1.0
NBP all NBP high
0.0 0.2 0.4 0.6 0.8 1.0
NBP low
DESeq all DESeq high
0
2
4
6
8
10DESeq low
0
2
4
6
8
10edgeR all
0.0 0.2 0.4 0.6 0.8 1.0
edgeR high edgeR lowall t’cripts >100 counts <100 counts
Percen
tage of total
DESeq null p-values: synthetic data 3 vs. 3
p-value
Density
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
67
Right-‐hand spike is an arGfact of calculaGng p-‐values from a discrete distribuGon -‐ could be ‘fixed’ by replacing the discrete distribuGon by a conGnuous distribuGon
a
Prob
(KA =
a|K
A + K
B = k
A + k
B)
2-‐sided p-‐value is the sum of these probs
kA
2-‐sided p-‐value is the shaded area
a Prob
(KA =
a|K
A + K
B = k
A + k
B)
kA
chose a point randomly in the interval (kA − ½, kA+ ½)
DESeq null p-values: synthetic data 3 vs. 3
p-value
Density
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
67
original spectrumspike removed
Remaining deviaGon from a uniform distribuGon is from having to esGmate the parameters μ and φ for each transcript
DESeq null p-values: synthetic data 3 vs. 3
p-value
Density
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
67
original spectrumspike removedparameters not estimated
Null hypothesis: α = 0 (no up-‐ or down-‐regulaGon) n = 3 reps vs. 3 reps expect flat p-‐value distribuGon.
Synthetic data: 3 rep vs. 3 rep
Per
cent
of T
otal
0
2
4
6
8
10
0.0 0.2 0.4 0.6 0.8 1.0
NBP all NBP high
0.0 0.2 0.4 0.6 0.8 1.0
NBP low
DESeq all DESeq high
0
2
4
6
8
10DESeq low
0
2
4
6
8
10edgeR all
0.0 0.2 0.4 0.6 0.8 1.0
edgeR high edgeR lowall t’cripts >100 counts <100 counts
Percen
tage of total
ArGfact of p-‐value calculaGon for discrete data
UnderesGmate of dispersion
OveresGmate of dispersion
!"!!#
!"$!#
%"!!#
%"$!#
&"!!#
&"$!#
'"!!#&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
,-.,/# 012,3# 45677582,3#
!"#$
"#%&''$
FPR = percentage of transcripts reported as differenGally expressed under the null hypothesis for n reps vs. n reps at α = 1% significance
(Li et al., BiostaDsDcs (2012) 13:523)
!"!!#
!"$!#
%"!!#
%"$!#
&"!!#
&"$!#
'"!!#&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
,-.,/# 012,3# 45677582,3#
!"#$
"#%&''$
FPR = percentage of transcripts reported as differenGally expressed under the null hypothesis for n reps vs. n reps at α = 1% significance
Overdispersion underesGmated
underconservaGve
Overdispersion overesGmated
overconservaGve
(Li et al., BiostaDsDcs (2012) 13:523)
TesGng the power to detect differenGal expression • How many replicates is appropriate?
(biological reps or library prep reps if reps are from the same biological source)
• What sequencing depth?
• Is mulGplexing (via barcodes) worthwhile?
• SyntheGc dataset to test the power of DESeq and edgeR to detect differenGal expression
1. Use max. likelihood esGmates of (μi, φi) from Pickrell data again
2. Construct a syntheGc dataset of counts: • n reps of ‘control’ counts Kijcontrol ~ NB(μi, φi) , j = 1, ... n • n reps of ‘treatment’ counts Kijtreatment ~ NB(μi θi, φi) where
θi = (1 + Xi) for 7.5% of the transcripts (up-‐regulated) θi = (1 + Xi)-‐1 for a further 7.5% (down-‐regulated) θi = 1 for the remainder,
with Xi ~ i.i.d. exponenGal random variables, parameter 1.
Define a gene to be ‘effecGvely differenGally expressed’ if
θi < 1/1.2 or θi > 1.2
EffecGvely DE
EffecGvely non-‐DE
85% unchanged
Control for false discovery rate FDR = FP/(TP + FP)
using the Benjamini-‐Hochberg adjusted p-‐value padj < α Finally, measure a false posiGve rate
and a true posiGve rate
Do this for a range of coverage depths and # replicates
FPR =# of effectively non-DE transcripts with padj <α
total # of effectively non-DE transcripts
TPR =# of effectively DE transcripts with padj <α
total # of effectively DE transcripts
With 7.5% up-‐regulated and 7.5% down-‐regulated: DESeq
TPR = TP/(TP + FN) (x 100%)
= ‘sensiGvity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion 100% coverage ≈ 107 reads
With 7.5% up-‐regulated and 7.5% down-‐regulated: edgeR
TPR = TP/(TP + FN) (x 100%)
= ‘sensiGvity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion 100% coverage ≈ 107 reads
With 7.5% up-‐regulated and 7.5% down-‐regulated: edgeR
TPR = TP/(TP + FN) (x 100%)
= ‘sensiGvity’ 1. TPR increases with
number of reps n 2. TPR decreases with
coverage depth 3. MulGplexing (more reps,
less coverage, keeping n Gmes depth constant) improves TPR (grey curve)
4. edgeR has slightly beyer sensiGvity than DESeq
With 7.5% up-‐regulated and 7.5% down-‐regulated: DESeq
FPR = FP/(TN + FP) (x 100%)
= 1 – ‘specificity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion
n =12
n =2
n =4
With 7.5% up-‐regulated and 7.5% down-‐regulated: edgeR
FPR = FP/(TN + FP) (x 100%)
= 1 – ‘specificity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion
n =12
n =2
n =4
1. MulGplexing (more reps, less coverage, keeping n Gmes depth constant) improves specificity (grey curve)
2. DESeq has slightly beyer specificity than edgeR
With 7.5% up-‐regulated and 7.5% down-‐regulated: edgeR
FPR = FP/(TN + FP) (x 100%)
= 1 – ‘specificity’ using Fold change > 2 as a criterion for detecGng differenGal expression
(not recommended)
n =12
n =2
n =4
FPR increases with decreasing coverage depth because more transcripts have very low counts and Poisson shot noise can easily induce a spurious doubling of counts
To summarise • Have tested the performance of NegaGve Binomial based R packages for
detecGng differenGal expression using syntheGc data. • Under null hypothesis, DESeq’s performance is consistently more
conservaGve than edgeR across # of replicates, and closer to expected significance level for small numbers of reps. edgeR is closer for high numbers of reps.
• With 15% of transcripts differenGally expressed, for both edgeR and
DESeq: – sensiGvity (= TPR) improves with number of replicates, as expected – sensiGvity declines with decreased sequencing depth, as expected – sensiGvity beyer for edgeR than DESeq – but mulGplexing (decreasing sequencing depth while increasing # of
replicates with same total amount of ‘read estate’) increases sensiGvity markedly
To summarise
Recommend • The more (independent!) replicates the beyer
• It’s OK to sacrifice sequencing read depth by mulGplexing
Acknowledgements
Sue Wilson, Australian NaGonal University and University of New South Wales Jen Taylor, Division of Plant Industry, CSIRO Sumaira Qureshi, MathemaGcal Sciences InsGtute, Australian NaGonal University Jose Robles, Division of Plant Industry, CSIRO Stuart Stephen, Division of Plant Industry, CSIRO