using simulated data to optimise experimental design and analysis for rna sequencing (conrad...

Using Simulated Data to Optimise Experimental Design

and Analysis for RNA-Sequencing.

Conrad Burden Mathematical Sciences Institute Australian National University

Canberra

RNA-Seq: Using high-throughput sequencing technology to sequence cDNA that has been reverse-transcribed from RNA to get information about a sample’s RNA content. If the sample is mRNA from a cell, it detects which genes are expressed. Useful for: 1.  Expression profiling 2.  Detecting differential expression

Extract RNA Library prep Sequencing

RNA cDNA

•  Extract mRNA from total RNA •  Randomly fragment •  Reverse transcribe to cDNA •  Ligate sequencing adaptor •  Size select to ~ 200 bases •  Amplify with PCR

Sequence and map to reference genome to get a digital count of fragments sampled from each gene


RNA cDNA

Biological variaGon Technical variaGon

Poisson noise

Overdispersion

Final count for each gene is overdispersed Poisson


RNA cDNA (conc = R)

1. For a given gene, let R = molar concentraGon of cDNA in ‘library’ for a given gene of interest, with E(R) = q; Var(R) = v. 2. Consider q as a proxy for the ‘transcript abundance’ of this gene. 3. Sequencer counts K for this gene given R is Poisson: K|R ~ Pois(λR). 1, 2 and 3 imply

E(K) = μ, Var(K) = μ(1 + φμ), where μ = λq, φ = v/q2. φ is called the overdispersion.

(count = K) transcript abundance ~ q


RNA cDNA (conc = R) (count = K) transcript abundance ~ q

Moreover, if

λR ~ Gamma(mean = μ, variance = φμ)

Then

K ~ NegBin(mean = μ, variance = μ(1 + φμ)

If λ, μ and φ can be esGmated from the data, q = μ/ λ gives an esGmate of the abundance of this transcript.

(Data: human lymphoblastoid cell lines from J.K. Pickrell et al., Nature 464 768–772.)

SyntheGc Poisson vs. Poisson Same cDNA library, different sequencers

Same biol. source, different cDNA libraries Different biol. reps.

Gene

Condi)on A Condi)on B ... etc.

Rep 1

Rep 2

...etc

Rep 1

Rep 2

...etc

ENSG00000209432 4 6 ... 35 45 ...

ENSG00000209432 0 0 ... 2 1 ... ENSG00000209432 110 96 ... 177 203 ...

ENSG00000209432 1268 1089 ... 9246 9873 ...

ENSG00000212678 148 201 ... 112 93 ...

... etc.

typically > 10,000 genes or transcript isoforms

n reps per condiGon

different condiGons or biol. samples

Data from an RNA-‐Seq experiment to detect differenGal expression typically looks like this:

Gene

Condi)on A Condi)on B ... etc.

Rep 1

Rep 2

...etc

Rep 1

Rep 2

...etc

ENSG00000209432 4 6 ... 35 45 ...

ENSG00000209432 0 0 ... 2 1 ... ENSG00000209432 110 96 ... 177 203 ...

ENSG00000209432 1268 1089 ... 9246 9873 ...

ENSG00000212678 148 201 ... 112 93 ...

... etc.

typically > 10,000 genes or transcript isoforms

n reps per condiGon

different condiGons or biol. samples

Data from an RNA-‐Seq experiment to detect differenGal expression typically looks like this:

Which genes are differenGally expressed?

R packages for assessing differenGal expression based on the negaGve binomial distribuGon:

•  DESeq: S. Anders and W. Huber, Gen. Biol. 11:R106 (2010)

•  edgeR: M. Robinson, D. McCarthy and G. Smyth, Bioinf 26:139 (2010)

•  (also NBPseq: Y. Di, et al., SAGMB 10:24 (2011) and

TSPM: P. Auer and R. Doerge: SAGMB 10:26 (2011))

They differ in how they esGmate the overdispersion (φ) for each gene from a limited number of replicates:

•  DESeq: dispersion φ esGmated for each gene as the greater of a per-‐gene maximum likelihood esGmate and a parametric fit to

φ = a + b/μ

•  edgeR: dispersion φ esGmated per gene from a likelihood funcGon condiGoned on sum across condiGons, then squeezed towards a common-‐to-‐all genes dispersion using empirical Bayes

p-‐values under the null hypothesis

(μ/λ)condiGon A = (μ/λ)condiGon B calculated under the approximaGon that the total counts in each condiGon is NB, and condiGoned on the sum of counts

KA = counts (cond. A)

K B = cou

nts (cond

. B)

(a, b)




K B = cou

nts (cond

. B)

(a, b)

kA

Prob

(KA =

a|K

A + K

B = a + b) (1-‐sided) p-‐value

is the sum of these probabiliGes

a




K B = cou

nts (cond

. B)

(a, b)

kA

Prob

(KA =

a|K

A + K

B = a + b) (2-‐sided) p-‐value

is the sum of these probabiliGes

a

Robles et al., BMC Genomics (2012) 13:484

Test DESeq and edgeR using simulated data TesGng null hypothesis:

1.  Start with Pickrell et al. dataset of 69 sequenced cDNA libraries from HapMap project (i.e. a table of RNA-‐Seq counts for 69 biological replicates of ~60,000 transcript isoforms).

2.  Use max. likelihood to produce from this a set of NB parameters (μi, φi) for i = 1, ..., ~60,000 represenGng a ‘typical’ range of means and overdispersions for our syntheGc transcriptome.

3.  Construct a syntheGc dataset of counts: •  n reps of ‘control’ counts Kijcontrol ~ NB(μi, φi) , j = 1, ... n •  n reps of ‘treatment’ counts Kijtreatment ~ NB(μi, φi)

Null hypothesis: (no up-‐ or down-‐regulaGon) n = 3 reps vs. 3 reps   expect flat p-‐value distribuGon.

Synthetic data: 3 rep vs. 3 rep

Per

cent

of T

otal

0

2

4

6

8

10

0.0 0.2 0.4 0.6 0.8 1.0

NBP all NBP high

0.0 0.2 0.4 0.6 0.8 1.0

NBP low

DESeq all DESeq high

0

2

4

6

8

10DESeq low

0

2

4

6

8

10edgeR all

0.0 0.2 0.4 0.6 0.8 1.0

edgeR high edgeR lowall t’cripts >100 counts <100 counts

Percen

tage of total

DESeq null p-values: synthetic data 3 vs. 3

p-value

Density

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

67

Right-‐hand spike is an arGfact of calculaGng p-‐values from a discrete distribuGon -‐ could be ‘fixed’ by replacing the discrete distribuGon by a conGnuous distribuGon

a

Prob

(KA =

a|K

A + K

B = k

A + k

B)

2-‐sided p-‐value is the sum of these probs

kA

2-‐sided p-‐value is the shaded area

a Prob

(KA =

a|K

A + K

B = k

A + k

B)

kA

chose a point randomly in the interval (kA − ½, kA+ ½)


p-value

Density

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

67

original spectrumspike removed

Remaining deviaGon from a uniform distribuGon is from having to esGmate the parameters μ and φ for each transcript


p-value

Density

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

67

original spectrumspike removedparameters not estimated

Null hypothesis: α = 0 (no up-‐ or down-‐regulaGon) n = 3 reps vs. 3 reps   expect flat p-‐value distribuGon.

Synthetic data: 3 rep vs. 3 rep

Per

cent

of T

otal

0

2

4

6

8

10

0.0 0.2 0.4 0.6 0.8 1.0

NBP all NBP high

0.0 0.2 0.4 0.6 0.8 1.0

NBP low

DESeq all DESeq high

0

2

4

6

8

10DESeq low

0

2

4

6

8

10edgeR all

0.0 0.2 0.4 0.6 0.8 1.0

edgeR high edgeR lowall t’cripts >100 counts <100 counts

Percen

tage of total

ArGfact of p-‐value calculaGon for discrete data

UnderesGmate of dispersion

OveresGmate of dispersion

!"!!#

!"$!#

%"!!#

%"$!#

&"!!#

&"$!#

'"!!#&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

,-.,/# 012,3# 45677582,3#

!"#$

"#%&''$

FPR = percentage of transcripts reported as differenGally expressed under the null hypothesis for n reps vs. n reps at α = 1% significance

(Li et al., BiostaDsDcs (2012) 13:523)

!"!!#

!"$!#

%"!!#

%"$!#

&"!!#

&"$!#

'"!!#&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

,-.,/# 012,3# 45677582,3#

!"#$

"#%&''$

FPR = percentage of transcripts reported as differenGally expressed under the null hypothesis for n reps vs. n reps at α = 1% significance

Overdispersion underesGmated

underconservaGve

Overdispersion overesGmated

overconservaGve

(Li et al., BiostaDsDcs (2012) 13:523)

TesGng the power to detect differenGal expression •  How many replicates is appropriate?

(biological reps or library prep reps if reps are from the same biological source)

•  What sequencing depth?

•  Is mulGplexing (via barcodes) worthwhile?

•  SyntheGc dataset to test the power of DESeq and edgeR to detect differenGal expression

1.  Use max. likelihood esGmates of (μi, φi) from Pickrell data again

2.  Construct a syntheGc dataset of counts: •  n reps of ‘control’ counts Kijcontrol ~ NB(μi, φi) , j = 1, ... n •  n reps of ‘treatment’ counts Kijtreatment ~ NB(μi θi, φi) where

θi = (1 + Xi) for 7.5% of the transcripts (up-‐regulated) θi = (1 + Xi)-‐1 for a further 7.5% (down-‐regulated) θi = 1 for the remainder,

with Xi ~ i.i.d. exponenGal random variables, parameter 1.

Define a gene to be ‘effecGvely differenGally expressed’ if

θi < 1/1.2 or θi > 1.2

EffecGvely DE

EffecGvely non-‐DE

85% unchanged

Control for false discovery rate FDR = FP/(TP + FP)

using the Benjamini-‐Hochberg adjusted p-‐value padj < α Finally, measure a false posiGve rate

and a true posiGve rate

Do this for a range of coverage depths and # replicates

FPR =# of effectively non-DE transcripts with padj <α

total # of effectively non-DE transcripts

TPR =# of effectively DE transcripts with padj <α

total # of effectively DE transcripts

With 7.5% up-‐regulated and 7.5% down-‐regulated: DESeq

TPR = TP/(TP + FN) (x 100%)

= ‘sensiGvity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion 100% coverage ≈ 107 reads

With 7.5% up-‐regulated and 7.5% down-‐regulated: edgeR

TPR = TP/(TP + FN) (x 100%)

= ‘sensiGvity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion 100% coverage ≈ 107 reads


TPR = TP/(TP + FN) (x 100%)

= ‘sensiGvity’ 1.  TPR increases with

number of reps n 2.  TPR decreases with

coverage depth 3.  MulGplexing (more reps,

less coverage, keeping n Gmes depth constant) improves TPR (grey curve)

4.  edgeR has slightly beyer sensiGvity than DESeq

With 7.5% up-‐regulated and 7.5% down-‐regulated: DESeq

FPR = FP/(TN + FP) (x 100%)

= 1 – ‘specificity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion

n =12

n =2

n =4


FPR = FP/(TN + FP) (x 100%)

= 1 – ‘specificity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion

n =12

n =2

n =4

1.  MulGplexing (more reps, less coverage, keeping n Gmes depth constant) improves specificity (grey curve)

2.  DESeq has slightly beyer specificity than edgeR


FPR = FP/(TN + FP) (x 100%)

= 1 – ‘specificity’ using Fold change > 2 as a criterion for detecGng differenGal expression

(not recommended)

n =12

n =2

n =4

FPR increases with decreasing coverage depth because more transcripts have very low counts and Poisson shot noise can easily induce a spurious doubling of counts

To summarise •  Have tested the performance of NegaGve Binomial based R packages for

detecGng differenGal expression using syntheGc data. •  Under null hypothesis, DESeq’s performance is consistently more

conservaGve than edgeR across # of replicates, and closer to expected significance level for small numbers of reps. edgeR is closer for high numbers of reps.

•  With 15% of transcripts differenGally expressed, for both edgeR and

DESeq: –  sensiGvity (= TPR) improves with number of replicates, as expected –  sensiGvity declines with decreased sequencing depth, as expected –  sensiGvity beyer for edgeR than DESeq –  but mulGplexing (decreasing sequencing depth while increasing # of

replicates with same total amount of ‘read estate’) increases sensiGvity markedly

To summarise

Recommend •  The more (independent!) replicates the beyer

•  It’s OK to sacrifice sequencing read depth by mulGplexing

Acknowledgements

Sue Wilson, Australian NaGonal University and University of New South Wales Jen Taylor, Division of Plant Industry, CSIRO Sumaira Qureshi, MathemaGcal Sciences InsGtute, Australian NaGonal University Jose Robles, Division of Plant Industry, CSIRO Stuart Stephen, Division of Plant Industry, CSIRO

using simulated data to optimise experimental design and analysis for rna sequencing (conrad...

Documents

condigon b

sum of counts

total counts

total rna

condion b

b edger

gene rep rep rep rep

cdna library