using simulated data to optimise experimental design and analysis for rna sequencing (conrad...

37
Using Simulated Data to Optimise Experimental Design and Analysis for RNA- Sequencing. Conrad Burden Mathematical Sciences Institute Australian National University Canberra

Upload: australian-bioinformatics-network

Post on 11-May-2015

257 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Using Simulated Data to Optimise Experimental Design

and Analysis for RNA-Sequencing.

Conrad Burden Mathematical Sciences Institute Australian National University

Canberra

Page 2: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

RNA-Seq: Using high-throughput sequencing technology to sequence cDNA that has been reverse-transcribed from RNA to get information about a sample’s RNA content. If the sample is mRNA from a cell, it detects which genes are expressed. Useful for: 1.  Expression profiling 2.  Detecting differential expression

Page 3: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Extract  RNA   Library  prep   Sequencing  

RNA   cDNA  

•  Extract  mRNA  from  total  RNA  •  Randomly  fragment  •  Reverse  transcribe  to  cDNA  •  Ligate  sequencing  adaptor  •  Size  select  to  ~  200  bases  •  Amplify  with  PCR  

Sequence  and  map  to    reference  genome  to  get  a    digital  count  of  fragments    sampled  from  each  gene  

Page 4: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Extract  RNA   Library  prep   Sequencing  

RNA   cDNA  

Biological  variaGon   Technical  variaGon  

Poisson  noise  

Overdispersion  

Final  count  for  each  gene  is  overdispersed  Poisson  

Page 5: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Extract  RNA   Library  prep   Sequencing  

RNA   cDNA  (conc  =  R)  

1.  For  a  given  gene,  let    R  =  molar  concentraGon  of  cDNA  in  ‘library’  for  a  given  gene  of  interest,  with    E(R)  =  q;    Var(R)  =  v.      2.  Consider    q    as  a  proxy  for  the  ‘transcript  abundance’  of  this  gene.        3.  Sequencer  counts    K    for  this  gene  given  R  is  Poisson:    K|R  ~  Pois(λR).    1,  2  and  3  imply      

E(K)  =  μ,    Var(K)  =    μ(1  +  φμ),        where    μ  =  λq,  φ  =  v/q2.    φ  is  called  the  overdispersion.      

(count  =  K)  transcript  abundance  ~  q  

Page 6: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Extract  RNA   Library  prep   Sequencing  

RNA   cDNA  (conc  =  R)   (count  =  K)  transcript  abundance  ~  q  

Moreover,  if      

   λR  ~  Gamma(mean  =  μ,  variance  =  φμ)    

Then      

   K  ~  NegBin(mean  =  μ,  variance  =  μ(1  +  φμ)    

If    λ,  μ    and    φ    can  be  esGmated  from  the  data,  q  =    μ/  λ  gives  an  esGmate  of  the  abundance  of  this  transcript.      

Page 7: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

(Data:  human  lymphoblastoid  cell  lines  from    J.K.  Pickrell  et  al.,  Nature  464  768–772.)    

SyntheGc  Poisson  vs.  Poisson   Same  cDNA  library,  different  sequencers  

Same  biol.  source,  different  cDNA  libraries   Different  biol.  reps.  

Page 8: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

 Gene  

Condi)on  A   Condi)on  B     ...  etc.  

Rep  1  

Rep  2  

 ...etc  

Rep  1  

Rep  2  

 ...etc  

ENSG00000209432   4   6   ...   35   45   ...  

ENSG00000209432   0   0   ...   2   1   ...  ENSG00000209432   110   96   ...   177   203   ...  

ENSG00000209432   1268   1089   ...   9246   9873   ...  

ENSG00000212678   148   201   ...   112   93   ...  

...  etc.  

typically  >  10,000  genes  or  transcript  isoforms  

n  reps  per  condiGon  

different  condiGons  or  biol.  samples  

Data  from  an  RNA-­‐Seq  experiment  to  detect  differenGal  expression  typically  looks  like  this:  

Page 9: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

 Gene  

Condi)on  A   Condi)on  B     ...  etc.  

Rep  1  

Rep  2  

 ...etc  

Rep  1  

Rep  2  

 ...etc  

ENSG00000209432   4   6   ...   35   45   ...  

ENSG00000209432   0   0   ...   2   1   ...  ENSG00000209432   110   96   ...   177   203   ...  

ENSG00000209432   1268   1089   ...   9246   9873   ...  

ENSG00000212678   148   201   ...   112   93   ...  

...  etc.  

typically  >  10,000  genes  or  transcript  isoforms  

n  reps  per  condiGon  

different  condiGons  or  biol.  samples  

Data  from  an  RNA-­‐Seq  experiment  to  detect  differenGal  expression  typically  looks  like  this:  

Which  genes  are  differenGally  expressed?  

Page 10: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

R  packages  for  assessing  differenGal  expression  based  on  the  negaGve  binomial  distribuGon:    

•  DESeq:      S.  Anders  and  W.  Huber,  Gen.  Biol.  11:R106  (2010)  

•  edgeR:      M.  Robinson,  D.  McCarthy  and  G.  Smyth,  Bioinf  26:139  (2010)  

 •  (also  NBPseq:  Y.  Di,  et  al.,  SAGMB  10:24  (2011)  and    

   TSPM:  P.  Auer  and  R.  Doerge:  SAGMB  10:26  (2011))      

Page 11: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

They  differ  in  how  they  esGmate  the  overdispersion  (φ)  for  each  gene  from  a  limited  number  of  replicates:    

•  DESeq:      dispersion  φ  esGmated  for  each  gene  as  the  greater  of  a  per-­‐gene  maximum  likelihood  esGmate  and  a  parametric  fit  to      

φ  =  a  +  b/μ  

•  edgeR:    dispersion  φ  esGmated  per  gene  from  a  likelihood  funcGon  condiGoned  on  sum  across  condiGons,  then  squeezed  towards  a  common-­‐to-­‐all  genes  dispersion  using  empirical  Bayes    

   

Page 12: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

p-­‐values  under  the  null  hypothesis    

(μ/λ)condiGon  A  =  (μ/λ)condiGon  B    calculated  under  the  approximaGon  that  the  total  counts  in  each  condiGon  is  NB,  and  condiGoned  on  the  sum  of  counts    

KA  =  counts  (cond.  A)  

K B  =  cou

nts  (cond

.  B)  

(a,  b)  

Page 13: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

p-­‐values  under  the  null  hypothesis    

(μ/λ)condiGon  A  =  (μ/λ)condiGon  B    calculated  under  the  approximaGon  that  the  total  counts  in  each  condiGon  is  NB,  and  condiGoned  on  the  sum  of  counts    

KA  =  counts  (cond.  A)  

K B  =  cou

nts  (cond

.  B)  

(a,  b)  

kA  

Prob

(KA  =

 a|K

A  +  K

B  =  a  +  b)   (1-­‐sided)  p-­‐value  

is  the  sum  of  these  probabiliGes  

a  

Page 14: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

p-­‐values  under  the  null  hypothesis    

(μ/λ)condiGon  A  =  (μ/λ)condiGon  B    calculated  under  the  approximaGon  that  the  total  counts  in  each  condiGon  is  NB,  and  condiGoned  on  the  sum  of  counts    

KA  =  counts  (cond.  A)  

K B  =  cou

nts  (cond

.  B)  

(a,  b)  

kA  

Prob

(KA  =

 a|K

A  +  K

B  =  a  +  b)   (2-­‐sided)  p-­‐value  

is  the  sum  of  these  probabiliGes  

a  

Page 15: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Robles  et  al.,  BMC  Genomics  (2012)  13:484  

Page 16: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Test  DESeq  and  edgeR  using  simulated  data    TesGng  null  hypothesis:      

1.  Start  with  Pickrell  et  al.  dataset  of  69  sequenced  cDNA  libraries  from  HapMap  project  (i.e.  a  table  of  RNA-­‐Seq  counts  for  69  biological  replicates  of  ~60,000  transcript  isoforms).    

2.  Use  max.  likelihood  to  produce  from  this  a  set  of  NB  parameters  (μi,  φi)  for  i  =  1,  ...,  ~60,000  represenGng  a  ‘typical’  range  of  means  and  overdispersions  for  our  syntheGc  transcriptome.  

3.  Construct  a  syntheGc  dataset  of  counts:    •  n  reps  of  ‘control’  counts    Kijcontrol    ~  NB(μi,  φi)  ,        j  =  1,  ...  n  •  n  reps  of  ‘treatment’  counts  Kijtreatment    ~  NB(μi,  φi)        

Page 17: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Null  hypothesis:  (no  up-­‐  or  down-­‐regulaGon)      n  =  3  reps  vs.  3  reps    expect  flat  p-­‐value  distribuGon.      

Synthetic data: 3 rep vs. 3 rep

Per

cent

of T

otal

0

2

4

6

8

10

0.0 0.2 0.4 0.6 0.8 1.0

NBP all NBP high

0.0 0.2 0.4 0.6 0.8 1.0

NBP low

DESeq all DESeq high

0

2

4

6

8

10DESeq low

0

2

4

6

8

10edgeR all

0.0 0.2 0.4 0.6 0.8 1.0

edgeR high edgeR lowall  t’cripts   >100  counts   <100  counts  

Percen

tage  of  total  

Page 18: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

DESeq null p-values: synthetic data 3 vs. 3

p-value

Density

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

67

Page 19: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Right-­‐hand  spike  is  an  arGfact  of  calculaGng  p-­‐values  from  a  discrete  distribuGon    -­‐  could  be  ‘fixed’  by  replacing  the  discrete  distribuGon  by  a  conGnuous  distribuGon  

a  

Prob

(KA  =

 a|K

A  +  K

B  =  k

A  +  k

B)  

2-­‐sided  p-­‐value  is  the  sum  of  these  probs  

kA  

2-­‐sided  p-­‐value  is  the  shaded  area  

a  Prob

(KA  =

 a|K

A  +  K

B  =  k

A  +  k

B)  

kA  

chose  a  point  randomly  in  the  interval  (kA  −  ½,  kA+  ½)    

Page 20: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

DESeq null p-values: synthetic data 3 vs. 3

p-value

Density

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

67

original spectrumspike removed

Page 21: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Remaining  deviaGon  from  a  uniform  distribuGon  is  from  having  to  esGmate  the  parameters  μ  and  φ  for  each  transcript  

DESeq null p-values: synthetic data 3 vs. 3

p-value

Density

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

67

original spectrumspike removedparameters not estimated

Page 22: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Null  hypothesis:  α  =  0  (no  up-­‐  or  down-­‐regulaGon)      n  =  3  reps  vs.  3  reps    expect  flat  p-­‐value  distribuGon.      

Synthetic data: 3 rep vs. 3 rep

Per

cent

of T

otal

0

2

4

6

8

10

0.0 0.2 0.4 0.6 0.8 1.0

NBP all NBP high

0.0 0.2 0.4 0.6 0.8 1.0

NBP low

DESeq all DESeq high

0

2

4

6

8

10DESeq low

0

2

4

6

8

10edgeR all

0.0 0.2 0.4 0.6 0.8 1.0

edgeR high edgeR lowall  t’cripts   >100  counts   <100  counts  

Percen

tage  of  total  

ArGfact  of  p-­‐value  calculaGon  for  discrete  data  

UnderesGmate  of  dispersion    

OveresGmate  of  dispersion    

Page 23: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

!"!!#

!"$!#

%"!!#

%"$!#

&"!!#

&"$!#

'"!!#&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

,-.,/# 012,3# 45677582,3#

!"#$

"#%&''$

FPR  =  percentage  of  transcripts  reported  as  differenGally  expressed  under  the  null  hypothesis  for    n  reps  vs.  n  reps    at    α  =  1%    significance  

(Li  et  al.,  BiostaDsDcs  (2012)  13:523)  

Page 24: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

!"!!#

!"$!#

%"!!#

%"$!#

&"!!#

&"$!#

'"!!#&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

&(&#

'('#

)()#

*(*#

+(+#

%&(%&#

,-.,/# 012,3# 45677582,3#

!"#$

"#%&''$

FPR  =  percentage  of  transcripts  reported  as  differenGally  expressed  under  the  null  hypothesis  for    n  reps  vs.  n  reps    at    α  =  1%    significance  

Overdispersion  underesGmated  

 underconservaGve    

Overdispersion  overesGmated  

 overconservaGve  

(Li  et  al.,  BiostaDsDcs  (2012)  13:523)  

Page 25: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

TesGng  the  power  to  detect  differenGal  expression    •  How  many  replicates  is  appropriate?  

 (biological  reps  or  library  prep  reps  if  reps  are  from  the  same  biological  source)      

•  What  sequencing  depth?  

•  Is  mulGplexing  (via  barcodes)  worthwhile?  

Page 26: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

•  SyntheGc  dataset  to  test  the  power  of  DESeq  and  edgeR  to  detect  differenGal  expression  

 

1.  Use  max.  likelihood  esGmates  of  (μi,  φi)  from  Pickrell  data  again  

2.  Construct  a  syntheGc  dataset  of  counts:    •  n  reps  of  ‘control’  counts    Kijcontrol    ~  NB(μi,  φi)  ,  j  =  1,  ...  n  •  n  reps  of  ‘treatment’  counts  Kijtreatment    ~  NB(μi  θi,  φi)      where    

θi  =  (1  +  Xi)  for  7.5%  of  the  transcripts  (up-­‐regulated)  θi  =  (1  +  Xi)-­‐1  for  a  further  7.5%  (down-­‐regulated)  θi  =  1  for  the  remainder,      

with  Xi    ~  i.i.d.  exponenGal  random  variables,  parameter  1.    

Page 27: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Define  a  gene  to  be  ‘effecGvely  differenGally  expressed’  if    

           θi  <  1/1.2    or      θi  >  1.2          

EffecGvely  DE  

EffecGvely  non-­‐DE  

85%  unchanged  

Page 28: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Control  for  false  discovery  rate    FDR  =  FP/(TP  +  FP)  

using  the  Benjamini-­‐Hochberg  adjusted  p-­‐value    padj < α  Finally,  measure  a  false  posiGve  rate          

and  a  true  posiGve  rate          

Do  this  for  a  range  of  coverage  depths  and  #  replicates  

FPR =# of effectively non-DE transcripts with padj <α

total # of effectively non-DE transcripts

TPR =# of effectively DE transcripts with padj <α

total # of effectively DE transcripts

Page 29: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  DESeq  

TPR  =  TP/(TP  +  FN)  (x  100%)    

   =  ‘sensiGvity’    using  Benjamini-­‐Hochberg  adjusted  p-­‐value    padj  ≤  0.01      as  a  significance  criterion        100%  coverage  ≈  107  reads  

Page 30: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  edgeR  

TPR  =  TP/(TP  +  FN)  (x  100%)    

   =  ‘sensiGvity’    using  Benjamini-­‐Hochberg  adjusted  p-­‐value    padj  ≤  0.01      as  a  significance  criterion        100%  coverage  ≈  107  reads  

Page 31: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  edgeR  

TPR  =  TP/(TP  +  FN)  (x  100%)    

   =  ‘sensiGvity’    1.  TPR  increases  with  

number  of  reps  n    2.  TPR  decreases  with  

coverage  depth    3.  MulGplexing  (more  reps,  

less  coverage,  keeping    n  Gmes  depth  constant)  improves  TPR      (grey  curve)  

4.  edgeR  has  slightly  beyer  sensiGvity  than  DESeq  

Page 32: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  DESeq  

FPR  =  FP/(TN  +  FP)  (x  100%)    

 =  1  –  ‘specificity’    using  Benjamini-­‐Hochberg  adjusted  p-­‐value    padj  ≤  0.01      as  a  significance  criterion  

n  =12  

n  =2  

n  =4  

Page 33: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  edgeR  

FPR  =  FP/(TN  +  FP)  (x  100%)    

 =  1  –  ‘specificity’    using  Benjamini-­‐Hochberg  adjusted  p-­‐value    padj  ≤  0.01      as  a  significance  criterion  

n  =12  

n  =2  

n  =4  

1.  MulGplexing  (more  reps,  less  coverage,  keeping    n  Gmes  depth  constant)  improves  specificity      (grey  curve)  

2.  DESeq  has  slightly  beyer  specificity  than  edgeR  

Page 34: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  edgeR  

FPR  =  FP/(TN  +  FP)  (x  100%)    

 =  1  –  ‘specificity’    using    Fold  change  >  2  as  a  criterion  for  detecGng  differenGal  expression  

 (not  recommended)    

n  =12  

n  =2  

n  =4  

FPR  increases  with  decreasing  coverage  depth  because  more  transcripts  have  very  low  counts  and  Poisson  shot  noise  can  easily  induce  a  spurious  doubling  of  counts  

Page 35: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

To  summarise  •  Have  tested  the  performance  of  NegaGve  Binomial  based  R  packages  for  

detecGng  differenGal  expression  using  syntheGc  data.    •  Under  null  hypothesis,  DESeq’s  performance  is  consistently  more  

conservaGve  than  edgeR  across  #  of  replicates,  and  closer  to  expected  significance  level  for  small  numbers  of  reps.    edgeR  is  closer  for  high  numbers  of  reps.    

 •  With  15%  of  transcripts    differenGally  expressed,  for  both  edgeR  and  

DESeq:  –  sensiGvity  (=  TPR)  improves  with  number  of  replicates,  as  expected  –  sensiGvity  declines  with  decreased  sequencing  depth,  as  expected  –  sensiGvity  beyer  for  edgeR  than  DESeq  –  but  mulGplexing  (decreasing  sequencing  depth  while  increasing  #  of  

replicates  with  same  total  amount  of  ‘read  estate’)  increases  sensiGvity  markedly  

     

Page 36: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

To  summarise  

Recommend      •  The  more  (independent!)  replicates  the  beyer  

•  It’s  OK  to  sacrifice  sequencing  read  depth  by  mulGplexing  

Page 37: Using Simulated Data to Optimise Experimental Design and Analysis for RNA  Sequencing (Conrad Burden)

Acknowledgements    

Sue  Wilson,  Australian  NaGonal  University  and  University  of  New  South  Wales    Jen  Taylor,  Division  of  Plant  Industry,  CSIRO    Sumaira  Qureshi,  MathemaGcal  Sciences  InsGtute,  Australian  NaGonal  University      Jose  Robles,  Division  of  Plant  Industry,  CSIRO    Stuart  Stephen,  Division  of  Plant  Industry,  CSIRO