use of bayesian statistical techniques in high-throughput...

67
Use of Bayesian Statistical Techniques in High-throughput Sequencing Christopher E. Mason Assistant Professor The Institute for Computational Biomedicine, Department of Physiology and Biophysics at Weill Cornell Medical College, and the Tri-I Program (Cornell, MSKCC, Rockefeller) on Comp. Bio. & Medicine February 21 st , 2012

Upload: others

Post on 24-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Use of Bayesian Statistical Techniques in High-throughput Sequencing

     

Christopher E. Mason Assistant Professor

The Institute for Computational Biomedicine, Department of Physiology and Biophysics at

Weill Cornell Medical College, and the Tri-I Program (Cornell, MSKCC, Rockefeller) on Comp. Bio. & Medicine

February 21st , 2012

Page 2: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Kahvejian,  2008  

Since DNA defines the biochemical recipe for the genesis of organisms, sequencing allows us to create molecular portraits

of development and disease at single-base resolution.

Page 3: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Table 1: Number of Genes vs. Genome Sizeorganism genes Base pairs

Plant ~50,000 100,000,000,000 Human, mouse or rat 25,000 3,000,000,000

Honey bee 15,000 300,000,000 Fruit Fly 13,767 130,000,000 Worm 19,000 97,000,000 Fungus 6,000 13,000,000

Bacterium 3,000 5,000,000 Mycoplasma genitalium 500 580,000

DNA virus 10–900 50,000 RNA virus 1–25 10,000

Viroid 0–1 500

Page 4: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Understanding the genome’s mutation, selection, and/or drift

C  A  G  T  C  T  

H  A  K  V  D  S  

κ ν γ p  β α

GI

Brain

Liver

Skin

SG

Gon

H. sapiens

Disease

How can we understand these interactions?

Evo-Devo

Integrated, spatiotemporal molecular profiling.  

Page 5: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

What erudition do we have now on the functional elements?

1.2  million  sequence  reads  in  brain    (1990-­‐2009)  

Ø Currently  limited  amount  of        EST  info  at  NCBI    Ø EST  data  is  expensive,  Ome-­‐consuming  (cloning),  and  exhibits  3’  bias.      Ø Much  EST  and  cDNA  data  is  for  whole  brains,  and  few  libraries  exist  with  region-­‐specific  data.  

New  School:  One  run  of  a  NGS  machine  =  billions  of  sequence  reads  in  days

Old  School  

Page 6: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

DescripOon/Discussion  of  the  Various  Technologies  

•  The  goal  of  the  Archon  X  prize  in  Genomics  is  to  enable  a  $1,000  genome,  

•  Currently  at  $3,000-­‐$50,000  •  Certain  pla^orms  are  be_er  suited  for  certain  tasks:  –  CounOng  applicaOons  (ChIP-­‐Seq,  RNA-­‐Seq)  need  more  reads  

– De  novo  assembly  work  needs  longer  reads  – Whole  genome  re-­‐sequencing  requires  lower  errors  rate  and  high  processivity  

Page 7: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

But, there are many options:

Michael    Metzker,    2010  

Page 8: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

DNA  (0.1-­‐1.0  ug)  

   

Single  molecule  array  Sample  

prepara9on   Cluster  growth  5’  

5’  3’  

G  

T  

C  

A  

G  

T  

C  

A  

G  

T  

C  

A  

C  

A  

G  

T  C  

A  

T  

C  

A  

C  

C  

T  A  G  

C  G  

T  A  

G  T  

1 2 3 7 8 9 4 5 6

Image  acquisi9on     Base  calling    

T G C T A C G A T …

Sequencing    

Illumina  SBS  Technology  Reversible  Terminator  Chemistry  Founda7on  

 

h_p://www.illumina.com/technology/sequencing_technology.ilmn  

Page 9: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Quality  Scores  vary  

Bansal  et  al,  2010  

Page 10: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Pyrosequencing  

Metzker,  2010  

Page 11: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Single  Molecule  Real-­‐Time  

Metzker,  2010  

Page 12: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Direct  DetecOon  of  MethylaOon  

70.5 71.0 71.5 72.0 72.5 73.0 73.5 74.0 74.50

100

200

300

400

Fluo

resc

ence

inte

nsity

(a.u

.)

Time (s)

104.5 105.0 105.5 106.0 106.5 107.0 107.5 108.0 108.50

100

200

300

400

Fluo

resc

ence

inte

nsity

(a.u

.)

Time (s)

C  

T   G   A   T  C   G   T   A   C  

mA  A  G   T  C  T   A   A  

G   C   C   A  A   A  A  

Approach:  KineOc  detecOon  of  methylated  bases  during  SMRT  DNA  sequencing  

Example:  N6-­‐methyladenosine  (mA)  

Page 13: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

detect  other  base  modificaOons  

0 10 20 30 40 50 60 70 80 900.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

Pulse  width  ra

9o  

Pulse  width  

DNA  Template  Posi9on  

0 10 20 30 40 50 60 70 80 90

0.5

1.0

1.5

2.0

2.5

3.0

IPD  Ra

9o  

Interpulse  duraOon  

5-­‐methylcytosine  (mC)  

=  Methylated  posiOon  

0 10 20 30 40 50 60 70 80 900.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

DNA  Template  Posi9on  

Pulse  width  ra

9o  

Pulse  width  

0 10 20 30 40 50 60 70 80 90

0.5

1.0

1.5

2.0

2.5

3.0

IPD  Ra

9o  

Interpulse  duraOon  

5-­‐hydroxymethylcytosine  (hmC)  

OH  

Page 14: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Watch  TranslaOon  in  Real-­‐Time  

Uemura  et  al,  Nature,  2010  

Page 15: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Full-­‐length  cDNA  sequencing  

Page 16: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

“Post-­‐Light,”    Semi-­‐Conductor  Sequencing  

Essentially, 11 million very small pH meters Purushothaman  et  al,  2005  

IonTorrent,  Inc.  

Page 17: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

MinION   GridION  

Exonuclease-­‐Seq   Strand-­‐Seq  

DNA  Sequencing  

Page 18: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Other  (Maybe  Killer)  Apps  

Analyte   Protein  Aptamer  

Direct  RNA  Sequencing   Small  molecule  

Page 19: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Each  Pla^orm  has  various  sources  of  noise,  and  thus  Error  

•  De-­‐Phasing  – Lagging  strand  dephasing  from  incomplete  extension  

– Leading  strand  dephasing  from  over-­‐extension  •  Dark  NucleoOdes  •  Polymerase  errors  (10-­‐5  to  10-­‐7)  •  Pla^orm-­‐specific  errors  

–  Illumina  more  likely  to  have  error  aker  ‘G’  – PCR-­‐based  methods  miss  GC-­‐  and  AT-­‐rich  regions  

Page 20: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Alignment  to  the  genome  

Page 21: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput
Page 22: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Analyzing High-Resolution Data

• Bayesian Methods • Hidden Markov Models •  Permutation Testing • Circular Binary Segmentation •  Seed-seeking •  Least Squares Regression • Democratic Voting

Page 23: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Prior •  The prior function Pr (H) gives the probability of different possible

values of the quantity of interest before the data are considered – that is, it represents the state of knowledge prior to the data.

•  Prior may be broad or flat if we have few data (non-informative prior), or peaked if we have more information (informative prior).

Pro

babi

lity

Unknown quantity of interest

(Population size, length at sexual maturity, haplotype frequency, model parameter)

Page 24: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Likelihood •  The likelihood function Pr (data | H) gives the probability of

obtaining the data, given different possible values of the unknown quantity of interest (the “hypothesis” H).

•  The likelihood is calculated using a statistical model, which represents the process that produced the data. The likelihood function connects the parameters of the model to the data.

Like

lihoo

d

Unknown quantity of interest

(Population size, length at sexual maturity, haplotype frequency, model parameter)

Page 25: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Posterior •  The posterior function Pr (H | data) gives the probability of different

possible values of the quantity of interest after the data are considered – that is, it represents the state of knowledge posterior to the data.

•  The posterior is a combination of the prior (what we knew before) and the likelihood (what the data told us).

•  The difference between the prior and the posterior indicates how much we learned from the data.

prior

posterior

Pro

babi

lity

Unknown quantity of interest

(Population size, length at sexual maturity, haplotype frequency, model parameter)

Page 26: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Paradigm for Bayesian inference

posterior likelihood x prior

new state of knowledge

Thus, in Bayesian reasoning, new data update the current state of knowledge through Bayes’ Theorem. The result is a new state of knowledge represented by the posterior.

information from new data

current state of knowledge x

Page 27: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Bayes’ Theorem

(H) The Hypothesis (the unknown) and x = data

p (H | data) = ——————————— p (data | H) p (H)

p (data)

posterior = likelihood x prior

Page 28: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

2 fundamental Bayesian concepts:

1.  Things that are unknown are represented by probability distributions.

2. Things that are known (data) are used to improve the knowledge of unknowns through Bayes’ Theorem.

Page 29: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Bayesian view of cancer

•  1% of women at 40 yrs have breast cancer (Prior)

• Mammography diagnoses 8/10 correctly (true positive rate, false negative rate)

•  10% of mammographies are false positives

•  If you get a positive result, what are your odds of having breast cancer?

Page 30: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

healthy  c

10,000 patients

100 patients

9,900 patients

c

8/10 true positives

80 patients

healthy  990

patients

10% false negatives

80 1070

=7.5%

Page 31: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Bayes Theorem depends on the prior probability (pr(A)):

.8*.01 .8*.01 + .1*.99 =7.5%

A = have cancer B = positive result

Page 32: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

healthy  c

1,000,000 patients

1 patient

999,999 patients

c

8/10 true positives

~.8 patients

healthy  9,999

patients

10% false negatives

0.8 10,000

=0.0008%

Page 33: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Q.  What  is  the  Bayesian  Conspiracy?  

 A.    The  Bayesian  Conspiracy  is  a  mulOnaOonal,  interdisciplinary,  and  shadowy  group  of  scienOsts  that  controls  publicaOon,  grants,  tenure,  and  the  illicit  traffic  in  grad  students.    The  best  way  to  be  accepted  into  the  Bayesian  Conspiracy  is  to  join  the  Campus  Crusade  for  Bayes  in  high  school  or  college,  and  gradually  work  your  way  up  to  the  inner  circles.    It  is  rumored  that  at  the  upper  levels  of  the  Bayesian  Conspiracy  exist  nine  silent  figures  known  only  as  the  Bayes  Council.    

h_p://yudkowsky.net/raOonal/bayes  

Page 34: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

ApplicaOons  of  Bayes  to  DNA-­‐Seq  

Page 35: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput
Page 36: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Each  Pla^orm  is  slightly  different,  and  so  intrinic  errors  are  different  

SeQC Consortium

Page 37: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput
Page 38: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

SNIP-­‐Seq  SNP  calling  

For  each  potenOal  variant  site  in  the  sequenced  region:  Set  the  base  quality  value  for  each  base  call  to  the  Illumina  

quality  value  For  k  =  1,  2,…    

a.  Sample  a  genotype  for  each  individual  from  the  posterior  distribuOon  using  a  heterozygote  prior  of  0.001.  

b.  Recalibrate  the  quality  score  for  each  base  call  using  genotypes  for  all  individuals.  

If  the  genotype  of  any  individual  is  different  from  the  reference,  idenOfy  posiOon  as  a  SNP.  

Sample  a  genotype  for  each  individual  from  the  posterior  distribuOon  computed  using  a  heterozygote  prior  of  0.2.  

 Bansal  et  al,  2010  

Page 39: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

PopulaOon-­‐based  models  can  overcome  systemaOc  errors,  but…  

Page 40: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Population Stratification is from the migration patterns of haplotypes

throughout human history

Page 41: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Genes  mirror  geography  within  Europe  Novembre  et    al,  2008  

Page 42: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Genes  mirror  geography  within  Europe  Novembre  et    al,  2008  

Page 43: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

ApplicaOons  of  Bayes  to  RNA-­‐Seq  

Page 44: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Microarrays  vs.  RNA-­‐Seq  

Marioni  and  Mason  et.  al,  Genome  Research,  2008  

RNA-­‐Seq:  An  assessment  of  technical  reproducibility  and  comparison  with  gene  expression  arrays  

Page 45: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

• Early days—fold change cutoffs (e.g., 2x difference or better) • not a very satisfying approach: -doesn’t take into account variance -misses any small changes

Data Analysis: What genes are differentially expressed?

Here, “A” has a fold change >2.5, but varies greatly between replicate experiments. “B” has a fold change of only 1.75, but changes reliably each time the experiment is performed.

Page 46: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Experimental  Design:    Liver  vs.  Kidney  

Page 47: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Metric for RNA-Seq Expression

RPKM:  Reads  per  Kilobase  per  Million  Reads  

Normalizes  for  (1)  gene  size  and  (2)  sequencing  depth    (~0.1-­‐1  transcript/cell)  

Y = (exons, introns, intergenic reads)

Mortazavi, Williams, et al. Nature Methods, 2008

FPKM=fragments-PKM is for paired-end data

Page 48: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Comparing  GA  and  Affy  arrays  Log2  fo

ld  change  (kidne

y/liver)  -­‐  Affy

metrix  

Log2  fold  change  (kidney/liver)  -­‐  Solexa  Spearman  correlaOon  =  0.72  

Page 49: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

13,072  DifferenOally  Expressed  (DE)  Genes    

(37%)  DE  in  Solexa  4,959  

(12%)  DE  in  Affy  1,579  

(50%)  Both  6,534  

Page 50: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Bias is introduced if these ratios are not kept:

Y All  Unique  Reads  

Exons   Intron   Inter-­‐  genic  

Exons   Intron   Inter-­‐  genic  

Good Run

Bad Run

Good Run

Bad Run RPKM = 72.7

RPKM = 128.4

Mortazavi, Williams, et al. Nature Methods, 2008

Page 51: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

NormalizaOon  is  needed  

Define  Ygk  as  the  observed  count  for  gene  g  in  library  k  summarized  from  the  raw  reads,  μgk  as  the  true  and  unknown  expression  level  (number  of  transcripts),  Lg  as  the  length  of  gene  g  and  Nk  as  total  number  of  reads  for  library  k.  We  can  model  the  expected  value  of  Ygk  as:          Sk  represents  the  total  RNA  output  of  a  sample.  The  problem  underlying  the  analysis  of  RNA-­‐seq  data  is  that  while  Nk  is  known,  Sk  is  unknown  and  can  vary  drasOcally  from  sample  to  sample,  depending  on  the  RNA  composiOon.  

Page 52: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Robinson  and  Oshlack  Genome  Biology  2010  11:R25  

NormalizaOon  is  needed  

Page 53: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

NormalizaOon  is  needed  

Robinson  and  Oshlack  Genome  Biology  2010  11:R25  Trimmed  Mean  M-­‐values  (TMM)  

Page 54: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput
Page 55: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Katz  et  al,  2010  

Page 56: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Coverage Requirements: How many lanes/plates/wells?  

Depends  on:  1. Read  Length  

2. Size  of  Transcriptome  3. Complexity  of  Transcriptome  

4. Complexity  of  Tissue  5. Biological  Variance  

6. Errors  (random  and  systemaOc)    

Page 57: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Plateau  of  InformaOon  Starts  @  ~500Mb  

Marioni,Mason et al, Genome Research, 2008

Page 58: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Coverage Requirements  

Wang, Gerstein, and Snyder, 2008

Page 59: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

No  current  visible  end  of  gene  discovery  

Page 60: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

How  many  replicates  do  I  need?  CalculaOon  of  the  number  of  replicates  depends  on:  

1.  An  esOmate  of  σ2  obtained  from  previous  experiments.  2.  The  size  of  the  difference  (δ)  to  be  detected.  

3.  The  assurance  with  which  it  is  desired  to  detect  the  difference  (i.e.,  Power  of  the  test  =  1-­‐β).  

4.  The  level  of  significance  to  be  used  in  the  actual  experiment  (i.e.,  Type  I  error).  5.  The  test  required,  whether  a  one-­‐tail  or  two-­‐tail  test.  

 To  determine  the  number  of  replicates  use  the  following  formula  :  

     

where:  Zα/2  is  associated  with  the  Type  I  error  (two-­‐tailed)  Zβ  is  associated  with  the  Type  II  error  

δ  is  the  true  difference  to  be  detected,  and  σ  is  the  known  variance  obtained  from  previous  experiments  

Page 61: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Bayes  in  Chip-­‐seq  too!  

Page 62: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

What’s the problem with Bayesian statistics? (according to non-Bayesians)

1.  Priors can introduce subjective judgment into data analysis.

2.  Priors affect the result. Different people can get different answers from the same data.

3.  It’s too hard. There are no simple point-and-click programs.

Page 63: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

This requires a lot of space

1.2  million  sequence  reads  in  brain    (1990-­‐2009)  

Old  School  

Stein  L,  2010  

Page 64: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

drugs  

vitamin  

hormone  

N  other  species  

     

mRNAs  degrade  

                                               

TFBS            

Gene  regulaOon  

                   

Robust  Sample  Prep  

&    Data  AcquisiOon  

Family  Medical  History  

1.  Small  Variants  (SNVs,  indels)  2.  DE  by  exon,  gene,  intron,  &  jxn  3.  PolyA  sites  and  miRNA  

changes  4.  TARs  and  ORF  potenOal  5.  Gene  fusion  events  6.  Allele-­‐specific  expression  

1.  Variants  (SNVs,  indels)  2.  Structural  variants  (CNVs,  TxC)  3.  Ancestry  predicOon  from  

AIMs  

RNA  edits,  eQTLs  

StaOsOcal  Models  Built  from  Per-­‐Base,  High-­‐Res  Data  

1.  mC  or  hmC  modificaOons  2.  Histone  marks  (HXXX)    3.  Regulatory  informaOon  

1.              Mass  spectrophotometry  data  1.  Protein-­‐Protein  InteracOons  2.  Polysome-­‐Associated  mRNAs    

Nucleosome  effects  

1.    Metabolic  changes  

                       (G)enome  Human  

AnnotaOons,  TDBGV  miRbase,  TCGA,  

1KG,dbSNP  

Bacteria  /Fungi  

G  

Virus  

G  

And  that  is  only  the  Genome!  

Page 65: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Meta-genomic phenotypes can persist for years, and “passenger genomes” can be a phenotype, as

well as their distributions.

Jakobsson  et  al,  2010  

Normal  throat  

Throat  +  anObioOcs  

Page 66: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

Risk factors for diseases usually involve many genes and pathways

.  Li  et  al,  PLOS  GeneOcs,  2007.  

Page 67: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput

There are other factors than these!

Systems  biology  requires  spaOotemporal  monitoring  of  the  genome,  epigenome,    transcriptome,  proteome,  metabolome,  and  the  environment,  to  see  the  interactome  

SHOW  NO  DIFFERENCE