ryan poplin - sources of bias

21
Ryan Poplin, on behalf of the Genome Sequencing and Analysis Group Program in Medical and Population Genetics August 16, 2012 Understanding sources of bias and error from a prospective Reference Material (NA12878)

Upload: genomeinabottle

Post on 26-Jun-2015

377 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Ryan Poplin - Sources of Bias

Ryan Poplin, on behalf of the Genome Sequencing and Analysis Group Program in Medical and Population Genetics August 16, 2012

Understanding sources of bias and error from a prospective Reference Material (NA12878)

Page 2: Ryan Poplin - Sources of Bias

NA12878 is a wonderful reference sample!

•  Unrestricted cell lines!•  Extensive pedigree available!•  Extensively sequenced and genotyped at the

Broad and elsewhere!– All Broad techs (both production and

experimental)!– Fosmids!– Many library designs and sample prep

protocols!

Page 3: Ryan Poplin - Sources of Bias

SNPs

Indels

Structural variation (SV)

Rawindels

RawSVs

Typically by lane Typically multiple samples simultaneously but can be single sample alone

Input

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads Sample 1 reads

Raw variants

RawSNPs

Genotype refinement

Variant quality recalibration

Analysis-ready variants

Pedigrees Known variation

Known genotypes

Population structure

Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis

Sample N reads

External data

Our framework for variation discovery!

DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

Page 4: Ryan Poplin - Sources of Bias

Lots of work required to turn raw sequencing reads into something that is useful!

Phase 1:!NGS data processing!

Input

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads

•  Unbiased  sampling  of  alleles  •  Calibrated  mapping  quality  scores  

•  Indels  have  correct  and  consistent  alignment  in  reads  

•  Duplicate  molecules  shouldn’t  count  as  extra  evidence  for  event  

•  Calibrated  base  quality  scores  for  base  subs=tu=ons,  base  inser=ons,  and  base  dele=ons  

Desired  proper=es  of  analysis-­‐ready  reads:  

Page 5: Ryan Poplin - Sources of Bias

rs28782535!

rs28783181! rs28788974! rs34877486! rs28788974!

1,000 Genomes Pilot 2 data, raw MAQ alignments! 1,000 Genomes Pilot 2 data, after MSA!

HiSeq data, raw BWA alignments! HiSeq data, after MSA!

Effect of MSA on alignments!NA12878, chr1:1,510,530-1,510,589!

Indels  have  correct  and  consistent  alignment  in  reads  through multiple sequence local realignment!

5!DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

Phase 1:!NGS data processing!

Input

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads

Page 6: Ryan Poplin - Sources of Bias

SLX  GA   454   SOLiD   HiSeq  Complete  Genomics  

!!!!

!!

!!!

!!!!

!!

!

!

!!

!!

!!

!!

!!!!!! !

!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!

!

!

Original, RMSE = 5.242

Recalibrated, RMSE = 0.196

!!

!!

!!!

!!

!!

!!

!!

!

!!!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!

!

Original, RMSE = 2.556

Recalibrated, RMSE = 0.213!!!

!

!

!!!

!!!

!!

!!

!!

!!

!!

!!

!!

!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!!

!!!!

!

!

Original, RMSE = 1.215

Recalibrated, RMSE = 0.756

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!!!

!

!

!

Original, RMSE = 4.479

Recalibrated, RMSE = 0.235

!!!

!!!

!!

!!

! !!

!

!

!

!

!!

!!

!

!

!!

!!

!!!!

!!

!!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!

Original, RMSE = 5.634

Recalibrated, RMSE = 0.135

!!!!!!!!!!!!!!! !! !! !! !

! !! !!

!

!

!! !!!!!

0 5 10 15 20 25 30 35

!10

!5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!! !! !! !! !! !! !! !! !! !!!! !

!

!

Original, RMSE = 2.207

Recalibrated, RMSE = 0.186

!

!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!! !!!!

!!!!!!!!!!!!!!!!

!!!!

!!!!

!!!!!!!!

!!!!!!!!!!!!!!!

! !!!!!!!! !!!!!!!! !

!!!!!!!

!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!!

!!!!

!!!!!!!! !!!!

!!!!!!!!!!!

!!!!!

!!

!

!

!

!!

!

!

!!

0 50 100 150 200

!10

!5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!! !!!!!!!! !!!! !!!!!!!!!!!!!!!!

!!!

!!!!!

!

!

Original, RMSE = 1.784

Recalibrated, RMSE = 0.136

!!

!

!

!

!

!

!

!

!

!!

!

!!!

!!

! !!!

! !!!

!

!!

!

!!

!!

!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!!

!30 !20 !10 0 10 20 30

!10

!5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!

!! !! !! !!!! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !

! !! !! !!! !!!!

!! !! ! !!!!!

!

!

Original, RMSE = 1.688

Recalibrated, RMSE = 0.213

!!

!

!!

!!

!!

!!

!

!! !

!!

!

!!

!

!

!! !! !!

!

!

!!! !

!

!

!!

!

!

!

!

!

!! !!

!! !!

!

!

!!

!

! !!

!

!!

!!!!

!!

!30 !20 !10 0 10 20 30

!10

!5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

! !!

! !! !! !!

!! !! !! !! !! !! !! !! !! !! !! !

! ! !! !! !! !! !! !! !! !! !! !! !! !! !!! ! !!!!!

!

!

Original, RMSE = 2.679

Recalibrated, RMSE = 0.182

!

!!!!!!!!!!!

!!!!!!!!

!!!!

!

!!

!

!!!!!!!!!!!!

!

!!!!!

!!

!

!!

!!

!

!!!

!!

!!

!!

!!!

!

!

!

!!

!!!!

!

!

!

!

!

!!!

!

!

!

!

!!!!!!!! !

!

!!

!!

!

!!!

!!

!

!!!!

!!

!

!

!

!!!!

!

!

!!

!

!

!

!

!!

!

!

!!!!!!!!!!!

!

!!!!!

!!!!!!!!!!!!!

!!

!

!!!!

!

!!!!

!!!!!

!!

!!!!!!!

!!!!

!!

!!!!!!!

!100 !50 0 50 100

!10

!5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!

!

!

Original, RMSE = 2.609

Recalibrated, RMSE = 0.089

!10

!5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!

!!!

!!

!

!!!!!!!!!

!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 2.598

Recalibrated, RMSE = 0.052

!10

!5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!

!!!!!

!

!!!!

!!!!

!!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 2.169

Recalibrated, RMSE = 0.135

!10

!5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!

!!

!!!

!

!!!!!!

!

!!!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 1.656

Recalibrated, RMSE = 0.088

!10

!5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!

!

!

!!!

!

!!!!!

!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 3.503

Recalibrated, RMSE = 0.06

!10

!5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!

!

!

!!!!

!

!!!!!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 2.469

Recalibrated, RMSE = 0.083

first  of  pair  reads  second  of  pair  reads   first  of  pair  reads  second  of  pair  reads   first  of  pair  reads  second  of  pair  reads  

Base Quality Score Recalibration provides a calibrated error model from which

to make mutation calls

Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!!

Page 7: Ryan Poplin - Sources of Bias

AAAAA context

suffix

Empi

rical

gap

ope

n pe

nalty

0

10

20

30

40

50

●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●●●

●●●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●●●●●●●

●●

●●●●

●●●●●●●●●

●●●●●

●●

●●●●

●●●

●●●

●●

●●

●●●

●●●●●●●●

●●●

●●●●●

●●●

●●●●

●●

●●●●●●●●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●●●●

●●

●●●●●

●●●●●●

●●

●●

●●●

●●●●

●●●●●●●●

●●

●●●

●●●●

●●●●

●●

●●●

●●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●●●

●●●●●

●●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●●

●●

●●

●●

●●●●●●

●●●

●●●●

●●●●

●●●

●●

●●●●

●●●

●●●●

●●

●●●●●●

●●●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●●●

●●●●

●●

●●●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●●●●●●

●●

●●

●●●●●

●●●

●●●

●●

●●

●●●

●●●

●●●

●●●●

●●●●●●

●●

●●●

●●●●●

●●●●●●●

●●

●●●●●

●●●●

●●

●●

●●

●●

●●

●●●●●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●●

●●

●●●●

●●●

●●

●●

●●●

●●●●

●●●

●●●

●●●●●

●●

●●●●●●●●

●●●

AAAAACAAGAATAC

AAC

CAC

GAC

TAG

AAG

CAG

GAG

TATAATCATGATTC

AAC

ACC

AGC

ATC

CA

CC

CC

CG

CC

TC

GA

CG

CC

GG

CG

TC

TAC

TCC

TGC

TTG

AAG

ACG

AGG

ATG

CA

GC

CG

CG

GC

TG

GA

GG

CG

GG

GG

TG

TAG

TCG

TGG

TTTAATACTAGTATTC

ATC

CTC

GTC

TTG

ATG

CTG

GTG

TTTATTCTTGTTT

ReadGroup● 20FUK.1● 20FUK.2● 20FUK.3● 20FUK.4● 20FUK.5● 20FUK.6● 20FUK.7● 20FUK.8● PacBio

7!

AAAAA + AAA context is errorful in HiSeq

HiSeq!

PacBio!

Per-­‐base  indel  error  rate  also  varies  by  lane,    sequence  context  and  sequencing  technology

PacBio error rate is 1000x higher but unbiased

Page 8: Ryan Poplin - Sources of Bias

Reported Quality Score

Empi

rical

Qua

lity

Scor

e

10

20

30

40

50

Base Substitution

10 20 30 40 50

Base Insertion

!

10 20 30 40 50

Base Deletion

!

10 20 30 40 50

Recalibration! Recalibrated! BQSRv2

log10(nBases)456789

Cycle Covariate

Qua

lity

Scor

e Ac

cura

cy

−6

−4

−2

0

2

4

Base Substitution

−100

−50

0 50 100

Base Insertion

!

!

!!!!!!!!!!!!!!!!!!!!

!!!!

!!!!!!!!!!!!!!!!!!!!!!!!!!

−100

−50

0 50 100

Base Deletion

!

!!!!!!!!!!

!

!!

!!

!!!!!!!!!!

!!!!!!!!!!!!!!!!!!!!!!!!!!

−100

−50

0 50 100

Recalibration! Recalibrated! BQSRv2

log10(nBases)6.756.806.85

Context Covariate

Qua

lity

Scor

e Ac

cura

cy

−8

−6

−4

−2

0

2

Base Substitution

!! !!

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Insertion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Deletion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Recalibration! Recalibrated! BQSRv2

log10(nBases)6.57.07.58.0

8  

UnifiedGenotyper  used  a  flat  Q45  in  its  indel  model  

Latest version of Base Quality Score Recalibrator empirically estimates the base insertion and deletion error rates in addition to substitutions

Page 9: Ryan Poplin - Sources of Bias

Reported Genotype Quality

Empi

rical

Gen

otyp

e Q

ualit

y 0

5

10

15

20

25

30

0

5

10

15

20

25

30

0

5

10

15

20

25

30

UnifiedGenotyper

●●

●●●

●●

●●●

●●●●

●●

●●

● ●

●●●

●●●●

●●●

●●

●●●●●

●●

●●

●●

●●●

●●●●●

●●●●●●●●

●●

●●●●

●●●

●●●●

●●●

●●

●●●

●●

●●●●

●●

●●●●

●●●●●●

●●●

●●

●●●●

●●●

●●●

●●●●

●●

●●●

●●●

●●

●●●

●●●●●

●●●

●●

●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●

●●●●

●●

●●●●●●●●●●●

●●●●●

●●●

●●●●●●

●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●●

●●●●●●●●●●

●●

●●

●●●●

●●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●

●●●●●●

●●●●●●●

●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●

●●

●●●

●●

●●

●●●●●●

●●●

●●

●●●●●●

●●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●●●

●●

●●●

●●

● ●

●●●●●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

0 5 10 15 20 25 30

HaplotypeCaller Original Quality

●●

●●

●●●

●●●●

●●●

●●

●●

●●●

● ●

●●●●

●●

●●●

●●

●●●

●●●●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●●●

●●●

●●

●●●●

●●●

●●

●●●●

●●●●

●●

●●

● ●

●●●●

●●●

●●●

●●

●●●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●●●●

●●

●●

●●●

●●●●●

●●●●●

●●●●●

●●

●●

●●●

●●●●●

●●

●●●●●

●●●

●●

●●●

●●

●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●

●●

●●●

●●

●●●●●

●●

●●●●●●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●●●

●●●

●●●

●●●●●●

●●

●●

●●

●●●●

●●●●●

●●●●

●●●●

●●

●●

●●●

●●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●● ●

●●

●●●

●●●●

●●●

●●●●●

●●

●●● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●●●

●●

●● ●

●●

●●

●●●●

● ●

● ●

●●

●●●

●●●

●●● ●

●●●●

●●

0 5 10 15 20 25 30

HaplotypeCaller Calibrated Quality

●●●●●●

●●●●

●●●●

●●●

●●

●●●

●●

●●●●●●

●●●●●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●●

●●●

●●●

●●●●●●

●●●

●●●

●●●

●●

●●●●●

●●

●●●●●

●●

●●●

●●●

●●●●●

●●

●●●

●●

●●●●

●●

●●●●●●●●

●●

●●

●●●

●●●

●●●●●

●●●●●●

●●●●●●●

●●●●

●●

●●●●

●●●●●●●●●●●●

●●●

●●●

●●●●

●●●

●●●

●●

●●

●●●

●●●●●●●●

●●●

●●●●●

●●●●●●

●●●

●●●●

●●●●

●●●●

●●●●

●●●●●

●●

●●●●●

●●●●

●●●●

●●●●●●

●●●

●●

●●●●●

●●●●●

●●●●●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●●●

●●●

●●

●●

●●●●

●●●

●●●

●●●●●●

●●●

●●

●●

●●●●●

●●

●●● ●●

●●

● ●

●●

●●●●●●●

●●●●

●●

● ●●●

●●

●●●●

●●●

●●●

● ●●

●●●●

●●

●●●●

●●

●●●●●

●●

0 5 10 15 20 25 30

Hom

Ref

Het

Hom

Var

pGGivenDType● Hom Ref● Het● Hom Var

log10(Sum)● 1.5● 2.0● 2.5● 3.0

● 3.5

● 4.0

Making  use  of  these  calibrated  quality  scores  improves  indel  likelihood  calibra=on  

Evaluated  calls:  Per-­‐read-­‐group  downsampled  (~4x  coverage)  NA12878  indel  calls  Truth:  GATK-­‐bundle  indel  “gold  standard”  truth  sites  with  high  confidence  genotypes  from  deep  coverage  CEU  trio  

Page 10: Ryan Poplin - Sources of Bias

Another source of bias: We often find consistent (artifactual) alleles at the sites of larger events because they cannot be properly modeled by

the mappers

10  

Validated 30bp deletion!

Original  BWA  

alignments  

Alignments  showing  the  actual  allele  

Chr12:15296246  GTGTGTATGTAAATATATACATACACACAT/-­‐  

Mul=ple  called  ar=facts  that  are  hard  to  filter  out,  since  they  are  well  supported  by  read  data  

Page 11: Ryan Poplin - Sources of Bias

BAM  read  bases  are  all  iden=cal;  individual  alignments  differ  based  on  the  the  whims  of  the  mapper  

Allele determination is much more accurate through local assembly of candidate haplotypes

-assembly: 1 multi-allelic SNP and two 1bp indels are called +assembly: Only the complex substitution (TT to TAC) is called

Original  BWA  

alignments  

11  

Page 12: Ryan Poplin - Sources of Bias

12  

As an added bonus we now get physical phasing for free, which allows us to distinguish between e.g. MNPs and compound hets

CEU  Trio  

Daughter  

Father  

Mother  

Page 13: Ryan Poplin - Sources of Bias

Conclusions!

•  NA12878 (and potentially more of the CEU pedigree) is a great reference sample!

•  Read data must be in an analysis ready form which passes multi-faceted battery of tests related to statistical bias that go beyond sample QC metrics!

•  Local de novo assembly around mutations is necessary to avoid biases in alleles resulting from myopic view!

Page 14: Ryan Poplin - Sources of Bias

Appendix!

Page 15: Ryan Poplin - Sources of Bias

Multiple sequence alignment itself is not enough for calling indels

•  MSA is now a standard piece of the BAM processing pipeline and works well for previously seen indels in getting consistent alignments!

•  However, it is not empowered to discover large novel indels or more complex alleles!

•  What is needed is haplotype reconstruction and then calling of variants from the candidate haplotypes!

•  Several groups (Oxford, Sanger, Broad) are actively working on assembly-based approaches (both global and local) for haplotype level calling!

!15!

Page 16: Ryan Poplin - Sources of Bias

16  Assembly  of  large  genomes  using  second-­‐genera4on  sequencing.  Schatz.  Genome  Research.  2010.  

Traverse  the  graph  to  enumerate  the  possible  haplotypes.  Each  edge  is  weighted  by  the  number  of  reads  which  gave  evidence  for  that  k-­‐mer.  

Step 1: propose haplotypes with local de novo assembler via DeBruijn graphs

Page 17: Ryan Poplin - Sources of Bias

Bayesian  model    

4 SNP calling

4.1 Simple genotype likelihoods for presentations

Pr{G|D} =Pr{G}Pr{D|G}

Σi Pr{Gi}Pr{D|Gi}, [Bayes’ rule]

Pr{D|G} =�

j

�Pr{Dj|H1}

2+

Pr{Dj|H2}2

�where G = H1H2

Pr{D|H} is the haploid likelihood function

4.1.1 SNP haploid likelihood

Pr{Dj|H} = Pr{Dj|b}, [single base pileup]

Pr{Dj|b} =

�1− �j Dj = b,

�j otherwise.

4.1.2 Indel haploid likelihood

Pr{Dj|H} =�

alignments π of Dj to H

Pr{Dj, π}

4.2 Genotype likelihoods

Pr{Di|GTi} =�

j

Pr{Di,j|GTi}

Pr{Di,j|GTi = AB} = (Pr{Di,j|A}+ Pr{Di,j|B}) /2

Pr{Di,j|B} =

�1− �i,j Di,j = B,

�i,j · Pr{B is true|Di,j is miscalled} otherwise.

3

Prior of the genotype!

Likelihood of the genotype!

Diploid assumption!

Empirical  gap  penal=es  derived  from  data  using  new  BQSR.    Base  mismatch  penal=es  are  the  base  quality  scores.  

Step 2: evaluate candidate haplotype likelihoods with Pair HMM

17!

Page 18: Ryan Poplin - Sources of Bias

The indel size distribution is more accurate when using local assembly of candidate haplotypes

Key:  •  -­‐assembly  •  +assembly  •  fosmid  data  (truth)  

Larger  events  are  missing  with  previous  methods  

Page 19: Ryan Poplin - Sources of Bias

Variant  annota=on  sta=s=cs  provide  signal  with  which  to  evaluate  callsets  

of  puta=ve  muta=ons  

22  49582364 . A G 198.96 . AB=0.67; AC=3; AF=0.50; AN=6; DP=87; Dels=0.00; HRun=1; MQ=71.31; MQ0=22; QD=2.29; SB=-31.76 GT:DP:GQ 0/1:12:99.00 0/1:11:89.43 0/1:28:37.78

VCF  record  for  an  A/G  SNP  at  22:49582364  

AC   No.  chromosomes  carrying  alt  allele  

AB   Allele  balance  of  ref/alt  in  hets  

AN   Total  no.  of  chromosomes   HRun   Length  of  longest  con=guous  homopolymer  

AF   Allele  frequency   MQ   RMS  MAPQ  of  all  reads  

DP   Depth  of  coverage   MQ0   No.  of  MAPQ  0  reads  at  locus  

QD   QUAL  score  over  depth   SB   Evidence  for  strand  bias  

INFO

 field  

Page 20: Ryan Poplin - Sources of Bias

Variant Quality Score Recalibration (VQSR): modeling error properties of real polymorphism to determine the probability that novel sites are real!

The HapMap3 sites from NA12878 HiSeq!calls are used to train the GMM. Shown!here is the 2D plot of strand bias vs. the!variant quality / depth for those sites.!

Variants are scored based on their!fit to the Gaussians. The variants!(here just the novels) clearly!separate into good and bad clusters.!

20!DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

Page 21: Ryan Poplin - Sources of Bias

Fisher Strand Bias Score

# o

f S

NP

s

0

2000

4000

6000

8000

0 5 10 15 20 25 30

Variant Quality / Depth

# o

f S

NP

s

0

10000

20000

30000

40000

50000

60000

10 20 30 40

novels   knowns  (dbSNP  132)   retained  

filtered  out  FS:  Fisher  Exact  Test  of  Read  Strand  

QD:  Variant  Quality  /  Depth  

Lots  of  the  filtered  out  strand  biased  variants  show  up  at  the  centromere.  Very  unlikely  to  be  real  SNP  muta=ons.