mpg ngs workshop - broad institute · 2015. 10. 16. · svs typically by lane typically multiple...

MPG NGS workshop: SNP calling and error modeling

February 2011

Ryan Poplin Genome Sequencing and Analysis Medical and Population Genetics

The paradigm today

Indels

Structural variation (SV)

Rawindels

RawSVs

Typically by lane Typically multiple samples simultaneously but can be single sample alone

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads Sample 1 reads

Raw variants

RawSNPs

Genotype refinement

Variant quality recalibration

Analysis-ready variants

Pedigrees Known variation

Known genotypes

Population structure

Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis

Sample N reads

External data

Step 2: SNP discovery

Genotype Likelihoo

ds Calculatio

Analysis-ready BAMs

•  We note that we no longer use any hard-filters (proximity to indel calls, clustered SNPs, etc.) at any point in the process.

•  Unified Genotyper math and command lines discussed in previous meetings. (see Appendix for full details)

Allele Frequenc

y Calculatio

Variant Quality

Recalibration

Beagle

Unified Genotyper

Genotype Likelihoo

ds Calculatio

•  The variant quality recalibration process has gone through a major overhaul recently. Most notably, we have removed any dependency on Ti/Tv in the calculation. This and further changes are highlighted in the following slides.

•  Outline: •  Quick Variant Recalibration overview •  Contrastive clustering walkthrough •  Ti/Tv-free quality thresholding or commitment-free probabilistic callsets

Allele Frequenc

y Calculatio

Variant Quality

Recalibration

Beagle

Unified Genotyper

Analysis-ready BAMs

Variant annotations provide signal with which to remove artifacts!

22  49582364 . A G 198.96 . AB=0.67; AC=3; AF=0.50; AN=6; DP=87; Dels=0.00; HRun=1; MQ=71.31; MQ0=22; QD=2.29; SB=-31.76 GT:DP:GQ 0/1:12:99.00 0/1:11:89.43 0/1:28:37.78

VCF record for an A/G SNP at 22:49582364

AC No. chromosomes carrying alt allele

AB Allele balance of ref/alt in hets

AN Total no. of chromosomes HRun

Length of longest contiguous homopolymer

AF Allele frequency MQ RMS MAPQ of all reads

DP Depth of coverage MQ0 No. of MAPQ 0 reads at locus

QD QUAL score over depth SB Estimated SB score

Variant Quality Score Recalibration Model

Gaussian Mixture Model trained on annotated variants, find MAP using VBEM:

p(c) = p(z)p(c

| z) = π k p(π k )N(c

| µk ,Σk )p(µ

k ,Σk )

∑z∑

Normal – inverse Wishart distribution

Prior expectation is the empirical mean and empirical covariance of the data. Bias away from singularities.

Dirichlet distribution

Prior expectation is sparse set

Number of Novel Variants (1000s)0 500 1000 1500 2000

Ti/Tv FDR

0 100 200 300 400

Cumulative TPsTranch!specific TPsTranch!specific FPsCumulative FPs

Ti/Tv FDR

0.0 0.2 0.4 0.6 0.8

Ti/Tv FDR

Evaluating novel variants

More Bias

Less Bias

HiSeq: training on HapMap

More Bias

Less Bias

Likely dbSNP errors

Gaussian mixture model fits

A HM3 1KG Trio

NR sensitivity (%):

HiSeq C

66.9 86.5 With imputation

96.7 NGS only 83.0

Heterozygous variants Homozygous

variants

Low-pass E

Analysis tranche

HiSeq: evaluating novel variants

HiSeq HM3

Analysis tranche

Variant Quality Score Recalibration: training on highly confident known sites to determine the probability that other sites are true

Genotype Likelihoo

ds Calculatio

•  Outline: •  Quick Variant Recalibration overview •  Contrastive clustering walkthrough •  Ti/Tv-free quality thresholding or commitment-free probabilistic callsets

Allele Frequenc

y Calculatio

Variant Quality

Recalibration

Beagle

Unified Genotyper

Analysis-ready BAMs

Running the Variant Quality Score Recalibrator

9 See http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration

•  Wiki page has full list of command lines broken out by the various steps in the process

•  Wiki page also has links to all the data sets we recommend using as training data

•  In a few weeks this whole process will be condensed into two much easier to use steps

Contrastive VQSR Clustering Walkthrough

First partition the data into a training set by

looking at sites which overlap with HapMap3.3

and the Omni chip.

Using Variational Bayes EM algorithm learn

probability distribution over the training set.

Assign a probability to each variant based on

how well it clusters with the training set.

Unfortunately a sizeable number of

seemingly good variants fall outside of

the main clusters.

Furthermore, all clusters are

essentially two-sided tests but most

annotations are really only one-sided.

Solution: Train a second set of clusters based on the bottom

10% of variants which had the worst LOD.

This model for the bad variants allows for

contrastive evaluation. New LOD score

becomes difference between the good

model and the bad model.

Contrastive VQSR clustering allows us to

rescue the variants which fall outside of the main clusters but

which also don’t fit the model for bad

variants.

s0 5 10 15 20 25

Genotype Likelihoo

ds Calculatio

•  Outline: •  Quick Variant Recalibration overview •  Contrastive clustering walkthrough •  Ti/Tv-free quality thresholding or commitment-free probabilistic

callsets

Allele Frequenc

y Calculatio

Variant Quality

Recalibration

Beagle

Unified Genotyper

Analysis-ready BAMs

0.5 0.6 0.7 0.8 0.9 1.0

Tranche truth sensitivity

0.5 0.6 0.7 0.8 0.9 1.0

Tranche truth sensitivitySp

Sensitivity vs. specificity plots with the new Ti/Tv-less approach look

1000G low-pass August N=629 NA12878 HiSeq WGS

!"#$%&'#$(!""$"$()*+&#(,&()-.(/*0+"'#,*&

3(*5(6

8 98 :8 ;8 <8 =8 >8 ?8 @8 A8 988 998 9:8

B&*C&(67/37*D$"(67/3E'0F'0(D;G:(H$4,3I*D$%J(H'#$9BK(/%*L$I#(/,"*#(9(H$4,3I*D$%J(H'#$

&#(H$4

,3I*D$%J(H

!"#$%&'()'%*+,-%.".$%'(/01%*+,-%

!"#$%&'#$(!""$"$()*+&#(,&()-.(/*0+"'#,*&

3(*5(6

8 98 :8 ;8 <8 =8 >8 ?8 @8 A8 988 998 9:8

8G98N(OPH(#%'&I2$9G88N(OPH(#%'&I2$=G88N(OPH(#%'&I2$98G88N(OPH(#%'&I2$B&*C&(1,Q1D(H'#,*7*D$"(1,Q1D(H'#,*

:<R<=>(&*D$"(D'%,'&#3(,&('SS%$S'#$(C,#2(9G@;(1,Q1D(%'#,*

1%'&3,#,*

&(#*(1%'&

3D$%3,*

8G8 8G: 8G< 8G> 8G@ 9G8

7*& H$5$%$&I$(!""$"$(O%$T+$&IJ

7HP(H'

/%$ UV0+#'#,*&/*3# UV0+#'#,*&

WD$%'""(7HP(H'#$(X(:<G=N

WD$%'""(7HP(H'#$(X(>G@N

8 : < > @ 98

6$T+$&I,&S(P$0#2

7HP(H'

/%$ UV0+#'#,*&/*3# UV0+#'#,*&

The low confidence tranches are comprised of the low frequency events (most likely FPs)

61-sample CEU from 1000G!

# samples

Center Total # variants

dbSNP% (129)

# knowns

Known ti/tv

# novels

Novel ti/tv

Includes genotype refinement?

1004 Broad 765,365

24.82 190,000

2.36 575,365

2.37 No

1004 BC 733,155

25.34 185,787

2.37 547,368

2.32 No

1004 Sanger 728,374

25.31 184,341

2.36 544,033

2.36 No

1004 UMich 721,250

26.46 190,871

2.33 530,379

2.35 Yes

1004 Oxford 660,024

27.44 181,095

2.38 478,929

2.38 Yes

1004 BCM 605,274

29.98 181,444

2.33 423,830

2.29 Yes

1004 NCBI 601,907

29.26 176,150

2.39 425,757

2.57 No

Broad discovered the most variants at very high quality levels in 1000G chr20 bake-off exercise

•  Our data processing pipeline produces really good SNP calls. The same pipeline is used for whole exome and WGS, both deep and low-pass sequencing. Short indel calls too!

•  Anything can be used as truth data. Validation assays, several 1000G callsets, or auto-generate your own by subsetting to the highest quality SNPs

•  There is no reason to decide between high sensitivity or high specificity. Just use a probabilistic callset.

•  The tools are available to all:

http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

Final Thoughts

Appendix

Step 2a: SNP discovery

Genotype Likelihoo

ds Calculatio

•  The genotype likelihoods calculation now takes overlapping read pairs (where bases are not independent observations) into account, which we term “fragment-based calling”.

Allele Frequenc

y Calculatio

Variant Quality

Recalibration

Beagle

Unified Genotyper

Analysis-ready BAMs

L(G | D) = P(G)P(D |G) = P(b |G)b∈ good _ bases{ }∏

GATK single sample genotype likelihoods

•  Priors applied during multi-sample calculation; P(G) = 1 •  Likelihood of data computed using pileup of bases and

associated quality scores at given locus •  Only “good bases” are included: those satisfying

minimum base quality, mapping read quality, pair mapping quality, NQS

•  P(b | G) uses calibrated base quality score •  L(G|D) computed for all 10 genotypes

Prior for the genotype"

Likelihood for the genotype"

Likelihood of the data given the genotype"

Bayesian model

Independent base model"

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 22

Step 2b: SNP discovery

Genotype Likelihoo

ds Calculatio

•  We now use Heng Li’s Exact model to calculate P(AF>0) instead of our previous heuristic grid search model.

Allele Frequenc

y Calculatio

Variant Quality

Recalibration

Beagle

Unified Genotyper

Analysis-ready BAMs

We apply a generalization of the single sample SNP caller for multi sample

•  This approach allows us to combine weak single sample calls to discover variation among several samples with high confidence

Individual 1"

Sample-associated reads"

Individual 2"

Individual N"

Genotype likelihoods"

Joint estimate across samples"

Genotype frequencies"

Allele frequency"

Running the Unified Genotyper

java -Xmx2048m –jar GenomeAnalysisTK.jar -R /broad/1KG/reference/human_b37_both.fasta -T UnifiedGenotyper-B:dbsnp,VCF dbsnp_132_b37.vcf-o NA19240.raw.vcf -stand_call_conf 30 --heterozygosity 1.000000e-03 -I NA19240.SLX.bam

Minimum phred-scaled confidence required to emit a

1 het per 1000 reference bases on average for a Yoruban

BAM file containing NA19240 SLX reads

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19240 1 36496 . T A 53.13 . <ATTRIBUTES> GT:DP:GQ 1/0:6:84.70

1 45162 rs10399749 C T 331.37 . <ATTRIBUTES> GT:DP:GQ 0/1:27:99.00

1 48677 . G A 399.86 . <ATTRIBUTES> GT:DP:GQ 1/0:25:99.00

Long string of variant annotations (more info in a few slides) Raw VCF calls (NA19240.raw.vcf)

Variants with bad Haplotype Scores often exhibit good Ti/Tv ratios and are included in other centers’ callsets, but are likely

Bad sites being called by other centers but correctly filtered by the Broad.

These sites are potentially bad in other SNP annotation dimensions.

Higher score means more evidence for error.

mpg ngs workshop - broad institute · 2015. 10. 16. · svs typically by lane typically multiple...

Documents

svs mini 7 lift installation manual -...

svs online journal

svs fem s.r.o. |

svs corporate brochure

hid str genotyper plugin - thermo fisher scientific · •...

agilent icp-oes svs 2 productivity package installation...

svs dii spanish manual

introduction to svs 8 tutorial -...

pvt svs-78

svs security

short tool jig svs-38 (svs-32) - sharpening · 2016. 1....

svs 2013 spring

beagle imputation in svs

su14 hit3 svs

svs college of engineering coimbatore dr. abdul kalam...

svs-experience reality brochure

group overview - e-spin group · professional svs. hardware...

marketing plan svs

svs college of engineering coimbatore...

svs-760c & svs-760cf operation manual - si-tex.com...