mpg ngs workshop - broad institute · 2015. 10. 16. · svs typically by lane typically multiple...
Post on 26-Sep-2020
0 Views
Preview:
TRANSCRIPT
MPG NGS workshop: SNP calling and error modeling
February 2011
Ryan Poplin Genome Sequencing and Analysis Medical and Population Genetics
The paradigm today
SNPs
Indels
Structural variation (SV)
Rawindels
RawSVs
Typically by lane Typically multiple samples simultaneously but can be single sample alone
Input
Output
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready reads
Raw reads Sample 1 reads
Raw variants
RawSNPs
Genotype refinement
Variant quality recalibration
Analysis-ready variants
Pedigrees Known variation
Known genotypes
Population structure
Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis
Sample N reads
External data
Step 2: SNP discovery
Genotype Likelihoo
ds Calculatio
n
Analysis-ready BAMs
• We note that we no longer use any hard-filters (proximity to indel calls, clustered SNPs, etc.) at any point in the process.
• Unified Genotyper math and command lines discussed in previous meetings. (see Appendix for full details)
Allele Frequenc
y Calculatio
n
Variant Quality
Recalibration
Beagle
Unified Genotyper
Step 3: SNP discovery
Genotype Likelihoo
ds Calculatio
n
• The variant quality recalibration process has gone through a major overhaul recently. Most notably, we have removed any dependency on Ti/Tv in the calculation. This and further changes are highlighted in the following slides.
• Outline: • Quick Variant Recalibration overview • Contrastive clustering walkthrough • Ti/Tv-free quality thresholding or commitment-free probabilistic callsets
Allele Frequenc
y Calculatio
n
Variant Quality
Recalibration
Beagle
Unified Genotyper
Analysis-ready BAMs
Variant annotations provide signal with which to remove artifacts!
22 49582364 . A G 198.96 . AB=0.67; AC=3; AF=0.50; AN=6; DP=87; Dels=0.00; HRun=1; MQ=71.31; MQ0=22; QD=2.29; SB=-31.76 GT:DP:GQ 0/1:12:99.00 0/1:11:89.43 0/1:28:37.78
VCF record for an A/G SNP at 22:49582364
5
AC No. chromosomes carrying alt allele
AB Allele balance of ref/alt in hets
AN Total no. of chromosomes HRun
Length of longest contiguous homopolymer
AF Allele frequency MQ RMS MAPQ of all reads
DP Depth of coverage MQ0 No. of MAPQ 0 reads at locus
QD QUAL score over depth SB Estimated SB score
INFO
fiel
d
Variant Quality Score Recalibration Model
Gaussian Mixture Model trained on annotated variants, find MAP using VBEM:
p(c) = p(z)p(c
| z) = π k p(π k )N(c
| µk ,Σk )p(µ
k ,Σk )
k=1
K
∑z∑
Normal – inverse Wishart distribution
Prior expectation is the empirical mean and empirical covariance of the data. Bias away from singularities.
Dirichlet distribution
Prior expectation is sparse set
Number of Novel Variants (1000s)0 500 1000 1500 2000
Ti/Tv FDR
10
5
1
0.1
1.91
1.99
2.06
2.07
0 100 200 300 400
Cumulative TPsTranch!specific TPsTranch!specific FPsCumulative FPs
Ti/Tv FDR
10
2
1
0.1
1.92
2.04
2.05
2.07
0.0 0.2 0.4 0.6 0.8
Ti/Tv FDR
10
2
1
0.1
2.79
2.96
2.98
3.01
Evaluating novel variants
More Bias
Less Bias
HiSeq: training on HapMap
More Bias
Less Bias
Likely dbSNP errors
Gaussian mixture model fits
A HM3 1KG Trio
99.5
99.5
99.5
99.5
98.2
98.5
98.4
98.6
NR sensitivity (%):
HiSeq C
HiSeq
88.0
89.7
88.0
93.8
HM3
96.0
96.8
96.0
98.3
D
65.1
66.2
65.3
66.9 86.5 With imputation
82.3
82.8
82.4
96.7 NGS only 83.0
B
Heterozygous variants Homozygous
variants
Exome
Low-pass E
Analysis tranche
Analysis tranche
HiSeq: evaluating novel variants
HiSeq HM3
Analysis tranche
Variant Quality Score Recalibration: training on highly confident known sites to determine the probability that other sites are true
Step 3: SNP discovery
Genotype Likelihoo
ds Calculatio
n
• The variant quality recalibration process has gone through a major overhaul recently. Most notably, we have removed any dependency on Ti/Tv in the calculation. This and further changes are highlighted in the following slides.
• Outline: • Quick Variant Recalibration overview • Contrastive clustering walkthrough • Ti/Tv-free quality thresholding or commitment-free probabilistic callsets
Allele Frequenc
y Calculatio
n
Variant Quality
Recalibration
Beagle
Unified Genotyper
Analysis-ready BAMs
Running the Variant Quality Score Recalibrator
9 See http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration
• Wiki page has full list of command lines broken out by the various steps in the process
• Wiki page also has links to all the data sets we recommend using as training data
• In a few weeks this whole process will be condensed into two much easier to use steps
Contrastive VQSR Clustering Walkthrough
First partition the data into a training set by
looking at sites which overlap with HapMap3.3
and the Omni chip.
Contrastive VQSR Clustering Walkthrough
Using Variational Bayes EM algorithm learn
probability distribution over the training set.
Contrastive VQSR Clustering Walkthrough
Assign a probability to each variant based on
how well it clusters with the training set.
Unfortunately a sizeable number of
seemingly good variants fall outside of
the main clusters.
Furthermore, all clusters are
essentially two-sided tests but most
annotations are really only one-sided.
Contrastive VQSR Clustering Walkthrough
Solution: Train a second set of clusters based on the bottom
10% of variants which had the worst LOD.
This model for the bad variants allows for
contrastive evaluation. New LOD score
becomes difference between the good
model and the bad model.
Contrastive VQSR Clustering Walkthrough
Contrastive VQSR clustering allows us to
rescue the variants which fall outside of the main clusters but
which also don’t fit the model for bad
variants.
AC
Num
ber
of S
NP
s0 5 10 15 20 25
01000
2000
3000
4000
5000
6000
Step 3: SNP discovery
Genotype Likelihoo
ds Calculatio
n
• The variant quality recalibration process has gone through a major overhaul recently. Most notably, we have removed any dependency on Ti/Tv in the calculation. This and further changes are highlighted in the following slides.
• Outline: • Quick Variant Recalibration overview • Contrastive clustering walkthrough • Ti/Tv-free quality thresholding or commitment-free probabilistic
callsets
Allele Frequenc
y Calculatio
n
Variant Quality
Recalibration
Beagle
Unified Genotyper
Analysis-ready BAMs
!
!
!
!
!
!
0.5 0.6 0.7 0.8 0.9 1.0
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Tranche truth sensitivity
Spec
ifici
ty (N
ovel
Ti/T
v ra
tio)
!
!
!
!
!
!
0.5 0.6 0.7 0.8 0.9 1.0
1.9
2.0
2.1
2.2
Tranche truth sensitivitySp
ecifi
city
(Nov
el T
i/Tv
ratio
)
Sensitivity vs. specificity plots with the new Ti/Tv-less approach look
good
1000G low-pass August N=629 NA12878 HiSeq WGS
!"#$%&'#$(!""$"$()*+&#(,&()-.(/*0+"'#,*&
12*+
3'&4
3(*5(6
7/3
898
8:8
8;8
8<8
8
8 98 :8 ;8 <8 =8 >8 ?8 @8 A8 988 998 9:8
B&*C&(67/37*D$"(67/3E'0F'0(D;G:(H$4,3I*D$%J(H'#$9BK(/%*L$I#(/,"*#(9(H$4,3I*D$%J(H'#$
8G8
8G:
8G<
8G>
8G@
9G8
M'%,'
&#(H$4
,3I*D$%J(H
'#$
!"#$%&'()'%*+,-%.".$%'(/01%*+,-%
!"#$%&'#$(!""$"$()*+&#(,&()-.(/*0+"'#,*&
12*+
3'&4
3(*5(6
7/3
898
8:8
8;8
8<8
8
8 98 :8 ;8 <8 =8 >8 ?8 @8 A8 988 998 9:8
8G98N(OPH(#%'&I2$9G88N(OPH(#%'&I2$=G88N(OPH(#%'&I2$98G88N(OPH(#%'&I2$B&*C&(1,Q1D(H'#,*7*D$"(1,Q1D(H'#,*
:<R<=>(&*D$"(D'%,'(,&('SS%$S'#$(C,#2(9G@;(1,Q1D(%'#,*
9G:
9G>
:G8
:G<
:G@
;G:
;G>
1%'&3,#,*
&(#*(1%'&
3D$%3,*
&(H'
#,*
8G8 8G: 8G< 8G> 8G@ 9G8
8G8
8G9
8G:
8G;
8G<
8G=
7*& H$5$%$&I$(!""$"$(O%$T+$&IJ
7HP(H'
#$
/%$ UV0+#'#,*&/*3# UV0+#'#,*&
WD$%'""(7HP(H'#$(X(:<G=N
WD$%'""(7HP(H'#$(X(>G@N
8 : < > @ 98
8G8
8G:
8G<
8G>
8G@
9G8
6$T+$&I,&S(P$0#2
7HP(H'
#$
/%$ UV0+#'#,*&/*3# UV0+#'#,*&
2% 3%
4% 5%
The low confidence tranches are comprised of the low frequency events (most likely FPs)
61-sample CEU from 1000G!
# samples
Center Total # variants
dbSNP% (129)
# knowns
Known ti/tv
# novels
Novel ti/tv
Includes genotype refinement?
1004 Broad 765,365
24.82 190,000
2.36 575,365
2.37 No
1004 BC 733,155
25.34 185,787
2.37 547,368
2.32 No
1004 Sanger 728,374
25.31 184,341
2.36 544,033
2.36 No
1004 UMich 721,250
26.46 190,871
2.33 530,379
2.35 Yes
1004 Oxford 660,024
27.44 181,095
2.38 478,929
2.38 Yes
1004 BCM 605,274
29.98 181,444
2.33 423,830
2.29 Yes
1004 NCBI 601,907
29.26 176,150
2.39 425,757
2.57 No
Broad discovered the most variants at very high quality levels in 1000G chr20 bake-off exercise
• Our data processing pipeline produces really good SNP calls. The same pipeline is used for whole exome and WGS, both deep and low-pass sequencing. Short indel calls too!
• Anything can be used as truth data. Validation assays, several 1000G callsets, or auto-generate your own by subsetting to the highest quality SNPs
• There is no reason to decide between high sensitivity or high specificity. Just use a probabilistic callset.
• The tools are available to all:
http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit
Final Thoughts
Appendix
Step 2a: SNP discovery
Genotype Likelihoo
ds Calculatio
n
• The genotype likelihoods calculation now takes overlapping read pairs (where bases are not independent observations) into account, which we term “fragment-based calling”.
Allele Frequenc
y Calculatio
n
Variant Quality
Recalibration
Beagle
Unified Genotyper
Analysis-ready BAMs
L(G | D) = P(G)P(D |G) = P(b |G)b∈ good _ bases{ }∏
GATK single sample genotype likelihoods
• Priors applied during multi-sample calculation; P(G) = 1 • Likelihood of data computed using pileup of bases and
associated quality scores at given locus • Only “good bases” are included: those satisfying
minimum base quality, mapping read quality, pair mapping quality, NQS
• P(b | G) uses calibrated base quality score • L(G|D) computed for all 10 genotypes
Prior for the genotype"
Likelihood for the genotype"
Likelihood of the data given the genotype"
Bayesian model
Independent base model"
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 22
Step 2b: SNP discovery
Genotype Likelihoo
ds Calculatio
n
• We now use Heng Li’s Exact model to calculate P(AF>0) instead of our previous heuristic grid search model.
Allele Frequenc
y Calculatio
n
Variant Quality
Recalibration
Beagle
Unified Genotyper
Analysis-ready BAMs
We apply a generalization of the single sample SNP caller for multi sample
data
• This approach allows us to combine weak single sample calls to discover variation among several samples with high confidence
Individual 1"
Sample-associated reads"
Individual 2"
Individual N"
Genotype likelihoods"
Joint estimate across samples"
Genotype frequencies"
Allele frequency"
SNPs"
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 24
Running the Unified Genotyper
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 25
java -Xmx2048m –jar GenomeAnalysisTK.jar -R /broad/1KG/reference/human_b37_both.fasta -T UnifiedGenotyper-B:dbsnp,VCF dbsnp_132_b37.vcf-o NA19240.raw.vcf -stand_call_conf 30 --heterozygosity 1.000000e-03 -I NA19240.SLX.bam
Minimum phred-scaled confidence required to emit a
SNP
1 het per 1000 reference bases on average for a Yoruban
BAM file containing NA19240 SLX reads
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19240 1 36496 . T A 53.13 . <ATTRIBUTES> GT:DP:GQ 1/0:6:84.70
1 45162 rs10399749 C T 331.37 . <ATTRIBUTES> GT:DP:GQ 0/1:27:99.00
1 48677 . G A 399.86 . <ATTRIBUTES> GT:DP:GQ 1/0:25:99.00
Long string of variant annotations (more info in a few slides) Raw VCF calls (NA19240.raw.vcf)
Variants with bad Haplotype Scores often exhibit good Ti/Tv ratios and are included in other centers’ callsets, but are likely
FPs
Bad sites being called by other centers but correctly filtered by the Broad.
These sites are potentially bad in other SNP annotation dimensions.
Higher score means more evidence for error.
top related