mpg workshop snp calling 020410 v6 - broad institute ngs workshop i: snp calling mark depristo...
TRANSCRIPT
![Page 1: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/1.jpg)
MPGNGSworkshopI:SNPcalling
MarkDePristo
Manager,MedicalandPopula<onGene<cAnalysisGenomeSequencingandAnalysisGroupMedicalandPopula<onGene<csProgram
BroadIns<tuteofHarvardandMIT02/04/10
![Page 2: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/2.jpg)
ThreeslidebackgroundonSNPcallingintheGATK
2
![Page 3: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/3.jpg)
SNPcallingworkflow
3
Call-ready BAM files(cleaned, dedupped, recalibrated,
with well-formated header)
Raw variants (VCF)(all sites confidently containing non-reference bases; with genotypes)
Filtered variants (VCF)(separate true segregating variation
from machine artifacts)
Data input and output Processing tools
GATK unified genotyper
GATK variant analysis
GATK variant filtration
Expert user judgement
Ease of useRuntime*Filesize*
* Runtime and file sizes are for a single sample 30x whole genome BAM
** Potentially requires many rounds of experimentation and evaluation
Very easy
200Gb
1 Gb
1 Gb
Tools are easy to use
but parameter selection
requires significant
expertise and
judgement
10 hrs
Instant
30 min
Days**
![Page 4: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/4.jpg)
L(G | D) = P(G)P(D |G) = P(b |G)b∈ good _ bases{ }∏
GATKsinglesamplegenotypelikelihoods
• Priorsappliedduringmul<‐samplecalcula<on;P(G)=1
• Likelihoodofdatacomputedusingpileupofbasesandassociatedqualityscoresatgivenlocus
• Only“goodbases”areincluded:thosesa<sfyingminimumbasequality,mappingreadquality,pairmappingquality,NQS
• P(b|G)usesplaYorm‐specificconfusionmatrices• L(G|D)computedforall10genotypes
Prior for the genotype
Likelihood for the genotype
Likelihood of the data given the genotype
Bayesianmodel
Independent base model
Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/Unified_genotyperformoreinforma<on4
![Page 5: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/5.jpg)
Weapplyageneraliza<onofthesinglesampleSNPcallertoPilot1
• Thisapproachallowsustocombineweaksinglesamplecallstodiscovervaria<onamongsampleswithhighconfidence
Individual 1
Sample-associated reads
Individual 2
Individual N
Genotype likelihoods
Joint estimate across samples
Genotype frequencies
Allele frequency
SNPs
Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/Unified_genotyperformoreinforma<on5
![Page 6: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/6.jpg)
MakingrawvariantcallswiththeGATKunifiedgenotyper
6
![Page 7: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/7.jpg)
RunningtheUnifiedGenotyper
Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/Unified_genotyperformoreinforma<on7
java -Xmx2048m –jar GenomeAnalysisTK.jar -R /broad/1KG/reference/human_b36_both.fasta -T UnifiedGenotyper-D dbsnp_129_b36.rod -varout NA19240.raw.vcf -confidence 50 --heterozygosity 1.000000e-03 -I NA19240.SLX.bam
Minimumphred‐scaledconfidencerequiredtoemitaSNP
1hetper1000referencebasesonaverageforaYoruban
BAMfilecontainingNA19240SLXreads
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19240 1 36496 . T A 53.13 0 <ATTRIBUTES> GT:DP:GQ 1/0:6:84.70
1 45162 rs10399749 C T 331.37 0 <ATTRIBUTES> GT:DP:GQ 0/1:27:99.00
1 48677 . G A 399.86 0 <ATTRIBUTES> GT:DP:GQ 1/0:25:99.00
Longstringofvariantannota<ons(moreinfoinafewslides)RawVCFcalls(NA19240.raw.vcf)
![Page 8: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/8.jpg)
SNPcallingar<facts
• SNPcallsaregenerallyinfestedwithfalseposi<ves– Fromsystema<cmachinear<facts,mismappedreads,alignedindels/CNV
– RawSNPcallsmighthavebetween5‐20%FPsamongnovelcalls
• Separa<ngtruevaria<onfromar<factsdependsverymuchonthepar<cularsofone’sdataandprojectgoals– Wholegenomedeepdata,WGlow‐pass,hybridcapture,pooledPCRarehavesignificantlydifferenterrormodes
8
![Page 9: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/9.jpg)
Filteringar<factsoutofyourSNPcalls
• TheGATKusesathreepassapproach– Firstemitallsitespoten<allycontainingatruevariant
– AggregateSNPcovariatesintherawVCFtodeterminetherela<onshipbetweeneachcovariateanderror[warning:requiresuserexper0se]
– Finally,applythesefilterstotherawVCFusingtheGATKVariantFiltra<ontool
• Wearecurrentlyworkingonarobust,easy‐to‐useautomatedtool
9
![Page 10: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/10.jpg)
Variantannota<onsandfilters
22 49582364 . A G 198.96 0 AB=0.67;AC=3;AF=0.50;AN=6;DP=87;Dels=0.00;HRun=1;MQ=71.31;MQ0=22;QD=2.29;SB=-31.76 GT:DP:GQ 0/1:12:99.00 0/1:11:89.43 0/1:28:37.78
VCFrecordforanA/GSNPat22:49582364
HeterozygousgenotypeA/Ginallthreeindividuals
Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/VariantAnnotatorformoreinforma<on10
AC No.chromosomescarryingaltallele
AB Allelebalanceofref/altinhets
AN Totalno.ofchromosomes Hrun Lengthoflongestcon<guoushomopolymer
AF Allelefrequency MQ RMSMAPQofallreads
DP Depthofcoverage MQ0 No.ofMAPQ0readsatlocus
QD QUALscoreoverdepth SB Es<matedSBscore
INFO
field
![Page 11: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/11.jpg)
Covariate bin value
Tra
nsitio
n / tra
nsvers
ion r
atio
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0 0.2 0.4 0.6 0.8
AB
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 200 400 600
DP
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 10 20 30 40 50
MQ0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
-1500 -1000 -500 0
SB
titv
dbSNP/100
Selec<ngfilteringthresholds
Selectedfiltersare:AB>0.75||DP>300||MQ0>40||SB>‐0.10||3snpswithin10bp
Notelet‐mostvaluesareSNPswithoutdisplayedannota<on
Annota<on
Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/VariantFiltra<onWalkerformoreinforma<on11
![Page 12: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/12.jpg)
RunningVariantFiltra<on
12
java -Xmx2048m –jar GenomeAnalysisTK.jar -R /broad/1KG/reference/human_b36_both.fasta-T VariantFiltration -B variant,VCF,NA19240.raw.vcf -D dbsnp_129_b36.rod --clusterWindowSize 10--filterExpression “AB > 0.75 || DP > 300 || MQ0 > 40 || SB > -0.10” -l INFO-o NA19240.filtered.vcf
ExpressiondescribingSNPsthatshouldbefilteredout
Filtersoutanygroupof3SNPswithin10bpofeachother
FilteredVCFcalls(NA19240.filtered.vcf)#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19240 1 36496 . T A 53.13 GATK_FILTER<ATTRIBUTES> GT:DP:GQ 1/0:6:84.70
1 45162 rs10399749 C T 331.37 0 <ATTRIBUTES> GT:DP:GQ 0/1:27:99.00
1 48677 . G A 399.86 0 <ATTRIBUTES> GT:DP:GQ 1/0:25:99.00
SNPswithpoorcharacteris<cshavetheirFILTERfieldfilledin
![Page 13: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/13.jpg)
Callset Callablebases1
#variants dbSNP% Ti/Tv(Est.FPrate2) Hapmap3Sensi@vity3
Hapmap3Concordance3
Known Novel
SingleindividualcallsfromtheGATK
RawNA192402.70B(89%)
4.52M 77.832.07(1.9%)
1.81(18.1%)
99.41 99.85
FilteredNA19240 4.26M 80.422.10
(~0.0%)2.01(5.6%)
99.14 99.85
Daughter+parentsmul@‐samplecallsfromtheGATK
RawYRItriotogether
2.5B(81%)
6.24M 71.652.07(1.9%)
1.80(18.8%)
99.62 99.85
FilteredYRItriotogether
5.60M 74.862.11
(~0.0%)2.02(5.0%)
99.29 99.85
RawandfilteredautosomalcallsforYRIdaughterandtrio
1. %ofall3.1BbasesoftheB36humangenomecalledwithatleastQ50confidence2. Calculatedas1‐(<tv_Observed‐0.5)/(<tv_Expected‐0.5)with<tv_Expectedof2.13. NA19240sensi<vityandconcordanceresults
13
![Page 14: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/14.jpg)
Examplenovelvariant
14
Chr1:67634785in3’untranslatedregion
![Page 15: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/15.jpg)
Examplescripts
• 1000GenomesSLXYRIBAMfiles:– Locallyavailableat:/humgen/gsa‐hpprojects/1kg/1kg_pilot2/useTheseBamsForAnalyses/<sample>.SLX.bam
– Availablefordownloadat1000genomes.org
• ScriptsandVCFfiles:– /humgen/gsa‐scr1/pub/tutorials/MPG_workshop
15
![Page 16: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/16.jpg)
Appendix
16
![Page 17: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling](https://reader031.vdocuments.us/reader031/viewer/2022022523/5b382b2f7f8b9a5a178d0abb/html5/thumbnails/17.jpg)
SNPs with confidence score within interval
% S
NP
s in
db
SN
P 1
29
020406080100
0 100 200 300 400 500
SNPs with confidence score within interval
Ti/T
v r
atio
0.5
1.0
1.5
2.0
2.5
0 100 200 300 400 500
SNPs with confidence score within interval
Tru
e p
ositiv
e S
NP
s
0100020003000
0 100 200 300 400 500
ChoosingaminimumconfidencescoreforaSNP
17
Defaultthreshold
• Eachpointonplotincludes~3000SNPsfromNA19240• ThedensityofpointsacrosstheconfidenceintervalindicatesthenumberofSNPs• ~0.5%ofSNPshaveQ<100,andonly2%arelessthanQ<200• ThedefaultQ50thresholdresultsinanhighlysensi<vecallset
dbSNPrate
Ti/Tvrate
Trueposi<veson1KG
customIlluminachip(cum.)