1000g pilot 3 progress (in silico analysis and comparison to experimental validation) amit indap,...
Post on 19-Dec-2015
219 views
TRANSCRIPT
![Page 1: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/1.jpg)
1
1000G Pilot 3 Progress
(in silico analysis and comparison to experimental validation)
Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl and Kiran Garimella (Broad Institute)
1000 Genomes Project Analysis GroupFebruary 2, 2010
![Page 2: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/2.jpg)
2
Acknowledgements
BaylorMatthew BainbridgeFuli YuDonna MuznyRichard Gibbs
BroadChris HartlKiran GarimellaCarrie SougnezMark DePristo
Wash. U.Dan KoboldtBob Fulton
SangerAarno Palotie
Boston CollegeAmit IndapWen Fung LeongGabor Marth
CornellAndy Clark
StanfordSimon GravelCarlos Bustamante
MichiganTom Blackwell
![Page 3: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/3.jpg)
3
Data
CEU TSI CHB CHD JPT LWK YRI
Number of samples 90 66 109 107 105 108 112
Sequencing technology SLX+454 SLX SLX+454 SLX+454 SLX+454 454 SLX+454
Per-sample coverage1 67X 64X 42X 59X 49X 30X 56X
• Capture technologies:– Nimblegen solid phase– Agilent liquid phase
• Sequencing technologies:– SLX– 454
• Data producers:– BCM– BI– WTSI– WUGSC
• Capture targets:– Started with ~1,000 genes / ~10,000 exons / 2.3Mb– 1.43Mb of total target length shared between 4 data
centers used for this analysis
• Samples:– 697 total samples– 7 populations
• Sequence coverage:– Goal was deep per-sample coverage– Effective coverage somewhat reduced by fragment
duplications
1. Mean of coverage medians per sample and population
![Page 4: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/4.jpg)
4
PipelinesProcessing step BC BI
Read mapping SW MOSAIK MAQ (SLX)SSAHA2 (454)
Duplicate filtering SW Picard MarkDuplicates (SLX)BCMMarkduplicates (454)
Picard MarkDuplicates (SLX) Picard MarkDuplicates (454)
Base quality recalibration SW
GATK (SLX)None (454)
GATK (SLX)GATK (454)
SNP calling SW GigaBayes (BamBayes) UnifiedGenotyper
CEU
TSI
CHB
CHD
JPT
LWK
YRI
Union of all called sites in all 697 samples
CEU
TSI
CHB
CHD
JPT
LWK
YRI
Segregating sites in each population sample
All 697 samples
All 697 samples
SNP calling
SNP statistics
![Page 5: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/5.jpg)
5
BC and BI call sets are convergingComparison # BC call
versionBC total calls
BC unique calls
BC & BI(intersection)
BC || BI(union)
BI unique calls
BI total calls BI call version
1 2009/11/20 11,580(55.96%)
733(3.54%)
10,847(54.34%)
20,695(100%)
9,115(44.04%)
19,962(96.46%)
v2
2 2009/11/20 11,580(65.75%)
1,480(8.40%)
10,100(62.60%)
17,613(100%)
6,033(34.25%)
16,133(91.60%)
v3
3 2010/01/20 14,502(79.35%)
2,144(11.73%)
12,358(76.60%)
18,277(100%)
3,775(20.65%)
16,133(88.27%)
v3
4 2010/01/20 14,502(72.91%)
1,741(8.75%)
12,761(64.16%)
19,890(100%)
5,388(27.01%)
18,149(91.25%)
v4
Comparison # CEU TSI CHB CHD JPT LWK YRI
1 3,354 (73.87%) 3,168 (65.88%) 3,279 (66.23%) 3,226 (68.42%) 2,942 (47.79%) 4,922 (70.56%) 4,917 (72.08%)
2 3,036 (70.62%) 2,893 (69.34%) 2,938 (62,23%) 2,783 (60.58%) 2,545 (55.64%) 4,486 (65.33%) 4,253 (66.30%)
3 3,333 (74.63%) 3,155 (73.15%) 3,294 (66.80%) 3,201 (66.69%) 2,795 (58.40%) 5,165 (73.18%) 4,728 (71.29%)
4 3,489 (78.78%) 3,281 (69.32) 3,415 (69.74%) 3,431 (72.81%) 2,900 (50.86%) 5,459 (78.55%) 5,175 (78.59%)
All called sites
Called sites per population (BC/BI intersection)
Intersection (% of union)
Number of sites(% of union)
![Page 6: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/6.jpg)
6
SNP calls (per population)CEU TSI CHB CHD JPT LWK YRI
samples 90 66 109 107 105 108 112
90 66 109 107 105 108 112
called SNPs 4,102 3,729 4,340 4,262 3,883 6,039 5,891
3,816 4,285 3,972 3,881 4,719 6,370 5,869
dbSNPs 2,422 2,257 2,042 1,924 1,950 2,872 2,897
2,352 2,200 1,827 1,753 1,710 2,825 2,856
% dbSNP 59.04 60.53 47.05 45.14 50.22 47.56 49.18
61.64 51.34 46.00 45.17 36.24 44.35 48.66
Ts/Tv (called SNPs) 2.73 2.78 2.82 3.06 2.85 3.45 2.92
3.14 2.38 3.15 3.16 1.83 3.17 3.15
novel SNPs 1,680 1,472 2,298 2,338 1,933 3,167 2,994
1,464 2,085 2,145 2,128 3,009 3,545 3,013
Ts/Tv (novel SNPs) 2.05 2.10 2.44 2.81 2.43 3.44 2.56
2.92 1.72 3.03 3.05 1.36 3.07 2.99
BCBI
![Page 7: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/7.jpg)
7
SNP calls (all samples)
BC BI
Samples 697 697
Called SNPs 14,502 18,149
dbSNPs 3,948 4,041
dbSNP fraction 27.22% 22.27%
5,388 SNPs172 dbSNPsdbSNP=3.19%
1,741 SNPs 79 dbSNPsdbSNP=4.54%
12,761 SNPs3,869 dbSNPs
dbSNP=30.32%
BC: 14,502 SNPs BI: 18,149 SNPs
BC U BI = 19,890
BC∩BI
![Page 8: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/8.jpg)
8
Genotype call accuracy relative to HapMap3
CEU TSI CHB CHD JPT LWK YRI
FDR of variant genotypes in HapMap3 (%) 0.96 0.23 2.61 1.42 3.60 0.47 0.57
1.41 0.45 2.99 1.82 3.56 0.66 1.25
Correct calls (%) 98.39 98.98 96.76 98.20 95.72 99.06 98.63
97.22 98.26 95.45 97.35 94.55 98.74 96.68
Accuracy of homozygote reference calls (%) 99.20 99.81 97.52 98.62 96.42 99.64 99.59
98.79 99.62 97.07 98.21 96.33 99.48 99.07
Accuracy of heterozygote calls (%) 97.50 97.72 97.98 99.12 96.81 98.37 96.53
94.49 95.43 94.19 97.37 92.30 97.89 90.81
Accuracy of homozygote non-reference calls (%) 97.31 98.44 93.27 95.78 92.69 98.21 98.45
96.77 98.76 93.26 95.16 93.43 97.67 97.46
BCBI
Data quality in CHB and JPT samples seems consistently lower
Statistics only include genotype calls at SNP sites in BC∩BI
![Page 9: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/9.jpg)
9
Genotype calls
All SNP sites considered Only SNP sites with >= 80% called genotypes
# SNP sites=3,075r=0.9979
# SNP sites=3,489r=0.9921
• Filtering:BC filters on genotype call qualityBI reports a genotype for any site where at least one read covers
• Nominally, BI makes more calls than BC, and has, on average, higher AF
The Broad caller does not filter on genotype quality
• Good allele frequency concordance between BC and BI• At genotype calls that passes BC filter, and BI also makes a call, no discordance was found
![Page 10: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/10.jpg)
10
1KG validation executive summary
• Evaluated BI and BC calls against validation– 1KG chip1
• 312/697 samples across 7 populations represented• ~300 sites (150 novel) overlap with Pilot 3 target region
• Concordance with 1KG chip is very high– Where covered (> 5 reads):
• 302/312 (97%) of samples have >90% variant sensitivity• 269/312 (86%) of samples have >90% genotype sensitivity
– Remaining disparities between 1KG chip and Pilot 3 calls can be explained by data quality issues• Later sequencing has far greater concordance with chip than earlier
sequencing1. Details in Appendix
![Page 11: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/11.jpg)
11
Variant PPV/Sensitivity to 1KG chip is reasonably high for most samples; discordant samples are poorly sequenced
0 50 100 150 200 250 300 3500.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
55.0%
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
105.0%
% Variant PPV
% Variant Sensitivity
Sample (318 Pilot 3 samples overlapping with 1KG chip)
Spikes in Variant PPV are due to low-quality sequencing in JPT samples (see Appendix)
![Page 12: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/12.jpg)
After filtering out sites with < 4 reads, nearly all samples in call-set overlap have high sensitivity and specificity
0 50 100 150 200 250 300 3500.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
55.0%
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
105.0%
% Variant PPV
% Variant Sensitivity
% Ignored Low-Coverage Bases
Sample (312 Pilot 3 samples after eliminating those with low-coverage)
12
These 10 low-sensitivity samples have strange allele balances and are likely contaminated
All but one sample with low PPV (false-positive rate > 10%) are among the earliest-sequenced samples (JPT/CHB/CHD)
![Page 13: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/13.jpg)
Concordance to chip tracks closely with submission-to-DCC date (proxy for sequencing date)
130 50 100 150 200 250 300 350
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
55.0%
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
105.0%
% Variant PPV
% Variant Sensitivity
% Ignored Low-Coverage Bases
Sample (312 Pilot 3 samples sorted by earliest DCC submission date)
Submitted: 12/08-7/09Median number of lanes: 2
Submitted: 8/08-10/08Median number of lanes: 3
The most recently sequenced samples have higher concordance to 1KG chip.
Increase in number of sites with < 4 reads corresponds with fewer lanes being run per sample.
![Page 14: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/14.jpg)
14
Mean sensitivity/PPV per population is good, and improves on more recently-sequenced populations
CEU CHB CHD JPT LWK TSI YRI0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
55.0%
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
105.0%
Mean % PPV
Mean % Sensitivity
N Samples: 69 13 27 102 69 3 24
8/2008ILMN/454
All Ctrs
8/2008ILMN/454
All Ctrs
8/2008ILMN/454
BI/BCM
1/2009454BCM
8/2008ILMN/454
BI/BCM
10/2008ILMNBI/SC
2008/2009ILMN/454
All Ctrs
![Page 15: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/15.jpg)
15
Low-frequency / singleton validation: executive summary
• Low-frequency Sequenom assay1
– Chose 105 putative novel singletons from early Pilot 3 46-CEU-sample callsets (called in at least 2/4 callers)
– Validated sites in those 46 individuals• 89/105 are true singletons• 16/105 are false-positive singletons (hom-refs and two non-singletons)
• Concordance with low-frequency assay is very high– Callsets today (January 2010)
• In BI and BC overlap, recovered 71/89 (80%) of assayed singletons with 0 false-positives and 0 non-singletons
• In BI and BC union, recovered all 89 singletons with 3 false-positives and 0 non-singletons
1. Details in Appendix
![Page 16: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/16.jpg)
16
Call Set Loci Tested
Overlap with Test Set
TP (PPV) FP True, but not Singleton
BC ∩ BI 105 71 71 (100%) 0 0
BC BI∪ 105 92 89 (97%) 3 0
Callers are able to detect most singletons with very low false-positive rate
Assay Loci(after filtering1)
TP (PPV) FP True, but not Singleton
Whole Assay 105 89 (85%) 16 2
Assay Performance
Callset Performance
1. HWE violations, no-call rate > 5%
Callset union finds every singleton in the assay with few false-positives.
![Page 17: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/17.jpg)
17
Many sites shared between P3 and external projects; low overall FP rate
Calls (90 CEU samples)Loci in P1/P2 = 60%
Loci in other projects/databases = 71%1
FP Rate (sites on validation chips) =5.3%FN Rate (sites on validation chips) < 5%2
Calls (overall)FP Rate (sites on validation chips) =
9.1%3
FN Rate (sites on validation chips)< 5%3
1. Sites seen across all 91 Pilot 3 CEU individuals, occurring in dbSNP 129, Hapmap 3, Pilot 1, or Pilot 22. No per-locus FNs observed in overlapping set3. Includes FP and FN errors due to sample contamination/data quality
FP rate is likely a slight overestimate because a hom-ref site across the 69 CEU samples on the chip doesn’t preclude the possibility of a variant harbored in one of the other 21 samples not represented in the validation assay.
Some of these FPs are also due to sample contamination in older lanes.
![Page 18: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/18.jpg)
18
Conclusions / future directions
• Data quality has improved significantly over the life of the project
• Both BC and BI pipelines produce high-quality call sets– Good agreement between call sets– intersection highly concordant with experimental validation
data– Estimated FP rate below ~9%
• The current Pilot 3 release is the BC∩BI (intersection) call set
• We are proceeding with validations– Dual focus: accuracy and functional classes– Results will inform future releases
![Page 19: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/19.jpg)
APPENDIX
![Page 20: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/20.jpg)
Population spectrum of called SNPs
![Page 21: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/21.jpg)
21
Population-spectrum of called SNPs
CEU TSI CHB CHD JPT LWK YRI ALL
called SNPs 4,102 3,729 4,340 4,262 3,883 6,039 5,891 14,502
3,816 4,285 3,972 3,881 4,719 6,370 5,869 18,149
BCBI
• Observation: BC call more SNPs on the population level, but less SNP sites overall
• Reason: BC tends to call the same site in more populations…
![Page 22: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/22.jpg)
BC/BI SNP calls per population (more detail)
![Page 23: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/23.jpg)
23
SNP calls (per population)CEU TSI CHB CHD JPT LWK YRI
samples 90 66 109 107 105 108 112
90 66 109 107 105 108 112
called SNPs 4,102 3,729 4,340 4,262 3,883 6,039 5,891
3,816 4,285 3,972 3,881 4,719 6,370 5,869
dbSNPs 2,422 2,257 2,042 1,924 1,950 2,872 2,897
2,352 2,200 1,827 1,753 1,710 2,825 2,856
% dbSNP 59.04 60.53 47.05 45.14 50.22 47.56 49.18
61.64 51.34 46.00 45.17 36.24 44.35 48.66
Ts/Tv (called SNPs) 2.73 2.78 2.82 3.06 2.85 3.45 2.92
3.14 2.38 3.15 3.16 1.83 3.17 3.15
novel SNPs 1,680 1,472 2,298 2,338 1,933 3,167 2,994
1,464 2,085 2,145 2,128 3,009 3,545 3,013
Ts/Tv (novel SNPs) 2.05 2.10 2.44 2.81 2.43 3.44 2.56
2.92 1.72 3.03 3.05 1.36 3.07 2.99
singletons 1,378 1,264 1,654 1,686 1,284 1,430 1,457
1,240 1,911 1,555 1,500 2,347 1,692 1,489
Ts/Tv (singletons) 2.72 3.36 3.33 3.39 3.09 4.68 3.04
2.84 1.72 2.81 3.03 1.11 3.26 2.73
BCBI
![Page 24: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/24.jpg)
24
Broad & BC calls: CEUPopulation: CEU (90 samples) BC Broad
# SNPs called (Ts/Tv) 4,102 (2.73) 3,816 (3.14)
#dbSNP (Ts/Tv) 2,422 (3.40) 2,352(3.28)
# novel SNPs (Ts/Tv) 1,680 (2.05) 1,464 (2.92)
# Singleton (Ts/Tv) 1,378 (2.72) 1,240 (2.84)
32752(15.90%)
1.32
BC613
122(19.90%)0.92
3,4892,300(65.92%)
3.47
SNP#dBSnp(%)Ts/Tv
Broad
![Page 25: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/25.jpg)
25
Broad & BC calls: CHBPopulation: CHB (109 samples) BC Broad
# SNPs called (Ts/Tv) 4,340 (2.82) 3,972 (3.15)
#dbSNP (Ts/Tv) 2,042 (3.37) 1,827 (3.30)
# novel SNPs (Ts/Tv) 2,298 (2.44) 2,145 (3.03)
# Singleton (Ts/Tv) 1,654 (3.33) 1,555 (2.81)
55732(5.75%)
1.37
BC925
247(26.70%)1.23
3,4151,795(52.56%)
3.74
Broad
SNP#dBSnp(%)Ts/Tv
![Page 26: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/26.jpg)
26
Broad & BC calls: CHDPopulation: CHD (107 samples) BC Broad
# SNPs called (Ts/Tv) 4,262 (3.06) 3,881 (3.16)
#dbSNP (Ts/Tv) 1,924 (3.40) 1,753 (3.30)
# novel SNPs (Ts/Tv) 2,338 (2.81) 2,128 (3.05)
# Singleton (Ts/Tv) 1,686 (3.39) 1,500 (3.03)
45031(6.44%)
1.33
BC
831200(24.07%)
1.68
34311,724(50.25%)
3.64
Broad
SNP#dBSnp(%)Ts/Tv
![Page 27: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/27.jpg)
27
Broad & BC calls: JPTPopulation: JPT (105 samples) BC Broad
# SNPs called (Ts/Tv) 3,883 (2.85) 4,719 (1.83)
#dbSNP (Ts/Tv) 1,950 (3.39) 1,710 (3.31)
# novel SNPs (Ts/Tv) 1,933 (2.43) 3,009 (1.36)
# Singleton (Ts/Tv) 1,284 (3.09) 2,347 (1.11)
983271(27.57%)
1.54
BC1819
31(1.70%)0.74
2,9001,679 (57.90%)
3.67
Broad
SNP#dBSnp(%)Ts/Tv
![Page 28: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/28.jpg)
28
Broad & BC calls: LWKPopulation: LWK (108 samples) BC Broad
# SNPs called (Ts/Tv) 6,039 (3.45) 6,370 (3.17)
#dbSNP (Ts/Tv) 2,872 (3.46) 2,825 (3.31)
# novel SNPs (Ts/Tv) 3,167 (3.44) 3,545 (3.08)
# Singleton (Ts/Tv) 1,430(4.68) 1,692 (3.26)
580136(23.45%)
2.09
BC911
89(9.77%)1.56
5,4592,736(50.12%)
3.67
Broad
SNP#dBSnp(%)Ts/Tv
![Page 29: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/29.jpg)
29
Broad & BC calls: TSIPopulation: TSI (66 samples) BC Broad
# SNPs called (Ts/Tv) 3,729 (2.78) 4,285 (2.39)
#dbSNP (Ts/Tv) 2,257 (3.42) 2,200 (3.40)
# novel SNPs (Ts/Tv) 1,472 (2.10) 2,085 (1.72)
# Singleton (Ts/Tv) 1,264(3.36) 1,911 (1.72)
448105(23.44%)
0.71
BC1,004
48(4.78%)0.85
3,2812152(65.59%)
3.54
Broad
SNP#dBSnp(%)Ts/Tv
![Page 30: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/30.jpg)
30
Broad & BC calls: YRIPopulation: TSI (66 samples) BC Broad
# SNPs called (Ts/Tv) 5,891(2.92) 5,869 (3.15)
#dbSNP (Ts/Tv) 2897 (3.38) 2,856 (3.34)
# novel SNPs (Ts/Tv) 2,994 (2.56) 3,013 (2.99)
# Singleton (Ts/Tv) 1,489 (3.04) 1,457 (2.73)
716112(15.64%)
0.95
BC694
71(1023%)1.48
5,1752,785(53.82%)
3.56
Broad
SNP#dBSnp(%)Ts/Tv
![Page 31: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/31.jpg)
BC vs. BI allele frequency comparisons per population at SNPs in the BC∩BI call
set
![Page 32: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/32.jpg)
32
BC/BI genotype calls (CHB & CHD)
All SNPs SNPs with >= 80% called genotypes
All SNPs SNPs with >= 80% called genotypes
#sites=3415r=0.9925
#sites=3431r=0.9941
CHD
CHB
#sites=3028r=0.9993
#sites=3310r=0.9991
![Page 33: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/33.jpg)
33
BC/BI genotype calls (TSI & JPT)
#sites=2900r=0.9922
#sites=2370r=0.9991
#sites=3108r=0.9973
#sites=3281r=0.9912
TSI
JPT
All SNPs SNPs with >= 80% called genotypes
All SNPs SNPs with >= 80% called genotypes
![Page 34: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/34.jpg)
34
BC/BI genotype calls (LWK & YRI)
#sites=5337r=0.9984
#sites=5459r=0.9924
#sites=4276r=0.9978
#sites=5175r=0.9917
YRI
LWK
All SNPs SNPs with >= 80% called genotypes
All SNPs SNPs with >= 80% called genotypes
![Page 35: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/35.jpg)
Low frequency / singleton validation design
![Page 36: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/36.jpg)
Recap: Novel singletons from 66 CEU samples chosen for validation
• Interesting singleton: a putative SNP…1. that is novel (not in dbSNP 129)2. that has been identified by the BC or BI caller3. that only occurs in 1 out of 66 of the test
individuals4. where the individual in whom the SNP is identified
is the same among callers5. that is also identified by one other caller6. whose locus has nominal coverage in other non-
variant samples
![Page 37: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/37.jpg)
Data and Definitions
• Sequenom validation run on 46 of 66 individuals (Broad did not have DNA for all 66 samples)
• Sequenom calls filtered by Broad standard metrics (no significant deviation from Hardy-Weinberg; no-call rate of <5%)
• Concordance checked across call sets which were used for selection, and the new Broad and BC calls
![Page 38: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/38.jpg)
Validated true singletons may not be singletons
• Because 20 members of the population were unable to be genotyped, it is possible that true novel singletons are actually present in one or more of the additional 20 individuals
• Basic pop-gen gives some ballpark estimates:– Probability that a validated singleton is in one of the
other 20 individuals:• 1.2% ( = 1 – ( 1 – θ )20 )
– All validated singletons are truly singletons• 33.5% ( = ( 1 – P[event above] )89 )
*θ = 1/1600
![Page 39: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/39.jpg)
Per population PPV and sensitivity
![Page 40: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/40.jpg)
Variant PPV/Sensitivity – unadjusted for depth
0 50 100 150 200 250 300 3500.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
Individual in Pilot 3 (318 overlapping individuals)
![Page 41: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/41.jpg)
Variant PPV/Sensitivity for CEU
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 700.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
% Ignored Low-Coverage Bases
CEU Individual in Pilot 3 (68 well-covered individuals)
Per-Locus FP Rate: 5.3%Per-Locus FN Rate: < 5%
*No FN observed
![Page 42: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/42.jpg)
Variant PPV/Sensitivity for CEU – Counting Low Depth
0 10 20 30 40 50 60 70 800.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
CEU Individual in Pilot 3 (69 individuals)
![Page 43: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/43.jpg)
Variant PPV/Sensitivity for CHB
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
% Ignored Low-Coverage Bases
CHB Individual in Pilot 3 (13 well-covered individuals)
Per-Locus FP Rate: 9.4%Per-Locus FN Rate: < 5%*No Locus FN observed
![Page 44: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/44.jpg)
Variant PPV/Sensitivity for CHB – Counting Low Depth
0 2 4 6 8 10 12 14 160.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
CHB Individual in Pilot 3 (14 individuals)
![Page 45: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/45.jpg)
Variant PPV/Sensitivity for CHD
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
% Ignored Low-Coverage Bases
CHD Individual in Pilot 3 (28 well-covered individuals)
Per-Locus FP Rate: 3.4%Per-Locus FN Rate: < 5%
* 3 FN in 555 TP observed
![Page 46: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/46.jpg)
Variant PPV/Sensitivity for CHD – Counting Low Depth
0 5 10 15 20 25 300.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
55.0%
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
105.0%
% Variant PPV
% Variant Sensitivity
CHD Individual in Pilot 3 (28 individuals)
![Page 47: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/47.jpg)
Variant PPV/Sensitivity for JPT
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99102
0.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
% Ignored Low-Coverage Bases
JPT Individual in Pilot 3 (104 well-covered individuals)
Per-Locus FP Rate: 2.2%Per-Locus FN Rate: < 5%* No Locus FN observed
![Page 48: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/48.jpg)
Variant PPV/Sensitivity for JPT – Counting Low Depth
0 20 40 60 80 100 1200.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
JPT Individual in Pilot 3 (104 individuals)
![Page 49: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/49.jpg)
Variant PPV/Sensitivity for LWK
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 700.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
% Ignored Low-Coverage Bases
LWK Individual in Pilot 3 (70 well-covered individuals)
Per-Locus FP Rate:1.3%Per-Locus FN Rate: < 5%
* 1 FN in 755 TP observed
![Page 50: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/50.jpg)
Variant PPV/Sensitivity for LWK – Counting Low Depth
0 10 20 30 40 50 60 70 800.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
LWK Individual in Pilot 3 (70 individuals)
![Page 51: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/51.jpg)
Variant PPV/Sensitivity for TSI
1 2 30.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
% Ignored Low-Coverage Bases
TSI Individual in Pilot 3 (3 well-covered individuals)
Per-Locus FP Rate: 16%Per-Locus FN Rate: < 5%
* 14 FN in 456 TP observed
Contaminated individuals (not shown on plot) are counted in the FP and FN rates.
High Locus FP rate may indicate contamination by an individual outside of
the population.
![Page 52: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/52.jpg)
Variant PPV/Sensitivity for TSI – Counting Low Depth
0.5 1 1.5 2 2.5 3 3.5 4 4.50.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
55.0%
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
105.0%
% Variant PPV% Variant Sensitivity
TSI Individual in Pilot 3 (4 individuals)
![Page 53: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/53.jpg)
Variant PPV/Sensitivity for YRI
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260.0%5.0%
10.0%15.0%20.0%25.0%30.0%35.0%40.0%45.0%50.0%55.0%60.0%65.0%70.0%75.0%80.0%85.0%90.0%95.0%
100.0%105.0%
% Variant PPV
% Variant Sensitivity
% Ignored Low-Coverage Bases
YRI Individual in Pilot 3 (25 well-covered individuals)
Per-Locus FP Rate:20%Per-Locus FN Rate: < 5%
* 1 FN in 731 TP observed
Contaminated individuals (not shown on plot) are counted in the FP and FN rates.
High Locus FP rate may indicate contamination by an individual outside of
the population.
![Page 54: 1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d2e5503460f94a04e1c/html5/thumbnails/54.jpg)
Variant PPV/Sensitivity for YRI – Counting Low Depth
0 5 10 15 20 25 30 350.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
55.0%
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
105.0%
% Variant PPV
% Variant Sensitivity
YRI Individual in Pilot 3 (29 individuals)