1000g pilot 3 progress in silico analysis and comparison to experimental validation
DESCRIPTION
1000G Pilot 3 Progress in silico analysis and comparison to experimental validation. Gabor Marth (Boston College ) + A + L Kiran Garimella (Broad Institute ) + C February 2, 2010. Acknowledgements. Boston College Amit Indap Wen Fung Leong Gabor Marth Cornell Andy Clark Stanford - PowerPoint PPT PresentationTRANSCRIPT
1
1000G Pilot 3 Progress
in silico analysis and comparison to experimental validation
Gabor Marth (Boston College) + A + LKiran Garimella (Broad Institute) + CFebruary 2, 2010
2
Acknowledgements
BaylorMatthew BainbridgeFuli YuDonna MuznyRichard Gibbs
BroadChris HartlKiran GarimellaCarrie SougnezMark DePristo
WUGSCDan KoboldtBob Fulton
WTSIAarno Palotie
Boston CollegeAmit IndapWen Fung LeongGabor Marth
CornellAndy Clark
StanfordSimon GravelCarlos Bustamante
MichiganTom Blackwell
3
Data
CEU TSI CHB CHD JPT LWK YRI
Number of samples 90 66 109 107 105 108 112
Sequencing technology SLX+454 SLX SLX+454 SLX+454 SLX+454 454 SLX+454
Per-sample coverage 78.20X 65.20X 45.40X 60.25X 52.79X 31.29X 58.12X
• Capture technologies:– Nimblegen solid phase– Agilent liquid phase
• Sequencing technologies:– SLX– 454
• Data producers:– BCM– BI– WTSI– WUGSC
• Capture targets:– Started with ~1,000 genes / ~10,000 exons / 2.3Mb– 1.43Mb of total target length shared between 4 data
centers used for this analysis
• Samples:– 697 total samples– 7 populations
• Sequence coverage:– Goal was deep per-sample coverage– Effective coverage somewhat reduced by fragment
duplications
4
PipelinesProcessing step BC BI
Read mapping SW MOSAIK MAQ (SLX)SSAHA2 (454)
Duplicate filtering SW Picard MarkDuplicates (SLX)BCMMarkduplicates (454)
Picard MarkDuplicates (SLX) Picard MarkDuplicates (454)
Base quality recalibration SW
GATK (SLX)None (454)
GATK (SLX)GATK (454)
SNP calling SW GigaBayes (BamBayes) UnifiedGenotyper
CEU
TSI
CHB
CHD
JPT
LWK
YRI
Union of all called sites in all 697 samples
CEU
TSI
CHB
CHD
JPT
LWK
YRI
Segregating sites in each population sample
All 697 samples
All 697 samples
SNP calling
SNP statistics
5
BC and BI call sets are convergingComparison # BC call
versionBC total calls
BC unique calls
BC & BI(intersection)
BC || BI(union)
BI unique calls
BI total calls BI call version
1 2009/11/20 11,580(55.96%)
733(3.54%)
10,847(54.34%)
20,695(100%)
9,115(44.04%)
19,962(96.46%)
v2
2 2009/11/20 11,580(65.75%)
1,480(8.40%)
10,100(62.60%)
17,613(100%)
6,033(34.25%)
16,133(91.60%)
v3
3 2010/01/20 14,502(79.35%)
2,144(11.73%)
12,358(76.60%)
18,277(100%)
3,775(20.65%)
16,133(88.27%)
v3
4 2010/01/20 14,502(72.91%)
1,741(8.75%)
12,761(64.16%)
19,890(100%)
5,388(27.01%)
18,149(91.25%)
v4
Comparison # CEU TSI CHB CHD JPT LWK YRI
1 3,354 (73.87%) 3,168 (65.88%) 3,279 (66.23%) 3,226 (68.42%) 2,942 (47.79%) 4,922 (70.56%) 4,917 (72.08%)
2 3,036 (70.62%) 2,893 (69.34%) 2,938 (62,23%) 2,783 (60.58%) 2,545 (55.64%) 4,486 (65.33%) 4,253 (66.30%)
3 3,333 (74.63%) 3,155 (73.15%) 3,294 (66.80%) 3,201 (66.69%) 2,795 (58.40%) 5,165 (73.18%) 4,728 (71.29%)
4 3,489 (78.78%) 3,281 (69.32) 3,415 (69.74%) 3,431 (72.81%) 2,900 (50.86%) 5,459 (78.55%) 5,175 (78.59%)
All called sites
Called sites per population (BC/BI intersection)
Intersection (% of union)
Number of sites(% of union)
6
SNP calls (per population)CEU TSI CHB CHD JPT LWK YRI
samples 90 66 109 107 105 108 112
90 66 109 107 105 108 112
called SNPs 4,102 3,729 4,340 4,262 3,883 6,039 5,891
3,816 4,285 3,972 3,881 4,719 6,370 5,869
dbSNPs 2,422 2,257 2,042 1,924 1,950 2,872 2,897
2,352 2,200 1,827 1,753 1,710 2,825 2,856
% dbSNP 59.04 60.53 47.05 45.14 50.22 47.56 49.18
61.64 51.34 46.00 45.17 36.24 44.35 48.66
Ts/Tv (called SNPs) 2.73 2.78 2.82 3.06 2.85 3.45 2.92
3.14 2.38 3.15 3.16 1.83 3.17 3.15
novel SNPs 1,680 1,472 2,298 2,338 1,933 3,167 2,994
1,464 2,085 2,145 2,128 3,009 3,545 3,013
Ts/Tv (novel SNPs) 2.05 2.10 2.44 2.81 2.43 3.44 2.56
2.92 1.72 3.03 3.05 1.36 3.07 2.99
BCBI
7
SNP calls (all samples)
BC BI
Samples 697 697
Called SNPs 14,502 18,149
dbSNPs 3,948 4,041
dbSNP fraction 27.22% 22.27%
5,388 SNPs172 dbSNPsdbSNP=3.19%
1,741 SNPs 79 dbSNPsdbSNP=4.54%
12,761 SNPs3,869 dbSNPs
dbSNP=30.32%
BC: 14,502 SNPs BI: 18,149 SNPs
BC U BI = 19,890
8
Genotype call accuracy relative to HapMap3
CEU TSI CHB CHD JPT LWK YRI
FDR of variant genotypes in HapMap3 (%) 0.96 0.23 2.61 1.42 3.60 0.47 0.57
1.41 0.45 2.99 1.82 3.56 0.66 1.25
Correct calls (%) 98.39 98.98 96.76 98.20 95.72 99.06 98.63
97.22 98.26 95.45 97.35 94.55 98.74 96.68
Accuracy of homozygote reference calls (%) 99.20 99.81 97.52 98.62 96.42 99.64 99.59
98.79 99.62 97.07 98.21 96.33 99.48 99.07
Accuracy of heterozygote calls (%) 97.50 97.72 97.98 99.12 96.81 98.37 96.53
94.49 95.43 94.19 97.37 92.30 97.89 90.81
Accuracy of homozygote non-reference calls (%) 97.31 98.44 93.27 95.78 92.69 98.21 98.45
96.77 98.76 93.26 95.16 93.43 97.67 97.46
BCBI
Data quality in CHB and JPT samples seems consistently lower
Statistics only include genotype calls at SNP sites in BC∩BI
9
Genotype calls
All SNP sites considered Only SNP sites with >= 80% called genotypes
# SNP sites=3,075r=0.9979
# SNP sites=3,489r=0.9921
• Filtering:BC filters on genotype call qualityBI reports a genotype for any site where at least one read covers
• Nominally, BI makes more calls than BC, and has, on average, higher AF
The Broad caller does not filter on genotype quality
• Good allele frequency concordance between BC and BI• At genotype calls that passes BC filter, and BI also makes a call, no discordance was found
10
1KG validation executive summary
• Evaluated BI and BC calls against validation– 1KG chip1
• 312/697 samples across 7 populations represented• ~300 sites (150 novel) overlap with Pilot 3 target region
• Concordance with 1KG chip is very high– Where covered (> 5 reads):
• 302/312 (97%) of samples have >90% variant sensitivity• 269/312 (86%) of samples have >90% genotype sensitivity
– Remaining disparities between 1KG chip and Pilot 3 calls can be explained by data quality issues
• Later sequencing has far greater concordance with chip than earlier sequencing1. Details in Appendix
Nearly all samples in call-set overlap have high sensitivity and specificity
0 50 100 150 200 250 300 3500.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
55.0%
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
105.0%
% Variant PPV
% Variant Sensitivity
% Ignored Low-Coverage Bases
Pilot 3 individual (312 individuals total after eliminating low-coverage samples)
These 10 low-sensitivity samples have strange
allele balances and are likely contaminated
All but one sample with low PPV (false-positive rate > 10%) are among the earliest-sequenced samples (JPT/CHB/CHD)
11
12
Mean sensitivity/PPV per population is good, and improves on more recently-sequenced populations
CEU CHB CHD JPT LWK TSI YRI0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
55.0%
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
105.0%
Mean % PPV
Mean % Sensitivity
N Samples: 69 13 27 102 69 3 24
8/2008ILMN/454
All Ctrs
8/2008ILMN/454
All Ctrs
8/2008ILMN/454
BI/BCM
1/2009454BCM
8/2008ILMN/454
BI/BCM
10/2008ILMNBI/SC
2008/2009ILMN/454
All Ctrs
13
Low-frequency / singleton validation: executive summary
• Low-frequency Sequenom assay1
– Chose 105 putative novel singletons from early Pilot 3 46-CEU-sample callsets (called in at least 2/4 callers)
– Validated sites in those 46 individuals• 89/105 are true singletons• 16/105 are false-positive singletons (hom-refs and two non-singletons)
• Concordance with low-frequency assay is very high– Callsets today (January 2010)
• In BI and BC overlap, recovered 71/89 (80%) of assayed singletons with 0 false-positives and 0 non-singletons
• In BI and BC union, recovered all 89 singletons with 3 false-positives and 0 non-singletons
1. Details in Appendix
14
Call Set Loci Tested (after Sequenom filtering)
Overlap with Test Set
TP (PPV) FP True, but not Singleton
BC ∩ BI 105 71 71 (100%) 0 0
BC BI∪ 105 92 89 (97%) 3 0
Whole Assay 105 105 89 (85%) 16 2
Callers are able to detect most singletons with very low false-positive rate
Joint calls find every singleton in the assay, with exceedingly few
false positives.
15
Conclusions / future directions
• Data quality has improved significantly over the life of the project
• Both BC and BI pipelines produce high-quality call sets– Good agreement between call sets– intersection highly concordant with experimental validation
data– Estimated FP rate below 5%
• The current Pilot 3 release is the BC∩BI (intersection) call set
• We are proceeding with validations– Dual focus: accuracy and functional classes– Results will inform future releases
APPENDIX
Population spectrum of called SNPs
18
Population-spectrum of called SNPs
CEU TSI CHB CHD JPT LWK YRI ALL
called SNPs 4,102 3,729 4,340 4,262 3,883 6,039 5,891 14,502
3,816 4,285 3,972 3,881 4,719 6,370 5,869 18,149
BCBI
• Observation: BC call more SNPs on the population level, but less SNP sites overall
• Reason: BC tends to call the same site in more populations…
BC/BI SNP calls per population (more detail)
20
SNP calls (per population)CEU TSI CHB CHD JPT LWK YRI
samples 90 66 109 107 105 108 112
90 66 109 107 105 108 112
called SNPs 4,102 3,729 4,340 4,262 3,883 6,039 5,891
3,816 4,285 3,972 3,881 4,719 6,370 5,869
dbSNPs 2,422 2,257 2,042 1,924 1,950 2,872 2,897
2,352 2,200 1,827 1,753 1,710 2,825 2,856
% dbSNP 59.04 60.53 47.05 45.14 50.22 47.56 49.18
61.64 51.34 46.00 45.17 36.24 44.35 48.66
Ts/Tv (called SNPs) 2.73 2.78 2.82 3.06 2.85 3.45 2.92
3.14 2.38 3.15 3.16 1.83 3.17 3.15
novel SNPs 1,680 1,472 2,298 2,338 1,933 3,167 2,994
1,464 2,085 2,145 2,128 3,009 3,545 3,013
Ts/Tv (novel SNPs) 2.05 2.10 2.44 2.81 2.43 3.44 2.56
2.92 1.72 3.03 3.05 1.36 3.07 2.99
singletons 1,378 1,264 1,654 1,686 1,284 1,430 1,457
1,240 1,911 1,555 1,500 2,347 1,692 1,489
Ts/Tv (singletons) 2.72 3.36 3.33 3.39 3.09 4.68 3.04
2.84 1.72 2.81 3.03 1.11 3.26 2.73
BCBI
Broad & BC calls: CEUPopulation: CEU (90 samples) BC Broad
# SNPs called (Ts/Tv) 4,102 (2.73) 3,816 (3.14)
#dbSNP (Ts/Tv) 2,422 (3.40) 2,352(3.28)
# novel SNPs (Ts/Tv) 1,680 (2.05) 1,464 (2.92)
# Singleton (Ts/Tv) 1,378 (2.72) 1,240 (2.84)
32752(15.90%)
1.32
BC613
122(19.90%)0.92
3,4892,300(65.92%)
3.47
SNP#dBSnp(%)Ts/Tv
Broad
Broad & BC calls: CHBPopulation: CHB (109 samples) BC Broad
# SNPs called (Ts/Tv) 4,340 (2.82) 3,972 (3.15)
#dbSNP (Ts/Tv) 2,042 (3.37) 1,827 (3.30)
# novel SNPs (Ts/Tv) 2,298 (2.44) 2,145 (3.03)
# Singleton (Ts/Tv) 1,654 (3.33) 1,555 (2.81)
55732(5.75%)
1.37
BC925
247(26.70%)1.23
3,4151,795(52.56%)
3.74
Broad
SNP#dBSnp(%)Ts/Tv
Broad & BC calls: CHDPopulation: CHD (107 samples) BC Broad
# SNPs called (Ts/Tv) 4,262 (3.06) 3,881 (3.16)
#dbSNP (Ts/Tv) 1,924 (3.40) 1,753 (3.30)
# novel SNPs (Ts/Tv) 2,338 (2.81) 2,128 (3.05)
# Singleton (Ts/Tv) 1,686 (3.39) 1,500 (3.03)
45031(6.44%)
1.33
BC
831200(24.07%)
1.68
34311,724(50.25%)
3.64
Broad
SNP#dBSnp(%)Ts/Tv
Broad & BC calls: JPTPopulation: JPT (105 samples) BC Broad
# SNPs called (Ts/Tv) 3,883 (2.85) 4,719 (1.83)
#dbSNP (Ts/Tv) 1,950 (3.39) 1,710 (3.31)
# novel SNPs (Ts/Tv) 1,933 (2.43) 3,009 (1.36)
# Singleton (Ts/Tv) 1,284 (3.09) 2,347 (1.11)
983271(27.57%)
1.54
BC1819
31(1.70%)0.74
2,9001,679 (57.90%)
3.67
Broad
SNP#dBSnp(%)Ts/Tv
Broad & BC calls: LWKPopulation: LWK (108 samples) BC Broad
# SNPs called (Ts/Tv) 6,039 (3.45) 6,370 (3.17)
#dbSNP (Ts/Tv) 2,872 (3.46) 2,825 (3.31)
# novel SNPs (Ts/Tv) 3,167 (3.44) 3,545 (3.08)
# Singleton (Ts/Tv) 1,430(4.68) 1,692 (3.26)
580136(23.45%)
2.09
BC911
89(9.77%)1.56
5,4592,736(50.12%)
3.67
Broad
SNP#dBSnp(%)Ts/Tv
Broad & BC calls: TSIPopulation: TSI (66 samples) BC Broad
# SNPs called (Ts/Tv) 3,729 (2.78) 4,285 (2.39)
#dbSNP (Ts/Tv) 2,257 (3.42) 2,200 (3.40)
# novel SNPs (Ts/Tv) 1,472 (2.10) 2,085 (1.72)
# Singleton (Ts/Tv) 1,264(3.36) 1,911 (1.72)
448105(23.44%)
0.71
BC1,004
48(4.78%)0.85
3,2812152(65.59%)
3.54
Broad
SNP#dBSnp(%)Ts/Tv
Broad & BC calls: YRIPopulation: TSI (66 samples) BC Broad
# SNPs called (Ts/Tv) 5,891(2.92) 5,869 (3.15)
#dbSNP (Ts/Tv) 2897 (3.38) 2,856 (3.34)
# novel SNPs (Ts/Tv) 2,994 (2.56) 3,013 (2.99)
# Singleton (Ts/Tv) 1,489 (3.04) 1,457 (2.73)
716112(15.64%)
0.95
BC694
71(1023%)1.48
5,1752,785(53.82%)
3.56
Broad
SNP#dBSnp(%)Ts/Tv
BC vs. BI allele frequency comparisons per population at SNPs in the BC∩BI call
set
BC/BI genotype calls (CHB & CHD)
All SNPs SNPs with >= 80% called genotypes
All SNPs SNPs with >= 80% called genotypes
#sites=3415r=0.9925
#sites=3431r=0.9941
CHD
CHB
#sites=3028r=0.9993
#sites=3310r=0.9991
BC/BI genotype calls (TSI & JPT)
#sites=2900r=0.9922
#sites=2370r=0.9991
#sites=3108r=0.9973
#sites=3281r=0.9912
TSI
JPT
All SNPs SNPs with >= 80% called genotypes
All SNPs SNPs with >= 80% called genotypes
BC/BI genotype calls (LWK & YRI)
#sites=5337r=0.9984
#sites=5459r=0.9924
#sites=4276r=0.9978
#sites=5175r=0.9917
YRI
LWK
All SNPs SNPs with >= 80% called genotypes
All SNPs SNPs with >= 80% called genotypes
Low frequency / singleton validation design
Per population PPV and sensitivity