validation of identity and ancestry snp panels for the ion ...€¦ · validation of identity and...
TRANSCRIPT
Validation of Identity and Ancestry SNP Panels for the Ion PGM™
Christopher Phillips, Carla Santos, Maria de la Puente, Manuel Fondevila, Ángel Carracedo, Maviky Lareu
Forensic Genetics Unit,University of Santiago de Compostela
Runa Daniel, Dennis McNevin & Roland van Oorschot, VPFSD, Melbourne
Christopher Phillips, Carla Santos, Maria de la Puente, Manuel Fondevila, Ángel Carracedo, Maviky Lareu
Forensic Genetics Unit,University of Santiago de Compostela
Validation of Identity and Ancestry SNP Panels for the Ion PGM™
Walther Parson & Mayra Eduardoff, GMIPeter Schneider & Theresa Gross, UHC
Where does Ion PGM™ fit into the ‘sequence explosion’ ?
LT SOLiD
Illumina HiSeq
Ion Torrent is designed to read comparatively short fragments atmuch greater coverage - more reads provide high levels of accuracy
Next-generation sequencing (NGS)
Massively parallel sequencing Ion PGM™
MiSeq
Considerations for evaluation of Ion PGM™ SNP panels
• For analysing degraded DNA with short amplicon SNPs there is plenty ofchoice (75 million variants in 1000 Genomes) - so if a SNP performs poorly,it can be easily replaced. SNPs selected for EVC prediction and the bestAIMs must work optimally for the test to be informative.
• Interested in the SNPs as much as the detection system, and can ask: ‘arethere good and bad performers in any one SNP set detected with IonPGM™?’ ‘Does performance impact quality of results from low level DNA?’
Are all SNPs in the multiplex equally well genotyped?
What is the genotyping precision (but also, what is the true genotype)?
What happens to sequence data when low-level or mixed DNA is the input?
• The need to align generated sequences to a reference sequence adds anew layer of complexity if contextsequence features (homopolymersor indels) occur near the SNP andinterfere with its secure alignment.
• Forensic SNP typing must get thebalance right between the desiredcoverage : multiplex scales : thebarcode samples loaded per chip.Ion PGM™ sequences are spreadacross the ‘pit capacity’ in the chips- not easy to set up consistently.
Are there coverage outliers?
Considerations for evaluation of Ion PGM™ SNP panels
Can we set a minimum coverage limit?
• Forensic SNP typing must get thebalance right between the desiredcoverage : multiplex scales : thebarcode samples loaded per chip.Ion PGM™ sequences are spreadacross the ‘pit capacity’ in the chips- not easy to set up consistently.
Are there coverage outliers?
• As SNPs are binary and Ion PGM™ is very sensitive, it is important tocreate a secure system to distinguish mixtures from imbalancedheterozygotes.
Considerations for evaluation of Ion PGM™ SNP panels
Can we set a minimum coverage limit?
• Forensic SNP typing must get thebalance right between the desiredcoverage : multiplex scales : thebarcode samples loaded per chip.Ion PGM™ sequences are spreadacross the ‘pit capacity’ in the chips- not easy to set up consistently.
Are there coverage outliers?
• As SNPs are binary and Ion PGM™ is very sensitive, it is important tocreate a secure system to distinguish mixtures from imbalancedheterozygotes. What are the patterns of allelic balance in the SNP set?
Considerations for evaluation of Ion PGM™ SNP panels
Can we set a minimum coverage limit?
Three studies of the Ion PGM™ system
Combined five established SNaPshot PCRs to create a pool of 136 uniqueamplicons with 51-156 bp size range (five-fold increase in marker depth)
No balancing made of individual PCRs - 10 µl of purified product (elution columns) pooled for the library preparation then sequenced with 314v1 chips. No Ampliseq™
Three studies of the Ion PGM™ system
Three samples, three input amounts, concordance checked with SNaPshot/Sanger
Three studies of the Ion PGM™ system
Combined five established SNaPshot® PCRs to create a pool of 136unique amplicons with 51-156 bp size range (five-fold increase in markerdepth)
No balancing made of individual PCRs - 10 µl of purified product (elution columns) pooled for the library preparation then sequenced with 314v1 chips. No Ampliseq™
Three samples, three input amounts, concordance checked with SNaPshot® kit /Sanger
Provided with prototype versions of the LT forensic Identity Panel
Decided to adopt a simple NIST validation framework centred onthe qualified run - here, closely matched protocols for the same control DNAs
Four chip types, 16 samples, concordance checked with public data. Ampliseq™
Three studies of the Ion PGM™ system
Combined five established SNaPshot® PCRs to create a pool of 136unique amplicons with 51-156 bp size range (five-fold increase in markerdepth)
No balancing made of individual PCRs - 10 µl of purified product (elution columns) pooled for the library preparation then sequenced with 314v1 chips. No Ampliseq™
Three samples, three input amounts, concordance checked with SNaPshot®
kit/Sanger
Provided with prototype versions of the LT forensic Identity Panel
Decided to adopt a simple NIST validation framework centred onthe qualified run - here, closely matched protocols for the same control DNAs
Four chip types, 16 samples, concordance checked with public data. Ampliseq™
Combined five established SNaPshot® PCRs to create a pool of 136unique amplicons with 51-156 bp size range (five-fold increase in markerdepth)
No balancing made of individual PCRs - 10 µl of purified product (elution columns) pooled for the library preparation then sequenced with 314v1 chips. No Ampliseq™
Three studies of the Ion PGM™ system
Three samples, three input amounts, concordance checked with SNaPshot® kit /Sanger
Provided with prototype versions of the LT forensic Identity Panel
Decided to adopt a simple NIST validation framework centred onthe qualified run - here, closely matched protocols for the same control DNAs
Four chip types, 16 samples, concordance checked with public data. Ampliseq™
A EUROFORGEN deliverable is to develop ancestry and EVCinference panels. Designed an 128-SNP ancestry set for NGS.
Obtained customised primer design service of LT ‘white glove’ team.Validating this SNP set performance with above framework in five labs. Ongoing
Combined five established SNaPshot® PCRs to create a pool of 136unique amplicons with 51-156 bp size range (five-fold increase in markerdepth)
No balancing made of individual PCRs - 10 µl of purified product (elution columns) pooled for the library preparation then sequenced with 314v1 chips. No Ampliseq™
Three studies of the Ion PGM™ system
Three samples, three input amounts, concordance checked with SNaPshot® kit /Sanger
Provided with prototype versions of the LT forensic Identity Panel
Decided to adopt a simple NIST validation framework centred onthe qualified run - here, closely matched protocols for the same control DNAs
Four chip types, 16 samples, concordance checked with public data. Ampliseq™
A EUROFORGEN deliverable is to develop ancestry and EVCinference panels. Designed an 128-SNP ancestry set for NGS.
Obtained customised primer design service of LT ‘white glove’ team.Validating this SNP set performance with above framework in five labs. Ongoing
Combined five optimised forensic PCRs
52plex 34plex Eurasiaplex Pacifiplex IrisPlex
Sequence coverage: 9947A 0.2 ng
Amplified three control DNAs with five SNaPshot PCRs
QIAquick® PCR purification, but no quantitation
Pooled DNA end-labelled and put into library preparation
SNPforID: 52 34-plex: 34 Eurasiaplex: 27Pacifiplex: 28IrisPlex: 6
136 unique sites
Pacifiplex 34plex AIMs IrisPlex SNPforID 52
SNPforID 52
Eurasiaplex
Eurasiaplex dominates top 10% coverage - PCR too strong
Established forensic PCRs worked well
9947A
007
S1
Obtained three times more coverage than expected from 314 Chips
0.1 ng 0.2 ng 0.3 ng
Template amounts did not influence coverage
• Outlier Allele Read Frequency ranges correlated with lower coverage
9947A 007 S1
• Outlier 10-30% and 70-90% Allele Read Frequency were difficult to interpret
Allelic balance
• These ranges represent ambiguous genotype designations - are these homozygotes with alarge number of spurious alternative alleles or heterozygotes with pronounced allelic imbalance?
Alle
le c
ount
sC
over
age
• If a small proportion of SNPs are consistently imbalanced they may be ‘discounted’ from analyses
0.1 ng 0.2 ng 0.3 ng
97% concordance obtained comparing Ion PGM™ to SNaPshot® - with a further1.5% genotypes from Sanger sequencing concordant with Ion PGM™. No no-calls
With SangerWith SNaPshot®
98.5%
rs1029047, rs717302
-3%
-1.5%
Genotyping concordance
rs1029047
...one of several tricky homopolymeric alignments
The PCR multiplex limits of SNaPshot® can be readily expanded bycombining products of multiple reactions and pooling for library preparation.
Conclusions
Although the average sequence coverage per SNP of 500x was very high,it was highly variable amongst the SNPs in each multiplex. This coveragebias was non-random: the same SNPs had consistently high/low coverage ineach sample/input amount. Absence of AmpliSeq® likely to affect balance.
Imbalanced heterozygotes correlated with low coverage, but not all lowcoverage SNPs were imbalanced. So an Allele Read Frequency threshold(10/90 - 40/60) is preferred approach and needs adjusting for outlier SNPs.
Furthermore, SNaPshot® could be better balanced. Not certain if IonPGM™ kit PCRs are equimolar or have molar ratios adjusted to improvebalance.
4/5 discordant genotypes in Ion PGM™ had coverage <13x indicating a 20xminimum could be applicable (a value suggested in other NGS studies).
Combined five established SNaPshot® PCRs to create a pool of 136unique amplicons with 51-156 bp size range (five-fold increase in markerdepth)
No balancing made of individual PCRs - 10 µl of purified product (elution columns) pooled for the library preparation then sequenced with 314v1 chips. No Ampliseq™
Three studies of the Ion PGM™ system
Three samples, three input amounts, concordance checked with SNaPshot®/Sanger
Provided with prototype versions of the LT forensic Identity Panel
Decided to adopt a simple NIST validation framework centred onthe qualified run - here, closely matched protocols for the same control DNAs
Four chip types, 16 samples, concordance checked with public data. Ampliseq™
A EUROFORGEN deliverable is to develop ancestry and EVCinference panels. Designed an 128-SNP ancestry set for NGS.
Obtained customised primer design service of LT ‘white glove’ team.Validating this SNP set performance with above framework in five labs. Ongoing
The HID-SNP identity set has undergone several revisions
HID-SNP 2.3 ‘prototype’ (169 SNPs)
HID-Ion AmpliSeq™Identity Panel (124)
HID-Ion AmpliSeq™Ancestry Panel
P202M479 L298
9 Y-SNPs replaced
45 A-SNPs removed
The HID-SNP identity set has undergone several revisions
HID-SNP 2.3 ‘prototype’ (169 SNPs)
HID-Ion AmpliSeq™Identity Panel (124)
HID-Ion AmpliSeq™Ancestry Panel
SNaPshot®
9 Y-SNPs replaced
45 A-SNPs removed
P202M479 L298
The HID-SNP identity set has undergone several revisions
HID-SNP 2.3 ‘prototype’ (169 SNPs)
HID-Ion AmpliSeq™Identity Panel (124)
Genplex
HID-Ion AmpliSeq™Ancestry Panel
SNaPshot
9 Y-SNPs replaced
45 A-SNPs removed
P202M479 L298
NIST validation framework chosen for Ion PGM™ evaluations
SRM2391b now superseded by SRM2391c, but opted for 007 and 9947A forensic control DNAs. Added six staff and seven Coriell genomic controls
NIST suggest a simple CE validation MTP for new STRs/kits/protocols:
neg10 ng1 ng
0.1 ng0.05 ng
0.025 ng
neg10 ng1 ng
0.1 ng0.05 ng
0.025 ng
Challenging DNA sources
NIST validation framework chosen for Ion PGM™ evaluations
SRM2391b now superseded by SRM2391c, but opted for 007 and 9947A forensic control DNAs. Added six staff and seven Coriell genomic controls
Coriell universal genomic controls
NA06994 EUR maleNA07000 EUR femaleNA07029 male child
NA18498 AFRHG00403 E ASNNA10540 OCENA11200 AME
NIST suggest a simple CE validation MTP for new STRs/kits/protocols:
neg10 ng1 ng
0.1 ng0.05 ng
0.025 ng
neg10 ng1 ng
0.1 ng0.05 ng
0.025 ng
Challenging DNA sources
Concordance Inter/Intra-lab plus vs. online data for Coriells
Mixture Male:female staff at above five ratios (duplicated)
Qualifying run Closely matched Ion PGM™ protocols IMU/UHC/USC
Sensitivity 007 & 9947A plus 12th Century aDNA (0.45 ng)
Key analysis parameters
Ion Torrent Suite (plugin: VariantCaller) used Somatic & Germline parameters
• Allele Read Frequency - how much of each allele-carrying sequence isdetected and in what ratio
• Coverage – number of target sequences per chip, per sample, per SNP
• Base misincorporation - the number of incorrect bases detected
• Strand bias - the ratio of sequences in each direction
Coverage
Y-SNPs
Ranked mean coverage per marker
A-SNPs•
500
1000
1500 > 1500
2000
2500
> 300 > 400 > 500 > 1000 > 1500
CoverageRanked mean coverage per SNP
Y-SNPs
Increasing mean coverage per sam
ple
Increasing mean coverage per SNP
Ranked mean coverage per marker
A-SNPs•
> 200
> 300
> 500
> 1500
> 400
> 1000> 100
0
> 10
> 5000
500
1000
1500 > 1500
2000
2500
> 300 > 400 > 500 > 1000 > 1500
Y-SNPs
A-SNPs
CoverageRanked mean coverage per SNP
Y-SNPs
Increasing mean coverage per sam
ple
Increasing mean coverage per SNP
Ranked mean coverage per marker
The majority of the topmost samples were:
• amplified with ≤100pg• degraded DNA• too many samples
loaded on a chip
A-SNPs•
> 200
> 300
> 500
> 1500
> 400
> 1000> 100
0
> 10
> 5000
500
1000
1500 > 1500
2000
2500
> 300 > 400 > 500 > 1000 > 1500
Y-SNPs
A-SNPs
CoverageRanked mean coverage per SNP
20x minimum coverage limit
Allele Read Frequency balance
SNP
Ref
eren
ce a
llele
freq
uenc
y / t
otal
alle
le fr
eque
ncy
Allele Read Frequency balance
Identified as ARF outliers by bothIdentified as ARF outliers by Børsting only Identified as ARF outliers by this study only
Base misincorporations
C reads G reads
A reads T reads4
2
4
2
4
2
4
2
6 6
6 6
Non-specific base misincorporation (e.g. C or T in an A/G SNP)
Mis
inco
rpor
atio
n as
pro
port
ion
of to
tal c
over
age
<1.5% misincorporation of expected bases <3% misincorporation of a non-specific 3rd/4th base
Reference or alternative allele misincorporations (e.g. low levels of A in GG homozygotes)
Total Y-SNP sequences detected in females: 34 (in >2 million sequences from 6 analyses)
Base misincorporations
0.25% overall rate of misincorporation -applicable to nearly all SNPs
Strand bias
rs430046, rs1463729, rs9866013, rs13182883
rs5746846, rs576261, rs2567608, rs4606077, rs1523537
Strand bias
The SNP target base calls unequivocally record a GGhomozygote but sequences were generated from 355 forwardstrands and 2 reverse strands = 0.994 strand bias)
IGV: rs13182883 C T A G
Deletion (DEL)
Direction
Insertion (INS)
1 2 3
One deletion in ~95% of sequences in the forward strand
Twelve direction-based deletion sites recorded in the reverse strand, including the target SNP and two clustering SNP sites 1 and 3
1 32
C T A G
Deletion (DEL)
Direction
Insertion (INS)
IGV: rs430046
Concordance
99.8% concordance from: inter-run (6009/6022); inter-lab (3751/3763); vs. online data (1621/1624)
Likely 1000 Genomes-Phase 1 error
Likely 1000 Genomes-Phase 1 error
No call on either allele in Complete Genomics
No call for 1st allele in Complete Genomics
No call on either allele in Complete Genomics
No call for 2nd allele in Complete Genomics
One Ion PGM™ discordancy in a single sample
T is the incorrect base call (A/C SNP)
NN
NN
NG
GN
T is the incorrect base call (A/C SNP)
rs2032597
rs2399332, rs1004357, rs938283, rs1979255 and rs2032597
0.3%0.2%1.2% 0.8% 2.4%
0.2%
Concordance
Likely 1000 Genomes-Phase 1 error
Likely 1000 Genomes-Phase 1 error
No call on either allele in Complete Genomics
No call for 1st allele in Complete Genomics
No call on either allele in Complete Genomics
No call for 2nd allele in Complete Genomics
One Ion PGM™ discordancy in a single sample
T is the incorrect base call (A/C SNP)
99.8% concordance from: inter-run (6009/6022); inter-lab (3751/3763); vs. online data (1621/1624)
NN
NN
NG
GN
T is the incorrect base call (A/C SNP)
Single donors 1:1
3:1
9:1
1:9
1:3
Mixtures
Mixed input DNA created Allele Read Frequency balance shifts - but at 9:1 very little heterozygote displacement
25 or 50 pg
Sensitivity
In low-level samples ~8-12 SNPs failed Only rs2016276 appeared disproportionately in failing SNPs
In ‘optimum input’ concordance samples ~1-3 SNPs gave no-calls or dropouts All were outlier SNPs
25 or 50 pg
Sensitivity
Only rs2016276 appeared disproportionately in failing SNPs
All were outlier SNPs
Read length in bp
Low-level DNA
Read length in bp
Optimum input DNA
In ‘optimum input’ concordance samples ~1-3 SNPs gave no-calls or dropouts
In low-level samples ~8-12 SNPs failed
Volders site, Tyrol, Austria
aDNA
450 pg quantified with Quantifiler Duo 25 PCR cycles or 25+5 library re-amplification cycles
Sensitivity
Volders site, Tyrol, Austria
aDNA
N / NN
QUAL=0
SNPs>100 x coverage
SNPs20-100 x coverage
1.2E-28
GlobalFiler
1.2E-33
RMP
450 pg quantified with Quantifiler Duo 25 PCR cycles or 25+5 library re-amplification cycles
Prototype SNP panel amplicons
N / NN
QUAL=0
47
68
108
130
SNPs>100 x coverage
SNPs20-100 x coverage
54 SNPs removed from prototype set
57 SNPs have reduced amplicon sizes (by an average 57.5 bp)
58 SNPs retain the original prototype primer designs
120
123
137
117
119
99
HID-Ion AmpliSeq™ Identity Panel amplicons
1.2E-28
GlobalFiler
1.2E-33
RMP
Identified three kinds of outlier SNPs
rs1979255rs1004357rs938283rs2032597rs2399332
rs9866013rs727811rs321198rs4606077rs1463729rs6591147rs8037429rs430046rs2567608rs1523537rs17250535
should be removed
have some outlyingcharacteristics such asstrand bias that meansthey should be treated withcaution when looking forunusual patterns such asmixed DNA
have some outlyingcharacteristics but thesedo not have a detectableeffect on genotypingperformance
rs1029047rs1336071rs1478829rs2032599rs13182883rs2107612rs576261rs5746846rs13447352
Not all SNPs performed equally well. Five gave genotype discordancies:inter-lab and inter-run, two remain in the HID-Ion Identity Panel.
Conclusions
Several other SNPs have outlier characteristics; those with imbalancedARFs were also identified in Børsting’s study, though two were not and threeBørsting outliers had reasonably good allelic balance in our study.
Limited experiments with low-level DNA suggest very high sensitivity. 81/169SNPs gave >100x coverage from aDNA. 57 SNPs have since been shortened.
Sought to identify ARF outliers to allow their discounting in mixtures. LTGenotyper uses Somatic parameter settings, but mixtures mimic mutationpatterns, so Germline settings improve post-hoc analysis of mixed DNA.
Harmonising chip loading to balance the coverage with samples-per-chipwas very difficult. Continues into AIM set validation - high inter-lab variability.
The 99.8% concordance is best of any SNP test comparisons made at USC.
EUROFORGEN working on a mixture analysis system for Ion PGM™ datathat works well with binary loci but requires conditioning on one contributor.
Combined five established SNaPshot® PCRs to create a pool of 136unique amplicons with 51-156 bp size range (five-fold increase in markerdepth)
No balancing made of individual PCRs - 10 µl of purified product (elution columns) pooled for the library preparation then sequenced with 314v1 chips. No Ampliseq™
Three studies of the Ion PGM™ system
Three samples, three input amounts, concordance checked with SNaPshot®/Sanger
Provided with prototype versions of the LT forensic Identity Panel
Decided to adopt a simple NIST validation framework centred onthe qualified run - here, closely matched protocols for the same control DNAs
Four chip types, 16 samples, concordance checked with public data. Ampliseq™
A EUROFORGEN deliverable is to develop ancestry and EVCinference panels. Designed an 128-SNP ancestry set for NGS.
Obtained customised primer design service of LT ‘white glove’ team.Validating this SNP set performance with above framework in five labs. Ongoing
Optimising a custom SNP panel for Ion PGM™
• 125 of 128 SNPs incorporated into the PCR multiplex: 97.5% conversion rate
• Concordance rates high: inter-run 99.98% (1 SNP), inter-lab 99.75% (6 SNPs)- with reasons for discordance identified in each case
• 125 of 128 SNPs incorporated into the PCR multiplex: 97.5% conversion rate
• Concordance rates high: inter-run 99.98% (1 SNP), inter-lab 99.75% (6 SNPs)- with reasons for discordance identified in each case
• Concordance from online database comparisons of Coriell DNAs:1000 Genomes 99.74% (3 SNPs), Complete Genomics 99.79% (same SNPs)
Optimising a custom SNP panel for Ion PGM™
• The six discordant SNPs in 13 genotypes were mainly homopolymeric tractsaround the SNP. Need retrogressive analysis due to strand directionality.
Optimising a custom SNP panel for Ion PGM™
• Eight no-call SNPs in 20 genotypes: 4 low coverage, 1 population-specificflanking indel, 3 did not pass variantCaller quality filters.
• Some population-specific low coverage seen - untracked primer site SNPs?
Thank You
Thanks to all the team at Santiago particularly Carla, Maria and Fonde. To Mayra, Walther at GMI and Theresa, Peter at UHC. To Matt Phipps, LT and David Ballard, KCL for help with data analysis.
Speaker was provided travel and hotel support by Thermo Fisher Scientific for this presentation, but no remuneration