![Page 1: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/1.jpg)
Genome in a Bottle Consortium
Progress UpdateJanuary 27, 2014
Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
![Page 2: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/2.jpg)
2
Whole Genome RMs vs. Current Validation Methods
• Sanger confirmation– Limited by number of sites (and sometimes it’s wrong)
• High depth NGS confirmation– May have same systematic errors
• Genotyping microarrays– Limited to known (easier) variants– Problems with neighboring “complex” variants, duplications
• Mendelian inheritance– Can’t account for some systematic errors
• Simulated data– Generally not very representative of errors in real data
• Ti/Tv– Varies by region of genome, and only gives overall statistic
![Page 3: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/3.jpg)
3
Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in confident regions
• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)
• Avoid bias towards any particular platform– take advantage of strengths of each platform
• Avoid bias towards any particular bioinformatics algorithms
![Page 4: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/4.jpg)
4
Integrate 12 14 Datasets from 5 platforms
![Page 5: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/5.jpg)
5
Integration of Data toForm Highly Confident Genotype Calls
Find all possible variant sites
Find concordant sites across multiple datasets
Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias
For each site, remove datasets with decreasingly atypical characteristics until all datasets agree
Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known
segmental duplications, SVs, or long repeats
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of bias
Confidence Level
![Page 6: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/6.jpg)
6
Verification of “Highly Confident” Genotype accuracy
• Sanger sequencing– 100% accuracy but only 100s of sites
• X Prize Fosmid sequencing– Sometimes call only part of a complex variant
• Microarrays– Differences appear to be FP or FN in arrays
• Broad 250bp HaplotypeCaller– Very highly concordant
• Platinum genomes pedigree SNPs– Some systematic errors are inherited; different representations of
complex variants• Real Time Genomics SNPs and indels
– Some interesting sites called by RTG complex caller
![Page 7: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/7.jpg)
7
GCAT – Interactive Performance Metrics
• NIST is working with GCAT to use our highly confident variant calls
• Assess performance of many combinations of mappers and variant callers
• www.bioplanet.com/gcat
Improvement of FreeBayes over 1 year with indels
![Page 8: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/8.jpg)
8
Why do calls differ from our highly confident genotypes?
Apparent False Positives • Platform-specific systematic
sequencing errors for SNPs• Analysis-specific • Difficult to map regions• Indels in long
homopolymers
Apparent False Negatives• Different complex variant
representation• Near indels• Inside repeats
![Page 9: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/9.jpg)
9
Complex variants have multiple correct unphased representations
BWA
ssaha2
CGTools
Novo-align
Ref:
T insertion
TCTCT insertion
FP SNPs FP MNPs FP indels
Traditional comparison
0.38% (610)
100% (915)
6.5% (733)
Comparison with realignment
0.15% (249)
4.2% (38)
2.6% (298)
• ~225,000 highly confident variants are within 10bp of another variant
• FPs and FNs are significantly enriched for complex variants
• RTG vcfeval can fix this issue!
![Page 10: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/10.jpg)
Reasons we exclude regions from high-confidence set
![Page 11: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/11.jpg)
Reasons we exclude regions from high-confidence set
![Page 12: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/12.jpg)
Depth of coverage (DOC)Control-FREECCnD
Paired-end mapping (PEM)Breakdancer
Split read (SR)Pindel
Assembly based (AS)VelvetABySS
SVMergeList of structural variant calls
CombinationGenome-STRiP
Structural variant analytical approach
![Page 13: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/13.jpg)
![Page 14: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/14.jpg)
• Coverage (mean and standard deviation)• Paired-end distance/insert size (mean and
standard deviation)• # of discordant paired-ends• Soft clipping of the reads (mean and
standard deviation)• Mapping quality (mean and standard
deviation)• # of heterozygous and homozygous SNP
genotype calls
Validation parameters for each SV
![Page 15: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/15.jpg)
15
Challenges with assessing performance
• All variant types are not equal
• All regions of the genome are not equal– Homopolymers, STRs,
duplications– Can be similar or
different in different genomes
• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance
• Genotypes fall in 3+ categories (not positive/negative)– standard diagnostic
accuracy measures not well posed
![Page 16: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/16.jpg)
16
Pedigree calls• RTG and Illumina Platinum
Genomes working on this• Sequence NA12878, husband,
and 11 children to identify high confidence variants– Identify cross-over events– Determine if genotypes are
consistent with inheritance
• Should we integrate these with the NIST high-confidence genotypes?
• Should we find larger families for future genomes?
• See afternoon presentations!
Source: Mike Eberle, Illumina
![Page 17: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/17.jpg)
Pedigree Calls in Uncertain Regions
![Page 18: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/18.jpg)
GIAB Characterization of pilot RM
• NIST – 300x 150x150bp HiSeq (from 6 vials)• NIST – 100x 75bp ECC SOLiD 5500W• Illumina – 50x 100x100bp HiSeq• Complete Genomics – Normal and LFR (non-
RM)• Garvan Institute – Illumina exome• NCI – Ion Proton whole genome• INOVA – Infinium SNP/CNV array
![Page 19: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/19.jpg)
Homogeneity and Stability
Homogeneity• Multiplex First and last vial
– 3 libraries x 33x HiSeq each
• Multiplex 4 Random vials– 2 libraries x 12.5x HiSeq each
• Compare variability due to:– vial– library– day– flow cell– lane– sampling
• Run PFGE on each vial for size
Stability• Run PFGE to detect DNA
degradation• Freeze-thaw 2 and 5 times• Vortex for 10s• 4°C for 2 and 8 weeks• 37°C for 2 and 8 weeks
![Page 20: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/20.jpg)
FTP site and Amazon S3
• NCBI is hosting fastq, bam, and vcf files on the giab ftp site
• These data are mirrored to Amazon S3, so we encourage you to take advantage of this!
![Page 21: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/21.jpg)
Pilot Reference Material
• High-confidence calls are available on the ftp site and are already being used
• NIST plans to release this as a NIST Reference Material in the next couple months
![Page 22: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/22.jpg)
Future Directions• Characterize more “difficult”
regions/variants• Structural variants• Compare to pedigree calls• Examine potentially clinically
relevant regions/variants in RMs• Use long-read technologies
– Moleculo– CG LFR– PacBio– BioNano Genomics– future technologies??
• Use glia/platypus to realign reads to candidate variants
• Analyze interlaboratory study data
• Characterize PGP genomes– Ashkenazim trio– son in Asian trio– DNA at NIST in Jan-Feb
2014– Volunteers to sequence?
• Select future genomes• Tumor-normal?
![Page 23: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/23.jpg)
Topic #1: Moving beyond the easy regions/variants
Presentations• Emerging Technologies
– PacBio– Complete Genomics LFR– Moleculo– BioNano Genomics
• Structural Variants– Bina Technologies
Topics• Structural Variants• Phasing• Validation• Where should we set the
threshold(s) for confidence?
![Page 24: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/24.jpg)
Topic #2: Cancer and Future Genomes
Cancer• Spike-ins• Mixtures of normal cell lines• Tumor-normal cell line pair• Transriptome controls
Priorities for Future Genomes• Diverse ancestry groups• Larger families• Recruitment with consent
for commercialization• How many genomes?• Should the parents be NIST
Reference Materials, or only the child?
![Page 25: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/25.jpg)
Working Group Questions
RM Selection & Design• Spike-in controls• FFPE• Commercial RMs• ABRF interlaboratory study• Should we prioritize one or
two genomes?
RM Characterization• Production mode for new
trios– Pilot was characterized by
Illumina, SOLiD, Ion Proton, and Complete Genomics
– What resources should we invest in measurements for each new family?
![Page 26: 140127 GIAB update and NIST high-confidence calls](https://reader035.vdocuments.us/reader035/viewer/2022062513/554ea712b4c905fb7c8b4a76/html5/thumbnails/26.jpg)
Working Group Questions
Bioinformatics• Storing data/pipelines
– Suggestions for ftp structure– Data submission/accessioning
process– Data model for genomic data– Archiving pipelines and reproducible
research
• GRCh38• How to use pedigree calls for pilot
genome?• Clones for targeted regions (hard
regions if not whole genome)• In which difficult regions should
we focus our characterization?
Performance Metrics• Target audience• Requirements for user
interface– Establishing truth set(s)– Inputs/Outputs– Visualization
• Integration with GeT-RM