mar2013 performance metrics working group

Post on 10-May-2015

343 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

David Jenkins on behalf of

Justin H. Johnson

Director of Bioinformatics

Performance Metric & Figures of Merit

Who are we?

• Justin Johnson

– Managing Director of Services

– Director of Bioinformatics

– 10 Years at JCVI before EdgeBio

– Project Manager - Archon Genomics XPrize

• EdgeBio

– CLIA Lab

– Illumina Hiseq & Miseq, Ion Proton & PGM

Overview – GIAB as I See It.

• Which genomes?

• How do we sequence them?

• How do we analyze them?

• How do we enable their usage?

Overview Experimental Data

• Sequence Data & Variation

• Metadata

Database

• RM vs. Reference

• Every Base

Visualize and Filter

• Browser over DB

• Query by Experiment Data

Compare and Report

• Single Genome Browser

• ValidationProtocol.org

Refine and Feedback

Experimental Data = Combination of Prep / Sequencing / Analysis

Bioinformatics Data Integration / Representation

Experimental Data • GetRM Model for Collection

– http://www.ncbi.nlm.nih.gov/projects/variation/get-rm/

• Preparation – Link to published prep protocol – ROI in Bed/GFF/GBK Format

• Sequencing – Platform Information (Minimally - Name) – Chemistry (Minimally - Version)

• Analysis – Link to published analysis protocol or best practices – Read Data (fastq, sra, hdf5, others) – Alignment/Assembly Data (bam)

• Minimal Tag Set TBD – Variation (VCF or gVCF)

• Minimal Tag Set TBD in INFO field of VCF or define external XSD • https://sites.google.com/site/gvcftools/home/about-gvcf

gVCF

https://sites.google.com/site/gvcftools/home/about-gvcf

Meta Data

• All required fields in VCF 4.1

• Others (Examples) – AA : ancestral allele

– AC : allele count in genotypes, for each ALT allele, in the same order as listed

– AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes

– AN : total number of alleles in called genotypes

– BQ : RMS base quality at this position

– CIGAR : cigar string describing how to align an alternate allele to the reference allele

– DB : dbSNP membership

– DP : combined depth across samples, e.g. DP=154

– END : end position of the variant described in this record (for use with symbolic alleles)

– H2 : membership in hapmap2

– VALIDATED : validated by follow-up experiment

• Reference Block Implementations

• Handle Indel Conflicts and Resolution

• Genotype Quality for non-variant sites (GQX)

Database

• Store Each Base + Meta of RM versus Reference for each Experiment from gVCF

– Distinguish missing versus homozygous reference

– Include copy number and phasing when available, not required

• Engine that drives front end visualization (Genome Browser)

• Build on GetRM/NCBI Database Work

Visualize and Filter

• Build on GetRM/NCBI Browser Work

• Single RM -> Many Experiments

• Not all metadata will be visual, but most/all will be filterable

• Filter data to generate ROI or VOI – Canned: i.e. Intersect of All Platforms + Analysis, All OMIM SNPs,

Clinical Cert SNV List, etc

– Dynamic: allowing people to explore prep, sequence, or analysis bias

• Slice, Dice, Export VOI to compare and reporting SW

• Allow user defined tracks

• By product is community educational resource – I have a ROI for a test and want to know what platform, prep, exome

kit version, etc covers it best. What do I do?

Parallel Database, Filter Effort (Gemini) Quinlan Lab at UVA - https://github.com/arq5x/gemini

• Gemini – simple, flexible, and powerful framework for exploring genetic variation

• Basic browser capabilities being developed

• Flexible custom annotation and metadata addition to DB

• Leverage the expressive power of SQL while overcoming fundamental challenges associated with using databases for very large datasets

Gemini

http://dl.dropbox.com/u/515640/posters_and_slides/Quinlan-Gemini-Poster.pdf

Gemini

http://dl.dropbox.com/u/515640/posters_and_slides/Quinlan-Gemini-Poster.pdf

Gemini

http://dl.dropbox.com/u/515640/posters_and_slides/Quinlan-Gemini-Poster.pdf

Compare and Reporting

• Take in ROI or VOI from the visualize and filter stage

• Take in user defined VOI or VOI + ROI

• Leverage SW under ValidationProtocol.org to generate reports and files including BNLT:

– Summary of completeness, accuracy, phasing

– Discordant variants in VCF

– Concordant variants in VCF

– Phasing errors in VCF

• Provide intuitive way to feed these resultants in downstream analysis SW (VarinatViz, IO8) or back into browser (User Defined Track)

• $10 million prize competition to showcase whole genome sequencing technology

• Award to the team(s) who can most completely, accurately and affordably sequence 100 human genomes in 30 days or less

• Competing Teams will sequence the genomes of the 100 centenarians who have evaded the usual diseases of aging such as heart disease, diabetes, cancer and Alzheimer’s

AGXP Validation Study Overview

AGXP Validation Study Analysis

• 2 Major Phases using NA19239 and NA12878

–Develop Reference Standards • Fosmid Reconstruction, Variation Discovery

• Technology Comparison and Bias Removal

–Develop Performance Metrics • Software Development

• Help labs use the data

Compare and Report

• The validationprotocol.org website provides a simple way for anyone to compare their variant calls against the public reference genomes.

• Encourages submission and analysis in public tools like Galaxy through transparent interoperability with GenomeSpace.

Compare and Report

Compare and Report

Compare and Report

Follow On

• Export different categories (Concordant/Discordant/Phasing Error) variants to VariantViz IO8

• Visualize Quality, Allele Frequencies, Depth, etc Info to detect patterns in and between variant categories

Concordant SNPs

Potential false positives

Xprize Team • Justin H. Johnson and Team - EdgeBio

• Brad Chapman Harvard: automated high-throughput analysis pipelines with custom visualization and processing tools

• Gabor Marth Boston College: Read mapping, single-nucleotide and insertion-deletion polymorphism detection, and discovery of structural variants.

• Aaron Quinlin University of Virginia: structural variation (SV)

• Granger Sutton JCVI: Oversight Committee

• Victor Jongeneel University of Illinois and NCSA: Oversight Committee

• Larry Kedes UCLA: Oversight Committee

EdgeBio Team

• LAB

– Joy Adigun

– Ryan Mease

– Jennifer Sheffield

– Aaron Johnson

– Jackie Jackson

• IFX

– David Jenkins

– Anju Varadarajan

– Vani Rajan

– Karthik Kota

– Phil Dagasto

• Adam Bennett

• Isabel Llorente

More info available at

http://bit.ly/agxpval

http://www.genomeinabottle.org

Thank You!

top related