richard h. scheuermann, ph.d. department of pathology division of biomedical informatics
DESCRIPTION
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers. Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center. N01AI2008038 - PowerPoint PPT PresentationTRANSCRIPT
Richard H. Scheuermann, Ph.D.
Department of Pathology
Division of Biomedical Informatics
U.T. Southwestern Medical Center
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers
N01AI2008038 N01AI40041
Richard H. Scheuermann, Ph.D.
Director of Informatics
J. Craig Venter Institute
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers
N01AI2008038 N01AI40041
Genome Sequencing Centers for Infectious Disease (GSCID)
Bioinformatics Resource Centers (BRC)
www.viprbrc.org www.fludb.org
High Throughput Sequencing
• Enabling technology– Epidemiology of outbreaks– Pathogen evolution– Host range restriction– Genetic determinants of virulence and pathogenicity
• Metadata requirements– Temporal-spatial information about isolates– Selective pressures– Host species of specimen source– Disease severity and clinical manifestations
Metadata Submission Spreadsheets
1 1 1 1
2
2 3
3
4
4 4
Complex Query Interface
Metadata Inconsistencies
• Each project was providing different types of metadata
• No consistent nomenclature being used• Impossible to perform reliable comparative
genomics analysis• Required extensive custom bioinformatics
system development
GSC-BRC Metadata Standards Working Group
• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs
• Develop metadata standards for pathogen isolate sequencing projects
• Bottom up approach• Assemble into a semantic framework
GSC-BRC Metadata Working Groups
Metadata Standards Process
• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources
(e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup
(core) and data fields that appear to be project specific• For each data field, provide common set of attributes, including definitions, synonyms,
allowed value sets preferably using controlled vocabularies, and expected syntax, etc.• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble set of pathogen-specific and project-specific metadata fields to be used in
conjunction with core fields• Compare, harmonize, map to other relevant initiatives, including OBI, MIGS, MIxS,
BioProjects, BioSamples (ongoing)• Assemble all metadata fields into a semantic network (ongoing)• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI)• Draft data submission spreadsheets to be used for all white paper and BRC-associated projects• Finalize version 1.0 metadata standard and version 1.0 data submission spreadsheet• Beta test version 1.0 standard with new white paper projects, collecting feedback
Data Fields: Core Project Core Sample
Attributes
organism
environmentalmaterial
equipment
person
specimensource role
specimencapture role
specimencollector role
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
specimen Xspecimen isolationprocedure X
isolationprotocol
has_input
has_output
plays
plays
has_specification
has_partdenotes
located_in
name
denotes
spatialregion
geographiclocation
denoteslocated_in
affiliation
has_affiliation
ID
denotes
specimen typeinsta
nce_of
specimen isolationprocedure type
instance_of
Specimen Isolation
plays
has_input
organism parthypothesis
is_about
IRB/IACUCapproval
has_authorization
environment
has_quality
organismpathogenicdisposition
has part
has disp
osition
ID
denotes
CS1
gender age health status
has quality
CS4 CS5/6 CS7
CS2/3
CS8
CS9/10
CS11/12
CS13
CS14
CS18
CS15/16
Metadata Processes
data transformations –image processing
assemblysequencing assay
specimen source – organism or environmental
specimencollector
input sample
reagents
technician
equipment
type ID
qualities
temporal-spatialregion
data transformations –variant detection
serotype marker detect.gene detection
primarydata
sequencedata
genotype/serotype/gene data
specimen
microorganism
enrichedNA sample
microorganismgenomic NA
specimen isolationprocess
isolationprotocol
sample processing
data archivingprocess
sequencedata record
has_input
has_output
has_output
has_specification has_part has_part
is_about
has_input
has_output
has_input
has_input
has_input
has_output
has_output
has_output
is_about
GenBankID
denotes
located_in
denotes
has_input
has_quality
instance_of
temporal-spatialregion
located_in
Specimen Isolation Material Processing
Data ProcessingSequencing Assay
Investigation
temporal-spatialregion
located_in
temporal-spatialregion
located_in
temporal-spatialregion
located_in
temporal-spatialregion
located_in
quality assessmentassay
Quality Assessment
has_input
has_output
Outcome of Metadata Standards WG
• Consistent metadata captured across GSCID• Guidance to collaborators regarding metadata
expectations for sequencing and analysis services• Support more standardized BRC interface
development• Harmonization with related stakeholders – Genome
Standards Consortium MIxS, OBO Foundry OBI and NCBI BioSample
• Represented in the context of an extensible semantic framework
Conclusions
• Metadata standards for microorganism sequencing projects• Bottom up approach focuses standard on important features• Harmonizing with related standards from the Genome Standards
Consortium, OBO Foundry and NCBI• Being beta-tested by GSCIDs for adoption by all NIAID-sponsored
sequencing projects• Utility of semantic representation
– Identified gaps in data field list (e.g. temporal components)– Includes logical structure for other, project-specific, data fields - extensible– Identified gaps in ontology data standards (use case-driven standard
development)– Identified commonalities in data structures (reusable)– Support for semantic queries and inferential analysis in future
• Ontology-based framework is extensible– Sequencing => “omics”