richard h. scheuermann, ph.d. department of pathology division of biomedical informatics

Richard H. Scheuermann, Ph.D.

Department of Pathology

Division of Biomedical Informatics

U.T. Southwestern Medical Center

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers

N01AI2008038 N01AI40041

http://www.niaid.nih.gov/default.htm

Richard H. Scheuermann, Ph.D.

Director of Informatics

J. Craig Venter Institute

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers

N01AI2008038 N01AI40041

http://www.niaid.nih.gov/default.htm

Genome Sequencing Centers for Infectious Disease (GSCID)

Bioinformatics Resource Centers (BRC)

www.viprbrc.org www.fludb.org

High Throughput Sequencing

• Enabling technology– Epidemiology of outbreaks– Pathogen evolution– Host range restriction– Genetic determinants of virulence and pathogenicity

• Metadata requirements– Temporal-spatial information about isolates– Selective pressures– Host species of specimen source– Disease severity and clinical manifestations

Metadata Submission Spreadsheets

1 1 1 1

2

2 3

3

4

4 4

Complex Query Interface

Metadata Inconsistencies

• Each project was providing different types of metadata

• No consistent nomenclature being used• Impossible to perform reliable comparative

genomics analysis• Required extensive custom bioinformatics

system development

GSC-BRC Metadata Standards Working Group

• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs

• Develop metadata standards for pathogen isolate sequencing projects

• Bottom up approach• Assemble into a semantic framework

GSC-BRC Metadata Working Groups

Metadata Standards Process

• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources

(e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup

(core) and data fields that appear to be project specific• For each data field, provide common set of attributes, including definitions, synonyms,

allowed value sets preferably using controlled vocabularies, and expected syntax, etc.• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble set of pathogen-specific and project-specific metadata fields to be used in

conjunction with core fields• Compare, harmonize, map to other relevant initiatives, including OBI, MIGS, MIxS,

BioProjects, BioSamples (ongoing)• Assemble all metadata fields into a semantic network (ongoing)• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI)• Draft data submission spreadsheets to be used for all white paper and BRC-associated projects• Finalize version 1.0 metadata standard and version 1.0 data submission spreadsheet• Beta test version 1.0 standard with new white paper projects, collecting feedback

Data Fields: Core Project Core Sample

Attributes

organism

environmentalmaterial

equipment

person

specimensource role

specimencapture role

specimencollector role

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen Xspecimen isolationprocedure X

isolationprotocol

has_input

has_output

plays

plays

has_specification

has_partdenotes

located_in

name

denotes

spatialregion

geographiclocation

denoteslocated_in

affiliation

has_affiliation

ID

denotes

specimen typeinsta

nce_of

specimen isolationprocedure type

instance_of

Specimen Isolation

plays

has_input

organism parthypothesis

is_about

IRB/IACUCapproval

has_authorization

environment

has_quality

organismpathogenicdisposition

has part

has disp

osition

ID

denotes

CS1

gender age health status

has quality

CS4 CS5/6 CS7

CS2/3

CS8

CS9/10

CS11/12

CS13

CS14

CS18

CS15/16

Metadata Processes

data transformations –image processing

assemblysequencing assay

specimen source – organism or environmental

specimencollector

input sample

reagents

technician

equipment

type ID

qualities


data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

sample processing

data archivingprocess

sequencedata record

has_input

has_output

has_output

has_specification has_part has_part

is_about

has_input

has_output

has_input

has_input

has_input

has_output

has_output

has_output

is_about

GenBankID

denotes

located_in

denotes

has_input

has_quality

instance_of


located_in

Specimen Isolation Material Processing

Data ProcessingSequencing Assay

Investigation


located_in


located_in


located_in


located_in

quality assessmentassay

Quality Assessment

has_input

has_output

Outcome of Metadata Standards WG

• Consistent metadata captured across GSCID• Guidance to collaborators regarding metadata

expectations for sequencing and analysis services• Support more standardized BRC interface

development• Harmonization with related stakeholders – Genome

Standards Consortium MIxS, OBO Foundry OBI and NCBI BioSample

• Represented in the context of an extensible semantic framework

Conclusions

• Metadata standards for microorganism sequencing projects• Bottom up approach focuses standard on important features• Harmonizing with related standards from the Genome Standards

Consortium, OBO Foundry and NCBI• Being beta-tested by GSCIDs for adoption by all NIAID-sponsored

sequencing projects• Utility of semantic representation

– Identified gaps in data field list (e.g. temporal components)– Includes logical structure for other, project-specific, data fields - extensible– Identified gaps in ontology data standards (use case-driven standard

development)– Identified commonalities in data structures (reusable)– Support for semantic queries and inferential analysis in future

• Ontology-based framework is extensible– Sequencing => “omics”

richard h. scheuermann, ph.d. department of pathology division of biomedical informatics

Documents

associated metadata

genome sequences

sequencing projectsbottom

virus pathogen resource

related pathogen

infectious diseases

broad institute

pathogen subgroup core