gsc-brc metadata standards

31
GSC-BRC Metadata Standards Richard H. Scheuermann U.T. Southwestern Medical Center

Upload: guang

Post on 23-Feb-2016

79 views

Category:

Documents


2 download

DESCRIPTION

GSC-BRC Metadata Standards. Richard H. Scheuermann U.T. Southwestern Medical Center. Metadata Inconsistencies. Each project was providing different types of metadata No consistent nomenclature being used Impossible to perform reliable comparative genomics analysis. Dengue Clinical Metadata. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GSC-BRC Metadata Standards

GSC-BRC Metadata Standards

Richard H. ScheuermannU.T. Southwestern Medical Center

Page 2: GSC-BRC Metadata Standards

Metadata Inconsistencies

• Each project was providing different types of metadata

• No consistent nomenclature being used• Impossible to perform reliable comparative

genomics analysis

Page 3: GSC-BRC Metadata Standards

Dengue Clinical Metadata

Page 4: GSC-BRC Metadata Standards

Virus Isolate Information

Page 5: GSC-BRC Metadata Standards

Complex Query Interface

Page 6: GSC-BRC Metadata Standards

Additional Clinical Characteristics

Page 7: GSC-BRC Metadata Standards

GSC-BRC Metadata Standards Working Group

• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs

• Develop metadata standards for pathogen isolate sequencing projects

Page 8: GSC-BRC Metadata Standards

Metadata Standards Process• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project

sources (e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup

(core) and data fields that appear to be project specific• For each data field, provide definitions, synonyms, allowed value sets preferably using

controlled vocabularies, expected syntax, examples, data categories and data providers• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble metadata fields into a semantic network• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI)• Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS,

BioProjects, BioSamples• Develop data submission spreadsheets to be used for all white paper and BRC-associated

projects

Page 9: GSC-BRC Metadata Standards

GSC-BRC Metadata Working Groups

Page 10: GSC-BRC Metadata Standards

Example Metadata

Page 11: GSC-BRC Metadata Standards

Virus Core Metadata Sheet

Page 12: GSC-BRC Metadata Standards

Metadata Merge

Page 13: GSC-BRC Metadata Standards

data transformations –image processing

assemblysequencing assay

specimen source – organism or environmental

specimencollector

input sample

reagents

technician

equipment

type ID qualities

temporal-spatialregion

data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

sample processing

data archivingprocess

sequencedata record

has_input

has_output

has_output

has_specification has_part has_part

is_about

has_input

has_output

has_input

has_input

has_input

has_output

has_output

has_output

is_about

GenBankID

denotes

located_in

denotes

- independent continuant- dependent continuant- occurrent- temporal-spatial region

ital - relations

has_input

has_qualityinstance_of

temporal-spatialregion

located_in

Network Overview

Page 14: GSC-BRC Metadata Standards

data transformations –image processing

assemblysequencing assay

specimen source – organism or environmental

specimencollector

input sample

reagents

technician

equipment

type ID qualities

temporal-spatialregion

data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

sample processing

data archivingprocess

sequencedata record

has_input

has_output

has_output

has_specification has_part has_part

is_about

has_input

has_output

has_input

has_input

has_input

has_output

has_output

has_output

is_about

GenBankID

denotes

located_in

denotes

has_input

has_qualityinstance_of

temporal-spatialregion

located_in

Specimen Isolation

Material Processing

Data ProcessingSequencing Assay

Investigation

Page 15: GSC-BRC Metadata Standards

Metadata Categories

• Investigation• Host/Source Characterization• Specimen Isolation• Pathogen Detection• Pathogen Isolation• Pathogen Characterization• Specimen Processing• Sample Shipment• Sequencing Sample Preparation• Sequencing Assay• Data Transformation

Page 16: GSC-BRC Metadata Standards

organism

environmentalmaterial

specimensource role

species/strain

organismID

age, gender,symptom

specimen isolationprocedure X

has_input

plays

commonname

denotes

denotes

has_qualityinstance_of

v10

v12

v11

v13

Host/Source Characterization

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_partdenotes

spatialregion

geographiclocation

denoteslocated_in

located_in

vX – row X in virus sheet- independent continuant- dependent continuant- occurrent- temporal-spatial region

ital - relations

b14 b15b16 b17

b19 b20

Page 17: GSC-BRC Metadata Standards

organism

environmentalmaterial

equipment

person

specimensource role

specimencapture role

specimencollector role

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen Xspecimen isolationprocedure X

isolationprotocol

has_input

has_output

plays

plays

has_specification

has_partdenotes

located_in

name

denotes

spatialregion

geographiclocation

denoteslocated_in

affiliation

has_affiliation

ID

v2

v5-6

v3-4

v7v8

v15

v16

denotes

specimen typeinsta

nce_of

specimen isolationprocedure type

instance_of

Specimen Isolation

plays

has_input

Comments

????

v9

organism parthypothesis v17

is_about

IRB/IACUCapproval

has_authorization

v19v18

b18

b22environmenthas_quality

b23

b24

b28 b29

b25 b26 b27

b30

Page 18: GSC-BRC Metadata Standards

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen X

microorganism X

has_part

has_part

located_in

spatialregion

geographiclocation

species/strain

instance_of

IDv15

v16

v27

Pathogen Detection

pathogen detectionprocess X

has_input

has_specification

data aboutpathogen presence

specimentype

amount

denotes

instance_of

has_quality

located_in

pathogen detectionmethod

instance_of

denotes denotes denotes

pathogen detectionprotocol

has_output

v28

is_about

b21

Page 19: GSC-BRC Metadata Standards

specimen X

microorganism X

has_part

species/strain

instance_of

IDv15

v16

Pathogen Isolation

specimentype

amount

denotes

instance_of

has_quality

v34

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

pathogen isolationprocess X

located_in

pathogen isolationmethod

denotes denotes denotes

pathogen isolationprotocol

has_input

instance_

of

has_specific

ation

pathogenisolate X

IDpathogen

typeamount

denotes

instance_ofhas_quality

has_output

v26

Page 20: GSC-BRC Metadata Standards

specimen X

microorganism X

has_part

species/strain

instance_of

IDv15

v16

v27

PathogenCharacterization

specimentype

amount

denotes

instance_of

has_quality

v34

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

pathogen isolationprocess X

located_in

pathogen isolationmethod

denotes denotes denotes

pathogen isolationprotocol

has_input

instance_

of

has_specific

ation

pathogenisolate X

IDpathogen

typeamount

denotes

instance_ofhas_quality

has_outputb2

b3

b4

biological characteristicassay X

antigenic characteristicassay X

pathologic characteristicassay X

genetic characteristicassay X

chromosome/plasmidassay X

biovarcharacteristic

serovarcharacteristic

pathovarcharacteristic

genotypecharacteristic

chromosome/plasmidcharacteristic

antibiotic sensitivityassay X

antibody sensitivitycharacteristic

has_input is_about

genus/species/straindetermination assay X

genus/species/straincharacteristic

b5

b6

b7

b8

b11

b13

b10

b9

b12

has_outputv27

v29v30

v31v32

Page 21: GSC-BRC Metadata Standards

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen X

microorganism X

sampleset X

sample setassembly process X

sample setassembly protocol

has_outputhas_part

has_specification

has_part

located_in

spatialregion

geographiclocation

species/strain

instance_of

ID

v15

v16

v27

SpecimenProcessing

aliquotingprocess X

aliquotingprotocol

has_input

has_output

has_specification

specimen Xaliquot Y

specimentypeamount

denotes

instance_ofhas_quality

IDspecimen

typeamount

denotes

instance_ofhas_quality

IDspecimen

typeamount

denotes

instance_ofhas_quality

located_in located_in

sample setassembly process

aliquotingprocess

instance_of instance_of

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

denotes denotes denotes

specimen Aaliquot B

specimen Maliquot N

specimen Taliquot U

has_input

v20v22

v23

b40

repositoryspecimen X

IDspecimen

typeinformationrecord

denotes

instance_ofhas_quality

repository depositionprocess X

has_input

has_output

specimenrepository

located_in

b41 b43b42

Page 22: GSC-BRC Metadata Standards

sample set Xat GSC

sample set Xin transit

sample shipmentprocess X

sample shipmentprotocol

sample receiptprocess X

sample receiptprotocol

has_input

has_input

has_output

has_output

has_specification has_specification

Sample Shipment

sampleset X

IDsample set

typeamount

denotes

instance_ofhas_quality

IDsample set

typeamount

denotes

instance_ofhas_quality

IDsample set

typeamount

denotes

instance_ofhas_quality

located_in located_insample shipmentprocess

sample receiptprocess

instance_of instance_of

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

denotes denotes denotes

v21

sample Xat GSC

IDsample

typeamount

denotes

instance_ofhas_quality

has_part

v24

v25

Page 23: GSC-BRC Metadata Standards

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

NA amplifiedsample Xspecimen X

microorganism X

enrichedNA sample X

microorganismgenomic NA

NA enrichmentprocess X

NA enrichmentprotocol

NA amplificationprocess X

NA amplificationprotocol

has_input

has_input

has_output

has_outputhas_part

has_specification

has_part

has_specification

has_part

located_in

spatialregion

geographiclocation

species/strain

instance_of

ID

ID

v15

v16

v27

Sequencing Sample Preparation

aliquotingprocess X

aliquotingprotocol

has_input

has_output

has_specification

specimenaliquot X

specimentypeamount

denotes

instance_ofhas_quality

IDspecimen

typeamount

denotes

instance_ofhas_quality

IDspecimen

typeamount

denotes

instance_ofhas_quality

IDspecimen

typeamount

denotes

instance_ofhas_quality

located_in located_in located_in

NA enrichmentprocess

NA amplificationprocess

aliquotingprocess

instance_of instance_of instance_of

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

denotes denotes denotes

v35

v36

v37

v38

v39

v33

b31

b32

library constructionprotocol

b33

Page 24: GSC-BRC Metadata Standards

sequencing assay X

samplematerial X

material X

person X

equipment X

lot #

primarydata

sequencingprotocol

temporal-spatialregion

has_input

located_in

has_specification

has_output

v40

plays

spatialregion

temporalinterval

GPSlocationdate/time

spatialregion

geographiclocation

Sequencing Assay

has_part

located_indenotes denotes

runID

sequencingassay type

denotes

insatnce_of

reagentrole

reagenttype

instance_

of

denotes

sample ID

playstemplate

role

sampletype

instance_

of

denotes

name

playssequencing

tech. role

species

instance_

of

denotes

serial #

playssignal

detection role

equipmenttype

instance_

of

denotes

has_input

has_input

has_input

v14

v41

objectives – coverage,genome type targeted,

finishing

has_part

b34

b38

Page 25: GSC-BRC Metadata Standards

data transformations –image processing

assembly X

data transformations –variant detection

primarydata

sequencedata

genotype data

microorganism X

microorganismgenomic NA

algorithm

data archivingprocess

sequencedata record

has_input

instance_

of

has_specification

has_input

has_outpu

t

has_output

is_about

GenBankID

denotes

software

has_input

data transferprotocol

has_specification

species/strain

has_output

has_input

temporal-spatialregion

located_in

spatialregion

temporalinterval

GPSlocationdate/time

spatialregion

geographiclocation

has_part

located_indenotes denotes

person Xname

plays

bioinformaticstech. role

species

instance_

of

denotes

runID

denoteslocated_in

data transformations –serotype marker

detection

serotype data

data transformations –gene detection

gene data

part_of

has_output

has_output

is_about

has_input

has_input

Data Transformationstemporal-spatial

region

spatialregion

temporalinterval

GPSlocationdate/time

spatialregion

geographiclocation

has_part

located_indenotes denotes

v29

v43

v31

v32

v42

v30

v44

v45 v46

v47

b35

b36

finishingstatus

has_quality

b37

b39

Page 26: GSC-BRC Metadata Standards

assay X

samplematerial X

material X

person X

equipment X

lot #

primarydata

assayprotocol

temporal-spatialregion

has_input

located_in

has_specification

has_output

plays

spatialregion

temporalinterval

GPSlocationdate/time

spatialregion

geographiclocation

Generic Assay

has_part

located_indenotes denotes

runID

assaytype

denotes

instance_of

reagentrole

reagenttype

instance_

of

denotes

sample ID

playstarget

role

sampletype

instance_

of

denotes

name

playstechnician

role

species

instance_

of

denotes

serial #

playssignal

detection role

equipmenttype

instance_

of

denotes

has_input

has_input

has_input

objectives

has_part

analyte X

has_part

quality x

has_quality

input samplematerial X

is_about

Page 27: GSC-BRC Metadata Standards

materialtransformation X

samplematerial X

material X

person X

equipment X

lot #

outputmaterial X

material transformationprotocol

temporal-spatialregion

has_input

located_in

has_specification

has_output

plays

spatialregion

temporalinterval

GPSlocationdate/time

spatialregion

geographiclocation

Generic Material Transformation

has_part

located_indenotes denotes

runID

material transformationtype

denotes

instance_of

reagentrole

reagenttype

instance_

of

denotes

sample ID

playstarget

role

sampletype

instance_

of

denotes

name

playstechnician

role

species

instance_

of

denotes

serial #

playssignal

detection role

equipmenttype

instance_

of

denotes

has_input

has_input

has_input

objectives

has_part

quality x

has_quality

quality x

materialtype

has_quality

instance_of

sample IDdenotes

Page 28: GSC-BRC Metadata Standards

data transformation Xinputdata

outputdata

material X

algorithm

has_specification

has_output

is_about

software

has_input

located_in

person Xname

data analystrole

denotes

runID

denotes

Generic Data Transformation

temporal-spatialregion

spatialregion

temporalinterval

GPSlocationdate/time

spatialregion

geographiclocation

has_part

located_indenotes denotes

data transformationtype

instance_of

plays

Page 29: GSC-BRC Metadata Standards

Generic Material (IC)

material X

ID

materialtype

quality x

has_quality

material Y

has_part

material Z

has_part

quality y

has_quality

denotes

instance_of

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

denotes denotes denotes

located_in located_in

Page 30: GSC-BRC Metadata Standards

OBI specimen creation

organism (for ‘collecting specimen from an organism’)

human being

synonym

individual organism identifier

quality

geographic location

specimen

infectious agent

specimen creation

protocol

has_specifie

d_output

realizes

unfolds_in

denotes has_quality

is_about

located_in

has_specified_input

geographic location

time measurement datum

is_duration_of

material entity (for ‘environmental material

collection’)

has_participant

organization

is_member_of_organization

e21

written name

denotes

e22CRID symboldenotes

e24

textual entity

is_about

document

measurement datum

is_about

anatomical entity (‘portion of body substance’ or ’ portion of tissue’)

is_a

specimen creation objective

achieves_planned_objective

infectious agent

is_about

e17 e18

synonym e19

is_about

organization

has_supplier

quality

has_quality

e26

measurement datum

e23

is_quality_measured_as

infectious agent

e25

e27

e29 e30

e31

e32

e33

located_in

growth environment

e35

e36

e40 e41 e42

e44

treatment

material_entity

has_participant

has_participant

e43

genetic characteristics information

is_about

e37

genetic characteristics information

is_about

e20

e39

e38

located_in

located_in

e45 e46

e47 e50

e14

e16

e15

information content entity

denotes

has_agent

Page 31: GSC-BRC Metadata Standards

Status

• Core metadata merge process nearly complete• Comprehensive semantic networks developed• Begun the OBI harmonization process• Begun the MIGS/MIMS harmonization process• Still need to:– Compare, harmonize, map with BioProjects and BioSamples– Decide what to do about metadata fields that appear to be

project specific– Develop metadata submission templates– Report process and results