ontologies and data integration in biomedicine olivier bodenreider lister hill national center for...

Post on 02-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Ontologies and data integration in biomedicine

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical Communications

Bethesda, Maryland - USABethesda, Maryland - USA

Kno.e.sis

Wright State University, Dayton, OhioWright State University, Dayton, OhioMay 27, 2009

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 2

OutlineOutline

Why integrate data?Why integrate data? Ontologies and data integrationOntologies and data integration ExamplesExamples Challenging issuesChallenging issues

Why integrate data?Why integrate data?

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 4

Why integrate data?Why integrate data?

Sources of informationSources of information Created byCreated by

Independent researchersIndependent researchers Separate workflowsSeparate workflows

HeterogeneousHeterogeneous ScatteredScattered ““Silos”Silos”

To identify patterns in integrated datasetsTo identify patterns in integrated datasets Hypothesis generationHypothesis generation Knowledge discoveryKnowledge discovery

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 5

Motivation Motivation Translational researchTranslational research

““Bench to Bedside”Bench to Bedside” Integration of clinical and research activities and Integration of clinical and research activities and

resultsresults Supported by research programsSupported by research programs

NIH RoadmapNIH Roadmap Clinical and Translational Science Awards (CTSA)Clinical and Translational Science Awards (CTSA)

Requires the effective integration and exchange Requires the effective integration and exchange and of information betweenand of information between Basic researchBasic research Clinical researchClinical research

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 6

Genotype and phenotypeGenotype and phenotype[Goh, PNAS 2007]

• OMIM• [HPO]• OMIM• [HPO]

Genes and environmental factorsGenes and environmental factors

[Liu, BMC Bioinf. 2008]

• MEDLINE (MeSH index terms)• Genetic Association Database• MEDLINE (MeSH index terms)• Genetic Association Database

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 8

Integrating drugs and targetsIntegrating drugs and targets

[Yildirim, Nature Biot. 2007]

• DrugBank• ATC• Gene Ontology

• DrugBank• ATC• Gene Ontology

Why ontologies?Why ontologies?

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 10

Uses of biomedical ontologiesUses of biomedical ontologies

Knowledge managementKnowledge management Annotating data and resourcesAnnotating data and resources Accessing biomedical informationAccessing biomedical information Mapping across biomedical ontologiesMapping across biomedical ontologies

Data integration, exchange and semantic Data integration, exchange and semantic interoperabilityinteroperability

Decision supportDecision support Data selection and aggregationData selection and aggregation Decision supportDecision support NLP applicationsNLP applications Knowledge discoveryKnowledge discovery

[Bodenreider, YBMI 2008][Bodenreider, YBMI 2008]

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 11

Terminology and translational researchTerminology and translational research

CancerBasic

Research

CancerBasic

Research

EHRCancerPatients

EHRCancerPatients

NCI ThesaurusNCI Thesaurus SNOMED CTSNOMED CT

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 12

Approaches to data integration (1)Approaches to data integration (1)

WarehousingWarehousing Sources to be integrated Sources to be integrated

are transformed into a are transformed into a common format and common format and converted to a common converted to a common vocabularyvocabulary

MediationMediation Local schema (of the Local schema (of the

sources)sources) Global schema (in Global schema (in

reference to which the reference to which the queries are made)queries are made)

[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]

[Goble J. Biomedical Informatics 2008]

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 13

Approaches to data integration (2)Approaches to data integration (2)

Linked dataLinked data Links among data Links among data

elementselements Enable navigation by Enable navigation by

humanshumans

[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]

[Goble J. Biomedical Informatics 2008]

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 14

Ontologies and warehousingOntologies and warehousing

RoleRole Provide a conceptualization of the domainProvide a conceptualization of the domain

Help define the schemaHelp define the schema Information model vs. ontologyInformation model vs. ontology

Provide value sets for data elementsProvide value sets for data elements Enable standardization and sharing of dataEnable standardization and sharing of data

ExamplesExamples Annotations to the Gene OntologyAnnotations to the Gene Ontology BioWarehouseBioWarehouse Clinical information systemsClinical information systems

http://biowarehouse.ai.sri.com/

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 15

Ontologies and mediationOntologies and mediation

RoleRole Reference for defining the global schemaReference for defining the global schema Map between local and global schemasMap between local and global schemas

Query reformulationQuery reformulation Local-as-view vs. Global-as-viewLocal-as-view vs. Global-as-view

ExamplesExamples TAMBISTAMBIS BioMediatorBioMediator OntoFusionOntoFusion

[Stevens, Bioinformatics 2000]

[Louie, AMIA 2005]

[Perez-Rey, Comput Biol Med 2006]

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 16

Ontologies and linked dataOntologies and linked data

RoleRole Explicit conceptualization of the domainExplicit conceptualization of the domain Semantic normalization of data elementsSemantic normalization of data elements

ExamplesExamples EntrezEntrez Semantic Web mashupsSemantic Web mashups Bio2RDFBio2RDF

[http://www.ncbi.nlm.nih.gov/]

[J. Biomedical informatics 41(5) 2008]

[http://bio2rdf.org/]

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 17

Ontologies and data integrationOntologies and data integration

Source of identifiers for biomedical entitiesSource of identifiers for biomedical entities Semantic normalizationSemantic normalization Warehouse approachesWarehouse approaches

Source of reference relations for the global schemaSource of reference relations for the global schema Mapping between local and global schemasMapping between local and global schemas Mediator-based approachesMediator-based approaches

Source of identifiers for biomedical entitiesSource of identifiers for biomedical entities Semantic normalizationSemantic normalization Explicit conceptualization of the domainExplicit conceptualization of the domain Linked data approachesLinked data approaches

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 18

Ontologies and data aggregationOntologies and data aggregation

Source of hierarchical relationsSource of hierarchical relations Aggregate data into coarser categoriesAggregate data into coarser categories Abstract away from low-frequency, fine grained data Abstract away from low-frequency, fine grained data

pointspoints Increase powerIncrease power Improve visualizationImprove visualization

ExamplesExamples

Gene OntologyGene Ontologyhttp://www.geneontology.org/http://www.geneontology.org/

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 20

Annotating dataAnnotating data

Gene OntologyGene Ontology Functional annotation of gene productsFunctional annotation of gene products

in several dozen model organismsin several dozen model organisms

Various communities use the same controlled Various communities use the same controlled vocabulariesvocabularies

Enabling comparisons across model organismsEnabling comparisons across model organisms AnnotationsAnnotations

Assigned manually by curatorsAssigned manually by curators Inferred automatically (e.g., from sequence similarity)Inferred automatically (e.g., from sequence similarity)

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 21

GO GO Annotations for Aldh2 (mouse)Annotations for Aldh2 (mouse)

http:// www.informatics.jax.org/

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 22

GO GO ALD4 in YeastALD4 in Yeast

http://db.yeastgenome.org/

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 23

GO GO Annotations for ALDH2 (Human)Annotations for ALDH2 (Human)

http://www.ebi.ac.uk/GOA/

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 24

Integration applicationsIntegration applications

Based on shared annotationsBased on shared annotations Enrichment analysis (within/across species)Enrichment analysis (within/across species) Clustering (co-clustering with gene expression data)Clustering (co-clustering with gene expression data)

Based on the structure of GOBased on the structure of GO Closely related annotationsClosely related annotations Semantic similaritySemantic similarity

Based on associations between gene products and Based on associations between gene products and annotationsannotations

Leveraging reasoningLeveraging reasoning

[Bodenreider, PSB 2005]

[Sahoo, Medinfo 2007]

[Lord, PSB 2003]

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 25

Gene Ontology

Integration Integration Entrez Gene + GOEntrez Gene + GO

gene

GO

PubMed

Gene name

OMIM

Sequence

Interactions

Glycosyltransferase

Congenital muscular dystrophy

Entrez Gene

[Sahoo, Medinfo 2007]

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 26

From From glycosyltransferaseglycosyltransferaseto to congenital muscular dystrophycongenital muscular dystrophy

MIM:608840Muscular dystrophy, congenital, type 1D

GO:0008375

has_associated_phenotype

has_molecular_function

EG:9215LARGE

acetylglucosaminyl-transferase

GO:0016757glycosyltransferase

GO:0008194isa

GO:0008375acetylglucosaminyl-

transferase

GO:0016758

ExamplesExamples

caBIGcaBIGhttp://cabig.nci.nih.gov/http://cabig.nci.nih.gov/

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 28

Cancer Biomedical Informatics GridCancer Biomedical Informatics Grid

US National Cancer InstituteUS National Cancer Institute Common infrastructure used to share data and Common infrastructure used to share data and

applications across institutions to support cancer applications across institutions to support cancer research efforts in a grid environmentresearch efforts in a grid environment

Service-oriented architectureService-oriented architecture

Data and application services available on the gridData and application services available on the grid Supported by ontological resourcesSupported by ontological resources

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 29

caBIG servicescaBIG services

caArraycaArray Microarray data repositoryMicroarray data repository

caTissuecaTissue Biospecimen repositoryBiospecimen repository

caFE (Cancer Function Express)caFE (Cancer Function Express) Annotations on microarray dataAnnotations on microarray data

……

caTRIPcaTRIP Cancer Translational Research Informatics PlatformCancer Translational Research Informatics Platform Integrates data servicesIntegrates data services

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 30

Ontological resourcesOntological resources

NCI ThesaurusNCI Thesaurus Reference terminology for the cancer domainReference terminology for the cancer domain ~ 60,000 concepts~ 60,000 concepts OWL LiteOWL Lite

Cancer Data Standards Repository (caDSR)Cancer Data Standards Repository (caDSR) Metadata repositoryMetadata repository Used to bridge across UML models through Common Used to bridge across UML models through Common

Data ElementsData Elements Links to concepts in ontologiesLinks to concepts in ontologies

ExamplesExamples

Semantic WebSemantic Webfor Health Care and Life Sciencesfor Health Care and Life Sciences

http://www.w3.org/2001/sw/hcls/http://www.w3.org/2001/sw/hcls/

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 32

Semantic Web layer cakeSemantic Web layer cake

Linked dataLinked data

linkeddata.org

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 34

Linked dataLinked data

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 35

Linked Linked biomedicalbiomedical data data

[Tim Berners-Lee TED 2009 conference]http://www.w3.org/2009/Talks/0204-ted-tbl/#(1)

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 36

W3C Health Care and Life Sciences IGW3C Health Care and Life Sciences IG

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 37

Biomedical Semantic WebBiomedical Semantic Web

IntegrationIntegration Data/InformationData/Information E.g., translational researchE.g., translational research

Hypothesis generationHypothesis generation Knowledge discoveryKnowledge discovery

[Ruttenberg, BMC Bioinf. 2007]

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 38

HCLS mashup of biomedical sourcesHCLS mashup of biomedical sources

NeuronDB

BAMS

NC Annotations

Homologene

SWAN

Entrez Gene

Gene Ontology

Mammalian Phenotype

PDSPki

BrainPharm

AlzGene

Antibodies

PubChem

MeSH

Reactome

Allen Brain Atlas

Publications

http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demohttp://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 39

Shared identifiers Shared identifiers ExampleExample

GO

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 40

HCLS mashupHCLS mashup NeuronDB

Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents

BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID

NC Annotations

Genes/ProteinsProcessesCells (maybe)PubMed ID

Allen Brain Atlas

GenesBrain imagesGross anatomy -> neuroanatomy

Homologene

GenesSpeciesOrthologiesProofs

SWAN

PubMedIDHypothesisQuestionsEvidence

Genes

Entrez GeneGenesProtein

GOPubMedID

Interaction (g/p)Chromosome

C. location

GO

Molecular functionCell components

Biological processAnnotation gene

PubMedID

Mammalian Phenotype

Genes Phenotypes

DiseasePubMedID

ProteinsChemicals

Neurotransmitters

PDSPki

BrainPharmDrug

Drug effectPathological agent

PhenotypeReceptorsChannelsCell typesPubMedIDDisease

AlzGene

Gene Polymorphism

PopulationAlz Diagnosis

AntibodiesGenes Antibodies

PubChem

NameStructurePropertiesMeSH term

MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem

Reactome

Genes/proteinsInteractionsCellular locationProcesses (GO)

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 41

HCLS mashupHCLS mashup NeuronDB

Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents

BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID

NC Annotations

Genes/ProteinsProcessesCells (maybe)PubMed ID

Allen Brain Atlas

GenesBrain imagesGross anatomy -> neuroanatomy

Homologene

GenesSpeciesOrthologiesProofs

SWAN

PubMedIDHypothesisQuestionsEvidence

Genes

Entrez GeneGenesProtein

GOPubMedID

Interaction (g/p)Chromosome

C. location

GO

Molecular functionCell components

Biological processAnnotation gene

PubMedID

Mammalian Phenotype

Genes Phenotypes

DiseasePubMedID

ProteinsChemicals

Neurotransmitters

PDSPki

BrainPharmDrug

Drug effectPathological agent

PhenotypeReceptorsChannelsCell typesPubMedIDDisease

AlzGene

Gene Polymorphism

PopulationAlz Diagnosis

AntibodiesGenes Antibodies

PubChem

NameStructurePropertiesMeSH term

MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem

Reactome

Genes/proteinsInteractionsCellular locationProcesses (GO)

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 42

HCLS mashupsHCLS mashups

Based on RDF/OWLBased on RDF/OWL Based on shared identifiersBased on shared identifiers

““Recombinant data” (E. Neumann)Recombinant data” (E. Neumann)

Ontologies used in some casesOntologies used in some cases Support applications (SWAN, SenseLab, etc.)Support applications (SWAN, SenseLab, etc.)

Journal of Biomedical InformaticsJournal of Biomedical Informaticsspecial issue on Semantic bio-mashupsspecial issue on Semantic bio-mashups[J. Biomedical Informatics 41(5) 2008][J. Biomedical Informatics 41(5) 2008]

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 43

Semantic bio-mashupsSemantic bio-mashups

Bio2RDF: Towards a mashup to build bioinformatics knowledge systemsBio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated Identifying disease-causal genes using Semantic Web-based representation of integrated

genomic and phenomic knowledgegenomic and phenomic knowledge Schema driven assignment and implementation of life science identifiers (LSIDs)Schema driven assignment and implementation of life science identifiers (LSIDs) The SWAN biomedical discourse ontologyThe SWAN biomedical discourse ontology An ontology-driven semantic mashup of gene and biological pathway information: An ontology-driven semantic mashup of gene and biological pathway information:

Application to the domain of nicotine dependenceApplication to the domain of nicotine dependence Towards an ontology for sharing medical images and regions of interest in neuroimagingTowards an ontology for sharing medical images and regions of interest in neuroimaging yOWL: An ontology-driven knowledge base for yeast biologistsyOWL: An ontology-driven knowledge base for yeast biologists Dynamic sub-ontology evolution for traditional Chinese medicine web ontologyDynamic sub-ontology evolution for traditional Chinese medicine web ontology Ontology-centric integration and navigation of the dengue literatureOntology-centric integration and navigation of the dengue literature Infrastructure for dynamic knowledge integration—Automated biomedical ontology extension Infrastructure for dynamic knowledge integration—Automated biomedical ontology extension

using textual resourcesusing textual resources An ontological knowledge framework for adaptive medical workflowAn ontological knowledge framework for adaptive medical workflow Semi-automatic web service composition for the life sciences using the BioMoby semantic Semi-automatic web service composition for the life sciences using the BioMoby semantic

web frameworkweb framework Combining Semantic Web technologies with Multi-Agent Systems for integrated access to Combining Semantic Web technologies with Multi-Agent Systems for integrated access to

biological resourcesbiological resources[J. Biomedical Informatics 41(5) 2008]

Challenging issuesChallenging issues

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 45

Challenging issuesChallenging issues

Bridges across ontologiesBridges across ontologies Permanent identifiers for biomedical entitiesPermanent identifiers for biomedical entities Other issuesOther issues

Challenging issuesChallenging issues

Bridges across ontologiesBridges across ontologies

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 47

Trans-namespace integrationTrans-namespace integration

Addison Disease(D000224)

Addison's disease (363732003)

Biomedicalliterature

Biomedicalliterature

MeSH

Clinicalrepositories

Clinicalrepositories

SNOMED CT

Primary adrenocortical insufficiency(E27.1)

ICD 10

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 48

(Integrated) concept repositories(Integrated) concept repositories

Unified Medical Language SystemUnified Medical Language Systemhttp://umlsks.nlm.nih.govhttp://umlsks.nlm.nih.gov

NCBO’s BioPortalNCBO’s BioPortalhttp://www.bioontology.org/tools/portal/bioportal.htmlhttp://www.bioontology.org/tools/portal/bioportal.html

caDSRcaDSRhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsrhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr

Open Biomedical Ontologies (OBO)Open Biomedical Ontologies (OBO)http://obofoundry.org/http://obofoundry.org/

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 49

Integrating subdomainsIntegrating subdomains

Biomedicalliterature

Biomedicalliterature

MeSH

Genomeannotations

Genomeannotations

GOModelorganisms

Modelorganisms

NCBITaxonomy

Geneticknowledge bases

Geneticknowledge bases

OMIM

Clinicalrepositories

Clinicalrepositories

SNOMED CTOthersubdomains

Othersubdomains

AnatomyAnatomy

FMA

UMLS

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 5050

Integrating subdomainsIntegrating subdomains

Biomedicalliterature

Biomedicalliterature

Genomeannotations

Genomeannotations

Modelorganisms

Modelorganisms

Geneticknowledge bases

Geneticknowledge bases

Clinicalrepositories

Clinicalrepositories

Othersubdomains

Othersubdomains

AnatomyAnatomy

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 51

Trans-namespace integrationTrans-namespace integration

Genomeannotations

Genomeannotations

GOModelorganisms

Modelorganisms

NCBITaxonomy

Geneticknowledge bases

Geneticknowledge bases

OMIMOther

subdomainsOther

subdomains

AnatomyAnatomy

FMA

UMLS

Addison Disease (D000224)

Addison's disease (363732003)

Biomedicalliterature

Biomedicalliterature

MeSH

Clinicalrepositories

Clinicalrepositories

SNOMED CT

UMLSC0001403

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 52

MappingsMappings

Created manually (e.g., UMLS)Created manually (e.g., UMLS) PurposePurpose DirectionalityDirectionality

Created automatically (e.g., BioPortal)Created automatically (e.g., BioPortal) Lexically: ambiguity, normalizationLexically: ambiguity, normalization Semantically: lack of / incomplete formal definitionsSemantically: lack of / incomplete formal definitions

Key to enabling semantic interoperabilityKey to enabling semantic interoperability Enabling resource for the Semantic WebEnabling resource for the Semantic Web

Challenging issuesChallenging issues

Permanent identifiers for biomedical entitiesPermanent identifiers for biomedical entities

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 54

Identifying biomedical entitiesIdentifying biomedical entities

Multiple identifiers for the same entity in different Multiple identifiers for the same entity in different ontologiesontologies

Barrier to data integration in generalBarrier to data integration in general Data annotated to different ontologies cannot Data annotated to different ontologies cannot

“recombine”“recombine” Need for mappings across ontologiesNeed for mappings across ontologies

Barrier to data integration in the Semantic WebBarrier to data integration in the Semantic Web Multiple possible identifiers for the same entityMultiple possible identifiers for the same entity

Depending on the underlying representational scheme (URI Depending on the underlying representational scheme (URI vs. LSID)vs. LSID)

Depending on who creates the URIDepending on who creates the URI

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 55

Possible solutionsPossible solutions

PURL PURL http://purl.orghttp://purl.org One level of indirection between developers and usersOne level of indirection between developers and users Independence from local constraints at the developer’s endIndependence from local constraints at the developer’s end

The institution creating a resource is also responsible The institution creating a resource is also responsible for minting URIsfor minting URIs E.g., URI for genes in Entrez GeneE.g., URI for genes in Entrez Gene

Guidelines: “URI note”Guidelines: “URI note” W3C Health Care and Life Sciences Interest GroupW3C Health Care and Life Sciences Interest Group

Shared names initiativeShared names initiative Identify resources vs. entitiesIdentify resources vs. entities

[http://sharedname.org/]

Challenging issuesChallenging issues

Other issuesOther issues

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 57

AvailabilityAvailability

Many ontologies are freely availableMany ontologies are freely available The UMLS is freely available for research The UMLS is freely available for research

purposespurposes Cost-free license requiredCost-free license required

Licensing issues can be trickyLicensing issues can be tricky SNOMED CT is freely available in member countries SNOMED CT is freely available in member countries

of the IHTSDOof the IHTSDO Being freely availableBeing freely available

Is a requirement for the Open Biomedical Ontologies Is a requirement for the Open Biomedical Ontologies (OBO)(OBO)

Is a Is a de facto de facto prerequisite for Semantic Web applicationsprerequisite for Semantic Web applications

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 58

DiscoverabilityDiscoverability

Ontology repositoriesOntology repositories UMLS: 152 source vocabulariesUMLS: 152 source vocabularies

(biased towards healthcare applications)(biased towards healthcare applications) NCBO BioPortal: ~141ontologiesNCBO BioPortal: ~141ontologies

(biased towards biological applications)(biased towards biological applications) Limited overlap between the two repositoriesLimited overlap between the two repositories

Need for discovery servicesNeed for discovery services Metadata for ontologiesMetadata for ontologies

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 59

FormalismFormalism

Several major formalismSeveral major formalism Web Ontology Language (OWL) – NCI ThesaurusWeb Ontology Language (OWL) – NCI Thesaurus OBO format – most OBO ontologiesOBO format – most OBO ontologies UMLS Rich Release Format (RRF) – UMLS, RxNormUMLS Rich Release Format (RRF) – UMLS, RxNorm

Conversion mechanismsConversion mechanisms OBO to OWLOBO to OWL LexGrid (import/export to LexGrid internal format)LexGrid (import/export to LexGrid internal format)

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 60

Ontology integrationOntology integration

Post hoc Post hoc integration , form the bottom upintegration , form the bottom up UMLS approachUMLS approach Integrates ontologies “as is”, including legacy Integrates ontologies “as is”, including legacy

ontologiesontologies Facilitates the integration of the corresponding datasetsFacilitates the integration of the corresponding datasets

Coordinated development of ontologiesCoordinated development of ontologies OBO Foundry approachOBO Foundry approach Ensures consistency Ensures consistency ab initioab initio Excludes legacy ontologiesExcludes legacy ontologies

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 61

QualityQuality

Quality assurance in ontologies is still imperfectly Quality assurance in ontologies is still imperfectly defineddefined Difficult to define outside a use case or applicationDifficult to define outside a use case or application

Several approaches to evaluating qualitySeveral approaches to evaluating quality Collaboratively, by users (Web 2.0 approach)Collaboratively, by users (Web 2.0 approach)

Marginal notes enabled by BioPortalMarginal notes enabled by BioPortal Centrally, by expertsCentrally, by experts

OBO Foundry approachOBO Foundry approach

Important factors besides qualityImportant factors besides quality GovernanceGovernance Installed base / Community of practiceInstalled base / Community of practice

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 62

ConclusionsConclusions

Ontologies are enabling resources for data Ontologies are enabling resources for data integrationintegration

Standardization worksStandardization works Grass roots effort (GO)Grass roots effort (GO) Regulatory context (ICD 9-CM)Regulatory context (ICD 9-CM)

Bridging across resources is crucialBridging across resources is crucial Ontology integration resources / strategiesOntology integration resources / strategies

(UMLS, BioPortal / OBO Foundry)(UMLS, BioPortal / OBO Foundry) Massive amounts of imperfect data integrated with Massive amounts of imperfect data integrated with

rough methods might still be usefulrough methods might still be useful

MedicalMedicalOntologyOntologyResearchResearch

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Contact:Contact:Web:Web:

olivier@nlm.nih.govolivier@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov

top related