ontologies and data integration in biomedicine olivier bodenreider lister hill national center for...
TRANSCRIPT
Ontologies and data integration in biomedicine
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical Communications
Bethesda, Maryland - USABethesda, Maryland - USA
Kno.e.sis
Wright State University, Dayton, OhioWright State University, Dayton, OhioMay 27, 2009
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 2
OutlineOutline
Why integrate data?Why integrate data? Ontologies and data integrationOntologies and data integration ExamplesExamples Challenging issuesChallenging issues
Why integrate data?Why integrate data?
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 4
Why integrate data?Why integrate data?
Sources of informationSources of information Created byCreated by
Independent researchersIndependent researchers Separate workflowsSeparate workflows
HeterogeneousHeterogeneous ScatteredScattered ““Silos”Silos”
To identify patterns in integrated datasetsTo identify patterns in integrated datasets Hypothesis generationHypothesis generation Knowledge discoveryKnowledge discovery
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 5
Motivation Motivation Translational researchTranslational research
““Bench to Bedside”Bench to Bedside” Integration of clinical and research activities and Integration of clinical and research activities and
resultsresults Supported by research programsSupported by research programs
NIH RoadmapNIH Roadmap Clinical and Translational Science Awards (CTSA)Clinical and Translational Science Awards (CTSA)
Requires the effective integration and exchange Requires the effective integration and exchange and of information betweenand of information between Basic researchBasic research Clinical researchClinical research
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 6
Genotype and phenotypeGenotype and phenotype[Goh, PNAS 2007]
• OMIM• [HPO]• OMIM• [HPO]
Genes and environmental factorsGenes and environmental factors
[Liu, BMC Bioinf. 2008]
• MEDLINE (MeSH index terms)• Genetic Association Database• MEDLINE (MeSH index terms)• Genetic Association Database
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 8
Integrating drugs and targetsIntegrating drugs and targets
[Yildirim, Nature Biot. 2007]
• DrugBank• ATC• Gene Ontology
• DrugBank• ATC• Gene Ontology
Why ontologies?Why ontologies?
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 10
Uses of biomedical ontologiesUses of biomedical ontologies
Knowledge managementKnowledge management Annotating data and resourcesAnnotating data and resources Accessing biomedical informationAccessing biomedical information Mapping across biomedical ontologiesMapping across biomedical ontologies
Data integration, exchange and semantic Data integration, exchange and semantic interoperabilityinteroperability
Decision supportDecision support Data selection and aggregationData selection and aggregation Decision supportDecision support NLP applicationsNLP applications Knowledge discoveryKnowledge discovery
[Bodenreider, YBMI 2008][Bodenreider, YBMI 2008]
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 11
Terminology and translational researchTerminology and translational research
CancerBasic
Research
CancerBasic
Research
EHRCancerPatients
EHRCancerPatients
NCI ThesaurusNCI Thesaurus SNOMED CTSNOMED CT
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 12
Approaches to data integration (1)Approaches to data integration (1)
WarehousingWarehousing Sources to be integrated Sources to be integrated
are transformed into a are transformed into a common format and common format and converted to a common converted to a common vocabularyvocabulary
MediationMediation Local schema (of the Local schema (of the
sources)sources) Global schema (in Global schema (in
reference to which the reference to which the queries are made)queries are made)
[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]
[Goble J. Biomedical Informatics 2008]
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 13
Approaches to data integration (2)Approaches to data integration (2)
Linked dataLinked data Links among data Links among data
elementselements Enable navigation by Enable navigation by
humanshumans
[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]
[Goble J. Biomedical Informatics 2008]
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 14
Ontologies and warehousingOntologies and warehousing
RoleRole Provide a conceptualization of the domainProvide a conceptualization of the domain
Help define the schemaHelp define the schema Information model vs. ontologyInformation model vs. ontology
Provide value sets for data elementsProvide value sets for data elements Enable standardization and sharing of dataEnable standardization and sharing of data
ExamplesExamples Annotations to the Gene OntologyAnnotations to the Gene Ontology BioWarehouseBioWarehouse Clinical information systemsClinical information systems
http://biowarehouse.ai.sri.com/
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 15
Ontologies and mediationOntologies and mediation
RoleRole Reference for defining the global schemaReference for defining the global schema Map between local and global schemasMap between local and global schemas
Query reformulationQuery reformulation Local-as-view vs. Global-as-viewLocal-as-view vs. Global-as-view
ExamplesExamples TAMBISTAMBIS BioMediatorBioMediator OntoFusionOntoFusion
[Stevens, Bioinformatics 2000]
[Louie, AMIA 2005]
[Perez-Rey, Comput Biol Med 2006]
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 16
Ontologies and linked dataOntologies and linked data
RoleRole Explicit conceptualization of the domainExplicit conceptualization of the domain Semantic normalization of data elementsSemantic normalization of data elements
ExamplesExamples EntrezEntrez Semantic Web mashupsSemantic Web mashups Bio2RDFBio2RDF
[http://www.ncbi.nlm.nih.gov/]
[J. Biomedical informatics 41(5) 2008]
[http://bio2rdf.org/]
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 17
Ontologies and data integrationOntologies and data integration
Source of identifiers for biomedical entitiesSource of identifiers for biomedical entities Semantic normalizationSemantic normalization Warehouse approachesWarehouse approaches
Source of reference relations for the global schemaSource of reference relations for the global schema Mapping between local and global schemasMapping between local and global schemas Mediator-based approachesMediator-based approaches
Source of identifiers for biomedical entitiesSource of identifiers for biomedical entities Semantic normalizationSemantic normalization Explicit conceptualization of the domainExplicit conceptualization of the domain Linked data approachesLinked data approaches
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 18
Ontologies and data aggregationOntologies and data aggregation
Source of hierarchical relationsSource of hierarchical relations Aggregate data into coarser categoriesAggregate data into coarser categories Abstract away from low-frequency, fine grained data Abstract away from low-frequency, fine grained data
pointspoints Increase powerIncrease power Improve visualizationImprove visualization
ExamplesExamples
Gene OntologyGene Ontologyhttp://www.geneontology.org/http://www.geneontology.org/
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 20
Annotating dataAnnotating data
Gene OntologyGene Ontology Functional annotation of gene productsFunctional annotation of gene products
in several dozen model organismsin several dozen model organisms
Various communities use the same controlled Various communities use the same controlled vocabulariesvocabularies
Enabling comparisons across model organismsEnabling comparisons across model organisms AnnotationsAnnotations
Assigned manually by curatorsAssigned manually by curators Inferred automatically (e.g., from sequence similarity)Inferred automatically (e.g., from sequence similarity)
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 21
GO GO Annotations for Aldh2 (mouse)Annotations for Aldh2 (mouse)
http:// www.informatics.jax.org/
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 22
GO GO ALD4 in YeastALD4 in Yeast
http://db.yeastgenome.org/
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 23
GO GO Annotations for ALDH2 (Human)Annotations for ALDH2 (Human)
http://www.ebi.ac.uk/GOA/
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 24
Integration applicationsIntegration applications
Based on shared annotationsBased on shared annotations Enrichment analysis (within/across species)Enrichment analysis (within/across species) Clustering (co-clustering with gene expression data)Clustering (co-clustering with gene expression data)
Based on the structure of GOBased on the structure of GO Closely related annotationsClosely related annotations Semantic similaritySemantic similarity
Based on associations between gene products and Based on associations between gene products and annotationsannotations
Leveraging reasoningLeveraging reasoning
[Bodenreider, PSB 2005]
[Sahoo, Medinfo 2007]
[Lord, PSB 2003]
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 25
Gene Ontology
Integration Integration Entrez Gene + GOEntrez Gene + GO
gene
GO
PubMed
Gene name
OMIM
Sequence
Interactions
Glycosyltransferase
Congenital muscular dystrophy
Entrez Gene
[Sahoo, Medinfo 2007]
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 26
From From glycosyltransferaseglycosyltransferaseto to congenital muscular dystrophycongenital muscular dystrophy
MIM:608840Muscular dystrophy, congenital, type 1D
GO:0008375
has_associated_phenotype
has_molecular_function
EG:9215LARGE
acetylglucosaminyl-transferase
GO:0016757glycosyltransferase
GO:0008194isa
GO:0008375acetylglucosaminyl-
transferase
GO:0016758
ExamplesExamples
caBIGcaBIGhttp://cabig.nci.nih.gov/http://cabig.nci.nih.gov/
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 28
Cancer Biomedical Informatics GridCancer Biomedical Informatics Grid
US National Cancer InstituteUS National Cancer Institute Common infrastructure used to share data and Common infrastructure used to share data and
applications across institutions to support cancer applications across institutions to support cancer research efforts in a grid environmentresearch efforts in a grid environment
Service-oriented architectureService-oriented architecture
Data and application services available on the gridData and application services available on the grid Supported by ontological resourcesSupported by ontological resources
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 29
caBIG servicescaBIG services
caArraycaArray Microarray data repositoryMicroarray data repository
caTissuecaTissue Biospecimen repositoryBiospecimen repository
caFE (Cancer Function Express)caFE (Cancer Function Express) Annotations on microarray dataAnnotations on microarray data
……
caTRIPcaTRIP Cancer Translational Research Informatics PlatformCancer Translational Research Informatics Platform Integrates data servicesIntegrates data services
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 30
Ontological resourcesOntological resources
NCI ThesaurusNCI Thesaurus Reference terminology for the cancer domainReference terminology for the cancer domain ~ 60,000 concepts~ 60,000 concepts OWL LiteOWL Lite
Cancer Data Standards Repository (caDSR)Cancer Data Standards Repository (caDSR) Metadata repositoryMetadata repository Used to bridge across UML models through Common Used to bridge across UML models through Common
Data ElementsData Elements Links to concepts in ontologiesLinks to concepts in ontologies
ExamplesExamples
Semantic WebSemantic Webfor Health Care and Life Sciencesfor Health Care and Life Sciences
http://www.w3.org/2001/sw/hcls/http://www.w3.org/2001/sw/hcls/
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 32
Semantic Web layer cakeSemantic Web layer cake
Linked dataLinked data
linkeddata.org
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 34
Linked dataLinked data
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 35
Linked Linked biomedicalbiomedical data data
[Tim Berners-Lee TED 2009 conference]http://www.w3.org/2009/Talks/0204-ted-tbl/#(1)
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 36
W3C Health Care and Life Sciences IGW3C Health Care and Life Sciences IG
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 37
Biomedical Semantic WebBiomedical Semantic Web
IntegrationIntegration Data/InformationData/Information E.g., translational researchE.g., translational research
Hypothesis generationHypothesis generation Knowledge discoveryKnowledge discovery
[Ruttenberg, BMC Bioinf. 2007]
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 38
HCLS mashup of biomedical sourcesHCLS mashup of biomedical sources
NeuronDB
BAMS
NC Annotations
Homologene
SWAN
Entrez Gene
Gene Ontology
Mammalian Phenotype
PDSPki
BrainPharm
AlzGene
Antibodies
PubChem
MeSH
Reactome
Allen Brain Atlas
Publications
http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demohttp://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 39
Shared identifiers Shared identifiers ExampleExample
GO
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 40
HCLS mashupHCLS mashup NeuronDB
Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents
BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID
NC Annotations
Genes/ProteinsProcessesCells (maybe)PubMed ID
Allen Brain Atlas
GenesBrain imagesGross anatomy -> neuroanatomy
Homologene
GenesSpeciesOrthologiesProofs
SWAN
PubMedIDHypothesisQuestionsEvidence
Genes
Entrez GeneGenesProtein
GOPubMedID
Interaction (g/p)Chromosome
C. location
GO
Molecular functionCell components
Biological processAnnotation gene
PubMedID
Mammalian Phenotype
Genes Phenotypes
DiseasePubMedID
ProteinsChemicals
Neurotransmitters
PDSPki
BrainPharmDrug
Drug effectPathological agent
PhenotypeReceptorsChannelsCell typesPubMedIDDisease
AlzGene
Gene Polymorphism
PopulationAlz Diagnosis
AntibodiesGenes Antibodies
PubChem
NameStructurePropertiesMeSH term
MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem
Reactome
Genes/proteinsInteractionsCellular locationProcesses (GO)
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 41
HCLS mashupHCLS mashup NeuronDB
Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents
BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID
NC Annotations
Genes/ProteinsProcessesCells (maybe)PubMed ID
Allen Brain Atlas
GenesBrain imagesGross anatomy -> neuroanatomy
Homologene
GenesSpeciesOrthologiesProofs
SWAN
PubMedIDHypothesisQuestionsEvidence
Genes
Entrez GeneGenesProtein
GOPubMedID
Interaction (g/p)Chromosome
C. location
GO
Molecular functionCell components
Biological processAnnotation gene
PubMedID
Mammalian Phenotype
Genes Phenotypes
DiseasePubMedID
ProteinsChemicals
Neurotransmitters
PDSPki
BrainPharmDrug
Drug effectPathological agent
PhenotypeReceptorsChannelsCell typesPubMedIDDisease
AlzGene
Gene Polymorphism
PopulationAlz Diagnosis
AntibodiesGenes Antibodies
PubChem
NameStructurePropertiesMeSH term
MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem
Reactome
Genes/proteinsInteractionsCellular locationProcesses (GO)
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 42
HCLS mashupsHCLS mashups
Based on RDF/OWLBased on RDF/OWL Based on shared identifiersBased on shared identifiers
““Recombinant data” (E. Neumann)Recombinant data” (E. Neumann)
Ontologies used in some casesOntologies used in some cases Support applications (SWAN, SenseLab, etc.)Support applications (SWAN, SenseLab, etc.)
Journal of Biomedical InformaticsJournal of Biomedical Informaticsspecial issue on Semantic bio-mashupsspecial issue on Semantic bio-mashups[J. Biomedical Informatics 41(5) 2008][J. Biomedical Informatics 41(5) 2008]
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 43
Semantic bio-mashupsSemantic bio-mashups
Bio2RDF: Towards a mashup to build bioinformatics knowledge systemsBio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated Identifying disease-causal genes using Semantic Web-based representation of integrated
genomic and phenomic knowledgegenomic and phenomic knowledge Schema driven assignment and implementation of life science identifiers (LSIDs)Schema driven assignment and implementation of life science identifiers (LSIDs) The SWAN biomedical discourse ontologyThe SWAN biomedical discourse ontology An ontology-driven semantic mashup of gene and biological pathway information: An ontology-driven semantic mashup of gene and biological pathway information:
Application to the domain of nicotine dependenceApplication to the domain of nicotine dependence Towards an ontology for sharing medical images and regions of interest in neuroimagingTowards an ontology for sharing medical images and regions of interest in neuroimaging yOWL: An ontology-driven knowledge base for yeast biologistsyOWL: An ontology-driven knowledge base for yeast biologists Dynamic sub-ontology evolution for traditional Chinese medicine web ontologyDynamic sub-ontology evolution for traditional Chinese medicine web ontology Ontology-centric integration and navigation of the dengue literatureOntology-centric integration and navigation of the dengue literature Infrastructure for dynamic knowledge integration—Automated biomedical ontology extension Infrastructure for dynamic knowledge integration—Automated biomedical ontology extension
using textual resourcesusing textual resources An ontological knowledge framework for adaptive medical workflowAn ontological knowledge framework for adaptive medical workflow Semi-automatic web service composition for the life sciences using the BioMoby semantic Semi-automatic web service composition for the life sciences using the BioMoby semantic
web frameworkweb framework Combining Semantic Web technologies with Multi-Agent Systems for integrated access to Combining Semantic Web technologies with Multi-Agent Systems for integrated access to
biological resourcesbiological resources[J. Biomedical Informatics 41(5) 2008]
Challenging issuesChallenging issues
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 45
Challenging issuesChallenging issues
Bridges across ontologiesBridges across ontologies Permanent identifiers for biomedical entitiesPermanent identifiers for biomedical entities Other issuesOther issues
Challenging issuesChallenging issues
Bridges across ontologiesBridges across ontologies
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 47
Trans-namespace integrationTrans-namespace integration
Addison Disease(D000224)
Addison's disease (363732003)
Biomedicalliterature
Biomedicalliterature
MeSH
Clinicalrepositories
Clinicalrepositories
SNOMED CT
Primary adrenocortical insufficiency(E27.1)
ICD 10
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 48
(Integrated) concept repositories(Integrated) concept repositories
Unified Medical Language SystemUnified Medical Language Systemhttp://umlsks.nlm.nih.govhttp://umlsks.nlm.nih.gov
NCBO’s BioPortalNCBO’s BioPortalhttp://www.bioontology.org/tools/portal/bioportal.htmlhttp://www.bioontology.org/tools/portal/bioportal.html
caDSRcaDSRhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsrhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr
Open Biomedical Ontologies (OBO)Open Biomedical Ontologies (OBO)http://obofoundry.org/http://obofoundry.org/
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 49
Integrating subdomainsIntegrating subdomains
Biomedicalliterature
Biomedicalliterature
MeSH
Genomeannotations
Genomeannotations
GOModelorganisms
Modelorganisms
NCBITaxonomy
Geneticknowledge bases
Geneticknowledge bases
OMIM
Clinicalrepositories
Clinicalrepositories
SNOMED CTOthersubdomains
Othersubdomains
…
AnatomyAnatomy
FMA
UMLS
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 5050
Integrating subdomainsIntegrating subdomains
Biomedicalliterature
Biomedicalliterature
Genomeannotations
Genomeannotations
Modelorganisms
Modelorganisms
Geneticknowledge bases
Geneticknowledge bases
Clinicalrepositories
Clinicalrepositories
Othersubdomains
Othersubdomains
AnatomyAnatomy
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 51
Trans-namespace integrationTrans-namespace integration
Genomeannotations
Genomeannotations
GOModelorganisms
Modelorganisms
NCBITaxonomy
Geneticknowledge bases
Geneticknowledge bases
OMIMOther
subdomainsOther
subdomains
…
AnatomyAnatomy
FMA
UMLS
Addison Disease (D000224)
Addison's disease (363732003)
Biomedicalliterature
Biomedicalliterature
MeSH
Clinicalrepositories
Clinicalrepositories
SNOMED CT
UMLSC0001403
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 52
MappingsMappings
Created manually (e.g., UMLS)Created manually (e.g., UMLS) PurposePurpose DirectionalityDirectionality
Created automatically (e.g., BioPortal)Created automatically (e.g., BioPortal) Lexically: ambiguity, normalizationLexically: ambiguity, normalization Semantically: lack of / incomplete formal definitionsSemantically: lack of / incomplete formal definitions
Key to enabling semantic interoperabilityKey to enabling semantic interoperability Enabling resource for the Semantic WebEnabling resource for the Semantic Web
Challenging issuesChallenging issues
Permanent identifiers for biomedical entitiesPermanent identifiers for biomedical entities
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 54
Identifying biomedical entitiesIdentifying biomedical entities
Multiple identifiers for the same entity in different Multiple identifiers for the same entity in different ontologiesontologies
Barrier to data integration in generalBarrier to data integration in general Data annotated to different ontologies cannot Data annotated to different ontologies cannot
“recombine”“recombine” Need for mappings across ontologiesNeed for mappings across ontologies
Barrier to data integration in the Semantic WebBarrier to data integration in the Semantic Web Multiple possible identifiers for the same entityMultiple possible identifiers for the same entity
Depending on the underlying representational scheme (URI Depending on the underlying representational scheme (URI vs. LSID)vs. LSID)
Depending on who creates the URIDepending on who creates the URI
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 55
Possible solutionsPossible solutions
PURL PURL http://purl.orghttp://purl.org One level of indirection between developers and usersOne level of indirection between developers and users Independence from local constraints at the developer’s endIndependence from local constraints at the developer’s end
The institution creating a resource is also responsible The institution creating a resource is also responsible for minting URIsfor minting URIs E.g., URI for genes in Entrez GeneE.g., URI for genes in Entrez Gene
Guidelines: “URI note”Guidelines: “URI note” W3C Health Care and Life Sciences Interest GroupW3C Health Care and Life Sciences Interest Group
Shared names initiativeShared names initiative Identify resources vs. entitiesIdentify resources vs. entities
[http://sharedname.org/]
Challenging issuesChallenging issues
Other issuesOther issues
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 57
AvailabilityAvailability
Many ontologies are freely availableMany ontologies are freely available The UMLS is freely available for research The UMLS is freely available for research
purposespurposes Cost-free license requiredCost-free license required
Licensing issues can be trickyLicensing issues can be tricky SNOMED CT is freely available in member countries SNOMED CT is freely available in member countries
of the IHTSDOof the IHTSDO Being freely availableBeing freely available
Is a requirement for the Open Biomedical Ontologies Is a requirement for the Open Biomedical Ontologies (OBO)(OBO)
Is a Is a de facto de facto prerequisite for Semantic Web applicationsprerequisite for Semantic Web applications
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 58
DiscoverabilityDiscoverability
Ontology repositoriesOntology repositories UMLS: 152 source vocabulariesUMLS: 152 source vocabularies
(biased towards healthcare applications)(biased towards healthcare applications) NCBO BioPortal: ~141ontologiesNCBO BioPortal: ~141ontologies
(biased towards biological applications)(biased towards biological applications) Limited overlap between the two repositoriesLimited overlap between the two repositories
Need for discovery servicesNeed for discovery services Metadata for ontologiesMetadata for ontologies
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 59
FormalismFormalism
Several major formalismSeveral major formalism Web Ontology Language (OWL) – NCI ThesaurusWeb Ontology Language (OWL) – NCI Thesaurus OBO format – most OBO ontologiesOBO format – most OBO ontologies UMLS Rich Release Format (RRF) – UMLS, RxNormUMLS Rich Release Format (RRF) – UMLS, RxNorm
Conversion mechanismsConversion mechanisms OBO to OWLOBO to OWL LexGrid (import/export to LexGrid internal format)LexGrid (import/export to LexGrid internal format)
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 60
Ontology integrationOntology integration
Post hoc Post hoc integration , form the bottom upintegration , form the bottom up UMLS approachUMLS approach Integrates ontologies “as is”, including legacy Integrates ontologies “as is”, including legacy
ontologiesontologies Facilitates the integration of the corresponding datasetsFacilitates the integration of the corresponding datasets
Coordinated development of ontologiesCoordinated development of ontologies OBO Foundry approachOBO Foundry approach Ensures consistency Ensures consistency ab initioab initio Excludes legacy ontologiesExcludes legacy ontologies
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 61
QualityQuality
Quality assurance in ontologies is still imperfectly Quality assurance in ontologies is still imperfectly defineddefined Difficult to define outside a use case or applicationDifficult to define outside a use case or application
Several approaches to evaluating qualitySeveral approaches to evaluating quality Collaboratively, by users (Web 2.0 approach)Collaboratively, by users (Web 2.0 approach)
Marginal notes enabled by BioPortalMarginal notes enabled by BioPortal Centrally, by expertsCentrally, by experts
OBO Foundry approachOBO Foundry approach
Important factors besides qualityImportant factors besides quality GovernanceGovernance Installed base / Community of practiceInstalled base / Community of practice
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 62
ConclusionsConclusions
Ontologies are enabling resources for data Ontologies are enabling resources for data integrationintegration
Standardization worksStandardization works Grass roots effort (GO)Grass roots effort (GO) Regulatory context (ICD 9-CM)Regulatory context (ICD 9-CM)
Bridging across resources is crucialBridging across resources is crucial Ontology integration resources / strategiesOntology integration resources / strategies
(UMLS, BioPortal / OBO Foundry)(UMLS, BioPortal / OBO Foundry) Massive amounts of imperfect data integrated with Massive amounts of imperfect data integrated with
rough methods might still be usefulrough methods might still be useful
MedicalMedicalOntologyOntologyResearchResearch
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA
Contact:Contact:Web:Web:
[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov