ontologies and data integration in biomedicine olivier bodenreider lister hill national center for...

63
ologies and data integration in biomedici Olivier Bodenreider Olivier Bodenreider Lister Hill National Center Lister Hill National Center for Biomedical Communications for Biomedical Communications Bethesda, Maryland - USA Bethesda, Maryland - USA Kno.e.sis Wright State University, Dayton, Ohio Wright State University, Dayton, Ohio May 27, 2009

Upload: darcy-cameron

Post on 02-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Ontologies and data integration in biomedicine

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical Communications

Bethesda, Maryland - USABethesda, Maryland - USA

Kno.e.sis

Wright State University, Dayton, OhioWright State University, Dayton, OhioMay 27, 2009

Page 2: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 2

OutlineOutline

Why integrate data?Why integrate data? Ontologies and data integrationOntologies and data integration ExamplesExamples Challenging issuesChallenging issues

Page 3: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Why integrate data?Why integrate data?

Page 4: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 4

Why integrate data?Why integrate data?

Sources of informationSources of information Created byCreated by

Independent researchersIndependent researchers Separate workflowsSeparate workflows

HeterogeneousHeterogeneous ScatteredScattered ““Silos”Silos”

To identify patterns in integrated datasetsTo identify patterns in integrated datasets Hypothesis generationHypothesis generation Knowledge discoveryKnowledge discovery

Page 5: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 5

Motivation Motivation Translational researchTranslational research

““Bench to Bedside”Bench to Bedside” Integration of clinical and research activities and Integration of clinical and research activities and

resultsresults Supported by research programsSupported by research programs

NIH RoadmapNIH Roadmap Clinical and Translational Science Awards (CTSA)Clinical and Translational Science Awards (CTSA)

Requires the effective integration and exchange Requires the effective integration and exchange and of information betweenand of information between Basic researchBasic research Clinical researchClinical research

Page 6: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 6

Genotype and phenotypeGenotype and phenotype[Goh, PNAS 2007]

• OMIM• [HPO]• OMIM• [HPO]

Page 7: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Genes and environmental factorsGenes and environmental factors

[Liu, BMC Bioinf. 2008]

• MEDLINE (MeSH index terms)• Genetic Association Database• MEDLINE (MeSH index terms)• Genetic Association Database

Page 8: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 8

Integrating drugs and targetsIntegrating drugs and targets

[Yildirim, Nature Biot. 2007]

• DrugBank• ATC• Gene Ontology

• DrugBank• ATC• Gene Ontology

Page 9: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Why ontologies?Why ontologies?

Page 10: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 10

Uses of biomedical ontologiesUses of biomedical ontologies

Knowledge managementKnowledge management Annotating data and resourcesAnnotating data and resources Accessing biomedical informationAccessing biomedical information Mapping across biomedical ontologiesMapping across biomedical ontologies

Data integration, exchange and semantic Data integration, exchange and semantic interoperabilityinteroperability

Decision supportDecision support Data selection and aggregationData selection and aggregation Decision supportDecision support NLP applicationsNLP applications Knowledge discoveryKnowledge discovery

[Bodenreider, YBMI 2008][Bodenreider, YBMI 2008]

Page 11: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 11

Terminology and translational researchTerminology and translational research

CancerBasic

Research

CancerBasic

Research

EHRCancerPatients

EHRCancerPatients

NCI ThesaurusNCI Thesaurus SNOMED CTSNOMED CT

Page 12: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 12

Approaches to data integration (1)Approaches to data integration (1)

WarehousingWarehousing Sources to be integrated Sources to be integrated

are transformed into a are transformed into a common format and common format and converted to a common converted to a common vocabularyvocabulary

MediationMediation Local schema (of the Local schema (of the

sources)sources) Global schema (in Global schema (in

reference to which the reference to which the queries are made)queries are made)

[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]

[Goble J. Biomedical Informatics 2008]

Page 13: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 13

Approaches to data integration (2)Approaches to data integration (2)

Linked dataLinked data Links among data Links among data

elementselements Enable navigation by Enable navigation by

humanshumans

[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]

[Goble J. Biomedical Informatics 2008]

Page 14: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 14

Ontologies and warehousingOntologies and warehousing

RoleRole Provide a conceptualization of the domainProvide a conceptualization of the domain

Help define the schemaHelp define the schema Information model vs. ontologyInformation model vs. ontology

Provide value sets for data elementsProvide value sets for data elements Enable standardization and sharing of dataEnable standardization and sharing of data

ExamplesExamples Annotations to the Gene OntologyAnnotations to the Gene Ontology BioWarehouseBioWarehouse Clinical information systemsClinical information systems

http://biowarehouse.ai.sri.com/

Page 15: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 15

Ontologies and mediationOntologies and mediation

RoleRole Reference for defining the global schemaReference for defining the global schema Map between local and global schemasMap between local and global schemas

Query reformulationQuery reformulation Local-as-view vs. Global-as-viewLocal-as-view vs. Global-as-view

ExamplesExamples TAMBISTAMBIS BioMediatorBioMediator OntoFusionOntoFusion

[Stevens, Bioinformatics 2000]

[Louie, AMIA 2005]

[Perez-Rey, Comput Biol Med 2006]

Page 16: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 16

Ontologies and linked dataOntologies and linked data

RoleRole Explicit conceptualization of the domainExplicit conceptualization of the domain Semantic normalization of data elementsSemantic normalization of data elements

ExamplesExamples EntrezEntrez Semantic Web mashupsSemantic Web mashups Bio2RDFBio2RDF

[http://www.ncbi.nlm.nih.gov/]

[J. Biomedical informatics 41(5) 2008]

[http://bio2rdf.org/]

Page 17: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 17

Ontologies and data integrationOntologies and data integration

Source of identifiers for biomedical entitiesSource of identifiers for biomedical entities Semantic normalizationSemantic normalization Warehouse approachesWarehouse approaches

Source of reference relations for the global schemaSource of reference relations for the global schema Mapping between local and global schemasMapping between local and global schemas Mediator-based approachesMediator-based approaches

Source of identifiers for biomedical entitiesSource of identifiers for biomedical entities Semantic normalizationSemantic normalization Explicit conceptualization of the domainExplicit conceptualization of the domain Linked data approachesLinked data approaches

Page 18: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 18

Ontologies and data aggregationOntologies and data aggregation

Source of hierarchical relationsSource of hierarchical relations Aggregate data into coarser categoriesAggregate data into coarser categories Abstract away from low-frequency, fine grained data Abstract away from low-frequency, fine grained data

pointspoints Increase powerIncrease power Improve visualizationImprove visualization

Page 19: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

ExamplesExamples

Gene OntologyGene Ontologyhttp://www.geneontology.org/http://www.geneontology.org/

Page 20: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 20

Annotating dataAnnotating data

Gene OntologyGene Ontology Functional annotation of gene productsFunctional annotation of gene products

in several dozen model organismsin several dozen model organisms

Various communities use the same controlled Various communities use the same controlled vocabulariesvocabularies

Enabling comparisons across model organismsEnabling comparisons across model organisms AnnotationsAnnotations

Assigned manually by curatorsAssigned manually by curators Inferred automatically (e.g., from sequence similarity)Inferred automatically (e.g., from sequence similarity)

Page 21: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 21

GO GO Annotations for Aldh2 (mouse)Annotations for Aldh2 (mouse)

http:// www.informatics.jax.org/

Page 22: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 22

GO GO ALD4 in YeastALD4 in Yeast

http://db.yeastgenome.org/

Page 23: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 23

GO GO Annotations for ALDH2 (Human)Annotations for ALDH2 (Human)

http://www.ebi.ac.uk/GOA/

Page 24: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 24

Integration applicationsIntegration applications

Based on shared annotationsBased on shared annotations Enrichment analysis (within/across species)Enrichment analysis (within/across species) Clustering (co-clustering with gene expression data)Clustering (co-clustering with gene expression data)

Based on the structure of GOBased on the structure of GO Closely related annotationsClosely related annotations Semantic similaritySemantic similarity

Based on associations between gene products and Based on associations between gene products and annotationsannotations

Leveraging reasoningLeveraging reasoning

[Bodenreider, PSB 2005]

[Sahoo, Medinfo 2007]

[Lord, PSB 2003]

Page 25: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 25

Gene Ontology

Integration Integration Entrez Gene + GOEntrez Gene + GO

gene

GO

PubMed

Gene name

OMIM

Sequence

Interactions

Glycosyltransferase

Congenital muscular dystrophy

Entrez Gene

[Sahoo, Medinfo 2007]

Page 26: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 26

From From glycosyltransferaseglycosyltransferaseto to congenital muscular dystrophycongenital muscular dystrophy

MIM:608840Muscular dystrophy, congenital, type 1D

GO:0008375

has_associated_phenotype

has_molecular_function

EG:9215LARGE

acetylglucosaminyl-transferase

GO:0016757glycosyltransferase

GO:0008194isa

GO:0008375acetylglucosaminyl-

transferase

GO:0016758

Page 27: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

ExamplesExamples

caBIGcaBIGhttp://cabig.nci.nih.gov/http://cabig.nci.nih.gov/

Page 28: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 28

Cancer Biomedical Informatics GridCancer Biomedical Informatics Grid

US National Cancer InstituteUS National Cancer Institute Common infrastructure used to share data and Common infrastructure used to share data and

applications across institutions to support cancer applications across institutions to support cancer research efforts in a grid environmentresearch efforts in a grid environment

Service-oriented architectureService-oriented architecture

Data and application services available on the gridData and application services available on the grid Supported by ontological resourcesSupported by ontological resources

Page 29: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 29

caBIG servicescaBIG services

caArraycaArray Microarray data repositoryMicroarray data repository

caTissuecaTissue Biospecimen repositoryBiospecimen repository

caFE (Cancer Function Express)caFE (Cancer Function Express) Annotations on microarray dataAnnotations on microarray data

……

caTRIPcaTRIP Cancer Translational Research Informatics PlatformCancer Translational Research Informatics Platform Integrates data servicesIntegrates data services

Page 30: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 30

Ontological resourcesOntological resources

NCI ThesaurusNCI Thesaurus Reference terminology for the cancer domainReference terminology for the cancer domain ~ 60,000 concepts~ 60,000 concepts OWL LiteOWL Lite

Cancer Data Standards Repository (caDSR)Cancer Data Standards Repository (caDSR) Metadata repositoryMetadata repository Used to bridge across UML models through Common Used to bridge across UML models through Common

Data ElementsData Elements Links to concepts in ontologiesLinks to concepts in ontologies

Page 31: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

ExamplesExamples

Semantic WebSemantic Webfor Health Care and Life Sciencesfor Health Care and Life Sciences

http://www.w3.org/2001/sw/hcls/http://www.w3.org/2001/sw/hcls/

Page 32: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 32

Semantic Web layer cakeSemantic Web layer cake

Page 33: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Linked dataLinked data

linkeddata.org

Page 34: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 34

Linked dataLinked data

Page 35: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 35

Linked Linked biomedicalbiomedical data data

[Tim Berners-Lee TED 2009 conference]http://www.w3.org/2009/Talks/0204-ted-tbl/#(1)

Page 36: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 36

W3C Health Care and Life Sciences IGW3C Health Care and Life Sciences IG

Page 37: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 37

Biomedical Semantic WebBiomedical Semantic Web

IntegrationIntegration Data/InformationData/Information E.g., translational researchE.g., translational research

Hypothesis generationHypothesis generation Knowledge discoveryKnowledge discovery

[Ruttenberg, BMC Bioinf. 2007]

Page 38: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 38

HCLS mashup of biomedical sourcesHCLS mashup of biomedical sources

NeuronDB

BAMS

NC Annotations

Homologene

SWAN

Entrez Gene

Gene Ontology

Mammalian Phenotype

PDSPki

BrainPharm

AlzGene

Antibodies

PubChem

MeSH

Reactome

Allen Brain Atlas

Publications

http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demohttp://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo

Page 39: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 39

Shared identifiers Shared identifiers ExampleExample

GO

Page 40: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 40

HCLS mashupHCLS mashup NeuronDB

Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents

BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID

NC Annotations

Genes/ProteinsProcessesCells (maybe)PubMed ID

Allen Brain Atlas

GenesBrain imagesGross anatomy -> neuroanatomy

Homologene

GenesSpeciesOrthologiesProofs

SWAN

PubMedIDHypothesisQuestionsEvidence

Genes

Entrez GeneGenesProtein

GOPubMedID

Interaction (g/p)Chromosome

C. location

GO

Molecular functionCell components

Biological processAnnotation gene

PubMedID

Mammalian Phenotype

Genes Phenotypes

DiseasePubMedID

ProteinsChemicals

Neurotransmitters

PDSPki

BrainPharmDrug

Drug effectPathological agent

PhenotypeReceptorsChannelsCell typesPubMedIDDisease

AlzGene

Gene Polymorphism

PopulationAlz Diagnosis

AntibodiesGenes Antibodies

PubChem

NameStructurePropertiesMeSH term

MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem

Reactome

Genes/proteinsInteractionsCellular locationProcesses (GO)

Page 41: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 41

HCLS mashupHCLS mashup NeuronDB

Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents

BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID

NC Annotations

Genes/ProteinsProcessesCells (maybe)PubMed ID

Allen Brain Atlas

GenesBrain imagesGross anatomy -> neuroanatomy

Homologene

GenesSpeciesOrthologiesProofs

SWAN

PubMedIDHypothesisQuestionsEvidence

Genes

Entrez GeneGenesProtein

GOPubMedID

Interaction (g/p)Chromosome

C. location

GO

Molecular functionCell components

Biological processAnnotation gene

PubMedID

Mammalian Phenotype

Genes Phenotypes

DiseasePubMedID

ProteinsChemicals

Neurotransmitters

PDSPki

BrainPharmDrug

Drug effectPathological agent

PhenotypeReceptorsChannelsCell typesPubMedIDDisease

AlzGene

Gene Polymorphism

PopulationAlz Diagnosis

AntibodiesGenes Antibodies

PubChem

NameStructurePropertiesMeSH term

MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem

Reactome

Genes/proteinsInteractionsCellular locationProcesses (GO)

Page 42: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 42

HCLS mashupsHCLS mashups

Based on RDF/OWLBased on RDF/OWL Based on shared identifiersBased on shared identifiers

““Recombinant data” (E. Neumann)Recombinant data” (E. Neumann)

Ontologies used in some casesOntologies used in some cases Support applications (SWAN, SenseLab, etc.)Support applications (SWAN, SenseLab, etc.)

Journal of Biomedical InformaticsJournal of Biomedical Informaticsspecial issue on Semantic bio-mashupsspecial issue on Semantic bio-mashups[J. Biomedical Informatics 41(5) 2008][J. Biomedical Informatics 41(5) 2008]

Page 43: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 43

Semantic bio-mashupsSemantic bio-mashups

Bio2RDF: Towards a mashup to build bioinformatics knowledge systemsBio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated Identifying disease-causal genes using Semantic Web-based representation of integrated

genomic and phenomic knowledgegenomic and phenomic knowledge Schema driven assignment and implementation of life science identifiers (LSIDs)Schema driven assignment and implementation of life science identifiers (LSIDs) The SWAN biomedical discourse ontologyThe SWAN biomedical discourse ontology An ontology-driven semantic mashup of gene and biological pathway information: An ontology-driven semantic mashup of gene and biological pathway information:

Application to the domain of nicotine dependenceApplication to the domain of nicotine dependence Towards an ontology for sharing medical images and regions of interest in neuroimagingTowards an ontology for sharing medical images and regions of interest in neuroimaging yOWL: An ontology-driven knowledge base for yeast biologistsyOWL: An ontology-driven knowledge base for yeast biologists Dynamic sub-ontology evolution for traditional Chinese medicine web ontologyDynamic sub-ontology evolution for traditional Chinese medicine web ontology Ontology-centric integration and navigation of the dengue literatureOntology-centric integration and navigation of the dengue literature Infrastructure for dynamic knowledge integration—Automated biomedical ontology extension Infrastructure for dynamic knowledge integration—Automated biomedical ontology extension

using textual resourcesusing textual resources An ontological knowledge framework for adaptive medical workflowAn ontological knowledge framework for adaptive medical workflow Semi-automatic web service composition for the life sciences using the BioMoby semantic Semi-automatic web service composition for the life sciences using the BioMoby semantic

web frameworkweb framework Combining Semantic Web technologies with Multi-Agent Systems for integrated access to Combining Semantic Web technologies with Multi-Agent Systems for integrated access to

biological resourcesbiological resources[J. Biomedical Informatics 41(5) 2008]

Page 44: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Challenging issuesChallenging issues

Page 45: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 45

Challenging issuesChallenging issues

Bridges across ontologiesBridges across ontologies Permanent identifiers for biomedical entitiesPermanent identifiers for biomedical entities Other issuesOther issues

Page 46: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Challenging issuesChallenging issues

Bridges across ontologiesBridges across ontologies

Page 47: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 47

Trans-namespace integrationTrans-namespace integration

Addison Disease(D000224)

Addison's disease (363732003)

Biomedicalliterature

Biomedicalliterature

MeSH

Clinicalrepositories

Clinicalrepositories

SNOMED CT

Primary adrenocortical insufficiency(E27.1)

ICD 10

Page 48: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 48

(Integrated) concept repositories(Integrated) concept repositories

Unified Medical Language SystemUnified Medical Language Systemhttp://umlsks.nlm.nih.govhttp://umlsks.nlm.nih.gov

NCBO’s BioPortalNCBO’s BioPortalhttp://www.bioontology.org/tools/portal/bioportal.htmlhttp://www.bioontology.org/tools/portal/bioportal.html

caDSRcaDSRhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsrhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr

Open Biomedical Ontologies (OBO)Open Biomedical Ontologies (OBO)http://obofoundry.org/http://obofoundry.org/

Page 49: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 49

Integrating subdomainsIntegrating subdomains

Biomedicalliterature

Biomedicalliterature

MeSH

Genomeannotations

Genomeannotations

GOModelorganisms

Modelorganisms

NCBITaxonomy

Geneticknowledge bases

Geneticknowledge bases

OMIM

Clinicalrepositories

Clinicalrepositories

SNOMED CTOthersubdomains

Othersubdomains

AnatomyAnatomy

FMA

UMLS

Page 50: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 5050

Integrating subdomainsIntegrating subdomains

Biomedicalliterature

Biomedicalliterature

Genomeannotations

Genomeannotations

Modelorganisms

Modelorganisms

Geneticknowledge bases

Geneticknowledge bases

Clinicalrepositories

Clinicalrepositories

Othersubdomains

Othersubdomains

AnatomyAnatomy

Page 51: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 51

Trans-namespace integrationTrans-namespace integration

Genomeannotations

Genomeannotations

GOModelorganisms

Modelorganisms

NCBITaxonomy

Geneticknowledge bases

Geneticknowledge bases

OMIMOther

subdomainsOther

subdomains

AnatomyAnatomy

FMA

UMLS

Addison Disease (D000224)

Addison's disease (363732003)

Biomedicalliterature

Biomedicalliterature

MeSH

Clinicalrepositories

Clinicalrepositories

SNOMED CT

UMLSC0001403

Page 52: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 52

MappingsMappings

Created manually (e.g., UMLS)Created manually (e.g., UMLS) PurposePurpose DirectionalityDirectionality

Created automatically (e.g., BioPortal)Created automatically (e.g., BioPortal) Lexically: ambiguity, normalizationLexically: ambiguity, normalization Semantically: lack of / incomplete formal definitionsSemantically: lack of / incomplete formal definitions

Key to enabling semantic interoperabilityKey to enabling semantic interoperability Enabling resource for the Semantic WebEnabling resource for the Semantic Web

Page 53: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Challenging issuesChallenging issues

Permanent identifiers for biomedical entitiesPermanent identifiers for biomedical entities

Page 54: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 54

Identifying biomedical entitiesIdentifying biomedical entities

Multiple identifiers for the same entity in different Multiple identifiers for the same entity in different ontologiesontologies

Barrier to data integration in generalBarrier to data integration in general Data annotated to different ontologies cannot Data annotated to different ontologies cannot

“recombine”“recombine” Need for mappings across ontologiesNeed for mappings across ontologies

Barrier to data integration in the Semantic WebBarrier to data integration in the Semantic Web Multiple possible identifiers for the same entityMultiple possible identifiers for the same entity

Depending on the underlying representational scheme (URI Depending on the underlying representational scheme (URI vs. LSID)vs. LSID)

Depending on who creates the URIDepending on who creates the URI

Page 55: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 55

Possible solutionsPossible solutions

PURL PURL http://purl.orghttp://purl.org One level of indirection between developers and usersOne level of indirection between developers and users Independence from local constraints at the developer’s endIndependence from local constraints at the developer’s end

The institution creating a resource is also responsible The institution creating a resource is also responsible for minting URIsfor minting URIs E.g., URI for genes in Entrez GeneE.g., URI for genes in Entrez Gene

Guidelines: “URI note”Guidelines: “URI note” W3C Health Care and Life Sciences Interest GroupW3C Health Care and Life Sciences Interest Group

Shared names initiativeShared names initiative Identify resources vs. entitiesIdentify resources vs. entities

[http://sharedname.org/]

Page 56: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Challenging issuesChallenging issues

Other issuesOther issues

Page 57: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 57

AvailabilityAvailability

Many ontologies are freely availableMany ontologies are freely available The UMLS is freely available for research The UMLS is freely available for research

purposespurposes Cost-free license requiredCost-free license required

Licensing issues can be trickyLicensing issues can be tricky SNOMED CT is freely available in member countries SNOMED CT is freely available in member countries

of the IHTSDOof the IHTSDO Being freely availableBeing freely available

Is a requirement for the Open Biomedical Ontologies Is a requirement for the Open Biomedical Ontologies (OBO)(OBO)

Is a Is a de facto de facto prerequisite for Semantic Web applicationsprerequisite for Semantic Web applications

Page 58: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 58

DiscoverabilityDiscoverability

Ontology repositoriesOntology repositories UMLS: 152 source vocabulariesUMLS: 152 source vocabularies

(biased towards healthcare applications)(biased towards healthcare applications) NCBO BioPortal: ~141ontologiesNCBO BioPortal: ~141ontologies

(biased towards biological applications)(biased towards biological applications) Limited overlap between the two repositoriesLimited overlap between the two repositories

Need for discovery servicesNeed for discovery services Metadata for ontologiesMetadata for ontologies

Page 59: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 59

FormalismFormalism

Several major formalismSeveral major formalism Web Ontology Language (OWL) – NCI ThesaurusWeb Ontology Language (OWL) – NCI Thesaurus OBO format – most OBO ontologiesOBO format – most OBO ontologies UMLS Rich Release Format (RRF) – UMLS, RxNormUMLS Rich Release Format (RRF) – UMLS, RxNorm

Conversion mechanismsConversion mechanisms OBO to OWLOBO to OWL LexGrid (import/export to LexGrid internal format)LexGrid (import/export to LexGrid internal format)

Page 60: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 60

Ontology integrationOntology integration

Post hoc Post hoc integration , form the bottom upintegration , form the bottom up UMLS approachUMLS approach Integrates ontologies “as is”, including legacy Integrates ontologies “as is”, including legacy

ontologiesontologies Facilitates the integration of the corresponding datasetsFacilitates the integration of the corresponding datasets

Coordinated development of ontologiesCoordinated development of ontologies OBO Foundry approachOBO Foundry approach Ensures consistency Ensures consistency ab initioab initio Excludes legacy ontologiesExcludes legacy ontologies

Page 61: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 61

QualityQuality

Quality assurance in ontologies is still imperfectly Quality assurance in ontologies is still imperfectly defineddefined Difficult to define outside a use case or applicationDifficult to define outside a use case or application

Several approaches to evaluating qualitySeveral approaches to evaluating quality Collaboratively, by users (Web 2.0 approach)Collaboratively, by users (Web 2.0 approach)

Marginal notes enabled by BioPortalMarginal notes enabled by BioPortal Centrally, by expertsCentrally, by experts

OBO Foundry approachOBO Foundry approach

Important factors besides qualityImportant factors besides quality GovernanceGovernance Installed base / Community of practiceInstalled base / Community of practice

Page 62: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 62

ConclusionsConclusions

Ontologies are enabling resources for data Ontologies are enabling resources for data integrationintegration

Standardization worksStandardization works Grass roots effort (GO)Grass roots effort (GO) Regulatory context (ICD 9-CM)Regulatory context (ICD 9-CM)

Bridging across resources is crucialBridging across resources is crucial Ontology integration resources / strategiesOntology integration resources / strategies

(UMLS, BioPortal / OBO Foundry)(UMLS, BioPortal / OBO Foundry) Massive amounts of imperfect data integrated with Massive amounts of imperfect data integrated with

rough methods might still be usefulrough methods might still be useful

Page 63: Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

MedicalMedicalOntologyOntologyResearchResearch

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Contact:Contact:Web:Web:

[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov