kno.e · 27-05-2009  · bio2rdf: towards a mashup to build bioinformatics knowledge systems...

63
Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis Wright State University, Dayton, Ohio May 27, 2009

Upload: others

Post on 01-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Ontologies and data integration in biomedicine

Olivier Bodenreider

Lister Hill National Centerfor Biomedical Communications

Bethesda, Maryland - USA

Kno.e.sisWright State University, Dayton, Ohio

May 27, 2009

Page 2: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 2

Outline

Why integrate data?Ontologies and data integrationExamplesChallenging issues

Page 3: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Why integrate data?

Page 4: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 4

Why integrate data?

Sources of informationCreated by

Independent researchersSeparate workflows

HeterogeneousScattered“Silos”

To identify patterns in integrated datasetsHypothesis generationKnowledge discovery

Page 5: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 5

Motivation Translational research

“Bench to Bedside”Integration of clinical and research activities and resultsSupported by research programs

NIH RoadmapClinical and Translational Science Awards (CTSA)

Requires the effective integration and exchange and of information between

Basic researchClinical research

Page 6: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 6

Genotype and phenotype[Goh, PNAS 2007]

• OMIM• [HPO]

Page 7: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Genes and environmental factors

[Liu, BMC Bioinf. 2008]

• MEDLINE (MeSH index terms)• Genetic Association Database

Page 8: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 8

Integrating drugs and targets[Yildirim, Nature Biot. 2007]

• DrugBank• ATC• Gene Ontology

Page 9: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Why ontologies?

Page 10: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 10

Uses of biomedical ontologies

Knowledge managementAnnotating data and resourcesAccessing biomedical informationMapping across biomedical ontologies

Data integration, exchange and semantic interoperabilityDecision support

Data selection and aggregationDecision supportNLP applicationsKnowledge discovery

[Bodenreider, YBMI 2008]

Page 11: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 11

Terminology and translational research

CancerBasic

Research

EHRCancerPatients

NCI Thesaurus SNOMED CT

Page 12: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 12

Approaches to data integration (1)

WarehousingSources to be integrated are transformed into a common format and converted to a common vocabulary

MediationLocal schema (of the sources)Global schema (in reference to which the queries are made)

[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]

[Goble J. Biomedical Informatics 2008]

Page 13: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 13

Approaches to data integration (2)

Linked dataLinks among data elementsEnable navigation by humans

[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]

[Goble J. Biomedical Informatics 2008]

Page 14: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 14

Ontologies and warehousing

RoleProvide a conceptualization of the domain

Help define the schemaInformation model vs. ontology

Provide value sets for data elementsEnable standardization and sharing of data

ExamplesAnnotations to the Gene OntologyBioWarehouseClinical information systems

http://biowarehouse.ai.sri.com/

Page 15: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 15

Ontologies and mediation

RoleReference for defining the global schemaMap between local and global schemas

Query reformulationLocal-as-view vs. Global-as-view

ExamplesTAMBISBioMediatorOntoFusion

[Stevens, Bioinformatics 2000]

[Louie, AMIA 2005]

[Perez-Rey, Comput Biol Med 2006]

Page 16: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 16

Ontologies and linked data

RoleExplicit conceptualization of the domainSemantic normalization of data elements

ExamplesEntrezSemantic Web mashupsBio2RDF

[http://www.ncbi.nlm.nih.gov/]

[J. Biomedical informatics 41(5) 2008]

[http://bio2rdf.org/]

Page 17: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 17

Ontologies and data integration

Source of identifiers for biomedical entitiesSemantic normalizationWarehouse approaches

Source of reference relations for the global schemaMapping between local and global schemasMediator-based approaches

Source of identifiers for biomedical entitiesSemantic normalizationExplicit conceptualization of the domainLinked data approaches

Page 18: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 18

Ontologies and data aggregation

Source of hierarchical relationsAggregate data into coarser categoriesAbstract away from low-frequency, fine grained data pointsIncrease powerImprove visualization

Page 19: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Examples

Gene Ontologyhttp://www.geneontology.org/

Page 20: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 20

Annotating data

Gene OntologyFunctional annotation of gene productsin several dozen model organisms

Various communities use the same controlled vocabulariesEnabling comparisons across model organismsAnnotations

Assigned manually by curatorsInferred automatically (e.g., from sequence similarity)

Page 21: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 21

GO Annotations for Aldh2 (mouse)

http:// www.informatics.jax.org/

Page 22: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 22

GO ALD4 in Yeast

http://db.yeastgenome.org/

Page 23: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 23

GO Annotations for ALDH2 (Human)

http://www.ebi.ac.uk/GOA/

Page 24: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 24

Integration applications

Based on shared annotationsEnrichment analysis (within/across species)Clustering (co-clustering with gene expression data)

Based on the structure of GOClosely related annotationsSemantic similarity

Based on associations between gene products and annotationsLeveraging reasoning

[Bodenreider, PSB 2005]

[Sahoo, Medinfo 2007]

[Lord, PSB 2003]

Page 25: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 25

Gene Ontology

Integration Entrez Gene + GO

gene

GO

PubMed

Gene name

OMIM

Sequence

InteractionsGlycosyltransferase

Congenital muscular dystrophy

Entrez Gene

[Sahoo, Medinfo 2007]

Page 26: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 26

From glycosyltransferaseto congenital muscular dystrophy

MIM:608840 Muscular dystrophy, congenital, type 1D

GO:0008375

has_associated_phenotype

has_molecular_function

EG:9215LARGE

acetylglucosaminyl-transferase

GO:0016757glycosyltransferase

GO:0008194isa

GO:0008375 acetylglucosaminyl-transferase

GO:0016758

Page 27: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Examples

caBIGhttp://cabig.nci.nih.gov/

Page 28: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 28

Cancer Biomedical Informatics Grid

US National Cancer InstituteCommon infrastructure used to share data and applications across institutions to support cancer research efforts in a grid environmentService-oriented architecture

Data and application services available on the gridSupported by ontological resources

Page 29: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 29

caBIG services

caArrayMicroarray data repository

caTissueBiospecimen repository

caFE (Cancer Function Express)Annotations on microarray data

caTRIPCancer Translational Research Informatics PlatformIntegrates data services

Page 30: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 30

Ontological resources

NCI ThesaurusReference terminology for the cancer domain~ 60,000 conceptsOWL Lite

Cancer Data Standards Repository (caDSR)Metadata repositoryUsed to bridge across UML models through Common Data ElementsLinks to concepts in ontologies

Page 31: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Examples

Semantic Webfor Health Care and Life Sciences

http://www.w3.org/2001/sw/hcls/

Page 32: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 32

Semantic Web layer cake

Page 33: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Linked datalinkeddata.org

Page 34: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 34

Linked data

Page 35: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 35

Linked biomedical data[Tim Berners-Lee TED 2009 conference]http://www.w3.org/2009/Talks/0204-ted-tbl/#(1)

Page 36: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 36

W3C Health Care and Life Sciences IG

Page 37: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 37

Biomedical Semantic Web

IntegrationData/InformationE.g., translational research

Hypothesis generationKnowledge discovery

[Ruttenberg, BMC Bioinf. 2007]

Page 38: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 38

HCLS mashup of biomedical sources

NeuronDB

BAMS

NC Annotations

Homologene

SWAN

Entrez Gene

Gene Ontology

Mammalian Phenotype

PDSPki

BrainPharm

AlzGene

Antibodies

PubChem

MeSH

Reactome

Allen Brain Atlas

Publications

http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo

Page 39: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 39

Shared identifiers Example

GO

Page 40: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 40

HCLS mashup NeuronDB

Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents

BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID

NC Annotations

Genes/ProteinsProcessesCells (maybe)PubMed ID

Allen Brain Atlas

GenesBrain imagesGross anatomy -> neuroanatomy

Homologene

GenesSpeciesOrthologiesProofs

SWAN

PubMedIDHypothesisQuestionsEvidence

Genes

Entrez GeneGenesProtein

GOPubMedID

Interaction (g/p)Chromosome

C. location

GO

Molecular functionCell components

Biological processAnnotation gene

PubMedID

Mammalian Phenotype

Genes Phenotypes

DiseasePubMedID

ProteinsChemicals

Neurotransmitters

PDSPki

BrainPharmDrug

Drug effectPathological agent

PhenotypeReceptorsChannelsCell typesPubMedIDDisease

AlzGene

Gene Polymorphism

PopulationAlz Diagnosis

AntibodiesGenes Antibodies

PubChem

NameStructurePropertiesMeSH term

MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem

Reactome

Genes/proteinsInteractionsCellular locationProcesses (GO)

Page 41: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 41

HCLS mashup NeuronDB

Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents

BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID

NC Annotations

Genes/ProteinsProcessesCells (maybe)PubMed ID

Allen Brain Atlas

GenesBrain imagesGross anatomy -> neuroanatomy

Homologene

GenesSpeciesOrthologiesProofs

SWAN

PubMedIDHypothesisQuestionsEvidence

Genes

Entrez GeneGenesProtein

GOPubMedID

Interaction (g/p)Chromosome

C. location

GO

Molecular functionCell components

Biological processAnnotation gene

PubMedID

Mammalian Phenotype

GenesPhenotypes

DiseasePubMedID

ProteinsChemicals

Neurotransmitters

PDSPki

BrainPharmDrug

Drug effectPathological agent

PhenotypeReceptorsChannelsCell typesPubMedIDDisease

AlzGene

GenePolymorphism

PopulationAlz Diagnosis

AntibodiesGenesAntibodies

PubChem

NameStructurePropertiesMeSH term

MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem

Reactome

Genes/proteinsInteractionsCellular locationProcesses (GO)

Page 42: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 42

HCLS mashups

Based on RDF/OWLBased on shared identifiers

“Recombinant data” (E. Neumann)

Ontologies used in some casesSupport applications (SWAN, SenseLab, etc.)

Journal of Biomedical Informaticsspecial issue on Semantic bio-mashups[J. Biomedical Informatics 41(5) 2008]

Page 43: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 43

Semantic bio-mashupsBio2RDF: Towards a mashup to build bioinformatics knowledge systemsIdentifying disease-causal genes using Semantic Web-based representation of integrated genomic and phenomic knowledgeSchema driven assignment and implementation of life science identifiers (LSIDs)The SWAN biomedical discourse ontologyAn ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependenceTowards an ontology for sharing medical images and regions of interest in neuroimagingyOWL: An ontology-driven knowledge base for yeast biologistsDynamic sub-ontology evolution for traditional Chinese medicine web ontologyOntology-centric integration and navigation of the dengue literatureInfrastructure for dynamic knowledge integration—Automated biomedical ontology extension using textual resourcesAn ontological knowledge framework for adaptive medical workflowSemi-automatic web service composition for the life sciences using the BioMoby semantic web frameworkCombining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources

[J. Biomedical Informatics 41(5) 2008]

Page 44: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Challenging issues

Page 45: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 45

Challenging issues

Bridges across ontologiesPermanent identifiers for biomedical entitiesOther issues

Page 46: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Challenging issues

Bridges across ontologies

Page 47: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 47

Trans-namespace integration

Addison Disease(D000224)

Addison's disease (363732003)

Biomedicalliterature

MeSH

Clinicalrepositories

SNOMED CT

Primary adrenocortical insufficiency(E27.1)

ICD 10

Page 48: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 48

(Integrated) concept repositories

Unified Medical Language Systemhttp://umlsks.nlm.nih.govNCBO’s BioPortalhttp://www.bioontology.org/tools/portal/bioportal.htmlcaDSRhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr

Open Biomedical Ontologies (OBO)http://obofoundry.org/

Page 49: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 49

Integrating subdomains

Biomedicalliterature

MeSH

Genomeannotations

GOModelorganisms

NCBITaxonomy

Geneticknowledge bases

OMIM

Clinicalrepositories

SNOMED CTOthersubdomains

Anatomy

FMA

UMLS

Page 50: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 5050

Integrating subdomains

Biomedicalliterature

Genomeannotations

Modelorganisms

Geneticknowledge bases

Clinicalrepositories

Othersubdomains

Anatomy

Page 51: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 51

Trans-namespace integration

Genomeannotations

GOModelorganisms

NCBITaxonomy

Geneticknowledge bases

OMIMOther

subdomains

Anatomy

FMA

UMLSAddison Disease (D000224)

Addison's disease (363732003)

Biomedicalliterature

MeSH

Clinicalrepositories

SNOMED CT

UMLSC0001403

Page 52: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 52

Mappings

Created manually (e.g., UMLS)PurposeDirectionality

Created automatically (e.g., BioPortal)Lexically: ambiguity, normalizationSemantically: lack of / incomplete formal definitions

Key to enabling semantic interoperabilityEnabling resource for the Semantic Web

Page 53: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Challenging issues

Permanent identifiers for biomedical entities

Page 54: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 54

Identifying biomedical entities

Multiple identifiers for the same entity in different ontologiesBarrier to data integration in general

Data annotated to different ontologies cannot “recombine”Need for mappings across ontologies

Barrier to data integration in the Semantic WebMultiple possible identifiers for the same entity

Depending on the underlying representational scheme (URI vs. LSID)Depending on who creates the URI

Page 55: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 55

Possible solutions

PURL http://purl.orgOne level of indirection between developers and usersIndependence from local constraints at the developer’s end

The institution creating a resource is also responsible for minting URIs

E.g., URI for genes in Entrez Gene

Guidelines: “URI note”W3C Health Care and Life Sciences Interest Group

Shared names initiativeIdentify resources vs. entities

[http://sharedname.org/]

Page 56: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Challenging issues

Other issues

Page 57: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 57

Availability

Many ontologies are freely availableThe UMLS is freely available for research purposes

Cost-free license requiredLicensing issues can be tricky

SNOMED CT is freely available in member countries of the IHTSDO

Being freely availableIs a requirement for the Open Biomedical Ontologies (OBO)Is a de facto prerequisite for Semantic Web applications

Page 58: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 58

Discoverability

Ontology repositoriesUMLS: 152 source vocabularies(biased towards healthcare applications)NCBO BioPortal: ~141ontologies(biased towards biological applications)Limited overlap between the two repositories

Need for discovery servicesMetadata for ontologies

Page 59: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 59

Formalism

Several major formalismWeb Ontology Language (OWL) – NCI ThesaurusOBO format – most OBO ontologiesUMLS Rich Release Format (RRF) – UMLS, RxNorm

Conversion mechanismsOBO to OWLLexGrid (import/export to LexGrid internal format)

Page 60: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 60

Ontology integration

Post hoc integration , form the bottom upUMLS approachIntegrates ontologies “as is”, including legacy ontologiesFacilitates the integration of the corresponding datasets

Coordinated development of ontologiesOBO Foundry approachEnsures consistency ab initioExcludes legacy ontologies

Page 61: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 61

Quality

Quality assurance in ontologies is still imperfectly defined

Difficult to define outside a use case or applicationSeveral approaches to evaluating quality

Collaboratively, by users (Web 2.0 approach)Marginal notes enabled by BioPortal

Centrally, by expertsOBO Foundry approach

Important factors besides qualityGovernanceInstalled base / Community of practice

Page 62: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

Lister Hill National Center for Biomedical Communications 62

Conclusions

Ontologies are enabling resources for data integrationStandardization works

Grass roots effort (GO)Regulatory context (ICD 9-CM)

Bridging across resources is crucialOntology integration resources / strategies(UMLS, BioPortal / OBO Foundry)

Massive amounts of imperfect data integrated with rough methods might still be useful

Page 63: Kno.e · 27-05-2009  · Bio2RDF: Towards a mashup to build bioinformatics knowledge systems Identifying disease-causal genes using Semantic Web-based representation of integrated

MedicalOntologyResearch

Olivier Bodenreider

Lister Hill National Centerfor Biomedical CommunicationsBethesda, Maryland - USA

Contact:Web:

[email protected]