kno.e · 27-05-2009 · bio2rdf: towards a mashup to build bioinformatics knowledge systems...
TRANSCRIPT
Ontologies and data integration in biomedicine
Olivier Bodenreider
Lister Hill National Centerfor Biomedical Communications
Bethesda, Maryland - USA
Kno.e.sisWright State University, Dayton, Ohio
May 27, 2009
Lister Hill National Center for Biomedical Communications 2
Outline
Why integrate data?Ontologies and data integrationExamplesChallenging issues
Why integrate data?
Lister Hill National Center for Biomedical Communications 4
Why integrate data?
Sources of informationCreated by
Independent researchersSeparate workflows
HeterogeneousScattered“Silos”
To identify patterns in integrated datasetsHypothesis generationKnowledge discovery
Lister Hill National Center for Biomedical Communications 5
Motivation Translational research
“Bench to Bedside”Integration of clinical and research activities and resultsSupported by research programs
NIH RoadmapClinical and Translational Science Awards (CTSA)
Requires the effective integration and exchange and of information between
Basic researchClinical research
Lister Hill National Center for Biomedical Communications 6
Genotype and phenotype[Goh, PNAS 2007]
• OMIM• [HPO]
Genes and environmental factors
[Liu, BMC Bioinf. 2008]
• MEDLINE (MeSH index terms)• Genetic Association Database
Lister Hill National Center for Biomedical Communications 8
Integrating drugs and targets[Yildirim, Nature Biot. 2007]
• DrugBank• ATC• Gene Ontology
Why ontologies?
Lister Hill National Center for Biomedical Communications 10
Uses of biomedical ontologies
Knowledge managementAnnotating data and resourcesAccessing biomedical informationMapping across biomedical ontologies
Data integration, exchange and semantic interoperabilityDecision support
Data selection and aggregationDecision supportNLP applicationsKnowledge discovery
[Bodenreider, YBMI 2008]
Lister Hill National Center for Biomedical Communications 11
Terminology and translational research
CancerBasic
Research
EHRCancerPatients
NCI Thesaurus SNOMED CT
Lister Hill National Center for Biomedical Communications 12
Approaches to data integration (1)
WarehousingSources to be integrated are transformed into a common format and converted to a common vocabulary
MediationLocal schema (of the sources)Global schema (in reference to which the queries are made)
[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]
[Goble J. Biomedical Informatics 2008]
Lister Hill National Center for Biomedical Communications 13
Approaches to data integration (2)
Linked dataLinks among data elementsEnable navigation by humans
[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]
[Goble J. Biomedical Informatics 2008]
Lister Hill National Center for Biomedical Communications 14
Ontologies and warehousing
RoleProvide a conceptualization of the domain
Help define the schemaInformation model vs. ontology
Provide value sets for data elementsEnable standardization and sharing of data
ExamplesAnnotations to the Gene OntologyBioWarehouseClinical information systems
http://biowarehouse.ai.sri.com/
Lister Hill National Center for Biomedical Communications 15
Ontologies and mediation
RoleReference for defining the global schemaMap between local and global schemas
Query reformulationLocal-as-view vs. Global-as-view
ExamplesTAMBISBioMediatorOntoFusion
[Stevens, Bioinformatics 2000]
[Louie, AMIA 2005]
[Perez-Rey, Comput Biol Med 2006]
Lister Hill National Center for Biomedical Communications 16
Ontologies and linked data
RoleExplicit conceptualization of the domainSemantic normalization of data elements
ExamplesEntrezSemantic Web mashupsBio2RDF
[http://www.ncbi.nlm.nih.gov/]
[J. Biomedical informatics 41(5) 2008]
[http://bio2rdf.org/]
Lister Hill National Center for Biomedical Communications 17
Ontologies and data integration
Source of identifiers for biomedical entitiesSemantic normalizationWarehouse approaches
Source of reference relations for the global schemaMapping between local and global schemasMediator-based approaches
Source of identifiers for biomedical entitiesSemantic normalizationExplicit conceptualization of the domainLinked data approaches
Lister Hill National Center for Biomedical Communications 18
Ontologies and data aggregation
Source of hierarchical relationsAggregate data into coarser categoriesAbstract away from low-frequency, fine grained data pointsIncrease powerImprove visualization
Lister Hill National Center for Biomedical Communications 20
Annotating data
Gene OntologyFunctional annotation of gene productsin several dozen model organisms
Various communities use the same controlled vocabulariesEnabling comparisons across model organismsAnnotations
Assigned manually by curatorsInferred automatically (e.g., from sequence similarity)
Lister Hill National Center for Biomedical Communications 21
GO Annotations for Aldh2 (mouse)
http:// www.informatics.jax.org/
Lister Hill National Center for Biomedical Communications 22
GO ALD4 in Yeast
http://db.yeastgenome.org/
Lister Hill National Center for Biomedical Communications 23
GO Annotations for ALDH2 (Human)
http://www.ebi.ac.uk/GOA/
Lister Hill National Center for Biomedical Communications 24
Integration applications
Based on shared annotationsEnrichment analysis (within/across species)Clustering (co-clustering with gene expression data)
Based on the structure of GOClosely related annotationsSemantic similarity
Based on associations between gene products and annotationsLeveraging reasoning
[Bodenreider, PSB 2005]
[Sahoo, Medinfo 2007]
[Lord, PSB 2003]
Lister Hill National Center for Biomedical Communications 25
Gene Ontology
Integration Entrez Gene + GO
gene
GO
PubMed
Gene name
OMIM
Sequence
InteractionsGlycosyltransferase
Congenital muscular dystrophy
Entrez Gene
[Sahoo, Medinfo 2007]
Lister Hill National Center for Biomedical Communications 26
From glycosyltransferaseto congenital muscular dystrophy
MIM:608840 Muscular dystrophy, congenital, type 1D
GO:0008375
has_associated_phenotype
has_molecular_function
EG:9215LARGE
acetylglucosaminyl-transferase
GO:0016757glycosyltransferase
GO:0008194isa
GO:0008375 acetylglucosaminyl-transferase
GO:0016758
Lister Hill National Center for Biomedical Communications 28
Cancer Biomedical Informatics Grid
US National Cancer InstituteCommon infrastructure used to share data and applications across institutions to support cancer research efforts in a grid environmentService-oriented architecture
Data and application services available on the gridSupported by ontological resources
Lister Hill National Center for Biomedical Communications 29
caBIG services
caArrayMicroarray data repository
caTissueBiospecimen repository
caFE (Cancer Function Express)Annotations on microarray data
…
caTRIPCancer Translational Research Informatics PlatformIntegrates data services
Lister Hill National Center for Biomedical Communications 30
Ontological resources
NCI ThesaurusReference terminology for the cancer domain~ 60,000 conceptsOWL Lite
Cancer Data Standards Repository (caDSR)Metadata repositoryUsed to bridge across UML models through Common Data ElementsLinks to concepts in ontologies
Examples
Semantic Webfor Health Care and Life Sciences
http://www.w3.org/2001/sw/hcls/
Lister Hill National Center for Biomedical Communications 32
Semantic Web layer cake
Linked datalinkeddata.org
Lister Hill National Center for Biomedical Communications 34
Linked data
Lister Hill National Center for Biomedical Communications 35
Linked biomedical data[Tim Berners-Lee TED 2009 conference]http://www.w3.org/2009/Talks/0204-ted-tbl/#(1)
Lister Hill National Center for Biomedical Communications 36
W3C Health Care and Life Sciences IG
Lister Hill National Center for Biomedical Communications 37
Biomedical Semantic Web
IntegrationData/InformationE.g., translational research
Hypothesis generationKnowledge discovery
[Ruttenberg, BMC Bioinf. 2007]
Lister Hill National Center for Biomedical Communications 38
HCLS mashup of biomedical sources
NeuronDB
BAMS
NC Annotations
Homologene
SWAN
Entrez Gene
Gene Ontology
Mammalian Phenotype
PDSPki
BrainPharm
AlzGene
Antibodies
PubChem
MeSH
Reactome
Allen Brain Atlas
Publications
http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo
Lister Hill National Center for Biomedical Communications 39
Shared identifiers Example
GO
Lister Hill National Center for Biomedical Communications 40
HCLS mashup NeuronDB
Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents
BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID
NC Annotations
Genes/ProteinsProcessesCells (maybe)PubMed ID
Allen Brain Atlas
GenesBrain imagesGross anatomy -> neuroanatomy
Homologene
GenesSpeciesOrthologiesProofs
SWAN
PubMedIDHypothesisQuestionsEvidence
Genes
Entrez GeneGenesProtein
GOPubMedID
Interaction (g/p)Chromosome
C. location
GO
Molecular functionCell components
Biological processAnnotation gene
PubMedID
Mammalian Phenotype
Genes Phenotypes
DiseasePubMedID
ProteinsChemicals
Neurotransmitters
PDSPki
BrainPharmDrug
Drug effectPathological agent
PhenotypeReceptorsChannelsCell typesPubMedIDDisease
AlzGene
Gene Polymorphism
PopulationAlz Diagnosis
AntibodiesGenes Antibodies
PubChem
NameStructurePropertiesMeSH term
MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem
Reactome
Genes/proteinsInteractionsCellular locationProcesses (GO)
Lister Hill National Center for Biomedical Communications 41
HCLS mashup NeuronDB
Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents
BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID
NC Annotations
Genes/ProteinsProcessesCells (maybe)PubMed ID
Allen Brain Atlas
GenesBrain imagesGross anatomy -> neuroanatomy
Homologene
GenesSpeciesOrthologiesProofs
SWAN
PubMedIDHypothesisQuestionsEvidence
Genes
Entrez GeneGenesProtein
GOPubMedID
Interaction (g/p)Chromosome
C. location
GO
Molecular functionCell components
Biological processAnnotation gene
PubMedID
Mammalian Phenotype
GenesPhenotypes
DiseasePubMedID
ProteinsChemicals
Neurotransmitters
PDSPki
BrainPharmDrug
Drug effectPathological agent
PhenotypeReceptorsChannelsCell typesPubMedIDDisease
AlzGene
GenePolymorphism
PopulationAlz Diagnosis
AntibodiesGenesAntibodies
PubChem
NameStructurePropertiesMeSH term
MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem
Reactome
Genes/proteinsInteractionsCellular locationProcesses (GO)
Lister Hill National Center for Biomedical Communications 42
HCLS mashups
Based on RDF/OWLBased on shared identifiers
“Recombinant data” (E. Neumann)
Ontologies used in some casesSupport applications (SWAN, SenseLab, etc.)
Journal of Biomedical Informaticsspecial issue on Semantic bio-mashups[J. Biomedical Informatics 41(5) 2008]
Lister Hill National Center for Biomedical Communications 43
Semantic bio-mashupsBio2RDF: Towards a mashup to build bioinformatics knowledge systemsIdentifying disease-causal genes using Semantic Web-based representation of integrated genomic and phenomic knowledgeSchema driven assignment and implementation of life science identifiers (LSIDs)The SWAN biomedical discourse ontologyAn ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependenceTowards an ontology for sharing medical images and regions of interest in neuroimagingyOWL: An ontology-driven knowledge base for yeast biologistsDynamic sub-ontology evolution for traditional Chinese medicine web ontologyOntology-centric integration and navigation of the dengue literatureInfrastructure for dynamic knowledge integration—Automated biomedical ontology extension using textual resourcesAn ontological knowledge framework for adaptive medical workflowSemi-automatic web service composition for the life sciences using the BioMoby semantic web frameworkCombining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources
[J. Biomedical Informatics 41(5) 2008]
Challenging issues
Lister Hill National Center for Biomedical Communications 45
Challenging issues
Bridges across ontologiesPermanent identifiers for biomedical entitiesOther issues
Challenging issues
Bridges across ontologies
Lister Hill National Center for Biomedical Communications 47
Trans-namespace integration
Addison Disease(D000224)
Addison's disease (363732003)
Biomedicalliterature
MeSH
Clinicalrepositories
SNOMED CT
Primary adrenocortical insufficiency(E27.1)
ICD 10
Lister Hill National Center for Biomedical Communications 48
(Integrated) concept repositories
Unified Medical Language Systemhttp://umlsks.nlm.nih.govNCBO’s BioPortalhttp://www.bioontology.org/tools/portal/bioportal.htmlcaDSRhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr
Open Biomedical Ontologies (OBO)http://obofoundry.org/
Lister Hill National Center for Biomedical Communications 49
Integrating subdomains
Biomedicalliterature
MeSH
Genomeannotations
GOModelorganisms
NCBITaxonomy
Geneticknowledge bases
OMIM
Clinicalrepositories
SNOMED CTOthersubdomains
…
Anatomy
FMA
UMLS
Lister Hill National Center for Biomedical Communications 5050
Integrating subdomains
Biomedicalliterature
Genomeannotations
Modelorganisms
Geneticknowledge bases
Clinicalrepositories
Othersubdomains
Anatomy
Lister Hill National Center for Biomedical Communications 51
Trans-namespace integration
Genomeannotations
GOModelorganisms
NCBITaxonomy
Geneticknowledge bases
OMIMOther
subdomains
…
Anatomy
FMA
UMLSAddison Disease (D000224)
Addison's disease (363732003)
Biomedicalliterature
MeSH
Clinicalrepositories
SNOMED CT
UMLSC0001403
Lister Hill National Center for Biomedical Communications 52
Mappings
Created manually (e.g., UMLS)PurposeDirectionality
Created automatically (e.g., BioPortal)Lexically: ambiguity, normalizationSemantically: lack of / incomplete formal definitions
Key to enabling semantic interoperabilityEnabling resource for the Semantic Web
Challenging issues
Permanent identifiers for biomedical entities
Lister Hill National Center for Biomedical Communications 54
Identifying biomedical entities
Multiple identifiers for the same entity in different ontologiesBarrier to data integration in general
Data annotated to different ontologies cannot “recombine”Need for mappings across ontologies
Barrier to data integration in the Semantic WebMultiple possible identifiers for the same entity
Depending on the underlying representational scheme (URI vs. LSID)Depending on who creates the URI
Lister Hill National Center for Biomedical Communications 55
Possible solutions
PURL http://purl.orgOne level of indirection between developers and usersIndependence from local constraints at the developer’s end
The institution creating a resource is also responsible for minting URIs
E.g., URI for genes in Entrez Gene
Guidelines: “URI note”W3C Health Care and Life Sciences Interest Group
Shared names initiativeIdentify resources vs. entities
[http://sharedname.org/]
Challenging issues
Other issues
Lister Hill National Center for Biomedical Communications 57
Availability
Many ontologies are freely availableThe UMLS is freely available for research purposes
Cost-free license requiredLicensing issues can be tricky
SNOMED CT is freely available in member countries of the IHTSDO
Being freely availableIs a requirement for the Open Biomedical Ontologies (OBO)Is a de facto prerequisite for Semantic Web applications
Lister Hill National Center for Biomedical Communications 58
Discoverability
Ontology repositoriesUMLS: 152 source vocabularies(biased towards healthcare applications)NCBO BioPortal: ~141ontologies(biased towards biological applications)Limited overlap between the two repositories
Need for discovery servicesMetadata for ontologies
Lister Hill National Center for Biomedical Communications 59
Formalism
Several major formalismWeb Ontology Language (OWL) – NCI ThesaurusOBO format – most OBO ontologiesUMLS Rich Release Format (RRF) – UMLS, RxNorm
Conversion mechanismsOBO to OWLLexGrid (import/export to LexGrid internal format)
Lister Hill National Center for Biomedical Communications 60
Ontology integration
Post hoc integration , form the bottom upUMLS approachIntegrates ontologies “as is”, including legacy ontologiesFacilitates the integration of the corresponding datasets
Coordinated development of ontologiesOBO Foundry approachEnsures consistency ab initioExcludes legacy ontologies
Lister Hill National Center for Biomedical Communications 61
Quality
Quality assurance in ontologies is still imperfectly defined
Difficult to define outside a use case or applicationSeveral approaches to evaluating quality
Collaboratively, by users (Web 2.0 approach)Marginal notes enabled by BioPortal
Centrally, by expertsOBO Foundry approach
Important factors besides qualityGovernanceInstalled base / Community of practice
Lister Hill National Center for Biomedical Communications 62
Conclusions
Ontologies are enabling resources for data integrationStandardization works
Grass roots effort (GO)Regulatory context (ICD 9-CM)
Bridging across resources is crucialOntology integration resources / strategies(UMLS, BioPortal / OBO Foundry)
Massive amounts of imperfect data integrated with rough methods might still be useful
MedicalOntologyResearch
Olivier Bodenreider
Lister Hill National Centerfor Biomedical CommunicationsBethesda, Maryland - USA
Contact:Web: