ontology-based data integration
DESCRIPTION
Data integration is a perennial challenge facing large-scale data scientists. Bio-ontologies are useful in this endeavour as sources of synonyms and also for rules-based fuzzy integration pipelines.TRANSCRIPT
![Page 1: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/1.jpg)
Industry Programme Workshop: Data Integration18-19 September 2013
Ontology-based data integration
Janna Hastings
![Page 2: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/2.jpg)
Data integration is hard
Technology
Syntax
Semantics
Content
![Page 3: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/3.jpg)
Different data resources, different needs
“why can’t they all just use the same- schema- measurement accuracy- units- labels- content?”
![Page 4: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/4.jpg)
Standards are the solution… (?)
Source: http://xkcd.com/927/
![Page 5: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/5.jpg)
Ontology-based data integration
Ontologies can help with the semantic and the content aspects of data integration
• Semantic: definition for schemas
• OWL is a good language for defining schemas
• See RDF and Semantic Web presentations, today
• Content: definition of the entities referred to by data
• Ontologies embedded into a data integration workflow help facilitate content-aware data integration
![Page 6: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/6.jpg)
Core challenge: labelling
Multiple labels can mean the same thing
One label can mean multiple things
![Page 7: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/7.jpg)
Semantics-free identifiers, multiple synonyms
CHEBI:27732
A trimethylxanthine in which the three methyl groups are located at positions 1, 3, and 7.
guaranine methyltheobromine
1,3,7-trimethylxanthine Koffeincaféine
![Page 8: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/8.jpg)
Core challenge: biological knowledgeThe answer to the question: “Is
Entity A from Data Source 1
the same thing as
Entity B from Data Source 2?”
often depends who is asking and who is answering!
Left lung vs. lungHippocampus vs. brainDopamine vs. L-dopamineIn vitro vs. In vivo cells of type XGene Y and post-translationally modified form Y’Gene Z in mouse, Gene Z in human
![Page 9: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/9.jpg)
Hierarchy
left lung
lung
organ
is a
is a
Generalise to the
nearest common ancestor
i.e. if you are integrating data about tissue samples annotated to ‘lung’ in the one dataset, and ‘left lung’ in the other,
The ontology can compute ‘lung’ as the nearest common ancestor
Also for ‘left lung’ and ‘right lung’
![Page 10: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/10.jpg)
Other relationships
Relationships encode biological knowledge
Rules allow to specify which relationships can be traversed for data integration purposes
e.g. for tissue samples, part_of:
sample_frompart_of => sample_from
A sample from a part of the brain (e.g. thehippocampus) is a sample from the brain
(Quite aside from the ‘is a’ hierarchy!)
brain
hippocampus
part of
![Page 11: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/11.jpg)
Core challenge: flexibility
… (>150 members)
Fixed-depth hierarchiesforce some classes to be too big, with the lowest levelcollapsing biolgoical hierarchy
and others too small
… (<1 member)
![Page 12: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/12.jpg)
Ontologies in content integration
A
B
A&B
1. Schema mappings
A
B
2. Ontology-provided synonyms
A
B
3. Hierarchyand relationshiprules for integration
OWL language and tools: web-embedded(but whole-ontology rule reasoning may be slow)
![Page 13: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/13.jpg)
Is ontology integration just another type of data integration?
Which ontology(-ies) to use?How to use them together? How to plug the gaps? Why should I (as a user) have to do this integration over and over
![Page 14: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/14.jpg)
Desiderata for ontologies for data integration
• Ontologies should be neutral and shared community-wide
• Users should be able to directly and rapidly extend the ontology where there are gaps (responsiveness)
• The ontology should use semantics-free identifiers and at the same time energetically annotate synonyms
• When necessary, ontologies should take care of ontology integration to provide the community with a one-stop service and appropriate cross-references
• The ontologies should be usedin data annotation
See http://www.obofoundry.org/
![Page 15: Ontology-based Data Integration](https://reader036.vdocuments.us/reader036/viewer/2022062418/555012d8b4c90555618b4b22/html5/thumbnails/15.jpg)
Questions?