art: ontology based annotation of · sapient demo - by maria liakata. 3 ontologies an ontology is...
TRANSCRIPT
Larisa Soldatova
RCUK FellowThe Department of Computer Science The University of Wales, Aberystwyth
1
ART: ontology based annotation of scientific papers
Manchester, December 2, 2008
Plan of the talk:Plan of the talk:
1. Introduction into ontology.2. An example: classification of biomedical
text by Hagit Shatkay.3. The Robot Scientist project and EXPO.4. LABORS, EXACT (protocols), DD(drug
discovery), OntoDM (data mining). 5. The ART project.6. SAPIENT demo - by Maria Liakata.
3
OntologiesOntologies
An ontology is “a concise and unambiguous description of what principle entities are relevant to an application domain and the relationship between them”*.
*Schulze-Kremer, S., 2001, Computer and Information Sci. 6(21)
Soldatova, UWA
1. Introduction
Ontology partsOntology parts
Classes and instances; Is-a relations; Other relations (part-of, located-in, has-
agent). Axioms.
4
The EXACT description of protocolsThe EXACT description of protocols
5
6
Ontologies in life sciences: Ontologies in life sciences: positive examplespositive examples OBI (biomedical investigations) http://obi-ontology.org/page/Main_Page FMA (Foundational Model of Anatomy ontology )
http://sig.biostr.washington.edu/projects/fm/AboutFM.html MSI for metabolomics experiments*
http://msi-ontology.sourceforge.net/
* Sansone, S., Schober, D., Atherton, H.J., Fiehn, O., Jenkins, H., Rocca-Serra, Ph., Rubtsov, D.V., Spasic, I., Soldatova, L.N., Taylor Ch., Tseng, A., Viant, M.R. and the Ontology Working Group Members. (2007) Metabolomics Standards Initiative - Ontology Working Group. Work in Progress. Metabolomics 3/3: 249-256.
7
Ontologies in life sciences: Ontologies in life sciences: negative examplesnegative examples
MGED ontology for microarray experiments* mmCIF for protein data bank (PDB)**
*Soldatova, L.N., King, R.D., (2005) Are the Current Ontologies used in Biology Good Ontologies? Nature Biotechnology 9/23: 1096-1098.
Soldatova, LN & King, RD. (2006) Reply to Wrestling with SUMO and bio-ontologies. Nature Biotechnology. 24/ 23.
** Schierz, A.C., Soldatova, L.N. and King, R.D. (2007) Overhauling the PDB. Nature Biotechnology 25/4: 437-442.
Schierz, A.C., Soldatova, L.N. and King, R.D. (2007) The reply: Overhauling the PDB. Nature Biotechnology 25/8: 846.
An example: classification of An example: classification of biomedical text by Hagit Shatkay.biomedical text by Hagit Shatkay.
Focus (scientific, generic, methodology); Polarity (affirmative/ negative); Certainty (0-3); Evidence (E0-E3); Direction/Trend (increase/decrease).
Shatkay et.al (2008) Multidemensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users. Bioinformatics 24/18: 2086-2093
Problems?Problems?
Polarity, Certainty, Direction/Trend – are properties of some entities;
Values: scientific, generic, methodology - have overlapping semantics;
Evidence – re-invent the wheal: ECO (evidence codes) http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code
ECO vs Hagit
E0 –no stated evidence or lack of evidence
E1 – mentions of evidence with no explicit reference
E2 – statement is backed by a reference to a supporting publication
E3 – experimental evidence is directly given in the text
is_concretization_of relates a generically dependent continuant to aspecifically dependent continuant. A generically dependent continuant mayinhere in more than one entity. It does so by virtue of the fact that thereis, for each entity that it inheres, a specifically dependent*concretization* of the generically dependent continuant that isspecifically dependent.
Example definition of a Example definition of a relation:relation:
12
2. Ontology of scientific experiments EXPO2. Ontology of scientific experiments EXPO
EXPO* v.1Concepts: 220 Language: OWLhttp://sourceforge.net/projects/exp
o
Tool: Hozo Ontology Editor
*Soldatova, LN & King, RD (2006) An Ontology of Scientific Experiments. Journal of the Royal Society Interface 3/11: 795-803.
2. EXPO
13
EXPO conceptsEXPO concepts
14
SolenodonsSolenodons
Paper investigates the phylogenetic status of the mammalian species Solenodon cubanus and Solenodon paradoxus. i.e. the evolutionary relationship of these animals with all others.
Solenodons have been isolated since the age of the dinosaurs!
15
Scientific Experiment: Hypothesis-forming, Hypothesis-drivenAdmin info about experiment:Title: Mesozoic Origin of West Indian InsectivoresAuthor: Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen,M.K.,…… Organisation: 1. National Cancer Institute, Frederick, USA …Status: public academic Reference: Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen, M.K.,
Maria, R.at all. Mesozoic origin for West Indian insectivores. Nature,429, 649-651 (2004).
Classification of experiment:Taxonomy DDC(Dewey): 575 Evolution and Genetics
Library of Congress: QH 367.5 molecular phylogeneticsZoology DDC(Dewey): 599: mammalology
Library of Congress: QL351-QL352 Zoology-ClassificationExperimental goal: To discover the phylogeny of the species: Solenodon paradoxus
and Solenodon cubanusNull hypothesis H01:explicitRepresentation style: textLinguistic expression: natural language“Some have suggested a close relationship to soricids (shrews) but not to talpids”Linguistic expression: arificial language: predicate calculus …………………………experimental action 1.1.1 extraction and purification
object: sample of DNAparent group: DNA from Solenodon paradoxus sampling: random samplinginstrument: Qiagen DNA cleanup kit
experimental action 1.1.2 DNA amplification …………………………Experimental Conclusions (Formed Hypotheses) C1) Hypothesis Representation style: textLinguistic expression: natural language
There existed an mammal that is the ancestor of: Solenodons, Soricoidea, Talpoidea, Erinaceidea, and which is not the ancestor of any other mammal.
Linguistic expression: artificial language: predicate calculus …………………………
Prolog:instantiation(solenodon, So), instantiation(soricoidea, Sh), instantiation(talpoidea, T), instantiation(mammalia, An), shared_ancestor([So, Sh], [T], An).% shared_ancestror(Shared, Not_shared).shared_ancestor([X],[Y], An) :-ancestor(An, X).not ancestor(An, Y).shared_ancestor([X|Lx],[Ly], An) :-shared_ancestor([Lx],[Ly], An).ancestor(An, X).shared_ancestor([Lx],[[Y|Ly], An) :-shared_ancestor([Lx],[Ly], An).not ancestor(An, Y).
EXPO: A scientific experiment is a research method which permits the investigation of cause-effect relations between known and unknown (target) variables of the field of study (domain). An experimental result cannot be known with certainty in advance.
EXPO: A classification of experiments is a hierarchical system of categories – types of experiments – according to their domains or used models of experiments. EXPO: A null
hypothesis is an experimental hypothesis that states that a known controlled variable or variables does not have a specified effect on the unknown (target) variable or variables of the domain.
XML: </rdfs:Class><rdfs:Class rdf:ID="classification of experiments"> <rdfs:label>classification of experiments</rdfs:label> <rdfs:subClassOf rdf:resource="#classification" /> <rdfs:comment>Def:A classification of experiments is a hierarchical system of categories - types of experiments - according to their domains or used models of experiments.Axiom: </rdfs:comment>
16
Problems Highlighted by AnnotationProblems Highlighted by Annotation
The use of EXPO makes explicit the different hypotheses described in the paper. The research conclusions are not mentioned as hypotheses in the text. This contrasts with seven null-hypotheses mentioned explicitly in the main text.
The DNA sequences produced during the experiment were stored in the EMBL database using the taxonomic term “Insectivora”. This taxon is now generally recognised to be polyphyletic, and its use contradicts the actual conclusions of the paper.
The authors’ conclusion: “Cuban Solenodons should be classified in a distinct genus, Atopogale”. Our analysis shows that it would be more internally consistent to classify Cuban Solenodons as a distinct family.
etc…..
17
EXPO dissemination:EXPO dissemination:
Soldatova, LN & King, RD. (2006) An Ontology of Scientific Experiments. Journal of the Royal Society Interface 3/11: 795-803.
2006 nomination for World Technology Award (software).
Articles in the New Scientist and the Chronicle of Higher Education.
18
The ConceptThe Concept of a Robot of a Robot Scientist:Scientist:
Background Knowledge
Analysis
Consistent
Hypotheses
Final TheoryExperiment
selectionRobot
Experiment
Results Interpretation
The robot scientist project aims to develop a computer system that is capable of originating its own experiments, physically doing them, interpreting the results, and then repeating the cycle.
Hypothesis Formation
*King et al. (2004) Nature, 427, 247-252.
19
The Application DomainThe Application Domain Systems Biology
Yeast (S. cerevisiae) – best understood Eukaryotic organism.
Strain libraries, e.g. EUROFAN 2 has knocked out each of the 6,000 genes.
Task to learn models of yeast metabolism using selected mutant strains and quantitative growth experiments.
Soldatova et al., UWA
20
The Robot During The Robot During CommissioningCommissioning
Soldatova et al., UWA
21
3. The ART projects 3. The ART projects (an ontology based ARticle (an ontology based ARticle preparation Tool) preparation Tool) Translating scientific papers into a format with an
explicit semantics. Explicit linking of repository papers to data and
metadata. Creation of an example intelligent digital repository.
* Soldatova, L.N., Batchelor, C.R., Liakata, M., Fielding, H.H., Lewis, S. and King, R.D. (2007) ART: An ontology based tool for the translation of papers into Semantic Web format. SIG/ ISMB Proceedings.
**http://www.jisc.ac.uk/whatwedo/programmes/programme_rep_pres/tools/art.aspx
22
Motivation:Motivation:
to improve information retrieval; to provide semantic clarity and explicitness of
represented information and knowledge; to promote the sharing of research results; to facilitate text mining and knowledge discovery
applications.
3. ART
Soldatova et al., UWA
23
The Core Information about The Core Information about Scientific Papers (CISP):Scientific Papers (CISP):
<goal of investigation><object of investigation><motivation for investigation><method><model><experiment><observation><result><conclusion>
3. ART
Soldatova et al., UWA
24
Please tell us your opinion:Please tell us your opinion:
http://www.aber.ac.uk/compsci/Research/bio/art/news/survey/
3. ART
25
Colin R. Batchelor, Royal Society of Chemistry
The related projects
26
The related projects
Colin R. Batchelor, Royal Society of Chemistry
DEMO by Dr Maria Liakata: DEMO by Dr Maria Liakata: SAPIENTSAPIENT
SAPIENT: Semantic Annotation of Papers: Interface and ENrichment Tool
• A web-based tool for sentence by sentene annotation of full papers
• Developed at UWA by Maria Liakata and Claire Q
• SAPIENT can be used to annotate papers with CISP (also incorporated OSCAR annotations)
• SAPIENT can also be used with other sentence based annotation schemes
SAPIENT: Semantic Annotation of Papers: Interface and ENrichment Tool
• SAPIENT currently suitable for manual annotation, to facilitate creation of corpus
• Currently SAPIENT used by 16 experts to create a corpus of full papers from Chemistry/BioChemistry annotated with CISP concepts.
• Papers provided by the RSC
• Corpus creation consists of 3 phases. Now at the start of phase 2.
• Software and manual available on-line.
SAPIENT Architecture
User User INPUTINPUT
Browser Browser ServerServer
Page for paper Upload &
Links to uploadedPapers
Annotations savedIn mode2.xml
Paper saved as source.xml
XMLHttprequest
Click on paper
Paper in.xml
1) Paper is split into sentences with SSSplit 2) Paper saved as mode2.xml
Paper displayedIn dynamic html
Javascript basedAnnotation with CISP
Processing with .xsl
Click on Save
OSCAR annotations
SSSplit: SAPIENT Sentence Splitting
• Rule based sentence splitter developed in Java by Maria Liakata and Claire Q at UWA
• SSSplit developed to take as input full papers in XML
• It fully respects XML annotations pertaining to paper structure,references, formatting.
• Can be used independently of SAPIENT from command line or can be imported as package
• Has been tested successfully on 130 papers
• Software available on-line.
SAPIENT: Semantic Annotation of Papers: Interface and ENrichment Tool
The Future
• Perform machine learning on corpus of papers annotated with CISP
• Automate SAPIENT to suggest CISP annotations in new papers
• Use CISP metadata to generate digital abstract
• Incorporate SAPIENT in publishers’ workflow as tool for editors, reviewers and authors of scientific papers.
SAPIENT: Semantic Annotation of Papers: Interface and ENrichment Tool
SAPIENT and SSSplit can be downloaded from:
http://www.aber.ac.uk/compsci/Research/bio/art/
For comments or questions contact [email protected]