art: ontology based annotation of · sapient demo - by maria liakata. 3 ontologies an ontology is...

33
Larisa Soldatova RCUK Fellow The Department of Computer Science The University of Wales, Aberystwyth 1 ART: ontology based annotation of scientific papers Manchester, December 2, 2008

Upload: others

Post on 15-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

Larisa Soldatova

RCUK FellowThe Department of Computer Science The University of Wales, Aberystwyth

1

ART: ontology based annotation of scientific papers

Manchester, December 2, 2008

Page 2: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

Plan of the talk:Plan of the talk:

1. Introduction into ontology.2. An example: classification of biomedical

text by Hagit Shatkay.3. The Robot Scientist project and EXPO.4. LABORS, EXACT (protocols), DD(drug

discovery), OntoDM (data mining). 5. The ART project.6. SAPIENT demo - by Maria Liakata.

Page 3: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

3

OntologiesOntologies

An ontology is “a concise and unambiguous description of what principle entities are relevant to an application domain and the relationship between them”*.

*Schulze-Kremer, S., 2001, Computer and Information Sci. 6(21)

Soldatova, UWA

1. Introduction

Page 4: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

Ontology partsOntology parts

Classes and instances; Is-a relations; Other relations (part-of, located-in, has-

agent). Axioms.

4

Page 5: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

The EXACT description of protocolsThe EXACT description of protocols

5

Page 6: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

6

Ontologies in life sciences: Ontologies in life sciences: positive examplespositive examples OBI (biomedical investigations) http://obi-ontology.org/page/Main_Page FMA (Foundational Model of Anatomy ontology )

http://sig.biostr.washington.edu/projects/fm/AboutFM.html MSI for metabolomics experiments*

http://msi-ontology.sourceforge.net/

* Sansone, S., Schober, D., Atherton, H.J., Fiehn, O., Jenkins, H., Rocca-Serra, Ph., Rubtsov, D.V., Spasic, I., Soldatova, L.N., Taylor Ch., Tseng, A., Viant, M.R. and the Ontology Working Group Members. (2007) Metabolomics Standards Initiative - Ontology Working Group. Work in Progress. Metabolomics 3/3: 249-256.

Page 7: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

7

Ontologies in life sciences: Ontologies in life sciences: negative examplesnegative examples

MGED ontology for microarray experiments* mmCIF for protein data bank (PDB)**

*Soldatova, L.N., King, R.D., (2005) Are the Current Ontologies used in Biology Good Ontologies? Nature Biotechnology 9/23: 1096-1098.

Soldatova, LN & King, RD. (2006) Reply to Wrestling with SUMO and bio-ontologies. Nature Biotechnology. 24/ 23.

** Schierz, A.C., Soldatova, L.N. and King, R.D. (2007) Overhauling the PDB. Nature Biotechnology 25/4: 437-442.

Schierz, A.C., Soldatova, L.N. and King, R.D. (2007) The reply: Overhauling the PDB. Nature Biotechnology 25/8: 846.

Page 8: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

An example: classification of An example: classification of biomedical text by Hagit Shatkay.biomedical text by Hagit Shatkay.

Focus (scientific, generic, methodology); Polarity (affirmative/ negative); Certainty (0-3); Evidence (E0-E3); Direction/Trend (increase/decrease).

Shatkay et.al (2008) Multidemensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users. Bioinformatics 24/18: 2086-2093

Page 9: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

Problems?Problems?

Polarity, Certainty, Direction/Trend – are properties of some entities;

Values: scientific, generic, methodology - have overlapping semantics;

Evidence – re-invent the wheal: ECO (evidence codes) http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code

Page 10: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

ECO vs Hagit

E0 –no stated evidence or lack of evidence

E1 – mentions of evidence with no explicit reference

E2 – statement is backed by a reference to a supporting publication

E3 – experimental evidence is directly given in the text

Page 11: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

is_concretization_of relates a generically dependent continuant to aspecifically dependent continuant. A generically dependent continuant mayinhere in more than one entity. It does so by virtue of the fact that thereis, for each entity that it inheres, a specifically dependent*concretization* of the generically dependent continuant that isspecifically dependent.

Example definition of a Example definition of a relation:relation:

Page 12: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

12

2. Ontology of scientific experiments EXPO2. Ontology of scientific experiments EXPO

EXPO* v.1Concepts: 220 Language: OWLhttp://sourceforge.net/projects/exp

o

Tool: Hozo Ontology Editor

*Soldatova, LN & King, RD (2006) An Ontology of Scientific Experiments. Journal of the Royal Society Interface 3/11: 795-803.

2. EXPO

Page 13: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

13

EXPO conceptsEXPO concepts

Page 14: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

14

SolenodonsSolenodons

Paper investigates the phylogenetic status of the mammalian species Solenodon cubanus and Solenodon paradoxus. i.e. the evolutionary relationship of these animals with all others.

Solenodons have been isolated since the age of the dinosaurs!

Page 15: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

15

Scientific Experiment: Hypothesis-forming, Hypothesis-drivenAdmin info about experiment:Title: Mesozoic Origin of West Indian InsectivoresAuthor: Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen,M.K.,…… Organisation: 1. National Cancer Institute, Frederick, USA …Status: public academic Reference: Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen, M.K.,

Maria, R.at all. Mesozoic origin for West Indian insectivores. Nature,429, 649-651 (2004).

Classification of experiment:Taxonomy DDC(Dewey): 575 Evolution and Genetics

Library of Congress: QH 367.5 molecular phylogeneticsZoology DDC(Dewey): 599: mammalology

Library of Congress: QL351-QL352 Zoology-ClassificationExperimental goal: To discover the phylogeny of the species: Solenodon paradoxus

and Solenodon cubanusNull hypothesis H01:explicitRepresentation style: textLinguistic expression: natural language“Some have suggested a close relationship to soricids (shrews) but not to talpids”Linguistic expression: arificial language: predicate calculus …………………………experimental action 1.1.1 extraction and purification

object: sample of DNAparent group: DNA from Solenodon paradoxus sampling: random samplinginstrument: Qiagen DNA cleanup kit

experimental action 1.1.2 DNA amplification …………………………Experimental Conclusions (Formed Hypotheses) C1) Hypothesis Representation style: textLinguistic expression: natural language

There existed an mammal that is the ancestor of: Solenodons, Soricoidea, Talpoidea, Erinaceidea, and which is not the ancestor of any other mammal.

Linguistic expression: artificial language: predicate calculus …………………………

Prolog:instantiation(solenodon, So), instantiation(soricoidea, Sh), instantiation(talpoidea, T), instantiation(mammalia, An), shared_ancestor([So, Sh], [T], An).% shared_ancestror(Shared, Not_shared).shared_ancestor([X],[Y], An) :-ancestor(An, X).not ancestor(An, Y).shared_ancestor([X|Lx],[Ly], An) :-shared_ancestor([Lx],[Ly], An).ancestor(An, X).shared_ancestor([Lx],[[Y|Ly], An) :-shared_ancestor([Lx],[Ly], An).not ancestor(An, Y).

EXPO: A scientific experiment is a research method which permits the investigation of cause-effect relations between known and unknown (target) variables of the field of study (domain). An experimental result cannot be known with certainty in advance.

EXPO: A classification of experiments is a hierarchical system of categories – types of experiments – according to their domains or used models of experiments. EXPO: A null

hypothesis is an experimental hypothesis that states that a known controlled variable or variables does not have a specified effect on the unknown (target) variable or variables of the domain.

XML: </rdfs:Class><rdfs:Class rdf:ID="classification of experiments"> <rdfs:label>classification of experiments</rdfs:label> <rdfs:subClassOf rdf:resource="#classification" /> <rdfs:comment>Def:A classification of experiments is a hierarchical system of categories - types of experiments - according to their domains or used models of experiments.Axiom: </rdfs:comment>

Page 16: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

16

Problems Highlighted by AnnotationProblems Highlighted by Annotation

The use of EXPO makes explicit the different hypotheses described in the paper. The research conclusions are not mentioned as hypotheses in the text. This contrasts with seven null-hypotheses mentioned explicitly in the main text.

The DNA sequences produced during the experiment were stored in the EMBL database using the taxonomic term “Insectivora”. This taxon is now generally recognised to be polyphyletic, and its use contradicts the actual conclusions of the paper.

The authors’ conclusion: “Cuban Solenodons should be classified in a distinct genus, Atopogale”. Our analysis shows that it would be more internally consistent to classify Cuban Solenodons as a distinct family.

etc…..

Page 17: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

17

EXPO dissemination:EXPO dissemination:

Soldatova, LN & King, RD. (2006) An Ontology of Scientific Experiments. Journal of the Royal Society Interface 3/11: 795-803.

2006 nomination for World Technology Award (software).

Articles in the New Scientist and the Chronicle of Higher Education.

Page 18: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

18

The ConceptThe Concept of a Robot of a Robot Scientist:Scientist:

Background Knowledge

Analysis

Consistent

Hypotheses

Final TheoryExperiment

selectionRobot

Experiment

Results Interpretation

The robot scientist project aims to develop a computer system that is capable of originating its own experiments, physically doing them, interpreting the results, and then repeating the cycle.

Hypothesis Formation

*King et al. (2004) Nature, 427, 247-252.

Page 19: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

19

The Application DomainThe Application Domain Systems Biology

Yeast (S. cerevisiae) – best understood Eukaryotic organism.

Strain libraries, e.g. EUROFAN 2 has knocked out each of the 6,000 genes.

Task to learn models of yeast metabolism using selected mutant strains and quantitative growth experiments.

Soldatova et al., UWA

Page 20: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

20

The Robot During The Robot During CommissioningCommissioning

Soldatova et al., UWA

Page 21: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

21

3. The ART projects 3. The ART projects (an ontology based ARticle (an ontology based ARticle preparation Tool) preparation Tool) Translating scientific papers into a format with an

explicit semantics. Explicit linking of repository papers to data and

metadata. Creation of an example intelligent digital repository.

* Soldatova, L.N., Batchelor, C.R., Liakata, M., Fielding, H.H., Lewis, S. and King, R.D. (2007) ART: An ontology based tool for the translation of papers into Semantic Web format. SIG/ ISMB Proceedings.

**http://www.jisc.ac.uk/whatwedo/programmes/programme_rep_pres/tools/art.aspx

Page 22: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

22

Motivation:Motivation:

to improve information retrieval; to provide semantic clarity and explicitness of

represented information and knowledge; to promote the sharing of research results; to facilitate text mining and knowledge discovery

applications.

3. ART

Soldatova et al., UWA

Page 23: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

23

The Core Information about The Core Information about Scientific Papers (CISP):Scientific Papers (CISP):

<goal of investigation><object of investigation><motivation for investigation><method><model><experiment><observation><result><conclusion>

3. ART

Soldatova et al., UWA

Page 24: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

24

Please tell us your opinion:Please tell us your opinion:

http://www.aber.ac.uk/compsci/Research/bio/art/news/survey/

3. ART

Page 25: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

25

Colin R. Batchelor, Royal Society of Chemistry

The related projects

Page 26: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

26

The related projects

Colin R. Batchelor, Royal Society of Chemistry

Page 27: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

DEMO by Dr Maria Liakata: DEMO by Dr Maria Liakata: SAPIENTSAPIENT

Page 28: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

SAPIENT: Semantic Annotation of Papers: Interface and ENrichment Tool

• A web-based tool for sentence by sentene annotation of full papers

• Developed at UWA by Maria Liakata and Claire Q

• SAPIENT can be used to annotate papers with CISP (also incorporated OSCAR annotations)‏

• SAPIENT can also be used with other sentence based annotation schemes

Page 29: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

SAPIENT: Semantic Annotation of Papers: Interface and ENrichment Tool

• SAPIENT currently suitable for manual annotation, to facilitate creation of corpus

• Currently SAPIENT used by 16 experts to create a corpus of full papers from Chemistry/BioChemistry annotated with CISP concepts.

• Papers provided by the RSC

• Corpus creation consists of 3 phases. Now at the start of phase 2.

• Software and manual available on-line.

Page 30: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

SAPIENT Architecture

User User INPUTINPUT

Browser Browser ServerServer

Page for paper Upload &

Links to uploadedPapers

Annotations savedIn mode2.xml

Paper saved as source.xml

XMLHttprequest

Click on paper

Paper in.xml

1) Paper is split into sentences with SSSplit 2) Paper saved as mode2.xml

Paper displayedIn dynamic html

Javascript basedAnnotation with CISP

Processing with .xsl

Click on Save

OSCAR annotations

Page 31: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

SSSplit: SAPIENT Sentence Splitting

• Rule based sentence splitter developed in Java by Maria Liakata and Claire Q at UWA

• SSSplit developed to take as input full papers in XML

• It fully respects XML annotations pertaining to paper structure,references, formatting.

• Can be used independently of SAPIENT from command line or can be imported as package

• Has been tested successfully on 130 papers

• Software available on-line.

Page 32: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

SAPIENT: Semantic Annotation of Papers: Interface and ENrichment Tool

The Future

• Perform machine learning on corpus of papers annotated with CISP

• Automate SAPIENT to suggest CISP annotations in new papers

• Use CISP metadata to generate digital abstract

• Incorporate SAPIENT in publishers’ workflow as tool for editors, reviewers and authors of scientific papers.

Page 33: ART: ontology based annotation of · SAPIENT demo - by Maria Liakata. 3 Ontologies An ontology is “a concise and unambiguous description of what principle entities are relevant

SAPIENT: Semantic Annotation of Papers: Interface and ENrichment Tool

SAPIENT and SSSplit can be downloaded from:

http://www.aber.ac.uk/compsci/Research/bio/art/

For comments or questions contact [email protected]