expo an ontology of scientific research

EXPO An Ontology of Scientific Research

Ross D. King & Larisa N. Soldatova

Department of Computer ScienceUniversity of Wales, Aberystwyth

Motivation: Formalisation of ScienceThe goal of science is to increase our knowledge of the natural world through the performance of experiments.

Scientific knowledge should, ideally, be expressed in a standard formal logical language.

Formal languages promote semantic clarity, which in turn supports the free exchange of scientific knowledge and simplifies scientific reasoning.

The Robot Scientist Concept

Background Knowledge

Analysis

ConsistentHypotheses

Final Theory Experiment(s) selection Robot

Experiments(s)

Results

The robot scientist project aims to develop a computer system that is capable of originating its own experiments, physically doing them, interpreting the results, and then repeating the cycle.

Hypothesis Formation

Our Latest Robot Scientist

King et al. (2004) Nature. 427, 247-252.

The Unity of ScienceWe are interested in formalising generic knowledge about scientific experimental design, methodology, and results representation. Such a common ontology is both feasible and desirable because all the sciences follow the same experimental principles. Despite their different subject matters, they all organise, execute, and analyse experiments in similar ways.They use related instruments and materials; they describe experimental results in identical formats, dimensional units, etc.

Ontologies The formal description of experiments for efficient analysis, annotation, and sharing of results is a fundamental objective of science.

Ontologies are required to achieve this goal.

A few subject-specific ontologies of experiments currently exist. The most notable of these is the MGED Ontology (MO). It was designed to provide descriptors required by MIAME (Minimum Information About a Microarray Experiment)

We propose the ontology EXPO to meet this need.

Advantages of Ontologies

The utilisation of a common standard ontology for the annotation of scientific experiments would:– Make scientific knowledge more explicit.– Help detect errors.– Enable the sharing and reuse of common

knowledge.– Remove redundancies in domain-specific

ontologies.– Promote the interchange and reliability of

experimental methods and conclusions.

Our Approach to Ontology Building

Explicitly list the principles of an ontology's design, its constraints, along with definitions and axioms.

Provide compliance with a standard upper ontology (SUO) developed by IEEE P1600.1.

Keep separately domain-dependent and domain-independent knowledge, as well as declarative and procedural knowledge.

Build ontologies so that they are purpose-independent and therefore are future-proof.

Generic Ontology of Experiments

e-Science

Ontology of science(formalization of scientific

methods, technologies, infrastructure of science)

EXPOOntology of

scientific experimentsconcepts: 218language: OWL

Classification of experiments

• Controlled vocabulary of scientific experiments;

• Formalized computer based representation of scientific experiments;

• Unified standards for representation, annotation, storage, and access to experimental results;

• Reasoning over experimental data and conclusions.

Admin info about experiment

Scientific ExperimentExperimental goalExperimental design

Experimental object

Experimental actionExperimental results

SUMO

Ontology of science

PSI ontology

MGED ontology

Upper level

Intermediate level

Domain level

EXPO

<Experiment>

<Subject of exp.> <Object of exp.>

<Domain of exp.>

Chemical ontology

MSI ontology

Functional ontology

Ontology of representations

<Biblio reference> <Experimental goal>

<Experimental action>

EXPO DescriptionEXPO v.1

Concepts: ~200 Language: OWLTool: Hozo Ontology Editor

Small Section of EXPO

SUMO-EXPO extension

SUMO

http://suo.ieee.org/

SUO/SUMO/index.html

An Experiment as Part of a Scientific Investigation

Example Applications of EXPO

Bioinformatics

Particle Physics

Solenodons

Solenodons are endangered insectivores from Hispaniola and Cuba.

Phylogenetic ExampleRandom paper selected from Nature: Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen, M.K., Maria, R. Mesozoic origin for West Indian insectivores. Nature, 429, 649-651 (2004).

Paper investigates the phylogenetic status of the mammalian species Solenodon cubanus and Solenodon paradoxus. i.e. the evolutionary relationship of these animals with all others.

Solenodons have been isolated since the age of the dinosaurs!

Scientific Experiment: Hypothesis-forming, Hypothesis-drivenAdmin info about experiment:Title: Mesozoic Origin of West Indian InsectivoresAuthor: Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen, M.K.,………………Organisation: 1. National Cancer Institute, Frederick, USA ………………Status: public academic Reference: Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen, M.K., Maria, R.

at all. Mesozoic origin for West Indian insectivores. Nature, 429, 649-651 (2004).

Classification of experiment:Taxonomy DDC(Dewey): 575 Evolution and Genetics

Library of Congress: QH 367.5 molecular phylogeneticsZoology DDC(Dewey): 599: mammalology

Library of Congress: QL351-QL352 Zoology-ClassificationExperimental goal: To discover the phylogeny of the species: Solenodon paradoxus

and Solenodon cubanusNull hypothesis H01: explicitRepresentation style: textLinguistic expression: natural language“Some have suggested a close relationship to soricids (shrews) but not to talpids”Linguistic expression: arificial language: predicate calculus …………………………experimental action 1.1.1 extraction and purification

object: sample of DNAparent group: DNA from Solenodon paradoxussampling: random samplinginstrument: Qiagen DNA cleanup kit

experimental action 1.1.2 DNA amplification …………………………Experimental Conclusions (Formed Hypotheses) C1) Hypothesis Representation style: textLinguistic expression: natural language

There existed an mammal that is the ancestor of: Solenodons, Soricoidea, Talpoidea, Erinaceidea, and which is not the ancestor of any other mammal.

Linguistic expression: artificial language: predicate calculus …………………………

Prolog:instantiation(solenodon, So), instantiation(soricoidea, Sh), instantiation(talpoidea, T), instantiation(mammalia, An), shared_ancestor([So, Sh], [T], An).% shared_ancestror(Shared, Not_shared).shared_ancestor([X],[Y], An) :-ancestor(An, X).not ancestor(An, Y).shared_ancestor([X|Lx],[Ly], An) :-shared_ancestor([Lx],[Ly], An).ancestor(An, X).shared_ancestor([Lx],[[Y|Ly], An) :-shared_ancestor([Lx],[Ly], An).not ancestor(An, Y).

Solonedon Annotation

EXPO: A scientific experiment is a research method which permits the investigation of cause-effect relations between known and unknown (target) variables of the field of study (domain). An experimental result cannot be known with certainty in advance.

EXPO: A classification of experiments is a hierarchical system of categories –types of experiments – according to their domains or used models of experiments. EXPO: A null

hypothesis is an experimental hypothesis that states that a known controlled variable or variables does not have a specified effect on the unknown (target) variable or variables of the domain.

XML: </rdfs:Class><rdfs:Class rdf:ID="classification of experiments"><rdfs:label>classification of

experiments</rdfs:label><rdfs:subClassOf rdf:resource="#classification" /><rdfs:comment>

Def:A classification of experiments is a hierarchical system of categories - types of experiments -according to their domains or used models of experiments.Axiom: </rdfs:comment>

EXPO Solenodon Annotationscientific experiment>: <physical experiment>:

<hypothesis-forming>; <hypothesis-driven><admin info about experiment>:

<title>: Mesozoic Origin of West Indian Insectivores<classification by domain>:<domain of experiment>: Zoology <DDC(Dewey)>: 599: mammalology<Library of Congress>: QL351-QL352 Zoology-Classif.

Taxonomy<DDC(Dewey)>: 575 Evolution and Genetics<Library of Congress>: QH 367.5 molecular phylogenetics<Library of Congress>: QH 83 Biosystematics<related domain>: Computational Statistics<DDC(Dewey)>: 519 Probabilities and Applied Mathematics<Library of Congress>: QA 273-274 Probabilities<research hypothesis>: <implicit><experimental goal>: To investigate the phylogeny of the species: Solenodon paradoxus and

Solenodon cubanus<null hypothesis>H01: <representation style>: <text><linguistic expression>:<natural language>:

“Some have suggested a close relationship to soricids (shrews) but not to talpids”

<linguistic expression>:<artificial language>So, Sh, T, An ∈ mammalia∀So . ∀Sh . ∀T . ∃An . solenodon(So) ⋀ soricoidea(Sh) ⋀ talpoidea(T) ⋀ ancestor(An, So) ⋀ ancestor(An, Sh) ⋀ ¬ancestor(An, T)

EXPO Solenodon Annotation<domain model>: Statistical branching model of evolution<experimental model>:<factor>: T - Phylogenetic tree relating Solenodon paradoxus and Solenodon

cubanus to other mammal species <factor level>: “sampling trees every 20 generations” (spl. in [9])<target variable>: PS(T) - position and BL(T) - branch length of Solenodon paradoxus and

Solenodon cubanus in tree T<model assumption>1: The use of the selected DNA sequences will produce results that generalise

to the complete genomes. <model assumption>2: Molecular clock assumptions. <experimental action> 1.1.1: extraction and purification

<object>: sample of DNA<parent group>: DNA from Solenodon paradoxus<sampling>: random sampling<instrument>: Qiagen DNA cleanup kit

………………………………………………….<experimental conclusion> C2: Comment: Formed Hypothesis<linguistic expression>:<artificial language>

So, Sh, T, E, An, De ∈ mammalia∀So . ∀Sh . ∀T . ∀E . ∃X . ∃An . solenodon(So) ⋀ soricoidea(Sh) ⋀ talpoidea(T) ⋀ erinaceidea(E) ⋀ ancestor(An, So) ⋀ ancestor(An, De) ⋀ ¬ancestor(De, So) ⋀ ancestor(De, Sh) ⋀ ancestor(De, T) ⋀ ancestor(De, E)

Problems Highlighted by Expo 1

The use of EXPO makes explicit the different hypotheses described in the paper. – What we have identified in the <research

conclusion> are not mentioned as hypotheses in the text.

– This contrasts with what we identify as the seven null-hypotheses, which are mentioned explicitly in the main text. – May be sub-optimal statistically.


Another problem of the research which EXPO exposed was that the DNA sequences produced during the experiment were stored in the EMBL database using the taxonomic term “Insectivora”.

This taxon is now generally recognised to be polyphyletic, and what is more, its use contradicts the actual conclusions of the paper.

Problems Highlighted by Expo 3We formalised, using Prolog, the knowledge behind the authors’ argument that “Cuban Solenodons should be classified in a distinct genus, Atopogale”. Our analysis indicated that it would have been more internally consistent for the authors to have classified Cuban Solenodons as a distinct family.

The decision to use a genus rather than family has probably more to do with sociology than biology.

However, it is a biological embarrassment that the terms “genus” and “family” are ill defined.

etc…..

High-energy/particle physics

Another random paper selected from same Nature issue: D0 Collaboration. A precision measurement of the mass of the top quark. Nature, 429, 639-642 (2004).

Small Part of Their Experimental Equipment!

EXPO D0 Example 1<scientific experiment>: <computational experiment>: <simulation><admin info about experiment>:

<title>: A precision measurement of the mass of the top quark<classification by domain>:<domain of experiment>: High Energy Physics / Particle Physics<DDC(Dewey) classification>: 539.7 Atomic and nuclear physics <Library of Congress classification>: QC 770-798 Atomic, Nuclear, Particle Physics <related domain>: Computational Statistics<DDC(Dewey) classification>: 519 Probabilities and Applied Mathematics<Library of Congress classification>: QA 273-274 Probabilities<research hypothesis>: <representation style>: <text><linguistic expression>:<natural language>:

Given the same observed data: use of the new statistical method M1 will produce a more accurate estimate of Mtop than the original method M0.

<linguistic expression>:<artificial language>: M0(∀ D0 observations ∧ ∀ relevant background knowledge) ↦ E0M1(∀ D0 observations ∧ ∀ relevant background knowledge) ↦ E1estimation_error(E0, Mtop) ↦ Error0estimation_error(E1, Mtop) ↦ Error1Error0 > Error1

<alternative hypothesis>: <subject effect>: <experimenter bias>:<linguistic expression>: <artificial language>

P(The Standard Model) is highThe Standard Model → Higgs boson(Mtop = 173.3) → Higgs boson mass best fit mass estimate is experimentally excluded.∴ Mtop > 173.3

EXPO D0 Example 2<domain model>: A statistical branching model of particles (see [1] fig. 1)<experimental model>:<factor>: M - a method of estimating Mtop<factor level>: M0 – the old method

M1 – the new method<target variable>: Mtop<model assumption>1: It is possible to compare the accuracy of estimates of M0 and

M1 without knowing the true value of Mtop.<model assumption>2: “the specific value of the Pbkg cut-off” based on Mtop =

175GeV/c2 [8] has no effects on the experimental results.…………………………………………………………………………………………………<experimental conclusion>: <representation style>: <text><linguistic expression>:<natural language>

The probability of the research hypothesis H1 has been increased:method M1 “yields a greatly improved precision” of Mtop [8].

<result error>: <error of conclusion>: <hypothesis acceptance mistake>: <false positive> ≠ 0<error of conclusion>: <faulty comparison>:M0 and M1 estimates


Poor science, even though with ~350 authors and published in Nature!

This annotation makes it explicit that the experiment was somewhat unusual in not generating any new observational data. Instead, it presents the results of applying a new statistical analysis method to existing data (a set of putative top quark pair decays events involving e+jets and μ+jets)

Problems Highlighted by Expo 2The paper has no explicit hypothesis. However, we argue that the paper’s implicit experimental hypothesis was: given the same observed data, use of the new statistical method will produce a more accurate estimate of Mtop than the original method. This is based on the authors’ statement “here we report a technique that extracts more information from each top-quark event and yields a greatly improved precision when compared to previous measurements”.

We prefer the term “accuracy” to “precision”


The Carnap principle: All relevant knowledge should be used to decide a scientific question:– 91 candidate events were used to calculate the

old value, but only 22 of these were used for the new value!

– The old method estimate of Mtop is: 173.3 ± 5.6 (stat) ± 5.5 (sys) GeV/c2

– The new method estimate of Mtop is: 180.1 ± 3.6 (stat) ± 3.9 (sys) GeV/c2.

– The current (June 2005) best estimate for Mtop is 174.3 ± 3.4 GeV/c2

Problems Highlighted by Expo 4The paper concluded that Mtop is higher than previously estimated, which deductively implies a higher mass for the Higgs Boson. As the Higgs Boson has not yet been observed, even at energies above its previously predicted maximum likelihood mass, the newly inferred higher Mtop lent support to the existence of the Higgs Boson. However, it would have been possible to argue validly the other way: that the Higgs Boson is thought highly likely to exist, therefore its non observation makes more probable a higher value of Mtop. This argument was not explicit in the paper, but may have existed implicitly as a motivation, and the paper would have benefited from making this argument explicit, even if not used.

e-ScienceRelated to the formalisation of science is the aim of “e-Science” is to develop a way of explicitly publishing online all of the data and metadata from a scientific experiment: so that it can be repeated and compared with other related experiments

We believe that EXPO is an important step in developing such an e-Science approach. It will enable scientists to annotate their experiments in a unified way with the minimum of overhead, and it will also enable the efficient review, retrieval, comparison, and sharing of data from experiments.

Scientific Publication 1The traditional way of presenting scientific knowledge in scientific papers has many limitations.

The most important and obvious of these is the use of natural language to describe knowledge - albeit augmented by various formalisms and mathematics.

Natural language is notorious for its imprecision and ambiguity.

Scientific Publication 2

Use of Natural Language is a great hindrance when using computers to store and analyse data – hence the growing importance of text-mining.

We argue that the content of scientific papers should increasingly be expressed in formal languages.

Is writing a scientific paper closer to writing poetry or a computer program?

Conclusions

The unity of science implies that an accepted general ontology of experiments is both possible and desirable.

Such an ontology would promote the sharing of results within and between subjects, reducing both the duplication and loss of knowledge.

It is also an essential step in formalising science, and so fully exploiting computer reasoning in science.

Acknowledgments

Nigel Hardy Helen Jenkins

Steve OliverManchester University

Riichiro MizoguchiOsaka University (Japan)

BBSRC

expo an ontology of scientific research

Documents