Download - EXPO An Ontology of Scientific Research
EXPO An Ontology of Scientific Research
Ross D. King & Larisa N. Soldatova
Department of Computer ScienceUniversity of Wales, Aberystwyth
Motivation: Formalisation of ScienceThe goal of science is to increase our knowledge of the natural world through the performance of experiments.
Scientific knowledge should, ideally, be expressed in a standard formal logical language.
Formal languages promote semantic clarity, which in turn supports the free exchange of scientific knowledge and simplifies scientific reasoning.
The Robot Scientist Concept
Background Knowledge
Analysis
ConsistentHypotheses
Final Theory Experiment(s) selection Robot
Experiments(s)
Results
The robot scientist project aims to develop a computer system that is capable of originating its own experiments, physically doing them, interpreting the results, and then repeating the cycle.
Hypothesis Formation
The Unity of ScienceWe are interested in formalising generic knowledge about scientific experimental design, methodology, and results representation. Such a common ontology is both feasible and desirable because all the sciences follow the same experimental principles. Despite their different subject matters, they all organise, execute, and analyse experiments in similar ways.They use related instruments and materials; they describe experimental results in identical formats, dimensional units, etc.
Ontologies The formal description of experiments for efficient analysis, annotation, and sharing of results is a fundamental objective of science.
Ontologies are required to achieve this goal.
A few subject-specific ontologies of experiments currently exist. The most notable of these is the MGED Ontology (MO). It was designed to provide descriptors required by MIAME (Minimum Information About a Microarray Experiment)
We propose the ontology EXPO to meet this need.
Advantages of Ontologies
The utilisation of a common standard ontology for the annotation of scientific experiments would:– Make scientific knowledge more explicit.– Help detect errors.– Enable the sharing and reuse of common
knowledge.– Remove redundancies in domain-specific
ontologies.– Promote the interchange and reliability of
experimental methods and conclusions.
Our Approach to Ontology Building
Explicitly list the principles of an ontology's design, its constraints, along with definitions and axioms.
Provide compliance with a standard upper ontology (SUO) developed by IEEE P1600.1.
Keep separately domain-dependent and domain-independent knowledge, as well as declarative and procedural knowledge.
Build ontologies so that they are purpose-independent and therefore are future-proof.
Generic Ontology of Experiments
e-Science
Ontology of science(formalization of scientific
methods, technologies, infrastructure of science)
EXPOOntology of
scientific experimentsconcepts: 218language: OWL
Classification of experiments
• Controlled vocabulary of scientific experiments;
• Formalized computer based representation of scientific experiments;
• Unified standards for representation, annotation, storage, and access to experimental results;
• Reasoning over experimental data and conclusions.
Admin info about experiment
Scientific ExperimentExperimental goalExperimental design
Experimental object
Experimental actionExperimental results
SUMO
Ontology of science
PSI ontology
MGED ontology
Upper level
Intermediate level
Domain level
EXPO
<Experiment>
<Subject of exp.> <Object of exp.>
<Domain of exp.>
Chemical ontology
MSI ontology
Functional ontology
Ontology of representations
<Biblio reference> <Experimental goal>
<Experimental action>
Phylogenetic ExampleRandom paper selected from Nature: Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen, M.K., Maria, R. Mesozoic origin for West Indian insectivores. Nature, 429, 649-651 (2004).
Paper investigates the phylogenetic status of the mammalian species Solenodon cubanus and Solenodon paradoxus. i.e. the evolutionary relationship of these animals with all others.
Solenodons have been isolated since the age of the dinosaurs!
Scientific Experiment: Hypothesis-forming, Hypothesis-drivenAdmin info about experiment:Title: Mesozoic Origin of West Indian InsectivoresAuthor: Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen, M.K.,………………Organisation: 1. National Cancer Institute, Frederick, USA ………………Status: public academic Reference: Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen, M.K., Maria, R.
at all. Mesozoic origin for West Indian insectivores. Nature, 429, 649-651 (2004).
Classification of experiment:Taxonomy DDC(Dewey): 575 Evolution and Genetics
Library of Congress: QH 367.5 molecular phylogeneticsZoology DDC(Dewey): 599: mammalology
Library of Congress: QL351-QL352 Zoology-ClassificationExperimental goal: To discover the phylogeny of the species: Solenodon paradoxus
and Solenodon cubanusNull hypothesis H01: explicitRepresentation style: textLinguistic expression: natural language“Some have suggested a close relationship to soricids (shrews) but not to talpids”Linguistic expression: arificial language: predicate calculus …………………………experimental action 1.1.1 extraction and purification
object: sample of DNAparent group: DNA from Solenodon paradoxussampling: random samplinginstrument: Qiagen DNA cleanup kit
experimental action 1.1.2 DNA amplification …………………………Experimental Conclusions (Formed Hypotheses) C1) Hypothesis Representation style: textLinguistic expression: natural language
There existed an mammal that is the ancestor of: Solenodons, Soricoidea, Talpoidea, Erinaceidea, and which is not the ancestor of any other mammal.
Linguistic expression: artificial language: predicate calculus …………………………
Prolog:instantiation(solenodon, So), instantiation(soricoidea, Sh), instantiation(talpoidea, T), instantiation(mammalia, An), shared_ancestor([So, Sh], [T], An).% shared_ancestror(Shared, Not_shared).shared_ancestor([X],[Y], An) :-ancestor(An, X).not ancestor(An, Y).shared_ancestor([X|Lx],[Ly], An) :-shared_ancestor([Lx],[Ly], An).ancestor(An, X).shared_ancestor([Lx],[[Y|Ly], An) :-shared_ancestor([Lx],[Ly], An).not ancestor(An, Y).
Solonedon Annotation
EXPO: A scientific experiment is a research method which permits the investigation of cause-effect relations between known and unknown (target) variables of the field of study (domain). An experimental result cannot be known with certainty in advance.
EXPO: A classification of experiments is a hierarchical system of categories –types of experiments – according to their domains or used models of experiments. EXPO: A null
hypothesis is an experimental hypothesis that states that a known controlled variable or variables does not have a specified effect on the unknown (target) variable or variables of the domain.
XML: </rdfs:Class><rdfs:Class rdf:ID="classification of experiments"><rdfs:label>classification of
experiments</rdfs:label><rdfs:subClassOf rdf:resource="#classification" /><rdfs:comment>
Def:A classification of experiments is a hierarchical system of categories - types of experiments -according to their domains or used models of experiments.Axiom: </rdfs:comment>
EXPO Solenodon Annotationscientific experiment>: <physical experiment>:
<hypothesis-forming>; <hypothesis-driven><admin info about experiment>:
<title>: Mesozoic Origin of West Indian Insectivores<classification by domain>:<domain of experiment>: Zoology <DDC(Dewey)>: 599: mammalology<Library of Congress>: QL351-QL352 Zoology-Classif.
Taxonomy<DDC(Dewey)>: 575 Evolution and Genetics<Library of Congress>: QH 367.5 molecular phylogenetics<Library of Congress>: QH 83 Biosystematics<related domain>: Computational Statistics<DDC(Dewey)>: 519 Probabilities and Applied Mathematics<Library of Congress>: QA 273-274 Probabilities<research hypothesis>: <implicit><experimental goal>: To investigate the phylogeny of the species: Solenodon paradoxus and
Solenodon cubanus<null hypothesis>H01: <representation style>: <text><linguistic expression>:<natural language>:
“Some have suggested a close relationship to soricids (shrews) but not to talpids”
<linguistic expression>:<artificial language>So, Sh, T, An ∈ mammalia∀So . ∀Sh . ∀T . ∃An . solenodon(So) ⋀ soricoidea(Sh) ⋀ talpoidea(T) ⋀ ancestor(An, So) ⋀ ancestor(An, Sh) ⋀ ¬ancestor(An, T)
EXPO Solenodon Annotation<domain model>: Statistical branching model of evolution<experimental model>:<factor>: T - Phylogenetic tree relating Solenodon paradoxus and Solenodon
cubanus to other mammal species <factor level>: “sampling trees every 20 generations” (spl. in [9])<target variable>: PS(T) - position and BL(T) - branch length of Solenodon paradoxus and
Solenodon cubanus in tree T<model assumption>1: The use of the selected DNA sequences will produce results that generalise
to the complete genomes. <model assumption>2: Molecular clock assumptions. <experimental action> 1.1.1: extraction and purification
<object>: sample of DNA<parent group>: DNA from Solenodon paradoxus<sampling>: random sampling<instrument>: Qiagen DNA cleanup kit
………………………………………………….<experimental conclusion> C2: Comment: Formed Hypothesis<linguistic expression>:<artificial language>
So, Sh, T, E, An, De ∈ mammalia∀So . ∀Sh . ∀T . ∀E . ∃X . ∃An . solenodon(So) ⋀ soricoidea(Sh) ⋀ talpoidea(T) ⋀ erinaceidea(E) ⋀ ancestor(An, So) ⋀ ancestor(An, De) ⋀ ¬ancestor(De, So) ⋀ ancestor(De, Sh) ⋀ ancestor(De, T) ⋀ ancestor(De, E)
Problems Highlighted by Expo 1
The use of EXPO makes explicit the different hypotheses described in the paper. – What we have identified in the <research
conclusion> are not mentioned as hypotheses in the text.
– This contrasts with what we identify as the seven null-hypotheses, which are mentioned explicitly in the main text. – May be sub-optimal statistically.
Problems Highlighted by Expo 2
Another problem of the research which EXPO exposed was that the DNA sequences produced during the experiment were stored in the EMBL database using the taxonomic term “Insectivora”.
This taxon is now generally recognised to be polyphyletic, and what is more, its use contradicts the actual conclusions of the paper.
Problems Highlighted by Expo 3We formalised, using Prolog, the knowledge behind the authors’ argument that “Cuban Solenodons should be classified in a distinct genus, Atopogale”. Our analysis indicated that it would have been more internally consistent for the authors to have classified Cuban Solenodons as a distinct family.
The decision to use a genus rather than family has probably more to do with sociology than biology.
However, it is a biological embarrassment that the terms “genus” and “family” are ill defined.
etc…..
High-energy/particle physics
Another random paper selected from same Nature issue: D0 Collaboration. A precision measurement of the mass of the top quark. Nature, 429, 639-642 (2004).
EXPO D0 Example 1<scientific experiment>: <computational experiment>: <simulation><admin info about experiment>:
<title>: A precision measurement of the mass of the top quark<classification by domain>:<domain of experiment>: High Energy Physics / Particle Physics<DDC(Dewey) classification>: 539.7 Atomic and nuclear physics <Library of Congress classification>: QC 770-798 Atomic, Nuclear, Particle Physics <related domain>: Computational Statistics<DDC(Dewey) classification>: 519 Probabilities and Applied Mathematics<Library of Congress classification>: QA 273-274 Probabilities<research hypothesis>: <representation style>: <text><linguistic expression>:<natural language>:
Given the same observed data: use of the new statistical method M1 will produce a more accurate estimate of Mtop than the original method M0.
<linguistic expression>:<artificial language>: M0(∀ D0 observations ∧ ∀ relevant background knowledge) ↦ E0M1(∀ D0 observations ∧ ∀ relevant background knowledge) ↦ E1estimation_error(E0, Mtop) ↦ Error0estimation_error(E1, Mtop) ↦ Error1Error0 > Error1
<alternative hypothesis>: <subject effect>: <experimenter bias>:<linguistic expression>: <artificial language>
P(The Standard Model) is highThe Standard Model → Higgs boson(Mtop = 173.3) → Higgs boson mass best fit mass estimate is experimentally excluded.∴ Mtop > 173.3
EXPO D0 Example 2<domain model>: A statistical branching model of particles (see [1] fig. 1)<experimental model>:<factor>: M - a method of estimating Mtop<factor level>: M0 – the old method
M1 – the new method<target variable>: Mtop<model assumption>1: It is possible to compare the accuracy of estimates of M0 and
M1 without knowing the true value of Mtop.<model assumption>2: “the specific value of the Pbkg cut-off” based on Mtop =
175GeV/c2 [8] has no effects on the experimental results.…………………………………………………………………………………………………<experimental conclusion>: <representation style>: <text><linguistic expression>:<natural language>
The probability of the research hypothesis H1 has been increased:method M1 “yields a greatly improved precision” of Mtop [8].
<result error>: <error of conclusion>: <hypothesis acceptance mistake>: <false positive> ≠ 0<error of conclusion>: <faulty comparison>:M0 and M1 estimates
Problems Highlighted by Expo 1
Poor science, even though with ~350 authors and published in Nature!
This annotation makes it explicit that the experiment was somewhat unusual in not generating any new observational data. Instead, it presents the results of applying a new statistical analysis method to existing data (a set of putative top quark pair decays events involving e+jets and μ+jets)
Problems Highlighted by Expo 2The paper has no explicit hypothesis. However, we argue that the paper’s implicit experimental hypothesis was: given the same observed data, use of the new statistical method will produce a more accurate estimate of Mtop than the original method. This is based on the authors’ statement “here we report a technique that extracts more information from each top-quark event and yields a greatly improved precision when compared to previous measurements”.
We prefer the term “accuracy” to “precision”
Problems Highlighted by Expo 3
The Carnap principle: All relevant knowledge should be used to decide a scientific question:– 91 candidate events were used to calculate the
old value, but only 22 of these were used for the new value!
– The old method estimate of Mtop is: 173.3 ± 5.6 (stat) ± 5.5 (sys) GeV/c2
– The new method estimate of Mtop is: 180.1 ± 3.6 (stat) ± 3.9 (sys) GeV/c2.
– The current (June 2005) best estimate for Mtop is 174.3 ± 3.4 GeV/c2
Problems Highlighted by Expo 4The paper concluded that Mtop is higher than previously estimated, which deductively implies a higher mass for the Higgs Boson. As the Higgs Boson has not yet been observed, even at energies above its previously predicted maximum likelihood mass, the newly inferred higher Mtop lent support to the existence of the Higgs Boson. However, it would have been possible to argue validly the other way: that the Higgs Boson is thought highly likely to exist, therefore its non observation makes more probable a higher value of Mtop. This argument was not explicit in the paper, but may have existed implicitly as a motivation, and the paper would have benefited from making this argument explicit, even if not used.
e-ScienceRelated to the formalisation of science is the aim of “e-Science” is to develop a way of explicitly publishing online all of the data and metadata from a scientific experiment: so that it can be repeated and compared with other related experiments
We believe that EXPO is an important step in developing such an e-Science approach. It will enable scientists to annotate their experiments in a unified way with the minimum of overhead, and it will also enable the efficient review, retrieval, comparison, and sharing of data from experiments.
Scientific Publication 1The traditional way of presenting scientific knowledge in scientific papers has many limitations.
The most important and obvious of these is the use of natural language to describe knowledge - albeit augmented by various formalisms and mathematics.
Natural language is notorious for its imprecision and ambiguity.
Scientific Publication 2
Use of Natural Language is a great hindrance when using computers to store and analyse data – hence the growing importance of text-mining.
We argue that the content of scientific papers should increasingly be expressed in formal languages.
Is writing a scientific paper closer to writing poetry or a computer program?
Conclusions
The unity of science implies that an accepted general ontology of experiments is both possible and desirable.
Such an ontology would promote the sharing of results within and between subjects, reducing both the duplication and loss of knowledge.
It is also an essential step in formalising science, and so fully exploiting computer reasoning in science.