Download - Obol: Open Bio-Ontology Language
Obol:Obol:Open Bio-Ontology Open Bio-Ontology
LanguageLanguageUsing grammars to extract and use Using grammars to extract and use
implicit knowledge in the GO and OBOimplicit knowledge in the GO and OBO
Chris MungallChris Mungall
Berkeley Drosophila Genome Berkeley Drosophila Genome Project / GO ConsortiumProject / GO Consortium
ObolObol
Obol is a system for discovering and Obol is a system for discovering and reasoning over hidden knowledge in reasoning over hidden knowledge in ontologiesontologies
Obol is useful for helping maintain Obol is useful for helping maintain cross-products in the Gene Ontologycross-products in the Gene Ontology
Obol works by parsing syntax and Obol works by parsing syntax and semantics from GO and OBO terms semantics from GO and OBO terms
Motivation: Ontology Motivation: Ontology MaintenanceMaintenance
GO: 3 ontologies, 16k terms, 23k GO: 3 ontologies, 16k terms, 23k relationshipsrelationships
OBO: cell, biochemical, sequence and OBO: cell, biochemical, sequence and multiple anatomical ontologiesmultiple anatomical ontologies
Many GO terms are combinatorial Many GO terms are combinatorial (cross-products)(cross-products) ““regulation of neutrophil differentiation”regulation of neutrophil differentiation”
No explicit links between ontologiesNo explicit links between ontologies Difficult to maintain manuallyDifficult to maintain manually
Some Sample GO termsSome Sample GO terms
‘regulation of neutrophil differentiation.’‘neutrophil differentiation.’
‘granuloctye differentiation.’‘smooth muscle contraction.’
‘nucleolar chromatin.’‘nucleolus.’
‘oxygen transport.’‘negative regulation of interleukin-2
biosynthesis.’‘oxidoreductase activity, acting on paired donors,
with incorporation or reduction of molecular oxygen, reduced iron-sulfur protein as one donor, and incorporation of one atom of oxygen.’
Graph complexityGraph complexity
interleukin-2biosynthesis
cytokinebiosynthesis
biosynthesis
regulation ofinterleukin-2biosynthesis
regulation ofcytokine
biosynthesis
regulation ofbiosynthesis
negativeregulation
ofinterleukin-2biosynthesis
negativeregulation
ofcytokine
biosynthesis
negativeregulation ofbiosynthesis
part-of
is-a
Automatic inference of Automatic inference of relationshipsrelationships
Some relationships can be derived Some relationships can be derived computationally…computationally…
……provided we have complete logical provided we have complete logical definitionsdefinitions
regulation ^ (regtype:negative) ^(regprocess:biosynthesis ^ (makes:interleukin-2))
Tools exist for reasoning over these logical definitions, but…
Generating logical Generating logical definitionsdefinitions
Generating and maintaining logical Generating and maintaining logical definitions for GO/OBO is non-trivialdefinitions for GO/OBO is non-trivial
Obol exploits the highly regular Obol exploits the highly regular grammatical structure of GO term namesgrammatical structure of GO term names ““regulation of X”, never “X regulation”regulation of X”, never “X regulation” ““Y biosynthesis”, never “biosynthesis of Y”Y biosynthesis”, never “biosynthesis of Y” no stemming requiredno stemming required
Obol derives candidate class definitions Obol derives candidate class definitions from term names, and performs basic from term names, and performs basic reasoning over themreasoning over them
Obol: parsing and Obol: parsing and reasoningreasoning
GO/OBO TermLexical string
Class Definition(s)may involve relationshipsto other OBO terms
Inferencesusing definitions and
existing ontologies
“interleukin-2 biosynthesis”
biosynthesis^(makes:interleukin-2)
“interleukin-2 biosynthesis” is_a“cytokine biosynthesis”inferred from: interleukin-2 is_a cytokine
How Obol WorksHow Obol Works
term names are broken into lexical tokens term names are broken into lexical tokens (words) using a tokeniser(words) using a tokeniser
tokens are parsed using a tokens are parsed using a grammargrammar, , generating parse treesgenerating parse trees
parse trees are turned into class definitions parse trees are turned into class definitions using using transformation rulestransformation rules and and property property definitionsdefinitions
transformation is reversibletransformation is reversible class definitions are reasoned overclass definitions are reasoned over implemented in XSB Prologimplemented in XSB Prolog
Word tokensWord tokens
Obol uses an atomic vocabulary of Obol uses an atomic vocabulary of word tokensword tokens
tokens are partitioned by ontology tokens are partitioned by ontology domaindomain cell, anatomy, biological process, etccell, anatomy, biological process, etc
tokens have a grammatical typetokens have a grammatical type adj, noun, prep, relational adj, specialadj, noun, prep, relational adj, special
vocabularies need not be correct or vocabularies need not be correct or completecomplete
Computational Computational GrammarsGrammars
formal grammars can elucidate formal grammars can elucidate sentence structuresentence structure
grammars transform token lists grammars transform token lists into parse treesinto parse trees
multiple parses may be possiblemultiple parses may be possible parses are reversibleparses are reversible a grammar is a collection of a grammar is a collection of
transformation rulestransformation rules
A simple OBO term A simple OBO term grammargrammar
Term --> NP e.g. negative regulation of interleukin-2 biosynthesis NP --> NP PP e.g. negative regulation of interleukin-2 biosynthesis
NP --> NOUN e.g. interleukin;; regulation;; biosynthesis
NP --> NP-TOK e.g. interleukin-2
NP --> ADJ NP e.g. negative regulation
NP --> NP NP e.g. interleukin-2 biosynthesis
PP --> PREP NP e.g. of interleukin-2 biosynthesis
(subset of the whole OBO grammar)
Applying grammar rulesApplying grammar rules
negative regulation of interleukin-2 biosynthesisadj noun prep noun tok noun
np npnpnp -> n
npnp -> adj np npnp -> np-tok
np
np -> np nppp
pp -> p npnp
np -> np pp
term -> np
Generating Class Generating Class DefinitionsDefinitions
A parse tree shows the A parse tree shows the syntaxsyntax structure of a structure of a termterm
A class definition is a description of the A class definition is a description of the meaningmeaning of a term of a term
An Obol classdef is a cross product An Obol classdef is a cross product (intersection) of necessary and sufficient (intersection) of necessary and sufficient conditionsconditions
Classdefs are generated from parse trees using Classdefs are generated from parse trees using tree transform rules and property descriptionstree transform rules and property descriptions
Classdefs can be exported using obo or OWL Classdefs can be exported using obo or OWL formatformat
Property definitions Property definitions guide class constructionguide class construction
Property name: makes domain: biosynthesis range: substance grammar: np_modifier interleukin-2
npnp
np
biosynthesis
biosynthesis^(makes:interleukin-2)
Property definitions Property definitions guide class constructionguide class construction
negativenpadj
np
regulation
regulation^(regtype:negative)
Property name: regtype domain: regulation range: neg/pos grammar: np_modifier
Property definitions Property definitions guide class constructionguide class construction
regulation^(regtype:negative)
npnp
pp
biosynthesis^(makes:IL-2)
regulation^(regtype:negative)^(regprocess:biosynthesis^(makes:IL-2))
Property name: regprocess domain: regulation range: biological_process grammar: prep(of)
np
of
Unparseable terms and Unparseable terms and multi-parse termsmulti-parse terms
biologicalprocess
molecularfunction
cellularcomponent
single-token terms excluded from this analysis
Reasoning over class Reasoning over class definitionsdefinitions
Using class definitions, we can:Using class definitions, we can: autocreate parentage for new termsautocreate parentage for new terms check for missing relationshipscheck for missing relationships find inconsistencies between ontologiesfind inconsistencies between ontologies generate implicit orthogonal ontologiesgenerate implicit orthogonal ontologies
Method:Method: Use native OBOL rules (via prolog or DAG-Use native OBOL rules (via prolog or DAG-
Edit)Edit) OROR use external reasoner; eg RACER, FaCT use external reasoner; eg RACER, FaCT
Finding missing Finding missing relationshipsrelationships
Obol is run periodically on GO to check Obol is run periodically on GO to check for missing IS A and PART OF for missing IS A and PART OF relationshipsrelationships
Multiple parses produce false-positivesMultiple parses produce false-positives 223 missing relationships added to GO223 missing relationships added to GO ToDo: increase specificity by improving ToDo: increase specificity by improving
vocabularies and property definitionsvocabularies and property definitions
Obol sample reportObol sample report
nucleolar chromatin PART OF nucleusclathrin-coated vesicle HAS PART clathrin coatchromoplast membrane IS A plastid membranenuclear microtubule PART OF nucleusvitamin E biosynthesis IS A vitamin E metabolismuracil permease activity IS A permease activitychloroplast envelope IS A plastid envelopenegative regulation of lipid biosynthesis IS A negative regulation of lipid metabolismketone body metabolism IS A ketone metabolismdense nuclear body IS A nuclear body
falsepositive!
inversepresent
Aligning to the OBO cell Aligning to the OBO cell ontologyontology
cardioblastdifferentiation
cardiac celldifferentiation
cardioblast
mesodermalcell
animalcell
cardiacmusclecell
musclecell
DEVELOPSFROM
most differentiation terms align precisely; some don’t…
???
Deriving existing GO Deriving existing GO relationshipsrelationships
Function Process Component
all relationships 8002 13613 1691
obol-derivable relationships
479 3055 346
non-trivially derivable relationships
0 1089 63
Obol as an ontology Obol as an ontology curation toolcuration tool
Obol can be used by GO curators in a variety Obol can be used by GO curators in a variety of waysof ways
““Behind the scenes”Behind the scenes” IterativeIterative
GO curator receives periodic suggestion reportsGO curator receives periodic suggestion reports ContinuousContinuous
GO curator uses OBOL interactively via DAG-Edit pluginGO curator uses OBOL interactively via DAG-Edit plugin
To help the transition to a fully specified To help the transition to a fully specified ontologyontology
GO curators then maintain class definitionsGO curators then maintain class definitions
Obol as a search tool?Obol as a search tool?
Problems to addressProblems to address Integration with curation processIntegration with curation process Memory usageMemory usage Syntax parsingSyntax parsing
chemical terms, long termschemical terms, long terms Dealing with Dealing with andand, , oror and and notnot Generating text definitionsGenerating text definitions Word list maintenanceWord list maintenance
solution: integrate with ontology maintenancesolution: integrate with ontology maintenance Ontology dependenciesOntology dependencies
protein and generic anatomy ontologies neededprotein and generic anatomy ontologies needed Obol can be used to help generate theseObol can be used to help generate these
ConclusionsConclusions
Obol is useful for maintainng large Obol is useful for maintainng large GO-style ontologiesGO-style ontologies
combination of semantic parsing combination of semantic parsing with reasoning is powerfulwith reasoning is powerful benefits of both GO-style ontology benefits of both GO-style ontology
development and formal reasoningdevelopment and formal reasoning
AcknowledgementsAcknowledgements Berkeley/GO
John Richter Brad Marshall Karen Eilbeck Suzanna Lewis Gerry Rubin
Jackson Labs/GO David Hill Joel Richardson Judith Blake
GO CuratorsMidori HarrisJennifer ClarkAmelia IrelandJane Lomax
ManchesterChris WroeRobert StevensPhillip Lord
J Michael CherryMichael Ashburnerall the GO
Consortium
The OBO Universe The OBO Universe (partial)(partial)
Phenotype
Sequence
Anatomy
Cell
Protein
Chemical
Component
Process
Function
Fish-Anat
Fly-Anat
Detecting Detecting inconsistenciesinconsistencies
Y binding
X binding
Z binding
X
Z
Y
Prolog as an ontology Prolog as an ontology languagelanguage
% DATABASE OF FACTS% DATABASE OF FACTS
isa(carb_binding, binding).isa(carb_binding, binding).
isa(polysac_binding, isa(polysac_binding, carb_binding).carb_binding).
isa(chitin_binding, isa(chitin_binding, polysac_binding)polysac_binding)
isa(cellulose_binding, isa(cellulose_binding, polysac_binding).polysac_binding).
% INFERENCE RULES% INFERENCE RULES
isaT(isaT(XX,,YY):- isa():- isa(XX,,YY).).
isaT(isaT(XX,,YY):-isa():-isa(XX,,ZZ),), isaT( isaT(ZZ,,YY).).
● ?- isaT(chitin_binding, binding).
● YES
● ?-isaT(X, polysac_binding).
● X=carb_binding.
● X=chitin_binding.
● X=cellulose_binding.
● ?-isaT(chitin_binding, cellulose_binding).
● NO
● ?-isaT(X,Y). % returns all paths
regulation
regulates
contraction muscle
negative smooth
class(regulation <process> qualifier=class(negative <general>) regulates=class(contraction <process> affects_cell_type=class(muscle <anatomical> has_type=class(smooth <general>))))
Prolog internal Prolog internal representationrepresentation
Prolog Grammar Prolog Grammar ImplementationImplementation
Prolog: the classic logic programming Prolog: the classic logic programming languagelanguage High-level declarative language, natural choice High-level declarative language, natural choice
for ontologies; built in “database”for ontologies; built in “database” Definite Clause Grammars (DCGs)part of the Definite Clause Grammars (DCGs)part of the
language; language; DCGs allow passing data up the parse treeDCGs allow passing data up the parse tree
XSB PrologXSB Prolog Uses tabling (more efficient, less re-calculation)Uses tabling (more efficient, less re-calculation) Tabling + DCGs = Tabling + DCGs = chart parsingchart parsing (Earley's (Earley's
algorithm)algorithm) This means we can have This means we can have left-recursiveleft-recursive grammars grammars
A Formal Grammar for A Formal Grammar for OBO termsOBO terms
All(?) GO/OBO terms are NOUN-PHRASES (exception: All(?) GO/OBO terms are NOUN-PHRASES (exception: phenotypes?)phenotypes?)
A NOUN-PHRASE is (recursively) made fromA NOUN-PHRASE is (recursively) made from
a NOUN (includes inflected verbs; ega NOUN (includes inflected verbs; eg bind bindinging)) an ADJECTIVE followed by a NOUN-PHRASE eg an ADJECTIVE followed by a NOUN-PHRASE eg inner inner
membranemembrane a NOUN-PHRASE preceeded by a NOUN-PHRASE a NOUN-PHRASE preceeded by a NOUN-PHRASE acting asacting as
ADJECTIVE; eg ADJECTIVE; eg clathrin coatclathrin coat a NOUN-PHRASE then PREPOSITION then NOUN-PHRASE; eg a NOUN-PHRASE then PREPOSITION then NOUN-PHRASE; eg
regulation of transcriptionregulation of transcription an (optional) NOUN-PHRASE then a RELATIONAL ADJECTIVE an (optional) NOUN-PHRASE then a RELATIONAL ADJECTIVE
then a NOUN-PHRASE; eg then a NOUN-PHRASE; eg clathrin-coatclathrin-coateded vesicle vesicle Precedence rulesPrecedence rules are also required to prune parse forest are also required to prune parse forest
Simple but effectiveSimple but effective
A formal grammar is a set of A formal grammar is a set of production rulesproduction rules operating over operating over terminal symbolsterminal symbols (eg (eg wordswords) ) and and non-terminal symbolsnon-terminal symbols (eg word/phrase (eg word/phrase categories)categories)
The rules determine how sequences of symbols The rules determine how sequences of symbols can be can be transformedtransformed, making a , making a parse treeparse tree
Sentence -> Subject Verb ObjectSubject -> Article NounObject -> Article NounArticle -> a | theVerb -> ate | chasedNoun -> cat | banana | mouse
GENERATING: Start at top ('sentence')and apply rules until all symbols are terminal
PARSING: Start at bottom – a sequence of terminals;apply rules, combining symbols if necessary
Parse tree for a simple sentence,“the cat ate the banana”
negative regulation of smooth muscle contraction ADJ NOUN P ADJ NOUN NOUN
NP NP NP NP -> ADJ NOUN*
NP NP -> NP NP*
PP PP -> P NP*
NP NP -> NP* PPParse tree
Class Definition (shown as DAG)
regulation
regulates
contraction muscle
negative smooth
thick lines inidcatestem terms
the parse tree showsthe synctacticstructure
we need new OBO format (or OWL) to representcross-products (aka intersections / complete defs)
DAG definition is minimal – non-definitional relationships to other terms not shown
the classdef shows thelogical structure
recurse down treeapplying grammaticalcontext rules to getproperty fillers
X
XX
XX
negative regulationof smooth muscle contraction
smooth musclecontraction
smooth muscle
COPII-coated vesicle COPII-coated vesicle membranemembrane
class(membrane <component> “COPII-coated vesicle membrane” part_of=class(vesicle <component> “COPII-coated vesicle” has_part=class(coat <component> “COPII coat” made_from=class(COPII <complex> “COPII”))))
class/term name shown in quotes; these can be derived byreversion the transformation
requires use of inverse properties (has_part vs part_of)- supported in new OBO format.
the above classdef is consistent with what is in the GOcellular_component ontology
OutlineOutline
Motivation – combinatorial issues with composite termsMotivation – combinatorial issues with composite terms
Approaches: annotation-time term composition Approaches: annotation-time term composition vsvs tools for tools for
maintenance of large DAGsmaintenance of large DAGs
The OBOL SystemThe OBOL System
Term decomposition using grammarsTerm decomposition using grammars
Generating computable logical class definitionsGenerating computable logical class definitions
Rules and reasoning over class definitionsRules and reasoning over class definitions
Initial ResultsInitial Results
Strategies for using OBOL within GO/OBOStrategies for using OBOL within GO/OBO
Manual maintenance of Manual maintenance of GOGO
GO is 3 DAGs of over 16k termsGO is 3 DAGs of over 16k terms Large DAGs of terms are hard to maintainLarge DAGs of terms are hard to maintain ““cross-products” produce combinatorial cross-products” produce combinatorial
explosions and highly connected sub-explosions and highly connected sub-graphsgraphs
GO terms include OBO termsGO terms include OBO terms eg eg oxygen bindingoxygen binding; ; wing developmentwing development
Zipf's LawZipf's Law Many terms not yet used in annotation Many terms not yet used in annotation
(Ogren, pers. Comm.)(Ogren, pers. Comm.)
Example combinatorial Example combinatorial explosionexplosion
contraction
muscle contraction
smooth muscle contraction
negative regulation of contraction
Negative regulation of muscle contraction
regulation
regulation of contraction
regulation ofmuscle contraction
regulation ofsmooth muscle contraction
negative regulation ofsmooth muscle contraction
Negative regulation
part_of
part_of
“implicit” terms
negative regulation
negative regulation of muscle contraction
actual GO terms
implicit anatomicalterms not shown
affects
part_of affects
cell motility
One Approach: One Approach: PropertiesProperties
One One extremeextreme solution is to remove composite terms from ontology solution is to remove composite terms from ontology
altogetheraltogether
Generate “anonymous” composite terms at annotation-time via Generate “anonymous” composite terms at annotation-time via
property/slotproperty/slot restrictions to restrictions to atomic termsatomic terms
● bindingbinding ^ ^ affects(affects(interleukin-18interleukin-18))
● contractioncontraction ^ ^ affects(affects(musclemuscle++typetype((smoothsmooth))))
These bindings constitute a These bindings constitute a class definitionclass definition
But it is still necessary to make statements about composite terms in But it is still necessary to make statements about composite terms in
the ontologythe ontology
● macrophage activationmacrophage activation is_ais_a immune cell activityimmune cell activity
● fibrinolysis fibrinolysis is_ais_a negative regulation of blood coagulation negative regulation of blood coagulation
Another approach: Another approach: Computationally aided Computationally aided ontology maintenanceontology maintenance
GO terms exhibit regularity in their GO terms exhibit regularity in their syntactic structuresyntactic structure SubstringSubstring relationships highly relationships highly
correlated with actual relationshipscorrelated with actual relationships– regulation of smooth muscle contractionregulation of smooth muscle contraction
– smooth muscle contractionsmooth muscle contraction
– muscle contractionmuscle contraction
– contractioncontractionOgren PV, Cohen KB, Acquaah-Mensah GK, Eberlein J, Hunter L. 2004. The compositional structure of Gene Ontology terms.Pac Symp Biocomput 9: 214-215.
OBOL: Syntax and OBOL: Syntax and SemanticsSemantics
What about using the term syntax What about using the term syntax to get at the to get at the meaningmeaning of the term? of the term?
GO/OBO Term
Class Definitionmay involve relationshipsto other OBO terms
Inferences
“interleukin binding”
binding^affects(interleukin)
“interleukin binding” is_a“cytokine binding”inferred from: interleukin is_a cytokine
Inference of Inference of intermediate terms and intermediate terms and
IS_As – example ruleIS_As – example rule
class(regulation process_regulated=R qual=Q) is_aclass(regulation process_regulated=R' qual=Q) <=> R is_a R'
FORALL classdef pairs IFF the stem-class is the same AND all the property-values in the restriction-list are identical EXCEPT for one property, in which the property-values are linked by an isa, THEN the classdefs are linked by an isa
class(C P1=V1 P2=V2..Px=Vx Pn=Vn) is_aclass(C P1=V1 P2=V2..Px=Vx' Pn=Vn) <=> Vx is_a Vx'