Collaborative Ontology building: So much more than authoring an
OntologyRobert Stevens
BioHealth Informatics GroupThe University of Manchester
ManchesterUnited Kingdom
Overview
• An experiment in collaborative authoring• Issues raised• Observations made• The process and the artefact• Bits of technology
Ontologists: What’s their Problem?
David RandallManchester Metropolitan University
What do I Know about Collaborative Ontology Authoring?
• “you’ve never built a real ontology”• Advisor in projects• Experiments in collaborative authoring• Doing it for real in a Kidney and urinary Pathway
Ontology• Informal observational studies with collaborative
protégé
The Software Engineering Life-CycleOntolo
gy
Issues in OntologyAuthoring
SCOPESCOPE
COMPLEXITYCOMPLEXITY
COSTCOST
AUTHORINGAUTHORING
EVALUATIONEVALUATION
http://ontogenesis.ontonet.org/ppt/Issues_mindmapSB.pdf
The NCL Study• A small group met to normalise the OBO Cell
Ontology (CL)• Transform an axiomatically lean hand-crafted
“tangled” ontology to:• An axiomatically rich ontology where the structure is
computationally maintained• Study the process and deliver the artefact• http://www.gong.manchester.ac.uk/CTON.html• Two two day meetings; videoed and observed by an
ethnographer• Part of the OntoGenesis network
Contractile cell CL
What is Ontology Normalisation?
• Hand-crafted ontologies with multiple inheritance are “tangled”
• Usually axiomatically lean• We classify along one axis and use
“restrictions” to other modules to capture other axes
• Then re-build the multiple inheritance using the axiomatically rich ontology
Tangled Ontology of Cars
Tangled Untangled Inferred
Contractile cell nCL
The People
• Ten people “friends and family”• All some sort of biologists• All familiar with OWL and normalisation• All “singing from the same hymn sheet”
The Overall Process• Analyse issues in current OBO CL• Determine primary axis of classification• Identify supporting ontologies• Identify properties and design patterns; determine
representation• Gather knowledge• Generate OWL encoding• Evaluate, iterate• Two face to face meetings; separate work; email and
skype
Questions Raised
• When do we work as a larger group; smaller groups and singly?
• What resources do we use?• Who knows what?• What strategies do we use?• What expertise do we need?• What are the vested interests?
Producing the “schema”
• What is it we want to say about cells?• How do we want to say it?• Most time was spent on these questions (one day)• Best Face to face as the whole group• Perhaps a fait accompli in the large• Lots of modifications through debate• Strong chair and process (“bhenevolent
dictatorship”)
“what about sea urchins?”
“what about sea urchins?”
Ethnographer’sObservations!
Ethnographer’sObservations!
I don’t knowabout plantsI don’t knowabout plants
NCL Schema Captured in a Spreadsheet
Term Name CTO id ploidy morphologyCellular component size germ line nucleation process
slow muscle cell CL:0000189
PATO:0001873
GO:0030017 ; GO:0005739 Large n/a
PATO:0001908 GO:0031444
blue sensitive photoreceptor cell CL:0000495
PATO:0001394
PATO:0001154 ; PATO:0001873 Large Somatic
PATO:0001407
GO:0050908 ; GO:0007603
green sensitive photoreceptor cell CL:0000496
PATO:0001394
PATO:0001154 ; PATO:0001873 Large Somatic
PATO:0001407
GO:0050908 ; GO:0007603
R1 photoreceptor cell CL:0000687
PATO:0001394 ?? Variable Somatic
PATO:0001407
GO:0050908 ; GO:0007603
CL normalisation Workflow
Ontology API
CL Spreadsheet
The Ontology Preprocessor Language
• Adding “select”, “add” and “remove” keywords to MOS
• A “scripting” language for OWL• We generate a list of instructions to build an
ontology• We can embed patterns in to this generation• Saves “mouse clicks”• Rapid production of large amounts of ontology• Easy to apply changes; acts as a macro language
OPPL sampleADD Class: CL_0000811;REMOVE subClassOf owl:Thing;ADD label ``CD8-positive, alpha-beta immature T cell'';ADD subClassOf cto:Cell;ADD subClassOf cto:has_ploidy some pato:PATO_0001394;ADD comment ``MORPHOLOGY: pleiomorphic'';ADD comment ``CELULAR COMPONENT: '';ADD subClassOf cto:has_size some cto:Small;ADD comment ``GERM LINE: n/a'';ADD subClassOf cto:has_nucleation some pato:PATO_0001407;ADD subClassOf cto:participates_in some go:GO_2456;ADD subClassOf cto:participates_in some go:GO_0021700;ADD subClassOf cto:participates_in some go:GO_0032940;ADD comment ``PROCESS: '';ADD comment ``LINEAGE: mesoderm'';ADD subClassOf cto:appears_in some cto:Animalia;ADD comment ``ORGANISM COMMENT: '';ADD subClassOf cto:potentiality some cto:TerminallyDifferentiated;
What we GenerateClass: 'CD8-positive alpha-beta immature T cell'
SubClassOf: Cell, has_morphology some pleomorphic, has_nucleation some mononuclete, has_ploidy some diploid, has_potentiality some TerminallyDifferentiated, derives_from some 'double-positive alpha-beta immature T cell', located_in some 'Animalia',
not (participates_in some gametogenesis), participates_in some 'T cell mediated immunity', participates_in some 'developmental maturation', participates_in some 'secretion by cell'
A Defined ClassClass: “diploid cell”EquivalentTo: cellThat has_ploidy some diploid
• Picks up all cells that has_ploidy some diploid• Trivial, but difficult to do by hand and be complete
Class: “germline cell”EquivalentTo: cellThat (participates_in some gametogenesis) or
(directly_derived_from some gamete)
The Representation
• Aligning with RO and most OBO conventions• Red_blood_cell participates_in some
Oxygen_transport• Red_blood_cell has_disposition some
(realisable_entity that is_realised_in some oxygen_transport)
• First is simple and useful, but not actually true• Second is more ontologically formal and “right”, Can
easily expand the “schema” to either representation• Do experiments with patterns
Entity Quality or Entity Property Quality Pattern?
• At least two ways of representing qualities• Need only one instance of a quality type inhering in
each entity• has_quality exactly 1 diploid • coupled with has_quality max 1 ploidy• Otherwise:• has_ploidy some diploid • has_ploidy is functional and in property hierarchy
under has_quality• Again, applying patterns is easy; do experiments;
gain consistency
Time Spent
• First two day meeting• One day “planning the schema”• Half a day describing 30 cells and producing
an ontology• An hour or so evaluating and re-generating• Quick iterations and always having an
ontology to look at
The Second Meeting
• Six months gatherhing material • An hour or so of review all together• Pairs adding more material• A review• More pair work• More review• Then dispersed activity (all “spare time”)• Short iteration periods (in terms of work spent)
Resources used
• Brain power;• The Web – Wikipedia is our friend• Other ontologies• Text books (minor use)• Research papers• The developing ontology and the reasoner• Phone a friend (who is an authority in the
field?)
Identifying Issues in OBO CL
• CL generated in a few days and not really touched (not true now)
• Lots of well recognised issues: Wrong biology; missing biology; ontological defects; …
• Still observed to be very useful• Issues gave us some “tests”
Identifying Supporting Ontologies
CL Ontology
PATO Qualities
GO
Biological Process
GO
Cellular Component
NCBITaxonomy
FMA Anatomy
Nucleation
Morphology
Size
Ploidy
Muscle ContractionSecretion
Bacillus anthracis str. Ames
ChloroplastCell Membrane
Epithelium
Kidney
“It lets me do the biology” • Is what one of our biologists said• I can see what we’ve said about a cell• I can see where it is in the structure• I relate the two• The work is “turned around”: thinking about the biology and
its consequences• P1: flight muscle cell, thats interesting ... no, a cardiac muscle
cell is not a skeletal muscle cell!! • P2; a flight muscle cell is never a cardiac muscle cell.’• “Why has it put it there?”• Hereit” is the reasoner
Strategies
• Pinning down the scope: Only cells in vivo• Dealing with a representative set of cells:
developing a test plan• Collective wisdom: testing against current
knowledge – “pericytes”• Concentrating on biology and less on ontology
egineering• Using the owners and authorities
Being “Agile”
• Software engineering has moved on from simplistic life cycles
• Agile methods are the fashion• Embedding users• Always have something working• Test driven development• Short iterations• Deliver early
Observations on Collaboration• The work is not mechanical• It involves extensive synchronous face-to-face work on
deciding on scope and purpose• It relies on a socially distributed expertise, and ‘knowing
who knows’• It involves the synchronous or rapid use of a number of
different artefacts, and an understanding of how best to use them.
• It involves constant ‘testing’ and the delaying of final decisions through ambiguity resolution and error checking, and the constant recording of rationales for decision-making
The New KUPO Process
CollaborativeSpreadsheetCollaborativeSpreadsheet
Individual SpreadsheetIndividual
Spreadsheet
Semantic WikiSemantic Wiki
Issue TrackerIssue Tracker
OPPLScript
Formulation
OPPLScript
Formulation
Generate OWL
Generate OWL
Reasoned OntologyReasoned Ontology
View OntologyView Ontology
Summary
• Mass direct authoring of an ontology seems bad• In NCL we only used Protégé to “look at it” – no
hand-building• Mass knowledge gathering and commenting seems
good• Keeping “Agile” seems good• Doing too much by hand seems bad• Developing the schema in a team seems good • The team should have a coherent, non-clashing
interests
Acknowledgements
• Mikel Aranguren and Simon Jupp for slides• Mikel Aranguren, Simon Jupp, Helen
Parkinson, Phil Lord, David Shotton, James Malone, Jonathan Bard, Midori Harris did the work
• Dave Randall did the ethnography• The EPSRC for funding OntoGenesis