stealthy annotation of experimental biology by … · special issue paper stealthy annotation of...

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2013; 25:467–480Published online 10 October 2012 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.2941

SPECIAL ISSUE PAPER

Stealthy annotation of experimental biology by spreadsheets

KatherineWolstencroft1,*,†, Stuart Owen1, MatthewHorridge1, Simon Jupp1, Olga Krebs2,Jacky Snoep3, Franco du Preez3, Wolfgang Mueller2, Robert Stevens1 and Carole Goble1

1School of Computer Science, University of Manchester, Oxford Road, Manchester, UK2HITS gGmbH, Heidelberg, Germany

3Manchester Interdisciplinary Biocentre, University of Manchester, UK

SUMMARY

The increase in volume and complexity of biological data has led to increased requirements to reuse that data.Consistent and accurate metadata is essential for this task, creating new challenges in semantic data annotationand in the constriction of terminologies and ontologies used for annotation. The BioSharing community aredeveloping standards and terminologies for annotation, which have been adopted across bioinformatics, but the realchallenge is to make these standards accessible to laboratory scientists. Widespread adoption requires the provisionof tools to assist scientists whilst reducing the complexities of working with semantics. This paper describesunobtrusive ‘stealthy’ methods for collecting standards compliant, semantically annotated data and for contributingto ontologies used for those annotations. Spreadsheets are ubiquitous in laboratory data management. Our spread-sheet-basedRightField tool enables scientists to structure information and select ontology terms for annotationwithinspreadsheets, producing high quality, consistent datawithout changing commonworking practices. Furthermore, ourPopulous spreadsheet tool proves effective for gathering domain knowledge in the form ofWebOntology Language(OWL) ontologies. Such a corpus of structured and semantically enriched knowledge can be extracted in ResourceDescription Framework (RDF), providing further means for searching across the content and contributing to OpenLinked Data (http://linkeddata.org/). Copyright © 2012 John Wiley & Sons, Ltd.

Received 4 April 2011; Revised 1 September 2012; Accepted 10 September 2012

KEY WORDS: semantic annotation; data curation; ontology

1. INTRODUCTION

Howe et al. in [1] make the compelling case that although databases have become important avenues forpublishing biological data, the curation increasingly lags behind data generation in funding, developmentand recognition. The interpretation and reuse of data rely on it being annotated with rich, standardisedmetadata. However, the labour cost of metadata construction, data curation and annotation is high: it isa time-consuming and awkward process, which is undervalued and provides little reward for the skilleddata curators that undertake it [2, 3].

In an effort to drive-up the quality of curation at the point of data capture, various biological sciencecommunities have proposed minimum information models and community ontologies that should beused for structuring and describing these metadata. The models set community-wide expectationsfor the least information required to understand and interpret experimentally gathered data.Recommendations for the use of specific terminologies and ontologies clarify the information requiredand the choice of terminologies for annotation, and encourage widespread compliance [4].

*Correspondence to: Katherine Wolstencroft, School of Computer Science, University of Manchester, Oxford Road,Manchester, UK.†E-mail: [email protected]

Copyright © 2012 John Wiley & Sons, Ltd.

http://linkeddata.org/

468 K. WOLSTENCROFT ET AL.

Minimum Information about a Biological or Biomedical Investigation (MIBBI) [5] is an umbrellaorganisation describing minimum information models for over 30 different biological fields orexperimental technologies. The approach is a pragmatic one, with many of the models little more thanchecklists. Many MIBBI checklists recommend the use of specific terminologies and ontologies forannotation. These ontologies are published in the BioPortal [6], a community repository for biologicalontologies. Table I shows some minimum information models and corresponding ontologies commonlyused in the life sciences.

Data management specialists routinely use these MIBBI checklists and related terminologies to captureand standardise the structure of data for sharing or for deposition into public repositories. However, theuptake of such standards by experimental biologists is much less common. One significant reason forthis difference is the ease with which the metadata standards can be applied to data. The existence oftools that assist with the process can increase compliance significantly. For example, MinimumInformation about a Microarray Experiment (MIAME) was initially implemented as The MicroArrayGene Expression Markup Language (MAGE-ML), an (Extensible Markup Language) XML-basedschema. Users were required to submit data in this format to ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) or GEO (Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo/) in order topublish work relating to it. In practice, this was only feasible for laboratories with dedicatedbioinformatics support. The community developed MicroArray Gene Expression Tab (MAGE-TAB)[9], a tabular alternative that allows scientists to prepare MIAME compliant data from within MicrosoftExcel spreadsheets to lower the barrier of adoption of the standard. This more straightforward processrequires no prior knowledge of MAGE or XML and enables experimentalists to use the tools that theyalready use to record their data and are highly proficient experts in. MAGE-TAB demonstrates theimportance of providing tools and interfaces as well as metadata standards and schemas [10].

There is a similar situation in the development of the terminologies or ontologies that supply themetadata for these annotation tasks. For a good introduction to ontologies in biology, see [11]; here, webriefly set the context for the paper. Ontologies are used to define the entities and the relationshipsbetween those entities [12]. The labels used for the concepts describing the domain form a controlledvocabulary for that domain. Ontologies are now widely used in this way within biology; they providemetadata for describing the data. Ontologies come in a variety of forms, from simple trees orpolyhierarchies, through to highly axiomatised ontologies amenable to automated reasoning.

In biology, two languages for authoring ontologies are widely used: Web Ontology Language (OWL)and Open Biomedical Ontology (OBO). The OWL (http://www.w3.org/2007/OWL) is widely used forauthoring ontologies as a collection of axioms, each of which has a precise semantics that make suchontologies amenable to automated reasoners. OWL and its ontologies are hard to understand and use; thetools for developing OWL ontologies, such as Protégé (http://protege.stanford.edu) are best used in thehands of specialists. The OBO (http://www.geneontology.org/GO.format.obo-1_2.shtml) format is acommon format for biological ontologies. OBO is less complex than OWL and may therefore be easierto understand, but the large number of OBO ontologies available and the large numbers of terms withinthem, make the task of exploration difficult. Although ontologies play an important role in many fields,including biology, they are best used through tools that present them in a form compatible with everydayworking practices of users. For example, ontologies, irrespective of their level of axiomatisation, areusually used by biologists as taxonomies of vocabulary terms and they are best presented as such.

Our aim is to make the collection, validation and construction of accurate and precise metadata asimple task: without using special tools, without the need for our laboratory experimentalist to beexposed to rich and complex ontologies and with the tools and techniques they are already familiarwith. As the most common data collection and exchange file is the Microsoft Excel spreadsheet, weuse standard Excel spreadsheets.

• To enable ontology-enriched data collection. We have developed the RightField ontologyannotation and information management desktop application (http://www.rightfield.org.uk) [13].RightField adds constrained ontology term selection to Excel spreadsheets, addressing issuessuch as multiple ontology loading, display and browsing, multiple ontology formalisms, ontologyevolution and ontology provenance, off-line working, cross-platform support, Excel applicationlegacy, ontology selection depths and nesting, and format import–export.

Copyright © 2012 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2013; 25:467–480DOI: 10.1002/cpe

http://www.ebi.ac.uk/arrayexpress/

http://www.ebi.ac.uk/arrayexpress/

http://www.ncbi.nlm.nih.gov/geo/

http://www.w3.org/2007/OWL

http://www.geneontology.org/GO.format.obo-1_2.shtml

http://www.rightfield.org.uk

TableI.Examples

ofstandardsandontologies

fordataannotatio

n.Other

ontologies

aremoregenerally

applicableandareused

across

thedomain.

For

exam

ple,Ontology

forBiomedicalInvestigations

(OBI)[7]canbe

used

todescribe

experimentalproperties,andtheGeneOntology[8]canbe

used

todescribe

functio

nalpropertiesof

genes

andgene

products.

Data

MIBBIModel

Ontologies

Microarray

MIA

ME:Minim

umInform

ationaboutaMicroarrayExperim

ent

MO:MGED

Ontology

Proteom

ics

MIA

PE:Minim

umInform

ationaboutaProteom

icsExperim

ent

MS:massspectrom

etry;MOD:Protein

Modificatio

n;SEP:sampleprocessing

andseparatio

nInteractionexperiments

MIM

IX:Minim

umInform

ationaboutaMolecular

InteractionExperim

ent

MI:protein–

proteininteraction

SystemsBiology

Models

MIRIA

M:Minim

umInform

ationRequiredIn

theAnnotationof

biochemical

Models

SBO:SystemsBiology

Ontology

SystemsBiology

Model

Sim

ulation

MIA

SE:Minim

umInform

ationAbout

aSim

ulationExperim

ent

KISAO:Kinetic

Sim

ulationAlgorith

mOntology

STEALTHY ANNOTATION OF EXPERIMENTAL BIOLOGY BY SPREADSHEETS 469



RightField and the spreadsheets it produces are already in use in a large Systems Biology programmeincorporating over 300 scientists from 120 institutes. It has significantly improved the ability forlaboratory experimentalists to self-curate their data from a familiar environment and a lower barrierof entry to the process of semantic annotation, helping to ensure the quality of data annotation andsupporting standards compliance. When data is required to comply with particular standards inorder to be published, scientists already have it preformatted. Although developed for SystemsBiology, the software is generic. For example, the Gas and Oil industry’s instrument data sheetsare spreadsheet-based and are in need of a tool such as Rightfield with the added capability of alocal dialect to core term mapping functions.

• To enable ontology content collection. We have developed the Populous ontology populationdesktop application (http://www.populous.org.uk) [14]. Populous extends RightField by addingsupport for the creation and population of ontology design patterns. Specifying good ontologydesign patterns in a language such as OWL can be difficult and requires knowledge of OWLsyntax, OWL semantics, knowledge representation and the ontology authoring tools. The taskof populating the design pattern however, requires only an understanding of the pattern andsufficient domain knowledge in order to populate it. The Populous extension to RightFieldprovides support for domain experts in populating design pattern templates using a familiarspreadsheet style interface. Once the template is populated, Populous uses the Ontology Pre-Processing Language (OPPL) [15, 16] to map the data in the columns to variables within thedesign pattern to generate axioms for the growing ontology. Populous has been successfully usedin the e-LICO (http://www.e-lico.eu) project for the construction of an ontology describing theKidney and Urinary Pathway (KUP) [17]. The KUP ontology (KUPO) is used to annotateexperimental data held in the KUP Knowledge-base (KUPKB). The challenge was to engagethe biologists in the ontology’s development, whilst shielding them from details of the ontologyitself. Experienced ontology developers defined the set of design patterns for KUPO andgenerated a set of templates for the biologists using RightField. The content for KUPO waspopulated entirely by biologists with little or no previous knowledge in ontology development.The community can now submit new experiments to the KUPKB using RightField-generatedspreadsheets that contain metadata coming from KUPO.

To enable information model design. We combine RightField and Populous. The overall structurefor the KUPO was developed by bio-ontologists, and Populous was used to populate the patternsimplied by the KUPO. Portions of the KUPO can then be used within a RightField spreadsheet tohelp biologists provide content to the KUPKB. As the biologists used the KUP-orientatedRightField data submission form, gaps in the KUPO will be spotted; Populous can be returnedto update the KUPO. The KUPO provides an information model for placing into RightField; asthe RightField templates are used, the information model can change. The Populous extensionto RighField then allows the same biologists to expand the ontology that forms the informationmodel for the RightField generated spreadsheets.

• RightField and Populous are so-called Data Ramps [18], tools that lower the barriers to using andmanipulating data. These tools are not Excel plugins; instead, they are Excel spreadsheetgenerators. The spreadsheets are plain Excel spreadsheets that use only the standard features ofExcel to produce drop-down lists of allowed cell values. In this way, the features previously out-lined are ensured. The tabulation and constraints within spreadsheets generated by RightFieldmeans that data collected using RightField templates is highly structured and can be exportedand stored in relational databases, XML or RDF. As all, this is achieved using only Excel featuresin an environment familiar to target users, these highly structured data are produced by stealth,avoiding the need to develop, install and train users with a bespoke tool. RightField and Populousare important components towards quality-assured Linked Data. Linked data is a method forexposing, sharing and connecting pieces of data, information and knowledge on the SemanticWeb using unique identifiers and common formats. These tools automatically produce semanti-cally annotated life science data that could be exported to the Linked Data cloud in a structuredformat such as RDF.


http://www.populous.org.uk

http://www.e-lico.eu


RightField and Populous are open source Java applications distributed under the Berkeley SoftwareDistribution (BSD) licence and available from http://www.rightfield.org.uk and http://www.populous.org.uk, respectively.

The paper is organised as follows. In Section 2, we present the motivation for RightField anddemonstrate how it is used for semantically enriched and validated data annotation against project-defined information model templates by the pan-European SysMO Systems Biology Consortium. InSection 3, we present the motivation for Populous and demonstrate how it is used to gather content andpopulate ontologies against pre-defined patterns by the European Union e-LICO Project. We alsopropose how RightField and Populous can be used to define and validate new minimum informationmodels. In Section 4, we reflect on our experiences of the tools in the field, including shortcomings, andset out our future plans with respect to Linked Data. Related work is referred to throughout the sections.

2. RIGHTFIELD: ONTOLOGY-ENRICHED DATA COLLECTION BY STEALTH

A classic example of the challenges of accurate yet onerous scientific data collection is demonstrated by theSysMO consortium (Systems Biology of Microorganisms http://www.sysmo.net). This consortium is a pan-European research initiative to record and describe the dynamic molecular processes occurring inmicroorganisms and to present these processes in the form of computerised mathematical models. Thediversity of experiments performed across the 13 multi-partner projects varies widely. Its scientists typicallyperform a mixture of high-throughput ’omics experiments, such as microarray analysis or proteomics, aswell as traditional molecular biology and enzyme reaction kinetics. Despite this diversity, the consortiumaims to pool its research capacities and know-how, sharing data, methods, models and results within theconsortium and with the rest of the Systems Biology community. It has developed, through a speciallyfunded SysMO-DB project [19], a suite of resources and a web-based platform, the SEEK [20], to providea data sharing environment for scientists within the consortium, a dissemination mechanism to shareresearch with the wider community and a gateway to other commonly used community resources. Themost common data collection and exchange format is the Microsoft Excel spreadsheet.

There is no additional curation resource outside the projects so data curation must be handled at source bythe scientists. The laboratory scientists in SysMO (some 300 drawn from over 120 institutes across Europe)have little experience in metadata management and semantics but understand the value of reusing data. Thetime they have to spend on data curation is limited, so they will only adopt community standards if there areclear advantages in doing so. The range ofMIBBImodels and accompanying ontologies they are expected tocomply with is wide, confusing and daunting; the metadata models are complicated and impractical,requiring familiarity with XML formats, ontology languages and formats, and the annotation of typicallyover 100 metadata elements. Often, multiple ontologies are needed for the annotation of a single file.

The SysMO-DB project has improved ‘curation at source’ by leveraging the widespread use of, andfamiliarity with, spreadsheets. Experimentalists collect data against standard Excel spreadsheettemplates based on variations of the MIBBI community models. Terms from ontologies are exposedas simple dropdown cell selection lists, shielding the scientist from understanding and browsing anyontologies and guiding them into only choosing valid terms.

To enable expert bioinformaticians to make these templates, we have developed the RightFieldontology annotation and information management desktop application for adding constrained ontologyterm selection to Excel spreadsheets. Thus, SEEK users are encouraged to upload spreadsheets that areRightField enabled.

RightField is an administrator’s tool. The workflow is a three-stage process.

1. An expert bioinformatician or data manager prepares data collection templates using standard Excel.These are imported into RightField. Ontologies are accessed and browsed through the BioPortal, arepository of biological ontologies (http://bioportal.bioontology.org/) or from their local file system(Figure 1(a)). The tool supports ontologies in OBO (http://www.geneontology.org/GO.format.obo-1_2.shtml), OWL and RDFS formats and simple RDF vocabularies. Ontology terms areselected and bound to the cells. Terms can be all the subclasses from a chosen class, all the individuals(instances), or a combination of both (Figure 1(b)). Individual cells or whole columns or rows can be


http://www.rightfield.org.uk



http://www.sysmo.net

http://bioportal.bioontology.org/



Figure 1. A screenshot of RightField showing a loaded spreadsheet template and the process of markingcells with ranges of ontology terms from the Microarray Gene Expression Data (MGED) ontology [21].


marked with the required ranges of ontology terms (Figure 1(c)). A spreadsheet can be annotated withterms from multiple ontologies. The full Internationalised Resource Identifiers (IRIs) and associatedlabels for these ontology elements are retained and discreetly stored within hidden sheets in the spread-sheet file. This affords a full provenance track concerning the origins and versions of the ontologyterms used in the annotation. The spreadsheets are saved in native Excel format.

2. A laboratory experimentalist opens the RightField enabled spreadsheet in standard Excel orOpenOffice (Figure 2). No plugins are required, and there are no macros, visual basic code orplatform-specific libraries needed for its use. No connection to any other service is required: thespreadsheet is self-contained as terms are embedded within its file. The spreadsheet behaves asusual; however, marked-up cells (Figure 2(a)) are restricted to selections from a simple drop-downlist of terms using the term labels (Figure 2(b)). The ontologies have been bridged and flattened intovalidated vocabulary pick-lists, so no in-depth knowledge of the ontologies or the reportingstandards are required. By selecting a term, the IRI of the ontology term is attached to the cell forthat data sheet. Once marked-up and saved, the spreadsheet data can link back to the originalontology. The default behaviour is to retain this information in a hidden state to shield the user fromcomplexities of the underlying semantics. However, a feature to toggle between term labels andidentifiers has subsequently been included at the request of the expert bioinformatician and thecurious experimentalist.

3. The marked up datasheets now have embedded annotation bound to the IRIs of ontology terms.When viewed using Excel or OpenOffice, these appear as labels (Figure 2(b)). However, forsemantically capable applications, these encapsulated ontology IRIs are available for extractionand processing alongside the data they annotate. This has the added benefit that the data is readyfor publishing into the Linked Data cloud.

Figure 3 describes the RightField architecture. RightField combines the use of the Apache POIlibrary to read and manipulate spreadsheets and the [22] OWL Application Programming Interfac


Figure 3. The RightField implementation architecture.

Figure 2. Excel spreadsheet using a Rightfield template.


(API) to read and process ontology files. In order to achieve the seamless integration with Excel,RightField augments an existing Microsoft Excel Workbook with the data as follows. For eachdistinct set of ontology terms selected by the informatician, RightField adds a ‘very hidden’worksheet to the workbook. Each worksheet records each term in the set as a tuple containing thehuman readable term name, its unique identifier (IRI) and the most specific ontology that ‘defines’the term. If the associated defining ontology was obtained from the internet or from the BioPortalontology repository, then RightField records the ontology document IRI, both for versioningpurposes, and so that the original ontology can automatically be reloaded if the workbook is re-opened in RightField at some point in the future. Having embedded each constraining term set as avery hidden worksheet, RightField then adds Excel Data Validation constraint information to eachof the marked up cells, or ranges of cells, in the user worksheets. The Data Validation functionality



used is taken directly from Excel. Using out-of-the-box functionality such as this is one of the reasonswhy RightField-enabled spreadsheets can be opened in Excel without the need for any special ExcelAdd-Ins or platform specific native code — both of which would decrease the robustness of thesolution. In essence, the Excel Data Validation functionality links the ranges of cells in the veryhidden worksheets to the human readable names for the ontology terms in the user worksheet.Given this markup ‘schema’, it is easy to map back from the values of cells that correspond tohuman readable term names to the IDs of the corresponding terms and the ontologies that define them.

During the design of the markup ‘schema’, the possibility of embedding complete ontologies intoExcel Workbooks was considered. The motivation for this was that the exact copy of the ontologythat defines the constraining ontology terms would then be preserved for future use and extraction,thus providing a robust provenance solution. With regard to providing such a solution, it is notdifficult to imagine how an ontology could be encoded into a grid that could then be written into anExcel worksheet. However, some bio-ontologies contain hundreds of thousands of axioms thatdefine hundreds of thousands of terms. Since older versions Excel, but still widely used, imposerather strict size limitations on the amount of data that can be stored in a worksheet, this solutionwas discounted at an early stage.

2.1. Virtue in simplicity

The simplicity of this approach—fixing and embedding terms within hidden sheets—is a significantfeature to match the operational requirements of our target users (experimentalists and informaticians).

• Ontology stability: The choices of terms are stable once the spreadsheet is established. This ensuresthat a series of experiments can be annotated with the same versions of the same ontologies. If anontology needs to be updated, the spreadsheet must be actively uploaded and modified withinRightField.

• Offline working: Linkage to the ontology server is only required at the time the template iscreated. This enables experimentalists to add or view data offline and applications to view or process(with limitations) the data without a connection to an ontology server.

• Cross platform support: Our experimentalists use Apple OS and Windows, often old versions ofExcel or OpenOffice and do not want plug-ins. Thus macros, visual basic code or platform-specificlibraries are out of bounds.

• Standard Tools: Several metadata collection efforts have sought to take advantage of the spreadsheetmetaphor by developing bespoke ‘spreadsheet like’ ‘data disposition tools’. However, our experi-mentalists wanted to use the mainstream, off the shelf tools they were already familiar with.

• Distribution: The self-contained and plain nature of the resulting datasheets means that they canbe stored, distributed and shared amongst scientists as normal, through email, shared file stores,USB sticks and so on.

• Ontology flattening: Life science ontologies come in several formats (OBO, OWL, RDFS, SKOS)and are often broad and deep trees. Some, such as the Gene Ontology, have tens of thousands ofterms. It is easy to become overwhelmed, especially as one data sheet requires multiple ontologies.By picking out the relevant terms for a particular context, we have merged and flattened thesecomplex structures. A side-effect is the exposure of shallow ontologies with many subclasses andinstances. This can result in long drop-down boxes, making the annotation process more difficultand prone to inaccuracies. Auto-complete functions for marked up cells can help relieve this problembut it actually highlights a wider issue with the structure of some of the ontologies. A communitystandard that specifies that users should select from over 50 terms, for example, may indicate thatthe ontology requires further development in that area. Defining more specific subclasses andsplitting any instances between them would reduce this problem much more effectively and wouldbenefit other users of the ontology.

2.2. RightField in practice

In the SysMO consortium, spreadsheets are prepared to conform to the ‘Just Enough Results Model’(JERM). The JERM is the SysMO-DB minimum information model. It describes what type



of experiment was performed, who performed it and what was measured. It defines and describesthe relationships between SysMO assets (i.e. different types of data, mathematical models andexperimental methods) and allows JERM representations to be mapped to MIBBI modelswhere available.

Experiments in SysMO are related to one another using the ISA-TAB format [23]. Investigations,Studies, Assays Tab (ISA-TAB) is a tabular representation for describing how experiments (Assays)are grouped into wider Studies and Investigations. It allows multiple ’omics investigations to belinked together and described and provides a format to submit data into existing databases, such asArrayExpress [24] and PRIDE [25]. This is a typical use for RightField. Research groups performingtranscriptomics, metabolomics and proteomics experiments can use RightField-enabled templates tocapture and therefore integrate the results from these multiple studies. The templates can ensure thatbiological entities in each data set (genes, metabolites and proteins etc.) are named consistently andthat each data set is compliant to the appropriate standard (e.g. MIAME, Minimum InformationAbout a Cellular Assay and Minimum Information about a Proteomics Experiment, respectively).

The combination of embedding ontology term selection and standardising the content and format ofspreadsheets provides a simple mechanism to ensure data is consistently annotated and compliant withcommunity standards, ‘by stealth’. The aim is to make it easier to use appropriate ontology terms thannot to do so. The result for the SysMO consortium is a growing corpus of experimental data that iseasier to compare and explore. RightField-enabled templates for transcriptomics data are currentlythe most widely used due to the more stringent reporting standards for transcriptomics publicationscompared with other areas. A collection of RightField-enabled templates developed for SysMO canbe found here: http://www.sysmo-db.org/rightfield/templates.

The next step for SysMO-DB is the exploitation of the emerging corpus of RightField annotateddata. By extracting and storing the data in RDF, based on specifications and mappings to theJERM ontology (http://bioportal.bioontology.org/ontologies/45133), we are able to provide furthermechanisms for searching across the content of spreadsheets and publishing SysMO data as OpenLinked Data [26]. Compliance to community ontologies means that the system can already use WebServices from the BioPortal for term lookup and visualisation.

2.3. Related work

The idea of using spreadsheets as a data collection mechanism is not novel. The developers of ISA-Infrastructure tools have produced a suite of tools to design and validate ISA compliant data [27].Ontology terms can also be added via an Ontology Lookup Service [28] and a BioPortal Plugin.The ISA Creator GUI has the look and feel of a spreadsheet but can only perform some of the samefunctions and is not a standard tool. Scientists must download and install the software and learn anew environment, and more of the complexities of the ontologies are exposed making it unsuitablefor experimentalists. Other ontology annotation tools in the life sciences include Phenote, whichassists with the annotation of biological phenotypes (http://phenote.org), and the PRIDE ProteomeHarvest Spreadsheet submission tool (http://www.ebi.ac.uk/pride/proteomeharvest/), which assistswith the annotation and submission of proteomics data to the PRIDE public repository. These arepowerful annotation tools for specific biological disciplines, and are not generically applicable.

3. POPULOUS: ONTOLOGY DEVELOPMENT BY STEALTH

The Populous tool [29] is an extension to RightField that changes the focus from data annotation toontology development.

Developing logic-based ontologies in a language such as OWL or OBO is a specialised task, butgathering the domain knowledge that forms the content of the ontology requires input from the domainspecialist [30]. Current ontology authoring tools, such as Protégé, require an understanding of OWLsemantics and ontology construction, as well as the need for a consistent application of axiom patternsfor related types of entities. These requirements suggest a separation of domain knowledge, andaxiomatization in constructing OWL ontologies is needed. The tabulation of entities and the attributes


http://bioportal.bioontology.org/ontologies/45133

http://http://phenote.org





that describe them is a common practice in developing OWL ontologies [31, 32]. Much of ontology’scontent can be authored as a set of axiom patterns or templates. Spreadsheets are commonly usedto tabulate the information that is used to populate these patterns. For example, the Ontology forBiomedical Investigation [33] describes many forms of assay used in an investigation; each assay isdescribed in terms of its analyte, evaluent and unit. Quick Term Templates [32] provide a form styletable into which contributors to the ontology can add values that will form the ontology’s content whencombined with the relationships in the pattern associated with the template. As the details for eachassay can be easily tabulated by a domain expert and the axioms for the ontology can be readilygenerated from this tabulated knowledge, a barrier to ontology development is overcome. Thisseparation allows the domain experts that know about the assays to describe them in terms of otherparts of OBI without writing complex OWL axioms and doing so in a consistent style across a largenumber of assay types. Using a spreadsheet-based approach enables domain experts to use familiartools to achieve the complex task of ontology building.

Populous is a tool for gathering content and populating ontologies. It builds on the RightField codebaseand makes use of the ontology loading and spreadsheet display functionality. Like RightField, it is alsobuilt on the premise that a tabular, template view is a natural way of gathering data. In this case, it isused to separate the processes of gathering content for the ontology from the process of ontologyconceptualisation. The structure and axioms from the ontology are shielded from the user, who onlysees the entities related to one another in the spreadsheet table. Consequently, Populous is most usefulwhen a repeating ontology design pattern emerges that needs to be populated en-mass.

Consider the OBO cell type ontology [34] that contains a basic design pattern stating that every celltype has a nucleation, where the nucleation is one of the anucleate, nucleate, bi-nucleate ormultinucleate. Populous can be used to create a simple template where column A captures the celltype and column B captures the nucleation value. RightField can be used to restrict the allowablevalues in column A to named cell types and column B to the set of valid nucleation terms. Thistemplate can either be exported or modified in Microsoft Excel, or users can choose to work directlyinside the Populous RightField extension. Using Populous directly offers some additional benefitsover standalone Excel, such as the ability to auto-complete and search for terms using wildcardstring matches from within a spreadsheet cell. Figure 4 shows a populated template as it appears inthe Populous extension to RightField. Data that matches terms in the validation list are highlightedin green, whereas incorrect or unknown terms are shown in red. The user is free to choose betweenwhether they want to work only in Excel documents or if they want to work on the same filesdirectly in Populous to benefit from some of the advance features Populous offers.

The template approach lowers the entry level for a user wishing to contribute to the ontology. Once thetemplate is populated, the information can be translated into an OWL representation using the OPPL thatcomes packaged with the Populous extension. OPPL allows ontology design patterns to be specified interms of OWL axioms that include variables for the differentia. Each row in the table represents asingle instantiation of the pattern. The Populous OPPL wizard guides the user through the instantiationof a particular design pattern by binding columns in the table to variables in the axiom. The output ofthis process is a new OWL ontology along with a reproducible workflow that details how the ontologywas built. Capturing knowledge using constrained templates improves the consistency with which thetemplates are populated. This reduces the amount of time required to validate the data and offers a wayto automate the workflow for ontology construction. Building ontologies in this way means we capturethe provenance of the ontology’s construction, such as how each axiom was constructed and whogenerated the content. This separation of content from the axiom patterns offers new possibilities forrefactoring of ontologies. Ontology developers can explore different conceptualisation from the samedata captured in the template. A linked data application may only require a simple taxonomy from thedata that could be represented in SKOS but if classification and reasoning were important, the designpatterns could be substituted to build a more expressive ontology that both form the same template.

3.1. Populous in practice

RightField and the Populous extension are being used in the e-LICO project to develop ontologiesrelating to the KUP. The KUPO describes the kidney’s cells, anatomy and diseases and is being


Figure 4. Populous, an ontology content population extension to RightField.


used to capture metadata about experiments relating to renal physiology and disease. The KUPO re-uses many existing ontologies to build a much more detailed model of the kidney. Several designpatterns for the KUPO were specified in templates built using RightField. Biologists with little or noprior knowledge about ontology development populated these templates using Populous. Thedesigned patterns were expressed in OPPL and used to generate the final KUPO. The templates,OPPL scripts and generated ontologies can be found at http://www.e-lico.eu/kupo.

Initial evaluations showed that RightField-enabled spreadsheets produced datasets with greaterconsistency of annotation and compliance with standards. It is possible to over-ride the term selectionand annotate cells with plain text, but users did not typically do so. Therefore, the annotation wasconsistent with the allowed terms from the ontology. If a suitable term could not be found, the defaultbehaviour was to leave the parent class as annotation, which also produces consistently annotated data.

For ontology, users are not able to contribute new term suggestions without entering a curation andreview process, so this behaviour is desirable, but for local ontologies (e.g. the SysMO-JERM and theKUPO), this mechanism could be used to identify and collect new terms by adding the Populousvalidation extension into RightField.

In Populous, if users add terms to marked-up cells that do not feature in the range of terms from thechosen ontology, Populous will highlight these. It could mean that these terms are errors, or it couldmean that the relevant ontology needs to be extended with new terms. In each case, the tool assistswith the curation process, identifying areas that require further investigation. In the development ofKUPO, the biologists identified several gaps in the existing anatomy and cell type ontologies. Over200 new terms can now be extracted from the populated template for candidate submission to therespective ontology development projects.

3.2. Related work

Protégé 4 ExcelImporter plugin and the MappingMaster plugin for Protégé 3 [31] generate OWLaxioms. This and RDF conversion tools such as Excel2RDF (http://www.mindswap.org/~rreck/excel2rdf.shtml) XLWrap [35] and RDF123 [36] focus on the transformation of spreadsheet contentbut pay little attention to how the data are collected in those spreadsheets. When generating


http://www.e-lico.eu/kupo

http://www.mindswap.org/~rreck/excel2rdf.shtml

http://www.mindswap.org/~rreck/excel2rdf.shtml


ontologies, in this way, a large portion of time is taken to ‘clean up’ and validate the populatedspreadsheets. RightField and the Populous extension enable the validation of input data at the sourceand bridge the gap between the population on transformation of these spreadsheets. The template towhich users of spreadsheets in this form have to comply is implicit, rather than being exposed as acollection of constrained entries to which the user complies. These constraints make it easier for theuser to find the terms they need for annotating data and make the task of transformation easier; all ina stealthy form by using the very tools that the users deal with in their everyday tasks.

4. DISCUSSION

Many experimental biologists have no interest or experience in the use of ontologies but understand thevalue of semantic annotation and quality metadata when attempting to reuse data from others. Forprojects such as SysMO and e-LICO, the overall aims of the project include sharing and dissemination.

The more functionality that can be embedded in tools already in use, the more likely that semanticannotation will be considered part of the data management process. There are already a plethora ofterminologies and information models available to support the consistent annotation of data, butproviding such schemas is only the first step. It should not be an extra task for the data providers todescribe their data in such a way that it can be interpreted and reused. This means that simplemethods for adding semantics need to be embedded in the data management process.

RightField and Populous are flexible and extensible curation tools that make semantic annotationeasier and more accessible to researchers: hence, annotation ‘by stealth’. They bridge the gapbetween informaticians and data producers as data can be annotated accurately and at source by thelaboratory scientists who can also contribute terms to ontologies when needed. The advantages arethe following:

• Reduction in errors and the time it takes to annotate the data to comply with community standards;• Greater understanding through choice restrictions of annotation terms to a small and manageable set;• Greater validation to ensure data annotation quality and provide more feedback during annotation; and• Community intelligence for information model and ontology developers to incrementally improve thecoverage and structure of the metadata standards they propose. The combination of the tools provides aconvenient platform for prototyping and piloting data collection standards by trying them out to collectdata, designing models through use. A similar approach was pioneered by the Pedro toolkit [37].

RightField and Populous use a simple, yet powerful and generic approach. The tools are already in usein large biological initiatives, but there is nothing in the applications that make them specific to thebiological domain. RightField is already under evaluation for use for datasets in the oil and gasindustry. Any domain with heterogeneous data that requires annotation to common schemas andontologies could use them.

Excel2RDF, XLWrap and RDF123 enable the generation of RDF statements from spreadsheet data.RightField and Populous lay the foundations for the structured extraction of data and knowledge andpresent the possibility of contributing to Open Linked Data. Ongoing research will focus onextracting spreadsheet data into RDF for storage and comparison to fully exploit the large corpus ofdata that is being generated in the SysMO consortium and in the wider systems biology community.

ACKNOWLEDGEMENTS

This work was funded by the UK’s BBSRC award BBG0102181 and EU/FP7/ICT-2007.4.4 e-LICO project.

REFERENCES

1. Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP, Kania R, Schaeffer M, St Pierre S, TwiggerS, White O, Rhee SY. Big data: the future of biocuration. Nature 2008; 455:47–50.

2. Bateman A. Curators of the world unite: the International Society of Biocuration. Bioinformatics 2010; 26:991.3. St Pierre S, McQuilton P. Inside FlyBase: biocuration as a career. Fly (Austin) 2009;3:112–114.4. Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W, Amaral-

Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B, Clark T, Coleman LA, Copeland J, Das S,



de Daruvar A, de Matos P, Dix I, Edmunds S, Evelo CT, Forster MJ, Gaudet P, Gilbert J, Goble C, Griffin JL, JacobD, Kleinjans J, Harland L, Haug K, Hermjakob H, Ho Sui SJ, Laederach A, Liang S, Marshall S, McGrath A, MerrillE, Reilly D, Roux M, Shamu CE, Shang CA, Steinbeck C, Trefethen A, Williams-Jones B, Wolstencroft K, XenariosI, Hide W. Toward interoperable bioscience data. Nature Genetics 2012; 44:121–126.

5. Taylor CF, Field D, Sansone SA, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA, Bogue M, Booth T, Brazma A,Brinkman RR, Michael Clark A, Deutsch EW, Fiehn O, Fostel J, Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM,Hardy NW, Hermjakob H, Julian RK Jr, Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper M, Le Novère N, Leebens-Mack J, Lewis SE, Lord P, Mallon AM, Marthandan N, Masuya H, McNally R, Mehrle A, Morrison N, Orchard S,Quackenbush J, Reecy JM, Robertson DG, Rocca-Serra P, Rodriguez H, Rosenfelder H, Santoyo-Lopez J, ScheuermannRH, Schober D, Smith B, Snape J, Stoeckert CJ Jr, Tipton K, Sterk P, Untergasser A, Vandesompele J, Wiemann S.Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project.Nature Biotechnology 2008; 26:889–896.

6. Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M, Griffith N, Jonquet C, Rubin DL, Storey MA, Chute CG, MusenMA. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research 2009;37:W170–173.

7. Brinkman RR, Courtot M, Derom D, Fostel JM, He Y, Lord P, Malone J, Parkinson H, Peters B, Rocca-Serra P,Ruttenberg A, Sansone SA, Soldatova LN, Stoeckert CJ Jr, Turner JA, Zheng J, OBI consortium. Modeling biomed-ical experimental processes with OBI. Journal of Biomedical Semantics 2010; 1(Suppl 1):S7.

8. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT,Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM,Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics2000; 25:25–29.

9. Rayner TF, Rocca-Serra P, Spellman PT, Causton HC, Farne A, Holloway E, Irizarry RA, Liu J, Maier DS, Miller M,Petersen K, Quackenbush J, Sherlock G, Stoeckert CJ Jr, White J, Whetzel PL, Wymore F, Parkinson H, Sarkans U,Ball CA, Brazma A. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMCBioinformatics 2006; 7:489.

10. Rayner TF, Rezwan FI, Lukk M, Bradley XZ, Farne A, Holloway E, Malone J, Williams E, Parkinson H. MAGE-Tabulator, a suite of tools to support the microarray data format MAGE-TAB. Bioinformatics 2009; 25:279–280.

11. Bodenreider O, Stevens R. Bio-ontologies: current trends and future directions. Briefings in Bioinformatics 2006;7:257–274.

12. Rector A. Modularisation of domain ontologies implemented in description logics and related formalisms includingowl. Proceedings of the 2nd International Conference on Knowledge Capture, Sanibel Island, USA, 2003.

13. Wolstencroft K, Owen S, Horridge M, Krebs O, Mueller W, Snoep JL, du Preez F, Goble C. RightField: embeddingontology annotation in spreadsheets. Bioinformatics 2011; 27:2021–2022.

14. Jupp S, Horridge M, Iannone L, Klein J, Owen S, Schanstra J, Wolstencroft K, Stevens R. Populous: a tool for buildingOWL ontologies from templates. BMC Bioinformatics 2012; 13(Suppl 1):S5.

15. Egana M, Rector A, Stevens R, Antezana A. Applying ontology design patterns in bio-ontologies. Knowledge Engi-neering: Practice and Patterns, Proceedings 2008; 5268:7–16.

16. Iannone L, Rector A, Stevens R. Embedding knowledge patterns into OWL. Semantic Web: Research and Applications2009; 5554:218–232.

17. Jupp S, Klein J, Schanstra J, Stevens R. Developing a kidney and urinary pathway knowledge base. Journal of BiomedicalSemantics 2011; 2(Suppl 2):S7.

18. Atkinson M, De Roure D, van Hemert J, Michaelides D. Shaping ramps for data-intensive research. UK e-Science AllHands Meeting, Cardiff, UK, 2010.

19. Müller W, Krebs O, Rojas I, Wolstencroft K, Owen S, Alexejevs S, Goble C, Snoep J. SysMO-DB: justenough exchange for systems biology data and models. Microsoft Research eScience Workshop, Pittsburgh,USA, 2009.

20. Wolstencroft K, Owen S, du Preez F, Krebs O, Mueller W, Goble C, Snoep JL. The SEEK: a platform for sharingdata and models in systems biology. Methods in Enzymology 2011; 500:629–655.

21. Whetzel PL, Parkinson H, Causton HC, Fan L, Fostel J, Fragoso G, Game L, Heiskanen M, Morrison N, Rocca-SerraP, Sansone SA, Taylor C, White J, Stoeckert CJ Jr. The MGED ontology: a resource for semantics-based descriptionof microarray experiments. Bioinformatics 2006; 22:866–873.

22. Horridge M, Bechhofer S. The OWL API: A Java API for OWL ontologies. Semantic Web 2010. DOI:10.1.1.163.7035

23. Sansone SA, Rocca-Serra P, Brandizi M, Brazma A, Field D, Fostel J, Garrow AG, Gilbert J, Goodsaid F, Hardy N,Jones P, Lister A, Miller M, Morrison N, Rayner T, Sklyar N, Taylor C, Tong W, Warner G, Wiemann S, Membersof the RSBI Working Group. The first RSBI (ISA-TAB) workshop: “can a simple format work for complex studies?”Omics 2008; 12:143–149.

24. Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG, HollowayE, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma A, Brazma A, Sansone S.ArrayExpress--a public repository for microarray gene expression data at the EBI. Nucleic Acids Research 2005; 33:D553–555.

25. Vizcaino JA, Reisinger F, Côté R, Martens L. PRIDE: data submission and analysis. Current Protocols in ProteinScience 2010, Chapter 25: Unit 25 24.



26. Wolstencroft K, Owen S, Goble C, Nguyen Q, Krebs O, Müller W. RightField: semantic enrichment of systemsbiology data using spreadsheets. IEEE International Conference on e-Science, Chicago, USA, 2012.

27. Rocca-Serra P, BrandiziM,Maguire E, Sklyar N, Taylor C, Begley K, Field D, Harris S, HideW, Hofmann O, NeumannS, Sterk P, Tong W, Sansone SA. ISA software suite: supporting standards-compliant experimental annotation andenabling curation at the community level. Bioinformatics 2010; b26:2354–2356.

28. Cote R, Reisinger F, Martens L, Barsnes H, Vizcaino JA, Hermjakob H. The Ontology Lookup Service: bigger andbetter, Nucleic Acids Research 2010; 38:W155–160.

29. Jupp S, Horridge M, Iannone L, Klein J, Owen S, Schanstra J, Stevens R, Wolstencroft K. A tool for populatingontology templates. Semantic Web Applications and Tools for Life Sciences (SWAT4LS), Berlin, 2010.

30. Suarez-Figueroa MC, Gomez-Perez A. NeOn methodology for building ontology networks: a scenario-basedmethodology. Proceedings of the International Conference on SOFTWARE, SERVICES & SEMANTIC TECHNOLOGIES(S3T 2009), 2009.

31. O’Connor MJ, Halaschek-Wiener C, Musen MA. Mapping master: a flexible approach for mapping spreadsheets to OWL.Proceedings of the 9th International Semantic Web Conference on The Semantic Web, Shanghai, China, 2010; 194–208.

32. Rocca-Serra P, Ruttenberg A, O’Connor MJ, Whetzel T, Schober D, Greenbaum J, Courtot M, Sansone SA,Scheurmann R, Peters B. Overcoming the ontology enrichment bottleneck with quick term templates. InternationalConference on Biomedical Ontology, 2009.

33. BrazmaA,Hingamp P, Quackenbush J, SherlockG, Spellman P, Stoeckert C, Aach J, AnsorgeW,Ball CA, Causton HC,Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U,Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M. Minimum information about a microarray experiment(MIAME)-toward standards for microarray data. Nature Genetics 2001; 29:365–371.

34. Bard J, Rhee SY, Ashburner M. An ontology for cell types, Genome Biology 2005; 6:R21.35. Langegger A,WossW. XLWrap - querying and integrating arbitrary spreadsheets with SPARQL. Semantic Web - ISWC

2009 Proceedings 2009; 5823:359–374.36. Han LS, Finin T, Parr C, Sachs J, Joshi A. RDF123: from spreadsheets to RDF. Semantic Web - ISWC. Lecture Notes

in Computer Science 2008; 5318:451–466.37. Garwood K, McLaughlin T, Garwood C, Joens S, Morrison N, Taylor CF, Carroll K, Evans C, Whetton AD, Hart S,

Stead D, Yin Z, Brown AJ, Hesketh A, Chater K, Hansson L, Mewissen M, Ghazal P, Howard J, Lilley KS, GaskellSJ, Brass A, Hubbard SJ, Oliver SG, Paton NW. PEDRo: a database for storing, searching and disseminating experimen-tal proteomics data, BMC Genomics 2004; 5:68.


stealthy annotation of experimental biology by … · special issue paper stealthy annotation of...

Documents