shotton - nature of biomedical research data 2 shotton.pdf · characteristics of biomedical...

69
e-mail: [email protected] David Shotton Getting the most out of data, Making the most of research A Research Information Network workshop Tuesday 5th December 2006 Royal institute of Public Health, London The nature of biomedical research data Image BioInformatics Research Group Oxford e-Research Centre and Department of Zoology University of Oxford, UK

Upload: others

Post on 18-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

e-mail: [email protected]

David Shotton

Getting the most out of data,Making the most of research

A Research Information Network workshop

Tuesday 5th December 2006Royal institute of Public Health, London

The nature of biomedical research data

Image BioInformatics Research GroupOxford e-Research Centre andDepartment of ZoologyUniversity of Oxford, UK

Page 2: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Outline

Characteristics of biological and medical research data

What data do we publish?

Structured research information and structured metadata

Ontologies in biology

Electronic laboratory notebooks and the FlyData Project

Biomedical data – 1: Universals

Bioinformatics in silico research

BioMoby, myGrid and Taverna

Biomedical data – 2: Particulars

Data webs and ImageWeb

ImageBLAST

Digital data and the historical scientific record

Page 3: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Characteristics of biomedical research data

Bottom-up data flow, lacking central control

Very large research community with diverse research topics

Highly distributed research activities and publication structures

Research data heterogeneous and largely unstructured, often with little by way of semantic mark-up

An open world, where change is as ubiquitous as consensus is elusive

This stands in contrast to much chemistry and particle physics data being discussed in a parallel session

Chemistry: Unified data notation

Defined conventions

Particle physics: Relatively small unified community

Limited number of research objectives

Top-down hierarchical data flow from CERN

Page 4: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Ancient DNA

Page 5: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Remote sensing

Satellite imagery supplying information about land temperature and vegetation is being increasingly used to study a large range of biological topics:

Animal population studies

Epidemiology

Species migration in the face of climate change

Page 6: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Mr Grizzle at home with GPS logging device

Page 7: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Tracking Mr Grizzle by GPS

Credits: Tim Guilford, Department of Zoology, University of Oxford

Biro, D., Guilford, T., Dell’Omo, G. & Lipp, H-P. (2002) Journal of Experimental Biology205:3833-3844

Biro, D., Meade, J. & Guilford, T. Proceedings of the National Academy of Sciences (2004), 101: 17440-17443

Page 8: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Other biological observations are closer to earth . . .

Credits: Oxford 2006 M. Sc. in Biology students, on a field trip to Skomer Island

Page 9: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

. . . or even underground

The Wood Wide Web

Page 10: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Studying network transport dynamics

Credits: Mark Fricker, Department of Plant Sciences, University of Oxford

Page 11: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Tool making and use in New Caledonian Crows

Introducing Betty the Crow

Credits: Alex Weir, Jackie Chappell and Alex Kacelnik, Department of Zoology, Oxford

Weir, A.A.S., Chappell, J., & Kacelnik, A. (2002). Shaping of hooks in New Caledonian crows. Science 297: 981. DOI 10.1126/science.1073433

Page 12: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Studies of gene expression

DNA chip: gene expression in Drosophila testis. Credits: Helen White-Cooper,

Department of Zoology, University of Oxford

Immunofluorescence: gene expression in Drosophila embryo. Credits: James Langeland and Stephen Paddock, Howard Hughes Medical Institute, University of Wisconsin, Madison

Page 13: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Studies of ultrastructure

Electron micrograph: Thin section through the cytopharynx of a protozoan showing microtubule array.

Confocal optical sections of a pollen grain of Passiflora coerulea. Credits: Anna Smallcome, BioRad Microscience

Page 14: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Numbers from images

584 490 465 598 540 441 555 393 485 502 489 454 446 446 463 218 437 509 522 451 497 412 418 529 513 468 580 493 485 498 443 450 459 497 557 481 430 463 426 437 578 441 475

Microvillar areas: Mean = 477.16, Std. Dev. = 63.88

Page 15: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

In summary

RAW DATA INTERPRETATIONS

Sequences > Phylogenies

GPS coordinates > Homing trajectories

Observations and video recordings > Behavioural analysis

Scintillation imaging > Metabolite transport kinetics

Assay results > Gene expression measurements

Micrographs > Understanding of ultrastructure

Image analysis > Quantitation

It is frequently the interpretations, rather than the raw data, that are published in research papers

Page 16: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

The nature of medical research data

In diversity, the data are very similar to those from biological research

However, data sharing is restricted and complicated by issues of patient confidentiality

Nevertheless, data sharing is essential for diagnosis, training and epidemiological studies

Page 17: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Establishing a cyberinfrastructure for storing, manipulating and sharing biomedical data and resources

Test beds focused on neuroimaging, linking animal and human imaging data

Human MRI and fMRI

Mouse models of neurological disease

"The Biomedical Informatics Research Network (BIRN) is a geographically distributed virtual community of shared resources offering tremendous potential to advance the diagnosis and treatment of disease.

“The BIRN is changing how biomedical scientists and clinical researchers make discoveries by enhancing communication and collaboration across research disciplines."

While it has not been easy, by working hard together they have cracked the problems involved in sharing patient data in a secure manner

http://www.nbirn.net

Page 18: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Common Healthcare Educational Recordings Reusability

Infrastructure

http://www.cherri.mvm.ed.ac.uk

CHERRI has defined licensing conditions for research re-use of clinical data with informed consent

Their recommendations need adoption by the NHS and other stakeholders

Page 19: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

A follow-on from the eDIAMOND Digital Mammography e-Science Project, which aimed to produce a federated database of patient mammograms for breast cancer diagnosis, training and research

The most difficult part of the eDIAMOND Project was not the technical aspects, but negotiating barriers to data sharing because of patient confidentiality considerations

This follow-on project was funded by ESRC to examine legal, IPR and confidentiality issues surrounding medical data in greater depth

http://www.oerc.ox.ac.uk/activities/projects/index.xml.ID=image

Page 20: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

What data do we publish?

A scientific paper does not just report scientific observations

A scientific paper is an exercise in rhetoric designed to convince readers of the truth of particular scientific hypotheses or conclusions believed by the authors

The goal of the article is not to state facts, but rather to convince

Facts are selected to support the argument, and are embedded in a rhetorical structure with the purpose of conviction

Credits: Anita de WaardAdvanced Technology Group, Elsevier, AmsterdamContent and Knowledge Engineering Group, University of UtrechtA Semantic Structure for Scientific Articles April 2006

Papers are also written to show off the authors’ research in the best possible light, to encourage funding agencies to continue supporting the work, and to score well in the research assessment exercise

Page 21: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

“These observations support theories that defects of the muscle plasma membrane are important for dystrophic pathogenesis.”

“If, as we have argued, the caveolar patterning over regions of T-tubule invgination into the underlying sarcomeres is of functional significance, its loss in dystrophy both in chicken and in man suggests the presence of a fundamental defect that may be of importance in initiating the internal necrotic changes found in dystrophic fast muscle fibres.”

Page 22: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

“These data extend recent studiesby demonstrating that TCRinteraction with variant peptide antigens can trigger target cell adhesion and surface exploration without activating the signalingpathway that results in cytotoxicity.”

“The facts indicates that this signalling pathway must be distinct from that leading to delivery of the cytotoxic lethal hit”

Page 23: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

. . . but what about the original research data?

Findings that support hypotheses appear in research articles, but the bulk of the raw research data are never published

Historically, there has been no method for doing this

Journals had limited space

Other publication avenues were not available

Now, with the Web and with on-line journals, comes the possibility of journals publishing ‘supplementary information’ on-line

However, this facility is not widely used

Furthermore, such supplementary data are usually poorly structured, and are not discoverable by external search engines

Depositing data as supplementary information thus consigns them to data graveyards from which resurrection is difficult

The purpose of this talk is to examine how we might improve on this situation

Page 24: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

The need to publish structured research information

Robert Muetzelfeldt (University of Edinburgh) recently wrote to the RIN:

“There is a problem with the word 'data'. For some people it covers everything from numeric tabular data to structured symbolically-represented information, while for others it tends to mean just the numeric kind.”

In the following discussion, I adopt Robert’s wider usage – whenever I say ‘data’, I mean all types of research information, and want them to be properly structured

For data to be maximally useful, they have to have the following characteristics:

Saved in machine-processable form in conformity to appropriate standards (XML, RDF, OWL)

Published and made freely accessible on the World Wide Web

Referenced by globally unique and resolvable identifiers (URIs, DOIs, LSIDs)

Accompanied by metadata based upon standard thesauri or ontologies

Including provenance information – by whom, when, where and particularly why were the data recorded, and who funded the research

Page 25: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Structured metadata

Structured metadata can take different forms:

A controlled vocabulary (word list with no internal structure)

A categorically organised thesaurus, a restricted vocabulary in which additional relationships between terms may be defined

A hierarchical taxonomy (a hierarchy of ‘parent-offspring’ is_arelationships, e.g. a crow is a bird, a bird is a vertebrate)

An ontology, in which such relationships take the form of a directed acyclic graph, in which an entry can have more than one ‘parent’, and in which more than one relationship type is allowed

A computable ontology, structured in such a manner as to permit computers to make semantic inferences and undertake logical reasoning over the data

Page 26: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Ontologies in biology

A helpful definition of an ontology has been given by Tom Gruber as

The formal explicit specification of a shared conceptualisation

The role of an ontology is thus to facilitate the formal sharing and re-use of knowledge through the construction of an explicit domain model

The most successful application of ontologies to bioinformatics has been the Gene Ontology (GO), first developed to annotate model organism databases and thus provide interoperability between them

GO has only two relationships: is_a (subsumption) and part_of (partonomy)

GO is composed of three orthogonal sub-ontologies:Molecular function – capabilities of gene productCellular component – cellular location at which the gene product actsBiological process - processes in which the gene product is involved

GO has achieved extensive take-up by the biomedical community, and is now widely used to annotate entries in a growing number of bioinformatics databases

GO was developed by Michael Ashburner in Cambridge, with others elsewhere

www.geneontology.org

Page 27: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

The Ontogenesis Network

A Network of Excellence to foster the creation, ontogeny and evolution of biological, bioinformatics and medical ontologies

EPSRC funding for three years from November 2006

PI’s: David Shotton and Robert Stevens (Computer Science, Manchester)

Regular meetings and workshops

Initial partners: FreshwaterLifeCancerGridEuropean Bioinformatics Institute

CCLRC e-Science GroupSemantic Web Research Team, HP Research Labs

Asemantics LtdBioOntology ProjectNational e-Social Science Centre

Other interested parties welcome - please contact me [email protected]

Page 28: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Recording metadata for primary research data

Even with good ontologies, creating metadata is difficult and time-consuming, but absolutely necessary if Web-accessible data are to be useful

Metadata should ideally be create at the time the primary research information is obtained – retro-fitting only happens in exceptional circumstances

Thus we need to provide metadata creation tools that make this VERY EASY for the researcher

Even with such support, metadata creation will still involve effort

We must thus make it ADVANTAGEOUS for researchers to do this, by enhancing or facilitating their existing activities

decision support, report creation, data publication, paper writing, etc.

We must also catch as much metadata as possible automatically from lab instruments

e.g. digital microscopes records the date, magnification, lens and filter details in image headers

Page 29: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Electronic laboratory notebooks

Commercial electronic lab notebook systems are widely used in the pharmaeutical industry, where drug licensing regulations require them

But they have found little uptake in academic labs

This is partly because of cost

ConturELN, one of the leading European systems, costs €12,500 for a perpetual licence for ten named people

An 80% academic discount brings the cost to €2,500, but this is still too much for most labs

Page 30: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

An alternative – the FlyData Project

THE FLYDATA PROJECT: Decision support and semantic organization of laboratory data in Drosophila gene expression experiments

The FlyData Project has just been funded by the BBSRC

My colleague Helen White-Cooper is co-investigator, and she and her team will generate the primary research data

Its objective is to develop a simple but powerful computer-based information management system that will help us to

organise all the laboratory data arising from our Drosophila gene expression experiments in meaningful ways

relate them to on-line information, and

obtain different semantic views into this multi-dimensional information space, to support research decision-making processes

The FlyData System will also record our research decisions and their provenance, creating a complete record of our research 'journey‘ that will simplify subsequent report writing and data publication

Page 31: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Cell differentiation in the Drosophila testis

Page 32: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Gene expression patterns in fruit fly testes

These typical Drosophila testis in situ images show the sites of expression of particular genes at different stages of spermatogenesis

These images are the “end game” -the final stage of a long and complex process of experimental screening, decision-making and preparation

Reproducibility and interpretation requires that the preparatory data and research decisions are recorded along with the images

aly

cyclinB

Mst87F

Page 33: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Workflow in Drosophila gene expression experiments #1

Page 34: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Workflow in Drosophila gene expression experiments #2

Page 35: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Workflow in Drosophila gene expression experiments #3

Page 36: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

There is an aweful lot of data to keep in mind when selecting which 1,500 genes to study from ~16,000 genes in the Drosophila genome

To facilitate laboratory data management and provide decision support, we will use

open standards

lightweight software tools

appropriately structured semantic metadata

carefully-designed graphical user interfaces

Initially a bespoke solution, we hope later to make it generic

Summary

Page 37: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community
Page 38: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community
Page 39: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Biomedical data – 1: Universals

Biomedical results may represent ‘universal truths’, such as the sequence of a particular gene, or the 3D structure of a specific protein

These form bounded data setsThe data need only be discovered and recorded once, and would be the same whoever acquires them

Such information is typically published in the public domain. It is seen as fundamental research knowledge to which all should have free access

A note of caution, however (from the Towards 2020 Science report by Microsoft Research, March 2005)

“We believe that attempts to solve the issues of scientific data management by building large, centralised, archival repositories are both dangerous and unworkable.”

Page 40: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Universals: The scope of the problem

The number of databases has more than doubled in the last three yearsSource: Michael Galperin (2006) Nucleic Acids Research 34: D3-D5

So many sites! So many formats!

NAR Molecular Biology Database Collection

0

200

400

600

800

1000

2003 2004 2005 2006

Year

Page 41: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Genomics databases

Page 42: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Secondary databases

Page 43: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Species-specific databases

Page 44: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

The Bioinformatics data maze

Navigating though all this to find the information your need is currently not straightforward!

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Page 45: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Typical bioinformatics in silico research

Searches are frequently repeated, both because data are rapidly added to public databases, and also because it is often quicker to repeat a search than find files!

Data resulting from one service is used as input to the next

Advantages:

Expert human intervention at every step

Quick and easy access to distributed Web resources

Disadvantages:

Labour intensive, time consuming, highly repetitive and error prone

Involves tacit procedures, so difficult to share both protocol and results

Much knowledge remains undocumented

Page 46: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Integrating sequence bioinformatics data

Data interpretation and integration are currently undertaken by skilled biologists and bioinformaticians, using their expert domain knowledge

Data transfer between applications is typically by ‘cut and paste’

e.g. a sequence selected from a genome database is pasted into the BLAST sequence comparison tool

There may be a little local subset selection, format modification or data transformation (e.g. from nucleotide to amino acid sequence)

Data integration? What data integration?

Separate resources are searched independently and sequentially

Information is downloaded as required

‘Data integration’ is usually limited to cutting and pasting into a Word document !

Page 47: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Data integration – what people actually want !

What the service provider wants

NOT to have to restructure all his relational data into an XML schema

NOT to have to change his publication methods or adopt some standard generic service interface

What the user wants

NOT a fully automated decision making system that spits out the “answer” at the end (even if this were possible)

Rather, an interactive decision support system, in which the knowledgeable opinionated expert user, who is highly sceptical that any computer can do things as well as she can, can check all the methodologies and intermediate results

Results presented in formats to which she has become accustomed

The ability to use legacy data, with new information available in legacy format so existing tools can use it

i.e. both service providers and consumers are highly resistant to change !

Page 48: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

BioMOBY philosophy

Provide a single user interface to a variety of underlying bioinformatics services

Each service provider is described in the MOBY-S ‘Yellow Pages’ Central Registry

Each service has only a single input, parameters (options to define on what, or how the service will work), and a single output

e.g. performBlast (sequence, gap, etc.) BlastReport

Each service is atomic, stateless and unrelated to other services

One can browse and select the services using a simple MOBY-Services browser

However, there is no facility for building complex workflows

http://www.biomoby.org/

BioMOBY has been developed by Mark Wilkinson of the University of British Columbia

Page 49: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

myGrid permits data intensive in silico biology

Uses an Open architectureOpen domain servicesOpen communityOpen application

Web Services based

Metadata and ontology driven

Manages the whole experiment lifecycle

Uses LSIDs, ontologies and Semantic Web technologies

http://www.mygrid.org.uk

myGrid has been developed by Carole Goble et al., in Manchester and elsewhere

The primary focus of myGrid is on facilitating workflows

For this, Taverna, an Open Source workflow GUI, has been developed

Page 50: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

A BC

A: Identification of overlapping sequence

B: Characterisation of nucleotide sequence

C: Characterisation of protein sequence

Real bioinformatics workflows are complex

Page 51: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Taverna lets you see intermediate results

Taverna has been developed by Tom Oinn at the EBI http://taverna.sf.net

Page 52: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Where are we?

“The problem now is not connecting up and running the services.

“It’s managing and visualising all the data results, and the metadata and the provenance records and stuff.”

Carole Goble, EBI Hinxton, October 2004

Page 53: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

iHOP – Information Hyperlinked Over Proteins

The World Wide Web has changed the way we access information

www.google.com “iHOP“iHOP has been developed to exploit this-the work of Robert Hoffmann and Alfonso Valencia

iHOP is a text mining tool – “12 million pages but just 3 clicks away”

By employing gene and protein names as hyperlinks between sentences and abstracts, the information in PubMed abstracts becomes bound together as one navigable small-world network

Traversing interrelated sentences within this network is closer to human intuition than conventional keyword searches

Each move through the network produces just the information pertaining to one single gene and its associations

Page 54: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Biomedical data – 2: Particulars

Research data can also be ‘particulars’, rather than ‘universals’, for example individual assay results, microscopy images and wildlife photos:

These data form unbounded data sets

Data collection will never be complete

Such information is not (yet) widely available on line

It is NOT appropriate to submit such data to central global databases

The data are too heterogeneous

Such activities would not scale

Photographs of the unique stripe patterns on the rumps of different individual Grevy’s zebras (Equus grevyi).

(Credits: Alistair Nelson, Dept Zoology, Oxford)

Page 55: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Publication@source

With the advent of the Semantic Web, the possibility exists for centralized databases to give way to a new paradigm where everyone publishes their own research results

We are entering the age of distributed personal data publication

Most research data will in future not be submitted to centralized databases

Rather, data will be published locally by individuals, institutional repositories or journals, complete with semantically rich metadata

This method of publication is particularly appropriate for ‘particulars’

The challenge then becomes how best to provide cross-searchable access to these heterogeneous resources, so that the data do not become lost in isolated data silos

A potential solution, that I outlined during the first RIN Workshop last June, is the data web

Page 56: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Data integration – the lightweight data web approach

The data web concept is a novel concept for digital information storage and integration

The data are NOT submitted to a central database, but are simply published by the data providers on their own Web servers

Then, separately for each data web serving a particular knowledge domain, lightweight software tools are used to harvest, marshal and index metadata describing the distributed data into a central ontology-enabled data web registryBy requiring only core metadata, conforming to a specific minimalist data web ontology, each data web overcomes the problems caused by syntactic and semantic differences in data presentation between providers, and makes collating selected information from multiple web sites possible for machines

Data webs represent a step towards Berners-Lee’s vision of the World Wide Web as a ‘Web of Data’Data webs are particularly suitable for data that represent research ‘particulars’

Page 57: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Role of the central metadata registry

The data web registry acts first as a data marshal, gathering, ordering and integrating the metadata from across the web into a single searchable RDF graph

It then provides an integrated cross-searchable access point to all the data in the data web, with both human and programmatic access

The data web registry adds value by providing interoperability and customizable search interfaces, with a rigorous semantic underpinning

The primary data holders benefit by increased user traffic to their sites, while at the same time being able to maintain normal copyright and access control

The primary metadata are never controlled by the data web registry, but are freely available on the Web for use by other presently unforeseen applications, including novel data mining, integration and analysis services

Page 58: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Data web advantages

A data web of this type will have all the advantages of the World Wide Web itself:

Distributed data Freedom, decentralization and low cost of publicationLack of centralized controlBuilt-in evolvability and scalability

Data webs are Open – Open – Open – Open

Support for open access dataUse of open source software componentsOpen standards for software and metadata developmentAn Open World (“missing isn’t broken”) data philosophy

Page 59: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

How does a data webs differ from Google?

A data web will provide for the selected data what Web search engines such as Google do for conventional Web pages, but with the following advantages:

It permits database information hidden in the Deep Web to be accessed

It involves specific targeting to a particular knowledge domain, thus achieving a significantly higher signal-to-noise ratio

It provides integration of information with ontological underpinning, semantic coherence, and truth maintenance

It permits programmatic access, enabling further services to be built on top of one or more data webs

Page 60: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

The ImageWeb Project

The first data web we aim to create is for published biomedical research images

We aim to integrate and make cross-searchable research images held by publishers, institutional repositories, and other curators of research images

We will require minimum effort on the part of the publishers and repository managers, who can use their existing relational databases, XML metadata schemas or RSS feeds

ImageWeb will enable access to published images to become a more integral part of day-to-day research

The JISC have funded an initial requirements analysis project for such data web integration across university repositories, as part of their Repositories and Preservation Programme:

“Defining Image Access: Requirements for interoperable discovery and delivery of image data stored in DSpace, EPrints and Fedora-based institutional repositories using a data web approach”

Page 61: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

The ImageWeb Consortium

My Image BioInformatics Research Group, University of Oxford

Leading commercial publishers: Nature, OUP, Blackwell, Elsevier

Leading Open Access publishers: PLoS and BioMed Central

University institutional repositories at Cambridge, Imperial College, Oxford and Southampton

Professional biologists and academic biological image collections: NHM

Other stakeholders: British Library, CCLRC, UKOLN, ILRT, CrossRef, SPARC

I welcome the involvement of other partners who hold image collections that they wish to be used on the Web – please contact me <[email protected]>

Our next consortium meeting will be at OUP in Oxford on January 4th 2007

Page 62: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

What users really want !

I originally imagined that ImageWeb users would directly accessing the central ImageWeb Registry with their queries, and from there being led to relevant images

However, I now believe that it might be even more useful for a user to be able to click on an image within an online paper she is reading, and have semantically related images from other sources presented as a ranked list

This service would resemble the basic bioinformatics BLAST service for finding related biological sequences (http://www.ncbi.nlm.nih.gov/BLAST/)

This secondary ‘ImageBLAST’ service, built on top of an ImageWeb registry, would not locate images that resemble the reference image in terms of visualappearance, but in terms of being about the same thing

e.g. the same gene expressed in a different organism

or the same concept demonstrated in a different system

Page 63: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

An example

Page 64: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Fig. 2. (A and B) Immunohistochemical staining for EGFP on livers of (A) Z/EG x Cre–into-Cre and (B) Z/EG-into-Cre transplants. (C) Immunofluorescence staining with cytokeratin (green) and Y chromosome FISH (red) in the same Z/EG-into-Cre transplant, showing the presence of a donor-derived Y-positive hepatocyte (arrow). (D and E) Immunofluorescence staining of (D) untransplanted positive control (Z/EG x Cre F1) and (E) experimental (Z/EG into Cre) epidermal sections with antibodies against EGFP (green) and cytokeratinAE1/AE3 (red). (F) Immunofluorescence staining with cytokeratin AE1/AE3 (red) and Y chromosome FISH (green), showing the presence of a donor-derived Y-positive keratinocyte (arrow) in the epidermis of a Z/EG-into-Cre transplant recipient.

Related images

Page 65: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

The data behind the structure of elastase

While the atomic coordinates of the elastase molecule that I determined with Lindsay Sawyer in 1976 are faithfully preserved in PDB – our submission was one of the first in this fledgling database – what should now be done with the original experimental data

Page 66: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

A fraction of the raw elastase structural data

Precession photograph of tosyl-elastase X-ray diffraction

intensities at 3.5Å resolution

Original positive print of a single section of the 2.5Å resolution

elastase electron density map from which the first elastase atomic

model was built

Page 67: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Digital data and the historical scientific record

With modern digital technology, it is now technically possible to store as much research data as we wish. But how much is enough? When is it right not to save data? And what about legacy data?

Clearly every electron micrograph ever recorded is not worth saving. However, the era of electron microscopy is past, with few domain experts left

Many of the pioneers of cell biology, who discovered the ultrastructure of the cell in the 1960s and 1970s, are now retired or close to retirementThe community is in danger of loosing all their original data unless funding is provided within the next five years for a digital preservation programme

A good rule of thumb is that for every 1000 EM images taken, 100 will be good, 10 will be superb, and 1 or 2 will have been published in a scientific paper

While we should be happy to discard the 900 poor negatives, what we should do with the 98 unpublished good images is a pressing question

These and countless other types of analogue data constitute our scientific cultural heritage, and yet will almost certainly be lost if nothing is done soon

The cost of having to repeat these observations would far outweigh the cost of preserving the original data

Page 68: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

Conclusions

Biomedical data is bottom-up, diverse, highly distributed, heterogeneous and largely unstructured. Let the diversity flourish – it is a fool’s mission to try to impose conformity!

We only publish selected data or interpretations in journal articles, to back hypotheses

Most original research data remain unpublished – a shame in the digital age

Creating ontologies and metadata is hard work, and requires better tools

‘Universals’ are well served by a growing number of sequence and structural bioinformatics databases. In silico research is enabled by tools such as myGrid

‘Particulars’ are best handled by distributed publication, with data web structures to create searchable registries, on top of which can be build secondary services such as ImageBLAST

Legacy analogue biomedical data are in danger of being lost for ever unless steps are taken to preserve them

Page 69: Shotton - Nature of Biomedical Research Data 2 Shotton.pdf · Characteristics of biomedical research data Bottom-up data flow, lacking central control Very large research community

The end