shotton - nature of biomedical research data 2 shotton.pdf · characteristics of biomedical...
TRANSCRIPT
e-mail: [email protected]
David Shotton
Getting the most out of data,Making the most of research
A Research Information Network workshop
Tuesday 5th December 2006Royal institute of Public Health, London
The nature of biomedical research data
Image BioInformatics Research GroupOxford e-Research Centre andDepartment of ZoologyUniversity of Oxford, UK
Outline
Characteristics of biological and medical research data
What data do we publish?
Structured research information and structured metadata
Ontologies in biology
Electronic laboratory notebooks and the FlyData Project
Biomedical data – 1: Universals
Bioinformatics in silico research
BioMoby, myGrid and Taverna
Biomedical data – 2: Particulars
Data webs and ImageWeb
ImageBLAST
Digital data and the historical scientific record
Characteristics of biomedical research data
Bottom-up data flow, lacking central control
Very large research community with diverse research topics
Highly distributed research activities and publication structures
Research data heterogeneous and largely unstructured, often with little by way of semantic mark-up
An open world, where change is as ubiquitous as consensus is elusive
This stands in contrast to much chemistry and particle physics data being discussed in a parallel session
Chemistry: Unified data notation
Defined conventions
Particle physics: Relatively small unified community
Limited number of research objectives
Top-down hierarchical data flow from CERN
Ancient DNA
Remote sensing
Satellite imagery supplying information about land temperature and vegetation is being increasingly used to study a large range of biological topics:
Animal population studies
Epidemiology
Species migration in the face of climate change
Mr Grizzle at home with GPS logging device
Tracking Mr Grizzle by GPS
Credits: Tim Guilford, Department of Zoology, University of Oxford
Biro, D., Guilford, T., Dell’Omo, G. & Lipp, H-P. (2002) Journal of Experimental Biology205:3833-3844
Biro, D., Meade, J. & Guilford, T. Proceedings of the National Academy of Sciences (2004), 101: 17440-17443
Other biological observations are closer to earth . . .
Credits: Oxford 2006 M. Sc. in Biology students, on a field trip to Skomer Island
. . . or even underground
The Wood Wide Web
Studying network transport dynamics
Credits: Mark Fricker, Department of Plant Sciences, University of Oxford
Tool making and use in New Caledonian Crows
Introducing Betty the Crow
Credits: Alex Weir, Jackie Chappell and Alex Kacelnik, Department of Zoology, Oxford
Weir, A.A.S., Chappell, J., & Kacelnik, A. (2002). Shaping of hooks in New Caledonian crows. Science 297: 981. DOI 10.1126/science.1073433
Studies of gene expression
DNA chip: gene expression in Drosophila testis. Credits: Helen White-Cooper,
Department of Zoology, University of Oxford
Immunofluorescence: gene expression in Drosophila embryo. Credits: James Langeland and Stephen Paddock, Howard Hughes Medical Institute, University of Wisconsin, Madison
Studies of ultrastructure
Electron micrograph: Thin section through the cytopharynx of a protozoan showing microtubule array.
Confocal optical sections of a pollen grain of Passiflora coerulea. Credits: Anna Smallcome, BioRad Microscience
Numbers from images
584 490 465 598 540 441 555 393 485 502 489 454 446 446 463 218 437 509 522 451 497 412 418 529 513 468 580 493 485 498 443 450 459 497 557 481 430 463 426 437 578 441 475
Microvillar areas: Mean = 477.16, Std. Dev. = 63.88
In summary
RAW DATA INTERPRETATIONS
Sequences > Phylogenies
GPS coordinates > Homing trajectories
Observations and video recordings > Behavioural analysis
Scintillation imaging > Metabolite transport kinetics
Assay results > Gene expression measurements
Micrographs > Understanding of ultrastructure
Image analysis > Quantitation
It is frequently the interpretations, rather than the raw data, that are published in research papers
The nature of medical research data
In diversity, the data are very similar to those from biological research
However, data sharing is restricted and complicated by issues of patient confidentiality
Nevertheless, data sharing is essential for diagnosis, training and epidemiological studies
Establishing a cyberinfrastructure for storing, manipulating and sharing biomedical data and resources
Test beds focused on neuroimaging, linking animal and human imaging data
Human MRI and fMRI
Mouse models of neurological disease
"The Biomedical Informatics Research Network (BIRN) is a geographically distributed virtual community of shared resources offering tremendous potential to advance the diagnosis and treatment of disease.
“The BIRN is changing how biomedical scientists and clinical researchers make discoveries by enhancing communication and collaboration across research disciplines."
While it has not been easy, by working hard together they have cracked the problems involved in sharing patient data in a secure manner
http://www.nbirn.net
Common Healthcare Educational Recordings Reusability
Infrastructure
http://www.cherri.mvm.ed.ac.uk
CHERRI has defined licensing conditions for research re-use of clinical data with informed consent
Their recommendations need adoption by the NHS and other stakeholders
A follow-on from the eDIAMOND Digital Mammography e-Science Project, which aimed to produce a federated database of patient mammograms for breast cancer diagnosis, training and research
The most difficult part of the eDIAMOND Project was not the technical aspects, but negotiating barriers to data sharing because of patient confidentiality considerations
This follow-on project was funded by ESRC to examine legal, IPR and confidentiality issues surrounding medical data in greater depth
http://www.oerc.ox.ac.uk/activities/projects/index.xml.ID=image
What data do we publish?
A scientific paper does not just report scientific observations
A scientific paper is an exercise in rhetoric designed to convince readers of the truth of particular scientific hypotheses or conclusions believed by the authors
The goal of the article is not to state facts, but rather to convince
Facts are selected to support the argument, and are embedded in a rhetorical structure with the purpose of conviction
Credits: Anita de WaardAdvanced Technology Group, Elsevier, AmsterdamContent and Knowledge Engineering Group, University of UtrechtA Semantic Structure for Scientific Articles April 2006
Papers are also written to show off the authors’ research in the best possible light, to encourage funding agencies to continue supporting the work, and to score well in the research assessment exercise
“These observations support theories that defects of the muscle plasma membrane are important for dystrophic pathogenesis.”
“If, as we have argued, the caveolar patterning over regions of T-tubule invgination into the underlying sarcomeres is of functional significance, its loss in dystrophy both in chicken and in man suggests the presence of a fundamental defect that may be of importance in initiating the internal necrotic changes found in dystrophic fast muscle fibres.”
“These data extend recent studiesby demonstrating that TCRinteraction with variant peptide antigens can trigger target cell adhesion and surface exploration without activating the signalingpathway that results in cytotoxicity.”
“The facts indicates that this signalling pathway must be distinct from that leading to delivery of the cytotoxic lethal hit”
. . . but what about the original research data?
Findings that support hypotheses appear in research articles, but the bulk of the raw research data are never published
Historically, there has been no method for doing this
Journals had limited space
Other publication avenues were not available
Now, with the Web and with on-line journals, comes the possibility of journals publishing ‘supplementary information’ on-line
However, this facility is not widely used
Furthermore, such supplementary data are usually poorly structured, and are not discoverable by external search engines
Depositing data as supplementary information thus consigns them to data graveyards from which resurrection is difficult
The purpose of this talk is to examine how we might improve on this situation
The need to publish structured research information
Robert Muetzelfeldt (University of Edinburgh) recently wrote to the RIN:
“There is a problem with the word 'data'. For some people it covers everything from numeric tabular data to structured symbolically-represented information, while for others it tends to mean just the numeric kind.”
In the following discussion, I adopt Robert’s wider usage – whenever I say ‘data’, I mean all types of research information, and want them to be properly structured
For data to be maximally useful, they have to have the following characteristics:
Saved in machine-processable form in conformity to appropriate standards (XML, RDF, OWL)
Published and made freely accessible on the World Wide Web
Referenced by globally unique and resolvable identifiers (URIs, DOIs, LSIDs)
Accompanied by metadata based upon standard thesauri or ontologies
Including provenance information – by whom, when, where and particularly why were the data recorded, and who funded the research
Structured metadata
Structured metadata can take different forms:
A controlled vocabulary (word list with no internal structure)
A categorically organised thesaurus, a restricted vocabulary in which additional relationships between terms may be defined
A hierarchical taxonomy (a hierarchy of ‘parent-offspring’ is_arelationships, e.g. a crow is a bird, a bird is a vertebrate)
An ontology, in which such relationships take the form of a directed acyclic graph, in which an entry can have more than one ‘parent’, and in which more than one relationship type is allowed
A computable ontology, structured in such a manner as to permit computers to make semantic inferences and undertake logical reasoning over the data
Ontologies in biology
A helpful definition of an ontology has been given by Tom Gruber as
The formal explicit specification of a shared conceptualisation
The role of an ontology is thus to facilitate the formal sharing and re-use of knowledge through the construction of an explicit domain model
The most successful application of ontologies to bioinformatics has been the Gene Ontology (GO), first developed to annotate model organism databases and thus provide interoperability between them
GO has only two relationships: is_a (subsumption) and part_of (partonomy)
GO is composed of three orthogonal sub-ontologies:Molecular function – capabilities of gene productCellular component – cellular location at which the gene product actsBiological process - processes in which the gene product is involved
GO has achieved extensive take-up by the biomedical community, and is now widely used to annotate entries in a growing number of bioinformatics databases
GO was developed by Michael Ashburner in Cambridge, with others elsewhere
www.geneontology.org
The Ontogenesis Network
A Network of Excellence to foster the creation, ontogeny and evolution of biological, bioinformatics and medical ontologies
EPSRC funding for three years from November 2006
PI’s: David Shotton and Robert Stevens (Computer Science, Manchester)
Regular meetings and workshops
Initial partners: FreshwaterLifeCancerGridEuropean Bioinformatics Institute
CCLRC e-Science GroupSemantic Web Research Team, HP Research Labs
Asemantics LtdBioOntology ProjectNational e-Social Science Centre
Other interested parties welcome - please contact me [email protected]
Recording metadata for primary research data
Even with good ontologies, creating metadata is difficult and time-consuming, but absolutely necessary if Web-accessible data are to be useful
Metadata should ideally be create at the time the primary research information is obtained – retro-fitting only happens in exceptional circumstances
Thus we need to provide metadata creation tools that make this VERY EASY for the researcher
Even with such support, metadata creation will still involve effort
We must thus make it ADVANTAGEOUS for researchers to do this, by enhancing or facilitating their existing activities
decision support, report creation, data publication, paper writing, etc.
We must also catch as much metadata as possible automatically from lab instruments
e.g. digital microscopes records the date, magnification, lens and filter details in image headers
Electronic laboratory notebooks
Commercial electronic lab notebook systems are widely used in the pharmaeutical industry, where drug licensing regulations require them
But they have found little uptake in academic labs
This is partly because of cost
ConturELN, one of the leading European systems, costs €12,500 for a perpetual licence for ten named people
An 80% academic discount brings the cost to €2,500, but this is still too much for most labs
An alternative – the FlyData Project
THE FLYDATA PROJECT: Decision support and semantic organization of laboratory data in Drosophila gene expression experiments
The FlyData Project has just been funded by the BBSRC
My colleague Helen White-Cooper is co-investigator, and she and her team will generate the primary research data
Its objective is to develop a simple but powerful computer-based information management system that will help us to
organise all the laboratory data arising from our Drosophila gene expression experiments in meaningful ways
relate them to on-line information, and
obtain different semantic views into this multi-dimensional information space, to support research decision-making processes
The FlyData System will also record our research decisions and their provenance, creating a complete record of our research 'journey‘ that will simplify subsequent report writing and data publication
Cell differentiation in the Drosophila testis
Gene expression patterns in fruit fly testes
These typical Drosophila testis in situ images show the sites of expression of particular genes at different stages of spermatogenesis
These images are the “end game” -the final stage of a long and complex process of experimental screening, decision-making and preparation
Reproducibility and interpretation requires that the preparatory data and research decisions are recorded along with the images
aly
cyclinB
Mst87F
Workflow in Drosophila gene expression experiments #1
Workflow in Drosophila gene expression experiments #2
Workflow in Drosophila gene expression experiments #3
There is an aweful lot of data to keep in mind when selecting which 1,500 genes to study from ~16,000 genes in the Drosophila genome
To facilitate laboratory data management and provide decision support, we will use
open standards
lightweight software tools
appropriately structured semantic metadata
carefully-designed graphical user interfaces
Initially a bespoke solution, we hope later to make it generic
Summary
Biomedical data – 1: Universals
Biomedical results may represent ‘universal truths’, such as the sequence of a particular gene, or the 3D structure of a specific protein
These form bounded data setsThe data need only be discovered and recorded once, and would be the same whoever acquires them
Such information is typically published in the public domain. It is seen as fundamental research knowledge to which all should have free access
A note of caution, however (from the Towards 2020 Science report by Microsoft Research, March 2005)
“We believe that attempts to solve the issues of scientific data management by building large, centralised, archival repositories are both dangerous and unworkable.”
Universals: The scope of the problem
The number of databases has more than doubled in the last three yearsSource: Michael Galperin (2006) Nucleic Acids Research 34: D3-D5
So many sites! So many formats!
NAR Molecular Biology Database Collection
0
200
400
600
800
1000
2003 2004 2005 2006
Year
Genomics databases
Secondary databases
Species-specific databases
The Bioinformatics data maze
Navigating though all this to find the information your need is currently not straightforward!
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Typical bioinformatics in silico research
Searches are frequently repeated, both because data are rapidly added to public databases, and also because it is often quicker to repeat a search than find files!
Data resulting from one service is used as input to the next
Advantages:
Expert human intervention at every step
Quick and easy access to distributed Web resources
Disadvantages:
Labour intensive, time consuming, highly repetitive and error prone
Involves tacit procedures, so difficult to share both protocol and results
Much knowledge remains undocumented
Integrating sequence bioinformatics data
Data interpretation and integration are currently undertaken by skilled biologists and bioinformaticians, using their expert domain knowledge
Data transfer between applications is typically by ‘cut and paste’
e.g. a sequence selected from a genome database is pasted into the BLAST sequence comparison tool
There may be a little local subset selection, format modification or data transformation (e.g. from nucleotide to amino acid sequence)
Data integration? What data integration?
Separate resources are searched independently and sequentially
Information is downloaded as required
‘Data integration’ is usually limited to cutting and pasting into a Word document !
Data integration – what people actually want !
What the service provider wants
NOT to have to restructure all his relational data into an XML schema
NOT to have to change his publication methods or adopt some standard generic service interface
What the user wants
NOT a fully automated decision making system that spits out the “answer” at the end (even if this were possible)
Rather, an interactive decision support system, in which the knowledgeable opinionated expert user, who is highly sceptical that any computer can do things as well as she can, can check all the methodologies and intermediate results
Results presented in formats to which she has become accustomed
The ability to use legacy data, with new information available in legacy format so existing tools can use it
i.e. both service providers and consumers are highly resistant to change !
BioMOBY philosophy
Provide a single user interface to a variety of underlying bioinformatics services
Each service provider is described in the MOBY-S ‘Yellow Pages’ Central Registry
Each service has only a single input, parameters (options to define on what, or how the service will work), and a single output
e.g. performBlast (sequence, gap, etc.) BlastReport
Each service is atomic, stateless and unrelated to other services
One can browse and select the services using a simple MOBY-Services browser
However, there is no facility for building complex workflows
http://www.biomoby.org/
BioMOBY has been developed by Mark Wilkinson of the University of British Columbia
myGrid permits data intensive in silico biology
Uses an Open architectureOpen domain servicesOpen communityOpen application
Web Services based
Metadata and ontology driven
Manages the whole experiment lifecycle
Uses LSIDs, ontologies and Semantic Web technologies
http://www.mygrid.org.uk
myGrid has been developed by Carole Goble et al., in Manchester and elsewhere
The primary focus of myGrid is on facilitating workflows
For this, Taverna, an Open Source workflow GUI, has been developed
A BC
A: Identification of overlapping sequence
B: Characterisation of nucleotide sequence
C: Characterisation of protein sequence
Real bioinformatics workflows are complex
Taverna lets you see intermediate results
Taverna has been developed by Tom Oinn at the EBI http://taverna.sf.net
Where are we?
“The problem now is not connecting up and running the services.
“It’s managing and visualising all the data results, and the metadata and the provenance records and stuff.”
Carole Goble, EBI Hinxton, October 2004
iHOP – Information Hyperlinked Over Proteins
The World Wide Web has changed the way we access information
www.google.com “iHOP“iHOP has been developed to exploit this-the work of Robert Hoffmann and Alfonso Valencia
iHOP is a text mining tool – “12 million pages but just 3 clicks away”
By employing gene and protein names as hyperlinks between sentences and abstracts, the information in PubMed abstracts becomes bound together as one navigable small-world network
Traversing interrelated sentences within this network is closer to human intuition than conventional keyword searches
Each move through the network produces just the information pertaining to one single gene and its associations
Biomedical data – 2: Particulars
Research data can also be ‘particulars’, rather than ‘universals’, for example individual assay results, microscopy images and wildlife photos:
These data form unbounded data sets
Data collection will never be complete
Such information is not (yet) widely available on line
It is NOT appropriate to submit such data to central global databases
The data are too heterogeneous
Such activities would not scale
Photographs of the unique stripe patterns on the rumps of different individual Grevy’s zebras (Equus grevyi).
(Credits: Alistair Nelson, Dept Zoology, Oxford)
Publication@source
With the advent of the Semantic Web, the possibility exists for centralized databases to give way to a new paradigm where everyone publishes their own research results
We are entering the age of distributed personal data publication
Most research data will in future not be submitted to centralized databases
Rather, data will be published locally by individuals, institutional repositories or journals, complete with semantically rich metadata
This method of publication is particularly appropriate for ‘particulars’
The challenge then becomes how best to provide cross-searchable access to these heterogeneous resources, so that the data do not become lost in isolated data silos
A potential solution, that I outlined during the first RIN Workshop last June, is the data web
Data integration – the lightweight data web approach
The data web concept is a novel concept for digital information storage and integration
The data are NOT submitted to a central database, but are simply published by the data providers on their own Web servers
Then, separately for each data web serving a particular knowledge domain, lightweight software tools are used to harvest, marshal and index metadata describing the distributed data into a central ontology-enabled data web registryBy requiring only core metadata, conforming to a specific minimalist data web ontology, each data web overcomes the problems caused by syntactic and semantic differences in data presentation between providers, and makes collating selected information from multiple web sites possible for machines
Data webs represent a step towards Berners-Lee’s vision of the World Wide Web as a ‘Web of Data’Data webs are particularly suitable for data that represent research ‘particulars’
Role of the central metadata registry
The data web registry acts first as a data marshal, gathering, ordering and integrating the metadata from across the web into a single searchable RDF graph
It then provides an integrated cross-searchable access point to all the data in the data web, with both human and programmatic access
The data web registry adds value by providing interoperability and customizable search interfaces, with a rigorous semantic underpinning
The primary data holders benefit by increased user traffic to their sites, while at the same time being able to maintain normal copyright and access control
The primary metadata are never controlled by the data web registry, but are freely available on the Web for use by other presently unforeseen applications, including novel data mining, integration and analysis services
Data web advantages
A data web of this type will have all the advantages of the World Wide Web itself:
Distributed data Freedom, decentralization and low cost of publicationLack of centralized controlBuilt-in evolvability and scalability
Data webs are Open – Open – Open – Open
Support for open access dataUse of open source software componentsOpen standards for software and metadata developmentAn Open World (“missing isn’t broken”) data philosophy
How does a data webs differ from Google?
A data web will provide for the selected data what Web search engines such as Google do for conventional Web pages, but with the following advantages:
It permits database information hidden in the Deep Web to be accessed
It involves specific targeting to a particular knowledge domain, thus achieving a significantly higher signal-to-noise ratio
It provides integration of information with ontological underpinning, semantic coherence, and truth maintenance
It permits programmatic access, enabling further services to be built on top of one or more data webs
The ImageWeb Project
The first data web we aim to create is for published biomedical research images
We aim to integrate and make cross-searchable research images held by publishers, institutional repositories, and other curators of research images
We will require minimum effort on the part of the publishers and repository managers, who can use their existing relational databases, XML metadata schemas or RSS feeds
ImageWeb will enable access to published images to become a more integral part of day-to-day research
The JISC have funded an initial requirements analysis project for such data web integration across university repositories, as part of their Repositories and Preservation Programme:
“Defining Image Access: Requirements for interoperable discovery and delivery of image data stored in DSpace, EPrints and Fedora-based institutional repositories using a data web approach”
The ImageWeb Consortium
My Image BioInformatics Research Group, University of Oxford
Leading commercial publishers: Nature, OUP, Blackwell, Elsevier
Leading Open Access publishers: PLoS and BioMed Central
University institutional repositories at Cambridge, Imperial College, Oxford and Southampton
Professional biologists and academic biological image collections: NHM
Other stakeholders: British Library, CCLRC, UKOLN, ILRT, CrossRef, SPARC
I welcome the involvement of other partners who hold image collections that they wish to be used on the Web – please contact me <[email protected]>
Our next consortium meeting will be at OUP in Oxford on January 4th 2007
What users really want !
I originally imagined that ImageWeb users would directly accessing the central ImageWeb Registry with their queries, and from there being led to relevant images
However, I now believe that it might be even more useful for a user to be able to click on an image within an online paper she is reading, and have semantically related images from other sources presented as a ranked list
This service would resemble the basic bioinformatics BLAST service for finding related biological sequences (http://www.ncbi.nlm.nih.gov/BLAST/)
This secondary ‘ImageBLAST’ service, built on top of an ImageWeb registry, would not locate images that resemble the reference image in terms of visualappearance, but in terms of being about the same thing
e.g. the same gene expressed in a different organism
or the same concept demonstrated in a different system
An example
Fig. 2. (A and B) Immunohistochemical staining for EGFP on livers of (A) Z/EG x Cre–into-Cre and (B) Z/EG-into-Cre transplants. (C) Immunofluorescence staining with cytokeratin (green) and Y chromosome FISH (red) in the same Z/EG-into-Cre transplant, showing the presence of a donor-derived Y-positive hepatocyte (arrow). (D and E) Immunofluorescence staining of (D) untransplanted positive control (Z/EG x Cre F1) and (E) experimental (Z/EG into Cre) epidermal sections with antibodies against EGFP (green) and cytokeratinAE1/AE3 (red). (F) Immunofluorescence staining with cytokeratin AE1/AE3 (red) and Y chromosome FISH (green), showing the presence of a donor-derived Y-positive keratinocyte (arrow) in the epidermis of a Z/EG-into-Cre transplant recipient.
Related images
The data behind the structure of elastase
While the atomic coordinates of the elastase molecule that I determined with Lindsay Sawyer in 1976 are faithfully preserved in PDB – our submission was one of the first in this fledgling database – what should now be done with the original experimental data
A fraction of the raw elastase structural data
Precession photograph of tosyl-elastase X-ray diffraction
intensities at 3.5Å resolution
Original positive print of a single section of the 2.5Å resolution
elastase electron density map from which the first elastase atomic
model was built
Digital data and the historical scientific record
With modern digital technology, it is now technically possible to store as much research data as we wish. But how much is enough? When is it right not to save data? And what about legacy data?
Clearly every electron micrograph ever recorded is not worth saving. However, the era of electron microscopy is past, with few domain experts left
Many of the pioneers of cell biology, who discovered the ultrastructure of the cell in the 1960s and 1970s, are now retired or close to retirementThe community is in danger of loosing all their original data unless funding is provided within the next five years for a digital preservation programme
A good rule of thumb is that for every 1000 EM images taken, 100 will be good, 10 will be superb, and 1 or 2 will have been published in a scientific paper
While we should be happy to discard the 900 poor negatives, what we should do with the 98 unpublished good images is a pressing question
These and countless other types of analogue data constitute our scientific cultural heritage, and yet will almost certainly be lost if nothing is done soon
The cost of having to repeat these observations would far outweigh the cost of preserving the original data
Conclusions
Biomedical data is bottom-up, diverse, highly distributed, heterogeneous and largely unstructured. Let the diversity flourish – it is a fool’s mission to try to impose conformity!
We only publish selected data or interpretations in journal articles, to back hypotheses
Most original research data remain unpublished – a shame in the digital age
Creating ontologies and metadata is hard work, and requires better tools
‘Universals’ are well served by a growing number of sequence and structural bioinformatics databases. In silico research is enabled by tools such as myGrid
‘Particulars’ are best handled by distributed publication, with data web structures to create searchable registries, on top of which can be build secondary services such as ImageBLAST
Legacy analogue biomedical data are in danger of being lost for ever unless steps are taken to preserve them
The end