ah-xldbeurope-position-09 jun2011

Alex Hardisty, XLDB-Europe, Edinburgh, 8-10th June 2011

What does research infrastructure really need for data?

A personal view based on LifeWatch and ENVRI

Alex Hardisty, Cardiff University

LifeWatch: an ESFRI Research Infrastructure; an e-Infrastructure for Biodiversity and Ecosystem Science.

What is LifeWatch? Biodiversity science is the study of the diversity of life on our planet – plants, animals, microorganisms and viruses – and the environments (ecosystems) they live in. LifeWatch (www.lifewatch.eu) will be an open access infrastructure, accessed through a single portal (portal.lifewatch.eu) for users from the scientific community, as well as policy makers and representatives of the private sector. It will allow scientists to explore, describe and understand patterns in biodiversity, and the processes that maintain biodiversity, in space and time at the gene, species, ecosystem and landscape levels; and to understand what causes and affects species diversity.

The innovative design of LifeWatch offers integrated access to large-scale data resources, advanced algorithms and computational capability through a service-oriented architecture to support creation of new knowledge. Key elements of the infrastructure will include: distributed observatories/sensors, interoperable datasets, processing and analytical tools, and both computational capability and capacity. Data mining, data analysis and modelling allows users to study patterns and mechanisms across different levels of biodiversity. The LifeWatch infrastructure provides scientific research teams with new collaborative environments by creating ‘e-Laboratories’ or composing ‘e-Services’. They may share their data and analytical and modelling algorithms with others, while controlling access. LifeWatch enables “distributed large scale” and collaborative research on complex and multidisciplinary problems.

In planning for the past 3 years, LifeWatch is presently transitioning to its construction phase. Early Virtual Labs are likely to support scientific studies of biodiversity in marine wetlands and the fragility of ecosystems towards alien and invasive species. The Biodiversity Virtual e-Laboratory (BioVeL) project (www.biovel.eu) contributes to the construction by causing islands of compatible infrastructure to be created / emerge at key centres across Europe.

The challenges of scale and heterogeneity LifeWatch is supported by many good data providers from within the scientific communities (networks of excellence) for terrestrial ecology, marine ecology and the natural history collections with all their biological specimens. There are currently about 1800 terrestrial monitoring sites and 200 marine research sites across Europe. Hundreds of millions of specimens in natural history collections all over Europe are gradually being indexed and digitised.

Biodiversity data is extremely diverse and heterogeneous. Biodiversity science spans many more familiar disciplines: biology, botany, zoology, ecology, genetics, soil science, biogeography, climate science, chemistry - to name but a few. Each of these established scientific communities already has its own way of

http://www.lifewatch.eu/�

http://portal.lifewatch.eu/�

http://www.biovel.eu/�


doing things, their own data resources and their own tools. Not only that, but they have their own different vocabularies and conceptual underpinnings. Interoperability is a problem demanding a determined ontological and thesaurus solution like that used in the medical domain: the Unified Medical Language System (UMLS) (www.nlm.nih.gov/research/umls).

The interconnections between different biodiversity ideas/concepts, data sources, and the outputs from data processing, manipulation and modelling are intricate. As well as the traditional sources mentioned above, genomic data including, for example: sequence data, DNA barcodes and phylogenies are becoming increasingly important sources. Biodiversity science also demands environmental data (climate, soil, ocean temperature, etc.), as well as economic and census data for particular types of studies.

Apart from the well known and often large sources - GBIF, EBI, environmental data, census data - there are numerous small datasets in the hands of individual researchers. If computerised at all, these small datasets are often held in spreadsheets and with no identifiable common structure. There are probably thousands of them. And multiple tools for processing too. The biodiversity science community is highly fragmented and all these kinds of small, personal, group and departmental datasets need to get published and become discoverable and usable.

LifeWatch aims to support upwards of 25,000 users, primarily from the academic and research community, and the policymaking community, but also supporting the student education sector and the general public (citizen science).

The LifeWatch strategy of “Thinking globally, acting locally” addresses these challenges of heterogeneity and scale. “Thinking globally, acting locally” devises and promotes the pan-European top-down strategies that foster collaboration and interoperability, and at the local level assists and encourages ‘islands’ of compliant infrastructure to emerge and fuse.

ENVRI: Common Operations of the ESFRI Environmental Research Infrastructures

What is ENVRI? ENVRI is a soon to be funded EC FP7 project that brings together many of the main ESFRI research infrastructures from the environmental sciences domain. The ENVRI project will contribute to the construction of these research infrastructures by sharing experiences and technologies and by solving crucial common technology issues and challenges together. Through cooperation in this project the ESFRI ENV infrastructures, together with ICT partners, are seeking to increase the interoperability of their data and facilities to increase the use and effectiveness of their infrastructures. The central goal of the ENVRI project is to implement harmonised solutions and draw up guidelines for the common needs of the environmental ESFRI projects, with a special focus on issues as architectures, metadata frameworks, data discovery in scattered repositories, visualization and data curation.

ENVRI recognises scientific data services as part of a horizontal set of foundational services that include communications, distributed computing, and storage. It recognises that data providers, as well as data users, are users of data services and that there are common requirements irrespective of domain-specific communities. Community-specific services sit on top of data services and interact with them.

The key to improved interoperability is finding common solutions to common problems that can be adopted by each research infrastructure as it progresses through its construction phase. Fundamental common solutions include:

http://www.nlm.nih.gov/research/umls�


a) A Common Reference Model providing multiple compatible ‘views’ of the research infrastructure for different purposes.

An ENVRI Common Reference Model is likely to be based on the ISO/IEC 10746 series of Standards for Open Distributed Processing, presenting 5 viewpoints: i) Science business / enterprise view; ii) Information view; iii) Computational / services view; iv) Engineering view and v) Technology view.

b) “Standards, Standards, Standards” are required for, at least: • Data capture from distributed sensors • Metadata definition • Management of high volume data • Execution of workflows • Visualization of data • Provenance and annotation • Interoperability between assets

c) Based on a generic metadata model (the Information viewpoint of the Common Reference Model),

tools to allow data discovery and access in a federation of distributed digital repositories and interoperating infrastructures;

d) RDF and OWL frameworks to describe relations between (virtualized) e-Infrastructure components, and to link semantic descriptions of data with the semantic descriptions of the infrastructure, allowing the creation of a data-centric network.

Riding the Wave: How Europe can gain from the rising tide of scientific data The recently published report of the High Level Expert Group on Scientific Data – “Riding the Wave: How Europe can gain from the rising tide of scientific data” – is an important contribution towards addressing the question of what research infrastructures really need for data. Neelie Kroes, the Vice-President of the European Commission responsible for the Digital Agenda has asked: “every citizen and every organisation involved in scientific research to take note of this report and to use it as a reference point when discussing the priorities of EU research investments.”

The report may be found here:

http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf

http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf�