building an open data infrastructure for science: policy and practice juan bicarregui stfc e-science...

43
Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference, November 2011

Upload: kenya-stapleford

Post on 15-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Building an Open Data Infrastructure for Science:

Policy and Practice

Juan BicarreguiSTFC e-Science

WIFI: BMA Guest“apa2 conference” “visitor”

APA conference, November 2011

Page 2: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Overview

• Introduction– What is STFC? What do we do? Why do we do it?

• An example project (Practice)– Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES

• RCUK Data Principles (Policy)

Page 3: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Programme includes: • Neutron and Muon Source• Synchrotron Radiation Source• Lasers• Space Science • Particle Physics• Compuing and Data Management• Microstructures• Nuclear Physics• Radio Communications

What is STFC?

250m

ESRF & ILL, GrenobleDaresbury Laboratory

Square Kilometre Array Large Hadron Collider

Page 4: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

What is the science?

Page 5: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

©John Womersley/Keith Jeffery/STFC

Tomorrow’s Digital Infrastructure for Science

APA Conference 2007 John Womersley/Keith Jeffery

Page 6: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

The Innovation Lifecycle

The Body of Knowledge

The GovernmentProcess

The ResearchProcess

Aggregation of Knowledge lies at the heart of the innovation lifecycle

Enabling Knowledge Creation

Enabling Wealth Creation

Quality Assessment

Strategic Direction

Improved Quality of Life

Improved Understanding

Data and the Research Process

Page 7: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Data centric view of research

DataCreation

Archival

Access

Storage ComputeNetwork

Services

Curation

the researcher actsthrough ingest and access

Virtual Research Environment

the researcher shouldn’t have to worry about the information infrastructure

Information Infrastructure

Page 8: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

The 7 C’s

Creation Collection

Capacity

Computation

Curation

Collaboration Communication

Data

Page 9: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Linked systems for:

• Proposal submission• User management• Data acquisition

Metadata carried from each system to the next

Detectors moving from Hz to KHz, MHz, GHz, ...

Creation

Examining the detectors on MAPS instrument on ISIS

Barrel toroid magnet and detector module from ATLAS at CERN

Page 10: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

1997 1998 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 201010

100

1,000

10,000

Data Stored at Rutherford Appleton Lab

Capacity

Currently store about

7 PetaBytes of data

1PB = 1015 Bytes

= a Billion Floppys

= a Million CDs

= a Thousand PCs

LHC experiments ~ 2PB

Active file system based HSM ~2PB(facilities)

Dark archive ~2PB(facilities and non-STFC services)

of Today’s

}

10 x Moore’s Law (2 years)2 x Moore’s Law (1.5 years)

Moore’s Lawx1000 in 13 yearsDoubling every 1.3 years

Page 11: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Computation

Compute intensive components on the grid

Computational applications for Laser Theory Group’s adoption of HPC

Laser real-time diagnostics & data flow pipeline.

Fitting of experimental data to model

Page 12: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

CurationComplexity of Facility Archives• All ISIS data (~25 years)

> 3,000,000 files

• All Diamond Data (~2.5 years)> 11,000,000 files

Breadth of Data Sources:• STFC (Tier 1)• NERC (BADC, NEODC)• BBSRC (All institutes)• AHRC• MRC • Others:

– The STFC Data Portal – The STFC Publications Archive – The CCPs (Collaborative Computational Projects) – The Chemical Database Service – The Digital Curation Centre – The EUROPRACTICE Software service– The HPCx Supercomputer– The JISCmail service– The NERC Datagrid – The Starlink Software suite – The UK Grid Support Centre – The World Data Centre for Solar-Terrestrial Physics

Atlas Datastore Tape Robot

The StorageTek tape robot with capacity of 20PB

Page 13: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Collection

Proposal

Approval

SchedulingExperiment

Data cleansing

Record Publication

Scientist submits application for

beamtime

Facility committee approves application

Facility registers, trains, and schedules

scientist’s visit

Scientists visits, facility run’s experiment

Subsequent publication registered

with facility

Raw data filtered and cleansed

Data analysis

Tools for processing made available

Page 14: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

CommunicationImmense Expectations ! • Web enables:

– access to everything • Everything on-line

• Interlinking enables:– Validation of results – Repetition of experiment

• Discovery enables:– new knowledge from old

• Archiving enables:– Unplanned reuse of data

• Antarctic environmental data– Reuse of knowledge

• One paper has >20,000 downloads– Completing the cycle

• Publications entered in next proposal

STFC’s “e-pubs” Institutional Repository has records of 30,000 publications spanning 25 years

“The web has changed everything...”

Page 15: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Collaboration

Technology integration facilitates scientific collaboration• Cross facility/beamline• Cross disciplinary

Technology integration improves facility efficiency • PaN-data –Photon and Neutron Data infrastructure• ICAT also used in Australian Synchrotron and Oak Ridge National Lab

Page 16: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Overview

• Introduction– What is STFC? What do we do? Why do we do it?

• An example project – Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES

• RCUK Data Policy Principles

Page 17: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Tools for virtual research environments

Tools for virtual research environments

Generic services, storage and computation

OA participatory infrastructure

Agricultur

e

Environment

Physics, Engineering

Biolo

gy

Medici

n

e

Atmosphere/Space Physics

Social SciencesScientific Data

(Discipline Specific)

Other Data

Researcher 1

Non Scientific World

Scientific WorldResearcher 2

Aggregated Data Sets(Temporary or Permanent)

Workflows

Aggregation Path

transPLANT

EUDAT

AgINFRA

iMarine

OPENAire Plus

diXa

SCIDIP-ES

ESPAS

ENGAGE

PanDataODI

Scientific Data Landscape of Initiatives – results from call9

VREs

VREs

PaNdataODI

Page 18: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

PaN-data bring together 11 major European Research Infrastructures

PaN-data is coordinated by the e-Science Department at the Rutherford Appleton Laboratory, UK

ISIS is the world’s leading pulsed spallation neutron source

ILL operates the most intense slow neutron source in the world

PSI operates the Swiss Light Source, SLS, and Neutron Spallation Source, SINQ, and is developing the SwissFEL Free Electron Laser

HZB operates the BER II research reactor the BESSY II synchrotron

CEA/LLB operates neutron scattering spectrometers from the Orphée fission reactor

ESRF is a third generation synchrotron light source jointly funded by 19 European countries

Diamond is new 3rd generation synchrotron funded by the UK and the Wellcome Trust

DESY operates two synchrotrons, Doris III and Petra III, and the FLASH free electron laser

Soleil is a 2.75 GeV synchrotron radiation facility in operation since 2007

ELETTRA operates a 2-2.4 GeV synchrotron and is building the FERMI Free Electron Laser

ALBA is a new 3 GeV synchrotron facility due to become operational in 2010

PaN-data Partners

JCNS Juelich Centre for Neutron Science MaxLab, Max IV Synchrotron

The partners operate hundreds of instruments used by over 30,000 scientists each year

Page 19: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

EDNS - European Data Infrastructure for Neutron and Synchrotron Sources

PaNdata Vision

Single Infrastructure Single User Experience

CapacityStorage

Publications Repositories

Data Repositories

Software Repositories

Raw Data Data Analysis

Analysed Data

Publication Data

Publications

Facility 1

Raw Data Data Analysis

Analysed Data

Publication Data

Publications

Facility 2

Raw Data Data Analysis

Analysed Data

Publication Data

Publications

Facility 3

Different Infrastructures Different User ExperiencesRaw Data Catalogue

Data Analysis

Analysed Data Catalogue

Publication Data Catalogue

Publications Catalogue

Page 20: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Science driver – Data IntegrationNeutron diffraction X-ray diffraction

}NMR

High-qualitystructure refinement

}

Page 21: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

PaN-data Standardisation

PaN-data Europe is undertaking 5 standardisation activities:

1. Development of a common data policy framework

2. Agreement on protocols for shared user information exchange

3. Definition of standards for common scientific data formats

4. Strategy for the interoperation of data analysis software enabling the most appropriate software to be used independently of where the data is collected

5. Integration and cross-linking of research outputs completing the lifecycle of research, linking all information underpinning publications, and supporting the long-term preservation of the research outputs

PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories

Page 22: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

2.1 Data Policy

2.2Software Policy

2.3UserPolicy

2.4Integrated Policy

4.1User

Proposal

4.2User

Workshop

4.3User

Revision

5.1Data

Proposal

5.2Data

Workshop

5.3Data

Revision

6.1SoftwareReview

6.2Software Workshop

6.3SoftwareProposal

6.4Software Revision

7.1Integration Report

7.2Integration Proposal

7.3Integration Revision

3.4

Final

Workshop

Project Management, Knowledge Exchange and Dissemination Activities

Dependencies between the major project tasks

PaNdata Activities

Page 23: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

ERA Open Access Sharing Initiatives (examples, etc)

ERA Infrastructure Platform Initiatives (EGI, etc)

PaNdata Support Action

(Ends 30 Nov 11)

Policies and Standards

PaNdataODI

(begins end2011)

JRAs

Users

Data

Software

Integration

Provenance

Preservation

Scalability

PaNdataODI

(begins end2011)

ServicesUsers

Data

PaNdataODI

Virtual Labs

Policies Powder Diff

SAXS & SANS

Tomography

Page 24: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

ObjectivesObjective 2 – UsersTo deploy, operate and evaluate a system for pan-European user identification across the participating facilities and

implement common processes for the joint maintenance of that system.

Objective 3 – DataTo deploy, operate and evaluate a generic catalogue of scientific data across the participating facilities and promote

its integration with other catalogues beyond the project.

Objective 4 – Provenance To research and develop a conceptual framework, defined as a metadata model, which can record the analysis

process, and to provide a software infrastructure which implements that model to record analysis steps hence enabling the tracing of the derivation of analysed data outputs.

Objective 5 – PreservationTo add to the PaNdata infrastructure extra capabilities oriented towards long-term preservation and to integrate

these within selected virtual laboratories of the project to demonstrate benefits. These capabilities should, as for the developments in the provenance JRA, be integrated into the normal scientific lifecycle as far as possible. The conceptual foundations will be the OAIS standard and the NeXus file format.

Objective 6 – Scalability To develop a scalable data processing framework, combining parallel filesystems with a parallelized standard data

formats (pNexus pHDF5) to permit applications to make most efficient use of dedicated multi-core environments and to permit simultaneous ingest of data from various sources, while maintaining the possibility for real-time data processing.

Objective 7 – DemonstrationTo deploy and operate the services and technology developed in the project in virtual laboratories for three specific

techniques providing a set of integrated end-to-end data services.

Page 25: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

PaNdata ODI Joint Research Activities

PaNdata ODI Service Activities

PaNdata ODI Service ReleasesStandards from

PaNdataSupport Action

uCat

dCat

vLabs

Prov

Pres

Scale

Rel 1 Rel 2 Rel 3 Rel 4

users

data

s/w

Integ

Jun 2014Jun 2013 Dec 2013Dec 2012

Page 26: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Data

The Research Lifecycle – a personal view

the researcher actsthrough ingest and access Research

Environment

Creation

Archival

Access

Storage ComputeNetwork

Data

Services

the researcher shouldn’t have to worry about the information infrastructure

Information Infrastructure

MetaData/ Catalogues

PortalsUser Info feed

DAQ feedData Analysis feed

EGIGEANT

Local resources

e-Infrastructure

Provenanced Research

Page 27: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Overview

• Introduction– What is STFC? What do we do? Why do we do it?

• An example project – Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES

• RCUK Data Policy Principles

Page 28: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

RCUK Principles on Data Policy

Seven (fairly) orthogonal principles: • Public good • Preservation• Discoverability• Confidentiality • First use• Recognition• Costs

} Data

} Access

} Rights

Page 29: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Motivation

Data are a critical output of the research process:

• For the integrity, transparency and robustness of the research record

• Often value increases through aggregation

• Enables new research questions to be addressed

• Supports the wider exploitation of data

Page 30: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Repeat, Repeal, Repurpose

Why might we want access to data?

Three distinct reasons for sharing data:

• Repeat - Validation of previous analysis

– How does this fit with peer review?

• Reconsider/Reform/Repeal/Reverse - Alternative hypotheses in the same

field

– c.f. Reuse

– How does this fit with “right” to first use?

• Repurpose - New research in another field

– c.f. Recycle

– How does this fit with recognition of Intellectual contribution? (What’s in it for me?)

Different concerns and requirements for each type of sharing

Page 31: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Data are a Public GoodPublicly funded research data are a public good, produced in the public interest, which should be made openly available with as few restrictions as possible in a timely and responsible manner that does not harm intellectual property.

Public good – is nonrival and non-excludable [wikipedia] consumption by one does not reduce availability for othersno one can be effectively excluded from using

Research Data recorded factual material commonly retained by and accepted in the

scientific community as necessary to validate research findings As few restrictions as possible

Later (distinguish registration from restriction)Timely

Later (discipline specific)Responsible

Later (maximising access does not necessarily maximising research benefit)Intellectual Property

Later (balance contribution from sharing and from primary research)

Page 32: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Data should be managedInstitutional and project specific data management policies and plans should be in accordance with relevant standards and community best practice. Data with acknowledged long term value should be preserved and remain accessible and usable for future researchPolicies and Plans

DMPs should exist for all dataInstitutional/Departmental v. Project

Standards and Best practicediscipline specific

Long term value eg. NERC has the concept of the Data Value Checklist.

Future research (by current and future generations)Don’t lose it by accident

Page 33: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Data should be discoverableTo enable research data to be discoverable and effectively reused by others, sufficient metadata should be recorded and made openly available to enable other researchers to understand the research and re-use potential of the data. Published results should always include information on how to access the supporting data

Effectively reused by othersfor validation (the data behind the graph – and derivation/provenance?)for reworking (alternative hypothesis)for repurposing(same data or data merging)

Sufficient Metadata ... openly available (stronger than for data)Understand the re-use potential Discoverable – repository? registration?

Published results should include (pointers)could be an email address (but note longevity requirement)

Page 34: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Data may be protectedRCUK recognises that there are legal, ethical and commercial constraints on release of research data. To ensure that the research process is not damaged by inappropriate release of data, research organisation policies and practices should ensure that these are considered at all stages in the research process. Legal

Data Protection Act, Freedom of Information Act, Environmental Information Regulations, EU INSPIRE Directive (Spatial Data) Some Case law: UEA “Climategate” and a few others - can go either way

ICO developing Guidelines on FoIProtection of Freedoms Bill

EthicalConsent, Privacy, National Security, Consent eg longitudinal cohort studies – protection of cohort participationCommercial – shared funding, patent pending, commercial in confidence... all stages in the research process ...

Plan in proposal – peer reviewedexternal review of access requests

Page 35: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Originators may have first use

To ensure that research teams get appropriate recognition for the effort involved in collecting and analysing data, those who undertake research council funded work may be entitled to a limited period of privileged use of the data they have collected to enable them to publish the results of their research. The length of this period varies by research discipline and, where appropriate, is discussed further in the published policies of individual Research Councils.

... may be entitled ...c.f. will be entitled

... limited period of privileged use ... “Withholding of data without good reason beyond this period is not

acceptable”... to publish the results of their research ... to work on and publish the results ... the length of this period varies ...

individual research councils’ policies elaborate

Page 36: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Reusers have responsibilities

In order to recognise the intellectual contributions of researchers who generate, preserve and share key research datasets, all users of research data should acknowledge the sources of their data and abide by the terms and conditions under which they are accessed.

... abide by the terms and conditions ... terms and conditions may exist

to monitor useto promote terms and conditions of use

... should acknowledge the sources of their data ...

Data citation c.f. should be required to acknowledge....

Page 37: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Data sharing is not freeIt is appropriate to use public funds to support the management and sharing of publicly-funded research data. To maximise the research benefit which can be gained from limited budgets, the mechanisms for these activities should be both efficient and cost-effective in the use of public funds.

Data sharing costs!Marginal cost may be small but intial cost may be high

Even if data are free at the point of use (which it may not be) there are costs behind.

Cost is fundable by RCs - but needs tensioning against other research• The policies, models and mechanisms for managing and providing access to research data• Funders will work with grant holders to .....”• No prescription on how funding should be raised.”

You can ask for funding... ... but be reasonable

Page 38: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Outcomes

• Ensure the continuing availability of data of long-term value

• Facilitate the development of mechanisms to improve the management, accessibility and attribution of data

• Promote awareness of legislation and guidance relating to the management and dissemination of information

Page 39: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

The Seven Ps

• Public good Public good

• Preservation Preservation

• Discoverability Promotion

• Confidentiality Protection & Privacy

• First use Privilege

• Recognition Probity

• Costs Price

Page 40: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Overview

• Introduction– What is STFC? What do we do? Why do we do it?

• An example project – Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES

• RCUK Data Policy Principles

• a final thought

Page 41: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

The 7 C’s

Creation Collection

Capacity

Computation

Curation

Collaboration Communication

Permanent Access

Provenanced Research

The KnowledgeLifecycle

DataCreation

Archival

Access

Storage ComputeNetworkServicesCuration

Page 43: Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Building an Open Data Infrastructure for Science

Policy and Practice

Juan BicarreguiSTFC e-Science

APA conference, November 2011