imagining the uk national data infrastructure - recommendations
DESCRIPTION
The data ecosystem in the UK is expanding rapidly to cope with the demands of the UK’s data intensive research. We recognise the key challenges ahead if we are to develop our world-leading data infrastructure in a sustainable and innovative way. In response to these challenges the National e-Infrastructure Project Directors Group (NeI- PDG) brought together in December 2014 a large number of representatives from RCUK-funded ‘Big Data’ projects to imagine how the national data infrastructure could develop.The working group have made a number of recommendations in the key themes of Integration, Capability, Connections, and Infrastructure (as identified in the EPSRC e-Infrastructure roadmap) and we outline some key deliverables for 2015.TRANSCRIPT
Imagining the UK National Data Infrastructure
Connecting up Big Data in the UK
Project Directors Group (PDG)
Imagining the UK National Data Infrastructure Connecting up Big Data in the UK Report of the UK National e-‐Infrastructure Project Directors Group workshop held at the Farr Institute, London, 15th December 2014
Authors:
David Fergusson, Francis Crick Institute David Colling, Imperial College / GridPP / WLCG David de Roure, University of Oxford / ESRC Martin Hamilton, Jisc (editor) Brian Matthews, STFC Jacky Pallas, University College London / eMedLab David Salmon, Jisc Jeremy Yates, University College London / STFC DiRAC
Imagining the UK National Data Infrastructure
Connecting up Big Data in the UK
Project Directors Group (PDG)
Contents 2
Contents Contents ........................................................................................................................ 2
1. Purpose and scope ..................................................................................................... 3
2. Integration ................................................................................................................. 4
3. Capability ................................................................................................................... 6
4. Connections ............................................................................................................... 8
5. Infrastructure ............................................................................................................. 9
6. Deliverables ............................................................................................................. 11
7. An Imagined Data Infrastructure – Another Traditional View .................................... 15
Imagining the UK National Data Infrastructure Connecting up Big Data in the UK
Project Directors Group (PDG)
1. Purpose and scope 3
1. Purpose and scope The data ecosystem in the UK is expanding rapidly to cope with the demands of the UK’s data intensive research. We recognise the key challenges ahead if we are to develop our world-‐leading data infrastructure in a sustainable and innovative way. In response to these challenges the National e-‐Infrastructure Project Directors Group (NeI-‐PDG) brought together in December 2014 a large number of representatives from RCUK-‐funded ‘Big Data’ projects to imagine how the national data infrastructure could develop.
Figure 1 – The UK National Data Landscape for Research
The working group have made a number of recommendations in the key themes of Integration, Capability, Connections, and Infrastructure (as identified in the EPSRC e-‐Infrastructure roadmap1) and we outline some key deliverables for 2015.
1 http://www.epsrc.ac.uk/newsevents/pubs/e-‐infrastructure-‐roadmap/
Imagining the UK National Data Infrastructure
Connecting up Big Data in the UK
Project Directors Group (PDG)
2. Integration 4
2. Integration “Our aspiration is for the UK to have an integrated e-‐infrastructure: one that is run and managed as a whole without silos or boundaries, where there are simple processes by which users can get access to the e-‐infrastructure they need across the eco-‐system, as appropriate for the type or stage of research they are doing.“
We do not envisage the UK data infrastructure as a single system but rather an integrated solution which reflects the range of excellent science supported via both large-‐scale projects and research institutions. We propose to build on existing resources and work towards better integration through best practice for sharing data coupled to extensive training support.
The UK engages in a broad range of international projects such as EUDAT, ELIXIR, and SKA. There is a need for a “single voice” for UK in the international arena which can represent the academic community in large collaborations.
Recommendation: Build on international activity – standards, policies etc. in a more strategic and co-‐ordinated way. Role for RCUK coordinator to ensure that UK gets value for money from its involvement/subscriptions in large scale international collaborations.
There is an expectation that significant capital investment in the research e-‐Infrastructure should deliver benefits for UK industry, especially allowing SMEs to benefit through access to big data and compute resources. Some of these benefits can be realised through direct collaboration between industry and academic institution(s). However we believe that there are additional opportunities by leveraging funding with Innovate UK and established (or future) Catapult Centres.
Recommendation: Identify funding opportunities within existing streams to allow academic institutions to interact more effectively with the existing and projected future Catapult centres, as a mechanism to engage industry more effectively around key areas such as digital health and futures cities/urban transformation.
We can only work effectively and share data with researchers, whether UK, international or industry, if datasets are managed and discoverable.
Imagining the UK National Data Infrastructure
Connecting up Big Data in the UK
Project Directors Group (PDG)
2. Integration 5
Standards
● Datasets ● Metadata, e.g. schema.org, CKAN, DataCite and others. We need methods to capture metadata
automatically ● Internationally agreed, community driven ● Domain/project specific, regulatory (e.g. health)
De facto standards have often been driven by common hardware in instruments across domains, e.g. EXIF in digital imaging. We then need to layer on top of those domain specific metadata standards with “Discovery Metadata”. In some domains these are well established, such as Biosharing.org, however this is not widely the case.
Metadata is a key enabler of data management and discovery, and at “big data” scales its collection and sometimes its use must be automated. However, there is a need to document the current metadata landscape and best practice, and identify areas for further development, improvement and standardization. This will become a living document, in collaboration with those organisations involved in the Open Research Data and Data Transparency areas e.g. Digital Curation Centre and HE institutions.
Recommendation: Metadata is a key enabler of data management and discovery, which at “big data” scales must be automated. However, there is a need to document the current metadata landscape and best practice, and identify areas for further development, improvement and standardization.
In order to promote sharing at scale researchers must see some benefit beyond compliance with RCUK and other funder policies. Sharing of datasets should bring academic credit through data citations (for example the DataCite consortium) with DOIs or other persistent identifiers being associated with published datasets. Publication of datasets should be captured as an impact outcome of funded research through metrics portals such as Researchfish. Jisc are also reviewing proposals innovations in Data Management in the Research Data Spring initiative2.
Recommendation: Recognition for the impact of research datasets to the community through the use of DOIs or other common identifiers and, equally, giving credit to researchers for generating datasets. Metrics should be captured via existing mechanisms such as Jisc, Gateway to Research, Researchfish for example.
2 http://www.Jisc.ac.uk/rd/projects/research-‐data-‐spring
Imagining the UK National Data Infrastructure Connecting up Big Data in the UK
Project Directors Group (PDG)
3. Capability 6
3. Capability There is broad recognition of the concept of research data management as an essential activity across the project lifecycle rather than just a paper exercise at the time of grant submission, as illustrated in the DCC Data Lifecycle model below. RCUK has driven the requirement for institutions to show leadership in research data management, management, with a joint position on Data Management3 and the EPSRC in particular asking HEIs to meet specific standards by May 20154 .
Figure 2 -‐ The Digital Curation Centre Lifecycle Model
Training in research data management needs to speak to projects/centres, institutions and individual researchers at all levels. There is a huge opportunity to reach Early Career Researchers in particular through existing Centres for Doctoral Training via a “train the trainers” type approach.
3 http://www.rcuk.ac.uk/research/datapolicy 4 http://www.epsrc.ac.uk/files/aboutus/standards/clarificationsofexpectationsresearchdatamanagement/
Imagining the UK National Data Infrastructure Connecting up Big Data in the UK
Project Directors Group (PDG)
3. Capability 7
Recommendation: Training in data management -‐ Build upon existing PDG, SSI and DCC activities to create a concerted and coordinated approach to promoting best practice in data management. Capitalize on existing activities to orchestrate this, e.g. “train the trainers” whereby the actual training is delivered by projects and institutions.
Capacity building and skills training The need for technology transfer between subject domains, in terms of staff experience rather than commercialization, was recognised. While RCUK has a number of schemes for academic placements such as Bridging the Gaps, there is no equivalent for technical staff. One possible activity was proposed -‐ Cross-‐RCUK big data tech-‐specific scheme. Proposals to such a scheme would preferably driven by an actual problem, ideally across disciplines or e-‐Infrastructure projects and provide potential for host institution staff to gain management or supervisory experience.
Recommendation: Sharing excellence across domains -‐ e.g. cross RCUK initiative, buying out staff time (not just academics) for a defined period to work on specific activities, proposal from two subject domains as a minimum.
Imagining the UK National Data Infrastructure Connecting up Big Data in the UK
Project Directors Group (PDG)
4. Connections 8
4. Connections
User management User management systems are essential to enable researcher access to regional and national systems. This is especially important for the health informatics and administrative data networks which require additional security and two-‐factor authentication systems. There are existing activities around Shibboleth, SAFE, VOMS, Moonshot and Safe Share, but existing well established services and facilities have their own approaches that need to be taken into account. Pilots will lead to recommendations for common standards. There is a particular role for Jisc and RCUK here in terms of international standards liaison e.g. W3C, schema.org, Research Data Alliance. This will require wider buy-‐in from the community as well as pump-‐priming funding.
Data Transfer and access Lots of closely coupled systems with compute and storage are co-‐located, and there are some examples of tiered approaches when huge volumes of data involved e.g. WLCG. The group felt that these issues were typically addressed as part of projects. Exemplars for researcher access to datasets (and compute) respecting trust boundaries include EBI, UKDA, NERC data centres, GridPP data movement orchestration. The comparison was made between between LHC data (instrument in the stream) and the Twitter “firehose” for social sciences studies.
There is a requirement for remote data access for researchers with the necessary control and orchestration, and caching tiers. Examples range from a client running on an end user workstation (GridPP) versus access mediated through a website (EBI). We propose a new project to develop cross-‐discipline solutions to managing data transfer through joint working with biomedical and physical science domains.
Recommendation: Particular example around orchestrating data transfer -‐ problem is widely recognised, and there are already understood approaches in some subject domains. Orchestrating data transfer -‐ Crick, EBI, GridPP joint project
Imagining the UK National Data Infrastructure Connecting up Big Data in the UK
Project Directors Group (PDG)
5. Infrastructure 9
5. Infrastructure
Networks The group felt that with the recent investment in Janet6, the network had sufficient capacity and “room for expansion”. However, access to high capacity for short periods would increasingly be required. A number of points were raised about campus networks which would be challenging to address and difficult or expensive.
● “Last mile” -‐ e.g. campus network to end user. ● Is the campus LAN fit for purpose for NeI users? ● Do campus firewalls have sufficient throughput? ● Is campus Janet connection oversubscribed / separate research connection required? ● What would a campus focal point look like? e.g. GridPP use of Squid cache ● Estates constraints on many institutions -‐ listed buildings, busy city streets etc ● Investment in Janet6, improved connectivity to major research institutions and improved resilience for
day-‐to-‐day use.
Q: Do we need a new equivalent to the HEFCE LAN/MAN initiative?
Q: What would a “NeI Network Appliance” look like?
Would it be
● a Virtual Machine (VM) image or ● a Transmission Control Protocol (TCP) stack tuned e.g. Maximum Transmission Unit (MTU)
It would need to use AAAI and it should scheduled file transfers
Recommendation: The group felt that more flexible access to high capacity networking for defined periods would increasingly be required. For example the eMedLab project will be moving 2.5PB data from EBI at the start of the project (April 2015).
Imagining the UK National Data Infrastructure Connecting up Big Data in the UK
Project Directors Group (PDG)
5. Infrastructure 10
Archive There was much discussion around archives, defined as long-‐term storage of immutable datasets. Some projects have their own archives and some disciplines have international repositories (e.g. EBI). However the RCUK data sharing policy has specific requirements to make research data objects available for up to 10 years after the last requested access. The group felt that it was difficult to focus on approaches offered individual institutions and proposed a survey of the data management landscape. Any institutional archive should provide DOIs or persistent identifiers for datasets to allow discovery, and a means of crediting researchers for creating and depositing datasets (as outlined earlier).
Imagining the UK National Data Infrastructure Connecting up Big Data in the UK
Project Directors Group (PDG)
6. Deliverables 11
6. Deliverables
Pre-‐Requisites
• The Data Analytics and Open Research Data activities in the data e-‐Infrastructure should be supported by a simple layered middleware and software e-‐Infrastructure.
• This e-‐infrastructure should consist of a Common Basic Layer (CBL) on which a Research Domain Specific layer would sit.
• The Common Basic Layer (CBL) should therefore be small and capable of generic use.
• The Research Domain Specific Layer (RDSL) needs to be constructed at the same time.
• Key elements of the CBL are o The AAAI and Security Models – I am who I am and I can use resources. o Control access to data – The RCUK AAAI project SAFE SHARE is delivering aspects of this. o Data In-‐flight Security – my data is going to flow ok and only the right people will get it and see it o Data at-‐rest Security – it’s looked after and I am obeying the pertinent regulations. The data are
open to those who are allowed to see it; it is searchable and query-‐able. o Cloud/Grid middleware to enable appropriate resources to be used. From the user perspective
this can be broken down into the following attributes: 1. Can I see resources? 2. You can use resources, 3. and actually using resources, 4. here is what you have used and 5. here are your results in the place you asked them to be put.
o Wrapping compute around big data – use of virtualisation and containers to send our workflows to where the data are residing. The local compute simply executes the workflows we have constructed/run on other machines.
o An Application Program Interface (API) that allows Data Policies (e.g. metadata requirements) to be actualised in applications.
o Simple Tools and Services to enable data discovery and exploration. Data can be accessed and queried using published metadata and data transport tools.
• An RDSL would have elements such as o Applications or web portals that allow its researchers to use CBL services. These are the user-‐
friendly User Interface (UI) and would be the gateway to the NeI for the average researcher. o If needed, extra security and AAAI requirements could be included here. o Access to training resources could be included, such as online courses and tests. o The interfaces and APIs to the Data Analytics and Open Research Data infrastructures would
reside in the RDSL.
Imagining the UK National Data Infrastructure
Connecting up Big Data in the UK
Project Directors Group (PDG)
6. Deliverables 12
• Hardware will be domain and activity specific. However object stores that can act as repositories could be centralised and be a common activity between the RCs.
In terms of current activities our progress in creating these Pre-‐Requisites is also listed below. Table 1: Pre-‐Requisites for the Data Infrastructure
Infrastructure Projects Who is Responsible?
Authentication, Allocation and Authorisation Infrastructure with 2 factor Security Controls
Jisc-‐led Safe Share Project already underway
Research Domain aspects of AAAI need to be constructed.
Jisc and partners from ESRC and MRC
Research Domains
Data-‐in-‐flight Information Assurance
Jisc Jisc, Research Domains
Data-‐at-‐rest Information assurance No overall description, or indeed none
NeI as a whole
Data abstraction layer development NeI Projects PDG members, RCs
Networks High Capacity Networking
Local Research Organisation
Links to Business
Jisc
RO
Jisc
Advanced Compute NeI Projects PDG members, RCs
Data Storage Facilities NeI Projects PDG Members, RCs
Cloud/Grid Infrastructure GridPP, JASMIN2, EMBASSY CLOUD, eMedLab
Cloud WG, PDG
Imagining the UK National Data Infrastructure
Connecting up Big Data in the UK
Project Directors Group (PDG)
6. Deliverables 13
Infrastructure Projects Who is Responsible?
Tools and Software Varied – no coherence Big Data SIG, PDG and RCs
What needs to be tried out and tested? The tools and software needed to discover data and move data around (needed for multiple data sources) need to be developed into a coherent and simple package. Below are listed a set of deliverables that can be achieved in 2015 to enable this. However these are dependent on activities listed in Table 1. This is why the tests will be done in the field on live NeI systems. Table 2: List of Deliverables
Recommendations Action Milestone (OWNER)
Training in data management Projects to produce data management plans and run courses on data management for user communities and staff. CDTs to be involved.
DMPs and Courses in place by June 2015 (PDG)
Document the current metadata landscape and best practice
RCs to document the relevant Metadata standards and publish these standards Create code libraries that applications can use to produce metadata when data are produced.
Publish Standards and insist on their use – particularly when data are produced (RCUK). Demonstrate on PDG Projects’ systems (PDG)
Develop data abstraction layer Build test and open source software tools for data abstraction and presentation of meta-‐data
Integration of iRODS and OpenStack as a POC for data integration and presentation (PDG)
Co-‐ordination of International Projects to extract best value and influence Agendas
Produce report on the various national and international projects the UK is involved in
Produce Strategy Document (RCUK)
Imagining the UK National Data Infrastructure
Connecting up Big Data in the UK
Project Directors Group (PDG)
6. Deliverables 14
Recommendations Action Milestone (OWNER)
Working with Catapult Centres Work with Innovate UK to ensure that business has access to Janet RCUK NeI Group to communicate to academic community opportunities to work with catapult centres
Simple Contracts and portal make sure Business can book network access easily (Jisc). Adding to existing regular research bulletins (RCUK) Organise joint academic/Innovate UK workshops to link academy to Catapults (RCUK)
Data Transfer 1 – Data transport and orchestration
Make FTS a generic tool to act as an aggregator and orchestrator and link to the RCUK AAAI
Test on the DiRAC, JASMIN2 and eMedLab systems (PDG, Jisc)
Data Transfer 2 – High Capacity Network Access
Secure Transport of Data to eMedLab and RAL WOS
Transfer of multi-‐PB EBI data to eMedLab and and DiRAC@Durham Data to RAL WOS (PDG, Jisc)
Data Transfer 3 – Creating Single Name Spaces
Create WLAN and VLANS in projects to create single filesystems (global spaces) between distributed systems
Test on DiRAC systems between Durham and Edinburgh and between EBI and eMedLab (PDG, Jisc) Test on wLHC and DiRAC (PDG, Jisc)
Knowledge Transfer and Consultancy
Produce Work programme Produce by April 2015
Imagining the UK National Data Infrastructure
Connecting up Big Data in the UK
Project Directors Group (PDG)
7. An Imagined Data Infrastructure – Another Traditional View 15
7. An Imagined Data Infrastructure – Another Traditional View
A schematic of what a National Data e-Infrastructure may look like. Note the ubiquitous presence of Janet.
Key: a Janet Connection
The Proposed CBL and RDSL would be the enabling middleware infrastructure for this e-‐Infrastructure.
HEI 3 HEI 2
JASMINE2
DIAMOND
HEI 1
National Deep Archive Service
National Tertiary Storage Service
Sanger, EBI, ESRC, DiRAC, ARCHER
The Attributes and functional blow-‐up of a TYPICAL Local System, the National Tertiary Storage Service and the National Deep Archive Service
“Local” Tertiary Storage Layer
Meta Data Presented to World
Database Creation/Ingestion Layer and Analytics
Parallel File System, HEI RDM/Repository
Data Generator. Experiments, Clusters, PCs....
Imagining the UK National Data Infrastructure
Connecting up Big Data in the UK
Project Directors Group (PDG)
16
The principal components needed for such an e-Infrastructure are:-
1. Local tertiary storage platforms for active data. 2. Data Base Creator/Ingestor widget to create structured data from unstructured data and policies to
meta-data tag such data – e.g. owner, project, grant no. etc. 3. A National tertiary storage /metadata service to build up and store metadata from the other databases
in the National e-I, as well as store our major active databases. 4. A National Deep Archive Service to store data that has been produced by National Facilities and to
provide data replication services for the National E-Infrastructure.
This is a traditional representation of a computing infrastructure. It is very much the end point of the proposed work in this document, which is why it belongs at the end. The work proposed in this document enables this infrastructure to exist in an efficacious way. The outputs we propose are the real Data Infrastructure in that they enable data to be moved, selected, and queried. It is these that give the data its form and value.