solving the data problem for research beyond

39
Solving the data problem for research and beyond Matthew Dovey, Head of e-infrastructure strategy, Jisc John Kaye, Senior co-design manager - research data, Jisc 28/04/2017 1

Upload: jisc

Post on 28-Jan-2018

719 views

Category:

Education


0 download

TRANSCRIPT

Solving the data problem for research and beyondMatthew Dovey, Head of e-infrastructure strategy, JiscJohn Kaye, Senior co-design manager - research data, Jisc

28/04/2017

1

Jisc research strategy

28/04/2017 Solving the data problem 2

Research is changing

» The 4th Paradigm of data-intensive research and data-driven innovation

» Open by default

» Dependency on digital infrastructures and digital transformation

» Globally competitive environment – digital transformation is open to everyone

28/04/2017 Solving the data problem 3

The vision

» Jisc’s vision is to make the UK the most digitally advanced research nation in the world by fully exploiting the possibilities of modern digital empowerment, content and connectivity

» Jisc will provide the underlying infrastructure which can scale and flex to enable researchers to deliver the outcomes that funders, government, industry and society want from the sector

» Our vision is of a seamless, interoperable digital infrastructure which enables researchers and research organisations the freedom to apply their strategic resources to maximise their research impact and minimise the cost and burden of the supporting operations

28/04/2017 Solving the data problem 4

The vision

28/04/2017 Solving the data problem 5

Underpinning infrastructure

Information model

Dynamic research platform

» Cyber-Security Support

» Data Assurance

» Network Performance Optimisation

» Procurement Frameworks

» Research Analytics

» Research Outputs - Publication,Curation, Archiving and Preservation

» Content Licensing, Discovery and Management

» Standards and Identifiers

» Vocabularies

» Data Model

» Janet Backbone

» Federated Access and Identity Management

» Data Centres

Research enabling services

» Advanced Networking Technologies

» Data Warehouse

» Flexible Storage

» Metadata Profiles

» Application Profiles

» Data Brokerage

Top three priorities

» The comprehensive connectivity across the infrastructure at a diversity of scales (local, regional, national, international)

» A coherent suite of research services which reduces the burden on institutions, increases the efficiency, delivers solutions to common problems and improves UK’s research performance

» Representation of the UK’s digital needs in our engagements and advocacy in the national and international arena

Jisc will provide three elements of the vision

28/04/2017 Solving the data problem 6

Research strategy outcomes

1. The UK’s research environment is underpinned by flexible, scalable infrastructure where standards based approaches ensure that data can be generated, moved, stored, found and used with the minimum of cost or burden to the institution and the researcher

2. The transition from Open Access to Open Science where research objects are findable, accessible, interoperable and reusable by academia, industry and society for wider economic and social benefit

3. UK interests are represented in both international policy and operational environments enabling UK researchers to collaborate, compete and comply with the global research community

4. The UK maintains its position as a digital thought leader and shaper of both research infrastructures and the wider scholarly communications environment

5. The investment in the mission-critical UK E-Infrastructure required by the research base is safeguarded for the long-term enabling UK Research to continue to punch above its weight in the global research environment

28/04/2017 Solving the data problem 7

Tiered storage

28/04/2017 Solving the data problem 8

Motivation and engagement

» Initial interest for explored with SDC-North tenants

» Informal vendor discussions to determine technical feasibility

» Requirements workshop – November 2016

» Active working group to develop full business case for phased implementation in 2017

» Progress and input from wider community via https://community.jisc.ac.uk/groups/tiered-storage

28/04/2017 Solving the data problem 9

Opportunities

» Provide a national storage provision filling a current gap

› Universities looking at ever-increasing storage requirements and needs

› Confused by different approaches (in house, cloud, hybrid), technologies, solutions, pricing structures

› Different requirements and policies (internal, and externally imposed)

» Remove headache of procurement and management across multiple providers and technologies

» Maximise Janet network value

» De-risk University in area of exponential growth

› Low risk\PAYG infrastructure avoids over investment

28/04/2017 Solving the data problem 10

Benefits

» Savings on costs of power, cooling and carbon arising from a modern consolidated infrastructure in a high-specification datacentre with modern cooling

» Procurement cost savings not just from quantity of procurements, but also from timeliness of procurements: you will get cheaper overall storage costs by procuring 100TB a year in each of five years than procuring 500TB once (simply because you get more storage for your money as time goes on)

» Operational savings on time for installing and managing storage hardware

» Clear compliance with research council expectations for appropriate data management across the research lifecycle

» Benefits across the University sector of providing a standard for research data management and a standard costing

28/04/2017 Solving the data problem 11

Multi-vendor tiered storage proposal

28/04/2017 Solving the data problem 12

HSM Appliance

AWS

Cloud storage pool Archival storage pool

Customer infrastructure(eg VMWare Vsphere)

Amazon Glacier

Arkivum

Customer applications RDM share services

Cloud9

iSC

SI

SM

B

CIF

NFS S3

htt

ps

Sw

ift

cep

h

Applications

Jisc tiered storage service

HSM Data Policy• Pool Prioritisation• Replication• Snapshots• SLAs (e.g.

retention, availability, security)

Distributed storage pool

Google

HSM data policy

» Pool prioritisation

» Replication

» Snapshots

» SLAs (eg retention, availability, security)

HSM Appliance

Tiered storage proposal - pools

28/04/2017 Solving the data problem 13

Pool Overview Class Copies Recovery Time Objective

Recovery Point Objective

Distributed storage pool

Data stored near sites (possibly based on SDC1, SDC2 and other locations egnational research e-infrastructure centres, other NRENs) to give onsite\nearsite recovery timesUse of erasure-encoding to give equivalence of 2 copies with ~1.6 times storage capacity

Lever Janet backbone to deliver Onsite equivalence

Equivalent to 2 Copies including offsite

Onsite\near site equivalent

<1 Hour

Cloud storage pool

Managing data copies across multiple cloud providers

Archive Equivalent to 2 Copies including offsite

< 1 Hour 1-24 Hour

Archival storage pool

Managing data copies across multiple cloud “vault” providers (ie 99% or 100% guaranteed data recovery)

Vault Guaranteed recovery

N/A N/A

Requirements and demand working group

» University of Oxford

» University of Leeds

» University of Manchester

» University College London

» London School of Economics

» Natural History Museum

» Additions welcome

Current members

» Phased technical specification

» Use scenarios

› (eg data movement)

» Business and financial case

› (including TCO analysis)

» Market review and supplier engagement

Key outputs

28/04/2017 Solving the data problem 14

Tiered storage positioning

28/04/2017 Solving the data problem 15

Storage Providers

Jisc Tiered Storage

Other Jisc Services

StoragePolicy

StoragePolicy

StoragePolicy

StoragePolicy

Jisc RDSSLocal Research Data Systems

Other local systems (financial, T&L, etc)

Jisc research data shared service

28/04/2017 Solving the data problem 16

The futures portfolio consists of three big areas

28/04/2017 Solving the data problem 17

Store services

Playlists Diagnostic tool builder

Curation and remix

Learner Analytics Services

Digital capability

Learning analytics

Digital launchpad

Apprentice workforce

development

Digital leadership

Summer of student

innovation

Analyticsacademy

Analytics labs

Qualification verification

App and

content store

Research data discovery

Research data

usage metrics

Equipmentdata

Repository and preservation platform

Research data

shared service

?

Research data discovery service

Alpha site

28/04/2017 Solving the data problem 18

Research data usage and metrics

28/04/2017 Solving the data problem 19

Shared Service Goals

» Policy compliance

» Efficiency

» Better research

28/04/2017 Solving the data problem 20

A key requirement

28/04/2017 Solving the data problem 21

…..but a challenging problem

28/04/2017 Solving the data problem 22

Implementing Archivematica for research data preservation at York and Hull

Jenny Mitcham (Digital Archivist) -University of York

Research data shared service overview

28/04/2017 Solving the data problem 23

Data model

28/04/2017 Solving the data problem 24

Service MVP (Alpha – July 2017)

28/04/2017 Solving the data problem 25

Service MVP (Alpha – July 2017)

28/04/2017 Solving the data problem 26

Pilot MVP components

* Under review as additional reporting options may be available, also differing offers from full dashboard/analytics to API only. Further discovery work is underway.

28/04/2017 Solving the data problem 27

RDSS Component Offer Number of Pilots Requiring (total =17)

RDSS Repository 14

RDSS Preservation 17

RDSS Reporting 14 (TBC)*

RDSS Storage 16

Pilot Alpha MVP integrations

*RDSS Framework Supplier

28/04/2017 Solving the data problem 28

RDSS Component Offer Number of Pilots Requiring (total =17)

Eprints (Repository) 12

Dspace (Repository) 4

Hydra (Repository) 2

Symplectic (CRIS)* 4

Pure (CRIS) 3

Converis (CRIS) 1

Authentication 17

Middlesex Figshare implementation

» Accelerated deployment in 10 weeks (Installation by 10th November)

» Stakeholder engagement

» Development of institutional requirements

» Sign up to Datacite membership

» Implementation team (informal)

» Integration with Jisc Storage

» Implementation of pilot data repository

28/04/2017 Solving the data problem 29

The University of Jisc Sandbox

» Scratch environment for testing of configuration and integration of service platform components

» A mock HEI to integrate with

» Infrastructure as code, learning from building, and managing the mixture of SaaS and custom applications. This will allow easy push button install of products

» Working with test data and metadata taken from real HEI repositories

» Consistent and standardised UX

» Bespoke development environment

28/04/2017 Solving the data problem 30

Apps CRIS

Test dataZenodo

RDSS pilot HEI repositoriesPublisher data

AWSstorage + tools

Data repositoriesFigshare, HydraIslandora, Haplo

Publicationrepositories

EprintsD-space

Preservation systemsPreservica

Archivematica

Additionalsoftware

and services

Assessing researchers’ needs - Data asset framework

28/04/2017 Solving the data problem 31

Preservation of research data“I currently spend about £1,200 pa on data

storage from my own salary. I have the highest data needs in my School, and there is no plan in

place for storing my data.”

28/04/2017 Solving the data problem 32

Sensitive research data“It would be helpful to clarify the rules for storing

anonymised data on cloud services. My departmental rules say this is never OK, however

this seems to contradict University rules.”

28/04/2017 Solving the data problem 33

University services to support RDM“Support is woeful in the university currently, in particular long-term data archiving is critically

required. Most of my non-current data is rotting on CD's and hard-drives.”

28/04/2017 Solving the data problem 34

University services to support RDM“Please, individualise the support. Workshop are

useless, emails with information are useless, brochures are useless, posters are useless.”

28/04/2017 Solving the data problem 35

Researchdata.network

28/04/2017 Solving the data problem 36

Discussion

28/04/2017 Solving the data problem 37

What we’d like to know…..

» What are your current priorities and pain points with managing data?

» Do you have or are you expecting a data deluge?

» What would you like Jisc to provide for managing data?

» What would you like the Jisc offer to look like?

» Have we missed anything in our pilots? Are there gaps?

» Are there any aspects of data management you’d like to keep ‘in-house’?

» Do you have issues around research systems user experience for researchers and staff

» Do you have issues around systems interoperability

» Do you have preservation needs beyond research data (eg records management, Archives)

» Can you share any hooks or incentives to engage researchers in data management services

» Any tips for success and lessons learned that we can utilise in implementing systems?

» Anything else…..

28/04/2017 Solving the data problem 38

28/04/2017 Solving the data problem 39

Matthew DoveyHead of e-infrastructure [email protected]

John KayeSenior co-design manager – Research [email protected]

jisc.ac.uk/rd/projects/research-data-shared-servicehttps://community.jisc.ac.uk/groups/tiered-storage