solving the data problem for research beyond
TRANSCRIPT
Solving the data problem for research and beyondMatthew Dovey, Head of e-infrastructure strategy, JiscJohn Kaye, Senior co-design manager - research data, Jisc
28/04/2017
1
Research is changing
» The 4th Paradigm of data-intensive research and data-driven innovation
» Open by default
» Dependency on digital infrastructures and digital transformation
» Globally competitive environment – digital transformation is open to everyone
28/04/2017 Solving the data problem 3
The vision
» Jisc’s vision is to make the UK the most digitally advanced research nation in the world by fully exploiting the possibilities of modern digital empowerment, content and connectivity
» Jisc will provide the underlying infrastructure which can scale and flex to enable researchers to deliver the outcomes that funders, government, industry and society want from the sector
» Our vision is of a seamless, interoperable digital infrastructure which enables researchers and research organisations the freedom to apply their strategic resources to maximise their research impact and minimise the cost and burden of the supporting operations
28/04/2017 Solving the data problem 4
The vision
28/04/2017 Solving the data problem 5
Underpinning infrastructure
Information model
Dynamic research platform
» Cyber-Security Support
» Data Assurance
» Network Performance Optimisation
» Procurement Frameworks
» Research Analytics
» Research Outputs - Publication,Curation, Archiving and Preservation
» Content Licensing, Discovery and Management
» Standards and Identifiers
» Vocabularies
» Data Model
» Janet Backbone
» Federated Access and Identity Management
» Data Centres
Research enabling services
» Advanced Networking Technologies
» Data Warehouse
» Flexible Storage
» Metadata Profiles
» Application Profiles
» Data Brokerage
Top three priorities
» The comprehensive connectivity across the infrastructure at a diversity of scales (local, regional, national, international)
» A coherent suite of research services which reduces the burden on institutions, increases the efficiency, delivers solutions to common problems and improves UK’s research performance
» Representation of the UK’s digital needs in our engagements and advocacy in the national and international arena
Jisc will provide three elements of the vision
28/04/2017 Solving the data problem 6
Research strategy outcomes
1. The UK’s research environment is underpinned by flexible, scalable infrastructure where standards based approaches ensure that data can be generated, moved, stored, found and used with the minimum of cost or burden to the institution and the researcher
2. The transition from Open Access to Open Science where research objects are findable, accessible, interoperable and reusable by academia, industry and society for wider economic and social benefit
3. UK interests are represented in both international policy and operational environments enabling UK researchers to collaborate, compete and comply with the global research community
4. The UK maintains its position as a digital thought leader and shaper of both research infrastructures and the wider scholarly communications environment
5. The investment in the mission-critical UK E-Infrastructure required by the research base is safeguarded for the long-term enabling UK Research to continue to punch above its weight in the global research environment
28/04/2017 Solving the data problem 7
Motivation and engagement
» Initial interest for explored with SDC-North tenants
» Informal vendor discussions to determine technical feasibility
» Requirements workshop – November 2016
» Active working group to develop full business case for phased implementation in 2017
» Progress and input from wider community via https://community.jisc.ac.uk/groups/tiered-storage
28/04/2017 Solving the data problem 9
Opportunities
» Provide a national storage provision filling a current gap
› Universities looking at ever-increasing storage requirements and needs
› Confused by different approaches (in house, cloud, hybrid), technologies, solutions, pricing structures
› Different requirements and policies (internal, and externally imposed)
» Remove headache of procurement and management across multiple providers and technologies
» Maximise Janet network value
» De-risk University in area of exponential growth
› Low risk\PAYG infrastructure avoids over investment
28/04/2017 Solving the data problem 10
Benefits
» Savings on costs of power, cooling and carbon arising from a modern consolidated infrastructure in a high-specification datacentre with modern cooling
» Procurement cost savings not just from quantity of procurements, but also from timeliness of procurements: you will get cheaper overall storage costs by procuring 100TB a year in each of five years than procuring 500TB once (simply because you get more storage for your money as time goes on)
» Operational savings on time for installing and managing storage hardware
» Clear compliance with research council expectations for appropriate data management across the research lifecycle
» Benefits across the University sector of providing a standard for research data management and a standard costing
28/04/2017 Solving the data problem 11
Multi-vendor tiered storage proposal
28/04/2017 Solving the data problem 12
HSM Appliance
AWS
Cloud storage pool Archival storage pool
Customer infrastructure(eg VMWare Vsphere)
Amazon Glacier
Arkivum
Customer applications RDM share services
Cloud9
iSC
SI
SM
B
CIF
NFS S3
htt
ps
Sw
ift
cep
h
…
Applications
Jisc tiered storage service
HSM Data Policy• Pool Prioritisation• Replication• Snapshots• SLAs (e.g.
retention, availability, security)
Distributed storage pool
HSM data policy
» Pool prioritisation
» Replication
» Snapshots
» SLAs (eg retention, availability, security)
HSM Appliance
Tiered storage proposal - pools
28/04/2017 Solving the data problem 13
Pool Overview Class Copies Recovery Time Objective
Recovery Point Objective
Distributed storage pool
Data stored near sites (possibly based on SDC1, SDC2 and other locations egnational research e-infrastructure centres, other NRENs) to give onsite\nearsite recovery timesUse of erasure-encoding to give equivalence of 2 copies with ~1.6 times storage capacity
Lever Janet backbone to deliver Onsite equivalence
Equivalent to 2 Copies including offsite
Onsite\near site equivalent
<1 Hour
Cloud storage pool
Managing data copies across multiple cloud providers
Archive Equivalent to 2 Copies including offsite
< 1 Hour 1-24 Hour
Archival storage pool
Managing data copies across multiple cloud “vault” providers (ie 99% or 100% guaranteed data recovery)
Vault Guaranteed recovery
N/A N/A
Requirements and demand working group
» University of Oxford
» University of Leeds
» University of Manchester
» University College London
» London School of Economics
» Natural History Museum
» Additions welcome
Current members
» Phased technical specification
» Use scenarios
› (eg data movement)
» Business and financial case
› (including TCO analysis)
» Market review and supplier engagement
Key outputs
28/04/2017 Solving the data problem 14
Tiered storage positioning
28/04/2017 Solving the data problem 15
Storage Providers
Jisc Tiered Storage
Other Jisc Services
StoragePolicy
StoragePolicy
StoragePolicy
StoragePolicy
Jisc RDSSLocal Research Data Systems
Other local systems (financial, T&L, etc)
The futures portfolio consists of three big areas
28/04/2017 Solving the data problem 17
Store services
Playlists Diagnostic tool builder
Curation and remix
Learner Analytics Services
Digital capability
Learning analytics
Digital launchpad
Apprentice workforce
development
Digital leadership
Summer of student
innovation
Analyticsacademy
Analytics labs
Qualification verification
App and
content store
Research data discovery
Research data
usage metrics
Equipmentdata
Repository and preservation platform
Research data
shared service
?
Research data discovery service
Alpha site
28/04/2017 Solving the data problem 18
Shared Service Goals
» Policy compliance
» Efficiency
» Better research
28/04/2017 Solving the data problem 20
…..but a challenging problem
28/04/2017 Solving the data problem 22
Implementing Archivematica for research data preservation at York and Hull
Jenny Mitcham (Digital Archivist) -University of York
Pilot MVP components
* Under review as additional reporting options may be available, also differing offers from full dashboard/analytics to API only. Further discovery work is underway.
28/04/2017 Solving the data problem 27
RDSS Component Offer Number of Pilots Requiring (total =17)
RDSS Repository 14
RDSS Preservation 17
RDSS Reporting 14 (TBC)*
RDSS Storage 16
Pilot Alpha MVP integrations
*RDSS Framework Supplier
28/04/2017 Solving the data problem 28
RDSS Component Offer Number of Pilots Requiring (total =17)
Eprints (Repository) 12
Dspace (Repository) 4
Hydra (Repository) 2
Symplectic (CRIS)* 4
Pure (CRIS) 3
Converis (CRIS) 1
Authentication 17
Middlesex Figshare implementation
» Accelerated deployment in 10 weeks (Installation by 10th November)
» Stakeholder engagement
» Development of institutional requirements
» Sign up to Datacite membership
» Implementation team (informal)
» Integration with Jisc Storage
» Implementation of pilot data repository
28/04/2017 Solving the data problem 29
The University of Jisc Sandbox
» Scratch environment for testing of configuration and integration of service platform components
» A mock HEI to integrate with
» Infrastructure as code, learning from building, and managing the mixture of SaaS and custom applications. This will allow easy push button install of products
» Working with test data and metadata taken from real HEI repositories
» Consistent and standardised UX
» Bespoke development environment
28/04/2017 Solving the data problem 30
Apps CRIS
Test dataZenodo
RDSS pilot HEI repositoriesPublisher data
AWSstorage + tools
Data repositoriesFigshare, HydraIslandora, Haplo
Publicationrepositories
EprintsD-space
Preservation systemsPreservica
Archivematica
Additionalsoftware
and services
Preservation of research data“I currently spend about £1,200 pa on data
storage from my own salary. I have the highest data needs in my School, and there is no plan in
place for storing my data.”
28/04/2017 Solving the data problem 32
Sensitive research data“It would be helpful to clarify the rules for storing
anonymised data on cloud services. My departmental rules say this is never OK, however
this seems to contradict University rules.”
28/04/2017 Solving the data problem 33
University services to support RDM“Support is woeful in the university currently, in particular long-term data archiving is critically
required. Most of my non-current data is rotting on CD's and hard-drives.”
28/04/2017 Solving the data problem 34
University services to support RDM“Please, individualise the support. Workshop are
useless, emails with information are useless, brochures are useless, posters are useless.”
28/04/2017 Solving the data problem 35
What we’d like to know…..
» What are your current priorities and pain points with managing data?
» Do you have or are you expecting a data deluge?
» What would you like Jisc to provide for managing data?
» What would you like the Jisc offer to look like?
» Have we missed anything in our pilots? Are there gaps?
» Are there any aspects of data management you’d like to keep ‘in-house’?
» Do you have issues around research systems user experience for researchers and staff
» Do you have issues around systems interoperability
» Do you have preservation needs beyond research data (eg records management, Archives)
» Can you share any hooks or incentives to engage researchers in data management services
» Any tips for success and lessons learned that we can utilise in implementing systems?
» Anything else…..
28/04/2017 Solving the data problem 38
28/04/2017 Solving the data problem 39
Matthew DoveyHead of e-infrastructure [email protected]
John KayeSenior co-design manager – Research [email protected]
jisc.ac.uk/rd/projects/research-data-shared-servicehttps://community.jisc.ac.uk/groups/tiered-storage