metro rdm webinar

37
Managing & Preserving Data Sets Vicky Steeves | METRO Webinar | 8/3/2015

Upload: victoria-steeves

Post on 15-Apr-2017

132 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: METRO RDM Webinar

Managing & Preserving Data SetsVicky Steeves | METRO Webinar | 8/3/2015

Page 2: METRO RDM Webinar

Itinerary ❖Science Data: Definition & Explanation❖Current Trends in DigiPres for Science❖Benefits to Curating Datasets❖Existing Problems❖Research Data Management❖Upcoming Tools

Page 3: METRO RDM Webinar

What is Data @ the Federal Gov’t?

“the recorded factual material commonly accepted in the scientific community as necessary to validate research

findings.”

-Federal Office of Management & Budget Circular A-110

Page 4: METRO RDM Webinar

Why is science data different?

Page 5: METRO RDM Webinar

Why is Science data different?

OR

Page 6: METRO RDM Webinar

Why is Science data different?

Page 7: METRO RDM Webinar

Why is Science data different?

Page 8: METRO RDM Webinar

Why is Science data different?

Page 9: METRO RDM Webinar

Digital Preservation of Science in the US

❏ North Carolina County Geospatial Data

❏ Caroline Dean Wildflower Collection

❏ FSU Biological Scientist, Dr. A.K.S.K. Prasad Diatomscapes I and II Collections Photographs

❏ FSU Department of Oceanography Technical Reports

Page 10: METRO RDM Webinar

NDSA Levels of Digital PreservationLevel 1 Level 2 Level 3 Level 4

Storage & Geographic Locations

File Fixity & Data Integrity

Information Security

Metadata

File Formats

Page 11: METRO RDM Webinar

USGS Levels of Digital PreservationLevel 1 Level 2 Level 3 Level 4

Storage & Geographic Locations

Data Integrity

Information Security

Metadata

File Formats

Physical Media

Page 12: METRO RDM Webinar

Benefits Satisfy your federal grant requirements

more likely to receive future funding if data is made immediately accessible & you write and follow a well-structured DMP

Saves time & effort, making research process more efficientreduces duplicated work & cuts costs of storagemore efficient quality control of data produced better version control of data--identify versions that can be

periodically purged to save money on storage costs!Adding metadata means that both you & others can go back to it

and easily understand and use it effectively

Page 13: METRO RDM Webinar

BenefitsMakes your data scientifically and legally defensible

verifiable & authenticatable, replicable & easily duplicated!

Supports Open Access Movementadvocates for researchers to share data to foster development of body of

knowledgesustainability of science data in the long-term!Reinforce open scientific inquiry

could lead to new and unanticipated discoveries!continually improve the quality of datadata can be used as valuable teaching instrument to train future scientists

Page 14: METRO RDM Webinar

EXAMPLE: Benefit to Science

Page 15: METRO RDM Webinar

Existing Problems: Management“My research assistants

manage all my data.”

“I think only the division chair knows how much server space we have in total.”

“There are not enough computer terminals in the

imaging lab.”

“I have no time to standardize my data

management.”

“I have no way to get my data off of the computers in the gene sequencing lab because the files are too large.”

“I don’t know where exactly to get support for my database. I don’t know if its IT’s jurisdiction or job.”

Page 16: METRO RDM Webinar

Existing Problem: Scope

Page 17: METRO RDM Webinar

Existing Problems: Media

Page 18: METRO RDM Webinar

Possible Solution

Page 19: METRO RDM Webinar

Research Data Management

a set of practices which affords researchers the ability to more quickly, efficiently, and accurately find, access, and understand their own or others’ research data

*also our way to get science to listen to us! (sorry Scott)

Page 20: METRO RDM Webinar

Data Management PlanA data management plan (DMP) is a document that describes how you will collect, organise, manage, store, secure, backup, preserve, and share your data.

Page 21: METRO RDM Webinar

Federal Agencies & DMPs

Page 22: METRO RDM Webinar

DMPs Usual Requirements❏ Format of Data❏ Research methodology❏ Roles & Responsibilities for Data❏ Metadata Standards❏ Storage & Back Up Procedure❏ Long-Term Archiving & Preservation Plan❏ Access Policy❏ Security Measures

❏ Data❏ Humans

Page 23: METRO RDM Webinar

Existing Tools

Page 24: METRO RDM Webinar

Research Lifecycle

Data management is done at all stages of the research lifecycle.

*each step in the process has its own best practices & standards

Page 25: METRO RDM Webinar

Research Lifecycle: CreateCreating Data

what format will the data be in?

where will we store this data?

how will it be backed up?

how are we going to share this data?

how will we collect this data?

how will we describe this data?

Page 26: METRO RDM Webinar

Research Lifecycle: ProcessProcessing Data

how will we check, validate, or clean the data?

a. how will we describe that process?

b. how will we describe the data?

will we store the processed data? where? how?

Page 27: METRO RDM Webinar

Research Lifecycle: AnalyzeAnalyzing Data

how will we interpret data?

what research outputs will be produced?

a. what format will they be in?

b. how can we make it preservation-ready?

where will we store this data?

how will we ready this data for publication?

Page 28: METRO RDM Webinar

Existing Resources

Metadata

Open File Formats

Darwin Core

Library of Congress File Standards Guide

ABCD

Data Documentation Initiative

File Directories

& Org

Fixity Checks & SecurityBack Ups

Version Control

Page 29: METRO RDM Webinar

Research Lifecycle: PreservePreserving Data

what is the best archival format for our type of data?

what is the best type of archival storage for our data?

What needs to be preserved alongside our data to make it useful to others?

a. metadata and documentation

Page 30: METRO RDM Webinar

Research Lifecycle: AccessAccess to Data

Interoperable formatting means many people can use our data

distribute data

share data

promote data

Page 31: METRO RDM Webinar

Existing Resources

Page 32: METRO RDM Webinar

Research Lifecycle: ReUseRe-using Data

5 years down the road? 10?

a. use open, new archival formats

b. refresh it our storage media

c. scrutinize our findings & integrate data into new projects

teach next generation using our datasets

Page 33: METRO RDM Webinar

What’s more...

Page 34: METRO RDM Webinar

Short-Term Solution

Page 35: METRO RDM Webinar

Long-Term Solution

Page 36: METRO RDM Webinar

Upcoming Tools & Strategies● ReproZip is a general tool for Linux distributions that simplifies the process

of creating reproducible experiments from command-line executions.

○ automatically creates a package that contains all the dependencies (e.g., libraries, input data) that are required to run the input command.

● Hydra in a Box is a turnkey feature-rich, robust, flexible digital repository that is easy to install, configure, and maintain

○ Joint venture between DPLA, Stanford University, and DuraSpace for $2M from IMLS

○ 30 month period to completion

Page 37: METRO RDM Webinar

Ask me questions!

@VickySteeves

[email protected]