metro rdm webinar
TRANSCRIPT
Managing & Preserving Data SetsVicky Steeves | METRO Webinar | 8/3/2015
Itinerary ❖Science Data: Definition & Explanation❖Current Trends in DigiPres for Science❖Benefits to Curating Datasets❖Existing Problems❖Research Data Management❖Upcoming Tools
What is Data @ the Federal Gov’t?
“the recorded factual material commonly accepted in the scientific community as necessary to validate research
findings.”
-Federal Office of Management & Budget Circular A-110
Why is science data different?
Why is Science data different?
OR
Why is Science data different?
Why is Science data different?
Why is Science data different?
Digital Preservation of Science in the US
❏ North Carolina County Geospatial Data
❏ Caroline Dean Wildflower Collection
❏ FSU Biological Scientist, Dr. A.K.S.K. Prasad Diatomscapes I and II Collections Photographs
❏ FSU Department of Oceanography Technical Reports
NDSA Levels of Digital PreservationLevel 1 Level 2 Level 3 Level 4
Storage & Geographic Locations
File Fixity & Data Integrity
Information Security
Metadata
File Formats
USGS Levels of Digital PreservationLevel 1 Level 2 Level 3 Level 4
Storage & Geographic Locations
Data Integrity
Information Security
Metadata
File Formats
Physical Media
Benefits Satisfy your federal grant requirements
more likely to receive future funding if data is made immediately accessible & you write and follow a well-structured DMP
Saves time & effort, making research process more efficientreduces duplicated work & cuts costs of storagemore efficient quality control of data produced better version control of data--identify versions that can be
periodically purged to save money on storage costs!Adding metadata means that both you & others can go back to it
and easily understand and use it effectively
BenefitsMakes your data scientifically and legally defensible
verifiable & authenticatable, replicable & easily duplicated!
Supports Open Access Movementadvocates for researchers to share data to foster development of body of
knowledgesustainability of science data in the long-term!Reinforce open scientific inquiry
could lead to new and unanticipated discoveries!continually improve the quality of datadata can be used as valuable teaching instrument to train future scientists
EXAMPLE: Benefit to Science
Existing Problems: Management“My research assistants
manage all my data.”
“I think only the division chair knows how much server space we have in total.”
“There are not enough computer terminals in the
imaging lab.”
“I have no time to standardize my data
management.”
“I have no way to get my data off of the computers in the gene sequencing lab because the files are too large.”
“I don’t know where exactly to get support for my database. I don’t know if its IT’s jurisdiction or job.”
Existing Problem: Scope
Existing Problems: Media
Possible Solution
Research Data Management
a set of practices which affords researchers the ability to more quickly, efficiently, and accurately find, access, and understand their own or others’ research data
*also our way to get science to listen to us! (sorry Scott)
Data Management PlanA data management plan (DMP) is a document that describes how you will collect, organise, manage, store, secure, backup, preserve, and share your data.
Federal Agencies & DMPs
DMPs Usual Requirements❏ Format of Data❏ Research methodology❏ Roles & Responsibilities for Data❏ Metadata Standards❏ Storage & Back Up Procedure❏ Long-Term Archiving & Preservation Plan❏ Access Policy❏ Security Measures
❏ Data❏ Humans
Existing Tools
Research Lifecycle
Data management is done at all stages of the research lifecycle.
*each step in the process has its own best practices & standards
Research Lifecycle: CreateCreating Data
what format will the data be in?
where will we store this data?
how will it be backed up?
how are we going to share this data?
how will we collect this data?
how will we describe this data?
Research Lifecycle: ProcessProcessing Data
how will we check, validate, or clean the data?
a. how will we describe that process?
b. how will we describe the data?
will we store the processed data? where? how?
Research Lifecycle: AnalyzeAnalyzing Data
how will we interpret data?
what research outputs will be produced?
a. what format will they be in?
b. how can we make it preservation-ready?
where will we store this data?
how will we ready this data for publication?
Existing Resources
Metadata
Open File Formats
Darwin Core
Library of Congress File Standards Guide
ABCD
Data Documentation Initiative
File Directories
& Org
Fixity Checks & SecurityBack Ups
Version Control
Research Lifecycle: PreservePreserving Data
what is the best archival format for our type of data?
what is the best type of archival storage for our data?
What needs to be preserved alongside our data to make it useful to others?
a. metadata and documentation
Research Lifecycle: AccessAccess to Data
Interoperable formatting means many people can use our data
distribute data
share data
promote data
Existing Resources
Research Lifecycle: ReUseRe-using Data
5 years down the road? 10?
a. use open, new archival formats
b. refresh it our storage media
c. scrutinize our findings & integrate data into new projects
teach next generation using our datasets
What’s more...
Short-Term Solution
Long-Term Solution
Upcoming Tools & Strategies● ReproZip is a general tool for Linux distributions that simplifies the process
of creating reproducible experiments from command-line executions.
○ automatically creates a package that contains all the dependencies (e.g., libraries, input data) that are required to run the input command.
● Hydra in a Box is a turnkey feature-rich, robust, flexible digital repository that is easy to install, configure, and maintain
○ Joint venture between DPLA, Stanford University, and DuraSpace for $2M from IMLS
○ 30 month period to completion