data management plans: a good idea, but not sufficient
DESCRIPTION
Data Management Plans: A good idea, but not sufficient. Outline. Why are Data Management Plans good but insufficient? From Data to Process Management Plans How to capture process & context? Summary. Sustainable (e-)Science. Data is key enabler in science - PowerPoint PPT PresentationTRANSCRIPT
Data Management Plans:A good idea, but not sufficient
Andreas Rauber
Department of Software Technology and Interactive Systems
Vienna University of Technology&
Secure Business [email protected]
http://www.ifs.tuwien.ac.at/~andi
Outline
Why are Data Management Plans good but insufficient?
From Data to Process Management Plans
How to capture process & context?
Summary
Sustainable (e-)Science
Data is key enabler in science
- Basis for evaluation and verification
- Basis for re-use
- Basis for meta-studies
Safeguarding investment made in data
Need to preserve and curate the data
Preservation: keeping useable over time fighting mostly technical & semantic obsolescence
How to avoid data being lost after projects end?
Sustainable (e-)Science
Data Management Plans as integral part of research proposals
Need recognized by researchers, funding bodies,…
Focus on- Data- Descriptions- Declarations of activities to ensure long-term availability of data
Data Management Plans are good, but not sufficient!
https://data.uni-bielefeld.de/de/data-management-plan
https://dmp.cdlib.org/
https://dmponline.dcc.ac.uk/
Data Management Plans
Short, free-form text, requiring human interpretation Declarations of intent Not enforceable, hardly verifiable (Burden remains with researchers / institutions,
who need to become data management experts) Focuses solely on data, ignoring the process:
pre-processing, processing, analysis Limits
- availability of data & results
- verification of results,
- re-use and re-purposing http://deepblue.lib.umich.edu/bitstream/handle/2027.42/86586/CoE_DMP_template_v1.pdf?sequence=1
http://rci.ucsd.edu/_files/DMP%20Example%20Cosman.pdf
From Data to Processes
Excursion: Scientific Processes
From Data to Processes
Rhythm Pattern Feature Set- extracts numeric descriptors from audio- basically 2 Fourier Transforms- some psycho-acoustic modelling- some filters (gaussian, gradient) to make features more robust
Used for- music genre classification- clustering of music by similarity- retrieval
Implemented first in Matlab, then in Java- both publicly available on website- same same but different...
From Data to Processes
Excursion: scientific processes
set1_freq440Hz_Am12.0Hz
set1_freq440Hz_Am05.5Hz
set1_freq440Hz_Am11.0Hz
Java Matlab
From Data to Processes
Excursion: Scientific Processes
Bug? Psychoacoustic transformation tables? Forgetting a transformation? Diferent implementation of filters? Limited accuracy of calculation? Difference in FFT implementation? ...?
From Data to Processes
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234
From Data to Processes
To sum up:
Data
- is the fuel for scientific processes
- is the result of scientific processes
Curation of data thus needs to consider these processes
Data Management Plans
- are data centric
- put too little focus on the processes associated with data
- are written by humans for humans
Outline
Why are Data Management Plans insufficient?
From Data to Process Management Plans
How to capture process & context?
Summary
Process Management Plans
Process Management Plans (PMPs)
Go beyond data to cover research process:
- ideas, steps, tools, documentation, results, …
- data is only one (important) element, commonly actually a result of a research (pre-)process
Ensure re-executability, re-usability
Must be machine-actionable & verifiable
Basis for preservation and re-use of research
Similar to “research objects”, “executable papers”, …
Process Management Plans
Need to establish
Models for representing such process management plans (PMPs)
Must be machine-readable and machine-actionable
Identify “minimum set” of information
Devise means to automate (most of) the activity in creating and maintaining those PMPs
Establish them to replace (enhance / subsume / …) Data Management Plans
Process Management Plans
Structure of PMPs (following concept of DMPs):
1.Overview and context
2.Description of processes and their implementation Process description | Process implementation | Data used and
produced by process
3.Preservation1. Preservation history | Long term storage and funding
4.Sharing and reuse Sharing | Reuse | Verification | Legal aspects
§Monitoring and external dependencies§Adherence and Review
Outline
Why are Data Management Plans insufficient?
From Data to Process Management Plans
How to capture process & context?
Summary
Process Capture
Need to establish what forms part of a process:- analyzing process documentation- establishing context of process, relationships between elements- monitoring of process activities
Capture and describe this in a context model
Architectural Concepts
Based on Enterprise Architecture Framework(Zachmann), taxonomies (e.g. PREMIS), …
DIO: Domain-Independent Ontology DSO: Domain-Specific Ontologies
(legal, sensor, multimedia codecs, …)
19
DIO (ArchiMate) DSO-1DIO-DSO1
Transformation Map
DIO-DSO2Transformation Map DSO-2
Process Capture
Input: music (e.g. MP3 format) Input: training data, i.e. music with genre labels Output: classification of music, e.g. into genres Intermediate steps
extract numeric description (features) from music combine features with ground truth into specific file format, …
Example: Music Classification Process
Process Capture
Taverna
…………….
Process Capture
Software setup can be automatically detected in OS with software packages (e.g. Linux);
allows detection of licenses, dependencies
Process Capture
Process Capture
24
Example:
Music Classification Workflow
Process Re-deployment
Preservation and Re-deployment
„Encapsulate“ as complex „research objects“ (RO)
Re-Deployment beyond original environment Format migration of elements of ROs
Cross-compilation of code
Emulation-as-a-Service, virtual machines, …
Process Re-deployment
Verification, Validation & Data
Verify correctness of re-execution validation and verification framework
process instance data
points of capture
Metrics
Data and data citation Identifying subsets of data in large and dynamic databases
Timestamping and versioning of data
Assigning PID (DOI, …) to time-stamped query
Data
Table A
Table B
Query
Query Store
Subsets
PID Provider
PID Store
Sustainable (e-)Science
How to get there?
Research infrastructure support
- Versioning systems
- Logging (“virtual lab-book”)
- Virtual machines / pre-configured virtual labs for research
- Data citation support for large, dynamic databases
R&D in process preservation, re-deployment & verification
- Evolving research environments, code migration, …
- Verification of process re-execution
- Financial impact, business models
Summary
Need to move beyond concept of data
Need to move beyond the focus on description
Process Management Plans (PMPs) extending DMPs
Process capture, preservation & verification
Capture “all” elements of a research process
Machine-readable and -actionable
Data and process re-use as basis for data driven science
Thank you!
http://www.ifs.tuwien.ac.at/imp
Data
Table A
Table B
Query
Query Store
Subsets
PID Provider
PID Store
DIO (ArchiMate) DSO-1DIO-DSO1
Transformation Map
DIO-DSO2Transformation Map DSO-2