data management plans: a good idea, but not sufficient

Post on 13-Jan-2016

22 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Data Management Plans: A good idea, but not sufficient. Outline. Why are Data Management Plans good but insufficient? From Data to Process Management Plans How to capture process & context? Summary. Sustainable (e-)Science. Data is key enabler in science - PowerPoint PPT Presentation

TRANSCRIPT

Data Management Plans:A good idea, but not sufficient

Andreas Rauber

Department of Software Technology and Interactive Systems

Vienna University of Technology&

Secure Business Austriarauber@ifs.tuwien.ac.at

http://www.ifs.tuwien.ac.at/~andi

Outline

Why are Data Management Plans good but insufficient?

From Data to Process Management Plans

How to capture process & context?

Summary

Sustainable (e-)Science

Data is key enabler in science

- Basis for evaluation and verification

- Basis for re-use

- Basis for meta-studies

Safeguarding investment made in data

Need to preserve and curate the data

Preservation: keeping useable over time fighting mostly technical & semantic obsolescence

How to avoid data being lost after projects end?

Sustainable (e-)Science

Data Management Plans as integral part of research proposals

Need recognized by researchers, funding bodies,…

Focus on- Data- Descriptions- Declarations of activities to ensure long-term availability of data

Data Management Plans are good, but not sufficient!

https://data.uni-bielefeld.de/de/data-management-plan

https://dmp.cdlib.org/

https://dmponline.dcc.ac.uk/

Data Management Plans

Short, free-form text, requiring human interpretation Declarations of intent Not enforceable, hardly verifiable (Burden remains with researchers / institutions,

who need to become data management experts) Focuses solely on data, ignoring the process:

pre-processing, processing, analysis Limits

- availability of data & results

- verification of results,

- re-use and re-purposing http://deepblue.lib.umich.edu/bitstream/handle/2027.42/86586/CoE_DMP_template_v1.pdf?sequence=1

http://rci.ucsd.edu/_files/DMP%20Example%20Cosman.pdf

From Data to Processes

Excursion: Scientific Processes

From Data to Processes

Rhythm Pattern Feature Set- extracts numeric descriptors from audio- basically 2 Fourier Transforms- some psycho-acoustic modelling- some filters (gaussian, gradient) to make features more robust

Used for- music genre classification- clustering of music by similarity- retrieval

Implemented first in Matlab, then in Java- both publicly available on website- same same but different...

From Data to Processes

Excursion: scientific processes

set1_freq440Hz_Am12.0Hz

set1_freq440Hz_Am05.5Hz

set1_freq440Hz_Am11.0Hz

Java Matlab

From Data to Processes

Excursion: Scientific Processes

Bug? Psychoacoustic transformation tables? Forgetting a transformation? Diferent implementation of filters? Limited accuracy of calculation? Difference in FFT implementation? ...?

From Data to Processes

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234

From Data to Processes

To sum up:

Data

- is the fuel for scientific processes

- is the result of scientific processes

Curation of data thus needs to consider these processes

Data Management Plans

- are data centric

- put too little focus on the processes associated with data

- are written by humans for humans

Outline

Why are Data Management Plans insufficient?

From Data to Process Management Plans

How to capture process & context?

Summary

Process Management Plans

Process Management Plans (PMPs)

Go beyond data to cover research process:

- ideas, steps, tools, documentation, results, …

- data is only one (important) element, commonly actually a result of a research (pre-)process

Ensure re-executability, re-usability

Must be machine-actionable & verifiable

Basis for preservation and re-use of research

Similar to “research objects”, “executable papers”, …

Process Management Plans

Need to establish

Models for representing such process management plans (PMPs)

Must be machine-readable and machine-actionable

Identify “minimum set” of information

Devise means to automate (most of) the activity in creating and maintaining those PMPs

Establish them to replace (enhance / subsume / …) Data Management Plans

Process Management Plans

Structure of PMPs (following concept of DMPs):

1.Overview and context

2.Description of processes and their implementation Process description | Process implementation | Data used and

produced by process

3.Preservation1. Preservation history | Long term storage and funding

4.Sharing and reuse Sharing | Reuse | Verification | Legal aspects

§Monitoring and external dependencies§Adherence and Review

Outline

Why are Data Management Plans insufficient?

From Data to Process Management Plans

How to capture process & context?

Summary

Process Capture

Need to establish what forms part of a process:- analyzing process documentation- establishing context of process, relationships between elements- monitoring of process activities

Capture and describe this in a context model

Architectural Concepts

Based on Enterprise Architecture Framework(Zachmann), taxonomies (e.g. PREMIS), …

DIO: Domain-Independent Ontology DSO: Domain-Specific Ontologies

(legal, sensor, multimedia codecs, …)

19

DIO (ArchiMate) DSO-1DIO-DSO1

Transformation Map

DIO-DSO2Transformation Map DSO-2

Process Capture

Input: music (e.g. MP3 format) Input: training data, i.e. music with genre labels Output: classification of music, e.g. into genres Intermediate steps

extract numeric description (features) from music combine features with ground truth into specific file format, …

Example: Music Classification Process

Process Capture

Taverna

…………….

Process Capture

Software setup can be automatically detected in OS with software packages (e.g. Linux);

allows detection of licenses, dependencies

Process Capture

Process Capture

24

Example:

Music Classification Workflow

Process Re-deployment

Preservation and Re-deployment

„Encapsulate“ as complex „research objects“ (RO)

Re-Deployment beyond original environment Format migration of elements of ROs

Cross-compilation of code

Emulation-as-a-Service, virtual machines, …

Process Re-deployment

Verification, Validation & Data

Verify correctness of re-execution validation and verification framework

process instance data

points of capture

Metrics

Data and data citation Identifying subsets of data in large and dynamic databases

Timestamping and versioning of data

Assigning PID (DOI, …) to time-stamped query

Data

Table A

Table B

Query

Query Store

Subsets

PID Provider

PID Store

Sustainable (e-)Science

How to get there?

Research infrastructure support

- Versioning systems

- Logging (“virtual lab-book”)

- Virtual machines / pre-configured virtual labs for research

- Data citation support for large, dynamic databases

R&D in process preservation, re-deployment & verification

- Evolving research environments, code migration, …

- Verification of process re-execution

- Financial impact, business models

Summary

Need to move beyond concept of data

Need to move beyond the focus on description

Process Management Plans (PMPs) extending DMPs

Process capture, preservation & verification

Capture “all” elements of a research process

Machine-readable and -actionable

Data and process re-use as basis for data driven science

Thank you!

http://www.ifs.tuwien.ac.at/imp

Data

Table A

Table B

Query

Query Store

Subsets

PID Provider

PID Store

DIO (ArchiMate) DSO-1DIO-DSO1

Transformation Map

DIO-DSO2Transformation Map DSO-2

top related