data management all a scientist never wanted to know but will not be able to avoid

26
Data management all a scientist never wanted to know but will not be able to avoid

Upload: elijah-mckinney

Post on 14-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Data management

all a scientist never wanted to know but will not be able to avoid

Why data management?

Selfish reasons Work more efficiently Avoid data corruption and loss

Altruistic reasons Facilitates data exchange Avoid data loss

Altruistic = selfish in long run Treat others like you want to be treated

Why conserve data?

Moral obligation Price of data collection Uniqueness of observations

You can’t measure a 2003 temperature in 2009

Allow peer review and audit of results Cfr molecular genetics – requirement to deposit

sequences in international databases (Genbank)

Tools of the trade

Principles, attitude more important than hardware in principle, dissociated from computer use

cfr gigantic card indices of some libraries in practice, involves the use of computerised

databasesoften RDBMSNot always! (Genbank, World Ocean Database)

E2EDM

Data management starts from day α Data management plan should be part of any

‘project’ description

Data management ends on day as last activity of project submitting final data set to ‘deep archive’

‘End-to-End Data Management’

Data?

Results of measurements or observations Monitoring vs scientific ‘Operational’ vs delayed-mode Supporting data (eg ‘underway data’ collected

automatically by research vessels) Not necessarily numerical (eg species

identifications) Measurement scales: nominal, rank, interval, ratio Representation: string, boolean, integer, real

Information?

Widely different meanings (supporting data) Interpreted data Metadata: data about data Data about the science rather than about the

scientific subject Eg bibliographies, directories

Different aspects

Documentation and inventories Recording and logging procedures Quality control Exchange, redistribution Back up Archive

Documentation

Creating information about the dataset: metadata what, where, objectives, limitations… make available as widely as possible

avoid duplicationattract partners (scientific!)

Store metadata together with data

Documentation

Different types of metadata Discovery Documentation Technical

Serve different purposes, often different systems

Ideally ‘harvested’ from data

Inventorising

Metadata database Discovery type information

Document not only what has been measured, but also planned campaigns Make inventory searchable Facilitate exchange of data and information Avoid duplication

Existing systems

Global Change Master Directory (GCMD) Gcmd.nasa.gov

IODE Marine Environmental Directory of Information (MEDI)

Recording

Often in systems other than final data management system Paper forms

Reminder of what information should be recorded Spreadsheet

Makes quality control possible during first steps

Needs system to control data flow

Quality control

Automated Range check (impossible values) Statistical (improbable values)

Danger of excluding unexpected phenomena (eg hole in ozone layer, El Nino)

Expert ‘manually’, anything that requires knowledge of the

subject area Often involves creating graphs

Flag, don’t delete

Backing up

Needs rigorous procedures Keeping separate copy of working data sets

Disaster recoveryNeeds copy to be kept in separate location

Wrong manipulation On larger systems: on specialised hardware

(tape drives…), necessitated by large volume But the principle is more important!!

Exchanging

Communicating data to others To systems – distributed data systems To people

Requires data exchange protocols Agree on the formats for exchange

Requires data exchange policy Agree on what can be done with data by

‘recipient’

Archiving

Important to ensure long-term integrity of the data On time scales that are typically much longer than a

project… Often will involve specialised organisations

Data repositories – data centres Needs careful thinking about storage medium

Magnetic media are not ideal, certainly not in tropical countries

Documentation, viewer software

Role of data centres

Data management tasks Inventorising and documenting Archiving

Specific tasks Redistribution Integration

Support

Redistribution

Preferably on line Fast and efficient No marginal costs

Inventory Metadatabase as a tool

Data rescue Recovering data that are in danger of being lost

Respecting rights of data providers Data policy Proper use statement

Integration

Over different disciplines CTD cast/Niskin bottles

Over different institutions Implies ‘trust’ Needs formal arrangements

Data policy

Creates possibility of extra quality control Checks on consistency

New technologies

Technological developments make new types of applications possible Internet, bandwidth Standard protocols

DiGIR, XML Distributed databases

Data centres are forced to rethink their role No longer passive archive, but active service centre

Data policy

Formal agreement between partners exchanging data

Describes rights and duties of data provider and data user

Considerations Data are public property Rights of data collector

Prisoner’s dilemma

Thought experiment in ESS research Fate of prisoners depends on their behaviour:

If they collaborate, they have a reasonable chance of escaping

A traitor is released, his companion stays

Evolutionary stable strategy: works against collaboration

Data sharers’ dilemma

A’s data B’s data C’s data

A and B play it fair C cheats

The cheat wins, since (s)he has access to his own, *and* to his naïve colleagues’ data

Data policy

Breaking the prisoner’s dilemma Rewards for data providers?

Co-authorship Dataset citation