managing complex datasets and accompanying...

61
VO Sandpit, November 2009 Managing complex datasets and accompanying information for reuse and repurpose Bryan N. Lawrence, Sarah Callaghan, Sam Pepler STFC Centre for Environmental Data Archival [email protected]

Upload: duongxuyen

Post on 13-Mar-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Managing complex datasets and accompanying information for reuse

and repurpose

Bryan N. Lawrence, Sarah Callaghan, Sam PeplerSTFC Centre for Environmental Data Archival

[email protected]

Page 2: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Outline

Context: The Data Deluge Communications and Evidence

Real Experience Who are we and where am we from? What is metadata and where does it come from?

A couple of simple examples.

One complex example: CMIP5

Exploiting information about data Data Citation: Why it's crucial, and what we are doing.

… and subsiding into a the sunset with a short summary!

Page 3: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

The Data Deluge

“the amount of data generated worldwide...is growing by 58% per year; in 2010 the world generated 1250 billion gigabytes of data”

Decisions,DecisionsDecisions

(need)Information& better yet

PRIORplanning

(Exploded view from 2007 IDC study – but note colours swapped)

Page 4: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

SI Prefixes

SI prefix Name Power of 10 or 2 Status

k kilo thousand 10 3 210 Count on fingers

M mega million 10 6 220 Trivial

G giga billion 10 9 230 Small

T tera trillion 10 12 240 Real

P peta quadrillion 10 15 250 Challenging

E exa quintillion 10 18 260 Aspirational

Z zetta sextillion 10 21 270 Wacko

Y yotta septillion 10 24 280 Science fiction

Stuart Feldman, Google

Page 5: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Humans and the Data DelugeA person working full time for a year has about 1500 hours to do something. (In the UK 220 working days a year is about standard. Let's remove about 20 days for courses, staff meetings etc ... so that leaves about 200 days or, for a working day of 7.5 hours, a working year of about 1500 hours.)

• What does a 50 TB dataset mean? – Take a set of climate predictions.– A single lat/lon map might be of order 50 Kb … so we have of the order of 10 billion maps. – Looking at each map for 10s, one individual could quality control those maps in approximately

two thousand years of work! – Bring on crowd sourcing … but there’s only so many people in the world!

• If it takes 2 minutes to find something, and have a quick look at it and extract a (e.g.) parameter name,

• you can process 45,000 items a year

• but no human could do that full time (repetitive boredom)!

• Maybe 30K in two years?

Your examples, will differ, but your conclusions are unlikely to:We can’t manage big data relying on humans! We need automation!

Page 6: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

ChemSpider

Data LandscapeP

etas

cale

Laptop ScaleSlide concept: Carole Goble via Liz Lyons

Big DataBig ScienceEver larger!

Personal science:Ever more complex, and

ever more of it!

CMIP5Archive

Earth Observation(Climate Data)

Typical DepartmentalCollectionOne instrument archive

Fortran,IDL,Matlab,Python& (yuck)Excel,Access

(More) Homogenous

Heterogeneous?

Page 7: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Preservation and Curation

One could argue that the writers of these documents did a brilliant job of preserving the bits-and-bytes of their time …thousands of years of preservation must be worth something … but

Phaistos Disk, 1700BC

they’ve both been translated many times, it’s a shame the meanings are different …

=> Data Preservation is not enough, we need “Active Curation” to preserve Information

Page 8: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

(More) Homogenous

Heterogeneous

Dange

r Zon

e:

Off m

ost r

adar

But w

orth

it! Give up!(Prohibitively expensive to

be anything other than deceptive!)

The reality of curation (as opposed to preservation)

(Solution: move data into managed repositories sharing commonsome common syntax and semantics, i.e. move it left!)

ExpensiveBut

generallyalreadybeing

managed!

Standards Controlled

Vocabularies

risks

Libraries work well because we have common concepts(books, chapters, pages, words, interpreted using language)

Large data repositories share common concepts for syntax and semantics,(and are actively managed for real user communities – like museums)

Page 9: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

(Scientific) Communication through the ages

Science, as a process, requires the exchange of information and ideas.

We can make this exchange face-to-face (conferences, meetings, seminars) or through another medium (text, video, images), or both.

No matter what method we use, we wind up telling each other stories about what we’ve discovered.

Technology has given us new tools, but it’s also provided new challenges

http://www.intoon.com/#68559

Page 10: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Journals – a 17th century technology

The first scientific journal, Journal des sçavans (later renamed Journal des savants), was first published on Monday, 5 January 1665.

It also carried a proportion of material that would not now be considered scientific, such as obituaries of famous men, church history, and legal reports.

It still exists, but is more of a literary journal

The first edition of the Philosophical Transactions of the Royal Society of London was on 6 March 1665. That still exists, and continues to publish scientific information to this day.

Page 11: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Journals work, but...

... They’re not enough now to communicate everything we need to know about a scientific event

- whether that’s an observation, simulation, development of a theory, or any combination of these.

Data always has been the foundation of scientific progress – without it, we can’t test any of our assertions.

Previously data was hard to capture, but could be (relatively) easily published in image or table format – papers “included” data!

But now...Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665

Page 12: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

And those data, appear, if they're lucky, as one figure!

Just exactly how “evidential” is the academic record now?

Can her work be repeated/refuted without access to her data?(and even with the data, will they alone be enough?

What metadata might one need?)

The role of data

Page 13: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Repeat – Verify, Refute“cite”Trust

ReuseExtrapolateAggregate

Extend“build on the shoulders of giants”

PROGRESS!

Page 14: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Who are we…? What do we know about data?

We’re (several) of the NERC data centres!

Page 15: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Some BADC numbers for context

Dataset: A collection of files sharing some administrative and/or project heritage.

BADC has approximately 150 real datasets (and thousands of virtual datasets).

BADC has approx 200 million files containing thousands of measured or simulated parameters.

BADC tries to deploy information systems that describe those data, parameters, projects and files, along with services that allow one to manipulate them …

Calendar year 2010: 2800 active users (of 12000 registered), downloaded 64 TB data in 16 million files from 165 datasets.

Less than half of the BADC data consumers are “atmospheric science” users!

Page 16: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

What is this metadata malarkey?

Metadata:

(1) a noun carrying not much information

(do you know what I mean when I say metadata?)

or

(2) the heart and soul of data reuse?

BOTH!

Page 17: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Metadata for Discovery, Documentation, Definition

Lawrence et al 2009, doi:10.1098/rsta.2008.0237

Page 18: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Who makes metadata? Part 1 of 4

• Metadata should be added in layers, with original metadata created with measurements, augmented by notes, and extra material (provenance) as it migrates within the scientific community.

– Big role for electronic notebooks and machine generated information!

• The tools for creating and maintaining that metadata could and (probably) should vary with migration …

• … before it finally reaches a data scientist for ingestion … but data scientists have a role in helping shape the information (requirements and syntax) at every step of the way.

Individuals

research groups

collaborators

data centres

data users/citers

ProvenanceDocumentation

akametadata

Documentation and

Institutional Wisdom

“Lab Books” “Notes”

“File_Metadata”

Page 19: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Who makes metadata? Part 2 of 4

Very Important(Newish)Career

(A specific data scientist role)

Define StructureCreate A metadata

Page 20: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Who makes metadata? Part 3 of 4

Create B, O and D metadata!

Acquire E metadata

Connect data and metadata to SERVICES

Page 21: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Who makes metadata?: Part 4 of 4

The USER does!Citation and Annotation Crucial measures of quality

– cf Amazon and Trip Advisor

Quality is An essential and distinguishing attribute A degree or grade of excellence or worth The degree to which a man-made object or

system is free from bugs and flaws … A perceptual, conditional and somewhat

subjective attribute and may be understood differently by different people.

In relation to relevancy, this term refers to … how important it is as viewed by the content owner or search application.

Is the suitability of procedures, processes and systems in relation to the strategic objectives.

Philosophically:Distinction between primary and secondary qualities (John Locke):

– Primary qualities are intrinsic to an object—a thing or a person—whereas

– Secondary qualities are dependent on the interpretation and context.

From a metadata perspective, annotation of quality just as important as provider attribution.

“C” metadata crucial to reuse!

Page 22: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Putting it all togetherThe things we worry about!

The difference between preservation and curation? All information packages need to evolve as the producer and consumer communities evolve!

Page 23: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Example One

Page 24: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Page 25: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

B

Page 26: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

A

Page 27: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Example Two: The beauty of really old incidental metadata!

(Not ours)

Page 28: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Page 29: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Old Weather/New Results

(Modern max & min over HMS Dorothea 1818Brohan et al 2010)

e.g: voyage of HMS Dorothea, 1818(and others)

Comparing climate of past with current climate range!

Page 30: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Citizen Science: Old Weather

http://www.oldweather.org

Page 31: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Example 3

Image: from J. Lafeuille, 2006Global Earth System Modelling & CMIP5

Page 32: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Global community activity under the auspices of the World Meteorological Organisation (WMO) via the World Climate Research Programme (WCRP)

Aim: to address outstanding scientific questions that arose as part of the IPCC AR4 process,

improve understanding of climate, and

to provide estimates of future climate change that will be useful to those considering its possible consequences.

Method: standard set of model simulations in order to:

evaluate how realistic the models are in simulating the recent past,

provide projections of future climate change on two time scales, near term (out to about 2035) and long term (out to 2100 and beyond), and

understand some of the factors responsible for differences in model projections, including quantifying some key feedbacks such as those involving clouds and the carbon cycle

CMIP5: Fifth Coupled Model Intercomparison Project

Page 33: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Simulations:~90,000 years~60 experiments within CMIP5~20 modelling centres (from around the world) using~several model configurations each~2 million output “atomic” datasets ~10's of petabytes of output~2 petabytes of CMIP5 requested output~1 petabyte of CMIP5 “replicated” output

Which will be replicated at a number of sites (including ours), arriving now!

Of the replicants:~ 220 TB decadal~ 540 TB long term~ 220 TB atmos-only

~80 TB of 3hourly data~215 TB of ocean 3d monthly data!~250 TB for the cloud feedbacks!~10 TB of land-biochemistry (from the long term experiments alone).

CMIP5 numbers!

Page 34: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Volume or Metadata?

So we could wax lyrical about how hard it is to handle the data volume!

– Yes, we need to purchase a lot of disk and tapes!

– Yes, we need it to be fast disk.

– No, we can't just wander down to PC-World and buy armfuls of disks.

– Yes, we need to worry about networks, and

– Yes, that all takes time and money,

BUT

The hard part is the information handling! The major cost for supporting CMIP5 is in the information systems

– describing the models, – the data, how to find them, and how to use them,

and eventually, – what data one used, and what results were obtained!

Page 35: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

IPCC:

FAR:1990SAR:1995TAR:2001AR4:2007AR5:2013

Page 36: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

State of the Art: Model Comparison

Guilyardi E. (2006): El Niño- mean state - seasonal cycle interactions in a multi-model ensemble. Clim. Dyn., 26:329-348, DOI: 10.1007/s00382-005-0084-6

1: Tabulate some interesting property (and author grafts hard to get the information)

Page 37: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

State of the Art: Model Comparison

Kharin et al, Journal of Climate 2007 doi: 10.1175/JCLI4066.1Dai, A.,J. Climate 2006 doi: 10.1175/JCLI3884.1

2: Provide some (slightly) organised citation material (and author and readers graft hard to get the information)

Page 38: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

State of the art: Model Comparison

3: Calculate and tabulate some interesting properties and bury in a table or figure

Guilyardi E. (2006): El Niño- mean state - seasonal cycle interactions in a multi-model ensemble. Clim. Dyn., 26:329-348, DOI: 10.1007/s00382-005-0084-6

Not an easy or time efficient way of doing things!

Page 39: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 200939

Handling the CMIP5 data

Earth System Grid (ESG)

US DoE funded project to develop software and support CMIP5Consists of

– distributed data node software (to publish data)

– Tools

– gateway software (to provide catalog and services)

Metafor

– Information model to describe models and simulations, and

– Tools to manipulate it

Major “technical challenges”

Earth System Grid FEDERATION (ESGF)

Global initiative to deploy the ESG (and other) software to support:

– timely access to the data– minimum international

movement of the data

– long term access to significant versions of the CMIP5 data.

Major “social challenge” as well as “technical challenge”

Page 40: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 200940

CMIP5: Handling the metadata

Three streams of provenance metadata: A) “archive” metadata B) “browse” metadata C) “character” metadata

A: Archive Metadata: three levels of information from the file system:I. CF compliance in the NetCDF files

II. “Extra” CMIP5 required attributes including a unique identifier within each file.

III. Use of the Directory Reference Syntax (DRS) to help maintain version information.Compliance enforced by ESG publisher.

B: Browse Metadata, added independently of the archive• Exploiting Metafor controlled vocabularies in a “Common Information Model” (CIM) via a customised “CMIP5 questionnaire”.

Compliance enforced by CMIP5 quality control systems, leading toC: Character Metadata• Data assessment

Page 41: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

41VO Sandpit, November 2009

Bringing it all together for CMIP5

The players:

A) Data Publication systemon data nodes, strips metadata from files,loads into a catalogue.Viewable from Gateways!

B) Model descriptions, andC) quality descriptions Entered via combination of script driven upload and questionnaire entry. Viewable from Gatewaysand third party Portals!

Page 42: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Aerosol, Atmosphere,Atmospheric Chemistry,Land Ice, Land Surface,

Ocean Biogeochem,Ocean, Sea Ice(Yes we know we shouldn't have this sort of detail

in the UML, and it wont be … shortly)A look under the hoodof a piece of the CIM

Page 43: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Scientific Properties:Controlled Vocabulariesdeveloped with expert community using mindmaps and some “rules” to aid machine post processing …

(everyone can use mindmaps: no a priori semantic technology knowledge required.)

Page 44: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

A piece of the mindmap XML …

Page 45: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

… vocabulary driven content in web based “human entry tool”

(A great advertisement for Python and Django)

Page 46: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

46VO Sandpit, November 2009

The Earth System Grid Federation

Data Nodes, providing data services and publishing toData Gateways linked in a Global Federation

Three nodes (and gateways)committed to persistingthe data!

(Slide from D. Williams et al, in preparation)

Page 47: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

4747VO Sandpit, November 2009

NCARPCMDI

Earth System Grid Gateways

Page 48: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

4848VO Sandpit, November 2009

ESGF Gateways

Page 49: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Take home points from the CIMP5 example

Big data is a global activity, requiring global solutions:- global technology with global agreements!- it is our experience that the information requirements of multi-

disciplinary science cannot be satisfied “in-house”!

Big data requires big information systems- storage is the least of our worries (which doesn't mean it's insigificant)

There WILL be complex lines of provenance in all interesting systems- information aggregation matters,- without “standards” (for information, and for interfaces),

we have little chance of progress

None of this comes cheap, and despite our best efforts, initial budgets never really accurately quantify the resources required!

Page 50: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Segue from CMIP5 to Data Publication

From CMIP5 to data publication.

CMIP5 has three levels of “internal” quality control:

– 1: Syntactic; does the data and metadata conform to controlled vocabularies and syntactic requirements?

– 2: Semantic: are the data and metadata sensible and consistent with expectation (includes human “spot checks” of diagnostic plots and questionnaire output).

– 3: Support Citation: match the data and model metadata, confirm with the “authors”, further human examination of the metadata, leading to

• Publication by the World Data Centre for Climate who will assign and maintain a Digital Object Identifier.

Page 51: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Publishing data – why do it?

• Scientific journal publication mainly focuses on the analysis, interpretation and conclusions drawn from a given dataset.

• Examining the raw data that forms the dataset is more difficult, as datasets are usually stored in digital media, in a variety of (proprietary or non-standard) formats.

• Peer-review is generally only applied to the methodology and final conclusions of a piece of work, and not the underlying data itself. But if the conclusions are to stand, the data and accomanying metadata must be of good quality.

• A process of data publication, involving peer-review of datasets is (in some cases) and will be (in others) of benefit to many sectors of the academic community – including data producers!

Page 52: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Publishing: Some terminology (1)

cite verb ( GIVE EXAMPLE ) /sa t/ [T]ɪ• formal - to mention something as proof for a theory or as a reason why something has happened• formal - to speak or write words taken from a particular writer or written work

(Cambridge Advanced Learner's Dictionary - http://dictionary.cambridge.org/dictionary/british/cite_1 )

publish verb / p b.l / [T]ˈ ʌ ɪʃto make information available to people, especially in a book, magazine or

newspaper, or to produce and sell a book, magazine or newspaper

(Cambridge Advanced Learner's Dictionary - http://dictionary.cambridge.org/dictionary/british/publish )

Page 53: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

0.Serving of data sets

1.Data set Citation

2.Publication of data sets

This is what data centres do as our day job – take in data supplied by scientists and make it available to other interested parties. “On the web” is not the same as “Published”

This is what we’re aiming for now –formulate and formalise a way of citing data sets. Will provide benefits to our users – and a carrot to get them to provide data to us!

This involves the peer-review of data sets, and gives “stamp of approval” associated with traditional journal publications. Can’t be done without effective linking/citing of the data sets.

Publishing: Some terminology (2)

Being able to cite a dataset is a good thing in its own right!

Page 54: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

To publish or to Publish

We draw a clear distinction between

publishing = making available for consumption (e.g. on the web), and

Publishing = publishing after some formal process which adds value for the consumer:

– e.g. PloS ONE type review, or– EGU journal type public review, or– More traditional peer review.

AND– provides commitment to persistence

Page 55: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

How we’re going to support citation and persistence

We decided to use digital object identifiers (DOIs) because:

• They are actionable, interoperable, persistent links for (digital) objects

• Scientists are already used to citing papers using DOIs

• Pangaea assign DOIs, and ESSD use DOIs to link to the datasets they publish

• The British Library gave us an allocation of 500 DOIs to assign to datasets as we saw fit.

Page 56: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

What sort of data can we cite?

Dataset has to be:• Stable (i.e. not going to be modified)• Complete (i.e. not going to be updated)• Permanent – by assigning a DOI we’re committing to make the dataset available

for posterity• Good quality – by assigning a DOI we’re giving it our data centre stamp of

approval, saying that it’s complete and all the metadata is available

When a dataset is cited that means:• There will be bitwise fixity• With no additions or deletions of files• No changes to the directory structure in the dataset

“bundle”

A DOI should point to a html representation of some record which describes a data object.

Upgrades to versions of data formats will result in new editions of datasets.

Page 57: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Landing page(s)

We don’t want to have the DOI resolve right at the archive level where you can get the database or files in the dataset because:

• this dumps the user with the only information being a list of filenames or table records

• If we change the archive structure, that requires re-mapping all the DOIs

So, we need a landing page.

Users are used to this, as that’s what on-line journals do

For example – clicking 10.1049/iet-map:20060126 will bring you to a html page with a link to a pdf of the referenced paper.

Page 58: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Human-readable citation string

We worked on this in the JISC and NERC-funded CLADDIER project (http://claddier.badc.ac.uk/trac) and will follow those rules

For example:Science and Technology Facilities Council (STFC), Chilbolton Facility for Atmospheric and Radio Research, [Wrench, C.L.]. Chilbolton Facility for Atmospheric and Radio Research (CFARR) data, [Internet]. British Atmospheric Data Centre, 2003-,Date of citation. Available from http://badc.nerc.ac.uk/data/chilbolton/.

For the GBS datasets:Science and Technology Facilities Council (STFC), Chilbolton Facility for Atmospheric and Radio Research, [Callaghan, S. A., J. Waight, C. J. Walden, J. Agnew and S. Ventouras]. GBS 20.7GHz slant path radio propagation measurements, Chilbolton site, [Internet]. British Atmospheric Data Centre, 2003-2005, doi:10.12345/1234567890

Page 59: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Using the DataCite API to mint DOIs

And this is the point where we are at the moment.

Next steps:• Talk to our developers at the BADC about

how to use the API (and writing a script to assign DOIs)

• Improve metadata and general user-friendliness of chosen landing pages.

• Confirm that all the dataset to be cited is complete and ready to be frozen!

• Decide what our bit of the DOI string should be and finalize human-readable citation.

• Think about how we integrate DOIs into our metadata management procedures

• Prepare announcement to our users about our new ability to mint DOIs

Page 60: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Data publication summary• We have the ability now (thanks to the British

Library) to mint our own DOIs• We can therefore cite our datasets, giving

academic credit to those scientists who get cited – making them more likely to give us good quality data to archive.

• Publication – and peer-review – is the next step

• We need to work with recognized academic journals to do this

• Data journals already exist:• Earth System Science Data (http://earth-

system-science-data.net/)• Geochemistry, Geophysics, Geosystems (G3

http://www.agu.org/journals/gc/ ) http://www.keepcalm-o-matic.co.uk/default.aspx#createposter

Page 61: Managing complex datasets and accompanying …home.badc.rl.ac.uk/lawrence/static/2011/03/29/data2knowledge.pdfManaging complex datasets and accompanying information for ... Sarah Callaghan,

VO Sandpit, November 2009

Summary and maybe conclusions? Data is important, and becoming more so for

a far wider range of the population Conclusions and knowledge are only as

good as the data they’re based on

– Data is only as good as it's metadata!

– There are lots of important types of metadata!

Science is supposed to be reproducible and verifiable

It’s up to us as scientists to care for the data we’ve got and ensure that the story of what we did to the data is transparent

– So we can use the data again

– And so people will trust our results