managing complex datasets and accompanying...

VO Sandpit, November 2009

Managing complex datasets and accompanying information for reuse

and repurpose

Bryan N. Lawrence, Sarah Callaghan, Sam PeplerSTFC Centre for Environmental Data Archival

[email protected]


Outline

Context: The Data Deluge Communications and Evidence

Real Experience Who are we and where am we from? What is metadata and where does it come from?

A couple of simple examples.

One complex example: CMIP5

Exploiting information about data Data Citation: Why it's crucial, and what we are doing.

… and subsiding into a the sunset with a short summary!


The Data Deluge

“the amount of data generated worldwide...is growing by 58% per year; in 2010 the world generated 1250 billion gigabytes of data”

Decisions,DecisionsDecisions

(need)Information& better yet

PRIORplanning

(Exploded view from 2007 IDC study – but note colours swapped)


SI Prefixes

SI prefix Name Power of 10 or 2 Status

k kilo thousand 10 3 210 Count on fingers

M mega million 10 6 220 Trivial

G giga billion 10 9 230 Small

T tera trillion 10 12 240 Real

P peta quadrillion 10 15 250 Challenging

E exa quintillion 10 18 260 Aspirational

Z zetta sextillion 10 21 270 Wacko

Y yotta septillion 10 24 280 Science fiction

Stuart Feldman, Google


Humans and the Data DelugeA person working full time for a year has about 1500 hours to do something. (In the UK 220 working days a year is about standard. Let's remove about 20 days for courses, staff meetings etc ... so that leaves about 200 days or, for a working day of 7.5 hours, a working year of about 1500 hours.)

• What does a 50 TB dataset mean? – Take a set of climate predictions.– A single lat/lon map might be of order 50 Kb … so we have of the order of 10 billion maps. – Looking at each map for 10s, one individual could quality control those maps in approximately

two thousand years of work! – Bring on crowd sourcing … but there’s only so many people in the world!

• If it takes 2 minutes to find something, and have a quick look at it and extract a (e.g.) parameter name,

• you can process 45,000 items a year

• but no human could do that full time (repetitive boredom)!

• Maybe 30K in two years?

Your examples, will differ, but your conclusions are unlikely to:We can’t manage big data relying on humans! We need automation!


ChemSpider

Data LandscapeP

etas

cale

Laptop ScaleSlide concept: Carole Goble via Liz Lyons

Big DataBig ScienceEver larger!

Personal science:Ever more complex, and

ever more of it!

CMIP5Archive

Earth Observation(Climate Data)

Typical DepartmentalCollectionOne instrument archive

Fortran,IDL,Matlab,Python& (yuck)Excel,Access

(More) Homogenous

Heterogeneous?


Preservation and Curation

One could argue that the writers of these documents did a brilliant job of preserving the bits-and-bytes of their time …thousands of years of preservation must be worth something … but

Phaistos Disk, 1700BC

they’ve both been translated many times, it’s a shame the meanings are different …

=> Data Preservation is not enough, we need “Active Curation” to preserve Information


(More) Homogenous

Heterogeneous

Dange

r Zon

e:

Off m

ost r

adar

But w

orth

it! Give up!(Prohibitively expensive to

be anything other than deceptive!)

The reality of curation (as opposed to preservation)

(Solution: move data into managed repositories sharing commonsome common syntax and semantics, i.e. move it left!)

ExpensiveBut

generallyalreadybeing

managed!

Standards Controlled

Vocabularies

risks

Libraries work well because we have common concepts(books, chapters, pages, words, interpreted using language)

Large data repositories share common concepts for syntax and semantics,(and are actively managed for real user communities – like museums)


(Scientific) Communication through the ages

Science, as a process, requires the exchange of information and ideas.

We can make this exchange face-to-face (conferences, meetings, seminars) or through another medium (text, video, images), or both.

No matter what method we use, we wind up telling each other stories about what we’ve discovered.

Technology has given us new tools, but it’s also provided new challenges

http://www.intoon.com/#68559


Journals – a 17th century technology

The first scientific journal, Journal des sçavans (later renamed Journal des savants), was first published on Monday, 5 January 1665.

It also carried a proportion of material that would not now be considered scientific, such as obituaries of famous men, church history, and legal reports.

It still exists, but is more of a literary journal

The first edition of the Philosophical Transactions of the Royal Society of London was on 6 March 1665. That still exists, and continues to publish scientific information to this day.


Journals work, but...

... They’re not enough now to communicate everything we need to know about a scientific event

- whether that’s an observation, simulation, development of a theory, or any combination of these.

Data always has been the foundation of scientific progress – without it, we can’t test any of our assertions.

Previously data was hard to capture, but could be (relatively) easily published in image or table format – papers “included” data!

But now...Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665


And those data, appear, if they're lucky, as one figure!

Just exactly how “evidential” is the academic record now?

Can her work be repeated/refuted without access to her data?(and even with the data, will they alone be enough?

What metadata might one need?)

The role of data


Repeat – Verify, Refute“cite”Trust

ReuseExtrapolateAggregate

Extend“build on the shoulders of giants”

PROGRESS!


Who are we…? What do we know about data?

We’re (several) of the NERC data centres!


Some BADC numbers for context

Dataset: A collection of files sharing some administrative and/or project heritage.

BADC has approximately 150 real datasets (and thousands of virtual datasets).

BADC has approx 200 million files containing thousands of measured or simulated parameters.

BADC tries to deploy information systems that describe those data, parameters, projects and files, along with services that allow one to manipulate them …

Calendar year 2010: 2800 active users (of 12000 registered), downloaded 64 TB data in 16 million files from 165 datasets.

Less than half of the BADC data consumers are “atmospheric science” users!


What is this metadata malarkey?

Metadata:

(1) a noun carrying not much information

(do you know what I mean when I say metadata?)

or

(2) the heart and soul of data reuse?

BOTH!


Metadata for Discovery, Documentation, Definition

Lawrence et al 2009, doi:10.1098/rsta.2008.0237

http://dx.doi.org/10.1098/rsta.2008.0237


Who makes metadata? Part 1 of 4

• Metadata should be added in layers, with original metadata created with measurements, augmented by notes, and extra material (provenance) as it migrates within the scientific community.

– Big role for electronic notebooks and machine generated information!

• The tools for creating and maintaining that metadata could and (probably) should vary with migration …

• … before it finally reaches a data scientist for ingestion … but data scientists have a role in helping shape the information (requirements and syntax) at every step of the way.

Individuals

research groups

collaborators

data centres

data users/citers

ProvenanceDocumentation

akametadata

Documentation and

Institutional Wisdom

“Lab Books” “Notes”

“File_Metadata”



Very Important(Newish)Career

(A specific data scientist role)

Define StructureCreate A metadata



Create B, O and D metadata!

Acquire E metadata

Connect data and metadata to SERVICES


Who makes metadata?: Part 4 of 4

The USER does!Citation and Annotation Crucial measures of quality

– cf Amazon and Trip Advisor

Quality is An essential and distinguishing attribute A degree or grade of excellence or worth The degree to which a man-made object or

system is free from bugs and flaws … A perceptual, conditional and somewhat

subjective attribute and may be understood differently by different people.

In relation to relevancy, this term refers to … how important it is as viewed by the content owner or search application.

Is the suitability of procedures, processes and systems in relation to the strategic objectives.

Philosophically:Distinction between primary and secondary qualities (John Locke):

– Primary qualities are intrinsic to an object—a thing or a person—whereas

– Secondary qualities are dependent on the interpretation and context.

From a metadata perspective, annotation of quality just as important as provider attribution.

“C” metadata crucial to reuse!


Putting it all togetherThe things we worry about!

The difference between preservation and curation? All information packages need to evolve as the producer and consumer communities evolve!


Example One


B


A


Example Two: The beauty of really old incidental metadata!

(Not ours)


Old Weather/New Results

(Modern max & min over HMS Dorothea 1818Brohan et al 2010)

e.g: voyage of HMS Dorothea, 1818(and others)

Comparing climate of past with current climate range!

http://dx.doi.org//10.5194/cp-6-315-2010


Citizen Science: Old Weather

http://www.oldweather.org

http://www.oldweather.org/


Example 3

Image: from J. Lafeuille, 2006Global Earth System Modelling & CMIP5

http://qa4eo.org/docs/workshop_09/Lafeuille_29Sep09.pdf


Global community activity under the auspices of the World Meteorological Organisation (WMO) via the World Climate Research Programme (WCRP)

Aim: to address outstanding scientific questions that arose as part of the IPCC AR4 process,

improve understanding of climate, and

to provide estimates of future climate change that will be useful to those considering its possible consequences.

Method: standard set of model simulations in order to:

evaluate how realistic the models are in simulating the recent past,

provide projections of future climate change on two time scales, near term (out to about 2035) and long term (out to 2100 and beyond), and

understand some of the factors responsible for differences in model projections, including quantifying some key feedbacks such as those involving clouds and the carbon cycle

CMIP5: Fifth Coupled Model Intercomparison Project


Simulations:~90,000 years~60 experiments within CMIP5~20 modelling centres (from around the world) using~several model configurations each~2 million output “atomic” datasets ~10's of petabytes of output~2 petabytes of CMIP5 requested output~1 petabyte of CMIP5 “replicated” output

Which will be replicated at a number of sites (including ours), arriving now!

Of the replicants:~ 220 TB decadal~ 540 TB long term~ 220 TB atmos-only

~80 TB of 3hourly data~215 TB of ocean 3d monthly data!~250 TB for the cloud feedbacks!~10 TB of land-biochemistry (from the long term experiments alone).

CMIP5 numbers!


Volume or Metadata?

So we could wax lyrical about how hard it is to handle the data volume!

– Yes, we need to purchase a lot of disk and tapes!

– Yes, we need it to be fast disk.

– No, we can't just wander down to PC-World and buy armfuls of disks.

– Yes, we need to worry about networks, and

– Yes, that all takes time and money,

BUT

The hard part is the information handling! The major cost for supporting CMIP5 is in the information systems

– describing the models, – the data, how to find them, and how to use them,

and eventually, – what data one used, and what results were obtained!


IPCC:

FAR:1990SAR:1995TAR:2001AR4:2007AR5:2013


State of the Art: Model Comparison

Guilyardi E. (2006): El Niño- mean state - seasonal cycle interactions in a multi-model ensemble. Clim. Dyn., 26:329-348, DOI: 10.1007/s00382-005-0084-6

1: Tabulate some interesting property (and author grafts hard to get the information)

http://10.1007/s00382-005-0084-6%C2%A0


State of the Art: Model Comparison

Kharin et al, Journal of Climate 2007 doi: 10.1175/JCLI4066.1Dai, A.,J. Climate 2006 doi: 10.1175/JCLI3884.1

2: Provide some (slightly) organised citation material (and author and readers graft hard to get the information)


State of the art: Model Comparison

3: Calculate and tabulate some interesting properties and bury in a table or figure

Guilyardi E. (2006): El Niño- mean state - seasonal cycle interactions in a multi-model ensemble. Clim. Dyn., 26:329-348, DOI: 10.1007/s00382-005-0084-6

Not an easy or time efficient way of doing things!


Handling the CMIP5 data

Earth System Grid (ESG)

US DoE funded project to develop software and support CMIP5Consists of

– distributed data node software (to publish data)

– Tools

– gateway software (to provide catalog and services)

Metafor

– Information model to describe models and simulations, and

– Tools to manipulate it

Major “technical challenges”

Earth System Grid FEDERATION (ESGF)

Global initiative to deploy the ESG (and other) software to support:

– timely access to the data– minimum international

movement of the data

– long term access to significant versions of the CMIP5 data.

Major “social challenge” as well as “technical challenge”


CMIP5: Handling the metadata

Three streams of provenance metadata: A) “archive” metadata B) “browse” metadata C) “character” metadata

A: Archive Metadata: three levels of information from the file system:I. CF compliance in the NetCDF files

II. “Extra” CMIP5 required attributes including a unique identifier within each file.

III. Use of the Directory Reference Syntax (DRS) to help maintain version information.Compliance enforced by ESG publisher.

B: Browse Metadata, added independently of the archive• Exploiting Metafor controlled vocabularies in a “Common Information Model” (CIM) via a customised “CMIP5 questionnaire”.

Compliance enforced by CMIP5 quality control systems, leading toC: Character Metadata• Data assessment

41VO Sandpit, November 2009

Bringing it all together for CMIP5

The players:

A) Data Publication systemon data nodes, strips metadata from files,loads into a catalogue.Viewable from Gateways!

B) Model descriptions, andC) quality descriptions Entered via combination of script driven upload and questionnaire entry. Viewable from Gatewaysand third party Portals!


Aerosol, Atmosphere,Atmospheric Chemistry,Land Ice, Land Surface,

Ocean Biogeochem,Ocean, Sea Ice(Yes we know we shouldn't have this sort of detail

in the UML, and it wont be … shortly)A look under the hoodof a piece of the CIM


Scientific Properties:Controlled Vocabulariesdeveloped with expert community using mindmaps and some “rules” to aid machine post processing …

(everyone can use mindmaps: no a priori semantic technology knowledge required.)


A piece of the mindmap XML …


… vocabulary driven content in web based “human entry tool”

(A great advertisement for Python and Django)


The Earth System Grid Federation

Data Nodes, providing data services and publishing toData Gateways linked in a Global Federation

Three nodes (and gateways)committed to persistingthe data!

(Slide from D. Williams et al, in preparation)


NCARPCMDI

Earth System Grid Gateways


ESGF Gateways


Take home points from the CIMP5 example

Big data is a global activity, requiring global solutions:- global technology with global agreements!- it is our experience that the information requirements of multi-

disciplinary science cannot be satisfied “in-house”!

Big data requires big information systems- storage is the least of our worries (which doesn't mean it's insigificant)

There WILL be complex lines of provenance in all interesting systems- information aggregation matters,- without “standards” (for information, and for interfaces),

we have little chance of progress

None of this comes cheap, and despite our best efforts, initial budgets never really accurately quantify the resources required!


Segue from CMIP5 to Data Publication

From CMIP5 to data publication.

CMIP5 has three levels of “internal” quality control:

– 1: Syntactic; does the data and metadata conform to controlled vocabularies and syntactic requirements?

– 2: Semantic: are the data and metadata sensible and consistent with expectation (includes human “spot checks” of diagnostic plots and questionnaire output).

– 3: Support Citation: match the data and model metadata, confirm with the “authors”, further human examination of the metadata, leading to

• Publication by the World Data Centre for Climate who will assign and maintain a Digital Object Identifier.


Publishing data – why do it?

• Scientific journal publication mainly focuses on the analysis, interpretation and conclusions drawn from a given dataset.

• Examining the raw data that forms the dataset is more difficult, as datasets are usually stored in digital media, in a variety of (proprietary or non-standard) formats.

• Peer-review is generally only applied to the methodology and final conclusions of a piece of work, and not the underlying data itself. But if the conclusions are to stand, the data and accomanying metadata must be of good quality.

• A process of data publication, involving peer-review of datasets is (in some cases) and will be (in others) of benefit to many sectors of the academic community – including data producers!


Publishing: Some terminology (1)

cite verb ( GIVE EXAMPLE ) /sa t/ [T]ɪ• formal - to mention something as proof for a theory or as a reason why something has happened• formal - to speak or write words taken from a particular writer or written work

(Cambridge Advanced Learner's Dictionary - http://dictionary.cambridge.org/dictionary/british/cite_1 )

publish verb / p b.l / [T]ˈ ʌ ɪʃto make information available to people, especially in a book, magazine or

newspaper, or to produce and sell a book, magazine or newspaper

(Cambridge Advanced Learner's Dictionary - http://dictionary.cambridge.org/dictionary/british/publish )


0.Serving of data sets

1.Data set Citation

2.Publication of data sets

This is what data centres do as our day job – take in data supplied by scientists and make it available to other interested parties. “On the web” is not the same as “Published”

This is what we’re aiming for now –formulate and formalise a way of citing data sets. Will provide benefits to our users – and a carrot to get them to provide data to us!

This involves the peer-review of data sets, and gives “stamp of approval” associated with traditional journal publications. Can’t be done without effective linking/citing of the data sets.

Publishing: Some terminology (2)

Being able to cite a dataset is a good thing in its own right!


To publish or to Publish

We draw a clear distinction between

publishing = making available for consumption (e.g. on the web), and

Publishing = publishing after some formal process which adds value for the consumer:

– e.g. PloS ONE type review, or– EGU journal type public review, or– More traditional peer review.

AND– provides commitment to persistence


How we’re going to support citation and persistence

We decided to use digital object identifiers (DOIs) because:

• They are actionable, interoperable, persistent links for (digital) objects

• Scientists are already used to citing papers using DOIs

• Pangaea assign DOIs, and ESSD use DOIs to link to the datasets they publish

• The British Library gave us an allocation of 500 DOIs to assign to datasets as we saw fit.


What sort of data can we cite?

Dataset has to be:• Stable (i.e. not going to be modified)• Complete (i.e. not going to be updated)• Permanent – by assigning a DOI we’re committing to make the dataset available

for posterity• Good quality – by assigning a DOI we’re giving it our data centre stamp of

approval, saying that it’s complete and all the metadata is available

When a dataset is cited that means:• There will be bitwise fixity• With no additions or deletions of files• No changes to the directory structure in the dataset

“bundle”

A DOI should point to a html representation of some record which describes a data object.

Upgrades to versions of data formats will result in new editions of datasets.


Landing page(s)

We don’t want to have the DOI resolve right at the archive level where you can get the database or files in the dataset because:

• this dumps the user with the only information being a list of filenames or table records

• If we change the archive structure, that requires re-mapping all the DOIs

So, we need a landing page.

Users are used to this, as that’s what on-line journals do

For example – clicking 10.1049/iet-map:20060126 will bring you to a html page with a link to a pdf of the referenced paper.


Human-readable citation string

We worked on this in the JISC and NERC-funded CLADDIER project (http://claddier.badc.ac.uk/trac) and will follow those rules

For example:Science and Technology Facilities Council (STFC), Chilbolton Facility for Atmospheric and Radio Research, [Wrench, C.L.]. Chilbolton Facility for Atmospheric and Radio Research (CFARR) data, [Internet]. British Atmospheric Data Centre, 2003-,Date of citation. Available from http://badc.nerc.ac.uk/data/chilbolton/.

For the GBS datasets:Science and Technology Facilities Council (STFC), Chilbolton Facility for Atmospheric and Radio Research, [Callaghan, S. A., J. Waight, C. J. Walden, J. Agnew and S. Ventouras]. GBS 20.7GHz slant path radio propagation measurements, Chilbolton site, [Internet]. British Atmospheric Data Centre, 2003-2005, doi:10.12345/1234567890


Using the DataCite API to mint DOIs

And this is the point where we are at the moment.

Next steps:• Talk to our developers at the BADC about

how to use the API (and writing a script to assign DOIs)

• Improve metadata and general user-friendliness of chosen landing pages.

• Confirm that all the dataset to be cited is complete and ready to be frozen!

• Decide what our bit of the DOI string should be and finalize human-readable citation.

• Think about how we integrate DOIs into our metadata management procedures

• Prepare announcement to our users about our new ability to mint DOIs


Data publication summary• We have the ability now (thanks to the British

Library) to mint our own DOIs• We can therefore cite our datasets, giving

academic credit to those scientists who get cited – making them more likely to give us good quality data to archive.

• Publication – and peer-review – is the next step

• We need to work with recognized academic journals to do this

• Data journals already exist:• Earth System Science Data (http://earth-

system-science-data.net/)• Geochemistry, Geophysics, Geosystems (G3

http://www.agu.org/journals/gc/ ) http://www.keepcalm-o-matic.co.uk/default.aspx#createposter


Summary and maybe conclusions? Data is important, and becoming more so for

a far wider range of the population Conclusions and knowledge are only as

good as the data they’re based on

– Data is only as good as it's metadata!

– There are lots of important types of metadata!

Science is supposed to be reproducible and verifiable

It’s up to us as scientists to care for the data we’ve got and ensure that the story of what we did to the data is transparent

– So we can use the data again

– And so people will trust our results

managing complex datasets and accompanying...

Documents