managing complex datasets and accompanying...
TRANSCRIPT
VO Sandpit, November 2009
Managing complex datasets and accompanying information for reuse
and repurpose
Bryan N. Lawrence, Sarah Callaghan, Sam PeplerSTFC Centre for Environmental Data Archival
VO Sandpit, November 2009
Outline
Context: The Data Deluge Communications and Evidence
Real Experience Who are we and where am we from? What is metadata and where does it come from?
A couple of simple examples.
One complex example: CMIP5
Exploiting information about data Data Citation: Why it's crucial, and what we are doing.
… and subsiding into a the sunset with a short summary!
VO Sandpit, November 2009
The Data Deluge
“the amount of data generated worldwide...is growing by 58% per year; in 2010 the world generated 1250 billion gigabytes of data”
Decisions,DecisionsDecisions
(need)Information& better yet
PRIORplanning
(Exploded view from 2007 IDC study – but note colours swapped)
VO Sandpit, November 2009
SI Prefixes
SI prefix Name Power of 10 or 2 Status
k kilo thousand 10 3 210 Count on fingers
M mega million 10 6 220 Trivial
G giga billion 10 9 230 Small
T tera trillion 10 12 240 Real
P peta quadrillion 10 15 250 Challenging
E exa quintillion 10 18 260 Aspirational
Z zetta sextillion 10 21 270 Wacko
Y yotta septillion 10 24 280 Science fiction
Stuart Feldman, Google
VO Sandpit, November 2009
Humans and the Data DelugeA person working full time for a year has about 1500 hours to do something. (In the UK 220 working days a year is about standard. Let's remove about 20 days for courses, staff meetings etc ... so that leaves about 200 days or, for a working day of 7.5 hours, a working year of about 1500 hours.)
• What does a 50 TB dataset mean? – Take a set of climate predictions.– A single lat/lon map might be of order 50 Kb … so we have of the order of 10 billion maps. – Looking at each map for 10s, one individual could quality control those maps in approximately
two thousand years of work! – Bring on crowd sourcing … but there’s only so many people in the world!
• If it takes 2 minutes to find something, and have a quick look at it and extract a (e.g.) parameter name,
• you can process 45,000 items a year
• but no human could do that full time (repetitive boredom)!
• Maybe 30K in two years?
Your examples, will differ, but your conclusions are unlikely to:We can’t manage big data relying on humans! We need automation!
VO Sandpit, November 2009
ChemSpider
Data LandscapeP
etas
cale
Laptop ScaleSlide concept: Carole Goble via Liz Lyons
Big DataBig ScienceEver larger!
Personal science:Ever more complex, and
ever more of it!
CMIP5Archive
Earth Observation(Climate Data)
Typical DepartmentalCollectionOne instrument archive
Fortran,IDL,Matlab,Python& (yuck)Excel,Access
(More) Homogenous
Heterogeneous?
VO Sandpit, November 2009
Preservation and Curation
One could argue that the writers of these documents did a brilliant job of preserving the bits-and-bytes of their time …thousands of years of preservation must be worth something … but
Phaistos Disk, 1700BC
they’ve both been translated many times, it’s a shame the meanings are different …
=> Data Preservation is not enough, we need “Active Curation” to preserve Information
VO Sandpit, November 2009
(More) Homogenous
Heterogeneous
Dange
r Zon
e:
Off m
ost r
adar
But w
orth
it! Give up!(Prohibitively expensive to
be anything other than deceptive!)
The reality of curation (as opposed to preservation)
(Solution: move data into managed repositories sharing commonsome common syntax and semantics, i.e. move it left!)
ExpensiveBut
generallyalreadybeing
managed!
Standards Controlled
Vocabularies
risks
Libraries work well because we have common concepts(books, chapters, pages, words, interpreted using language)
Large data repositories share common concepts for syntax and semantics,(and are actively managed for real user communities – like museums)
VO Sandpit, November 2009
(Scientific) Communication through the ages
Science, as a process, requires the exchange of information and ideas.
We can make this exchange face-to-face (conferences, meetings, seminars) or through another medium (text, video, images), or both.
No matter what method we use, we wind up telling each other stories about what we’ve discovered.
Technology has given us new tools, but it’s also provided new challenges
http://www.intoon.com/#68559
VO Sandpit, November 2009
Journals – a 17th century technology
The first scientific journal, Journal des sçavans (later renamed Journal des savants), was first published on Monday, 5 January 1665.
It also carried a proportion of material that would not now be considered scientific, such as obituaries of famous men, church history, and legal reports.
It still exists, but is more of a literary journal
The first edition of the Philosophical Transactions of the Royal Society of London was on 6 March 1665. That still exists, and continues to publish scientific information to this day.
VO Sandpit, November 2009
Journals work, but...
... They’re not enough now to communicate everything we need to know about a scientific event
- whether that’s an observation, simulation, development of a theory, or any combination of these.
Data always has been the foundation of scientific progress – without it, we can’t test any of our assertions.
Previously data was hard to capture, but could be (relatively) easily published in image or table format – papers “included” data!
But now...Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665
VO Sandpit, November 2009
And those data, appear, if they're lucky, as one figure!
Just exactly how “evidential” is the academic record now?
Can her work be repeated/refuted without access to her data?(and even with the data, will they alone be enough?
What metadata might one need?)
The role of data
VO Sandpit, November 2009
Repeat – Verify, Refute“cite”Trust
ReuseExtrapolateAggregate
Extend“build on the shoulders of giants”
PROGRESS!
VO Sandpit, November 2009
Who are we…? What do we know about data?
We’re (several) of the NERC data centres!
VO Sandpit, November 2009
Some BADC numbers for context
Dataset: A collection of files sharing some administrative and/or project heritage.
BADC has approximately 150 real datasets (and thousands of virtual datasets).
BADC has approx 200 million files containing thousands of measured or simulated parameters.
BADC tries to deploy information systems that describe those data, parameters, projects and files, along with services that allow one to manipulate them …
Calendar year 2010: 2800 active users (of 12000 registered), downloaded 64 TB data in 16 million files from 165 datasets.
Less than half of the BADC data consumers are “atmospheric science” users!
VO Sandpit, November 2009
What is this metadata malarkey?
Metadata:
(1) a noun carrying not much information
(do you know what I mean when I say metadata?)
or
(2) the heart and soul of data reuse?
BOTH!
VO Sandpit, November 2009
Metadata for Discovery, Documentation, Definition
Lawrence et al 2009, doi:10.1098/rsta.2008.0237
VO Sandpit, November 2009
Who makes metadata? Part 1 of 4
• Metadata should be added in layers, with original metadata created with measurements, augmented by notes, and extra material (provenance) as it migrates within the scientific community.
– Big role for electronic notebooks and machine generated information!
• The tools for creating and maintaining that metadata could and (probably) should vary with migration …
• … before it finally reaches a data scientist for ingestion … but data scientists have a role in helping shape the information (requirements and syntax) at every step of the way.
Individuals
research groups
collaborators
data centres
data users/citers
ProvenanceDocumentation
akametadata
Documentation and
Institutional Wisdom
“Lab Books” “Notes”
“File_Metadata”
VO Sandpit, November 2009
Who makes metadata? Part 2 of 4
Very Important(Newish)Career
(A specific data scientist role)
Define StructureCreate A metadata
VO Sandpit, November 2009
Who makes metadata? Part 3 of 4
Create B, O and D metadata!
Acquire E metadata
Connect data and metadata to SERVICES
VO Sandpit, November 2009
Who makes metadata?: Part 4 of 4
The USER does!Citation and Annotation Crucial measures of quality
– cf Amazon and Trip Advisor
Quality is An essential and distinguishing attribute A degree or grade of excellence or worth The degree to which a man-made object or
system is free from bugs and flaws … A perceptual, conditional and somewhat
subjective attribute and may be understood differently by different people.
In relation to relevancy, this term refers to … how important it is as viewed by the content owner or search application.
Is the suitability of procedures, processes and systems in relation to the strategic objectives.
Philosophically:Distinction between primary and secondary qualities (John Locke):
– Primary qualities are intrinsic to an object—a thing or a person—whereas
– Secondary qualities are dependent on the interpretation and context.
From a metadata perspective, annotation of quality just as important as provider attribution.
“C” metadata crucial to reuse!
VO Sandpit, November 2009
Putting it all togetherThe things we worry about!
The difference between preservation and curation? All information packages need to evolve as the producer and consumer communities evolve!
VO Sandpit, November 2009
Example One
VO Sandpit, November 2009
VO Sandpit, November 2009
B
VO Sandpit, November 2009
A
VO Sandpit, November 2009
Example Two: The beauty of really old incidental metadata!
(Not ours)
VO Sandpit, November 2009
VO Sandpit, November 2009
Old Weather/New Results
(Modern max & min over HMS Dorothea 1818Brohan et al 2010)
e.g: voyage of HMS Dorothea, 1818(and others)
Comparing climate of past with current climate range!
VO Sandpit, November 2009
Citizen Science: Old Weather
http://www.oldweather.org
VO Sandpit, November 2009
Example 3
Image: from J. Lafeuille, 2006Global Earth System Modelling & CMIP5
VO Sandpit, November 2009
Global community activity under the auspices of the World Meteorological Organisation (WMO) via the World Climate Research Programme (WCRP)
Aim: to address outstanding scientific questions that arose as part of the IPCC AR4 process,
improve understanding of climate, and
to provide estimates of future climate change that will be useful to those considering its possible consequences.
Method: standard set of model simulations in order to:
evaluate how realistic the models are in simulating the recent past,
provide projections of future climate change on two time scales, near term (out to about 2035) and long term (out to 2100 and beyond), and
understand some of the factors responsible for differences in model projections, including quantifying some key feedbacks such as those involving clouds and the carbon cycle
CMIP5: Fifth Coupled Model Intercomparison Project
VO Sandpit, November 2009
Simulations:~90,000 years~60 experiments within CMIP5~20 modelling centres (from around the world) using~several model configurations each~2 million output “atomic” datasets ~10's of petabytes of output~2 petabytes of CMIP5 requested output~1 petabyte of CMIP5 “replicated” output
Which will be replicated at a number of sites (including ours), arriving now!
Of the replicants:~ 220 TB decadal~ 540 TB long term~ 220 TB atmos-only
~80 TB of 3hourly data~215 TB of ocean 3d monthly data!~250 TB for the cloud feedbacks!~10 TB of land-biochemistry (from the long term experiments alone).
CMIP5 numbers!
VO Sandpit, November 2009
Volume or Metadata?
So we could wax lyrical about how hard it is to handle the data volume!
– Yes, we need to purchase a lot of disk and tapes!
– Yes, we need it to be fast disk.
– No, we can't just wander down to PC-World and buy armfuls of disks.
– Yes, we need to worry about networks, and
– Yes, that all takes time and money,
BUT
The hard part is the information handling! The major cost for supporting CMIP5 is in the information systems
– describing the models, – the data, how to find them, and how to use them,
and eventually, – what data one used, and what results were obtained!
VO Sandpit, November 2009
IPCC:
FAR:1990SAR:1995TAR:2001AR4:2007AR5:2013
VO Sandpit, November 2009
State of the Art: Model Comparison
Guilyardi E. (2006): El Niño- mean state - seasonal cycle interactions in a multi-model ensemble. Clim. Dyn., 26:329-348, DOI: 10.1007/s00382-005-0084-6
1: Tabulate some interesting property (and author grafts hard to get the information)
VO Sandpit, November 2009
State of the Art: Model Comparison
Kharin et al, Journal of Climate 2007 doi: 10.1175/JCLI4066.1Dai, A.,J. Climate 2006 doi: 10.1175/JCLI3884.1
2: Provide some (slightly) organised citation material (and author and readers graft hard to get the information)
VO Sandpit, November 2009
State of the art: Model Comparison
3: Calculate and tabulate some interesting properties and bury in a table or figure
Guilyardi E. (2006): El Niño- mean state - seasonal cycle interactions in a multi-model ensemble. Clim. Dyn., 26:329-348, DOI: 10.1007/s00382-005-0084-6
Not an easy or time efficient way of doing things!
VO Sandpit, November 200939
Handling the CMIP5 data
Earth System Grid (ESG)
US DoE funded project to develop software and support CMIP5Consists of
– distributed data node software (to publish data)
– Tools
– gateway software (to provide catalog and services)
Metafor
– Information model to describe models and simulations, and
– Tools to manipulate it
Major “technical challenges”
Earth System Grid FEDERATION (ESGF)
Global initiative to deploy the ESG (and other) software to support:
– timely access to the data– minimum international
movement of the data
– long term access to significant versions of the CMIP5 data.
Major “social challenge” as well as “technical challenge”
VO Sandpit, November 200940
CMIP5: Handling the metadata
Three streams of provenance metadata: A) “archive” metadata B) “browse” metadata C) “character” metadata
A: Archive Metadata: three levels of information from the file system:I. CF compliance in the NetCDF files
II. “Extra” CMIP5 required attributes including a unique identifier within each file.
III. Use of the Directory Reference Syntax (DRS) to help maintain version information.Compliance enforced by ESG publisher.
B: Browse Metadata, added independently of the archive• Exploiting Metafor controlled vocabularies in a “Common Information Model” (CIM) via a customised “CMIP5 questionnaire”.
Compliance enforced by CMIP5 quality control systems, leading toC: Character Metadata• Data assessment
41VO Sandpit, November 2009
Bringing it all together for CMIP5
The players:
A) Data Publication systemon data nodes, strips metadata from files,loads into a catalogue.Viewable from Gateways!
B) Model descriptions, andC) quality descriptions Entered via combination of script driven upload and questionnaire entry. Viewable from Gatewaysand third party Portals!
VO Sandpit, November 2009
Aerosol, Atmosphere,Atmospheric Chemistry,Land Ice, Land Surface,
Ocean Biogeochem,Ocean, Sea Ice(Yes we know we shouldn't have this sort of detail
in the UML, and it wont be … shortly)A look under the hoodof a piece of the CIM
VO Sandpit, November 2009
Scientific Properties:Controlled Vocabulariesdeveloped with expert community using mindmaps and some “rules” to aid machine post processing …
(everyone can use mindmaps: no a priori semantic technology knowledge required.)
VO Sandpit, November 2009
A piece of the mindmap XML …
VO Sandpit, November 2009
… vocabulary driven content in web based “human entry tool”
(A great advertisement for Python and Django)
46VO Sandpit, November 2009
The Earth System Grid Federation
Data Nodes, providing data services and publishing toData Gateways linked in a Global Federation
Three nodes (and gateways)committed to persistingthe data!
(Slide from D. Williams et al, in preparation)
4747VO Sandpit, November 2009
NCARPCMDI
Earth System Grid Gateways
4848VO Sandpit, November 2009
ESGF Gateways
VO Sandpit, November 2009
Take home points from the CIMP5 example
Big data is a global activity, requiring global solutions:- global technology with global agreements!- it is our experience that the information requirements of multi-
disciplinary science cannot be satisfied “in-house”!
Big data requires big information systems- storage is the least of our worries (which doesn't mean it's insigificant)
There WILL be complex lines of provenance in all interesting systems- information aggregation matters,- without “standards” (for information, and for interfaces),
we have little chance of progress
None of this comes cheap, and despite our best efforts, initial budgets never really accurately quantify the resources required!
VO Sandpit, November 2009
Segue from CMIP5 to Data Publication
From CMIP5 to data publication.
CMIP5 has three levels of “internal” quality control:
– 1: Syntactic; does the data and metadata conform to controlled vocabularies and syntactic requirements?
– 2: Semantic: are the data and metadata sensible and consistent with expectation (includes human “spot checks” of diagnostic plots and questionnaire output).
– 3: Support Citation: match the data and model metadata, confirm with the “authors”, further human examination of the metadata, leading to
• Publication by the World Data Centre for Climate who will assign and maintain a Digital Object Identifier.
VO Sandpit, November 2009
Publishing data – why do it?
• Scientific journal publication mainly focuses on the analysis, interpretation and conclusions drawn from a given dataset.
• Examining the raw data that forms the dataset is more difficult, as datasets are usually stored in digital media, in a variety of (proprietary or non-standard) formats.
• Peer-review is generally only applied to the methodology and final conclusions of a piece of work, and not the underlying data itself. But if the conclusions are to stand, the data and accomanying metadata must be of good quality.
• A process of data publication, involving peer-review of datasets is (in some cases) and will be (in others) of benefit to many sectors of the academic community – including data producers!
VO Sandpit, November 2009
Publishing: Some terminology (1)
cite verb ( GIVE EXAMPLE ) /sa t/ [T]ɪ• formal - to mention something as proof for a theory or as a reason why something has happened• formal - to speak or write words taken from a particular writer or written work
(Cambridge Advanced Learner's Dictionary - http://dictionary.cambridge.org/dictionary/british/cite_1 )
publish verb / p b.l / [T]ˈ ʌ ɪʃto make information available to people, especially in a book, magazine or
newspaper, or to produce and sell a book, magazine or newspaper
(Cambridge Advanced Learner's Dictionary - http://dictionary.cambridge.org/dictionary/british/publish )
VO Sandpit, November 2009
0.Serving of data sets
1.Data set Citation
2.Publication of data sets
This is what data centres do as our day job – take in data supplied by scientists and make it available to other interested parties. “On the web” is not the same as “Published”
This is what we’re aiming for now –formulate and formalise a way of citing data sets. Will provide benefits to our users – and a carrot to get them to provide data to us!
This involves the peer-review of data sets, and gives “stamp of approval” associated with traditional journal publications. Can’t be done without effective linking/citing of the data sets.
Publishing: Some terminology (2)
Being able to cite a dataset is a good thing in its own right!
VO Sandpit, November 2009
To publish or to Publish
We draw a clear distinction between
publishing = making available for consumption (e.g. on the web), and
Publishing = publishing after some formal process which adds value for the consumer:
– e.g. PloS ONE type review, or– EGU journal type public review, or– More traditional peer review.
AND– provides commitment to persistence
VO Sandpit, November 2009
How we’re going to support citation and persistence
We decided to use digital object identifiers (DOIs) because:
• They are actionable, interoperable, persistent links for (digital) objects
• Scientists are already used to citing papers using DOIs
• Pangaea assign DOIs, and ESSD use DOIs to link to the datasets they publish
• The British Library gave us an allocation of 500 DOIs to assign to datasets as we saw fit.
VO Sandpit, November 2009
What sort of data can we cite?
Dataset has to be:• Stable (i.e. not going to be modified)• Complete (i.e. not going to be updated)• Permanent – by assigning a DOI we’re committing to make the dataset available
for posterity• Good quality – by assigning a DOI we’re giving it our data centre stamp of
approval, saying that it’s complete and all the metadata is available
When a dataset is cited that means:• There will be bitwise fixity• With no additions or deletions of files• No changes to the directory structure in the dataset
“bundle”
A DOI should point to a html representation of some record which describes a data object.
Upgrades to versions of data formats will result in new editions of datasets.
VO Sandpit, November 2009
Landing page(s)
We don’t want to have the DOI resolve right at the archive level where you can get the database or files in the dataset because:
• this dumps the user with the only information being a list of filenames or table records
• If we change the archive structure, that requires re-mapping all the DOIs
So, we need a landing page.
Users are used to this, as that’s what on-line journals do
For example – clicking 10.1049/iet-map:20060126 will bring you to a html page with a link to a pdf of the referenced paper.
VO Sandpit, November 2009
Human-readable citation string
We worked on this in the JISC and NERC-funded CLADDIER project (http://claddier.badc.ac.uk/trac) and will follow those rules
For example:Science and Technology Facilities Council (STFC), Chilbolton Facility for Atmospheric and Radio Research, [Wrench, C.L.]. Chilbolton Facility for Atmospheric and Radio Research (CFARR) data, [Internet]. British Atmospheric Data Centre, 2003-,Date of citation. Available from http://badc.nerc.ac.uk/data/chilbolton/.
For the GBS datasets:Science and Technology Facilities Council (STFC), Chilbolton Facility for Atmospheric and Radio Research, [Callaghan, S. A., J. Waight, C. J. Walden, J. Agnew and S. Ventouras]. GBS 20.7GHz slant path radio propagation measurements, Chilbolton site, [Internet]. British Atmospheric Data Centre, 2003-2005, doi:10.12345/1234567890
VO Sandpit, November 2009
Using the DataCite API to mint DOIs
And this is the point where we are at the moment.
Next steps:• Talk to our developers at the BADC about
how to use the API (and writing a script to assign DOIs)
• Improve metadata and general user-friendliness of chosen landing pages.
• Confirm that all the dataset to be cited is complete and ready to be frozen!
• Decide what our bit of the DOI string should be and finalize human-readable citation.
• Think about how we integrate DOIs into our metadata management procedures
• Prepare announcement to our users about our new ability to mint DOIs
VO Sandpit, November 2009
Data publication summary• We have the ability now (thanks to the British
Library) to mint our own DOIs• We can therefore cite our datasets, giving
academic credit to those scientists who get cited – making them more likely to give us good quality data to archive.
• Publication – and peer-review – is the next step
• We need to work with recognized academic journals to do this
• Data journals already exist:• Earth System Science Data (http://earth-
system-science-data.net/)• Geochemistry, Geophysics, Geosystems (G3
http://www.agu.org/journals/gc/ ) http://www.keepcalm-o-matic.co.uk/default.aspx#createposter
VO Sandpit, November 2009
Summary and maybe conclusions? Data is important, and becoming more so for
a far wider range of the population Conclusions and knowledge are only as
good as the data they’re based on
– Data is only as good as it's metadata!
– There are lots of important types of metadata!
Science is supposed to be reproducible and verifiable
It’s up to us as scientists to care for the data we’ve got and ensure that the story of what we did to the data is transparent
– So we can use the data again
– And so people will trust our results