data sets, ensemble cloud computing, and the university library:getting the most out of research...

Data Sets, Ensemble Cloud Computing, and the University

Library:Getting the Most Out of Research Support

Jim Myers1, Margaret Hedstrom1, Beth A Plale2, Praveen Kumar3, Robert McDonald4, Rob Kooper5, Luigi Marini5, Inna Kouper4, Kavitha Chandrasekar4

[email protected]

1 School on Information, University of Michigan, Ann Arbor, MI, United States. 2 School of Informatics and Computing, Indiana University, Bloomington, IN, United States. 3 Civil and Environmental Engineering, University of Illinois, Urbana-Champaign, IL, United States. 4 Data To Insight Center, Indiana University, Bloomington, IN, United States. 5 National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, IL, United States.

mailto:[email protected]

Overview

• Technological advances are making it ever easier to move computation, data, and metadata around

• With decreasing costs and increasing recognition of the value of data re-use, many organization are exploring their role in data curation/preservation

• If we look at the nature of the problem– How should data be curated to scalably support research?

• Lifecycle approaches to manage value-defined research objects

– Can we do it?• SEAD as an end-to-end demonstration…

– What organization(s) are best positioned/the most capable of leading/providing such services long-term?• Primary research organizations have a combination of capability, motivation,

and long-term commitment.

Technology – the world is flat

• Today’s researchers can employ computing and data resources from anywhere, using scalable search technologies …

Enough said.

Data as a key resource, Big Data

• Data is increasingly recognized as valuable beyond its initial use:– Data reproducibility– Re-analysis– Reference Data– Data mining/machine learning/…

– NSF Data plan requirement– Paper publication with data requirements– Community and institutional collections growing

Data Publication today

• Data cited in papers (to limited depth)• Project file archives (large, limited description,

gray/dark)• Reference/analytical data (standardized

content, limited breadth)• Historical collections (temporal breadth, limited

numbers)

- do any of these solve the problem?

Researchers think, and work, like this:

• Multi– Disciplinary– Format– Model– Semantics– Location

and this

– Raw and derived data– ~5 levels of quality,

processing, maturity– Observations, calibrations,

experiments, models, statistical ensembles, …

– Also organized by location, time, variables, technique, creator, project, provenance, …

– Large amount of reference information from external sources (e.g. NASA)

– Evidence for ‘non-orthogonal’ sub-collections

What’s Really Needed?Scalable Research Productivity Requires:

• A way to – store what you want– Reference what you want– Organize how you want (search, filter, tag, collect)

• At the scale, and level of detail/richness, you want• When you figure that out• In a way that is self-describing/high-fidelity across applications

and owners• In the vocabularies and formats you find efficient• Beyond the lifetime of individual/project interest• For active use and external credit• With minimal training/IT support required.

How can we approach magic?

• Global identifiers – data, terms, metadata• Content management abstractions (blob + type +

metadata)• Service architectures and automated processing

(conversion, preview, extraction, derivation, cataloging, …)• Applications that share these abstractions – write what

you know, display/ignore what you don’t• Research Object management (structured, inter-related

collections)

Web 2.0, Web3.0, + explicit context management …

SEAD: Sustainable Environment -Actionable Data

• An NSF DataNet project started in October, 2011

• An international resource for sustainability science

• A provider of light-weight Data Services based on novel technical and business approaches:– Supporting the long-tail of research– Enabling active and social curation– Providing integrated lifecycle support for data

http://sead-data.net/

Margaret Hedstrom, PIPraveen Kumar, co-PIJim Myers, co-PIBeth Plale, co-PI

SEAD is:

• Data discovery• Project workspaces• A data-aware community network• Curation and preservation services that link to multiple archives and discovery services

SEAD is:• An active repository that creates data pages with– Previews– Extracted Metadata– Overlays– Tags– Comments– Provenance– Use information– Download/Embed

• A tool for community exploration:– Personal and Project Profiles– Publications and Data Citations– Co-author, co-investigator graphs– Temporal analysis

SEAD is:

SEAD is:• Curation and Preservation Services:– Research Object management– ID assignment– Matchmaking to long-term repositoriesCitation Generation– Catalog Registration– Discovery services

SEAD’s Virtual Archive allows curators to access, assess, enhance, package, and submit data from SEAD project repositories for long-term storage in SEAD-managed storage or external institutional repositories and cloud data services.

Semantic Content Middleware over Scalable File System and Triple Store

Flickr-style web management of dataGeospatial, social network mash-ups, workflows and services

Curation Services to harvest and package specific data sets

Federation of OAI repositories for long-term preservation

Sensor data

– Apps read what they need and write what they know– Curation snapshots meaningful Research Objects– Multiple ROs can be defined/managed re-using the same underlying ‘living’ content– The larger graph can be ~reassembled w/o the ongoing cost of managing at the item level

Key Points• Research Objects have meaning/value but data comes in smaller

chunks• Research Objects are not orthogonal, but individual data sets/files

are

• Lifecycle approaches for datasets are becoming possible

• Managing intermixed ROs is the problem that needs to be tackled to meet the research community’s needs

• Research Data Alliance (RDA) can help drive standardization/scaling

What will drive research data preservation?

• The most valuable data service(s) are active/actionable research service(s)…– The ability to define Research Objects is more

important than any given RO

• Led by research organizations as part of their long-term mission?– The only organizations with the focus, scope, and

scale to solve the whole problem (end-to-end research productivity)

Acknowledgements

• SEAD Team @ UM, UI, IU• NSF• NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other sustainability

researchers

• and Thank You!

… stop by the SEAD booth and share your thoughts!

http://sead-data.net/