data sets, ensemble cloud computing, and the university library:getting the most out of research...
DESCRIPTION
A presentation given at AGU2013TRANSCRIPT
Data Sets, Ensemble Cloud Computing, and the University
Library:Getting the Most Out of Research Support
Jim Myers1, Margaret Hedstrom1, Beth A Plale2, Praveen Kumar3, Robert McDonald4, Rob Kooper5, Luigi Marini5, Inna Kouper4, Kavitha Chandrasekar4
1 School on Information, University of Michigan, Ann Arbor, MI, United States. 2 School of Informatics and Computing, Indiana University, Bloomington, IN, United States. 3 Civil and Environmental Engineering, University of Illinois, Urbana-Champaign, IL, United States. 4 Data To Insight Center, Indiana University, Bloomington, IN, United States. 5 National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, IL, United States.
Overview
• Technological advances are making it ever easier to move computation, data, and metadata around
• With decreasing costs and increasing recognition of the value of data re-use, many organization are exploring their role in data curation/preservation
• If we look at the nature of the problem– How should data be curated to scalably support research?
• Lifecycle approaches to manage value-defined research objects
– Can we do it?• SEAD as an end-to-end demonstration…
– What organization(s) are best positioned/the most capable of leading/providing such services long-term?• Primary research organizations have a combination of capability, motivation,
and long-term commitment.
Technology – the world is flat
• Today’s researchers can employ computing and data resources from anywhere, using scalable search technologies …
Enough said.
Data as a key resource, Big Data
• Data is increasingly recognized as valuable beyond its initial use:– Data reproducibility– Re-analysis– Reference Data– Data mining/machine learning/…
– NSF Data plan requirement– Paper publication with data requirements– Community and institutional collections growing
Data Publication today
• Data cited in papers (to limited depth)• Project file archives (large, limited description,
gray/dark)• Reference/analytical data (standardized
content, limited breadth)• Historical collections (temporal breadth, limited
numbers)
- do any of these solve the problem?
Researchers think, and work, like this:
• Multi– Disciplinary– Format– Model– Semantics– Location
and this
– Raw and derived data– ~5 levels of quality,
processing, maturity– Observations, calibrations,
experiments, models, statistical ensembles, …
– Also organized by location, time, variables, technique, creator, project, provenance, …
– Large amount of reference information from external sources (e.g. NASA)
– Evidence for ‘non-orthogonal’ sub-collections
What’s Really Needed?Scalable Research Productivity Requires:
• A way to – store what you want– Reference what you want– Organize how you want (search, filter, tag, collect)
• At the scale, and level of detail/richness, you want• When you figure that out• In a way that is self-describing/high-fidelity across applications
and owners• In the vocabularies and formats you find efficient• Beyond the lifetime of individual/project interest• For active use and external credit• With minimal training/IT support required.
How can we approach magic?
• Global identifiers – data, terms, metadata• Content management abstractions (blob + type +
metadata)• Service architectures and automated processing
(conversion, preview, extraction, derivation, cataloging, …)• Applications that share these abstractions – write what
you know, display/ignore what you don’t• Research Object management (structured, inter-related
collections)
Web 2.0, Web3.0, + explicit context management …
SEAD: Sustainable Environment -Actionable Data
• An NSF DataNet project started in October, 2011
• An international resource for sustainability science
• A provider of light-weight Data Services based on novel technical and business approaches:– Supporting the long-tail of research– Enabling active and social curation– Providing integrated lifecycle support for data
http://sead-data.net/
Margaret Hedstrom, PIPraveen Kumar, co-PIJim Myers, co-PIBeth Plale, co-PI
SEAD is:
• Data discovery• Project workspaces• A data-aware community network• Curation and preservation services that link to multiple archives and discovery services
SEAD is:• An active repository that creates data pages with– Previews– Extracted Metadata– Overlays– Tags– Comments– Provenance– Use information– Download/Embed
• A tool for community exploration:– Personal and Project Profiles– Publications and Data Citations– Co-author, co-investigator graphs– Temporal analysis
SEAD is:
SEAD is:• Curation and Preservation Services:– Research Object management– ID assignment– Matchmaking to long-term repositoriesCitation Generation– Catalog Registration– Discovery services
SEAD’s Virtual Archive allows curators to access, assess, enhance, package, and submit data from SEAD project repositories for long-term storage in SEAD-managed storage or external institutional repositories and cloud data services.
Semantic Content Middleware over Scalable File System and Triple Store
Flickr-style web management of dataGeospatial, social network mash-ups, workflows and services
Curation Services to harvest and package specific data sets
Federation of OAI repositories for long-term preservation
Sensor data
– Apps read what they need and write what they know– Curation snapshots meaningful Research Objects– Multiple ROs can be defined/managed re-using the same underlying ‘living’ content– The larger graph can be ~reassembled w/o the ongoing cost of managing at the item level
Key Points• Research Objects have meaning/value but data comes in smaller
chunks• Research Objects are not orthogonal, but individual data sets/files
are
• Lifecycle approaches for datasets are becoming possible
• Managing intermixed ROs is the problem that needs to be tackled to meet the research community’s needs
• Research Data Alliance (RDA) can help drive standardization/scaling
What will drive research data preservation?
• The most valuable data service(s) are active/actionable research service(s)…– The ability to define Research Objects is more
important than any given RO
• Led by research organizations as part of their long-term mission?– The only organizations with the focus, scope, and
scale to solve the whole problem (end-to-end research productivity)
Acknowledgements
• SEAD Team @ UM, UI, IU• NSF• NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other sustainability
researchers
• and Thank You!
… stop by the SEAD booth and share your thoughts!
http://sead-data.net/