niso forum, denver, sept. 24, 2012: scientific discovery and innovation in an era of data-intensive...
DESCRIPTION
Scientific discovery and innovation in an era of data-intensive science William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.TRANSCRIPT
Data Observation Network for Earth (DataONE): Supporting Scientific Data Preservation, Discovery, and Innovation
Bill Michener
Professor and DataONE Project DirectorUniversity of New Mexico
24 September 2012
National Information Standards Organization
2
3
Research and Data Life Cycle Integration
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
Proposal writing
Research
Publication
Ideas
?
?
4
Three Key Challenges
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
1
3
2
{Innovation
5
1. Data Preservation and Planning
✔ ?
6
The Long Tail of Orphan DataVolu
me
Rank frequency of datatype
Specialized repositories(e.g. GenBank, PDB)
Orphan data
(B. Heidorn)
“Most of the bytes are at the high end, but most of the datasets are at the low end” – Jim Gray
6
7
Planning ?
Metadata standard?Data repository?
8
Three major components for a flexible, scalable, sustainable network
Member Nodes• diverse institutions• serve local community• provide resources for
managing their data• retain copies of data
Coordinating Nodes• retain complete metadata
catalog • indexing for search• network-wide services• ensure content availability
(preservation) • replication services
Investigator Toolkit
DataONE and the DMPTool Support Data Preservation
9
Dryad (>3,000 data products)
Coordinated submission of articles and underlying data
Handshaking with specialized repositories
Promotion of reuse and incentives for deposit
9
10
Contributors• Individual investigators• Field stations and networks• Government agencies• Non-profit partnerships• Synthesis centers
Data Types• Ecological• Environmental• Demographic• Social/Legal/Economic
< 1
1-10
10-200
>200
0
15
30
45
60DataSizes
%
10MB
Knowledge Network for Biocomplexity (20,000+ data packages)
11
✔Check for best practices✔Create metadata✔Connect to ONEShare
Data & Metadata (EML)
12
Data Management Planning Tool
13
14
15
2. Data Discovery
16
Data Silos
17
The DataONE Federation
18
Tier 1: Read only, public contentping(), getLogRecords(), getCapabilities(),get(), getSystemMetadata(), getChecksum(),listObjects(), synchronizationFailed()
Tier 2: Read only, with access controlisAuthorized(), setAccessPolicy()
Tier 3: Read/Write using client toolscreate(), update(), delete()
Tier 4: Able to operate as a replication targetreplicate(),getReplica()
http://mule1.dataone.org/ArchitectureDocs-current/apis/MN_APIs.html
Member Node Functional Tiers
19
NASA collectors DAAC Users (UWG)
DataONE Users
ORNL DAAC as a DataONE Member Node
Investigator Toolkit
19
20
21
22
23
24
25
1. Ontology-based discovery search results
Concepts acquire context: biomass
as Material or biomass as Energy Additional
search terms
Super-classes may have different
properties
1.NCBO ontology repository instance2.Populated with ontologies (e.g., the NASA-JPL Semantic Web
for Earth and Environmental Terminology)3.Queried ontologies and returned results using REST services
26
Actual Keywords Suggested Keywords1. canopy characteristics2. field investigation3. vegetation index4. leaf characteristics5. Satellite6. land cover7. leaf area meter8. Reflectance9. steel measuring tape10. vegetative cover11. plant characteristics12. albedo
[1]field investigation[2]analysis[3]land cover[4]computational model[5]reflectance[6]vegetative cover[7]biomass[8]primary production[9]steel measuring tape[10]weigh balance[11]precipitation amount[12]canopy characteristics[13]leaf characteristics[14]water vapor[15]quadrat sample frame[16]rain gauge[17]surface air temperature[18]air temperature[19]meteorological station[20]human observer[21]vegetation index[22]soil core device[23]plant characteristics[24]surface wind[25]albedo
DAAC DRYAD KNBNumber of Documents 978 1,729 24,249Total Number of Keywords 7,294 8,266 254,525Average Keywords/Document 7.46 4.78 10.49 1
2
3
0 2 4 6 8 10 12
DAAC
DRYAD
KNB
Approach 2: Enrich MN Metadata
2727
3. Innovation
The Fourth Paradigm:1. Observational and
experimental 2. Theoretical research 3. Computer simulations of
natural phenomena4. Data-intensive research
• new tools, techniques, and ways of working
28
Dec
reas
ing
Spati
al C
over
age
Incr
easi
ng P
roce
ss K
now
ledg
e
Adapted from CENR-OSTP
Remotesensing
Intensive science sites and experiments
Extensive science sites
Volunteer & education networks
“Data Intensive Science” and the “80:20 Rule”
28
29
Public Participation in Scientific Research Conference: 4-5 August 2012 in Portland, Oregon USA prior to Ecological Society of America meeting (6-10 Aug.): http://www.birds.cornell.edu/citscitoolkit/conference/2012
29
30
Kepler
DMP-Tool
Investigator Toolkit Support
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
31
Spatio-Temporal Exploratory Model identifies factors affecting patterns of migration
Diverse bird observations and environmental data from 300,00 locations in the US integrated and analyzed using High Performance Computing Resources
Land Cover
Meteorology
MODIS – Remote sensing data
• Examine patterns of migration
• Infer how climate change may affect bird migration
Model results
Occurrence of Indigo Bunting (2008)
Jan Sep DecJunApr
Exploration, Visualization, and Analysis
31
32
Taverna, MyExperiment
33
Provenance Browser
33
34
DataONE: Supporting Scientific Data Preservation, Discovery, and Innovation
Current Member Nodes:
Coming Soon: Current Tools:
Tools Coming Soon: Queensland University of Technology
35
2009 2010 2011 2012 2013 2014
Deployment Targets – Y5
Y1 Y2 Y3 Y4 Y5
Metadata Objects 100k (130k) 400k 1M
Datasets 90k (120k) 180k 360k
Uptime 99.0 (100) 99.9 99.9
Metadata Schemas 8 (4) 8 8
Member Nodes 10 (8) 20 40
MN Countries 3 (2) 5 10
Coordinating Nodes 3 (3) 4 5
CN Countries 1 (1) 1 2
ITK Tools 8 (4) 10 12
36
Community Engagement
37
Year 1 Year 2 Year 3 Year 4 Year 5
Scientists: BL
User Assessments
Scientists: FU
Librarians: BL Librarians: FU
Policy Makers: BL Policy Makers: FU
Educators: BL Educators: FU
Library Policies: BL Library Policies: FU
38
Community Engagement
39
Best Practices and Software Tools
40
June 3-21, 2013University of New Mexico
41
Internships
https://notebooks.dataone.org/summer2012/
2009 – 4 interns, 2010 – 4 interns2011 – 8 interns, 2012 – 6 interns
42
DataONE: Supporting Scientific Data Preservation, Discovery, and Innovation
43
DataONE.org
44
DataONE Team and Sponsors
• Bertram Ludaescher
• Deborah McGuinness
• Jeff Horsburgh
• Robert Sandusky
• Peter Honeyman
• Carole Goble
• Cliff Duke
• Donald Hobern
• Ewa Deelman• Amber Budden, Roger Dahl, Rebecca Koskela, Bill Michener, Robert Nahf, Skye Roseboom, Mark Servilla
• Patricia Cruse, John Kunze
• Dave Vieglais
• Paul Allen, Rick Bonney, Steve Kelling
• Stephanie Hampton, Chris Jones, Matt Jones, Ben Leinfelder, Andrew Pippin
• Suzie Allard, Nick Dexter, Kimberly Douglass, Carol Tenopir, Robert Waltz, Bruce Wilson
• John Cobb, Bob Cook, Ranjeet Devarakonda, Giri Palanismy, Line Pouchard
• Sky Bristol, Mike Frame, Richard Huffine, Viv Hutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly
• David DeRoure
• Ryan Scherle, Todd Vision
LEON LEVY FOUNDATION
• Randy Butler