as repository: attract dark data · functionality to attract dark data eresearchaustralasia 2011 p....
TRANSCRIPT
11/18/2011
1
App as Repository: Functionality to attract Dark Data
eResearch Australasia 2011
P. Bryan Heidorn, SIRLS, Univ of Arizona
7 November 2011
(near) University of Arizona
11/18/2011
2
Thesis
• Large amounts of data remain uncurated
• Most of that data is from small data sets and is currently largely invisible – Dark Data
• This data should be curated locally but not by scientists alone
The problem
• Information is not in accessible format
• Computer Science, Information Science and Technology has not addressed the problem
• Challenge to reproducibility of science
11/18/2011
3
Images courtesy Ian Foster
Cyberinfrastructure Vision
“The anticipated growth in both the production and repurposing of digital data raises complex issues not only of scale and heterogeneity, but also of stewardship, curation and long‐term
access.” NSF Cyberinfrastructure Vision for 21st
Century Discovery, Chapter 3, 2007
11/18/2011
4
Recognition of need for data curation
“Recommendation 6: The NSF, working in partnership with collection managers and the community at large, should act to develop and mature the career path for data scientists and to ensure that the research enterprise includes a sufficient number of high‐quality data scientists.”
Long‐Lived Digital Data Collections: Enabling Research and Education in the 21st Century, Recommendations, 2005
• Recognition of the importance of Information
• Recognition of the need for education
• New work roles within traditional institutions
Interagency Working Group on Digital Data
11/18/2011
5
Dark data is the data that we know is was there but we can’t see it.
Hubble Space Telescope composite image "ring" of dark matter in the galaxy cluster Cl 0024+17
f(x)=axk+o(xk)
Power Law of Science Data
f(x)=axk+o(xk)| X<.20
Data Volume
Science Projects and Initiatives
11/18/2011
6
Does NSF’s Data Follow the Power Law?
I do not know but if $1 = X bytes…..
Awarded Amount 2007
$0
$1,000,000
$2,000,000
$3,000,000
$4,000,000
$5,000,000
$6,000,000
$7,000,000
1 586 1171 1756 2341 2926 3511 4096 4681 5266 5851 6436 7021 7606 8191 8776
20‐80 Rule The small are big!
Total Grants 9347
$2,137,636,716
20% 80%
Number Grants 1869 7478
Total Dollars $1,199,088,125 $938,548,595
Range $6,892,810-$350,000
$350,000-$831
11/18/2011
7
Where is your data now?
Is it working or on the dole?
What is a data scientist anyway and what should they be doing?
Data Scientist Job
• Find Dark/Hidden data – Or better: do not loose it in the first place
• Evaluate usefulness– Or better: provide usage metrics
• Organize (metadata and format)– Or better: allow multiple views
• Collect into sufficient mass to be visible– Normalize, Interlink and Integrate
• Provide access– Or better: Enable new research
• Preserve– Or better: Make it indispensible
Apps:Heidorn
11/18/2011
8
• Because it is high volume
• Because it is information rich – high entropy
• While needs of large data are understood small data and integration are not understood
Heidorn, P. Bryan (2008). Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57(2) Fall 2008 . Institutional Repositories: Institutional Repositories: Current State and Future. Edited by Sarah Sheeves and Melissa Cragin. http://hdl.handle.net/2142/9127).
Small data is big science
Where to find dark data in biodiversity?
• Literature/Biodiversity Heritage Library
• Museum Specimens
• Field notes
• (Un)Experimental data sets
• Citizen Observations
11/18/2011
9
What is dark data good for?
• Ecological Niche Modeling
• Climate Change niche change prediction
• Taxonomic Name Resolution
• Literature Search Support– Taxonomic intelligence
– Key‐like – character searching
• Phenology and Phenology change
• Food‐web / trophic level
Cyberinfrastructure Needs
• Collection
• Storage
• Access
• Processing
• Communication
• Training
• Institutions
11/18/2011
10
Institutionalization of e‐reseach
The tools of science and the business of science need to be altered to seamlessly integrate good data curation into professional practice.
Need a set of professional training, social institutions, scholarly communication and stable funding to support and promote data and computational enabled research.
Apps:Heidorn
Apps:Heidorn
Repositories
11/18/2011
11
Apps:Heidornhttp://gavinwedell.com/doodles/
Lab Notebook
Apps:Heidornhttp://gavinwedell.com/doodles/
Sensors
Statistical Packages
11/18/2011
12
Lab Notebook
Apps:Heidornhttp://gavinwedell.com/doodles/
Sensors
Statistical PackagesStatistical Packages
Sensors
Lab Notebook
Apps:Heidorn
IBM en:System/360 Model 65
This photo was taken by Mike Ross of corestore.org
11/18/2011
13
Lab Notebook
Sensors
Statistical Packages
Apps:Heidornhttp://gavinwedell.com/doodles/
Data Repurposing
From: To stand the test of time: Long‐term stewardship of of digital data sets in science and engineering. Sept 26‐27, 2006 Arlington VA
11/18/2011
14
A number of projects working in this direction
Apps:Heidorn
The iPlant Collaborative Cyberinfrastructure to Support the Challenges of Modern
Biology
Society for Experimental Biology, Glasgow, UKJuly 3rd, 2011
Dan StanzioneCo-PI and Cyberinfrastructure Lead, iPlant Collaborative
Deputy Director, Texas Advanced Computing [email protected]
11/18/2011
15
What is iPlant?• iPlant’s mission is to build the CI to support plant
biology’s Grand Challenge solutions• Grand Challenges were not defined in advance, but
identified through engagement with the community• A virtual organization with Grand Challenge teams
relying on national cyberinfrastructure • Long term focus on sustainable food supply,
climate change, biofuels, ecological stability, etc• Hundreds of participants globally… Working group
members at >50 US institutions, USDA, DOE, etc.
Brief History• Funding by NSF – February 1st, 2008
• iPlant Kickoff Conference at CSHL – April 2008
o ~200 participants
Grand Challenge Workshops – Sept-Dec 2008
CI workshop – Jan 2009
Grand Challenge White Paper Review – March 2009
Project Recommendations – March 2009
Project Kickoffs – May 2009 & August 2009
Start of software development; September 2009
First prototypes to public: April 2010
First release with user-driven tool integration: July 2011
11/18/2011
16
iPlant’s Central Challenge
• To define what it means to build a lasting, community driven Cyberinfrastructure for the Grand Challenges of Plant Science, to get community buy‐in of this vision, and to execute this vision.
Steve Goff, PIU of Arizona
Dan Stanzione, coPITexas Advanced Computing Center
National Science BoardUpdate on Award Progress: DBI ‐0735191
Directorate for Biological SciencesJuly 2011
11/18/2011
17
What iPlant Offers
Grand Challenges in Plant Science• Genotype‐to‐Phenotype
– To understand how DNA blueprints produce a plant’s characteristic traits and functions and to predict how traits change in response to complex environments
– Requires ability to collect, query, interpret, and model high‐throughput, genome‐scale data sets
• Tree of Life– To understand evolutionary relationships among green plants
– Requires ability to create, display, and query information in very large phylogenetic trees
11/18/2011
18
Taxonomic Name Resolution Service
The Biodiversity Heritage 34 million pages now
Collection of Australian birds' eggs and nests in the possession of D. Le Souef, Director, Zoological Gardens, Melbourne, 1900? .. Publication info:.
Long Citation Half‐life
Critical use for Taxonomy
Ecology and Environmental History
Naming for genomics and metagenomics
11/18/2011
19
Mobilizing Data Locked on Paper
• Fine‐Grained Semantic Markup of Descriptive Data for Knowledge Applications in Biodiversity Domains Hong Cui [email protected] (Principal Investigator)
• The University of Arizona is awarded a grant to develop and evaluate a set of algorithms/software to help computers to read and “understand” taxonomic descriptions of plants, animals, and other living or fossil organisms. The major functions of the algorithms/software include 1) annotate large sets of text descriptions in a machine‐readable way to support various knowledge applications, including producing character matrices and identification keys for various taxon groups.
Semantic Markup System
11/18/2011
20
Data Interlinking
Apps:Heidorn
The World Mind (Ver0.01b)
John Deck, University of California, BerkeleyBrian Stucky, University of Colorado, BoulderLukasz Ziemba, University of Florida, GainesevilleNico Cellinese, University of Florida, GainesvilleRob Guralnick, University of Colorado, Boulder
BiSciCol TeamReed Beaman, Nico Cellinese, Jonathan Coddington, Neil Davies, John Deck, RobGuralnick, P. Bryan Heidorn, Chris Meyer, Tom Orrell, Rich Pyle, Kate Rachwal, BrianStucky, Rob Whitton, Lukasz Ziemba
BiSciCol: Tracking Biodiversity
Objects to Brokering Standards“Or, Gustav’s Big Problem”
11/18/2011
21
Biological Science Collections Tracker working towards building an infrastructure designed to tag and track scientific
collections and all of their derivatives.
• National Science Foundation funded 2010 – 2014
• Partners are University of Florida at Gaineseville, University of Colorado at Boulder, Bishop Museum, University of California at Berkeley, Smithsonian Institution, University of Arizona at Tucson
• Relies on globally unique identifiers (GUIDs) to track objects
• Implements a Linked Data approach
• Provides support for the Global Names Architecture
Biological Science Collections (BiSciCol) Tracker
S1: KNM
S2: MNHN
Muséum national d'histoire naturelle
Nairobi National Museum
S3: MBG
Living Collection: Missouri Botanical Garden
Determination
?
?
Gene Sequence
GENBANK
?
?
?
?Parasitism
Agave sisalana
?
11/18/2011
22
From “Facebook Visualizer”
Tracking FaceBook relationships …
Can we track relationships for Biological Objects as well?
11/18/2011
23
Why? Here is Gustav’s Problem….
(Prefers to collect stuff)
Lots of Data ….
Generates …
Due to project requirements and integration needs, Gustav is left navigating a plethora of redundant and disconnected distributed Databases. Lots of effort to track objectsAnd their derivatives.
Can we borrow from Facebook and social networking to help solve Gustav’s Problem?
11/18/2011
24
Taxonomic Type Filter
Class Filter
X
X
Specimens
Tissues
Sequences
FunctionsX Infer Relationships Across providers
A Biological Relationship Graph …
Mobilizing data in museums
Apps:Heidorn
11/18/2011
25
NSF: Advanced Digitization of Biological Collections
• iDigBio: The National Resource for Advancing Digitization of Biological Collections
Organization
• National Hub (~$7.5M)– Title: A Collections Digitization Framework for the 21st Century
– PI: Lawrence Page, University of Florida
• Thematic Hub (~$2M each)– Title: InvertNet–An Integrative Platform for Research on Environmental
Change, Species Discovery and Identification• PI: Christopher Dietrich, University of Illinois, Urbana‐Champaign
– Title: Plants, Herbivores and Parasitoids: A Model System for the Study of Tri‐Trophic Associations
• PI: Randall T. Schuh, American Museum of Natural History
– Title: North American Lichens and Bryophytes: Sensitive Indicators of Environmental Quality and Change
• PI (Principal Investigator): Corinna Gries, University of Wisconsin, Madison
11/18/2011
26
Example of Virtual Community in NanoTechnology
Citizen Science
• Need for standardization
• Validation
• Feedback
• Use
• Like non‐citizen science?
Apps:Heidorn
11/18/2011
27
Apps:Heidorn
Agile Science
• Disaster: RAPID: Gulf Coast Oil Spill Biodiversity Tracker. A Volunteer‐based Observation Network Steven Kelling [email protected] (Principal Investigator)
• RAPID: Enhancement of Fishnet2 for Disaster Impact Assessment Henry Bart [email protected] (Principal Investigator)
11/18/2011
28
http://ebird.org/tools/oilspill/
11/18/2011
29
New Validation Models
• Filtered Push: Continuous Quality Control for Distributed Collections and Other Species‐Occurrence Data. James Macklin [email protected] (Principal Investigator) Bertram Ludaescher (Co‐Principal Investigator)
• networked solution to enable annotation of distributed biological collection data and to share assertions about their quality or usability.
Improved collection management
• Collaborative Biodiversity Collections Computing. James Beach [email protected] (Principal Investigator)
http://digbiocol.wordpress.com/
11/18/2011
30
Map of Life
Co‐Pis: Walter Jetz (Yale)Rob Guralnick (CU Boulder)
An infrastructure for integrating and advancing global species distribution knowledge
Scale (Grain)
World
200km
50km
1km
100m
1m
1996: G
TOPO 30
2009: SRTM
V V4
2003: G
LC 2000
2009: G
lobCover
1992:BIOME
2001:Im
age 2.2
Regional m
odels
TopographyLandcovercurrent
Landcover future
Species distributions(Vertebrates)
?
Advancing species distribution knowledge
2006 W
WF
2005‐9: expert m
aps
Atlas data, surveys
Knowledge Gap
Hurlbert and Jetz (PNAS 2007)Jetz et al. (Conservation Biology 2008)
11/18/2011
31
Cougar
11/18/2011
32
John Wieczorek, Museum of Vertebrate Zoology at Berkeley
• How has VertNet built a large community of users?
• What technology/portal components were important to the community development?
• How did success lead to VertNet in its current incarnation?
11/18/2011
33
The Vertebrate Networks (2011)
Est. 1999, 2004
31 collections
Est. 2001
41 collections
Est. 2004
48 collections
Est. 2002
56 collections
176 active collections76 participating institutions
• How has VertNet built a large community of users?
• What technology/portal components were important to the community development?
• How did success lead to VertNet in its current incarnation?
11/18/2011
34
Collaborations
Growth of the Vertebrate Networks
11/18/2011
35
• How has VertNet built a large community of users?
• What technology/portal components were important to the community development?
• How did success lead to VertNet in its current incarnation?
“Participation ‐ Peers”
Portal
Provider
CollectionDatabase
Provider
PublicDatabase
CollectionDatabase
Provider
CollectionDatabase
PublicDatabase
11/18/2011
36
Apps:Heidorn
Kenya National Museum Library
11/18/2011
37
VertNet
New Architecture
Why do people share
• Credit for their work
• “Free” enhanced user statistics
• Easier to share than not to share
Apps:Heidorn
11/18/2011
38
DataNet
The DataONE R Client Package is an R package that provides access to the DataONE services that are present at Coordinating Nodes and Member Nodes.
Apps:Heidorn
SQLShare: Database‐as‐a‐Service for Researchers
SQLShare is a database service aimed at removing the obstacles to using relational databases: installation, configuration, schema design, tuning, data ingest, and even application design. Our goal for SQLShare is to make available a cloud‐based system where you can simply upload your data and immediately start querying it.
Apps:Heidorn
11/18/2011
39
Why Libraries
• Long history of scholarly data management
• Skills overlap such a development of metadata standards, ontologies, controlled vocabularies, thesauri
• Long‐lived institutions
• Overlap with museums and archives
New Information Disciplines
• Digital Curator: an expert knowledgeable of and with responsibility for the content of a digital collection(s)
• Digital Archivist: an expert competent to appraise, acquire, authenticate, preserve, and provide access to records in digital form
• Data Scientists: the information and computer scientists, database and software engineers and programmers, disciplinary experts, expert annotators, and others, who are crucial to the successful management of a digital data collection
(Long Long‐Lived Digital Data Collections: Enabling Research and Education in the 21st Century, report of the National Science Board, September, 2005)
11/18/2011
40
Library Roles
Library Skills
11/18/2011
41
Thank you
Apps:Heidorn