as repository: attract dark data · functionality to attract dark data eresearchaustralasia 2011 p....

11/18/2011

1

App as Repository: Functionality to attract Dark Data

eResearch Australasia 2011

P. Bryan Heidorn, SIRLS, Univ of Arizona

7 November 2011

(near) University of Arizona

11/18/2011

2

Thesis

• Large amounts of data remain uncurated

• Most of that data is from small data sets and is currently largely invisible – Dark Data

• This data should be curated locally but not by scientists alone

The problem

• Information is not in accessible format

• Computer Science, Information Science and Technology has not addressed the problem

• Challenge to reproducibility of science

11/18/2011

3

Images courtesy Ian Foster

Cyberinfrastructure Vision

“The anticipated growth in both the production and repurposing of digital data raises complex issues not only of scale and heterogeneity, but also of stewardship, curation and long‐term

access.” NSF Cyberinfrastructure Vision for 21st

Century Discovery, Chapter 3, 2007

11/18/2011

4

Recognition of need for data curation

“Recommendation 6: The NSF, working in partnership with collection managers and the community at large, should act to develop and mature the career path for data scientists and to ensure that the research enterprise includes a sufficient number of high‐quality data scientists.”

Long‐Lived Digital Data Collections: Enabling Research and Education in the 21st Century, Recommendations, 2005

• Recognition of the importance of Information

• Recognition of the need for education

• New work roles within traditional institutions

Interagency Working Group on Digital Data

11/18/2011

5

Dark data is the data that we know is was there but we can’t see it.

Hubble Space Telescope composite image "ring" of dark matter in the galaxy cluster Cl 0024+17

f(x)=axk+o(xk)

Power Law of Science Data

f(x)=axk+o(xk)| X<.20

Data Volume

Science Projects and Initiatives

11/18/2011

6

Does NSF’s Data Follow the Power Law?

I do not know but if $1 = X bytes…..

Awarded Amount 2007

$0

$1,000,000

$2,000,000

$3,000,000

$4,000,000

$5,000,000

$6,000,000

$7,000,000

1 586 1171 1756 2341 2926 3511 4096 4681 5266 5851 6436 7021 7606 8191 8776

20‐80 Rule The small are big!

Total Grants 9347

$2,137,636,716

20% 80%

Number Grants 1869 7478

Total Dollars $1,199,088,125 $938,548,595

Range $6,892,810-$350,000

$350,000-$831

11/18/2011

7

Where is your data now?

Is it working or on the dole?

What is a data scientist anyway and what should they be doing?

Data Scientist Job

• Find Dark/Hidden data – Or better: do not loose it in the first place

• Evaluate usefulness– Or better: provide usage metrics

• Organize (metadata and format)– Or better: allow multiple views

• Collect into sufficient mass to be visible– Normalize, Interlink and Integrate

• Provide access– Or better: Enable new research

• Preserve– Or better: Make it indispensible

Apps:Heidorn

11/18/2011

8

• Because it is high volume

• Because it is information rich – high entropy

• While needs of large data are understood small data and integration are not understood

Heidorn, P. Bryan (2008). Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57(2) Fall 2008 . Institutional Repositories: Institutional Repositories: Current State and Future. Edited by Sarah Sheeves and Melissa Cragin. http://hdl.handle.net/2142/9127).

Small data is big science

Where to find dark data in biodiversity?

• Literature/Biodiversity Heritage Library

• Museum Specimens

• Field notes

• (Un)Experimental data sets

• Citizen Observations

11/18/2011

9

What is dark data good for?

• Ecological Niche Modeling

• Climate Change niche change prediction

• Taxonomic Name Resolution

• Literature Search Support– Taxonomic intelligence

– Key‐like – character searching

• Phenology and Phenology change

• Food‐web / trophic level

Cyberinfrastructure Needs

• Collection

• Storage

• Access

• Processing

• Communication

• Training

• Institutions

11/18/2011

10

Institutionalization of e‐reseach

The tools of science and the business of science need to be altered to seamlessly integrate good data curation into professional practice.

Need a set of professional training, social institutions, scholarly communication and stable funding to support and promote data and computational enabled research.

Apps:Heidorn

Apps:Heidorn

Repositories

11/18/2011

11

Apps:Heidornhttp://gavinwedell.com/doodles/

Lab Notebook


Sensors

Statistical Packages

11/18/2011

12

Lab Notebook


Sensors

Statistical PackagesStatistical Packages

Sensors

Lab Notebook

Apps:Heidorn

IBM en:System/360 Model 65

This photo was taken by Mike Ross of corestore.org

11/18/2011

13

Lab Notebook

Sensors

Statistical Packages


Data Repurposing

From: To stand the test of time: Long‐term stewardship of of digital data sets in science and engineering. Sept 26‐27, 2006 Arlington VA

11/18/2011

14

A number of projects working in this direction

Apps:Heidorn

The iPlant Collaborative Cyberinfrastructure to Support the Challenges of Modern

Biology

Society for Experimental Biology, Glasgow, UKJuly 3rd, 2011

Dan StanzioneCo-PI and Cyberinfrastructure Lead, iPlant Collaborative

Deputy Director, Texas Advanced Computing [email protected]

[email protected]

11/18/2011

15

What is iPlant?• iPlant’s mission is to build the CI to support plant

biology’s Grand Challenge solutions• Grand Challenges were not defined in advance, but

identified through engagement with the community• A virtual organization with Grand Challenge teams

relying on national cyberinfrastructure • Long term focus on sustainable food supply,

climate change, biofuels, ecological stability, etc• Hundreds of participants globally… Working group

members at >50 US institutions, USDA, DOE, etc.

Brief History• Funding by NSF – February 1st, 2008

• iPlant Kickoff Conference at CSHL – April 2008

o ~200 participants

Grand Challenge Workshops – Sept-Dec 2008

CI workshop – Jan 2009

Grand Challenge White Paper Review – March 2009

Project Recommendations – March 2009

Project Kickoffs – May 2009 & August 2009

Start of software development; September 2009

First prototypes to public: April 2010

First release with user-driven tool integration: July 2011

11/18/2011

16

iPlant’s Central Challenge

• To define what it means to build a lasting, community driven Cyberinfrastructure for the Grand Challenges of Plant Science, to get community buy‐in of this vision, and to execute this vision.

Steve Goff, PIU of Arizona

Dan Stanzione, coPITexas Advanced Computing Center

National Science BoardUpdate on Award Progress: DBI ‐0735191

Directorate for Biological SciencesJuly 2011

11/18/2011

17

What iPlant Offers

Grand Challenges in Plant Science• Genotype‐to‐Phenotype

– To understand how DNA blueprints produce a plant’s characteristic traits and functions and to predict how traits change in response to complex environments

– Requires ability to collect, query, interpret, and model high‐throughput, genome‐scale data sets

• Tree of Life– To understand evolutionary relationships among green plants

– Requires ability to create, display, and query information in very large phylogenetic trees

11/18/2011

18

Taxonomic Name Resolution Service

The Biodiversity Heritage 34 million pages now

Collection of Australian birds' eggs and nests in the possession of D. Le Souef, Director, Zoological Gardens, Melbourne, 1900? .. Publication info:.

Long Citation Half‐life

Critical use for Taxonomy

Ecology and Environmental History

Naming for genomics and metagenomics

11/18/2011

19

Mobilizing Data Locked on Paper

• Fine‐Grained Semantic Markup of Descriptive Data for Knowledge Applications in Biodiversity Domains Hong Cui [email protected] (Principal Investigator)

• The University of Arizona is awarded a grant to develop and evaluate a set of algorithms/software to help computers to read and “understand” taxonomic descriptions of plants, animals, and other living or fossil organisms. The major functions of the algorithms/software include 1) annotate large sets of text descriptions in a machine‐readable way to support various knowledge applications, including producing character matrices and identification keys for various taxon groups.

Semantic Markup System

11/18/2011

20

Data Interlinking

Apps:Heidorn

The World Mind (Ver0.01b)

John Deck, University of California, BerkeleyBrian Stucky, University of Colorado, BoulderLukasz Ziemba, University of Florida, GainesevilleNico Cellinese, University of Florida, GainesvilleRob Guralnick, University of Colorado, Boulder

BiSciCol TeamReed Beaman, Nico Cellinese, Jonathan Coddington, Neil Davies, John Deck, RobGuralnick, P. Bryan Heidorn, Chris Meyer, Tom Orrell, Rich Pyle, Kate Rachwal, BrianStucky, Rob Whitton, Lukasz Ziemba

BiSciCol: Tracking Biodiversity

Objects to Brokering Standards“Or, Gustav’s Big Problem”

11/18/2011

21

Biological Science Collections Tracker working towards building an infrastructure designed to tag and track scientific

collections and all of their derivatives.

• National Science Foundation funded 2010 – 2014

• Partners are University of Florida at Gaineseville, University of Colorado at Boulder, Bishop Museum, University of California at Berkeley, Smithsonian Institution, University of Arizona at Tucson

• Relies on globally unique identifiers (GUIDs) to track objects

• Implements a Linked Data approach

• Provides support for the Global Names Architecture

Biological Science Collections (BiSciCol) Tracker

S1: KNM

S2: MNHN

Muséum national d'histoire naturelle

Nairobi National Museum

S3: MBG

Living Collection: Missouri Botanical Garden

Determination

?

?

Gene Sequence

GENBANK

?

?

?

?Parasitism

Agave sisalana

?

11/18/2011

22

From “Facebook Visualizer”

Tracking FaceBook relationships …

Can we track relationships for Biological Objects as well?

11/18/2011

23

Why? Here is Gustav’s Problem….

(Prefers to collect stuff)

Lots of Data ….

Generates …

Due to project requirements and integration needs, Gustav is left navigating a plethora of redundant and disconnected distributed Databases. Lots of effort to track objectsAnd their derivatives.

Can we borrow from Facebook and social networking to help solve Gustav’s Problem?

11/18/2011

24

Taxonomic Type Filter

Class Filter

X

X

Specimens

Tissues

Sequences

FunctionsX Infer Relationships Across providers

A Biological Relationship Graph …

Mobilizing data in museums

Apps:Heidorn

11/18/2011

25

NSF: Advanced Digitization of Biological Collections

• iDigBio: The National Resource for Advancing Digitization of Biological Collections

Organization

• National Hub (~$7.5M)– Title: A Collections Digitization Framework for the 21st Century

– PI: Lawrence Page, University of Florida

• Thematic Hub (~$2M each)– Title: InvertNet–An Integrative Platform for Research on Environmental

Change, Species Discovery and Identification• PI: Christopher Dietrich, University of Illinois, Urbana‐Champaign

– Title: Plants, Herbivores and Parasitoids: A Model System for the Study of Tri‐Trophic Associations

• PI: Randall T. Schuh, American Museum of Natural History

– Title: North American Lichens and Bryophytes: Sensitive Indicators of Environmental Quality and Change

• PI (Principal Investigator): Corinna Gries, University of Wisconsin, Madison

11/18/2011

26

Example of Virtual Community in NanoTechnology

Citizen Science

• Need for standardization

• Validation

• Feedback

• Use

• Like non‐citizen science?

Apps:Heidorn

11/18/2011

27

Apps:Heidorn

Agile Science

• Disaster: RAPID: Gulf Coast Oil Spill Biodiversity Tracker. A Volunteer‐based Observation Network Steven Kelling [email protected] (Principal Investigator)

• RAPID: Enhancement of Fishnet2 for Disaster Impact Assessment Henry Bart [email protected] (Principal Investigator)

11/18/2011

28

http://ebird.org/tools/oilspill/

11/18/2011

29

New Validation Models

• Filtered Push: Continuous Quality Control for Distributed Collections and Other Species‐Occurrence Data. James Macklin [email protected] (Principal Investigator) Bertram Ludaescher (Co‐Principal Investigator)

• networked solution to enable annotation of distributed biological collection data and to share assertions about their quality or usability.

Improved collection management

• Collaborative Biodiversity Collections Computing. James Beach [email protected] (Principal Investigator)

http://digbiocol.wordpress.com/

11/18/2011

30

Map of Life

Co‐Pis: Walter Jetz (Yale)Rob Guralnick (CU Boulder)

An infrastructure for integrating and advancing global species distribution knowledge

Scale (Grain)

World

200km

50km

1km

100m

1m

1996: G

TOPO 30

2009: SRTM

V V4

2003: G

LC 2000

2009: G

lobCover

1992:BIOME

2001:Im

age 2.2

Regional m

odels

TopographyLandcovercurrent

Landcover future

Species distributions(Vertebrates)

?

Advancing species distribution knowledge

2006 W

WF

2005‐9: expert m

aps

Atlas data, surveys

Knowledge Gap

Hurlbert and Jetz (PNAS 2007)Jetz et al. (Conservation Biology 2008)

11/18/2011

31

Cougar

11/18/2011

32

John Wieczorek, Museum of Vertebrate Zoology at Berkeley

• How has VertNet built a large community of users?

• What technology/portal components were important to the community development?

• How did success lead to VertNet in its current incarnation?

11/18/2011

33

The Vertebrate Networks (2011)

Est. 1999, 2004

31 collections

Est. 2001

41 collections

Est. 2004

48 collections

Est. 2002

56 collections

176 active collections76 participating institutions




11/18/2011

34

Collaborations

Growth of the Vertebrate Networks

11/18/2011

35




“Participation ‐ Peers”

Portal

Provider

CollectionDatabase

Provider

PublicDatabase

CollectionDatabase

Provider

CollectionDatabase

PublicDatabase

11/18/2011

36

Apps:Heidorn

Kenya National Museum Library

11/18/2011

37

VertNet

New Architecture

Why do people share

• Credit for their work

• “Free” enhanced user statistics

• Easier to share than not to share

Apps:Heidorn

11/18/2011

38

DataNet

The DataONE R Client Package is an R package that provides access to the DataONE services that are present at Coordinating Nodes and Member Nodes.

Apps:Heidorn

SQLShare: Database‐as‐a‐Service for Researchers

SQLShare is a database service aimed at removing the obstacles to using relational databases: installation, configuration, schema design, tuning, data ingest, and even application design. Our goal for SQLShare is to make available a cloud‐based system where you can simply upload your data and immediately start querying it.

Apps:Heidorn

11/18/2011

39

Why Libraries

• Long history of scholarly data management

• Skills overlap such a development of metadata standards, ontologies, controlled vocabularies, thesauri

• Long‐lived institutions

• Overlap with museums and archives

New Information Disciplines

• Digital Curator: an expert knowledgeable of and with responsibility for the content of a digital collection(s)

• Digital Archivist: an expert competent to appraise, acquire, authenticate, preserve, and provide access to records in digital form

• Data Scientists: the information and computer scientists, database and software engineers and programmers, disciplinary experts, expert annotators, and others, who are crucial to the successful management of a digital data collection

(Long Long‐Lived Digital Data Collections: Enabling Research and Education in the 21st Century, report of the National Science Board, September, 2005)

11/18/2011

40

Library Roles

Library Skills

11/18/2011

41

Thank you

Apps:Heidorn

as repository: attract dark data · functionality to attract dark data eresearchaustralasia 2011 p....

Documents