science data repositories (sdrs) on the web: an initial survey

44
SCIENCE DATA REPOSITORIES (SDRs) ON THE WEB: AN INITIAL SURVEY LAURA MARCIAL [email protected] BRAD HEMMINGER 5 March 2010

Upload: ipo

Post on 25-Feb-2016

49 views

Category:

Documents


5 download

DESCRIPTION

SCIENCE DATA REPOSITORIES (SDRs) ON THE WEB: AN INITIAL SURVEY. LAURA MARCIAL [email protected] BRAD HEMMINGER 5 March 2010. Rationale Study Discussion Future Work. Rationale. Study. Discussion. Futures. OBJECTIVES. “Digital data collections are powerful catalysts - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

SCIENCE DATA REPOSITORIES (SDRs)

ON THE WEB: AN INITIAL SURVEY

LAURA [email protected]

BRAD HEMMINGER

5 March 2010

Page 2: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

Rationale

Discussion

Futures

Study

RationaleStudyDiscussionFuture Work

OBJECTIVES

Page 3: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY
Page 4: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

RATIONALE

“Digital data collections are powerful catalysts

for progress and for democratization of the

research and the enterprise.”

National Science Board [NSB] Report 2005c

Page 5: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

ORIGINS

Total sequencing

contributions:

12,000,000,000 base

pairs

DDBJ/EMBL/GenBank Database Growth

Page 6: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

Data from the Department of Energy’s Joint Genome Institute

Origins

ORIGINS

JGI generates on the order of 2.3 gigabases of sequence per month or 1 terabyte of data per month

Page 7: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

ORIGINSNOAA: Large-array data growth expected over 15 years. Current estimates predict data archive growth to more than 160,000 TB by 2020. http://www.ngdc.noaa.gov/noaa_pubs/pdf/NOAA_DataManagementReport_Final.pdf

Page 8: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

Hubble Space

Telescope: generates10 gigabytes of data per

day

ORIGINS

Page 9: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

Origins

Coastal Data Monitoring

ORIGINS

Page 10: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

Origins

Functional

Magnetic

Resonance

Imaging (fMRI)

ORIGINS

Amygdala activation at 3T in response to human and avatar facial expressions of emotions. http://evolution.anthro.univie.ac.at/institutes/urbanethology/projects/simulation/fmri/index.html

Page 11: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

ORIGINS

So, what is happening with

all of the Closed Circuit TV

(CCTV) data generated every

day?

Page 12: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

Clearly, we are entering the yottabyte (YB) era:

1,000,000,000,000,000,000,000,000 (one septillion)

bytes

Page 13: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

RATIONALE

At least many thousands of SDRs

Often start as government projects

Keys to success are elusive

Highly heterogeneous

Highly domain specific

Page 14: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY
Page 15: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

“I found it interesting to read your survey results

and see what information you inferred about the KNB. It points out areas that we

need to improve upon in terms of communication from our web presence.”

--Matt Jones, Knowledge Network for Biocomplexity

STUDY

Page 16: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

GOALS

Inventory a convenience sample of

100 SDRs

Identify major characteristics

Examine commonalities

Look for trends over time

Identify characteristics that correlate with

success

Page 17: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

TIMELINE

2007 2008 2009 2010

In 2007-2008, identified 100 SDRs through Google searches

Initial review was done to refine salient characteristics

In 2009, site profiles were sent to site administrators for

review and comment50

characteristics were captured,

17 of which were analyzed

using cluster analysis

Page 18: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

GENERAL Scientific Domain

Research, Community or Reference

Holding Size Information

BUSINESS

Governmentally based

Business Type Memberships or Subscriptions

DATA DETAILS

Deposits and Access

Representation

Ingest Methods

Metadata Preservation Additional Services

Usage Statistics

CHARACTERISTICS of the 50

Page 19: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

ANALYSIS

# Characteristic Type

1 Natural Science Binary2 Science Area Nominal3 Virtual Binary4 Holding Size Ordinal5 Research/Community/

ReferenceNominal

6 Centralized/Distributed Binary7 Instrument Based Binary8 Business Type Nominal9 Subscription or Membership Binary10 How Based Nominal11 Multi-Sponsored Binary12 Grants & Contracts Binary13 Accept Submitted Data Binary14 Registration Required Ordinal15 Free in the Public Domain Ordinal16 Preservation Policy Binary17 Portal Binary

The 17 characteristics suitable for analysis and their data type

Page 20: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

CLUSTER RESULTS

Semi-Partial R-Squared

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Name of Observation or Cluster

OB1

OB62

OB66

OB70

OB12

OB41

OB64

OB27

OB28

OB30

OB44

OB9

OB23

OB43

OB45

OB36

OB17

OB79

OB38

OB40

OB46

OB48

OB47

OB18

OB42

OB55

OB37

OB60

OB93

OB29

OB57

OB75

OB74

OB63

OB65

OB67

OB73

OB2

OB72

OB16

OB39

OB7

OB13

OB11

OB10

OB21

OB15

OB68

OB76

OB87

OB86

OB78

OB54

OB56

OB58

OB84

OB92

OB3

OB89

OB4

OB5

OB6

OB69

OB25

OB50

OB49

OB24

OB59

OB53

OB82

OB71

OB80

OB77

OB99

OB95

OB96

OB8

OB51

OB97

OB81

OB100

OB31

OB61

OB35

OB52

OB14

OB26

OB34

OB32

OB33

OB83

OB19

OB20

OB22

OB85

OB90

OB91

OB88

OB94

OB98

Cluster AN=40

Cluster BN=18

Cluster CN=27

Cluster DN=15

Cluster 1N=40

Cluster 2N=60

Page 21: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

Agency for Healthcare Quality and ResearchMultimission Archive at STScI (MAST)Alternative Fuels Data Center (AFDC)NASA Langley Atmospheric Science Data CenterAtlantic Oceanographic and Meteorological Laboratory (AOML) Environmental Data Server or ENVIDSNASA/IPAC Infrared Science Archive (IRSA)Atmospheric Radiation Monitoring (ARM) Data CentersNational Ecological Observatory Network (NEON)Carbon Dioxide Information Analysis Center (CDIAC)National Nuclear Data Center Nuclear Data PortalCenters for Disease Control and Prevention Data and StatisticsNational Space Science Data CenterClimate and Environmental Retrieval and Archive (CERA) for the WDCCNatural Resource and GIS Metadata and Data Store of the National Park ServiceChandra data archiveOak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC)Comprehensive Epidemiological Data Resource (CEDR)Planetary Data System (PDS)Controlled Fusion Atomic Data Center (CFADC)Renewable Resource Data Center (RReDC)

DNA Data Bank of Japan (DDBJ)Solar Data Analysis Center (SDAC) at NASA Goddard Space Flight Center DOE Joint Genome Institute's (JGI) Genome Web PortalSkyViewDOE's Energy Information Administration (EIA)Smithsonian Tropical Research Institute's (STRI) Center for Tropical Forest Science (CTFS)European Southern Observatory (ESO) Archive FacilityU.S. Transuranium and Uranium Registries (USTUR)GenbankUnited States Census BureauGeodata.govUS National Virtual Observatory (NVO)NASA’s High Energy Astrophysics Science Archive Research Center (HEASARC)US Transplant -- Scientific Registry of Transplant RecipientsHubbleSite GalleryVisible Human Project®NOAA's Integrated Coral Observing Network (ICON)World Data Center (WDC)Integrated Monitoring NetworkWorld Data Center (WDC) for Biodiversity and Ecology

CLUSTER A

Page 22: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

CLUSTER B

BioSystematic Database of World Diptera (BDWD)CalSurv, the California Vectorborne Disease Surveillance SystemEcological Society of America's Ecological ArchivesEuropean Molecular Biology Laboratory - European Bioinformatics Institute or EMBL-EBIEncyclopedia of Astronomy and AstrophysicsEnsemblInternational Council for

Science : Committee on Data for Science and TechnologyIubioJ. Craig Venter InstituteJasparJournal of Applied Econometrics (JAE) Data ArchiveNational Center for Ecological Analysis and Synthesis (NCEAS) Data RepositoryNC One MapSpec PatternsThe BioGRIDThe Sanger Institute

Page 23: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

ACE Science Center (ASC)Antarctic Glaciological Data Center (AGDC)Astronomy Digital Image LibraryBrain biodiversity bank at Michigan State UniversityBugwood NetworkCenter for International Earth Science Information Network (CIESIN)Chesapeake Bay Environmental Observatory (CBEO) PortalCoastal Data Information Program (CDIP) of the Scripps Institution of Oceanography, University of California at San DiegoCornell University Geospatial Information RepositoryForestry ImagesHenry A. Murray Research Archive MRA)IAU Minor Planet CenterInter-university Consortium for Political and Social Research (ICPSR)IQSS Dataverse networkLTER NetworkMcIDAS

Melanoma Molecular Map ProjectRepository for Archiving, Managing and Accessing Diverse Data (RAMADDA)Socioeconomic Data and Applications Center (SEDAC)Space Science and Engineering Center (SSEC) Data Center, University of Wisconsin-MadisonThe Howard W. Odum Institute for Research in Social ScienceThe USA National Phenology Network (USA-NPN)Thematic Realtime Environmental Distributed Data Services (THREDDS) Data ServerUnidata Program at the University Corporation for Atmospheric Research (UCAR)University of California Santa Cruz Genome BioinformaticsWoods Hole Oceanographic Institute Data CenterWorld Data Center for Human Interactions in the EnvironmentCLUSTER C

Page 24: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

Amphibian Ark Team PortalDiscover Life in America's Great Smoky Mountains National Park's All Taxa Biodiversity InventoryEncyclopedia of LifefMRI Data CenterGlobal Biodiversity Information FacilityKnowledge Network for Biocomplexity (KNB)Mouse Genome InformaticsNEEScentralNetlibOcean Biogeographic Information System (OBIS)Paleobiology DatabasePANGAEA® - Publishing Network for Geoscientific and Environmental DataTree of Life Web ProjectTreebase, Treebase2VegBank, a vegetation plot database

CLUSTER D

Page 25: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

FreeinthePublicDomainBusinessTypeScientificArea

SubscriptionMembershipNaturalScience

PortalResearch/Community/Reference***

InstrumentBasedCentralized/Decentralized

AcceptSubmittedDataHowBased**

RegistrationRequiredVirtuallyBased

PreservationPolicyHoldingSize*

MultipleSponsorsGrantsContracts

0 1 2 3 4 5 6 7 8 9 10

Relative contribution of variables (measured using simple logistic regression Wald Chi-Square/df)

LOGISTIC REGRESSION

Page 26: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

Variables Cluster A: ‘Governmental’

Cluster B: ‘Medicine/Small’

Cluster C: ‘University’

Cluster D: Community ‘Biology’

Grants Contracts No Mixed Yes YesMultiple Sponsors No Yes Yes MixedHolding Size Large Small Mixed ModeratePreservation Policy Yes Mixed Yes NoVirtually Based No Mixed No NoRegistration Required No No Mixed NoHow Based Government

alMixed Universit

yMixed

Accept Submitted Data

Mixed Yes Yes Yes

Centralized/Distributed

Mixed Mixed mixed Distributed

Instrument Based Mixed No No NoRes/Com/Ref Research Mixed Research Communi

tyPortal Mixed No mixed MixedNatural Science Yes Yes Yes YesSubscription Membership

No No No No

Scientific Area Mixed Medicine Mixed BiologyBusiness Type Federal Center Mixed Universit

yPartnership

Free in the Public Domain

Yes Yes Yes Yes

GROUP COMPOSITION

Page 27: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

CLUSTER RESULTS

Semi-Partial R-Squared

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Name of Observation or Cluster

OB1

OB62

OB66

OB70

OB12

OB41

OB64

OB27

OB28

OB30

OB44

OB9

OB23

OB43

OB45

OB36

OB17

OB79

OB38

OB40

OB46

OB48

OB47

OB18

OB42

OB55

OB37

OB60

OB93

OB29

OB57

OB75

OB74

OB63

OB65

OB67

OB73

OB2

OB72

OB16

OB39

OB7

OB13

OB11

OB10

OB21

OB15

OB68

OB76

OB87

OB86

OB78

OB54

OB56

OB58

OB84

OB92

OB3

OB89

OB4

OB5

OB6

OB69

OB25

OB50

OB49

OB24

OB59

OB53

OB82

OB71

OB80

OB77

OB99

OB95

OB96

OB8

OB51

OB97

OB81

OB100

OB31

OB61

OB35

OB52

OB14

OB26

OB34

OB32

OB33

OB83

OB19

OB20

OB22

OB85

OB90

OB91

OB88

OB94

OB98

Cluster AN=40

Cluster BN=18

Cluster CN=27

Cluster DN=15

Cluster 1N=40

Cluster 2N=60

Page 28: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

ALTERNATIVE SAMPLING

If we are all about studying success

and SUCCESS = performance over

time

How can we study SDRs over time?

Page 29: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

WAYBACK MACHINE

Page 30: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY
Page 31: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

DISCUSSION

“I am interested in your preservation policy line. We don't have a policy

explicitly listed, though we do hope and aim to make the data

permanently preserved. Could you provide me with some examples of

preservation policies so that we might create one?”

--Michael Lee, VegBank

Page 32: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

PRESERVATION

Although preservation ranked as the fourth most important variable (taken independently) in defining group membership, what we did not find was at least as important as what we did.

Preservation Policy

any mention of long term data storage

=

Page 33: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

DATANET

Page 34: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

EVOLUTION/ECOLOGY

Idea

FundingScope

Contributors

Technology + Services

Service Offering

Policies + Structure

Business

Strategy

Idea +

Business Strategy

FundingScope

Contributors

Technology + Services

Service OfferingPolicies + Structure

Evaluation

Emerging environments (observed) pattern:

Mature environments pattern:

Page 35: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

Research: products of one or more focused research projects and typically contain data that are subject to limited processing or curation. These collections are generally small and/or project specific.

Community data collections: serve a single science or engineering community. They are generally intermediate in size and supported in a somewhat more distributed fashion by the community served.

Reference data collections: serve large segments of the scientific and education community. These are generally broad and/or multidisciplinary as well as long lived.

A TYPOLOGY

Page 36: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

FRAMEWORK

Grants and Contracts

Multiple Sponsors

Holding SizePreservation

PolicyHow Based

Business Type

Scientific Area

Page 37: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

ENVIRONMENT

Characteristic

Institutional Repository Science Data Repository

Holdings Management

IRs have a high degree of similarity in terms of management of holdings.

SDRs are dissimilar, often highly domain specific, to each other in terms of holdings.

Handling Procedures

Homogeneity of handling procedures both within and among repositories (DRIVER, 2008)

Heterogeneity of handling procedures, perhaps necessary to degree of specialization within a domain, often seemingly due to lack of standardization.

Base Institutionally based (DRIVER, 2008)

Typically domain based, though increasingly cross cutting making the call for standardization more critical.

Page 38: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

DRIVER (2008):

Business of digital repositories,

Stimuli for depositing materials into repositories, intellectual property rights,

Data curation, and Long-term preservation

SDRs:

GrantsContracts MultipleSponsors

HoldingSize PreservationPolicy

Characteristics of Success/Group Composition

SUCCESS

Page 39: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY
Page 40: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

FUTURE WORK

“The format of this form forces us to pigeonhole

ourselves in a way that is not accurate or useful. Sorry

I can't be of more help.”

--Matthew LaPoint, J. Craig Venter Institute

Page 41: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

FUTURE WORK

Looking back, the key to moving ahead is LONGITUDINALEVALUATION

Page 42: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

GETTING THE WORD OUT

Page 43: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

QUESTIONS?

Page 44: SCIENCE DATA REPOSITORIES (SDRs)  ON THE WEB:   AN INITIAL SURVEY

THANK YOU!