designing flexible workflow for upstream participation of ... · designing flexible workflow for...

22
Designing Flexible Workflow for Upstream Participation of the Scientific Data Community Robert R. Downs and Robert S. Chen NASA Socioeconomic Data and Applications Center (SEDAC) Center for International Earth Science Information Network (CIESIN) The Earth Institute, Columbia University Prepared for presentation to the IASSIST 2010 Meeting June 3, 2010 Cornell University Ithaca, NY

Upload: others

Post on 24-Feb-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Designing Flexible Workflow for Upstream

Participation of the Scientific Data Community

Robert R. Downs and Robert S. ChenNASA Socioeconomic Data and Applications Center (SEDAC)

Center for International Earth Science Information Network (CIESIN)The Earth Institute, Columbia University

Prepared for presentation to the

IASSIST 2010 Meeting

June 3, 2010

Cornell University

Ithaca, NY

Page 2: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Scientific Data are at Risk if not Archived

• Replication, comparison, new, and future uses of existing data

require scientific data stewardship

– Data must be identifiable, discoverable, accessible, usable, and

recoverable

• Data Preservation requires preparation

– Datasets need to be complete, documented, and described, and must

contain permissions for their use

• Stewardship of data often decreases after completion of the

project that produced the data

– Some data are neglected if not archived soon after creation

Page 3: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Saving Scientific Data For Use By Others

• Scientific data repositories can provide capabilities to submit

data for archiving

– Scientist or team member submits data online

• A data submission system could assist data producers in

preparing and describing their data for archiving

– Data preparation prior to project completion

• Capabilities for data submission must balance the need for

comprehensive information about the data with the

practicalities of what data producers are willing and able to

provide.

– Easy tools to deposit and describe data

3

Page 4: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Designing a Data Submission System

• Identify Trusted Repository Requirements for Submission

• Categorize Submission Services

• Define Functions for Submission Services

• Create Workflow for Data Submission and Review

• Model Scientific Data Submission and Workflow

• Review of Successful Submissions

• Recommendations for Submission Services

4

Page 5: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Identifying Requirements for Submission System

• Reviewed requirements for trustworthy archives and digital repositories in

relevant standards and documents

– Consultative Committee for Space Data Systems (CCSDS) (2002) Reference

Model for an Open Archival Information System (OAIS). Adopted as ISO

14721:2003

– CCSDS (2004) Producer-Archive Interface Methodology Abstract Standard.

Adopted as ISO 20652:2006

– CCSDS. Audit and Certification of Trustworthy Digital Repositories: Draft

Recommended Practice. 652.0-R-1 Red Book, Issue 1. (July 2009).

• Initially Utilized TRAC document

– Online Computer Library Center (OCLC) and Center for Research Libraries (CRL)

(2007) Trustworthy Repositories Audit & Certification: Criteria and Checklist

(TRAC), Version 1.0.

• Identified and categorized pre-ingest requirements from TRAC

– Requirements relevant to submission and workflow prior to ingest

• Identified pre-ingest requirements from 652.0-R-1 (Draft ISO Standard)

– Related and additional submission and pre-ingest workflow requirements

Page 6: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Communication Requirements

Identified From TRAC Document

• A3.5 Repository has policies and procedures to ensure that feedback from producers

and users is sought and addressed over time.

• A3.7 Repository commits to transparency and accountability in all actions supporting

the operation and management of the repository, especially those that affect the

preservation of digital content over time.

• B1.4 Repository’s ingest process verifies each submitted object (i.e., SIP) for

completeness and correctness as specified in B1.2.

• B1.6 Repository provides producer/depositor with appropriate responses at predefined

points during the ingest processes.

• B1.7 Repository can demonstrate when preservation responsibility is formally

accepted for the contents of the submitted data objects (i.e., SIPs).

• B1.8 Repository has contemporaneous records of actions and administration processes

that are relevant to preservation (Ingest: content acquisition).

*Source: Online Computer Library Center (OCLC) and Center for Research Libraries (CRL). (2007). Trustworthy

Repositories Audit & Certification: Criteria and Checklist (TRAC), Version 1.0.OCLC & CRL. February 2007.

Available: http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf

6

Page 7: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Authentication Requirements Identified From

Draft Recommended Practice (CCSDS 652.0-R-1)*

3.3.4 The repository shall commit to transparency and accountability in all

actions supporting the operation and management of the repository that

affect the preservation of digital content over time.

4.1.4 The repository shall have mechanisms to appropriately verify the identity

of the Producer of all materials.

4.6.2 The repository shall follow policies and procedures that enable the

dissemination of digital objects that are traceable to the originals, with

evidence supporting their authenticity.

*Source: Consultative Committee for Space Data Systems (CCSDS) Audit and Certification of

Trustworthy Digital Repositories: Draft Recommended Practice. Red Book, Issue 1. 652.0-R-1

(July 2009). Available:

http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206520R1/NASAUSOverview.aspx

7

Page 8: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Digital Repository Services

for Web-Based Data Submission

• AuthenticationVerify identity of data producer or representative for each submission session

• Data DepositGather and deposit data and documentation

• Data DescriptionDescribe data for preservation, discovery, and use

• Submission AgreementEstablish agreement between the producer and repository

• CommunicationConfirm submission, request information if needed, and notify upon ingest

• Review and ApprovalReview submission information package and approve for ingest

• TransformationsTransform descriptive information and actions into metadata standards for ingest

Source: Downs & Chen (2009) Earth and Space Science Informatics Workshop. http://essi.gsfc.nasa.gov/pdf/Downs.pdf

Page 9: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Workflow for Web-Based Submission of

Scientific Data

1. Secure authenticated login by authorized data producer or representative– Multiple sessions may be needed to assemble submission information

2. Deposit and describe data and documentation files– Automate and encourage descriptions for each file

3. Describe scientific data set– Encourage unique title and offer selectable choices when possible

4. Grant permissions for data set– Offer choices based on data type, organization, and collection

5. Submit Data Set– Provide capabilities to review and modify entire package before submission

6. Notify Submitter and Archivist that submission was completed– Email notifications include contact information for subsequent communication

7. Review submission for completeness and correctness– Apply appraisal criteria for collection to which data set was submitted

– Contact producer regarding questions or need for additional information

8. Approve data set for ingest to digital repository– Notify submitter that submission has been approved for ingest into digital repository

9. Transform descriptions and actions into metadata for ingest to digital repository– Descriptive information is converted into XML metadata and ingested into digital repository

Source: Downs & Chen (2009) Earth and Space Science Informatics Workshop. http://essi.gsfc.nasa.gov/pdf/Downs.pdf

Page 10: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Data

Producer

Authentication

Login for One or More Sessions

Communication

Notifications

and Requests

Ingest

Archival

Information

Package

In

Digital

Repository

Transformation

Transform

Values to

XML

Metadata

Submission

Agreement

Grant Intellectual

Property Rights

Data Description

Describe

Data Set

Data Deposit

Provide Files

and

Descriptions

Review and

Approval

Appraise and

Approve

Submission

Information

Package

Model for Web-Based

Data Submission and Workflow

Data

Reviewer

Derived from Downs &Chen (2009) Earth and Space Science Informatics Workshop http://essi.gsfc.nasa.gov/pdf/Downs.pdf

Page 11: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Review of Successful Data Submissions

• Resources Reviewed:

– Legacy Data Submission Process

– Forms Used in Legacy Submission Process

– Descriptions of Submitted Data

– Data Collections

– Cyberinfrastructure and physical facilities

– Initial Prototype of Submission System

11

Page 12: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Support for Successful Submission

• Affordances identified to address challenges

for online submission of data:– Enable Timely Preparation of Submissions

– Facilitate Authentication of Submitter

– Elicit Information to Contact Submitter

– Invite Complete Documentation

– Foster Composition of Data Descriptions

– Provide Choices to Describe Data

– Request Non-Restrictive Permissions

12

Page 13: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Enable Timely Preparation of Submissions

• Challenge: Data submitted before creation or a long time after

creation can be incorrect or incomplete

– Previous asynchronous capabilities enabled assembly of submissions

locally prior to submission.

– Submissions prior to completion can result in an addendum to replace

missing or incorrect files.

– Submissions long after completion can result in delays for scheduling

dissemination.

• Recommendation: Encourage producers to submit data at the

time when it has been created by enabling multiple sessions

for producers to prepare and submit data.

13

Page 14: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Facilitate Authentication of Submitter

• Challenge: Identification of the data submitter is needed to

ensure that the data producer is being represented

– Previous physical and email submission capabilities enabled

verification of the identity of the data provider.

– Submissions received from non-authorized individuals might not

contain the correct or complete data.

– The data producer or their representative can provide rights for

archiving and using the data.

• Recommendation: Establish capabilities and procedures to

allow data producers and their representatives to receive a

username and password that can be used to log in to the data

submission system when submitting data.

14

Page 15: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Elicit Information to Contact Submitter

• Challenge: Submitters need to be contacted to resolve issues

with submission.

• Recommendation: Request or generate the complete name and

email address of the individual who submits the data.

– Automatically populate contact information fields upon log in and

request verification.

– Online form to request for contact information: complete name and

email address

– Obtain additional contact information

• Institution, mailing address, telephone number

15

Page 16: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Invite Complete Documentation

• Challenge: Data require documentation to facilitate

understanding about the data and their applicability

– Data must be understood by those not familiar with the study.

• Recommendation: Request submission of documents

describing the data, their creation, and measures used.

– Methodology document (who, why, what, where, when, and how the

data were obtained)

– Variable definitions and specification (location) of values (codebook)

– Descriptions of instruments, measures, and units of measurement

– Explanations of caveats, assumptions, additions, corrections

16

Page 17: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Foster Composition of Data Descriptions: Title

• Challenge: The relevance of a data set cannot always be determined from

the title.

• Recommendation: Guidance for describing the data within the title to

enable discovery and to differentiate it from other data.

• Considerations for inclusion within title:

– Purpose: Characteristic measured

– Measure: Instrument

– Location: Geographical aspects measured or political (country, state, county,

city, etc.)

– Temporal Aspects: Date or range of dates when data was collected or measured

– Version: Sequential version identifier or date of release

• Examples

Indicators of Coastal Water Quality: Change in Chlorophyll-a Concentration 1998-2007,

Alaska-Argentina

National Footprint Accounts, 2006 Edition, Footprint and Biocapacity by major land type by

nation, 2003

17

Page 18: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Provide Choices to Describe Data

• Challenge: Identifying terms to describe data can be

time consuming

• Recommendation: Provide choices from groups of

controlled vocabularies to describe data

– Examples of terminology for consideration:

• ISO 19115:2003 Geographic Information – Metadata Topic Categories

• Semantic Web for Earth and Environmental Terminology (SWEET)

See http://sweet.jpl.nasa.gov/ontology/

18

Page 19: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Selecting Terms from Controlled Vocabulary:

ISO 19115 Topic Categories

Source: Downs & Chen (2009) Earth and Space Science Informatics Workshop. http://essi.gsfc.nasa.gov/pdf/Downs.pdf

Page 20: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Request Non-Restrictive Permissions

• Challenge: Intellectual property rights must be obtained to enable

the use of data by the archive and by others.

– Unknown rights to data can restrict data stewardship and use

– Limiting the rights to data can prevent some uses of the data

• Recommendation: Avoid legal terms in request for data producer

to grant rights, with limited restrictions, if possible.

– Simple Form with choices to be clicked, based on affiliation of submitter

and type of resource

• Creative Commons License (Attribution) http://creativecommons.org/

• Additional data sharing options http://sciencecommons.org/

• Public Domain (Created by Government employee)

20

Page 21: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

Summary: Capabilities for Upstream

Submission and Workflow

• Requirements are applicable to social science and natural science data and

to interdisciplinary data

• Potential risk when engaging data producers early

– not knowing which data are important to preserve (but, capturing more

information should improve selection and appraisal)

• Benefits of obtaining data through robust workflow prior to the end of the

project that collects the data

– higher quality metadata, including provenance information

– reduced risk of not getting minimum metadata (e.g., when authors

move on to other projects)

– lower costs overall (data are submitted when ready)

– ability to follow up with producers

21

Page 22: Designing Flexible Workflow for Upstream Participation of ... · Designing Flexible Workflow for Upstream Participation of the Scientific Data Community ... • Identified and categorized

References

Consultative Committee for Space Data Systems (2004) Producer-Archive Interface Methodology Abstract Standard.

(CCSDS 651.0-B-1). Also: Space data and information transfer systems – Producer-archive interface –

Methodology abstract standard (ISO 20652:2006). Available:

http://public.ccsds.org/publications/archive/651x0b1.pdf

Consultative Committee for Space Data Systems (2002) Reference Model for an Open Archival Information System

(OAIS). Also: Space data and information transfer systems - Open archival information system - Reference model

(ISO 14721:2003). Available: http://public.ccsds.org/publications/archive/650x0b1.pdf

Consultative Committee for Space Data Systems (CCSDS) Audit and Certification of Trustworthy Digital

Repositories: Draft Recommended Practice. Red Book, Issue 1. 652.0-R-1 (July 2009). Available:

http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206520R1/NASAUSOverview.aspx

Downs RR, Chen RS (2009) Designing Submission Services for a Trustworthy Digital Repository of Interdisciplinary

Scientific Data. Earth and Space Science Informatics Workshop: Developing the Next Generation of Earth and

Space Science Informatics. August 3-5, 2009. University of Maryland, Baltimore County. Available:

http://essi.gsfc.nasa.gov/pdf/Downs.pdf

Downs RR, Chen RS (2010) Designing Submission and Workflow Services for Preserving Interdisciplinary Scientific

Data. Earth Science Informatics. Available: http://dx.doi.org/10.1007/s12145-010-0051-6

Nestor Working Group, Trusted Repositories -Certification (2006) Catalogue of Criteria for Trusted Digital

Repositories, Version 1. Available: http://edoc.hu-berlin.de/series/nestor-materialien/8en/PDF/8en.pdf

Online Computer Library Center (OCLC) and Center for Research Libraries (CRL) (2007) Trustworthy Repositories

Audit & Certification: Criteria and Checklist (TRAC), Version 1.0.OCLC & CRL. February 2007. Available:

http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf

The Digital Curation Centre (DCC) and Digital Preservation Europe (DPE) (2007) Digital Repository Audit Method

Based on Risk Assessment (DRAMBORA). Available: http://www.repositoryaudit.eu/