assuring the quality of your data: a natural history ... · a natural history collection community...

25
Assuring the quality of your data: A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio @VertNetOrg Tuesday, January 12, 2016 9 am Pacific / 10 am Mountain / 11 am Central / 12 noon Eastern

Upload: dinhhanh

Post on 10-Apr-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Assuring the quality of your data: A natural history collection community

perspectiveDeborah Paul, Katja Seltmann, Laura Russell, David Bloom

@idbdeb @iDigBio @VertNetOrg

Tuesday, January 12, 20169 am Pacific / 10 am Mountain /11 am Central / 12 noon Eastern

Page 2: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

pH

Page 3: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Our data producers & data users

Naturalists, Field biologists, Nature explorers, Research institutions, Citizen scientists, Curators

Ecologists, biogeographers; Analysts, modellers; Conservation planners; Nature managers; Policy managers; Funding agencies; Industry; ?, ...

Data Quality starts here, before collection of specimens and field data

Page 4: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Data, data types, data standards

●●●●●●●●●http://rs.tdwg.org/dwc/terms/index.htm

Darwin Core (DwC)Audubon Media (AC)Ecological Metadata Language (EML)Global Genome Biodiversity Network (GGBN), ...

Page 5: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Data Publication Fallacies (and Truths)

The Fallacy of Perfection:Data must be perfect before publication.

The Truth:Data will never be perfect in every aspect.

Page 6: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Data Publication Fallacies (and Truths)

The Fallacy of Petrification:Data do not change (or don’t need to change) once they

are in a ledger or database.

The Truth:Data are dynamic and require regular curation.

Page 7: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Data Publication Fallacies (and Truths)DataONE Data Life Cycle

Page 8: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Data Publication Fallacies (and Truths)

The Fallacy of Fitness for Use:The fitness for use of data depends upon how and why

the data were collected.

The Truth:Fitness depends upon the questions being asked (value is

is in the eye of the beholder).

Page 9: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Challenges for data publication and qualityIndividual:

Education, Experience, Funding, Presence of support

Institutional:

Inter/Intra-collection communication, Mixed technology, Legal limitations

Structural:

Interoperability, Related/Derivative sources

Page 10: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Data assurance practices - VertNet● Data publishing

○ VertNet Toolkit■ Data migrator■ Data quality reports

● Data acquisition and processing○ Google BigQuery

● Data access - VertNet portal○ Spatial quality○ Flag data issues

● Feedback○ Issue tracking and feedback via GitHub

Page 11: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio
Page 12: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

https://github.com/VertNet/toolkit

Page 13: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio
Page 14: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

http://vertnet.org/resources/spatialqualitytabguide.html

Page 15: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Easy for users to submit. Users do need a free GitHub account.

Easy for data publishers to receive and manage feedback about their published data.

http://vertnet.org/resources/issuetrackingguide.html

Page 16: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Data assurance practices Data Publication - an overview of the process data goes through before ingestion at iDigBio1. Negotiating

2. Mobilizing (human, scripts)

3. Ingestion

a. Evaluation (human, scripts, reports)

b. iDigBio Data Quality (DQ) Flags

Graphic by Joanna McCaffrey, iDigBio Biodiversity Informatics Manager

a Perk of data sharing!

https://www.idigbio.org/wiki/index.php/Data_Ingestion_Guidance

Page 17: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Search and download using the iDigBio DQ FlagsDQ Flags

● enhance and improve data

● enhance discoverability● visualization aid● transparency builds

community trust● facilitate potential

development of automated updates by provider

https://www.idigbio.org/portal/search

Page 18: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

● Data Quality Flags by recordset at iDigBio

● Download Darwin Core Archive Files ○ raw data○ and a bonus file○ available to

everyone● Annotations coming

○ feedback loop○ transparency

Data assurance practices

Darwin Core Archive File + DQ enhanced data

Page 19: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Contact the provider...

https://www.idigbio.org/portal/records/d0199563-4490-4364-a768-e2318c3b4e24

Page 20: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Future DQ Work at ● Full Taxonomic resolution against the GBIF Backbone, including: adding TaxonID (for all taxonomic

levels) values, accepted names, and canonical names.

● Expanded geographic name/point validation/augmentation using GADM, GeoNames, or some similar place name database.

● Opening up the stage 1 corrections process to include externally provided corrections data sources. An API has been provisionally designed, but it hasn’t undergone and testing or validation.

● Opening up the stage 1 corrections process to include annotations Since, from a technical perspective the process of applying an annotation to a record is the same as applying a general correction that only applies to specific single records or versions. This is even farther out than the last bullet, since it would need an entire workflow built around it to be effective.

● Harmonization of flag values, conditions and actions with other projects (such as this ALA/GBIF list: http://bit.ly/evMJv5 )

Page 21: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

NSF ADBC Digitization Thematic Collection Network

Plants, Herbivores, and Parasitoids: A Model System for the study of Tri-Trophic Associations

3,645,975 new records in 4 years

Page 22: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio
Page 23: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

1. Collection manager commitment and training

2. QC tools (data entry, data evaluation)

3. Research use of data

Page 24: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Acknowledgements and ReferencesDRAFT FOR CONSULTATION: Best Practice Guide for Data Gap Analysis for biodiversity stakeholders http://www.gbif.org/system/files_force/gbif_resource/resource-82566/DRAFT-Best-Practice-Guide-for-Data-Gap-Analysis-for-biodiversity-stakeholders.pdf Arturo H. Ariño, Vishwas Chavan, Javier Otegui

VertNet Migrators https://github.com/VertNet/toolkit

VertNet Portal Spatial Quality Tab http://www.vertnet.org/resources/spatialqualitytabguide.html

VertNet GitHub Reference & Set Up for Data Issue Tracking http://www.vertnet.org/resources/issuetrackingguide.html

Improving Data Quality: iDigBio Recordset data cleaning method, tools, and data flags https://www.idigbio.org/content/improving-data-quality-idigbio-recordset-data-cleaning-method-tools-and-data-flags

A summer learning R to clean up data with the iDigBio portal recordset correction feature https://www.idigbio.org/content/summer-learning-r-clean-data-idigbio-portal-recordset-correction-feature

iDigBio Data recommendations for optimal searchability and applicability in the aggregate https://www.idigbio.org/wiki/index.php/Data_Ingestion_Guidance#Data_recommendations_for_optimal_searchability_and_applicability_in_the_aggregate

Exploring unique values in iDigBio using Apache Spark https://www.idigbio.org/content/exploring-unique-values-idigbio-using-apache-spark

Page 25: Assuring the quality of your data: A natural history ... · A natural history collection community perspective Deborah Paul, Katja Seltmann, Laura Russell, David Bloom @idbdeb @iDigBio

Acknowledgements and References, cont’dBiocode Field Information Management System ppt youtube. A Field Information Management System (FIMS) enables data collection at the source (in the field) by generating spreadsheet templates, validating data, and assigning persistent identifiers for every unique biological sample. The following diagram shows how the system works. The most typical functions are Generating Templates and Validating Data, both of which can be found under the Tools menu.

Generate a TemplateValidate dataHow FIMS works

Contact usDeb Paul [email protected] Seltmann [email protected] Russell [email protected] Bloom [email protected]

iDigBio is funded by a grant from the National Science Foundation's Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.