data quality: towards a common validator
DESCRIPTION
Data Quality: Towards a Common Validator SiBBr 4th Workshop, Petrópolis, Rio de Janeiro, Brazil Authors: Christian Gendreau, Anne Bruneau, David ShorthouseTRANSCRIPT
![Page 1: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/1.jpg)
Data qualityTowards a common validator
Christian Gendreau, Anne Bruneau, David ShorthouseUniversité de Montréal, Biodiversity Centre
![Page 2: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/2.jpg)
What is data quality?● Relative● Fitness for use
o Coordinate precision for distributiono Hierarchy not provided
● What –When –Where
![Page 3: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/3.jpg)
Examples● 2008 VI 13, 2008-06-13, 13-06-2008, June 13 2008, 13 junho
2008● Canada, Québec, Montréal, -73.55399 45.508669● Narwalus microcephalus => Monodon monoceros
Public Domain: Freshwater and Marine Image Bank, University of WashingtonLibraries Digital Collections
![Page 4: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/4.jpg)
Data Quality Information Chain
Courtesy of Arthur D. Chapman
![Page 5: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/5.jpg)
Brief History● First Canadensys Explorer/Harvester (2012)● narwhal-processor (2013)● TDWG 2013
o DQ Interest Groupo Presentation about our plano Discussion with GBIF
![Page 6: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/6.jpg)
Why do we need a validator?● Identify and quantify potential issues● DarwinCore is permissive
o DarwinCore itself can change● Records and technologies will always evolve
![Page 7: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/7.jpg)
What should a validator allow?● Define a validation scope
o at the source (e.g. collection)o national nodeo aggregatoro GBIF
![Page 8: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/8.jpg)
Validator - Expected design● Modular, scalable, reusable● Customizable
o per configuration/extensiono use user defined dictionary
● Validation Chain
![Page 9: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/9.jpg)
Current Options● GBIF validator● CRIA tools● ALA tools
Probably all organisations have their own tools.
![Page 10: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/10.jpg)
dwca-validator● Starting from previous GBIF validator● Building a community project● Provide framework for Biodiversity Data Quality Interest
Group (TDWG)
https://github.com/gbif/dwca-validator
![Page 11: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/11.jpg)
Vision• Library
o Core module, reusable (e.g. IPT)
• Webo Send archive, view report
• narwhal-processoro Suggest interpreted value
• Extensionso Domain knowledge / Quality index
![Page 12: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/12.jpg)
Validation chain● Chain element
o Self contained (never relies on another chain element)o Ordering independent
● Composed chain element (narwhal and extensions)o Wrap chain elements under a new elemento Ordering possible between wrapped element
![Page 13: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/13.jpg)
Chain element example
![Page 14: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/14.jpg)
Validation types● Structure
o metadatao organization of data
● Rowso dates, coordinates, ...
● Columnso ID uniqueness
![Page 15: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/15.jpg)
Result Accumulator● Records validation result as they occur
o ID/Validator/Context/ValidationType/Result/Message
● Allows different views of resulto Web viewo Feed another application
![Page 16: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/16.jpg)
Current Status● Library with CLI (command line interface)● Basic evaluators and rules● Ready for contributions
![Page 17: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/17.jpg)
Demo• Darwin Core Archive, Taxon Checklisto Invalid characterso Broken link synonym accepted taxon
Lynx Canadensis, http://www.animalgalleries.org/
![Page 18: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/18.jpg)
Future validations● Use semantic web (e.g. GeoNames)● Use external resolver (e.g. CoL)● Use more complex validation (e.g. climate layer)
![Page 19: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/19.jpg)
Future validationsAccomodate localisation vs misspellings● Brésil (fr)● Brazil (en)● Brasil (pt)● Brasilien (se)
● Brézil (??)
![Page 20: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/20.jpg)
Questions?
Public Domain: robynm
![Page 21: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/21.jpg)
Acknowledgements
![Page 22: Data Quality: Towards a Common Validator](https://reader036.vdocuments.us/reader036/viewer/2022062706/557dc84bd8b42a8a4e8b510d/html5/thumbnails/22.jpg)
Contacthttp://www.canadensys.net
http://github.com/Canadensys
@Canadensys