kurator: towards data curation for mere mortals
TRANSCRIPT
![Page 1: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/1.jpg)
Kurator: Towards Data Curation Workflows for Mere Mortals
An extensible, open-source workflow platform for users & makers of data curation tools
B. Ludäscher J. Hanken D. Lowery J.A. Macklin T. McPhillips P.J. Morris R.A. Morris T. Song
![Page 2: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/2.jpg)
SPNHC'15 Kurator/P 2
Problem: Data & Metadata Quality• Collections & occurrence data can
be all over the map• Examples:
– Lat/Long transposition, other geo-ref issues (projections, …)
– Scientific Names (spelling errors, other)
– Data entry/creation, “fuzzy” data, naming issues, bit rot, data conversions and transformations, schema mappings, …
• Related:– Filtered-Push
![Page 3: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/3.jpg)
SPNHC'15 Kurator/P 3
What problems are we trying to solve?• Detect and flag data quality issues• Repair if possible
– … ask human curators as needed
• Keep track of provenance– (semi-)automatic repairs– human curators’ edits
• Employ workflow (semi-)automation – Scientific workflow systems:
• Kepler/COMAD, Restflow, Galaxy, Biovel/Taverna, Argo, VisTrails, …
– Related technologies• Akka parallel execution platform• Script-based automation (e.g. Python, R), digital notebooks (iPython)
![Page 4: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/4.jpg)
SPNHC'15 Kurator/P 4
Customers of Curation Workflows
• Collection Managers – … who are managing the collections databases– Can run curation workflows periodically
• … in the presence of new data and/or new curation services
• (Biodiversity) Researchers– To perform an analysis in the presence of (partially)
dirty data, researchers need to• Clean or fix dirty data• Throw out unfixable data
– Reporting back to the collection managers (cf. FPush)
![Page 5: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/5.jpg)
SPNHC'15 Kurator/P 5
How we do it (Part 1 …): Kepler curation workflows
• Why workflows? ASAP!
Dou, Lei., G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, J. Hanken. 2012. Kurator: A Kepler Package for Data Curation Workflows, Procedia Computer Science, 9:1614-1619, doi:10.1016/j.procs.2012.04.177
![Page 6: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/6.jpg)
SPNHC'15 Kurator/P 6
Scientific Workflows: ASAP! • Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles)
– wfs should make use of parallel compute resources – wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles)
– wfs should be easy to (re-)use, evolve, share
• Provenance– wfs should capture processing history, data lineage
traceable data- and wf-evolution Reproducible Science
TridentWorkbench
VisTrails
![Page 7: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/7.jpg)
SPNHC'15 Kurator/P 7
Scientific workflows: a(nother) silver bullet?
Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy.
—Alan Perlis
![Page 8: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/8.jpg)
SPNHC'15 Kurator/P 8
I beg your pardon, I never promised ..
“Thanks to our Graphical UI your scientific workflows will be much easier to develop, understand and maintain!”
Hmm… this was supposed to be easier than programming!
![Page 9: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/9.jpg)
SPNHC'15 Kurator/P 9
Scientific Workflows …
Cabellos et al. Computer Physics Communications 182, 2011
![Page 10: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/10.jpg)
SPNHC'15 Kurator/P 10
… are a wonderful thing … Norbert Podhorszki
(then: UC Davis)
![Page 11: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/11.jpg)
SPNHC'15 Kurator/P 11
… after simplifying a bit (here: Kepler/COMAD)
Sven Köhler(then: UC Davis)
![Page 12: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/12.jpg)
SPNHC'15 Kurator/P 12
So many systems (models of computation ~ languages, … )
… so little time …
Sven Köhler(then: UC Davis)
![Page 13: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/13.jpg)
SPNHC'15 Kurator/P 13
Workflow Systems: Learning to program, all over again …
![Page 14: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/14.jpg)
SPNHC'15 Kurator/P 14
Scientific Workflow: it’s called R&D for a reason … Workflow Modeling & Design
(prospective provenance)
Runtime Provenance (traces,
retrospective provenance)
Fault-tolerance crash recovery
Scalability parallel processing
![Page 15: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/15.jpg)
SPNHC'15 Kurator/P 15
Meanwhile, on a nearby planet …
Highly dynamic visualization(so dynamic, it’s hard to capture)
![Page 16: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/16.jpg)
SPNHC'15 Kurator/P 16
It’s time to Shift Control …
• … back from being consumers of tools “Just click here!”
• ... to tool makers!
• Kurator/P:– Yes, develop for end users … – … but don’t forget the tool makers!
• Can we do this together?
![Page 17: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/17.jpg)
SPNHC'15 Kurator/P 17
How we do it (Part 2 of … )
• Kurator– Apply workflow technologies and workflow thinking – … in a technology agnostic way (if possible) Build a library of curation services such that curation workflows can be run from various platforms– Scientific workflow systems
• e.g. Restflow, Kepler, Taverna, Galaxy
– Other platforms• e.g. Akka, Python-based, …
• … leveraging existing technologies
![Page 18: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/18.jpg)
SPNHC'15 Kurator/P 18
How we do it (Part 3 of … )
• YesWorkflow– Grass-roots effort (goes well with stone-soup…) – Scripts + User annotations Can give us much of ASAP!
• Key Ideas:– Meet the tool makers and researchers (R, Python, …) – Make them workflow/dataflow thinkers … – … but giving them workflow benefits (ASAP!) – … via simple annotations!
![Page 19: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/19.jpg)
SPNHC'15 Kurator/P 19
SKOPE: Synthesized Knowledge Of Past Environments
Bocinsky, Kohler et al. study rain-fed maize of Anasazi – Four Corners; AD 600–1500. Climate change influenced Mesa Verde
Migrations; late 13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm estimates joint information in tree-rings and a climate signal to identify “best” tree-ring chronologies for climate reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script …
![Page 20: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/20.jpg)
SPNHC'15 Kurator/P 20
User Comments: YW Annotations@begin GO_Analysis
@in hgCutoff@in …
@out BP_Summl_file@out …
@end GO_Analysis
...
![Page 21: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/21.jpg)
SPNHC'15 Kurator/P 21
Get 3 views for the price of 1!
Process view
Data view
Combined view
![Page 22: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/22.jpg)
SPNHC'15 Kurator/P 22
Paleoclimate Reconstruction (EnviRecon.org) • … explained using YesWorkflow!
Kyle B., (computational) archaeologist: "It took me about 20 minutes to comment. Less than an hour to learn and YW-annotate, all-told."
SKOPE Kurator
++
=> YesWorkflow.org
![Page 23: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/23.jpg)
SPNHC'15 Kurator/P 23
The Road ahead …
• YesWorkflow:– … finishing support for retrospective provenance
without using a runtime provenance recorder!– Key insight: scientists already leave provenance “bread
crumbs” behind! (it’s not an accident!)– Exploit that via annotations: URI-templates
• Kurator[/P]:– How far can we go towards ASAP via YW?
![Page 24: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/24.jpg)
SPNHC'15 Kurator/P 24
YesWorkflow.org
![Page 25: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/25.jpg)
SPNHC'15 Kurator/P 25
YW-RECON: Prospective & Retrospective Provenance … (almost) for free!
• YW annotations in the script (R, Python, Matlab) are used to recreate the workflow view from the script …
YW
![Page 26: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/26.jpg)
SPNHC'15 Kurator/P 26
YW-RECON: Prospective & Retrospective Provenance … (almost) for free!
• URI-templates link conceptual entities to runtime provenance “left behind” by the script author …
• … facilitating provenance reconstruction
![Page 27: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/27.jpg)
SPNHC'15 Kurator/P 27
Summary: Data Curation with Scientific Workflow Systems
Scientific Workflows• [+] Automation• [+] Scalability• [+] Abstraction• [+] Provenance• …• [+/0] Easy to use
– [0] learning a new paradigm• [-] Teaching resources: learning a new language!• [-] Special expertise needed for deep changes
e.g. new Java actors, shims, …
![Page 28: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/28.jpg)
SPNHC'15 Kurator/P 28
Kurator/P: Scripts + YesWorkflow ++Scripts: [+] Automation, [0] Scalability, [-] Abstraction, [0/-] Provenance
Now: Scripts + YesWorkflow Annotations• [+] Abstraction
– explain your methods to mere mortals=> encourage (re-)use
• [+] Provenance:– YesWorkflow (prospective and retrospective provenance)
• [+] Language independent (R, Matlab, Python, …) • [+] Empower tool makers (script programmers): give them …
– … some immediate benefits (workflow views, retrospective provenance)– … some long term benefits: think about your methods differently => dataflow programming => [+] Scalability
![Page 29: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/29.jpg)
SPNHC'15 Kurator/P 29
Acknowledgments
• NSF-DBI #1356751 – Collaborative Research: ABI Development:
Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data
![Page 30: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/30.jpg)
Additional Material
![Page 31: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/31.jpg)
SPNHC'15 Kurator/P 31
Date Validation
• Check: – Collector’s life span – .. vs. Date-Collected
• Possible outcomes:– Valid– Corrected– Unable to validate
• Internal inconsistency– Contradicting dates
• External inconsistency– Lack of date data
![Page 32: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/32.jpg)
SPNHC'15 Kurator/P 32
… Logic Behind Each Step (cont’d)
• Scientific Name Validation– Customer-dependent:
• Collection Managers:– Nomenclature
• Researchers:– Taxonomy (current names)
– Several Remote services• IPNI, GNI, …
• …. <your logic here> …
![Page 33: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/33.jpg)
SPNHC'15 Kurator/P 33
Simplified Example Workflow
• Related Research (Tianhong Song, UC Davis)– Analyze linear workflow “story”– Use patterns to discover wf design issues
(e.g. use before update); then fix them– Parallelize when possible
• Allow easy assembly of such workflows
• For tool makers• … and tool users • … scalability …
![Page 34: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/34.jpg)
SPNHC'15 Kurator/P 34
Example Output …
![Page 35: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/35.jpg)
SPNHC'15 Kurator/P 35
… close up …
![Page 36: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/36.jpg)
SPNHC'15 Kurator/P 36
FilteredPush Curation Provenance (Spreadsheet View)
![Page 37: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/37.jpg)
SPNHC'15 Kurator/P 37
Agile Kurator Development
![Page 38: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/38.jpg)
SPNHC'15 Kurator/P 38
Related Research (Tianhong Song, UC Davis)
• Analyze linear workflow “story”
• Use patterns to discover wf design issues (e.g. use before update); then fix them
• Parallelize when possible
![Page 39: Kurator: Towards Data Curation for Mere Mortals](https://reader035.vdocuments.us/reader035/viewer/2022062710/55b1f1a9bb61eb845c8b4629/html5/thumbnails/39.jpg)
SPNHC'15 Kurator/P 39
Contact me!
• If you’re interested in a project, research theme (or similar ones): Send me email!– Email: [email protected]