gokb and refine (kuali days 2013)
Post on 20-Aug-2015
209 Views
Preview:
TRANSCRIPT
•The problem space•Tool Selection•Enhancements•Open Refine in Action •Features and Limitations•The Data Journey •So What?
Kristin Antelman(North Carolina State University)
David Kay (Sero Consulting, UK)
Problem Space / Domain Requirement
• Unstructured messy data– Critical data is largely poorly controlled text strings (titles,
publishers)– Data is sloppy: duplicate rows, blank rows, multiple values in
single column, incorrectly formatted dates– Standards and identifiers exist but have poor -- or incorrect --
adoption
• Bad data – Titles associated with wrong identifiers– Data is out of date (has changed)– Key data is missing
Problem Space / Domain Requirement
Library Book Lifecycle
4
Library E-Content Lifecycle
5
Open Refine
GOKbDatabase
KualiOLE
Library
API
Ingest
PublisherSourceData
Ingest
The Data Improvement Workflow
From Vision to Implementation
July 2012 to October 2013
• Straw Man
• Feasibility Study
• Iterative Development
Lucas van Valckenborch (1535 or later–1597) [Public domain], via Wikimedia Commons
Aspiration
Tools Selection
Feasibility StudyKnowledge Integration – Summer 2012
Options• Open Rules• Drools• DIY• Google Refine
Considerations• Open • Performance• Rule Syntax & Interface *• Rule Management *• Rule Precedence Support• Auditing • Deployment *
Open Rules
Drools Expert
DIY
Critical Factors• Geared to the main objective• Suited to the expected user skill sets • Ease of deployment• Scales in the ways we need • An open platform for integration and extensions• Supported by an active community
Selection of Google Refine
:= Open Refine
Open Refine Extensions
GOKb Open Refine Extensionsin the current release (September 2013)
• Server side management– Projects– Check-out, Check-in– Rules
• Refine UI extensions geared to GOKb expectations– Pre-edit checks – e.g. New file? White space? – Authority validation – e.g. Organisations– Feedback panel – Errors and Warnings– Access to Quick Resolutions involving stored transformations– Pre-processing impact assessment – what this will do to the database– Update options - Incremental and Replacement
• Post-ingest support within GOKb– Audit trail, To do checklist
GOKb Open Refine Screencast
Why Open Refine is a good fit for us (and may be for you as well)
• Extensible • Supports collaboration/shared workspace• Supports users at multiple levels of expertise– Cross between a spreadsheet and a database for
novices– GREL, JSON scripting– API calls to external data sets
• But sometimes it’s not the right tool….
Round Trip Data Journey
OpenRefine
GOKbDatabase
TargetApplications
e.g. OLE
Route 2
Route 4
Route 3
Route 1
API
API
Route 1 – New projectRoute 2 – CRED user editsRoute 3 - Update projectRoute 4 – CRED Delta ingest
Ingest
What’sNext?The Round Trip
RESTful APIsSupportingJSON
So what?Or … why might you be interested?
The Application• Data cleansing / enhancement• Reuse … Automation• Managing distributed activity• Leveraging Refine and Excel user skills• Note - GOKb Extensions are Open Source
The Meta Challenge• Kuali software and the evolving ecosystem• Tool selection • An example of community innovation
Open Refine ResourcesTutorials, FAQs and the Open Refine wikihttp://openrefine.org/documentation.
About GREL https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Expressions-
Common formulas for editing with GREL https://github.com/OpenRefine/OpenRefine/wiki/Recipes
Step-by-step tutorialshttp://www.davidhuynh.net/spaces/nicar2011/tutorial.pdf, http://freeyourmetadata.org
Book by the freeyourmetadata authorshttp://www.packtpub.com/openrefine-guide-for-data-analysis-and-linking-dataset-to-the-web/book
GOKb guidance on Open Refinehttps://wiki.kuali.org/display/OLE/OpenRefine
Twitter @OpenRefine
top related