gokb and refine (kuali days 2013)

Post on 20-Aug-2015

209 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

•The problem space•Tool Selection•Enhancements•Open Refine in Action •Features and Limitations•The Data Journey •So What?

Kristin Antelman(North Carolina State University)

David Kay (Sero Consulting, UK)

Problem Space / Domain Requirement

• Unstructured messy data– Critical data is largely poorly controlled text strings (titles,

publishers)– Data is sloppy: duplicate rows, blank rows, multiple values in

single column, incorrectly formatted dates– Standards and identifiers exist but have poor -- or incorrect --

adoption

• Bad data – Titles associated with wrong identifiers– Data is out of date (has changed)– Key data is missing

Problem Space / Domain Requirement

Library Book Lifecycle

4

Library E-Content Lifecycle

5

Open Refine

GOKbDatabase

KualiOLE

Library

API

Ingest

PublisherSourceData

Ingest

The Data Improvement Workflow

From Vision to Implementation

July 2012 to October 2013

• Straw Man

• Feasibility Study

• Iterative Development

Lucas van Valckenborch (1535 or later–1597) [Public domain], via Wikimedia Commons

Aspiration

Tools Selection

Feasibility StudyKnowledge Integration – Summer 2012

Options• Open Rules• Drools• DIY• Google Refine

Considerations• Open • Performance• Rule Syntax & Interface *• Rule Management *• Rule Precedence Support• Auditing • Deployment *

Open Rules

Drools Expert

DIY

Critical Factors• Geared to the main objective• Suited to the expected user skill sets • Ease of deployment• Scales in the ways we need • An open platform for integration and extensions• Supported by an active community

Selection of Google Refine

:= Open Refine

Open Refine Extensions

GOKb Open Refine Extensionsin the current release (September 2013)

• Server side management– Projects– Check-out, Check-in– Rules

• Refine UI extensions geared to GOKb expectations– Pre-edit checks – e.g. New file? White space? – Authority validation – e.g. Organisations– Feedback panel – Errors and Warnings– Access to Quick Resolutions involving stored transformations– Pre-processing impact assessment – what this will do to the database– Update options - Incremental and Replacement

• Post-ingest support within GOKb– Audit trail, To do checklist

GOKb Open Refine Screencast

Why Open Refine is a good fit for us (and may be for you as well)

• Extensible • Supports collaboration/shared workspace• Supports users at multiple levels of expertise– Cross between a spreadsheet and a database for

novices– GREL, JSON scripting– API calls to external data sets

• But sometimes it’s not the right tool….

Round Trip Data Journey

OpenRefine

GOKbDatabase

TargetApplications

e.g. OLE

Route 2

Route 4

Route 3

Route 1

API

API

Route 1 – New projectRoute 2 – CRED user editsRoute 3 - Update projectRoute 4 – CRED Delta ingest

Ingest

What’sNext?The Round Trip

RESTful APIsSupportingJSON

So what?Or … why might you be interested?

The Application• Data cleansing / enhancement• Reuse … Automation• Managing distributed activity• Leveraging Refine and Excel user skills• Note - GOKb Extensions are Open Source

The Meta Challenge• Kuali software and the evolving ecosystem• Tool selection • An example of community innovation

Open Refine ResourcesTutorials, FAQs and the Open Refine wikihttp://openrefine.org/documentation.

About GREL https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Expressions-

Common formulas for editing with GREL https://github.com/OpenRefine/OpenRefine/wiki/Recipes

Step-by-step tutorialshttp://www.davidhuynh.net/spaces/nicar2011/tutorial.pdf, http://freeyourmetadata.org

Book by the freeyourmetadata authorshttp://www.packtpub.com/openrefine-guide-for-data-analysis-and-linking-dataset-to-the-web/book

GOKb guidance on Open Refinehttps://wiki.kuali.org/display/OLE/OpenRefine

Twitter @OpenRefine

top related