open refine to update and clean up your messy data

16
OPEN REFINE TO PROFILE AND CLEAN UP YOUR MESSY DATA

Upload: university-of-connecticut-libraries

Post on 06-Jan-2017

110 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Open refine to update and clean up your messy data

OPEN REFINE TO PROFILE AND CLEAN UP YOUR

MESSY DATA

Page 2: Open refine to update and clean up your messy data

THE WOES OF DATA

NOTE ALL DATA ARE MADE EQUAL!

DATA ARE CREATED FROM A VARIETY OF SOURCES:

• AUTOMATICALLY GENERATED

• MANUALLY CREATED AND MANAGED

• SCRAPPED FROM THE WEB

• TAKEN FROM ANOTHER SOURCE

• ETC.

EACH SOURCE TYPICALLY HAS ITS OWN METHOD OF HANDLING DATA.

THEREFORE DATA SOURCES ARE OFTEN HETEROGENEOUS, WHICH CREATES

INCONSISTENT AND INACCURATE DATA. FURTHER, THERE IS NO INFORMATION

PROVIDED ABOUT THE DATA TO HELP BETTER UNDERSTAND WHAT THE DATA ARE

ABOUT.

Page 3: Open refine to update and clean up your messy data

WHAT TO DO?

ESSENTIALLY, YOU ARE LOOKING FOR CONSISTENT AND ACCURATE

DATA. TO DO THIS, THERE ARE OPTIONS:

• CREATE A SCRIPT TO CLEAN DATA

• HAVE SOMEONE CLEAN YOUR DATA

• USE AN APPLICATION TO HELP YOU CLEAN YOUR DATA

TODAY, WE’RE GOING TO LOOK AT OPEN REFINE, AN APPLICATION

THAT CAN HELP YOU CLEAN YOUR DATA.

Page 4: Open refine to update and clean up your messy data

WHAT IS OPEN REFINE?

OPEN REFINE ORIGINATED WITH METAWEB TECHNOLOGIES AND THEN

GOOGLE. FORMERLY KNOWN AS FREEBASE GRIDWORKS, IT BECAME

KNOWN AS GOOGLE REFINE AND THEN OPEN REFINE.

IT IS A FULLY OPEN SOURCE APPLICATION.

YOU CAN DOWNLOAD IT AT: HTTP://OPEN

REFINE.ORG/DOWNLOAD.HTML

Page 5: Open refine to update and clean up your messy data

OPEN REFINE AND THE WOES OF DATA

WHETHER YOU HAVE HETEROGENEOUS DATA OR DATA YOU’RE

UNFAMILIAR WITH, OPEN REFINE CAN HELP WITH:

• DATA CLEANUP

• CORRECT INCONSISTENCIES AND INACCURACIES

• DATA PROFILING

• CREATE AN ANALYSIS OF UNFAMILIAR DATA BY LEARNING WHAT THE

DATA ARE

Page 6: Open refine to update and clean up your messy data

2 BASIC OPERATIONS OF OPEN REFINE

BOTH DATA CLEANUP AND DATA PROFILING RELY ON A DATABASE,

CALLED FREEBASE, TO VISUALIZE AND MANIPULATE YOUR DATA.

VISUALIZING YOUR DATA INCLUDES BEING ABLE TO SEE THE “TYPE”

OF DATA. IS IT A DATE, NUMBER, TEXT? ARE ALL THE ENTRIES THE

SAME? VISUALIZING ALSO INCLUDES BEING ABLE TO SEE TIMELINES

FOR ENTRIES THAT ARE DATES.

MANIPULATING DATA INCLUDES GLOBAL SEARCHING AND

REPLACING, UPDATING ENTRIES SINGLY OR BY BATCH, ADDING NEW

DATA, REMOVING DATA, CHANGING THE DATA TYPES, AND MORE!

Page 7: Open refine to update and clean up your messy data

LET’S DO THE TOUR

• CREATE A PROJECT

• DIFFERENT VIEWS

• SEEING YOUR DATA

• MANIPULATING ROWS/COLUMNS

• CORRECTING ERRORS

Page 8: Open refine to update and clean up your messy data

HOW TO MAKE CHANGES TO DATA

• CHANGES TO COLUMNS

• SPLIT

• ADD

• REMOVE

• RENAME

• MOVE

• CHANGES CELLS (TO DATA ANYWHERE)

• EDIT CELLS USING CUSTOM OR DEFAULT TRANSFORMS, FILL

DOWN/BLANK, SPLIT/JOIN CELLS, CLUSTER/EDIT

Page 9: Open refine to update and clean up your messy data

AVAILABLE TRANSFORMATIONS

TRIM OR COLLAPSE WHITESPACES

UNESCAPE HTML ENTITIES

CHANGE THE CASE (TITLE, LOWER OR UPPER CASE)

CHANGE THE DATA TO NUMERIC, DATE OR TEXT

Page 10: Open refine to update and clean up your messy data

CUSTOM TRANSFORMATIONS

• JYTHON

• HTTP://WWW.JYTHON.ORG/

• HTTPS://GITHUB.COM/OPENREFINE/OPENREFINE/WIKI/JYTHON

• GREL (OPEN REFINE EXPRESSION LANGUAGE)

• HTTPS://GITHUB.COM/OPENREFINE/OPENREFINE/WIKI/GREL-FUNCTIONS

• REGULAR EXPRESSIONS

• HTTP://EN.WIKIPEDIA.ORG/WIKI/REGULAR_EXPRESSION

• HTTP://WWW.REGULAR-EXPRESSIONS.INFO/QUICKSTART.HTML

Page 11: Open refine to update and clean up your messy data

GREL

• SLICE

• EXAMPLE CHANGE “2010-05-31T01:10:0Z” TO “05/31/2010”

• VALUE.SLICE(5,7) + ‘/’ + VALUE.SLICE(8,10) + ‘/’ + VALUE.SLICE(0,4)

• YOU CAN ALSO USE:

• ADD A PREFIX

• “PREFIX” + VALUE

• SPLIT AND JOIN

• A:B:C:D:E -> B:C:D

Page 12: Open refine to update and clean up your messy data

REGULAR EXPRESSIONS

• TEXT PATTERN THAT ONE CAN USE WITH MANY MODERN

APPLICATIONS AND PROGRAMMING LANGUAGES

• REGULAR EXPRESSIONS COME IN FLAVORS

• .NET, JAVA, JAVASCRIPT, PERL, PYTHON, RUBY, …

• METACHARACTERS: 12 PUNCTUATION CHARACTERS THAT MAKE

REGULAR EXPRESSIONS WORK

Page 13: Open refine to update and clean up your messy data

REGULAR EXPRESSIONS

• CHARACTER CLASS ABBREVIATIONS

Page 14: Open refine to update and clean up your messy data

REGULAR EXPRESSIONS

• REMOVE THE “.” AT THE END OF A PHRASE

• VALUE.REPLACE(/[.]$/, “”)

• WHAT HAPPENS IF YOU JUST PUT /.$/ WITHOUT THE BRACKETS?

• REMOVE A STRING AT THE BEGINNING OF THE PHRASE WHERE THE

FIRST LETTER OF THE STRING IS UPPER OR LOWER CASE

• VALUE.REPLACE(/^\W+\S/, “”)

• TEST IT OUT

• HTTP://REGEXPAL.COM/

Page 15: Open refine to update and clean up your messy data

QUESTIONS?

Page 16: Open refine to update and clean up your messy data

RESOURCES

• USING OPEN REFINE … THE BOOK, QA76.9.D343 V47 2013

• OPEN REFINE DOCUMENTATION: HTTP://OPEN

REFINE.ORG/DOCUMENTATION.HTML

• OPEN REFINE COMMUNITY: HTTP://OPEN

REFINE.ORG/COMMUNITY.HTML

• CODE PROJECT ON REGULAR EXPRESSIONS:

HTTP://WWW.CODEPROJECT.COM/ARTICLES/9099/THE-MINUTE-

REGEX-TUTORIAL

• REGULAR EXPRESSIONS CHEAT SHEET:

HTTP://WWW.CHEATOGRAPHY.COM/DAVECHILD/CHEAT-

SHEETS/REGULAR-EXPRESSIONS/

• REGULAR EXPRESSIONS TESTER: HTTP://REGEXPAL.COM/ ,

HTTP://REGEX101.COM/