cleaning data with google refine

20
Google ‘Google Refine download’ http://code.google.com/p/google- refine/wiki/Downloads Download and install Google Refine Download data at http://bit.ly/nqbIaI Open it up - it should open in a browser at http://127.0.0.1:3333/ Get yourself ready Saturday, 15 October 2011

Upload: paul-bradshaw

Post on 09-May-2015

3.152 views

Category:

Education


1 download

DESCRIPTION

Presentation at the Global Investigative Journalism Conference, Kiev, Ukraine, 15 Oct 2011

TRANSCRIPT

Page 1: Cleaning data with Google Refine

• Google ‘Google Refine download’http://code.google.com/p/google-refine/wiki/Downloads• Download and install Google Refine• Download data at http://bit.ly/nqbIaI• Open it up - it should open in a browser at http://127.0.0.1:3333/

Get yourself ready

Saturday, 15 October 2011

Page 2: Cleaning data with Google Refine

Google Refine: cleaning data

Paul Bradshaw OnlineJournalismBlog.com, Twitter.com/paulbradshaw

Saturday, 15 October 2011

Page 3: Cleaning data with Google Refine

• Getting rid of common data problems• ‘Clustering’ data to clean up multiple names for same thing• Manual tidying

In a nutshell...

Saturday, 15 October 2011

Page 4: Cleaning data with Google Refine

The basics

Common transforms

Saturday, 15 October 2011

Page 5: Cleaning data with Google Refine

• Clean common data problems: wrong format, inconsistent case, HTML, spaces, etc.• Use algorithms to find similar items• Use APIs and GREL to add new data

What can you do with Google Refine?

Saturday, 15 October 2011

Page 6: Cleaning data with Google Refine

"Because we take the time to clean the data, we are able to do lobbying stories no other news organisation can do."

David Donald, Center for Public Integrity

Saturday, 15 October 2011

Page 7: Cleaning data with Google Refine

Humans collect dataHumans enter dataHuman error

Time spent now...

Saturday, 15 October 2011

Page 8: Cleaning data with Google Refine

Different words for the same thingDouble spaces, punctuationWrong data typeMistypedDuplicate entriesDefault entries (1/1/00)

...Saves time later

Saturday, 15 October 2011

Page 9: Cleaning data with Google Refine

Save some copies of the raw data Work on a new copySave versions as you go to revertNote: Docs limited to 200,000 cells/256 cols; some Excel limited to 66,000 rows

First!

Saturday, 15 October 2011

Page 10: Cleaning data with Google Refine

Group by term to see duplicationsFind & replace double spaces, etc. Select column/row & check data typeSort to find unusually large/small, and neighbouring misspellings

Cleaning methods

Saturday, 15 October 2011

Page 11: Cleaning data with Google Refine

Never publish a name from data without running a background check

Check.

Saturday, 15 October 2011

Page 12: Cleaning data with Google Refine

Edit cells>Common transforms

Saturday, 15 October 2011

Page 13: Cleaning data with Google Refine

Facets

Saturday, 15 October 2011

Page 14: Cleaning data with Google Refine

Facets, Edit cells

Edit cells > common transforms > cluster & edit > unescape HTMLEdit cells > split multi-valued cellsFacet > text facetExport...

Saturday, 15 October 2011

Page 15: Cleaning data with Google Refine

Clustering

An intelligent helper

Saturday, 15 October 2011

Page 16: Cleaning data with Google Refine

Algorithms

Fingerprint: looks for items with identical characters, e.g. “John Smith,” and “Smith, John”Double-metaphone: looks for similar sounds, e.g. “Horowitz” and “Horowicz”PPM: partial matches - try increasing radius to increase

Saturday, 15 October 2011

Page 17: Cleaning data with Google Refine

Algorithms

Nearest neighbor: looks for shared clusters of characters, e.g. “Johnson” and “Johnsons”Levenshtein: looks for number of edits needed to change one to another, e.g. “New York” -> “newyork” = 3 edits

Saturday, 15 October 2011

Page 18: Cleaning data with Google Refine

Just a helper...

Check and tick to apply the cleanup - click ‘Browse this cluster’ to see in more detail.Research to check if there are 2 people with same nameWill not spot abbreviations, e.g. MOJ vs Ministry of Justice

Saturday, 15 October 2011

Page 19: Cleaning data with Google Refine

Saturday, 15 October 2011

Page 20: Cleaning data with Google Refine

Delicious.com/paulb/kiev11Delicious.com/paulb/googlerefineOnlineJournalismBlog.com/tag/google-refine

Links

Saturday, 15 October 2011