cleaning data with google refine
DESCRIPTION
Presentation at the Global Investigative Journalism Conference, Kiev, Ukraine, 15 Oct 2011TRANSCRIPT
• Google ‘Google Refine download’http://code.google.com/p/google-refine/wiki/Downloads• Download and install Google Refine• Download data at http://bit.ly/nqbIaI• Open it up - it should open in a browser at http://127.0.0.1:3333/
Get yourself ready
Saturday, 15 October 2011
Google Refine: cleaning data
Paul Bradshaw OnlineJournalismBlog.com, Twitter.com/paulbradshaw
Saturday, 15 October 2011
• Getting rid of common data problems• ‘Clustering’ data to clean up multiple names for same thing• Manual tidying
In a nutshell...
Saturday, 15 October 2011
The basics
Common transforms
Saturday, 15 October 2011
• Clean common data problems: wrong format, inconsistent case, HTML, spaces, etc.• Use algorithms to find similar items• Use APIs and GREL to add new data
What can you do with Google Refine?
Saturday, 15 October 2011
"Because we take the time to clean the data, we are able to do lobbying stories no other news organisation can do."
David Donald, Center for Public Integrity
Saturday, 15 October 2011
Humans collect dataHumans enter dataHuman error
Time spent now...
Saturday, 15 October 2011
Different words for the same thingDouble spaces, punctuationWrong data typeMistypedDuplicate entriesDefault entries (1/1/00)
...Saves time later
Saturday, 15 October 2011
Save some copies of the raw data Work on a new copySave versions as you go to revertNote: Docs limited to 200,000 cells/256 cols; some Excel limited to 66,000 rows
First!
Saturday, 15 October 2011
Group by term to see duplicationsFind & replace double spaces, etc. Select column/row & check data typeSort to find unusually large/small, and neighbouring misspellings
Cleaning methods
Saturday, 15 October 2011
Never publish a name from data without running a background check
Check.
Saturday, 15 October 2011
Edit cells>Common transforms
Saturday, 15 October 2011
Facets
Saturday, 15 October 2011
Facets, Edit cells
Edit cells > common transforms > cluster & edit > unescape HTMLEdit cells > split multi-valued cellsFacet > text facetExport...
Saturday, 15 October 2011
Clustering
An intelligent helper
Saturday, 15 October 2011
Algorithms
Fingerprint: looks for items with identical characters, e.g. “John Smith,” and “Smith, John”Double-metaphone: looks for similar sounds, e.g. “Horowitz” and “Horowicz”PPM: partial matches - try increasing radius to increase
Saturday, 15 October 2011
Algorithms
Nearest neighbor: looks for shared clusters of characters, e.g. “Johnson” and “Johnsons”Levenshtein: looks for number of edits needed to change one to another, e.g. “New York” -> “newyork” = 3 edits
Saturday, 15 October 2011
Just a helper...
Check and tick to apply the cleanup - click ‘Browse this cluster’ to see in more detail.Research to check if there are 2 people with same nameWill not spot abbreviations, e.g. MOJ vs Ministry of Justice
Saturday, 15 October 2011
Saturday, 15 October 2011
Delicious.com/paulb/kiev11Delicious.com/paulb/googlerefineOnlineJournalismBlog.com/tag/google-refine
Links
Saturday, 15 October 2011