data cleaning & integration -...
Post on 09-Oct-2020
1 Views
Preview:
TRANSCRIPT
Data Cleaning & Integration
CSE6242 / CX4242Jan 14, 2014
Duen Horng (Polo) ChauGeorgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Last TimeBig data analytics building blocks"Data collection & simple data storage!
• Why SQLite? "• Simplicity : nothing to install/
maintain, database in a single file"
• Popular: cross-platform, cross-device"
• SQL basics (create table, join, create index, etc.)
�2
Collection
Cleaning
Integration
Visualization
Analysis
Presentation
Dissemination
Data CleaningHow dirty is real data?
Data CleanersWatch videos "• Google Refine"• Data Wrangler (research at Stanford)"
Write down"• Examples of data dirtiness"• Tool’s features demo-ed (or that you like)"
Will collectively summarize similarities and differences afterwards
Google Refine: http://code.google.com/p/google-refine/"Data Wrangler: http://vis.stanford.edu/wrangler/
�4
How dirty is real data?Examples"
• no specific schemas / different names for the same thing / numbers and text mixed"
• trailing spaces/ text not relevant to data"
• different units / data out of range (unrealistic) / skew data distributions"
• missing values / missing rows entirely"
• file formats"
• text may not be where you want it to be (maybe at a different column)"
• improper merge of two tables"
• duplications�5
How are they similar?• mass/batch conversion "
• graph/chart visualization"
• heuristics (e.g., group in G, selection in W)"
• removing redundancy"
• tracking changes / history / undo-redo"
• table based"
• suggestions (what to fix)"
• filtering (show less)�6
G = Google Refine"W = Data wrangler
How do they different?• G has clustering feature"
• W has format conversion (1 column spread into multiple)"
• W can export actions as scripts"
• G supports offline mode (online too?)"
• W extracts part of text into new column"
• W can copy and paste"
• W allow you to preview changes"
• W uses colors to indicate different kinds of changes"
• G can show statistics
G = Google Refine"W = Data wrangler
�7
! The videos only show
some of the tools’ features. Try them out.
Google Refine: http://code.google.com/p/google-refine/"Data Wrangler: http://vis.stanford.edu/wrangler/
�8
Data Integration
Course OverviewCollection
Cleaning
Integration
Visualization
Analysis
Presentation
Dissemination
What is Data Integration? Why is it Important?
�12
Data IntegrationCombining data from different sources to provide the user with a unified view"
As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges"
How to help people effectively leverage multiple data sources? (People: analysts, researchers, practitioners, etc.)
Examples of businesses based on
data integration
Mashup
More Examples?• Palantir gotham"
• Yelp: restaurant reviews, business reviews"
• Facebook friend request: look at your friends’s friends and recommend those friends as your friends"
• Trulia / zillow (real estate sites)"
• graph search (facebook)"
• waze"
• yahoo pipe "
• google search engine"
• google transit"
• google now / apple siri�18
How to do data integration?
“Low” Effort ApproachesUse database’s “Join”! (e.g., SQLite)"
"
"
"
"
Google Refinehttp://code.google.com/p/google-refine/ (video #3)
�20
id name state111 Smith GA222 Johnson NY222 Obama CA
id name111 Smith222 Johnson333 Obama
id state111 GA222 NY222 CA
Crowd-sourcing Approaches: Freebase
�21http://wiki.freebase.com/wiki/What_is_Freebase%3F
Freebase(a graph of entities)!
“…a large collaborative knowledge base consisting of metadata composed mainly
by its community members…”
�22
Wikipedia.
So what? What can you do with Freebase?
(Hint: Google acquired it in 2010)!
�23
http://www.google.com/insidesearch/features/search/knowledge.html
Given a graph of entities, like Freebase, what other cool
things can you do? "
�25
Facebook’s Graph Search!
Integrate your friends’ info with yours
�27
Feldspar!Finding Information by Association.
CHI 2008 Polo Chau, Brad Myers, Andrew Faulring
�28Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdfYouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E
Summary for data integrationOpportunities"
• enable new services (Siri, padmapper)"• enable new ways to discover info"• improve existing services"• reduce redundancy"• new way to interactive with data"• promote knowledge transfer (e.g., between
companies)�30
top related