data cleaning & integration -...

Post on 09-Oct-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Cleaning & Integration

CSE6242 / CX4242Jan 14, 2014

Duen Horng (Polo) ChauGeorgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Last TimeBig data analytics building blocks"Data collection & simple data storage!

• Why SQLite? "• Simplicity : nothing to install/

maintain, database in a single file"

• Popular: cross-platform, cross-device"

• SQL basics (create table, join, create index, etc.)

�2

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Data CleaningHow dirty is real data?

Data CleanersWatch videos "• Google Refine"• Data Wrangler (research at Stanford)"

Write down"• Examples of data dirtiness"• Tool’s features demo-ed (or that you like)"

Will collectively summarize similarities and differences afterwards

Google Refine: http://code.google.com/p/google-refine/"Data Wrangler: http://vis.stanford.edu/wrangler/

�4

How dirty is real data?Examples"

• no specific schemas / different names for the same thing / numbers and text mixed"

• trailing spaces/ text not relevant to data"

• different units / data out of range (unrealistic) / skew data distributions"

• missing values / missing rows entirely"

• file formats"

• text may not be where you want it to be (maybe at a different column)"

• improper merge of two tables"

• duplications�5

How are they similar?• mass/batch conversion "

• graph/chart visualization"

• heuristics (e.g., group in G, selection in W)"

• removing redundancy"

• tracking changes / history / undo-redo"

• table based"

• suggestions (what to fix)"

• filtering (show less)�6

G = Google Refine"W = Data wrangler

How do they different?• G has clustering feature"

• W has format conversion (1 column spread into multiple)"

• W can export actions as scripts"

• G supports offline mode (online too?)"

• W extracts part of text into new column"

• W can copy and paste"

• W allow you to preview changes"

• W uses colors to indicate different kinds of changes"

• G can show statistics

G = Google Refine"W = Data wrangler

�7

! The videos only show

some of the tools’ features. Try them out.

Google Refine: http://code.google.com/p/google-refine/"Data Wrangler: http://vis.stanford.edu/wrangler/

�8

Data Integration

Course OverviewCollection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

What is Data Integration? Why is it Important?

�12

Data IntegrationCombining data from different sources to provide the user with a unified view"

As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges"

How to help people effectively leverage multiple data sources? (People: analysts, researchers, practitioners, etc.)

Examples of businesses based on

data integration

Mashup

More Examples?• Palantir gotham"

• Yelp: restaurant reviews, business reviews"

• Facebook friend request: look at your friends’s friends and recommend those friends as your friends"

• Trulia / zillow (real estate sites)"

• graph search (facebook)"

• waze"

• yahoo pipe "

• google search engine"

• google transit"

• google now / apple siri�18

How to do data integration?

“Low” Effort ApproachesUse database’s “Join”! (e.g., SQLite)"

"

"

"

"

Google Refinehttp://code.google.com/p/google-refine/ (video #3)

�20

id name state111 Smith GA222 Johnson NY222 Obama CA

id name111 Smith222 Johnson333 Obama

id state111 GA222 NY222 CA

Crowd-sourcing Approaches: Freebase

�21http://wiki.freebase.com/wiki/What_is_Freebase%3F

Freebase(a graph of entities)!

“…a large collaborative knowledge base consisting of metadata composed mainly

by its community members…”

�22

Wikipedia.

So what? What can you do with Freebase?

(Hint: Google acquired it in 2010)!

�23

http://www.google.com/insidesearch/features/search/knowledge.html

Given a graph of entities, like Freebase, what other cool

things can you do? "

�25

https://www.facebook.com/about/graphsearch

Facebook’s Graph Search!

Integrate your friends’ info with yours

�27

Feldspar!Finding Information by Association.

CHI 2008 Polo Chau, Brad Myers, Andrew Faulring

�28Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdfYouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E

Summary for data integrationOpportunities"

• enable new services (Siri, padmapper)"• enable new ways to discover info"• improve existing services"• reduce redundancy"• new way to interactive with data"• promote knowledge transfer (e.g., between

companies)�30

top related