18 cleaning
TRANSCRIPT
![Page 1: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/1.jpg)
Garrett GrolemundPhd Student / Rice University
Department of Statistics
Data cleaning
![Page 2: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/2.jpg)
1. Intro to data cleaning
2. What you can’t fix
3. What you can fix
4. Intro to reshape
![Page 3: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/3.jpg)
Your turn
Do you think men or women leave a larger tip when dining out? What data would you collect to test this belief? What would prompt you to change your belief?
![Page 4: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/4.jpg)
Data Analysis
Data
Residuals
ModelCompare
Visualize
Transform
![Page 5: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/5.jpg)
Data Analysis
Data
Residuals
ModelCompare
Visualize
Transform
![Page 6: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/6.jpg)
Data Analysis
Data
Residuals
ModelCompare
Visualize
Transform
![Page 7: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/7.jpg)
Data Analysis
Data
Residuals
ModelCompare
Visualize
Transform
![Page 8: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/8.jpg)
Data Analysis
Data
Residuals
ModelCompare
Visualize
Transform
![Page 9: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/9.jpg)
Data Analysis
Data
Residuals
ModelCompare
Visualize
Transform
![Page 10: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/10.jpg)
10 - 20% of an analysis
![Page 11: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/11.jpg)
Data Cleaning
Data
Residuals
ModelCompare
VisualizeTransform
![Page 12: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/12.jpg)
Data cleaning
![Page 13: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/13.jpg)
“Happy families are all alike; every unhappy family is unhappy in its own way.”—Leo Tolstoy
![Page 14: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/14.jpg)
“Clean datasets are all alike; every messy dataset is messy in its own way.”—Hadley Wickham
![Page 15: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/15.jpg)
Clean data is:
Complete
Correct(factual and internally consistent)
Concise
Compatible(required variables: observations in rows, one column per variable)
![Page 16: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/16.jpg)
What you can’t fix:
![Page 17: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/17.jpg)
Complete
Correct
![Page 18: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/18.jpg)
Correct
Can’t restore incorrect values without original data but can remove clearly incorrect values
Options:
Remove entire row
Mark incorrect value as missing (NA)
![Page 19: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/19.jpg)
When two rows present the same information with different values, at least one row is wrong.
Whenever there is inconsistency, you are going to have to make some tradeoff to ensure concision.
Detecting inconsistency is not always easy.
Inconsistency = incorrect
![Page 20: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/20.jpg)
General strategy
To find incorrect values you need to be creative, combining graphics and data processing.
![Page 21: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/21.jpg)
Tipping data
One waiter recorded information about each tip he received over a period of a few months
244 records
Do men or women tip more?
![Page 22: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/22.jpg)
Your turn
Subset the tipping data to include only rows without NA’s. Judge whether you think all of the data points are correct. How will you make your decision?
![Page 23: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/23.jpg)
tips <- read.csv("tipping.csv", stringsAsFactors = FALSE)
summary(tips)
tips <- subset(tips, !is.na(smoker) & !is.na(non_smoker))
qplot(tip, data = tips, binwidth = .5)qplot(total_bill, data = tips, binwidth = 2)qplot(total_bill, tip, data = tips)
![Page 24: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/24.jpg)
nrow(tips)
sum(tips$male)
sum(tips$female)
subset(tips, male != female)
![Page 25: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/25.jpg)
What you can fix:
![Page 26: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/26.jpg)
Concise(each fact represented once)
Repeating facts: 1. wastes memory 2. creates opportunities for inconsistency
![Page 27: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/27.jpg)
Compatible(Data is compatible with your analysis
in both form and fact)
1. Do you have the relevant variables for your analysis?
![Page 28: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/28.jpg)
This often requires some type of calculation. For example,
proportion = sucesses / attempts
Avg score per game per team = ?
join(), transform(), summarise(), ddply(), plyr address this need
![Page 29: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/29.jpg)
Compatible(Data is compatible with your analysis
in both form and fact)
2. Is the data in the right form for your analysis and visualization tools? (reshape)
![Page 30: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/30.jpg)
Rectangular
![Page 31: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/31.jpg)
Observations in rows
![Page 32: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/32.jpg)
Variables in columns
(1 column per variable)
![Page 33: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/33.jpg)
Your turn
What are the variables in tipping.csv? How are they arranged in rows and columns? Can you form the variables into two groups?
![Page 34: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/34.jpg)
Reshape
![Page 35: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/35.jpg)
install.packages("reshape")library(reshape)library(stringr)head(tips)
![Page 36: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/36.jpg)
Molten data
We can use melt to put each variable into its own column.“Protect” the good columns. “Melt” the offending columns.Then subset.
![Page 37: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/37.jpg)
1. ID variables - identify the object that measurements will take place on (we know these before the experiment)
2. Measured variables - the features of the object that will be measured (we have to do an experiment to observe these)
Two types of variables
![Page 38: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/38.jpg)
object
ID Variables
Bruce Wayne
Batman
SSN: 555-89-3000
Measured Var.
Height (6’1’’)
IQ (180)
Age (71)
![Page 39: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/39.jpg)
ID Variables
Gotham City +
male +
Top 1% tax bracket
![Page 40: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/40.jpg)
Identifier variable Measured variable
Index of random variable
Random variable
Dimension Measure
Experimental design Measurement
predictors (Xi) response (Y)
![Page 41: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/41.jpg)
Molten data
Molten data collapses all the measured variables into two columns: 1) the variable being measured and 2) the value. Sometimes called “long” form.
To protect a column from being melted, label it as an id variable.
reshape::melt(data, id)
![Page 42: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/42.jpg)
tips1 <- melt(tips, id = c("customer_ID", "total_bill", "tip", "smoker", "non_smoker"))
# assign an appropriate variable namenames(tips1)[6] <- "sex"
# subset out unwanted rowstips1 <- subset(tips1, value == 1)tips1 <- tips1[ , c(1,2,6,4,5,3)]
![Page 43: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/43.jpg)
Use melt to fix the smoking variable. One column should be enough to record whether a person smokes or not.
Your turn
![Page 44: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/44.jpg)
Rectangular data are much easier to work with!
qplot(total_bill, tip, data = tips1, color = sex)
# vs.
qplot(total_bill, tip, data = tip, colour = ?)
![Page 45: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/45.jpg)
qplot(total_bill, tip, data = tips1, color = sex) + geom_smooth(method = lm)
![Page 46: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/46.jpg)
Clean data is:
Complete
Correct(factual and internally consistent)
Concise
Compatible(required variables: observations in rows, one column per variable)
![Page 47: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/47.jpg)
Resource
Wickham, H. (2007) Reshaping data with the reshape package. Journal of Statistical Software. 22 (12)
http://www.jstatsoft.org/v21/i12
![Page 48: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/48.jpg)
Summary
![Page 49: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/49.jpg)
Clean data is:
Rectangular(observations in rows, one column per variable)
Consistent
Concise
Complete
Correct
![Page 50: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/50.jpg)
Data
Residuals
ModelCompare
VisualizeTransform
![Page 51: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/51.jpg)
Data
Residuals
ModelCompare
VisualizeTransform
ggplot2
![Page 52: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/52.jpg)
Data
Residuals
ModelCompare
VisualizeTransform
ggplot2plyr
![Page 53: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/53.jpg)
Data
Residuals
ModelCompare
VisualizeTransform
ggplot2plyr
reshape
![Page 54: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/54.jpg)
Data
Residuals
ModelCompare
Visualize
Transform
most statistics classes
![Page 55: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/55.jpg)
![Page 56: 18 cleaning](https://reader033.vdocuments.us/reader033/viewer/2022052505/5552caf5b4c90581158b4e13/html5/thumbnails/56.jpg)
This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.