data tidying
TRANSCRIPT
![Page 1: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/1.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 1/13
LESSON 5:
“Data Tidying in Practice”
5Key characteristicsof “messy” or data
that is not tidy
4Common times whendata errors are likely
to be introduced
“Tidy” data is keyto ensure error-
free analysis
Consistency,accuracy are hall-marks of tidy data
Sicaq
![Page 2: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/2.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 2/13
![Page 3: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/3.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 3/13
![Page 4: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/4.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 4/13
“Tidy” data is important to ensure error-free analy
! Each variable you measure should be in one column
! Each di#erent observation should be in a di#erent row
! There should be one table for each type of variable
! If you have multiple tables, they should include a key in
tables that allow them to be linked
Source: Leek “How To Share Data With A Statistician” (2013)
![Page 5: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/5.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 5/13
“Tidy
![Page 6: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/6.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 6/13
“Mess
![Page 7: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/7.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 7/13
Wickham identifies common characteristics
of messy data
! Multiple variables are stored in one column
! Variables are stored in both rows and columns
! Column headers are values, not variable names
! Multiple types of observational units are stored in the
table
! A single observational unit is stored in multiple tables
Source: Wickham, “Tidy Data” Journal of Statistical Software August 2014, Volume 59, Issue 10
![Page 8: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/8.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 8/13
Follow these guidelines to turn messy
data into tidy data
! Keep a close eye on your data when importing data
! Produce a new file at each step of the tidying process
unique extensions (e.g., “_ORIGINAL”, “_COL-CLEAN
“_ROW-CLEAN” or “_V1”, “_V2”, etc.)
! (In a table of numbers) Flush out errors by summing ro
and columns and comparing results
! Understand where errors are commonly introduced an
proactively work to limit mistakes from occurring
![Page 9: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/9.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 9/13
Understand common data errors to
identify trouble spots
! Measurement errors (“bias”) when data collected from
sample is not reflective of the population
! Data entry errors that occur when data is incorrectly
recorded
! Data integration errors when data imported or joined
multiple databases does not integrate smoothly
! Calculation errors when inaccurate formulas or
manipulations produce unintended outputs
![Page 10: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/10.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 10/13
Kaushak offers a plain philosophy on data quality
! Assume those managing the data have a level of comf
with the data (i.e., trust them)
! Start making decisions that you are comfortable with
!
Over time drill deeper in micro specific areas and learn
! Get more comfortable with data and its limitations ove
! Consistency in calculations = Good
Source: Kaushak, “Data Quality Sucks, Let's Just Get Over It” (2006)
![Page 11: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/11.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 11/13
LESSON 5:
“Data Tidying in Practice”
5Key characteristicsof “messy” or data
that is not tidy
4Common times whendata errors are likely
to be introduced
“Tidy” data is keyto ensure error-
free analysis
Consistency,accuracy are hall-marks of tidy data
Sicaq
![Page 12: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/12.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 12/13
Supplemental reading for this lesson
! Data Quality Sucks, Let’s Just Get Over It:http://www.kaushik.net/avinash/data-quality-sucks-let
just-get-over-it/
! Tidy Data:http://www.jstatsoft.org/v59/i10/paper
![Page 13: data tidying](https://reader030.vdocuments.us/reader030/viewer/2022021219/56d6bf861a28ab3016969567/html5/thumbnails/13.jpg)
7/25/2019 data tidying
http://slidepdf.com/reader/full/data-tidying 13/13
References
1.Je# Leek. 2013. “How to Share Data with A Statistian”Retrieved from https://github.com/jtleek/datasharing
2. Hadley Wickham, Tidy Data. The Journal of Statistical
Software, Vol. 59, Issue 10, Sep 2014. Retrieved from
www.jstatsoft.org/v59/i10/paper
3. Avinash Kaushik. 2006. “Data Quality Sucks, Let's Jus
Over It”. Retrieved from http://www.kaushik.net/avinas
data-quality-sucks-lets-just-get-over-it/