Download - 06 Data
![Page 1: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/1.jpg)
Hadley Wickham
Stat405Data
Monday, 14 September 2009
![Page 2: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/2.jpg)
1. Group work
2. Motivating problem
3. Loading & saving data
4. Factors & characters
Monday, 14 September 2009
![Page 3: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/3.jpg)
Want to help your groups become effective teams.
We’ll spend 15 minutes getting you into teams, and establishing expectations. See handouts.
Final project weighting for team citizenship.
Group project
Monday, 14 September 2009
![Page 4: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/4.jpg)
Firing & Quitting
You may fire a non-participating team member, but you need to meet with me and issue a written warning.
If you feel that you are doing all the work in your team, you may quit. You’ll also need to meet with me and give a written warning to the rest of your team.
Monday, 14 September 2009
![Page 5: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/5.jpg)
State regulated payoffs: how can be sure they’re honest? CC by-nc-nd: http://www.flickr.com/photos/amoleji/2979221622/
Monday, 14 September 2009
![Page 6: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/6.jpg)
Where are we going?
In the next few weeks we will be focussing our attention on some slot machine data. We want to figure out if the slot machine is paying out at the rate the manufacturer claims.
To do this, we’ll need to learn more about data formats and how to write functions.
Monday, 14 September 2009
![Page 7: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/7.jpg)
Loading dataread.table(): white space separated
read.table(sep="\t"): tab separated
read.csv(): comma separated
read.fwf(): fixed width
load(): R binary format
All take file argument
Monday, 14 September 2009
![Page 8: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/8.jpg)
Why csv?
Simple.
Compatible with all statistics software.
Human readable (in 20 years time you will still be able to extract data from it).
Monday, 14 September 2009
![Page 9: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/9.jpg)
Your turnDownload baseball and slots csv files from website. Practice using read.csv() to load into R.
Guess the name of the function you might use to write the R object back to a csv file on disk. Practice using it.
What happens if you read in a file you wrote with this method?
Monday, 14 September 2009
![Page 10: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/10.jpg)
batting <- read.csv("batting.csv")players <- read.csv("players.csv")slots <- read.csv("slots.csv")
write.csv(slots, "slots-2.csv")slots2 <- read.csv("slots-2.csv")str(slots)str(slots2)
# Betterwrite.table(slots, file = "slots-3.csv", sep=",", row = F)slots3 <- read.csv("slots-3.csv")
Monday, 14 September 2009
![Page 11: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/11.jpg)
Remember to set your working directory.
From the terminal (linux or mac): the working directory is the directory you’re in when you start R
On windows: setwd(choose.dir())
On the mac: ⌘-D
Working directory
Monday, 14 September 2009
![Page 12: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/12.jpg)
Saving data
# For long-termwrite.table(slots, file = "slots-3.csv", sep=",", row = F)
# For short-term cachingsave(slots, file = "slots.rdata")
Monday, 14 September 2009
![Page 13: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/13.jpg)
.csv .rdata
read.csv() load()
write.table(sep = ",", row = F) save()
Only data frames Any R object
Can be read by any program Only by R
Long term Short term caching of expensive computations
Monday, 14 September 2009
![Page 14: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/14.jpg)
Cleaning
I cleaned up slots.csv for you to practice with. The original data was slots.txt. Your next task is to performing the cleaning yourself.
This should always be the first step in an analysis: ensure that your data is available as a clean csv file. Do this in once in a file called clean.r.
Monday, 14 September 2009
![Page 15: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/15.jpg)
Your turn
Take two minutes to find as many differences as possible between slots.txt and slots.csv.
What did I do to clean up the file?
Monday, 14 September 2009
![Page 16: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/16.jpg)
Cleaning
• Convert from space delimited to csv
• Add variable names
• Convert uninformative numbers to informative labels
Monday, 14 September 2009
![Page 17: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/17.jpg)
Variable names
names(slots)
names(slots) <- c("w1", "w2", "w3", "prize", "night")
dput(names(slots))
This is a general pattern we’ll see a lot of
Monday, 14 September 2009
![Page 18: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/18.jpg)
Factors
• R’s way of storing categorical data
• Have ordered levels() which:
• Control order on plots and in table()
• Are preserved across subsets
• Affect contrasts in linear models
Monday, 14 September 2009
![Page 19: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/19.jpg)
# Creating a factor
x <- sample(5, 20, rep = T)
a <- factor(x)
b <- factor(x, levels = 1:10)
c <- factor(x, labels = letters[1:5])
levels(a); levels(b); levels(c)
table(a); table(b); table(c)
Monday, 14 September 2009
![Page 20: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/20.jpg)
# Subsets
b2 <- b[1:5]
levels(b2)
table(b2)
# Remove extra levels
b2[, drop=T]
factor(b2)
# Convert to character
b3 <- as.character(b)
table(b3)
table(b3[1:5])
Monday, 14 September 2009
![Page 21: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/21.jpg)
as.numeric(a)
as.numeric(b)
as.numeric(c)
d <- factor(x, labels = 2^(1:5))
as.numeric(d)
as.character(d)
as.numeric(as.character(d))
Monday, 14 September 2009
![Page 22: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/22.jpg)
Characters don’t remember all levels. Tables of characters always ordered alphabetically
By default, strings converted to factors when loading data frames.
Use stringsAsFactors = F to turn off for one data frame, or options(stringsAsFactors = F)
Character vs. factor
Monday, 14 September 2009
![Page 23: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/23.jpg)
Character vs. factor
Use a factor when there is a well-defined set of all possible values.
Use a character vector when there are potentially infinite possibilities.
Monday, 14 September 2009
![Page 24: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/24.jpg)
Quiz
Take one minute to decide which data type is most appropriate for each of the following variables collected in a medical experiment:
Subject id, name, treatment, sex, address, race, eye colour, birth city, birth state.
Monday, 14 September 2009
![Page 25: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/25.jpg)
Your turnConvert w1, w2 and w3 to factors with labels from adjacent table
Rearrange levels in terms of value: DD, 7, BBB, BB, B, C, 0
Save as a csv file
Read in and look at levels. Compare to input with stringsAsFactors = F
0 Blank (0)
1 Single Bar (B)
2 Double Bar (BB)
3 Triple Bar (BBB)
5 Double Diamond (DD)
6 Cherries (C)
7 Seven (7)
Monday, 14 September 2009
![Page 26: 06 Data](https://reader034.vdocuments.us/reader034/viewer/2022042814/5552ca64b4c905920f8b4f75/html5/thumbnails/26.jpg)
slots <- read.table("slots.txt")names(slots) <- c("w1", "w2", "w3", "prize", "night")
levels <- c(0, 1, 2, 3, 5, 6, 7)labels <- c("0", "B", "BB", "BBB", "DD", "C", "7")
slots$w1 <- factor(slots$w1, levels = levels, labels = labels)slots$w2 <- factor(slots$w2, levels = levels, labels = labels)slots$w3 <- factor(slots$w3, levels = levels, labels = labels)
write.table(slots, "slots.csv", sep=",", row=F)
Monday, 14 September 2009