12 adv-manip
TRANSCRIPT
![Page 1: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/1.jpg)
Hadley Wickham
Stat405Advanced data manipulation 2
Thursday, 30 September 2010
![Page 2: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/2.jpg)
1. Colloquium Monday
2. String basics
3. Group-wise transformations
4. Practice challenges
Thursday, 30 September 2010
![Page 3: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/3.jpg)
Colloquium onSummer research experiences
When: Monday, 4:00 - 5:00Where: DH 1070
Coffee and Cookies will be served ahead of time
Thursday, 30 September 2010
![Page 4: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/4.jpg)
1. James Rigby -summer institute in bioinformatics
2. Gabi Quart - internship at Deutche Bank
3. Liz Jackson - Survey methodology summer program
4. Ollie McDonald - internship at Novartis in Switzerland
5. Christine Peterson - research in the Med center
6. Joseph Egbulefu - research at Rice
Speakers
Thursday, 30 September 2010
![Page 5: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/5.jpg)
String basics
Thursday, 30 September 2010
![Page 6: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/6.jpg)
install.packages("stringr")library(stringr)
str_length("Hadley")str_c(letters, LETTERS)str_c(letters, LETTERS, sep = " ")str_c("H", "a", "d", "l", "e", "y", collapse = "")
tolower("Hadley")toupper("Hadley")
str_sub("Hadley", 1, 3)str_sub("Hadley", -1)
Thursday, 30 September 2010
![Page 7: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/7.jpg)
Add new columns that give:
The length of each name
The first letter of the name
The last letter of the name
Your turn
Thursday, 30 September 2010
![Page 8: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/8.jpg)
library(stringr)
bnames <- read.csv("baby-names2.csv.bz2", stringsAsFactors = FALSE)
bnames <- transform(bnames, length = str_length(name), first = str_sub(name, 1, 1), last = str_sub(name, -1, -1))
Thursday, 30 September 2010
![Page 9: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/9.jpg)
Explore how the average length of names has changed over time (for each sex)
Your turn
Thursday, 30 September 2010
![Page 10: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/10.jpg)
library(plyr)sy <- ddply(bnames, c("sex", "year"), summarise, avg_length = weighted.mean(length, prop))
library(ggplot2)qplot(year, avg_length, data = sy, colour = sex, geom = "line")
Thursday, 30 September 2010
![Page 11: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/11.jpg)
# Another approachsyl <- ddply(bnames, c("sex", "length", "year"), summarise, prop = sum(prop))qplot(year, prop, data = syl, colour = sex, geom = "line") + facet_wrap(~ length)
Thursday, 30 September 2010
![Page 12: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/12.jpg)
Transformations
Thursday, 30 September 2010
![Page 13: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/13.jpg)
What about group-wise transformations? e.g. what if we want to compute the rank of a name within a sex and year?
This task is easy if we have a single year & sex, but hard otherwise.
Transformations
Thursday, 30 September 2010
![Page 14: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/14.jpg)
What about group-wise transformations? e.g. what if we want to compute the rank of a name within a sex and year?
This task is easy if we have a single year & sex, but hard otherwise.
Transformations
How would you do it for a single group?Thursday, 30 September 2010
![Page 15: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/15.jpg)
one <- subset(bnames, sex == "boy" & year == 2008)one$rank <- rank(-one$prop, ties.method = "first")
# orone <- transform(one, rank = rank(-prop, ties.method = "first"))head(one)
What if we want to transform every sex and year?
Thursday, 30 September 2010
![Page 16: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/16.jpg)
1. Extract a single group
2. Figure out how to solve it for just that group
3. Use ddply to solve it for all groups
Workflow
Thursday, 30 September 2010
![Page 17: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/17.jpg)
1. Extract a single group
2. Figure out how to solve it for just that group
3. Use ddply to solve it for all groups
Workflow
How would you use ddply to calculate all ranks?Thursday, 30 September 2010
![Page 18: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/18.jpg)
bnames <- ddply(bnames, c("sex", "year"), transform, rank = rank(-prop, ties.method = "first"))
Thursday, 30 September 2010
![Page 19: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/19.jpg)
ddply + transform = group-wise transformation
ddply + summarise = per-group summaries
ddply + subset = per-group subsets
Thursday, 30 September 2010
![Page 20: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/20.jpg)
Tools
You know have all the tools to solve 95% of data manipulation problems in R. It’s just a matter of figuring out which tools to use, and how to combine them.
The following challenges will give you some practice.
Thursday, 30 September 2010
![Page 21: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/21.jpg)
Challenges
Thursday, 30 September 2010
![Page 22: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/22.jpg)
Warmups
Which names were most popular in 1999?
Work out the average proportion for each name.
List the 10 names with the highest average proportions.
Thursday, 30 September 2010
![Page 23: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/23.jpg)
# Which names were most popular in 1999?subset(bnames, year == 1999 & rank < 10)subset(bnames, year == 1999 & prop == max(prop))
# Average usageoverall <- ddply(bnames, "name", summarise, prop = mean(prop))
# Top 10 nameshead(arrange(overall, desc(prop)), 10)
Thursday, 30 September 2010
![Page 24: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/24.jpg)
How has the total proportion of babies with names in the top 1000 changed over time?
How has the popularity of different initials changed over time?
Challenge 1
Thursday, 30 September 2010
![Page 25: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/25.jpg)
sy <- ddply(bnames, c("year","sex"), summarise, prop = sum(prop), npop = sum(prop > 1/1000))
qplot(year, prop, data = sy, colour = sex, geom = "line")qplot(year, npop, data = sy, colour = sex, geom = "line")
Thursday, 30 September 2010
![Page 26: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/26.jpg)
Challenge 2
For each name, find the year in which it was most popular, and the rank in that year. (Hint: you might find which.max useful).
Print all names that have been the most popular name at least once.
Thursday, 30 September 2010
![Page 27: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/27.jpg)
most_pop <- ddply(bnames, "name", summarise, year = year[which.max(prop)], rank = min(rank))most_pop <- ddply(bnames, "name", subset, prop == max(prop))
subset(most_pop, rank == 1)
# Double challenge: Why is this last one wrong?
Thursday, 30 September 2010
![Page 28: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/28.jpg)
Challenge 3
What name has been in the top 10 most often?
(Hint: you'll have to do this in three steps. Think about what they are before starting)
Thursday, 30 September 2010
![Page 29: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/29.jpg)
top10 <- subset(bnames, rank <= 10)counts <- count(top10, c("sex", "name"))
ddply(counts, "sex", subset, freq == max(freq))head(arrange(counts, desc(freq)), 10)
Thursday, 30 September 2010
![Page 30: 12 adv-manip](https://reader036.vdocuments.us/reader036/viewer/2022062405/5552ca59b4c905920f8b4f6d/html5/thumbnails/30.jpg)
No homework this week.
Use what you’ve learned to make your projects even better!
Homework
Thursday, 30 September 2010