r programming taster session
DESCRIPTION
Slides for a research methods class on using R.TRANSCRIPT
Taster/Skills Set Session
R-Programming
is a lot like Magic
Instead of spells, you have functions.
Muggles
Incapable of magic and hardly aware of it.
• Limited ability to change the environment.
• Limited ability to change the environment.
• Must rely on algorithms developed for them.
• Limited ability to change the environment.
• Must rely on algorithms developed for them.
• Problem-solving constrained by SPSS developers.
• Limited ability to change the environment.
• Must rely on algorithms developed for them.
• Problem-solving constrained by SPSS developers.
• Must pay for using the constrained algorithms.
Most people are muggles.
And that’s okay.
Wizards
• Can use functions made by top statistics researchers or create their own.
• Can use functions made by top statistics researchers or create their own.
• Almost unlimited in their ability to change their environment.
• Can use functions made by top statistics researchers or create their own.
• Almost unlimited in their ability to change their environment.
• Can do things SPSS users cannot even dream of.
• Can use functions made by top statistics researchers or create their own.
• Almost unlimited in their ability to change their environment.
• Can do things SPSS users cannot even dream of.
• Get their powers for free.
Warning!Here’s the small print.
Wizards also...
• Love to stretch their brains
Wizards also...
• Love to stretch their brains
• Have strong sitting muscles
Wizards also...
• Love to stretch their brains
• Have strong sitting muscles
• Put in the effort to learn
Wizards also...
• Love to stretch their brains
• Have strong sitting muscles
• Put in the effort to learn
• Persist with puzzles
Wizards also...
• Love to stretch their brains
• Have strong sitting muscles
• Put in the effort to learn
• Persist with puzzles
• Feel at home with the esoteric and obscure
Wizards also...
Do you stillwant to bea wizard?
Syllabus
History of Magic — Origins of R
Syllabus
History of Magic — Origins of RArithmancy — Learning the system
Syllabus
History of Magic — Origins of RArithmancy — Learning the system
Transfiguration — Working with data
Syllabus
History of Magic — Origins of RArithmancy — Learning the system
Transfiguration — Working with dataDivination — Models and predictions
Syllabus
History of Magic
What is ?
What is ?R is a computer language
used for data manipulation, statistics, and graphics.
Learning any new language is tough.
Grammar, vocabulary, idioms,orthography, a new
world view...
The payoff is a whole new world of possibility.
Advantages Disadvantages
Open source Not user friendly at start
State of the art Minimal GUI
Publication-quality graphics Easy to lose “sense” of data
Reproducible research
Computer intensive analyses
Makes you think
Easy interface with databases
1976 – Bell Labs develops S, a language for data analysis; released commercially as S-plus.
1976 – Bell Labs develops S, a language for data analysis; released commercially as S-plus.
1990s – R written and released as open source by (R)oss Ihaka and (R)obert Gentleman.
1976 – Bell Labs develops S, a language for data analysis; released commercially as S-plus.
1990s – R written and released as open source by (R)oss Ihaka and (R)obert Gentleman.
1997 – The Comprehensive R Archive Network (CRAN) launched.
1976 – Bell Labs develops S, a language for data analysis; released commercially as S-plus.
1990s – R written and released as open source by (R)oss Ihaka and (R)obert Gentleman.
1997 – The Comprehensive R Archive Network (CRAN) launched.
Today – 2781 user-contributes packages for R.
Accio .To download R, go to
http://cran.r-project.org/bin/
Windows Mac Linux
Software Pros Cons
Easy(ish), common in psychology
Limited analytic capability
Easy, common in business
Very limited analytic capability
Elegant matrix support
Expensive, lacks in statistics support
Extensibility, visualization,
programmabilityLearning curve
Software Pros Cons
Easy(ish), common in psychology
Limited analytic capability
Easy, common in business
Very limited analytic capability
Elegant matrix support
Expensive, lacks in statistics support
Extensibility, visualization,
programmabilityLearning curve
Software Pros Cons
Easy(ish), common in psychology
Limited analytic capability
Easy, common in business
Very limited analytic capability
Elegant matrix support
Expensive, lacks in statistics support
Extensibility, visualization,
programmabilityLearning curve
Software Pros Cons
Easy(ish), common in psychology
Limited analytic capability
Easy, common in business
Very limited analytic capability
Elegant matrix support
Expensive, lacks in statistics support
Extensibility, visualization,
programmabilityLearning curve
Software Pros Cons
Easy(ish), common in psychology
Limited analytic capability
Easy, common in business
Very limited analytic capability
Elegant matrix support
Expensive, lacks in statistics support
Extensibility, visualization,
programmabilityLearning curve
data analysis contests
Why ?• EVERYTHING in one framework
‣ base: linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering etc.
‣ packages from Medical Image Analysis to Pharmacokinetics
• CUSTOM functionality
‣ Programming ➞ Automation
Practical Benefits
• Multiple datasets open at once
• Automate away “click-click-click” tasks
• Reproducibility
Why not ?
Why not ?
deducer
Tastersession
Learning• Self-study
‣ Past programming experience recommended
‣ Lots of expert advice available
• Oxford
‣ e.g., Ruth Ripley, Department of Statistics
‣ We’ll scratch the surface today
ArithmancyWorking with R
R SPSS
Multi-dimensional data Rectangular data (“spreadsheet”)
Functions can be modified Proprietary functions
Interactive experience Passive experience
Extensible Cross/up-selling
Open and free Commercial
New Mindset
Getting startedwith
(Not very consoling) R console
Write a script hereand run it
Output appears here.Did you get what you
wanted?
Revise the scriptand run it again
Saved scriptscan be rerun
later
Interactivedata analysis
session
writescript
runscript
Interactivedata analysis
sessionTextmate
Grammar of Spells
object = function(arguments)
Assignment operator
Guess what this does!
z = read.table(“MyFile.txt”)
Two ways about it=
is the same as
<-
Data Frames
z
You can also use e.g., read.csv() and read.spss() functions.
Accessing data
z[1,]
Read 1st row, all columns.
Accessing data
z[1,3]
Read cell at 1st row, 3rd column.
Accessing data
z[,3]
Read 3rd column.
Accessing data
z[,3:6]
Read columns from 3rd to 6th.
Accessing data
z$avbity
Read 3rd column by name.
Accessing data
z[“avbity”]
Read 3rd column by name.
How about?
How about?
z[1:6,1:3]
SubsetsTask: Make a data set of items that cost less than 2.
Subset functionz.cheap <- subset(z, cost < 2)
Can you make sense of this?
Transfiguration
Practical magicTask: Transform a data set from individual data to pair-wise data.
(A typical tall-to-wide transformation.)
Create a data set “c” in which each row has data from both the male and female in each pair from
data set “p”.
Goal
A pair
How would you do this in SPSS?
• Create an id variable for each pair.
• Create an id variable for each pair.
• Click Data > Restructure.
• Create an id variable for each pair.
• Click Data > Restructure.‣ You want the second option, to "Restructure
selected cases into variables".
• Create an id variable for each pair.
• Click Data > Restructure.‣ You want the second option, to "Restructure
selected cases into variables".
• Move id variable into the “Identifier Variable/s” and click “Next.”
• Create an id variable for each pair.
• Click Data > Restructure.‣ You want the second option, to "Restructure
selected cases into variables".
• Move id variable into the “Identifier Variable/s” and click “Next.”
• Click “Yes” when asked whether you want to sort the data.
• Create an id variable for each pair.
• Click Data > Restructure.‣ You want the second option, to "Restructure
selected cases into variables".
• Move id variable into the “Identifier Variable/s” and click “Next.”
• Click “Yes” when asked whether you want to sort the data.
• For “Order of New Variables,” click “Group by Original Variable” and click “Next.”
The Plan
• Step 1
The Plan
• Step 1‣ Make a variable to identify each pair
The Plan
• Step 1‣ Make a variable to identify each pair
• Step 2
The Plan
• Step 1‣ Make a variable to identify each pair
• Step 2‣ Split the tall data into two parts: one chunk for
men and one chunk for women
The Plan
• Step 1‣ Make a variable to identify each pair
• Step 2‣ Split the tall data into two parts: one chunk for
men and one chunk for women
• Step 3
The Plan
• Step 1‣ Make a variable to identify each pair
• Step 2‣ Split the tall data into two parts: one chunk for
men and one chunk for women
• Step 3‣ Merge the two chunks side by side using the pair
identifier
The Plan
Participantids
Participantids
10/10 = 1
Participantids
10/10 = 111/10 = 1.1
Participantids
10/10 = 111/10 = 1.1
When rounded,both equal 1.
Create pair IDp$pair_id <- round(p$code/10)
Now each member of a pair has a common ID.
Separate gendersmen <- subset(! p,! gender == “Male”)
Separate genderswomen <- subset(! p,! gender == “Female”)
Merge sets
c <- merge(men, women, ! by.x = "pair_id",! by.y = "pair_id")
“x”“y”
Ugly variable names
Rename variablesnames(c) <- gsub(! "x", # find “x”! "m", !! # replace with “m”
! names(c))
Rename variablesnames(c) <- gsub(! "y", # find “y”! "f", !! # replace with “m”
! names(c))
But...Wouldn’t it be useful to have participant age
instead of their birth year?
Do it all over again. Click click click click click click.
Just add a line of code to the top:p$Age = (2011 - p$BirthYear)
Now re-run the script.
Practical magicTask: Extract participants’ written responsesfor statistical analysis in LIWC.
(For analysis, LIWC requires each text response in a separate file.)
Extract each cell to a text file.62 participants, 8 variables = 496 files
Manual labour
Manual labour• Boring
Manual labour• Boring
• Prone to human errors
Manual labour• Boring
• Prone to human errors
• Risk of repetitive strain injury
Manual labour• Boring
• Prone to human errors
• Risk of repetitive strain injury
• You have better things to do
The way
The way• Quick
The way• Quick
• Efficient
The way• Quick
• Efficient
• Repeatable
The Plan
• Step 1
The Plan
• Step 1‣ Load SPSS data into R
The Plan
• Step 1‣ Load SPSS data into R
• Step 2
The Plan
• Step 1‣ Load SPSS data into R
• Step 2‣ Create a function that extracts the cell contents
and writes them to a file based on participant id and variable name
The Plan
• Step 1‣ Load SPSS data into R
• Step 2‣ Create a function that extracts the cell contents
and writes them to a file based on participant id and variable name
• Step 3
The Plan
• Step 1‣ Load SPSS data into R
• Step 2‣ Create a function that extracts the cell contents
and writes them to a file based on participant id and variable name
• Step 3‣ Run the function on the data
The Plan
Load data to R
Load data to Rlibrary(foreign)
Load data to Rlibrary(foreign)
d <- read.spss(! “RESEARCH_DATA_FILE.sav", ! to.data.frame = T)
Function ingredients
Function ingredients• Information to identify the right cell
Function ingredients• Information to identify the right cell‣ Participant id (the right row)
Function ingredients• Information to identify the right cell‣ Participant id (the right row)‣ Variable name (the right column)
Function ingredients• Information to identify the right cell‣ Participant id (the right row)‣ Variable name (the right column)
• A unique file name
Function ingredients• Information to identify the right cell‣ Participant id (the right row)‣ Variable name (the right column)
• A unique file name‣ We’ll just use the above information + “.txt”
The FunctionsaveText <- function(id, variable) {! data = subset(d, d$Ppno == id)! value = as.character(data[variable][1,1])! filename = paste(id, variable, ".txt", sep = "")! writeLines(value, con = filename)
}
The FunctionsaveText <- function(id, variable) {! data = subset(d, d$Ppno == id)! value = as.character(data[variable][1,1])! filename = paste(id, variable, ".txt", sep = "")! writeLines(value, con = filename)
} The name of our function.I could have used “Waddiwasi” instead,but I didn’t.
The FunctionsaveText <- function(id, variable) {! data = subset(d, d$Ppno == id)! value = as.character(data[variable][1,1])! filename = paste(id, variable, ".txt", sep = "")! writeLines(value, con = filename)
}Function to makefunctions
The FunctionsaveText <- function(id, variable) {! data = subset(d, d$Ppno == id)! value = as.character(data[variable][1,1])! filename = paste(id, variable, ".txt", sep = "")! writeLines(value, con = filename)
} The function requires two thingsto work: the participant id and the name of the variable to extract
The FunctionsaveText <- function(id, variable) {! data = subset(d, d$Ppno == id)! value = as.character(data[variable][1,1])! filename = paste(id, variable, ".txt", sep = "")! writeLines(value, con = filename)
} Create a new object “data” that contains only the rows from “d” where the Ppno is the same as the id fed into the function.
The FunctionsaveText <- function(id, variable) {! data = subset(d, d$Ppno == id)! value = as.character(data[variable][1,1])! filename = paste(id, variable, ".txt", sep = "")! writeLines(value, con = filename)
}Create a new object “value” that contains the specified variable from the participant data in text format.
The FunctionsaveText <- function(id, variable) {! data = subset(d, d$Ppno == id)! value = as.character(data[variable][1,1])! filename = paste(id, variable, ".txt", sep = "")! writeLines(value, con = filename)
}Create a new object “filename” by squishing together the participant id, the variable name, and “.txt”.
The FunctionsaveText <- function(id, variable) {! data = subset(d, d$Ppno == id)! value = as.character(data[variable][1,1])! filename = paste(id, variable, ".txt", sep = "")! writeLines(value, con = filename)
}Save the value to a file (name specified by filename).
Ok, what’s next?
Ok, what’s next?• Since the function writes out the data one
cell at a time (based on two bits of information), we need two lists to automate our work:
Ok, what’s next?• Since the function writes out the data one
cell at a time (based on two bits of information), we need two lists to automate our work:‣ A list of participants
Ok, what’s next?• Since the function writes out the data one
cell at a time (based on two bits of information), we need two lists to automate our work:‣ A list of participants‣ A list of all the variables we need
Get ready to run the functionparticipants = unique(d$Ppno)
variables = c(! "phys_attra", "pers_attra",! "Descr__app", "Comments", ! "Signal_conveyed", "portrayyou", ! "their_signals", "their_portrayal”)
“For each participant, go through thevariables and save the results for each.”
Run, function, run!List of participants
List of variables Our function
Loopty loop
Loopty loopfor (participant in participants) {!
}
Loopty loopfor (participant in participants) {!
}Do this onceper participant(62 times total)
Loopty loopfor (participant in participants) {!
}Do this onceper participant(62 times total)
for (variable in variables) {! ! saveText(participant, variable)! }
Loopty loopfor (participant in participants) {!
}Do this onceper participant(62 times total)
for (variable in variables) {! ! saveText(participant, variable)! }
Do this onceper variable(8 times total)
Loopty loopfor (participant in participants) {! for (variable in variables) {! ! saveText(participant, variable)! }}
Result
Result
496FILE
S
...In a flick ofa wand!
Divination
More or lesseverything.
What can R do for you?
Basic magic
Basic magic• Out of the box, R can do
Basic magic• Out of the box, R can do‣ Linear and nonlinear modeling
Basic magic• Out of the box, R can do‣ Linear and nonlinear modeling‣ Classical statistical tests
Basic magic• Out of the box, R can do‣ Linear and nonlinear modeling‣ Classical statistical tests‣ Time-series analysis
Basic magic• Out of the box, R can do‣ Linear and nonlinear modeling‣ Classical statistical tests‣ Time-series analysis‣ Classification
Basic magic• Out of the box, R can do‣ Linear and nonlinear modeling‣ Classical statistical tests‣ Time-series analysis‣ Classification‣ Clustering
Basic magic• Out of the box, R can do‣ Linear and nonlinear modeling‣ Classical statistical tests‣ Time-series analysis‣ Classification‣ Clustering‣ and many other statistical techniques...
More help
More help• An Introduction to R
More help• An Introduction to R‣ http://cran.r-project.org/doc/manuals/R-
intro.html
More help• An Introduction to R‣ http://cran.r-project.org/doc/manuals/R-
intro.html
• R Starter Kit
More help• An Introduction to R‣ http://cran.r-project.org/doc/manuals/R-
intro.html
• R Starter Kit‣ http://www.ats.ucla.edu/stat/r/sk/
More help• An Introduction to R‣ http://cran.r-project.org/doc/manuals/R-
intro.html
• R Starter Kit‣ http://www.ats.ucla.edu/stat/r/sk/
• R mailing list
More help• An Introduction to R‣ http://cran.r-project.org/doc/manuals/R-
intro.html
• R Starter Kit‣ http://www.ats.ucla.edu/stat/r/sk/
• R mailing list
• Dumbledore’s Ruth Ripley’s classDepartment of Statistics, University of Oxford
More help• An Introduction to R‣ http://cran.r-project.org/doc/manuals/R-
intro.html
• R Starter Kit‣ http://www.ats.ucla.edu/stat/r/sk/
• R mailing list
• Dumbledore’s Ruth Ripley’s classDepartment of Statistics, University of Oxford
‣ http://www.stats.ox.ac.uk/~ruth/
Remember...Without R, it’s only esearch.
Thanks for listening!