r basics v10 - learning research and development center · 2019. 1. 9. · r basics lr commands...
TRANSCRIPT
R Basics / Course Businessl We’ll be using a sample dataset in class today:
l CourseWeb: Course Documents à Sample Data à Week 2
l Can download to your computer before class
l Thanks for answering CourseWeb background survey!
l If sitting in on the course, e-mail me so I can add you to CourseWeb
R Basics
R Basics
R Basics
l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help
R Commands
l Simplest way to interact with R is by typing in commands at the > prompt:
R STUDIO R
R as a Calculator
l Typing in a simple calculation shows us the result:l 608 + 28
l What’s 11527 minus 283?l Some more examples:
l 400 / 65 (division)l 2 * 4 (multiplication)l 5 ^ 2 (exponentiation)
Functions
l More complex calculations can be done with functions:l sqrt(64)
l Can often read theseleft to right (“square root of 64”)
l What do you thinkthis means?l abs(-7)
What the function is (square root)
In parenthesis: What we want to perform the function on
Arguments
l Some functions have settings(“arguments”) that we can adjust:
l round(3.14)- Rounds off to the nearest integer (zero
decimal places)
l round(3.14, digits=1)- One decimal place
Nested Functions
Nested Functions
l We can use multiple functions in a row, one inside another- sqrt(abs(-16))- “Square root of the absolute value of -16”
l Don't get scared when you see multiple parentheses!- Can often just read left to right- R first figures out the thing nested in
the middle• Can you round off the square root of 7?
Using Multiple Numbers at Once
l When we want to use multiple numbers, we concatenate them
l c(2,6,16)- A list of the numbers 2, 6, and 16
l Sometimes a computation requires multiple numbers- mean(c(2,6,16))
l Also a quick way to do the same thing to multiple different numbers:- sqrt(c(16,100,144))
R Basics
l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help
Course Documents: Sample Data: Week 2
l Reading plausible versus implausible sentences
l “Scott chopped the carrots with a knife.”
“Scott chopped the carrots with a spoon.”
Measure reading time on final word
Note: Simulated data; not a real experiment.
Course Documents: Sample Data: Week 2
l Reading plausible versus implausible sentences
l Reading time on critical word
l 36 subjectsl Each subject sees 30 items (sentences):
half plausible, half implausiblel Interested in changes over time, so we’ll
track number of trials remaining (29 vs 28 vs 27 vs 26…)
Reading in Data
l Make sure you have the dataset at this point if you want to follow along:
Course Documents àSample Data à
Week 2
Reading in Data – RStudio
l Navigate to thefolder in lower-right
l More ->Set as Working Directory
l Open a “comma-separated value” file:- experiment <-read.csv('week2.csv')
Name of the “dataframe” we’re creating (whatever we want to call this dataset) read.csv is the
function nameFile name
Reading in Data – Regular Rl Read in a “comma-separated value” file:
- experiment <- read.csv('/Users/scottfraundorf/Desktop/week2.csv')
Name of the “dataframe” we’re creating (whatever we want to call this dataset)
read.csv is the function name
Folder & file name
• Drag & drop the file into R to get the full folder & filename
Looking at the Data: Summary
l A “big picture” of the dataset:l summary(experiment)
l summary() is a very important function!l Basic info & descriptive statisticsl Check to make sure the data are correct
Looking at the Data: Summary
l A “big picture” of the dataset:l summary(experiment)
l We can use $ to refer to a specific column/variable in our dataset:l summary(experiment$ItemName)
Looking at the Data: Raw Data
l Let’s look at the data!l experiment
l Ack! That’s too much! How about just a few rows?l head(experiment)l head(experiment, n=10)
Reading in Data: Other Formats
l Excel:- library(gdata)- experiment <-
read.xls('/Users/scottfraundorf/Desktop/week2.xls')
l SPSS:- library(foreign)- experiment <-
read.spss('/Users/scottfraundorf/Desktop/week2.spss', to.data.frame=TRUE)
R Basics
l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help
R Scriptsl Save & reuse commands with a script
R STUDIO
RFile -> New Document
R Scripts
l Run commands without typing them all again
l R Studio:l Code -> Run Region -> Run All: Run entire scriptl Code -> Run Line(s): Run just what you’ve
highlighted/selectedl R:- Highlight the section of script you want to run- Edit -> Execute
l Keyboard shortcut for this:- Ctrl+Enter (PC), ⌘+Enter (Mac)
R Scripts
l Saves times when re-running analyses
l Other advantages?
l Some:- Documentation for yourself- Documentation for others- Reuse with new analyses/experiments- Quicker to run—can automatically
perform one analysis after another
R Scripts—Comments
l Add # before a line to make it a comment- Not commands to R, just notes to self
(or other readers)
• Can also add a # to make the rest of a line a comment
• summary(experiment$Subject) #awesome
R Basics
l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help
Descriptive Statistics
l Remember how we referred to a particular variable in a dataframe?- $
l Combine that with functions:- mean(experiment$RT)- median(experiment$RT)- sd(experiment$RT)
l Or, for a categorical variable:- levels(experiment$ItemName)- summary(experiment$Subject)
Descriptive Statisticsl We often want to look at a dependent variable
as a function of some independent variable(s)- tapply(experiment$RT,
experiment$Condition, mean)- “Split up the RTs by Condition, then get the mean”
l Try getting the mean RT for each iteml How about the median RT for each subject?l To combine multiple results into one table,
“column bind” them with cbind():l cbind(
tapply(experiment$RT, experiment$Condition, mean), tapply(experiment$RT, experiment$Condition, sd)
)
Descriptive Statistics
l Can have 2-way tables...- tapply(experiment$RT,
list(experiment$Subject, experiment$Condition), mean)
- 1st variable is rows, 2nd is columnsl ...or more!- tapply(experiment$RT,
list(experiment$ItemName, experiment$Condition, experiment$TestingRoom), mean)
Descriptive Statistics
l Contingency tables for categorical variables:
- xtabs (~ Subject + Condition, data=experiment)
R Basics
l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help
Subsetting Data
l Often, we want to examine or use just part of a dataframe
l Remember how we read our dataframe?- experiment <- read.csv(...)
l Create a new dataframe that's just a subset of experiment:- experiment.LongRTsRemoved <-
subset(experiment, RT < 2000)Inclusion criterion: RT less than 2000 msOriginal dataframeNew dataframe name
Subsetting Data: Logical Operators
l Try getting just the observations with RTs 200 ms or more:- experiment.ShortRTsRemoved <-
subset(experiment, RT >= 200)l Why not just delete the bad RTs from the
spreadsheet?l Easy to make a mistake / miss some of theml Faster to have the computer do itl We’d lose the original datal No documentation of how we subsetted the data
Subsetting Data: AND and ORl What if we wanted only RTs between 200
and 2000 ms?- Could do two steps:- experiment.Temp <-
subset(experiment, RT >= 200)- experiment.BadRTsRemoved <-
subset(experiment.Temp, RT <= 2000)l One step with & for AND:- experiment2 <- subset(experiment,
RT >= 200 & RT <= 2000)
Subsetting Data: AND and ORl What if we wanted only RTs between 200
and 2000 ms?l One step with & for AND:- experiment2 <- subset(experiment,
RT >= 200 & RT <= 2000)
l | means OR:- experiment.BadRTs <-
subset(experiment, RT < 200 | RT > 2000)
- Logical OR (“either or both”)
Subsetting Data: == and !=l Get a match / equals:- experiment.LastTrials <-
subset(experiment, TrialsRemaining== 0)
l Words/categorical variables need quotes:- experiment.ImplausibleSentences <-
subset(experiment, Condition=='Implausible')
l != means “not equal to”:- experiment.BadSubjectRemoved <-
subset(experiment, Subject != 'S23')
Note DOUBLE equals sign
Drops subject “S23”
Subsetting Data: %in%l Sometimes our inclusion criteria aren't so
mathematicall Suppose I just want the “Ducks” and
“Panther” itemsl We can check against any arbitrary list:- experiment.SpecialItems <-
subset(experiment, ItemName %in% c('Ducks', 'Panther'))
l Or, keep just things that aren't in a list:- experiment.NonNativeSpeakersRemoved
<- subset(experiment, Subject %in% c('S10', 'S23') == FALSE)
Writing Data
l Note that these subsets are just creating new dataframes in R
l If you want to save to a folder on your computer, use write.csv():l write.csv(experiment.BadRTsRemoved, file='experiment_badremoved.csv')
Logical Operators Review
l Summary- > Greater than- >= Greater than or equal to- < Less than- <= Less than or equal to- & AND- | OR- == Equal to- != Not equal to- %in% Is this included in a list?
R Basics
l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help
Let’s Practice It!
l Try getting the mean RT for each number of TrialsRemaining (29 trials remaining, 28 trials remaining, etc.)
l Try getting a subset of just the people in TestingRoom 3
Let’s Practice It!
l Try getting the mean RT for each number of TrialsRemaining (29 trials remaining, 28 trials remaining, etc.)l tapply(experiment$RT, experiment$TrialsRemaining, mean)
l Try getting a subset of just the people in TestingRoom 3l experiment.Room3 <- subset(experiment, TestingRoom == 3)
R Basics
l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help
Assignmentl Remember the pointing arrow used to
create dataframes and subsets?- e.g., experiment <- read.csv(...)
l This is the assignment operator. It saves results or values in a variable- x <- sqrt(64)- CriticalTrialsPerSubject <- 30
- Remember, typing a name by itself shows you the current value:
- CriticalTrialsPerSubjectl Assigning a new value overwrites the old
Assignment
l We can use this to create new columns in our dataframe:- experiment$ExperimentNumber <- 1- Here, the same number (1) is assigned to
every triall Or, compute a value for each row:- experiment$RTinSeconds <-
experiment$RT / 1000- For each trial, finds the RT in seconds for that
specific trial and saves that into RTinSeconds- Similar to an Excel formula
Assignment
l We can use this to create new columns in our dataframe:- experiment$ExperimentNumber <- 1- Here, the same number (1) is assigned to
every triall Another example:- experiment$SerialPosition <-
30 - experiment$TrialsRemaining- For each trial, calculates the serial position
(trial #1, trial #2, etc.) and saves the result into SerialPosition
ifelse()
IF YOU WANT DESSERT, EAT YOUR PEAS
… OR ELSE!
ifelse()l ifelse(): Use a test to decide which of two
values to assign:l experiment$Half <- ifelse(experiment$SerialPosition <= 15,1,2)
l Possible to nest ifelse() if we need more than 2 categories:
l experiment$Third <-ifelse(experiment$SerialPosition <= 10, 1, ifelse(experiment$SerialPosition <= 20, 2, 3))
Function nameIf serial position IS <= 15…“Half” is 1If it’s NOT, “Half” is 2
ifelse()l Instead of specific numbers, can use
other columns or a formula:l experiment$RT.Fixed <- ifelse(experiment$TestingRoom==2,experiment$RT + 100,experiment$RT)
l How can we check if this worked?l head(experiment)
Fixed RTs are indeed 100 ms longer for TestingRoom2
ifelse()l Instead of specific numbers, can use
other columns or a formula:l experiment$RT.Fixed <- ifelse(experiment$TestingRoom==2,experiment$RT + 100,experiment$RT)
l How can we check if this worked?l tapply(experiment$RT, experiment$TestingRoom, mean)
l tapply(experiment$RT.Fixed, experiment$TestingRoom, mean)
l Only room 2 should be different
Which do you like better?- experiment$Half <-
ifelse(experiment$SerialPosition <= 15, 1, 2)
- Shorter & faster to writel vs:- CriticalTrialsPerSubject <- 30- experiment$Half <-
ifelse(experiment$SerialPosition <= (CriticalTrialsPerSubject / 2) , 1, 2)
- Explains where the 15 comes from—helpful if we come back to this script later
- We can also refer to CriticalTrialsPerSubjectvariable later in the script & this ensure it’s consistent
- Easy to update if we change the number of critical trials
Deleting Variables
l It is also possible to delete a variable by assigning it the value of NULL- experiment$TrialsRemaining <- NULL
- Since we now have SerialPosition, maybe we don’t want to bother keeping TrialsRemainingany more
R Basics
l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help
Referring to Specific Cells
l So far, we’ve seen how to- Create a new dataframe that’s a subset of an
existing dataframe- Modify a dataframe by creating an entire column
l What if we want to modify a dataframe by adjusting some existing values?l e.g., replace all RTs above 2000 ms with the
number 2000 (“fencing”)l Creating a new subset won’t work because we
want to change the original dataframel Need a way to edit specific values
Referring to Specific Cellsl Use square brackets [ ] to refer to specific
entries in a dataframe:- Row, column- experiment[3,7]
l Omit the row or column number to mean all rows or all columns:- experiment[3,]- experiment[,4]
l Can also use column names:l experiment[,'RT'] All rows in the RT column
l Remember c()? We can check multiple rows:- experiment[c(1:4),]
Row 3, all columnsAll rows in column 4
Logical Indexingl We can look at rows or columns that meet
a specific criterion...- experiment[experiment$RT < 200,]
l Can use this as another way to subset:- experiment.ShortRTsRemoved <-
experiment[experiment$RT > 200, ]- Actually, subset() just does this
l But we can also set values this way- experiment[experiment$RT < 200,
'RT'] <- 200- In the dataframe experiment, find the rows
where RT < 200, and set the column RT to 200
R Basics
l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help
Types
l R treats continuous & categorical variables differently:
l These are different data types:- Numeric- Factor: Variable w/ fixed set of
categories (e.g., treatment vs. placebo)- Character: Freely entered text (e.g.,
open response question)
Types
l R's heuristic when reading in data:- Letters anywhere in the column → factor
- No letters, purely numbers → numeric
Type Conversion: Numeric → Factor
l Sometimes we need to correct this- Room 4 is not “twice as much” Room 2
l Create a new column that's the factor(categorical) version of TestingRoom:- experiment$Room.Factor <-
as.factor(experiment$TestingRoom)l Or, just overwrite the old column:- experiment$TestingRoom <-
as.factor(experiment$TestingRoom)
Conversion: Character → Factorl When ifelse() results in words, R creates
a character variable rather than a factor- Need to convert it
l Wrong:- experiment$FaveRoom <-
ifelse(experiment$TestingRoom==3, 'My favorite room', 'Not favorite')
l Right:- experiment$FaveRoom <-
as.factor(ifelse(experiment$TestingRoom== 3, 'My favorite room', 'Not favorite'))
Type Conversion: Factor → Numeric
l To change a factor to a number, need to turn it into a character first:- experiment$Age.Numeric <-
as.numeric(as.character(experiment$Age))
R Basics
l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help
NA
l We might have run into some problems trying to change Age into a numerical variable...
l NA means “not available”...- Characters that don't convert to numbers- Missing data in a spreadsheet- Invalid computations
NA
l If we try to do computations on a set of numbers where any of them is NA, we get NA as a result...- mean(experiment$Age.Numeric)
l R wants you to think about how you want to treat these missing values
NA – Solutions
l To ignore the NAs when doing a specificcomputation, use na.rm=TRUE:- sd(experiment$Age.Numeric,na.rm=TRUE)
l To get a copy of the dataframe that excludes all rows with an NA (in any column):- experiment.NoNAs <- na.omit(experiment)
l Change NAs to something else with logical indexing:- experiment[is.na(experiment$
Age.Numeric)==TRUE, ]$Age.Numeric <- 23
R Basics
l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help
Getting Help
l Get help on a specific known function:- ?sqrt- ?write.csv- Lists all
argumentsl Try to find a
function on aparticular topic:- ??logarithm
Analyses & Add-On Packages
l Some built-in analyses:l aov() ANOVAl lm() Linear regressionl glm() Generalized linear models (e.g., logistic)l cor.test() Pearson correlationl t.test() t-test
l Help function (?) will tell you about the arguments to these particular functions
Wrap-Up
l Can use R for:- Reading in data- Descriptive statistics- Subsetting data- Creating new variables