r basics v10 - learning research and development center · 2019. 1. 9. · r basics lr commands...

R Basics / Course Businessl We’ll be using a sample dataset in class today:

l CourseWeb: Course Documents à Sample Data à Week 2

l Can download to your computer before class

l Thanks for answering CourseWeb background survey!

l If sitting in on the course, e-mail me so I can add you to CourseWeb

R Basics

R Basics

l R commands & functionsl Reading in datal Saving R scriptsl Descriptive statisticsl Subsetting datal Assigning new valuesl Referring to specific cellsl Types & type conversionl NA valuesl Getting help

R Commands

l Simplest way to interact with R is by typing in commands at the > prompt:

R STUDIO R

R as a Calculator

l Typing in a simple calculation shows us the result:l 608 + 28

l What’s 11527 minus 283?l Some more examples:

l 400 / 65 (division)l 2 * 4 (multiplication)l 5 ^ 2 (exponentiation)

Functions

l More complex calculations can be done with functions:l sqrt(64)

l Can often read theseleft to right (“square root of 64”)

l What do you thinkthis means?l abs(-7)

What the function is (square root)

In parenthesis: What we want to perform the function on

Arguments

l Some functions have settings(“arguments”) that we can adjust:

l round(3.14)- Rounds off to the nearest integer (zero

decimal places)

l round(3.14, digits=1)- One decimal place

Nested Functions

Nested Functions

l We can use multiple functions in a row, one inside another- sqrt(abs(-16))- “Square root of the absolute value of -16”

l Don't get scared when you see multiple parentheses!- Can often just read left to right- R first figures out the thing nested in

the middle• Can you round off the square root of 7?

Using Multiple Numbers at Once

l When we want to use multiple numbers, we concatenate them

l c(2,6,16)- A list of the numbers 2, 6, and 16

l Sometimes a computation requires multiple numbers- mean(c(2,6,16))

l Also a quick way to do the same thing to multiple different numbers:- sqrt(c(16,100,144))

R Basics


Course Documents: Sample Data: Week 2

l Reading plausible versus implausible sentences

l “Scott chopped the carrots with a knife.”

“Scott chopped the carrots with a spoon.”

Measure reading time on final word

Note: Simulated data; not a real experiment.

Course Documents: Sample Data: Week 2

l Reading plausible versus implausible sentences

l Reading time on critical word

l 36 subjectsl Each subject sees 30 items (sentences):

half plausible, half implausiblel Interested in changes over time, so we’ll

track number of trials remaining (29 vs 28 vs 27 vs 26…)

Reading in Data

l Make sure you have the dataset at this point if you want to follow along:

Course Documents àSample Data à

Week 2

Reading in Data – RStudio

l Navigate to thefolder in lower-right

l More ->Set as Working Directory

l Open a “comma-separated value” file:- experiment <-read.csv('week2.csv')

Name of the “dataframe” we’re creating (whatever we want to call this dataset) read.csv is the

function nameFile name

Reading in Data – Regular Rl Read in a “comma-separated value” file:

- experiment <- read.csv('/Users/scottfraundorf/Desktop/week2.csv')

Name of the “dataframe” we’re creating (whatever we want to call this dataset)

read.csv is the function name

Folder & file name

• Drag & drop the file into R to get the full folder & filename

Looking at the Data: Summary

l A “big picture” of the dataset:l summary(experiment)

l summary() is a very important function!l Basic info & descriptive statisticsl Check to make sure the data are correct

Looking at the Data: Summary

l A “big picture” of the dataset:l summary(experiment)

l We can use $ to refer to a specific column/variable in our dataset:l summary(experiment$ItemName)

Looking at the Data: Raw Data

l Let’s look at the data!l experiment

l Ack! That’s too much! How about just a few rows?l head(experiment)l head(experiment, n=10)

Reading in Data: Other Formats

l Excel:- library(gdata)- experiment <-

read.xls('/Users/scottfraundorf/Desktop/week2.xls')

l SPSS:- library(foreign)- experiment <-

read.spss('/Users/scottfraundorf/Desktop/week2.spss', to.data.frame=TRUE)

R Basics


R Scriptsl Save & reuse commands with a script

R STUDIO

RFile -> New Document

R Scripts

l Run commands without typing them all again

l R Studio:l Code -> Run Region -> Run All: Run entire scriptl Code -> Run Line(s): Run just what you’ve

highlighted/selectedl R:- Highlight the section of script you want to run- Edit -> Execute

l Keyboard shortcut for this:- Ctrl+Enter (PC), ⌘+Enter (Mac)

R Scripts

l Saves times when re-running analyses

l Other advantages?

l Some:- Documentation for yourself- Documentation for others- Reuse with new analyses/experiments- Quicker to run—can automatically

perform one analysis after another

R Scripts—Comments

l Add # before a line to make it a comment- Not commands to R, just notes to self

(or other readers)

• Can also add a # to make the rest of a line a comment

• summary(experiment$Subject) #awesome

R Basics


Descriptive Statistics

l Remember how we referred to a particular variable in a dataframe?- $

l Combine that with functions:- mean(experiment$RT)- median(experiment$RT)- sd(experiment$RT)

l Or, for a categorical variable:- levels(experiment$ItemName)- summary(experiment$Subject)

Descriptive Statisticsl We often want to look at a dependent variable

as a function of some independent variable(s)- tapply(experiment$RT,

experiment$Condition, mean)- “Split up the RTs by Condition, then get the mean”

l Try getting the mean RT for each iteml How about the median RT for each subject?l To combine multiple results into one table,

“column bind” them with cbind():l cbind(

tapply(experiment$RT, experiment$Condition, mean), tapply(experiment$RT, experiment$Condition, sd)

)


l Can have 2-way tables...- tapply(experiment$RT,

list(experiment$Subject, experiment$Condition), mean)

- 1st variable is rows, 2nd is columnsl ...or more!- tapply(experiment$RT,

list(experiment$ItemName, experiment$Condition, experiment$TestingRoom), mean)


l Contingency tables for categorical variables:

- xtabs (~ Subject + Condition, data=experiment)

R Basics


Subsetting Data

l Often, we want to examine or use just part of a dataframe

l Remember how we read our dataframe?- experiment <- read.csv(...)

l Create a new dataframe that's just a subset of experiment:- experiment.LongRTsRemoved <-

subset(experiment, RT < 2000)Inclusion criterion: RT less than 2000 msOriginal dataframeNew dataframe name

Subsetting Data: Logical Operators

l Try getting just the observations with RTs 200 ms or more:- experiment.ShortRTsRemoved <-

subset(experiment, RT >= 200)l Why not just delete the bad RTs from the

spreadsheet?l Easy to make a mistake / miss some of theml Faster to have the computer do itl We’d lose the original datal No documentation of how we subsetted the data

Subsetting Data: AND and ORl What if we wanted only RTs between 200

and 2000 ms?- Could do two steps:- experiment.Temp <-

subset(experiment, RT >= 200)- experiment.BadRTsRemoved <-

subset(experiment.Temp, RT <= 2000)l One step with & for AND:- experiment2 <- subset(experiment,

RT >= 200 & RT <= 2000)

Subsetting Data: AND and ORl What if we wanted only RTs between 200

and 2000 ms?l One step with & for AND:- experiment2 <- subset(experiment,

RT >= 200 & RT <= 2000)

l | means OR:- experiment.BadRTs <-

subset(experiment, RT < 200 | RT > 2000)

- Logical OR (“either or both”)

Subsetting Data: == and !=l Get a match / equals:- experiment.LastTrials <-

subset(experiment, TrialsRemaining== 0)

l Words/categorical variables need quotes:- experiment.ImplausibleSentences <-

subset(experiment, Condition=='Implausible')

l != means “not equal to”:- experiment.BadSubjectRemoved <-

subset(experiment, Subject != 'S23')

Note DOUBLE equals sign

Drops subject “S23”

Subsetting Data: %in%l Sometimes our inclusion criteria aren't so

mathematicall Suppose I just want the “Ducks” and

“Panther” itemsl We can check against any arbitrary list:- experiment.SpecialItems <-

subset(experiment, ItemName %in% c('Ducks', 'Panther'))

l Or, keep just things that aren't in a list:- experiment.NonNativeSpeakersRemoved

<- subset(experiment, Subject %in% c('S10', 'S23') == FALSE)

Writing Data

l Note that these subsets are just creating new dataframes in R

l If you want to save to a folder on your computer, use write.csv():l write.csv(experiment.BadRTsRemoved, file='experiment_badremoved.csv')

Logical Operators Review

l Summary- > Greater than- >= Greater than or equal to- < Less than- <= Less than or equal to- & AND- | OR- == Equal to- != Not equal to- %in% Is this included in a list?

R Basics


Let’s Practice It!

l Try getting the mean RT for each number of TrialsRemaining (29 trials remaining, 28 trials remaining, etc.)

l Try getting a subset of just the people in TestingRoom 3

Let’s Practice It!

l Try getting the mean RT for each number of TrialsRemaining (29 trials remaining, 28 trials remaining, etc.)l tapply(experiment$RT, experiment$TrialsRemaining, mean)

l Try getting a subset of just the people in TestingRoom 3l experiment.Room3 <- subset(experiment, TestingRoom == 3)

R Basics


Assignmentl Remember the pointing arrow used to

create dataframes and subsets?- e.g., experiment <- read.csv(...)

l This is the assignment operator. It saves results or values in a variable- x <- sqrt(64)- CriticalTrialsPerSubject <- 30

- Remember, typing a name by itself shows you the current value:

- CriticalTrialsPerSubjectl Assigning a new value overwrites the old

Assignment

l We can use this to create new columns in our dataframe:- experiment$ExperimentNumber <- 1- Here, the same number (1) is assigned to

every triall Or, compute a value for each row:- experiment$RTinSeconds <-

experiment$RT / 1000- For each trial, finds the RT in seconds for that

specific trial and saves that into RTinSeconds- Similar to an Excel formula

Assignment

l We can use this to create new columns in our dataframe:- experiment$ExperimentNumber <- 1- Here, the same number (1) is assigned to

every triall Another example:- experiment$SerialPosition <-

30 - experiment$TrialsRemaining- For each trial, calculates the serial position

(trial #1, trial #2, etc.) and saves the result into SerialPosition

ifelse()

IF YOU WANT DESSERT, EAT YOUR PEAS

… OR ELSE!

ifelse()l ifelse(): Use a test to decide which of two

values to assign:l experiment$Half <- ifelse(experiment$SerialPosition <= 15,1,2)

l Possible to nest ifelse() if we need more than 2 categories:

l experiment$Third <-ifelse(experiment$SerialPosition <= 10, 1, ifelse(experiment$SerialPosition <= 20, 2, 3))

Function nameIf serial position IS <= 15…“Half” is 1If it’s NOT, “Half” is 2

ifelse()l Instead of specific numbers, can use

other columns or a formula:l experiment$RT.Fixed <- ifelse(experiment$TestingRoom==2,experiment$RT + 100,experiment$RT)

l How can we check if this worked?l head(experiment)

Fixed RTs are indeed 100 ms longer for TestingRoom2

ifelse()l Instead of specific numbers, can use

other columns or a formula:l experiment$RT.Fixed <- ifelse(experiment$TestingRoom==2,experiment$RT + 100,experiment$RT)

l How can we check if this worked?l tapply(experiment$RT, experiment$TestingRoom, mean)

l tapply(experiment$RT.Fixed, experiment$TestingRoom, mean)

l Only room 2 should be different

Which do you like better?- experiment$Half <-

ifelse(experiment$SerialPosition <= 15, 1, 2)

- Shorter & faster to writel vs:- CriticalTrialsPerSubject <- 30- experiment$Half <-

ifelse(experiment$SerialPosition <= (CriticalTrialsPerSubject / 2) , 1, 2)

- Explains where the 15 comes from—helpful if we come back to this script later

- We can also refer to CriticalTrialsPerSubjectvariable later in the script & this ensure it’s consistent

- Easy to update if we change the number of critical trials

Deleting Variables

l It is also possible to delete a variable by assigning it the value of NULL- experiment$TrialsRemaining <- NULL

- Since we now have SerialPosition, maybe we don’t want to bother keeping TrialsRemainingany more

R Basics


Referring to Specific Cells

l So far, we’ve seen how to- Create a new dataframe that’s a subset of an

existing dataframe- Modify a dataframe by creating an entire column

l What if we want to modify a dataframe by adjusting some existing values?l e.g., replace all RTs above 2000 ms with the

number 2000 (“fencing”)l Creating a new subset won’t work because we

want to change the original dataframel Need a way to edit specific values

Referring to Specific Cellsl Use square brackets [ ] to refer to specific

entries in a dataframe:- Row, column- experiment[3,7]

l Omit the row or column number to mean all rows or all columns:- experiment[3,]- experiment[,4]

l Can also use column names:l experiment[,'RT'] All rows in the RT column

l Remember c()? We can check multiple rows:- experiment[c(1:4),]

Row 3, all columnsAll rows in column 4

Logical Indexingl We can look at rows or columns that meet

a specific criterion...- experiment[experiment$RT < 200,]

l Can use this as another way to subset:- experiment.ShortRTsRemoved <-

experiment[experiment$RT > 200, ]- Actually, subset() just does this

l But we can also set values this way- experiment[experiment$RT < 200,

'RT'] <- 200- In the dataframe experiment, find the rows

where RT < 200, and set the column RT to 200

R Basics


Types

l R treats continuous & categorical variables differently:

l These are different data types:- Numeric- Factor: Variable w/ fixed set of

categories (e.g., treatment vs. placebo)- Character: Freely entered text (e.g.,

open response question)

Types

l R's heuristic when reading in data:- Letters anywhere in the column → factor

- No letters, purely numbers → numeric

Type Conversion: Numeric → Factor

l Sometimes we need to correct this- Room 4 is not “twice as much” Room 2

l Create a new column that's the factor(categorical) version of TestingRoom:- experiment$Room.Factor <-

as.factor(experiment$TestingRoom)l Or, just overwrite the old column:- experiment$TestingRoom <-

as.factor(experiment$TestingRoom)

Conversion: Character → Factorl When ifelse() results in words, R creates

a character variable rather than a factor- Need to convert it

l Wrong:- experiment$FaveRoom <-

ifelse(experiment$TestingRoom==3, 'My favorite room', 'Not favorite')

l Right:- experiment$FaveRoom <-

as.factor(ifelse(experiment$TestingRoom== 3, 'My favorite room', 'Not favorite'))

Type Conversion: Factor → Numeric

l To change a factor to a number, need to turn it into a character first:- experiment$Age.Numeric <-

as.numeric(as.character(experiment$Age))

R Basics


NA

l We might have run into some problems trying to change Age into a numerical variable...

l NA means “not available”...- Characters that don't convert to numbers- Missing data in a spreadsheet- Invalid computations

NA

l If we try to do computations on a set of numbers where any of them is NA, we get NA as a result...- mean(experiment$Age.Numeric)

l R wants you to think about how you want to treat these missing values

NA – Solutions

l To ignore the NAs when doing a specificcomputation, use na.rm=TRUE:- sd(experiment$Age.Numeric,na.rm=TRUE)

l To get a copy of the dataframe that excludes all rows with an NA (in any column):- experiment.NoNAs <- na.omit(experiment)

l Change NAs to something else with logical indexing:- experiment[is.na(experiment$

Age.Numeric)==TRUE, ]$Age.Numeric <- 23

R Basics


Getting Help

l Get help on a specific known function:- ?sqrt- ?write.csv- Lists all

argumentsl Try to find a

function on aparticular topic:- ??logarithm

Analyses & Add-On Packages

l Some built-in analyses:l aov() ANOVAl lm() Linear regressionl glm() Generalized linear models (e.g., logistic)l cor.test() Pearson correlationl t.test() t-test

l Help function (?) will tell you about the arguments to these particular functions

Wrap-Up

l Can use R for:- Reading in data- Descriptive statistics- Subsetting data- Creating new variables