An Introduction to Statistical Computing in RK2I Data Science Boot Camp - Day 1 AM Session
May 15, 2017
Statistical Computing in R May 15, 2017 1 / 55
AM Session Outline
Intro to R Basics
Plotting In R
Data Manipulation
Statistical Computing in R May 15, 2017 2 / 55
R Basics
Here we will give a quick overview of the R language and the RStudio IDE.
Our emphasis will be to explore the most used features of R, especiallythose used in later courses.
This won’t cover all the details, but will the most important parts.
Statistical Computing in R May 15, 2017 3 / 55
Working with Rstudio
Before beginning with R let’s orient ourselves with RStudio.
Statistical Computing in R May 15, 2017 4 / 55
Our initial view of RStudio is:
Statistical Computing in R May 15, 2017 5 / 55
Go to: File -> New File -> R Script. This gives:
Statistical Computing in R May 15, 2017 6 / 55
Statistical Computing in R May 15, 2017 7 / 55
Try It Out
Type the following into console
?lm
??linear
plot(1:20, 1:20)
Statistical Computing in R May 15, 2017 8 / 55
There are several useful shortcut keys in RStudio. A few popular ones:
Ctrl+Enter - When pressed in Editor, sends current line to console.
Ctrl+1, Ctrl+2 - switch between editor and console
Ctrl+Shift+Enter - run entire script in console
tab completion - this is perhaps the most used feature
For vim/emacs users Tools -> Global Options -> Code -> Keybindingswill give you your prefered bindings.
Statistical Computing in R May 15, 2017 9 / 55
It’s important to know our working directory.
Given a file name, R will assume it is located in your current workingdirectory.
R will also save output to the working directory by default.
It is important to set your working directory to the correct location orspecify full path names.
Statistical Computing in R May 15, 2017 10 / 55
Try out the following in the console window:
getwd()
list.files()
To change your working directory go to: Session -> Set Working Directory-> Choose Directory
Alternatively,
setwd("/path/to/directory")
Statistical Computing in R May 15, 2017 11 / 55
Reading, Writing, Saving, and Loading
Here we’ll look at bringing data into R and getting it out
We’ll also see how to save R objects and environments
Statistical Computing in R May 15, 2017 12 / 55
Reading In Data
read.table
read.csv
read.fwf
Check out options for each ?read.table
Statistical Computing in R May 15, 2017 13 / 55
Syntax
?read.table
?read.csv
read.table("/path/to/your/file.ext",
header=TRUE,
sep=",",
stringsAsFactors = FALSE)
Statistical Computing in R May 15, 2017 14 / 55
Most Common Options
sep tells how fields/variables are separated. Commons values are:
”,” (comma)
” ” (single space)
”\t” (tab escape character)
stringsAsFactors tells whether to treat non numeric values asfactor/categorical variables.
header tells whether first line of file has variable names
na.strings tells how missing values are encoded in the file.
Statistical Computing in R May 15, 2017 15 / 55
Standard Procedure
Open file in text editor
Check items relevant to options. Header? Separator type?
For big files, Linux tools are helpful: head -n10 BigFile.txt > OpenMe
Statistical Computing in R May 15, 2017 16 / 55
Try it Out
Let’s read in the ReadMeInX.txt files into R.
Try it on your own before looking at the answer on the next slides.
Example workflow:
1 Set your working directory to the directory containing the files.
2 Examine the files in a text editor to check for common options(header, separator, etc.)
Statistical Computing in R May 15, 2017 17 / 55
# read.table's default seperator ok for this one
set0 <- read.table("ReadMeIn0.txt",
header=TRUE)
# specify new seperator
set1 <- read.table("ReadMeIn1.txt",
header=TRUE,
sep=',')
# Or use read.csv
set1 <- read.csv("ReadMeIn1.txt",
header=TRUE)
Statistical Computing in R May 15, 2017 18 / 55
# another change of seperator
set2 <- read.table("ReadMeIn2.txt",
header=TRUE,
sep=';')
# check for missing
set3 <- read.table("ReadMeIn3.txt",
header=FALSE,
sep=',',
na.strings = '')
Statistical Computing in R May 15, 2017 19 / 55
Writing Data
write.table
write.csv
Statistical Computing in R May 15, 2017 20 / 55
Syntax and Common Options
?write.csv
write.csv(myRObject,
file="/path/to/save/spot/file.csv",
row.names=FALSE)
Options largely the same as their read counterparts
row.names = FALSE is helpful to avoid have 1,2,3,... as avariable/column
Statistical Computing in R May 15, 2017 21 / 55
Try It Out
Write out one of the files you imported. Try to varying options like sep,quote.
Statistical Computing in R May 15, 2017 22 / 55
Saving Objects
saveRDS/readRDS are used to save (compressed version of) individual Robjects
# save our data set
saveRDS(set1,file="TstObj.rds")
# get it back
newtst <- readRDS("TstObj.rds")
# can save any R object. Try a vector
my.vector <- c(1,8,-100)
saveRDS(my.vector, file="JustAVector.rds")
Statistical Computing in R May 15, 2017 23 / 55
Saving Environment
We can save all variables in the current R workspace with save.image
We can load in a saved workspace with load
R will ask you save your work when you exit
# Save all our work
save.image("AllMyWork.RData")
# Reload it
load("AllMyWork.RData")
# name given to default save
load(".RData")
Statistical Computing in R May 15, 2017 24 / 55
The Basics of R
Let’s do a whirlwind tour of R: it’s syntax and data structures
This won’t cover all the details, but will the most important parts
Statistical Computing in R May 15, 2017 25 / 55
Basic R Data Types
# numeric types: interger, double
348
# character
"my string"
# logical
TRUE
FALSE
# artithmetic as you'd expect
43 + 1 * 2^4
# so too logical operators/comparison
TRUE | FALSE
1 + 7 != 7
# Other logical operators:
# &, |, !
# <,>,<=,>=, ==, !=
Statistical Computing in R May 15, 2017 26 / 55
Data Types Cont.
# variables assignment is done with the <- operator
my.number <- 483
# the '.' above does nothing. we could have done:
# mynumber <- 483
# instead
# it's an Rism to use .'s in variable names.
# typeof() tells use type
typeof(my.number)
## [1] "double"
# we can convert between types
my.int <- as.integer(my.number)
typeof(my.int)
## [1] "integer"
# we can test for types
is.logical(my.int)
## [1] FALSE
Statistical Computing in R May 15, 2017 27 / 55
R Data Structures - Vectors
# the vector is the most important data structure
# create it with c()
my.vec <- c(1,2,67,-98)
# get some properties
str(my.vec)
## num [1:4] 1 2 67 -98
length(my.vec)
## [1] 4
# access elements with []
my.vec[3]
## [1] 67
my.vec[c(3,4)]
## [1] 67 -98
# can do assignment too
my.vec[5] <- 41.2
Statistical Computing in R May 15, 2017 28 / 55
Vectors - Cont.
# other ways to create vectors
x <- 1:6
y <- seq(7,12,by=1)
# Operations get recycled through whole vector
x + 1
## [1] 2 3 4 5 6 7
x > 3
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
# Can do component wise operations between vectors
x * y
## [1] 7 16 27 40 55 72
x / y
## [1] 0.1428571 0.2500000 0.3333333 0.4000000 0.4545455 0.5000000
y %/% x
## [1] 7 4 3 2 2 2
Statistical Computing in R May 15, 2017 29 / 55
Try It Out
# Try guess what the following lines will do
# Will it run at all? If so, what will it give?
# Think about it and run to confirm
7 -> w
w <- z <- 44
1 + TRUE
0 | 15 & 3
my.vec[2:4]
my.vec[-2]
my.vec[c(TRUE,FALSE,FALSE,TRUE,FALSE)]
my.vec[
sum(
c(TRUE,FALSE,FALSE,TRUE,TRUE)
)
] <- TRUE
my.vec[3] <- "I'm a string"
as.numeric(my.vec)
x[x>3]
x + c(1,2)
Statistical Computing in R May 15, 2017 30 / 55
Matrices# matricies are 2d vectors.
# create using matrix()
my.matrix <- matrix(rnorm(20),nrow=4,ncol=5)
# rnorm() draws 20 random samples from a n(0,1) distribution
my.matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5351131 1.08710882 0.5670939 0.2800755 -0.8050743
## [2,] -1.9263838 0.86267009 0.7318280 0.4177110 -0.9576529
## [3,] -1.2931770 -1.03381286 -0.9035750 1.9787516 0.3747967
## [4,] -2.6190953 -0.04829205 1.3157181 1.2562005 0.1131199
# note matricies loaded by column
# Get details
dim(my.matrix)
## [1] 4 5
nrow(my.matrix)
## [1] 4
ncol(my.matrix)
## [1] 5
Statistical Computing in R May 15, 2017 31 / 55
Matrices - Cont.
# Indexing is similar to vectors but with 2 dimensions
# get second row
my.matrix[2,]
## [1] -1.9263838 0.8626701 0.7318280 0.4177110 -0.9576529
# get first,last columns of row three
my.matrix[3,c(1,4)]
## [1] -1.293177 1.978752
# transposing done with t()
Statistical Computing in R May 15, 2017 32 / 55
Lists# lists similar to vectors but contain different types
# create with list
my.list <- list("just a string",
44,
my.matrix,
c(TRUE,TRUE,FALSE))
# access items via double brackets [[]]
my.list[[4]]
## [1] TRUE TRUE FALSE
# access multiple items
my.list[1:2]
## [[1]]
## [1] "just a string"
##
## [[2]]
## [1] 44
# list items can be named too
named.list <- list(Item1="my string",
Item2=my.list)
# access of named item is via dollar sign operator
# [[]] also works
c(named.list$Item1,named.list[[1]])
## [1] "my string" "my string"
Statistical Computing in R May 15, 2017 33 / 55
Putting it together
Let’s practice with R data types by doing PCA on the iris data.
data("iris")
head(iris)
str(iris)
Note iris is a data.frame data type; this is simply a list.
Statistical Computing in R May 15, 2017 34 / 55
PCA outline
Save the numeric columns of iris as a matrix. (Hint: ?as.matrix)
Center and scale the matrix (Hint: ?scale)
Compute the correlation matrix
R =1
n − 1XTX
Here X is our (centered and scaled) data matrix, n is the number ofrows/observations in our data, and XT is the transpose of X .
(Hint: t(X) is transpose operator and A%*%B performs matrixmultiplication on the matricies A and B)
Statistical Computing in R May 15, 2017 35 / 55
PCA outline cont.
Obtain the two leading eigenvectors of the correlation matrix R.Denote these as v1, v2. (Hint: ?eigen)
Compute the first and second principle components via
z1 = Xv1
z2 = Xv2
Produce a scatter plot of z1 vs z2 (Hint: ?plot)
Take a few moments to try it yourself before looking at the answers on thenext slides.
Statistical Computing in R May 15, 2017 36 / 55
PCA from scratch
data("iris")
# get numeric portions of list and make a matrix
X <- as.matrix(iris[1:4])
# center and scale
X <- scale(X,center = TRUE,scale=TRUE)
# get the number of rows
n <- nrow(X)
# compute correlation matrix
R <- (1/(n-1))*t(X)%*%X
# perform eigen decomposition
Reig <- eigen(R)
# get eigen vectors
Reig.vecs <- Reig$vectors
# create principle components
pc1 <- X%*%Reig.vecs[,1]
pc2 <- X%*%Reig.vecs[,2]
Statistical Computing in R May 15, 2017 37 / 55
PCA from scratch cont.
# compare to R's PCA function
their.pcs <-prcomp(iris[1:4],center = TRUE,scale. = TRUE)
head(their.pcs$x[,1:2])
## PC1 PC2
## [1,] -2.257141 -0.4784238
## [2,] -2.074013 0.6718827
## [3,] -2.356335 0.3407664
## [4,] -2.291707 0.5953999
## [5,] -2.381863 -0.6446757
## [6,] -2.068701 -1.4842053
# our result
head(cbind(pc1,pc2))
## [,1] [,2]
## [1,] -2.257141 -0.4784238
## [2,] -2.074013 0.6718827
## [3,] -2.356335 0.3407664
## [4,] -2.291707 0.5953999
## [5,] -2.381863 -0.6446757
## [6,] -2.068701 -1.4842053
Statistical Computing in R May 15, 2017 38 / 55
PCA from scratch cont.
plot(pc1,pc2,col=iris$Species)
−3 −2 −1 0 1 2 3
−2
−1
01
2
pc1
pc2
Statistical Computing in R May 15, 2017 39 / 55
Factors# Factors are like vector, but with predefined allowed values called levels
# Factors are used to represent categorical variables in R
# create a factor
factor1 <- factor(c('Good','Bad','Ugly'))
# find it's levels
levels(factor1)
## [1] "Bad" "Good" "Ugly"
# below gives warning, but not error
factor1[4] <- 17
## Warning in ‘[<-.factor‘(‘*tmp*‘, 4, value = 17): invalid factor level, NA generated
# see what happened
factor1
## [1] Good Bad Ugly <NA>
## Levels: Bad Good Ugly
factor1[4] <- 'Bad'
# get the breakdown
table(factor1)
## factor1
## Bad Good Ugly
## 2 1 1
Statistical Computing in R May 15, 2017 40 / 55
Note one of our previous examples R filled in the improper factor valuewith NA
NA is R’s way of specifying missing data
Note the missing data is handled differently than ordinary values, as wewill see as we go along.
Statistical Computing in R May 15, 2017 41 / 55
Questions
What will the following lines of code do?
my.matrix[3:4,1:2] <- c(4,5)
my.matrix[4,5] <- 'string'
mf.strings <- c('F','F','M','F')
factor2 <- as.factor(mf.strings)
c(factor1, factor2)
factor1 == 'Ugly'
my.list[[3]][2,]
sum(c(1,2,3,NA))
sum(c(1,2,3,NA),na.rm = TRUE)
Statistical Computing in R May 15, 2017 42 / 55
Data Frames
The data.frame is how R represents data sets. They are simply lists, witha few additional restrictions.
# create your own
my.df <- data.frame(
age = c(45,27,19,59,71,13,5),
gender = factor(c('M','M','M','F','M','F','F'))
)
str(my.df)
## 'data.frame': 7 obs. of 2 variables:
## $ age : num 45 27 19 59 71 13 5
## $ gender: Factor w/ 2 levels "F","M": 2 2 2 1 2 1 1
Statistical Computing in R May 15, 2017 43 / 55
Data Frames - Cont.
Individual variables can be accessed via $ operator
my.df$age
## [1] 45 27 19 59 71 13 5
summary(my.df$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 16.00 27.00 34.14 52.00 71.00
table(my.df$gender)
##
## F M
## 3 4
# data frames are really just lists
my.df[[2]]
## [1] M M M F M F F
## Levels: F M
Statistical Computing in R May 15, 2017 44 / 55
Data Frames - Cont.
# data.frames can be subsetted like matrcies
my.df[1:3,c("age")]
## [1] 45 27 19
# logical subsetting especially useful for .data.frames
# get ages over 40
age.logic <- my.df$age > 40
# take a subset of these rows
my.df[age.logic,]
## age gender
## 1 45 M
## 4 59 F
## 5 71 M
# create a new variable age.sq
my.df$age.sq <- my.df$age^2
Statistical Computing in R May 15, 2017 45 / 55
Try It Out
Let’s use R’s internal iris data set to practice with data frames
my.iris <- iris
my.iris
1 Create two new variables Length.Sum and Width.Sum which are thesum of Sepal and Petal length/width respectively.
2 Use subsetting and R’s mean function to find the averageLength.Sum of setosa species
Statistical Computing in R May 15, 2017 46 / 55
my.iris$Length.Sum = my.iris$Sepal.Length +
my.iris$Petal.Length
my.iris$Width.Sum = my.iris$Sepal.Width +
my.iris$Petal.Width
setosa.inds <- my.iris$Species == 'setosa'
mean(my.iris[setosa.inds,]$Length.Sum)
## [1] 6.468
Statistical Computing in R May 15, 2017 47 / 55
Control Structures
R has all the typical control structures:
if-else statements
for loops
while loops
Statistical Computing in R May 15, 2017 48 / 55
Syntax
if(logical_expression){execute_code
} else{executre_other_code
}
for(value in sequence){work_with_value
}
while(expression_is_true){execute_code
}
Statistical Computing in R May 15, 2017 49 / 55
Functions
Defining functions is R is easy
# use function key word with assignment <-
my.mean <- function(input.vector){sum = 0
for(val in input.vector) {sum = sum + val
}# the expression get retuned
return.me <- sum / length(input.vector)
}my.mean(1:10)
Statistical Computing in R May 15, 2017 50 / 55
Functions cont.
my.mean <- function(input.vector){sum = 0
for(val in input.vector) {sum = sum + val
}# returns 1 now
retrun.me <- sum / length(input.vector)
1
}my.mean(1:10)
## [1] 1
Statistical Computing in R May 15, 2017 51 / 55
Try It Out
Create a function my.summary which inputs a vector, x, calculates themean, standard deviation, max, and min of x, and returns these in a list
Try out R’s internal functions mean, sd, max,min
Statistical Computing in R May 15, 2017 52 / 55
my.summary <- function(x) {list(
mean = mean(x),
sd = sd(x),
max = max(x),
min = min(x)
)
}
Statistical Computing in R May 15, 2017 53 / 55
Try It Out cont.
Loop through the variables in my.iris, evaluating my.summary on each(provided the variable is numeric) and printing the maximum.
Hint: Use is.numeric to test each variable before applying my.summary
Statistical Computing in R May 15, 2017 54 / 55
for(var in my.iris) {if(is.numeric(var)){tmp <- my.summary(var)
print(tmp$max)
}}
Statistical Computing in R May 15, 2017 55 / 55