stat 437 lecture notes 1

32
Stat 437 Lecture Notes 1 Xiongzhi Chen Washington State University Contents 3 Set up RStudio 3 Install R and Rstudio ......................................... 3 Rstudio: a sanpshot .......................................... 3 Rstudio ................................................. 3 Objects in R: I 3 Scalars in R ............................................... 3 Vectors in R: I ............................................. 4 Vectors in R: II ............................................. 4 The seq command ........................................... 5 Matrices in R: I ............................................. 5 Matrices in R: II ............................................ 5 Matrices in R: III ............................................ 6 Matrices in R: IV ............................................ 6 Data frames in R: I ........................................... 6 Data frames in R: II .......................................... 7 Data frames in R: III .......................................... 7 Data frames in R: IV .......................................... 7 Objects in R: II 8 Character vectors in R ......................................... 8 Strings in R ............................................... 8 Factors in R: I ............................................. 8 Factors in R: II ............................................. 9 Logic operators in R: I ......................................... 9 Logic operators in R: II ........................................ 10 Logic operators in R: III ........................................ 10 Logic operators in R: IV ........................................ 10 Lists in R: I ............................................... 10 Lists in R: II .............................................. 11 Lists in R: III .............................................. 11 Set operations in R: I ......................................... 11 Set operations in R: II ......................................... 12 “Coerce” in R .............................................. 12 length and dim ............................................. 12 R markdown 13 Install R markdown .......................................... 13 Create a R markdown file ....................................... 13 Structure of a markdown file ..................................... 13 A sample markdown file ........................................ 14 Basic syntax: I ............................................. 14 Basic syntax: II ............................................. 14 1

Upload: others

Post on 21-Dec-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stat 437 Lecture Notes 1

Stat 437 Lecture Notes 1Xiongzhi Chen

Washington State University

Contents3

Set up RStudio 3Install R and Rstudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Rstudio: a sanpshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Rstudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Objects in R: I 3Scalars in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Vectors in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Vectors in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4The seq command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Matrices in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Matrices in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Matrices in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Matrices in R: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Data frames in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Data frames in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Data frames in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Data frames in R: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Objects in R: II 8Character vectors in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Strings in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Factors in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Factors in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Logic operators in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Logic operators in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Logic operators in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Logic operators in R: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Lists in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Lists in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Lists in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Set operations in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Set operations in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12“Coerce” in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12length and dim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

R markdown 13Install R markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Create a R markdown file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Structure of a markdown file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13A sample markdown file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Basic syntax: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Basic syntax: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1

Page 2: Stat 437 Lecture Notes 1

Basic syntax: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Basic syntax: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Latex in markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Data visualization 15Why data visualization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15R packages for visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Basic principles for plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Scatter plot, density plot, boxplot, bar plot 16Scatter plot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Scatter plot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Density plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Density plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Scatter plot matrix: ggpairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Bar plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Visualization with factors 20Look into iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Faceting with 1 factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Non-faceting with 1 factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Faceting with 2 factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Faceting with 2 factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Visualization with 3 factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Mathematical expressions 24Math expressions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Subset of diamonds data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Base layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Math symbols in axis titles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Math symbols in axis titles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Math symbols in legend title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Subset of diamonds data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Base layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Math symbols in legend labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Math symbols in strip names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Math symbols in strip names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Math symbols in strip names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Math symbols in plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Other ggplot2 twicks 31Not covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31License and session Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2

Page 3: Stat 437 Lecture Notes 1

Set up RStudio

Install R and Rstudio

• Rstudio free version at: https://www.rstudio.com/products/rstudio/download/• R at: https://www.r-project.org/• install a R package by install.packages("package_name")• install R packages “tidyverse”, “ggplot2”, “markdown”, “igraph”, “plotly”, “ggmap”

Rstudio: a sanpshot

Rstudio

• Upper Left panel: R scripts, R markdown file, R project file, View data, etc

• Lower Left panel: R console, R markdown log, etc

• Upper Right panel: R workspace, History, etc

• Lower Right panel: Files in working directory, Plots, Help, etc

Objects in R: I

Scalars in R

3

Page 4: Stat 437 Lecture Notes 1

> x = 3 # assign value 3 to variable x> y = 2> x+y # addition[1] 5> x*y # multiplication[1] 6> x/y # division[1] 1.5> x%%y # modulo[1] 1> x^y # exponentiation[1] 9> x/0[1] Inf> 0/0 # undefifed[1] NaN

Vectors in R: I

> z = c(1,2,3) # a vector of 3 components> v = c(5,6,7)> z+v # vector addition[1] 6 8 10> z*v # paired componentwise product[1] 5 12 21> z/v # paired componentwise division[1] 0.2000000 0.3333333 0.4285714> z%*%v # inner product

[,1][1,] 38> 2*z # scalar-vector multipication[1] 2 4 6

Vectors in R: II

> z = c(1,2,3)> v = c(5,6,7)> z[1] # access the 1st component of z[1] 1> t(v) # transpose of vector

[,1] [,2] [,3][1,] 5 6 7> z%*%t(v) # outer product

[,1] [,2] [,3][1,] 5 6 7[2,] 10 12 14[3,] 15 18 21

4

Page 5: Stat 437 Lecture Notes 1

The seq command

> seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),+ length.out = NULL, along.with = NULL, ...)

Usage:> seq(0,1,by=0.1)[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Matrices in R: I

> matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix[,1] [,2] [,3]

[1,] 1 3 5[2,] 2 4 6> x = c(1,3,5) # a 3-component vector> y = c(2,4,6) #a 3-component vector> # stack x and y as 2 rows to obtain a 2-by-3 matrix> rbind(x,y)

[,1] [,2] [,3]x 1 3 5y 2 4 6> # stack x and y as 2 columns to obtain a 3-by-2 matrix> cbind(x,y)

x y[1,] 1 2[2,] 3 4[3,] 5 6

Matrices in R: II

> x=matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix> x

[,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6> x[,1] # 1st column of x[1] 1 2> x[2,] # 2nd row of x[1] 2 4 6> x[1,2] # (1,2)-entry of x[1] 3> t(x) # transpose of x

[,1] [,2][1,] 1 2[2,] 3 4[3,] 5 6

5

Page 6: Stat 437 Lecture Notes 1

Matrices in R: III

> x=matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix> x

[,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6> y = rbind(c(0,1,0),c(1,1,1))> y

[,1] [,2] [,3][1,] 0 1 0[2,] 1 1 1> x %*%t(y) # matrix Cauchy product

[,1] [,2][1,] 3 9[2,] 4 12

Matrices in R: IV

> x=matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix> x

[,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6> y = rbind(c(0,1,0),c(1,1,1))> y

[,1] [,2] [,3][1,] 0 1 0[2,] 1 1 1> x + y # matrix addition

[,1] [,2] [,3][1,] 1 4 5[2,] 3 5 7> 2*x # scalar multiplication

[,1] [,2] [,3][1,] 2 6 10[2,] 4 8 12

Data frames in R: I

> x <- data.frame("SN" = 1:2, "Age" = c(21,15),+ "Name" = c("John","Dora"))> x

SN Age Name1 1 21 John2 2 15 Dora> x$SN #access SN[1] 1 2> x[,1] # access SN

6

Page 7: Stat 437 Lecture Notes 1

[1] 1 2> class(x$SN) # check object type for SN[1] "integer"> class(x$Name) # check object type for Name[1] "factor"

Data frames in R: II

> x <- data.frame("SN" = 1:2, "Age" = c(21,15),+ "Name" = c("John","Dora"))> x

SN Age Name1 1 21 John2 2 15 Dora> x$SN[2] #access the 2nd entry of SN[1] 2> x[1,2] #access the 1st entry of Age[1] 21

Caution: do not transpose a data.frame when it contains different types of objects

Data frames in R: III

Import (malaria related death) data as data.frame:> Y = read.csv("dataMalyria.csv",header = TRUE,sep=",",+ colClasses=c("country"=NA,"percent"="numeric",+ "labels"=NA))> head(Y)

country percent labels1 Lesotho 0 <1%2 Mauritius 0 <1%3 Seychelles 0 <1%4 Cabo Verde 0 <1%5 Algeria 0 <1%6 Egypt 0 <1%

Data frames in R: IV

Import (malaria related death) data as data.frame:> str(Y) # object structure of Y'data.frame': 53 obs. of 3 variables:$ country: Factor w/ 53 levels "Algeria","Angola",..: 25 32 41 7 1 15 27 33 50 47 ...$ percent: num 0 0 0 0 0 0 0 0 0 0 ...$ labels : Factor w/ 5 levels " <1% "," 1-4% ",..: 1 1 1 1 1 1 1 1 1 1 ...

> dim(Y) # dimension of Y[1] 53 3> Y$id = 1:53 # append a column to Y> Y[1:3,] # display the first 3 rows of Y

7

Page 8: Stat 437 Lecture Notes 1

country percent labels id1 Lesotho 0 <1% 12 Mauritius 0 <1% 23 Seychelles 0 <1% 3

Objects in R: II

Character vectors in R

> w = c("a","b","c") # a vector of 3 character components> w[2] # access the 2nd component[1] "b"> # 1st 10 upper case letters in the alphabet> LETTERS[seq( from = 1, to = 10 )][1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

> # 1st 10 lower case letters in the alphabet> letters[seq( from = 1, to = 10 )][1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

> Q = c("Go","WSU","Cougs","!")> Q[1] "Go" "WSU" "Cougs" "!"> # concatenate two character vectors> c(w,Q)[1] "a" "b" "c" "Go" "WSU" "Cougs" "!"

Strings in R

> w = "Go cougs!"> w[1] "Go cougs!">> v = "Data analytics"> v[1] "Data analytics">> # concatenate two strings> paste(w,v,sep = " ")[1] "Go cougs! Data analytics"

Factors in R: I

> grades = c("A","F","D","C","B") # character vector> grades[1] "A" "F" "D" "C" "B"> class(grades)[1] "character"

8

Page 9: Stat 437 Lecture Notes 1

> gradesF = factor(grades) # gradesF is a now factor> gradesF[1] A F D C BLevels: A B C D F> class(gradesF)[1] "factor"> # levels of the factor "gradesF"> levels(gradesF)[1] "A" "B" "C" "D" "F"> # levels are ordered alphabetically

Factors in R: II

> x = c(1,3,2) # numeric vector> b = factor(x) # change x into a factor> b[1] 1 3 2Levels: 1 2 3> levels(b) # levels are ordered from smallest to largest[1] "1" "2" "3"> # relabel levels of b> d = factor(x,labels = c("3Level","1Level","2Level"))> d[1] 3Level 2Level 1LevelLevels: 3Level 1Level 2Level

Logic operators in R: I

> x = 0 # assign 0 to x> x >0[1] FALSE> x == 0[1] TRUE> !x # return TRUE[1] TRUE> y = 1> y >= 1[1] TRUE> !y # return FALSE[1] FALSE> x & y # "and"; return FALSE[1] FALSE> x | y # "or"; return TRUE[1] TRUE

9

Page 10: Stat 437 Lecture Notes 1

Logic operators in R: II

> x = 1> y = -1> x >0 & y > 0 # "and"[1] FALSE> x > 0 | y > 0 # "or"[1] TRUE> x >0 & !(y>0)[1] TRUE

Logic operators in R: III

> x = c(1,2,3) # a 3-component vector> x >0 # returns a 3-component logic vector[1] TRUE TRUE TRUE> x > 2 # returns a 3-component logic vector[1] FALSE FALSE TRUE> # return indices of entries of x that are greater than 2> which(x>2)[1] 3> # take the subvector of x whose entries not smaller than 2> x[x >=2][1] 2 3

Logic operators in R: IV

> x = c(1,2,3) # a 3-component vector> y = c(-1,4,-1) # a 3-component vector> # compare x and y entrywise; return a 3-component vector> x > y[1] TRUE FALSE TRUE> x == y[1] FALSE FALSE FALSE> x >= y[1] TRUE FALSE TRUE> any(x>y)[1] TRUE> all(x>y)[1] FALSE

Lists in R: I

> x = vector("list",3) # a list with 3 components> # assign a vector to its 1st component> x[[1]] = c(1,2,3)> # assign a string to its 2nd component> x[[2]] = "Second part of x"

10

Page 11: Stat 437 Lecture Notes 1

> # assign a matrix to its 3rd component> x[[3]] = matrix(1:6,nrow=3)> x[[1]][1] 1 2 3

[[2]][1] "Second part of x"

[[3]][,1] [,2]

[1,] 1 4[2,] 2 5[3,] 3 6

Lists in R: II

> x = vector("list",3) # a list with 3 components> x[[1]] = c(1,2,3)> x[[2]] = "Second part of x"> x[[3]] = matrix(1:6,nrow=3)> x[[2]] # show 2nd component of x[1] "Second part of x"

Lists in R: III

> a = c(1,2,3)> b = "Second part of x"> c = matrix(1:6,nrow=3)> y = list("vector" = a, "string" = b, "matrix" = c)> y$vector[1] 1 2 3

$string[1] "Second part of x"

$matrix[,1] [,2]

[1,] 1 4[2,] 2 5[3,] 3 6

Set operations in R: I

> x = c(1,2,3) # a 3-component vector> 1 %in% x # check membership

11

Page 12: Stat 437 Lecture Notes 1

[1] TRUE> c(2,3) %in% x[1] TRUE TRUE> y = c("stat","115","lecture")> "stat" %in% y[1] TRUE> "time" %in% y[1] FALSE

Set operations in R: II

> x = c(1,2,3) # a 3-component vector> y = c(-1,4,-1) # a 3-component vector> union(x, y)[1] 1 2 3 -1 4> intersect(x, y)numeric(0)> setdiff(x, y)[1] 1 2 3

“Coerce” in R

• as.numeric coerces an object to be numeric• as.factor coerces an object to be a factor• as.marix . . .• as.logical . . .• as.data.frame . . .• so on . . .

length and dim

• length returns the number of components of a vector> a = 1:10> length(a)[1] 10

• dim returns the dimension of matrix or data frame> x=dim(matrix(1:6,nrow=3,ncol=2))> x[1] 3 2> x[1][1] 3

12

Page 13: Stat 437 Lecture Notes 1

R markdown

Install R markdown

> install.packages("markdown")> install.packages("knitr")

In Rstudio, follow “Tools > Global Options > Sweave”, and set “Weave Rnw files using” as “knitr”

More details and video tutorial at: Course webiste

Create a R markdown file

In Rstudio, follow “File > New File > R markdown . . . ”

More details and video tutorial at: Course webiste

Structure of a markdown file

• Header (that typesets the output document)• Main body (that contains the contents)

– R chunk (that contains R codes)– Text chunk (that contains non-coding texts or latex commands)

More details and video tutorial at: Course webiste

13

Page 14: Stat 437 Lecture Notes 1

A sample markdown file

Basic syntax: I

Online tutorial: https://rmarkdown.rstudio.com/authoring_basics.html

Online tutorial: https://bookdown.org/yihui/rmarkdown/r-code.html

Basic syntax: II

Some things to go over carefully:

• Adjust figure size in the output document when figure is generated by a R chunk

• Enable current R chunk to use results produced by previous R chunks

• Basic latex commands

Basic syntax: III

To adjust figure size when figure is generated by a R chunk:

• use fig.width and fig.height to set graphical device size as in

{r eval=TRUE,fig.width = 3,fig.height=4}

• use out.width and out.height to set output size as in

{r eval=TRUE,out.width = 5,out.height=6}

More details at: https://bookdown.org/yihui/rmarkdown/r-code.html

14

Page 15: Stat 437 Lecture Notes 1

Basic syntax: IV

To enable current R chunk to use results produced by privous R chunks:

• name a chunk as “chunk1” and cache results as in

{r chunk1,eval=TRUE,cache=TRUE}

• use dependson= refer to “chunk1” as in

{r chunk2,dependson="chunk1",eval=TRUE}

More details at: https://yihui.name/knitr/options/

Latex in markdown

• To include latex packages, add - \usepackage{package_name} in the header, such as:

header-includes:- \usepackage{bbm}- \usepackage{amssymb}- \usepackage{amsmath}- \usepackage{graphicx,float}

• For Latex commands, please use a quick reference: https://wch.github.io/latexsheet/

• Caution: not all Latex commands work in markdown

Data visualization

Why data visualization?

Data visualization

• provides preliminary understanding of data• helps present and disseminate knowledge• is a relatively under-developed subject of data science

R packages for visualization

• ggplot2: create plots• GGally: extend ggplot2• ggmap: provide maps• igraph: produce graphs• Plotly: create interactive web-based plots• Other specialized packages

Basic principles for plotting

• data usually need to be a data frame• build plot layer by layer• basic components of a plot command:

– data, mapping, scales

15

Page 16: Stat 437 Lecture Notes 1

– geometric objects, coordinate system– facet, statistical transformations

Scatter plot, density plot, boxplot, bar plot

Scatter plot matrix

Scatter plot can be used to show any “visible” relationship between two variables.

Iris data:

• 4 variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width• species: setosa, versicolor, and virginica• 150 observations for each variable

Scatter plot matrix

> pairs(iris[,1:4], col=iris$Species)

Sepal.Length

2.0 3.0 4.0 0.5 1.5 2.5

4.5

6.0

7.5

2.0

3.0

4.0

Sepal.Width

Petal.Length

13

57

4.5 5.5 6.5 7.5

0.5

1.5

2.5

1 3 5 7

Petal.Width

16

Page 17: Stat 437 Lecture Notes 1

Density plot

Density plot can be used to:

• visually check model assumptions• visually compare a response’s behavior under different conditions

Example: iris data set

Density plot

> library(ggplot2)> ggplot(iris, aes(x=Sepal.Length, color=Species)) ++ geom_density(linetype = "dashed") + theme_bw()

0.0

0.4

0.8

1.2

5 6 7 8

Sepal.Length

dens

ity

Species

setosa

versicolor

virginica

Boxplot

Boxplot does not present full distributional information as density plot. But it can be used to visuallycheck:

• median of data

17

Page 18: Stat 437 Lecture Notes 1

• range of data• skewness of data• outliers in data

Boxplot

> library(ggplot2)> ggplot(iris, aes(x=Species,y=Sepal.Length))+geom_boxplot()++ theme_bw()+stat_summary(fun.y=mean,geom="point",shape=23,size=4)

5

6

7

8

setosa versicolor virginica

Species

Sep

al.L

engt

h

Scatter plot matrix: ggpairs

> library(GGally)> ggpairs(iris, aes(colour = Species, alpha = 0.4))

18

Page 19: Stat 437 Lecture Notes 1

Cor : −0.118setosa: 0.743

versicolor: 0.526virginica: 0.457

Cor : 0.872setosa: 0.267

versicolor: 0.754virginica: 0.864

Cor : −0.428setosa: 0.178

versicolor: 0.561virginica: 0.401

Cor : 0.818setosa: 0.278

versicolor: 0.546virginica: 0.281

Cor : −0.366setosa: 0.233

versicolor: 0.664virginica: 0.538

Cor : 0.963setosa: 0.332

versicolor: 0.787virginica: 0.322

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Sepal.Length

Sepal.W

idthP

etal.LengthP

etal.Width

Species

5 6 7 8 2.02.53.03.54.04.5 2 4 6 0.00.51.01.52.02.5 setosaversicolorvirginica

0.0

0.4

0.8

1.2

2.02.53.03.54.04.5

2

4

6

0.00.51.01.52.02.5

0.02.55.07.5

0.02.55.07.5

0.02.55.07.5

Bar plot

> library(ggplot2)> ggplot(mpg, aes(x=drv,y=hwy,fill=class))+theme_bw()++ geom_bar(stat='identity', position='dodge')

19

Page 20: Stat 437 Lecture Notes 1

0

10

20

30

40

4 f r

drv

hwy

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

Visualization with factors

Look into iris data set

> library(ggplot2)> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa

Faceting with 1 factor

> library(ggplot2)> ggplot(iris, aes(x=Sepal.Length,y=Petal.Length))+

20

Page 21: Stat 437 Lecture Notes 1

+ theme_bw()+geom_point()++ facet_wrap(~Species,nrow=1)

setosa versicolor virginica

5 6 7 8 5 6 7 8 5 6 7 8

2

4

6

Sepal.Length

Pet

al.L

engt

h

Non-faceting with 1 factor

> library(ggplot2)> ggplot(iris, aes(x=Sepal.Length,y=Petal.Length,+ shape=Species,colour=Species))++ theme_bw()+geom_point()

21

Page 22: Stat 437 Lecture Notes 1

2

4

6

5 6 7 8

Sepal.Length

Pet

al.L

engt

h Species

setosa

versicolor

virginica

Faceting with 2 factors

> library(ggplot2)> head(diamonds)# A tibble: 6 x 10

carat cut color clarity depth table price x y z<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>

1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.432 0.21 Prem~ E SI1 59.8 61 326 3.89 3.84 2.313 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.314 0.290 Prem~ I VS2 62.4 58 334 4.2 4.23 2.635 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.756 0.24 Very~ J VVS2 62.8 57 336 3.94 3.96 2.48

Faceting with 2 factors

> library(ggplot2)> diamondsA = diamonds[diamonds$color %in% c("E","J","G"), ]> ggplot(diamondsA, aes(x=carat,y=price))+theme_bw()++ geom_point(aes(colour=depth))+facet_grid(color~cut)

22

Page 23: Stat 437 Lecture Notes 1

Fair Good Very Good Premium Ideal

EG

J

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

0

5000

10000

15000

0

5000

10000

15000

0

5000

10000

15000

carat

pric

e

50

60

70

depth

Visualization with 3 factors

> library(ggplot2)> ggplot(diamondsA, aes(x=carat,y=price))+theme_bw()++ geom_point(aes(colour=clarity))+facet_grid(color~cut)

23

Page 24: Stat 437 Lecture Notes 1

Fair Good Very Good Premium Ideal

EG

J

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

0

5000

10000

15000

0

5000

10000

15000

0

5000

10000

15000

carat

pric

e

clarity

I1

SI2

SI1

VS2

VS1

VVS2

VVS1

IF

Mathematical expressions

Math expressions in R

• Plotmath documentation

• expression and paste commands

Subset of diamonds data

Use dplyr and piping:> library(dplyr)> dB = diamonds %>%+ filter(color %in% c("E","J","G")) %>%+ filter(cut %in% c("Ideal","Premium"))> head(dB)# A tibble: 6 x 10

carat cut color clarity depth table price x y z

24

Page 25: Stat 437 Lecture Notes 1

<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.432 0.21 Prem~ E SI1 59.8 61 326 3.89 3.84 2.313 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.464 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.715 0.2 Prem~ E SI2 60.2 62 345 3.79 3.75 2.276 0.32 Prem~ E I1 60.9 58 345 4.38 4.42 2.68

Base layer

> library(ggplot2)> p1 = ggplot(dB, aes(x=carat,y=price))+theme_bw()++ geom_point(aes(colour=depth))+facet_grid(cut~color)> p1

E G J

Prem

iumIdeal

1 2 3 4 1 2 3 4 1 2 3 4

0

5000

10000

15000

0

5000

10000

15000

carat

pric

e

45

50

55

60

65

depth

Math symbols in axis titles

Create math expressions for axis titles:

25

Page 26: Stat 437 Lecture Notes 1

> xs = expression(paste("carat ", pi["1,m"], sep=""))> ys = expression(paste("price ", gamma^2, sep=""))> ms = c("Price vs cara")

Math symbols in axis titles

> p2=p1 + ggtitle(ms)+xlab(xs)+ylab(ys)++ theme(plot.title = element_text(hjust = 0.5))> p2

E G J

Prem

iumIdeal

1 2 3 4 1 2 3 4 1 2 3 4

0

5000

10000

15000

0

5000

10000

15000

carat π1,m

pric

e γ2

45

50

55

60

65

depth

Price vs cara

Math symbols in legend title

26

Page 27: Stat 437 Lecture Notes 1

> p2 + labs(col = expression(paste("my ",lambda, sep="")))

E G J

Prem

iumIdeal

1 2 3 4 1 2 3 4 1 2 3 4

0

5000

10000

15000

0

5000

10000

15000

carat π1,m

pric

e γ2

45

50

55

60

65

my λ

Price vs cara

Subset of diamonds data

> library(dplyr)> dC = diamonds %>% filter(color %in% c("E","J","G")) %>%+ filter(cut %in% c("Ideal","Premium")) %>%+ filter(clarity %in% c("SI1","VS1"))> head(dC)# A tibble: 6 x 10

carat cut color clarity depth table price x y z<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>

1 0.21 Prem~ E SI1 59.8 61 326 3.89 3.84 2.312 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.463 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.764 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.445 0.31 Prem~ G SI1 61.8 58 553 4.35 4.32 2.686 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57

27

Page 28: Stat 437 Lecture Notes 1

Base layer

> library(ggplot2)> p3 = ggplot(dC, aes(x=carat,y=price))+theme_bw()++ geom_point(aes(colour=clarity))+facet_grid(cut~color)> p3

E G JP

remium

Ideal

1 2 1 2 1 2

0

5000

10000

15000

0

5000

10000

15000

carat

pric

e

clarity

SI1

VS1

Math symbols in legend labels

> library(ggplot2)> p3+scale_color_discrete(labels =+ c(expression(italic(omega)),"Any"))

28

Page 29: Stat 437 Lecture Notes 1

E G J

Prem

iumIdeal

1 2 1 2 1 2

0

5000

10000

15000

0

5000

10000

15000

carat

pric

e

clarity

ω

Any

Math symbols in strip names

Command:

• Both factors

facet_grid(factorA ~ factorB, labeller = label_parsed)

• One factor

facet_grid(factorA ~ factorB,labeller = labeller(factorA=label_parsed))

Math symbols in strip names

Create math expressions for levels of factors:> ColorFStg = c(expression(paste(pi[0],"=", 0.5,sep="")),+ expression(paste(lambda[z],"=", 0.6,sep="")),+ expression(paste(zeta[0],"=", 0.7,sep="")))> dC$colorF = dC$color> dC$colorF = factor(dC$color, labels =ColorFStg)> dC[,c(1:4,7,11)] %>% group_by(colorF) %>% slice(1)

29

Page 30: Stat 437 Lecture Notes 1

# A tibble: 3 x 6# Groups: colorF [3]

carat cut color clarity price colorF<dbl> <ord> <ord> <ord> <int> <ord>

1 0.21 Premium E SI1 326 "paste(pi[0], \"=\", 0.5, ~2 0.23 Ideal G VS1 404 "paste(lambda[z], \"=\", 0~3 0.23 Ideal J VS1 340 "paste(zeta[0], \"=\", 0.7~

Math symbols in strip names

> ggplot(dC, aes(x=carat,y=price))+theme_bw()++ geom_point(aes(colour=clarity))++ facet_grid(cut~colorF,labeller = label_parsed)

π0=0.5 λz=0.6 ζ0=0.7

Prem

iumIdeal

1 2 1 2 1 2

0

5000

10000

15000

0

5000

10000

15000

carat

pric

e

clarity

SI1

VS1

Math symbols in plot

> ggplot(dC, aes(x=carat,y=price))+theme_bw()++ geom_line(aes(linetype=clarity))+

30

Page 31: Stat 437 Lecture Notes 1

+ facet_grid(cut~colorF,labeller = label_parsed)

π0=0.5 λz=0.6 ζ0=0.7

Prem

iumIdeal

1 2 1 2 1 2

0

5000

10000

15000

0

5000

10000

15000

carat

pric

e

clarity

SI1

VS1

Other ggplot2 twicks

Not covered

The following have not been covered:

• some statistical transforms: stat_XXX• lines, shapes for x-y plot: geom_XXX• axis, legend and strip adjustment: theme• figure margin adjustment: margin

Information on this can be found on the ggplot2 book or https://stackoverflow.com

License and session Information

License

31

Page 32: Stat 437 Lecture Notes 1

> sessionInfo()R version 3.5.0 (2018-04-23)Platform: x86_64-w64-mingw32/x64 (64-bit)Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:[1] LC_COLLATE=English_United States.1252[2] LC_CTYPE=English_United States.1252[3] LC_MONETARY=English_United States.1252[4] LC_NUMERIC=C[5] LC_TIME=English_United States.1252

attached base packages:[1] stats graphics grDevices utils datasets methods[7] base

other attached packages:[1] bindrcpp_0.2.2 dplyr_0.7.8 GGally_1.4.0 ggplot2_3.1.0[5] knitr_1.21

loaded via a namespace (and not attached):[1] Rcpp_1.0.0 RColorBrewer_1.1-2 pillar_1.3.1[4] compiler_3.5.0 plyr_1.8.4 highr_0.7[7] bindr_0.1.1 tools_3.5.0 digest_0.6.18

[10] viridisLite_0.3.0 evaluate_0.12 tibble_1.4.2[13] gtable_0.2.0 pkgconfig_2.0.2 rlang_0.3.0.1[16] cli_1.0.1 rstudioapi_0.8 yaml_2.2.0[19] xfun_0.4 withr_2.1.2 stringr_1.3.1[22] grid_3.5.0 tidyselect_0.2.5 reshape_0.8.8[25] glue_1.3.0 R6_2.3.0 fansi_0.4.0[28] rmarkdown_1.11 reshape2_1.4.3 purrr_0.2.5[31] magrittr_1.5 scales_1.0.0 codetools_0.2-15[34] htmltools_0.3.6 assertthat_0.2.0 colorspace_1.3-2[37] labeling_0.3 utf8_1.1.4 stringi_1.2.4[40] lazyeval_0.2.1 munsell_0.5.0 crayon_1.3.4

32