stat 115 lecture notes 1 - washington state...
TRANSCRIPT
Stat 115 Lecture Notes 1Xiongzhi Chen
Washington State University
Contents3
An overview of big data 3Deluge of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Velocity and volume of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Variety of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Variety of data: discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Some features of modern data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Illustration of “big” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Illustration of “Heterogeneous” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Illustration of “Complicated” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Features of modern data: discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Nature, data and us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Some challenges of modern data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Illustrating examples of data analytics 5Example 1: clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Example 1: clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Example 2: predicitive modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Example 2: predicitive modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Example 2: predicitive modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Example 3: dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Example 3: dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Example 4: classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Example 4: classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Example 5: community detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Example 5: community detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Example 6: survival rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Example 6: survival rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
An overview of data analytics 11Different philosophies towards data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11A workflow for data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Discussion on the workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11The landscape of big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Data analytics: a joint endeavor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Software tools for data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13On course contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Installation of R, Rstudio and R packages 13Download R and Rstudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Installations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Rstudio: a sanpshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Rstudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
Objects in R: I 14Scalars in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Vectors in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Vectors in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15The seq command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Matrices in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Matrices in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Matrices in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Matrices in R: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Data frames in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Data frames in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Data frames in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Data frames in R: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Objects in R: II 19Character vectors in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Strings in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Example: create a scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Factors in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Factors in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Logic operators in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Logic operators in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Logic operators in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Logic operators in R: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Set operations in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Set operations in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Lists in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Lists in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Lists in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23“Coerce” in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24The length and dim commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
R markdown 24Install R markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Create a R markdown file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Structure of a markdown file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25A sample markdown file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Basic syntax: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Basic syntax: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Basic syntax: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Basic syntax: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Latex in markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Copyright and session information 27License and session Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2
An overview of big data
Deluge of data
Variety
Velocity Volume
Velocity and volume of data
The use of data today is transforming the way we
live, work, and play. Businesses in industries
around the world are using data to transform
themselves to become more agile, improve
customer experience, introduce new business
models, and develop new sources of competitive
advantage. Consumers are living in an
increasingly digital world, depending on online
and mobile channels to connect with friends and
family, access goods and services, and run nearly
every aspect of their lives, even while asleep.
Much of today’s economy relies on data, and this
reliance will only increase in the future as
companies capture, catalog, and cash in on data
in every step of their supply chain; enterprises
collect vast sums of customer data to provide
greater levels of personalization; and consumers
integrate social media, entertainment, cloud
storage, and real-time personalized services into
their streams of life.
The consequence of this increasing reliance on
data will be a never-ending expansion in the size
of the Global Datasphere. Estimated to be 33 ZB
in 2018, IDC forecasts the Global Datasphere to
grow to 175 ZB by 2025. (Figure 1). See Appendix
for methodology and data/device categories.
Global Datasphere Expansion is Never-ending
Chapter 1 Characterizing the Global Datasphere
Figure 1 – Annual Size of the Global Datasphere
Annual Size of the Global Datasphere
MRI image creation is driving storage requirements significantly. The trend is more images with thinner slices and 3D capability. We've gone from 2,000 images to over 20,000 for an MRI of a human head, and stronger magnets and higher resolution pictures means more data stored.
– Senior Director in IT, Major Healthcare Provider
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Source: Data Age 2025, sponsored by Seagate with data from IDC Global DataSphere, Nov 2018
IDC White Paper I Doc# US44413318 I November 2018 The Digitization of the World – From Edge to Core I 6
180
160
140
120
100
80
60
40
20
0
Zet
abyt
es
175 ZB
3
Variety of data
1. Sensors deployed in the physical world (e.g., sequencers in biology, scanners in medicine, wind andhumidity sensors in agriculture and weather system, engine performance sensors in automotives andaircrafts)
2. Sensors deployed in the virtual world (e.g, visits to a webpage on the internet, user interactions onFacebook, search keywords and their frequencies on Google, Amazon transaction records and userprofiles, Netflix movie preferences)
3. The rest
Variety of data: discussion
• Are you aware of other sources that generate data (different than those given in the previous slide)?
• Can you provide other instances of source of data within a category of data source (given in theprevious slide)?
Some features of modern data
• Big: requiring large storage space; containing measurements for many variables
• Heterogeneous: uncertainties associated with measurements are different for different parts ofdata; measurements may have been obtained from different technologies; measurements have differentnumerical, algebraic or topological properties
• Complicated: measurements may not be generated from designed experiments; certain mathema-tical and statistical operations cannot be applied to measurements
Illustration of “big”
• Brightness of stars in the milky way galaxy• Visits to (or transactions on) Amazon products• Neuroimaging (e.g., magnetic resonance imaging, diffuse optical imaging)• Mobile device data (e.g., iPhone, iwatch)• Social interactions (e.g., Facebook, Twitter)• U.S. weather system (e.g., temperature, wind direction and velocity)
Illustration of “Heterogeneous”
• Data on transmission of a contagious diseases between humans
• Data on Amazon product query and purchase and user profiles• Data on the prices of a collection of stocks on the NYSE• Data on the weather system in the U.S.• Data on search queries via Google
4
Illustration of “Complicated”
• Data that do not have a global mathematical structure:– Social or biological interaction networks– Images (in medical studies, agriculture, pattern recognition)– Branches (of wings of flies, leafs of trees, or descendants)
• Data that are not generated from designed experiments– Social network or social media data (e.g., Facebook and Twitter)– Personal mobile data (e.g., recorded by i-devices)– Large-scale ecology or environment data
Features of modern data: discussion
What are other features of modern data different than “big”, “heterogeneous” and “complicated”?
Nature, data and us
The main theme:
• Nature generates Data, and we are part of Nature
• We also generate Data, and we learn from Data
• We respect and maintain harmony with NATURE
Some challenges of modern data
• Data acquisition (issues with obtaining observations)
• Data storage, distribution and access (issues with compressing/decompressing data, data centerinfrastructure, and data base management)
• Data analysis (how to learn from data)
• Data visualization
Illustrating examples of data analytics
Example 1: clustering
Cancer data:
• Cancer type of each patient
• Measurements on how active the same sets of genes are in each patient
• Target: is gene activity suggestive of a cancer type?
5
Example 1: clustering
Example 2: predicitive modeling
Baseball players data:
• Salary of player• Hits, Runs, Years, League, Division, etc (a total of 19 features)• Target: which features of a player are able to predict his salary?
6
Example 2: predicitive modeling
Hits
0 1000 2000 0 1000 2000
010
020
0
010
0020
00
CRuns
CRBI
050
015
00
0 50 150
010
0025
00
0 500 1500
Salary
Example 2: predicitive modeling
A predictive model:
4 x 1 sparse Matrix of class "dgCMatrix"1
(Intercept) -46.9002500Hits 3.2192174CRuns 0.2097842CRBI 0.4840020
average Salary = -46.90 + 3.22 Hits + 0.21 CRuns + 0.48 CRBI
Example 3: dimension reduction
US arrests data:
• Arrests per 100,00 residents in each of the 50 US states
• Three types of crimes: assault, murder, rape
7
• UrbanPro (percent of population in each state living in urban areas)
Example 3: dimension reduction
−3 −2 −1 0 1 2 3
−3
−2
−1
01
23
PC1
PC
2
AlabamaAlaska
Arizona
Arkansas
California
Colorado Connecticut
DelawareFlorida
Georgia
Hawaii
Idaho
Illinois
Indiana IowaKansas
KentuckyLouisiana
MaineMaryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
OhioOklahoma
OregonPennsylvania
Rhode Island
South Carolina
South DakotaTennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
−0.5 0.0 0.5
−0.
50.
00.
5
Murder
Assault
UrbanPop
Rape
Example 4: classification
Iris data:
• Three species: “s” (setosa), “c” (versicolor), “v” (virginica)
• Measurements: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
• Classify an iris into one of the 3 species
8
Example 4: classification
2.5
3.0
3.5
4.0
5 6 7 8
Sepal.Length
Sep
al.W
idth KnnClass
c
s
v
Example 5: community detection
Karate club data:
• 34 members of the club
• Edges in this network represent friendship which is defined by consistent interaction outside thenormal activities of the club (karate classes and club meetings)
• Target: detect hidden communities among the members
9
Example 5: community detection
12
34
56
7
89
10
11
121314
15
16
17
18
19
2021
22
2324
252627
2829
30
31
323334
Example 6: survival rates
Survival times:
• Patients suffering from Ovarian Cancer
• Patients suffering from Breast Cancer
• Target: are the survival probabilities for these two types of cancers different?
10
Example 6: survival rates
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++ +++ +++ +++ + ++ +
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++ ++ +0.00
0.25
0.50
0.75
1.00
0 2000 4000 6000 8000Time
Sur
viva
l pro
babi
lity
Strata + +admin.disease_code=brca admin.disease_code=ov
An overview of data analytics
Different philosophies towards data analytics
• Hypotheses driven vs data driven
• Predictive modeling vs Inferential modeling
• Supervised learning vs Unsupervised learning
A workflow for data analysis
Discussion on the workflow
• What is the purpose of a step?
• How a step affects its following ones?
• What tools are usually needed for a step?
11
Data cycleDoing Data Science
Doing Data Science: Straight talk from the frontline.Rachel Schutt, Cathy O’NeilO’Reilly
Figure 1: ’Doing Data Science: Straight talk from the frontline’
The landscape of big data
Data Science ecosystemA Complex Ecosystem!
12
Data analytics: a joint endeavor
• Computer science
• Mathematics
• Statistics
• Domain knowledge
Software tools for data analytics
Source of figure
On course contents
The course will focus on data analytics via statistical and machine learning methods
• The R programming environment
• Exploratory data analysis
• Supervised learning via predictive modeling
• Unsupervised learning such as clustering, classification, dimension reduction
• Introduction to analysis of non-Euclidean data
Installation of R, Rstudio and R packages
Download R and Rstudio
• Rstudio free version at: https://www.rstudio.com/products/rstudio/download/
13
• R at: https://www.r-project.org/
Installations
• install Rstudio and R
• install R packages “tidyverse”, “ggplot2” and “markdown” by:
install.packages("package_name")
• install additional R packages also by
install.packages("package_name")
Rstudio: a sanpshot
Rstudio
• Upper Left panel: R scripts, R markdown file, R project file, View data, etc
• Lower Left panel: R console, R markdown log, etc
• Upper Right panel: R workspace, History, etc
• Lower Right panel: Files in working directory, Plots, Help, etc
Objects in R: I
Scalars in R
14
> x = 3 # assign value 3 to variable x> y = 2> x+y # addition[1] 5> x*y # multiplication[1] 6> x/y # division[1] 1.5> x%%y # modulo[1] 1> x^y # exponentiation[1] 9> x/0[1] Inf> 0/0 # undefifed[1] NaN
Vectors in R: I
> z = c(1,2,3) # a vector of 3 components> v = c(5,6,7)> z+v # vector addition[1] 6 8 10> z*v # paired componentwise product[1] 5 12 21> z/v # paired componentwise division[1] 0.2000000 0.3333333 0.4285714> z%*%v # inner product
[,1][1,] 38> 2*z # scalar-vector multipication[1] 2 4 6
Vectors in R: II
> z = c(1,2,3)> v = c(5,6,7)> z[1] # access the 1st component of z[1] 1> t(v) # transpose of vector
[,1] [,2] [,3][1,] 5 6 7> z%*%t(v) # outer product
[,1] [,2] [,3][1,] 5 6 7[2,] 10 12 14[3,] 15 18 21
15
The seq command
> seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),+ length.out = NULL, along.with = NULL, ...)
Usage:> seq(0,1,by=0.1)[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Matrices in R: I
> matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix[,1] [,2] [,3]
[1,] 1 3 5[2,] 2 4 6> x = c(1,3,5) # a 3-component vector> y = c(2,4,6) #a 3-component vector> # stack x and y as 2 rows to obtain a 2-by-3 matrix> rbind(x,y)
[,1] [,2] [,3]x 1 3 5y 2 4 6> # stack x and y as 2 columns to obtain a 3-by-2 matrix> cbind(x,y)
x y[1,] 1 2[2,] 3 4[3,] 5 6
Matrices in R: II
> x=matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix> x
[,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6> x[,1] # 1st column of x[1] 1 2> x[2,] # 2nd row of x[1] 2 4 6> x[1,2] # (1,2)-entry of x[1] 3> t(x) # transpose of x
[,1] [,2][1,] 1 2[2,] 3 4[3,] 5 6
16
Matrices in R: III
> x=matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix> x
[,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6> y = rbind(c(0,1,0),c(1,1,1))> y
[,1] [,2] [,3][1,] 0 1 0[2,] 1 1 1> x %*%t(y) # matrix Cauchy product
[,1] [,2][1,] 3 9[2,] 4 12
Matrices in R: IV
> x=matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix> x
[,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6> y = rbind(c(0,1,0),c(1,1,1))> y
[,1] [,2] [,3][1,] 0 1 0[2,] 1 1 1> x + y # matrix addition
[,1] [,2] [,3][1,] 1 4 5[2,] 3 5 7> 2*x # scalar multiplication
[,1] [,2] [,3][1,] 2 6 10[2,] 4 8 12
Data frames in R: I
> x <- data.frame("SN" = 1:2, "Age" = c(21,15),+ "Name" = c("John","Dora"))> x
SN Age Name1 1 21 John2 2 15 Dora> x$SN #access SN[1] 1 2> x[,1] # access SN
17
[1] 1 2> class(x$SN) # check object type for SN[1] "integer"> class(x$Name) # check object type for Name[1] "factor"
Data frames in R: II
> x <- data.frame("SN" = 1:2, "Age" = c(21,15),+ "Name" = c("John","Dora"))> x
SN Age Name1 1 21 John2 2 15 Dora> x$SN[2] #access the 2nd entry of SN[1] 2> x[1,2] #access the 1st entry of Age[1] 21
Caution: do not transpose a data.frame when it contains different types of objects
Data frames in R: III
Import (malaria related death) data as data.frame:> Y = read.csv("dataMalyria.csv",header = TRUE,sep=",",+ colClasses=c("country"=NA,"percent"="numeric",+ "labels"=NA))> head(Y)
country percent labels1 Lesotho 0 <1%2 Mauritius 0 <1%3 Seychelles 0 <1%4 Cabo Verde 0 <1%5 Algeria 0 <1%6 Egypt 0 <1%
Data frames in R: IV
Import (malaria related death) data as data.frame:> str(Y) # object structure of Y'data.frame': 53 obs. of 3 variables:$ country: Factor w/ 53 levels "Algeria","Angola",..: 25 32 41 7 1 15 27 33 50 47 ...$ percent: num 0 0 0 0 0 0 0 0 0 0 ...$ labels : Factor w/ 5 levels " <1% "," 1-4% ",..: 1 1 1 1 1 1 1 1 1 1 ...
> dim(Y) # dimension of Y[1] 53 3> Y$id = 1:53 # append a column to Y> Y[1:3,] # display the first 3 rows of Y
18
country percent labels id1 Lesotho 0 <1% 12 Mauritius 0 <1% 23 Seychelles 0 <1% 3
Objects in R: II
Character vectors in R
> w = c("a","b","c") # a vector of 3 character components> w[2] # access the 2nd component[1] "b"> # 1st 10 upper case letters in the alphabet> LETTERS[seq( from = 1, to = 10 )][1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
> # 1st 10 lower case letters in the alphabet> letters[seq( from = 1, to = 10 )][1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
> Q = c("Go","WSU","Cougs","!")> Q[1] "Go" "WSU" "Cougs" "!"> # concatenate two character vectors> c(w,Q)[1] "a" "b" "c" "Go" "WSU" "Cougs" "!"
Strings in R
> w = "Go cougs!"> w[1] "Go cougs!">> v = "Data analytics"> v[1] "Data analytics">> # concatenate two strings> paste(w,v,sep = " ")[1] "Go cougs! Data analytics"
Example: create a scatter plot
> x = seq(1,10,by=1) # generate vector> y = seq(1,10,by=1) # generate vector> title_stg = "Simple plot" # generate string> plot(x,y,main = title_stg) # scatter plot
19
2 4 6 8 10
24
68
10
Simple plot
x
y
Factors in R: I
> grades = c("A","F","D","C","B") # character vector> grades[1] "A" "F" "D" "C" "B"> class(grades)[1] "character"> gradesF = factor(grades) # gradesF is a now factor> gradesF[1] A F D C BLevels: A B C D F> class(gradesF)[1] "factor"> # levels of the factor "gradesF"> levels(gradesF)[1] "A" "B" "C" "D" "F"> # levels are ordered alphabetically
20
Factors in R: II
> x = c(1,3,2) # numeric vector> b = factor(x) # change x into a factor> b[1] 1 3 2Levels: 1 2 3> levels(b) # levels are ordered from smallest to largest[1] "1" "2" "3"> # relabel levels of b> d = factor(x,labels = c("3Level","1Level","2Level"))> d[1] 3Level 2Level 1LevelLevels: 3Level 1Level 2Level
Logic operators in R: I
> x = 0 # assign 0 to x> x >0[1] FALSE> x == 0[1] TRUE> !x # return TRUE[1] TRUE> y = 1> y >= 1[1] TRUE> !y # return FALSE[1] FALSE> x & y # "and"; return FALSE[1] FALSE> x | y # "or"; return TRUE[1] TRUE
Logic operators in R: II
> x = 1> y = -1> x >0 & y > 0 # "and"[1] FALSE> x > 0 | y > 0 # "or"[1] TRUE> x >0 & !(y>0)[1] TRUE
21
Logic operators in R: III
> x = c(1,2,3) # a 3-component vector> x >0 # returns a 3-component logic vector[1] TRUE TRUE TRUE> x > 2 # returns a 3-component logic vector[1] FALSE FALSE TRUE> # return indices of entries of x that are greater than 2> which(x>2)[1] 3> # take the subvector of x whose entries not smaller than 2> x[x >=2][1] 2 3
Logic operators in R: IV
> x = c(1,2,3) # a 3-component vector> y = c(-1,4,-1) # a 3-component vector> # compare x and y entrywise; return a 3-component vector> x > y[1] TRUE FALSE TRUE> x == y[1] FALSE FALSE FALSE> x >= y[1] TRUE FALSE TRUE> any(x>y)[1] TRUE> all(x>y)[1] FALSE
Set operations in R: I
> x = c(1,2,3) # a 3-component vector> 1 %in% x # check membership[1] TRUE> c(2,3) %in% x[1] TRUE TRUE> y = c("stat","115","lecture")> "stat" %in% y[1] TRUE> "time" %in% y[1] FALSE
Set operations in R: II
> x = c(1,2,3) # a 3-component vector> y = c(-1,4,-1) # a 3-component vector> union(x, y)
22
[1] 1 2 3 -1 4> intersect(x, y)numeric(0)> setdiff(x, y)[1] 1 2 3
Lists in R: I
> x = vector("list",3) # a list with 3 components> # assign a vector to its 1st component> x[[1]] = c(1,2,3)> # assign a string to its 2nd component> x[[2]] = "Second part of x"> # assign a matrix to its 3rd component> x[[3]] = matrix(1:6,nrow=3)> x[[1]][1] 1 2 3
[[2]][1] "Second part of x"
[[3]][,1] [,2]
[1,] 1 4[2,] 2 5[3,] 3 6
Lists in R: II
> x = vector("list",3) # a list with 3 components> x[[1]] = c(1,2,3)> x[[2]] = "Second part of x"> x[[3]] = matrix(1:6,nrow=3)> x[[2]] # show 2nd component of x[1] "Second part of x"
Lists in R: III
> a = c(1,2,3)> b = "Second part of x"> c = matrix(1:6,nrow=3)> y = list("vector" = a, "string" = b, "matrix" = c)> y$vector[1] 1 2 3
23
$string[1] "Second part of x"
$matrix[,1] [,2]
[1,] 1 4[2,] 2 5[3,] 3 6
“Coerce” in R
• as.numeric coerces an object to be numeric• as.factor coerces an object to be a factor• as.marix . . .• as.logical . . .• as.data.frame . . .• so on . . .
The length and dim commands
• length returns the number of components of a vector> a = 1:10> length(a)[1] 10
• dim returns the dimension of matrix or data frame> x=dim(matrix(1:6,nrow=3,ncol=2))> x[1] 3 2> x[1][1] 3
R markdown
Install R markdown
> install.packages("markdown")> install.packages("knitr")
In Rstudio, follow “Tools > Global Options > Sweave”, and set “Weave Rnw files using” as “knitr”
Create a R markdown file
In Rstudio, follow “File > New File > R markdown . . . ”
24
More details and video tutorial at: Course webiste
Structure of a markdown file
• Header (that typesets the output document)• Main body (that contains the contents)
– R chunk (that contains R codes)– Text chunk (that contains non-coding texts or latex commands)
More details and video tutorial at: Course webiste
A sample markdown file
25
Basic syntax: I
Online tutorial: https://rmarkdown.rstudio.com/authoring_basics.html
Online tutorial: https://bookdown.org/yihui/rmarkdown/r-code.html
Basic syntax: II
Some things to go over carefully:
• Adjust figure size in the output document when figure is generated by a R chunk
• Enable current R chunk to use results produced by previous R chunks
• Basic latex commands
Basic syntax: III
To adjust figure size when figure is generated by a R chunk:
• use fig.width and fig.height to set graphical device size as in
{r eval=TRUE,fig.width = 3,fig.height=4}
• use out.width and out.height to set output size as in
{r eval=TRUE,out.width = 5,out.height=6}
More details at: https://bookdown.org/yihui/rmarkdown/r-code.html
Basic syntax: IV
To enable current R chunk to use results produced by privous R chunks:
• name a chunk as “chunk1” and cache results as in
{r chunk1,eval=TRUE,cache=TRUE}
• use dependson= refer to “chunk1” as in
{r chunk2,dependson="chunk1",eval=TRUE,cache=TRUE}
More details at: https://yihui.name/knitr/options/
Latex in markdown
• To include latex packages, add - \usepackage{package_name} in the header, such as:
header-includes:- \usepackage{bbm}- \usepackage{amssymb}- \usepackage{amsmath}- \usepackage{graphicx,float}
• For Latex commands, please use a quick reference: https://wch.github.io/latexsheet/
• Caution: not all Latex commands work in markdown
26
Copyright and session information
License and session Information
License: Instructor owns the copyright> sessionInfo()R version 3.5.0 (2018-04-23)Platform: x86_64-w64-mingw32/x64 (64-bit)Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:[1] LC_COLLATE=English_United States.1252[2] LC_CTYPE=English_United States.1252[3] LC_MONETARY=English_United States.1252[4] LC_NUMERIC=C[5] LC_TIME=English_United States.1252
attached base packages:[1] stats graphics grDevices utils datasets methods[7] base
other attached packages:[1] ggplot2_3.1.0 class_7.3-15 glmnet_2.0-16 foreach_1.4.4[5] Matrix_1.2-14 ISLR_1.2 knitr_1.21
loaded via a namespace (and not attached):[1] Rcpp_1.0.0 highr_0.7 compiler_3.5.0[4] pillar_1.3.1 plyr_1.8.4 bindr_0.1.1[7] iterators_1.0.10 tools_3.5.0 digest_0.6.18
[10] evaluate_0.12 tibble_1.4.2 gtable_0.2.0[13] lattice_0.20-35 pkgconfig_2.0.2 png_0.1-7[16] rlang_0.3.0.1 rstudioapi_0.8 yaml_2.2.0[19] xfun_0.4 bindrcpp_0.2.2 withr_2.1.2[22] stringr_1.3.1 dplyr_0.7.8 tidyselect_0.2.5[25] grid_3.5.0 glue_1.3.0 R6_2.3.0[28] rmarkdown_1.11 purrr_0.2.5 magrittr_1.5[31] scales_1.0.0 codetools_0.2-15 htmltools_0.3.6[34] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3[37] stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0[40] crayon_1.3.4
27