stat 115 lecture notes 1 - washington state...

27
Stat 115 Lecture Notes 1 Xiongzhi Chen Washington State University Contents 3 An overview of big data 3 Deluge of data ............................................. 3 Velocity and volume of data ...................................... 3 Variety of data ............................................. 4 Variety of data: discussion ....................................... 4 Some features of modern data ..................................... 4 Illustration of “big” .......................................... 4 Illustration of “Heterogeneous” .................................... 4 Illustration of “Complicated” ..................................... 5 Features of modern data: discussion ................................. 5 Nature, data and us .......................................... 5 Some challenges of modern data ................................... 5 Illustrating examples of data analytics 5 Example 1: clustering ......................................... 5 Example 1: clustering ......................................... 6 Example 2: predicitive modeling ................................... 6 Example 2: predicitive modeling ................................... 7 Example 2: predicitive modeling ................................... 7 Example 3: dimension reduction ................................... 7 Example 3: dimension reduction ................................... 8 Example 4: classification ........................................ 8 Example 4: classification ........................................ 9 Example 5: community detection ................................... 9 Example 5: community detection ................................... 10 Example 6: survival rates ....................................... 10 Example 6: survival rates ....................................... 11 An overview of data analytics 11 Different philosophies towards data analytics ............................ 11 A workflow for data analysis ..................................... 11 Discussion on the workflow ...................................... 11 The landscape of big data ....................................... 12 Data analytics: a joint endeavor ................................... 13 Software tools for data analytics ................................... 13 On course contents ........................................... 13 Installation of R, Rstudio and R packages 13 Download R and Rstudio ....................................... 13 Installations ............................................... 14 Rstudio: a sanpshot .......................................... 14 Rstudio ................................................. 14 1

Upload: others

Post on 21-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Stat 115 Lecture Notes 1Xiongzhi Chen

Washington State University

Contents3

An overview of big data 3Deluge of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Velocity and volume of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Variety of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Variety of data: discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Some features of modern data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Illustration of “big” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Illustration of “Heterogeneous” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Illustration of “Complicated” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Features of modern data: discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Nature, data and us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Some challenges of modern data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Illustrating examples of data analytics 5Example 1: clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Example 1: clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Example 2: predicitive modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Example 2: predicitive modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Example 2: predicitive modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Example 3: dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Example 3: dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Example 4: classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Example 4: classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Example 5: community detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Example 5: community detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Example 6: survival rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Example 6: survival rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

An overview of data analytics 11Different philosophies towards data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11A workflow for data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Discussion on the workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11The landscape of big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Data analytics: a joint endeavor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Software tools for data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13On course contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Installation of R, Rstudio and R packages 13Download R and Rstudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Installations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Rstudio: a sanpshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Rstudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1

Page 2: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Objects in R: I 14Scalars in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Vectors in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Vectors in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15The seq command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Matrices in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Matrices in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Matrices in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Matrices in R: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Data frames in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Data frames in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Data frames in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Data frames in R: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Objects in R: II 19Character vectors in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Strings in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Example: create a scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Factors in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Factors in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Logic operators in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Logic operators in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Logic operators in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Logic operators in R: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Set operations in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Set operations in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Lists in R: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Lists in R: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Lists in R: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23“Coerce” in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24The length and dim commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

R markdown 24Install R markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Create a R markdown file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Structure of a markdown file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25A sample markdown file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Basic syntax: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Basic syntax: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Basic syntax: III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Basic syntax: IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Latex in markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Copyright and session information 27License and session Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2

Page 3: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

An overview of big data

Deluge of data

Variety

Velocity Volume

Velocity and volume of data

The use of data today is transforming the way we

live, work, and play. Businesses in industries

around the world are using data to transform

themselves to become more agile, improve

customer experience, introduce new business

models, and develop new sources of competitive

advantage. Consumers are living in an

increasingly digital world, depending on online

and mobile channels to connect with friends and

family, access goods and services, and run nearly

every aspect of their lives, even while asleep.

Much of today’s economy relies on data, and this

reliance will only increase in the future as

companies capture, catalog, and cash in on data

in every step of their supply chain; enterprises

collect vast sums of customer data to provide

greater levels of personalization; and consumers

integrate social media, entertainment, cloud

storage, and real-time personalized services into

their streams of life.

The consequence of this increasing reliance on

data will be a never-ending expansion in the size

of the Global Datasphere. Estimated to be 33 ZB

in 2018, IDC forecasts the Global Datasphere to

grow to 175 ZB by 2025. (Figure 1). See Appendix

for methodology and data/device categories.

Global Datasphere Expansion is Never-ending

Chapter 1 Characterizing the Global Datasphere

Figure 1 – Annual Size of the Global Datasphere

Annual Size of the Global Datasphere

MRI image creation is driving storage requirements significantly. The trend is more images with thinner slices and 3D capability. We've gone from 2,000 images to over 20,000 for an MRI of a human head, and stronger magnets and higher resolution pictures means more data stored.

– Senior Director in IT, Major Healthcare Provider

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025

Source: Data Age 2025, sponsored by Seagate with data from IDC Global DataSphere, Nov 2018

IDC White Paper I Doc# US44413318 I November 2018 The Digitization of the World – From Edge to Core I 6

180

160

140

120

100

80

60

40

20

0

Zet

abyt

es

175 ZB

3

Page 4: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Variety of data

1. Sensors deployed in the physical world (e.g., sequencers in biology, scanners in medicine, wind andhumidity sensors in agriculture and weather system, engine performance sensors in automotives andaircrafts)

2. Sensors deployed in the virtual world (e.g, visits to a webpage on the internet, user interactions onFacebook, search keywords and their frequencies on Google, Amazon transaction records and userprofiles, Netflix movie preferences)

3. The rest

Variety of data: discussion

• Are you aware of other sources that generate data (different than those given in the previous slide)?

• Can you provide other instances of source of data within a category of data source (given in theprevious slide)?

Some features of modern data

• Big: requiring large storage space; containing measurements for many variables

• Heterogeneous: uncertainties associated with measurements are different for different parts ofdata; measurements may have been obtained from different technologies; measurements have differentnumerical, algebraic or topological properties

• Complicated: measurements may not be generated from designed experiments; certain mathema-tical and statistical operations cannot be applied to measurements

Illustration of “big”

• Brightness of stars in the milky way galaxy• Visits to (or transactions on) Amazon products• Neuroimaging (e.g., magnetic resonance imaging, diffuse optical imaging)• Mobile device data (e.g., iPhone, iwatch)• Social interactions (e.g., Facebook, Twitter)• U.S. weather system (e.g., temperature, wind direction and velocity)

Illustration of “Heterogeneous”

• Data on transmission of a contagious diseases between humans

• Data on Amazon product query and purchase and user profiles• Data on the prices of a collection of stocks on the NYSE• Data on the weather system in the U.S.• Data on search queries via Google

4

Page 5: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Illustration of “Complicated”

• Data that do not have a global mathematical structure:– Social or biological interaction networks– Images (in medical studies, agriculture, pattern recognition)– Branches (of wings of flies, leafs of trees, or descendants)

• Data that are not generated from designed experiments– Social network or social media data (e.g., Facebook and Twitter)– Personal mobile data (e.g., recorded by i-devices)– Large-scale ecology or environment data

Features of modern data: discussion

What are other features of modern data different than “big”, “heterogeneous” and “complicated”?

Nature, data and us

The main theme:

• Nature generates Data, and we are part of Nature

• We also generate Data, and we learn from Data

• We respect and maintain harmony with NATURE

Some challenges of modern data

• Data acquisition (issues with obtaining observations)

• Data storage, distribution and access (issues with compressing/decompressing data, data centerinfrastructure, and data base management)

• Data analysis (how to learn from data)

• Data visualization

Illustrating examples of data analytics

Example 1: clustering

Cancer data:

• Cancer type of each patient

• Measurements on how active the same sets of genes are in each patient

• Target: is gene activity suggestive of a cancer type?

5

Page 6: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Example 1: clustering

Example 2: predicitive modeling

Baseball players data:

• Salary of player• Hits, Runs, Years, League, Division, etc (a total of 19 features)• Target: which features of a player are able to predict his salary?

6

Page 7: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Example 2: predicitive modeling

Hits

0 1000 2000 0 1000 2000

010

020

0

010

0020

00

CRuns

CRBI

050

015

00

0 50 150

010

0025

00

0 500 1500

Salary

Example 2: predicitive modeling

A predictive model:

4 x 1 sparse Matrix of class "dgCMatrix"1

(Intercept) -46.9002500Hits 3.2192174CRuns 0.2097842CRBI 0.4840020

average Salary = -46.90 + 3.22 Hits + 0.21 CRuns + 0.48 CRBI

Example 3: dimension reduction

US arrests data:

• Arrests per 100,00 residents in each of the 50 US states

• Three types of crimes: assault, murder, rape

7

Page 8: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

• UrbanPro (percent of population in each state living in urban areas)

Example 3: dimension reduction

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

PC1

PC

2

AlabamaAlaska

Arizona

Arkansas

California

Colorado Connecticut

DelawareFlorida

Georgia

Hawaii

Idaho

Illinois

Indiana IowaKansas

KentuckyLouisiana

MaineMaryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

OhioOklahoma

OregonPennsylvania

Rhode Island

South Carolina

South DakotaTennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−0.5 0.0 0.5

−0.

50.

00.

5

Murder

Assault

UrbanPop

Rape

Example 4: classification

Iris data:

• Three species: “s” (setosa), “c” (versicolor), “v” (virginica)

• Measurements: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

• Classify an iris into one of the 3 species

8

Page 9: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Example 4: classification

2.5

3.0

3.5

4.0

5 6 7 8

Sepal.Length

Sep

al.W

idth KnnClass

c

s

v

Example 5: community detection

Karate club data:

• 34 members of the club

• Edges in this network represent friendship which is defined by consistent interaction outside thenormal activities of the club (karate classes and club meetings)

• Target: detect hidden communities among the members

9

Page 10: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Example 5: community detection

12

34

56

7

89

10

11

121314

15

16

17

18

19

2021

22

2324

252627

2829

30

31

323334

Example 6: survival rates

Survival times:

• Patients suffering from Ovarian Cancer

• Patients suffering from Breast Cancer

• Target: are the survival probabilities for these two types of cancers different?

10

Page 11: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Example 6: survival rates

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++ +++ +++ +++ + ++ +

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++ ++ +0.00

0.25

0.50

0.75

1.00

0 2000 4000 6000 8000Time

Sur

viva

l pro

babi

lity

Strata + +admin.disease_code=brca admin.disease_code=ov

An overview of data analytics

Different philosophies towards data analytics

• Hypotheses driven vs data driven

• Predictive modeling vs Inferential modeling

• Supervised learning vs Unsupervised learning

A workflow for data analysis

Discussion on the workflow

• What is the purpose of a step?

• How a step affects its following ones?

• What tools are usually needed for a step?

11

Page 12: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Data cycleDoing Data Science

Doing Data Science: Straight talk from the frontline.Rachel Schutt, Cathy O’NeilO’Reilly

Figure 1: ’Doing Data Science: Straight talk from the frontline’

The landscape of big data

Data Science ecosystemA Complex Ecosystem!

12

Page 13: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Data analytics: a joint endeavor

• Computer science

• Mathematics

• Statistics

• Domain knowledge

Software tools for data analytics

Source of figure

On course contents

The course will focus on data analytics via statistical and machine learning methods

• The R programming environment

• Exploratory data analysis

• Supervised learning via predictive modeling

• Unsupervised learning such as clustering, classification, dimension reduction

• Introduction to analysis of non-Euclidean data

Installation of R, Rstudio and R packages

Download R and Rstudio

• Rstudio free version at: https://www.rstudio.com/products/rstudio/download/

13

Page 14: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

• R at: https://www.r-project.org/

Installations

• install Rstudio and R

• install R packages “tidyverse”, “ggplot2” and “markdown” by:

install.packages("package_name")

• install additional R packages also by

install.packages("package_name")

Rstudio: a sanpshot

Rstudio

• Upper Left panel: R scripts, R markdown file, R project file, View data, etc

• Lower Left panel: R console, R markdown log, etc

• Upper Right panel: R workspace, History, etc

• Lower Right panel: Files in working directory, Plots, Help, etc

Objects in R: I

Scalars in R

14

Page 15: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

> x = 3 # assign value 3 to variable x> y = 2> x+y # addition[1] 5> x*y # multiplication[1] 6> x/y # division[1] 1.5> x%%y # modulo[1] 1> x^y # exponentiation[1] 9> x/0[1] Inf> 0/0 # undefifed[1] NaN

Vectors in R: I

> z = c(1,2,3) # a vector of 3 components> v = c(5,6,7)> z+v # vector addition[1] 6 8 10> z*v # paired componentwise product[1] 5 12 21> z/v # paired componentwise division[1] 0.2000000 0.3333333 0.4285714> z%*%v # inner product

[,1][1,] 38> 2*z # scalar-vector multipication[1] 2 4 6

Vectors in R: II

> z = c(1,2,3)> v = c(5,6,7)> z[1] # access the 1st component of z[1] 1> t(v) # transpose of vector

[,1] [,2] [,3][1,] 5 6 7> z%*%t(v) # outer product

[,1] [,2] [,3][1,] 5 6 7[2,] 10 12 14[3,] 15 18 21

15

Page 16: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

The seq command

> seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),+ length.out = NULL, along.with = NULL, ...)

Usage:> seq(0,1,by=0.1)[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Matrices in R: I

> matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix[,1] [,2] [,3]

[1,] 1 3 5[2,] 2 4 6> x = c(1,3,5) # a 3-component vector> y = c(2,4,6) #a 3-component vector> # stack x and y as 2 rows to obtain a 2-by-3 matrix> rbind(x,y)

[,1] [,2] [,3]x 1 3 5y 2 4 6> # stack x and y as 2 columns to obtain a 3-by-2 matrix> cbind(x,y)

x y[1,] 1 2[2,] 3 4[3,] 5 6

Matrices in R: II

> x=matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix> x

[,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6> x[,1] # 1st column of x[1] 1 2> x[2,] # 2nd row of x[1] 2 4 6> x[1,2] # (1,2)-entry of x[1] 3> t(x) # transpose of x

[,1] [,2][1,] 1 2[2,] 3 4[3,] 5 6

16

Page 17: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Matrices in R: III

> x=matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix> x

[,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6> y = rbind(c(0,1,0),c(1,1,1))> y

[,1] [,2] [,3][1,] 0 1 0[2,] 1 1 1> x %*%t(y) # matrix Cauchy product

[,1] [,2][1,] 3 9[2,] 4 12

Matrices in R: IV

> x=matrix(1:6,nrow=2,ncol=3) # a 2-by-3 matrix> x

[,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6> y = rbind(c(0,1,0),c(1,1,1))> y

[,1] [,2] [,3][1,] 0 1 0[2,] 1 1 1> x + y # matrix addition

[,1] [,2] [,3][1,] 1 4 5[2,] 3 5 7> 2*x # scalar multiplication

[,1] [,2] [,3][1,] 2 6 10[2,] 4 8 12

Data frames in R: I

> x <- data.frame("SN" = 1:2, "Age" = c(21,15),+ "Name" = c("John","Dora"))> x

SN Age Name1 1 21 John2 2 15 Dora> x$SN #access SN[1] 1 2> x[,1] # access SN

17

Page 18: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

[1] 1 2> class(x$SN) # check object type for SN[1] "integer"> class(x$Name) # check object type for Name[1] "factor"

Data frames in R: II

> x <- data.frame("SN" = 1:2, "Age" = c(21,15),+ "Name" = c("John","Dora"))> x

SN Age Name1 1 21 John2 2 15 Dora> x$SN[2] #access the 2nd entry of SN[1] 2> x[1,2] #access the 1st entry of Age[1] 21

Caution: do not transpose a data.frame when it contains different types of objects

Data frames in R: III

Import (malaria related death) data as data.frame:> Y = read.csv("dataMalyria.csv",header = TRUE,sep=",",+ colClasses=c("country"=NA,"percent"="numeric",+ "labels"=NA))> head(Y)

country percent labels1 Lesotho 0 <1%2 Mauritius 0 <1%3 Seychelles 0 <1%4 Cabo Verde 0 <1%5 Algeria 0 <1%6 Egypt 0 <1%

Data frames in R: IV

Import (malaria related death) data as data.frame:> str(Y) # object structure of Y'data.frame': 53 obs. of 3 variables:$ country: Factor w/ 53 levels "Algeria","Angola",..: 25 32 41 7 1 15 27 33 50 47 ...$ percent: num 0 0 0 0 0 0 0 0 0 0 ...$ labels : Factor w/ 5 levels " <1% "," 1-4% ",..: 1 1 1 1 1 1 1 1 1 1 ...

> dim(Y) # dimension of Y[1] 53 3> Y$id = 1:53 # append a column to Y> Y[1:3,] # display the first 3 rows of Y

18

Page 19: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

country percent labels id1 Lesotho 0 <1% 12 Mauritius 0 <1% 23 Seychelles 0 <1% 3

Objects in R: II

Character vectors in R

> w = c("a","b","c") # a vector of 3 character components> w[2] # access the 2nd component[1] "b"> # 1st 10 upper case letters in the alphabet> LETTERS[seq( from = 1, to = 10 )][1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

> # 1st 10 lower case letters in the alphabet> letters[seq( from = 1, to = 10 )][1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

> Q = c("Go","WSU","Cougs","!")> Q[1] "Go" "WSU" "Cougs" "!"> # concatenate two character vectors> c(w,Q)[1] "a" "b" "c" "Go" "WSU" "Cougs" "!"

Strings in R

> w = "Go cougs!"> w[1] "Go cougs!">> v = "Data analytics"> v[1] "Data analytics">> # concatenate two strings> paste(w,v,sep = " ")[1] "Go cougs! Data analytics"

Example: create a scatter plot

> x = seq(1,10,by=1) # generate vector> y = seq(1,10,by=1) # generate vector> title_stg = "Simple plot" # generate string> plot(x,y,main = title_stg) # scatter plot

19

Page 20: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

2 4 6 8 10

24

68

10

Simple plot

x

y

Factors in R: I

> grades = c("A","F","D","C","B") # character vector> grades[1] "A" "F" "D" "C" "B"> class(grades)[1] "character"> gradesF = factor(grades) # gradesF is a now factor> gradesF[1] A F D C BLevels: A B C D F> class(gradesF)[1] "factor"> # levels of the factor "gradesF"> levels(gradesF)[1] "A" "B" "C" "D" "F"> # levels are ordered alphabetically

20

Page 21: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Factors in R: II

> x = c(1,3,2) # numeric vector> b = factor(x) # change x into a factor> b[1] 1 3 2Levels: 1 2 3> levels(b) # levels are ordered from smallest to largest[1] "1" "2" "3"> # relabel levels of b> d = factor(x,labels = c("3Level","1Level","2Level"))> d[1] 3Level 2Level 1LevelLevels: 3Level 1Level 2Level

Logic operators in R: I

> x = 0 # assign 0 to x> x >0[1] FALSE> x == 0[1] TRUE> !x # return TRUE[1] TRUE> y = 1> y >= 1[1] TRUE> !y # return FALSE[1] FALSE> x & y # "and"; return FALSE[1] FALSE> x | y # "or"; return TRUE[1] TRUE

Logic operators in R: II

> x = 1> y = -1> x >0 & y > 0 # "and"[1] FALSE> x > 0 | y > 0 # "or"[1] TRUE> x >0 & !(y>0)[1] TRUE

21

Page 22: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Logic operators in R: III

> x = c(1,2,3) # a 3-component vector> x >0 # returns a 3-component logic vector[1] TRUE TRUE TRUE> x > 2 # returns a 3-component logic vector[1] FALSE FALSE TRUE> # return indices of entries of x that are greater than 2> which(x>2)[1] 3> # take the subvector of x whose entries not smaller than 2> x[x >=2][1] 2 3

Logic operators in R: IV

> x = c(1,2,3) # a 3-component vector> y = c(-1,4,-1) # a 3-component vector> # compare x and y entrywise; return a 3-component vector> x > y[1] TRUE FALSE TRUE> x == y[1] FALSE FALSE FALSE> x >= y[1] TRUE FALSE TRUE> any(x>y)[1] TRUE> all(x>y)[1] FALSE

Set operations in R: I

> x = c(1,2,3) # a 3-component vector> 1 %in% x # check membership[1] TRUE> c(2,3) %in% x[1] TRUE TRUE> y = c("stat","115","lecture")> "stat" %in% y[1] TRUE> "time" %in% y[1] FALSE

Set operations in R: II

> x = c(1,2,3) # a 3-component vector> y = c(-1,4,-1) # a 3-component vector> union(x, y)

22

Page 23: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

[1] 1 2 3 -1 4> intersect(x, y)numeric(0)> setdiff(x, y)[1] 1 2 3

Lists in R: I

> x = vector("list",3) # a list with 3 components> # assign a vector to its 1st component> x[[1]] = c(1,2,3)> # assign a string to its 2nd component> x[[2]] = "Second part of x"> # assign a matrix to its 3rd component> x[[3]] = matrix(1:6,nrow=3)> x[[1]][1] 1 2 3

[[2]][1] "Second part of x"

[[3]][,1] [,2]

[1,] 1 4[2,] 2 5[3,] 3 6

Lists in R: II

> x = vector("list",3) # a list with 3 components> x[[1]] = c(1,2,3)> x[[2]] = "Second part of x"> x[[3]] = matrix(1:6,nrow=3)> x[[2]] # show 2nd component of x[1] "Second part of x"

Lists in R: III

> a = c(1,2,3)> b = "Second part of x"> c = matrix(1:6,nrow=3)> y = list("vector" = a, "string" = b, "matrix" = c)> y$vector[1] 1 2 3

23

Page 24: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

$string[1] "Second part of x"

$matrix[,1] [,2]

[1,] 1 4[2,] 2 5[3,] 3 6

“Coerce” in R

• as.numeric coerces an object to be numeric• as.factor coerces an object to be a factor• as.marix . . .• as.logical . . .• as.data.frame . . .• so on . . .

The length and dim commands

• length returns the number of components of a vector> a = 1:10> length(a)[1] 10

• dim returns the dimension of matrix or data frame> x=dim(matrix(1:6,nrow=3,ncol=2))> x[1] 3 2> x[1][1] 3

R markdown

Install R markdown

> install.packages("markdown")> install.packages("knitr")

In Rstudio, follow “Tools > Global Options > Sweave”, and set “Weave Rnw files using” as “knitr”

Create a R markdown file

In Rstudio, follow “File > New File > R markdown . . . ”

24

Page 25: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

More details and video tutorial at: Course webiste

Structure of a markdown file

• Header (that typesets the output document)• Main body (that contains the contents)

– R chunk (that contains R codes)– Text chunk (that contains non-coding texts or latex commands)

More details and video tutorial at: Course webiste

A sample markdown file

25

Page 26: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Basic syntax: I

Online tutorial: https://rmarkdown.rstudio.com/authoring_basics.html

Online tutorial: https://bookdown.org/yihui/rmarkdown/r-code.html

Basic syntax: II

Some things to go over carefully:

• Adjust figure size in the output document when figure is generated by a R chunk

• Enable current R chunk to use results produced by previous R chunks

• Basic latex commands

Basic syntax: III

To adjust figure size when figure is generated by a R chunk:

• use fig.width and fig.height to set graphical device size as in

{r eval=TRUE,fig.width = 3,fig.height=4}

• use out.width and out.height to set output size as in

{r eval=TRUE,out.width = 5,out.height=6}

More details at: https://bookdown.org/yihui/rmarkdown/r-code.html

Basic syntax: IV

To enable current R chunk to use results produced by privous R chunks:

• name a chunk as “chunk1” and cache results as in

{r chunk1,eval=TRUE,cache=TRUE}

• use dependson= refer to “chunk1” as in

{r chunk2,dependson="chunk1",eval=TRUE,cache=TRUE}

More details at: https://yihui.name/knitr/options/

Latex in markdown

• To include latex packages, add - \usepackage{package_name} in the header, such as:

header-includes:- \usepackage{bbm}- \usepackage{amssymb}- \usepackage{amsmath}- \usepackage{graphicx,float}

• For Latex commands, please use a quick reference: https://wch.github.io/latexsheet/

• Caution: not all Latex commands work in markdown

26

Page 27: Stat 115 Lecture Notes 1 - Washington State Universitymath.wsu.edu/faculty/xchen/stat115/LectureNotes1_notes.pdf · 2019-01-07 · Stat 115 Lecture Notes 1 Xiongzhi Chen Washington

Copyright and session information

License and session Information

License: Instructor owns the copyright> sessionInfo()R version 3.5.0 (2018-04-23)Platform: x86_64-w64-mingw32/x64 (64-bit)Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:[1] LC_COLLATE=English_United States.1252[2] LC_CTYPE=English_United States.1252[3] LC_MONETARY=English_United States.1252[4] LC_NUMERIC=C[5] LC_TIME=English_United States.1252

attached base packages:[1] stats graphics grDevices utils datasets methods[7] base

other attached packages:[1] ggplot2_3.1.0 class_7.3-15 glmnet_2.0-16 foreach_1.4.4[5] Matrix_1.2-14 ISLR_1.2 knitr_1.21

loaded via a namespace (and not attached):[1] Rcpp_1.0.0 highr_0.7 compiler_3.5.0[4] pillar_1.3.1 plyr_1.8.4 bindr_0.1.1[7] iterators_1.0.10 tools_3.5.0 digest_0.6.18

[10] evaluate_0.12 tibble_1.4.2 gtable_0.2.0[13] lattice_0.20-35 pkgconfig_2.0.2 png_0.1-7[16] rlang_0.3.0.1 rstudioapi_0.8 yaml_2.2.0[19] xfun_0.4 bindrcpp_0.2.2 withr_2.1.2[22] stringr_1.3.1 dplyr_0.7.8 tidyselect_0.2.5[25] grid_3.5.0 glue_1.3.0 R6_2.3.0[28] rmarkdown_1.11 purrr_0.2.5 magrittr_1.5[31] scales_1.0.0 codetools_0.2-15 htmltools_0.3.6[34] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3[37] stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0[40] crayon_1.3.4

27