using r for data analysis and graphics - eth zstat.ethz.ch/~stahel/courses/r/usingr-slides.pdf ·...

32
Introduction Basics Simple Statistics More on S Using R for Data Analysis and Graphics 1. Introduction

Upload: others

Post on 23-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Using R for Data Analysis and Graphics

1. Introduction

Page 2: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

What is R?

1.1 What is R?

R is a software environment for statistical computing.R is based on commands. Implements the S language.There is an inofficial menu based interface calledR-Commander.Drawbacks of menus: difficult to store what you do. Ascript of commands

documents the analysis andallows for easy repetition with changed data, options, ...

R is free software. http://www.r-project.orgSupported operating systems: Linux, Mac OS X, WindowsLanguage for exchanging statistical methods amongresearchers

Page 3: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Other Statistical Software

1.2 Other Statistical Software

S-Plus: same programming language, commercial.Features a GUI.SPSS: good for standard procedures.SAS: all-rounder, good for large data sets, complicatedanalyses.Systat: Analysis of Variance, easy-to-use graphics system.Excel: Very limited collection of statistical methods.Good for getting the dataset ready.Matlab: Mathematical methods. Statistical methods limited.Similar “paradigm”, less flexible structure.

Page 4: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Introductory examples

1.3 Introductory examples

A dataset that we have stored before in the system is calledd.sport

weit kugel hoch disc stab speer punkteOBRIEN 7.57 15.66 207 48.78 500 66.90 8824BUSEMANN 8.07 13.60 204 45.04 480 66.86 8706DVORAK 7.60 15.82 198 46.28 470 70.16 8664: : : : : : : :: : : : : : : :: : : : : : : :CHMARA 7.75 14.51 210 42.60 490 54.84 8249

Draw a histogram of the results of variable kugel !We type hist(d.sport[,"kugel"])The graphics window is opened automatically.We have called the S-function hist with argumentd.sport[,"kugel"] .[,] is used to select the column.

Page 5: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Introductory examples

1.3 Introductory examples

Scatter plot: type

plot(d.sport[,"kugel"], d.sport[,"speer"])

First argument: x coordinates; second: y coordinatesMany optional arguments!plot(d.sport[,"kugel"], d.sport[,"speer"],xlab="ball push", ylab="javelin", pch=7)

Scatter plot matrixpairs(d.sport)

Every column of d.sport is plottedagainst all other columns.

Page 6: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Introductory examples

1.3 Introductory examples

Get a dataset from a text file and assign it to a name:

d.sport <- read.table(...)"http://stat.ethz.ch/Teaching/Datasets/WBL/sport.dat", header=TRUE)

Start browser of operating system to get a file:d.sport <- read.table(file....())

Page 7: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Using R

1.4 Using R

Within a window running R, you will see the prompt >.You type a command and get a result and a new prompt.> hist(d.sport[,"kugel"])>

An incomplete statement can be continued on the next line> plot(d.sport[,"kugel"],+ d.sport[,"speer"])

R stores “objects” in your workspace> d.sport <- read.table(...)

Objects have names like a, fun, d.sport

R provides a huge number of functions and other objects

Page 8: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Using R

1.4 Using R

An R statement consists ofa name of an object −→ object is displayed> d.sport

a call to a function −→ graphical or numerical result> hist(d.sport[,"kugel"])

an assignment> a <- 2*pi/360> mn <- mean(d.sport[,"kugel"])

stores the mean of d.sport[,"kugel"]under the name mn

Page 9: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Using R

1.4 Using R

Some special and useful functions (more details later):documentation on the arguments etc. of a function(or dataset provided by the system):> help(hist) or ?hist

list all “objects” (names) in the workspace:> objects()

leave the R session:> q()You get the question:Save workspace image? [y/n/c]:If you answer ”y”, your objects will be availablefor your next session.

Page 10: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Scripts and Editors

1.5 Scripts and Editors

Instead of typing commands into the R window,you can generate commands by an editor and then“send” them to the R window.... and later modify (correct) them and send again.Text Editors supporting R

WinEdt: http://www.winedt.com/

Emacs: http://www.gnu.org/software/emacs/ESS: http://stat.ethz.ch/ESS/

Tinn-R: http://www.sciviews.org/Tinn-R/

Page 11: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Scripts and Editors

1.5 Scripts and Editors

The Tinn-R Window

Page 12: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Scripts and Editors

1.5 Scripts and EditorsDefine Tinn-R Keyboard Shortcuts: Use dialog R / Hotkeys of R

Page 13: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Using R for Data Analysis and Graphics

2. Basics

Page 14: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Vectors

2.1 Vectors

Functions and operations are usually applied towhole “collections” instead of single numbers,including “vectors”, “matrices”, “data.frames” ( d.sport )

Numbers can be combined into “vectors”using the function c() (“combine”)> t.v <- c(4,2,7,8,2)> t.a <- c(3.1, 5, -0.7, 0.9, 1.7)> t.u <- c(t.v,t.a)> t.u

Page 15: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Vectors

2.1 Vectors

Generate a sequence of consecutive integers:> seq(1, 9)[1] 1 2 3 4 5 6 7 8 9

Since sequences of integers are needed very often,this can be abbreviated to 1:9 .

Equally spaced numbers: Use argument by (default: 1)> seq(0, 3, by=0.5)[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Repetition:> rep(0.7, 5)[1] 0.7 0.7 0.7 0.7 0.7

> rep(c(1, 3, 5), length=8)[1] 1 3 5 1 3 5 1 3

Page 16: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Vectors

2.1 Vectors

Basic functions for vectors:

Call, Example Descriptionlength(t.v) Length of a vector, number of

elementssum(t.v) Sum of all elementsmean(t.v) arithmetic meanvar(t.v) empirical variancerange(t.v) range

Page 17: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Arithmetic

2.2 Arithmetic

Simple arithmetic is as expected:> 2+5[1] 7

Operations: + - * / ˆ (Exponentiation)

These operations are applied to vectors elementwise.> (2:5) ˆ c(2,3,1,0)[1] 4 27 4 1

Priorities as usual. Use parentheses!> (2:5) ˆ 2[1] 4 9 16 25

Page 18: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Arithmetic

2.2 Arithmetic

Elements are recycled:> (1:6)*(1:2)[1] 1 4 3 8 5 12

> (1:5)-(0:1)[1] 1 1 3 3 5

Warning message:longer object length is not a multipleofshorter object length in: (1:5) -(0:1)

> (1:6)-(0:1)

[1] 1 1 3 3 5 5

Be careful, there is no warning in this case!

Page 19: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Character Vectors

2.3 Character Vectors

Character strings: "abc" , ’nut 999’Combine strings into vector of “mode” character:> t.names <- c("Urs", "Anna", "Max", "Pia")

Length of strings:> nchar(t.names)[1] 3 4 3 5

String manipulations:> substring(t.names,3,4)[1] "s" "na" "x" "ud"

> paste(t.names,"Z.")[1] "Urs Z." "Anna Z." "Max Z." "Pia Z."

> paste("X",1:3, sep="")[1] "X1" "X2" "X3"

Page 20: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Logical Vectors

2.4 Logical Vectors

Logical vectors contain elements TRUE or FALSE> rep(c(TRUE, FALSE), length=6)[1] TRUE FALSE TRUE FALSE TRUE FALSE

often result from comparisons:< <= > >= == !=

> (1:5)>=3[1] FALSE FALSE TRUE TRUE TRUE

Logical operations: & (and), | (or), ! (not).> t.i <- (t.a>2)&(t.a<5)> t.i[1] TRUE FALSE FALSE FALSE FALSE

Page 21: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Selecting elements

2.5 Selecting elements

Select elements from vectors or data.frames: [ ] , [,]> t.v[c(1,3,5)][1] 15.66 15.82 16.32

> d.sport[c(1,3,5),1:3]

weit kugel hochOBRIEN 7.57 15.66 207DVORAK 7.60 15.82 198HAMALAINEN 7.48 16.32 198

For data.frames, use names of columns or rows:> d.sport[c("OBRIEN","DVORAK"),

c("kugel","speer","punkte")]

kugel speer punkteOBRIEN 15.66 66.90 8824DVORAK 15.82 70.16 8664

Page 22: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Selecting elements

2.5 Selecting elements

Using logical vectors:> t.a[c(TRUE,FALSE,TRUE,TRUE,FALSE,FALSE)][1] 3.1 -0.7 0.9

> d.sport[d.sport[,"kugel"] > 16, c(2,7)]

kugel punkteHAMALAINEN 16.32 8613PENALVER 16.91 8307SMITH 16.97 8271

Page 23: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Matrices

2.6 Matrices

Matrices are “data tables” like data.frames, but they canonly contain data of a single type (numeric or character)

Generate a matrix:> t.m1 <- matrix(1:10, nrow=2, ncol=5)> t.m1

[,1] [,2] [,3] [,4] [,5][1,] 1 3 5 7 9[2,] 2 4 6 8 10

> t.m2 <- matrix(1:10, ncol=2,+ byrow=TRUE)

Transpose: t(t.m1) equals t.m2 .

Page 24: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Matrices

2.6 Matrices

Selection of elements as with data.frames:> t.m1[2,1:3][1] 2 4 6

Matrix multiplication:> t.m1 %*% t.m2

[,1] [,2][1,] 95 220[2,] 110 260

Vectors are treated as 1-row or 1-column matrices (mostly)Functions for linear algebra are available.

Page 25: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Using R for Data Analysis and Graphics

3. Simple Statistics

Page 26: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Simple Statistical Functions

3.1 Simple Statistical Functions

Count number of cases with same value:> table(d.blast[,"loc"])

L1 L2 L3 L4 L5 L614 10 14 10 24 24

Cross-table> table(d.blast[,"loc"],+ d.blast[,"loading"])

2.08 2.18 2.5 2.6 3.12 3.33 3.64L1 2 2 1 5 1 2 1L2 2 0 0 4 3 1 0...

Page 27: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Simple Statistical Functions

3.1 Simple Statistical Functions

Estimation of a “location parameter”:mean(x) median(x)

Variance: var(x) ; correlation:> cor(d.sport[,"kugel"],d.sport[,"speer"])

Correlation matrix:> t.cor <- cor(d.sport[,1:3])> round(100*t.cor)

weit kugel hochweit 100 -63 34kugel -63 100 -9hoch 34 -9 100

Page 28: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Hypothesis Tests

3.2 Hypothesis Tests

Do two groups differ in their “location”?−→Wilcoxon’s Rank Sum Test

> t.y1 <- sleep[sleep[,’group’]==1,’extra’]> t.y2 <- sleep[sleep[,’group’]==2,’extra’]> wilcox.test(t.y1, t.y2, paired=FALSE)

Wilcoxon rank sum test with continuity correction

data: t.y1 and t.y2W = 25.5, p-value = 0.06933alternative hyp.: true location shift not equal to 0

Page 29: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Hypothesis Tests

3.2 Hypothesis Tests

More well-known: t-test. Assumes normal distributions.> t.test(t.y2,t.y1,alternative="two.sided",+ paired=F)

Welch Two Sample t-testdata: t.y1 and t.y2t = -1.8608, df = 17.776, p-value = 0.0794alternative hyp.: true diff. in means not equal to 095 percent confidence interval:-3.365 0.205

sample estimates:mean of x mean of y

0.75 2.33

−→ Confidence interval!

Page 30: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Two Groups

3.3 Two Groups

Plots for two samples of data.

> boxplot(t.y1,t.y2,ylab="extra")

> plot(sleep[,"group"],sleep[,"extra"],

+ xlab="group", ylab="extra")

Page 31: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Statistical Models, Formula Objects

3.4 Statistical Models, Formula Objects

Statistics is concerned with relations between “variables”.Prototype: Relationship between target variable Yand explanatory variables X1, X2, ... −→ Regression.

Symbolic notation of such a relation: Y ˜ X1 + X2This symbolic notation is an S object (of class formula )(The notation is also used in other statistical packages.)Use of formula :> plot(punkte ˜ kugel + speer,+ data = d.sport)gives 2 scatterplots, punkte (vertical) againstkugel and speer , respectively (horizontal axis).

Page 32: Using R for Data Analysis and Graphics - ETH Zstat.ethz.ch/~stahel/courses/R/usingr-slides.pdf · Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases

Introduction Basics Simple Statistics More on S

Statistical Models, Formula Objects

3.4 Statistical Models, Formula Objects

Grouping or nominal or categorical variables,e.g., location, type, group, species, plot, ...Role in models different from continuous variables−→ S must know! −→ stores them as factor s

– Character variables enter data.frame as factor s– Grouping var. with numerical “labels”

can be declared as factor> sleep[,’group’] <-+ factor(sleep[,’group’])

> plot(extra ˜ group, data = sleep)produces two box plots.