an overview of r sources: using r for introductory statistics, john verzani, chapman & hall/crc...

1

An Overview of R

Sources: Using R for Introductory Statistics, John Verzani, Chapman &

Hall/CRC

2

What is R?

• R is a computer language for statistical computing similar to the S language. It is an alternative to Matlab.

• R is open-source software and is part of the GNU project.

• The R home page is at: http://www.r-project.org. You can download your copy there.

http://www.r-project.org/

3

Starting RDouble-click on the R icon from WindowsR version 3.1.0 (2014-04-10) -- "Spring Dance"Copyright (C) 2014 The R Foundation for Statistical ComputingPlatform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.You are welcome to redistribute it under certain conditions.Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.Type 'contributors()' for more information and'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or'help.start()' for an HTML browser interface to help.Type 'q()' to quit R.

>

4

Using R as a calculator

> 2+2[1] 4> 2^4 # exponential[1] 16> (1-2)*3[1] -3> 1-2*3 # the usual # precedence laws[1] -5

> sqrt(2) # the square root[1] 1.414214> sin(pi) # the sine function[1] 1.224606e-16 # this is 0> exp(1) # this is exp(x) = e^x[1] 2.718282> log(10) # the log base e[1] 2.302585

Note: Rounding errors such as 1.224606e-16 instead of 0 are common in R

5

Changing the default Behaviour

• Many functions have extra arguments that allow us to change the default behaviour.

• To know the detailed arguments of the functions you use, type help(<fn>) or ?<fn> or ?”<fn>”. This will open an html window with a description, examples, etc.

> log(10,10)[1] 1> log(10, base=10)[1] 1> help(log)starting httpd help server ... done> ?log

6

Other ways to get help

> help.search("log")> apropos("log") [1] ".__C__logical" ".__C__logLik" ".__T__Logic:base" [4] "as.data.frame.logical" "as.logical" "as.logical.factor" [7] "dlogis" "is.logical" "log" [10] "log10" "log1p" "log2" [13] "logb" "Logic" "logical" [16] "logLik" "loglin" "plogis" [19] "qlogis" "rlogis" "SSlogis" [22] "winDialog" "winDialogString" >

7

Warning and Errors

> squareroot(2)Error: could not find function "squareroot"> sqrt 2Error: unexpected numeric constant in "sqrt 2“> sqrt(2+ )[1] 1.414214>

> sqrt(-2)[1] NaNWarning message:In sqrt(-2) : NaNs produced

The +, like the >’s, was not typed, but rather was added by R.

8

Assignment: =, <-, <<-

> x = 2> x + 3[1] 5> pi[1] 3.141593> e^2Error: object 'e' not found> e = exp(1)> e^2[1] 7.389056

> x <- 2> x[1] 2> x <<- 17> x[1] 17>

9

Variable Names

> x = 2> n = 25> N = 17> n[1] 25> N[1] 17>

Note: Case is important

> a.really.long.number = 123456789> a.really.long.number[1] 123456789> AReallySmallNumber = 0.000000001> AReallySmallNumber[1] 1e-09>

10

Data Vectors: c()

• A data set contains many observations. E.g., the number of whale beachings per year in texas in 1990, 91, 92,…, 99.

• c() can also combine vectors

> whales = c(74, 122, 235, 111, 292, 111, 211, 133, 156, 79)> whales [1] 74 122 235 111 292 111 211 133 156 79

> x = c(1, 2)> y = c(3, 4)> c(x,y)[1] 1 2 3 4>

11

Data Vectors: type, named entries

All entries of a vector must have the same type automatic conversion to character

> simpsons = c("Homer", ‘Marge', "Bart", "Lisa", "Maggie")> simpsons[1] "Homer" “Marge" "Bart" "Lisa" "Maggie“> mixed = c(-1.5, "a", 9, 3e-2, 'hi')> mixed[1] "-1.5" "a" "9" "0.03" "hi"

Entries can be named

> names(simpsons) = c("dad", "mom", "son", "daughter 1", "daughter 2")> names(simpsons)[1] "dad" "mom" "son" "daughter 1" "daughter 2"> simpsons dad mom son daughter 1 daughter 2 "Homer" “Marge" "Bart" "Lisa" "Maggie" >

12

Using functions on Data Vectors> whales [1] 74 122 235 111 292 111 211 133 156 79> sum(whales)[1] 1524> length(whales)[1] 10> sum(whales)/length(whales)[1] 152.4> mean(whales)[1] 152.4> sort(whales) [1] 74 79 111 111 122 133 156 211 235 292

> min(whales)[1] 74> max(whales)[1] 292> range(whales)[1] 74 292> diff(whales)[1] 48 113 -124 181 -181 100 -78 23 -77> cumsum(whales) [1] 74 196 431 542 834 945 1156 1289 1445 1524

Note: diff computes the successive Differences in the data vector

13

Vectorization of Functions

> whales.tex = whales> whales.fla = c(89, 254, 306, 292, 274, 233, 294, 204, 204, 90)> whales.tex + whales.fla [1] 163 376 541 403 566 344 505 337 360 169> whales.tex - whales.fla [1] -15 -132 -71 -181 18 -122 -83 -71 -48 -11> whales.tex - mean(whales.tex) [1] -78.4 -30.4 82.6 -41.4 139.6 -41.4 58.6 -19.4 3.6 -73.4

In the last example, a single number gets subtracted from each entry of the vector.

14

Creating Structured Data: simple and arithmetic sequences

> 1:10 [1] 1 2 3 4 5 6 7 8 9 10> rev(1:10) # countdown [1] 10 9 8 7 6 5 4 3 2 1> 10:1 # 10 > 1 [1] 10 9 8 7 6 5 4 3 2 1> a = 1 ; h = 4 ; n = 5 ; # use ; to separate commands> a + h * ( 0: (n - 1))[1] 1 5 9 13 17> a + h * ( 0: n - 1) # note: 0:(n-1) is not 0:n-1[1] -3 1 5 9 13 17

15

Creating Structured Data: more arithmetic sequences

> seq(1,9,by=2)[1] 1 3 5 7 9> seq(1,10,by=2)[1] 1 3 5 7 9> seq(1,9,length=5)[1] 1 3 5 7 9> seq(1,9,length=9)[1] 1 2 3 4 5 6 7 8 9> seq(1,9,length=17) [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0>

16

Creating Structured Data: Repeated numbers

> rep(1,10) [1] 1 1 1 1 1 1 1 1 1 1> rep(1:3,3)[1] 1 2 3 1 2 3 1 2 3> rep(c("long","short"), c(1,2))[1] "long" "short" "short">

17

Accessing Data by using Indices

> ebay = c(88.8, 88.3, 90.2, 93.5, 95.2, 94.7, 99.2, 99.4, 101.6)> length(ebay)[1] 9> ebay[1][1] 88.8> ebay[9][1] 101.6> ebay[length(ebay)][1] 101.6> ebay[1:4][1] 88.8 88.3 90.2 93.5> ebay[c(1,5,9)]>[1] 88.8 95.2 101.6

18

Negative indices and names

> ebay[-1] # all but the first[1] 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6> ebay[-(1:4)] # all but the 1st - 4th[1] 95.2 94.7 99.2 99.4 101.6

> x = 1:3> names(x) = c("one", "two", "three") # set the names> x["one"]one 1 >

19

Assigning Values to Data Vectors

> ebay[1] = 88.0> ebay[1] 88.0 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6> ebay[10:13] = c(97.0, 99.3, 102.0, 101.8)> ebay [1] 88.0 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6 97.0 99.3 102.0[13] 101.8>

20

Logical Values

> ebay > 100 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE[13] TRUE> which(ebay > 100)[1] 9 12 13> ebay[which(ebay > 100)][1] 101.6 102.0 101.8> sum(ebay > 100) # How many valuesre bigger than 100[1] 3> sum(ebay > 100)/length(ebay) # Proportions of values bigger than 100[1] 0.2307692>

21

Reading in other sources of data

> library()

> data()

> data(package="MASS")

> library(MASS)

> install.packages("RWeka")

List all the installed packages

List all available data sets in loaded packages

List all the data sets in package MASS

Load the package MASS

Install the package named “RWEKA”

22

Basic Charts

> sales = c(15, 34, 5)> names(sales) = c("Mary", "Peter", "Paul")>> barplot(sales, main="Sales", ylab="Thousands")>> pie(sales, main="Sales")

23

Basic Statistical Analysis: mean, variance, histograms, densities

> mean(whales.tex)[1] 152.4> var(whales.tex)[1] 5113.378> mean(whales.fla)[1] 224> var(whales.fla)[1] 6301.111> hist(whales.tex, prob=TRUE, main="Whales, Texas")> lines(density(whales.tex))> hist(whales.fla, prob=TRUE, main="Whales, Florida")> lines(density(whales.fla))

24

Basic Statistical Analysis: Bivariate data I

> allwhales <- rbind(whales.tex, whales.fla)> colnames(allwhales) = c(1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999)> allwhales 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999whales.tex 74 122 235 111 292 111 211 133 156 79whales.fla 89 254 306 292 274 233 294 204 204 90> barplot(allwhales, beside=TRUE, legend.text=TRUE, main="Whales")

25

> boxplot(whales.tex, whales.fla, main="Whales", names=c("Texas", "Florida"))

Basic Statistical Analysis: Bivariate data II

26

Statistical Review• A Bernoulli random variable X is one that has only two values: 0 or 1.

The distribution of X is characterized by p=P(X=1) [e.g, if we toss a coin and let X be 1 if a heads occurs, then X is a Bernoulli random variable with p= 0.5 if the coin is fair].

• A binomial random variable X counts the number of successes in n Bernoulli trials. There are 2 parameters that describe the distribution of X: the number of trials, n, and the success probability, p. [e.g., if we toss a fair coin 10 times, X has a Binomial(10, .5) distribution.

• The central limit theorem states that any standardised parent population with mean μ and standard deviation σ is approximated by the standard normal distribution for a large enough n.

• For the binomial distribution, the rule of thumb is that when n x p and n x (1-p) are both greater than 5, then the normal approximation is valid.

27

Simulation: the normal approximation for the binomial

> m = 200; p= 1/2> n = 5> res = rbinom(m, n, p)> res [1] 3 4 3 0 3 1 2 3 2 5 2 2 4 4 3 2 2 2 2 2 3 3 2 3 2 2 1 2 2 3 2 4 3 3 3 1 4 [38] 1 2 2 2 4 2 2 4 4 2 0 2 3 3 1 3 2 3 2 0 2 2 2 5 2 3 2 1 1 2 2 2 4 4 1 2 2…> hist(res, prob=TRUE, main="n=5")> curve(dnorm(x, n*p, sqrt(n*p*(1-p))), add=TRUE)

28

Programing in R: Functions> hello=function(){cat("hello world\n")}> hello()hello world> hello = edit(hello)> hello()hi world> hello=function(x){cat("hello", x, "\n")}> hello("kitty")hello kitty > hello=function(x="everyone"){cat("hello", x, "\n")}> hello("kitty")hello kitty > hello()hello everyone

• edit() opens a new window and allows the programmer to make changes in that window. Upon closing the window, the programmer is asked if the changes should be saved.

• Default values can be specified for each argument when the function is defined.

29

Programing in R: if, switch> abs=function(x) {+ if (x < 0) {+ return(-x)+ } else {+ return(x)+ }+ }> abs(-.2)[1] 0.2> abs(3)[1] 3> abs("hi")[1] "hi">

> distrib = function(x, type){}> edit(distrib)function(x, type) { switch(type, mean = mean(x), variance = var(x), stdev = sd(x))}> x = rnorm(10000, 3, 2)> distrib(x, "mean")[1] 2.952066> distrib(x, "stdev")[1] 1.993793> distrib(x, "var")[1] 3.975209

30

Programing in R: for, while> fact = function(x){+ ret = 1+ for (i in 1:x) {+ ret = ret * i }+ return(ret) }> fact(5)[1] 120>> tosscoin = function() { + coin = "tails"+ count = -1+ while (coin == "tails") {+ coin = sample(c("heads", "tails"), 1)+ count = count + 1}

+ cat("There were", count, "tails before the first heads\n")+ }> tosscoin()There were 2 tails before the first heads> tosscoin()There were 2 tails before the first heads> tosscoin()There were 0 tails before the first heads> tosscoin()There were 1 tails before the first heads

31

An example: Generating random mixtures of 12 Gaussians

numClass <- 12allData = array(0, dim=c(numClass,100,2))n = array(0, dim=numClass)mean = array(0, dim=c(numClass,2))var = array(0, dim=c(numClass, 2))x1 <- c()x2 <- c()class <- c()# Generate Classesfor (i in 1:numClass){ n[i] <- round(runif(1,10,100)) mean[i,1] <- runif(1,0,1) mean[i,2] <- runif(1,0,1)

var[i,1] <- runif(1,.01,.07) var[i,2] <- runif(1,.01,.07) comp1 <- rnorm(n[i], mean[i,1], var[i,1]) comp2 <- rnorm(n[i], mean[i,2], var[i,2]) cl <- rep(i-1, n[i]) x1 <- c(x1,comp1) x2 <- c(x2,comp2) class <- c(class, cl) for (j in 1:n[i]){ allData[i,j,1] <- comp1[j]; allData[i,j,2] <- comp2[j]; }}plot(allData[,,1],allData[,,2])

32

Examples of data sets generated by the previous program

Example: Run 1 Example: Run 2

33

To Learn More about R

• Go to: http://cran.r-project.org/doc/manuals/R-intro.html• Some features of R not discussed here:

– Object Oriented Programming facilities– Matrix facilities (Matrix multiplication, Linear equations and

inversion, Eigenvalues and eigenvectors, Singular value decomposition and determinants, Least squares fitting and the QR decomposition)

– Statistical Models (t-test, ANOVA…)– Advanced Graphical procedures– Packages (many community generated highly useful packages)– And much more…

http://cran.r-project.org/doc/manuals/R-intro.html



an overview of r sources: using r for introductory statistics, john verzani, chapman & hall/crc...

Documents