an overview of r sources: using r for introductory statistics, john verzani, chapman & hall/crc...

33
An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

Upload: thomas-martin

Post on 29-Dec-2015

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

1

An Overview of R

Sources: Using R for Introductory Statistics, John Verzani, Chapman &

Hall/CRC

Page 2: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

2

What is R?

• R is a computer language for statistical computing similar to the S language. It is an alternative to Matlab.

• R is open-source software and is part of the GNU project.

• The R home page is at: http://www.r-project.org. You can download your copy there.

Page 3: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

3

Starting RDouble-click on the R icon from WindowsR version 3.1.0 (2014-04-10) -- "Spring Dance"Copyright (C) 2014 The R Foundation for Statistical ComputingPlatform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.You are welcome to redistribute it under certain conditions.Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.Type 'contributors()' for more information and'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or'help.start()' for an HTML browser interface to help.Type 'q()' to quit R.

>

Page 4: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

4

Using R as a calculator

> 2+2[1] 4> 2^4 # exponential[1] 16> (1-2)*3[1] -3> 1-2*3 # the usual # precedence laws[1] -5

> sqrt(2) # the square root[1] 1.414214> sin(pi) # the sine function[1] 1.224606e-16 # this is 0> exp(1) # this is exp(x) = e^x[1] 2.718282> log(10) # the log base e[1] 2.302585

Note: Rounding errors such as 1.224606e-16 instead of 0 are common in R

Page 5: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

5

Changing the default Behaviour

• Many functions have extra arguments that allow us to change the default behaviour.

• To know the detailed arguments of the functions you use, type help(<fn>) or ?<fn> or ?”<fn>”. This will open an html window with a description, examples, etc.

> log(10,10)[1] 1> log(10, base=10)[1] 1> help(log)starting httpd help server ... done> ?log

Page 6: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

6

Other ways to get help

> help.search("log")> apropos("log") [1] ".__C__logical" ".__C__logLik" ".__T__Logic:base" [4] "as.data.frame.logical" "as.logical" "as.logical.factor" [7] "dlogis" "is.logical" "log" [10] "log10" "log1p" "log2" [13] "logb" "Logic" "logical" [16] "logLik" "loglin" "plogis" [19] "qlogis" "rlogis" "SSlogis" [22] "winDialog" "winDialogString" >

Page 7: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

7

Warning and Errors

> squareroot(2)Error: could not find function "squareroot"> sqrt 2Error: unexpected numeric constant in "sqrt 2“> sqrt(2+ )[1] 1.414214>

> sqrt(-2)[1] NaNWarning message:In sqrt(-2) : NaNs produced

The +, like the >’s, was not typed, but rather was added by R.

Page 8: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

8

Assignment: =, <-, <<-

> x = 2> x + 3[1] 5> pi[1] 3.141593> e^2Error: object 'e' not found> e = exp(1)> e^2[1] 7.389056

> x <- 2> x[1] 2> x <<- 17> x[1] 17>

Page 9: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

9

Variable Names

> x = 2> n = 25> N = 17> n[1] 25> N[1] 17>

Note: Case is important

> a.really.long.number = 123456789> a.really.long.number[1] 123456789> AReallySmallNumber = 0.000000001> AReallySmallNumber[1] 1e-09>

Page 10: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

10

Data Vectors: c()

• A data set contains many observations. E.g., the number of whale beachings per year in texas in 1990, 91, 92,…, 99.

• c() can also combine vectors

> whales = c(74, 122, 235, 111, 292, 111, 211, 133, 156, 79)> whales [1] 74 122 235 111 292 111 211 133 156 79

> x = c(1, 2)> y = c(3, 4)> c(x,y)[1] 1 2 3 4>

Page 11: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

11

Data Vectors: type, named entries

All entries of a vector must have the same type automatic conversion to character

> simpsons = c("Homer", ‘Marge', "Bart", "Lisa", "Maggie")> simpsons[1] "Homer" “Marge" "Bart" "Lisa" "Maggie“> mixed = c(-1.5, "a", 9, 3e-2, 'hi')> mixed[1] "-1.5" "a" "9" "0.03" "hi"

Entries can be named

> names(simpsons) = c("dad", "mom", "son", "daughter 1", "daughter 2")> names(simpsons)[1] "dad" "mom" "son" "daughter 1" "daughter 2"> simpsons dad mom son daughter 1 daughter 2 "Homer" “Marge" "Bart" "Lisa" "Maggie" >

Page 12: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

12

Using functions on Data Vectors> whales [1] 74 122 235 111 292 111 211 133 156 79> sum(whales)[1] 1524> length(whales)[1] 10> sum(whales)/length(whales)[1] 152.4> mean(whales)[1] 152.4> sort(whales) [1] 74 79 111 111 122 133 156 211 235 292

> min(whales)[1] 74> max(whales)[1] 292> range(whales)[1] 74 292> diff(whales)[1] 48 113 -124 181 -181 100 -78 23 -77> cumsum(whales) [1] 74 196 431 542 834 945 1156 1289 1445 1524

Note: diff computes the successive Differences in the data vector

Page 13: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

13

Vectorization of Functions

> whales.tex = whales> whales.fla = c(89, 254, 306, 292, 274, 233, 294, 204, 204, 90)> whales.tex + whales.fla [1] 163 376 541 403 566 344 505 337 360 169> whales.tex - whales.fla [1] -15 -132 -71 -181 18 -122 -83 -71 -48 -11> whales.tex - mean(whales.tex) [1] -78.4 -30.4 82.6 -41.4 139.6 -41.4 58.6 -19.4 3.6 -73.4

In the last example, a single number gets subtracted from each entry of the vector.

Page 14: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

14

Creating Structured Data: simple and arithmetic sequences

> 1:10 [1] 1 2 3 4 5 6 7 8 9 10> rev(1:10) # countdown [1] 10 9 8 7 6 5 4 3 2 1> 10:1 # 10 > 1 [1] 10 9 8 7 6 5 4 3 2 1> a = 1 ; h = 4 ; n = 5 ; # use ; to separate commands> a + h * ( 0: (n - 1))[1] 1 5 9 13 17> a + h * ( 0: n - 1) # note: 0:(n-1) is not 0:n-1[1] -3 1 5 9 13 17

Page 15: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

15

Creating Structured Data: more arithmetic sequences

> seq(1,9,by=2)[1] 1 3 5 7 9> seq(1,10,by=2)[1] 1 3 5 7 9> seq(1,9,length=5)[1] 1 3 5 7 9> seq(1,9,length=9)[1] 1 2 3 4 5 6 7 8 9> seq(1,9,length=17) [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0>

Page 16: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

16

Creating Structured Data: Repeated numbers

> rep(1,10) [1] 1 1 1 1 1 1 1 1 1 1> rep(1:3,3)[1] 1 2 3 1 2 3 1 2 3> rep(c("long","short"), c(1,2))[1] "long" "short" "short">

Page 17: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

17

Accessing Data by using Indices

> ebay = c(88.8, 88.3, 90.2, 93.5, 95.2, 94.7, 99.2, 99.4, 101.6)> length(ebay)[1] 9> ebay[1][1] 88.8> ebay[9][1] 101.6> ebay[length(ebay)][1] 101.6> ebay[1:4][1] 88.8 88.3 90.2 93.5> ebay[c(1,5,9)]>[1] 88.8 95.2 101.6

Page 18: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

18

Negative indices and names

> ebay[-1] # all but the first[1] 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6> ebay[-(1:4)] # all but the 1st - 4th[1] 95.2 94.7 99.2 99.4 101.6

> x = 1:3> names(x) = c("one", "two", "three") # set the names> x["one"]one 1 >

Page 19: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

19

Assigning Values to Data Vectors

> ebay[1] = 88.0> ebay[1] 88.0 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6> ebay[10:13] = c(97.0, 99.3, 102.0, 101.8)> ebay [1] 88.0 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6 97.0 99.3 102.0[13] 101.8>

Page 20: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

20

Logical Values

> ebay > 100 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE[13] TRUE> which(ebay > 100)[1] 9 12 13> ebay[which(ebay > 100)][1] 101.6 102.0 101.8> sum(ebay > 100) # How many valuesre bigger than 100[1] 3> sum(ebay > 100)/length(ebay) # Proportions of values bigger than 100[1] 0.2307692>

Page 21: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

21

Reading in other sources of data

> library()

> data()

> data(package="MASS")

> library(MASS)

> install.packages("RWeka")

List all the installed packages

List all available data sets in loaded packages

List all the data sets in package MASS

Load the package MASS

Install the package named “RWEKA”

Page 22: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

22

Basic Charts

> sales = c(15, 34, 5)> names(sales) = c("Mary", "Peter", "Paul")>> barplot(sales, main="Sales", ylab="Thousands")>> pie(sales, main="Sales")

Page 23: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

23

Basic Statistical Analysis: mean, variance, histograms, densities

> mean(whales.tex)[1] 152.4> var(whales.tex)[1] 5113.378> mean(whales.fla)[1] 224> var(whales.fla)[1] 6301.111> hist(whales.tex, prob=TRUE, main="Whales, Texas")> lines(density(whales.tex))> hist(whales.fla, prob=TRUE, main="Whales, Florida")> lines(density(whales.fla))

Page 24: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

24

Basic Statistical Analysis: Bivariate data I

> allwhales <- rbind(whales.tex, whales.fla)> colnames(allwhales) = c(1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999)> allwhales 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999whales.tex 74 122 235 111 292 111 211 133 156 79whales.fla 89 254 306 292 274 233 294 204 204 90> barplot(allwhales, beside=TRUE, legend.text=TRUE, main="Whales")

Page 25: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

25

> boxplot(whales.tex, whales.fla, main="Whales", names=c("Texas", "Florida"))

Basic Statistical Analysis: Bivariate data II

Page 26: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

26

Statistical Review• A Bernoulli random variable X is one that has only two values: 0 or 1.

The distribution of X is characterized by p=P(X=1) [e.g, if we toss a coin and let X be 1 if a heads occurs, then X is a Bernoulli random variable with p= 0.5 if the coin is fair].

• A binomial random variable X counts the number of successes in n Bernoulli trials. There are 2 parameters that describe the distribution of X: the number of trials, n, and the success probability, p. [e.g., if we toss a fair coin 10 times, X has a Binomial(10, .5) distribution.

• The central limit theorem states that any standardised parent population with mean μ and standard deviation σ is approximated by the standard normal distribution for a large enough n.

• For the binomial distribution, the rule of thumb is that when n x p and n x (1-p) are both greater than 5, then the normal approximation is valid.

Page 27: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

27

Simulation: the normal approximation for the binomial

> m = 200; p= 1/2> n = 5> res = rbinom(m, n, p)> res [1] 3 4 3 0 3 1 2 3 2 5 2 2 4 4 3 2 2 2 2 2 3 3 2 3 2 2 1 2 2 3 2 4 3 3 3 1 4 [38] 1 2 2 2 4 2 2 4 4 2 0 2 3 3 1 3 2 3 2 0 2 2 2 5 2 3 2 1 1 2 2 2 4 4 1 2 2…> hist(res, prob=TRUE, main="n=5")> curve(dnorm(x, n*p, sqrt(n*p*(1-p))), add=TRUE)

Page 28: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

28

Programing in R: Functions> hello=function(){cat("hello world\n")}> hello()hello world> hello = edit(hello)> hello()hi world> hello=function(x){cat("hello", x, "\n")}> hello("kitty")hello kitty > hello=function(x="everyone"){cat("hello", x, "\n")}> hello("kitty")hello kitty > hello()hello everyone

• edit() opens a new window and allows the programmer to make changes in that window. Upon closing the window, the programmer is asked if the changes should be saved.

• Default values can be specified for each argument when the function is defined.

Page 29: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

29

Programing in R: if, switch> abs=function(x) {+ if (x < 0) {+ return(-x)+ } else {+ return(x)+ }+ }> abs(-.2)[1] 0.2> abs(3)[1] 3> abs("hi")[1] "hi">

> distrib = function(x, type){}> edit(distrib)function(x, type) { switch(type, mean = mean(x), variance = var(x), stdev = sd(x))}> x = rnorm(10000, 3, 2)> distrib(x, "mean")[1] 2.952066> distrib(x, "stdev")[1] 1.993793> distrib(x, "var")[1] 3.975209

Page 30: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

30

Programing in R: for, while> fact = function(x){+ ret = 1+ for (i in 1:x) {+ ret = ret * i }+ return(ret) }> fact(5)[1] 120>> tosscoin = function() { + coin = "tails"+ count = -1+ while (coin == "tails") {+ coin = sample(c("heads", "tails"), 1)+ count = count + 1}

+ cat("There were", count, "tails before the first heads\n")+ }> tosscoin()There were 2 tails before the first heads> tosscoin()There were 2 tails before the first heads> tosscoin()There were 0 tails before the first heads> tosscoin()There were 1 tails before the first heads

Page 31: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

31

An example: Generating random mixtures of 12 Gaussians

numClass <- 12allData = array(0, dim=c(numClass,100,2))n = array(0, dim=numClass)mean = array(0, dim=c(numClass,2))var = array(0, dim=c(numClass, 2))x1 <- c()x2 <- c()class <- c()# Generate Classesfor (i in 1:numClass){ n[i] <- round(runif(1,10,100)) mean[i,1] <- runif(1,0,1) mean[i,2] <- runif(1,0,1)

var[i,1] <- runif(1,.01,.07) var[i,2] <- runif(1,.01,.07) comp1 <- rnorm(n[i], mean[i,1], var[i,1]) comp2 <- rnorm(n[i], mean[i,2], var[i,2]) cl <- rep(i-1, n[i]) x1 <- c(x1,comp1) x2 <- c(x2,comp2) class <- c(class, cl) for (j in 1:n[i]){ allData[i,j,1] <- comp1[j]; allData[i,j,2] <- comp2[j]; }}plot(allData[,,1],allData[,,2])

Page 32: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

32

Examples of data sets generated by the previous program

Example: Run 1 Example: Run 2

Page 33: An Overview of R Sources: Using R for Introductory Statistics, John Verzani, Chapman & Hall/CRC 1

33

To Learn More about R

• Go to: http://cran.r-project.org/doc/manuals/R-intro.html• Some features of R not discussed here:

– Object Oriented Programming facilities– Matrix facilities (Matrix multiplication, Linear equations and

inversion, Eigenvalues and eigenvectors, Singular value decomposition and determinants, Least squares fitting and the QR decomposition)

– Statistical Models (t-test, ANOVA…)– Advanced Graphical procedures– Packages (many community generated highly useful packages)– And much more…