introduction to r: part iii - b2slabb2slab.upc.edu/wp-content/uploads/2014/02/cursr_iii.pdf ·...

51
Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna 1, 2 1 Centre de Recerca en Enginyeria Biom` edica (CREB) Departament d’Enginyeria de Sistemes, Autom`atica i Inform`atica Industrial (ESAII) Universitat Polit` ecnica de Catalunya mailto:[email protected] 2 Centro de Investigaci´ on Biom´ edica en Red en Bioingenier´ ıa, Biomateriales y Nanomedicina (CIBER-BBN) Jan 2011 / Introduction to R Universitat Rovira i Virgili

Upload: trinhkhue

Post on 02-Apr-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

Introduction to R: Part IIIStatistics and Linear Modelling

Alexandre Perera i Lluna1,2

1Centre de Recerca en Enginyeria Biomedica (CREB)Departament d’Enginyeria de Sistemes, Automatica i Informatica Industrial

(ESAII)Universitat Politecnica de Catalunyamailto:[email protected]

2Centro de Investigacion Biomedica en Red en Bioingenierıa, Biomateriales yNanomedicina (CIBER-BBN)

Jan 2011 / Introduction to RUniversitat Rovira i Virgili

Page 2: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Contents I

1 StatisticsUnivariate DataBivariate dataMultivariate Data

2 TestsHypothesis testsTwo population testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

3 Linear regressionLinear modelsRegression analysisMultivariate regressionVariance analysis

Alexandre Perera i Lluna, Introduction to R: Part III

Page 3: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

mean(),sd()

Mean Standard deviation, variance

Let’s define a random variable with a normal distribution> x <- rnorm(100, mean = 2, sd = 0.5)

mean(),sd()> mean(x)

[1] 2.016474

> median(x)

[1] 1.996165

> sd(x)

[1] 0.4814775

> var(x)

[1] 0.2318206

> summary(x)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.068 1.638 1.996 2.016 2.347 3.446

Alexandre Perera i Lluna, Introduction to R: Part III

Page 4: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

Quantiles

quantile() any quantile between 0 and 1> quantile(x, 0.25)

25%

1.637719

> quantile(x, c(0.1, 0.9))

10% 90%

1.429676 2.663940

Difference between 1st and 3rd quartile> IQR(x)

[1] 0.7096768

cut() categorizes continous variables> summary(cut(x, c(min(x), mean(x), quantile(x, 0.75), max(x))))

(1.07,2.02] (2.02,2.35] (2.35,3.45] NA's50 24 25 1

Alexandre Perera i Lluna, Introduction to R: Part III

Page 5: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

Histograms

> cuts <- quantile(x, seq(0,

+ 1, 0.1))

> hist(x, breaks = cuts)

> rug(x)

Histogram of x

x

Den

sity

1.0 1.5 2.0 2.5 3.0 3.50.

00.

20.

40.

60.

8

Alexandre Perera i Lluna, Introduction to R: Part III

Page 6: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

boxplots

> boxplot(x, horizontal = TRUE,

+ col = "pink", xlab = "cm",

+ main = "Oscillation")

1.0 1.5 2.0 2.5 3.0 3.5

Oscillation

cm

Alexandre Perera i Lluna, Introduction to R: Part III

Page 7: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

Density

> cortes <- quantile(x, seq(0,

+ 1, 0.1))

> hist(x, breaks = cortes)

> rug(x)

> lines(density(x, bw = "SJ"),

+ col = "red")

Histogram of x

x

Den

sity

1.0 1.5 2.0 2.5 3.0 3.50.

00.

20.

40.

60.

8

Alexandre Perera i Lluna, Introduction to R: Part III

Page 8: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

Factores

> language <- as.factor(c("french",

+ "french", "german", "german",

+ "english", "german", "french",

+ "english", "french", "german"))

> gender <- as.factor(c("man",

+ "woman", "woman", "woman",

+ "woman", "woman", "man",

+ "woman", "man", "man"))

> table(gender, language)

language

gender english french german

man 0 3 1

woman 2 1 3

> plot(table(language, gender),

+ col = c("pink", "blue"))

table(language, gender)

language

gend

er

english french german

man

wom

an

Alexandre Perera i Lluna, Introduction to R: Part III

Page 9: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

barplot

> barplot(table(language, gender),

+ col = c("pink", "blue",

+ "green"), legend.text = levels(language))

man woman

germanfrenchenglish

01

23

45

6

Alexandre Perera i Lluna, Introduction to R: Part III

Page 10: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

stripchart

1D plots, alternative to boxplot() for certain cases:

> attach(iris)

> stripchart(Sepal.Length ~ Species)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

seto

save

rsic

olor

virg

inic

a

Sepal.Length

> boxplot(Sepal.Length ~ Species)

> detach(iris)

setosa versicolor virginica

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Alexandre Perera i Lluna, Introduction to R: Part III

Page 11: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

Formula notation in R

It is possible to use a formula notation in R, when these arenamed (names()).

In previous example: Sepal.Length ~ Species

Formulas in R

variable ∼ group

variables per group

This notation is homogeneous along most of R code

Alexandre Perera i Lluna, Introduction to R: Part III

Page 12: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

Formula Notation in R, II

Formulas

response ∼ model

See help(formula)

log(Sepal.Length) ~ Species

Arithmetic functions are allowed:

I(Sepal.Length + Petal.Length) ~ Species

Heavily used in linear regression, but also in visualizationfunctions

Alexandre Perera i Lluna, Introduction to R: Part III

Page 13: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

Contingency tables

> data(UCBAdmissions)> UCBAdmissions[, , 1:2]

, , Dept = A

GenderAdmit Male FemaleAdmitted 512 89Rejected 313 19

, , Dept = B

GenderAdmit Male FemaleAdmitted 353 17Rejected 207 8

> DF <- as.data.frame(UCBAdmissions)> head(DF)

Admit Gender Dept Freq1 Admitted Male A 5122 Rejected Male A 3133 Admitted Female A 894 Rejected Female A 195 Admitted Male B 3536 Rejected Male B 207

xtabs() Mulltiple factorscontingency tables

Used commonly on data framesdata.frame

> xtabs(Freq ~ Gender + Admit,+ DF)

AdmitGender Admitted Rejected

Male 1198 1493Female 557 1278

> summary(xtabs(Freq ~ ., DF))

Call: xtabs(formula = Freq ~ ., data = DF)Number of cases in table: 4526Number of factors: 3Test for independence of all factors:

Chisq = 2000.3, df = 16, p-value = 0

Alexandre Perera i Lluna, Introduction to R: Part III

Page 14: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

score plots I

> def.par <- par(no.readonly = TRUE)> data(iris)> xhist <- hist(iris$Petal.Length, plot = FALSE)> yhist <- hist(iris$Sepal.Length, plot = FALSE)> top <- max(c(xhist$counts, yhist$counts))> xrange <- range(iris$Petal.Length)> yrange <- range(iris$Sepal.Length)> nf <- layout(matrix(c(2, 0, 1, 3), 2, 2, byrow = TRUE), c(3,+ 1), c(1, 3), TRUE)> layout.show(nf)> par(mar = c(3, 3, 1, 1))> plot(iris$Petal.Length, iris$Sepal.Length, xlim = xrange, ylim = yrange,+ xlab = "", ylab = "")> par(mar = c(0, 3, 1, 1))> barplot(xhist$counts, axes = FALSE, ylim = c(0, top), space = 0)> par(mar = c(3, 0, 1, 1))> barplot(yhist$counts, axes = FALSE, xlim = c(0, top), space = 0,+ horiz = TRUE)> par(def.par)

Alexandre Perera i Lluna, Introduction to R: Part III

Page 15: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

score plots II

Alexandre Perera i Lluna, Introduction to R: Part III

Page 16: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

lattice

library(lattice)

xyplot(Sepal.Length ~ Sepal.Width

| Species, data=iris)

Sepal.Width

Sep

al.L

engt

h

5

6

7

8

2.0 2.5 3.0 3.5 4.0 4.5

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

setosa

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●●

versicolor

5

6

7

8

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

virginica

Alexandre Perera i Lluna, Introduction to R: Part III

Page 17: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

pairs()

> panel.hist <- function(x,+ ...) {+ usr <- par("usr")+ on.exit(par(usr))+ par(usr = c(usr[1:2],+ 0, 1.5))+ h <- hist(x, plot = FALSE,+ breaks = 10)+ breaks <- h$breaks+ nB <- length(breaks)+ y <- h$counts+ y <- y/max(y)+ rect(breaks[-nB], 0, breaks[-1],+ y, col = "blue", ...)+ }> pairs(iris[, c(1:4)], panel = panel.smooth,+ cex = 1.5, pch = 21, bg = as.numeric(iris$Species),+ diag.panel = panel.hist)

Sepal.Length

2.0 3.0 4.0

●●

●●●

●●

●●

● ●●

●●

●●

● ●●●

●●

●●●

●●

●●

● ●

● ●●

●●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

● ●●

●●

●●●

● ●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●●●

●●

●●●●●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●●●●

0.5 1.5 2.5

4.5

5.5

6.5

7.5

●●●●●

●●

●●

●●●

●●

●●●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●● ●

●●

●●

●●

●●●● ●

●●●●●

●●

●●●

●●

●●●

●●

●●●

● ●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●●●

●●

2.0

3.0

4.0

●●●

● ●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

● ●● ●

●●

●●

●●●●

●●●●●●

●●●

●●

●●

●● ●

●●● ●

●●

● ●●

● ●

●●

●●●

●●●

●● ●●●

●●

Sepal.Width●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

● ●●●

●●

●●

●●●●

●●●●●●

●●●

● ●

●●

●●●

●●● ●

●●

● ●●

●●

●●

●●●

●●●

●● ●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

● ●●●

●●

●●

●●●●

●●●●

●●

●●●

● ●

●●

●●●

●● ●●

●●

●●●

●●

●●

●●

●●●

●● ● ●●

●●

●●●● ●●

● ●● ● ●●●● ●

●●●●●●●

●●●●●●●●● ●●●●● ●●● ●●●●●●

●●● ●●

●●●

●●● ●

●●

●●●●

●●●

●●

●●●●

●●●

●●●●

●● ● ●

●●●● ●

●●

●●● ●

●●●

●●

●●●

●●●●

●●

●●

●●

●●● ●

●●

●●

●●

●●●

●●●●

●●●●●●

●●●● ●●

●●●● ●●●● ●

●●●●●● ●

●●●● ●●●●● ● ●●●

● ●●● ●●● ●●

●● ●● ●●

●●●

●●● ●

●●

●●●

●●●

●●

● ●●●

●●●

●●● ●

●● ●●

●●●

● ●●

● ●●●

●●●

●●

●● ●● ● ●●

●●

●●

●●

●●●●●

●●

●●

●●●●●●●

●●●● ● ●●

Petal.Length

12

34

56

7

●●●●●●

●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●

●●●

●●● ●

●●

●●●

●●●

●●

●●●●● ●●

●●●●

●●●●●

●●●●●

●●●●

●● ●

●●

●●●

● ●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●● ●

●●

●●●●●●

4.5 5.5 6.5 7.5

0.5

1.5

2.5

●●●● ●●● ●● ● ●●●● ●

●●● ●●●●

●●●●●●●●

●●●● ●●● ●●●●

●●●●● ●●

●● ●●

●●

●●

●● ●●

●●

●●●●

●●

●●●●

●● ● ●●●●●

●●

●●● ●●

●●

● ●

● ●●

●●●●

● ●

●●

● ●●

●●●

● ●●

●●

●●

●●

●●

●●●

●●

●●●● ●●●●●● ●●●● ●

●●● ●●●●

●●●●●●●

●●●● ●●● ●●● ●

●●● ●● ●●

●●●●●●

●●

●●●●

●●

●●●●

●●

●●●●

● ● ●●● ●●●

●●

● ●●●●●

●●

●●

● ●●

●●●●

● ●

●●

●●●

●●●

● ●●

●●

● ●

●●

●●

●●

● ●

1 2 3 4 5 6 7

●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●

●●

●●

●● ●●

●●

●●●●

●●

●●●●

●●●●●●●●●

●●

●●●●●●

●●

● ●

● ●●

●●●●

●●

●●

● ●●

●●●

●●●

●●

●●

●●

●●

●●

●●

Petal.Width

Alexandre Perera i Lluna, Introduction to R: Part III

Page 18: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

dpq-functions

pnorm(q) returns the probability thata random variable measures lowerthan q (larger than withlower.tail=FALSE)

> pnorm(c(0, 1))

[1] 0.5000000 0.8413447

> pnorm(1, lower.tail = F)

[1] 0.1586553

qnorm(q) responds to the inversequestion. Which value corresponds toa certain probability. (e.g. .75 for Q3)

> qnorm(c(0.75, 0.841345))

[1] 0.6744898 1.0000010

dnorm(x) theoretical density function

> curve(pnorm(x), -5, 5, col = "red",+ ylab = "", frame.plot = FALSE)> curve(dnorm(x), -5, 5, col = "blue",+ add = TRUE)> legend("topleft", legend = c("pnorm(x)",+ "dnorm(x)"), col = c("red",+ "blue"))

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

x

pnorm(x)dnorm(x)

Alexandre Perera i Lluna, Introduction to R: Part III

Page 19: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

Standardization

With our random variable x . (rnorm(), runif(), ...)> x <- rnorm(100, mean = 2,+ sd = 0.5)> z <- (x - 2)/0.5> mean(z)

[1] 0.08849595

> sd(z)

[1] 0.986837

z -score> pnorm(z)[1:5]

[1] 0.1282827 0.7556697 0.2045399[4] 0.4998171 0.1222717

> pnorm(x, mean = 2, sd = 0.5)[1:5]

[1] 0.1282827 0.7556697 0.2045399[4] 0.4998171 0.1222717

Alexandre Perera i Lluna, Introduction to R: Part III

Page 20: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

t-test

t-statistic:

t =X − µs/√n

> x <- rnorm(100, mean = 2, sd = 0.5)> t.test(x)

One Sample t-test

data: xt = 38.0565, df = 99, p-value <2.2e-16alternative hypothesis: true mean is not equal to 095 percent confidence interval:1.885099 2.092484sample estimates:mean of x1.988791

> x <- rnorm(100, mean = 0, sd = 0.5)> t.test(x)

One Sample t-test

data: xt = 0.5835, df = 99, p-value =0.5609alternative hypothesis: true mean is not equal to 095 percent confidence interval:-0.06811229 0.12486603sample estimates:mean of x0.02837687

Alexandre Perera i Lluna, Introduction to R: Part III

Page 21: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

Proportion test

In a poll (“yes”/”no”), 43 say “yes”. Is that a 50 % of population?(two-sided alternative)

H0: null hypothesis p = 0,5

H1: alternative hypothesis p 6= 0,5

> prop.test(43, 100, p = 0.5)

1-sample proportions test withcontinuity correction

data: 43 out of 100, null probability 0.5X-squared = 1.69, df = 1, p-value =0.1936alternative hypothesis: true p is not equal to 0.595 percent confidence interval:0.3326536 0.5327873sample estimates:

p0.43

> prop.test(430, 1000, p = 0.5)

1-sample proportions test withcontinuity correction

data: 430 out of 1000, null probability 0.5X-squared = 19.321, df = 1, p-value= 1.105e-05alternative hypothesis: true p is not equal to 0.595 percent confidence interval:0.3991472 0.4613973sample estimates:

p0.43

Alexandre Perera i Lluna, Introduction to R: Part III

Page 22: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

Wilcox-test

Rain distribution in Albacete (Spain). Asimetric distribution (t-test isnot allowed).

H0: µ = 5H1: µ > 0,5

> x = c(12.8, 3.5, 2.9, 9.4, 8.7,+ 0.7, 0.2, 2.8, 1.9, 2.8, 3.1,+ 15.8)> stem(x)

The decimal point is 1 digit(s) to the right of the |

0 | 012333340 | 991 | 31 | 6

> wilcox.test(x, mu = 5, alt = "greater")

Wilcoxon signed rank test withcontinuity correction

data: xV = 39, p-value = 0.5156alternative hypothesis: true location is greater than 5

null not rejectedAlexandre Perera i Lluna, Introduction to R: Part III

Page 23: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

t-test for two populations

t-statistic:

t =(X2 − X2)− (µ1 − µ2)√

s21n1− s22

n2

Assuming X1 and X2 normally distributedEqual variances:> x = c(15, 10, 13, 7, 9, 8, 21,+ 9, 14, 8)> y = c(15, 14, 12, 8, 14, 7, 16,+ 10, 15, 12)> t.test(x, y, alt = "less", var.equal = TRUE)

Two Sample t-test

data: x and yt = -0.5331, df = 18, p-value =0.3002alternative hypothesis: true difference in means is less than 095 percent confidence interval:

-Inf 2.027436sample estimates:mean of x mean of y

11.4 12.3

Unequal variances> t.test(x, y, alt = "less")

Welch Two Sample t-test

data: x and yt = -0.5331, df = 16.245, p-value =0.3006alternative hypothesis: true difference in means is less than 095 percent confidence interval:

-Inf 2.044664sample estimates:mean of x mean of y

11.4 12.3

Alexandre Perera i Lluna, Introduction to R: Part III

Page 24: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

χ2-test

Allows statistical test for categorial data

χ2-test:

χ2 =n∑

i=1

(fi − ei)2

ei

Assumption: All cases occur more than one and 80% of cases > 5.

> freqs <- c(22, 21, 22, 27, 22,+ 36)> probs <- rep(1/6, 6)> chisq.test(freqs, p = probs)

Chi-squared test for givenprobabilities

data: freqsX-squared = 6.72, df = 5, p-value =0.2423

> freqs <- c(22, 31, 12, 37, 12,+ 36)> probs <- rep(1/6, 6)> chisq.test(freqs, p = probs)

Chi-squared test for givenprobabilities

data: freqsX-squared = 25.92, df = 5, p-value= 9.248e-05

Alexandre Perera i Lluna, Introduction to R: Part III

Page 25: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

χ2-test II

Does certain process follow a certain distribution? (e.g. dice)

> freqs <- c(22, 21, 22, 27, 22,+ 36)> probs <- rep(1/6, 6)> chisq.test(freqs, p = probs)

Chi-squared test for givenprobabilities

data: freqsX-squared = 6.72, df = 5, p-value =0.2423

> freqs <- c(22, 31, 12, 37, 12,+ 36)> probs <- rep(1/6, 6)> chisq.test(freqs, p = probs)

Chi-squared test for givenprobabilities

data: freqsX-squared = 25.92, df = 5, p-value= 9.248e-05

Alexandre Perera i Lluna, Introduction to R: Part III

Page 26: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

χ2-test III: homogeneity

Are two processes generated by the same distribution? (e.g. two dice,normal truce (ok/ko))

> dado.ok <- sample(1:6, 200, p = c(1,+ 1, 1, 1, 1, 1)/6, replace = T)> dado.ko <- sample(1:6, 100, p = c(0.5,+ 0.5, 0.5, 0.5, 2, 2)/6, replace = T)> freqs.ok <- table(dado.ok)> freqs.ko = table(dado.ko)> rbind(freqs.ok, freqs.ko)

1 2 3 4 5 6freqs.ok 29 25 42 39 40 25freqs.ko 6 11 5 12 35 31

> chisq.test(rbind(freqs.ok, freqs.ko))

Pearson's Chi-squared test

data: rbind(freqs.ok, freqs.ko)X-squared = 35.5763, df = 5,p-value = 1.154e-06

> dado.ok <- sample(1:6, 200, p = c(1,+ 1, 1, 1, 1, 1)/6, replace = T)> dado.ko <- sample(1:6, 100, p = c(1.1,+ 1, 1, 1.1, 1, 1)/6, replace = T)> freqs.ok <- table(dado.ok)> freqs.ko = table(dado.ko)> rbind(freqs.ok, freqs.ko)

1 2 3 4 5 6freqs.ok 35 33 38 32 37 25freqs.ko 12 19 13 18 14 24

> chisq.test(rbind(freqs.ok, freqs.ko))

Pearson's Chi-squared test

data: rbind(freqs.ok, freqs.ko)X-squared = 9.2915, df = 5, p-value= 0.09799

Alexandre Perera i Lluna, Introduction to R: Part III

Page 27: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

Durbin-Watson

Evaluates Durbin-Watson for error auto-correlation

durbin.watson(car) Durbin-Watson Test for Autocorrelated Errorsdwtest(lmtest) Durbin-Watson Test

> library(lmtest)> err1 <- rnorm(100)> x <- rep(c(-1, 1), 50)> y1 <- 1 + x + err1> dwtest(y1 ~ x)

Durbin-Watson test

data: y1 ~ xDW = 1.8898, p-value = 0.3244alternative hypothesis: true autocorrelation is greater than 0

> err2 <- filter(err1, 0.9, method = "recursive")> y2 <- 1 + x + err2> dwtest(y2 ~ x)

Durbin-Watson test

data: y2 ~ xDW = 0.2426, p-value < 2.2e-16alternative hypothesis: true autocorrelation is greater than 0

Alexandre Perera i Lluna, Introduction to R: Part III

Page 28: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

Random numbers: quantile2 plots

> x = rnorm(100, 0, 1)> qqnorm(x, main = "normal(0,1)")> qqline(x)

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

−2 −1 0 1 2

−2

−1

01

2

normal(0,1)

Theoretical Quantiles

Sam

ple

Qua

ntile

s

> x = rnorm(100, 10, 15)> qqnorm(x, main = "normal(10,15)")> qqline(x)

●●

●●

●●

●●

●●

●●

●●

−2 −1 0 1 2

−20

020

40

normal(10,15)

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Alexandre Perera i Lluna, Introduction to R: Part III

Page 29: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

Random numbers: quantile2 plots

> x = rexp(100, 1/10)> qqnorm(x, main = "exponential mu=10")> qqline(x)

●●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

−2 −1 0 1 2

010

2030

4050

60

exponential mu=10

Theoretical Quantiles

Sam

ple

Qua

ntile

s

> x = runif(100, 0, 1)> qqnorm(x, main = "unif(0,1)")> qqline(x)

●●

●●

●●

●●

●●

−2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

unif(0,1)

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Alexandre Perera i Lluna, Introduction to R: Part III

Page 30: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Linear Models

Assume a response variable Y , dependant on three predictors:

Y = f (X1,X2,X3) + ε

Most easy linear form:

Y = β0 + β1X1 + β2X2 + β3X3 + ε

No need that predictors should be linear, but the input should belineal:

Y = β0 + β1X1 + β2log(X2) + β3X3X1 + ε

On the other side:

Y = β0 + β1Xβ2

1 + ε

Is not linear.

Alexandre Perera i Lluna, Introduction to R: Part III

Page 31: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Linear modelling: matrix representation

Y = Xβ + ε

In matrix representationy1

y2

. . .yn

=

1 x11 x12 . . . x1P

1 x21 x22 . . . x2P

. . . . . . . . . . . . . . . . .1 xn1 xn2 . . . xnP

·β0

β1

. . .βP

+

ε1

ε2

. . .εn

(1)

The most simple model is the null model:y1

y2

. . .yn

=

11. . .1

µ+

ε1

ε2

. . .εn

(2)

Find β so that X β is as close to Y as possible.

y = X β

ε lives in the subspace (n-p)Alexandre Perera i Lluna, Introduction to R: Part III

Page 32: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Geometrical representation

Alexandre Perera i Lluna, Introduction to R: Part III

Page 33: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Galapagos

Galapago islands : 30 islands, 7 variables

Species: The number of species of tortoise found on the islandArea: The area of the island (km2)Nearest: The distance from the nearest island (km)Elevation: The highest elevation of the island (m)Endemics: The number of endemic speciesScruz: The distance from Santa Cruz island (km)Adjacent: the area of the adjacent island (square km)

> library(faraway)> data(gala)> head(gala)

Species Endemics Area Elevation Nearest Scruz AdjacentBaltra 58 23 25.09 346 0.6 0.6 1.84Bartolome 31 21 1.24 109 0.6 26.3 572.33Caldwell 3 3 0.21 114 2.8 58.7 0.78Champion 25 9 0.10 46 1.9 47.4 0.18Coamano 2 1 0.05 77 1.9 1.9 903.82Daphne.Major 18 11 0.34 119 8.0 8.0 1.84

Alexandre Perera i Lluna, Introduction to R: Part III

Page 34: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

lm()

> plot(Species ~ Elevation, data = gala)

●●

●●●

● ●

●●

● ●

0 500 1000 15000

100

200

300

400

Elevation

Spe

cies

Alexandre Perera i Lluna, Introduction to R: Part III

Page 35: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

lm(): model construction

> mdl <- lm(Species ~ Elevation, data = gala)> coef(mdl)

(Intercept) Elevation11.3351132 0.2007922

> plot(Species ~ Elevation, data = gala)> abline(mdl, col = "blue") ●

●●

●●●

● ●

●●

● ●

0 500 1000 15000

100

200

300

400

Elevation

Spe

cies

Alexandre Perera i Lluna, Introduction to R: Part III

Page 36: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

lm(): model information

> summary(mdl)

Call:lm(formula = Species ~ Elevation, data = gala)

Residuals:Min 1Q Median 3Q Max

-218.319 -30.721 -14.690 4.634 259.180

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 11.33511 19.20529 0.590 0.56Elevation 0.20079 0.03465 5.795 3.18e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 78.66 on 28 degrees of freedomMultiple R-squared: 0.5454, Adjusted R-squared: 0.5291F-statistic: 33.59 on 1 and 28 DF, p-value: 3.177e-06

Alexandre Perera i Lluna, Introduction to R: Part III

Page 37: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

lm(): plot(lm)

> par(mfrow = c(2, 2))> plot(mdl)

50 150 250 350

−20

00

200

Fitted values

Res

idua

ls

●●●

●●●●

●●●

●●

●●

●● ●

●●●

Residuals vs Fitted

SantaCruz

Fernandina

SantaMaria

●●

●●

● ● ●●

●●

●●

● ●

●● ●

●●●

−2 −1 0 1 2

−2

02

4

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

SantaCruz

Fernandina

SantaMaria

50 150 250 350

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

●●

●●

●●

●●

● ●

●●

Scale−LocationSantaCruzFernandina

SantaMaria

0.0 0.1 0.2 0.3

−4

−2

02

4

Leverage

Sta

ndar

dize

d re

sidu

als

●●●●●●●

●●●

●●

● ●

●●●

●●●

Cook's distance

10.5

0.51

Residuals vs Leverage

Fernandina

SantaCruz

SantaMaria

Alexandre Perera i Lluna, Introduction to R: Part III

Page 38: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

lm(): Residuals

> resid(mdl)

Baltra Bartolome Caldwell Champion Coamano Daphne.Major-22.809212 -2.221462 -31.225423 4.428446 -24.796112 -17.229384

Daphne.Minor Darwin Eden Enderby Espanola Fernandina-6.008787 -35.068202 -17.591359 -31.823839 45.908033 -218.318650Gardner1 Gardner2 Genovesa Isabela Marchena Onslow36.826069 -51.914941 13.404680 -7.087387 -29.206835 -14.354918

Pinta Pinzon Las.Plazas Rabida SanCristobal SanSalvador-63.350647 4.702062 -18.209579 -15.025848 124.897677 43.747160SantaCruz SantaFe SantaMaria Seymour Tortuga Wolf259.180432 -1.340291 145.157883 3.148434 -32.682461 -41.135538

Alexandre Perera i Lluna, Introduction to R: Part III

Page 39: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

lm(): Predictions

> newdata <- gala[15:nrow(gala), ]> predict(mdl, newdata)

Genovesa Isabela Marchena Onslow Pinta Pinzon26.59532 354.08739 80.20684 16.35492 167.35065 103.29794

Las.Plazas Rabida SanCristobal SanSalvador SantaCruz SantaFe30.20958 85.02585 155.10232 193.25284 184.81957 63.34029

SantaMaria Seymour Tortuga Wolf139.84212 40.85157 48.68246 62.13554

Alexandre Perera i Lluna, Introduction to R: Part III

Page 40: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

lm(): Predictions, IC

> predict(mdl, newdata, level = 0.9, interval = "confidence")

fit lwr uprGenovesa 26.59532 -3.2897506 56.48039Isabela 354.08739 271.4761884 436.69858Marchena 80.20684 55.7314194 104.68225Onslow 16.35492 -15.3566679 48.06650Pinta 167.35065 133.0307302 201.67056Pinzon 103.29794 78.2982338 128.29764Las.Plazas 30.20958 0.9226656 59.49649Rabida 85.02585 60.5948667 109.45683SanCristobal 155.10232 123.2045765 187.00007SanSalvador 193.25284 153.2255598 233.28012SantaCruz 184.81957 146.7231442 222.91599SantaFe 63.34029 38.0783575 88.60222SantaMaria 139.84212 110.6221988 169.06203Seymour 40.85157 13.1644062 68.53873Tortuga 48.68246 21.9996242 75.36530Wolf 62.13554 36.7813407 87.48974

Alexandre Perera i Lluna, Introduction to R: Part III

Page 41: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

lm(): Model construction

> mdl <- lm(Species ~ Elevation + Endemics, data = gala)> summary(mdl)

Call:lm(formula = Species ~ Elevation + Endemics, data = gala)

Residuals:Min 1Q Median 3Q Max

-74.85 -12.49 2.59 12.67 70.25

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -19.92862 7.14320 -2.790 0.00955 **Elevation -0.02294 0.02009 -1.142 0.26366Endemics 4.35265 0.30997 14.042 6.29e-14 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 27.8 on 27 degrees of freedomMultiple R-squared: 0.9452, Adjusted R-squared: 0.9412F-statistic: 233 on 2 and 27 DF, p-value: < 2.2e-16

Alexandre Perera i Lluna, Introduction to R: Part III

Page 42: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

lm(): Model construction

> mdl <- lm(Species ~ ., data = gala)> summary(mdl)

Call:lm(formula = Species ~ ., data = gala)

Residuals:Min 1Q Median 3Q Max

-68.219 -10.225 1.830 9.557 71.090

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -15.337942 9.423550 -1.628 0.117Endemics 4.393654 0.481203 9.131 4.13e-09 ***Area 0.013258 0.011403 1.163 0.257Elevation -0.047537 0.047596 -0.999 0.328Nearest -0.101460 0.500871 -0.203 0.841Scruz 0.008256 0.105884 0.078 0.939Adjacent 0.001811 0.011879 0.152 0.880---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.96 on 23 degrees of freedomMultiple R-squared: 0.9494, Adjusted R-squared: 0.9362F-statistic: 71.88 on 6 and 23 DF, p-value: 9.674e-14

Alexandre Perera i Lluna, Introduction to R: Part III

Page 43: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Other regressors

Partial Least Squares: pls() en library(pls)

> library(pls)> data(yarn)> mod <- plsr(density ~ NIR, ncomp = 10,+ data = yarn[yarn$train, ],+ validation = "CV")> predplot(mod, ncomp = 1:6)

●●

●●

●●

●●●

●●

●●●

0 20 40 60 80 100

020

4060

80

density, 1 comps, validation

●●

●●

●●

●●●●●

●●●●●

0 20 40 60 80 100

020

4060

80

density, 2 comps, validation

●●

●●●

●●●●

●●●●●

●●●●●

0 20 40 60 80 100

020

4060

8010

0

density, 3 comps, validation

●●

●●●

●●●●

●●●●●

●●●●●●

0 20 40 60 80 100

020

4060

8010

0density, 4 comps, validation

●●

●●●

●●●●

●●●●●

●●●●●●

0 20 40 60 80 100

020

4060

8010

0

density, 5 comps, validation

●●

●●●

●●●●

●●●●●

●●●●●●

0 20 40 60 80 100

020

4060

8010

0

density, 6 comps, validation

measured

pred

icte

d

Alexandre Perera i Lluna, Introduction to R: Part III

Page 44: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Other regressors

Principal Component Regression: pcr() en library(pls)

> data(yarn)> mod <- pcr(density ~ NIR, ncomp = 10,+ data = yarn[yarn$train, ],+ validation = "CV")> predplot(mod, ncomp = 1:6)

●●

0 20 40 60 80 100

2025

3035

4045

density, 1 comps, validation

●●

●●

●●

●●●●●

●●●●●

0 20 40 60 80 100

020

4060

80

density, 2 comps, validation

●●

●●●

●●

●●

●●●●●

●●●●●

0 20 40 60 80 100

020

4060

8010

0

density, 3 comps, validation

●●

●●●

●●●

●●●●●

●●●●●

0 20 40 60 80 100

020

4060

80density, 4 comps, validation

●●

●●●

●●●●

●●●●●

●●●●●

0 20 40 60 80 100

020

4060

8010

0

density, 5 comps, validation

●●

●●●

●●●●

●●●●●

●●●●●●

0 20 40 60 80 100

020

4060

8010

0

density, 6 comps, validation

measured

pred

icte

d

Alexandre Perera i Lluna, Introduction to R: Part III

Page 45: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Other regressors

Generally it is assumed:

ε is i.i.d (ind. and ident. distr., ε = σ2IResidues dist. normal

When errors are not i.i.d.:

glm() from library(stats)

glm(model, family="bionomial") (logistic version with)

Independent errors, but no iden. dist.:

WLS, through glm()

Errors not normally distributed:

robust regression, through rlm() from library(MASS)

Alexandre Perera i Lluna, Introduction to R: Part III

Page 46: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

One way anova

Generalization of t-test

H0: µ0 = µ1 = · · · = µp

> oneway.test(Sepal.Length ~ Species,+ data = iris)

One-way analysis of means (notassuming equal variances)

data: Sepal.Length and SpeciesF = 138.9083, num df = 2.000, denomdf = 92.211, p-value < 2.2e-16

p-value small: we reject nullhypothesis of equal means

setosa versicolor virginica

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Species

Sep

al.L

engt

h

Alexandre Perera i Lluna, Introduction to R: Part III

Page 47: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Anova

> mdl <- lm(Sepal.Length ~ Species - 1, data = iris)> mdl.null <- lm(Sepal.Length ~ 1, data = iris)> anova(mdl, mdl.null)

Analysis of Variance Table

Model 1: Sepal.Length ~ Species - 1Model 2: Sepal.Length ~ 1

Res.Df RSS Df Sum of Sq F Pr(>F)1 147 38.9562 149 102.168 -2 -63.212 119.26 < 2.2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Alexandre Perera i Lluna, Introduction to R: Part III

Page 48: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Principal Component Analysis: model

> mdl <- prcomp(iris[, -5], center = TRUE, scale = TRUE)> summary(mdl)

Importance of components:PC1 PC2 PC3 PC4

Standard deviation 1.71 0.956 0.3831 0.14393Proportion of Variance 0.73 0.229 0.0367 0.00518Cumulative Proportion 0.73 0.958 0.9948 1.00000

> mdl$rotation

PC1 PC2 PC3 PC4Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971

Alexandre Perera i Lluna, Introduction to R: Part III

Page 49: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Principal Component Analysis: prediction

> proj <- predict(mdl, iris[, -5])> plot(proj) ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●●

−3 −2 −1 0 1 2 3

−2

−1

01

2

PC1

PC

2

Alexandre Perera i Lluna, Introduction to R: Part III

Page 50: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Principal Component Analysis: prediction

> biplot(mdl)

−0.2 −0.1 0.0 0.1 0.2

−0.

2−

0.1

0.0

0.1

0.2

PC1

PC

2

1

234

5

6

78

9

10

11

12

1314

15

16

17

18

1920

21

22

23

2425

26

272829

3031

32

33

34

3536

3738

39

4041

42

43

44

45

46

47

48

49

50

515253

54

55

56

57

58

59

60

61

62

63

6465

66

67

68

6970

71

72

73

74

7576

77

78

79

80

8182

8384

85

8687

88

89

9091

92

93

94

95

9697

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125126

127

128129

130131

132

133134

135

136137

138

139

140141142

143

144145

146

147

148

149

150

−10 −5 0 5 10

−10

−5

05

10

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

> library(pls)> scoreplot(mdl, col = as.numeric(iris$Species),+ pch = 16)

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●●

−3 −2 −1 0 1 2 3

−2

−1

01

2

PC1 (73 %)

PC

2 (2

3 %

)

Alexandre Perera i Lluna, Introduction to R: Part III

Page 51: Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

End Part III

Alexandre Perera i Lluna, Introduction to R: Part III