regression harry r. erwin, phd school of computing and technology university of sunderland

Regression

Harry R. Erwin, PhD

School of Computing and Technology

University of Sunderland

Resources

• Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley.

• Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

Regression

• Used when both the response and the explanatory variable are continuous

• Apply when a scatter plot is the appropriate graphic.• Four main types:

– Linear regression (straight line)

– Polynomial regression (non-linear)

– Non-linear regression (in general)

– Non-parametric regression (no obvious functional form)

Linear Regression

• Worked example from book (128ff)reg.data<-read.table(“tannin.txt”,header=T)attach(reg.data)names(reg.data)plot(tannin,growth,pch=16)

• Uses the lm() function and a simple model growth~tanninabline(lm(growth~tannin))fitted<-predict(lm(growth~tannin))

• model… (141ff)

Tannin Data Set

reg.data<-read.table("tannin.txt",header=T)

attach(reg.data)

names(reg.data)

[1] "growth" "tannin”

plot(tannin,growth,pch=16) (dots)

Tannin Plot

Linear Regression

model<-lm(growth~tannin)model

Call:lm(formula = growth ~ tannin)

Coefficients:(Intercept) tannin 11.756 -1.217

abline(model)

Abline

Fitting

fitted<-predict(model)

fitted

1 2 3 4 5 6 7 8 9

11.755556 10.538889 9.322222 8.105556 6.888889 5.672222 4.455556 3.238889 2.022222

for(i in 1:9)lines(c(tannin[i],tannin[i]),c(growth[i],fitted[i]))

Fitted

Summary

summary(model)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.7556 1.0408 11.295 9.54e-06 ***tannin -1.2167 0.2186 -5.565 0.000846 ***---

Residual standard error: 1.693 on 7 degrees of freedomMultiple R-squared: 0.8157, Adjusted R-squared:

0.7893 F-statistic: 30.97 on 1 and 7 DF, p-value: 0.000846

Summary.aov

summary.aov(model) Df Sum Sq Mean Sq F value Pr(>F) tannin 1 88.817 88.817 30.974 0.000846 ***Residuals 7 20.072 2.867 <- the error variance

Report summary(model) and resist the temptation to include summary.aov(model). Include the p-value from (last slide) and error variance (here) in a figure caption.Finally plot(model)

First Plot (don’t want structure here)

Second Plot (qqnorm)

Third Plot (also don’t want structure here)

Fourth Plot (influence)

Key Definitions

• SSE—the sum of the squares of the residuals (or error sum of squares)—this is to be minimised for the best fit

• SSX—∑x2-(∑x)2/n, the corrected sum of the squares of x• SSY—∑y2-(∑y)2/n, the corrected sum of the squares of y.• SSXY—∑xy-(∑x)(∑y)/n, the corrected sum of the products• b—SSXY/SSX, the maximum likelihood estimate of the slope

of the linear regression.• SSR—SSXY2/SSX, the explained variation or the regression

sum of squares. Note SSY = SSR + SSE.• r—the correlation coefficient, SSXY/√(SSX SSY)

Analysis of Variance

• Start with SSR, SSE, and SSY.• SSY has df = n-1.• SSE uses two estimated parameters (slope and

intercept), so df = n-2.• SSR uses a single degree of freedom since fitting the

regression model to this simple data set estimated only one extra parameter (beyond the mean value of y), the slope, b.

• Remember SSY = SSR + SSE.

Continuing

• Regression variance = SSR/1.• Error variance s2 = SSE/(n-2)• F = Regression variance/s2

• The null hypothesis is that the slope (b) is zero, so there is no dependence of the response on the explanatory variable.

• s2 then allows us to work out the standard errors of the slope and intercept.

• s.e.b = √(s2/SSX)• s.e.a = √(s2∑x2/nSSX)

Doing it in R

• model<-lm(growth~tannin)• summary(model)

– This produces all of the parameters and their standard errors

• If you want to see the analysis of variance, use summary.aov(model)

• Report summary(model) and resist the temptation to include summary.aov(model). Include the p-value and error variance in a figure caption.

• The degree of fit or coefficient of determination (r2) is SSR/SSY. r (or ) is the correlation coefficient.

Critical Appraisal

• Check constancy of variance and normality of errors• plot(model)

– Plot 1 should show no pattern– Plot 2 should show a straight line– Plot 3 repeats Plot 1 on a different scale. You don’t want to see

a triangular shape.– Plot 4 shows Cook’s distance, showing those points with the

most influence. You may want to investigate them to look for error or systematic effects. Remodel, removing those points and assess whether they dominate your results unduly.

• mcheck(model)

Be Aware!

• interv<-1:100/100• theta<-2*pi*interv• x<-cos(theta)• y<-sin(theta)• plot(y,x)

What's the correct functional form?• regress<-lm(y~x)• plot(regress)

Polynomial Regression

• A simple way to investigate non-linearity.

• Worked example. (146ff)

Non-Linear Regression

• Perhaps the science constrains the functional form of the relationship between a response variable and an explanatory variable, but the relationship cannot be linearized by transformations. What to do?

• Use nls instead of lm, precisely specify the form of the model, and define initial guesses for any parameters.

• summary(model) still reports the statistics, while anova(model1, model2) is used to compare models. summary.aov(model) reports the analysis of variance.

Generalised Additive Models

• If you see that the relationship is non-linear, but you don’t have a theory, use a generalised additive model (gam).

• library(mgcv)– by the way, this is not gam() from core R.

• model<-gam(y~s(x))– s(x) is the default smoother , a thin plate regression spline

basis.

• Worked example.

regression harry r. erwin, phd school of computing and technology university of sunderland

Documents

regression harry

sq f value prf tannin

obvious functional form

error t value prt intercept

aovmodel df sum sq

datat intercept

adjusted r

residual standard error