Download - Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland
Resources
• Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley.
• Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).
Regression
• Used when both the response and the explanatory variable are continuous
• Apply when a scatter plot is the appropriate graphic.• Four main types:
– Linear regression (straight line)
– Polynomial regression (non-linear)
– Non-linear regression (in general)
– Non-parametric regression (no obvious functional form)
Linear Regression
• Worked example from book (128ff)reg.data<-read.table(“tannin.txt”,header=T)attach(reg.data)names(reg.data)plot(tannin,growth,pch=16)
• Uses the lm() function and a simple model growth~tanninabline(lm(growth~tannin))fitted<-predict(lm(growth~tannin))
• model… (141ff)
Tannin Data Set
reg.data<-read.table("tannin.txt",header=T)
attach(reg.data)
names(reg.data)
[1] "growth" "tannin”
plot(tannin,growth,pch=16) (dots)
Linear Regression
model<-lm(growth~tannin)model
Call:lm(formula = growth ~ tannin)
Coefficients:(Intercept) tannin 11.756 -1.217
abline(model)
Fitting
fitted<-predict(model)
fitted
1 2 3 4 5 6 7 8 9
11.755556 10.538889 9.322222 8.105556 6.888889 5.672222 4.455556 3.238889 2.022222
for(i in 1:9)lines(c(tannin[i],tannin[i]),c(growth[i],fitted[i]))
Summary
summary(model)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.7556 1.0408 11.295 9.54e-06 ***tannin -1.2167 0.2186 -5.565 0.000846 ***---
Residual standard error: 1.693 on 7 degrees of freedomMultiple R-squared: 0.8157, Adjusted R-squared:
0.7893 F-statistic: 30.97 on 1 and 7 DF, p-value: 0.000846
Summary.aov
summary.aov(model) Df Sum Sq Mean Sq F value Pr(>F) tannin 1 88.817 88.817 30.974 0.000846 ***Residuals 7 20.072 2.867 <- the error variance
Report summary(model) and resist the temptation to include summary.aov(model). Include the p-value from (last slide) and error variance (here) in a figure caption.Finally plot(model)
Key Definitions
• SSE—the sum of the squares of the residuals (or error sum of squares)—this is to be minimised for the best fit
• SSX—∑x2-(∑x)2/n, the corrected sum of the squares of x• SSY—∑y2-(∑y)2/n, the corrected sum of the squares of y.• SSXY—∑xy-(∑x)(∑y)/n, the corrected sum of the products• b—SSXY/SSX, the maximum likelihood estimate of the slope
of the linear regression.• SSR—SSXY2/SSX, the explained variation or the regression
sum of squares. Note SSY = SSR + SSE.• r—the correlation coefficient, SSXY/√(SSX SSY)
Analysis of Variance
• Start with SSR, SSE, and SSY.• SSY has df = n-1.• SSE uses two estimated parameters (slope and
intercept), so df = n-2.• SSR uses a single degree of freedom since fitting the
regression model to this simple data set estimated only one extra parameter (beyond the mean value of y), the slope, b.
• Remember SSY = SSR + SSE.
Continuing
• Regression variance = SSR/1.• Error variance s2 = SSE/(n-2)• F = Regression variance/s2
• The null hypothesis is that the slope (b) is zero, so there is no dependence of the response on the explanatory variable.
• s2 then allows us to work out the standard errors of the slope and intercept.
• s.e.b = √(s2/SSX)• s.e.a = √(s2∑x2/nSSX)
Doing it in R
• model<-lm(growth~tannin)• summary(model)
– This produces all of the parameters and their standard errors
• If you want to see the analysis of variance, use summary.aov(model)
• Report summary(model) and resist the temptation to include summary.aov(model). Include the p-value and error variance in a figure caption.
• The degree of fit or coefficient of determination (r2) is SSR/SSY. r (or ) is the correlation coefficient.
Critical Appraisal
• Check constancy of variance and normality of errors• plot(model)
– Plot 1 should show no pattern– Plot 2 should show a straight line– Plot 3 repeats Plot 1 on a different scale. You don’t want to see
a triangular shape.– Plot 4 shows Cook’s distance, showing those points with the
most influence. You may want to investigate them to look for error or systematic effects. Remodel, removing those points and assess whether they dominate your results unduly.
• mcheck(model)
Be Aware!
• interv<-1:100/100• theta<-2*pi*interv• x<-cos(theta)• y<-sin(theta)• plot(y,x)
What's the correct functional form?• regress<-lm(y~x)• plot(regress)
Non-Linear Regression
• Perhaps the science constrains the functional form of the relationship between a response variable and an explanatory variable, but the relationship cannot be linearized by transformations. What to do?
• Use nls instead of lm, precisely specify the form of the model, and define initial guesses for any parameters.
• summary(model) still reports the statistics, while anova(model1, model2) is used to compare models. summary.aov(model) reports the analysis of variance.
Generalised Additive Models
• If you see that the relationship is non-linear, but you don’t have a theory, use a generalised additive model (gam).
• library(mgcv)– by the way, this is not gam() from core R.
• model<-gam(y~s(x))– s(x) is the default smoother , a thin plate regression spline
basis.
• Worked example.