comp6053 lecture: linear regressionmb1a10/stats/feeg6017_8.pdf · dfsdf height residuals sum of...

COMP6053 lecture:Linear regression

[email protected] [email protected]

Regression analysis

• Last time we looked at correlation coefficients: how much of the variation in one variable can be explained by another?

• What if we wanted to actually predict the value one variable will take, based on our measurements of another variable?

• This is what regression analysis ("fitting a line to the data") is all about.

Interlude: types of variables

Three types of variables:

• Continuous: real-numbered values, e.g., time, mass, money.

• Ordinal: a numerical variable where a small number of possibilities are ranked, e.g., school grades, or Michelin stars.


• Categorical: describes membership of a group. The groups are distinct, and may even be represented with a code number, but they can't be ranked. Examples: country of birth, sex, control group vs. experimental group.o Some categorical variables are binary (alive/dead,

pass/fail) and some are not.


• The boundaries can be blurred.

• All continuous variables are "really" ordinal given that they can only be measured to limited accuracy, e.g., weight to nearest kg.

• A variable like "age of child in years" might be treated as ordinal, or it might be categorical if you're looking for cognitive differences between 3-yr-olds and 6-yr-olds.


• Whatever the form of the variable, they can also play different roles when we start to build statistical models.

• Dependent or outcome variables: for a particular analysis, this variable is the focus. It's assumed to be linked to or predictable from some other variables. Perhaps we want to predict test scores based on certain demographic facts about people.


• Independent or predictor variables: these variables are assumed to have inherent variation (they "just are") and we will use them to try to explain the variance in the dependent variable. E.g., sex, age, and education level when we're trying to predict test scores.


• Note that sometimes we have a clear causal model in mind, and at other times we're happy to try out different variables in either role.

• For instance, we could try to predict height based on weight, but we could also do the reverse quite reasonably.

• Even if you have a causal model in mind be careful. You may only demonstrate correlation.

Variables in regression

• The classic regression case involves one dependent variable and one independent variable (we'll expand this later).

• Both of these variables are continuous or at least ordinal.

• We're trying to explain or predict the variation in Y based on the variation in X.

Drawing a line through some points

• If I give you the X value and ask for a systematic guess about the matching Y value, what are your options?

• The easiest way forward is to assume that X and Y are linearly related, as in correlation analysis.

• We can then ask what line provides a "best guess" for Y based on X.

Drawing a line through some points

• Another way of posing the same question is to ask for a "line of best fit" to be drawn through the cloud of (X,Y) points.

• We need two things for a line: slope (m) and intercept (b). Thus Y = mX + b.

• We could make a hand-drawn guess at a line that fits the data. But how can we be systematic about it?

Equation of a straight line

Method of least squares

• For any given line Y = mX + b, we can measure the differences between the actual Y values and those predicted by our hypothesized line.

• The sum of the squared differences between the actual Y values and the predicted ones is a reasonable way to measure goodness of fit.

• Regression analysis uses this method.

Least squares demo

http://onlinestatbook.com/simulations/reg_least_squares/reg_ls.html

• David M. Lane'sdemo is a great way to see what the least-squaresfitting idea really means.

• Definitely spend some time playingaround with it.

http://onlinestatbook.com/simulations/reg_least_squares/reg_ls.html

How do I run a regression in R?

• We'll start with a fictional data set.

• We have 100 men who have been measured for height and tested for basketball ability.

• We're interested in trying to predict their basketball skill from their height.

Basketball ability

• The mean height is 169.1cm, SD = 10.5cm.

• Mean basketball performance on an arbitrary scale is 65.8, SD = 5.9.

• Correlation between the two is r=0.52, which means height explains about 28% of the variance in basketball ability.

How do I run a regression in R?

• We read the data into R with the usual read.table() command.

• The variables of interest are Height and BasketballAbility.

• We build the regression model with:regModel = lm(BasketballAbility ~ Height)

• LM stands for "linear model".

Call:

lm(formula = BasketballAbility ~ Height)

Residuals:

Min 1Q Median 3Q Max

-11.0733 -3.4851 -0.5733 3.4969 12.9267

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 15.67832 8.22351 1.907 0.0595 .

Height 0.29602 0.04855 6.097 2.15e-08 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.07 on 98 degrees of freedom

Multiple R-squared: 0.275, Adjusted R-squared: 0.2676

F-statistic: 37.17 on 1 and 98 DF, p-value: 2.146e-08

The output in full

Implied model of data generation

Breaking down the R output

• summary(regModel) gives us most of what we need.

Call:

lm(formula = BasketballAbility ~ Height)

This is just restating the model formula.

The residuals

Residuals:

Min 1Q Median 3Q Max

-11.0733 -3.4851 -0.5733 3.4969 12.9267

How about this?

• This section describes the distribution of the residuals, i.e., the differences between the predicted and the actual basketball ability scores.

• Roughly speaking, they should be normally distributed with a mean of zero.

The coefficients

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 15.67832 8.22351 1.907 0.0595 .

Height 0.29602 0.04855 6.097 2.15e-08 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

• This is where we get "m" and "b".

• Intercept estimate = 15.7, meaning a man of zero height would be expected to score this much on basketball skill.

The coefficients

• Height estimate = 0.30, meaning every additional cm of height adds 0.3 to your basketball skill score.

• We also get standard errors on these estimates, and t-tests for the hypothesis that they're equal to zero.

• The t-test to focus on is the one for the slope of the line: here it's highly significant.

Plotting the model's predictions

Summary of the analysis

Residual standard error: 5.07 on 98 degrees of freedom

Multiple R-squared: 0.275, Adjusted R-squared: 0.2676

F-statistic: 37.17 on 1 and 98 DF, p-value: 2.146e-08

• Note the R-squared figure, which is what we expected.

• We also get an "Adjusted R-squared" value, which tries to correct for the fact that models with more parameters will naturally get better fits to the data.


• We finish with an F-test for the model as a whole.

• This has 1 and 98 degrees of freedom. Why these values?

• In a data set with 100 measurements of basketball ability, there are 99 degrees of freedom (the last measurement is not free to vary if we are to get the same mean).


• These 99 degrees of freedom have to be allocated to either the variance explained by the model or the leftover "error" variance.

• You might think that a linear model would get 2 degrees of freedom: intercept and slope.

• In fact the intercept is not free to vary: the regression line always runs through the joint mean.

Implied model of data generation


• Thus 1 degree of freedom goes to the model, and the rest (98) go to the error variance.

• The high F-statistic of 37.2 and the extremely low p-value (2 x 10-8) tell us that the linear relationship we're seeing between basketball ability and height would be extremely unlikely to occur by chance.

Analysis of variance

• aov(regModel) gives information on how the variance pie is divided up.

dfsdf

Height ResidualsSum of Squares 955.3655 2518.7945Deg. of Freedom 1 98

• We're trying to account for the "variance pie" in basketball ability. SD = 5.92, variance ≈ 35, SS ≈ 3473 which is the total SS above.

How do I write up aregression analysis?

A linear regression model was used to predict basketball ability in terms of height, and the overall result was highly significant (F1,98= 37.1, p < 2x10-8) with an R-squared value of 27.5%. The intercept for the regression line was 15.7 and the slope was 0.30, indicating that men of average height (169cm) could expect a mean skill rating of 65.8, with each additional 10cm of height being associated with a 3-point increase in basketball skill.

General truths aboutthe regression line

• You might imagine that finding the line of best fit (lowest "mean-squared error") would be a complicated optimization problem, different for every data set.

• In fact, because there are only two parameters to find (the slope and the intercept) the results are quite predictable.

General truths aboutthe regression line

• The regression line always goes through the combined means (height = 169.3cm, basketball skill = 65.8 in this case).

• The slope of the line is always equal to the correlation coefficient times the ratio of the standard deviations.

Standardized regression

•

Connections to other methods

Regression can be seen as a superset of:

• ANOVA

• Two-sample t-tests

• Correlation analysis

Can I run a regression in Python?

• Yes: the pylab.polyfit command will fit the regression model.

• The pylab.polyval command will allow you to generate predictions from the model, as does the predict command in R.

Linear fit demo

http://onlinestatbook.com/2/simulations/reg_least_squares/Reg_scatter.html

http://onlinestatbook.com/2/simulations/reg_least_squares/Reg_scatter.html

Additional materials

• The Python code for generating graphs and the fictional data set used here.

• The fictional data set as a text file.

• An R script for analyzing the fictional data set. source("regressionScript.txt",echo=TRUE)

http://users.ecs.soton.ac.uk/jn2/teaching/regression.py

http://users.ecs.soton.ac.uk/jn2/teaching/basketballData.txt

http://users.ecs.soton.ac.uk/jn2/teaching/regressionScript.txt

comp6053 lecture: linear regressionmb1a10/stats/feeg6017_8.pdf · dfsdf height residuals sum of...

Documents