comp6053 lecture: linear regressionmb1a10/stats/feeg6017_8.pdf · dfsdf height residuals sum of...

43
COMP6053 lecture: Linear regression [email protected] [email protected]

Upload: others

Post on 30-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

COMP6053 lecture:Linear regression

[email protected] [email protected]

Page 2: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Regression analysis

• Last time we looked at correlation coefficients: how much of the variation in one variable can be explained by another?

• What if we wanted to actually predict the value one variable will take, based on our measurements of another variable?

• This is what regression analysis ("fitting a line to the data") is all about.

Page 3: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Interlude: types of variables

Three types of variables:

• Continuous: real-numbered values, e.g., time, mass, money.

• Ordinal: a numerical variable where a small number of possibilities are ranked, e.g., school grades, or Michelin stars.

Page 4: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Interlude: types of variables

• Categorical: describes membership of a group. The groups are distinct, and may even be represented with a code number, but they can't be ranked. Examples: country of birth, sex, control group vs. experimental group.o Some categorical variables are binary (alive/dead,

pass/fail) and some are not.

Page 5: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Interlude: types of variables

• The boundaries can be blurred.

• All continuous variables are "really" ordinal given that they can only be measured to limited accuracy, e.g., weight to nearest kg.

• A variable like "age of child in years" might be treated as ordinal, or it might be categorical if you're looking for cognitive differences between 3-yr-olds and 6-yr-olds.

Page 6: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Interlude: types of variables

• Whatever the form of the variable, they can also play different roles when we start to build statistical models.

• Dependent or outcome variables: for a particular analysis, this variable is the focus. It's assumed to be linked to or predictable from some other variables. Perhaps we want to predict test scores based on certain demographic facts about people.

Page 7: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Interlude: types of variables

• Independent or predictor variables: these variables are assumed to have inherent variation (they "just are") and we will use them to try to explain the variance in the dependent variable. E.g., sex, age, and education level when we're trying to predict test scores.

Page 8: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Interlude: types of variables

• Note that sometimes we have a clear causal model in mind, and at other times we're happy to try out different variables in either role.

• For instance, we could try to predict height based on weight, but we could also do the reverse quite reasonably.

• Even if you have a causal model in mind be careful. You may only demonstrate correlation.

Page 9: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Variables in regression

• The classic regression case involves one dependent variable and one independent variable (we'll expand this later).

• Both of these variables are continuous or at least ordinal.

• We're trying to explain or predict the variation in Y based on the variation in X.

Page 10: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the
Page 11: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Drawing a line through some points

• If I give you the X value and ask for a systematic guess about the matching Y value, what are your options?

• The easiest way forward is to assume that X and Y are linearly related, as in correlation analysis.

• We can then ask what line provides a "best guess" for Y based on X.

Page 12: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Drawing a line through some points

• Another way of posing the same question is to ask for a "line of best fit" to be drawn through the cloud of (X,Y) points.

• We need two things for a line: slope (m) and intercept (b). Thus Y = mX + b.

• We could make a hand-drawn guess at a line that fits the data. But how can we be systematic about it?

Page 13: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

?

Page 14: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Equation of a straight line

Page 15: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Method of least squares

• For any given line Y = mX + b, we can measure the differences between the actual Y values and those predicted by our hypothesized line.

• The sum of the squared differences between the actual Y values and the predicted ones is a reasonable way to measure goodness of fit.

• Regression analysis uses this method.

Page 16: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Least squares demo

http://onlinestatbook.com/simulations/reg_least_squares/reg_ls.html

• David M. Lane'sdemo is a great way to see what the least-squaresfitting idea really means.

• Definitely spend some time playingaround with it.

Page 17: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

How do I run a regression in R?

• We'll start with a fictional data set.

• We have 100 men who have been measured for height and tested for basketball ability.

• We're interested in trying to predict their basketball skill from their height.

Page 18: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the
Page 19: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Basketball ability

• The mean height is 169.1cm, SD = 10.5cm.

• Mean basketball performance on an arbitrary scale is 65.8, SD = 5.9.

• Correlation between the two is r=0.52, which means height explains about 28% of the variance in basketball ability.

Page 20: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

How do I run a regression in R?

• We read the data into R with the usual read.table() command.

• The variables of interest are Height and BasketballAbility.

• We build the regression model with:regModel = lm(BasketballAbility ~ Height)

• LM stands for "linear model".

Page 21: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Call:

lm(formula = BasketballAbility ~ Height)

Residuals:

Min 1Q Median 3Q Max

-11.0733 -3.4851 -0.5733 3.4969 12.9267

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 15.67832 8.22351 1.907 0.0595 .

Height 0.29602 0.04855 6.097 2.15e-08 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.07 on 98 degrees of freedom

Multiple R-squared: 0.275, Adjusted R-squared: 0.2676

F-statistic: 37.17 on 1 and 98 DF, p-value: 2.146e-08

The output in full

Page 22: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Implied model of data generation

Page 23: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Breaking down the R output

• summary(regModel) gives us most of what we need.

Call:

lm(formula = BasketballAbility ~ Height)

This is just restating the model formula.

Page 24: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

The residuals

Residuals:

Min 1Q Median 3Q Max

-11.0733 -3.4851 -0.5733 3.4969 12.9267

How about this?

• This section describes the distribution of the residuals, i.e., the differences between the predicted and the actual basketball ability scores.

• Roughly speaking, they should be normally distributed with a mean of zero.

Page 25: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

The coefficients

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 15.67832 8.22351 1.907 0.0595 .

Height 0.29602 0.04855 6.097 2.15e-08 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

• This is where we get "m" and "b".

• Intercept estimate = 15.7, meaning a man of zero height would be expected to score this much on basketball skill.

Page 26: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

The coefficients

• Height estimate = 0.30, meaning every additional cm of height adds 0.3 to your basketball skill score.

• We also get standard errors on these estimates, and t-tests for the hypothesis that they're equal to zero.

• The t-test to focus on is the one for the slope of the line: here it's highly significant.

Page 27: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Plotting the model's predictions

Page 28: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Summary of the analysis

Residual standard error: 5.07 on 98 degrees of freedom

Multiple R-squared: 0.275, Adjusted R-squared: 0.2676

F-statistic: 37.17 on 1 and 98 DF, p-value: 2.146e-08

• Note the R-squared figure, which is what we expected.

• We also get an "Adjusted R-squared" value, which tries to correct for the fact that models with more parameters will naturally get better fits to the data.

Page 29: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Summary of the analysis

• We finish with an F-test for the model as a whole.

• This has 1 and 98 degrees of freedom. Why these values?

• In a data set with 100 measurements of basketball ability, there are 99 degrees of freedom (the last measurement is not free to vary if we are to get the same mean).

Page 30: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Summary of the analysis

• These 99 degrees of freedom have to be allocated to either the variance explained by the model or the leftover "error" variance.

• You might think that a linear model would get 2 degrees of freedom: intercept and slope.

• In fact the intercept is not free to vary: the regression line always runs through the joint mean.

Page 31: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the
Page 32: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Implied model of data generation

Page 33: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Summary of the analysis

• Thus 1 degree of freedom goes to the model, and the rest (98) go to the error variance.

• The high F-statistic of 37.2 and the extremely low p-value (2 x 10-8) tell us that the linear relationship we're seeing between basketball ability and height would be extremely unlikely to occur by chance.

Page 34: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Analysis of variance

• aov(regModel) gives information on how the variance pie is divided up.

dfsdf

Height ResidualsSum of Squares 955.3655 2518.7945Deg. of Freedom 1 98

• We're trying to account for the "variance pie" in basketball ability. SD = 5.92, variance ≈ 35, SS ≈ 3473 which is the total SS above.

Page 35: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

How do I write up aregression analysis?

A linear regression model was used to predict basketball ability in terms of height, and the overall result was highly significant (F1,98= 37.1, p < 2x10-8) with an R-squared value of 27.5%. The intercept for the regression line was 15.7 and the slope was 0.30, indicating that men of average height (169cm) could expect a mean skill rating of 65.8, with each additional 10cm of height being associated with a 3-point increase in basketball skill.

Page 36: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

General truths aboutthe regression line

• You might imagine that finding the line of best fit (lowest "mean-squared error") would be a complicated optimization problem, different for every data set.

• In fact, because there are only two parameters to find (the slope and the intercept) the results are quite predictable.

Page 37: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

General truths aboutthe regression line

• The regression line always goes through the combined means (height = 169.3cm, basketball skill = 65.8 in this case).

• The slope of the line is always equal to the correlation coefficient times the ratio of the standard deviations.

Page 38: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Standardized regression

Page 39: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Connections to other methods

Regression can be seen as a superset of:

• ANOVA

• Two-sample t-tests

• Correlation analysis

Page 40: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Can I run a regression in Python?

• Yes: the pylab.polyfit command will fit the regression model.

• The pylab.polyval command will allow you to generate predictions from the model, as does the predict command in R.

Page 41: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the
Page 42: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Linear fit demo

http://onlinestatbook.com/2/simulations/reg_least_squares/Reg_scatter.html

Page 43: COMP6053 lecture: Linear regressionmb1a10/stats/FEEG6017_8.pdf · dfsdf Height Residuals Sum of Squares 955.3655 2518.7945 Deg. of Freedom 1 98 •We're trying to account for the

Additional materials

• The Python code for generating graphs and the fictional data set used here.

• The fictional data set as a text file.

• An R script for analyzing the fictional data set. source("regressionScript.txt",echo=TRUE)