introduction to linear regression analysis part 1 · 2020-01-25 · 1 introduction to linear...

61
1 Introduction to Linear regression analysis Part 1 Simple linear regression Two continuous variables: response (dependent) variable (Y) predictor (independent) variable (X) each recorded for n observations (replicates) Predictor variable “influences” response variable Does not necessarily demonstrate causality (depends on design of experiment or survey

Upload: others

Post on 08-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

1

Introduction to Linear regression

analysis

Part 1

Simple linear regression

• Two continuous variables:

– response (dependent) variable (Y)

– predictor (independent) variable (X)

– each recorded for n observations (replicates)

• Predictor variable “influences” response variable

• Does not necessarily demonstrate causality

(depends on design of experiment or survey

Page 2: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

2

Scatterplot

0 500 1000 1500 2000 2500

Riparian tree density

0

50

100

150

200

CW

D(c

oa

rse

wo

od

y d

eb

ris)

ba

sa

l a

rea

Linear regression

• Description:

– relationship between response (Y) and

predictor (X) variable

• Explanation:

– how much of variation in Y explained by

linear relationship with X

• Prediction:

– new Y-values from new X-values

– precision of those estimates

Page 3: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

3

Regression model

yi = b0 + b1xi + ei

(CWD basal area)i = b0 + b1(tree density)i + ei

yi value of Y for ith observation

xi value of X for ith observation

b0 population intercept (value of Y when X = 0)

b1 population slope (change in Y per unit change in X)

ei error term (measures variation in Y at each xi -

deviation of each Yi from predicted value )

Regression line

X

Y

Intercept

Slope:

change in Y per unit

change in X

Page 4: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

4

Regression model

yi = b0 + b1xi + ei

E(yi) = b0 + b1xi

where:

• E(yi) = expected value of yi

• ei (error term) measures difference

between yi and E(yi) at each xi

Sample regression equation

predicted Y-value for xi

estimates E(yi)

sample intercept

estimates b0 sample regression slope

estimates b1

Page 5: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

5

Regression line

0 500 1000 1500 2000 2500

Riparian tree density

-100

0

100

200

CW

D b

asa

l a

rea

Slope

Intercept

0 500 1000 1500 2000 2500

Riparian tree density

-100

0

100

200

CW

D b

asa

l a

rea

-y ns /t( ) +y ns /t( )y

The logic of the assessment of

regression models – what to

compare to

If data are normally

distributed, then

unbiased estimate of

distribution of means

can be obtained from

y, (SE)

Page 6: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

6

Year

Em

plo

yment

(1000’s

, C

onfidence inte

rval)

5.1954

317,65

x

y

Longley.csv

Now lets assume that we think there may be a

relationship between year and employment

Question: does the mean (or some other estimator that does

not include the relationship between y and x) fit the data

better than an estimator that includes the effect of x

5.1954

317,65

x

y

1945 1950 1955 1960 1965Year

60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

ent

(thousands)

1945 1950 1955 1960 1965Year

60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

ent

(thousands)

Page 7: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

7

Analysis of variance in Y

( )y yi -2

Total variation (Sum of Squares) in Y

Variation in Y explained

by regression

(SSRegression)

Variation in Y

unexplained by

regression (SSResidual)

Y

X

least squares

regression line

y

x

y i

yi

xi

y

})ˆ( i yy -}

)ˆ( ii yy -

)( i yy -}

222)ˆ()ˆ()( iiii yyyyyy -+-

Ordinary Least Squares (OLS)

Page 8: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

8

y yi i- small y yi i- big

Unexplained or residual variation

Explained variation

y yi - small

y

y yi - big

y

Page 9: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

9

Source of SS df Variance

variation (= mean square)

Regression 1 SSRegression / 1

Variation in Y explained by regression

Residual n-2 SSResidual / n-2

Variation in Y unexplained by regression

Analysis of variance

Why n-2?

Analysis of variance

It follows that if:

Variation in Y explained by regression >> Variation in Y

unexplained by regression (MSRegression >> MSResidual)

Then:

Regression function contributes to estimation of Y

(Slope = b1 > 0, or b1 < 0)

Page 10: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

10

> 0

x

y

xx

yy oo

oo

o

o o

oo

oo

oo

oo

o

o o

oo

oo

o

oo

o

o o

oo

oo

x

y

oo

oo

o

o o

oo

oo

x

y

xx

yy

oo

oo

o

o o

oo

oo

oo

oo

o

o o

oo

oo

o

oo

o

o o

oo

oo

o o

oo

o

o

o o o

x

y

xx

yyo o

oo

o

o

o o o

o o

oo

o

o

o o o

< 0

= 0

b1Slope =

Variation in Y explained

by regression

> 0

> 0

= 0

y yi For most xi

y yi For most xi

y yi For all xi~

Null hypothesis

• Null hypothesis: b1 = 0

• F-ratio statistic = MSRegression / MSResidual

– if H0 true, F-ratio follows F distribution with

dfRegression and dfResidual

• t-statistic = b1 / SE(b1)

– if H0 true, t-statistic follows t distribution

with df = n-2

Page 11: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

11

Model comparisons

ANOVA for regression

Total variation in Y

SSTotal

=

Variation explained by regression with X

SSRegression

+

Residual variation

SSResidual

Page 12: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

12

Full model

yi = b0 + b1xi + ei

• Unexplained variation in Y from full

model = SSResidual

Reduced model (H0 true)

• Reduced model (H0: b1 = 0 true):

yi = b0 + ei

• Unexplained variation in Y from reduced

model = SSTotal

(Mean and error)

( )y yi -2

Page 13: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

13

Model comparison

• Difference in unexplained variation between

full and reduced models:

SSTotal - SSResidual

= SSRegression

• Variation explained by including b1 in model

Explained variation

• Proportion of variation in Y explained by

linear relationship with X

• Termed r2, coefficient of determination:

SS Regression

SS Total

• r2 is simply square of correlation

coefficient (r) between X and Y.

Page 14: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

14

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2

Which is the better model??

Which is the better model??

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

Dep Var: Y1 N: 26 Multiple R: 0.754377 Squared multiple R: 0.569085

Adjusted squared multiple R: 0.551131 Standard error of estimate: 86.934708

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT 11.207815 30.277197 0.000000 . 0.37017 0.71450

X 0.026573 0.004720 0.754377 1.000000 5.62987 0.00001

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 2.39543E+05 1 2.39543E+05 31.695479 0.000009

Residual 1.81383E+05 24 7557.643448

Page 15: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

15

Which is the better model??

0 5000 10000 15000

X

0

100

200

300

400

500

Y2

Dep Var: Y2 N: 5 Multiple R: 0.978152 Squared multiple R: 0.956781

Adjusted squared multiple R: 0.942374 Standard error of estimate: 33.608617

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT -31.455158 32.524324 0.000000 . -0.96713 0.40482

X 0.033584 0.004121 0.978152 1.000000 8.14944 0.00386

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 7.50166E+04 1 7.50166E+04 66.413444 0.003864

Residual 3388.617362 3 1129.539121

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2

Which is the better model??

n = 5

P = 0.00386

r2 = 0.942

n = 26

P = 0.000009

r 2= 0.551

Page 16: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

16

Which is the better model??

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2

95% Confidence bands (for slope)

n = 5

P = 0.00386

r2 = 0.942

n = 26

P = 0.000009

r 2= 0.551

Assumptions

Page 17: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

17

Normality

Y normally distributed at each value of X:

– Boxplot of y should be symmetrical - watch

out for outliers and skewness

– Transformations often help

Homogeneity of variance

Variance (spread) of Y should be constant

for each value of xi (homogeneity of

variance):

– Very difficult to assess usually (for

models with only one value of y per x).

Page 18: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

18

x1 x2 X

Y

Y1

Y2

b by iix +0 1

Homogeneity of variance

Variance (spread) of Y should be constant for

each value of xi (homogeneity of variance):

– Very difficult to assess usually (for models with

only one value of y per x).

– Spread of residuals should be even when

plotted against xi or predicted yi’s

– Transformations often help

– Transformations that improve normality of Y will

also usually make variance of Y more constant

Page 19: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

19

Independence

Values of yi are independent of each

other:

– watch out for data which are a time series

on same experimental or sampling units

– should be considered at design stage

Linearity

For Linear regression: true population

relationship between Y and X is linear:

– scatterplot of Y against X

– watch out for asymptotic or exponential

patterns

– transformations of Y or Y and X often help

– Always look at residuals

Page 20: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

20

EDA and regression diagnostics

• Check assumptions

• Check fit of model

• Warn about influential observations and

outliers

EDA

• Boxplots of Y (and X):

– check for normality, outliers etc.

• Scatterplot of Y and X:

– check for linearity, homogeneity of

variance, outliers etc.

Page 21: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

21

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15 20

Anscombe (1973) data set

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15 20

R2 = 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002

Page 22: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

22

Limited or weighted data Smoothers (for data

exploration – especially useful for model

fitting)

• Nonparametric description of

relationship between Y and X

– unconstrained by specific model structure

• Useful exploratory technique:

– is linear model appropriate?

– are particular observations influential?

Page 23: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

23

Limited or weighted data Smoothers

• Each observation replaced by mean or median of surrounding observations– or predicted value of regression model through

surrounding observations

• Surrounding observations in window (or band)– covers range along X-axis

– size of window (number of observations) determined by smoothing parameter

Limited or weighted data Smoothers

• Adjacent windows overlap

– resulting line is smooth

– smoothness controlled by smoothing

parameter (width of window)

• Any section of line robust to extreme

values in other windows

Page 24: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

24

Types of limited or weighted data

smoothers (examples)

• Running (moving) means or averages:

– means or medians within each window

• Lo(w)ess:

– locally weighted regression scatterplot

smoothing

– observations within a window weighted

differently

– observations replaced by predicted values

from local regression line

Residuals – very useful for examining

regression assumptions

• Difference between observed value and

value predicted or fitted by the model

• Residual for each observation:– difference between observed y and value

of y predicted by linear regression

equation:

Page 25: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

25

Studentised residuals

• residual / SE residuals• follow a t-distribution• studentised residuals can be compared

between different regressions

Observations with large residual (or studentised residual) are outliers from fitted model.

0

-se

+se

Predicted yi

Res

idual •No pattern in residuals

• Indicates assumptions

OK

x

y •Even spread of Y

around line

Plot residuals against predicted yi

Page 26: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

26

• Increasing spread of

residuals, ie. wedge-shape

• Unequal variance in Y

• Skewed distribution of Y

• Transformation of Y helps

• Uneven spread of Y

around line

0-se

+se

Predicted yi

Res

idual

x

y

Other indicators

• Outliers

• Leverage

• Influence

Page 27: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

27

Outliers

• Observations further

from fitted model than

remaining observations

– might be different from

sample outliers in

boxplots

• Large residual

outlier

Use robust estimator.syz

Leverage

• How extreme

observation is for

X-variable

• Measures how

much each xi

influences

predicted yi Large leverage

Page 28: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

28

Influence

• Cook’s D statistic:– incorporates leverage & residual

– observations with large influence on

estimated slope

– observations with D near or greater than 1

should be checked

• Observation 1 is X and Y outlier but not

influential

1Y

X

2

• Observation 2 has large residual - outlier

3

• Observation 3 is very influential (large Cook’s

D) - also outlier

Page 29: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

29

Full set of lecture notes

Page 30: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

30

Introduction to Linear regression

analysis

Part 1

Simple linear regression

• Two continuous variables:

– response (dependent) variable (Y)

– predictor (independent) variable (X)

– each recorded for n observations (replicates)

• Predictor variable “influences” response variable

• Does not necessarily demonstrate causality

(depends on design of experiment or survey

Page 31: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

31

Scatterplot

0 500 1000 1500 2000 2500

Riparian tree density

0

50

100

150

200

CW

D(c

oa

rse

wo

od

y d

eb

ris)

ba

sa

l a

rea

Linear regression

• Description:

– relationship between response (Y) and

predictor (X) variable

• Explanation:

– how much of variation in Y explained by

linear relationship with X

• Prediction:

– new Y-values from new X-values

– precision of those estimates

Page 32: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

32

Regression model

yi = b0 + b1xi + ei

(CWD basal area)i = b0 + b1(tree density)i + ei

yi value of Y for ith observation

xi value of X for ith observation

b0 population intercept (value of Y when X = 0)

b1 population slope (change in Y per unit change in X)

ei error term (measures variation in Y at each xi -

deviation of each Yi from predicted value )

5 10 15 20

X

5

10

15

20

Y

Calculations of the Linear

Regression equation

Page 33: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

33

Calculate mean x and y values

5 10 15 20

X

5

10

15

20

Y

x = 13.14

y = 13.00

Calculate deviations from mean x

and y values

5 10 15 20

X

5

10

15

20

Y

x = 13.14

y = 13.00}3

2.86

}

}

}-5

-4.14

+ +

+ -

- +

- -

x

y

Page 34: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

34

Calculate sum of xy deviation

cross-products

+ +

+ -

- +

- -

x

y

5 10 15 20

X

5

10

15

20

Y

x = 13.14

y = 13.00}3

2.86

}

}

}-5

-4.14

(xi - x ) (yi - y )

+

-

-

+

x

y

(xi - x ) (yi - y )(xi - x ) (yi - y )Deviations (x,y)

Calculate slope

(xi - x ) (yi - y )(xi - x )

2

Slope = = 1.09

X Y CPXY SSX

9 8 20.7143 17.1633

11 12 2.1429 4.5918

12 11 2.2857 1.3061

13 14 -0.1429 0.0204

14 12 -0.8571 0.7347

16 17 11.4286 8.1633

17 17 15.4286 14.8776

Mean 13.14286 13

sum 51.0000 46.8571

slope 1.0884

(xi - x ) 2

(xi - x ) (xi - x ) 2(xi - x ) (yi - y )(xi - x ) (yi - y )

0 5 10 15 20

X

-5

0

5

10

15

20

Y

Page 35: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

35

Solve for intercept

0 5 10 15 20

X

-5

0

5

10

15

20

Y

y = b0 + b1x

Rearrange

y - b1x = b0

Where:

b0 = Intercept

b1 = Slope

It can be shown that:

-1.32

13 = b0 + b1(13.14)

13 – 1.09(13.14) = b0

b0 =

+

-

-

+

x

y

Slopes

oo

oo

o

o o

oo

oo

(xi - x ) (yi - y )(xi - x )

2 b1Slope =

b1 > 0

+

-

-

+

x

y

oo

oo

o

o o

oo

oo

b1 < 0

+

-

-

+

x

yo o

oo

o

o

o o o

+

-

-

+

x

y

+

-

-

+

xx

yy

(xi - x ) (yi - y )(xi - x ) (yi - y )(xi - x ) (yi - y )(xi - x ) (yi - y )

Regions of cross-

product

b1 = 0

Page 36: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

36

Regression line

X

Y

Intercept

Slope:

change in Y per unit

change in X

x1 x2 X

Y

Y1

Y2

b by iix +0 1

Page 37: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

37

Regression model

yi = b0 + b1xi + ei

E(yi) = b0 + b1xi

where:

• E(yi) = (yi)

• ei (error term) measures difference

between yi and (yi) at each xi

Sample regression equation

predicted Y-value for xi

estimates (yi)

sample intercept

estimates b0 sample regression slope

estimates b1

Page 38: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

38

Regression line

0 500 1000 1500 2000 2500

Riparian tree density

-100

0

100

200

CW

D b

asa

l a

rea

Slope

Intercept

0 500 1000 1500 2000 2500

Riparian tree density

-100

0

100

200

CW

D b

asa

l a

rea

-y ns /t( ) +y ns /t( )y

The logic of the assessment of

regression models – what to

compare to

If data are normally

distributed, then

unbiased estimate of

distribution of means

can be obtained from

y, (SE)

Page 39: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

39

TOTAL60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

en

t (t

ho

usa

nd

s +

- C

I)

1945 1950 1955 1960 1965

Year

60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

en

t (t

ho

usa

nds)

Now lets assume that we think there may be a

relationship between year and employment

5.1954

317,65

x

y

Longley.syd

Question: does the mean (or some other estimator that does

not include the relationship between y and x) fit the data

better than an estimator that includes the effect of x

5.1954

317,65

x

y

1945 1950 1955 1960 1965Year

60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

ent

(thousands)

1945 1950 1955 1960 1965Year

60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

ent

(thousands)

Page 40: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

40

Analysis of variance in Y

( )y yi -2

Total variation (Sum of Squares) in Y

Variation in Y explained

by regression

(SSRegression)

Variation in Y

unexplained by

regression (SSResidual)

Y

X

least squares

regression line

y

x

y i

yi

xi

y

})ˆ( i yy -}

)ˆ( ii yy -

)( i yy -}

222)ˆ()ˆ()( iiii yyyyyy -+-

Ordinary Least Squares (OLS)

Page 41: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

41

y yi i- small y yi i- big

Unexplained or residual variation

Explained variation

y yi - small

y

y yi - big

y

Page 42: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

42

Source of SS df Variance

variation (= mean square)

Regression 1 SSRegression / 1

Variation in Y explained by regression

Residual n-2 SSResidual / n-2

Variation in Y unexplained by regression

Analysis of variance

Why n-2?

Analysis of variance

It follows that if:

Variation in Y explained by regression >> Variation in Y

unexplained by regression (MSRegression >> MSResidual)

Then:

Regression function contributes to estimation of Y

(Slope = b1 > 0, or b1 < 0)

Page 43: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

43

> 0

x

y

xx

yy oo

oo

o

o o

oo

oo

oo

oo

o

o o

oo

oo

o

oo

o

o o

oo

oo

x

y

oo

oo

o

o o

oo

oo

x

y

xx

yy

oo

oo

o

o o

oo

oo

oo

oo

o

o o

oo

oo

o

oo

o

o o

oo

oo

o o

oo

o

o

o o o

x

y

xx

yyo o

oo

o

o

o o o

o o

oo

o

o

o o o

< 0

= 0

b1Slope =

Variation in Y explained

by regression

> 0

> 0

= 0

y yi For most xi

y yi For most xi

y yi For all xi~

Null hypothesis

• Null hypothesis: b1 = 0

• F-ratio statistic = MSRegression / MSResidual

– if H0 true, F-ratio follows F distribution with

dfRegression and dfResidual

• t-statistic = b1 / SE(b1)

– if H0 true, t-statistic follows t distribution

with df = n-2

Page 44: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

44

Model comparisons

ANOVA for regression

Total variation in Y

SSTotal

=

Variation explained by regression with X

SSRegression

+

Residual variation

SSResidual

Page 45: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

45

Full model

yi = b0 + b1xi + ei

• Unexplained variation in Y from full

model = SSResidual

Reduced model (H0 true)

• Reduced model (H0: b1 = 0 true):

yi = b0 + ei

• Unexplained variation in Y from reduced

model = SSTotal

(Mean and error)

( )y yi -2

Page 46: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

46

Model comparison

• Difference in unexplained variation between

full and reduced models:

SSTotal - SSResidual

= SSRegression

• Variation explained by including b1 in model

Explained variation

• Proportion of variation in Y explained by

linear relationship with X

• Termed r2, coefficient of determination:

SS Regression

SS Total

• r2 is simply square of correlation

coefficient (r) between X and Y.

Page 47: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

47

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2

Which is the better model??

Which is the better model??

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

Dep Var: Y1 N: 26 Multiple R: 0.754377 Squared multiple R: 0.569085

Adjusted squared multiple R: 0.551131 Standard error of estimate: 86.934708

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT 11.207815 30.277197 0.000000 . 0.37017 0.71450

X 0.026573 0.004720 0.754377 1.000000 5.62987 0.00001

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 2.39543E+05 1 2.39543E+05 31.695479 0.000009

Residual 1.81383E+05 24 7557.643448

Page 48: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

48

Which is the better model??

0 5000 10000 15000

X

0

100

200

300

400

500

Y2

Dep Var: Y2 N: 5 Multiple R: 0.978152 Squared multiple R: 0.956781

Adjusted squared multiple R: 0.942374 Standard error of estimate: 33.608617

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT -31.455158 32.524324 0.000000 . -0.96713 0.40482

X 0.033584 0.004121 0.978152 1.000000 8.14944 0.00386

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 7.50166E+04 1 7.50166E+04 66.413444 0.003864

Residual 3388.617362 3 1129.539121

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2

Which is the better model??

n = 5

P = 0.00386

r2 = 0.942

n = 26

P = 0.000009

r 2= 0.551

Page 49: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

49

Which is the better model??

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2

95% Confidence bands (for slope)

n = 5

P = 0.00386

r2 = 0.942

n = 26

P = 0.000009

r 2= 0.551

Assumptions

Page 50: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

50

Normality

Y normally distributed at each value of X:

– Boxplot of y should be symmetrical - watch

out for outliers and skewness

– Transformations often help

Homogeneity of variance

Variance (spread) of Y should be constant

for each value of xi (homogeneity of

variance):

– Very difficult to assess usually (for

models with only one value of y per x).

Page 51: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

51

x1 x2 X

Y

Y1

Y2

b by iix +0 1

Homogeneity of variance

Variance (spread) of Y should be constant for

each value of xi (homogeneity of variance):

– Very difficult to assess usually (for models with

only one value of y per x).

– Spread of residuals should be even when

plotted against xi or predicted yi’s

– Transformations often help

– Transformations that improve normality of Y will

also usually make variance of Y more constant

Page 52: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

52

Independence

Values of yi are independent of each

other:

– watch out for data which are a time series

on same experimental or sampling units

– should be considered at design stage

Linearity

For Linear regression: true population

relationship between Y and X is linear:

– scatterplot of Y against X

– watch out for asymptotic or exponential

patterns

– transformations of Y or Y and X often help

– Always look at residuals

Page 53: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

53

EDA and regression diagnostics

• Check assumptions

• Check fit of model

• Warn about influential observations and

outliers

EDA

• Boxplots of Y (and X):

– check for normality, outliers etc.

• Scatterplot of Y and X:

– check for linearity, homogeneity of

variance, outliers etc.

Page 54: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

54

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15 20

Anscombe (1973) data set

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15 20

R2 = 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002

Page 55: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

55

Limited or weighted data Smoothers (for data

exploration – especially useful for model

fitting)

• Nonparametric description of

relationship between Y and X

– unconstrained by specific model structure

• Useful exploratory technique:

– is linear model appropriate?

– are particular observations influential?

Page 56: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

56

Limited or weighted data Smoothers

• Each observation replaced by mean or median of surrounding observations– or predicted value of regression model through

surrounding observations

• Surrounding observations in window (or band)– covers range along X-axis

– size of window (number of observations) determined by smoothing parameter

Limited or weighted data Smoothers

• Adjacent windows overlap

– resulting line is smooth

– smoothness controlled by smoothing

parameter (width of window)

• Any section of line robust to extreme

values in other windows

Page 57: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

57

Types of limited or weighted data

smoothers (examples)

• Running (moving) means or averages:

– means or medians within each window

• Lo(w)ess:

– locally weighted regression scatterplot

smoothing

– observations within a window weighted

differently

– observations replaced by predicted values

from local regression line

Residuals – very useful for examining

regression assumptions

• Difference between observed value and

value predicted or fitted by the model

• Residual for each observation:– difference between observed y and value

of y predicted by linear regression

equation:

Page 58: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

58

Studentised residuals

• residual / SE residuals• follow a t-distribution• studentised residuals can be compared

between different regressions

Observations with large residual (or studentised residual) are outliers from fitted model.

0

-se

+se

Predicted yi

Res

idual •No pattern in residuals

• Indicates assumptions

OK

x

y •Even spread of Y

around line

Plot residuals against predicted yi

Page 59: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

59

• Increasing spread of

residuals, ie. wedge-shape

• Unequal variance in Y

• Skewed distribution of Y

• Transformation of Y helps

• Uneven spread of Y

around line

0-se

+se

Predicted yi

Res

idual

x

y

Other indicators

• Outliers

• Leverage

• Influence

Page 60: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

60

Outliers

• Observations further

from fitted model than

remaining observations

– might be different from

sample outliers in

boxplots

• Large residual

outlier

Use robust estimator.syz

Leverage

• How extreme

observation is for

X-variable

• Measures how

much each xi

influences

predicted yi Large leverage

Page 61: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response

61

Influence

• Cook’s D statistic:– incorporates leverage & residual

– observations with large influence on

estimated slope

– observations with D near or greater than 1

should be checked

• Observation 1 is X and Y outlier but not

influential

1Y

X

2

• Observation 2 has large residual - outlier

3

• Observation 3 is very influential (large Cook’s

D) - also outlier