introduction to linear regression analysis part 1 · 2020-01-25 · 1 introduction to linear...

1

Introduction to Linear regression

analysis

Part 1

Simple linear regression

• Two continuous variables:

– response (dependent) variable (Y)

– predictor (independent) variable (X)

– each recorded for n observations (replicates)

• Predictor variable “influences” response variable

• Does not necessarily demonstrate causality

(depends on design of experiment or survey

2

Scatterplot

0 500 1000 1500 2000 2500

Riparian tree density

0

50

100

150

200

CW

D(c

oa

rse

wo

od

y d

eb

ris)

ba

sa

l a

rea

Linear regression

• Description:

– relationship between response (Y) and

predictor (X) variable

• Explanation:

– how much of variation in Y explained by

linear relationship with X

• Prediction:

– new Y-values from new X-values

– precision of those estimates

3

Regression model

yi = b0 + b1xi + ei

(CWD basal area)i = b0 + b1(tree density)i + ei

yi value of Y for ith observation

xi value of X for ith observation

b0 population intercept (value of Y when X = 0)

b1 population slope (change in Y per unit change in X)

ei error term (measures variation in Y at each xi -

deviation of each Yi from predicted value )

Regression line

X

Y

Intercept

Slope:

change in Y per unit

change in X

4

Regression model

yi = b0 + b1xi + ei

E(yi) = b0 + b1xi

where:

• E(yi) = expected value of yi

• ei (error term) measures difference

between yi and E(yi) at each xi

Sample regression equation

predicted Y-value for xi

estimates E(yi)

sample intercept

estimates b0 sample regression slope

estimates b1

5

Regression line

0 500 1000 1500 2000 2500


-100

0

100

200

CW

D b

asa

l a

rea

Slope

Intercept

0 500 1000 1500 2000 2500


-100

0

100

200

CW

D b

asa

l a

rea

-y ns /t( ) +y ns /t( )y

The logic of the assessment of

regression models – what to

compare to

If data are normally

distributed, then

unbiased estimate of

distribution of means

can be obtained from

y, (SE)

6

Year

Em

plo

yment

(1000’s

, C

onfidence inte

rval)

5.1954

317,65

x

y

Longley.csv

Now lets assume that we think there may be a

relationship between year and employment

Question: does the mean (or some other estimator that does

not include the relationship between y and x) fit the data

better than an estimator that includes the effect of x

5.1954

317,65

x

y

1945 1950 1955 1960 1965Year

60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

ent

(thousands)

1945 1950 1955 1960 1965Year

60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

ent

(thousands)

7

Analysis of variance in Y

( )y yi -2

Total variation (Sum of Squares) in Y

Variation in Y explained

by regression

(SSRegression)

Variation in Y

unexplained by

regression (SSResidual)

Y

X

least squares

regression line

y

x

y i

yi

xi

y

})ˆ( i yy -}

)ˆ( ii yy -

)( i yy -}

222)ˆ()ˆ()( iiii yyyyyy -+-

Ordinary Least Squares (OLS)

8

y yi i- small y yi i- big

Unexplained or residual variation

Explained variation

y yi - small

y

y yi - big

y

9

Source of SS df Variance

variation (= mean square)

Regression 1 SSRegression / 1

Variation in Y explained by regression

Residual n-2 SSResidual / n-2

Variation in Y unexplained by regression

Analysis of variance

Why n-2?


It follows that if:

Variation in Y explained by regression >> Variation in Y

unexplained by regression (MSRegression >> MSResidual)

Then:

Regression function contributes to estimation of Y

(Slope = b1 > 0, or b1 < 0)

10

> 0

x

y

xx

yy oo

oo

o

o o

oo

oo

oo

oo

o

o o

oo

oo

o

oo

o

o o

oo

oo

x

y

oo

oo

o

o o

oo

oo

x

y

xx

yy

oo

oo

o

o o

oo

oo

oo

oo

o

o o

oo

oo

o

oo

o

o o

oo

oo

o o

oo

o

o

o o o

x

y

xx

yyo o

oo

o

o

o o o

o o

oo

o

o

o o o

< 0

= 0

b1Slope =


by regression

> 0

> 0

= 0

y yi For most xi

y yi For most xi

y yi For all xi~

Null hypothesis

• Null hypothesis: b1 = 0

• F-ratio statistic = MSRegression / MSResidual

– if H0 true, F-ratio follows F distribution with

dfRegression and dfResidual

• t-statistic = b1 / SE(b1)

– if H0 true, t-statistic follows t distribution

with df = n-2

11

Model comparisons

ANOVA for regression

Total variation in Y

SSTotal

=

Variation explained by regression with X

SSRegression

+

Residual variation

SSResidual

12

Full model

yi = b0 + b1xi + ei

• Unexplained variation in Y from full

model = SSResidual

Reduced model (H0 true)

• Reduced model (H0: b1 = 0 true):

yi = b0 + ei

• Unexplained variation in Y from reduced

model = SSTotal

(Mean and error)

( )y yi -2

13

Model comparison

• Difference in unexplained variation between

full and reduced models:

SSTotal - SSResidual

= SSRegression

• Variation explained by including b1 in model

Explained variation

• Proportion of variation in Y explained by


• Termed r2, coefficient of determination:

SS Regression

SS Total

• r2 is simply square of correlation

coefficient (r) between X and Y.

14

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2

Which is the better model??


0 5000 10000 15000

X

0

100

200

300

400

500

Y1

Dep Var: Y1 N: 26 Multiple R: 0.754377 Squared multiple R: 0.569085

Adjusted squared multiple R: 0.551131 Standard error of estimate: 86.934708

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT 11.207815 30.277197 0.000000 . 0.37017 0.71450

X 0.026573 0.004720 0.754377 1.000000 5.62987 0.00001

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 2.39543E+05 1 2.39543E+05 31.695479 0.000009

Residual 1.81383E+05 24 7557.643448

15


0 5000 10000 15000

X

0

100

200

300

400

500

Y2




CONSTANT -31.455158 32.524324 0.000000 . -0.96713 0.40482

X 0.033584 0.004121 0.978152 1.000000 8.14944 0.00386



Regression 7.50166E+04 1 7.50166E+04 66.413444 0.003864

Residual 3388.617362 3 1129.539121

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2


n = 5

P = 0.00386

r2 = 0.942

n = 26

P = 0.000009

r 2= 0.551

16


0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2

95% Confidence bands (for slope)

n = 5

P = 0.00386

r2 = 0.942

n = 26

P = 0.000009

r 2= 0.551

Assumptions

17

Normality

Y normally distributed at each value of X:

– Boxplot of y should be symmetrical - watch

out for outliers and skewness

– Transformations often help

Homogeneity of variance

Variance (spread) of Y should be constant

for each value of xi (homogeneity of

variance):

– Very difficult to assess usually (for

models with only one value of y per x).

18

x1 x2 X

Y

Y1

Y2

b by iix +0 1


Variance (spread) of Y should be constant for

each value of xi (homogeneity of variance):

– Very difficult to assess usually (for models with

only one value of y per x).

– Spread of residuals should be even when

plotted against xi or predicted yi’s


– Transformations that improve normality of Y will

also usually make variance of Y more constant

19

Independence

Values of yi are independent of each

other:

– watch out for data which are a time series

on same experimental or sampling units

– should be considered at design stage

Linearity

For Linear regression: true population

relationship between Y and X is linear:

– scatterplot of Y against X

– watch out for asymptotic or exponential

patterns

– transformations of Y or Y and X often help

– Always look at residuals

20

EDA and regression diagnostics

• Check assumptions

• Check fit of model

• Warn about influential observations and

outliers

EDA

• Boxplots of Y (and X):

– check for normality, outliers etc.

• Scatterplot of Y and X:

– check for linearity, homogeneity of

variance, outliers etc.

21

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15 20

Anscombe (1973) data set

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15 20

R2 = 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002

22

Limited or weighted data Smoothers (for data

exploration – especially useful for model

fitting)

• Nonparametric description of

relationship between Y and X

– unconstrained by specific model structure

• Useful exploratory technique:

– is linear model appropriate?

– are particular observations influential?

23

Limited or weighted data Smoothers

• Each observation replaced by mean or median of surrounding observations– or predicted value of regression model through

surrounding observations

• Surrounding observations in window (or band)– covers range along X-axis

– size of window (number of observations) determined by smoothing parameter


• Adjacent windows overlap

– resulting line is smooth

– smoothness controlled by smoothing

parameter (width of window)

• Any section of line robust to extreme

values in other windows

24

Types of limited or weighted data

smoothers (examples)

• Running (moving) means or averages:

– means or medians within each window

• Lo(w)ess:

– locally weighted regression scatterplot

smoothing

– observations within a window weighted

differently

– observations replaced by predicted values

from local regression line

Residuals – very useful for examining

regression assumptions

• Difference between observed value and

value predicted or fitted by the model

• Residual for each observation:– difference between observed y and value

of y predicted by linear regression

equation:

25

Studentised residuals

• residual / SE residuals• follow a t-distribution• studentised residuals can be compared

between different regressions

Observations with large residual (or studentised residual) are outliers from fitted model.

0

-se

+se

Predicted yi

Res

idual •No pattern in residuals

• Indicates assumptions

OK

x

y •Even spread of Y

around line

Plot residuals against predicted yi

26

• Increasing spread of

residuals, ie. wedge-shape

• Unequal variance in Y

• Skewed distribution of Y

• Transformation of Y helps

• Uneven spread of Y

around line

0-se

+se

Predicted yi

Res

idual

x

y

Other indicators

• Outliers

• Leverage

• Influence

27

Outliers

• Observations further

from fitted model than

remaining observations

– might be different from

sample outliers in

boxplots

• Large residual

outlier

Use robust estimator.syz

Leverage

• How extreme

observation is for

X-variable

• Measures how

much each xi

influences

predicted yi Large leverage

28

Influence

• Cook’s D statistic:– incorporates leverage & residual

– observations with large influence on

estimated slope

– observations with D near or greater than 1

should be checked

• Observation 1 is X and Y outlier but not

influential

1Y

X

2

• Observation 2 has large residual - outlier

3

• Observation 3 is very influential (large Cook’s

D) - also outlier

29

Full set of lecture notes

30

Introduction to Linear regression

analysis

Part 1

Simple linear regression

• Two continuous variables:

– response (dependent) variable (Y)

– predictor (independent) variable (X)

– each recorded for n observations (replicates)

• Predictor variable “influences” response variable

• Does not necessarily demonstrate causality

(depends on design of experiment or survey

31

Scatterplot

0 500 1000 1500 2000 2500


0

50

100

150

200

CW

D(c

oa

rse

wo

od

y d

eb

ris)

ba

sa

l a

rea

Linear regression

• Description:

– relationship between response (Y) and

predictor (X) variable

• Explanation:

– how much of variation in Y explained by


• Prediction:

– new Y-values from new X-values

– precision of those estimates

32

Regression model

yi = b0 + b1xi + ei

(CWD basal area)i = b0 + b1(tree density)i + ei

yi value of Y for ith observation

xi value of X for ith observation

b0 population intercept (value of Y when X = 0)

b1 population slope (change in Y per unit change in X)

ei error term (measures variation in Y at each xi -

deviation of each Yi from predicted value )

5 10 15 20

X

5

10

15

20

Y

Calculations of the Linear

Regression equation

33

Calculate mean x and y values

5 10 15 20

X

5

10

15

20

Y

x = 13.14

y = 13.00

Calculate deviations from mean x

and y values

5 10 15 20

X

5

10

15

20

Y

x = 13.14

y = 13.00}3

2.86

}

}

}-5

-4.14

+ +

+ -

- +

- -

x

y

34

Calculate sum of xy deviation

cross-products

+ +

+ -

- +

- -

x

y

5 10 15 20

X

5

10

15

20

Y

x = 13.14

y = 13.00}3

2.86

}

}

}-5

-4.14

(xi - x ) (yi - y )

+

-

-

+

x

y

(xi - x ) (yi - y )(xi - x ) (yi - y )Deviations (x,y)

Calculate slope

(xi - x ) (yi - y )(xi - x )

2

Slope = = 1.09

X Y CPXY SSX

9 8 20.7143 17.1633

11 12 2.1429 4.5918

12 11 2.2857 1.3061

13 14 -0.1429 0.0204

14 12 -0.8571 0.7347

16 17 11.4286 8.1633

17 17 15.4286 14.8776

Mean 13.14286 13

sum 51.0000 46.8571

slope 1.0884

(xi - x ) 2

(xi - x ) (xi - x ) 2(xi - x ) (yi - y )(xi - x ) (yi - y )

0 5 10 15 20

X

-5

0

5

10

15

20

Y

35

Solve for intercept

0 5 10 15 20

X

-5

0

5

10

15

20

Y

y = b0 + b1x

Rearrange

y - b1x = b0

Where:

b0 = Intercept

b1 = Slope

It can be shown that:

-1.32

13 = b0 + b1(13.14)

13 – 1.09(13.14) = b0

b0 =

+

-

-

+

x

y

Slopes

oo

oo

o

o o

oo

oo

(xi - x ) (yi - y )(xi - x )

2 b1Slope =

b1 > 0

+

-

-

+

x

y

oo

oo

o

o o

oo

oo

b1 < 0

+

-

-

+

x

yo o

oo

o

o

o o o

+

-

-

+

x

y

+

-

-

+

xx

yy

(xi - x ) (yi - y )(xi - x ) (yi - y )(xi - x ) (yi - y )(xi - x ) (yi - y )

Regions of cross-

product

b1 = 0

36

Regression line

X

Y

Intercept

Slope:

change in Y per unit

change in X

x1 x2 X

Y

Y1

Y2

b by iix +0 1

37

Regression model

yi = b0 + b1xi + ei

E(yi) = b0 + b1xi

where:

• E(yi) = (yi)

• ei (error term) measures difference

between yi and (yi) at each xi

Sample regression equation

predicted Y-value for xi

estimates (yi)

sample intercept

estimates b0 sample regression slope

estimates b1

38

Regression line

0 500 1000 1500 2000 2500


-100

0

100

200

CW

D b

asa

l a

rea

Slope

Intercept

0 500 1000 1500 2000 2500


-100

0

100

200

CW

D b

asa

l a

rea

-y ns /t( ) +y ns /t( )y

The logic of the assessment of

regression models – what to

compare to

If data are normally

distributed, then

unbiased estimate of

distribution of means

can be obtained from

y, (SE)

39

TOTAL60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

en

t (t

ho

usa

nd

s +

- C

I)

1945 1950 1955 1960 1965

Year

60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

en

t (t

ho

usa

nds)

Now lets assume that we think there may be a

relationship between year and employment

5.1954

317,65

x

y

Longley.syd

Question: does the mean (or some other estimator that does

not include the relationship between y and x) fit the data

better than an estimator that includes the effect of x

5.1954

317,65

x

y

1945 1950 1955 1960 1965Year

60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

ent

(thousands)

1945 1950 1955 1960 1965Year

60000

62000

64000

66000

68000

70000

72000

Em

plo

ym

ent

(thousands)

40

Analysis of variance in Y

( )y yi -2

Total variation (Sum of Squares) in Y


by regression

(SSRegression)

Variation in Y

unexplained by

regression (SSResidual)

Y

X

least squares

regression line

y

x

y i

yi

xi

y

})ˆ( i yy -}

)ˆ( ii yy -

)( i yy -}

222)ˆ()ˆ()( iiii yyyyyy -+-

Ordinary Least Squares (OLS)

41

y yi i- small y yi i- big

Unexplained or residual variation

Explained variation

y yi - small

y

y yi - big

y

42

Source of SS df Variance

variation (= mean square)

Regression 1 SSRegression / 1

Variation in Y explained by regression

Residual n-2 SSResidual / n-2

Variation in Y unexplained by regression


Why n-2?


It follows that if:

Variation in Y explained by regression >> Variation in Y

unexplained by regression (MSRegression >> MSResidual)

Then:

Regression function contributes to estimation of Y

(Slope = b1 > 0, or b1 < 0)

43

> 0

x

y

xx

yy oo

oo

o

o o

oo

oo

oo

oo

o

o o

oo

oo

o

oo

o

o o

oo

oo

x

y

oo

oo

o

o o

oo

oo

x

y

xx

yy

oo

oo

o

o o

oo

oo

oo

oo

o

o o

oo

oo

o

oo

o

o o

oo

oo

o o

oo

o

o

o o o

x

y

xx

yyo o

oo

o

o

o o o

o o

oo

o

o

o o o

< 0

= 0

b1Slope =


by regression

> 0

> 0

= 0

y yi For most xi

y yi For most xi

y yi For all xi~

Null hypothesis

• Null hypothesis: b1 = 0

• F-ratio statistic = MSRegression / MSResidual

– if H0 true, F-ratio follows F distribution with

dfRegression and dfResidual

• t-statistic = b1 / SE(b1)

– if H0 true, t-statistic follows t distribution

with df = n-2

44

Model comparisons

ANOVA for regression

Total variation in Y

SSTotal

=

Variation explained by regression with X

SSRegression

+

Residual variation

SSResidual

45

Full model

yi = b0 + b1xi + ei

• Unexplained variation in Y from full

model = SSResidual

Reduced model (H0 true)

• Reduced model (H0: b1 = 0 true):

yi = b0 + ei

• Unexplained variation in Y from reduced

model = SSTotal

(Mean and error)

( )y yi -2

46

Model comparison

• Difference in unexplained variation between

full and reduced models:

SSTotal - SSResidual

= SSRegression

• Variation explained by including b1 in model

Explained variation

• Proportion of variation in Y explained by


• Termed r2, coefficient of determination:

SS Regression

SS Total

• r2 is simply square of correlation

coefficient (r) between X and Y.

47

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2



0 5000 10000 15000

X

0

100

200

300

400

500

Y1




CONSTANT 11.207815 30.277197 0.000000 . 0.37017 0.71450

X 0.026573 0.004720 0.754377 1.000000 5.62987 0.00001



Regression 2.39543E+05 1 2.39543E+05 31.695479 0.000009

Residual 1.81383E+05 24 7557.643448

48


0 5000 10000 15000

X

0

100

200

300

400

500

Y2




CONSTANT -31.455158 32.524324 0.000000 . -0.96713 0.40482

X 0.033584 0.004121 0.978152 1.000000 8.14944 0.00386



Regression 7.50166E+04 1 7.50166E+04 66.413444 0.003864

Residual 3388.617362 3 1129.539121

0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2


n = 5

P = 0.00386

r2 = 0.942

n = 26

P = 0.000009

r 2= 0.551

49


0 5000 10000 15000

X

0

100

200

300

400

500

Y1

0 5000 10000 15000

X

0

100

200

300

400

500

Y2

95% Confidence bands (for slope)

n = 5

P = 0.00386

r2 = 0.942

n = 26

P = 0.000009

r 2= 0.551

Assumptions

50

Normality

Y normally distributed at each value of X:

– Boxplot of y should be symmetrical - watch

out for outliers and skewness



Variance (spread) of Y should be constant

for each value of xi (homogeneity of

variance):

– Very difficult to assess usually (for

models with only one value of y per x).

51

x1 x2 X

Y

Y1

Y2

b by iix +0 1


Variance (spread) of Y should be constant for

each value of xi (homogeneity of variance):

– Very difficult to assess usually (for models with

only one value of y per x).

– Spread of residuals should be even when

plotted against xi or predicted yi’s


– Transformations that improve normality of Y will

also usually make variance of Y more constant

52

Independence

Values of yi are independent of each

other:

– watch out for data which are a time series

on same experimental or sampling units

– should be considered at design stage

Linearity

For Linear regression: true population

relationship between Y and X is linear:

– scatterplot of Y against X

– watch out for asymptotic or exponential

patterns

– transformations of Y or Y and X often help

– Always look at residuals

53

EDA and regression diagnostics

• Check assumptions

• Check fit of model

• Warn about influential observations and

outliers

EDA

• Boxplots of Y (and X):

– check for normality, outliers etc.

• Scatterplot of Y and X:

– check for linearity, homogeneity of

variance, outliers etc.

54

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15 20

Anscombe (1973) data set

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15 20

R2 = 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002

55

Limited or weighted data Smoothers (for data

exploration – especially useful for model

fitting)

• Nonparametric description of

relationship between Y and X

– unconstrained by specific model structure

• Useful exploratory technique:

– is linear model appropriate?

– are particular observations influential?

56


• Each observation replaced by mean or median of surrounding observations– or predicted value of regression model through

surrounding observations

• Surrounding observations in window (or band)– covers range along X-axis

– size of window (number of observations) determined by smoothing parameter


• Adjacent windows overlap

– resulting line is smooth

– smoothness controlled by smoothing

parameter (width of window)

• Any section of line robust to extreme

values in other windows

57

Types of limited or weighted data

smoothers (examples)

• Running (moving) means or averages:

– means or medians within each window

• Lo(w)ess:

– locally weighted regression scatterplot

smoothing

– observations within a window weighted

differently

– observations replaced by predicted values

from local regression line

Residuals – very useful for examining

regression assumptions

• Difference between observed value and

value predicted or fitted by the model

• Residual for each observation:– difference between observed y and value

of y predicted by linear regression

equation:

58

Studentised residuals

• residual / SE residuals• follow a t-distribution• studentised residuals can be compared

between different regressions

Observations with large residual (or studentised residual) are outliers from fitted model.

0

-se

+se

Predicted yi

Res

idual •No pattern in residuals

• Indicates assumptions

OK

x

y •Even spread of Y

around line

Plot residuals against predicted yi

59

• Increasing spread of

residuals, ie. wedge-shape

• Unequal variance in Y

• Skewed distribution of Y

• Transformation of Y helps

• Uneven spread of Y

around line

0-se

+se

Predicted yi

Res

idual

x

y

Other indicators

• Outliers

• Leverage

• Influence

60

Outliers

• Observations further

from fitted model than

remaining observations

– might be different from

sample outliers in

boxplots

• Large residual

outlier

Use robust estimator.syz

Leverage

• How extreme

observation is for

X-variable

• Measures how

much each xi

influences

predicted yi Large leverage

61

Influence

• Cook’s D statistic:– incorporates leverage & residual

– observations with large influence on

estimated slope

– observations with D near or greater than 1

should be checked

• Observation 1 is X and Y outlier but not

influential

1Y

X

2

• Observation 2 has large residual - outlier

3

• Observation 3 is very influential (large Cook’s

D) - also outlier

introduction to linear regression analysis part 1 · 2020-01-25 · 1 introduction to linear...

Documents