introduction to simple linear regression - …study.yuting.me/cu/linear_regression/linear...

592
Introduction to Simple Linear Regression Yang Feng Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 70

Upload: trantuyen

Post on 26-Apr-2018

232 views

Category:

Documents


9 download

TRANSCRIPT

Page 1: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Introduction to Simple Linear Regression

Yang Feng

Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 70

Page 2: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Course Description

Theory and practice of regression analysis, Simple and multiple regression,including testing, estimation, and confidence procedures, modeling,regression diagnostics and plots, polynomial regression, colinearity andconfounding, model selection, geometry of least squares. Extensive use ofthe computer to analyze data.

Course website: http://www.stat.columbia.edu/~yangfeng/W4315

Required Text: Applied Linear Regression Models (4th Ed.)Authors: Kutner, Nachtsheim, Neter

Yang Feng (Columbia University) Introduction to Simple Linear Regression 2 / 70

Page 3: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Philosophy and Style

Easy first half.

Difficult second half.

Some digressions from the required book.

Understanding = proof (derivation) + implementation.

Practice makes perfect.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 3 / 70

Page 4: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Course Outline

First half of the course is single variable linear regression.

Least squares

Maximum likelihood, normal model

Tests / inferences

ANOVA

Diagnostics

Remedial Measures

Yang Feng (Columbia University) Introduction to Simple Linear Regression 4 / 70

Page 5: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Course Outline (Continued)

Second half of the course is multiple linear regression and other relatedtopics .

Multiple linear Regression

Linear algebra reviewMatrix approach to linear regressionMultiple predictor variablesDiagnosticsTestsModel selection

Other topics (If time permits)

Principle Component AnalysisGeneralized Linear ModelsIntroduction to Bayesian Inference

Yang Feng (Columbia University) Introduction to Simple Linear Regression 5 / 70

Page 6: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Requirements

Calculus

Derivatives, gradients, convexity

Linear algebra

Matrix notation, inversion, eigenvectors, eigenvalues, rank

Probability and Statistics

Random variableExpectation, varianceEstimationBias/VarianceBasic probability distributionsHypothesis TestingConfidence Interval

Yang Feng (Columbia University) Introduction to Simple Linear Regression 6 / 70

Page 7: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Software

R will be used throughout the course and it is required in all homework.An R tutorial session will be given on Sep 5 (Bring your laptop!).Reasons for R:

Completely free software. Can be downloaded fromhttp://cran.r-project.org/

Available on various systems, PC, MAC, Linux, and even

Iphone andIpad!

Advanced yet easy to use.

An Introduction to R: http://cran.r-project.org/doc/manuals/R-intro.pdf

Yang Feng (Columbia University) Introduction to Simple Linear Regression 7 / 70

Page 8: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Software

R will be used throughout the course and it is required in all homework.An R tutorial session will be given on Sep 5 (Bring your laptop!).Reasons for R:

Completely free software. Can be downloaded fromhttp://cran.r-project.org/

Available on various systems, PC, MAC, Linux, and even Iphone andIpad!

Advanced yet easy to use.

An Introduction to R: http://cran.r-project.org/doc/manuals/R-intro.pdf

Yang Feng (Columbia University) Introduction to Simple Linear Regression 7 / 70

Page 9: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Grading

Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped

In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.

Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:

Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70

Page 10: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Grading

Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped

In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:

Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70

Page 11: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Grading

Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped

In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:

Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70

Page 12: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Office Hours / Website

http://www.stat.columbia.edu/~yangfeng

Course Materials and homework with the due dates will be posted onthe course website.

Office hours : Thursday 10am-noon subject to change

Office Location : Room 1012, SSW Building (1255 AmsterdamAvenue, between 121st and 122nd street)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 9 / 70

Page 13: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Why regression?

Want to model a functional relationship between an “predictorvariable” (input, independent variable, etc.) and a “responsevariable” (output, dependent variable, etc.)

Examples?

But real world is noisy, no f = ma

Observation noiseProcess noise

Two distinct goals

(Estimation) Understanding the relationship between predictor variablesand response variables(Prediction) Predicting the future response given the new observedpredictors.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 10 / 70

Page 14: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

History

Sir Francis Galton, 19th century

Studied the relation between heights of parents and children and notedthat the children “regressed” to the population mean

“Regression” stuck as the term to describe statistical relationsbetween variables

Yang Feng (Columbia University) Introduction to Simple Linear Regression 11 / 70

Page 15: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example Applications

Trend lines, eg. Google over 6 mo.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 12 / 70

Page 16: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Others

Epidemiology

Relating lifespan to obesity or smoking habits etc.

Science and engineering

Relating physical inputs to physical outputs in complex systems

Brain

Yang Feng (Columbia University) Introduction to Simple Linear Regression 13 / 70

Page 17: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Aims for the course

Given something you would like to predict and some number ofcovariates

What kind of model should you use?Which variables should you include?Which transformations of variables and interaction terms should youuse?

Given a model and some data

How do you fit the model to the data?How do you express confidence in the values of the model parameters?How do you regularize the model to avoid over-fitting and other relatedissues?

Yang Feng (Columbia University) Introduction to Simple Linear Regression 14 / 70

Page 18: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Aims for the course

Given something you would like to predict and some number ofcovariates

What kind of model should you use?Which variables should you include?Which transformations of variables and interaction terms should youuse?

Given a model and some data

How do you fit the model to the data?How do you express confidence in the values of the model parameters?How do you regularize the model to avoid over-fitting and other relatedissues?

Yang Feng (Columbia University) Introduction to Simple Linear Regression 14 / 70

Page 19: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Data for Regression Analysis

Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!

Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .

Treatment: the length of trainingExperimental Units: the analysts included in the study.

Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70

Page 20: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Data for Regression Analysis

Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!

Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .

Treatment: the length of trainingExperimental Units: the analysts included in the study.

Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70

Page 21: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Data for Regression Analysis

Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!

Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .

Treatment: the length of trainingExperimental Units: the analysts included in the study.

Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70

Page 22: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Questions?

Good time to ask now.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 16 / 70

Page 23: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Simple Linear Regression

Want to find parameters for a function of the form

Yi = β0 + β1Xi + εi

Distribution of error random variable not specified

Yang Feng (Columbia University) Introduction to Simple Linear Regression 17 / 70

Page 24: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Quick Example - Scatter Plot

●●

●●

●●

●●

●●

−2 −1 0 1 2

−2

02

46

8

a

b

Yang Feng (Columbia University) Introduction to Simple Linear Regression 18 / 70

Page 25: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Formal Statement of Model

Yi = β0 + β1Xi + εi

Yi value of the response variable in the i th trial

β0 and β1 are parameters

Xi is a known constant, the value of the predictor variable in the i th

trial

εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2

εi and εj are uncorrelated

i = 1, . . . , n

Yang Feng (Columbia University) Introduction to Simple Linear Regression 19 / 70

Page 26: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Properties

The response Yi is the sum of two components

Constant term β0 + β1Xi

Random term εi

The expected response is

E(Yi ) = E(β0 + β1Xi + εi )

= β0 + β1Xi + E(εi )

= β0 + β1Xi

Yang Feng (Columbia University) Introduction to Simple Linear Regression 20 / 70

Page 27: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Expectation Review

Suppose X has probability density function f (x).

Definition

E(X ) =

∫xf (x)dx .

Linearity property

E(aX ) = aE(X )

E(aX + bY ) = aE(X ) + bE(Y )

Obvious from definition

Yang Feng (Columbia University) Introduction to Simple Linear Regression 21 / 70

Page 28: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example Expectation Derivation

Suppose p.d.f of X isf (x) = 2x , 0 ≤ x ≤ 1

Expectation

E(X ) =

∫ 1

0xf (x)dx

=

∫ 1

02x2dx

=2x3

3|10

=2

3

Yang Feng (Columbia University) Introduction to Simple Linear Regression 22 / 70

Page 29: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Expectation of a Product of Random Variables

If X ,Y are random variables with joint density function f (x , y) then theexpectation of the product is given by

E(XY ) =

∫XY

xyf (x , y)dxdy .

Yang Feng (Columbia University) Introduction to Simple Linear Regression 23 / 70

Page 30: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Expectation of a product of random variables

What if X and Y are independent? If X and Y are independent withdensity functions f1 and f2 respectively then

E(XY ) =

∫XY

xyf1(x)f2(y)dxdy

=

∫X

∫Yxyf1(x)f2(Y )dxdy

=

∫Xxf1(x)[

∫Yyf2(y)dy ]dx

=

∫Xxf1(x)E(Y )dX

= E(X )E(Y )

Yang Feng (Columbia University) Introduction to Simple Linear Regression 24 / 70

Page 31: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Regression Function

The response Yi comes from a probability distribution with mean

E(Yi ) = β0 + β1Xi

This means the regression function is

E(Y ) = β0 + β1X

Since the regression function relates the means of the probabilitydistributions of Y for a given X to the level of X

Yang Feng (Columbia University) Introduction to Simple Linear Regression 25 / 70

Page 32: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Error Terms

The response Yi in the i th trial exceeds or falls short of the value ofthe regression function by the error term amount εi

The error terms εi are assumed to have constant variance σ2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 26 / 70

Page 33: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Response Variance

Responses Yi have the same constant variance

Var(Yi ) = Var(β0 + β1Xi + εi )

= Var(εi )

= σ2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 27 / 70

Page 34: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Variance (2nd central moment) Review

Continuous distribution

Var(X ) = E[(X − E(X ))2] =

∫(x − E(X ))2f (x)dx

Discrete distribution

Var(X ) = E[(X − E(X ))2] =∑i

(Xi − E(X ))2P(X = Xi )

Alternative Form for Variance:

Var(X ) = E(X 2)− E(X )2.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 28 / 70

Page 35: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example Variance Derivation

f (x) = 2x , 0 ≤ x ≤ 1

Var(X ) = E(X 2)− E(X )2

=

∫ 1

02xx2dx − (

2

3)2

=2x4

4|10 −

4

9

=1

2− 4

9

=1

18

Yang Feng (Columbia University) Introduction to Simple Linear Regression 29 / 70

Page 36: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Variance Properties

Var(aX ) = a2 Var(X )

Var(aX + bY ) = a2 Var(X ) + b2 Var(Y ) if X ⊥ Y

Var(a + cX ) = c2 Var(X ) if a and c are both constants

More generally

Var(∑

aiXi ) =∑i

∑j

aiajCov(Xi ,Xj)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 30 / 70

Page 37: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Covariance

The covariance between two real-valued random variables X and Y, withexpected values E(X ) = µ and E(Y ) = ν is defined as

Cov(X ,Y ) = E((X − µ)(Y − ν))

= E(XY )− µν.

If X ⊥ Y ,E(XY ) = E(X )E(Y ) = µν.

ThenCov(X ,Y ) = 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 31 / 70

Page 38: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Formal Statement of Model

Yi = β0 + β1Xi + εi

Yi value of the response variable in the i th trial

β0 and β1 are parameters

Xi is a known constant, the value of the predictor variable in the i th

trial

εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2

i = 1, . . . , n

Yang Feng (Columbia University) Introduction to Simple Linear Regression 32 / 70

Page 39: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example

An experimenter gave three subjects a very difficult task. Data on the ageof the subject (X ) and on the number of attempts to accomplish the taskbefore giving up (Y ) follow:

Table :

Subject i 1 2 3

Age Xi 20 55 30Number of Attempts Yi 5 12 10

Yang Feng (Columbia University) Introduction to Simple Linear Regression 33 / 70

Page 40: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Squares Linear Regression

Goal: make Yi and b0 + b1Xi close for all i .

Proposal 1: minimize∑n

i=1[Yi − (b0 + b1Xi )].

Proposal 2: minimize∑n

i=1 |Yi − (b0 + b1Xi )|.Final Proposal: minimize

Q(b0, b1) =n∑

i=1

[Yi − (b0 + b1Xi )]2

Choose b0 and b1 as estimators for β0 and β1.

b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).

Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70

Page 41: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Squares Linear Regression

Goal: make Yi and b0 + b1Xi close for all i .

Proposal 1: minimize∑n

i=1[Yi − (b0 + b1Xi )].

Proposal 2: minimize∑n

i=1 |Yi − (b0 + b1Xi )|.

Final Proposal: minimize

Q(b0, b1) =n∑

i=1

[Yi − (b0 + b1Xi )]2

Choose b0 and b1 as estimators for β0 and β1.

b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).

Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70

Page 42: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Squares Linear Regression

Goal: make Yi and b0 + b1Xi close for all i .

Proposal 1: minimize∑n

i=1[Yi − (b0 + b1Xi )].

Proposal 2: minimize∑n

i=1 |Yi − (b0 + b1Xi )|.Final Proposal: minimize

Q(b0, b1) =n∑

i=1

[Yi − (b0 + b1Xi )]2

Choose b0 and b1 as estimators for β0 and β1.

b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).

Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70

Page 43: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Guess #1

Yang Feng (Columbia University) Introduction to Simple Linear Regression 35 / 70

Page 44: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Guess #2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 36 / 70

Page 45: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Function maximization

Important technique to remember!

Take derivativeSet result equal to zero and solveTest second derivative at that point

Question: does this always give you the maximum?

Going further: multiple variables, convex optimization

Yang Feng (Columbia University) Introduction to Simple Linear Regression 37 / 70

Page 46: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Function Maximization

Find the value of x that maximize the function

f (x) = −x2 + log(x), x > 0

1. Take derivative

d

dx(−x2 + log(x)) = 0 (1)

−2x +1

x= 0 (2)

2x2 = 1 (3)

x2 =1

2(4)

x =

√2

2(5)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 38 / 70

Page 47: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Function Maximization

Find the value of x that maximize the function

f (x) = −x2 + log(x), x > 0

1. Take derivative

d

dx(−x2 + log(x)) = 0 (1)

−2x +1

x= 0 (2)

2x2 = 1 (3)

x2 =1

2(4)

x =

√2

2(5)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 38 / 70

Page 48: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

2. Check second order derivative

d2

dx2(−x2 + log(x)) =

d

dx(−2x +

1

x) (6)

= −2− 1

x2(7)

< 0 (8)

Then we have found the maximum. x∗ =√22 , f (x∗) = −1

2 [1 + log(2)].

Yang Feng (Columbia University) Introduction to Simple Linear Regression 39 / 70

Page 49: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Squares Minimization

Function to minimize w.r.t. b0 and b1

b0 and b1 are called point estimators of β0 and β1 respectively

Q(b0, b1) =n∑

i=1

[Yi − (b0 + b1Xi )]2

Find partial derivatives and set both equal to zero

∂Q

∂b0= 0

∂Q

∂b1= 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 40 / 70

Page 50: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Normal Equations

The result of this maximization step are called the normal equations.b0 and b1 are called point estimators of β0 and β1 respectively.∑

Yi = nb0 + b1∑

Xi (9)∑XiYi = b0

∑Xi + b1

∑X 2i (10)

This is a system of two equations and two unknowns.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 41 / 70

Page 51: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Solution to Normal Equations

After a lot of algebra one arrives at

b1 =

∑(Xi − X )(Yi − Y )∑

(Xi − X )2

b0 = Y − b1X

X =

∑Xi

n

Y =

∑Yi

n

Yang Feng (Columbia University) Introduction to Simple Linear Regression 42 / 70

Page 52: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Square Fit

Yang Feng (Columbia University) Introduction to Simple Linear Regression 43 / 70

Page 53: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Properties of Solution

The i th residual is defined to be

ei = Yi − Yi

The sum of the residuals is zero from (9).∑i

ei =∑

(Yi − b0 − b1Xi )

=∑

Yi − nb0 − b1∑

Xi

= 0

The sum of the observed values Yi equals the sum of the fitted valuesYi ∑

i

Yi =∑i

Yi

Yang Feng (Columbia University) Introduction to Simple Linear Regression 44 / 70

Page 54: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Properties of Solution

The sum of the weighted residuals is zero when the residual in the i th trialis weighted by the level of the predictor variable in the i th trial from (10).∑

i

Xiei =∑

(Xi (Yi − b0 − b1Xi ))

=∑i

XiYi − b0∑

Xi − b1∑

X 2i

= 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 45 / 70

Page 55: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Properties of Solution

The sum of the weighted residuals is zero when the residual in the i th trialis weighted by the fitted value of the response variable in the i th trial∑

i

Yiei =∑i

(b0 + b1Xi )ei

= b0∑i

ei + b1∑i

Xiei

= 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 46 / 70

Page 56: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Properties of Solution

The regression line always goes through the point

X , Y

Using the alternative format of linear regression model:

Yi = β∗0 + β1(Xi − X ) + εi

The least squares estimator b1 for β1 remains the same as before. Theleast squares estimator for β∗0 = β0 + β1X becomes

b∗0 = b0 + b1X = (Y − b1X ) + b1X = Y

Hence the estimated regression function is

Y = Y + b1(X − X )

Apparently, the regression line always goes through the point (X , Y ).Yang Feng (Columbia University) Introduction to Simple Linear Regression 47 / 70

Page 57: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Estimating Error Term Variance σ2

Review estimation in non-regression setting.

Show estimation results for regression setting.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 48 / 70

Page 58: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Estimation Review

An estimator is a rule that tells how to calculate the value of anestimate based on the measurements contained in a sample

i.e. the sample mean

Y =1

n

n∑i=1

Yi

Yang Feng (Columbia University) Introduction to Simple Linear Regression 49 / 70

Page 59: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Point Estimators and Bias

Point estimatorθ = f ({Y1, . . . ,Yn})

Unknown quantity / parameter

θ

Definition: Bias of estimator

Bias(θ) = E(θ)− θ

Yang Feng (Columbia University) Introduction to Simple Linear Regression 50 / 70

Page 60: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example

SamplesYi ∼ N(µ, σ2)

Estimate the population mean

θ = µ, θ = Y =1

n

n∑i=1

Yi

Yang Feng (Columbia University) Introduction to Simple Linear Regression 51 / 70

Page 61: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of the Estimator

First moment

E(θ) = E(1

n

n∑i=1

Yi )

=1

n

n∑i=1

E(Yi ) =nµ

n= θ

This is an example of an unbiased estimator

Bias(θ) = E(θ)− θ = 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 52 / 70

Page 62: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Variance of Estimator

Definition: Variance of estimator

Var(θ) = E([θ − E(θ)]2)

Remember:

Var(cY ) = c2 Var(Y )

Var(n∑

i=1

Yi ) =n∑

i=1

Var(Yi )

Only if the Yi are independent with finite variance

Yang Feng (Columbia University) Introduction to Simple Linear Regression 53 / 70

Page 63: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example Estimator Variance

Var(θ) = Var(1

n

n∑i=1

Yi )

=1

n2

n∑i=1

Var(Yi )

=nσ2

n2

=σ2

n

Yang Feng (Columbia University) Introduction to Simple Linear Regression 54 / 70

Page 64: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Bias Variance Trade-off

The mean squared error of an estimator

MSE (θ) = E([θ − θ]2)

Can be re-expressed

MSE (θ) = Var(θ) + [Bias(θ)]2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 55 / 70

Page 65: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

MSE = VAR + BIAS2

Proof

MSE (θ)

= E((θ − θ)2)

= E(([θ − E(θ)] + [E(θ)− θ])2)

= E([θ − E(θ)]2) + 2E([E(θ)− θ][θ − E(θ)]) + E([E(θ)− θ]2)

= Var(θ) + 2E([E(θ)[θ − E(θ)]− θ[θ − E(θ)])) + (Bias(θ))2

= Var(θ) + 2(0 + 0) + (Bias(θ))2

= Var(θ) + (Bias(θ))2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 56 / 70

Page 66: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Trade-off

Think of variance as confidence and bias as correctness.

Intuitions (largely) apply

Sometimes choosing a biased estimator can result in an overall lowerMSE if it has much lower variance than the unbiased one.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 57 / 70

Page 67: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

s2 estimator for σ2 for Single population

Sum of Squares:n∑

i=1

(Yi − Y )2

Sample Variance Estimator:

s2 =

∑ni=1(Yi − Y )2

n − 1

s2 is an unbiased estimator of σ2.

The sum of squares SSE has n − 1 “degrees of freedom” associatedwith it, one degree of freedom is lost by using Y as an estimate of theunknown population mean µ.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 58 / 70

Page 68: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Estimating Error Term Variance σ2 for Regression Model

Regression model

Variance of each observation Yi is σ2 (the same as for the error termεi )

Each Yi comes from a different probability distribution with differentmeans that depend on the level Xi

The deviation of an observation Yi must be calculated around its ownestimated mean.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 59 / 70

Page 69: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

s2 estimator for σ2

s2 = MSE =SSE

n − 2=

∑(Yi − Yi )

2

n − 2=

∑e2i

n − 2

MSE is an unbiased estimator of σ2

E(MSE ) = σ2

The sum of squares SSE has n − 2 “degrees of freedom” associatedwith it.

Cochran’s theorem (later in the course) tells us where degree’s offreedom come from and how to calculate them.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 60 / 70

Page 70: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Normal Error Regression Model

No matter how the error terms εi are distributed, the least squaresmethod provides unbiased point estimators of β0 and β1

that also have minimum variance among all unbiased linear estimators

To set up interval estimates and make tests we need to specify thedistribution of the εi

We will assume that the εi are normally distributed.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 61 / 70

Page 71: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Normal Error Regression Model

Yi = β0 + β1Xi + εi

Yi value of the response variable in the i th trial

β0 and β1 are parameters

Xi is a known constant, the value of the predictor variable in the i th

trial

εi ∼iid N(0, σ2)note this is different, now we know the distribution

i = 1, . . . , n

Yang Feng (Columbia University) Introduction to Simple Linear Regression 62 / 70

Page 72: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Notational Convention

When you see εi ∼iid N(0, σ2)

It is read as εi is identically and independently distributed accordingto a normal distribution with mean 0 and variance σ2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 63 / 70

Page 73: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Maximum Likelihood Principle

The method of maximum likelihood chooses as estimates those values ofthe parameters that are most consistent with the sample data.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 64 / 70

Page 74: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Likelihood Function

IfXi ∼ F (Θ), i = 1 . . . n

then the likelihood function is

L({Xi}ni=1,Θ) =n∏

i=1

F (Xi ; Θ)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 65 / 70

Page 75: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Maximum Likelihood Estimation

The likelihood function can be maximized w.r.t. the parameter(s) Θ,doing this one can arrive at estimators for parameters as well.

L({Xi}ni=1,Θ) =n∏

i=1

F (Xi ; Θ)

To do this, find solutions to (analytically or by following gradient)

dL({Xi}ni=1,Θ)

dΘ= 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 66 / 70

Page 76: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Important Trick

(almost) Never maximize the likelihood function, maximize the loglikelihood function instead.

log(L({Xi}ni=1,Θ)) = log(n∏

i=1

F (Xi ; Θ))

=n∑

i=1

log(F (Xi ; Θ))

Usually the log of the likelihood is easier to work with mathematically.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 67 / 70

Page 77: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

ML Normal Regression

Likelihood function

L(β0, β1, σ2) =

n∏i=1

1

(2πσ2)1/2e−

12σ2 (Yi−β0−β1Xi )

2

=1

(2πσ2)n/2e−

12σ2

∑ni=1(Yi−β0−β1Xi )

2

which if you maximize (how?) w.r.t. to the parameters you get. . .

Yang Feng (Columbia University) Introduction to Simple Linear Regression 68 / 70

Page 78: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Maximum Likelihood Estimator(s)

β0b0 same as in least squares case

β1b1 same as in least squares case

σ2

σ2 =

∑i (Yi − Yi )

2

n

Note that ML estimator is biased as s2 is unbiased and

s2 = MSE =n

n − 2σ2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 69 / 70

Page 79: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Comments

Least squares minimizes the squared error between the prediction andthe true output

The normal distribution is fully characterized by its first two centralmoments (mean and variance)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 70 / 70

Page 80: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Introduction to Simple Linear Regression

Yang Feng

Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 70

Page 81: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Course Description

Theory and practice of regression analysis, Simple and multiple regression,including testing, estimation, and confidence procedures, modeling,regression diagnostics and plots, polynomial regression, colinearity andconfounding, model selection, geometry of least squares. Extensive use ofthe computer to analyze data.

Course website: http://www.stat.columbia.edu/~yangfeng/W4315

Required Text: Applied Linear Regression Models (4th Ed.)Authors: Kutner, Nachtsheim, Neter

Yang Feng (Columbia University) Introduction to Simple Linear Regression 2 / 70

Page 82: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Philosophy and Style

Easy first half.

Difficult second half.

Some digressions from the required book.

Understanding = proof (derivation) + implementation.

Practice makes perfect.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 3 / 70

Page 83: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Course Outline

First half of the course is single variable linear regression.

Least squares

Maximum likelihood, normal model

Tests / inferences

ANOVA

Diagnostics

Remedial Measures

Yang Feng (Columbia University) Introduction to Simple Linear Regression 4 / 70

Page 84: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Course Outline (Continued)

Second half of the course is multiple linear regression and other relatedtopics .

Multiple linear Regression

Linear algebra reviewMatrix approach to linear regressionMultiple predictor variablesDiagnosticsTestsModel selection

Other topics (If time permits)

Principle Component AnalysisGeneralized Linear ModelsIntroduction to Bayesian Inference

Yang Feng (Columbia University) Introduction to Simple Linear Regression 5 / 70

Page 85: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Requirements

Calculus

Derivatives, gradients, convexity

Linear algebra

Matrix notation, inversion, eigenvectors, eigenvalues, rank

Probability and Statistics

Random variableExpectation, varianceEstimationBias/VarianceBasic probability distributionsHypothesis TestingConfidence Interval

Yang Feng (Columbia University) Introduction to Simple Linear Regression 6 / 70

Page 86: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Software

R will be used throughout the course and it is required in all homework.An R tutorial session will be given on Sep 5 (Bring your laptop!).Reasons for R:

Completely free software. Can be downloaded fromhttp://cran.r-project.org/

Available on various systems, PC, MAC, Linux, and even

Iphone andIpad!

Advanced yet easy to use.

An Introduction to R: http://cran.r-project.org/doc/manuals/R-intro.pdf

Yang Feng (Columbia University) Introduction to Simple Linear Regression 7 / 70

Page 87: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Software

R will be used throughout the course and it is required in all homework.An R tutorial session will be given on Sep 5 (Bring your laptop!).Reasons for R:

Completely free software. Can be downloaded fromhttp://cran.r-project.org/

Available on various systems, PC, MAC, Linux, and even Iphone andIpad!

Advanced yet easy to use.

An Introduction to R: http://cran.r-project.org/doc/manuals/R-intro.pdf

Yang Feng (Columbia University) Introduction to Simple Linear Regression 7 / 70

Page 88: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Grading

Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped

In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.

Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:

Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70

Page 89: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Grading

Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped

In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:

Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70

Page 90: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Grading

Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped

In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:

Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70

Page 91: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Office Hours / Website

http://www.stat.columbia.edu/~yangfeng

Course Materials and homework with the due dates will be posted onthe course website.

Office hours : Thursday 10am-noon subject to change

Office Location : Room 1012, SSW Building (1255 AmsterdamAvenue, between 121st and 122nd street)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 9 / 70

Page 92: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Why regression?

Want to model a functional relationship between an “predictorvariable” (input, independent variable, etc.) and a “responsevariable” (output, dependent variable, etc.)

Examples?

But real world is noisy, no f = ma

Observation noiseProcess noise

Two distinct goals

(Estimation) Understanding the relationship between predictor variablesand response variables(Prediction) Predicting the future response given the new observedpredictors.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 10 / 70

Page 93: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

History

Sir Francis Galton, 19th century

Studied the relation between heights of parents and children and notedthat the children “regressed” to the population mean

“Regression” stuck as the term to describe statistical relationsbetween variables

Yang Feng (Columbia University) Introduction to Simple Linear Regression 11 / 70

Page 94: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example Applications

Trend lines, eg. Google over 6 mo.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 12 / 70

Page 95: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Others

Epidemiology

Relating lifespan to obesity or smoking habits etc.

Science and engineering

Relating physical inputs to physical outputs in complex systems

Brain

Yang Feng (Columbia University) Introduction to Simple Linear Regression 13 / 70

Page 96: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Aims for the course

Given something you would like to predict and some number ofcovariates

What kind of model should you use?Which variables should you include?Which transformations of variables and interaction terms should youuse?

Given a model and some data

How do you fit the model to the data?How do you express confidence in the values of the model parameters?How do you regularize the model to avoid over-fitting and other relatedissues?

Yang Feng (Columbia University) Introduction to Simple Linear Regression 14 / 70

Page 97: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Aims for the course

Given something you would like to predict and some number ofcovariates

What kind of model should you use?Which variables should you include?Which transformations of variables and interaction terms should youuse?

Given a model and some data

How do you fit the model to the data?How do you express confidence in the values of the model parameters?How do you regularize the model to avoid over-fitting and other relatedissues?

Yang Feng (Columbia University) Introduction to Simple Linear Regression 14 / 70

Page 98: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Data for Regression Analysis

Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!

Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .

Treatment: the length of trainingExperimental Units: the analysts included in the study.

Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70

Page 99: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Data for Regression Analysis

Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!

Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .

Treatment: the length of trainingExperimental Units: the analysts included in the study.

Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70

Page 100: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Data for Regression Analysis

Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!

Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .

Treatment: the length of trainingExperimental Units: the analysts included in the study.

Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70

Page 101: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Questions?

Good time to ask now.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 16 / 70

Page 102: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Simple Linear Regression

Want to find parameters for a function of the form

Yi = β0 + β1Xi + εi

Distribution of error random variable not specified

Yang Feng (Columbia University) Introduction to Simple Linear Regression 17 / 70

Page 103: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Quick Example - Scatter Plot

●●

●●

●●

●●

●●

−2 −1 0 1 2

−2

02

46

8

a

b

Yang Feng (Columbia University) Introduction to Simple Linear Regression 18 / 70

Page 104: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Formal Statement of Model

Yi = β0 + β1Xi + εi

Yi value of the response variable in the i th trial

β0 and β1 are parameters

Xi is a known constant, the value of the predictor variable in the i th

trial

εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2

εi and εj are uncorrelated

i = 1, . . . , n

Yang Feng (Columbia University) Introduction to Simple Linear Regression 19 / 70

Page 105: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Properties

The response Yi is the sum of two components

Constant term β0 + β1Xi

Random term εi

The expected response is

E(Yi ) = E(β0 + β1Xi + εi )

= β0 + β1Xi + E(εi )

= β0 + β1Xi

Yang Feng (Columbia University) Introduction to Simple Linear Regression 20 / 70

Page 106: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Expectation Review

Suppose X has probability density function f (x).

Definition

E(X ) =

∫xf (x)dx .

Linearity property

E(aX ) = aE(X )

E(aX + bY ) = aE(X ) + bE(Y )

Obvious from definition

Yang Feng (Columbia University) Introduction to Simple Linear Regression 21 / 70

Page 107: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example Expectation Derivation

Suppose p.d.f of X isf (x) = 2x , 0 ≤ x ≤ 1

Expectation

E(X ) =

∫ 1

0xf (x)dx

=

∫ 1

02x2dx

=2x3

3|10

=2

3

Yang Feng (Columbia University) Introduction to Simple Linear Regression 22 / 70

Page 108: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Expectation of a Product of Random Variables

If X ,Y are random variables with joint density function f (x , y) then theexpectation of the product is given by

E(XY ) =

∫XY

xyf (x , y)dxdy .

Yang Feng (Columbia University) Introduction to Simple Linear Regression 23 / 70

Page 109: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Expectation of a product of random variables

What if X and Y are independent? If X and Y are independent withdensity functions f1 and f2 respectively then

E(XY ) =

∫XY

xyf1(x)f2(y)dxdy

=

∫X

∫Yxyf1(x)f2(Y )dxdy

=

∫Xxf1(x)[

∫Yyf2(y)dy ]dx

=

∫Xxf1(x)E(Y )dX

= E(X )E(Y )

Yang Feng (Columbia University) Introduction to Simple Linear Regression 24 / 70

Page 110: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Regression Function

The response Yi comes from a probability distribution with mean

E(Yi ) = β0 + β1Xi

This means the regression function is

E(Y ) = β0 + β1X

Since the regression function relates the means of the probabilitydistributions of Y for a given X to the level of X

Yang Feng (Columbia University) Introduction to Simple Linear Regression 25 / 70

Page 111: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Error Terms

The response Yi in the i th trial exceeds or falls short of the value ofthe regression function by the error term amount εi

The error terms εi are assumed to have constant variance σ2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 26 / 70

Page 112: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Response Variance

Responses Yi have the same constant variance

Var(Yi ) = Var(β0 + β1Xi + εi )

= Var(εi )

= σ2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 27 / 70

Page 113: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Variance (2nd central moment) Review

Continuous distribution

Var(X ) = E[(X − E(X ))2] =

∫(x − E(X ))2f (x)dx

Discrete distribution

Var(X ) = E[(X − E(X ))2] =∑i

(Xi − E(X ))2P(X = Xi )

Alternative Form for Variance:

Var(X ) = E(X 2)− E(X )2.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 28 / 70

Page 114: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example Variance Derivation

f (x) = 2x , 0 ≤ x ≤ 1

Var(X ) = E(X 2)− E(X )2

=

∫ 1

02xx2dx − (

2

3)2

=2x4

4|10 −

4

9

=1

2− 4

9

=1

18

Yang Feng (Columbia University) Introduction to Simple Linear Regression 29 / 70

Page 115: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Variance Properties

Var(aX ) = a2 Var(X )

Var(aX + bY ) = a2 Var(X ) + b2 Var(Y ) if X ⊥ Y

Var(a + cX ) = c2 Var(X ) if a and c are both constants

More generally

Var(∑

aiXi ) =∑i

∑j

aiajCov(Xi ,Xj)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 30 / 70

Page 116: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Covariance

The covariance between two real-valued random variables X and Y, withexpected values E(X ) = µ and E(Y ) = ν is defined as

Cov(X ,Y ) = E((X − µ)(Y − ν))

= E(XY )− µν.

If X ⊥ Y ,E(XY ) = E(X )E(Y ) = µν.

ThenCov(X ,Y ) = 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 31 / 70

Page 117: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Formal Statement of Model

Yi = β0 + β1Xi + εi

Yi value of the response variable in the i th trial

β0 and β1 are parameters

Xi is a known constant, the value of the predictor variable in the i th

trial

εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2

i = 1, . . . , n

Yang Feng (Columbia University) Introduction to Simple Linear Regression 32 / 70

Page 118: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example

An experimenter gave three subjects a very difficult task. Data on the ageof the subject (X ) and on the number of attempts to accomplish the taskbefore giving up (Y ) follow:

Table :

Subject i 1 2 3

Age Xi 20 55 30Number of Attempts Yi 5 12 10

Yang Feng (Columbia University) Introduction to Simple Linear Regression 33 / 70

Page 119: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Squares Linear Regression

Goal: make Yi and b0 + b1Xi close for all i .

Proposal 1: minimize∑n

i=1[Yi − (b0 + b1Xi )].

Proposal 2: minimize∑n

i=1 |Yi − (b0 + b1Xi )|.Final Proposal: minimize

Q(b0, b1) =n∑

i=1

[Yi − (b0 + b1Xi )]2

Choose b0 and b1 as estimators for β0 and β1.

b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).

Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70

Page 120: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Squares Linear Regression

Goal: make Yi and b0 + b1Xi close for all i .

Proposal 1: minimize∑n

i=1[Yi − (b0 + b1Xi )].

Proposal 2: minimize∑n

i=1 |Yi − (b0 + b1Xi )|.

Final Proposal: minimize

Q(b0, b1) =n∑

i=1

[Yi − (b0 + b1Xi )]2

Choose b0 and b1 as estimators for β0 and β1.

b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).

Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70

Page 121: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Squares Linear Regression

Goal: make Yi and b0 + b1Xi close for all i .

Proposal 1: minimize∑n

i=1[Yi − (b0 + b1Xi )].

Proposal 2: minimize∑n

i=1 |Yi − (b0 + b1Xi )|.Final Proposal: minimize

Q(b0, b1) =n∑

i=1

[Yi − (b0 + b1Xi )]2

Choose b0 and b1 as estimators for β0 and β1.

b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).

Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70

Page 122: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Guess #1

Yang Feng (Columbia University) Introduction to Simple Linear Regression 35 / 70

Page 123: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Guess #2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 36 / 70

Page 124: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Function maximization

Important technique to remember!

Take derivativeSet result equal to zero and solveTest second derivative at that point

Question: does this always give you the maximum?

Going further: multiple variables, convex optimization

Yang Feng (Columbia University) Introduction to Simple Linear Regression 37 / 70

Page 125: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Function Maximization

Find the value of x that maximize the function

f (x) = −x2 + log(x), x > 0

1. Take derivative

d

dx(−x2 + log(x)) = 0 (1)

−2x +1

x= 0 (2)

2x2 = 1 (3)

x2 =1

2(4)

x =

√2

2(5)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 38 / 70

Page 126: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Function Maximization

Find the value of x that maximize the function

f (x) = −x2 + log(x), x > 0

1. Take derivative

d

dx(−x2 + log(x)) = 0 (1)

−2x +1

x= 0 (2)

2x2 = 1 (3)

x2 =1

2(4)

x =

√2

2(5)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 38 / 70

Page 127: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

2. Check second order derivative

d2

dx2(−x2 + log(x)) =

d

dx(−2x +

1

x) (6)

= −2− 1

x2(7)

< 0 (8)

Then we have found the maximum. x∗ =√22 , f (x∗) = −1

2 [1 + log(2)].

Yang Feng (Columbia University) Introduction to Simple Linear Regression 39 / 70

Page 128: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Squares Minimization

Function to minimize w.r.t. b0 and b1

b0 and b1 are called point estimators of β0 and β1 respectively

Q(b0, b1) =n∑

i=1

[Yi − (b0 + b1Xi )]2

Find partial derivatives and set both equal to zero

∂Q

∂b0= 0

∂Q

∂b1= 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 40 / 70

Page 129: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Normal Equations

The result of this maximization step are called the normal equations.b0 and b1 are called point estimators of β0 and β1 respectively.∑

Yi = nb0 + b1∑

Xi (9)∑XiYi = b0

∑Xi + b1

∑X 2i (10)

This is a system of two equations and two unknowns.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 41 / 70

Page 130: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Solution to Normal Equations

After a lot of algebra one arrives at

b1 =

∑(Xi − X )(Yi − Y )∑

(Xi − X )2

b0 = Y − b1X

X =

∑Xi

n

Y =

∑Yi

n

Yang Feng (Columbia University) Introduction to Simple Linear Regression 42 / 70

Page 131: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Square Fit

Yang Feng (Columbia University) Introduction to Simple Linear Regression 43 / 70

Page 132: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Properties of Solution

The i th residual is defined to be

ei = Yi − Yi

The sum of the residuals is zero from (9).∑i

ei =∑

(Yi − b0 − b1Xi )

=∑

Yi − nb0 − b1∑

Xi

= 0

The sum of the observed values Yi equals the sum of the fitted valuesYi ∑

i

Yi =∑i

Yi

Yang Feng (Columbia University) Introduction to Simple Linear Regression 44 / 70

Page 133: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Properties of Solution

The sum of the weighted residuals is zero when the residual in the i th trialis weighted by the level of the predictor variable in the i th trial from (10).∑

i

Xiei =∑

(Xi (Yi − b0 − b1Xi ))

=∑i

XiYi − b0∑

Xi − b1∑

X 2i

= 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 45 / 70

Page 134: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Properties of Solution

The sum of the weighted residuals is zero when the residual in the i th trialis weighted by the fitted value of the response variable in the i th trial∑

i

Yiei =∑i

(b0 + b1Xi )ei

= b0∑i

ei + b1∑i

Xiei

= 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 46 / 70

Page 135: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Properties of Solution

The regression line always goes through the point

X , Y

Using the alternative format of linear regression model:

Yi = β∗0 + β1(Xi − X ) + εi

The least squares estimator b1 for β1 remains the same as before. Theleast squares estimator for β∗0 = β0 + β1X becomes

b∗0 = b0 + b1X = (Y − b1X ) + b1X = Y

Hence the estimated regression function is

Y = Y + b1(X − X )

Apparently, the regression line always goes through the point (X , Y ).Yang Feng (Columbia University) Introduction to Simple Linear Regression 47 / 70

Page 136: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Estimating Error Term Variance σ2

Review estimation in non-regression setting.

Show estimation results for regression setting.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 48 / 70

Page 137: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Estimation Review

An estimator is a rule that tells how to calculate the value of anestimate based on the measurements contained in a sample

i.e. the sample mean

Y =1

n

n∑i=1

Yi

Yang Feng (Columbia University) Introduction to Simple Linear Regression 49 / 70

Page 138: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Point Estimators and Bias

Point estimatorθ = f ({Y1, . . . ,Yn})

Unknown quantity / parameter

θ

Definition: Bias of estimator

Bias(θ) = E(θ)− θ

Yang Feng (Columbia University) Introduction to Simple Linear Regression 50 / 70

Page 139: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example

SamplesYi ∼ N(µ, σ2)

Estimate the population mean

θ = µ, θ = Y =1

n

n∑i=1

Yi

Yang Feng (Columbia University) Introduction to Simple Linear Regression 51 / 70

Page 140: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of the Estimator

First moment

E(θ) = E(1

n

n∑i=1

Yi )

=1

n

n∑i=1

E(Yi ) =nµ

n= θ

This is an example of an unbiased estimator

Bias(θ) = E(θ)− θ = 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 52 / 70

Page 141: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Variance of Estimator

Definition: Variance of estimator

Var(θ) = E([θ − E(θ)]2)

Remember:

Var(cY ) = c2 Var(Y )

Var(n∑

i=1

Yi ) =n∑

i=1

Var(Yi )

Only if the Yi are independent with finite variance

Yang Feng (Columbia University) Introduction to Simple Linear Regression 53 / 70

Page 142: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example Estimator Variance

Var(θ) = Var(1

n

n∑i=1

Yi )

=1

n2

n∑i=1

Var(Yi )

=nσ2

n2

=σ2

n

Yang Feng (Columbia University) Introduction to Simple Linear Regression 54 / 70

Page 143: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Bias Variance Trade-off

The mean squared error of an estimator

MSE (θ) = E([θ − θ]2)

Can be re-expressed

MSE (θ) = Var(θ) + [Bias(θ)]2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 55 / 70

Page 144: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

MSE = VAR + BIAS2

Proof

MSE (θ)

= E((θ − θ)2)

= E(([θ − E(θ)] + [E(θ)− θ])2)

= E([θ − E(θ)]2) + 2E([E(θ)− θ][θ − E(θ)]) + E([E(θ)− θ]2)

= Var(θ) + 2E([E(θ)[θ − E(θ)]− θ[θ − E(θ)])) + (Bias(θ))2

= Var(θ) + 2(0 + 0) + (Bias(θ))2

= Var(θ) + (Bias(θ))2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 56 / 70

Page 145: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Trade-off

Think of variance as confidence and bias as correctness.

Intuitions (largely) apply

Sometimes choosing a biased estimator can result in an overall lowerMSE if it has much lower variance than the unbiased one.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 57 / 70

Page 146: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

s2 estimator for σ2 for Single population

Sum of Squares:n∑

i=1

(Yi − Y )2

Sample Variance Estimator:

s2 =

∑ni=1(Yi − Y )2

n − 1

s2 is an unbiased estimator of σ2.

The sum of squares SSE has n − 1 “degrees of freedom” associatedwith it, one degree of freedom is lost by using Y as an estimate of theunknown population mean µ.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 58 / 70

Page 147: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Estimating Error Term Variance σ2 for Regression Model

Regression model

Variance of each observation Yi is σ2 (the same as for the error termεi )

Each Yi comes from a different probability distribution with differentmeans that depend on the level Xi

The deviation of an observation Yi must be calculated around its ownestimated mean.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 59 / 70

Page 148: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

s2 estimator for σ2

s2 = MSE =SSE

n − 2=

∑(Yi − Yi )

2

n − 2=

∑e2i

n − 2

MSE is an unbiased estimator of σ2

E(MSE ) = σ2

The sum of squares SSE has n − 2 “degrees of freedom” associatedwith it.

Cochran’s theorem (later in the course) tells us where degree’s offreedom come from and how to calculate them.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 60 / 70

Page 149: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Normal Error Regression Model

No matter how the error terms εi are distributed, the least squaresmethod provides unbiased point estimators of β0 and β1

that also have minimum variance among all unbiased linear estimators

To set up interval estimates and make tests we need to specify thedistribution of the εi

We will assume that the εi are normally distributed.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 61 / 70

Page 150: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Normal Error Regression Model

Yi = β0 + β1Xi + εi

Yi value of the response variable in the i th trial

β0 and β1 are parameters

Xi is a known constant, the value of the predictor variable in the i th

trial

εi ∼iid N(0, σ2)note this is different, now we know the distribution

i = 1, . . . , n

Yang Feng (Columbia University) Introduction to Simple Linear Regression 62 / 70

Page 151: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Notational Convention

When you see εi ∼iid N(0, σ2)

It is read as εi is identically and independently distributed accordingto a normal distribution with mean 0 and variance σ2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 63 / 70

Page 152: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Maximum Likelihood Principle

The method of maximum likelihood chooses as estimates those values ofthe parameters that are most consistent with the sample data.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 64 / 70

Page 153: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Likelihood Function

IfXi ∼ F (Θ), i = 1 . . . n

then the likelihood function is

L({Xi}ni=1,Θ) =n∏

i=1

F (Xi ; Θ)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 65 / 70

Page 154: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Maximum Likelihood Estimation

The likelihood function can be maximized w.r.t. the parameter(s) Θ,doing this one can arrive at estimators for parameters as well.

L({Xi}ni=1,Θ) =n∏

i=1

F (Xi ; Θ)

To do this, find solutions to (analytically or by following gradient)

dL({Xi}ni=1,Θ)

dΘ= 0

Yang Feng (Columbia University) Introduction to Simple Linear Regression 66 / 70

Page 155: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Important Trick

(almost) Never maximize the likelihood function, maximize the loglikelihood function instead.

log(L({Xi}ni=1,Θ)) = log(n∏

i=1

F (Xi ; Θ))

=n∑

i=1

log(F (Xi ; Θ))

Usually the log of the likelihood is easier to work with mathematically.

Yang Feng (Columbia University) Introduction to Simple Linear Regression 67 / 70

Page 156: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

ML Normal Regression

Likelihood function

L(β0, β1, σ2) =

n∏i=1

1

(2πσ2)1/2e−

12σ2 (Yi−β0−β1Xi )

2

=1

(2πσ2)n/2e−

12σ2

∑ni=1(Yi−β0−β1Xi )

2

which if you maximize (how?) w.r.t. to the parameters you get. . .

Yang Feng (Columbia University) Introduction to Simple Linear Regression 68 / 70

Page 157: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Maximum Likelihood Estimator(s)

β0b0 same as in least squares case

β1b1 same as in least squares case

σ2

σ2 =

∑i (Yi − Yi )

2

n

Note that ML estimator is biased as s2 is unbiased and

s2 = MSE =n

n − 2σ2

Yang Feng (Columbia University) Introduction to Simple Linear Regression 69 / 70

Page 158: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Comments

Least squares minimizes the squared error between the prediction andthe true output

The normal distribution is fully characterized by its first two centralmoments (mean and variance)

Yang Feng (Columbia University) Introduction to Simple Linear Regression 70 / 70

Page 159: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Inference in Regression Analysis

Yang Feng

Yang Feng (Columbia University) Inference in Regression Analysis 1 / 113

Page 160: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Inference in the Normal Error Regression Model

Yi = β0 + β1Xi + εi

Yi value of the response variable in the i th trial

β0 and β1 are parameters

Xi is a known constant, the value of the predictor variable in the i th

trial

εi ∼iid N(0, σ2)

i = 1, . . . , n

Yang Feng (Columbia University) Inference in Regression Analysis 2 / 113

Page 161: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Maximum Likelihood Estimator(s)

β0

b0 same as in least squares case

β1

b1 same as in least squares case

σ2

σ2 =

∑i (Yi − Yi )

2

n

Note that ML estimator is biased as s2 is unbiased and

s2 = MSE =n

n − 2σ2

Yang Feng (Columbia University) Inference in Regression Analysis 3 / 113

Page 162: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Inference concerning β1

Tests concerning β1 (the slope) are often of interest, particularly

H0 : β1 = 0

Ha : β1 6= 0

the null hypothesis model

Yi = β0 + (0)Xi + εi

implies that there is no linear relationship between Y and X.

Note the means of all the Yi ’s are equal at all levels of Xi .

Yang Feng (Columbia University) Inference in Regression Analysis 4 / 113

Page 163: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Dist. Of b1

The point estimator for b1 is

b1 =

∑(Xi − X )(Yi − Y )∑

(Xi − X )2

The sampling distribution for b1 is the distribution of b1 that arisesfrom the variability of b1 when the predictor variables Xi are heldfixed and the observed outputs are repeatedly sampled

Note that the sampling distribution of b1 will depend on our modelassumptions.

Yang Feng (Columbia University) Inference in Regression Analysis 5 / 113

Page 164: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Dist. Of b1 In Normal Regr. Model

For a normal error regression model the sampling distribution of b1 isnormal, with mean and variance given by

E(b1) = β1

Var(b1) =σ2∑

(Xi − X )2

To show this we need to go through a number of algebraic steps.

Yang Feng (Columbia University) Inference in Regression Analysis 6 / 113

Page 165: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

First step

To show ∑(Xi − X )(Yi − Y ) =

∑(Xi − X )Yi

we observe∑(Xi − X )(Yi − Y ) =

∑(Xi − X )Yi −

∑(Xi − X )Y

=∑

(Xi − X )Yi − Y∑

(Xi − X )

=∑

(Xi − X )Yi

Yang Feng (Columbia University) Inference in Regression Analysis 7 / 113

Page 166: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

b1 as convex combination of Yi ’s

b1 can be expressed as a linear combination of the Y ′i s

b1 =

∑(Xi − X )(Yi − Y )∑

(Xi − X )2

=

∑(Xi − X )Yi∑(Xi − X )2

from previous slide

=∑

kiYi

where

ki =(Xi − X )∑(Xi − X )2

Yang Feng (Columbia University) Inference in Regression Analysis 8 / 113

Page 167: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Properties of the k ′i s

It can be shown that ∑ki = 0∑

kiXi = 1∑k2i =

1∑(Xi − X )2

We will use these properties to prove various properties of the samplingdistributions of b1 and b0.

Yang Feng (Columbia University) Inference in Regression Analysis 9 / 113

Page 168: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Linear combination of independent normal randomvariables

When Y1, . . . ,Yn are independent normal random variables, the linearcombination ∑

aiYi ∼ N(∑

ai E(Yi ),∑

a2i Var(Yi )

)

Yang Feng (Columbia University) Inference in Regression Analysis 10 / 113

Page 169: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Normality of b′1s Sampling Distribution

Since b1 is a linear combination of the Y ′i s and each Yi is an independentnormal random variable, then b1 is distributed normally as well

b1 =∑

kiYi , ki =(Xi − X )∑(Xi − X )2

From previous slide

E(b1) =∑

ki E(Yi ), Var(b1) =∑

k2i Var(Yi )

Yang Feng (Columbia University) Inference in Regression Analysis 11 / 113

Page 170: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

b1 is an unbiased estimator

This can be seen using two of the properties

E(b1) = E(∑

kiYi )

=∑

ki E(Yi )

=∑

ki (β0 + β1Xi )

= β0

∑ki + β1

∑kiXi

= β0(0) + β1(1)

= β1

Yang Feng (Columbia University) Inference in Regression Analysis 12 / 113

Page 171: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Variance of b1

Since the Yi are independent random variables with variance σ2 and thek ′i s are constants we get

Var(b1) = Var(∑

kiYi )

=∑

k2i Var(Yi )

=∑

k2i σ

2

= σ2∑

k2i

= σ2 1∑(Xi − X )2

However, in most cases, σ2 is unknown.

Yang Feng (Columbia University) Inference in Regression Analysis 13 / 113

Page 172: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Estimated variance of b1

When we don’t know σ2 then we have to replace it with the MSEestimate (From the Least Square estimation)

Let

s2 = MSE =SSE

n − 2

whereSSE =

∑e2i

andei = Yi − Yi

plugging in we get

Var(b1) =s2∑

(Xi − X )2

Yang Feng (Columbia University) Inference in Regression Analysis 14 / 113

Page 173: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Recap

We now have an expression for the sampling distribution of b1 whenσ2 is known

b1 ∼ N (β1,σ2∑

(Xi − X )2) (1)

When σ2 is unknown we have an unbiased point estimator of σ2

Var(b1) =s2∑

(Xi − X )2

Yang Feng (Columbia University) Inference in Regression Analysis 15 / 113

Page 174: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Digression : Gauss-Markov Theorem

Theorem

In a regression model where E(εi ) = 0 and variance Var(εi ) = σ2 <∞ andεi and εj are uncorrelated for all i and j the least squares estimators b0

and b1 are unbiased and have minimum variance among all unbiased linearestimators.

No normality assumption on the error distribution!Recall

b1 =

∑(Xi − X )(Yi − Y )∑

(Xi − X )2

b0 = Y − b1X

Yang Feng (Columbia University) Inference in Regression Analysis 16 / 113

Page 175: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Proof

The theorem states that b1 as minimum variance among all unbiasedlinear estimators of the form

β1 =∑

ciYi

As this estimator must be unbiased we have

E(β1) =∑

ci E(Yi )

=∑

ci (β0 + β1Xi )

= β0

∑ci + β1

∑ciXi

= β1

Yang Feng (Columbia University) Inference in Regression Analysis 17 / 113

Page 176: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Proof cont.

Given the constraint

β0

∑ci + β1

∑ciXi = β1

clearly it must be the case that∑

ci = 0 and∑

ciXi = 1

The variance of this estimator is

Var(β1) =∑

c2i Var(Yi ) = σ2

∑c2i

Yang Feng (Columbia University) Inference in Regression Analysis 18 / 113

Page 177: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Proof cont.

Now define ci = ki + di where the ki are the constants we already definedand the di are arbitrary constants.

ki =(Xi − X )∑(Xi − X )2

Let’s look at the variance of the estimator

Var(β1) =∑

c2i Var(Yi ) = σ2

∑(ki + di )

2

= σ2(∑

k2i +

∑d2i + 2

∑kidi )

Note we just demonstrated that

σ2∑

k2i = Var(b1)

Yang Feng (Columbia University) Inference in Regression Analysis 19 / 113

Page 178: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Proof cont.

Now by showing that∑

kidi = 0 we’re almost done∑kidi =

∑ki (ci − ki )

=∑

kici −∑

k2i

=∑

ci

(Xi − X∑(Xi − X )2

)− 1∑

(Xi − X )2

=

∑ciXi − X

∑ci∑

(Xi − X )2− 1∑

(Xi − X )2= 0

Recall∑

ci = 0 and∑

ciXi = 1.

Yang Feng (Columbia University) Inference in Regression Analysis 20 / 113

Page 179: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Proof end

So we are left with

Var(β1) = σ2(∑

k2i +

∑d2i )

= Var(b1) + σ2(∑

d2i )

which is minimized when all the di = 0. This means that the least squaresestimator b1 has minimum variance among all unbiased linear estimators.

Yang Feng (Columbia University) Inference in Regression Analysis 21 / 113

Page 180: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of (b1 − β1)/s(b1)

b1 is normally distributed so

b1 − β1√Var(b1)

∼ N(0, 1)

We don’t know Var(b1) so it must be estimated from data. We havealready derived its estimate

If using the estimate Var(b1) it can be shown that

b1 − β1

s(b1)∼ t(n − 2)

s(b1) =

√Var(b1)

Yang Feng (Columbia University) Inference in Regression Analysis 22 / 113

Page 181: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Where does this come from?

For now we need to rely upon the following theorem

For the normal error regression model

SSE

σ2=

∑(Yi − Yi )

2

σ2∼ χ2(n − 2)

and is independent of b0 and b1

Intuitively this follows the standard result for the sum of squarednormal random variables

Here there are two linear constraints imposed by the regressionparameter estimation that each reduce the number of degrees offreedom by one.

We will revisit this subject soon.

Yang Feng (Columbia University) Inference in Regression Analysis 23 / 113

Page 182: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Another useful fact : t distributed random variables

Let z and χ2(ν) be independent random variables (standard normal andχ2 respectively). The following random variable is a t-dstributed randomvariable:

t(ν) =z√χ2(ν)ν

This version of the t distribution has one parameter, the degrees offreedom ν

Yang Feng (Columbia University) Inference in Regression Analysis 24 / 113

Page 183: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Distribution of the studentized statistic

To derive the distribution of this statistic, first we rewrite

b1 − β1

s(b1)=

b1−β1

σ(b1)

s(b1)σ(b1)

s(b1)

σ(b1)=

√Var(b1)

Var(b1)

Yang Feng (Columbia University) Inference in Regression Analysis 25 / 113

Page 184: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Studentized statistic cont.

And note the following

Var(b1)

Var(b1)=

MSE∑(Xi−X )2

σ2∑(Xi−X )2

=MSE

σ2=

SSE

σ2(n − 2)

where we know (by the given theorem) the distribution of the last term isχ2 and indep. of b1 and b0

SSE

σ2(n − 2)∼ χ2(n − 2)

n − 2

Yang Feng (Columbia University) Inference in Regression Analysis 26 / 113

Page 185: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Studentized statistic final

But by the given definition of the t distribution we have our result

b1 − β1

s(b1)∼ t(n − 2)

because putting everything together we can see that

b1 − β1

s(b1)∼ z√

χ2(n−2)n−2

Yang Feng (Columbia University) Inference in Regression Analysis 27 / 113

Page 186: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence Intervals and Hypothesis Tests

Now that we know the sampling distribution of b1 (t with n− 2 degrees offreedom) we can construct confidence intervals and hypothesis tests easily.

Things to think about

What does the t-distribution look like?

Why is the estimator distributed according to a t-distribution ratherthan a normal distribution?

When performing tests, why does this matter?

When is it safe to cheat and use a normal approximation?

Yang Feng (Columbia University) Inference in Regression Analysis 28 / 113

Page 187: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Quick Review : Hypothesis Testing

Elements of a statistical test

Null hypothesis, H0

Alternative hypothesis, Ha

Test statisticRejection region

Yang Feng (Columbia University) Inference in Regression Analysis 29 / 113

Page 188: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Quick Review : Hypothesis Testing - Errors

Errors

A type I error is made if H0 is rejected when H0 is true. The probabilityof a type I error is denoted by α. The value of α is called the level ofthe test.A type II error is made if H0 is accepted when Ha is true. Theprobability of a type II error is denoted by β.

H0 is true Ha is true

Accept H0 Right decision Type II ErrorReject H0 Type I Error Right decision

Yang Feng (Columbia University) Inference in Regression Analysis 30 / 113

Page 189: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

p-value

The p-value, or attained significance level, is the smallest level ofsignificance α for which the observed data indicate that the nullhypothesis should be rejected.

Yang Feng (Columbia University) Inference in Regression Analysis 31 / 113

Page 190: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Null Hypothesis

If the null hypothesis is that β1 = 0 then b1 should fall in the range aroundzero. The further it is from 0 the less likely the null hypothesis is to hold.

Yang Feng (Columbia University) Inference in Regression Analysis 32 / 113

Page 191: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Alternative Hypothesis : Least Squares Fit

If we find that our estimated value of b1 deviates from 0 then we have todetermine whether or not that deviation would be surprising given themodel and the sampling distribution of the estimator. If it is sufficiently(where we define what sufficient is by a confidence level) different then wereject the null hypothesis.

Yang Feng (Columbia University) Inference in Regression Analysis 33 / 113

Page 192: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Testing This Hypothesis

Only have a finite sample

Different finite set of samples (from the same population / source)will (almost always) produce different point estimates of β0 and β1

(b0, b1) given the same estimation procedure

Key point: b0 and b1 are random variables whose samplingdistributions can be statistically characterized

Hypothesis tests about β0 and β1 can be constructed using thesedistributions.

Yang Feng (Columbia University) Inference in Regression Analysis 34 / 113

Page 193: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence Interval Example

A machine fills cups with margarine, and is supposed to be adjustedso that the content of the cups is µ = 250g of margarine.

Observed random variable X ∼ N (250, 2.5)

Yang Feng (Columbia University) Inference in Regression Analysis 35 / 113

Page 194: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence Interval Example, Cont.

X1, ...,X25, a random sample from X .

The natural estimator is the sample mean: µ = X = 1n

∑ni=1 Xi .

Suppose the sample shows actual weights X1, ...,X25, with mean:

X =1

25

25∑i=1

Xi = 250.2grams.

Yang Feng (Columbia University) Inference in Regression Analysis 36 / 113

Page 195: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence Interval Example, Cont.

Say we want to get a confidence interval for µ.By standardizing, we get a random variable

Z =X − µσ/√n

=X − µ

0.5

P(−z ≤ Z ≤ z) = 1− α = 0.95.

The number z follows from the cumulative distribution function:

Φ(z) = P(Z ≤ z) = 1− α2 = 0.975, (2)

z = Φ−1(Φ(z)) = Φ−1(0.975) = 1.96, (3)

Yang Feng (Columbia University) Inference in Regression Analysis 37 / 113

Page 196: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence Interval Example, Cont.

Now we get:

0.95 = 1− α = P(−z ≤ Z ≤ z) = P

(−1.96 ≤ X − µ

σ/√n≤ 1.96

)(4)

= P

(X − 1.96

σ√n≤ µ ≤ X + 1.96

σ√n

)(5)

= P(X − 1.96× 0.5 ≤ µ ≤ X + 1.96× 0.5

)(6)

= P(X − 0.98 ≤ µ ≤ X + 0.98

). (7)

This might be interpreted as: with probability 0.95 we will find aconfidence interval in which we will meet the parameter µ between thestochastic endpoints

(X − 0.98, X + 0.98)

Yang Feng (Columbia University) Inference in Regression Analysis 38 / 113

Page 197: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence Interval Example, Cont.

Therefore, our 0.95 confidence interval becomes:

(X − 0.98, X + 0.98) = (250.2− 0.98, 250.2 + 0.98) = (249.22, 251.18).

Yang Feng (Columbia University) Inference in Regression Analysis 39 / 113

Page 198: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Recap

We know that the point estimator of β1 is

b1 =

∑(Xi − X )(Yi − Y )∑

(Xi − X )2

We derived the sampling distribution of b1, it beingN (β1,Var(b1))(when σ2 known) with

Var(b1) = σ2{b1} =σ2∑

(Xi − X )2

And we suggested that an estimate of Var(b1) could be arrived at bysubstituting the MSE for σ2 when σ2 is unknown.

s2{b1} =MSE∑

(Xi − X )2=

SSEn−2∑

(Xi − X )2

Yang Feng (Columbia University) Inference in Regression Analysis 40 / 113

Page 199: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of (b1 − β1)/s{b1}

Since b1 is normally distributed,

b1 − β1

σ{b1}∼ N (0, 1)

Using the estimate s2{b1},

b1 − β1

s{b1}∼ t(n − 2),

where

s{b1} =√

s2{b1}

Yang Feng (Columbia University) Inference in Regression Analysis 41 / 113

Page 200: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence Interval for β1

Since the “studentized” statistic follows a t distribution, we can make thefollowing probability statement

P(t(α/2; n − 2) ≤ b1 − β1

s{b1}≤ t(1− α/2; n − 2)) = 1− α

By symmetryt(α/2; n − 2) = −t(1− α/2; n − 2),

We have the following confidence interval for β1,

P(b1− t(1−α/2; n−2)s{b1} ≤ β1 ≤ b1 + t(1−α/2; n−2)s{b1}) = 1−α

Look up the t distribution table (table B.2 in the appendix) and produceconfidence intervals.Or R function

qt(1− α/2, n − 2)

Yang Feng (Columbia University) Inference in Regression Analysis 42 / 113

Page 201: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Remember

Density: f (y) = dF (y)dy

Distribution (CDF): F (y) = P(Y ≤ y) =∫ y−∞ f (t)dt

Inverse CDF: F−1(p) = y s.t.∫ y−∞ f (t)dt = p

Yang Feng (Columbia University) Inference in Regression Analysis 43 / 113

Page 202: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

1− α confidence limits for β1

The 1− α confidence limits for β1 are

b1 ± t(1− α/2; n − 2)s{b1}

Note that this quantity can be used to calculate confidence intervalsgiven n and α.

Fixing α can guide the choice of sample size if a particular confidenceinterval is desiredGiven a sample size, vice versa.

Also useful for hypothesis testing

Yang Feng (Columbia University) Inference in Regression Analysis 44 / 113

Page 203: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Tests Concerning β1

Example 1(Two-sided test)

H0 : β1 = 0Ha : β1 6= 0Test statistic

t∗ =b1 − 0

s{b1}

Yang Feng (Columbia University) Inference in Regression Analysis 45 / 113

Page 204: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Tests Concerning β1

We have an estimate of the sampling distribution of b1 from the data.

If the null hypothesis holds then the b1 estimate coming from thedata should be within the 95% confidence interval of the samplingdistribution centered at 0 (in this case)

t∗ =b1 − 0

s{b1}

Yang Feng (Columbia University) Inference in Regression Analysis 46 / 113

Page 205: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Decision rules

if |t∗| ≤ t(1− α/2; n − 2), acceptH0

if |t∗| > t(1− α/2; n − 2), rejectH0

Absolute values make the test two-sided

Yang Feng (Columbia University) Inference in Regression Analysis 47 / 113

Page 206: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Calculating the p-value

The p-value, or attained significance level, is the smallest level ofsignificance α for which the observed data indicate that the nullhypothesis should be rejected.

This can be looked up using the CDF of the test statistic.

Yang Feng (Columbia University) Inference in Regression Analysis 48 / 113

Page 207: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

p-value Example

An experiment is performed to determine whether a coin flip is fair (50%chance, each, of landing heads or tails) or unfairly biased (50% chance ofone of the outcomes).

Outcome: Suppose that the experimental results show the coin turningup heads 14 times out of 20 total flips.p-value: The p-value of this result would be the chance of a fair coinlanding on heads at least 14 times out of 20 flipsCalculation:

Prob(14 heads) + Prob(15 heads) + · · ·+ Prob(20 heads) (8)

=1

220

[(20

14

)+

(20

15

)+ · · ·+

(20

20

)]=

60,460

1,048,576≈ 0.058 (9)

Two sided p-value:2 ∗ 0.058 = 0.116

Yang Feng (Columbia University) Inference in Regression Analysis 49 / 113

Page 208: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Power of the test

Power = P{Reject H0|δ} (10)

= P{|t∗| > t(1− α/2; n − 2)|δ} (11)

(12)

where δ is the noncentrality measure

δ =|β1|σ{b1}

t∗ =

b1−β1

σ(b1) + β1

σ(b1)

s(b1)σ(b1)

(13)

Table B.5 presents the power of the two-sided t test.Notice: the power depends on the value of σ2.

a = qt(1- alpha/2,n-2)

1- pt(a, n-2, delta) + pt(-a, n-2, delta)Yang Feng (Columbia University) Inference in Regression Analysis 50 / 113

Page 209: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Inferences Concerning β0

Largely, inference procedures regarding β0 can be performed in thesame way as those for β1

Remember the point estimator b0 for β0

b0 = Y − b1X

Yang Feng (Columbia University) Inference in Regression Analysis 51 / 113

Page 210: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling distribution of b0

The sampling distribution of b0 refers to the different values of b0

that would be obtained with repeated sampling when the levels of thepredictor variable X are held constant from sample to sample.

For the normal regression model the sampling distribution of b0 isnormal

Yang Feng (Columbia University) Inference in Regression Analysis 52 / 113

Page 211: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling distribution of b0

When error variance is known

E(b0) = β0

σ2{b0} = σ2(1

n+

X 2∑(Xi − X )2

)

When error variance is unknown

s2{b0} = MSE (1

n+

X 2∑(Xi − X )2

)

Yang Feng (Columbia University) Inference in Regression Analysis 53 / 113

Page 212: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence interval for β0

The 1− α confidence limits for β0 are obtained in the same manner asthose for β1

b0 ± t(1− α/2; n − 2)s{b0}

Yang Feng (Columbia University) Inference in Regression Analysis 54 / 113

Page 213: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Considerations on Inferences on β0 and β1

Effects of departures from normality of the Yi

The estimators of β0 and β1 have the property of asymptoticnormality - their distributions approach normality as the sample sizeincreases (under general conditions)

Spacing of the X levelsThe variances of b0 and b1 (for a given n and σ2) depend strongly onthe spacing of X

Yang Feng (Columbia University) Inference in Regression Analysis 55 / 113

Page 214: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling distribution of point estimator of mean response

Let Xh be the level of X for which we would like an estimate of themean responseNeeds to be one of the observed X’s

The mean response when X = Xh is denoted by

E(Yh) = β0 + β1Xh

The point estimator of E(Yh) is

Yh = b0 + b1Xh

We are interested in the sampling distribution of this quantity

Yang Feng (Columbia University) Inference in Regression Analysis 56 / 113

Page 215: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of Yh

We haveYh = b0 + b1Xh

Since this quantity is itself a linear combination of the Y ′i s it’ssampling distribution is itself normal.

The mean of the sampling distribution is

E{Yh} = E{b0}+ E{b1}Xh = β0 + β1Xh

Biased or unbiased?

Yang Feng (Columbia University) Inference in Regression Analysis 57 / 113

Page 216: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of Yh

To derive the sampling distribution variance of the mean response wefirst show that b1 and (1/n)

∑Yi are uncorrelated and, hence, for the

normal error regression model independent

We start with the definitions

Y =∑

(1

n)Yi

b1 =∑

kiYi , ki =(Xi − X )∑(Xi − X )2

Yang Feng (Columbia University) Inference in Regression Analysis 58 / 113

Page 217: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of Yh

We want to show that mean response and the estimate b1 areuncorrelated

Cov(Y , b1) = σ2{Y , b1} = 0

To do this we need the following result (A.32)

σ2{n∑

i=1

aiYi ,

n∑i=1

ciYi} =n∑

i=1

aiciσ2{Yi}

when the Yi are independent

Yang Feng (Columbia University) Inference in Regression Analysis 59 / 113

Page 218: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of Yh

Using this fact we have

σ2{n∑

i=1

1

nYi ,

n∑i=1

kiYi} =n∑

i=1

1

nkiσ

2{Yi}

=n∑

i=1

1

nkiσ

2

=σ2

n

n∑i=1

ki

= 0

So the Y and b1 are uncorrelated

Yang Feng (Columbia University) Inference in Regression Analysis 60 / 113

Page 219: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of Yh

This means that we can write down the variance

σ2{Yh} = σ2{Y + b1(Xh − X )}

alternative and equivalent form of regression function

But we know that Y and b1 are uncorrelated so

σ2{Yh} = σ2{Y }+ σ2{b1}(Xh − X )2

Yang Feng (Columbia University) Inference in Regression Analysis 61 / 113

Page 220: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of Yh

We know

σ2{b1} =σ2∑

(Xi − X )2

s2{b1} =MSE∑

(Xi − X )2

And we can find

σ2{Y } =1

n2

∑σ2{Yi} =

nσ2

n2=σ2

n

Yang Feng (Columbia University) Inference in Regression Analysis 62 / 113

Page 221: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of Yh

So, plugging in, we get

σ2{Yh} =σ2

n+

σ2∑(Xi − X )2

(Xh − X )2

Or

σ2{Yh} = σ2

(1

n+

(Xh − X )2∑(Xi − X )2

)

Yang Feng (Columbia University) Inference in Regression Analysis 63 / 113

Page 222: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling Distribution of Yh

Since we often won’t know σ2 we can, as usual, plug in s2 = SSE/(n− 2),our estimate for it to get our estimate of this sampling distribution variance

s2{Yh} = s2

(1

n+

(Xh − X )2∑(Xi − X )2

)

Yang Feng (Columbia University) Inference in Regression Analysis 64 / 113

Page 223: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

No surprise. . .

The sampling distribution of our point estimator for the output isdistributed as a t-distribution with n − 2 degrees of freedom

Yh − E{Yh}s{Yh}

∼ t(n − 2)

This means that we can construct confidence intervals in the samemanner as before.

Yang Feng (Columbia University) Inference in Regression Analysis 65 / 113

Page 224: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence Intervals for E(Yh)

The 1− α confidence intervals for E(Yh) are

Yh ± t(1− α/2; n − 2)s{Yh}

From this hypothesis tests can be constructed as usual.

Yang Feng (Columbia University) Inference in Regression Analysis 66 / 113

Page 225: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Comments

The variance of the estimator for E(Yh) is smallest near the mean ofX. Designing studies such that the mean of X is near Xh will improveinference precision

When Xh is zero the variance of the estimator for E(Yh) reduces tothe variance of the estimator b0 for β0

Yang Feng (Columbia University) Inference in Regression Analysis 67 / 113

Page 226: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence Band for Regression Line

At times, we want to get a confidence band for the entire regressionline E{Y } = β0 + β1X .

The Working-Hotelling 1− α confidence band is

Yh ±W × s{Yh}

here W 2 = 2F (1− α; 2, n − 2).

Same form as before, except the t multiple is replaced with the Wmultiple.

R code:

qf (1− α/2, 2, n − 2)

Yang Feng (Columbia University) Inference in Regression Analysis 68 / 113

Page 227: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Prediction interval for single new observation

Essentially follows the sampling distribution arguments for E(Yh)

If all regression parameters are known then the 1− α predictioninterval for a new observation Yh is

E{Yh} ± z(1− α/2)σ

Yang Feng (Columbia University) Inference in Regression Analysis 69 / 113

Page 228: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Prediction interval for single new observation

If the regression parameters are unknown the 1− α prediction intervalfor a new observation Yh is given by the following theorem

Yh ± t(1− α/2; n − 2)s{pred}

This is very nearly the same as prediction for a known value of X butincludes a correction for the fact that there is additional variabilityarising from the fact that the new input location was not used in theoriginal estimates of b1, b0, and s2

Yang Feng (Columbia University) Inference in Regression Analysis 70 / 113

Page 229: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Prediction interval for single new observation

We have

σ2{pred} = σ2{Yh − Yh} = σ2{Yh}+ σ2{Yh} = σ2 + σ2{Yh}

An unbiased estimator of σ2{pred} is s2{pred} = MSE + s2{Yh}, whichis given by

s2{pred} = MSE

[1 +

1

n+

(Xh − X )2∑(Xi − X )2

]

Confidence Interval: represents an inference on a parameter, and is aninterval which is intended to cover the value of the parameter.

Prediction Interval: a statement about the value to be taken by arandom variable. Wider than confidence interval.

Yang Feng (Columbia University) Inference in Regression Analysis 71 / 113

Page 230: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Prediction interval for the mean of m new observations forgiven Xh

Yh ± t(1− α/2; n − 2)s{predmean}

An unbiased estimator of σ2{pred} is s2{predmean} = MSEm + s2{Yh},

which is given by

s2{predmean} = MSE

[1

m+

1

n+

(Xh − X )2∑(Xi − X )2

]

Yang Feng (Columbia University) Inference in Regression Analysis 72 / 113

Page 231: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

ANOVA

ANOVA is nothing new but is instead a way of organizing the parts oflinear regression so as to make easy inference recipes.

Will return to ANOVA when discussing multiple regression and othertypes of linear statistical models.

Yang Feng (Columbia University) Inference in Regression Analysis 73 / 113

Page 232: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Partitioning Total Sum of Squares

“The ANOVA approach is based on the partitioning of sums ofsquares and degrees of freedom associated with the response variableY”

We start with the observed deviations of Yi around the observed mean

Yi − Y

Yang Feng (Columbia University) Inference in Regression Analysis 74 / 113

Page 233: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Partitioning of Total Deviations

Yang Feng (Columbia University) Inference in Regression Analysis 75 / 113

Page 234: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Measure of Total Variation

The measure of total variation is denoted by

SSTO =∑

(Yi − Y )2

SSTO stands for total sum of squares

If all Y ′i s are the same, SSTO = 0

The greater the variation of the Y ′i s the greater SSTO

Yang Feng (Columbia University) Inference in Regression Analysis 76 / 113

Page 235: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Variation after predictor effect

The measure of variation of the Y ′i s that is still present when thepredictor variable X is taken into account is the sum of the squareddeviations

SSE =∑

(Yi − Yi )2

SSE denotes error sum of squares

Yang Feng (Columbia University) Inference in Regression Analysis 77 / 113

Page 236: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Regression Sum of Squares

The difference between SSTO and SSE is SSR

SSR =∑

(Yi − Y )2

SSR stands for regression sum of squares

Yang Feng (Columbia University) Inference in Regression Analysis 78 / 113

Page 237: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Partitioning of Sum of Squares

Yi − Y︸ ︷︷ ︸Total deviation

= Yi − Y︸ ︷︷ ︸Deviation of fitted regression value around mean

+ Yi − Yi︸ ︷︷ ︸Deviation around fitted regression line

Yang Feng (Columbia University) Inference in Regression Analysis 79 / 113

Page 238: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Remarkable Property

The sums of the same deviations squared has the same property!∑(Yi − Y )2 =

∑(Yi − Y )2 +

∑(Yi − Yi )

2

orSSTO = SSR + SSE

Yang Feng (Columbia University) Inference in Regression Analysis 80 / 113

Page 239: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Proof

∑(Yi − Y )2 =

∑[(Yi − Y ) + (Yi − Yi )]2

=∑

[(Yi − Y )2 + (Yi − Yi )2 + 2(Yi − Y )(Yi − Yi )]

=∑

(Yi − Y )2 +∑

(Yi − Yi )2 + 2

∑(Yi − Y )(Yi − Yi )

but ∑(Yi − Y )(Yi − Yi ) =

∑Yi (Yi − Yi )−

∑Y (Yi − Yi ) = 0

By properties previously demonstrated. Namely∑Yiei = 0

and ∑ei = 0

Yang Feng (Columbia University) Inference in Regression Analysis 81 / 113

Page 240: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Breakdown of Degrees of Freedom

SSTO1 linear constraint due to the calculation and inclusion of the mean

n-1 degrees of freedom

SSE2 linear constraints arising from the estimation of β1 and β0

n-2 degrees of freedom

SSRTwo degrees of freedom in the regression parameters, one is lost due tolinear constraint

1 degree of freedom

Yang Feng (Columbia University) Inference in Regression Analysis 82 / 113

Page 241: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Mean Squares

A sum of squares divided by its associated degrees of freedom is called amean squareThe regression mean square is

MSR =SSR

1= SSR

The mean square error is

MSE =SSE

n − 2

Yang Feng (Columbia University) Inference in Regression Analysis 83 / 113

Page 242: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

ANOVA table for simple lin. regression

Source of Variation SS df MS E(MS)

Regression SSR =∑

(Yi − Y )2 1 MSR = SSR/1 σ2 + β21

∑(Xi − X )2

Error SSE =∑

(Yi − Yi )2 n − 2 MSE = SSE/(n − 2) σ2

Total SSTO =∑

(Yi − Y )2 n − 1

Yang Feng (Columbia University) Inference in Regression Analysis 84 / 113

Page 243: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

E{MSE} = σ2

Remember the following theorem, presented in an earlier lecture.

For the normal error regression model, SSEσ2 is distributed as χ2 with

n − 2 degrees of freedom and is independent of both b0 and b1.

Rewritten this yields

SSE/σ2 ∼ χ2(n − 2)

That means that E{SSE/σ2} = n − 2

And thus that E{SSE/(n − 2)} = E{MSE} = σ2

Yang Feng (Columbia University) Inference in Regression Analysis 85 / 113

Page 244: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

E{MSR} = σ2 + β21

∑(Xi − X )2

To begin, we take an alternative but equivalent form for SSR

SSR = b21

∑(Xi − X )2

And note that, by definition of variance we can write

σ2{b1} = E{b21} − (E{b1})2

Yang Feng (Columbia University) Inference in Regression Analysis 86 / 113

Page 245: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

E{MSR} = σ2 + β21

∑(Xi − X )2

But we know that b1 is an unbiased estimator of β1 so E{b1} = β1

We also know (from previous lectures) that

σ2{b1} =σ2∑

(Xi − X )2

So we can rearrange terms and plug in

σ2{b1} = E{b21} − (E{b1})2

E{b21} =

σ2∑(Xi − X )2

+ β21

Yang Feng (Columbia University) Inference in Regression Analysis 87 / 113

Page 246: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

E{MSR} = σ2 + β21

∑(Xi − X )2

From the previous slide

E{b21} =

σ2∑(Xi − X )2

+ β21

Which brings us to this resultE{MSR} = E{SSR/1} = E{b2

1}∑

(Xi − X )2 = σ2 + β21

∑(Xi − X )2

Yang Feng (Columbia University) Inference in Regression Analysis 88 / 113

Page 247: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Comments and Intuition

The mean of the sampling distribution of MSE is σ2 regardless ofwhether X and Y are linearly related (i.e. whether β1 = 0)

The mean of the sampling distribution of MSR is also σ2 whenβ1 = 0.

When β1 = 0 the sampling distributions of MSR and MSE tend to bethe same

Yang Feng (Columbia University) Inference in Regression Analysis 89 / 113

Page 248: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

F Test of β1 = 0 vs. β1 6= 0

ANOVA provides a battery of useful tests. For example, ANOVA providesan easy test forTwo-sided test

H0 : β1 = 0 v.s. Ha : β1 6= 0

Test statistic from before

t∗ =b1 − 0

s{b1}ANOVA test statistic

F ∗ =MSR

MSE

Yang Feng (Columbia University) Inference in Regression Analysis 90 / 113

Page 249: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sampling distribution of F ∗

The sampling distribution of F ∗ when H0 : β1 = 0 holds can bederived from Cochran’s theorem

Cochran’s theoremIf all n observations Yi come from the same normal distribution withmean µ and variance σ2, and SSTO is decomposed into k sums ofsquares SSr , each with degrees of freedom dfr , then the SSr/σ

2 termsare independent χ2 variables with dfr degrees of freedom if

k∑r=1

dfr = n − 1

Yang Feng (Columbia University) Inference in Regression Analysis 91 / 113

Page 250: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

The F Test

We have decomposed SSTO into two sums of squares SSR and SSE andtheir degrees of freedom are additive, hence, by Cochran’s theorem:If β1 = 0 so that all Yi have the same mean µ = β0 and the same varianceσ2, SSE/σ2 and SSR/σ2 are independent χ2 variables

Yang Feng (Columbia University) Inference in Regression Analysis 92 / 113

Page 251: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

F ∗ Test Statistic

F ∗ can be written as follows

F ∗ =MSR

MSE=

SSR/σ2

1SSE/σ2

n−2

But by Cochran’ s theorem, we have when H0 holds

F ∗ ∼χ2(1)

1χ2(n−2)n−2

Yang Feng (Columbia University) Inference in Regression Analysis 93 / 113

Page 252: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

F Distribution

The F distribution is the ratio of two independent χ2 randomvariables normalized by their corresponding degrees of freedom.

The test statistic F ∗ follows the distribution

F ∗ ∼ F (1, n − 2)

Yang Feng (Columbia University) Inference in Regression Analysis 94 / 113

Page 253: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Hypothesis Test Decision Rule

Since F ∗ is distributed as F (1, n − 2) when H0 holds, the decision rule tofollow when the risk of a Type I error is to be controlled at α is:

If F ∗ ≤ F (1− α; 1, n − 2), conclude H0

If F ∗ > F (1− α; 1, n − 2), conclude Ha

Yang Feng (Columbia University) Inference in Regression Analysis 95 / 113

Page 254: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

F distribution

PDF, CDF, Inverse CDF of F distribution

Note, MSR/MSE must be big in order to reject hypothesis.

Yang Feng (Columbia University) Inference in Regression Analysis 96 / 113

Page 255: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Partitioning of Total Deviations

Does this make sense? When is MSR/MSE big?

Yang Feng (Columbia University) Inference in Regression Analysis 97 / 113

Page 256: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Equivalence of F test and two-sided t test

F ∗ =MSR

MSE(14)

=b2

1

∑(Xi − X )2

MSE(15)

=b2

1

s2{b1}(16)

=

(b1

s{b1}

)2

(17)

= (t∗)2 (18)

In addition: F (1− α; 1, n − 2) = t(1− α/2; n − 2)2.

Yang Feng (Columbia University) Inference in Regression Analysis 98 / 113

Page 257: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

General Linear Test

The test of β1 = 0 versus β1 6= 0 is a simple example of a generallinear test.

The general linear test has three parts

Full ModelReduced ModelTest Statistic

Yang Feng (Columbia University) Inference in Regression Analysis 99 / 113

Page 258: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Full Model Fit

A full linear model is first fit to the data

Yi = β0 + β1Xi + εi

Using this model the error sum of squares is obtained, here forexample the simple linear model with non-zero slope is the “full”model

SSE (F ) =∑

[Yi − (b0 + b1Xi )]2 =∑

(Yi − Yi )2 = SSE

Yang Feng (Columbia University) Inference in Regression Analysis 100 / 113

Page 259: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Fit Reduced Model

One can test the hypothesis that a simpler model is a “better” modelvia a general linear test (which is really a likelihood ratio test indisguise). For instance, consider a “reduced” model in which theslope is zero (i.e. no relationship between input and output).

H0 : β1 = 0Ha : β1 6= 0

The model when H0 holds is called the reduced or restricted model.

Yi = β0 + εi

The SSE for the reduced model is obtained

SSE (R) =∑

(Yi − b0)2 =∑

(Yi − Y )2 = SSTO

Yang Feng (Columbia University) Inference in Regression Analysis 101 / 113

Page 260: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Test Statistic

The idea is to compare the two error sums of squares SSE(F) andSSE(R).

Because the full model F has more parameters than the reducedmodel, SSE (F ) ≤ SSE (R) always

In the general linear test, the test statistic is

F ∗ =

SSE(R)−SSE(F )dfR−dfFSSE(F )

dfF

which follows the F distribution when H0 holds.

dfR and dfF are those associated with the reduced and full modelerror sums of squares respectively

Yang Feng (Columbia University) Inference in Regression Analysis 102 / 113

Page 261: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

R2 (Coefficient of determination)

SSTO measures the variation in the observations Yi when X is notconsidered

SSE measures the variation in the Yi after a predictor variable X isemployed

A natural measure of the effect of X in reducing variation in Y is toexpress the reduction in variation (SSTO − SSE = SSR) as aproportion of the total variation

R2 =SSR

SSTO= 1− SSE

SSTO

Note that since 0 ≤ SSE ≤ SSTO then 0 ≤ R2 ≤ 1

Yang Feng (Columbia University) Inference in Regression Analysis 103 / 113

Page 262: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Limitations of and misunderstandings about R2

1 Claim: high R2 indicates that useful predictions can be made. Theprediction interval for a particular input of interest may still be wideeven if R2 is high.

2 Claim: high R2 means that there is a good linear fit betweenpredictor and output. It can be the case that an approximate (bad)linear fit to a truly curvilinear relationship might result in a high R2.

3 Claim: low R2 means that there is no relationship between input andoutput. Also not true since there can be clear and strong relationshipsbetween input and output that are not well explained by a linearfunctional relationship.

Yang Feng (Columbia University) Inference in Regression Analysis 104 / 113

Page 263: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Coefficient of Correlation

r = ±√R2

Range:−1 ≤ r ≤ 1

if b1 > 0, r =√R2,

if b1 < 0, r = −√R2.

Yang Feng (Columbia University) Inference in Regression Analysis 105 / 113

Page 264: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Normal Correlation Models

Some times, correlation models are more natural than regression models,such as

Relationship between sales of gasoline and sales of auxiliary products.

Relationship between blood pressure and age.

Difference between Regression models and Correlation models

Regression models: X values are known constants.

Correlation models: Both X and Y are random.

Yang Feng (Columbia University) Inference in Regression Analysis 106 / 113

Page 265: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Bivariate Normal Density

f (Y1,Y2) =1

2πσ1σ2

√1− ρ2

12

exp{− 1

2(1− ρ212)

[(Y1 − µ1

σ1)2

−2ρ12(Y1 − µ1

σ1)(Y2 − µ2

σ2) + (

Y2 − µ2

σ2)2]}

Parameters: µ1, µ2, σ1, σ2, ρ12

Here, ρ12 is the coefficient of correlation between variable Y1 and Y2.

ρ12 = ρ{Y1,Y2} =σ12

σ1σ2

σ12 = σ{Y1,Y2} = E{(Y1 − µ1)(Y2 − µ2)}

Yang Feng (Columbia University) Inference in Regression Analysis 107 / 113

Page 266: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Yang Feng (Columbia University) Inference in Regression Analysis 108 / 113

Page 267: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Marginal Density

Marginal distribution of Y1 is normal with mean µ1 and standard deviationσ1:

f1(Y1) =1√

2πσ1

exp[−1

2(Y1 − µ1

σ1)2]

How to get it from the bivariate density function?

Yang Feng (Columbia University) Inference in Regression Analysis 109 / 113

Page 268: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Conditional Inferences

f (Y1|Y2) =f (Y1,Y2)

f2(Y2)

=1√

2πσ1|2exp[−1

2(Y1 − α1|2 − β12Y2

σ1|2)2]

Hereα1|2 = µ1 − µ2ρ12

σ1

σ2

β12 = ρ12σ1

σ2

σ21|2 = σ2

1(1− ρ212)

Yang Feng (Columbia University) Inference in Regression Analysis 110 / 113

Page 269: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Inferences on Correlation Coefficients

Maximum Likelihood Estimation of ρ12 (Pearson Product-momentCorrelation Coefficient):

r12 =

∑(Yi1 − Y1)(Yi2 − Y2)

[∑

(Yi1 − Y1)2∑

(Yi2 − Y2)2]1/2

Usually biased, but bias is small when n is large.Test whether ρ12 = 0

H0 : ρ12 = 0

v.s.H1 : ρ12 6= 0

Test Statistics:

t∗ =r12

√n − 2√

1− r212

If H0 holds, t∗ follows the t(n − 2) distribution.Yang Feng (Columbia University) Inference in Regression Analysis 111 / 113

Page 270: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Interval Estimation of ρ12

Make the Fisher z transformation:

z ′ =1

2loge(

1 + r12

1− r12)

When n is large (n > 25), approximately normally distributed with

E{z ′} = ζ =1

2loge(

1 + ρ12

1− ρ12)

σ2{z ′} =1

n − 3

Then we can make interval estimate

z ′ − ζσ{z ′}

is approximately standard normal. Then 1− α confidence limits for ζ are

z ′ ± z(1− α/2)σ{z ′}

Yang Feng (Columbia University) Inference in Regression Analysis 112 / 113

Page 271: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Spearman Rank Correlation Coefficient

Denote the rank of Yi1 by Ri1 and the rank of Yi2 by Ri2. The ranksare from 1 to n.

The Spearman Rank Correlation Coefficient rS is then defined as

rS =

∑(Ri1 − R1)(Ri2 − R2)

[∑

(Ri1 − R1)2∑

(Ri2 − R2)2]1/2

as before −1 ≤ rS ≤ 1.

More robust compared with the Pearson correlation coefficient.

Yang Feng (Columbia University) Inference in Regression Analysis 113 / 113

Page 272: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Diagnostics and Remedial Measures

Yang Feng

Yang Feng (Columbia University) Diagnostics and Remedial Measures 1 / 76

Page 273: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Remedial Measures

How do we know that the regression function is a good explainer ofthe observed data?- Plotting- Tests

What if it is not? What can we do about it?- Transformation of variables

Yang Feng (Columbia University) Diagnostics and Remedial Measures 2 / 76

Page 274: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Graphical Diagnostics for the Predictor Variable

Dot Plot- Useful for visualizing distribution of inputs

Sequence Plot- Useful for visualizing dependencies between error terms

Box Plot - Useful for visualizing distribution of inputs

Toluca manufacturing example again: production time vs. lot size

Yang Feng (Columbia University) Diagnostics and Remedial Measures 3 / 76

Page 275: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Dot Plot

Figure :

How many observations per input value?

Range of inputs?

Yang Feng (Columbia University) Diagnostics and Remedial Measures 4 / 76

Page 276: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sequence Plot

Figure :

If observations are made over time, is there a correlation betweeninput and position in observation sequence?

Yang Feng (Columbia University) Diagnostics and Remedial Measures 5 / 76

Page 277: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Box Plot

Figure :

Shows- Median- 1st and 3rd quartiles- Maximum and minimum

Yang Feng (Columbia University) Diagnostics and Remedial Measures 6 / 76

Page 278: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Residuals

Recall the definition of residuals:

ei = Yi − Yi

And the difference between that and the unknown true error

εi = Yi − E (Yi )

In a normal regression model the εi ’s are assumed to be iid N(0, σ2)random variables. The observed residuals ei should reflect theseproperties.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 7 / 76

Page 279: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Remember: residual properties

Mean

e =

∑ei

n= 0

Variance

s2 =

∑(ei − e)2

n − 2=

SSE

n − 2= MSE

Yang Feng (Columbia University) Diagnostics and Remedial Measures 8 / 76

Page 280: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Nonindependence of Residuals

The residuals ei are not independent random variables - The fittedvalues Yi are based on the same fitted regression line.- The residuals are subject to two constraints

1 - Sum of the ei ’s equals 02 - Sum of the products Xiei ’s equals 0

When the sample size is large in comparison to the number ofparameters in the regression model, the dependency effect among theresiduals ei can be safely ignored.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 9 / 76

Page 281: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Definition: semistudentized residuals

It may be useful sometimes to look at a standardized set of residuals,for instance in outlier detection.

Like usual, since the standard deviation of εi is σ(itself estimated bysquare root of MSE) a natural form of standardization to consider is

e∗i =ei√MSE

This is called a semistudentized residual.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 10 / 76

Page 282: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Departures from Model...

To be studied by residuals

Regression function not linear

Error terms do not have constant variance

Error terms are not independent

Model fits all but one or a few outlier observations

Error terms are not normally distributed

One or more predictor variables have been omitted from the model

Yang Feng (Columbia University) Diagnostics and Remedial Measures 11 / 76

Page 283: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Diagnostics for Residuals

Plot of residuals against predictor variable

Plot of absolute or squared residuals against predictor variable

Plot of residuals against fitted values

Plot of residuals against time or other sequence

Plot of residuals against omitted predictor variables

Box plot of residuals

Normal probability plot of residuals

Yang Feng (Columbia University) Diagnostics and Remedial Measures 12 / 76

Page 284: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

1. Test for nonlinearity of Regression Function: ResidualPlot against the predictor variable

Figure : Transit example : ridership increase vs. num. maps distributed

Should be no systematic relationship between residual and predictorvariable if it is linear related.Yang Feng (Columbia University) Diagnostics and Remedial Measures 13 / 76

Page 285: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Prototype Residual Plots

Figure : Indicate residual plots

Yang Feng (Columbia University) Diagnostics and Remedial Measures 14 / 76

Page 286: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

2. Nonconstancy of Error Variance

Yang Feng (Columbia University) Diagnostics and Remedial Measures 15 / 76

Page 287: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

3. Presence of Outliers

Outliers can strongly effect the fitted values of the regression line.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 16 / 76

Page 288: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Outlier effect on residuals

Yang Feng (Columbia University) Diagnostics and Remedial Measures 17 / 76

Page 289: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

4. Nonindependence of Error Terms

Sequential observations can exhibit observable trends in error distribution.Application: Time series.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 18 / 76

Page 290: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

5. Non-normality of Error Terms

Distribution plots. (e.g., boxplot)

Comparison of Frequencies. (e.g., 68 percent of the residuals fallbetween ±

√MSE or about 90 percent between ±1.645

√MSE.

Normal probability plotQ-Q plot with numerical quantiles on the horizontal axis

Figure : Examples of non-normality in distribution of error terms

Yang Feng (Columbia University) Diagnostics and Remedial Measures 19 / 76

Page 291: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Normal probability plot

For a N(0,MSE ) random variable, a good approximation of theexpected value of the k-th smallest observation in a random sampleof size n is √

MSE [z(k − .375

n + .25)]

A normal probability plot consists of plotting the expected value ofthe k-th smallest observation against the observed k-th smallestobservation

Yang Feng (Columbia University) Diagnostics and Remedial Measures 20 / 76

Page 292: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

6. Omission of Important Predictor Variables

Example- Qualitative variable- Type of machine

Partitioning data can revealdependence on omittedvariable(s)

Works for quantitative variablesas well

Can suggest that inclusion ofother inputs is important

Yang Feng (Columbia University) Diagnostics and Remedial Measures 21 / 76

Page 293: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Tests Involving Residuals

Tests for randomness (run test, Durbin-Watson test, Chapter 12)

Tests for constancy of variance (Brown-Forsythe test, Breusch-Pagantest, Section 3.6)

Tests for outliers (fit a new regression line to the other n − 1observations. detail in Chapter 10)

Tests for normality of error distribution (will discuss now.)

Yang Feng (Columbia University) Diagnostics and Remedial Measures 22 / 76

Page 294: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Correlation Test for Normality of Error Distribution

For one way to run this correlation test let

e = {eσ(1), . . . , eσ(n)}

be the ordered sequence (from smallest to the largest) of the observederrors. Let

r = {r1, . . . , rk , . . . , rn},

where rk is the expected value of the kth residual under the normalityassumption, i.e.

rk ≈√MSE [z(

k − .375

n + .25)]

Then compute the sample correlation between e and r .

Yang Feng (Columbia University) Diagnostics and Remedial Measures 23 / 76

Page 295: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Correlation Test for Normality of Error Distribution

A formal test for the normality of the error terms can be developed interms of the coefficient of correlation between the residuals ei andtheir expected values under normality. High value indicates normality!

Tables (B.6 in the book) gives critical values for the null hypothesis(normally distributed errors).

Less than the critical value, reject the null hypothesis!

Toluca Company example: coefficient of correlation: 0.991 > 0.959,which is the critical value for α risk at 0.05. Then, conclude, it doesnot depart from normal substantially!

Yang Feng (Columbia University) Diagnostics and Remedial Measures 24 / 76

Page 296: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Tests for Constancy of Error Variance

Brown-Forsythe test does not depend on normality of error terms.The Brown-Forsythe test is applicable to simple linear regression when

The variance of the error terms either increases or decreases with X(“megaphone” residual plot)Sample size is large enough to ignore dependencies between theresiduals

The Brown-Forsythe test is essentially a t-test for testing whether themeans of two normally distributed populations are the same where thepopulations are the absolute deviations between the prediction andthe observed output space in two non-overlapping partitions of theinput space.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 25 / 76

Page 297: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Brown-Forsythe Test

Divide X into X1 (the low values of X) and X2 (the high values of X)

Let ei1 be the i-th residual for X1 and vice versa

e1 and e2 denote the median of the residuals.

let n = n1 + n2

The Brown-Forsythe test uses the absolute deviations of the residualsaround their group median

di1 = |ei1 − e1|, di2 = |ei2 − e2|

Yang Feng (Columbia University) Diagnostics and Remedial Measures 26 / 76

Page 298: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Brown-Forsythe Test

The test statistic for comparing the means of the absolute deviationsof the residuals around the group medians is

t∗BF =d1 − d2

s√

1n1

+ 1n2

where the pooled variance

s2 =

∑(di1 − d1)2 +

∑(di2 − d2)2

n − 2

Yang Feng (Columbia University) Diagnostics and Remedial Measures 27 / 76

Page 299: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Brown-Forsythe Test

If n1 and n2 are not extremely small

t∗BF ∼ t(n − 2)

approximately

From this confidence intervals and tests can be constructed.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 28 / 76

Page 300: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Breusch-Pagan Test

Another test for the constancy of error variance.

Assume error terms are independently and normally distributed and

σ2i = γ0 + γ1Xi

Then the problem reduces to

H0 : γ1 = 0 v.s. H1 : γ1 6= 0

Test statistics is derived in the following two steps1 Regress Y on X , get SSE and residual vector e.2 Regress e2 on X , get SSR∗.

X 2BP =

SSR∗

2(SSEn )2

Yang Feng (Columbia University) Diagnostics and Remedial Measures 29 / 76

Page 301: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

BP-test

Under H0, when n is reasonably large,

X 2BP ∼ χ2(1)

If X 2BP > qchisq(1− α, 1), reject H0.

Direct function in R:ncvTest in the package car.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 30 / 76

Page 302: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

F test for lack of fit

Formal test for determining whether a specific type of regressionfunction adequately fits the data.

Assume the observations Y |X are1 independent2 normally distributed3 same variance σ2

Requires: repeat observations at one or more X levels (calledreplicates)

Yang Feng (Columbia University) Diagnostics and Remedial Measures 31 / 76

Page 303: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example

11 similar branches of a bank offered gifts for setting up moneymarket accounts

Minimum initial deposits were specific to qualify for the gift

Value of gift was proportional to the specified minimum deposit

Interested in: relationship between specified minimum deposit andnumber of new accounts opened

Yang Feng (Columbia University) Diagnostics and Remedial Measures 32 / 76

Page 304: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

F Test Example Data and ANOVA Table

Figure :

Yang Feng (Columbia University) Diagnostics and Remedial Measures 33 / 76

Page 305: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Fit

Figure :

Yang Feng (Columbia University) Diagnostics and Remedial Measures 34 / 76

Page 306: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Data Arranged To Highlight Replicates

Figure :

The observed value of the response variable for the i-th replicate forthe j-th level of X is Yij

The mean of the Y observations at the level X = Xj is Yj

Yang Feng (Columbia University) Diagnostics and Remedial Measures 35 / 76

Page 307: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Full Model vs. Regression Model

The full model is

Yij = µj + εij

where1 µj are parameters, j = 1, ..., c2 εij are iid N(0, σ2)

Since the error terms have expectation zero

E (Yij) = µj

Yang Feng (Columbia University) Diagnostics and Remedial Measures 36 / 76

Page 308: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Full Model

In the full model there is a different mean (a free parameter) for eachXi

In the regression model the mean responses are constrained to lie ona line

E (Y ) = β0 + β1X

Yang Feng (Columbia University) Diagnostics and Remedial Measures 37 / 76

Page 309: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Fitting the Full Model

The estimators of µj are simply

µj = Yj

The error sum of squares of the full model therefore is

SSE (F ) =∑∑

(Yij − Yj)2 = SSPE

SSPE: Pure Error Sum of Squares

Yang Feng (Columbia University) Diagnostics and Remedial Measures 38 / 76

Page 310: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Degrees of Freedom

Ordinary total sum of squares had n-1 degrees of freedom.

Each of the j terms is a ordinary total sum of squares- Each then has nj − 1 degrees of freedom

The number of degrees of freedom of SSPE is the sum of thecomponent degrees of freedom

dfF =∑j

(nj − 1) =∑j

nj − c = n − c

Yang Feng (Columbia University) Diagnostics and Remedial Measures 39 / 76

Page 311: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

General Linear Test

Remember: the general linear test proposes a reduced model

H0 : E (Y ) = β0 + β1X (Normal regression model)

Ha : E (Y ) 6= β0 + β1X (The full model, one independent mean foreach level of X)

Yang Feng (Columbia University) Diagnostics and Remedial Measures 40 / 76

Page 312: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

SSE For Reduced Model

The SSE for the reduced model is as before- remember

SSE (R) =∑i

∑j

[Yij − (b0 + b1Xj)]2

=∑i

∑j

(Yij − Yij)2

- and has n-2 degrees of freedom dfR = n − 2

Yang Feng (Columbia University) Diagnostics and Remedial Measures 41 / 76

Page 313: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

SSE(R)

Figure :

Yang Feng (Columbia University) Diagnostics and Remedial Measures 42 / 76

Page 314: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

F Test Statistic

From the general linear test approach

F ∗ =SSE (R)− SSE (F )

dfR − dfF÷ SSE (F )

dfF

=SSE − SSPE

(n − 2)− (n − c)÷ SSPE

n − c

Lack of fit sum of squares:

SSLF = SSE − SSPE

Then

F ∗ =SSLF

(n − 2)− (n − c)÷ SSPE

n − c=

MSLF

MSPE

Yang Feng (Columbia University) Diagnostics and Remedial Measures 43 / 76

Page 315: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

F Test Rule

From the F test we know that large values of F ∗ lead us to reject thenull hypothesis:If F ∗ ≤ F (1− α; c − 2, n − c), conclude H0

If F ∗ > F (1− α; c − 2, n − c), conclude Ha

For this example we have

Yang Feng (Columbia University) Diagnostics and Remedial Measures 44 / 76

Page 316: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Variance decomposition

SSE = SSPE + SSLF.∑∑(Yij − Yij)

2 =∑∑

(Yij − Yj)2 +

∑∑(Yj − Yij)

2

Yang Feng (Columbia University) Diagnostics and Remedial Measures 45 / 76

Page 317: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example decomposition

Yang Feng (Columbia University) Diagnostics and Remedial Measures 46 / 76

Page 318: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example decomposition

Yang Feng (Columbia University) Diagnostics and Remedial Measures 47 / 76

Page 319: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example Conclusion

If we set the significance level to α = .01

And look up the value of the F inv-cdf F (.99, 4, 5) = 11.4

We can conclude that the null hypothesis should be rejected.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 48 / 76

Page 320: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

A review

Graphical procedures for determining appropriateness of regression fit- Various Residual plots

Tests to determine- Constancy of error variance- Lack of fit test

what do we do if we determine(through testing or otherwise) that thelinear regression fit is not good?

Yang Feng (Columbia University) Diagnostics and Remedial Measures 49 / 76

Page 321: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

How to fix

If simple regression model is not appropriate then there are two choices:

1 Abandon simple regression model and develop and use a moreappropriate modele.g., logistic regression, nonparametric regression, etc...may yield better insights, but a more complex model lead to morecomplex procedures for estimating the parameters. (Later in thecourse)

2 Employ some transformation of the data so that the simple regressionmodel is appropriate for the transformed data. (This chapter)

Yang Feng (Columbia University) Diagnostics and Remedial Measures 50 / 76

Page 322: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Fixes For...

Nonlinearity of regression function - Transformation(s) (today)

Nonconstancy of error variance - Weighted least squares (Chapter 11)and transformations

Nonindependence of error terms - Directly model correlation or usefirst differences (Chapter 12)

Non-normality of error terms - Transformation(s) (today)

Omission of Important Predictor Variables - Modify themodel—-Multiple regression analysis, Chapter 6 and later on.

Outlying observations - Robust regression (Chapter 11)

Yang Feng (Columbia University) Diagnostics and Remedial Measures 51 / 76

Page 323: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Nonlinearity of regression function

Direct approach

Modify regression model by altering the nature of the regressionfunction. For instance, a quadratic regression function might be used

E (Y ) = β0 + β1X + β2X2

or an exponential function

E (Y ) = β0βX1

Such approaches employ a tranformation to (approximately) linearizea regression function

Yang Feng (Columbia University) Diagnostics and Remedial Measures 52 / 76

Page 324: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Quick Questions

How would you fit such models?

How does the exponential regression function relate to regular linearregression?

Where did the error terms go?

Yang Feng (Columbia University) Diagnostics and Remedial Measures 53 / 76

Page 325: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Transformations

Transformations for nonlinearity relation only- Appropriate when the distribution of the error terms is reasonably closeto a normal distribution- In this situation1. transformation of X should be attempted;2. transformation of Y should not be attempted because it will materiallyeffect the distribution of the error terms.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 54 / 76

Page 326: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Prototype Regression Patterns

Figure :

Yang Feng (Columbia University) Diagnostics and Remedial Measures 55 / 76

Page 327: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example

Experiment

X: days of training received

Y: sales performance(score)

Yang Feng (Columbia University) Diagnostics and Remedial Measures 56 / 76

Page 328: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

X ′ =√X

Figure :

Yang Feng (Columbia University) Diagnostics and Remedial Measures 57 / 76

Page 329: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example Data Transformation

Figure :

Yang Feng (Columbia University) Diagnostics and Remedial Measures 58 / 76

Page 330: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Graphical Residual Analysis

Figure :

Yang Feng (Columbia University) Diagnostics and Remedial Measures 59 / 76

Page 331: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Transformations on Y

Non-normality and unequal variances of error terms frequently appeartogether

To remedy these in the normal regression model we need atransformation on Y

This is because- Shapes and spreads of distributions of Y need to be changed- May help linearize a curvilinear regression relation

Can be combined with transformation on X

Yang Feng (Columbia University) Diagnostics and Remedial Measures 60 / 76

Page 332: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Prototype Regression Patterns and Y Transformations

Figure :

Transformations on Y:y ′ =

√Y

y ′ = log10 Yy ′ = 1/Y

Yang Feng (Columbia University) Diagnostics and Remedial Measures 61 / 76

Page 333: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example

Use of logarithmic transformation of Y to linearize regression relationsand stabilize error variance

Data on age(X) and plasma level of a polyamine (Y) for a portion ofthe 25 healthy children in a study. Younger children exhibit greatervariability than older children.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 62 / 76

Page 334: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Plasma Level vs. Age

Figure :

Yang Feng (Columbia University) Diagnostics and Remedial Measures 63 / 76

Page 335: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Associated Data

Figure :

Yang Feng (Columbia University) Diagnostics and Remedial Measures 64 / 76

Page 336: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Associated Data (Cont’)

if we fit a simple linear regression line to the log transformed Y datawe obtain:

Y ′ = 1.135− .1023X

And the coefficient of correlation between the ordered residuals andtheir expected values under normality is .981(for α = .05 B.6 in thebook shows a critical value of .959)

Normality of error terms supported, regression model for transformedY data appropriate.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 65 / 76

Page 337: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Box Cox Transforms

It can be difficult to graphically determine which transformation of Yis most appropriate for correcting- skewness of the distributions of error terms- unequal variances- nonlinearity of the regression function

The Box-Cox procedure automatically identifies a transformation fromthe family of power transformations on Y

Yang Feng (Columbia University) Diagnostics and Remedial Measures 66 / 76

Page 338: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Box Cox Transforms

This family is of the form

Y ′ = Y λ

Examples include

λ = 2 Y ′ = Y 2

λ = .5 Y ′ =√Y

λ = 0 Y ′ = lnY (by definition)λ = −.5 Y ′ = 1√

Y

λ = −1 Y ′ = 1Y

Yang Feng (Columbia University) Diagnostics and Remedial Measures 67 / 76

Page 339: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Box Cox Cont.

The normal error regression model with the response variable amember of the family of power transformations becomes

Y λi = β0 + β1Xi + εi

This model has an additional parameter that needs to be estimated

Maximum likelihood is a way to estimate this parameter

Yang Feng (Columbia University) Diagnostics and Remedial Measures 68 / 76

Page 340: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Box Cox Maximum Likelihood Estimation

Before setting up MLE, the observations are further standardized sothat the magnitude of the error sum of squares does not depend onthe value of λ

The transformation is give by

Wi = K1(Y λi − 1) λ 6= 0

or K2(logeYi ) λ = 0

where

K2 = (∏

Yi )1/n

K1 = 1λKλ−1

2

Yang Feng (Columbia University) Diagnostics and Remedial Measures 69 / 76

Page 341: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Box Cox Maximum Likelihood Estimation

Maximize

log(L(X ,Y , σ, λ, b1, b0)) = −∑i

(Wi − (b1Xi + b0))2

2σ2− nlog(σ)

w.r.t λ σ b1 b0

How?- Take partial derivatives- Solve- or... gradient ascent methods

Not easy task, so we work on a grid of λ values, sayλ = −2,−1.75, . . . , 1.75, 2. And calculate a list of MSE according todifferent λ values.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 70 / 76

Page 342: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Box Cox Maximum Likelihood Estimation

Maximize

log(L(X ,Y , σ, λ, b1, b0)) = −∑i

(Wi − (b1Xi + b0))2

2σ2− nlog(σ)

w.r.t λ σ b1 b0

How?- Take partial derivatives- Solve- or... gradient ascent methods

Not easy task, so we work on a grid of λ values, sayλ = −2,−1.75, . . . , 1.75, 2. And calculate a list of MSE according todifferent λ values.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 70 / 76

Page 343: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

The plasma levels example

Yang Feng (Columbia University) Diagnostics and Remedial Measures 71 / 76

Page 344: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Comments on Box Cox

The Box-Cox procedure is ordinarily used only to provide a guide forselecting a transformation

At time, theoretical considerations or prior information can be utilizedto help in choosing an appropriate transformation

It is important to perform residual analysis after the transformation toensure the transformation is appropriate

When transformed models are employed, b0 and b1 obtained via leastsquares have the least squares property w.r.t the transformedobservations not the original ones.

Usually take a nearby λ value for which the power transformation iseasier to understand. Say λ = 0.03, we may just use λ = 0. Ofcourse, one should examine the flatness of likelihood function in theneighborhood of λ.

When λ near 1, no transformation of Y needed.

Yang Feng (Columbia University) Diagnostics and Remedial Measures 72 / 76

Page 345: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Exploration of Shape of Regression Function:Nonparametric Regression Curves

So far: parametric regression approaches

LinearLinear with transformed inputs and outputsetc.

Other approaches

Method of moving averages : interpolate between mean outputs atadjacent inputsLowess : “LOcally WEighted Scatterplot Smoothing”

Yang Feng (Columbia University) Diagnostics and Remedial Measures 73 / 76

Page 346: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Lowess Method

IntuitionFit low-order polynomial (linear) regression models to points in aneighborhood

The neighborhood size is a parameterDetermining the neighborhood is done via a nearest neighbors algorithm

Produce predictions by weighting the regressors by how far the set ofpoints used to produce the regressor is from the input point for which aprediction is wanted

While somewhat ad-hoc, it is a method of producing a nonlinearregression function for data that might seem otherwise difficult toregress

Yang Feng (Columbia University) Diagnostics and Remedial Measures 74 / 76

Page 347: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Lowess Method Example

Yang Feng (Columbia University) Diagnostics and Remedial Measures 75 / 76

Page 348: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

R example

require(graphics)

plot(cars, main = "lowess(cars)")

lines(lowess(cars), col = 2)

lines(lowess(cars, f=.2), col = 3)

legend(5, 120, c(paste("f = ", c("2/3", ".2"))),

lty = 1, col = 2:3)

Yang Feng (Columbia University) Diagnostics and Remedial Measures 76 / 76

Page 349: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Simultaneous Inferences and Other Topics in RegressionAnalysis

Yang Feng

Yang Feng (Columbia University) Simultaneous Inferences 1 / 20

Page 350: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Simultaneous Inferences

In chapter 2, we know how to construct confidence interval for β0 andβ1.

If we want a confidence level of 95% of both β0 and β1

One could construct a separate confidence interval for β0 and β1.BUT, then the probability of both happening is below 95%.

How to create a joint confidence interval?

Yang Feng (Columbia University) Simultaneous Inferences 2 / 20

Page 351: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Bonferroni Joint Confidence Intervals

Calculation of Bonferroni joint confidence intervals is a generaltechnique

We highlight its application in the regression setting

Joint confidence intervals for β0 and β1

Intuition

Set each statement confidence level to larger than 1− α so that thefamily coefficient is at least 1− αBUT how much larger?

Yang Feng (Columbia University) Simultaneous Inferences 3 / 20

Page 352: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Ordinary Confidence Intervals

Start with ordinary confidence intervals for β0 and β1

b0 ± t(1− α/2; n − 2)s{b0}b1 ± t(1− α/2; n − 2)s{b1}

And ask what is probability that one or both of these intervals isincorrect

Remember

s2{b0} = MSE

[1

n+

X 2∑(Xi − X )2

]s2{b1} =

MSE∑(Xi − X )2

Yang Feng (Columbia University) Simultaneous Inferences 4 / 20

Page 353: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

General Procedure

Let A1 denote the event that the first confidence interval does notcover β0, i.e. P(A1) = α

Let A2 denote the event that the second confidence interval does notcover β1, i.e. P(A2) = α

We want to know the probability that both estimates fall in theirrespective confidence intervals, i.e. P(A1 ∩ A2)

How do we get there from what we know?

Yang Feng (Columbia University) Simultaneous Inferences 5 / 20

Page 354: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Venn Diagram

Yang Feng (Columbia University) Simultaneous Inferences 6 / 20

Page 355: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Bonferroni inequality

We can see that P(A1 ∩ A2) = 1− P(A2)− P(A1) + P(A1 ∩ A2)

Size of set is equal to area is equal to probability in a Venn diagram.

It also is clear that P(A1 ∩ A2) ≥ 0

So,

P(A1 ∩ A2) ≥ 1− P(A2)− P(A1)

= 1− 2α

Yang Feng (Columbia University) Simultaneous Inferences 7 / 20

Page 356: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Using the Bonferroni inequality cont.

To achieve a 1− α family confidence interval for β0 and β1 (forexample) using the Bonferroni procedure we know that bothindividual intervals must shrink.

Returning to our confidence intervals for β0 and β1 from before

b0 ± t(1− α/2; n − 2)s{b0}b1 ± t(1− α/2; n − 2)s{b1}

To achieve a 1− α family confidence interval these intervals mustwiden to

b0 ± t(1− α/4; n − 2)s{b0}b1 ± t(1− α/4; n − 2)s{b1}

Then P(A1 ∩ A2) ≥ 1− P(A2)− P(A1) = 1− α/2− α/2 = 1− α

Yang Feng (Columbia University) Simultaneous Inferences 8 / 20

Page 357: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Using the Bonferroni inequality cont.

The Bonferroni procedure is very general. To make joint confidencestatements about multiple simultaneous predictions remember that

Yh ± t(1− α/2; n − 2)s{pred}

s2{pred} = MSE

[1 +

1

n+

(Xh − X )2∑i (Xi − X )2

]If one is interested in a 1− α confidence statement about gpredictions then Bonferroni says that the confidence interval for eachindividual prediction must get wider (for each h in the g predictions)

Yh ± t(1− α/2g ; n − 2)s{pred}

Note: if a sufficiently large number of simultaneous predictions are made,the width of the individual confidence intervals may become so wide thatthey are no longer useful.Yang Feng (Columbia University) Simultaneous Inferences 9 / 20

Page 358: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

The Toluca Example

Say, we want to get a 90 percent confidence interval for β0 and β1simultaneously.

Then we require B = t(1− .1/4; 23) = t(.975, 23) = 2.069

Then we have the joint confidence interval:

b0 ± B ∗ s(b0)

andb1 ± B ∗ s(b1)

Yang Feng (Columbia University) Simultaneous Inferences 10 / 20

Page 359: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence Band for Regression Line

Remember in Chapter 2.5, we get the confidence interval for E{Yh}to be

Yh ± t(1− α/2; n − 2)s{Yh}

Now, we want to get a confidence band for the entire regression lineE{Y } = β0 + β1X .

The Working-Hotelling 1− α confidence band is

Yh ±W × s{Yh}

here W 2 = 2F (1− α; 2, n − 2).

Same form as before, except the t multiple is replaced with the Wmultiple.

Yang Feng (Columbia University) Simultaneous Inferences 11 / 20

Page 360: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example: toluca company

Say we want to estimate the boundary value for the band atXh = 30, 65, 100.

We have

Looking up the table,W 2 = 2F (1− α; 2, n − 2) = 2F (.9; 2, 23) = 5.098.R code:

w2 = 2 * qf(1-0.1,2,23)

Yang Feng (Columbia University) Simultaneous Inferences 12 / 20

Page 361: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Now we have the confidence band for the three points are

Yang Feng (Columbia University) Simultaneous Inferences 13 / 20

Page 362: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example confidence band

Yang Feng (Columbia University) Simultaneous Inferences 14 / 20

Page 363: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Compare with Bonferroni Procedure

Say we want to simultaneously estimate response for Xh = 30, 65, 100.

Then the simultaneous confidence intervals are

Yh ± t(1− α/(2g); n − 2)s{Yh}

We have B = t(1− α/(2g); n − 2) = t(1− .1/(2 ∗ 3), 23) = 2.263,the confidence intervals are

Yang Feng (Columbia University) Simultaneous Inferences 15 / 20

Page 364: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Bonferroni v.s. Working-Hotelling

This instance, working-hotelling confidence limits are slightertighter(better) than bonferroni limits

However, in larger families (more X ) to be considered simultaneously,working-hotelling is always tighter, since W stays the same for anynumber of statements but B becomres larger.

The levels of predictor variables are sometimes not known in advance.In such cases, it is better to use Working-Hotelling procedure sincethe family encompasses all possible levels of X .

Yang Feng (Columbia University) Simultaneous Inferences 16 / 20

Page 365: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Simultaneous Prediction Intervals for g New Observations

1 Scheffe procedure

Yh ± Ss{pred}, (1)

where S2 = gF (1− α; g , n − 2).

2 Bonferroni procedure

Yh ± Bs{pred}, (2)

where B = t(1− α/(2g); n − 2).

Yang Feng (Columbia University) Simultaneous Inferences 17 / 20

Page 366: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Regression through the origin

ModelYi = β1Xi + εi

Sometimes it is known that the regression function is linear and thatit must go through the origin.

β1 is parameter

Xi are known constants

εi are i.i.d N(0, σ2).

The least squares and maximum likelihood estimators for β1 coincideas before, the estimator is b1 =

∑XiYi∑X 2i

Yang Feng (Columbia University) Simultaneous Inferences 18 / 20

Page 367: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Regression through the origin, Cont

In regression through the origin there is only one free parameter (β1)so the number of degrees of freedom of the MSE

s2 = MSE =

∑e2i

n − 1=

∑(Yi − Yi )

2

n − 1

is increased by one.This is because this is a “reduced” model in the general linear testsense and because the number of parameters estimated from the datais less by one.

Yang Feng (Columbia University) Simultaneous Inferences 19 / 20

Page 368: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

A few notes on regression through the origin

∑ei 6= 0 in general now. Only constraint is

∑Xiei = 0.

SSE may exceed the total sum of squares SSTO. In the case of acurvilinear pattern or linear pattern with a intercept away from theorigin.

Therefore, R2 = 1− SSE/SSTO may be negative!

Generally, it is safer to use the original model opposed withregression-through-the-origin model.

Otherwise, it is the wrong model to start with!

Yang Feng (Columbia University) Simultaneous Inferences 20 / 20

Page 369: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Linear Algebra Review

Yang Feng

Yang Feng (Columbia University) Linear Algebra Review 1 / 46

Page 370: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Definition of Matrix

Rectangular array of elements arranged in rows and columns16000 2333000 4721000 35

A matrix has dimensions

The dimension of a matrix is its number of rows and columns

It is expressed as 3× 2 (in this case)

Yang Feng (Columbia University) Linear Algebra Review 2 / 46

Page 371: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Indexing a Matrix

Rectangular array of elements arranged in rows and columns

A =

[a11 a12 a13a21 a22 a23

]A matrix can also be notated

A = [aij ], i = 1, 2; j = 1, 2, 3

Yang Feng (Columbia University) Linear Algebra Review 3 / 46

Page 372: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Square Matrix and Column Vector

A square matrix has equal number of rows and columns

[4 73 9

] a11 a12 a13a21 a22 a23a31 a32 a33

A column vector is a matrix with a single column

47

10

c1c2c3c4c5

All vectors (row or column) are matrices, all scalars are 1× 1 matrices.

Yang Feng (Columbia University) Linear Algebra Review 4 / 46

Page 373: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Transpose

The transpose of a matrix is another matrix in which the rows andcolumns have been interchanged

A =

2 57 103 4

A′ =

[2 7 35 10 4

]

Yang Feng (Columbia University) Linear Algebra Review 5 / 46

Page 374: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Equality of Matrices

Two matrices are the same if they have the same dimension and allthe elements are equal

A =

a1a2a3

B =

473

A = B implies a1 = 4, a2 = 7, a3 = 3

Yang Feng (Columbia University) Linear Algebra Review 6 / 46

Page 375: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Matrix Addition and Substraction

A =

1 42 53 6

B =

1 22 33 4

Then

A + B =

2 64 86 10

Yang Feng (Columbia University) Linear Algebra Review 7 / 46

Page 376: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Multiplication of a Matrix by a Scalar

A =

[2 79 3

]kA = k

[2 79 3

]=

[2k 7k9k 3k

]

Yang Feng (Columbia University) Linear Algebra Review 8 / 46

Page 377: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Multiplication of two Matrices

Yang Feng (Columbia University) Linear Algebra Review 9 / 46

Page 378: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Another Matrix Multiplication Example

A =

[1 3 40 5 8

]B =

352

AB =

[1 3 40 5 8

]352

=

[2641

]

Yang Feng (Columbia University) Linear Algebra Review 10 / 46

Page 379: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Special Matrices

If A = A′, then A is a symmetric matrix

A =

1 4 64 2 56 5 3

A′ =

1 4 64 2 56 5 3

If the off-diagonal elements of a matrix are all zeros it is then called adiagonal matrix

A =

a1 0 00 a2 00 0 a3

B =

4 0 0 00 1 0 00 0 10 00 0 0 5

Yang Feng (Columbia University) Linear Algebra Review 11 / 46

Page 380: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Identity Matrix

A diagonal matrix whose diagonal entries are all ones is an identity matrix.Multiplication by an identity matrix leaves the pre or post multipliedmatrix unchanged.

I =

1 0 0 00 1 0 00 0 1 00 0 0 1

and

AI = IA = A

Yang Feng (Columbia University) Linear Algebra Review 12 / 46

Page 381: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Vector and matrix with all elements equal to one

1 =

11...1

J =

1 ... 1. . .. . .. . .1 ... 1

11′ =

11...1

[1 1 . . . 1

]=

1 ... 1. . .. . .. . .1 ... 1

= J

Yang Feng (Columbia University) Linear Algebra Review 13 / 46

Page 382: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Linear Dependence and Rank of Matrix

Consider

A =

1 2 5 12 2 10 63 4 15 1

and think of this as a matrix of a collection of column vectors.

Note that the third column vector is a multiple of the first column vector.

Yang Feng (Columbia University) Linear Algebra Review 14 / 46

Page 383: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Linear Dependence

When m scalars k1, ..., km not all zero, can be found such that:

k1A1 + ...+ kmAm = 0

where 0 denotes the zero column vector and Ai is the i th column of matrixA, the m column vectors are called linearly dependent. If the only set ofscalars for which the equality holds is k1 = 0, ..., km = 0, the set of mcolumn vectors is linearly independent.

In the previous example matrix the columns are linearly dependent.

5

123

+ 0

224

− 1

51015

+ 0

161

=

000

Yang Feng (Columbia University) Linear Algebra Review 15 / 46

Page 384: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Rank of Matrix

The rank of a matrix is defined to be the maximum number of linearlyindependent columns in the matrix. Rank properties include

The rank of a matrix is unique

The rank of a matrix can equivalently be defined as the maximumnumber of linearly independent rows

The rank of an r × c matrix cannot exceed min(r , c)

The row and column rank of a matrix are equal

The rank of a matrix is preserved under nonsingular transformations.,i.e. Let A (n × n) and C (k × k) be nonsingular matrices. Then forany n × k matrix B we have

rank(B) = rank(AB) = rank(BC)

Yang Feng (Columbia University) Linear Algebra Review 16 / 46

Page 385: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Inverse of Matrix

Like a reciprocal

6 ∗ 1/6 = 1/6 ∗ 6 = 1

x1

x= 1

But for matrices

AA−1 = A−1A = I

Yang Feng (Columbia University) Linear Algebra Review 17 / 46

Page 386: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example

A =

[2 43 1

]

A−1 =

[−.1 .4.3 −.2

]A−1A =

[1 00 1

]More generally,

A =

[a bc d

]

A−1 =1

D

[d −b−c a

]where D = ad − bc

Yang Feng (Columbia University) Linear Algebra Review 18 / 46

Page 387: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Inverses of Diagonal Matrices are Easy

A =

3 0 00 4 00 0 2

then

A−1 =

1/3 0 00 1/4 00 0 1/2

Yang Feng (Columbia University) Linear Algebra Review 19 / 46

Page 388: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Finding the inverse

Finding an inverse takes (for general matrices with no specialstructure)

O(n3)

operations (when n is the number of rows in the matrix)

We will assume that numerical packages can do this for usin R: solve(A) gives the inverse of matrix A

Yang Feng (Columbia University) Linear Algebra Review 20 / 46

Page 389: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Uses of Inverse Matrix

Ordinary algebra 5y = 20is solved by 1/5 ∗ (5y) = 1/5 ∗ (20)

Linear algebra AY = Cis solved by

A−1AY = A−1C,Y = A−1C

Yang Feng (Columbia University) Linear Algebra Review 21 / 46

Page 390: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example

Solving a system of simultaneous equations

2y1 + 4y2 = 203y1 + y2 = 10[

2 43 1

] [y1y2

]=

[2010

][y1y2

]=

[2 43 1

]−1 [2010

]

Yang Feng (Columbia University) Linear Algebra Review 22 / 46

Page 391: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

List of Useful Matrix Properties

A + B = B + A

(A + B) + C = A + (B + C)

(AB)C = A(BC)

C(A + B) = CA + CB

k(A + B) = kA + kB

(A′)′ = A

(A + B)′ = A′ + B′

(AB)′ = B′A′

(ABC)′ = C′B′A′

(AB)−1 = B−1A−1

(ABC)−1 = C−1B−1A−1

(A−1)−1 = A, (A′)−1 = (A−1)′

Yang Feng (Columbia University) Linear Algebra Review 23 / 46

Page 392: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Random Vectors and Matrices

Let’s say we have a vector consisting of three random variables

Y =

Y1

Y2

Y3

The expectation of a random vector is defined as

E(Y) =

E(Y1)E(Y2)E(Y3)

Yang Feng (Columbia University) Linear Algebra Review 24 / 46

Page 393: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Expectation of a Random Matrix

The expectation of a random matrix is defined similarly

E(Y) = [E(Yij)] i = 1, ...n; j = 1, ..., p

Yang Feng (Columbia University) Linear Algebra Review 25 / 46

Page 394: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Variance-covariance Matrix of a Random Vector

The variances of three random variables σ2(Yi ) and the covariancesbetween any two of the three random variables σ(Yi ,Yj), are assembled inthe variance-covariance matrix of Y

cov(Y) = σ2{Y} =

σ2(Y1) σ(Y1,Y2) σ(Y1,Y3)σ(Y2,Y1) σ2(Y2) σ(Y2,Y3)σ(Y3,Y1) σ(Y3,Y2) σ2(Y3)

remember σ(Y2,Y1) = σ(Y1,Y2) so the covariance matrix is symmetric

Yang Feng (Columbia University) Linear Algebra Review 26 / 46

Page 395: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Derivation of Covariance Matrix

In vector terms the variance-covariance matrix is defined by

σ2{Y} = E(Y − E(Y))(Y − E(Y))′

because

σ2{Y} = E(

Y1 − E(Y1)Y2 − E(Y2)Y3 − E(Y3)

(Y1 − E(Y1) Y2 − E(Y2) Y3 − E(Y3)))

Yang Feng (Columbia University) Linear Algebra Review 27 / 46

Page 396: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Regression Example

Take a regression example with n = 3 with constant error termsσ2(εi ) and are uncorrelated so that σ2(εi , εj) = 0 for all i 6= j

The variance-covariance matrix for the random vector ε is

σ2(ε) =

σ2 0 00 σ2 00 0 σ2

which can be written as σ2{ε} = σ2 I

Yang Feng (Columbia University) Linear Algebra Review 28 / 46

Page 397: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Basic Results

If A is a constant matrix and Y is a random matrix then W = AYis a random matrix

E(A) = AE(W) = E(AY) = AE(Y)

σ2{W} = σ2{AY} = Aσ2{Y}A′

Yang Feng (Columbia University) Linear Algebra Review 29 / 46

Page 398: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Multivariate Normal Density

Let Y be a vector of p observations

Y =

Y1

Y2

.

.

.Yp

Let µ be a vector of the means of each of the p observations

µ =

µ1µ2...µp

Yang Feng (Columbia University) Linear Algebra Review 30 / 46

Page 399: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Multivariate Normal Density

let Σ be the variance-covariance matrix of Y

Σ =

σ21 σ12 ... σ1pσ21 σ22 ... σ2p...σp1 σp2 ... σ2p

Then the multivariate normal density is given by

P(Y|µ,Σ) =1

(2π)p/2|Σ|1/2exp[−1

2(Y − µ)′Σ−1(Y − µ)]

Yang Feng (Columbia University) Linear Algebra Review 31 / 46

Page 400: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Matrix Simple Linear Regression

Nothing new-only matrix formalism for previous results

Remember the normal error regression model

Yi = β0 + β1Xi + εi , εi ∼ N(0, σ2), i = 1, ..., n

Expanded out this looks like

Y1 = β0 + β1X1 + ε1Y2 = β0 + β1X2 + ε2

...Yn = β0 + β1Xn + εn

which points towards an obvious matrix formulation.

Yang Feng (Columbia University) Linear Algebra Review 32 / 46

Page 401: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Regression Matrices

If we identify the following matrices

Y =

Y1

Y2

.

.

.Yn

X =

1 X1

1 X2

.

.

.1 Xn

β =

(β0β1

)ε =

ε1ε2...εn

We can write the linear regression equations in a compact form

Y = Xβ + ε

Yang Feng (Columbia University) Linear Algebra Review 33 / 46

Page 402: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Regression Matrices

Of course, in the normal regression model the expected value of eachof the ε’s is zero, we can write E(Y) = Xβ

This is because

E(ε) = 0

E(ε1)E(ε2)...

E(εn)

=

00...0

Yang Feng (Columbia University) Linear Algebra Review 34 / 46

Page 403: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Error Covariance

Because the error terms are independent and have constant variance σ2

σ2{ε} =

σ2 0 ... 00 σ2 ... 0...0 0 ... σ2

σ2{ε} = σ2I

Yang Feng (Columbia University) Linear Algebra Review 35 / 46

Page 404: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Matrix Normal Regression Model

In matrix terms the normal regression model can be written as

Y = Xβ + ε

where ε ∼ N(0, σ2I)

Yang Feng (Columbia University) Linear Algebra Review 36 / 46

Page 405: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Square Estimation

If we remember both the starting normal equations that we derived

nb0 + b1∑

Xi =∑

Yi

b0∑

Xi + b1∑

X 2i =

∑XiYi

and the fact that

X′X =

[1 1 ... 1X1 X2 ... Xn

]1 X1

1 X2

. .

. .1 Xn

=

[n

∑Xi∑

Xi∑

X 2i

]

X′Y =

[1 1 ... 1X1 X2 ... Xn

]Y1

Y2

.

.Yn

=

[ ∑Yi∑XiYi

]

Yang Feng (Columbia University) Linear Algebra Review 37 / 46

Page 406: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Square Estimation

Then we can see that these equations are equivalent to the followingmatrix operations

X′X b = X′Y

with

b =

(b0b1

)with the solution to this equation given by

b = (X′X)−1X′Y

when (X′X)−1 exists.

Yang Feng (Columbia University) Linear Algebra Review 38 / 46

Page 407: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Fitted Value

Y = XbBecause:

Y =

Y1

Y2

.

.

.

Yn

=

1 X1

1 X2

. .

. .

. .1 Xn

(b0b1

)=

b0 + b1X1

b0 + b1X2

.

.

.b0 + b1Xn

Yang Feng (Columbia University) Linear Algebra Review 39 / 46

Page 408: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Fitted Values, Hat Matrix

plug inb = (X′X)−1X′Y

We have

Y = Xb = X(X′X)−1X′Y

orY = HY

whereH = X(X′X)−1X′

is called the hat matrix.Property of hat matrix H:

1 symmetric

2 idempotent: HH = H.

Yang Feng (Columbia University) Linear Algebra Review 40 / 46

Page 409: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Residuals

e = Y − Y = Y −HY = (I−H)Y

Thene = (I−H)Y

The matrix I−H is also symmetric and idempotent.The variance-covariance matrix of e is

σ2{e} = σ2(I−H)

And we can estimate it by

s2{e} = MSE (I−H)

Yang Feng (Columbia University) Linear Algebra Review 41 / 46

Page 410: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Analysis of Variance Results

SSTO =∑

(Yi − Y )2 =∑

Y 2i −

(∑

Yi )2

n

We knowY′Y =

∑Y 2i

and J is the matrix with entries all equal to 1. Then we have

(∑

Yi )2

n=

1

nY′JY

As a result:

SSTO = Y′Y − 1

nY′JY

Yang Feng (Columbia University) Linear Algebra Review 42 / 46

Page 411: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Analysis of Variance Results

Also,SSE =

∑e2i =

∑(Yi − Yi )

2

can be represented as

SSE = e′e = Y′(I−H)′(I−H)Y = Y′(I−H)Y

Notice that H1 = 1, then (I−H)J = 0Finally by similarly reasoning,

SSR = ([H− 1

nJ]Y)′([H− 1

nJ]Y) = Y′[H− 1

nJ]Y

Easy to check thatSSTO = SSE + SSR

Yang Feng (Columbia University) Linear Algebra Review 43 / 46

Page 412: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Sums of Squares as Quadratic Forms

When n = 2, an example of quadratic forms:

5Y 21 + 6Y1Y2 + 4Y 2

2

can be expressed as matrix term as

(Y1 Y2

)(5 33 4

)(Y1

Y2

)= Y′AY

In general, a quadratic term is defined as :

Y′AY =n∑

i=1

n∑j=1

AijYiYj

where Aij = Aji

Here, A is a symmetric n × n matrix , the matrix of the quadratic form.

Yang Feng (Columbia University) Linear Algebra Review 44 / 46

Page 413: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Quadratic forms for ANOVA

SSTO = Y′[I− 1

nJ]Y

SSE = Y′[I−H]Y

SSR = Y′[H− 1

nJ]Y

Yang Feng (Columbia University) Linear Algebra Review 45 / 46

Page 414: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Inference in Regression Analysis

Regression Coefficients: The variance-covariance matrix of b is

σ2{b} = σ2(X′X)−1

Mean Response: To estimate the mean response at Xh, define

Xh =

(1Xh

)Then

Yh = X′hb

And the variance-covariance matrix of Yh is

σ2{Yh} = X′hσ2{b}Xh = σ2X′h(X′X)−1Xh

Prediction of New Observation:

s2{pred} = MSE (1 + X′h(X′X)−1Xh)

Yang Feng (Columbia University) Linear Algebra Review 46 / 46

Page 415: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Multiple Regression

Yang Feng

Yang Feng (Columbia University) Multiple Regression 1 / 41

Page 416: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Multiple regression

One of the most widely used tools in statistical analysis

Matrix expressions for multiple regression are the same as for simplelinear regression

Yang Feng (Columbia University) Multiple Regression 2 / 41

Page 417: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Need for Several Predictor Variables

Often the response is best understood as being a function of multipleinput quantities

Examples

Spam filtering-regress the probability of an email being a spam messageagainst thousands of input variablesRevenue prediction - regress the revenue of a company against a lot offactors

Yang Feng (Columbia University) Multiple Regression 3 / 41

Page 418: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

First-Order with Two Predictor Variables

When there are two predictor variables X1 and X2 the regressionmodel

Yi = β0 + β1Xi1 + β2Xi2 + εi

is called a first-order model with two predictor variables.

A first order model is linear in the predictor variables.

Xi1 and Xi2 are the values of the two predictor variables in the i th

trial.

Yang Feng (Columbia University) Multiple Regression 4 / 41

Page 419: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Functional Form of Regression Surface

Assuming noise equal to zero in expectation

E(Y ) = β0 + β1X1 + β2X2

The form of this regression function is of a plane

e.g. E(Y ) = 10 + 2X1 + 5X2

Yang Feng (Columbia University) Multiple Regression 5 / 41

Page 420: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example

Yang Feng (Columbia University) Multiple Regression 6 / 41

Page 421: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Meaning of Regression Coefficients

β0 is the intercept when both X1 and X2 are zero;

β1 indicates the change in the mean response E(Y ) per unit increasein X1 when X2 is held constant

β2 -vice versa

Example: fix X2 = 2

E(Y ) = 10 + 2X1 + 5(2) = 20 + 2X1 X2 = 2

intercept changes but clearly linear

In other words, all one dimensional restrictions of the regressionsurface are lines.

Yang Feng (Columbia University) Multiple Regression 7 / 41

Page 422: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Terminology

1 When the effect of X1 on the mean response does not depend on thelevel X2 (and vice versa) the two predictor variables are said to haveadditive effects or not to interact.

2 The parameters β1 and β2 are sometimes called partial regressioncoefficients. They represents the partial effect of one predictorvariable when the other predictor variable is included in the modeland is held constant.

Yang Feng (Columbia University) Multiple Regression 8 / 41

Page 423: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Comments

1 A planar response surface may not always be appropriate, but evenwhen not it is often a good approximate descriptor of the regressionfunction in “local” regions of the input space

2 The meaning of the parameters can be determined by taking partialsof the regression function w.r.t. to each.

Yang Feng (Columbia University) Multiple Regression 9 / 41

Page 424: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

First order model with > 2 predictor variables

Let there be p − 1 predictor variables, then

Yi = β0 + β1Xi1 + β2Xi2 + ...+ βp−1Xi ,p−1 + εi

which can also be written as

Yi = β0 +

p−1∑k=1

βkXik + εi

and if Xi0 = 1 is also can be written as

Yi =

p−1∑k=0

βkXik + εi

where Xi0 = 1

Yang Feng (Columbia University) Multiple Regression 10 / 41

Page 425: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Geometry of response surface

In this setting the response surface is a hyperplane

This is difficult to visualize but the same intuitions hold

Fixing all but one input variables, each βp tells how much the responsevariable will grow or decrease according to that one input variable

Yang Feng (Columbia University) Multiple Regression 11 / 41

Page 426: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

General Linear Regression Model

We have arrived at the general regression model. In general theX1, ...,Xp−1 variables in the regression model do not have to representdifferent predictor variables, nor do they have to all bequantitative(continuous).

The general model is

Yi =

p−1∑k=0

βkXik + εi where Xi0 = 1

with response function when E(εi )=0 is

E(Y ) = β0 + β1X1 + ...+ βp−1Xp−1

Yang Feng (Columbia University) Multiple Regression 12 / 41

Page 427: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Qualitative(Discrete) Predictor Variables

Until now we have (implicitly) focused on quantitative (continuous)predictor variables.

Qualitative(discrete) predictor variables often arise in the real world.

Examples:

Patient sex: male/female

College Degree: yes/no

Etc

Yang Feng (Columbia University) Multiple Regression 13 / 41

Page 428: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example

Regression model to predict the length of hospital stay(Y) based on theage (X1) and gender(X2) of the patient. Define gender as:

X2 ={ 1 if patient female

0 if patient male

And use the standard first-order regression model

Yi = β0 + β1Xi1 + β2Xi2 + εi

Yang Feng (Columbia University) Multiple Regression 14 / 41

Page 429: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Example cont.

Where Xi1 is patient’s age, and Xi2 is patient’s gender

If X2 = 0, the response function is E (Y ) = β0 + β1X1

Otherwise, it’s E (Y ) = (β0 + β2) + β1X1

Which is just another parallel linear response function with a differentintercept

Yang Feng (Columbia University) Multiple Regression 15 / 41

Page 430: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Polynomial Regression

Polynomial regression models are special cases of the generalregression model.

They can contain squared and higher-order terms of the predictorvariables

The response function becomes curvilinear.

For example Yi = β0 + β1Xi + β2X2i + εi

which clearly has the same form as the general regression model.

Yang Feng (Columbia University) Multiple Regression 16 / 41

Page 431: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

General Regression

Transformed variableslogY , 1/Y

Interaction effectsYi = β0 + β1Xi1 + β2Xi2 + β3Xi1Xi2 + εi

CombinationsYi = β0 + β1Xi1 + β2X

2i1 + β3Xi2 + β4Xi1Xi2 + εi

Key point-all linear in parameters!

Yang Feng (Columbia University) Multiple Regression 17 / 41

Page 432: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

General Regression Model in Matrix Terms

Y =

Y1

.

.

.Yn

n×1

X =

1 X11 X12 ... X1,p−1...1 Xn1 Xn2 ... Xn,p−1

n×p

β =

β0...

βp−1

p×1

ε =

ε1...εn

n×1

Yang Feng (Columbia University) Multiple Regression 18 / 41

Page 433: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

General Linear Regression in Matrix Terms

Y = Xβ + ε

With E (ε) = 0 and

σ2{ε} =

σ2 0 ... 00 σ2 ... 0...0 0 ... σ2

We have E (Y) = Xβ and σ2{Y} = σ2{ε} = σ2I

Yang Feng (Columbia University) Multiple Regression 19 / 41

Page 434: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Square Solution

The matrix normal equations can be derived directly from theminimization of

Q(b) = (Y − Xb)′(Y − Xb)

w.r.t. to b

Key result

∂Xb

∂b= X. (1)

Yang Feng (Columbia University) Multiple Regression 20 / 41

Page 435: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Least Square Solution

We can solve this equation

X′Xb = X′Y

(if the inverse of X′X exists) by the following

(X′X)−1X′Xb = (X′X)−1X′Y

and since(X′X)−1X′X = I

we haveb = (X′X)−1X′Y

Yang Feng (Columbia University) Multiple Regression 21 / 41

Page 436: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Fitted Values and Residuals

Let the vector of the fitted values are

Y =

Y1

Y2

.

.

.

Yn

in matrix notation we then have Y = Xb

Yang Feng (Columbia University) Multiple Regression 22 / 41

Page 437: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Hat Matrix-Puts hat on y

We can also directly express the fitted values in terms of X and Y matrices

Y = X(X′X)−1X′Y

and we can further define H, the “hat matrix”

Y = HY H = X(X′X)−1X′

The hat matrix plans an important role in diagnostics for regressionanalysis.

Yang Feng (Columbia University) Multiple Regression 23 / 41

Page 438: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Hat Matrix Properties

1. the hat matrix is symmetric2. the hat matrix is idempotent, i.e. HH = H

Important idempotent matrix property

For a symmetric and idempotent matrix A, rank(A) = trace(A), thenumber of non-zero eigenvalues of A.

Yang Feng (Columbia University) Multiple Regression 24 / 41

Page 439: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Residuals

The residuals, like the fitted value Y can be expressed as linearcombinations of the response variable observations Yi

e = Y − Y = Y −HY = (I−H)Y

also, remember

e = Y − Y = Y − Xb

these are equivalent.

Yang Feng (Columbia University) Multiple Regression 25 / 41

Page 440: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Covariance of Residuals

Starting withe = (I−H)Y

we see thatσ2{e} = (I−H)σ2{Y}(I−H)′

butσ2{Y} = σ2{ε} = σ2I

which means that

σ2{e} = σ2(I−H)I(I−H) = σ2(I−H)(I−H)

and since I−H is idempotent (check) we have σ2{e} = σ2(I−H)

Yang Feng (Columbia University) Multiple Regression 26 / 41

Page 441: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Quadratic Forms

In general, a quadratic form is defined by

Y′AY =∑

i

∑j aijYiYj where aij = aji

with A the matrix of the quadratic form.

The ANOVA sums SSTO,SSE and SSR can all be arranged intoquadratic forms.

SSTO = Y′(I− 1

nJ)Y

SSE = Y′(I−H)Y

SSR = Y′(H− 1

nJ)Y

Yang Feng (Columbia University) Multiple Regression 27 / 41

Page 442: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Quadratic Forms

Cochran’s Theorem

Let X1,X2, . . . ,Xn be independent, N(0, σ2)-distributed random variables,and suppose that

n∑i=1

X 2i = Q1 + Q2 + . . .+ Qk ,

where Q1,Q2, . . . ,Qk are nonnegative-definite quadratic forms in therandom variables X1,X2, . . . ,Xn, with rank(Ai ) = ri , i = 1, 2, . . . , k .namely,

Qi = X′AiX, i = 1, 2, . . . , k .

If r1 + r2 + . . .+ rk = n, then

1 Q1,Q2, . . . ,Qk are independent; and

2 Qi ∼ σ2χ2(ri ), i = 1, 2, . . . , k

Yang Feng (Columbia University) Multiple Regression 28 / 41

Page 443: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

ANOVA table for multiple linear regression

Source of Variation SS df MS E(MS)

Regression SSR p − 1 MSR = SSR/(p − 1) > σ2

Error SSE n − p MSE = SSE/(n − p) σ2

Total SSTO =∑

(Yi − Y )2 n − 1

Yang Feng (Columbia University) Multiple Regression 29 / 41

Page 444: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

F-test for regression

H0 : β1 = β2 = · · · = βp−1 = 0

Ha: no all βk , (k = 1, · · · , p − 1) equal zero

Test statistic:

F ∗ =MSR

MSE

Decision Rule:

if F ∗ ≤ F (1− α; p − 1, n − p), conclude H0

if F ∗ > F (1− α; p − 1, n − p), conclude Ha

Yang Feng (Columbia University) Multiple Regression 30 / 41

Page 445: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

R2 and adjusted R2

The coefficient of multiple determination R2 is defined as:

R2 =SSR

SSTO= 1− SSE

SSTO

0 ≤ R2 ≤ 1

R2 always increases when there are more variables.

Therefore, adjusted R2:

R2a = 1−

SSEn−pSSTOn−1

= 1−(n − 1

n − p

)SSE

SSTO

R2a may decrease when p is large.

Coefficient of multiple correlation:

R =√R2

Always positive square root!

Yang Feng (Columbia University) Multiple Regression 31 / 41

Page 446: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Inferences about parameters

We haveb = (X′X)−1X′Y

Since σ2{Y} = σ2I we can write

σ2{b} = (X′X)−1X′σ2IX(X′X)−1

= σ2(X′X)−1X′X(X′X)−1

= σ2(X′X)−1I

= σ2(X′X)−1

Also

E(b) = E((X′X)−1X′Y) = (X′X)−1X′ E(Y) = (X′X)−1X′Xβ = β

Yang Feng (Columbia University) Multiple Regression 32 / 41

Page 447: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Inferences

The estimated variance-covariance matrix

s2{b} = MSE (X′X)−1

Then, we have

bk − βks{bk}

∼ t(n − p), k = 0, 1, · · · , p − 1

1− α confidence intervals:

bk ± t(1− α/2; n − p)s{bk}

Yang Feng (Columbia University) Multiple Regression 33 / 41

Page 448: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

t test

Tests for βk :

H0 : βk = 0H1 : βk 6= 0

Test Statistic:

t∗ =bk

s{bk}Decision Rule:

|t∗| ≤ t(1− α/2; n − p); conclude H0

Otherwise, conclude Ha

Yang Feng (Columbia University) Multiple Regression 34 / 41

Page 449: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Joint Inferences

Bonferroni Joint Confidence Intervals for g parameters, the confidencelimits with family confidence coefficient 1− α are

bk ± Bs{bk},

whereB = t(1− α/(2g); n − p)

Yang Feng (Columbia University) Multiple Regression 35 / 41

Page 450: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Interval estimate of EYh

We want to estimate the response at Xh = (1,Xh1, · · · ,Xh,p−1)′.

Estimator: Yh = X′hb

Expectation EYh = X′hβ = EYh

Variance σ2{Yh} = σ2X′h(X′X)−1Xh

Estimated Variance s2{Yh} = MSE (X′h(X′X)−1Xh)

1− α confidence limits for EYh:

Yh ± t(1− α/2; n − p)s{Yh}

Yang Feng (Columbia University) Multiple Regression 36 / 41

Page 451: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Confidence Region for Regression Surface and Predictionof New Observations

Working-Hotelling confidence band:

Yh ±Ws{Yh}

where W 2 = pF (1− α; p, n − p)

Prediction of New Observation Yh(new):

Yh ± t(1− α/2; n − p)s{pred}

where s2{pred} = MSE + s2{Y 2h } = MSE (1 + X′h(X′X)−1Xh).

Yang Feng (Columbia University) Multiple Regression 37 / 41

Page 452: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Diagnostics and Remedial Measures

Very similar to simple linear regression.

Only mention the difference.

Yang Feng (Columbia University) Multiple Regression 38 / 41

Page 453: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Scatter Plot Matrix

It is a matrix of scatter plot! R code: pairs(data)

Yang Feng (Columbia University) Multiple Regression 39 / 41

Page 454: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Correlation Matrix

Corresponds to the scatter plot matrix

R code: cor(data)

Population Income Sales

Population 1.00 0.78 0.94Income 0.78 1.00 0.84

Sales 0.94 0.84 1.00

Yang Feng (Columbia University) Multiple Regression 40 / 41

Page 455: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Other Diagnostics and Remedial Measures (Read afterclass)

Residual Plots.

Against time (or some other sequence) for error dependency.Against each X variable for potential nonlinear relationship andnonconstancy of error variances.Against omitted variables (including the interaction terms). More oninteraction terms in next Chapter.

Correlation Test for Normality (Same, since it is on the residuals)

Brown-Forsythe Test for Constancy of Error Variance (Need to find away to divide the X space)

Breusch-Pagan Test for Constancy of Error Variance (Same)

F Test for Lack of Fit (Need to have (near) replicates on alldimension of X)

Box-Cox Transformations (Same, since it is on Y )

Yang Feng (Columbia University) Multiple Regression 41 / 41

Page 456: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

.

......Multiple Regression (II)

Yang Feng

Yang Feng (Columbia University) Multiple Regression (II) 1 / 44

Page 457: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Special Topics for Multiple Regression

Extra Sums of Squares

Standardized Version of the Multiple Regression Model

Multicollinearity

Yang Feng (Columbia University) Multiple Regression (II) 2 / 44

Page 458: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Extra Sums of Squares

A topic unique to multiple regression

An extra sum of squares measures the marginal decrease in the errorsum of squares when one or several predictor variables are added tothe regression model, given that other variables are already in themodel.

Equivalently, one can view the extra sum of squares as measuring themarginal increase in the regression sum of squares

Yang Feng (Columbia University) Multiple Regression (II) 3 / 44

Page 459: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Example

Multiple regression– Output: Body fat percentage– Input:1. triceps skin fold thickness(X1)2. thigh circumference (X2)3. midarm circumference (X3)

Aim–Replace cumbersome and expensive immersion in water procedurewith model.

Goal– Determine which predictor variables provide a good model.

Yang Feng (Columbia University) Multiple Regression (II) 4 / 44

Page 460: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

The Data

Yang Feng (Columbia University) Multiple Regression (II) 5 / 44

Page 461: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Regression of Y on X1

Yang Feng (Columbia University) Multiple Regression (II) 6 / 44

Page 462: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Regression of Y on X2

Yang Feng (Columbia University) Multiple Regression (II) 7 / 44

Page 463: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Regression of Y on X1 and X2

Yang Feng (Columbia University) Multiple Regression (II) 8 / 44

Page 464: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Regression of Y on X1, X2 and X3.

Yang Feng (Columbia University) Multiple Regression (II) 9 / 44

Page 465: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Notation

SSR X1 only denoted by SSR(X1)=352.27

SSE X1 only denoted by SSE(X1)=143.12

Accordingly,

SSR(X1,X2)=385.44SSE(X1,X2)=109.95

Yang Feng (Columbia University) Multiple Regression (II) 10 / 44

Page 466: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

More Powerful Model, Smaller SSE

When X1 and X2 are in the model, SSE(X1,X2)=109.95 is smallerthan when the model contains only X1

The difference is called extra sum of squares and will be denoted bySSR(X2|X1) = SSE (X1)− SSE (X1,X2) = 33.17

The extra sum of squares SSR(X2|X1) measure the marginal effect ofadding X2 to the regression model when X1 is already in the model

Yang Feng (Columbia University) Multiple Regression (II) 11 / 44

Page 467: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

SSR increase and SSE decrease

The extra sum of squares SSR(X2|X1) can equivalently be viewed as themarginal increase in the regression sum of squares.

SSR(X2|X1) = SSR(X1,X2)− SSR(X1)

= 385.44− 352.27

= 33.17

Yang Feng (Columbia University) Multiple Regression (II) 12 / 44

Page 468: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Why does this relationship exist?

Remember SSTO = SSR + SSE

SSTO measures only the variability of the Y’s and does not dependon the regression model fitted.

Any increase in SSR must be accompanied by a correspondingdecrease in the SSE.

Yang Feng (Columbia University) Multiple Regression (II) 13 / 44

Page 469: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Example relations

SSR(X3|X1,X2) = SSE (X1,X2)− SSE (X1,X2,X3)

= SSR(X1,X2,X3)− SSR(X1,X2)

= 11.54

or with multiple variables included at time

SSR(X2,X3|X1) = SSE (X1)− SSE (X1,X2,X3)

= SSR(X1,X2,X3)− SSR(X1)

= 44.71

Yang Feng (Columbia University) Multiple Regression (II) 14 / 44

Page 470: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Definitions

DefinitionSSR(X1|X2) = SSE (X2)− SSE (X1,X2)

EquivalentlySSR(X1|X2) = SSR(X1,X2)− SSR(X2)

We can switch the order of X1 and X2 in these expressions

We can easily generalize these definitions for more than two variablesSSR(X3|X1,X2) = SSE (X1,X2)− SSE (X1,X2,X3)SSR(X3|X1,X2) = SSR(X1,X2,X3)− SSR(X1,X2)

Yang Feng (Columbia University) Multiple Regression (II) 15 / 44

Page 471: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

N! different partitions

Figure :

Yang Feng (Columbia University) Multiple Regression (II) 16 / 44

Page 472: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

ANOVA Table

Various software packages can provide extra sums of squares for regressionanalysis. These are usually provided in the order in which the inputvariables are provided to the system, for instance

Figure :

Yang Feng (Columbia University) Multiple Regression (II) 17 / 44

Page 473: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Importance

Extra sums of squares are of interest because they occur in a variety oftests about regression coefficients where the question of concern iswhether certain X variables can be dropped from the regression model.

Yang Feng (Columbia University) Multiple Regression (II) 18 / 44

Page 474: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Test whether a single βk = 0

Does Xk provide statistically significant improvement to theregression model fit?

We can use the general linear test approach

Example: First order model with three predictor variables

Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + ϵi

We want to answer the following hypothesis test

H0 : β3 = 0H1 : β3 = 0

Yang Feng (Columbia University) Multiple Regression (II) 19 / 44

Page 475: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Test whether a single βk = 0

For the full model we have SSE (F ) = SSE (X1,X2,X3)

The reduced model is Yi = β0 + β1Xi1 + β2Xi2 + ϵi

And for this model we have SSE (R) = SSE (X1,X2)

Where there are dfr = n − 3 degrees of freedom associated with thereduced model

Yang Feng (Columbia University) Multiple Regression (II) 20 / 44

Page 476: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Test whether a single βk = 0

The general linear test statistics is

F ∗ = SSE(R)−SSE(F )dfR−dfF

/SSE(F )dfF

which becomes

F ∗ = SSE(X1,X2)−SSE(X1,X2,X3)(n−3)−(n−4) /SSE(X1,X2,X3)

n−4

but SSE (X1,X2)− SSE (X1,X2,X3) = SSR(X3|X1,X2)

Yang Feng (Columbia University) Multiple Regression (II) 21 / 44

Page 477: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Test whether a single βk = 0

The general linear test statistics is

F ∗ = SSR(X3|X1,X2)1 /SSE(X1,X2,X3)

n−4 = MSR(X3|X1,X2)MSE(X1,X2,X3)

Extra sum of squares has one associated degree of freedom.

Yang Feng (Columbia University) Multiple Regression (II) 22 / 44

Page 478: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Example

Body fat: Can X3 (midarm circumference) be dropped from the model?

Figure :

F ∗ = SSR(X3|X1,X2)1 /SSE(X1,X2,X3)

n−4 = 1.88

Yang Feng (Columbia University) Multiple Regression (II) 23 / 44

Page 479: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Example Cont.

For α = .01 we require F (.99; 1, 16) = 8.53

We observe F ∗ = 1.88

We conclude H0 : β3 = 0

Yang Feng (Columbia University) Multiple Regression (II) 24 / 44

Page 480: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Test whether several βk = 0

Another exampleH0 : β2 = β3 = 0H1 : not both β2 and β3 are zeroThe general linear test can be used again

F ∗ = SSE(X1)−SSE(X1,X2,X3)(n−2)−(n−4) /SSE(X1,X2,X3)

n−4

But SSE (X1)− SSE (X1,X2,X3) = SSR(X2,X3|X1)so the expression can be simplified.

Yang Feng (Columbia University) Multiple Regression (II) 25 / 44

Page 481: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Tests concerning regression coefficients

General linear test can be used to determine whether or not apredictor variable( or sets of variables) should be included in themodel

The ANOVA SSE’s can be used to compute F ∗ test statistics

Some more general tests require fitting the model more than onceunlike the examples given.

Yang Feng (Columbia University) Multiple Regression (II) 26 / 44

Page 482: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Summary of Tests Concerning Regression Coefficients

Test whether all βk = 0

Test whether a single βk = 0

Test whether some βk = 0

Test involving relationships among coefficients, for example,

H0 : β1 = β2 vs. Ha : β1 = β2

H0 : β1 = 3, β2 = 5 vs. Ha : otherwise

Key point in all tests: form the full model and the reduced model,then calculate the SSE(F) and SSE(R).

Yang Feng (Columbia University) Multiple Regression (II) 27 / 44

Page 483: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Coefficients of Partial Determination

Recall “Coefficient of determination”:R2 measures the proportionate reduction in the variation of Y byintroduction of the entire set of X .

Partial Determination:measures the marginal contribution of one X variable when all othersare already in the model.

Yang Feng (Columbia University) Multiple Regression (II) 28 / 44

Page 484: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Two predictor variables

Yi = β0 + β1Xi1 + β2Xi2 + ϵi

Coefficient of partial determination between Y and X1 given X2 in themodel is denoted as R2

Y 1|2:

R2Y 1|2 =

SSE (X2)− SSE (X1,X2)

SSE (X2)=

SSR(X1|X2)

SSE (X2)

Likewise:

R2Y 2|1 =

SSE (X1)− SSE (X1,X2)

SSE (X1)=

SSR(X2|X1)

SSE (X1)

Yang Feng (Columbia University) Multiple Regression (II) 29 / 44

Page 485: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

General case

R2Y 1|23 =

SSR(X1|X2,X3)

SSE (X2,X3)

R2Y 4|123 =

SSR(X4|X1,X2,X3)

SSE (X1,X2,X3)

Yang Feng (Columbia University) Multiple Regression (II) 30 / 44

Page 486: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Properties

Value between 0 and 1

Another way of getting R2Y 1|2:

...1 Regress Y on X2 and obtain the residuals ei (Y |X2) = Yi − Yi (X2)

...2 Regress X1 on X2 and obtain the residuals ei (X1|X2) = Xi1 − Xi1(X2)

...3 R2 between ei (Y |X2) and ei (X1|X2) will be the same as R2Y 1|2.

Followup, the scatter plot of ei (Y |X2) and ei (X1|X2) provides agraphical representation of the strength of the relationship between Yand X1, adjusted for X2. Called “added variable pots” or “partialregression plots”. More on Chapter 10.1.

Yang Feng (Columbia University) Multiple Regression (II) 31 / 44

Page 487: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Coefficients of Partial Correlation

Coefficients of Partial Correlation:square root of a coefficient of partial determination, following thesame sign with the regression coefficient.

Yang Feng (Columbia University) Multiple Regression (II) 32 / 44

Page 488: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Standardized Multiple Regression

Numerical precision errors can occur when- (X ′X )−1 is poorly conditioned near singular : colinearity- And when the predictor variables have substantially differentmagnitudes

Solution– Regularization– Standardized multiple regression

First, transformed variables

Yang Feng (Columbia University) Multiple Regression (II) 33 / 44

Page 489: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Correlation Transformation

Makes all entries in X ′X matrix for the transformed variables fallbetween -1 and 1 inclusive

Lack of comparability of regression coefficientsY = 200 + 20000X1 + .2X2

Y in dollars, X1 in thousand dollars, X2 in cents– Which is most important predictor?

X1 increase 1,000 dollars→ Y increase 20, 000 dollarsX2 increase 1,000 dollars→ Y also increase 20, 000 dollars

Yang Feng (Columbia University) Multiple Regression (II) 34 / 44

Page 490: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Correlation Transformation

Makes all entries in X ′X matrix for the transformed variables fallbetween -1 and 1 inclusive

Lack of comparability of regression coefficientsY = 200 + 20000X1 + .2X2

Y in dollars, X1 in thousand dollars, X2 in cents– Which is most important predictor?X1 increase 1,000 dollars→ Y increase 20, 000 dollarsX2 increase 1,000 dollars→ Y also increase 20, 000 dollars

Yang Feng (Columbia University) Multiple Regression (II) 34 / 44

Page 491: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Correlation Transformation

Centering and scaling

Yi−Ysy

Xik−Xksk

, k = 1, ..., p − 1

sy =√∑

(Yi−Y )2

n−1

sk =√∑

(Xik−Xk )2

n−1 , k − 1, ..., p − 1

Yang Feng (Columbia University) Multiple Regression (II) 35 / 44

Page 492: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Correlation Transformation

Transformed variables

Y ∗i =

1√n − 1

(Yi − Y

sy)

X ∗ik =

1√n − 1

(Xik − Xk

sk), k = 1, ..., p − 1

Yang Feng (Columbia University) Multiple Regression (II) 36 / 44

Page 493: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Standardized Regression Model

Define the matrix consisting of the transformed X variables

X ∗ =

X ∗11 ... X ∗

1,p−1

X ∗21 ... X ∗

2,p−1

...X ∗n1 ... X ∗

n,p−1

And define (X ∗)′X ∗ = rXX

Yang Feng (Columbia University) Multiple Regression (II) 37 / 44

Page 494: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Correlation matrix of the X variables

Can show that

rXX =

1 r12 ... r1,p−1

r21 1 ... r2,p−1

...rp−1,1 rp−1,2 ... 1

where each entry is just the coefficient of correlation between Xi and Xj

∑x∗i1x

∗i2 =

∑(Xi1 − X1√n − 1s1

)(Xi2 − X2√n − 1s2

)

=1

n − 1

∑(Xi1 − X1)(Xi2 − X2)

s1s2

=

∑(Xi1 − X1)(Xi2 − X2)

[∑

(Xi1 − X1)2∑

(Xi2 − X2)2]1/2

Yang Feng (Columbia University) Multiple Regression (II) 38 / 44

Page 495: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Standardized Regression Model

The regression model using the transformed variables:

Y ∗i = β∗

1X∗i1 + · · ·+ β∗

p−1X∗i ,p−1 + ϵ∗i

Notice that there is no need for intercept

If we define in a similar way (X ∗)′Y ∗ = rXY , where rXY is thecoefficient of simple correlations between Xj and Y

Then we can set up a standard linear regression problem

rXXb = rXY

Yang Feng (Columbia University) Multiple Regression (II) 39 / 44

Page 496: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Standardized Regression Model

The solution

b∗ =

b∗1b∗2...

b∗p−1

can be related to the solution to the untransformed regression problemthrough the relationship

bk = (sysk)b∗k , k = 1, ..., p − 1

b0 = Y − b1X1 − ...− bp−1Xp−1

Yang Feng (Columbia University) Multiple Regression (II) 40 / 44

Page 497: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Multicollinearity

When the predictor variables are correlated among themselves,intercorrelation or multicollinearity among them is said to exist.

Uncorrelated Predictor Variables

Perfectly Correlated Predictor Variables

Effects of Multicollinearity

Yang Feng (Columbia University) Multiple Regression (II) 41 / 44

Page 498: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Uncorrelated Predictor Variables

Suppose we have the following three regression

Regress Y on X1. (Estimator b1)

Regress Y on X2. (Estimator b2)

Regress Y on both X1 and X2 (Estimator b∗1 and b∗2)

If X1 and X2 are uncorrelated, then we have...1 b1 = b∗1, b2 = b∗2...2 SSR(X1|X2) = SSR(X1), SSR(X2|X1) = SSR(X2)

Yang Feng (Columbia University) Multiple Regression (II) 42 / 44

Page 499: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Perfectly Correlated Predictor Variables

Regress Y on both X1 and X2. If X1 and X2 are perfectly correlated (sayX2 = 5 + .5X1), then

We have infinitely many possible solutions which fits the modelequally well (have the same SSE).

The perfect relation between X1 and X2 does not inhibit our ability toobtain a good fit.

The magnitude of the regression coefficients can not be interpreted asreflecting the effects of different predictor variables.

Yang Feng (Columbia University) Multiple Regression (II) 43 / 44

Page 500: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

General Effects of Multicollinearity

Usually, we still have good fit of the data, in addition, we still havegood prediction.

The estimated regression coefficients tends to have large samplingvariability when the predictor variables are highly correlated. Some ofthe regression coefficients maybe statistically not significant eventhough a definite statistical relation exists.

The common interpretation of a regression coefficient is NOT fullyapplicable any more.

Regress Y on both X1 and X2. It is possible that when individualt-tests are performed, neither β1 or β2 is significant. However, whenthe F -test is performed for both β1 and β2, the results may still besignificant.

Yang Feng (Columbia University) Multiple Regression (II) 44 / 44

Page 501: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

.

......

Regression Models for Quantitative and QualitativePredictors

Yang Feng

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 1 / 29

Page 502: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Two types of predictors

Quantitative. (e.g., multiple linear regression)

Qualitative. (e.g., indicator variables)

This Class:

Polynomial Regression Models

Interaction Regression Models

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 2 / 29

Page 503: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Polynomial Regression Models

When the true curvilinear response function is indeed a polynomialfunction.

When polynomial function is a good approximation to the truefunction.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 3 / 29

Page 504: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

One-predictor variable-second order

Yi = β0 + β1xi + β11x2i + ϵi

wherexi = Xi − X

X is centered due to the possible high correlation between X and X 2.

Regression function: E{Y } = β0 + β1x + β11x2, quadratic response

function

β0 is the mean response when x = 0, i.e., X = X .

β1 is called the linear effect.

β11 is called the quadratic effect.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 4 / 29

Page 505: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

One Predictor Variable-Third Order

Yi = β0 + β1xi + β11x2i + β111x

3i + ϵi

wherexi = Xi − X

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 5 / 29

Page 506: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

One Predictor Variable-Higher Orders

Employed with special caution.

Tends to overfit

Poor prediction

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 6 / 29

Page 507: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Two Predictors-Second Order

Yi = β0 + β1xi1 + β2xi2 + β11x2i1 + β22x

2i2 + β12xi1xi2 + ϵi

wherexi1 = Xi1 − X1, xi2 = Xi2 − X2

The coefficient β12 is called the interaction effect coefficient.

More on interaction later.

Three Predictors- Second Order is similar.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 7 / 29

Page 508: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Implementation of Polynomial Regression Models

Fitting—Very easy, just use the least squares for multiple linearregressions since they can all be seen as a multiple regression.

Determine the order—Very important step!

Yi = β0 + β1xi + β11x2i + β111x

3i + ϵi

Naturally, we want to test whether or not β111 = 0, or whether or notboth β11 = 0 and β111 = 0.How to do the test?

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 8 / 29

Page 509: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Extra Sum of Squares

Decomposition SSR into SSR(x), SSR(x2|x) and SSR(x3|x , x2).Test whether β111 = 0: use SSR(x3|x , x2).Test whether both β11 = 0 and β111 = 0: use SSR(x2, x3|x).

Time for a real example!

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 9 / 29

Page 510: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Further Comments on Polynomial Regression

There are drawbacks....1 Sometimes polynomial models are more expensive in degrees offreedom than alternative nonlinear models or linear models withtransformed variables.

...2 Serious multicollinearity may be present even when the variables arecentered

An alternative to using centered variables is to use orthogonalpolynomials.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 10 / 29

Page 511: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Interaction Regression Models

Additive effects:

E{Y } = f1(X1) + f2(X2) + · · ·+ fp−1(Xp−1)

General effects with interactions. Example:

E{Y } = β0 + β1X1 + β2X2 + β3X1X2

This cross-product term β3X1X2 is called an interaction term.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 11 / 29

Page 512: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Interpretation of Regression Models with Interactions

E{Y } = β0 + β1X1 + β2X2 + β3X1X2

The change in mean response with a unit increase in X1 when X2 isheld constant is

β1 + β3X2

Similarly, a unit increase in X2 when X1 is constant:

β2 + β3X1

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 12 / 29

Page 513: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

First type of interaction

First, suppose β1 and β2 are positive.

Reinforcement (synergistic) type: β3 > 0

E{Y } = 10 + 2X1 + 5X2 + .5X1X2

Conditional Effects Plot:

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 13 / 29

Page 514: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Second type of interaction

Interference (antagonistic) type: β3 < 0

E{Y } = 10 + 2X1 + 5X2 − .5X1X2

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 14 / 29

Page 515: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Implementation of Interaction Regression Models

Center the predictor variables to avoid the high multicollinearities

xik = Xik − Xk

Using prior knowledge to reduce the number of interactions. If wehave 8 predictors, then we have 28 pairwise terms in total. For ppredictors, the number is p(p − 1)/2.

Now a real example...

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 15 / 29

Page 516: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Implementation of Interaction Regression Models

Center the predictor variables to avoid the high multicollinearities

xik = Xik − Xk

Using prior knowledge to reduce the number of interactions. If wehave 8 predictors, then we have 28 pairwise terms in total. For ppredictors, the number is p(p − 1)/2.

Now a real example...

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 15 / 29

Page 517: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Qualitative Predictors

Examples:

Gender (male or female)

Purchase status (yes or no)

Disability status (not disabled, partly disabled, fully disabled)

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 16 / 29

Page 518: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

A study of innovation in insurance industry

Objective: related the speed with which a particular insuranceinnovation is adopted (Y ) to the size of the insurance firm (X1) andthe type of the firm.

Response Y : quantitative, continuous

Predictor X1: quantitative,

Second predictor: type of firm, stock companies and mutualcompanies.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 17 / 29

Page 519: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Qualitative Predictor with Two Classes

Suppose

X2 =

{1, if stock company;0, otherwise.

X3 =

{1, if mutual company;0, otherwise.

Then, we have the model

Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + ϵi

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 18 / 29

Page 520: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Design Matrix

Suppose, we have n = 4 observations, the first two being stock firms, thesecond two be mutual firms. Then

X =

1 X11 1 01 X21 1 01 X31 0 11 X41 0 1

Observation: first column is equal to the sum of the X2 and X3

columns, linear dependent...

Solution: A qualitative variables with c classes will be represented byc − 1 indicator variables, each taking on the values 0 and 1.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 19 / 29

Page 521: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Design Matrix

Suppose, we have n = 4 observations, the first two being stock firms, thesecond two be mutual firms. Then

X =

1 X11 1 01 X21 1 01 X31 0 11 X41 0 1

Observation: first column is equal to the sum of the X2 and X3

columns, linear dependent...

Solution: A qualitative variables with c classes will be represented byc − 1 indicator variables, each taking on the values 0 and 1.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 19 / 29

Page 522: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Interpretation

Now, we drop the X3 from the regression model:

Yi = β0 + β1Xi1 + β2Xi2 + ϵi

whereX1 = size of the firm

X2 =

{1, if stock company;0, otherwise.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 20 / 29

Page 523: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Interpretation(Cont’)

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 21 / 29

Page 524: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

More than Two Classes

Regression of tool wear (Y ) on tool speed (X1) and tool model (fourclasses M1,M2,M3,M4).

4 classes → 3 indicator variables

Define

X2 =

{1, if tool model M1;0, otherwise.

X3 =

{1, if tool model M2;0, otherwise.

X4 =

{1, if tool model M3;0, otherwise.

Then, we have the following first-order regression model:

Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + β4Xi4 + ϵi

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 22 / 29

Page 525: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Interpretation

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 23 / 29

Page 526: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Interpretation

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 23 / 29

Page 527: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Some Considerations in Using Indicator Variables

An alternative: allocated codes.

For example, the predictor variable “frequency of product use” hasthree classes: frequent user, occasional user, nonuser. We can use asingle X1 variable to denote it as follows:

X1 =

3, Frequent User;2, Occasional User;1, Nonuser.

Then, we have the regression model:

Yi = β0 + β1Xi1 + ϵi

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 24 / 29

Page 528: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Difficulties with allocated codes

The mean response with the regression function will be:Class E{Y }Frequent User β0 + 3β1Occasional User β0 + 2β1Nonuser β0 + β1

Key implication:

E{Y |frequent user} − E{Y |occasional user}=E{Y |occasional user} − E{Y |nonuser}

Using indicator variables doesn’t have this restriction since it has onemore variable to denote them.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 25 / 29

Page 529: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Other Codings for Indicator Variables

For the stock company and mutual company data:

X2 =

{1, if stock company;−1, if mutual company.

Another alternative: use indicator variable for each of the c classesand drop the intercept term:

Yi = β1Xi1 + β2Xi2 + β3Xi3 + ϵi

whereX1 = size of the firm

X2 =

{1, if stock company;0, otherwise.

X3 =

{1, if mutual company;0, otherwise.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 26 / 29

Page 530: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Interactions between Quantitative and QualitativeVariables

Almost the same as the regular interactions

Read Chapter 8.5 and 8.6 after class

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 27 / 29

Page 531: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Comparison of Two or More Regression Functions

Three examples:

A company operates two productions lines for making soap bars. Foreach line, the relationship between the speed of the line and theamount of scrap for the day was studied.

An economist is studying the relationship between amount of savingsand level of income for middle-income families from urban and ruralareas, based on independent samples from the two populations.

Two instruments were constructed for a company to identicalspecifications to measure pressure in an industrial process.

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 28 / 29

Page 532: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

..........

.....

.....................................................................

.....

......

.....

.....

.

Soap Production Lines Example

Y : scrap, X1: line speed. X2: code for production line.

Interaction model:

Yi = β0 + β1Xi1 + β2Xi2 + β3Xi1Xi2 + ϵi

whereXi1 = line speed

Xi2 =

{1, if production line 1;0, if production line 2.

i = 1, 2, · · · , 27

Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 29 / 29

Page 533: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Model Selection

Yang Feng

Yang Feng (Columbia University) Model Selection 1 / 24

Page 534: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Model-Building Process

Data Collection and preparation

Reduction of explanatory or predictor variables (for exploratoryobservational studies)

Model refinement and selection (This class!)

Model validation

Yang Feng (Columbia University) Model Selection 2 / 24

Page 535: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Trouble and Strife

Form any set of p − 1 predictors, 2p−1 different linear regression modelscan be constructed

Consider all binary vectors of length p − 1

Search in that space is exponentially difficult.

Greedy strategies are typically utilized.

Is this the only way?

Yang Feng (Columbia University) Model Selection 3 / 24

Page 536: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Selecting between models

In order to select between models, some score must be given to eachmodel.

The likelihood of the data under each model is not sufficient because, aswe have seen, the likelihood of the data can always be improved by addingmore parameters until the model effectively memorizes the data.

Accordingly some penalty that is a function of the complexity of the modelmust be included in the selection procedure.

There are several choices for how to do this

1 Explicit penalization of the number of parameters in the model (AIC,BIC, etc.)

2 Implicit penalization through cross validation3 Bayesian regularization (putting certain prior distribution on each

model).

Yang Feng (Columbia University) Model Selection 4 / 24

Page 537: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Six Criteria

R2p ,R

2a,p,Cp,AICp,BICp(SBCp),PRESSp

Denote total number of variables as P − 1, so P parameters in total.Here, 1 ≤ p ≤ P.

Use the coefficient of multiple determination R2:

R2p = 1− SSEp

SSTO

Use adjusted coefficient of multiple determination:

R2a,p = 1− n − 1

n − p

SSEp

SSTO= 1− MSEp

SSTOn−1

Yang Feng (Columbia University) Model Selection 5 / 24

Page 538: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Mallows’ Cp Criterion

Concerned with total mean squared error

(Yi − µi )2 = [(E{Yi} − µi ) + (Yi − E{Yi})]2

we haveE (Yi − µi )2 = (E{Yi} − µi )2 + σ2{Yi}

Criterion measure:

Γp =1

σ2

[n∑

i=1

(E{Yi} − µi )2 +n∑

i=1

σ2{Yi}

]

Yang Feng (Columbia University) Model Selection 6 / 24

Page 539: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Cp continued

n∑i=1

σ2{Yi} = trace(Cov(Xb)) (1)

= trace(Xσ2(X′X)−1X′) (2)

= σ2trace((X′X)(X′X)−1) (3)

= pσ2 (4)

Yang Feng (Columbia University) Model Selection 7 / 24

Page 540: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Cp continued

E{SSEp} =n∑

i=1

E (Yi − Yi )2 (5)

=n∑

i=1

E [(Yi − EYi − εi + EYi − µi )2] (6)

=n∑

i=1

(EYi − µi )2 + E [(Yi − EYi − εi )2] (7)

=n∑

i=1

(EYi − µi )2 + (n − p)σ2 (8)

From (7) to (8), refer to the Cp.pdf file.

Yang Feng (Columbia University) Model Selection 8 / 24

Page 541: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Cp continued

Therefore, a good estimator of Γp will be

Cp =SSEp

MSE (X1, · · · ,XP−1)− (n − 2p)

When E{Yi} = µi , we have

ECp ≈ p

Cp for the full model with P − 1 variables, is always P by definition.

We would like to find the model where (1) Cp value is small, (2)Cp ≈ p.

Yang Feng (Columbia University) Model Selection 9 / 24

Page 542: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

AIC and BIC

The Akaike’s information criterion (AIC) and the Bayesian informationcriterion (BIC) (also called the Schwarz criterion, SBC in the book) aretwo criteria that penalize model complexity.

In the linear regression setting

AICp = n lnSSEp − n ln n + 2p

BICp = n lnSSEp − n ln n + (ln n)p

Roughly you can think of these two criteria as penalizing models withmany parameters (p in the case of linear regression).

Yang Feng (Columbia University) Model Selection 10 / 24

Page 543: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Cross Validation

A way of doing model selection that implicitely penalizes models of highcomplexity is cross-validation.

If you fit a model with a large number of parameters to a subset of thedata then predict the rest of the data using the model you just fit, then

The average scores of held-out data over different folds can be usedto compare models.

If the held-out data is consistently (over the various folds) wellexplained by the model, then one could conclude that the model is agood model.

If one model performs better on average when predicting held-outdata then there is reason to prefer that model.

Overly complex models will not generalize well and will therefore notbe selected.

Yang Feng (Columbia University) Model Selection 11 / 24

Page 544: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

PRESSp or Leave-One-Out Cross Validation

The PRESSp or “prediction sum of squares” measures how well a subsetmodel can predict the observed responses Yi .

Let Yi(i) be the fitted value when i is being predicted from a model inwhich (i) was left out during training.

The PRESSp criterion is then given by summing over all n cases

PRESSp =n∑

i=1

(Yi − Yi(i))2

PRESSp values can be calculated without doing n separate regression runs.

Yang Feng (Columbia University) Model Selection 12 / 24

Page 545: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

PRESSp or Leave-One-Out Cross Validation

If we let di be the deleted residual for the i th case

di = Yi − Yi(i)

then we can rewritedi =

ei1− hii

where ei is the ordinary residual for the i th case and hii is the i th diagonalelement in the hat matrix.

We can obtain the hii diagonal element of the hat matrix directly from

hii = X′i (X′X)−1Xi

Yang Feng (Columbia University) Model Selection 13 / 24

Page 546: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

PRESSp or Leave-One-Out Cross Validation

PRESSp is useful for choosing between models. Consider

Take two models, one Mp with p dimensional input and one Mp−1with p − 1 dimensional input

Calculate the PRESSp criterion for Mp and Mp−1

Whichever model has the lowest PRESSp should be preferred.

Why?

Unfortunately the PRESSp criteria can’t tell us which variables to include.For that general search is still required.

More general cross-validation procedures can be designed.

Note that the PRESSp criterion is very similar to log-likelihood.

Yang Feng (Columbia University) Model Selection 14 / 24

Page 547: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Detecting Outliers Via Studentized Deleted Residuals

The studentized deleted residual, denoted by ti is

ti =di

s{di}

where

s2{di} =MSE(i)

1− hii

dis{di}

∼ t(n − p − 1)

Fortunately, again, ti can be calculated without fitting n differentregression models. It can be shown that

ti = ei

[n − p − 1

SSE (1− hii − e2i )

]1/2These ti ’s can be used to formally test (e.g. using a Bonferroni testprocedure) whether the largest absolute studentized deleted residual is anoutlier.

Yang Feng (Columbia University) Model Selection 15 / 24

Page 548: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Six Criteria

R2p ,R

2a,p,Cp,AICp,BICp(SBCp),PRESSp

Useful for choosing between models.

Take two models, one Mp with p dimensional input and one Mp−1with p − 1 dimensional input

Calculate the criterions for Mp and Mp−1

Unfortunately the criteria can’t tell us which variables to include. For thatgeneral search is still required.

Yang Feng (Columbia University) Model Selection 16 / 24

Page 549: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Yang Feng (Columbia University) Model Selection 17 / 24

Page 550: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Best Subset Algorithms

Consider all the possible subset. For each of the model, evaluate thecriteria.

Time-saving algorithms have been developed, which require thecalculation of only a small fraction of all possible models.

Still, if P > 30, 40, it requires excessive computer time.

Several regression models can be identified as “good” for finalconsideration, depending on which criteria we use.

Yang Feng (Columbia University) Model Selection 18 / 24

Page 551: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Stepwise Regression Methods

An automatic search procedure

Identify a “single” best model

several different formats

Yang Feng (Columbia University) Model Selection 19 / 24

Page 552: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Forward Stepwise Regression

A (greedy) procedure for identifying variables to include in the regressionmodel is as follows. Repeat until finished:

1 Fit a simple linear regression model for each of the P − 1 X variablesconsidered for inclusion. For each compute the t∗ statistics for testingwhether or not the slope is zero

t∗k =bk

s{bk}1

2 Pick the largest out of the P − 1 |t∗k |’s and include the correspondingX variable in the regression model if |t∗k | exceeds some threshold.

3 If the number of X variables included in the regression model isgreater than one, check to see if the model would be improved bydropping variables (using the t-test and a threshold again).

1Remember bk is the estimate for βk and s{bk} is the standard error.Yang Feng (Columbia University) Model Selection 20 / 24

Page 553: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Forward Stepwise Regression (cont)

Other criteria can be used in determining which variables to add anddelete, such as F-test (full model v.s. reduced model), AIC (defaultoption in R), BIC, Cp.

Usually much more efficient than the best subset regression

Yang Feng (Columbia University) Model Selection 21 / 24

Page 554: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Forward Regression

Simplified version of forward stepwise regression

No deletion step! Once a variable is in, it will be there from then on.

Yang Feng (Columbia University) Model Selection 22 / 24

Page 555: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Backward Elimination

Start from the full model with P − 1 variables.

Iteratively check whether any variable should be deleted from themodel by some given criteria. This time, no addition step!

Time to see how we do this in R. Suppose ”fit” is the fitted object fromthe full model with all P − 1 variables.

Forward Stepwise Regression: step(null, scope=list(upper=full,lower=null), direction=’both’)

Forward Regression: step(null, scope=list(upper=full, lower=null),direction=’forward’)

Backward Elimination: step(full, scope=list(upper=full, lower=null),direction=’backward’)

Yang Feng (Columbia University) Model Selection 23 / 24

Page 556: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Remarks

If the number of variable is not large (2P−1 is not large), it is best tofit all the possible models, and choose the one with the smallest AIC,BIC, Cp, PRESS.

If the number of variable is too large, then, using stepwise forwardregression is recommended. More aggressively, if you would like morespeed, you can use forward selection or backward elimination.

If p � n, then direct regularization techniques are needed, such asLASSO, SCAD, etc.

Yang Feng (Columbia University) Model Selection 24 / 24

Page 557: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Diagnostic for Multiple Linear Regression

Yang Feng

Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 1 / 11

Page 558: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Diagnostics

Added-Variable Plots (Model Adequacy for a Predictor Variable)

Studentized Deleted Residuals (Identifying outlying Y )

Hat Matrix Leverage Values (Identifying outlying X )

DFFITS, Cook’s Distance, DFBETAS (Identifying Influential Cases)

Variance Inflation Factor (Multicollinearity Diagnostic)

Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 2 / 11

Page 559: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Added-Variable Plots

Measures the marginal role of Xk given other variables are already inthe model.

Links with Coefficient of Partial Determination

Suppose only X1 and X2, we want to measure the role of X1 given X2

is already in the model.1 Regress Y on X2 and obtain the residuals ei (Y |X2) = Yi − Yi (X2)2 Regress X1 on X2 and obtain the residuals ei (X1|X2) = Xi1 − Xi1(X2)3 The scatter plot of ei (Y |X2) and ei (X1|X2) provides a graphical

representation of the strength of the relationship between Y and X1,adjusted for X2. The plot is called Added-Variable Plots.

Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 3 / 11

Page 560: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Added-Variable Plots (Some prototypes)

Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 4 / 11

Page 561: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Studentized Deleted Residuals

Figure:

Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 5 / 11

Page 562: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Studentized Deleted Residuals (Cont’)

Let di be the deleted residual for the i th case

di = Yi − Yi(i)

The studentized deleted residual, denoted by ti is

ti =di

s{di}where

s2{di} =MSE(i)

1− hii

dis{di}

∼ t(n − p − 1)

It can be shown that

ti = ei

[n − p − 1

SSE (1− hii − e2i )

]1/2These ti ’s can be used to formally test (e.g. using a Bonferroni testprocedure) whether the largest absolute studentized deleted residual is anoutlier.Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 6 / 11

Page 563: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Hat Matrix Leverage Values

Figure:

Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 7 / 11

Page 564: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Hat Matrix Leverage Values

For hat matrix H, we have 0 ≤ hii ≤ 1,∑n

i=1 hii = p. hii is called theleverage of the ith case.

Large hii indicates that it has substantial leverage in determining thefitted value Yi . Reasons:

The fitted value is a linear combination of the observed Y values,where hii is the weight of observation Yi .The large is hii , the smaller the variance of the residual ei .

If hii > 2p/n, we say it is large.

Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 8 / 11

Page 565: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Identifying Influential Cases

After identifying cases as outlying, we would like to ascertain that thesecases are influential, i.e., whether its exclusion causes major changes in thefitted regression function. Three measures:

Influence on Single Fitted Value (DFFITS). Case i has on the fittedvalue Yi is given by

(DFFITS)i =Yi − Yi(i)√MSE(i)hii

(1)

= ti

(hii

1− hii

)1/2

(2)

Influential if larger than 1 for small to medium data sets, or largerthan 2

√p/n for large data sets.

Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 9 / 11

Page 566: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Identifying Influential Cases

Influence on all fitted values (Cook’s Distance).

Di =

∑nj=1(Yj − Yj(i))

2

pMSE(3)

=e2i

pMSE

hii(1− hii )2

(4)

Influential if near 50 percentile of F (p, n − p)

Influence on the Regression Coefficients (DFBETAS).

(DFBETAS)k(i) =bk − bk(i)√MSE(i)ckk

, (5)

where ckk is the kth diagonal element of (X′X)−1.Influential if larger than 1 for small to medium data sets, or largerthan 2/

√n for large data sets.

Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 10 / 11

Page 567: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Variance Inflation Factor

A formal way of detecting the presence of multcollinearity.

Considering the standardized regression model, we have

σ2{b∗} = (σ∗)2r−1XX

Define (VIF )k as the kth diagonal element of r−1XX .

(VIF )k = (1− R2k )−1, for k = 1, 2, · · · , p − 1, where R2

k is thecoefficient of determination when Xk is regressed on the p − 2 otherX variables.

Think about what happens when Rk = 0 and when R2k is close to 1.

maxk(VIF )k > 10 indicates that there is serious multicollinearityproblem.

Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 11 / 11

Page 568: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Remedial Measures for Multiple Linear RegressionModels

Yang Feng

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 1 / 25

Page 569: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Outline

Unequal Error Variance——Weighted Least Squares

Multicollinearity———–Ridge Regression

Influential Cases———–Robust Regression

Nonparametric Regression————Lowess Method and RegressionTrees

Evaluating Precision————Bootstrapping

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 2 / 25

Page 570: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Unequal Error Variance

Yi = β0 + β1Xi1 + · · ·+ βp−1Xi ,p−1 + εi

Here: εi are independent N(0, σ2i ). (Originally: εi are independentN(0, σ2))

In matrix form:

σ2{ε} =

σ21 0 · · · 00 σ22 · · · 0...

......

0 0 · · · σ2n

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 3 / 25

Page 571: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Known Error Variance

Define weights

wi =1

σ2i

Denote

W =

w1 0 · · · 00 w2 · · · 0...

......

0 0 · · · wn

Weighted least squares and maximum likelihood estimator isbw = (X′WX)−1X′WY (Derivation on the board, two methods:direct MLE and transform variables to the regular multiple linearregression)

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 4 / 25

Page 572: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Error Variance Known up to Proportionality Constant

wi = k1

σ2i

Same estimator.

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 5 / 25

Page 573: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Unknown Error Variances

In reality, one rarely known the variances σ2i .

Estimation of Variance Function or Standard Deviation Function

Use of Replicates or Near Replicates

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 6 / 25

Page 574: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Estimation of Variance Function or Standard DeviationFunction

Four steps: (Can be iterated for several times to reach convegence)

1 Fit the regression model by unweighted least squares and analyze theresiduals

2 Estimate the variance function or the standard deviation function byregressing either the squared residuals or the absolute residuals on theappropriate predictor(s).(We known that the variance of εi σ

2i = E (ε2i )− (E (εi ))2 = E (ε2i ).

Hence the squared residual e2i is an estimator of σ2i .)

3 Use the fitted value from the estimated variance or standard deviationfunction to obtain the weights wi .

4 Estimate the regression coefficients using these weights.

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 7 / 25

Page 575: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Use of Replicates or Near Replicates

When replicates or near replicates are available, use the samplevariance of the replicates as the estimate for the variances.

In observational studies, usually not available.

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 8 / 25

Page 576: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Ridge Regression

Multicollinearity

In polynomial regression models, higher order terms

One or several predictor variables may be dropped from the model inorder to remove the multicollinearity.

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 9 / 25

Page 577: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Ridge Estimators

OLS:(X′X)b = X′Y

Transformed by correlation transformation:

rXXb = rYX

Ridge Estimator: for a constant c ≥ 0,

(rXX + cI)bR = rYX

c = 0, OLS

c > 0, biased, but much more stable.

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 10 / 25

Page 578: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Choice of c

Ridge trace (0 ≤ c ≤ 1) and the variance inflation factors (VIF )k :

(VIF )k = (1− R2k )−1

Choose the c where the Ridge trace starts to become stable and VIFhas become sufficiently small.

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 11 / 25

Page 579: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Robust Regression

Robust to outlying and influential cases.

LAD (Least Absolute Deviation) regression. To minimize

L1 =n∑

i=1

|Yi − (β0 + β1Xi1 + · · ·+ βp−1Xi ,p−1)|. (1)

LMS (Least Median of Squares) regression. To minimize

median{[Yi − (β0 + β1Xi1 + · · ·+ βp−1Xi ,p−1)]2}. (2)

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 12 / 25

Page 580: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

IRLS Robust Regression

1 Choose a weight function for weighting the case.

2 Obtain starting weights for all cases.

3 Using the starting weights in weighted least squares and obtain theresiduals from the fit.

4 Use the residuals in step 3 to obtain revised weights.

5 Continue the iteration until convergence.

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 13 / 25

Page 581: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

IRLS Robust Regression

1 Huber weight function.

w =

{1 |u| ≤ 1.345.1.345|u| |u| > 1.345.

(3)

2 Bisquare weight function.

w =

{ [1− ( u

4.685)2]2 |u| ≤ 4.685.

0 |u| > 4.685.(4)

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 14 / 25

Page 582: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Starting weights

Huber weight: OLS

Bisquare weight: use the initial robust fit using huber weights, orresidual from LAD regression.

Scaled Residuals

When there is no outlying observations, normalize the residual by√MSE .

When there are outlying observations, use the resistant and robustmedian absolute deviation (MAD) estimator.

MAD =1

.6745median{|el −median{ei}|}. (5)

Then the scaled residual is

ui =ei

MAD. (6)

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 15 / 25

Page 583: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Lowess Method

Two predictor variables, fitted value at (Xh1,Xh2).

Distance Measure.

di = [(Xi1 − Xh1)2 + (Xi2 − Xh2)2]1/2. (7)

Proportion of the data q that are nearest to (Xh1,Xh2). Larger q leadsto smoother fit, but may increase the bias. Usually between .4 and .6.

Weight Function.

wi =

{ [1− ( di

dq)3]3

di < dq.

0 di ≥ dq.(8)

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 16 / 25

Page 584: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Regression Trees

A powerful nonparametric regression method.

Can handle multiple predictors.

Easy to calculate and require virtually no assumptions.

Achieved by partitioning the covariates.

Key quantities.

Number of regions r .Split points between the regions.

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 17 / 25

Page 585: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Growing a regression tree

Take a single predictor as an example.

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 18 / 25

Page 586: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Growing a regression tree

Divide X into two regions: R21 and R22.

The optimal split point Xs would be to minimize the SSE.

SSE = SSE(R21) + SSE(R22), (9)

where

SSE(Rrj) =∑

(Yi − YRjk)2. (10)

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 19 / 25

Page 587: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Growing a regression tree

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 20 / 25

Page 588: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

A graphical example

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 21 / 25

Page 589: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Bootstrapping

A nonparametric method of evaluating the uncertainly of theestimate.

Suppose we want to evaluate the precision of an estimated regressioncoefficient b1.

1 Fix B as the number of bootstrap samples to be generated. SayB = 500.

2 For each k = 1, · · · ,B, sample n observations with replacement fromthe original n observations.

3 Fit a linear regression model using the bootstrap sample which leads

to coefficient b(k)1 .

4 The sample standard deviation of {b(1)1 , b(2)1 , · · · , b(B)

1 } is a measureof the precision of b1.

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 22 / 25

Page 590: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Bootstrap sampling

Fixed X sampling. If the regression function is a good model fro thedata and the error terms have constant variance. Obtain the residualsei from the original fit, then get a bootstrap sample of size n of ei ,then define new Y ∗1 , · · · ,Y ∗n to be

Y ∗i = Yi + e∗i . (11)

Then regress Y ∗ values on the original X variables.

Random X sampling. Get a bootstrap sample of (X ∗,Y ∗) pairs.

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 23 / 25

Page 591: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Bootstrap Confidence Intervals

From the bootstrap distribution of b∗1, get the α/2 and 1− α/2quantiles b∗1(α/2) and b∗1(1−α/2). Suppose the original estimate is b

Percentile Bootstrap

b∗1 (α/2) ≤ β1 ≤ b∗1 (1− α/2). (12)

Basic Bootstrap (Reflection method)

2b1 − b∗1 (1− α/2) ≤ β1 ≤ 2b1 − b∗1 (α/2). (13)

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 24 / 25

Page 592: Introduction to Simple Linear Regression - …study.yuting.me/CU/Linear_Regression/Linear lectures.pdfCourse Description Theory and practice of regression analysis, Simple and multiple

Reflection Method-explained

With probability 1− α,

b1(α/2) ≤ b1 ≤ b1(1− α/2). (14)

Let

D1 = β1 − b1(α/2) (15)

D2 = b1(1− α/2)− β1. (16)

Substitue (15) and (16) into (14), we have

b1 − D2 ≤ β1 ≤ b1 + D1. (17)

Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 25 / 25