introduction to simple linear regression - …study.yuting.me/cu/linear_regression/linear...
TRANSCRIPT
Introduction to Simple Linear Regression
Yang Feng
Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 70
Course Description
Theory and practice of regression analysis, Simple and multiple regression,including testing, estimation, and confidence procedures, modeling,regression diagnostics and plots, polynomial regression, colinearity andconfounding, model selection, geometry of least squares. Extensive use ofthe computer to analyze data.
Course website: http://www.stat.columbia.edu/~yangfeng/W4315
Required Text: Applied Linear Regression Models (4th Ed.)Authors: Kutner, Nachtsheim, Neter
Yang Feng (Columbia University) Introduction to Simple Linear Regression 2 / 70
Philosophy and Style
Easy first half.
Difficult second half.
Some digressions from the required book.
Understanding = proof (derivation) + implementation.
Practice makes perfect.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 3 / 70
Course Outline
First half of the course is single variable linear regression.
Least squares
Maximum likelihood, normal model
Tests / inferences
ANOVA
Diagnostics
Remedial Measures
Yang Feng (Columbia University) Introduction to Simple Linear Regression 4 / 70
Course Outline (Continued)
Second half of the course is multiple linear regression and other relatedtopics .
Multiple linear Regression
Linear algebra reviewMatrix approach to linear regressionMultiple predictor variablesDiagnosticsTestsModel selection
Other topics (If time permits)
Principle Component AnalysisGeneralized Linear ModelsIntroduction to Bayesian Inference
Yang Feng (Columbia University) Introduction to Simple Linear Regression 5 / 70
Requirements
Calculus
Derivatives, gradients, convexity
Linear algebra
Matrix notation, inversion, eigenvectors, eigenvalues, rank
Probability and Statistics
Random variableExpectation, varianceEstimationBias/VarianceBasic probability distributionsHypothesis TestingConfidence Interval
Yang Feng (Columbia University) Introduction to Simple Linear Regression 6 / 70
Software
R will be used throughout the course and it is required in all homework.An R tutorial session will be given on Sep 5 (Bring your laptop!).Reasons for R:
Completely free software. Can be downloaded fromhttp://cran.r-project.org/
Available on various systems, PC, MAC, Linux, and even
Iphone andIpad!
Advanced yet easy to use.
An Introduction to R: http://cran.r-project.org/doc/manuals/R-intro.pdf
Yang Feng (Columbia University) Introduction to Simple Linear Regression 7 / 70
Software
R will be used throughout the course and it is required in all homework.An R tutorial session will be given on Sep 5 (Bring your laptop!).Reasons for R:
Completely free software. Can be downloaded fromhttp://cran.r-project.org/
Available on various systems, PC, MAC, Linux, and even Iphone andIpad!
Advanced yet easy to use.
An Introduction to R: http://cran.r-project.org/doc/manuals/R-intro.pdf
Yang Feng (Columbia University) Introduction to Simple Linear Regression 7 / 70
Grading
Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped
In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.
Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:
Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70
Grading
Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped
In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:
Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70
Grading
Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped
In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:
Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70
Office Hours / Website
http://www.stat.columbia.edu/~yangfeng
Course Materials and homework with the due dates will be posted onthe course website.
Office hours : Thursday 10am-noon subject to change
Office Location : Room 1012, SSW Building (1255 AmsterdamAvenue, between 121st and 122nd street)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 9 / 70
Why regression?
Want to model a functional relationship between an “predictorvariable” (input, independent variable, etc.) and a “responsevariable” (output, dependent variable, etc.)
Examples?
But real world is noisy, no f = ma
Observation noiseProcess noise
Two distinct goals
(Estimation) Understanding the relationship between predictor variablesand response variables(Prediction) Predicting the future response given the new observedpredictors.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 10 / 70
History
Sir Francis Galton, 19th century
Studied the relation between heights of parents and children and notedthat the children “regressed” to the population mean
“Regression” stuck as the term to describe statistical relationsbetween variables
Yang Feng (Columbia University) Introduction to Simple Linear Regression 11 / 70
Example Applications
Trend lines, eg. Google over 6 mo.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 12 / 70
Others
Epidemiology
Relating lifespan to obesity or smoking habits etc.
Science and engineering
Relating physical inputs to physical outputs in complex systems
Brain
Yang Feng (Columbia University) Introduction to Simple Linear Regression 13 / 70
Aims for the course
Given something you would like to predict and some number ofcovariates
What kind of model should you use?Which variables should you include?Which transformations of variables and interaction terms should youuse?
Given a model and some data
How do you fit the model to the data?How do you express confidence in the values of the model parameters?How do you regularize the model to avoid over-fitting and other relatedissues?
Yang Feng (Columbia University) Introduction to Simple Linear Regression 14 / 70
Aims for the course
Given something you would like to predict and some number ofcovariates
What kind of model should you use?Which variables should you include?Which transformations of variables and interaction terms should youuse?
Given a model and some data
How do you fit the model to the data?How do you express confidence in the values of the model parameters?How do you regularize the model to avoid over-fitting and other relatedissues?
Yang Feng (Columbia University) Introduction to Simple Linear Regression 14 / 70
Data for Regression Analysis
Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!
Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .
Treatment: the length of trainingExperimental Units: the analysts included in the study.
Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70
Data for Regression Analysis
Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!
Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .
Treatment: the length of trainingExperimental Units: the analysts included in the study.
Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70
Data for Regression Analysis
Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!
Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .
Treatment: the length of trainingExperimental Units: the analysts included in the study.
Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70
Questions?
Good time to ask now.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 16 / 70
Simple Linear Regression
Want to find parameters for a function of the form
Yi = β0 + β1Xi + εi
Distribution of error random variable not specified
Yang Feng (Columbia University) Introduction to Simple Linear Regression 17 / 70
Quick Example - Scatter Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
−2
02
46
8
a
b
Yang Feng (Columbia University) Introduction to Simple Linear Regression 18 / 70
Formal Statement of Model
Yi = β0 + β1Xi + εi
Yi value of the response variable in the i th trial
β0 and β1 are parameters
Xi is a known constant, the value of the predictor variable in the i th
trial
εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2
εi and εj are uncorrelated
i = 1, . . . , n
Yang Feng (Columbia University) Introduction to Simple Linear Regression 19 / 70
Properties
The response Yi is the sum of two components
Constant term β0 + β1Xi
Random term εi
The expected response is
E(Yi ) = E(β0 + β1Xi + εi )
= β0 + β1Xi + E(εi )
= β0 + β1Xi
Yang Feng (Columbia University) Introduction to Simple Linear Regression 20 / 70
Expectation Review
Suppose X has probability density function f (x).
Definition
E(X ) =
∫xf (x)dx .
Linearity property
E(aX ) = aE(X )
E(aX + bY ) = aE(X ) + bE(Y )
Obvious from definition
Yang Feng (Columbia University) Introduction to Simple Linear Regression 21 / 70
Example Expectation Derivation
Suppose p.d.f of X isf (x) = 2x , 0 ≤ x ≤ 1
Expectation
E(X ) =
∫ 1
0xf (x)dx
=
∫ 1
02x2dx
=2x3
3|10
=2
3
Yang Feng (Columbia University) Introduction to Simple Linear Regression 22 / 70
Expectation of a Product of Random Variables
If X ,Y are random variables with joint density function f (x , y) then theexpectation of the product is given by
E(XY ) =
∫XY
xyf (x , y)dxdy .
Yang Feng (Columbia University) Introduction to Simple Linear Regression 23 / 70
Expectation of a product of random variables
What if X and Y are independent? If X and Y are independent withdensity functions f1 and f2 respectively then
E(XY ) =
∫XY
xyf1(x)f2(y)dxdy
=
∫X
∫Yxyf1(x)f2(Y )dxdy
=
∫Xxf1(x)[
∫Yyf2(y)dy ]dx
=
∫Xxf1(x)E(Y )dX
= E(X )E(Y )
Yang Feng (Columbia University) Introduction to Simple Linear Regression 24 / 70
Regression Function
The response Yi comes from a probability distribution with mean
E(Yi ) = β0 + β1Xi
This means the regression function is
E(Y ) = β0 + β1X
Since the regression function relates the means of the probabilitydistributions of Y for a given X to the level of X
Yang Feng (Columbia University) Introduction to Simple Linear Regression 25 / 70
Error Terms
The response Yi in the i th trial exceeds or falls short of the value ofthe regression function by the error term amount εi
The error terms εi are assumed to have constant variance σ2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 26 / 70
Response Variance
Responses Yi have the same constant variance
Var(Yi ) = Var(β0 + β1Xi + εi )
= Var(εi )
= σ2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 27 / 70
Variance (2nd central moment) Review
Continuous distribution
Var(X ) = E[(X − E(X ))2] =
∫(x − E(X ))2f (x)dx
Discrete distribution
Var(X ) = E[(X − E(X ))2] =∑i
(Xi − E(X ))2P(X = Xi )
Alternative Form for Variance:
Var(X ) = E(X 2)− E(X )2.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 28 / 70
Example Variance Derivation
f (x) = 2x , 0 ≤ x ≤ 1
Var(X ) = E(X 2)− E(X )2
=
∫ 1
02xx2dx − (
2
3)2
=2x4
4|10 −
4
9
=1
2− 4
9
=1
18
Yang Feng (Columbia University) Introduction to Simple Linear Regression 29 / 70
Variance Properties
Var(aX ) = a2 Var(X )
Var(aX + bY ) = a2 Var(X ) + b2 Var(Y ) if X ⊥ Y
Var(a + cX ) = c2 Var(X ) if a and c are both constants
More generally
Var(∑
aiXi ) =∑i
∑j
aiajCov(Xi ,Xj)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 30 / 70
Covariance
The covariance between two real-valued random variables X and Y, withexpected values E(X ) = µ and E(Y ) = ν is defined as
Cov(X ,Y ) = E((X − µ)(Y − ν))
= E(XY )− µν.
If X ⊥ Y ,E(XY ) = E(X )E(Y ) = µν.
ThenCov(X ,Y ) = 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 31 / 70
Formal Statement of Model
Yi = β0 + β1Xi + εi
Yi value of the response variable in the i th trial
β0 and β1 are parameters
Xi is a known constant, the value of the predictor variable in the i th
trial
εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2
i = 1, . . . , n
Yang Feng (Columbia University) Introduction to Simple Linear Regression 32 / 70
Example
An experimenter gave three subjects a very difficult task. Data on the ageof the subject (X ) and on the number of attempts to accomplish the taskbefore giving up (Y ) follow:
Table :
Subject i 1 2 3
Age Xi 20 55 30Number of Attempts Yi 5 12 10
Yang Feng (Columbia University) Introduction to Simple Linear Regression 33 / 70
Least Squares Linear Regression
Goal: make Yi and b0 + b1Xi close for all i .
Proposal 1: minimize∑n
i=1[Yi − (b0 + b1Xi )].
Proposal 2: minimize∑n
i=1 |Yi − (b0 + b1Xi )|.Final Proposal: minimize
Q(b0, b1) =n∑
i=1
[Yi − (b0 + b1Xi )]2
Choose b0 and b1 as estimators for β0 and β1.
b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).
Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70
Least Squares Linear Regression
Goal: make Yi and b0 + b1Xi close for all i .
Proposal 1: minimize∑n
i=1[Yi − (b0 + b1Xi )].
Proposal 2: minimize∑n
i=1 |Yi − (b0 + b1Xi )|.
Final Proposal: minimize
Q(b0, b1) =n∑
i=1
[Yi − (b0 + b1Xi )]2
Choose b0 and b1 as estimators for β0 and β1.
b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).
Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70
Least Squares Linear Regression
Goal: make Yi and b0 + b1Xi close for all i .
Proposal 1: minimize∑n
i=1[Yi − (b0 + b1Xi )].
Proposal 2: minimize∑n
i=1 |Yi − (b0 + b1Xi )|.Final Proposal: minimize
Q(b0, b1) =n∑
i=1
[Yi − (b0 + b1Xi )]2
Choose b0 and b1 as estimators for β0 and β1.
b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).
Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70
Guess #1
Yang Feng (Columbia University) Introduction to Simple Linear Regression 35 / 70
Guess #2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 36 / 70
Function maximization
Important technique to remember!
Take derivativeSet result equal to zero and solveTest second derivative at that point
Question: does this always give you the maximum?
Going further: multiple variables, convex optimization
Yang Feng (Columbia University) Introduction to Simple Linear Regression 37 / 70
Function Maximization
Find the value of x that maximize the function
f (x) = −x2 + log(x), x > 0
1. Take derivative
d
dx(−x2 + log(x)) = 0 (1)
−2x +1
x= 0 (2)
2x2 = 1 (3)
x2 =1
2(4)
x =
√2
2(5)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 38 / 70
Function Maximization
Find the value of x that maximize the function
f (x) = −x2 + log(x), x > 0
1. Take derivative
d
dx(−x2 + log(x)) = 0 (1)
−2x +1
x= 0 (2)
2x2 = 1 (3)
x2 =1
2(4)
x =
√2
2(5)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 38 / 70
2. Check second order derivative
d2
dx2(−x2 + log(x)) =
d
dx(−2x +
1
x) (6)
= −2− 1
x2(7)
< 0 (8)
Then we have found the maximum. x∗ =√22 , f (x∗) = −1
2 [1 + log(2)].
Yang Feng (Columbia University) Introduction to Simple Linear Regression 39 / 70
Least Squares Minimization
Function to minimize w.r.t. b0 and b1
b0 and b1 are called point estimators of β0 and β1 respectively
Q(b0, b1) =n∑
i=1
[Yi − (b0 + b1Xi )]2
Find partial derivatives and set both equal to zero
∂Q
∂b0= 0
∂Q
∂b1= 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 40 / 70
Normal Equations
The result of this maximization step are called the normal equations.b0 and b1 are called point estimators of β0 and β1 respectively.∑
Yi = nb0 + b1∑
Xi (9)∑XiYi = b0
∑Xi + b1
∑X 2i (10)
This is a system of two equations and two unknowns.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 41 / 70
Solution to Normal Equations
After a lot of algebra one arrives at
b1 =
∑(Xi − X )(Yi − Y )∑
(Xi − X )2
b0 = Y − b1X
X =
∑Xi
n
Y =
∑Yi
n
Yang Feng (Columbia University) Introduction to Simple Linear Regression 42 / 70
Least Square Fit
Yang Feng (Columbia University) Introduction to Simple Linear Regression 43 / 70
Properties of Solution
The i th residual is defined to be
ei = Yi − Yi
The sum of the residuals is zero from (9).∑i
ei =∑
(Yi − b0 − b1Xi )
=∑
Yi − nb0 − b1∑
Xi
= 0
The sum of the observed values Yi equals the sum of the fitted valuesYi ∑
i
Yi =∑i
Yi
Yang Feng (Columbia University) Introduction to Simple Linear Regression 44 / 70
Properties of Solution
The sum of the weighted residuals is zero when the residual in the i th trialis weighted by the level of the predictor variable in the i th trial from (10).∑
i
Xiei =∑
(Xi (Yi − b0 − b1Xi ))
=∑i
XiYi − b0∑
Xi − b1∑
X 2i
= 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 45 / 70
Properties of Solution
The sum of the weighted residuals is zero when the residual in the i th trialis weighted by the fitted value of the response variable in the i th trial∑
i
Yiei =∑i
(b0 + b1Xi )ei
= b0∑i
ei + b1∑i
Xiei
= 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 46 / 70
Properties of Solution
The regression line always goes through the point
X , Y
Using the alternative format of linear regression model:
Yi = β∗0 + β1(Xi − X ) + εi
The least squares estimator b1 for β1 remains the same as before. Theleast squares estimator for β∗0 = β0 + β1X becomes
b∗0 = b0 + b1X = (Y − b1X ) + b1X = Y
Hence the estimated regression function is
Y = Y + b1(X − X )
Apparently, the regression line always goes through the point (X , Y ).Yang Feng (Columbia University) Introduction to Simple Linear Regression 47 / 70
Estimating Error Term Variance σ2
Review estimation in non-regression setting.
Show estimation results for regression setting.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 48 / 70
Estimation Review
An estimator is a rule that tells how to calculate the value of anestimate based on the measurements contained in a sample
i.e. the sample mean
Y =1
n
n∑i=1
Yi
Yang Feng (Columbia University) Introduction to Simple Linear Regression 49 / 70
Point Estimators and Bias
Point estimatorθ = f ({Y1, . . . ,Yn})
Unknown quantity / parameter
θ
Definition: Bias of estimator
Bias(θ) = E(θ)− θ
Yang Feng (Columbia University) Introduction to Simple Linear Regression 50 / 70
Example
SamplesYi ∼ N(µ, σ2)
Estimate the population mean
θ = µ, θ = Y =1
n
n∑i=1
Yi
Yang Feng (Columbia University) Introduction to Simple Linear Regression 51 / 70
Sampling Distribution of the Estimator
First moment
E(θ) = E(1
n
n∑i=1
Yi )
=1
n
n∑i=1
E(Yi ) =nµ
n= θ
This is an example of an unbiased estimator
Bias(θ) = E(θ)− θ = 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 52 / 70
Variance of Estimator
Definition: Variance of estimator
Var(θ) = E([θ − E(θ)]2)
Remember:
Var(cY ) = c2 Var(Y )
Var(n∑
i=1
Yi ) =n∑
i=1
Var(Yi )
Only if the Yi are independent with finite variance
Yang Feng (Columbia University) Introduction to Simple Linear Regression 53 / 70
Example Estimator Variance
Var(θ) = Var(1
n
n∑i=1
Yi )
=1
n2
n∑i=1
Var(Yi )
=nσ2
n2
=σ2
n
Yang Feng (Columbia University) Introduction to Simple Linear Regression 54 / 70
Bias Variance Trade-off
The mean squared error of an estimator
MSE (θ) = E([θ − θ]2)
Can be re-expressed
MSE (θ) = Var(θ) + [Bias(θ)]2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 55 / 70
MSE = VAR + BIAS2
Proof
MSE (θ)
= E((θ − θ)2)
= E(([θ − E(θ)] + [E(θ)− θ])2)
= E([θ − E(θ)]2) + 2E([E(θ)− θ][θ − E(θ)]) + E([E(θ)− θ]2)
= Var(θ) + 2E([E(θ)[θ − E(θ)]− θ[θ − E(θ)])) + (Bias(θ))2
= Var(θ) + 2(0 + 0) + (Bias(θ))2
= Var(θ) + (Bias(θ))2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 56 / 70
Trade-off
Think of variance as confidence and bias as correctness.
Intuitions (largely) apply
Sometimes choosing a biased estimator can result in an overall lowerMSE if it has much lower variance than the unbiased one.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 57 / 70
s2 estimator for σ2 for Single population
Sum of Squares:n∑
i=1
(Yi − Y )2
Sample Variance Estimator:
s2 =
∑ni=1(Yi − Y )2
n − 1
s2 is an unbiased estimator of σ2.
The sum of squares SSE has n − 1 “degrees of freedom” associatedwith it, one degree of freedom is lost by using Y as an estimate of theunknown population mean µ.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 58 / 70
Estimating Error Term Variance σ2 for Regression Model
Regression model
Variance of each observation Yi is σ2 (the same as for the error termεi )
Each Yi comes from a different probability distribution with differentmeans that depend on the level Xi
The deviation of an observation Yi must be calculated around its ownestimated mean.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 59 / 70
s2 estimator for σ2
s2 = MSE =SSE
n − 2=
∑(Yi − Yi )
2
n − 2=
∑e2i
n − 2
MSE is an unbiased estimator of σ2
E(MSE ) = σ2
The sum of squares SSE has n − 2 “degrees of freedom” associatedwith it.
Cochran’s theorem (later in the course) tells us where degree’s offreedom come from and how to calculate them.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 60 / 70
Normal Error Regression Model
No matter how the error terms εi are distributed, the least squaresmethod provides unbiased point estimators of β0 and β1
that also have minimum variance among all unbiased linear estimators
To set up interval estimates and make tests we need to specify thedistribution of the εi
We will assume that the εi are normally distributed.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 61 / 70
Normal Error Regression Model
Yi = β0 + β1Xi + εi
Yi value of the response variable in the i th trial
β0 and β1 are parameters
Xi is a known constant, the value of the predictor variable in the i th
trial
εi ∼iid N(0, σ2)note this is different, now we know the distribution
i = 1, . . . , n
Yang Feng (Columbia University) Introduction to Simple Linear Regression 62 / 70
Notational Convention
When you see εi ∼iid N(0, σ2)
It is read as εi is identically and independently distributed accordingto a normal distribution with mean 0 and variance σ2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 63 / 70
Maximum Likelihood Principle
The method of maximum likelihood chooses as estimates those values ofthe parameters that are most consistent with the sample data.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 64 / 70
Likelihood Function
IfXi ∼ F (Θ), i = 1 . . . n
then the likelihood function is
L({Xi}ni=1,Θ) =n∏
i=1
F (Xi ; Θ)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 65 / 70
Maximum Likelihood Estimation
The likelihood function can be maximized w.r.t. the parameter(s) Θ,doing this one can arrive at estimators for parameters as well.
L({Xi}ni=1,Θ) =n∏
i=1
F (Xi ; Θ)
To do this, find solutions to (analytically or by following gradient)
dL({Xi}ni=1,Θ)
dΘ= 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 66 / 70
Important Trick
(almost) Never maximize the likelihood function, maximize the loglikelihood function instead.
log(L({Xi}ni=1,Θ)) = log(n∏
i=1
F (Xi ; Θ))
=n∑
i=1
log(F (Xi ; Θ))
Usually the log of the likelihood is easier to work with mathematically.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 67 / 70
ML Normal Regression
Likelihood function
L(β0, β1, σ2) =
n∏i=1
1
(2πσ2)1/2e−
12σ2 (Yi−β0−β1Xi )
2
=1
(2πσ2)n/2e−
12σ2
∑ni=1(Yi−β0−β1Xi )
2
which if you maximize (how?) w.r.t. to the parameters you get. . .
Yang Feng (Columbia University) Introduction to Simple Linear Regression 68 / 70
Maximum Likelihood Estimator(s)
β0b0 same as in least squares case
β1b1 same as in least squares case
σ2
σ2 =
∑i (Yi − Yi )
2
n
Note that ML estimator is biased as s2 is unbiased and
s2 = MSE =n
n − 2σ2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 69 / 70
Comments
Least squares minimizes the squared error between the prediction andthe true output
The normal distribution is fully characterized by its first two centralmoments (mean and variance)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 70 / 70
Introduction to Simple Linear Regression
Yang Feng
Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 70
Course Description
Theory and practice of regression analysis, Simple and multiple regression,including testing, estimation, and confidence procedures, modeling,regression diagnostics and plots, polynomial regression, colinearity andconfounding, model selection, geometry of least squares. Extensive use ofthe computer to analyze data.
Course website: http://www.stat.columbia.edu/~yangfeng/W4315
Required Text: Applied Linear Regression Models (4th Ed.)Authors: Kutner, Nachtsheim, Neter
Yang Feng (Columbia University) Introduction to Simple Linear Regression 2 / 70
Philosophy and Style
Easy first half.
Difficult second half.
Some digressions from the required book.
Understanding = proof (derivation) + implementation.
Practice makes perfect.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 3 / 70
Course Outline
First half of the course is single variable linear regression.
Least squares
Maximum likelihood, normal model
Tests / inferences
ANOVA
Diagnostics
Remedial Measures
Yang Feng (Columbia University) Introduction to Simple Linear Regression 4 / 70
Course Outline (Continued)
Second half of the course is multiple linear regression and other relatedtopics .
Multiple linear Regression
Linear algebra reviewMatrix approach to linear regressionMultiple predictor variablesDiagnosticsTestsModel selection
Other topics (If time permits)
Principle Component AnalysisGeneralized Linear ModelsIntroduction to Bayesian Inference
Yang Feng (Columbia University) Introduction to Simple Linear Regression 5 / 70
Requirements
Calculus
Derivatives, gradients, convexity
Linear algebra
Matrix notation, inversion, eigenvectors, eigenvalues, rank
Probability and Statistics
Random variableExpectation, varianceEstimationBias/VarianceBasic probability distributionsHypothesis TestingConfidence Interval
Yang Feng (Columbia University) Introduction to Simple Linear Regression 6 / 70
Software
R will be used throughout the course and it is required in all homework.An R tutorial session will be given on Sep 5 (Bring your laptop!).Reasons for R:
Completely free software. Can be downloaded fromhttp://cran.r-project.org/
Available on various systems, PC, MAC, Linux, and even
Iphone andIpad!
Advanced yet easy to use.
An Introduction to R: http://cran.r-project.org/doc/manuals/R-intro.pdf
Yang Feng (Columbia University) Introduction to Simple Linear Regression 7 / 70
Software
R will be used throughout the course and it is required in all homework.An R tutorial session will be given on Sep 5 (Bring your laptop!).Reasons for R:
Completely free software. Can be downloaded fromhttp://cran.r-project.org/
Available on various systems, PC, MAC, Linux, and even Iphone andIpad!
Advanced yet easy to use.
An Introduction to R: http://cran.r-project.org/doc/manuals/R-intro.pdf
Yang Feng (Columbia University) Introduction to Simple Linear Regression 7 / 70
Grading
Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped
In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.
Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:
Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70
Grading
Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped
In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:
Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70
Grading
Weekly homework (20%)DUE 6pm every Tuesday in W4315-Inbox at 904@SSWYou can collect graded homework in W4315-Outbox at 904@SSWNO late homework acceptedLowest score will be dropped
In Class Midterm exam (30%)In Class Final exam (50%).Exams are close book, close notes! One double-sided cheat sheet isallowed.Letter grade will look like the histogram of a normal distribution.Let’s have a look at Spring 2013’s distribution:
Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 70
Office Hours / Website
http://www.stat.columbia.edu/~yangfeng
Course Materials and homework with the due dates will be posted onthe course website.
Office hours : Thursday 10am-noon subject to change
Office Location : Room 1012, SSW Building (1255 AmsterdamAvenue, between 121st and 122nd street)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 9 / 70
Why regression?
Want to model a functional relationship between an “predictorvariable” (input, independent variable, etc.) and a “responsevariable” (output, dependent variable, etc.)
Examples?
But real world is noisy, no f = ma
Observation noiseProcess noise
Two distinct goals
(Estimation) Understanding the relationship between predictor variablesand response variables(Prediction) Predicting the future response given the new observedpredictors.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 10 / 70
History
Sir Francis Galton, 19th century
Studied the relation between heights of parents and children and notedthat the children “regressed” to the population mean
“Regression” stuck as the term to describe statistical relationsbetween variables
Yang Feng (Columbia University) Introduction to Simple Linear Regression 11 / 70
Example Applications
Trend lines, eg. Google over 6 mo.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 12 / 70
Others
Epidemiology
Relating lifespan to obesity or smoking habits etc.
Science and engineering
Relating physical inputs to physical outputs in complex systems
Brain
Yang Feng (Columbia University) Introduction to Simple Linear Regression 13 / 70
Aims for the course
Given something you would like to predict and some number ofcovariates
What kind of model should you use?Which variables should you include?Which transformations of variables and interaction terms should youuse?
Given a model and some data
How do you fit the model to the data?How do you express confidence in the values of the model parameters?How do you regularize the model to avoid over-fitting and other relatedissues?
Yang Feng (Columbia University) Introduction to Simple Linear Regression 14 / 70
Aims for the course
Given something you would like to predict and some number ofcovariates
What kind of model should you use?Which variables should you include?Which transformations of variables and interaction terms should youuse?
Given a model and some data
How do you fit the model to the data?How do you express confidence in the values of the model parameters?How do you regularize the model to avoid over-fitting and other relatedissues?
Yang Feng (Columbia University) Introduction to Simple Linear Regression 14 / 70
Data for Regression Analysis
Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!
Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .
Treatment: the length of trainingExperimental Units: the analysts included in the study.
Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70
Data for Regression Analysis
Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!
Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .
Treatment: the length of trainingExperimental Units: the analysts included in the study.
Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70
Data for Regression Analysis
Observational DataExample: relation between age of employee (X ) and number of daysof illness last year (Y )Cannot be controlled!
Experimental DataExample: an insurance company wishes to study the relation betweenproductivity of its analysts in processing claims (Y ) and length oftraining X .
Treatment: the length of trainingExperimental Units: the analysts included in the study.
Completely Randomized Design: Most basic type of statistical designExample: same example, but every experimental unit has an equalchance to receive any one of the treatments.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 70
Questions?
Good time to ask now.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 16 / 70
Simple Linear Regression
Want to find parameters for a function of the form
Yi = β0 + β1Xi + εi
Distribution of error random variable not specified
Yang Feng (Columbia University) Introduction to Simple Linear Regression 17 / 70
Quick Example - Scatter Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
−2
02
46
8
a
b
Yang Feng (Columbia University) Introduction to Simple Linear Regression 18 / 70
Formal Statement of Model
Yi = β0 + β1Xi + εi
Yi value of the response variable in the i th trial
β0 and β1 are parameters
Xi is a known constant, the value of the predictor variable in the i th
trial
εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2
εi and εj are uncorrelated
i = 1, . . . , n
Yang Feng (Columbia University) Introduction to Simple Linear Regression 19 / 70
Properties
The response Yi is the sum of two components
Constant term β0 + β1Xi
Random term εi
The expected response is
E(Yi ) = E(β0 + β1Xi + εi )
= β0 + β1Xi + E(εi )
= β0 + β1Xi
Yang Feng (Columbia University) Introduction to Simple Linear Regression 20 / 70
Expectation Review
Suppose X has probability density function f (x).
Definition
E(X ) =
∫xf (x)dx .
Linearity property
E(aX ) = aE(X )
E(aX + bY ) = aE(X ) + bE(Y )
Obvious from definition
Yang Feng (Columbia University) Introduction to Simple Linear Regression 21 / 70
Example Expectation Derivation
Suppose p.d.f of X isf (x) = 2x , 0 ≤ x ≤ 1
Expectation
E(X ) =
∫ 1
0xf (x)dx
=
∫ 1
02x2dx
=2x3
3|10
=2
3
Yang Feng (Columbia University) Introduction to Simple Linear Regression 22 / 70
Expectation of a Product of Random Variables
If X ,Y are random variables with joint density function f (x , y) then theexpectation of the product is given by
E(XY ) =
∫XY
xyf (x , y)dxdy .
Yang Feng (Columbia University) Introduction to Simple Linear Regression 23 / 70
Expectation of a product of random variables
What if X and Y are independent? If X and Y are independent withdensity functions f1 and f2 respectively then
E(XY ) =
∫XY
xyf1(x)f2(y)dxdy
=
∫X
∫Yxyf1(x)f2(Y )dxdy
=
∫Xxf1(x)[
∫Yyf2(y)dy ]dx
=
∫Xxf1(x)E(Y )dX
= E(X )E(Y )
Yang Feng (Columbia University) Introduction to Simple Linear Regression 24 / 70
Regression Function
The response Yi comes from a probability distribution with mean
E(Yi ) = β0 + β1Xi
This means the regression function is
E(Y ) = β0 + β1X
Since the regression function relates the means of the probabilitydistributions of Y for a given X to the level of X
Yang Feng (Columbia University) Introduction to Simple Linear Regression 25 / 70
Error Terms
The response Yi in the i th trial exceeds or falls short of the value ofthe regression function by the error term amount εi
The error terms εi are assumed to have constant variance σ2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 26 / 70
Response Variance
Responses Yi have the same constant variance
Var(Yi ) = Var(β0 + β1Xi + εi )
= Var(εi )
= σ2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 27 / 70
Variance (2nd central moment) Review
Continuous distribution
Var(X ) = E[(X − E(X ))2] =
∫(x − E(X ))2f (x)dx
Discrete distribution
Var(X ) = E[(X − E(X ))2] =∑i
(Xi − E(X ))2P(X = Xi )
Alternative Form for Variance:
Var(X ) = E(X 2)− E(X )2.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 28 / 70
Example Variance Derivation
f (x) = 2x , 0 ≤ x ≤ 1
Var(X ) = E(X 2)− E(X )2
=
∫ 1
02xx2dx − (
2
3)2
=2x4
4|10 −
4
9
=1
2− 4
9
=1
18
Yang Feng (Columbia University) Introduction to Simple Linear Regression 29 / 70
Variance Properties
Var(aX ) = a2 Var(X )
Var(aX + bY ) = a2 Var(X ) + b2 Var(Y ) if X ⊥ Y
Var(a + cX ) = c2 Var(X ) if a and c are both constants
More generally
Var(∑
aiXi ) =∑i
∑j
aiajCov(Xi ,Xj)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 30 / 70
Covariance
The covariance between two real-valued random variables X and Y, withexpected values E(X ) = µ and E(Y ) = ν is defined as
Cov(X ,Y ) = E((X − µ)(Y − ν))
= E(XY )− µν.
If X ⊥ Y ,E(XY ) = E(X )E(Y ) = µν.
ThenCov(X ,Y ) = 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 31 / 70
Formal Statement of Model
Yi = β0 + β1Xi + εi
Yi value of the response variable in the i th trial
β0 and β1 are parameters
Xi is a known constant, the value of the predictor variable in the i th
trial
εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2
i = 1, . . . , n
Yang Feng (Columbia University) Introduction to Simple Linear Regression 32 / 70
Example
An experimenter gave three subjects a very difficult task. Data on the ageof the subject (X ) and on the number of attempts to accomplish the taskbefore giving up (Y ) follow:
Table :
Subject i 1 2 3
Age Xi 20 55 30Number of Attempts Yi 5 12 10
Yang Feng (Columbia University) Introduction to Simple Linear Regression 33 / 70
Least Squares Linear Regression
Goal: make Yi and b0 + b1Xi close for all i .
Proposal 1: minimize∑n
i=1[Yi − (b0 + b1Xi )].
Proposal 2: minimize∑n
i=1 |Yi − (b0 + b1Xi )|.Final Proposal: minimize
Q(b0, b1) =n∑
i=1
[Yi − (b0 + b1Xi )]2
Choose b0 and b1 as estimators for β0 and β1.
b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).
Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70
Least Squares Linear Regression
Goal: make Yi and b0 + b1Xi close for all i .
Proposal 1: minimize∑n
i=1[Yi − (b0 + b1Xi )].
Proposal 2: minimize∑n
i=1 |Yi − (b0 + b1Xi )|.
Final Proposal: minimize
Q(b0, b1) =n∑
i=1
[Yi − (b0 + b1Xi )]2
Choose b0 and b1 as estimators for β0 and β1.
b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).
Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70
Least Squares Linear Regression
Goal: make Yi and b0 + b1Xi close for all i .
Proposal 1: minimize∑n
i=1[Yi − (b0 + b1Xi )].
Proposal 2: minimize∑n
i=1 |Yi − (b0 + b1Xi )|.Final Proposal: minimize
Q(b0, b1) =n∑
i=1
[Yi − (b0 + b1Xi )]2
Choose b0 and b1 as estimators for β0 and β1.
b0 and b1 will minimize the criterion Q for the given sampleobservations (X1,Y1), (X2,Y2), · · · , (Xn,Yn).
Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 70
Guess #1
Yang Feng (Columbia University) Introduction to Simple Linear Regression 35 / 70
Guess #2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 36 / 70
Function maximization
Important technique to remember!
Take derivativeSet result equal to zero and solveTest second derivative at that point
Question: does this always give you the maximum?
Going further: multiple variables, convex optimization
Yang Feng (Columbia University) Introduction to Simple Linear Regression 37 / 70
Function Maximization
Find the value of x that maximize the function
f (x) = −x2 + log(x), x > 0
1. Take derivative
d
dx(−x2 + log(x)) = 0 (1)
−2x +1
x= 0 (2)
2x2 = 1 (3)
x2 =1
2(4)
x =
√2
2(5)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 38 / 70
Function Maximization
Find the value of x that maximize the function
f (x) = −x2 + log(x), x > 0
1. Take derivative
d
dx(−x2 + log(x)) = 0 (1)
−2x +1
x= 0 (2)
2x2 = 1 (3)
x2 =1
2(4)
x =
√2
2(5)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 38 / 70
2. Check second order derivative
d2
dx2(−x2 + log(x)) =
d
dx(−2x +
1
x) (6)
= −2− 1
x2(7)
< 0 (8)
Then we have found the maximum. x∗ =√22 , f (x∗) = −1
2 [1 + log(2)].
Yang Feng (Columbia University) Introduction to Simple Linear Regression 39 / 70
Least Squares Minimization
Function to minimize w.r.t. b0 and b1
b0 and b1 are called point estimators of β0 and β1 respectively
Q(b0, b1) =n∑
i=1
[Yi − (b0 + b1Xi )]2
Find partial derivatives and set both equal to zero
∂Q
∂b0= 0
∂Q
∂b1= 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 40 / 70
Normal Equations
The result of this maximization step are called the normal equations.b0 and b1 are called point estimators of β0 and β1 respectively.∑
Yi = nb0 + b1∑
Xi (9)∑XiYi = b0
∑Xi + b1
∑X 2i (10)
This is a system of two equations and two unknowns.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 41 / 70
Solution to Normal Equations
After a lot of algebra one arrives at
b1 =
∑(Xi − X )(Yi − Y )∑
(Xi − X )2
b0 = Y − b1X
X =
∑Xi
n
Y =
∑Yi
n
Yang Feng (Columbia University) Introduction to Simple Linear Regression 42 / 70
Least Square Fit
Yang Feng (Columbia University) Introduction to Simple Linear Regression 43 / 70
Properties of Solution
The i th residual is defined to be
ei = Yi − Yi
The sum of the residuals is zero from (9).∑i
ei =∑
(Yi − b0 − b1Xi )
=∑
Yi − nb0 − b1∑
Xi
= 0
The sum of the observed values Yi equals the sum of the fitted valuesYi ∑
i
Yi =∑i
Yi
Yang Feng (Columbia University) Introduction to Simple Linear Regression 44 / 70
Properties of Solution
The sum of the weighted residuals is zero when the residual in the i th trialis weighted by the level of the predictor variable in the i th trial from (10).∑
i
Xiei =∑
(Xi (Yi − b0 − b1Xi ))
=∑i
XiYi − b0∑
Xi − b1∑
X 2i
= 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 45 / 70
Properties of Solution
The sum of the weighted residuals is zero when the residual in the i th trialis weighted by the fitted value of the response variable in the i th trial∑
i
Yiei =∑i
(b0 + b1Xi )ei
= b0∑i
ei + b1∑i
Xiei
= 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 46 / 70
Properties of Solution
The regression line always goes through the point
X , Y
Using the alternative format of linear regression model:
Yi = β∗0 + β1(Xi − X ) + εi
The least squares estimator b1 for β1 remains the same as before. Theleast squares estimator for β∗0 = β0 + β1X becomes
b∗0 = b0 + b1X = (Y − b1X ) + b1X = Y
Hence the estimated regression function is
Y = Y + b1(X − X )
Apparently, the regression line always goes through the point (X , Y ).Yang Feng (Columbia University) Introduction to Simple Linear Regression 47 / 70
Estimating Error Term Variance σ2
Review estimation in non-regression setting.
Show estimation results for regression setting.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 48 / 70
Estimation Review
An estimator is a rule that tells how to calculate the value of anestimate based on the measurements contained in a sample
i.e. the sample mean
Y =1
n
n∑i=1
Yi
Yang Feng (Columbia University) Introduction to Simple Linear Regression 49 / 70
Point Estimators and Bias
Point estimatorθ = f ({Y1, . . . ,Yn})
Unknown quantity / parameter
θ
Definition: Bias of estimator
Bias(θ) = E(θ)− θ
Yang Feng (Columbia University) Introduction to Simple Linear Regression 50 / 70
Example
SamplesYi ∼ N(µ, σ2)
Estimate the population mean
θ = µ, θ = Y =1
n
n∑i=1
Yi
Yang Feng (Columbia University) Introduction to Simple Linear Regression 51 / 70
Sampling Distribution of the Estimator
First moment
E(θ) = E(1
n
n∑i=1
Yi )
=1
n
n∑i=1
E(Yi ) =nµ
n= θ
This is an example of an unbiased estimator
Bias(θ) = E(θ)− θ = 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 52 / 70
Variance of Estimator
Definition: Variance of estimator
Var(θ) = E([θ − E(θ)]2)
Remember:
Var(cY ) = c2 Var(Y )
Var(n∑
i=1
Yi ) =n∑
i=1
Var(Yi )
Only if the Yi are independent with finite variance
Yang Feng (Columbia University) Introduction to Simple Linear Regression 53 / 70
Example Estimator Variance
Var(θ) = Var(1
n
n∑i=1
Yi )
=1
n2
n∑i=1
Var(Yi )
=nσ2
n2
=σ2
n
Yang Feng (Columbia University) Introduction to Simple Linear Regression 54 / 70
Bias Variance Trade-off
The mean squared error of an estimator
MSE (θ) = E([θ − θ]2)
Can be re-expressed
MSE (θ) = Var(θ) + [Bias(θ)]2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 55 / 70
MSE = VAR + BIAS2
Proof
MSE (θ)
= E((θ − θ)2)
= E(([θ − E(θ)] + [E(θ)− θ])2)
= E([θ − E(θ)]2) + 2E([E(θ)− θ][θ − E(θ)]) + E([E(θ)− θ]2)
= Var(θ) + 2E([E(θ)[θ − E(θ)]− θ[θ − E(θ)])) + (Bias(θ))2
= Var(θ) + 2(0 + 0) + (Bias(θ))2
= Var(θ) + (Bias(θ))2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 56 / 70
Trade-off
Think of variance as confidence and bias as correctness.
Intuitions (largely) apply
Sometimes choosing a biased estimator can result in an overall lowerMSE if it has much lower variance than the unbiased one.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 57 / 70
s2 estimator for σ2 for Single population
Sum of Squares:n∑
i=1
(Yi − Y )2
Sample Variance Estimator:
s2 =
∑ni=1(Yi − Y )2
n − 1
s2 is an unbiased estimator of σ2.
The sum of squares SSE has n − 1 “degrees of freedom” associatedwith it, one degree of freedom is lost by using Y as an estimate of theunknown population mean µ.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 58 / 70
Estimating Error Term Variance σ2 for Regression Model
Regression model
Variance of each observation Yi is σ2 (the same as for the error termεi )
Each Yi comes from a different probability distribution with differentmeans that depend on the level Xi
The deviation of an observation Yi must be calculated around its ownestimated mean.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 59 / 70
s2 estimator for σ2
s2 = MSE =SSE
n − 2=
∑(Yi − Yi )
2
n − 2=
∑e2i
n − 2
MSE is an unbiased estimator of σ2
E(MSE ) = σ2
The sum of squares SSE has n − 2 “degrees of freedom” associatedwith it.
Cochran’s theorem (later in the course) tells us where degree’s offreedom come from and how to calculate them.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 60 / 70
Normal Error Regression Model
No matter how the error terms εi are distributed, the least squaresmethod provides unbiased point estimators of β0 and β1
that also have minimum variance among all unbiased linear estimators
To set up interval estimates and make tests we need to specify thedistribution of the εi
We will assume that the εi are normally distributed.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 61 / 70
Normal Error Regression Model
Yi = β0 + β1Xi + εi
Yi value of the response variable in the i th trial
β0 and β1 are parameters
Xi is a known constant, the value of the predictor variable in the i th
trial
εi ∼iid N(0, σ2)note this is different, now we know the distribution
i = 1, . . . , n
Yang Feng (Columbia University) Introduction to Simple Linear Regression 62 / 70
Notational Convention
When you see εi ∼iid N(0, σ2)
It is read as εi is identically and independently distributed accordingto a normal distribution with mean 0 and variance σ2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 63 / 70
Maximum Likelihood Principle
The method of maximum likelihood chooses as estimates those values ofthe parameters that are most consistent with the sample data.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 64 / 70
Likelihood Function
IfXi ∼ F (Θ), i = 1 . . . n
then the likelihood function is
L({Xi}ni=1,Θ) =n∏
i=1
F (Xi ; Θ)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 65 / 70
Maximum Likelihood Estimation
The likelihood function can be maximized w.r.t. the parameter(s) Θ,doing this one can arrive at estimators for parameters as well.
L({Xi}ni=1,Θ) =n∏
i=1
F (Xi ; Θ)
To do this, find solutions to (analytically or by following gradient)
dL({Xi}ni=1,Θ)
dΘ= 0
Yang Feng (Columbia University) Introduction to Simple Linear Regression 66 / 70
Important Trick
(almost) Never maximize the likelihood function, maximize the loglikelihood function instead.
log(L({Xi}ni=1,Θ)) = log(n∏
i=1
F (Xi ; Θ))
=n∑
i=1
log(F (Xi ; Θ))
Usually the log of the likelihood is easier to work with mathematically.
Yang Feng (Columbia University) Introduction to Simple Linear Regression 67 / 70
ML Normal Regression
Likelihood function
L(β0, β1, σ2) =
n∏i=1
1
(2πσ2)1/2e−
12σ2 (Yi−β0−β1Xi )
2
=1
(2πσ2)n/2e−
12σ2
∑ni=1(Yi−β0−β1Xi )
2
which if you maximize (how?) w.r.t. to the parameters you get. . .
Yang Feng (Columbia University) Introduction to Simple Linear Regression 68 / 70
Maximum Likelihood Estimator(s)
β0b0 same as in least squares case
β1b1 same as in least squares case
σ2
σ2 =
∑i (Yi − Yi )
2
n
Note that ML estimator is biased as s2 is unbiased and
s2 = MSE =n
n − 2σ2
Yang Feng (Columbia University) Introduction to Simple Linear Regression 69 / 70
Comments
Least squares minimizes the squared error between the prediction andthe true output
The normal distribution is fully characterized by its first two centralmoments (mean and variance)
Yang Feng (Columbia University) Introduction to Simple Linear Regression 70 / 70
Inference in Regression Analysis
Yang Feng
Yang Feng (Columbia University) Inference in Regression Analysis 1 / 113
Inference in the Normal Error Regression Model
Yi = β0 + β1Xi + εi
Yi value of the response variable in the i th trial
β0 and β1 are parameters
Xi is a known constant, the value of the predictor variable in the i th
trial
εi ∼iid N(0, σ2)
i = 1, . . . , n
Yang Feng (Columbia University) Inference in Regression Analysis 2 / 113
Maximum Likelihood Estimator(s)
β0
b0 same as in least squares case
β1
b1 same as in least squares case
σ2
σ2 =
∑i (Yi − Yi )
2
n
Note that ML estimator is biased as s2 is unbiased and
s2 = MSE =n
n − 2σ2
Yang Feng (Columbia University) Inference in Regression Analysis 3 / 113
Inference concerning β1
Tests concerning β1 (the slope) are often of interest, particularly
H0 : β1 = 0
Ha : β1 6= 0
the null hypothesis model
Yi = β0 + (0)Xi + εi
implies that there is no linear relationship between Y and X.
Note the means of all the Yi ’s are equal at all levels of Xi .
Yang Feng (Columbia University) Inference in Regression Analysis 4 / 113
Sampling Dist. Of b1
The point estimator for b1 is
b1 =
∑(Xi − X )(Yi − Y )∑
(Xi − X )2
The sampling distribution for b1 is the distribution of b1 that arisesfrom the variability of b1 when the predictor variables Xi are heldfixed and the observed outputs are repeatedly sampled
Note that the sampling distribution of b1 will depend on our modelassumptions.
Yang Feng (Columbia University) Inference in Regression Analysis 5 / 113
Sampling Dist. Of b1 In Normal Regr. Model
For a normal error regression model the sampling distribution of b1 isnormal, with mean and variance given by
E(b1) = β1
Var(b1) =σ2∑
(Xi − X )2
To show this we need to go through a number of algebraic steps.
Yang Feng (Columbia University) Inference in Regression Analysis 6 / 113
First step
To show ∑(Xi − X )(Yi − Y ) =
∑(Xi − X )Yi
we observe∑(Xi − X )(Yi − Y ) =
∑(Xi − X )Yi −
∑(Xi − X )Y
=∑
(Xi − X )Yi − Y∑
(Xi − X )
=∑
(Xi − X )Yi
Yang Feng (Columbia University) Inference in Regression Analysis 7 / 113
b1 as convex combination of Yi ’s
b1 can be expressed as a linear combination of the Y ′i s
b1 =
∑(Xi − X )(Yi − Y )∑
(Xi − X )2
=
∑(Xi − X )Yi∑(Xi − X )2
from previous slide
=∑
kiYi
where
ki =(Xi − X )∑(Xi − X )2
Yang Feng (Columbia University) Inference in Regression Analysis 8 / 113
Properties of the k ′i s
It can be shown that ∑ki = 0∑
kiXi = 1∑k2i =
1∑(Xi − X )2
We will use these properties to prove various properties of the samplingdistributions of b1 and b0.
Yang Feng (Columbia University) Inference in Regression Analysis 9 / 113
Linear combination of independent normal randomvariables
When Y1, . . . ,Yn are independent normal random variables, the linearcombination ∑
aiYi ∼ N(∑
ai E(Yi ),∑
a2i Var(Yi )
)
Yang Feng (Columbia University) Inference in Regression Analysis 10 / 113
Normality of b′1s Sampling Distribution
Since b1 is a linear combination of the Y ′i s and each Yi is an independentnormal random variable, then b1 is distributed normally as well
b1 =∑
kiYi , ki =(Xi − X )∑(Xi − X )2
From previous slide
E(b1) =∑
ki E(Yi ), Var(b1) =∑
k2i Var(Yi )
Yang Feng (Columbia University) Inference in Regression Analysis 11 / 113
b1 is an unbiased estimator
This can be seen using two of the properties
E(b1) = E(∑
kiYi )
=∑
ki E(Yi )
=∑
ki (β0 + β1Xi )
= β0
∑ki + β1
∑kiXi
= β0(0) + β1(1)
= β1
Yang Feng (Columbia University) Inference in Regression Analysis 12 / 113
Variance of b1
Since the Yi are independent random variables with variance σ2 and thek ′i s are constants we get
Var(b1) = Var(∑
kiYi )
=∑
k2i Var(Yi )
=∑
k2i σ
2
= σ2∑
k2i
= σ2 1∑(Xi − X )2
However, in most cases, σ2 is unknown.
Yang Feng (Columbia University) Inference in Regression Analysis 13 / 113
Estimated variance of b1
When we don’t know σ2 then we have to replace it with the MSEestimate (From the Least Square estimation)
Let
s2 = MSE =SSE
n − 2
whereSSE =
∑e2i
andei = Yi − Yi
plugging in we get
Var(b1) =s2∑
(Xi − X )2
Yang Feng (Columbia University) Inference in Regression Analysis 14 / 113
Recap
We now have an expression for the sampling distribution of b1 whenσ2 is known
b1 ∼ N (β1,σ2∑
(Xi − X )2) (1)
When σ2 is unknown we have an unbiased point estimator of σ2
Var(b1) =s2∑
(Xi − X )2
Yang Feng (Columbia University) Inference in Regression Analysis 15 / 113
Digression : Gauss-Markov Theorem
Theorem
In a regression model where E(εi ) = 0 and variance Var(εi ) = σ2 <∞ andεi and εj are uncorrelated for all i and j the least squares estimators b0
and b1 are unbiased and have minimum variance among all unbiased linearestimators.
No normality assumption on the error distribution!Recall
b1 =
∑(Xi − X )(Yi − Y )∑
(Xi − X )2
b0 = Y − b1X
Yang Feng (Columbia University) Inference in Regression Analysis 16 / 113
Proof
The theorem states that b1 as minimum variance among all unbiasedlinear estimators of the form
β1 =∑
ciYi
As this estimator must be unbiased we have
E(β1) =∑
ci E(Yi )
=∑
ci (β0 + β1Xi )
= β0
∑ci + β1
∑ciXi
= β1
Yang Feng (Columbia University) Inference in Regression Analysis 17 / 113
Proof cont.
Given the constraint
β0
∑ci + β1
∑ciXi = β1
clearly it must be the case that∑
ci = 0 and∑
ciXi = 1
The variance of this estimator is
Var(β1) =∑
c2i Var(Yi ) = σ2
∑c2i
Yang Feng (Columbia University) Inference in Regression Analysis 18 / 113
Proof cont.
Now define ci = ki + di where the ki are the constants we already definedand the di are arbitrary constants.
ki =(Xi − X )∑(Xi − X )2
Let’s look at the variance of the estimator
Var(β1) =∑
c2i Var(Yi ) = σ2
∑(ki + di )
2
= σ2(∑
k2i +
∑d2i + 2
∑kidi )
Note we just demonstrated that
σ2∑
k2i = Var(b1)
Yang Feng (Columbia University) Inference in Regression Analysis 19 / 113
Proof cont.
Now by showing that∑
kidi = 0 we’re almost done∑kidi =
∑ki (ci − ki )
=∑
kici −∑
k2i
=∑
ci
(Xi − X∑(Xi − X )2
)− 1∑
(Xi − X )2
=
∑ciXi − X
∑ci∑
(Xi − X )2− 1∑
(Xi − X )2= 0
Recall∑
ci = 0 and∑
ciXi = 1.
Yang Feng (Columbia University) Inference in Regression Analysis 20 / 113
Proof end
So we are left with
Var(β1) = σ2(∑
k2i +
∑d2i )
= Var(b1) + σ2(∑
d2i )
which is minimized when all the di = 0. This means that the least squaresestimator b1 has minimum variance among all unbiased linear estimators.
Yang Feng (Columbia University) Inference in Regression Analysis 21 / 113
Sampling Distribution of (b1 − β1)/s(b1)
b1 is normally distributed so
b1 − β1√Var(b1)
∼ N(0, 1)
We don’t know Var(b1) so it must be estimated from data. We havealready derived its estimate
If using the estimate Var(b1) it can be shown that
b1 − β1
s(b1)∼ t(n − 2)
s(b1) =
√Var(b1)
Yang Feng (Columbia University) Inference in Regression Analysis 22 / 113
Where does this come from?
For now we need to rely upon the following theorem
For the normal error regression model
SSE
σ2=
∑(Yi − Yi )
2
σ2∼ χ2(n − 2)
and is independent of b0 and b1
Intuitively this follows the standard result for the sum of squarednormal random variables
Here there are two linear constraints imposed by the regressionparameter estimation that each reduce the number of degrees offreedom by one.
We will revisit this subject soon.
Yang Feng (Columbia University) Inference in Regression Analysis 23 / 113
Another useful fact : t distributed random variables
Let z and χ2(ν) be independent random variables (standard normal andχ2 respectively). The following random variable is a t-dstributed randomvariable:
t(ν) =z√χ2(ν)ν
This version of the t distribution has one parameter, the degrees offreedom ν
Yang Feng (Columbia University) Inference in Regression Analysis 24 / 113
Distribution of the studentized statistic
To derive the distribution of this statistic, first we rewrite
b1 − β1
s(b1)=
b1−β1
σ(b1)
s(b1)σ(b1)
s(b1)
σ(b1)=
√Var(b1)
Var(b1)
Yang Feng (Columbia University) Inference in Regression Analysis 25 / 113
Studentized statistic cont.
And note the following
Var(b1)
Var(b1)=
MSE∑(Xi−X )2
σ2∑(Xi−X )2
=MSE
σ2=
SSE
σ2(n − 2)
where we know (by the given theorem) the distribution of the last term isχ2 and indep. of b1 and b0
SSE
σ2(n − 2)∼ χ2(n − 2)
n − 2
Yang Feng (Columbia University) Inference in Regression Analysis 26 / 113
Studentized statistic final
But by the given definition of the t distribution we have our result
b1 − β1
s(b1)∼ t(n − 2)
because putting everything together we can see that
b1 − β1
s(b1)∼ z√
χ2(n−2)n−2
Yang Feng (Columbia University) Inference in Regression Analysis 27 / 113
Confidence Intervals and Hypothesis Tests
Now that we know the sampling distribution of b1 (t with n− 2 degrees offreedom) we can construct confidence intervals and hypothesis tests easily.
Things to think about
What does the t-distribution look like?
Why is the estimator distributed according to a t-distribution ratherthan a normal distribution?
When performing tests, why does this matter?
When is it safe to cheat and use a normal approximation?
Yang Feng (Columbia University) Inference in Regression Analysis 28 / 113
Quick Review : Hypothesis Testing
Elements of a statistical test
Null hypothesis, H0
Alternative hypothesis, Ha
Test statisticRejection region
Yang Feng (Columbia University) Inference in Regression Analysis 29 / 113
Quick Review : Hypothesis Testing - Errors
Errors
A type I error is made if H0 is rejected when H0 is true. The probabilityof a type I error is denoted by α. The value of α is called the level ofthe test.A type II error is made if H0 is accepted when Ha is true. Theprobability of a type II error is denoted by β.
H0 is true Ha is true
Accept H0 Right decision Type II ErrorReject H0 Type I Error Right decision
Yang Feng (Columbia University) Inference in Regression Analysis 30 / 113
p-value
The p-value, or attained significance level, is the smallest level ofsignificance α for which the observed data indicate that the nullhypothesis should be rejected.
Yang Feng (Columbia University) Inference in Regression Analysis 31 / 113
Null Hypothesis
If the null hypothesis is that β1 = 0 then b1 should fall in the range aroundzero. The further it is from 0 the less likely the null hypothesis is to hold.
Yang Feng (Columbia University) Inference in Regression Analysis 32 / 113
Alternative Hypothesis : Least Squares Fit
If we find that our estimated value of b1 deviates from 0 then we have todetermine whether or not that deviation would be surprising given themodel and the sampling distribution of the estimator. If it is sufficiently(where we define what sufficient is by a confidence level) different then wereject the null hypothesis.
Yang Feng (Columbia University) Inference in Regression Analysis 33 / 113
Testing This Hypothesis
Only have a finite sample
Different finite set of samples (from the same population / source)will (almost always) produce different point estimates of β0 and β1
(b0, b1) given the same estimation procedure
Key point: b0 and b1 are random variables whose samplingdistributions can be statistically characterized
Hypothesis tests about β0 and β1 can be constructed using thesedistributions.
Yang Feng (Columbia University) Inference in Regression Analysis 34 / 113
Confidence Interval Example
A machine fills cups with margarine, and is supposed to be adjustedso that the content of the cups is µ = 250g of margarine.
Observed random variable X ∼ N (250, 2.5)
Yang Feng (Columbia University) Inference in Regression Analysis 35 / 113
Confidence Interval Example, Cont.
X1, ...,X25, a random sample from X .
The natural estimator is the sample mean: µ = X = 1n
∑ni=1 Xi .
Suppose the sample shows actual weights X1, ...,X25, with mean:
X =1
25
25∑i=1
Xi = 250.2grams.
Yang Feng (Columbia University) Inference in Regression Analysis 36 / 113
Confidence Interval Example, Cont.
Say we want to get a confidence interval for µ.By standardizing, we get a random variable
Z =X − µσ/√n
=X − µ
0.5
P(−z ≤ Z ≤ z) = 1− α = 0.95.
The number z follows from the cumulative distribution function:
Φ(z) = P(Z ≤ z) = 1− α2 = 0.975, (2)
z = Φ−1(Φ(z)) = Φ−1(0.975) = 1.96, (3)
Yang Feng (Columbia University) Inference in Regression Analysis 37 / 113
Confidence Interval Example, Cont.
Now we get:
0.95 = 1− α = P(−z ≤ Z ≤ z) = P
(−1.96 ≤ X − µ
σ/√n≤ 1.96
)(4)
= P
(X − 1.96
σ√n≤ µ ≤ X + 1.96
σ√n
)(5)
= P(X − 1.96× 0.5 ≤ µ ≤ X + 1.96× 0.5
)(6)
= P(X − 0.98 ≤ µ ≤ X + 0.98
). (7)
This might be interpreted as: with probability 0.95 we will find aconfidence interval in which we will meet the parameter µ between thestochastic endpoints
(X − 0.98, X + 0.98)
Yang Feng (Columbia University) Inference in Regression Analysis 38 / 113
Confidence Interval Example, Cont.
Therefore, our 0.95 confidence interval becomes:
(X − 0.98, X + 0.98) = (250.2− 0.98, 250.2 + 0.98) = (249.22, 251.18).
Yang Feng (Columbia University) Inference in Regression Analysis 39 / 113
Recap
We know that the point estimator of β1 is
b1 =
∑(Xi − X )(Yi − Y )∑
(Xi − X )2
We derived the sampling distribution of b1, it beingN (β1,Var(b1))(when σ2 known) with
Var(b1) = σ2{b1} =σ2∑
(Xi − X )2
And we suggested that an estimate of Var(b1) could be arrived at bysubstituting the MSE for σ2 when σ2 is unknown.
s2{b1} =MSE∑
(Xi − X )2=
SSEn−2∑
(Xi − X )2
Yang Feng (Columbia University) Inference in Regression Analysis 40 / 113
Sampling Distribution of (b1 − β1)/s{b1}
Since b1 is normally distributed,
b1 − β1
σ{b1}∼ N (0, 1)
Using the estimate s2{b1},
b1 − β1
s{b1}∼ t(n − 2),
where
s{b1} =√
s2{b1}
Yang Feng (Columbia University) Inference in Regression Analysis 41 / 113
Confidence Interval for β1
Since the “studentized” statistic follows a t distribution, we can make thefollowing probability statement
P(t(α/2; n − 2) ≤ b1 − β1
s{b1}≤ t(1− α/2; n − 2)) = 1− α
By symmetryt(α/2; n − 2) = −t(1− α/2; n − 2),
We have the following confidence interval for β1,
P(b1− t(1−α/2; n−2)s{b1} ≤ β1 ≤ b1 + t(1−α/2; n−2)s{b1}) = 1−α
Look up the t distribution table (table B.2 in the appendix) and produceconfidence intervals.Or R function
qt(1− α/2, n − 2)
Yang Feng (Columbia University) Inference in Regression Analysis 42 / 113
Remember
Density: f (y) = dF (y)dy
Distribution (CDF): F (y) = P(Y ≤ y) =∫ y−∞ f (t)dt
Inverse CDF: F−1(p) = y s.t.∫ y−∞ f (t)dt = p
Yang Feng (Columbia University) Inference in Regression Analysis 43 / 113
1− α confidence limits for β1
The 1− α confidence limits for β1 are
b1 ± t(1− α/2; n − 2)s{b1}
Note that this quantity can be used to calculate confidence intervalsgiven n and α.
Fixing α can guide the choice of sample size if a particular confidenceinterval is desiredGiven a sample size, vice versa.
Also useful for hypothesis testing
Yang Feng (Columbia University) Inference in Regression Analysis 44 / 113
Tests Concerning β1
Example 1(Two-sided test)
H0 : β1 = 0Ha : β1 6= 0Test statistic
t∗ =b1 − 0
s{b1}
Yang Feng (Columbia University) Inference in Regression Analysis 45 / 113
Tests Concerning β1
We have an estimate of the sampling distribution of b1 from the data.
If the null hypothesis holds then the b1 estimate coming from thedata should be within the 95% confidence interval of the samplingdistribution centered at 0 (in this case)
t∗ =b1 − 0
s{b1}
Yang Feng (Columbia University) Inference in Regression Analysis 46 / 113
Decision rules
if |t∗| ≤ t(1− α/2; n − 2), acceptH0
if |t∗| > t(1− α/2; n − 2), rejectH0
Absolute values make the test two-sided
Yang Feng (Columbia University) Inference in Regression Analysis 47 / 113
Calculating the p-value
The p-value, or attained significance level, is the smallest level ofsignificance α for which the observed data indicate that the nullhypothesis should be rejected.
This can be looked up using the CDF of the test statistic.
Yang Feng (Columbia University) Inference in Regression Analysis 48 / 113
p-value Example
An experiment is performed to determine whether a coin flip is fair (50%chance, each, of landing heads or tails) or unfairly biased (50% chance ofone of the outcomes).
Outcome: Suppose that the experimental results show the coin turningup heads 14 times out of 20 total flips.p-value: The p-value of this result would be the chance of a fair coinlanding on heads at least 14 times out of 20 flipsCalculation:
Prob(14 heads) + Prob(15 heads) + · · ·+ Prob(20 heads) (8)
=1
220
[(20
14
)+
(20
15
)+ · · ·+
(20
20
)]=
60,460
1,048,576≈ 0.058 (9)
Two sided p-value:2 ∗ 0.058 = 0.116
Yang Feng (Columbia University) Inference in Regression Analysis 49 / 113
Power of the test
Power = P{Reject H0|δ} (10)
= P{|t∗| > t(1− α/2; n − 2)|δ} (11)
(12)
where δ is the noncentrality measure
δ =|β1|σ{b1}
t∗ =
b1−β1
σ(b1) + β1
σ(b1)
s(b1)σ(b1)
(13)
Table B.5 presents the power of the two-sided t test.Notice: the power depends on the value of σ2.
a = qt(1- alpha/2,n-2)
1- pt(a, n-2, delta) + pt(-a, n-2, delta)Yang Feng (Columbia University) Inference in Regression Analysis 50 / 113
Inferences Concerning β0
Largely, inference procedures regarding β0 can be performed in thesame way as those for β1
Remember the point estimator b0 for β0
b0 = Y − b1X
Yang Feng (Columbia University) Inference in Regression Analysis 51 / 113
Sampling distribution of b0
The sampling distribution of b0 refers to the different values of b0
that would be obtained with repeated sampling when the levels of thepredictor variable X are held constant from sample to sample.
For the normal regression model the sampling distribution of b0 isnormal
Yang Feng (Columbia University) Inference in Regression Analysis 52 / 113
Sampling distribution of b0
When error variance is known
E(b0) = β0
σ2{b0} = σ2(1
n+
X 2∑(Xi − X )2
)
When error variance is unknown
s2{b0} = MSE (1
n+
X 2∑(Xi − X )2
)
Yang Feng (Columbia University) Inference in Regression Analysis 53 / 113
Confidence interval for β0
The 1− α confidence limits for β0 are obtained in the same manner asthose for β1
b0 ± t(1− α/2; n − 2)s{b0}
Yang Feng (Columbia University) Inference in Regression Analysis 54 / 113
Considerations on Inferences on β0 and β1
Effects of departures from normality of the Yi
The estimators of β0 and β1 have the property of asymptoticnormality - their distributions approach normality as the sample sizeincreases (under general conditions)
Spacing of the X levelsThe variances of b0 and b1 (for a given n and σ2) depend strongly onthe spacing of X
Yang Feng (Columbia University) Inference in Regression Analysis 55 / 113
Sampling distribution of point estimator of mean response
Let Xh be the level of X for which we would like an estimate of themean responseNeeds to be one of the observed X’s
The mean response when X = Xh is denoted by
E(Yh) = β0 + β1Xh
The point estimator of E(Yh) is
Yh = b0 + b1Xh
We are interested in the sampling distribution of this quantity
Yang Feng (Columbia University) Inference in Regression Analysis 56 / 113
Sampling Distribution of Yh
We haveYh = b0 + b1Xh
Since this quantity is itself a linear combination of the Y ′i s it’ssampling distribution is itself normal.
The mean of the sampling distribution is
E{Yh} = E{b0}+ E{b1}Xh = β0 + β1Xh
Biased or unbiased?
Yang Feng (Columbia University) Inference in Regression Analysis 57 / 113
Sampling Distribution of Yh
To derive the sampling distribution variance of the mean response wefirst show that b1 and (1/n)
∑Yi are uncorrelated and, hence, for the
normal error regression model independent
We start with the definitions
Y =∑
(1
n)Yi
b1 =∑
kiYi , ki =(Xi − X )∑(Xi − X )2
Yang Feng (Columbia University) Inference in Regression Analysis 58 / 113
Sampling Distribution of Yh
We want to show that mean response and the estimate b1 areuncorrelated
Cov(Y , b1) = σ2{Y , b1} = 0
To do this we need the following result (A.32)
σ2{n∑
i=1
aiYi ,
n∑i=1
ciYi} =n∑
i=1
aiciσ2{Yi}
when the Yi are independent
Yang Feng (Columbia University) Inference in Regression Analysis 59 / 113
Sampling Distribution of Yh
Using this fact we have
σ2{n∑
i=1
1
nYi ,
n∑i=1
kiYi} =n∑
i=1
1
nkiσ
2{Yi}
=n∑
i=1
1
nkiσ
2
=σ2
n
n∑i=1
ki
= 0
So the Y and b1 are uncorrelated
Yang Feng (Columbia University) Inference in Regression Analysis 60 / 113
Sampling Distribution of Yh
This means that we can write down the variance
σ2{Yh} = σ2{Y + b1(Xh − X )}
alternative and equivalent form of regression function
But we know that Y and b1 are uncorrelated so
σ2{Yh} = σ2{Y }+ σ2{b1}(Xh − X )2
Yang Feng (Columbia University) Inference in Regression Analysis 61 / 113
Sampling Distribution of Yh
We know
σ2{b1} =σ2∑
(Xi − X )2
s2{b1} =MSE∑
(Xi − X )2
And we can find
σ2{Y } =1
n2
∑σ2{Yi} =
nσ2
n2=σ2
n
Yang Feng (Columbia University) Inference in Regression Analysis 62 / 113
Sampling Distribution of Yh
So, plugging in, we get
σ2{Yh} =σ2
n+
σ2∑(Xi − X )2
(Xh − X )2
Or
σ2{Yh} = σ2
(1
n+
(Xh − X )2∑(Xi − X )2
)
Yang Feng (Columbia University) Inference in Regression Analysis 63 / 113
Sampling Distribution of Yh
Since we often won’t know σ2 we can, as usual, plug in s2 = SSE/(n− 2),our estimate for it to get our estimate of this sampling distribution variance
s2{Yh} = s2
(1
n+
(Xh − X )2∑(Xi − X )2
)
Yang Feng (Columbia University) Inference in Regression Analysis 64 / 113
No surprise. . .
The sampling distribution of our point estimator for the output isdistributed as a t-distribution with n − 2 degrees of freedom
Yh − E{Yh}s{Yh}
∼ t(n − 2)
This means that we can construct confidence intervals in the samemanner as before.
Yang Feng (Columbia University) Inference in Regression Analysis 65 / 113
Confidence Intervals for E(Yh)
The 1− α confidence intervals for E(Yh) are
Yh ± t(1− α/2; n − 2)s{Yh}
From this hypothesis tests can be constructed as usual.
Yang Feng (Columbia University) Inference in Regression Analysis 66 / 113
Comments
The variance of the estimator for E(Yh) is smallest near the mean ofX. Designing studies such that the mean of X is near Xh will improveinference precision
When Xh is zero the variance of the estimator for E(Yh) reduces tothe variance of the estimator b0 for β0
Yang Feng (Columbia University) Inference in Regression Analysis 67 / 113
Confidence Band for Regression Line
At times, we want to get a confidence band for the entire regressionline E{Y } = β0 + β1X .
The Working-Hotelling 1− α confidence band is
Yh ±W × s{Yh}
here W 2 = 2F (1− α; 2, n − 2).
Same form as before, except the t multiple is replaced with the Wmultiple.
R code:
qf (1− α/2, 2, n − 2)
Yang Feng (Columbia University) Inference in Regression Analysis 68 / 113
Prediction interval for single new observation
Essentially follows the sampling distribution arguments for E(Yh)
If all regression parameters are known then the 1− α predictioninterval for a new observation Yh is
E{Yh} ± z(1− α/2)σ
Yang Feng (Columbia University) Inference in Regression Analysis 69 / 113
Prediction interval for single new observation
If the regression parameters are unknown the 1− α prediction intervalfor a new observation Yh is given by the following theorem
Yh ± t(1− α/2; n − 2)s{pred}
This is very nearly the same as prediction for a known value of X butincludes a correction for the fact that there is additional variabilityarising from the fact that the new input location was not used in theoriginal estimates of b1, b0, and s2
Yang Feng (Columbia University) Inference in Regression Analysis 70 / 113
Prediction interval for single new observation
We have
σ2{pred} = σ2{Yh − Yh} = σ2{Yh}+ σ2{Yh} = σ2 + σ2{Yh}
An unbiased estimator of σ2{pred} is s2{pred} = MSE + s2{Yh}, whichis given by
s2{pred} = MSE
[1 +
1
n+
(Xh − X )2∑(Xi − X )2
]
Confidence Interval: represents an inference on a parameter, and is aninterval which is intended to cover the value of the parameter.
Prediction Interval: a statement about the value to be taken by arandom variable. Wider than confidence interval.
Yang Feng (Columbia University) Inference in Regression Analysis 71 / 113
Prediction interval for the mean of m new observations forgiven Xh
Yh ± t(1− α/2; n − 2)s{predmean}
An unbiased estimator of σ2{pred} is s2{predmean} = MSEm + s2{Yh},
which is given by
s2{predmean} = MSE
[1
m+
1
n+
(Xh − X )2∑(Xi − X )2
]
Yang Feng (Columbia University) Inference in Regression Analysis 72 / 113
ANOVA
ANOVA is nothing new but is instead a way of organizing the parts oflinear regression so as to make easy inference recipes.
Will return to ANOVA when discussing multiple regression and othertypes of linear statistical models.
Yang Feng (Columbia University) Inference in Regression Analysis 73 / 113
Partitioning Total Sum of Squares
“The ANOVA approach is based on the partitioning of sums ofsquares and degrees of freedom associated with the response variableY”
We start with the observed deviations of Yi around the observed mean
Yi − Y
Yang Feng (Columbia University) Inference in Regression Analysis 74 / 113
Partitioning of Total Deviations
Yang Feng (Columbia University) Inference in Regression Analysis 75 / 113
Measure of Total Variation
The measure of total variation is denoted by
SSTO =∑
(Yi − Y )2
SSTO stands for total sum of squares
If all Y ′i s are the same, SSTO = 0
The greater the variation of the Y ′i s the greater SSTO
Yang Feng (Columbia University) Inference in Regression Analysis 76 / 113
Variation after predictor effect
The measure of variation of the Y ′i s that is still present when thepredictor variable X is taken into account is the sum of the squareddeviations
SSE =∑
(Yi − Yi )2
SSE denotes error sum of squares
Yang Feng (Columbia University) Inference in Regression Analysis 77 / 113
Regression Sum of Squares
The difference between SSTO and SSE is SSR
SSR =∑
(Yi − Y )2
SSR stands for regression sum of squares
Yang Feng (Columbia University) Inference in Regression Analysis 78 / 113
Partitioning of Sum of Squares
Yi − Y︸ ︷︷ ︸Total deviation
= Yi − Y︸ ︷︷ ︸Deviation of fitted regression value around mean
+ Yi − Yi︸ ︷︷ ︸Deviation around fitted regression line
Yang Feng (Columbia University) Inference in Regression Analysis 79 / 113
Remarkable Property
The sums of the same deviations squared has the same property!∑(Yi − Y )2 =
∑(Yi − Y )2 +
∑(Yi − Yi )
2
orSSTO = SSR + SSE
Yang Feng (Columbia University) Inference in Regression Analysis 80 / 113
Proof
∑(Yi − Y )2 =
∑[(Yi − Y ) + (Yi − Yi )]2
=∑
[(Yi − Y )2 + (Yi − Yi )2 + 2(Yi − Y )(Yi − Yi )]
=∑
(Yi − Y )2 +∑
(Yi − Yi )2 + 2
∑(Yi − Y )(Yi − Yi )
but ∑(Yi − Y )(Yi − Yi ) =
∑Yi (Yi − Yi )−
∑Y (Yi − Yi ) = 0
By properties previously demonstrated. Namely∑Yiei = 0
and ∑ei = 0
Yang Feng (Columbia University) Inference in Regression Analysis 81 / 113
Breakdown of Degrees of Freedom
SSTO1 linear constraint due to the calculation and inclusion of the mean
n-1 degrees of freedom
SSE2 linear constraints arising from the estimation of β1 and β0
n-2 degrees of freedom
SSRTwo degrees of freedom in the regression parameters, one is lost due tolinear constraint
1 degree of freedom
Yang Feng (Columbia University) Inference in Regression Analysis 82 / 113
Mean Squares
A sum of squares divided by its associated degrees of freedom is called amean squareThe regression mean square is
MSR =SSR
1= SSR
The mean square error is
MSE =SSE
n − 2
Yang Feng (Columbia University) Inference in Regression Analysis 83 / 113
ANOVA table for simple lin. regression
Source of Variation SS df MS E(MS)
Regression SSR =∑
(Yi − Y )2 1 MSR = SSR/1 σ2 + β21
∑(Xi − X )2
Error SSE =∑
(Yi − Yi )2 n − 2 MSE = SSE/(n − 2) σ2
Total SSTO =∑
(Yi − Y )2 n − 1
Yang Feng (Columbia University) Inference in Regression Analysis 84 / 113
E{MSE} = σ2
Remember the following theorem, presented in an earlier lecture.
For the normal error regression model, SSEσ2 is distributed as χ2 with
n − 2 degrees of freedom and is independent of both b0 and b1.
Rewritten this yields
SSE/σ2 ∼ χ2(n − 2)
That means that E{SSE/σ2} = n − 2
And thus that E{SSE/(n − 2)} = E{MSE} = σ2
Yang Feng (Columbia University) Inference in Regression Analysis 85 / 113
E{MSR} = σ2 + β21
∑(Xi − X )2
To begin, we take an alternative but equivalent form for SSR
SSR = b21
∑(Xi − X )2
And note that, by definition of variance we can write
σ2{b1} = E{b21} − (E{b1})2
Yang Feng (Columbia University) Inference in Regression Analysis 86 / 113
E{MSR} = σ2 + β21
∑(Xi − X )2
But we know that b1 is an unbiased estimator of β1 so E{b1} = β1
We also know (from previous lectures) that
σ2{b1} =σ2∑
(Xi − X )2
So we can rearrange terms and plug in
σ2{b1} = E{b21} − (E{b1})2
E{b21} =
σ2∑(Xi − X )2
+ β21
Yang Feng (Columbia University) Inference in Regression Analysis 87 / 113
E{MSR} = σ2 + β21
∑(Xi − X )2
From the previous slide
E{b21} =
σ2∑(Xi − X )2
+ β21
Which brings us to this resultE{MSR} = E{SSR/1} = E{b2
1}∑
(Xi − X )2 = σ2 + β21
∑(Xi − X )2
Yang Feng (Columbia University) Inference in Regression Analysis 88 / 113
Comments and Intuition
The mean of the sampling distribution of MSE is σ2 regardless ofwhether X and Y are linearly related (i.e. whether β1 = 0)
The mean of the sampling distribution of MSR is also σ2 whenβ1 = 0.
When β1 = 0 the sampling distributions of MSR and MSE tend to bethe same
Yang Feng (Columbia University) Inference in Regression Analysis 89 / 113
F Test of β1 = 0 vs. β1 6= 0
ANOVA provides a battery of useful tests. For example, ANOVA providesan easy test forTwo-sided test
H0 : β1 = 0 v.s. Ha : β1 6= 0
Test statistic from before
t∗ =b1 − 0
s{b1}ANOVA test statistic
F ∗ =MSR
MSE
Yang Feng (Columbia University) Inference in Regression Analysis 90 / 113
Sampling distribution of F ∗
The sampling distribution of F ∗ when H0 : β1 = 0 holds can bederived from Cochran’s theorem
Cochran’s theoremIf all n observations Yi come from the same normal distribution withmean µ and variance σ2, and SSTO is decomposed into k sums ofsquares SSr , each with degrees of freedom dfr , then the SSr/σ
2 termsare independent χ2 variables with dfr degrees of freedom if
k∑r=1
dfr = n − 1
Yang Feng (Columbia University) Inference in Regression Analysis 91 / 113
The F Test
We have decomposed SSTO into two sums of squares SSR and SSE andtheir degrees of freedom are additive, hence, by Cochran’s theorem:If β1 = 0 so that all Yi have the same mean µ = β0 and the same varianceσ2, SSE/σ2 and SSR/σ2 are independent χ2 variables
Yang Feng (Columbia University) Inference in Regression Analysis 92 / 113
F ∗ Test Statistic
F ∗ can be written as follows
F ∗ =MSR
MSE=
SSR/σ2
1SSE/σ2
n−2
But by Cochran’ s theorem, we have when H0 holds
F ∗ ∼χ2(1)
1χ2(n−2)n−2
Yang Feng (Columbia University) Inference in Regression Analysis 93 / 113
F Distribution
The F distribution is the ratio of two independent χ2 randomvariables normalized by their corresponding degrees of freedom.
The test statistic F ∗ follows the distribution
F ∗ ∼ F (1, n − 2)
Yang Feng (Columbia University) Inference in Regression Analysis 94 / 113
Hypothesis Test Decision Rule
Since F ∗ is distributed as F (1, n − 2) when H0 holds, the decision rule tofollow when the risk of a Type I error is to be controlled at α is:
If F ∗ ≤ F (1− α; 1, n − 2), conclude H0
If F ∗ > F (1− α; 1, n − 2), conclude Ha
Yang Feng (Columbia University) Inference in Regression Analysis 95 / 113
F distribution
PDF, CDF, Inverse CDF of F distribution
Note, MSR/MSE must be big in order to reject hypothesis.
Yang Feng (Columbia University) Inference in Regression Analysis 96 / 113
Partitioning of Total Deviations
Does this make sense? When is MSR/MSE big?
Yang Feng (Columbia University) Inference in Regression Analysis 97 / 113
Equivalence of F test and two-sided t test
F ∗ =MSR
MSE(14)
=b2
1
∑(Xi − X )2
MSE(15)
=b2
1
s2{b1}(16)
=
(b1
s{b1}
)2
(17)
= (t∗)2 (18)
In addition: F (1− α; 1, n − 2) = t(1− α/2; n − 2)2.
Yang Feng (Columbia University) Inference in Regression Analysis 98 / 113
General Linear Test
The test of β1 = 0 versus β1 6= 0 is a simple example of a generallinear test.
The general linear test has three parts
Full ModelReduced ModelTest Statistic
Yang Feng (Columbia University) Inference in Regression Analysis 99 / 113
Full Model Fit
A full linear model is first fit to the data
Yi = β0 + β1Xi + εi
Using this model the error sum of squares is obtained, here forexample the simple linear model with non-zero slope is the “full”model
SSE (F ) =∑
[Yi − (b0 + b1Xi )]2 =∑
(Yi − Yi )2 = SSE
Yang Feng (Columbia University) Inference in Regression Analysis 100 / 113
Fit Reduced Model
One can test the hypothesis that a simpler model is a “better” modelvia a general linear test (which is really a likelihood ratio test indisguise). For instance, consider a “reduced” model in which theslope is zero (i.e. no relationship between input and output).
H0 : β1 = 0Ha : β1 6= 0
The model when H0 holds is called the reduced or restricted model.
Yi = β0 + εi
The SSE for the reduced model is obtained
SSE (R) =∑
(Yi − b0)2 =∑
(Yi − Y )2 = SSTO
Yang Feng (Columbia University) Inference in Regression Analysis 101 / 113
Test Statistic
The idea is to compare the two error sums of squares SSE(F) andSSE(R).
Because the full model F has more parameters than the reducedmodel, SSE (F ) ≤ SSE (R) always
In the general linear test, the test statistic is
F ∗ =
SSE(R)−SSE(F )dfR−dfFSSE(F )
dfF
which follows the F distribution when H0 holds.
dfR and dfF are those associated with the reduced and full modelerror sums of squares respectively
Yang Feng (Columbia University) Inference in Regression Analysis 102 / 113
R2 (Coefficient of determination)
SSTO measures the variation in the observations Yi when X is notconsidered
SSE measures the variation in the Yi after a predictor variable X isemployed
A natural measure of the effect of X in reducing variation in Y is toexpress the reduction in variation (SSTO − SSE = SSR) as aproportion of the total variation
R2 =SSR
SSTO= 1− SSE
SSTO
Note that since 0 ≤ SSE ≤ SSTO then 0 ≤ R2 ≤ 1
Yang Feng (Columbia University) Inference in Regression Analysis 103 / 113
Limitations of and misunderstandings about R2
1 Claim: high R2 indicates that useful predictions can be made. Theprediction interval for a particular input of interest may still be wideeven if R2 is high.
2 Claim: high R2 means that there is a good linear fit betweenpredictor and output. It can be the case that an approximate (bad)linear fit to a truly curvilinear relationship might result in a high R2.
3 Claim: low R2 means that there is no relationship between input andoutput. Also not true since there can be clear and strong relationshipsbetween input and output that are not well explained by a linearfunctional relationship.
Yang Feng (Columbia University) Inference in Regression Analysis 104 / 113
Coefficient of Correlation
r = ±√R2
Range:−1 ≤ r ≤ 1
if b1 > 0, r =√R2,
if b1 < 0, r = −√R2.
Yang Feng (Columbia University) Inference in Regression Analysis 105 / 113
Normal Correlation Models
Some times, correlation models are more natural than regression models,such as
Relationship between sales of gasoline and sales of auxiliary products.
Relationship between blood pressure and age.
Difference between Regression models and Correlation models
Regression models: X values are known constants.
Correlation models: Both X and Y are random.
Yang Feng (Columbia University) Inference in Regression Analysis 106 / 113
Bivariate Normal Density
f (Y1,Y2) =1
2πσ1σ2
√1− ρ2
12
exp{− 1
2(1− ρ212)
[(Y1 − µ1
σ1)2
−2ρ12(Y1 − µ1
σ1)(Y2 − µ2
σ2) + (
Y2 − µ2
σ2)2]}
Parameters: µ1, µ2, σ1, σ2, ρ12
Here, ρ12 is the coefficient of correlation between variable Y1 and Y2.
ρ12 = ρ{Y1,Y2} =σ12
σ1σ2
σ12 = σ{Y1,Y2} = E{(Y1 − µ1)(Y2 − µ2)}
Yang Feng (Columbia University) Inference in Regression Analysis 107 / 113
Yang Feng (Columbia University) Inference in Regression Analysis 108 / 113
Marginal Density
Marginal distribution of Y1 is normal with mean µ1 and standard deviationσ1:
f1(Y1) =1√
2πσ1
exp[−1
2(Y1 − µ1
σ1)2]
How to get it from the bivariate density function?
Yang Feng (Columbia University) Inference in Regression Analysis 109 / 113
Conditional Inferences
f (Y1|Y2) =f (Y1,Y2)
f2(Y2)
=1√
2πσ1|2exp[−1
2(Y1 − α1|2 − β12Y2
σ1|2)2]
Hereα1|2 = µ1 − µ2ρ12
σ1
σ2
β12 = ρ12σ1
σ2
σ21|2 = σ2
1(1− ρ212)
Yang Feng (Columbia University) Inference in Regression Analysis 110 / 113
Inferences on Correlation Coefficients
Maximum Likelihood Estimation of ρ12 (Pearson Product-momentCorrelation Coefficient):
r12 =
∑(Yi1 − Y1)(Yi2 − Y2)
[∑
(Yi1 − Y1)2∑
(Yi2 − Y2)2]1/2
Usually biased, but bias is small when n is large.Test whether ρ12 = 0
H0 : ρ12 = 0
v.s.H1 : ρ12 6= 0
Test Statistics:
t∗ =r12
√n − 2√
1− r212
If H0 holds, t∗ follows the t(n − 2) distribution.Yang Feng (Columbia University) Inference in Regression Analysis 111 / 113
Interval Estimation of ρ12
Make the Fisher z transformation:
z ′ =1
2loge(
1 + r12
1− r12)
When n is large (n > 25), approximately normally distributed with
E{z ′} = ζ =1
2loge(
1 + ρ12
1− ρ12)
σ2{z ′} =1
n − 3
Then we can make interval estimate
z ′ − ζσ{z ′}
is approximately standard normal. Then 1− α confidence limits for ζ are
z ′ ± z(1− α/2)σ{z ′}
Yang Feng (Columbia University) Inference in Regression Analysis 112 / 113
Spearman Rank Correlation Coefficient
Denote the rank of Yi1 by Ri1 and the rank of Yi2 by Ri2. The ranksare from 1 to n.
The Spearman Rank Correlation Coefficient rS is then defined as
rS =
∑(Ri1 − R1)(Ri2 − R2)
[∑
(Ri1 − R1)2∑
(Ri2 − R2)2]1/2
as before −1 ≤ rS ≤ 1.
More robust compared with the Pearson correlation coefficient.
Yang Feng (Columbia University) Inference in Regression Analysis 113 / 113
Diagnostics and Remedial Measures
Yang Feng
Yang Feng (Columbia University) Diagnostics and Remedial Measures 1 / 76
Remedial Measures
How do we know that the regression function is a good explainer ofthe observed data?- Plotting- Tests
What if it is not? What can we do about it?- Transformation of variables
Yang Feng (Columbia University) Diagnostics and Remedial Measures 2 / 76
Graphical Diagnostics for the Predictor Variable
Dot Plot- Useful for visualizing distribution of inputs
Sequence Plot- Useful for visualizing dependencies between error terms
Box Plot - Useful for visualizing distribution of inputs
Toluca manufacturing example again: production time vs. lot size
Yang Feng (Columbia University) Diagnostics and Remedial Measures 3 / 76
Dot Plot
Figure :
How many observations per input value?
Range of inputs?
Yang Feng (Columbia University) Diagnostics and Remedial Measures 4 / 76
Sequence Plot
Figure :
If observations are made over time, is there a correlation betweeninput and position in observation sequence?
Yang Feng (Columbia University) Diagnostics and Remedial Measures 5 / 76
Box Plot
Figure :
Shows- Median- 1st and 3rd quartiles- Maximum and minimum
Yang Feng (Columbia University) Diagnostics and Remedial Measures 6 / 76
Residuals
Recall the definition of residuals:
ei = Yi − Yi
And the difference between that and the unknown true error
εi = Yi − E (Yi )
In a normal regression model the εi ’s are assumed to be iid N(0, σ2)random variables. The observed residuals ei should reflect theseproperties.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 7 / 76
Remember: residual properties
Mean
e =
∑ei
n= 0
Variance
s2 =
∑(ei − e)2
n − 2=
SSE
n − 2= MSE
Yang Feng (Columbia University) Diagnostics and Remedial Measures 8 / 76
Nonindependence of Residuals
The residuals ei are not independent random variables - The fittedvalues Yi are based on the same fitted regression line.- The residuals are subject to two constraints
1 - Sum of the ei ’s equals 02 - Sum of the products Xiei ’s equals 0
When the sample size is large in comparison to the number ofparameters in the regression model, the dependency effect among theresiduals ei can be safely ignored.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 9 / 76
Definition: semistudentized residuals
It may be useful sometimes to look at a standardized set of residuals,for instance in outlier detection.
Like usual, since the standard deviation of εi is σ(itself estimated bysquare root of MSE) a natural form of standardization to consider is
e∗i =ei√MSE
This is called a semistudentized residual.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 10 / 76
Departures from Model...
To be studied by residuals
Regression function not linear
Error terms do not have constant variance
Error terms are not independent
Model fits all but one or a few outlier observations
Error terms are not normally distributed
One or more predictor variables have been omitted from the model
Yang Feng (Columbia University) Diagnostics and Remedial Measures 11 / 76
Diagnostics for Residuals
Plot of residuals against predictor variable
Plot of absolute or squared residuals against predictor variable
Plot of residuals against fitted values
Plot of residuals against time or other sequence
Plot of residuals against omitted predictor variables
Box plot of residuals
Normal probability plot of residuals
Yang Feng (Columbia University) Diagnostics and Remedial Measures 12 / 76
1. Test for nonlinearity of Regression Function: ResidualPlot against the predictor variable
Figure : Transit example : ridership increase vs. num. maps distributed
Should be no systematic relationship between residual and predictorvariable if it is linear related.Yang Feng (Columbia University) Diagnostics and Remedial Measures 13 / 76
Prototype Residual Plots
Figure : Indicate residual plots
Yang Feng (Columbia University) Diagnostics and Remedial Measures 14 / 76
2. Nonconstancy of Error Variance
Yang Feng (Columbia University) Diagnostics and Remedial Measures 15 / 76
3. Presence of Outliers
Outliers can strongly effect the fitted values of the regression line.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 16 / 76
Outlier effect on residuals
Yang Feng (Columbia University) Diagnostics and Remedial Measures 17 / 76
4. Nonindependence of Error Terms
Sequential observations can exhibit observable trends in error distribution.Application: Time series.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 18 / 76
5. Non-normality of Error Terms
Distribution plots. (e.g., boxplot)
Comparison of Frequencies. (e.g., 68 percent of the residuals fallbetween ±
√MSE or about 90 percent between ±1.645
√MSE.
Normal probability plotQ-Q plot with numerical quantiles on the horizontal axis
Figure : Examples of non-normality in distribution of error terms
Yang Feng (Columbia University) Diagnostics and Remedial Measures 19 / 76
Normal probability plot
For a N(0,MSE ) random variable, a good approximation of theexpected value of the k-th smallest observation in a random sampleof size n is √
MSE [z(k − .375
n + .25)]
A normal probability plot consists of plotting the expected value ofthe k-th smallest observation against the observed k-th smallestobservation
Yang Feng (Columbia University) Diagnostics and Remedial Measures 20 / 76
6. Omission of Important Predictor Variables
Example- Qualitative variable- Type of machine
Partitioning data can revealdependence on omittedvariable(s)
Works for quantitative variablesas well
Can suggest that inclusion ofother inputs is important
Yang Feng (Columbia University) Diagnostics and Remedial Measures 21 / 76
Tests Involving Residuals
Tests for randomness (run test, Durbin-Watson test, Chapter 12)
Tests for constancy of variance (Brown-Forsythe test, Breusch-Pagantest, Section 3.6)
Tests for outliers (fit a new regression line to the other n − 1observations. detail in Chapter 10)
Tests for normality of error distribution (will discuss now.)
Yang Feng (Columbia University) Diagnostics and Remedial Measures 22 / 76
Correlation Test for Normality of Error Distribution
For one way to run this correlation test let
e = {eσ(1), . . . , eσ(n)}
be the ordered sequence (from smallest to the largest) of the observederrors. Let
r = {r1, . . . , rk , . . . , rn},
where rk is the expected value of the kth residual under the normalityassumption, i.e.
rk ≈√MSE [z(
k − .375
n + .25)]
Then compute the sample correlation between e and r .
Yang Feng (Columbia University) Diagnostics and Remedial Measures 23 / 76
Correlation Test for Normality of Error Distribution
A formal test for the normality of the error terms can be developed interms of the coefficient of correlation between the residuals ei andtheir expected values under normality. High value indicates normality!
Tables (B.6 in the book) gives critical values for the null hypothesis(normally distributed errors).
Less than the critical value, reject the null hypothesis!
Toluca Company example: coefficient of correlation: 0.991 > 0.959,which is the critical value for α risk at 0.05. Then, conclude, it doesnot depart from normal substantially!
Yang Feng (Columbia University) Diagnostics and Remedial Measures 24 / 76
Tests for Constancy of Error Variance
Brown-Forsythe test does not depend on normality of error terms.The Brown-Forsythe test is applicable to simple linear regression when
The variance of the error terms either increases or decreases with X(“megaphone” residual plot)Sample size is large enough to ignore dependencies between theresiduals
The Brown-Forsythe test is essentially a t-test for testing whether themeans of two normally distributed populations are the same where thepopulations are the absolute deviations between the prediction andthe observed output space in two non-overlapping partitions of theinput space.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 25 / 76
Brown-Forsythe Test
Divide X into X1 (the low values of X) and X2 (the high values of X)
Let ei1 be the i-th residual for X1 and vice versa
e1 and e2 denote the median of the residuals.
let n = n1 + n2
The Brown-Forsythe test uses the absolute deviations of the residualsaround their group median
di1 = |ei1 − e1|, di2 = |ei2 − e2|
Yang Feng (Columbia University) Diagnostics and Remedial Measures 26 / 76
Brown-Forsythe Test
The test statistic for comparing the means of the absolute deviationsof the residuals around the group medians is
t∗BF =d1 − d2
s√
1n1
+ 1n2
where the pooled variance
s2 =
∑(di1 − d1)2 +
∑(di2 − d2)2
n − 2
Yang Feng (Columbia University) Diagnostics and Remedial Measures 27 / 76
Brown-Forsythe Test
If n1 and n2 are not extremely small
t∗BF ∼ t(n − 2)
approximately
From this confidence intervals and tests can be constructed.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 28 / 76
Breusch-Pagan Test
Another test for the constancy of error variance.
Assume error terms are independently and normally distributed and
σ2i = γ0 + γ1Xi
Then the problem reduces to
H0 : γ1 = 0 v.s. H1 : γ1 6= 0
Test statistics is derived in the following two steps1 Regress Y on X , get SSE and residual vector e.2 Regress e2 on X , get SSR∗.
X 2BP =
SSR∗
2(SSEn )2
Yang Feng (Columbia University) Diagnostics and Remedial Measures 29 / 76
BP-test
Under H0, when n is reasonably large,
X 2BP ∼ χ2(1)
If X 2BP > qchisq(1− α, 1), reject H0.
Direct function in R:ncvTest in the package car.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 30 / 76
F test for lack of fit
Formal test for determining whether a specific type of regressionfunction adequately fits the data.
Assume the observations Y |X are1 independent2 normally distributed3 same variance σ2
Requires: repeat observations at one or more X levels (calledreplicates)
Yang Feng (Columbia University) Diagnostics and Remedial Measures 31 / 76
Example
11 similar branches of a bank offered gifts for setting up moneymarket accounts
Minimum initial deposits were specific to qualify for the gift
Value of gift was proportional to the specified minimum deposit
Interested in: relationship between specified minimum deposit andnumber of new accounts opened
Yang Feng (Columbia University) Diagnostics and Remedial Measures 32 / 76
F Test Example Data and ANOVA Table
Figure :
Yang Feng (Columbia University) Diagnostics and Remedial Measures 33 / 76
Fit
Figure :
Yang Feng (Columbia University) Diagnostics and Remedial Measures 34 / 76
Data Arranged To Highlight Replicates
Figure :
The observed value of the response variable for the i-th replicate forthe j-th level of X is Yij
The mean of the Y observations at the level X = Xj is Yj
Yang Feng (Columbia University) Diagnostics and Remedial Measures 35 / 76
Full Model vs. Regression Model
The full model is
Yij = µj + εij
where1 µj are parameters, j = 1, ..., c2 εij are iid N(0, σ2)
Since the error terms have expectation zero
E (Yij) = µj
Yang Feng (Columbia University) Diagnostics and Remedial Measures 36 / 76
Full Model
In the full model there is a different mean (a free parameter) for eachXi
In the regression model the mean responses are constrained to lie ona line
E (Y ) = β0 + β1X
Yang Feng (Columbia University) Diagnostics and Remedial Measures 37 / 76
Fitting the Full Model
The estimators of µj are simply
µj = Yj
The error sum of squares of the full model therefore is
SSE (F ) =∑∑
(Yij − Yj)2 = SSPE
SSPE: Pure Error Sum of Squares
Yang Feng (Columbia University) Diagnostics and Remedial Measures 38 / 76
Degrees of Freedom
Ordinary total sum of squares had n-1 degrees of freedom.
Each of the j terms is a ordinary total sum of squares- Each then has nj − 1 degrees of freedom
The number of degrees of freedom of SSPE is the sum of thecomponent degrees of freedom
dfF =∑j
(nj − 1) =∑j
nj − c = n − c
Yang Feng (Columbia University) Diagnostics and Remedial Measures 39 / 76
General Linear Test
Remember: the general linear test proposes a reduced model
H0 : E (Y ) = β0 + β1X (Normal regression model)
Ha : E (Y ) 6= β0 + β1X (The full model, one independent mean foreach level of X)
Yang Feng (Columbia University) Diagnostics and Remedial Measures 40 / 76
SSE For Reduced Model
The SSE for the reduced model is as before- remember
SSE (R) =∑i
∑j
[Yij − (b0 + b1Xj)]2
=∑i
∑j
(Yij − Yij)2
- and has n-2 degrees of freedom dfR = n − 2
Yang Feng (Columbia University) Diagnostics and Remedial Measures 41 / 76
SSE(R)
Figure :
Yang Feng (Columbia University) Diagnostics and Remedial Measures 42 / 76
F Test Statistic
From the general linear test approach
F ∗ =SSE (R)− SSE (F )
dfR − dfF÷ SSE (F )
dfF
=SSE − SSPE
(n − 2)− (n − c)÷ SSPE
n − c
Lack of fit sum of squares:
SSLF = SSE − SSPE
Then
F ∗ =SSLF
(n − 2)− (n − c)÷ SSPE
n − c=
MSLF
MSPE
Yang Feng (Columbia University) Diagnostics and Remedial Measures 43 / 76
F Test Rule
From the F test we know that large values of F ∗ lead us to reject thenull hypothesis:If F ∗ ≤ F (1− α; c − 2, n − c), conclude H0
If F ∗ > F (1− α; c − 2, n − c), conclude Ha
For this example we have
Yang Feng (Columbia University) Diagnostics and Remedial Measures 44 / 76
Variance decomposition
SSE = SSPE + SSLF.∑∑(Yij − Yij)
2 =∑∑
(Yij − Yj)2 +
∑∑(Yj − Yij)
2
Yang Feng (Columbia University) Diagnostics and Remedial Measures 45 / 76
Example decomposition
Yang Feng (Columbia University) Diagnostics and Remedial Measures 46 / 76
Example decomposition
Yang Feng (Columbia University) Diagnostics and Remedial Measures 47 / 76
Example Conclusion
If we set the significance level to α = .01
And look up the value of the F inv-cdf F (.99, 4, 5) = 11.4
We can conclude that the null hypothesis should be rejected.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 48 / 76
A review
Graphical procedures for determining appropriateness of regression fit- Various Residual plots
Tests to determine- Constancy of error variance- Lack of fit test
what do we do if we determine(through testing or otherwise) that thelinear regression fit is not good?
Yang Feng (Columbia University) Diagnostics and Remedial Measures 49 / 76
How to fix
If simple regression model is not appropriate then there are two choices:
1 Abandon simple regression model and develop and use a moreappropriate modele.g., logistic regression, nonparametric regression, etc...may yield better insights, but a more complex model lead to morecomplex procedures for estimating the parameters. (Later in thecourse)
2 Employ some transformation of the data so that the simple regressionmodel is appropriate for the transformed data. (This chapter)
Yang Feng (Columbia University) Diagnostics and Remedial Measures 50 / 76
Fixes For...
Nonlinearity of regression function - Transformation(s) (today)
Nonconstancy of error variance - Weighted least squares (Chapter 11)and transformations
Nonindependence of error terms - Directly model correlation or usefirst differences (Chapter 12)
Non-normality of error terms - Transformation(s) (today)
Omission of Important Predictor Variables - Modify themodel—-Multiple regression analysis, Chapter 6 and later on.
Outlying observations - Robust regression (Chapter 11)
Yang Feng (Columbia University) Diagnostics and Remedial Measures 51 / 76
Nonlinearity of regression function
Direct approach
Modify regression model by altering the nature of the regressionfunction. For instance, a quadratic regression function might be used
E (Y ) = β0 + β1X + β2X2
or an exponential function
E (Y ) = β0βX1
Such approaches employ a tranformation to (approximately) linearizea regression function
Yang Feng (Columbia University) Diagnostics and Remedial Measures 52 / 76
Quick Questions
How would you fit such models?
How does the exponential regression function relate to regular linearregression?
Where did the error terms go?
Yang Feng (Columbia University) Diagnostics and Remedial Measures 53 / 76
Transformations
Transformations for nonlinearity relation only- Appropriate when the distribution of the error terms is reasonably closeto a normal distribution- In this situation1. transformation of X should be attempted;2. transformation of Y should not be attempted because it will materiallyeffect the distribution of the error terms.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 54 / 76
Prototype Regression Patterns
Figure :
Yang Feng (Columbia University) Diagnostics and Remedial Measures 55 / 76
Example
Experiment
X: days of training received
Y: sales performance(score)
Yang Feng (Columbia University) Diagnostics and Remedial Measures 56 / 76
X ′ =√X
Figure :
Yang Feng (Columbia University) Diagnostics and Remedial Measures 57 / 76
Example Data Transformation
Figure :
Yang Feng (Columbia University) Diagnostics and Remedial Measures 58 / 76
Graphical Residual Analysis
Figure :
Yang Feng (Columbia University) Diagnostics and Remedial Measures 59 / 76
Transformations on Y
Non-normality and unequal variances of error terms frequently appeartogether
To remedy these in the normal regression model we need atransformation on Y
This is because- Shapes and spreads of distributions of Y need to be changed- May help linearize a curvilinear regression relation
Can be combined with transformation on X
Yang Feng (Columbia University) Diagnostics and Remedial Measures 60 / 76
Prototype Regression Patterns and Y Transformations
Figure :
Transformations on Y:y ′ =
√Y
y ′ = log10 Yy ′ = 1/Y
Yang Feng (Columbia University) Diagnostics and Remedial Measures 61 / 76
Example
Use of logarithmic transformation of Y to linearize regression relationsand stabilize error variance
Data on age(X) and plasma level of a polyamine (Y) for a portion ofthe 25 healthy children in a study. Younger children exhibit greatervariability than older children.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 62 / 76
Plasma Level vs. Age
Figure :
Yang Feng (Columbia University) Diagnostics and Remedial Measures 63 / 76
Associated Data
Figure :
Yang Feng (Columbia University) Diagnostics and Remedial Measures 64 / 76
Associated Data (Cont’)
if we fit a simple linear regression line to the log transformed Y datawe obtain:
Y ′ = 1.135− .1023X
And the coefficient of correlation between the ordered residuals andtheir expected values under normality is .981(for α = .05 B.6 in thebook shows a critical value of .959)
Normality of error terms supported, regression model for transformedY data appropriate.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 65 / 76
Box Cox Transforms
It can be difficult to graphically determine which transformation of Yis most appropriate for correcting- skewness of the distributions of error terms- unequal variances- nonlinearity of the regression function
The Box-Cox procedure automatically identifies a transformation fromthe family of power transformations on Y
Yang Feng (Columbia University) Diagnostics and Remedial Measures 66 / 76
Box Cox Transforms
This family is of the form
Y ′ = Y λ
Examples include
λ = 2 Y ′ = Y 2
λ = .5 Y ′ =√Y
λ = 0 Y ′ = lnY (by definition)λ = −.5 Y ′ = 1√
Y
λ = −1 Y ′ = 1Y
Yang Feng (Columbia University) Diagnostics and Remedial Measures 67 / 76
Box Cox Cont.
The normal error regression model with the response variable amember of the family of power transformations becomes
Y λi = β0 + β1Xi + εi
This model has an additional parameter that needs to be estimated
Maximum likelihood is a way to estimate this parameter
Yang Feng (Columbia University) Diagnostics and Remedial Measures 68 / 76
Box Cox Maximum Likelihood Estimation
Before setting up MLE, the observations are further standardized sothat the magnitude of the error sum of squares does not depend onthe value of λ
The transformation is give by
Wi = K1(Y λi − 1) λ 6= 0
or K2(logeYi ) λ = 0
where
K2 = (∏
Yi )1/n
K1 = 1λKλ−1
2
Yang Feng (Columbia University) Diagnostics and Remedial Measures 69 / 76
Box Cox Maximum Likelihood Estimation
Maximize
log(L(X ,Y , σ, λ, b1, b0)) = −∑i
(Wi − (b1Xi + b0))2
2σ2− nlog(σ)
w.r.t λ σ b1 b0
How?- Take partial derivatives- Solve- or... gradient ascent methods
Not easy task, so we work on a grid of λ values, sayλ = −2,−1.75, . . . , 1.75, 2. And calculate a list of MSE according todifferent λ values.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 70 / 76
Box Cox Maximum Likelihood Estimation
Maximize
log(L(X ,Y , σ, λ, b1, b0)) = −∑i
(Wi − (b1Xi + b0))2
2σ2− nlog(σ)
w.r.t λ σ b1 b0
How?- Take partial derivatives- Solve- or... gradient ascent methods
Not easy task, so we work on a grid of λ values, sayλ = −2,−1.75, . . . , 1.75, 2. And calculate a list of MSE according todifferent λ values.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 70 / 76
The plasma levels example
Yang Feng (Columbia University) Diagnostics and Remedial Measures 71 / 76
Comments on Box Cox
The Box-Cox procedure is ordinarily used only to provide a guide forselecting a transformation
At time, theoretical considerations or prior information can be utilizedto help in choosing an appropriate transformation
It is important to perform residual analysis after the transformation toensure the transformation is appropriate
When transformed models are employed, b0 and b1 obtained via leastsquares have the least squares property w.r.t the transformedobservations not the original ones.
Usually take a nearby λ value for which the power transformation iseasier to understand. Say λ = 0.03, we may just use λ = 0. Ofcourse, one should examine the flatness of likelihood function in theneighborhood of λ.
When λ near 1, no transformation of Y needed.
Yang Feng (Columbia University) Diagnostics and Remedial Measures 72 / 76
Exploration of Shape of Regression Function:Nonparametric Regression Curves
So far: parametric regression approaches
LinearLinear with transformed inputs and outputsetc.
Other approaches
Method of moving averages : interpolate between mean outputs atadjacent inputsLowess : “LOcally WEighted Scatterplot Smoothing”
Yang Feng (Columbia University) Diagnostics and Remedial Measures 73 / 76
Lowess Method
IntuitionFit low-order polynomial (linear) regression models to points in aneighborhood
The neighborhood size is a parameterDetermining the neighborhood is done via a nearest neighbors algorithm
Produce predictions by weighting the regressors by how far the set ofpoints used to produce the regressor is from the input point for which aprediction is wanted
While somewhat ad-hoc, it is a method of producing a nonlinearregression function for data that might seem otherwise difficult toregress
Yang Feng (Columbia University) Diagnostics and Remedial Measures 74 / 76
Lowess Method Example
Yang Feng (Columbia University) Diagnostics and Remedial Measures 75 / 76
R example
require(graphics)
plot(cars, main = "lowess(cars)")
lines(lowess(cars), col = 2)
lines(lowess(cars, f=.2), col = 3)
legend(5, 120, c(paste("f = ", c("2/3", ".2"))),
lty = 1, col = 2:3)
Yang Feng (Columbia University) Diagnostics and Remedial Measures 76 / 76
Simultaneous Inferences and Other Topics in RegressionAnalysis
Yang Feng
Yang Feng (Columbia University) Simultaneous Inferences 1 / 20
Simultaneous Inferences
In chapter 2, we know how to construct confidence interval for β0 andβ1.
If we want a confidence level of 95% of both β0 and β1
One could construct a separate confidence interval for β0 and β1.BUT, then the probability of both happening is below 95%.
How to create a joint confidence interval?
Yang Feng (Columbia University) Simultaneous Inferences 2 / 20
Bonferroni Joint Confidence Intervals
Calculation of Bonferroni joint confidence intervals is a generaltechnique
We highlight its application in the regression setting
Joint confidence intervals for β0 and β1
Intuition
Set each statement confidence level to larger than 1− α so that thefamily coefficient is at least 1− αBUT how much larger?
Yang Feng (Columbia University) Simultaneous Inferences 3 / 20
Ordinary Confidence Intervals
Start with ordinary confidence intervals for β0 and β1
b0 ± t(1− α/2; n − 2)s{b0}b1 ± t(1− α/2; n − 2)s{b1}
And ask what is probability that one or both of these intervals isincorrect
Remember
s2{b0} = MSE
[1
n+
X 2∑(Xi − X )2
]s2{b1} =
MSE∑(Xi − X )2
Yang Feng (Columbia University) Simultaneous Inferences 4 / 20
General Procedure
Let A1 denote the event that the first confidence interval does notcover β0, i.e. P(A1) = α
Let A2 denote the event that the second confidence interval does notcover β1, i.e. P(A2) = α
We want to know the probability that both estimates fall in theirrespective confidence intervals, i.e. P(A1 ∩ A2)
How do we get there from what we know?
Yang Feng (Columbia University) Simultaneous Inferences 5 / 20
Venn Diagram
Yang Feng (Columbia University) Simultaneous Inferences 6 / 20
Bonferroni inequality
We can see that P(A1 ∩ A2) = 1− P(A2)− P(A1) + P(A1 ∩ A2)
Size of set is equal to area is equal to probability in a Venn diagram.
It also is clear that P(A1 ∩ A2) ≥ 0
So,
P(A1 ∩ A2) ≥ 1− P(A2)− P(A1)
= 1− 2α
Yang Feng (Columbia University) Simultaneous Inferences 7 / 20
Using the Bonferroni inequality cont.
To achieve a 1− α family confidence interval for β0 and β1 (forexample) using the Bonferroni procedure we know that bothindividual intervals must shrink.
Returning to our confidence intervals for β0 and β1 from before
b0 ± t(1− α/2; n − 2)s{b0}b1 ± t(1− α/2; n − 2)s{b1}
To achieve a 1− α family confidence interval these intervals mustwiden to
b0 ± t(1− α/4; n − 2)s{b0}b1 ± t(1− α/4; n − 2)s{b1}
Then P(A1 ∩ A2) ≥ 1− P(A2)− P(A1) = 1− α/2− α/2 = 1− α
Yang Feng (Columbia University) Simultaneous Inferences 8 / 20
Using the Bonferroni inequality cont.
The Bonferroni procedure is very general. To make joint confidencestatements about multiple simultaneous predictions remember that
Yh ± t(1− α/2; n − 2)s{pred}
s2{pred} = MSE
[1 +
1
n+
(Xh − X )2∑i (Xi − X )2
]If one is interested in a 1− α confidence statement about gpredictions then Bonferroni says that the confidence interval for eachindividual prediction must get wider (for each h in the g predictions)
Yh ± t(1− α/2g ; n − 2)s{pred}
Note: if a sufficiently large number of simultaneous predictions are made,the width of the individual confidence intervals may become so wide thatthey are no longer useful.Yang Feng (Columbia University) Simultaneous Inferences 9 / 20
The Toluca Example
Say, we want to get a 90 percent confidence interval for β0 and β1simultaneously.
Then we require B = t(1− .1/4; 23) = t(.975, 23) = 2.069
Then we have the joint confidence interval:
b0 ± B ∗ s(b0)
andb1 ± B ∗ s(b1)
Yang Feng (Columbia University) Simultaneous Inferences 10 / 20
Confidence Band for Regression Line
Remember in Chapter 2.5, we get the confidence interval for E{Yh}to be
Yh ± t(1− α/2; n − 2)s{Yh}
Now, we want to get a confidence band for the entire regression lineE{Y } = β0 + β1X .
The Working-Hotelling 1− α confidence band is
Yh ±W × s{Yh}
here W 2 = 2F (1− α; 2, n − 2).
Same form as before, except the t multiple is replaced with the Wmultiple.
Yang Feng (Columbia University) Simultaneous Inferences 11 / 20
Example: toluca company
Say we want to estimate the boundary value for the band atXh = 30, 65, 100.
We have
Looking up the table,W 2 = 2F (1− α; 2, n − 2) = 2F (.9; 2, 23) = 5.098.R code:
w2 = 2 * qf(1-0.1,2,23)
Yang Feng (Columbia University) Simultaneous Inferences 12 / 20
Now we have the confidence band for the three points are
Yang Feng (Columbia University) Simultaneous Inferences 13 / 20
Example confidence band
Yang Feng (Columbia University) Simultaneous Inferences 14 / 20
Compare with Bonferroni Procedure
Say we want to simultaneously estimate response for Xh = 30, 65, 100.
Then the simultaneous confidence intervals are
Yh ± t(1− α/(2g); n − 2)s{Yh}
We have B = t(1− α/(2g); n − 2) = t(1− .1/(2 ∗ 3), 23) = 2.263,the confidence intervals are
Yang Feng (Columbia University) Simultaneous Inferences 15 / 20
Bonferroni v.s. Working-Hotelling
This instance, working-hotelling confidence limits are slightertighter(better) than bonferroni limits
However, in larger families (more X ) to be considered simultaneously,working-hotelling is always tighter, since W stays the same for anynumber of statements but B becomres larger.
The levels of predictor variables are sometimes not known in advance.In such cases, it is better to use Working-Hotelling procedure sincethe family encompasses all possible levels of X .
Yang Feng (Columbia University) Simultaneous Inferences 16 / 20
Simultaneous Prediction Intervals for g New Observations
1 Scheffe procedure
Yh ± Ss{pred}, (1)
where S2 = gF (1− α; g , n − 2).
2 Bonferroni procedure
Yh ± Bs{pred}, (2)
where B = t(1− α/(2g); n − 2).
Yang Feng (Columbia University) Simultaneous Inferences 17 / 20
Regression through the origin
ModelYi = β1Xi + εi
Sometimes it is known that the regression function is linear and thatit must go through the origin.
β1 is parameter
Xi are known constants
εi are i.i.d N(0, σ2).
The least squares and maximum likelihood estimators for β1 coincideas before, the estimator is b1 =
∑XiYi∑X 2i
Yang Feng (Columbia University) Simultaneous Inferences 18 / 20
Regression through the origin, Cont
In regression through the origin there is only one free parameter (β1)so the number of degrees of freedom of the MSE
s2 = MSE =
∑e2i
n − 1=
∑(Yi − Yi )
2
n − 1
is increased by one.This is because this is a “reduced” model in the general linear testsense and because the number of parameters estimated from the datais less by one.
Yang Feng (Columbia University) Simultaneous Inferences 19 / 20
A few notes on regression through the origin
∑ei 6= 0 in general now. Only constraint is
∑Xiei = 0.
SSE may exceed the total sum of squares SSTO. In the case of acurvilinear pattern or linear pattern with a intercept away from theorigin.
Therefore, R2 = 1− SSE/SSTO may be negative!
Generally, it is safer to use the original model opposed withregression-through-the-origin model.
Otherwise, it is the wrong model to start with!
Yang Feng (Columbia University) Simultaneous Inferences 20 / 20
Linear Algebra Review
Yang Feng
Yang Feng (Columbia University) Linear Algebra Review 1 / 46
Definition of Matrix
Rectangular array of elements arranged in rows and columns16000 2333000 4721000 35
A matrix has dimensions
The dimension of a matrix is its number of rows and columns
It is expressed as 3× 2 (in this case)
Yang Feng (Columbia University) Linear Algebra Review 2 / 46
Indexing a Matrix
Rectangular array of elements arranged in rows and columns
A =
[a11 a12 a13a21 a22 a23
]A matrix can also be notated
A = [aij ], i = 1, 2; j = 1, 2, 3
Yang Feng (Columbia University) Linear Algebra Review 3 / 46
Square Matrix and Column Vector
A square matrix has equal number of rows and columns
[4 73 9
] a11 a12 a13a21 a22 a23a31 a32 a33
A column vector is a matrix with a single column
47
10
c1c2c3c4c5
All vectors (row or column) are matrices, all scalars are 1× 1 matrices.
Yang Feng (Columbia University) Linear Algebra Review 4 / 46
Transpose
The transpose of a matrix is another matrix in which the rows andcolumns have been interchanged
A =
2 57 103 4
A′ =
[2 7 35 10 4
]
Yang Feng (Columbia University) Linear Algebra Review 5 / 46
Equality of Matrices
Two matrices are the same if they have the same dimension and allthe elements are equal
A =
a1a2a3
B =
473
A = B implies a1 = 4, a2 = 7, a3 = 3
Yang Feng (Columbia University) Linear Algebra Review 6 / 46
Matrix Addition and Substraction
A =
1 42 53 6
B =
1 22 33 4
Then
A + B =
2 64 86 10
Yang Feng (Columbia University) Linear Algebra Review 7 / 46
Multiplication of a Matrix by a Scalar
A =
[2 79 3
]kA = k
[2 79 3
]=
[2k 7k9k 3k
]
Yang Feng (Columbia University) Linear Algebra Review 8 / 46
Multiplication of two Matrices
Yang Feng (Columbia University) Linear Algebra Review 9 / 46
Another Matrix Multiplication Example
A =
[1 3 40 5 8
]B =
352
AB =
[1 3 40 5 8
]352
=
[2641
]
Yang Feng (Columbia University) Linear Algebra Review 10 / 46
Special Matrices
If A = A′, then A is a symmetric matrix
A =
1 4 64 2 56 5 3
A′ =
1 4 64 2 56 5 3
If the off-diagonal elements of a matrix are all zeros it is then called adiagonal matrix
A =
a1 0 00 a2 00 0 a3
B =
4 0 0 00 1 0 00 0 10 00 0 0 5
Yang Feng (Columbia University) Linear Algebra Review 11 / 46
Identity Matrix
A diagonal matrix whose diagonal entries are all ones is an identity matrix.Multiplication by an identity matrix leaves the pre or post multipliedmatrix unchanged.
I =
1 0 0 00 1 0 00 0 1 00 0 0 1
and
AI = IA = A
Yang Feng (Columbia University) Linear Algebra Review 12 / 46
Vector and matrix with all elements equal to one
1 =
11...1
J =
1 ... 1. . .. . .. . .1 ... 1
11′ =
11...1
[1 1 . . . 1
]=
1 ... 1. . .. . .. . .1 ... 1
= J
Yang Feng (Columbia University) Linear Algebra Review 13 / 46
Linear Dependence and Rank of Matrix
Consider
A =
1 2 5 12 2 10 63 4 15 1
and think of this as a matrix of a collection of column vectors.
Note that the third column vector is a multiple of the first column vector.
Yang Feng (Columbia University) Linear Algebra Review 14 / 46
Linear Dependence
When m scalars k1, ..., km not all zero, can be found such that:
k1A1 + ...+ kmAm = 0
where 0 denotes the zero column vector and Ai is the i th column of matrixA, the m column vectors are called linearly dependent. If the only set ofscalars for which the equality holds is k1 = 0, ..., km = 0, the set of mcolumn vectors is linearly independent.
In the previous example matrix the columns are linearly dependent.
5
123
+ 0
224
− 1
51015
+ 0
161
=
000
Yang Feng (Columbia University) Linear Algebra Review 15 / 46
Rank of Matrix
The rank of a matrix is defined to be the maximum number of linearlyindependent columns in the matrix. Rank properties include
The rank of a matrix is unique
The rank of a matrix can equivalently be defined as the maximumnumber of linearly independent rows
The rank of an r × c matrix cannot exceed min(r , c)
The row and column rank of a matrix are equal
The rank of a matrix is preserved under nonsingular transformations.,i.e. Let A (n × n) and C (k × k) be nonsingular matrices. Then forany n × k matrix B we have
rank(B) = rank(AB) = rank(BC)
Yang Feng (Columbia University) Linear Algebra Review 16 / 46
Inverse of Matrix
Like a reciprocal
6 ∗ 1/6 = 1/6 ∗ 6 = 1
x1
x= 1
But for matrices
AA−1 = A−1A = I
Yang Feng (Columbia University) Linear Algebra Review 17 / 46
Example
A =
[2 43 1
]
A−1 =
[−.1 .4.3 −.2
]A−1A =
[1 00 1
]More generally,
A =
[a bc d
]
A−1 =1
D
[d −b−c a
]where D = ad − bc
Yang Feng (Columbia University) Linear Algebra Review 18 / 46
Inverses of Diagonal Matrices are Easy
A =
3 0 00 4 00 0 2
then
A−1 =
1/3 0 00 1/4 00 0 1/2
Yang Feng (Columbia University) Linear Algebra Review 19 / 46
Finding the inverse
Finding an inverse takes (for general matrices with no specialstructure)
O(n3)
operations (when n is the number of rows in the matrix)
We will assume that numerical packages can do this for usin R: solve(A) gives the inverse of matrix A
Yang Feng (Columbia University) Linear Algebra Review 20 / 46
Uses of Inverse Matrix
Ordinary algebra 5y = 20is solved by 1/5 ∗ (5y) = 1/5 ∗ (20)
Linear algebra AY = Cis solved by
A−1AY = A−1C,Y = A−1C
Yang Feng (Columbia University) Linear Algebra Review 21 / 46
Example
Solving a system of simultaneous equations
2y1 + 4y2 = 203y1 + y2 = 10[
2 43 1
] [y1y2
]=
[2010
][y1y2
]=
[2 43 1
]−1 [2010
]
Yang Feng (Columbia University) Linear Algebra Review 22 / 46
List of Useful Matrix Properties
A + B = B + A
(A + B) + C = A + (B + C)
(AB)C = A(BC)
C(A + B) = CA + CB
k(A + B) = kA + kB
(A′)′ = A
(A + B)′ = A′ + B′
(AB)′ = B′A′
(ABC)′ = C′B′A′
(AB)−1 = B−1A−1
(ABC)−1 = C−1B−1A−1
(A−1)−1 = A, (A′)−1 = (A−1)′
Yang Feng (Columbia University) Linear Algebra Review 23 / 46
Random Vectors and Matrices
Let’s say we have a vector consisting of three random variables
Y =
Y1
Y2
Y3
The expectation of a random vector is defined as
E(Y) =
E(Y1)E(Y2)E(Y3)
Yang Feng (Columbia University) Linear Algebra Review 24 / 46
Expectation of a Random Matrix
The expectation of a random matrix is defined similarly
E(Y) = [E(Yij)] i = 1, ...n; j = 1, ..., p
Yang Feng (Columbia University) Linear Algebra Review 25 / 46
Variance-covariance Matrix of a Random Vector
The variances of three random variables σ2(Yi ) and the covariancesbetween any two of the three random variables σ(Yi ,Yj), are assembled inthe variance-covariance matrix of Y
cov(Y) = σ2{Y} =
σ2(Y1) σ(Y1,Y2) σ(Y1,Y3)σ(Y2,Y1) σ2(Y2) σ(Y2,Y3)σ(Y3,Y1) σ(Y3,Y2) σ2(Y3)
remember σ(Y2,Y1) = σ(Y1,Y2) so the covariance matrix is symmetric
Yang Feng (Columbia University) Linear Algebra Review 26 / 46
Derivation of Covariance Matrix
In vector terms the variance-covariance matrix is defined by
σ2{Y} = E(Y − E(Y))(Y − E(Y))′
because
σ2{Y} = E(
Y1 − E(Y1)Y2 − E(Y2)Y3 − E(Y3)
(Y1 − E(Y1) Y2 − E(Y2) Y3 − E(Y3)))
Yang Feng (Columbia University) Linear Algebra Review 27 / 46
Regression Example
Take a regression example with n = 3 with constant error termsσ2(εi ) and are uncorrelated so that σ2(εi , εj) = 0 for all i 6= j
The variance-covariance matrix for the random vector ε is
σ2(ε) =
σ2 0 00 σ2 00 0 σ2
which can be written as σ2{ε} = σ2 I
Yang Feng (Columbia University) Linear Algebra Review 28 / 46
Basic Results
If A is a constant matrix and Y is a random matrix then W = AYis a random matrix
E(A) = AE(W) = E(AY) = AE(Y)
σ2{W} = σ2{AY} = Aσ2{Y}A′
Yang Feng (Columbia University) Linear Algebra Review 29 / 46
Multivariate Normal Density
Let Y be a vector of p observations
Y =
Y1
Y2
.
.
.Yp
Let µ be a vector of the means of each of the p observations
µ =
µ1µ2...µp
Yang Feng (Columbia University) Linear Algebra Review 30 / 46
Multivariate Normal Density
let Σ be the variance-covariance matrix of Y
Σ =
σ21 σ12 ... σ1pσ21 σ22 ... σ2p...σp1 σp2 ... σ2p
Then the multivariate normal density is given by
P(Y|µ,Σ) =1
(2π)p/2|Σ|1/2exp[−1
2(Y − µ)′Σ−1(Y − µ)]
Yang Feng (Columbia University) Linear Algebra Review 31 / 46
Matrix Simple Linear Regression
Nothing new-only matrix formalism for previous results
Remember the normal error regression model
Yi = β0 + β1Xi + εi , εi ∼ N(0, σ2), i = 1, ..., n
Expanded out this looks like
Y1 = β0 + β1X1 + ε1Y2 = β0 + β1X2 + ε2
...Yn = β0 + β1Xn + εn
which points towards an obvious matrix formulation.
Yang Feng (Columbia University) Linear Algebra Review 32 / 46
Regression Matrices
If we identify the following matrices
Y =
Y1
Y2
.
.
.Yn
X =
1 X1
1 X2
.
.
.1 Xn
β =
(β0β1
)ε =
ε1ε2...εn
We can write the linear regression equations in a compact form
Y = Xβ + ε
Yang Feng (Columbia University) Linear Algebra Review 33 / 46
Regression Matrices
Of course, in the normal regression model the expected value of eachof the ε’s is zero, we can write E(Y) = Xβ
This is because
E(ε) = 0
E(ε1)E(ε2)...
E(εn)
=
00...0
Yang Feng (Columbia University) Linear Algebra Review 34 / 46
Error Covariance
Because the error terms are independent and have constant variance σ2
σ2{ε} =
σ2 0 ... 00 σ2 ... 0...0 0 ... σ2
σ2{ε} = σ2I
Yang Feng (Columbia University) Linear Algebra Review 35 / 46
Matrix Normal Regression Model
In matrix terms the normal regression model can be written as
Y = Xβ + ε
where ε ∼ N(0, σ2I)
Yang Feng (Columbia University) Linear Algebra Review 36 / 46
Least Square Estimation
If we remember both the starting normal equations that we derived
nb0 + b1∑
Xi =∑
Yi
b0∑
Xi + b1∑
X 2i =
∑XiYi
and the fact that
X′X =
[1 1 ... 1X1 X2 ... Xn
]1 X1
1 X2
. .
. .1 Xn
=
[n
∑Xi∑
Xi∑
X 2i
]
X′Y =
[1 1 ... 1X1 X2 ... Xn
]Y1
Y2
.
.Yn
=
[ ∑Yi∑XiYi
]
Yang Feng (Columbia University) Linear Algebra Review 37 / 46
Least Square Estimation
Then we can see that these equations are equivalent to the followingmatrix operations
X′X b = X′Y
with
b =
(b0b1
)with the solution to this equation given by
b = (X′X)−1X′Y
when (X′X)−1 exists.
Yang Feng (Columbia University) Linear Algebra Review 38 / 46
Fitted Value
Y = XbBecause:
Y =
Y1
Y2
.
.
.
Yn
=
1 X1
1 X2
. .
. .
. .1 Xn
(b0b1
)=
b0 + b1X1
b0 + b1X2
.
.
.b0 + b1Xn
Yang Feng (Columbia University) Linear Algebra Review 39 / 46
Fitted Values, Hat Matrix
plug inb = (X′X)−1X′Y
We have
Y = Xb = X(X′X)−1X′Y
orY = HY
whereH = X(X′X)−1X′
is called the hat matrix.Property of hat matrix H:
1 symmetric
2 idempotent: HH = H.
Yang Feng (Columbia University) Linear Algebra Review 40 / 46
Residuals
e = Y − Y = Y −HY = (I−H)Y
Thene = (I−H)Y
The matrix I−H is also symmetric and idempotent.The variance-covariance matrix of e is
σ2{e} = σ2(I−H)
And we can estimate it by
s2{e} = MSE (I−H)
Yang Feng (Columbia University) Linear Algebra Review 41 / 46
Analysis of Variance Results
SSTO =∑
(Yi − Y )2 =∑
Y 2i −
(∑
Yi )2
n
We knowY′Y =
∑Y 2i
and J is the matrix with entries all equal to 1. Then we have
(∑
Yi )2
n=
1
nY′JY
As a result:
SSTO = Y′Y − 1
nY′JY
Yang Feng (Columbia University) Linear Algebra Review 42 / 46
Analysis of Variance Results
Also,SSE =
∑e2i =
∑(Yi − Yi )
2
can be represented as
SSE = e′e = Y′(I−H)′(I−H)Y = Y′(I−H)Y
Notice that H1 = 1, then (I−H)J = 0Finally by similarly reasoning,
SSR = ([H− 1
nJ]Y)′([H− 1
nJ]Y) = Y′[H− 1
nJ]Y
Easy to check thatSSTO = SSE + SSR
Yang Feng (Columbia University) Linear Algebra Review 43 / 46
Sums of Squares as Quadratic Forms
When n = 2, an example of quadratic forms:
5Y 21 + 6Y1Y2 + 4Y 2
2
can be expressed as matrix term as
(Y1 Y2
)(5 33 4
)(Y1
Y2
)= Y′AY
In general, a quadratic term is defined as :
Y′AY =n∑
i=1
n∑j=1
AijYiYj
where Aij = Aji
Here, A is a symmetric n × n matrix , the matrix of the quadratic form.
Yang Feng (Columbia University) Linear Algebra Review 44 / 46
Quadratic forms for ANOVA
SSTO = Y′[I− 1
nJ]Y
SSE = Y′[I−H]Y
SSR = Y′[H− 1
nJ]Y
Yang Feng (Columbia University) Linear Algebra Review 45 / 46
Inference in Regression Analysis
Regression Coefficients: The variance-covariance matrix of b is
σ2{b} = σ2(X′X)−1
Mean Response: To estimate the mean response at Xh, define
Xh =
(1Xh
)Then
Yh = X′hb
And the variance-covariance matrix of Yh is
σ2{Yh} = X′hσ2{b}Xh = σ2X′h(X′X)−1Xh
Prediction of New Observation:
s2{pred} = MSE (1 + X′h(X′X)−1Xh)
Yang Feng (Columbia University) Linear Algebra Review 46 / 46
Multiple Regression
Yang Feng
Yang Feng (Columbia University) Multiple Regression 1 / 41
Multiple regression
One of the most widely used tools in statistical analysis
Matrix expressions for multiple regression are the same as for simplelinear regression
Yang Feng (Columbia University) Multiple Regression 2 / 41
Need for Several Predictor Variables
Often the response is best understood as being a function of multipleinput quantities
Examples
Spam filtering-regress the probability of an email being a spam messageagainst thousands of input variablesRevenue prediction - regress the revenue of a company against a lot offactors
Yang Feng (Columbia University) Multiple Regression 3 / 41
First-Order with Two Predictor Variables
When there are two predictor variables X1 and X2 the regressionmodel
Yi = β0 + β1Xi1 + β2Xi2 + εi
is called a first-order model with two predictor variables.
A first order model is linear in the predictor variables.
Xi1 and Xi2 are the values of the two predictor variables in the i th
trial.
Yang Feng (Columbia University) Multiple Regression 4 / 41
Functional Form of Regression Surface
Assuming noise equal to zero in expectation
E(Y ) = β0 + β1X1 + β2X2
The form of this regression function is of a plane
e.g. E(Y ) = 10 + 2X1 + 5X2
Yang Feng (Columbia University) Multiple Regression 5 / 41
Example
Yang Feng (Columbia University) Multiple Regression 6 / 41
Meaning of Regression Coefficients
β0 is the intercept when both X1 and X2 are zero;
β1 indicates the change in the mean response E(Y ) per unit increasein X1 when X2 is held constant
β2 -vice versa
Example: fix X2 = 2
E(Y ) = 10 + 2X1 + 5(2) = 20 + 2X1 X2 = 2
intercept changes but clearly linear
In other words, all one dimensional restrictions of the regressionsurface are lines.
Yang Feng (Columbia University) Multiple Regression 7 / 41
Terminology
1 When the effect of X1 on the mean response does not depend on thelevel X2 (and vice versa) the two predictor variables are said to haveadditive effects or not to interact.
2 The parameters β1 and β2 are sometimes called partial regressioncoefficients. They represents the partial effect of one predictorvariable when the other predictor variable is included in the modeland is held constant.
Yang Feng (Columbia University) Multiple Regression 8 / 41
Comments
1 A planar response surface may not always be appropriate, but evenwhen not it is often a good approximate descriptor of the regressionfunction in “local” regions of the input space
2 The meaning of the parameters can be determined by taking partialsof the regression function w.r.t. to each.
Yang Feng (Columbia University) Multiple Regression 9 / 41
First order model with > 2 predictor variables
Let there be p − 1 predictor variables, then
Yi = β0 + β1Xi1 + β2Xi2 + ...+ βp−1Xi ,p−1 + εi
which can also be written as
Yi = β0 +
p−1∑k=1
βkXik + εi
and if Xi0 = 1 is also can be written as
Yi =
p−1∑k=0
βkXik + εi
where Xi0 = 1
Yang Feng (Columbia University) Multiple Regression 10 / 41
Geometry of response surface
In this setting the response surface is a hyperplane
This is difficult to visualize but the same intuitions hold
Fixing all but one input variables, each βp tells how much the responsevariable will grow or decrease according to that one input variable
Yang Feng (Columbia University) Multiple Regression 11 / 41
General Linear Regression Model
We have arrived at the general regression model. In general theX1, ...,Xp−1 variables in the regression model do not have to representdifferent predictor variables, nor do they have to all bequantitative(continuous).
The general model is
Yi =
p−1∑k=0
βkXik + εi where Xi0 = 1
with response function when E(εi )=0 is
E(Y ) = β0 + β1X1 + ...+ βp−1Xp−1
Yang Feng (Columbia University) Multiple Regression 12 / 41
Qualitative(Discrete) Predictor Variables
Until now we have (implicitly) focused on quantitative (continuous)predictor variables.
Qualitative(discrete) predictor variables often arise in the real world.
Examples:
Patient sex: male/female
College Degree: yes/no
Etc
Yang Feng (Columbia University) Multiple Regression 13 / 41
Example
Regression model to predict the length of hospital stay(Y) based on theage (X1) and gender(X2) of the patient. Define gender as:
X2 ={ 1 if patient female
0 if patient male
And use the standard first-order regression model
Yi = β0 + β1Xi1 + β2Xi2 + εi
Yang Feng (Columbia University) Multiple Regression 14 / 41
Example cont.
Where Xi1 is patient’s age, and Xi2 is patient’s gender
If X2 = 0, the response function is E (Y ) = β0 + β1X1
Otherwise, it’s E (Y ) = (β0 + β2) + β1X1
Which is just another parallel linear response function with a differentintercept
Yang Feng (Columbia University) Multiple Regression 15 / 41
Polynomial Regression
Polynomial regression models are special cases of the generalregression model.
They can contain squared and higher-order terms of the predictorvariables
The response function becomes curvilinear.
For example Yi = β0 + β1Xi + β2X2i + εi
which clearly has the same form as the general regression model.
Yang Feng (Columbia University) Multiple Regression 16 / 41
General Regression
Transformed variableslogY , 1/Y
Interaction effectsYi = β0 + β1Xi1 + β2Xi2 + β3Xi1Xi2 + εi
CombinationsYi = β0 + β1Xi1 + β2X
2i1 + β3Xi2 + β4Xi1Xi2 + εi
Key point-all linear in parameters!
Yang Feng (Columbia University) Multiple Regression 17 / 41
General Regression Model in Matrix Terms
Y =
Y1
.
.
.Yn
n×1
X =
1 X11 X12 ... X1,p−1...1 Xn1 Xn2 ... Xn,p−1
n×p
β =
β0...
βp−1
p×1
ε =
ε1...εn
n×1
Yang Feng (Columbia University) Multiple Regression 18 / 41
General Linear Regression in Matrix Terms
Y = Xβ + ε
With E (ε) = 0 and
σ2{ε} =
σ2 0 ... 00 σ2 ... 0...0 0 ... σ2
We have E (Y) = Xβ and σ2{Y} = σ2{ε} = σ2I
Yang Feng (Columbia University) Multiple Regression 19 / 41
Least Square Solution
The matrix normal equations can be derived directly from theminimization of
Q(b) = (Y − Xb)′(Y − Xb)
w.r.t. to b
Key result
∂Xb
∂b= X. (1)
Yang Feng (Columbia University) Multiple Regression 20 / 41
Least Square Solution
We can solve this equation
X′Xb = X′Y
(if the inverse of X′X exists) by the following
(X′X)−1X′Xb = (X′X)−1X′Y
and since(X′X)−1X′X = I
we haveb = (X′X)−1X′Y
Yang Feng (Columbia University) Multiple Regression 21 / 41
Fitted Values and Residuals
Let the vector of the fitted values are
Y =
Y1
Y2
.
.
.
Yn
in matrix notation we then have Y = Xb
Yang Feng (Columbia University) Multiple Regression 22 / 41
Hat Matrix-Puts hat on y
We can also directly express the fitted values in terms of X and Y matrices
Y = X(X′X)−1X′Y
and we can further define H, the “hat matrix”
Y = HY H = X(X′X)−1X′
The hat matrix plans an important role in diagnostics for regressionanalysis.
Yang Feng (Columbia University) Multiple Regression 23 / 41
Hat Matrix Properties
1. the hat matrix is symmetric2. the hat matrix is idempotent, i.e. HH = H
Important idempotent matrix property
For a symmetric and idempotent matrix A, rank(A) = trace(A), thenumber of non-zero eigenvalues of A.
Yang Feng (Columbia University) Multiple Regression 24 / 41
Residuals
The residuals, like the fitted value Y can be expressed as linearcombinations of the response variable observations Yi
e = Y − Y = Y −HY = (I−H)Y
also, remember
e = Y − Y = Y − Xb
these are equivalent.
Yang Feng (Columbia University) Multiple Regression 25 / 41
Covariance of Residuals
Starting withe = (I−H)Y
we see thatσ2{e} = (I−H)σ2{Y}(I−H)′
butσ2{Y} = σ2{ε} = σ2I
which means that
σ2{e} = σ2(I−H)I(I−H) = σ2(I−H)(I−H)
and since I−H is idempotent (check) we have σ2{e} = σ2(I−H)
Yang Feng (Columbia University) Multiple Regression 26 / 41
Quadratic Forms
In general, a quadratic form is defined by
Y′AY =∑
i
∑j aijYiYj where aij = aji
with A the matrix of the quadratic form.
The ANOVA sums SSTO,SSE and SSR can all be arranged intoquadratic forms.
SSTO = Y′(I− 1
nJ)Y
SSE = Y′(I−H)Y
SSR = Y′(H− 1
nJ)Y
Yang Feng (Columbia University) Multiple Regression 27 / 41
Quadratic Forms
Cochran’s Theorem
Let X1,X2, . . . ,Xn be independent, N(0, σ2)-distributed random variables,and suppose that
n∑i=1
X 2i = Q1 + Q2 + . . .+ Qk ,
where Q1,Q2, . . . ,Qk are nonnegative-definite quadratic forms in therandom variables X1,X2, . . . ,Xn, with rank(Ai ) = ri , i = 1, 2, . . . , k .namely,
Qi = X′AiX, i = 1, 2, . . . , k .
If r1 + r2 + . . .+ rk = n, then
1 Q1,Q2, . . . ,Qk are independent; and
2 Qi ∼ σ2χ2(ri ), i = 1, 2, . . . , k
Yang Feng (Columbia University) Multiple Regression 28 / 41
ANOVA table for multiple linear regression
Source of Variation SS df MS E(MS)
Regression SSR p − 1 MSR = SSR/(p − 1) > σ2
Error SSE n − p MSE = SSE/(n − p) σ2
Total SSTO =∑
(Yi − Y )2 n − 1
Yang Feng (Columbia University) Multiple Regression 29 / 41
F-test for regression
H0 : β1 = β2 = · · · = βp−1 = 0
Ha: no all βk , (k = 1, · · · , p − 1) equal zero
Test statistic:
F ∗ =MSR
MSE
Decision Rule:
if F ∗ ≤ F (1− α; p − 1, n − p), conclude H0
if F ∗ > F (1− α; p − 1, n − p), conclude Ha
Yang Feng (Columbia University) Multiple Regression 30 / 41
R2 and adjusted R2
The coefficient of multiple determination R2 is defined as:
R2 =SSR
SSTO= 1− SSE
SSTO
0 ≤ R2 ≤ 1
R2 always increases when there are more variables.
Therefore, adjusted R2:
R2a = 1−
SSEn−pSSTOn−1
= 1−(n − 1
n − p
)SSE
SSTO
R2a may decrease when p is large.
Coefficient of multiple correlation:
R =√R2
Always positive square root!
Yang Feng (Columbia University) Multiple Regression 31 / 41
Inferences about parameters
We haveb = (X′X)−1X′Y
Since σ2{Y} = σ2I we can write
σ2{b} = (X′X)−1X′σ2IX(X′X)−1
= σ2(X′X)−1X′X(X′X)−1
= σ2(X′X)−1I
= σ2(X′X)−1
Also
E(b) = E((X′X)−1X′Y) = (X′X)−1X′ E(Y) = (X′X)−1X′Xβ = β
Yang Feng (Columbia University) Multiple Regression 32 / 41
Inferences
The estimated variance-covariance matrix
s2{b} = MSE (X′X)−1
Then, we have
bk − βks{bk}
∼ t(n − p), k = 0, 1, · · · , p − 1
1− α confidence intervals:
bk ± t(1− α/2; n − p)s{bk}
Yang Feng (Columbia University) Multiple Regression 33 / 41
t test
Tests for βk :
H0 : βk = 0H1 : βk 6= 0
Test Statistic:
t∗ =bk
s{bk}Decision Rule:
|t∗| ≤ t(1− α/2; n − p); conclude H0
Otherwise, conclude Ha
Yang Feng (Columbia University) Multiple Regression 34 / 41
Joint Inferences
Bonferroni Joint Confidence Intervals for g parameters, the confidencelimits with family confidence coefficient 1− α are
bk ± Bs{bk},
whereB = t(1− α/(2g); n − p)
Yang Feng (Columbia University) Multiple Regression 35 / 41
Interval estimate of EYh
We want to estimate the response at Xh = (1,Xh1, · · · ,Xh,p−1)′.
Estimator: Yh = X′hb
Expectation EYh = X′hβ = EYh
Variance σ2{Yh} = σ2X′h(X′X)−1Xh
Estimated Variance s2{Yh} = MSE (X′h(X′X)−1Xh)
1− α confidence limits for EYh:
Yh ± t(1− α/2; n − p)s{Yh}
Yang Feng (Columbia University) Multiple Regression 36 / 41
Confidence Region for Regression Surface and Predictionof New Observations
Working-Hotelling confidence band:
Yh ±Ws{Yh}
where W 2 = pF (1− α; p, n − p)
Prediction of New Observation Yh(new):
Yh ± t(1− α/2; n − p)s{pred}
where s2{pred} = MSE + s2{Y 2h } = MSE (1 + X′h(X′X)−1Xh).
Yang Feng (Columbia University) Multiple Regression 37 / 41
Diagnostics and Remedial Measures
Very similar to simple linear regression.
Only mention the difference.
Yang Feng (Columbia University) Multiple Regression 38 / 41
Scatter Plot Matrix
It is a matrix of scatter plot! R code: pairs(data)
Yang Feng (Columbia University) Multiple Regression 39 / 41
Correlation Matrix
Corresponds to the scatter plot matrix
R code: cor(data)
Population Income Sales
Population 1.00 0.78 0.94Income 0.78 1.00 0.84
Sales 0.94 0.84 1.00
Yang Feng (Columbia University) Multiple Regression 40 / 41
Other Diagnostics and Remedial Measures (Read afterclass)
Residual Plots.
Against time (or some other sequence) for error dependency.Against each X variable for potential nonlinear relationship andnonconstancy of error variances.Against omitted variables (including the interaction terms). More oninteraction terms in next Chapter.
Correlation Test for Normality (Same, since it is on the residuals)
Brown-Forsythe Test for Constancy of Error Variance (Need to find away to divide the X space)
Breusch-Pagan Test for Constancy of Error Variance (Same)
F Test for Lack of Fit (Need to have (near) replicates on alldimension of X)
Box-Cox Transformations (Same, since it is on Y )
Yang Feng (Columbia University) Multiple Regression 41 / 41
..........
.....
.....................................................................
.....
......
.....
.....
.
.
......Multiple Regression (II)
Yang Feng
Yang Feng (Columbia University) Multiple Regression (II) 1 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Special Topics for Multiple Regression
Extra Sums of Squares
Standardized Version of the Multiple Regression Model
Multicollinearity
Yang Feng (Columbia University) Multiple Regression (II) 2 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Extra Sums of Squares
A topic unique to multiple regression
An extra sum of squares measures the marginal decrease in the errorsum of squares when one or several predictor variables are added tothe regression model, given that other variables are already in themodel.
Equivalently, one can view the extra sum of squares as measuring themarginal increase in the regression sum of squares
Yang Feng (Columbia University) Multiple Regression (II) 3 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Example
Multiple regression– Output: Body fat percentage– Input:1. triceps skin fold thickness(X1)2. thigh circumference (X2)3. midarm circumference (X3)
Aim–Replace cumbersome and expensive immersion in water procedurewith model.
Goal– Determine which predictor variables provide a good model.
Yang Feng (Columbia University) Multiple Regression (II) 4 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
The Data
Yang Feng (Columbia University) Multiple Regression (II) 5 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Regression of Y on X1
Yang Feng (Columbia University) Multiple Regression (II) 6 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Regression of Y on X2
Yang Feng (Columbia University) Multiple Regression (II) 7 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Regression of Y on X1 and X2
Yang Feng (Columbia University) Multiple Regression (II) 8 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Regression of Y on X1, X2 and X3.
Yang Feng (Columbia University) Multiple Regression (II) 9 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Notation
SSR X1 only denoted by SSR(X1)=352.27
SSE X1 only denoted by SSE(X1)=143.12
Accordingly,
SSR(X1,X2)=385.44SSE(X1,X2)=109.95
Yang Feng (Columbia University) Multiple Regression (II) 10 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
More Powerful Model, Smaller SSE
When X1 and X2 are in the model, SSE(X1,X2)=109.95 is smallerthan when the model contains only X1
The difference is called extra sum of squares and will be denoted bySSR(X2|X1) = SSE (X1)− SSE (X1,X2) = 33.17
The extra sum of squares SSR(X2|X1) measure the marginal effect ofadding X2 to the regression model when X1 is already in the model
Yang Feng (Columbia University) Multiple Regression (II) 11 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
SSR increase and SSE decrease
The extra sum of squares SSR(X2|X1) can equivalently be viewed as themarginal increase in the regression sum of squares.
SSR(X2|X1) = SSR(X1,X2)− SSR(X1)
= 385.44− 352.27
= 33.17
Yang Feng (Columbia University) Multiple Regression (II) 12 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Why does this relationship exist?
Remember SSTO = SSR + SSE
SSTO measures only the variability of the Y’s and does not dependon the regression model fitted.
Any increase in SSR must be accompanied by a correspondingdecrease in the SSE.
Yang Feng (Columbia University) Multiple Regression (II) 13 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Example relations
SSR(X3|X1,X2) = SSE (X1,X2)− SSE (X1,X2,X3)
= SSR(X1,X2,X3)− SSR(X1,X2)
= 11.54
or with multiple variables included at time
SSR(X2,X3|X1) = SSE (X1)− SSE (X1,X2,X3)
= SSR(X1,X2,X3)− SSR(X1)
= 44.71
Yang Feng (Columbia University) Multiple Regression (II) 14 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Definitions
DefinitionSSR(X1|X2) = SSE (X2)− SSE (X1,X2)
EquivalentlySSR(X1|X2) = SSR(X1,X2)− SSR(X2)
We can switch the order of X1 and X2 in these expressions
We can easily generalize these definitions for more than two variablesSSR(X3|X1,X2) = SSE (X1,X2)− SSE (X1,X2,X3)SSR(X3|X1,X2) = SSR(X1,X2,X3)− SSR(X1,X2)
Yang Feng (Columbia University) Multiple Regression (II) 15 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
N! different partitions
Figure :
Yang Feng (Columbia University) Multiple Regression (II) 16 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
ANOVA Table
Various software packages can provide extra sums of squares for regressionanalysis. These are usually provided in the order in which the inputvariables are provided to the system, for instance
Figure :
Yang Feng (Columbia University) Multiple Regression (II) 17 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Importance
Extra sums of squares are of interest because they occur in a variety oftests about regression coefficients where the question of concern iswhether certain X variables can be dropped from the regression model.
Yang Feng (Columbia University) Multiple Regression (II) 18 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Test whether a single βk = 0
Does Xk provide statistically significant improvement to theregression model fit?
We can use the general linear test approach
Example: First order model with three predictor variables
Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + ϵi
We want to answer the following hypothesis test
H0 : β3 = 0H1 : β3 = 0
Yang Feng (Columbia University) Multiple Regression (II) 19 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Test whether a single βk = 0
For the full model we have SSE (F ) = SSE (X1,X2,X3)
The reduced model is Yi = β0 + β1Xi1 + β2Xi2 + ϵi
And for this model we have SSE (R) = SSE (X1,X2)
Where there are dfr = n − 3 degrees of freedom associated with thereduced model
Yang Feng (Columbia University) Multiple Regression (II) 20 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Test whether a single βk = 0
The general linear test statistics is
F ∗ = SSE(R)−SSE(F )dfR−dfF
/SSE(F )dfF
which becomes
F ∗ = SSE(X1,X2)−SSE(X1,X2,X3)(n−3)−(n−4) /SSE(X1,X2,X3)
n−4
but SSE (X1,X2)− SSE (X1,X2,X3) = SSR(X3|X1,X2)
Yang Feng (Columbia University) Multiple Regression (II) 21 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Test whether a single βk = 0
The general linear test statistics is
F ∗ = SSR(X3|X1,X2)1 /SSE(X1,X2,X3)
n−4 = MSR(X3|X1,X2)MSE(X1,X2,X3)
Extra sum of squares has one associated degree of freedom.
Yang Feng (Columbia University) Multiple Regression (II) 22 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Example
Body fat: Can X3 (midarm circumference) be dropped from the model?
Figure :
F ∗ = SSR(X3|X1,X2)1 /SSE(X1,X2,X3)
n−4 = 1.88
Yang Feng (Columbia University) Multiple Regression (II) 23 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Example Cont.
For α = .01 we require F (.99; 1, 16) = 8.53
We observe F ∗ = 1.88
We conclude H0 : β3 = 0
Yang Feng (Columbia University) Multiple Regression (II) 24 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Test whether several βk = 0
Another exampleH0 : β2 = β3 = 0H1 : not both β2 and β3 are zeroThe general linear test can be used again
F ∗ = SSE(X1)−SSE(X1,X2,X3)(n−2)−(n−4) /SSE(X1,X2,X3)
n−4
But SSE (X1)− SSE (X1,X2,X3) = SSR(X2,X3|X1)so the expression can be simplified.
Yang Feng (Columbia University) Multiple Regression (II) 25 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Tests concerning regression coefficients
General linear test can be used to determine whether or not apredictor variable( or sets of variables) should be included in themodel
The ANOVA SSE’s can be used to compute F ∗ test statistics
Some more general tests require fitting the model more than onceunlike the examples given.
Yang Feng (Columbia University) Multiple Regression (II) 26 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Summary of Tests Concerning Regression Coefficients
Test whether all βk = 0
Test whether a single βk = 0
Test whether some βk = 0
Test involving relationships among coefficients, for example,
H0 : β1 = β2 vs. Ha : β1 = β2
H0 : β1 = 3, β2 = 5 vs. Ha : otherwise
Key point in all tests: form the full model and the reduced model,then calculate the SSE(F) and SSE(R).
Yang Feng (Columbia University) Multiple Regression (II) 27 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Coefficients of Partial Determination
Recall “Coefficient of determination”:R2 measures the proportionate reduction in the variation of Y byintroduction of the entire set of X .
Partial Determination:measures the marginal contribution of one X variable when all othersare already in the model.
Yang Feng (Columbia University) Multiple Regression (II) 28 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Two predictor variables
Yi = β0 + β1Xi1 + β2Xi2 + ϵi
Coefficient of partial determination between Y and X1 given X2 in themodel is denoted as R2
Y 1|2:
R2Y 1|2 =
SSE (X2)− SSE (X1,X2)
SSE (X2)=
SSR(X1|X2)
SSE (X2)
Likewise:
R2Y 2|1 =
SSE (X1)− SSE (X1,X2)
SSE (X1)=
SSR(X2|X1)
SSE (X1)
Yang Feng (Columbia University) Multiple Regression (II) 29 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
General case
R2Y 1|23 =
SSR(X1|X2,X3)
SSE (X2,X3)
R2Y 4|123 =
SSR(X4|X1,X2,X3)
SSE (X1,X2,X3)
Yang Feng (Columbia University) Multiple Regression (II) 30 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Properties
Value between 0 and 1
Another way of getting R2Y 1|2:
...1 Regress Y on X2 and obtain the residuals ei (Y |X2) = Yi − Yi (X2)
...2 Regress X1 on X2 and obtain the residuals ei (X1|X2) = Xi1 − Xi1(X2)
...3 R2 between ei (Y |X2) and ei (X1|X2) will be the same as R2Y 1|2.
Followup, the scatter plot of ei (Y |X2) and ei (X1|X2) provides agraphical representation of the strength of the relationship between Yand X1, adjusted for X2. Called “added variable pots” or “partialregression plots”. More on Chapter 10.1.
Yang Feng (Columbia University) Multiple Regression (II) 31 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Coefficients of Partial Correlation
Coefficients of Partial Correlation:square root of a coefficient of partial determination, following thesame sign with the regression coefficient.
Yang Feng (Columbia University) Multiple Regression (II) 32 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Standardized Multiple Regression
Numerical precision errors can occur when- (X ′X )−1 is poorly conditioned near singular : colinearity- And when the predictor variables have substantially differentmagnitudes
Solution– Regularization– Standardized multiple regression
First, transformed variables
Yang Feng (Columbia University) Multiple Regression (II) 33 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Correlation Transformation
Makes all entries in X ′X matrix for the transformed variables fallbetween -1 and 1 inclusive
Lack of comparability of regression coefficientsY = 200 + 20000X1 + .2X2
Y in dollars, X1 in thousand dollars, X2 in cents– Which is most important predictor?
X1 increase 1,000 dollars→ Y increase 20, 000 dollarsX2 increase 1,000 dollars→ Y also increase 20, 000 dollars
Yang Feng (Columbia University) Multiple Regression (II) 34 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Correlation Transformation
Makes all entries in X ′X matrix for the transformed variables fallbetween -1 and 1 inclusive
Lack of comparability of regression coefficientsY = 200 + 20000X1 + .2X2
Y in dollars, X1 in thousand dollars, X2 in cents– Which is most important predictor?X1 increase 1,000 dollars→ Y increase 20, 000 dollarsX2 increase 1,000 dollars→ Y also increase 20, 000 dollars
Yang Feng (Columbia University) Multiple Regression (II) 34 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Correlation Transformation
Centering and scaling
Yi−Ysy
Xik−Xksk
, k = 1, ..., p − 1
sy =√∑
(Yi−Y )2
n−1
sk =√∑
(Xik−Xk )2
n−1 , k − 1, ..., p − 1
Yang Feng (Columbia University) Multiple Regression (II) 35 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Correlation Transformation
Transformed variables
Y ∗i =
1√n − 1
(Yi − Y
sy)
X ∗ik =
1√n − 1
(Xik − Xk
sk), k = 1, ..., p − 1
Yang Feng (Columbia University) Multiple Regression (II) 36 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Standardized Regression Model
Define the matrix consisting of the transformed X variables
X ∗ =
X ∗11 ... X ∗
1,p−1
X ∗21 ... X ∗
2,p−1
...X ∗n1 ... X ∗
n,p−1
And define (X ∗)′X ∗ = rXX
Yang Feng (Columbia University) Multiple Regression (II) 37 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Correlation matrix of the X variables
Can show that
rXX =
1 r12 ... r1,p−1
r21 1 ... r2,p−1
...rp−1,1 rp−1,2 ... 1
where each entry is just the coefficient of correlation between Xi and Xj
∑x∗i1x
∗i2 =
∑(Xi1 − X1√n − 1s1
)(Xi2 − X2√n − 1s2
)
=1
n − 1
∑(Xi1 − X1)(Xi2 − X2)
s1s2
=
∑(Xi1 − X1)(Xi2 − X2)
[∑
(Xi1 − X1)2∑
(Xi2 − X2)2]1/2
Yang Feng (Columbia University) Multiple Regression (II) 38 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Standardized Regression Model
The regression model using the transformed variables:
Y ∗i = β∗
1X∗i1 + · · ·+ β∗
p−1X∗i ,p−1 + ϵ∗i
Notice that there is no need for intercept
If we define in a similar way (X ∗)′Y ∗ = rXY , where rXY is thecoefficient of simple correlations between Xj and Y
Then we can set up a standard linear regression problem
rXXb = rXY
Yang Feng (Columbia University) Multiple Regression (II) 39 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Standardized Regression Model
The solution
b∗ =
b∗1b∗2...
b∗p−1
can be related to the solution to the untransformed regression problemthrough the relationship
bk = (sysk)b∗k , k = 1, ..., p − 1
b0 = Y − b1X1 − ...− bp−1Xp−1
Yang Feng (Columbia University) Multiple Regression (II) 40 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Multicollinearity
When the predictor variables are correlated among themselves,intercorrelation or multicollinearity among them is said to exist.
Uncorrelated Predictor Variables
Perfectly Correlated Predictor Variables
Effects of Multicollinearity
Yang Feng (Columbia University) Multiple Regression (II) 41 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Uncorrelated Predictor Variables
Suppose we have the following three regression
Regress Y on X1. (Estimator b1)
Regress Y on X2. (Estimator b2)
Regress Y on both X1 and X2 (Estimator b∗1 and b∗2)
If X1 and X2 are uncorrelated, then we have...1 b1 = b∗1, b2 = b∗2...2 SSR(X1|X2) = SSR(X1), SSR(X2|X1) = SSR(X2)
Yang Feng (Columbia University) Multiple Regression (II) 42 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
Perfectly Correlated Predictor Variables
Regress Y on both X1 and X2. If X1 and X2 are perfectly correlated (sayX2 = 5 + .5X1), then
We have infinitely many possible solutions which fits the modelequally well (have the same SSE).
The perfect relation between X1 and X2 does not inhibit our ability toobtain a good fit.
The magnitude of the regression coefficients can not be interpreted asreflecting the effects of different predictor variables.
Yang Feng (Columbia University) Multiple Regression (II) 43 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
General Effects of Multicollinearity
Usually, we still have good fit of the data, in addition, we still havegood prediction.
The estimated regression coefficients tends to have large samplingvariability when the predictor variables are highly correlated. Some ofthe regression coefficients maybe statistically not significant eventhough a definite statistical relation exists.
The common interpretation of a regression coefficient is NOT fullyapplicable any more.
Regress Y on both X1 and X2. It is possible that when individualt-tests are performed, neither β1 or β2 is significant. However, whenthe F -test is performed for both β1 and β2, the results may still besignificant.
Yang Feng (Columbia University) Multiple Regression (II) 44 / 44
..........
.....
.....................................................................
.....
......
.....
.....
.
.
......
Regression Models for Quantitative and QualitativePredictors
Yang Feng
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 1 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Two types of predictors
Quantitative. (e.g., multiple linear regression)
Qualitative. (e.g., indicator variables)
This Class:
Polynomial Regression Models
Interaction Regression Models
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 2 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Polynomial Regression Models
When the true curvilinear response function is indeed a polynomialfunction.
When polynomial function is a good approximation to the truefunction.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 3 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
One-predictor variable-second order
Yi = β0 + β1xi + β11x2i + ϵi
wherexi = Xi − X
X is centered due to the possible high correlation between X and X 2.
Regression function: E{Y } = β0 + β1x + β11x2, quadratic response
function
β0 is the mean response when x = 0, i.e., X = X .
β1 is called the linear effect.
β11 is called the quadratic effect.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 4 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
One Predictor Variable-Third Order
Yi = β0 + β1xi + β11x2i + β111x
3i + ϵi
wherexi = Xi − X
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 5 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
One Predictor Variable-Higher Orders
Employed with special caution.
Tends to overfit
Poor prediction
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 6 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Two Predictors-Second Order
Yi = β0 + β1xi1 + β2xi2 + β11x2i1 + β22x
2i2 + β12xi1xi2 + ϵi
wherexi1 = Xi1 − X1, xi2 = Xi2 − X2
The coefficient β12 is called the interaction effect coefficient.
More on interaction later.
Three Predictors- Second Order is similar.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 7 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Implementation of Polynomial Regression Models
Fitting—Very easy, just use the least squares for multiple linearregressions since they can all be seen as a multiple regression.
Determine the order—Very important step!
Yi = β0 + β1xi + β11x2i + β111x
3i + ϵi
Naturally, we want to test whether or not β111 = 0, or whether or notboth β11 = 0 and β111 = 0.How to do the test?
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 8 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Extra Sum of Squares
Decomposition SSR into SSR(x), SSR(x2|x) and SSR(x3|x , x2).Test whether β111 = 0: use SSR(x3|x , x2).Test whether both β11 = 0 and β111 = 0: use SSR(x2, x3|x).
Time for a real example!
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 9 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Further Comments on Polynomial Regression
There are drawbacks....1 Sometimes polynomial models are more expensive in degrees offreedom than alternative nonlinear models or linear models withtransformed variables.
...2 Serious multicollinearity may be present even when the variables arecentered
An alternative to using centered variables is to use orthogonalpolynomials.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 10 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Interaction Regression Models
Additive effects:
E{Y } = f1(X1) + f2(X2) + · · ·+ fp−1(Xp−1)
General effects with interactions. Example:
E{Y } = β0 + β1X1 + β2X2 + β3X1X2
This cross-product term β3X1X2 is called an interaction term.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 11 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Interpretation of Regression Models with Interactions
E{Y } = β0 + β1X1 + β2X2 + β3X1X2
The change in mean response with a unit increase in X1 when X2 isheld constant is
β1 + β3X2
Similarly, a unit increase in X2 when X1 is constant:
β2 + β3X1
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 12 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
First type of interaction
First, suppose β1 and β2 are positive.
Reinforcement (synergistic) type: β3 > 0
E{Y } = 10 + 2X1 + 5X2 + .5X1X2
Conditional Effects Plot:
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 13 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Second type of interaction
Interference (antagonistic) type: β3 < 0
E{Y } = 10 + 2X1 + 5X2 − .5X1X2
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 14 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Implementation of Interaction Regression Models
Center the predictor variables to avoid the high multicollinearities
xik = Xik − Xk
Using prior knowledge to reduce the number of interactions. If wehave 8 predictors, then we have 28 pairwise terms in total. For ppredictors, the number is p(p − 1)/2.
Now a real example...
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 15 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Implementation of Interaction Regression Models
Center the predictor variables to avoid the high multicollinearities
xik = Xik − Xk
Using prior knowledge to reduce the number of interactions. If wehave 8 predictors, then we have 28 pairwise terms in total. For ppredictors, the number is p(p − 1)/2.
Now a real example...
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 15 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Qualitative Predictors
Examples:
Gender (male or female)
Purchase status (yes or no)
Disability status (not disabled, partly disabled, fully disabled)
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 16 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
A study of innovation in insurance industry
Objective: related the speed with which a particular insuranceinnovation is adopted (Y ) to the size of the insurance firm (X1) andthe type of the firm.
Response Y : quantitative, continuous
Predictor X1: quantitative,
Second predictor: type of firm, stock companies and mutualcompanies.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 17 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Qualitative Predictor with Two Classes
Suppose
X2 =
{1, if stock company;0, otherwise.
X3 =
{1, if mutual company;0, otherwise.
Then, we have the model
Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + ϵi
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 18 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Design Matrix
Suppose, we have n = 4 observations, the first two being stock firms, thesecond two be mutual firms. Then
X =
1 X11 1 01 X21 1 01 X31 0 11 X41 0 1
Observation: first column is equal to the sum of the X2 and X3
columns, linear dependent...
Solution: A qualitative variables with c classes will be represented byc − 1 indicator variables, each taking on the values 0 and 1.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 19 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Design Matrix
Suppose, we have n = 4 observations, the first two being stock firms, thesecond two be mutual firms. Then
X =
1 X11 1 01 X21 1 01 X31 0 11 X41 0 1
Observation: first column is equal to the sum of the X2 and X3
columns, linear dependent...
Solution: A qualitative variables with c classes will be represented byc − 1 indicator variables, each taking on the values 0 and 1.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 19 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Interpretation
Now, we drop the X3 from the regression model:
Yi = β0 + β1Xi1 + β2Xi2 + ϵi
whereX1 = size of the firm
X2 =
{1, if stock company;0, otherwise.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 20 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Interpretation(Cont’)
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 21 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
More than Two Classes
Regression of tool wear (Y ) on tool speed (X1) and tool model (fourclasses M1,M2,M3,M4).
4 classes → 3 indicator variables
Define
X2 =
{1, if tool model M1;0, otherwise.
X3 =
{1, if tool model M2;0, otherwise.
X4 =
{1, if tool model M3;0, otherwise.
Then, we have the following first-order regression model:
Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + β4Xi4 + ϵi
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 22 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Interpretation
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 23 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Interpretation
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 23 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Some Considerations in Using Indicator Variables
An alternative: allocated codes.
For example, the predictor variable “frequency of product use” hasthree classes: frequent user, occasional user, nonuser. We can use asingle X1 variable to denote it as follows:
X1 =
3, Frequent User;2, Occasional User;1, Nonuser.
Then, we have the regression model:
Yi = β0 + β1Xi1 + ϵi
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 24 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Difficulties with allocated codes
The mean response with the regression function will be:Class E{Y }Frequent User β0 + 3β1Occasional User β0 + 2β1Nonuser β0 + β1
Key implication:
E{Y |frequent user} − E{Y |occasional user}=E{Y |occasional user} − E{Y |nonuser}
Using indicator variables doesn’t have this restriction since it has onemore variable to denote them.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 25 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Other Codings for Indicator Variables
For the stock company and mutual company data:
X2 =
{1, if stock company;−1, if mutual company.
Another alternative: use indicator variable for each of the c classesand drop the intercept term:
Yi = β1Xi1 + β2Xi2 + β3Xi3 + ϵi
whereX1 = size of the firm
X2 =
{1, if stock company;0, otherwise.
X3 =
{1, if mutual company;0, otherwise.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 26 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Interactions between Quantitative and QualitativeVariables
Almost the same as the regular interactions
Read Chapter 8.5 and 8.6 after class
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 27 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Comparison of Two or More Regression Functions
Three examples:
A company operates two productions lines for making soap bars. Foreach line, the relationship between the speed of the line and theamount of scrap for the day was studied.
An economist is studying the relationship between amount of savingsand level of income for middle-income families from urban and ruralareas, based on independent samples from the two populations.
Two instruments were constructed for a company to identicalspecifications to measure pressure in an industrial process.
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 28 / 29
..........
.....
.....................................................................
.....
......
.....
.....
.
Soap Production Lines Example
Y : scrap, X1: line speed. X2: code for production line.
Interaction model:
Yi = β0 + β1Xi1 + β2Xi2 + β3Xi1Xi2 + ϵi
whereXi1 = line speed
Xi2 =
{1, if production line 1;0, if production line 2.
i = 1, 2, · · · , 27
Yang Feng (Columbia University) Regression Models for Quantitative and Qualitative Predictors 29 / 29
Model Selection
Yang Feng
Yang Feng (Columbia University) Model Selection 1 / 24
Model-Building Process
Data Collection and preparation
Reduction of explanatory or predictor variables (for exploratoryobservational studies)
Model refinement and selection (This class!)
Model validation
Yang Feng (Columbia University) Model Selection 2 / 24
Trouble and Strife
Form any set of p − 1 predictors, 2p−1 different linear regression modelscan be constructed
Consider all binary vectors of length p − 1
Search in that space is exponentially difficult.
Greedy strategies are typically utilized.
Is this the only way?
Yang Feng (Columbia University) Model Selection 3 / 24
Selecting between models
In order to select between models, some score must be given to eachmodel.
The likelihood of the data under each model is not sufficient because, aswe have seen, the likelihood of the data can always be improved by addingmore parameters until the model effectively memorizes the data.
Accordingly some penalty that is a function of the complexity of the modelmust be included in the selection procedure.
There are several choices for how to do this
1 Explicit penalization of the number of parameters in the model (AIC,BIC, etc.)
2 Implicit penalization through cross validation3 Bayesian regularization (putting certain prior distribution on each
model).
Yang Feng (Columbia University) Model Selection 4 / 24
Six Criteria
R2p ,R
2a,p,Cp,AICp,BICp(SBCp),PRESSp
Denote total number of variables as P − 1, so P parameters in total.Here, 1 ≤ p ≤ P.
Use the coefficient of multiple determination R2:
R2p = 1− SSEp
SSTO
Use adjusted coefficient of multiple determination:
R2a,p = 1− n − 1
n − p
SSEp
SSTO= 1− MSEp
SSTOn−1
Yang Feng (Columbia University) Model Selection 5 / 24
Mallows’ Cp Criterion
Concerned with total mean squared error
(Yi − µi )2 = [(E{Yi} − µi ) + (Yi − E{Yi})]2
we haveE (Yi − µi )2 = (E{Yi} − µi )2 + σ2{Yi}
Criterion measure:
Γp =1
σ2
[n∑
i=1
(E{Yi} − µi )2 +n∑
i=1
σ2{Yi}
]
Yang Feng (Columbia University) Model Selection 6 / 24
Cp continued
n∑i=1
σ2{Yi} = trace(Cov(Xb)) (1)
= trace(Xσ2(X′X)−1X′) (2)
= σ2trace((X′X)(X′X)−1) (3)
= pσ2 (4)
Yang Feng (Columbia University) Model Selection 7 / 24
Cp continued
E{SSEp} =n∑
i=1
E (Yi − Yi )2 (5)
=n∑
i=1
E [(Yi − EYi − εi + EYi − µi )2] (6)
=n∑
i=1
(EYi − µi )2 + E [(Yi − EYi − εi )2] (7)
=n∑
i=1
(EYi − µi )2 + (n − p)σ2 (8)
From (7) to (8), refer to the Cp.pdf file.
Yang Feng (Columbia University) Model Selection 8 / 24
Cp continued
Therefore, a good estimator of Γp will be
Cp =SSEp
MSE (X1, · · · ,XP−1)− (n − 2p)
When E{Yi} = µi , we have
ECp ≈ p
Cp for the full model with P − 1 variables, is always P by definition.
We would like to find the model where (1) Cp value is small, (2)Cp ≈ p.
Yang Feng (Columbia University) Model Selection 9 / 24
AIC and BIC
The Akaike’s information criterion (AIC) and the Bayesian informationcriterion (BIC) (also called the Schwarz criterion, SBC in the book) aretwo criteria that penalize model complexity.
In the linear regression setting
AICp = n lnSSEp − n ln n + 2p
BICp = n lnSSEp − n ln n + (ln n)p
Roughly you can think of these two criteria as penalizing models withmany parameters (p in the case of linear regression).
Yang Feng (Columbia University) Model Selection 10 / 24
Cross Validation
A way of doing model selection that implicitely penalizes models of highcomplexity is cross-validation.
If you fit a model with a large number of parameters to a subset of thedata then predict the rest of the data using the model you just fit, then
The average scores of held-out data over different folds can be usedto compare models.
If the held-out data is consistently (over the various folds) wellexplained by the model, then one could conclude that the model is agood model.
If one model performs better on average when predicting held-outdata then there is reason to prefer that model.
Overly complex models will not generalize well and will therefore notbe selected.
Yang Feng (Columbia University) Model Selection 11 / 24
PRESSp or Leave-One-Out Cross Validation
The PRESSp or “prediction sum of squares” measures how well a subsetmodel can predict the observed responses Yi .
Let Yi(i) be the fitted value when i is being predicted from a model inwhich (i) was left out during training.
The PRESSp criterion is then given by summing over all n cases
PRESSp =n∑
i=1
(Yi − Yi(i))2
PRESSp values can be calculated without doing n separate regression runs.
Yang Feng (Columbia University) Model Selection 12 / 24
PRESSp or Leave-One-Out Cross Validation
If we let di be the deleted residual for the i th case
di = Yi − Yi(i)
then we can rewritedi =
ei1− hii
where ei is the ordinary residual for the i th case and hii is the i th diagonalelement in the hat matrix.
We can obtain the hii diagonal element of the hat matrix directly from
hii = X′i (X′X)−1Xi
Yang Feng (Columbia University) Model Selection 13 / 24
PRESSp or Leave-One-Out Cross Validation
PRESSp is useful for choosing between models. Consider
Take two models, one Mp with p dimensional input and one Mp−1with p − 1 dimensional input
Calculate the PRESSp criterion for Mp and Mp−1
Whichever model has the lowest PRESSp should be preferred.
Why?
Unfortunately the PRESSp criteria can’t tell us which variables to include.For that general search is still required.
More general cross-validation procedures can be designed.
Note that the PRESSp criterion is very similar to log-likelihood.
Yang Feng (Columbia University) Model Selection 14 / 24
Detecting Outliers Via Studentized Deleted Residuals
The studentized deleted residual, denoted by ti is
ti =di
s{di}
where
s2{di} =MSE(i)
1− hii
dis{di}
∼ t(n − p − 1)
Fortunately, again, ti can be calculated without fitting n differentregression models. It can be shown that
ti = ei
[n − p − 1
SSE (1− hii − e2i )
]1/2These ti ’s can be used to formally test (e.g. using a Bonferroni testprocedure) whether the largest absolute studentized deleted residual is anoutlier.
Yang Feng (Columbia University) Model Selection 15 / 24
Six Criteria
R2p ,R
2a,p,Cp,AICp,BICp(SBCp),PRESSp
Useful for choosing between models.
Take two models, one Mp with p dimensional input and one Mp−1with p − 1 dimensional input
Calculate the criterions for Mp and Mp−1
Unfortunately the criteria can’t tell us which variables to include. For thatgeneral search is still required.
Yang Feng (Columbia University) Model Selection 16 / 24
Yang Feng (Columbia University) Model Selection 17 / 24
Best Subset Algorithms
Consider all the possible subset. For each of the model, evaluate thecriteria.
Time-saving algorithms have been developed, which require thecalculation of only a small fraction of all possible models.
Still, if P > 30, 40, it requires excessive computer time.
Several regression models can be identified as “good” for finalconsideration, depending on which criteria we use.
Yang Feng (Columbia University) Model Selection 18 / 24
Stepwise Regression Methods
An automatic search procedure
Identify a “single” best model
several different formats
Yang Feng (Columbia University) Model Selection 19 / 24
Forward Stepwise Regression
A (greedy) procedure for identifying variables to include in the regressionmodel is as follows. Repeat until finished:
1 Fit a simple linear regression model for each of the P − 1 X variablesconsidered for inclusion. For each compute the t∗ statistics for testingwhether or not the slope is zero
t∗k =bk
s{bk}1
2 Pick the largest out of the P − 1 |t∗k |’s and include the correspondingX variable in the regression model if |t∗k | exceeds some threshold.
3 If the number of X variables included in the regression model isgreater than one, check to see if the model would be improved bydropping variables (using the t-test and a threshold again).
1Remember bk is the estimate for βk and s{bk} is the standard error.Yang Feng (Columbia University) Model Selection 20 / 24
Forward Stepwise Regression (cont)
Other criteria can be used in determining which variables to add anddelete, such as F-test (full model v.s. reduced model), AIC (defaultoption in R), BIC, Cp.
Usually much more efficient than the best subset regression
Yang Feng (Columbia University) Model Selection 21 / 24
Forward Regression
Simplified version of forward stepwise regression
No deletion step! Once a variable is in, it will be there from then on.
Yang Feng (Columbia University) Model Selection 22 / 24
Backward Elimination
Start from the full model with P − 1 variables.
Iteratively check whether any variable should be deleted from themodel by some given criteria. This time, no addition step!
Time to see how we do this in R. Suppose ”fit” is the fitted object fromthe full model with all P − 1 variables.
Forward Stepwise Regression: step(null, scope=list(upper=full,lower=null), direction=’both’)
Forward Regression: step(null, scope=list(upper=full, lower=null),direction=’forward’)
Backward Elimination: step(full, scope=list(upper=full, lower=null),direction=’backward’)
Yang Feng (Columbia University) Model Selection 23 / 24
Remarks
If the number of variable is not large (2P−1 is not large), it is best tofit all the possible models, and choose the one with the smallest AIC,BIC, Cp, PRESS.
If the number of variable is too large, then, using stepwise forwardregression is recommended. More aggressively, if you would like morespeed, you can use forward selection or backward elimination.
If p � n, then direct regularization techniques are needed, such asLASSO, SCAD, etc.
Yang Feng (Columbia University) Model Selection 24 / 24
Diagnostic for Multiple Linear Regression
Yang Feng
Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 1 / 11
Diagnostics
Added-Variable Plots (Model Adequacy for a Predictor Variable)
Studentized Deleted Residuals (Identifying outlying Y )
Hat Matrix Leverage Values (Identifying outlying X )
DFFITS, Cook’s Distance, DFBETAS (Identifying Influential Cases)
Variance Inflation Factor (Multicollinearity Diagnostic)
Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 2 / 11
Added-Variable Plots
Measures the marginal role of Xk given other variables are already inthe model.
Links with Coefficient of Partial Determination
Suppose only X1 and X2, we want to measure the role of X1 given X2
is already in the model.1 Regress Y on X2 and obtain the residuals ei (Y |X2) = Yi − Yi (X2)2 Regress X1 on X2 and obtain the residuals ei (X1|X2) = Xi1 − Xi1(X2)3 The scatter plot of ei (Y |X2) and ei (X1|X2) provides a graphical
representation of the strength of the relationship between Y and X1,adjusted for X2. The plot is called Added-Variable Plots.
Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 3 / 11
Added-Variable Plots (Some prototypes)
Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 4 / 11
Studentized Deleted Residuals
Figure:
Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 5 / 11
Studentized Deleted Residuals (Cont’)
Let di be the deleted residual for the i th case
di = Yi − Yi(i)
The studentized deleted residual, denoted by ti is
ti =di
s{di}where
s2{di} =MSE(i)
1− hii
dis{di}
∼ t(n − p − 1)
It can be shown that
ti = ei
[n − p − 1
SSE (1− hii − e2i )
]1/2These ti ’s can be used to formally test (e.g. using a Bonferroni testprocedure) whether the largest absolute studentized deleted residual is anoutlier.Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 6 / 11
Hat Matrix Leverage Values
Figure:
Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 7 / 11
Hat Matrix Leverage Values
For hat matrix H, we have 0 ≤ hii ≤ 1,∑n
i=1 hii = p. hii is called theleverage of the ith case.
Large hii indicates that it has substantial leverage in determining thefitted value Yi . Reasons:
The fitted value is a linear combination of the observed Y values,where hii is the weight of observation Yi .The large is hii , the smaller the variance of the residual ei .
If hii > 2p/n, we say it is large.
Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 8 / 11
Identifying Influential Cases
After identifying cases as outlying, we would like to ascertain that thesecases are influential, i.e., whether its exclusion causes major changes in thefitted regression function. Three measures:
Influence on Single Fitted Value (DFFITS). Case i has on the fittedvalue Yi is given by
(DFFITS)i =Yi − Yi(i)√MSE(i)hii
(1)
= ti
(hii
1− hii
)1/2
(2)
Influential if larger than 1 for small to medium data sets, or largerthan 2
√p/n for large data sets.
Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 9 / 11
Identifying Influential Cases
Influence on all fitted values (Cook’s Distance).
Di =
∑nj=1(Yj − Yj(i))
2
pMSE(3)
=e2i
pMSE
hii(1− hii )2
(4)
Influential if near 50 percentile of F (p, n − p)
Influence on the Regression Coefficients (DFBETAS).
(DFBETAS)k(i) =bk − bk(i)√MSE(i)ckk
, (5)
where ckk is the kth diagonal element of (X′X)−1.Influential if larger than 1 for small to medium data sets, or largerthan 2/
√n for large data sets.
Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 10 / 11
Variance Inflation Factor
A formal way of detecting the presence of multcollinearity.
Considering the standardized regression model, we have
σ2{b∗} = (σ∗)2r−1XX
Define (VIF )k as the kth diagonal element of r−1XX .
(VIF )k = (1− R2k )−1, for k = 1, 2, · · · , p − 1, where R2
k is thecoefficient of determination when Xk is regressed on the p − 2 otherX variables.
Think about what happens when Rk = 0 and when R2k is close to 1.
maxk(VIF )k > 10 indicates that there is serious multicollinearityproblem.
Yang Feng (Columbia University) Diagnostic for Multiple Linear Regression 11 / 11
Remedial Measures for Multiple Linear RegressionModels
Yang Feng
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 1 / 25
Outline
Unequal Error Variance——Weighted Least Squares
Multicollinearity———–Ridge Regression
Influential Cases———–Robust Regression
Nonparametric Regression————Lowess Method and RegressionTrees
Evaluating Precision————Bootstrapping
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 2 / 25
Unequal Error Variance
Yi = β0 + β1Xi1 + · · ·+ βp−1Xi ,p−1 + εi
Here: εi are independent N(0, σ2i ). (Originally: εi are independentN(0, σ2))
In matrix form:
σ2{ε} =
σ21 0 · · · 00 σ22 · · · 0...
......
0 0 · · · σ2n
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 3 / 25
Known Error Variance
Define weights
wi =1
σ2i
Denote
W =
w1 0 · · · 00 w2 · · · 0...
......
0 0 · · · wn
Weighted least squares and maximum likelihood estimator isbw = (X′WX)−1X′WY (Derivation on the board, two methods:direct MLE and transform variables to the regular multiple linearregression)
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 4 / 25
Error Variance Known up to Proportionality Constant
wi = k1
σ2i
Same estimator.
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 5 / 25
Unknown Error Variances
In reality, one rarely known the variances σ2i .
Estimation of Variance Function or Standard Deviation Function
Use of Replicates or Near Replicates
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 6 / 25
Estimation of Variance Function or Standard DeviationFunction
Four steps: (Can be iterated for several times to reach convegence)
1 Fit the regression model by unweighted least squares and analyze theresiduals
2 Estimate the variance function or the standard deviation function byregressing either the squared residuals or the absolute residuals on theappropriate predictor(s).(We known that the variance of εi σ
2i = E (ε2i )− (E (εi ))2 = E (ε2i ).
Hence the squared residual e2i is an estimator of σ2i .)
3 Use the fitted value from the estimated variance or standard deviationfunction to obtain the weights wi .
4 Estimate the regression coefficients using these weights.
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 7 / 25
Use of Replicates or Near Replicates
When replicates or near replicates are available, use the samplevariance of the replicates as the estimate for the variances.
In observational studies, usually not available.
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 8 / 25
Ridge Regression
Multicollinearity
In polynomial regression models, higher order terms
One or several predictor variables may be dropped from the model inorder to remove the multicollinearity.
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 9 / 25
Ridge Estimators
OLS:(X′X)b = X′Y
Transformed by correlation transformation:
rXXb = rYX
Ridge Estimator: for a constant c ≥ 0,
(rXX + cI)bR = rYX
c = 0, OLS
c > 0, biased, but much more stable.
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 10 / 25
Choice of c
Ridge trace (0 ≤ c ≤ 1) and the variance inflation factors (VIF )k :
(VIF )k = (1− R2k )−1
Choose the c where the Ridge trace starts to become stable and VIFhas become sufficiently small.
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 11 / 25
Robust Regression
Robust to outlying and influential cases.
LAD (Least Absolute Deviation) regression. To minimize
L1 =n∑
i=1
|Yi − (β0 + β1Xi1 + · · ·+ βp−1Xi ,p−1)|. (1)
LMS (Least Median of Squares) regression. To minimize
median{[Yi − (β0 + β1Xi1 + · · ·+ βp−1Xi ,p−1)]2}. (2)
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 12 / 25
IRLS Robust Regression
1 Choose a weight function for weighting the case.
2 Obtain starting weights for all cases.
3 Using the starting weights in weighted least squares and obtain theresiduals from the fit.
4 Use the residuals in step 3 to obtain revised weights.
5 Continue the iteration until convergence.
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 13 / 25
IRLS Robust Regression
1 Huber weight function.
w =
{1 |u| ≤ 1.345.1.345|u| |u| > 1.345.
(3)
2 Bisquare weight function.
w =
{ [1− ( u
4.685)2]2 |u| ≤ 4.685.
0 |u| > 4.685.(4)
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 14 / 25
Starting weights
Huber weight: OLS
Bisquare weight: use the initial robust fit using huber weights, orresidual from LAD regression.
Scaled Residuals
When there is no outlying observations, normalize the residual by√MSE .
When there are outlying observations, use the resistant and robustmedian absolute deviation (MAD) estimator.
MAD =1
.6745median{|el −median{ei}|}. (5)
Then the scaled residual is
ui =ei
MAD. (6)
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 15 / 25
Lowess Method
Two predictor variables, fitted value at (Xh1,Xh2).
Distance Measure.
di = [(Xi1 − Xh1)2 + (Xi2 − Xh2)2]1/2. (7)
Proportion of the data q that are nearest to (Xh1,Xh2). Larger q leadsto smoother fit, but may increase the bias. Usually between .4 and .6.
Weight Function.
wi =
{ [1− ( di
dq)3]3
di < dq.
0 di ≥ dq.(8)
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 16 / 25
Regression Trees
A powerful nonparametric regression method.
Can handle multiple predictors.
Easy to calculate and require virtually no assumptions.
Achieved by partitioning the covariates.
Key quantities.
Number of regions r .Split points between the regions.
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 17 / 25
Growing a regression tree
Take a single predictor as an example.
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 18 / 25
Growing a regression tree
Divide X into two regions: R21 and R22.
The optimal split point Xs would be to minimize the SSE.
SSE = SSE(R21) + SSE(R22), (9)
where
SSE(Rrj) =∑
(Yi − YRjk)2. (10)
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 19 / 25
Growing a regression tree
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 20 / 25
A graphical example
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 21 / 25
Bootstrapping
A nonparametric method of evaluating the uncertainly of theestimate.
Suppose we want to evaluate the precision of an estimated regressioncoefficient b1.
1 Fix B as the number of bootstrap samples to be generated. SayB = 500.
2 For each k = 1, · · · ,B, sample n observations with replacement fromthe original n observations.
3 Fit a linear regression model using the bootstrap sample which leads
to coefficient b(k)1 .
4 The sample standard deviation of {b(1)1 , b(2)1 , · · · , b(B)
1 } is a measureof the precision of b1.
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 22 / 25
Bootstrap sampling
Fixed X sampling. If the regression function is a good model fro thedata and the error terms have constant variance. Obtain the residualsei from the original fit, then get a bootstrap sample of size n of ei ,then define new Y ∗1 , · · · ,Y ∗n to be
Y ∗i = Yi + e∗i . (11)
Then regress Y ∗ values on the original X variables.
Random X sampling. Get a bootstrap sample of (X ∗,Y ∗) pairs.
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 23 / 25
Bootstrap Confidence Intervals
From the bootstrap distribution of b∗1, get the α/2 and 1− α/2quantiles b∗1(α/2) and b∗1(1−α/2). Suppose the original estimate is b
Percentile Bootstrap
b∗1 (α/2) ≤ β1 ≤ b∗1 (1− α/2). (12)
Basic Bootstrap (Reflection method)
2b1 − b∗1 (1− α/2) ≤ β1 ≤ 2b1 − b∗1 (α/2). (13)
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 24 / 25
Reflection Method-explained
With probability 1− α,
b1(α/2) ≤ b1 ≤ b1(1− α/2). (14)
Let
D1 = β1 − b1(α/2) (15)
D2 = b1(1− α/2)− β1. (16)
Substitue (15) and (16) into (14), we have
b1 − D2 ≤ β1 ≤ b1 + D1. (17)
Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 25 / 25