02 chapter alr for printing(1)

8/9/2019 02 Chapter ALR for Printing(1)

1/117

Introduction to Simple Linear Regression: I

consider responseY and predictorXmodel for simple linear regression assumes a mean function

E(Y | X = x) of the form 0+1x and a variance functionVar(Y |X=x) that is constant:

E(Y |X=x) =0+ 1x and Var(Y |X=x) =2

3 parameters are 0 (intercept), 1(slope) & 2 >0

to interpret2, define random variablee =YE(Y |X=x)so thatY = E(Y |X=x) +e

results inA.2.4 of ALR say that E(e) = 0 and Var(e) =2

2 tells how closeYis likely to be to mean value E(Y |X=x)

ALR21, 292 II1


2/117

Introduction to Simple Linear Regression: II

let (xi, yi),i= 1, . . . , n, denote predictor/response pairs (X, Y)

ei=yiE(Y |X=xi) =yi(0+1xi) is called a statisticalerror and represents distance betweenyiand its mean function

will make two additional assumptions about the errors:

[1]E(ei|xi) = 0, implying a scatter plot ofeiversusxishouldresemble a null plot (random deviations about zero)

[2]e1, . . . , enare a set ofnindependent random variables (RVs)

will make third additional assumption upon occasion:

[3]conditional onxis, errorseis are normally distributed

note: normally distributed is same as Gaussian distributed(preferred expression in engineering & physical sciences)

ALR21, 29 II2


3/117

Estimation of Model Parameters: I

Q: given data (xi, yi), i = 1, . . . , n (n realizations of RVs XandY), how should we determine parameters0 & 1?

since 0& 1determine a line, question is equivalent to decidinghow best to draw a line through a scatterplot of (xi, yi)

for n >2, possibilities for defining best (lotsmore exist!): hire an expert to eye ball a line (Mosteller et al., 1981) find line minimizing distances between data and all possible

lines, with some considerations being

direction (vertical, horizontal, perpendicular) squared difference, absolute difference, etc. look at all possible lines determined by two distinct points in

scatterplot, and pick one with median slope (sounds bizarre,but later on will discuss why this might be of interest)

ALR22,23 II3


4/117


5/117

Vertical, Horizontal&PerpendicularLeast Squares

0 10 20 30 40 50

10

20

30

40

50

60

xi

yi

II5


6/117

Estimation of Model Parameters: II

one strategy: form to-be-defined estimators 0 & 1 of0 &1, after which form residuals(observed errors):

ei=yi (0+ 1xi) =yi yi,where yi= (0+ 1xi) is fitted value forith case

Q: why isnt residual eiin general equal to error

ei=yi (0+ 1xi)?Q: if per chance we had 0=0& 1=1, would fitted value

yibe equal to actual value yi?

ALR22 II6


7/117

Least Squares Criterion: I

least squares scheme: estimate 0& 1such that sum of squaresof resulting residuals is as small as possible

since residuals are given by ei=yi (0 +1xi) once 0& 1are known, consider

RSS(b0, b1) =n

i=1

[yi (b0+b1xi)]2,

i.e., theresidual sum of squareswhen we useb0for the inter-cept andb1 for the slope

least squares estimators 0& 1 are such that

RSS(0,1)< RSS(b0, b1)

when eitherb0 = 0orb1 = 1 (or both)ALR24 II7


8/117

Least Squares Criterion: II

Q: how do we set b0 &b1 to make RSS(b0, b1) the smallest?could try lots of different values (a grid search a potentially

exhausting task!), but can put calculus to good use here

to motivate how to find 0 & 1, first consider simpler mean

function E(Y |X=x) =1x(regression through origin)model is nowY =1x+e, and task is to findb1 minimizing

RSS(b1) =n

i=1[yi b1xi]2 =

n

i=1

y2i 2b1xiyi+b21x2i

Q: why isb1minimizing RSS(b1) is same aszminimizing

f(z) =az2 +bz, where a=n

i=1

x2i and b= 2n

i=1

xiyi?

ALR24, 47, 48 II8


9/117

Least Squares Criterion: III

sincea= i x2i >0,f(z) =az2 +bz asz sincef(0) = 0, minimizer zmust be such that f(z) 0

0

0

z

f(z)

iff(z)


10/117

Least Squares Criterion: IV

roots of polynomialaz2 +bz+cgiven by quadratic formula:b

b2 4ac

2a

whenc= 0, one root is 0, and nonzero root is b/a, soz= b

2a= i xiyi

i x2i

= 1 ()

alternative approach to finding minimizer ofRSS(b1): differen-tiate with respect tob1, set result to 0 and solve for b1:

d RSS(b1)

db1=

d i[yi b1xi]2db1

= 2i

xi(yi b1xi) = 0,

which yields same expression for 1as stated in()ALR47, 48 II10


11/117

Least Squares Criterion: V

return now to mean function E(Y | X = x) = 0+ 1X, forwhich RSS(b0, b1) =

i[yi b0 b1xi]2

calculus-based approach to get least squares estimators 0 &1 follows a path similar to that for E(Y |X=x) =1X

leads to two equations to solve for two unknowns (0 & 1)

differentiate RSS(b0, b1) with respect to b0and set result to 0:

2

i(yi b0 b1xi) = 0, giving b0n+b1

ixi=

iyi

differentiate RSS(b0, b1) with respect to b1and set result to 0:

2

i

xi(yib0b1xi) = 0, giving b0

i

xi+b1

i

x2i =

i

xiyi

ALR293 II11


12/117

Least Squares Criterion: VI

so-callednormal equationsfor simple linear regression are thusb0n+b1

i

xi=

i

yi and b0

i

xi+b1

i

x2i =

i

xiyi

using x= 1n

i xi& y=

1n

i yi, 1st normal equation gives

b0= y b1xreplaceb0in 2nd normal equation with right-hand side of above:

(y b1x)

i

xi+b1

i

x2i =

i

xiyi

after a bit of algebra, getb1=

i xiyi yi xii x

2i x

i xi

=

i xiyi nxyi x

2i nx2

= 1,

and hence 0= y 1xALR293, 204 II12


13/117

Sum of Cross Products and Sum of Squares

define sum of cross products and sum of squares for xs:

SXY=

i

(xi x)(yi y) & SXX=

i

(xi x)2 ()

Problem 1: show thati

(xix)(yiy) = i

xiyinxy & i

(xix)2 = i

x2inx2

can thus write

1= i xiyi nxyi x2i nx2 = SXYSXX

note: should avoid

i xiyinxy&

i x2inx2 when actually

computing 1 use SXYand SXXfrom()insteadALR294, 23 II13


14/117

Sufficient Statistics

since1=

SXY

SXXand 0= y 1x,

need only know x, y, SXY& SXXto form 0& 1

since

x=1n

ni=1

xi, y=1n

ni=1

yi, SXY=n

i=1

xiyinxy, SXX= ni=1

x2inx2

follows that 0& 1depend only on foursufficient statistics:n

i=1xi,

ni=1

yi,

ni=1

xiyi and

ni=1

x2i

in theory, can dispense with 2nvalues (xi, yi),i= 1, . . . , n, andjust keep 4 sufficient statistics as far as 0 & 1 are concerned

ALR294, 23 II14


15/117

Atmospheric Pressure & Boiling Point of Water

as 1st example, reconsider Forbess recordings of atmosphericpressure and boiling point of water, which physics suggests arerelated by

log (pressure) = 0+ 1 boiling pointtaking responseYto be log

10

(pressure) and predictorXto beboiling point, will estimate 0 & 1 for model

Y =0+ 1X+e

via least squares based upon data (xi, yi),i= 1, . . . , 17

taking log to mean log base 10, computations in Ryield

x= 202.9529, y= 1.396041, SXY= 4.753781 & SXX= 530.7824,

from which we get

1=SXY

SXX= 0.008956178 and 0= y 1x= 0.4216418

ALR25, 26 II15


16/117


17/117

Predicting the Weather

as 2nd example, reconsidern= 93 years of measured early/lateseason snowfalls from Fort Collins, Colorado

taking response Yand predictor Xto be late season (JanJune)and early season (SeptDec) snowfalls, entertain model

Y =0+ 1X+e

computations in Ryield

x= 16.74409, y= 32.04301, SXY= 2229.014 & SXX= 10954.07,

from which we get

1=SXY

SXX= 0.2034873 and 0= y 1x= 28.6358

ALR8, 9 II17


18/117

Scatterplot of Late Snowfall Versus Early Snowfall

0 10 20 30 40 50

10

20

30

40

50

60

Early snowfall (inches)

Latesnowfall(i

nches)

ALR8 II18


19/117

Sample Variances, Covariance and Correlation

define sample variance and sample standard deviation ofxs:

SD2x=

i(xi x)2n 1 =

SXX

n 1 & SDx=

SXX

n 1note: sometimesnis used in place ofn

1 in defining SD2x

after defining SYY= i(yiy)2 (sum of squares forys), definesimilar quantities forys:

SD2y= i(yi y)2n

1

= SYY

n

1

& SDy=

SYY

n

1

finally define sample covariance and then sample correlation:

sxy=

i(xi x)(yi y)

n 1 = SXY

n 1 & rxy= sxy

SDxSDy

ALR23 II19


20/117

Alternative Expression for Slope Estimator

Problem 2: alternative expression for 1=SXY/SXXis

1=rxySDySDx

Problem 3:

1

rxy

1

note that, ifxis & yis are such that SDy = SDx, then esti-mated slope is same as sample correlation, as following set ofplots illustrate

ALR24 II20


21/117

Sample Correlation rxy= 0.999

!"# #"$ #"# #"$ !"#

!"

#

#"

$

#"

#

#"

$

!

"#

%&

'&

II21


22/117

Estimating 2: I

simple linear regression model has 3 parameters: 0(intercept),1 (slope) & 2 (variance of errors)

with0&1estimated by 0 & 1, will base estimator for 2

on variance of residuals (observed errors)

recall definition of residuals: ei=yi (0+

1xi)

in view of, e.g.,

SD2x=

i(xi x)2n 1

obvious estimator of2 would appear to bei(ei e)2n 1 , where

e=1

n

ni=1

ei

Problem 4: show thate= 0alwaysfor simple linear regression

ALR26 II22


23/117

Estimating 2: II

obvious estimator thus simplifies to ie2i /(n 1)taking RSSto be shorthand for RSS(0,1), we have

RSS=n

i=1[yi (0+ 1xi)]2 =

n

i=1e2i ,

so the obvious estimator of2 is RSS/(n 1)can show (e.g., Seber, 1977, p. 51) that unbiased estimator 2

of2, i.e., E(2) =2, is

2 = RSSn 2,

wheren2 = sample size # of parameters in mean function;obvious estimator 2 and hence is biased towards zero

ALR26, 27, 306 II23


24/117


25/117

Scatterplot of log10(Pressure) Versus Boiling Point

195 200 205 210

1.3

5

1.4

0

1.4

5

Boiling point

log(Pressu

re)

ALR6 II25


26/117

Residual Plot for Forbes Data

!"# $%% $%# $!%

%&%

%#

%&%

%%

%&%

%#

%&%

!%

%&%

!

#

'()*)+, .()+/

012)345*2

ALR6 II26


27/117

Estimating 2: IV

for the Forbes data, computations in RyieldSYY= 0.04279135, SXY= 4.753781 & SXX= 530.7824,

from which we get

RSS= SYY

(SXY)2

SXX= 0.0002156426

sincen= 17 for the Forbes data,

2 = RSS

n 2=RSS

15 = 0.00001437617

standard error of regression is

= 0.00001437617 = 0.003791592(also called residual standard error)

reddashed horizontal lines on residual plot showALR26 II27


28/117

Estimating 2: V

for the Fort Collins data, computations in Ryield

SYY= 17572.41, SXY= 2229.014 & SXX= 10954.07,

from which we get

RSS= SYY (SXY)2

SXX= 17118.83

sincen= 93 for the Fort Collins data,

2 = RSS

n

2=

RSS

91 = 188.119

standard error of regression is

=

354.5912 = 13.71565

ALR26 II28


29/117


0 10 20 30 40 50

10

20

30

40

50

60


Latesnowfall(i

nches)

ALR8 II29


30/117

Residual Plot for Fort Collins Data

! "! #! $! %! &!

$!

#!

"!

!

"!

#!

$!

'()*+ -./01(** 23.456-7

86-39:(*-

II30


31/117

Matrix Formulation of Simple Linear Regression: I

matrix theory offers an alternative formulation for simple linearregression, with the advantage that it generalizes readily tohandle multiple linear regression

start by defining a n-dimensional column vector Y with yis;

ann 2 matrixXwhose 1st column consists of just 1s, andwhose 2nd has thexis; a 2-dimensional vector containing0and1; and ann-dimensional vectorecontaining theeis:

Y= y1y2...

yn , X=

1 x11 x2... ...1 xn

, = 01 , e= e1e2...

en

matrix version of simple linear regression model is Y=X+e

ALR63, 64, 60 II31


32/117

Matrix Formulation of Simple Linear Regression: II

since

X=

1 x11 x2... ...1 xn

& =

01

, it follows that X =

0+ 1x10+ 1x2

...0+ 1xn

henceith row of matrix equationY =X + esays

yi=0+ 1xi+ei,

which is consistent with modelY =0 +1X+ e; see also II2

let e andX denote the transposes ofe and X; i.e., e is ann-dimensionalrow vector e1 e2 en, whileXis a 2 nmatrix taking the form

1 1 1x1 x2 xn

ALR299, 300, 301 II32


33/117

Matrix Formulation of Simple Linear Regression: III

sincee=Y

X and sinceee= i e2i , can express sum ofsquares of errors as

ee= (Y X)(Y X)if we entertain b=

b0 b1

rather than unknown =

0 1

,

corresponding residuals are given byY

Xb, so residual sum

of squares can be written as

RSS(b) = (Y Xb)(Y Xb)= (Y bX)(Y Xb)= YY

YXb

bXY + bXXb

= YY 2YXb + bXXb,where we make use of 2 facts: (1) transpose of a product isproduct of transposes in reverse order & (2) transpose of ascalar is itself (hencebXY= (bXY)=YXb)

ALR61, 62, 300, 301, 304 II33


34/117

Taking Derivatives with Respect to Vector b: I

supposef(b) is a scalar-valued function of vectorb(elementsareb1, b2, . . . , bq)

two examples, for which ais a vector (ith element is ai) & Ais anq qmatrix (element inith row &jth column isAi,j):

f1(b) =ab=q

i=1

aibi and f2(b) =bAb=q

i=1

qj=1

biAi,jbj

define

df(b)db

=

df(b)db1

df(b)db2

...df(b)

dbq

ALR301, 304 II34


35/117

Taking Derivatives with Respect to Vector b: II

can show (see, e.g., Rao, 1973, p. 71) that

f1(b) =ab has derivative df1(b)

db =a

and

f2(b) =bAb has derivative df2(b)

db = 2Ab

(not hard to show do it for fun and games!)

Q: what is the derivative off3(b) =ba?

Q: what is the derivative off4(b) =bb=

i b

2i ?

Q: what is the derivative of

f5(b) =cCb=

pi=1

qj=1

ciCi,jbj,

wherecis ap-dimensional vector andCis ap qmatrix?II35


36/117

Matrix Formulation of Simple Linear Regression: IV

returning toRSS(b) =YY 2YXb + bXXb,

taking the derivative off(b) = RSS(b) with respect to bandsetting the resulting expression to 0 (a vector of zeros) yields

the matrix version of the normal equations:XXb= XY,

where we have made use of the facts

d(YY)db =0,

d(YXb)db =XY and

d(bXXb)db = 2XXb

least squares estimator of is solution to normal equations:

XX=XY

ALR304 II36


37/117

Matrix Formulation of Simple Linear Regression: V

lets verify that solution to XX=XYyields same estima-tors 1 & 0as before, namely, SXY/SXX& y 1x

now

XX= 1 1 1x1 x2 xn1 x11 x2... ...1 xn

= n i xii xi i x2i = n nxnxi x2iand

XY= 1 1 1x1 x2 xn

y1

y2...yn

= i yii xiyi

= nyi xiyi

Problem 6: finish the verification!

ALR63, 64 II37


38/117

Properties of Least Squares Estimators: I

since E(Y |X=x) =0+ 1x, fitted mean function isE(Y |X=x) = 0+ 1x, ()which is a line with intercept 0 and slope 1

recalling that 0

= y

1x, start from the right-hand side of

()withxset to xto get0+ 1x= y 1x+1x= y,

which says point (x,y) mustlie on fitted mean function

vertical dashed line on following plots indicates the value of x,while horizontal dashed line, the value of y

ALR27, 28 II38


39/117


195 200 205 210

1.3

5

1.4

0

1.4

5

Boiling point

log(Pressu

re)

ALR6 II39


40/117


0 10 20 30 40 50

10

20

30

40

50

60


Latesnowfall(inches)

ALR8 II40


41/117

Properties of Least Squares Estimators: II

both 0 and 1 can be written as a linear combination of re-sponsesy1, y2, . . . , yn

since 1=SXY/SXXand since SXY=

i(xi x)yi(see Prob-lem 1), we have

1=i(xi x)yi

SXX= n

i=1

ciyi, where ci=xi xSXX

Q: whichyis will have the most/least influence on 1?

lets look atciplotted versusxifor Forbes data

ALR27 II41


42/117

Weights Versus Boiling Point for Forbes Data

!"# $%% $%# $!%

%&

%!#

%&

%%#

%

&%%#

%&

%!#

'()*)+, .()+/

0)

II42


43/117


195 200 205 210

1.3

5

1.4

0

1.4

5

Boiling point

log(Pressu

re)

ALR6 II43


44/117

Random Vectors and Their Properties: I

column vector U is said to be a random vector if each of itselementsUiis an RV (random variable)

expected value of random vector denoted by E(U) is avector whoseith element is expected value ofith RVUiinU

for example, ifU= U1 U2 U3, thenE(U) =

E(U1)E(U2)E(U3)

ifUhas dimensionq, ifais ap-dimensional column vector ofconstants and ifAis ap qdimensional matrix of constants,

can show (fairly easily give it a try!) that

E(a + AU) =a + AE(U)

ALR303 II44


45/117

Random Vectors and Their Properties: II

recall that, ifUiand U

jare two RVs, theircovarianceis defined

to be

Cov(Ui, Uj) = E([Ui E(Ui)][Uj E(Uj)]note that Cov(Uj, Ui) = Cov(Ui, Uj) and that

Cov(Ui, Ui) = E([UiE(Ui)][UiE(Ui)] = E([UiE(Ui)]2

) = Var(Ui)by definition, the covariance matrix for q-dimensional random

vector U to be denoted by Var(U) is q qmatrix whose(i, j)th element is Cov(Ui, Uj)

for example, ifU= U1 U2 U3, thenVar(U) =

Var(U1) Cov(U1, U2) Cov(U1, U3)Cov(U2, U1) Var(U2) Cov(U2, U1)Cov(U3, U1) Cov(U3, U2) Var(U3)

ALR291, 292, 303 II45


46/117

Random Vectors and Their Properties: III

ifUhas dimensionq, ifais ap-dimensional column vector ofconstants and ifA is p qdimensional matrix of constants,can show (a bit more challenging, but still worth a try!) that

Var(a + AU) =A Var(U)A

RVs inUare uncorrelated if Cov(Ui, Uj) = 0 wheni =jif eachUiinUhas the same variance (

2, say) and ifUis areuncorrelated, then Var(U) =2I, whereI is theq qidentitymatrix (1s along its diagonal and 0s elsewhere)

for this special case,

Var(a + AU) =A(2I)A=2AA

ALR304, 292 II46


47/117

Properties of Least Squares Estimators: III

recall that least squares estimator

solves normal equations:XX=XY ()

Problem 6: in the case of simple linear regression,XX is aninvertible matrix and thus has an inverse(XX)1 such that

(XX)1XX= XX(XX)

1=I

premultiplication of both sides of()by(XX)1 yields(XX)1XX=(XX)1XY

from which we get I=(XX)1XYand hence=(XX)1XY

above succinctly expresses the fact that 0& 1(the elementsof) are linear combinations ofyis (the elements ofY)

ALR61, 64, 304 II47


48/117

Properties of Least Squares Estimators: IV

considering = (XX)

1

XY and taking conditional expec-tation of both sides yieldsE(|X) = E((XX)1XY|X)

= (XX)1XE(Y|X)= (X

X)

1X

E(X + e|X)

= (XX)1X(X +E(e|X))= (XX)1XX=

Q: whats the justification for each step above?E( | X) = holds for all X and hence E() = uncondi-

tionally, from which we can conclude that 0& 1are unbiasedestimators of0 & 1: E(0) =0 and E(1) =1

ALR305 II48


49/117

Properties of Least Squares Estimators: V

since (i) Var(a + AU) =A Var(U)A, (ii) (AB)=BAand(iii) (A1)= (A)1 for a square matrix, we have

Var(|X) = Var((XX)1XY|X)= (XX)1XVar(Y|X)((XX)1X)

= (XX)1

XVar(X + e|X)X(XX)1

= (XX)1XVar(e|X)X(XX)1= (XX)1X(2I)X(XX)1= 2(XX)1XX(XX)1

=

2

(XX)1

Q: justification for each step above?

ALR305 II49


50/117

Properties of Least Squares Estimators: VI

can readily verify thata bc d

1=

1

ad cb d bc a

since

XX= n nxnxi x2i , get (XX)1 = 1ni x2i n2x2 i x2i

nx

nx n since

Var(|X) =

Var(0|X) Cov(0,1|X)

Cov(0,1|X) Var(1|X)

=2(XX)1,

we find that, e.g., Var(1|X) = 2i x

2i nx2

= 2

SXX

by making use of

i x

2i nx2 =SXX(see Problem 1)

ALR64, 305, 28 II50


51/117

Properties of Least Squares Estimators: VII

Q: what happens to Var(1 | X) = 2

/SXX if we have theluxury of making the sample sizenas large as we want?

in practice, 2 is usually unknown and must be estimated via2, leading to the following estimator for Var(1|X):

Var(1|X) = 2SXX

term standard error is sometimes (but not always) used torefer to the square root of an estimated variance

standard error of1 denoted by se(1) is thus /SXX

ALR29 II51


52/117

Confidence Intervals and Tests for Slope: I

assuming errors ei in simple linear regression to be normallydistributed, parameter estimator1for slope 1is also normallydistributed (same holds for 0 also)

further assuming errors ei to have mean 0 and unknown vari-ance2, distribution of

1also depends upon unknown 2

with 2 estimated by 2, confidence intervals (CIs) and testsconcerning unknown true1need to be based ont-distributionwith degrees of freedom in sync with divisor used to form 2

letTbe a random variable with at-distribution withddegreesof freedom, and let t(/2, d) be percentage point such that

Pr(T t(/2, d)) =/2

ALR30, 31 II52


53/117

Confidence Intervals and Tests for Slope: II

plot below shows probability density function (PDF) for t-distribution withd = 15 degrees of freedom, witht(0.05, 15) =1.753 marked by vertical dashed line (thus area under PDF toright of line is 0.05, and area to left is 0.95)

! " # $ # " !$%$

$%#

$%"

$%!

&

'()

II53


54/117

Confidence Intervals and Tests for Slope: III

(1 ) 100% CI for slope 1 is set of points 1in interval1 t(/2, n 2)se(1) 1 1+t(/2, n 2)se(1)

example: for Forbes data (n= 17), 1= 0.008956 and se(1) =/SXX= 0.0001646 since = 0.003792 and

SXX= 23.04,

so 90% CI for 1is

0.0089561.7530.0001646 1 0.008956+1.7530.0001646becauset(0.05, 15) = 1.753, yielding

0.008668 1 0.009245

ALR31 II54


55/117

Confidence Intervals and Tests for Slope: IV

can test null hypothesis 1= 1 versus alternative hypothesis1 =1 by computingt-statistic

t=1 1se(1)

and comparing it to percentage points for t-distribution withn 2 degrees of freedom

example: for Fort Collins data (n = 93), 1 = 0.2035 andse(1) = /

SXX = 0.1310 since = 13.72 and

SXX =

104.7, sot-statistic for test of zero slope (1 = 0) ist=

0.2035 00.1310

= 1.553

ALR31 II55


56/117

Confidence Intervals and Tests for Slope: V

letting G(x) denote cumulative probability distribution func-tion for random variableT witht(91) distribution, i.e.,

G(x) = Pr(T x),p-value associated withtis 2(1

G(|t|)) see next overhead

p-value is 0.1239, which is not small by common standards (e.g.,0.05 or 0.01), so not much support for rejecting null hypothesis

ALR31 II56


57/117


58/117

Prediction: I

suppose now we want to predict a yet-to-be-observed responseygiven a settingx for the predictor

if assumed-to-be-true linear regression model were known per-fectly, prediction would be y = 0 + 1x, whereas modelsays

y=0+ 1x +e= y +eprediction error would be y y = e, which has variance

Var(e|x) =2

in general we must be satisfied using estimators 0 &

1 inlieu of true values 0 & 1, which intuitively should lead to

predictions that are not as good, resulting in a prediction errorwith a variance inflated above 2

ALR32, 33 II58


59/117

Prediction: II

using fitted mean functionE(Y |X=x) = 0 +1xto predictresponseyfor givenx, prediction is nowy= 0+ 1x,

and prediction error becomes

y

y

=0+ 1x

+e

(0+ 1x

)

recall that, ifU&Vare uncorrelated RVs, then Var(UV) =Var(U) + Var(V) (see Equation (A.2), p. 291, of Weisberg)

assuming that e is uncorrelated with RVs involved in formationof0&1, can regard U=0+ 1x

+e

andV = 0+ 1x

as uncorrelated RVs when conditioned onx andx1, . . . , xnlettingx+ be shorthand for x, x1, . . . , xn, we can write

Var(y y|x+ ) = Var(0+ 1x +e|x+ ) + Var(0+ 1x|x+ )

ALR32, 33, 291 II59


60/117

Prediction: III

study piecesVar(0

+ 1

x

+e|x+

) &Var(

0+

1x|x+

)

one at a time

using fact that Var(c + U) = Var(U) for a constantc, we have

Var(0+ 1x +e|x+ ) = Var(e|x+ ) =2

recall that, ifU & V are correlated RVs and c is a constantthen Var(U+cV) = Var(U) +c2 Var(V) + 2c Cov(U, V) (seeEquation (A.3), p. 292, of Weisberg)

hence

Var(0

+ 1

x|x+

) = Var(

0|x+

) +x2

Var(

1|x+

) + 2x

Cov(

0,

1|x+

)

= Var(0|x1, . . . , xn) +x2Var(1|x1, . . . , xn)

+2xCov(0, 1|x1, . . . , xn)under assumptionxis independent of RVs forming 0& 1

ALR32, 33, 292 II60


61/117

Prediction: IV

expressions for Var(0 |x1, . . . , xn), Var(1 |x1, . . . , xn) andCov(0, 1|x1, . . . , xn) can be extracted from matrix

Var(|X) =

Var(0|X) Cov(0,1|X)

Cov(0,1|X) Var(1|X)

=2(XX)1

Exercise (unassigned): using elements of (XX)1, show thatVar(0+ 1x|x+ ) =2

1

n+

(x x)2SXX

above representsincreasein variance of prediction error due to

necessity of estimating0&1, with the actual variance being

Var(y y|x+ ) =2

1+1

n+

(x x)2SXX

ALR32, 33, 295 II61


62/117

Prediction: V

estimating 2 by 2 and taking square root lead to standarderror of prediction (sepred) atx:

sepred(y|x+ ) =

1 +1

n+

(x x)2SXX

1/2 using Forbes data as an example, suppose we want to pre-

dict log10(pressure) at a hypothetical location for which boilingpoint of waterxis somewhere between 190 and 215

prediction for log10(pressure) given boiling point x is

y= 0+ 1x= 0.4216418 + 0.008956178x(1)100% prediction interval is set of pointsyin interval

yt(/2, n2)sepred(y|x+ ) y y+t(/2, n2)sepred(y|x+ )

ALR32, 33 II62


63/117

Prediction: VI

heren= 17, so, for a 99% prediction interval, we set = 0.01and uset(0.005, 15) = 2.947

since = 0.003792, x= 203.0 and SXX= 530.8, we have

sepred(y

|x+

) = 0.0037921 +

1

17

+(x 203.0)2

530.8 1/2

solid red curves on following plot depict 99% prediction intervalasxsweeps from 190 to 215 (black lines show intervals assum-ing unrealistically no uncertainty in parameter estimates)

for x = 200, prediction is y = 1.370, and 99% predictioninterval is specified by 1.358 y 1.381 in original space, prediction is 10y = 23.42, and interval is

101.358 10y 101.381, i.e., 22.80 10y 24.05ALR32, 33 II63


64/117


190 195 200 205 210 215

1.3

0

1.3

5

1.4

0

1.4

5

1.5

0

Boiling point

log(Pressure)

II64


65/117

Scatterplot of Pressure Versus Boiling Point

190 195 200 205 210 215

20

22

24

26

28

30

32

Boiling point

Pressur

e

II65


66/117

Coefficient of Determination R2: I

ignoring potential predictors, best prediction of response is sam-ple average y of observed responsesy1,y2, . . . , yn

for Fort Collins data, total sum of squares SYY=

i(yi y)2is sum of squares of deviations of data from horizontal dashed

line on next plotwith inclusion of predictors, unexplained variation is RSS

for Fort Collins data, RSSis sum of squares of deviations fromsolid line on next plot

ALR35, 36 II66


67/117


0 10 20 30 40 50

10

20

30

40

50

60


Latesnowfall(inches)

ALR8 II67


68/117

Coefficient of Determination R2: II

difference between SYYand RSS is called sum of squares dueto regression:

SSreg= SYY RSSProblem 5 says that

RSS=SYY(SXY)2

SXX

hence

SSreg= SYYSYY (SXY)

2

SXX

=

(SXY)2

SXX

divide SSregbySYYto get definition for coefficient of deter-mination:

R2 =SSreg

SYY=

(SXY)2

SXX SYY = 1 RSS

SYY

ALR35, 36 II68


69/117

Coefficient of Determination R2: III

Exercise (unassigned): R2 =r2xy (squared sample correlation)must have 0 R2 1R2 100 gives percentage of total sum of squares explained by

regression (concept ofR2 generalizes to multiple regression)

examples: R2 = 0.026 for Fort Collins &R2 = 0.995 for Forbes

ALR35, 36 II69


70/117

Coefficient of Determination R2: IV

R and other computer packages report bothR

2

and a variationknown as the adjustedR2 :

R2adj= 1 RSS/df

SYY/(n 1) as compared to R2 = 1 RSS

SYY,

wheredfis the degrees of freedom

for simple linear regression, df = n 2, so R2adjgets closerand closer toR2 asnincreases

in general,df=n minus # of parameters in mean functionR2

adjis intended to facilitate comparison of models, but Weis-

berg notes (p. 36) there are better ways of doing so

note: R2 useless if mean function does not have intercept term(e.g., regression through the origin: E(Y |X=x) =1x)

ALR36 II70


71/117

Inadequacy of Sufficient Statistics: I

all of the data-dependent variables connected with a simple

linear regression (e.g., 0, 1, 2, SSreg, RSS,R2, etc.) can beformed using just five fundamental statistics:

x, y, SXX, SYY and SXY

since

x=1

n

ni=1

xi, SXX=n

i=1

x2inx2 and SXY=n

i=1

xiyinxy

(with analogous equations for y and SYY), it follows that basiclinear regression analysis depends only on five so-called suffi-

cient statistics:n

i=1

xi,n

i=1

yi,n

i=1

x2i ,n

i=1

y2i andn

i=1

xiyi

ALR293, 294, 23, 24, 25 II71


72/117

Inadequacy of Sufficient Statistics: II

under assumptions of normality and correctness of regressionmodel, we do not in theory lose any probabilistic informationby tossing away the original data (xi, yi), i = 1, . . . , n, andjust keeping five sufficient statistics

reliance on sufficient statistics is dangerous in actual applica-tions, where adequacy of basic assumptions (normality, correct-ness of model) is always open to question

Anscombe (1973) constructed an example of four data sets(n= 11) with sufficient statistics that are identical(to within

rounding error), offering much food for thought

third data set: reconsider scheme of picking median slopeamongst all possible lines determined by two distinct points

ALR12, 13 II72


73/117

Anscombes First Data Set

5 10 15 20

2

4

6

8

10

12

14

Predictor

Response

ALR13 II73


74/117

Anscombes Second Data Set

5 10 15 20

2

4

6

8

10

12

14

Predictor

Response

ALR13 II74


75/117

Anscombes Third Data Set

5 10 15 20

2

4

6

8

10

12

14

Predictor

Response

ALR13 II75


76/117

Anscombes Fourth Data Set

5 10 15 20

2

4

6

8

10

12

14

Predictor

Response

ALR13 II76


77/117

Residuals: I

looking at residuals ei is a vital step in regression analysis can check assumptions to prevent garbage in/garbage out

basic tool is a plot of residuals versus other quantities, of whichthree obvious choices are:

1.residuals versus fitted values yi2.residuals versus predictorsxi3.residuals versus case numbersi

special nature of certain data might suggest other plots

useful residual plot resembles a null plot when assumptionshold, and a non-null plot when some assumption fails

lets look at plots1to3using Anscombes data sets as examples

ALR36, 37, 38 II77


78/117

Residuals Versus Fitted Values, Data Set #1

! " # $ % &'

(

&

'

&

)*++,- /012,3

4,3*-20

13

II78


79/117

Residuals Versus Predictors, Data Set #1

! " # $% $& $!

&

$

%

$

'()*+,-.(/

0)/+*12

3/

II79


80/117


! " # $ % &'

()'

&

)!

&)'

')!

')'

')!

&)'

*+,,-. 0123-4

5-4+.31

24

II80


81/117


! " # $% $& $!

&'%

$

'(

$'%

%'(

%'%

%'(

$'%

)*+,-./0*1

2+1-,34

51

II81


82/117


! " # $ % &'

&

'

&

(

)

*+,,-. 0123-4

5-4+.31

24

II82


83/117


84/117


! " # $% $$ $&

$'(

%'(

%'%

%'(

$'%

$'(

)*++,- /012,3

4,3*-20

13

II84


85/117


! "# "$ "% "& "!

"'(

#'(

#'#

#'(

"'#

"'(

)*+,-./0*1

2+1-,34

51

II85


86/117

Residuals: II

Q: why is a plot of residuals versus yi identical to a plot ofresiduals versusxiafter relabeling the horizontal axis?

II86


87/117

Residuals Versus Case Numbers, Data Set #1

! " # $ %&

!

%

&

%

'()* ,-./*0)

1*)23-(

4)

II87


88/117


! " # $ %&

!'&

%

'(

%'&

&'(

&'&

&'(

%'&

)*+, ./01,2+

3,+45/*

6+

II88


89/117


! " # $ %&

%

&

%

!

'

()*+ -./0+1*

2+*34.)

5*

II89


90/117


! " # $ %&

%'(

&'(

&'&

&'(

%'&

%'(

)*+, ./01,2+

3,+45/*

6+

II90


91/117

Residuals: III

although plots of ei versus i were not particularly useful forAnscombes data, plotisuseful for certain other data sets (par-ticularly where cases are collected sequentially in time)

fourth obvious choice: plot residuals eiversus responsesyi

this choice is problematic because relationshipyi= yi+ ei

says that, if spread of yis is small compared to spread of eis,large eis will correspond to largeyis even if model is correct

thus residuals versus responses is not a useful residual plot be-cause it need not resemble a null plot when assumptions hold

as an example, reconsider Fort Collins data

II91


92/117


0 10 20 30 40 50

10

20

30

40

50

60


Latesnowfall(

inches)

ALR8 II92


93/117

Residuals Versus Fitted Values, Fort Collins Data

!" !# !$ !% !& $"

!"

#"

'"

"

'"

#"

!"

()**+, ./01+2

3+2),1/02

II93

id l di C lli


94/117

Residuals Versus Predictors, Fort Collins Data

! "! #! $! %! &!

$!

#!

"!

!

"!

#!

$!

'()*+,-.(/

0)/+*123/

II94

R id l V C N b F C lli D


95/117

Residuals Versus Case Numbers, Fort Collins Data

! "! #! $! %!

&!

"!

'!

!

'!

"!

&!

()*+ -./0+1*

2+*34.)5*

II95

R id l V R F t C lli D t


96/117

Residuals Versus Responses, Fort Collins Data

!" #" $" %" &" '"

$"

#"

!"

"

!"

#"

$"

()*+,-*)*

()*./012*

II96

R id l IV


97/117

Residuals: IV

reconsider Forbes data, focusing first on 3 following overheads reddashed horizontal lines on residual plot show recall definition of weights ci:

1

= i(xi x)yiSXX =n

i=1 ciyi, where ci=xi xSXX

ALR36, 37, 38 II97

S tt l t f l (P ) V B ili P i t


98/117


195 200 205 210

1.3

5

1.40

1.4

5

Boiling point

log(Press

ure)

ALR6 II98

Pl t f R id l P di t f F b D t


99/117

Plot of Residuals versus Predictors for Forbes Data

!"# $%% $%# $!%

%&%

%#

%&%

%%

%&%

%#

%&%

!%

%&%!#

'()*)+, .()+/

012)345*2

ALR6 II99

W i ht V B ili P i t f F b D t


100/117

Weights Versus Boiling Point for Forbes Data

!"# $%% $%# $!%

%&

%!#

%&

%%#

%&

%%#

%&

%!#

'()*)+, .()+/

0)

II100

R id l V


101/117

Residuals: V

Weisberg notes that Forbes deemed this case evidently a mis-take, but perhaps just because of its appearance as an outlier

Weisberg (p. 38) shows that, if (x12, y12) is removed and re-gression analysis is redone on reduced data set, resulting slopeestimate is virtually the same, but and quantities that depend

upon it drastically change (see overheads that follow)

to delete or not to delete that is the question:

if we dont delete, normality assumption is questionable

if we do delete, normality assumption is tenable, but no realscientific justification for doing so (open to charges of datamassaging)

ALR36, 37, 38 II101



102/117


195 200 205 210

1.3

5

1.40

1.4

5

Boiling point

log(Pressure)

ALR6 II102



103/117


195 200 205 210

1.3

5

1.40

1.4

5

Boiling point

log(Pressure) x

II103



104/117


!"# $%% $%# $!%

%&%

%#

%&%

%%

%&%

%#

%&%

!%

%&%!#

'()*)+, .()+/

012)34

5*2

ALR6 II104



105/117


!"# $%% $%# $!%

%&%

%#

%&%

%%

%&%

%#

%&%

!%

%&%!#

'()*)+, .()+/

012)34

5*2

6

II105

Scatterplot of log (Pressure) Versus Boiling Point


106/117


190 195 200 205 210 215

1.3

0

1.3

5

1.4

0

1.4

5

1.5

0

Boiling point

log(Pressure)

II106

Scatterplot of log (Pressure) Versus Boiling Point


107/117


190 195 200 205 210 215

1.3

0

1.3

5

1.40

1.4

5

1.5

0

Boiling point

log(Pressure) x

II107

Main Points: I


108/117

Main Points: I

given a responseYand a predictorX, simple linear regressionassumes(1)a linear mean function

E(Y |X=x) =0+ 1x,

where 0 (intercept term) and 1 (slope term) are unknownparameters (constants), and(2)a constant variance function

Var(Y |X=x) =2,

where2 >0 is a third unknown parameter

simple linear regression model can also be written as

Y = E(Y |X=x) +e=0+ 1x+e,

wheree is a statistical error, a random variable (RV) such thatE(e) = 0 and Var(e) =2

ALR21, 292, 293 II108

Main Points: II


109/117

Main Points: II

let (xi, y

i), i = 1, . . . , n, be RVs obeying Y =

0+

1x+ e

(predictor/response data are realizations of these 2nRVs)

for theith case, haveyi=0+ 1xi+ei

errorse1, . . . , enare independent RVs such that E(ei|xi) = 0

model for data can also be written in matrix notation asY=X + e,

where

Y= y1

y2...

yn

, X= 1 x1

1 x2... ...1 xn

, = 01 , e= e1

e2...

en

ALR21, 29, 63, 64 II109

Main Points: III


110/117

Main Points: III

given sample means

x=1n

ni=1

xi and y=1n

ni=1

yi

and sample cross products and sum of squares

SXY=

n

i=1

(xix)(yiy), SXX=n

i=1

(xix)2

& SYY=

n

i=1

(yiy)2

,

can form least squares estimators for parameters 1 and0:

1=SXY

SXXand 0= y 1x

corresponding estimator for error variance 2 is

2 = RSS

n 2, where RSS=n

i=1

[yi (0 +1xi)]2 =SYY(SXY)2

SXX

ALR293, 294, 24, 25 II110

Main Points: IV


111/117

Main Points: IV

letting

RSS(b0, b1) =n

i=1

[yi (b0+b1xi)]2,

least squares estimators 0 and 1 are choices for b0 and b1

such that RSS(b0, b1) is minimizedfitted values yiand residuals eiare defined as

yi= 0+ 1xi and ei=yi (0+ 1xi) =yi yi,in terms of which we have

RSS= RSS(0,1) =n

i=1

e2i and 2 =

ie2in 2

ALR24, 22, 23 II111

Main Points: V


112/117

Main Points: V

in matrix notation, least squares estimator of is such that

XX=XY, i.e., is solution to normal equations XXb= XY

2 2 matrixXXhas an inverse as long as SXX = 0, so= (XX)1XY

since E() =, estimators0& 1are unbiased, as is 2 also:

E(0) =0, E(1) =1 and E(2) =2

also have Var(|X) =2(XX)1, leading us to deduce

Var(1|X) = 2

SXX, which can be estimated byVar(1|X) = 2

SXX,

the square root of which is se(1), the standard error of1

ALR304, 305, 61, 62, 63, 27, 28 II112

Main Points: VI


113/117

Main Points: VI

can test null hypothesis (NH) that1

= 0 by formingt-statistict= 1/se(1) and comparing it to percentage pointst(, n2)for t-distribution with n 2 degrees of freedom, with a largevalue of|t|giving evidence against NH via a small p-value

(1

)

100% confidence interval for 1 is set of points in

interval whose end points are

1 t(/2, n 2)se(1) and 1+t(/2, n 2)se(1)

ALR31 II113

Main Points: VII


114/117

Main Points: VII

can predict a yet-to-be-observed responsey

given a settingxfor the predictor using y = 0+ 1x, which has a standard

error given by

sepred(y

|x+

) = 1 +

1

n

+(x x)2

SXX 1/2

,

wherex+ denotesxalong with original predictorsx1, . . . , xn(1 ) 100% prediction interval constitutes all values from

y

t(/2, n

2)sepred(y|x+

) to y

+t(/2, n

2)sepred(y

|x+

)

ALR32, 33 II114


115/117

Main Points: IX


116/117

Main Points: IX

plots of residuals eiare invaluable for assessing reasonableness

of fitted model (a point that cannot be emphasized too much)

standard plot is residuals ei versus fitted values yi, which isequivalent to eis versus predictorsxi

plot of residuals versus case numberiis potentially but notalways useful

donotplot residuals versus responsesyi misleading!failure to plot residuals is potentially bad for your health!

Thou Shalt Plot Residuals (a proposed 11th commandment!)

ALR36, 37, 38 II116

Additional References


117/117

F. Mosteller, A.F. Siegel, E. Trapido and C. Youtz (1981), Eye Fitting Straight Lines,

The American Statistician,35, pp. 1501

C.R. Rao (1972), Linear Statistical Inference and Its Applications (Second Edition),

New York: John Wiley & Sons, Inc.

G.A.F. Seber (1977),Linear Regression Analysis, New York: John Wiley & Sons, Inc.

02 chapter alr for printing(1)

Documents