erho.weebly.com · up school of statistics student council education and research erho.weebly.com |...

26
UP School of Statistics Student Council Education and Research erho.weebly.com | 0 [email protected] | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics 136 Introduction to Regression Analysis Reviewer for the 1 st Long Examination Preliminaries rk ( I A)= rk ( I )− rk ( A)= tr ( I A ) For general quadratic forms f ( x )=( a ± Bx ) ' A ( a ± Bx ) f ( x ) x 2 B' A ( a ± Bx ) Where a m ×1 - vector of constants B m× p - matrix of constants x p×1 - vector of variables A m×m - symmetric matrix of constants Expected value of a vector For y n×1 a vector of random variables: E ( y )= E [ Y 1 Y 2 Y n ] = [ E ( Y 1 ) E ( Y 2 ) E ( Y n ) ] =[ E ( Y i )] For matrices just distribute the expected value operator to all elements of the matrix. Variance-covariance matrix For y n×1 a vector of random variables: 1 w

Upload: others

Post on 13-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

UP School of Statistics Student Council

Education and Research erho.weebly.com | 0 [email protected] | f /erhoismyhero | t @erhomyhero

S136_Reviewer_001Statistics 136 Introduction to Regression AnalysisReviewer for the 1st Long Examination

Preliminaries

rk (I−A) = rk (I )− rk (A) = tr (I−A)

For general quadratic forms f (x )= (a ± Bx)' A(a ± Bx)

∂ f (x )∂ x

=±2 B ' A(a±Bx )

Wherea m×1 - vector of constants

B m× p - matrix of constants

x p×1 - vector of variables

A m×m - symmetric matrix of constants

Expected value of a vector

For y n×1 a vector of random variables:

E ( y )= E [Y 1

Y 2

⋮Y n]= [

E (Y 1)

E (Y 2)⋮

E (Y n)]= [E(Y i)]

For matrices just distribute the expected value operator to all elements of the matrix.

Variance-covariance matrix

For y n×1 a vector of random variables:

1

w

Page 2: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Var ( y) = E {[ y − E ( y )] [ y − E ( y )]' }

= E[Y 1 − E (Y 1)

Y 2 − E (Y 2)⋮

Y n − E (Y n)][Y 1 − E (Y 1) Y 2 − E (Y 2) ⋯ Y n − E(Y n)]

= [σij ]

where σii =σi2= Var (Y i) , σij = Cov (Y i , Y j)

Note: The variance-covariance matrix is always symmetric

Correlation matrix

The correlation matrix of y n×1 denoted by ℜ is defined by

ℜ = diag−1{σ1 σ2 ⋯ σn }V diag−1

{σ1 σ2 ⋯ σn }

= diag{ 1σ1

1σ2

⋯1σn}V diag{ 1

σ1

1σ2

⋯1σn}

= [σijσiσ j ] i, j = 1,2,...,n

where V = Var ( y)

Note: Let C be a matrix of constants and y a random vector where Var(y) = V. Then, Var(Cy) = CVC'.

The Linear Model

Multiple Linear Regression model

Y = X β + ε

Y i = β 0 + β 1 X i1 + β 2 X i2 +…+ β k X ik +ε i

Matrix Notation

Yn × 1

= [Y 1

Y 2

⋮Y n] , X

n ×(k + 1)= [

1 X 11 X 12 ⋯ X 1k

1 X 21 X 22 ⋯ X 2k

⋮ ⋮ ⋮ ⋮ ⋮1 X n1 X n2 ⋯ X nk

] , β(k + 1 ) ×1

= [β 0

β 1

β 2

⋮β k] , ε

n × 1= [

ε 1ε 2

⋮ε n]

Y i is the value of the response variable in the ith trialX ij is a known constant , namely, the value of the jth independent variable on the ith trial

2

Page 3: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

β j is a parameter where j = 0, 1, 2, ..., kε i is a random error term

i = 1, 2, ..., n and j = 0, 1, 2, …, k

Classical Assumptions

E (ε i) = 0 ,∀ i

Var (ε i) = σ2 ,∀ i

Cov(ε i ,ε j) = 0 ,∀ i≠ j

Normal error model assumptions

ε i∼NID(0,σ2) , i = 1,2,… , n

ε∼N n(0 ,σ2 I )

Dependent Variable (implicit distribution)

Y∼N n(X β ,σ2 I n)

Regression Function

E (Y ) = X β

E (Y i) = β 0+β 1 X i1+β 2 X i2+…+β k X ik

Least Squares Criterion

Objective function

Minimize ∑i=1

n

ε i2= ε ' ε the inner product of the vector of error terms

Least Squares Estimation

∂(∑i=1

n

ε i2)

∂β i=

∂ε ' ε∂β

=∂(Y − X β )' (Y − X β )

∂β= −2 X ' (Y − X β )

3

Page 4: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Equate to the null vector

→ −2 X ' (Y − X β )= 0→ X ' (Y − X β )= 0→ X ' Y − X ' X β = 0→ X ' X β = X ' Y

Thus, from the above equation, we get the set of normal equations and an estimator for β.

Normal Equations

X ' X β = X ' Y

X ' X β = [n ∑

i=1

n

X i1 ∑i=1

n

X i2 ⋯ ∑i=1

n

X ik

∑i=1

n

X i1 ∑i=1

n

X i12 ∑

i=1

n

X i1 X i2 ⋯ ∑i=1

n

X i1 X ik

⋮ ⋮ ⋮ ⋮ ⋮

∑i=1

n

X ik ∑i=1

n

X i1 X ik ⋯ ⋯ ∑i=1

n

X ik2 ][

β 0

β 1

β k

]= [∑i=1

n

Y i

∑i=1

n

X i1Y i

∑i=1

n

X ik Y i]= X ' Y

Set of Normal Equations

1st : n β 0 + β 1 ∑i=1

n

X i1 + β 2 ∑i=1

n

X i2 +… + β k ∑i=1

n

X ik = ∑i=1

n

Y i

2nd : β 0 ∑i=1

n

X i1 + β 1 ∑i=1

n

X i12+ … + β k ∑

i=1

n

X i1 X ik = ∑i=1

n

X i1 Y i

(k+1)th : β 0 ∑i=1

n

X ik + β 1 ∑i=1

n

X i1 X ik + … + β k ∑i=1

n

X ik2 = ∑

i=1

n

X ik Y i

Least Squares Estimator of β

β = (X ' X )−1 X ' Y

Provided (X ' X )−1 exists, that is, X is of full rank →rk (X )= k + 1 .

Estimator for E(Y)

E (Y ) = Y = X β

= X (X ' X )−1 X ' Y

= H YDefine H = X (X ' X )−1 X '

4

Page 5: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Residuals

e = Y−Y= Y−X β

= Y−X (X ' X )−1 X ' Y

= Y−H Ye = ( I−H )Y

Interpretation of Coefficients

Given Y i = β 0 + β 1 X i1 + β 2 X i2 +⋯+ β k X ik , the estimated mean of Y, we interpret the coefficientsβ 0, β 1, ⋯, β k as follows:

• β 0 : value of the estimated mean of Y provided all the independent variables are zero• β 1 : change in the estimated mean of Y per unit change in X i1 holding the other independent

variables constant

• In general, ceteris paribus (all things the same),

β j : change in the estimated mean of Y per unit change in X ij holding the other independent variables constant, for j = 1,2,...,k

• Caution on the interpretation of coefficients:i. coefficients are partialii. validity of interpretation depends on whether the assumption of uncorrelatedness among X's

holdsiii. affected by the range of X used in estimation (for example, β 0 may not always be interpretable)

Results from Least Squares Criterion

1. The least squares estimator β is unbiased for β .

Proof:

E ( β ) = E [(X ' X )−1 X ' Y ]

= (X ' X )−1 X ' E (Y )

= (X ' X )−1 X ' X β

E ( β ) = β

2. The expected value of the vector of residuals is a null vector.

5

E (e )= 0

E ( β )= β

Page 6: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Proof:

E (e ) = E [( I − H )Y ]= ( I − H )E (Y )= ( I − H ) X β= X β − H X β

= X β − X (X ' X )−1 X ' X β= X β − X β

E (e ) = 0

3. The sum of squared residuals is a minimum.

∑i=1

n

e i2 = e ' e is a minimum

4. The least squares regression line always passes through the centroid.

5. The sum of the residuals of any regression model that contains an intercept β0 is always equal to zero.

∑i=1

n

e i = 1 ' e = e ' 1 = 0

Proof:

∑i=1

n

e i = ∑i=1

n

(Y i − Y i)

= ∑i=1

n

[Y i − (β 0 + β 1 X i1 + β 2 X i2 + …+ β k X ik )]

∑i=1

n

e i = ∑i=1

n

Y i −∑i=1

n

(β 0 + β 1 X i1 + β 2 X i2 + …+ β k X ik )

= ∑i=1

n

Y i −(n β 0 + β 1 ∑i=1

n

X i1 + β 2 ∑i=1

n

X i2 + …+ β k ∑i=1

n

X ik )

= ∑i=1

n

Y i −∑i=1

n

Y i (from the first set of normal equations)

∑i=1

n

e i = 0

6. The sum of the residuals weighted by the corresponding value of the regressor variable always equals zero.

6

∑i=1

n

e i X ij = X ' e = 0

Page 7: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Proof:

∑i=1

n

e i X ij = X ' e

= X ' (I − H )Y

= X ' [ I − X (X ' X )−1 X ' ]Y

= [X ' − X ' X (X ' X )−1 X ' ]Y

= (X ' − X ' )Y= 0' Y

∑i=1

n

e i X ij = 0

7. The sum of the observed values Yi equals the sum of the fitted values Ŷi.

Proof:

From (5): ∑i=1

n

e i = 0

→ ∑i=1

n

e i =∑i=1

n

(Y i − Y i)

→ ∑i=1

n

(Y i − Y i)= 0

→ ∑i=1

n

Y i −∑i=1

n

Y i = 0

→ ∑i=1

n

Y i =∑i=1

n

Y i

8. The sum of the residuals weighted by the corresponding fitted value always equals zero.

Proof:

e ' Y = [(I − H )Y ]' H Y= Y ' ( I − H ) ' H Y= Y ' ( I ' − H ' )H Y= Y ' ( I − H )H Y → I and H are symmetric

= Y ' (H Y − H 2 Y )

7

∑i=1

n

Y i =∑i=1

n

Y i

∑i=1

n

e i Y i = e ' Y = Y ' e = 0

Page 8: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

= Y ' (HY − H Y ) →H is idempotent

= Y ' 0e ' Y = 0

9. The residuals and independent variables are uncorrelated.

ρ (X j , ei) =Cov (X j , e i)

√Var (X j)√Var (e i)=

∑i=1

n

(X ij − X j)(e i − e)

√∑i=1

n

(X ij − X j)2√∑

i=1

n

(ei − e )2

Consider the numerator ∑i=1

n

(X ij − X j)(e i − e )

∑i=1

n

(X ij − X j)(e i − e ) = ∑i=1

n

(X ij e i − X ij e− X j ei + X j e)

= ∑i=1

n

X ij ei⏟

0

− e⏟0

∑i=1

n

X ij − X j ∑i=1

n

e i⏟

0

+ n X j e⏟0

= 0

Thus, ρ (X j , ei)= 0

10. The variance-covariance matrix of β is given by:

Thus, β∼N (β ,(X ' X )−1σ

2)

Inferences in Regression Analysis

ANOVA for Regression

Total amount of variation of Y: SST =∑i=1

n

(Y i −Y )2

8

Var ( β ) = Var [(X ' X )−1 X ' Y ]

= (X ' X )−1 X ' Var (Y )X (X ' X )

−1

= (X ' X )−1 X ' σ2 I X (X ' X )−1

= (X ' X )−1 X ' X (X ' X )

−1σ

2

= I (X ' X )−1σ

2

= (X ' X )−1σ

2

Page 9: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Decomposition of SST

∑i=1

n

(Y i − Y )2 = ∑i=1

n

(Y i − Y i + Y i − Y )2

= ∑i=1

n

[(Y i − Y i) + (Y i − Y )]2

= ∑i=1

n

[(Y i − Y i)2+ (Y i − Y )2 − 2(Y i − Y i)(Y i −Y )]

= ∑i=1

n

(Y i − Y i)2+∑

i=1

n

( Y i − Y )2 − 2∑i=1

n

(Y i − Y i)(Y i − Y )

Consider 2∑i=1

n

(Y i − Y i)(Y i −Y )

2∑i=1

n

(Y i − Y i)(Y i − Y ) = 2∑i=1

n

e i(Y i − Y )

= 2∑i=1

n

(e i Y i − e iY )

= 2[∑i=1

n

e i Y i − Y∑i=1

n

e i]= 2 [0 − 0]

2∑i=1

n

(Y i − Y i)(Y i − Y ) = 0

Thus,

∑i=1

n

(Y i − Y )2

⏟SST

= ∑i=1

n

(Y i − Y i)2

⏟SSE

+∑i=1

n

(Y i − Y )2

⏟SSR

Where:SST

- Total corrected sum of squares- Total amount of variation on Y

SSE - Sum of squares due to error- “unexplained sum of squares”

SSR - Sum of squares due to regression- “explained sum of squares”

9

Page 10: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Matrix Notation

SST = Y ' C YSSE = Y ' ( I − H )YSSR = Y ' (H − J )Y

Expected Values

• E(MSE)

E (MSE ) = E [ SSEn− p ]

=1

n− pE [ SSE] =

1n− p

E [Y ' ( I − H )Y ]

Note: Since Y ' (I −H )Y is a scalar, Y ' (I −H )Y = tr [Y ' ( I − H )Y ]

tr [Y ' ( I − H )Y ] = tr [( I − H )Y Y ' ]So,

E (MSE ) =1

n− pE {tr [( I − H )Y Y ' ] }

=1

n− ptr [(I − H )E(Y Y ' )]

Now,

Var (Y ) = E [(Y − μY)(Y − μ Y ) ' ]

= E [(Y − μY)(Y'− μ Y

')]

= E [Y Y '− Y μ Y

'− μY Y '

+ μ Y μY']

= E(Y Y ')− E (Y μ Y

')− E (μY Y '

)+E (μ Y μY')

= E(Y Y ')− E (Y )μY

'− μ Y E(Y '

)+μ Y μY'

Previous results: μ Y = E (Y ) = X β and Var (Y )= σ2 I

σ2 I = E (Y Y '

)− X β β ' X '− X β β ' X '

+ X β β ' X '

= E (Y Y ')− X β β ' X '

E (Y Y ') = σ

2 I + X β β ' X '

Going back,

10

Page 11: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

E (MSE ) =1

n− ptr [( I − H )E (Y Y '

)]

=1

n− ptr [( I − H )(σ

2 I + X β β ' X ')]

=1

n− ptr [σ2( I − H ) + ( I − H )(X β β ' X ' )]

=1

n− ptr [σ2

( I − H ) + (X β β ' X '− H X β β ' X '

)]

=1

n− ptr [σ2

( I − H ) + (X β β ' X '− X (X ' X )

−1 X ' X β β ' X ')]

=1

n− ptr [σ2( I − H ) + (X β β ' X ' − X β β ' X ' )]

=1

n− ptr [σ2

( I − H ) + 0]

=1

n− ptr [σ2

( I − H )]

=1

n− pσ

2 tr (I − H )

=1

n− pσ

2(n − p)

E (MSE ) = σ2

Thus, MSE is unbiased for σ2

• E(MSR)

E (MSR) = E( SSRp−1)

=1

p−1E (SSR)

=1

p−1E [Y ' (H−J )Y ]

Note: Y ' (H−J )Y is a scalar so,Y ' (H−J )Y = tr [Y ' (H−J )Y ]

= tr [(H−J )Y Y ' ]

So,

E (MSR) =1

p−1E {tr[(H−J )Y Y ' ]}

=1

p−1tr [(H−J )E (Y Y ' )]

It was shown previously that, E (Y Y ' )= σ2 I + X β β ' X '

11

Page 12: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Thus,

E (MSR) =1

p−1tr [(H−J )(σ2 I + X β β ' X ' )]

=1

p−1tr [σ2

(H−J ) + (H−J )(X β β ' X ' )]

=1

p−1{σ2tr (H−J ) + tr [(H−J )(X β β ' X ' )]}

Aside,

tr [(H−J )(X β β ' X ' )] = tr (H X β β ' X '−J X β β ' X ' )= tr [ X (X ' X )

−1 X ' X β β ' X '−J X β β ' X ' ]= tr (X β β ' X '−J X β β ' X ')= tr [( I−J )X β β ' X ' ]= tr (C X β β ' X ' )= tr (X β β ' X ' C )= tr (β β ' X ' C X )

= tr (β ' X ' C X β )= tr [(X β ) ' C X β ]

tr [(H−J )(X β β ' X ' )] = (X β )' C X β since the term inside the trace is a scalar

Therefore,

E (MSR) =1

p−1σ2 tr(H − J ) +

1p−1

(X β )' C X β

=1

p−1σ

2( p−1) +

1p−1

(X β )' C X β

= σ2 +1

p−1(X β )' C X β

ANOVA Table for testing H 0: β 1 = β 2 =…= β k vs H a : ∃ at least one inequality

Sourceof

Variationdf

Sumof

Squares

MeanSquare

F-Stat

Regression p − 1 SSR MSR

F c =MSRMSE

Error n − p SSE MSE

Total n −1 SST

Where n is the number of observations and p is the number of parameters. Alternatively k is the number of independent variables so that k = p – 1.

12

Page 13: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

F-test of Regression relation

Under Ho : F c =

MSRMSE

∼ F ( p − 1,n − p)

Critical Region: Reject Ho if F c > F ( p −1, n− p )

α

Tests on Individual Regression Coefficients

Hypotheses:

H 0 : β 1 = 0 vs H a : β 1 ≠ 0H 0 : β 2 = 0 vs H a : β 2 ≠ 0

⋮H 0: β k = 0 vs H a : β k ≠ 0

Test Statistic:

t j =β j − β j

s.e.( β j)=

β j

s.e. ( β j)∼ t( α

2, n − p)

Critical Region: Reject Ho if ∣t j∣> t(α

2,n − p)

Coefficient of Multiple Determination

R2= RY. 1, 2,… , k

2=

SSR(X 1 , X 2 , … , X k )

SST= 1−

SSE (X 1 , X 2 , … , X k )

SST

Interpretation: Percentage variation in Y that can be explained by the X's through the model Y = X β + ε

Adjusted Coefficient of Determination

Ra2= 1−

MSEMST

= 1−

SSEn− pSSTn− 1

= 1− [ SSESST (

n− 1n− p)]

Interpretation: Same as R-squared but taking into account the loss in degrees of freedom

Coefficient of Alienation (Non-determination)

1− R2 or 1− Ra2

13

Page 14: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Gauss-Markov Theorem

Under the conditions of the multiple linear regression model, the least squares estimator β is the best linear unbiased estimator (BLUE) of β . This means that among all linear unbiased estimators of β j , β j has the smallest variance, j = 0, 1, 2, ..., k.

Proof:β is unbiased for β . We have to show that β has the smallest variance among all linear unbiased

estimators for β .

Consider any linear unbiased estimator of β , say β = CY , where C = (X ' X )−1 X ' + D .

To show unbiasedness:

E ( β ) = E (CY )

= E {[(X ' X )−1 X ' + D ] [ X β + ε ]}

= E [(X ' X )−1 X ' X β + (X ' X )

−1 X ' ε + D X β + Dε ]

= β + (X ' X )−1 X ' E (ε ) + D X β + D E (ε )E ( β ) = β + D X β

D X = 0 for E ( β ) to be unbiased.

Var ( β ) = E [( β − β )(β − β ) ' ]

β − β = [(X ' X )−1 X ' + D ] [X β + ε ]− β

= β + (X ' X )−1 X ' ε + D X β + Dε − β

= (X ' X )−1 X ' ε + D X β + Dε

( β − β ) ' = ε ' X (X ' X )−1+ β ' X ' D ' + ε ' D '

Var ( β ) = E {[(X ' X )−1 X ' ε + D X β + Dε ][ε ' X (X ' X )

−1+ β ' X ' D ' + ε ' D ' ]}

= E [(X ' X )−1 X ' ε ε ' X (X ' X )

−1+ (X ' X )

−1 X ' ε β ' X ' D ' +(X ' X )

−1 X ' ε ε ' D ' + D X β ε ' X (X ' X )−1+ D X β β ' X ' D ' + D X β ε ' D ' +

Dε ε ' X (X ' X )−1+ Dε β ' X ' D ' + Dε ε ' D ' ]

= (X ' X )−1 X ' E (ε ε ') X (X ' X )

−1+ (X ' X )

−1 X ' E(ε )β ' X ' D ' +(X ' X )

−1 X ' E (ε ε ' )D' + D X β E (ε ') X (X ' X )−1+ D X β β ' X ' D' +

D X β E (ε ' )D ' + D E (ε ε ' )X (X ' X )−1+ D E (ε ) β ' X ' D ' + D E (ε ε ' )D'

= (X ' X )−1 X ' E (ε ε ') X (X ' X )

−1+ (X ' X )

−1 X ' E (ε )β ' X ' D ' +

14

Page 15: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

(X ' X )−1 X ' E (ε ε ' )D' + D X β E (ε ') X (X ' X )

−1+ D X β β ' X ' D' +

D X β E (ε ' )D ' + D E (ε ε ' )X (X ' X )−1+ D E (ε ) β ' X ' D ' + D E (ε ε ' )D'

= (X ' X )−1 X ' E (ε ε ' ) X (X ' X )

−1+ (X ' X )

−1 X ' E(ε ε ' )D ' + D X β β ' X ' D ' +D E (ε ε ' )X (X ' X )−1 + D E (ε ε ')D '

Note: Var (ε )= σ2 I = E [(ε − 0)(ε − 0) ' ]= E (ε ε ')

So,

Var ( β ) = σ2(X ' X )

−1 X ' X (X ' X )−1+ σ

2(X ' X )

−1 X ' D ' + D X β β ' X ' D ' +σ

2 D X (X ' X )−1+ σ

2 D D '

Recall the condition that D X = 0 → (D X ) ' = 0 '

Thus, we are left with

Var ( β ) = σ 2(X ' X )−1 X ' X (X ' X )−1 + σ2 D D' = σ2(X ' X )−1 + σ2 D D'Recall that the first term of this sum is the variance-covariance matrix of β . We have to show that the variance-covariance matrix of β is larger. We have to show that D D ' is psd.

NTS: D D ' is psd → x ' D D ' x ≥ 0 ∀ x and equality holding for some x ≠ 0

Let y = D' x

x ' D D ' x = y ' y

y ' y ≥ 0 since it is a sum of squares (inner product of a vector with itself).

Thus D D ' is psd.

Var ( β ) = σ2(X ' X )

−1+ σ

2 D D' = Var ( β ) + σ2 D D'

Var ( β )≥ Var ( β )

Therefore, Var ( β ) is the smallest among all linear unbiased estimators of β .

β is the BLUE for β .

Furthermore, any linear combination λ ' β of the elements of β ,the BLUE is λ ' β .

15

Page 16: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Theorem With the assumption of normality of the error terms, the maximum likelihood estimators

(MLE) for β and σ2 are β = (X ' X )−1 X ' Y and σ2=

1n

e ' e , respectively.

Proof:

Recall: Y = X β + ε ε ~ N (0 ,σ 2 I )

Y ~ N (X β ,σ2 I ) → Y i ∼ N ( xi' β ,σ2

)

Likelihood functions

L (ε 1 ,ε 2 ,… ,ε n) = ∏i=1

n

f (ε i)

= (2πσ2)−

n2 exp{− 1

2σ2 ∑i=1

n

ε i2}

L (Y 1 ,Y 2 ,… , Y n ∣X ) = (2πσ2)−

n2 exp{− 1

2σ2 ∑i=1

n

(Y i − x i' β )

2}Given the data, the likelihood function may be regarded as a function of the p + 1 parameters.

L (Y 1 ,Y 2 ,… ,Y n ∣X ) = (2 πσ2)−

n2 exp{− 1

2σ2 (Y − X β ) ' (Y − X β )}Get the ln likelihood

ln [L] =−n2

ln(2πσ2)−

1

2σ2(Y − X β )' (Y − X β )

∂ ln(L)∂ β = −

1

2σ2 (−2) X ' (Y − X β )

=1

σ2 X ' (Y − X β ) (1)

∂ ln(L)

∂σ2 = −

n2

2 π

2πσ2 +1

2σ4 (Y − X β ) ' (Y − X β )

= −n

2σ2 +1

2σ4 (Y − X β ) ' (Y − X β ) (2)

16

Page 17: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Equating (1) to 0

2 X ' (Y − X β ) = 0

→ X ' (Y − X β ) = 0→ X ' Y − X ' X β = 0

X ' X β = X ' Y

β = (X ' X )−1 X ' Y

Equating (2) to 0 (since it is a scalar) and substituting β to β ,

−n

2σ2+

1

2σ4(Y − X β )' (Y − X β )= 0 →

1

2σ4(Y − X β ) ' (Y − X β )=

n

2σ2

1

2σ4e ' e =

n

2σ2 → σ2 n= e ' e → σ2

=1n

e ' e

Theorem ( β ,σ2) is the UMVUE for (β ,σ2) .

Proof:

We have established that β and σ2= MSE are unbiased estimators for β and σ2 respectively.

Thus, by Lehmann-Scheffé Theorem, we now have to determine a joint complete sufficient statistic (CSS) for (β ,σ2) and see if the unbiased estimators are functions of the joint CSS.

Recall: Y i ∼ N ( x i' β , σ2

)

f Y i( yi) =

1σ √2 π

exp{− 12σ2 ( yi − xi

' β )2} I (−∞ , ∞)( y i)

=1

σ √2 πexp{− 1

2σ2 [ yi2− 2 y i xi

' β + (x i' β )2]} I (−∞ , ∞)( y i)

=1

σ √2 πexp{−(x i

' β )2

2σ2 } exp{− 1

2σ2 [ y i2− 2 y i x i

' β ]} I (−∞ , ∞)( y i)

Consider exp{− 1

2σ2[ y i

2− 2 y i x i

' β ]}

17

Page 18: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

exp{− 12σ2 [ y i

2 − 2 y i xi' β ]} = exp{− 1

2σ2 [ y i2 − 2 y i(β 0 + β 1 X i1 + β 2 X i2 +…+ β k X ik )]}

= exp{− 12σ2 yi

2+β 0

σ2 y i +

β 1

σ2 y i X i1 +…+

β k

σ2 yi X ik}

Let θ =(β , σ2)

a (θ)=1

σ √2πexp{−(x i

' β )2

2σ2 } b( yi)= I (−∞ , ∞)( y i)

c1(θ)=−1

2σ 2, c2(θ)=

β 0

σ2

, c3(θ) =β 1

σ2

, … , ck + 2(θ)=β k

σ2

d 1( y i)= y i2 , d 2( y i)= yi , d 3( y i)= yi X i1 , … , d k + 2( y i)= y i X ik

Thus, f Y i( yi) is a member of the k + 2 parameter exponential family of distributions. A joint CSS is given

by

S = {∑i=1

n

d 1( y i) ,∑i=1

n

d 2( y i) , ∑i=1

n

d 3( y i) , … , ∑i=1

n

d k + 2( y i)}= {∑i=1

n

Y i2 , ∑

i=1

n

Y i , ∑i=1

n

Y i X i1 , … , ∑i=1

n

Y i X ik}i. Now β = (X ' X )

−1 X ' Y . It was previously shown that

X ' Y = [∑i=1

n

Y i

∑i=1

n

X i1Y i

∑i=1

n

X ik Y i]

Thus β is a function of the joint CSS since (X ' X )−1 is a matrix of constants, so E ( β∣S )= β .

By Lehmann-Scheffé theorem, β is the UMVUE for β

ii. SSE = Y ' ( I − H )Y= Y ' Y − Y ' H Y= Y ' Y − Y ' X (X ' X )

−1 X ' Y

MSE =1

n − pSSE =

1n− p

[Y ' Y − Y ' X (X ' X )−1 X ' Y ]

18

Page 19: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Y ' Y =∑i=1

n

Y i2 , Y ' X = (X ' Y )' and

1n− p

is a constant.

Thus, MSE is a function of the CSS and the UMVUE for σ2

Confidence Intervals

From individual t-tests with H 0: β j = 0 vs H a : β j ≠ 0

t j =β j − β j

s.e.( β j)∼ tα

2, n−p → P (∣t j∣≤ t α

2, n− p)= 1−α

tα2

, n− p ≤β j − β j

s.e. (β j)≤ tα

2,n−p → β j − t α

2, n−p s.e. (β j)≤ β j ≤ β j + tα

2,n− p s.e.( β j)

→ [ β j ∓ t α2

, n−p s.e. ( β j)]

• Confidence interval estimates of the parameters give a better picture than point estimates (e.g. BLUE and UMVUE)

• The width of these confidence intervals is a measure of the overall quality of the regression line.• These confidence intervals have the usual frequency interpretation

Prediction Interval

• Inferences about the Mean Response E (Y h)

Given a particular set of independent variables xh' , we estimate the mean of Y

Y h = x h' β + ε h

E (Y h) = xh' β which we estimate using Y h = x h

' β

Distribution of Y h

E (Y h) = E (xh' β )= xh

' E ( β )= xh' β . Thus, Y h is unbiased for E (Y h)

Var (Y h) = Var (xh' β )

= xh' Var ( β )x h

= σ2 xh

'(X ' X )

−1 xh

Thus, Y h ∼ N ( xh' β , σ 2 xh

'(X ' X )

−1 x h)

Note: Y h is BLUE for E (Y h) (by Gauss-Markov Theorem)

19

Page 20: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

A (1- α )100% CI for E (Y h) is:

[Y h ∓ zα /2 √σ2 xh' (X ' X )−1 xh] when σ2 is known

[Y h ∓ tα /2,n− p √σ2 xh'(X ' X )

−1 xh] when σ2 is unknown

In testing H 0: E (Y h)= d vs H a: E (Y h)≠ d ,

Test statistic: t c =Y h − d

s.e. (Y h)∼ tα /2, n−p

Critical Region: Reject Ho if ∣t c∣≥ tα /2, n− p

• Prediction of a New Observation Y h given x h [Y h(new ) ]

We are not estimating a parameter but predicting a particular value of the random variable Y i given a particular x i based on the distribution of Y.

The new observation Y h is viewed as a result of a new trial, independent of the trials on which the regression analysis is based. We also assume that the underlying regression model applicable for the basic sample data continues to be appropriate for the new observation.

i. Case 1:Parameters are known

Y = X β + ε ε ∼ N (0 , σ 2 I ) Y i ∼ N ( xi' β , σ2)

Y i − x i' β

σ ∼ N (0, 1) → (−zα2≤

Y i − x i' β

σ ≤ zα2)

A (1- α )100% PI for Y i is given by:

(x i' β ∓ zα

2σ)

ii. Case 2: Parameters are unknown

Assumptions: (1) parameters have already been estimated(2) the given xh is independent of the previous sample

β is estimated by β and σ2 is estimated by MSEY h = x h

' β + ε h ε ∼ N (0, MSE )

E (Y h) = xh' β

Var (Y h) = MSE x h'(X ' X )

−1 xh + MSE = MSE [ xh'(X ' X )

−1 xh + 1]

20

Page 21: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

A (1- α )100% PI for Y i is given by:

(x i' β ∓ tα

2,n− p√MSE [ xh

'(X ' X )

−1 x h + 1])• Prediction of Mean of m New Observations Y h(new ) for given x h

(x i' β ∓ tα

2,n− p√MSE [ 1

m+ x h

' (X ' X )−1 xh]) Inverse Regression Problem

• Also known as calibration problem• Given the regression model of Y, we predict the value of the independent variable, X, that gave rise to

a new observation Y.• For a simple liner regression model Y i = β 0 + β 1 X i + ε i and the estimated regression function

Y i = β 0 + β 1 X i , if a new observation Y h(new) becomes available, a natural point estimator for thelevel X h(new) is given by

X h(new)=Y h (new )− β 0

β 1

• For a linear regression model with two independent variables Y = β 0 + β 1 X 1 + β 2 X 2 + ε and the estimated regression function Y = β 0 + β 1 X 1 + β 2 X 2 , if a new observation Y 0 becomes available and for a given value of X 2 = X 20 , X 1 can be predicted by,

X 10 =Y 0 − β 0 − β 2 X 20

β 1

Hypothesis Testing

General Linear Test (GLT)

Steps:

1. Fit the full model and obtain the error sum of squares, SSE(F)2. Fit the reduced model under the null hypothesis and obtain the error sum of squares, SSE(R)

3. The test statistic will be F*=

SSE (R)− SSE (F )df R − df F

SSE (F )df F

and the critical region will be to reject the null

hypothesis if F*> F (dfR− dfF , dfF )

α

21

Page 22: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Extra Sums of Squares

Reflect the reduction in the error sum of squares (or increase in the model sum of quares) by adding an independent variable to the model, given that another independent variable (or variables) is already in the model.

SSR(variable/s added∣ variable/s included )= SSR (full model)− SSR(variable/s included)= SSE (variable/s included)− SSE (full model)

Lack-of-Fit Test

Hypotheses

H 0: E (Y )= β 0 + β 1 X i1 + β 2 X i2 +… + β k X ik

H a : E (Y )≠ β 0 + β 1 X i1 + β 2 X i2 +… + β k X ik

ANOVA Table for Testing Lack-of-Fit for MLRM

Sourceof

Variationdf

Sumof

Squares

MeanSquare

F-Stat

Regression p −1 SSR MSRF 2 =

MSRMSEError n − p SSE MSE

Lack-of-Fit c − p SSLF MSLFF1 =

MSLFMSPE Pure Error n − c SSPE MSPE

Total n −1 SST

Remarks• SSE = SSLF + SSPE

• SSPE =∑m=1

c

∑i=1

n m

(Y i m − Y m)2

• Critical Region: Reject Ho if F1 > F (c− p , n − p)α

• Significance indicates that the model appears to be inadequate.• The F- test for significance of the overall regression is valid only if no lack-of-fit is exhibited by the

model.

Partial F-Tests and Sequential F-Tests

• Partial F-Test measures the value of adding another regressor given that the other k-1 independent variables are already in the model

• Sequential F-Test a special type of partial F-test where the regressor variables enter one at a time

22

Page 23: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

• Structure• Hypotheses: H 0: β j = 0 vs H a : β j ≠ 0

• Test Statistic: F*=

SSR(X j ∣ X 1 , X 2 , … , X j− 1 , X j + 1 , … , X k )

1SSE

n− p

• Critical Region: Reject Ho if F*> F (1, n − p )

α

Coefficient of Partial Determination

r 2Y j . 1, 2, … , j−1, j+ 1,… , k = marginal contribution of X j to the reduction of variation of in Y when

the independent variables other than X j are already in the model= percentage decrease in SSE when X j is added into the model when

the independent variables other than X j are already in the model

The coefficient of partial correlation is the square root of the coefficient of partial determination. It follows the sign of the associated regression coefficient. It does not have a clear meaning as the coefficient of partial determination.

Variable Selection• All Possible Regression (APR)

Takes into account all the possible 2k− 1 regression models. These criterion do not guarantee

that the independent variables in the selected model/s are significant.

i. R p2 Criterion

The intent is to find the point where adding more independent variables is not worthwhile becauseit leads to a very small increase in R p

2 .

ii. MSE p or Ra2 Criterion

Ra2 Takes the number of parameters in the model into account through the degrees of freedom.

Note that Ra2 increases if and only is the mean square error decreases. Hence Ra

2 and MSE are equivalent criteria.

iii. Mallow's C p CriterionConcerned with the total mean squared error of the n fitted values for each of the various subset regression models. In using this criterion, one seeks to identify subsets of the independent variables for which the C p value is small and the C p value is near p (number of parameters in thesubset).

23

Page 24: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

• Automatic Search Procedures (ASP)i. Forward Selectionii. Backward Selectioniii. Stepwise Selection

Standardized Coefficients

Given Y i = β 0 + β 1 X i1 + β 2 X i2 + … + β k X ik + ε i = x i' β + ε i

Y i − μ YσY

=(x i

' β + ε i)− E (xi' β + ε i)

σY

=(x i

' β + ε i)− β E (x i') + E (ε i)

σY

=(x i

' β + ε i)− β μ iσY

=x i

' β − β μ i + ε iσY

=x i

' β − β μ iσY

+ε iσY

=(β 0 + β 1 X i1 + β 2 X i2 + …+ β k X ik)− (β 0μ0 + β 1μ1 + …+ β kμ k )

σY+ε iσY

=β 0 − β 0μ 0

σY+β 1 X i1 − β 1μ 1

σY+β 2 X i2 − β 2μ 2

σY+… +

β k X ik − β 1μ kσY

+ε iσY

μ 0 =

∑i=1

N

1

N= N /N = 1

Y i −μ YσY

= 0 +β 1(X i1 − μ 1)

σY+β 2(X i2 − μ 2)

σY+ …+

β k (X ik − μ k )σY

+ε iσY

=β 1σ1σY

(X i1− μ1)σ1

+β 2σ2σY

(X i2 − μ 2)σ2

+…+β kσ kσY

(X ik − μ k)σ k

+ε iσY

Given a random sample of size n

Y i − YsY

=β 1 s1

sY

(X i1 − X 1)

s1

+β 2 s2

sY

(X i2− X 2)

s2

+β k sk

sY

(X ik − X k )

sk

+ ε i*

The new regression model

Y *= X *β *

+ ε * where E (ε*)= 0 and Var (ε *

)= 1

Estimating β *

β *= (X * ' X *

)−1 X * ' Y *

24

Page 25: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Y *= [

Y 1 − YsY

Y 2 − YsY

⋮Y n − Y

sY

] X*= [

X 11 − X 1

s1

X 12 − X 2

s2

⋯X 1k − X k

sk

X 21 − X 1

s1

X 22 − X 2

s2

⋯X 2k − X k

sk

⋮ ⋮ ⋱ ⋮X n1 − X 1

s1

X n2 − X 2

s2

⋯X nk − X k

sk

]X * ' = [

X 11 − X 1

s1

X 21 − X 1

s1

⋯X n1− X 1

s1

X 12 − X 2

s2

X 22 − X 2

s2

⋯X n2− X 2

s2

⋮ ⋮ ⋱ ⋮X 1k − X k

sk

X 2k − X k

s1

⋯X nk − X k

sk

]X * ' X *

= [∑i=1

n

( X i1− X 1

s1)

2

∑i=1

n

( X i1 − X 1

s1)( X i2− X 2

s2) ⋯ ∑

i=1

n

( X i1− X 1

s1)( X ik − X k

sk)

⋮ ∑i=1

n

( X i2− X 2

s2 )2

⋯ ∑i=1

n

( X i1 − X 1

s1 )( X i2− X 2

s2 )⋮ ⋮ ⋱ ⋮

⋯ ⋯ ⋯ ∑i=1

n

( X ik − X k

sk)

2 ]This matrix is symmetric

X * ' X *= [

n− 1 (n− 1)r12 ⋯ (n− 1)r1k

… n− 1 ⋯ (n− 1)r 2k

⋮ ⋮ ⋱ ⋮⋯ ⋯ ⋯ n− 1

]=(n− 1)Rxx

Similarly,

X*' Y

*= [

(n− 1)r 1Y

(n− 1)r 2Y

⋮(n− 1)rkY

]=(n− 1)R xy

Where Rxx is the correlation matrix of the independent variablesR xy is the correlation vector of the dependent with each independent variable

25

Page 26: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics

Thus β *= (X * ' X *

)−1 X * ' Y *

= [(n −1)Rxx ]−1[(n− 1)R xy ] = [R xx ]

−1 Rxy

A 1 standard deviation change in X j results to a β j* standard deviation change in Y .

β j*= β j

s j

sY

→ β j* sY = β j s j

Elasticity: The effect of a percentage change of an independent variable on the dependent variable. Large elasticity indicates that the dependent variable is very sensitive to changes in the independent variable.

E j = β j

x j

y

26