erho.weebly.com · up school of statistics student council education and research erho.weebly.com |...
TRANSCRIPT
![Page 1: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/1.jpg)
UP School of Statistics Student Council
Education and Research erho.weebly.com | 0 [email protected] | f /erhoismyhero | t @erhomyhero
S136_Reviewer_001Statistics 136 Introduction to Regression AnalysisReviewer for the 1st Long Examination
Preliminaries
rk (I−A) = rk (I )− rk (A) = tr (I−A)
For general quadratic forms f (x )= (a ± Bx)' A(a ± Bx)
∂ f (x )∂ x
=±2 B ' A(a±Bx )
Wherea m×1 - vector of constants
B m× p - matrix of constants
x p×1 - vector of variables
A m×m - symmetric matrix of constants
Expected value of a vector
For y n×1 a vector of random variables:
E ( y )= E [Y 1
Y 2
⋮Y n]= [
E (Y 1)
E (Y 2)⋮
E (Y n)]= [E(Y i)]
For matrices just distribute the expected value operator to all elements of the matrix.
Variance-covariance matrix
For y n×1 a vector of random variables:
1
w
![Page 2: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/2.jpg)
Var ( y) = E {[ y − E ( y )] [ y − E ( y )]' }
= E[Y 1 − E (Y 1)
Y 2 − E (Y 2)⋮
Y n − E (Y n)][Y 1 − E (Y 1) Y 2 − E (Y 2) ⋯ Y n − E(Y n)]
= [σij ]
where σii =σi2= Var (Y i) , σij = Cov (Y i , Y j)
Note: The variance-covariance matrix is always symmetric
Correlation matrix
The correlation matrix of y n×1 denoted by ℜ is defined by
ℜ = diag−1{σ1 σ2 ⋯ σn }V diag−1
{σ1 σ2 ⋯ σn }
= diag{ 1σ1
1σ2
⋯1σn}V diag{ 1
σ1
1σ2
⋯1σn}
= [σijσiσ j ] i, j = 1,2,...,n
where V = Var ( y)
Note: Let C be a matrix of constants and y a random vector where Var(y) = V. Then, Var(Cy) = CVC'.
The Linear Model
Multiple Linear Regression model
Y = X β + ε
Y i = β 0 + β 1 X i1 + β 2 X i2 +…+ β k X ik +ε i
Matrix Notation
Yn × 1
= [Y 1
Y 2
⋮Y n] , X
n ×(k + 1)= [
1 X 11 X 12 ⋯ X 1k
1 X 21 X 22 ⋯ X 2k
⋮ ⋮ ⋮ ⋮ ⋮1 X n1 X n2 ⋯ X nk
] , β(k + 1 ) ×1
= [β 0
β 1
β 2
⋮β k] , ε
n × 1= [
ε 1ε 2
⋮ε n]
Y i is the value of the response variable in the ith trialX ij is a known constant , namely, the value of the jth independent variable on the ith trial
2
![Page 3: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/3.jpg)
β j is a parameter where j = 0, 1, 2, ..., kε i is a random error term
i = 1, 2, ..., n and j = 0, 1, 2, …, k
Classical Assumptions
E (ε i) = 0 ,∀ i
Var (ε i) = σ2 ,∀ i
Cov(ε i ,ε j) = 0 ,∀ i≠ j
Normal error model assumptions
ε i∼NID(0,σ2) , i = 1,2,… , n
ε∼N n(0 ,σ2 I )
Dependent Variable (implicit distribution)
Y∼N n(X β ,σ2 I n)
Regression Function
E (Y ) = X β
E (Y i) = β 0+β 1 X i1+β 2 X i2+…+β k X ik
Least Squares Criterion
Objective function
Minimize ∑i=1
n
ε i2= ε ' ε the inner product of the vector of error terms
Least Squares Estimation
∂(∑i=1
n
ε i2)
∂β i=
∂ε ' ε∂β
=∂(Y − X β )' (Y − X β )
∂β= −2 X ' (Y − X β )
3
![Page 4: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/4.jpg)
Equate to the null vector
→ −2 X ' (Y − X β )= 0→ X ' (Y − X β )= 0→ X ' Y − X ' X β = 0→ X ' X β = X ' Y
Thus, from the above equation, we get the set of normal equations and an estimator for β.
Normal Equations
X ' X β = X ' Y
X ' X β = [n ∑
i=1
n
X i1 ∑i=1
n
X i2 ⋯ ∑i=1
n
X ik
∑i=1
n
X i1 ∑i=1
n
X i12 ∑
i=1
n
X i1 X i2 ⋯ ∑i=1
n
X i1 X ik
⋮ ⋮ ⋮ ⋮ ⋮
∑i=1
n
X ik ∑i=1
n
X i1 X ik ⋯ ⋯ ∑i=1
n
X ik2 ][
β 0
β 1
⋮
β k
]= [∑i=1
n
Y i
∑i=1
n
X i1Y i
⋮
∑i=1
n
X ik Y i]= X ' Y
Set of Normal Equations
1st : n β 0 + β 1 ∑i=1
n
X i1 + β 2 ∑i=1
n
X i2 +… + β k ∑i=1
n
X ik = ∑i=1
n
Y i
2nd : β 0 ∑i=1
n
X i1 + β 1 ∑i=1
n
X i12+ … + β k ∑
i=1
n
X i1 X ik = ∑i=1
n
X i1 Y i
⋮
(k+1)th : β 0 ∑i=1
n
X ik + β 1 ∑i=1
n
X i1 X ik + … + β k ∑i=1
n
X ik2 = ∑
i=1
n
X ik Y i
Least Squares Estimator of β
β = (X ' X )−1 X ' Y
Provided (X ' X )−1 exists, that is, X is of full rank →rk (X )= k + 1 .
Estimator for E(Y)
E (Y ) = Y = X β
= X (X ' X )−1 X ' Y
= H YDefine H = X (X ' X )−1 X '
4
![Page 5: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/5.jpg)
Residuals
e = Y−Y= Y−X β
= Y−X (X ' X )−1 X ' Y
= Y−H Ye = ( I−H )Y
Interpretation of Coefficients
Given Y i = β 0 + β 1 X i1 + β 2 X i2 +⋯+ β k X ik , the estimated mean of Y, we interpret the coefficientsβ 0, β 1, ⋯, β k as follows:
• β 0 : value of the estimated mean of Y provided all the independent variables are zero• β 1 : change in the estimated mean of Y per unit change in X i1 holding the other independent
variables constant
• In general, ceteris paribus (all things the same),
β j : change in the estimated mean of Y per unit change in X ij holding the other independent variables constant, for j = 1,2,...,k
• Caution on the interpretation of coefficients:i. coefficients are partialii. validity of interpretation depends on whether the assumption of uncorrelatedness among X's
holdsiii. affected by the range of X used in estimation (for example, β 0 may not always be interpretable)
Results from Least Squares Criterion
1. The least squares estimator β is unbiased for β .
Proof:
E ( β ) = E [(X ' X )−1 X ' Y ]
= (X ' X )−1 X ' E (Y )
= (X ' X )−1 X ' X β
E ( β ) = β
2. The expected value of the vector of residuals is a null vector.
5
E (e )= 0
E ( β )= β
![Page 6: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/6.jpg)
Proof:
E (e ) = E [( I − H )Y ]= ( I − H )E (Y )= ( I − H ) X β= X β − H X β
= X β − X (X ' X )−1 X ' X β= X β − X β
E (e ) = 0
3. The sum of squared residuals is a minimum.
∑i=1
n
e i2 = e ' e is a minimum
4. The least squares regression line always passes through the centroid.
5. The sum of the residuals of any regression model that contains an intercept β0 is always equal to zero.
∑i=1
n
e i = 1 ' e = e ' 1 = 0
Proof:
∑i=1
n
e i = ∑i=1
n
(Y i − Y i)
= ∑i=1
n
[Y i − (β 0 + β 1 X i1 + β 2 X i2 + …+ β k X ik )]
∑i=1
n
e i = ∑i=1
n
Y i −∑i=1
n
(β 0 + β 1 X i1 + β 2 X i2 + …+ β k X ik )
= ∑i=1
n
Y i −(n β 0 + β 1 ∑i=1
n
X i1 + β 2 ∑i=1
n
X i2 + …+ β k ∑i=1
n
X ik )
= ∑i=1
n
Y i −∑i=1
n
Y i (from the first set of normal equations)
∑i=1
n
e i = 0
6. The sum of the residuals weighted by the corresponding value of the regressor variable always equals zero.
6
∑i=1
n
e i X ij = X ' e = 0
![Page 7: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/7.jpg)
Proof:
∑i=1
n
e i X ij = X ' e
= X ' (I − H )Y
= X ' [ I − X (X ' X )−1 X ' ]Y
= [X ' − X ' X (X ' X )−1 X ' ]Y
= (X ' − X ' )Y= 0' Y
∑i=1
n
e i X ij = 0
7. The sum of the observed values Yi equals the sum of the fitted values Ŷi.
Proof:
From (5): ∑i=1
n
e i = 0
→ ∑i=1
n
e i =∑i=1
n
(Y i − Y i)
→ ∑i=1
n
(Y i − Y i)= 0
→ ∑i=1
n
Y i −∑i=1
n
Y i = 0
→ ∑i=1
n
Y i =∑i=1
n
Y i
8. The sum of the residuals weighted by the corresponding fitted value always equals zero.
Proof:
e ' Y = [(I − H )Y ]' H Y= Y ' ( I − H ) ' H Y= Y ' ( I ' − H ' )H Y= Y ' ( I − H )H Y → I and H are symmetric
= Y ' (H Y − H 2 Y )
7
∑i=1
n
Y i =∑i=1
n
Y i
∑i=1
n
e i Y i = e ' Y = Y ' e = 0
![Page 8: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/8.jpg)
= Y ' (HY − H Y ) →H is idempotent
= Y ' 0e ' Y = 0
9. The residuals and independent variables are uncorrelated.
ρ (X j , ei) =Cov (X j , e i)
√Var (X j)√Var (e i)=
∑i=1
n
(X ij − X j)(e i − e)
√∑i=1
n
(X ij − X j)2√∑
i=1
n
(ei − e )2
Consider the numerator ∑i=1
n
(X ij − X j)(e i − e )
∑i=1
n
(X ij − X j)(e i − e ) = ∑i=1
n
(X ij e i − X ij e− X j ei + X j e)
= ∑i=1
n
X ij ei⏟
0
− e⏟0
∑i=1
n
X ij − X j ∑i=1
n
e i⏟
0
+ n X j e⏟0
= 0
Thus, ρ (X j , ei)= 0
10. The variance-covariance matrix of β is given by:
Thus, β∼N (β ,(X ' X )−1σ
2)
Inferences in Regression Analysis
ANOVA for Regression
Total amount of variation of Y: SST =∑i=1
n
(Y i −Y )2
8
Var ( β ) = Var [(X ' X )−1 X ' Y ]
= (X ' X )−1 X ' Var (Y )X (X ' X )
−1
= (X ' X )−1 X ' σ2 I X (X ' X )−1
= (X ' X )−1 X ' X (X ' X )
−1σ
2
= I (X ' X )−1σ
2
= (X ' X )−1σ
2
![Page 9: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/9.jpg)
Decomposition of SST
∑i=1
n
(Y i − Y )2 = ∑i=1
n
(Y i − Y i + Y i − Y )2
= ∑i=1
n
[(Y i − Y i) + (Y i − Y )]2
= ∑i=1
n
[(Y i − Y i)2+ (Y i − Y )2 − 2(Y i − Y i)(Y i −Y )]
= ∑i=1
n
(Y i − Y i)2+∑
i=1
n
( Y i − Y )2 − 2∑i=1
n
(Y i − Y i)(Y i − Y )
Consider 2∑i=1
n
(Y i − Y i)(Y i −Y )
2∑i=1
n
(Y i − Y i)(Y i − Y ) = 2∑i=1
n
e i(Y i − Y )
= 2∑i=1
n
(e i Y i − e iY )
= 2[∑i=1
n
e i Y i − Y∑i=1
n
e i]= 2 [0 − 0]
2∑i=1
n
(Y i − Y i)(Y i − Y ) = 0
Thus,
∑i=1
n
(Y i − Y )2
⏟SST
= ∑i=1
n
(Y i − Y i)2
⏟SSE
+∑i=1
n
(Y i − Y )2
⏟SSR
Where:SST
- Total corrected sum of squares- Total amount of variation on Y
SSE - Sum of squares due to error- “unexplained sum of squares”
SSR - Sum of squares due to regression- “explained sum of squares”
9
![Page 10: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/10.jpg)
Matrix Notation
SST = Y ' C YSSE = Y ' ( I − H )YSSR = Y ' (H − J )Y
Expected Values
• E(MSE)
E (MSE ) = E [ SSEn− p ]
=1
n− pE [ SSE] =
1n− p
E [Y ' ( I − H )Y ]
Note: Since Y ' (I −H )Y is a scalar, Y ' (I −H )Y = tr [Y ' ( I − H )Y ]
tr [Y ' ( I − H )Y ] = tr [( I − H )Y Y ' ]So,
E (MSE ) =1
n− pE {tr [( I − H )Y Y ' ] }
=1
n− ptr [(I − H )E(Y Y ' )]
Now,
Var (Y ) = E [(Y − μY)(Y − μ Y ) ' ]
= E [(Y − μY)(Y'− μ Y
')]
= E [Y Y '− Y μ Y
'− μY Y '
+ μ Y μY']
= E(Y Y ')− E (Y μ Y
')− E (μY Y '
)+E (μ Y μY')
= E(Y Y ')− E (Y )μY
'− μ Y E(Y '
)+μ Y μY'
Previous results: μ Y = E (Y ) = X β and Var (Y )= σ2 I
σ2 I = E (Y Y '
)− X β β ' X '− X β β ' X '
+ X β β ' X '
= E (Y Y ')− X β β ' X '
E (Y Y ') = σ
2 I + X β β ' X '
Going back,
10
![Page 11: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/11.jpg)
E (MSE ) =1
n− ptr [( I − H )E (Y Y '
)]
=1
n− ptr [( I − H )(σ
2 I + X β β ' X ')]
=1
n− ptr [σ2( I − H ) + ( I − H )(X β β ' X ' )]
=1
n− ptr [σ2
( I − H ) + (X β β ' X '− H X β β ' X '
)]
=1
n− ptr [σ2
( I − H ) + (X β β ' X '− X (X ' X )
−1 X ' X β β ' X ')]
=1
n− ptr [σ2( I − H ) + (X β β ' X ' − X β β ' X ' )]
=1
n− ptr [σ2
( I − H ) + 0]
=1
n− ptr [σ2
( I − H )]
=1
n− pσ
2 tr (I − H )
=1
n− pσ
2(n − p)
E (MSE ) = σ2
Thus, MSE is unbiased for σ2
• E(MSR)
E (MSR) = E( SSRp−1)
=1
p−1E (SSR)
=1
p−1E [Y ' (H−J )Y ]
Note: Y ' (H−J )Y is a scalar so,Y ' (H−J )Y = tr [Y ' (H−J )Y ]
= tr [(H−J )Y Y ' ]
So,
E (MSR) =1
p−1E {tr[(H−J )Y Y ' ]}
=1
p−1tr [(H−J )E (Y Y ' )]
It was shown previously that, E (Y Y ' )= σ2 I + X β β ' X '
11
![Page 12: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/12.jpg)
Thus,
E (MSR) =1
p−1tr [(H−J )(σ2 I + X β β ' X ' )]
=1
p−1tr [σ2
(H−J ) + (H−J )(X β β ' X ' )]
=1
p−1{σ2tr (H−J ) + tr [(H−J )(X β β ' X ' )]}
Aside,
tr [(H−J )(X β β ' X ' )] = tr (H X β β ' X '−J X β β ' X ' )= tr [ X (X ' X )
−1 X ' X β β ' X '−J X β β ' X ' ]= tr (X β β ' X '−J X β β ' X ')= tr [( I−J )X β β ' X ' ]= tr (C X β β ' X ' )= tr (X β β ' X ' C )= tr (β β ' X ' C X )
= tr (β ' X ' C X β )= tr [(X β ) ' C X β ]
tr [(H−J )(X β β ' X ' )] = (X β )' C X β since the term inside the trace is a scalar
Therefore,
E (MSR) =1
p−1σ2 tr(H − J ) +
1p−1
(X β )' C X β
=1
p−1σ
2( p−1) +
1p−1
(X β )' C X β
= σ2 +1
p−1(X β )' C X β
ANOVA Table for testing H 0: β 1 = β 2 =…= β k vs H a : ∃ at least one inequality
Sourceof
Variationdf
Sumof
Squares
MeanSquare
F-Stat
Regression p − 1 SSR MSR
F c =MSRMSE
Error n − p SSE MSE
Total n −1 SST
Where n is the number of observations and p is the number of parameters. Alternatively k is the number of independent variables so that k = p – 1.
12
![Page 13: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/13.jpg)
F-test of Regression relation
Under Ho : F c =
MSRMSE
∼ F ( p − 1,n − p)
Critical Region: Reject Ho if F c > F ( p −1, n− p )
α
Tests on Individual Regression Coefficients
Hypotheses:
H 0 : β 1 = 0 vs H a : β 1 ≠ 0H 0 : β 2 = 0 vs H a : β 2 ≠ 0
⋮H 0: β k = 0 vs H a : β k ≠ 0
Test Statistic:
t j =β j − β j
s.e.( β j)=
β j
s.e. ( β j)∼ t( α
2, n − p)
Critical Region: Reject Ho if ∣t j∣> t(α
2,n − p)
Coefficient of Multiple Determination
R2= RY. 1, 2,… , k
2=
SSR(X 1 , X 2 , … , X k )
SST= 1−
SSE (X 1 , X 2 , … , X k )
SST
Interpretation: Percentage variation in Y that can be explained by the X's through the model Y = X β + ε
Adjusted Coefficient of Determination
Ra2= 1−
MSEMST
= 1−
SSEn− pSSTn− 1
= 1− [ SSESST (
n− 1n− p)]
Interpretation: Same as R-squared but taking into account the loss in degrees of freedom
Coefficient of Alienation (Non-determination)
1− R2 or 1− Ra2
13
![Page 14: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/14.jpg)
Gauss-Markov Theorem
Under the conditions of the multiple linear regression model, the least squares estimator β is the best linear unbiased estimator (BLUE) of β . This means that among all linear unbiased estimators of β j , β j has the smallest variance, j = 0, 1, 2, ..., k.
Proof:β is unbiased for β . We have to show that β has the smallest variance among all linear unbiased
estimators for β .
Consider any linear unbiased estimator of β , say β = CY , where C = (X ' X )−1 X ' + D .
To show unbiasedness:
E ( β ) = E (CY )
= E {[(X ' X )−1 X ' + D ] [ X β + ε ]}
= E [(X ' X )−1 X ' X β + (X ' X )
−1 X ' ε + D X β + Dε ]
= β + (X ' X )−1 X ' E (ε ) + D X β + D E (ε )E ( β ) = β + D X β
D X = 0 for E ( β ) to be unbiased.
Var ( β ) = E [( β − β )(β − β ) ' ]
β − β = [(X ' X )−1 X ' + D ] [X β + ε ]− β
= β + (X ' X )−1 X ' ε + D X β + Dε − β
= (X ' X )−1 X ' ε + D X β + Dε
( β − β ) ' = ε ' X (X ' X )−1+ β ' X ' D ' + ε ' D '
Var ( β ) = E {[(X ' X )−1 X ' ε + D X β + Dε ][ε ' X (X ' X )
−1+ β ' X ' D ' + ε ' D ' ]}
= E [(X ' X )−1 X ' ε ε ' X (X ' X )
−1+ (X ' X )
−1 X ' ε β ' X ' D ' +(X ' X )
−1 X ' ε ε ' D ' + D X β ε ' X (X ' X )−1+ D X β β ' X ' D ' + D X β ε ' D ' +
Dε ε ' X (X ' X )−1+ Dε β ' X ' D ' + Dε ε ' D ' ]
= (X ' X )−1 X ' E (ε ε ') X (X ' X )
−1+ (X ' X )
−1 X ' E(ε )β ' X ' D ' +(X ' X )
−1 X ' E (ε ε ' )D' + D X β E (ε ') X (X ' X )−1+ D X β β ' X ' D' +
D X β E (ε ' )D ' + D E (ε ε ' )X (X ' X )−1+ D E (ε ) β ' X ' D ' + D E (ε ε ' )D'
= (X ' X )−1 X ' E (ε ε ') X (X ' X )
−1+ (X ' X )
−1 X ' E (ε )β ' X ' D ' +
14
![Page 15: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/15.jpg)
(X ' X )−1 X ' E (ε ε ' )D' + D X β E (ε ') X (X ' X )
−1+ D X β β ' X ' D' +
D X β E (ε ' )D ' + D E (ε ε ' )X (X ' X )−1+ D E (ε ) β ' X ' D ' + D E (ε ε ' )D'
= (X ' X )−1 X ' E (ε ε ' ) X (X ' X )
−1+ (X ' X )
−1 X ' E(ε ε ' )D ' + D X β β ' X ' D ' +D E (ε ε ' )X (X ' X )−1 + D E (ε ε ')D '
Note: Var (ε )= σ2 I = E [(ε − 0)(ε − 0) ' ]= E (ε ε ')
So,
Var ( β ) = σ2(X ' X )
−1 X ' X (X ' X )−1+ σ
2(X ' X )
−1 X ' D ' + D X β β ' X ' D ' +σ
2 D X (X ' X )−1+ σ
2 D D '
Recall the condition that D X = 0 → (D X ) ' = 0 '
Thus, we are left with
Var ( β ) = σ 2(X ' X )−1 X ' X (X ' X )−1 + σ2 D D' = σ2(X ' X )−1 + σ2 D D'Recall that the first term of this sum is the variance-covariance matrix of β . We have to show that the variance-covariance matrix of β is larger. We have to show that D D ' is psd.
NTS: D D ' is psd → x ' D D ' x ≥ 0 ∀ x and equality holding for some x ≠ 0
Let y = D' x
x ' D D ' x = y ' y
y ' y ≥ 0 since it is a sum of squares (inner product of a vector with itself).
Thus D D ' is psd.
Var ( β ) = σ2(X ' X )
−1+ σ
2 D D' = Var ( β ) + σ2 D D'
Var ( β )≥ Var ( β )
Therefore, Var ( β ) is the smallest among all linear unbiased estimators of β .
β is the BLUE for β .
Furthermore, any linear combination λ ' β of the elements of β ,the BLUE is λ ' β .
15
![Page 16: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/16.jpg)
Theorem With the assumption of normality of the error terms, the maximum likelihood estimators
(MLE) for β and σ2 are β = (X ' X )−1 X ' Y and σ2=
1n
e ' e , respectively.
Proof:
Recall: Y = X β + ε ε ~ N (0 ,σ 2 I )
Y ~ N (X β ,σ2 I ) → Y i ∼ N ( xi' β ,σ2
)
Likelihood functions
L (ε 1 ,ε 2 ,… ,ε n) = ∏i=1
n
f (ε i)
= (2πσ2)−
n2 exp{− 1
2σ2 ∑i=1
n
ε i2}
L (Y 1 ,Y 2 ,… , Y n ∣X ) = (2πσ2)−
n2 exp{− 1
2σ2 ∑i=1
n
(Y i − x i' β )
2}Given the data, the likelihood function may be regarded as a function of the p + 1 parameters.
L (Y 1 ,Y 2 ,… ,Y n ∣X ) = (2 πσ2)−
n2 exp{− 1
2σ2 (Y − X β ) ' (Y − X β )}Get the ln likelihood
ln [L] =−n2
ln(2πσ2)−
1
2σ2(Y − X β )' (Y − X β )
∂ ln(L)∂ β = −
1
2σ2 (−2) X ' (Y − X β )
=1
σ2 X ' (Y − X β ) (1)
∂ ln(L)
∂σ2 = −
n2
2 π
2πσ2 +1
2σ4 (Y − X β ) ' (Y − X β )
= −n
2σ2 +1
2σ4 (Y − X β ) ' (Y − X β ) (2)
16
![Page 17: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/17.jpg)
Equating (1) to 0
1σ
2 X ' (Y − X β ) = 0
→ X ' (Y − X β ) = 0→ X ' Y − X ' X β = 0
X ' X β = X ' Y
β = (X ' X )−1 X ' Y
Equating (2) to 0 (since it is a scalar) and substituting β to β ,
−n
2σ2+
1
2σ4(Y − X β )' (Y − X β )= 0 →
1
2σ4(Y − X β ) ' (Y − X β )=
n
2σ2
1
2σ4e ' e =
n
2σ2 → σ2 n= e ' e → σ2
=1n
e ' e
Theorem ( β ,σ2) is the UMVUE for (β ,σ2) .
Proof:
We have established that β and σ2= MSE are unbiased estimators for β and σ2 respectively.
Thus, by Lehmann-Scheffé Theorem, we now have to determine a joint complete sufficient statistic (CSS) for (β ,σ2) and see if the unbiased estimators are functions of the joint CSS.
Recall: Y i ∼ N ( x i' β , σ2
)
f Y i( yi) =
1σ √2 π
exp{− 12σ2 ( yi − xi
' β )2} I (−∞ , ∞)( y i)
=1
σ √2 πexp{− 1
2σ2 [ yi2− 2 y i xi
' β + (x i' β )2]} I (−∞ , ∞)( y i)
=1
σ √2 πexp{−(x i
' β )2
2σ2 } exp{− 1
2σ2 [ y i2− 2 y i x i
' β ]} I (−∞ , ∞)( y i)
Consider exp{− 1
2σ2[ y i
2− 2 y i x i
' β ]}
17
![Page 18: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/18.jpg)
exp{− 12σ2 [ y i
2 − 2 y i xi' β ]} = exp{− 1
2σ2 [ y i2 − 2 y i(β 0 + β 1 X i1 + β 2 X i2 +…+ β k X ik )]}
= exp{− 12σ2 yi
2+β 0
σ2 y i +
β 1
σ2 y i X i1 +…+
β k
σ2 yi X ik}
Let θ =(β , σ2)
a (θ)=1
σ √2πexp{−(x i
' β )2
2σ2 } b( yi)= I (−∞ , ∞)( y i)
c1(θ)=−1
2σ 2, c2(θ)=
β 0
σ2
, c3(θ) =β 1
σ2
, … , ck + 2(θ)=β k
σ2
d 1( y i)= y i2 , d 2( y i)= yi , d 3( y i)= yi X i1 , … , d k + 2( y i)= y i X ik
Thus, f Y i( yi) is a member of the k + 2 parameter exponential family of distributions. A joint CSS is given
by
S = {∑i=1
n
d 1( y i) ,∑i=1
n
d 2( y i) , ∑i=1
n
d 3( y i) , … , ∑i=1
n
d k + 2( y i)}= {∑i=1
n
Y i2 , ∑
i=1
n
Y i , ∑i=1
n
Y i X i1 , … , ∑i=1
n
Y i X ik}i. Now β = (X ' X )
−1 X ' Y . It was previously shown that
X ' Y = [∑i=1
n
Y i
∑i=1
n
X i1Y i
⋮
∑i=1
n
X ik Y i]
Thus β is a function of the joint CSS since (X ' X )−1 is a matrix of constants, so E ( β∣S )= β .
By Lehmann-Scheffé theorem, β is the UMVUE for β
ii. SSE = Y ' ( I − H )Y= Y ' Y − Y ' H Y= Y ' Y − Y ' X (X ' X )
−1 X ' Y
MSE =1
n − pSSE =
1n− p
[Y ' Y − Y ' X (X ' X )−1 X ' Y ]
18
![Page 19: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/19.jpg)
Y ' Y =∑i=1
n
Y i2 , Y ' X = (X ' Y )' and
1n− p
is a constant.
Thus, MSE is a function of the CSS and the UMVUE for σ2
Confidence Intervals
From individual t-tests with H 0: β j = 0 vs H a : β j ≠ 0
t j =β j − β j
s.e.( β j)∼ tα
2, n−p → P (∣t j∣≤ t α
2, n− p)= 1−α
tα2
, n− p ≤β j − β j
s.e. (β j)≤ tα
2,n−p → β j − t α
2, n−p s.e. (β j)≤ β j ≤ β j + tα
2,n− p s.e.( β j)
→ [ β j ∓ t α2
, n−p s.e. ( β j)]
• Confidence interval estimates of the parameters give a better picture than point estimates (e.g. BLUE and UMVUE)
• The width of these confidence intervals is a measure of the overall quality of the regression line.• These confidence intervals have the usual frequency interpretation
Prediction Interval
• Inferences about the Mean Response E (Y h)
Given a particular set of independent variables xh' , we estimate the mean of Y
Y h = x h' β + ε h
E (Y h) = xh' β which we estimate using Y h = x h
' β
Distribution of Y h
E (Y h) = E (xh' β )= xh
' E ( β )= xh' β . Thus, Y h is unbiased for E (Y h)
Var (Y h) = Var (xh' β )
= xh' Var ( β )x h
= σ2 xh
'(X ' X )
−1 xh
Thus, Y h ∼ N ( xh' β , σ 2 xh
'(X ' X )
−1 x h)
Note: Y h is BLUE for E (Y h) (by Gauss-Markov Theorem)
19
![Page 20: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/20.jpg)
A (1- α )100% CI for E (Y h) is:
[Y h ∓ zα /2 √σ2 xh' (X ' X )−1 xh] when σ2 is known
[Y h ∓ tα /2,n− p √σ2 xh'(X ' X )
−1 xh] when σ2 is unknown
In testing H 0: E (Y h)= d vs H a: E (Y h)≠ d ,
Test statistic: t c =Y h − d
s.e. (Y h)∼ tα /2, n−p
Critical Region: Reject Ho if ∣t c∣≥ tα /2, n− p
• Prediction of a New Observation Y h given x h [Y h(new ) ]
We are not estimating a parameter but predicting a particular value of the random variable Y i given a particular x i based on the distribution of Y.
The new observation Y h is viewed as a result of a new trial, independent of the trials on which the regression analysis is based. We also assume that the underlying regression model applicable for the basic sample data continues to be appropriate for the new observation.
i. Case 1:Parameters are known
Y = X β + ε ε ∼ N (0 , σ 2 I ) Y i ∼ N ( xi' β , σ2)
Y i − x i' β
σ ∼ N (0, 1) → (−zα2≤
Y i − x i' β
σ ≤ zα2)
A (1- α )100% PI for Y i is given by:
(x i' β ∓ zα
2σ)
ii. Case 2: Parameters are unknown
Assumptions: (1) parameters have already been estimated(2) the given xh is independent of the previous sample
β is estimated by β and σ2 is estimated by MSEY h = x h
' β + ε h ε ∼ N (0, MSE )
E (Y h) = xh' β
Var (Y h) = MSE x h'(X ' X )
−1 xh + MSE = MSE [ xh'(X ' X )
−1 xh + 1]
20
![Page 21: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/21.jpg)
A (1- α )100% PI for Y i is given by:
(x i' β ∓ tα
2,n− p√MSE [ xh
'(X ' X )
−1 x h + 1])• Prediction of Mean of m New Observations Y h(new ) for given x h
(x i' β ∓ tα
2,n− p√MSE [ 1
m+ x h
' (X ' X )−1 xh]) Inverse Regression Problem
• Also known as calibration problem• Given the regression model of Y, we predict the value of the independent variable, X, that gave rise to
a new observation Y.• For a simple liner regression model Y i = β 0 + β 1 X i + ε i and the estimated regression function
Y i = β 0 + β 1 X i , if a new observation Y h(new) becomes available, a natural point estimator for thelevel X h(new) is given by
X h(new)=Y h (new )− β 0
β 1
• For a linear regression model with two independent variables Y = β 0 + β 1 X 1 + β 2 X 2 + ε and the estimated regression function Y = β 0 + β 1 X 1 + β 2 X 2 , if a new observation Y 0 becomes available and for a given value of X 2 = X 20 , X 1 can be predicted by,
X 10 =Y 0 − β 0 − β 2 X 20
β 1
Hypothesis Testing
General Linear Test (GLT)
Steps:
1. Fit the full model and obtain the error sum of squares, SSE(F)2. Fit the reduced model under the null hypothesis and obtain the error sum of squares, SSE(R)
3. The test statistic will be F*=
SSE (R)− SSE (F )df R − df F
SSE (F )df F
and the critical region will be to reject the null
hypothesis if F*> F (dfR− dfF , dfF )
α
21
![Page 22: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/22.jpg)
Extra Sums of Squares
Reflect the reduction in the error sum of squares (or increase in the model sum of quares) by adding an independent variable to the model, given that another independent variable (or variables) is already in the model.
SSR(variable/s added∣ variable/s included )= SSR (full model)− SSR(variable/s included)= SSE (variable/s included)− SSE (full model)
Lack-of-Fit Test
Hypotheses
H 0: E (Y )= β 0 + β 1 X i1 + β 2 X i2 +… + β k X ik
H a : E (Y )≠ β 0 + β 1 X i1 + β 2 X i2 +… + β k X ik
ANOVA Table for Testing Lack-of-Fit for MLRM
Sourceof
Variationdf
Sumof
Squares
MeanSquare
F-Stat
Regression p −1 SSR MSRF 2 =
MSRMSEError n − p SSE MSE
Lack-of-Fit c − p SSLF MSLFF1 =
MSLFMSPE Pure Error n − c SSPE MSPE
Total n −1 SST
Remarks• SSE = SSLF + SSPE
• SSPE =∑m=1
c
∑i=1
n m
(Y i m − Y m)2
• Critical Region: Reject Ho if F1 > F (c− p , n − p)α
• Significance indicates that the model appears to be inadequate.• The F- test for significance of the overall regression is valid only if no lack-of-fit is exhibited by the
model.
Partial F-Tests and Sequential F-Tests
• Partial F-Test measures the value of adding another regressor given that the other k-1 independent variables are already in the model
• Sequential F-Test a special type of partial F-test where the regressor variables enter one at a time
22
![Page 23: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/23.jpg)
• Structure• Hypotheses: H 0: β j = 0 vs H a : β j ≠ 0
• Test Statistic: F*=
SSR(X j ∣ X 1 , X 2 , … , X j− 1 , X j + 1 , … , X k )
1SSE
n− p
• Critical Region: Reject Ho if F*> F (1, n − p )
α
Coefficient of Partial Determination
r 2Y j . 1, 2, … , j−1, j+ 1,… , k = marginal contribution of X j to the reduction of variation of in Y when
the independent variables other than X j are already in the model= percentage decrease in SSE when X j is added into the model when
the independent variables other than X j are already in the model
The coefficient of partial correlation is the square root of the coefficient of partial determination. It follows the sign of the associated regression coefficient. It does not have a clear meaning as the coefficient of partial determination.
Variable Selection• All Possible Regression (APR)
Takes into account all the possible 2k− 1 regression models. These criterion do not guarantee
that the independent variables in the selected model/s are significant.
i. R p2 Criterion
The intent is to find the point where adding more independent variables is not worthwhile becauseit leads to a very small increase in R p
2 .
ii. MSE p or Ra2 Criterion
Ra2 Takes the number of parameters in the model into account through the degrees of freedom.
Note that Ra2 increases if and only is the mean square error decreases. Hence Ra
2 and MSE are equivalent criteria.
iii. Mallow's C p CriterionConcerned with the total mean squared error of the n fitted values for each of the various subset regression models. In using this criterion, one seeks to identify subsets of the independent variables for which the C p value is small and the C p value is near p (number of parameters in thesubset).
23
![Page 24: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/24.jpg)
• Automatic Search Procedures (ASP)i. Forward Selectionii. Backward Selectioniii. Stepwise Selection
Standardized Coefficients
Given Y i = β 0 + β 1 X i1 + β 2 X i2 + … + β k X ik + ε i = x i' β + ε i
Y i − μ YσY
=(x i
' β + ε i)− E (xi' β + ε i)
σY
=(x i
' β + ε i)− β E (x i') + E (ε i)
σY
=(x i
' β + ε i)− β μ iσY
=x i
' β − β μ i + ε iσY
=x i
' β − β μ iσY
+ε iσY
=(β 0 + β 1 X i1 + β 2 X i2 + …+ β k X ik)− (β 0μ0 + β 1μ1 + …+ β kμ k )
σY+ε iσY
=β 0 − β 0μ 0
σY+β 1 X i1 − β 1μ 1
σY+β 2 X i2 − β 2μ 2
σY+… +
β k X ik − β 1μ kσY
+ε iσY
μ 0 =
∑i=1
N
1
N= N /N = 1
Y i −μ YσY
= 0 +β 1(X i1 − μ 1)
σY+β 2(X i2 − μ 2)
σY+ …+
β k (X ik − μ k )σY
+ε iσY
=β 1σ1σY
(X i1− μ1)σ1
+β 2σ2σY
(X i2 − μ 2)σ2
+…+β kσ kσY
(X ik − μ k)σ k
+ε iσY
Given a random sample of size n
Y i − YsY
=β 1 s1
sY
(X i1 − X 1)
s1
+β 2 s2
sY
(X i2− X 2)
s2
+β k sk
sY
(X ik − X k )
sk
+ ε i*
The new regression model
Y *= X *β *
+ ε * where E (ε*)= 0 and Var (ε *
)= 1
Estimating β *
β *= (X * ' X *
)−1 X * ' Y *
24
![Page 25: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/25.jpg)
Y *= [
Y 1 − YsY
Y 2 − YsY
⋮Y n − Y
sY
] X*= [
X 11 − X 1
s1
X 12 − X 2
s2
⋯X 1k − X k
sk
X 21 − X 1
s1
X 22 − X 2
s2
⋯X 2k − X k
sk
⋮ ⋮ ⋱ ⋮X n1 − X 1
s1
X n2 − X 2
s2
⋯X nk − X k
sk
]X * ' = [
X 11 − X 1
s1
X 21 − X 1
s1
⋯X n1− X 1
s1
X 12 − X 2
s2
X 22 − X 2
s2
⋯X n2− X 2
s2
⋮ ⋮ ⋱ ⋮X 1k − X k
sk
X 2k − X k
s1
⋯X nk − X k
sk
]X * ' X *
= [∑i=1
n
( X i1− X 1
s1)
2
∑i=1
n
( X i1 − X 1
s1)( X i2− X 2
s2) ⋯ ∑
i=1
n
( X i1− X 1
s1)( X ik − X k
sk)
⋮ ∑i=1
n
( X i2− X 2
s2 )2
⋯ ∑i=1
n
( X i1 − X 1
s1 )( X i2− X 2
s2 )⋮ ⋮ ⋱ ⋮
⋯ ⋯ ⋯ ∑i=1
n
( X ik − X k
sk)
2 ]This matrix is symmetric
X * ' X *= [
n− 1 (n− 1)r12 ⋯ (n− 1)r1k
… n− 1 ⋯ (n− 1)r 2k
⋮ ⋮ ⋱ ⋮⋯ ⋯ ⋯ n− 1
]=(n− 1)Rxx
Similarly,
X*' Y
*= [
(n− 1)r 1Y
(n− 1)r 2Y
⋮(n− 1)rkY
]=(n− 1)R xy
Where Rxx is the correlation matrix of the independent variablesR xy is the correlation vector of the dependent with each independent variable
25
![Page 26: erho.weebly.com · UP School of Statistics Student Council Education and Research erho.weebly.com | 0 erhomyhero@gmail.com | f /erhoismyhero | t @erhomyhero S136_Reviewer_001 Statistics](https://reader034.vdocuments.us/reader034/viewer/2022050103/5f42256d6128854c1e26c7df/html5/thumbnails/26.jpg)
Thus β *= (X * ' X *
)−1 X * ' Y *
= [(n −1)Rxx ]−1[(n− 1)R xy ] = [R xx ]
−1 Rxy
A 1 standard deviation change in X j results to a β j* standard deviation change in Y .
β j*= β j
s j
sY
→ β j* sY = β j s j
Elasticity: The effect of a percentage change of an independent variable on the dependent variable. Large elasticity indicates that the dependent variable is very sensitive to changes in the independent variable.
E j = β j
x j
y
26