1 econometrics - slides - ulisboapascal.iseg.utl.pt/~nicolau/ec_meap/1-264.pdf · 2 1 introduction...
TRANSCRIPT
1
Econometrics - Slides
2011/2012
João Nicolau
2
1 Introduction
1.1 What is Econometrics?
Econometrics is a discipline that “aims to give empirical content to economic relations”. Ithas been defined generally as “the application of mathematics and statistical methods toeconomic data”. Application of econometrics:
• forecast (e.g. interest rates, inflation rates, and gross domestic product).
• study economic relations;
• testing economic theories;
• evaluating and implementing government and business policy. For example, what are theeffects of political campaign expenditures on voting outcomes? What is the effect of schoolspending on student performance in the field of education?
3
1.2 Steps in Empirical Economic Analysis
• Formulate the question of interest. The question might deal with testing a certainaspect of an economic theory, or it might pertain to testing the effects of a governmentpolicy.
• Build the economic model. An economic model consists of mathematical equations thatdescribe various relationships. Formal economic modeling is sometimes the starting pointfor empirical analysis, but it is more common to use economic theory less formally, oreven to rely entirely on intuition.
• Specify the econometric model.
• Collect the data.
• Estimate and test the econometric model.
• Answer the question in step 1.
4
1.3 The Structure of Economic Data
1.3.1 Cross-Sectional Data
A cross-sectional data: sample of individuals, households, firms, cities, states, countries,etc. taken at a given point in time. An important feature of cross-sectional data: they areobtained by random sampling from the underlying population. For example, suppose thatyi is the i-th observation of the dependent variable and xi is the i-th observation of theexplanatory variable. Random sampling means that
{(yi, xi)} is an i.i.d. sequence.This implies that for i 6= j
Cov(yi, yj
)= 0, Cov
(xi, xj
)= 0, Cov
(yi, xj
)= 0.
Obviously, if xi “explains”yi we will have Cov (yi, xi) 6= 0.
Cross-sectional data is closely aligned with the applied microeconomics fields, such as laboreconomics, state and local public finance, industrial organization, urban economics, demog-raphy, and health economics.
5
An example of Cross-Sectional Data:
6
Scatterplots may be adequate for analyzing cross-section data:
Models based on Cross-Sectional Data usually satisfy the assumptions cover by the chapter“Finite-Sample Properties of OLS”.
7
1.3.2 Time-Series Data
A time series data set consists of observations on a variable or several variables over time.E.g.: stock prices, money supply, consumer price index, gross domestic product, annualhomicide rates, and automobile sales figures, etc.
Time series data cannot be assumed to be independent across time. For example, knowingsomething about the gross domestic product from last quarter tells us quite a bit about thelikely range of the GDP during this quarter ...
The analysis of time series data is more diffi cult than that of cross-sectional data. Reasons:
• we need to account for the dependent nature of economic time series;
• time-series data exhibits unique features such as trends over time and seasonality;
• models based on time-series data rarely satisfy the assumptions cover be the chapter“Finite-Sample Properties of OLS”. The most adequate assumptions are cover by chapter“Large-Sample Theory”, which is theoretically more advanced.
8
An example of a time series (scatterplots cannot in general be used here, but there areexceptions):
9
1.3.3 Pooled Cross Sections and Panel or Longitudinal Data
Data sets have both cross-sectional and time series features.
1.3.4 Causality And The Notion Of Ceteris Paribus In Econometric Analysis
Ceteris Paribus: “other (relevant) factors being equal”. Plays an important role in causalanalysis.
Example. Suppose that wages depend on education and labor force experience. Your goalis to measure the “return to education”. If your analysis involves only wages and educationyou may not uncover the ceteris paribus effect of education on wages. Consider the followingdata:
monthly wages (Euros) years of experience years of education1500 6 91500 0 151600 1 152000 8 122500 10 12
10
Example. In a totalitarianism regime how can you measure the ceteris paribus effect ofanother year of education on wages? You may create 100 clones of a “normal” individual.Give to each person an amount of education and then measure their wages.
Ceteris Paribus is relatively easy to analyze in Experimental Data.
Example (Experimental Data). Considered the effects of new fertilizers on crop yields. Sup-pose the crop under consideration is soybeans. Since fertilizer amount is only one factoraffecting yields– some others include rainfall, quality of land, and presence of parasites–this issue must be posed as a ceteris paribus question. One way to determine the causal effectof fertilizer amount on soybean yield is to conduct an experiment, which might include thefollowing steps. Choose several one-acre plots of land. Apply different amounts of fertilizerto each plot and subsequently measure the yields.
In economics you have nonexperimental data, so in principle, it is diffi cult to estimate theceteris paribus effects. However, we will see that econometric methods can simulate a ceterisparibus experiment. We will be able to do in nonexperimental environments what naturalscientists are able to do in a controlled laboratory setting: keep other factors fixed.
11
2 Finite-Sample Properties of OLS
This chapter covers the finite- or small-sample properties of the OLS estimator, that is, thestatistical properties of the OLS estimator that are valid for any given sample size.
2.1 The Classical Linear Regression Model
The dependent variable is related to several other variables (called the regressors or theexplanatory variables).
Let yi be the i-th observation of the dependent variable.
Let (xi1, xi2, ..., xiK) be the i-th observation of the K regressors. The sample or data is acollection of those n observations.
The data in economics cannot be generated by experiments (except in experimental eco-nomics), so both the dependent and independent variables have to be treated as randomvariables, variables whose values are subject to chance.
12
2.1.1 The Linearity Assumption
Assumption (1.1 - Linearity).We have
yi = β1xi1 + β2xi2 + ...+ βKxiK + εi, i = 1, 2, ..., n
where β′s are unknown parameters to be estimated, and εi is the unobserved error term.
β′s : regression coeffi cients. They represent the marginal and separate effects of the regres-sors.
Example (1.1). (Consumption function): Consider
coni = β1 + β2ydi + εi.
coni : consumption; ydi is disposable income. Note: xi1 = 1, xi2 = ydi. The error εirepresents other variables besides disposable income that influence consumption. They in-clude: those variables– such as financial assets– that might be observable but the researcherdecided not to include as regressors, as well as those variables– such as the “mood”of theconsumer– that are hard to measure. The equation is called the simple regression model.
13
The linearity assumption is not as restrictive as it might first seem.
Example (1.2). (Wage equation). Consider
wagei = eβ1eβ2educieβ3tenureieβ4exprieεi
where WAGE = the wage rate for the individual, educ = education in years, tenure = yearson the current job, and expr = experience in the labor. This equation can be written as
log (wagei) = β1 + β2educi + β3tenurei + β4expri + εi
The equation is said to be in the semi-log form (or log-level form).
Example. Does this model
yi = β1 + β2xi2 + β3 log xi2 + β4x2i3 + εi
violate Assumption 1.1?
There are, of course, cases of genuine nonlinearity. For example
yi = β1 + eβ2xi2 + εi
14
Partial Effects
To simplify let’s consider, K = 2, and assume that E (εi|xi1, xi2) = 0.
What is the impact on the conditional expected value y, E (yi|xi1, xi2) when xi2 is increasedby a small amount
x′i = (xi1, xi2)→ x∗′i = (xi1, xi2 + ∆xi2) (holding the other variable fixed)?
Let
∆ E (yi|xi) ≡ E (yi|x∗i1 = xi1, x∗i2 = xi2 + ∆xi2)− E (yi|xi1, xi2) .
Equation Interpretation of β2(level-level) yi = β1 + β2xi2 + εi ∆ E (yi|xi) = β2∆xi2(level-log) yi = β1 + β2 log (xi2) + εi ∆ E (yi|xi) '
β2100
(∆xi2xi2× 100
)(log-level) log (yi) = β1 + β2xi2 + εi
∆ E(yi|xi)E(yi|xi)
× 100 ' (100β2) ∆xi2
(100β2: semi-elast.)
(log-log) log (yi) = β1 + β2 log (xi2) + εi∆ E(yi|xi)
E(yi|xi)× 100 ' β2
(∆xi2xi2× 100
)(β2: elasticity)
15
Exercise 2.1. Suppose, for example, the marginal effect of experience on wages declines withthe level of experience. How can this be captured?
Exercise 2.2. Provide an interpretation of β2 in the following equations:
(a) coni = β1 + β2inci + εi, where inc: income, con: consumption (both measured indollars). Assume that β2 = 0.8;
(b) log (wagei) = β1 + β2educi + β3tenurei + β4expri + εi. Assume that β2 = 0.05.
(c) log (pricei) = β1 + β2 log (disti) + εi where prices = housing price and dist =
distance from a recently built garbage incinerator. Assume that β2 = 0.6.
16
2.1.2 Matrix Notation
We have
yi = β1xi1 + β2xi2 + ...+ βKxiK + εi =[xi1 xi2 · · · xiK
] β1β2...βK
+ εi
= x′iβ + εi
where
xi =
xi1xi2...
xiK
, β =
β1β2...βK
yi = x′iβ + εi.
17
More compactly
y =
y1y2...yn
, X =
x11 x12 · · · x1Kx21 x22 · · · x2K... ... ...xn1 xn2 · · · xnK
, εi =
ε1ε2...εn
y = Xβ + ε.
Example. yi = β1 + β2educi + β3expi + εi (yi = wages in Euros). An example ofCross-Sectional Data is
y =
200025001500...
50001000
, X =
1 12 51 15 61 12 3... ... ...1 17 151 12 1
.
Important: y and X (or yi and xik) may be random variables or observed values. We usethe same notation for both cases.
18
2.1.3 The Strict Exogeneity Assumption
Assumption (1.2 - Strict exogeneity). E (εi|X) = 0, ∀i
This assumption can be written as
E (εi|x1, ...,xn) = 0, ∀i.
With random sampling εi is automatically independent of the explanatory variables for ob-servations other than i. This implies that
E(εi|xj
)= 0, ∀i, j i 6= j
It remains to be analyzed whether or not
E (εi|xi)?= 0.
19
Strict Exogeneity assumption can fail in situations such as:
• (Cross-Section or Time Series) Omitted variables;
• (Cross-Section or Time Series) Measurement error in some of the regressors;
• (Time Series, Static models) There is a feedback from yi on future values of xi;
• (Time Series, Dynamic models) There is a lag dependent variable as a regressor;
• (Cross-Section or Time Series) Simultaneity.Example (Omitted variables). Suppose that wage is determined by
wagei = β1 + β2xi2 + β3xi3 + vi,
where x2: years of education, x3: ability. Assume that E (vi|X) = 0. Since ability is notobserved, we instead estimate the model. wagei = β1 + β2xi2 + εi, εi = β3xi3 + vi. IfCov (xi2, xi3) 6= 0 then
Cov (εi, xi2) = Cov (β3xi3 + vi, xi2) = β3 Cov (xi3, xi2) 6= 0⇒ E (εi|X) 6= 0.
20
Example (Measurement error in some of the regressors). Consider y = household savingsand w = disposable income and
yi = β1 + β2wi + vi, E (vi|w) = 0.
Suppose that w cannot be measured absolutely accurately (for example, because of misre-porting) and denote the measured value for wi by xi2. We have
xi2 = wi + ui.
Assume: E (ui) = 0, Cov (wi, ui) = 0, Cov (vi, ui) = 0. Now substituting xi2 = wi+uiinto yi = β1 + β2wi + vi we obtain
yi = β1 + β2xi2 + εi, εi = vi − β2ui.
Hence,
Cov (εi, xi2) = ... = −β2 Var (ui) 6= 0.
Cov (εi, xi2) 6= 0⇒ E (εi|X) 6= 0.
21
Example (Feedback from y on future values of x). Consider a simple static time-series modelto explain a city’s murder rate (yt) in terms of police offi cers per capita (xt):
yt = β1 + β2xt + εt,
Suppose that the city adjusts the size of its police force based on past values of the murderrate. This means that, say, xt+1 might be correlated with εt (since a higher εt leads to ahigher yt).
Example (There is a lag dependent variable as a regressor). See section 2.1.5.
Exercise 2.3. Let kids denote the number of children ever born to a woman, and let educdenote years of education for the woman. A simple model relating fertility to years ofeducation is
kidsi = β1 + β2educi + εi.
where εi is the unobserved error. (i) What kinds of factors are contained in εi? Are theselikely to be correlated with level of education? (ii) Will a simple regression analysis uncoverthe ceteris paribus effect of education on fertility? Explain.
22
2.1.4 Implications of Strict Exogeneity
The Assumption E (εi|X) = 0, ∀i implies:
• E (εi) = 0, ∀i.
• E(εi|xj
)= 0, ∀i, j.
• E(xjkεi
)= 0, ∀i, j, k or E
(xjεi
)= 0, ∀i, j The regressors are orthogonal to the
error term for all observations
• Cov(xjk, εi
)= 0.
Note: if E(εi|xj
)6= 0 or E
(xjkεi
)6= 0 or Cov
(xjk, εi
)6= 0⇒ E (εi|X) 6= 0.
23
2.1.5 Strict Exogeneity in Time-Series Models
For time-series models where strict exogeneity can be rephrased as: the regressors are or-thogonal to the past, current, and future error terms. However, for most time-series models,strict exogeneity is not satisfied.
Example. Consider
yi = βyi−1 + εi, E (εi| yi−1) = 0 (thus E (yi−1εi) = 0).
Let xi = yi−1. By construction we have
E (xi+1εi) = E (yiεi) = ... = E(ε2i
)6= 0.
The regressor is not orthogonal to the past error term, which is a violation of strict exogeneity.However, the estimator may possess good large-sample properties without strict exogeneity.
2.1.6 Other Assumptions of the Model
Assumption (1.3 - no multicollinearity). The rank of the n ×K data matrix X is K withprobability 1.
24
None of the K columns of the data matrix X can be expressed as a linear combination ofthe other columns of X.
Example (1.4 - continuation of Example 1.2). If no individuals in the sample ever changedjobs, then tenurei = expri for all i, in violation of the no multicollinearity assumption.There no way to distinguish the tenure effect on the wage rate from the experience effect.Remedy: drop tenurei or expri from the wage equation.
Example (Dummy Variable Trap). Consider
wagei = β1 + β2educi + β3femalei + β4malei + εi
where
femalei =
{1 if i corresponds to a female0 if i corresponds to a male
, malei = 1− femalei.
In vectorial notation we have
wage = β11 + β2educ + β3female + β4male + ε.
It is obvious that 1 = female + male. Therefore the above model violates Assumption1.3. One may also justify using scalar notation: xi1 = femalei + malei because thisrelationship implies 1 = female + male. Can you overcome the dummy variable trap byremoving xi1 ≡ 1 from the equation?
25
Exercise 2.4. In a study relating college grade point average to time spent in various activ-ities, you distribute a survey to several students. The students are asked how many hoursthey spend each week in four activities: studying, sleeping, working, and leisure. Any activityis put into one of the four categories, so that for each student the sum of hours in the fouractivities must be 168. (i) In the model
GPAi = β1 + β2studyi + β3sleepi + β4worki + β5leisurei + εi
does it make sense to hold sleep, work, and leisure fixed, while changing study? (ii) Explainwhy this model violates Assumption 1.3; (iii) How could you reformulate the model so thatits parameters have a useful interpretation and it satisfies Assumption 1.3?
Assumption (1.4 - spherical error variance). The error term satisfies:
E(ε2i
∣∣∣X) = σ2 > 0, ∀i, Homoskedasticity
E(εiεj
∣∣∣X) = 0, ∀i, j; i 6= j. No correlation between observations.
Exercise 2.5. Under the Assumptions 1.2 and 1.4, show that Cov(yi, yj
∣∣∣X) = 0.
26
Assumption 1.4 and strict exogeneity implies:
• Var (εi|X) = E(ε2i
∣∣∣X) = σ2.
• Cov(εi, εj
∣∣∣X) = 0.
• E(εε′∣∣X) = σ2I.
• Var (ε|X) = σ2I.
Note
E(εε′∣∣∣X) =
E(ε2
1
∣∣∣X) E (ε1ε2|X) · · · E (ε1εn|X)
E (ε1ε2|X) E(ε2
2
∣∣∣X) · · · E (ε2εn|X)... ... . . . ...
E (ε1εn|X) E (ε2εn|X) · · · E(ε2n
∣∣∣X)
.
27
Exercise 2.6. Consider the savings function
savi = β1 + β2inci + εi, εi =√incizi
where zi is a random variable with E (zi) = 0 and Var (zi) = σ2z. Assume that zi is
independent of incj (for all i, j). (i) Show that E (ε| inc) = 0; (ii) Show that Assumption1.4 is violated.
2.1.7 The Classical Regression Model for Random Samples
The sample (y,X) is a random sample if {(yi,xi)} is i.i.d. (independently and identicallydistributed) across observations. Random sample automatically implies:
E (εi|X) = E (εi|xi) ,E(ε2i
∣∣∣X) = E(ε2i
∣∣∣xi) .Therefore Assumptions 1.2 and 1.4 can be rephrasing as
Assumption 1.2 E (εi|xi) = E (εi) = 0
Assumption 1.4 E(ε2i
∣∣∣xi) = E(ε2i
)= σ2
28
2.1.8 “Fixed”Regressors
This is a simplifying (and generally an unrealistic) assumption to make the statistical analysistractable. It means that X is exactly the same in repeated samples. Sampling schemes thatsupport this assumption:
a) Experimental situations. For example, suppose that y represents the yields of a cropgrown on n experimental plots, and let the rows of X represent the seed varieties, irrigationand fertilizer for each plot. The experiment can be repeated as often as desired, with thesame X. Only y varies across plots.
b) Stratified Sampling (for more details see Wooldridge, chap. 9).
29
2.2 The Algebra of Least Squares
2.2.1 OLS Minimizes the Sum of Squared Residuals
Residual for observation i (evaluated at β):
yi − x′iβ.
Vector of residuals (evaluated at β):
y −Xβ.
Sum of squared residuals (SSR):
SSR(β)
=n∑i=1
(yi − x′iβ
)2=(y −Xβ
)′ (y −Xβ
).
The OLS (Ordinary Least Squares):
b = arg minβSSR
(β)
b is such that SSR (b) is minimum.
30
K = 1, yi = βxi + εi
Example. Consider yi = β1 + β2xi2 + εi. The data:
y1 1 13 1 32 1 18 1 312 1 8
X
Verify that SSR(β)
= 42 when β =
(01
).
31
2.2.2 Normal Equations
To solve the optimization proble minβSSR
(β)we use classical optimization:
• First Order Condition (FOC):
∂SSR(β)
∂β= 0.
Solve the previous equation with respect to β. Let b such solution.
• Second Order Condition (SOC):
∂2SSR(β)
∂β∂β′ is a Positive Definite Matrix⇔ b is global minimum point.
32
To easily obtain the FOC we start writing SSR(β)as
SSR(β)
=(y −Xβ
)′ (y −Xβ
)= ...
= y′y − 2y′Xβ + β′X′Xβ.
Recalling from matrix algebra that
∂(a′β
)∂β
= a,∂(β′Aβ
)∂β
= 2Aβ (for A symmetric)
we have
∂SSR(β)
∂β= −2
(y′X
)′+ 2X′Xβ = 0
i.e. (replacing β by the solution b)
X′Xb = X′y or
X′ (y −Xb) = 0.
33
This is a system with K equations and K unknowns. These equations are called the normalequations. If
rank (X) = K ⇒ X′X is nonsingular⇒ there exists(X′X
)−1.
Therefore, if rank (X) = K we have a unique solution:
b =(X′X
)−1 X′y OLS estimator.
The SOC is
∂2SSR(β)
∂β∂β′ = 2X′X.
If rank (X) = K then 2X′X is a positive definite matrix thus SSR(β)is strictly convex
in Rk. Hence b is a global minimum point.
The vector of residuals evaluated at β = b,
e = y −Xb
is called the vector of OLS residuals (or simply residuals).
34
The normal equations can be written as
X′e = 0⇔ 1
n
n∑i=1
xiei = 0.
This shows that the normal equations can be interpreted as the sample analogue of theorthogonality conditions E (xiεi) = 0. Notice the reasoning: by assuming in the popula-tion the orthogonality conditions E (xiεi) = 0 we deduce by the method of moments thecorresponding sample analogue
1
n
∑i
xi(yi − x′iβ
)= 0.
We obtain the OLS estimator b by solving this equation with respect to β.
35
2.2.3 Two Expressions for the OLS Estimator
• b =(X′X
)−1 X′y
• b =(
X′Xn
)−1 X′yn = S−1
xxSxy, where
Sxx =X′Xn
=1
n
n∑i=1
xix′i (sample average of xix
′i)
Sxy =X′yn
=1
n
n∑i=1
xiyi (sample average of xiyi).
Example (continuation of previous example). Consider the data.
y1 1 13 1 32 1 18 1 312 1 8
X
Obtain b, e and SSR (b) .
36
2.2.4 More Concepts and Algebra
The fitted value for observation i: yi = x′ib.
The vector of fitted value: y = Xb.
The vector of OLS residuals: e = y −Xb = y − y.
The projection matrix P and the annihilator M are defined as
P = X(X′X
)−1X′, M = I−P.
Properties:
Exercise 2.7. Show that P and M are symmetric and idempotent and
PX = X
MX = 0
y = Py
e = My = Mε
SSR = e′e = y′My = ε′Mε.
37
The OLS estimate of σ2 (the variance of the error term), denoted s2, is
s2 =SSR
n−K=
e′en−K
s2 is called the standard error of regression.
The sampling error
b− β = ... =(X′X
)−1X′ε.
Coeffi cient of Determination
A measure of goodness of fit is the coeffi cient of determination
R2 =
∑ni=1 (yi − y)2∑ni=1 (yi − y)2 = 1−
∑ni=1 e
2i∑n
i=1 (yi − y)2, 0 ≤ R2 ≤ 1.
It measures the proportion of the variation of y that is accounted for by variation in theregressors, x′js. Derivation of R
2: [board]
38
y
5
0
5
10
15
20
25
3 2 1 0 1 2 3x
yy^
R 2 = 0.96y
5040302010
0102030405060
3 2 1 0 1 2 3x
yy^
R 2 = 0.19
y
8
9
1011
12
13
1415
16
17
3 2 1 0 1 2 3x
yy^
R 2 = 0.00
39
“The most important thing about R2 is that it is not important” (Goldberger). Why?
• We are concerned with parameters in a population, not with goodness of fit in the sample;
• We can always increase R2 by adding more explanatory variables. At the limit, if K =
n⇒ R2 = 1.
Exercise 2.8. Prove that K = n⇒ R2 = 1 (assume that Assumption 1.3 holds).
It can be proved that
R2 = ρ2, ρ =
∑i
(yi − y
)(yi − y) /n
SySy.
Adjusted coeffi cient of determination
R2 = 1− n− 1
n− k(
1−R2)
= 1−∑ni=1 e
2i / (n− k)∑n
i=1 (yi − y)2 / (n− 1).
Contrary to R2, R2 may decline when a variable is added to the set of independent variables.
40
2.3 Finite-Sample Properties of OLS
First of all we need to recognize that b and b|X are random!
Assumptions:1.1 - Linearity: yi = β1xi1 + β2xi2 + ...+ βKxiK + εi.
1.2 - Strict exogeneity: E (εi|X) = 0.
1.3 - No multicollinearity.1.4 - Spherical error variance: E
(ε2i
∣∣∣X) = σ2,E(εiεj
∣∣∣X) = 0.
Proposition (1.1 - finite-sample properties of b).We have:(a) (unbiasedness) Under Assumptions 1.1-1.3, E (b|X) = β.
(b) (expression for the variance) Under Assumptions 1.1-1.4, Var (b|X) = σ2 (X′X)−1 .
(c) (Gauss-Markov Theorem) Under Assumptions 1.1-1.4, the OLS estimator is effi cient inthe class of linear unbiased estimators (also called Best Linear Unbiased Estimator). Thatis, for any unbiased estimator β that is linear in y, Var (b|X) ≤ Var
(β∣∣∣X) in the matrix
sense (i.e. Var(β∣∣∣X)− Var (b|X) is a positive semidefinite matrix).
(d) Under Assumptions 1.1-1.4, Cov (b, e|X) = 0. Proof: [board]
41
Proposition (1.2 - Unbiasedness of s2). Let s2 = e′e/ (n−K) . We have
E(s2∣∣∣X) = E
(s2)
= σ2. Proof: [board]
An unbiased estimator of Var (b|X) is
Var (b|X) = s2(X′X
)−1.
Example. Consider
colGPAi = β1 + β2HSGPAi + β3ACTi + β4SKIPPEDi + β5PCi + εi
where: colGPA : college grade point average (GPA); HSGPA : high school GPA; ACT :
achievement examination for college admission; SKIPPED : average lectures missed perweek; PC is a binary variable (0/1) to identify who owns a personal computer. Using asurvey of 141 students (Michigan State University) in Fall 1994, we obtained the followingresults:
42
These results tell us that n = 141, s = 0.325, R2 = 0.259, SSR = 14.37
b =
1.356
0.41290.0133−0.0710.1244
, Var (b|X) =
0.32752 ? ? ? ?
? 0.09242 ? ? ?? ? 0.0102 ? ?? ? ? 0.0262 ?? ? ? ? 0.05732
43
2.4 More on Regression Algebra
2.4.1 Regression Matrices
Matrix P = X(X′X
)−1 X′
Py → Fitted values from the regression of y on XPz → ?
Matrix M = I−P = I−X(X′X
)−1 X′
My → Residuals from the regression of y on XMz → ?
Consider a partition of X as follows X =[
X1 X2
]
Matrix P1= X1
(X′1X1
)−1X′
1P1y → ?
Matrix M1= I−P1= I−X1
(X′1X1
)−1X′
1M1y → ?
44
2.4.2 Short and Long Regression Algebra
Partition X as
X =[
X1 X2
], XK1×n, XK2×n, K1 +K2 = K
Long RegressionWe have
y = y + e = Xb + e =[
X1 X2
] [ b1b2
]+ e = X1b1 + X2b2 + e.
Short RegressionSuppose that we shorten the list of explanatory variables and regress y on X1. We have
y = y∗ + e∗ = X1b∗1 + e∗
where
b∗1 =(X′1X1
)−1X1y
e∗ = M1y, M1 = I−X1
(X′1X1
)−1X′1
45
How are b∗1 and e∗ related to b1 and e?
b∗1 vs. b1
We have,
b∗1 =(X′1X1
)−1X1y
=(X′1X1
)−1X′1 (X1b1 + X2b2 + e)
= b1 +(X′1X1
)−1X′1X2b2 +
(X′1X1
)−1X′1e︸ ︷︷ ︸
0
= b1 +(X′1X1
)−1X′1X2b2
= b1 + Fb2, F =(X′1X1
)−1X′1X2.
Thus, in general, b∗1 6= b1. Exceptional cases: b2 = 0 or X′1X2 = O⇒ b∗1 = b1.
46
e∗ vs. e
We have,
e∗ = M1y
= M1 (X1b1 + X2b2 + e)
= M1X1b1 + M1X2b2 + M1e
= M1X2b2 + e,
= v + e
Thus,
e∗′e∗ = e′e + v′v ≥ e′e
Thus the SSR of the short regression (e∗′e∗) exceeds the SSR of the long regression (e′e)and e∗′e∗ = e′e iff v = 0, that is iff b2 = 0.
47
Example. Illustration of b∗1 6= b1 and e∗′e∗≥ e′e.
Find X, X1, X2, b, b1, b2, b∗1, e∗′e∗, e′e.
48
2.4.3 Residual Regression
Consider
y = Xβ + ε
= X1β1 + X2β2 + ε.
Premultiplying both sides by M1 and using M1X1 = 0, we obtain
M1y = M1X1β1 + M1X2β2 + M1ε
y = X2β2 + M1ε
The OLS gives
b2 =(X′2X2
)−1X′2y =
(X′2X2
)−1X′2M1y =
(X′2X2
)−1X′2y
Thus
b2 =(X′2X2
)−1X′2y
49
Another way to prove b2 =(X′2X2
)−1X′2y (you may skip this proof). We have
(X′2X2
)−1X′2y =
(X′2X2
)−1X′2 (X1b1 + X2b2 + e)
=(X′2X2
)−1X′2X1b1︸ ︷︷ ︸
0
+(X′2X2
)−1X′2X2b2︸ ︷︷ ︸
b2
+(X′2X2
)−1X′2e︸ ︷︷ ︸
0
= b2
since: (X′2X2
)−1X′2X1b1 =
(X′2X2
)−1X′2M1X1b1
= 0(X′2X2
)−1X′2X2b2 =
(X′2X2
)−1X′2M1X2b2
=(X′2M′1M1X2
)−1X′2M1X2b2
=(X′2M1X2
)−1X′2M1X2b2
= b2
X′2e = X′2M1e
= X′2e
= 0.
50
The conclusion is that we can obtain b2 =(X′2X2
)−1X′2y =
(X′2X2
)−1X′2y as follows:
1) Regress X2 on X1 to get the residuals X2 = M1X2. Interp. of X2: X2 is X2 after theeffects of X1 have been removed or, X2 is the part X2 that is uncorrelated with X1.2) Regress y on X2 to get the coeffi cient b2 of the long regression.
OR:1’) Same as 1).2’a) Regress y on X1 to get the residuals y = M1y.2’b) Regress y on X2 to get the coeffi cient b2 of the long regression.
The conclusion of 1) and 2) is extremely important: b2 relates y to X2 after controlling forthe effects of X1. This is why b2 can be obtained from the regression of y on X2 whereX2 is X2 after the effects of X1 have been removed (fixed or controlled for). This meansthat b2 has in fact a ceteris paribus interpretation.
To recover b1 we consider the equation b∗1 = b1 + Fb2. Regress y on X1, obtaining
b∗1 =(X′1X1
)−1X′1y and now
b1 = b∗1 −(X′1X1
)−1X′1X2b2 = b∗1 − Fb2.
51
Example. Consider the example on page 9.
52
Example. Consider X =[
1 exper tenure IQ educ]and
X1 =[
1 exper tenure IQ], X2 = educ
53
54
2.4.4 Application of Residual Regression
A) Trend Removal (time series)
Suppose that yt and xt have a linear trend. Should the trend term be included in theregression as in the case
yt = β1 + β2xt2 + β3xt3 + εt, xt3 = t
or should the variables first be “detrended” and then used without the trend term includedas in
yt = β2xt2 + εt?
According to the previous results, the OLS coeffi cient b2 is the same in both regressions.In the second regression b2 is obtained from the regression of y = M1y on x•2 = M1x•2where
X1 =[
1 x•3]
=
1 11 2... ...1 n
.
55
Example. Consider (TXDES: unemployment rate, INF: inflation, t: time)
TXDESt = β1 + β2INFt + β3t+ εt.
We will show two ways to obtain b2 (compare EQ01 to EQ04).
EQ01Dependent Variable: TXDESMethod: Least SquaresSample: 1948 2003
Variable Coefficient Std. Error tStatistic Prob.
C 4.463068 0.425856 10.48023 0.0000INF 0.104712 0.063329 1.653473 0.1041
@TREND 0.027788 0.011806 2.353790 0.0223
EQ02Dependent Variable: TXDESMethod: Least SquaresSample: 1948 2003
Variable Coefficient Std. Error tStatistic Prob.
C 4.801316 0.379453 12.65325 0.0000@TREND 0.030277 0.011896 2.545185 0.0138
EQ03Dependent Variable: INFMethod: Least SquaresSample: 1948 2003
Variable Coefficient Std. Error tStatistic Prob.
C 3.230263 0.802598 4.024758 0.0002@TREND 0.023770 0.025161 0.944696 0.3490
EQ04Dependent Variable: TXDES_Method: Least SquaresSample: 1948 2003
Variable Coefficient Std. Error tStatistic Prob.
INF_ 0.104712 0.062167 1.684382 0.0978
56
B) Seasonal Adjustment and Linear Regression with Seasonal Data
Suppose that we have data on the variable y, quarter by quarter, for m years. A way to dealwith (deterministic) seasonality is the following
yt = β1Qt1 + β2Qt2 + β3Qt3 + β4Qt4 + β5xt5 + εi
where
Qti =
{1 in quarter i0 otherwise.
Let
X =[
Q1 Q2 Q3 Q4 x•5], X1 =
[Q1 Q2 Q3 Q4
].
Previous results show that b5 can be obtained from the regression of y = M1y on x•5 =M1x•5. It can be proved
yt =
yt − yQ1 in quarter 1yt − yQ2 in quarter 2yt − yQ3 in quarter 3yt − yQ4 in quarter 4
where yQi is the seasonal mean of quarter i.
57
c) Deviations from Means
Let x•1 be the summer vector. Instead of regressing y on[
x•1 x•2 · · · x•K]to get
(b1, b2, ..., bK)′ , we can regress y on x12 − x2 · · · x1K − xK... ...
xn2 − x2 · · · xnK − xK
to get the same vector (b2, ..., bK)′ . We sketch the proof. Let
X2 =[
x•2 · · · x•K]
so that
y = x•1b1 + X2b2.
1) Regress X2 on x•1 to get the residuals X2 = M1X2 where
M1 = I− x•1(x′•1x•1
)−1x′•1 = I−
x•1x′•1n
.
58
As we know
X2 = M1X2
= M1
[x•2 · · · x•K
]=
[M1x•2 · · · M1x•K
]=
x12 − x2 · · · x1K − xK... ...
xn2 − x2 · · · xnK − xK
.
2) Regress y (or y = M1y) on X2 to get the coeffi cient b2 of the long regression:
b2 =(X′2X2
)−1X′2y =
(X′2X2
)−1X′2y.
The intercept can be recovered as
b1 = b∗1 − x•1(x′•1x•1
)−1x′•1X2.
59
2.4.5 Short and Residual Regression in the Classical Regression Model
Consider:
y = X1b1 + X2b2 + e (long regression)
y = X1b∗1 + e∗ (short regression).
The correct specification corresponds to the long regression:
E (y|X) = X1β1 + X2β2
= Xβ
Var (y|X) = σ2I, etc.
60
A) Short-Regression Coeffi cients
b∗1 is a biased estimator of β1
Given that
b∗1 =(X′1X1
)−1X′1y = b1 + Fb2, F =
(X′1X1
)−1X′1X2.
we have
E (b∗1|X) = E (b1 + Fb2|X) = β1 + Fβ2,
Var (b∗1|X) = Var((
X′1X1
)−1X′1y
∣∣∣∣X) =(X′1X1
)−1X′1 Var (y|X) X1
(X′1X1
)−1
= σ2(X′1X1
)−1
thus, in general,
b∗1 is a biased estimator of β1 (“omitted-variable bias”)
unless:
• β2 = 0. Corresponds to the case of “Irrelevant Omitted Variables”.• F = O. Corresponds to the case of “Orthogonal Explanatory Variables”(in sample space).
61
Var (b1|X) ≥ Var(b∗1∣∣∣X) (you may skip the proof)
Consider b1 = b∗1 − Fb2
Var (b1|X) = Var (b∗1 − Fb2|X)
= Var (b∗1|X) + Var (Fb2|X) since Cov (b∗1,b2|X) = O [board]
= Var (b∗1|X) + F Var (b2|X) F′.
Because F Var (b2|X) F′ is positive semidefinite (or nonnegative definite), Var (b1|X) ≥Var
(b∗1∣∣∣X).
This relation is still valid if β2 = 0. In this case β2 = 0, regressing y onX1 and on irrelevantvariables (X2) involves a cost: Var (b1|X) ≥ Var
(b∗1∣∣∣X) , although E (b1|X) = β1.
In practise there may be a bias-variance trade-off between short and long regression whenthe target is β1.
62
Exercise 2.9. Consider the standard simple regression model yi = β1 + β2xi2 + εi underAssumptions 1.1 through 1.4. Thus, the usual OLS estimators b1 and b2 are unbiased fortheir respective population parameters. Let b∗2 be the estimator of β2 obtained by assumingthe intercept is zero i.e. β1 = 0 (i) Find E
(b∗2∣∣∣X). Verify that b∗2 is unbiased for β2 when
the population intercept β1 is zero. Are there other cases where b∗2 is unbiased? (ii) Find the
variance of b∗2. (iii) Show that Var(b∗2∣∣∣X) ≤ Var (b2|X); (iv) Comment on the trade-off
between bias and variance when choosing between b∗2 and b2.
Exercise 2.10. Suppose that average worker productivity at manufacturing firms (avgprod)depends on two factors, average hours of training (avgtrain) and average worker ability(avgabil):
avgprodi = β1 + β2avgtraini + β3avgabili + εi
Assume that this equation satisfies Assumptions 1.1 through 1.4. If grants have been given tofirms whose workers have less than average ability, so that avgtrain and avgabil are negativelycorrelated, what is the likely bias in b∗2 in obtained from the simple regression of avgprod onavgtrain?
63
B) Short-Regression Residuals (skip this)
Given that e∗ = M1y we have
E (e∗|X) = M1 E (y|X) = M1 E (X1β1 + X2β2|X) = X2β2,
Var (e∗|X) = Var (M1y|X) = M1 Var (y|X) M′1 = σ2M1.
Thus E (e∗|X) 6= 0, unless β2 = 0.
Let’s see now that the omission of explanatory variables leads to an increase in the expectedSSR. We have, by R5,
E(e∗′e∗
∣∣∣X) = E(y′M1y
∣∣∣X) = tr (M1 Var (y|X)) + E (y|X)′M1 E (y|X)
= σ2 tr (M1) + β′2X′2X2β2 = σ2 (n−K1) + β′2X′2X2β2
and E(e′e∣∣X) = σ2 (n−K) thus
E(e∗′e∗
∣∣∣X)− E(e′e∣∣∣X) = σ2K2 + β′2X′2X2β2 > 0.
Notice that: e∗′e∗ − e′e = b′2X′2X2b2 ≥ 0. (check E(b′2X′2X2b2
∣∣∣X) = σ2K2 +
β′2X′2X2β2).
64
C) Residual Regression
The objective is to characterize
Var (b2|X) .
We know that b2 =(X′2X2
)−1X′2y. Thus
Var (b2|X) = Var((
X′2X2
)−1X′2y
∣∣∣∣X)=
(X′2X2
)−1X′2 Var (y|X) X2
(X′2X2
)−1
= σ2(X′2X2
)−1
= σ2(X′2M1X2
)−1.
Now suppose that
X =[
X1 x•K]
(i.e. x•K = X2)
65
If follows that
Var (bK|X) =σ2
x′•KM1x•K
and x′•KM1x•K is the sum of the squared residuals in the auxiliary regression
x•K = α1x•1 + α2x•2 + ...+ αK−1x•K−1 + error.
One can conclude (assuming that x•1 is the summer vector):
R2K = 1−
x′•KM1x•K∑(xiK − xK)2.
Solving this equation for x′•KM1x•K we have
x′•KM1x•K =(
1−R2K
)∑(xiK − xK)2 .
We get
Var (bK|X) =σ2(
1−R2K
)∑(xiK − xK)2
=σ2(
1−R2K
)S2xKn.
66
Var (bK|X) =σ2(
1−R2K
)∑(xiK − xK)2
=σ2(
1−R2K
)S2xKn.
We can conclude that the precision of bK is high (i.e. Var (bK) is small) when:
• σ2 is low;
• S2xK
is high (imagine the regression
wage = β1 + β2educ+ ε.
If most people (in the sample) report the same education, S2xK
will be low and β2 willbe estimated very imprecisely).
• n is high (large sample is preferable to small sample).
• R2K is low (multicollinearity increases R2
K).
67
Exercise 2.11. Consider: sleep: minutes sleep at night per week; totwrk: hours workedper week; educ: years of schooling; female: binary variable equal to one if the individualis female. Do women sleep more than men? Explain the differences between the estimates32.18 and -90.969.
Dependent Variable: SLEEPMethod: Least SquaresSample: 1 706
Variable Coefficient Std. Error tStatistic Prob.
C 3252.407 22.22211 146.3591 0.0000FEMALE 32.18074 33.75413 0.953387 0.3407
Rsquared 0.001289 Mean dependent var 3266.356Adjusted Rsquared 0.000129 S.D. dependent var 444.4134S.E. of regression 444.4422 Akaike info criterion 15.03435Sum squared resid 1.39E+08 Schwarz criterion 15.04726
Dependent Variable: SLEEPMethod: Least SquaresSample: 1 706
Variable Coefficient Std. Error tStatistic Prob.
C 3838.486 86.67226 44.28737 0.0000TOTWRK 0.167339 0.017937 9.329260 0.0000
EDUC 13.88479 5.657573 2.454196 0.0144FEMALE 90.96919 34.27441 2.654143 0.0081
Rsquared 0.119277 Mean dependent var 3266.356Adjusted Rsquared 0.115514 S.D. dependent var 444.4134S.E. of regression 417.9581 Akaike info criterion 14.91429Sum squared resid 1.23E+08 Schwarz criterion 14.94012
68
Example. The goal is to analyze the impact of another year of education on wages. Consider:wage: monthly earnings; KWW: knowledge of world work score (KWW is a general test ofwork-related abilities); educ: years of education; exper: years of work experience; tenure:years with current employer
Dependent Variable: LOG(WAGE)Method: Least SquaresSample: 1 935White HeteroskedasticityConsistent Standard Errors & Covariance
Variable Coefficient Std. Error tStatistic Prob.
C 5.973062 0.082272 72.60160 0.0000EDUC 0.059839 0.006079 9.843503 0.0000
Rsquared 0.097417 Mean dependent var 6.779004Adjusted Rsquared 0.096449 S.D. dependent var 0.421144S.E. of regression 0.400320 Akaike info criterion 1.009029Sum squared resid 149.5186 Schwarz criterion 1.019383
Dependent Variable: LOG(WAGE)Method: Least SquaresSample: 1 935White HeteroskedasticityConsistent Standard Errors & Covariance
Variable Coefficient Std. Error tStatistic Prob.
C 5.496696 0.112030 49.06458 0.0000EDUC 0.074864 0.006654 11.25160 0.0000
EXPER 0.015328 0.003405 4.501375 0.0000TENURE 0.013375 0.002657 5.033021 0.0000
Rsquared 0.155112 Mean dependent var 6.779004Adjusted Rsquared 0.152390 S.D. dependent var 0.421144S.E. of regression 0.387729 Akaike info criterion 0.947250Sum squared resid 139.9610 Schwarz criterion 0.967958
Dependent Variable: LOG(WAGE)Method: Least SquaresSample: 1 935White HeteroskedasticityConsistent Standard Errors & Covariance
Variable Coefficient Std. Error tStatistic Prob.
C 5.210967 0.113778 45.79932 0.0000EDUC 0.047537 0.008275 5.744381 0.0000
EXPER 0.012897 0.003437 3.752376 0.0002TENURE 0.011468 0.002686 4.270056 0.0000
IQ 0.004503 0.000989 4.553567 0.0000KWW 0.006704 0.002070 3.238002 0.0012
Rsquared 0.193739 Mean dependent var 6.779004Adjusted Rsquared 0.189400 S.D. dependent var 0.421144S.E. of regression 0.379170 Akaike info criterion 0.904732Sum squared resid 133.5622 Schwarz criterion 0.935794
69
Exercise 2.12. Consider
yi = β1 + β2xi2 + εi, i = 1, ..., n
where xi2 is an impulse dummy, i.e. x•2 is a column vector with n− 1 zeros and only one1. To simplify let us suppose that this 1 is the first element of x•2, i.e.
x′•2 =[
1 0 · · · 0].
Find and interpret the coeffi cient from the regression of y on x•1 = M2x•1 and M2 =
I− x•2(x′•2x•2
)−1x′•2 (x•1 is the residual vector from the regression x•1 on x•2).
Exercise 2.13. Consider the long regression model (under Assumptions 1.1 through 1.4):
y = X1b1 + X2b2 + e,
and the following coeffi cients (obtained from the short regressions):
b∗1 =(X′1X1
)−1X′1y, b∗2 =
(X′2X2
)−1X′2y.
Decide if you agree or disagree with the following statement: if Cov(b∗1,b
∗2
∣∣∣X1,X2
)= O
(zero matrix) then b∗1 = b1 and b∗2 = b2.
70
2.5 Multicollinearity
If rank (X) < K then b is not defined. This is called strict multicollinearity. When thishappens, the statistical software will be unable to construct
(X′X
)−1 . Since the error isdiscovered quickly, this is rarely a problem for applied econometric practice.
The more relevant situation is near multicollinearity, which is often called “multicollinearity”for brevity. This is the situation when the X′X is near singular, when the columns of X areclose to linearly dependent.
Consequence: the individual coeffi cient estimates will be imprecise. We have shown that
Var (bK|X) =σ2(
1−R2K
)S2xKn.
where R2K is the coeffi cient of determination in the auxiliary regression
x•K = α1x•1 + α2x•2 + ...+ αK−1x•K−1 + error.
71
Exercise 2.14.Do you agree with the following quotations: (a) “But more data is no remedyfor multicollinearity if the additional data are simply "more of the same." So obtaining lotsof small samples from the same population will not help” (Johnston, 1984); (b) “Anotherimportant point is that a high degree of correlation between certain independent variablescan be irrelevant as to how well we can estimate other parameters in the model.”
Exercise 2.15. Suppose you postulate a model explaining final exam score in terms of classattendance. Thus, the dependent variable is final exam score, and the key explanatoryvariable is number of classes attended. To control for student abilities and efforts outsidethe classroom, you include among the explanatory variables cumulative GPA, SAT score, andmeasures of high school performance. Someone says, “You cannot hope to learn anythingfrom this exercise because cumulative GPA, SAT score, and high school performance arelikely to be highly collinear.”What should be your answer?
72
2.6 Statistical Inference under Normality
Assumption (1.5 - normality of the error term). ε|X ∼ Normal
Assumption 1.5 together with Assumptions 1.2 and 1.4 implies that
ε|X ∼ N(0,σ2I
)and y|X ∼ N
(Xβ,σ2I
).
Suppose that we want to test H0 : β2 = 1. Although Proposition 1.1 guarantees that, onaverage, b2 (the OLS estimate of β2) equals 1 if the hypothesis H0 : β2 = 1 is true, b2 maynot be exactly equal to 1 for a particular sample at hand. Obviously, we cannot concludethat the restriction is false just because the estimate b2 differs from 1. In order for us todecide whether the sampling error b2 − 1 is “too large” for the restriction to be true, weneed to construct from the sampling distribution error some test statistic whose probabilitydistribution is known given the truth of the hypothesis.
The relevant theory is built from the following results:
73
1. z ∼ N (0, I)⇔ z′z ∼ χ2(n).
2. w1 ∼ χ2(m), w2 ∼ χ2
(n), w1 and w2 are independent⇔w1/mw2/n
∼ F (m,n) .
3. w ∼ χ2(n), z ∼ N (0, 1) , w and z are independent⇔ z√
w/n∼ t(n).
4. Asymptotic Results:
v ∼ F (m,n)⇒ mvd−→ χ2
(m) as n→∞
u ∼ t(n) ⇒ ud−→ N (0, 1) as n→∞.
5. Consider the vector n× 1 vector y|X ∼ N (Xβ,Σ) . Then,
w = (y −Xβ)′Σ−1 (y −Xβ) ∼ χ2(n).
74
6. Consider the vector n × 1 vector ε|X ∼ N (0, I) . Let M be a n × n idempotentmatrix with rank (M) = r ≤ n. Then,
ε′Mε∣∣∣X ∼ χ2
(r).
7. Consider the vector n × 1 vector ε|X ∼ N (0, I) . Let M be a n × n idempotentmatrix with rank (M) = r ≤ n. Let L be a matrix such that LM = O. Let t1 = Mε
and t2 = Lε. Then t1 and t2 are independent random vectors.
8. b|X ∼ N(β,σ2 (X′X)−1
).
9. Let r = Rβ (Rp×K) with rank (R) = p (in Hayashi’s notation p is equal to #r).Then,
Rb|X ∼ N(r,σ2R
(X′X
)−1R′).
75
10. Let bk be the kth element of b and qkk the (k, k) element of(X′X
)−1 . Then,
bk|X ∼ N(βk, σ
2qkk)or zk =
bk − βkσ√qkk∼ N (0, 1) .
11. w = (Rb− r)′(R(X′X
)−1 R′)−1
(Rb− r) /σ2 ∼ χ2(p).
12. wk =(bk−βk)2
σ2qkk∼ χ2
(1).
13. w0 = e′e/σ2 ∼ χ2(n−K).
14. The random vectors b and e are independent.
15. Each of the statistics e, e′e, w0, s2, Var (b) , is independent of each of the statistics
b, bk, Rb, w, wk.
76
16. tk =bk−βkσbk
∼ t (n−K) , where σ2bkis the (k, k) element of s2 (X′X)−1 .
17. Rb−Rβ
s
√R(X′X)−1R′
∼ t (n−K) , R is of type 1×K
18. F = (Rb− r)′(R(X′X
)−1 R′)−1
(Rb− r) /(ps2
)∼ F (p, n−K) .
Exercise 2.16. Prove the results #8, #9, #16 and #18 (take the other results as given).
The two most important results are:
tk =bk − βkσbk
=bk − βkSE (bk)
∼ t (n−K)
F = (Rb− r)′(R(X′X
)−1R′)−1
(Rb− r) /(ps2
)∼ F (p, n−K) .
77
2.6.1 Confidence Intervals and Regions
Let tα/2 ≡ tα/2 (n− k) be such that
P(|t| < tα/2
)= 1− α.
78
Let Fα ≡ Fα (p, n−K) be such that
P (F > Fα) = 1− α
79
• (1− α) 100% CI for an individual slope coeffi cient βk:βk :
∣∣∣∣∣∣bj − βkσbk
∣∣∣∣∣∣ ≤ tα/2
⇔ bk ± tα/2σbk.
• (1− α) 100% CI for a single linear combination of the elements of β (p = 1)Rβ :
∣∣∣∣∣∣∣Rb−Rβ
s√
R (X′X)−1 R′
∣∣∣∣∣∣∣ ≤ tα/2
⇔ Rb± tα/2s√
R (X′X)−1 R′.
In this case R is a vector 1×K.
• (1− α) 100% Confidence Region for the parameter vector θ = Rβ :{θ : (Rb− θ)′
(R(X′X
)−1R′)−1
(Rb− θ) /s2 ≤ pFα}.
• (1− α) 100% Confidence region for the parameter vector β (consider R = I in the pre-vious case) {
β : (b− β)′(X′X
)(b− β) /s2 ≤ pFα
}.
80
Exercise 2.17. Consider yi = β1xi1 + β2xi2 + εi where yi = wagesi − wages, xi1 =educi − educ, xi2 = experi − exper. The results are
Dependent Variable: YMethod: Least SquaresSample: 1 526
Variable Coefficient Std. Error tStatistic Prob.
XX1 0.644272 0.053755 11.98541 0.0000X2 0.070095 0.010967 6.391393 0.0000
Rsquared 0.225162 Mean dependent var 1.34E15Adjusted Rsquared 0.223683 S.D. dependent var 3.693086S.E. of regression 3.253935 Akaike info criterion 5.201402Sum squared resid 5548.160 Schwarz criterion 5.217620Log likelihood 1365.969 HannanQuinn criter. 5.207752DurbinWatson stat 1.820274
X′X =
[4025.4297 −5910.064−5910.064 96706.846
],(X′X
)−1=
[2.7291× 10−4 1.6678× 10−5
1.6678× 10−5 1.1360× 10−5
]
(a) Build the 95% confidence interval for β2.
(b) Build the 95% confidence interval for β1 + β2.
(c) Build the 95% confidence region for the parameter vector β.
81
Confidence regions in the EVIEWS
.04
.05
.06
.07
.08
.09
.10
.50 .55 .60 .65 .70 .75 .80
beta1
beta
2
90% and 95% Confidence region for the parameter vector β
82
2.6.2 Testing on a Single Parameter
Suppose that we have a hypothesis about the kth regression coeffi cient:
H0 : βk = β0k
(β0k is a specific value, e.g. zero), and that this hypothesis is tested against the alternative
hypothesis
H1 : βk 6= β0k.
We do not reject H0 at the α100% level if
β0k lies within the (1− α) 100% CI for βk, i.e., bk ± tα/2σbk;
reject H0 otherwise. Equivalently, calculate the test statistic
tobs =bk − β0
k
σbkand,
if |tobs| > tα/2 then reject H0,
if |tobs| ≤ tα/2 then do not reject H0.
83
The reasoning is as follow. Under the null hypothesis we have
t0k =bk − β0
k
σbk∼ t(n−K).
If we observe |tobs| > tα/2 and the H0 is true, then a low-probability event has occurred.We take |tobs| > tα/2 as an evidence against the null and the decision should be to rejectH0.
Other cases:
• H0 : βk = β0k vs. H1 : βk > β0
k,
if tobs > tα then reject H0 at the α100% level; otherwise do not reject H0.
• H0 : βk = β0k vs. H1 : βk < β0
k,
if tobs < −tα then reject H0 at the α100% level; otherwise do not reject H0.
84
2.6.3 Issues in Hypothesis Testing
p-value
p-value (or p) is the probability of obtaining a test statistic at least as extreme as the one thatwas actually observed, assuming that the null hypothesis is true. p is an informal measureof evidence of the null hypothesis.
Example. Consider H0 : βk = β0k vs. H1 : βk 6= β0
k
p-value = 2P(t0k > |tobs|
∣∣∣H0 is true).
A p-value = 0.02 shows little evidence supporting H0. At the 5% level you should reject theH0 hypothesis.
Example. Consider H0 : βk = β0k vs. H1 : βk > β0
k
p-value = P(t0k > tobs
∣∣∣H0 is true).
EVIEWS: divide the reported p-value by two.
85
Reporting the outcome of a test
Correct wording in reporting the outcome of a test involvingH0 : βk = β0k vs. H1 : βk 6= β0
k
• When the null is rejected we say that bk (not βk) is significantly different from β0k at
α100%.
• When the null isn’t rejected we say that bk (not βk) is not significantly different fromβ0k at α100%.
Correct wording in reporting the outcome of a test involving H0 : βk = 0 vs. H1 : βk 6= 0
• When the null is rejected we say that bk (not βk) is significantly different from zero atα100% level, or the variable (associated with bk) is statistically significant at α100%.
• When the null isn’t rejected we say that bk (not βk) is not significantly different fromzero at α100% level, or the variable is not statistically significant at α100%.
86
More Remarks:
• Rejection of the null is not proof that the null is false. Why?
• Acceptance of the null is not proof that the null is true. Why? We prefer to use thelanguage “we fail to reject H0 at the x% level” rather than “H0 is accepted at the x%level.”
• In a test of type H0 : βk = β0k, if σbk is large (bk is an imprecise estimator) is more
diffi cult to reject the null. The sample contains little information about the true valueof βk parameter. Remember that σbk depends on
σ2, S2xk, n and R2
k.
87
Statistical Versus Economic Significance
The statistical significance of a variable is determined by the size of tobs = bk/se (bk) ,
whereas the economic significance of a variable is related to the size and sign of bk.
Example. Suppose that in a business activity we have
log (wagei) = .1 + 0.01(0.001)
female+ ... n = 600
H0 : β2 = 0 vs. H1 = β2 6= 0. We have:
t0k =b2
σb2
∼ t(600−K) ≈ N (0, 1) (under the null)
tobs =0.01
0.001= 10,
p-value = 2P(t0k > |10|
∣∣∣H0 is true)≈ 0.
Discuss statistical versus economic significance.
88
Exercise 2.18. Can we say that students at smaller schools perform better than those atlarger schools? To discuss this hypothesis we consider data on 408 high schools in Michiganfor the year 1993 (see Wooldridge, chapter 4). Performance is measured by the percentageof students receiving a passing score on a tenth grade math test (math10). School sizeis measured by student enrollment (enroll). We will control for two other factors, averageannual teacher compensation ( totcomp) and the number of staff per one thousand students( staff ). Teacher compensation is a measure of teacher quality, and staff size is a roughmeasure of how much attention students receive. Figure below reports the results. Answerto the initial question.
Dependent Variable: MATH10Method: Least SquaresSample: 1 408
Variable Coefficient Std. Error tStatistic Prob.
C 2.274021 6.113794 0.371949 0.7101TOTCOMP 0.000459 0.000100 4.570030 0.0000
STAFF 0.047920 0.039814 1.203593 0.2295ENROLL 0.000198 0.000215 0.917935 0.3592
Rsquared 0.054063 Mean dependent var 24.10686Adjusted Rsquared 0.047038 S.D. dependent var 10.49361S.E. of regression 10.24384 Akaike info criterion 7.500986Sum squared resid 42394.25 Schwarz criterion 7.540312Log likelihood 1526.201 HannanQuinn criter. 7.516547Fstatistic 7.696528 DurbinWatson stat 1.668918Prob(Fstatistic) 0.000052
89
Exercise 2.19.We want to relate the median housing price (price) in the community tovarious community characteristics: nox is the amount of nitrous oxide in the air, in partsper million; dist is a weighted distance of the community from five employment centers, inmiles; rooms is the average number of rooms in houses in the community; and stratio isthe average student-teacher ratio of schools in the community. Can we conclude that theelasticity of price with respect to nox is -1? (Sample: 506 communities in the Boston area -see Wooldridge, chapter 4).
Dependent Variable: LOG(PRICE)Method: Least SquaresSample: 1 506
Variable Coefficient Std. Error tStatistic Prob.
C 11.08386 0.318111 34.84271 0.0000LOG(NOX) 0.953539 0.116742 8.167932 0.0000LOG(DIST) 0.134339 0.043103 3.116693 0.0019
ROOMS 0.254527 0.018530 13.73570 0.0000STRATIO 0.052451 0.005897 8.894399 0.0000
Rsquared 0.584032 Mean dependent var 9.941057Adjusted Rsquared 0.580711 S.D. dependent var 0.409255S.E. of regression 0.265003 Akaike info criterion 0.191679Sum squared resid 35.18346 Schwarz criterion 0.233444Log likelihood 43.49487 HannanQuinn criter. 0.208059Fstatistic 175.8552 DurbinWatson stat 0.681595Prob(Fstatistic) 0.000000
90
2.6.4 Test on a Set of Parameter I
Suppose that we have a joint null hypothesis about β :
H0 : Rβ = r vs. H1 : Rβ 6= r.
where Rβp×1, Rp×K). The test statistics is
F 0 = (Rb− r)′(R(X′X
)−1R′)−1
(Rb− r) /(ps2
).
Let Fobs be the observed test statistics. We have
reject H0 if Fobs > Fα (or if p-value < α)
do not reject H0 if Fobs ≤ Fα.
The reasoning is as follow. Under the null hypothesis we have
F 0 ∼ F(p,n−K).
If we observe F 0 > Fα and the H0 is true, then a low-probability event has occurred.
91
In the case p = 1 (single linear combination of the elements of β) one may use the teststatistics
t0 =Rb−Rβ
s√
R (X′X)−1 R′∼ t (n−K) .
Example.We consider a simple model to compare the returns to education at junior collegesand four-year colleges; for simplicity, we refer to the latter as “universities”(See Wooldridge,chap. 4).The model is
log (wagesi) = β1 + β2jci + β3univi + β4experi + εi.
The population includes working people with a high school degree. jc is number of yearsattending a two-year college and univ is number of years at a four-year college. Note thatany combination of junior college and college is allowed, including jc = 0 and univ = 0.The hypothesis of interest is whether a year at a junior college is worth a year at a university:this is stated as H0 : β2 = β3. Under H0, another year at a junior college and another yearat a university lead to the same ceteris paribus percentage increase in wage. The alternativeof interest is one-sided: a year at a junior college is worth less than a year at a university.This is stated as H1 : β2 < β3.
92
Dependent Variable: LWAGEMethod: Least SquaresSample: 1 6763
Variable Coefficient Std. Error tStatistic Prob.
C 1.472326 0.021060 69.91020 0.0000JC 0.066697 0.006829 9.766984 0.0000
UNIV 0.076876 0.002309 33.29808 0.0000EXPER 0.004944 0.000157 31.39717 0.0000
Rsquared 0.222442 Mean dependent var 2.248096Adjusted Rsquared 0.222097 S.D. dependent var 0.487692S.E. of regression 0.430138 Akaike info criterion 1.151172Sum squared resid 1250.544 Schwarz criterion 1.155205Log likelihood 3888.687 HannanQuinn criter. 1.152564Fstatistic 644.5330 DurbinWatson stat 1.968444Prob(Fstatistic) 0.000000
(X′X
)−1=
0.0023972 −9.4121× 10−5 −8.50437× 10−5 −1.6780× 10−5
−9.41217× 10−5 0.0002520 1.04201× 10−5 −9.2871× 10−8
−8.50437× 10−5 1.0420× 10−5 2.88090× 10−5 2.12598× 10−7
−1.67807× 10−5 −9.2871× 10−8 2.1259× 10−7 1.3402× 10−7
Under the null, the test statistics is
t0 =Rb−Rβ
s√
R (X′X)−1 R′∼ t (n−K) .
93
We have
R =[
0 1 −1 0]
√R (X′X)−1 R′ = 0.016124827
s√
R (X′X)−1 R′ = 0.430138× 0.016124827 = 0.006936
Rb =[
0 1 −1 0]
1.4723260.0666970.0768760.004944
= −0.01018
Rβ =[
0 1 −1 0]
β1β2β3β4
= β2 − β3 = 0 (under H0)
tobs =−0.01018
0.006936= −1.467
−t0.05 = −1.645.
We do not reject H0 at the 5% level. There is no evidence against β2 = β3 at 5% level.
94
Remark: in this exercise t0 can be written as
t0 =Rb
s√
R (X′X)−1 R′=
b2 − b3√Var (b2 − b3)
=b2 − b3
SE (b2 − b3).
Exercise 2.20 (continuation). Propose another way to test H0 : β2 = β3 against H0 :
β2 < β3 along the following lines: define θ = β2 − β3; write β2 = θ + β3; plug this intothe equation log (wagesi) = β1 + β2jci + β3univi + β4experi + εi and test θ = 0. Usethe database available on the webpage of the course.
95
2.6.5 Test on a Set of Parameter II
We focus on another way to test
H0 : Rβ = r vs. H1 : Rβ 6= r.
(where Rβp×1, Rp×K). It can be proved that
F 0 = (Rb− r)′(R(X′X
)−1R′)−1
(Rb− r) /(ps2
)=
(e∗′e∗ − e′e
)/p
e′e/ (n−K)
=
(R2 −R2∗
)/p(
1−R2)/ (n−K)
∼ F (p, n−K)
where ∗ refers to the short regression or the regression subjected to the constraint Rβ = r.
96
Example. Consider once again the equation log (wagesi) = β1 + β2jci + β3univi +
β4experi + εi and H0 : β2 = β3 against H0 : β2 6= β3. The results of the regressionsubjected to the constraint H0 : β2 = β3 are
Dependent Variable: LWAGEMethod: Least SquaresSample: 1 6763
Variable Coefficient Std. Error tStatistic Prob.
C 1.471970 0.021061 69.89198 0.0000JC+UNIV 0.076156 0.002256 33.75412 0.0000EXPER 0.004932 0.000157 31.36057 0.0000
Rsquared 0.222194 Mean dependent var 2.248096Adjusted Rsquared 0.221964 S.D. dependent var 0.487692S.E. of regression 0.430175 Akaike info criterion 1.151195Sum squared resid 1250.942 Schwarz criterion 1.154220Log likelihood 3889.764 HannanQuinn criter. 1.152239Fstatistic 965.5576 DurbinWatson stat 1.968481Prob(Fstatistic) 0.000000
We have p = 1, e′e = 1250.544, e∗′e∗ = 1250.942 and
Fobs =
(e∗′e∗ − e′e
)/p
e′e/ (n−K)=
(1250.942− 1250.544) /1
1250.544/ (6763− 4)= 2.151,
F0.05 = 3.84.
We do not reject the null at 5% level, since Fobs = 2.151 < F0.05 = 3.84.
97
In the case “all slopes zero” (test of significance of the complete regression), it can beproved that F o equals
F 0 =R2/ (K − 1)(
1−R2)/ (n−K)
.
Under the null H0 : βk = 0, k = 2, 3, ...,K, we have F 0 ∼ F (K − 1, n−K) .
Exercise 2.21. Consider the results:
Dependent Variable: YMethod: Least SquaresSample: 1 500
Variable Coefficient Std. Error tStatistic Prob.
C 0.952298 0.237528 4.009200 0.0001X2 1.322678 1.686759 0.784154 0.4333X3 2.026896 1.701543 1.191210 0.2341
Rsquared 0.300503 Mean dependent var 0.975957Adjusted Rsquared 0.297688 S.D. dependent var 6.337496S.E. of regression 5.311080 Akaike info criterion 6.183449Sum squared resid 14019.16 Schwarz criterion 6.208737Log likelihood 1542.862 HannanQuinn criter. 6.193372Fstatistic 106.7551 DurbinWatson stat 2.052601Prob(Fstatistic) 0.000000
Test: (a) H0 : β2 = 0 vs. H1 : β2 6= 0; (b) H0 : β3 = 0 vs. H1 : β3 6= 0; (c)H0 : β2 = 0, β3 = 0 vs. H1 : ∃βi 6= 0 (i = 1, 2) (d) Are xi2 and xi3 truly relevantvariables? How would you explain the results you obtained in parts (a), (b) and (c)?
98
2.7 Relation to Maximum Likelihood
Having specified the distribution of the error vector, we can use the maximum likelihood(ML) principle to estimate the model parameters θ =
(β′, σ2
)′.
2.7.1 The Maximum Likelihood Principle
ML principle: choose the parameter estimates to maximize the probability of obtaining thedata. Maximizing the joint density associated with the data, f
(y,X; θ
), leads to the same
solution. Therefore:
ML estimator of θ = arg maxθf(y,X; θ
).
99
Example (Without X).We flipped a coin 10 times. If heads then y = 1. Obviously y ∼Bernoulli(θ) . We don’t know if the coin is fair, so we treated E (Y ) = θ as unknownparameter. Suppose that
∑10i=1 yi = 6. We have
f (y;θ) = f (y1, ..., yn; θ) =n∏i=1
f (yi; θ) = θy1 (1− θ)1−y1 × ...× θyn (1− θ)1−yn
= θ∑i yi (1− θ)10−
∑i yi = θ6 (1− θ)4 .
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00.0000
0.0001
0.0002
0.0003
0.0004
0.0005
0.0006
0.0007
0.0008
0.0009
0.0010
0.0011
0.0012
theta
joint density
100
To obtain the ML estimate of θ we proceed with:
dθ6 (1− θ)4
dθ= 0⇔ θ =
6
10
and since
d2θ6 (1− θ)4
dθ2 < 0
θ = 0.6 maximizes f(y;θ
). θ is the “most likely”value θ, that is the value that maximizes
the probability of observing (y1, ..., y10) . Notice that the ML estimator is y.
Since log x, x > 0 is a strictly increasing function we have: θ maximizes f(y;θ
)iff θ
maximizes log f(y;θ
), that is
θ = arg maxθf(y,X; θ
)⇔ θ = arg max
θlog f
(y,X; θ
).
In most cases we prefer to solve maxθ log f(y,X; θ
)rather maxθ f
(y,X; θ
), since the
transformation log greatly simplify the likelihood (products become sums).
101
2.7.2 Conditional versus Unconditional Likelihood
The joint density f (y,X; ζ) is in general diffi cult to handle. Consider:
f (y,X; ζ) = f (y|X; θ) f (X;ψ) , ζ =(θ′,ψ′
),
log f (y,X; ζ) = log f (y|X; θ) + log f (X;ψ)
In general we don’t know f (X;ψ) .
Example. Consider yi = β1xi1 + β2xi2 + εi where
εi|X ∼ N(
0, σ2)⇒ yi|X ∼ N
(x′iβ, σ
2)
X ∼ N(µx, σ
2xI).
Thus,
θ =
[βσ2
], ψ =
[µxσ2x
], ζ =
[θψ
].
If there is no functional relationship between θ and ψ (such as a subset of ψ being afunction of θ), then maximizing log f (y,X; ζ) with respect to ζ is achieved by separatelymaximizing f (y|X; θ) with respect to θ and maximizing f (X;ψ) with respect to ψ. Thusthe ML estimate of θ also maximizes the conditional likelihood f (y|X; θ) .
102
2.7.3 The Log Likelihood for the Regression Model
Assumption 1.5 (the normality assumption) together with Assumptions 1.2 and 1.4 implythat the distribution of ε conditional on X is N
(0, σ2I
). Thus,
ε|X ∼ N(
0, σ2I)⇒ y|X ∼ N
(Xβ,σ2I
)⇒
f (y|X; θ) =(
2πσ2)−n/2
exp(− 1
2σ2(y −Xβ)′ (y −Xβ)
)⇒
log f (y|X; θ) = −n2
log(
2πσ2)− 1
2σ2(y −Xβ)′ (y −Xβ) .
It can be proved
log f (y|X; θ) =n∑i=1
log f (yi|xi) = −n2
log(
2πσ2)− 1
2σ2
n∑i=1
(yi − x′iβ
)2.
Proposition (1.5 - ML Estimator of β and σ2). Suppose Assumptions 1.1-1.5 hold. Then,
ML estimator of β =(X′X
)−1X′y.
ML estimator of σ2 =e′en6= s2 =
e′en−K
.
103
We know that E(s2)
= σ2. Therefore:
• E(
e′en
)6= σ2.
• limn→∞ E(
e′en
)= σ2.
Proposition (1.6 - b is the Best Unbiased Estimator BUE). Under Assumptions 1.1-1.5,the OLS estimator b of β is BUE in that any other unbiased (but not necessarily linear)estimator has larger conditional variance in the matrix sense.
This result should be distinguished from the Gauss-Markov Theorem that b is minimumvariance among those estimators that are unbiased and linear in y. Proposition 1.6 saysthat b is minimum variance in a larger class of estimators that includes nonlinear unbiasedestimators. This stronger statement is obtained under the normality assumption (Assumption1.5) which is not assumed in the Gauss-Markov Theorem. Put differently, the Gauss-MarkovTheorem does not exclude the possibility of some nonlinear estimator beating OLS, but thispossibility is ruled out by the normality assumption.
104
Exercise 2.22. Suppose yi = x′iβ + εi where εi|X ∼ t(v). Assume that Assumptions1.1-1.4 hold. Use your intuition to answer “true”or “false” to the following statements:
(a) b is the BLUE;
(b) b is the BUE;
(c) the BUE estimator can only be obtained numerically (i.e. there is not a closed formulafor the BUE estimator).
Just out of curiosity notice that the log-likelihood function is
n∑i=1
log f (yi|xi) = −n2
log σ2 − n2
log π − n2
log (v − 2)
+n logΓ(v+1
2
)Γ(v2
) − v + 1
2
n∑i=1
log
1 +1
v − 2
(yi − x′iβ
)2
σ2
.
105
2.8 Generalized Least Squares (GLS)
We have assumed that
E(ε2i
∣∣∣X) = Var (εi|X) = σ2 > 0, ∀i, Homoskedasticity
E(εiεj
∣∣∣X) = 0, ∀i, j; i 6= j No correlation between observations.
Matrix notation:
E(εε′∣∣∣X) =
E(ε2
1
∣∣∣X) E (ε1ε2|X) · · · E (ε1εn|X)
E (ε1ε2|X) E(ε2
2
∣∣∣X) · · · E (ε2εn|X)... ... . . . ...
E (ε1εn|X) E (ε2εn|X) · · · E(ε2n
∣∣∣X)
=
σ2 0 · · · 00 σ2 · · · 0... ... . . . ...0 0 · · · σ2
= σ2I.
106
The Assumption E(εε′∣∣X) = σI is violated if either
• E(ε2i
∣∣∣X) depends on X → Heteroskedasticity, or
• E(εiεj
∣∣∣X) 6= 0 → Serial Correlation (We will analyze this case later).
Let’s assume now that
E(εε′∣∣∣X) = σ2V (V depends on X).
The model y = Xβ + ε based on the assumptions Assumptions 1.1-1.3 and E(εε′∣∣X) =
σ2V is called generalized regression model.
Notice that by definition, we always have:
E(εε′∣∣∣X) = Var (ε|X) = Var (y|X) .
107
Example (case where E(ε2i
∣∣∣X) depends on X). Consider the following model
yi = β1 + β2xi2 + εi
to explain household expenditure on food (y) as a function of household income. Typicalbehavior: Low-income household do not have the option of extravagant food tastes: theyhave few choices and are almost forced to spend a particular portion of their income on food;High-income household could have simple food tastes or extravagant food tastes: income byitself is likely to be relatively less important as an explanatory variable.
02468
101214161820
6 7 8 9 10 11 12 13
x : Income
y : E
xpen
ditu
re
108
If e accurately reflects the behavior of the ε, the information in the previous figure suggeststhat the variability of yi increases as income increases, thus it is reasonable to suppose that
Var (yi|xi2) is a function of xi2.
This is the same as saying that
E(ε2i
∣∣∣xi2) is a function of xi2.For example if E
(ε2i
∣∣∣xi2) = σ2x2i2 then
E(εε′∣∣∣X) = σ2
x2
12 0 · · · 00 x2
22 · · · 0... ... . . . ...0 0 · · · x2
n2
︸ ︷︷ ︸
V
= σV 6= σ2I.
109
2.8.1 Consequence of Relaxing Assumption 1.4
1. The Gauss-Markov Theorem no longer holds for the OLS estimator. The BLUE is someother estimator.
2. The t-ratio is not distributed as the t distribution. Thus, the t-test is no longer valid. Thesame comments apply to the F-test. Note that Var (b|X) is no longer σ2 (X′X)−1 . Ineffect,
Var (b|X) = Var((
X′X)−1
X′y∣∣∣∣X) =
(X′X
)−1X′Var (y|X) X
(X′X
)−1
= σ2(X′X
)−1X′VX
(X′X
)−1.
On the other hand,
E(s2∣∣∣X) =
E(e′e∣∣X)
n−K=
tr (Var (e|X))
n−K=σ2 tr (MVM)
n−K=σ2 tr (MV)
n−K.
The conventional standard errors are incorrect when Var (y|X) 6= σ2I. Confidenceregion and hypothesis test procedures based on the classical regression model are notvalid.
110
3. However, the OLS estimator is still unbiased, because the unbiasedness result (Propo-sition 1.1 (a)) does not require Assumption 1.4. In effect,
E (b|X) =(X′X
)−1X′ E (y|X) =
(X′X
)−1X′Xβ = β, E (b) = β
Options in the presence of E(εε′∣∣X) 6= σ2I:
• Use b to estimate β and Var (b|X) = σ2 (X′X)−1 X′VX(X′X
)−1 for inferencepurposes. Note that y|X ∼ N
(Xβ, σ2V
)implies
b|X ∼ N(β,σ2
(X′X
)−1X′VX
(X′X
)−1).
This is not a good solution as if you know V you may use a more effi cient estimator, aswe will see below. Later on, in chapter “Large Sample Theory”we will find that σ2V
may be replaced by a consistent estimator.
• Search for a better estimator of β.
111
2.8.2 Effi cient Estimation with Known V
If the value of the matrix function V is known, a BLUE estimator for β, called generalizedleast squares (GLS), can be deduced. The basic idea of the derivation is to transformthe generalized regression model into a model that satisfies all the assumptions, includingAssumption 1.4, of the classical regression model. Consider
y = Xβ + ε, E(εε′∣∣∣X) = σ2V.
We should multiply both sides of the equation by a nonsingular matrix C (depending on X)
Cy = CXβ + Cε
y = Xβ + ε
such that the transformed error ε verify E(εε′∣∣∣X) = σ2I, i.e.
E(εε′∣∣∣X) = E
(Cεε′C′
∣∣∣X) = C E(εε′∣∣∣X)C′ = σ2CVC′ = σ2I
that is CVC′ = I.
112
Given CVC′ = I, how to find C? Since V is by construction symmetric and positive defi-nite, there exists a nonsingular n× n matrix C such
V = C−1(C′)−1
or V−1 = C′C
Note
CVC′ = CC−1(C′)−1
C′
= I.
It easy to see that if y = Xβ + ε satisfies Assumptions 1.1-1.3 and Assumption 1.5 (butnot Assumption 1.4), then
y = Xβ + ε, where y = Cy, X = CX
satisfies Assumptions 1.1-1.5. Let
βGLS =(X′X
)−1X′y =
(X′V−1X
)−1X′V−1y.
113
Proposition (1.7 - finite-sample properties of GLS). (a) (unbiasedness) Under Assumption1.1-1.3,
E(βGLS
∣∣∣X) = β.
(b) (expression for the variance) Under Assumptions 1.1-1.3 and the assumption E(εε′∣∣X) =
σ2V that the conditional second moment is proportional to V,
Var(βGLS
∣∣∣X) = σ2(X′V−1X
)−1.
(c) (the GLS estimator is BLUE) Under the same set of assumptions as in (b), the GLSestimator is effi cient in that the conditional variance of any unbiased estimator that is linearin y is greater than or equal to Var
(βGLS
∣∣∣X) in the matrix sense.Remark: Var (b|X)− Var
(βGLS
∣∣∣X) is a positive semidefinite matrix. In particular,Var
(bj∣∣∣X) ≥ Var
(βj,GLS
∣∣∣X) .
114
2.8.3 A Special Case: Weighted Least Squares (WLS)
Let’s suppose that
E(ε2i
∣∣∣X) = σ2vi (vi is a function of X).
Recall: C is such that V−1 = C′C .
We have
V =
v1 0 · · · 00 v2 · · · 0... ... . . . ...0 0 · · · vn
⇒ V−1 =
1/v1 0 · · · 0
0 1/v2 · · · 0... ... . . . ...0 0 · · · 1/vn
⇒
C =
1/√v1 0 · · · 0
0 1/√v2 · · · 0
... ... . . . ...0 0 · · · 1/
√vn
.
115
Now
y = Cy =
1/√v1 0 · · · 0
0 1/√v2 · · · 0
... ... . . . ...0 0 · · · 1/
√vn
y1y2...yn
=
y1√v1y2√v2...yn√vn
X = CX =
1/√v1 0 · · · 0
0 1/√v2 · · · 0
... ... . . . ...0 0 · · · 1/
√vn
1 x12 · · · x1K1 x22 · · · x2K... ... . . . ...1 xn2 · · · xnK
=
1/√v1 x12/
√v1 · · · x1K/
√v1
1/√v2 x22/
√v2 · · · x2K/
√v2
... ... . . . ...1/√vn xn2/
√vn · · · xnK/
√vn
.
Another way to express these relations:
yi =yi√vi, xik =
xik√vi, i = 1, 2, ..., n.
116
Example. Suppose that yi = α+ βxi2 + εi,
Var (yi|xi2) = Var (εi|xi2) = σ2exi2, Cov(yi, yj
∣∣∣xi2, xj2) = 0
V =
ex12 · · · 0 · · · 0... ...0 · · · exi2 · · · 0... ...0 · · · 0 · · · exn2
.
Transformed model (matrix notation):
Cy = CXβ + Cεy1√ex12...yn√exn2
=
1√ex12
x12√ex12
... ...1√exn2
xn2√exn2
[β1β2
]+
ε1√ex12...εn√exn2
or (scalar notation):
yi = xi1β1 + xi2β2 + εi, i = 1, ..., n
yi√exi2
=1√exi2
β1 +xi2√exi2
β2 +εi√exi2
, i = 1, ..., n.
117
Notice:
Var (εi|X) = Var
(εi√exi2
∣∣∣∣∣xi2)
=1
exi2Var (εi|xi2) =
1
exi2σ2exi2 = σ2.
Effi cient estimation under a known form of heteroskedasticity is called the weighted regression(or the weighted least squares (WLS)).
Example. Consider wagei = β1 + β2educi + β3experi + εi.
0
5
10
15
20
25
30
0 10 20 30 40 50 60
EXPER
WAG
E
0
5
10
15
20
25
30
0 4 8 12 16 20
EDUC
WAG
E
118
Dependent Variable: WAGEMethod: Least SquaresSample: 1 526
Variable Coefficient Std. Error tStatistic Prob.
C 3.390540 0.766566 4.423023 0.0000EDUC 0.644272 0.053806 11.97397 0.0000
EXPER 0.070095 0.010978 6.385291 0.0000
Rsquared 0.225162 Mean dependent var 5.896103Adjusted Rsquared 0.222199 S.D. dependent var 3.693086S.E. of regression 3.257044 Akaike info criterion 5.205204Sum squared resid 5548.160 Schwarz criterion 5.229531Log likelihood 1365.969 HannanQuinn criter. 5.214729Fstatistic 75.98998 DurbinWatson stat 1.820274Prob(Fstatistic) 0.000000
0
50
100
150
200
250
300
0 4 8 12 16 20
EDUC
RES
2
Assume Var (εi| educi, experi) = σ2educ2i . Transformed model:
wageieduci
=β1
educi+ β2
educieduci
+ β3experieduci
+ εi, i = 1, ..., n
119
Dependent Variable: WAGE/EDUCMethod: Least SquaresSample: 1 526 IF EDUC>0
Variable Coefficient Std. Error tStatistic Prob.
1/EDUC 0.709212 0.549861 1.289800 0.1977EDUC/EDUC 0.443472 0.038098 11.64033 0.0000EXPER/EDUC 0.055355 0.009356 5.916236 0.0000
Rsquared 0.105221 Mean dependent var 0.469856Adjusted Rsquared 0.101786 S.D. dependent var 0.265660S.E. of regression 0.251777 Akaike info criterion 0.085167Sum squared resid 33.02718 Schwarz criterion 0.109564Log likelihood 19.31365 HannanQuinn criter. 0.094721DurbinWatson stat 1.777416
Exercise 2.23. Let {yi, i = 1, 2, ...} be a sequence of independent random variables withdistribution N
(β, σ2
i
), where σ2
i is known (note: we assume σ21 6= σ2
2 6= ...). Whenthe variances are unequal, the sample mean y is not the best linear unbiased estimator,i.e. BLUE). The BLUE has the form β =
∑ni=1wiyi where wi are nonrandom weights.
(a) Find a condition on wi such that E(β)
= β; (b) Find the optimal weights wi that
make β the BLUE. Hint: You may translate this problem into an econometric framework:if {yi} is a sequence of independent random variables with distribution N
(β, σ2
i
)then yi
can be represented by the equation yi = β+ εi, where εi ∼ N(
0, σ2i
). Then find the GLS
estimator of β.
120
Exercise 2.24. Consider
yi = βxi1 + εi, β > 0
and assume E (εi|X) = 0, Var (εi|X) = 1 + |xi1| , Cov(εi, εj
∣∣∣X) = 0. (a) Suppose wehave a lot of observations and plot a graph of the observation of yi and xi2. How would thescattered plot look like? (b) Propose an unbiased estimator with minimum variance; (c)Suppose we have the 3 following observation of (xi2, yi): (0, 0), (3, 1) and (8, 5). Estimatethe value of β from these 3 observations.
Exercise 2.25. Consider
yt = β1 + β2t+ εt, Var (εi) = σ2t2, i = 1, ..., 20
Find σ2 (X′X)−1 , Var (b|X) and Var(βGLS
∣∣∣X) and comment on the results. Solution:σ2(X′X
)−1= σ2
[0.215 −0.01578−0.01578 0.0015
], Var (b|X) = σ2
[13.293 −1.6326−1.6326 0.25548
]
Var(βGLS
∣∣∣X) = σ2
[1.0537 −0.1895−0.1895 0.0840
].
121
Exercise 2.26. A research first ran a OLS regression. Then she was given the true V matrix.She transformed the data appropriately and obtained the GLS estimator. For several coeffi -cient, standard errors in the second regression were larger than those in the first regression.Does this contradict 1.7 proposition? See the previous exercise.
2.8.4 Limiting Nature of GLS
• Finite-sample properties of GLS rest on the assumption that the regressors are strictlyexogenous. In time-series models the regressors are not strictly exogenous and the erroris serially correlated.
• In practice, the matrix function V is unknown.
• V can be estimated from the sample. This approach is called the Feasible GeneralizedLeast Squares (FGLS). But if the function V is estimated from the sample, its value Vbecomes a random variable, which affects the distribution of the GLS estimator. Verylittle is known about the finite-sample properties of the FGLS estimator. We need touse the large-sample properties ...
122
3 Large-Sample Theory
The finite-sample theory breaks down if one of the following three assumptions is violated:
1. the exogeneity of regressors,
2. the normality of the error term, and
3. the linearity of the regression equation.
This chapter develops an alternative approach based on large-sample theory (n is “suffi cientlylarge”).
123
3.1 Review of Limit Theorems for Sequences of Random Variables
3.1.1 Convergence in Probability in Mean Square and in Distribution
Convergence in Probability
A sequence of random scalars {zn} converges in probability to a constant (non-random) αif, for any ε > 0,
limn→∞P (|zn − α| > ε) = 0.
We write
znp−→ α or plim zn = α.
As we will see, zn is usually a sample mean
zn =
∑ni=1 yin
or zn =
∑ni=1 zin
.
124
Example. Consider a fair coin. Let zi = 1 if the ith toss results in heads and zi = 0
otherwise. Let zn = n−1∑ni=1 zi. The following graph suggests that zn
p−→ 1/2.
125
A sequence of K dimensional vectors {zn} converges in probability to a K-dimensionalvector of constants α if, for any ε > 0,
limn→∞P (|znk − αk| > ε) = 0, ∀k
We write
znp−→ α.
Convergence in Mean Square
A sequence of random scalars {zi} converges in mean square (or in quadratic mean) to a αif
limn→∞E
[(zn − α)2
]= 0
The extension to random vectors is analogous to that for convergence in probability.
126
Convergence in Distribution
Let {zn} be a sequence of random scalars and Fn be the cumulative distribution function(c.d.f.) of zn, i.e. zn ∼ Fn. We say that {zn} converges in distribution to a random scalarz if the c.d.f. Fn, of zn , converges to the c.d.f. F of z at every continuity point of F . Wewrite
znd−→ z, where z ∼ F,
F is is the asymptotic (or limiting) distribution of z. If F is well-known, for example, if Fis the cumulative normal N (0, 1) distribution we prefer to write
znd−→ N (0, 1) (instead of zn
d−→ z and z ∼ N (0, 1)).
Example. Consider zn ∼ t(n). We know that znd−→ N (0, 1) .
In most applications zn is of type
zn =√n (y − E (yi)) .
Exercise 3.1. For zn =√n (y − E (yi)) calculate E (zn) and Var (zn) (assume E (yi) = µ,
Var (yi) = σ2 and {yi} is an i.i.d. sequence).
127
3.1.2 Useful Results
Lemma (2.3 - preservation of convergence for continuous transformation). Suppose f is avector-valued continuous function that does not depend on n. Then:
(a) if znp−→ α⇒ f (zn)
p−→ f (α) ;
(b) if znd−→ z⇒ f (zn)
d−→ f (z) .
An immediate implication of Lemma 2.3 (a) is that the usual arithmetic operations preserveconvergence in probability:
xnp−→ β, yn
p−→ γ ⇒ xn + ynp−→ β + γ.
xnp−→ β, yn
p−→ γ ⇒ xnynp−→ βγ.
xnp−→ β, yn
p−→ γ ⇒ xn/ynp−→ β/γ, γ 6= 0.
Ynp−→ Γ ⇒ Y−1
np−→ Γ−1 (Γ is invertible).
128
Lemma (2.4).We have
(a) xnd−→ x, yn
p−→ α⇒ xn + ynd−→ x +α.
(b) xnd−→ x, yn
p−→ 0⇒ y′nxnp−→ 0.
(c) xnd−→ x, An
p−→ A⇒ Anxnd−→ Ax. In particular if x ∼ N (0,Σ) , then
Anxnd−→ N
(0,AΣA′
).
(d) xnd−→ x, An
p−→ A⇒ x′nA−1n xn
d−→ x′A−1x (A is nonsingular).
If xnp−→ 0 we write xn = op (1) .
If xn − ynp−→ 0 we write xn = yn + op (1) .
In part (c) we may write Anxnd= Axn (Anxn and Axn have the same asymptotic
distribution).
129
3.1.3 Viewing Estimators as Sequences of Random Variables
Let θn be an estimator of a parameter vector θ based on a sample of size n. We say thatan estimator θn is consistent for θ if
θnp−→ θ.
The asymptotic bias of θn, is defined as plimn→∞ θn−θ. So if the estimator is consistent,its asymptotic bias is zero.
Wooldridge’s quotation:
While not all useful estimators are unbiased, virtually all economists agree thatconsistency is a minimal requirement for an estimator. The famous econometricianClive W.J. Granger once remarked: “If you can’t get it right as n goes to infinity,you shouldn’t be in this business.” The implication is that, if your estimator of aparticular population parameter is not consistent, then you are wasting your time.
130
A consistent estimator θn is asymptotically normal if
√n(θn − θ
)d−→ N (0,Σ) .
Such an estimator is called√n-consistent.
The variance matrix Σ is called the asymptotic variance and is denoted Avar(θn), i.e.
limn→∞Var
(√n(θn − θ
))= Avar
(θn)
= Σ.
Some authors use the notation Avar(θn)to mean Σ/n (which is zero in the limit).
131
3.1.4 Laws of Large Numbers and Central Limit Theorems
Consider
zn =1
n
n∑i=1
zi.
We say that zn obeys to the LLN if znp−→ µ where µ = E (zi) or limn E (zn) = µ.
• (A Version of Chebychev’s Weak LLN) If
lim E (zn) = µlim Var (zn) = 0
⇒ znp−→ µ.
• (Kolmogorov’s Second Strong LLN) If {zi} is i.i.d. with E (zn) = µ ⇒ znp−→ µ.
These LLNs extend readily to random vectors by requiring element-by-element convergence.
132
Theorem 1 (Lindeberg-Levy CLT). Let {zi} be i.i.d. with E (zn) = µ and Var (zi) = Σ.
Then√n (zn − µ) =
1√n
n∑i=1
(zi − µ)d−→ N (0,Σ) .
Notice that
E(√
n (zn − µ))
= 0⇒ E (zn) = µ
Var(√
n (zn − µ))
= Σ⇒Var (zn) = Σ/n
Given the previous equations, some authors write
zna∼ N
(µ,
Σ
n
).
133
Example. Let {zi} be i.i.d. with distribution χ2(1). By the Lindeberg-Levy CLT (scalar case)
we have
zn =1
n
n∑i=1
zia∼ N
(µ,σ2
n
)where
E (zn) =1
n
n∑i=1
E (zi) = E (zi) = µ = 1;
Var (zn) = Var
1
n
n∑i=1
zi
=1
nVar (zi) =
σ2
n=
2
n.
134
Probability Density Function of zn (obtained byMonte-Carlo Simulation)
-3-2-11230.10.20.30.4
Probability Density Function of√n (zn − µ) (exact expressions for
n = 5, 10 and 50)
135
Example. In a random sampling, sample size = 30, on the variable z with E (z) = 10,
Var (z) = 9 but unknown distribution, obtain an approximation to P (zn < 9.5) . We donot know the exact distribution of zn. However, from Lindeberg-Levy CLT we have
√n
(zn − µ)
σ
d−→ N (0, 1) or zna∼ N
(µ,σ2
n
).
Hence,
P (zn < 9.5) = P
(√n
(zn − µ)
σ<√
30(9.5− 10)
3
)' Φ (−0.9128) , [Φ is the cdf of N (0, 1) ]
= 0.1807.
136
3.2 Fundamental Concepts in Time-Series Analysis
Stochastic process (SP): is a sequence of random variables. For this reason, it is moreadequate to write a SP as {zi} (means a sequence of random variables) rather than zi(means the random variable at time i).
137
3.2.1 Various Classes of Stochastic processes
Definition (Stationary Processes). A SP {zi} is (strictly) stationary if the joint distributionof (z1, z2, ..., zs) equals to that of
(zk+1, zk+2, ..., zk+s
)for any s ∈ N and k ∈ Z.
Exercise 3.2. Consider a SP {zi} where E (|g (zi)|) < ∞. Show that if {zi} is a strictlystationary process then E (g (zi)) is constant and do not depend on t.
The definition implies that any transformation (function) of a stationary process is itselfstationary, that is, if {zi} is stationary, then {g (zi)} is. For example, if {zi} is stationarythen
{ziz′i
}is also a SP.
Definition (Covariance Stationary Processes). A stochastic process {zi} is weakly (or co-variance) stationary if: (i) E (zi) does not depend on i , and (ii) Cov
(zi, zi−j
)exists, is
finite, and depends only on j but not on i.
If {zi} is a covariance SP then Cov (z1, z5) = Cov (z1001, z1005).
A transformation (function) of a covariance stationary process may or may not be a covari-ance stationary process.
138
Example. It can be proved that {zi} , zi =√α0 + α1z
2i−1εi, where {εi} is i.i.d. with mean
zero and unit variance and α0 > 0 and√
1/3 ≤ α1 < 1 is a covariance stationary process.
However, wi = z2i is not a covariance stationary process as E
(w2i
)does not exist.
Exercise 3.3. Consider the SP {ut} where
ut =
ξt if t ≤ 2000
√k−2k ζt
if t > 2000
where ξt and ζs are independent for all t and s and ξtiid∼ N (0, 1) and ζs
iid∼ t(k). Explainwhy {ut} is weakly (or covariance) stationary but not strictly stationary.
Definition (White Noise Processes). A white noise process {zi} is a covariance stationaryprocess with zero mean and no serial correlation:
E (zi) = 0, Cov(zi, zj
)= 0, i 6= j.
139
20
16
12
8
4
0
4
8
25 50 75 100 125 150 175 200
Y
5
0
5
10
15
20
25
25 50 75 100 125 150 175 200
Y
50
40
30
20
10
0
10
25 50 75 100 125 150 175 200
Y
5
4
3
2
1
0
1
2
3
4
10 20 30 40 50 60 70 80 90
Y5
140
In the literature there is not a unique definition of ergodicity. We prefer to call “weaklydependent process” to what Hayashi calls “ergodic process”.
Definition. A stationary process {zi} is said to be a weakly dependent process (= ergodic inHayashi’s definition) if, for any two bounded functions f : Rk+1 → R and g : Rs+1 → R,
limn→∞
∣∣E [f (zi, .., zi+k) g (zi+n, .., zi+n+s)]∣∣
= limn→∞
∣∣E (f (zi, .., zi+k))∣∣× |E (g (zi+n, .., zi+n+s))| .
Theorem 2 (S&WD). Let {zi} be a stationary weakly dependent (S&WD) process withE (zi) = µ. Then zn
p−→ µ.
Serial dependence, which is ruled out by the i.i.d. assumption in Kolmogorov’s LLN, isallowed in this Theorem, provided that it disappears in the long run. Since, for any functionf , {f (zi)} is a S&WD stationary whenever {zi} is, this theorem implies that any momentof a S&WD process (if it exists and is finite) is consistently estimated by the sample moment.For example, suppose {zi} is a S&WD process and E
(ziz′i
)exists and is finite. Then
zn =1
n
n∑i=1
ziz′i
p−→ E(ziz′i
).
141
Definition (Martingale). A vector process {zi} is called a martingale with respect to {zi} if
E (zi| zi−1, ..., z1) = zi−1 for i ≥ 2.
The process
zi = zi−1 + εi
where {εi} is a white noise process with E (εi| zi−1) = 0, is a martingale since
E (zi| zi−1, ..., z1) = E (zi| zi−1) = zi−1 + E (εi| zi−1) = zi−1.
Definition (Martingale Difference Sequence). A vector process {gi} with E (gi) = 0 iscalled a martingale difference sequence (MDS) or martingale differences if
E (gi| gi−1, ..., g1) = 0.
If {zi} is a martingale, the process defined as ∆zi = zi − zi−1 is a MDS.
Proposition. If {gi} is a MDS then Cov(gi, gi−j
)= 0, j 6= 0.
142
By definition
Var (gn) =1
n2Var
n∑t=1
gt
=1
n2
n∑t=1
Var (gt) + 2n−1∑j=1
n∑i=j+1
Cov(gi, gi−j
) .However, if {gi} is a stationary MDS with finite second moment then
n∑t=1
Var (gt) = nVar (gt) , Cov(gi, gi−j
)= 0,
so
Var (gn) =1
nVar (gt) .
Definition (RandomWalk). Let {gi} be a vector independent white noise process. A randomwalk, {zi}, is a sequence of cumulative sums:
zi = gi + gi−1 + ...+ g1.
Exercise 3.4. Show that the random walk can be written as
zi = zi−1 + gi, z1 = g1.
143
3.2.2 Different Formulation of Lack of Serial Dependence
We have three formulations of a lack of serial dependence for zero-mean covariance stationaryprocesses:
(1) {gi} is independent white noise.
(2) {gi} is stationary MDS with finite variance.
(3) {gi} is white noise.
(1)⇒ (2)⇒ (3).
Exercise 3.5 (Process that satisfies (2) but not (1) - the ARCH process). Consider gi =√α0 + α1g
2i−1εi, where {εi} is i.i.d. with mean zero and unit variance and α0 > 0 and
|α1| < 1. Show that {gi} is a MDS but not a independent white noise.
144
3.2.3 The CLT for S&WD Martingale Difference Sequences
Theorem 3 (Stationary Martingale Differences CLT (Billingsley, 1961) ). Let {gi} be avector martingale difference sequence that is S&WD process with E
(gig′i
)= Σ and let
gi = 1n
∑gi. Then
√ngn =
1√n
n∑i=1
gid−→ N (0,Σ) .
Theorem 4 (Martingale Differences CLT (White, 1984)). Let {gi} be a vector martingaledifference sequence. Suppose that (a) E
(gig′i
)= Σt is a positive definite matrix with
1n
∑ni=1 Σt → Σ (positive definite matrix), (b) g has finite 4th moment, (c) 1
n
∑gig′i
p−→Σ. Then
√ngn =
1√n
n∑i=1
gid−→ N (0,Σ) .
145
3.3 Large-Sample Distribution of the OLS Estimator
The model presented in this section has probably the widest range of economic applications:
• No specific distributional assumption (such as the normality of the error term) is required;
• The requirement in finite-sample theory that the regressors be strictly exogenous or fixedis replaced by a much weaker requirement that they be "predetermined."
Assumption (2.1 - linearity). yi = x′iβ + εi.
Assumption (2.2 - S&WD). {(yi,xi)} is jointly S&WD.Assumption (2.3 - predetermined regressors). All the regressors are predetermined in thesense that they are orthogonal to the contemporaneous error term: E (xikεi) = 0, ∀i, k.This can be written as
E (xiεi) = 0 or E (gi) = 0 where gi = xiεi.
Assumption (2.4 - rank condition). E(xix′i
)= Σxx is nonsingular.
146
Assumption (2.5 - {gi} is a martingale difference sequence with finite second moments).{gi} , where gi = xiεi, is a martingale difference sequence (so a fortiori E (gi) = 0.The K ×K matrix of cross moments, E
(gig′i
), is nonsingular. We use S for Avar (g) (the
variance of√ng, where g =1
n
∑gi). By Assumption 2.2 and S&WD Martingale Differences
CLT, S = E(gig′i
).
Remarks:
1. (S&WD) A special case of S&WD is that {(yi,xi)} is i.i.d. (random sample in cross-sectional data).
2. (The model accommodates conditional heteroskedasticity) If {(yi,xi)} is stationary,then the error term εi = yi − x′iβ is also stationary. The conditional moment
E(ε2i
∣∣∣xi) can depend on xi
without violating any previous assumption, as long as E(ε2i
)is constant.
147
3. (E (xiεi) = 0 vs. E (εi|xi) = 0) The condition E (εi|xi) = 0 is stronger than
E (xiεi) = 0. In effect,
E (xiεi) = E (E (xiεi|xi))
= E (xi E (εi|xi))
= E (xi0) = 0.
4. (Predetermined vs. strictly exogenous regressors) Assumption 2.3, restricts only thecontemporaneous relationship between the error term and the regressors. The exogeneityassumption (Assumption 1.2) implies that, for any regressor k, E
(xjkεi
)= 0 for all i
and j, not just for i = j. Strict exogeneity is a strong assumption that does not hold ingeneral for time series models.
148
5. (Rank condition as no multicollinearity in the limit) Since
b =
(X′Xn
)−1X′yn
=(
1
n
∑xix′i
)−1 1
n
∑xiy = S−1
xxSxy
where
Sxx =X′Xn
=1
n
∑xix′i (sample average of xix
′i)
Sxy =X′yn
=1
n
∑xiyi (sample average of xiyi).
By Assumptions 2.2, 2.4 and theorem S&WD we have
X′Xn
=1
n
n∑i=1
xix′i
p−→ E(xix′i
).
Assumption 2.4 guarantees that the limit in probability of X′Xn has rank K.
149
6. (A suffi cient condition for {gi} to be a MDS) Since a MDS is zero-mean by definition,Assumption 2.5 is stronger than Assumption 2.3 (this latter is redundant in face ofAssumption 2.5). We will need Assumption 2.5 to prove the asymptotic normality ofthe OLS estimator. A suffi cient condition for {gi} to be an MDS is
E (εi| Fi) = 0 where
Fi = Ii−1 ∪ xi = {εi−1, εi−2, ..., ε1,xi,xi−1, ...,x1} ,Ii−1 = {εi−1, εi−2, ..., ε1,xi−1, ...,x1} .
(This condition implies that the error term is serially uncorrelated and also is uncorrelatedwith the current and past regressors). Proof. Notice: {gi} is a MDS if
E (gi| gi−1, ..., g1) = 0, gi = xiεi.
Now, using the condition E (εi| Fi) = 0,
E (xiεi| gi−1, ..., g1) = E [E (xiεi| Fi)| gi−1, ..., g1] = E [0| gi−1, ..., g1] = 0
thus E (εi| Fi) = 0⇒ {gi} is a MDS.
150
7. (When the regressors include a constant) Assumption 2.5 is
E (xiεi| gi−1, ..., g1) = E
1
...xiK
εi∣∣∣∣∣∣∣ gi−1, ..., g1
= 0⇒E (εi| gi−1, ..., g1) = 0.
E (εi| εi−1, ..., ε1) = E (E (εi| gi−1, ..., g1)| εi−1, ..., ε1) = 0.
Assumption 2.5 implies that the error term itself is a MDS and hence is serially uncorrelated.
8. (S is a matrix of fourth moments)
S = E(gig′i
)= E
(xiεix
′iεi)
= E(ε2ixix
′i
).
Consistent estimation of S will require an additional assumption.
151
9. (S will take a different expression without Assumption 2.5) In general
Avar (g) = lim Var(√
ng)
= lim Var
√n1
n
n∑i=1
gi
= lim Var
1√n
n∑i=1
gi
= lim
1
nVar
n∑i=1
gi
= lim
1
n
n∑i=1
Var (gi) +n−1∑j=1
n∑i=j+1
(Cov
(gi, gi−j
)+ Cov
(gi−j, gi
))= lim
1
n
n∑i=1
Var (gi) + lim1
n
n−1∑j=1
n∑i=j+1
(E(gig′i−j
)+ E
(gi−jg
′i
)).
Given stationarity, we have
1
n
n∑i=1
Var (gi) = Var (gi) .
Thanks to the assumption 2.5 we have E(gig′i−j
)= E
(gi−jg′i
)= 0 so
S = Avar (g) = Var (gi) = E(gig′i
).
152
Proposition (2.1- asymptotic distribution of the OLS Estimator). (a) (Consistency of b forβ) Under Assumptions 2.1-2.4,
bp−→ β.
(b) (Asymptotic Normality of b) If Assumption 2.3 is strengthened as Assumption 2.5, then
√n (b− β)
d−→ N (0,Avar (b))
where
Avar (b) = Σ−1xxSΣ−1
xx .
(c) (Consistent Estimate of Avar (b)) Suppose there is available a consistent estimator S
of S. Then under Assumption 2.2, Avar (b) is consistently estimated by
Avar (b) = S−1xx SS
−1xx
where
Sxx =X′Xn
=1
n
n∑i=1
xix′i.
153
Proposition (2.2 - consistent estimation of error variance). Under the Assumptions 2.1- 2.4,
s2 =1
n−K
n∑i=1
e2i
p−→ E(ε2i
)provide E
(ε2i
)exists and is finite.
Under conditional homocedasticity E(ε2i
∣∣∣xi) = σ2 (we will see this in detail later) wehave,
S = E(gig′i
)= E
(ε2ixix
′i
)= ... = σ2
E(xix′i
)= σ2Σxx
and
Avar (b) = Σ−1xxSΣ−1
xx = Σ−1xxσ
2ΣxxΣ−1xx = σ2Σ−1
xx ,
Avar (b) = s2
(X′Xn
)−1
= s2n(X′X
)−1.
Thus
ba∼ N
β, Avar (b)
n
= N
(β, s2
(X′X
)−1)
154
3.4 Statistical Inference
Derivation of the distribution of test statistics is easier than in finite-sample theory becausewe are only concerned about the large-sample approximation to the exact distribution.
Proposition (2.3 - robust t-ratio and Wald statistic). Suppose Assumptions 2.1-2.5 hold,and suppose there is available a consistent estimate of S of S. As before, let Avar (b) =
S−1xx SS
−1xx . Then
(a) Under the null hypothesis H0 : βk = β0k
t0k =bk − β0
k
σbk
d−→ N (0, 1) , where σ2bk
=Avar (bk)
n=
(S−1
xx SS−1xx
)kk
n.
(b) Under the null hypothesis H0 : Rβ = r, with rank (R) = p
W = n (Rb− r)′(RAvar (b) R′
)−1(Rb− r)
d−→ χ2(p).
155
Remarks
• σbk is called is called the heteroskedasticity-consistent standard error, (heteroskedastic-ity) robust standard error, or White’s standard error. The reason for this terminology isthat the error term can be conditionally heteroskedastic. The t-ratio is called the robustt-ratio.
• The differences from the finite-sample t-test are: (1) the way the standard error iscalculated is different, (2) we use the table of N(0, 1) rather than that of t(n−K),and (3) the actual size or exact size of the test (the probability of Type I error giventhe sample size) equals the nominal size (i.e., the desired significance level α) onlyapproximately, although the approximation becomes arbitrarily good as the sample sizeincreases. The difference between the exact size and the nominal size of a test is calledthe size distortion.
• Both tests are consistent in the sense that
power = P (rejecting the null H0|H1 is true)→ 1 as n→∞.
156
3.5 Estimating S = E(ε2ixix
′i
)Consistently
How to select an estimator for a population parameter? One of the most important methodis the analog estimation method or the method of moments. The method of momentprinciple: To estimate a feature of the population, use the corresponding feature of thesample.
Examples of analog estimators:
Parameter of the population Estimator
E (yi) Y
Var (yi) S2y
σxyσ2x
SxyS2x
P (yi ≤ c)∑ni=1 I{yi≤c}
nmedian (yi) sample medianmax(yi) maxi=1,...,n (yi)
157
The analogy principle suggests that E(ε2ixix
′i
)can be estimated using the estimator
1
n
n∑i=1
ε2ixix
′i.
Since εi is not observable we need another one:
S =1
n
n∑i=1
e2ixix
′i.
Assumption (2.6 - finite fourth moments for regressors). E
((xikxij
)2)exists and is finite
for all k and j (k, j = 1, ...,K) .
Proposition (2.4 - consistent estimation of S). Suppose S = E(ε2ixix
′i
)exists and is finite.
Then, under Assumptions 2.1-2.4 and 2.6, S is consistent for S.
158
The estimator S can be represented as
S =1
n
n∑i=1
e2ixix
′i =
X′BX
nwhere B =
e2
1 0 · · · 00 e2
2 · · · 0... ... ...0 0 · · · e2
n
.
Thus, Avar (b) = S−1xx SS
−1xx = n
(X′X
)−1 X′BX(X′X
)−1. We have
• ba∼ N
(β, Avar(b)
n
)= N
(β, S−1
xx SS−1xx
n
)= N
(β,(X′X
)−1 X′BX(X′X
)−1)
• W = n (Rb− r)′(RAvar (b) R′
)−1(Rb− r)
= n (Rb− r)′(RS−1
xx SS−1xx R
′)−1
(Rb− r)
= (Rb− r)′(R(X′X
)−1 X′BX(X′X
)−1 R′)−1
(Rb− r)d−→ χ2
(p)
159
Dependent Variable: WAGEMethod: Least SquaresSample: 1 526
Variable Coefficient Std. Error tStatistic Prob.
C 1.567939 0.724551 2.164014 0.0309FEMALE 1.810852 0.264825 6.837915 0.0000
EDUC 0.571505 0.049337 11.58362 0.0000EXPER 0.025396 0.011569 2.195083 0.0286
TENURE 0.141005 0.021162 6.663225 0.0000
Rsquared 0.363541 Mean dependent var 5.896103Adjusted Rsquared 0.358655 S.D. dependent var 3.693086S.E. of regression 2.957572 Akaike info criterion 5.016075Sum squared resid 4557.308 Schwarz criterion 5.056619Log likelihood 1314.228 HannanQuinn criter. 5.031950Fstatistic 74.39801 DurbinWatson stat 1.794400Prob(Fstatistic) 0.000000
Dependent Variable: WAGEMethod: Least SquaresSample: 1 526White HeteroskedasticityConsistent Standard Errors & Covariance
Variable Coefficient Std. Error tStatistic Prob.
C 1.567939 0.825934 1.898382 0.0582FEMALE 1.810852 0.254156 7.124963 0.0000
EDUC 0.571505 0.061217 9.335686 0.0000EXPER 0.025396 0.009806 2.589912 0.0099
TENURE 0.141005 0.027955 5.044007 0.0000
Rsquared 0.363541 Mean dependent var 5.896103Adjusted Rsquared 0.358655 S.D. dependent var 3.693086S.E. of regression 2.957572 Akaike info criterion 5.016075Sum squared resid 4557.308 Schwarz criterion 5.056619Log likelihood 1314.228 HannanQuinn criter. 5.031950Fstatistic 74.39801 DurbinWatson stat 1.794400Prob(Fstatistic) 0.000000
160
3.6 Implications of Conditional Homoskedasticity
Assumption (2.7 - conditional homoskedasticity). E(ε2i
∣∣∣xi) = σ2 > 0.
Under Assumption 2.7 we have
S = E(ε2ixix
′i
)= ... = σ2
E(xix′i
)= σ2Σxx and
Avar (b) = Σ−1xxSΣ−1
xx = σ2Σ−1xxΣxxΣ−1
xx = σ2Σ−1xx .
Proposition (2.5 - large-sample properties of b, t , and F under conditional homoskedas-ticity). Suppose Assumptions 2.1-2.5 and 2.7 are satisfied. Then
(a) (Asymptotic distribution of b) The OLS estimator b is consistent and asymptoticallynormal with
Avar (b) = σ2Σ−1xx .
(b) (Consistent estimation of asymptotic variance) Under the same set of assumptions,Avar (b) is consistently estimated by
Avar (b) = s2S−1xx = ns2
(X′X
)−1.
161
(c) (Asymptotic distribution of the t and F statistics of the finite-sample theory)
Under H0 : βk = β0k we have
t0k =bk − β0
k
σbk
d−→ N (0, 1) , where σ2bk
=Avar (bk)
n= s2
(X′X
)−1
kk.
Under H0 : Rβ = r with rank (R) = p, we have
pF 0 d−→ χ2(p)
where F 0 = (Rb− r)′(R(X′X
)−1 R′)−1
(Rb− r) /(ps2
).
Notice
pF 0 =e∗′e∗ − e′e
e′e/ (n−K)
d−→ χ2(p)
where ∗ refers to the short regression or the regression subjected to the constraint Rβ = r
Remark (No need for fourth-moment assumption) By S&WD and Assumptions 2.1-2.4,s2Sxx
p−→ σ2Σxx = S. We do not need the fourth-moment assumption (Assumption 2.6)for consistency.
162
3.7 Testing Conditional Homoskedasticity
With the advent of robust standard errors allowing us to do inference without specifying theconditional second moment testing conditional homoskedasticity is not as important as itused to be. This section presents only the most popular test due to White (1980) for thecase of random samples.
Let ψi be a vector collecting unique and nonconstant elements of the K × K symmetricmatrix xix
′i.
Proposition (2.6 - White’s Test for Conditional Heteroskedasticity). In addition to Assump-tions 2.1 and 2.4, suppose that (a) {(yi,xi)} is i.i.d. with finite E
(ε2ixix
′i
)(thus strength-
ening Assumptions 2.2 and 2.5), (b) εi is independent of xi (thus strengthening Assumption2.3 and conditional homoskedasticity), and (c) a certain condition holds on the moments ofεi and xi. Then under H0: E
(ε2i
∣∣∣xi) = σ2 (constant) we have
nR2 d−→ χ2(m)
where R2 is the R2 from the auxiliary regression of e2i on a constant and ψi and m is the
dimension of ψi.
163
Dependent Variable: WAGEMethod: Least SquaresSample: 1 526Included observations: 526
Variable Coefficient Std. Error tStatistic Prob.
C 1.567939 0.724551 2.164014 0.0309FEMALE 1.810852 0.264825 6.837915 0.0000
EDUC 0.571505 0.049337 11.58362 0.0000EXPER 0.025396 0.011569 2.195083 0.0286
TENURE 0.141005 0.021162 6.663225 0.0000
Rsquared 0.363541 Mean dependent var 5.896103Adjusted Rsquared 0.358655 S.D. dependent var 3.693086S.E. of regression 2.957572 Akaike info criterion 5.016075Sum squared resid 4557.308 Schwarz criterion 5.056619Log likelihood 1314.228 HannanQuinn criter. 5.031950Fstatistic 74.39801 DurbinWatson stat 1.794400Prob(Fstatistic) 0.000000
164
Heteroskedasticity Test: White
Fstatistic 5.911627 Prob. F(13,512) 0.0000Obs*Rsquared 68.64843 Prob. ChiSquare(13) 0.0000Scaled explained SS 227.2648 Prob. ChiSquare(13) 0.0000
Test Equation:Dependent Variable: RESID^2
Variable Coefficient Std. Error tStatistic Prob.
C 47.03183 20.19579 2.328794 0.0203FEMALE 7.205436 10.92406 0.659593 0.5098
FEMALE*EDUC 0.491073 0.778127 0.631097 0.5283FEMALE*EXPER 0.154634 0.168490 0.917768 0.3592
FEMALE*TENURE 0.066832 0.351582 0.190089 0.8493EDUC 7.693423 2.596664 2.962811 0.0032
EDUC^2 0.315191 0.086457 3.645652 0.0003EDUC*EXPER 0.045665 0.036134 1.263789 0.2069
EDUC*TENURE 0.083929 0.054140 1.550226 0.1217EXPER 0.000257 0.610348 0.000421 0.9997
EXPER^2 0.009134 0.007010 1.303002 0.1932EXPER*TENURE 0.004066 0.017603 0.230969 0.8174
TENURE 0.298093 0.934417 0.319015 0.7498TENURE^2 0.004633 0.016358 0.283255 0.7771
Rsquared 0.130510 Mean dependent var 8.664083Adjusted Rsquared 0.108433 S.D. dependent var 22.52940S.E. of regression 21.27289 Akaike info criterion 8.978999Sum squared resid 231698.4 Schwarz criterion 9.092525Log likelihood 2347.477 HannanQuinn criter. 9.023450Fstatistic 5.911627 DurbinWatson stat 1.905515Prob(Fstatistic) 0.000000
165
Dependent Variable: WAGEMethod: Least SquaresIncluded observations: 526White HeteroskedasticityConsistent Standard Errors & Covariance
Variable Coefficient Std. Error tStatistic Prob.
C 1.567939 0.825934 1.898382 0.0582FEMALE 1.810852 0.254156 7.124963 0.0000
EDUC 0.571505 0.061217 9.335686 0.0000EXPER 0.025396 0.009806 2.589912 0.0099
TENURE 0.141005 0.027955 5.044007 0.0000
Rsquared 0.363541 Mean dependent var 5.896103Adjusted Rsquared 0.358655 S.D. dependent var 3.693086S.E. of regression 2.957572 Akaike info criterion 5.016075Sum squared resid 4557.308 Schwarz criterion 5.056619Log likelihood 1314.228 HannanQuinn criter. 5.031950Fstatistic 74.39801 DurbinWatson stat 1.794400Prob(Fstatistic) 0.000000
3.8 Estimation with Parameterized Conditional Heteroskedasticity
Even when the error is found to be conditionally heteroskedastic, the OLS estimator is stillconsistent and asymptotically normal, and valid statistical inference can be conducted withrobust standard errors and robust Wald statistics. However, in the (somewhat unlikely) caseof a priori knowledge of the functional form of the conditional second moment, it should bepossible to obtain sharper estimates with smaller asymptotic variance.
166
To simplify the discussion, throughout this section we strengthen Assumptions 2.2 and 2.5by assuming that {(yi,xi)} is i.i.d.
3.8.1 The Functional Form
The parametric functional form for the conditional second moment we consider is
E(ε2i
∣∣∣xi) = z′iα
where zi is a function of xi.
Por example, E(ε2i
∣∣∣xi) = α1 + α2x2i2,
z′i =(
1 x2i2
).
167
3.8.2 WLS with Known α
The WLS (also GLS) estimator can be obtained by applying the OLS to the regression
yi = x′iβ + εi
where
yi =yi√z′iα
, xik =xik√z′iα
, εi =εi√z′iα
, i = 1, 2, ..., n
We have
βGLS = β (V) =(X′X
)−1X′y =
(X′V−1X
)−1X′V−1y.
168
Note that
E (εi| xi) = 0.
Therefore, provided that E(xix′i
)is nonsingular, Assumptions 2.1-2.5 are satisfied for equa-
tion yi = x′iβ+εi. Furthermore, by construction, the error εi is conditionally homoskedastic:
E (εi| xi) = 1. So Proposition 2.5 applies: the WLS estimator is consistent and asymptoti-cally normal, and the asymptotic variance is
Avar(β (V)
)= E
(xix′i
)−1
= plim
1
n
n∑i=1
xix′i
−1
(by S&WD theorem)
= plim(
1
nX′V−1X
)−1.
Thus 1nX′V−1X is a consistent estimator of Avar
(β (V)
).
169
3.8.3 Regression of e2i on zi Provides a Consistent Estimate of α
If α is unknown we need to obtain α. Assuming E(ε2i
∣∣∣xi) = z′iα we have
ε2i = E
(ε2i
∣∣∣xi) + ηi
where by construction E (ηi|xi) = 0. This suggest that the following regression can beconsidered
ε2i = z′iα+ ηi
Provided that E(ziz′i
)is nonsingular, Proposition 2.1 is applicable to this auxiliary regres-
sion: the OLS estimator of α is consistent and asymptotically normal. However we cannotrun this regression as εi is not observable. In the previous regression we should replace εiby the consistent estimate ei (despite the presence of conditional heteroskedasticity). Inconclusion, we may obtain a consistent estimate of α by considering the regression of e2
i onzi to get
α =
n∑i=1
ziz′i
−1 n∑i=1
zie2i .
170
3.8.4 WLS with Estimated α
Step 1: Estimate the equation yi = x′iβ + εi by OLS and compute the OLS residuals ei.
Step 2: Regress e2i on zi to obtain the OLS coeffi cient estimate α.
Step 3: Transform the original variables according to the rules
yi =yi√z′iα
, xik =xik√z′iα
, i = 1, 2, ..., n
and run the OLS estimator with respect to the model yi = x′iβ + εi to obtain theFeasible GLS (FGLS):
β(V)
=(X′V−1X
)−1X′V−1y
171
It can be proved that:
• β(V) p−→ β
•√n(β(V)− β
)d−→ N
(0,Avar
(β (V)
))
• 1nX′V−1X is a consistent estimator of Avar
(β (V)
).
No finite properties are known concerning the estimator β(V).
172
3.8.5 A popular specification for E(ε2i
∣∣∣xi)
The especification ε2i = z′iα+ηi may lead to z′iα < 0. To overcome this problem a popular
specification for E(ε2i
∣∣∣xi) isE(ε2i
∣∣∣xi) = exp{x′iα
}(it guarantees that Var (yi|xi) > 0 for all α ∈ Rr). It implies log E
(ε2i
∣∣∣xi) = x′iα. Thissuggests the following procedure:
a) Regress y on X to get the residual vector e.
b) Run the LS regression log e2i on xi to estimate α and calculate
σ2i = exp
{x′iα
}.
c) Transform the data yi = yiσi, xij =
xijσi.
d) Regress y on X and obtain β(V)
173
Notice also that:
E(ε2i
∣∣∣xi) = exp{x′iα
}ε2i = exp
{x′iα
}+ vi, vi = ε2
i − E(ε2i
∣∣∣xi)log ε2
i ≈ x′iα+ v∗ilog e2
i ≈ x′iα+ v∗∗i .
Example (Part 1).We want to estimate a demand function for daily cigarette consumption(cigs). The explanatory variables are: log(income) - log of annual income, log(cigprice) -log of per pack price of cigarettes in cents, educ - years of education, age and restaurn- binary indicator equal to unity if the person resides in a state with restaurant smokingrestrictions (source: J. Mullahy (1997), “Instrumental-Variable Estimation of Count DataModels: Applications to Models of Cigarette Smoking Behavior,”Review of Economics andStatistics 79, 596-593).
Based on information below, are the standard errors reported in the first table reliable?
174
Dependent Variable: CIGSMethod: Least SquaresSample: 1 807
Variable Coefficient Std. Error tStatistic Prob.
C 3.639823 24.07866 0.151164 0.8799LOG(INCOME) 0.880268 0.727783 1.209519 0.2268LOG(CIGPRIC) 0.750862 5.773342 0.130057 0.8966
EDUC 0.501498 0.167077 3.001596 0.0028AGE 0.770694 0.160122 4.813155 0.0000
AGE^2 0.009023 0.001743 5.176494 0.0000RESTAURN 2.825085 1.111794 2.541016 0.0112
Rsquared 0.052737 Mean dependent var 8.686493Adjusted Rsquared 0.045632 S.D. dependent var 13.72152S.E. of regression 13.40479 Akaike info criterion 8.037737Sum squared resid 143750.7 Schwarz criterion 8.078448Log likelihood 3236.227 HannanQuinn criter. 8.053370Fstatistic 7.423062 DurbinWatson stat 2.012825Prob(Fstatistic) 0.000000
Heteroskedasticity Test: White
Fstatistic 2.159258 Prob. F(25,781) 0.0009Obs*Rsquared 52.17245 Prob. ChiSquare(25) 0.0011Scaled explained SS 110.0813 Prob. ChiSquare(25) 0.0000
Test Equation:Dependent Variable: RESID^2
Variable Coefficient Std. Error tStatistic Prob.
C 29374.77 20559.14 1.428794 0.1535LOG(INCOME) 1049.630 963.4359 1.089466 0.2763
(LOG(INCOME))^2 3.941183 17.07122 0.230867 0.8175(LOG(INCOME))*(LOG(CIGPRIC)) 329.8896 239.2417 1.378897 0.1683
(LOG(INCOME))*EDUC 9.591849 8.047066 1.191969 0.2336(LOG(INCOME))*AGE 3.354565 6.682194 0.502015 0.6158
(LOG(INCOME))*(AGE^2) 0.026704 0.073025 0.365689 0.7147(LOG(INCOME))*RESTAURN 59.88700 49.69039 1.205203 0.2285
LOG(CIGPRIC) 10340.68 9754.559 1.060087 0.2894(LOG(CIGPRIC))^2 668.5294 1204.316 0.555111 0.5790
(LOG(CIGPRIC))*EDUC 32.91371 59.06252 0.557269 0.5775(LOG(CIGPRIC))*AGE 62.88164 55.29011 1.137304 0.2558
(LOG(CIGPRIC))*(AGE^2) 0.622371 0.594730 1.046477 0.2957(LOG(CIGPRIC))*RESTAURN 862.1577 720.6219 1.196408 0.2319
EDUC 117.4705 251.2852 0.467479 0.6403EDUC^2 0.290343 1.287605 0.225491 0.8217
EDUC*AGE 3.617048 1.724659 2.097254 0.0363EDUC*(AGE^2) 0.035558 0.017664 2.012988 0.0445
EDUC*RESTAURN 2.896490 10.65709 0.271790 0.7859AGE 264.1461 235.7624 1.120391 0.2629
AGE^2 3.468601 3.194651 1.085753 0.2779AGE*(AGE^2) 0.019111 0.028655 0.666935 0.5050
AGE*RESTAURN 4.933199 10.84029 0.455080 0.6492(AGE^2)^2 0.000118 0.000146 0.807552 0.4196
(AGE^2)*RESTAURN 0.038446 0.120459 0.319160 0.7497RESTAURN 2868.196 2986.776 0.960299 0.3372
cigs: number of cigarettes smoked per day, log(income): log of annual income, log(cigprice):log of per pack price of cigarettes in cents, educ: years of education, age and restaurn:binary indicator equal to unity if the person resides in a state with restaurant smoking re-strictions.
175
Example (Part 2). Discuss the results of the following figures.
Dependent Variable: CIGSMethod: Least SquaresSample: 1 807
Variable Coefficient Std. Error tStatistic Prob.
C 3.639823 24.07866 0.151164 0.8799LOG(INCOME) 0.880268 0.727783 1.209519 0.2268LOG(CIGPRIC) 0.750862 5.773342 0.130057 0.8966
EDUC 0.501498 0.167077 3.001596 0.0028AGE 0.770694 0.160122 4.813155 0.0000
AGE^2 0.009023 0.001743 5.176494 0.0000RESTAURN 2.825085 1.111794 2.541016 0.0112
Rsquared 0.052737 Mean dependent var 8.686493Adjusted Rsquared 0.045632 S.D. dependent var 13.72152S.E. of regression 13.40479 Akaike info criterion 8.037737Sum squared resid 143750.7 Schwarz criterion 8.078448Log likelihood 3236.227 HannanQuinn criter. 8.053370Fstatistic 7.423062 DurbinWatson stat 2.012825Prob(Fstatistic) 0.000000
Dependent Variable: CIGSMethod: Least SquaresSample: 1 807White HeteroskedasticityConsistent Standard Errors & Covariance
Variable Coefficient Std. Error tStatistic Prob.
C 3.639823 25.61646 0.142089 0.8870LOG(INCOME) 0.880268 0.596011 1.476931 0.1401LOG(CIGPRIC) 0.750862 6.035401 0.124410 0.9010
EDUC 0.501498 0.162394 3.088167 0.0021AGE 0.770694 0.138284 5.573262 0.0000
AGE^2 0.009023 0.001462 6.170768 0.0000RESTAURN 2.825085 1.008033 2.802573 0.0052
Rsquared 0.052737 Mean dependent var 8.686493Adjusted Rsquared 0.045632 S.D. dependent var 13.72152S.E. of regression 13.40479 Akaike info criterion 8.037737Sum squared resid 143750.7 Schwarz criterion 8.078448Log likelihood 3236.227 HannanQuinn criter. 8.053370Fstatistic 7.423062 DurbinWatson stat 2.012825Prob(Fstatistic) 0.000000
176
Example (Part 3). a) Regress y on X to get the residual vector e.
Dependent Variable: CIGSMethod: Least SquaresSample: 1 807
Variable Coefficient Std. Error tStatistic Prob.
C 3.639823 24.07866 0.151164 0.8799LOG(INCOME) 0.880268 0.727783 1.209519 0.2268LOG(CIGPRIC) 0.750862 5.773342 0.130057 0.8966
EDUC 0.501498 0.167077 3.001596 0.0028AGE 0.770694 0.160122 4.813155 0.0000
AGE^2 0.009023 0.001743 5.176494 0.0000RESTAURN 2.825085 1.111794 2.541016 0.0112
Rsquared 0.052737 Mean dependent var 8.686493Adjusted Rsquared 0.045632 S.D. dependent var 13.72152S.E. of regression 13.40479 Akaike info criterion 8.037737Sum squared resid 143750.7 Schwarz criterion 8.078448Log likelihood 3236.227 HannanQuinn criter. 8.053370Fstatistic 7.423062 DurbinWatson stat 2.012825Prob(Fstatistic) 0.000000
177
b) Run the LS regression log e2i on xi
Dependent Variable: LOG(RES^2)Method: Least SquaresSample: 1 807
Variable Coefficient Std. Error tStatistic Prob.
C 1.920691 2.563033 0.749382 0.4538LOG(INCOME) 0.291540 0.077468 3.763351 0.0002LOG(CIGPRIC) 0.195418 0.614539 0.317992 0.7506
EDUC 0.079704 0.017784 4.481657 0.0000AGE 0.204005 0.017044 11.96928 0.0000
AGE^2 0.002392 0.000186 12.89313 0.0000RESTAURN 0.627011 0.118344 5.298213 0.0000
Rsquared 0.247362 Mean dependent var 4.207486Adjusted Rsquared 0.241717 S.D. dependent var 1.638575S.E. of regression 1.426862 Akaike info criterion 3.557468Sum squared resid 1628.747 Schwarz criterion 3.598178Log likelihood 1428.438 HannanQuinn criter. 3.573101Fstatistic 43.82129 DurbinWatson stat 2.024587Prob(Fstatistic) 0.000000
Calculate σ2i = exp
{x′iα
}= exp
{log e2
i
}.
Notice: log e21, ..., log e2
n are the fitted values of the above regression.
178
c) Transform the data
yi =yiσi, xij =
xij
σi
and d) Regress y on X and obtain β(V).
Dependent Variable: CIGS/SIGMAMethod: Least SquaresSample: 1 807
Variable Coefficient Std. Error tStatistic Prob.
1/SIGMA 5.635471 17.80314 0.316544 0.7517LOG(INCOME)/SIGMA 1.295239 0.437012 2.963855 0.0031LOG(CIGPRIC)/SIGMA 2.940314 4.460145 0.659242 0.5099
EDUC/SIGMA 0.463446 0.120159 3.856953 0.0001AGE/SIGMA 0.481948 0.096808 4.978378 0.0000
AGE^2/SIGMA 0.005627 0.000939 5.989706 0.0000RESTAURN/SIGMA 3.461064 0.795505 4.350776 0.0000
Rsquared 0.002751 Mean dependent var 0.966192Adjusted Rsquared 0.004728 S.D. dependent var 1.574979S.E. of regression 1.578698 Akaike info criterion 3.759715Sum squared resid 1993.831 Schwarz criterion 3.800425Log likelihood 1510.045 HannanQuinn criter. 3.775347DurbinWatson stat 2.049719
179
3.8.6 OLS versus WLS
Under certain conditions we have:
• b and β(V)are consistent.
• Assuming that the functional form of the conditional second moment is correctly spec-ified, β
(V)is asymptotically more effi cient than b.
• It is not clear which estimator is better (in terms of effi ciency) in the following situations:
— the functional form of the conditional second moment is misspecified;
— in finite samples, even if the functional form is correctly specified, the large-sampleapproximation will probably work less well for the WLS estimator than for OLSbecause of the estimation of extra parameters (a) involved in the WLS procedure.
180
3.9 Serial Correlation
Because the issue of serial correlation arises almost always in time-series models, we use thesubscript "t" instead of "i" in this section. Throughout this section we assume that theregressors include a constant. The issue is how to deal with
E(εtεt−j
∣∣∣xt−j,xt) 6= 0.
181
3.9.1 Usual Inference is not Valid
When the regressors include a constant (true in virtually all known applications), Assumption2.5 implies that the error term is a scalar martingale difference sequence, so if the erroris found to be serially correlated (or autocorrelated), that is an indication of a failure ofAssumption 2.5.
We have Cov(gt, gt−j
)6= 0. In fact,
Cov(gt, gt−j
)= E
(xtεtx
′t−jεt−j
)= E
(E(xtεtx
′t−jεt−j
∣∣∣xt−j,xt))= E
(xtx′t−j E
(εtεt−j
∣∣∣xt−j,xt)) 6= 0.
Assumptions 2.1-2.4 may hold under serial correlation, so the OLS estimator may be consis-tent even if the error is autocorrelated. However, the large-sample properties of b, t , andF of proposition 2.5 are not valid. To see why, consider
√n (b− β) = S−1
xx
(√ng).
182
We have
Avar (b) = Σ−1xxSΣ−1
xx ,Avar (b) = S−1
xx SS−1xx .
If errors are not autocorrelated:
S = Var(√
ng)
= Var (gt) .
If the errors are autocorrelated:
S = Var(√
ng)
= Var (gt) +1
n
n−1∑j=1
n∑t=j+1
(E(gtg′t−j
)+ E
(gt−jg
′t
)).
Since Cov(gt, gt−j
)6= 0 and E
(gt−jg′t
)6= 0 we have
S 6= Var (gt) i.e. S 6= E(gtg′t
).
If the errors are serial correlated we cannot use 1n
∑nt=1 xtx
′t or
1n
∑nt=1 e
2txtx
′t (robust to
conditional heteroskedasticity) as a consistent estimators of S.
183
3.9.2 Testing Serial Correlation
Consider the regression yt = x′tβ+εt.We want to test whether or not εt is serial correlated.
Consider
ρj =Cov
(εt, εt−j
)√
Var (εt) Var(εt−j
) =Cov
(εt, εt−j
)Var (εt)
=γj
γ0=
E(εtεt−j
)E(ε2t
) .
Since γj is not observable, we need to consider
ρj =γj
γ0
γj =1
n
n∑t=j+1
εtεt−j, γ0 =1
n
n∑t=1
ε2t .
184
Proposition. If {εt} is a stationary MDS with E(ε2t
∣∣∣ εt−1, εt−2, ...)
= σ2, then
√nγj
d−→ N(
0, σ4)and√nρj
d−→ N (0, 1) .
Proposition. Under the assumptions of the previous proposition
Box-Pierce Q statistics = QBP =p∑j=1
(√nρj
)2= n
p∑j=1
ρ2j
d−→ χ2(p).
However, ρj is still unfeasible as we do not observe the errors. Thus,
ρj =γj
γ0
γj =1
n
n∑t=j+1
etet−j, γ0 =1
n
n∑t=1
e2t (=SSR).
Exercise 3.6. Prove that ρj can be obtained from the regression et on et−j (without inter-cept).
185
Testing with Strictly Exogenous Regressors
To test H0 : ρj = 0 we consider the following proposition:
Proposition (testing for serial correlation with strictly exogeneous regressors). Suppose thatAssumptions 1.2, 2.1, 2.2, 2.4 are satisfied. Then
ρjp−→ 0,
√nρj
d−→ N (0, 1) .
186
To test H0 : ρ1 = ρ2 = ... = ρp = 0 we consider the following proposition:
Proposition (Box-Pierce Q & Ljung-Box Q). Suppose that Assumptions 1.2, 2.1, 2.2, 2.4are satisfied. Then
QBP = np∑j=1
ρ2j
d−→ χ2(p),
QLB = n (n+ 2)p∑j=1
ρ2j
n− jd−→ χ2
(p).
It can be shown that the hypothesis H0 : ρ1 = ρ2 = ... = ρp = 0 can also be testedthrough the following auxiliary regression:
regression et on et−1, ..., et−p.
We calculate the F statistic for the hypothesis that the p coeffi cients of et−1, ..., et−p areall zero.
187
Testing with Predetermined, but Not Strictly Exogenous, Regressors
If the regressors are not strictly exogenous, the√nρj has no longer N (0, 1) distribution and
the residual-based Q statistic may not be asymptotically chi-squared.
The trick consist in removing the effect of xi in the regression of et on et−1, ..., et−p byconsidering now the
regression et on xt,et−1, ..., et−p
and then calculate the F statistic for the hypothesis that the p coeffi cients of et−1, ..., et−pare all zero. This regression is still valid when the regressors are strictly exogenous (so youmay always use that regression).
Given
et = θ1 + θ2xt2 + ...+ θKxtK + γ1et−1 + ...+ γpet−p + errort
the null hypothesis can be formulated as
H0 : γ1 = ... = γp = 0
Use the F test.
188
EVIEWS
189
Example. Consider, chnimp: the volume of imports of barium chloride from China, chempi:index of chemical production (to control for overall demand for barium chloride), gas: thevolume of gasoline production (another demand variable), rtwex: an exchange rate index(measures the strength of the dollar against several other currencies).
Equation 1Dependent Variable: LOG(CHNIMP)Method: Least SquaresSample: 1978M02 1988M12Included observations: 131
Variable Coefficient Std. Error tStatistic Prob.
C 19.75991 21.08580 0.937119 0.3505LOG(CHEMPI) 3.044302 0.478954 6.356142 0.0000
LOG(GAS) 0.349769 0.906247 0.385953 0.7002LOG(RTWEX) 0.717552 0.349450 2.053378 0.0421
Rsquared 0.280905 Mean dependent var 6.174599Adjusted Rsquared 0.263919 S.D. dependent var 0.699738S.E. of regression 0.600341 Akaike info criterion 1.847421Sum squared resid 45.77200 Schwarz criterion 1.935213Log likelihood 117.0061 HannanQuinn criter. 1.883095Fstatistic 16.53698 DurbinWatson stat 1.421242Prob(Fstatistic) 0.000000
190
Equation 2BreuschGodfrey Serial Correlation LM Test:
Fstatistic 2.337861 Prob. F(12,115) 0.0102Obs*Rsquared 25.69036 Prob. ChiSquare(12) 0.0119
Test Equation:Dependent Variable: RESIDMethod: Least SquaresSample: 1978M02 1988M12Included observations: 131Presample missing value lagged residuals set to zero.
Variable Coefficient Std. Error tStatistic Prob.
C 3.074901 20.73522 0.148294 0.8824LOG(CHEMPI) 0.084948 0.457958 0.185493 0.8532
LOG(GAS) 0.110527 0.892301 0.123867 0.9016LOG(RTWEX) 0.030365 0.333890 0.090942 0.9277
RESID(1) 0.234579 0.093215 2.516546 0.0132RESID(2) 0.182743 0.095624 1.911051 0.0585RESID(3) 0.164748 0.097176 1.695366 0.0927RESID(4) 0.180123 0.098565 1.827464 0.0702RESID(5) 0.041327 0.099482 0.415425 0.6786RESID(6) 0.038597 0.098345 0.392468 0.6954RESID(7) 0.139782 0.098420 1.420268 0.1582RESID(8) 0.063771 0.099213 0.642771 0.5217RESID(9) 0.154525 0.098209 1.573441 0.1184RESID(10) 0.027184 0.098283 0.276585 0.7826RESID(11) 0.049692 0.097140 0.511550 0.6099RESID(12) 0.058076 0.095469 0.608329 0.5442
Rsquared 0.196110 Mean dependent var 3.97E15Adjusted Rsquared 0.091254 S.D. dependent var 0.593374S.E. of regression 0.565652 Akaike info criterion 1.812335Sum squared resid 36.79567 Schwarz criterion 2.163504Log likelihood 102.7079 HannanQuinn criter. 1.955030Fstatistic 1.870289 DurbinWatson stat 2.015299Prob(Fstatistic) 0.033268
191
If you conclude that the errors are serial correlated you have a few options:
(a) You know (at least approximately) the form of autocorrelation and so you use a feasibleGLS estimator.
(b) The second approach, parallels the use of the White estimator for heteroskedasticity:you don’t know the form of autocorrelation so you rely on the OLS, but you use aconsistent estimator for Avar (b) .
(c) You are concerned only with the dynamic specification of the model and with forecast.You may try to convert your model into a dynamically complete model.
(d) Your model may be misspecified: you respecified the model and the autocorrelationdisappear.
192
3.9.3 Question (a): feasible GLS estimator
There are many forms of autocorrelation and each one leads to a different structure for theerror covariance matrix V. The most popular form is known as the first-order autoregressiveprocess. In this case the error term in
yt = x′tβ + εt
is assumed to follow the AR(1) model
εt = ρεt−1 + vt, |ρ| < 1,
where vt is an error term with mean zero and constant conditional variance that exhibits noserial correlation. We assume all assumptions 2.1-2.5 was ρ = 0.
193
Initial Model:
yt = x′tβ + εt, εt = ρεt−1 + vt, |ρ| < 1
The GLS estimator is the OLS estimator applied to the transformed model
yt = x′tβ + vt
where
yt =
{ √1− ρ2y1 t = 1
yt − ρyt−1 t > 1, x′t =
{ √1− ρ2x′1 t = 1
(xt − ρxt−1)′ t > 1,
Without the first observation, the transformed model is
yt − ρyt−1 = (xt − ρxt−1)′β + vt.
If ρ is unknown we may replace it by a consistent estimator or we may use the nonlinearleast squares estimator (EVIEW).
194
Example (continuation of the previous example). Let’s consider the residuals of Equation 1:
Equation 3Dependent Variable: LOG(CHNIMP)Method: Least SquaresSample (adjusted): 1978M03 1988M12Included observations: 130 after adjustmentsConvergence achieved after 8 iterations
Variable Coefficient Std. Error tStatistic Prob.
C 39.30703 23.61105 1.664772 0.0985LOG(CHEMPI) 2.875036 0.658664 4.364949 0.0000
LOG(GAS) 1.213475 1.005164 1.207241 0.2296LOG(RTWEX) 0.850385 0.468696 1.814362 0.0720
AR(1) 0.309190 0.086011 3.594777 0.0005
Rsquared 0.338533 Mean dependent var 6.180590Adjusted Rsquared 0.317366 S.D. dependent var 0.699063S.E. of regression 0.577578 Akaike info criterion 1.777754Sum squared resid 41.69947 Schwarz criterion 1.888044Log likelihood 110.5540 HannanQuinn criter. 1.822569Fstatistic 15.99350 DurbinWatson stat 2.079096Prob(Fstatistic) 0.000000
Inverted AR Roots .31
Exercise 3.7. Consider yt = β1 + β2xt2 + εt where εt = ρεt−1 + vt and {vt} is a whitenoise process. Using the first differences of the variables one gets ∆yt = β1∆xt2 + ∆εt.
Show that Corr (∆εt,∆εt−1) = − (1− ρ) /2. Discuss the advantages and disadvantagesof differentiating the variables as a procedure to remove autocorrelation.
195
3.9.4 Question (b): Heteroskedasticity and autocorrelation-consistent (HAC) Co-variance Matrix Estimator
For sake of generality, assume that you have also a problem of heteroskedasticity.
Given
S = Var(√
ng)
= Var (gt) +1
n
n−1∑j=1
n∑t=j+1
(E(gtg′t−j
)+ E
(gt−jg
′t
))
= E(ε2txtx
′t
)+
1
n
n−1∑j=1
n∑t=j+1
(E(εtεt−jxtx
′t−j
)+ E
(εt−jεtxt−jx
′t
)),
a possible estimator of S based on the analogy principle would be
1
n
n∑t=1
e2txtx
′t +
1
n
n′−1∑j=1
n∑t=j+1
(etet−jxtx
′t−j + et−jetxt−jx
′t
), n′ < n.
A major problem with this estimator is that it is not positive semi-definite and hence cannotbe a well-defined variance-covariance matrix.
196
Newey and West show that with a suitable weighting function ω (j), the estimator below isconsistent and positive semi-definite:
SHAC =1
n
n∑t=1
e2txtx
′t +
1
n
L∑j=1
n∑t=j+1
ω (j)(etet−jxtx
′t−j + et−jetxt−jx
′t
)where the weighting function ω (j) is
ω (j) = 1− j
L+ 1.
The maximum lag L must be determined in advance. Autocorrelations at lags longer thanL are ignored. For a moving-average process, this value is in general a small number.
This estimator is known as (HAC) covariance matrix estimator and is valid when bothconditional heteroskedasticity and serial correlations are present but of an unknown form.
197
Example. For xt = 1, n = 9, L = 3 we have
L∑j=1
n∑t=j+1
ω (j)(etet−jxtx
′t−j + et−jetxt−jx
′t
)
=L∑j=1
n∑t=j+1
ω (j) 2etet−j
= ω (1) (2e1e2 + 2e2e3 + 2e3e4 + 2e4e5 + 2e5e6 + 2e6e7 + 2e7e8 + 2e8e9) +
ω (2) (2e1e3 + 2e2e4 + 2e3e5 + 2e4e6 + 2e5e7 + 2e6e8 + 2e7e9) +
ω (3) (2e1e4 + 2e2e5 + 2e3e6 + 2e4e7 + 2e5e8 + 2e6e9) .
ω (1) = 1− 1
4= 0.75
ω (2) = 1− 2
4= 0.50
ω (3) = 1− 3
4= 0.25
198
Newey-West covariance matrix estimator
Avar (b) = S−1xx SHACS−1
xx .
EVIEWS:
0 1000 2000 3000 4000 50000
1
2
3
4
5
6
7
8
9
10
n
L
Eviews selects L = floor(4(n
100
)2/9)
199
Example (continuation ...). Newey-West covariance matrix estimator
Avar (b) = S−1xx SHACS−1
xx
Equation 4Dependent Variable: LOG(CHNIMP)Method: Least SquaresSample: 1978M02 1988M12Included observations: 131NeweyWest HAC Standard Errors & Covariance (lag truncation=4)
Variable Coefficient Std. Error tStatistic Prob.
C 19.75991 26.25891 0.752503 0.4531LOG(CHEMPI) 3.044302 0.667155 4.563111 0.0000
LOG(GAS) 0.349769 1.189866 0.293956 0.7693LOG(RTWEX) 0.717552 0.361957 1.982426 0.0496
Rsquared 0.280905 Mean dependent var 6.174599Adjusted Rsquared 0.263919 S.D. dependent var 0.699738S.E. of regression 0.600341 Akaike info criterion 1.847421Sum squared resid 45.77200 Schwarz criterion 1.935213Log likelihood 117.0061 HannanQuinn criter. 1.883095Fstatistic 16.53698 DurbinWatson stat 1.421242Prob(Fstatistic) 0.000000
200
3.9.5 Question (c): Dynamically Complete Models
Consider
yt = x′tβ + ut
such that E (ut| xt) = 0. This condition although necessary for consistency, does not pre-clude autocorrelation. You may try to increase the number of regressors to xt and get a newregression model
yt = x′tβ + εt such that
E (εt|xt, yt−1,xt−1, yt−2, ...) = 0.
Written in terms of yt
E (yt|xt, yt−1,xt−1, yt−2, ...) = E (yt|xt) .Definition. The model yt = x′tβ + εt is dynamically complete (DC) if
E (εt|xt, yt−1,xt−1, yt−2, ...) = 0 or
E (yt|xt, yt−1,xt−1, yt−2, ...) = E (yt|xt)holds (see Wooldridge).
201
Proposition. If a model is DC then the errors are not correlated. Moreover {gi} is a MDS.
Notice that E (εt|xt, yt−1,xt−1, yt−2, ...) = 0 can be rewritten as
E (εi| Fi) = 0 where
Fi = Ii−1 ∪ xi = {εi−1, εi−2, ..., ε1,xi,xi−1, ...,x1} ,Ii−1 = {εi−1, εi−2, ..., ε1,xi−1, ...,x1} .
Example. Consider
yt = β1 + β2xt2 + ut, ut = φut−1 + εt
where {εt} is a white noise process and E(εt|xt2, yt−1, xt−1,2, yt−2, ...
)= 0. Set x′t =(
1 xt2). The above model is not DC since the errors are autocorrelated. Notice that
E(yt|xt2, yt−1, xt−1,2, yt−2, ...
)= β1 + β2xt2 + φut−1
does not coincide with
E (yt| xt) = E (yt|xt2) = β1 + β2xt2.
202
However, it is easy to obtain a DC model. Since
ut = yt − (β1 + β2xt2)⇒ut−1 = yt−1 − (β1 + β2xt−1,2)
we have
yt = β1 + β2xt2 + ut
= β1 + β2xt2 + φut−1 + εt
= β1 + β2xt2 + φ(yt−1 −
(β1 + β2xt−1,2
))+ εt.
This equation can be written in the form
yt = γ1 + γ2xt2 + γ3yt−1 + γ4xt−1,2 + εt.
Let xt =(xt2, yt−1, xt−1,2
). The previous models is DC as
E (yt|xt, yt−1,xt−1, ...) = E (yt|xt) = γ1 + γ2xt2 + γ3yt−1 + γ4xt−1,2.
203
Example (continuation ...). Dynamically Complete Model
Equation 5Dependent Variable: LOG(CHNIMP)Method: Least SquaresSample (adjusted): 1978M03 1988M12Included observations: 130 after adjustments
Variable Coefficient Std. Error tStatistic Prob.
C 11.30596 23.24886 0.486302 0.6276LOG(CHEMPI) 7.193799 3.539951 2.032175 0.0443
LOG(GAS) 1.319540 1.003825 1.314513 0.1911LOG(RTWEX) 0.501520 2.108623 0.237842 0.8124
LOG(CHEMPI(1)) 9.618587 3.602977 2.669622 0.0086LOG(GAS(1)) 1.223681 1.002237 1.220950 0.2245
LOG(RTWEX(1)) 0.935678 2.088961 0.447915 0.6550LOG(CHNIMP(1)) 0.270704 0.084103 3.218710 0.0016
Rsquared 0.394405 Mean dependent var 6.180590Adjusted Rsquared 0.359658 S.D. dependent var 0.699063S.E. of regression 0.559400 Akaike info criterion 1.735660Sum squared resid 38.17726 Schwarz criterion 1.912123Log likelihood 104.8179 HannanQuinn criter. 1.807363Fstatistic 11.35069 DurbinWatson stat 2.059684Prob(Fstatistic) 0.000000
Equation 6BreuschGodfrey Serial Correlation LM Test:
Fstatistic 0.810670 Prob. F(12,110) 0.6389Obs*Rsquared 10.56265 Prob. ChiSquare(12) 0.5667
Test Equation:Dependent Variable: RESIDMethod: Least SquaresDate: 05/12/10 Time: 19:13Sample: 1978M03 1988M12Included observations: 130Presample missing value lagged residuals set to zero.
Variable Coefficient Std. Error tStatistic Prob.
C 1.025127 26.26657 0.039028 0.9689LOG(CHEMPI) 1.373671 3.968650 0.346130 0.7299
LOG(GAS) 0.279136 1.055889 0.264361 0.7920LOG(RTWEX) 0.074592 2.234853 0.033377 0.9734
LOG(CHEMPI(1)) 1.878917 4.322963 0.434636 0.6647LOG(GAS(1)) 0.315918 1.076831 0.293378 0.7698
LOG(RTWEX(1)) 0.007029 2.224878 0.003159 0.9975LOG(CHNIMP(1)) 0.151065 0.293284 0.515082 0.6075
RESID(1) 0.189924 0.307062 0.618520 0.5375RESID(2) 0.088557 0.124602 0.710715 0.4788RESID(3) 0.154141 0.098337 1.567475 0.1199RESID(4) 0.125009 0.098681 1.266795 0.2079RESID(5) 0.035680 0.099831 0.357407 0.7215RESID(6) 0.048053 0.098008 0.490291 0.6249RESID(7) 0.129226 0.097417 1.326523 0.1874RESID(8) 0.052884 0.099891 0.529420 0.5976RESID(9) 0.122323 0.102670 1.191423 0.2361RESID(10) 0.022149 0.099419 0.222788 0.8241RESID(11) 0.034364 0.099973 0.343738 0.7317RESID(12) 0.038034 0.102071 0.372628 0.7101
Rsquared 0.081251 Mean dependent var 9.76E15Adjusted Rsquared 0.077442 S.D. dependent var 0.544011S.E. of regression 0.564683 Akaike info criterion 1.835533Sum squared resid 35.07532 Schwarz criterion 2.276692Log likelihood 99.30962 HannanQuinn criter. 2.014790Fstatistic 0.512002 DurbinWatson stat 2.011429Prob(Fstatistic) 0.952295
204
3.9.6 Question (d): Misspecification
In many cases the finding of autocorrelation is an indication that the model is misspecified.If this is the case, the most natural route is not to change your estimator (from OLS to GLS)but to change your model. Types of misspecification may lead to a finding of autocorrelationin your OLS residuals:
• dynamic misspecification (related to question (c));
• omitted variables (that are autocorrelated);
• yt and/or xtk are integrated processes, e.g. yt ∼ I (1) .
• functional form misspecification.
205
Functional form misspecification. Suppose that the true linear relationship is
yt = β1 + β2 log t+ εt.
In the following figure we estimate a misspecified functional form: yt = β1 +β2t+ ε∗t . Theresiduals are clearly autocorrelated
206
3.10 Time Regressions
Consider
yt = α+ δf (t) + εt
where f (t) is a function of time (e.g. f (t) = t or f (t) = t2 etc.). This kind of modelsdo not satisfy the Assumption 2.2: {(yi,xi)} is jointly S&WD. This type of nonstationaryis not serious and the OLS is applicable. Let’s us focus on the case
yt = α+ δt+ εt
= x′tβ + εt,
x′t =(
1 t), β =
[αδ
].
α+ δt is called time trend of yt.
Definition.We say that a process is trend stationary if it can be written as the sum of a timetrend and a stationary process. The process {yt} here is a special trend-stationary processwhere the stationary component is independent white noise.
207
3.10.1 The Asymptotic Distribution of the OLS Estimator
Let b be the OLS estimate of p based on a sample of size n:
b =
[α
δ
]=(X′X
)−1X′y.
Proposition (2.11 - OLS estimation of the time regression). Consider the time regressionyt = α+ δt+ εt where εt is independent white noise with E
(ε2t
)= σ2 and E
(ε4)<∞.
Then( √n (α− α)
n3/2(δ − δ
) ) d−→ N
0, σ2
[1 1/2
1/2 1/3
]−1 = N
(0, σ2
[4 −6−6 12
]).
As in the stationary case, α is√n-consistent because
√n(δ − δ
)converges to a (normal)
random variable. The OLS estimate of the time coeffi cient, δ, is also consistent, but thespeed of convergence is faster: it is n3/2-consistent in that n3/2
(δ − δ
)converge to a
random variable. In this sense, δ is superconsistent.
208
We provide a simpler proof of proposition 2.11 in the case yt = δt+ εt. We have
δ − δ =(X′X
)−1X′ε
=
[ 1 2 · · · n]
12...n
−1 [
1 2 · · · n]
ε1ε2...εn
=
1∑nt=1 t
2
n∑t=1
tεt
=
√Var
(∑nt=1 tεt
)∑nt=1 t
2
∑nt=1 tεt√
Var(∑n
t=1 tεt)
=σ√∑n
t=1 t2∑n
t=1 t2
∑nt=1 tεt
σ√∑n
t=1 t2
=σ√∑n
t=1 t2∑n
t=1 t2Zn, where Zn
d−→ Z ∼ N (0, 1)
209
n3/2(δ − δ
)= n3/2
σ√∑n
t=1 t2∑n
t=1 t2Zn
Since
limn→∞n
3/2σ√∑n
t=1 t2∑n
t=1 t2
= σ√
3
we have
n3/2(δ − δ
)d= σ√
3Zd−→ N
(0, σ23
).�
3.10.2 Hypothesis Testing for Time Regressions
The OLS coeffi cient estimates of the time regression are asymptotically normal, provided thesampling error is properly scaled. Inference about δ can be based on
n3/2(δ − δ
)√s212
d−→ N (0, 1) in the case yt = α+ δt+ εt
n3/2(δ − δ
)√s23
d−→ N (0, 1) in the case yt = δt+ εt
210
4 Endogeneity and the GMM
Consider
yi = β1zi1 + β2zi2 + ...+ βKziK + εi.
If Cov(zij, εi
)6= 0 (or E
(zijεi
)6= 0) then we say that zij (j-th regressor) is endogenous.
It follows that E (ziεi) 6= 0.
Definition (endogenous regressor).We say that a regressor is endogenous if it is not predeter-mined (i.e., not orthogonal to the error term), that is, if it does not satisfy the orthogonalitycondition (Assumption 2.3 does not hold).
If the regressors are endogenous we have, under the Assumptions 2.1, 2.2 and 2.4,
b = β+
1
n
n∑i=1
ziz′i
−11
n
n∑i=1
ziεip−→ β + Σ−1
zz E (ziεi) 6= β
since E (ziεi) 6= 0. The term Σ−1zz E (ziεi) is the asymptotic bias.
211
Example (Simple regression model). Consider
yi = β1 + β2zi2 + εi
is
b =
[b1b2
]=(Z′Z
)−1Z′y =
y − Cov(zi2,yi)
S2z2
z2
Cov(zi2,yi)S2z2
where
Cov (zi2, yi) =1
n
∑(zi2 − z2) (yi − y) , S2
z2=
1
n
∑(zi2 − z2)2 .
Under the assumption 2.2 we have
b2 =Cov (zi2, yi)
S2z
p−→ Cov (zi2, yi)
Var (zi2)
=Cov (zi2, β1 + β2zi2 + εi)
Var (zi2)= β2 +
Cov (zi2, εi)
Var (zi2).
212
b1 = y − Cov (zi2, yi)
S2z2
z2p−→ E (y)− Cov (zi2, yi)
Var (zi2)E (zi2)
= β1 + β2 E (zi2)−(β2 +
Cov (zi2, εi)
Var (zi2)
)E (zi2)
= β1 −Cov (zi2, εi)
Var (zi2)E (zi2)
If Cov (zi2, εi) = 0 ⇒ bip−→ βi. If zi2 is endogenous, b1 and b2 are inconsistent. Show
that
Σ−1zz E (ziεi) =
−Cov(zi2,εi)Var(zi2) E (zi2)
Cov(zi2,εi)Var(zi2)
.
213
4.1 Examples of Endogeneity
4.1.1 Simultaneous Equations Bias
Example. Consider
yi1 = α0 + α1yi2 + εi1yi2 = β0 + β1yi1 + εi2
where εi1 and εi2 are independent. By construction yi1 and yi2 are endogenous regressors. Infact, it can be proved that
Cov (yi2, εi1) =β1
1− β1α1Var (εi1) 6= 0
Cov (yi1, εi2) =α1
1− β1α1Var (εi2) 6= 0
Now
α1,OLSp−→ Cov (yi2, yi1)
Var (yi2)=
Cov (yi2, α0 + α1yi2 + εi1)
Var (yi2)= α1 +
Cov (yi2, εi1)
Var (yi2)6= α1
β1,OLSp−→ Cov (yi2, yi1)
Var (yi1)=
Cov (yi1, β0 + β1yi1 + εi2)
Var (yi1)= β1 +
Cov (yi1, εi2)
Var (yi1)6= β1.
214
The OLS estimator is inconsistent for both α1 and β1 (and for α0 and β0 as well). Thisphenomenon is known as the simultaneous equations bias or simultaneity bias, because theregressor and the error term are often related to each other through a system of simultaneousequations.
Example. Consider
Ci = α0 + α1Yi + ui (consumption function)
Yi = Ci + Ii (GNP identity).
where Cov (ui, Ii) = 0. It can be proved that
α1,OLSp−→ α1 +
1
1− α1
Var (ui)
Var (yi).
Example. See Hayashi:
qdi = α0 + α1pi + ui (demand equation)
qsi = β0 + β1pi + vi (supply equation)
qdi = qsi (market equilibrium)
215
4.1.2 Errors-in-Variables Bias
We will see that predetermined regressor necessarily becomes endogenous when measuredwith error. This problem is ubiquitous, particularly in micro data on households.
Consider
y∗i = βz∗i + ui
where z∗i is a predetermined regressor. The variables y∗i and z
∗i are measured with error:
yi = y∗i + εi and zi = z∗i + vi.
Assume that E(z∗i ui
)= E
(z∗i εi
)= E
(z∗i vi
)= E (viui) = E (viεi) = 0. The regression
equation is
yi = βzi + ηi, ηi = ui + εi − βviAssuming S&WD we have (after some calculations):
βOLS =
∑i ziyi∑i z
2i
=
∑i ziyi/n∑i z
2i /n
p−→ β − βE(v2i
)E(z2i
).
216
4.1.3 Omitted Variable Bias
Consider the “long regression”
y = X1β1 + X2β2 + u
and suppose that this model satisfies the assumptions 2.1-2.4 (hence the OLS based onthe previous equation is consistent). However, for some reason X2 is not included in theregression model (“short regression)”
y = X1β1 + ε, ε = X2β2 + u
We are interested only in β1. We have
b1 =(X′1X1
)−1X1y
=(X′1X1
)−1X1 (X1β1 + X2β2 + u)
= β1 +(X′1X1
)−1X1X2β2 +
(X′1X1
)−1X1u
= β1 +
(X′1X1
n
)−1X1X2
nβ2 +
(X′1X1
n
)−1X1u
n
217
This expression converges in probability to
β1 + Σ−1x1x1
Σx1x2β2.
The conclusion is that b1 is inconsistent if there are omitted variables that are correlatedwith X1. The variables in X1 are endogenous as long as Cov (X1,X2) 6= 0
Cov (X1, ε) = Cov (X1,X2β2 + u) = Cov (X1,X2)β2
Example. Consider the problem of unobserved ability in a wage equation for working adults.A simple model is
log (WAGEi) = β1 + β2educi + β3abili + ui
where ui is the error term. We put abili into the error term, and we are left with the simpleregression model
log (WAGEi) = β1 + β2educi + εi
where εi = β3abili + ui.
218
The OLS will be inconsistent estimator of β2 if educi and abili are correlated. In effect,
b2p−→ β2 +
Cov (educi, εi)
Var (educi)= β2 +
Cov (educi, β3abili + ui)
Var (educi)
= β2 + β3Cov (educi, abili)
Var (educi).
219
4.2 The General Formulation
4.2.1 Regressors and Instruments
Definition. xi is an instrumental variable (IV) for zi if (1) xi is uncorrelated with εi, thatis, Cov(xi, εi) = 0 (thus, xi is a predetermined variable), and (2) xi is correlated with zi,that is, Cov (xi, zi) 6= 0.
Exercise 4.1. Consider log (wagei) = β1 + β2educi + εi. Omitted variable: ability. (a)Is educ an endogenous variable? (b) Can IQ be considered an IV for educ? and mother’seducation?
Exercise 4.2. Consider childreni = β1 +β2mothereduci+β3motheragei+εi. Omittedvariable: bcmi : dummy equal to one if the mother is informed about birth control methods.(a) Is mothereduc endogenous? (b) Suggest an IV for mothereducation.
Exercise 4.3. Consider scorei = β1 + β2skippedi + εi. Omitted variable: motivation(a) Is skippedi endogenous? (b) Can the distance between home (or living quarters) anduniversity be considered an IV variable?
220
Exercise 4.4. (Wooldridge, Chap. 15) Consider a simple model to estimate the effect ofpersonal computer (PC) ownership on college grade point average for graduating seniors ata large public university:
GPAi = β1 + β2PCi + εi
where PC is a binary variable indicating PC ownership. (a) Why might PC ownership becorrelated with εi? (b) Explain why PC is likely to be related to parents’annual income.Does this mean parental income is a good IV for PC? Why or why not? (c) Suppose that, fouryears ago, the university gave grants to buy computers to roughly one-half of the incomingstudents, and the students who received grants were randomly chosen. Carefully explainhow you would use this information to construct an instrumental variable for PC. (d) Samequestion as (c) but suppose that the university gave grant priority to low-income students.
(see the use of IV in errors-in-variables problems in Woodridge’s text book).
221
Assumption (3.1 - linearity). The equation to be estimated is linear:
yi = z′iδ + εi, (i = 1, 2, ..., n) ,
where zi is an L-dimensional vector of regressors, δ is an L-dimensional coeffi cient vectorand εi is an unobservable error term.
Assumption (3.2 - S&WD). Let xi be aK-dimensional vector to be referred to as the vectorof instruments, and let wi be the unique and nonconstant elements of (yi, zi,xi). {wi} isjointly stationary and weakly dependent.
Assumption (3.3 - orthogonality conditions). All the K variables in xi are predetermined inthe sense that they are all orthogonal to the current error term: E (xikεi) = 0 for all i andk. This can be written as
E(xi(yi − z′iδ
))= 0 or E (gi) = 0
where gi = xiεi.
Notice: xi should include the “1”(constant). Not only xi1 = 1 can be considered as an IVvariable but also guarantee that E
(1(yi − z′iδ
))= 0⇔ E (εi) = 0.
222
Example (3.1). Consider
qi = α0 + α1pi + ui (demand equation)
where Cov (pi, ui) 6= 0, and xi is such that Cov (xi, pi) 6= 0 but Cov (xi, ui) = 0. Usingprevious notation we have:
yi = qi,
zi =
[1pi
], δ =
[α0α1
], L = 2
xi =
[1xi
], K = 2
wi =
qipixi
.
In the above example, xi and zi share the same variable (a constant). The instruments thatare also regressors are called predetermined regressors, and the rest of the regressors, thosethat are not included in xi, are called endogenous regressors.
223
Example (3.2 - wage equation). Consider
LWi = δ1 + δ2Si + δ3EXPRi + δ4IQi + εi.
where:
LWi is the log wage of individual i,
Si is completed years of schooling (we assume predetermined),
EXPRi is experience in years (we assume predetermined),
IQi is IQ (an error-ridden measure of the individual’s ability, is endogenous due to theerrors-in-variables problem)
We still have information on:
AGEi (age of the individual - predetermined),
MEDi (mother’s education in years - predetermined).
Note: AGE; is excluded from the wage equation, reflecting the underlying assumption that,once experience is controlled for, age has no effect on the wage rate.
224
In terms of the general model,
yi = LWi,
zi =
1Si
EXPRiIQi
, δ =
δ1δ2δ3δ4
, L = 4
xi =
1Si
EXPRiAGEiMEDi
, K = 5
w′i =[LWi Si EXPRi IQi AGEi MEDi
].
225
4.2.2 Identification
The GMM estimation of the parameter vector δ is about how to exploit the informationafforded by the orthogonality conditions
E(xi(yi − z′iδ
))= 0⇔ E
(xiz′i
)δ = E (xiyi)
E(xiz′i
)δ = E (xiyi) can be interpreted as a linear system with K equations where δ is the
unknown vector. Notice: E(xiz′i
)is a K × L matrix and E (xiyi) is a K × 1 vector. Can
we solve the system with respect to δ? We need to study the identification of the system.
Assumption (3.4 - rank condition for identification). The K × L matrix E(xiz′i
)is of full
column rank (i.e., its rank equals L, the number of its columns). We denote this matrix byΣxz.
226
Example. Consider the example 3.2 where
xi =
1Si
EXPRiAGEiMEDi
, zi =
1Si
EXPRiIQi
.
We have
xiz′i =
1Si
EXPRiAGEiMEDi
[
1 Si EXPERi IQi]
=
1 Si EXPERi IQiSi S2
i SiEXPERi SiIQiEXPRi EXPRiSi EXPER2
i EXPRiIQiAGEi AGEiSi AGEiEXPERi AGEiIQiMEDi MEDiSi MEDiEXPERi MEDiIQi
.
227
E(xiz′i
)= Σxz
=
1 E (Si) E (EXPERi) E (IQi)
E (Si) E(S2i
)E (SiEXPERi) E (SiIQi)
E (EXPRi) E (EXPRiSi) E(EXPER2
i
)E (EXPRiIQi)
E (AGEi) E (AGEiSi) E (AGEiEXPERi) E (AGEiIQi)
E (MEDi) E (MEDiSi) E (MEDiEXPERi) E (MEDiIQi)
.
Assumption 3.4 requires that rank (Σxz) = 4.
228
4.2.3 Order Condition for Identification
Since rank (Σxz) ≤ min {K,L} we have: if K < L ⇒ rank (Σxz) < L. Thus anecessary condition for identification is that K ≥ L.
Definition (order condition for identification).K ≥ L or
#orthogonality conditions︸ ︷︷ ︸K
≥ #parameters︸ ︷︷ ︸L
.
Definition.We say that the equation is overidentified if the rank condition is satisfied andK > L, exactly identified or just identified if the rank condition is satisfied and K = L
and underidentified (or not identified) if the order condition is not satisfied (i.e., ifK < L).
229
Example. Consider the system Ax = b, with A = E(xiz′i
)and b = E (xiyi) . It can be
proved that the system is always “possible” (it has at least one solution). Consider thefollowing scenarios:
1. If rank (A) = L and K = L the SLE is exactly identified. Example:[1 10 1
] [x1x2
]=
[31
]⇒{x1 = 2x2 = 1
Note: rank (A) = 2 = L = K.
2. If rank (A) = L and K > L. The SLE is overidentified. Example: 1 10 10 1
[ x1x2
]=
311
⇒ {x1 = 2x2 = 1
Note: rank (A) = 2 = L and K = 3.
230
3. If rank (A) < L the SLE is underidentified. Example:[1 12 2
] [x1x2
]=
[24
]⇒ x1 = 2− x2, x2 ∈ R
Note: rank (A) = 1 < L.
4. If K < L then rank (A) < L and the SLE is underidentified. Example:
[1 1
] [ x1x2
]= 1⇒ x1 = 1− x2, x2 ∈ R
Note: rank (A) = 1 and K = 1 < L = 2.
231
4.2.4 The Assumption for Asymptotic Normality
Assumption (3.5 - {gi} is a martingale difference sequence with finite second moments).Let gi = xiεi. {gi} is a martingale difference sequence (so E (gi) = 0). The K×K matrixof cross moments, E
(gig′i
), is nonsingular. Let S = Avar (g) .
Remarks:
• Assumption 3.5 implies Avar (g) = lim Var (√ng) = E
(gig′i
).
• Assumption 3.5 implies√ng
d−→ N (0,Avar (g)) .
• If the instruments include a constant, then this assumption implies that the error is amartingale difference sequence (and a fortiori serially uncorrelated).
232
• A suffi cient and perhaps easier to understand condition for Assumption 3.5 is that
E (εi| Fi) = 0 where
Ii−1 = {εi−1, εi−2, ..., ε1,xi−1, ...,x1} ,Fi = Ii−1 ∪ xi = {εi−1, εi−2, ..., ε1,xi,xi−1, ...,x1} .
It implies the error term is orthogonal not only to the current but also to the pastinstruments.
• Since gig′i = ε2
ixix′i, S is a matrix of fourth moments. Consistent estimation of S will
require a fourth-moment assumption to be specified in Assumption 3.6 below.
• If {gi} is serially correlated, then S does not equal E(gig′i
)and will take a more
complicated form.
233
4.3 Generalized Method of Moments (GMM) Defined
The method of moment principle: To estimate a feature of the population, use thecorresponding feature of the sample.
Examples:
Parameter of the population EstimatorE (yi) Y
Var (yi) S2y
E(xi(yi − z′iδ
))1n
∑i xi
(yi − z′iδ
)
Method of moments: choose the parameter estimate so that the corresponding samplemoments are also equal to zero. Since we know that E
(xi(yi − z′iδ
))= 0 we choose the
parameter estimate δ so that
1
n
n∑i=1
xi(yi − z′iδ
)= 0.
234
Another way of writing 1n
∑ni=1 xi
(yi − z′iδ
)= 0:
1
n
n∑i=1
gi = 0⇔ 1
n
n∑i=1
gi(w; δ
)︸ ︷︷ ︸
gn(δ)
= 0⇔ gn(δ)
= 0.
Let’s expand gn(δ)
= 0 :
1
n
n∑i=1
xi(yi − z′iδ
)= 0
1
n
n∑i=1
xiyi −1
n
n∑i=1
xiz′iδ = 0
1
n
n∑i=1
xiz′iδ =
1
n
n∑i=1
xiyi
Sxzδ = sxy.
235
Thus:
• Sxz(K×L)
δ(L×1)
= sxy(K×1)
is a system with K (linear) equations in L unknowns.
• Sxzδ = sxy is the sample analogue of E(xi(yi − z′iδ
))= 0, that is
E(xiz′i
)δ = E (xiyi) .
236
4.3.1 Method of Moments
Consider
Sxzδ = sxy
If K = L and rank (Σxz) = L⇒ Σxz := E(xiz′i
)is invertible and Sxy is invertible (in
probability, for n large enough).
Solving Sxzδ = sxy with respect to δ gives
δIV = S−1xz sxy
=
1
n
n∑i=1
xiz′i
−11
n
n∑i=1
xiyi
=
n∑i=1
xiz′i
−1 n∑i=1
xiyi
=(X′Z
)−1X′y.
237
Example. Consider
yi = δ1 + δ2zi2 + εi
and suppose that Cov (zi, εi) 6= 0, that is, zi is an endogenous variable. We have L = 2
so we need at least K = 2 instrumental variables. Let x′i =(
1 xi2)and suppose that
Cov (xi2, εi) = 0 and Cov (xi2, zi2) 6= 0. Thus an IV estimator is
δIV =(X′Z
)−1X′y.
Exercise 4.5. Consider the previous example. (a) Show that the IV estimator δ2,IV can bewritten as
δ2,IV =
∑ni=1 (xi2 − x2) (yi − y)∑ni=1 (xi2 − x2) (zi2 − z2)
.
(b) Show Cov (xi2, yi) = δ2 Cov (xi2, zi2) + Cov (xi2, εi) ; (c) Based on part (b), showthat δ2,IV
p−→ δ2 (write the assumptions you need to prove these results).
238
4.3.2 GMM
It may happen that K > L (there are more orthogonality conditions than parameters). Inprinciple, it is better to have as many IV as possible, so the case K > L is desirable, butthen the system Sxzδ = sxy may not have a solution.
Example. Suppose
Sxz =
1.00 0.097 0.099
0.097 1.011 −0.0590.099 −0.059 0.967−0.182 0.203 −0.031
, sxy =
1.9541.346−0.900−0.0262
(K = 4, L = 3) and try (if you can) to solve Sxzδ = sxy. This system is of same type of
δ1 + δ2 = 1
δ3 = 1
δ4 + δ5 = 5
δ1 + δ2 = 2
(the first and fourth equations are incompatible - the system is impossible - there is not asolution).
239
This means we cannot set gn(δ)exactly equal to 0. However, we can at least choose δ
so that gn(δ)is as close to 0 as possible. In Linear Algebra two vectors are “close” if the
distance between them is relatively small. We will define the distance in RK as follows:
distance between ξ and η is equal to (ξ − η)′ W (ξ − η)
where W, called the weighting matrix, is a symmetric positive definite matrix defining thedistance.
Example. If
ξ =
[12
], η =
[35
], W =
[1 00 1
]the distance between these two vectors is
(ξ − η)′ W (ξ − η) =[
1− 3 2− 5] [ 1− 3
2− 5
]= 22 + 32 = 13.
240
Definition (3.1 - GMM estimator). Let W be a K ×K symmetric positive definite matrix,possibly dependent on the sample, such that W
p−→ W as n → ∞, with W symmetricand positive definite. The GMM estimator of δ, denoted δ
(W)is
δ(W)
= arg minδJ(δ,W
)where
J(δ,W
)= ngn
(δ)′
Wgn(δ)
= n(sxy − Sxzδ
)′W
(sxy − Sxzδ
).
Proposition. Under the Assumptions 3.2 and 3.4
GMM estimator δ(W)
=(S′xzWSxz
)−1S′xzWsxy.
To prove this proposition you need the following rule:
∂(q′Wq
)∂δ
= 2∂q′
∂δWq
where q is a K × 1 vector depending on δ and W is a K ×K matrix not depending on δ.
241
If K = L then Sxz is invertible and δ(W)reduces to the IV estimator:
δ(W)
=(S′xzWSxz
)−1S′xzWsxy
= S−1xz W−1
(S′xz
)−1S′xzWsxy
= S−1xz sxy = δIV .
4.3.3 Sampling Error
The GMM estimator can be written as
δ(W)
= δ +(S′xzWSxz
)−1S′xzWg.
242
Proof: First consider
sxy =1
n
∑i
xiyi
=1
n
∑i
xi(z′iδ + εi
)=
1
n
∑i
xiz′iδ +
1
n
∑i
xiεi
= Sxzδ + g
Replacing sxy = Sxzδ + g into δ(W)
=(S′xzWSxz
)−1S′xzWsxy produces:
δ(W)
=(S′xzWSxz
)−1S′xzWsxy
=(S′xzWSxz
)−1S′xzW (Sxzδ + g)
=(S′xzWSxz
)−1S′xzWSxzδ +
(S′xzWSxz
)−1S′xzWSxzg
= δ +(S′xzWSxz
)−1S′xzWg.
243
4.4 Large-Sample Properties of GMM
4.4.1 Asymptotic Distribution of the GMM Estimator
Proposition (3.1 - asymptotic distribution of the GMM estimator). (a) (Consistency) Un-der Assumptions 3.1-3.4, δ
(W) p−→ δ; (b) (Asymptotic Normality) If Assumption 3.3 is
strengthened as Assumption 3.5, then
√n(δ(W)− δ
)d−→ N
(0,Avar
(δ(W)))
where
Avar(δ(W))
=(Σ′xzWΣxz
)−1Σ′xzWSWΣxz
(Σ′xzWΣxz
)−1
Recall: S ≡ E(gig′i
). (c) (Consistent Estimate of Avar
(δ(W))) Suppose there is avail-
able a consistent estimator, S, of S. Then, under Assumption 3.2, Avar(δ(W))is consis-
tently estimated by
Avar(δ(W))
=(S′xzWSxz
)−1S′xzWSWSxz
(S′xzWSxz
)−1.
244
4.4.2 Estimation of Error Variance
Proposition (3.2 - consistent estimation of error variance). For any consistent estimator δand under Assumptions 3.1, 3.2, the assumptions that E
(ziz′i
)and E
(ε2i
)exist and are
finite we have
1
n
n∑i=1
εip−→ E
(ε2i
)where εi ≡ yi − z′iδ.
4.4.3 Hypothesis Testing
Proposition (3.3 - robust t-ratio and Wald statistics). Suppose Assumptions 3.1-3.5 hold,and suppose there is available a consistent estimate S of S (≡ Avar (g) = E
(gig′i
). Let
Avar(δ(W))
=(S′xzWSxz
)−1S′xzWSWSxz
(S′xzWSxz
)−1.
245
Then (a) under the null H0: δj = δ0j
t0j =
√n(δj(W)− δ0
j
)√(
Avar(δ(W)))
jj
=δj(W)− δ0
j
SEj
d−→ N (0, 1)
where(
Avar(δ(W)))
jjis the (j, j) element of Avar
(δ(W))and
SEj =
√1
n
(Avar
(δ(W)))
jj.
(b) Under the null hypothesis H0:Rδ = r where p is the number of restrictions and R
(p× L) is of full row rank,
W = n(Rδ
(W)− r
)′ (RAvar
(δ(W))
R′)−1 (
Rδ(W)− r
)d−→ χ2
(p).
246
4.4.4 Estimation of S
Let
S ≡ 1
n
n∑i=1
ε2ixix
′i, where εi ≡ yi − z′iδ.
Assumption (3.6 - finite fourth moments). E(
(xikzi`)2)exists and is finite for all k =
1, ...,K, and ` = 1, ..., L.
Proposition (3.4 - consistent estimation of S). Suppose δ is consistent and S = E(gig′i
)exists and is finite. Then under Assumptions 3.1, 3.2 and 3.6 the following estimator
S ≡ 1
n
n∑i=1
ε2ixix
′i, where εi ≡ yi − z′iδ.
is consistent.
247
4.4.5 Effi cient GMM Estimator
The next proposition provides a choice of W that minimizes the asymptotic variance.
Proposition (3.5 - optimal choice of the weighting matrix). If W is chosen such that
Wp−→ S−1
then the lower bound for the asymptotic variance of the GMM estimators is reached, whichis equal to (
Σ′xzS−1Σxz
)−1.
Definition. The estimator
δ(S−1
)= arg min
δngn
(δ)′
Wgn(δ)
where W = S−1is called the effi cient GMM estimator.
248
The effi cient GMM estimator can be written as
δ(W)
=(S′xzWSxz
)−1S′xzWsxy
δ(S−1
)=
(S′xzS
−1Sxz
)−1S′xzS
−1sxy
and
Avar(δ(S−1
))=
(Σ′xzS
−1Σxz
)−1
Avar
(δ(S−1
))=
(S′xzS
−1Sxz
)−1.
249
To calculate the effi cient GMM estimator, we need the consistent estimator S, which dependson εi. This leads us to the following two-step effi cient GMM procedure:
Step1: Compute S ≡ 1n
∑ni=1 ε
2ixix
′i, where εi = yi − z′iδ. To obtain δ :
δ(W)
= arg minδn(sxy − Sxzδ
)′W
(sxy − Sxzδ
)where W is a matrix that converges in probability to a symmetric and positive definitematrix, for example
W = S−1xx .
With this choice, use the (so called) 2SLS estimator δ(S−1
xx
)to obtain the residuals
εi = yi − z′iδ and S ≡ 1n
∑ni=1 ε
2ixix
′i.
Step 2: Minimize J(δ, S
)with respect to δ. The minimizer is the effi cient GMM estimator,
δ(S−1
)= arg min
δn (sxy − Sxzδ)′ S−1 (sxy − Sxzδ) .
250
Example. (Wooldridge, chap. 15 - data base:card) Wage and education data for a sampleof men in 1976
Dependent Variable: LOG(WAGE)Method: Least SquaresSample: 1 3010Included observations: 3010
Variable Coefficient Std. Error tStatistic Prob.
C 4.733664 0.067603 70.02193 0.0000EDUC 0.074009 0.003505 21.11264 0.0000
EXPER 0.083596 0.006648 12.57499 0.0000EXPER^2 0.002241 0.000318 7.050346 0.0000BLACK 0.189632 0.017627 10.75828 0.0000SMSA 0.161423 0.015573 10.36538 0.0000
SOUTH 0.124862 0.015118 8.259006 0.0000
Rsquared 0.290505 Mean dependent var 6.261832Adjusted Rsquared 0.289088 S.D. dependent var 0.443798S.E. of regression 0.374191 Akaike info criterion 0.874220Sum squared resid 420.4760 Schwarz criterion 0.888196Log likelihood 1308.702 HannanQuinn criter. 0.879247Fstatistic 204.9318 DurbinWatson stat 1.861291Prob(Fstatistic) 0.000000
SMSA =1 if in Standard Metropolitan Statistical Area in 1976.
NEAR4 =1 if he grew up near a 4 year college.
251
252
z′i =[
1 EDUCi EXPERi EXPER2i BLACKi SMSAi SOUTH
]x′i =
[1 EXPERi EXPER2
i BLACKi SMSAi SOUTH NEAR4i NEAR2i]
Dependent Variable: LOG(WAGE)Method: Generalized Method of MomentsSample: 1 3010Included observations: 3010Linear estimation with 1 weight updateEstimation weighting matrix: HAC (Bartlett kernel, NeweyWest fixed
bandwidth = 9.0000)Standard errors & covariance computed using estimation weighting matrixInstrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH
NEARC4 NEARC2
Variable Coefficient Std. Error tStatistic Prob.
C 3.330464 0.886167 3.758280 0.0002EDUC 0.157469 0.052578 2.994963 0.0028
EXPER 0.117223 0.022676 5.169509 0.0000EXPER^2 0.002277 0.000380 5.997813 0.0000BLACK 0.106718 0.056652 1.883736 0.0597SMSA 0.119990 0.030595 3.921874 0.0001
SOUTH 0.095977 0.025905 3.704972 0.0002
Rsquared 0.156572 Mean dependent var 6.261832Adjusted Rsquared 0.154887 S.D. dependent var 0.443798S.E. of regression 0.407983 Sum squared resid 499.8506DurbinWatson stat 1.866667 Jstatistic 2.200989Instrument rank 8 Prob(Jstatistic) 0.137922
253
4.5 Testing Overidentifying Restrictions
4.5.1 Testing all Orthogonality Conditions
If the equation is exactly identified then J(δ,W
)= 0. If the equation is overidentified
then J(δ,W
)> 0. When W is chosen optimally so that W = S
−1 p−→ S−1 then
J
(δ(S−1
), S−1)is asymptotically chi-squared.
Proposition (3.6 - Hansen’s test of overidentifying restrictions). Under assumptions 3.1-3.5
J
(δ(S−1
), S−1)
d−→ χ2(K−L)
254
Two comments:
1) This is a specification test, testing whether all the restrictions of the model (which are
the assumptions maintained in Proposition 3.6) are satisfied. If the J(δ(S−1
), S−1)is
surprisingly large, it means that either the orthogonality conditions (Assumption 3.3) or theother assumptions (or both) are likely to be false. Only when we are confident about thoseother assumptions can we interpret the large J statistic as evidence for the endogeneity ofsome of the K instruments included in xi.
2) Small-sample properties of the test may be a matter of concern.
Example (continuation). EVIEWS provides the J statistics of proposition 3.6:
255
Dependent Variable: LOG(WAGE)Method: Generalized Method of MomentsSample: 1 3010Included observations: 3010Linear estimation & iterate weightsEstimation weighting matrix: WhiteStandard errors & covariance computed using estimation weighting matrixConvergence achieved after 2 weight iterationsInstrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH
NEARC4 NEARC2
Variable Coefficient Std. Error tStatistic Prob.
C 3.307001 0.814185 4.061733 0.0000EDUC 0.158840 0.048355 3.284842 0.0010EXPER 0.118205 0.021229 5.567988 0.0000
EXPER^2 0.002296 0.000367 6.250943 0.0000BLACK 0.105678 0.051814 2.039573 0.0415SMSA 0.117018 0.030158 3.880117 0.0001
SOUTH 0.096095 0.023342 4.116897 0.0000
Rsquared 0.152137 Mean dependent var 6.261832Adjusted Rsquared 0.150443 S.D. dependent var 0.443798S.E. of regression 0.409055 Sum squared resid 502.4789DurbinWatson stat 1.866149 Jstatistic 2.673614Instrument rank 8 Prob(Jstatistic) 0.102024
256
4.5.2 Testing Subsets of Orthogonality Conditions
Consider
xi =
[xi1} K1 rowsxi2} K −K1 rows
]We want to test H0 : E (xi2εi) = 0.
The basic idea is to compare two J statistics from two separate GMM estimators, one usingonly the instruments included in xi1 and the other using also the suspect instruments xi2in addition to xi1. If the inclusion of the suspect instruments significantly increases the Jstatistic, that is a good reason for doubting the predeterminedness of xi2. This restrictionis testable K1 ≥ L (why?).
257
Proposition (3.7 - testing a subset of orthogonality conditions). Suppose that the rankcondition is satisfied for xi1, so E
(xi1z′i
)is of full column rank. Under assumptions 3.1-
3.5. Let
J = ngn(δ)′
S−1gn(δ), δ =
(S′xzS
−1Sxz
)−1S′xzS
−1sxy
J1 = ng1n
(δ)′
S−1g1n
(δ), δ =
(S′x1zS
−111 Sx1z
)−1S′x1zS
−111 sx1y.
Then, under the null H0 : E (xi2εi) = 0,
C ≡ J − J1d−→ χ2
(K−K1).
258
Example. EVIEWS 7 performs this test. Following previous example, suppose you want totest E (nearc4iεi) = 0. In our case, xi1 is 7 × 1 vector and xi2 = nearc4i is a scalar(L = 7, K1 = 7, K −K1 = 1).
259
Instrument Orthogonality Ctest TestEquation: EQ03Specification: LOG(WAGE) C EDUC EXPER EXPER^2 BLACK SMSA
SOUTHInstrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH
NEARC4 NEARC2Test instruments: NEARC4
Value df ProbabilityDifference in Jstats 2.673614 1 0.1020
Jstatistic summary:Value
Restricted Jstatistic 2.673614Unrestricted Jstatistic 5.16E33
Unrestricted Test Equation:Dependent Variable: LOG(WAGE)Method: Generalized Method of MomentsFixed weighting matrix for test evaluationStandard errors & covariance computed using estimation weighting matrixInstrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH
NEARC2
Variable Coefficient Std. Error tStatistic Prob.
C 0.092557 2.127447 0.043506 0.9653EDUC 0.349764 0.126360 2.768002 0.0057EXPER 0.196690 0.052475 3.748287 0.0002
EXPER^2 0.002445 0.000378 6.467830 0.0000BLACK 0.088724 0.129667 0.684247 0.4939SMSA 0.019006 0.067085 0.283317 0.7770
SOUTH 0.030415 0.046444 0.654869 0.5126
Rsquared 1.171522 Mean dependent var 6.261832Adjusted Rsquared 1.175861 S.D. dependent var 0.443798S.E. of regression 0.654637 Sum squared resid 1286.934DurbinWatson stat 1.818008 Jstatistic 5.16E33Instrument rank 7
260
4.5.3 Regressor Endogeneity Test
We can use Proposition 3.7 to test for the endogeneity of a subset of regressors.
See example 3.3 of the book.
4.6 Implications of Conditional Homoskedasticity
Assume now:
Assumption (3.7 - conditional homoskedasticity). E(ε2i
∣∣∣xi) = σ2.
This assumption implies
S ≡ E(gig′i
)= E
(ε2ixix
′i
)= σ2
E(xix′i
)= σ2Σxx.
Its estimator is
S =σ2Sxx
261
4.6.1 Effi cient GMM Becomes 2SLS
The effi cient GMM is
δ(S−1
)=
(S′xzS
−1Sxz
)−1S′xzS
−1sxy
=(S′xz
(σ2Sxx
)−1Sxz
)−1S′xz
(σ2Sxx
)−1sxy
=(S′xzS
−1xxSxz
)−1S′xzS
−1xxsxy
≡ δ2SLS.
The estimator δ2SLS is called two-stage least squares (2SLS or TSLS), for reasons we explainbelow. It follows
Avar(δ2SLS
)= σ2
(Σ′xzS
−1xxΣxz
)−1
Avar
(δ2SLS
)= σ2
(S′xzS
−1xxSxz
)−1.
Proposition (3.9 - asymptotic properties of 2SLS). Skip.
262
4.6.2 Alternative Derivations of 2SLS
The 2SLS can be written as
δ2SLS =(S′xzS
−1xxSxz
)−1S′xzS
−1xxsxy
=(Z′X(X′X)−1X′Z
)−1Z′X(X′X)−1X′y
Let us interpret the 2SLS estimator as a IV estimator. Use as instruments
Z = X(X′X
)−1X′Z
or simply Z = X if K = L. Define the IV estimator as
δIV =
1
n
n∑i=1
ziz′i
−11
n
n∑i=1
ziyi
=(Z′Z
)−1Z′y
=(Z′X(X′X)−1X′Z
)−1Z′X(X′X)−1X′y
= δ2SLS
263
If K = L then
δIV =(X′Z
)−1X′y.
Finally, let us show the 2SLS as the result of two regression:
1) regress the L regressors on xi and obtain fitted values i.e. zi
2) regress yi on z1, ..., zL to obtain the estimator(Z′Z
)−1Z′y which is also the δ2SLS. In
effect,
(Z′Z
)−1Z′y =
Z′X(X′X)−1X′︸ ︷︷ ︸Z′
X(X′X
)−1X′Z︸ ︷︷ ︸
Z
Z′X(X′X)−1X′︸ ︷︷ ︸Z′
y
=(Z′X(X′X)−1X′Z
)−1Z′X(X′X)−1X′y
= δ2SLS.
264
Exercise 4.6. Consider the equation yi = z′iδ + εi and the instrumental variables xi,where K = L. Assume Assumptions 3.1-3.7 and suppose that xi and zi are strictly ex-ogenous (so the use of the IV estimator is unnecessary). Show that δIV =
(X′Z
)−1 X′yis unbiased and consistent but less effi cient than δOLS =
(Z′Z
)−1 Z′y. Hint: compareVar
(δIV
∣∣∣Z,X) to Var(δOLS
∣∣∣Z,X) and and notice that an idempotent matrix is posi-tive semi-definite. Also notice that Var
(δIV
∣∣∣Z,X)− Var(δOLS
∣∣∣Z,X) is positive semi-definite iff Var
(δOLS
∣∣∣Z,X)−1 − Var(δIV
∣∣∣Z,X)−1is positive semi-definite (provided
these inverses exist).