1 econometrics - slides - ulisboapascal.iseg.utl.pt/~nicolau/ec_meap/1-264.pdf · 2 1 introduction...

1

Econometrics - Slides

2011/2012

João Nicolau

2

1 Introduction

1.1 What is Econometrics?

Econometrics is a discipline that “aims to give empirical content to economic relations”. Ithas been defined generally as “the application of mathematics and statistical methods toeconomic data”. Application of econometrics:

• forecast (e.g. interest rates, inflation rates, and gross domestic product).

• study economic relations;

• testing economic theories;

• evaluating and implementing government and business policy. For example, what are theeffects of political campaign expenditures on voting outcomes? What is the effect of schoolspending on student performance in the field of education?

3

1.2 Steps in Empirical Economic Analysis

• Formulate the question of interest. The question might deal with testing a certainaspect of an economic theory, or it might pertain to testing the effects of a governmentpolicy.

• Build the economic model. An economic model consists of mathematical equations thatdescribe various relationships. Formal economic modeling is sometimes the starting pointfor empirical analysis, but it is more common to use economic theory less formally, oreven to rely entirely on intuition.

• Specify the econometric model.

• Collect the data.

• Estimate and test the econometric model.

• Answer the question in step 1.

4

1.3 The Structure of Economic Data

1.3.1 Cross-Sectional Data

A cross-sectional data: sample of individuals, households, firms, cities, states, countries,etc. taken at a given point in time. An important feature of cross-sectional data: they areobtained by random sampling from the underlying population. For example, suppose thatyi is the i-th observation of the dependent variable and xi is the i-th observation of theexplanatory variable. Random sampling means that

{(yi, xi)} is an i.i.d. sequence.This implies that for i 6= j

Cov(yi, yj

)= 0, Cov

(xi, xj

)= 0, Cov

(yi, xj

)= 0.

Obviously, if xi “explains”yi we will have Cov (yi, xi) 6= 0.

Cross-sectional data is closely aligned with the applied microeconomics fields, such as laboreconomics, state and local public finance, industrial organization, urban economics, demog-raphy, and health economics.

5

An example of Cross-Sectional Data:

6

Scatterplots may be adequate for analyzing cross-section data:

Models based on Cross-Sectional Data usually satisfy the assumptions cover by the chapter“Finite-Sample Properties of OLS”.

7

1.3.2 Time-Series Data

A time series data set consists of observations on a variable or several variables over time.E.g.: stock prices, money supply, consumer price index, gross domestic product, annualhomicide rates, and automobile sales figures, etc.

Time series data cannot be assumed to be independent across time. For example, knowingsomething about the gross domestic product from last quarter tells us quite a bit about thelikely range of the GDP during this quarter ...

The analysis of time series data is more diffi cult than that of cross-sectional data. Reasons:

• we need to account for the dependent nature of economic time series;

• time-series data exhibits unique features such as trends over time and seasonality;

• models based on time-series data rarely satisfy the assumptions cover be the chapter“Finite-Sample Properties of OLS”. The most adequate assumptions are cover by chapter“Large-Sample Theory”, which is theoretically more advanced.

8

An example of a time series (scatterplots cannot in general be used here, but there areexceptions):

9

1.3.3 Pooled Cross Sections and Panel or Longitudinal Data

Data sets have both cross-sectional and time series features.

1.3.4 Causality And The Notion Of Ceteris Paribus In Econometric Analysis

Ceteris Paribus: “other (relevant) factors being equal”. Plays an important role in causalanalysis.

Example. Suppose that wages depend on education and labor force experience. Your goalis to measure the “return to education”. If your analysis involves only wages and educationyou may not uncover the ceteris paribus effect of education on wages. Consider the followingdata:

monthly wages (Euros) years of experience years of education1500 6 91500 0 151600 1 152000 8 122500 10 12

10

Example. In a totalitarianism regime how can you measure the ceteris paribus effect ofanother year of education on wages? You may create 100 clones of a “normal” individual.Give to each person an amount of education and then measure their wages.

Ceteris Paribus is relatively easy to analyze in Experimental Data.

Example (Experimental Data). Considered the effects of new fertilizers on crop yields. Sup-pose the crop under consideration is soybeans. Since fertilizer amount is only one factoraffecting yields– some others include rainfall, quality of land, and presence of parasites–this issue must be posed as a ceteris paribus question. One way to determine the causal effectof fertilizer amount on soybean yield is to conduct an experiment, which might include thefollowing steps. Choose several one-acre plots of land. Apply different amounts of fertilizerto each plot and subsequently measure the yields.

In economics you have nonexperimental data, so in principle, it is diffi cult to estimate theceteris paribus effects. However, we will see that econometric methods can simulate a ceterisparibus experiment. We will be able to do in nonexperimental environments what naturalscientists are able to do in a controlled laboratory setting: keep other factors fixed.

11

2 Finite-Sample Properties of OLS

This chapter covers the finite- or small-sample properties of the OLS estimator, that is, thestatistical properties of the OLS estimator that are valid for any given sample size.

2.1 The Classical Linear Regression Model

The dependent variable is related to several other variables (called the regressors or theexplanatory variables).

Let yi be the i-th observation of the dependent variable.

Let (xi1, xi2, ..., xiK) be the i-th observation of the K regressors. The sample or data is acollection of those n observations.

The data in economics cannot be generated by experiments (except in experimental eco-nomics), so both the dependent and independent variables have to be treated as randomvariables, variables whose values are subject to chance.

12

2.1.1 The Linearity Assumption

Assumption (1.1 - Linearity).We have

yi = β1xi1 + β2xi2 + ...+ βKxiK + εi, i = 1, 2, ..., n

where β′s are unknown parameters to be estimated, and εi is the unobserved error term.

β′s : regression coeffi cients. They represent the marginal and separate effects of the regres-sors.

Example (1.1). (Consumption function): Consider

coni = β1 + β2ydi + εi.

coni : consumption; ydi is disposable income. Note: xi1 = 1, xi2 = ydi. The error εirepresents other variables besides disposable income that influence consumption. They in-clude: those variables– such as financial assets– that might be observable but the researcherdecided not to include as regressors, as well as those variables– such as the “mood”of theconsumer– that are hard to measure. The equation is called the simple regression model.

13

The linearity assumption is not as restrictive as it might first seem.

Example (1.2). (Wage equation). Consider

wagei = eβ1eβ2educieβ3tenureieβ4exprieεi

where WAGE = the wage rate for the individual, educ = education in years, tenure = yearson the current job, and expr = experience in the labor. This equation can be written as

log (wagei) = β1 + β2educi + β3tenurei + β4expri + εi

The equation is said to be in the semi-log form (or log-level form).

Example. Does this model

yi = β1 + β2xi2 + β3 log xi2 + β4x2i3 + εi

violate Assumption 1.1?

There are, of course, cases of genuine nonlinearity. For example

yi = β1 + eβ2xi2 + εi

14

Partial Effects

To simplify let’s consider, K = 2, and assume that E (εi|xi1, xi2) = 0.

What is the impact on the conditional expected value y, E (yi|xi1, xi2) when xi2 is increasedby a small amount

x′i = (xi1, xi2)→ x∗′i = (xi1, xi2 + ∆xi2) (holding the other variable fixed)?

Let

∆ E (yi|xi) ≡ E (yi|x∗i1 = xi1, x∗i2 = xi2 + ∆xi2)− E (yi|xi1, xi2) .

Equation Interpretation of β2(level-level) yi = β1 + β2xi2 + εi ∆ E (yi|xi) = β2∆xi2(level-log) yi = β1 + β2 log (xi2) + εi ∆ E (yi|xi) '

β2100

(∆xi2xi2× 100

)(log-level) log (yi) = β1 + β2xi2 + εi

∆ E(yi|xi)E(yi|xi)

× 100 ' (100β2) ∆xi2

(100β2: semi-elast.)

(log-log) log (yi) = β1 + β2 log (xi2) + εi∆ E(yi|xi)

E(yi|xi)× 100 ' β2

(∆xi2xi2× 100

)(β2: elasticity)

15

Exercise 2.1. Suppose, for example, the marginal effect of experience on wages declines withthe level of experience. How can this be captured?

Exercise 2.2. Provide an interpretation of β2 in the following equations:

(a) coni = β1 + β2inci + εi, where inc: income, con: consumption (both measured indollars). Assume that β2 = 0.8;

(b) log (wagei) = β1 + β2educi + β3tenurei + β4expri + εi. Assume that β2 = 0.05.

(c) log (pricei) = β1 + β2 log (disti) + εi where prices = housing price and dist =

distance from a recently built garbage incinerator. Assume that β2 = 0.6.

16

2.1.2 Matrix Notation

We have

yi = β1xi1 + β2xi2 + ...+ βKxiK + εi =[xi1 xi2 · · · xiK

] β1β2...βK

+ εi

= x′iβ + εi

where

xi =

xi1xi2...

xiK

, β =

β1β2...βK

yi = x′iβ + εi.

17

More compactly

y =

y1y2...yn

, X =

x11 x12 · · · x1Kx21 x22 · · · x2K... ... ...xn1 xn2 · · · xnK

, εi =

ε1ε2...εn

y = Xβ + ε.

Example. yi = β1 + β2educi + β3expi + εi (yi = wages in Euros). An example ofCross-Sectional Data is

y =

200025001500...

50001000

, X =

1 12 51 15 61 12 3... ... ...1 17 151 12 1

.

Important: y and X (or yi and xik) may be random variables or observed values. We usethe same notation for both cases.

18

2.1.3 The Strict Exogeneity Assumption

Assumption (1.2 - Strict exogeneity). E (εi|X) = 0, ∀i

This assumption can be written as

E (εi|x1, ...,xn) = 0, ∀i.

With random sampling εi is automatically independent of the explanatory variables for ob-servations other than i. This implies that

E(εi|xj

)= 0, ∀i, j i 6= j

It remains to be analyzed whether or not

E (εi|xi)?= 0.

19

Strict Exogeneity assumption can fail in situations such as:

• (Cross-Section or Time Series) Omitted variables;

• (Cross-Section or Time Series) Measurement error in some of the regressors;

• (Time Series, Static models) There is a feedback from yi on future values of xi;

• (Time Series, Dynamic models) There is a lag dependent variable as a regressor;

• (Cross-Section or Time Series) Simultaneity.Example (Omitted variables). Suppose that wage is determined by

wagei = β1 + β2xi2 + β3xi3 + vi,

where x2: years of education, x3: ability. Assume that E (vi|X) = 0. Since ability is notobserved, we instead estimate the model. wagei = β1 + β2xi2 + εi, εi = β3xi3 + vi. IfCov (xi2, xi3) 6= 0 then

Cov (εi, xi2) = Cov (β3xi3 + vi, xi2) = β3 Cov (xi3, xi2) 6= 0⇒ E (εi|X) 6= 0.

20

Example (Measurement error in some of the regressors). Consider y = household savingsand w = disposable income and

yi = β1 + β2wi + vi, E (vi|w) = 0.

Suppose that w cannot be measured absolutely accurately (for example, because of misre-porting) and denote the measured value for wi by xi2. We have

xi2 = wi + ui.

Assume: E (ui) = 0, Cov (wi, ui) = 0, Cov (vi, ui) = 0. Now substituting xi2 = wi+uiinto yi = β1 + β2wi + vi we obtain

yi = β1 + β2xi2 + εi, εi = vi − β2ui.

Hence,

Cov (εi, xi2) = ... = −β2 Var (ui) 6= 0.

Cov (εi, xi2) 6= 0⇒ E (εi|X) 6= 0.

21

Example (Feedback from y on future values of x). Consider a simple static time-series modelto explain a city’s murder rate (yt) in terms of police offi cers per capita (xt):

yt = β1 + β2xt + εt,

Suppose that the city adjusts the size of its police force based on past values of the murderrate. This means that, say, xt+1 might be correlated with εt (since a higher εt leads to ahigher yt).

Example (There is a lag dependent variable as a regressor). See section 2.1.5.

Exercise 2.3. Let kids denote the number of children ever born to a woman, and let educdenote years of education for the woman. A simple model relating fertility to years ofeducation is

kidsi = β1 + β2educi + εi.

where εi is the unobserved error. (i) What kinds of factors are contained in εi? Are theselikely to be correlated with level of education? (ii) Will a simple regression analysis uncoverthe ceteris paribus effect of education on fertility? Explain.

22

2.1.4 Implications of Strict Exogeneity

The Assumption E (εi|X) = 0, ∀i implies:

• E (εi) = 0, ∀i.

• E(εi|xj

)= 0, ∀i, j.

• E(xjkεi

)= 0, ∀i, j, k or E

(xjεi

)= 0, ∀i, j The regressors are orthogonal to the

error term for all observations

• Cov(xjk, εi

)= 0.

Note: if E(εi|xj

)6= 0 or E

(xjkεi

)6= 0 or Cov

(xjk, εi

)6= 0⇒ E (εi|X) 6= 0.

23

2.1.5 Strict Exogeneity in Time-Series Models

For time-series models where strict exogeneity can be rephrased as: the regressors are or-thogonal to the past, current, and future error terms. However, for most time-series models,strict exogeneity is not satisfied.

Example. Consider

yi = βyi−1 + εi, E (εi| yi−1) = 0 (thus E (yi−1εi) = 0).

Let xi = yi−1. By construction we have

E (xi+1εi) = E (yiεi) = ... = E(ε2i

)6= 0.

The regressor is not orthogonal to the past error term, which is a violation of strict exogeneity.However, the estimator may possess good large-sample properties without strict exogeneity.

2.1.6 Other Assumptions of the Model

Assumption (1.3 - no multicollinearity). The rank of the n ×K data matrix X is K withprobability 1.

24

None of the K columns of the data matrix X can be expressed as a linear combination ofthe other columns of X.

Example (1.4 - continuation of Example 1.2). If no individuals in the sample ever changedjobs, then tenurei = expri for all i, in violation of the no multicollinearity assumption.There no way to distinguish the tenure effect on the wage rate from the experience effect.Remedy: drop tenurei or expri from the wage equation.

Example (Dummy Variable Trap). Consider

wagei = β1 + β2educi + β3femalei + β4malei + εi

where

femalei =

{1 if i corresponds to a female0 if i corresponds to a male

, malei = 1− femalei.

In vectorial notation we have

wage = β11 + β2educ + β3female + β4male + ε.

It is obvious that 1 = female + male. Therefore the above model violates Assumption1.3. One may also justify using scalar notation: xi1 = femalei + malei because thisrelationship implies 1 = female + male. Can you overcome the dummy variable trap byremoving xi1 ≡ 1 from the equation?

25

Exercise 2.4. In a study relating college grade point average to time spent in various activ-ities, you distribute a survey to several students. The students are asked how many hoursthey spend each week in four activities: studying, sleeping, working, and leisure. Any activityis put into one of the four categories, so that for each student the sum of hours in the fouractivities must be 168. (i) In the model

GPAi = β1 + β2studyi + β3sleepi + β4worki + β5leisurei + εi

does it make sense to hold sleep, work, and leisure fixed, while changing study? (ii) Explainwhy this model violates Assumption 1.3; (iii) How could you reformulate the model so thatits parameters have a useful interpretation and it satisfies Assumption 1.3?

Assumption (1.4 - spherical error variance). The error term satisfies:

E(ε2i

∣∣∣X) = σ2 > 0, ∀i, Homoskedasticity

E(εiεj

∣∣∣X) = 0, ∀i, j; i 6= j. No correlation between observations.

Exercise 2.5. Under the Assumptions 1.2 and 1.4, show that Cov(yi, yj

∣∣∣X) = 0.

27

Exercise 2.6. Consider the savings function

savi = β1 + β2inci + εi, εi =√incizi

where zi is a random variable with E (zi) = 0 and Var (zi) = σ2z. Assume that zi is

independent of incj (for all i, j). (i) Show that E (ε| inc) = 0; (ii) Show that Assumption1.4 is violated.

2.1.7 The Classical Regression Model for Random Samples

The sample (y,X) is a random sample if {(yi,xi)} is i.i.d. (independently and identicallydistributed) across observations. Random sample automatically implies:

E (εi|X) = E (εi|xi) ,E(ε2i

∣∣∣X) = E(ε2i

∣∣∣xi) .Therefore Assumptions 1.2 and 1.4 can be rephrasing as

Assumption 1.2 E (εi|xi) = E (εi) = 0

Assumption 1.4 E(ε2i

∣∣∣xi) = E(ε2i

)= σ2

28

2.1.8 “Fixed”Regressors

This is a simplifying (and generally an unrealistic) assumption to make the statistical analysistractable. It means that X is exactly the same in repeated samples. Sampling schemes thatsupport this assumption:

a) Experimental situations. For example, suppose that y represents the yields of a cropgrown on n experimental plots, and let the rows of X represent the seed varieties, irrigationand fertilizer for each plot. The experiment can be repeated as often as desired, with thesame X. Only y varies across plots.

b) Stratified Sampling (for more details see Wooldridge, chap. 9).

29

2.2 The Algebra of Least Squares

2.2.1 OLS Minimizes the Sum of Squared Residuals

Residual for observation i (evaluated at β):

yi − x′iβ.

Vector of residuals (evaluated at β):

y −Xβ.

Sum of squared residuals (SSR):

SSR(β)

=n∑i=1

(yi − x′iβ

)2=(y −Xβ

)′ (y −Xβ

).

The OLS (Ordinary Least Squares):

b = arg minβSSR

(β)

b is such that SSR (b) is minimum.

30

K = 1, yi = βxi + εi

Example. Consider yi = β1 + β2xi2 + εi. The data:

y1 1 13 1 32 1 18 1 312 1 8

X

Verify that SSR(β)

= 42 when β =

(01

).

31

2.2.2 Normal Equations

To solve the optimization proble minβSSR

(β)we use classical optimization:

• First Order Condition (FOC):

∂SSR(β)

∂β= 0.

Solve the previous equation with respect to β. Let b such solution.

• Second Order Condition (SOC):

∂2SSR(β)

∂β∂β′ is a Positive Definite Matrix⇔ b is global minimum point.

32

To easily obtain the FOC we start writing SSR(β)as

SSR(β)

=(y −Xβ

)′ (y −Xβ

)= ...

= y′y − 2y′Xβ + β′X′Xβ.

Recalling from matrix algebra that

∂(a′β

)∂β

= a,∂(β′Aβ

)∂β

= 2Aβ (for A symmetric)

we have

∂SSR(β)

∂β= −2

(y′X

)′+ 2X′Xβ = 0

i.e. (replacing β by the solution b)

X′Xb = X′y or

X′ (y −Xb) = 0.

33

This is a system with K equations and K unknowns. These equations are called the normalequations. If

rank (X) = K ⇒ X′X is nonsingular⇒ there exists(X′X

)−1.

Therefore, if rank (X) = K we have a unique solution:

b =(X′X

)−1 X′y OLS estimator.

The SOC is

∂2SSR(β)

∂β∂β′ = 2X′X.

If rank (X) = K then 2X′X is a positive definite matrix thus SSR(β)is strictly convex

in Rk. Hence b is a global minimum point.

The vector of residuals evaluated at β = b,

e = y −Xb

is called the vector of OLS residuals (or simply residuals).

34

The normal equations can be written as

X′e = 0⇔ 1

n

n∑i=1

xiei = 0.

This shows that the normal equations can be interpreted as the sample analogue of theorthogonality conditions E (xiεi) = 0. Notice the reasoning: by assuming in the popula-tion the orthogonality conditions E (xiεi) = 0 we deduce by the method of moments thecorresponding sample analogue

1

n

∑i

xi(yi − x′iβ

)= 0.

We obtain the OLS estimator b by solving this equation with respect to β.

35

2.2.3 Two Expressions for the OLS Estimator

• b =(X′X

)−1 X′y

• b =(

X′Xn

)−1 X′yn = S−1

xxSxy, where

Sxx =X′Xn

=1

n

n∑i=1

xix′i (sample average of xix

′i)

Sxy =X′yn

=1

n

n∑i=1

xiyi (sample average of xiyi).

Example (continuation of previous example). Consider the data.

y1 1 13 1 32 1 18 1 312 1 8

X

Obtain b, e and SSR (b) .

36

2.2.4 More Concepts and Algebra

The fitted value for observation i: yi = x′ib.

The vector of fitted value: y = Xb.

The vector of OLS residuals: e = y −Xb = y − y.

The projection matrix P and the annihilator M are defined as

P = X(X′X

)−1X′, M = I−P.

Properties:

Exercise 2.7. Show that P and M are symmetric and idempotent and

PX = X

MX = 0

y = Py

e = My = Mε

SSR = e′e = y′My = ε′Mε.

37

The OLS estimate of σ2 (the variance of the error term), denoted s2, is

s2 =SSR

n−K=

e′en−K

s2 is called the standard error of regression.

The sampling error

b− β = ... =(X′X

)−1X′ε.

Coeffi cient of Determination

A measure of goodness of fit is the coeffi cient of determination

R2 =

∑ni=1 (yi − y)2∑ni=1 (yi − y)2 = 1−

∑ni=1 e

2i∑n

i=1 (yi − y)2, 0 ≤ R2 ≤ 1.

It measures the proportion of the variation of y that is accounted for by variation in theregressors, x′js. Derivation of R

2: [board]

38

y

5

0

5

10

15

20

25

3 2 1 0 1 2 3x

yy^

R 2 = 0.96y

5040302010

0102030405060

3 2 1 0 1 2 3x

yy^

R 2 = 0.19

y

8

9

1011

12

13

1415

16

17

3 2 1 0 1 2 3x

yy^

R 2 = 0.00

39

“The most important thing about R2 is that it is not important” (Goldberger). Why?

• We are concerned with parameters in a population, not with goodness of fit in the sample;

• We can always increase R2 by adding more explanatory variables. At the limit, if K =

n⇒ R2 = 1.

Exercise 2.8. Prove that K = n⇒ R2 = 1 (assume that Assumption 1.3 holds).

It can be proved that

R2 = ρ2, ρ =

∑i

(yi − y

)(yi − y) /n

SySy.

Adjusted coeffi cient of determination

R2 = 1− n− 1

n− k(

1−R2)

= 1−∑ni=1 e

2i / (n− k)∑n

i=1 (yi − y)2 / (n− 1).

Contrary to R2, R2 may decline when a variable is added to the set of independent variables.

40

2.3 Finite-Sample Properties of OLS

First of all we need to recognize that b and b|X are random!

Assumptions:1.1 - Linearity: yi = β1xi1 + β2xi2 + ...+ βKxiK + εi.

1.2 - Strict exogeneity: E (εi|X) = 0.

1.3 - No multicollinearity.1.4 - Spherical error variance: E

(ε2i

∣∣∣X) = σ2,E(εiεj

∣∣∣X) = 0.

Proposition (1.1 - finite-sample properties of b).We have:(a) (unbiasedness) Under Assumptions 1.1-1.3, E (b|X) = β.

(b) (expression for the variance) Under Assumptions 1.1-1.4, Var (b|X) = σ2 (X′X)−1 .

(c) (Gauss-Markov Theorem) Under Assumptions 1.1-1.4, the OLS estimator is effi cient inthe class of linear unbiased estimators (also called Best Linear Unbiased Estimator). Thatis, for any unbiased estimator β that is linear in y, Var (b|X) ≤ Var

(β∣∣∣X) in the matrix

sense (i.e. Var(β∣∣∣X)− Var (b|X) is a positive semidefinite matrix).

(d) Under Assumptions 1.1-1.4, Cov (b, e|X) = 0. Proof: [board]

41

Proposition (1.2 - Unbiasedness of s2). Let s2 = e′e/ (n−K) . We have

E(s2∣∣∣X) = E

(s2)

= σ2. Proof: [board]

An unbiased estimator of Var (b|X) is

Var (b|X) = s2(X′X

)−1.

Example. Consider

colGPAi = β1 + β2HSGPAi + β3ACTi + β4SKIPPEDi + β5PCi + εi

where: colGPA : college grade point average (GPA); HSGPA : high school GPA; ACT :

achievement examination for college admission; SKIPPED : average lectures missed perweek; PC is a binary variable (0/1) to identify who owns a personal computer. Using asurvey of 141 students (Michigan State University) in Fall 1994, we obtained the followingresults:

42

These results tell us that n = 141, s = 0.325, R2 = 0.259, SSR = 14.37

b =

1.356

0.41290.0133−0.0710.1244

, Var (b|X) =

0.32752 ? ? ? ?

? 0.09242 ? ? ?? ? 0.0102 ? ?? ? ? 0.0262 ?? ? ? ? 0.05732

43

2.4 More on Regression Algebra

2.4.1 Regression Matrices

Matrix P = X(X′X

)−1 X′

Py → Fitted values from the regression of y on XPz → ?

Matrix M = I−P = I−X(X′X

)−1 X′

My → Residuals from the regression of y on XMz → ?

Consider a partition of X as follows X =[

X1 X2

]

Matrix P1= X1

(X′1X1

)−1X′

1P1y → ?

Matrix M1= I−P1= I−X1

(X′1X1

)−1X′

1M1y → ?

44

2.4.2 Short and Long Regression Algebra

Partition X as

X =[

X1 X2

], XK1×n, XK2×n, K1 +K2 = K

Long RegressionWe have

y = y + e = Xb + e =[

X1 X2

] [ b1b2

]+ e = X1b1 + X2b2 + e.

Short RegressionSuppose that we shorten the list of explanatory variables and regress y on X1. We have

y = y∗ + e∗ = X1b∗1 + e∗

where

b∗1 =(X′1X1

)−1X1y

e∗ = M1y, M1 = I−X1

(X′1X1

)−1X′1

45

How are b∗1 and e∗ related to b1 and e?

b∗1 vs. b1

We have,

b∗1 =(X′1X1

)−1X1y

=(X′1X1

)−1X′1 (X1b1 + X2b2 + e)

= b1 +(X′1X1

)−1X′1X2b2 +

(X′1X1

)−1X′1e︸︷︷︸

0

= b1 +(X′1X1

)−1X′1X2b2

= b1 + Fb2, F =(X′1X1

)−1X′1X2.

Thus, in general, b∗1 6= b1. Exceptional cases: b2 = 0 or X′1X2 = O⇒ b∗1 = b1.

46

e∗ vs. e

We have,

e∗ = M1y

= M1 (X1b1 + X2b2 + e)

= M1X1b1 + M1X2b2 + M1e

= M1X2b2 + e,

= v + e

Thus,

e∗′e∗ = e′e + v′v ≥ e′e

Thus the SSR of the short regression (e∗′e∗) exceeds the SSR of the long regression (e′e)and e∗′e∗ = e′e iff v = 0, that is iff b2 = 0.

47

Example. Illustration of b∗1 6= b1 and e∗′e∗≥ e′e.

Find X, X1, X2, b, b1, b2, b∗1, e∗′e∗, e′e.

48

2.4.3 Residual Regression

Consider

y = Xβ + ε

= X1β1 + X2β2 + ε.

Premultiplying both sides by M1 and using M1X1 = 0, we obtain

M1y = M1X1β1 + M1X2β2 + M1ε

y = X2β2 + M1ε

The OLS gives

b2 =(X′2X2

)−1X′2y =

(X′2X2

)−1X′2M1y =

(X′2X2

)−1X′2y

Thus

b2 =(X′2X2

)−1X′2y

49

Another way to prove b2 =(X′2X2

)−1X′2y (you may skip this proof). We have

(X′2X2

)−1X′2y =

(X′2X2

)−1X′2 (X1b1 + X2b2 + e)

=(X′2X2

)−1X′2X1b1︸︷︷︸

0

+(X′2X2

)−1X′2X2b2︸︷︷︸

b2

+(X′2X2

)−1X′2e︸︷︷︸

0

= b2

since: (X′2X2

)−1X′2X1b1 =

(X′2X2

)−1X′2M1X1b1

= 0(X′2X2

)−1X′2X2b2 =

(X′2X2

)−1X′2M1X2b2

=(X′2M′1M1X2

)−1X′2M1X2b2

=(X′2M1X2

)−1X′2M1X2b2

= b2

X′2e = X′2M1e

= X′2e

= 0.

50

The conclusion is that we can obtain b2 =(X′2X2

)−1X′2y =

(X′2X2

)−1X′2y as follows:

1) Regress X2 on X1 to get the residuals X2 = M1X2. Interp. of X2: X2 is X2 after theeffects of X1 have been removed or, X2 is the part X2 that is uncorrelated with X1.2) Regress y on X2 to get the coeffi cient b2 of the long regression.

OR:1’) Same as 1).2’a) Regress y on X1 to get the residuals y = M1y.2’b) Regress y on X2 to get the coeffi cient b2 of the long regression.

The conclusion of 1) and 2) is extremely important: b2 relates y to X2 after controlling forthe effects of X1. This is why b2 can be obtained from the regression of y on X2 whereX2 is X2 after the effects of X1 have been removed (fixed or controlled for). This meansthat b2 has in fact a ceteris paribus interpretation.

To recover b1 we consider the equation b∗1 = b1 + Fb2. Regress y on X1, obtaining

b∗1 =(X′1X1

)−1X′1y and now

b1 = b∗1 −(X′1X1

)−1X′1X2b2 = b∗1 − Fb2.

51

Example. Consider the example on page 9.

52

Example. Consider X =[

1 exper tenure IQ educ]and

X1 =[

1 exper tenure IQ], X2 = educ

54

2.4.4 Application of Residual Regression

A) Trend Removal (time series)

Suppose that yt and xt have a linear trend. Should the trend term be included in theregression as in the case

yt = β1 + β2xt2 + β3xt3 + εt, xt3 = t

or should the variables first be “detrended” and then used without the trend term includedas in

yt = β2xt2 + εt?

According to the previous results, the OLS coeffi cient b2 is the same in both regressions.In the second regression b2 is obtained from the regression of y = M1y on x•2 = M1x•2where

X1 =[

1 x•3]

=

1 11 2... ...1 n

.

55

Example. Consider (TXDES: unemployment rate, INF: inflation, t: time)

TXDESt = β1 + β2INFt + β3t+ εt.

We will show two ways to obtain b2 (compare EQ01 to EQ04).

EQ01Dependent Variable: TXDESMethod: Least SquaresSample: 1948 2003

Variable Coefficient Std. Error tStatistic Prob.

C 4.463068 0.425856 10.48023 0.0000INF 0.104712 0.063329 1.653473 0.1041

@TREND 0.027788 0.011806 2.353790 0.0223

EQ02Dependent Variable: TXDESMethod: Least SquaresSample: 1948 2003


C 4.801316 0.379453 12.65325 0.0000@TREND 0.030277 0.011896 2.545185 0.0138

EQ03Dependent Variable: INFMethod: Least SquaresSample: 1948 2003


C 3.230263 0.802598 4.024758 0.0002@TREND 0.023770 0.025161 0.944696 0.3490

EQ04Dependent Variable: TXDES_Method: Least SquaresSample: 1948 2003


INF_ 0.104712 0.062167 1.684382 0.0978

56

B) Seasonal Adjustment and Linear Regression with Seasonal Data

Suppose that we have data on the variable y, quarter by quarter, for m years. A way to dealwith (deterministic) seasonality is the following

yt = β1Qt1 + β2Qt2 + β3Qt3 + β4Qt4 + β5xt5 + εi

where

Qti =

{1 in quarter i0 otherwise.

Let

X =[

Q1 Q2 Q3 Q4 x•5], X1 =

[Q1 Q2 Q3 Q4

].

Previous results show that b5 can be obtained from the regression of y = M1y on x•5 =M1x•5. It can be proved

yt =

yt − yQ1 in quarter 1yt − yQ2 in quarter 2yt − yQ3 in quarter 3yt − yQ4 in quarter 4

where yQi is the seasonal mean of quarter i.

57

c) Deviations from Means

Let x•1 be the summer vector. Instead of regressing y on[

x•1 x•2 · · · x•K]to get

(b1, b2, ..., bK)′ , we can regress y on x12 − x2 · · · x1K − xK... ...

xn2 − x2 · · · xnK − xK

to get the same vector (b2, ..., bK)′ . We sketch the proof. Let

X2 =[

x•2 · · · x•K]

so that

y = x•1b1 + X2b2.

1) Regress X2 on x•1 to get the residuals X2 = M1X2 where

M1 = I− x•1(x′•1x•1

)−1x′•1 = I−

x•1x′•1n

.

58

As we know

X2 = M1X2

= M1

[x•2 · · · x•K

]=

[M1x•2 · · · M1x•K

]=

x12 − x2 · · · x1K − xK... ...

xn2 − x2 · · · xnK − xK

.

2) Regress y (or y = M1y) on X2 to get the coeffi cient b2 of the long regression:

b2 =(X′2X2

)−1X′2y =

(X′2X2

)−1X′2y.

The intercept can be recovered as

b1 = b∗1 − x•1(x′•1x•1

)−1x′•1X2.

59

2.4.5 Short and Residual Regression in the Classical Regression Model

Consider:

y = X1b1 + X2b2 + e (long regression)

y = X1b∗1 + e∗ (short regression).

The correct specification corresponds to the long regression:

E (y|X) = X1β1 + X2β2

= Xβ

Var (y|X) = σ2I, etc.

60

A) Short-Regression Coeffi cients

b∗1 is a biased estimator of β1

Given that

b∗1 =(X′1X1

)−1X′1y = b1 + Fb2, F =

(X′1X1

)−1X′1X2.

we have

E (b∗1|X) = E (b1 + Fb2|X) = β1 + Fβ2,

Var (b∗1|X) = Var((

X′1X1

)−1X′1y

∣∣∣∣X) =(X′1X1

)−1X′1 Var (y|X) X1

(X′1X1

)−1

= σ2(X′1X1

)−1

thus, in general,

b∗1 is a biased estimator of β1 (“omitted-variable bias”)

unless:

• β2 = 0. Corresponds to the case of “Irrelevant Omitted Variables”.• F = O. Corresponds to the case of “Orthogonal Explanatory Variables”(in sample space).

61

Var (b1|X) ≥ Var(b∗1∣∣∣X) (you may skip the proof)

Consider b1 = b∗1 − Fb2

Var (b1|X) = Var (b∗1 − Fb2|X)

= Var (b∗1|X) + Var (Fb2|X) since Cov (b∗1,b2|X) = O [board]

= Var (b∗1|X) + F Var (b2|X) F′.

Because F Var (b2|X) F′ is positive semidefinite (or nonnegative definite), Var (b1|X) ≥Var

(b∗1∣∣∣X).

This relation is still valid if β2 = 0. In this case β2 = 0, regressing y onX1 and on irrelevantvariables (X2) involves a cost: Var (b1|X) ≥ Var

(b∗1∣∣∣X) , although E (b1|X) = β1.

In practise there may be a bias-variance trade-off between short and long regression whenthe target is β1.

62

Exercise 2.9. Consider the standard simple regression model yi = β1 + β2xi2 + εi underAssumptions 1.1 through 1.4. Thus, the usual OLS estimators b1 and b2 are unbiased fortheir respective population parameters. Let b∗2 be the estimator of β2 obtained by assumingthe intercept is zero i.e. β1 = 0 (i) Find E

(b∗2∣∣∣X). Verify that b∗2 is unbiased for β2 when

the population intercept β1 is zero. Are there other cases where b∗2 is unbiased? (ii) Find the

variance of b∗2. (iii) Show that Var(b∗2∣∣∣X) ≤ Var (b2|X); (iv) Comment on the trade-off

between bias and variance when choosing between b∗2 and b2.

Exercise 2.10. Suppose that average worker productivity at manufacturing firms (avgprod)depends on two factors, average hours of training (avgtrain) and average worker ability(avgabil):

avgprodi = β1 + β2avgtraini + β3avgabili + εi

Assume that this equation satisfies Assumptions 1.1 through 1.4. If grants have been given tofirms whose workers have less than average ability, so that avgtrain and avgabil are negativelycorrelated, what is the likely bias in b∗2 in obtained from the simple regression of avgprod onavgtrain?

63

B) Short-Regression Residuals (skip this)

Given that e∗ = M1y we have

E (e∗|X) = M1 E (y|X) = M1 E (X1β1 + X2β2|X) = X2β2,

Var (e∗|X) = Var (M1y|X) = M1 Var (y|X) M′1 = σ2M1.

Thus E (e∗|X) 6= 0, unless β2 = 0.

Let’s see now that the omission of explanatory variables leads to an increase in the expectedSSR. We have, by R5,

E(e∗′e∗

∣∣∣X) = E(y′M1y

∣∣∣X) = tr (M1 Var (y|X)) + E (y|X)′M1 E (y|X)

= σ2 tr (M1) + β′2X′2X2β2 = σ2 (n−K1) + β′2X′2X2β2

and E(e′e∣∣X) = σ2 (n−K) thus

E(e∗′e∗

∣∣∣X)− E(e′e∣∣∣X) = σ2K2 + β′2X′2X2β2 > 0.

Notice that: e∗′e∗ − e′e = b′2X′2X2b2 ≥ 0. (check E(b′2X′2X2b2

∣∣∣X) = σ2K2 +

β′2X′2X2β2).

64

C) Residual Regression

The objective is to characterize

Var (b2|X) .

We know that b2 =(X′2X2

)−1X′2y. Thus

Var (b2|X) = Var((

X′2X2

)−1X′2y

∣∣∣∣X)=

(X′2X2

)−1X′2 Var (y|X) X2

(X′2X2

)−1

= σ2(X′2X2

)−1

= σ2(X′2M1X2

)−1.

Now suppose that

X =[

X1 x•K]

(i.e. x•K = X2)

65

If follows that

Var (bK|X) =σ2

x′•KM1x•K

and x′•KM1x•K is the sum of the squared residuals in the auxiliary regression

x•K = α1x•1 + α2x•2 + ...+ αK−1x•K−1 + error.

One can conclude (assuming that x•1 is the summer vector):

R2K = 1−

x′•KM1x•K∑(xiK − xK)2.

Solving this equation for x′•KM1x•K we have

x′•KM1x•K =(

1−R2K

)∑(xiK − xK)2 .

We get

Var (bK|X) =σ2(

1−R2K

)∑(xiK − xK)2

=σ2(

1−R2K

)S2xKn.

66

Var (bK|X) =σ2(

1−R2K

)∑(xiK − xK)2

=σ2(

1−R2K

)S2xKn.

We can conclude that the precision of bK is high (i.e. Var (bK) is small) when:

• σ2 is low;

• S2xK

is high (imagine the regression

wage = β1 + β2educ+ ε.

If most people (in the sample) report the same education, S2xK

will be low and β2 willbe estimated very imprecisely).

• n is high (large sample is preferable to small sample).

• R2K is low (multicollinearity increases R2

K).

67

Exercise 2.11. Consider: sleep: minutes sleep at night per week; totwrk: hours workedper week; educ: years of schooling; female: binary variable equal to one if the individualis female. Do women sleep more than men? Explain the differences between the estimates32.18 and -90.969.

Dependent Variable: SLEEPMethod: Least SquaresSample: 1 706


C 3252.407 22.22211 146.3591 0.0000FEMALE 32.18074 33.75413 0.953387 0.3407

Rsquared 0.001289 Mean dependent var 3266.356Adjusted Rsquared 0.000129 S.D. dependent var 444.4134S.E. of regression 444.4422 Akaike info criterion 15.03435Sum squared resid 1.39E+08 Schwarz criterion 15.04726

Dependent Variable: SLEEPMethod: Least SquaresSample: 1 706


C 3838.486 86.67226 44.28737 0.0000TOTWRK 0.167339 0.017937 9.329260 0.0000

EDUC 13.88479 5.657573 2.454196 0.0144FEMALE 90.96919 34.27441 2.654143 0.0081

Rsquared 0.119277 Mean dependent var 3266.356Adjusted Rsquared 0.115514 S.D. dependent var 444.4134S.E. of regression 417.9581 Akaike info criterion 14.91429Sum squared resid 1.23E+08 Schwarz criterion 14.94012

68

Example. The goal is to analyze the impact of another year of education on wages. Consider:wage: monthly earnings; KWW: knowledge of world work score (KWW is a general test ofwork-related abilities); educ: years of education; exper: years of work experience; tenure:years with current employer

Dependent Variable: LOG(WAGE)Method: Least SquaresSample: 1 935White HeteroskedasticityConsistent Standard Errors & Covariance


C 5.973062 0.082272 72.60160 0.0000EDUC 0.059839 0.006079 9.843503 0.0000

Rsquared 0.097417 Mean dependent var 6.779004Adjusted Rsquared 0.096449 S.D. dependent var 0.421144S.E. of regression 0.400320 Akaike info criterion 1.009029Sum squared resid 149.5186 Schwarz criterion 1.019383



C 5.496696 0.112030 49.06458 0.0000EDUC 0.074864 0.006654 11.25160 0.0000

EXPER 0.015328 0.003405 4.501375 0.0000TENURE 0.013375 0.002657 5.033021 0.0000




C 5.210967 0.113778 45.79932 0.0000EDUC 0.047537 0.008275 5.744381 0.0000

EXPER 0.012897 0.003437 3.752376 0.0002TENURE 0.011468 0.002686 4.270056 0.0000

IQ 0.004503 0.000989 4.553567 0.0000KWW 0.006704 0.002070 3.238002 0.0012


69

Exercise 2.12. Consider

yi = β1 + β2xi2 + εi, i = 1, ..., n

where xi2 is an impulse dummy, i.e. x•2 is a column vector with n− 1 zeros and only one1. To simplify let us suppose that this 1 is the first element of x•2, i.e.

x′•2 =[

1 0 · · · 0].

Find and interpret the coeffi cient from the regression of y on x•1 = M2x•1 and M2 =

I− x•2(x′•2x•2

)−1x′•2 (x•1 is the residual vector from the regression x•1 on x•2).

Exercise 2.13. Consider the long regression model (under Assumptions 1.1 through 1.4):

y = X1b1 + X2b2 + e,

and the following coeffi cients (obtained from the short regressions):

b∗1 =(X′1X1

)−1X′1y, b∗2 =

(X′2X2

)−1X′2y.

Decide if you agree or disagree with the following statement: if Cov(b∗1,b

∗2

∣∣∣X1,X2

)= O

(zero matrix) then b∗1 = b1 and b∗2 = b2.

70

2.5 Multicollinearity

If rank (X) < K then b is not defined. This is called strict multicollinearity. When thishappens, the statistical software will be unable to construct

(X′X

)−1 . Since the error isdiscovered quickly, this is rarely a problem for applied econometric practice.

The more relevant situation is near multicollinearity, which is often called “multicollinearity”for brevity. This is the situation when the X′X is near singular, when the columns of X areclose to linearly dependent.

Consequence: the individual coeffi cient estimates will be imprecise. We have shown that

Var (bK|X) =σ2(

1−R2K

)S2xKn.

where R2K is the coeffi cient of determination in the auxiliary regression

x•K = α1x•1 + α2x•2 + ...+ αK−1x•K−1 + error.

71

Exercise 2.14.Do you agree with the following quotations: (a) “But more data is no remedyfor multicollinearity if the additional data are simply "more of the same." So obtaining lotsof small samples from the same population will not help” (Johnston, 1984); (b) “Anotherimportant point is that a high degree of correlation between certain independent variablescan be irrelevant as to how well we can estimate other parameters in the model.”

Exercise 2.15. Suppose you postulate a model explaining final exam score in terms of classattendance. Thus, the dependent variable is final exam score, and the key explanatoryvariable is number of classes attended. To control for student abilities and efforts outsidethe classroom, you include among the explanatory variables cumulative GPA, SAT score, andmeasures of high school performance. Someone says, “You cannot hope to learn anythingfrom this exercise because cumulative GPA, SAT score, and high school performance arelikely to be highly collinear.”What should be your answer?

72

2.6 Statistical Inference under Normality

Assumption (1.5 - normality of the error term). ε|X ∼ Normal

Assumption 1.5 together with Assumptions 1.2 and 1.4 implies that

ε|X ∼ N(0,σ2I

)and y|X ∼ N

(Xβ,σ2I

).

Suppose that we want to test H0 : β2 = 1. Although Proposition 1.1 guarantees that, onaverage, b2 (the OLS estimate of β2) equals 1 if the hypothesis H0 : β2 = 1 is true, b2 maynot be exactly equal to 1 for a particular sample at hand. Obviously, we cannot concludethat the restriction is false just because the estimate b2 differs from 1. In order for us todecide whether the sampling error b2 − 1 is “too large” for the restriction to be true, weneed to construct from the sampling distribution error some test statistic whose probabilitydistribution is known given the truth of the hypothesis.

The relevant theory is built from the following results:

73

1. z ∼ N (0, I)⇔ z′z ∼ χ2(n).

2. w1 ∼ χ2(m), w2 ∼ χ2

(n), w1 and w2 are independent⇔w1/mw2/n

∼ F (m,n) .

3. w ∼ χ2(n), z ∼ N (0, 1) , w and z are independent⇔ z√

w/n∼ t(n).

4. Asymptotic Results:

v ∼ F (m,n)⇒ mvd−→ χ2

(m) as n→∞

u ∼ t(n) ⇒ ud−→ N (0, 1) as n→∞.

5. Consider the vector n× 1 vector y|X ∼ N (Xβ,Σ) . Then,

w = (y −Xβ)′Σ−1 (y −Xβ) ∼ χ2(n).

74

6. Consider the vector n × 1 vector ε|X ∼ N (0, I) . Let M be a n × n idempotentmatrix with rank (M) = r ≤ n. Then,

ε′Mε∣∣∣X ∼ χ2

(r).

7. Consider the vector n × 1 vector ε|X ∼ N (0, I) . Let M be a n × n idempotentmatrix with rank (M) = r ≤ n. Let L be a matrix such that LM = O. Let t1 = Mε

and t2 = Lε. Then t1 and t2 are independent random vectors.

8. b|X ∼ N(β,σ2 (X′X)−1

).

9. Let r = Rβ (Rp×K) with rank (R) = p (in Hayashi’s notation p is equal to #r).Then,

Rb|X ∼ N(r,σ2R

(X′X

)−1R′).

75

10. Let bk be the kth element of b and qkk the (k, k) element of(X′X

)−1 . Then,

bk|X ∼ N(βk, σ

2qkk)or zk =

bk − βkσ√qkk∼ N (0, 1) .

11. w = (Rb− r)′(R(X′X

)−1 R′)−1

(Rb− r) /σ2 ∼ χ2(p).

12. wk =(bk−βk)2

σ2qkk∼ χ2

(1).

13. w0 = e′e/σ2 ∼ χ2(n−K).

14. The random vectors b and e are independent.

15. Each of the statistics e, e′e, w0, s2, Var (b) , is independent of each of the statistics

b, bk, Rb, w, wk.

76

16. tk =bk−βkσbk

∼ t (n−K) , where σ2bkis the (k, k) element of s2 (X′X)−1 .

17. Rb−Rβ

s

√R(X′X)−1R′

∼ t (n−K) , R is of type 1×K

18. F = (Rb− r)′(R(X′X

)−1 R′)−1

(Rb− r) /(ps2

)∼ F (p, n−K) .

Exercise 2.16. Prove the results #8, #9, #16 and #18 (take the other results as given).

The two most important results are:

tk =bk − βkσbk

=bk − βkSE (bk)

∼ t (n−K)

F = (Rb− r)′(R(X′X

)−1R′)−1

(Rb− r) /(ps2

)∼ F (p, n−K) .

77

2.6.1 Confidence Intervals and Regions

Let tα/2 ≡ tα/2 (n− k) be such that

P(|t| < tα/2

)= 1− α.

78

Let Fα ≡ Fα (p, n−K) be such that

P (F > Fα) = 1− α

79

• (1− α) 100% CI for an individual slope coeffi cient βk:βk :

∣∣∣∣∣∣bj − βkσbk

∣∣∣∣∣∣ ≤ tα/2

⇔ bk ± tα/2σbk.

• (1− α) 100% CI for a single linear combination of the elements of β (p = 1)Rβ :

∣∣∣∣∣∣∣Rb−Rβ

s√

R (X′X)−1 R′

∣∣∣∣∣∣∣ ≤ tα/2

⇔ Rb± tα/2s√

R (X′X)−1 R′.

In this case R is a vector 1×K.

• (1− α) 100% Confidence Region for the parameter vector θ = Rβ :{θ : (Rb− θ)′

(R(X′X

)−1R′)−1

(Rb− θ) /s2 ≤ pFα}.

• (1− α) 100% Confidence region for the parameter vector β (consider R = I in the pre-vious case) {

β : (b− β)′(X′X

)(b− β) /s2 ≤ pFα

}.

80

Exercise 2.17. Consider yi = β1xi1 + β2xi2 + εi where yi = wagesi − wages, xi1 =educi − educ, xi2 = experi − exper. The results are

Dependent Variable: YMethod: Least SquaresSample: 1 526


XX1 0.644272 0.053755 11.98541 0.0000X2 0.070095 0.010967 6.391393 0.0000

Rsquared 0.225162 Mean dependent var 1.34E15Adjusted Rsquared 0.223683 S.D. dependent var 3.693086S.E. of regression 3.253935 Akaike info criterion 5.201402Sum squared resid 5548.160 Schwarz criterion 5.217620Log likelihood 1365.969 HannanQuinn criter. 5.207752DurbinWatson stat 1.820274

X′X =

[4025.4297 −5910.064−5910.064 96706.846

],(X′X

)−1=

[2.7291× 10−4 1.6678× 10−5

1.6678× 10−5 1.1360× 10−5

]

(a) Build the 95% confidence interval for β2.

(b) Build the 95% confidence interval for β1 + β2.

(c) Build the 95% confidence region for the parameter vector β.

81

Confidence regions in the EVIEWS

.04

.05

.06

.07

.08

.09

.10

.50 .55 .60 .65 .70 .75 .80

beta1

beta

2

90% and 95% Confidence region for the parameter vector β

82

2.6.2 Testing on a Single Parameter

Suppose that we have a hypothesis about the kth regression coeffi cient:

H0 : βk = β0k

(β0k is a specific value, e.g. zero), and that this hypothesis is tested against the alternative

hypothesis

H1 : βk 6= β0k.

We do not reject H0 at the α100% level if

β0k lies within the (1− α) 100% CI for βk, i.e., bk ± tα/2σbk;

reject H0 otherwise. Equivalently, calculate the test statistic

tobs =bk − β0

k

σbkand,

if |tobs| > tα/2 then reject H0,

if |tobs| ≤ tα/2 then do not reject H0.

83

The reasoning is as follow. Under the null hypothesis we have

t0k =bk − β0

k

σbk∼ t(n−K).

If we observe |tobs| > tα/2 and the H0 is true, then a low-probability event has occurred.We take |tobs| > tα/2 as an evidence against the null and the decision should be to rejectH0.

Other cases:

• H0 : βk = β0k vs. H1 : βk > β0

k,

if tobs > tα then reject H0 at the α100% level; otherwise do not reject H0.

• H0 : βk = β0k vs. H1 : βk < β0

k,

if tobs < −tα then reject H0 at the α100% level; otherwise do not reject H0.

84

2.6.3 Issues in Hypothesis Testing

p-value

p-value (or p) is the probability of obtaining a test statistic at least as extreme as the one thatwas actually observed, assuming that the null hypothesis is true. p is an informal measureof evidence of the null hypothesis.

Example. Consider H0 : βk = β0k vs. H1 : βk 6= β0

k

p-value = 2P(t0k > |tobs|

∣∣∣H0 is true).

A p-value = 0.02 shows little evidence supporting H0. At the 5% level you should reject theH0 hypothesis.

Example. Consider H0 : βk = β0k vs. H1 : βk > β0

k

p-value = P(t0k > tobs

∣∣∣H0 is true).

EVIEWS: divide the reported p-value by two.

85

Reporting the outcome of a test

Correct wording in reporting the outcome of a test involvingH0 : βk = β0k vs. H1 : βk 6= β0

k

• When the null is rejected we say that bk (not βk) is significantly different from β0k at

α100%.

• When the null isn’t rejected we say that bk (not βk) is not significantly different fromβ0k at α100%.

Correct wording in reporting the outcome of a test involving H0 : βk = 0 vs. H1 : βk 6= 0

• When the null is rejected we say that bk (not βk) is significantly different from zero atα100% level, or the variable (associated with bk) is statistically significant at α100%.

• When the null isn’t rejected we say that bk (not βk) is not significantly different fromzero at α100% level, or the variable is not statistically significant at α100%.

86

More Remarks:

• Rejection of the null is not proof that the null is false. Why?

• Acceptance of the null is not proof that the null is true. Why? We prefer to use thelanguage “we fail to reject H0 at the x% level” rather than “H0 is accepted at the x%level.”

• In a test of type H0 : βk = β0k, if σbk is large (bk is an imprecise estimator) is more

diffi cult to reject the null. The sample contains little information about the true valueof βk parameter. Remember that σbk depends on

σ2, S2xk, n and R2

k.

87

Statistical Versus Economic Significance

The statistical significance of a variable is determined by the size of tobs = bk/se (bk) ,

whereas the economic significance of a variable is related to the size and sign of bk.

Example. Suppose that in a business activity we have

log (wagei) = .1 + 0.01(0.001)

female+ ... n = 600

H0 : β2 = 0 vs. H1 = β2 6= 0. We have:

t0k =b2

σb2

∼ t(600−K) ≈ N (0, 1) (under the null)

tobs =0.01

0.001= 10,

p-value = 2P(t0k > |10|

∣∣∣H0 is true)≈ 0.

Discuss statistical versus economic significance.

88

Exercise 2.18. Can we say that students at smaller schools perform better than those atlarger schools? To discuss this hypothesis we consider data on 408 high schools in Michiganfor the year 1993 (see Wooldridge, chapter 4). Performance is measured by the percentageof students receiving a passing score on a tenth grade math test (math10). School sizeis measured by student enrollment (enroll). We will control for two other factors, averageannual teacher compensation ( totcomp) and the number of staff per one thousand students( staff ). Teacher compensation is a measure of teacher quality, and staff size is a roughmeasure of how much attention students receive. Figure below reports the results. Answerto the initial question.

Dependent Variable: MATH10Method: Least SquaresSample: 1 408


C 2.274021 6.113794 0.371949 0.7101TOTCOMP 0.000459 0.000100 4.570030 0.0000

STAFF 0.047920 0.039814 1.203593 0.2295ENROLL 0.000198 0.000215 0.917935 0.3592

Rsquared 0.054063 Mean dependent var 24.10686Adjusted Rsquared 0.047038 S.D. dependent var 10.49361S.E. of regression 10.24384 Akaike info criterion 7.500986Sum squared resid 42394.25 Schwarz criterion 7.540312Log likelihood 1526.201 HannanQuinn criter. 7.516547Fstatistic 7.696528 DurbinWatson stat 1.668918Prob(Fstatistic) 0.000052

89

Exercise 2.19.We want to relate the median housing price (price) in the community tovarious community characteristics: nox is the amount of nitrous oxide in the air, in partsper million; dist is a weighted distance of the community from five employment centers, inmiles; rooms is the average number of rooms in houses in the community; and stratio isthe average student-teacher ratio of schools in the community. Can we conclude that theelasticity of price with respect to nox is -1? (Sample: 506 communities in the Boston area -see Wooldridge, chapter 4).

Dependent Variable: LOG(PRICE)Method: Least SquaresSample: 1 506


C 11.08386 0.318111 34.84271 0.0000LOG(NOX) 0.953539 0.116742 8.167932 0.0000LOG(DIST) 0.134339 0.043103 3.116693 0.0019

ROOMS 0.254527 0.018530 13.73570 0.0000STRATIO 0.052451 0.005897 8.894399 0.0000


90

2.6.4 Test on a Set of Parameter I

Suppose that we have a joint null hypothesis about β :

H0 : Rβ = r vs. H1 : Rβ 6= r.

where Rβp×1, Rp×K). The test statistics is

F 0 = (Rb− r)′(R(X′X

)−1R′)−1

(Rb− r) /(ps2

).

Let Fobs be the observed test statistics. We have

reject H0 if Fobs > Fα (or if p-value < α)

do not reject H0 if Fobs ≤ Fα.

The reasoning is as follow. Under the null hypothesis we have

F 0 ∼ F(p,n−K).

If we observe F 0 > Fα and the H0 is true, then a low-probability event has occurred.

91

In the case p = 1 (single linear combination of the elements of β) one may use the teststatistics

t0 =Rb−Rβ

s√

R (X′X)−1 R′∼ t (n−K) .

Example.We consider a simple model to compare the returns to education at junior collegesand four-year colleges; for simplicity, we refer to the latter as “universities”(See Wooldridge,chap. 4).The model is

log (wagesi) = β1 + β2jci + β3univi + β4experi + εi.

The population includes working people with a high school degree. jc is number of yearsattending a two-year college and univ is number of years at a four-year college. Note thatany combination of junior college and college is allowed, including jc = 0 and univ = 0.The hypothesis of interest is whether a year at a junior college is worth a year at a university:this is stated as H0 : β2 = β3. Under H0, another year at a junior college and another yearat a university lead to the same ceteris paribus percentage increase in wage. The alternativeof interest is one-sided: a year at a junior college is worth less than a year at a university.This is stated as H1 : β2 < β3.

92

Dependent Variable: LWAGEMethod: Least SquaresSample: 1 6763


C 1.472326 0.021060 69.91020 0.0000JC 0.066697 0.006829 9.766984 0.0000

UNIV 0.076876 0.002309 33.29808 0.0000EXPER 0.004944 0.000157 31.39717 0.0000


(X′X

)−1=

0.0023972 −9.4121× 10−5 −8.50437× 10−5 −1.6780× 10−5

−9.41217× 10−5 0.0002520 1.04201× 10−5 −9.2871× 10−8

−8.50437× 10−5 1.0420× 10−5 2.88090× 10−5 2.12598× 10−7

−1.67807× 10−5 −9.2871× 10−8 2.1259× 10−7 1.3402× 10−7

Under the null, the test statistics is

t0 =Rb−Rβ

s√

R (X′X)−1 R′∼ t (n−K) .

93

We have

R =[

0 1 −1 0]

√R (X′X)−1 R′ = 0.016124827

s√

R (X′X)−1 R′ = 0.430138× 0.016124827 = 0.006936

Rb =[

0 1 −1 0]

1.4723260.0666970.0768760.004944

= −0.01018

Rβ =[

0 1 −1 0]

β1β2β3β4

= β2 − β3 = 0 (under H0)

tobs =−0.01018

0.006936= −1.467

−t0.05 = −1.645.

We do not reject H0 at the 5% level. There is no evidence against β2 = β3 at 5% level.

94

Remark: in this exercise t0 can be written as

t0 =Rb

s√

R (X′X)−1 R′=

b2 − b3√Var (b2 − b3)

=b2 − b3

SE (b2 − b3).

Exercise 2.20 (continuation). Propose another way to test H0 : β2 = β3 against H0 :

β2 < β3 along the following lines: define θ = β2 − β3; write β2 = θ + β3; plug this intothe equation log (wagesi) = β1 + β2jci + β3univi + β4experi + εi and test θ = 0. Usethe database available on the webpage of the course.

95

2.6.5 Test on a Set of Parameter II

We focus on another way to test

H0 : Rβ = r vs. H1 : Rβ 6= r.

(where Rβp×1, Rp×K). It can be proved that

F 0 = (Rb− r)′(R(X′X

)−1R′)−1

(Rb− r) /(ps2

)=

(e∗′e∗ − e′e

)/p

e′e/ (n−K)

=

(R2 −R2∗

)/p(

1−R2)/ (n−K)

∼ F (p, n−K)

where ∗ refers to the short regression or the regression subjected to the constraint Rβ = r.

96

Example. Consider once again the equation log (wagesi) = β1 + β2jci + β3univi +

β4experi + εi and H0 : β2 = β3 against H0 : β2 6= β3. The results of the regressionsubjected to the constraint H0 : β2 = β3 are

Dependent Variable: LWAGEMethod: Least SquaresSample: 1 6763


C 1.471970 0.021061 69.89198 0.0000JC+UNIV 0.076156 0.002256 33.75412 0.0000EXPER 0.004932 0.000157 31.36057 0.0000


We have p = 1, e′e = 1250.544, e∗′e∗ = 1250.942 and

Fobs =

(e∗′e∗ − e′e

)/p

e′e/ (n−K)=

(1250.942− 1250.544) /1

1250.544/ (6763− 4)= 2.151,

F0.05 = 3.84.

We do not reject the null at 5% level, since Fobs = 2.151 < F0.05 = 3.84.

97

In the case “all slopes zero” (test of significance of the complete regression), it can beproved that F o equals

F 0 =R2/ (K − 1)(

1−R2)/ (n−K)

.

Under the null H0 : βk = 0, k = 2, 3, ...,K, we have F 0 ∼ F (K − 1, n−K) .

Exercise 2.21. Consider the results:

Dependent Variable: YMethod: Least SquaresSample: 1 500


C 0.952298 0.237528 4.009200 0.0001X2 1.322678 1.686759 0.784154 0.4333X3 2.026896 1.701543 1.191210 0.2341


Test: (a) H0 : β2 = 0 vs. H1 : β2 6= 0; (b) H0 : β3 = 0 vs. H1 : β3 6= 0; (c)H0 : β2 = 0, β3 = 0 vs. H1 : ∃βi 6= 0 (i = 1, 2) (d) Are xi2 and xi3 truly relevantvariables? How would you explain the results you obtained in parts (a), (b) and (c)?

98

2.7 Relation to Maximum Likelihood

Having specified the distribution of the error vector, we can use the maximum likelihood(ML) principle to estimate the model parameters θ =

(β′, σ2

)′.

2.7.1 The Maximum Likelihood Principle

ML principle: choose the parameter estimates to maximize the probability of obtaining thedata. Maximizing the joint density associated with the data, f

(y,X; θ

), leads to the same

solution. Therefore:

ML estimator of θ = arg maxθf(y,X; θ

).

99

Example (Without X).We flipped a coin 10 times. If heads then y = 1. Obviously y ∼Bernoulli(θ) . We don’t know if the coin is fair, so we treated E (Y ) = θ as unknownparameter. Suppose that

∑10i=1 yi = 6. We have

f (y;θ) = f (y1, ..., yn; θ) =n∏i=1

f (yi; θ) = θy1 (1− θ)1−y1 × ...× θyn (1− θ)1−yn

= θ∑i yi (1− θ)10−

∑i yi = θ6 (1− θ)4 .

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00.0000

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

0.0007

0.0008

0.0009

0.0010

0.0011

0.0012

theta

joint density

100

To obtain the ML estimate of θ we proceed with:

dθ6 (1− θ)4

dθ= 0⇔ θ =

6

10

and since

d2θ6 (1− θ)4

dθ2 < 0

θ = 0.6 maximizes f(y;θ

). θ is the “most likely”value θ, that is the value that maximizes

the probability of observing (y1, ..., y10) . Notice that the ML estimator is y.

Since log x, x > 0 is a strictly increasing function we have: θ maximizes f(y;θ

)iff θ

maximizes log f(y;θ

), that is

θ = arg maxθf(y,X; θ

)⇔ θ = arg max

θlog f

(y,X; θ

).

In most cases we prefer to solve maxθ log f(y,X; θ

)rather maxθ f

(y,X; θ

), since the

transformation log greatly simplify the likelihood (products become sums).

101

2.7.2 Conditional versus Unconditional Likelihood

The joint density f (y,X; ζ) is in general diffi cult to handle. Consider:

f (y,X; ζ) = f (y|X; θ) f (X;ψ) , ζ =(θ′,ψ′

),

log f (y,X; ζ) = log f (y|X; θ) + log f (X;ψ)

In general we don’t know f (X;ψ) .

Example. Consider yi = β1xi1 + β2xi2 + εi where

εi|X ∼ N(

0, σ2)⇒ yi|X ∼ N

(x′iβ, σ

2)

X ∼ N(µx, σ

2xI).

Thus,

θ =

[βσ2

], ψ =

[µxσ2x

], ζ =

[θψ

].

If there is no functional relationship between θ and ψ (such as a subset of ψ being afunction of θ), then maximizing log f (y,X; ζ) with respect to ζ is achieved by separatelymaximizing f (y|X; θ) with respect to θ and maximizing f (X;ψ) with respect to ψ. Thusthe ML estimate of θ also maximizes the conditional likelihood f (y|X; θ) .

102

2.7.3 The Log Likelihood for the Regression Model

Assumption 1.5 (the normality assumption) together with Assumptions 1.2 and 1.4 implythat the distribution of ε conditional on X is N

(0, σ2I

). Thus,

ε|X ∼ N(

0, σ2I)⇒ y|X ∼ N

(Xβ,σ2I

)⇒

f (y|X; θ) =(

2πσ2)−n/2

exp(− 1

2σ2(y −Xβ)′ (y −Xβ)

)⇒

log f (y|X; θ) = −n2

log(

2πσ2)− 1

2σ2(y −Xβ)′ (y −Xβ) .

It can be proved

log f (y|X; θ) =n∑i=1

log f (yi|xi) = −n2

log(

2πσ2)− 1

2σ2

n∑i=1

(yi − x′iβ

)2.

Proposition (1.5 - ML Estimator of β and σ2). Suppose Assumptions 1.1-1.5 hold. Then,

ML estimator of β =(X′X

)−1X′y.

ML estimator of σ2 =e′en6= s2 =

e′en−K

.

103

We know that E(s2)

= σ2. Therefore:

• E(

e′en

)6= σ2.

• limn→∞ E(

e′en

)= σ2.

Proposition (1.6 - b is the Best Unbiased Estimator BUE). Under Assumptions 1.1-1.5,the OLS estimator b of β is BUE in that any other unbiased (but not necessarily linear)estimator has larger conditional variance in the matrix sense.

This result should be distinguished from the Gauss-Markov Theorem that b is minimumvariance among those estimators that are unbiased and linear in y. Proposition 1.6 saysthat b is minimum variance in a larger class of estimators that includes nonlinear unbiasedestimators. This stronger statement is obtained under the normality assumption (Assumption1.5) which is not assumed in the Gauss-Markov Theorem. Put differently, the Gauss-MarkovTheorem does not exclude the possibility of some nonlinear estimator beating OLS, but thispossibility is ruled out by the normality assumption.

104

Exercise 2.22. Suppose yi = x′iβ + εi where εi|X ∼ t(v). Assume that Assumptions1.1-1.4 hold. Use your intuition to answer “true”or “false” to the following statements:

(a) b is the BLUE;

(b) b is the BUE;

(c) the BUE estimator can only be obtained numerically (i.e. there is not a closed formulafor the BUE estimator).

Just out of curiosity notice that the log-likelihood function is

n∑i=1

log f (yi|xi) = −n2

log σ2 − n2

log π − n2

log (v − 2)

+n logΓ(v+1

2

)Γ(v2

) − v + 1

2

n∑i=1

log

1 +1

v − 2

(yi − x′iβ

)2

σ2

.

105

2.8 Generalized Least Squares (GLS)

We have assumed that

E(ε2i

∣∣∣X) = Var (εi|X) = σ2 > 0, ∀i, Homoskedasticity

E(εiεj

∣∣∣X) = 0, ∀i, j; i 6= j No correlation between observations.

Matrix notation:

E(εε′∣∣∣X) =

E(ε2

1

∣∣∣X) E (ε1ε2|X) · · · E (ε1εn|X)

E (ε1ε2|X) E(ε2

2

∣∣∣X) · · · E (ε2εn|X)... ... . . . ...

E (ε1εn|X) E (ε2εn|X) · · · E(ε2n

∣∣∣X)

=

σ2 0 · · · 00 σ2 · · · 0... ... . . . ...0 0 · · · σ2

= σ2I.

106

The Assumption E(εε′∣∣X) = σI is violated if either

• E(ε2i

∣∣∣X) depends on X → Heteroskedasticity, or

• E(εiεj

∣∣∣X) 6= 0 → Serial Correlation (We will analyze this case later).

Let’s assume now that

E(εε′∣∣∣X) = σ2V (V depends on X).

The model y = Xβ + ε based on the assumptions Assumptions 1.1-1.3 and E(εε′∣∣X) =

σ2V is called generalized regression model.

Notice that by definition, we always have:

E(εε′∣∣∣X) = Var (ε|X) = Var (y|X) .

107

Example (case where E(ε2i

∣∣∣X) depends on X). Consider the following model

yi = β1 + β2xi2 + εi

to explain household expenditure on food (y) as a function of household income. Typicalbehavior: Low-income household do not have the option of extravagant food tastes: theyhave few choices and are almost forced to spend a particular portion of their income on food;High-income household could have simple food tastes or extravagant food tastes: income byitself is likely to be relatively less important as an explanatory variable.

02468

101214161820

6 7 8 9 10 11 12 13

x : Income

y : E

xpen

ditu

re

108

If e accurately reflects the behavior of the ε, the information in the previous figure suggeststhat the variability of yi increases as income increases, thus it is reasonable to suppose that

Var (yi|xi2) is a function of xi2.

This is the same as saying that

E(ε2i

∣∣∣xi2) is a function of xi2.For example if E

(ε2i

∣∣∣xi2) = σ2x2i2 then

E(εε′∣∣∣X) = σ2

x2

12 0 · · · 00 x2

22 · · · 0... ... . . . ...0 0 · · · x2

n2

︸︷︷︸

V

= σV 6= σ2I.

109

2.8.1 Consequence of Relaxing Assumption 1.4

1. The Gauss-Markov Theorem no longer holds for the OLS estimator. The BLUE is someother estimator.

2. The t-ratio is not distributed as the t distribution. Thus, the t-test is no longer valid. Thesame comments apply to the F-test. Note that Var (b|X) is no longer σ2 (X′X)−1 . Ineffect,

Var (b|X) = Var((

X′X)−1

X′y∣∣∣∣X) =

(X′X

)−1X′Var (y|X) X

(X′X

)−1

= σ2(X′X

)−1X′VX

(X′X

)−1.

On the other hand,

E(s2∣∣∣X) =

E(e′e∣∣X)

n−K=

tr (Var (e|X))

n−K=σ2 tr (MVM)

n−K=σ2 tr (MV)

n−K.

The conventional standard errors are incorrect when Var (y|X) 6= σ2I. Confidenceregion and hypothesis test procedures based on the classical regression model are notvalid.

110

3. However, the OLS estimator is still unbiased, because the unbiasedness result (Propo-sition 1.1 (a)) does not require Assumption 1.4. In effect,

E (b|X) =(X′X

)−1X′ E (y|X) =

(X′X

)−1X′Xβ = β, E (b) = β

Options in the presence of E(εε′∣∣X) 6= σ2I:

• Use b to estimate β and Var (b|X) = σ2 (X′X)−1 X′VX(X′X

)−1 for inferencepurposes. Note that y|X ∼ N

(Xβ, σ2V

)implies

b|X ∼ N(β,σ2

(X′X

)−1X′VX

(X′X

)−1).

This is not a good solution as if you know V you may use a more effi cient estimator, aswe will see below. Later on, in chapter “Large Sample Theory”we will find that σ2V

may be replaced by a consistent estimator.

• Search for a better estimator of β.

111

2.8.2 Effi cient Estimation with Known V

If the value of the matrix function V is known, a BLUE estimator for β, called generalizedleast squares (GLS), can be deduced. The basic idea of the derivation is to transformthe generalized regression model into a model that satisfies all the assumptions, includingAssumption 1.4, of the classical regression model. Consider

y = Xβ + ε, E(εε′∣∣∣X) = σ2V.

We should multiply both sides of the equation by a nonsingular matrix C (depending on X)

Cy = CXβ + Cε

y = Xβ + ε

such that the transformed error ε verify E(εε′∣∣∣X) = σ2I, i.e.

E(εε′∣∣∣X) = E

(Cεε′C′

∣∣∣X) = C E(εε′∣∣∣X)C′ = σ2CVC′ = σ2I

that is CVC′ = I.

112

Given CVC′ = I, how to find C? Since V is by construction symmetric and positive defi-nite, there exists a nonsingular n× n matrix C such

V = C−1(C′)−1

or V−1 = C′C

Note

CVC′ = CC−1(C′)−1

C′

= I.

It easy to see that if y = Xβ + ε satisfies Assumptions 1.1-1.3 and Assumption 1.5 (butnot Assumption 1.4), then

y = Xβ + ε, where y = Cy, X = CX

satisfies Assumptions 1.1-1.5. Let

βGLS =(X′X

)−1X′y =

(X′V−1X

)−1X′V−1y.

113

Proposition (1.7 - finite-sample properties of GLS). (a) (unbiasedness) Under Assumption1.1-1.3,

E(βGLS

∣∣∣X) = β.

(b) (expression for the variance) Under Assumptions 1.1-1.3 and the assumption E(εε′∣∣X) =

σ2V that the conditional second moment is proportional to V,

Var(βGLS

∣∣∣X) = σ2(X′V−1X

)−1.

(c) (the GLS estimator is BLUE) Under the same set of assumptions as in (b), the GLSestimator is effi cient in that the conditional variance of any unbiased estimator that is linearin y is greater than or equal to Var

(βGLS

∣∣∣X) in the matrix sense.Remark: Var (b|X)− Var

(βGLS

∣∣∣X) is a positive semidefinite matrix. In particular,Var

(bj∣∣∣X) ≥ Var

(βj,GLS

∣∣∣X) .

114

2.8.3 A Special Case: Weighted Least Squares (WLS)

Let’s suppose that

E(ε2i

∣∣∣X) = σ2vi (vi is a function of X).

Recall: C is such that V−1 = C′C .

We have

V =

v1 0 · · · 00 v2 · · · 0... ... . . . ...0 0 · · · vn

⇒ V−1 =

1/v1 0 · · · 0

0 1/v2 · · · 0... ... . . . ...0 0 · · · 1/vn

⇒

C =

1/√v1 0 · · · 0

0 1/√v2 · · · 0

... ... . . . ...0 0 · · · 1/

√vn

.

115

Now

y = Cy =

1/√v1 0 · · · 0

0 1/√v2 · · · 0

... ... . . . ...0 0 · · · 1/

√vn

y1y2...yn

=

y1√v1y2√v2...yn√vn

X = CX =

1/√v1 0 · · · 0

0 1/√v2 · · · 0

... ... . . . ...0 0 · · · 1/

√vn

1 x12 · · · x1K1 x22 · · · x2K... ... . . . ...1 xn2 · · · xnK

=

1/√v1 x12/

√v1 · · · x1K/

√v1

1/√v2 x22/

√v2 · · · x2K/

√v2

... ... . . . ...1/√vn xn2/

√vn · · · xnK/

√vn

.

Another way to express these relations:

yi =yi√vi, xik =

xik√vi, i = 1, 2, ..., n.

116

Example. Suppose that yi = α+ βxi2 + εi,

Var (yi|xi2) = Var (εi|xi2) = σ2exi2, Cov(yi, yj

∣∣∣xi2, xj2) = 0

V =

ex12 · · · 0 · · · 0... ...0 · · · exi2 · · · 0... ...0 · · · 0 · · · exn2

.

Transformed model (matrix notation):

Cy = CXβ + Cεy1√ex12...yn√exn2

=

1√ex12

x12√ex12

... ...1√exn2

xn2√exn2

[β1β2

]+

ε1√ex12...εn√exn2

or (scalar notation):

yi = xi1β1 + xi2β2 + εi, i = 1, ..., n

yi√exi2

=1√exi2

β1 +xi2√exi2

β2 +εi√exi2

, i = 1, ..., n.

117

Notice:

Var (εi|X) = Var

(εi√exi2

∣∣∣∣∣xi2)

=1

exi2Var (εi|xi2) =

1

exi2σ2exi2 = σ2.

Effi cient estimation under a known form of heteroskedasticity is called the weighted regression(or the weighted least squares (WLS)).

Example. Consider wagei = β1 + β2educi + β3experi + εi.

0

5

10

15

20

25

30

0 10 20 30 40 50 60

EXPER

WAG

E

0

5

10

15

20

25

30

0 4 8 12 16 20

EDUC

WAG

E

118

Dependent Variable: WAGEMethod: Least SquaresSample: 1 526


C 3.390540 0.766566 4.423023 0.0000EDUC 0.644272 0.053806 11.97397 0.0000

EXPER 0.070095 0.010978 6.385291 0.0000


0

50

100

150

200

250

300

0 4 8 12 16 20

EDUC

RES

2

Assume Var (εi| educi, experi) = σ2educ2i . Transformed model:

wageieduci

=β1

educi+ β2

educieduci

+ β3experieduci

+ εi, i = 1, ..., n

119

Dependent Variable: WAGE/EDUCMethod: Least SquaresSample: 1 526 IF EDUC>0


1/EDUC 0.709212 0.549861 1.289800 0.1977EDUC/EDUC 0.443472 0.038098 11.64033 0.0000EXPER/EDUC 0.055355 0.009356 5.916236 0.0000

Rsquared 0.105221 Mean dependent var 0.469856Adjusted Rsquared 0.101786 S.D. dependent var 0.265660S.E. of regression 0.251777 Akaike info criterion 0.085167Sum squared resid 33.02718 Schwarz criterion 0.109564Log likelihood 19.31365 HannanQuinn criter. 0.094721DurbinWatson stat 1.777416

Exercise 2.23. Let {yi, i = 1, 2, ...} be a sequence of independent random variables withdistribution N

(β, σ2

i

), where σ2

i is known (note: we assume σ21 6= σ2

2 6= ...). Whenthe variances are unequal, the sample mean y is not the best linear unbiased estimator,i.e. BLUE). The BLUE has the form β =

∑ni=1wiyi where wi are nonrandom weights.

(a) Find a condition on wi such that E(β)

= β; (b) Find the optimal weights wi that

make β the BLUE. Hint: You may translate this problem into an econometric framework:if {yi} is a sequence of independent random variables with distribution N

(β, σ2

i

)then yi

can be represented by the equation yi = β+ εi, where εi ∼ N(

0, σ2i

). Then find the GLS

estimator of β.

120


yi = βxi1 + εi, β > 0

and assume E (εi|X) = 0, Var (εi|X) = 1 + |xi1| , Cov(εi, εj

∣∣∣X) = 0. (a) Suppose wehave a lot of observations and plot a graph of the observation of yi and xi2. How would thescattered plot look like? (b) Propose an unbiased estimator with minimum variance; (c)Suppose we have the 3 following observation of (xi2, yi): (0, 0), (3, 1) and (8, 5). Estimatethe value of β from these 3 observations.


yt = β1 + β2t+ εt, Var (εi) = σ2t2, i = 1, ..., 20

Find σ2 (X′X)−1 , Var (b|X) and Var(βGLS

∣∣∣X) and comment on the results. Solution:σ2(X′X

)−1= σ2

[0.215 −0.01578−0.01578 0.0015

], Var (b|X) = σ2

[13.293 −1.6326−1.6326 0.25548

]

Var(βGLS

∣∣∣X) = σ2

[1.0537 −0.1895−0.1895 0.0840

].

121

Exercise 2.26. A research first ran a OLS regression. Then she was given the true V matrix.She transformed the data appropriately and obtained the GLS estimator. For several coeffi -cient, standard errors in the second regression were larger than those in the first regression.Does this contradict 1.7 proposition? See the previous exercise.

2.8.4 Limiting Nature of GLS

• Finite-sample properties of GLS rest on the assumption that the regressors are strictlyexogenous. In time-series models the regressors are not strictly exogenous and the erroris serially correlated.

• In practice, the matrix function V is unknown.

• V can be estimated from the sample. This approach is called the Feasible GeneralizedLeast Squares (FGLS). But if the function V is estimated from the sample, its value Vbecomes a random variable, which affects the distribution of the GLS estimator. Verylittle is known about the finite-sample properties of the FGLS estimator. We need touse the large-sample properties ...

122

3 Large-Sample Theory

The finite-sample theory breaks down if one of the following three assumptions is violated:

1. the exogeneity of regressors,

2. the normality of the error term, and

3. the linearity of the regression equation.

This chapter develops an alternative approach based on large-sample theory (n is “suffi cientlylarge”).

123

3.1 Review of Limit Theorems for Sequences of Random Variables

3.1.1 Convergence in Probability in Mean Square and in Distribution

Convergence in Probability

A sequence of random scalars {zn} converges in probability to a constant (non-random) αif, for any ε > 0,

limn→∞P (|zn − α| > ε) = 0.

We write

znp−→ α or plim zn = α.

As we will see, zn is usually a sample mean

zn =

∑ni=1 yin

or zn =

∑ni=1 zin

.

124

Example. Consider a fair coin. Let zi = 1 if the ith toss results in heads and zi = 0

otherwise. Let zn = n−1∑ni=1 zi. The following graph suggests that zn

p−→ 1/2.

125

A sequence of K dimensional vectors {zn} converges in probability to a K-dimensionalvector of constants α if, for any ε > 0,

limn→∞P (|znk − αk| > ε) = 0, ∀k

We write

znp−→ α.

Convergence in Mean Square

A sequence of random scalars {zi} converges in mean square (or in quadratic mean) to a αif

limn→∞E

[(zn − α)2

]= 0

The extension to random vectors is analogous to that for convergence in probability.

126

Convergence in Distribution

Let {zn} be a sequence of random scalars and Fn be the cumulative distribution function(c.d.f.) of zn, i.e. zn ∼ Fn. We say that {zn} converges in distribution to a random scalarz if the c.d.f. Fn, of zn , converges to the c.d.f. F of z at every continuity point of F . Wewrite

znd−→ z, where z ∼ F,

F is is the asymptotic (or limiting) distribution of z. If F is well-known, for example, if Fis the cumulative normal N (0, 1) distribution we prefer to write

znd−→ N (0, 1) (instead of zn

d−→ z and z ∼ N (0, 1)).

Example. Consider zn ∼ t(n). We know that znd−→ N (0, 1) .

In most applications zn is of type

zn =√n (y − E (yi)) .

Exercise 3.1. For zn =√n (y − E (yi)) calculate E (zn) and Var (zn) (assume E (yi) = µ,

Var (yi) = σ2 and {yi} is an i.i.d. sequence).

127

3.1.2 Useful Results

Lemma (2.3 - preservation of convergence for continuous transformation). Suppose f is avector-valued continuous function that does not depend on n. Then:

(a) if znp−→ α⇒ f (zn)

p−→ f (α) ;

(b) if znd−→ z⇒ f (zn)

d−→ f (z) .

An immediate implication of Lemma 2.3 (a) is that the usual arithmetic operations preserveconvergence in probability:

xnp−→ β, yn

p−→ γ ⇒ xn + ynp−→ β + γ.

xnp−→ β, yn

p−→ γ ⇒ xnynp−→ βγ.

xnp−→ β, yn

p−→ γ ⇒ xn/ynp−→ β/γ, γ 6= 0.

Ynp−→ Γ ⇒ Y−1

np−→ Γ−1 (Γ is invertible).

128

Lemma (2.4).We have

(a) xnd−→ x, yn

p−→ α⇒ xn + ynd−→ x +α.

(b) xnd−→ x, yn

p−→ 0⇒ y′nxnp−→ 0.

(c) xnd−→ x, An

p−→ A⇒ Anxnd−→ Ax. In particular if x ∼ N (0,Σ) , then

Anxnd−→ N

(0,AΣA′

).

(d) xnd−→ x, An

p−→ A⇒ x′nA−1n xn

d−→ x′A−1x (A is nonsingular).

If xnp−→ 0 we write xn = op (1) .

If xn − ynp−→ 0 we write xn = yn + op (1) .

In part (c) we may write Anxnd= Axn (Anxn and Axn have the same asymptotic

distribution).

129

3.1.3 Viewing Estimators as Sequences of Random Variables

Let θn be an estimator of a parameter vector θ based on a sample of size n. We say thatan estimator θn is consistent for θ if

θnp−→ θ.

The asymptotic bias of θn, is defined as plimn→∞ θn−θ. So if the estimator is consistent,its asymptotic bias is zero.

Wooldridge’s quotation:

While not all useful estimators are unbiased, virtually all economists agree thatconsistency is a minimal requirement for an estimator. The famous econometricianClive W.J. Granger once remarked: “If you can’t get it right as n goes to infinity,you shouldn’t be in this business.” The implication is that, if your estimator of aparticular population parameter is not consistent, then you are wasting your time.

130

A consistent estimator θn is asymptotically normal if

√n(θn − θ

)d−→ N (0,Σ) .

Such an estimator is called√n-consistent.

The variance matrix Σ is called the asymptotic variance and is denoted Avar(θn), i.e.

limn→∞Var

(√n(θn − θ

))= Avar

(θn)

= Σ.

Some authors use the notation Avar(θn)to mean Σ/n (which is zero in the limit).

131

3.1.4 Laws of Large Numbers and Central Limit Theorems

Consider

zn =1

n

n∑i=1

zi.

We say that zn obeys to the LLN if znp−→ µ where µ = E (zi) or limn E (zn) = µ.

• (A Version of Chebychev’s Weak LLN) If

lim E (zn) = µlim Var (zn) = 0

⇒ znp−→ µ.

• (Kolmogorov’s Second Strong LLN) If {zi} is i.i.d. with E (zn) = µ ⇒ znp−→ µ.

These LLNs extend readily to random vectors by requiring element-by-element convergence.

132

Theorem 1 (Lindeberg-Levy CLT). Let {zi} be i.i.d. with E (zn) = µ and Var (zi) = Σ.

Then√n (zn − µ) =

1√n

n∑i=1

(zi − µ)d−→ N (0,Σ) .

Notice that

E(√

n (zn − µ))

= 0⇒ E (zn) = µ

Var(√

n (zn − µ))

= Σ⇒Var (zn) = Σ/n

Given the previous equations, some authors write

zna∼ N

(µ,

Σ

n

).

133

Example. Let {zi} be i.i.d. with distribution χ2(1). By the Lindeberg-Levy CLT (scalar case)

we have

zn =1

n

n∑i=1

zia∼ N

(µ,σ2

n

)where

E (zn) =1

n

n∑i=1

E (zi) = E (zi) = µ = 1;

Var (zn) = Var

1

n

n∑i=1

zi

=1

nVar (zi) =

σ2

n=

2

n.

134

Probability Density Function of zn (obtained byMonte-Carlo Simulation)

-3-2-11230.10.20.30.4

Probability Density Function of√n (zn − µ) (exact expressions for

n = 5, 10 and 50)

135

Example. In a random sampling, sample size = 30, on the variable z with E (z) = 10,

Var (z) = 9 but unknown distribution, obtain an approximation to P (zn < 9.5) . We donot know the exact distribution of zn. However, from Lindeberg-Levy CLT we have

√n

(zn − µ)

σ

d−→ N (0, 1) or zna∼ N

(µ,σ2

n

).

Hence,

P (zn < 9.5) = P

(√n

(zn − µ)

σ<√

30(9.5− 10)

3

)' Φ (−0.9128) , [Φ is the cdf of N (0, 1) ]

= 0.1807.

136

3.2 Fundamental Concepts in Time-Series Analysis

Stochastic process (SP): is a sequence of random variables. For this reason, it is moreadequate to write a SP as {zi} (means a sequence of random variables) rather than zi(means the random variable at time i).

137

3.2.1 Various Classes of Stochastic processes

Definition (Stationary Processes). A SP {zi} is (strictly) stationary if the joint distributionof (z1, z2, ..., zs) equals to that of

(zk+1, zk+2, ..., zk+s

)for any s ∈ N and k ∈ Z.

Exercise 3.2. Consider a SP {zi} where E (|g (zi)|) < ∞. Show that if {zi} is a strictlystationary process then E (g (zi)) is constant and do not depend on t.

The definition implies that any transformation (function) of a stationary process is itselfstationary, that is, if {zi} is stationary, then {g (zi)} is. For example, if {zi} is stationarythen

{ziz′i

}is also a SP.

Definition (Covariance Stationary Processes). A stochastic process {zi} is weakly (or co-variance) stationary if: (i) E (zi) does not depend on i , and (ii) Cov

(zi, zi−j

)exists, is

finite, and depends only on j but not on i.

If {zi} is a covariance SP then Cov (z1, z5) = Cov (z1001, z1005).

A transformation (function) of a covariance stationary process may or may not be a covari-ance stationary process.

138

Example. It can be proved that {zi} , zi =√α0 + α1z

2i−1εi, where {εi} is i.i.d. with mean

zero and unit variance and α0 > 0 and√

1/3 ≤ α1 < 1 is a covariance stationary process.

However, wi = z2i is not a covariance stationary process as E

(w2i

)does not exist.

Exercise 3.3. Consider the SP {ut} where

ut =

ξt if t ≤ 2000

√k−2k ζt

if t > 2000

where ξt and ζs are independent for all t and s and ξtiid∼ N (0, 1) and ζs

iid∼ t(k). Explainwhy {ut} is weakly (or covariance) stationary but not strictly stationary.

Definition (White Noise Processes). A white noise process {zi} is a covariance stationaryprocess with zero mean and no serial correlation:

E (zi) = 0, Cov(zi, zj

)= 0, i 6= j.

139

20

16

12

8

4

0

4

8

25 50 75 100 125 150 175 200

Y

5

0

5

10

15

20

25

25 50 75 100 125 150 175 200

Y

50

40

30

20

10

0

10

25 50 75 100 125 150 175 200

Y

5

4

3

2

1

0

1

2

3

4

10 20 30 40 50 60 70 80 90

Y5

140

In the literature there is not a unique definition of ergodicity. We prefer to call “weaklydependent process” to what Hayashi calls “ergodic process”.

Definition. A stationary process {zi} is said to be a weakly dependent process (= ergodic inHayashi’s definition) if, for any two bounded functions f : Rk+1 → R and g : Rs+1 → R,

limn→∞

∣∣E [f (zi, .., zi+k) g (zi+n, .., zi+n+s)]∣∣

= limn→∞

∣∣E (f (zi, .., zi+k))∣∣× |E (g (zi+n, .., zi+n+s))| .

Theorem 2 (S&WD). Let {zi} be a stationary weakly dependent (S&WD) process withE (zi) = µ. Then zn

p−→ µ.

Serial dependence, which is ruled out by the i.i.d. assumption in Kolmogorov’s LLN, isallowed in this Theorem, provided that it disappears in the long run. Since, for any functionf , {f (zi)} is a S&WD stationary whenever {zi} is, this theorem implies that any momentof a S&WD process (if it exists and is finite) is consistently estimated by the sample moment.For example, suppose {zi} is a S&WD process and E

(ziz′i

)exists and is finite. Then

zn =1

n

n∑i=1

ziz′i

p−→ E(ziz′i

).

141

Definition (Martingale). A vector process {zi} is called a martingale with respect to {zi} if

E (zi| zi−1, ..., z1) = zi−1 for i ≥ 2.

The process

zi = zi−1 + εi

where {εi} is a white noise process with E (εi| zi−1) = 0, is a martingale since

E (zi| zi−1, ..., z1) = E (zi| zi−1) = zi−1 + E (εi| zi−1) = zi−1.

Definition (Martingale Difference Sequence). A vector process {gi} with E (gi) = 0 iscalled a martingale difference sequence (MDS) or martingale differences if

E (gi| gi−1, ..., g1) = 0.

If {zi} is a martingale, the process defined as ∆zi = zi − zi−1 is a MDS.

Proposition. If {gi} is a MDS then Cov(gi, gi−j

)= 0, j 6= 0.

142

By definition

Var (gn) =1

n2Var

n∑t=1

gt

=1

n2

n∑t=1

Var (gt) + 2n−1∑j=1

n∑i=j+1

Cov(gi, gi−j

) .However, if {gi} is a stationary MDS with finite second moment then

n∑t=1

Var (gt) = nVar (gt) , Cov(gi, gi−j

)= 0,

so

Var (gn) =1

nVar (gt) .

Definition (RandomWalk). Let {gi} be a vector independent white noise process. A randomwalk, {zi}, is a sequence of cumulative sums:

zi = gi + gi−1 + ...+ g1.

Exercise 3.4. Show that the random walk can be written as

zi = zi−1 + gi, z1 = g1.

143

3.2.2 Different Formulation of Lack of Serial Dependence

We have three formulations of a lack of serial dependence for zero-mean covariance stationaryprocesses:

(1) {gi} is independent white noise.

(2) {gi} is stationary MDS with finite variance.

(3) {gi} is white noise.

(1)⇒ (2)⇒ (3).

Exercise 3.5 (Process that satisfies (2) but not (1) - the ARCH process). Consider gi =√α0 + α1g

2i−1εi, where {εi} is i.i.d. with mean zero and unit variance and α0 > 0 and

|α1| < 1. Show that {gi} is a MDS but not a independent white noise.

144

3.2.3 The CLT for S&WD Martingale Difference Sequences

Theorem 3 (Stationary Martingale Differences CLT (Billingsley, 1961) ). Let {gi} be avector martingale difference sequence that is S&WD process with E

(gig′i

)= Σ and let

gi = 1n

∑gi. Then

√ngn =

1√n

n∑i=1

gid−→ N (0,Σ) .

Theorem 4 (Martingale Differences CLT (White, 1984)). Let {gi} be a vector martingaledifference sequence. Suppose that (a) E

(gig′i

)= Σt is a positive definite matrix with

1n

∑ni=1 Σt → Σ (positive definite matrix), (b) g has finite 4th moment, (c) 1

n

∑gig′i

p−→Σ. Then

√ngn =

1√n

n∑i=1

gid−→ N (0,Σ) .

145

3.3 Large-Sample Distribution of the OLS Estimator

The model presented in this section has probably the widest range of economic applications:

• No specific distributional assumption (such as the normality of the error term) is required;

• The requirement in finite-sample theory that the regressors be strictly exogenous or fixedis replaced by a much weaker requirement that they be "predetermined."

Assumption (2.1 - linearity). yi = x′iβ + εi.

Assumption (2.2 - S&WD). {(yi,xi)} is jointly S&WD.Assumption (2.3 - predetermined regressors). All the regressors are predetermined in thesense that they are orthogonal to the contemporaneous error term: E (xikεi) = 0, ∀i, k.This can be written as

E (xiεi) = 0 or E (gi) = 0 where gi = xiεi.

Assumption (2.4 - rank condition). E(xix′i

)= Σxx is nonsingular.

146

Assumption (2.5 - {gi} is a martingale difference sequence with finite second moments).{gi} , where gi = xiεi, is a martingale difference sequence (so a fortiori E (gi) = 0.The K ×K matrix of cross moments, E

(gig′i

), is nonsingular. We use S for Avar (g) (the

variance of√ng, where g =1

n

∑gi). By Assumption 2.2 and S&WD Martingale Differences

CLT, S = E(gig′i

).

Remarks:

1. (S&WD) A special case of S&WD is that {(yi,xi)} is i.i.d. (random sample in cross-sectional data).

2. (The model accommodates conditional heteroskedasticity) If {(yi,xi)} is stationary,then the error term εi = yi − x′iβ is also stationary. The conditional moment

E(ε2i

∣∣∣xi) can depend on xi

without violating any previous assumption, as long as E(ε2i

)is constant.

147

3. (E (xiεi) = 0 vs. E (εi|xi) = 0) The condition E (εi|xi) = 0 is stronger than

E (xiεi) = 0. In effect,

E (xiεi) = E (E (xiεi|xi))

= E (xi E (εi|xi))

= E (xi0) = 0.

4. (Predetermined vs. strictly exogenous regressors) Assumption 2.3, restricts only thecontemporaneous relationship between the error term and the regressors. The exogeneityassumption (Assumption 1.2) implies that, for any regressor k, E

(xjkεi

)= 0 for all i

and j, not just for i = j. Strict exogeneity is a strong assumption that does not hold ingeneral for time series models.

148

5. (Rank condition as no multicollinearity in the limit) Since

b =

(X′Xn

)−1X′yn

=(

1

n

∑xix′i

)−1 1

n

∑xiy = S−1

xxSxy

where

Sxx =X′Xn

=1

n

∑xix′i (sample average of xix

′i)

Sxy =X′yn

=1

n

∑xiyi (sample average of xiyi).

By Assumptions 2.2, 2.4 and theorem S&WD we have

X′Xn

=1

n

n∑i=1

xix′i

p−→ E(xix′i

).

Assumption 2.4 guarantees that the limit in probability of X′Xn has rank K.

149

6. (A suffi cient condition for {gi} to be a MDS) Since a MDS is zero-mean by definition,Assumption 2.5 is stronger than Assumption 2.3 (this latter is redundant in face ofAssumption 2.5). We will need Assumption 2.5 to prove the asymptotic normality ofthe OLS estimator. A suffi cient condition for {gi} to be an MDS is

E (εi| Fi) = 0 where

Fi = Ii−1 ∪ xi = {εi−1, εi−2, ..., ε1,xi,xi−1, ...,x1} ,Ii−1 = {εi−1, εi−2, ..., ε1,xi−1, ...,x1} .

(This condition implies that the error term is serially uncorrelated and also is uncorrelatedwith the current and past regressors). Proof. Notice: {gi} is a MDS if

E (gi| gi−1, ..., g1) = 0, gi = xiεi.

Now, using the condition E (εi| Fi) = 0,

E (xiεi| gi−1, ..., g1) = E [E (xiεi| Fi)| gi−1, ..., g1] = E [0| gi−1, ..., g1] = 0

thus E (εi| Fi) = 0⇒ {gi} is a MDS.

150

7. (When the regressors include a constant) Assumption 2.5 is

E (xiεi| gi−1, ..., g1) = E

1

...xiK

εi∣∣∣∣∣∣∣ gi−1, ..., g1

= 0⇒E (εi| gi−1, ..., g1) = 0.

E (εi| εi−1, ..., ε1) = E (E (εi| gi−1, ..., g1)| εi−1, ..., ε1) = 0.

Assumption 2.5 implies that the error term itself is a MDS and hence is serially uncorrelated.

8. (S is a matrix of fourth moments)

S = E(gig′i

)= E

(xiεix

′iεi)

= E(ε2ixix

′i

).

Consistent estimation of S will require an additional assumption.

151

9. (S will take a different expression without Assumption 2.5) In general

Avar (g) = lim Var(√

ng)

= lim Var

√n1

n

n∑i=1

gi

= lim Var

1√n

n∑i=1

gi

= lim

1

nVar

n∑i=1

gi

= lim

1

n

n∑i=1

Var (gi) +n−1∑j=1

n∑i=j+1

(Cov

(gi, gi−j

)+ Cov

(gi−j, gi

))= lim

1

n

n∑i=1

Var (gi) + lim1

n

n−1∑j=1

n∑i=j+1

(E(gig′i−j

)+ E

(gi−jg

′i

)).

Given stationarity, we have

1

n

n∑i=1

Var (gi) = Var (gi) .

Thanks to the assumption 2.5 we have E(gig′i−j

)= E

(gi−jg′i

)= 0 so

S = Avar (g) = Var (gi) = E(gig′i

).

152

Proposition (2.1- asymptotic distribution of the OLS Estimator). (a) (Consistency of b forβ) Under Assumptions 2.1-2.4,

bp−→ β.

(b) (Asymptotic Normality of b) If Assumption 2.3 is strengthened as Assumption 2.5, then

√n (b− β)

d−→ N (0,Avar (b))

where

Avar (b) = Σ−1xxSΣ−1

xx .

(c) (Consistent Estimate of Avar (b)) Suppose there is available a consistent estimator S

of S. Then under Assumption 2.2, Avar (b) is consistently estimated by

Avar (b) = S−1xx SS

−1xx

where

Sxx =X′Xn

=1

n

n∑i=1

xix′i.

153

Proposition (2.2 - consistent estimation of error variance). Under the Assumptions 2.1- 2.4,

s2 =1

n−K

n∑i=1

e2i

p−→ E(ε2i

)provide E

(ε2i

)exists and is finite.

Under conditional homocedasticity E(ε2i

∣∣∣xi) = σ2 (we will see this in detail later) wehave,

S = E(gig′i

)= E

(ε2ixix

′i

)= ... = σ2

E(xix′i

)= σ2Σxx

and


xx = Σ−1xxσ

2ΣxxΣ−1xx = σ2Σ−1

xx ,

Avar (b) = s2

(X′Xn

)−1

= s2n(X′X

)−1.

Thus

ba∼ N

β, Avar (b)

n

= N

(β, s2

(X′X

)−1)

154

3.4 Statistical Inference

Derivation of the distribution of test statistics is easier than in finite-sample theory becausewe are only concerned about the large-sample approximation to the exact distribution.

Proposition (2.3 - robust t-ratio and Wald statistic). Suppose Assumptions 2.1-2.5 hold,and suppose there is available a consistent estimate of S of S. As before, let Avar (b) =

S−1xx SS

−1xx . Then

(a) Under the null hypothesis H0 : βk = β0k

t0k =bk − β0

k

σbk

d−→ N (0, 1) , where σ2bk

=Avar (bk)

n=

(S−1

xx SS−1xx

)kk

n.

(b) Under the null hypothesis H0 : Rβ = r, with rank (R) = p

W = n (Rb− r)′(RAvar (b) R′

)−1(Rb− r)

d−→ χ2(p).

155

Remarks

• σbk is called is called the heteroskedasticity-consistent standard error, (heteroskedastic-ity) robust standard error, or White’s standard error. The reason for this terminology isthat the error term can be conditionally heteroskedastic. The t-ratio is called the robustt-ratio.

• The differences from the finite-sample t-test are: (1) the way the standard error iscalculated is different, (2) we use the table of N(0, 1) rather than that of t(n−K),and (3) the actual size or exact size of the test (the probability of Type I error giventhe sample size) equals the nominal size (i.e., the desired significance level α) onlyapproximately, although the approximation becomes arbitrarily good as the sample sizeincreases. The difference between the exact size and the nominal size of a test is calledthe size distortion.

• Both tests are consistent in the sense that

power = P (rejecting the null H0|H1 is true)→ 1 as n→∞.

156

3.5 Estimating S = E(ε2ixix

′i

)Consistently

How to select an estimator for a population parameter? One of the most important methodis the analog estimation method or the method of moments. The method of momentprinciple: To estimate a feature of the population, use the corresponding feature of thesample.

Examples of analog estimators:

Parameter of the population Estimator

E (yi) Y

Var (yi) S2y

σxyσ2x

SxyS2x

P (yi ≤ c)∑ni=1 I{yi≤c}

nmedian (yi) sample medianmax(yi) maxi=1,...,n (yi)

157

The analogy principle suggests that E(ε2ixix

′i

)can be estimated using the estimator

1

n

n∑i=1

ε2ixix

′i.

Since εi is not observable we need another one:

S =1

n

n∑i=1

e2ixix

′i.

Assumption (2.6 - finite fourth moments for regressors). E

((xikxij

)2)exists and is finite

for all k and j (k, j = 1, ...,K) .

Proposition (2.4 - consistent estimation of S). Suppose S = E(ε2ixix

′i

)exists and is finite.

Then, under Assumptions 2.1-2.4 and 2.6, S is consistent for S.

158

The estimator S can be represented as

S =1

n

n∑i=1

e2ixix

′i =

X′BX

nwhere B =

e2

1 0 · · · 00 e2

2 · · · 0... ... ...0 0 · · · e2

n

.

Thus, Avar (b) = S−1xx SS

−1xx = n

(X′X

)−1 X′BX(X′X

)−1. We have

• ba∼ N

(β, Avar(b)

n

)= N

(β, S−1

xx SS−1xx

n

)= N

(β,(X′X

)−1 X′BX(X′X

)−1)

• W = n (Rb− r)′(RAvar (b) R′

)−1(Rb− r)

= n (Rb− r)′(RS−1

xx SS−1xx R

′)−1

(Rb− r)

= (Rb− r)′(R(X′X

)−1 X′BX(X′X

)−1 R′)−1

(Rb− r)d−→ χ2

(p)

159

Dependent Variable: WAGEMethod: Least SquaresSample: 1 526


C 1.567939 0.724551 2.164014 0.0309FEMALE 1.810852 0.264825 6.837915 0.0000

EDUC 0.571505 0.049337 11.58362 0.0000EXPER 0.025396 0.011569 2.195083 0.0286

TENURE 0.141005 0.021162 6.663225 0.0000


Dependent Variable: WAGEMethod: Least SquaresSample: 1 526White HeteroskedasticityConsistent Standard Errors & Covariance


C 1.567939 0.825934 1.898382 0.0582FEMALE 1.810852 0.254156 7.124963 0.0000

EDUC 0.571505 0.061217 9.335686 0.0000EXPER 0.025396 0.009806 2.589912 0.0099

TENURE 0.141005 0.027955 5.044007 0.0000


160

3.6 Implications of Conditional Homoskedasticity

Assumption (2.7 - conditional homoskedasticity). E(ε2i

∣∣∣xi) = σ2 > 0.

Under Assumption 2.7 we have

S = E(ε2ixix

′i

)= ... = σ2

E(xix′i

)= σ2Σxx and


xx = σ2Σ−1xxΣxxΣ−1

xx = σ2Σ−1xx .

Proposition (2.5 - large-sample properties of b, t , and F under conditional homoskedas-ticity). Suppose Assumptions 2.1-2.5 and 2.7 are satisfied. Then

(a) (Asymptotic distribution of b) The OLS estimator b is consistent and asymptoticallynormal with

Avar (b) = σ2Σ−1xx .

(b) (Consistent estimation of asymptotic variance) Under the same set of assumptions,Avar (b) is consistently estimated by

Avar (b) = s2S−1xx = ns2

(X′X

)−1.

161

(c) (Asymptotic distribution of the t and F statistics of the finite-sample theory)

Under H0 : βk = β0k we have

t0k =bk − β0

k

σbk

d−→ N (0, 1) , where σ2bk

=Avar (bk)

n= s2

(X′X

)−1

kk.

Under H0 : Rβ = r with rank (R) = p, we have

pF 0 d−→ χ2(p)

where F 0 = (Rb− r)′(R(X′X

)−1 R′)−1

(Rb− r) /(ps2

).

Notice

pF 0 =e∗′e∗ − e′e

e′e/ (n−K)

d−→ χ2(p)

where ∗ refers to the short regression or the regression subjected to the constraint Rβ = r

Remark (No need for fourth-moment assumption) By S&WD and Assumptions 2.1-2.4,s2Sxx

p−→ σ2Σxx = S. We do not need the fourth-moment assumption (Assumption 2.6)for consistency.

162

3.7 Testing Conditional Homoskedasticity

With the advent of robust standard errors allowing us to do inference without specifying theconditional second moment testing conditional homoskedasticity is not as important as itused to be. This section presents only the most popular test due to White (1980) for thecase of random samples.

Let ψi be a vector collecting unique and nonconstant elements of the K × K symmetricmatrix xix

′i.

Proposition (2.6 - White’s Test for Conditional Heteroskedasticity). In addition to Assump-tions 2.1 and 2.4, suppose that (a) {(yi,xi)} is i.i.d. with finite E

(ε2ixix

′i

)(thus strength-

ening Assumptions 2.2 and 2.5), (b) εi is independent of xi (thus strengthening Assumption2.3 and conditional homoskedasticity), and (c) a certain condition holds on the moments ofεi and xi. Then under H0: E

(ε2i

∣∣∣xi) = σ2 (constant) we have

nR2 d−→ χ2(m)

where R2 is the R2 from the auxiliary regression of e2i on a constant and ψi and m is the

dimension of ψi.

163

Dependent Variable: WAGEMethod: Least SquaresSample: 1 526Included observations: 526


C 1.567939 0.724551 2.164014 0.0309FEMALE 1.810852 0.264825 6.837915 0.0000

EDUC 0.571505 0.049337 11.58362 0.0000EXPER 0.025396 0.011569 2.195083 0.0286

TENURE 0.141005 0.021162 6.663225 0.0000


164

Heteroskedasticity Test: White

Fstatistic 5.911627 Prob. F(13,512) 0.0000Obs*Rsquared 68.64843 Prob. ChiSquare(13) 0.0000Scaled explained SS 227.2648 Prob. ChiSquare(13) 0.0000

Test Equation:Dependent Variable: RESID^2


C 47.03183 20.19579 2.328794 0.0203FEMALE 7.205436 10.92406 0.659593 0.5098

FEMALE*EDUC 0.491073 0.778127 0.631097 0.5283FEMALE*EXPER 0.154634 0.168490 0.917768 0.3592

FEMALE*TENURE 0.066832 0.351582 0.190089 0.8493EDUC 7.693423 2.596664 2.962811 0.0032

EDUC^2 0.315191 0.086457 3.645652 0.0003EDUC*EXPER 0.045665 0.036134 1.263789 0.2069

EDUC*TENURE 0.083929 0.054140 1.550226 0.1217EXPER 0.000257 0.610348 0.000421 0.9997

EXPER^2 0.009134 0.007010 1.303002 0.1932EXPER*TENURE 0.004066 0.017603 0.230969 0.8174

TENURE 0.298093 0.934417 0.319015 0.7498TENURE^2 0.004633 0.016358 0.283255 0.7771


165

Dependent Variable: WAGEMethod: Least SquaresIncluded observations: 526White HeteroskedasticityConsistent Standard Errors & Covariance


C 1.567939 0.825934 1.898382 0.0582FEMALE 1.810852 0.254156 7.124963 0.0000

EDUC 0.571505 0.061217 9.335686 0.0000EXPER 0.025396 0.009806 2.589912 0.0099

TENURE 0.141005 0.027955 5.044007 0.0000


3.8 Estimation with Parameterized Conditional Heteroskedasticity

Even when the error is found to be conditionally heteroskedastic, the OLS estimator is stillconsistent and asymptotically normal, and valid statistical inference can be conducted withrobust standard errors and robust Wald statistics. However, in the (somewhat unlikely) caseof a priori knowledge of the functional form of the conditional second moment, it should bepossible to obtain sharper estimates with smaller asymptotic variance.

166

To simplify the discussion, throughout this section we strengthen Assumptions 2.2 and 2.5by assuming that {(yi,xi)} is i.i.d.

3.8.1 The Functional Form

The parametric functional form for the conditional second moment we consider is

E(ε2i

∣∣∣xi) = z′iα

where zi is a function of xi.

Por example, E(ε2i

∣∣∣xi) = α1 + α2x2i2,

z′i =(

1 x2i2

).

167

3.8.2 WLS with Known α

The WLS (also GLS) estimator can be obtained by applying the OLS to the regression

yi = x′iβ + εi

where

yi =yi√z′iα

, xik =xik√z′iα

, εi =εi√z′iα

, i = 1, 2, ..., n

We have

βGLS = β (V) =(X′X

)−1X′y =

(X′V−1X

)−1X′V−1y.

168

Note that

E (εi| xi) = 0.

Therefore, provided that E(xix′i

)is nonsingular, Assumptions 2.1-2.5 are satisfied for equa-

tion yi = x′iβ+εi. Furthermore, by construction, the error εi is conditionally homoskedastic:

E (εi| xi) = 1. So Proposition 2.5 applies: the WLS estimator is consistent and asymptoti-cally normal, and the asymptotic variance is

Avar(β (V)

)= E

(xix′i

)−1

= plim

1

n

n∑i=1

xix′i

−1

(by S&WD theorem)

= plim(

1

nX′V−1X

)−1.

Thus 1nX′V−1X is a consistent estimator of Avar

(β (V)

).

169

3.8.3 Regression of e2i on zi Provides a Consistent Estimate of α

If α is unknown we need to obtain α. Assuming E(ε2i

∣∣∣xi) = z′iα we have

ε2i = E

(ε2i

∣∣∣xi) + ηi

where by construction E (ηi|xi) = 0. This suggest that the following regression can beconsidered

ε2i = z′iα+ ηi

Provided that E(ziz′i

)is nonsingular, Proposition 2.1 is applicable to this auxiliary regres-

sion: the OLS estimator of α is consistent and asymptotically normal. However we cannotrun this regression as εi is not observable. In the previous regression we should replace εiby the consistent estimate ei (despite the presence of conditional heteroskedasticity). Inconclusion, we may obtain a consistent estimate of α by considering the regression of e2

i onzi to get

α =

n∑i=1

ziz′i

−1 n∑i=1

zie2i .

170

3.8.4 WLS with Estimated α

Step 1: Estimate the equation yi = x′iβ + εi by OLS and compute the OLS residuals ei.

Step 2: Regress e2i on zi to obtain the OLS coeffi cient estimate α.

Step 3: Transform the original variables according to the rules

yi =yi√z′iα

, xik =xik√z′iα

, i = 1, 2, ..., n

and run the OLS estimator with respect to the model yi = x′iβ + εi to obtain theFeasible GLS (FGLS):

β(V)

=(X′V−1X

)−1X′V−1y

171

It can be proved that:

• β(V) p−→ β

•√n(β(V)− β

)d−→ N

(0,Avar

(β (V)

))

• 1nX′V−1X is a consistent estimator of Avar

(β (V)

).

No finite properties are known concerning the estimator β(V).

172

3.8.5 A popular specification for E(ε2i

∣∣∣xi)

The especification ε2i = z′iα+ηi may lead to z′iα < 0. To overcome this problem a popular

specification for E(ε2i

∣∣∣xi) isE(ε2i

∣∣∣xi) = exp{x′iα

}(it guarantees that Var (yi|xi) > 0 for all α ∈ Rr). It implies log E

(ε2i

∣∣∣xi) = x′iα. Thissuggests the following procedure:

a) Regress y on X to get the residual vector e.

b) Run the LS regression log e2i on xi to estimate α and calculate

σ2i = exp

{x′iα

}.

c) Transform the data yi = yiσi, xij =

xijσi.

d) Regress y on X and obtain β(V)

173

Notice also that:

E(ε2i

∣∣∣xi) = exp{x′iα

}ε2i = exp

{x′iα

}+ vi, vi = ε2

i − E(ε2i

∣∣∣xi)log ε2

i ≈ x′iα+ v∗ilog e2

i ≈ x′iα+ v∗∗i .

Example (Part 1).We want to estimate a demand function for daily cigarette consumption(cigs). The explanatory variables are: log(income) - log of annual income, log(cigprice) -log of per pack price of cigarettes in cents, educ - years of education, age and restaurn- binary indicator equal to unity if the person resides in a state with restaurant smokingrestrictions (source: J. Mullahy (1997), “Instrumental-Variable Estimation of Count DataModels: Applications to Models of Cigarette Smoking Behavior,”Review of Economics andStatistics 79, 596-593).

Based on information below, are the standard errors reported in the first table reliable?

174

Dependent Variable: CIGSMethod: Least SquaresSample: 1 807


C 3.639823 24.07866 0.151164 0.8799LOG(INCOME) 0.880268 0.727783 1.209519 0.2268LOG(CIGPRIC) 0.750862 5.773342 0.130057 0.8966

EDUC 0.501498 0.167077 3.001596 0.0028AGE 0.770694 0.160122 4.813155 0.0000

AGE^2 0.009023 0.001743 5.176494 0.0000RESTAURN 2.825085 1.111794 2.541016 0.0112


Heteroskedasticity Test: White

Fstatistic 2.159258 Prob. F(25,781) 0.0009Obs*Rsquared 52.17245 Prob. ChiSquare(25) 0.0011Scaled explained SS 110.0813 Prob. ChiSquare(25) 0.0000

Test Equation:Dependent Variable: RESID^2


C 29374.77 20559.14 1.428794 0.1535LOG(INCOME) 1049.630 963.4359 1.089466 0.2763

(LOG(INCOME))^2 3.941183 17.07122 0.230867 0.8175(LOG(INCOME))*(LOG(CIGPRIC)) 329.8896 239.2417 1.378897 0.1683

(LOG(INCOME))*EDUC 9.591849 8.047066 1.191969 0.2336(LOG(INCOME))*AGE 3.354565 6.682194 0.502015 0.6158

(LOG(INCOME))*(AGE^2) 0.026704 0.073025 0.365689 0.7147(LOG(INCOME))*RESTAURN 59.88700 49.69039 1.205203 0.2285

LOG(CIGPRIC) 10340.68 9754.559 1.060087 0.2894(LOG(CIGPRIC))^2 668.5294 1204.316 0.555111 0.5790

(LOG(CIGPRIC))*EDUC 32.91371 59.06252 0.557269 0.5775(LOG(CIGPRIC))*AGE 62.88164 55.29011 1.137304 0.2558

(LOG(CIGPRIC))*(AGE^2) 0.622371 0.594730 1.046477 0.2957(LOG(CIGPRIC))*RESTAURN 862.1577 720.6219 1.196408 0.2319

EDUC 117.4705 251.2852 0.467479 0.6403EDUC^2 0.290343 1.287605 0.225491 0.8217

EDUC*AGE 3.617048 1.724659 2.097254 0.0363EDUC*(AGE^2) 0.035558 0.017664 2.012988 0.0445

EDUC*RESTAURN 2.896490 10.65709 0.271790 0.7859AGE 264.1461 235.7624 1.120391 0.2629

AGE^2 3.468601 3.194651 1.085753 0.2779AGE*(AGE^2) 0.019111 0.028655 0.666935 0.5050

AGE*RESTAURN 4.933199 10.84029 0.455080 0.6492(AGE^2)^2 0.000118 0.000146 0.807552 0.4196

(AGE^2)*RESTAURN 0.038446 0.120459 0.319160 0.7497RESTAURN 2868.196 2986.776 0.960299 0.3372

cigs: number of cigarettes smoked per day, log(income): log of annual income, log(cigprice):log of per pack price of cigarettes in cents, educ: years of education, age and restaurn:binary indicator equal to unity if the person resides in a state with restaurant smoking re-strictions.

175

Example (Part 2). Discuss the results of the following figures.




EDUC 0.501498 0.167077 3.001596 0.0028AGE 0.770694 0.160122 4.813155 0.0000

AGE^2 0.009023 0.001743 5.176494 0.0000RESTAURN 2.825085 1.111794 2.541016 0.0112


Dependent Variable: CIGSMethod: Least SquaresSample: 1 807White HeteroskedasticityConsistent Standard Errors & Covariance



EDUC 0.501498 0.162394 3.088167 0.0021AGE 0.770694 0.138284 5.573262 0.0000

AGE^2 0.009023 0.001462 6.170768 0.0000RESTAURN 2.825085 1.008033 2.802573 0.0052


176

Example (Part 3). a) Regress y on X to get the residual vector e.




EDUC 0.501498 0.167077 3.001596 0.0028AGE 0.770694 0.160122 4.813155 0.0000

AGE^2 0.009023 0.001743 5.176494 0.0000RESTAURN 2.825085 1.111794 2.541016 0.0112


177

b) Run the LS regression log e2i on xi

Dependent Variable: LOG(RES^2)Method: Least SquaresSample: 1 807



EDUC 0.079704 0.017784 4.481657 0.0000AGE 0.204005 0.017044 11.96928 0.0000

AGE^2 0.002392 0.000186 12.89313 0.0000RESTAURN 0.627011 0.118344 5.298213 0.0000


Calculate σ2i = exp

{x′iα

}= exp

{log e2

i

}.

Notice: log e21, ..., log e2

n are the fitted values of the above regression.

178

c) Transform the data

yi =yiσi, xij =

xij

σi

and d) Regress y on X and obtain β(V).

Dependent Variable: CIGS/SIGMAMethod: Least SquaresSample: 1 807


1/SIGMA 5.635471 17.80314 0.316544 0.7517LOG(INCOME)/SIGMA 1.295239 0.437012 2.963855 0.0031LOG(CIGPRIC)/SIGMA 2.940314 4.460145 0.659242 0.5099

EDUC/SIGMA 0.463446 0.120159 3.856953 0.0001AGE/SIGMA 0.481948 0.096808 4.978378 0.0000

AGE^2/SIGMA 0.005627 0.000939 5.989706 0.0000RESTAURN/SIGMA 3.461064 0.795505 4.350776 0.0000

Rsquared 0.002751 Mean dependent var 0.966192Adjusted Rsquared 0.004728 S.D. dependent var 1.574979S.E. of regression 1.578698 Akaike info criterion 3.759715Sum squared resid 1993.831 Schwarz criterion 3.800425Log likelihood 1510.045 HannanQuinn criter. 3.775347DurbinWatson stat 2.049719

179

3.8.6 OLS versus WLS

Under certain conditions we have:

• b and β(V)are consistent.

• Assuming that the functional form of the conditional second moment is correctly spec-ified, β

(V)is asymptotically more effi cient than b.

• It is not clear which estimator is better (in terms of effi ciency) in the following situations:

— the functional form of the conditional second moment is misspecified;

— in finite samples, even if the functional form is correctly specified, the large-sampleapproximation will probably work less well for the WLS estimator than for OLSbecause of the estimation of extra parameters (a) involved in the WLS procedure.

180

3.9 Serial Correlation

Because the issue of serial correlation arises almost always in time-series models, we use thesubscript "t" instead of "i" in this section. Throughout this section we assume that theregressors include a constant. The issue is how to deal with

E(εtεt−j

∣∣∣xt−j,xt) 6= 0.

181

3.9.1 Usual Inference is not Valid

When the regressors include a constant (true in virtually all known applications), Assumption2.5 implies that the error term is a scalar martingale difference sequence, so if the erroris found to be serially correlated (or autocorrelated), that is an indication of a failure ofAssumption 2.5.

We have Cov(gt, gt−j

)6= 0. In fact,

Cov(gt, gt−j

)= E

(xtεtx

′t−jεt−j

)= E

(E(xtεtx

′t−jεt−j

∣∣∣xt−j,xt))= E

(xtx′t−j E

(εtεt−j

∣∣∣xt−j,xt)) 6= 0.

Assumptions 2.1-2.4 may hold under serial correlation, so the OLS estimator may be consis-tent even if the error is autocorrelated. However, the large-sample properties of b, t , andF of proposition 2.5 are not valid. To see why, consider

√n (b− β) = S−1

xx

(√ng).

182

We have


xx ,Avar (b) = S−1

xx SS−1xx .

If errors are not autocorrelated:

S = Var(√

ng)

= Var (gt) .

If the errors are autocorrelated:

S = Var(√

ng)

= Var (gt) +1

n

n−1∑j=1

n∑t=j+1

(E(gtg′t−j

)+ E

(gt−jg

′t

)).

Since Cov(gt, gt−j

)6= 0 and E

(gt−jg′t

)6= 0 we have

S 6= Var (gt) i.e. S 6= E(gtg′t

).

If the errors are serial correlated we cannot use 1n

∑nt=1 xtx

′t or

1n

∑nt=1 e

2txtx

′t (robust to

conditional heteroskedasticity) as a consistent estimators of S.

183

3.9.2 Testing Serial Correlation

Consider the regression yt = x′tβ+εt.We want to test whether or not εt is serial correlated.

Consider

ρj =Cov

(εt, εt−j

)√

Var (εt) Var(εt−j

) =Cov

(εt, εt−j

)Var (εt)

=γj

γ0=

E(εtεt−j

)E(ε2t

) .

Since γj is not observable, we need to consider

ρj =γj

γ0

γj =1

n

n∑t=j+1

εtεt−j, γ0 =1

n

n∑t=1

ε2t .

184

Proposition. If {εt} is a stationary MDS with E(ε2t

∣∣∣ εt−1, εt−2, ...)

= σ2, then

√nγj

d−→ N(

0, σ4)and√nρj

d−→ N (0, 1) .

Proposition. Under the assumptions of the previous proposition

Box-Pierce Q statistics = QBP =p∑j=1

(√nρj

)2= n

p∑j=1

ρ2j

d−→ χ2(p).

However, ρj is still unfeasible as we do not observe the errors. Thus,

ρj =γj

γ0

γj =1

n

n∑t=j+1

etet−j, γ0 =1

n

n∑t=1

e2t (=SSR).

Exercise 3.6. Prove that ρj can be obtained from the regression et on et−j (without inter-cept).

185

Testing with Strictly Exogenous Regressors

To test H0 : ρj = 0 we consider the following proposition:

Proposition (testing for serial correlation with strictly exogeneous regressors). Suppose thatAssumptions 1.2, 2.1, 2.2, 2.4 are satisfied. Then

ρjp−→ 0,

√nρj

d−→ N (0, 1) .

186

To test H0 : ρ1 = ρ2 = ... = ρp = 0 we consider the following proposition:

Proposition (Box-Pierce Q & Ljung-Box Q). Suppose that Assumptions 1.2, 2.1, 2.2, 2.4are satisfied. Then

QBP = np∑j=1

ρ2j

d−→ χ2(p),

QLB = n (n+ 2)p∑j=1

ρ2j

n− jd−→ χ2

(p).

It can be shown that the hypothesis H0 : ρ1 = ρ2 = ... = ρp = 0 can also be testedthrough the following auxiliary regression:

regression et on et−1, ..., et−p.

We calculate the F statistic for the hypothesis that the p coeffi cients of et−1, ..., et−p areall zero.

187

Testing with Predetermined, but Not Strictly Exogenous, Regressors

If the regressors are not strictly exogenous, the√nρj has no longer N (0, 1) distribution and

the residual-based Q statistic may not be asymptotically chi-squared.

The trick consist in removing the effect of xi in the regression of et on et−1, ..., et−p byconsidering now the

regression et on xt,et−1, ..., et−p

and then calculate the F statistic for the hypothesis that the p coeffi cients of et−1, ..., et−pare all zero. This regression is still valid when the regressors are strictly exogenous (so youmay always use that regression).

Given

et = θ1 + θ2xt2 + ...+ θKxtK + γ1et−1 + ...+ γpet−p + errort

the null hypothesis can be formulated as

H0 : γ1 = ... = γp = 0

Use the F test.

188

EVIEWS

189

Example. Consider, chnimp: the volume of imports of barium chloride from China, chempi:index of chemical production (to control for overall demand for barium chloride), gas: thevolume of gasoline production (another demand variable), rtwex: an exchange rate index(measures the strength of the dollar against several other currencies).

Equation 1Dependent Variable: LOG(CHNIMP)Method: Least SquaresSample: 1978M02 1988M12Included observations: 131


C 19.75991 21.08580 0.937119 0.3505LOG(CHEMPI) 3.044302 0.478954 6.356142 0.0000

LOG(GAS) 0.349769 0.906247 0.385953 0.7002LOG(RTWEX) 0.717552 0.349450 2.053378 0.0421


190

Equation 2BreuschGodfrey Serial Correlation LM Test:

Fstatistic 2.337861 Prob. F(12,115) 0.0102Obs*Rsquared 25.69036 Prob. ChiSquare(12) 0.0119

Test Equation:Dependent Variable: RESIDMethod: Least SquaresSample: 1978M02 1988M12Included observations: 131Presample missing value lagged residuals set to zero.


C 3.074901 20.73522 0.148294 0.8824LOG(CHEMPI) 0.084948 0.457958 0.185493 0.8532

LOG(GAS) 0.110527 0.892301 0.123867 0.9016LOG(RTWEX) 0.030365 0.333890 0.090942 0.9277

RESID(1) 0.234579 0.093215 2.516546 0.0132RESID(2) 0.182743 0.095624 1.911051 0.0585RESID(3) 0.164748 0.097176 1.695366 0.0927RESID(4) 0.180123 0.098565 1.827464 0.0702RESID(5) 0.041327 0.099482 0.415425 0.6786RESID(6) 0.038597 0.098345 0.392468 0.6954RESID(7) 0.139782 0.098420 1.420268 0.1582RESID(8) 0.063771 0.099213 0.642771 0.5217RESID(9) 0.154525 0.098209 1.573441 0.1184RESID(10) 0.027184 0.098283 0.276585 0.7826RESID(11) 0.049692 0.097140 0.511550 0.6099RESID(12) 0.058076 0.095469 0.608329 0.5442

Rsquared 0.196110 Mean dependent var 3.97E15Adjusted Rsquared 0.091254 S.D. dependent var 0.593374S.E. of regression 0.565652 Akaike info criterion 1.812335Sum squared resid 36.79567 Schwarz criterion 2.163504Log likelihood 102.7079 HannanQuinn criter. 1.955030Fstatistic 1.870289 DurbinWatson stat 2.015299Prob(Fstatistic) 0.033268

191

If you conclude that the errors are serial correlated you have a few options:

(a) You know (at least approximately) the form of autocorrelation and so you use a feasibleGLS estimator.

(b) The second approach, parallels the use of the White estimator for heteroskedasticity:you don’t know the form of autocorrelation so you rely on the OLS, but you use aconsistent estimator for Avar (b) .

(c) You are concerned only with the dynamic specification of the model and with forecast.You may try to convert your model into a dynamically complete model.

(d) Your model may be misspecified: you respecified the model and the autocorrelationdisappear.

192

3.9.3 Question (a): feasible GLS estimator

There are many forms of autocorrelation and each one leads to a different structure for theerror covariance matrix V. The most popular form is known as the first-order autoregressiveprocess. In this case the error term in

yt = x′tβ + εt

is assumed to follow the AR(1) model

εt = ρεt−1 + vt, |ρ| < 1,

where vt is an error term with mean zero and constant conditional variance that exhibits noserial correlation. We assume all assumptions 2.1-2.5 was ρ = 0.

193

Initial Model:

yt = x′tβ + εt, εt = ρεt−1 + vt, |ρ| < 1

The GLS estimator is the OLS estimator applied to the transformed model

yt = x′tβ + vt

where

yt =

{ √1− ρ2y1 t = 1

yt − ρyt−1 t > 1, x′t =

{ √1− ρ2x′1 t = 1

(xt − ρxt−1)′ t > 1,

Without the first observation, the transformed model is

yt − ρyt−1 = (xt − ρxt−1)′β + vt.

If ρ is unknown we may replace it by a consistent estimator or we may use the nonlinearleast squares estimator (EVIEW).

194

Example (continuation of the previous example). Let’s consider the residuals of Equation 1:

Equation 3Dependent Variable: LOG(CHNIMP)Method: Least SquaresSample (adjusted): 1978M03 1988M12Included observations: 130 after adjustmentsConvergence achieved after 8 iterations


C 39.30703 23.61105 1.664772 0.0985LOG(CHEMPI) 2.875036 0.658664 4.364949 0.0000

LOG(GAS) 1.213475 1.005164 1.207241 0.2296LOG(RTWEX) 0.850385 0.468696 1.814362 0.0720

AR(1) 0.309190 0.086011 3.594777 0.0005


Inverted AR Roots .31

Exercise 3.7. Consider yt = β1 + β2xt2 + εt where εt = ρεt−1 + vt and {vt} is a whitenoise process. Using the first differences of the variables one gets ∆yt = β1∆xt2 + ∆εt.

Show that Corr (∆εt,∆εt−1) = − (1− ρ) /2. Discuss the advantages and disadvantagesof differentiating the variables as a procedure to remove autocorrelation.

195

3.9.4 Question (b): Heteroskedasticity and autocorrelation-consistent (HAC) Co-variance Matrix Estimator

For sake of generality, assume that you have also a problem of heteroskedasticity.

Given

S = Var(√

ng)

= Var (gt) +1

n

n−1∑j=1

n∑t=j+1

(E(gtg′t−j

)+ E

(gt−jg

′t

))

= E(ε2txtx

′t

)+

1

n

n−1∑j=1

n∑t=j+1

(E(εtεt−jxtx

′t−j

)+ E

(εt−jεtxt−jx

′t

)),

a possible estimator of S based on the analogy principle would be

1

n

n∑t=1

e2txtx

′t +

1

n

n′−1∑j=1

n∑t=j+1

(etet−jxtx

′t−j + et−jetxt−jx

′t

), n′ < n.

A major problem with this estimator is that it is not positive semi-definite and hence cannotbe a well-defined variance-covariance matrix.

196

Newey and West show that with a suitable weighting function ω (j), the estimator below isconsistent and positive semi-definite:

SHAC =1

n

n∑t=1

e2txtx

′t +

1

n

L∑j=1

n∑t=j+1

ω (j)(etet−jxtx


′t

)where the weighting function ω (j) is

ω (j) = 1− j

L+ 1.

The maximum lag L must be determined in advance. Autocorrelations at lags longer thanL are ignored. For a moving-average process, this value is in general a small number.

This estimator is known as (HAC) covariance matrix estimator and is valid when bothconditional heteroskedasticity and serial correlations are present but of an unknown form.

197

Example. For xt = 1, n = 9, L = 3 we have

L∑j=1

n∑t=j+1

ω (j)(etet−jxtx


′t

)

=L∑j=1

n∑t=j+1

ω (j) 2etet−j

= ω (1) (2e1e2 + 2e2e3 + 2e3e4 + 2e4e5 + 2e5e6 + 2e6e7 + 2e7e8 + 2e8e9) +

ω (2) (2e1e3 + 2e2e4 + 2e3e5 + 2e4e6 + 2e5e7 + 2e6e8 + 2e7e9) +

ω (3) (2e1e4 + 2e2e5 + 2e3e6 + 2e4e7 + 2e5e8 + 2e6e9) .

ω (1) = 1− 1

4= 0.75

ω (2) = 1− 2

4= 0.50

ω (3) = 1− 3

4= 0.25

198

Newey-West covariance matrix estimator

Avar (b) = S−1xx SHACS−1

xx .

EVIEWS:

0 1000 2000 3000 4000 50000

1

2

3

4

5

6

7

8

9

10

n

L

Eviews selects L = floor(4(n

100

)2/9)

199

Example (continuation ...). Newey-West covariance matrix estimator

Avar (b) = S−1xx SHACS−1

xx

Equation 4Dependent Variable: LOG(CHNIMP)Method: Least SquaresSample: 1978M02 1988M12Included observations: 131NeweyWest HAC Standard Errors & Covariance (lag truncation=4)


C 19.75991 26.25891 0.752503 0.4531LOG(CHEMPI) 3.044302 0.667155 4.563111 0.0000

LOG(GAS) 0.349769 1.189866 0.293956 0.7693LOG(RTWEX) 0.717552 0.361957 1.982426 0.0496


200

3.9.5 Question (c): Dynamically Complete Models

Consider

yt = x′tβ + ut

such that E (ut| xt) = 0. This condition although necessary for consistency, does not pre-clude autocorrelation. You may try to increase the number of regressors to xt and get a newregression model

yt = x′tβ + εt such that

E (εt|xt, yt−1,xt−1, yt−2, ...) = 0.

Written in terms of yt

E (yt|xt, yt−1,xt−1, yt−2, ...) = E (yt|xt) .Definition. The model yt = x′tβ + εt is dynamically complete (DC) if

E (εt|xt, yt−1,xt−1, yt−2, ...) = 0 or

E (yt|xt, yt−1,xt−1, yt−2, ...) = E (yt|xt)holds (see Wooldridge).

201

Proposition. If a model is DC then the errors are not correlated. Moreover {gi} is a MDS.

Notice that E (εt|xt, yt−1,xt−1, yt−2, ...) = 0 can be rewritten as


Fi = Ii−1 ∪ xi = {εi−1, εi−2, ..., ε1,xi,xi−1, ...,x1} ,Ii−1 = {εi−1, εi−2, ..., ε1,xi−1, ...,x1} .

Example. Consider

yt = β1 + β2xt2 + ut, ut = φut−1 + εt

where {εt} is a white noise process and E(εt|xt2, yt−1, xt−1,2, yt−2, ...

)= 0. Set x′t =(

1 xt2). The above model is not DC since the errors are autocorrelated. Notice that

E(yt|xt2, yt−1, xt−1,2, yt−2, ...

)= β1 + β2xt2 + φut−1

does not coincide with

E (yt| xt) = E (yt|xt2) = β1 + β2xt2.

202

However, it is easy to obtain a DC model. Since

ut = yt − (β1 + β2xt2)⇒ut−1 = yt−1 − (β1 + β2xt−1,2)

we have

yt = β1 + β2xt2 + ut

= β1 + β2xt2 + φut−1 + εt

= β1 + β2xt2 + φ(yt−1 −

(β1 + β2xt−1,2

))+ εt.

This equation can be written in the form

yt = γ1 + γ2xt2 + γ3yt−1 + γ4xt−1,2 + εt.

Let xt =(xt2, yt−1, xt−1,2

). The previous models is DC as

E (yt|xt, yt−1,xt−1, ...) = E (yt|xt) = γ1 + γ2xt2 + γ3yt−1 + γ4xt−1,2.

203

Example (continuation ...). Dynamically Complete Model

Equation 5Dependent Variable: LOG(CHNIMP)Method: Least SquaresSample (adjusted): 1978M03 1988M12Included observations: 130 after adjustments


C 11.30596 23.24886 0.486302 0.6276LOG(CHEMPI) 7.193799 3.539951 2.032175 0.0443

LOG(GAS) 1.319540 1.003825 1.314513 0.1911LOG(RTWEX) 0.501520 2.108623 0.237842 0.8124

LOG(CHEMPI(1)) 9.618587 3.602977 2.669622 0.0086LOG(GAS(1)) 1.223681 1.002237 1.220950 0.2245

LOG(RTWEX(1)) 0.935678 2.088961 0.447915 0.6550LOG(CHNIMP(1)) 0.270704 0.084103 3.218710 0.0016


Equation 6BreuschGodfrey Serial Correlation LM Test:

Fstatistic 0.810670 Prob. F(12,110) 0.6389Obs*Rsquared 10.56265 Prob. ChiSquare(12) 0.5667

Test Equation:Dependent Variable: RESIDMethod: Least SquaresDate: 05/12/10 Time: 19:13Sample: 1978M03 1988M12Included observations: 130Presample missing value lagged residuals set to zero.


C 1.025127 26.26657 0.039028 0.9689LOG(CHEMPI) 1.373671 3.968650 0.346130 0.7299

LOG(GAS) 0.279136 1.055889 0.264361 0.7920LOG(RTWEX) 0.074592 2.234853 0.033377 0.9734

LOG(CHEMPI(1)) 1.878917 4.322963 0.434636 0.6647LOG(GAS(1)) 0.315918 1.076831 0.293378 0.7698

LOG(RTWEX(1)) 0.007029 2.224878 0.003159 0.9975LOG(CHNIMP(1)) 0.151065 0.293284 0.515082 0.6075

RESID(1) 0.189924 0.307062 0.618520 0.5375RESID(2) 0.088557 0.124602 0.710715 0.4788RESID(3) 0.154141 0.098337 1.567475 0.1199RESID(4) 0.125009 0.098681 1.266795 0.2079RESID(5) 0.035680 0.099831 0.357407 0.7215RESID(6) 0.048053 0.098008 0.490291 0.6249RESID(7) 0.129226 0.097417 1.326523 0.1874RESID(8) 0.052884 0.099891 0.529420 0.5976RESID(9) 0.122323 0.102670 1.191423 0.2361RESID(10) 0.022149 0.099419 0.222788 0.8241RESID(11) 0.034364 0.099973 0.343738 0.7317RESID(12) 0.038034 0.102071 0.372628 0.7101

Rsquared 0.081251 Mean dependent var 9.76E15Adjusted Rsquared 0.077442 S.D. dependent var 0.544011S.E. of regression 0.564683 Akaike info criterion 1.835533Sum squared resid 35.07532 Schwarz criterion 2.276692Log likelihood 99.30962 HannanQuinn criter. 2.014790Fstatistic 0.512002 DurbinWatson stat 2.011429Prob(Fstatistic) 0.952295

204

3.9.6 Question (d): Misspecification

In many cases the finding of autocorrelation is an indication that the model is misspecified.If this is the case, the most natural route is not to change your estimator (from OLS to GLS)but to change your model. Types of misspecification may lead to a finding of autocorrelationin your OLS residuals:

• dynamic misspecification (related to question (c));

• omitted variables (that are autocorrelated);

• yt and/or xtk are integrated processes, e.g. yt ∼ I (1) .

• functional form misspecification.

205

Functional form misspecification. Suppose that the true linear relationship is

yt = β1 + β2 log t+ εt.

In the following figure we estimate a misspecified functional form: yt = β1 +β2t+ ε∗t . Theresiduals are clearly autocorrelated

206

3.10 Time Regressions

Consider

yt = α+ δf (t) + εt

where f (t) is a function of time (e.g. f (t) = t or f (t) = t2 etc.). This kind of modelsdo not satisfy the Assumption 2.2: {(yi,xi)} is jointly S&WD. This type of nonstationaryis not serious and the OLS is applicable. Let’s us focus on the case

yt = α+ δt+ εt

= x′tβ + εt,

x′t =(

1 t), β =

[αδ

].

α+ δt is called time trend of yt.

Definition.We say that a process is trend stationary if it can be written as the sum of a timetrend and a stationary process. The process {yt} here is a special trend-stationary processwhere the stationary component is independent white noise.

207

3.10.1 The Asymptotic Distribution of the OLS Estimator

Let b be the OLS estimate of p based on a sample of size n:

b =

[α

δ

]=(X′X

)−1X′y.

Proposition (2.11 - OLS estimation of the time regression). Consider the time regressionyt = α+ δt+ εt where εt is independent white noise with E

(ε2t

)= σ2 and E

(ε4)<∞.

Then( √n (α− α)

n3/2(δ − δ

) ) d−→ N

0, σ2

[1 1/2

1/2 1/3

]−1 = N

(0, σ2

[4 −6−6 12

]).

As in the stationary case, α is√n-consistent because

√n(δ − δ

)converges to a (normal)

random variable. The OLS estimate of the time coeffi cient, δ, is also consistent, but thespeed of convergence is faster: it is n3/2-consistent in that n3/2

(δ − δ

)converge to a

random variable. In this sense, δ is superconsistent.

208

We provide a simpler proof of proposition 2.11 in the case yt = δt+ εt. We have

δ − δ =(X′X

)−1X′ε

=

[ 1 2 · · · n]

12...n

−1 [

1 2 · · · n]

ε1ε2...εn

=

1∑nt=1 t

2

n∑t=1

tεt

=

√Var

(∑nt=1 tεt

)∑nt=1 t

2

∑nt=1 tεt√

Var(∑n

t=1 tεt)

=σ√∑n

t=1 t2∑n

t=1 t2

∑nt=1 tεt

σ√∑n

t=1 t2

=σ√∑n

t=1 t2∑n

t=1 t2Zn, where Zn

d−→ Z ∼ N (0, 1)

209

n3/2(δ − δ

)= n3/2

σ√∑n

t=1 t2∑n

t=1 t2Zn

Since

limn→∞n

3/2σ√∑n

t=1 t2∑n

t=1 t2

= σ√

3

we have

n3/2(δ − δ

)d= σ√

3Zd−→ N

(0, σ23

).�

3.10.2 Hypothesis Testing for Time Regressions

The OLS coeffi cient estimates of the time regression are asymptotically normal, provided thesampling error is properly scaled. Inference about δ can be based on

n3/2(δ − δ

)√s212

d−→ N (0, 1) in the case yt = α+ δt+ εt

n3/2(δ − δ

)√s23

d−→ N (0, 1) in the case yt = δt+ εt

210

4 Endogeneity and the GMM

Consider

yi = β1zi1 + β2zi2 + ...+ βKziK + εi.

If Cov(zij, εi

)6= 0 (or E

(zijεi

)6= 0) then we say that zij (j-th regressor) is endogenous.

It follows that E (ziεi) 6= 0.

Definition (endogenous regressor).We say that a regressor is endogenous if it is not predeter-mined (i.e., not orthogonal to the error term), that is, if it does not satisfy the orthogonalitycondition (Assumption 2.3 does not hold).

If the regressors are endogenous we have, under the Assumptions 2.1, 2.2 and 2.4,

b = β+

1

n

n∑i=1

ziz′i

−11

n

n∑i=1

ziεip−→ β + Σ−1

zz E (ziεi) 6= β

since E (ziεi) 6= 0. The term Σ−1zz E (ziεi) is the asymptotic bias.

211

Example (Simple regression model). Consider

yi = β1 + β2zi2 + εi

is

b =

[b1b2

]=(Z′Z

)−1Z′y =

y − Cov(zi2,yi)

S2z2

z2

Cov(zi2,yi)S2z2

where

Cov (zi2, yi) =1

n

∑(zi2 − z2) (yi − y) , S2

z2=

1

n

∑(zi2 − z2)2 .

Under the assumption 2.2 we have

b2 =Cov (zi2, yi)

S2z

p−→ Cov (zi2, yi)

Var (zi2)

=Cov (zi2, β1 + β2zi2 + εi)

Var (zi2)= β2 +

Cov (zi2, εi)

Var (zi2).

212

b1 = y − Cov (zi2, yi)

S2z2

z2p−→ E (y)− Cov (zi2, yi)

Var (zi2)E (zi2)

= β1 + β2 E (zi2)−(β2 +

Cov (zi2, εi)

Var (zi2)

)E (zi2)

= β1 −Cov (zi2, εi)

Var (zi2)E (zi2)

If Cov (zi2, εi) = 0 ⇒ bip−→ βi. If zi2 is endogenous, b1 and b2 are inconsistent. Show

that

Σ−1zz E (ziεi) =

−Cov(zi2,εi)Var(zi2) E (zi2)

Cov(zi2,εi)Var(zi2)

.

213

4.1 Examples of Endogeneity

4.1.1 Simultaneous Equations Bias

Example. Consider

yi1 = α0 + α1yi2 + εi1yi2 = β0 + β1yi1 + εi2

where εi1 and εi2 are independent. By construction yi1 and yi2 are endogenous regressors. Infact, it can be proved that

Cov (yi2, εi1) =β1

1− β1α1Var (εi1) 6= 0

Cov (yi1, εi2) =α1

1− β1α1Var (εi2) 6= 0

Now

α1,OLSp−→ Cov (yi2, yi1)

Var (yi2)=

Cov (yi2, α0 + α1yi2 + εi1)

Var (yi2)= α1 +

Cov (yi2, εi1)

Var (yi2)6= α1

β1,OLSp−→ Cov (yi2, yi1)

Var (yi1)=

Cov (yi1, β0 + β1yi1 + εi2)

Var (yi1)= β1 +

Cov (yi1, εi2)

Var (yi1)6= β1.

214

The OLS estimator is inconsistent for both α1 and β1 (and for α0 and β0 as well). Thisphenomenon is known as the simultaneous equations bias or simultaneity bias, because theregressor and the error term are often related to each other through a system of simultaneousequations.

Example. Consider

Ci = α0 + α1Yi + ui (consumption function)

Yi = Ci + Ii (GNP identity).

where Cov (ui, Ii) = 0. It can be proved that

α1,OLSp−→ α1 +

1

1− α1

Var (ui)

Var (yi).

Example. See Hayashi:

qdi = α0 + α1pi + ui (demand equation)

qsi = β0 + β1pi + vi (supply equation)

qdi = qsi (market equilibrium)

215

4.1.2 Errors-in-Variables Bias

We will see that predetermined regressor necessarily becomes endogenous when measuredwith error. This problem is ubiquitous, particularly in micro data on households.

Consider

y∗i = βz∗i + ui

where z∗i is a predetermined regressor. The variables y∗i and z

∗i are measured with error:

yi = y∗i + εi and zi = z∗i + vi.

Assume that E(z∗i ui

)= E

(z∗i εi

)= E

(z∗i vi

)= E (viui) = E (viεi) = 0. The regression

equation is

yi = βzi + ηi, ηi = ui + εi − βviAssuming S&WD we have (after some calculations):

βOLS =

∑i ziyi∑i z

2i

=

∑i ziyi/n∑i z

2i /n

p−→ β − βE(v2i

)E(z2i

).

216

4.1.3 Omitted Variable Bias

Consider the “long regression”

y = X1β1 + X2β2 + u

and suppose that this model satisfies the assumptions 2.1-2.4 (hence the OLS based onthe previous equation is consistent). However, for some reason X2 is not included in theregression model (“short regression)”

y = X1β1 + ε, ε = X2β2 + u

We are interested only in β1. We have

b1 =(X′1X1

)−1X1y

=(X′1X1

)−1X1 (X1β1 + X2β2 + u)

= β1 +(X′1X1

)−1X1X2β2 +

(X′1X1

)−1X1u

= β1 +

(X′1X1

n

)−1X1X2

nβ2 +

(X′1X1

n

)−1X1u

n

217

This expression converges in probability to

β1 + Σ−1x1x1

Σx1x2β2.

The conclusion is that b1 is inconsistent if there are omitted variables that are correlatedwith X1. The variables in X1 are endogenous as long as Cov (X1,X2) 6= 0

Cov (X1, ε) = Cov (X1,X2β2 + u) = Cov (X1,X2)β2

Example. Consider the problem of unobserved ability in a wage equation for working adults.A simple model is

log (WAGEi) = β1 + β2educi + β3abili + ui

where ui is the error term. We put abili into the error term, and we are left with the simpleregression model

log (WAGEi) = β1 + β2educi + εi

where εi = β3abili + ui.

218

The OLS will be inconsistent estimator of β2 if educi and abili are correlated. In effect,

b2p−→ β2 +

Cov (educi, εi)

Var (educi)= β2 +

Cov (educi, β3abili + ui)

Var (educi)

= β2 + β3Cov (educi, abili)

Var (educi).

219

4.2 The General Formulation

4.2.1 Regressors and Instruments

Definition. xi is an instrumental variable (IV) for zi if (1) xi is uncorrelated with εi, thatis, Cov(xi, εi) = 0 (thus, xi is a predetermined variable), and (2) xi is correlated with zi,that is, Cov (xi, zi) 6= 0.

Exercise 4.1. Consider log (wagei) = β1 + β2educi + εi. Omitted variable: ability. (a)Is educ an endogenous variable? (b) Can IQ be considered an IV for educ? and mother’seducation?

Exercise 4.2. Consider childreni = β1 +β2mothereduci+β3motheragei+εi. Omittedvariable: bcmi : dummy equal to one if the mother is informed about birth control methods.(a) Is mothereduc endogenous? (b) Suggest an IV for mothereducation.

Exercise 4.3. Consider scorei = β1 + β2skippedi + εi. Omitted variable: motivation(a) Is skippedi endogenous? (b) Can the distance between home (or living quarters) anduniversity be considered an IV variable?

220

Exercise 4.4. (Wooldridge, Chap. 15) Consider a simple model to estimate the effect ofpersonal computer (PC) ownership on college grade point average for graduating seniors ata large public university:

GPAi = β1 + β2PCi + εi

where PC is a binary variable indicating PC ownership. (a) Why might PC ownership becorrelated with εi? (b) Explain why PC is likely to be related to parents’annual income.Does this mean parental income is a good IV for PC? Why or why not? (c) Suppose that, fouryears ago, the university gave grants to buy computers to roughly one-half of the incomingstudents, and the students who received grants were randomly chosen. Carefully explainhow you would use this information to construct an instrumental variable for PC. (d) Samequestion as (c) but suppose that the university gave grant priority to low-income students.

(see the use of IV in errors-in-variables problems in Woodridge’s text book).

221

Assumption (3.1 - linearity). The equation to be estimated is linear:

yi = z′iδ + εi, (i = 1, 2, ..., n) ,

where zi is an L-dimensional vector of regressors, δ is an L-dimensional coeffi cient vectorand εi is an unobservable error term.

Assumption (3.2 - S&WD). Let xi be aK-dimensional vector to be referred to as the vectorof instruments, and let wi be the unique and nonconstant elements of (yi, zi,xi). {wi} isjointly stationary and weakly dependent.

Assumption (3.3 - orthogonality conditions). All the K variables in xi are predetermined inthe sense that they are all orthogonal to the current error term: E (xikεi) = 0 for all i andk. This can be written as

E(xi(yi − z′iδ

))= 0 or E (gi) = 0

where gi = xiεi.

Notice: xi should include the “1”(constant). Not only xi1 = 1 can be considered as an IVvariable but also guarantee that E

(1(yi − z′iδ

))= 0⇔ E (εi) = 0.

222

Example (3.1). Consider

qi = α0 + α1pi + ui (demand equation)

where Cov (pi, ui) 6= 0, and xi is such that Cov (xi, pi) 6= 0 but Cov (xi, ui) = 0. Usingprevious notation we have:

yi = qi,

zi =

[1pi

], δ =

[α0α1

], L = 2

xi =

[1xi

], K = 2

wi =

qipixi

.

In the above example, xi and zi share the same variable (a constant). The instruments thatare also regressors are called predetermined regressors, and the rest of the regressors, thosethat are not included in xi, are called endogenous regressors.

223

Example (3.2 - wage equation). Consider

LWi = δ1 + δ2Si + δ3EXPRi + δ4IQi + εi.

where:

LWi is the log wage of individual i,

Si is completed years of schooling (we assume predetermined),

EXPRi is experience in years (we assume predetermined),

IQi is IQ (an error-ridden measure of the individual’s ability, is endogenous due to theerrors-in-variables problem)

We still have information on:

AGEi (age of the individual - predetermined),

MEDi (mother’s education in years - predetermined).

Note: AGE; is excluded from the wage equation, reflecting the underlying assumption that,once experience is controlled for, age has no effect on the wage rate.

224

In terms of the general model,

yi = LWi,

zi =

1Si

EXPRiIQi

, δ =

δ1δ2δ3δ4

, L = 4

xi =

1Si

EXPRiAGEiMEDi

, K = 5

w′i =[LWi Si EXPRi IQi AGEi MEDi

].

225

4.2.2 Identification

The GMM estimation of the parameter vector δ is about how to exploit the informationafforded by the orthogonality conditions

E(xi(yi − z′iδ

))= 0⇔ E

(xiz′i

)δ = E (xiyi)

E(xiz′i

)δ = E (xiyi) can be interpreted as a linear system with K equations where δ is the

unknown vector. Notice: E(xiz′i

)is a K × L matrix and E (xiyi) is a K × 1 vector. Can

we solve the system with respect to δ? We need to study the identification of the system.

Assumption (3.4 - rank condition for identification). The K × L matrix E(xiz′i

)is of full

column rank (i.e., its rank equals L, the number of its columns). We denote this matrix byΣxz.

226

Example. Consider the example 3.2 where

xi =

1Si

EXPRiAGEiMEDi

, zi =

1Si

EXPRiIQi

.

We have

xiz′i =

1Si

EXPRiAGEiMEDi

[

1 Si EXPERi IQi]

=

1 Si EXPERi IQiSi S2

i SiEXPERi SiIQiEXPRi EXPRiSi EXPER2

i EXPRiIQiAGEi AGEiSi AGEiEXPERi AGEiIQiMEDi MEDiSi MEDiEXPERi MEDiIQi

.

227

E(xiz′i

)= Σxz

=

1 E (Si) E (EXPERi) E (IQi)

E (Si) E(S2i

)E (SiEXPERi) E (SiIQi)

E (EXPRi) E (EXPRiSi) E(EXPER2

i

)E (EXPRiIQi)

E (AGEi) E (AGEiSi) E (AGEiEXPERi) E (AGEiIQi)

E (MEDi) E (MEDiSi) E (MEDiEXPERi) E (MEDiIQi)

.

Assumption 3.4 requires that rank (Σxz) = 4.

228

4.2.3 Order Condition for Identification

Since rank (Σxz) ≤ min {K,L} we have: if K < L ⇒ rank (Σxz) < L. Thus anecessary condition for identification is that K ≥ L.

Definition (order condition for identification).K ≥ L or

#orthogonality conditions︸︷︷︸K

≥ #parameters︸︷︷︸L

.

Definition.We say that the equation is overidentified if the rank condition is satisfied andK > L, exactly identified or just identified if the rank condition is satisfied and K = L

and underidentified (or not identified) if the order condition is not satisfied (i.e., ifK < L).

229

Example. Consider the system Ax = b, with A = E(xiz′i

)and b = E (xiyi) . It can be

proved that the system is always “possible” (it has at least one solution). Consider thefollowing scenarios:

1. If rank (A) = L and K = L the SLE is exactly identified. Example:[1 10 1

] [x1x2

]=

[31

]⇒{x1 = 2x2 = 1

Note: rank (A) = 2 = L = K.

2. If rank (A) = L and K > L. The SLE is overidentified. Example: 1 10 10 1

[ x1x2

]=

311

⇒ {x1 = 2x2 = 1

Note: rank (A) = 2 = L and K = 3.

230

3. If rank (A) < L the SLE is underidentified. Example:[1 12 2

] [x1x2

]=

[24

]⇒ x1 = 2− x2, x2 ∈ R

Note: rank (A) = 1 < L.

4. If K < L then rank (A) < L and the SLE is underidentified. Example:

[1 1

] [ x1x2

]= 1⇒ x1 = 1− x2, x2 ∈ R

Note: rank (A) = 1 and K = 1 < L = 2.

231

4.2.4 The Assumption for Asymptotic Normality

Assumption (3.5 - {gi} is a martingale difference sequence with finite second moments).Let gi = xiεi. {gi} is a martingale difference sequence (so E (gi) = 0). The K×K matrixof cross moments, E

(gig′i

), is nonsingular. Let S = Avar (g) .

Remarks:

• Assumption 3.5 implies Avar (g) = lim Var (√ng) = E

(gig′i

).

• Assumption 3.5 implies√ng

d−→ N (0,Avar (g)) .

• If the instruments include a constant, then this assumption implies that the error is amartingale difference sequence (and a fortiori serially uncorrelated).

232

• A suffi cient and perhaps easier to understand condition for Assumption 3.5 is that


Ii−1 = {εi−1, εi−2, ..., ε1,xi−1, ...,x1} ,Fi = Ii−1 ∪ xi = {εi−1, εi−2, ..., ε1,xi,xi−1, ...,x1} .

It implies the error term is orthogonal not only to the current but also to the pastinstruments.

• Since gig′i = ε2

ixix′i, S is a matrix of fourth moments. Consistent estimation of S will

require a fourth-moment assumption to be specified in Assumption 3.6 below.

• If {gi} is serially correlated, then S does not equal E(gig′i

)and will take a more

complicated form.

233

4.3 Generalized Method of Moments (GMM) Defined

The method of moment principle: To estimate a feature of the population, use thecorresponding feature of the sample.

Examples:

Parameter of the population EstimatorE (yi) Y

Var (yi) S2y

E(xi(yi − z′iδ

))1n

∑i xi

(yi − z′iδ

)

Method of moments: choose the parameter estimate so that the corresponding samplemoments are also equal to zero. Since we know that E

(xi(yi − z′iδ

))= 0 we choose the

parameter estimate δ so that

1

n

n∑i=1

xi(yi − z′iδ

)= 0.

234

Another way of writing 1n

∑ni=1 xi

(yi − z′iδ

)= 0:

1

n

n∑i=1

gi = 0⇔ 1

n

n∑i=1

gi(w; δ

)︸︷︷︸

gn(δ)

= 0⇔ gn(δ)

= 0.

Let’s expand gn(δ)

= 0 :

1

n

n∑i=1

xi(yi − z′iδ

)= 0

1

n

n∑i=1

xiyi −1

n

n∑i=1

xiz′iδ = 0

1

n

n∑i=1

xiz′iδ =

1

n

n∑i=1

xiyi

Sxzδ = sxy.

235

Thus:

• Sxz(K×L)

δ(L×1)

= sxy(K×1)

is a system with K (linear) equations in L unknowns.

• Sxzδ = sxy is the sample analogue of E(xi(yi − z′iδ

))= 0, that is

E(xiz′i

)δ = E (xiyi) .

236

4.3.1 Method of Moments

Consider

Sxzδ = sxy

If K = L and rank (Σxz) = L⇒ Σxz := E(xiz′i

)is invertible and Sxy is invertible (in

probability, for n large enough).

Solving Sxzδ = sxy with respect to δ gives

δIV = S−1xz sxy

=

1

n

n∑i=1

xiz′i

−11

n

n∑i=1

xiyi

=

n∑i=1

xiz′i

−1 n∑i=1

xiyi

=(X′Z

)−1X′y.

237

Example. Consider

yi = δ1 + δ2zi2 + εi

and suppose that Cov (zi, εi) 6= 0, that is, zi is an endogenous variable. We have L = 2

so we need at least K = 2 instrumental variables. Let x′i =(

1 xi2)and suppose that

Cov (xi2, εi) = 0 and Cov (xi2, zi2) 6= 0. Thus an IV estimator is

δIV =(X′Z

)−1X′y.

Exercise 4.5. Consider the previous example. (a) Show that the IV estimator δ2,IV can bewritten as

δ2,IV =

∑ni=1 (xi2 − x2) (yi − y)∑ni=1 (xi2 − x2) (zi2 − z2)

.

(b) Show Cov (xi2, yi) = δ2 Cov (xi2, zi2) + Cov (xi2, εi) ; (c) Based on part (b), showthat δ2,IV

p−→ δ2 (write the assumptions you need to prove these results).

238

4.3.2 GMM

It may happen that K > L (there are more orthogonality conditions than parameters). Inprinciple, it is better to have as many IV as possible, so the case K > L is desirable, butthen the system Sxzδ = sxy may not have a solution.

Example. Suppose

Sxz =

1.00 0.097 0.099

0.097 1.011 −0.0590.099 −0.059 0.967−0.182 0.203 −0.031

, sxy =

1.9541.346−0.900−0.0262

(K = 4, L = 3) and try (if you can) to solve Sxzδ = sxy. This system is of same type of

δ1 + δ2 = 1

δ3 = 1

δ4 + δ5 = 5

δ1 + δ2 = 2

(the first and fourth equations are incompatible - the system is impossible - there is not asolution).

239

This means we cannot set gn(δ)exactly equal to 0. However, we can at least choose δ

so that gn(δ)is as close to 0 as possible. In Linear Algebra two vectors are “close” if the

distance between them is relatively small. We will define the distance in RK as follows:

distance between ξ and η is equal to (ξ − η)′ W (ξ − η)

where W, called the weighting matrix, is a symmetric positive definite matrix defining thedistance.

Example. If

ξ =

[12

], η =

[35

], W =

[1 00 1

]the distance between these two vectors is

(ξ − η)′ W (ξ − η) =[

1− 3 2− 5] [ 1− 3

2− 5

]= 22 + 32 = 13.

240

Definition (3.1 - GMM estimator). Let W be a K ×K symmetric positive definite matrix,possibly dependent on the sample, such that W

p−→ W as n → ∞, with W symmetricand positive definite. The GMM estimator of δ, denoted δ

(W)is

δ(W)

= arg minδJ(δ,W

)where

J(δ,W

)= ngn

(δ)′

Wgn(δ)

= n(sxy − Sxzδ

)′W

(sxy − Sxzδ

).

Proposition. Under the Assumptions 3.2 and 3.4

GMM estimator δ(W)

=(S′xzWSxz

)−1S′xzWsxy.

To prove this proposition you need the following rule:

∂(q′Wq

)∂δ

= 2∂q′

∂δWq

where q is a K × 1 vector depending on δ and W is a K ×K matrix not depending on δ.

241

If K = L then Sxz is invertible and δ(W)reduces to the IV estimator:

δ(W)

=(S′xzWSxz

)−1S′xzWsxy

= S−1xz W−1

(S′xz

)−1S′xzWsxy

= S−1xz sxy = δIV .

4.3.3 Sampling Error

The GMM estimator can be written as

δ(W)

= δ +(S′xzWSxz

)−1S′xzWg.

242

Proof: First consider

sxy =1

n

∑i

xiyi

=1

n

∑i

xi(z′iδ + εi

)=

1

n

∑i

xiz′iδ +

1

n

∑i

xiεi

= Sxzδ + g

Replacing sxy = Sxzδ + g into δ(W)

=(S′xzWSxz

)−1S′xzWsxy produces:

δ(W)

=(S′xzWSxz

)−1S′xzWsxy

=(S′xzWSxz

)−1S′xzW (Sxzδ + g)

=(S′xzWSxz

)−1S′xzWSxzδ +

(S′xzWSxz

)−1S′xzWSxzg

= δ +(S′xzWSxz

)−1S′xzWg.

243

4.4 Large-Sample Properties of GMM

4.4.1 Asymptotic Distribution of the GMM Estimator

Proposition (3.1 - asymptotic distribution of the GMM estimator). (a) (Consistency) Un-der Assumptions 3.1-3.4, δ

(W) p−→ δ; (b) (Asymptotic Normality) If Assumption 3.3 is

strengthened as Assumption 3.5, then

√n(δ(W)− δ

)d−→ N

(0,Avar

(δ(W)))

where

Avar(δ(W))

=(Σ′xzWΣxz

)−1Σ′xzWSWΣxz

(Σ′xzWΣxz

)−1

Recall: S ≡ E(gig′i

). (c) (Consistent Estimate of Avar

(δ(W))) Suppose there is avail-

able a consistent estimator, S, of S. Then, under Assumption 3.2, Avar(δ(W))is consis-

tently estimated by

Avar(δ(W))

=(S′xzWSxz

)−1S′xzWSWSxz

(S′xzWSxz

)−1.

244

4.4.2 Estimation of Error Variance

Proposition (3.2 - consistent estimation of error variance). For any consistent estimator δand under Assumptions 3.1, 3.2, the assumptions that E

(ziz′i

)and E

(ε2i

)exist and are

finite we have

1

n

n∑i=1

εip−→ E

(ε2i

)where εi ≡ yi − z′iδ.

4.4.3 Hypothesis Testing

Proposition (3.3 - robust t-ratio and Wald statistics). Suppose Assumptions 3.1-3.5 hold,and suppose there is available a consistent estimate S of S (≡ Avar (g) = E

(gig′i

). Let

Avar(δ(W))

=(S′xzWSxz

)−1S′xzWSWSxz

(S′xzWSxz

)−1.

245

Then (a) under the null H0: δj = δ0j

t0j =

√n(δj(W)− δ0

j

)√(

Avar(δ(W)))

jj

=δj(W)− δ0

j

SEj

d−→ N (0, 1)

where(

Avar(δ(W)))

jjis the (j, j) element of Avar

(δ(W))and

SEj =

√1

n

(Avar

(δ(W)))

jj.

(b) Under the null hypothesis H0:Rδ = r where p is the number of restrictions and R

(p× L) is of full row rank,

W = n(Rδ

(W)− r

)′ (RAvar

(δ(W))

R′)−1 (

Rδ(W)− r

)d−→ χ2

(p).

246

4.4.4 Estimation of S

Let

S ≡ 1

n

n∑i=1

ε2ixix

′i, where εi ≡ yi − z′iδ.

Assumption (3.6 - finite fourth moments). E(

(xikzi`)2)exists and is finite for all k =

1, ...,K, and ` = 1, ..., L.

Proposition (3.4 - consistent estimation of S). Suppose δ is consistent and S = E(gig′i

)exists and is finite. Then under Assumptions 3.1, 3.2 and 3.6 the following estimator

S ≡ 1

n

n∑i=1

ε2ixix

′i, where εi ≡ yi − z′iδ.

is consistent.

247

4.4.5 Effi cient GMM Estimator

The next proposition provides a choice of W that minimizes the asymptotic variance.

Proposition (3.5 - optimal choice of the weighting matrix). If W is chosen such that

Wp−→ S−1

then the lower bound for the asymptotic variance of the GMM estimators is reached, whichis equal to (

Σ′xzS−1Σxz

)−1.

Definition. The estimator

δ(S−1

)= arg min

δngn

(δ)′

Wgn(δ)

where W = S−1is called the effi cient GMM estimator.

248

The effi cient GMM estimator can be written as

δ(W)

=(S′xzWSxz

)−1S′xzWsxy

δ(S−1

)=

(S′xzS

−1Sxz

)−1S′xzS

−1sxy

and

Avar(δ(S−1

))=

(Σ′xzS

−1Σxz

)−1

Avar

(δ(S−1

))=

(S′xzS

−1Sxz

)−1.

249

To calculate the effi cient GMM estimator, we need the consistent estimator S, which dependson εi. This leads us to the following two-step effi cient GMM procedure:

Step1: Compute S ≡ 1n

∑ni=1 ε

2ixix

′i, where εi = yi − z′iδ. To obtain δ :

δ(W)

= arg minδn(sxy − Sxzδ

)′W

(sxy − Sxzδ

)where W is a matrix that converges in probability to a symmetric and positive definitematrix, for example

W = S−1xx .

With this choice, use the (so called) 2SLS estimator δ(S−1

xx

)to obtain the residuals

εi = yi − z′iδ and S ≡ 1n

∑ni=1 ε

2ixix

′i.

Step 2: Minimize J(δ, S

)with respect to δ. The minimizer is the effi cient GMM estimator,

δ(S−1

)= arg min

δn (sxy − Sxzδ)′ S−1 (sxy − Sxzδ) .

250

Example. (Wooldridge, chap. 15 - data base:card) Wage and education data for a sampleof men in 1976

Dependent Variable: LOG(WAGE)Method: Least SquaresSample: 1 3010Included observations: 3010


C 4.733664 0.067603 70.02193 0.0000EDUC 0.074009 0.003505 21.11264 0.0000

EXPER 0.083596 0.006648 12.57499 0.0000EXPER^2 0.002241 0.000318 7.050346 0.0000BLACK 0.189632 0.017627 10.75828 0.0000SMSA 0.161423 0.015573 10.36538 0.0000

SOUTH 0.124862 0.015118 8.259006 0.0000


SMSA =1 if in Standard Metropolitan Statistical Area in 1976.

NEAR4 =1 if he grew up near a 4 year college.

252

z′i =[

1 EDUCi EXPERi EXPER2i BLACKi SMSAi SOUTH

]x′i =

[1 EXPERi EXPER2

i BLACKi SMSAi SOUTH NEAR4i NEAR2i]

Dependent Variable: LOG(WAGE)Method: Generalized Method of MomentsSample: 1 3010Included observations: 3010Linear estimation with 1 weight updateEstimation weighting matrix: HAC (Bartlett kernel, NeweyWest fixed

bandwidth = 9.0000)Standard errors & covariance computed using estimation weighting matrixInstrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH

NEARC4 NEARC2


C 3.330464 0.886167 3.758280 0.0002EDUC 0.157469 0.052578 2.994963 0.0028

EXPER 0.117223 0.022676 5.169509 0.0000EXPER^2 0.002277 0.000380 5.997813 0.0000BLACK 0.106718 0.056652 1.883736 0.0597SMSA 0.119990 0.030595 3.921874 0.0001

SOUTH 0.095977 0.025905 3.704972 0.0002

Rsquared 0.156572 Mean dependent var 6.261832Adjusted Rsquared 0.154887 S.D. dependent var 0.443798S.E. of regression 0.407983 Sum squared resid 499.8506DurbinWatson stat 1.866667 Jstatistic 2.200989Instrument rank 8 Prob(Jstatistic) 0.137922

253

4.5 Testing Overidentifying Restrictions

4.5.1 Testing all Orthogonality Conditions

If the equation is exactly identified then J(δ,W

)= 0. If the equation is overidentified

then J(δ,W

)> 0. When W is chosen optimally so that W = S

−1 p−→ S−1 then

J

(δ(S−1

), S−1)is asymptotically chi-squared.

Proposition (3.6 - Hansen’s test of overidentifying restrictions). Under assumptions 3.1-3.5

J

(δ(S−1

), S−1)

d−→ χ2(K−L)

254

Two comments:

1) This is a specification test, testing whether all the restrictions of the model (which are

the assumptions maintained in Proposition 3.6) are satisfied. If the J(δ(S−1

), S−1)is

surprisingly large, it means that either the orthogonality conditions (Assumption 3.3) or theother assumptions (or both) are likely to be false. Only when we are confident about thoseother assumptions can we interpret the large J statistic as evidence for the endogeneity ofsome of the K instruments included in xi.

2) Small-sample properties of the test may be a matter of concern.

Example (continuation). EVIEWS provides the J statistics of proposition 3.6:

255

Dependent Variable: LOG(WAGE)Method: Generalized Method of MomentsSample: 1 3010Included observations: 3010Linear estimation & iterate weightsEstimation weighting matrix: WhiteStandard errors & covariance computed using estimation weighting matrixConvergence achieved after 2 weight iterationsInstrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH

NEARC4 NEARC2


C 3.307001 0.814185 4.061733 0.0000EDUC 0.158840 0.048355 3.284842 0.0010EXPER 0.118205 0.021229 5.567988 0.0000

EXPER^2 0.002296 0.000367 6.250943 0.0000BLACK 0.105678 0.051814 2.039573 0.0415SMSA 0.117018 0.030158 3.880117 0.0001

SOUTH 0.096095 0.023342 4.116897 0.0000

Rsquared 0.152137 Mean dependent var 6.261832Adjusted Rsquared 0.150443 S.D. dependent var 0.443798S.E. of regression 0.409055 Sum squared resid 502.4789DurbinWatson stat 1.866149 Jstatistic 2.673614Instrument rank 8 Prob(Jstatistic) 0.102024

256

4.5.2 Testing Subsets of Orthogonality Conditions

Consider

xi =

[xi1} K1 rowsxi2} K −K1 rows

]We want to test H0 : E (xi2εi) = 0.

The basic idea is to compare two J statistics from two separate GMM estimators, one usingonly the instruments included in xi1 and the other using also the suspect instruments xi2in addition to xi1. If the inclusion of the suspect instruments significantly increases the Jstatistic, that is a good reason for doubting the predeterminedness of xi2. This restrictionis testable K1 ≥ L (why?).

257

Proposition (3.7 - testing a subset of orthogonality conditions). Suppose that the rankcondition is satisfied for xi1, so E

(xi1z′i

)is of full column rank. Under assumptions 3.1-

3.5. Let

J = ngn(δ)′

S−1gn(δ), δ =

(S′xzS

−1Sxz

)−1S′xzS

−1sxy

J1 = ng1n

(δ)′

S−1g1n

(δ), δ =

(S′x1zS

−111 Sx1z

)−1S′x1zS

−111 sx1y.

Then, under the null H0 : E (xi2εi) = 0,

C ≡ J − J1d−→ χ2

(K−K1).

258

Example. EVIEWS 7 performs this test. Following previous example, suppose you want totest E (nearc4iεi) = 0. In our case, xi1 is 7 × 1 vector and xi2 = nearc4i is a scalar(L = 7, K1 = 7, K −K1 = 1).

259

Instrument Orthogonality Ctest TestEquation: EQ03Specification: LOG(WAGE) C EDUC EXPER EXPER^2 BLACK SMSA

SOUTHInstrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH

NEARC4 NEARC2Test instruments: NEARC4

Value df ProbabilityDifference in Jstats 2.673614 1 0.1020

Jstatistic summary:Value

Restricted Jstatistic 2.673614Unrestricted Jstatistic 5.16E33

Unrestricted Test Equation:Dependent Variable: LOG(WAGE)Method: Generalized Method of MomentsFixed weighting matrix for test evaluationStandard errors & covariance computed using estimation weighting matrixInstrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH

NEARC2


C 0.092557 2.127447 0.043506 0.9653EDUC 0.349764 0.126360 2.768002 0.0057EXPER 0.196690 0.052475 3.748287 0.0002

EXPER^2 0.002445 0.000378 6.467830 0.0000BLACK 0.088724 0.129667 0.684247 0.4939SMSA 0.019006 0.067085 0.283317 0.7770

SOUTH 0.030415 0.046444 0.654869 0.5126

Rsquared 1.171522 Mean dependent var 6.261832Adjusted Rsquared 1.175861 S.D. dependent var 0.443798S.E. of regression 0.654637 Sum squared resid 1286.934DurbinWatson stat 1.818008 Jstatistic 5.16E33Instrument rank 7

260

4.5.3 Regressor Endogeneity Test

We can use Proposition 3.7 to test for the endogeneity of a subset of regressors.

See example 3.3 of the book.

4.6 Implications of Conditional Homoskedasticity

Assume now:

Assumption (3.7 - conditional homoskedasticity). E(ε2i

∣∣∣xi) = σ2.

This assumption implies

S ≡ E(gig′i

)= E

(ε2ixix

′i

)= σ2

E(xix′i

)= σ2Σxx.

Its estimator is

S =σ2Sxx

261

4.6.1 Effi cient GMM Becomes 2SLS

The effi cient GMM is

δ(S−1

)=

(S′xzS

−1Sxz

)−1S′xzS

−1sxy

=(S′xz

(σ2Sxx

)−1Sxz

)−1S′xz

(σ2Sxx

)−1sxy

=(S′xzS

−1xxSxz

)−1S′xzS

−1xxsxy

≡ δ2SLS.

The estimator δ2SLS is called two-stage least squares (2SLS or TSLS), for reasons we explainbelow. It follows

Avar(δ2SLS

)= σ2

(Σ′xzS

−1xxΣxz

)−1

Avar

(δ2SLS

)= σ2

(S′xzS

−1xxSxz

)−1.

Proposition (3.9 - asymptotic properties of 2SLS). Skip.

262

4.6.2 Alternative Derivations of 2SLS

The 2SLS can be written as

δ2SLS =(S′xzS

−1xxSxz

)−1S′xzS

−1xxsxy

=(Z′X(X′X)−1X′Z

)−1Z′X(X′X)−1X′y

Let us interpret the 2SLS estimator as a IV estimator. Use as instruments

Z = X(X′X

)−1X′Z

or simply Z = X if K = L. Define the IV estimator as

δIV =

1

n

n∑i=1

ziz′i

−11

n

n∑i=1

ziyi

=(Z′Z

)−1Z′y


)−1Z′X(X′X)−1X′y

= δ2SLS

263

If K = L then

δIV =(X′Z

)−1X′y.

Finally, let us show the 2SLS as the result of two regression:

1) regress the L regressors on xi and obtain fitted values i.e. zi

2) regress yi on z1, ..., zL to obtain the estimator(Z′Z

)−1Z′y which is also the δ2SLS. In

effect,

(Z′Z

)−1Z′y =

Z′X(X′X)−1X′︸︷︷︸Z′

X(X′X

)−1X′Z︸︷︷︸

Z

Z′X(X′X)−1X′︸︷︷︸Z′

y


)−1Z′X(X′X)−1X′y

= δ2SLS.

264

Exercise 4.6. Consider the equation yi = z′iδ + εi and the instrumental variables xi,where K = L. Assume Assumptions 3.1-3.7 and suppose that xi and zi are strictly ex-ogenous (so the use of the IV estimator is unnecessary). Show that δIV =

(X′Z

)−1 X′yis unbiased and consistent but less effi cient than δOLS =

(Z′Z

)−1 Z′y. Hint: compareVar

(δIV

∣∣∣Z,X) to Var(δOLS

∣∣∣Z,X) and and notice that an idempotent matrix is posi-tive semi-definite. Also notice that Var

(δIV

∣∣∣Z,X)− Var(δOLS

∣∣∣Z,X) is positive semi-definite iff Var

(δOLS

∣∣∣Z,X)−1 − Var(δIV

∣∣∣Z,X)−1is positive semi-definite (provided

these inverses exist).

1 econometrics - slides - ulisboapascal.iseg.utl.pt/~nicolau/ec_meap/1-264.pdf · 2 1 introduction...

Documents