intensive course in econometrics slides...intensive course in econometrics — section 1.2 — ur...

Intensive Course in Econometrics

Slides

Rolf Tschernig & Harry Haupt

University of Regensburg University of Bielefeld

March 20091

1We are greatly indebted to Kathrin Kagerer, Joachim Schnurbus, and Roland Weigand who helped us enormously to improve and correct this course

material. Of course, the usual disclaimer applies.

c© These slides may be printed and reproduced for individual or instructional use, but may not be printed for commercial purposes.

i

Contents

1 Introduction: What is Econometrics? 4

1.1 A Trade Example: What Determines Trade Flows? . . . 4

1.2 Economic Models and the Need for Econometrics . . . 13

1.3 Causality and Experiments . . . . . . . . . . . . . . . 21

1.4 Types of Economic Data . . . . . . . . . . . . . . . . 24

2 The Simple Regression Model 29

2.1 The Population Regression Model . . . . . . . . . . . 30

ii

2.2 The Sample Regression Model . . . . . . . . . . . . . 44

2.3 The OLS Estimator . . . . . . . . . . . . . . . . . . . 47

2.4 Best Linear Prediction, Correlation, and Causality . . . 63

2.5 Algebraic Properties of the OLS Estimator . . . . . . . 69

2.6 Parameter Interpretation and Functional Form . . . . . 73

2.7 Statistical Properties: Expected Value and Variance . . 84

2.8 Estimation of the Error Variance . . . . . . . . . . . . 91

3 Multiple Regression Analysis: Estimation 94

3.1 Motivation: The Trade Example Continued . . . . . . . 94

3.2 The Multiple Regression Model of the Population . . . 98

3.3 The OLS Estimator: Derivation and Algebraic Properties 111

3.4 The OLS Estimator: Statistical Properties . . . . . . . 123

3.5 Model Specification I: Model Selection Criteria . . . . . 153

iii

4 Multiple Regression Analysis: Hypothesis Testing 163

4.1 Basics of Statistical Tests . . . . . . . . . . . . . . . . 163

4.2 Probability Distribution of the OLS Estimator . . . . . 193

4.3 The t Test in the Multiple Regression Model . . . . . . 200

4.4 Empirical Analysis of a Simplified Gravity Equation . . . 208

4.5 Confidence Intervals . . . . . . . . . . . . . . . . . . 217

4.6 Testing a Single Linear Combination of Parameters . . 227

4.7 The F Test . . . . . . . . . . . . . . . . . . . . . . . 233

4.8 Reporting Regression Results . . . . . . . . . . . . . . 259

5 Multiple Regression Analysis: Asymptotics 262

5.1 Large Sample Distribution of the Mean Estimator . . . 262

5.2 Large Sample Inference for the OLS Estimator . . . . . 277

iv

6 Multiple Regression Analysis: Interpretation 282

6.1 Level and Log Models . . . . . . . . . . . . . . . . . 282

6.2 Data Scaling . . . . . . . . . . . . . . . . . . . . . . 283

6.3 Dealing with Nonlinear or Transformed Regressors . . . 290

6.4 Regressors with Qualitative Data . . . . . . . . . . . . 301

7 Multiple Regression Analysis: Prediction 317

7.1 Prediction and Prediction Error . . . . . . . . . . . . . 317

7.2 Statistical Properties of Linear Predictions . . . . . . . 324

8 Multiple Regression Analysis: Heteroskedasticity 325

8.1 Consequences of Heteroskedasticity for OLS . . . . . . 328

8.2 Heteroskedasticity-Robust Inference after OLS . . . . . 330

8.3 The General Least Squares (GLS) Estimator . . . . . . 333

8.4 Feasible Generalized Least Squares (FGLS) . . . . . . . 341

v

9 Multiple Regression Analysis: Model Diagnostics 363

9.1 The RESET Test . . . . . . . . . . . . . . . . . . . . 363

9.2 Heteroskedasticity Tests . . . . . . . . . . . . . . . . 366

9.3 Model Specification II: Useful Tests . . . . . . . . . . 386

10 Appendix I

10.1 A Condensed Introduction to Probability . . . . . . . . I

10.2 Important Rules of Matrix Algebra . . . . . . . . . . . XXI

10.3 Rules for Matrix Differentiation . . . . . . . . . . . . . XXVIII

10.4 Data for Estimating Gravity Equations . . . . . . . . . XXX

vi

Organisation

Contact

Prof. Dr. Rolf Tschernig

Building RW(L), 5th floor, room 514

Universitatsstr. 31, 93040 Regensburg, Germany

Tel. (+49) 941/943 2737, Fax (+49) 941/943 4917

Email: [email protected]

http://www.wiwi.uni-regensburg.de/tschernig/

1

http://www.wiwi.uni-regensburg.de/tschernig/

Intensive Course in Econometrics — Organisation — UR March 2009 — R. Tschernig

Schedule and Location

Date of Course: February 16 to February 27, 2009

Osteuropa-Institut and University of Regensburg

Lectures: every morning 8.30 - 10.00 W 113 Tschernig

every morning 10.30 - 12.00 W 113 Tschernig

Exercises: every afternoon 14.00 - 17.00 W 113 or PC-room Kagerer, Schnurbus, Weigand

Exam

no exam in this course

2

Intensive Course in Econometrics — Organisation — UR March 2009 — R. Tschernig

Required Text

Wooldridge, J.M. (2009). Introductory Econometrics. A Modern

Approach, 4th ed., Thomson South-Western.

Additional Reading

Stock, J.H. and Watson, M.W. (2007). Introduction to Economet-

rics, 2nd ed., Pearson, Addison-Wesley.

plus what will be announced during the course.

3

1 Introduction: What is Econometrics?

1.1 A Trade Example: What Determines Trade Flows?

Goal/Research Question: Identify the factors that influence imports

to Kazakhstan and quantify their impact.

4

Intensive Course in Econometrics — Section 1.1 — UR March 2009 — R. Tschernig

• Three basic questions that have to be answered during the anal-

ysis:

1. Which (economic) relationships could be / are “known” to be

relevant for this question?

2. Which data can be useful for checking the possibly relevant eco-

nomic conjectures/theories?

3. How to decide about which economic conjecture to reject or to

follow?

• Let’s have a first look at some data of interest: the imports (in

current US dollars) to Kazakhstan from 55 originating countries in

2004.

5


Imports to Kazakhstan in 2004 in current US dollars

0

200000000

400000000

600000000

800000000

1000000000

1200000000

1400000000

ALB

AUT

BEL

BIH

CAN

CHN

CZE

ESP

FIN

GBR

GER

HKG

HUN

ISL

JPN

KGZ

LTU

MDA

MLT

NOR

PRT

RUS

SVN

THA

TKM

TWN

USA

YUG

The original data are from

UN Commodity Trade Statistics Database (UN COMTRADE)

6

http://comtrade.un.org/


• See section 10.4 in the Appendix for detailed data descriptions.

Data are provided in the EViews file Kazakhstan imports 2004.wf1.

We thank Richard Frensch, Osteuropa-Institut, Regensburg, Germany, who pro-

vided all data throughout this course for analyzing trade flows.

• A first attempt to answer the three basic questions:

1. Ignore for the moment all existing economic theory and simply

hypothesize that observed imports depend somehow on the GDP

of the exporting country.

2. Collect GDP data for the countries of origin, e.g. from the

International Monetary Fund (IMF) – World Economic Outlook Database

3. Plot the data, e.g. by using a scatter plot.

Can you decide whether there is a relationship between trade

flows from and the GDP of exporting countries?

7

http://www.imf.org/external/data.htm


A scatter plot

0.00E+00

2.00E+08

4.00E+08

6.00E+08

8.00E+08

1.00E+09

1.20E+09

0 2000 4000 6000 8000 10000 12000

WEO_GDPCR_O

TR

AD

E_

0_

D_

O

Some questions:

• What do you see?

• Is there a relationship?

• If so, how to quantify it?

• Is there a causal relationship

- what determines what?

• By how much do the im-

ports from Germany change

if the GDP in Germany

changes by 1%?

• Are there other relevant

factors determining imports,

e.g. distance?

• Is it possible to forecast fu-

ture trade flows?8


• What have you done?

– You tried to simplify reality

– by building some kind of (economic) model.

• An (economic) model

– has to reduce the complexity of reality such that it is useful for

answering the question of interest;

– is a collection of cleverly chosen assumptions from which implica-

tions can be inferred (using logic) — Example: Heckscher-Ohlin

model;

– should be as simple as possible and as complex as necessary;

– cannot be refuted or “validated” without empirical data of some

kind.

9


• Let us consider a simple formal model for the relationship between

imports and GDP of the originating countries

importsi = β0 + β1gdpi, i = 1, . . . , 55.

– Does this make sense?

– How to determine the values of the so called parameters β0 and

β1?

– Fit a straight line through the cloud!

0.00E+00

2.00E+08

4.00E+08

6.00E+08

8.00E+08

1.00E+09

1.20E+09

0 2000 4000 6000 8000 10000 12000

WEO_GDPCR_O

TR

AD

E_

0_

D_

O

10


0.00E+00

2.00E+08

4.00E+08

6.00E+08

8.00E+08

1.00E+09

1.20E+09

0 2000 4000 6000 8000 10000 12000

W E O _ G D P C R _ O

TR

AD

E_

0_

D_

OTRADE_0_D_O vs. WEO_GDPCR_O

More questions:

– How to fit a line through

the cloud of points?

– Which properties does the

fitted line have?

– What to do with other

relevant factors that are

currently neglected in the

analysis?

– Which criteria to choose

for identifying a potential

relationship?

11


0

200,000,000

400,000,000

600,000,000

800,000,000

1,000,000,000

1,200,000,000

0 4,000 8,000 12,000

WEO_GDPCR_O

TRADE_0_D_OTRADE_0_D_F_LEVTRADE_0_D_F_LEV2

Further questions:

– Is the potential relation-

ship really linear? Com-

pare it to the green points

of a nonlinear relationship.

– And: how much may re-

sults change with a differ-

ent sample, e.g. for 2003?

12


1.2 Economic Models and the Need for Econometrics

• Standard problems of economic models:

– The conjectured economic model is likely to neglect some factors.

– Numeric results to the numerical questions posed depend in gen-

eral on the choice of a data set. A different data set leads to

different numerical results.

=⇒ Numeric answers always have some uncertainty.

13


• Econometrics

– offers solutions for dealing with unobserved factors in economic

models,

– provides “both a numerical answer to the question and a measure

how precise the answer is (Stock & Watson 2007, p. 7)”,

– as will be seen later, provides tools that allow to refute economic

hypotheses using statistical techniques by confronting theory with

data and to quantify the probability of such decisions to be wrong,

– as will be seen later as well, allows to quantify risks of forecasts,

decisions, and even of its own analysis.

14


• Therefore:

Econometrics can also be useful for providing answers to questions

like:

– How reliable are predicted growth rates or returns?

– How likely is it that the value realizing in the future will be close

to the predicted value? In other words, how precise are the pre-

dictions?

• Main tool: Multiple regression model

It allows to quantify the effect of a change in one variable on another

variable, holding other things constant (ceteris paribus analysis).

15


• Steps of an econometric analysis:

1. Careful formulation of question/problem/task of interest.

2. Specification of an economic model.

3. Careful selection of a class of econometric models.

4. Collecting data.

5. Selection and estimation of an econometric model.

6. Diagnostics of correct model specification.

7. Usage of the model.

Note that there exists a large variety of econometric models and

model choice depends very much on the research question, the un-

derlying economic theory, availability of data, and the structure of

the problem.

16


• Goals of this course:

providing you with basic econometric tools such that you can

– successfully carry out simple empirical econometric analyzes and

provide quantitative answers to quantitative questions,

– recognize ill-conducted econometric studies and their consequences,

– recognize when to ask for help of an expert econometrician,

– attend courses for advanced econometrics / empirical economics,

– study more advanced econometric techniques.

17


Some Definitions of Econometrics

– “... discover empirical relation between economic variables, pro-

vide forecast of various economic quantities of interest ... (First

issue of volume 1, Econometrica, 1933).”

– “The science of model building consists of a set of quantitative

tools which are used to construct and then test mathematical rep-

resentations of the real world. The development and use of these

tools are subsumed under the subject heading of econometrics

(Pindyck & Rubinfeld 1998).”

18


– “At a broad level, econometrics is the science and art of using eco-

nomic theory and statistical techniques to analyze economic data.

Econometric methods are used in many branches of economics,

including finance, labor economics, macroeconomics, microeco-

nomics, marketing, and economic policy. Econometric methods

are also commonly used in other social sciences, including political

science and sociology (Stock & Watson 2007, p. 3).”

So, some may also say: “Alchemy or Science?”, “Economic-

tricks”, “Econo-mystiques”.

– “Econometrics is based upon the development of statistical meth-

ods for estimating economic relationships, testing economic the-

ories, and evaluating and implementing government and business

policy (Wooldridge 2009, p. 1).”

19


• Summary of tasks for econometric methods

– In brief: econometrics can be useful whenever you en-

counter (economic) data and you want to make sense

out of them.

– In detail:

∗ Providing a formal framework for falsifying postulated

economic relationships by confronting economic theory with

economic data using statistical methods: Economic hypotheses

are formulated and statistically tested on basis of adequately

(and repeatedly) collected data such that test results may fal-

sify the postulated hypotheses.

∗ Analyzing the effects of policy measures.

∗ Forecasting.

20


1.3 Causality and Experiments

• Common understanding: “causality means that a specific action”

(touching a hot stove) “leads to a specific, measurable consequence”

(get burned) (Stock & Watson 2007, p. 8).

• How to identify causality? Observe repeatedly an action and its

consequence!

• Thus, in science one aims at repeating an action and its conse-

quences under identical conditions. How to generate repetitions

of actions?

21


• Randomized controlled experiments:

– there is a control group that receives no treatment (e.g. fertil-

izer) and a treatment group that receives treatment, and

– where treatment is assigned randomly in order to eliminate

any possible systematic relationship between the treatment and

other possible influences.

• Causal effect:

A “causal effect is defined to be an effect on an outcome of a given

action or treatment, as measured in an ideal randomized controlled

experiment (Stock & Watson 2007, p. 9).”

• In economics randomized controlled experiments are very often dif-

ficult or impossible to conduct. Then a randomized controlled ex-

periment provides a theoretical benchmark and econometric analysis

22


aims at mimicking as closely as possible the conditions of a random-

ized controlled experiment using actual data.

• Note that for forecasting knowledge of causal effects is not nec-

essary.

• Warning: in general multiple regression models do not allow con-

clusions about causality!

23


1.4 Types of Economic Data

1. Cross-Sectional Data

• are collected across several units at a single point or period of

time.

• Units: “economic agents”, e.g. individuals, households, investors,

firms, economic sectors, cities, countries.

• In general: the order of observations has no meaning.

• Popular to use index i.

• Optimal: the data are a random sample of the underlying popu-

lation, see Section 2.1 for details.

• Cross-Sectional data allow to explain differences between individ-

ual units.

24


• Example: sample of countries that export to Kazakhstan in 2004

of Section 1.1.

2. Time Series Data

• are sampled across differing points/periods of time.

• Popular to use index t.

• Sampling frequency is important:

– variable versus fixed;

– fixed: annually, quarterly, monthly, weekly, daily, intradaily;

– variable: ticker data, duration data (e.g. unemployment spells).

• Time series data allow the analysis of dynamic effects.

• Univariate versus multivariate time series data.

25


• Example: Trade flow from Germany to Kazakhstan and GDP in

Germany (in current US dollars), 1990 - 2007, T = 18.

0.0E+00

1.0E+08

2.0E+08

3.0E+08

4.0E+08

5.0E+08

6.0E+08

7.0E+08

8.0E+08

1990 1992 1994 1996 1998 2000 2002 2004 2006

TRADE_0_D_O

1.60E+12

1.80E+12

2.00E+12

2.20E+12

2.40E+12

2.60E+12

2.80E+12

3.00E+12

1990 1992 1994 1996 1998 2000 2002 2004 2006

WDI_GDPUSDCR_O

3. Panel data

• are a collection of cross-sectional data for at least two different

points/periods of time.

• Individual units remain identical in each cross-sectional sample

(except if units vanish).

26


• Use of double index: it where i = 1, . . . , N and t = 1, . . . , T .

• Typical problem: missing values - for some units and periods there

are no data.

• Example: growth rate of imports from 55 different countries to

Kazakhstan from 1991 to 2008 where all 55 countries were chosen

for the sample 1990 and kept fixed for all subsequent years

(T = 18, N = 55).

4. Pooled Cross Sections

• also a collection of cross-sectional data, however, allowing for

changing units across time.

• Example: in 1995 countries of origin are Germany, France, Russia

and in 1996 countries of origin are Poland, US, Italy.

27


In this course: focus on the analysis of cross-sectional data and

specific types of time series data:

• simple regression model → Chapter 2,

• multiple regression model → Chapters 3 to 9.

• Time series analysis requires advanced econometric techniques that

are beyond the scope of this course (given the time constraints).

Recall the arithmetic quality of data:

• quantitative variables,

• qualitative or categorical variables.

Reading: Sections 1.1-1.3 in Wooldridge (2009).

28

2 The Simple Regression Model

Distinguish between the

• population regression model and the

• sample regression model.

29


2.1 The Population Regression Model

• In general:

y and x are two variables that describe properties of the population

under consideration for which one wants “to explain y in terms of

x” or “to study how y varies with changes in x” or “to predict y

for given values of x”.

Example: By how much changes the hourly wage for an additional

year of schooling keeping all other influences fixed?

• If we knew everything, then the relationship between y and x

may formally be expressed as

y = m(x, z1, . . . , zs) (2.1)

where z1, . . . , zs denote s additional variables that in addition to

years of schooling x influence the hourly wage y.

30


• For practical application it is possible

– that relationship (2.1) is too complicated to be useful,

– that there does not exist an exact relationship, or

– that there exists an exact relationship for which, however, not all

s influential variables z1, . . . , zs can be observed, or

– one has no idea about the structure of the function m(·).• Our solution:

– build a useful model, cf. Section 1.1,

– which focuses on a relationship that holds on “average”. What

do we mean by “average”?

31


• Crucial building blocks for our model:

– Consider the variable y as random. You may think of y

denoting the value of the variable of a random choice out of all

units in the population. Furthermore, in case of discrete values of

the random variable y, a probability is assigned to each value

of y. (If the random variable y is continuous, a density value is

assigned.)

In other words: apply probability theory. See Appendices B

and C in Wooldridge (2009).

Examples:

∗ The population consists of all apartments in Almaty. The vari-

able y denotes the rent of a single apartment randomly chosen

from all apartments in Almaty.

32


∗ The population consists of all possible values of imports to

Kazakhstan from a specific country and period.

∗ For a dice the population consists of all numbers that are writ-

ten on each side although in this case statisticians prefer to

talk about a sample space.

– In terms of probability theory the “average” of a variable y is

given by the expected value of this variable. In case of discrete

y one has

E[y] =∑

j ∈ all different yj in population

yjProb(y = yj

)

– Sometimes one may only look at a subset of the population,

namely all y that have the same value for another variable x.

33


Example: one only considers the rents of all apartments in Al-

maty of size x = 75 m2.

– If the “average” is conditioned on specific values of another vari-

able x, then one considers the conditional expected value of

y for a given x: E[y|x]. For discrete random variables y one has

E[y|x] =∑

j ∈ all different yj in population

yjProb(y = yj|x

)

(See Appendix 10.1 for a brief introduction to probability theory

and corresponding definitions for continuous random variables.)

Example continued: the conditional expectation E[y|x = 75]

corresponds to the average rent of all apartments in Almaty of

size x = 75 m2.

34


– Note that the variable x can be random, too. Then, the condi-

tional expectation E[y|x] is a function of the (random) variable

x

E[y|x] = g(x)

and therefore a random variable itself.

– From the identity

y = E[y|x] + (y − E[y|x]) (2.2)

one defines the error term or disturbance term as

u ≡ y − E[y|x]

so that one obtains a simple regression model of the pop-

ulation

y = E[y|x] + u (2.3)

35


• Interpretation:

– The random variable y varies randomly around the conditional

expectation E[y|x]:

y = E[y|x] + u.

– The conditional expectation E[y|x] is called the systematic

part of the regression.

– The error term u is called the unsystematic part of the regres-

sion.

• So instead of trying the impossible, namely specifying m(x, . . .)

given by (2.1), one focuses the analysis on the “average” E[y|x].

36


• How to determine the conditional expectation?

– This step requires assumptions!

– To keep things simple we make Assumption (A) given by

E[y|x] = β0 + β1x. (2.4)

– Discussion of Assumption (A):

∗ It restricts the flexibility of g(x) = E[y|x] such that g(x) =

β0 + β1x has to be linear in x. So if E[y|x] = δ0 + δ1 log x,

Assumption A is wrong.

∗ It can be fulfilled if there are other variables influencing y lin-

early. For example, consider

E[y|x, z] = δ0 + δ1x + δ2z.

37


Then, by the law of iterated expectations one obtains

E[y|x] = δ0 + δ1x + δ2E[z|x]

If E[z|x] is linear in x, one obtains

E[y|x] = δ0 + δ1x + δ2(α0 + α1x)

= γ0 + γ1x (2.5)

with γ0 = δ0 + δ2α0 und γ1 = δ1 + δ2α1. Note, however,

that in this case E[y|x, z] 6= E[y|x] in general. Then model

choice depends on the goal of the analysis: the smaller model

can sometimes be preferable for prediction, the larger model is

needed if controlling for z is important ⇔ controlled random

experiments, see Section 1.3.

∗ In general, Assumption (A) is violated if (2.5) does not hold

e.g. if E[z|x] is nonlinear in x. Then the linear population

model is called misspecified. More on that in Section 3.4.

38


• Properties of the error term u: From Assumption (A)

1. E[u|x] = 0,

2. E[u] = 0,

3. Cov(x, u) = 0.

39


• An alternative set of assumptions:

The above result E[u|x] = 0 together with the identity (2.3) al-

lows to rewrite Assumption (A) in terms of the following two

assumptions:

1. Assumption SLR.1

(Linearity in the Parameters)

y = β0 + β1x + u, (2.6)

2. Assumption SLR.4

(Zero Conditional Mean)

E[u|x] = 0.

40


• Linear Population Regression Model:

The simple linear population regression model is given by equation

(2.6)

y = β0 + β1x + u

and obtained by specifying the conditional expectation in the regres-

sion model (2.3) by a linear function (linear in the parameters).

The parameters β0 and β1 are called the intercept parameter

and slope parameter, respectively.

41


• Some terminology for regressions

y x

Dependent variable Independent variable

Explained variable Explanatory variable

Response variable Control variable

Predicted variable Predictor variable

Regressand Regressor

Covariate

42


• A simple example: a game of dice

Let the random numbers x and u denote the throws of two fair

dices with x, u = −2.5,−1.5,−0.5, 0.5, 1.5, 2.5. Based on both

throws the random number y denotes the following sum

y = 2︸︷︷︸β0

+ 3︸︷︷︸β1

x + u.

This completely describes the population regression model.

– Derive the systematic relationship between y and x holding x

fixed.

– Interpret the systematic relationship.

– How can you obtain the values of the parameters β0 = 2 and

β1 = 3 if those values are unknown?

Next section: How can you determine/estimate β0 and β1?

43


2.2 The Sample Regression Model

Estimators and Estimates

• In practice one has to estimate the unknown parameters β0 and β1

of the population regression model using a sample of observations.

• The sample has to be representative and has to be collected/drawn

from the population.

• A sample of the random numbers x and y of size n is given by

(xi, yi) : i = 1, . . . , n.• Now we require an estimator that allows us — given the sample

observations (xi, yi) : i = 1, . . . , n — to compute estimates for

the unknown parameters β0 and β1 of the population.

44


• Note:

– If we want to construct an estimator for the unknown parameters,

we have not yet observed a sample. An estimator is a function

that contains the sample values as arguments.

– Once we have an estimator and observe a sample, we can compute

estimates (=numerical values) for the unknown quantities.

• For estimating the unknown parameters there exist many different

estimators that differ with respect to their statistical properties (sta-

tistical quality)!

Example: Two different estimators for estimating the mean: 1n

∑ni=1 yi

and 12 (y1 + yn).

45


• If you denote estimators of the parameters β0 and β1 in the popu-

lation regression model

y = β0 + β1x + u

by β0 and β1, then the sample regression model is given by

yi = β0 + β1xi + ui, i = 1, . . . , n.

It consists of

– the sample regression function or regression line

yi = β0 + β1xi,

– the fitted values yi, and

– the residuals ui = yi − yi, i = 1, . . . , n.

With which method can we estimate?

46


2.3 The Ordinary Least Squares Estimator (OLS) Estimator

• The ordinary least squares estimator is frequently abbreviated as

OLS estimator. The OLS estimator goes back to C.F. Gauss (1777-

1855).

• It is derived by choosing the values β0 and β1 such that the sum

of squared residuals (SSR)n∑

i=1

u2i =

n∑

i=1

(yi − β0 − β1xi

)2

is minimized.

47


• One computes the first partial derivatives with respect to β0 and β1

and sets them equal to zero:n∑

i=1


)= 0, (2.7)

n∑

i=1

xi


)= 0. (2.8)

The equations (2.7) and (2.8) are called normal equations.

48


From (2.7) one obtains

β0 = n−1n∑

i=1

yi − β1n−1

n∑

i=1

xi

β0 = y − β1x, (2.9)

where z = n−1∑ni=1 zi denotes the estimated mean of zi, i =

1, . . . , n.

Inserting (2.9) into the normal equation (2.8) deliversn∑

i=1

xi

(yi −

(y − β1x

)− β1xi

)= 0.

Moving terms leads ton∑

i=1

xi(yi − y) = β1

n∑

i=1

xi(xi − x).

49


Note thatn∑

i=1

xi(yi − y) =

n∑

i=1

(xi − x)(yi − y),

n∑

i=1

xi(xi − x) =

n∑

i=1

(xi − x)2,

such that:

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2. (2.10)

50


• Terminology:

– The sample functions (2.9) and (2.10)

β0 = y − β1x,

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2

are called the ordinary least squares (OLS) estimators for

β0 and β1.

– For a given sample, the quantities β0 and β1 are called the OLS

estimates for β0 and β1.

51


– The OLS sample regression function or OLS regression

line for the simple regression model is given by

yi = β0 + β1xi (2.11)

with residuals ui = yi − yi.

– The OLS sample regression model is denoted by

yi = β0 + β1xi + ui (2.12)

52


Note:

– The OLS estimator β1 only exists if the sample observations xi,

i = 1, . . . , n exhibit variation.

Assumption SLR.3

(Sample Variation in the Explanatory Variable):

In the sample the outcomes of the independent variable xi, i =

1, 2, . . . , n are not all the same.

– The derivation of the OLS estimator only requires assumption

SLR.3 but not the population Assumptions SLR.1 and SLR.4.

– In order to investigate the statistical properties of the OLS esti-

mator one needs further assumptions, see Sections 2.7, 3.4, 4.2.

– One also can derive the OLS estimator from the assumptions

about the population, see below.

53


• The OLS estimator as a Moment Estimator:

– Note that from Assumption SLR.4 E[u|x] = 0 one obtains two

conditions on moments: E[u] = 0 and Cov(x, u) = 0. Inserting

Assumption SLR.1 u = y − β0 − β1x defines moment condi-

tions for the model parameters

E(y − β0 − β1x) = 0 (2.13)

E[x(y − β0 − β1x)] = 0 (2.14)

– How to estimate the moment conditions using sample functions?

– Assumption SLR.2 (Random Sampling):

The sample of size n is obtained by random sampling that is, the

pairs (xi, yi) and (xj, yj), i 6= j, i, j = 1, . . . , n, are pairwise

identically and independently distributed following the population

model.

54


– An important result in statistics, see Section 5.1, says:

If Assumption SLR.2 holds, then the expected value can well be

estimated by the sample average. (Assumption SLR.2 can be

weakened, see e.g. Chapter 11 in Wooldridge (2009).)

– If one replaces the expected values in (2.13) and (2.14) by their

sample averages, one obtains

n−1n∑

i=1


)= 0, (2.15)

n−1n∑

i=1

xi


)= 0. (2.16)

By multiplying (2.15) (2.16) by n one obtains the normal equa-

tions (2.7) and (2.8).

55


The Trade Example Continued

Question:

Do imports to Kazakhstan increase if the exporting country experiences

an increase in GDP?

Scatter plot (from Section 1.1)

0.00E+00

2.00E+08

4.00E+08

6.00E+08

8.00E+08

1.00E+09

1.20E+09

0 2000 4000 6000 8000 10000 12000

WEO_GDPCR_O

TR

AD

E_

0_

D_

O

56


The OLS regression line is given by

importsi = 53441198 + 2.16 · 10−05gdpi, i = 1, . . . , 55,

and the sample regression model by

importsi = 53441198 + 2.16 · 10−05gdpi + ui, i = 1, . . . , 55.

0.00E+00

2.00E+08

4.00E+08

6.00E+08

8.00E+08

1.00E+09

1.20E+09

0 2000 4000 6000 8000 10000 12000

W E O _ G D P C R _ O

TR

AD

E_

0_

D_

OTRADE_0_D_O vs. WEO_GDPCR_O

57


====================================================================

Dependent Variable: TRADE_0_D_O

Method: Least Squares

Date: 02/09/09 Time: 17:12

Sample: 1 55

Included observations: 52

====================================================================

Variable Coefficient Std. Error t-Statistic Prob.

====================================================================

C 53441198 25852541 2.067155 0.0439

WDI_GDPUSDCR_O 2.16E-05 1.37E-05 1.572746 0.1221

====================================================================

R-squared 0.047139 Mean dependent var 67801339

Adjusted R-squared 0.028081 S.D. dependent var 1.77E+08

S.E. of regression 1.74E+08 Akaike info criterion 40.82943

Sum squared resid 1.52E+18 Schwarz criterion 40.90448

Log likelihood -1059.565 F-statistic 2.473530

Durbin-Watson stat 2.297310 Prob(F-statistic) 0.122085

====================================================================

• For a data description see Appendix 10.4:

importsi (from country i) TRADE 0 D O

gdpi (in exporting country i) WDI GDPUSDCR O

58


• Potential interpretation of estimated slope parameter:

β1 =∆imports

∆gdp

indicates by how many US dollars imports in Kazakhstan increase if

GDP in an exporting country increases by 1 US dollar.

• Does this interpretation really make sense? Aren’t there other im-

portant influencing factors missing? What about using economic

theory as well?

• What about the quality of the estimates?

59


Example: Wage Regression

Question:

How does education influence the hourly wage of an employee?

• Data (Source: Example 2.4 in Wooldridge (2009)): Sample of U.S.

employees with n = 526 observations. Available data are:

– wage per hour in dollars and

– educ years of schooling of each employee.

• The OLS regression line is given by

ˆwagei = −0.90 + 0.54 educi, i = 1, . . . , 526.

The sample regression model is

wagei = −0.90 + 0.54 educ + ui, i = 1, . . . , 526

60


====================================================================

Dependent Variable: WAGE Method: Least Squares Sample: 1 526


====================================================================


====================================================================

C -0.904852 0.684968 -1.321013 0.1871

EDUC 0.541359 0.053248 10.16675 0.0000

====================================================================

R-squared 0.164758 Mean dependent var 5.896103

Adjusted R-squared 0.163164 S.D. dependent var 3.693086

S.E. of regression 3.378390 Akaike info criterion 5.276470

Sum squared resid 5980.682 Schwarz criterion 5.292688



====================================================================

• Interpretation of the estimated slope parameter:

β1 =∆wage

∆educindicates by how much the hourly wage changes if the years of

schooling increases by one year:

– An additional year in school or university increases the hourly

61


wage by 54 cent.

– But: Somebody without any education earns an hourly wage of

-90 cent? Does this interpretation make sense?

• Is it always sensible to interpret the slope coefficient? Watch out

spurious causality, see next section.

• Are these estimates reliable or good in some sense? What do we

mean by “good” in econometrics and statistics? To get more insight

study

– the statistical properties of the OLS estimator and the OLS esti-

mates, see Section 2.7 and

– check the choice of the functional form for the conditional expec-

tation E[y|x], see Section 2.6.

62


2.4 Best Linear Prediction, Correlation, and Causality

Best Linear Prediction

• What does the OLS estimator estimate if Assumptions SLR.1 and

SLR. 4 (alias Assumption (A)) are not valid in the population

from which the sample is drawn?

• Note that SSR(γ0, γ1)/n =∑n

i=1 (yi − γ0 − γ1xi)2 /n is a sample

average and thus estimates the expected value

E[(y − γ0 − γ1x)2

](2.17)

if Assumption SLR.2 (or some weaker form) holds. (For existence

of (2.17) it is required that 0 < V ar(x) < ∞ and V ar(y) < ∞.)

Equation (2.17) is called the mean squared error of a linear

predictor

γ0 + γ1x.

63


• Mimicking minimizing SSR(γ0, γ1), the theoretically best fit of a

linear predictor γ0 + γ1x to y is obtained by minimizing its mean

squared error (2.17) with respect to γ0 and γ1. This leads (try to

derive it) to

γ∗0 = E[y] − γ∗1E[x], (2.18)

γ∗1 =Cov(x, y)

V ar(x)= Corr(x, y)

√V ar(y)

V ar(x)(2.19)

with

Corr(x, y) =Cov(x, y)√

V ar(x)V ar(y), −1 ≤ Corr(x, y) ≤ 1

denoting the correlation that measures the linear dependence be-

tween two variables in a population, here x and y.

The expression

γ∗0 + γ∗1x (2.20)

64


is called the best linear predictor of y where “best” is defined by

minimal mean squared error.

• Now observe that for the simple regression model

y = γ∗0 + γ∗1x + ε

one has Cov(x, ε) = 0, a weaker form of SLR.4, since

Cov(x, y) =Cov(x, y)

V ar(x)V ar(x) + Cov(x, ε).

This indicates that one can show that under Assumption SLR.2

and SLR.3 the OLS estimator estimates the parameters γ∗0and γ∗1 of the best linear predictor. Observe also that the

OLS estimator (2.10) for the slope coefficient consists of the sample

averages of the moments defining γ∗1

γ1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2

65


• Rewriting γ1 as

β1 = Corr(x, y)

√∑ni=1(yi − y)2∑ni=1(xi − x)2

using the empirical correlation coefficient

Corr(x, y) =

∑ni=1(xi − x)(yi − y)√∑n

i=1(xi − x)2∑n

i=1(yi − y)2

shows that the estimated slope coefficient is non-zero if there is

sample correlation between x and y.

66


Causality

• Recall Section 1.3.

• Be aware that the slope coefficient of the best linear pre-

dictor γ∗1 and its OLS estimate γ1 cannot be automatically

interpreted in terms of a causal relationship since estimating

the best linear predictor

– only captures correlation but not direction,

– may not estimate the model of interest, e.g. if Assumptions SLR.1

and SLR.4 are violated and β1 6= γ∗1 ,

– may produce garbage if Corr(x, y) estimates spurious correlation

(Corr(x, y) = 0 and Assumption SLR.2 (or its weaker versions)

are violated),

– or relevant control variables are missing in the simple regression

67


model such that the results cannot represent results of a fictive

randomized controlled experiment, see Chapter 3 onwards.

Therefore, before any causal interpretation takes place one has to

use specification and diagnostic techniques for regression models.

Frequently one needs economic theory to identify causal relation-

ships.

68


2.5 Algebraic Properties of the OLS Estimator

Basic properties:

•∑n

i=1 ui = 0, because of normal equation (2.7),

•∑n

i=1 xiui = 0, because of normal equation (2.8).

• The point (x, y) lies on the regression line.

Can you provide some intuition for these properties?

69


• Total sum of squares (SST)

SST ≡n∑

i=1

(yi − y)2

• Explained sum of squares (SSE)

SSE ≡n∑

i=1

(yi − y)2

• Sum of squared residuals (SSR)

SSR ≡n∑

i=1

u2i

• The decomposition SST = SSE + SSR holds if the regression

model contains an intercept β0.

70


• Coefficient of Determination R2 or (R-squared)

R2 =SSE

SST.

– Interpretation: share of variation of yi that is explained by the

variation of xi.

– If the regression model contains an intercept term β0, then

R2 =SSE

SST= 1 − SSR

SSTdue to the decomposition SST = SSE + SSR, and therefore

0 ≤ R2 ≤ 1.

– Later we will see: Choosing regressors with R2 is in general mis-

leading.

71


Reading:

• Sections 1.4 and 2.1-2.3 in Wooldridge (2009) and Appendix 10.1

if needed.

• 2.4 and 2.5 in Wooldridge (2009).

72


2.6 Parameter Interpretation and Functional Form and

Data Transformation

• The term linear in “simple linear regression models” does not imply

that the relationship between the explained and the explanatory

variable is linear. Instead it refers to the fact that the parameters

β0 and β1 enter the model linearly.

• Examples for regression models that are linear in their parameters:

yi = β0 + β1xi + ui,

yi = β0 + β1 ln xi + ui,

ln yi = β0 + β1 ln xi + ui,

ln yi = β0 + β1xi + ui,

yi = β0 + β1x2i + ui.

73


The Natural Logarithm in Econometrics

Frequently variables are transformed by taking the natural logarithm

ln. Then the interpretation of the slope coefficient has to be ad-

justed accordingly.

Taylor approximation of the logarithmic function:

ln(1 + z) ≈ z if z is close to 0.

Using this approximation one can derive a popular approximation of

growth rates or returns

(∆xt)/xt−1 ≡ (xt − xt−1)/xt−1

≈ ln (1 + (xt − xt−1)/xt−1) ,

(∆xt)/xt−1 ≈ ln(xt) − ln(xt−1).

which approximates well if the relative change ∆xt/xt−1 is close to 0.

74


One obtains percentages by multiplying with 100:

100∆ ln(xt) ≈ %∆xt = 100(xt − xt−1)/xt−1.

Thus, the percentage change for small ∆xt/xt−1 can be well ap-

proximated by 100[ln(xt) − ln(xt−1)].

• Examples of models that are nonlinear in the parameters (β0, β1, γ, λ, π, δ):

yi = β0 + β1xγi + ui,

yγi = β0 + β1 ln xi + ui,

yi = β0 + β1xi +1

1 + exp(λ(xi − π))(γ + δxi) + ui.

• The last example allows for smooth switching between two linear

regimes. The possibilities for formulating nonlinear regression mod-

els are huge. However, their estimation requires more advanced

methods such as nonlinear least squares that are beyond the scope

of this course.

75


• Note, however, that linear regression models allow for a wide range

of nonlinear relationships between the dependent and independent

variables, some of which were listed at the beginning of this section.

Economic Interpretation of OLS Parameters

• Consider the ratio of relative changes of two non-stochastic

variables y and x

∆yy

∆xx

=%change of y

%change of x=

%∆y

%∆x.

If ∆y → 0 and ∆x → 0, then it can be shown that ∆y∆x → dy

dx.

• If this result is applied to the ratio above, one obtains the elasticity

η(x) =dy

dx

x

y.

76


• Interpretation: If the relative change of x is 0.01, then the relative

change of y given by 0.01η(x).

In other words: If x changes by 1%, then y changes by η(x)%.

• If y, x are random variables, then the elasticity is defined with

respect to the conditional expectation of y given x:

η(x) =dE[y|x]

dx

x

E[y|x].

This can be derived fromE[y|x1=x0+∆x]−E[y|x0]

E[y|x0]

∆xx0

=

E[y|x1 = x0 + ∆x] − E[y|x0]

∆x

x0

E[y|x0]

and letting ∆x → 0.

77


Different Models and Interpretations of β1

For each model it is assumed that SLR.1 and SLR.4 hold.

• Models that are linear with respect to their variables

(level-level models)

y = β0 + β1x + u.

It holds thatdE[y|x]

dx= β1

and thus

∆E[y|x] = β1∆x.

In words:

The slope coefficient denotes the absolute change in the conditional

expectation of the dependent variable y for a one-unit change in the

independent variable x.

78


• Level-log models

y = β0 + β1 ln x + u.

It holds thatdE[y|x]

dx= β1

1

xand thus

∆E[y|x] ≈ β1∆ ln x =β1

100100∆ ln x ≈ β1

100%∆x.

In words:

The conditional expectation of y changes by β1/100 units if x

changes by 1%.

79


• Log-log models

are frequently called loglinear models or constant-elasticity

models and are very popular in empirical work

ln y = β0 + β1 ln x + u.

Similar to above one can show that

dE[y|x]

dx= β1

E[y|x]

x, and thus β1 = η(x)

if E[eu|x] is constant.

In these models the slope coefficient is interpreted as the elasticity

between the level variables y and x.

81


The Trade Example Continued====================================================================

Dependent Variable: LOG(TRADE_0_D_O)


Date: 02/08/09 Time: 12:29

Sample: 1 55


====================================================================


====================================================================

C -3.460963 3.280047 -1.055157 0.2964

LOG(WDI_GDPUSDCR_O) 0.770127 0.129507 5.946600 0.0000

====================================================================





Log likelihood -109.3209 F-statistic 2.65E-07

====================================================================

Note the very different interpretation of the estimated slope coeffi-

cient β1:

– Level-level model (Section 2.3): an increase in GDP in the ex-

porting country by 1 billion US dollars corresponds to an increase

82


of imports to Kazakhstan by 0.216 million US dollars.

– Log-log model: an 1%-increase of GDP in the exporting country

corresponds to an increase of imports by 0.77%.

But wait before you take these numbers seriously.

83


2.7 Statistical Properties of the OLS Estimator: Expected

Value and Variance

• Some preparatory transformations (all sums are indexed by i =

1, . . . , n):

β1 =

∑ni=1 (xi − x) (yi − y)∑n

i=1 (xi − x)2=

∑ni=1 (xi − x) yi∑nj=1

(xj − x

)2

=

n∑

i=1

(xi − x)∑n

j=1

(xj − x

)2

︸︷︷︸wi

yi =∑

wiyi

where it can be shown that (try it):∑wi = 0,

∑wixi = 1 and

∑w2

i = 1∑nj=1(xj−x)2.

84


• Unbiasedness of the OLS estimator:

If Assumptions SLR.1 to SLR.4 hold, then

E[β0] = β0,

E[β1] = β1.

Interpretation:

If one keeps repeatedly drawing new samples and estimating the re-

gression parameters, then the average of all obtained OLS parameter

estimates roughly corresponds to the population parameters.

The property of unbiasedness is a property of the sample distribution

of the OLS estimators for β0 and β1. It does not imply that the

population parameters are perfectly estimated for a specific sample.

85


Proof for β1 (clarify where each SLR assumption is needed):

1. E[β1

∣∣∣ x1, . . . , xn

]can be manipulated as follows:

= E[∑

wiyi

∣∣∣x1, . . . , xn

]

= E[∑

wi(β0 + β1xi + ui)∣∣∣x1, . . . , xn

]

=∑

E [wi(β0 + β1xi + ui)| x1, . . . , xn]

= β0

∑wi + β1

∑wixi +

∑E [wiui| x1, . . . , xn]

= β1 +∑

wiE [ui| x1, . . . , xn]

= β1 +∑

wiE [ui| xi]

= β1.

2. From E[β1] = E[E[β1|x1, . . . , xn]] one obtains unbiasedness

E[β1] = β1.

86


• Variance of the OLS estimator

In order to determine the variance of the OLS estimators β0 and β1

we need another assumption,

Assumption SLR.5 (Homoskedasticity):

V ar(u|x) = σ2.

• Variances of parameter estimators

conditional on the sample observations

If Assumptions SLR.1 to SLR.5 hold, then

V ar(β1

∣∣∣ x1, . . . , xn

)= σ2 1

∑ni=1 (xi − x)2

,

V ar(β0

∣∣∣ x1, . . . , xn

)= σ2 n−1∑x2

i∑ni=1 (xi − x)2

.

87


Proof (for the conditional variance of β1):

V ar(β1

∣∣∣x1, . . . , xn

)

= V ar(∑

wiui

∣∣∣x1, . . . , xn

)

=∑

V ar (wiui| x1, . . . , xn)

=∑

w2iV ar (ui| x1, . . . , xn)

=∑

w2iV ar (ui| xi)

=∑

w2iσ

2

= σ2∑

w2i

= σ2 1∑

(xi − x)2.

88


• Covariance between the intercept and the slope estimator:

Cov(β0, β1|x1, . . . , xn) = −σ2 x∑n

i=1 (xi − x)2.

Proof: Cov(β0, β1 |x1, . . . , xn) can be manipulated as follows:

= Cov(y − β1x, β1

∣∣∣ x1, . . . , xn

)

= Cov(u, β1

∣∣∣x1, . . . , xn

)

︸︷︷︸=0 see below

−Cov(β1x, β1

∣∣∣x1, . . . , xn

)

= −xCov(β1, β1

∣∣∣ x1, . . . , xn

)

= −xV ar(β1

∣∣∣ x1, . . . , xn

)

= −σ2 x∑

(xi − x)2.

89


Cov(u, β1

∣∣∣ x1, . . . , xn

)

= Cov(u,∑

wiui

∣∣∣ x1, . . . , xn

)

= Cov (u, w1u1| x1, . . . , xn) + · · · + Cov (u, wnun|x1, . . . , xn)

= w1Cov (u, u1| x1, . . . , xn) + · · · + wnCov (u, un|x1, . . . , xn)

=∑

wiCov (u, ui| x1, . . . , xn)

= Cov (u, u1|x1, . . . , xn)∑

wi

= 0.

90


2.8 Estimation of the Error Variance

• One estimator for the error variance σ2 is given by

σ2 =1

n

n∑

i=1

u2i ,

where the ui’s denote the residuals of the OLS estimator.

Disadvantage: The estimator σ2 does not take into account that 2

restrictions were imposed on obtaining the OLS residuals, namely:∑ui = 0,

∑uixi = 0.

This leads to biased estimates, E[σ2|x1, . . . , xn] 6= σ2.

• Unbiased estimator for the error variance:

σ2 =1

n − 2

n∑

i=1

u2i .

91


• If Assumptions SLR.1 to SLR.5 hold, then

E[σ2|x1, . . . , xn] = σ2.

• Standard error of the regression, standard error of the es-

timate or root mean squared error:

σ =√

σ2 .

• In the formulas for the variances of and covariance between the

parameter estimators β0 und β1 the variance estimator σ2 can be

used for estimating the unknown error variance σ.

Example:

V ar(β1|x1, . . . , xn) =σ2

∑(xi − x)2

.

92


Denote the standard deviation as

sd(β1|x1, . . . , xn) =

√V ar(β1|x1, . . . , xn),

then

sd(β1|x1, . . . , xn) =σ

(∑(xi − x)2

)1/2

is frequently called the standard error of β1 and reported in the

output of software packages.

Reading: Sections 2.4 and 2.5 in Wooldridge (2009) and Appendix

10.1 if needed.

93

3 Multiple Regression Analysis: Estimation

3.1 Motivation for Multiple Regression: The Trade

Example Continued

• In Section 2.6 two simple linear regression models for explaining

imports to Kazakhstan were estimated (and interpreted): a level-

level model and a log-log model.

• It is hardly credible that imports to Kazakhstan only depend on the

GDP of the exporting country. What about, for example, distance,

94


borders, and other factors causing trading costs?

• Such quantities have been found to be relevant in the empirical

literature on gravity equations for explaining intra- and interna-

tional trade. In general, bi-directional trade flows are considered.

Here we consider only one-directional trade flows, namely exports

to Kazakhstan in 2004. Such a simplified gravity equation reads as

ln(importsi) = β0 + β1 ln(gdpi) + β2 ln(distancei) + ui. (3.1)

Standard gravity equations are based on bilateral imports and ex-

ports over a number of years and thus require panel data techniques

that are beyond the scope of this course. However, in Section 4.4

we will consider cross-section data on both imports and exports for

2004. For a brief introduction to gravity equations see e.g. Fratianni

(2007). A recent theoretic underpinning of gravity equations was

provided by Anderson & Wincoop (2003).

95


• If relevant variables are neglected, Assumptions SLR.1 and/or SLR.4

could be violated and in this case interpretation of causal effects

can be highly misleading, see Section 3.4. To avoid this trap, the

multiple regression model can be useful.

• To get an idea about the change in the elasticity parameter due to a

second independent variable, like e.g. distance, inspect the following

OLS estimate of the simple import equation (3.1):Dependent Variable: LOG(TRADE_0_D_O), Method: Least Squares, Date: 02/13/09 Time: 14:31

Sample: 1 55, Included observations: 52

====================================================================


====================================================================

C 4.800950 3.497341 1.372743 0.1761


LOG(CEPII_DIST) -1.970804 0.480555 -4.101103 0.0002

====================================================================







96


Instead of an estimated elasticity of 0.77, see Section 2.6, one ob-

tains a value of 1.09. Furthermore, the R2 increases from 0.41 to

0.56, indicating a much better statistical fit. Finally, a 1% increase

in distance reduces imports by almost 2%. Is this model then better?

Or is it (also) misspecified?

To answer these questions we have to study the linear multiple re-

gression model first.

97


3.2 The Multiple Regression Model of the Population

• Assumptions:

The Assumptions SLR.1 and SLR.4 of the simple linear regression

model have to be adapted accordingly to the multiple linear regres-

sion model (MLR) for the population (see Section 3.3 in Wooldridge

(2009)):

– MLR.1 (Linearity in the Parameters)

The multiple regression model allows for more than one, say

k, explanatory variables

y = β0 + β1x1 + β2x2 + · · · + βkxk + u (3.2)

and the model is linear in its parameters.

Example: the import equation (3.1).

98


– MLR.4 (Zero Conditional Mean)

E[u|x1, . . . , xk] = 0 for all x.

Observe that all explanatory variables of the multiple regression

(3.2) must be included in the conditioning set. Sometimes the

conditioning set is called information set.

• Remarks:

– To see the need for MLR.4, take the conditional expectation of

y in (3.2) given all k regressors

E[y|x1, x2, . . . , xk] = β0 + β1x1 + · · · βkxk + E[u|x1, x2, . . . , xk].

If E[u|x1, x2, . . . , xk] 6= 0 for some x, then the systematic part

β0+β1x1+· · · βkxk does not model the conditional expectations

E[y|x1, . . . , xk] correctly.

99


– If MLR.1 and MLR.4 are fulfilled, then equation (3.2)

y = β0 + β1x1 + β2x2 + · · · + βkxk + u

is also called the linear multiple regression model for the

population. Frequently it is also called the true model (even

if any model may be fare from truth). Alternatively, one may

think of equation (3.2) as the data generating mechanism

(although, strictly speaking, a data generating mechanism also

requires specification of the probability distributions of all regres-

sors and the error).

• To guarantee nice properties of the OLS estimator and the sample

regression model, we adapt SLR.2 and SLR.3 accordingly:

– MLR.2 (Random Sampling)

The sample of size n is obtained by random sampling, that is

the observations (xi1, . . . , xik, yi) : i = 1, . . . , n are pairwise

100


independently and identically distributed following the population

model in MLR.1.

– MLR.3 (No Perfect Collinearity)

(more on MLR.3 in Section 3.3)

• Interpretation:

– If Assumptions MLR.1 and MLR.4 are correct and the popula-

tion regression model allows for a causal interpretation, then the

multiple regression model is a great tool for ceteris paribus

analysis. It allows to hold the values of all explanatory variables

fixed except one and check how the conditional expectation of

the explained variable changes. This resembles changing one con-

trol variable in a randomized control experiment. Let xj be the

control variable of interest.

101


– Taking conditional expectations of the multiple regression (3.2)

and applying Assumption MLR.4 delivers

E[y|x1, . . . , xj, . . . , xk] = β0 +β1x1 + · · ·+βjxj + · · ·+βkxk.

– Consider a change in xj: xj + ∆xj

E[y|x1, . . . , xj+∆xj, . . . , xk] = β0+β1x1+· · ·+βj(xj+∆xj)+· · ·+βkxk.

∗ Ceteris-paribus effect:

In (3.2) the absolute change due to a change of xj by ∆xj is

given by

∆E[y|x1, . . . , xj, . . . , xk] ≡E[y|x1, . . . , xj−1, xj + ∆xj, xj+1, . . . , xk]

− E[y|x1, . . . , xj−1, xj, xj+1, . . . , xk] = βj∆xj,

102


where βj corresponds to the first partial derivative

∂E[y|x1, . . . , xj−1, xj, xj+1, . . . , xk]

∂xj= βj.

The parameter βj gives the partial effect of changing xj on

the conditional expectation of y while all other regressors are

held constant.

∗ Total effect:

Of course one can also consider simultaneous changes in the

regressors, for example ∆x1 and ∆xk. For this case one obtains

∆E[y|x1, . . . , xk] = β1∆x1 + βk∆xk.

– Note that the specific interpretation of βj depends on how

variables enter, e.g. as log variables. In a ceteris paribus analysis

the results of Section 2.6 remain valid.

103


Trade Example Continued

• Considering the log-log model (3.1)

ln(importsi) = β0 + β1 ln(gdpi) + β2 ln(distancei) + ui

a 1% increase in distance leads to a increase of β2% in imports

keeping GDP fixed. In other words, one can separate the effect

of distance on imports from the effect of economic size. From

the output table in Section 3.1 one obtains that a 1% increase in

distance decreases imports by about 2%.

• Keep in mind that determining distances between countries is a

complicated matter and results may change with the choice of the

method for computing distances. Our data are from CEPII, see also

Appendix 10.4.

• There may still be missing variables, see also Section 4.4.

104

http://www.cepii.fr/anglaisgraph/news/accueilengl.htm


Wage Example Continued

• In Section 2.3 it was assumed that hourly wage is determined by

wage = β0 + β1 educ + u.

Instead of a level-level model one may also consider a log-level model

ln(wage) = β0 + β1 educ + u. (3.3)

However, since we expect that experience also matters for hourly

wages, we want to include experience as well. We obtain

ln(wage) = β0 + β1 educ + β2 exper + v. (3.4)

What about the expected log wage given the variables educ and

exper?

E[ln(wage)|educ, exper] = β0 + β1 educ + β2 exper +E[v|educ, exper]

E[ln(wage)|educ, exper] = β0 + β1 educ + β2 exper,

105


where the second equation only holds if MLR.4 holds, that is if

E[v|educ, exper] = 0.

• Note that if instead of (3.4) one investigates the simple linear log-

level model (3.3) although the population model contains

exper one obtains

E[ln(wage)|educ] = β0 + β1 educ + β2 E[exper|educ] +E[v|educ]

indicating misspecification of the simple model since it ignores the

influence of exper via β2. Thus, the smaller model suffers from

misspecification if

E[ln(wage)|educ] 6= E[ln(wage)|educ, exper]

for some values of educ or exper.

106


• Empirical results:

See Example 2.10 in Wooldridge (2009), file: wage1.wf1, output

from EViews 6:– Simple log-level model

Dependent Variable: LOG(WAGE)

Method: Least Squares, Date: 02/17/09 Time: 15:59


====================================================================

Coefficient Std. Error t-Statistic Prob.

====================================================================

C 0.583773 0.097336 5.997510 0.0000

EDUC 0.082744 0.007567 10.93534 0.0000

====================================================================





Log likelihood -359.3781 Hannan-Quinn criter. 1.380411

F-statistic 119.5816 Durbin-Watson stat 1.801328

Prob(F-statistic) 0.000000

====================================================================

107


ln(wagei) = 0.5838 + 0.0827 educi + ui, i = 1, . . . , 526,

R2 = 0.1858.

If SLR.1 to SLR.4 are valid, then each additional year of schooling

is estimated to increase hourly wages by 8.3% on average. The

sample regression model explains about 18.6% of the variation of

the dependent variable ln(wage).

108


– Multivariate log-level model:Dependent Variable: LOG(WAGE), Method: Least Squares

Date: 02/17/09 Time: 15:48


====================================================================


====================================================================

C 0.216854 0.108595 1.996909 0.0464

EDUC 0.097936 0.007622 12.84839 0.0000

EXPER 0.010347 0.001555 6.653393 0.0000

====================================================================








====================================================================

ln(wagei) = 0.2169 + 0.0979 educi + 0.0103 experi + ui,

i = 1, . . . , 526,

R2 = 0.2493.

109


∗ Ceteris-paribus interpretation: If MLR.1 to MLR.4 are cor-

rect, then the expected increase in hourly wages due to an ad-

ditional year of schooling is about 9.8% and thus slightly larger

than obtained from the simple regression model.

An additional year of experience corresponds to an in increase

in expected hourly wages by 1%.

∗ Model fit:

The model explains 24.9% of the variation of the independent

variable. Does this imply that the multivariate model is better

than the simple regression model with an R2 of 18.6%? Be

careful with your answer and wait until we investigate model

selection criteria.

110


3.3 The OLS Estimator: Derivation and Algebraic

Properties

• For an arbitrary estimator the sample regression model for a

sample (yi, xi1, . . . , xik), i = 1, . . . , n, is given by

yi = β0 + β1xi1 + β2xi2 + · · · + βkxik + ui, i = 1, . . . , n.

• Recall the idea of the OLS estimator: Choose β0, . . . , βk such that

the sum of squared residuals (SSR)

SSR(β0, . . . , βk) =

n∑

i=1

u2i =

n∑

i=1

(yi − β0 − β1xi1 − · · · − βkxik

)2

is minimized. Taking first partial derivatives of SSR(β0, . . . , βk)

with respect to all k + 1 parameters and setting them to zero yields

111


the first order conditions of a minimum:n∑

i=1

(yi − β0 − β1xi1 − · · · − βkxik

)= 0 (3.5a)

n∑

i=1

xi1

(yi − β0 − β1xi1 − · · · − βkxik

)= 0 (3.5b)

... ...n∑

i=1

xik

(yi − β0 − β1xi1 − · · · − βkxik

)= 0 (3.5c)

This system of normal equations contains k + 1 unknown pa-

rameters and k + 1 equations. Under some further conditions (see

below) it has a unique solution.

Solving this set of equations becomes cumbersome if k is large. This

can be circumvented if the normal equations are written in matrix

notation.

112


• The Multiple Regression Model in Matrix Form

Using matrix notation the multiple regression model can be rewritten

as (Wooldridge 2009, Appendix E)

y = Xβ + u, (3.6)

where

y1

y2

...

yn

︸︷︷︸y

=

x10 x11 x12 · · · x1k

x20 x21 x22 · · · x2k

... ... ... ...

xn0 xn1 xn2 · · · xnk

︸︷︷︸X

β0

β1

β2

...

βk

︸︷︷︸β

+

u1

u2

...

un

︸︷︷︸u

.

The matrix X is called the regressor matrix and has n rows and

k + 1 columns. The column vectors y and u have n rows each, the

column vector β has k + 1 rows.

113


• Derivation: The OLS Estimator in Matrix Notation

– One possibility to derive the OLS estimator in matrix notation is

to rewrite the normal equations (3.5) in matrix notation. We do

this explicitly for the j-th equationn∑

i=1

xij

(yi − β0xi0 − β1xi1 − · · · − βkxik

)= 0

that is manipulated ton∑

i=1

(xijyi − β0xijxi0 − β1xijxi1 − · · · − βkxijxik

)= 0

and further ton∑

i=1

(β0xijxi0 + β1xijxi1 + · · · + βkxijxik

)=

n∑

i=1

xijyi.

114


By factoring out we have

n∑

i=1

xijxi0

β0+

n∑

i=1

xijxi1

β1+· · ·+

n∑

i=1

xijxik

βk =

n∑

i=1

xijyi.

Similarly, rearranging all other equations and collecting all k + 1

equations in a vector delivers

(∑n

i=1 xi0xi0) β0 + (∑n

i=1 xi0xi1) β1 + · · · + (∑n

i=1 xi0xik) βk...

(∑n

i=1 xikxi0) β0 + (∑n

i=1 xikxi1) β1 + · · · + (∑n

i=1 xikxik) βk

=

∑ni=1 xi0yi

...∑n

i=1 xikyi

.

115


Applying the rules for matrix multiplication yields

(∑n

i=1 xi0xi0) (∑n

i=1 xi0xi1) · · · (∑n

i=1 xi0xik)

... ... . . . ...

(∑n

i=1 xikxi0) (∑n

i=1 xikxi1) · · · (∑n

i=1 xikxik)

︸︷︷︸X′X

β0

...

βk

︸︷︷︸β

=

∑ni=1 xi0yi

...∑n

i=1 xikyi

︸︷︷︸X′y

as well as the normal equations in matrix notation

(X′X)β = X′y. (3.7)

116


– Note: The matrix X′X has k + 1 columns and rows so that it is

a square matrix.

The inverse (X′X)−1 exists if all columns (and rows) are linearly

independent. This can be shown to be the case if all columns of

X are linearly independent.

This is exactly what the next assumption states.

Assumption MLR.3 (No Perfect Collinearity):

In the sample none of the regressors can be expressed as an exact

linear combination of one or more of the other regressors.

Is this a restrictive assumption?

117


– Finally, multiply the normal equation (3.7) by (X′X)−1 from the

left and obtain the OLS estimator in matrix notation:

β = (X′X)−1X′y. (3.8)

This is the compact notation for

β0

...

βk

︸︷︷︸β

=

(∑n

i=1 xi0xi0) (∑n

i=1 xi0xi1) · · · (∑n

i=1 xi0xik)

... ... . . . ...

(∑n

i=1 xikxi0) (∑n

i=1 xikxi1) · · · (∑n

i=1 xikxik)

−1

︸︷︷︸(X′X)−1

∑ni=1 xi0yi

...∑n

i=1 xikyi

︸︷︷︸X′y

.

118


Algebraic Properties of the OLS Estimator

• X′u = 0, that is∑n

i=1 xijui = 0 for j = 0, . . . , k.

Proof: Plugging y = Xβ + u into the normal equation yields

(X′X)β = (X′X)β + X′u and hence X′u = 0.

• If xi0 = 1, i = 1, . . . , n, it follows that∑n

i=1 ui = 0.

• For the special case k = 1, the algebraic properties of the simple

linear regression model follow immediately.

• The point (y, x1, . . . , xk) is always located on the regression hyper-

plane if there is a constant in the model.

• The definitions for SST, SSE and SSR are like in the simple regres-

sion.

• If a constant term is included in the model, we can decompose

SST = SSE + SSR.

119


• The Coefficient of Determination:

R2 is defined as in the SLR case as

R2 =SSE

SSTor, if there is an intercept in the model,

R2 = 1 − SSR

SST.

It can be shown that the R2 is the squared empirical coefficient of

correlation between the observed yi’s and the explained yi’s, namely

R2 =

(∑ni=1 (yi − y)

(yi − ¯y

))2(∑n

i=1 (yi − y)2)(∑n

i=1

(yi − ¯y

)2)

=[Corr(y, y)

]2.

Note that[Corr(y, y)

]2can be used even when R2 is not useful.

120


• Adjusted R2:

If we rewrite R2 by expanding the SSR/SST term by n

R2 = 1 − SSR/n

SST/n,

we can interpret SSR/n and SST/n as estimators for σ2 and σ2y

respectively. They are biased estimators, however.

Using unbiased estimators thereof instead one obtains the “ad-

justed” R2

R2 = 1 − SSR/(n − k − 1)

SST/(n − 1)

= 1 − n − 1

n − k − 1· SSR

SST

121


R2 = 1 − n − 1

n − k − 1

(1 − R2

)

=−k

n − k − 1+

n − 1

n − k − 1· R2

Properties of R2 (see Section 6.3 in Wooldridge (2009)):

– R2 can increase or fall when including an additional regressor.

– R2 always increases if an additional regressor reduces the unbi-

ased estimate of the error variance.

Attention: Analogously to R2 one may not compare R2 of regression

models with different y.

• The quantities R2 and R2 are both called goodness-of-fit mea-

sures.

122


3.4 The OLS Estimator: Statistical Properties

Assumptions (Recap):

• MLR.1 (Linearity in the Parameters)

• MLR.2 (Random Sampling)

• MLR.3 (No Perfect Collinearity)

• MLR.4 (Zero Conditional Mean)

123

Intensive Course in Econometrics — Section 3.4.1 — UR March 2009 — R. Tschernig

3.4.1 The Unbiasedness of Parameter Estimates

• Let MLR.1 through MLR.4 hold. Then we have E[β] = β

Proof:

β = (X′X)−1X′y MLR.3

= (X′X)−1X′ (Xβ + u) MLR.1

= (X′X)−1X′Xβ + (X′X)−1X′u= β + (X′X)−1X′u.

Taking conditional expectation

E[β|X] = β + E[(X′X)−1X′u|X]

= β + (X′X)−1X′E[u|X]

= β. MLR.2 and MLR.4

124


• The Danger of Omitted Variable Bias

We partition the k + 1 regressors in a (n × k) matrix XA and a

(n × 1) vector xa. This yields

y = XAβA + xaβa + u. (3.9)

In the following it is assumed that the population regression model

has the same structure as (3.9).

Trade Example Continued (from Section 3.2):

Assume that in the population imports depend on gdp, distance,

and whether the trading countries share to some extent the same

language

ln(importsi) = β0 + β1 ln(gdpi) + β2 ln(distancei)

+ β3 ln(languagei) + ui.(3.10)

so that XA includes the constant, gdpi, and distancei and xa

126


includes languagei, each i = 1, . . . , n.

Imagine now that you are only interested in the values of βA (the

parameters for the constant, gdp, and distance), and that the re-

gressor vector xa has to be omitted because you have, for instance,

no data.

Which effect has the omission of the variable xa on the es-

timation of βA if, for example, the model

y = XAβA + w (3.11)

is considered? Model (3.11) is frequently called the smaller model.

Or, stated differently, which estimation properties does the OLS

estimator for βA have on basis of the smaller model (3.11)?

127


Derivation:

– Denote the OLS estimator for βA from the small model by βA.

Following the proof of unbiasedness for the small model but re-

placing y with the true population model (3.9) delivers

βA = (X′AXA)−1X′

Ay

= (X′AXA)−1X′

A(XAβA + xaβa + u)

= βA + (X′AXA)−1X′

Axaβa + (X′AXA)−1X′

Au.

– By the law of iterated expectations E[u|XA] = E [E[u|XA,xa]|XA]

and therefore E[u|XA] = E[0|XA] = 0 by validity of MLR.4 for

the population model (3.9).

128


– Compute the conditional expectation of βA. Treating the (un-

observed) xa in the same way as XA one obtains

E[βA|XA,xa

]= βA + (X′

AXA)−1X′Axaβa.

Therefore the estimator βA is unbiased only if

(X′AXA)−1X′

Axaβa = 0. (3.12)

Take a closer look at the term on the left hand side of (3.12), i.e.

(X′AXA)−1X′

Axaβa.

One observes that

δ = (X′AXA)−1X′

Axa

is the OLS estimator of δ in a regression of xa on XA

xa = XAδ + ε.

129


Condition (3.12) holds (and there is no bias) if

∗ δ = 0, so xa is uncorrelated with XA in the sample or

∗ βa = 0 holds and the smaller model is the population model.

If neither of these conditions holds, then βA is biased

E[βA|XA,xa] = βA + δβa.

This means that the OLS estimator βA is in general biased for

every parameter in the smaller model.

Since these biases are caused by using a regression model that

misses a variable that is relevant in the population model, this

kind of bias is called omitted variable bias and the smaller

model is said to be misspecified. (See Appendix 3A.4 in Wooldridge

(2009).)

130


– One may also ask about the unconditional bias. Applying the LIE

delivers

E[βA|XA

]= βA + E

[δ|XA

]βa,

E[βA

]= βA + E

[δ]βa

Interpretation: The second expression delivers the expected value

of the OLS estimator if one keeps drawing new samples for y

and XA. Thus, in repeated sampling there is only bias if there

is correlation in the population between the variables in XA and

xa since otherwise E[δ]

= 0, cf. 2.4.

131


• Wage Example Continued (from Section 3.2):

– If the observed regressor educ is correlated with the unobserved

variable ability, then the regressor xa = ability is missing in

the regression and the OLS estimators, e.g. for the effect of an

additional year of schooling, are biased.

– Interpretation of the various information sets for computing the

expectation of βeduc:

∗ First consider

E[βeduc|educ, exper, ability] = βeduc + δβability,

where

ability =(1 educ exper

)δ + ε.

Then the conditional expectation above indicates the average

of βeduc computed over many different samples where each

132


sample of workers is drawn in the following way: You always

guarantee that each sample has the same number of workers

with e.g. 10 years of schooling, 15 years of experience, and 150

units of ability and the same number of workers with 11 years of

schooling, etc., so that for each combination of characteristics

there is the same amount of workers although the workers are

not identical.

∗ Next consider

E[βeduc|educ, exper] = βeduc + E[δ|educ, exper] βability.

When drawing a new sample you only guarantee that the amounts

of workers with a specific number of years of schooling and

experience stay the same. In contrast to above, you do not

control ability.

133


∗ Finally consider

E[βeduc] = βeduc + E[δ] βability.

Here you simply draw new samples where everything is allowed

to vary. If you had, let’s say 50 workers with 10 years of school-

ing in one sample, you may have 73 workers with 10 years of

schooling in another sample. This possibility is excluded in the

two previous cases.

134


• Effect of omitted variables on the conditional mean:

– General terminology:

∗ If E[y|xA, xa] 6= E[y|xA],

then the smaller model omitting xa is misspecified and esti-

mation will suffer from omitted variable bias.

∗ If E[y|xA, xa] = E[y|xA],

then the variable xa in the larger model is redundant and

should be eliminated from the regression.

∗ Trade Example Continued: Assume that the population

regression model only contains the variables gdp and distance.

Then a simple regression model with gdp is misspecified and a

multiple regression model with gdp, distance, and language

contains the redundant variable language.

135


– It can happen that for a misspecified model Assump-

tions MLR.1 to MLR.4 are fulfilled.

To see this, consider only one variable in XA

E[y|xA, xa] = β0 + βAxA + βaxa.

Then, by the law of iterated expectations one obtains

E[y|xA] = β0 + βAxA + βaE[xa|xA].

If, in addition, E[xa|xA] is linear in xA

xa = α0 + α1xA + ε, E[ε|xA] = 0,

one obtains

E[y|xA] = β0 + βAxA + βa(α0 + α1xA)

= γ0 + γ1x

with γ0 = β0 + βaα0 und γ1 = βA + βaα1 being the parameters

of the best linear predictor, see Section 2.4.

136


– Note that in this case SLR.1 and SLR.4 are fulfilled for the smaller

model although it is not the population model. However

E[y|xA, xa] 6= E[y|xA]

if βa 6= 0 and α1 6= 0.

– Thus, model choice matters, see Section 3.5. If controlling for

xa is important (controlled random experiments, see Section 1.3),

then the smaller model is of not much use if the differences be-

tween the expected values are large for some values of the regres-

sors.

If one needs a model for prediction, the smaller model may be

preferable since it exhibits smaller estimation variance, see Sec-

tions 3.4.3 and 3.5.

Reading: Section 3.3 in Wooldridge (2009).

137


3.4.2 The Variance of Parameter Estimates

• Assumption MLR.5 (Homoskedasticity):

V ar(ui|xi1, . . . , xik) = σ2 i = 1, . . . , n

• Assumptions MLR.1 bis MLR.5 are frequently called Gauss-Markov-

Assumptions.

• Note that by the Random Sampling assumption MLR.2 one has

Cov(ui, uj|xi1, . . . , xik, xj1, . . . , xjk) = 0 for all i 6= j, 1 ≤ i, j ≤ n,

Cov(ui, uj) = 0 for all i 6= j, 1 ≤ i, j ≤ n,

where for the latter equations the LIE was used. Because of MLR.2

one may also write

V ar(ui|xi1, . . . , xik) = V ar(ui|X), Cov(ui, uj|X) = 0, i 6= j.

138


• Variance of the OLS Estimator

Under the Gauss-Markov Assumptions MLR.1 to MLR.5 we have

V ar(βj|X) =σ2

SSTj(1 − R2j)

, xj not constant, (3.15)

where SSTj is the total sample variation (total sum of squares)

of the j-th regressor,

SSTj =

n∑

i=1

(xij − xj)2,

and the coefficient of determination R2j is taken from a regression

of the j-th regressor on all other regressors

xij = δ0xi0 + · · · + δj−1xi,j−1 + δj+1xi,j+1 + vi,

i = 1, . . . , n.(3.16)

(See Appendix 3A.5 in Wooldridge (2009) for the proof.)

140


Interpretation of the variance of the OLS estimator:

– The larger the error variance σ2, the larger is the variance of

βj.

Note: This is a property of the population so that this variance

component cannot be influenced by sample size. (In analogy to

the simple regression model.)

– The larger the total sample variation SSTj of the j-th

regressor xj is, the smaller is the variance V ar(βj|X).

Note: The total sample variation can always be increased by

increasing sample size since adding another observation increases

SSTj.

– If SSTj = 0, assumption MLR.3 fails to hold.

141


– The larger the coefficient of determination R2j from regression

(3.16) is, the larger is the variance of βj.

– The larger R2j, the better the variation in xj can be explained

by variation in the other regressors because in this case there is

a high degree of linear dependence between xj and the other

explanatory variables.

Then only a small part of the sample variation in xj is specific for

the j-th regressor (precisely the error variation in (3.16)). The

other part of the variation can be explained equally well by the

estimated linear combination of all other regressors. This effect

is not well attributable by the estimator to either variable xj or

the linear combination of all the remaining variables and thus the

estimator suffers from a larger estimation variance.

142


– Special cases:

∗ R2j = 0: Then xj and all other explanatory variables are empiri-

cally uncorrelated and the parameter estimator βj is unaffected

by all other regressors.

∗ R2j = 1: Then MLR.3 fails to hold.

∗ R2j near 1: This situation is called multi- oder near collinear-

ity. In this case V ar(βj|X) is very large.

– But: The multicollinearity problem is reduced in larger samples

because SSTj rises and hence variance decreases for a given value

of R2j. Therefore multicollinearity is always a problem of small

sample sizes, too.

143


• Estimation of the error variance σ2

– Unbiased estimation of the error variance σ2:

σ2 =u′u

n − (k + 1).

– Properties of the OLS estimator (continued):

Call sd(βj|X) =√

V ar(βj|X) the standard deviation, then

sd(βj|X) =σ

(SSTj(1 − R2

j))1/2

is the standard error of βj.

144


• Variance-covariance-matrix of the OLS estimator:

Basics: The covariance of jointly estimating βj and βl — between

the estimators of the j-th and the l-th parameter — is written as

Cov(βj, βl|X) = E[(βj −βj)(βl−βl)|X], j, l = 0, 1, . . . , k +1,

where unbiasedness of the estimators is assumed. Rewrite all vari-

ances and covariances in a ((k + 1) × (k + 1))-matrix:

V ar(β|X) ≡

=

Cov(β0, β0|X) Cov(β0, β1|X) · · · Cov(β0, βk|X)

Cov(β1, β0|X) Cov(β1, β1|X) · · · Cov(β1, βk|X)

... ... . . . ...

Cov(βk, β0|X) Cov(βk, β1|X) · · · Cov(βk, βk|X)

145


Proof:

Remember that correct model specification implies

β = (X′X)−1X′y = (X′X)−1X′(Xβ + u) = β + (X′X)−1X′u,

hence β−β = (X′X)−1X′u, which inserted into V ar(β|X) yields

E[(β − β)(β − β)′|X

]= E

[(X′X)−1X′u

((X′X)−1X′u

)′|X]

= E[(X′X)−1X′uu′X(X′X)−1|X

]

= (X′X)−1X′ E[uu′|X]︸︷︷︸σ2In

X(X′X)−1

= σ2(X′X)−1X′X(X′X)−1

= σ2(X′X)−1.

From the definition of V ar(β|X) above it can be seen that the

diagonal elements are the variances V ar(βj|X), j = 0, . . . , k.

147


• Efficiency of OLS

Note: The OLS estimator is a linear estimator with respect to the

dependent variable because it holds for given X that

βj =

n∑

i=1

(vi∑ni=1 v2

i

)yi,

where vi are the residuals from regression (3.16). (For a derivation

without matrix algebra see Appendix 3A.2 in Wooldridge (2009).)

Further OLS is unbiased so E[βj] = βj.

Gauss-Markov Theorem: Under assumptions MLR.1 through

MLR.5 the OLS estimator is the best linear unbiased estimator

(BLUE).

“Best” means that the OLS estimator, that is unbiased since E[βj] =

βj, has minimal variance among linear unbiased estimators.

148


3.4.3 Trade-off between Bias and Multicollinearity

• Example: Let the population model be

y = β0 + β1x1 + β2x2 + u.

– For a given sample let R21 be close to 1. Then β1 is estimated

with a large variance by (3.15).

– A possible solution? Leaving out the regressor x2 and estima-

tion of the simple regression. But then, as already shown, the

estimator of β1 is biased.

Hence: If there is correlation between x1 and x2 near 1 or -1, then

— for given sample size — one faces a trade-off between

variance and bias.

– What we observe is kind of a statistical uncertainty relation:

The sample does not provide sufficient information to precisely

149


answer the formulated question.

– The only good solution: Increasing sample size.

– Alternative solution: Combining highly correlated variables.

• Variance of parameter estimates in misspecified models:

Again, there are different possibilities how incorrect regression mod-

els might be chosen (cf. Section 3.4.1):

– Too many variables: Parameters are estimated for variables that

do not play a role in the “true” data generation mechanism

(redundant variables).

– Too few variables: One or more variables are missing which

are relevant in the population regression model (omitted vari-

ables).

150


– Wrong variables: A combination of both.

Effect on the variance of parameter estimators:

– Case 1 (redundant variables):

Consider the population model y = Xβ+u. Assume that instead

the following sample specification is chosen:

y = Xβ + zα + w,

where the vector z contains all sample observations of the variable

z. The variance of the parameter estimator βj is

V ar(βj|X) =σ2

SSTj(1 − R2j,X,z)

,

where now R2j,X,z is the coefficient of determination of a re-

gression of xj on all other variables in X and on z. It is easily

seen that R2j,X,z ≥ R2

j because less variables are included in the

151


regression yielding the second R2.

Therefore: Including additional variables in a regression

model increases estimation variance or leaves it un-

changed.

– Case 2 (omitted variables):

The converse of case 1 holds: If a variable is omitted, it

can be shown that the estimation variance is smaller than

when using the true model.

– Case 3 (redundant and omitted variables):

Should really be avoided.

Correct model specification is crucial!

152


3.5 Model Specification I: Model Selection Criteria

• Goal of model selection:

– In principle: find the population model.

– In practice: find the “best” model for the purpose of the analysis.

– More specific: Under the assumption that the population model

is a multiple linear regression model find all regressors that are

included in the regression and their appropriate transformations

(log or level or ...). Avoid omitting variables and including irrel-

evant variables.

153


• Brief theory of model selection:

– There are two issues:

a) the variable (model) choice,

b) the estimation variance.

– Consider a): Choose a goal function to evaluate different models.

A popular goal function is the mean squared error (MSE).

For fixed parameters it is defined as

MSE = E[(y − β0x0 − β1x1 − · · · − βkxk)2

], (3.17)

see also equation (2.17) for the simple regression case.

Choose the model for which the MSE is minimal.

154


Important cases:

∗ If x0, . . . , xk include all relevant variables, the population

model is a multiple linear regression, and MSE is minimized

with respect to the parameters, then

MSE = E[u2]

= σ2.

∗ If relevant variables are missing, it can be shown that

the MSE decomposes into variance and squared bias. For

simplicity, omit all variables except x1 and fit the simple linear

regression

y = γ0 + γ1x1 + v.

Then

MSE1 = E[(y − E[y|x1, . . . , xk]) + (E[y|x1, . . . , xk] − E[y|x1])2

]

= σ2 + (E[y|x1, . . . , xk] − E[y|x1])2 .

155


Since the squared bias term is positive, (E[y|x1, . . . , xk] − E[y|x1])2 >

0, one clearly has MSE < MSE1.

– Consider a) and b): If parameters have to estimated, a

further term enters the mean squared error, namely the variances

and covariances for estimating the model parameters. One has

MSE = V ariance of population error + Bias of chosen model2

+ Estimation variance,

where the estimation variance in general increases with the num-

ber of variables. Now it can happen that for minimizing MSE it

is optimal to choose a model that omits variable(s). A typical

case is prediction.

– Therefore, a reliable method for estimating the MSE is needed.

156


• What does not work:

– Selecting the model with the smallest standard error of

the regression σ does not work.

∗ Why? It is always possible to select a model for which every

residual is zero, that is ui = 0 for all i = 1, . . . , n. Then σ = 0

as well although the error variance σ2 > 0 in the true model.

∗ How? Simply take k+1 = n regressors into the sample regres-

sion model which fulfil MLR.3 and solve the normal equations

(3.5). Then you obtain a perfect fit since you have a linear

equation system with n equations and n unknown parameters.

∗ Note that you can add any regressors that fulfil MLR.3 even if

they have nothing to do with the population regression model.

∗ Note also that SSR remains constant or decreases if for a given

157


sample of size n a further regressor variable is added since

the linear equation system obtains more flexibility to fit the

sample observations. Therefore σ2 = SSRn remains constant or

decreases as well.

∗ For the variance estimator σ2 = SSRn−k−1 there are opposing

effects: a decrease in SSR maybe offset by the decrease in

n − k − 1.

In sum, the standard error of regression tends to decrease when

additional regressors are added so that it is not suited for selecting

those variables that are part of the population model.

– Selecting the model with the largest R2 does not work either.

Why?

158


– Although the adjusted R2 may fall or increase with adding another

regressor, it screws up for k + 1 = n since R2 = 1 as well in this

case.

• Solution: Use model selection criteria

– Basic idea:

Selection criterion = lnu′un

+ (k + 1) · penalty function(n)

∗ First term: a variance estimator for ln(σ2) of the chosen

model.

Note that the estimated variance σ2 = u′u/n is reduced by

every additionally included independent variable.

∗ Second term: is a penalty term punishing the number of

parameters to avoid models that include redundant variables.

159


Because the true error variance is typically underestimated us-

ing σ2, the penalty term penalizes the inclusion of additional

regressors.

The penalty term increases with k and the penalty function

must be chosen such that is decreases with n such that a large

number of parameters matters less in large samples. Why?

∗ This implies a trade-off : Regressors are taken in the model,

if the penalty is smaller than the decrease in estimated MSE.

When choosing a criterion one determines how the trade-off is

shaped.

∗ Rule: Choose among all considered candidate models the spec-

ification for which the criterion is minimal.

160


– Popular model selection criteria:

∗ the Akaike Criterion (AIC)

AIC = lnu′un

+ (k + 1)2

n, (3.18)

∗ the Hannan-Quinn Criterion (HQ)

HQ = lnu′un

+ (k + 1)2 ln(ln n)

n, (3.19)

∗ the Schwarz / Bayesian Information Criterion (SC/BIC)

SC = lnu′un

+ (k + 1)ln n

n. (3.20)

It is advised always to check all criteria although the researcher

decides which to use. In nice cases, all criteria deliver the same

result. Note that for standard sample sizes SC punishes additional

parameters more than HQ, and HQ more than AIC

161


• Trade Example Continued:

– Model 1

LOG(TRADE_0_D_O) = -3.4610 + 0.7701*LOG(WDI_GDPUSDCR_O)

AIC = 4.282, HQ = 4.310, SC = 4.357

– Model 2

LOG(TRADE_0_D_O) = 4.8009 + 1.0885*LOG(WDI_GDPUSDCR_O) - 1.9708*LOG(CEPII_DIST)

AIC = 4.025, HQ = 4.068, SC = 4.138

– Model 3

LOG(TRADE_0_D_O) = -9.5789+ 1.3566*LOG(WDI_GDPUSDCR_O) - 1.1442*LOG(CEPII_DIST)

+ 3.1265*CEPII_COMCOL_REV

AIC = 3.694, HQ = 3.751, SC = 3.844

– Model 4

Y_LOG = -13.0268 + 1.3176*LOG(WDI_GDPUSDCR_O) - 0.6249*LOG(CEPII_DIST)

+ 3.3512*CEPII_COMCOL_REV + 2.1096*CEPII_COMLANG_OFF

AIC = 3.679, HQ = 3.751, SC = 3.867

– Comparing all four models, SC selects model 3 with regressors gdp, distance and common colonizer while

AIC selects model 4 with additional regressor common official language. See Appendix 10.4 for more details

on variables. One can nicely see that SC punishes additional variables more than AIC. Statistical tests may

provide further information on which model to choose, see Sections 4.3 onwards.

162

4 Multiple Regression Analysis: Hypothesis Testing and

Confidence Intervals

4.1 Basics of Statistical Tests

Foundations of statistical hypothesis testing

• In general: Statistical hypothesis tests allow statistically sound and

unambiguous answers to yes-or-no questions:

– Do men and women earn equal income in Kazakhstan?

163


– Do certain political attempts lead to a decrease in unemployment

in 2010?

– Are imports to Kazakhstan influenced by the gdp of exporting

countries?

164


• Elements of a statistical test:

1. Two disjoint hypotheses about the value(s) of (a) parameter(s)

θ in a population.

That means that one of the two competing hypotheses has to

hold in the population:

– Null hypothesis H0

– Alternative hypothesis H1

2. A test statistic T that is a function of some or all sample values

(X,y). We will denote it as t(X,y).

3. A decision rule, stating for which values of t(X,y) the null

hypothesis H0 is rejected and for which values the null is

not rejected.

165


More precisely: Partition the domain of the test statistic T in

two disjoint regions:

– Rejection region, critical region CIf the test statistic t(X,y) is located in the critical region, H0

is rejected:

Reject H0 if t(X,y) ∈ C

– Non-rejection region

If the test statistic t(X,y) falls into the non-rejection region,

H0 is not rejected:

Do not reject H0 if t(X,y) 6∈ C

– Critical value c: Boundary between rejection and non-rejection

region.

166


• Properties of a test:

– Type I error, α error:

The type I error measures the probability (evaluated before the

sample is taken) of rejecting H0 though H0 is correct in the pop-

ulation,

α = P (Reject H0 |H0 is true) = P (T ∈ C|H0 is true).

The type I error is frequently called the significance level or

size of a test.

– Type II error, β error:

The type II error gives the probability of not rejecting H0 though

it is wrong,

β = P (Not reject H0|H1 is true).

167


– Size of a test: The significance level or size equal to α has to

be fixed by the researcher before the test is carried out.

– Power of a test: The power of a test gives the probability of re-

jecting a wrong null hypothesis. power = π = P (Reject H0 |H1 is true),

that is

π = 1 − P (Not reject H0|H1 is true) = 1 − β.

To calculate C for a given α one has to know the probability

distribution of the test statistic under H0.

168


Deriving Tests about the Sample Mean:

1. Consider two disjoint hypotheses about the mean of a sample.

(For example, the mean µ of hourly wages in the US in 1976.)

a) Null hypothesis

H0 : µ = µ0

(In our example: mean hourly wage is 6 US-$,

thus H0 : µ = 6)

b) Alternative hypothesis

H1 : µ 6= µ0

(In the example: mean hourly wages are not 6 US-$,

thus H1 : µ 6= 6)

169


2. Test statistic:

a) Choice of an estimator for the unknown mean µ, e.g. the OLS

estimator of a regression of hourly wages w on a constant:

Compute the sample mean

µ =1

n

n∑

i=1

wi.

out of a sample w1, . . . , wn with n observations.

b) Obtain the probability distribution of the estimator: For simplicity

assume that individual wages wi are jointly normally distributed

with expected value µ and variance σ2w, that is

wi ∼ N (µ, σ2w).

170


From the properties of jointly normally distributed random vari-

ables it follows that

µ ∼ N(µ, σ2

µ

),

where σ2µ = V ar(µ) = V ar(n−1∑wi) = n−1σ2

w.

c) In order to obtain a test statistic t(w1, . . . , wn) all unknown pa-

rameters have to be removed from the distribution. In this simple

case this can be achieved by standardizing µ

t(w1, . . . , wn) =µ − µ

σµ∼ N (0, 1).

d) The test statistic t(w1, . . . , wn) can be calculated if we know µ

and σµ. Assume for the moment that σµ is known.

171


Which value takes µ under H0?

H0 : µ = µ0.

Under H0 we can compute the test statistic for a given sample as

t(w1, . . . , wn) =µ − µ0

σµ∼ N (0, 1).

3. Decision rule:

When should we reject H0 and in which case shouldn’t we?

(Now the significance level α has to be chosen!)

If the deviation of µ from the null hypothesis value µ0 is large

enough one would reject H0.

172


0Region of rejection under H

f(t)

t

0

0

Critical value c

Probability of error a/2Probability of error a/2


0Non-rejection region under H

Intuition: If t is very large (or very small) then

a) the estimated mean µ is far from µ0 (under H0) and / or

b) the standard deviation σµ of the estimated deviation is small

relative to µ − µ0.

173


• When is |t| large enough (to reject H0)?

• Note: Under H0 it holds that

t(w1, . . . , wn) =µ − µ0

σµ∼ N (0, 1)

and hence for given α the rejection region C can be determined

(see figure).

• Formally:

P (T < −c|H0) + P (T > c|H0) = α

or in this case due to the symmetry of the normal distribution

P (T < −c|H0) =α

2und P (T > c|H0) =

α

2.

The values of −c and c are tabulated — they are the α/2 and

1 − α/2 quantiles of the standard normal distribution.

174


• Under H1 it holds that

µ − µ

σµ∼ N (0, 1).

Expanding yields

µ − µ + µ0 − µ0

σµ=

µ − µ0

σµ+

µ0 − µ

σµ=

µ − µ0

σµ︸︷︷︸t(w1,...,wn)

− µ − µ0

σµ︸︷︷︸m

and therefore we have under H1

t(w1, . . . , wn) =µ − µ0

σµ∼ N

(µ − µ0

σµ, 1

)

since X ∼ N (m, 1) is equivalent to X − m ∼ N (0, 1).

• Conclusion: If H1 is true, then the density of t(w1, . . . , wn) is

shifted by (µ − µ0)/σµ.

175


• In the figure exhibiting the density under H1 (for a specific value

of µ 6= µ0) the power can be seen as the sum of the two shaded

areas because π = P (t < −c|H1) + P (t > c|H1).

f(t)

t

0Non-rejection region of H

0Rejection region of H

0Rejection region of H

Critical value c

Power = Sum of the probabilites of rejection

Probability of rejectionProbability of rejection

0 0m-m

ms

• For a given σµ, the power of the test increases with the distance

between the null hypothesis µ0 and the true value µ.

• Recall that if H0 is true, then (µ − µ0)/σµ = 0 holds and one

176


obtains the distribution under H0.

• It can further be seen that the type II error — given as 1 − π =

1 − (1 − β) = β — does not equal zero!

4. There remains one problem: In real world applications we do not

know the standard deviation of the mean estimator σµ = σw/√

n.

Remedy: Estimation by

σµ =σw√

n.

Then one has the popular t statistic

t(w1, . . . , wn) =µ − µ0

σµ,

however, watch out!

177


The test statistic is no longer normally distributed but follows a t

distribution with n− 1 degrees of freedom (short tn−1). Therefore

t(w1, . . . , wn) =µ − µ0

σµ∼ tn−1.

To obtain the critical values

P (T < −c|H0) =α

2und P (T > c|H0) =

α

2,

the tables of the t distribution have to be considered (see Appendix

G, Table G.2 in Wooldridge (2009)).

Wage Example Continued:

Hourly wages wi, i = 1, . . . , 526 of US employees:

1. Hypotheses:

a) Null hypothesis: H0 : µ = 6

b) Alternative hypothesis: H1 : µ 6= 6

178


2. Estimation and calculation of the t statistic in EViews:====================================================================

Dependent Variable: WAGE, Method: Least Squares Sample: 1 526, Included

observations: 526

====================================================================


====================================================================

C 5.896103 0.161026 36.61580 0.0000

====================================================================





Log likelihood -1433.060 Durbin-Watson stat 1.817647

====================================================================

Thus

µ = 5.896103, σµ = 0.161026

and

t(w1, . . . , w526) =5.896103 − 6

0.161026= −0.64521878.

179


3. Determination of critical values:

Suppose a significance level of α = 5%. Then the critical value

c = t525,0.05 can be obtained from the table for the t distribution

with n − 1 = 525 degrees of freedom: c = t525,0.05 = 1.96.

4. Test decision: Do not reject H0 : µ = 6 since

−c = −1.96 < t = −0.645 < c = 1.96,

and therefore t /∈ C (the test statistic is not contained in the rejec-

tion region).

5. However:

Do hourly wages wi really follow a normal distribution as assumed?

Examine the histogram of the sample observations wi (see the de-

scriptive statistics option in EViews):

180


Result:

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14 16 18 20 22 24

Series: WAGE

Sample 1 526

Observations 526

Mean 5.896103

Median 4.650000

Maximum 24.98000

Minimum 0.530000

Std. Dev. 3.693086

Skewness 2.007325

Kurtosis 7.970083

Jarque-Bera 894.6195

Probability 0.000000

• The normality condition for our test does not seem to be fulfilled.

The test result could be misleading!

• There are also tests that work without the normality assumption,

see Section 5.1.

181


One- and two-sided hypothesis tests

• Two-sided tests

H0 : θ = θ0 versus H1 : θ 6= θ0

• One-sided tests

– Tests with left-sided alternative hypothesis

H0 : θ ≥ θ0 versus H1 : θ < θ0

Notice: Often, also in Wooldridge (2009), you can read H0 : θ =

θ0 versus H1 : θ < θ0. This notation, however, is somewhat

imprecise since either H0 or H1 has to be true. This is not made

clear by the latter notation.

182


H0 : θ ≥ θ0 versus H1 : θ < θ0

0 t

0Non-rejection region under H

Critical value c

f(t)

Probability of error a


∗ Decision rule:

t < c ⇒ Reject H0.

∗ You do not need a rejection region on the right hand side since

all θ > θ0 are elements of H0 and thus fall into the non-

rejection region.

183


∗ The critical value is obtained on basis of the density for θ = θ0

since then for a given critical value c the shaded area is larger

than for any θ > θ0 and one prefers a test for which the

maximum of the type I error is controlled.

184


Wage Example Continued:

(In the following we ignore that wages are not normally dis-

tributed.)

∗ The null hypothesis states that mean hourly wages are US-$ 6

or more (H1 says it is less than US-$ 6):

H0 : µ ≥ 6 versus H1 : µ < 6

∗ Calculation of the test statistic: as in the two-sided case, be-

cause again µ0 is the boundary between null and alternative

hypothesis:

t(w1, . . . , w526) =5.896103 − 6

0.161026= −0.64521878

185


∗ Calculation of the critical value: For α = 0.05 the critical value

(note: one-sided test) from the t distribution with 525 degrees

of freedom (df) is 1.645.

∗ Decision: Since

t = −0.64521878 > c = −1.645

the null hypothesis is not rejected.

186


– Test with right-sided alternative

H0 : θ ≤ θ0 versus H1 : θ > θ0f(t)

t

0

0

0Non-rejection region under H Region of rejection under H

Probability of error a

Critical value c

0

As with left-sided alternatives, but reversed.

• Why do we carry out one-sided tests? Consider the following

issue: Provide statistical evidence that the mean wage is above $

5.60.

– Since by using statistical tests we can never confirm but only

187


reject a hypothesis, we have to choose the alternative hypothesis

such that it reflects our conjecture. Here, this is a mean wage

larger than $ 5.60. Rejecting the null hypothesis then provides

statistical evidence for the alternative hypothesis. However, there

are exceptions to this rule, see e.g. Sections 4.6 and 4.7.

– We thus have to test if the mean wage is statistically significantly

larger than $ 5.60.

We therefore need a test with a one-sided alternative. Our pair

of hypotheses is

H0 : µ ≤ 5.60 versus H1 : µ > 5.60.

– For α = P (T > c|H0) = 0.05 the critical value is c = 1.645.

– Decision:

t =5.896103 − 5.60

0.161026= 1.8388521 > c = 1.645

188


⇒ Reject H0 (for size 5%) that means data confirm that the

mean wage is statistically significantly above $ 5.60.

– If, on the contrary, we want to examine whether mean wages

deviate from $ 5.60 in any direction, the pair of hypotheses is:

H0 : µ = 5.60 versus H1 : µ 6= 5.60.

Given the chosen significance level, α = 0.05, the critical values

are -1.96 and 1.96, respectively, and hence

−1.96 < 1.84 < 1.96.

Thus, the null hypothesis cannot be rejected.

– It is therefore easier to reject if one has knowledge about the

location of the alternative because then the region of rejection

can be made smaller and it is “easier” to reject the null hypothesis

if it is false.

189


p-values

• For every test statistic one can calculate the largest significance

level for which — given a sample of observations — the computed

test statistic would have just not led to a rejection of the null. This

probability is called p-value (probability value).

In case of a one-sided test with right-hand alternative one has

(Wooldridge 2009, Appendix C.6, p. 776)

P (T ≤ t(y)|H0) ≡ 1 − p

• Since P (T > t(y)|H0) = 1 − P (T ≤ t(y)|H0), one also has

P (T > t(y)|H0) = p

and thus it is common to say that the p-value is the smallest signif-

icance level at which the null can be rejected. Cf. Section 4.2, p.

133 in Wooldridge (2009)

190


• The decision rule of a test can also be stated in terms of p-

values:

Reject H0 if the p-value is smaller than the significance level α.

Note: In the figure t is shorthand for t(y).

Left-sided test: p = P (T < t(X,y)),

Right-sided test: p = P (T > t(X,y))

Two-sided test: p = P (T < −|t(X,y)|) + P (T > |t(X,y)|)

191


• Software packages (e.g. EViews) often give p-values for

H0 : θ = 0 versus θ 6= 0.

Reading: Appendix C.6 in Wooldridge (2009).

192


4.2 Probability Distribution of the OLS Estimator

For the multiple regression model

y = Xβ + u

we assume MLR.1 to MLR.5, as we did in Sections 3.2 and 3.4.

• Recall from Section 3.4.1 that under MLR.1 the OLS estimator

β = (X′X)−1X′y

can be written as

β = β + (X′X)−1X′︸︷︷︸

W

u. (4.1)

193


• In order to derive the probability distribution of a test statistic one

needs the probability distribution of the underlying estimators since

the former is a function of the latter. Furthermore, the probability

distribution of the OLS estimator is necessary to construct interval

estimators, see Section 4.5.

Conditioning on the regressor matrix X, it follows from (4.1) that

the probability distribution of the OLS estimator only depends on

the error vector u. Similarly to the case of testing the mean we

make the assumption that the relevant random variables are nor-

mally distributed.

194


• Assumption MLR.6 (Normality of Errors):

Conditionally on the regressor matrix X, the vector of sample errors

u is stochastically independently and identically normally distributed

as

ui|xi1, . . . , xik ∼ i.i.d.N (0, σ2), i = 1, . . . , n.

Jointly with MLR.2, it can be equivalently written that u is multi-

variate normal with mean zero and variance-covariance matrix σ2I

u|X ∼ N (0, σ2I).

• Of course, one could assume for the errors u any other probability

distribution. However, assuming normally distributed errors has two

advantages:

1. The probability distribution of the OLS estimator and derived test

statistics can easily be derived, see the remaining sections.

195


2. Under certain conditions the resulting probability distribution for

the OLS estimator holds even if the errors are not normally dis-

tributed. Then it is called asymptotic distribution, see Chap-

ter 5.

See Appendix B and D in Wooldridge (2009) for rules and properties

of normally distributed random variables and vectors.

196


• Properties of the multivariate normal distribution:

– If Z ∼ N (µ, σ2), then aZ + b ∼ N (aµ + b, a2σ2).

– If the random numbers Z and V are jointly normally distributed,

then Z and V are stochastically independent if and only if

Cov(Z, V ) = 0.

– Every linear combination of a vector of identically and indepen-

dently normally distributed random variables z ∼ N (µ, σ2I) is

also normally distributed. Let

w =

w1

...

wn

, z =

z1

...

zn

.

Then∑n

j=1 wjzj = w′z ∼ N(w′µ, σ2w′w

).

197


More generally, it holds for z = (z1, . . . , zn)′ ∼ N (µ, σ2I) and

W =

w01 w02 · · · w0n

w11 w12 · · · w1n

w21 w22 · · · w2n

... ... ...

wk1 wk2 · · · wkn

∑nj=1 w0jzj

...∑n

j=1 wkjzj

= Wz ∼ N

(Wµ, σ2WW′

). (4.2)

• The property (4.2) for linear combinations of normally distributed

random numbers is very helpful for us since the OLS estimator (4.1)

is just such a linear combination.

198


Thus, one obtains

β − β = Wu ∼ N(0, σ2WW′

).

Since WW′ = (X′X)−1X′X(X′X)−1 = (X′X)−1, one obtains

β ∼ N(β, σ2(X′X)−1

).

Similarly one can show that

βj ∼ N

(βj, σ

2βj

)(4.3)

with σ2βj

= σ2

SSTj(1−R2j)

(see (3.15) in Section 3.4).

• Note that (4.3) generalizes the example of Section 4.1 for testing

hypotheses on the mean. If X is a column vector of ones, then

β0 = µ.

199


4.3 The t Test in the Multiple Regression Model

• Derivation of the test statistic and its distribution

– From (4.3) βj ∼ N

(βj, σ

2βj

).

– Standardizing leads to

βj − βj

σβj

∼ N (0, 1) .

For estimated σ2 (no proof) the test statistic follows a t gdistribution

with n − k − 1 degrees of freedom. Estimating k + 1 regression

parameters implies k + 1 restrictions from the normal equations

t(X,y) =βj − βj

σβj

∼ tn−k−1.

200


• Critical region and decision rule

– Two-sided test

∗ Hypotheses:

H0 : βj = βj0 versus H1 : βj 6= βj0.

For a given significance level one obtains the critical values from

the table of the t distribution such that P (T < −c|H0) = α/2

and P (T > c|H0) = α/2 or equivalently 2 ·P (T > c|H0) = α.

∗ Decision rule:

· Reject H0 if |t(X,y)| > c, otherwise do not reject H0.

· Alternatively: Calculate p-value

p = P (|T | > |t(X,y)||H0) = 2 · P (T > t(X,y)|H0)

and reject H0 if p < α, otherwise do not reject H0.

201


– One-sided test with left-sided alternative

∗ Hypotheses:

H0 : βj ≥ βj0 versus H1 : βj < βj0.

For a given significance level one obtains the critical value from

the table of the t distribution such that

P (T < c|H0) = α.

∗ Decision rule:

· Reject H0 if t(X,y) < c, otherwise do not reject H0.


p = P (T < t(X,y)|H0).


202


– One-sided test with right-sided alternative

∗ Hypotheses:

H0 : βj ≤ βj0 versus H1 : βj > βj0.

For a given significance level one obtains the critical value from

the table of the t distribution such that

P (T > c|H0) = α.

∗ Decision rule:

· Reject H0 if t(X,y) > c, otherwise do not reject H0.


p = P (T > t(X,y)|H0)


203


• Economic versus statistical significance

– For a given (statistical) significance level α, the power of a test

increases with increasing sample size since σβj

in the denominator

of the test statistic decreases with sample size.

– Not being able to reject a null hypothesis may thus be simple

caused by a too small sample size (if the null hypothesis is wrong

in the population).

– On the other hand, if a variable has only weak influence in the

population, its parameter will be significantly different from zero

if the sample size is large enough. Thus, even if βjxj only has

small economic impact on the dependent variable, the variable

is statistically significant.

– Be careful: In order to avoid estimation bias due to too small

204


models, significant variables must be kept in the model, see Sec-

tion 3.4.1.

• Choice of significance level

– Two reasons for decreasing the significance level α with increasing

sample size n:

∗ Larger sample sizes make tests more powerful. Thus, one can

decide whether the benefits of a larger sample size is only at-

tributed to reducing the Type II error β = 1 − π or whether

one wants also to decrease the Type I error as well. In case

of standard significance testing, the type I error represents the

probability to include a variable in the model although it is irrel-

evant in the population model. Thus, it makes sense to reduce

this probability as well.

205


∗ In general one selects relevant variables from a large number

of possibly relevant variables. Since for each statistically sig-

nificant variable a significance level α holds, one includes er-

roneously on average about αK redundant variables where K

denotes the total number of variables considered. Since fre-

quently K is allowed to increase with sample size n, the sig-

nificance level α should fall in order to avoid αK to increase.

– If one uses the Hannan-Quinn (HQ) (3.19) or the Schwarz (SC)

(3.20) model selection criterion, then the significance level de-

creases with sample size. This is not the case for the AIC criterion

(3.18).

206


• Insignificance, multicollinearity, and sample size

– Recall: The test statistic t(X,y) is small since

∗ the deviation between the true value and the null hypothesis is

small, for example between βj and βj0

∗ or the estimated standard error σβj

of βj is large.

The latter can also be caused by multicollinearity in X. Thus: A

high degree of multicollinearity makes it more unlikely to reject

the null hypothesis (since |t(X,y)| is small on average).

– For this reason one may keep insignificant variables in the regres-

sion. However, corresponding parameter estimates have then to

be interpreted with care.

Reading: Appendices C.5, E.3 in Wooldridge (2009) if needed.

207


4.4 Example of an Empirical Analysis I: A Simplified

Gravity Equation


Compare steps of an econometric analysis, see Section 1.2.

1. Question of interest:

Quantify impact of changes of gdp in exporting country and changes

in imports to Kazakhstan.

2. Economic model:

Under idealized assumptions including complete specialization in

production and identical consumption preferences among countries,

no trading costs, and focusing exclusively on imports, economic the-

ory implies (see Section II, equation (5) in Fratianni (2007))

importsi = A gdpi distanceβ2i , β2 < 0.

208


This implies a unit elasticity (elasticity of 1) of gdp on imports. This

means that a 1% change in gdp in the exporting country increases

imports by 1% as well.

This hypothesis can be statistically tested.

3. Econometric model:

The simplest econometric model is obtained by taking logs of the

economic model and adding an error term. This delivers

ln(importsi) = β0 + β1 ln(gdpi) + β2 ln(distancei) + ui.

4. Collecting data: see Appendix 10.4.

5. Selection and estimation of an econometric model:

In practice, there may be further variables influencing imports. Thus,

further control variables have to be added. Based on the Schwarz

criterion the model selection exercise in Section 3.5 suggested to

209


add the control variable common colonizer since 1945

(Model 3),

ln(importsi) = β0+β1 ln(gdpi)+β2 ln(distancei)+β3comcoli+ui.

====================================================================

Dependent Variable: LOG_IMP



====================================================================


====================================================================

C -9.578911 4.273583 -2.241424 0.0297


LOG(CEPII_DIST) -1.144269 0.441293 -2.592991 0.0126

CEPII_COMCOL_REV 3.126534 0.674896 4.632616 0.0000

====================================================================





Log likelihood -92.03988 Durbin-Watson stat 2.044512

====================================================================

210


6. Model diagnostics: Check possible violation of MLR.5 (Homoskedas-

ticity) by plotting the residuals against the fitted values and possible

violation of MLR.6 (Normal errors) by plotting a histogram of the

residuals:

-4

-3

-2

-1

0

1

2

3

10 12 14 16 18 20 22

FIT_OLS_IMP_KAZ

RE

S_

OL

S_

IMP

_K

AZ

0

2

4

6

8

10

12

-4 -3 -2 -1 0 1 2 3

Series: RES_OLS_IMP_KAZ

Sample 1 55

Observations 52

Mean -2.95e-15

Median -0.103912

Maximum 2.909987

Minimum -3.625761

Std. Dev. 1.434431

Skewness -0.295179

Kurtosis 3.017182



The scatter plot does not indicate a violation of MLR.5. Why?

Inspecting the box right to the histogram shows that the estimated

kurtosis is close to 3 which is the theoretical value implied by the

211


standard normal distribution.

Thus, we may continue to use this model.

7. Usage of the model: Conduct tests:

A two-sided test

• Now we can formulate the pair of statistical hypotheses:

H0 : The elasticity of imports to gdp is 1. versus H1 : The elasticity is unequal to 1.

H0 : β1 = 1 versus H1 : β1 6= 1.

• Compute t statistic from the relevant line of the outputVariable Coefficient Std. Error t-Statistic Prob.


t(X,y) =β1 − β10

σβ1

=1.356566 − 1

0.128793= 2.76852

212


• Choose a significance level, e.g. α = 0.05.

Compute critical values: The degrees of freedom are n−k−1 =

52 − 3 − 1 = 48. One may obtain an approximate critical value

from Table G.2 in Wooldridge (2009) or a precise critical value

e.g. from

– EViews using scalar crit = @qtdist(1-alpha/2,n-k-1)

in the command window or

– Excel using c =(TINV(alpha;n-k-1))=2.0106. (Note that

the Excel function already assumes a two-sided test.)

• Since

t(X,y) = 2.76852 > 2.0106 = c

one rejects the null hypothesis.

• p-values can be computed in EViews using

213


scalar pval = 2*(1-@ctdist(t,n-k-1))= 0.0080. Thus,

one can reject H0 even at the 1% significance level. The p-

value means that we would observe a t statistic of at least 2.787

in absolute value only in about 8 samples out of 1000 samples

drawn.

One-sided test

• Now we can formulate the pair of statistical hypotheses with re-

spect to the sign of β2, that is the impact of distance on imports.

Since we want to provide evidence for β2 < 0, we put this into

H1:

H0 : β2 ≥ 0 versus H1 : β2 < 0.

• Compute t statistic from the relevant line of the outputVariable Coefficient Std. Error t-Statistic Prob.

LOG(CEPII_DIST) -1.144269 0.441293 -2.592991 0.0126

214


t(X,y) =β2 − β20

σβ2

=−1.144269 − 0

0.441293= −2.592991

• Choosing again α = 0.05, we compute the critical value using the

EViews function scalar crit = @qtdist(alpha,n-k-1)=-

1.6772.

• Since

t(X,y) = −2.592991 < −1.6772 = c

one rejects the null hypothesis. Thus, log distance has a statis-

tically significant negative impact on imports at the given signif-

icance level.

• The corresponding p-value using EViews is

scalar pval = @ctdist(t,n-k-1)= 0.0063. Thus, distance

has a negative impact even at the 1% significance level.

215


Interpretation of β3:

In principle, this parameter can be interpreted like in a log-level

model, see Section 2.6. However, since β3 is very different from

0, and because the regressor is a dummy variable, one should use

an exact formula to compute the relative change, see Section 6.3:

the precise value is eβ3 − 1 = 21.8. Thus, imports from countries

with colonial ties are about 22 times larger than from countries from

other countries keeping everything else fixed! These are very likely

the trading patterns from the former Soviet Union.

Note that we already considered other model specifications in Sec-

tion 3.5. It might be interesting to check whether these test results

are robust if other model specifications are used such as Model 2 or

Model 4.

216


4.5 Confidence Intervals

• How large is the probability that the estimated parameter value

corresponds to the true value?

• A parameter estimator — to be more precise, a point estimator —

does not allow any conclusions how “close” the estimate is to the

true value of the population.

• Following the position of Sir Karl Popper who advocated the crit-

ical rationalism in the philosophy of science, point estimates are

not very useful since it cannot be falsified. Instead, an empirical

hypothesis is only scientific if it is falsifiable.

• Example: Assume we predicted on basis of an econometric model

a price index and obtained a predicted value of 5.12. the realized

value, however, will be 5.24. → Then we made a wrong prediction

217


since it did not realize exactly.

This “error” can only have three reasons:

– The random error of the population regression model.

– The estimation error of the sample regression model.

– The regression model is not correct or (more realistic) it is a bad

approximation. At least one of our assumptions is not justified.

Problem:

From an subjective point of view one can have different opinions

about these “explanations”:

– One believes that the deviation is due to the random error.

– Another claims that the model is wrong.

218


Solution

One should specify objective criteria such that one can make a

scientific decision. These criteria should be determined before any

predicted value realizes.

Then one cannot escape a potential falsification of a hypothesis af-

terwards. This makes a hypothesis scientific in the sense of Popper.

• Let’s be more precise:

How large is the probability that the estimated value βj corresponds

exactly to the true value βj if, as was shown in Section 4.3,

βj ∼ N

(βj, σ

2βj

)

219


and (βj − βj)/σβj∼ N (0, 1), or if σ

βjis estimated,

βj − βj

σβj

∼ tn−k−1 ?

• Alternative question:

How large is the probability that the true value βj lies in the interval

[βj − c · σβj

, βj + c · σβj

]

where c is given?

Note that the endpoints of the interval are random prior to obtaining

a sample. Its location is random through βj and its length is

random through σβj

This interval is the most well known example of an interval esti-

mator.

220


• Answer for given σβj

:

How large is the probability that the true value βj is contained in

the interval [βj − c · σβj

, βj + c · σβj

] where the value c is chosen

by you?

221


– It is 2Φ(c) − 1 since

P

(βj − cσ

βj≤ βj ≤ βj + cσ

βj

)= P

(−cσ

βj≤ βj − βj ≤ cσ

βj

)

= P

−c ≤

βj − βj

σβj

≤ c

= P

−c ≤

βj − βj

σβj

≤ c

= Φ(c) − Φ(−c)

= Φ(c) − (1 − Φ(c))

= 2Φ(c) − 1.

– Example: For c = 1.96 one obtains Φ(1.96) − Φ(−1.96) =

0.975 − 0.025 = 0.95:

222


The true value βj will be with 95% probability within the interval

βj ± c · σβj

. One also relates this probability to α by writing

0.95 = 1 − α. Thus one has α = 0.05.

• Answer for estimated σβj

: The true value βj lies in the interval

βj±c·σβj

with probability 1−α. Note, however, that for computing

the probability one has to use the tn−k−1 distribution since

P

(βj − cσ

βj≤ βj ≤ βj + cσ

βj

)= P

−c ≤

βj − βj

σβj

≤ c

.

• The interval

[βj − c · σβj

, βj + c · σβj

]

is called confidence interval. One says that the confidence in-

terval contains the true value with a probability of confidence of

223


(1 − α)100%. The value (1 − α) is also called confidence level

or coverage probability of the confidence interval.

• In practice one determines the confidence level 1−α and then com-

putes the value c using the appropriate distribution: either N (0, 1)

or tn−k−1.

• Note:

– The constant c corresponds to the (upper) critical value of a

two-sided test with significance level α.

– Since the confidence interval is a random interval, its location

and length is in general different for each sample.

– The larger (1 − α), the smaller α, the larger is the confidence

interval. In other words: the more you want to be on the safe

side, the larger the confidence interval becomes. Why?

224


– A two-sided t test and a confidence interval contain the same

amount of information. The null hypothesis of a two-sided t

test is rejected if and only if the value of the null hypothesis lies

outside the confidence interval. Draw a graph to make this clear.

– If keep drawing new samples from a population, how many con-

fidence intervals do not contain the true value on average?

• Trade Example Continued (from Section 4.4):

– Compute a 95% confidence interval for the elasticity βgdp of im-

ports with respect to gdp.

– From Section 4.4 it can be justified that MLR.1 to MLR.6 hold

and imports are normally distributed.

– Since σβgdp

has to be estimated, one has to use the t distribution

with n − k − 1 = 48 degrees of freedom. For a confidence level

225


of 0.95 one obtains α = 0.05 and thus c = 2.0106 (e.g. by using

EViews scalar crit = @qtdist(1-0.05/2,52-3-1)).

– The relevant line of output was, see Section 4.4:Variable Coeff. Std.Err. t-Stat. Prob.


– Therefore the 95% confidence interval is given by

[βj − c · σβj

, βj + c · σβj

]

[1.3566 − 2.0106 · 0.1288 , 1.3566 + 2.0106 · 0.1288]

[1.0976 , 1.6156].

– The elasticity of imports with respect to gdp falls with 95% prob-

ability within the range between 1.0976 and 1.6156. Note that 1

is not included in the confidence interval. This reflects the test

result in Section 4.4 of rejecting H0 : βgdp = 1.

226


4.6 Testing a Single Linear Combination of Parameters

• Example: Cobb-Douglas production function

log Y = β0 + β1 log K + β2 log L + u,

where Y denotes output, K and L denote the production factors

capital and labor, respectively. Note that β1 and β2 are elasticities

here.

If the restriction β1 + β2 = 1 holds true, the production function

has constant returns to scale, e.g. a 1% increase of labor and capital

leads to a 1% increase of output on average.

227


For an empirical test of constant returns to scale, we employ the

following pair of hypotheses:

H0 : β1 + β2 = 1 versus H1 : β1 + β2 6= 1.

• How to construct the test statistic:

1. First, define auxiliary parameters θ and θ0, where

θ = β1 + β2, θ0 = 1,

or, equivalently

H0 : θ = θ0 versus H1 : θ 6= θ0.

228


2. Second, solve θ for one of the parameters βi, here β1

β1 = θ − β2

and insert it into the initial regression equation and reformulate

the latter to

log Y = β0 + (θ − β2) log K + β2 log L + u

log Y = β0 + θ log K + β2 (log L − log K)︸︷︷︸new variable

+u. (4.4)

Then estimate (4.4) and obtain the test statistic

tθ =θ − θ0

σθ

which can be directly calculated from the estimation of (4.4).

229


Example:

In a classical marketing model we regress (the natural logarithm of)sales (S) of a consumer good on (the natural logarithm of) this good’sprice (P ) as well as on (the natural logarithms of) cross prices (PK1,PK2) of competing goods. The following regression output is calcu-lated from the data:

Dependent Variable: log(S), Method: Least Squares, Obs: 6917

Variable Coeff. Std.Err. t-Stat. Prob.

C 4.407786 0.079559 55.40268 0.0000

LOG(P) -3.955281 0.068095 -58.08499 0.0000

LOG(P_K1) 0.710274 0.073912 9.609683 0.0000

LOG(P_K2) 1.154163 0.079815 14.46046 0.0000


Adj. R-squared 0.331974 S.D. dependent var 1.250189

S.E. regression 1.021815 Akaike info crit. 2.881617


F-statistic 1146.631 Prob(F-statistic) 0.000000

230


We wish to test the following statement: the cross price elasticities are

identical, keeping everything else fixed (ceteris paribus) (though the

competing goods come from different market segments).

• The initial hypotheses are given by

H0 : βK1 = βK2 versus H1 : βK1 6= βK2.

We reformulate them by re-parametrization according to

θ = βK1 − βK2, θ0 = 0

H0 : θ = 0 versus H1 : θ 6= 0.

• Thus, due to βK1 = θ + βK2, the initial regression model

ln(S) = β1 + β2 ln(P ) + βK1 ln(PK1) + βK2 ln(PK2) + u

can be rendered to

ln(S) = β1 + β2 ln(P ) + θ ln(PK1) + βK2(ln(PK2) + ln(PK1)) + u.

231


• Given the estimates of the last regression

Dependent Variable: log(S), Method: Least Squares, Obs: 6917

Variable Coeff. Std.Err. t-Stat. Prob.

C 4.407786 0.079559 55.40268 0.0000

LOG(P) -3.955281 0.068095 -58.08499 0.0000

LOG(P_K1) -0.443889 0.112543 -3.944165 0.0001

LOG(P_K1)+LOG(P_K2) 1.154163 0.079815 14.46046 0.0000


Adj. R-squared 0.331974 S.D. dependent var 1.250189

S.E. regression 1.021815 Akaike info crit. 2.881617


F-statistic 1146.631 Prob(F-statistic) 0.000000

calculate t statistic ast =

−0.443889− 0

0.112543≈ −3.94.

For a given significance level of α = 0.05, the critical values are

-1.96 and 1.96. Thus, we have to reject H0.


232


4.7 Jointly Testing Several Linear Combinations of

Parameters: The F Test

Some examples of possible restrictions within the MLR framework:

1. H0 : β1 = 3

2. H0 : β2 = βk

3. H0 : β1 = 1, βk = 0

4. H0 : β1 = β3, β2 = β4

5. H0 : βj = 0, j = 1, . . . , k

6. H0 : βj + 2βl = 1, βk = 2

We can already check case 1. and case 2. by applying t tests. For all

other cases we need the F test.

233


4.7.1 Testing of Several Exclusion Restrictions


Consider Model 4 in Section 3.5:

====================================================================

Dependent Variable: LOG_IMP, Method: Least Squares

Sample: 1 55 Included observations: 52

====================================================================


====================================================================

C -13.02677 4.726426 -2.756157 0.0083


LOG(CEPII_DIST) -0.624909 0.542325 -1.152279 0.2550

CEPII_COMCOL_REV 3.351230 0.678912 4.936177 0.0000

CEPII_COMLANG_OFF 2.109579 1.319296 1.599018 0.1165

====================================================================







234


Are the control variables common colonizer since 1945 (CEPII COMCOL REV)

and common official language (CEPII COMLANG OFF) really needed in the

specification of Model 4 ?

To put it more precisely, are both parameters of two variables mentioned

jointly significantly different from zero?

H0 : βcomcol rev = 0 and βcomlang off = 0

versus

H1 : βcomcol rev 6= 0 and/or βcomlang off 6= 0

235


How can one jointly test several hypotheses?

• Note that SSR decreases (or stays constant) with an additional re-

gressor.

⇒ Idea: Compare the SSR of a model on which the null hypotheses

are imposed (restricted model) with the SSR of another model that

does not impose the joint restrictions (unrestricted model).

• The estimation under H0 is easy: simply exclude all regressors from

the regression whose parameters under H0 are set to zero and re-

estimate the restricted model.

In case of Model 4 for the trade example the OLS estimates are for

the restricted model (that corresponds to Model 2 in Section 3.5):

236


====================================================================



Sample: 1 55 Included observations: 52

====================================================================


====================================================================

C 4.800950 3.497341 1.372743 0.1761


LOG(CEPII_DIST) -1.970804 0.480555 -4.101103 0.0002

====================================================================







====================================================================

237


Results:

– The R2 of the unrestricted model is 0.7142 while the R2 of the

restricted model is (only) 0.5640.

– Correspondingly, the standard error of regression σ increases from

1.4551 to 1.7604.

– Are these changes large? It looks like that but what does “large”

really mean here?

– Note that all three model selection criteria, AIC, HQ, and SC,

“prefer” the unrestricted model, see Section 3.5. Will this finding

be confirmed by the test?

238


• In order to be able to use a statistic (a function that can be com-

puted from sample values) as a test statistic, one has to know its

probability distribution under the null hypothesis H0.

One can show (→ advanced econometrics course or Section 4.4

in Davidson & MacKinnon (2004)) that the following test statistic

follows an F distribution

F =(SSRH0

− SSRH1)/q

SSRH1/(n − k − 1)

∼ Fq,n−k−1.

Therefore this test is called F test and the test statistic is abbre-

viated as F statistic.

• Note that the F distribution has two different degrees of freedom,

q degrees of freedom for the random variable in the numerator, and

n−k−1 degrees of freedom for the random variable in denominator.

The value q contains the number of restrictions that are jointly

239


tested.

• Details of the F statistic:

– Its minimum is 0 since SSRH0≥ SSRH1

and SSRH1> 0. (There-

fore the F statistic cannot be normally distributed!)

– There is no upper bound.

• When should the joint null hypothesis be rejected?

– The larger the absolute difference between the SSRs of the re-

stricted and the unrestricted model, SSRH0− SSRH1

, the more

likely is a violation of the exclusion restrictions.

– However, be aware that absolute differences do not say much.

Why?

240


– It makes much more sense to consider the relative difference

between the SSRs. This is exactly what the F statistic does.

It scales the difference in SSRs by the SSR of the unrestricted

model. If the relative difference is large, then the joint null hy-

pothesis is likely to be violated.

– On the other hand, if the relative difference is small, then it is

likely that the excluded variables do not have any relevant impact

in the unrestricted model since they can be neglected without any

noticeable effect.

• Decision rule:

Reject H0 if the test statistic is larger than the critical value:

Reject H0 if F > c.

Thus, the critical region is (c,∞).

241


Calculation of the critical region:

For a given significance level α, the critical value c is implicitly

defined by the probability

P (F > c|H0) = α.

The corresponding value for c given α can be found in tables on

the F distribution, e.g. Table G.3 in Appendix G in Wooldridge

(2009) or be computed in EViews or Excel. (For the latter one has

(Finv(0.05;q;n-k-1) for alpha=0.05).

242


Trade Example Continued (from the beginning of this section):

• The joint null hypothesis contains two exclusion restrictions, thus

the degree of freedoms for the numerator are two, q = 2. The

degrees of freedom for the denominator correspond to the degrees

of freedom of Model 4, n − k − 1 = 52 − 4 − 1 = 47. Choosing

a significance value of α = 0.05, we check Table G.3 in Appendix

G in Wooldridge (2009) for the appropriate critical value. Listed

values are F2,40 = 3.23 and F2,60 = 3.15. While the former implies

a true significance level smaller than 0.05, the latter implies one

above 0.05. If one is interested in an exact critical value, one can

obtain it from Excel as Finv(0.05;2;47) = 3.1951.

• Collecting the SSRs from the regression outputs for Model 4 and

Model 2 at the beginning of the section, the F statistic can be

243


computed as

F =(151.855 − 99.523)/2

99.523/(47)= 12.3570.

Since

F = 12.3570 > 3.1951 = c,

reject H0 on a significance level of 5%.

• Check that the same decision holds for a significance level of 1%.

The two variables common colonizer since 1945 (CEPII COMCOL REV)

and common official language (CEPII COMLANG OFF) are statistically

significant at the 1% significance level and thus at least one of

the two variables has an impact on imports on the 5% as well

as on the 1% significance level.

244


Calculation of p-values for F statistics:

• In empirical work one is frequently interested in the largest signifi-

cance level for which the null hypothesis can be rejected given the

observed test statistic.

As explained in Section 4.1, this information is provided by the p-

value. Alternatively, it is the smallest significance level at which the

null cannot be rejected.

Given the significance level that was chosen prior to any calculations,

the null hypothesis is rejected if the p-value is smaller than the given

significance level α.

245


• Trade Example Continued: The p-value can be computed (in

Excel =FVERT(K10;2;47)= 4.87099E-05= 4.87099 ·10−05. The

p-value can also be calculated in EViews, see below.

Thus, there is extremely strong statistical evidence against the null

hypothesis.

Direct Calculation of the F statistic in EViews:

• In the Equation window one clicks on View and then on

Representations where one can read how the parameter num-

bering relates to the regressor variables.

• One again clicks on View and then on Coefficient Tests and

further on Wald-Coefficient Restrictions .... Then one

enters all restrictions into the opened EViews window Wald Test.

For the test of the joint significance of the control variables in the

246


trade example one types C(4)=0,C(5)=0 and confirms:

Wald Test:

Equation: EQ_LN_LN_COL_R_LANG_OFF

====================================================================

Test Statistic Value df Probability

====================================================================

F-statistic 12.35704 (2, 47) 0.0000

Chi-square 24.71407 2 0.0000

====================================================================

Null Hypothesis Summary:

====================================================================

Normalized Restriction (= 0) Value Std. Err.

====================================================================

C(4) 3.351230 0.678912

C(5) 2.109579 1.319296

====================================================================

Restrictions are linear in coefficients.

Note that the Chi-square test statistic is kind of a variant of the

F statistic which is useful in large samples. Here, it will not be

further discussed.

247


Remarks:

• One can, of course, test the simple null hypothesis with a two-sided

alternative

H0 : βj = 0 versus H1 : βj 6= 0

by means of an F test.

It holds that the square of a random variable X that follows a t

distribution with n − k − 1 degrees of freedom just corresponds to

a random variable that follows an F distribution with (1, n−k− 1)

degrees of freedom

X ∼ tn−k−1 =⇒ X2 ∼ F1,n−k−1.

Therefore, a two-sided t test and an F test lead to exactly the same

result for the pair of hypotheses above.

248


• It may happen that each regressor tested by itself is not statisti-

cally significant but if they are jointly tested they are statistically

significant (at the same significance level). This is a sign of mul-

ticollinearity between the regressors considered. Then, the given

sample size is only sufficient for providing statistical significance

jointly for both regressors. However, it is not sufficient for providing

statistical evidence for each regressor separately. In such cases you

may check the covariance between the parameter estimates that are

included in the test (in EViews View → Covariance Matrix).

• It may also happen that one variable is statistically significant but if

jointly tested with other variables it becomes insignificant. This can

happen if the other variables that are included in the joint hypothesis

are redundant in the population regression. In this case, the power of

a single hypothesis test is weakened by the other irrelevant variables.

249


• Thus, there is no general rule on whether to prefer joint or single

tests results.

• Trade Example Continued (from the middle of this section):

Comparing four different model specifications using model selection

criteria, see Section 3.5, HQ and AIC favor Model 4. Inspecting its

parameter estimates at the beginning of this section, one finds two

parameters to be statistically insignificant even at the 10% level:

βdistance and βcommon off language.

Why, then, was this Model 4 found to be best by HQ and AIC?

Answer:

The parameter estimators for βdistance and βcom. off. lang. might

be highly correlated so that only a joint impact is significant. One

reason could be that a lot of variation of distance can be explained

by common off. language, among other things. In this case, it

250


can be expected that if both parameters are jointly tested by means

of an F test, they are statistically significant. Test the pair of

hypotheses:

H0 : βdistance = 0 and βcomlang off = 0

H1 : βdistance 6= 0 and/or βcomlang off 6= 0

The EViews output is:

Wald Test:

Equation: EQ_LN_LN_COL_LANG


F-statistic 4.749269 (2, 47) 0.0132

Chi-square 9.498538 2 0.0087

Thus, reject H0 at the 5% significance level. The previous conjecture

is statistically confirmed.

251


The effect of multicollinearity can nicely be seen in

-2

-1

0

1

2

3

4

5

6

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

C(5

)C(3)

Note that C(3) and C(5) correspond to βdistance and βcom. off. lang., respectively. The ellipse

is a generalization of confidence intervals to two dimensions. Thus, all points outside the ellipse

are joint null hypotheses that are rejected. Note that the origin also lies outside while the zero

is included in each one-dimensional confidence interval. (Get the graph in EViews via Views →

Coefficient Tests→ Confidence Ellipse.) More on confidence ellipses, e.g. in Davidson

& MacKinnon (2004).

252


• R2 version of the F statistic:

If a regression model contains a constant, then the decomposition

SSR = SST(1−R2) holds. Inserting each SSR into the F statistic

delivers

F =(R2

H1− R2

H0)/q

(1 − R2H1

)/(n − (k + 1))∼ Fq,n−k−1.

Note:

– SST is canceled if the dependent variable y is the same under H0

and H1 as, for example, in case of exclusion restrictions. However,

this is not always true if general linear restrictions are tested.

– There can be slight differences between both versions of the F

statistic due to rounding errors.

253


Overall F Test

Standard software packages (such as EViews) include in their OLS

output for the multiple regression model y = β0+β1x1+. . .+βkxk+u

the F statistic and its p-value for the pair of hypotheses:

“None of the (non-constant) regressors has impact on the dependent

variable and thus the corresponding parameters are all zero.”

H0 : β1 = · · · = βk = 0 (and y = β0 + u)

H1 : βj 6= 0 for at least one j = 1, . . . , k.

If H0 is not rejected, this possibly indicates that

- all regressors are possible badly/wrongly chosen,

- or at least a substantial number of regressors has no impact on y,

- or too many regressors were considered for given sample size n.

This test is a first rough check for the validity of the model.

254


4.7.2 Testing of Several General Linear Restrictions

• Generalization of the F test for exclusion restrictions.

• Works equivalently by computing the relative change in the SSRs.

• R2 version cannot be used in this case!

Examples of possible pairs of hypotheses:

H0 : β2 = β3 = 1 versus H1 : β2 6= 1 and/or β3 6= 1,

H0 : β1 = 1, βj = 2βl versus H1 : β1 6= 1 and/or βj 6= 2βl.

Trade Example Continued (from previous subsection):

• One may conjecture that due to the multicollinearity between the es-

timates for distance and common official language the impact

of distance might be underestimated in absolute value while the

255


impact of language is zero. Thus, consider the pair of hypotheses:

H0 : βdistance = −1 and βcomlang off = 0

H1 : βdistance 6= −1 and/or βcomlang off 6= 0

In order to compute the SSR under H0 impose these restrictions on

the regression as

log(imports) − (−1) log(distance) = β0 + βgdp log(gdp) + βcomcolcepii comcol rev + u

256


The EViews output is:

Dependent Variable: LOG_IMP+LOG(CEPII_DIST)



====================================================================


====================================================================

C -10.49513 3.196796 -3.283015 0.0019


CEPII_COMCOL_REV 3.215739 0.611625 5.257696 0.0000

====================================================================







====================================================================

257


This allows to compute the F statistic

F =

(SSRH0

− SSRH1

)/q

SSRH1/(n − k − 1)

=(105.1709 − 99.5230)/2

99.5230/47= 1.333618 < c = 3.195.

Alternatively, one can conduct this general F test directly in EViews

via the Wald test option C(3)=-1,C(5)=0:

Wald Test:

Equation: EQ_LN_LN_COL_LANG

====================================================================


====================================================================

F-statistic 1.333603 (2, 47) 0.2733

Chi-square 2.667205 2 0.2635

→ The claim that a “common official language has no impact

while distance has an elasticity of −1” cannot be rejected at

any reasonable significance level since the p-value is about 27%.

258


4.8 Reporting Regression Results

In general, empirical researchers investigate a number of different spec-

ifications of regression functions.

In order to make visible how robust the conclusions are with respect

to model choice it is good practice to report the results of the most

important specifications so that each reader can evaluate the findings

in her own manner.

This is most easily achieved by summarizing the relevant results in a

table, see the example below.

259


For each specification a minimum number of results should be:

• OLS parameter estimates βj of the regression parameters βj, j =

0, 1, . . . , k (plus variable names),

• Standard error of βj, σβj

,

• Number of observations n,

• R2 and adjusted R2,

• Standard error of regression or estimated variance of the regression

error σ2.

If possible, one should also report

• Model selection criteria such as AIC, HQ or SC,

• Sum of squared residuals (SSR).

Based on the SSRs one can easily compute F tests.

260


Trade Example Continued:

Dependent Variable: ln(imports to Kazakhstan)

Independent Variables/Model (1) (2) (3) (4)

constant -3.461 4.801 -9.579 -13.027

(3.280) (3.497) (4.274) (4.726)

ln(gdp) 0.770 1.089 1.357 1.318

(0.130) (0.137) (0.129) (0.129)

ln(distance) — -1.971 -1.144 -0.625

(0.481) (0.441) (0.542)

common ”colonizer” since 1945 — — 3.127 3.351

(0.675) (0.679)

common official language — — — 2.110

(1.319)

Number of observations 52 52 52 52

R2 0.414 0.564 0.699 0.714

Standard error of regression 2.020 1.760 1.479 1.455

Sum of squared residuals 203.980 151.855 104.937 99.523

AIC 4.2816 4.0249 3.6938 3.6793

HQ 4.3103 4.0681 3.7514 3.7513

SC 4.3566 4.1375 3.8439 3.8670


261

5 Multiple Regression Analysis: Asymptotics

The assumption of a normal (or gaussian) distribution MLR.6 is fre-

quently violated in empirical practice. How can we then proceed to

calculate test statistics or confidence intervals?

5.1 Large Sample Distribution of the Mean Estimator

• Example: Testing the mean of hourly wages: the empirical distri-

bution is steep at the left and skewed to the right (as is typical for

262


prices and wages which are not generated additively).

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14 16 18 20 22 24

Series: WAGE

Sample 1 526

Observations 526

Mean 5.896103

Median 4.650000

Maximum 24.98000

Minimum 0.530000

Std. Dev. 3.693086

Skewness 2.007325

Kurtosis 7.970083



• Examples of random variables with right-skewed distribution:

– A χ2(m) distributed random variable X is defined as the sum of

m squared i.i.d. standard normal random variables

X =

m∑

j=1

u2j, uj ∼ i.i.d.N (0, 1).

263


(Details on the χ2 distribution can be found in Appendix B in

Wooldridge (2009).)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1 2 3 4 5 6 7 8

Chi-Quadrat(1)

Density of the χ2(1) distribution.

Moments of a χ2(1) distributed random variable:

E[X ] = E[u2]

= V ar(u) + E[u]2 = 1,

V ar(X) = E[X2]− E[X ]2 = E[u4] − 12 = 2,

u2 − 1√2

=X − 1√

2∼ (0, 1).

264


Note that for a standard normal random variable we have E[u4] =

3.

– Linear functions of a χ2(1) distributed random variable, e.g.

yi = ν + σyu2

i − 1√2

, ui ∼ i.i.d.N (0, 1). (5.1)

Moments:

E[yi] = ν,

V ar(yi) = V ar

(σy

u2i − 1√

2

)= σ2

yV ar

(u2

i√2

)= σ2

y.

265


• Expectation and variance of mean estimators

µn =1

n

n∑

i=1

yi.

E[µn] =1

n

n∑

i=1

E[yi] = ν,

V ar (µn) =1

n2

n∑

i=1

V ar(yi) =V ar(yi)

n=

σ2y

n,

sd (µn) =σy√n.

In this example the estimator is unbiased and the variance decreases

with rate n as sample size increases.

266


• Consistency of an estimator θn:

For every ǫ > 0 and δ > 0 there exists an N such that

P(|θn − θ| < ǫ

)> 1 − δ for all n > N.

Alternatively:

– limn→∞ P(|θn − θ| < ǫ

)= 1,

– plim θn = θ,

– θnp−→ θ.

The “plim” notation stands for probability limit. This concept

of convergence is usually denoted as convergence in probability or

(weak) consistency. Some notes on calculation rules for the “plim”

are given in Appendix C.3 in Wooldridge (2009).

267


A consistent estimator θn has the properties

– limn→∞ E[θn

]= θ and

– limn→∞ V ar(θn

)= 0.

If one of these conditions fails to hold, the estimator is called in-

consistent. In general:

• Weak law of large numbers (WLLN):

For yi ∼ i.i.d. with −∞ < E[yi] = µ < ∞, the mean estimator

µn = 1n

∑ni=1 yi is weakly consistent, that is

µnp−→ µ.

• Then we can consistently estimate the variance of i.i.d. random

variables wi ∼ i.i.d.(µw, σ2w) with σ2 = 1

n

∑ni=1(wi−µw)2. Why?

268


• But how can we derive the asymptotic probability distri-

bution of the mean estimator µn?

• Monte Carlo Simulation (MC):

The EViews-program mcarlo1 est mu.prg allows us to iteratively

draw R = 1000 samples of size n with elements y1, . . . , yn, where

yi ∼ i.i.d.(ν, σ2y) with ν = 3 and σ2

y = 1 and yi is generated from

(5.1). One frequently calls (5.1) the data generating process

(DGP). For every sample yr1, y

r2, . . . , y

rn generated in this way,

where r = 1, . . . , 1000, the mean estimator µr = 1n

∑ni=1 yr

i is

calculated and stored. After all iterations, a histogram is calculated

based on R estimates µ1, µ2, . . . , µR.

269


First, the results for the simulated moments:

Elements in sample average std. deviation true std. deviation

n of estimated means of MC DGP

10 2.993908 0.298423 0.3162278

30 3.003103 0.182773 0.1825742

50 3.001263 0.142403 0.1414214

100 3.001637 0.098750 0.1

500 2.999619 0.046355 0.04472136

1000 3.000503 0.031865 0.03162278

– The true moments are accurately estimated,

– and we can observe how the LLN works.

270


n = 10 n = 30

0

10

20

30

40

50

60

70

2.50 2.75 3.00 3.25 3.50 3.75 4.00

Series: MU_HAT

Sample 1 1000

Observations 1000

Mean 2.993908

Median 2.966680

Maximum 4.061612

Minimum 2.394145

Std. Dev. 0.298423

Skewness 0.621673

Kurtosis 3.250171



20

40

60

80

100

120

140

2.6 2.8 3.0 3.2 3.4 3.6 3.8

Series: MU_HAT

Sample 1 1000

Observations 1000

Mean 3.003103

Median 2.991643

Maximum 3.795529

Minimum 2.562306

Std. Dev. 0.182773

Skewness 0.572018

Kurtosis 3.450607



n = 50 n = 100

0

40

80

120

160

2.6 2.8 3.0 3.2 3.4 3.6

Series: MU_HAT

Sample 1 1000

Observations 1000

Mean 3.001263

Median 2.992116

Maximum 3.650012

Minimum 2.595097

Std. Dev. 0.142403

Skewness 0.509651

Kurtosis 3.745789



20

40

60

80

100

120

2.750 2.875 3.000 3.125 3.250 3.375

Series: MU_HAT

Sample 1 1000

Observations 1000

Mean 3.001637

Median 2.994952

Maximum 3.434677

Minimum 2.721445

Std. Dev. 0.098750

Skewness 0.383232

Kurtosis 3.350859



n = 500 n = 1000

0

20

40

60

80

100

2.90 2.95 3.00 3.05 3.10 3.15 3.20

Series: MU_HAT

Sample 1 1000

Observations 1000

Mean 2.999619

Median 2.998812

Maximum 3.201741

Minimum 2.861536

Std. Dev. 0.046355

Skewness 0.198532

Kurtosis 3.293122



20

40

60

80

100

120

140

2.90 2.95 3.00 3.05 3.10

Series: MU_HAT

Sample 1 1000

Observations 1000

Mean 3.000503

Median 2.999499

Maximum 3.108631

Minimum 2.905122

Std. Dev. 0.031865

Skewness 0.170954

Kurtosis 2.950809



271


• Results for simulated distributions:

– Right-skewness decreases with increase in sample size n.

– A test for normality (Jarque-Bera-Test): null hypothesis of normal

distribution cannot be rejected for large n.

Theoretical explanation of this phenomenon: a cental limit

theorem holds under certain (rather weak) conditions that is one of

the most important tools in statistics!

• Central limit theorem (CLT):

For yi ∼ i.i.d.(µ, σ2) with 0 < σ2 < ∞, µn = 1n

∑ni=1 yi is

asymptotically normally distributed:√

n (µn − µ)d−→ N (0, σ2).

.

272


– Interpretation: the larger the number of sample elements n, the

more precise is the approximation of the exact distribution of µn

(see the MC results) by an exactly specified normal distribution.

Hence the label large sample distribution.

– But how good is the asymptotic approximation for a given sample

size n?

∗ The CLT is not informative on this question, though we may

get an answer by conducting MC simulations for certain cases

or by using rather involved finite sample statistics.

∗ Experience: as the distribution of the yi approaches the nor-

mal distribution, smaller and smaller n suffice for a very good

approximation. In some cases even n = 30 is enough.

Do some experiments using mcarlo1 est mu.prg, and grad-

273


ually increase the degrees of freedom r of the χ2 distribution

(and observe that the skewness decreases in this process)!

– Alternative notations (Φ(z) is the cumulative distribution func-tion of the standard normal distribution):

√n

(µn − µ

σ

)d−→ N(0, 1) (5.2)

P

(√n

(µn − µ

σ

)≤ z

)−→ Φ(z) (5.3)

µn − µ

σ/√

n

approx∼ N(0, 1) (5.4)

µnapprox∼ N

(µ,

σ2

n

)(5.5)

Notation: the mean estimator is asymptotically normally dis-

tributed.

274


• In large samples the standardized mean estimator is approximated

by a standard normal distribution. Then, due to (5.4)

wi ∼ i.i.d.N (µ, σ2) t(w1, . . . , wn) = µ−µσµ

∼ N (0, 1)

wi ∼ i.i.d.(µ, σ2) t(w1, . . . , wn) = µ−µσµ

approx∼ N (0, 1)

and it can be shown that

wi ∼ i.i.d.N (µ, σ2) t(w1, . . . , wn) = µ−µσµ

∼ tn−k−1

wi ∼ i.i.d.(µ, σ2) t(w1, . . . , wn) = µ−µσµ

approx∼ N (0, 1)

and we get the following (very convenient) result: the (small sam-

ple) theory of t tests and confidence intervals for the mean

estimator of i.i.d. variables holds approximately in large

(enough) samples.

275


• Hence the test results in our empirical exercise are still approximately

valid!

• How about this concept of validity in a regression context?

276


5.2 Large Sample Inference for the OLS Estimator

• The OLS-estimator

β = β +(X′X

)−1X′u = β + Wu

depends on X or W. Hence, for the OLS estimator to be consistent

and asymptotically normal, certain conditions must hold for the re-

gressor variables as n → ∞. One of these conditions is that for all

i, l = 0, 1, . . . , k we have plim 1n

∑ni=1 xijxil = E[xjxl] = aij or

1

nX′X

p−→ A. (5.6)

277


• Asymptotic normality of the OLS estimator

All necessary conditions for asymptotic normality are fulfilled if the

standard assumptions MLR.1-MLR.5 hold true. Then (see a sketch

of proof in Appendix E.4 in Wooldridge (2009)):

√n(β − β

)d−→ N

(0, σ2A

). (5.7)

278


For the (asymptotic) distributions of the t statistics we get:

MLR.1-MLR.6 t (X,y) =βj−βjσ

βj

∼ N (0, 1)

MLR.1-MLR.5 t (X,y) =βj−βjσ

βj

approx∼ N (0, 1)

and it can be shown that

MLR.1-MLR.6 t (X,y) =βj−βj

σ/(SSTj(1−R2j))

∼ tn−k−1


σ/(SSTj(1−R2j))

approx∼ N (0, 1)

279


A frequent observation from many Monte Carlo simulations and

empirical practice is that

– for small n one proceeds as in the case of normally distributed

errors and uses the critical values of the t distribution:


σ/(SSTj(1−R2j))

approx∼ tn−k−1

– and analogously for the F statistic the critical values are deter-

mined from the F distribution.

– Note again: the critical values are valid only approximately, not

exactly. Analogously, the p-values (calculated in EViews) are valid

only approximately!

280


• Conclusion:

– For the calculation of test statistics and confidence intervals (ex-

ception: forecast intervals) we proceed as hitherto. However, all

statistical results hold only as an approximation.

– If the assumption of homoskedasticity is violated, even the asymp-

totic results do not hold and models for heteroskedastic errors are

required (with stronger assumptions for LLN and CLT), see Chap-

ter 8.

Reading: Chapter 5 and Appendix C.3 in Wooldridge (2009).

281

6 Multiple Regression Analysis: Interpretation

6.1 Level and Log Models

Recall section 2.6 on level-level, level-log, log-level, log-log models. All

the results remain valid in the multiple regression model in a ceteris-

paribus analysis.

282


6.2 Data Scaling

•• Scaling the dependent variable:

– Initial model:

y = Xβ + u.

– Variable transformation: y∗i = a · yi with scale factor a.

→ New, transformed regression equation:

ay︸︷︷︸y∗

= X aβ︸︷︷︸β∗

+ au︸︷︷︸u∗

y∗ = Xβ∗ + u∗ (6.1)

283


– OLS-estimator for β∗ in (6.1):

β∗ =(X′X

)−1X′y∗

= a(X′X

)−1X′y = aβ.

– Error variance:

V ar(u∗) = V ar(au) = a2V ar(u) = a2σ2I.

– Variance-covariance matrix:

V ar(β∗) = σ∗2 (X′X

)−1= a2σ2 (X′X

)−1= a2V ar(β)

– t statistic:

t∗ =β∗

j − 0

σβ∗

j

=aβj

aσβj

= t.

284


• Scaling explanatory variables:

– Variable transformation: X∗ = X · a. New regression equation:

y = Xa · a−1β + u = X∗β∗ + u. (6.2)

– OLS estimator for β∗ in (6.2):

β∗ =(X∗′X∗

)−1X∗′y =

(a2X′X

)−1X′ay

= a−2a(X′X

)−1X′y = a−1β.

– Result: The sole magnitude of βj is no indicator for the relevance

of the impact of the j-th regressor. One always has to take the

scale of the variable into account.

– Example: In Section 2.3 a simple level-level model was estimated

for imports on gdp. The parameter estimate βgdp = 2.16 · 10−05

appears very small. However, taking into account that gdp is

285


measured in dollars, this estimate is not small. Simply rescale

gdp to millions of dollars with a = 10−6 and you obtain β∗gdp =

106 · 2.16 · 10−05 = 21.6.

• Scaling of variables in logarithmic form

just alters the constant β0 since ln y∗ = ln ay = ln a + ln y.

• Standardized Coefficients:

We just saw that it is not possible to deduce the relevance of ex-

planatory variables from the magnitude of the corresponding coef-

ficient. This is possible, however, if the regression is suitably stan-

dardized.

286


Deviation: First, consider the following sample regression model

yi = β0 + xi1β1 + . . . + xikβk + ui, (6.3)

and its representation after taking means over all n observations

y = β0 + x1β1 + . . . + xkβk. (6.4)

Then we calculate the difference between (6.4) and (6.3)

(yi − y) = (xi1 − x1)β1 + . . . + (xik − xk)βk + ui. (6.5)

Finally, we divide equation (6.5) by the estimated standard deviation

of y, say σy, and expand every term on the right-hand side by

the estimated standard deviations of the corresponding explanatory

variables, say σxj , j = 1, . . . , k,

(yi − y)

σy=

(xi1 − x1)

σy· σx1

σx1

β1 + . . . +(xik − xk)

σy· σxk

σxk

βk +ui

σy.

287


Simple algebra gives(yi − y)

σy︸︷︷︸zi,y

=(xi1 − x1)

σx1︸︷︷︸zi,x1

σx1

σyβ1

︸︷︷︸b1

+ . . . +(xik − xk)

σxk︸︷︷︸zi,xk

σxk

σyβk

︸︷︷︸bk

+ui

σy︸︷︷︸ξi

.

In the literature the transformed variables zi,y and zi,x1, . . . , zi,xk

are usually denoted as z-scores.

In compact notation we get

zi,y = zi,x1b1 + · · · + zi,xk

bk + ξi.

where bj are denoted as standardized coefficients (or simply

beta coefficients).

The magnitudes of the standardized coefficients can be compared

to each other. Hence, the explanatory variable with the largest

parameter βj has the relatively largest impact on the dependent

variable.

288


Interpretation: a one standard deviation increase in xj changes y

by bj standard deviations.

Standardized coefficients can be calculated in SPSS (see Example

6.1 in Wooldridge (2009)).

289


6.3 Dealing with Nonlinear or Transformed Regressors

• Further details on logarithmic variables:

Consider the following log-level regression model

ln y = β0 + β1x1 + β2x2 + u, (6.6)

where x2 is a dummy variable (it is either equal to 0 or 1).

– How can we determine the exact impact of x2, that is, how

should we interpret β2? From (6.6) follows

y = eln y = eβ0+β1x1+β2x2+u = eβ0+β1x1+β2x2 · eu

and for the conditional expectation

E[y|x1, x2] = eβ0+β1x1+β2x2 · E[eu|x1, x2]. (6.7)

290


This implies

%∆E[y|x1, x2] = 100(eβ2 − 1

).

– In the general case of k regressors:

%∆E[y|x1, x2, . . . , xk] = 100(eβj∆xj − 1

). (6.8)

Obviously (6.8) represents the exact partial effect, whereas

the interpretation as an approximate semi-elasticity may be rather

crude in some cases.

– Trade Example Continued (from Section 4.8 and specifically

from Section 4.4):

For Model 3 we obtained the sample regression

LOG(TRADE_0_D_O) = -9.5789 + 1.3566*LOG(WDI_GDPUSDCR_O)

- 1.1443*LOG(CEPII_DIST) + 3.1265*CEPII_COMCOL_REV + RESIDUAL

Recall that CEPII COMCOL REV denotes a dummy variable.

292


∗ The approximate interpretation of βcomcol is that 1 unit change

changes imports by 100βcomcol = 313%.

∗ The exact partial effect is 100(eβcomcol − 1

)= 2179.5%, al-

most 7 times as large!

∗ Of course, the difference between the approximate and exact

effect are smaller if β is closer to zero.

293


• Models with quadratic regressors:

– For example, consider the multiple regression

y = β0 + β1x1 + β2x2 + β3x22 + u.

The marginal effect of a change in x2 on the conditional expec-

tation of y is equal to

∂E[y|x1, x2]

∂x2= β2 + 2β3x2.

Therefore a change of ∆x2 in x2 changes ceteris paribus the

dependent variable y on average by

(β2 + 2β3x2)∆x2.

Clearly, this effect depends on the level of x2 (and an interpreta-

tion of β2 alone does not make any sense!).

294


– In some empirical applications regressor variables are considered

using quadratics and logarithms, in order to approximate a non-

linear regression function.

Example: we can approximate non-constant elasticities using the

model

ln y = β0 + β1x1 + β2 ln x2 + β3(ln x2)2 + u.

Then the elasticity of y with respect to x2 equals

β2 + 2β3 ln x2

and is constant if and only if β3 = 0.

295


– Trade Example Continued:

So far we only considered multiple regression models that are

log-log or log-level in the original variables.

Now consider a further specification for modeling imports where

a log regressors also enters as square.

Model 5:

ln(imports) = β0 + β1 ln(gdp) + β2 (ln(gdp))2 + β3 ln(distance)

+ β4 com colonizer + β5 com language + u.

Using the previous result, the elasticity of imports with respect

to gdp is

β1 + 2β2 ln(gdp). (6.9)

296


Estimation of Model 5 delivers:====================================================================

Dependent Variable: LOG(TRADE_0_D_O)



====================================================================


====================================================================

C -85.82647 27.89947 -3.076276 0.0035


(LOG(WDI_GDPUSDCR_O))^2 -0.111982 0.042366 -2.643219 0.0112

LOG(CEPII_DIST) -0.761431 0.513375 -1.483187 0.1448

CEPII_COMCOL_REV 3.911289 0.673603 5.806523 0.0000


====================================================================








====================================================================

297


Comparing the AIC, HQ, and SC of Model 5 with those of Models

1 to 4, see Section 4.4, one finds that Model 5 exhibits the lowest

values throughout. In addition, the (approximate) p-value of β2

is 0.011 and the quadratic term is statistically significant at the

5% significance level.

This provides also evidence for a nonlinear elasticity. Inserting

the parameter estimates into (6.9) delivers

η(gdp) = 7.088 − 0.224 ln(gdp).

One may plot the elasticity η(gdp) versus gdp for each observedvalue of gdp. In EViews this can be done by a little program

’ Model 5

equation model5.ls log(trade_0_d_o) c log(wdi_gdpusdcr_o)

(log(wdi_gdpusdcr_o))^2 log(cepii_dist) cepii_comcol_rev cepii_comlang_off

’ generate elasticities for gdp

genr elast_gdp = model5.@coefs(2) + 2*eq_ln_sq_ln_col.@coefs(3)*log(wdi_gdpusdcr_o)

group group_elast_gdp (wdi_gdpusdcr_o) elast_gdp ’ create group

group_elast_gdp.scat ’make scatter plot

298


0.0

0.4

0.8

1.2

1.6

2.0

2.4

0.0E+00 4.0E+12 8.0E+12 1.2E+13

WDI_GDPUSDCR_O

EL

AS

T_

GD

P

The import elasticity with respect to gdp is much larger for small

economies in terms of gdp than for large economies.

Warning: Nonlinearities are sometimes due to missing variables.

Can you think of any control variables left out that should be

included in Model 5?

299


• Interactions:

Example:

y = β0 + β1x1 + β2x2 + β3x2x1 + u.

The marginal effect of a change in x2 is given by

∆E[y|x1, x2] = (β2 + β3x1)∆x2.

Hence, in this case the marginal effect also depends on the level of

x1!

• Selection between non-nested models of the same dependent

variable:

Definition: Non-nested means, that one model cannot be repre-

sented as a special case of the other model.

Example: the choice between models y = β0 +β1x1+β2x21+u and

y = β0 + β1 ln x1 + u can be based on SC or AIC (or R2). Why?

300


6.4 Regressors with Qualitative Data

Dummy variables or binary variables

A binary variable can take exactly two different values and allows to

describe two qualitatively different states.

Examples: female vs. male, employed vs. unemployed, etc.

• In general these values are coded as 0 and 1. This allows for a very

easy and straightforward interpretation. Example:

y = β0 + β1x1 + β2x2 + · · · + βk−1xk−1 + δD + u,

where D equals 0 or 1.

301


• Interpretation (well known by now):

E[y|x1, . . . , xk−1, D = 1] − E[y|x1, . . . , xk−1, D = 0] =

β0 + β1x1 + β2x2 + · · · + βk−1xk−1 + δ

− (β0 + β1x1 + β2x2 + · · · + βk−1xk−1) = δ

The coefficient of a dummy variable is equal to an intercept shift of

size δ in the case D = 1. All slope parameters βi, i = 1, . . . , k− 1

remain unchanged.

• Wage Example Continued:

– Question of interest: Do females earn significantly less than males?

– Data: a sample of n = 526 U.S. workers obtained in 1976.

(Source: Examples 2.4, 7.1 in Wooldridge (2009)).

302


∗ wage in dollars per hour,

∗ educ: years of schooling of each worker,

∗ exper: years of professional experience,

∗ tenure: years of employment in current firm,

∗ female: dummy=1 if female, dumm=0 otherwise.

====================================================================

Dependent Variable: ln(WAGE), Method: Least Squares, Sample: 1 526

====================================================================


====================================================================

C 0.416691 0.098928 4.212066 0.0000

FEMALE -0.296511 0.035805 -8.281169 0.0000

EDUC 0.080197 0.006757 11.86823 0.0000

EXPER 0.029432 0.004975 5.915866 0.0000

EXPER^2 -0.000583 0.000107 -5.430528 0.0000

TENURE 0.031714 0.006845 4.633035 0.0000

TENURE^2 -0.000585 0.000235 -2.493365 0.0130

====================================================================

R-squared 0.440769, Adjusted R-squared 0.434304, Akaike info crit. 1.017438

Mean dependent var 1.623268, S.D. dependent var 0.531538, Schwarz criterion 1.074200

S.E. of regression 0.399785, Sum squared resid 82.95065

F-statistic 68.17659, Prob(F-statistic) 0.000000

303


– Note: In order to be able to interpret the coefficients of dummy

variables one has to know the reference group. The reference

group is given by the group for which the dummy equals zero.

– Prediction: How much earns a woman with 12 years of school-

ing, 10 years of experience, and 1 year tenure? (Or course, you

can insert any other numbers here.)

E[ln(wage)|female = 1, educ = 12, exper = 10, tenure = 1]

= 0.4167 − 0.2965 · 1 + 0.0802 · 12 + 0.0294 · 10

− 0.0006 · (102) + 0.0317 · 1 − 0.0006 · (12)

= 1.35

Thus, the expected hourly wage is approximately exp(1.35) =

3.86 US dollar.

– We already know that in case of a log-level model the expected

304


value of y given the regressors x1, x2 is given by

E[y|x1, x2] = eβ0+β1x1+β2x2 · E[eu|x1, x2].

The true value of E[eu|x1, x2] depends on the probability distri-

bution of u.

It holds that: If u is normally distributed with variance σ2, then

E[eu|x1, x2] = eE[u|x1,x2]+σ2/2.

The precise prediction is therefore

E[y|x1, x2] = eβ0+β1x1+β2x2+E[u|x1,x2]+σ2/2.

305


The exact prediction of the desired hourly wage is

E[wage|female = 1, educ = 12, exper = 10, tenure = 1]

= exp(0.4167 − 0.2965 · 1 + 0.0802 · 12 + 0.02943 · 10

− 0.0006 · (102) + 0.0317 · 1 − 0.0006 · (12) + 0.39982/2)

= 4.18.

Thus, the precise value of the mean hourly wage for the specified

person is about 4.18$ and thus 30 Cent larger than the approxi-

mate value.

– The parameter δ corresponds to the difference between the log

income of female and male workers keeping everything else con-

stant (e.g. years of schooling, experience, etc.).

Question: How large is the exact wage difference?

Answer: 100(e−0.2965 − 1) = 34.51%.

306


Note that ceteris paribus analysis is much more informative than

the comparison of the unconditional means of male and female

wages. Assuming normal errors one has

E[wagef ] − E[wagem]

E[wagem]=

eE[ln(wagef)]+σ2

f/2 − eE[ln(wagem)]+σ2m/2

eE[ln(wagem)]+σ2m/2

.

Inserting estimates one obtains

e1.416+0.442/2 − e1.814+0.532/2

e1.814+0.532/2= −0.3570,

which, by the way, is very similar to inserting estimates for (E[wagef ]−E[wagem])/E[wagem] leading to -0.3538.

Females earn 36% less than males if one does not control for

other effects.

307


Several subgroups

• Example: A worker is female or male and married or unmarried

=⇒ 4 subgroups:

1. female and not married

2. female and married

3. male and not married

4. male and married

How to proceed:

– Choose one subgroup to be the reference group, for example:

female and not married

– Define dummy variables for the other subgroups. For example, in

EViews with the Command “generate series” (genr):

308


∗ FEMMARR = FEMALE*MARRIED

∗ MALESING = (1-FEMALE) * (1-MARRIED)

∗ MALEMARR = (1-FEMALE) * MARRIED

Dependent Variable: ln(WAGE), Method: Least Squares, Sample: 1 526

====================================================================


====================================================================

C 0.211028 0.096644 2.183548 0.0294

FEMMARR -0.087917 0.052348 -1.679475 0.0937

MALESING 0.110350 0.055742 1.979658 0.0483

MALEMARR 0.323026 0.050114 6.445759 0.0000

EDUC 0.078910 0.006694 11.78733 0.0000

EXPER 0.026801 0.005243 5.111835 0.0000

EXPER^2 -0.000535 0.000110 -4.847105 0.0000

TENURE 0.029088 0.006762 4.301613 0.0000

TENURE^2 -0.000533 0.000231 -2.305552 0.0215

====================================================================





309


Examples for Interpretation:

– Married women earn about 8.8% less than unmarried women.

However, this effect is only significant at the 10% significance

level (for a two-sided test).

– The wage difference between married men and women is about

32.3 − (−8.8) = 41.1%. A t test cannot be directly applied.

(Solution: Choose a new reference group with one of the two

subgroups as the reference group.)

Remarks:

– Using dummies for all subgroups is not recommended since then

differences with respect to the ref. group cannot be tested directly.

– If you use dummies for all subgroups you cannot include a con-

stant. Otherwise MLR.3 is violated. Why?

310


• Using ordinal information in regression

Example: Ranking of universities

The quality difference between ranks 1 and 2 and ranks 11 and 12,

respectively, may be dramatically different. Hence, ranks should not

be used as regressors. Instead, we have to assign a dummy variable

Dj for all but one (the “reference category”) of the universities,

inducing several new parameters which have to be estimated.

Note: Then, the coefficient of a dummy variable Dj denotes the

intercept shift between university j and the reference university.

Sometimes there are too many ranks and hence too many parame-

ters to be estimated. Then it proves useful to group the data, e.g.,

ranks 1-10, 11-20, etc.

311


Interactions and Dummy Variables

• Interactions between dummy variables:

– May be used to define sub-groups (e.g., married males).

– Note that a useful interpretation and comparison of sub-group

effects crucially depends on a correct setup of dummies. For

example, let us include the dummies male and married and

their interaction in a wage equation

y = β0 + δ1male + δ2married + δ3male · married + . . .

Then, a comparison between male-married and male-single is

given by

E[y|male = 1,married = 1] − E[y|male = 1,married = 0]

= β0 + δ1 + δ2 + δ3 + . . . − (β0 + δ1 + . . .) = δ2 + δ3

312


• Interactions between dummies and quantitative variables:

– Allows different slope parameters for different groups

y = β0 + β1D + β2x1 + β3(x1 · D) + u.

Note: here β1 denotes the difference between both groups only

for the case x1 = 0. If x1 6= 0, then this difference is equal to

E[y|D = 1, x1] − E[y|D = 0, x1]

= β0 + β1 · 1 + β2x1 + β3(x1 · 1) − (β0 + β2x1)

= β1 + β3x1

Even if β1 is negative, the total effect may be positive!

313


– Wage Example Continued:Dependent Variable: ln(WAGE), Method: Least Squares, Sample: 1 526

====================================================================


====================================================================

C 0.388806 0.118687 3.275892 0.0011

FEMALE -0.226789 0.167539 -1.353643 0.1764

EDUC 0.082369 0.008470 9.724919 0.0000

EXPER 0.029337 0.004984 5.885973 0.0000

EXPER^2 -0.000580 0.000108 -5.397767 0.0000

TENURE 0.031897 0.006864 4.646956 0.0000

TENURE^2 -0.000590 0.000235 -2.508901 0.0124

FEMALE*EDUC -0.005565 0.013062 -0.426013 0.6703

====================================================================





Are returns to schooling sensitive to gender?

314


• Testing for differences between groups

– Can be done with F tests.

– Chow Test: Allows to test whether there is a difference between

groups in a sense that there may be group specific intercepts

and/or (at least one) slope parameter.

Illustration:

y = β0 + β1D + β2x1 + β3(x1 · D) + β4x2 + β5(x2 · D) + u.

(6.10)

Pair of hypotheses:

H0 :β1 = β3 = β5 = 0 vs.

H1 :β1 6= 0 and/or β3 6= 0 and/or β5 6= 0

315


Application of F tests:

∗ Estimate the regression equation for each group l

y = β0l + β2lx1 + β4lx2 + u, l = 1, 2,

and calculate SSR1 and SSR2.

∗ Then estimate this regression for both groups together and

calculate SSR.

∗ Compute the F statistic

F =SSR − (SSR1 + SSR2)

SSR1 + SSR2

n − 2(k + 1)

(k + 1)

where the degrees of freedom for the F distribution are equal

to k + 1 and n − 2(k + 1).

Reading: Chapter 6 (without Section 6.4) and Chapter 7 (without

Sections 7.5 und 7.6) in Wooldridge (2009).

316

7 Multiple Regression Analysis: Prediction

7.1 Prediction and Prediction Error

• Consider the multiple regression model y = Xβ + u, i.e.

yi = β0 + β1xi1 + · · · + βkxik + ui, 1 ≤ i ≤ n.

• We search for a predictor y0 for y0 given x01, . . . , x0k.

• Define the prediction error

y0 − y0.

317


• We assume that MLR.1 to MLR.5 hold for the prediction sample

(x0, y0). Then

y0 = β0 + β1x01 + · · · + βkx0k + u0 (7.1)

and

E[u0|x01, . . . , x0k] = 0,

so that

E[y0|x01, . . . , x0k] = β0 + β1x01 + · · · + βkx0k = x′0β,

where x′0 = (1, x01, . . . , x0k).

MLR.4 guarantees that for known parameters the predictions are un-

biased. Then, the prediction is, loosely speaking, correct on average

(if averaged over many samples).

It can be shown that the conditional expectation is optimal in the

sense of minimizing the mean squared prediction error.

318


• In practice, the true regression coefficients βj, j = 0, . . . , k, are

unknown. Inserting the OLS estimators βj gives

y0 = E[y0|x01, . . . , x0k] = β0 + β1x01 + · · · + βkx0k.

Using compact notation the prediction rule is:

y0 = x′0β (7.2)

• This prediction rule only makes sense if (y0,x′0) belongs to the

population as well. Otherwise the population regression model is

not valid for (y0,x′0) and the prediction based on the estimated

version possibly strongly misleading.

319


• General decomposition of the prediction error

u0 = y0 − y0 (7.3)

= (y0 − E[y0|x0])︸︷︷︸unavoidable error v0

+(E[y0|x0] − x′0β

)︸︷︷︸

possible specification error

+(x′0β − x′0β

)

︸︷︷︸estimation error

– If MLR.1 and MLR.4 are correct for the population and if the

prediction sample also belongs to the population, then the spec-

ification error is zero. Then v0 = u0 in (7.1).

– If the estimator is consistent, plim β = β, then the estimation

error becomes negligible in large samples.

320


– Using the OLS estimator, the estimation error is

x′0β − x′0β = x′0(β − β)

= x′0β − x′0((X′X)−1X′y

)

= x′0β − x′0(β + (X′X)−1X′u

)

= −x′0(X′X)−1X′u. (7.4)

Thus, the estimation error only depends on the estimation sample.

– The OLS prediction error under MLR.1 to MLR.5 is given by

(using (7.3) and (7.4)):

u0 = u0 + x′0(β − β)

= u0 − x′0(X′X)−1X′u. (7.5)

321


• Variance of the prediction error:

– Extension of Assumption MLR.2 (Random Sampling):

u0 and u are uncorrelated.

– Conditional variance of (7.5) given X and x0:

V ar(u0|X,x0) = V ar(u0|X,x0) + V ar(x′0(β − β)|X,x0

)

= σ2 + x′0V ar(β − β|X)x0

= σ2 + x′0σ2(X′X)−1x0

or

V ar(u0|X,x0) = σ2(1 + x′0(X

′X)−1x0

). (7.6)

– Relevant in practice: Estimated variance of the prediction

error

V ar(u0|X,x0) = σ2(1 + x′0(X

′X)−1x0

).

322


• Prediction interval: A prediction interval is (given an a priori

chosen confidence probability 1 − α) for the multiple regression

model given by[y0 − tn−k−1

√V ar(u0|X,x0) , y0 + tn−k−1

√V ar(u0|X,x0)

].

Notes:

– Derivation and structure are analogous to the case of confidence

intervals for the parameter estimates.

– Prediction intervals are in contrast to confidence intervals even

in large samples only valid if the prediction errors are normally

distributed. This is because there is no averaging of the true

prediction error u0 as it occurs for β − β = Wu due to the

central limit theorem.

323


7.2 Statistical Properties of Linear Predictions

Apparently the prediction rule is linear (in y) since

y0 = x′0β = x′0(X′X)−1X′y.

Gauss-Markov property of linear prediction

If β is the BLU estimator for β, then

y0 = x′0β

is the BLU prediction rule. Among all linear prediction rules with a

mean prediction error of zero it exhibits the smallest prediction error

variance.

Reading: Section 6.4 in Wooldridge (2009).

324

8 Multiple Regression Analysis: Heteroskedasticity

• In this chapter Assumptions MLR.1 through MLR.4 continue to

hold.

• If MLR.5 fails to hold such that

V ar(ui|xi1, . . . , xik) = σ2i 6= σ2, i = 1, . . . , n,

the errors of the regression model exhibit heteroscedasticity. More

precisely (instead of MLR.5) we have

325

Intensive Course in Econometrics — Chapter 8 — UR March 2009 — R. Tschernig

– Assumption GLS.5: Heteroskedasticity

V ar(ui|xi1, . . . , xik) = σ2i (xi1, . . . , xik)

= σ2h(xi1, . . . , xik) = σ2hi, i = 1, . . . , n.

The error variance of the i-th sample observation σ2i is a function

h(·) of the regressors.

• Examples:

– The variance of net rents depends on the size of the flat.

– The variance of consumption expenditures depends on the level

of income.

– The variance of log hourly wages depends on years of education.

326

Intensive Course in Econometrics — Chapter 8 — UR March 2009 — R. Tschernig

• The covariance matrix of the errors of the regression is given

by:

V ar(u|X) = E[uu′|X] =

σ2h1 0 · · · 0

0 σ2h2 · · · 0

... ... . . . ...

0 0 · · · σ2hn

= σ2

h1 0 · · · 0

0 h2 · · · 0

... ... . . . ...

0 0 · · · hn

︸︷︷︸Ψ

.

Thus, we have

y = Xβ + u, V ar(u|X) = σ2Ψ, (8.1)

which will be referred to as the original model in matrix nota-

tion.

327


• When estimating models with heteroskedastic errors three cases

have to be distinguished:

1. Function h(·) is known, see Section 8.3.

2. Function h(·) is only partially known, see Section 8.4.

3. Function h(·) is completely unknown, see Section 8.2.

8.1 Consequences of Heteroskedasticity for OLS

• The OLS estimator is unbiased and consistent.

• Variance of the OLS estimator in the presence of heteroskedas-

tic errors (compare Section 3.4.2):

328


From β − β = (X′X)−1X′u it can be derived that

V ar(β|X) = E

[(β − β

)(β − β

)′|X]

= E[(X′X)−1X′uu′X(X′X)−1|X

]

= (X′X)−1X′ E[uu′|X

]︸︷︷︸

σ2Ψ

X(X′X)−1

= (X′X)−1X′σ2ΨX(X′X)−1. (8.2)

• Note that with homoskedastic errors one has Ψ = I. Then (8.2)

yields the usual OLS covariance matrix, namely σ2(X′X)−1.

• If heteroskedasticity is present, using the usual covariance

matrix σ2(X′X)−1 is misleading and leads to faulty inference.

• The problem with using (8.2) directly is that Ψ is unknown. The

next section introduces an appropriate estimator.

329


• Even if Ψ is known, OLS is not the best linear unbiased estimator,

and thus not efficient. One has to use the GLS estimator instead,

see Section 8.3.

8.2 Heteroskedasticity-Robust Inference after OLS

• Derivation of heteroskedasticity-robust standard errors

Let x′i = (1, xi1 . . . , xik). Note that the middle term in the variance-

covariance matrix (8.2) with dimension (k + 1) × (k + 1) can be

written as

X′σ2ΨX =

n∑

i=1

σ2hixix′i.

Because E[u2i |X] = σ2hi, one can estimate σ2hi by the “one ob-

330


servation average” u2i . Of course this is not a good estimator but for

the present purpose it is doing well enough. Since ui is not known,

one takes the residual ui.

Hence one can estimate the covariance matrix (8.2) of the OLS

estimator in presence of heteroskedasticity by

V ar(β|X) = (X′X)−1

n∑

i=1

u2ixix

′i

(X′X)−1. (8.3)

• Comments:

– Standard errors obtained from (8.3) are called heteroskedasticity-

robust standard errors or also White standard errors named

after Halbert White, an econometrician at the University of Cal-

ifornia in San Diego.

331


– For single βj heteroskedasticity-robust standard errors can be

smaller or larger than the usual OLS standard errors.

– If heteroskedasticity-robust standard errors are used, it can be

shown that the OLS estimator β has no longer a known finite

sample distribution. However, it is asymptotically normally

distributed. Thus, critical values and p-values remain approxi-

mately valid if (8.3) is used.

– The OLS estimator with White standard errors is unbiased and

consistent since MLR.1 to MLR.4 are unaffected by heteroskedas-

ticity.

– However, the OLS estimator is not efficient. Efficient estima-

tors will be presented in the next sections.

332


8.3 The General Least Squares (GLS) Estimator

• Original model (8.1):

yi = β0 + β1xi1 + . . . + βkxik + ui, (8.4)

V ar(ui|xi1, . . . , xik) = σ2h(xi1, . . . , xik) = σ2hi.

• Basic idea: Weighted estimation of (8.4):

Transformation of the initial model to a model that satisfies all

assumptions, including MLR.5. This is achieved by kind of stan-

dardizing the regression error ui. This amounts to dividing ui and

thus the whole regression equation (8.4) by the square root of hi:

yi√hi︸︷︷︸

y∗i

= β01√hi︸︷︷︸

x∗i0

+β1xi1√hi︸︷︷︸

x∗i1

+ . . . + βkxik√hi︸︷︷︸

x∗ik

+ui√hi︸︷︷︸

u∗i

.

333


The resulting model is

y∗i = β0x∗i0 + β1x

∗i1 + . . . + βkx

∗ik + u∗i . (8.5)

Note: For the transformed error u∗i we have

V ar(u∗i |xi1, . . . , xik) = V ar

(ui√hi

∣∣∣∣xi1, . . . , xik

)

= E

[u2

i

hi

∣∣∣∣∣xi1, . . . , xik

]

=1

hiE[u2

i |xi1, . . . , xik] =1

hiσ2hi = σ2.

Result: We have transformed the original regression (8.4) in such

a way that the homoskedasticity assumption MLR.5 holds for the

resulting regression model (8.5).

334


• Therefore the OLS estimator based on the transformed model (8.5)

has all desirable properties: BLU (best linear unbiased).

• The OLS estimator of the transformed model (8.5) is based on the

minimization of a weighted sum of squared residualsn∑

i=1

(yi − β0 − β1xi1 − . . . − βkxik)2/hi.

Therefore, it is called a weighted least squares (WLS) procedure.

Note in its current form it requires that h(·) is known.

• The transformed model does not contain a constant term if√

hi is

not identical to one of the regressors in model (8.4).

• Next we derive the transformed model in matrix notation.

335


• Explicit statement of y∗, X∗, and u∗ in matrix notation:

y∗1y∗2...

y∗n

︸︷︷︸y∗

=

h−1/21 0 · · · 0

0 h−1/22 · · · 0

... ... . . . ...

0 0 · · · h−1/2n

︸︷︷︸P

y1

y2

...

yn

︸︷︷︸y

x∗10 x∗11 · · · x∗1kx∗20 x∗21 · · · x∗2k... ... ...

x∗n0 x∗n1 · · · x∗nk

︸︷︷︸X∗

= P ·

1 x11 · · · x1k

1 x21 · · · x2k... ... ...

1 xn1 · · · xnk

︸︷︷︸X

,

u∗1u∗2...

u∗n

︸︷︷︸u∗

= P ·

u1

u2

...

un

︸︷︷︸u

336


• For the transformation matrix P it holds that

P′P = Ψ−1

and hence

E[uu′|X] = σ2Ψ = σ2(P′P)−1.

• Therefore, the transformed model (8.5) in matrix notation is

given by

Py = PXβ + Pu,

or

y∗ = X∗β + u∗, E[u∗(u∗)′|X∗] = σ2I. (8.6)

• Obviously (8.6) is obtained by multiplying the original model (8.1)

y = Xβ + u by the transformation matrix P from the left.

• What is the explicit formula for the OLS estimator in terms of the

transformed model (8.6) and the original model (8.1)?

337


GLS (generalized least squares) estimator

• OLS estimation of (8.6) yields

βGLS =(X∗′X∗)−1

X∗′y∗

=((PX)′PX

)−1(PX)′Py

=(X′P′PX

)−1X′P′Py

and therefore

βGLS =(X′Ψ−1X

)−1X′Ψ−1y. (8.7)

βGLS in (8.7) is called the GLS estimator.

In case of heteroskedasticity Ψ is a diagonal matrix and each of the

n observations is weighted by 1/√

hi.

338


• Properties for known h(·):Under MLR.1 to MLR.4 and GLS.5 the GLS-estimator βGLS

– is unbiased and consistent,

– is BLUE (best linear unbiased), and thus efficient,

– has variance-covariance matrix V ar(βGLS|X) = σ2(X′Ψ−1X

)−1,

– is unbiased and consistent even if Ψ is misspecified since Ψ is a

function of X and not of u and thus

E[βGLS − β|X] =(X′Ψ−1X

)−1X′Ψ−1E[u|X] = 0.

As a consequence, OLS is inefficient since OLS and GLS are both

linear estimators. OLS variances are larger than or equal to those

of the GLS estimator. This can be shown using matrix algebra.

• Analogously to MLR.6 in Section 4.2 above, we assume

339


– Assumption GLS.6: Normal Distribution

ui|xi ∼ N (0, σ2hi), i = 1, . . . , n,

which, together with MLR.2 (Random Sampling) implies the

multivariate normal distribution

u|X ∼ N(0, σ2Ψ

).

Note GLS.6 implies that ui given xi is independently but not iden-

tically distributed since the variance changes with i.

All test statistics based on the transformed model (8.6) and ap-

propriately modified for the original model (8.1) exhibit the exact

distributions of Chapter 4 (normal, t, F ).

• Frequent problem in practice: hi is not known. In this case,

the feasible GLS estimator has to be used −→ Case 2.

340


8.4 Feasible Generalized Least Squares (FGLS)

• In general, the variance function hi is not known and has to be

estimated. Frequently neither the relevant factors nor the functional

relationship are known.

• Hence, one needs a specification that flexibly captures a large range

of possibilities, e.g.

hi = h(xi1, . . . , xik) = exp (δ1xi1 + . . . + δkxik)

and thus

V ar(ui|xi1, xi2, . . . , xik) = σ2hi = σ2 exp (δ1xi1 + . . . + δkxik) .

Remark: On pp. 282, Wooldridge (2009) considers in hi additionally

the factor exp δ0. As this factor is constant, it can also be captured

by σ2.

341


• How can one estimate the unknown parameters δ1, . . . , δk?

Standardizing ui delivers vi = ui/(σ√

hi) with E[vi|X] = 0 and

V ar(vi|X) = 1. Therefore ui = σ√

hivi and

u2i = σ2hiv

2i , i = 1, . . . , n.

Taking logarithms leads to

ln u2i = ln σ2 + ln hi + ln v2

i

= ln σ2 + ln exp (δ1xi1 + · · · + δkxik) + ln v2i

= ln σ2 + E[ln v2i ]︸︷︷︸

α0

+δ1xi1 + · · · + δkxik + ln v2i − E[ln v2

i ]︸︷︷︸ei

ln u2i = α0 + δ1xi1 + · · · + δkxik + ei. (8.8)

For the regression equation (8.8) the assumptions MLR.1-MLR.4

are satisfied. Hence, the OLS estimator for δj is unbiased and

consistent.

342


In practice, the u2i ’s in the variance regression (8.8) are replaced

by the squared OLS residuals u2i ’s from the sample regression y =

Xβ+ u of (8.1). The resulting δj’s are used to get the fitted values

hi’s which are inserted into the GLS estimator (8.7) in step II.

• Outline of the FGLS-method:

Step I

a) Regress y on X and compute the residual vector u by OLS

estimation of the original specification (8.1).

b) Calculate ln u2i , i = 1, . . . , n, that are used as regressand in

the variance regression (8.8).

c) Estimate the variance regression (8.8) by OLS.

d) Compute hi = exp(δ1xi1 + · · · + δkxik

), i = 1, . . . , n.

343


Step II

The FGLS estimator βFGLS is obtained analogously to the

GLS procedure. The original regression (8.1) is multiplied from

the left with the matrix

P =

h−1/21 · · · 0

... . . . ...

0 · · · h−1/2n

.

This delivers a variant of the transformed regression

y# = X#β + u#. (8.9)

Hence, OLS estimation of (8.9) leads to the FGLS estimator

βFGLS =(X′Ψ

−1X)−1

X′Ψ−1

y, (8.10)

with Ψ−1

= P′P.

344


• Estimation properties of the FGLS estimators:

– They are consistent, that is, they converge in probability to the

true parameters for n → ∞plim βFGLS = β.

– The FGLS estimator is asymptotically efficient: For a cor-

rectly specified hi and a sufficiently large sample, the FGLS esti-

mator is preferable to the OLS estimator as the former one has a

lower estimation-variance. (This is plausible, as FGLS also uses

information on the functional form of the heteroskedasticity while

OLS with heteroskedasticity-robust standard errors does not.)

– If the variance function hi is misspecified, then the FGLS esti-

mator is inefficient.

– Be aware that there may be considerable differences between the

345


FGLS estimates and the OLS estimates.

• Comparing OLS with heteroskedasticity-robust standard

errors and FGLS

– If you know something about the variance function hi, then

FGLS is preferable. If you have no idea about it, then OLS with

heteroskedasticity-robust standard errors may be better.

– It is always a good idea to run an OLS regression also with

heteroskedasticity-robust standard errors in order to see

whether the significance of parameters depends on the presence

of heteroskedasticity.

– Since any estimator taking into account heteroskedasticity should

be avoided if there is no heteroskedasticity, one should test for

the presence of heteroskedasticity, see Section 9.2.

346


• Trade Example Continued

– Consider Model 5 of Section 6.3 and compare OLS estimates,

FGLS estimates, and OLS estimates with heteroskedasticity-robust

standard errors.

– EViews program to run OLS, FGLS with both steps, and OLS

with White standard errors, and scatter plots of residuals against

fitted values for both estimators.

’EViews program for FGLS estimation, Chapter 8 Heteroskedasticity

’RT, 2009_02_26

’requires workfile IMPORTS_KAZAKHSTAN_2004

’ define variables

’ define log of dependent variable

genr log_imp = log(trade_0_d_o)

’ define group of regressors (right hand side variables)

’ Model 5

group rhs_model5_base c log(wdi_gdpusdcr_o) (log(wdi_gdpusdcr_o))^2 log( cepii_dist)

cepii_comcol_rev cepii_comlang_off

347


’ potential regressors to add

group rhs_model5_add log(cepii_area_o) log(weo_pop_o) cepii_comlang_ethno_rev

cepii_col45_rev cepii_contig

’ rhs_model5_add (can be added)

group rhs_model5 rhs_model5_base

’ Step I: OLS regression

’ ols regression

equation eq_ols_model5.ls log_imp rhs_model5

’ compute residuals

eq_ols_model5.makeresids res_ols_model5

’ compute fitted values

eq_ols_model5.fit fit_ols_model5

’ plot residuals versus fitted log imports in order to check for heteroskedasticity

group group_ols_res fit_ols_model5 res_ols_model5

group_ols_res.scat

’Step II: FGLS regression

’ square residuals and take logs

genr ln_u_hat_sq = log(res_ols_model5^2)

’ estimate variance equation

348


equation eq_h_model5.ls ln_u_hat_sq rhs_model5

’ predicted squared residuals

eq_h_model5.fit ln_u_hat_sq_hat

’ compute exponential of fitted values of variance regression

genr h_hat = exp(ln_u_hat_sq_hat)

’ estimate FGLS using w=h_hat^(-1/2)

equation eq_fgls_model5.ls(w=h_hat^(-1/2)) log_imp rhs_model5

’ compute fitted values based on FGLS

eq_fgls_model5.fit fit_fgls_model5

’ compute residuals based on FGLS

eq_fgls_model5.makeresids res_fgls_model5

’ standardize residuals with weight function

series res_fgls_model5_star = res_fgls_model5*h_hat^(-1/2)

’ plot residuals versus fitted

group group_fgls_res fit_fgls_model5 res_fgls_model5_star

group_fgls_res.scat

’ OLS regression with heteroskedasticity-robust standard errors

equation eq_white_model5.ls(h) log_imp rhs_model5

349


– OLS output with usual standard errors

====================================================================



Sample: 1 55; Included observations: 52

====================================================================


====================================================================

C -85.82647 27.89947 -3.076276 0.0035



LOG(CEPII_DIST) -0.761431 0.513375 -1.483187 0.1448

CEPII_COMCOL_REV 3.911289 0.673603 5.806523 0.0000


====================================================================








====================================================================

350


– FGLS - Step I: estimate variance regression (8.8)

====================================================================

Dependent Variable: LN_U_HAT_SQ



====================================================================


====================================================================

C 10.63404 44.84623 0.237122 0.8136

LOG(WDI_GDPUSDCR_O) -0.848029 3.514572 -0.241290 0.8104

(LOG(WDI_GDPUSDCR_O))^2 0.004631 0.068099 0.068007 0.9461

LOG(CEPII_DIST) 0.863174 0.825210 1.046006 0.3010

CEPII_COMCOL_REV -0.946104 1.082764 -0.873786 0.3868


====================================================================

R-squared 0.177909 Mean dependent var -0.847962







====================================================================

351


– Estimate FGLS - Step II: estimate (8.10)====================================================================




Weighting series: H_HAT^(-1/2)

====================================================================


====================================================================

C -88.19314 22.59938 -3.902459 0.0003



LOG(CEPII_DIST) -1.107230 0.347434 -3.186879 0.0026

CEPII_COMCOL_REV 3.704807 0.678132 5.463250 0.0000


====================================================================

Weighted Statistics

====================================================================






352




====================================================================

Unweighted Statistics

====================================================================



S.E. of regression 1.384409 Sum squared resid 88.16300

Durbin-Watson stat 2.055925

====================================================================

353


– OLS with heteroskedasticity-robust standard errors====================================================================




White Heteroskedasticity-Consistent Standard Errors & Covariance

====================================================================


====================================================================

C -85.82647 28.84426 -2.975513 0.0046



LOG(CEPII_DIST) -0.761431 0.456362 -1.668480 0.1020

CEPII_COMCOL_REV 3.911289 0.777904 5.027986 0.0000


====================================================================








====================================================================

354


– Diagnostic plots: (standardized) residuals against fitted values

-4

-3

-2

-1

0

1

2

3

10 12 14 16 18 20 22

FIT_OLS_MODEL5

RE

S_

OL

S_

MO

DE

L5

-4

-3

-2

-1

0

1

2

3

4

8 10 12 14 16 18 20 22

FIT_FGLS_MODEL5

RE

S_

FG

LS

_M

OD

EL

5_

ST

AR

355


– Output table for Model 4 and Model 5 using various

estimators (compare Section 4.8):

356


Dependent Variable: ln(imports to Kazakhstan)

Independent Variables/Model (4)-OLS (5)-OLS (5)-FGLS

constant -13.027 -85.827 -88.193

(4.726) (27.900) (22.599)

[4.778] [28.844]

ln(gdp) 1.318 7.088 7.426

(0.129) (2.187) (1.675)

[0.158] [2.230]

(ln(gdp))2 — -0.112 -0.117

(0.042) (0.031)

[0.042]

ln(distance) -0.625 -0.761 -1.107

(0.542) (0.513) (0.347)

[0.438] [0.456]

common ”colonizer” since 1945 3.351 3.911 3.705

(0.679) (0.674) (0.678)

[0.782] [0.778]

common official language 2.110 2.312 2.012

(1.319) (1.245) (0.989)

[1.020] [0.789]

Number of observations 52 52 52

R2 0.714 0.752 0.747

Standard error of regression 1.455 1.371 1.384

Sum of squared residuals 99.523 86.40 88.16

AIC 3.6793 3.576

HQ 3.7513 3.663

SC 3.8670 3.802

Notes: OLS or FGLS standard errors in paren-

theses, White standard errors in brackets

357


– Results and Interpretation:

∗ OLS and FGLS parameter estimates for ln(distance) and com.

official language are quite different: the FGLS estimates im-

ply less impact of language, more impact of distance. In addi-

tion, log(distance) is statistically significant in the FGLS esti-

mate while insignificant in the OLS estimate with heteroskedas-

ticity-robust standard errors. This makes the FGLS estimates

more plausible.

∗ When taking into account heteroskedasticity, based on FGLS

there is no insignificant parameter at the 5% significance level

and based on heteroskedasticity-robust OLS standard errors

only log(distance) is insignificant.

∗ Comparing the usual OLS standard errors and White standard

errors shows that the multicollinearity problem for the param-

358


eter estimates of distance and language decreases.

∗ Inspecting the scatter plots of OLS and standardized FGLS

residuals against fitted values does not automatically suggest

heteroskedasticity. Thus heteroskedasticity tests are useful, see

Section 9.2.

• Cigarette Example (Wooldridge 2009, Example 8.7) (with EViews-

outputs/commands):

Step I

1. OLS-estimationDependent Variable: CIGS, Method: Least Squares, Sample: 1 807


C -3.639855 24.07866 -0.151165 0.8799

LINCOME 0.880268 0.727783 1.209520 0.2268

LCIGPRIC -0.750855 5.773343 -0.130056 0.8966

EDUC -0.501498 0.167077 -3.001596 0.0028

AGE 0.770694 0.160122 4.813155 0.0000

AGE^2 -0.009023 0.001743 -5.176494 0.0000

359


RESTAURN -2.825085 1.111794 -2.541016 0.0112



S.E. of regression 13.40479, Sum squared resid 143750.7, Log likelihood -3236.227

F-statistic 7.423062, Prob(F-statistic) 0.000000, Durbin-Watson stat 2.012825

2. Save the residuals using genr u hat = resid

3. Taking the logarithm of the squared residuals by using

genr ln u sq = log(u hat^ 2)

360


4. Estimation of variance regression (8.8) with OLS yieldsDependent Variable: LN_U_SQ, Method: Least Squares, Sample: 1 807


C -1.920704 2.563034 -0.749387 0.4538

LINCOME 0.291541 0.077468 3.763351 0.0002

LCIGPRIC 0.195421 0.614539 0.317996 0.7506

EDUC -0.079704 0.017784 -4.481656 0.0000

AGE 0.204005 0.017044 11.96927 0.0000

AGE^2 -0.002392 0.000186 -12.89313 0.0000

RESTAURN -0.627012 0.118344 -5.298212 0.0000





– Saving the hi, i = 1, . . . , n, using

genr h hat = exp(ln u sq - resid).

361


Step IIWeighted LS estimate (see Options) with weights h hat^ (-1/2)Dependent Variable: CIGS, Method: Least Squares, Sample: 1 807, Weighting series: WEIGHTS


C 5.635433 17.80314 0.316541 0.7517

LINCOME 1.295240 0.437012 2.963855 0.0031

LCIGPRIC -2.940305 4.460145 -0.659240 0.5099

EDUC -0.463446 0.120159 -3.856953 0.0001

AGE 0.481948 0.096808 4.978378 0.0000

AGE^2 -0.005627 0.000939 -5.989707 0.0000

RESTAURN -3.461064 0.795505 -4.350776 0.0000

Weighted Statistics

R-squared 0.002751, Adjusted R-squared -0.004728, Akaike info crit. 7.765025




Unweighted Statistics

R-squared 0.045739, Adjusted R-squared 0.038582, S.E. of regression 13.45421

Mean dependent var 8.686493, S.D. dependent var 13.72152, Sum squared resid 144812.7

(Remark: The Unweighted Statistics are based on the resid-

uals y − XβGLS; see EViews-helpfile.)

– Compare with the OLS estimator based on White standard errors.

362

9 Multiple Regression Analysis: Model Diagnostics

9.1 The RESET Test

RESET Test (regression specification error test)

Idea and implementation:

• If the original model

y = x0β0 + . . . + xkβk + u = x′β + u

363


satisfies assumption MLR.4 E[u|x0, . . . , xk] = 0, it holds that

E[y|x0, . . . , xk] = x0β0 + . . . + xkβk + u = x′β.

• Then, any further term added to the model should not be significant.

Thus, any nonlinear function of the independent variables should be

insignificant.

• Thus, the null hypothesis of the RESET test is formulated such

that one can test the significance of nonlinear functions of the fit-

ted values y = x′β that are added to the model. Note that the

fitted values are a linear function of the regressors of the original

specification.

364


• In practice it turned out that for implementing the RESET test it is

sufficient to include quadratic and cubic terms of y only

y = x′β + αy2 + γy3 + ε.

The pair of hypotheses is

H0 : α = 0, γ = 0 (linear model is correctly specified)

H1 : α 6= 0 and/or γ 6= 0.

The null hypothesis is tested using an F test with 2 degrees of

freedom in the numerator and n − k − 3 in the denominator.

• Be aware that the null hypothesis may also be rejected because of

omitting relevant regressor variables.

365


9.2 Heteroskedasticity Tests

• As already noted, it does not make sense to “automatically” use the

FGLS estimator. If the errors are homoskedastic, the OLS estimator

with OLS standard errors should be used.

• Thus, one should test if there is statistical evidence for heteroskedas-

ticity.

• In the following, two different test for heteroskedasticity are dis-

cussed: the Breusch-Pagan test and the White test. For both, the

null hypothesis is “homoskedastic errors”.

• Both tests are implemented in EViews 6.0, however, for the White

test without cross terms the level variables are included only in earlier

EViews versions.

366


It is assumed that for the multiple linear regression

y = β0 + x1β1 + . . . + xkβk + u

assumptions MLR.1 to MLR.4 hold.

The pair of hypotheses that has to be tested is

H0 : V ar(ui|xi) = σ2 (homoskedasticity),

H1 : V ar(ui|xi) = σi 6= σ2 (heteroskedasticity).

The general idea underlying heteroskedasticity tests is that under the

null hypothesis no regressor should have any explanatory power for

V ar(ui|xi). If the null hypothesis is not true, V ar(ui|xi) can be a

(nearly arbitrary) function of the regressors xj, (1 ≤ j ≤ k).

Note: The Breusch-Pagan test and the White test differ with respect

to the specification of their alternative hypothesis.

367


Breusch-Pagan Test

• Idea: Consider the regression

u2i = δ0 + δ1xi1 + · · · + δkxik + vi, i = 1, . . . , n. (9.1)

Under assumptions MLR.1 to MLR.4 the OLS estimator for the δj’s

is unbiased.

The pair of hypotheses is:

H0 : δ1 = δ2 = · · · = δk = 0 versus

H1 : δ1 6= 0 and/or δ2 6= 0 and/or . . .,

since under H0 it holds that E[u2i |X] = δ0.

368


• Difference from the previous application of the F test:

– The squares of the errors u2i are by no means normally distributed

since they are squared quantities and thus cannot take negative

values. Hence, the vi cannot be normally distributed and the

F distribution of the F statistic does not hold exactly in finite

samples. However, the central limit theorem (CLT) works here as

well, see Section 5.2, and the F statistic follows approximately

an F distribution in large samples.

– The errors ui are unknown.They can be replaced by the OLS

residuals ui. In doing so, the F test remains asymptotically valid

(proof is formally sophisticated).

369


• The R2 version of the test statistic can be used. Note that for a

regression including only a constant, it holds that R2 = 0 since

SSR = SST (there are no regressors that show a variation). Call

the coefficient of variation of the OLS estimation of (9.1) R2u2 then

F =R2

u2/k

(1 − R2u2)/(n − k − 1)

.

The F statistic for testing the joint significance of all regressors is

generally given by the appropriate software.

• H0 is rejected if F exceeds the critical value for a chosen significance

level (or equivalently if the p-value is smaller than the significance

level).

370


• Cigarette Example Continued: (from Section 8.4):Dependent Variable: U_HAT_SQ, Method: Least Squares, Sample: 1 807


C -636.3031 652.4946 -0.975185 0.3298

LINCOME 24.63849 19.72180 1.249302 0.2119

LCIGPRIC 60.97656 156.4487 0.389754 0.6968

EDUC -2.384226 4.527535 -0.526606 0.5986

AGE 19.41748 4.339068 4.475034 0.0000

AGE^2 -0.214790 0.047234 -4.547398 0.0000

RESTAURN -71.18138 30.12789 -2.362641 0.0184



S.E. of regression 363.2491, Sum squared resid 1.06E+08, Log likelihood -5898.905


The F statistic for the above H0 is 5.55 and the corresponding p-

value is smaller than 1%. The null hypothesis of homoskedastic

errors thus is rejected at a level of 1%.

371


• Note:

– If one conjectures that the heteroskedasticity is caused by specific

variables that have not been included previously, they can be

included in regression (9.1).

– If H0 is not rejected, this does not mean automatically that the

ui’s are homoskedastic. If the specification (9.1) does not con-

tain all relevant variables causing heteroskedasticity, then it may

happen that all δj, j = 1, . . . , k, are jointly insignificant.

– A variant of the Breusch-Pagan test is a test for multiplicative

heteroskedasticity, i.e. the variance is of the form σ2i = σ2 ·

h(x′iβ). If, for example, the case h(·) = exp(·) is assumed, the

test equation ln(u2i ) = ln(σ2) + x′iβ + v results.

372


White Test

• Background:

For deriving the asymptotic distribution of the OLS estimator the

assumption of homoskedastic errors MLR.5 is not necessary.

It is enough that the squared errors u2i are uncorrelated with all

regressors and the squares and cross products of the latter.

This can easily be tested using the following regression, where the

errors are already replaced by the residuals:

u2i = δ0 + δ1xi1 + · · · + δkxik

+ δk+1x2i1 + · · · + δJ1

x2ik

+ δJ1+1xi1xi2 + · · · + δJ2xik−1xik

+ vi, i = 1, . . . , n. (9.2)

373


• The pair of hypotheses is:

H0 : δj = 0 for all j = 1, 2, . . . , J2, vs. H1 : δj 6= 0 for at least one j.

Again, a F test can be used whose distribution is approximated by

the F distribution (asymptotic distribution).

• With many regressors, it is tedious to implement the F test for (9.2)

manually. However, most software packages provide the White test.

• When implementing the White test, a large number of parameters

has to be estimated if the original model exhibits large k. This is

hardly possible in small samples. Then one only includes the squares

x2ij into the regression and neglects all cross products.

• Note: If the null hypothesis is rejected, this may also be due to

violation of MLR.1 or MLR.4. Then, the original regression is mis-

specified!

374


• Cigarette Example Continued:White Heteroskedasticity Test:

F-statistic 2.159257 Probability 0.000905

Obs*R-squared 52.17244 Probability 0.001140

Test Equation: Dependent Variable: RESID^2, Method: Least Squares, Sample: 1 807


C 29374.75 20559.14 1.428793 0.1535

LINCOME -1049.627 963.4360 -1.089462 0.2763

LINCOME^2 -3.941187 17.07122 -0.230867 0.8175

LINCOME*LCIGPRIC 329.8888 239.2417 1.378893 0.1683

LINCOME*EDUC -9.591844 8.047067 -1.191968 0.2336

LINCOME*AGE -3.354564 6.682195 -0.502015 0.6158

LINCOME*(AGE^2) 0.026704 0.073025 0.365689 0.7147

LINCOME*RESTAURN -59.88701 49.69040 -1.205203 0.2285

LCIGPRIC -10340.67 9754.559 -1.060086 0.2894

LCIGPRIC^2 668.5282 1204.316 0.555110 0.5790

LCIGPRIC*EDUC 32.91400 59.06252 0.557274 0.5775

LCIGPRIC*AGE 62.88178 55.29011 1.137306 0.2558

LCIGPRIC*(AGE^2) -0.622372 0.594730 -1.046479 0.2957

LCIGPRIC*RESTAURN 862.1558 720.6219 1.196405 0.2319

EDUC -117.4717 251.2852 -0.467484 0.6403

EDUC^2 -0.290344 1.287605 -0.225492 0.8217

EDUC*AGE 3.617047 1.724659 2.097253 0.0363

EDUC*(AGE^2) -0.035558 0.017664 -2.012988 0.0445

EDUC*RESTAURN -2.896491 10.65709 -0.271790 0.7859

375


AGE -264.1467 235.7624 -1.120394 0.2629

AGE^2 3.468605 3.194651 1.085754 0.2779

AGE*(AGE^2) -0.019111 0.028655 -0.666935 0.5050

AGE*RESTAURN -4.933195 10.84029 -0.455080 0.6492

(AGE^2)^2 0.000118 0.000146 0.807552 0.4196

(AGE^2)*RESTAURN 0.038446 0.120459 0.319160 0.7497

RESTAURN -2868.188 2986.776 -0.960296 0.3372



S.E. of regression 362.8853, Sum squared resid 1.03E+08, Log likelihood -5888.398


Result: With the White test H0 is also rejected.

376


Trade Example Continued

(from Section 8.4):

• Breusch-Pagan test for heteroskedasticity using OLS residuals====================================================================

Heteroskedasticity Test: Breusch-Pagan-Godfrey

====================================================================

F-statistic 4.139262 Prob. F(5,46) 0.0035

Obs*R-squared 16.13595 Prob. Chi-Square(5) 0.0065

Scaled explained SS 11.61688 Prob. Chi-Square(5) 0.0404

====================================================================

Test Equation:

Dependent Variable: RESID^2



====================================================================


====================================================================

C 59.29291 40.51240 1.463575 0.1501

LOG(WDI_GDPUSDCR_O) -4.634815 3.174932 -1.459815 0.1511

(LOG(WDI_GDPUSDCR_O))^2 0.075281 0.061518 1.223720 0.2273

LOG(CEPII_DIST) 1.382417 0.745464 1.854439 0.0701

377


CEPII_COMCOL_REV -1.642264 0.978128 -1.678986 0.0999


====================================================================








====================================================================

378


• White test (without cross terms) for heteroskedasticity using OLSresidualsHeteroskedasticity Test: White

====================================================================




====================================================================

Test Equation:

Dependent Variable: RESID^2



====================================================================


====================================================================

C 18.75859 10.91313 1.718900 0.0924


((LOG(WDI_GDPUSDCR_O))^2)^2 3.05E-05 2.34E-05 1.307676 0.1975

(LOG(CEPII_DIST))^2 0.090498 0.049960 1.811423 0.0766

CEPII_COMCOL_REV^2 -1.582928 0.992436 -1.594992 0.1176

CEPII_COMLANG_OFF^2 0.291977 1.766624 0.165274 0.8695

====================================================================


379








====================================================================

380


• Breusch-Pagan test for heteroskedasticity using standardized FGLSresiduals====================================================================

Heteroskedasticity Test: Breusch-Pagan-Godfrey

====================================================================




====================================================================

Test Equation:

Dependent Variable: WGT_RESID^2



====================================================================


====================================================================

C -0.323375 1.044251 -0.309672 0.7582

LOG(WDI_GDPUSDCR_O)*WGT 0.122435 0.229469 0.533560 0.5962

(LOG(WDI_GDPUSDCR_O))^2*WGT -0.010813 0.007708 -1.402801 0.1674

LOG(CEPII_DIST)*WGT 0.678927 0.586156 1.158269 0.2527

CEPII_COMCOL_REV*WGT -0.781193 0.641851 -1.217095 0.2298

CEPII_COMLANG_OFF*WGT 0.299789 1.228329 0.244062 0.8083

====================================================================

381



Adjusted R-squared -0.031997 S.D. dependent var 1.116472






====================================================================

382


• White test (without cross terms) for heteroskedasticity using FGLSresidualsHeteroskedasticity Test: White

====================================================================




====================================================================

Test Equation:

Dependent Variable: WGT_RESID^2



====================================================================


====================================================================

C 0.568776 0.520032 1.093733 0.2799

WGT^2 6.158506 9.676452 0.636443 0.5277

(LOG(WDI_GDPUSDCR_O))^2*WGT^2 -0.014641 0.023638 -0.619379 0.5388

((LOG(WDI_GDPUSDCR_O))^2)^2*WGT^2 6.39E-06 1.41E-05 0.454559 0.6516

(LOG(CEPII_DIST))^2*WGT^2 0.021214 0.022901 0.926322 0.3592

CEPII_COMCOL_REV^2*WGT^2 -0.941910 1.019790 -0.923632 0.3606

CEPII_COMLANG_OFF^2*WGT^2 -0.421654 1.220702 -0.345420 0.7314

====================================================================

383



Adjusted R-squared -0.057392 S.D. dependent var 1.116472






====================================================================

Results:

– Note that the specification of the White test without cross terms

follows EViews 6.0 and does not include level terms (in contrast

to (9.2)).

– Both, the Breusch-Pagan and the White test reject the null hy-

pothesis of homoskedastic errors for the OLS residuals at the 1%

significance level. Thus, using OLS with heteroskedasticity-robust

standard errors or FGLS in Section 8.4 was justified.

384


– Both, the Breusch-Pagan and the White test do not reject the

null hypothesis of homoskedastic standardized errors in the FGLS

framework. Both p-values are above 60%. Thus, the variance

regression in Section 8.4 does not seem to be misspecified.

– In sum, among all models and estimation procedures considered,

the FGLS estimates of Model 5 seem to be the most reliable ones.

Reading: Chapter 8 in Wooldridge (2009) (without Section 8.5 con-

cerning linear probability models).

385


9.3 Model Specification II: Useful Tests

9.3.1 Comparing Models with Identical Regressand

Starting point: two non-nested models

(M1) y = x0β0 + . . . + xkβk + u = x′β + u,

(M2) y = z0γ0 + . . . + zmγm + v = z′γ + v,

where k = m does not have to hold.

Decision between (M1) and (M2): using

• information criteria (AIC, SC, HQ, ...),

• encompassing test,

• non-nested F test,

• J test.

386


All three tests can be constituted on the encompassing principle.

Encompassing Principle

Let two non-nested models be given:

(M1) y = x′β + u,

(M2) y = z′γ + v.

For clarifying the non-nested relationship between (M1) and (M2), de-

fine

x′ =(w′ x′B

), β =

(βA βB

),

z′ =(w′ z′B

), γ =

(γA γB

),

such that w contains all common regressors

(M1) y = w′βA + x′BβB + u,

(M2) y = w′γA + z′BγB + v.

387


Idea of the encompassing principle:

• If (M1) is correctly specified, it must be able to explain the results

of an estimation of (M2) (and vice versa).

• If not, (M1) has to be rejected (and vice versa).

Derivation:

Consider the “artificial nesting model”

(ANM) y = w′a + x′Bbx + z′Bbz + ε, E[ε|w,xB, zB] = 0.

Different settings:

• (ANM) correctly specified model such that (M1) and (M2) are mis-

specified. Model (M2) is estimated.

• (M1) correctly specified model. Model (M2) is estimated.


388


Since (ANM) is correct, it holds that E[ε|w,xB, zB] = 0 and thus

E[0] = 0 = E[ε|w, zB].

When estimating (M2) instead of (ANM), one gets

E[y|w, zB] = w′a + [w′q + z′Bp]bx + z′Bbz

= w′ [a+qbx]︸︷︷︸γA

+z′B [bz+pbx]︸︷︷︸γB

. (9.3)

Note that the biases qbx and pbx are caused by omitting the variable

xB. These effects bias the direct impact of w via a and of zB via

bz on y.

390



Then bz = 0 and from (9.3) the following restriction results:

pbx = γB.

Now it can be seen that knowing the correctly specified model (M1)

is enough for deriving model (M2), thus predicting γB or the expec-

tation of the OLS estimator. In other words: Since (M2) is “smaller”

than (M1) with respect to the relevant variables, the behavior of

(M2) can be predicted with the help of (M1) when an unbiased es-

timator is used for the latter. Then one says “(M1) encompasses

(M2)”. (Knowing (M1) is not enough here if (ANM) is the correct

model, bz 6= 0.)


Can be derived just as in the above case.

391


Thus, for the null hypothesis “(M1) encompasses (M2)” two equivalent

hypotheses can be tested:

• H0 : pb − γB = 0 - more complicated, no details here. (This

version is often termed encompassing test and sometimes has

advantages in more general models.)

• H0 : bz = 0 in (ANM) - easy: by the help of a non-nested F

test.

Proceeding for more than two alternatives: selection procedure

• Based on this same principle, the remaining model competes with

further alternative models as long as it is not rejected.

• Problem of this principle: it can happen that both null hypotheses

have to be rejected.

392


Non-nested F test


• Hypotheses: “H0: model (M1) is correct” versus “H1: model (M1)

incorrect”.

• Again, partition z′ = (w′, z′B), where the kA regressors from w are

contained in x but the kB regressors from zB are not contained.

• Formulate the artificial nesting model (ANM)

y = x′β + z′Bbz + ε.

• Based on this ANM test H0 where

H0 : bz = 0

using an F test with kB degrees of freedom in the numerator and

n − m − kB in the denominator.

393


• For the test of (M2) vs. (M1) proceed analogously with partition

x′ = (w′,x′B) ...

J test (Davidson-MacKinnon test)


• For the J test the ANM is formulated such that both (M1) and

(M2) are nested in the ANM:

y = (1 − λ)x′β + λz′γ + ε.

For the case λ = 0 the model (M1) results, for λ = 1 model (M2).

• Problem: λ, β and γ are not identified in the above approach.

• Solution: replace γ by the OLS estimator from (M2) γ.

I.e. test H0 : λ = 0 with test equation y = x′β∗+λyM2+η, where

β∗ = (1 − λ)β and yM2 = z′γ is the fitted value from the OLS

estimation of (M2).

394


• For testing whether (M2) is valid, proceed analogously ...

• Interpretation of the logic of the test:

For testing model (M1) it is enlarged by the fitted values of model

(M2); these (i.e. the by the regressors in (M2) explained part of y)

are tested for their significance in the test equation.

• Advantages of the J test compared to the non-nested F test:

– only one single restriction has to be tested,

– higher power, if kB or respectively mB are very large,

– in case of kB = 1 or respectively mB = 1 the tests are equivalent.

395


9.3.2 Comparing Models with differing Regressand

Idea and implementation (of the P test):

Example linear model versus log-log alternative

• Step 1: Run an OLS estimation for both models.

• Step 2: Compute the corresponding fitted values

ylin (linear model) and ln(ylog) (log-log model).

• Step 3a: Test the linear approach against the log-log alternative

using the ANM

y =∑

xjβj,lin + δlin[ln(ylin) − ln(ylog)] + u,

by a t test with the null hypothesis

H0 : δlin = 0 (linear model is correct).

396


• Step 3b: Test the log-log approach against the linear alternative

using the ANM

ln(y) =∑

ln(xj)βj,log + δlog[ylin − exp( ln ylog)] + v,

by a t test with the null hypothesis

H0 : δlog = 0 (log-log model is correct).

Problem: it is possible that both hypotheses are rejected (i.e. another

functional form is relevant) or both cannot be rejected (i.e. the problem

of lacking power or something else).

Note: in this case a comparison using the information criteria is not

possible.

Reading: Chapter 9 in Wooldridge (2009).

397

10 Appendix

10.1 A Condensed Introduction to Probability

Preliminary Statement: The following pages are not considered as de-

terrence, but as supplement to the illustrations found in introductory

textbooks for econometrics. This supplement is intended to explain

the intuition underlying the large amount of definitions and concepts

in probability theory. Nevertheless it is not possible to completely avoid

formulas, although it may take some time to clarify your mind.

I


• Sample space, outcome space:

The set Ω contains all possible outcomes of a random experiment.

This set can contain (countably) finite or infinite outcomes.

Examples:

– Urn with 4 balls of different color: Ω = yellow, red, blue, green– Monthly income of a household in the future: Ω = [0,∞)

Remark:

– If there is a finite number of outcomes, they are often denoted

as ωi. For S outcomes, Ω appears as

Ω = ω1, ω2, . . . , ωS.

– If there is an infinite number of outcomes, each one is often

denoted as ω.

II


• Event:

Every set of possible outcomes = every subset of the set Ω including

Ω itself.

Examples:

– Urn-example: possible events are for example yellow, red or

red, blue, green.– Household income: possible events are all possible subintervals

and combinations of them, e.g. (0, 5000], [1000, 1001), (400,∞),

4000, and so on.

Remark: By using the general point of view with the ω’s, one has

– for the case of S outcomes: ω1, ω2, ωS, ω3, . . . , ωS, and

so on.

– for the case of infinitely many outcomes located inside an interval

III


Ω = (−∞,∞): (a1, b1], [a2, b2), (0,∞), and so on, where the

lower bound always has to be lower or equal the upper bound

(ai ≤ bi).

• Random variable:

A random variable is a function that assigns a real number X(ω) to

each outcome ω ∈ Ω.

Urn example: X(ω1) = 0, X(ω2) = 3, X(ω3) = 17, X(ω4) = 20.

• Density function

– Preliminary statement: As we have already seen, it gets com-

plicated if Ω contains infinitely many outcomes. Consider for

example Ω = [0, 4]. If one wants to compute the probability for

the number π to appear, this probability is equal to zero. If it

were not equal to zero, we had the problem that a sum of all

probabilities for all (infinitely many) numbers could not be equal

IV


to 1. What to do?

– A back door is the following trick: Consider the probability for the

outcome of the random variable X being located in the interval

[0, x], with x < 4. This probability can be written as P (X ≤ x).

Now determine how the probability changes by extending the size

of this interval [0, x] by h. The solution to this is: P (X ≤x + h) − P (X ≤ x). By relating this change in probability to

the interval length one gets

P (X ≤ x + h) − P (X ≤ x)

h.

For a decreasing interval length h that approaches zero, one ob-

tains the following limit:

limh→0

P (X ≤ x + h) − P (X ≤ x)

h= f (x).

This limit is called probability density function or shortly

V


density function that belongs to the probability function P .

– How to interpret a density function?

By using the sloppy formulation

P (X ≤ x + h) − P (X ≤ x)

h≈ f (x)

and rewriting as

P (X ≤ x + h) − P (X ≤ x) ≈ f (x)h,

one can see that f (x) determines the rate of change for the

probability that X falls into the interval [0, x] if the interval length

is extended by h. Hence, the density function is a rate.

– As the density function is a derivative, we get conversely for our

example ∫ x

0f (u)du = P (X ≤ x) = F (x).

VI


Here, F (x) = P (X ≤ x) is called probability distribution

function. Certainly, in this example we get∫ 4

0f (u)du = P (X ≤ 4) = 1.

In general, the integral of the density function over the full support

of the random variable yields a value of 1. Consider for example

X(ω) ∈ R:∫ ∞

−∞f (u)du = P (X ≤ ∞) = 1.

VII


• Conditional probability function

Let’s begin with an example:

Let the random variable X ∈ [0,∞) be the payoff in a lottery.

The probability function (distribution function) P (X ≤ x) = F (x)

is the probability for a maximum payoff x. Additionally, we know

that there are two machines (machine A and B) that determine the

payoff.

Question: What is the probability for a maximum payoff of x if

machine A is used?

In other words, what is the probability of interest if the condition

“Machine A is used” is applied? Hence, the probability under con-

sideration is also called conditional probability and written as

P (X ≤ x|A).

VIII


Accordingly one writes P (X ≤ x|B), if the condition “Machine B

is used” is applied.

Question: What is the relationship between the unconditional

probability P (X ≤ x) and the conditional probabilities P (X ≤x|A) and P (X ≤ x|B)?

To answer this question one has to clarify what the corresponding

probabilities of using machine A or B are. Denoting these proba-

bilities by P (A) and P (B) we have:

P (X ≤ x) = P (X ≤ x|A)P (A) + P (X ≤ x|B)P (B)

F (x) = F (x|A)P (A) + F (x|B)P (B)

In this example there are two outcomes. The corresponding relation-

IX


ship can be extended to n discrete outcomes Ω = A1, A2, . . . , An:F (x) = F (x|A1)P (A1) + F (x|A2)P (A2) + · · · + F (x|An)P (An)

(10.1)

Until now we defined the conditions in terms of events and not

in terms of random variables. An example for the latter one were

if the payoff is determined by only one machine, but where the

mode of operation for this machine is conditioned upon the payoffs’

magnitude Z. In this case, the conditional distribution function

is F (x|Z = z), with Z = z meaning that the random variable

Z exactly takes the value z. For relating the unconditional and

conditional probability we have to replace the sum by an integral,

and the probability of the conditioning event by the corresponding

density function, as Z can have infinitely many values. For our

X


example we obtain:

F (x) =

∫ ∞

0F (x|Z = z)f (z)dz =

∫ ∞

0F (x|z)f (z)dz

or generally

F (x) =

∫F (x|Z = z)f (z)dz =

∫F (x|z)f (z)dz (10.2)

Another important property:

If the random variables X and Z are stochastically independent, we

have

F (x|z) = F (x).

• Conditional density function

The conditional density function can be heuristically derived from

the conditional distribution function in the same way as for the case

of the unconditional density function: one simply replaces the uncon-

XI


ditional probabilities by conditional probabilities. The conditional

density function arises from

limh→0

P (X ≤ x + h|A) − P (X ≤ x|A)

h= f (x|A).

For finitely many conditions equation (10.1) becomes

f (x) = f (x|A1)P (A1) + f (x|A2)P (A2) + · · · f (x|An)P (An).

The relationship (10.2) turns to

f (x) =

∫f (x|Z = z)f (z)dz =

∫f (x|z)f (z)dz. (10.3)

• Expectation

Consider again the payoff example.

Question: Which payoff would you expect “on average”?

Answer:∫∞0 xf (x)dx. For a payoff paid in n different discrete

amounts, one would expect∑n

i=1 xiP (X = xi) on average. Each

XII


possible payoff is multiplied by its probability of entry and added up.

It is not surprising that the result is denoted as expectation.

In general the expectation is defined as

E[X ] =

∫xf (x)dx, continuous X,

E[X ] =∑

xiP (X = xi), discrete X.

XIII


• Rules for the expectation e.g. Appendix B in Wooldridge (2009).

1. For each constant c it holds that

E[c] = c.

2. For all constants a and b and all random variables X and Y it

holds that

E[aX + bY ] = aE[X ] + bE[Y ].

3. If the random variables X and Y are independent, it holds that

E[Y X ] = E[Y ]E[X ].

• Conditional expectation

So far we did not care for the machine that was used to create the

payoff. If we are interested in the expected payoff of using machine

XIV


A, we have to calculate the conditional expectation

E[X|A] =

∫ ∞

0xf (x|A)dx.

This is easily achieved by replacing the unconditional density f (x) by

the conditional density f (x|A) and stating the condition in the no-

tation of expectations accordingly. Analogously the expected payoff

for machine B is determined as

E[X|B] =

∫ ∞

0xf (x|B)dx.

In general one has for discrete conditioning events

E[X|A] =

∫xf (x|A)dx, continuous X,

E[X|A] =∑

xiP (X = xi|A), discrete X,

XV


and for continuous conditions

E[X|Z = z] =

∫xf (x|Z = z)dx, continuous X,

E[X|Z = z] =∑

xiP (X = xi|Z = z), discrete X.

Remark: Frequently, the short versions are used as in Wooldridge

(2009).

E[X|z] =

∫xf (x|z)dx, continuous X,

E[X|z] =∑

xiP (X = xi|z), discrete X.

In accordance to the relationship of unconditional and conditional

probabilities there is a similar relationship for unconditional and con-

ditional expectations. The relationship is

E[X ] = E [E[X|Z]]

which is denoted as law of iterated expectations (LIE).

XVI


yields

E[X ] = E[X|A]P [A] + E[X|B]P (B).

This example also shows that the conditional expectations E[X|A]

and E[X|B] are random variables. If they are weighted by the cor-

responding probabilities of entry P (A) and P (B), they yield E[X ].

Suppose that, prior to the lottery, you only know both conditional

expectations but not which machine is used. Then the expected

payoff is equal to E[X ] and both conditional expectations are con-

sidered as random variables. After knowing what machine is used,

the corresponding conditional expectation is the outcome of the ran-

dom variable. This is a general property of conditional expectations.

XVIII


• Rules for conditional expectations

e.g. Appendix B in Wooldridge (2009).

1. For each function c(·) it holds that

E[c(X)|X ] = c(X).

2. For all functions a(·) and b(·) it holds that

E[a(X)Y + b(X)|X ] = a(X)E[Y |X ] + b(X).

3. If the random variables X and Y are independent, it holds that

E[Y |X ] = E[Y ].

4. Law of iterated expectations (LIE)

E[E[Y |X ]] = E[Y ].

5. E[Y |X ] = E[E[Y |X,Z]|X ].

XIX


6. If it holds that E[Y |X ] = E[Y ], then it also holds that Cov(X,Y ) =

0.

7. If E[Y 2]

< ∞ and E[g(X)2] < ∞ for an arbitrary function

g(·), then the following inequalities hold:

E[Y − E[Y |X ]]2|X ≤ E[Y − g(X)]2|XE[Y − E[Y |X ]]2 ≤ E[Y − g(X)]2.

XX


10.2 Important Rules of Matrix Algebra

Matrix addition

A =

a11 a12 . . . a1K

a21 a22 . . . a2K

... ... ...

aT1 aT2 . . . aTK

, C =

c11 c12 . . . c1K

c21 c22 . . . c2K

... ... ...

cT1 cT2 . . . cTK

.

If A and C are of the same dimension

A + C =

a11 + c11 a12 + c12 · · · a1K + c1K

a21 + c21 a22 + c22 · · · a2K + c2K

... ... ...

aT1 + cT1 aT2 + cT2 · · · aTK + cTK

.

XXI


Matrix multiplication

A =

a11 a12 · · · a1K

a21 a22 · · · a2K

... ... ...

aT1 aT2 · · · aTK

, B =

b11 b12 · · · b1L

b21 b22 · · · b2L

... ... ...

bK1 bK2 · · · bKL

.

If the number of columns in A is equal to the number of rows in B,

then the product C = AB is defined and the following equality holds

for every element in C

cij =(

ai1 · · · aiK

)

b1j

...

bKj

= ai1b1j + · · · + aiKbKj =

K∑

l=1

ailblj.

Caution: In general it holds that AB 6= BA.

XXII


Transpose of a matrix

Given the (2 × 3)-matrix (i.e. 2 rows, 3 columns)

A =

(a11 a12 a13

a21 a22 a23

),

the transpose of A is the (3 × 2)-matrix

A′ =

a11 a21

a12 a22

a13 a23

.

It holds that

(AB)′ = B′A′.

XXIII


Inverse of a matrix

Let A be the (K × K)-matrix

A =

a11 a12 · · · a1K

a21 a22 · · · a2K

... ... ...

aK1 aK2 · · · aKK

,

then the inverse of A is A−1 and is defined by

AA−1 = A−1A = IK =

1 0 . . . 0

0 1 0

... . . .

0 0 . . . 1

with IK as identity matrix of dimension (K × K).

XXIV


The matrix A is invertible if the rows respectively columns are linearly

independent. In other words: No row (column) can be described as

linear combination of the other rows (columns). Technically this is

satisfied whenever the determinant of A is unequal to zero.

Frequently, a noninvertible matrix is called singular.

The calculation of an inverse is better left to a computer. Only for

matrices of 2 or 3 columns/rows, the calculation is of moderate com-

plexity. Hence a manual calculation can be useful.

XXV


Special issue of a (2 × 2) matrix:

For a square (2 × 2) matrix

B =

(b11 b12

b21 b22

)

the determinant is computed as

det(B) = b11b22 − b21b12

and the inverse as

B−1 =1

det(B)

(b22 −b12

−b21 b11

)

=1

b11b22 − b21b12

(b22 −b12

−b21 b11

).

XXVI


Example:

C =

(0 2

1 −1

), with det(C) = 0 · (−1) − 1 · 2 = −2

C−1 =1

−2

(−1 −2

−1 0

)=

(12 112 0

)

Check:

CC−1 =

(0 2

1 −1

)(12 112 0

)=

(1 0

0 1

)

Reading: As supplement for matrix algebra and its implementation

in the multiple linear regression framework see Appendices D, E.1 in

Wooldridge (2009).

XXVII


10.3 Rules for Matrix Differentiation

•

c =

c1

c2

...

cT

, w =

w1

w2

...

wT

z = c′w =(

c1 c2 · · · cT

)

w1

w2

...

wT

∂z

∂w= c

XXVIII


•

A =

a11 a12 · · · a1T

a21 a22 · · · a2T

. . . . . . . . . . . . . . . . .

aT1 aT2 · · · aTT

z = w′Aw =(

w1 w2 · · · wT

)

a11 a12 · · · a1T

a21 a22 · · · a2T

. . . . . . . . . . . . . . . . .

aT1 aT2 · · · aTT

w1

w2

...

wT

∂z

∂w= (A′ + A)w

XXIX


10.4 Data for Estimating Gravity Equations

Legend for data in gravity data kaz.wf1

• Countries and country codes1 ALB Albania 17 GBR United Kingdom 33 NLD Netherlands

2 ARM Armenia 18 GEO Georgia 34 NOR Norway

3 AUT Austria 19 GER Germany 35 POL Poland

4 AZE Azerbaijan 20 GRC Greece 36 PRT Portugal

5 BEL Belgium and 21 HRV Croatia 37 ROM Romania

Luxembourg

6 BGR Bulgaria 22 HUN Hungary 38 RUS Russia

7 BLR Belarus 23 IRL Ireland 39 SVK Slovakia

8 CAN Canada 24 ISL Iceland 40 SVN Slovenia

9 CHE Switzerland 25 ITA Italy 41 SWE Sweden

10 CYP Cyprus 26 KAZ Kazakhstan 42 TKM Turkmenistan

11 CZE Czech Republic 27 KGZ Kyrgyzstan 43 TUR Turkey

12 DNK Denmark 28 LTU Lithuania 44 UKR Ukraine

13 ESP Spain 29 LVA Latvia 45 USA United States

14 EST Estonia 30 MDA Moldova 46 YUG Serbia and

Montenegro

15 FIN Finland 31 MKD Macedonia

16 FRA France 32 MLT Malta

XXX


Notes: Table is based on Table 1 in “Explanatory notes on gravity data.wf1 ” .

Countries that feature only as origin countries:BIH Bosnia and Herzegovina

TJK Tajikistan

UZB Uzbekistan

CHN China

HKG Hong Kong

JPN Japan

KOR South Korea

TWN Taiwan

THA Thailand

XXXI


• Endogenous variable:

– TRADE 0 D O:

Imports of country d from country o (i.e., exports of country o

to country d) in current US dollars

– Commodity classifications: Trade flows are based on aggregating

disaggregate trade flows according to the Standard International

Trade Classification, Revision 3 (SITC, Rev.3) at the lowest ag-

gregation levels (4- or 5-digit). Source: UN COMTRADE

– Without fuels and lubricants (i.e., specifically without petrol and

natural gas products). Cut-off value for underlying disaggregated

trade flows (at SITC Rev.3 5-digit level) is 500 US dollars.

• Explanatory variables:

XXXII

http://comtrade.un.org/


Origin country

WDI GDPUSDCR O Origin country GDP data; in current US dollars World Bank - World Development Indicato

WDI GDPPCUSDCR O Origin country GDP per capita data; in current US dollars World Bank - World Development Indicato

WEO GDPCR O Destination and origin country GDP data; in current US dollars IMF - World Economic Outlook database

WEO GDPPCCR O Destination and origin country GDP per capita data; in current US dollars IMF - World Economic Outlook database

WEO POP O Origin country population data IMF - World Economic Outlook database

CEPII AREA O area of origin country in km2 CEPII

CEPII COL45 dummy; d and o country have had a colonial relationship after 1945 CEPII

CEPII COL45 REV dummy; revised by “expert knowledge”

CEPII COLONY dummy; d and o country have ever had a colonial link CEPII

CEPII COMCOL dummy; d and o country share a common colonizer since 1945 CEPII

CEPII COMCOL REV dummy; revised by “expert knowledge”

CEPII COMLANG ETHNO dummy; d and o country share a language CEPII

CEPII COMLANG ETHNO REV at least spoken by 9% of each population

CEPII COMLANG OFF dummy; d and o country share common official language CEPII

CEPII CONTIG dummy; d and o country are contiguous (neighboring countries) CEPII

CEPII DISINT O internal distance in origin country CEPII

CEPII DIST geodesic distance between d and o country CEPII

CEPII DISTCAP distance between d and o country based on capitals 0.67√

area/π CEPII

CEPII DISTW weighted distances, see CEPII for details CEPII

CEPII DISTWCES weighted distances, see CEPII for details CEPII

CEPII LAT O latitute of the city CEPII

CEPII LON O longitute of the city CEPII

CEPII SMCTRY REV dummy; d and o country were/are the same country CEPII, revised

ISO O ISO codes in three characters of origin country CEPII

EBRD TFES O EBRD measure of foreign trade and payments liberalisation of o country EBRD

XXXIII


Destination country

WDI GDPUSDCR D Destination country GDP data; in current US dollars World Bank - World Development Indicators

WDI GDPPCUSDCR D Destination country GDP per capita data; in current US dollars World Bank - World Development Indicators

WEO GDPCR D Destination and origin country GDP data; in current US dollars IMF - World Economic Outlook database

WEO GDPPCCR D Destination and origin country GDP per capita data; in current US dollars IMF - World Economic Outlook database

WEO POP D Destination country population data IMF - World Economic Outlook database

Notes: The EBRD measures reform on a scale between 1 and 4+ (=4.33); 1 represents no or little progress; 2 indicates important

progress; 3 is substantial progress; 4 indicates comprehensive progress, while 4+ indicates countries have reached the standards and

performance norms of advanced industrial countries, i.e., of OECD countries. By construction, this variable is ordered qualitative

rather than cardinal.

• Thanks: to Richard Frensch, Osteuropa-Institut, for providing the

data set.

XXXIV


• EViews-Commands to extract selected data from main workfile:

– to select observations of countries that export to Kazakhstan:

in workfile: Proc → Copy/Extract from Current Page

→ By Value to New Page or Workfile:

in Sample - observations to copy: @all if (iso d="KAZ"). Objects to

copy: select. Page Destination: select.

– to select observations for one period, e.g. 2004:

as above but: in Sample - observations to copy: 2004 2004

– to select observations for trade flows from Germany to Kaza-

khstan for all periods:

as above but: in Sample - observations to copy: @all if (iso o="KAZ")

and (iso d="GER")

• Websites CEPII

XXXV

http://www.cepii.fr/anglaisgraph/news/accueilengl.htm

Bibliography

Anderson, J. E. & Wincoop, E. v. (2003), ‘Gravity with gravitas: A

solution to the border puzzle’, The American Economic Review

93, 170–192.

Davidson, R. & MacKinnon, J. G. (2004), Econometric Theory and

Methods, Oxford University Press, Oxford.

Fratianni, M. (2007), The gravity equation in international trade, Tech-

nical report, Dipartimento di Economia, Universita Politecnica delle

Marche.

XXXVI


Pindyck, R. S. & Rubinfeld, D. L. (1998), Econometric models and

economic forecasts, Irwin McGraw-Hill.

Stock, J. H. & Watson, M. W. (2007), Introduction to Econometrics,

Pearson, Boston, Mass.

Wooldridge, J. M. (2009), Introductory Econometrics. A Modern Ap-

proach, 4th edn, Thomson South-Western.

XXXVII

intensive course in econometrics slides...intensive course in econometrics — section 1.2 — ur...

Documents