part viii model speci cation and data...

Model Specification and Data Problems

Part VIII


As of Oct 18, 2018Seppo Pynnonen Econometrics I


Functional Form Misspecification

1 Model Specification and Data Problems


RESET test

Non-nested alternatives

Using proxies for unobserved explanatory variables

Outliers

Seppo Pynnonen Econometrics I



A functional form misspecification generally means that themodel does not account for some important nonlinearities.

Recall that omitting important variable is also modelmisspecification.

Generally functional form misspecification causes bias in theremaining parameter estimators.




Example 1

Suppose that the correct specification of the wage equation is

(1)

log(wage) = β0 + β1educ + β2exper + β3(exper)2 + u.

Then the return for an extra year of experience is

∂ log(wage)

∂ exper= β2 + 2β3exper. (2)

If the second order term is dropped from (1), use of the resulting biased

estimate of β2 can be misleading.






RESET test



Outliers




Ramsey (1969)2 proposed a general functional formmisspecification test, Regression Specification Error Test (RESET),which has proven to be useful.

Estimatey = β0 + β1x1 + · · ·+ βkxk + u, (3)

get y and test in the augmented model

y = β0 + β1x1 + · · ·+ βkxk + δ1y2 + δ2y

3 + e. (4)

Test the null hypothesis

H0 : δ1 = δ2 = 0. (5)

with the F -test with numerator df1 = 2 and denominatordf2 = n − k − 3.

2Ramsey, J.B. (1969). Tests for specification errors in classical linear least-squares analysis, Journal of the

Royal Statistical Society, Series B, 71, 350–371.




Example 2

Consider the house price data (Exercise 3.1) and estimate

price = β0 + β1lotsize + β2sqrft + β3bdrms + u. (6)

Estimation results are:

Dependent Variable: PRICE

Method: Least Squares

Sample: 1 88

Included observations: 88

==========================================================

Variable Coefficient Std. Error t-Statistic Prob.

----------------------------------------------------------

C -21.77031 29.47504 -0.738601 0.4622

LOTSIZE 0.002068 0.000642 3.220096 0.0018

SQRFT 0.122778 0.013237 9.275093 0.0000

BDRMS 13.85252 9.010145 1.537436 0.1279

==========================================================

============================================================

R-squared 0.672362 Mean dependent var 293.5460

Adjusted R-squared 0.660661 S.D. dependent var 102.7134

S.E. of regression 59.83348 Akaike info criterion 11.06540

Sum squared resid 300723.8 Schwarz criterion 11.17800

Log likelihood -482.8775 F-statistic 57.46023

Durbin-Watson stat 2.109796 Prob(F-statistic) 0.000000

============================================================




Estimate next (6) augmented with (price)2 and (price)3 as in (4).

The F -statistic for the null hypothesis (5) becomes F = 4.67 with 2and 82 degrees of freedom. The p-value is 0.012, such that wereject the null hypothesis at the 5% level.

Thus, there is some evidence of non-linearity.




Estimate next

log(price) = β0 + β1 log(lotsize) + β2 log(sqrft) + β3bdrms + u.(7)

Estimation results:

Dependent Variable: LOG(PRICE)

Method: Least Squares

Date: 10/19/06 Time: 00:01

Sample: 1 88

Included observations: 88

============================================================

Variable Coefficient Std. Error t-Statistic Prob.

============================================================

C -1.297042 0.651284 -1.991517 0.0497

LOG(LOTSIZE) 0.167967 0.038281 4.387714 0.0000

LOG(SQRFT) 0.700232 0.092865 7.540306 0.0000

BDRMS 0.036958 0.027531 1.342415 0.1831

============================================================

==============================================================

R-squared 0.642965 Mean dependent var 5.633180

Adjusted R-squared 0.630214 S.D. dependent var 0.303573

S.E. of regression 0.184603 Akaike info criterion -0.496833

Sum squared resid 2.862563 Schwarz criterion -0.384227

Log likelihood 25.86066 F-statistic 50.42374

Durbin-Watson stat 2.088996 Prob(F-statistic) 0.000000

==============================================================




Applying the RESET test, the F -statistic for the null hypothesis (5) isnow F = 2.56 with p-value 0.084, which implies that the hypothesis isnot rejected at the 5% level.

Thus overall, on the basis of the RESET test the log-log model (7) is

preferred.






RESET test



Outliers




For example if the model choices are

y = β0 + β1x1 + β2x2 + u (8)

andy = β0 + β1 log(x1) + β2 log(x2) + u. (9)

Because the models are non-nested the usual F -test does not apply.

A common approach is to estimate a combined model

y = γ0 + γ1x1 + γ2x2 + γ3 log(x1) + γ4 log(x2) + u.

(10)

H0 : γ3 = γ4 = 0 is a hypothesis for (8) and H0 : γ1 = γ2 = 0 is ahypothesis for (9). The usual F -test applies again here.




Davidson and MacKinnon (1981)3 procedure:

For example to test (8), estimate first

y = β0 + β1x1 + β2x2 + θ1ˆy + v , (11)

where ˆy is the fitted value of (9). A significant t value of theθ1-estimate is a rejection of (8).

Similarly, if y denotes the fitted values of (8), the test of (9) is thet-staistic of the θ1-estimate from

y = β0 + β1 log(x1) + β2 log(x2) + θ1y + v , (12)

3Davidson, R. and J.G. MacKinnon (1981). Several tests for modelspecification in the presence of alternative hypotheses, Econometrica 49,781–793.




Remark 8.1: A clear winner need not emerge. Both models may be

rejected or neither may be rejected. In the latter case adjusted R-square

can be used to select the better fitting one. If both models are rejected,

more work is needed. 4

4For more complicated cases, see Wooldridge, J.M. (1994). A simplespecification test for the predictive ability of transformation models, Review ofEconomics and Statistics 76, 59–65.






RESET test



Outliers




As discussed earlier, an important source of bias in OLS isomitted variables that are correlated with the includedexplanatory variables.

Often the reason for omission is that these variables areunobservable.

A way to mitigate the problem is to collect data on proxyvariables.

Consider the following regression

y = β0 + β1x1 + β2x∗2 + u, (13)

where x∗2 is unobservable variable (e.g. human ability).




Suppose that the primary interest is to estimate β1, so that x∗2 is acontrol variable.

However, as we know the simple regression y = β0 + β1x1 + vresults to biased and inconsistent OLS estimator of β1 suchplim β1 = β1 + γ1β2, where δ1 is the coefficient of regressionx∗2 = γ0 + γ1x1 + error

Suppose that we have a ’good’ proxy x2 for x∗2 such tat

E[x∗2 |x2, x1] = E[x∗2 |x2], i.e., given the proxy x2, x1 does nothelp in predicting the unobserved variable x∗2 .

E[u|x2] = 0 for the error term in regression (13).

These imply that in regression x∗2 = δ0 + δ1x2 + θx1 + e, θ = 0 sothat only the proxy x2 is related to the unobserved variable x∗2 , andthat the proxy x2 is not correlated with error term of the trueregression in equation (13).




With this kind of a good proxy instead of (13), the model to beestimated becomes

y = α0 + β1x1 + α2x2 + w . (14)

Now OLS is unbiased and consistent estimator of β1, theparameter we are primarily interested in (also OLS estimators of α0

and α1 are unbiased and consistent for these parameters, butα0 = β0 + β2δ0 and α1 = δ1β2 differ from β0 and β2).




Example 3

Consider the return to education in wages (monthly) for men (wage2data set).

lm(formula = log(wage) ~ educ + exper + tenure + married + south +

urban + black, data = wdf)

Residuals:

Min 1Q Median 3Q Max

-1.98069 -0.21996 0.00707 0.24288 1.22822

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.395497 0.113225 47.653 < 2e-16 ***

educ 0.065431 0.006250 10.468 < 2e-16 ***

exper 0.014043 0.003185 4.409 1.16e-05 ***

tenure 0.011747 0.002453 4.789 1.95e-06 ***

married 0.199417 0.039050 5.107 3.98e-07 ***

south -0.090904 0.026249 -3.463 0.000558 ***

urban 0.183912 0.026958 6.822 1.62e-11 ***

black -0.188350 0.037667 -5.000 6.84e-07 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.3655 on 927 degrees of freedom

Multiple R-squared: 0.2526,Adjusted R-squared: 0.2469

F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16




The estimated return to education is 6.5%. However, if the omittedability is positively correlated with educ, the estimate is too high.

Adding IQ as a proxy to ability into the equation reduces the estimate to5.4%, which is consistent with the omitted variable bias assumption.


urban + black + iq, data = wdf)

Residuals:


-2.01203 -0.22244 0.01017 0.22951 1.27478

Coefficients:


(Intercept) 5.1764391 0.1280006 40.441 < 2e-16 ***

educ 0.0544106 0.0069285 7.853 1.12e-14 ***

exper 0.0141459 0.0031651 4.469 8.82e-06 ***

tenure 0.0113951 0.0024394 4.671 3.44e-06 ***

married 0.1997644 0.0388025 5.148 3.21e-07 ***

south -0.0801695 0.0262529 -3.054 0.002325 **

urban 0.1819463 0.0267929 6.791 1.99e-11 ***

black -0.1431253 0.0394925 -3.624 0.000306 ***

iq 0.0035591 0.0009918 3.589 0.000350 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1







Test whether the interaction of ability and education affects wages.


urban + black + iq + iq:educ, data = wdf) # iq:educ introduces interaction iq*educ

Residuals:


-2.00733 -0.21715 0.01177 0.23456 1.27305

Coefficients:


(Intercept) 5.6482478 0.5462963 10.339 < 2e-16 ***

educ 0.0184560 0.0410608 0.449 0.653192

exper 0.0139072 0.0031768 4.378 1.34e-05 ***

tenure 0.0113929 0.0024397 4.670 3.46e-06 ***

married 0.2008658 0.0388267 5.173 2.82e-07 ***

south -0.0802354 0.0262560 -3.056 0.002308 **

urban 0.1835758 0.0268586 6.835 1.49e-11 ***

black -0.1466989 0.0397013 -3.695 0.000233 ***

iq -0.0009418 0.0051625 -0.182 0.855290

educ:iq 0.0003399 0.0003826 0.888 0.374564

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1







Adding iq× educ is not only insignificant but it also renders educ andiq insignificant!

This is due to high correlation of the interaction term with itscomponents:

> with(wdf, cor(cbind(educ, iq, educ*iq)))

educ iq educ*iq

educ 1.0000000 0.5156970 0.8880035

iq 0.5156970 1.0000000 0.8453237

educ*iq 0.8880035 0.8453237 1.0000000

The implied collinearity can be materially reduced by defining theinteraction term in terms of demeand variables:

> with(wdf, cor(cbind(educ, iq, (educ - mean(educ))*(iq - mean(iq)))))

educ iq (e-m(e))*(i-m(i))

educ 1.0000000 0.5156970 0.1864668

iq 0.5156970 1.0000000 -0.0133327

(educ-m(educ)*(iq-m(iq)) 0.1864668 -0.0133327 1.0000000




Interaction term of the demeaned components leads also to a meaningfulinterpretation of the implied model.

Writing the original model with interaction term as

log(wage) = β0 + β1educ + β2iq + β12(educ× iq) + other factors (15)

an equivalent representation in terms of demeaned interaction becomes

log(wage) = β0 + γ1educ + γ2iq + β12(educ× iq) + other factors (16)

where educ = educ− educ and iq = iq− iq are demeaned educ andiq.

The relation of the coefficients of the original model (15) and model (16)are β0 = γ0 + β12(educ× iq), β1 = γ1 − β12iq, and β2 = γ2 − β12educ.

For example, at the mean IQ, iq = 0, so that γ1 indicates the return to

education for a person with average ability.




Estimating the model, however, indicates that β12 = .00034 with p-value.37 is not at all statistically significant, which implies that there is noevidence that variability in IQ as such affects return to education.

Dependent variable: log(wage)

Coefficients:


(Intercept) 5.1846286 0.1283466 40.396 < 2e-16

educ 0.0528786 0.0071406 7.405 2.94e-13

exper 0.0139072 0.0031768 4.378 1.34e-05

tenure 0.0113929 0.0024397 4.670 3.46e-06

married 0.2008658 0.0388267 5.173 2.82e-07

south -0.0802354 0.0262560 -3.056 0.002308

urban 0.1835758 0.0268586 6.835 1.49e-11

black -0.1466989 0.0397013 -3.695 0.000233

iq 0.0036357 0.0009957 3.652 0.000275

(iq - mean(iq)) x (educ - mean(educ)) 0.0003399 0.0003826 0.888 0.374564






Outliers



RESET test



Outliers



Outliers

Particularly in small data sets OLS estimates may beinfluenced by one or several observations (see figure).

Generally such observations are called outliers or influentialobservations.

Loosely, an observation is an outlier if dropping it changesestimation results materially.

In detection of outliers a usual practice is to investigatestandardized (or ”studentized”) residuals.

If an outlier is an obvious mistake in recording the data, it canbe corrected. Usual practice also is to eliminate suchobservations.

Data transformations, like taking logarithms often narrow therange of data and hence may alleviate outlier problems, too.


part viii model speci cation and data...

Documents