part viii model speci cation and data...
TRANSCRIPT
Model Specification and Data Problems
Part VIII
Model Specification and Data Problems
As of Oct 18, 2018Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
1 Model Specification and Data Problems
Functional Form Misspecification
RESET test
Non-nested alternatives
Using proxies for unobserved explanatory variables
Outliers
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
A functional form misspecification generally means that themodel does not account for some important nonlinearities.
Recall that omitting important variable is also modelmisspecification.
Generally functional form misspecification causes bias in theremaining parameter estimators.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
Example 1
Suppose that the correct specification of the wage equation is
(1)
log(wage) = β0 + β1educ + β2exper + β3(exper)2 + u.
Then the return for an extra year of experience is
∂ log(wage)
∂ exper= β2 + 2β3exper. (2)
If the second order term is dropped from (1), use of the resulting biased
estimate of β2 can be misleading.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
1 Model Specification and Data Problems
Functional Form Misspecification
RESET test
Non-nested alternatives
Using proxies for unobserved explanatory variables
Outliers
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
Ramsey (1969)2 proposed a general functional formmisspecification test, Regression Specification Error Test (RESET),which has proven to be useful.
Estimatey = β0 + β1x1 + · · ·+ βkxk + u, (3)
get y and test in the augmented model
y = β0 + β1x1 + · · ·+ βkxk + δ1y2 + δ2y
3 + e. (4)
Test the null hypothesis
H0 : δ1 = δ2 = 0. (5)
with the F -test with numerator df1 = 2 and denominatordf2 = n − k − 3.
2Ramsey, J.B. (1969). Tests for specification errors in classical linear least-squares analysis, Journal of the
Royal Statistical Society, Series B, 71, 350–371.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
Example 2
Consider the house price data (Exercise 3.1) and estimate
price = β0 + β1lotsize + β2sqrft + β3bdrms + u. (6)
Estimation results are:
Dependent Variable: PRICE
Method: Least Squares
Sample: 1 88
Included observations: 88
==========================================================
Variable Coefficient Std. Error t-Statistic Prob.
----------------------------------------------------------
C -21.77031 29.47504 -0.738601 0.4622
LOTSIZE 0.002068 0.000642 3.220096 0.0018
SQRFT 0.122778 0.013237 9.275093 0.0000
BDRMS 13.85252 9.010145 1.537436 0.1279
==========================================================
============================================================
R-squared 0.672362 Mean dependent var 293.5460
Adjusted R-squared 0.660661 S.D. dependent var 102.7134
S.E. of regression 59.83348 Akaike info criterion 11.06540
Sum squared resid 300723.8 Schwarz criterion 11.17800
Log likelihood -482.8775 F-statistic 57.46023
Durbin-Watson stat 2.109796 Prob(F-statistic) 0.000000
============================================================
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
Estimate next (6) augmented with (price)2 and (price)3 as in (4).
The F -statistic for the null hypothesis (5) becomes F = 4.67 with 2and 82 degrees of freedom. The p-value is 0.012, such that wereject the null hypothesis at the 5% level.
Thus, there is some evidence of non-linearity.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
Estimate next
log(price) = β0 + β1 log(lotsize) + β2 log(sqrft) + β3bdrms + u.(7)
Estimation results:
Dependent Variable: LOG(PRICE)
Method: Least Squares
Date: 10/19/06 Time: 00:01
Sample: 1 88
Included observations: 88
============================================================
Variable Coefficient Std. Error t-Statistic Prob.
============================================================
C -1.297042 0.651284 -1.991517 0.0497
LOG(LOTSIZE) 0.167967 0.038281 4.387714 0.0000
LOG(SQRFT) 0.700232 0.092865 7.540306 0.0000
BDRMS 0.036958 0.027531 1.342415 0.1831
============================================================
==============================================================
R-squared 0.642965 Mean dependent var 5.633180
Adjusted R-squared 0.630214 S.D. dependent var 0.303573
S.E. of regression 0.184603 Akaike info criterion -0.496833
Sum squared resid 2.862563 Schwarz criterion -0.384227
Log likelihood 25.86066 F-statistic 50.42374
Durbin-Watson stat 2.088996 Prob(F-statistic) 0.000000
==============================================================
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
Applying the RESET test, the F -statistic for the null hypothesis (5) isnow F = 2.56 with p-value 0.084, which implies that the hypothesis isnot rejected at the 5% level.
Thus overall, on the basis of the RESET test the log-log model (7) is
preferred.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
1 Model Specification and Data Problems
Functional Form Misspecification
RESET test
Non-nested alternatives
Using proxies for unobserved explanatory variables
Outliers
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
For example if the model choices are
y = β0 + β1x1 + β2x2 + u (8)
andy = β0 + β1 log(x1) + β2 log(x2) + u. (9)
Because the models are non-nested the usual F -test does not apply.
A common approach is to estimate a combined model
y = γ0 + γ1x1 + γ2x2 + γ3 log(x1) + γ4 log(x2) + u.
(10)
H0 : γ3 = γ4 = 0 is a hypothesis for (8) and H0 : γ1 = γ2 = 0 is ahypothesis for (9). The usual F -test applies again here.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
Davidson and MacKinnon (1981)3 procedure:
For example to test (8), estimate first
y = β0 + β1x1 + β2x2 + θ1ˆy + v , (11)
where ˆy is the fitted value of (9). A significant t value of theθ1-estimate is a rejection of (8).
Similarly, if y denotes the fitted values of (8), the test of (9) is thet-staistic of the θ1-estimate from
y = β0 + β1 log(x1) + β2 log(x2) + θ1y + v , (12)
3Davidson, R. and J.G. MacKinnon (1981). Several tests for modelspecification in the presence of alternative hypotheses, Econometrica 49,781–793.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Functional Form Misspecification
Remark 8.1: A clear winner need not emerge. Both models may be
rejected or neither may be rejected. In the latter case adjusted R-square
can be used to select the better fitting one. If both models are rejected,
more work is needed. 4
4For more complicated cases, see Wooldridge, J.M. (1994). A simplespecification test for the predictive ability of transformation models, Review ofEconomics and Statistics 76, 59–65.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Using proxies for unobserved explanatory variables
1 Model Specification and Data Problems
Functional Form Misspecification
RESET test
Non-nested alternatives
Using proxies for unobserved explanatory variables
Outliers
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Using proxies for unobserved explanatory variables
As discussed earlier, an important source of bias in OLS isomitted variables that are correlated with the includedexplanatory variables.
Often the reason for omission is that these variables areunobservable.
A way to mitigate the problem is to collect data on proxyvariables.
Consider the following regression
y = β0 + β1x1 + β2x∗2 + u, (13)
where x∗2 is unobservable variable (e.g. human ability).
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Using proxies for unobserved explanatory variables
Suppose that the primary interest is to estimate β1, so that x∗2 is acontrol variable.
However, as we know the simple regression y = β0 + β1x1 + vresults to biased and inconsistent OLS estimator of β1 suchplim β1 = β1 + γ1β2, where δ1 is the coefficient of regressionx∗2 = γ0 + γ1x1 + error
Suppose that we have a ’good’ proxy x2 for x∗2 such tat
E[x∗2 |x2, x1] = E[x∗2 |x2], i.e., given the proxy x2, x1 does nothelp in predicting the unobserved variable x∗2 .
E[u|x2] = 0 for the error term in regression (13).
These imply that in regression x∗2 = δ0 + δ1x2 + θx1 + e, θ = 0 sothat only the proxy x2 is related to the unobserved variable x∗2 , andthat the proxy x2 is not correlated with error term of the trueregression in equation (13).
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Using proxies for unobserved explanatory variables
With this kind of a good proxy instead of (13), the model to beestimated becomes
y = α0 + β1x1 + α2x2 + w . (14)
Now OLS is unbiased and consistent estimator of β1, theparameter we are primarily interested in (also OLS estimators of α0
and α1 are unbiased and consistent for these parameters, butα0 = β0 + β2δ0 and α1 = δ1β2 differ from β0 and β2).
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Using proxies for unobserved explanatory variables
Example 3
Consider the return to education in wages (monthly) for men (wage2data set).
lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black, data = wdf)
Residuals:
Min 1Q Median 3Q Max
-1.98069 -0.21996 0.00707 0.24288 1.22822
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.395497 0.113225 47.653 < 2e-16 ***
educ 0.065431 0.006250 10.468 < 2e-16 ***
exper 0.014043 0.003185 4.409 1.16e-05 ***
tenure 0.011747 0.002453 4.789 1.95e-06 ***
married 0.199417 0.039050 5.107 3.98e-07 ***
south -0.090904 0.026249 -3.463 0.000558 ***
urban 0.183912 0.026958 6.822 1.62e-11 ***
black -0.188350 0.037667 -5.000 6.84e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3655 on 927 degrees of freedom
Multiple R-squared: 0.2526,Adjusted R-squared: 0.2469
F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Using proxies for unobserved explanatory variables
The estimated return to education is 6.5%. However, if the omittedability is positively correlated with educ, the estimate is too high.
Adding IQ as a proxy to ability into the equation reduces the estimate to5.4%, which is consistent with the omitted variable bias assumption.
lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black + iq, data = wdf)
Residuals:
Min 1Q Median 3Q Max
-2.01203 -0.22244 0.01017 0.22951 1.27478
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1764391 0.1280006 40.441 < 2e-16 ***
educ 0.0544106 0.0069285 7.853 1.12e-14 ***
exper 0.0141459 0.0031651 4.469 8.82e-06 ***
tenure 0.0113951 0.0024394 4.671 3.44e-06 ***
married 0.1997644 0.0388025 5.148 3.21e-07 ***
south -0.0801695 0.0262529 -3.054 0.002325 **
urban 0.1819463 0.0267929 6.791 1.99e-11 ***
black -0.1431253 0.0394925 -3.624 0.000306 ***
iq 0.0035591 0.0009918 3.589 0.000350 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3632 on 926 degrees of freedom
Multiple R-squared: 0.2628,Adjusted R-squared: 0.2564
F-statistic: 41.27 on 8 and 926 DF, p-value: < 2.2e-16
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Using proxies for unobserved explanatory variables
Test whether the interaction of ability and education affects wages.
lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black + iq + iq:educ, data = wdf) # iq:educ introduces interaction iq*educ
Residuals:
Min 1Q Median 3Q Max
-2.00733 -0.21715 0.01177 0.23456 1.27305
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6482478 0.5462963 10.339 < 2e-16 ***
educ 0.0184560 0.0410608 0.449 0.653192
exper 0.0139072 0.0031768 4.378 1.34e-05 ***
tenure 0.0113929 0.0024397 4.670 3.46e-06 ***
married 0.2008658 0.0388267 5.173 2.82e-07 ***
south -0.0802354 0.0262560 -3.056 0.002308 **
urban 0.1835758 0.0268586 6.835 1.49e-11 ***
black -0.1466989 0.0397013 -3.695 0.000233 ***
iq -0.0009418 0.0051625 -0.182 0.855290
educ:iq 0.0003399 0.0003826 0.888 0.374564
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3632 on 925 degrees of freedom
Multiple R-squared: 0.2634,Adjusted R-squared: 0.2563
F-statistic: 36.76 on 9 and 925 DF, p-value: < 2.2e-16
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Using proxies for unobserved explanatory variables
Adding iq× educ is not only insignificant but it also renders educ andiq insignificant!
This is due to high correlation of the interaction term with itscomponents:
> with(wdf, cor(cbind(educ, iq, educ*iq)))
educ iq educ*iq
educ 1.0000000 0.5156970 0.8880035
iq 0.5156970 1.0000000 0.8453237
educ*iq 0.8880035 0.8453237 1.0000000
The implied collinearity can be materially reduced by defining theinteraction term in terms of demeand variables:
> with(wdf, cor(cbind(educ, iq, (educ - mean(educ))*(iq - mean(iq)))))
educ iq (e-m(e))*(i-m(i))
educ 1.0000000 0.5156970 0.1864668
iq 0.5156970 1.0000000 -0.0133327
(educ-m(educ)*(iq-m(iq)) 0.1864668 -0.0133327 1.0000000
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Using proxies for unobserved explanatory variables
Interaction term of the demeaned components leads also to a meaningfulinterpretation of the implied model.
Writing the original model with interaction term as
log(wage) = β0 + β1educ + β2iq + β12(educ× iq) + other factors (15)
an equivalent representation in terms of demeaned interaction becomes
log(wage) = β0 + γ1educ + γ2iq + β12(educ× iq) + other factors (16)
where educ = educ− educ and iq = iq− iq are demeaned educ andiq.
The relation of the coefficients of the original model (15) and model (16)are β0 = γ0 + β12(educ× iq), β1 = γ1 − β12iq, and β2 = γ2 − β12educ.
For example, at the mean IQ, iq = 0, so that γ1 indicates the return to
education for a person with average ability.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Using proxies for unobserved explanatory variables
Estimating the model, however, indicates that β12 = .00034 with p-value.37 is not at all statistically significant, which implies that there is noevidence that variability in IQ as such affects return to education.
Dependent variable: log(wage)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1846286 0.1283466 40.396 < 2e-16
educ 0.0528786 0.0071406 7.405 2.94e-13
exper 0.0139072 0.0031768 4.378 1.34e-05
tenure 0.0113929 0.0024397 4.670 3.46e-06
married 0.2008658 0.0388267 5.173 2.82e-07
south -0.0802354 0.0262560 -3.056 0.002308
urban 0.1835758 0.0268586 6.835 1.49e-11
black -0.1466989 0.0397013 -3.695 0.000233
iq 0.0036357 0.0009957 3.652 0.000275
(iq - mean(iq)) x (educ - mean(educ)) 0.0003399 0.0003826 0.888 0.374564
Residual standard error: 0.3632 on 925 degrees of freedom
Multiple R-squared: 0.2634,Adjusted R-squared: 0.2563
F-statistic: 36.76 on 9 and 925 DF, p-value: < 2.2e-16
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Outliers
1 Model Specification and Data Problems
Functional Form Misspecification
RESET test
Non-nested alternatives
Using proxies for unobserved explanatory variables
Outliers
Seppo Pynnonen Econometrics I
Model Specification and Data Problems
Outliers
Particularly in small data sets OLS estimates may beinfluenced by one or several observations (see figure).
Generally such observations are called outliers or influentialobservations.
Loosely, an observation is an outlier if dropping it changesestimation results materially.
In detection of outliers a usual practice is to investigatestandardized (or ”studentized”) residuals.
If an outlier is an obvious mistake in recording the data, it canbe corrected. Usual practice also is to eliminate suchobservations.
Data transformations, like taking logarithms often narrow therange of data and hence may alleviate outlier problems, too.
Seppo Pynnonen Econometrics I