pset4-group5

Problem Set 4 1. Paper & Pen a) y! = β! + β!x! + ϵ!, x! is endogenous in the bivariate regression; E (zε) = 0; E (ε) = 0 E zε = 0 => E z(y− x!β) = E zy− zx!β) = E zy − E zx! β = 0

E[zx!]β = E[zy] If one can assume that E [zx!] exists, is finite and invertible, we can solve for β (2x1 vector including β!and β!). Hence,

β = E zx! !!E(zy) Other way to solve the question, E ε = 0 => E y− β! − β!x! = 0 => E β! = E y − E (β!x!) β! = E y − β!E (x!)

• E zε = 0 => E z y− β! − β!x! = 0 E zy − E zβ! − E zβ!x! = 0 β!E z = E zy − β!E (zx!)

β! =E zy − β!E (zx!)

E z

β! = E y − β!E (x!)

β! =E zy − β!E (zx!)

z

E y − β!E x! = E zy − β!E (zx!)

E z

E y −E zyE z = β!E x! −

β!E (zx!)E z

β! E x! −E zx!E z = E y −

E zyE z

β! = E y E z − E zyE x E z − E zx!

= Cov z, yCov z, x

β! = E y − Cov z, yCov z, x E (x)

With !"# !,!

!"# !,!≠ 0, condition for relevance of instrumental variable z.

b) From a) we have that β = E zx! !!E (zy), and so we can replace by the Law of Large Numbers E zx! by !

!z!x!′!

!!! which goes to in probability to E zx! . The

same can be applied to E (zy), substituting in the equation for !!

z!y!!!!! .

Hence, β̂1 = ( z!x!′)!

!!!!! z!y!!

!!! x!! = x!!! , x!!! z! = (x!!! , z!!), if z!! = x!!! the expression reduces to the OLS estimator.

β̂1 = !"# !,!!"# !,!

= !! (!!!!)(!!!!)

!!!!

!! (!!!

!!! !!)(!!!!)

Since E z = 0 → z = 0, then β̂1 = (!!!!)(!!)!

!!!(!!!

!!! !!)(!!)

b!!" = y− b!!"

c) z ϵ 0,1 Taking again β! =

!"#(!,!)!"#(!,!)

and divide numerator and denominator by the variance of Z brings

β! =Cov(y, z)/Var(z)Cov(x, z)/Var(z) =

γ!π!

γ! is the slope regression of y on z. π! is the slope of regression x on z. As z is binary, γ! = E y z = 1 − E(y|z = 0) and π! = E x z = 1 - E x z = 0

Therefore b! =! ! !!! ! ! ! !!!! ! !!! – ! ! !!!

which is called the Wald estimator.

So, the IV estimator becomes:

b!!" =E y z = 1 − E(y|z = 0)E x z = 1 − E(x|z = 0)

b!!" = y− b!!"

2. Data

Table 1 VARIABLES (1)

lwage76 (2)

ed76 (3)

lwage76 (4)

lwage76 Education

(ed76) 0.0561*** (0.00437)

0.0671*** (0.0114)

0.0679*** (0.0113)

Lived Metrop. Area (smsa76)

0.163*** (0.0150)

0.332*** (0.0921)

0.159*** (0.0155)

0.159*** (0.0155)

Lived in South

(south76)

-0.120*** (0.0151)

-0.147* (0.0864)

-0.118*** (0.0154)

-0.118*** (0.0154)

Enrolled (enroll76)

-0.122*** (0.0249)

1.012*** (0.116)

-0.134*** (0.0267)

-0.134*** (0.0267)

Black -0.119*** (0.0192)

-0.0107 (0.112)

-0.116*** (0.0196)

-0.116*** (0.0196)

Kww score 0.00750*** (0.00110)

0.110*** (0.00548)

0.00605*** (0.00181)

0.00593*** (0.00180)

Married (mar76)

-0.0323*** (0.00356)

0.108*** (0.0190)

-0.0338*** (0.00383)

-0.0338*** (0.00383)

Experience (exp76)

0.0573*** (0.00690)

0.0507*** (0.0168)

0.0512*** (0.0168)

Experience2

(exp762) -0.00154*** (0.000314)

-0.00115 (0.000826)

-0.00117 (0.000826)

Mother’s Education (momed)

0.162*** (0.0170)

Father’s Education

(daded)

0.142*** (0.0146)

Constant 4.951*** (0.0713)

5.970*** (0.234)

4.879*** (0.134)

4.870*** (0.134)

Observations R-squared

2,956 0.324

2,956 0.365

2,956 0.322

2,956 0.321

Robust standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1

a) The column (1) of Table 1, presents the output of the regression on the dependent variable log(wage) on other control variables that we choose believing that this ones have a direct impact on wages. We ruled out the variables that we thought had an indirect effect on wage through the education variable. Therefore, we choose the variables from the table above with values in the firs column to this estimation. Variables as age and IQ were deprecated because, the first one would make as face a problem o multicolinearity with the experience variables, since one should obtain more experience with age; the second one we choose not to include because one could

see that we didn’t had information on this variable for nearly a third of the individuals, allowing us to have better results in terms of R2. We can see that education has a positive effect on wage, since one more year of education, on average and ceteris paribus, will increase wages by 5.5%. Marital status, living in the south or being black have a negative impact on the wage level. Individuals from metropolitan areas get higher wages, on average. Furthermore, individuals who have a good knowledge of the world of work test have a slightly better wage, everything else constant. b) The economic interpretation is that the fact of an individual staying more years on school could be correlated with the same individual having more ability, so the returns from education might be overestimated. At the same time, people who stay longer in school expect to have a higher wage at the end of their academic formation. On the opposite side people we have people who don’t expect returns on wage from more years of education. Therefore, we are measuring not only the difference between having one more year of education or not, but at the same time the difference between people who have expectations that one more year of education traduces in a significant return on wage and hence selecting the good outcomes when using an OLS estimation. Statistically speaking, the problem stated above means that the error term is correlated with the regressors on the estimation, so E(ϵ/x) ≠ 0 and Cov(ϵ, x) is probably positive. This problem implies that the OLS estimator β is no longer consistent and unbiased. c) In the regression of education, presented on Table 1 column number 2, we used all variables that were used in the log wage regression excluding the experience and its squared variables. As it was asked we included the education the variables with the parents education; momed and daded. In linear models, there are two main requirements for using an IV: - The instrument must be correlated with the endogenous explanatory variables, conditional on the other covariates. - The instrument cannot be correlated with the error term in the explanatory equation (conditional on the other covariates), that is, the instrument cannot suffer from the same problem as the original predicting variable. We calculated the F-statistic for a joint test for both variables, momed and daded, and the value we get was 167.54, and this suggests that our instruments are strong and the validity of using them as instruments. d) There was no variable with square of age in our database so we created the variable age762, which is nothing, but the square of the variable age76 from the database. In Table 2 we present the results for the three regressions needed for answer this question. This enables us to check the strength of our instruments. For this we use a joint F-test for our 4 instrumental variables. For the three regressions we get the following outcomes: ed76: F-value of 98.85;

exp76: F-value of 1694.41; exp762: F-value of 1024.78.

The results of the F-statistics show that our instruments (education of parents, father and mother, age and age squared) are strong. In Table 1 column number 3, we can see the outputs of IV regression of log(wage) on education and experience, using by parents' education and age as instruments. The results show that the IV estimator for education is higher compared to the OLS estimation with a larger standard error value than in (1), but the estimator for experience has a smaller value when compared with the OLS estimation. Experience squared is no longer significant.

Table 2 VARIABLES (1)

ed76 (2)

exp76 (3)

exp762 momed 0.154***

(0.0168) -0.154*** (0.0168)

-3.071*** (0.370)

daded 0.133*** (0.0145)

-0.133*** (0.0145)

-2.177*** (0.328)

age76 0.512** (0.256)

0.488* (0.256)

-43.22*** (5.879)

age762 -0.0106** (0.00445)

0.0106** (0.00445)

1.134*** (0.104)

smsa76 0.300*** (0.0906)

-0.300*** (0.0906)

-7.102*** (2.019)

south76 -0.146* (0.0854)

0.146* (0.0854)

5.123*** (1.820)

enroll76 0.926*** (0.117)

-0.926*** (0.117)

-15.01*** (2.196)

black 0.0766 (0.112)

-0.0766 (0.112)

-4.277* (2.585)

kww 0.127*** (0.00606)

-0.127*** (0.00606)

-2.705*** (0.146)

mar76 0.0840*** (0.0195)

-0.0840*** (0.0195)

-0.778** (0.384)

Constant -0.227 (3.631)

-5.773 (3.631)

554.1*** (81.49)

Observations 2,956 0.377

2,956 0.741

2,956 0.715 R-squared

Robust standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1

e) A test of overidentifying restrictions regresses the residuals from an IV or 2SLS regression on all instruments in Z. Under the null hypothesis that all instruments are uncorrelated with u, the test has a large-sample χ2(r) distribution where r is the number of overidentifying restrictions. Under the assumption of i.i.d. errors, this is known as a Sargan test, and is routinely produced on STATA by ivreg2 for IV and 2SLS estimates. It can also be calculated after ivreg estimation with the overid command, which is part of the ivreg2 suite. After ivregress, the command estat overid provides the test. We just have to find the critical value and compare it to the output from STATA.

NR!~𝒳!(R− K)

2956×0,0009 = 2.6604 < 𝒳! 11− 10 = 𝒳! 1 = 3.841 We can see that the test statistic is lower than the critical value from the Chi2 table (one restriction, 95% level. The null hypothesis cannot be rejected, which in this case is that all the instrumental variables are exogenous. To sum this up we can say that our instruments are strong and valid. f) We have applied the same method as in e) and compute the over-identification test statistic.

2956×0.0012 = 3.5472 < 𝒳! 12− 10 = 𝒳! 2 = 5.99 The test statistic is lower than the critical value so we cannot reject the null hypothesis that all our instrumental variables are exogenous. We can see, in Table 1 column (4), from the regression results that the differences between education and experience are small when we use the living near a college variable as an instrument. To check if this variable should be included, we can use first stage regressions for the instrumented variables and include the dummy. If the dummy is significant in explaining the instrumented variable and the joint F-test is high enough we should include the dummy to get a better IV model. We can also perform likelihood-ratio tests between the versions of the three first stage regressions with nearc and without. All of the likelihood-ratio tests don't reject the null hypothesis so in this case is not good to use nearc as an instrument. The F-statistic for the joint significance of the instruments are lower in the first, 88.48 for schooling and 1434.23 for experience, and the F for experience2 increases to 1293.90. It is now possible to say that the variable nearc it is not a good instrument since it do not have explanatory power. g) It is easy to see that the returns to education estimated with the IV model are higher when compared to the OLS (0.0561<0.0679). But the standard errors in the IV estimator are much bigger than in the OLS. We expected that the OLS estimator would give us biased results but presenting higher returns to education than in the IV estimates, but it seems to be the opposite. Even being more precise the OLS estimator is biased so it is not a good option. Table 3

OLS (Table1, column1) IV (Table1, column4) Return to education

0.0561*** 0.0679***

Std. Error (0.00437) (0.0113)

3. Simulation a)-e) We start our analysis with the definition of the model: y = β! + β!x+ ε ,. As we know, x is an endogenous regressor. An endogenous regressor means that E ε x ≠ 0 , Our model is defined in this way: β! = 1, β! = 0.1 x = 12 ∗ (!!! !!

!+ 0.5), where x!~U −0.5,0.5 , x!~U −0.5,0.5

ε = U −0.5,0.5 + 0.5x! We will run a Monte Carlo simulation with different samples size of observations and different strength of the instruments. From this model we start to analyse a small sample of 20 observations. We can immediately observe that we have a problem of endogeneity in the OLS estimation: we have that E ε x ≠ 0, which means that the estimator b is biased and not consistent. In which sense is there endogeneity? There’s endogeneity because the error term is correlated with x1, and x1 is inside our x variable. Moreover, our instrument z1 is correlated with x2, consequentially it is also correlated with x, but it isn’t correlated with the error term. On the other hand the instrument z2 is correlated with x1 (so also with x, as for z1) but also correlated with the error term as x1 is correlated with ε.

It is easy to see that, if there is endogeneity, OLS overestimates the slope parameter and underestimates the intercept.

The underestimation of the slope is shown in the table by b0, OLS – β0 (its deviation from the estimated intercept to its true value) , which has a negative value: -0.242.

Otherwise, we can see the overestimation of the mean of the deviation from the estimated slope to its true value (b1,OLS – β1 ) from the positive value 0.040. On the contrary, using IV1 regression, which just has z1 as instrument variable, our deviation would be around 0 and positive for b0 (0.051) and slightly negative for b1 (-0.008). The problem is that, even if the IV1 regression allows us to obtain a more centred deviation, we can see from Fig.1 that IV estimator is less precise. To sum up: we can conclude that deleting z2 as a instrumental variable and focusing our analysis just con z1 the deviation become less bias, but, despite of this, we will have a less precise estimation (we can see it by the higher value of the standard error in IV regression than in OLS).

We decided to study also the Sargan test, which is a test of the validity of the instrumental variables and it is used for testing over-identifying restrictions too. A model is over-identifying if the number of the instruments is bigger than the number of the regressors. This test examines the exogeneity of all implemented instruments: Sargan’s null hypothesis is not rejected means that our variables are valid instruments because they are uncorrelated to some set of residuals. Our Sargan test shows us that the null hypothesis should get rejected more often for a stronger instrument z1 and many observations which implies the weaker z1 the less the Sargan detects the endogeneity of z2.

If we look at Table 2 and Figure 1, we see that IV2 estimation, which is composed by the two instruments z1 and z2, has the same problems of the OLS estimation. Indeed also in IV2 the estimators are biased, but it is also less precise than OLS, since z2 is a function of x1 (which generates the endogeneity).

In any case we reject the null hypothesis of the Sargan test only in 23% of the cases. It means that the IV2 regression is correctly specified and we can consider our instruments as exogenous.

Fig.1: N=20-STRONG

Table 1: N=20 (strong instrument)

Variable Obs Mean Std. Dev. Min Max

b0,OLS – β0 1000 -0.242 0.184 -1.194 0.400

b0,IV(1) – β0 1000 0.051 0.399 -1.401 2.953

b0,IV(2) – β0 1000 -0.155 0.268 -1.518 1.451

b1,OLS – β1 1000 0.040 0.028 -0.061 0.156

b1,IV(1) – β1 1000 -0.008 0.065 -0.455 0.238

b1,IV(2) – β1 1000 0.026 0.043 -0.246 0.259

Sargan 1000 0.162 0.369 0.000 1.000

b1,ex 1000 0.040 0.028 -0.061 0.156

ρxz 1000 0.603 0.132 0.073 0.901

F-test value 1000 13.194 9.443 0.097 77.242 In Table 1, looking at the F-test we can see that the mean for a sample of 20 observations is 13.194 (so it is > 10 rule-of-thumb value), which means that the regressor x on z1 is significantly different from zero (first stage regression). The ρxz value, shows us the correlarion between the instrument and the endogenous regressor. We can notice that it has a quite strong (0.603); consequentially the instrument is quite strong too.

f) From the Table 2 we can notice that taking a larger sample with 2000 observations, our distribution becomes more precise.

We can conclude that because it is more centred around the bias in the OLS, so its estimators are more precise around the wrong values. Hence, the estimation is far from the true values.

In the IV1 we have a smaller standard deviation with a larger sample, even if is still higher than in OLS estimationsà it is consistent.

Also in IV2 regression the estimators are more precise around the wrong values than with a sample of 20 observationsà for the endogeneity of z2

The Sargan test in this case shows us that we have to reject the null hypothesis à the instruments are not exogenous (because of z2). When the sample becomes larger, it is harder to find endogeneity between the variables.

If we look at the correlation (ρxz) in the sample with N=2000, we can observe that the instrument is strong and that we can adopt it to do an approximation of the ratio of the standard errors.

Fig.2 shows us that IV2 is around 0 and it is quite precise, on the other hand IV1 and OLS are really far from 0 and, consequentially, biased. It is also important to highlight that we used a really small scale; it means that if we swell the scale, the two regressions are distinct one from the other.

Table 2: 2000 obs-strong

Fig 2: 2000 obs-STRONG

g) We want to create a weak instrument. How can we do it?

We decided to reduce the weight of the component, which shows the influence of x2 on z1 from 1.2 to 0.05.

Table 3: weak-20 obs

Tab 4: weak-2000 obs

From Fig.3 and Fig.4 we can see that OLS do not change, instead IV1, both with a sample of 20 observations and with a sample of 2000 observations, not converge anymore to the real value.

Furthermore the standard deviations of our estimated coefficients are increasing blatantly.

Examining the IV2 regression, the astonishment thing is that, using a weak instrument, is more precise that IV1; on the other hand it is still more biased than OLS.

Using a weak instrument we can see that the correlation falls steeply (N=20: from 0.603 to 0.044; N=2000: from 0.611 to 0.05), probably caused by the necessity to have a bigger sample to decrease the standard deviation.

In this case the Sargan Test suggests us to accept the null hypothesis of exogenous instruments (even if z2 is endogenous).

Moreover, the F-test, as shown in Fig.3 and in Fig.4 is really lower than 10 (rule-of-thumb): F-test takes a value of 1.17 with a sample of 20 observations and a value of 5.88 with a sample of 2000 (this is not unexpected because we decided to create a very weak instrument).

Comparing these graphs with those in point f) we notice how much is fundamental for our analyses the strength of an instrument: weak instruments can cause problem of high inefficiency

Fig.3: N=20- WEAK

Fig.4: N=2000-WEAK

pset4-group5

Documents