wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/smchap00…  · web viewthe choice of...

31
Chapter 08 - Model Selection in Multiple Linear Regression Analysis CHAPTER 8 Answers to End of Chapter Problems 8.1 a. For the average individual, holding the effects of average points per game, average rebounds, and player position constant, if the number of years a player has been in the NBA goes up by one year, salary increases by 16%. b. ln ( Salar y i ) =β 0 + β 1 Yrs i +β 2 PPG i +β 3 RPG i +β 4 F i + β 5 G i +β 6 F i RPG i +β 7 F i RPG i +ε i To test this hypothesis, you can do t-test if to determine if β 6 =0 (the returns to salary are the same for Forwards and Centers) and if β 7 =0 (the returns to salary are the same for Guards and Centers), and an F-test if jointly β 6 =β 7 =0 c. Note the question says ANY differences. ln ( Salar y i ) =β 0 + β 1 Yr s i +β 2 PP G i +β 3 RP G i +β 4 F i + β 5 G i +β 6 Foreign i +β 7 Foreign i Yr s i +β 8 Foreign i PP G i + β 9 Foreign i RP G i + ε i Where the variable foreign = 1 if the player is foreign born and foreign = 0 if the player is born in the U.S. Hypothesis: H 0 : β 6 ¿ β 7 =β 8 =β 9 =0 H 1 : atleastoneβ i isnot equal ¿ 0 Test statistic: Fstat= ( SSUnexplained restricted SSUnexplaine d unrestricted )/ 4 SSUnexplaine d unrestricted /( nk1) Where the restricted model is the original model in the problem (or alternatively the model with the null hypothesis imposed) Critical Value is F α, 4 ,n k1 Rejection Rule: 8-1 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Upload: docong

Post on 07-Mar-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

CHAPTER 8

Answers to End of Chapter Problems

8.1 a. For the average individual, holding the effects of average points per game, average rebounds, and player position constant, if the number of years a player has been in the NBA goes up by one year, salary increases by 16%.

b. ln (Salar y i )=β0+β1Yr si+β2PPGi+β3 RPGi+β4 Fi+β5Gi+β6 Fi∗RPGi+β7F i∗RPGi+εi

To test this hypothesis, you can do t-test if to determine if β6=0 (the returns to salary are the same for Forwards and Centers) and if β7=0 (the returns to salary are the same for Guards and Centers), and an F-test if jointly β6=β7=0

c. Note the question says ANY differences.ln (Salar y i )=β0+β1Yr si+β2PPGi+β3 RPGi+β4 Fi+β5Gi+β6 Foreigni +β7Foreigni∗Yr si+β8 Foreigni∗PPGi+β9Foreigni∗RPGi+εi

Where the variable foreign = 1 if the player is foreign born and foreign = 0 if the player is born in the U.S.

Hypothesis:H 0 : β6¿ β7=β8=β9=0H 1: at least one β i isnot equal¿0

Test statistic:

F−stat=(SSUnexplaine drestricted−SSUnexplaine dunrestricted)/4

SSUnexplaine dunrestricted/(n−k−1)

Where the restricted model is the original model in the problem (or alternatively the model with the null hypothesis imposed)Critical Value is Fα ,4 , n−k−1

Rejection Rule: Reject H0 if F-stat > Fα ,4 , n−k−1

d. This is a Davidson MacKinnon Test(1) Estimate the model

ln (Salar y i )=β0+β1Yr si+β2PPGi+β3 RPGi+β4 Fi+β5Gi+β6 Foreigni +β7Foreigni∗Yr si+β8 Foreigni∗PPGi+β9Foreigni∗RPGi+εi

and obtain the predicted value l̂n (Salar y i ).(2) Add the predicted value from step (1) to the model

ln (Salar y i )=β0+β1Yr si+β2PPGi+β3 RPGi+β4 Fi+β5Gi+β6 Fi∗RPGi+β7F i

¿ RPGi+β8 l̂n (Salar y i)+εi

8-1Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 2: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

(3) Perform a t-test for the statistical significance of β8 . If it is statistically significant then the model from step (1) may be preferred.

8.2 a. The unrestricted model is FRi=β0+β1 Agei+β2Educ i+ β3Urbani+εi while the restricted model is FRi=β0+β1 Agei+ε iHypothesis:

H 0 : β2¿ β3=0H 1: at least one β i isnot equal¿0

Test statistic:

F−stat=(SSUnexplaine drestricted−SSUnexplaine dunrestricted)/2

SSUnexplaine dunrestricted /(n−k−1)

Where the restricted model is the original model in the problem (or alternatively the model with the null hypothesis imposed)Critical Value is Fα ,5 ,n−k−1

Rejection Rule: Reject H0 if F-stat > Fα ,4 , n−k−1

b. Set β1−β2=θ, solve for β1 or β1=θ+β2, and then substitute for β1 in the original model.FRi=β0+β1 Agei+β2Educ i+ β3Urbani+εi

FRi=β0+(θ+β2)Age i+β2 Educi+β3Urbani+εiFRi=β0+θ Agei+β2 Agei+β2Educi+β3Urbani+εi

FRi=β0+θ Agei+β2(Agei+Educ ¿¿ i)+β3Urbani+εi ¿From this last equation that isolates the parameters that need to be estimated, , β2, and β3. A new variable need to be created by adding the age an education columns together and the regress FR on Age, (Age+Educ), and Urban. The coefficient on Age is the estimate, β̂1−β̂2, the standard error on age is the standard error of this hypothesis, the t-statistic on Age is the test statistic for this test, and last but not least the p-value on Age is the p-value for this test. To see if these coefficients are equal, reject the null hypothesis if they are equal if the p-value is less than the significance level α.

c. This is a Davidson MacKinnon Test(1) Estimate the model

ln FRi=β0+β1 Agei+β2Educ i+ β3Urbani+εiand obtain the predicted value l̂n (F R i ).(2) Add the predicted value from step (1) to the model

FRi=β0+β1 Agei+β2Educ i+ β3Urbani+ β4 l̂n (FR)i+εi(3) Perform a t-test for the statistical significance of β4 . If it is statistically significant then the model from step (1) may be preferred.

8-2Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 3: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

The reason that the semi-log model is more likely to lead to biased estimates is that that the natural log is a non-linear function and the estimates it yields (without a transformation) are already biased even if the true model is non-linear. The choice of specification should be largely be made on the underlying economics. If economic theory says that when age goes up by one year (or education) the percentage change in the fertility rate is constant then the semi-log model should be estimated. The coefficient on Age is interpreted as, on average, holding education and urban constant, if an individual gets one year older then the fertility rate increases (decreases) by β1*(100)%. The coefficient on Education is interpreted as, on average, holding age and urban constant, if an individual gets one more year of education then the fertility rate increases (decreases) by β2*(100)%. The coefficient on Urban is interpreted as, on average, holding age and education constant, if an individual lives in an urban area the fertility rate is β3*(100)% higher (lower) relative to living in a rural area.

8.3 a. To find where pollution reaches a maximum (or where diminishing marginal returns sets in) set 4000−0.25 (2 )GDPi=0 or when GDP per capita is $8000.

b. If all of the multiple linear regression assumptions hold then the consequences of heteroskedasticity is that the OLS estimates are no longer BLUE but they remain unbiased. The other consequence is that all standard error and hypothesis tests are incorrect.

c. This is chapter 9 material.

d. This is chapter 9 material.

e. Because the dependent variable hasn’t changed, you can compare the R-squared values between the two models and if one R-squared is clearly larger than the other then that model is preferred. You could also perform a Davidson MacKinnon test.

8.4 a. This is the two step estimator for multiple linear regression analysis. First a formal proof. The estimates are obtained by minimizing the sum of squared residuals with amounts to taking the derivative respect to β̂0, β̂1 , and β̂2 and setting those equations equal to 0.

∑i=1

n

¿¿

yielding the normal equations

∑i=1

n

( y i¿− β̂0− β̂1 x1 ,i− β̂2 x2 ,i)=0¿

∑i=1

n

x1 ,i( yi¿− β̂0− β̂1 x1 ,i− β̂2x2 ,i)=0¿

8-3Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 4: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

∑i=1

n

x2 ,i( yi¿− β̂0− β̂1x1 ,i− β̂2x2 ,i)=0¿

Noting that when x1 is regressed on x2, then x1 can be written as the predicted values and the residuals or x1 ,i= x̂1 ,i+r̂1 i. Substitute this into the second normal equation to obtain

∑i=1

n

( x̂1 ,i+r̂1 i)( yi¿− β̂0−β̂1 x1, i−β̂2 x2 ,i)=0¿

Because the sum of the predicted values times the residual is equal to zero

or ∑i=1

n

x̂1 ,i e i=0 the equation reduces to

∑i=1

n

(r̂1 i)( y i¿− β̂0− β̂1x1 ,i− β̂2x2 , i)=0¿

Now because the r̂1 iare the residuals from the regression of x1 on x2 which

means ∑i=1

n

x̂2 ,i r̂1 ii=0 and the sum of residuals are always equal to 0 so

∑i=1

n

ri=0Therefore we are left with

∑i=1

n

(r̂1 i)( y i¿− β̂1( x̂1 ,i+r̂1 i))=0¿

and then using the fact that ∑i=1

n

x̂1 ,i r̂1 i=0 we get

∑i=1

n

(r̂1 i)( y i¿− β̂1 r̂1 i)=0¿

Solving for β̂1

β̂1=∑ir̂ i1 y i

∑ir̂i1

2

In Venn Diagram form and less formally

8-4Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 5: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

8-5Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 6: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

when x1 is regressed on x2 then the part the regression captures is pink and dark orange and the residuals of that regression are the red plus yellow area. Then when y is regressed on those residuals (i.e. only the red and yellow part of x1) only the red area is left.

b. The expression is Var ( β̂1 )=∑ie i

2/(n−k−1)

TSS1(1−R12)

. When x1 and x2 have a

large amount of independent variation then R12 is small, (1-R1

2) is large and 1 divided by that value is small (note that R1

2 is bounded to be between 0 and 1. Now if x1 and x2 have a small amount of independent variation then R1

2 is large, (1-R12) is l is small and 1 divided by that value is large.

c. No, including irrelevant variables does not cause the estimates to be biased. If being taller is strongly related to married then R1

2 is large, (1-R12)

is l is small and 1 divided by that value is large.

8.5 a. Two new variables need to be created by multiplying inf by home runs and inf by batting average and then estimating the regression modelln ( salary i)=β0+β1expi+β2BA i+β3 RBI i+β4HR i+β3 INF i+β4 ALLStar i+β5 INF i∗HRi+β6 INF i∗BAi+εiTo test these two hypotheses it is two t-test of

H0 : β5≥0 infielders do not get paid less to hit home runs than outfieldersH 1: β5<0 infielders get paid less to hit home runs than outfielders

and Reject H 0 if t-stat < −t α ,n−k−1.H 0 : β6≤0 infielders do not get paid more to have a high batting average than

outfielders

8-6Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 7: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

8-7Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 8: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

H 1: β6>0 infielders get paid more to have a high batting average than outfieldersand Reject H 0 if t-stat > t α ,n−k−1.Note that because these are one sided tests so the critical value is t α ,n−k−1 (α remains whole when obtaining the critical value) or if the p-value approach is used then the p-value in the regression output needs to be multiplied by 2 and then compared to α.

b. Define a new variable as Native =1 if the player is native born and Native = 0 if the player is foreign born. Multiply this dummy variable by all independent variables that were originally in the model. The new model becomes

ln ( salary i)=β0+β1Ex pi+β2BAi+β3RBI i+β 4HRi+β3 INF i+ β4 ALLStar i+β5Nativei+β6Nativei expi+β7Native iRBI i+β8NativeiHRi+ β9Nativei INF i+β10Nativei ALLStar i+εi

To test for any differences it is an F-test.Hypothesis:

H 0 : β5=β6¿ β7=β8=β9=β10=0H 1: at least one β i isnot equal¿0

Test statistic:

F−stat=(SSUnexplaine drestricted−SSUnexplaine dunrestricted)/6

SSUnexplainedunrestricted /(n−k−1)Where the restricted model is the original model in the problem (or alternatively the model with the null hypothesis imposed)Critical Value is Fα ,6 ,n−k−1

Rejection Rule: Reject H0 if F-stat > Fα ,6 ,n−k−1

c. This is a Davidson MacKinnon Test(1) Estimate the model

ln ( salary i)=β0+β1 ln (exp )i+β2BA i+ β3 INFi+β4 BAi2+β4RBI i

2+εiand obtain the predicted value l̂n (Salar y i ).

(2) Add the predicted value from step (1) to the model ln ( salary i)=β0+β1expi+β2BA i+β3 RBI i+β4HR i+β3 INF i+β4 ALLStar i+ β̂5 ln (Salar y i )+εi

(3) Perform a t-test for the statistical significance of β5 . If it is statistically significant then the model from step (1) may be preferred.Because the left hand side variable doesn’t change between the two specifications, the R-squares between the two models can also be compared.

8-8Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 9: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

Answers to End of Chapter Exercises

E8.1 a.

The problem with ability is that it is hard to obtain a variable that is an appropriate measure of ability and even though ability is certainly a determinant of GPA. Individuals with a higher ability also typically have a higher GPA and vice versa. Omitted variable bias becomes an issue because ability is also related to hours studied, work, video games, and even possibly texts. The omission of a relevant variable causes the coefficient estimates to be biased. This means that all coefficient estimates are wrong on average and the all hypothesis tests and confidence intervals are also incorrect.

8-9Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 10: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

b.

The consequences are the inclusion of an irrelevant variable does not yield biased estimates but the standard errors typically become inflated. In this case, the inclusion of the irrelevant variable did not change the overall decisions about statistical significance. Notice that when Eye Color was included the R-squared went up but the adjusted R-squared went down.

c. It is much better to include an irrelevant variable than omit a relevant variable because larger standard error are much better than biased estimators. Most of the time omitted variables are not omitted because the researcher is sloppy and didn’t think to include that variable but rather because data on that variable is not available.

E8.2 a. See Excel Worksheet

8-10Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 11: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

b.

From this regression we see that distance to the beach is not statistically significant but missing is statistically significant suggesting that the observations with missing data have a lower housing price of $297,185.14 than those observations without missing data.

c. An easy way to test this hypothesis is to just regress housing price on the missing column which will yield a differences in means.

8-11Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 12: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

This regression suggests that the mean housing price for data without missing observations is $795,333.21 while the mean housing price for data with missing observations is $795,333.21 - $168,943.28 = $626,389.93. The p-value suggests that this difference in means is not statistically significant. Another way to see if the missing data causes issues is to perform the regression with only the data that have the distance to the beach observations.

In the regression with only the 43 observations that have data on distance to the beach, have somewhat different results than the regression that accounted for the missing data. The beach distance variable is now statistically significant at the 5% level and suggests that for each additional mile a house is away from the beach the price drops by $17,497.39.

E8.3a. See graph below.

8-12Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 13: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

0 0.2 0.4 0.6 0.8 1 1.20

500000

1000000

1500000

2000000

2500000

Units Sold vs. MP Sales

Online MP

Units

Sol

d

The two potential outliers are Call of Duty: Black Ops 2and Assassin's Creed 3.

b.

8-13Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 14: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

The coefficient on outlier is statistically significant at the 1% level suggesting that the two outliers have, on average, 1,162,603.11 more units sold than the 51 other observations.

Interacting this with Online MP we obtain the regression

In this regression, the outlier without Online MP has 5,001,950.17 more sales than non outlier and the outlier with Online MP has 7,424.4+501,950+1,266,616= 1,775,808 more units sold than video games that our not outliers with no online MP.

c. It doesn’t look like either outlier was a there due to a special reason except for both Call of Duty: Black Ops 2and Assassin's Creed 3 are extremely popular video games.

8-14Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 15: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

E8.4 a.

In this regression, the only statistically significant independent variable is square feet. On average, holding bedrooms, bathrooms, lot size, and pool constant, if square feet goes up by one foot then the price increases by .062%. Even though square feet is statistically significant it does it is not economically significant because the coefficient estimate is so small.

b.

8-15Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 16: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

In this regression, the only statistically significant independent variable are log square feet, bedrooms at the 5% level, and bathrooms at the 10% level. On average, holding bedrooms, bathrooms, lot size, and pool constant, if square feet goes up by 1% then the price increases by 1.14%. On average, holding log square feet, bathrooms, lot size, and pool constant, if bedrooms goes up by 1% then the price decreases by .107%. On average, holding log square feet, bedrooms, lot size, and pool constant, if bathrooms goes up by 1% then the price increases by .113%.

c. Performing the Davidson-MacKinnon test

The predicted ln housing price is not statistically significant, which suggests that the model without the log square feet is preferred. Since the dependent variable in both models is the same, the R-squares can also be compared and the R-squared from the initial model is larger than the R-squared from the model with log square feet.

8-16Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 17: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

E8.5 Regression from step 1 of reset tests.

8-17Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.

Page 18: wordpress.viu.cawordpress.viu.ca/danielsimons/files/2016/09/SMChap00…  · Web viewThe choice of specification should be largely be made on the underlying economics. ... in an urban

Chapter 08 - Model Selection in Multiple Linear Regression Analysis

Second regression for reset test

The yhat^2, yhat^3, and yhat^4 are individually statistically insignificant but we need to test if they are jointly statistically significant.Hypothesis:

H 0 : β5=β6¿ β7=0H 1: at least one β i isnot equal¿0

Test statistic:

F−stat=(4.5536−4.432) /34.432/58

=0.5302

Critical Value is F .05 ,3,58=2.746Rejection Rule: Reject H0 if F-stat > 2.746Decision:Because 0.5302 < 2.746 we fail to reject H 0 and conclude that the model without the quadratic terms is statistically preferred.

8-18Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill

Education.