70-208 regression analysis week 3 dielman: ch 4 (skip sub-sec 4.4.2, 4.6.2, and 4.6.3 and sec 4.7),...

67
70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Upload: angela-stanley

Post on 17-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

70-208 Regression Analysis

Week 3

Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Page 2: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables

• We believe that both education and experience affect the salary you earn. Can linear regression still be used to capture this idea?

• Yes, of course• The “linear” part of “linear regression” means

that the regression coefficients cannot enter the eq’n in a nonlinear way (such as β1

2 * x1)

Page 3: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables

• Salaryi = β0 + β1 * Educi + β2 * Experi + μi

• Graphing this equation requires the use of 3 dimensions, so the usefulness of graphical methods such as scatterplots and best-fit lines is somewhat limited now

• As the number of explanatory variables increases, the formulas for computing the estimates of the regression coefficients become increasingly complex– So we will not cover how to solve them by hand

Page 4: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables

• Equation that “best” describes the relationship btwn a dependent variable y and K independent variables x1, x2, … , xK can be written as:– y = β0 + β1 * x1 + β2 * x2 + … + βK * xK + μ– Note that I will mostly drop the “i” subscript moving fwd

• The criterion for “best” is the same as it was for simple (i.e. K = 1) regression – the sum of the squared difference btwn the true values of y and the values predicted yhat should be as small as possible

• β0,hat, β1,hat, β2,hat, … , βK,hat ensure that the sum of squared errors is minimized

Page 5: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Labeling β

• Sometimes we just use β0, β1, β2, … , βK to label the coefficients

• Other times, it is useful to be more specific. For example, if x1 represents “education level”, it is better to write β1 as βeduc.– β0 is always written the same

• The first regression below is more helpful in seeing and presenting your work than the second regression, even if we knew that y was salary, x1 was education, etc– Salary = β0 + βeduc * Educ + βexper * Exper + μ

– y = β0 + β1 * x1 + β2 * x2 + μ

• I will go back and forth with my labeling throughout the course. I just wanted you to understand the difference and why one way might be better in practice.

Page 6: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables

• Ceteris paribus – all else equal• In the case of simple regression, we interpreted

the regression coefficient estimate as meaning how much the dependent variable increased when the independent variable went up one unit

• Implicit was the concept that the error term for any two individuals were equally distributed, in other words, that all else was equal

Page 7: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables

• It is very possible that that is a bad implicit assumption

• That is one reason we like to add multiple explanatory variables. Once they are added, they are not part of the error term and can be explicitly accounted for when we interpret coefficient estimates

• What the hell do I mean by all of this?

Page 8: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables

• Go back to the salary example• Hopefully you all agree that education and

experience are both highly likely to explain salary in statistically significant ways

• But what if we didn’t have experience data, so we just ran the regression on salary and education?

Page 9: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables• What we would like to run:

– Salary = β0 + β1 * Educ + β2 * Exper + μ

• What we do run:– Salary = β0 + β1 * Educ + μ

• Which means that experience has now been sucked into the error term. If experience levels (conditional on education) differ in our sample data set, the implicit assumption that the errors are equally distributed across all observations is wrong!

• If we ran the 2nd regression written above, we would interpret β1,hat as the amount by which salary increases when education increases by one unit (implicitly saying all else, i.e. the “errors”, are equal, which I just argued is probably a poor assumption)

Page 10: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables

• So now say we have the experience data and we can run the regression with 2 explanatory variables

• Now we would interpret β1,hat as the amount by which salary increases when education increases by one unit AND EXPERIENCE IS THE SAME (plus the remaining information captured by the errors is the same across all observations)

• So we explicitly take experience out of the error term and can now condition on it being the same when we interpret the education coefficient

Page 11: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables

• But how good does the implicit, ceteris paribus, “error” assumption hold up even when both educ and exper are included?

• Maybe still not very good. Everything you can think of is still being captured by the error terms except for education and experience levels. If these somehow differ systematically across observations, the assumption of equal error distributions is still wrong!

Page 12: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables

• What do I mean by “everything you can think of”? Very simply, anything else that might (or might not!) affect salary.– Years of experience at current company– Number of extended family members that work at same company– Intelligence– How many sick days you took over the past 5 years– How many kids you have– How many siblings you have– How many different cities you’ve lived in– How many hot dogs you eat each year– Etc, etc, etc, blah, blah, blah

Page 13: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables• Let’s look at those closer

– Years of experience at current company – Probably would have significant effect on salary. We should include this in the regression if we can get the data.

– Number of extended family members that work at same company – Might or might not have affect on salary.

– Intelligence – Tough to measure, but could proxy for it using an IQ score. Very likely to affect salary, so it should be included in the regression, too.

– How many sick days you took over the past 5 years – Kind of a measure of effort, so I think it would matter.

– How many kids you have – Could matter, especially for women.– How many siblings you have – Doubtful it would be significant. – How many different cities you’ve lived in – Very unlikely to be significant.– How many hot dogs you eat each year – I’m literally just making stuff up at

this point, so I doubt this would affect salary (unless we are measuring the salaries of competitive eaters, so note that context can matter when “answering” these questions)

Page 14: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables

• So what happens if we think intelligence matters but it wasn’t included in the regression as a separate explanatory variable?

• Then intelligence is rolled up into the error term. But if education and intelligence are highly correlated (smarter people have more years of education), then the errors are not the same across the individuals in the sample (E(μi|X) ≠ 0). In fact, those with higher education have “higher” error, by which I mean one component of the error term is systemically bigger for some individuals

• This would make our ceteris paribus assumption false and we would end up with biased estimators!

Page 15: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables

• What if we include insignificant variables because we are afraid of getting biased estimates if we don’t throw everything in?

• Not really a problem. We will see how to evaluate whether there are any relevant gains to including additional variables. If there are, they should be kept in the regression. If the gains are negligible or even negative, drop those insignificant variables and fear not the repercussions of bias.

Page 16: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables – Output

• Look at and interpret output• Sales are dependent on Advertising and Bonus• Run the regression:

– Saleshat = -516.4 + 2.47 * Adv + 1.86 * Bonus

• This equation can be interpreted as providing an estimate of mean sales for a given level of advertising and bonus payment.

• If advertising is held constant, mean sales tend to rise by $1860 (1.86 thousands of dollars) for each unit increase in Bonus. If bonus is held fixed, mean sales tend to rise by $2470 (2.47 thousands of dollars) for each unit increase in Adv.

Page 17: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables – Output

• Notice in the Excel output that the dof of the Regression is now 2 (always used to be 1). This is because there are 2 explanatory variables. The SSE, MSR, F, etc are calculated basically the same way as before, which we will go over very soon.

• Look at Fig 4.7b on pg 141 of Dielman to see how Excel outputs all the regression information when multiple independent variables are included

Page 18: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multiple Independent Variables – Prediction

• As in simple regression, when we run a multiple regression we can then predict, or estimate, values for y when we have values for every explanatory variable by solving for yhat

• Back to sales example with Adv and Bonus only– Saleshat = -516.4 + 2.47 * Adv + 1.86 * Bonus

• Say Adv = 200 and Bonus = 150. What would we predict for Sales (i.e. what is Saleshat)?– Plug in Adv = 200 and Bonus = 150– Saleshat = -516.4 + 2.47 * 200 + 1.86 * 150 = 256.6

Page 19: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Confidence Intervals and Hypothesis Testing

• The confidence interval on βk,hat when K explanatory variables are included is– (βk,hat – tα/2,N-K-1 * sβk, βk,hat – tα/2,N-K-1 * sβk)– Notice the dof change on the t-value

• Hypothesis testing on any one independent variable is the same as before. The default Excel test is shown below.– H0 : βk = 0

– Ha : βk ≠ 0

Page 20: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Hypothesis Testing

• If the null on the previous slide is not rejected, then the conclusion is that, once the effects of all other variables in the regression are included, xk is not linearly related to y. In other words, adding xk to the regression eq’n is of no help in explaining any additional variation in y left unexplained by the other explanatory variables. You can drop xk from the regression and still have the same “fit”.

Page 21: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Hypothesis Testing

• If the null is rejected, then there is evidence that xk and y are linearly related and that xk does help explain some of the variation in y not accounted for by the other variables

Page 22: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Hypothesis Testing

• Are Sales and Bonus linearly related?• Use t-test

– H0 : βBON = 0

– Ha : βBON ≠ 0– Dec rule → reject null if test stat more extreme than t-value and

do not reject otherwise– βBON,hat = 1.856 and sβBON = 0.715, so test stat = 1.856 / 0.715 = 2.593– The t value with 22 dof (from N-K-1) for a two-tailed test with α =

0.05 is 2.074.– Since 2.593 > 2.074, reject null– Yes, they are linearly related (even when Advertising is also

accounted for)

Page 23: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Hypothesis Testing

• Could have used p-value or CI to answer the question on previous slide– Would have reached same conclusion– Don’t use full F when testing just one variable

(more explanation later)

Page 24: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Assessing the Fit

• Recall SST, SSR, and SSE– SST = ∑ (yi – ybar)2

– SSR = ∑ (yi,hat – ybar)2

– SSE = ∑ (yi – yi,hat)2

• For SSR, dof is equal to number of explanatory variables K

• For SSE, dof is N – K – 1• So SST has N – 1 dof

Page 25: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Assessing the Fit

• Recall that R2 = SSR / SST = 1 – (SSE / SST)• It was a measure of the goodness of fit of the

regression line and ranged from 0 to 1. If R2 was multiplied by 100, it represented the percentage of the variation in y explained by the regression.

• Drawback to R2 in multiple regression → As more explanatory variables are added, the value of R2 will never decrease even if the additional variables are explaining an insignificant proportion of the variation in y

Page 26: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Assessing the Fit

• From R2 = 1 – (SSE / SST), you can see that R2 gets increasingly closer to 1 since SSE falls any time any little tiny bit more variation in y is explained

• Addition of unnecessary explanatory variables, which add little, if anything, to the explanation of the variation in y, is not desirable

• An alternative measure is called adjusted R2, or Radj

2

– “Adjusted” because it adjusts for the dof

Page 27: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Assessing the Fit

• Radj2 = 1 – (SSE / (N – K – 1)) / (SST / (N – 1))

• Now suppose an explanatory variable is added to the regression model that produces only a very small decrease in SSE. The divisor N-K-1 also falls since K has been increased by 1. It is possible that the numerator of Radj

2 may increase if the decrease in SSE from the addition of another variable is not great enough to overcome the decrease in N-K-1.

Page 28: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Assessing the Fit

• Radj2 no longer represents the proportion of

variation in y explained by the regression (that is still captured only by R2), but it is useful when comparing two regressions with different numbers of explanatory variables. A decrease in Radj

2 from the addition of one or more explanatory variables signals that the added variable(s) was of little importance in the regression, so it can be dropped.

Page 29: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Assessing the Fit

• F = MSR / MSE• MSR = SSR / K• MSE = SSE / (N – K – 1)• Full F statistic is used to test the following

hypothesis:– H0 : β1 = β2 = … = βK = 0

– Ha : At least one coefficient above is not equal to 0

Page 30: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Assessing the Fit

• Decision rule → reject null if F > fcrit(α; K, N-K-1) and do not rej otherwise

• Failing to reject the null implies that the explanatory variables in the regression equation are of little or no use in explaining the variation in y. Rejection of the null implies that at least one (but not necessarily all) of the explanatory variables helps explain the variation in y.

Page 31: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Assessing the Fit

• Rejection of the null does not mean that all pop’n regression coefficients are different from 0 (though this may be true), just that the regression is useful overall in explaining y.

• The full F test can be thought of as a global test designed to assess the overall fit of the model.

• That’s why full F cannot be used for hypothesis testing on a single variable in multiple regression, but it could be used for the hypothesis testing on the single explanatory variable in simple regression (since that variable was the whole, “global” model)

Page 32: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Sales Example

• Show the calculation of F on the Excel sheet– Using SSE and SSR– Using MSE and MSR

• Would we reject the null that all coefficients are equal to 0?– YES

Page 33: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Comparing Two Regression Models

• Remember that the t-test can check whether each individual regression coefficient is significant and the full F test can check the overall fit of the regression by asking whether any coefficient is significant

• Partial F test is in between – it answers the question of whether some subset of coefficients are significant or not

Page 34: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Comparing Two Regression Models

• Want to test whether variables xL+1, … , xK are useful in explaining any variation in y after taking into account variation already explained by x1, … , xL variables

• Full model has all K variables:– y = β0 + β1 * x1 + β2 * x2 + … + βL * xL + βL+1 * xL+1 + … +

βK * xK + μ

• Reduced model only has L variables:– y = β0 + β1 * x1 + β2 * x2 + … + βL * xL + μ

Page 35: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Comparing Two Regression Models

• Is the full model significantly better than the reduced model at explaining the variation in y?

• H0 : βL+1 = … = βK = 0

• Ha : at least one of them isn’t equal to 0• If null is not rejected, choose the reduced model• If null is rejected, xL+1, … , xK contribute to

explaining y, so use the full model

Page 36: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Comparing Two Regression Models

• To test the hypothesis, use the following partial F statistic– Fpart = ((SSER – SSEF) / (K – L)) / ((SSEF) / (N – K – 1)), where the

“R” stands for reduced model and “F” stands for full model

• SSER – SSEF is always greater than or equal to 0 – Full model includes K – L extra variables which, at worst,

explain none of variation in y and in all likelihood explain at least a little of it, so SSE falls

– This difference represents the additional amount of variation in y explained by adding xL+1, … , xK to the regression

Page 37: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Comparing Two Regression Models

• This measure of improvement is then divided by the number of additional variables included, K – L– Thus the numerator of Fpart is the additional

variation in y explained per additional explanatory variable used

• Reject null if Fpart > fcrit(α; K – L, N – K – 1) and do not reject otherwise

Page 38: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Sales Example Revisited

• Example 4.4, pg 152 of Dielman• Let’s add two more variables to the sales example

from earlier• x3 is mkt share held by company in each territory

and x4 is largest competitor’s sales in each territory

• So the “reduced” model results we already have. They were shown earlier when just x1 (Adv) and x2 (Bonus) were included

Page 39: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Sales Example Revisited

• We need to see the full model results• Notice that R2 is higher for the full model

(remember, R2 can never fall when more variables are added) but Radj

2 is actually lower– This should be a clue that we will probably not

reject the null on β3 and β4 when comparing the full and reduced models

Page 40: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Sales Example Revisited

• SSER = 181176, SSEF = 175855, K – L = 2, N – K – 1 = 20 – Note that this last value is the dof of SSE in the full model

• So Fpart = ((181176 – 175855) / 2) / (175855 / 20) = 0.303

• fcrit(0.05; 2, 20) = 3.49• Since 0.303 < 3.49, do not reject null• Conclude that β3 = β4 = 0, so x3 and x4 should not be

included in the regression

Page 41: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Sales Example Revisited

• Notice that the values for β0,hat, βADV,hat, and βBON,hat changed when we added additional variables– Saleshat = -516.4 + 2.47 * Adv + 1.86 * Bonus

– Saleshat = -593.5 + 2.51 * Adv + 1.91 * Bonus + 2.65 * Mkt_Shr – 0.121 * Compet

• This should not surprise you. Some of what was previously rolled up into μ has now been explicitly accounted for, and that changes the way the initial set of explanatory variables relate to Sales.

• Note that the inclusion of additional observations (i.e. we gather more data) could also adjust the estimates of β0,hat, etc

• Every regression is different! (like snowflakes.......)

Page 42: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Sales Example Revisited• If we chose to stick with the “full” sales model, we

would include the x3 and x4 variables in predicting Saleshat

– Even though they are insignificant, because the β0,hat, βADV,hat, and βBON,hat values changed with their inclusion, it would be wrong to make predictions without them (unless we re-ran the original regression where they were not even included)

• So what is Saleshat for Adv = 500, Bonus = 150, Mkt_Shr = 0.5, and Compet = 100?– Saleshat = -593.5 + 2.51 * 500 + 1.91 * 150 + 2.65 * 0.5 –

0.121 * 100 = 937.2

Page 43: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Limits to K?

• There are K + 1 coefficients that need to be estimated (β0, β1, … , βK)

• We need at least N observations to estimate that many coefficients

• Normally written as K ≤ N – 1• This is a similar concept from an algebra class you’d

have taken in middle school, where we needed at least M equations to solve for X unknowns (i.e. M ≥ X)– Here, you can think of N being similar to the number of

equations needed and K being the number of unknowns to be solved for

Page 44: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multicollinearity

• For a regression of y on K explanatory variables, it is hoped that the explanatory variables are highly correlated with the dependent variable

• However, it is not desirable for strong relationships to exist among the explanatory variables themselves

• When explanatory variables are correlated with one another, the problem of multicollinearity is said to exist

Page 45: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Multicollinearity

• Seriousness of problem depends on degree of correlation• Some books list an additional assumption of OLS that the

sample data X is not all the same value, and a follow-up assumption that X1 cannot directly determine X2

– The first point made in the last bullet hardly ever happens. As long as X varies in the population, the sample data will almost always vary unless the pop’n variation is minimal or the sample size is very small.

– The second point made in the last bullet expressly forbids perfect multicollinearity to occur between any 2 explanatory variables

Page 46: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Biggest Problem for MultiC• The std errors of regression coefficients are large when

there is high multicollinearity among explanatory variables• The null hypo that the coefficients are 0 may not be rejected

even when the associated variable is important in explaining variation in y

• Summary: Perfect collinearity is fatal for a regression. Any small degree of multicollinearity increases std errors and is thus somewhat undesirable, though basically unavoidable.– We will look at one strategy for investigating multicollinearity and

using it to inform our regression choices next (free preview: Fpart is useful)

Page 47: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Baseball Example

• Example comes from the Wooldridge text• I believe baseball player salaries are determined by

years in the league, avg games played per year, career batting average, avg home runs per year, and avg RBIs per year

• So the following regression is run:– log(salary) = β0 + β1 * years + β2 * games_yr + β3 * cavg + β4 *

hr_yr + β5 * rbi_yr + μ– Ignore the log for now, that’s for next week. I just wanted to

stay kosher with the example from my other book. Just think of it as “salary” if it really bothers you.

Page 48: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Baseball Example

• Results

• Plus N = 353 and SSEF = 183.186

β0 β1 β2 β3 β4 β5

11.19 (0.29)

0.0689(0.0121)

0.0126(0.0026)

0.00098(0.00110)

0.0144(0.0161)

0.0108(0.0072)

Page 49: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Baseball Example

• Simple t-test on the last three coefficients would say they are insignificant in explaining log(salary)

• But any baseball fan knows that batting avg, home runs, and RBIs definitely are big factors in determining player salaries (and team performance for that matter)

• So let’s run the reduced model where we drop out those three variables and check to see what the partial F statistic reveals

Page 50: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Baseball Example

• Results

• Plus N = 353 and SSER = 198.311

β0 β1 β2

11.22 (0.11)

0.0713(0.0125)

0.0202(0.0013)

Page 51: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Baseball Example

• So Fpart = 9.55 (do the math yourself later, you have everything you need), and we reject null that β3 = β4 = β5 = 0 (and thus that batting avg, home runs, and RBIs have no effect on salary)

• That may seem surprising in light of insignificant t-stats for all 3 in the full model regression

Page 52: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Baseball Example

• What is happening is that two variables, hr_yr and rbi_yr, are highly correlated (and less so for cavg), and this multicollinearity makes it difficult to uncover the partial effect of each variable– This is reflected in individual t-stats

• Fpart stat tests whether all 3 variables above are jointly significant, and multicollinearity between them is much less relevant for testing this hypo

• If we drop one of those variables, we would see the t-stat of the others increase by a lot (even to the point of significance). The point estimates might change up or down, but the standard errors would definitely fall.

Page 53: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables• Dummy variables, or indicator variables, take on

only two values → 0 or 1• They indicate whether a sample observation from

our data does (1) or does not (0) belong in a certain category– You can think of them as “yes” (1) or “no” (0) variables

• Examples:– Gender – 1 if female, 0 otherwise– Race – 1 if white, 0 otherwise– Employment – 1 if employed, 0 otherwise– Education – 1 if college graduate, 0 otherwise

Page 54: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables

• Can also be used to capture deeper qualitative information– Is person A a US citizen? (1 if yes, 0 if no)– Is person A a baseball fan? (1 if yes, 0 if no)– Does person A own a computer? (1 if yes, 0 if no)– Is summer the favorite season of person A? (1 if yes,

0 if no)– Does firm Z sell video games? (1 if yes, 0 if no)– Has country Z signed a free trade agreement with

Canada? (1 if yes, 0 if no)

Page 55: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables

• In regression analysis, we must always “leave out” one part of the indicator

• Use gender as the example here– So Xmale = 1 if male, 0 otherwise might be included in the

regression as an independent variable– But we cannot also include Xfemale = 1 if female, 0 otherwise

in the regression– One “part” (here, female indicator) must be left out– Why is this? Think back to the prefect collinearity problem

discussed earlier. We can always define “female” completely in terms of “male” (Xfemale = 1 – Xmale). So both cannot be included in the regression or we get an error.

Page 56: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables

• The group whose indicator is omitted from the regression serves as the base-level group for comparison

• In the gender example, say I ran the following regression:– Salary = β0 + β1 * Educ + β2 * Male + μ

Page 57: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables

• Then the base-level group are females• The intercept for females is β0, while for males it

is β0 + β2

• From where?– Indicated group (males) → Salary = β0 + β1 * Educ +

β2 * Male + μ = β0 + β1 * Educ + β2 + μ = (β0 + β2) + β1 * Educ + μ

– Non-indicated group (females) → Salary = β0 + β1 * Educ + β2 * Male + μ = (β0) + β1 * Educ + μ

Page 58: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables

• If we wanted to answer the question of whether or not men and women earn the same salary once education has been accounted for, a simple t-test would do the trick– H0 : β2 = 0

– Ha : β2 ≠ 0– If we reject the null, then men and women earn

different salaries even when education levels are accounted for (remember there’s all kinds of other stuff in μ though)

Page 59: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables

• How about a more complicated example of indicator variables?

• Suppose firms in a sample are categorized according to the exchange on which they are listed (NYSE, AMEX, or NASDAQ). We believe the exchange they are on may have some predictive power over the value of the firm.– D1 = 1 if listed on NYSE, 0 otherwise

– D2 = 1 if listed on AMEX, 0 otherwise

– D3 = 1 if listed on NASDAQ, 0 otherwise

Page 60: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables

• Let NYSE be the base level, so leave its dummy out of the regressions equation

• Include firm-level assets and number of employees as additional independent variables

• Value = β0 + β1 * D2 + β2 * D3 + β3 * Assets + β4 * Employees + μ

• Then the NYSE intercept is β0, AMEX is β0 + β1, and NASDAQ is β0 + β2

Page 61: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables• When using indicator variables, the partial F statistic is used to

test whether the variables are important as a group. The t-test on individual coefficients should not be used to decide whether individual indicator variables should be retained or dropped (except when there are only two groups represented, and thus only one indicator variable, such as the male/female salary regression a few slides back).

• The indicator variables are designed to have meaning as a group, and are either all retained or all dropped as a group. Dropping individual indicators changes the meaning of the remaining ones.– Imagine dropping just D2 (AMEX) in the previous regression. Then D3

(NASDAQ) is kept, while the base-level group switches from D1 (NYSE) to simply “not D3” (which would include both NYSE and AMEX)

Page 62: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables – Sales Example

• This is example 7.3 on pg 279 of Dielman• Look at relationship between dependent variable

(Sales) and a few independent variables (Advertising, Bonus).

• Let’s add variables indicating the region of the US in which Sales are made.– South = 1 if territory is in the South, 0 otherwise– West = 1 if territory is in the West, 0 otherwise– Midwest = 1 if territory is in the Midwest, 0 otherwise

Page 63: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables – Sales Example

• Let Midwest be the base level group– So leave it out of the regression

• Regression:– Sales = β0 + β1 * Adv + β2 * Bonus + β3 * South + β4 * West

+ μ

• We find β3,hat = -258 and β4,hat = -210. What do those mean?– Since β3,hat = -258, Sales in the South are 258 units lower

than sales in the Midwest (since Midwest is our comparison group) even if we condition on Adver and Bonus being the same (similar for β4,hat)

Page 64: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables – Sales Example

• It would be inappropriate to run simple t-tests on those coefficients to determine their significance. We need to use partial F. Think about how the interpretation of all indicators would change if we ran a t-test and decided to drop only β4 * West from the regression.

• To determine whether there is a significant difference in sales for territories in different regions, the following hypotheses should be tested:– H0 : β3 = β4 = 0

– Ha : at least one of them isn’t equal to 0

Page 65: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables – Sales Example

• The larger (full) model is again:– Sales = β0 + β1 * Adv + β2 * Bonus + β3 * South + β4 *

West + μ• So the null stating that both indicators are 0, if

not rejected, would mean no differences in sales in the regions exists, and the indicators can be dropped

• The simpler (reduced) model:– Sales = β0 + β1 * Adv + β2 * Bonus + μ

Page 66: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Dummy Variables – Sales Example

• So Fpart = ((SSER – SSEF) / (K – L)) / MSEF = 17.3

• Decision is to reject null since 17.3 > fcrit(0.05; 2, 20)

• Thus, at least one of the coefficients of the indicator variables is not 0. There are differences in average sales levels btwn the three regions. Keep the indicator variables in the regression.

Page 67: 70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1

Suggested Problems from Dielman

• Pg 148, #1• Pg 158, #3• Pg 169, #7• Pg 170, #11• Pg 173, #17• Pg 285, #1