example 12.2 multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 |...

13
Example 12.2 Multicollinearity

Upload: chase-roche

Post on 26-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

Example 12.2

Multicollinearity

Page 2: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

The Problem

We want to explain a person’s height by means of foot length.

The response variable is Height, and the explanatory variables are Right and Left, the length of the right foot and the left foot, respectively.

What can occur when we regress Height on both Right and Left?

Page 3: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

Multicollinearity The relationship between the explanatory variable X and

the response variable Y is not always accurately reflected in the coefficient of X; it depends on which other X’s are included or not included in the equation.

This is especially true when there is a linear relationship between to or more explanatory variables, in which case we have multicollinearity.

By definition multicollinearity is the presence of a fairly strong linear relationship between two or more explanatory variables, and it can make estimation difficult.

Page 4: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

Solution

Admittedly, there is no need to include both Right and Left in an equation for Height - either one would do - but we include both to make a point.

It is likely that there is a large correlation between height and foot size, so we would expect this regression equation to do a good job.

The R2 value will probably be large. But what about the coefficients of Right and Left? Here is a problem.

Page 5: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

Solution -- continued

The coefficient of Right indicates that the right foot’s effect on Height in addition to the effect of the left foot. This additional effect is probably minimal. That is, after the effect of Left on Height has already been taken into account, the extra information provided by Right is probably minimal. But it goes the other way also. The extra effort of Left, in addition to that provided by Right, is probably minimal.

Page 6: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

HEIGHT.XLS To show what can happen numerically, we generated

a hypothetical data set of heights and left and right foot lengths in this file.

We did this so that, except for random error, height is approximately 32 plus 3.2 times foot length (all expressed in inches).

As shown in the table to the right, the correlations between Height and either Right or Left in our data set are quite large, and the correlation between Right and Left is very close to 1.

Page 7: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

Solution -- continued The regression output when both Right and Left are

entered in the equation for Height appears in this table.

Page 8: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

Solution -- continued

This output tells a somewhat confusing story.

The multiple R and the corresponding R2 are about what we would expect, given the correlations between Height and either Right or Left.

In particular, the multiple R is close to the correlation between Height and either Right or Left. Also, the se

value is quite good. It implies that predictions of height from this regression equation will typically be off by only about 2 inches.

Page 9: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

Solution -- continued

However, the coefficients of Right and Left are not all what we might expect, given that we generated heights as approximately 32 plus 3.2 times foot length.

In fact, the coefficient of Left has the wrong sign - it is negative!

Besides this wrong sign, the tip-off that there is a problem is that the t-value of Left is quite small and the corresponding p-value is quite large.

Page 10: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

Solution -- continued

Judging by this, we might conclude that Height and Left are either not related or are related negatively. But we know from the table of correlations that both of these are false.

In contrast, the coefficient of Right has the “correct” sign, and its t-value and associated p-value do imply statistical significance, at least at the 5% level.

However, this happened mostly by chance, slight changes in the data could change the results completely.

Page 11: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

Solution -- continued The problem is although both Right and Left are clearly

related to Height, it is impossible for the least squares method to distinguish their separate effects.

Note that the regression equation does estimate the combined effect fairly well, the sum of the coefficients is 3.178 which is close to the coefficient of 3.2 we used to generate the data.

Therefore, the estimated equation will work well for predicting heights. It just does not have reliable estimates of the individual coefficients of Right and Left.

Page 12: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

Solution -- continued To see what happens when either Right or Left are

excluded from the regression equation, we show the results of simple regression.

When Right is only variable in the equation, it becomes

Predicted Height = 31.546 + 3.195Right

The R2 and se values are 81.6% and 2.005, and the t-value and p-value for the coefficient of Right are now 21.34 and 0.000 - very significant.

Page 13: Example 12.2 Multicollinearity. 12.112.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b12.312.3a12.1a12.412.4a12.1b12.512.4b The Problem

12.1 | 12.3 | 12.3a | 12.1a | 12.4 | 12.4a | 12.1b | 12.5 | 12.4b

Solution -- continued

Similarly, when the Left is the only variable in the equation, it becomes

Predicted Height = 31.526 + 3.197Left

The R2 and se values are 81.1% and 2.033, and the t-value and p-value for the coefficient of Left are 20.99 and 0.0000 - again very significant.

Clearly, both of these equations tell almost identical stories, and they are much easier to interpret than the equation with both Right and Left included.