multiple regression dummy variables multicollinearity interaction effects heteroscedasticity

Multiple Regression

Dummy VariablesMulticollinearity

Interaction EffectsHeteroscedasticity

Lecture Objectives

You should be able to :

1. Convert categorical variables into dummies.

2. Identify and eliminate Multicollinearity.

3. Use interaction terms and interpret their coefficients.

4. Identify heteroscedasticity.

I. Using Categorical Data:Dummy Variables

Y X1 X2

Accidents

per 10000 Car

Obs Drivers Age Color

1 89 17 Red

2 70 17 Black

3 75 17 Blue

4 85 18 Red

5 74 18 Black

6 76 18 Blue

7 90 19 Red

8 78 19 Black

9 70 19 Blue

10 80 20 Red

Consider insurance company data on accidents and their relationship to age of driver and the color of car driven.

See spreadsheet for complete data.

Coding a Categorical VariableOriginal Coding Alternate Coding

Y X1 X2 Y X1 X2

Accidents Accidents

per 10000 Car per 10000 Car

Drivers Age Color Drivers Age Color

89 17 1 89 17 3

70 17 2 70 17 1

75 17 3 75 17 2

85 18 1 85 18 3

74 18 2 74 18 1

76 18 3 76 18 2

90 19 1 90 19 3

78 19 2 78 19 1

70 19 3 70 19 2

80 20 1 80 20 3

This is the incorrect way. Output from the two ways of coding give inconsistent results.

Original Coding:Partial Output and Forecasts

SUMMARYOUTPUT

Regression Statistics

Multiple R 0.817505768

R Square 0.668315681

Adjusted R Square 0.640675321

Standard Error 7.224556247

Observations 27

Coefficients

Intercept 151.0611

Age -3.3944

Color -5.0000

RESIDUAL OUTPUT

ObservationPredicted

Accidents Residuals

1 88.3556 0.6444

2 83.3556 -13.3556

3 78.3556 -3.3556

4 84.9611 0.0389

5 79.9611 -5.9611

6 74.9611 1.0389

7 81.5667 8.4333

8 76.5667 1.4333

9 71.5667 -1.5667

10 78.1722 1.8278

Modified Coding:Partial Output and ForecastsSUMMARY

OUTPUT



R Square 0.760989997



Observations 27

Coefficients

Intercept 127.7278

Age -3.3944

Color 6.6667

RESIDUAL OUTPUT


Accidents Residuals

1 90.0222 -1.0222

2 76.6889 -6.6889

3 83.3556 -8.3556

4 86.6278 -1.6278

5 73.2944 0.7056

6 79.9611 -3.9611

7 83.2333 6.7667

8 69.9000 8.1000

9 76.5667 -6.5667

10 79.8389 0.1611

Coding with DummiesOriginal

Dummies Alternately Coded Dummies

Y X1 X2 X3 Y X1 X2 X3

Accidents Accidents

per 10000 per 10000

Drivers Age

D1 Red

D2 Black Drivers Age D1 Black D2 Blue

89 17 1 0 89 17 0 0

70 17 0 1 70 17 1 0

75 17 0 0 75 17 0 1

85 18 1 0 85 18 0 0

74 18 0 1 74 18 1 0

76 18 0 0 76 18 0 1

90 19 1 0 90 19 0 0

78 19 0 1 78 19 1 0

70 19 0 0 70 19 0 1

80 20 1 0 80 20 0 0This is the correct way. Output from either way of coding gives the same forecasts.

Original Dummy CodingSUMMARY

OUTPUT



R Square 0.778642248



Observations 27

Coefficients

Intercept 138.8389

Age -3.3944

D1 Red 10.0000

D2 Black -3.3333

RESIDUAL OUTPUT


Accidents Residuals

1 91.1333 -2.1333

2 77.8000 -7.8000

3 81.1333 -6.1333

4 87.7389 -2.7389

5 74.4056 -0.4056

6 77.7389 -1.7389

7 84.3444 5.6556

8 71.0111 6.9889

9 74.3444 -4.3444

10 80.9500 -0.9500

Modified Dummy Coding

SUMMARY OUTPUT



R Square 0.778642248



Observations 27

Coefficients

Intercept 148.8389

Age -3.3944

D1 Black -13.3333

D2 Blue -10.0000

RESIDUAL OUTPUT


Accidents Residuals

1 91.1333 -2.1333

2 77.8000 -7.8000

3 81.1333 -6.1333

4 87.7389 -2.7389

5 74.4056 -0.4056

6 77.7389 -1.7389

7 84.3444 5.6556

8 71.0111 6.9889

9 74.3444 -4.3444

10 80.9500 -0.9500

II. Multicollinearity

We wish to forecast the height of a person based on the length of his/her feet. Consider data as shown:

Height Right Left

77.31 11.59 11.54

67.58 9.57 9.63

70.40 8.97 8.98

64.84 9.39 9.46

77.03 12.05 12.03

79.66 11.39 11.41

72.37 10.55 10.61

73.18 10.31 10.33

77.60 11.81 11.81

71.40 9.92 9.88

Regression with Right Foot

SUMMARY OUTPUT



R Square 0.815590472



Observations 105

Coefficients

Intercept 31.5457001

Right 3.9936768

As right foot length increases by an inch, height increases on average by 3.99 inches.

Regression with Left Foot

SUMMARY OUTPUT



R Square 0.810514482



Observations 105

Coefficients


Left 3.99585252

As left foot length increases by an inch, height increases on average by 3.99 inches.

Regression with Both

SUMMARY OUTPUT



R Square 0.817596741



Observations 105

Coefficients


Right 8.528657878

Left -4.555977176

As right foot length increases by an inch, height increases on average by 8.52 inches (assuming left foot is constant!) while lengthening of the left foot makes a person shorter by 4.55 inches!!

The Reason? Multicollinearity.

Height Right Left

Height (y) 1.0000

Right (X1) 0.9031 1.0000

Left (X2) 0.9003 0.9990 1.0000

While both feet (Xs) are correlated with height (y), they are also highly correlated with each other (0.999). In other words, the second foot adds no extra information to the prediction of y. One of the two Xs is sufficient.

III. Interaction EffectsScores on test of reflexes

Y X1 X2

Obs Score Age Gender

1 80 25 0

2 82 28 0

3 75 32 0

4 70 33 0

5 65 35 0

6 60 43 0

7 67 46 0

8 70 55 0

9 60 56 0

10 55 67 0

11 90 24 1

Do reflexes slow down with age? Are there gender differences?

A portion of the data is shown here.

Scatterplots with Age, GenderReflex Scores (0-100) by Age

0

20

40

60

80

100

0 20 40 60 80

Age

Ref

lex

Sco

re

Reflex Scores by Age, Gender

0

20

40

60

80

100

0 20 40 60 80

Age

Ref

lex

Sco

re

Does age seem related? How about Gender?

Correlation, Regression

Correlations Score Age

Score 1

Age -0.8981 1

Gender -0.1311 0.1406

SUMMARY OUTPUT


Multiple R 0.898113

R Square 0.806608



Observations 20

CoefficientsStandard

Error t Stat P-value

Intercept 103.8807 4.735902 21.93473 6.58346E-14

Age -0.84478 0.101411 -8.33024 2.09518E-07

Gender -0.13641 2.957645 -0.04612 0.963752398

Age is related, gender is not.

Interaction TermY X1 X2 X1*X2

Obs Score Age GenderAge*

Gender

1 80 25 0 0

2 82 28 0 0

3 75 32 0 0

4 70 33 0 0

5 65 35 0 0

6 60 43 0 0

7 67 46 0 0

.. … … … …

11 90 24 1 24

12 87 28 1 28

A 2-way interaction term is the product of the two variables.

Regression with Interaction

SUMMARY OUTPUT



R Square 0.890325927



Observations 20

CoefficientsStandard

Error t Stat P-value

Intercept 90.10731707 5.389545 16.71891 1.49E-11

Age -0.516840883 0.122483 -4.21968 0.000651

Gender 24.27622946 7.353093 3.301499 0.004505

Age*Gender -0.558724118 0.159875 -3.49476 0.002997

How do we interpret the coefficient for the interaction term?

Meaning of Interaction

X1 and X2 are said to interact with each other if the impact of X1 on y changes as the value of X2 changes.

In this example, the impact of age (X1) on reflexes (y) is different for males and females (changing values of X2). Hence age and gender are said to interact.

Explain how this is different from multicollinearity.

IV. HeteroscedasticityLake Lanier Water Levels

y = - 0.012x + 1531

R2 = 0.7024

1050

1055

1060

1065

1070

1075

Jan-

04

Aug

-04

Feb

-05

Sep

-05

Mar

-06

Oct

-06

Apr

-07

Nov

-07

Jun-

08

Dec

-08

Wat

er L

evel

(fe

et)

Consider the water levels in Lake Lanier. There is a trend that can be used to forecast. However, the variability around the trendline is not consistent. The increase in variation makes the prediction margin of error unreliable.

Example 2: Income and Spending

Spending on Luxury Goods as Percent of Income

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

0 20000 40000 60000 80000 100000 120000 140000

Income

Per

cen

t sp

ent

on

Lu

xury

G

oo

ds

As income grows, the ability to spend on luxury goods grows with it, and so does the variation in how much is actually spent. Once again, forecasts become less reliable due to changing variation (heteroscedasticity).

Solution

When heteroscedasticity is identified, data may need to be transformed (change to a log scale, for instance) to reduce its impact. The type of transformation needed depends on the data.

multiple regression dummy variables multicollinearity interaction effects heteroscedasticity

Documents

adjusted r square0

partial output

residual output observation

way of coding

modified coding

left foot length

ways of coding

right foot length