dummy variables

Dummy Variables(K.R. Shanmugam, Madras School of Economics)

IntroductionConsider a simple Two variables

regression:YYii = = + + X Xii +u +ui i Where, Y - Earnings or wages; X - Where, Y - Earnings or wages; X -

Job experienceJob experienceData set: 50 employees dataData set: 50 employees data

Earnings Equation Results

Wages = 685.9993 + 129.7805 * Experience

Dependent Variable: WAGES Method: Least Squares Date: 03/16/07 Time: 20:50 Sample: 1 50 Included observations: 50

Variable Coefficient Std. Error t-Statistic Prob. C 685.9993 99.41455 6.900391 0.0000

EXPER 129.7805 10.76783 12.05261 0.0000 R-squared 0.751637 Mean dependent var 1796.920

Adjusted R-squared 0.746463 S.D. dependent var 523.0871 S.E. of regression 263.3874 Akaike info criterion 14.02431 Sum squared resid 3329900. Schwarz criterion 14.10079 Log likelihood -348.6077 F-statistic 145.2654 Durbin-Watson stat 1.554684 Prob(F-statistic) 0.000000

Regression with intercept only

Let everybody has the same experience

That is, experience is a constantThen, R2 =0 Intercept = 1796.92 (What is

this?)

Different Segments of Sample

In the data set, we find that 25 respondents are male and remaining are female.

That is 25 sample belong to male employees and 25 belong to female.

Two groups or segments of sample. Can we treat them same? No. In general labor markets for

different groups may be different. eg. In agriculture, female gets less

wage than male

What is Average Wages?

Male average wages=1867.68 Female average wages= 1726.68 There is a gender (sex) discrimination What should we do? Option 1: Analyze male and female

samples separately Option 2: Analyze them jointly but we

need to take into account the gender difference

Option 1: Separate Analysis

What is the difference?









Earning functions for Male and female Separately

610.01

=130

757.59

=130

Slope is the same; but intercept is different!

MALE FEMALE

Option 2: Joint Analysis Take into account the Gender Differences? Gender is a qualitative factor and not

readily quantifiableSolution: Dummy variable-specially constructed

variable to represent gender difference Implicit Assumption: Regression lines for

different groups differ only in intercept but have same slope coefficient

Option 2: Use of a dummy variable

Dummy Variable: Definition Artificially created variable by us to

incorporate the effect of a variable that is not readily quantifiable.

That is, Dummy variables are a device of incorporating in to the regression model certain variables that are not readily quantified such as region, time, occupation and ownership.

How do we create a Dummy?

For our case,D1 =Gender=1 if respondent is a

male 0 if respondent is a female

(It takes value 1 for some observations to indicate the presence of a group/category and 0 for the remaining observations)

Dummy is also called as : Indicator Variable, Binary Variable, Categorical Variable, Dichotomous Variable, and Qualitative Variable

Option 2: Single Model with Dummy

Regression Model: Yi = 1 + 2 D1 + X Estimated relationship for Two Groups:E (Y|X, D1=0) = 1 + X (for female)E (Y|X, D1=1) = 1 + 2+ X (for male)That is, the slope is the same for both Intercept varies: The original intercept

(1) is the intercept for female (base group with dummy value zero)Intercept for male = (1 + 2)

Diagrammatic Explanation

X

Y

1 + X

1+2+X

Constant Term 1 – intercept for base group; 1 + 2 – intercept for male; and 2 the coefficient of the dummy variable measures the difference in intercept

1

2

Option 2: Estimation Results

Wages = 607.86 + 151.92 GENDER +130.03 Exper (2.649) (1.102) (9.297) R2 =0.64

Since 2 is significant, there exists gender differentials!



EXPER 130.0344 10.40046 12.50275 0.0000 GENDER 151.9227 71.95553 2.111342 0.0401

R-squared 0.773152 Mean dependent var 1796.920


Alternative Way: Changing base

Define Gender 1= D2 =1 if female 0 if male



EXPER 130.0344 10.40046 12.50275 0.0000 GENDER1 -151.9227 71.95553 -2.111342 0.0401



Suppose, we define two dummy variables as:

D1=1 if male and D2 = 1 if female =0 if female =0 if male

The Regression equation can be specified as:Yi = 1D1(=Gender) + 2 D2 (=Gender1) + X What is the difference?

Overall intercept term is missing. Why?

Alternative Way: Both Dummies

Both DummiesDependent Variable: WAGES Method: Least Squares Date: 03/16/07 Time: 21:28 Sample: 1 50 Included observations: 50

Variable Coefficient Std. Error t-Statistic Prob. EXPER 130.0344 10.40046 12.50275 0.0000

GENDER 759.7872 102.1789 7.435854 0.0000 GENDER1 607.8644 102.9013 5.907259 0.0000


Adjusted R-squared 0.763499 S.D. dependent var 523.0871 S.E. of regression 254.3842 Akaike info criterion 13.97369 Sum squared resid 3041432. Schwarz criterion 14.08841 Log likelihood -346.3423 Durbin-Watson stat 1.721055

•If we include a constant term, we face the problem of perfect multi-collinearity problem (i.e., linear dependence exists among columns of X Matrix.)This is known as Dummy Variable TrapTo avoid the dummy variable trap, we

can either drop the dummy for one category as in the earlier case or we can include dummies for all categories without intercept term.

Dummy Variable Trap

Rule

• With overall intercept, use m-1 dummies if m groups or category or without intercept, use m dummies for m groups • If there is no intercept, then the coefficients of dummy variables measure the intercepts for respective groups• Wages =759.78 Gender +607.86 Gender1 + 130.03 X (7.44) (5.91) (12.50)

Salest = + pt + 1 D1 + 2 D2 + 3 D3

where, D1 =1 if 1st Quarter; 0 otherwise

D2 = 1 if 2nd Quarter; 0 otherwise

D3 = 1 if 3rd Quarter;0 otherwise

(or) Salest= pt + 1 D1 + 2 D2 + 3 D3 + 4

D4

Several Categories: Suppose we want to control the seasons when we analyze the sales for umbrella

Several Qualitative Variables

Suppose there are two qualitative factors: Sex, and race

Define dummy variables as:Gender = 1 if male and =0

otherwise Race = 1 if belong to white and

=0 if black

With Two Qualitative Factors



EXPER 129.6110 9.591059 13.51373 0.0000 GENDER 168.2290 66.56434 2.527314 0.0150

RACE 204.2518 67.05229 3.046157 0.0038 R-squared 0.811231 Mean dependent var 1796.920


C intercept for both base groups-female and black

Intercept for male = c+168.23; and for white=c+204.25

Example 3: Consumption function analysis.

Suppose there are three qualitative factors: Sex, age of household head and education level of head.Define dummy variables as:D1 = 1 if sex is male and =0 otherwise D2 = 1 if age <25 and =0 otherwiseD3 = 1 if age between 25 and 50 and =0 otherwiseD4 = 1 if high school education and =0 otherwise D5 = 1 if H.sc., Degree and above and =0 otherwise

With 3 Qualitative Factors

Example 3: Base or Reference Groups

Sex: Female

Age: Above 50 years

Education: Below High School

Regression Model:Ct = + Yt + 1D1+ 2D2 + 3D3 + 4D4 + 5D5 + ut

- the intercept term for female head of household - the intercept term if age of head is above 50 years - the intercept term if head’s education is below high

school

Intercepts for Other Groups:

+ 1- for male household head

+ 2 – for age is less than 25 years

+ 3 - for age between 25 and 50 years

+ 4 – for high school education

+ 5 –for above high school education

If the household head is male with age 40 years and high school education, what is the intercept? + 1+ 3+ 4

dummy variables

Documents