dummy variables
TRANSCRIPT
Dummy Variables(K.R. Shanmugam, Madras School of Economics)
IntroductionConsider a simple Two variables
regression:YYii = = + + X Xii +u +ui i Where, Y - Earnings or wages; X - Where, Y - Earnings or wages; X -
Job experienceJob experienceData set: 50 employees dataData set: 50 employees data
Earnings Equation Results
Wages = 685.9993 + 129.7805 * Experience
Dependent Variable: WAGES Method: Least Squares Date: 03/16/07 Time: 20:50 Sample: 1 50 Included observations: 50
Variable Coefficient Std. Error t-Statistic Prob. C 685.9993 99.41455 6.900391 0.0000
EXPER 129.7805 10.76783 12.05261 0.0000 R-squared 0.751637 Mean dependent var 1796.920
Adjusted R-squared 0.746463 S.D. dependent var 523.0871 S.E. of regression 263.3874 Akaike info criterion 14.02431 Sum squared resid 3329900. Schwarz criterion 14.10079 Log likelihood -348.6077 F-statistic 145.2654 Durbin-Watson stat 1.554684 Prob(F-statistic) 0.000000
Regression with intercept only
Let everybody has the same experience
That is, experience is a constantThen, R2 =0 Intercept = 1796.92 (What is
this?)
Different Segments of Sample
In the data set, we find that 25 respondents are male and remaining are female.
That is 25 sample belong to male employees and 25 belong to female.
Two groups or segments of sample. Can we treat them same? No. In general labor markets for
different groups may be different. eg. In agriculture, female gets less
wage than male
What is Average Wages?
Male average wages=1867.68 Female average wages= 1726.68 There is a gender (sex) discrimination What should we do? Option 1: Analyze male and female
samples separately Option 2: Analyze them jointly but we
need to take into account the gender difference
Option 1: Separate Analysis
What is the difference?
Dependent Variable: WAGES Method: Least Squares Date: 03/16/07 Time: 20:51 Sample: 1 25 Included observations: 25
Variable Coefficient Std. Error t-Statistic Prob. C 610.0101 128.9632 4.730111 0.0001
EXPER 129.7849 13.89676 9.339215 0.0000 R-squared 0.791328 Mean dependent var 1726.160
Adjusted R-squared 0.782256 S.D. dependent var 519.2505 S.E. of regression 242.2984 Akaike info criterion 13.89484 Sum squared resid 1350295. Schwarz criterion 13.99235 Log likelihood -171.6854 F-statistic 87.22095 Durbin-Watson stat 1.769131 Prob(F-statistic) 0.000000
Dependent Variable: WAGES Method: Least Squares Date: 03/16/07 Time: 20:51 Sample: 26 50 Included observations: 25
Variable Coefficient Std. Error t-Statistic Prob. C 757.5909 145.1904 5.217912 0.0000
EXPER 130.2921 15.80774 8.242302 0.0000 R-squared 0.747074 Mean dependent var 1867.680
Adjusted R-squared 0.736077 S.D. dependent var 527.8150 S.E. of regression 271.1568 Akaike info criterion 14.11989 Sum squared resid 1691099. Schwarz criterion 14.21740 Log likelihood -174.4986 F-statistic 67.93554 Durbin-Watson stat 1.641091 Prob(F-statistic) 0.000000
Earning functions for Male and female Separately
610.01
=130
757.59
=130
Slope is the same; but intercept is different!
MALE FEMALE
Option 2: Joint Analysis Take into account the Gender Differences? Gender is a qualitative factor and not
readily quantifiableSolution: Dummy variable-specially constructed
variable to represent gender difference Implicit Assumption: Regression lines for
different groups differ only in intercept but have same slope coefficient
Option 2: Use of a dummy variable
Dummy Variable: Definition Artificially created variable by us to
incorporate the effect of a variable that is not readily quantifiable.
That is, Dummy variables are a device of incorporating in to the regression model certain variables that are not readily quantified such as region, time, occupation and ownership.
How do we create a Dummy?
For our case,D1 =Gender=1 if respondent is a
male 0 if respondent is a female
(It takes value 1 for some observations to indicate the presence of a group/category and 0 for the remaining observations)
Dummy is also called as : Indicator Variable, Binary Variable, Categorical Variable, Dichotomous Variable, and Qualitative Variable
Option 2: Single Model with Dummy
Regression Model: Yi = 1 + 2 D1 + X Estimated relationship for Two Groups:E (Y|X, D1=0) = 1 + X (for female)E (Y|X, D1=1) = 1 + 2+ X (for male)That is, the slope is the same for both Intercept varies: The original intercept
(1) is the intercept for female (base group with dummy value zero)Intercept for male = (1 + 2)
Diagrammatic Explanation
X
Y
1 + X
1+2+X
Constant Term 1 – intercept for base group; 1 + 2 – intercept for male; and 2 the coefficient of the dummy variable measures the difference in intercept
1
2
Option 2: Estimation Results
Wages = 607.86 + 151.92 GENDER +130.03 Exper (2.649) (1.102) (9.297) R2 =0.64
Since 2 is significant, there exists gender differentials!
Dependent Variable: WAGES Method: Least Squares Date: 03/16/07 Time: 20:50 Sample: 1 50 Included observations: 50
Variable Coefficient Std. Error t-Statistic Prob. C 607.8644 102.9013 5.907259 0.0000
EXPER 130.0344 10.40046 12.50275 0.0000 GENDER 151.9227 71.95553 2.111342 0.0401
R-squared 0.773152 Mean dependent var 1796.920
Adjusted R-squared 0.763499 S.D. dependent var 523.0871 S.E. of regression 254.3842 Akaike info criterion 13.97369 Sum squared resid 3041432. Schwarz criterion 14.08841 Log likelihood -346.3423 F-statistic 80.09379 Durbin-Watson stat 1.721055 Prob(F-statistic) 0.000000
Alternative Way: Changing base
Define Gender 1= D2 =1 if female 0 if male
Dependent Variable: WAGES Method: Least Squares Date: 03/24/07 Time: 23:48 Sample: 1 50 Included observations: 50
Variable Coefficient Std. Error t-Statistic Prob. C 759.7872 102.1789 7.435854 0.0000
EXPER 130.0344 10.40046 12.50275 0.0000 GENDER1 -151.9227 71.95553 -2.111342 0.0401
R-squared 0.773152 Mean dependent var 1796.920
Adjusted R-squared 0.763499 S.D. dependent var 523.0871 S.E. of regression 254.3842 Akaike info criterion 13.97369 Sum squared resid 3041432. Schwarz criterion 14.08841 Log likelihood -346.3423 F-statistic 80.09379 Durbin-Watson stat 1.721055 Prob(F-statistic) 0.000000
Suppose, we define two dummy variables as:
D1=1 if male and D2 = 1 if female =0 if female =0 if male
The Regression equation can be specified as:Yi = 1D1(=Gender) + 2 D2 (=Gender1) + X What is the difference?
Overall intercept term is missing. Why?
Alternative Way: Both Dummies
Both DummiesDependent Variable: WAGES Method: Least Squares Date: 03/16/07 Time: 21:28 Sample: 1 50 Included observations: 50
Variable Coefficient Std. Error t-Statistic Prob. EXPER 130.0344 10.40046 12.50275 0.0000
GENDER 759.7872 102.1789 7.435854 0.0000 GENDER1 607.8644 102.9013 5.907259 0.0000
R-squared 0.773152 Mean dependent var 1796.920
Adjusted R-squared 0.763499 S.D. dependent var 523.0871 S.E. of regression 254.3842 Akaike info criterion 13.97369 Sum squared resid 3041432. Schwarz criterion 14.08841 Log likelihood -346.3423 Durbin-Watson stat 1.721055
•If we include a constant term, we face the problem of perfect multi-collinearity problem (i.e., linear dependence exists among columns of X Matrix.)This is known as Dummy Variable TrapTo avoid the dummy variable trap, we
can either drop the dummy for one category as in the earlier case or we can include dummies for all categories without intercept term.
Dummy Variable Trap
Rule
• With overall intercept, use m-1 dummies if m groups or category or without intercept, use m dummies for m groups • If there is no intercept, then the coefficients of dummy variables measure the intercepts for respective groups• Wages =759.78 Gender +607.86 Gender1 + 130.03 X (7.44) (5.91) (12.50)
Salest = + pt + 1 D1 + 2 D2 + 3 D3
where, D1 =1 if 1st Quarter; 0 otherwise
D2 = 1 if 2nd Quarter; 0 otherwise
D3 = 1 if 3rd Quarter;0 otherwise
(or) Salest= pt + 1 D1 + 2 D2 + 3 D3 + 4
D4
Several Categories: Suppose we want to control the seasons when we analyze the sales for umbrella
Several Qualitative Variables
Suppose there are two qualitative factors: Sex, and race
Define dummy variables as:Gender = 1 if male and =0
otherwise Race = 1 if belong to white and
=0 if black
With Two Qualitative Factors
Dependent Variable: WAGES Method: Least Squares Date: 03/16/07 Time: 20:58 Sample: 1 50 Included observations: 50
Variable Coefficient Std. Error t-Statistic Prob. C 488.9543 102.5994 4.765664 0.0000
EXPER 129.6110 9.591059 13.51373 0.0000 GENDER 168.2290 66.56434 2.527314 0.0150
RACE 204.2518 67.05229 3.046157 0.0038 R-squared 0.811231 Mean dependent var 1796.920
Adjusted R-squared 0.798920 S.D. dependent var 523.0871 S.E. of regression 234.5626 Akaike info criterion 13.82994 Sum squared resid 2530902. Schwarz criterion 13.98290 Log likelihood -341.7485 F-statistic 65.89459 Durbin-Watson stat 1.433962 Prob(F-statistic) 0.000000
C intercept for both base groups-female and black
Intercept for male = c+168.23; and for white=c+204.25
Example 3: Consumption function analysis.
Suppose there are three qualitative factors: Sex, age of household head and education level of head.Define dummy variables as:D1 = 1 if sex is male and =0 otherwise D2 = 1 if age <25 and =0 otherwiseD3 = 1 if age between 25 and 50 and =0 otherwiseD4 = 1 if high school education and =0 otherwise D5 = 1 if H.sc., Degree and above and =0 otherwise
With 3 Qualitative Factors
Example 3: Base or Reference Groups
Sex: Female
Age: Above 50 years
Education: Below High School
Regression Model:Ct = + Yt + 1D1+ 2D2 + 3D3 + 4D4 + 5D5 + ut
- the intercept term for female head of household - the intercept term if age of head is above 50 years - the intercept term if head’s education is below high
school
Intercepts for Other Groups:
+ 1- for male household head
+ 2 – for age is less than 25 years
+ 3 - for age between 25 and 50 years
+ 4 – for high school education
+ 5 –for above high school education
If the household head is male with age 40 years and high school education, what is the intercept? + 1+ 3+ 4