class 5 multiple regression lionel nesta observatoire français des conjonctures economiques...
TRANSCRIPT
Class 5Multiple Regression
Lionel Nesta
Observatoire Français des Conjonctures Economiques
SKEMA Ph.D programme2010-2011
Introduction to Regression Typically, the social scientist is dealing with multiple
and complex webs of interactions between variables.
An immediate and appealing extension to simple
linear regression is to extend the set of explanatory
variable to other variables.
Multiple regressions include several explanatory
variables in the empirical model
1 21 2
pi i i p i iy x x x u
Introduction to Regression Typically, the social scientist is dealing with multiple
and complex webs of interactions between variables.
An immediate and appealing extension to simple
linear regression is to extend the set of explanatory
variable to other variables.
Multiple regressions include several explanatory
variables in the empirical model
1
k Kk
i k i ik
y x u
22
1 1
21
1
2
1
220 , ,
ˆˆmin min
0
, ,
ˆ
,
n n k K
n
j k
ki i i iki i k
i
Kik
n
y y y x
To minimize the sum of squared errors
1
12
ˆ
ˆcov( )
i i iy x u
β XX
y = Xβ +
y
u
X
β XX
Multivariate Least Square Estimator
Usually, the multivariate is described by matrix notation:
With the following least square solution:
Assumption OLS 1
20 1 1y x u
It is possible to operate non linear transformation of the
variables (e.g. log of x) but not of the parameters like the
following :
0 1 1 2 2 k ky x x x u
LinearityThe model is linear in its parameters
OLS can not estimate this
Assumption OLS 2
There is no selection bias in the sample. The results
pertain to the whole population
All observations are independent from one another (no
serial nor cross-sectional correlation)
Random SamplingThe n observations are a random sample of
the whole population
Assumption OLS 3
No independent variable is constant. Each variable has
variance which can be used with the variance of the
dependent variable to compute the parameters.
No exact linear relationships amongst independent variables
No perfect Collinearity There is no collinearity between independent
variables
Assumption OLS 4
Given any values of the independent variables (IV), the error
term must have an expected value of zero.
In this case, all independent variables are exogenous.
Otherwise, at least one IV suffers from an endogeneity problem.
Zero Conditional Mean The error term u has an expected value of zero
1 2 kE u x ,x , ,x 0
Sources of endogeneity
Wrong specification of the model
Omitted variable correlated with one RHS.
Measurement errors of RHS
Mutual causation between LHS and RHS
Simultaneity
Assumption OLS 5
21 2 k uVar u x ,x , ,x
Homoskedasticity The variance of the error term, u, conditional on RHS, is the same for all values of RHS.
Otherwise we speak of heteroskedasticity.
Assumption OLS 6
Normality of error termThe error term is independent of all RHS and follows a normal distribution with zero mean
and variance σ²
2u Normal(0, )
Assumptions OLS
OLS1 Linearity
OLS2 Random Sampling
OLS3 No perfect Collinearity
OLS4 Zero Conditional Mean
OLS5 Homoskedasticity
OLS6 Normality of error term
Theorem 1
j jˆE , j 0,1,2, ,k
OLS1 - OLS4 : Unbiasedness of OLS. The set of estimated parameters is equal to the true unknown values of j
j
Theorem 2
OLS1 – OLS5 : Variance of OLS estimate. The variance of the OLS estimator is
2u
j n 2 2ij j j
i 1
ˆVarx x 1 R
… where R²j is the R-squared from regressing xj on all other independent variables. But how can we measure ?
2u
Theorem 3
OLS1 – OLS5 : The standard error of the regression is defined as
22
i ii2 2 i iu u
ˆy y uˆE
n k 1n k 1
This is also called the standard error of the estimate or the root mean squared errors (RMSE)
Standard Error of Each Parameter Combining theorems 2 and 3 yields:
uj n 2 2
ij j ji 1
ˆˆse
x x 1 R
Theorem 4
Under assumptions OLS1 – OLS5, estimators
are the Best Linear Unbiased Estimators
(BLUE) of
0 1 kˆ ˆ ˆ, , ,
0 1 k, , ,
Assumptions OLS1 – OLS5 are known as the Gauss-
Markov Theorem, which stipulates that under OLS1-5, the
OLS are the best estimation method
The estimates are unbiased (OLS1-4)
The estimates have the smallest variance (OLS5)
Theorem 5
Under assumptions OLS1 – OLS6, the OLS estimates
follows a t distribution:
j jn k 1
j
ˆt
ˆse( )
Extension of theorem 5: Inference We can define de confidence interval of β, at 95% :
.025
2 2
1
ˆt
1
ujj n
ij j ji
x x R
If the 95% CI does not include 0, then β is significantly different than 0.
Student t Test for H0: βj=0 We are also in the position to infer on βj
H0: βj = 0
H1: βj ≠ 0
Rule of decision
Accept H0 is | t | < tα/2
Reject H0 is | t | ≥ tα/2
ˆ ˆ
tse se
Summary
OLS1 Linearity
OLS2 Random Sampling
OLS3 No perfect Collinearity
OLS4 Zero Conditional Mean
OLS5 Homoskedasticity
OLS6 Normality of error term
T1
UnbiasednessT2-T4
BLUET5
β ~ t
The knowledge production function
Application 1: seminal model
1 2
1 2
PAT f (RD,SIZE)
PAT A RD SIZE exp u
pat rd size u
Application 1: modèle de base
_cons -.5909529 .3903255 -1.51 0.131 -1.358146 .1762404 lassets -.3712237 .0722135 -5.14 0.000 -.513161 -.2292864 lrd .6461714 .0868021 7.44 0.000 .47556 .8167828 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 636.011503 430 1.47909652 Root MSE = 1.1221 Adj R-squared = 0.1487 Residual 538.941858 428 1.25920995 R-squared = 0.1526 Model 97.0696447 2 48.5348224 Prob > F = 0.0000 F( 2, 428) = 38.54 Source SS df MS Number of obs = 431
. reg lpat lrd lassets
Application 2: Changing specification
1
2
1 2
PAT f (RD,SIZE)
RDPAT A SIZE exp u
SIZE
RDpat log size u
SIZE
The knowledge production function
_cons -.5909529 .3903255 -1.51 0.131 -1.358146 .1762404 lassets .2749477 .0337246 8.15 0.000 .2086614 .3412341 lrdi .6461714 .0868021 7.44 0.000 .47556 .8167828 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 636.011503 430 1.47909652 Root MSE = 1.1221 Adj R-squared = 0.1487 Residual 538.941858 428 1.25920995 R-squared = 0.1526 Model 97.0696447 2 48.5348224 Prob > F = 0.0000 F( 2, 428) = 38.54 Source SS df MS Number of obs = 431
. reg lpat lrdi lassets
Application 2: Changing specification
The knowledge production function
Application 3: Adding variables
1
23
1 2 3
PAT f (RD,SIZE,SPE)
RDPAT A SIZE exp SPE u
SIZE
rdpat size SPE u
size
Application 3: Adding variables
_cons -.4877403 .3895845 -1.25 0.211 -1.253482 .2780017 spe .423136 .1600635 2.64 0.009 .1085255 .7377464 lassets .2736255 .0334948 8.17 0.000 .2077903 .3394608 lrdi .670643 .0866968 7.74 0.000 .5002375 .8410485 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 636.011503 430 1.47909652 Root MSE = 1.1144 Adj R-squared = 0.1604 Residual 530.263469 427 1.24183482 R-squared = 0.1663 Model 105.748034 3 35.2493446 Prob > F = 0.0000 F( 3, 427) = 28.38 Source SS df MS Number of obs = 431
. reg lpat lrdi lassets spe
Qualitative variables used as independent
variables
Qualitative variables as indep. variables
Qualitative variables
Dummy variables
Generating dummy variables using STATA
Interpretation of coefficients in OLS
Interaction effects between continuous and dummy var.
Qualitatives variables
Qualitative variables provide information on discrete characteristics
The number of categories taken by qualitative variables is general small.
These can be numerical values but each number denotes a attribute – a characteristics.
A qualitative variable may have several categories Two categories: male – female
Three categories: nationality (French, German, Turkish)
More than three categories: sectors (car, chemical, steel, electronic equip., etc.)
Qualitative variables There are several ways to code qualitative variables with n
categories
Using one categorical variables
Producing n - 1 dummy variables
A dummy variable is a variable which takes values 0 or 1.
We also call them binary variables
We also call dichotomous variables
Coding using one categorical variable Two categories: we generate a categorical variable called “gender”
set to 1 if the observation is a female, 2 if the observation is a male. Three categories: we generate a categorical variable called
“country” set to 1 if the observation is French, 2 if the observation is German, three if the observation if Turkish.
More than three categories : we generate a categorical variable called “sector” set to 1 if the observation is in the car industry, 2 for the chemical industry, three for the steel ifnustry, four for the electronic equip industry, etc..
This requires the use of label in order to know to which category a given number pertains
Qualitative variables
Labelling variables
Labelling is tedious, boring and uninteresting.
But there are clear consequences when one must interpret the results
label variable. Decribe a variable, qualitative or quantitativelabel variable asset "real capital"
label define. Defines a label (meaning of numbers)label define firm_type 1 "biotech" 0 "Pharma"
label values Applies the label to a given variablelabel values type firm_type
Exemple de labellisation*************************************************************************************
******* CREATION DES LABELS INDUSTRIES *********
*************************************************************************************
egen industrie = group(isic_oecd)
#delimit ;
label define induscode 1 "Text. Habill. & Cuir"
2 "Bois"
3 "Pap. Cart. & Imprim."
4 "Coke Raffin. Nucl."
5 "Chimie"
6 "Caoutc. Plast."
7 "Aut. Prod. min."
8 "Métaux de base"
9 "Travail des métaux"
10 "Mach. & Equip."
11 "Bureau & Inform."
12 "Mach. & Mat. Elec."
13 "Radio TV Telecom."
14 "Instrum. optique"
15 "Automobile"
16 "Aut. transp."
17 "Autres";
#delimit cr
label values industrie induscode
Exercise
1. Open SKEMA_BIO.dta
2. Create variable firm_type from type
3. Label variable firm_type
4. Define a label for firm_type and apply it
Dummy variables Coding categorical variables using dummy variables only
Two categories. We generate one dummy variable “female” set to 1if the obs. is a
female, 0 otherwise. We generate one dummy variable “male” set to 1if the obs. is a
male, 0 otherwise. But one of the dummy variable is simply redundant. When female
= 0, then necessarily male = 1 (and vice versa).
Hence with two categories, we only need one dummy variable.
Dummy variables Coding categorical variables using dummy variables only
Three categories. We generate one dummy variable “France” set to 1if the obs. is a
French, 0 otherwise. We generate one dummy variable “Germany” set to 1if the obs. is a
German, 0 otherwise. We generate one dummy variable “Turkish” set to 1if the obs. is a
Turkish, 0 otherwise. But one of the dummy variable is simply redundant. When France=0
and German=0, then Turkish=1.
For a variable with n categories, we must create n - 1 dummy variables, each representing one particular category.
Generation of dummies with STATA Using the if condition.
generate DEU = 0 replace DEU = 1 if country==“GERMANY”
generate LDF= 1 if size > 100 replace LDF =0 if size < 101
Avoiding the use of the if condition. generate FRA = country==“FRANCE” generate LDF = size > 100
With n categories and n being large, generating dummty variables can become really tedious
Function tabulate has a very convenient extension, since it will generate n dummy variables at once. tabulate varcat, gen(v_)
tabulate country, gen(c_)
Will create n dummy variables with n being the number of country in the dataset, and c_1 being the first country, c_2 being second, c_3 the third, etc.
Generation of dummies with STATA
Reading coefficients of dummy variables Remember! A coefficient tells us the increase in y
associated with a one-unit increase in x, other things held constant (ceteris paribus).
If the knowledge production function goes
with « y » being the number of patent and “biotech” being a dummy variable set to 1 for biotech fimrs, 0 otherwise.
y biotech u
If the firm is biotech company, then the dummy variable “biotech” is equal to unity. Hence:
If the firm is pharma company, then the dummy variable “biotech” is equal to zero. Hence:
ˆ ˆˆ ˆy 1
ˆˆ ˆy 0
Reading coefficients of dummy variables
The coefficient reads as the variation in the dependent variable when the dummy variable is set to 1 relative to the situation where the dummy variable is set to 0. With two categories, I must introduce one dummy variable.
With three categories, I must introduce two dummy variables.
With n categories, I must introduce (n-1) dummy variables.
Reading coefficients of dummy variables
Exercise
1. Regress the following model:
2. Predict the number of patents for both biotech and pharma companies
3. Produce descriptive statistics of PAT for each type of company using the command table
4. What do you observe?
PAT biotech u
For semi logarithmic forms (log Y), coefficient β must be read as an approximation of the percent change in Y associated with a variation of 1 unit of the explanatory variable.
This approximation is acceptable for small β (β < 0.1). When β is large (β ≥ 0.1), the exact percent change in Y is:
100 × (eβ – 1)
Reading coefficients of dummy variables
Application 4: dummy variable
1
23 4
1 2 3 4
PAT f (RD,SIZE,SPE, )
RDPAT A SIZE exp SPE u
SIZE
rdpa
BIO
BIO
t size SPE usiz
Be
IO
The knowledge production function
_cons -5.464644 .6164752 -8.86 0.000 -6.676356 -4.252932 biotech 1.657062 .1684813 9.84 0.000 1.325904 1.98822 spe .4212942 .1446661 2.91 0.004 .136946 .7056423 lassets .5558656 .0417126 13.33 0.000 .4738775 .6378537 lrdi .4924169 .0804249 6.12 0.000 .3343379 .650496 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 636.011503 430 1.47909652 Root MSE = 1.0072 Adj R-squared = 0.3142 Residual 432.137097 426 1.01440633 R-squared = 0.3206 Model 203.874406 4 50.9686015 Prob > F = 0.0000 F( 4, 426) = 50.24 Source SS df MS Number of obs = 431
. reg lpat lrdi lassets spe biotech
Application 4: dummy variable
Patentln(PAT)
size
4ˆ ˆ
2 4ˆˆBiotech : size ˆ
2ˆslope
2ˆslope
2ˆˆPharma : size 4
Application 4: dummy variable
Application 5: Interacting variables
1
23 4
1 2 3 5
BIO
BIO B
PAT f (RD,SIZE, )
RDPAT A SIZE exp u
SIZE
rdpat si
IO size
BIO BIO sizze usize
e
The knowledge production function
Application 5: Interacting variables
_cons -6.482948 .8427254 -7.69 0.000 -8.139376 -4.826519 bio_assets -.1435349 .081221 -1.77 0.078 -.3031798 .0161099 biotech 3.592252 1.107872 3.24 0.001 1.41466 5.769843 spe .4131693 .1443802 2.86 0.004 .1293812 .6969573 lassets .619805 .0551395 11.24 0.000 .5114249 .7281852 lrdi .4742035 .0808846 5.86 0.000 .3152199 .6331871 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 636.011503 430 1.47909652 Root MSE = 1.0047 Adj R-squared = 0.3176 Residual 428.984767 425 1.00937592 R-squared = 0.3255 Model 207.026736 5 41.4053471 Prob > F = 0.0000 F( 5, 425) = 41.02 Source SS df MS Number of obs = 431
. reg lpat lrdi lassets spe biotech bio_assets
Patentln(PAT)
Size
52 4ˆˆBiotech : size ˆ ˆ BIO size
2ˆˆPharma : size
2 5ˆ BIOˆsl izee sop
2ˆslope
5ˆ ˆ
5
Application 5: Interacting variables
Specification Tests
The knowledge production function
Specification Tests for Multiple OLS
1
23
1 2 3
PAT f (RD,SIZE,SPE)
RDPAT A SIZE exp SPE u
SIZE
rdpat size SPE u
size
Specification Tests for Multiple OLS
Critical probability α such that : Pr(Ha|H0)= α
Student t test: concerning the significance of one parameter
Fisher F test: concerning the significance of several parameters simultaneously (Wald test)
Non linear restriction test: Testing for non-linear relationship between parameters
Concerning one parameter onlyH0 : lassets = 0.30test size = 0.30
Test on several parameters
H0 : size = 0.30 and rdi = 0.70 test (size = 0.3) (rdi=0.7)
H0 : rdi = 2 * size test lrdi = 2 * lassets
H0 : lrdi + lassets = 1test lrdi + lassets = 1
lincom _b[lrdi] + _b[lassets] - 1
Specification Tests for Multiple OLSTesting linear combination of parameters
Test on several parameters
H0 : size * rdi = 0.2testnl _b[lrdi] * _b[lassets] = 0.2nlcom _b[lrdi] * _b[lassets] = 0.2
Specification Tests for Multiple OLSTesting non linear combination of parameters
Review of Assumptions
OLS assumption Consistency when violated
Efficiency when violated
Test
OLS1 Linearity - - -
OLS2 Random Sampling Biased β NoneNone. Redo
sampling & estimation
OLS3 No perfect Collinearity - - -
OLS4 Zero Conditional Mean Biased βPoorly estimated
variance of βLink test
Omitted Variable test
OLS5 Homoskedasticity NoneUnderestimated
variance of βBreusch-Pagan test
OLS6 Normality of error term NoneLack of reliability of the t test for β
Shapiro Wilk test
Rule of thumb using graphs
Stata Instruction rvfplot
White Test
Stata Instruction estat imtest
Breusch-Pagan Test
Stata Instruction estat hettest
Specification Tests for Multiple OLSSpecification tests on the validity of assumptions
Hypothesis OLS5 : Homoskedasticity of residuals
Specification Tests for Multiple OLSSpecification tests on the validity of assumptions
Hypothesis OLS5 : Homoskedasticity of residuals: rvfplot
-2-1
01
23
Res
idu
als
-1 0 1 2 3Fitted values
Specification Tests for Multiple OLSSpecification tests on the validity of assumptions
Hypothesis OLS5 : Homoskedasticity of residuals: estat imtest
Total 40.34 13 0.0001 Kurtosis 15.55 1 0.0001 Skewness 3.05 3 0.3840 Heteroskedasticity 21.74 9 0.0097 Source chi2 df p
Cameron & Trivedi's decomposition of IM-test
. imtest
Specification Tests for Multiple OLSSpecification tests on the validity of assumptions
Hypothesis OLS5 : Homoskedasticity of residuals: estat hettest
Prob > chi2 = 0.0927 chi2(1) = 2.83
Variables: fitted values of lpat Ho: Constant varianceBreusch-Pagan / Cook-Weisberg test for heteroskedasticity
. hettest
Specification tests on the validity of assumptions
Hypothesis OLS6 : Normality of residuals
Rule of thumb using graphs
Stata Instruction predict res, residual kdensity res, normal
Formally using the Shapiro-Wilk Test
Stata Instruction predict res, residual swilk res, normal
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS6 : Normality of residuals: kdensity
Specification Tests for Multiple OLS
0.1
.2.3
.4D
ensi
ty
-4 -2 0 2 4Residuals
Kernel density estimate
Normal density
kernel = epanechnikov, bandwidth = 0.2971
Kernel density estimate
Specification tests on the validity of assumptions
Hypothesis OLS6 : Normality of residuals
Specification Tests for Multiple OLS
res 431 0.98688 3.862 3.226 0.00063 Variable Obs W V z Prob>z
Shapiro-Wilk W test for normal data
. swilk res
Specification tests on the validity of assumptions
There is no omitted variables (OLS4 on endogeneity)
Link test : Stata Instruction linktest
Regress the DV over the prediction and its squared value
Variable _hat must be significant, but not _hatsq
Ramsey RESET Test : Stata Instruction ovtest
Regress the DV over powers (4) of LHS variables
Regress the DV over powers (4) of RHS variables
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
There is no omitted variables (OLS4 on endogeneity): linktest
Specification Tests for Multiple OLS
_cons .4943074 .2887035 1.71 0.088 -.0731457 1.061761 _hatsq .2707472 .1161699 2.33 0.020 .0424126 .4990817 _hat .2055605 .3574387 0.58 0.566 -.4969932 .9081141 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 636.011503 430 1.47909652 Root MSE = 1.1061 Adj R-squared = 0.1729 Residual 523.618213 428 1.22340704 R-squared = 0.1767 Model 112.393289 2 56.1966447 Prob > F = 0.0000 F( 2, 428) = 45.93 Source SS df MS Number of obs = 431
. linktest
. quietly: regress lpat lrdi lassets spe
Specification tests on the validity of assumptions
There is no omitted variables (OLS4 on endogeneity): ovtest
Specification Tests for Multiple OLS
Prob > F = 0.0732 F(3, 424) = 2.34 Ho: model has no omitted variablesRamsey RESET test using powers of the fitted values of lpat
. ovtest
. quietly: regress lpat lrdi lassets spe
2 21 0
k 1n m k 2
1
R R
k 1F1 R
n m k
Exercise1. Regress the following model
2. Assuming OLS1-3 to be correct, test OLS4-6 and conclude
1. OL4 on specification test using linktest and ovetst
2. OLS5 on homoskedasticity using imtest and hettest
3. OLS6 on normality of errors using kdensity and swilk test
1 2 3 4
rdpat size SPE u
sizBIO
e