class 5 multiple regression lionel nesta observatoire français des conjonctures economiques...

Class 5Multiple Regression

Lionel Nesta

Observatoire Français des Conjonctures Economiques

[email protected]

SKEMA Ph.D programme2010-2011

Introduction to Regression Typically, the social scientist is dealing with multiple

and complex webs of interactions between variables.

An immediate and appealing extension to simple

linear regression is to extend the set of explanatory

variable to other variables.

Multiple regressions include several explanatory

variables in the empirical model

1 21 2

pi i i p i iy x x x u

Introduction to Regression Typically, the social scientist is dealing with multiple

and complex webs of interactions between variables.

An immediate and appealing extension to simple

linear regression is to extend the set of explanatory

variable to other variables.

Multiple regressions include several explanatory

variables in the empirical model

1

k Kk

i k i ik

y x u

22

1 1

21

1

2

1

220 , ,

ˆˆmin min

0

, ,

ˆ

,

n n k K

n

j k

ki i i iki i k

i

Kik

n

y y y x

To minimize the sum of squared errors

1

12

ˆ

ˆcov( )

i i iy x u

β XX

y = Xβ +

y

u

X

β XX

Multivariate Least Square Estimator

Usually, the multivariate is described by matrix notation:

With the following least square solution:

Assumption OLS 1

20 1 1y x u

It is possible to operate non linear transformation of the

variables (e.g. log of x) but not of the parameters like the

following :

0 1 1 2 2 k ky x x x u

LinearityThe model is linear in its parameters

OLS can not estimate this

Assumption OLS 2

There is no selection bias in the sample. The results

pertain to the whole population

All observations are independent from one another (no

serial nor cross-sectional correlation)

Random SamplingThe n observations are a random sample of

the whole population

Assumption OLS 3

No independent variable is constant. Each variable has

variance which can be used with the variance of the

dependent variable to compute the parameters.

No exact linear relationships amongst independent variables

No perfect Collinearity There is no collinearity between independent

variables

Assumption OLS 4

Given any values of the independent variables (IV), the error

term must have an expected value of zero.

In this case, all independent variables are exogenous.

Otherwise, at least one IV suffers from an endogeneity problem.

Zero Conditional Mean The error term u has an expected value of zero

1 2 kE u x ,x , ,x 0

Sources of endogeneity

Wrong specification of the model

Omitted variable correlated with one RHS.

Measurement errors of RHS

Mutual causation between LHS and RHS

Simultaneity

Assumption OLS 5

21 2 k uVar u x ,x , ,x

Homoskedasticity The variance of the error term, u, conditional on RHS, is the same for all values of RHS.

Otherwise we speak of heteroskedasticity.

Assumption OLS 6

Normality of error termThe error term is independent of all RHS and follows a normal distribution with zero mean

and variance σ²

2u Normal(0, )

Assumptions OLS

OLS1 Linearity

OLS2 Random Sampling

OLS3 No perfect Collinearity

OLS4 Zero Conditional Mean

OLS5 Homoskedasticity

OLS6 Normality of error term

Theorem 1

j jˆE , j 0,1,2, ,k

OLS1 - OLS4 : Unbiasedness of OLS. The set of estimated parameters is equal to the true unknown values of j

j

Theorem 2

OLS1 – OLS5 : Variance of OLS estimate. The variance of the OLS estimator is

2u

j n 2 2ij j j

i 1

ˆVarx x 1 R

… where R²j is the R-squared from regressing xj on all other independent variables. But how can we measure ?

2u

Theorem 3

OLS1 – OLS5 : The standard error of the regression is defined as

22

i ii2 2 i iu u

ˆy y uˆE

n k 1n k 1

This is also called the standard error of the estimate or the root mean squared errors (RMSE)

Standard Error of Each Parameter Combining theorems 2 and 3 yields:

uj n 2 2

ij j ji 1

ˆˆse

x x 1 R

Theorem 4

Under assumptions OLS1 – OLS5, estimators

are the Best Linear Unbiased Estimators

(BLUE) of

0 1 kˆ ˆ ˆ, , ,

0 1 k, , ,

Assumptions OLS1 – OLS5 are known as the Gauss-

Markov Theorem, which stipulates that under OLS1-5, the

OLS are the best estimation method

The estimates are unbiased (OLS1-4)

The estimates have the smallest variance (OLS5)

Theorem 5

Under assumptions OLS1 – OLS6, the OLS estimates

follows a t distribution:

j jn k 1

j

ˆt

ˆse( )

Extension of theorem 5: Inference We can define de confidence interval of β, at 95% :

.025

2 2

1

ˆt

1

ujj n

ij j ji

x x R

If the 95% CI does not include 0, then β is significantly different than 0.

Student t Test for H0: βj=0 We are also in the position to infer on βj

H0: βj = 0

H1: βj ≠ 0

Rule of decision

Accept H0 is | t | < tα/2

Reject H0 is | t | ≥ tα/2

ˆ ˆ

tse se

Summary

OLS1 Linearity

OLS2 Random Sampling

OLS3 No perfect Collinearity

OLS4 Zero Conditional Mean

OLS5 Homoskedasticity

OLS6 Normality of error term

T1

UnbiasednessT2-T4

BLUET5

β ~ t

The knowledge production function

Application 1: seminal model

1 2

1 2

PAT f (RD,SIZE)

PAT A RD SIZE exp u

pat rd size u

Application 1: modèle de base

_cons -.5909529 .3903255 -1.51 0.131 -1.358146 .1762404 lassets -.3712237 .0722135 -5.14 0.000 -.513161 -.2292864 lrd .6461714 .0868021 7.44 0.000 .47556 .8167828 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 636.011503 430 1.47909652 Root MSE = 1.1221 Adj R-squared = 0.1487 Residual 538.941858 428 1.25920995 R-squared = 0.1526 Model 97.0696447 2 48.5348224 Prob > F = 0.0000 F( 2, 428) = 38.54 Source SS df MS Number of obs = 431

. reg lpat lrd lassets

Application 2: Changing specification

1

2

1 2

PAT f (RD,SIZE)

RDPAT A SIZE exp u

SIZE

RDpat log size u

SIZE


_cons -.5909529 .3903255 -1.51 0.131 -1.358146 .1762404 lassets .2749477 .0337246 8.15 0.000 .2086614 .3412341 lrdi .6461714 .0868021 7.44 0.000 .47556 .8167828 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg lpat lrdi lassets

Application 2: Changing specification


Application 3: Adding variables

1

23

1 2 3

PAT f (RD,SIZE,SPE)

RDPAT A SIZE exp SPE u

SIZE

rdpat size SPE u

size

Application 3: Adding variables

_cons -.4877403 .3895845 -1.25 0.211 -1.253482 .2780017 spe .423136 .1600635 2.64 0.009 .1085255 .7377464 lassets .2736255 .0334948 8.17 0.000 .2077903 .3394608 lrdi .670643 .0866968 7.74 0.000 .5002375 .8410485 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg lpat lrdi lassets spe

Qualitative variables used as independent

variables

Qualitative variables as indep. variables

Qualitative variables

Dummy variables

Generating dummy variables using STATA

Interpretation of coefficients in OLS

Interaction effects between continuous and dummy var.

Qualitatives variables

Qualitative variables provide information on discrete characteristics

The number of categories taken by qualitative variables is general small.

These can be numerical values but each number denotes a attribute – a characteristics.

A qualitative variable may have several categories Two categories: male – female

Three categories: nationality (French, German, Turkish)

More than three categories: sectors (car, chemical, steel, electronic equip., etc.)

Qualitative variables There are several ways to code qualitative variables with n

categories

Using one categorical variables

Producing n - 1 dummy variables

A dummy variable is a variable which takes values 0 or 1.

We also call them binary variables

We also call dichotomous variables

Coding using one categorical variable Two categories: we generate a categorical variable called “gender”

set to 1 if the observation is a female, 2 if the observation is a male. Three categories: we generate a categorical variable called

“country” set to 1 if the observation is French, 2 if the observation is German, three if the observation if Turkish.

More than three categories : we generate a categorical variable called “sector” set to 1 if the observation is in the car industry, 2 for the chemical industry, three for the steel ifnustry, four for the electronic equip industry, etc..

This requires the use of label in order to know to which category a given number pertains

Qualitative variables

Labelling variables

Labelling is tedious, boring and uninteresting.

But there are clear consequences when one must interpret the results

label variable. Decribe a variable, qualitative or quantitativelabel variable asset "real capital"

label define. Defines a label (meaning of numbers)label define firm_type 1 "biotech" 0 "Pharma"

label values Applies the label to a given variablelabel values type firm_type

Exemple de labellisation*************************************************************************************

******* CREATION DES LABELS INDUSTRIES *********

*************************************************************************************

egen industrie = group(isic_oecd)

#delimit ;

label define induscode 1 "Text. Habill. & Cuir"

2 "Bois"

3 "Pap. Cart. & Imprim."

4 "Coke Raffin. Nucl."

5 "Chimie"

6 "Caoutc. Plast."

7 "Aut. Prod. min."

8 "Métaux de base"

9 "Travail des métaux"

10 "Mach. & Equip."

11 "Bureau & Inform."

12 "Mach. & Mat. Elec."

13 "Radio TV Telecom."

14 "Instrum. optique"

15 "Automobile"

16 "Aut. transp."

17 "Autres";

#delimit cr

label values industrie induscode

Exercise

1. Open SKEMA_BIO.dta

2. Create variable firm_type from type

3. Label variable firm_type

4. Define a label for firm_type and apply it

Dummy variables Coding categorical variables using dummy variables only

Two categories. We generate one dummy variable “female” set to 1if the obs. is a

female, 0 otherwise. We generate one dummy variable “male” set to 1if the obs. is a

male, 0 otherwise. But one of the dummy variable is simply redundant. When female

= 0, then necessarily male = 1 (and vice versa).

Hence with two categories, we only need one dummy variable.

Dummy variables Coding categorical variables using dummy variables only

Three categories. We generate one dummy variable “France” set to 1if the obs. is a

French, 0 otherwise. We generate one dummy variable “Germany” set to 1if the obs. is a

German, 0 otherwise. We generate one dummy variable “Turkish” set to 1if the obs. is a

Turkish, 0 otherwise. But one of the dummy variable is simply redundant. When France=0

and German=0, then Turkish=1.

For a variable with n categories, we must create n - 1 dummy variables, each representing one particular category.

Generation of dummies with STATA Using the if condition.

generate DEU = 0 replace DEU = 1 if country==“GERMANY”

generate LDF= 1 if size > 100 replace LDF =0 if size < 101

Avoiding the use of the if condition. generate FRA = country==“FRANCE” generate LDF = size > 100

With n categories and n being large, generating dummty variables can become really tedious

Function tabulate has a very convenient extension, since it will generate n dummy variables at once. tabulate varcat, gen(v_)

tabulate country, gen(c_)

Will create n dummy variables with n being the number of country in the dataset, and c_1 being the first country, c_2 being second, c_3 the third, etc.

Generation of dummies with STATA

Reading coefficients of dummy variables Remember! A coefficient tells us the increase in y

associated with a one-unit increase in x, other things held constant (ceteris paribus).

If the knowledge production function goes

with « y » being the number of patent and “biotech” being a dummy variable set to 1 for biotech fimrs, 0 otherwise.

y biotech u

If the firm is biotech company, then the dummy variable “biotech” is equal to unity. Hence:

If the firm is pharma company, then the dummy variable “biotech” is equal to zero. Hence:

ˆ ˆˆ ˆy 1

ˆˆ ˆy 0

Reading coefficients of dummy variables

The coefficient reads as the variation in the dependent variable when the dummy variable is set to 1 relative to the situation where the dummy variable is set to 0. With two categories, I must introduce one dummy variable.

With three categories, I must introduce two dummy variables.

With n categories, I must introduce (n-1) dummy variables.


Exercise

1. Regress the following model:

2. Predict the number of patents for both biotech and pharma companies

3. Produce descriptive statistics of PAT for each type of company using the command table

4. What do you observe?

PAT biotech u

For semi logarithmic forms (log Y), coefficient β must be read as an approximation of the percent change in Y associated with a variation of 1 unit of the explanatory variable.

This approximation is acceptable for small β (β < 0.1). When β is large (β ≥ 0.1), the exact percent change in Y is:

100 × (eβ – 1)


Application 4: dummy variable

1

23 4

1 2 3 4

PAT f (RD,SIZE,SPE, )


SIZE

rdpa

BIO

BIO

t size SPE usiz

Be

IO


_cons -5.464644 .6164752 -8.86 0.000 -6.676356 -4.252932 biotech 1.657062 .1684813 9.84 0.000 1.325904 1.98822 spe .4212942 .1446661 2.91 0.004 .136946 .7056423 lassets .5558656 .0417126 13.33 0.000 .4738775 .6378537 lrdi .4924169 .0804249 6.12 0.000 .3343379 .650496 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg lpat lrdi lassets spe biotech


Patentln(PAT)

size

4ˆ ˆ

2 4ˆˆBiotech : size ˆ

2ˆslope

2ˆslope

2ˆˆPharma : size 4


Application 5: Interacting variables

1

23 4

1 2 3 5

BIO

BIO B

PAT f (RD,SIZE, )

RDPAT A SIZE exp u

SIZE

rdpat si

IO size

BIO BIO sizze usize

e



_cons -6.482948 .8427254 -7.69 0.000 -8.139376 -4.826519 bio_assets -.1435349 .081221 -1.77 0.078 -.3031798 .0161099 biotech 3.592252 1.107872 3.24 0.001 1.41466 5.769843 spe .4131693 .1443802 2.86 0.004 .1293812 .6969573 lassets .619805 .0551395 11.24 0.000 .5114249 .7281852 lrdi .4742035 .0808846 5.86 0.000 .3152199 .6331871 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg lpat lrdi lassets spe biotech bio_assets

Patentln(PAT)

Size

52 4ˆˆBiotech : size ˆ ˆ BIO size

2ˆˆPharma : size

2 5ˆ BIOˆsl izee sop

2ˆslope

5ˆ ˆ

5


Specification Tests


Specification Tests for Multiple OLS

1

23

1 2 3

PAT f (RD,SIZE,SPE)


SIZE

rdpat size SPE u

size


Critical probability α such that : Pr(Ha|H0)= α

Student t test: concerning the significance of one parameter

Fisher F test: concerning the significance of several parameters simultaneously (Wald test)

Non linear restriction test: Testing for non-linear relationship between parameters

Concerning one parameter onlyH0 : lassets = 0.30test size = 0.30

Test on several parameters

H0 : size = 0.30 and rdi = 0.70 test (size = 0.3) (rdi=0.7)

H0 : rdi = 2 * size test lrdi = 2 * lassets

H0 : lrdi + lassets = 1test lrdi + lassets = 1

lincom _b[lrdi] + _b[lassets] - 1

Specification Tests for Multiple OLSTesting linear combination of parameters

Test on several parameters

H0 : size * rdi = 0.2testnl _b[lrdi] * _b[lassets] = 0.2nlcom _b[lrdi] * _b[lassets] = 0.2

Specification Tests for Multiple OLSTesting non linear combination of parameters

Review of Assumptions

OLS assumption Consistency when violated

Efficiency when violated

Test

OLS1 Linearity - - -

OLS2 Random Sampling Biased β NoneNone. Redo

sampling & estimation

OLS3 No perfect Collinearity - - -

OLS4 Zero Conditional Mean Biased βPoorly estimated

variance of βLink test

Omitted Variable test

OLS5 Homoskedasticity NoneUnderestimated

variance of βBreusch-Pagan test

OLS6 Normality of error term NoneLack of reliability of the t test for β

Shapiro Wilk test

Rule of thumb using graphs

Stata Instruction rvfplot

White Test

Stata Instruction estat imtest

Breusch-Pagan Test

Stata Instruction estat hettest

Specification Tests for Multiple OLSSpecification tests on the validity of assumptions

Hypothesis OLS5 : Homoskedasticity of residuals


Hypothesis OLS5 : Homoskedasticity of residuals: rvfplot

-2-1

01

23

Res

idu

als

-1 0 1 2 3Fitted values


Hypothesis OLS5 : Homoskedasticity of residuals: estat imtest

Total 40.34 13 0.0001 Kurtosis 15.55 1 0.0001 Skewness 3.05 3 0.3840 Heteroskedasticity 21.74 9 0.0097 Source chi2 df p

Cameron & Trivedi's decomposition of IM-test

. imtest


Hypothesis OLS5 : Homoskedasticity of residuals: estat hettest

Prob > chi2 = 0.0927 chi2(1) = 2.83

Variables: fitted values of lpat Ho: Constant varianceBreusch-Pagan / Cook-Weisberg test for heteroskedasticity

. hettest

Specification tests on the validity of assumptions

Hypothesis OLS6 : Normality of residuals

Rule of thumb using graphs

Stata Instruction predict res, residual kdensity res, normal

Formally using the Shapiro-Wilk Test

Stata Instruction predict res, residual swilk res, normal



Hypothesis OLS6 : Normality of residuals: kdensity


0.1

.2.3

.4D

ensi

ty

-4 -2 0 2 4Residuals

Kernel density estimate

Normal density

kernel = epanechnikov, bandwidth = 0.2971

Kernel density estimate


Hypothesis OLS6 : Normality of residuals


res 431 0.98688 3.862 3.226 0.00063 Variable Obs W V z Prob>z

Shapiro-Wilk W test for normal data

. swilk res


There is no omitted variables (OLS4 on endogeneity)

Link test : Stata Instruction linktest

Regress the DV over the prediction and its squared value

Variable _hat must be significant, but not _hatsq

Ramsey RESET Test : Stata Instruction ovtest

Regress the DV over powers (4) of LHS variables

Regress the DV over powers (4) of RHS variables



There is no omitted variables (OLS4 on endogeneity): linktest


_cons .4943074 .2887035 1.71 0.088 -.0731457 1.061761 _hatsq .2707472 .1161699 2.33 0.020 .0424126 .4990817 _hat .2055605 .3574387 0.58 0.566 -.4969932 .9081141 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]


. linktest

. quietly: regress lpat lrdi lassets spe


There is no omitted variables (OLS4 on endogeneity): ovtest


Prob > F = 0.0732 F(3, 424) = 2.34 Ho: model has no omitted variablesRamsey RESET test using powers of the fitted values of lpat

. ovtest

. quietly: regress lpat lrdi lassets spe

2 21 0

k 1n m k 2

1

R R

k 1F1 R

n m k

Exercise1. Regress the following model

2. Assuming OLS1-3 to be correct, test OLS4-6 and conclude

1. OL4 on specification test using linktest and ovetst

2. OLS5 on homoskedasticity using imtest and hettest

3. OLS6 on normality of errors using kdensity and swilk test

1 2 3 4

rdpat size SPE u

sizBIO

e

class 5 multiple regression lionel nesta observatoire français des conjonctures economiques...

Documents

variance slide

assumption ols

explanatory variables

parameters ols

rhs simultaneity slide

independent variables

empirical model slide

ols estimator