linear regression with multiple regressors

Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity

Linear Regression with MultipleRegressors

Ming-Ching Luoh

2021.3.9.

1 / 43


Omitted Variable Bias

aeMultiple Regression Model

ae OLS Estimator

Measures of Fit

Least Squares Assumptions

Sampling Distribution of OLS Estimators

Multicollinearity

2 / 43


Omitted Variable BiasOLS estimate of the Test Score/STR relation

Test Score = 698.9 − 2.28 × STR, R2 = .05,

(10.4) (0.52)

● Is this a credible estimate of the causal eòect on test scores ofa change in the student-teacher ratio?

No, there are omitted confounding factors that bias the OLSestimator. STR could be “picking up" the eòect of theseconfounding factors.

3 / 43


● ae bias in the OLS estimator that occurs as a result of anomitted factor is called omitted variable bias.

● For omitted variable bias to occur, the omitted factor “Z"must be

1 A determinant of Y , and2 correlated with the regressor X.

Both conditionsmust hold for the omission of Z to result inomitted variable bias.

4 / 43


In the test score example:

1 English language ability (Z) (measured as the percentageof students who are still learning English) plausibly aòectsstandardized test scores. Z is a determinant of Y .

2 Immigrant communities tend to be less aøuent and thushave smaller school budgets, and higher STR: Z iscorrelated with X.

Accordingly, β1 is biased.

● What is the direction of the bias?

● What does common sense suggest?

● If common sense is not very obvious, there is a formula.

5 / 43


A formula for omitted variable biasRecall the equation,

β1 = β1 +1n ∑

ni=1(Xi − X)ui

1n ∑

ni=1(Xi − X)2

= β1 + ρXuσuσX

because

1n

n∑i=1(Xi − X)2

p→ σ2X

1n

n∑i=1(Xi − X)ui

p→ Cov(Xi , ui) = ρXuσXσu

6 / 43


Omitted variable bias formula:

β1p→ β1 + ρXu

σuσX

● If an omitted factor Z is both:

● a determinant of Y (that is, it is contained in u), and● correlated with X,

then ρXu ≠ 0 and the OLS estimator β1 is not consistent.

● ae larger is ∣ρXu∣, the larger is the bias.

● ae direction of the bias in β1 depends on whether X and uare positively or negatively correlated.

7 / 43


To bemore speciûc, suppose that the truemodel is

Yi = β0 + β1Xi + β2Zi + ui , Cov(Xi , ui) = 0

ae estimatedmodel when Z is omitted is

Yi = β0 + β1Xi + єi , єi ≡ β2Zi + ui

ae covariance between Xi and error term єi is

Cov(Xi , єi) = Cov(Xi , β2Zi + ui)= β2Cov(Xi , Zi)

8 / 43


aerefore,β1

p→ β1 +

Cov(Xi , єi)Var(Xi)

= β1 + β2Cov(Xi , Zi)

Var(Xi)

Let PctEL (percentage of engish learners) be Z, since

● β2 < 0: the eòect of PctEL on TestScore is negative, and

● Cov(Xi , Zi) > 0: districts with high STR also have highPctEL.

We have β1 < β1, β1 ismore negative than β1. ae (negative)eòect of STR on test score is overestimated.

9 / 43


Is this is actually going on in the CA data?

● Districts with fewer English Learners have higher test scores.● Districts with lower percent EL have smaller classes.● Among districts with comparable PctEL, the eòect of class

size is small. (recall overall “test score gap" = 7.4)10 / 43


Causality and regression analysis

● ae test score/STR/fraction English Learners example showsthat, if an omitted variable satisûes the two conditions foromitted variable bias, then the OLS estimator in theregression omitting that variable is biased and inconsistent.So, even if n is large, β1 will not be close to β1.

● ais raises a deeper question: how do we deûne β1? aat is,what precisely do we want to estimate when we run aregression?

11 / 43


What precisely do we want to estimate when we run aregression? aere are (at least) three possible answers to thisquestion:

1. We want to estimate the slope of a line through a scatterplotas a simple summary of the data to which we attach nosubstantivemeaning.

ais can be useful at times, but isn’t very interestingintellectually and isn’t what this course is about.

12 / 43


2. We want to make forecasts, or predictions, of the value of Yfor an entity not in the data set, for which we know the valueof X.

Forecasting is an important job for economists, and excellentforecasts are possible using regression methods withoutneeding to know causal eòects.

13 / 43


3. We want to estimate the causal eòect on Y of a change in X.ais is why we are interested in the class size eòect. Supposethe school board decided to cut class size by 2 students perclass. What would be the eòect on test scores? ais is a causalquestion (what is the causal eòect on test scores of STR?) so weneed to estimate this causal eòect. Except when we discussforecasting, the aim of this course is the estimation of causaleòects using regression methods.

14 / 43


What, precisely, is a causal eòect?

● “Causality" is a complex concept! ae common-sensedeûnition of causality is not precise enough for our purposes.

● In this course, we we take a practical approach to deûningcausality:

A causal eòect is deûned to be the eòectmeasured in anideal randomized controlled experiment.

15 / 43


Ideal Randomized Controlled Experiment

● Ideal: subjects all follow the treatment protocol - perfectcompliance, no errors in reporting, etc.

● Randomized: subjects from the population of interest arerandomly assigned to a treatment or control group (so thereare no confounding factors).

● Controlled: having a control group permitsmeasuring thediòerential eòect of the treatment.

● Experiment: the treatment is assigned as part of theexperiment: the subjects have no choice, which means thatthere is no “reverse causality" in which subjects choose thetreatment they think will work best.

16 / 43


Back to the case of class sizeWhat is an ideal randomized controlled experiment formeasuring the eòect on Test Score of reducing STR?

● In that experiment, students would be randomly assigned toclasses, which would have diòerent sizes.

● Because they are randomly assigned, all studentcharacteristics (and thus ui) would be distributedindependently of STRi .

● aus, E(ui ∣STRi) = 0 - that is, LSA #1 holds in arandomized controlled experiment.

17 / 43


How does our regression analysis of observational data diòerfrom this ideal?

● ae treatment is not randomly assigned.

● Consider PctEL— percent English learners— in the district.It plausibly satisûes the two criteria for omitted variable bias:Z = PctEL is

● a determinant of Y ; and● correlated with the regressor X.

● ae “control" and “treatment" groups diòer in a systematicway - corr(STR, PctEL) ≠ 0.

18 / 43


Examples of omitted variable bias:

● Mozart Eòect? Listening to Mozart for 10-15 minutes couldraise IQ by 8 or 9 points. (Nature 1993)

● Students who take optional music or arts courses in highschool have higher English andmath test scores than thosewho don’t.

19 / 43


aree ways to overcome omitted variable bias.

● Run a randomized controlled experiment in whichtreatment (STR) is randomly assigned: then PctEL is still adeterminant of Test Score, but PctEL is uncorrelated withSTR. (But this is unrealistic in practice.)

● Adopt the “cross tabulation" approach, with ûner gradationsof STR and PctEL (But soon we will run out of data, andwhat about other determinants like family income andparental education?)

● Use a regression in which the omitted variable (PctEL) is nolonger omitted: include PctEL as an additional regressor inamultiple regression.

20 / 43


aeMultiple Regression Model

Consider the case of two regressors:

Yi = β0 + β1X1i + β2X2i + ui , i = 1,⋯, n

● X1, X2 are the two independent variables (regressors).

● (Yi , X1i , X2i) denote the ith observation on Y , X1, and X2.

● β0 = unknown population intercept.

● β1 = eòect on Y of a change in X1, holding X2 constant.

● β2 = eòect on Y of a change in X2, holding X1 constant.

● ui = “error term" (omitted factors).

21 / 43


Interpretation ofmultiple regression coeõcients

Yi = β0 + β1X1i + β2X2i + ui , i = 1,⋯, n

Consider changing X1 by ∆X1 while holding X2 constant.Population regression line before the change.

Yi = β0 + β1X1 + β2X2

Population regression line, a�er the change.

Y + ∆Y = β0 + β1(X1 + ∆X1) + β2X2

22 / 43


aat is,β1 =

∆Y∆X1

, holding X2 constant

also,β2 =

∆Y∆X2

, holding X1 constant

and

β0 = predicted value of Y when X1 = X2 = 0

23 / 43


aeOLS Estimator in Multiple RegressionWith two regressors, the OLS estimator solves

minβ0 ,β1 ,β2

n∑i=1(Yi − (β0 + β1X1i + β2X2i))

2

● ae OLS estimatorminimizes the sum of squared diòerencebetween the actual values of Yi and the prediction (predictedvalue) based on the estimated line.

● aisminimization problem yields the OLS estimators of β0,β1 and β2.

24 / 43


Example: ae California test score dataRegression of Test Score against STR

Test Score = 698.9 − 2.28 × STR

Now include percent English Learners in the district (PctEL):

Test Score = 696.0 − 1.10 × STR − 0.65 × PctEL

● What happens to the coeõcient on STR?

● Why? (Note: corr(STR, PctEL) = 0.19)

25 / 43


26 / 43


Measures of Fit in Multiple Regression

Actual = predicted + residual: Yi = Yi + ui

● SER (standard error of regression)= std. error of ui (with d.f. correction)

● RMSE = std. error of ui (without d.f. correction)

● R2 = fraction of the sample variance of Y explained by X.

● R2 = “adjusted R2"

27 / 43


SER and RMSE

● As in regression with a single regressor, the SER and theRMSE aremeasures of the spread of the Y ’s around theregression line.

SER =

¿ÁÁÀ 1

n−k − 1

n∑i=1

u2i

RMSE =

¿ÁÁÀ 1

n

n∑i=1

u2i

28 / 43


R2 and R2

● ae R2 is the fraction of the sample variance of Yexplained— same deûnition as in regression with a singleregressor:

R2 =ESSTSS

= 1 −SSRTSS

where ESS = ∑ni=1(Yi − Y)2, SSR = ∑

ni=1 u2i ,

TSS = ∑ni=1(Yi − Y)2.

● ae R2 always increases when you add another regressor.(why?)

29 / 43


● ae R2 (the “adjusted R2") corrects this problem by“penalizing" you for including another regressor such thatthe R2 does not necessarily increase.

Adjusted R2: R2 = 1 − (n − 1

n − k − 1)SSRTSS

● Facts about R2.

● n−1n−k−1 > 1, therefore R2 < R2.

● Adding a regressor has two opposite eòects. On the onehand, the SSR falls, which increases the R2. On theother hand, the factor n−1

n−k−1 increases, this reduces R2.● R2 can be negative.● When n is large, R2 and R2 are very close.

30 / 43


Using the R2 and R2

● ae R2 is useful because it quantiûes the extent to which theregressors account for, or explain, the variation in thedependent variable.

● Nevertheless, heavy reliance on the R2 (or R2) can be a trap.In applications, “maximize the R2 is rarely the answer to anyeconomically or statisticallymeaningful question.

● ae decision about whether to include a variable in amultiple regression should be based on whether includingthat variable allows us better to estimate the causal eòect ofinterest.

31 / 43


ae Least Squares Assumptions inMultiple Regression

Yi = β0 + β1X1i + β2X2i +⋯ + βkXki + ui ,i = 1,⋯, n

1 ae conditional distribution of ui given the X’s has amean of zero.

2 (X1i ,⋯, Xki ,Yi), i = 1,⋯, n, are i .i .d .

3 Large outliers are unlikely.

4 aere is no perfectmulticollinearity.

32 / 43


Assumption #1:ae conditional mean of u given the included X’s is zero.

● ais has the same interpretation as in regression with asingle regressor.

● If an omitted variable (1) belongs in the equation (so is in u)and (2) is correlated with an included X, then this conditionfails.

● Failure of this condition leads to omitted variable bias.

● ae solution - if possible - is to include the omitted variablein the regression.

33 / 43


Assumption #2: (X1i ,⋯, Xki ,Yi), i = 1,⋯, n, are i .i .d .

ais is satisûed automatically if the data are collected by simplerandom sampling.

Assumption #3: large outliers are rare.

X1,⋯, Xk , and u have ûnite fourth moments: E(X41i) <∞,⋯,

E(X4ki) <∞, E(u4i ) <∞.

ais is the same assumption as we had before for a singleregressor. As in the case of a single regressor, OLS can besensitive to large outliers, so you need to check your data(scatterplots!) to make sure there are no crazy values (typos orcoding errors).

34 / 43


Assumption #4: aere is no perfectmulticollinearity.

ae regressors are said to be perfectlymulticollinear if one ofthe regressors is an exact linear function of the other regressors.Example: Suppose you accidentally include STR twice:

35 / 43


Perfectmulticollinearity is when one of the regressors is anexact linear function of the other regressors.

● In the previous regression, β1 is the eòect on TestScore of aunit change in STR, holding STR constant. (???)

● Second example: Regress Test Score on a constant, D, and B,where Di = 1 if STR ≤ 20, = 0 otherwise; Bi = 1 if STR > 20,= 0 otherwise. So Bi = 1 − Di and there is perfectmulticollinearity.

● Perfectmulticollinearity usually re�ects amistake in thedeûnition of the regressors.

36 / 43


ae Distribution of the OLS Estimators

Under the four Least Squares Assumptions,

● ae exact (ûnite sample) distribution of β1 hasmean β1,Var(β1) is inversely proportional to n. So too for β2.

● Other than itsmean and variance, the exact distribution ofβ1 is complicated.

● β1 is consistent: β1p→ β1. (law of large numbers)

● β1−E(β1)√Var(β1)

is approximately distributed N(0, 1). (CLT)

● So too for β2,⋯, βk .

37 / 43


MulticollinearitySomemore examples of perfectmulticollinearity

● ae example from earlier: we include STR twice.

● Second example: regress TestScore on a constant, D, and B,where: Di = 1 if STR ≤ 20, = 0 otherwise; Bi = 1 ifSTR > 20, = 0 otherwise, so Bi = 1 − Di and there is perfectmulticollinearity.

● Would there still be perfectmulticollinearity if the intercept(constant term) were somehow excluded (that is, omitted orsuppressed) from this regression?

38 / 43


ae dummy variable trap

● Suppose you have a set ofmultiple binary (dummy)variables, which aremutually exclusive and exhaustive—that is, there aremultiple categories and every observationfalls in one and only one category.

● If you include all these dummy variables and a constant, youwill have perfectmulticollinearity— this is sometimes calledthe dummy variable trap.

● Why is there perfectmulticollinearity?

39 / 43


Solutions to the dummy variable trap:

(1) Omit one of the groups, or

(2) Omit the intercept term.

● What are the implications of solutions (1) and (2) for theinterpretation of the coeõcients?

40 / 43


● Perfectmulticollinearity usually re�ects amistake in thedeûnition of the regressors, or an oddity in the data.

● If you have perfectmulticollinearity, your statistical so�warewill let you know— either by crashing or giving an errormessage or by “dropping" one of the variables arbitrarily.

● ae solution to perfectmulticollinearity is to modify yourlist of regressors so that you no longer have perfectmulticollinearity.

41 / 43


Imperfectmulticollinearity

● Imperfect and perfectmulticollinearity are quite diòerentdespite the similarity of the names.

● Imperfectmulticollinearity occurs when two ormoreregressors are very highly correlated.

● Why this term? If two regressors are very highly correlated,then their scatterplot will prettymuch look like a straightline— they are collinear— but unless the correlation isexactly ±1, that collinearity is imperfect.

● Imperfectmulticollinearity implies that one ormore of theregression coeõcients will be imprecisely estimated.

42 / 43


● Intuition: the coeõcient on X1 is the eòect of X1 holding X2

constant; but if X1 and X2 are highly correlated, there is verylittle variation in X1 once X2 is held constant— so the dataare prettymuch uninformative about what happens when X1

changes but X2 doesn’t, so the variance of the OLS estimatorof the coeõcient on X1 will be large.

● Imperfectmulticollinearity (correctly) results in largestandard errors for one ormore of the OLS coeõcients.

● Imperfectmulticollinearity does not lead OLS estimators tobe biased nor inconsistent.

43 / 43

linear regression with multiple regressors

Documents