multiple linear regression model until now, all we did was establish the relationship between 2...

Multiple Linear Regression Model

• Until now, all we did was establish the relationship between 2 variables– 1 independent variable, 1 dependent variable

• But we would like to do better and several independent variables together could be a better predictor, so we would like to establish a relationship like this - – Several independent variables, 1 dependent variable

XY

XfY

)(

),..,,( 21 nXXXfY

• Linear Regression estimates the “Line of Best Fit” (‘Line’ because we are talking about a Linear model)

Analysis of Relationship among several variables

iiKKiii

iKKiii

iKiii

xxxy

xxxy

xxxfy

...

...

),....,,(

2211

2211

21

Analysis of Relationship among several variables

• bj for each X is slope coefficient– The slope coefficient measures how much the dependent

variable, Y , changes when the independent variable, Xj , changes by one unit, holding all other independent variables constant.

• a is the intercept or constant for the model– The intercept measures the value of Y if all X’s are 0

• The way to estimate this is through the process of ‘Least Squares’

• In practice, software programs are used to estimate the multiple regression model

Inference about Parameters

Parameter Estimate Error t Value Pr>|t|

CONST1 -51.8861 27.1756 -1.91 0.0594 X1 0.02065 0.04028 0.51 0.6094 X2 0.47620 0.22749 2.09 0.0391 X3 0.07123 0.14718 0.48 0.6296 X4 -2.02110 1.10141 -1.84 0.0698 X5 0.00447 0.03138 0.14 0.8870 X6 3.79589 2.39372 1.59 0.1163 X7 0.26862 0.11720 2.29 0.0242 X8 -1.91116 1.06776 -1.79 0.0768

X9 3.26388 1.14130 2.86 0.0053 X10 3.64432 1.27539 2.86 0.0053

Assumptions of the Multiple Regression Model

• The relationship between the dependent variable, Y, and the independent variables, X1, X2, . . . , Xk, is linear.

• The independent variables (X1, X2, . . . , Xk) are not random. Also, no exact linear relation exists between two or more of the independent variables.

• The error term is normally distributed• The expected value (mean) of the error term,

conditioned on the independent variables, is 0 • The variance of the error term is the same for all

observations

Inference about Model

• If the model is correctly specified, R2 is an ideal measure

• Addition of a variable to a regression will increase the R2 (by construction)

• This fact can be exploited to get regressions with R2 ~ 100% by addition of variables, but this doesn’t mean that the model is any good

• Adj-R2 should be reported

Inference about Model

• Adjusted R2 is a measure of goodness of fit that accounts for additional explanatory variables.

22

22

22

11)1(

1

1)1(

1

0 allfor ,)1(1

1)1(

11

RR

RRkn

n

kn

n

kknn

Rkn

nR


• Coefficients (1, 2,..,k) are estimated with a confidence interval

• To know if a specific independent variable (xi) is influential in predicting the dependent variable (y), we test whether the corresponding coefficient (i) is statistically different from 0 (i.e. i = 0).

• We do so by calculating the t-statistic for the coefficient

• If the t-stat is sufficient large, it indicates that i is significantly different from 0 indicating that i * xi plays a role in determining y


Parameter Estimate Error t Value Pr>|t|

CONST1 -51.8861 27.1756 -1.91 0.0594 X1 0.02065 0.04028 0.51 0.6094 X2 0.47620 0.22749 2.09 0.0391 X3 0.07123 0.14718 0.48 0.6296 X4 -2.02110 1.10141 -1.84 0.0698 X5 0.00447 0.03138 0.14 0.8870 X6 3.79589 2.39372 1.59 0.1163 X7 0.26862 0.11720 2.29 0.0242 X8 -1.91116 1.06776 -1.79 0.0768

X9 3.26388 1.14130 2.86 0.0053 X10 3.64432 1.27539 2.86 0.0053

Predicting the Dependent Variable

• To predict the value of a dependent variable using a multiple linear regression model, we follow these three steps:– Obtain estimates of the regression parameters.– Determine the assumed values of the independent variables.

• Estimate a model

• Assess the model predictive ability• Assess the significance of each independent variable• If they are satisfactory, we decide to use the existing

model• Else we re-estimate using different independent variables

Predictions

iKKiii xxxy ...2211

• We then change the parameters to levels that we would like predictions for derive the corresponding y

• This would be the predicted value of ‘y’ as a function of the changes to the existing parameter values

Predictions

Using Dummy Variables

• Some times our independent variable isn’t numeric– E.g. establish relationship between day of week and

alcohol consumption– relationship between major and income

• More relevant example from finance– Industry effect on returns (do technology stocks

higher returns on same investments)– Do emerging markets provide greater returns for

same risks?


• A dummy variable is qualitative variable• It takes on a value of 1 if a particular condition is true

and 0 if that condition is false.• In our examples, we would use dummy for the following

– Relationship between week day and alcohol consumption

– Each weekday would have a dummy associated if the alcohol consumption on the y corresponded to it


Tequila shots Mon Tue Wed Thu Fri Sat Sun

2 1 0 0 0 0 0 02 1 0 0 0 0 0 03 0 1 0 0 0 0 01 0 1 0 0 0 0 02 0 0 1 0 0 0 01 0 0 1 0 0 0 00 0 0 0 1 0 0 01 0 0 0 1 0 0 05 0 0 0 0 1 0 06 0 0 0 0 1 0 07 0 0 0 0 0 1 06 0 0 0 0 0 1 03 0 0 0 0 0 0 13 0 0 0 0 0 0 1

Month-of-the-Year Effects on Small Stock Returns

• Suppose we want to test whether total returns to one small-stock index, the Russell 2000 Index, differ by month.

• We can use dummy variables in estimate the following regression,

tttt NovFebJan 11210tReturns

Month-of-the-Year Effects on Small Stock Returns

Violations of Regression Assumptions

• Inference based on an estimated regression model rests on certain assumptions being met.

• Violations may cause the inferences made to be invalid.

Heteroskedasticity

• Heteroskedasticity occurs when the variance of the errors differs across observations.– does not affect consistency– t-tests for the significance of individual regression

coefficients are unreliable because heteroskedasticity introduces bias into estimators of the standard error of regression coefficients.

Regressions with Homoskedasticity

Regressions with Heteroskedasticity

Serial Correlation

• When regression errors are correlated across observations, we say that they are serially correlated (or autocorrelated).– Serial correlation most typically arises in time-series

regressions.– The principal problem caused by serial correlation in a

linear regression is that it leads to incorrect calculation of critical values for the test of significance

Multicollinearity

• Multicollinearity occurs when two or more independent variables (or combinations of independent variables) are highly (but not perfectly) correlated with each other.– does not affect the consistency of the regression

coefficients– estimates become extremely imprecise and unreliable

• The classic symptom of multicollinearity is a high R2 even though the t-statistics on the estimated slope coefficients are not significant.

• The most direct solution to multicollinearity is excluding one or more of the regression variables.

Problems in Regression & Solutions

Problem Effect Solution

Heteroskedasticity Incorrect standard errors

Correct for conditional heteroskedasticity

Serial Correlation Incorrect t-values

Correct for serial correlation

Multicollinearity High R2 and low t-statistic

Remove 1 or more independent variable

Model Specification

• Model specification refers to the set of variables included in the regression and the regression equation’s functional form.

• Possible misspecifications include:– One or more important variables could be omitted

from regression.– One or more of the regression variables may need to

be transformed (for example, by taking the natural logarithm of the variable) before estimating the regression.

– The regression model pools data from different samples that should not be pooled.

Discreet Dependent Variable Models

• Discreet dependent variables are dummy variables used as dependent variables instead of as independent variables.

• Mainly 2 models – – The probit model, which is based on the normal distribution,

estimates the probability that Y = 1 (a condition is fulfilled) given the value of the independent variable X

– The logit model is identical, except that it is based on the logistic distribution rather than the normal distribution

Discreet Dependent Variable Models

• Logistic regression

multiple linear regression model until now, all we did was establish the relationship between 2...

Documents