multiple regression and issues in regression analysis

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

2

MULTIPLE REGRESSION

With multiple regression, we can analyze the association between more than one independent variable and our dependent variable.

Returning to our analysis of the determinants of loan rates, we also believe that the number of lines of credit the client currently employs is related to the loan rate charged. Accordingly, we model a multiple linear regression of the relationship:

- General form

- Specific form

where DTI is the debt-to-income ratio and Open lines is the number of existing lines of credit the client already possesses.

3

MULTIPLE REGRESSION

Focus On: Calculations

• Coefficient estimates output

• The coefficient estimates are both positive, indicating that increases in the DTI and open lines are associated with increases in the loan rate. But only the DTI is significant at the 95% or better level as indicated by its t-stat of 4.1068.

• A 1% increase in the debt-to-income ratio leads to a 75.76 bp increase in loan rate, holding the number of open lines constant.

Coefficients Standard Error t-Stat p-Value

Intercept 0.0066 0.0352 0.1879 0.8563

DTI 0.7576 0.1845 4.1068 0.0045

Open lines 0.0059 0.0052 1.1429 0.2906

4

MULTIPLE REGRESSION

Focus On: Hypothesis Testing

We can test the hypothesis that the true population slope coefficient for the association between open lines and loan rate is zero.

1. Formulate hypothesis H0: b2 = 0 versus Ha : b2 ≠ 0 (a two-tailed test)

2. Identify appropriate test statistic

3. Specify the significance level 0.05 leading to a critical value of 2.4469

4. Collect data and calculate test statistic

5. Make the statistical decision Fail to reject the null because 1.1429 < 2.4469

5

MULTIPLE REGRESSION

Focus On: The p-Value Approach

• p-Values appear in reference to the coefficient estimates on the regression output. For the coefficient estimates, we would fail to reject a null hypothesis of a zero parameter value for b0 at any a level above a = 0.8563, for b1 at any level above a = 0.0045, and for b2 at any level above a = 0.2906.

• Conventionally, accepted a levels are 0.1, 0.05, and 0.01, which leads us to reject the null hypothesis of a zero parameter value only for b1 and conclude that only b1 is statistically significantly different from zero at generally accepted levels.


Intercept 0.0066 0.0352 0.1879 0.8563

DTI 0.7576 0.1845 4.1068 0.0045

OpenLines 0.0059 0.0052 1.1429 0.2906

6

MULTIPLE REGRESSION ASSUMPTIONS

Multiple linear regression has the same underlying assumptions as single independent variable linear regression and some additional ones.

1. The relationship between the dependent variable, Y, and the independent variables (X1, X2, . . . , Xk) is linear.

2. The independent variables (X1, X2, . . . , Xk) are not random. Also, no exact linear relation exists between two or more of the independent variables.

3. The expected value of the error term, conditioned on the independent variables, is zero.

4. The variance of the error term is the same for all observations.

5. The error term is uncorrelated across observations: E(∈i∈j) = 0, j ≠ i.

6. The error term is normally distributed.

7

MULTIPLE REGRESSION PREDICTED VALUES


Returning to our multiple linear regression, what loan rate would we expect for a borrower with an 18% DTI and 3 open lines of credit?

L̂oanrate 𝑖=0.0066+0.7576 (0.18 )+0.0059 (3 )

8

UNCERTAINTY IN LINEAR REGRESSION

There are two sources of uncertainty in linear regression models:

1. Uncertainty associated with the random error term.

- The random error term itself contains uncertainty, which can be estimated from the standard error of the estimate for the regression equation.

2. Uncertainty associated with the parameter estimates.

- The estimated parameters also contain uncertainty because they are only estimates of the true underlying population parameters.

- For a single independent variable, as covered in the prior chapter, estimates of this uncertainty can be obtained.

- For multiple independent variables, the matrix algebra necessary to obtain such estimates is beyond the scope of this text.

9

MULTIPLE REGRESSION: ANOVA

Focus On: Regression Output

• The analysis of variance section of the output provides the F-test for the hypothesis that all the coefficient estimates are jointly zero. The high value of this F-test leads us to reject the null that all the coefficients are jointly zero, concluding that at least one coefficient estimate is nonzero.

• Combined with the coefficient estimates, this model suggests that the loan rate is fairly well described by the level of the debt-to-income ratio for the client, but that the number of outstanding open lines does not make a strong contribution to that understanding.

df SS MSS F Significance F

Regression 2 0.0120 0.0060 9.6104 0.0098

Residual 7 0.0044 0.0006

Total 9 0.0164

10

F-TEST


• The F-test for a multiple regression determines whether the slope coefficients, taken together simultaneously as a group, are all zero. The test statistic is

• From our regression output, this is

which is greater than the critical value for an F(0.05,2,8)=4.4590 leading us to reject the null hypothesis of all coefficient estimates being equal to zero.

11

R2 AND ADJUSTED R2


• Regression specification output from our example regression provides

- Multiple R is the correlation coefficient for the degree of association between the independent variables and the dependent variable.

- R2 is our familiar correlation estimate the independent variables explain 73.3% of the variation in the dependent variable.

- Adjusted R2 is a more appropriate measure for a correlation estimate that accounts for the presence of multiple independent variables and it is 65.68%.

Regression Statistics

Multiple R 0.8562

R2 0.7330

Adjusted R2 0.6568

Standard Error 0.0250

Observations 10

12

INDICATOR VARIABLES

Often called “dummy variables,” indicator variables are used to capture qualitative aspects of the hypothesized relationship.

• Consider that a reliance on short-term sources of financing is also generally believed to be associated with more risky borrowers. The indicator variable, STR, for short-term reliance is coded as a “1” when borrowers have predominantly used lines of credit as existing borrowing and “0” otherwise. The hypothesized relationship is now

13

INDICATOR VARIABLES



Intercept –0.0138 0.0324 –0.4252 0.6855

DTI 0.6117 0.1781 3.4340 0.0139

Open lines 0.0265 0.0121 2.1958 0.0705

STR –0.0681 0.0371 –1.8367 0.1159


Regression 3 0.0136 0.0045 9.7037 0.0102

Residual 6 0.0028 0.0005

Total 9 0.0164


Multiple R 0.8562

R2 0.7330

Adjusted R2 0.6568

14

VIOLATIONS: HETEROSKEDASTICITY

The variance of the errors differs across observations (Assumption 4).

There are two types of heteroskedasticity:

- Unconditional heteroskedasticity, which presents no problems for statistical inference, and

- Conditional heteroskedasticity, wherein the error variance is correlated with the independent variable values.

- Parameter estimates are still consistent.

- F-test and t-tests are unreliable.

15

VIOLATIONS: SERIAL CORRELATION

There is correlation between the error terms (Assumption 5).

• The focus in this chapter is the case in which there is serial correlation but no lagged values of the dependent variable as independent variable(s).

- Parameter estimates are consistent, but the standard errors are incorrect.

- The F-test and t-tests are likely inflated with positive serial correlation, the most common case with financial variables.

• Parameter estimates are still consistent as long as there are no lagged values of the dependent variable as independent variables.

- If there are lagged values as independent variables,

- Coefficient estimates are inconsistent.

- This is the statistical arena of time series (Chapter 10).

16

TESTING AND CORRECTING FOR VIOLATIONS

There are well-established tests for serial correlation and heteroskedasticity, as well as ways to correct for their impact.

• Testing for

- Heteroskedasticitiy Use the Breusch–Pagan test

- Serial correlation Use the Durbin–Watson test

• Correcting for

- Heteroskedasticity Use robust standard errors or generalized least squares

Use White standard errors

- Serial correlation Use the Hansen correction

- This also corrects for heteroskedasticity.

17

TESTING FOR SERIAL CORRELATION

Focus On: Calculating the Durbin–Watson Statistic

You have recently estimated a regression model with 100 observations and two independent variables. Using the estimated errors, you have determined that the correlation between the error term and a first lagged value of the error term is 0.16. Do the observations exhibit positive serial correlation?

- The test statistic is.

- The critical values from Appendix E are dl = 1.63 and du = 1.72.

- Because 1.68 > 1.63, we fail to reject the null of positive serial correlation.

dl = 1.63 du = 1.72

Inconclusive

Rejection zone for

positive serial correlation

Rejection zone for

negative serial correlation

18

VIOLATIONS: MULTICOLLINEARITY

Multicollinearity occurs when two or more independent variables or combinations of independent variables are highly (but not perfectly) correlated with each other (Assumption 6).

• Common with financial data

- Estimates are still consistent, but imprecise and unreliable.

- One indicator that you may have a collinearity problem is the presence of a significant F-test but no (few) significant t-tests.

• No easy solution to correct violation, you may have to drop variables.

- The “story” here is critical.

19

SUMMARIZING VIOLATIONS AND SOLUTIONS

Problem Effect Solution

Heteroskedasticity Incorrect standard errors Use robust standard errors

Serial Correlation Incorrect standard errors* Use robust standard errors

Multicollinearity High R2 and low t-stats No theory-based solution

20

MODEL SPECIFICATION

• Models should

- Be grounded in financial or economic reasoning.

- Have variables that are an appropriate functional form for their nature.

- Have specifications that are parsimonious.

- Be in compliance with the regression assumptions.

- Be tested out-of-sample before applying them to decisions.

21

MODEL MISSPECIFICATIONS

A model is misspecified when it violates the assumptions underlying linear regression, its functional form is incorrect, or it contains time-series specification problems.

• Generally, model misspecification can result in invalid statistical inference when we are using linear regression.

• Misspecification has a number of possible sources:

1. Misspecified functional form can arise from several possible problems:

- Omitted variable bias.

- Incorrectly represented variables.

- Data that are pooled which should not.

2. Error term correlation with independent variables can arise from:

- Lagged values of the dependent variable as independent variables.

- Measurement error in the independent variables.

- Independent variables that are functions of the dependent variable.

22

AVOIDING MISSPECIFICATION

• If independent or dependent variables are nonlinear, use an appropriate transformation to make them linear.

- For example, use common size statements or log-based transformations.

• Avoid independent variables that are mathematical transformations of dependent variables.

• Don’t include spurious independent variables (no data mining).

• Perform diagnostic tests for violations of the linear regression assumptions.

- If violations are found, use appropriate corrections.

• Validate model estimations out-of-sample when possible.

• Ensure that data come from a single underlying population.

- The data collection process should be grounded in good sampling practice.

23

QUALITATIVE DEPENDENT VARIABLESThe dependent variable of interest may be a categorical variable representing the state of the subject we are analyzing.

• Dependent variables that take on ordinal or nominal values are better estimated using models developed for qualitative analysis.

- This approach is the dependent variable analog to indicator (dummy) variables as independent variables.

• Three broad categories

1. Probit: Based on the normal distribution, it estimates probability of the dependent variable outcome.

2. Logit: Based on the logistic distribution, it also estimates probability of the dependent variable outcome.

3. Discriminant Analysis: It estimates a linear function, which can then be used to assign the observation to the underlying categories.

24

ECONOMIC MEANING AND MULTIPLE REGRESSION

Coefficients Standard Error t-Stat p-value

Intercept –0.0138 0.0324 –0.4252 0.6855

DTI 0.6117 0.1781 3.4340 0.0139

Open lines 0.0265 0.0121 2.1958 0.0705

STR –0.0681 0.0371 –1.8367 0.1159


Regression 3 0.0136 0.0045 9.7037 0.0102

Residual 6 0.0028 0.0005

Total 9 0.0164


Multiple R 0.8562

R2 0.7330

Adjusted R2 0.6568

25

SUMMARY

• We are often interested in the relationship between more than two financial variables, and multiple linear regression allows us to model such relationships and subject our beliefs about them to rigorous testing.

• Financial data often exhibit characteristics that violate the underlying assumptions necessary for linear regression and its associated hypothesis test to be meaningful.

• The main violations are

- Serial correlation.

- Conditional heteroskedasticity.

- Multicollinearity.

• We can test for each of these conditions and correct our estimations and hypothesis tests to account for their effects.