multiple regression analysis - university of miskolc

Miskolci Egyetem Gazdaságtudományi Kar

Üzleti Információgazdálkodási és Módszertani Intézet

Multiple Regression Analysis

Roland Szilágyi Ph.D.Associate professor

• X (or X1, X2, … , Xp):

known variable(s) / independent variable(s) / predictor(s)

• Y: unknown variable / dependent variable

• causal relationship: X „causes” Y to change

Correlation Regression

describes the strength of a

relationship, the degree to

which one variable is

linearly related to another

shows us how to

determine the nature of a

relationship between two

or more variables

Simple Linear Regression Model

• We model the relationship between two variables, X and Y

as a straight line.

• The model contains two parameters:

an intercept parameter,

a slope parameter.

Y = β0 + β1x + ε

where: y – dependent or response variable (the variable we

wish to explain or predict)

x – independent or predictor variable

ε – random error component

β0 – y-intercept of the line, i.e. point at which the line intercept the y-axis

β1 – slope of the line

Y = deterministic component + random error

β0 = y-intercept

β1 = slope

Random error

Deterministic component• y = deterministic component +

random error

• We always assume that the mean value of the random error equals 0 the mean value of y equals the deterministic component.

• It is possible to find many lines for which the sum of the errors is equal to 0, but there is one (and only one) line for which the SSE (sum of squares of the errors) is a minimum:

least squares line / regression line.

ŷi = b0 + b1x i

• The method of least squares gives us the bestlinear unbiased estimators (BLUE) of the regressionparameters, β0, β1.

• The least-squares estimators:

b0 estimates β0

b1 estimates β1

• Calculation of the estimators:

• The regression line:

Ŷ = b0 + b1x

ii xbbybbf

Least Square Methode• Where tha partial derivation is equal to 0

• The normal equations (with 1 x)

Σy = nb0 + b1ΣxΣxy = b0Σx + b1Σx

• The estimated regression line:

xbbyxb

ŷ = b0 + b1x

Multiple Linear Regression Model

• The multiple linear regression line describes the relation

between the independent variables (X1, X2, …, Xp) and

the dependent variable..

• Y depends on:

• X1, X2, …, Xp (p independent variables)

• the error term (ε)

• β0, β1, …, βp regression coefficients..

Y = β0 + β1X1 + β2X2 +…+ βpXp +ε

Y = deterministic component + random error

Least Squares Method

• The method of least squares gives us the

best linear unbiased estimators (BLUE) of

the regression parameters (β0, β1, β2,… βp)

min)...();;...;;( 2

22110210 ppp xbxbxbbybbbbf

ppxb...xbxbby 22110

Data Structure of Multiple

Linear Regression

Multiple Linear Regression

min)...();;...;;( 2

22110210 ppp xbxbxbbybbbbf

2211202

pxbxxbxxbxbyx

xxbxbxxbxbyx

xxbxxbxbxbyx

xbxbxbnby

The equation system with

matrices operation :

xxxxxx

bXXyXTT

The equation system with

matrices operation:

bXXyXTT

yXXXbTT

With the help of this results we can give the

estimation of the regression equation. (the

empirical regression equation; the sample

model)

Interpretation of Parameters

The intercept point (b0) can be interpreted as

the value you would predict for the

dependent variable if every Xi = 0. The

interpretation on one hand depends on

whether the 0 is part of Xi values or not, and

on the other hand, whether the b0 is part of Yi

values or not.

ppxb...xbxbby 22110

Interpretation of Parameters

In a geometrical sense, bp coefficient is the

slope of the regression line, thus it shows bp

unit average changing in the dependent

variable for each one-unit difference

(increasing) in Xp, if the other independent

variables remain constant.

ppxb...xbxbby 22110

Residual variable

yyyyyy

22ˆˆ

Sy = + Se

Sum of square of

Sum of squares

explained by

regression

Sum of squares of the

errors

Sum of Squares DfMean Sum of

SquaresF

Regression p MSR=SSR/p

Residual n-p-1 MSE=SSE/(n-p-1)

Total n-1

Analysis of Variance in

Regression Analysis

2y SS S ˆ

i )y(y + )yy( )y(y

iy )yy( = S

ie )y(y = S

S = (y y)y i

1)-p-/(nS

Model Testing

0: 210 pH

.0:1 jH

211 : H

);( 21

211 : H

F);( 211 F

Parameter testing

If tcalculated<tcritical→H0

If tcalculated>tcritical→H1

pn;critical tt

Assumptions of the Multiple Linear

Regression Model

Assumptions of the error term

• The expected value of the error term equals 0

E(ε│X1, X2, …Xp)=0

• Constant variance (homoscedasticity) Var(ε) = 2

• The error term is uncorrelated across observations.

• Normally distributed error term.

Assumptions of the independent

variables

• Linear independency.

• Fix values, which do not change sample by

sample.

• There is no scale error.

• The independent variable is uncorrelated

with the error term.

1. The expected value of the error term equals

0 E(ε│X1, X2, …Xp)=0

2. Constant variance (homoscedasticity)

Var(ε) = 2

3. The error term is uncorrelated across

observations.

4. Normally distributed error term.

1. E(ε│X1, X2, …Xp)=0

• The assumption means, that the residual

should be neutral. If the expected 0 value

is not valid, this tendency would mean that

it could be integrated into the deterministic

model.

• If the method of estimation for the

regression model is least squares, the

average residual will be 0.

0 E(ε│X1, X2, …Xp)=0

Var(ε) = 2

observations.

2. Homoscedasticity (Var(ε) = 2)

• the variance of the error term is the same

for all observations.

Testing:

o Plots – of residuals versus independent variables

(or predicted value ŷ or time)

o Statistic tests– Goldfeld-Quandt test, (Especially

when the hetescedasticity is related to one of the

independent variables.)

Graphical tests for

homoscedasticity

Homoscedastic residuals Heteroscedastic residuals

ŷ ŷŷ

e – residual

• H0: j2 = 2

H1: j2 ≠ 2

• Steps:

1. Ranking: sort cases by y variable.

2. Subgroups: , (where r > 0, > p )

3. Calculating the mean square errors (se2) from the separeted regressions on

1th and 3rd subgroups

4. F-test:

Goldfeld-Quandt test

F(1-α/2); ν1,ν2F(α/2)

0 E(ε│X1, X2, …Xp)=0

Var(ε) = 2

observations.

The error term is uncorrelated

across observations

• In case of cross-sectional, data the observations

meet the assumption of simple random sampling,

thus we do not have to test this hypothesis.

• before making estimations according to time series

data, we need to determine the residual

autocorrelation.

Causes of autocorrelation

• if we did not use every important

descriptive variables in the model (we can’t

recognise the effect, no data, short time series)

• if the model specification is wrong i.e.: the

relationship is not linear, but we use linear

regression

• not random scaling errors

Independent variable

there is no in the

equation.

Plots to detect autocorrelation

We sholud to use other

type of function.

H0: ρ = 0 no autocorrelation

H1: ρ ≠ 0 autocorrelation

0 dl du 2 4-du 4-dl 4

The Durbin-Watson test

- violator

autocorrelation+violatoró

autocorrelation

Limits:

Positive autocorrelation:

Negative autocorrelation :

Weaker problem: no

decision

• Use more variable

• Use larger database

No problem

A Durbin-Watson próba döntési

táblázata

H1 Accept H0:p=0

Reject

No decision

Positive

autocorrelation

d>du d<dl dl<d<du

Negative

autocorrelation

d<4-du d>4-dl 4-dl<d<4-du

Source: Kerékgyártó-Mundruczó [1999]

1. The expected value of the error term

equals 0 E(ε│X1, X2, …Xp)=0

Var(ε) = 2

observations.

Normally distributed errors

Testing:

• Plots

• Quantitative tests- Goodness-of-fit tests

Chi square test

Kolmogorov-Smirnoff test

Graphical testing

A plot of the values of

the residuals against

normal distributed

values.

The assumption is not

violated when the figure

is nearly linear.

Histogram of residuals

Goodness-of-fit test

H0: Pr(εj) = Pj (the distribution is normal)

H1: Jj: Pr(εj) ≠ Pj

)1(),1(2

Assumptions of the independent

variables

1. Linear independency. (the independent variables

should not be an exact linear combination of other

independent variables)

2. Fix values, which do not change sample by

sample.

3. There is no scale error.

4. The independent variable is uncorrelated

with the error term.

Multicollinearity

• Testing:

• Xj=f(X1, X2,…,Xj-1, Xj+1, …,Xp) regression

models:

– Multiple determination coefficient

– F-test(F>Fkrit)

– VIF- indicator

VIF-mutató

• Variance Inflation Factor

• VIF=1 if Rj2=0 (jth independent variable doesn’t correlate

with the others)

• VIF Rj2=1 (jth independent variable is an exact linear

combination of other independent variables)

• - weak multicollinearity

- strong disturbing multicollinearity

- very strong, harmful multicollinearity

Correction for Multicollinearity

• We should find the offending independent

variables to exclude them.

• We can combine independent variables

which are strongly (creating principle

components), which will differ from the

original independents, but it will contain

the information content of the original

Thanks for your attention!

multiple regression analysis - university of miskolc

Documents

multiple linear regression review. outline outline outline...

multiple linear regression: estimation and model fitting ·...

doing multiple regression with spss multiple regression...

multiple regression

multiple regression analysis: estimation -...

doing multiple regression with spss multiple regression ......

multiple regression analysis multiple regression model...

part ii multiple linear regression - statistics · pdf...

multiple regression. psyc 6130, prof. j. elder 2 multiple...

chapter 11 multiple linear regression chapter 11 multiple...

multiple regression chapter 1313 multiple regression...

multiple linear regression - analysis made easy · multiple...

chapter 3 multiple linear regression model -...

slide 1 hierarchical multiple regression. slide 2...

2. korrelation, linear regression und multiple...

multiple linear regression - analystsoft inc. ·...

introduction to multiple regression -...

2. korrelation, linear regression und multiple · pdf file2....

regression analysis and multiple regression

multiple regression (continued) polynomial regression