warsaw summer school 2015, osu study abroad program regression

35
Warsaw Summer School 2015, OSU Study Abroad Program Regression

Upload: domenic-pope

Post on 03-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Warsaw Summer School 2015, OSU Study Abroad Program

Regression

Page 2: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Linear Relationship

The line = a mathematical function that can be expressed through the formula Y = a + bX, where Y & X are our variables.

Y, the dependent variable, is expressed as a linear function of

the independent (explanatory) variable X.

Page 3: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Linear Relationship

Page 4: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Linear Relationship

The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept).

The slope b equals the change in Y for a one-unit increase inX (one-unit increase in X corresponds to a change of b unitsin Y). The slope describes the rate of change in Y-values, asX increases.

Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change inthe vertical distance is divided by the change in the horizontaldistance).

Page 5: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Linear Relationship

The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept).

The slope b equals the change in Y for a one-unit increase inX (one-unit increase in X corresponds to a change of b unitsin Y). The slope describes the rate of change in Y-values, asX increases.

Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change inthe vertical distance is divided by the change in the horizontaldistance).

Page 6: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Cartesian Coordinate System

Variables X, Y and their linear function:

The formula Y = a + bX expresses the dependent (response) variable Y as a linear function of the independent (explanatory) variable X. The formula maps out a strait-line graph with slope b and Y-intercept a.

Page 7: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Basics

Linear Relationship: Y = a + bX

The constant a is the value of Y when X = 0. For X = 0 we have: Y = a + b*0 = a

The constant a is the value of Y where the line Y = a + bX intersects the Y-axis.

The slope b equals the change in Y for a one-unit increase in X. This means that one-unit increase in X corresponds to a change of b units in Y. Thus, the slope describes the rate of change in the Y-values as X increases. Generally,

b = (Y - a) / X

Page 8: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Model vs Reality

The function Y = a + bX is a model

In reality we do not have one line

Page 9: Warsaw Summer School 2015, OSU Study Abroad Program Regression

The Scatter gram and Least Squares Method

The graphical plot of observed values (X,Y) is called a

- scatter-gram

- scatter-diagram

- scatter-plot.

A regression function is a function that describes how the expected value of the dependent (response) variable Y changes according to the values of an independent

(explanatory) variable X.

Page 10: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Regression

This expected value is estimated by a linear function:

• Ý = a + bX

Ý = predicted value for the dependent variable, Ya = the intercept (the value of Y when X = 0)b = the regression coefficient (the slope), indicating the amount of change in Y given a unit change in XX = the independent variable

Page 11: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Regression

Ý = a + bX

b = [Σ(X - X̃ )(Y - Ÿ)] / Σ(X - X̃ )2

a = Ý - b*X

Page 12: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Method of Least Squares

The prediction errors, called residuals, are defined as the differences between observed and predicted values of Y

E = Ý - (a + bX) = Y - Ý

Regression line minimizes the sum of error terms: SSE = Σ(Y - Ý)2

Page 13: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Method of Least Squares

The method of least squares provides the prediction equation Ý = a + bX having the minimal value of SSE.

The least square estimates a and b are the values determining the prediction equation for which the sum of squared errors SSE is a minimum.

Page 14: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Covariance

In the regression analysis we ask: to what extend could we predict Y knowing our variable X? Prediction means that values X and Y go together or co-vary.

Covariance is sum of products, or SP, • SP = Σ (X - X̃ ) (Y - Ÿ)

Sums of squares for X:• SSx = Σ (X - X̃ )2

Note that in the regression equation of Y on X• Ý = bX + a• b = SP / SSx

Page 15: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Interpretation of b

The slope of the line, b, has the verbal interpretation “rise over run”-- that is, the rise divided by the run. This means that the change in the vertical distance is divided by the change in the horizontal distance.

The more steep the hill, the higher the slope. You go “up” more rapidly than you go over. The line can have a negative slope.

When there is negative slope, you are going “downhill” rather than “uphill.”

• b > 0, positive relationship• b < 0, negative relationship• b = 0, no relationship

Page 16: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Linear Relationship

The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept).

The slope b equals the change in Y for a one-unit increase inX (one-unit increase in X corresponds to a change of b unitsin Y). The slope describes the rate of change in Y-values, asX increases.

Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change inthe vertical distance is divided by the change in the horizontaldistance).

Page 17: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Unststandardized and standardized coefficients

If both variables, IV and DV, are expressed in z-scores, a (constant) is equal zero.

We obtain Beta coefficients that tell us the following: How much change in the standard deviation units in DV is attributable to the change in IV by one standard deviation.

Page 18: Warsaw Summer School 2015, OSU Study Abroad Program Regression
Page 19: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Two and more IVs

Ý = a + b1X1 + b2X2

Ý = β1X1 + β2X2

Ý = a + b1X1 + b2X2 ……….. bk-1Xk-1 + bkXk

Ý = β 1X1 + β 2X2 ……….. β k-1Xk-1 + β kXk

Page 20: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Coefficients and variables

The estimated parameters b1, b2, ..., bk are partial regression coefficients. They are different from regression coefficients for bi-variate relationships between Y and each exploratory variable.

Three criteria for a number of independent (exploratory) variables:

• (1) Theory

• (2) Parsimony

• (3) Sample size

Page 21: Warsaw Summer School 2015, OSU Study Abroad Program Regression

R2

Coefficient of determination (explained variance) for two variables

SS(total) - SS(error)• r2 = ----------------------------- SS(total)

• Stata provides a value of the coefficient of determination for

• SS(total) - SS(error)• R2 = ----------------------------- SS(total)

Page 22: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Sum of squares

R2 is a proportion of explained variance by X1, X2, ...., Xk.

Therefore, 1 - R2 is a proportion of unexplained variance.

Page 23: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Adjusted R-square

• Adjusted R-square is a modification of R-square that adjusts for the number of terms in a model. R-square always increases when a new term is added to a model, but adjusted R-square increases only if the new term improves the model more than would be expected by chance.

Page 24: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Sum of Squares

The Regression SUM of SQUARES is defined:

SS(regression) = SS(total) – SS(error)

Page 25: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Mean square

The Regression MEAN SQUARE

MSS(regression) = SS(regression) / df-v

df-v = k where k is a number of variables

The MEAN SQUARE ERROR

MSS(error) = SS(error) / df

df-t = n - (k + 1) where n is a number of cases and k is a number of variables.

Page 26: Warsaw Summer School 2015, OSU Study Abroad Program Regression

F

The null hypothesis

Ho: b1 = b2 = … = bk = 0

MSS(model)• F = -------------- MSS(error)

The sampling distribution of this statistic is the F-distribution

Page 27: Warsaw Summer School 2015, OSU Study Abroad Program Regression

t

The test of H0: bk = 0 evaluates whether Y and X are statistically dependent, ignoring other variables.

We use the t statistic b• t = -------------- σB where σB is a standard error of B

SS(error)• σB = -------- n - 2

Page 28: Warsaw Summer School 2015, OSU Study Abroad Program Regression

ANOVA

ANALYSIS OF VARIANCE

• How much of the variance is explained by values of the nominal variable?

• Total sum of squared variation from the mean:

• SS(total) = Σ [X – XM (total)]2

Page 29: Warsaw Summer School 2015, OSU Study Abroad Program Regression

ANOVA

The between group variation represents the squared deviations of every group mean from the total mean:

• SS(between) = Σ [XM (group) – XM (total)]2

The within-group sum of squares is the sum of every raw

score from its group mean:

• SS(within) = Σ [X – XM (group)]2

Page 30: Warsaw Summer School 2015, OSU Study Abroad Program Regression

ANOVA

Mean Squares:

• MSS(between) = SS(between) / df(between)

where df(between) = k – 1

• MSS(within) = SS(within) / df(within) where df(within) = N - k

Page 31: Warsaw Summer School 2015, OSU Study Abroad Program Regression

F

F-statistic

MSS(between)

• F = --------------

MSS(within)

• The larger the F-value, the greater the impact of a group on the dependent variable.

Page 32: Warsaw Summer School 2015, OSU Study Abroad Program Regression

F

Compare:

MSS(between)

• F = --------------

MSS(within)

MSS(regression)

• F = -------------- Regression ANOVA

MSS(error)

Page 33: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Stata

Page 34: Warsaw Summer School 2015, OSU Study Abroad Program Regression

ANOVA

• Source - Model, Residual, and Total. The Total variance is partitioned into the variance which can be explained by the independent variables (Model) and the variance which is not explained by the independent variables (Residual, sometimes called Error). 

• SS - Sum of Squares associated with the three sources of variance, Total, Model and Residual.

• df - Degrees of freedom associated with the sources of variance.  The total variance has N-1 degrees of freedom.  The model degrees of freedom = the number of coefficients + intercept minus 1.  The Residual degrees of freedom is the DF total minus the DF model.

• MS - Mean Squares, the Sum of Squares divided by their respective DF. 

Page 35: Warsaw Summer School 2015, OSU Study Abroad Program Regression

Regression

• Number of observations used in the regression analysis. • The F-statistic is the Mean Square Model divided by the

Mean Square Residual.  The numbers in parentheses are the Model and Residual degrees of freedom.

• Prob > F - This is the p-value associated with the above F-statistic.  It is used in testing the null hypothesis that all of the model coefficients are 0.

• R-squared - R-Squared is the proportion of variance in the dependent variable which can be explained by the independent variables. 

• Adj R-squared - This is an adjustment of the R-squared that penalizes the addition of extraneous predictors to the model.  Adjusted R-squared is computed using the formula 1 - ((1 - Rsq)((N - 1) /( N - k - 1)) where k is the number of predictors. 

• Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Squared Error).