warsaw summer school 2015, osu study abroad program regression
TRANSCRIPT
Warsaw Summer School 2015, OSU Study Abroad Program
Regression
Linear Relationship
The line = a mathematical function that can be expressed through the formula Y = a + bX, where Y & X are our variables.
Y, the dependent variable, is expressed as a linear function of
the independent (explanatory) variable X.
Linear Relationship
Linear Relationship
The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept).
The slope b equals the change in Y for a one-unit increase inX (one-unit increase in X corresponds to a change of b unitsin Y). The slope describes the rate of change in Y-values, asX increases.
Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change inthe vertical distance is divided by the change in the horizontaldistance).
Linear Relationship
The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept).
The slope b equals the change in Y for a one-unit increase inX (one-unit increase in X corresponds to a change of b unitsin Y). The slope describes the rate of change in Y-values, asX increases.
Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change inthe vertical distance is divided by the change in the horizontaldistance).
Cartesian Coordinate System
Variables X, Y and their linear function:
The formula Y = a + bX expresses the dependent (response) variable Y as a linear function of the independent (explanatory) variable X. The formula maps out a strait-line graph with slope b and Y-intercept a.
Basics
Linear Relationship: Y = a + bX
The constant a is the value of Y when X = 0. For X = 0 we have: Y = a + b*0 = a
The constant a is the value of Y where the line Y = a + bX intersects the Y-axis.
The slope b equals the change in Y for a one-unit increase in X. This means that one-unit increase in X corresponds to a change of b units in Y. Thus, the slope describes the rate of change in the Y-values as X increases. Generally,
b = (Y - a) / X
Model vs Reality
The function Y = a + bX is a model
In reality we do not have one line
The Scatter gram and Least Squares Method
The graphical plot of observed values (X,Y) is called a
- scatter-gram
- scatter-diagram
- scatter-plot.
A regression function is a function that describes how the expected value of the dependent (response) variable Y changes according to the values of an independent
(explanatory) variable X.
Regression
This expected value is estimated by a linear function:
• Ý = a + bX
Ý = predicted value for the dependent variable, Ya = the intercept (the value of Y when X = 0)b = the regression coefficient (the slope), indicating the amount of change in Y given a unit change in XX = the independent variable
Regression
Ý = a + bX
b = [Σ(X - X̃ )(Y - Ÿ)] / Σ(X - X̃ )2
a = Ý - b*X
Method of Least Squares
The prediction errors, called residuals, are defined as the differences between observed and predicted values of Y
E = Ý - (a + bX) = Y - Ý
Regression line minimizes the sum of error terms: SSE = Σ(Y - Ý)2
Method of Least Squares
The method of least squares provides the prediction equation Ý = a + bX having the minimal value of SSE.
The least square estimates a and b are the values determining the prediction equation for which the sum of squared errors SSE is a minimum.
Covariance
In the regression analysis we ask: to what extend could we predict Y knowing our variable X? Prediction means that values X and Y go together or co-vary.
Covariance is sum of products, or SP, • SP = Σ (X - X̃ ) (Y - Ÿ)
Sums of squares for X:• SSx = Σ (X - X̃ )2
Note that in the regression equation of Y on X• Ý = bX + a• b = SP / SSx
Interpretation of b
The slope of the line, b, has the verbal interpretation “rise over run”-- that is, the rise divided by the run. This means that the change in the vertical distance is divided by the change in the horizontal distance.
The more steep the hill, the higher the slope. You go “up” more rapidly than you go over. The line can have a negative slope.
When there is negative slope, you are going “downhill” rather than “uphill.”
• b > 0, positive relationship• b < 0, negative relationship• b = 0, no relationship
Linear Relationship
The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept).
The slope b equals the change in Y for a one-unit increase inX (one-unit increase in X corresponds to a change of b unitsin Y). The slope describes the rate of change in Y-values, asX increases.
Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change inthe vertical distance is divided by the change in the horizontaldistance).
Unststandardized and standardized coefficients
If both variables, IV and DV, are expressed in z-scores, a (constant) is equal zero.
We obtain Beta coefficients that tell us the following: How much change in the standard deviation units in DV is attributable to the change in IV by one standard deviation.
Two and more IVs
Ý = a + b1X1 + b2X2
Ý = β1X1 + β2X2
Ý = a + b1X1 + b2X2 ……….. bk-1Xk-1 + bkXk
Ý = β 1X1 + β 2X2 ……….. β k-1Xk-1 + β kXk
Coefficients and variables
The estimated parameters b1, b2, ..., bk are partial regression coefficients. They are different from regression coefficients for bi-variate relationships between Y and each exploratory variable.
Three criteria for a number of independent (exploratory) variables:
• (1) Theory
• (2) Parsimony
• (3) Sample size
R2
Coefficient of determination (explained variance) for two variables
SS(total) - SS(error)• r2 = ----------------------------- SS(total)
• Stata provides a value of the coefficient of determination for
• SS(total) - SS(error)• R2 = ----------------------------- SS(total)
Sum of squares
R2 is a proportion of explained variance by X1, X2, ...., Xk.
Therefore, 1 - R2 is a proportion of unexplained variance.
Adjusted R-square
• Adjusted R-square is a modification of R-square that adjusts for the number of terms in a model. R-square always increases when a new term is added to a model, but adjusted R-square increases only if the new term improves the model more than would be expected by chance.
Sum of Squares
The Regression SUM of SQUARES is defined:
SS(regression) = SS(total) – SS(error)
Mean square
The Regression MEAN SQUARE
MSS(regression) = SS(regression) / df-v
df-v = k where k is a number of variables
The MEAN SQUARE ERROR
MSS(error) = SS(error) / df
df-t = n - (k + 1) where n is a number of cases and k is a number of variables.
F
The null hypothesis
Ho: b1 = b2 = … = bk = 0
MSS(model)• F = -------------- MSS(error)
The sampling distribution of this statistic is the F-distribution
t
The test of H0: bk = 0 evaluates whether Y and X are statistically dependent, ignoring other variables.
We use the t statistic b• t = -------------- σB where σB is a standard error of B
SS(error)• σB = -------- n - 2
ANOVA
ANALYSIS OF VARIANCE
• How much of the variance is explained by values of the nominal variable?
• Total sum of squared variation from the mean:
• SS(total) = Σ [X – XM (total)]2
ANOVA
The between group variation represents the squared deviations of every group mean from the total mean:
• SS(between) = Σ [XM (group) – XM (total)]2
The within-group sum of squares is the sum of every raw
score from its group mean:
• SS(within) = Σ [X – XM (group)]2
ANOVA
Mean Squares:
• MSS(between) = SS(between) / df(between)
where df(between) = k – 1
• MSS(within) = SS(within) / df(within) where df(within) = N - k
F
F-statistic
MSS(between)
• F = --------------
MSS(within)
• The larger the F-value, the greater the impact of a group on the dependent variable.
F
Compare:
MSS(between)
• F = --------------
MSS(within)
MSS(regression)
• F = -------------- Regression ANOVA
MSS(error)
Stata
ANOVA
• Source - Model, Residual, and Total. The Total variance is partitioned into the variance which can be explained by the independent variables (Model) and the variance which is not explained by the independent variables (Residual, sometimes called Error).
• SS - Sum of Squares associated with the three sources of variance, Total, Model and Residual.
• df - Degrees of freedom associated with the sources of variance. The total variance has N-1 degrees of freedom. The model degrees of freedom = the number of coefficients + intercept minus 1. The Residual degrees of freedom is the DF total minus the DF model.
• MS - Mean Squares, the Sum of Squares divided by their respective DF.
Regression
• Number of observations used in the regression analysis. • The F-statistic is the Mean Square Model divided by the
Mean Square Residual. The numbers in parentheses are the Model and Residual degrees of freedom.
• Prob > F - This is the p-value associated with the above F-statistic. It is used in testing the null hypothesis that all of the model coefficients are 0.
• R-squared - R-Squared is the proportion of variance in the dependent variable which can be explained by the independent variables.
• Adj R-squared - This is an adjustment of the R-squared that penalizes the addition of extraneous predictors to the model. Adjusted R-squared is computed using the formula 1 - ((1 - Rsq)((N - 1) /( N - k - 1)) where k is the number of predictors.
• Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Squared Error).