Download - Regression and Multivariate analysis
-
7/29/2019 Regression and Multivariate analysis
1/44
Least Squares Regression andMultiple Regression
-
7/29/2019 Regression and Multivariate analysis
2/44
Regression: A Simplified Example
X(predictor)
Y(criterion)
3 14
4 182 10
1 6
5 22
3 146 26
Lets find the best-fitting equation for predictingnew, as yet unknown scores on Y from scoreson X. The regression equation takes the form Y= a + bX + e where Y is the dependent or
criterion variable were trying to predict, a is theintercept or point where the regression linecrosses the Y axis, X is the independent or
predictorvariable, b is the weight by which wemultiply the value of X (it is the slope of theregression line, and is how many units Y
increases (decreases) for every unit change inX), and e is an error term (basically an estimateof how much our prediction is off). a and bare often called regression coefficients. WhenY is an estimated value it is usually symbolizedas Y
-
7/29/2019 Regression and Multivariate analysis
3/44
Finding the Regression Line withSPSS
First lets use a scatterplot tovisualize the relationshipbetween X and Y. The firstthing we notice is that thepoints appear to form astraight line and that that as
X gets larger, Y gets larger,so it would appear that wehave a strong, positiverelationship between X and Y.Based on the way the pointsseem to fall, what do youthink the value of Y would befor a person who obtained a
score of 7 on X?
X
76543210
Y
30
20
10
0
-
7/29/2019 Regression and Multivariate analysis
4/44
Fitting a Line to the Scatterplot
Next lets fit a line to thescatterplot. Note that thepoints appear to be fit wellby the straight line, andthat the line crosses the Y
axis (at the point called theintercept, or the constant ain our regression equation)at about the point y = 2.So its a good guess thatour regression equation willbe something like y = 2 +
some positive multiple of X,since the values of Y look tobe about 4-5 times the sizeof X
X
76543210
Y
30
20
10
0
-
7/29/2019 Regression and Multivariate analysis
5/44
The Least Squares Solution toFinding the Regression Equation
Mathematically, the regression equation is that combinationof constant and weights bon the predictors (the Xs) whichminimizes the sum, across all subjects, of the squareddifferences between their predicted scores (e.g. the scoresthey would get if the regression equation were doing thepredicting) and the obtained scores (their actual scores) onthe criterion Y (that is, it minimizes the error sum ofsquares or residuals). This is known as the least squaressolution
The correlation between the obtained scores on thecriterion or dependent variable, Y, and the scores predictedby the regression equation is expressed in the correlation
coefficient, r, or in the case of more than one independentvariable, R.* Alternatively it expresses the correlationbetween Y and the weighted combination of predictors. Rranges from zero to 1
*SPSS uses R in the regression output even if there is onlyone predictor
-
7/29/2019 Regression and Multivariate analysis
6/44
Using SPSS to Calculate theRegression Equation
Download the Data Filesimpleregressionexample.sav and open it in SPSS
In Data Editor, we will goto Analyze/ Regression /Linear and move X into
the Independent box (inregression theIndependent variables arethe predictor variables)and move Y into thedependent box and clickOK. The dependentvariable, Y, is the one for
which we are trying tofind an equation that willpredict new cases of Ygiven than we know X
http://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/simpleregressionexample.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/simpleregressionexample.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/simpleregressionexample.sav -
7/29/2019 Regression and Multivariate analysis
7/44
Obtaining the Regression Equationfrom the SPSS Output
Coefficientsa
2.000 .000 . .
4.000 .000 1.000 . .
(Constant)
X
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: Ya.
This table gives us theregression coefficients. Look inthe column calledunstandardized coefficients.There are two values ofprovided. The first one, labeledthe constant, is the intercept a,or the point at which theregression line crosses the yaxis. The second one, X, is theunstandardized regressionweight or the b from ourregression equation. So thisoutput tells us that the best-
fitting equation forpredicting Y from X is Y = 2+ (4)X. Lets check that outwith a known value of X and Y.According to the equation, if X is3, Y should be 2 + 4(3), or 14.How about when X = 5?
X Y
3 14
4 18
2 10
1 6
5 22
3 14
6 26
The constantrepresenting theintercept is the valuethat the dependentvariable would take
when all the predictorsare at a value of zero.In some treatmentsthis is called B0instead ofa
-
7/29/2019 Regression and Multivariate analysis
8/44
What is the Regression Equation whenthe Scores are in Standard (Z) Units?
When the scores on X and Y have been converted to Zscores, then the intercept disappears (because the twosets of scores are expressed on the same scale) and theequation for predicting Y from X just becomes Y = BetaX,
where Beta is the standardized coefficient reported inyour SPSS regression procedure output
Coefficientsa
2.000 .000 . .
4.000 .000 1.000 . .
(Constant)
X
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variabl e: Ya.
In the bivariate case, where there is only one X and one Y, thestandardized beta weight will equal the correlation coefficient.Lets confirm this by seeing what would happen if we convertour raw scores to Z scores
-
7/29/2019 Regression and Multivariate analysis
9/44
Regression Equation for Z scores
In SPSS I have converted X and Y to two new variables, ZX and ZY,expressed in standard score units. You achieve this by going to Analyze/Descriptive/ Descriptives (dont do this now), moving the variables youwant to convert into the variables box, and selecting save standardizedvalues as variables. This creates the new variables expressed as Z scores.Note that if you reran the linear regression analysis that we just did on theraw scores, that in the output for the regression equation for predicting the
standard scores on Y the constant has dropped out and the equation is nowof the form y = Beta x, where Beta is equal to 1. In this case the z scoresare identical on X and Y although they certainly wouldnt always be
Coefficientsa
.000 .000 . .
1.000 .000 1.000 . .
(Constant)
Zscore(X)
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variabl e: Zscore(Y)a.
Correlations
1 1.000**
. .
7 7
1.000** 1
. .
7 7
Pearson Correlati on
Sig. (2-tailed)
N
Pearson Correlati on
Sig. (2-tailed)
N
Zscore(Y)
Zscore(X)
Zscore(Y) Zscore(X)
Correlation i s signi ficant at the 0.01 l evel (2-tailed).**.
-
7/29/2019 Regression and Multivariate analysis
10/44
Meaning of Regression Weights The regression weights or regression coefficients (the
raw score s and the standardized Betas) can beinterpreted as expressing this unique contribution of avariable: you can say they represent the amount ofchange in Y that you can expect to occur per unitchange in Xi , where X is the ith variable in the predictiveequation, when statistical control has been achieved for
all of the other variables in the equation Lets consider an example from the raw-score regression
equation Y = 2 + (b)X, where the weight b is 4: Y = 2+ (4) X. In predicting Y, what the weight b means isthat for every unit change in X, Y will be increasedfourfold. Consider the data from this table and verifythat this is the case. For example, if X = 1, Y = 6. Nowmake a unit change of 1 in X, so that X is 2, and Ybecomes equal to 10. Make a further unit change of 2
units to 3, and Y becomes equal to 14. Make a furtherunit change of 3 units to 4, and Y becomes equal to 18.So each unit change in X increases Y fourfold (the valueof the b weight). If the b weight were negative (e.g. y =2 bx) the value of y would decrease fourfold for everyunit increase in X
X Y
3 14
4 18
2 10
1 6
5 22
3 14
6 26
-
7/29/2019 Regression and Multivariate analysis
11/44
Finding the Regression Equation forSome Real-World Data
Download the World95.sav data file and open it in SPSSData Editor. We are going to find the regression equationfor predicting the raw (unstandardized) scores on thedependent variable, Average Female Life Expectancy (Y)from Daily Calorie Intake (X). Another way to say this isthat we are trying to find the regression of Y on X.
Go to Graphs/Chart Builder/OK Under Choose From select ScatterDot (top leftmost icon)
and double click to move it into the preview window Drag Daily Calorie Intake onto the X axis box Drag Average Female Life Expectancy onto the Y axis box
and click OK In the Output viewer, double click on the chart to bring
up the Chart Editor; go to Elements and select Fit Lineat Total, then select linear and click Close
http://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.sav -
7/29/2019 Regression and Multivariate analysis
12/44
Scatterplot of Relationship between FemaleLife Expectancy and Daily Caloric Intake
From the scatterplot it wouldappear that there is a strongpositive correlation between Xand Y (as daily caloric intakeincreases, life expectancy
increases), and X can beexpected to be a good predictorof as-yet unknown cases of Y.(Note, however, that there is alot of scatter about the line andwe may need additional
predictors to soak up some ofthe variance left over after thisparticular X has done its work(also consider loess regression
In the loess method, weighted least squares is used tofit linear or quadratic functions of the predictors at thecenters of neighborhoods. The radius of each neighborhood
is chosen so that the neighborhood containsa specified percentage of the data points)
-
7/29/2019 Regression and Multivariate analysis
13/44
Finding the Regression Equation
Go to Analyze/ Regression/ Linear
Move the Average Female Life Expectancyvariable into the dependent box and the DailyCalorie Intake variable into the independent box
Under Options, make sure include constant inequation is checked and click Continue
Under Statistics, Check Estimates, Confidenceintervals, and Model Fit. Click Continue and
then OK Compare your output to the next slide
-
7/29/2019 Regression and Multivariate analysis
14/44
Interpreting the SPSS RegressionOutput
From your output you can obtain the regression equation for predictingAverage Female Life Expectancy from Daily Calorie Intake. The equation isY = 25.904 + .016X + e, where e is the error term. Thus for a countrywhere the average daily calorie intake is 3000 calories, the average femalelife expectancy is about 25.904 + (.016)(3000) or 73.904 years. This is araw score regression equation
If the data were expressed in standardscores, the equation would be ZY =.775ZX + e, and .775 is also thecorrelation between X and Y. This is astandard score regression equation
Coefficientsa
25.904 4.175 6.204 .000 17.583 34.225
.016 .001 .775 10.491 .000 .013 .019
(Constant)
Daily calorie intake
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: Average female life e xpectancya.
a b
These weights are calledunstandardizedpartial regressioncoefficients or weights
This is a standardized partialregression coefficient or beta weightSignificanceof constant
of little use.Just saysthat itdifferssignificantly
from zero(e.g when xis zero, y isnot zero)
-
7/29/2019 Regression and Multivariate analysis
15/44
More Information from the SPSSRegression Output
There are some other questions we could ask about this regression (1) Is the regression equation a significant predictor of Y? (That is, is it
good enough to reject the null hypothesis, which is more or less that themean of Y is the best predictor of any given obtained Y). To find this outwe consult the ANOVA output which is provided and look for a significantvalue ofF. In this case the regression equation is significant
(2) How much of the variation in Y can be explained by the regressionequation? To find this out we look for the value of R square, which is .601
ANOVAb
5792.910 1 5792.910 110.055 .000a
3842.477 73 52.637
9635.387 74
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), Daily calorie intakea.
Dependent Variable: Average female life expectancyb.
Model Summary
.775a .601 .596 7.255
Model
1
R R Square
Adj usted
R Square
Std. Error of
the Estimate
Predictors: (Constant), Daily calorie intakea.Residual SS is the sum of squared deviations ofthe known values of Y and the predicted valuesof Y based on the equation
Regression SS is the sum of the squared deviations of the
predicted variable about its mean
-
7/29/2019 Regression and Multivariate analysis
16/44
How Much Error do We Have?
Just how good a job will our regression equation do inpredicting new cases of Y? As it happens the greaterthe departure of the obtained Y scores from thelocation that the regression equation predicted theyshould be, the larger the error
If you created a distribution of all the errors ofprediction (what are called the residuals or thedifferences between observed and predicted score foreach case), the standard deviation of this distributionwould be the standard error of estimate
The standard error of estimate can be used to put
confidence intervals orprediction intervals aroundpredicted scores to indicate the interval within whichthey might fall, with a certain level of confidence suchas .05
-
7/29/2019 Regression and Multivariate analysis
17/44
Confidence Intervals in Regression
Coefficientsa
25.904 4.175 6.204 .000 17.583 34.225
.016 .001 .775 10.491 .000 .013 .019
(Constant)
Daily calorie in take
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval fo r B
Dependent Variable: Average femal e li fe expectancya.
Look at the columns headed 95% confidence intervals. These columns putconfidence intervals based on the standard error of estimate around theregression coefficients a and b. Thus for example in the table below we cansay with 95% confidence that the value of the constant a lies somewherebetween 17.583 and 34.225, and the value of the regression coefficient b(unstandardized) lies somewhere between .013 and .019)
Model Summary
.775a
.601 .596 7.255
Model
1
R R Square
Adj usted
R Square
Std. Error of
the Estimate
Predictors: (Constant), Daily calorie intakea.
Looking at the standard error of thestandardized coefficient we can see that the
estimate R (which is also the standardizedversion ofb) is 775. Thus we could say with95% confidence that if ZX is the Z scorecorresponding to a particular calorie level,life expectancy is .775 (Zx) plus or minus7.255 years
SEE = SD of X multiplied by thesquare root of the coeffiecient of
nondetermination.Says what anerror standard score of 1 is equal toin terms of Y units
-
7/29/2019 Regression and Multivariate analysis
18/44
Multivariate Analysis
Multivariate analysis is a term applied to a related set of statisticaltechniques which seek to assess and in some cases summarize ormake more parsimonious the relationships among a set ofindependent variables and a set of dependent variables
Multivariate analyses seeks to answer questions such as Is there a linear combination of personal and intellectual traits that will
maximally discriminate between people who will successfully completefreshman year of college and people who drop out? What linearcombination of characteristics of the tax return and the taxpayer bestdistinguish between those whom it would and would not be worthwhile toaudit? (Discriminant Analysis)
What are the underlying factors of an 94-item statistics test, and how cana more parsimonious measure of statistical knowledge be achieved?(Factor Analysis)
What are the effects of gender, ethnicity, and language spoken in thehome and their interaction on a set of ten socio-economic statusindicators? Even if none of these is significant by itself, will their linearcombination yield significant effects? (MANOVA, Multiple Regression)
-
7/29/2019 Regression and Multivariate analysis
19/44
More Examples of MultivariateAnalysis Questions
What are the underlying dimensions of judgment in aset of similarity and/or preference ratings of politicalcandidates? (Multidimensional Scaling)
What is the incremental contribution of each of tenpredictors of marital happiness? Should all of the
variables be kept in the prediction equation? What is themaximum accuracy of prediction that can be achieved?(Stepwise Multiple Regression Analysis)
How do a set of univariate measures of nonverbalbehavior combine to predict ratings of communicatorattractiveness? (Multiple regression)
What is the correlation between a set of measures
assessing the attractiveness of a communicator and asecond set of measures assessing the communicatorsverbal skills? (Canonical Correlation)
-
7/29/2019 Regression and Multivariate analysis
20/44
An Example (sort of) of MultivariateAnalysis: Multiple Regression
A good place to start in learning about multivariate analysisis with multiple regression. Perhaps it is not strictlyspeaking a multivariate procedure since although there aremultiple independent variables there is only one dependentvariable Canonical correlation is perhaps a more classic multivariate
procedure with multiple dependent and independent variables Multiple regression is a relative of simple bivariate or zero-
order correlation (two interval-level variables) In multiple regression, the investigator is concerned with
predicting a dependent or criterion variable from two ormore independent variables. The regression equation (raw
score version) takes the form Y = a + b1X1 + b2X2 + b3X3 +..bnXn + e One motivation for doing this is to be able to predict the scores
on cases for which measurements have not yet been obtainedor might be difficult to obtain . The regression equation can beused to classify, rate, or rank new cases
-
7/29/2019 Regression and Multivariate analysis
21/44
Coding Categorical Variables inRegression
In multiple regression, both theindependent or predictor variables and thedependent or criterion variables areusually continuous (interval or ratio-levelmeasurement) although sometimes therewill be concocted or dummy independent
variables which are categorical (e.g., menand women are assigned scores of one ortwo on a dummy gender variable; or, formore categories, K-1 dummy variables areused where 1 equals has the propertyand 0 equals doesnt have the property
Consider the race variable from one of ourdata sets which has three categories:
White, African-American, and Other. Tocode this variable for multiple regression,you create two dummy variables, Whiteand African-American. Each subject willget a score of either 1 or 0 on each of thetwo variables
Subject 1
Caucas.1 0
Subject 2
African-American
0 1
Subject 3
Other0 0
Caucasian African-American
-
7/29/2019 Regression and Multivariate analysis
22/44
Coding Categorical Variables inRegression, contd
Subject 1
HighStatusAttireCondition
1 0
Subject 2
MediumStatusAttireCondition
0 1
Subject 3
LowStatusAttireCondition
0 0
You can use this same type ofprocedure to code assignments tolevels of a treatment in anexperiment, and thus you can use a
factor from an experiment, suchas interviewer status, as a predictor
variable in a regression. Forexample if you had an experimentwith three levels of interviewerattire, you would create one dummyvariable for the high status attirecondition and one for the mediumstatus attire and the people in the
low status attire condition would get0,0 on both variables, where highstatus condition subjects would get1,0 and medium status conditionsubjects would get 0, 1 scores onthe two variables, respectively
High Status Medium Status
-
7/29/2019 Regression and Multivariate analysis
23/44
Regression and Prediction
Most regression analyses look for a linear relationshipbetween predictors and criterion although nonlinear trendscan be explored through regression procedures as well
In multiple regression we attempt to derive an equationwhich is the weighted sum of two or more variables. Theequation tells you how much weight to place on each of thevariables to arrive at the optimal predictive combination
The equation that is arrived at is the best combination ofpredictors for the sample from which it was derived. Buthow well will it predict new cases? Sometimes the regression equation is tested against a new
sample of cases to see how well it holds up. The first sample
is used for the derivation study (to derive the equation) and asecond sample is used for cross-validation. If the secondsample was part of the original sample reserved for just thiscross-validation purpose, then it is called a hold-outsample.
-
7/29/2019 Regression and Multivariate analysis
24/44
Simultaneous Multiple RegressionAnalysis
One of the most important notions in multipleregression analysis is the notion of statisticalcontrol, that is, mathematical operations toremove the effects of potentially confounding
or third variables from the relationshipbetween a predictor or IV and a criterion orDV. Terms you might hear which refer tothis include Partialing
Controlling for Residualizing Holding constant
-
7/29/2019 Regression and Multivariate analysis
25/44
Meaning of Regression Weights
In multiple regression when you have multiple predictors ofthe same dependent or criterion variable Y the standardizedregression coefficient, or Beta1 expresses the independentcontribution to predicting variable Y of X1 when the effectsof the other variables X2 through Xn are not a factor (havebeen statistically controlled for), and similarly for weights
Beta2 through Betan These regression weights or coefficients can be tested for
statistical significance and it will be possible to state with95% (or 99%) confidence that the magnitude of thecoefficient differs from zero, and thus that that particularpredictor makes a contribution to predicting the criterion or
dependent variable, Y, that is unrelated to the contributionof any of the other predictors
-
7/29/2019 Regression and Multivariate analysis
26/44
Tests of the Predictors
The magnitude of the raw score weights (usually symbolized by b1,b2, etc) cannot be directly compared since they are associated with(usually) variables with different units of measurement
It is common practice to compare the standardized regressionweights (the Beta1, Beta 2, etc) andmake claims about the relativeimportance of the unique contribution of each predictor variable to
predicting the criterion It is also possible to do tests for the significance of the differencesbetween two predictors: is one a significantly better predictor than theother
These coefficients vary from sample to sample so its not prudent togeneralize too much about the relative ability of two predictors to predict
Its also the case that in the context of the regression equation thevariable which is a good predictor is not the original variable, but rather a
residualized version for which the effects of all the other variables havebeen held constant. So the magnitude of its contribution is relative tothe other variables, and only holds for this particular combination ofvariables included in the predictive equation
-
7/29/2019 Regression and Multivariate analysis
27/44
How Do we Find the RegressionWeights (Beta Weights)?
Although this is not how SPSS would calculate them,we can get the Beta weights from the zero-order(pairwise) correlations between Y and the variouspredictor variables X1, X2, etc and theintercorrelations among the latter
Suppose we want to find the beta weights for anequation Y = Beta1X1 + Beta2X2
We need three correlations: the correlation betweenY and X1; the correlation between Y and X2, and the
correlation between X1 and X2
-
7/29/2019 Regression and Multivariate analysis
28/44
How Do we Find the RegressionWeights (Beta Weights)?, contd
Lets suppose we have the following data: rfor Y and X1 =.776; rfor Y and X2 is .869; and rfor X1 and X 2 is .682.
The formula for predicting the standardized partialregression weight for X1 with the effects of X2 removed is
* Beta X1Y.X2 = r X1Y (r X2Y)(r X1X2)
1 r2X1X2
Substituting the correlations we already have in the formula,we find that the beta weight for the predictive effect of
variable X1 on Y is equal to .776 (.869)(.682) / 1 (.682)2= .342. To compute the second weight, Beta X2Y.X1, we justswitch the first and second terms in the numerator.Now lets see that in the context of an SPSS-calculatedmultiple regression
*Read this as the Beta weight for the regression of Y on X1when the effects of X2 have been removed
-
7/29/2019 Regression and Multivariate analysis
29/44
Multiple Regression using SPSS
Suppose we think that the ability of Daily Calorie Intake topredict Female Life Expectancy is not adequate, and wewould like to achieve a more accurate prediction. One wayto do this is to add additional variables to the equation andconduct a multiple regression analysis.
Suppose we have a suspicion that literacy rate might alsobe a good predictor, not only as a general measure of thestate of the countrys development but also as an indicatorof the likelihood that individuals will have the wherewithalto access health and medical information. We have noparticular reasons to assume that literacy rate and calorieconsumption are correlated, so we will assume for themoment that they will have a separate and additive effecton female life expectancy
Lets add literacy rate (People who Read %) as a secondpredictor (X2), so now our equation that we are looking foris Y = a + b1X1 + b2X2
where Y = Female Life Expectancy,Daily Calorie Intake is X1 andLiteracy Rate is X2
-
7/29/2019 Regression and Multivariate analysis
30/44
Multiple Regression using SPSS:Steps to Set Up the Analysis
Download the World95.sav datafile and open it in SPSS DataEditor.
In Data Editor go to Analyze/Regression/ Linear and click Reset
Put Average Female LifeExpectancy into the Dependent box
Put Daily Calorie Intake and Peoplewho Read % into the Independentsbox
Under Statistics, select Estimates,Confidence Intervals, Model Fit,Descriptives, Part and PartialCorrelation, R Square Change,Collinearity Diagnostics, and clickContinue
Under Options, check IncludeConstant in the Equation, clickContinue and then OK
Compare your output to the nextseveral slides
http://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.sav -
7/29/2019 Regression and Multivariate analysis
31/44
Interpreting Your SPSS MultipleRegression Output
Correlations
1.000 .776 .869
.776 1.000 .682
.869 .682 1.000
. .000 .000
.000 . .000
.000 .000 .
74 74 74
74 74 74
74 74 74
Average fem ale li fe
expectancy
Daily calorie i ntake
People who read (%)
Average fem ale li fe
expectancy
Daily calorie i ntake
People who read (%)
Average fem ale li fe
expectancy
Daily calorie i ntake
People who read (%)
Pearson Correlation
Sig. (1-tailed)
N
Average
female life
expectancy
Daily calorie
intake
People who
read (%)
First lets look at the zero-order (pairwise)correlations between Average Female LifeExpectancy (Y), Daily Calorie Intake (X1) and Peoplewho Read (X2). Note that these are .776 for Y withX1, .869 for Y with X2, and .682 for X1 with X2
rYX1
rYX2
rX1X2
-
7/29/2019 Regression and Multivariate analysis
32/44
Examining the Regression Weights
Above are the raw (unstandardized) and standardized regression weights forthe regression of female life expectancy on daily calorie intake andpercentage of people who read. Consistent with our hand calculation, thestandardized regression coefficient (beta weight) for daily caloric intake is.342. The beta weight for percentage of people who read is much larger,.636. What this weight means is that for every unit change in percentage of
people who read (that is, for every increase by a factor of one standarddeviation on the people who read variable), Y (female life expectancy) willincrease by a multiple of .636 standard deviations. Note that both the betacoefficients are significant at p < .001
Coefficientsa
25.838 2.882 8.964 .000 20.090 31.585
.315 .034 .636 9.202 .000 .247 .383 .869 .738 .465 .535 1.868
.007 .001 .342 4.949 .000 .004 .010 .776 .506 .250 .535 1.868
(Constant)
People who read (%)
Daily calorie i ntake
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Zero-order Partial Part
Correlations
Tolerance VIF
Collin earity Statistics
Dependent Variable: Average female life expectancya.
-
7/29/2019 Regression and Multivariate analysis
33/44
R, R Square, and the SEE
Model Summary
.905a .818 .813 4.948 .818 159.922 2 71 .000
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
R Square
Change F Change df1 df2 Sig. F Change
Change Statistics
Predictors: (Constant), People who read (%), Daily calorie intakea.
Above is the model summary, which has some importantstatistics. It gives us R and R square for the regression ofY (female life expectancy) on the two predictors. R is.905, which is a very high correlation. R square tells us
what proportion of the variation in female life expectancyis explained by the two predictors, a very high .818. Itgives us the standard error of estimate, which we can useto put confidence intervals around the unstandardizedregression coefficients
-
7/29/2019 Regression and Multivariate analysis
34/44
FTest for the Significance of theRegression Equation
ANOVAb
7829.451 2 3914.726 159.922 .000a
1738.008 71 24.479
9567.459 73
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), People who read (%), Daily calorie intakea.
Dependent Variable: Average female l ife expectancyb.
Next we look at the Ftest of the significance of theRegression equation, Y = .342 X1 + .636 X2. Is this so much better apredictor of female literacy (Y) than simply using the mean of Y that thedifference is statistically significant? The Ftest is a ratio of the mean squarefor the regression equation to the mean square for the residual (the
departures of the actual scores on Y from what the regression equationpredicted). In this case we have a very large value ofF, which is significantat p
-
7/29/2019 Regression and Multivariate analysis
35/44
Confidence Intervals around theRegression Weights
Coefficientsa
25.838 2.882 8.964 .000 20.090 31.585.007 .001 .342 4.949 .000 .004 .010 .776 .506 .250
.315 .034 .636 9.202 .000 .247 .383 .869 .738 .465
(Constant)Daily calorie in take
People who read (%)
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Zero-order Partial Part
Correlations
Dependent Variable: Average female life expectancya.
Finally, your output provides confidence intervals around theunstandardized regression coefficients. Thus we can say
with 95% confidence that the unstandardized weight toapply to daily calorie intake to predict female life expectancyranges between .004 and .010, and that theundstandardized weight to apply to percentage of peoplewho read ranges between .247 and .383
-
7/29/2019 Regression and Multivariate analysis
36/44
Multicollinearity
One of the requirements for a mathematical solution to themultiple regression problem is that the predictors or independentvariables not be highly correlated
If in fact two predictors are perfectly correlated, the analysiscannot be completed
Multicollinearity (the case in which two or more of the predictors
are too highly correlated) also leads to unstable partial regressioncoefficients which wont hold up when applied to a new sample ofcases
Further, if predictors are too highly correlated with each other theirshared variance with the dependent or criterion variable may beredundant and its hard to tell just using statistical procedureswhich variable is producing the effect
Moreover, the regression weights for the predictors would lookmuch like their zero-order correlations with Y if the predictors aredependent; if the predictors are highly correlated this mayproduce regression weights that dont really reflect theindependent contribution to prediction of each of the predictors
-
7/29/2019 Regression and Multivariate analysis
37/44
Multicollinearity, contd
As a rule of thumb, bivariate zero-order correlations betweenpredictors should not exceed .80 This is easy to prevent; run complete analysis of all possible pairs of
predictors using the correlation procedure
Also, no predictor should be totally accounted for by a combinationof the other predictors
Look at tolerance levels. Tolerance for a predictor variable is equal to1-R2 for an equation where one of the predictors is regressed on all ofthe other predictors. If the predictor is highly correlated with(explained by) the combination of the other predictors, it will have alow tolerance, approaching zero, because the R2 will be large
So, zero tolerance = BAD, near 1 tolerance = GOOD in terms ofindependence of a predictor
The best prediction occurs when the predictors are
moderately independent of each other, but each is highlycorrelated with the dependent (criterion) variable Y Some interpretive problems resulting from multicollinearity can be
resolved using path analysis (see Chapter 3 in Grimm and Yarnold)
-
7/29/2019 Regression and Multivariate analysis
38/44
Multicollinearity Issues in ourCurrent SPSS Problem
Correlations
1.000 .776 .869
.776 1.000 .682
.869 .682 1.000
. .000 .000
.000 . .000
.000 .000 .
74 74 74
74 74 74
74 74 74
Average fem ale li fe
expectancy
Daily calorie i ntake
People who read (%)
Average fem ale li fe
expectancy
Daily calorie i ntakePeople who read (%)
Average fem ale li fe
expectancy
Daily calorie i ntake
People who read (%)
Pearson Correlation
Sig. (1-tailed)
N
Average
female life
expectancy
Daily calorie
intake
People who
read (%)
From our SPSS output we note that the correlation between our two predictors,Daily Calorie Intake (X1) and People who Read (X2) is .682. This is a prettyhigh correlation for two predictors to be interpreted independently: it meanseach explains about half the variation in the other. If you look at the zeroorder correlation of our Y variable, average life expectancy with % people whoread, you note that the correlation is quite high, .869. However, the value of rfor the two variable combination was .905, which is an improvement.
rYX1
rYX2
rX1X2
-
7/29/2019 Regression and Multivariate analysis
39/44
Multicollinearity Issues in ourCurrent SPSS Problem, contd
The table below is excerpted from the more complete table on Slide 32.Look at the tolerance value. Recall that zero tolerance means very highmulticollinearity (high intercorrelation among the predictors, which is bad).Tolerance is .535 for both variables (since there are only two, the value isthe same for either one predicting the other)
VIF(variance inflation factor) is a completely redundant statistic withtolerance (it is 1/tolerance). The higher it is, the greater themulticollinearity. When there is no multicollinearity the value of VIF equals1. Multicollinearity problems have to be dealt with (by getting rid ofredundant predictor variables or other means) if VIF approaches 10 (thatmeans that only about 10% of the variance in the predictor in question isnot explained by the combination of the other predictors)
In the case of our twopredictors, there is someindication of multicollinearitybut not enough to throw outone of the variables
-
7/29/2019 Regression and Multivariate analysis
40/44
Specification Errors
One type of specification error is that the relationship among thevariables that you are looking at is not linear (e.g., you know thatY peaks at high and low levels of one or more predictors (acurvilinear relationship) but you are using linear regressionanyhow. There are options for nonlinear regression available thatshould be used in such a case
Another type of specification error occurs when you have eitherunderspecified or overspecified the model by (a) failing to includeall relevant predictors (for example including weight but not heightin an equation for predicting obesity or (b) including predictorswhich are not relevant. Most irrelevant predictors will not evenshow up in the final regression equation unless you insist on it, butthey can affect the results if they are correlated with at least someof the other predictors
For proper specification nothing beats a good theory (as opposedto launching a fishing expedition)
-
7/29/2019 Regression and Multivariate analysis
41/44
Types of Multiple RegressionAnalysis
So far we have looked at a standard or simultaneous multipleregression analysis where all of the predictor variables were enteredat the same time, that is, considered in combination with each othersimultaneously
But there are other types of multiple regression analyses which canyield some interesting results
Hierarchical regression analysis refers to the method of regression inwhich not all of the variables are entered simultaneously but ratherone at a time or a few at a time, and at each step the correlation of Y,the criterion variable, with the current set of predictors is calculatedand evaluated. At each stage the R square that is calculated showsthe incremental change in variance accounted for in Y with theaddition of the most recently entered predictor, and that is exclusivelyassociated with that predictor.
Tests can be done to determine the significance of the change in Rsquare at each step to see if each newly added predictor makes asignificant improvement in the predictive power of the regressionequation
The order in which variables are entered makes a difference to theoutcome. The researcher determines the order on theoretical grounds(exception is stepwise analysis)
-
7/29/2019 Regression and Multivariate analysis
42/44
Stepwise Multiple Regression
Stepwise multiple regression is a variant of hierarchicalregression where the order of entry is determined not bythe researcher but on empirical criteria
In the forward inclusion version of stepwise regression theorder of entry is determined at each step by calculatingwhich variable will produce the greatest increase in Rsquare (the amount of variance in the dependent variable Yaccounted for) at that step
In the backward elimination version of stepwise multipleregression the analysis starts off with all of the predictors atthe first step and then eliminates them so that eachsuccessive step has fewer predictors in the equation.
Elimination is based on an empirical criterion that is thereverse of that for forward inclusion (the variable thatproduces the smallest decline in R square is removed ateach step)
-
7/29/2019 Regression and Multivariate analysis
43/44
Reducing the Overall Level of TypeI Error
One of the problems with doing multiple regression is that thereare a lot of significance tests being conducted simultaneously, butfor all practical purposes each test is treated as an independentone even though the data are related. When a large number oftests are done, theoretically the likelihood of Type I error increases(failing to reject the null hypothesis when it is in fact true)
This is particularly problematic in stepwise regression with theiterative process of assessing significance of R square over andover again not to speak of the significance of individual regressioncoefficients
Therefore it is desirable to do something to reduce the increasedchance of making Type I errors (finding significant results thatarent there) such as keeping the number of predictors to aminimum to reduce the number of times you go to the normal
table to obtain a significance level, or dividing the usual requiredconfidence level by the number of predictors, or keeping theintercorrelation of the predictors as low as possible (avoiding useof redundant predictors, which would cause you to basically testthe significance of the same relationship to Y over and over)
-
7/29/2019 Regression and Multivariate analysis
44/44
Reducing the Overall Level of TypeI Error, contd
This may be of particular importance when theresearcher is testing a theory which has a network ofinterlocking claims such that the invalidation of one ofthem brings the whole thing tumbling down An issue of HCR (July 2003) devoted several papers to
exploring this question
As mentioned in class before, the Bonferroni procedure issometimes used, but its hard to swallow, as you have todivide the usual confidence level of .05 by the number oftests you expect to perform, so if you are conductingthirty tests, you have to set your alpha level at .05/30 or.0017 for each test. With stepwise regression its not
clear in advance how many tests you will have toperform although you can estimate it by the number ofpredictor variables you intend to start off with