unit 6: the basics of multiple regression
DESCRIPTION
Unit 6: The basics of multiple regression. The S-030 roadmap: Where’s this unit in the big picture?. Unit 1: Introduction to simple linear regression. Unit 2: Correlation and causality. Unit 3: Inference for the regression model. Building a solid foundation. Unit 5: - PowerPoint PPT PresentationTRANSCRIPT
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 1
Unit 6: The basics of multiple regression
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 2
The S-030 roadmap: Where’s this unit in the big picture?
Unit 2:Correlation
and causality
Unit 3:Inference for the regression model
Unit 4:Regression assumptions:
Evaluating their tenability
Unit 5:Transformations
to achieve linearity
Unit 6:The basics of
multiple regression
Unit 7:Statistical control in
depth:Correlation and
collinearity
Unit 10:Interaction and quadratic effects
Unit 8:Categorical predictors I:
Dichotomies
Unit 9:Categorical predictors II:
Polychotomies
Unit 11:Regression modeling
in practice
Unit 1:Introduction to
simple linear regression
Building a solid
foundation
Mastering the
subtleties
Adding additional predictors
Generalizing to other types of
predictors and effects
Pulling it all
together
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 3
In this unit, we’re going to learn about…
• Various representations of the multiple regression model:– An algebraic representation – A three-dimensional graphic representation– A two-dimensional graphic representation
• Multiple regression—how it works and helps improve predictions– Estimating the parameters of the multiple regression model– Holding predictors constant—what does this really mean?
• Plotting the fitted multiple regression model: – Deciding how to construct the plot– Choosing prototypical values – Learning how to actually construct the plot (and interpret it correctly!)
• R2 and the Analysis of Variance (ANOVA) in multiple regression• Inference in multiple regression
– The omnibus F-test in multiple regression– Individual t-tests
• How might we summarize multiple regression results in tables/figures?
• How do we test our regression assumptions?
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 4
US News Peer Ratings of Graduate Schools of Education (GSEs)
RQs: What predicts Peer Ratings?
• This unit (Unit 6) doctoral student characteristics
• Next unit (Unit 7) faculty research productivityLearn more about the
ratings at USNews.com
(education school ratings methodology
page)
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 5
Stem Leaf # Boxplot 47 0 1 0 46 45 0 1 | 44 00 2 | 43 000 3 | 42 0 1 | 41 00 2 | 40 0000 4 | 39 00000 5 | 38 00 2 | 37 00 2 +-----+ 36 000000 6 | | 35 000000000 9 | | 34 000000 6 *--+--* 33 00000000 8 | | 32 000000000 9 | | 31 0000000 7 +-----+ 30 0000000000 10 | 29 000000 6 | 28 000 3 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1
A first look at the data: Peer Ratings, mean GREs and N Doc Grads
Ratings of US Graduate schools of education
Doc Peer USNewsID School GRE Grad Rat Rat
1 Harvard 6.625 60 450 100 2 UCLA 5.780 53 410 97 3 Stanford 6.775 38 470 95 4 TC 6.045 193 440 92 5 Vanderbilt 6.605 22 430 88 6 Northwestern 6.770 10 390 83 7 Berkeley 6.050 43 440 82 8 Penn 6.040 61 380 82 9 Michigan 6.090 38 430 7910 Madison 5.800 106 430 7911 NYU 5.960 112 360 7712 MinneTC 5.750 89 390 7313 Oregon 6.115 39 340 7114 MichiganState 5.865 52 420 7015 Indiana 5.960 110 390 6916 UTAustin 5.865 102 400 6917 Washington 5.930 37 370 6818 Urbana 6.330 50 410 6719 USC 5.695 119 360 6720 BC 5.845 42 360 66...
The UNIVARIATE ProcedureVariable: PeerRat
Mean 344.8276 Std Deviation 45.13190Median 340.0000 Variance 2036.88853Mode 300.0000 Range 190.00000 Interquartile Range 60.00000
Outcome: Peer Rating
RQs: What doctoral student characteristics predict variation in the
peer ratings of GSEs?Question predictor: Is it quality (GRE scores)?Control predictor: Is it size (N doc grads)?
n = 87
Stanford
HGSETC, Berkeley
St Johns, Cincinnati, USF
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 6
Examining the predictors: Mean GRE scores and Number of doctoral grads
The UNIVARIATE ProcedureVariable: GRE
Mean 5.578966 Std Deviation 0.42899Median 5.505000 Variance 0.18403Mode 5.210000 Range 2.03000 Interquartile Range 0.57000
Question Predictor: Mean GRE scores
Stem Leaf # Boxplot 67 78 2 0 66 02 2 | 65 4 1 | 64 | 63 3 1 | 62 | 61 2 1 | 60 04459 5 | 59 111366 6 | 58 04669 5 +-----+ 57 0004555688 10 | | 56 0225666 7 | | 55 000178 6 *--+--* 54 11244889 8 | | 53 3444556888 10 | | 52 0011122255788 13 +-----+ 51 | 50 1349 4 | 49 046 3 | 48 4 1 | 47 48 2 | ----+----+----+----+ Multiply Stem.Leaf by 10**-1
The UNIVARIATE ProcedureVariable: DocGrad
Mean 45.67816 Std Deviation 33.03293Median 38.00000 Variance 1091.17428Mode 18.00000 Range 189.00000 Interquartile Range 40.00000
Control Predictor: Number of doctoral graduates
Stem Leaf # Boxplot 19 3 1 * 18 17 16 15 14 1 1 0 13 12 11 029 3 | 10 26 2 | 9 6 1 | 8 669 3 | 7 016 3 | 6 0011345678 10 +-----+ 5 0112233589 10 | | 4 0122356 7 | + | 3 0023346678889 13 *-----* 2 01112344778899 14 +-----+ 1 001235567888899 15 | 0 4468 4 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1
Stanford, Northwestern
Georgia
HGSE, VanderbiltDelaware
AuburnSt John’s, Illinois State
TC
VA CommCornell
UC DavisUC Irvine
HGSE (60)
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 7
Simple linear regression of Peer Ratings on mean GRE scores
The REG ProcedureDependent Variable: PeerRat
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > FModel 1 75507 75507 64.40 <.0001Error 85 99665 1172.53242Corrected Total 86 175172
Root MSE 34.24226 R-Square 0.4310
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 -40.51619 48.15953 -0.84 0.4025GRE 1 69.07083 8.60722 8.02 <.0001
GRERatrPee 07.6952.40ˆ
Effect is strong: 43.1% of the variation in ratings is associated with mean GRE scores
Effect is large and statistically
significant: Schools whose mean GRE
scores are 100 points higher have peer
ratings that are, on average, 69 points higher (p<0.0001)
Tentative conclusion:
Student body quality has an
effect (or at least it does not
controlling for size)
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 8
Simple linear regression of Peer Ratings on program size
DocLRatrPee 290.1863.247ˆ
The REG ProcedureDependent Variable: PeerRat
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > FModel 1 37702 37702 23.31 <.0001Error 85 137470 1617.29328Corrected Total 86 175172
Root MSE 40.21559 R-Square 0.2152
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 247.62760 20.58800 12.03 <.0001L2Doc 1 18.90487 3.91546 4.83 <.0001
Effect is moderately
strong: 21.5% of the variation in ratings is
associated with number of doctoral
graduates
Effect is moderately large and statistically
significant: Programs that are
twice as large have peer ratings that are an average of 18.9
points higher (p<0.0001)
TC
Conclusion: We should control for
size when evaluating the effects of GRE
scores
Down in X
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 9
From simple regression to multiple regression: Putting it all together
kk XXXY ...22110
DocLGREPeerRat 2210
How does multiple regression help us?1. Simultaneous consideration of many contributing
factors 2. We explain more of the variation in Y3. More accurate predictions (so the residuals are
smaller)4. Provides a separate understanding of each
predictor, controlling for the effects of other predictors in the model (that is, holding all these other predictors constant)
GRERatrPee 07.6952.40ˆ DocLRatrPee 290.1863.247ˆ
More generally, let X1, X2, … Xk represent k predictors
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 10
What does the multiple regression model look like graphically?
Let’s go 3D!
School GRE L2Doc PeerRatStJohns 4.75 4.39 280IllState 4.79 5.04 290SDState 5.01 4.25 320UNM 5.04 5.49 300UCIrvine 5.20 3.00 310Utah 5.22 4.86 320UIChicago 5.33 4.58 350Uarizona 5.34 5.88 360Vcomm 5.39 2.00 300Georgia 5.49 7.14 380BU 5.61 4.52 340Cornell 5.62 2.00 350GMU 5.62 4.39 320SUNYAlb 5.65 5.39 330Madison 5.80 6.73 430BC 5.85 5.39 360MichSt 5.87 5.70 420Uconn 5.89 5.70 340Boulder 5.91 3.58 350NYU 5.96 6.81 360Iowa 6.01 6.02 360TC 6.05 7.59 440Harvard 6.62 5.91 450Stanford 6.78 5.25 470
n = 24 for display purposes
2804.394.75StJohns3204.395.62GMU3204.255.01SDState3503.585.91Boulder3103.005.20UCIrvine3002.005.39Vcomm3502.005.62Cornell
PeerRatL2DocGRESchool
Smallschools
4205.705.87MichSt3405.705.89Uconn3005.495.04UNM3305.395.65SUNYAlb3605.395.85BC4705.256.78Stanford2905.044.79IllState3204.865.22Utah3504.585.33UIChicago3404.525.61BU
Mediumschools
4407.596.05TC3807.145.49Georgia3606.815.96NYU4306.735.80Madison3606.026.01Iowa4505.916.62Harvard3605.885.34Uarizona
Largeschools
Schools with same GRE scores but different sizes
Schools with same GRE scores AND size
Schools with same size but different GRE
scores
Sorted by L2Doc (low to high) Sorted by GRE (low to high)
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 11
What does the multiple regression model look like graphically?
PEER
Fitted regression plane
Observations OVER-predicted
(purple)
Observations UNDER-predicted (blue)
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 12
Returning to Flatland, Part I: A 3D graph drawn in 2D (using perspective)
GRE
L2D
oc
225
275
325
375
425
475
4.5 55.5 6
6.5 7
3
4
5
6
7
Ratings are higher, on
average, in schools with
higher GRE scores
Ratings are
higher, on
average, in
larger schools
And this
holds at
each level
of L2Doc (ie,
holding
L2Doc
constant)
And this holds at
each level
of GRE (ie,
holding GRE
constant)
DocLGREPeerRat 2210Note that this image
has a different orientation than the
one on the last
slidePeerRat
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 13
Returning to Flatland, Part II: Projecting the 3D graph back into 2D
PeerRat
GRE
L2D
oc
225
275
325
375
425
475
4.5 55.5 6
6.5 7
3
4
5
6
7
34
567
DocLGREPeerRat 2210
Each of these lines describes the effect of GRE at a given value of L2Doc—notice that this effect is the same at all
levels of L2DocNotice that these lines are
equidistant (or at least they appear to be so in
perspective)
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 14
Returning to Flatland, Part II: Projecting the 3D graph back into 2D
PEER
GRE
34
56
7
PEER
34
56
7
Looking at our 3D plot from the side, we can see
how to move from the fitted plane
A two-dimensional representation of
prototypical fitted lines
…to…
Note that this image
has a different orientation than the
one on the last
slide
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 15
Multiple regression assumptions (with more than 1 predictor)
Y
X1
X2
kk XXXY ...22110
2.The straight line model is correct. The means of each of these distributions, the µ’s, may be joined by a plane.
3.Homoscedasticity. The variances of each of these distributions, the σ2’s, are identical.
1.At each combination of the X’s there is a distribution of Y. These distributions have a mean µ Y|X1…Xk and a variance of σ2
Y|X1…Xk…
4.Independence of observations.Conditional on each combination of the X’s, the values of Y are independent of each other (we still can’t see this visually)5.Normality. At each combination of the X’s the values of Y are normally distributed
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 16
Multiple regression results: Regressing Peer Ratings on both L2Doc and GRE
The REG ProcedureDependent Variable: PeerRat
Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 2 99814 49907 55.63 <.0001Error 84 75359 897.12759Corrected Total 86 175172
Root MSE 29.95209 R-Square 0.5698Dependent Mean 344.82759 Adj R-Sq 0.5596Coeff Var 8.68611
Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 -87.29494 43.07364 -2.03 0.0459GRE 1 63.31660 7.60956 8.32 <.0001L2Doc 1 15.34201 2.94746 5.21 <.0001
DocLGRERatrPee 234.1532.6329.87ˆ
Interpretation of intercept: The value of Y when all X’s = 0. When L2Doc=0 & GRE=0, the
predicted Peer Rating is -87.29
Here, the intercept is not meaningful.
Interpretation of slope coefficients: Difference in Y per 1 unit difference in X, holding all other
X’s in the model constant.Holding mean GRE scores constant, schools with
twice as many doctoral graduates have peer ratings that are an average of 15.34 points higher.
Holding L2Doc constant, schools whose doctoral students have mean GRE scores that are 100 higher have peer ratings that are 63.32 points
higher.
Synonyms: “statistically controlling for,” “partialling out,” “holding
constant”
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 17
Understanding the fitted multiple regression model algebraically & graphically
DocLGRERatrPee 234.1532.6329.87ˆ
Controlled effect of GRE can be seen in the
common slope (63.32) of these lines
Controlled effect of L2Doc can be seen in the common distance
(15.34) between these lines
GRErRatePe
GRErRatePe
32.6393.25ˆ
)4(34.1532.6329.87-ˆ
4L2Doc
GRErRatePe
GRErRatePe
32.6359.10ˆ
)5(34.1532.6329.87-ˆ
5L2Doc
GRErRatePe
GRErRatePe
32.6327.41ˆ
)3(34.1532.6329.87-ˆ
3L2Doc
GRErRatePe
GRErRatePe
32.6375.4ˆ
)6(34.1532.6329.87-ˆ
6L2Doc
AlgebraicallyPlug in different values of
L2Doc
-25.93-(-
41.27)15.34
-10.59-(-
25.93)15.34
4.75-(-
10.59)15.34
GraphicallyReturn to the plot from
before
34567
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 18
Conceptualizing a 2D graph that will display our findings
(80, 82.35)
OWNIQTIQSFO 91.072.9ˆ
(120, 118.67)
So let’s discuss how we make
these two decisions
In multiple regression, we use the same general approach, but because we have more than 1 predictor, we have to make two decisions:
1. Decide which predictor you’d like to display on the X axis—in multiple regression, you have several predictors, but in a 2D graph, you only have 1 X axis
2. For all the other predictors (note: here we have only 1 other predictor, but usually we have more) identify prototypical values you’d like to use for plotting
Having made these 2 decisions, we then:
3. Systematically substitute in the prototypical values for those predictor(s), which yields a set of partial regression equations
4. Plot each partial regression equation as before (substitute in any 2 values for the remaining predictor, get the corresponding value of y-hat, plot the points, and connect them)
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 19
Decision 1: Use sketches to select the predictor to display on the X axis
(note: I mean sketches that don’t need to be drawn perfectly to scale)
PeerRat
GRE
Large L2Doc
Small L2Doc
Medium L2Doc
From the multiple regression equation, we know that the fitted lines corresponding to 1 unit differences in L2Doc will be 15.34 rating
points apart
DocLGRERatrPee 234.1532.6329.87ˆ
PeerRat
L2Doc
Hi GREs
Low GREs
Medium GREs
From the multiple regression equation, we know that the fitted lines corresponding to 1 unit
differences in GRE will be 63.32 rating points apart (and the slopes for L2Doc will be shallower
on this graph)
Two general principles when deciding:
It’s usually easier to see/talk about a predictor displayed on the X axis (because its effect is
seen through the slope)
Corollary: Usually put the question predictor on the X axis. You’re typically less interested in control
predictors and generally want to focus on question predictors
We now need to define these
prototypical values
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 20
Decision 2: Helpful strategies for selecting prototypical values
The UNIVARIATE ProcedureVariable: L2Doc
Mean 5.141533 Std Deviation 1.10755Median 5.247928 Variance 1.22666 Quantile Estimate
100% Max 7.5924695% 6.7813690% 6.4757375% Q3 5.9307450% Median 5.2479325% Q1 4.3923210% 3.700445% 3.321930% Min 2.00000 Stem Leaf # Boxplot 7 6 1 | 7 1 1 | 6 5677889 7 | 6 00001111244 11 | 5 5567777778999999 16 +-----+ 5 0001222222334444 16 *--+--* 4 556688889999 12 | | 4 012222223444 12 +-----+ 3 56799 5 | 3 033 3 | 2 6 1 | 2 00 2 0 ----+----+----+----+
Examine the distribution of the remaining predictors and consider selecting:
1. Substantively interesting values. This is easiest when the predictor has inherently appealing values (e.g., 8, 12, and 16 years of education in the US)
2. A range of percentiles. When there are no well-known values, consider using a range of percentiles (either the 25th, 50th and 75th or the 10th, 50th, and 90th)
3. The sample mean .5 (or 1) standard deviation. Best used with predictors with a symmetric distribution
4. The sample mean (on its own). If you don’t want to display a predictor’s effect but just control for it, using only the sample mean will yield a “controlled” fitted regression equation
Remember that exposition is easier if you select whole number values (if the scale permits) or easily communicated fractions (eg.,¼, ½, ¾, ⅛)
Mean = 5.14, sd = 1.1110th = 3.725th = 4.450th = 5.275th = 5.990th = 6.5
Use L2DOC = 4, 5, and 6
24 = 16 (small)25 = 32 (medium)
26 = 64 (large)
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 21
Substitute in the prototypical values to graph the fitted MR equation
Three prototypical lines representing the relationship between Peer Ratings and GRE scores, holding L2Doc
constant (again, notice the identical slopes but different
intercepts)
DocLGRERatrPee 234.1532.6329.87ˆ
GRErRatePe
GRErRatePe
32.6393.25ˆ
)4(34.1532.6329.87-ˆ
4)(L2DocschoolsSmall
GRErRatePe
GRErRatePe
32.6359.10ˆ
)5(34.1532.6329.87-ˆ
5)(L2DocschoolsMedium
GRErRatePe
GRErRatePe
32.6375.4ˆ
)6(34.1532.6329.87-ˆ
6)(L2Docschools Large
Small
LargeMedium
The vertical distance between the parallel lines spaced 1 unit
apart is the slope coefficient for L2DOC
15.34
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 22
What would happen if we put L2Doc on the X axis?
The UNIVARIATE ProcedureVariable: GRE
Mean 5.578966 Std Deviation 0.428999Median 5.505000 Variance 0.184035 Quantile Estimate
100% Max 6.77595% 6.54090% 6.05075% Q3 5.84550% Median 5.50525% Q1 5.27510% 5.0405% 4.9450% Min 4.745
Stem Leaf # Boxplot 67 78 2 0 66 02 2 | 65 4 1 | 64 | 63 3 1 | 62 | 61 2 1 | 60 04459 5 | 59 111366 6 | 58 04669 5 +-----+ 57 0004555688 10 | | 56 0225666 7 | | 55 000178 6 *--+--* 54 11244889 8 | | 53 3444556888 10 | | 52 0011122255788 13 +-----+ 51 | 50 1349 4 | 49 046 3 | 48 4 1 | 47 48 2 | ----+----+----+----+ Multiply Stem.Leaf by 10**-1
Use GRE = 5, 5.5 and 6
6 (high GRE)5.5 (~median
GRE)5 (low GRE)
DocLGRERatrPee 234.1532.6329.87ˆ
DocLrRatePe 234.1531.229ˆ (5.0)GRELow
DocLrRatePe 234.1597.260ˆ (5.5)GRE Med
DocLrRatePe 234.1563.292ˆ GRE(6.0) High
Lo GRE
Hi GRE
Med GRE
63.32
The vertical distance between the parallel lines spaced 1 unit apart (Low to High) is the slope
coefficient for GRE
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 23
Understanding how the fitted MR equation provides statistical control
DocLGRERatrPee 234.1532.6329.87ˆ
Small
LargeMedium
354
369
385
Comparison 1: Programs with identical mean GRE scores may have different
ratings because of their sizeComparison 2: Programs
of equal size may have different ratings because
of the quality of their student body
Comparing fitted values on a line drawn perpendicular to the X axis is holding GRE
constant
(5.0, 321)
(6.0, 385)
(7.0, 448)
Comparing values “on” any fitted
line is holding L2Doc
constant
Comparison 3: There is more than one way to earn a specific rating
(this is an unusual view of the data because it’s
reasoning backwards—we don’t really ever want to
hold Y constant— but nevertheless it’s worth
thinking about)
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 24
Towards understanding how and why MR can provide a better fit
= Lo GRE= Med GRE
= Hi GRE
= Small Schools= Medium Schools
= Large Schools
Why multiple regression helps provide a better fit
If the additional predictor(s) improve(s) the quality of the fit, the observed values of Y (the
yi) will be closer to the predicted values of Y (the )
(the fitted values on the relevant line – i.e., the line corresponding to that specific combination
of predictor values)
iy
Lo GRE
Hi GRE
Med GRE
Large
SmallMedium
Why are these lines parallel?They’re parallel because we assume that they are. This is known as a main effects assumption: we’re assuming the effect of each predictor is the same regardless of the
levels of the other predictor. Might this not be a correct assumption?
DocLGRERatrPee 234.1532.6329.87ˆ
GRERatrPee 07.6952.40ˆ DocLRatrPee 290.1863.247ˆ Note: In
actuality, there are
fitted lines for every value of
L2Doc, not just for large,
medium, and small schools.
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 25
Might these lines NOT be parallel?: Let’s imagine what else they might be
= Lo GRE= Med GRE
= Hi GRE
= Small Schools= Medium Schools
= Large Schools
Right now, let’s assume that the main effects assumption is
correct
What does it mean if the lines aren’t parallel?
• This says that the effect of one predictor (say the effect of L2Doc) differs by levels of the other predictor (here, GRE)
• This is called a statistical interaction and in Unit 10 we’ll learn how to test for it and modify the model if necessary
Hmmm…the larger the school, the larger the effect
of GRE?
Hmmm…the better the student body, the larger the effect of
program size?
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 26
From simple to multiple regression: R2 & the Analysis of Variance (ANOVA)
)ˆ( yyi DeviationRegr )( yyi
Deviation Total
SS TotalSS Regress
2R
iy
x
y
X
Y
y
iy
)ˆ( ii yy DeviationError
Reprise of R2 in simple linear regression
)ˆ( yyi DeviationRegr
iy
x
y
X
Y
y
iy
R2 in multiple linear regression
)ˆ( ii yy DeviationError
)( yyi Deviation Total
MST=SST/dfSSTn-1
MSE=SSE/dfSSE
(n-1) – # predictors
MSR=SSR/dfSSR# predictors
Mean Squaredf
Analysis of Variance
Total
Error(Residual)
Model (Regression)
Sum of SquaresSource
2)ˆ( yySSR i
2)ˆ( ii yySSE
2)( ii yySST )(ˆˆ 2 YraVMST
)Residuals(ˆˆ 2| raVMSE XY
Note that this table and the formula for R2 apply in both simple and multiple regression—it’s only the fitted values of Y that change!
The residual is now the vertical distance
between the observation and the
fitted regression plane
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 27
Comparing fitted values and residuals from simple and multiple regression models
ID School GRE L2Doc PeerRat yhatgre yhatdoc yhatmr residgre residdoc residmr
1 Harvard 6.625 5.91 450 417.1 359.3 422.8 32.9 90.7 27.2 2 UCLA 5.780 5.73 410 358.7 355.9 366.6 51.3 54.1 43.4 3 Stanford 6.775 5.25 470 427.4 346.8 422.2 42.6 123.2 47.8 4 TC 6.045 7.59 440 377.0 391.2 411.9 63.0 48.8 28.1 5 Vanderbilt 6.605 4.46 430 415.7 331.9 399.3 14.3 98.1 30.7 6 Northwestern 6.770 3.32 390 427.1 310.4 392.3 -37.1 79.6 -2.3 7 Berkeley 6.050 5.43 440 377.4 350.2 379.0 62.6 89.8 61.0 8 Penn 6.040 5.93 380 376.7 359.7 386.1 3.3 20.3 -6.1 9 Michigan 6.090 5.25 430 380.1 346.8 378.8 49.9 83.2 51.210 Madison 5.800 6.73 430 360.1 374.8 383.2 69.9 55.2 46.8
....
81 Colorado 5.210 3.91 300 319.3 321.5 302.5 -19.3 -21.5 -2.582 UWMilw 5.030 4.25 330 306.9 327.9 296.4 23.1 2.1 33.683 Hofstra 5.910 3.70 290 367.7 317.6 343.7 -77.7 -27.6 -53.784 IllState 4.785 5.04 290 290.0 343.0 293.1 0.0 -53.0 -3.185 IndianaSt 4.955 4.17 290 301.7 326.5 290.4 -11.7 -36.5 -0.486 StJohns 4.745 4.39 280 287.2 330.7 280.5 -7.2 -50.7 -0.587 UVM 5.340 4.17 310 328.3 326.5 314.8 -18.3 -16.5 -4.8
GRERatrPee 07.6952.40ˆ
DocLRatrPee 290.1863.247ˆ
DocLGRERatrPee 234.1532.6329.87ˆ
Sum of Squared Errors
99,665 137,470 75,359
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 28
Interpreting R2 and the Analysis of Variance in multiple regression
Analysis of Variance
Sum of MeanSource DF Squares Square
Model 2 99814 49907 Error 84 75359 897.12759Corrected Total 86 175172
Root MSE 29.95209 R-Square 0.5698
Analysis of Variance
SourceSum of Squares
DfMean Square
Model (Regression)
99,814 2 49,907.0
0
Error(Residual)
75,359 84 897.13
Total 175,172 86 2,036.89
%1.432| GREYR %5.212
2| DocLYR
2ˆ,
2,..., 21 YYXXX rR
k
][
)0(21
2121
,
2|
2|
2,|
7 Unit in this on more
XX
XYXYXXY
runless
RRR
r=.75
=.752
57.0% of the variation in
Peer Ratings is associated with L2Doc
and GRE
y
y
Root MST = = 45.132036.89)(ˆ YDS
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 29
Statistical inference: Two distinct types of hypotheses we can test
Across all my predictors, is there anything going on, or would I do just as well without
them?
Controlling for all other predictors in the model, does
each individual predictor, Xj, have an effect?
all) at help
tdoesn' n(regressio
H0 0: 21 k
zero)-non is effect
spredictor' (at least 1
some j 0: 1H
effect) controlled has
predictor (
no
this
j 0: 0H
effect) controlled a
predictor (
0:
has
this
j 1H
Overall/Omnibus F test
Individual t-tests
With only 1 predictor (that is, in simple linear regression), these two tests are identical.
In multiple regression, these two types of tests are decidedly different!
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 30
Towards a heuristic understanding of the omnibus F-test:Comparing the regression decomposition when H0 is not true and is true
)ˆ( yyi DeviationRegr
iy
x
y
X
Y
y
iy
Regression decomposition if H0 is not true
)ˆ( ii yy DeviationError
)( yyi Deviation Total
Error deviations are small
Regression deviations are large
SS Error is small
SS Regression is large
MSE is small
MSR is large0H reject large, is if So,
MSE
MSR
)ˆ( yyi DeviationRegr iy
x
y
X
Y
y
iy
Regression decomposition if H0 is true
)ˆ( ii yy DeviationError
)( yyi Deviation Total
Error deviations are large
Regression deviations are small
SS Error is large
SS Regression is small
MSE is large
MSR is small0H reject to fail small, is if So,
MSE
MSR
MSE
MSRF
testFOmnibus
obs
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 31
Conducting omnibus hypothesis tests in multiple regression
Sum of MeanSource DF Squares Square F Value Pr > FModel 2 99814 49907 55.63 <.0001Error 84 75359 897.12759Corrected Total 86 175172
63.5513.897
907,49
MSE
MSRFobs
Q: Is 55.63 “large
enough” to reject H0?
Because F2, 84 = 55.63 (p<0.0001), we reject H0 that
all ’s = 0 and conclude that at least one j 0
Sound statistical practice:When reporting F-tests, be sure to provide not just the p-value but also both the numerator and denominator degrees of
freedom1.001.301.561.72
1.221.381.641.77
1.571.681.782.01
1.831.932.032.24
2.212.312.402.60
2.372.462.562.76
2.602.702.792.99
3.003.093.183.39
3.843.944.034.24
inf1005025
df for denominator (MSE)
1000
120
20
10
5
4
3
2
1
df for numerator (MSR)
When the numerator df = 1, F = t2 (1.962=3.84) – this
relationship makes sense so that the omnibus F-test and the
single parameter t-test give identical results
Critical values of Fobserved (α=.05)
Omnibus F test: Across all my predictors, is there anything going on, or would I do just as
well without them?0:
0: 21
j
k
some
1
0
H
H
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 32
Conducting individual t-tests in multiple regression
Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 -87.29494 43.07364 -2.03 0.0459GRE 1 63.31660 7.60956 8.32 <.0001L2Doc 1 15.34201 2.94746 5.21 <.0001
Statistically controlling for L2Doc,
there is an effect of GRE
Statistically controlling for GRE,
there is an effect of L2DOC
Individual t-tests: Controlling for all other predictors in the
model, does each individual predictor, Xj, have an effect?
0:
0:
j
j
1
0
H
H
Individual t-tests in multiple regression are analogous to those in single variable
regression
The key difference comes in our interpretation of the results
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 33
How might we summarize the results of these analyses?
Comparison of regression models predicting peer ratings of US Graduate Schools of Education (n=87) (US News and World Report, 2005)
Predictor
Model A Model B
Model C
Intercept
-40.52(48.16)-0.84
247.63(20.59)12.03**
*
-87.29(43.07)-2.03*
Mean GRE
scores
69.07(8.61)
8.02***
63.32(7.61)
8.32***
Log2(N doctoral grads)
18.90(3.92)
4.83***
15.34(2.95)
5.21***
R2 43.1 21.5 57.0
F(df)p
64.40(1, 85)
<0.0001
23.31(1, 85)<0.000
1
55.63(2, 84)
<0.0001
Cell entries are estimated regression coefficients, (standard errors) and t-statistics.* p<0.05, ** p<0.01, *** p<0.001
Small
LargeMedium
Definitions of program sizeSmall: = 16 doctoral grads (24) Medium = 32 doctoral
grads (25) Large = 64 doctoral grads
(26)
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 34
Examining residuals to examine assumptions as in simple regression
Stem Leaf # Boxplot 20 6 1 | 18 17 2 | 16 3939 4 | 14 069 3 | 12 2 1 | 10 174 3 | 8 158579 6 | 6 4427 4 +-----+ 4 6759 4 | | 2 235800145 9 | | 0 6689028 7 *--+--* 0 9643119821 10 | | -2 671 3 | | -4 54437 5 | | -6 88075 5 +-----+ -8 860 3 | -10 862081 6 | -12 96754 5 | -14 10 2 | -16 37 2 | -18 3 1 | -20 | -22 | -24 | -26 8 1 | ----+----+----+----+ Multiply Stem.Leaf by 10**-1
BerkeleyUCDavis, Ohio State
Delaware
HofstraUSF, UHouston
UC Berkeley
UC Berkeley
Delaware (very high GREs)
DelawarePlot residuals (either raw or studentized)
vs. each X
Under-predictedRatings are higher than we expected
Over-predictedRatings are lower than we expected
Studentized residuals
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 35
We should also plot residuals vs. fitted values
UC Berkeley379ˆ;440 yyobs
390ˆ;310 yyobs
Delaware (very high GREs)
• Possible nonlinearity?
• Might this improve when we add other predictors in Unit 7?
• Might this improve if we allow the effect GRE to interact with L2Doc (in Unit 10)
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 36
What’s the big takeaway from this unit?
• Multiple regression serves several purposes– We can more accurately explain the variation in the outcome Y by
considering several predictors simultaneously– The basic principles of model fitting, data analysis, and inference
remain essentially the same
• Inference in multiple regression focuses both on the overall model and on the role of individual predictors (controlling for other predictors in the model)– Omnibus F-tests tell about the model as a whole– Individual t-tests provide information about an individual
predictor when controlling for the other predictor(s) in the model
• We have to make wise decisions about how to best present findings– Multidimensional graphs would be ideal, but we usually find
ourselves displaying our findings in just two dimensions– Different plots emphasize different messages—you need to learn
how to think about what the prototypical plots will look like and make educated decisions about what plots to display
– Tables can be helpful in presenting results from several models that include different combinations of predictors
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 37
Appendix: Annotated PC-SAS Code for fitting multiple regression models
Note that the handouts include only annotations for the needed additional code. For the complete program, check program “Unit 6—EdSchools analysis” on the website.
Note that the handouts include only annotations for the needed additional code. For the complete program, check program “Unit 6—EdSchools analysis” on the website.
*-----------------------------------------------------------------*Fitting multiple regression model PEERRAT on L2DOC and GRE*-----------------------------------------------------------------*;
proc reg data=one; model PeerRat=L2Doc GRE; output out=resdat1 r=residual student=student predicted=yhat;
*-----------------------------------------------------------------*Univariate summary information on studentized residualsfrom multiple regression model PEERRAT on L2DOC GRE *----------------------------------------------------------------*;
proc univariate data=resdat1 plots; var student; id school;
*-----------------------------------------------------------------*Plotting studentized residuals vs. each Predictor and YHat *-----------------------------------------------------------------*;
proc gplot data = resdat1; plot student*(L2Doc GRE yhat); symbol value='dot';
*-----------------------------------------------------------------*Computing fitted values and residuals from the 3 models *-----------------------------------------------------------------*;
data one; set one;
yhatgre = -40.51619 + 69.07083*GRE; yhatdoc = 247.62750 + 18.90487*L2Doc; yhatmr = -87.29494 + 63.31660*GRE + 15.34201*L2Doc;
residgre = peerrat - yhatgre; residdoc = peerrat - yhatdoc; residmr = peerrat - yhatmr;
Proc reg allows you to fit multiple regression models by adding additional predictors to your model statement (following the equal “=” sign). The syntax for the output statement is similar, except that now you also need to ask for the predicted values (the fitted values of y), to use in residual plots to explore assumption violations.
Proc reg allows you to fit multiple regression models by adding additional predictors to your model statement (following the equal “=” sign). The syntax for the output statement is similar, except that now you also need to ask for the predicted values (the fitted values of y), to use in residual plots to explore assumption violations.
proc univariate can be used as usual (with the plots option) to analyze the new dataset RESDAT and to provide summary statistics for the residuals.
proc univariate can be used as usual (with the plots option) to analyze the new dataset RESDAT and to provide summary statistics for the residuals.
To analyze residual assumptions for multiple regression model use proc gplot to produce plots of the residuals vs. predicted value of Y.
To analyze residual assumptions for multiple regression model use proc gplot to produce plots of the residuals vs. predicted value of Y.
You can obtain these fitted values and residuals from the separate PROC REGs but it’s MUCH easier to just write code in a data step, which is what I did.
You can obtain these fitted values and residuals from the separate PROC REGs but it’s MUCH easier to just write code in a data step, which is what I did.
© Judith D. Singer, Harvard Graduate School of Education Unit 6/Slide 38
Glossary terms included in Unit 6
• Statistical control• Interactions• Main effects• Omnibus F-test