graphical anova and regression - news · graphical anova and regression ... david moore’s...
TRANSCRIPT
Graphical ANOVA and Regression
W. John Braun
22 November 2013
University of Winnipeg – Winnipeg, MB
1
Outline of Presentation
Part I. Graphical ANOVA
A Sad Story
Part II. Graphical Regression via the QR Decomposition
2
Part I. Graphical ANOVA
A relatively high level of mathematical sophistication is
required of an individual in order to understand ANOVA
In the words of one introductory statistics textbook author:
“the details of ANOVA are a bit daunting” (Moore, 2010)
4
Required Concepts
• Null hypothesis/Alternative hypothesis
• Variance
• Stochastic independence
• Degrees of freedom
• Sums of Squares
• F -distribution
• p-value
Most introductory-level students do not fully comprehend
these concepts (Weinberg et al., 2010)
5
Why Students Fail to Understand ANOVA
A breakdown in the understanding of ANOVA can easily
occur at any point in this sequence
Students lose “sight of the big picture of statistics
(arguably, real data analysis and inference)” by the time
they have completed the first two thirds of a traditional
introductory statistics course (Tintle et al., 2011)
6
Helping Students Understand ANOVA
David Moore’s introductory-level statistics textbook defers
ANOVA formulas to an optional section at the end of the
ANOVA chapter.
His preferred focus:
The main idea of ANOVA is more accessible and much moreimportant. Here it is: when we ask if a set of sample meansgives evidence for differences among the population means,what matters is not how far apart the sample means are buthow far apart they are relative to the variability of individualobservations (Moore, 2010, p. 642).
7
Objective
To demonstrate a method for analyzing multiple samples
which is
• focussed on the essence of ANOVA
• understandable to students with little mathematical or
statistical background, e.g., introductory statistics
students
• a graphical complement to ANOVA
8
A Paper Airplane Experiment
We will introduce graphical ANOVA with the following
problem:
Does the distance travelled by a paper airplane after beingthrown depend on the weight of the paper used in itsconstruction?
9
A Paper Airplane Experiment*
12 paper airplanes of a single design were constructed
using
• 20× 27 cm sheets of paper
• 3 different weights of paper (light, medium, heavy)
m = 3 treatment groups with n = 4
• Response: distance travelled
• Factor of interest: weight of paper
*A better alternative is to run this as a classroom demonstration.
10
The Paper Airplane Experiment Data
Distance Travelled (m)
Light Paper Medium Paper Heavy Paper3.1 4.0 5.13.3 3.5 3.12.1 4.5 4.71.9 6.1 5.3
Average Distance Travelled (m)
Light Paper Medium Paper Heavy Paper2.600 4.525 4.550
11
Class Discussion
Flight distances are different for different treatments, but
they are also different within treatments
What has caused the variation within each treatment?
Factors other than paper weight must be having an effect
Identifying these other factors is a worthwhile exercise,
because it connects “unexplained variation”, “error” or
“noise” with tangibles that the students can understand
12
Class Discussion
Some of the possibilities are: individual airplane
construction, initial throwing height, and initial thrust and
direction
These unmeasured factors probably vary slightly from
throw to throw and thus could account for some or all of
the variation observed within each treatment
Is there confounding?*
*This would be clear in a demonstration.
13
Class Discussion
The presence of unmeasured factors makes it difficult to
tell if differences in paper type have an effect on flight
distance
Are the differences in the treatment averages due only to
the unmeasured factors?
A bootstrap hypothesis test will be used to answer this
question
14
Graphical ANOVA: A Bootstrap Hypothesis Test
Create an artificial data set which is similar to the original data set butonly the unmeasured factors will be responsible for variation
From this artificial data set, repeatedly take samples of size 4, withreplacement
Averages of these samples will vary, but only because of theunmeasured factors
If the observed averages vary more than the simulated averages, thenthere is evidence of a treatment effect paper weight affects flight distance
A graphical test: compare the observed averages with a histogram ofthe simulated averages
15
The Graphical ANOVA Plot
Observed* and Simulated� Flight Distance Averages
simulated treatment averages
Freq
uenc
y
3.0 3.5 4.0 4.5
05
1015
2025
30
●
●
average heavyaverage lightaverage medium
*rug plot�histogram
16
The Graphical ANOVA Plot
The light paper average is an outlier relative to the
histogram
There is strong evidence that the treatment means are
not all the same
Otherwise, the treatment averages should be located in
regions corresponding to higher histogram density
17
Use of Reference Plots
As with QQ-plots, one’s ability to use the graphical
ANOVA plot effectively will improve with experience
Assessing strength of evidence depends on the number of
treatments under study
Reference plots are a good way to gain some experience
before making a judgement based on given data
The following plots correspond to cases where the p-value
from the ANOVA F -test is in the vicinity of 0.05
18
Use of Reference Plots (2 samples)
simulated treatment averages
Freq
uenc
y
−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6
010
2030
4050
●
●
average trt 1average trt 2
19
Use of Reference Plots (3 samples)
simulated treatment averages
Freq
uenc
y
−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6
010
2030
40
●
●
average trt 1average trt 2average trt 3
20
Use of Reference Plots (4 samples)
simulated treatment averages
Freq
uenc
y
−0.5 0.0 0.5
010
2030
●
●
average trt 1average trt 2average trt 3average trt 4
21
Use of Reference Plots (5 samples)
simulated treatment averages
Freq
uenc
y
−0.5 0.0 0.5
010
2030
● ●
●
●
average trt 1average trt 2average trt 3average trt 4average trt 5
22
Motor Vibration Example
5 different brands of bearings are compared in terms of the
amount of vibration they generate in an electric motor*
The graphical ANOVA plot clear evidence of a
treatment effect: the second brand average appears in the
extreme right tail
ANOVA p-value: 0.0001871
*The motor vibration data can be found in the Devore5 library (Bates, 2004)
23
Motor Vibration Example
simulated treatment averages
Freq
uenc
y
13.0 13.5 14.0 14.5 15.0 15.5 16.0
05
1015
2025
30
●●
●
●
average Brand 1average Brand 2average Brand 3average Brand 4average Brand 5
24
Agricultural Yield Example
This example concerns the effects of a fertilizer on
agricultural yield
2 treatments: a fertilizer treatment and a control
There are 8 measurements in each sample
ANOVA p-value: 0.847
25
Agricultural Yield Example
simulated treatment averages
Freq
uenc
y
1100 1150 1200 1250
010
2030
40
●
●
average Control.plotsaverage Fertilizer.plots
26
Conclusion to Part I
The technique we have described has been used by a
twelve-year-old who appeared to understand it
Other mathematically challenged individuals seem to have
found the method comprehensible
It also seems to have been understood by a 10-year-old,
but he was from Australia
The technique has a similar goal to the graphical ANOVA
plot which has been advocated by Box et al (2005)
27
A Sad Story
A few years ago, a student writing the final exam in my third yearregression course motioned to me, indicating that he had a question
His question concerned whether he had to invert the simple 3× 3matrix that arose in a 2 variable multiple regression question, becausehe had forgotten how to invert a matrix
It was a bit shocking that someone at this stage of a statisticsundergraduate program could have such difficulties
When the exams were marked, it became evident that several studentswere incapable of inverting the given 3× 3 matrix
I related the story to a visiting researcher expecting the usual groan ofdisbelief about the abilities of “students these days” but was rathertaken aback at the response which was, “Why are you asking them toinvert matrices in that course?” (Guttorp, 2009)
28
A Sad Story
I had to admit that I didn’t have a good answer, being fully aware thatmatrix inversion is generally a bad idea and that good statisticalsoftware will find least-squares solutions using stable procedures whichavoid explicit matrix inversion
At the same time, other problems were evident in the teaching of thisregression course
In particular, students seemed to struggle with the F -test forsignificance of regression and were not really clear on the usefulness ofthis test
Their focus was placed on the individual coefficient t-tests, and evenwith many warnings about the dangers of misinterpreting large Rvalues and small t-test p-values as indicators of goodness of fit,students were still tempted to use these measures as criteria forassessing model validity
29
Part II. Graphical Regression via the QR Decomposition*
The usual multiple regression model is assumed:
y = Xβ+ ε
where y and ε are vectors of length n, β is a vector of
length p and X is an n× p matrix (n > p)
The columns of X are assumed to be linearly independent,
and if an intercept is assumed to be in the model, the first
column is a vector of 1’s
The other columns are assumed to be nonrandom
*I am borrowing liberally from Wood (2006), providing details on how one might elaboratein a typical undergraduate course.
31
Regression Analysis via QR
The elements of ε are assumed to be uncorrelated random
variables with mean 0 and common variance σ2
Thus, the expected value and variance of the vector y are
E[y] = Xβ
and
Var(y) = σ2I
where I is the n× n identity matrix
32
The QR Decomposition of X
Given the model, it is always possible to find an n× northogonal matrix Q and an n× p matrix R such that
X = QR
where the bottom n− p rows of R consist of 0’s and the
top p rows of R constitute a nonsingular upper triangular
matrix U
That is, QTQ = I and
R =[U
0
]
33
Parameter Estimation via QR
Multiplying the model equation on the left by QT gives
QTy = Rβ+QTε
Partitioning Q as [Q1 Q2], where Q1 represents the first p
columns of Q, the above equation can be re-written as
QT1y = Uβ+QT1ε (1)
and
QT2y = QT2ε (2)
34
Parameter Estimation via QR
To estimate β, it is clear that model (1) contains all of the
needed information, and the ordinary least-squares estimate
must be the solution of the upper triangular linear system
Uβ̂ = QT1y
since the solution of this must minimize the least-squares
objective function
(QT1y − Uβ)T(QT1y − Uβ)
35
Properties of the Parameter Estimates
Unbiasedness of β̂ = U−1QT1y is easily verified, since it
can be shown that
β̂ = β+ U−1QT1ε.
This last result can also be used to see that
Var(β̂) = U−1QT1Var(ε)Q1U−T = σ2U−1U−T
36
Parameter Estimation via QR
To estimate σ2, model equation (2) is used. It is easily
shown that
E[QT2y] = 0
and
Var(QT2y) = QT2σ2IQ2 = σ2I
Thus, QT2y is a vector of n− p uncorrelated mean 0 and
variance σ2 random variables
Hence, yTQ2QT2y must be the sum of squares of such
random variables
37
Parameter Estimation via QR
i.e. yTQ2QT2y =
n−p∑i=1
Z2i (3)
where Z1, Z2, . . . , Zn−p are uncorrelated with mean 0 and
variance σ2
Therefore,
E[yTQ2QT2y] =
n−p∑i=1
Var(Zi) = (n− p)σ2
and an unbiased estimator for σ2 is
σ̂2 =yTQ2Q
T2y
n− p
38
F statistics
Given r ∈ {1, . . . , p}, consider the (null) hypothesis that
βp−r = βp−r+1 = · · · = βp−1 = 0
From the QR decomposition of X and model (1), we have,under the above hypothesis,
E[QT1y] =
U1,1β0 + U1,2β1 + · · ·+ U1,pβp−1U2,2β1 + · · ·+ U2,pβp−1
· · ·· · ·· · ·· · ·
Up,pβp−1
=
U1,1β0 + U1,2β1 + · · ·+ U1,p−rβp−r−1U2,2β1 + · · ·+ U2,p−rβp−r−1
· · ·Up−r,p−rβp−r−1
0· · ·0
39
F statistics
Partition the vector QT1y as
QT1y =
q11q12
where q12 denotes the last r components of QT1y
q12 is a vector of r independent normal random variables
with mean 0 and variance σ2, when the null hypothesis is
true
40
F statistics
Therefore,
qT12q12σ2
must have a χ2 distribution on r degrees of freedom, and
because QT1Q2 = 0, q12 is independent of σ̂2, so
qT12q12/r
σ̂2
must have an F distribution on r and n− p degrees of
freedom, when the null hypothesis is true
41
F statistics
The F -test for significance of regression corresponds to the
case where r = p− 1
Then E[q11] = U1,1β0. E[q12] is a (p− 1)-vector of 0’s
under the null hypothesis and must have at least one
nonzero component if the alternative is true
42
Graphical Complements to the F -test
There may be several ways of generalizing the ANOVA plot
to the regression (or ANCOVA) context; we show here how
to use the QR decomposition to obtain such a plot
Recall: QTy is a random vector whose last n− pcomponents (i.e., QT2y) have the same mean and variance
as the errors
This can be graphically displayed in a histogram
If the data are symmetrically distributed, graphing the
distribution of the absolute values of QT2y gives a more
precise estimate of the distribution of |ε|43
Graphical Complements to the F -test
QT1y has expected value Uβ and variance σ2I
If the ith component of the vector Uβ is 0, then the ith
component of QT1y has the same mean and variance as the
errors.
Thus, we can informally check whether particular elements
of Uβ are 0 by plotting the corresponding elements of
|QT1y| as a rugplot on the |QT2y| histogram
Evidence that an element of Uβ is nonzero is indicated by
the appearance of the corresponding point of |QT1y|plotting outside the range of the histogram
44
Illustrative Examples
13 observations on an experiment to produce a synthetic
analogue to jojoba oil
Response variable: y, yield
Predictors: x1, x2 and x3 (temperature, catalyst, pressure)
Regression model:
y = β0 + β1x1 + β2x2 + β3x3 + ε.
45
errors
Fre
quen
cy
0 10 20 30 40
01
23
45
23 4
Evidence of Regression in the Jojoba Oil Data
The Qy-plot for the synthetic jojoba oil data set where yield is regressed against
temperature, catalyst and pressure.
46
Jojoba Oil Qy-Plot
Histogram: distribution of the absolute values of the errors
Rug plot: three points – the second through fourth
components of Uβ̂
The second point is outside the range of the error
distribution
strong evidence that at least one of the predictor
variables has a nonzero coefficient
The graphical result agrees with the formal F -test
(p-value: .0112)
47
A Null Case
40 observations
Response variable: simulated standard normals
Predictors: 9 simulated variables, correlated amongst
themselves but not with the response variable
Model:
y = β0 +9∑
j=1βjxj + ε
The following Qy-plot shows that there is no evidence that
any of the coefficients is nonzero48
errors
Fre
quen
cy
0.0 0.5 1.0 1.5 2.0 2.5
02
46
810
12
2 345 67 89 10
Evidence of Regression in Simulated Data Set
The Qy-plot for the simulated data set with a response which is independent of all
9 predictor variables.
49
A Null Case
Individual t-tests must be interpreted with care when a
large number of simultaneous tests are conducted
The output from fitting the 9 variable regression:
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.08373 0.19360 -0.432 0.6685x1 -1.73185 0.75376 -2.298 0.0287x2 2.46466 1.44453 1.706 0.0983x3 -3.51600 1.59061 -2.210 0.0348x4 0.59918 0.29474 2.033 0.0510x5 1.91034 1.03242 1.850 0.0741x6 -2.13077 2.00931 -1.060 0.2974x7 1.85654 0.79440 2.337 0.0263x8 1.69608 1.51268 1.121 0.2711x9 -0.64120 0.41033 -1.563 0.1286
Residual standard error: 1.161 on 30 degrees of freedomMultiple R-squared: 0.2189,Adjusted R-squared: -0.01549F-statistic: 0.9339 on 9 and 30 DF, p-value: 0.5109
50
A Null Case
If one is not careful, one might mistakenly infer that the
coefficients of x1, x3 and x7 are nonzero on the basis of
the low t-test p-values
The F -test p-value (0.5109) confirms our earlier assertion
that there is no evidence that any of the regression
coefficients are nonzero
The statistical analyst should proceed no further with any
kind of linear model for this data set, but given our
informal survey results, one must wonder how strong the
temptation would be to apply backward variable selection
in a situation like this51
NFL Data
20 observations on 1976 National Football League (NFL)
performance
Response variable: number of games won in a 14 game
season
Predictor variables: rushing yards, passing yards, punting
average (yards/punt), field Goal Percentage (FGs
made/FGs attempted), turnover differential (turnovers
acquired - turnovers lost), penalty yards, percent rushing
(rushing plays/total plays), opponents’ rushing yards, and
opponents’ passing yards
52
errors
Fre
quen
cy
0 2 4 6 8 10
01
23
45
2345 6 7 8 910
Evidence of Regression in NFL Data Set
The Qy-plot for the NFL model for number of games won, given 9 predictors.
53
NFL Data
Two error points are separated from the rest of the error
distribution, a reminder that additional model diagnostics
should be studied before making any firm conclusions
Strong evidence that elements 2, 3 and 9 of Uβ are
nonzero
Do not over-interpret the plot; it is not saying that the
three predictor coefficients (β1, β2 and β8) are nonzero
The plot contains no information about which coefficients
is nonzero
54
The U Plot
Like the F -test, the Qy-plot does not provide information
about the specific elements of β
Examining the coefficients of U can give some insight
Returning to the jojoba example, we saw evidence that the
2nd component of the Uβ is nonzero
Because U is upper triangular,
U2,2β1 + U2,3β2 + U2,4β3 6= 0. (4)
55
The U Plot
The entries of U are:
> U
x0 x1 x2 x3
1 -3.605551 0.5547002 0.0000000 0.0000000
2 0.000000 2.7735010 0.7211103 0.7211103
3 0.000000 0.0000000 -2.7349589 0.1901308
4 0.000000 0.0000000 0.0000000 2.7283420
The absolute values of the U matrix can be readily
translated into a graph, row by row
56
x0 x1 x2 x3
x0
x0 x1 x2 x3
x1
x0 x1 x2 x3
x2
x0 x1 x2 x3
x3
The U plot for the jojoba oil data set where yield is regressed against three
predictors.
57
The U Plot
Our focus should be on the second row which indicates
that U2,2 is large in magnitude while U2,3 and U2,4 are
smaller, but not negligible
Since the multiplier of β2 in (4) is large, relative to the
other multipliers, it is believable that β2 is nonzero
However, the fact that U2,3 and U2,4 are not negligible
makes it hard to be certain. Note that the Qy-plot leaves
the possibility open that the 3rd and 4th component of Uβ
are nonzero
58
The U Plot
That is, it is possible that
U4,4β3 = 0. (5)
which would mean that β3 = 0. In the same way, it is
plausible that
U3,3β2 + U3,4β3 = 0. (6)
which would lead us to conclude that β2 = 0. This
informal approach to sequential testing then leads us to
the possibility that β1 is nonzero.
59
The U Plot for the NFL Data
We consider only the second, third and ninth elements of
the Uβ vector, since those elements are the ones that were
identified in the Qy-plot
60
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
x1
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
x2
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
x8
The U plot for the NFL data set, with a focus on the second, third and ninth rows.
61
The U Plot
Working from the bottom to the top of this plot, we note
that U9,9 and U9,10 are nonzero
From the Qy-plot, it is evident that
U9,9β8 + U9,10β9 6= 0
but there is little evidence that β9 6= 0, so it is plausible
that β8 6= 0
62
The U Plot
The second row of the U-plot indicates that β2 6= 0, since
the multiplier of β8 appears to be very small
The top row of the U-plot is less clear, because the
multiplier of β8 is clearly nontrivial; this indicates that
there is very weak evidence that β1 6= 0
This graphical analysis leads us to consider a model with
only the x2 and x8 as predictors: passing yards and
opponents’ rushing yards
63
Concluding Remarks
Regression analysis can be taught, based on methods that
are actually employed in statistical software
The Qy-plot is a graphical technique which conveys the
meaning of the F -test for signficance of regression
The U-plot can be used to visualize multicollinearity
A t-plot can also be constructed
Code for the three proposed plots can be found in the
MPV library
64