basic anova and glm
TRANSCRIPT
-
8/12/2019 Basic Anova and Glm
1/40
Basic Analysis of Variance and
the General Linear Model
Psy 420
Andrew Ainsworth
-
8/12/2019 Basic Anova and Glm
2/40
Why is it called analysis of variance
anyway?
If we are interested in group mean differences, why
are we looking at variance?
t-test only one place to look for variability
More groups, more places to look
Variance of group means around a central tendency
(grand meanignoring group membership) really
tells us, on average, how much each group isdifferent from the central tendency (and each other)
-
8/12/2019 Basic Anova and Glm
3/40
Why is it called analysis of
variance anyway?
Average mean variability around GM needs to
be compared to average variability of scores
around each group mean Variability in any distribution can be broken
down into conceptual parts:
total variability = (variability of each group
mean around the grand mean) + (variability of
each persons score around their group mean)
-
8/12/2019 Basic Anova and Glm
4/40
General Linear Model (GLM)
The basis for most inferential statistics
(e.g. 420, 520, 524, etc.)
Simple form of the GLM
score=grand mean + independent variable + error
Y
-
8/12/2019 Basic Anova and Glm
5/40
General Linear Model (GLM)
The basic idea is that everyone in the
population has the same score (the grand
mean) that is changed by the effects of anindependent variable (A) plus just random
noise (error)
Some levels of A raise scores from theGM, other levels lower scores from the
GM and yet others have no effect.
-
8/12/2019 Basic Anova and Glm
6/40
General Linear Model (GLM)
Error is the noise caused by other variables you arentmeasuring, havent controlled for or are unaware of.
Error like A will have different effects on scores but thishappens independently of A.
If error gets too large it will mask the effects of A and make itimpossible to analyze the effects of A
Most of the effort in research designs is done to try andminimize error to make sure the effect of A is not buried in
the noise. The error term is important because it gives us a yard stick
with which to measure the variability cause by the A effect.We want to make sure that the variability attributable to A isgreater than the naturally occurring variability (error)
-
8/12/2019 Basic Anova and Glm
7/40
GLM
Example of GLMANOVA backwards
We can generate a data set using the GLM formula
We start off with every subject at the GM (e.g. =5)
a1 a2
Case Score Case Score
s1
s2s3
s4
s5
5
55
5
5
s6
s7s8
s9
s10
5
55
5
5
-
8/12/2019 Basic Anova and Glm
8/40
GLM Then we add in the effect of A (a1 adds 2 points and
a2 subtracts 2 points)
a1 a2
Case Score Case Scores1
s2
s3
s4
s5
5 + 2 = 7
5 + 2 = 7
5 + 2 = 7
5 + 2 = 7
5 + 2 = 7
s6
s7
s8
s9
s10
52 = 3
52 = 3
52 = 3
52 = 3
5
2 = 3
135aY 2 15aY
1
2 245aY 22 45aY
1 7aY 3 3aY
-
8/12/2019 Basic Anova and Glm
9/40
GLM
Changes produced by the treatment represent
deviations around the GM
2 2 2
2 2 2 2
( ) [(7 5) (3 5) ]
5(2) 5( 2) 5[(2) ( 2) ] 40
jn Y GM n
or
-
8/12/2019 Basic Anova and Glm
10/40
GLM
Now if we add in some random variation (error)
a1 a2
Case Score Case Score SUM
s1
s2
s3
s4
s5
5 + 2 + 2 = 9
5 + 2 + 0 = 7
5 + 21 = 6
5 + 2 + 0 = 7
5 + 2
1 = 6
s6
s7
s8
s9
s10
5
2 + 0 = 3
522 = 1
52 + 0 = 3
52 + 1 = 4
5
2 + 1 = 4
135aY 2 15aY 50Y
1
2 251aY 22 51aY 2 302Y
1
7a
Y 3
3a
Y 5Y
-
8/12/2019 Basic Anova and Glm
11/40
GLM
Now if we calculate the variance for each group:
The average variance in this case is also going tobe 1.5 (1.5 + 1.5 / 2)
2
2 22
21
( ) 1551
5 1.51 4
a
N
YY
NsN
1
2 22
2 1
( ) 35251
5 1.51 4
a
N
YY
NsN
-
8/12/2019 Basic Anova and Glm
12/40
GLM
We can also calculate the total variability inthe data regardless of treatment group
The average variability of the two groups issmaller than the total variability.
2 22
2
1
( ) 50302
10 5.781 9
N
YY
NsN
-
8/12/2019 Basic Anova and Glm
13/40
Analysisdeviation approach
The total variability can be partitioned
into between group variability and
error.
ij ij j jY GM Y Y Y GM
-
8/12/2019 Basic Anova and Glm
14/40
Analysisdeviation approach
If you ignore group membership and
calculate the mean of all subjects this
is the grand mean and total variabilityis the deviation of all subjects around
this grand mean
Remember that if you just looked atdeviations it would most likely sum to
zero so
-
8/12/2019 Basic Anova and Glm
15/40
Analysisdeviation approach
2 2 2
/
ij j ij j
i j j i j
total bg wg
total A S A
Y GM n Y GM Y Y
SS SS SS
SS SS SS
-
8/12/2019 Basic Anova and Glm
16/40
Analysisdeviation approachA Score 2ijY GM 2jY GM 2ij jY Y
a1
9
7
6
7
6
16
4
1
4
1
(75)2= 4
4
0
1
0
1
a2
3
1
3
4
4
4
16
4
1
1
(35)2= 4
0
4
0
1
1
50Y 2 302Y 52 8 12
5Y 5(8) 40n
52 = 40 + 12
-
8/12/2019 Basic Anova and Glm
17/40
Analysisdeviation approach
degrees of freedom
DFtotal = N1 = 10 -1 = 9
DFA= a1 = 21 = 1
DFS/A= a(S1) = a(n1) = ana =
Na = 2(5)2 = 8
-
8/12/2019 Basic Anova and Glm
18/40
Analysisdeviation approach
Variance or Mean square
MStotal= 52/9 = 5.78
MSA= 40/1 = 40
MSS/A= 12/8 = 1.5
Test statistic
F = MSA/MSS/A= 40/1.5 = 26.67
Critical value is looked up with dfA, dfS/A
and alpha. The test is always non-
directional.
-
8/12/2019 Basic Anova and Glm
19/40
Analysisdeviation approach
ANOVA summary table
Source SS df MS F
A 40 1 40 26.67
S/A 12 8 1.5Total 52 9
-
8/12/2019 Basic Anova and Glm
20/40
Analysiscomputational
approach Equations
Under each part of the equations, you divide by the
number of scores it took to get the number in the
numerator
2
22 2
Y T
Y TSS SS Y Y
N an
2 2jA
a TSS
n an
2
2
/
j
S AaSS Y
n
-
8/12/2019 Basic Anova and Glm
21/40
Analysiscomputational
approach
Analysis of sample problem
250
302 5210TSS
2 2 235 15 5040
5 10
ASS
2 2
/
35 15302 12
5S ASS
-
8/12/2019 Basic Anova and Glm
22/40
Analysisregression
approach
Levels of A Cases Y X YX
a1
S1S2S3
S4S5
9
7
6
76
1
1
1
11
9
7
6
76
a2
S6S7S8
S9S10
3
1
3
44
-1
-1
-1
-1-1
-3
-1
-3
-4-4
Sum 50 0 20
Squares Summed 302 10
N 10
Mean 5
-
8/12/2019 Basic Anova and Glm
23/40
Analysisregression
approach
Y = a + bX + e
e = YY
-
8/12/2019 Basic Anova and Glm
24/40
Analysisregression
approach
Sums of squares
22
2 50( ) 302 5210
YSS Y Y
N
22
2 0( ) 10 1010
XSS X X
N
( )( ) (50)(0)
( ) 20 2010
Y XSP YX YX
N
-
8/12/2019 Basic Anova and Glm
25/40
Analysisregression
approach
( ) ( ) 52TotalSS SS Y
2 2( ) 20
40( ) 10regression
SP YXSS
SS X
( ) ( ) ( ) 52 40 12residual total regressionSS SS SS
-
8/12/2019 Basic Anova and Glm
26/40
Slope
Intercept
2
2
( )( )
( ) 202
( ) 10
Y X
YX SP YXNbSS XX
XN
5 2(0) 5a Y bX
-
8/12/2019 Basic Anova and Glm
27/40
-
8/12/2019 Basic Anova and Glm
28/40
Analysisregression
approach
Degrees of freedom
df(reg.)= # of predictors
df(total)= number of cases1
df(resid.)= df(total)df(reg.) =
91 = 8
-
8/12/2019 Basic Anova and Glm
29/40
Statistical Inference
and the F-test
Any type of measurement will include acertain amount of random variability.
In the F-test this random variability is seenin two places, random variation of eachperson around their group mean and eachgroup mean around the grand mean.
The effect of the IV is seen as adding furthervariation of the group means around theirgrand mean so that the F-test is really:
-
8/12/2019 Basic Anova and Glm
30/40
Statistical Inference
and the F-test
If there is no effect of the IV than theequation breaks down to just:
which means that any differences between the
groups is due to chance alone.
BG
WG
effect errorF
error
1
BG
WG
error
F error
-
8/12/2019 Basic Anova and Glm
31/40
Statistical Inference
and the F-test
The F-distribution is based on havinga between groups variation due to theeffect that causes the F-ratio to be
larger than 1. Like the t-distribution, there is not a
single F-distribution, but a family of
distributions. The F distribution isdetermined by both the degrees offreedom due to the effect and thedegrees of freedom due to the error.
-
8/12/2019 Basic Anova and Glm
32/40
Statistical Inference
and the F-test
-
8/12/2019 Basic Anova and Glm
33/40
Assumptions of the analysis
Robusta robust test is one that is
said to be fairly accurate even if the
assumptions of the analysis are notmet. ANOVA is said to be a fairly
robust analysis. With that said
-
8/12/2019 Basic Anova and Glm
34/40
Assumptions of the analysis
Normality of the sampling distribution of
means
This assumes that the sampling distribution ofeach level of the IV is relatively normal.
The assumption is of the sampling distribution
not the scores themselves
This assumption is said to be met when there
is relatively equal samples in each cell and the
degrees of freedom for error is 20 or more.
-
8/12/2019 Basic Anova and Glm
35/40
Assumptions of the analysis
Normality of the sampling distributionof means If the degrees of freedom for error are small
than:
The individual distributions should bechecked for skewness and kurtosis (see
chapter 2) and the presence of outliers. If the data does not meet the distributional
assumption than transformations will need tobe done.
-
8/12/2019 Basic Anova and Glm
36/40
Assumptions of the analysis
Independence of errorsthe size of the error forone case is not related to the size of the error inanother case.
This is violated if a subject is used more than once(repeated measures case) and is still analyzed withbetween subjects ANOVA
This is also violated if subjects are ran in groups. This
is especially the case if the groups are pre-existing This can also be the case if similar people exist within
a randomized experiment (e.g. age groups) and canbe controlled by using this variable as a blockingvariable.
-
8/12/2019 Basic Anova and Glm
37/40
Assumptions of the analysis
Homogeneity of Variancesince we are
assuming that each sample comes from
the same population and is only affected(or not) by the IV, we assume that each
groups has roughly the same variance
Each sample variance should reflect the
population variance, they should be equal toeach other
Since we use each sample variance to estimate
an average within cell variance, they need to be
roughly equal
-
8/12/2019 Basic Anova and Glm
38/40
Assumptions of the analysis
Homogeneity of Variance
An easy test to assess this assumption is:
2
largest
max 2
smallest
max 10, than the variances are roughly homogenous
SF
S
F
-
8/12/2019 Basic Anova and Glm
39/40
Assumptions of the analysis
Absence of outliers
Outliersa data point that doesnt really
belong with the others Either conceptually, you wanted to study
only women and you have data from a man
Or statistically, a data point does not cluster
with other data points and has undueinfluence on the distribution
This relates back to normality
-
8/12/2019 Basic Anova and Glm
40/40
Assumptions of the analysis
Absence of outliers