introduction to anova (single factor) part 1 2019 · 5 diatoms & heavy metals • effect of...
TRANSCRIPT
1
Reduced slides
Introduction to Analysis of Variance (ANOVA) – Part 1
Single factor
2
The logic of Analysis of Variance
• Is the variance explained by the model >> than the residual variance
• In regression models– Variance explained by regression model vs unexplained
variance
• In ANOVA models– Variance explained by Factors >> than unexplained
variance– In common language – is the variability among
treatments greater than variability within treatments
ANOVA vs regression
• One factor ANOVA:– 1 continuous response variable and 1
categorical predictor variable (factor)
• Compare with regression:– 1 continuous response variable and 1
continuous predictor variable
3
Aims
• Measure relative contribution of different sources of variation (factors or combination of factors) to total variation in response variable
• Test hypotheses about group (treatment) population means for response variable
Data layoutFactor level (group) 1 2 … i
Replicates y11 y21 ... yi1
y1j y2j ... yij
... ... ... ...y1n y2n ... yin
Population means 1 2 i
Sample means y1 y2 yi
Grand mean y estimates
4
Types of predictors (factors)
• Fixed factor:– all levels or groups of interest are used in study
– conclusions are restricted to those groups
• Random factor:– random sample of all groups of interest are used
in study – typically individual groups are not of interest
– conclusions extrapolate to all possible groups
Linear model
Linear model for 1 factor ANOVA:
yij = + i + ij
where
overall population mean
i effect of ith treatment or group ( - i)
ij random or unexplained error (variation not explained by treatment effects)
5
Diatoms & heavy metals
• Effect of heavy metals on species diversity of diatoms in streams in Colorado
• Response variable:– species diversity of diatoms
• Predictor variable:– heavy metal level– categorical with 4 groups (background, low, medium,
high)
• Replicates are “stations”
Null hypothesis
• H0: 1 = 2 = i =
• No difference between population group (treatment) means
• Mean species diversity of diatoms is same for 4 heavy metals levels
6
H0 - fixed factor
• No effects of specific groups (treatments)
H0: 1 = 2 = … = i … = 0where i = i -
• No effect of 4 heavy metal levels on diatom species diversity
Inference is only to these 4 heavy metals
Streams and diatomsDoes diatom diversity vary by stream?
7
H0 - random factor• No variation among means of all possible
groups (treatments)
H0: A2 = 0
𝐻 : ∑ �̅� 𝜇 2/ 𝑁 1
=0
where groups i=1 to N (streams) are chosen randomly
• Test: No variation in diatom species diversity between randomly chosen streams
Inference is to all streams (within ??? Region) – sampled by Nnumber of streams
Partitioning variation
• Variation in response variable partitioned into:– variation explained by difference among
groups (or treatments)
– variation not explained (residual variation, within group)
8
Regression: Analysis of variance in Y
( )y yi 2
Total variation (Sum of Squares) in Y
( )y yi 2
Variation in Y explained by regression (SSRegression)
( )y yi i 2
Variation in Y unexplained by
regression (SSResidual)
Y
X
least squares regression line
y
x
yi
yi
xi
y
222 )ˆ()ˆ()( iiii yyyyyy
})ˆ( i yy }
)ˆ( ii yy )( i yy }
9
1 2 3
Group
y
y11y12
y13y14
y21
y22
y23
y24
y31
y32y33
y34
Partitioning the Variance
1 2 3
Group
y
y1y
2y
3y
y21
y22
y23
y24
Partitioning the Variance
10
1 2 3
Group
y
y1y
3y
2y
)()()( yyyyyy iiijij
y21)( iij yy
)( yyi
Within Groups
Between Groups
Partitioning the VarianceSum of squares
1 2 3
Group
y
y1y
2y
3y
y21
y22
y23
y24
222 )()()( iijiij yyyynyy Within Group – unexplained
Partitioning the VarianceSum of squares
11
1 2 3
Group
y
y1y
3y
2y
222 )()()( iijiij yyyynyy
)( yyi Between Groups (n = 4)
Between Groups – explainedn = 4 (in this example)
Partitioning the VarianceSum of squares
ANOVA
SS Total
SS Between groups + SS Within groups (Residual)
( )y yij 2
n y yi( ) 2
( )y yij i 2
12
Mean squares
• Average sum-of-squared deviations
• Degrees of freedom:– number of components minus 1
– df total [pn-1] = df groups [p-1] + df residual [p(n-1)]
• Mean square is a variance:– SS divided by df
Source SS df MS
Groups p-1
Residual p(n-1)
Total pn-1
ANOVA table
( )y yij 2
n y yi( ) 2
( )y yij i 2
)1(
)( 2
np
yy iij
)1(
)( 2
p
yyn i
13
Treatments (= groups) explain nothing, ie. SSGroups equals zero
Replicate Group1 Group2 Group3 Group4
1 16.0 15.0 16.0 17.02 15.0 17.0 16.0 16.03 17.0 16.0 17.0 15.04 16.0 16.0 15.0 16.0
Mean 16.0 16.0 16.0 16.0
Grand mean = 16.0
Treatments (= groups) explain everything, ie. SSResidual equals zero
Replicate Group1 Group2 Group3 Group4
1 19.5 15.0 16.5 13.02 19.5 15.0 16.5 13.03 19.5 15.0 16.5 13.04 19.5 15.0 16.5 13.0
Mean 19.5 15.0 16.5 13.0
Grand mean = 16.0
14
Testing ANOVA H0
• All population group means the same1 = 2 = i = a =
• Fixed factor:H0: 1 = 2 = … = i … = 0
– Means that there is no variability across a fixed set of group means (limited inference)
• Random factor (A):H0: A
2 = 0– Means that there is no variability across all possible group
means (broad inference)
Remember: Linear model for 1 factor ANOVA:
yij = + i + ij and orbecanwhereuu ii ,
Source SS df MS__ F
Groups p-1 MSg/MSres
Residual p(n-1)
Total pn-1
ANOVA table
( )y yij 2
n y yi( ) 2
( )y yij i 2
)1(
)( 2
np
yy iij
)1(
)( 2
p
yyn i
15
F-ratio statistic
• F-ratio statistic is ratio of 2 sample variances (i.e. 2 mean squares)
• Probability distribution of F-ratio known– different distributions depending on df of 2
variances
• If homogeneity of variances holds, F-ratio follows F distribution
F distribution – a null distribution
0 1 2 3 4 5
F
P(F)3, 24 df
16
Expected mean squares
• If factor is fixed and homogeneity of variance assumption holds:
– MSGroups estimates
– MSResidual estimates
1
)( 22
p
n i
2
Fratio =Msgroups
MSResidual
Testing H0 - fixed factor
• If H0 is true:– all i’s = 0– MSGroups and MSResidual
both estimate 2
– so F-ratio 1
• If H0 is false:– at least one i 0– MSGroups estimates 2 +
treatment effects– so F-ratio > 1
MSGroups
MSResidual
1
)( 22
p
n i
2
Fratio =Msgroups
MSResidual
Fratio =Msgroups
MSResidual
Msgroups
MSResidual
17
• If factor is fixed and homogeneity of variance assumption holds:
– MSGroups
– MSResidual
Fratio =Msgroups
MSResidual
)1(
)( 2
np
yy iij
1
)( 22
p
n i
2
)1(
)( 2
p
yyn i
Expected Calculated
Expected mean squares(random factor)
• If factor is random and homogeneity of variance assumption holds:
– MSGroups estimates
– MSResidual estimates
22An
2
Fratio =Msgroups
MSResidual
Fratio =Msgroups
MSResidual
Msgroups
MSResidual
18
Testing H0 - random factor
• If H0 is true:– A
2 = 0– MSGroups and MSResidual
both estimate 2
– so F-ratio 1
• If H0 is false:– A
2 > 0– MSGroups estimates 2 plus
added variance due to groups or treatments
– so F-ratio > 1
MSGroups
MSResidual
22An
2
Fratio =Msgroups
MSResidual
Fratio =Msgroups
MSResidual
Msgroups
MSResidual
• If factor is random and homogeneity of variance assumption holds:
– MSGroups
– MSResidual
Fratio =Msgroups
MSResidual
)1(
)( 2
np
yy iij2
)1(
)( 2
p
yyn i
Expected Calculated
22An
19
Full set of slides
20
Introduction to Analysis of Variance (ANOVA) – Part 1
Single factor
The logic of Analysis of Variance
• Is the variance explained by the model >> than the residual variance
• In regression models– Variance explained by regression model vs unexplained
variance
• In ANOVA models– Variance explained by Factors >> than unexplained
variance– In common language – is the variability among
treatments greater than variability within treatments
21
ANOVA vs regression
• One factor ANOVA:– 1 continuous response variable and 1
categorical predictor variable (factor)
• Compare with regression:– 1 continuous response variable and 1
continuous predictor variable
Aims
• Measure relative contribution of different sources of variation (factors or combination of factors) to total variation in response variable
• Test hypotheses about group (treatment) population means for response variable
22
Terminology
• Factor (predictor variable):– usually designated factor A
– number of levels/groups/treatments = p
• Number of replicates within each group– n
• Each observation:– y
Data layoutFactor level (group) 1 2 … i
Replicates y11 y21 ... yi1
y1j y2j ... yij
... ... ... ...y1n y2n ... yin
Population means 1 2 i
Sample means y1 y2 yi
Grand mean y estimates
23
Types of predictors (factors)
• Fixed factor:– all levels or groups of interest are used in study
– conclusions are restricted to those groups
• Random factor:– random sample of all groups of interest are used
in study – typically individual groups are not of interest
– conclusions extrapolate to all possible groups
Linear model
Linear model for 1 factor ANOVA:
yij = + i + ij
where
overall population mean
i effect of ith treatment or group ( - i)
ij random or unexplained error (variation not explained by treatment effects)
24
Compare with regression model
yi = 0 + 1xi + i
• intercept is replaced by • slope is replaced by i (treatment effect):
– predictor variable is categorical rather than continuous
– still measures “effect” of predictor variable
Diatoms & heavy metals
• Effect of heavy metals on species diversity of diatoms in streams in Colorado
• Response variable:– species diversity of diatoms
• Predictor variable:– heavy metal level– categorical with 4 groups (background, low, medium,
high)
• Replicates are “stations”
25
Null hypothesis
• H0: 1 = 2 = i =
• No difference between population group (treatment) means
• Mean species diversity of diatoms is same for 4 heavy metals levels
H0 - fixed factor
• No effects of specific groups (treatments)
H0: 1 = 2 = … = i … = 0where i = i -
• No effect of 4 heavy metal levels on diatom species diversity
Inference is only to these 4 heavy metals
26
Streams and diatomsDoes diatom diversity vary by stream?
H0 - random factor• No variation among means of all possible
groups (treatments)
H0: A2 = 0
𝐻 : ∑ �̅� 𝜇 2/ 𝑁 1
=0
where groups i=1 to N (streams) are chosen randomly
• Test: No variation in diatom species diversity between randomly chosen streams
Inference is to all streams (within ??? Region) – sampled by Nnumber of streams
27
Basic assumption of ANOVA (single factor)
12 = 2
2 = … = i2 = … = 2
where i2 = population variance of
dependent variable (yi) in each group (this is the within group variation)
Each group (or treatment) population has similar variance– homogeneity of variance assumption
Partitioning variation
• Variation in response variable partitioned into:– variation explained by difference among
groups (or treatments)
– variation not explained (residual variation, within group)
28
Regression: Analysis of variance in Y
( )y yi 2
Total variation (Sum of Squares) in Y
( )y yi 2
Variation in Y explained by regression (SSRegression)
( )y yi i 2
Variation in Y unexplained by
regression (SSResidual)
Y
X
least squares regression line
y
x
yi
yi
xi
y
222 )ˆ()ˆ()( iiii yyyyyy
})ˆ( i yy }
)ˆ( ii yy )( i yy }
29
ANOVA
SS Total
SS Between groups + SS Within groups (Residual)
( )y yij 2
n y yi( ) 2
( )y yij i 2
1 2 3
Group
y
y11y12
y13y14
y21
y22
y23
y24
y31
y32y33
y34
Partitioning the Variance
30
1 2 3
Group
y
y1y
2y
3y
y21
y22
y23
y24
Partitioning the Variance
1 2 3
Group
y
y1y
3y
2y
)()()( yyyyyy iiijij
y21)( iij yy
)( yyi
Within Groups
Between Groups
Partitioning the Variance
31
1 2 3
Group
y
y1y
2y
3y
y21
y22
y23
y24
222 )()()( iijiij yyyynyy Within Group
Partitioning the Variance
1 2 3
Group
y
y1y
3y
2y
222 )()()( iijiij yyyynyy
)( yyi Between Groups (n = 4)
Between Groupsn = 4 (in this example)
Partitioning the Variance
32
Mean squares
• Average sum-of-squared deviations
• Degrees of freedom:– number of components minus 1
– df total [pn-1] = df groups [p-1] + df residual [p(n-1)]
• Mean square is a variance:– SS divided by df
Source SS df MS
Groups p-1
Residual p(n-1)
Total pn-1
ANOVA table
( )y yij 2
n y yi( ) 2
( )y yij i 2
)1(
)( 2
np
yy iij
)1(
)( 2
p
yyn i
33
Treatments (= groups) explain nothing, ie. SSGroups equals zero
Replicate Group1 Group2 Group3 Group4
1 16.0 15.0 16.0 17.02 15.0 17.0 16.0 16.03 17.0 16.0 17.0 15.04 16.0 16.0 15.0 16.0
Mean 16.0 16.0 16.0 16.0
Grand mean = 16.0
Treatments (= groups) explain everything, ie. SSResidual equals zero
Replicate Group1 Group2 Group3 Group4
1 19.5 15.0 16.5 13.02 19.5 15.0 16.5 13.03 19.5 15.0 16.5 13.04 19.5 15.0 16.5 13.0
Mean 19.5 15.0 16.5 13.0
Grand mean = 16.0
34
Testing ANOVA H0
• All population group means the same1 = 2 = i = a =
• Fixed factor:H0: 1 = 2 = … = i … = 0
– Means that there is no variability across a fixed set of group means (limited inference)
• Random factor (A):H0: A
2 = 0– Means that there is no variability across all possible group
means (broad inference)
Remember: Linear model for 1 factor ANOVA:
yij = + i + ij and orbecanwhereuu ii ,
Source SS df MS__ F
Groups p-1 MSg/MSres
Residual p(n-1)
Total pn-1
ANOVA table
( )y yij 2
n y yi( ) 2
( )y yij i 2
)1(
)( 2
np
yy iij
)1(
)( 2
p
yyn i
35
F-ratio statistic
• F-ratio statistic is ratio of 2 sample variances (i.e. 2 mean squares)
• Probability distribution of F-ratio known– different distributions depending on df of 2
variances
• If homogeneity of variances holds, F-ratio follows F distribution
F distribution
0 1 2 3 4 5
F
P(F)3, 24 df
36
Expected mean squares
• If factor is fixed and homogeneity of variance assumption holds:
– MSGroups estimates
– MSResidual estimates
1
)( 22
p
n i
2
Fratio =Msgroups
MSResidual
Testing H0 - fixed factor
• If H0 is true:– all i’s = 0– MSGroups and MSResidual
both estimate 2
– so F-ratio 1
• If H0 is false:– at least one i 0– MSGroups estimates 2 +
treatment effects– so F-ratio > 1
MSGroups
MSResidual
1
)( 22
p
n i
2
Fratio =Msgroups
MSResidual
Fratio =Msgroups
MSResidual
Msgroups
MSResidual
37
• If factor is fixed and homogeneity of variance assumption holds:
– MSGroups
– MSResidual
Fratio =Msgroups
MSResidual
)1(
)( 2
np
yy iij
1
)( 22
p
n i
2
)1(
)( 2
p
yyn i
Expected Calculated
Expected mean squares(random factor)
• If factor is random and homogeneity of variance assumption holds:
– MSGroups estimates
– MSResidual estimates
22An
2
Fratio =Msgroups
MSResidual
Fratio =Msgroups
MSResidual
Msgroups
MSResidual
38
Testing H0 - random factor
• If H0 is true:– A
2 = 0– MSGroups and MSResidual
both estimate 2
– so F-ratio 1
• If H0 is false:– A
2 > 0– MSGroups estimates 2 plus
added variance due to groups or treatments
– so F-ratio > 1
MSGroups
MSResidual
22An
2
Fratio =Msgroups
MSResidual
Fratio =Msgroups
MSResidual
Msgroups
MSResidual
• If factor is random and homogeneity of variance assumption holds:
– MSGroups
– MSResidual
Fratio =Msgroups
MSResidual
)1(
)( 2
np
yy iij2
)1(
)( 2
p
yyn i
Expected Calculated
22An