introduction to biostatistical analysis using r statistics course for first-year phd students
DESCRIPTION
Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students. Session 3 Lecture : Analysis of Variance (ANOVA) Practical : ANOVA. Lecturer : Lorenzo Marini DAFNAE, University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova. - PowerPoint PPT PresentationTRANSCRIPT
Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students
Lecturer: Lorenzo MariniDAFNAE,University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova.E-mail: [email protected].: +39 0498272807
http://www.biodiversity-lorenzomarini.eu/
Session 3
Lecture: Analysis of Variance (ANOVA)Practical: ANOVA
Statistical modelling: more than one parameter
Nature of the response variable
NORMAL
(continuous)
POISSON, BINOMIAL …
GLM
Categorical Continuous Categorical + continuous
General Linear Models
Generalized
Linear Models
ANOVA Regression ANCOVA
Nature of the explanatory variables
Session 3 Session 4
(not covered)
ANOVA: aov()
ASSUMPTIONS
Independence of cases - this is a requirement of the design.
Normality - the distributions in each cells are normal [hist(), qq.plot(), shapiro.test()]
Homogeneity of variances - the variance of data in groups should be the same (variance homogeneity with fligner.test()).
ANOVA tests mean differences between groups defined by categorical variables
Fertirrigation: 4 levels
♀♀
♂♂
drug: 4 doses
Gender
One-way ANOVAONE factor with 2 or more levels
Multi-way ANOVA2 or more factors, each with 2 or more levels
One-way ANOVA step by step
1. Test normality
2. Test homogeneity of variance within each group
3. Run the ANOVA
4. Reject/accept the H0 (all the means are equal)
5A. Multiple comparison to test differences between the level of factors
5B. Model simplification working with contrasts
2 approaches
One-way ANOVA
One-way ANOVA is used to test for differences among two or more independent groups
yi = a + bx2 + cx3 + dx4
Maize: 4 varieties (k)
y: productivity (NORMAL CONTINUOS)x: variety (CATEGORICAL: four levels: x1, x2, x3, x4)
Ho: µ1= µ2= µ3= µ4
Ha: At least two means differ
ANOVA model
a
b
c
dy
a=µ1
b=µ1-µ2
c=µ1-µ3d=µ1-µ4
Var 1Var 1 Var 2Var 2 Var 3Var 3 Var 4Var 4
6.08 6.87 10.26
8.79
5.7 6.77 10.21
8.42
6.5 7.4 10.02
8.31
5.86 6.63 9.65 8.57
6.17 6.98 9.03
ni5 5 4 5
µi6.06 6.93 10.0
38.62
One-way ANOVA
Sum of squares (SS): devianceSum of squares (SS): deviance
SS Total = Σ(yi – grand mean)2
SS Factor = Σ ni(group meani – grand mean)2
SS Error (within group) = Σ(yi – group meani)2
Grand mean = 7.80Number of observations N = 19
Degree of freedom (df)Degree of freedom (df)
Total: N – 1Group: k – 1Error: N – k
Number of groups: k = 4
X
One-way ANOVA: SS explanation
SS Total SS Error
SS Factor
SS Total = SS Factor + SS Error
Grand mean
mean1
mean2
mean3
mean4
One-way ANOVA
Grand mean
µ3
µ4
µ2
µ1
SS can be divided by the respective df to get a variance
MS = SS /df Mean squared deviation
The pseudo-replication would work here!!!
SSTotal=SSFactor SSTotal=SSFactor + SSError
MSFactor
MSError
One-way ANOVA: F test (variance)
Factor MS
Error MS F =
If the k means are not equal, then the Factor MS in the population will be greater than the population’s Error MS
If F calculated is large (e.g. P<0.05), then we can reject Ho
All we conclude is thatat least two means are different!!!
A POSTERIORI MULTIPLE COMPARISONS
WORKING WITH CONTRASTS
How to define the correct F test can be a difficult task with complex design (BE EXTREMELY CAREFUL!!!!)
One-way ANOVA: contrasts
Contrasts are the essence of hypothesis testing and model simplification in analysis of variance and analysis of covariance.
They are used to compare means or groups of means with other means or groups of means
We used contrasts to carry out t test AFTER having found out a significant effect with the F test
- We can use contrasts in model simplification (merge similar factor levels)
- Often we can avoid post-hoc multiple comparisons
One-way ANOVA: multiple comparisons
If F calculated > F critic, then we can reject Ho
At least two means are different!!!
A POSTERIORI MULTIPLE COMPARISONS(lots of methods!!!)
Multiple comparison procedures are then used to determine which means are different from which.
Comparing K means involves K(K − 1)/2 pairwise comparisons.
E.g. Tukey-Cramer, Duncan, Scheffè…
Kruskal-Wallis test
(if k = 2, then it corresponds to the Mann-Whitney test)
ANOVA by ranks
If there are tied ranks a correction term must be applied
If the assumptions are seriously violated, then one can opt for a nonparametric ANOVA
One-way ANOVA: nonparametric
HoweverOne-way ANOVA is quite robust even in condition of non-normality
and non-homogeneity of variance
kruskal.test()
Multi-way ANOVA
Multi-way ANOVAMulti-way ANOVA is used when the experimenter wants to study the effects of two or more treatment variables.
ASSUMPTIONS
Independence of cases - this is a requirement of the design
Normality - the distributions in each of the groups are normal
Homogeneity of variances - the variance of data in groups should be the same
+ Equal replication (BALANCED AND ORTHOGONAL DESIGN)
If you use traditional general linear models just one missing data can affect strongly the results
Dose 1 Dose 2 Dose 3
Low temp - 10 obs 10 obs
High temp 10 obs 10 obs 8 obs XX
Fixed effects: factors are specifically chosen and under control, they are informative(E.g. sex, treatments, wet vs. dry, doses, sprayed or not sprayed)
Fixed vs. random factors
Random effects: factors are chosen randomly within a large population, they are normally not informative(E.g. fields within a site, block within a field, split-plot within a plot, family, parent, brood, individuals within repeated measures)
Random effects mainly occur in two contrasting kinds of circumstances
1. Observational studies with hierarchical structure 2. Designed experiments with different spatial or temporal dependence
If we consider more than one factor we have to distinguish two kinds of effects:
They affect the way to construct the F-test in a multifactorial ANOVA. Their false identification leads to wrong conclusions
Fixed vs. random factors
Why is it so important to identify fixed vs. random effects?
If we have both fixed and random effects, then we are working on MIXED MODELS
yi = µ + αi (fixed) + ri (random) + ε
You can find how to construct your F-test with different combinations of random and fixed effects and with different hierarchical structures
(choose a well-known sampling design!!!)
Factorial ANOVA: two or more factors
Factorial designFactorial design: two or more factors are crossed. Each combination of the factors are equally replicated and each factor occurs in combination with every level of the other factors
4 fertilizer
3 levels of irrigation
10
10
10
10
10
10
10
10
10
10
10
10
Orthogonal sampling
Factorial ANOVA: Why?
Why use a factorial ANOVA? Why not just use multiple one-way ANOVA’s?
With n factors, you’d need to run n one-way ANOVA’s, which would inflate your α-level– However, this could be corrected with a Bonferroni
correction
The best reason is that a factorial ANOVA can detect interactions, something that multiple one-way ANOVA’s cannot do
Factorial ANOVA: Interactions
E.g. We are testing two factors, Gender (male and female) and Age (young, medium, and old) and their effect on performance
If males performance differed as a function of age, i.e. males performed better or worse with age, but females performance was the same across ages, we would say that Age and Gender interact, or that we have an Age x Gender interaction
Interaction: When the effects of one independent variable differ according to levels of another independent variable
Male
Female
Age
Per
form
ance
It is necessary that the slopes differ from one anotherIt is necessary that the slopes differ from one another
OldYoung
Factorial ANOVA: Main effects
This is what we were looking at with one-way ANOVA’s – if we have a significant main effect of our factor, then we can say that the mean of at least one of the groups/levels of that factor is different than at least one of the other groups/levels
Main effects: the effect of a factor is independent from any other factors
Male FemaleP
erfo
rman
ce
It is necessary that the intercepts differIt is necessary that the intercepts differ
OldYoung
Per
form
ance
Factorial ANOVA: Two-crossed fixed factor design
Effect
Clone < 0.05Treat < 0.05C x T < 0.05
Effect
Clone < 0.05Treat < 0.05C x T n.s.
Effect
Clone n.s.Treat < 0.05C x T n.s.
1.0 2.0 3.0
10
20
30
Treatment
Me
an
y
Effect
Clone < 0.05Treat n.s.C x T n.s.
1.0 2.0 3.0
51
52
5
Treatment
Me
an
y
1.0 2.0 3.0
15
25
35
Treatment
Me
an
y
1.0 2.0 3.0
15
25
35
45
Treatment
Me
an
y
Examples of ‘good’ ANOVA results
Worst case
Effect
Clone n.s.Treat n.sC x T n.s
CloneACloneB
1.0 2.0 3.0
51
01
52
0
Treatment
Me
an
y
Treatment: 3 levels
Two-crossed factor design
Factorial ANOVA: Two-crossed factor design
Two crossed fixed effects: every level of each factor occurs in combination with every level of the other factors
Model 1: two fixed effectsModel 2: two random effects (uncommon situation)Model 3: one random and one fixed effect
We can test main effects and interaction:1. The main effectmain effect of each factor is the effect of each factor
independent of (pooling over) the other factors 2. The interactioninteraction between factors is a measure of how the
effects of one factor depends on the levels of one or more additional factors (synergic and antagonist effect of the factors) Factor 1 x Factor 2
We can only measure interaction effects in factorial (crossed) designs We can only measure interaction effects in factorial (crossed) designs
Factorial ANOVA: Two-crossed fixed factor design
Two crossed fixed effects:
Response variable: weight gain in six weeksFactor A: DIET (3 levels: barley, oats, wheat)Factor B: SUPPLEMENT (4 levels: S1, S2, S3, S4)
barley+S1 barley+S2
barley+S3 barley+S4 oats+S3 oats+S4
oats+S1 oats+S2 wheat+S1 wheat+S2
wheat+S3 wheat+S4
DIET* SUPPLEMENT= 3 x 4 = 12 combinations
We have 48 horses to test our two factors: 4 replicates
Factorial ANOVA: Two-crossed fixed factor design
The 48 horses must be independent units to be replicates
DIETSUPPLEMENTDIET*SUPPLEMENT
BarleyOatsWheat
S1 S2 S3 S4
Barley 26.34 23.29 22.46 25.57
Oats 23.29 20.49 19.66 21.86
wheat 19.63 17.40 17.01 19.66
DIET
error
DxSDxS MS
MSF
SUPPLEMENT
DIET*SUPPLEMENT
F test for main effects and interaction
error
DD MS
MSF
error
SS MS
MSF
MIXED MODELS: Fixed+random effects
The mixed models included the dependence in the data withappropriate random effects
Pure fixed effect models REQUIRE INDEPENDENCE
Mixed models can deal with spatial or temporal dependence
WHAT IF WE HAVE DEPENDENCE???
MIXED MODELS: Split-plot and
The split-plot design is one of the most useful design in agricultural and ecological research
The different treatment are applied to plot with different size organized in a hierarchical structure
We can consider random factors to account for the variability related to the environment in which we carry out the experiment
Mixed models can deal with spatial or temporal dependence
Factorial ANOVA: split-plot (mixed model)
N P
P NP
NP N
P P
NP N
P NP
NP NP
N N
P P
N P
P NP
NP N
P P
NP N
P NP
NP NP
N N
P P
N P
P NP
NP N
P P
NP N
P NP
NP NP
N N
P P
N P
P NP
NP N
P P
NP N
P NP
NP NP
N N
P P
Fixed effects:Fixed effects: Irrigation (yes or no)Irrigation (yes or no)Seed-sowing density sub-plots (low, med, high)Seed-sowing density sub-plots (low, med, high)Fertilizer (N, P or NP)Fertilizer (N, P or NP)
Random effects:Random effects: 4 blocks4 blocksIrrigation within blockIrrigation within blockDensity within irrigation Density within irrigation plotsplots
Response variable: Crop production in each cellResponse variable: Crop production in each cell
IrrigationIrrigation DensityDensity FertilizerFertilizer BlockBlock
AA BB CC DD
Factorial ANOVA: mixed model
The split-plot designThe split-plot design
N P
P NP
NP N
P P
NP N
P NP
NP NP
N N
P P
i) 4 Blocksi) 4 Blocks
No waterNo water
iii) seed-sowing iii) seed-sowing density sub-plotsdensity sub-plots(low, med, high)(low, med, high)
iv) 3 fertilizer iv) 3 fertilizer sub-sub-plotsub-sub-plot(N, P, o NP)(N, P, o NP)
ii) irrigation plot (yes or no)ii) irrigation plot (yes or no)WaterWater
Crop productionCrop production
HIERARCHICAL HIERARCHICAL STRUCTURESTRUCTURE
Model formulation
Factorial ANOVA: mixed model aov()
Y~ fixed effects + error terms
BlockBlock
IrrigationIrrigationDensityDensityFertilizerFertilizer
y ~ a*b*c + Error(a/b/c)
Yield ~ irrigation*density*fertilizer+
Error(block/irrigation/density))
YieldYield
UninformativeUninformative
Informative!!!Informative!!!
Here you can specify your sampling hierarchy
Mixed models: tradition vs. REML
Mixed models using traditional ANOVA requires perfect orthogonal and balanced design
(THEY WORK WELL WITH THE PROPER SAMPLING)
In R you can run Mixed models with missing data and unbalanced design (non orthogonal design) using the
REML estimation lme4 or NLME
avoid to work with multi-way ANOVA in non-orthogonal sampling designs
If something has gone wrong with the sampling
Mixed model: REML
• REML: Residual Maximum Likelihood– vs. Maximum
Likelihood– Unbalance, non-
orthogonal, multiple sources of error
• Packages– NLME (quite old)
• New Alternative– lme4
Mixed model: REML
When should I use REML?For balanced data, both ANOVA and REML will give the same answer. However, ANOVA’s algorithm is much more efficient and so should be
used whenever possible.
Are all factor combinations
present and sample sizes equal?
Can you identify blocks (or a hierarchy of blocks) of similar experimental
units?
Most efficient analysis
Fixed effects ANOVA(or regression)
Mixed effects ANOVA
Regression
REML
Generalized Linear Models (GLM)
We can use GLMs when the variance is not constant, and/or when the errors are not normally distributed.
0 2 4 6 8 10
2.0
3.0
4.0
Normal
Mean
Var
ian
ce
0 2 4 6 8 10
02
46
810
Count data
Mean
Var
ian
ce
0 2 4 6 8 10
01
23
4
Proportion data
Mean
Var
ian
ce
0 2 4 6 8 10
020
4060
80
Gamma
Mean
Var
ian
ce
A GLM has three important properties1. The error structure2. The linear predictor3. The link function
Generalized Linear Models (GLM)
Error structure
In GLM we can specify error structure different from the normal:
- Normal (gaussian)- Poisson errors (poisson)- Binomial errors (binomial)- Gamma errors (Gamma)
glm(formula, family = …, link=…, data, …)
Generalized Linear Models (GLM)
The model relates each observed y to a predicted valueThe predicted value is obtained by a TRANSFORMATION of the value emerging from the linear predictor
Linear predictor (η)= predicted value output from a GLM
p
jjix β are the parameters estimated for the p explanatory variablesx are the values measured for the p explanatory variables
Fit of the model = comparison between the linear predictor and the TRANSFORMED y measured
The transformation is specified in the LINK FUNCTION
Generalized Linear Models (GLM)
Link functions (g)
The link function related the mean value of y to its linear predictor
)( gThe value of η is obtained by transforming the value of y by the link function
The predicted value of y is obtained by applying the inverse link function to the η value of y by the link function
Typically the output of a GLM is η Need to transform η to get the predicted values
Known the distribution start with the canonical link function
Generalized Linear Models (GLM) glm()
Model fit
Deviance is the measure of goodness-of-fit of GLM
Deviance=-2(log-likelihoodcurrent model -log-likelihood saturated model)
Error Deviance Variance
normal
poisson
binomial
Gamma
2)( meanyi
)()/ln(/2 meanymeanyy mean
1
)/()ln()()/ln(/2 meannynynmeanyy
)/ln(/)(/2 meanyymeany 2mean
n
meannmean )(
We aim at reducing the residual deviance
n is the size of the sample
Proportion data and binomial error
Proportion data and binomial error
1. The data are strictly bounded (0-1)2. The variance is non-constant (∩-shaped relation with the mean)3. Errors are non-normal
Link function: Logit
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0
Proportion data
p
Va
ria
nc
e
bxaq
p
ln
-100 -50 0 50 100
0.0
0.4
0.8
Proportion
x
p
bxa
bxa
e
ep
1
2 Examples1. The first example concerns sex ratios in insects (the proportion of all individuals that are males). In the species in question, it has been observed that the sex ratio is highly variable, and an experiment was set up to see whether population density was involved in determining the fraction of males.2. The data consist of numbers dead and initial batch size for five doses of pesticide application, and we wish to know what dose kills 50% of the individuals (or 90% or 95%, as required).
After fitting a binomial-logit GLM must be:Residual df ≈ residual deviance
YESThe model is adequate
NOFit a quasibinomial
YES
The model is adequate
NO
Change distribution
Check again
Proportion data and binomial error
Count data and Poisson error
Count data and Poisson error
1. The data are non-negative integers2. The variance is non-constant (variance = mean)3. Errors are non-normal
0 2 4 6 8 10
02
46
81
0
Count data
Mean
Va
ria
nc
e
The model is fitted with a log link (to ensure that the fitted values are bounded below) and Poisson errors (to account for the non-normality).
0 20 40 60 80 100
01
23
4
Count
x
log
(x)
Example
1. In this example the response is a count of the number of plant species on plots that have different biomass (continuous variable) and different soil pH (high, mid and low).
After fitting a Poisson-log GLM must be:Residual df ≈ residual deviance
YESThe model is adequate
NOFit a quasipoisson
YES
The model is adequate
NO
Change distribution
Check again
Count data and Poisson error