introduction to biostatistical analysis using r statistics course for first-year phd students

40
Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students urer: Lorenzo Marini NAE, versity of Padova, Viale dell’Università 16, 35020 Legnaro, Padova. ail: [email protected] .: +39 0498272807 p://www.biodiversity-lorenzomarini.eu/ Session 3 Lecture: Analysis of Variance (ANOVA) Practical: ANOVA

Upload: yazid

Post on 31-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students. Session 3 Lecture : Analysis of Variance (ANOVA) Practical : ANOVA. Lecturer : Lorenzo Marini DAFNAE, University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Lecturer: Lorenzo MariniDAFNAE,University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova.E-mail: [email protected].: +39 0498272807

http://www.biodiversity-lorenzomarini.eu/

Session 3

Lecture: Analysis of Variance (ANOVA)Practical: ANOVA

Page 2: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Statistical modelling: more than one parameter

Nature of the response variable

NORMAL

(continuous)

POISSON, BINOMIAL …

GLM

Categorical Continuous Categorical + continuous

General Linear Models

Generalized

Linear Models

ANOVA Regression ANCOVA

Nature of the explanatory variables

Session 3 Session 4

(not covered)

Page 3: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

ANOVA: aov()

ASSUMPTIONS

Independence of cases - this is a requirement of the design.

Normality - the distributions in each cells are normal [hist(), qq.plot(), shapiro.test()]

Homogeneity of variances - the variance of data in groups should be the same (variance homogeneity with fligner.test()).

ANOVA tests mean differences between groups defined by categorical variables

Fertirrigation: 4 levels

♀♀

♂♂

drug: 4 doses

Gender

One-way ANOVAONE factor with 2 or more levels

Multi-way ANOVA2 or more factors, each with 2 or more levels

Page 4: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

One-way ANOVA step by step

1. Test normality

2. Test homogeneity of variance within each group

3. Run the ANOVA

4. Reject/accept the H0 (all the means are equal)

5A. Multiple comparison to test differences between the level of factors

5B. Model simplification working with contrasts

2 approaches

Page 5: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

One-way ANOVA

One-way ANOVA is used to test for differences among two or more independent groups

yi = a + bx2 + cx3 + dx4

Maize: 4 varieties (k)

y: productivity (NORMAL CONTINUOS)x: variety (CATEGORICAL: four levels: x1, x2, x3, x4)

Ho: µ1= µ2= µ3= µ4

Ha: At least two means differ

ANOVA model

a

b

c

dy

a=µ1

b=µ1-µ2

c=µ1-µ3d=µ1-µ4

Page 6: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Var 1Var 1 Var 2Var 2 Var 3Var 3 Var 4Var 4

6.08 6.87 10.26

8.79

5.7 6.77 10.21

8.42

6.5 7.4 10.02

8.31

5.86 6.63 9.65 8.57

6.17 6.98 9.03

ni5 5 4 5

µi6.06 6.93 10.0

38.62

One-way ANOVA

Sum of squares (SS): devianceSum of squares (SS): deviance

SS Total = Σ(yi – grand mean)2

SS Factor = Σ ni(group meani – grand mean)2

SS Error (within group) = Σ(yi – group meani)2

Grand mean = 7.80Number of observations N = 19

Degree of freedom (df)Degree of freedom (df)

Total: N – 1Group: k – 1Error: N – k

Number of groups: k = 4

X

Page 7: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

One-way ANOVA: SS explanation

SS Total SS Error

SS Factor

SS Total = SS Factor + SS Error

Grand mean

mean1

mean2

mean3

mean4

Page 8: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

One-way ANOVA

Grand mean

µ3

µ4

µ2

µ1

SS can be divided by the respective df to get a variance

MS = SS /df Mean squared deviation

The pseudo-replication would work here!!!

SSTotal=SSFactor SSTotal=SSFactor + SSError

MSFactor

MSError

Page 9: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

One-way ANOVA: F test (variance)

Factor MS

Error MS F =

If the k means are not equal, then the Factor MS in the population will be greater than the population’s Error MS

If F calculated is large (e.g. P<0.05), then we can reject Ho

All we conclude is thatat least two means are different!!!

A POSTERIORI MULTIPLE COMPARISONS

WORKING WITH CONTRASTS

How to define the correct F test can be a difficult task with complex design (BE EXTREMELY CAREFUL!!!!)

Page 10: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

One-way ANOVA: contrasts

Contrasts are the essence of hypothesis testing and model simplification in analysis of variance and analysis of covariance.

They are used to compare means or groups of means with other means or groups of means

We used contrasts to carry out t test AFTER having found out a significant effect with the F test

- We can use contrasts in model simplification (merge similar factor levels)

- Often we can avoid post-hoc multiple comparisons

Page 11: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

One-way ANOVA: multiple comparisons

If F calculated > F critic, then we can reject Ho

At least two means are different!!!

A POSTERIORI MULTIPLE COMPARISONS(lots of methods!!!)

Multiple comparison procedures are then used to determine which means are different from which.

Comparing K means involves K(K − 1)/2 pairwise comparisons.

E.g. Tukey-Cramer, Duncan, Scheffè…

Page 12: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Kruskal-Wallis test

(if k = 2, then it corresponds to the Mann-Whitney test)

ANOVA by ranks

If there are tied ranks a correction term must be applied

If the assumptions are seriously violated, then one can opt for a nonparametric ANOVA

One-way ANOVA: nonparametric

HoweverOne-way ANOVA is quite robust even in condition of non-normality

and non-homogeneity of variance

kruskal.test()

Page 13: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Multi-way ANOVA

Multi-way ANOVAMulti-way ANOVA is used when the experimenter wants to study the effects of two or more treatment variables.

ASSUMPTIONS

Independence of cases - this is a requirement of the design

Normality - the distributions in each of the groups are normal

Homogeneity of variances - the variance of data in groups should be the same

+ Equal replication (BALANCED AND ORTHOGONAL DESIGN)

If you use traditional general linear models just one missing data can affect strongly the results

Dose 1 Dose 2 Dose 3

Low temp - 10 obs 10 obs

High temp 10 obs 10 obs 8 obs XX

Page 14: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Fixed effects: factors are specifically chosen and under control, they are informative(E.g. sex, treatments, wet vs. dry, doses, sprayed or not sprayed)

Fixed vs. random factors

Random effects: factors are chosen randomly within a large population, they are normally not informative(E.g. fields within a site, block within a field, split-plot within a plot, family, parent, brood, individuals within repeated measures)

Random effects mainly occur in two contrasting kinds of circumstances

1. Observational studies with hierarchical structure 2. Designed experiments with different spatial or temporal dependence

If we consider more than one factor we have to distinguish two kinds of effects:

Page 15: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

They affect the way to construct the F-test in a multifactorial ANOVA. Their false identification leads to wrong conclusions

Fixed vs. random factors

Why is it so important to identify fixed vs. random effects?

If we have both fixed and random effects, then we are working on MIXED MODELS

yi = µ + αi (fixed) + ri (random) + ε

You can find how to construct your F-test with different combinations of random and fixed effects and with different hierarchical structures

(choose a well-known sampling design!!!)

Page 16: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Factorial ANOVA: two or more factors

Factorial designFactorial design: two or more factors are crossed. Each combination of the factors are equally replicated and each factor occurs in combination with every level of the other factors

4 fertilizer

3 levels of irrigation

10

10

10

10

10

10

10

10

10

10

10

10

Orthogonal sampling

Page 17: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Factorial ANOVA: Why?

Why use a factorial ANOVA? Why not just use multiple one-way ANOVA’s?

With n factors, you’d need to run n one-way ANOVA’s, which would inflate your α-level– However, this could be corrected with a Bonferroni

correction

The best reason is that a factorial ANOVA can detect interactions, something that multiple one-way ANOVA’s cannot do

Page 18: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Factorial ANOVA: Interactions

E.g. We are testing two factors, Gender (male and female) and Age (young, medium, and old) and their effect on performance

If males performance differed as a function of age, i.e. males performed better or worse with age, but females performance was the same across ages, we would say that Age and Gender interact, or that we have an Age x Gender interaction

Interaction: When the effects of one independent variable differ according to levels of another independent variable

Male

Female

Age

Per

form

ance

It is necessary that the slopes differ from one anotherIt is necessary that the slopes differ from one another

OldYoung

Page 19: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Factorial ANOVA: Main effects

This is what we were looking at with one-way ANOVA’s – if we have a significant main effect of our factor, then we can say that the mean of at least one of the groups/levels of that factor is different than at least one of the other groups/levels

Main effects: the effect of a factor is independent from any other factors

Male FemaleP

erfo

rman

ce

It is necessary that the intercepts differIt is necessary that the intercepts differ

OldYoung

Per

form

ance

Page 20: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Factorial ANOVA: Two-crossed fixed factor design

Effect

Clone < 0.05Treat < 0.05C x T < 0.05

Effect

Clone < 0.05Treat < 0.05C x T n.s.

Effect

Clone n.s.Treat < 0.05C x T n.s.

1.0 2.0 3.0

10

20

30

Treatment

Me

an

y

Effect

Clone < 0.05Treat n.s.C x T n.s.

1.0 2.0 3.0

51

52

5

Treatment

Me

an

y

1.0 2.0 3.0

15

25

35

Treatment

Me

an

y

1.0 2.0 3.0

15

25

35

45

Treatment

Me

an

y

Examples of ‘good’ ANOVA results

Worst case

Effect

Clone n.s.Treat n.sC x T n.s

CloneACloneB

1.0 2.0 3.0

51

01

52

0

Treatment

Me

an

y

Treatment: 3 levels

Two-crossed factor design

Page 21: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Factorial ANOVA: Two-crossed factor design

Two crossed fixed effects: every level of each factor occurs in combination with every level of the other factors

Model 1: two fixed effectsModel 2: two random effects (uncommon situation)Model 3: one random and one fixed effect

We can test main effects and interaction:1. The main effectmain effect of each factor is the effect of each factor

independent of (pooling over) the other factors 2. The interactioninteraction between factors is a measure of how the

effects of one factor depends on the levels of one or more additional factors (synergic and antagonist effect of the factors) Factor 1 x Factor 2

We can only measure interaction effects in factorial (crossed) designs We can only measure interaction effects in factorial (crossed) designs

Page 22: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Factorial ANOVA: Two-crossed fixed factor design

Two crossed fixed effects:

Response variable: weight gain in six weeksFactor A: DIET (3 levels: barley, oats, wheat)Factor B: SUPPLEMENT (4 levels: S1, S2, S3, S4)

barley+S1 barley+S2

barley+S3 barley+S4 oats+S3 oats+S4

oats+S1 oats+S2 wheat+S1 wheat+S2

wheat+S3 wheat+S4

DIET* SUPPLEMENT= 3 x 4 = 12 combinations

We have 48 horses to test our two factors: 4 replicates

Page 23: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Factorial ANOVA: Two-crossed fixed factor design

The 48 horses must be independent units to be replicates

DIETSUPPLEMENTDIET*SUPPLEMENT

BarleyOatsWheat

S1 S2 S3 S4

Barley 26.34 23.29 22.46 25.57

Oats 23.29 20.49 19.66 21.86

wheat 19.63 17.40 17.01 19.66

DIET

error

DxSDxS MS

MSF

SUPPLEMENT

DIET*SUPPLEMENT

F test for main effects and interaction

error

DD MS

MSF

error

SS MS

MSF

Page 24: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

MIXED MODELS: Fixed+random effects

The mixed models included the dependence in the data withappropriate random effects

Pure fixed effect models REQUIRE INDEPENDENCE

Mixed models can deal with spatial or temporal dependence

WHAT IF WE HAVE DEPENDENCE???

Page 25: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

MIXED MODELS: Split-plot and

The split-plot design is one of the most useful design in agricultural and ecological research

The different treatment are applied to plot with different size organized in a hierarchical structure

We can consider random factors to account for the variability related to the environment in which we carry out the experiment

Mixed models can deal with spatial or temporal dependence

Page 26: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Factorial ANOVA: split-plot (mixed model)

N P

P NP

NP N

P P

NP N

P NP

NP NP

N N

P P

N P

P NP

NP N

P P

NP N

P NP

NP NP

N N

P P

N P

P NP

NP N

P P

NP N

P NP

NP NP

N N

P P

N P

P NP

NP N

P P

NP N

P NP

NP NP

N N

P P

Fixed effects:Fixed effects: Irrigation (yes or no)Irrigation (yes or no)Seed-sowing density sub-plots (low, med, high)Seed-sowing density sub-plots (low, med, high)Fertilizer (N, P or NP)Fertilizer (N, P or NP)

Random effects:Random effects: 4 blocks4 blocksIrrigation within blockIrrigation within blockDensity within irrigation Density within irrigation plotsplots

Response variable: Crop production in each cellResponse variable: Crop production in each cell

IrrigationIrrigation DensityDensity FertilizerFertilizer BlockBlock

AA BB CC DD

Page 27: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Factorial ANOVA: mixed model

The split-plot designThe split-plot design

N P

P NP

NP N

P P

NP N

P NP

NP NP

N N

P P

i) 4 Blocksi) 4 Blocks

No waterNo water

iii) seed-sowing iii) seed-sowing density sub-plotsdensity sub-plots(low, med, high)(low, med, high)

iv) 3 fertilizer iv) 3 fertilizer sub-sub-plotsub-sub-plot(N, P, o NP)(N, P, o NP)

ii) irrigation plot (yes or no)ii) irrigation plot (yes or no)WaterWater

Crop productionCrop production

HIERARCHICAL HIERARCHICAL STRUCTURESTRUCTURE

Page 28: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Model formulation

Factorial ANOVA: mixed model aov()

Y~ fixed effects + error terms

BlockBlock

IrrigationIrrigationDensityDensityFertilizerFertilizer

y ~ a*b*c + Error(a/b/c)

Yield ~ irrigation*density*fertilizer+

Error(block/irrigation/density))

YieldYield

UninformativeUninformative

Informative!!!Informative!!!

Here you can specify your sampling hierarchy

Page 29: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Mixed models: tradition vs. REML

Mixed models using traditional ANOVA requires perfect orthogonal and balanced design

(THEY WORK WELL WITH THE PROPER SAMPLING)

In R you can run Mixed models with missing data and unbalanced design (non orthogonal design) using the

REML estimation lme4 or NLME

avoid to work with multi-way ANOVA in non-orthogonal sampling designs

If something has gone wrong with the sampling

Page 30: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Mixed model: REML

• REML: Residual Maximum Likelihood– vs. Maximum

Likelihood– Unbalance, non-

orthogonal, multiple sources of error

• Packages– NLME (quite old)

• New Alternative– lme4

Page 31: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Mixed model: REML

When should I use REML?For balanced data, both ANOVA and REML will give the same answer. However, ANOVA’s algorithm is much more efficient and so should be

used whenever possible.

Are all factor combinations

present and sample sizes equal?

Can you identify blocks (or a hierarchy of blocks) of similar experimental

units?

Most efficient analysis

Fixed effects ANOVA(or regression)

Mixed effects ANOVA

Regression

REML

Page 32: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Generalized Linear Models (GLM)

We can use GLMs when the variance is not constant, and/or when the errors are not normally distributed.

0 2 4 6 8 10

2.0

3.0

4.0

Normal

Mean

Var

ian

ce

0 2 4 6 8 10

02

46

810

Count data

Mean

Var

ian

ce

0 2 4 6 8 10

01

23

4

Proportion data

Mean

Var

ian

ce

0 2 4 6 8 10

020

4060

80

Gamma

Mean

Var

ian

ce

A GLM has three important properties1. The error structure2. The linear predictor3. The link function

Page 33: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Generalized Linear Models (GLM)

Error structure

In GLM we can specify error structure different from the normal:

- Normal (gaussian)- Poisson errors (poisson)- Binomial errors (binomial)- Gamma errors (Gamma)

glm(formula, family = …, link=…, data, …)

Page 34: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Generalized Linear Models (GLM)

The model relates each observed y to a predicted valueThe predicted value is obtained by a TRANSFORMATION of the value emerging from the linear predictor

Linear predictor (η)= predicted value output from a GLM

p

jjix β are the parameters estimated for the p explanatory variablesx are the values measured for the p explanatory variables

Fit of the model = comparison between the linear predictor and the TRANSFORMED y measured

The transformation is specified in the LINK FUNCTION

Page 35: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Generalized Linear Models (GLM)

Link functions (g)

The link function related the mean value of y to its linear predictor

)( gThe value of η is obtained by transforming the value of y by the link function

The predicted value of y is obtained by applying the inverse link function to the η value of y by the link function

Typically the output of a GLM is η Need to transform η to get the predicted values

Known the distribution start with the canonical link function

Page 36: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Generalized Linear Models (GLM) glm()

Model fit

Deviance is the measure of goodness-of-fit of GLM

Deviance=-2(log-likelihoodcurrent model -log-likelihood saturated model)

Error Deviance Variance

normal

poisson

binomial

Gamma

2)( meanyi

)()/ln(/2 meanymeanyy mean

1

)/()ln()()/ln(/2 meannynynmeanyy

)/ln(/)(/2 meanyymeany 2mean

n

meannmean )(

We aim at reducing the residual deviance

n is the size of the sample

Page 37: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Proportion data and binomial error

Proportion data and binomial error

1. The data are strictly bounded (0-1)2. The variance is non-constant (∩-shaped relation with the mean)3. Errors are non-normal

Link function: Logit

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

Proportion data

p

Va

ria

nc

e

bxaq

p

ln

-100 -50 0 50 100

0.0

0.4

0.8

Proportion

x

p

bxa

bxa

e

ep

1

Page 38: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

2 Examples1. The first example concerns sex ratios in insects (the proportion of all individuals that are males). In the species in question, it has been observed that the sex ratio is highly variable, and an experiment was set up to see whether population density was involved in determining the fraction of males.2. The data consist of numbers dead and initial batch size for five doses of pesticide application, and we wish to know what dose kills 50% of the individuals (or 90% or 95%, as required).

After fitting a binomial-logit GLM must be:Residual df ≈ residual deviance

YESThe model is adequate

NOFit a quasibinomial

YES

The model is adequate

NO

Change distribution

Check again

Proportion data and binomial error

Page 39: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Count data and Poisson error

Count data and Poisson error

1. The data are non-negative integers2. The variance is non-constant (variance = mean)3. Errors are non-normal

0 2 4 6 8 10

02

46

81

0

Count data

Mean

Va

ria

nc

e

The model is fitted with a log link (to ensure that the fitted values are bounded below) and Poisson errors (to account for the non-normality).

0 20 40 60 80 100

01

23

4

Count

x

log

(x)

Page 40: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Example

1. In this example the response is a count of the number of plant species on plots that have different biomass (continuous variable) and different soil pH (high, mid and low).

After fitting a Poisson-log GLM must be:Residual df ≈ residual deviance

YESThe model is adequate

NOFit a quasipoisson

YES

The model is adequate

NO

Change distribution

Check again

Count data and Poisson error