modeling wim buysse ruforum 1 december 2006 research methods group

70
Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Post on 19-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

ModelingWim Buysse

RUFORUM 1 December 2006

Research Methods Group

Page 2: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Part 1. General Linear Models

Research Methods Group

Page 3: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

General Linear ModelsDataset from

Research Methods Group

Page 4: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

General Linear ModelsDataset from

p. 89 - 95

Research Methods Group

Page 5: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

General Linear ModelsEffects of three levels of sorbic acid (Sorbic) and six levels of water activity (Water) on survival of Salmonella typhimurium (Density)

Water density = log(density/ml)

Research Methods Group

Page 6: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

ANOVA approach

Research Methods Group

General Linear Models

Page 7: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Results

Research Methods Group

General Linear Models

Page 8: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

The same data, but each treatment is presented as a ‘dummy variable’. (Warning: for educational purposes only.)

Research Methods Group

General Linear Models

Page 9: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Regression with a first independent variable.

Research Methods Group

General Linear Models

Page 10: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

We add a second independent variable.

Research Methods Group

General Linear Models

Page 11: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

We add a third one.

Research Methods Group

General Linear Models

Page 12: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

We add a fourth one.

Research Methods Group

General Linear Models

Page 13: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

We continue to construct the model.

Research Methods Group

General Linear Models

Page 14: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Finally, the results.

Research Methods Group

General Linear Models

Page 15: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Comparison of the two approaches.

Research Methods Group

General Linear Models

Page 16: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Comparison of the two approaches:

- They give the same results (in terms of SS.)- The approach to choose depends on what you

want to know.- The regression approach still works when the

ANOVA approach is not possible anymore (for instance when there are missing values).

Research Methods Group

General Linear Models

Page 17: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Example: modelling approach with normally distributed data.

Protocol and dataset.

Research Methods Group

Page 18: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Example: modelling approach with normally distributed data.

Data: Screening of suitable species for three-yearfallow

file = Fallow N.xls

Protocol: p. 13

Research Methods Group

Page 19: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

The analysis approach is written down in chapter 19 of ‘Good statistical practice for natural resources research’

Research Methods Group

Example: modelling approach with normally distributed data.

Page 20: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Modelling approach: general

5 steps:

1. (Visual) exploration to discover trends and relationships

2. Choose a possible model:• The trend you see• Knowledge of the experimental design• Biological/scientific knowledge of the

process

3. Fitting = estimation of parameters

4. Check = assessing the ‘fit’

5. Interpretation to answer the objectives.

Research Methods Group

Page 21: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Expanding the modelANOVA and regression• Same calculations• Data

= pattern + noise= systematic component + random component

• Same assumptions• Systematic components are additive• Variability of the groups is similar• The random component is (rather) normally

distributed. The random variability of “y” around the systematic component is not affected by this systematic component.

Research Methods Group

Page 22: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

GENERAL LINEAR MODELS

Research Methods Group

Page 23: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

GENERAL LINEAR MODELS

Research Methods Group

Page 24: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

GENERAL LINEAR MODELS

Research Methods Group

Data = pattern + noise

Pattern: is explained by a linear combination of the independent variables

(Data ≈ N(m,v) and the variance is rather constant across the different groups)

Noise: N(0,1) and the variance is rather constant across the different groups

Page 25: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Expanding the model

If the data are not normally distributed or if the variance of the different groups is not similar:

Possible approach = transformation of the data = « linearising » the model

Problems:- You don’t work anymore on a scale that has a

biological meaning.- Retransforming the standard errors back to the

original scale is not possible anymore.

Research Methods Group

Page 26: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Better solution: GENERAL LINEAR MODELS => GENERALIZED LINEAR MODELS

Research Methods Group

Less restrictions; two essential differences:

1. Data can be distributed according to the family of exponential distributions = Normal, Binomial, Poisson, Gamma, Negative binomial

2. Link function: the link between E(Y) and the independent variables is not longer a linear combination of the independent variables. It is also possible that the linear combination of the independent variables is a function of can also be a linear combination of a function of E(Y). (We don’t transform the dependent variables but include the transformation into the model).

Expanding the model

Page 27: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

Also:- The systematic component (linear combination of

independent variables) can include both continuous and categorical variables and even polynomials

But still:- The variance is constant across the different groups (or

has become constant because of the transformation through the link function)

Expanding the modelBetter solution: GENERAL LINEAR MODELS =>

GENERALIZED LINEAR MODELS

Page 28: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Generalised linear models

Research Methods Group

Statistical theory is more difficult, but the menus in GenStat and the way you can interpret the output is very similar to what we know from ANOVA and regression.

Page 29: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

=

=

Page 30: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Example 1. Logistic regressionExample: cardio-vascular disease according to age

Research Methods Group

age and chd.xls

Page 31: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Example: same data but according to age group

Research Methods Group

Example 1. Logistic regression

Page 32: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Example: the linear regression is not an appropriate model and the predictions at the extremes will not be correct

Research Methods Group

Example 1. Logistic regression

Page 33: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Example: test χ2 test: limited information

Research Methods Group

Example 1. Logistic regression

Page 34: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• Bernoulli process: an (independent) event that can have two possible outcomes (1 – 0, success-failure, …); with a given probability of succes• Tossing a coin: head or tail; p = 0,5• Throwing 6 with a dice (success) compared to

throwing any other number; p = 1/6• Conducting a survey: is the head of the

household male or female?; calculate p from the proportion found in the collected data

• Screening of cardio-vascular diseases. p disease = 43 out of 100 individuals = 0.43

Research Methods Group

Example 1. Logistic regression

Page 35: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• In GenStat

Research Methods Group

Example 1. Logistic regression

Page 36: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• Logistic function

Research Methods Group

Example 1. Logistic regression

Page 37: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• Logistic function

• Sigmoid form• Linear in the middle• The probability is restricted between 0

et 1• Small values: flatten towards 0; large

values: flatten towards 1

Research Methods Group

Example 1. Logistic regression

Page 38: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• GenStat output• Similar, but ‘deviance’ instead of ‘variance’ and

test χ2 instead of F

Research Methods Group

Example 1. Logistic regression

Page 39: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• GenStat output• model

Research Methods Group

• Logit(CHD) = -5,31 + 0,1109 AGE

Example 1. Logistic regression

Page 40: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

• Logit(CHD) = -5,31 + 0,1109 AGE

Example 1. Logistic regression

Page 41: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

Example 1. Logistic regression

Page 42: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• Binomial distribution: when we repeat the Bernoulli process, the order of success or failure can change

• Example: head of household in a survey

Research Methods Group

Example 1. Logistic regression

Page 43: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• Calculation of probabilities if success = female headed household with p = 0,2

Research Methods Group

Example 1. Logistic regression

Page 44: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• Calculated probabilities for obtaining success

Research Methods Group

• We can now construct a frequency distribution of obtaining success

• Probability = long-run frequency = frequency when very many data

• = binomial distribution

Example 1. Logistic regression

Page 45: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• Binomial distribution• Counts of a categorical variable• Example: experiment of survival of trees from

different provenances• File: survival trees.xls

Research Methods Group

Example 1. Logistic regression

Page 46: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• Several approaches possible

Research Methods Group

1

Example 1. Logistic regression

Page 47: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• Several approaches possible

Research Methods Group

1

Example 1. Logistic regression

Page 48: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

2

Example 1. Logistic regression• Several approaches possible

Page 49: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

2

Example 1. Logistic regression• Several approaches possible

Page 50: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

3

Example 1. Logistic regression• Several approaches possible

Page 51: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

3

Example 1. Logistic regression• Several approaches possible

Page 52: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• The Bernoulli distribution is a special case of the binomial distribution

• There exist ‘families of distributions’.

Research Methods Group

Example 1. Logistic regression

Page 53: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• There is of course a difference in the variability that is explained.

Research Methods Group

1 2

3

Example 1. Logistic regression

Page 54: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Example 2. Modelling counts

• We used logistic regression to analyse counts. • Bernoulli distribution: distribution of success of

events that follow a Bernoulli process (1 or 0, yes or no)

• Binomial distribution: distribution of possible (and independent) combinations of Bernoulli events

• So, more like analysis of proportions.• Next: Poisson distribution: distribution of counts

of Bernoulli events

Research Methods Group

Page 55: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• Poisson distribution: distribution of counts of Bernoulli events

• BUT:• p is very small• n is very big• p*n < 5• Events happen randomly and independent of

each other.

Research Methods Group

Example 2. Modelling counts

Page 56: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

• Poisson distribution = distribution of rare events• Number of civil airplane crashes (when there is

no war) in the whole world during several years.

• Number of infected seeds in seed lots that are certified by a controlling agency.

• Number of individuals of a rare tree species in a square kilometre in the same Agro Ecological Zone.

Research Methods Group

Example 2. Modelling counts

Page 57: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

THUS

• The distribution that best describes counts is not automatically a Poisson distribution.

• It depends of the context.

Research Methods Group

Example 2. Modelling counts

Page 58: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Some mathematical statistics

Research Methods Group

The proportion mean/variance must be 1.

= Poisson index

In GenStat:(s2-m)/m

Example 2. Modelling counts

Page 59: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

We briefly have seen already other counts: χ2 test

Research Methods Group

χ2 test: is there evidence of an association between two discrete variablesH0: no association

H1: association

Example 2. Modelling counts

Page 60: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

We could use another kind of probability to calculate the test statistic

Research Methods Group

Example 2. Modelling counts

Page 61: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

But now we look at the table in another way. If we consider the counts in the table as a variable, we could construct a frequency distribution.

Research Methods Group

Example 2. Modelling counts

Page 62: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Long run frequency distribution = probability distributionWe just expanded the binomial distribution into the multinomial distribution.

Binomial distribution:• Independent observations• p success = everywhere the same. The

probability that an individual observation falls into a specific cell of the table is the same for all cells.

Multinomial observation:• + The number of total observations is fixed.

Research Methods Group

Example 2. Modelling counts

Page 63: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

If the total number of observations was not fixed => Poisson distribution

BUT

Thanks to a lot of difficult statistical theory: we can also use the Poisson distribution even if the total number of observation is not fixed.

Research Methods Group

Example 2. Modelling counts

Page 64: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

CONCLUSION

Even though the context is important to decide whether we can use the Poisson distribution to analyse counts (‘distribution of rare events’)

Generally:

Analysis of ‘multiway contingency tables’ => Poisson distribution + logarithm link= LOGLINEAR MODELING

Research Methods Group

Example 2. Modelling counts

Page 65: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Analysis of counts = • Often we can use the Poisson distribution• But not always

Research Methods Group

Example 2. Modelling counts

Page 66: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Example 2. Loglinear modelling

Research Methods Group

=

Page 67: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

Adding interactions

Example 2. Loglinear modelling

Page 68: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

=

χ2 test

Loglinear modelling

Example 2. Loglinear modelling

Page 69: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

Modelling of complex datasets:• Adding or dropping terms and interactions

in the model and changing their order• Good model (‘good fit’ ) when the ‘residual

deviance’ becomes almost equal to the number of degrees of freedom (or ‘mean deviance’ = 0)

• At that moment we can assume that the remaining residual variability is caused by the random variability (noise)

• Adding too many terms: ‘residual deviance’ => 0

Example 2. Loglinear modelling

Page 70: Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Research Methods Group

Example: lambs.xls

Example 2. Loglinear modelling