sadc course in statistics goodness-of-fit tests (and further issues) (session 16)
Post on 28-Mar-2015
213 Views
Preview:
TRANSCRIPT
SADC Course in Statistics
Goodness-of-fit tests (and further issues)
(Session 16)
2To put your footer here go to View > Header and Footer
Learning ObjectivesBy the end of this session, you will be able to
• conduct and interpret results from a chi-square test for testing the goodness-of-fit of data to a particular distribution
• understand how two-way contingency tables can be further examined to look at its residuals
• present results from a standard chi-square test, paying attention to the table’s summary features
3To put your footer here go to View > Header and Footer
Goodness-of-fit tests• In previous sessions, we have seen that
many tests are based on the assumption of normality
• On some occasions, it is also important to ascertain whether the data follow other distributions, e.g. the binomial or Poisson distributions
• We shall now look at how the chi-square test can be applied to examine the extent to which assumptions concerning the distribution of a given variable holds
4To put your footer here go to View > Header and Footer
Goodness-of-fit tests• The basic idea is first to calculate the
probability of each possible value occurring
• e.g. the number of cows getting disease in a farm which has 6 cows, may be assumed to follow a binomial random variable.
• e.g. the number of visits made by a pregnant woman in a region to the region’s single anti-natal clinic may be assumed to follow a Poisson distribution.
Can we check these assumptions beforesubjecting the data to tests based on these?
5To put your footer here go to View > Header and Footer
Goodness-of-fit test: Normal distn
• Because the Normal distribution applies to a continuous random variable, it is necessary to group the data and obtain observed frequencies in each group.
• The next step is to determine the probability of an observation falling in each group, and hence the expected value.
• The chi-square test can then be applied in the usual way: the d.f. being number of groups – 1 – number of parameters estimated in computing expected values.
6To put your footer here go to View > Header and Footer
An example: Normal distn
• Consider the total rainfall in June at a particular site from 1928 to 1983. Suppose we wish to test the assumption that these data follow a normal distribution
• A histogram for the data appears below.
0
2
4
6
8
10
12
14
<=100 to 125 to 150 to 175 to 200 to 225 to 250 > 250
Rainfall totals
Fre
qu
ency
7To put your footer here go to View > Header and Footer
An example: Normal distn
Expected values are now calculated for each group, assuming a normal distribution.
The table shows observed and expected frequencies.
The chi-square value is 3.6 with d.f.=5.P-value = 0.6083.Conclusions?
RainTotal Observed Expected
<=100 4 6.86
to 125 11 7.45
to 155 12 10.31
to 175 9 11.12
to 200 9 9.33
to 225 6 6.10
to 250 3 3.11
> 250 2 1.72
Totals 56 56
8To put your footer here go to View > Header and Footer
An example: Binomial distn
• First recall (from Module H1) the form of the probability density function for the binomial random variable with parameters n and p, where p is the probability of a “success” in a sequence of n trials, each trial having just 2 possible outcomes.
• The number of successes (X) in n trials has a binomial distribution.
nkppknk
nkXP knk ,,1,0,)1(
)!(!
!)(
• This formula gives the binomial probabilities, obtained also from Excel’s function Binomdist(x,n,p,false).
9To put your footer here go to View > Header and Footer
An example: Binomial distn
Suppose we have a binomial variable with observed values as shown (n=7,p=0.222)
Expected values can be derived using [P(X=k)]*404.
The chi-square value is 141.3 with d.f.=4 since p has been estimated from the data. p-value = 0.000
k Observed Expected
0 81 69.7
1 130 139.2
2 129 119.2
3 37 56.7
4 14 16.2
5,6,7 23 3.0
Totals 404 404
What are your conclusions?
10To put your footer here go to View > Header and Footer
Other issues
There are two more issues to discuss concerning chi-square tests for testing the association between two categorical variables.
These relate to
• further examination of the table offrequencies when a significant result is found;
and
• how to present the results
11To put your footer here go to View > Header and Footer
Example of Session 15For data below, we found a significant chi-square value, with p=0.0024, i.e. evidence that the proportion of diseased animals are not the same for all vaccines.
Vaccine diseased
healthy Total
A 43 237 280
B 52 198 250
C 25 245 270
D 48 212 260
E 57 233 290
Total 225 1125 1350
Question:
But what contributes most to the chi-square statistic?
i.e. departs most from Pr(diseased)=0.167?
12To put your footer here go to View > Header and Footer
Cell contributions to chi-square:
Vaccine diseased healthy
A 0.288 0.057
B 2.563 0.512
C 8.889 1.778
D 0.502 0100
E 1.554 0.311
Table gives the chi-square contributions to each cell, i.e. values (O-E)2/E.
Rule of thumb:
Focus on cells with values4 and in larger tables, focus on those 9.
13To put your footer here go to View > Header and Footer
Standardised residualsVaccine disease
dhealthy
A -0.54 0.24
B 1.60 -072
C -2.98 1.33
D 0.71 -0.33
E 1.25 -0.56
Better still, use standardised residuals so signs are also included, i.e. use SR=(O-E)/E.
Rule of thumb:
Focus on SR>|2|, or in larger tables, focus on those >|3|.
Conclusion:
Vaccine C gives most discrepancy from H0.
14To put your footer here go to View > Header and Footer
Presentation of results
Vaccine % DISEASED
C 9.3%
A 15.4%
D 18.5%
E 19.7%
B 20.8%
In this example, it would be appropriate to present a table of the percentage of animals diseased under each vaccine.
Table sorted by the most useful vaccine would make the results easier to see.Note there are more advanced methods, e.g. modelling, to make specific comparisons between the above percentages
15To put your footer here go to View > Header and Footer
Presentation: Example from Sess 14
Usually sleep under a mosquito net?
Suffered malaria?
Yes No Total
Yes 649
62.5%
3849
55.8%
4498
56.6%
No 390
37.5%
3055
44.2%
3445
43.4%
Total 1039
100.0%
6904
100.0%
7943
(100%)
Recall results below from before. Test of association gave p=0.000.
16To put your footer here go to View > Header and Footer
Presentation and conclusionsTest results indicate that there is an association between use of a mosquito net and incidence of malaria. However the resulting incidences are unexpected. Note: malaria incidence
for those using net = 62.5%
for those not using a net is = 55.8%.
This emphasises the danger of ignoring other factors that may affect malaria incidence, e.g. altitude, housing conditions, etc. Further, could it be that those who had malaria, then started using mosquito nets?
17To put your footer here go to View > Header and Footer
Some final remarks• Performing a chi-square analysis is simple, but
it does not take account of other factors that may affect the results.
• More advanced (e.g. log-linear modelling) procedures do exist for exploring factors affecting a categorical response, here use of a bednet.
• Recall that the chi-square test is an approximation. This approximation is poor if the expected frequencies are very small (e.g. < 5). Try collapsing some rows or columns if this happens.
18To put your footer here go to View > Header and Footer
Some practical work follows…
top related