ep304: advanced statistical methods in epidemiologydl.lshtm.ac.uk/programme/epp/docs/examiner...

25
1 EP304: Advanced Statistical Methods in Epidemiology Examination Friday 3 June 2011: 10.00 am 12.15 pm Candidates are advised to spend the first FIFTEEN minutes of this exam reading the question paper and planning their answers. Candidates should answer ALL questions. Use a SEPARATE answer book for each question and put a page number at the bottom of each page used in the answer book. A formulae sheet and statistical tables are provided for use at the end of the paper. A hand held calculator may be used when answering questions on this paper. The calculator may be pre-programmed before the examination. The make and type of machine must be stated clearly on the front cover of the answer book.

Upload: others

Post on 27-Apr-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

1

EP304: Advanced Statistical Methods in

Epidemiology Examination

Friday 3 June 2011: 10.00 am – 12.15 pm

Candidates are advised to spend the first FIFTEEN minutes of this exam reading the

question paper and planning their answers.

Candidates should answer ALL questions.

Use a SEPARATE answer book for each question and put a page number at the bottom of

each page used in the answer book.

A formulae sheet and statistical tables are provided for use at the end of the paper.

A hand held calculator may be used when answering questions on this paper. The calculator

may be pre-programmed before the examination. The make and type of machine must be

stated clearly on the front cover of the answer book.

Page 2: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

2

Question 1

A demographic surveillance site was established in 2008 in a district in southern Tanzania,

with a total population of around 40000. A baseline census was completed in 2008, after

which all individuals have been followed up to the end of 2010. Children born in the area

since 2008 are included in the study population, as are children and adults who migrate into

the study area. For individuals who leave the study area, or have died since the baseline

census, their follow-up time stops on the date of migration or death.

There is only one health clinic in the study area that can diagnose and treat children with

severe pneumonia. During the whole of 2009, identifying information was collected on all

children in the study population who presented at the clinic with an episode of severe

pneumonia. This meant that each child with an episode of severe pneumonia could be linked

to the demographic surveillance data. For the purpose of the analysis here, a child’s follow-

up was censored when they reached 5 years old, and includes only person-time during 2009.

Key information collected is summarised in Table 1.

Severe pneumonia is rare, but children can have repeat episodes and one child had 3 such

episodes during 2009. A total of 7860 children, aged 5 years or under, were followed up for

part or all of 2009.

Table 1

Variable Coding

idno Unique identifying number for each child

sex 1= Male 2=Female

dob Date of birth

residence 0 = peri-urban 1 = rural

entry_date For the first record for each child, the entry_date is 1 January 2009, or in-

migration date if the child was born or moved into the study area after 1

January 2009

If a child has a second record, then the entry_date for the second record is

equal to the exit_date of the first record for the child (i.e. equal to the date

of the first severe pneumonia event)

If a child has a third record, then the entry_date for the third record is

equal to the exit_date of the second record for the child

exit_date For records ending without a severe pneumonia event, the exit date is:

31 December 2009, or date of death or out-migration date or date of 5th

birthday, if any of these events occurred before 31 December 2009

For records ending with a severe pneumonia event, the exit date is the date

of the clinic attendance with severe pneumonia

finexit Last date the child was resident in the study population during 2009

spneumonia Severe pneumonia; 0 = No 1 = Yes

totepi Total episodes of severe pneumonia during 2009

0 = none 1 = one 2 = two 3 = three

recordno 1 for the first record for each child; 2 for the second record for a child ;

3 for the third record for a child ; 4 for the fourth record for a child

Children with no episode of severe pneumonia have only one record.

Children with 1 episode of severe pneumonia have two records

Children with 2 episodes of severe pneumonia have three records

Children with 3 episodes of severe pneumonia have four records

Page 3: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

3

a) The researchers first investigated how many children had each of 0, 1, 2 or 3 episodes of

severe pneumonia during 2009, by tabulating the variable totepi. The result is shown in

Table 2 below:

Table 2

. tab totepi if recordno==1

totepi | Freq. Percent Cum.

------------+-----------------------------------

0 | 7,594 96.62 96.62

1 | 250 3.18 99.80

2 | 15 0.19 99.99

3 | 1 0.01 100.00

------------+-----------------------------------

Total | 7,860 100.00

(i) What is the implication of the fact that children can have repeated episodes of

severe pneumonia, rather than at most one episode, for the statistical analysis?

Justify your answer, and explain the consequence for the results of the analysis if

this fact is ignored.

(10 marks)

(ii) Given the distribution of the total number of episodes per child in Table 2, how

important is it for the analysis to properly account for the fact that a child can have

repeated events? Justify your answer. (5 marks)

The following command was used in Stata as the first step to analysing the rate of severe

pneumonia in children aged <5 years old:

stset exit_date, origin(dob) time0(entry_date) fail(spneumonia) exit(finexit) scale(365.25)

id(idno)

id: idno

failure event: spneumonia != 0 & spneumonia < .

obs. time interval: (entry_date, exit_date]

exit on or before: time finexit

t for analysis: (time-origin)/365.25

origin: time dob

---------------------------------------------------------------------------

7860 subjects

283 failures in multiple failure-per-subject data

5422.591 total analysis time at risk, at risk from t = 0

earliest observed entry t = 0

last observed exit t = 5

Page 4: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

4

(iii) What is the interpretation of the value of “earliest observed entry t” and “last

observed exit t”? (5 marks)

(iv) Calculate the overall rate of severe pneumonia among children <5 years old, using

the output from the stset command. (5 marks)

(v) From the listing below, describe the follow-up, and experience of severe

pneumonia, of child idno 5462792. (5 marks)

. list idno entry_date exit_date finexit spneumonia if idno==5462792

+--------------------------------------------------------+

| idno entry_date exit_date finexit spneumonia |

|--------------------------------------------------------|

| 5462792 01jan2009 08jan2009 28jul2009 1 |

| 5462792 08jan2009 28jul2009 28jul2009 0 |

+--------------------------------------------------------+

b) Next, the following analysis was done:

. strate residence, per(1000)

failure _d: spneumonia

analysis time _t: (exit_date-origin)/365.25

origin: time dob

exit on or before: time finexit

id: idno

Estimated rates (per 1000) and lower/upper bounds of 95% confidence intervals)

+----------------------------------------------------+

| residence D Y Rate Lower Upper |

|----------------------------------------------------|

| 0 183 2.2741 80.472 69.619 93.019 |

| 1 100 3.1485 31.761 26.108 38.638 |

+----------------------------------------------------+

. stmh residence, c(1,0)

Page 5: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

5

RR estimate, and lower and upper 95% confidence limits

----------------------------------------------------------

RR chi2 P>chi2 [95% Conf. Interval]

----------------------------------------------------------

0.395 60.03 0.0000 0.309 0.504

----------------------------------------------------------

(i) Is there statistical evidence of an association between area of residence and the rate

of severe pneumonia? Justify your answer, including reference to the 95%

confidence intervals and p-values for the rates and rate ratios. (10 marks)

The next step in the analysis was as follows:

. stsplit timeband, at(0,0.5,1,2,3,5)

. strate timeband, per(1000)

+--------------------------------------------------------+

| timeband D Y Rate Lower Upper |

|--------------------------------------------------------|

| 0 36 0.5496 65.5004 47.2473 90.8052 |

| .5 62 0.5283 117.3673 91.5049 150.5393 |

| 1 115 1.0671 107.7671 89.7659 129.3783 |

| 2 46 1.0895 42.2204 31.6242 56.3670 |

| 3 24 2.1881 10.9685 7.3518 16.3643 |

+--------------------------------------------------------+

(ii) What is the meaning of the variable “timeband”? To what time period does the

timeband .5 correspond? (5 marks)

(iii) From the output above, is there statistical evidence of an association between the

variable timeband and the rate of severe pneumonia? Justify your answer. (5

marks)

Page 6: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

6

c) Next, the following analysis was done:

. stmh residence, by(timeband) c(1,0)

RR estimate, and lower and upper 95% confidence limits

+--------------------------------+

| timeband RR Lower Upper |

|--------------------------------|

| 0 0.45 0.23 0.88 |

| .5 0.36 0.21 0.61 |

| 1 0.37 0.25 0.54 |

| 2 0.38 0.21 0.70 |

| 3 0.33 0.14 0.79 |

+--------------------------------+

Overall estimate controlling for timeband

----------------------------------------------------------

RR chi2 P>chi2 [95% Conf. Interval]

----------------------------------------------------------

0.375 67.43 0.0000 0.294 0.478

----------------------------------------------------------

Approx test for unequal RRs (effect modification): chi2(4) = 0.41

Pr>chi2 = 0.9815

Based on this stratified analysis, is there evidence that the variable timeband confounds or

modifies the association between area of residence and the rate of severe pneumonia in

children aged under 5 years old? Justify your answer. (10 marks)

d) The following Poisson regression models, and corresponding likelihood ratio tests, were

then done:

Page 7: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

7

Model A

xi: streg i.residence i.sex i.timeband, dist(exp)

------------------------------------------------------------------------------

_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

_Iresidenc~1 | .375984 .0467786 -7.86 0.000 .2946225 .4798138

_Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236 _Itimeband_2 |

1.827776 .3830194 2.88 0.004 1.212131 2.756109

_Itimeband_3 | 1.658995 .3168413 2.65 0.008 1.140984 2.412187

_Itimeband_4 | .65924 .1467178 -1.87 0.061 .4261903 1.019726

_Itimeband_5 | .1634126 .0430692 -6.87 0.000 .097486 .2739233

------------------------------------------------------------------------------

. est store A

(i) By comparing the rate ratio for area of residence from Model A, with the rate ratio

for area of residence from the Mantel-Haenszel analysis stratified on timeband, does

child sex confound the association between area of residence and the rate of severe

pneumonia? Justify your answer. (5 marks)

Model B

. xi: streg i.residence i.sex timeband, dist(exp)

------------------------------------------------------------------------------

_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

_Iresidenc~1 | .3821039 .0475275 -7.73 0.000 .2994372 .4875926

_Isex_2 | 1.193832 .1429444 1.48 0.139 .9441129 1.509602

timeband | .5238446 .0303475 -11.16 0.000 .4676171 .586833

------------------------------------------------------------------------------

. est store B

. lrtest A B

Likelihood-ratio test LR chi2(3) = 56.68

(Assumption: B nested in A) Prob > chi2 = 0.0000

(ii) Interpret the likelihood ratio test comparing Model B to Model A. (5 marks)

Page 8: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

8

e) A random-effects Poisson regression model was then fitted to the data, to properly

account for the clustering in the data.

. gen y=_t-_t0

Model C

. xi: xtpoisson _d i.residence i.timeband i.sex, e(y) re i(idno) irr

Random-effects Poisson regression Number of obs = 12909

Group variable: idno Number of groups = 7860

Random effects u_i ~ Gamma

Wald chi2(6) = 191.31

Log likelihood = -1531.503 Prob > chi2 = 0.0000

------------------------------------------------------------------------------

_d | IRR Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

_Iresidenc~1 | .3769066 .0476705 -7.71 0.000 .2941544 .4829389

_Itimeband_2 | 1.814057 .3822294 2.83 0.005 1.200326 2.741591

_Itimeband_3 | 1.657297 .3200445 2.62 0.009 1.13507 2.419791

_Itimeband_4 | .6574241 .147549 -1.87 0.062 .4234539 1.020669

_Itimeband_5 | .1630945 .0432013 -6.85 0.000 .097044 .2741006

_Isex_2 | 1.189539 .1454658 1.42 0.156 .9360249 1.511715

y | (exposure)

-------------+----------------------------------------------------------------

/lnalpha | -.6312896 .718153 -2.038844 .7762645

-------------+----------------------------------------------------------------

alpha | .5319054 .3819895 .1301792 2.173338

------------------------------------------------------------------------------

Likelihood-ratio test of alpha=0: chibar2(01) = 2.69 Prob>=chibar2 = 0.051

(i) Comparing the output from Model A and Model C, does the conclusion about the

association between area of residence and the rate of severe pneumonia change

when the clustering in the data is accounted for? Justify your answer. (5 marks)

(ii) Based on Model C, is there statistical evidence of within-child correlation for the

rate of severe pneumonia? (5 marks)

f) To investigate whether maternal HIV infection was a risk factor for child severe

pneumonia, a case-control study was done. There were 266 children with one or more

episodes of severe pneumonia, with a total of 283 episodes during 2009. For each of the

283 episodes, the child was recruited as a case. 283 individually-matched control children

Page 9: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

9

were selected from the DSS population, with one control child for each case. The control

child was matched on age in months, sex, and neighbourhood, and was selected within

one week of the date that the matched case presented at the health clinic – i.e.

“concurrent” sampling of controls was done.

(i) Why were control children selected with concurrent sampling? Are the exposure

odds ratios that will be calculated from this case-control study, estimates of the

disease risk ratio, odds ratio, or rate ratio? Justify your answer. (10 marks)

(ii) The association with maternal HIV was summarised using discordant pairs as

below:

Case mother HIV-

negative

Case mother HIV-

positive

Control mother HIV-negative 232 38

Control mother HIV-positive 11 2

Calculate the exposure odds ratio for the association between maternal HIV and

child severe pneumonia, comparing children with HIV-positive mothers to children

with HIV-negative mothers. (5 marks)

(iii) The following conditional logistic regression model was fitted. The variable casecon

is coded 0 for a control and 1 for a case, the variable sex is coded as before (1=male,

2=female), and caseid is the variable that uniquely identifies each case-control pair.

. xi: clogit casecon i.moth_hiv*i.sex, strata(caseid) or

Conditional (fixed-effects) logistic regression Number of obs = 566

LR chi2(2) = 17.20

Prob > chi2 = 0.0002

Log likelihood = -187.5612 Pseudo R2 = 0.0438

------------------------------------------------------------------------------

casecon | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

_Imoth_hiv_1 | 6.99995 5.291451 2.57 0.010 1.590922 30.79932

_Isex_2 | (omitted)

_ImotXsex_~2 | .3809551 .3241933 -1.13 0.257 .0718621 2.019517

Based on the model above, in which the variable moth_hiv is coded 0 for HIV-

negative mothers and 1 for HIV-positive mothers, calculate the odds ratio for the

association between maternal HIV and severe child pneumonia, separately for boys

and girls.

(5 marks)

Page 10: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

10

Question 2

Health care workers in South Africa are at higher risk of tuberculosis than the general

population and additional measures are needed to protect them. Isoniazid preventive therapy

consists of screening for tuberculosis followed by a course of isoniazid (usually 6-9 months)

for those without active tuberculosis. At an individual level, isoniazid preventive therapy has

been shown to reduce the subsequent risk of tuberculosis. The South African government

wants to know if providing isoniazid preventive therapy to all healthcare workers in the

country would reduce the high incidence of tuberculosis among healthcare workers. Hence,

researchers planned a cluster randomised controlled trial of isoniazid preventive therapy in

healthcare workers in South Africa, with healthcare facilities as the clusters.

a) (i) Give two advantages and two disadvantages of conducting this trial at the cluster

level rather than the individual level. (15 marks)

(ii) Give one advantage and one disadvantage of conducting this trial in a few, large

facilities, rather than many, small facilities. (10 marks)

(iii) This study was planned as an unpaired design. Give one advantage and one

disadvantage of pair-matching as an alternative study design. (10 marks)

b) (i) Define each of intra-cluster correlation and between-cluster variation and explain

how they are related to each other. How would you interpret each of (1) an intra-

cluster correlation coefficient of 0 and (2) an intra-cluster correlation coefficient of

1? (10 marks)

An estimate of the coefficient of variation was not available for this situation. The researchers

think that the best guess of the population mean and standard deviation of health-facility

tuberculosis rates were 5/100py and 1.5/100py, respectively.

(ii) Define the coefficient of variation k, and calculate the coefficient of variation k in

this situation. (5 marks)

(iii) Explain why the coefficient of variation of the true cluster-level means is important

in determining sample size calculations. In particular, why does a larger coefficient

of variation imply a larger number of clusters are required? (5 marks)

The researchers calculated that ten clusters per arm would be sufficient to detect a 50%

reduction in tuberculosis incidence from 5/100py in the control arm to 2.5/100py in the

intervention arm, assuming an unpaired design, k=0.35, 80% power and follow up of 250

healthcare workers for a one year period, after completion of isoniazid preventive therapy in

the intervention clusters and a similar time period for control clusters. (Note py = person-

years.)

The researchers conducted the trial and decided to analyse the trial using the cluster level

summaries.

Page 11: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

11

c) (i) What would have been the consequences for the p-value and confidence interval if

this trial had been analysed using individual level data without taking into account

the cluster level? (5 marks)

(ii) Why do you think the researchers decided to analyse this trial using cluster level

summaries instead of a random effects model? (5 marks)

d) The twenty cluster-level rates (per 100 person years) were then analysed by study arm

using an unpaired t-test and the output is given below:

. ttest rates, by(study_arm)

Two-sample t test with equal variances

----------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

-------------+--------------------------------------------------------------------

control | 10 5.714149 .3204526 1.01336 4.989234 6.439063

intervention | 10 3.139567 .0985532 .3116527 2.916624 3.36251

-------------+--------------------------------------------------------------------

combined | 20 4.426858 .3373994 1.508896 3.720673 5.133043

-------------+--------------------------------------------------------------------

diff | 2.574581 .335265 1.870216 3.278947

----------------------------------------------------------------------------------

diff = mean(control) - mean(intervention) t = 7.6792

Ho: diff = 0 degrees of freedom = 18

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

(i) State and interpret the rate difference, calculated from the cluster level summaries,

and its confidence interval and p-value. (5 marks)

(ii) Calculate and interpret the rate ratio and an approximate 95% confidence interval

calculated from the cluster level summaries, using the following formula for the

confidence interval of the log rate ratio (logRR):

logRR ± 1.96 * √var(logRR)

where the variance of the log rate ratio (logRR), is given below, in which s1 is the

standard deviation in the intervention arm and s0 is the standard deviation in the

control arm, c1 is the number of clusters in the intervention arm and c0 is the number

of clusters in the control arm, and 1r is the mean of the cluster-level rates in the

intervention arm and 0r is the mean of the cluster-level rates in the control arm. (10

marks)

2

00

2

0

2

11

2

1)var(logrc

s

rc

sRR

Page 12: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

12

e) The researchers conducted a short, baseline interview with all participants in both

intervention and control clusters, to collect basic demographic information and to test for

HIV infection. The researchers thought that imbalance in any of these variables between

the two study arms may bias the comparison between the two arms.

(i) Why would imbalance be a greater concern in a cluster randomised trial than an

individually randomised trial with the same number of participants? (5 marks)

The researchers calculated ratio residuals at the cluster-level (i.e. the ratio of observed cases

to expected cases for each cluster), adjusting for age, gender and HIV status, and then carried

out an unpaired t-test on these ratio residuals (Stata output given below).

. ttest residuals, by(study_arm)

Two-sample t test with equal variances

----------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

-------------+--------------------------------------------------------------------

control | 10 1.367021 .1557167 .4924194 1.014766 1.719277

intervention | 10 .7440783 .1015229 .3210435 .5144176 .9737389

-------------+--------------------------------------------------------------------

combined | 20 1.05555 .1152823 .515558 .8142612 1.296838

-------------+--------------------------------------------------------------------

diff | .622943 .1858886 .2324055 1.01348

----------------------------------------------------------------------------------

diff = mean(control) - mean(intervention) t = 3.3512

Ho: diff = 0 degrees of freedom = 18

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.9982 Pr(|T| > |t|) = 0.0036 Pr(T > t) = 0.0018

(ii) Does a mean ratio residual of 0.744 in the intervention arm imply that the

intervention increased or decreased the incidence rate of TB? Explain your answer.

(5 marks)

(iii) Calculate and interpret the adjusted rate ratio and its 95% confidence interval

calculated from the cluster-level summaries. (10 marks)

END OF QUESTIONS

(A formulae sheet and statistical tables follow)

Page 13: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

13

SUMMARY OF STATISTICAL FORMULAE MSc/Postgraduate Diploma Epidemiology

June 2011 examinations This summary sheet includes formulae from all EP modules and so will include formulae which some students are not familiar with; students are only expected to be able to apply formulae covered in modules they have studied. Please note however that more basic formulae are not included here and students are expected to know these. 1) Single Sample:

a) Proportion, , , estimated as

, for confidence intervals

95% confidence interval for :

, for significance tests

Test hypothesis :

b) Mean, , estimated as

i) Large Sample

95% confidence interval for :

Test hypothesis :

ii) Small Sample

95% confidence interval for :

where and is the 2-tailed 5% point of a t-

distribution with degrees of freedom (df)

Test hypothesis : , df

2) Two Independent Samples:

a) Difference in proportions, (where and )

95% confidence interval for :

where estimated as:

p

npSE

1

n

pppSE

1

pSEp 96.1

npSE 00 1

0 pSE

pz 0

,x n

xSE

n

sxSE

xSEx 96.1

0 xSE

xz 0

xSEtx 05.0,

1 n 05.0,t

0 xSE

xt 0 1 n

21 pp 1

11

n

rp

2

22

n

rp

21 2121 96.1 ppSEpp

21 ppSE

2

22

1

11 11

n

pp

n

pp

Page 14: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

14

Test hypothesis :

where estimated as:

and the common proportion,

A slightly more conservative test uses a continuity correction, where

,

or analyse as a contingency table (see 6 below)

b) Difference in means,

i) Large Samples , estimated as

95% confidence interval for :

Test hypothesis :

ii) Small Samples (where )

estimated as

where

95% confidence interval for :

where and is the 2-tailed 5% point

of a t-distribution with degrees of freedom (df)

Test hypothesis : , df

3) Paired Samples

a. Difference in means,

Take differences in paired values; analyse differences using formulae for single sample mean [1(b)].

21 21

21

ppSE

ppz

pooled

21 ppSEpooled

21

111

nnpp

21

21

nn

rrp

21

2121

21 11

ppSE

nnppz

pooled

22

21 xx

2

2

2

1

2

121

nnxxSE

2

2

2

1

2

121

n

s

n

sxxSE

21 2121 96.1 xxSExx

21 21

21

xxSE

xxz

21

21 xxSE 21

11

nns

11

11

21

2

22

2

112

nn

snsns

21 2105.0,21 xxSEtxx

221 nn05.0,t

21

21

21

11

nns

xxt

221 nn

21 xx

Page 15: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

15

b. Difference in proportions, ,

95% confidence interval for :

Test hypothesis : , df = 1

where r and s are the number of discordant pairs, and N is the total number of pairs

4) r x c contingency table

Test hypothesis of no association : , df = (r-1) ×(c-1)

Where: O = observed number in a cell

E = expected number in a cell under the null hypothesis

r = number of rows, c = number of columns

5) 2 x c contingency table

Assign score to each column of table

Test hypothesis of no linear trend :

, df = 1

Where = mean score for subjects in row 1 of table

= mean score for subjects in row 2 of table

= number of subjects in row 1 of table

= number of subjects in row 2 of table

s = standard deviation of scores combining subjects in rows 1 and 2

6) 2 x 2 contingency table

Test hypothesis of no association :

a b a+b

c d c+d df = 1

a+c b+d N

A slightly more conservative test uses a continuity correction,

, df = 1

21 pp N

srppSE

21

21 2121 96.1 ppSEpp

21

sr

srX paired

2

21

E

EOX

2

2

21

2

2

212

11

nns

xxX

1x

2x

1n

2n

dbcadcba

NbcadX

2

2

dbcadcba

NNbcadX

2

21

2

Page 16: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

16

7) Mantel-Haenszel χ2 test for several 2 x 2 tables :

Test hypothesis of no association :

, df = 1

where and

Mantel-Haenszel Odds Ratio = where the summation is over each of

the strata.

8) Linear regression :

Equation of fitted line y = a + b x

95% confidence interval for :

where and is the 2-tailed 5% point of a t-distribution with degrees

of freedom (df)

Test hypothesis of no linear association: , df =

Alternatively, the same test in terms of the correlation coefficient r:

d.f. = .

9) Likelihood Ratio Test The likelihood ratio statistic (LRS) for testing for an association is calculated as: = ( − ) , where L1 is the log likelihood of the model with the exposure variable, and L0 is the log likelihood of the model without the exposure variable. The LRS is then referred to the χ2 distribution, with the degrees of freedom equal to the number of parameters that were excluded from the model.

10) Population attributable risk & population attributable risk fraction

r0 is risk (or rate) in unexposed group, r1 is risk (or rate) in exposed group; r is risk (or rate) in total study population, p is proportion of exposed in the population, p1 is the proportion of exposed among cases RR is risk ratio (rate ratio, odds ratio)

PAR = r – r0, or PAR = p(r1 – r0) PAF = PAR/ r So PAF = (r – r0)/ r or PAF = p(RR–1)/ [p(RR–1) + 1]

i

ii

aV

aEaX

2

2

i

iiiii

n

cabaaE

12

ii

iiiiiiiii

nn

dbcadcbaaV

iii

iii

ncb

nda

xy

bSEtb 05.0,

2 n 05.0,t

bSE

bt 2n

21

2

r

nrt

2n

Page 17: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

17

Also: PAF = [p1 (RR – 1)] / RR. For matched case control studies, this formula is used with RR the matched odds ratio. This formula is also used when adjusting for confounding, with RR the adjusted rate ratio (or odds ratio for exposure in a case control study) obtained by stratification or regression methods.

11) Risk Ratio and Odds Ratio: Error Factor (EF) for use in calculation of 95% confidence

intervals:

Exposure Outcome

Yes No

Yes a b

No c d

95% confidence limits for the risk ratio, RR, in cohort or cross-sectional studies, are given by (RR/EF) to (RRxEF) where EF is the error factor:

EF = exp (1.96 x )

95% confidence limits for the odds ratio, OR, for cross-sectional or unmatched case control studies, are given by (OR/EF) to (ORxEF) where EF is the error factor:

EF = exp (1.96 x )

For 1:1 matched case control studies, the 95% confidence limits for the odds ratio are given by (OR/EF) to (ORxEF), where OR is the matched odds ratio and

EF = exp (1.96 x ), where r and s are the numbers of discordant pairs.

12) Rates and the Rate Ratio:

95% confidence limits for a rate is given by: (R/EF) to( RxEF) where EF is the error factor: EF = exp (1.96 x √ (1/e)), where e is the number of events observed. 95% confidence limits for the rate ratio RR are given by (RR/EF) to (RRxEF) where EF is the error factor:

EF = exp (1.96 x ) where e1 and e2 are the number of events in the

exposed and unexposed groups.

13) Vaccine efficacy: When the two groups being compared are vaccinated and unvaccinated

individuals in a cohort study or randomized trial, vaccine efficacy is defined as: 100 x (1-RR), where RR is the ratio of the incidence rate in the

vaccinated group to the incidence rate in the unvaccinated group.

dccbaa

1111

dcba

1111

sr

11

21

11ee

Page 18: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

18

Table A1 Areas in tail of the standard normal distribution.

Tabulated area: Proportion of the area of the standard normal distribution that is above z

Second decimal place of z

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641 0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247 0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859 0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483 0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121 0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776 0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451 0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148 0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867 0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611 1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379 1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170 1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985 1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823 1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681 1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559 1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455 1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367 1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294 1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233 2.0 0.02275 0.02222 0.02169 0.02118 0.02068 0.02018 0.01970 0.01923 0.01876 0.01831 2.1 0.01786 0.01743 0.01700 0.01659 0.01618 0.01578 0.01539 0.01500 0.01463 0.01426 2.2 0.01390 0.01355 0.01321 0.01287 0.01255 0.01222 0.01191 0.01160 0.01130 0.01101 2.3 0.01072 0.01044 0.01017 0.00990 0.00964 0.00939 0.00914 0.00889 0.00866 0.00842 2.4 0.00820 0.00798 0.00776 0.00755 0.00734 0.00714 0.00695 0.00676 0.00657 0.00639 2.5 0.00621 0.00604 0.00587 0.00570 0.00554 0.00539 0.00523 0.00508 0.00494 0.00480 2.6 0.00466 0.00453 0.00440 0.00427 0.00415 0.00402 0.00391 0.00379 0.00368 0.00357 2.7 0.00347 0.00336 0.00326 0.00317 0.00307 0.00298 0.00289 0.00280 0.00272 0.00264 2.8 0.00256 0.00248 0.00240 0.00233 0.00226 0.00219 0.00212 0.00205 0.01999 0.00193 2.9 0.00187 0.00181 0.00175 0.00169 0.00164 0.00159 0.00154 0.00149 0.00144 0.00139 3.0 0.00135 0.00131 0.00126 0.00122 0.00118 0.00114 0.00111 0.00107 0.00104 0.00100 3.1 0.00097 0.00094 0.00090 0.00087 0.00084 0.00082 0.00079 0.00076 0.00074 0.00071 3.2 0.00069 0.00066 0.00064 0.00062 0.00060 0.00058 0.00056 0.00054 0.00052 0.00050 3.3 0.00048 0.00047 0.00045 0.00043 0.00042 0.00040 0.00039 0.00038 0.00036 0.00035 3.4 0.00034 0.00032 0.00031 0.00030 0.00029 0.00028 0.00027 0.00026 0.00025 0.00024 3.3 0.00023 0.00022 0.00022 0.00021 0.00020 0.00019 0.00019 0.00018 0.00017 0.00017 3.6 0.00016 0.00015 0.00015 0.00014 0.00014 0.00013 0.00013 0.00012 0.00012 0.00011 3.7 0.00011 0.00010 0.00010 0.00010 0.00009 0.00009 0.00008 0.00008 0.00008 0.00008 3.8 0.00007 0.00007 0.00007 0.00006 0.00006 0.00006 0.00006 0.00005 0.00005 0.00005 3.9 0.00005 0.00005 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00003 0.00003

Page 19: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

19

Table A2 Percentage points of the t distribution.

One-sided P value

0.25 0.1 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005

Two-sided P value

d.f. 0.5 0.2 0.1 0.05 0.02 0.01 0.005 0.002 0.001

1 1.00 3.08 6.31 12.71 31.82 63.66 127.32 318.31 636.62 2 0.82 1.89 2.92 4.30 6.96 9.92 14.09 22.33 31.60 3 0.76 1.64 2.35 3.18 4.54 5.84 7.45 10.21 12.92 4 0.74 1.53 2.13 2.78 3.75 4.60 5.60 7.17 8.61 5 0.73 1.48 2.02 2.57 3.36 4.03 4.77 5.89 6.87 6 0.72 1.44 1.94 2.45 3.14 3.71 4.32 5.21 5.96 7 0.71 1.42 1.90 2.36 3.00 3.50 4,03 4.78 5.41 8 0.71 1.40 1.86 2.31 2.90 3.36 3.83 4.50 5.04 9 0.70 1.38 1.83 2.26 2.82 3.25 3.69 4.30 4.78 10 0.70 1.37 1.81 2.23 2.76 3.17 3.58 4.14 4.59 11 0.70 1.36 1.80 2.20 2.72 3.11 3.50 4.02 4.44 12 0.70 1.36 1.78 2.18 2.68 3.06 3.43 3.93 4.32 13 0.69 1.35 1.77 2.16 2.65 3.01 3.37 3.85 4.22 14 0.69 1.34 1.76 2.14 2.62 2.98 3.33 3.79 4.14 15 0.69 1.34 1.75 2.13 2.60 2.95 3.29 3.73 4.07 16 0.69 1.34 1.75 2.12 2.58 2.92 3.25 3.69 4.02 17 0.69 1.33 1.74 2.11 2.57 2.90 3.22 3.65 3.96 18 0.69 1.33 1.73 2.10 2.55 2.88 3.20 3.61 3.92 19 0.69 1.33 1.73 2.09 2.54 2.86 3.17 3.58 3.88 20 0.69 1.32 1.72 2.09 2.53 2.84 3.15 155 3.85 21 0.69 1.32 1.72 2.08 2.52 2.83 3.14 3.53 3.82 22 0.69 1.32 1.72 2.07 2.51 2.82 3.12 3.50 3.79 23 0.68 1.32 1.71 2.07 2.50 2.81 3.10 3.48 3.77 24 0.68 1.32 1.71 2.06 2.49 2.80 3.09 3.47 3.74 25 0.68 1.32 1.71 2.06 2.48 2.79 3.08 3.45 3.72 26 0.68 1.32 1.71 2.06 2.48 2.78 3.07 3.44 3.71 27 0.68 1.31 1.70 2.05 2.47 2.77 3.06 3.42 3.69 28 0.68 1.31 1.70 2.05 2.47 2.76 3.05 3.41 3.67 29 0.68 1.31 1.70 2.04 2.46 2.76 3.04 3.40 3.66 30 0.68 1.31 1.70 2.04 2.46 2.75 3.03 3.38 3.65 40 0.68 1.30 1.68 2.02 2.42 2.70 2.97 3.31 3.55 60 0.68 1.30 1.67 2.00 2.39 2.66 2.92 3.23 3.46 120 0.68 1.29 1.66 1.98 2.36 2.62 2.86 3.16 3.37

0.67 1.28 1.65 1.96 2.33 2.58 2.81 3.09 3.29

Page 20: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

20

Table A3 Percentage points of the 2 distribution.

In the comparison of two proportions (2 × 2 2 or Mantel–Haenszel 2 test) or in the assessment of a trend, the percentage points give a two-sided test. A one-sided test may be obtained by halving the P values. (Concepts of one- and two-sidedness do not apply to larger degrees of freedom, as these relate to tests of multiple comparisons.)

P value

d.f. 0.5 0.25 0.1 0.05 0.025 0.01 0.005 0.001

1 0.45 1.32 2.71 3.84 5.02 6.63 7.88 10.83 2 1.39 2.77 4.61 5.99 7.38 9.21 10.60 13.82 3 2.37 4.11 6.25 7.81 9.35 11.34 12.84 16.27 4 3.36 5.39 7.78 9.49 11.14 13.28 14.86 18.47 5 4.35 6.63 9.24 11.07 12.83 15.09 16.75 20.52 6 5.35 7.84 10.64 12.59 14.45 16.81 18.55 22.46 7 6.35 9.04 12.02 14.07 16.01 18.48 20.28 24.32 8 7.34 10.22 13.36 15.51 17.53 20.09 21.96 26.13 9 8.34 11.39 14.68 16.92 19.02 21.67 23.59 27.88 10 9.34 12.55 15.99 18.31 20.48 23.21 25.19 29.59 11 10.34 13.70 17.28 19.68 21.92 24.73 26.76 31.26 12 11.34 14.85 18.55 21.03 23.34 26.22 28.30 32.91 13 12.34 15.98 19.81 22.36 24.74 27.69 29.82 34.53 14 13.34 17.12 21.06 23.68 26.12 29.14 31.32 36.12 15 14.34 18.25 22.31 25.00 27.49 30.58 32.80 37.70 16 15.34 19.37 23.54 26.30 28.85 32.00 34.27 39.25 17 16.34 20.49 24.77 27.59 30.19 33.41 35.72 40.79 18 17.34 21.60 25.99 28.87 31.53 34.81 37.16 42.31 19 18.34 22.72 27.20 30.14 32.85 36.19 38.58 43.82 20 19.34 23.83 28.41 31.41 34.17 37.57 40.00 45.32 21 20.34 24.93 29.62 32.67 35.48 38.93 41.40 46.80 22 21.34 26.04 30.81 33.92 36.78 40.29 42.80 48.27 23 22.34 27.14 32.01 35.17 38.08 41.64 44.18 49.73 24 23.34 28.24 33.20 36.42 39.36 42.98 45.56 51.18 25 24.34 29.34 34.38 37.65 40.65 44.31 46.93 52.62 26 25.34 30.43 35.56 38.89 41.92 45.64 48.29 54.05 27 26.34 31.53 36.74 40.11 43.19 46.96 49.64 55.48 28 27.34 32.62 37.92 41.34 44.46 48.28 50.99 56.89 29 28.34 33.71 39.09 42.56 45.72 49.59 52.34 58.30 30 29.34 34.80 40.26 43.77 46.98 50.89 53.67 59.70 40 39.34 45.62 51.81 55.76 59.34 63.69 66.77 73.40 50 49.33 56.33 63.17 67.50 71.42 76.15 79.49 86.66 60 59.33 66.98 74.40 79.08 83.30 88.38 91.95 99.61 70 69.33 77.58 85.53 90.53 95.02 100.43 104.22 112.32 80 79.33 88.13 96.58 101.88 106.63 112.33 116.32 124.84 90 89.33 98.65 107.57 113.15 118.14 124.12 128.30 137.21 100 99.33 109.14 118.50 124.34 129.56 135.81 140.17 149.45

Page 21: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

21

EP304: Advanced Statistical Methods in

Epidemiology

Examiner’s Report

Question 1

The question tested students’ understanding of survival analysis. Students were required to

show knowledge and understanding of: classical (stratified) methods of analysis, including

assessment of confounding and interactions; and Poisson regression, including assessment of

confounding and use of random effects models to account for multiple episodes. A nested

matched case-control study was also included, that required students to understand: the

exposure odds ratio as an estimate of the population odds, risk or rate ratio; and methods of

analysis, including conditional logistic regression.

(a) This was generally answered well. Part (i) required students to note that repeat events

in the same child may not be independent, i.e. there may be within-child correlation

for the event rate. This could result in confidence intervals being too narrow/ standard

errors being too small and p-values being too small. Part (ii) required noting that very

few children have more than one event and so it may not be very important to

properly account for the fact that a child can have repeated events in the analysis. Part

(iii) required students to realize that the origin was set at the date of birth and so

follow-up time was based on age. Hence, the earliest observed entry of 0 showed that

at least one child was included since birth and the last observed exit of 5 children

showed that at least one child was included in the study until they were 5 years old.

Part (iv) was a calculation using the number of failures (283) and the total time at risk

(5422.591) to give a rate of 283 / 5422.591 = 0.052 per year. Full marks were also

given for 5.2 per 100 person-years or 52 per 1000 person-years. Part (v) needed

students to give a description that included the following points: follow-up started on

1 January 2009, and the child presented at the clinic with severe pneumonia on 8

January 2009. They were then followed up until 28 July 2009 and at the time the

follow-up stopped the child did not have severe pneumonia.

Those students who did not get full marks did so generally because they were not

thorough enough in their answers. For example, stating that repeat events lead to

dependent data, but then not adding that this would lead to smaller standard errors and

so narrower confidence intervals and smaller p-values in analysis.

(b) Part (i) was answered well with students expected to observe strong statistical

evidence of an association and to note the following points for full marks: the rate of

severe pneumonia was much lower among children living in the rural area compared

Page 22: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

22

with the peri-urban area, the confidence intervals for the two rates did not overlap, the

rate ratio was 0.40 with a 95% confidence interval from 0.31 to 0.50 and there was

very strong evidence that the rate ratio is different to 1 (p<0.001). Part (ii) was

answered less well – several students did not realise that the timeband variable

referred to age and almost no students noted that it referred to current age. Timeband

.5 corresponded to when a child’s age was between 6 months and one year old. Part

(iii) was answered well, with most students correctly citing the very different rates in

the 5 age groups with most of the confidence intervals not overlapping as evidence of

an association.

(c) This section was answered well by most students, with marks generally lost for

incomplete answers rather than lack of understanding. There was very little

confounding: the crude and adjusted rate ratios were very similar - 0.375 compared

with 0.395. There was no evidence of effect modification: the age-specific rate ratios

were very similar and there was no statistical evidence for effect modification,

p=0.98.

(d) Part (i) was answered very well with students correctly noting that child sex did not

confound the association between area of residence and the rate of severe pneumonia

– the rate ratio was 0.376 adjusted for child sex compared with 0.375 adjusted only

for child age, which is a very small difference. Part (ii) was not answered so well,

with students averaging half marks. Model A assumes a categorical variable for

timeband, while Model B assumes a linear association. The null hypothesis was that

the association is linear and so a p-value of <0.001 leads us to reject the null and

conclude that there is strong evidence that the association between child age and the

rate of severe pneumonia does not follow a linear trend (on the log(rate) scale), after

adjusting for area of residence and sex. Many students lost a mark for not noting that

the test was adjusted for area of residence and sex.

(e) This section was generally answered well. Part (i) required students to note that there

was very little change to the conclusion with the point estimate of the rate ratio being

very similar, as were the standard errors and the confidence intervals. In part (ii), the

p-value given for the likelihood ratio test of alpha was 0.051, giving many students

problems in terms of either stating “clearly no effect” or “evidence for an effect”,

rather than quantifying this as “weak evidence”. This also led some students to say

there was clearly an important effect of clustering, when in reality the rate ratios and

their standard errors did not change much (as very few children had multiple

episodes).

(f) Part (i) was answered poorly by many students. Pneumonia could be a recurrent

disease in this study, with children having repeat episodes and so it is appropriate to

calculate a rate ratio, rather than an odds or risk ratio. By sampling concurrently, the

rate ratio will be estimated. Part (ii) required students to calculate the exposure odds

ratio as 38/11 = 3.45. Part (iii) gave an odds ratio of 7.00 for boys and 6.999 x 0.3809

= 2.67 for girls. Parts (ii) and (iii) were generally answered well, but several students

seemed to have run out of time and so got zero marks for this section.

Page 23: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

23

Question 2

The question tested students’ understanding of cluster randomised trials: including design

issues; intra-cluster correlation and between-cluster variation; sample size issues; and both

unadjusted and adjusted cluster-level analysis.

(a) This was generally answered well. However, students often gave standard responses

from the material, without thinking whether these would apply in the given situation.

Any reasonable answers were allowed for part (i), (ii) and (iii).

For part (i) these included:

Advantages – likely to be a mass effect and would want to capture this; may be

logistically easier if only have to do intervention in fewer healthcare facilities, rather

than all healthcare facilities if an individually randomised trial were to be conducted;

avoid the contamination that would occur if an individually randomised trial was

done, due to intervention and control people working together. Disadvantages –

cluster randomised trials need larger sample size; higher possibility for baseline

imbalance between study arms; analysis is more complicated.

For part (ii) these included:

Advantages – logistically easier to do the study in a few, large clusters.

Disadvantages – larger sample size likely to be required if using a few large clusters

than many small clusters.

For part (iii) these included:

Advantages – in a pair matched design, k is the between cluster variation within

matched pairs, which may be substantially smaller than the unmatched k, giving a

smaller sample size.

Disadvantages – the number of effective observations is halved, so if the matching is

not very effective, a larger sample size will be required; if one cluster drops out of a

study (e.g. a hospital is closed down), then both clusters in the pair are lost from the

study.

(b) Part (i) was answered by almost all students, but with high variability in the quality of

the answers. A good answer would consist of the following points:

Intra-cluster correlation is present if observations from two individuals in the same

cluster are more alike than observations from two individuals in different clusters.

Between-cluster variation is present if the true mean of the outcome variable differs

among clusters.

Page 24: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

24

If individuals within a cluster are no more alike than they are to individuals in a

different cluster, then there will be no variation in the true mean value of the outcome

variable between clusters. Conversely, if individuals in the same cluster are more

alike than they are to individuals in a different cluster, then there will be between-

cluster variation for the true mean of the outcome variable.

An intra-cluster correlation coefficient of 0 means that individuals in the same cluster

are no more alike than they are to individuals in a different cluster. An intra-cluster

correlation coefficient of 1 means that all individuals in the same cluster have the

same value for the outcome variable, and this happens for all clusters.

Part (ii) required students to know the formula was CoV (or “k”) = SD/mean and then

to calculate this as 1.5/5 = 0.3. However, this was answered poorly or not at all by

many students. Many gave the coefficient of variation as mean/SD instead of

SD/mean.

Part (iii) was answered well by most students – if the between-cluster variation for the

true TB incidence rate is large, then it will be harder to detect a difference between

study arms against this “background noise” and so a larger sample size will be

required to get the necessary power.

(c) This section was answered well by most students, with students correctly noting for

part (i) that p-values will be too small and confidence intervals will be too narrow and

so the evidence for an effect will be exaggerated and for part (ii) that random effects

models tend not to give reliable results with less than around 15 clusters per arm

(compared to only 10 clusters per arm in this study).

(d) Part (i) was answered well. Full marks were given for noting all of the following

points: the rate difference is 2.57 i.e. the rate in the intervention arm is 2.57/100py

less than in the control arm, with a 95% CI of 1.87 to 3.28 i.e. we are 95% confident

that the true difference lies between 1.87/100py and 3.28/100py less in the

intervention arm and a p-value <0.001 suggesting very strong evidence that this

difference is not due to sampling variation.

However, part (ii) was answered poorly by most students. The necessary formulae

were given, yet many students struggled with the calculations required, even if they

correctly identified all the numbers to go into the formula. This may have been due to

exam stress. The correct calculation was as follows:

The rate ratio is 3.140/5.714 = 0.550 and so the rate of tuberculosis in the intervention

arm was 45.0% less than in the control arm. The variance of the log (rate ratio) was

004128.0714.510

013.1

140.310

3117.02

2

2

2

2

00

2

0

2

11

2

1

rc

s

rc

sV

Page 25: EP304: Advanced Statistical Methods in Epidemiologydl.lshtm.ac.uk/programme/epp/docs/Examiner Reports/2010... · 2016. 3. 9. · _Isex_2 | 1.196 .1432106 1.49 0.135 .9458169 1.51236

25

134.1)004128.096.1exp()96.1exp( VEF

EFRR = 0.550/1.134 = 0.48.

EFRR =0.550x1.134 = 0.62.

Hence, we are 95% confident that the true rate ratio lies between 0.48 and 0.62.

(e) This section was answered poorly by most students (average mark of 6.7 out of 20

marks). This is some of the most advanced material in the course and so perhaps this

is not surprising.

Part (i) required students to note that there are only ten clusters in each arm and so the

likelihood of imbalance arising by chance in at least one variable related to the

outcome (e.g. gender, age, HIV status, prior history of TB, etc.) is much higher than

in an individually randomised trial with 2,500 participants per study arm.

In part (ii), a mean ratio residual of 0.744 implies that the intervention has decreased

the number of cases, as the ratio residual is the ratio of observed cases to expected

cases, where the expected cases are calculated assuming no intervention effect, but

based on the characteristics of the study participants in the cluster. Hence a mean ratio

residual of 0.744 implies 26% fewer cases were observed in the intervention arm than

expected, given the characteristics of those in the intervention arm, and so it may be

thought that the difference is due to the intervention (or imbalance in other variables

that were not adjusted for).

Part (iii) required students to remember that they could use the same parts of the

output in the same formula as they did for d(ii). The correct calculation is as follows:

The adjusted rate ratio is 0.7441/1.367 = 0.544 and so the rate of tuberculosis in the

intervention arm was 46.0% less than in the control arm, adjusting for age, gender and

HIV status. The variance of the log(rate ratio) was

03158.0367.110

4924.0

7441.010

3210.02

2

2

2

2

00

2

0

2

11

2

1

rc

s

rc

sV

417.1)03158.096.1exp()96.1exp( VEF

EFRR = 0.544/1.417 = 0.38.

EFRR =0.544x1.417 = 0.77.

Hence, we are 95% confident that the true adjusted rate ratio lies between 0.38 and

0.77.