epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epif14/epidemiological analysis...example...

17
Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Upload: others

Post on 24-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

Epidemiological analysis PhD-course in epidemiology

Lau Caspar Thygesen

Associate professor, PhD

25th February 2014

Page 2: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

Age standardization

• Incidence and prevalence are strongly age-dependent

– Risks rising (e.g. chronic diseases) or declining (e.g. measles) with age

• Comparisons between populations and over time may be very misleading

• A single age-independent index representing a set of age-specific rates may be more appropriate

Page 3: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

Mortality in Denmark and Greenland,

men, 1975

Please interpret this table?

Direct standardization

IR(DK-standardized to Greenlandic age-distribution)

= 0.016*12.2+0.076*0.7+0.268*0.160+0.506*1.4+0.110*11.2+0.024*66.5

= 3.8

Indirect standardization

Page 4: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

Example

• Trend study of lung cancer incidence among

women

• Denmark

• 1943-2010

0

1

2

3

4

5

6

7

8

9

19

43

19

45

19

47

19

49

19

51

19

53

19

55

19

57

19

59

19

61

19

63

19

65

19

67

19

69

19

71

19

73

19

75

19

77

19

79

19

81

19

83

19

85

19

87

19

89

19

91

19

93

19

95

19

97

19

99

20

01

20

03

20

05

20

07

20

09

Lung Cancer Denmark Women

rateCrude

0

1

2

3

4

5

6

7

8

9

19

43

19

45

19

47

19

49

19

51

19

53

19

55

19

57

19

59

19

61

19

63

19

65

19

67

19

69

19

71

19

73

19

75

19

77

19

79

19

81

19

83

19

85

19

87

19

89

19

91

19

93

19

95

19

97

19

99

20

01

20

03

20

05

20

07

20

09

Lung Cancer Denmark Women

rateCrude

segi

scand

Example 2

• Incidence of multiple sclerosis

• Denmark

• 1950-2004

• European Standard Population

Page 5: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

Example indirect standardization

• 19,185 subjects (3,817 women) who attended

outpatient clinics for alcohol abusers

• Copenhagen

• 1952-1992

• Compare incidence of heart disease by the

incidence rate in the greater Copenhagen area

Page 6: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

Problems

• Direct standardisation can produce unreliable

estimates when the calculations are based on

small numbers

• Indirect standardisations from different

populations cannot be directly compared –

only compared to the standard

Compared to regression methods

• Regression based methods are available but are rarely applied in practice

• When individual data are available (presence / absence of disease, age and sex), a logistic regression can be used to estimate the standardized rate

• The main advantage is that it allows adjustment by continuous variables in addition to categorical variables

Missing data

• What does missing mean

• The pattern of missingness (nomenclature)

– How and why is it missing?

• Methods for handling

Missing values • Common in research

– Nonresponse

– Loss to follow-up

– Lack of overlap between linked data sets (not

so common)

Page 7: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

What is item nonresponse?

• Unit Nonresponse vs. Item Nonresponse

ID Q1 Q2 Q3

456 1 1 2

457 4 2 1

458 ? ? ?

459 3 2 1

ID Q1 Q2 Q3

456 1 1 2

457 4 ? 1

458 ? 2 1

459 3 2 ?

Unit Nonresponse Examples

• Person who is not at home

• Person who does not pick up the phone

• Person who hangs up on you

• Rat that dies before the study

• The country you could not get data on

• etc.

Item Nonresponse

• “I Don’t Know”

• Refusals to respond

• Questions left blank

• Failed measurement

• etc.

Best way to deal with Missing

Data is not to have any

Page 8: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

Minimizing Unit Nonresponse

• Call back if not home

• Refusal conversion

• Don’t mess up

• Clear and understandable questionnaire

• Polite request

• Incentives

Minimizing Item Nonresponse

• Well written questions

• Minimize misunderstandings

– cross-cultural example

– Standardized vs. non-standardized

• Minimize skip patterns

What kind of missing data should be

modeled?

• If an item is missing from your dataset but you

suspect that it has a true value

• I don’t know might simply mean I don’t know

– Don’t model it as if there was a true value

• Dead people (attrition)

The pattern of missingness (nomenclature)

• Ignorable

– MCAR - Missing Completely at Random

– MAR - Missing at Random

• Non-ignorable

– NMAR - Not Missing at Random

Page 9: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

Missing completely at random

Missing Completely at Random: if the data are missing

completely at random then missing values cannot be

predicted any better

• Cause of missingness completely random process (like coin

flip)

• Cause uncorrelated with variables of interest

– Example: parents move

• No bias if cause omitted

• In the unlikely event that the process is missing completely at

random, then inferences based on complete cases are

unbiased, but inefficient because we have lost some cases

Missing at random

• Missingness may be related to measured variables

• But no residual relationship with unmeasured variables

• No bias if you control for measured variables

• For example, if highly educated are more likely to participate

in a survey, then the process is missing at random as long we

know the educational level of all persons

• If data is missing at random, then inferences based on

complete cases will be biased and inefficient

Missing not at random

Non-Ignorable / NMAR: if the probability that a cell is missing depends on the unobserved value of the missing value

For example, individuals’ responses to income questions, where

high income people are more likely to refuse to answer survey questions about income and other variables in the data set cannot predict which respondents have high income

If your missing data is non-ignorable, then inferences based on complete cases will be biased and inefficient

Classical Missing Data Treatments

• Whatever you do, you are doing something

– Case Deletion

• Listwise (complete case analysis)

• Pairwise (available case analysis)

– Indicator variable (dummy variable)

– Single Imputation

• (Unconditional) Mean Imputation

• Conditional Mean Imputation (expected value)

– Weighting

Page 10: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

• Excludes the whole case

• Default in most software

• Works if mechanism is MCAR

and if pattern and sample size allows (need

to have enough complete cases)

• Can be biased

Listwise Deletion and Multi-Item Pairwise Deletion

• An option for using all available information

correlation/covariance matrixes

• Different calculations may be based on different

populations

• Very unpredictable bias

Indicator method

• For each variable with missing values, create a

missing-value indicator to accompany the

variable in all analysis

• Assumes MCAR

• Even if the stratum is just a random sample of

all subjects, the stratum will yield a

confounded estimate of the exposure effect

Mean imputation

• Technique

– Calculate mean over cases that have values for Y

– Impute this mean where Y is missing

– Ditto for X1, X2, etc.

• Problems

– ignores relationships among X and Y

• underestimates covariances

Page 11: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

(Unconditional) Mean Imputation

Scatterplots are from Joe Schafer’s website

Mean imputation

• Standard errors too low

• CI difficult to calculate

Conditional mean imputation • Technique & implicit models

– If Y is missing

• impute mean of cases with similar values for X1, X2

– Y = b0 + X1 b1 + X2 b2

– Likewise, if X2 is missing

• impute mean of cases with similar values for X1, Y

– X1 = g0 + X1 g1 + Y g2

– If both Y and X2 are missing

• impute means of cases with similar values for X1

– Y = d0 + X1 d1

– X2= f0 + X1 f1

• Problem

– Ignores random components (no e)

àUnderestimates variances, se’s

Imputation of Expected Value • Good for creating expected values

• Bad for multivariate analysis

– Decreases standard errors

– Creates overconfident outcomes

– Increases probability of Type I error

Page 12: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

Problem with single imputation

• Underestimates se’s!

• Treats imputed values like observed values

– when they are actually less certain

• Ignores imputation variation

Imputation variation • Sampling variation

– If you take a different sample

• you get different parameter estimates

– Standard errors reflect this

– One way to estimate sampling variation

• measure variation across multiple samples

• called “bootstrapping”

• Imputation variation

– If you impute different values

• you get different parameter estimates

– Standard errors should reflect this, too

– One way to estimate imputation variation

• measure variation across multiple imputed data sets

• called “multiple imputation”

Multiple Imputation

• Models both expected value and uncertainty.

• Using the Missing Data Model you specify it

simulates and imputes missing values “multiple”

times creating M complete datasets

– (M=5 is usually OK. It is a good idea to simulate more)

• Analyze each dataset independently

• Combines results to get unbiased estimates. Models

both uncertainty and expectation

Example

Page 13: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

Multiple Imputation Simple Procedure

1. Impute using PROC MI

3. Do analysis: PROC REG, LOGISTIC, etc.

using by _imputation_; in the procedure

4. Combine results using PROC MIANALYZE

PROC MI

• Typical syntax:

proc mi data=bmx out=impdat seed=33155;

var bmxbmi bmxht bmxwt bmxarmc bmxarml;

run;

• data= 1 copy of data with missing values

• out= 5 copies of data with imputed values (will be different across copies)

• seed= random seed, you can keep same to reconstruct your results

• var Variables with missing values you need imputed, in model, and those that may be helpful with imputation

PROC MI Sample Output PROC MI Options

• nimpute=5 # imputations, default=5

0 gives missing patterns

• minimum=0 0 0 0 set min & max, sometimes

maximum=1 1 1 90 doesn’t converge as well

• round=1 1 1 0.01 round off option

Page 14: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

Output dataset Regression

• Fit your model as if data had no missing values, using by _imputation_;

• proc reg data=impdat outest=parmcov covout;

model bmxbmi=bmxht bmxwt bmxarmc bmxarml;

by _imputation_;

run;

• You’ll get nimpute (usually 5) sets of output

• Estimates, covariances, errors will be combined in MIANALYZE

• Need to generate parameter estimates and covariance data set (varies by procedure)

Parameter Est. & Covariance Matrix

• proc logistic data=impdat descending; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run;

• proc mixed data=impdat; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /solution covb; by _imputation_; ods output covparms=parmcov; run;

Parameter Est. & Covariance Matrix

• proc genmod data=impdat; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run;

Page 15: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

PROC MIANALYZE • Syntax depends on what procedure you used in previous step:

• proc mianalyze data=parmcov; (or) proc mianalyze parms=parmsdat covb=covbdat; (or) proc mianalyze parms=parmsdat xpxi=xpxidat;

(then type this:) modeleffects intercept bmxht bmxwt bmxarmc bmxarml;

run; • Note the “var” statement is now “modeleffects”

• Note that the dependent variable is omitted

PROC MIANALYZE Output

STATA *preparing dataset for multipel imputation

mi query

mi set mlong

mi describe, detail

mi register imputed total

set seed 29390

mi impute mvn total = i.smoking i.isced4 i.samliv3 i.s57a_ i.alder4 i.gender, add(20) force

mi describe, detail

*rounding the imputed binary values to the nearest integer

*replace bingedrinking = 0 if bingedrinking <0.5

*replace bingedrinking = 1 if bingedrinking >0.5

*replace change_new = round(change_new)

*examination of imputations: comparing main descriptive statistics from some imputations to those from the observed data

mi xeq 0 1 20: summarize total

mi estimate: xtmixed total i.gender group##month || username:, mle

mi estimate: mean total, over(sex group month)

Weigted regression

• Suppose that a national survey sampled 2000 subjects with 1000 men and 1000 women

• The response were 500 for men and 750 for women

• If there are large differences between men and women, a simple average of 2000 observations will be a distorted representation of the population mean

• By down-weighting women and up-weighting men we could obtain the accurate picture of the population

Page 16: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3

• Probability that values are missing

depends on the missing values themselves

• e.g., the probability that weight Y is missing

– is higher for the overweight (depends on Y)

– is higher for women (depends on X1)

• and sometimes X1 is missing, too.

• Methods available – not today!

Values not missing at random (NMAR)

Page 17: Epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epiF14/Epidemiological analysis...Example •Trend study of lung cancer incidence among women •Denmark •1943-2010 0 1 2 3