to add make it clearer that excluding variables from a model because it is not “predictive”...

54
To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions Talk about stats being based on sampling variability – assumes sample is a random sample of some super-population (even if narrowly defined), but they are not a random sample, they self-select, so we couldn’t have infinite samples

Upload: iris-cross

Post on 13-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

To Add

Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Talk about stats being based on sampling variability – assumes sample is a random sample of some super-population (even if narrowly defined), but they are not a random sample, they self-select, so we couldn’t have infinite samples

Page 2: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Random Error I: p-values, confidence intervals, hypothesis

testing, etc.

Matthew Fox

Advanced Epidemiology

Page 3: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Do you like/use p-values?

Page 4: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

What is a relative risk?

What is a pvalue?

Page 5: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions
Page 6: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions
Page 7: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Table 1 of a randomized trial of asbestos and LC

Factor Asbestos No Asbestos

pvalue

Female 10% 25% 0.032

Smoking 60% 40% 0.351

>60 yrs 5% 7% 0.832

HBP 25% 24% 0.765

Alcohol use 37% 45% 0.152

Page 8: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Which result is more precise?RR 2.0 (95% CI: 1.0 – 4.0)

RR 5.0 (95% CI: 2.5 – 10.0)

Page 9: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

RR 2.0 (95% CI: 1.0 – 4.0)What are the chances the true results is between 1.0 and 4.0?

Page 10: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

If yes, what does it mean to be “by chance?” What is it that is caused by chance?

In a randomized trial, could the finding be by change?

Page 11: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

This Morning

Randomization– Why do we do it?

P-values– What are they?– How do we calculate them?– What do they mean?

Confidence Intervals

Page 12: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Last Session

Selection bias– Results from selection into our out of study related to

both exposure and outcome– Structural: conditioning on common effects– Adjustment for selection proportions– Weighting for LTFU

Matching– In a case control study, creates selection bias by

design, must be controlled in analysis

Page 13: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

“There’s a certain feeling of ease and pleasure for me as a scientist that any way you slice the data, it’s statistically significant,” said Dr. Anthony S. Fauci, a top AIDS expert in the United States government, which paid most of the trial’s costs.

Page 14: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Randomization

Randomization lends meaning to likelihoods, p-values and confidence intervals

It can reduce the probability of severe confounding to an acceptable level

But randomization does not prevent confounding

Page 15: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Greenland: Randomization, statistics and causal inference

Objective: – Clarify the meaning and limitations of inferential

statistics in the absence of randomization

Example — lidocaine therapy after acute MI– Patient 1: doomed– Patient 2: immune– lidocaine therapy assigned at random– two results are equally likely

Page 16: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Greenland: Randomization, statistics and causal inference

True RD = 0, so both possible results are confounded Expectation = 0 = (1 + -1)/2

– Statistically unbiased (expectation equals truth)

Conclusions – Randomization does not prevent confounding– Randomization does provide a known probability distribution for

the possible results under a specified hypothesis about the effect– Statistical unbiasedness of randomized exposure corresponds to

an average confounding of zero over the distribution of results

Result 1 Result 2

lidocaine placebo lidocaine placebo

Patient 1 Patient 2 Patient 2 Patient 1

RD 1 - 0 = 1 0 – 1 = -1

Page 17: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Probability Theory

With an assigned probability distribution, can calculate expectation

The expectation does not have to be in the set of possible outcomes– Here, the expectation equals zero

Page 18: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Probability

If we randomize and assume null is true (as we do when calculating p-values) – We expect half of the subjects to be exposed and half

the events to be among the exposed If truly no effect of exposure, all data

combinations, permutations are possible– Everyone was either type 1 or 4– All the events (deaths) would occur regardless of

whether assigned the exposure or not

Page 19: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Probability Theory

The probability of each possible data result in a 2x2 table is:– A function of the number of combinations

(permutations implies order matters)– Probability of each event is number of ways to

assign X subjects to exposure out of Y and A events out of a total of B total events

– Assumes the margins are fixed

Page 20: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

E+ E- TotalD+ ? ? 100D- ? ? 900Total 500 500 1000

Fixed margins, how many parameters (cells) do I need to estimate to fill in the entire table?

Page 21: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Greenland: Randomization, statistics and causal inference What comfort does this provide scientists

trying to interpret a single result? Can make probability of severe confounding

small by increasing the sample size

E+ E- Total

D+ 30 70 100

D- 470 430 900

Total 500 500 1000

Risk 0.06 0.14

RR 0.43 RD -0.08

Page 22: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Greenland: Randomization, statistics and causal inference

Given there were 100 cases and an even distribution of exposed and unexposed, how many cases would we expect to be exposed?

E+ E- Total

D+ 30 70 100

D- 470 430 900

Total 500 500 1000

Risk 0.06 0.14

RR 0.43 RD -0.08

Page 23: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Greenland: Randomization, statistics and causal inference What comfort does this provide scientists

trying to interpret a single result? Can make probability of severe confounding

small by increasing the sample size

0001.0

500

1000

500

900100

)cases 1000 in deaths 10030(

30

0

i

ii

XPProbability under the null that randomization would yield a result with at least as much downward confounding as the observed result

Page 24: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Back to the counterfactual

If association we measure differs from the truth, even if by chance, what explains it?– Unexposed can’t stand in for what would have

happened to exposed had they been unexposed This is confounding

– But on average, zero confounding– This gives us a probability distribution to calculate the

probability of confounding explaining the results This is a p-value

Page 25: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Randomized trial of E on D in 4 patients

We find:

E+ E-

D+ 2 0

D- 0 2

Total 2 2

Page 26: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Randomized trial of E on D in 4 patients

If the null is true, what CST types must they be?

E+ E-

D+ 2 0

D- 0 2

Total 2 2

Page 27: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Hypergeometric distribution The hypergeometric distribution:

Where X = random variable, x = exposed cases,

n = exposed population, M is = total cases, and

N = total population.

n

N

xn

MN

x

M

xXP

M!/x!(M-x)!

Page 28: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Spreadsheet

Page 29: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Greenland: Randomization, statistics and causal inference

When treatment is assigned by the physician, Expectation depends on physician behavior – Expectation does not necessarily equal truth

For observational data we DON’T have probability distribution for confounding– When E isn’t randomized, statistics don’t provide valid

probability statements about exposure effects because– p-values, CIs, & likelihoods calculated with assumption

all data interchanges are equally likely

Result 1 Result 2

lidocaine placebo lidocaine placebo

Patient 1 Patient 2 Patient 2 Patient 1

RD 1 - 0 = 1 0 – 1 = -1

Page 30: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Greenland: Randomization, statistics and causal inference

Alternatives– Limit statistics to data description (e.g., visual

summaries, tables of risks or rates, etc.)– Influence analysis: explore degree to which effect

estimates would change under small perturbations of the data, such as interchanging a few subjects

– Employ more elaborate statistical models– Sensitivity analysis– At the very least, interpret conventional statistics as

minimum estimates of the error

Page 31: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

(1) The p-value is:

Probability under the test hypothesis (usually the null) that a test statistic would be ≥ to its observed value, assuming no bias in data collection or analysis– Why the null? Our job is to measure– 1-sided upper p-value is test stat ≥ observed value– 1-sided lower p-value is test stat ≤ observed value– Mid-p assigns only half probability of the observation

to the 1-sided upper p-value– 2-sided p-value is twice the smaller of the 1-sideds

Page 32: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

(2) The p-value is not:

Probability that a test hypothesis (null hypothesis) is true– Calculated assuming that test hypothesis is true. – Cannot calculate probability of an event that is

assumed in the calculation

Probability of observing the result under the test hypothesis (null) [likelihood]– Also includes probability of results more extreme

Page 33: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

(3) The p-value is not:

An -level (the Type 1 error rate)– More on that later

A significance level– Used to refer to both p-values and Type 1

error rates– Should be avoided to prevent confusion

Page 34: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

(4) The 2-sided p-value is not:

Because 2-sided p-value is twice smaller of lower and upper 1-sided p-values, which may not be same and may be > 1, it is not the:– Probability that the data would show as

strong an association as observed or stronger if the null hypothesis were true;

– Probability that a point estimate would be as far or further from the test value as observed

Page 35: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Significance testing:

Compares p-value to an arbitrary or conventional Type 1 error rate =0.05

Emphasizes decision making, not measurement– Derives from agricultural and industrial

applications of statistics– Reflects the roots of epidemiology as the

union of statistics and medicine

Page 36: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

JNCI announces materials to “help journalists get it right”

Page 37: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions
Page 38: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Response

They acknowledge the definitions were incorrect, however:

“We were not convinced that working journalists would find these definitions user-friendly, so we sacrificed precision for utility. We will add references to standard textbooks for journalists who want to learn more.”

Page 39: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Frequentist Statistics

Page 40: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Alternatives to pvalues

Two studies which is more precise?– RR 10.0, p = 0.039– RR 1.3, p = 0.062

The pvalue conflates the size of the effect and its precision– RR 10.0, p = 0.039, 95% CI: 1.5-66.7– RR 1.3, p = 0.062, 95% CI: 0.99-1.7

Page 41: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Frequentist intervals (1)

Definition: – If the statistical model is correct and no bias, a

confidence interval derived from a valid test will, over unlimited repetitions of the study, contain the true parameter with a frequency no less than its confidence level (e.g. 95%).

But the statistical model is only correct under randomization

CAN’T say that the probability the interval includes the truth equals the interval’s coverage probability (e.g., 95%).

Page 42: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Confidence Interval Simulation

Page 43: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Frequentist intervals (2)

Advantages– Provides more information than significance tests

or p-values: direction, magnitude, and variability– Economical compared with p-value function

Disadvantages– Less information than the p-value function– Underlying assumptions (valid statistical model, no

bias, repeated experiments)

Page 44: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Approximations: Test-based

)1(ˆ

)]ˆ[ln(

)ˆln(

z

RRCI

RRSE

RR

)1(ˆ

)ˆ(

ˆ

zDRCI

DRSE

DR

Page 45: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Approximations: Wald

DRSEzDRCI

eCIRRSEzRR

ˆˆ

)ˆln()ˆln(

Page 46: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Standard Errors (basic over i strata):

i ii

ii

i ii

i

ii

i

idcba

ba

baRIR

Nb

d

Na

cRR

ROiiii

ˆlnvar

ˆlnvar

ˆlnvar

01

1111

Page 47: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

How do we measure precision?

Width of the confidence interval Measured how?

– If I tell you the 95% CI for an RR is 2 to 8, can you tell me the point estimate?

– Sqrt(U*L) – Difference measures, just subtract

Remember relative measures are on the log scale, so width of a CI is measured by the RATIO of the upper to the lower CI

Page 48: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions
Page 49: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Frequentist Intervals (4): Interpretation

Page 50: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Conclusion about confidence intervals

A CI used for hypothesis testing is an abuse of the CI– The goal is precision, not significance

The goal of epi is precision, not significance– A precise null estimate is just as important as a

precise significant estimate– An imprecise, statistically significant estimate is as

useless as a non-statistically significant, imprecise estimate

Page 51: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

What results are published or highlighted in publications?

Find a publication with multiple results Rank them in order of precision Then see what is highlighted in the abstract

Maternal age <35 Maternal age >= 35Parity OR LCL95 UCL95 width rank OR LCL95 UCL95 width rank

0 1 11 1.08 0.86 1.35 1.57 1 1.11 0.65 1.88 2.89 62 1.14 0.87 1.49 1.71 2 1.67 0.99 2.82 2.85 53 1.65 1.13 2.4 2.12 3 1.24 0.69 2.23 3.23 8

>=4 1.45 0.93 2.26 2.43 4 2.41 1.41 4.12 2.92 7

Maternal age <35 Maternal age >= 35Parity OR LCL95 UCL95 width rank OR LCL95 UCL95 width rank

0 1 11 1.08 0.86 1.35 1.57 1 1.11 0.65 1.88 2.89 62 1.14 0.87 1.49 1.71 2 1.67 0.99 2.82 2.85 53 1.65 1.13 2.4 2.12 3 1.24 0.69 2.23 3.23 8

>=4 1.45 0.93 2.26 2.43 4 2.41 1.41 4.12 2.92 7

Page 52: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

CIPRA Trial

Trial of nurse vs. Doctor managed HIV care For primary results, co-investigators wanted

pvalues and confidence intervals Didn’t want hypothesis testing even though

was aware people would do it anyway I fought initially, and lost the debate Put in both

Page 53: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

CIPRA Trial

Reviewer comment:

Table 3: Column with p-values can be dropped given that 95% confidence intervals are presented;

perhaps mark significance as * (e.g. for p<0.025) and ** (e.g. for p<0.005) after the 95% CI's.

Page 54: To Add Make it clearer that excluding variables from a model because it is not “predictive” removes all meaning from a CI since this is infinite repetitions

Summary

Randomization gives meaning to statistics– Gives a probability distribution for confounding

When randomization doesn’t hold, we have no probability distribution

Pvalues aren’t probability of chance, null, etc. CIs allows us to assess precision

– But are based on infinite repetitions– Do not contain the true value with 95% probability