patrick breheny march 29 - college of arts &...

IntroductionHypothesis testing

Study designsMeasuring association

Two-sample inference: Categorical data

Patrick Breheny

March 29

Patrick Breheny STA 580: Biostatistics I 1/62



Separate vs. paired samples

Despite the fact that paired samples usually offer a morepowerful design, separate samples are much more common,for a variety of reasons:

It is often impossible to pair samples: for example, in the Salkvaccine trial, a child must either be vaccinated or unvaccinated– there is no way to receive bothEven if theoretically possible, it is often impractical: forexample, in studies of chronic diseases, we have to wait yearsto observe whether or not a person suffers a heart attack ordevelops breast cancerFurthermore, in observational studies, we have no control overwhich subjects end up in which group




Lister’s experiment

In the 1860s, Joseph Lister conducted a landmark experimentto investigate the benefits of sterile technique in surgery

At the time, it was not customary for surgeons to wash theirhands or instruments prior to operating on patients

Lister developed a new operating procedure in which surgeonswere required to wash their hands, wear clean gloves, anddisinfect surgical instruments with carbolic acid

This new procedure was compared to the old, non-sterileprocedure and Lister recorded the number of patients in eachgroup that lived or died




Contingency tables

When the outcome of a two-sample study is categorical, theresults can be summarized in a 2x2 table that lists the numberof subjects in each sample that fell into each category

Putting Lister’s results in this form, we have:

SurvivedYes No

Sterile 34 6Control 19 16

This kind of table is called a contingency table, or sometimesa cross-classification table




Contingency tables (cont’d)

Customarily, the rows of a contingency table represent thetreatment/exposure groups, while the columns represent theoutcomes

All rows and columns must represent mutually exclusivecategories; thus, each subject is located in one and only onecell of the table




Lister’s results

On the surface, Lister’s experiment seems encouraging: 46%of patients who received conventional treatment died,compared with only 15% of the patients who were operatedon using the new sterile technique

However, if we calculate (separate, exact) confidence intervalsfor the proportion who die from each type of surgery, theyoverlap:

Sterile: (6%,30%)Control: (29%,63%)




Fisher’s Exact TestThe χ2 testFisher’s exact test vs. the χ2 test

Differences between groups

It’s nice to know about the actual separate proportions thatdied for each type of surgery, but what we really want to knowin this experiment is whether there is a difference between thetwo treatments

So, rather than ask two separate questions, each using part ofthe data and addressing part of the question of interest, amore powerful approach is to focus all of the analysis on theone primary question of interest





Setting up a hypothesis test

Let’s think about what a p-value is: the probability of seeingresults as extreme or more extreme than what we saw if thenull hypothesis were true

The null hypothesis here is that sterilization has no impact onthe probability that a patient lives or dies – i.e., that itdoesn’t make any difference which type of surgery thepatients received

If that were true, then it is as if we only really had one group,and everyone lived or died with equal probability regardless ofthe sterility of their surgery





Setting up a hypothesis test

Thus, consider putting all the patients’ outcomes into a singleurn without considering the type of surgery they received (i.e.,this urn would contain 22 balls with “died” written on themand 53 balls with “survived” written on them)

Our sample of 40 patients who received sterile surgery wouldbe like randomly drawing 40 balls out of the urn

How often would we see something as extreme or moreextreme than only 6 out of 40 patients dying?





Performing the experiment

Making these draws 10,000 times, I got the following results:

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Fre

quen

cy

050

010

0015

0020

00





Calculating a p-value from the experiment

When I drew 40 balls from the combined urn, I only drew 6balls 39 times (out of 10,000)

The only results “as extreme or more extreme” than 6 were:

Number of “died” balls: 4 5 6 18 19Number of times drawn: 2 7 39 13 1

So I obtained a result as extreme or more extreme than theobserved value a total of 62 times out of 10,000

From this experiment, then, I would calculate a p-value of62/10000 = .0062





Fisher’s exact test

This approach to testing association in a 2x2 table is calledFisher’s exact test, after R.A. Fisher, probably the mostinfluential statistician of the 20th century

Although we did the experiment using a simulation, Fisherworked out the exact probabilities that would result from thisexperiment

Even before Fisher, however, another famous statistician (KarlPearson) invented an approximate version of this test

Pearson’s invention, the χ2-test, is one of the earliest (1900)and most widely used statistical tests





The χ2 curve

The χ2 test involves a curve we haven’t seen yet called the χ2

curve

Before we get to the test, let’s take a quick look at where theχ2 curve comes from

Suppose that we generated a lot of random observations fromthe normal distribution; the histogram of these observationswould look like the normal curve

Now suppose that we took those observations, squared them,and then made a histogram





The χ2 curve (with 1 degree of freedom)

z2

Den

sity

0.0 0.5 1.0 1.5 2.0 2.5 3.0

01

23

45

6





The χ2 curve and hypothesis testing

What are the implications for hypothesis testing?

Well, suppose we were performing a z-test: normally, wewould calculate a test statistic z and find the area under thenormal curve outside ±zBut what if we squared z instead?

z2 =(x− µ0)

2

SE2





The χ2 curve and hypothesis testing (cont’d)

Now, the area under the normal curve above +z will lie above+z2 – and so will the area under the normal curve below −zThe area to the right of z2 now contains both tails of theoriginal normal curve

To summarize: regardless of whether x is far above µ0 or farbelow µ0, z2 will be large and we will naturally get atwo-sided test

So, an alternative way to calculate p-values for z-tests is tofind the area to the right of z2 on the χ2 curve – and don’tdouble it





Graphical representation

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

z

Den

sity

0 1 2 3 4 5 60.

00.

10.

20.

30.

40.

5

z2

Den

sity





The motivation behind a χ2-test

In a sense, squaring the z-test statistic and comparing it tothe χ2 curve is a χ2 test

However, this not what people usually mean when they talkabout χ2 tests

Usually, the term χ2-test is reserved for an analysis ofcategorical outcomes in which you compare the observednumber of times each category occurs with the number oftimes it would be expected to occur under the null hypothesis

The more disagreement there is between observed andexpected results, the further we will be in the right-hand tailof the χ2 curve, and the lower our p-value will be





The χ2-statistic

Specifically, this is done by calculating the χ2-statistic: lettingthe subscript i denote the possible categories,

χ2 =∑i

(Oi − Ei)2

Ei

where Oi and Ei are the observed and expected number oftimes category i occurs/should occur

The numerator should look familiar: it’s the differencebetween the observed value of something and its expectedvalue under the null

The denominator, on the other hand, looks weird

However, it just so happens that when you’re countingoccurrences of a category, its SD2 is approximately equal tothe expected value





The χ2-test procedure

The procedure for performing a χ2-test is as follows:

#1 Create a table of expected counts based on the null hypothesis

#2 Calculate the χ2-statistic

#3 Determine the area to the right of χ2 on the χ2 curve





The χ2-test: Lister’s experiment

Let’s use the χ2-test to determine how unlikely Lister’s resultswould have been if sterile technique had no impact on fatalcomplications from surgery

#1 Create a table of expected counts based on the null hypothesis

In the experiment, ignoring group affiliation, 22 out of 75patients died

Thus, under the null, we would expect 22/75 = 29.3% of thepatients in each group to die:

SurvivedYes No

Sterile 28.3 11.7Control 24.7 10.3





The χ2-test: Lister’s experiment (cont’d)

#2 Calculate the χ2-statistic:

χ2 =(34 − 28.3)2

28.3+

(6 − 11.7)2

11.7

+(19 − 24.7)2

24.7+

(16 − 10.3)2

10.3= 8.50

#3 The area to the right of 8.50 is 1 − .996 = .004

There is only a 0.4% probability of seeing such a largeassociation by chance alone; this is compelling evidence thatsterile surgical technique saves lives





Fisher’s exact test and the χ2-test

So far, everything we’ve talked about is testing the same nullhypothesis with the same data, so one would expect similarresults

Indeed, the p-values are very similar:

Experiment: p = .006Fisher’s Exact Test: p = .005χ2-test: p = .004





Fisher’s exact test vs. the χ2-test

How should you decide to use one versus the other?

The usual advice is that the χ2 approximation may beinaccurate if the expected count in any cell is under 5

Typically, the difference between the χ2 test and Fisher’sExact Test is small for 2x2 tables, but the two can beextremely different for larger tables

Because it often doesn’t matter, which test is performed isusually a matter of preference:

Some prefer the χ2-test for tradition’s sakeOther prefer Fisher’s exact test because, as the name implies,it is exact




OverviewProspective studiesRetrospective studiesCross-sectional studies

Study designs that can be analyzed with χ2-tests

One reason that χ2-tests are so popular is that they can beused to analyze a wide variety of study designs

In addition to controlled experiments, they are widely used inepidemiology, where investigators must conduct observationalstudies

Observational studies in epidemiology fall into threecategories: prospective studies, retrospective studies, andcross-sectional studies





Study designs (cont’d)

χ2-tests and Fisher’s exact test can be used to analyze all ofthese studies

Best of all (some might say), you don’t even need to thinkabout your study design or where your data came from; if itcan be expressed as a 2x2 contingency table, the χ2-test isalways appropriate

One caveat: there are a lot of tables in this world that havetwo rows and two columns; that does not necessarily makethem contingency tables, so please don’t try to performχ2-tests on them!





Prospective studies

We have said that the double-blind, randomized controlledtrial is the gold standard of biomedical research

When this is not possible (or ethical), the prospective study(also called a cohort study) is the next best thing

In a prospective study, investigators collect a sample, classifyindividuals in some way, and then wait to see if the individualsdevelop a condition

The classification is usually based on exposure to a risk factorsuch as smoking or obesity





Risk factors for breast cancer

For example, the CDC tracked 6,168 women in the hopes offinding risk factors that led to breast cancer

One risk factor they looked at was the age at which thewoman gave birth to her first child:

CancerNo Yes

Before age 25 4475 6525 or older 1597 31





Risk factors for breast cancer (cont’d)

Performing a χ2-test on the data, we obtain p = .19

This study provides only rather weak evidence that the risk ofdeveloping breast cancer depends on the age at which awoman gives birth to her first child





Retrospective studies

Not all researchers have the resources to follow thousands ofpeople for decades to see if they develop a rare disease

Instead, they often try the more feasible approach ofcollecting a sample of people with the condition of interest, asecond sample of people without the condition of interest, andthen ask them if they were exposed to a risk factor in the past

For example, a much cheaper way to conduct the study ofbreast cancer risk factors would be to find 50 women withbreast cancer, 50 women without breast cancer, and ask themwhen they had their first child

This approach is called a retrospective, or case-control study





Fluoride poisining in Alaska

In 1992, an outbreak of illness occurred in an Alaskancommunity

The CDC suspected fluoride poisoning from one of the town’swater supplies

Case Control

Drank from supply 33 4Didn’t drink from supply 5 46





Fluoride poisining in Alaska

Testing whether this could be due to chance, the χ2-test givesus p = 6 × 10−13

The observed association was certainly not due to chance

But the association still may be due to factors besides fluoridepoisoning





Recall bias

For example, people who got sick may have been thinkingmuch harder about what they ate and drank than people whodidn’t

This is called recall bias, and it is an important source of biasin retrospective studies

Furthermore, because researchers must gather separatesamples of cases and controls, these studies are more prone tosampling biases than prospective studies





Electromagnetic field example

For example, retrospective studies have been performedinvestigating links between childhood leukemia and exposureto electromagnetic fields (EMF)

Families with low socioeconomic status are more likely to livenear electromagnetic fields

Families with low socioeconomic status are also less likely toparticipate in studies as controls

Socioeconomic status does not affect the participation ofcases, however (cases are usually eager to participate)

This results in an observed association between EMF andleukemia potentially arising entirely due to bias





Cross-sectional studies

The weakest type of observational study is the cross-sectionalstudy

In a cross-sectional study, the investigator simply gathers asingle sample and cross-classifies them depending on whetherthey have the risk factor or not and whether they have thedisease or not

Cross-sectional studies are the easiest to carry out, but aresubject to all sorts of hidden biases





Circulatory disease and respiratory disease

For example, one study surveyed 257 hospitalized individualsand determined whether each individual suffered from adisease of the respiratory system, a disease of the circulatorysystem, or both

Their results:

Respiratory DiseaseYes No

Circulatory Yes 7 29Disease No 13 208

Could this association be due to chance?

Not likely; χ2 = 7.9, so p = .005





Circulatory disease and respiratory disease (cont’d)

Okay, so it’s probably not due to chance

But does that mean that you are more likely to get arespiratory disease if you have a circulatory disease?

The same study surveyed nonhospitalized individuals as well:

Respiratory DiseaseYes No

Circulatory Yes 15 142Disease No 189 2181





Circulatory disease and respiratory disease (cont’d)

The evidence in favor of an association is now nonexistent:p = .48

What’s going on?

The issue isn’t a poor sampling design: both samples weregathered carefully and are representative of their respectivepopulations

Instead, the issue is that cross-sectional studies are verysusceptible to selection bias





Selection bias in cross-sectional studies

In this example, the bias was that if a patient has both acirculatory disease and a respiratory disease, then he or she ismuch more likely to be hospitalized and to be included in thecross-sectional study

There are many other examples:

Suppose we obtained a cross-sectional sample of factoryworkers to see if they had developed asthma at a higher ratethan non-factory workersWorkers who developed asthma from working in the factorymay be more likely to quit their job, and less likely to beincluded in our sampleSuppose we notice an association between milk drinking andpeptic ulcersIs it because milk drinking causes ulcers, or because ulcersufferers like to drink milk in order to relieve their symptoms?





Summary

In conclusion, prospective studies are the most trustworthyobservational study, but like any observational study, they aresubject to confounding

Retrospective studies are often much more feasible, butpotentially subject to recall bias and unrepresentative sampling

Cross sectional studies provide a quick snapshot of anassociation, but need to be interpreted with care




Which association?Confidence intervals

Hypothesis tests and confidence intervals

Fisher’s exact test and the χ2-test can be used to calculatep-value and assess evidence against the null

However, we would also like to be able to measure and placeconfidence intervals on effect sizes to determine practicalsignificance

The null hypothesis of a χ2-test can be loosely described assaying that there is “no association” betweentreatment/exposure and the outcome

What do we mean by “association”?

Suppose we want a confidence interval . . . what do we want aconfidence interval of?





Difference in proportions

One way of measuring the strength of an association forcategorical data is to look at the difference in proportions

For example, in Lister’s experiment, 46% of the patients whoreceived the conventional surgery died, but only 15% of thepatients who received the sterile surgery died

The difference in these percentages is 31%





Flaws with the difference in proportions

However, differences in proportions are not informative forrare events

For example, in a rather famous study that made front-pageheadlines in the New York Times, 0.9% of subjects takingaspirin suffered heart attacks, compared to 1.7% of placebosubjects

The difference in proportions, 0.8%, doesn’t soundfront-page-of-the-New–York–Times–worthy





The relative risk

Instead, for proportions, we often describe the strength of anassociation using ratios

When we said that the probability of suffering a heart attackwas twice as large (1.7/0.9 = 1.9) for the placebo group as forthe aspirin group, this is much more attention-grabbing

Similarly for Lister’s experiment: the risk of dying from surgeryis three times lower (46/15 = 3.1) if sterile technique is used

This ratio is called the relative risk, and it is usually moreinformative than the difference of proportions





Flaws with the relative risk

The relative risk is a good measure of the strength of anassociation, but it too has flaws

One is that it’s asymmetric

For example, the relative risk of dying is 46/15 = 3.1 timesgreater with the nonsterile surgery, but the relative risk ofliving is only 85/54 = 1.57 times greater with the sterilesurgery





Relative risks and retrospective studies

Another flaw is that it doesn’t work with retrospective studies

For example, consider the results of a classic case-controlstudy of the relationship between smoking and lung cancerpublished in 1950:

Cases Controls

Smoker 688 650Nonsmoker 21 59

Is the probability of developing lung cancer given that aperson smoked 688/(688 + 650) = 51%?





Relative risks and retrospective studies (cont’d)

Absolutely not; this isn’t even remotely accurate

By design, this study included 709 people with lung cancerand 709 without; the fact that about 50% of smokers hadlung cancer doesn’t mean anything

For retrospective and cross-sectional studies, then, we cannotcalculate a relative risk

This would require an estimate of the probability ofdeveloping a disease given that an individual was exposed to arisk factor, which we can only get from a prospective study

Instead, retrospective studies give us the probability of beingexposed to a risk factor given that you have developed thedisease





Odds

A slightly different measure of association, the odds ratio,gets around both of these flaws

Instead of taking the ratio of the probabilities, the odds ratiois a ratio of the odds of developing the disease given riskfactor exposure to the odds given a lack of exposure

The odds of an event is the ratio of the number of times theevent occurs to the number of times the event fails to occur

For example, if the probability of an event is 50%, then theodds are 1; in speech, people usually say that “the odds are 1to 1”If the probability of an event is 75%, then the odds are 3; “theodds are 3 to 1”





The symmetry of the odds ratio

As advertised, the odds ratio possesses the symmetry that therelative risk does not

For example, in Lister’s experiment the odds of dying were6/34 = .176 for the sterile group and 16/19 = .842 for thecontrol group

The relative odds of dying with the control surgery istherefore .842/.176 = 4.77

On the other hand, the odds of surviving were 34/6 = 5.67 forthe sterile group and 19/16 = 1.19 for the control group

The relative odds of surviving with the sterile surgery istherefore 5.67/1.19 = 4.77





An easier formula for the odds ratio

Summarizing this reasoning into a formula, if our table lookslike

a bc d

then

OR =ad

bc

Because of this formula, the odds ratio was originally calledthe “cross-product ratio”





There are two odds ratios

Keep in mind that there are two odds ratios, depending onhow we ordered the rows and columns of the table, and thatthey will be reciprocals of one another

When calculating and interpreting odds ratios, be sure youknow which group has the higher odds of developing thedisease

In Lister’s experiment, the odds ratio for surviving with thesterile surgery was 4.77, but the odds ratio for surviving withthe control surgery was 1/4.77 = 0.210

NOTE: When writing about an odds ratio less than 1, it iscustomary to write, for example, that “the sterile procedurereduced the odds of death by 79%”





Odds ratios and retrospective studies

The symmetry of the odds ratio works wonders when it comesto retrospective studiesSo, in our case-control study of lung cancer and smoking, theodds ratio for smoking given lung cancer is

OR =688 · 59

21 · 650= 2.97

However, this is also the odds ratio for lung cancer givensmokingThis is a minor miracle: we have managed to obtain aprospective measure of association from a retrospective study!Hence the popularity of the odds ratio: it can be used for anystudy design (prospective, retrospective, cross-sectional) thatresults in a 2x2 contingency table





Interpretation of odds ratios

One important caveat when dealing with odds ratios is thatthey tend to sound larger than they really are

For example, in the Nexium trial, the healing rates were 93%vs. 89%

The odds ratio, however, is 1.6

A 60% increase in the odds of healing sounds quite a bit moreclinically significant than it really is





The χ2-test and measures of association

Note that when the difference between two proportions equals0, the relative risk equals 1 and the odds ratio equals 1

Furthermore, when relative risk of disease given exposureequals 1, the relative risk of exposure given disease equals 1

Thus, any one of these may be thought of as the nullhypothesis of the χ2-test

This why the null hypothesis is often loosely described asthere being no “association” between exposure and disease





The odds ratio and the central limit theorem

So far in this class, we’ve calculated confidence intervals foraverages and percentages

For both statistics, the central limit theorem guarantees thatif the sample size is big enough, their sampling distributionwill look relatively normal

The odds ratio, however, has no such guarantee





Simulation: The sampling distribution of the odds ratio

Odds ratio

Fre

quen

cy

0 1 2 3 4

0e+

001e

+05

2e+

053e

+05

4e+

05





The log transform

However, look at the sampling distribution of the logarithm of theodds ratio:

Log odds ratio

Fre

quen

cy

−3 −2 −1 0 1 2 3

0e+

001e

+05

2e+

053e

+05





Confidence intervals for the log-odds

The log of the odds ratio (sometimes called the “log-odds”) isquite normal-looking and amenable to finding confidenceintervals for

Thus, the procedure that has been developed for constructingapproximate confidence intervals for the odds ratio actuallyconstructs confidence intervals for the log of the odds ratio

Getting a confidence interval for the odds ratio itself thenrequires an extra step of converting the confidence intervalback to the odds ratio scale





The standard error of the log-odds

It can be shown that a good estimate of the standard error of thelog of the odds ratio is

SElOR =

√1

a+

1

b+

1

c+

1

d

where a, b, c, and d are the four entries in the contingency table





Confidence intervals for the odds ratio: procedure

The first three steps for constructing confidence intervals for theodds ratio should look familiar; the fourth will be new:

#1 Estimate the standard error of the log-odds,

SElOR =√

1a + 1

b + 1c + 1

d

#2 Determine the values that contain the middle x% of thenormal curve, ±zx%

#3 Calculate the confidence interval for the log-odds:

(L,U) =(

log(OR) − zx% · SElOR, log(OR) + zx%SElOR

)#4 Convert the confidence interval from step 3 back into the odds

ratio scale to obtain the confidence interval for the odds ratio:

(eL, eU )

NOTE: “log” above refers to the natural log (sometimes abbreviated

“ln”), not the base-10 logarithmPatrick Breheny STA 580: Biostatistics I 60/62




Confidence intervals for the odds ratio: example

To see how this works in practice, let’s calculate a 95% confidenceinterval for the odds ratio of surviving with sterile surgery inLister’s experiment

#1 Estimate the standard error of the log-odds:

SElOR =

√1

34+

1

6+

1

19+

1

16

= 0.56

#2 As usual, 1.96 contains the middle 95% of the normaldistribution

#3 Recall that the sample odds ratio was 4.77, so the log of thesample odds ratio is 1.56

#3 The 95% confidence interval for the log odds ratio is therefore

(1.56 − 1.96(0.56), 1.56 + 1.96(0.56)) = (0.47, 2.66)





Confidence intervals for the odds ratio: example

#4 The 95% confidence interval for the odds ratio is therefore(e0.47, e2.66

)= (1.60, 14.2)

Note that the confidence interval doesn’t include 1; thisagrees with our test of significance

Note also that this confidence interval is asymmetric (its righthalf is much longer than its left half) – this would beimpossible to achieve without the log transform

Now we have an idea of the possible clinical significance ofsterile technique: it may be lowering the odds of surgicaldeath by a factor of about 1.6, or by a factor of 14, with afactor of around 5 being the most likely


patrick breheny march 29 - college of arts &...

Documents