hypothesis testing - categorical data

1/32

Hypothesis Testing - Categorical Data

Ryan Miller

2/32

Introduction

Previously, we introduced hypothesis testing, a general statisticalapproach used to measure the compatibility of sample data with anull model. Hypothesis testing is a multi-step process:

1) Specify an appropriate null model2) Find the corresponding null distribution3) Use the null distribution to calculate a p-value4) Use the p-value to make a decision regarding the plausibility of

the null model

Our introduction glossed over Step #2 (and to some degree Step#1), which are arguably the most challenging aspects of hypothesistesting. This presentation will focus on those steps for categoricaldata.

3/32

Transmission Disequilibrium

I In genetics, a common question is whether a certain gene islinked to a certain traitI This is a challenging question, as humans have over 30,000

genes and countless traits that likely depend upon numerousfactors (gene-environment interactions, etc.)

I One approach is to find child-parent pairs where the child hasthe trait of interest and the parent is heterozygous for the geneof interest (ie: they have one copy of each version of the gene)I Under normal circumstances, the parent is equally likely to pass

on either version of the geneI Thus, if a gene is unrelated to trait, we’d expect 50% of

children with the trait to have either version of the gene

3/32




I One approach is to find child-parent pairs where the child hasthe trait of interest and the parent is heterozygous for the geneof interest (ie: they have one copy of each version of the gene)

I Under normal circumstances, the parent is equally likely to passon either version of the gene

I Thus, if a gene is unrelated to trait, we’d expect 50% ofchildren with the trait to have either version of the gene

3/32




I One approach is to find child-parent pairs where the child hasthe trait of interest and the parent is heterozygous for the geneof interest (ie: they have one copy of each version of the gene)I Under normal circumstances, the parent is equally likely to pass

on either version of the geneI Thus, if a gene is unrelated to trait, we’d expect 50% of

children with the trait to have either version of the gene

4/32

Type I Diabetes - Introduction

I A study published in Genetic Epidemiology collected data on124 children with Type 1 diabetes whose parent washeterozygous for the gene FP’1I Among these 124 children, 78 had the “class 1” version of FP’1,

while 46 did notI Is this sufficient evidence to link FP’1 with Type 1 diabetes?

5/32

Type I Diabetes - Hypothesis Testing

I Step #1: Specify an appropriate null model

I The proportion of all children with Type 1 diabetes and aheterozygous parent that have the “class 1” version of FP’1 is50%, expressed statistically as H0 : p = 0.5

I Step #2: Find the corresponding null distributionI Option #1: SimulationI Option #2: Binomial distributionI Option #3: Normal distribution

5/32


I Step #1: Specify an appropriate null modelI The proportion of all children with Type 1 diabetes and a

heterozygous parent that have the “class 1” version of FP’1 is50%, expressed statistically as H0 : p = 0.5


5/32




I Step #2: Find the corresponding null distribution

I Option #1: SimulationI Option #2: Binomial distributionI Option #3: Normal distribution

5/32





6/32

The Simulation Approach

I Simulation is a general approach that can be used to generatea null distribution in a wide variety of scenariosI Proper use of this approach preserves as many aspects of the

data as possible while satisfying the null model

I In our example, the observed data are 78 of 124 children(p̂ = 0.63) with the “class 1” gene and the null model isH0 : p = 0.5, how might you use simulation to generate a nulldistribution for this scenario?

6/32


I Simulation is a general approach that can be used to generatea null distribution in a wide variety of scenariosI Proper use of this approach preserves as many aspects of the

data as possible while satisfying the null modelI In our example, the observed data are 78 of 124 children

(p̂ = 0.63) with the “class 1” gene and the null model isH0 : p = 0.5, how might you use simulation to generate a nulldistribution for this scenario?

7/32


I Coin flips provide a suitable null model for random process ofeach child having (or not having) the “class 1” gene

I Thus, the proportion of “heads” across a large number of setsof 124 coin flips can be used as the null distributionI The rbinom() function allows us to simulate sets of coin flips,

can you use it to generate a null distribution and find thep-value of this test?

8/32

The Simulation Apporach (solution)

### Generate 1000 sets of n = 124 coin-flipsset.seed(123)null_results <- rbinom(1000, size = 124, prob = 0.5)hist(null_results/124, breaks = 15)

Histogram of null_results/124

null_results/124

Fre

quen

cy

0.35 0.40 0.45 0.50 0.55 0.60 0.65

050

100

150

200

9/32

The Simulation Approach (solution)

Histogram of null_results/124

null_results/124

Fre

quen

cy

0.35 0.40 0.45 0.50 0.55 0.60 0.65

050

100

150

200

## Calculate the two-sided p-valueupper_tail <- sum(null_results/124 >= 78/124)/10002*upper_tail

## [1] 0.004

10/32

Simulation - Advantages/Disadvantages

Advantages:

I Extremely generalI Can be used for lots of different statistical summaries, not just

proportions/averages

Disadvantages:

I ApproximateI Different simulation seeds will result in slightly different null

distributionsI Computationally involved

I Lots of simulations needed to get a stable null distributionI Requires creativity (sometimes)

I Its not always easy to determine how to generate data underthe null model while preserving all of the key aspects of theoriginal study

11/32

The Binomial Distribution Approach

I You may have recognized that simulating sets of 124 coin flipsis unnecessaryI The binomial distribution can be used to exact probabilities of

each possible result that could arise from this null modelI How might you find the two-sided p-value (for this scenario)

using the binomial distribution (via the pbinom() function)?

upper_tail <- pbinom(78 - 1, size = 124, prob = .5,lower.tail = FALSE)

2*upper_tail

## [1] 0.005161225

binom.test(78, n = 124)$p.value

## [1] 0.005161225

12/32

Comments on the use of pbinom()

I It might seem odd to subtract 1 from the number of“successes” in pbinom

I But remember, the p-value calculation involves every possibleoutcome at least as extreme as the observed outcome

I The argument lower.tail = FALSE tells pbinom() tocalculate P(X > x), which doesn’t include P(X = x)I Thus, subtracting 1 is a quick fix that will start the summation

at the observed number of successes

13/32

Binomial - Advantages/Disadvanteges

Advantages:

I ExactI Eliminates the uncertainty/randomness involved in simulation

Disadvantages:

I Computationally involved (somewhat)I Behind the scenes R is summing a lot of different binomial

probabilitiesI Not generalizable

I Only useful when analyzing a single proportion

14/32

The Normal Distribution Approach (Z -Test)

I A final approach uses Central Limit Theorem to come up witha Normal model for the null distribution

I Recall that if H0 : p = 0.5 and n = 124, then CLT suggests:

p̂ ∼ N(0.5,

√0.5(1−0.5)

124

)

Based upon this distribution, might consider the standardizedZ -value of our observed sample proportion:

Z = p̂−p0SE = 0.63−0.50√

0.5(1−0.5)/124= 2.9

So, our sample proportion is 2.9 standard deviations higher thanwhat we’d expect under the null model, let’s formalize how unusualthis is with a p-value

15/32

The Normal Distribution Approach (Z -Test)

The Z -test in R:Z <- 2.9upper_tail <- pnorm(Z, mean = 0, sd = 1,

lower.tail = FALSE)2*upper_tail

## [1] 0.003731627

Recognize we could do the same test without any standardization:phat <- 0.63upper_tail <- pnorm(phat, mean = 0.5,

sd = sqrt(.5*.5/124),lower.tail = FALSE)

2*upper_tail

## [1] 0.003788718

16/32

Z -Test - Advantages/Disadvanteges

Advantages:

I GeneralizableI Can be used for proportions/averages and linear combinations

of themI Computationally easy

I Statisticians could easily use the Z -test prior to moderncomputing

Disadvantages:

I ApproximateI Uses a large-sample theoretical result that might not be

accurate for finite samples

17/32

Comments on these Three Approaches

I In a real statistical analysis you’d chose only one of these threeapproaches to use/report

I Notice the p-values were very similar:I Simulation yielded p = 0.0040I The exact binomial test yielded p = 0.0051I The Z -test yielded p = 0.0038

I Regardless of the statistical test, the conclusion is that “class 1”gene seems related to Type I diabetes

I These similarities shouldn’t be surprisingI Simulation is an approximation of the exact binomialI The Central Limit Theorem normal result will be reasonably

accurate when n ∗ p0 ≥ 10 and n ∗ (1− p0) ≥ 10

18/32

Non-Binary Categorical Variables

I In the previous example, we were able to reduce the scenario toa test on a single proportion, the proportion of child-adult pairswhere the child had the “class 1”I However, not every application involving categorical data can be

summarized using a single proportion

I Below is the distribution of correct answers for 400 randomlyselected AP Stats Exam questions:

A B C D E85 90 79 78 68

1. If correct AP Exam answers are truly random, what proportionof these answers would you expect to be “A’s”?

2. Would a z-test on the proportion of “A” answers in this sampleprovide enough evidence to decide if AP Exam’s answers arerandomly distributed?

18/32

Non-Binary Categorical Variables

I In the previous example, we were able to reduce the scenario toa test on a single proportion, the proportion of child-adult pairswhere the child had the “class 1”I However, not every application involving categorical data can be

summarized using a single proportionI Below is the distribution of correct answers for 400 randomly

selected AP Stats Exam questions:

A B C D E85 90 79 78 68

1. If correct AP Exam answers are truly random, what proportionof these answers would you expect to be “A’s”?

2. Would a z-test on the proportion of “A” answers in this sampleprovide enough evidence to decide if AP Exam’s answers arerandomly distributed?

19/32

AP Exam Answers

I First, you’d expect (on average) 1/5 of answers in a randomsample to be “A”

I Second, a z-test based upon p̂A and the null proportion of 0.2would not be sufficient - even if the proportion of “A” answersaligns with what we’d expect, the proportions of other answersmight notI Thus, you need four proportions to describe these five outcomes

(why not a fifth?)

20/32

AP Exam Answers

I Four different hypothesis tests seems like overkill for such asimple frequency table

I Instead, it’s more sensible to do a single test of the hypotheses:

H0 : pA = pB = pC = pD = pE = 0.2

HA : pi 6= 0.2 for at least one i ∈ {A,B,C ,D,E}

I To see if we can come up with a test of this hypothesis, we’llbegin by assuming the null hypothesis is trueI So, had we randomly sampled 400 AP questions under this null

hypothesis, what frequencies would you expect for each answerchoice?

21/32

Expected Counts

I The most likely frequencies under the null hypothesis are calledthe expected counts

I For the AP Exam data, they are:

A B C D E80 80 80 80 80

I In general, we calculate the expected counts for each of ipossible categories as:

Expectedi = n ∗ pi

I This is easy with the AP Exam data because the proportions,pi , are the same for every category (under the null hypothesis)I Note that this won’t always be the case

22/32

Chi-Square Testing

I To evaluate H0 : pA = pB = pC = pD = pE = 0.2 we cancompare the observed counts with those we’d expect if thenull hypothesis was true:

Answer A B C D EExpected Count 80 80 80 80 80Observed Count 85 90 79 78 68

I In this framework, we seek to answer the question: “If the nullhypothesis is true, do the observed counts deviate from theexpected counts by more than we’d reasonably expect due torandom chance”

I Think about how you’d summarize the distance between theobserved and expected counts?I Is the distance between 79 and 80 the same as the distance

between 80 and 79?I Is it the same as the distance between 4 and 5?

23/32

The Chi-Square Statistic

I We evaluate H0 (as previously defined) using the Chi-SquareTest, the test statistic is given below:

χ2 =∑

i

(observedi − expectedi)2

expectedi

I Like other test statistics, it compares the observed data towhat we’d expect under the null hypothesis, while standardizingthe differencesI Different is that we must sum over the variable’s i categoriesI Also different is that the numerator is squared so that positive

and negative deviations won’t cancel each other out

24/32

The Chi-Square Distribution

I The Chi-Square test requires us to learn a new distribution, theχ2 curve

I Fortunately, the χ2 distribution is related to the standardnormal distribution

I Suppose we generated lots of data from the standard normaldistribution, the histogram of these data would look like thenormal curve

I Now suppose we took these observations and squared them, thishistogram looks like the χ2 curve (with df = 1)

24/32


I The Chi-Square test requires us to learn a new distribution, theχ2 curve

I Fortunately, the χ2 distribution is related to the standardnormal distributionI Suppose we generated lots of data from the standard normal

distribution, the histogram of these data would look like thenormal curve

I Now suppose we took these observations and squared them, thishistogram looks like the χ2 curve (with df = 1)

25/32


−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Normal (Area = 0.05)

Z

1 2 3 4 5 6

0.0

0.1

0.2

0.3

0.4

0.5

Chi−Square w/ df = 1 (Area = 0.05)

Z^2

26/32

The Chi-Square DistributionI The relationship between the χ2 distribution and the normal

distribution is clearly illustrated by looking at the test statisticfor the Z -test:

ztest = observed− null valueSE

z2test = (observed− null value)2

SE 2

χ2 =∑ (observed− expected count)2

expected count

I Essentially, the χ2 test is just a squared version of the z-testI This makes the test naturally two-sided, even though we only

calculate p-values using the right tail of the χ2 curveI Under H0, the SE of each category count is approximately the

square root of the expected value of that count

27/32

Degrees of Freedom

I There are many different χ2 distributions depending upon howmany unique categories we must sum over

I Letting k denote the number of categories of a categoricalvariable, the χ2 test statistic for testing a single categoricalvariable has k − 1 degrees of freedomI This is because the category proportions are constrained to sum

to 1

I The mean and standard deviation of the χ2 curve both dependupon its degrees of freedom

I We can use pchsq() to calculate areas under the variousdifferent χ2 curves in R

27/32

Degrees of Freedom

I There are many different χ2 distributions depending upon howmany unique categories we must sum over

I Letting k denote the number of categories of a categoricalvariable, the χ2 test statistic for testing a single categoricalvariable has k − 1 degrees of freedomI This is because the category proportions are constrained to sum

to 1I The mean and standard deviation of the χ2 curve both depend

upon its degrees of freedomI We can use pchsq() to calculate areas under the various

different χ2 curves in R

28/32

Performing the Chi-Square Test (Quick Example)1. State the Null Hypothesis:

H0 : pA = pB = pC = pD = pE = 0.2

2. Calculate the expected counts under the null:

EA = 0.2 ∗ 400 = 80, EB = 0.2 ∗ 400 = 80, . . .

3. Calculate the χ2 test statistic:

χ2 =∑

i

(observedi − expectedi)2

expectedi

= (85− 80)2

80 + (90− 80)2

80 + (79− 80)2

80 + (78− 80)2

80 + (68− 80)2

80= 3.425

4. Locate the χ2 test statistic on the χ2 distribution with k − 1degrees of freedom to find the p-value: p = 0.49

29/32

Chi-Squared Test in R

Using the X 2 test statistic:pchisq(3.425, df = 4, lower.tail = FALSE)

## [1] 0.4893735

Using the sample data directly:observed <- c(85, 90, 79, 78, 68)chisq.test(observed, p = c(.2, .2, .2, .2, .2))

#### Chi-squared test for given probabilities#### data: observed## X-squared = 3.425, df = 4, p-value = 0.4894

30/32

Another Example

I Pools of prospective jurors are supposed to be drawn atrandom from the eligible adults in that communityI The American Civil Liberties Union (ACLU) studied the racial

composition of the jury pools for a sample of 10 trials inAlameda County, California

I The 1453 individuals included in these jury pools aresummarized below. For comparison, census data describing theeligible jurors in the county is included

Race/Ethnicity White Black Hispanic Asian OtherNumber in jury pools 780 117 114 384 58Census percentage 54% 18% 12% 15% 1%

Directions: Use a Chi-Squared test to determine whether the racialcomposition of jury pools in Alameda County differs from what isexpected based upon the census

31/32

Example - Solution

H0 : pw = 0.54, pb = 0.18, ph = 0.12, pa = 0.15, po = 0.01HA : At least one pi differs from those specified in H0

Race/Ethnicity White Black Hispanic Asian OtherObserved Count 780 117 114 384 58Expected Count 1453*.54 = 1453*.18 = 1453*.12 = 1453*.15 = 1453*.01 =

784.6 261.5 174.4 218 14.5

χ2 =∑

i

(observedi − expectedi )2

expectedi

=(780 − 784.6)2

784.6+

(117 − 261.5)2

261.5+

(114 − 174.4)2

174.4+

(384 − 218)2

218+

(58 − 14.5)2

14.5= 357

I The p-value of this test is near zero and provides strong evidencethat the jury pools don’t match the racial proportions of the census

I Comparing the observed vs. expected counts, it appears that Blacksand Hispanics are underrepresented while Asians and Others areoverrepresented in the jury pools

32/32

Summary

I This presentation covered methods for hypothesis testsinvolving a single categorical variable

I If we could summarize the variable using a single proportion,we could use:I SimulationI Exact BinomalI Z -test

I If summarizing the variable requires multiple proportions, wemight use:I Simulation (not shown)I Exact Multinomial (not shown)I Chi-Squared Goodness of Fit test

I The next presentation will cover methods for testingrelationships between two categorical variables

hypothesis testing - categorical data

Documents