categorical data analysis ii lecture 14 - github pages

12
Categorical Data Analysis II 14.1 Lecture 14 Categorical Data Analysis II Text: Chapter 10 STAT 8020 Statistical Methods II October 8, 2020 Whitney Huang Clemson University Categorical Data Analysis II 14.2 Hypothesis Testing for p 1 State the null and alternative hypotheses: H 0 p = p 0 vs. H a p > or or < p 0 2 Compute the test statistic: z obs = ˆ p - p 0 p0(1-p0) n 3 Make the decision of the test: Rejection Region/ P-Value Methods 4 Draw the conclusion of the test: We (do/do not) have enough statistical evidence to conclude that (H a in words) at α significant level. Categorical Data Analysis II 14.3 Bird Flu Example Revisited Among 900 randomly selected registered voters na- tionwide, 63% of them are somewhat or very con- cerned about the spread of bird flu in the United States. Conduct a hypothesis test at .01 level to assess the research hypothesis: p > .6. Notes Notes Notes

Upload: others

Post on 04-Jun-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.1

Lecture 14Categorical Data Analysis IIText: Chapter 10

STAT 8020 Statistical Methods IIOctober 8, 2020

Whitney HuangClemson University

Categorical DataAnalysis II

14.2

Hypothesis Testing for p

1 State the null and alternative hypotheses:

H0 ∶ p = p0 vs. Ha ∶ p > or ≠ or < p0

2 Compute the test statistic:

zobs =p̂ − p0√p0(1−p0)

n

3 Make the decision of the test:Rejection Region/ P-Value Methods

4 Draw the conclusion of the test:

We (do/do not) have enough statistical evidence toconclude that (Ha in words) at α significant level.

Categorical DataAnalysis II

14.3

Bird Flu Example Revisited

Among 900 randomly selected registered voters na-tionwide, 63% of them are somewhat or very con-cerned about the spread of bird flu in the UnitedStates. Conduct a hypothesis test at .01 level toassess the research hypothesis: p > .6.

Notes

Notes

Notes

Page 2: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.4

Recap: Inference for pPoint estimate:

p̂ = xn

where x is the number of “successes” in a samplewith sample size n, and the probability of success, p,is the parameter of interest

100(1 − α)% confidence interval:

p̂ ± zα/2

√(p̂)(1 − p̂)

n

Hypothesis Testing:H0 ∶ p = p0 vs. Ha ∶ p > or ≠ or < p0

z∗ = p̂ − p0√p0(1−p0)

n

Under H0 ∶ p = p0, z∗ ∼ N(0,1)

Categorical DataAnalysis II

14.5

Another CI for p: Wilson Score Confidence Interval

The actual coverage probability of 100(1 − α)% CI

p̂ ± zα/2√(p̂)(1−p̂)

n is usually falls below (1 − α)/

E.B. Wilson proposed one solution in 1927Idea: Solving p−p̂

√p(1−p)n

= ±zα/2 for p

⇒ (p − p̂)2 = z2α/2

p(1 − p)n

100(1 − α)% Wilson Score Confidence Interval:

X +z2α/2

2

n + z2α/2

±zα/2

n + z2α/2

¿ÁÁÀX(n −X)

n+z2α/2

4

Categorical DataAnalysis II

14.6

Example

Suppose we would like to estimate p, the probabilityof being vegetarian (for all the CU student). We takea sample with sample size n = 25 and none of themare vegetarian (i.e., x = 0). Construct a 95% CI forp.

Notes

Notes

Notes

Page 3: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.7

Rule of Three: An Approximate 95% CI for p When p̂ = 0or 1

When p̂ = 0, we have

p̂ ± zα/2

√(p̂)(1 − p̂)

n= 0 ± zα/2 × 0 = (0,0)

Similarly, when p̂ = 1, we have

p̂ ± zα/2

√(p̂)(1 − p̂)

n= 1 ± zα/2 × 0 = (1,1)

These Wald CIs degenerate to a point , which do notreflect the estimation uncertainty. Here we could applythe rule of three to approximate 95% CI:

(0,3/n), if p̂ = 0

(1 − 3/n,1), if p̂ = 1

Categorical DataAnalysis II

14.8

Comparing Two Population Proportions p1 and p2

We often interested in comparing two groups, e.g.,does a particular treatment increase the survivalprobability for cancer patients ?

We would like to infer p1 − p2, the difference betweentwo population proportions⇒ point estimate, intervalestimate, hypothesis testing

Categorical DataAnalysis II

14.9

Notation

Parameters

p1, p2: population proportions

p1 − p2: the difference between two populationproportions

Sample Statistics

n1, n2: sample sizes

p̂1 = x1

n1, p̂2 = x2

n2: sample proportions

⇒p̂1 − p̂2 =x1n1− x2n2

se(p̂1 − p̂2) =√(p̂1)(1 − p̂1)

n1+ (p̂2)(1 − p̂2)

n2

Notes

Notes

Notes

Page 4: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.10

Point/Interval Estimation for p1 − p2

Point estimate:

p̂1 − p̂2 =X1

n1− X2

n2

100(1 − α)% CI based on CLT:

p̂1 − p̂2 ± zα/2√(p̂1)(1 − p̂1)

n1+ (p̂2)(1 − p̂2)

n2

Categorical DataAnalysis II

14.11

Hypothesis Testing for p1 − p2

1 State the null and alternative hypotheses:

H0 ∶ p1 − p2 = 0 vs. Ha ∶ p1 − p2 > or ≠ or < 0

2 Compute the test statistic:

zobs =p̂1 − p̂2√

p̄(1−p̄)n1+ p̄(1−p̄)

n2

,

where p̄ = x1+x2n1+n2

3 Make the decision of the test:Rejection Region/ P-Value Methods

4 Draw the conclusion of the test:We (do/do not) have enough statistical evidence toconclude that (Ha in words) at α% significant level.

Categorical DataAnalysis II

14.12

Example

A Simple Random Simple of 100 CU graduate stu-dents is taken and it is found that 79 “strongly agree”that they would recommend their current graduateprogram. A Simple Random Simple of 85 USCgraduate students is taken and it is found that 52“strongly agree” that they would recommend theircurrent graduate program. At 5 % level, can weconclude that the proportion of “strongly agree” ishigher at CU?

Notes

Notes

Notes

Page 5: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.13

Binomial Experiments and Inference for p

Fixed number of n trials (sample size), each trial isan independent event (simple random sample)

Binary outcomes (“success/failure”), where theprobability of success, p, for each trial is constant

The number of successes X ∼ Bin(n, p)

We use a random sample x to infer p, the populationproportion, using p̂ = x

n

Categorical DataAnalysis II

14.14

Multinomial Experiments and Inference forp = (p1,⋯, pK)

Fixed number of n trials, each trial is an independentevent

K possible outcomes, each with probabilitypk, k = 1,⋯,K where ∑Kk=1 pk = 1

(X1,X2,⋯,XK) ∼ Multi(n, p1, p2,⋯, pK)

We use a random sample x = (x1, x2,⋯, xK) to infer{pk}Kk=1, the event probabilities

Question: How many parameters here?

Categorical DataAnalysis II

14.15

Example: Multinomial Probability

Suppose that in a three-way election for a largecountry, candidate 1 received 20% of the votes,candidate 2 received 35% of the votes, and can-didate 3 received 45% of the votes. If ten votersare selected randomly, what is the probability thatthere will be exactly two supporter for candidate 1,three supporters for candidate 2 and five supportersfor candidate 3 in the sample?

P(X1 = 2,X2 = 3,X3 = 5) = 10!

2!3!5!(0.2)2(0.35)3(0.45)5 ≈ 0.08

Notes

Notes

Notes

Page 6: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.16

Example: Estimating Multinomial Parameters

If we randomly select ten voters, two supporter forcandidate 1, three supporters for candidate 2 andfive supporters for candidate 3 in the sample. Whatwould our best guess for the population proportioneach candidate would received?

Categorical DataAnalysis II

14.17

Pearson’s χ2 Test

The Hypotheses:H0 ∶ p1 = p1,0;p2 = p2,0;⋯, pK = pK,0Ha ∶ At least one is different

The Test Statistic:

χ2∗=

K

∑k=1

(Ok −Ek)2Ek

,

where Ok is the observed frequency for the kth eventand Ek is the expected frequency under H0

The Null Distribution: χ2∗∼ χ2

df=K−1

Assumption: npk > 5, k = 1,⋯,K

Categorical DataAnalysis II

14.18

χ2-Distribution

Notes

Notes

Notes

Page 7: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.19

Example: Testing Mendel’s Theories (pp 22–23, “Categorical

Data Analysis” 2nd Ed by Alan Agresti)

“Among its many applications, Pearson’s test wasused in genetics to test Mendel’s theories of natu-ral inheritance. Mendel crossed pea plants of pureyellow strain (dominant strain) plants of pure greenstrain. He predicted that second generation hy-brid seeds would be 75% yellow and 25% green.One experiment produced n = 8023 seeds, of whichX1 = 6022 were yellow and X2 = 2001 were green.”

Use Pearson’s χ2 test to assess Mendel’s hypothesis.

Categorical DataAnalysis II

14.20

Color Preference Example

In Child Psychology, color preference by young chil-dren is used as an indicator of emotional state. In astudy of 112 children, each was asked to choose“favorite” color from the 7 colors indicated below.Test if there is evidence of a preference at the 5%level.

Color Blue Red Green White Purple Black OtherFrequency 13 14 8 17 25 15 20

Categorical DataAnalysis II

14.21

An Example of Bivariate Categorical Data

A psychologist is interested in whether or nothandedness is related to gender. She collected data onhandedness for 100 individuals and the data set issummarized in the table below

Right-handed Left-handed TotalMales 43 9 52

Females 44 4 48Total 87 13 100

Grand total: 100

Marginal total for males: 52

Marginal total for females: 48

Marginal total for right-handed: 87

Marginal total for left-handed: 13

This is an example of a contingency table

Notes

Notes

Notes

Page 8: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.22

Contingency Tables

Bivariate categorical data is typically displayed in acontingency table

The number in each cell is the frequency for eachcategory level combination

Contingency table for the previous example:Right-handed Left-handed Total

Males 43 9 52Females 44 4 48

Total 87 13 100

For a given contingency table, we want to test if twovariables have a relationship or not? ⇒ χ2-Test

Categorical DataAnalysis II

14.23

χ2-Test for Independence1 Define the null and alternative hypotheses:

H0 ∶ there is no relationship between the 2 variables

Ha ∶ there is a relationship between the 2 variables

2 (If necessary) Calculate the marginal totals, and thegrand total

3 Calculate the expected cell frequencies:

Expected cell frequency = Row Total ×Column TotalGrand Total

4 Calculate the partial χ2 values (χ2 value for each cellof the table):

Partial χ2 value = (observed - expected)2expected

Categorical DataAnalysis II

14.24

χ2-Test for Independence Cont’d

5 Calculate the χ2 statistic:

χ2obs =∑partial χ2 value

6 Calculate the degrees of freedom (df )

df = (#of rows − 1) × (#of columns − 1)

7 Find the χ2 critical value with respect to α

8 Draw the conclusion:

Reject H0 if χ2obs is bigger than the χ2 critical value⇒

There is an statistical evidence that there is arelationship between the two variables at α level

Notes

Notes

Notes

Page 9: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.25

Handedness/Gender Example Revisited

Right-handed Left-handed TotalMales 43 9 52

Females 44 4 48Total 87 13 100

Is the percentage left-handed men in the popula-tion different from the percentage of left-handedwomen?

Categorical DataAnalysis II

14.26

Example

A 2011 study was conducted in Kalamazoo, Michi-gan. The objective was to determine if parents’ mar-ital status affects children’s marital status later intheir life. In total, 2,000 children were interviewed.The columns refer to the parents’ marital status.Use the contingency table below to conduct a χ2

test from beginning to end. Use α = .10

(Observed) Married Divorced TotalMarried 581 487Divorced 455 477

Total

Categorical DataAnalysis II

14.27

Example Cont’d

1 Define the Null and Alternative hypotheses:

H0 ∶ there is no relationship between parents’ maritalstatus and childrens’ marital status.

Ha ∶ there is a relationship between parents’ maritalstatus and childrens’ marital status

2 Calculate the marginal totals, and the grand total

(Observed) Married Divorced TotalMarried 581 487 1068Divorced 455 477 932

Total 1036 964 2000

Notes

Notes

Notes

Page 10: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.28

Example Cont’d3 Calculate the expected cell counts

(Expected) Married DivorcedMarried 1068×1036

2000 = 553.224 1068×9642000 = 514.776

Divorced 932×10362000 = 482.776 932×964

2000 = 449.2244 Calculate the partial χ2 values

partial χ2 Married DivorcedMarried (581−553.224)2

553.224 = 1.39(487−514.776)2

514.776 = 1.50

Divorced (455−482.776)2

482.776 = 1.60(477−449.224)2

449.224 = 1.725 Calculate the χ2 statisticχ2 = 1.39 + 1.50 + 1.60 + 1.72 = 6.21

6 Calculate the degrees of freedom (df )The df is (2 − 1) × (2 − 1) = 1

7 Find the χ2 critical value with respect to α from theχ2 tableThe χ2

α=0.1,df=1 = 2.718 Draw your conclusion:

We reject H0 and conclude that there is arelationship between parents’ marital status andchildrens’ marital status.

Categorical DataAnalysis II

14.29

Example

The following contingency table contains enrollmentdata for a random sample of students from severalcolleges at Purdue University during the 2006-2007academic year. The table lists the number of maleand female students enrolled in each college. Usethe two-way table to conduct a χ2 test from begin-ning to end. Use α = .01

(Observed) Female Male TotalLiberal Arts 378 262 640

Science 99 175 274Engineering 104 510 614

Total 581 947 1528

Categorical DataAnalysis II

14.30

Example Cont’d

(Expected) Female MaleLiberal Arts 640×581

1528 = 243.35 640×9471528 = 396.65

Science 274×5811528 = 104.18 274×947

1528 = 169.82

Engineering 614×5811528 = 233.46 614×947

1528 = 380.54

partial χ2 Female MaleLib Arts (378−243.35)2

243.35 = 74.50(262−396.65)2

396.65 = 45.71

Sci (99−104.18)2

104.18 = 0.26(175−169.82)2

169.82 = 0.16

Eng (104−233.46)2

233.46 = 71.79(510−380.54)2

380.54 = 44.05

χ2 = 74.50 + 45.71 + 0.26 + 0.16 + 71.79 + 44.05 = 236.47

The df = (3 − 1) × (2 − 1) = 2⇒ Critical valueχ2α=.01,df=2 = 9.21

Therefore we reject H0 (at .01 level) and conclude thatthere is a relationship between gender and major.

Notes

Notes

Notes

Page 11: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.31

R Code & Output

Categorical DataAnalysis II

14.32

Take Another Look at the Example

(Proportion) Female Male TotalLiberal Arts .59 (.65) .41 (.28) (.42)

Science .36 (.17) .64 (.18) (.18)Engineering .17 (.18) .83 (.54) (.40)

Total .38 .62 1

Rejecting H0 ⇒ conditional probabilities are notconsistent with marginal probabilities

Categorical DataAnalysis II

14.33

Example: Comparing Two Population Proportions

Let p1 = P(Female∣LiberalArts) andp2 = P(Female∣Science).

n1 = 640,X1 = 378, n2 = 274,X2 = 99

H0 ∶ p1 − p2 = 0 vs. Ha ∶ p1 − p2 ≠ 0

zobs = .59−.36√.52×.48

640+.52×.48

274

= 6.36 > z0.025 = 1.96

We do have enough statistical evidence to concludethat p1 ≠ p2 at .05% significant level.

Notes

Notes

Notes

Page 12: Categorical Data Analysis II Lecture 14 - GitHub Pages

Categorical DataAnalysis II

14.34

R Code & Output

Categorical DataAnalysis II

14.35

Example: Test for Homogeneity

Let p1 = P(LiberalArts), p2 = P(Science),p3 = P(Engineering)

The Hypotheses:

H0 ∶ p1 = p2 = p3 = 13

Ha ∶ At least one is different

The Test Statistic:

χ2obs =

(640 − 509.33)2509.33

+ (274 − 509.33)2509.33

+ (614 − 509.33)2509.33

= 33.52 + 108.73 + 21.51 = 163.76 > χ2.05,df=2 = 5.99

Rejecting H0 at .05 level

Categorical DataAnalysis II

14.36

R Code & Output

Notes

Notes

Notes