chapter 10. categorical data

Chapter 10. Categorical Data

Categorical or Count Data

• The levels of the variable of interest are identified by name orrank only and we are interested in the number of observations occurring at each level of the variable.

• Data obtained from these types of variables are called categorical or count data.

Case Study: Franchise Expansion and Location Selection (1)

• A questionnaire was designed to determine customers’ satisfaction with their stay at each of the four groups’ motels.

– The two key areas of interest are guests’ ratings of building quality (room, restaurant, exercise facility, etc.) and service quality.

– Both are rated on 1 to 5 scale, with 1 denoting poor, 3 denoting average, and 5 denoting excellent, in the guests’ opinions.

– 500 recent guests of each group were randomly selected and sent a questionnaire.

– Frequencies of building ratings Frequencies of service ratings

Inferences About a Population Proportion π (1)

• Probability distribution of binomial distribution

• The point estimate of the binomial parameter – Letting y denote the number of successes in the n sample trials, the sample

proportion is:

123)2()1( ! trialsin successes ofnumber trialaon failure ofy probabilit 1 trialaon success ofy probabilit

trialsofnumber where

)1()!(!

!)(

⋅⋅⋅⋅⋅−⋅−⋅===−==

−−

= −

nnnnny

n

ynynyP yny

ππ

ππ

nyπ =ˆ

Inferences About a Population Proportion π (2)

• The random variable y possesses a mound-shaped probability distribution that can be approximately by using a normal curve when

• In a similar way, the distribution of can be approximated by a normal distribution with a mean and a standard error as given here,

• Confidence interval for π with confidence coefficient (1-α):

)1 ,min(5

ππn

−≥

nyπ =ˆ

πµπ =ˆ nππσπ

)1(ˆ

−=

) ˆˆ , ˆˆ(or ˆˆ ˆ2ˆ2ˆ2 παπαπα σzπσzπσzπ +−±

Example 10.1

• A new genetic treatment of 870 patients with a particular type of cancer resulted in 330 patients surviving at least 5 years after treatment. Estimate the proportion of all patients with specified type of cancer who would survive at least 5 years after being administered this treatment. Use a 90% confidence interval.

Answer to Example 10.1 (1)

• The confidence interval for π is based on a normal approximately to a binomial, which is appropriate provided n is sufficiently large. The rule we have specified is that both n π and π(1- π) should be at least 5, but since π is the unknown parameter, we will require that and be at least 5.

0.406~0.354 026.038.0

016.0645.138.0..%90

016.0870

62.038.0

38.0870330ˆ

ˆ

=±=

×±=

=×

=

==

IC

σ

π

π

πn ˆ )ˆ1( πn −

Adjustment for π Estimation

• A problem that arises in the estimation of π occurs when π is very close to zero or one.

• Some adjustments for π estimation are required for this case.

• One of the proposed adjustments is to use:

0 when )

43(

83

ˆ . =+

= yn

πadj ))2(-1 (0, isit ,0 When ..)%1(100 /1 nαyICα =−

nyn

nπadj =

+

+= when

)43(

)83(

ˆ . )1 ,)2(( isit , When ..)%1(100 /1 nαnyICα =−

Sample Size Estimation (1)

• Sample size calculations for estimating π follow very closely the procedures we developed for inferences about π.

• The required sample size for a 100(1-α)% confidence interval for π of the form (where E is specified) is found by solving the expression

for n. The result is:

• Since π is not known, either substitute an educated guess or use π=0.5. Use of π=0.5 will generate the largest possible sample size for the specified confidence interval width, 2E, and thus will give a conservative answer to the required sample size.

Eπ ±ˆ

Eσz πα =ˆ2

2

22 )1(

Eππz

n α −=

Example 10.3

• A large public opinion polling agency plans to conduct national survey to determine the proportion of employed adults who fear losing their job within the next year. How many workers must the agency poll to estimate to within 0.02 using a 95% confidence interval?

Example 10.4

• Sports car owners in a town complain that the state vehicle inspection station judges their cars differently from family-style cars. Previous records indicate that 30% of all passenger cars fail the inspection on the first time through. In a random sample of 150 sports cars, 60 failed the inspection on the first time through. Is there sufficient evidence to indicate that the percentage of first failures for sports cars is higher than the percentage for all passenger cars? (α=0.05)

0.0035). value-(p 0.30an greater th rate failure-first a havestation inspection vehicleat the cars Sports :Conclusion

70.2037.0

30.040.0510570.0150)1(

54530.0150

037.0150

7.03.0 4.015060ˆ

30.0:

0

0

ˆ

0

=

=−

=

≥=×=−≥=×=

=×

===

≤

z

πnπn

σπ

πH

π

Computer Analysis Issue

• Most computer packages do not include the test in Example 10.4. However, by coding successes as 1s and failures as 0s, we can approximate this test. This trick works exactly if the package includes a z test. Specify σas . If the package includes a one-sample t test, the result will be slightly different, but often close enough.

)1( 00 ππ −

Sample-Size Requirement

• If either or is less than about 2, treat the results of a z test very skeptically. If and are at least 5, the z test should be reasonably accurate.

0πn )1( 0πn −0πn )1( 0πn −

Inferences about the Difference between Two Population Proportions 21 ππ −

• Comparison of two binomial parameters– Notation for comparing two binomial proportions

– Inferences about two binomial proportions are usually phrased in terms of their difference , and we use the difference in sample proportions as part of a confidence interval or statistical test. 21 ππ −

21 ˆˆ ππ −

21ˆˆ 21ππµ ππ −=−

2

22

1

11ˆˆ

)1()1(21 n

ππnππσ ππ

−+

−=−

Example 10.5

• A company test-markets a new product in the Grand Rapid, Michigan, and Wichita, Kansas, metropolitan areas. The company’s advertising in the Grand Rapid area is based almost entirely on television commercials. In Wichita, the company spends a roughly equal dollar amount on a balanced mix of television, radio, newspaper, and magazine ads. Two months after the campaign begins, the company conducts surveys to determine consumer awareness of the product.

Calculate a 95% confidence interval for the regional difference in the proportion of all consumers who are aware of the product.

Answer to Example 10.5

0264.0608

)355.0)(645.0(527

)216.0)(784.0( iserror standard estimated The

645.0608392ˆ

784.0527413ˆ

2

1

=+

==

==

π

π

191.00.087or

)0264.0(96.1)645.0784.0()0264.0(96.1)645.0784.0(

21

21

≤−≤

+−≤−≤−−

ππ

ππ

Rule for Sample Sizes

• The confidence interval method is based on the normal approximation to the binomial distribution. In Chapter 4, we indicated as a general rule that and should both be at least 5 to use this normal approximation. For this confidence interval to be used, the rule should hold for each sample. In practice, sample sizes that come even close to violating this rule aren’t very useful because they lead to excessively wide confidence intervals.

• The reason for confidence intervals that seem very wide and unhelpful is that each measurement conveys very little information.

πn ˆ)ˆ1( πn −

Statistical Test for the Difference between Two Population Proportions

Inferences about Several Proportions: Chi-Square Goodness-of-Fit Test

• The experiment consists of n identical trials.• Each trial results in one of k outcomes.• The probabilities that a single trial will result in outcome i is for i =1, 2, …,

k, and remains constant from trial to trial. (Note: ).• The trials are independent.• We are interested in ni, the number of trials resulting in outcome i. (Note:

)

Multinomial distribution

iπ∑ =1iπ

∑ = nni

knk

nn

kk πππ

nnnnnnnP ...

!!...!!),...,,( 21

2121

21 =

Example 10.7

• Previous experience with the breeding of a particular herd of cattle suggests that the probability of obtaining one healthy calf from a mating is 0.83. Similarly, the probabilities of obtaining zero or two healthy calves are, respectively, 0.15 and0.02. A farmer breeds three dams from the herd; find the probability of obtaining exactly three healthy calves.


• n = 3 trials and k = 3 outcomes

0.015 + 0.572 = 0.59

Chi-Square Goodness-of-Fit Test (1)

• The primary interest in the multinomial distribution is as a probability model underlying statistical tests about the probabilities .

• We will hypothesize specific values for the πs and then determine whether the sample data agree with the hypothesized values.

• One way to test such a hypothesis is to examine the observed number of trials resulting in each outcome and to compare this to the number we would expect to result in each outcome.

kπππ ,..., , 21

Chi-Square Goodness-of-Fit Test (2) --- Expected Number of Outcomes

• In a multinomial experiment in which each trial can result in one of koutcomes, the expected number of outcomes of type i in n trials is nπi,where πi is the probability that a single trial results in outcome i.

• In 1900, Karl Pearson proposed the following test statistic to test the specified probabilities:

where ni represents the number of trials resulting in outcome i and Eirepresents the number of trials we would expect to result in outcome i when the hypothesized probabilities represent the actual probabilities assigned to each outcome.

∑ ⎥⎦

⎤⎢⎣

⎡ −=

i i

ii

EEnχ

22 )(

Chi-Square Goodness-of-Fit Test (3) --- Expected Number of Outcomes

• Cell probabilities: • Observed cell counts: • Expected cell counts:

• If the hypothesized π-values are correct, the observed cell counts ni should not deviate greatly from the expected cell counts Ei, and the computed value of χ2 should be small. Similarly, when one or more of the hypothesized cell probabilities are incorrect, the observed and expected cell counts will differ substantially making χ2 large.

• The distribution of the quantity χ2 can be approximately by a chi-square distribution provided that the expected cell counts Ei are fairly large.

• The chi-square goodness-of-fit test based on k specified cell probabilities will have k-1 degrees of freedom.

kπππ ,..., , 21

knnn ,..., , 21

kEEE ,..., , 21

∑ ⎥⎦

⎤⎢⎣

⎡ −=

i i

ii

EEnχ

22 )(

Chi-Square Goodness-of-Fit Test (2)• Cochran indicates that the approximately should be adequate if no Ei is less

than 1 and no more than 20% of the Eis are less than 5. The values of n/kthat provide adequate approximations for the chi-square goodness-of-fit test statistic tends to decrease as k increases.

• Agresti (1990) discusses situations in which the chi-squared approximation tends to be poor for studies having small observed cell counts even if the expected cell counts are moderately large.

• Cochran’s guidelines for determining whether the chi-square goodness-of-fit test statistic can be adequately approximated with a chi-square distribution. When some of the Eis are too small, there are several alternatives. Researchers combine levels of the categorical variable to increase the observed cell counts. However, combining categories should not be done unless there is a natural way to redefine the levels of the categorical variable that does not change the nature of the hypothesis to be tested. When it is not possible to obtain observed cell counts large enough to permit the chi-squared approximation. Agresti (1990) discusses exactmethods to test the hypotheses.

Example 10.8• A laboratory is comparing a test drug to a standard drug preparation that is useful in

the maintenance of patients suffering from high blood pressure. Over many clinical trials at many different locations, the standard therapy was administered to patients with comparable hypertension. The lab then classified the responses to therapy for this large patient group into one of four response categories. Table 10.1 lists the categories and percentages of patients treated on the standard preparation who have been classified in each category.

• The lab then conducted a clinical with a random sample of 200 patients with high blood pressure. Use the sample data in Table 10.2 to test the hypothesis that the cell probabilities associated with the test preparation are identical to those for the standard. Use α=0.05.

Table 10.1 Table 10.2


value.edhypothesiz thefromdifferent is iesprobabilit cell theof oneleast At :15.0 ,10.0 ,25.0 ,50.0: 43210

aHππππH ====

• Conclusion: At least one of the cell probabilities differs from that specified under H0. It appears that a much higher proportion of patients treated with the test reparation falls into the moderate and marked improvement categories.

7.815 if Reject :R.R.

24.33 13.33 5 2 4 30

)3010(20

)2010(50

)5060(100

)100120(

)(

20

2222

22

>

=+++=

−+

−+

−+

−=

⎥⎦

⎤⎢⎣

⎡ −=∑

χH

EEnχ

i i

ii


• This goodness-of-fit test has been used extensively over the years to test various scientific theories. Unlike previous statistical tests, however, the hypothesis of interest is the null hypothesis, not the research (or alternative) hypothesis.

• If a scientist has a set theory and wants to show that sample data conform to or “fit” that theory, he wants to accept H0. From our previous work, there is the potential for committing a Type II error in accepting H0. Here, as with other tests, the calculation of β probabilities is difficult.

• In general, for a goodness-of-fit test, the potential for committing a Type II error is high if n is small or if k, the number of categories, is large. Even if the expected cell count Ei conform to our recommendations, the probability of a Type II error could be large. Therefore, the results of a chi-square goodness-of-fit test should be viewed suspiciously. Don’t automatically accept the null hypothesis as fact given that H0 was not rejected.

The Poisson Distribution (1)

• In 1837, S.D. Poisson developed a discrete probability distribution, suitably called the Poisson distribution, which provides a good approximation to the binomial when π is small and n is large but nπ is less than 5. The probability of observing y successes in the n trials is given by the formula

• For approximating binomial probabilities using the Poisson distribution, take

!)(

yeµyP

µy −

=

πnµ =

Example 10.9

• Refer to the clinical trial mentioned at the beginning of this section, where n= 1000 patients were treated with a new drug. Compute the probability that none of a sample of n = 1000 patients experiences a particular side effect when π = 0.001.

• Suppose that after a clinical trial of a new medication involving 1000 patients, no patient experienced nausea. Would it be reasonable to infer that less than 0.001 of the entire population would experience this side effect while taking the drug?

• Certainly not. The probability of observing y = 0 in n = 10000 trials assuming π = 0.001 is 0.0368. Since this probability is quite large, it would not be wise to infer that π < 0.001.

3679.071828.2

1!0

1)0(

1001.01000

110

===⋅

==

=×==

−−

eeyP

πnµ

The Poisson Distribution (2)

• Although the Poisson distribution provides a useful approximation to the binomial under certain conditions, the application of the Poisson distribution is not limited to these situations.

• The Poisson distribution has been useful in finding the probability of yoccurrences of an event that occurs randomly over an interval of time, volume, space, and so on, provided certain assumptions are met.

– Events occur one at a time: two or more events do not occur precisely at the same time.

– The occurrence of an event in a given period is independent of the occurrence of the event in a nonoverlapping period; that is, the occurrence (or nonoccurrence) of an event during one period does not change the probability of an event occurring in some later period.

In many discussions of this topic, a third assumption is added:– The expected number of events during any one period is the same as that

during any other period.

Tests Using Poisson Distribution

• To check the assumption that the data follow a Poisson probability distribution. To do this we make use of the goodness-of-fit test, using the test statistic:

• There are two types of null hypothesis. The first hypothesis is that the data arise from a Poisson distribution µ=µ0 (H0: µ=µ0 ).

– The quantity ni denoted the number of observations in cell i and Ei is the expected number of observations in cell i obtained from the probabilities for a Poisson distribution with mean µ0 . The computed value of the test statistic is then compared to the tabulated chi-square value with suitable α and df=k-1, where k is the number of cells.

• The second null hypothesis we might be interested in is less specific. We test H0: The observed cell counts all come from a common Poisson distribution µ (unspecified).

– All cells Ei is the expected number of observations in cell i obtained from the probabilities for a Poisson distribution with a mean estimated from the sample data. The rejection region is then located for df = k-2.

∑ ⎥⎦

⎤⎢⎣

⎡ −=

i i

ii

EEnχ

22 )(

Example 10.11

• Environmental engineers often utilize information contained in the number of different alga species and the number of cell clumps per species to measure the health of a lake. Those lakes exhibiting only a few species but many cell clumps are classified as oligotrophic. In one such investigation, a lake sample was analyzed under a microscope to determine the number of clumps of cells per microscope field. These data are summarized here for 150 fields examined under a microscope. Here yi denotes the number of cell clumps per field and ni denotes the number of fields with yi cell clumps.

• Use α =0.05 to test the null hypothesis that the sample data were drwanfrom a Poisson probability distribution.


• Poisson parameter µ and them compute the expected cell counts. The Poisson mean µ is estimated by using the sample mean .

• Note that the sample mean was computed to be 3.3 by using all the sample data before the 13 largest values were collapsed into the final cell. This is why the sample mean computed here was rounded up to 3.3.

• The Poisson probabilities for y=0, 1, …, 7 or more are shown here:

y

3.3150486

≈== ∑n

yny ii


• Substituting these values into the test statistic, we have

• The tabulated values of chi-square for α=0.05 and df = k -2 = 6 is 12.59. Since the computed value of chi-square does not exceed 12.59, we have insufficient evidence to reject the null hypothesis that the data were collected from a Poisson distribution.

93.665.7

)65.713(...30.18

)30.1823(55.5

)55.56(

)(

222

22

=−

++−

+−

=

⎥⎦

⎤⎢⎣

⎡ −=∑

i i

ii

EEnχ

Contingency Tables: Tests for Independence (1)

• The idea that dependence of variables means that one variable has some value for predicting the other. With sample data, there usually appears to be some degree of dependence. In this section, we develop a χ2 test that assesses whether the perceived dependence in sample data may be a fluke---the result of random variability rather than real dependence.

• An implicit assumption of our discussion surrounding he test of independence is that the data result from a single random sample from the whole population. Often, separate random samples are taken from the subpopulations defined by the column (or row) variable.

Case Study (1)

• Suppose that a personnel manager for a large firm wants to assess the popularity of three alternative flexible time-scheduling (flextime) plans among clerical workers in four different offices.

• The null hypothesis for this χ2 test is independence.

• The computation of expected value Ei under the null hypothesis is different for the independence test than for the goodness-of-fit.

Case Study (2)

• The null hypothesis of independence does not specify numerical values for the row probabilities πi. , and column probabilities π.j , so these probabilities must be estimated by the row and column relative frequencies.

• Under the hypothesis if independence, the estimated expected value in row i and column j is:

the row total multiplied by the column total divided by the grand total.

• In general, for a table with r rows and c columns, (r-1)(c-1) of the cell entries are free to vary. This number represents the degrees of freedom for the χ2 test of independence.

nnn

nn

nnnπnE jiji

ijij

))(()()(ˆˆ ===

Example 10.12

• Suppose that in the flexible time-scheduling illustration, a random sample of 216 workers yields the following frequencies:

Calculate a table of values.

• Carry out the χ2 test of independence for the data. First use α=0.05; then obtain a bound for the p-value.

ijE

Likelihood Ratio Statistic

• There is an alternative χ2 statistic called the likelihood ratio statistic that is often shown in computer outputs. It is defined as:

where ni is the total frequency in row i, nj is the total in column j, and ln is the natural logarithm.

• Although it isn’t at all obvious, this form of the χ2 independence test is approximately equal to the Pearson form. There is some reason to believe that the Pearson χ2 yields a better approximation to table values, so we prefer to rely on it rather than on the likelihood ratio form.

∑=ij ji

ijij nn

nnχ ]

)(ln[ ratio likelihood 2

Strength of Association

• Rejection of the null hypothesis indicates only that the apparent association is not reasonably attributable to chance. It does not indicate anything about the strength or type of association.

Test of Homogeneity (1)

• In the flextime example (Example 10.12), the data might have resulted from separate samples of respective sizes 24, 81, 66 and 45) from the four offices rather than from a single random sample of 216 workers.

• In general, suppose the column categories represent c distinct subpopulations. Random samples of size n1, n2, …, ni are selected from these subpopulations. The observations from each subpopulation are then classified into the r values of a categorical variable represented by the r rows in the contingency table.

• The research hypothesis is that there is a difference in the distribution of subpopulation units into the r levels of the categorical variable. The null hypothesis is that the set of r proportions for each subpopulation is the same for all c subpopulations. Thus, the null hypothesis is given by

• This test is called a test of homogeneity of distributions. The mechanics of the test of homogeneity and the test of independence are identical. However, note that the sampling scheme and conclusions are different.

),...,,( 21 rjjj πππ

),...,,(),...,,(),...,,(: 2122212121110 rcccrr πππππππππH ==

Test of Homogeneity (2)

• The objective of test of homogeneity is to determine whether thedistribution of the subpopulation units to the values of the categorical variable is the same for all c subpopulations.

Example 10.14

• Random samples of 200 individuals from major oil-producing and natural gas producing sates, 200 from coal states, and 400 from other states participate in a poll of attitudes toward five possible energy policies. Each respondent indicates the most preferred alternative from among the following:(1) primarily emphasize conservation;(2) primarily emphasize domestic oil and gas exploration;(3) primarily emphasize investment in solar-related energy;(4) primarily emphasize unclear energy development and safety;(5) primarily reduce environmental restrictions and emphasize coal-

burning activities.

Conclusion about Test2χ

2χ• The test described in this section has a limited but important purpose. This test only assesses whether the data indicate a statistically detectable (significant) relation among various categories.

• It does not measure how strong the apparent relation might be. A weak relation in a large data set may be detectable (significant); a strong relation in a small data set may be nonsignificant.

Measuring Strength of Relation (1)

• The simplest method for assessing the strength of a relation is simple percentage analysis.

• For example, a firm planning to market a cleaning product commissions a market research study of the leading current product. The variables of interest are the frequency of use and the rating of the leading product.


• One natural analysis of the data takes the frequencies of use as given and looks at the ratings as functions of use. The analysis essential looks at conditional probabilities of the rating factor, given the use factor, but it recognizes that the data are only a sample, not the population.

• The proportions are quite different for the three use categories which indicates that rating is related to use.

Rating Use Fair Good Excellent Total Rare 64

(0.1975)123

(0.3796)137

(0.4228) 324

Occupational 131 (0.2539)

256 (0.4961)

129 (0.2500)

516

Frequency 209 (0.4918)

171 (0.4024)

45 (0.1059)

425

Totals 404 550 311 1265


• Another way to analyze relations in data is to consider predictability. The stronger the relation exhibited in data, the better one can predict the value of one variable from the value of the other.

• We need to distinguish between a dependent variable --- the variable one is trying to predict --- and an independent variable --- the variable one is using to make the prediction.

• The simplest prediction rule is to predict the most common value (the mode) of the dependent variable; this rule is the basis of the λ(lambda) predictability measure.

Rating Use Fair Good Excellent Total

Rare 64 123 137 324 Occupational 131 256 129 516

Frequency 209 171 45 425 Totals 404 550 311 1265

663216260187

)45174()129131()12364( errors ofnumber totalThe

:known Use

=++=

+++++=

7155501265

errors ofnumber totalThe:unknown Use

=−=


• The λ measure indicates how much better we predict. When use is the independent variable and rating is the dependent variable, the difference in prediction errors is 715-- - 663 = 52; We take this difference as a fraction of the errors made not knowing the independent variable --- namely, 715. If use is the independent variable, then

• The value of λ ranges between 0 and 1; λ = 1 indicates that, at least in the sample, the dependent variable is predicted perfectly given the independent variable. A value of λ = 0 occurs if there is independent in the data, or if the same value of the dependent variable is always predicted.

073.0715

663-715

t variableindependenunknown with errorst variableindependenknown with errors -

t variableindependenunknown with errors

==

=λ

Example 10.15

• An internal survey of samples of clerical workers, supervisory personnel, and junior managers includes their opinions on a proposed flextime schedule.

Level Opinion Clerical Supervisor Junior

ManagerTotal

Strongly oppose 5 8 9 22 Oppose 8 10 5 23 Favor 14 7 4 25

Strongly favor 13 5 2 20 Totals 40 30 20 90


• Percentage analyses and values of λ play a fundamentally different role than does the test. The point of a test is to see how much evidence there is that there is a relation, whatever the size may be.

• The point of percentage analyses and λ is to see how strong the relation appears to be, taking the data face value. The two types of analyses are complementary.

2χ 2χ

chapter 10. categorical data

Documents