Chapter 16:Analysis of
Categorical Data
LO1 Use the chi-square goodness-of-fit test to analyze probabilities of multinomial distribution trials along a single dimension.
LO2 Use the chi-square test of independence to perform contingency analysis.
Learning Objectives
• In chapter 5 the binomial distribution was used to analyze experiments or trials that had only two possible outcomes
• An extension of this problem is a multinomial distribution in which more than two possible outcomes can occur
• The χ2 goodness-of-fit test is used to analyze probabilities of multinomial distribution trials along a single dimension.
• It compares expected (theoretical) frequencies of categories from a population distribution to the observed (actual) frequencies from a distribution to determine whether there is a difference between what was expected and what was observed.
2 Goodness-of-Fit Test
LO1
• Hypothesize– Step1: The hypotheses
• Test– Step 2: The appropriate statistical tests for the problem– Step 3: Set α value– Step 4: Determine the degrees of freedom– Step 5: Determine the expected frequencies– Step 6: Calculate the observed value of chi-square
• Action– Step 7: Make decision to accept or reject null hypothesis
• Business Implication• Use the information to answer research questions
Formulating Test of Hypothesis
LO1
• When the expected value of a category is small, a large chi-square value can be obtained erroneously, leading to a type I error
• Control: to control for this potential error, the chi-square goodness of fit test should not be used when any of the expected frequencies is less than 5
• If the observed data produce expected values of less than 5, combining adjacent categories (when meaningful) to create larger frequencies may be possible
Small Expected Values of a Category
LO1
• The formula which is used to compute the test statistic for a chi-square goodness-of-fit test is given below.
2 Goodness-of-Fit Test
LO1
Milk Sales Data for Demonstration Problem 16.1
Month Litres of Milk
January 1,610
February 1,585
March 1,649
April 1,590
May 1,540
June 1,397
July 1,410
August 1,350
September 1,495
October 1,564
November 1,602
December 1,655
TOTAL 18,447LO1
Hypotheses and Decision Rules for Demonstration Problem 16.1
ddistributeuniformly not are salesmilk for figuresmonthly The :H
ddistributeuniformly are salesmilk for figuresmonthly The :H
a
o
.
.. ,
011
12 1 011
24 72501 11
2
df k cIf reject H .
If do not reject H .
Cal
2o
Cal
2o
24 725
24 725
. ,
. ,
LO1
Calculations for Demonstration Problem 16.1
1844712
1537.25ef
Cal
274 37 .
Month f0 fe (f0 –fe )2 / fe
January 1,610 1,537.25 3.44
February 1,585 1,537.25 1.48
March 1,649 1,537.25 8.12
April 1,590 1,537.25 1.81
May 1,540 1,537.25 0.00
June 1,397 1,537.25 12.80
July2 1,410 1,537.25 10.53
August 1,350 1,537.25 22.81
September 1,495 1,537.25 1.16
October 1,564 1,537.25 0.47
November 1,602 1,537.25 2.73
December 1,655 1,537.25 9.02
Totals 18,447 18,447 74.37LO1
• The observed chi-square value of 74.37 is greater than the critical value of 24.725.
• The decision is to reject the null hypothesis. The data provides enough evidence to indicate that the distribution of milk sales is not uniform.
Calculations for Demonstration Problem 16.1
LO1
Calculations for Demonstration Problem 16.1
LO1
Bank Customer Arrival Data for Demonstration Problem 16.2
Number of Arrivals
Observed Frequencies
0 71 182 253 174 12
5 5
LO1
Hypotheses and Decision Rules for Demonstration Problem 16.2
LO1
Calculations for Demonstration Problem 16.2:
Estimating the Mean Arrival Rate
f Xf
19284
2 3. customers per minute
MeanArrivalRate
Number of Arrivals
X
Observed Frequencies
f f·X0 7 01 18 182 25 503 17 514 12 48
5 5 25192
LO1
Calculations for Demonstration Problem 16.2: Poisson Probabilities for = 2.3
Number of Arrivals X
Expected Probabilities
P(X)
Expected Frequencies
n·P(X)0 0.1003 8.421 0.2306 19.372 0.2652 22.283 0.2033 17.084 0.1169 9.82
0.0838 7.04
n f
84
PoissonProbabilities
for = 2.3LO1
2 Calculations for Demonstration Problem 16.2
Cal
2174 .Number of
Arrivals X
Observed Frequencies
f
Expected Frequencies
nP(X)
(fo - fe)2
fe
01234
5
7 8.4218 19.3725 22.2817 17.0812 9.825 7.04
84 84.00
0.240.100.330.000.480.591.74
LO1
• The observed chi-square value of 1.74 is less than the critical value of 9.4877.
• The decision is not to reject the null hypothesis. The data does not provide enough evidence to indicate that the distribution of bank arrivals is Poisson.
Calculations for Demonstration Problem 16.2
LO1
Calculations for Demonstration Problem 16.2
LO1
• Used to analyze the frequencies of two variables with multiple categories to determine whether the two variables are independent.
2 Test of Independence
Qualitative VariablesNominal Data
LO2
2 Test of Independence: Investment Example
• Where do you reside?A. Large town B. Medium town C. Small town D.
Rural area
• Which type of financial investment are you most likely to make today?
E. Stocks F. Bonds G. Treasury bills
Type of financialInvestment
E F GA O13 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
Contingency Table
LO2
2 Test of Independence: Investment Example
Type of Financial Investment
E F GA e12 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
Contingency Table
If A and F are independent,P A F P A P F
P AN
P FN
P A FN N
A F
A F
n n
n n
AF
A F
A F
en n
n n
N P A F
NN N
N
LO2
2 Test of Independence: Formulas
ij
i j
en n
Nwhere
: i = the rowj = the columnn
the total of row i
the total of column j
N = the total of all frequencies
i
j
nn
2
2
o e
where
f ff e
: df = (r - 1)(c - 1) r = the number of rowsc = the number of columns
ExpectedFrequencies
Calculated
(Observed )
LO2
2 Test of Independence: Gasoline Preference Versus Income Category
LO2
Contingency Table for the Gas Consumer Example
LO2
Gasoline Preference Versus Income Category: Expected Frequencies
Type of Gasoline
Income Regular PremiumExtra
PremiumLess than $30,000 (66.15) (24.46) (16.40)
85 16 6 107$30,000 to $49,999 (87.78) (32.46) (21.76)
102 27 13 142$50,000 to $99,000 (45.13) (16.69) (11.19)
36 22 15 73At least $100,000 (38.95) (14.40) (9.65)
15 23 25 63238 88 59 385
ij
i j
en n
e
e
e
N
11
12
13
107 238385
66 15107 88
38524 46107 59
38516 40
.
.
.
LO2
Gasoline Preference Versus Income Category: 2 Calculation
LO2
• The observed chi-square value of 70.78 is greater than the critical value of 16.8119.
• The decision is to reject the null hypothesis. The data does provide enough evidence to indicate that the type of gasoline preferred is not independent of income.
Gasoline Preference Versus Income Category
LO2
Gasoline Preference Versus Income Category: 2 Calculation
LO2
Gasoline Preference Versus Income Category: Minitab Output
LO2
• Chi-square tests indicate whether two distributions are the same or are not. They do not tell you in what specific way they are different
• The chi-square test of independence indicates whether two variables are independent or not. But it does not tell you in which way they are dependent: it does not tell the nature of the relationship between the two variables
• Chi-square techniques are an outgrowth of the binomial distribution and the inferential techniques for analyzing population proportions
• Both the chi-square test of independence and the chi-square goodness-of –fit test require that expected values be greater than or equal to 5. If they are not, add adjacent rows or columns until all expected values are five or greater.
Important Points of Interests
LO2
COPYRIGHT
Copyright © 2014 John Wiley & Sons Canada, Ltd. All rights reserved. Reproduction or translation of this work beyond that permitted by Access Copyright (The Canadian Copyright Licensing Agency) is unlawful. Requests for further information should be addressed to the Permissions Department, John Wiley & Sons Canada, Ltd. The purchaser may make back-up copies for his or her own use only and not for distribution or resale. The author and the publisher assume no responsibility for errors, omissions, or damages caused by the use of these programs or from the use of the information contained herein.