chapter 5 5.1 introductory chi-square test objectives:- concerning with the methods of analyzing the...
TRANSCRIPT
CHAPTER 55.1 INTRODUCTORY CHI-SQUARE TEST
Objectives:-Concerning with the methods of analyzing the
categorical data
In chi-square test, there are 3 methods to be analyzed :
Goodness-of-fit test: To test over assumption that some variables follow
certain distribution.
Independence Test To test if the variable is dependent to one another. Homogeneity Test To test if there is a homogeneous relationship
between the variables.
Goodness-of-fit test:
In Goodness-of-fit test, chi-square analysis is i. applied for the purpose of examine whether sample data could have been drawn from a population having a specific probability distributionii. To compare an observed distribution to an expected distribution
In Goodness-of-fit test, the test procedures are appropriate when the following conditions are met :i. The sampling method is simple random samplingii. The population is at least 10 times as large as the sampleiii. The variable under study is categoricaliv. The expected value for each level of the variable is at least 5
Test procedure to run the Goodness-of-fit test:
1. State the null hypothesis and alternative hypothesis
2. Determine:i. the level of significance, ii. The degree of freedom,
0H 1H
1df k p
k
p
where number of levels of the categorical variable
the number of unknown parameters needed
to be estimated from the data.
If there is no unknown parameter, then th
1 0
2 1
k p .
k p .
e degreed of
freedom is where
If there is unknown parameter, then the degreed of
freedom is where
3. Find the value of from the table of chi-square distribution
4. Calculate the value of
Where the
21,k p
2 using the formula belowcalculated
2
2
1
observed frequency
expected frequency
ki i
calculatedi i
thi
thi
o e
e
o i
e i
1 2 and i i ke nP X n o o ... o
Category
1 2 … k
Frequency
…1o 2o ko
2
5. Determine the rejection region:i. critical value approach; Rejectii. p – value approach;
6. Make decision
0Reject if value H p
2 20 if calculated ,dfH
Example 5.1:The authority claims that the proportions of road
accidents occurring in this country according to the categories User attitude (A), Mechanical Fault (M), Insufficient Sign Board (I) and Fate (F) are 60%, 20%, 15% and 5% respectively. A study by an independent body shows the following data
Can we accept the claim at significance level
Solution:
1.
Category A M I F Total
Frequency 130
35 30 5 200
0 05.
0
1
200
: 0 6 0 2 0 15 0 05
: At least one differs for and
n
H P A . ,P M . ,P I . ,P F .
H P i i A,M ,I F
2.
3. From chi-square distribution table
4..
0 05 1 4 1 3. , df k
2 2 20 05 3 0 0 05 37 815 reject if 7 815. , calculated . ,. , H .
0.833
0.625
0.000
2.500
io130Ao 35Mo
30Io 5Fo
i ie nP X 2
i i
i
o e
e
0 6 200 120Ae .
0 2 200 40Me . 0 15 200 30Ie .
0 05 200 10Fe . 2 3 958c .
2 2 2 22 (130 120) (35 40) (30 30) (5 10)
120 40 30 100.833 0.625 0.000 2.5 3.958
calculated
5. Rejection Region:
6. . Since . Thus we accept and conclude that we have no evidence to reject the claim.
2 2 20 05 3 0 05 3 07 815 3 958 7 815 (Do not reject . , calculated . ,. , . . H )
2 20 05 33 958 7 815c . ,. . 0H
Exercise 5.1:The number of students playing truancy in a school
over 200 school days is showing below
If X is a random variable representing the number of students playing truancy per day, test the hypothesis that X follows the Poisson distribution with mean 3 per day at
No. of truancy 0 1 2 3 4
No of days 12 32 45 50 35 26
5
0 01.
Exercise 5.2 :The probabilities of blood phenotypes A, B, AB and
O in the population of all Caucasians in the US are 0.41, 0.10, 0.04 and 0.45 respectively. To determine whether or not the actual population proportions fit this set of reported probabilities, a random sample of 200 Americans were selected and their phenotypes were recorded. The observed cells are count as calculated. Test the goodness of fit of these blood phenotype probabilities at Blood
Phenotypes
A B AB O
Observed 89 18 12 81
0.10
The Chi-Square Test for Homogeneity
The homogeneity test is used to determine whether several populations are similar or equal or homogeneous in some characteristics.
This test is applied to a single categorical variable from two different population
The test procedure is appropriate when satisfy the below conditions : i. For each population, the sampling method is simple random samplingii. Each population is at least 10 times as large as the sampleiii. The variable under study is categoricaliv. If sample data are displayed in contingency table (population x category levels), the expected value for each cell of the table is at least 5.
Two dimensional contingency table layout:
The above is contingency table (r x c) where r denotes as the number of categories of the row variable, c denotes as the number of categories of the column variable
is the observed frequency in cell i, j be the total frequency for row category i be the total frequency for column category j be the grand total frequency for all cell (i, j) where
Column Variable
Category B1
Category B2
… Category Bc
Total
Row Variable
Category A1
…
Category A2
…
Category …
… … … … …
Category Ar
…
Total …
11o
21o
1ro
1n
12o
22o
2ro
1co
2co
rco
2n cn
1n
2n
rn
n
ijo
in
jn
n
row total column total
grand total
th thi j
ij
i jn ne
n
Test procedure to run Chi-square test for homogeneity:1. State the null hypothesis and alternative hypothesis
Eg:
2. Determine:i. the level of significance, ii. The degree of freedom, where
3. Find the value of from the table of chi-square distribution Determine the rejection region:
i. critical value approach; Reject ii. p – value approach;
0H1H
1 1df r c
number of rows
number of column
r
c
2,df
2 20 if
calculated ,dfH
0Reject if valueH p
0
1
: The proportion of ROW variable are SAME with COLUMN variable
: The proportion of ROW variable are NOT SAME with COLUMN variable
H
H
4. Calculate the value of using the formula below:
5. Make decision
2calculated
2
2
1 1
observed frequency of and column
expected frequency of and column
r cij ij
ci j ij
th thij
th thi
o e
e
o i j
e i j
Example 5.2:Four machines manufacture cylindrical steel pins.
The pins are subjected to a diameter specification. A pin may meet the specification or it may be too thin or too thick. Pins are sampled from each machine and the number of pins in each category is counted. Table below presents the results. Test at whether the categories of pins are similar for all machines.
0 01.
Too thin OK Too Thick
Machine 1 10 102 8
Machine 2 34 161 5
Machine 3 12 79 9
Machine 4 10 60 10
Solution:Construct a contingency table:
Calculation of the expected frequency:
Too thin OK Too Thick Total
Machine 1 10 102 8 120
Machine 2 34 161 5 200
Machine 3 12 79 9 100
Machine 4 10 60 10 80
Total 66 402 32 500
row total column total
grand total
th thi j
ij
i jn ne
n
Testing procedure:1.
2.
3. From table of chi-square:
0
1
: The proportion of pins that are too thin, OK, or too thick is the same for all machines
: The proportion of pins that are too thin, OK, or too thick is the not same for all machines
H
H
0 01
4 3 So 1 1 4 1 3 1 6
.
r , c , df r c
2 2 20 01 6 0 0 01 616 812 and we reject if . , calculated . ,. H
4. Using the observed and expected frequency in the contingency table, we calculate using the formula given:
2c
ijo row total column total
grand total
th th
ij
i je
2
ij ij
ij
o e
e
11 10o
12 102o
13 8o
21 34o
22 161o
23 5o
31 12o
11
120 6615 84
500e .
12
120 40296 8
500e .
13
120 327 68
500e .
21
200 6626 4
500e .
22
200 402160 8
500e .
23
200 3212 8
500e .
31
100 6613 2
500e .
210 15 84
2 153115 84
..
.
2102 96 48
0 315896 48
..
.
28 7 68
0 00147 68
..
.
234 26 40
2 187926 40
..
.
2161 160 80
0 0002160 80
..
.
25 12 80
4 753112 80
..
.
212 13 20
0 109113 20
..
.
32 79o
33 9o
41 10o
42 60o
43 10o
32
100 40280 40
500e .
33
100 326 40
500e .
41
80 6610 56
500e .
42
80 40264 32
500e .
43
80 325 12
500e .
2 15 5844c .
279 80 40
0 024480 40
..
.
29 6 40
1 05636 40
..
,
210 10 56
0 029710 56
..
.
260 64 32
0 290164 32
..
.
210 5 12
4 65135 12
..
.
20 05 2
0
5. Since the value of 15 5844 16 812 thus we fail to
reject and conclude that the proportion of pins that are too thin, OK,
or too thick is the same for all mchines
cc . ,. . ,
H
Exercise 5.3:200 female owners and 200 male owners of Proton
cars selected at random and the color of their cars are noted. The following data shows the results:
Use a 1% significance level to test whether the proportions of color preference are the same for female and male.
Car Colour
Black Dull Bright
Gender Male 40 110 50
Female 20 80 100
Chi-Square Test for Independence
This test is applied to a single population which has categorical variables
To determine whether there is a significant association between the two variables.
Eg : In an election survey, voter might be classified by gender (female and male) and voting preferences (democrate ,republican or independent) . This test is used to determine whether gender is related to voting preferences.
The test is appropriated if the following are met :1. The sampling method is simple random samplingii. Each population is at least 10 times as large as the sampleiii. The variable under study is categoricaliv. If sample data are displayed in contingency table (population x category levels), the expected value for each cell of the table is at least 5.
Note: The procedure for the Chi-square test for independence is the same as the Chi-square test for homogeneity.
The only different between these two test is at the determination of the null and alternative hypothesis. The rest of the procedure are the same for both tests.
This theorem is useful in testing the following hypothesis:
0
1
: ROW and COLUMN variable are INDEPENDENT
: ROW and COLUMN variable are NOT INDEPENDENT
H
H
Example 5.3:Insomnia is disease where a person finds it hard to
sleep at night. A study is conducted to determine whether the two attributes, smoking habit and insomnia disease are dependent. The following data set was obtained.
Use a 5% significance level to conduct the study.
Insomnia
Yes No
Habit Non-smokers
10 70
Ex-smokers 8 32
Smokers 22 38
Solution:
1.
2.
3 From table of chi-square:
Insomnia
Yes No Total
Habit Non-smokers 10 70 80
Ex-smokers 8 32 40
Smokers 22 38 60
Total 40 140 180
0
1
: Smoking habits and Insomnia are independent
: Smoking habits and Insomnia are not independent
H
H
0 05
3 2 So 1 1 3 1 2 1 2
.
r , c , df r c
2 2 20 05 2 0 0 01 65 991 and we reject if . , calculated . ,. H
4. Using the observed and expected frequency in the contingency table, we calculate using the formula given:
2c
ijo ije 2
ij ij
ij
o e
e
11 10o 11
80 4017 78
180e . 2
10 17 783 40
17 78
..
.
12 70o
21 8o
22 32o
31 22o
32 38o
12
80 14062 22
180e .
21
40 408 89
180e .
22
40 14031 11
180e .
31
60 4013 33
180e .
32
60 140146 67
180e .
270 62 22
0 9762 22
..
.
28 8 89
0 908 89
..
.
232 31 11
0 0331 11
..
.
222 13 33
5 6413 33
..
.
238 46 67
1 6146 67
..
.
2 12 55c .
5. Since
2 20 05 2 012 55 5 991 so we reject and conclude that the smoking habit and
insomnia disease is not independentc . ,. . , H
Exercise 5.4:A study is conducted to determine whether
student’s academic performance are independent of their active in co-curricular activities. The following data set was obtained:
Use a 5% significance level to conduct the study.
Academic Performance
Low Fair Good
Co-curricular Activities
Inactive 40 80 60
Active 30 90 60
Exercise 5.5:
A total of n = 309 furniture defects were recorded and the defects were classified into four types: A,B,C,D. At the same time, each piece of furniture was identified by the production shift in which it was manufactured. Test at 5% significance level types of defects and furniture are independence. These counts are presented in table below:
Type of Defects
1 2 3
A 15 26 33
B 21 31 17
C 45 34 49
D 13 5 20