chapter 5 introductory chi-square test this chapter introduces a new probability distribution called...
Post on 02-Jan-2016
227 Views
Preview:
TRANSCRIPT
CHAPTER 5CHAPTER 5INTRODUCTORY CHI-SQUARE TESTINTRODUCTORY CHI-SQUARE TEST
• This chapter introduces a new probability distribution called the chi-square distribution.
• This chi-square distribution will be used in carrying out hypothesis to analyze whether:i. A sample could have come from a given type of POPULATION DISTRIBUTION.ii. Two nominal variable/categorical variable could be INDEPENDENT and HOMOGENEOUS of each other.
• The chi-square test that will be discussed are:
i. Goodness-of-fit Test
ii. The Chi-square Test For Homogeneity
iii. The Chi-square Test For Independence
1) Goodness-of-fit Test
• In Goodness-of-fit test, chi-square analysis is applied for the purpose of examine whether sample data could have been drawn from a population having a specific probability distribution.
• In Goodness-of-fit test, the test procedures are appropriate when the following conditions are met :i. The sampling method is simple random sampling.ii. The population is at least 10 times as large as the sample.iii. The variable under study is categorical (qualitative variable).iv. The expected value (ei) for each level of the variable is at least 5.
• The table frequency distribution layout:
where
• Test procedure to run the Goodness-of-fit test:1. State the null hypothesis and alternative hypothesis2. Determine:
i. The level of significance, ii. The degree of freedom,
Find the value of from the table of chi-square distribution
Category
1 2 … k
Frequency
…
0H 1H
1
where number of levels of the
categorical variable
df k
k
1o ioko
number of levels of the categorical variable
observed frequency, for 1 2 thi
k
o i i , ,...,k
2,df
3. Calculate the value of
where the
4. Determine the rejection region to reject : i. or
ii. .
5. Make decision/conclusion.
2 2If calculated ,df
If value p
2 using the formula below:calculated
2
2
1
observed frequency
expected frequency
ki i
calculatedi i
thi
thi
o e
e
o i
e i
1 2 for 1 2 and
total observationi i ke nP X i , ,...,k n o o ... o
n
0H
Example:The authority claims that the proportions of road accidents
occurring in this country according to the categories User attitude (A),
Mechanical Fault (M), Insufficient Sign Board (I) and Fate (F) are 60%, 20%, 15% and
5% respectively. A study by an independent body shows the following data.
Can we accept the claim at significance level ?Solution:
1.
Category A M I F Total
Frequency 130 35 30 5 200
0 05.
0
1
200 4
: 0 6 0 2 0 15 0 05 (claim)
: At least one differs for and
n , k
H P A . ,P M . ,P I . ,P F .
H P i i A,M ,I F
2.
3.
4. Since . Thus we accept 5. We conclude that we have no evidence to reject the claim.
0 05 1 4 1 3. , df k 20 05 3 7 815. , .
io
130Ao
35Mo
30Io
5Fo
i ie nP X 2
i i
i
o e
e
200 0 6 120Ae .
200 0 2 40Me .
200 0 15 30Ie .
200 0 05 10Fe .
2 3 958c .
2 20 05 33 958 7 815c . ,. . 0H
2130 120
0 833120
.
235 40
0 62540
.
230 30
0 00030
.
25 10
2 510
.
Example:The number of students playing truancy in a school over 200
school days is showing below.
If X is a random variable representing the number of students playing truancy
per day, test the hypothesis that X follows the Poisson distribution with mean
3 per day at
No. of truancy 0 1 2 3 4
No of days 12 32 45 50 35 26
5
0 01.
Solution:
0
1
200 6 3
1) : follows the Poisson distribution with mean 3 per day (claim)
: does not follows the Poisson distribution with mean 3 per day
n , k ,
H X
H X
20 01 52) 15 086. , .
3)
4) Since 4.472<15.086, so we accept Ho.5) We conclude that there is not enough evidence to reject the claim.
# of truancy
# of days,
0 12 0.0498 200(0.0498)=9.96
0.4178
1 32 0.1493 200(0.1493) =29.86
0.1534
2 45 0.2241 200(0.2241) =44.82
0.0007
3 50 0.2240 200(0.2240) =44.80
0.6036
4 35 0.1681 200(0.1681) =33.62
0.0566
26 0.1847 200(0.1847)=36.94
3.2399
iO ( )iP X
5
( )i ie nP X
200
2
2 i i
i
O e
e
2 4.472
2) The Chi-Square Test for Homogeneity
• The homogeneity test is used to determine whether several populations are similar or equal or homogeneous in some characteristics.
• This test is applied to a single categorical variable from two different population
• The test procedure is appropriate when satisfy the below conditions : i. For each population, the sampling method is simple random samplingii. Each population is at least 10 times as large as the sampleiii. The variable under study is categoricaliv. If sample data are displayed in contingency table (population x category levels), the expected value (ei) for each cell of the table is at least 5.
Two dimensional Contingency Table layout:
• The above is contingency table (r x c) where r denotes as the number of categories of the row variable, c denotes as the number of categories of the column variable
• is the observed frequency in cell i, j• be the total frequency for row category i• be the total frequency for column category j• be the grand total frequency for all cell (i, j)
Column Variable
Category B1
Category B2
… Category Bc
Total
Row Variable
Category A1
…
Category A2
…
Category …
… … … … …
Category Ar
…
Total …
11o
21o
1ro
1n
12o
22o
2ro
1co
2co
rco
2n cn
1n
2n
rn
n
ijo
in
jn
n
Test procedure to run Chi-square test for homogeneity:1. State the null hypothesis and alternative hypothesis
Eg:
2. Determine:i. The level of significance, ii. The degree of freedom, where
Find the value of from the table of chi-square distribution3. Calculate the value of using the formula below:
0H 1H
1 1df r c
number of rows
number of column
r
c
2,df
0
1
: The proportion of ROW variables are SAME with COLUMN variable
: The proportion of ROW variables are NOT SAME with COLUMN variable
H
H
2
2
1 1
observed frequency of and column
expected frequency of and column
r cij ij
ci j ij
th thij
th thij
o e
e
o i j
e i j
2calculated
row total column total
grand total
th thi j
ij
i jn ne
n
4. Determine the rejection region TO REJECT Ho:i. If ii. If p – value approach;
5. Make decision
2 20 if
calculated ,dfH valuep
Example:Four machines manufacture cylindrical steel pins. The pins are
subjected to a diameter specification. A pin may meet the specification or it
may be too thin or too thick. Pins are sampled from each machine and the
number of pins in each category is counted. Table below presents the results.
Test
0 01.
Too thin OK Too Thick
Machine 1 10 102 8
Machine 2 34 161 5
Machine 3 12 79 9
Machine 4 10 60 10
Solution:1.
2.
From table of chi-square: 3. Construct a contingency table:
Calculation of the expected frequency:
0
1
: The proportion of pins that are too thin, OK, or too thick is the same for all machines
: The proportion of pins that are too thin, OK, or too thick is not same for all machines
H
H
0 01
4 3 So 1 1 4 1 3 1 6
.
r , c , df r c
20 01 6 16 812 . , .
Too thin OK Too Thick Total
Machine 1 10 102 8 120
Machine 2 34 161 5 200
Machine 3 12 79 9 100
Machine 4 10 60 10 80
Total 66 402 32 500
row total column total
grand total
th thi j
ij
i jn ne
n
Using the observed and expected frequency in the contingency table, we calculate using the formula given:
2c
ijo row total column total
grand total
th th
ij
i je
2
ij ij
ij
o e
e
11 10o
12 102o
13 8o
21 34o
22 161o
23 5o
31 12o
11
120 6615 84
500e .
12
120 40296 48
500e .
13
120 327 68
500e .
21
200 6626 4
500e .
22
200 402160 8
500e .
23
200 3212 8
500e .
31
100 6613 2
500e .
210 15 84
2 153115 84
..
.
2102 96 48
0 315896 48
..
.
28 7 68
0 00147 68
..
.
234 26 40
2 187926 40
..
.
2161 160 80
0 0002160 80
..
.
25 12 80
4 753112 80
..
.
212 13 20
0 109113 20
..
.
32 79o
33 9o
41 10o
42 60o
43 10o
32
100 40280 40
500e .
33
100 326 40
500e .
41
80 6610 56
500e .
42
80 40264 32
500e .
43
80 325 12
500e .
2 15 5844c .
279 80 40
0 024480 40
..
.
29 6 40
1 05636 40
..
,
210 10 56
0 029710 56
..
.
260 64 32
0 290164 32
..
.
210 5 12
4 65135 12
..
.
4.
5.
2 20 01 6
0
Since the value of 15 5844 16 812 thus we fail to
reject c . ,. . ,
H
We conclude that the proportion of pins that are too thin, OK,
or too thick is the same for all machines.
Exercise:200 female owners and 200 male owners of Proton cars
selected at random and the colour of their cars are noted. The following data
shows the results:
Use a 1% significance level to test whether the proportions of colour
preference are the same for female and male.
Car Colour
Black Dull Bright
Gender Male 40 110 50
Female 20 80 100
3) Chi-Square Test for Independence• This test is applied to a single population which has 2
categorical variables.• To determine whether there is a significant association
between the 2 categorical variables.• Eg : In an election survey, voter might be classified by
gender (female and male) and voting preferences (democrate ,republican or independent) . This test is used to determine whether gender is related to voting preferences.
• The test is appropriated if the following are met :1. The sampling method is simple random samplingii. Each population is at least 10 times as large as the sampleiii. The variable under study is categoricaliv. If sample data are displayed in contingency table (population x category levels), the expected value for each cell of the table is at least 5.
• Note: The procedure for the Chi-square test for independence is the same as the Chi-square test for homogeneity.
The only different between these two test is at the determination of the null and alternative hypothesis. The rest of the procedure are the same for both tests.
This theorem is useful in testing the following hypothesis:0
1
: ROW and COLUMN variable are INDEPENDENT
: ROW and COLUMN variable are NOT INDEPENDENT
H
H
Example:Insomnia is disease where a person finds it hard to sleep at
night. A study is conducted to determine whether the two attributes, smoking
habit and insomnia disease are dependent. The following data set was
obtained.
Use a 5% significance level to conduct the study.
Insomnia
Yes No
Habit Non-smokers
10 70
Ex-smokers 8 32
Smokers 22 38
Solution: The contingency table
1.
2.
Insomnia
Yes No Total
Habit Non-smokers 10 70 80
Ex-smokers 8 32 40
Smokers 22 38 60
Total 40 140 180
0
1
: Smoking habits and Insomnia are independent
: Smoking habits and Insomnia are not independent
H
H
0 05
3 2 So 1 1 3 1 2 1 2
.
r , c , df r c
20 05 2 5 991. , .
3. Using the observed and expected frequency in the contingency table, we calculate using the formula given:
2c
ijo ije 2
ij ij
ij
o e
e
11 10o 11
80 4017 78
180e . 2
10 17 783 40
17 78
..
.
12 70o
21 8o
22 32o
31 22o
32 38o
12
80 14062 22
180e .
21
40 408 89
180e .
22
40 14031 11
180e .
31
60 4013 33
180e .
32
60 14046 67
180e .
270 62 22
0 9762 22
..
.
28 8 89
0 098 89
..
.
232 31 11
0 0331 11
..
.
222 13 33
5 6413 33
..
.
238 46 67
1 6146 67
..
.
2 11 74c .
4. Since 5. We conclude that the smoking habit and insomnia disease
are not independent.
2 20 05 2 012 55 5 991 so we reject c . ,. . , H
Exercise:A study is conducted to determine whether student’s
academic performance are independent of their active in co-curricular activities. The
following data set was obtained:
Use a 5% significance level to conduct the study.
Academic Performance
Low Fair Good
Co-curricular Activities
Inactive 40 80 60
Active 30 90 60
top related