chi square

21
Two-Way Tables and the Chi- Square Test When analysis of categorical data is concerned with more than one variable, two-way tables (also known as contingency tables) are employed. These tables provide a foundation for statistical inference, where statistical tests question the relationship between the variables on the basis of the data observed. Example In the dataset "Popular Kids," students in grades 4-6 were asked whether good grades, athletic ability, or popularity was most important to them. A two-way table separating the students by grade and by choice of most important factor is shown below: Grade Goals | 4 5 6 Total --------------------------------- Grades | 49 50 69 168 Popular | 24 36 38 98 Sports | 19 22 28 69 --------------------------------- Total | 92 108 135 335 To investigate possible differences among the students' choices by grade, it is useful to compute the column percentages for each choice, as follows: Grade Goals | 4 5 6 --------------------------- Grades | 53 46 51 Popular | 26 33 28 Sports | 21 20 21 --------------------------- Total | 100 100 100 There is error in the second column (the percentages sum to 99, not 100) due to rounding. From the appearance of

Upload: yam-salem

Post on 22-Oct-2014

207 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Chi Square

Two-Way Tables and the Chi-Square TestWhen analysis of categorical data is concerned with more than one variable, two-way tables (also known as contingency tables) are employed. These tables provide a foundation for statistical inference, where statistical tests question the relationship between the variables on the basis of the data observed.

Example

In the dataset "Popular Kids," students in grades 4-6 were asked whether good grades, athletic ability, or popularity was most important to them. A two-way table separating the students by grade and by choice of most important factor is shown below:

GradeGoals | 4 5 6 Total ---------------------------------Grades | 49 50 69 168 Popular | 24 36 38 98Sports | 19 22 28 69---------------------------------Total | 92 108 135 335

To investigate possible differences among the students' choices by grade, it is useful to compute the column percentages for each choice, as follows:

GradeGoals | 4 5 6 ---------------------------Grades | 53 46 51 Popular | 26 33 28 Sports | 21 20 21 ---------------------------Total | 100 100 100

There is error in the second column (the percentages sum to 99, not 100) due to rounding. From the appearance of the column percentages, it does not appear that there is much of a variation in preference across the three grades.

Data source: Chase, M.A and Dummer, G.M. (1992), "The Role of Sports as a Social Determinant for Children," Research Quarterly for Exercise and Sport, 63, 418-424. Dataset available through theStatlib Data and Story Library (DASL).

The chi-square test provides a method for testing the association between the row and column variables in a two-way table. The null hypothesis H0 assumes that there is no association between the variables (in other words, one variable does not vary according to the other variable), while the alternative hypothesis Ha claims that some association does exist. The alternative hypothesis does not specify the type of

Page 2: Chi Square

association, so close attention to the data is required to interpret the information provided by the test.

The chi-square test is based on a test statistic that measures the divergence of the observed data from the values that would be expected under the null hypothesis of no association. This requires calculation of the expected values based on the data. The expected value for each cell in a two-way table is equal to (row total*column total)/n, where n is the total number of observations included in the table.

Example

Continuing from the above example with the two-way table for students choice of grades, athletic ability, or popularity by grade, the expected values are calculated as shown below:

Original Table Expected Values Grade Grade

Goals | 4 5 6 Total Goals | 4 5 6 --------------------------------- ---------------------------Grades | 49 50 69 168 Grades | 46.1 54.2 67.7 Popular | 24 36 38 98 Popular | 26.9 31.6 39.5 Sports | 19 22 28 69 Sports | 18.9 22.2 27.8 ---------------------------------Total | 92 108 135 335

The first cell in the expected values table, Grade 4 with "grades" chosen to be most important, is calculated to be 168*92/335 = 46.1, for example.

Once the expected values have been computed (done automatically in most software packages), the chi-square test statistic is computed as

where the square of the differences between the observed and expected values in each cell, divided by the expected value, are added across all of the cells in the table.The distribution of the statistic X2 is chi-square with (r-1)(c-1) degrees of freedom, where r represents the number of rows in the two-way table and c represents the

number of columns. The distribution is denoted  (df), where df is the number of degrees of freedom.

Page 3: Chi Square

The chi-square distribution is defined for all positive values. The P-value for the chi-

square test is P(  >X²), the probability of observing a value at least as extreme as the test statistic for a chi-square distribution with (r-1)(c-1) degrees of freedom.

Example

The chi-square statistic for the above example is computed as follows: X² = (49 - 46.1)²/46.1 + (50 - 54.2)²/54.2 + (69 - 67.7)²/67.7 + .... + (28 - 27.8)²/27.8 = 0.18 + 0.33 + 0.03 + .... + 0.01 = 1.51 The degrees of freedom are equal to (3-1)(3-1) = 2*2 = 4, so we are interested in the

probability P(  > 1.51) = 0.8244 on 4 degrees of freedom. This indicates that there is no association between the choice of most important factor and the grade of the student -- the difference between observed and expected values under the null hypothesis is negligible.

Example

The "Popular Kids" dataset also divided the students' responses into "Urban," "Suburban," and "Rural" school areas. Is there an association between the type of school area and the students' choice of good grades, athletic ability, or popularity as most important?

A two-way table for student goals and school area appears as follows:

School AreaGoals | Rural Suburban Urban Total-------------------------------------------- Grades | 57 87 24 168Popular | 50 42 6 98 Sports | 42 22 5 69--------------------------------------------Total | 149 151 35 335

The corresponding column percentages are the following: School Area

Goals | Rural Suburban Urban ----------------------------------- Grades | 38 58 69 Popular | 34 28 17 Sports | 28 14 14 -----------------------------------Total | 100 100 100

Page 4: Chi Square

Barplots comparing the percentages of students' choices by school area appear below:

From the table and corresponding graphs, it appears that the emphasis on grades increases as the school areas become more urban, while the emphasis on popularity decreases. Is this association significant?

Using the MINITAB "CHIS" command to perform a chi-square test on the tabular data gives the following results:

Chi-Square Test

Expected counts are printed below observed counts

Rural Suburban Urban Total 1 57 87 24 168 74.72 75.73 17.55

2 50 42 6 98 43.59 44.17 10.24

3 42 22 5 69 30.69 31.10 7.21

Total 149 151 35 335

Chi-Sq = 4.203 + 1.679 + 2.369 + 0.943 + 0.107 + 1.755 + 4.168 + 2.663 + 0.677 = 18.564DF = 4, P-Value = 0.001

The P-value is highly significant, indicating that some association between the variables is present. We can conclude that the urban students' increased emphasis on grades is not due to random variation.

Data source: Chase, M.A and Dummer, G.M. (1992), "The Role of Sports as a Social Determinant for Children," Research Quarterly for Exercise and Sport, 63, 418-424. Dataset available through theStatlib Data and Story Library (DASL).

The chi-square index in the Statlib Data and Story Library (DASL) provides several other examples of the use of the chi-square test in categorical data analysis.

Page 5: Chi Square

 The Chi Square StatisticTypes of Data:

There are basically two types of random variables and they yield two types of data: numerical and categorical. A chi square (X2) statistic is used to investigate whether distributions of categorical variables differ from one another. Basically categorical variable yield data in the categories and numerical variables yield data in numerical form. Responses to such questions as "What is your major?" or Do you own a car?" are categorical because they yield data such as "biology" or "no." In contrast, responses to such questions as "How tall are you?" or "What is your G.P.A.?" are numerical. Numerical data can be either discrete or continuous. The table below may help you see the differences between these two variables.

 Data Type  Question TypePossible Responses

 Categorical  What is your sex? male or female

 NumericalDisrete- How many cars do you own?

two or three

 NumericalContinuous - How tall are you?

 72 inches

Notice that discrete data arise fom a counting process, while continuous data arise from a measuring process.

The Chi Square statistic compares the tallies or counts of categorical responses between two (or more) independent groups. (note: Chi square tests can only be used on actual numbers and not on percentages, proportions, means, etc.)

2 x 2 Contingency Table

There are several types of chi square tests depending on the way the data was collected and the hypothesis being tested. We'll begin with the simplest case: a 2 x 2 contingency table. If we set the 2 x 2 table to the general notation shown below in Table 1, using the letters a, b, c, and d to denote the contents of the cells, then we would have the following table:

Table 1. General notation for a 2 x 2 contingency table.

Page 6: Chi Square

Variable 1

 Variable 2  Data type 1  Data type 2  Totals

 Category 1  a b a + b

 Category 2  c d c + d

 Total a + c b + da + b + c + d = 

N

For a 2 x 2 contingency table the Chi Square statistic is calculated by the formula:

Note: notice that the four components of the denominator are the four totals from the table columns and rows.

Suppose you conducted a drug trial on a group of animals and you hypothesized that the animals receiving the drug would survive better than those that did not receive the drug. You conduct the study and collect the following data:

Ho: The survival of the animals is independent of drug treatment.

Ha: The survival of the animals is associated with drug treatment.

 

Table 2. Number of animals that survived a treatment.

   Dead  Alive Total

 Treated  36  14  50

 Not treated

 30  25  55

 Total  66  39  105

Applying the formula above we get:

Chi square = 105[(36)(25) - (14)(30)]2 / (50)(55)(39)(66) = 3.418

Page 7: Chi Square

Before we can proceed we eed to know how many degrees of freedom we have. When a comparison is made between one sample and another, a simple rule is that the degrees of freedom equal (number of columns minus one) x (number of rows minus one) not counting the totals for rows or columns. For our data this gives (2-1) x (2-1) = 1.

We now have our chi square statistic (x2 = 3.418), our predetermined alpha level of significalnce (0.05), and our degrees of freedom (df =1). Entering the Chi square distribution table with 1 degree of freedom and reading along the row we find our value of x2 (3.418) lies between 2.706 and 3.841. The corresponding probability is 0.10<P<0.05. This is below the conventionally accepted significance level of 0.05 or 5%, so the null hypothesis that the two distributions are the same is verified. In other words, when the computed x2 statistic exceeds the critical value in the table for a 0.05 probability level, then we can reject the null hypothesis of equal distributions. Since our x2 statistic (3.418) did not exceed the critical value for 0.05 probability level (3.841) we can accept the null hypothesis that the survival of the animals is independent of drug treatment (i.e. the drug had no effect on survival).

Table 3. Chi Square distribution table.

probability level (alpha)

Df

0.5 0.10

0.05 0.02 0.01 0.001

1 0.455

2.706

3.841

5.412

6.635

10.827

2 1.386

4.605

5.991

7.824

9.210

13.815

3 2.366

6.251

7.815

9.837

11.345

16.268

4 3.357

7.779

9.488

11.668

13.277

18.465

5 4.351

9.236

11.070

13.388

15.086

20.517

To make the chi square calculations a bit easier, plug your observed and expected values into the following applet. Click on the cell and then enter the value. Click the

Page 8: Chi Square

compute button on the lower right corner to see the chi square value printed in the lower left hand coner.

-->

Note: Some earlier versions of Netscape for the Macintosh do not support java 1.1 and if you are using one of these browsers you will not see the applet.

 

Chi Square Goodness of Fit (One Sample Test)

This test allows us to compae a collection of categorical data with some theoretical expected distribution. This test is often used in genetics to compare the results of a cross with the theoretical distribution based on genetic theory. Suppose you preformed a simpe monohybrid cross between two individuals that were heterozygous for the trait of interest.

Aa x Aa

The results of your cross are shown in Table 4.

 

Table 4. Results of a monohybrid coss between two heterozygotes for the 'a' gene.

   A  a  Totals

 A  10  42  52

 a  33  15  48

 Totals  43  57  100

The penotypic ratio 85 of the A type and 15 of the a-type (homozygous recessive). In a monohybrid cross between two heterozygotes, however, we would have predicted a 3:1 ratio of phenotypes. In other words, we would have expected to get 75 A-type and 25 a-type. Are or resuls different?

Page 9: Chi Square

Calculate the chi square statistic x2 by completing the following steps:

1. For each observed number in the table subtract the corresponding expected number (O — E).

2. Square the difference [ (O —E)2 ].3. Divide the squares obtained for each cell in the table by the expected number 

for that cell [ (O - E)2 / E ].4. Sum all the values for (O - E)2 / E. This is the chi square statistic.

For our example, the calculation would be:

  Observed Expected (O — E)

(O — E)2 (O — E)2/ E

A-type

85 75 10 100 1.33

a-type

15 25 10 100 4.0

Total 100 100      5.33

x2 = 5.33

We now have our chi square statistic (x2 = 5.33), our predetermined alpha level of significalnce (0.05), and our degrees of freedom (df =1). Entering the Chi square distribution table with 1 degree of freedom and reading along the row we find our value of x2 5.33) lies between 3.841 and 5.412. The corresponding probability is 0.05<P<0.02. This is smaller than the conventionally accepted significance level of 0.05 or 5%, so the null hypothesis that the two distributions are the same is rejected. In other words, when the computed x2 statistic exceeds the critical value in the table for a 0.05 probability level, then we can reject the null hypothesis of equal distributions. Since our x2 statistic (5.33) exceeded the critical value for 0.05 probability level (3.841) we can reject the null hypothesis that the observed values of our cross are the same as the theoretical distribution of a 3:1 ratio.

Table 3. Chi Square distribution table.

Page 10: Chi Square

probability level (alpha)

Df

0.5 0.10

0.05 0.02 0.01 0.001

1 0.455

2.706

3.841

5.412

6.635

10.827

2 1.386

4.605

5.991

7.824

9.210

13.815

3 2.366

6.251

7.815

9.837

11.345

16.268

4 3.357

7.779

9.488

11.668

13.277

18.465

5 4.351

9.236

11.070

13.388

15.086

20.517

To put this into context, it means that we do not have a 3:1 ratio of A_ to aa offspring.

To make the chi square calculations a bit easier, plug your observed and expected values into the following java applet.

Click on the cell and then enter the value. Click the compute button on the lower right corner to see the chi square value printed in the lower left hand coner.

-->

Note: Some versions of Netscape for the Macintosh do not support java 1.1 and if you are using one of these browsers you will not see the applet.

 

Chi Square Test of Independence

Page 11: Chi Square

For a contingency table that has r rows and c columns, the chi square test can be thought of as a test of independence. In a test ofindependence the null and alternative hypotheses are:

Ho: The two categorical variables are independent.

Ha: The two categorical variables are related.

We can use the equation Chi Square = the sum of all the(fo - fe)2 / fe

Here fo denotes the frequency of the observed data and fe is the frequency of the expected values. The general table would look something like the one below:

  Category I

Category II

Category III

Row Totals

 Sample A

 a b c a+b+c

 Sample B

 d e f d+e+f

 Sample C

 g h i g+h+i

 Column Totals

 a+d+g b+e+h c+f+i a+b+c+d+e+f+g+h+i=

N

Now we need to calculate the expected values for each cell in the table and we can do that using the the row total times the column total divided by the grand total (N). For example, for cell a the expected value would be (a+b+c)(a+d+g)/N.

Once the expected values have been calculated for each cell, we can use the same procedure are before for a simple 2 x 2 table.

 Observed Expected|O - E|

(O — E)2  (O — E)2/ E

         

Suppose you have the following categorical data set.

Page 12: Chi Square

Table . Incidence of three types of malaria in three tropical regions.

   Asia AfricaSouth America

Totals

 Malaria A

31 14 45 90

 Malaria B

2 5 53 60

 Malaria C

53 45 2 100

 Totals  86 64 100 250

 

We could now set up the following table:

 Observed Expected|O -E|

 (O — E)2  (O — E)2/ E

 31  30.96  0.04  0.0016  0.0000516

 14  23.04  9.04 81.72 3.546

 45  36.00  9.00 81.00 2.25

 2  20.64  18.64 347.45 16.83

 5  15.36  10.36 107.33 6.99

 53  24.00  29.00 841.00 35.04

 53  34.40  18.60 345.96 10.06

 45  25.60  19.40 376.36 14.70

 2  40.00  38.00  1444.00 36.10

Chi Square = 125.516

Degrees of Freedom = (c - 1)(r - 1) = 2(2) = 4

Table 3. Chi Square distribution table.

Page 13: Chi Square

probability level (alpha)

Df

0.5 0.10

0.05 0.02 0.01 0.001

1 0.455

2.706

3.841

5.412

6.635

10.827

2 1.386

4.605

5.991

7.824

9.210

13.815

3 2.366

6.251

7.815

9.837

11.345

16.268

4 3.357

7.779

9.488

11.668

13.277

18.465

5 4.351

9.236

11.070

13.388

15.086

20.517

Reject Ho because 125.516 is greater than 9.488 (for alpha 

Thus, we would reject the null hypothesis that there is no relationship between location and type of malaria. Our data tell us there is a relationship between type of malaria and location, but that's all it says.

Follow the link below to access a java-based program for calculating Chi Square statistics for contingency tables of up to 9 rows by 9 columns. Enter the number of row and colums in the spaces provided on the page and click the submit button. A new form will appear asking you to enter your actual data into the cells of the contingency table. When finished entering your data, click the "calculate now" button to see the results of your Chi Square analysis. You may wish to print this last page to keep as a record.

Chi Square,

This page was created as part of the Mathbeans Project. The java applets were created by David Eck and modified by Jim Ryan. The Mathbeans Project is funded by a grant from the National Science Foundation DUE-9950473.

Frequency Distributions

Page 14: Chi Square

One important set of statistical tests allows us to test for deviations of observed frequencies from expected frequencies. To introduce these tests, we will start with a simple, non-biological example. We want to determine if a coin is fair. In other words, are the odds of flipping the coin heads-up the same as tails-up. We collect data by flipping the coin 200 times. The coin landed heads-up 108 times and tails-up 92 times. At first glance, we might suspect that the coin is biased because heads resulted more often than than tails. However, we have a more quantitative way to analyze our results, a chi-squared test.

To perform a chi-square test (or any other statistical test), we first must establish our null hypothesis. In this example, our null hypothesis is that the coin should be equally likely to land head-up or tails-up every time. The null hypothesis allows us to state expected frequencies. For 200 tosses, we would expect 100 heads and 100 tails.

The next step is to prepare a table as follows.

  Heads Tails Total

Observed 108 92 200

Expected 100 100 200

Total 208 192 400

The Observed values are those we gather ourselves. The expected values are the frequencies expected, based on our null hypothesis. We total the rows and columns as indicated. It's a good idea to make sure that the row totals equal the column totals (both total to 400 in this example).

Using probability theory, statisticians have devised a way to determine if a frequency distribution differs from the expected distribution. To use this chi-square test, we first have to calculate chi-squared.

Chi-squared = � (observed-expected)2/(expected)

We have two classes to consider in this example, heads and tails.

Page 15: Chi Square

Chi-squared = (100-108)2/100 + (100-92)2/100 = (-8)2/100 + (8)2/100 = 0.64 + 0.64 = 1.28

Now we have to consult a table of critical values of the chi-squared distribution. Here is a portion of such a table.

df/prob. 0.99 0.95 0.90 0.80 0.70 0.50 0.30 0.20 0.10 0.05

1 0.00013 0.0039 0.016 0.64 0.15 0.46 1.07 1.64 2.71 3.84

2 0.02 0.10 0.21 0.45 0.71 1.39 2.41 3.22 4.60 5.99

3 0.12 0.35 0.58 1.00 1.42 2.37 3.66 4.64 6.25 7.82

4 0.3 0.71 1.06 1.65 2.20 3.36 4.88 5.99 7.78 9.49

5 0.55 1.14 1.61 2.34 3.00 4.35 6.06 7.29 9.24 11.07

The left-most column list the degrees of freedom (df). We determine the degrees of freedom by subtracting one from the number of classes. In this example, we have two classes (heads and tails), so our degrees of freedom is 1. Our chi-squared value is 1.28. Move across the row for 1 df until we find critical numbers that bound our value. In this case, 1.07 (corresponding to a probability of 0.30) and 1.64 (corresponding to a probability of 0.20). We can interpolate our value of 1.24 to estimate a probability of 0.27. This value means that there is a 73% chance that our coin is biased. In other words, the probability of getting 108 heads out of 200 coin tosses with a fair coin is 27%. In biological applications, a probability � 5% is usually adopted as the standard. This value means that the chances of an observed value arising by chance is only 1 in 20. Because the chi-squared value we obtained in the coin example is greater than 0.05 (0.27 to be precise), we accept the null hypothesis as true and conclude that our coin is fair.

We have gathered data to see if the percentage of elevated cholesterol (> 220 ppm) is the same in girls and boys. Our sample is all the sixth-graders in the state of Maine.

The resulting data are as follows: of 7,532 boys, 397 had elevated cholesterol; of 7,955 girls, 242 had elevated cholesterol.