inference for categorical data
DESCRIPTION
Inference for Categorical Data. William P. Wattles, Ph. D. Francis Marion University. Continuous vs. Categorical. Continuous (measurement) variables have many values Categorical variables have only certain values representing different categories - PowerPoint PPT PresentationTRANSCRIPT
1
Inference for Categorical Data
William P. Wattles, Ph. D.Francis Marion University
2
Continuous vs. Categorical• Continuous (measurement) variables have
many values• Categorical variables have only certain
values representing different categories• Ordinal-a type of categorical with a natural
order (e.g., year of college)• Nominal-a type of categorical with no order
(e.g., brand of cola)
3
Categorical Data• Tells which category an individual is in
rather than telling how much.• Sex, race, occupation naturally categorical• A quantitative variable can be grouped to
form a categorical variable. • Analyze with counts or percents.
4
Describing relationships in categorical data
• No single graph portrays the relationship
• Also no similar number summarizes the relationship
• Convert counts to proportions or percents
55
Prediction
66
Prediction
7
Moving from descriptive to Inferential
• Chi Square Inference involves a test of independence.
• If variable are independent, knowledge of one variable tells you nothing about the other.
8
Moving from descriptive to Inferential
• Inference involves expected counts. – Expected count=The count that would occur if
the variables are independent
9
Inference for two-way tables
• Chi Square test of independence.• For more than two groups• Cannot compare multiple groups one at a
time.
10
To Analyze Categorical Data
• First obtain counts• In Excel can do this with a pivot table• Put data in a Matrix or two-way table
11
Matrix or two-way table
Republican Democrat Independent
Male 18 43 14
Female 39 23 18
12
Inference for two-way tables
• Expected count• The count that would occur if the variables
are independent
13
Matrix or two-way table• Rows• Columns• Distribution: how often each outcome
occurred• Marginal distribution: Count for all entries
in a row or column
14
Row and column totals
RepublicanDemocrat IndependentMale 18 43 14 75Female 39 23 18 80
57 66 32 155
15
RepublicanDemocrat IndependentMale 75 48%Female 80 52%
57 66 32 15537% 43% 21%
16
Expected counts• 37% of all subjects are Republicans• If independent 37% of females should be
Republican (expected value)• 37% of 80= 29• 37% of 75 = 28
17
Expected counts rounded
Republican Democrat Independent totalMale 28 32 15 75Female 29 34 17 80total 57 66 32 155
18
Observed vs. ExpectedRepublicanDemocrat Independent
Male 18 43 14 75Female 39 23 18 80
57 66 32 155
Republican Democrat Independent totalMale 28 32 15 75Female 29 34 17 80total 57 66 32 155
19
Chi-Square• Chi-square A measure of how far the
observed counts are from the expected counts
20
Chi-square test of independence
e
eo
fffX
22 )(
21
Chi Square test of independence with SPSS
22
Chi Square test of independence with SPSS
23
Chi Square
24
Chi-square test of independence
• Degrees of Freedom• df=number of rows-1 times number of
columns -1• compare the observed and expected counts.• P-value comes from comparing the Chi-
square statistic with critical values for a chi-square distribution
25
Example• Have the percent of majors changed by
school?
26
Data collection
http://www.fmarion.edu/about/FactBook2004/2005 Fall 2004 Graduates by Major
27
28
29
Chi Square
30
Marital Status, page 543
job grade single married divorced widowed1 58 874 15 82 222 3927 70 203 50 2396 34 104 7 533 7 4
31
Marital Status, page 543
Test Statistics Value df p-valuePearson Chi-Square 67.491 9 0.0000
32
Olive Oil, page 578
low medium highColon cancer 398 397 430rectal 250 241 217controls 1368 1377 1409
Olive Oil
33
Olive Oil, page 578
Test Statistics Value df p-valuePearson Chi-Square 1.552 4 0.817Continuity Adjusted Chi-Square1.396 4 0.845Likelihood Ratio Chi-Square1.549 4 0.818
34
Business Majors, page 563
Female MaleAccounting 68 56Administration 91 40Economics 5 6Finance 61 59
35
Business Majors, page 563
Test Statistics Value df p-valuePearson Chi-Square 10.827 3 0.013
36
Exam Three• 37 multiple choice
questions, 4 short answer• T-tests and chi square on
Excel• General questions about
analyzing categorical data and t-tests
• Review from earlier this term
37
Inference as a decision• We must decide if the null hypothesis is
true.• We cannot know for sure.• We choose an arbitrary standard that is
conservative and set alpha at .05• Our decision will be either correct or
incorrect.
38
Type I and Type II errors
Ho is really True
Ho is really False
We reject Ho
Type I Error (false alarm)
Correct Decision
We accept Ho
Correct decision Type II Error (miss)
39
Type I error• If we reject Ho when in fact Ho is true, this
is a Type I error• Statistical procedures are designed to
minimize the probability of a Type I error, because they are more serious for science.
• With a Type I error we erroneously conclude that an independent variable works.
40
Type II error
• If we accept Ho when in fact Ho is false this is a Type II error.
• A type two error is serious to the researcher.• The Power of a test is the probability that
Ho will be rejected when it is, in fact, false.
41
Probability
Ho is really True
Ho is really False
We reject Ho
p= p=1-
We accept Ho
p=1- p=
42
Power• The goal of any scientific research is to
reject Ho when Ho is false.• To increase power:
– a. increase sample size– b. increase alpha– c. decrease sample variability– d. increase the difference between the means
43
Categorical data example• African-American students more likely to
register via the web.
44
Table
Variable White African-AmericanStudents University-Wide n Percent n PercentRegister on the Web 447 34% 284 44%Register with other method 876 66% 356 56%Total 1323 640
45
Web Registration by Race
34%
25%
44%
29%
0%
10%
20%
30%
40%
50%
60%
2000 2001Year
WhiteAfrican-American
46
Categorical Data Example• African-American students university-wide
(44%) were more likely that white students (34%) to use web registration, X2(1, N = 1963) = 20.7 , p < .001.
47
48
Smoking among French Men
• Do these data show a relationship between education and smoking in French men?
49
50
51
The EndThe End
52
Benford’s Law page 550• Faking data?
53
Problem 20.14Digit ratio Observed
1 0.301 62 0.176 43 0.125 64 0.097 75 0.079 36 0.067 57 0.058 68 0.051 49 0.046 4
54
Digit ratio Expected Observed1 0.301 13.545 62 0.176 7.92 43 0.125 5.625 64 0.097 4.365 75 0.079 3.555 36 0.067 3.015 57 0.058 2.61 68 0.051 2.295 49 0.046 2.07 4
55
Expected Observed13.545 6 4.20280731
7.92 4 1.940202025.625 6 0.0254.365 7 1.590658653.555 3 0.086645573.015 5 1.306873962.61 6 4.40310345
2.295 4 1.266677562.07 4 1.7994686
16.6214371
56
Significance test
chitest p = 0.03430
57
Example• Survey2 Berk & Carey
page 261