intermediate r - analysis of categorical data
TRANSCRIPT
Presentation Title Goes Here…presentation subtitle.Analysis of Categorical Data
Violeta I. BartolomeSenior Associate Scientist-Biometrics
PBGB-CRIL
Types of data
� Categorical data (classification data)
o Nominal
o Ordinal
� Quantitative data (measurement or scale data)
o Interval
o Ratio
Categorical Data
� The objects being studied are grouped into categories based on
some qualitative trait.
� They are often recorded as counts of objects in each category.
� Examples:
o Hair color
• blonde, brown, red, black
o Growth duration
• early, medium long
o Smoking status
• smoker, non-smoker
Nominal Data
� A type of categorical data in which objects fall into
unordered categories.
� Examples:
o Rice variety group
• indica, japonica, javanica
o Gender
• Male, Female
o Smoking status
• smoker, non-smoker
Ordinal Data
� A type of categorical data in which order is important.
� Examples:
o Growth duration
• early, medium, long
o Degree of resistance
• resistant, moderately resistant, susceptible
o Nitrogen Rate
• none, low, high
Binary Data
� A type of categorical data in which there are only two categories.
� Binary data can either be nominal or ordinal.
� Examples:
o Gender
• male, female
o Attendance
• present, absent
o Insect state after treatment application
• dead, alive
Quantitative Data
� The objects being studied are “measured” based on
some quantitative trait.
� The resulting data are set of numbers.
� Types of quantitative data
o Interval data – ordinal and distances between
values are comparable (e.g. temperature, IQ)
o Ratio data – interval data and have true zero point
as its origin (e.g. grain yield, age, number of trees
in a forest)
Example: Adoption of Nitrogen Fertilizer
YesNo
1004357Total
27216High
732251Low
TotalAdoptionLevel of
Education
Is there an association between level of
education and adoption of nitrogen fertilizer?
If no association, the observed frequencies should be
the same as the expected frequencies.
Expected frequencies:
YesNo
1004357Total
27High
73Low
TotalAdoptionLevel of
education
6141100
7357.
*= 3931
100
7343.
*=
3915100
2757.
*= 6111
100
2743.
*=
Chi-square test compares the observed and expected frequencies.
(((( ))))(((( ))))0001.P,3596.16
E
5.EO2
2 <<<<====−−−−−−−−
====χχχχ ∑∑∑∑
chisq.test()
chisq.test(x, # a vector or matrixy = NULL, # a vector, ignored if x is a matrix
correct = TRUE, # a logical indicating whether to
# apply continuity correction when
# computing the test statistic for
# 2x2 tables: one half is subtracted
# from all |O-E| differences. No
# correction is done if
# simulate.p.value = TRUE
simulate.p.value = FALSE, # a logical indicating whether to
# compute p-values by Monte Carlo
# simulation
B = 2000) # an integer specifying the number
# of replicates used in the Monte
# Carlo test
fisher.test()
fisher.test(x, # a vector or matrix
y = NULL, # a vector, ignored if x is a matrix
or = 1, # the hypothesized odds ratio.
# Only used in the 2 by 2 case
conf.int = TRUE, # logical indicating if a confidence
# interval should be computed
conf.level = 0.95, # confidence level for the returned
# confidence interval. Only used in
# the 2 by 2 case if conf.int = TRUE.
simulate.p.value = FALSE,
B = 2000)
Sample data set
Read data and tabulate Frequencies to Percentage
The second argument in the prop.table function
is marginal index, 1 for rows and 2 for columns.
Test is significant indicates that there is no independence between
adoption and level of education.
Chi-square test Logistic Regression
� Used to predict a two-category outcome from a set of
independent variables
� Response variable –binary: 0, 1
� Can handle more that 1 independent variable which can
be
o Categorical
o Quantitative
o Mixture of both
Logit Model
x)p
pln( β+α=
−1
)xexp(
)xexp(p
β+α+
β+α=1
)xexp(por
β−α−+=1
1
p=probability of being an adopter
x=1 if high level of education
x=0 if low level of education
Logistic model
Odds – ratio of success to failure
Logistic Regression in R
using 2 x 2 data
Test significance of the model
Test is significant indicating non-independence between
adoption and level of education
R Output
x0935.28408.0)oddsln( ++++−−−−====
x
x
)113.8)(43136.0(1
)113.8)(43136.0(p
++++====
7778.0p;1x:high
3014.0p;0x:low
========
========
x0935.28408.0 eep1
p −−−−====−−−−
Another Example
5488
5431
TotalYesNo
965392High
555376Low
Pulmonary AilmentInsecticide
Rate
00186102 .P. <=χ
Shows strong evidence that rate of insecticide has
an effect on pulmonary ailment.
Never Smoked:
Effect of insecticide usage (low, high) and smoking history on
the incidence of pulmonary ailments in farm workers.
Past Smokers:
Current Smokers:
4381
4373
TotalYesNo
1054276High
634310Low
Pulmonary AilmentInsecticide
Rate
1225
1213
TotalYesNo
371188High
211192Low
Pulmonary AilmentInsecticide
Rate
R=1 if using high rate of insecticide, R=0 if using low rate of insecticide
P=1 if past smoker, P=0 if not a past smoker
C=1 if current smoker, C=0 if not a current smoker
Logit Model with more than 1
Independent Variable
CbPbRbb)oddsln(3210
+++=
C.P.R..)oddsln( 2190335054105744 +++−=
From R:
b0=-4.574 b1=0.541 b2=0.335 b3=0.219
C.P.R.. eeeep
p 2190335054105744
1
−=−
R Script for logistic regression with
more than 1 independent variable
Test significance of model
Result indicates that model is significant.
Chi-square to test effects
Indicates that rate of insecticide and past smoking has an effect on
pulmonary ailment. Current smoking has no significant effect on
pulmonary ailment.
Parameter estimates
C2188.0P3346.0R5407.05739.4)oddsln( ++++++++++++−−−−====
C2188.0P3346.0R5407.05739.4 eeeep1
p −−−−====−−−−
)24482.1)(39738.1)(71721.1)(01032.0(p1
p CPR====−−−−
For non-smokers (P=0,C=0)
)71721.1)(01032.0(p1
p R====−−−−
71721.1ˆ ====θθθθ
01032.0p̂:0R ========
010105431
55.pactual ==
Values are very close, an
indication that model fits well.
0177.0p̂:1R ========
5488
5431
TotalYesNo
965392High
555376Low
PARate
)24482.1)(39738.1)(71721.1)(01032.0(p1
p CPR====−−−−
For past smokers (P=1, C=0)
)71721.1)(01442.0(p1
p R====−−−−
71721.1ˆ ====θθθθ
01442.0p̂:0R ========
014404373
63.pactual ==
0248.0p̂:1R ========
4381
4373
TotalYesNo
1054276High
634310Low
PARate
)24482.1)(39738.1)(71721.1)(01032.0(p1
p CPR====−−−−
Predict probabilities of success
0.0176
0.0144
0.0101
0.0299
0.0240
0.0174
Actual p
0.0176110Low Rate
Current Smoker
0.0142010Low Rate
Past Smoker
0.0102000Low Rate
Never Smoked
0.0299111High Rate
Current Smoker
0.0242011High Rate
Past Smoker
0.0174001High Rate
Never Smoked
Est. pCPRConditions
On the average, the expected probability of having pulmonary ailment is
higher for those who use high rate of insecticide. The probability is
further increased if the farmer is a past or current smoker.
Another Example
20173Control
602832Total
20614B
20515A
TotalWith DiseaseNo DiseaseTreatment
Create two dummy binary variables for treatment:
T1=1 if treatment=A, T1=0 otherwise
T2=1 if treatment=B, T2=0 otherwise
Logit model: 22110
1TbTbb)
p
pln( ++=
−
Effect of two types of treatment to control disease incidence.
R script for logistic reg with indep
variable with more than 2 levels
Read data
Logit model
A vs control
B vs control
Test significance of model
From R Output:
bo=1.735 b1=-2.833 b2=-2.582
21582283327351
1T.T..)
p
pln( −−=
−
).)(.)(.(
).)(.)(.(p
TT
TT
21
21
075600588066951
07560058806695
+=
0
1
0
T2
0
0
1
T1
0.85Control
0.30B
0.25A
Expected ProbabilitiesTreatment
The probability of a disease incidence is higher if no treatment is applied.
However, probability of a disease incidence is slightly higher if treatment B
was used than if treatment A was used.
Script to compute expected
probabilities
Logistic Models with quantitative
Independent Variable
� Example: Effect of farm size on the adoption of
improved fallow
o Y: adoption (1=Yes, 0=No)
o X: farm size
0
1
0 1 2 3 4 5 6 7 8 9
Farm Size
Use of im
prove fallow
� Why Regression Analysis is not
appropriate when dependent
variable is binary.
o May produce predicted values
which are negative or greater
than 1.
o Predicted values of Y can
assume a continuous range of
values but Y could only be 0
or 1.
� What should be done?
o Fit a model on the probability
of adoption
0
0.25
0.5
0.75
1
0 1 2 3 4 5 6 7 8 9
Farm Size
Probability
� Probability of adoption
increases as farm size
increases.
� Curve is not a straight line but
an S-shape curve.
� Model for this curve is:
)xexp(
)xexp(p
β+α+
β+α=1
�Estimate α and β using logit
regression equation:
x)p
pln( β+α=
−1
R script for logistic regression with
quantitative independent variable
Read data
Logit model
x..)p
pln( 453050321
+−=−
).)(.(
).)(.(p
x
x
5731081801
573108180
+=
If farm size=5 acres, p=0.44, that is probability
that the farmer is an adopter is 0.44.
Parameter estimates
Modeling Responses with More than
Two Categories
� Re-organize or ignore some of the categories
temporarily, to reduce to a binary response.
� Divide categories into a series of binary categories.
� Use multinomial logistic regression as an extension
of binary logistic regression.
� Use log-linear models if all variables are categorical.
Thank you!