intermediate r - analysis of categorical data

Presentation Title Goes Here…presentation subtitle.Analysis of Categorical Data

Violeta I. BartolomeSenior Associate Scientist-Biometrics

PBGB-CRIL

[email protected]

Types of data

� Categorical data (classification data)

o Nominal

o Ordinal

� Quantitative data (measurement or scale data)

o Interval

o Ratio

Categorical Data

� The objects being studied are grouped into categories based on

some qualitative trait.

� They are often recorded as counts of objects in each category.

� Examples:

o Hair color

• blonde, brown, red, black

o Growth duration

• early, medium long

o Smoking status

• smoker, non-smoker

Nominal Data

� A type of categorical data in which objects fall into

unordered categories.

� Examples:

o Rice variety group

• indica, japonica, javanica

o Gender

• Male, Female

o Smoking status

• smoker, non-smoker

Ordinal Data

� A type of categorical data in which order is important.

� Examples:

o Growth duration

• early, medium, long

o Degree of resistance

• resistant, moderately resistant, susceptible

o Nitrogen Rate

• none, low, high

Binary Data

� A type of categorical data in which there are only two categories.

� Binary data can either be nominal or ordinal.

� Examples:

o Gender

• male, female

o Attendance

• present, absent

o Insect state after treatment application

• dead, alive

Quantitative Data

� The objects being studied are “measured” based on

some quantitative trait.

� The resulting data are set of numbers.

� Types of quantitative data

o Interval data – ordinal and distances between

values are comparable (e.g. temperature, IQ)

o Ratio data – interval data and have true zero point

as its origin (e.g. grain yield, age, number of trees

in a forest)

Example: Adoption of Nitrogen Fertilizer

YesNo

1004357Total

27216High

732251Low

TotalAdoptionLevel of

Education

Is there an association between level of

education and adoption of nitrogen fertilizer?

If no association, the observed frequencies should be

the same as the expected frequencies.

Expected frequencies:

YesNo

1004357Total

27High

73Low

TotalAdoptionLevel of

education

6141100

7357.

*= 3931

100

7343.

*=

3915100

2757.

*= 6111

100

2743.

*=

Chi-square test compares the observed and expected frequencies.

(((( ))))(((( ))))0001.P,3596.16

E

5.EO2

2 <<<<====−−−−−−−−

====χχχχ ∑∑∑∑

chisq.test()

chisq.test(x, # a vector or matrixy = NULL, # a vector, ignored if x is a matrix

correct = TRUE, # a logical indicating whether to

# apply continuity correction when

# computing the test statistic for

# 2x2 tables: one half is subtracted

# from all |O-E| differences. No

# correction is done if

# simulate.p.value = TRUE

simulate.p.value = FALSE, # a logical indicating whether to

# compute p-values by Monte Carlo

# simulation

B = 2000) # an integer specifying the number

# of replicates used in the Monte

# Carlo test

fisher.test()

fisher.test(x, # a vector or matrix

y = NULL, # a vector, ignored if x is a matrix

or = 1, # the hypothesized odds ratio.

# Only used in the 2 by 2 case

conf.int = TRUE, # logical indicating if a confidence

# interval should be computed

conf.level = 0.95, # confidence level for the returned

# confidence interval. Only used in

# the 2 by 2 case if conf.int = TRUE.

simulate.p.value = FALSE,

B = 2000)

Sample data set

Read data and tabulate Frequencies to Percentage

The second argument in the prop.table function

is marginal index, 1 for rows and 2 for columns.

Test is significant indicates that there is no independence between

adoption and level of education.

Chi-square test Logistic Regression

� Used to predict a two-category outcome from a set of

independent variables

� Response variable –binary: 0, 1

� Can handle more that 1 independent variable which can

be

o Categorical

o Quantitative

o Mixture of both

Logit Model

x)p

pln( β+α=

−1

)xexp(

)xexp(p

β+α+

β+α=1

)xexp(por

β−α−+=1

1

p=probability of being an adopter

x=1 if high level of education

x=0 if low level of education

Logistic model

Odds – ratio of success to failure

Logistic Regression in R

using 2 x 2 data

Test significance of the model

Test is significant indicating non-independence between

adoption and level of education

R Output

x0935.28408.0)oddsln( ++++−−−−====

x

x

)113.8)(43136.0(1

)113.8)(43136.0(p

++++====

7778.0p;1x:high

3014.0p;0x:low

========

========

x0935.28408.0 eep1

p −−−−====−−−−

Another Example

5488

5431

TotalYesNo

965392High

555376Low

Pulmonary AilmentInsecticide

Rate

00186102 .P. <=χ

Shows strong evidence that rate of insecticide has

an effect on pulmonary ailment.

Never Smoked:

Effect of insecticide usage (low, high) and smoking history on

the incidence of pulmonary ailments in farm workers.

Past Smokers:

Current Smokers:

4381

4373

TotalYesNo

1054276High

634310Low


Rate

1225

1213

TotalYesNo

371188High

211192Low


Rate

R=1 if using high rate of insecticide, R=0 if using low rate of insecticide

P=1 if past smoker, P=0 if not a past smoker

C=1 if current smoker, C=0 if not a current smoker

Logit Model with more than 1

Independent Variable

CbPbRbb)oddsln(3210

+++=

C.P.R..)oddsln( 2190335054105744 +++−=

From R:

b0=-4.574 b1=0.541 b2=0.335 b3=0.219

C.P.R.. eeeep

p 2190335054105744

1

−=−

R Script for logistic regression with

more than 1 independent variable

Test significance of model

Result indicates that model is significant.

Chi-square to test effects

Indicates that rate of insecticide and past smoking has an effect on

pulmonary ailment. Current smoking has no significant effect on

pulmonary ailment.

Parameter estimates

C2188.0P3346.0R5407.05739.4)oddsln( ++++++++++++−−−−====

C2188.0P3346.0R5407.05739.4 eeeep1

p −−−−====−−−−

)24482.1)(39738.1)(71721.1)(01032.0(p1

p CPR====−−−−

For non-smokers (P=0,C=0)

)71721.1)(01032.0(p1

p R====−−−−

71721.1ˆ ====θθθθ

01032.0p̂:0R ========

010105431

55.pactual ==

Values are very close, an

indication that model fits well.

0177.0p̂:1R ========

5488

5431

TotalYesNo

965392High

555376Low

PARate

)24482.1)(39738.1)(71721.1)(01032.0(p1

p CPR====−−−−

For past smokers (P=1, C=0)

)71721.1)(01442.0(p1

p R====−−−−

71721.1ˆ ====θθθθ

01442.0p̂:0R ========

014404373

63.pactual ==

0248.0p̂:1R ========

4381

4373

TotalYesNo

1054276High

634310Low

PARate

)24482.1)(39738.1)(71721.1)(01032.0(p1

p CPR====−−−−

Predict probabilities of success

0.0176

0.0144

0.0101

0.0299

0.0240

0.0174

Actual p

0.0176110Low Rate

Current Smoker

0.0142010Low Rate

Past Smoker

0.0102000Low Rate

Never Smoked

0.0299111High Rate

Current Smoker

0.0242011High Rate

Past Smoker

0.0174001High Rate

Never Smoked

Est. pCPRConditions

On the average, the expected probability of having pulmonary ailment is

higher for those who use high rate of insecticide. The probability is

further increased if the farmer is a past or current smoker.

Another Example

20173Control

602832Total

20614B

20515A

TotalWith DiseaseNo DiseaseTreatment

Create two dummy binary variables for treatment:

T1=1 if treatment=A, T1=0 otherwise

T2=1 if treatment=B, T2=0 otherwise

Logit model: 22110

1TbTbb)

p

pln( ++=

−

Effect of two types of treatment to control disease incidence.

R script for logistic reg with indep

variable with more than 2 levels

Read data

Logit model

A vs control

B vs control

Test significance of model

From R Output:

bo=1.735 b1=-2.833 b2=-2.582

21582283327351

1T.T..)

p

pln( −−=

−

).)(.)(.(

).)(.)(.(p

TT

TT

21

21

075600588066951

07560058806695

+=

0

1

0

T2

0

0

1

T1

0.85Control

0.30B

0.25A

Expected ProbabilitiesTreatment

The probability of a disease incidence is higher if no treatment is applied.

However, probability of a disease incidence is slightly higher if treatment B

was used than if treatment A was used.

Script to compute expected

probabilities

Logistic Models with quantitative

Independent Variable

� Example: Effect of farm size on the adoption of

improved fallow

o Y: adoption (1=Yes, 0=No)

o X: farm size

0

1

0 1 2 3 4 5 6 7 8 9

Farm Size

Use of im

prove fallow

� Why Regression Analysis is not

appropriate when dependent

variable is binary.

o May produce predicted values

which are negative or greater

than 1.

o Predicted values of Y can

assume a continuous range of

values but Y could only be 0

or 1.

� What should be done?

o Fit a model on the probability

of adoption

0

0.25

0.5

0.75

1

0 1 2 3 4 5 6 7 8 9

Farm Size

Probability

� Probability of adoption

increases as farm size

increases.

� Curve is not a straight line but

an S-shape curve.

� Model for this curve is:

)xexp(

)xexp(p

β+α+

β+α=1

�Estimate α and β using logit

regression equation:

x)p

pln( β+α=

−1

R script for logistic regression with

quantitative independent variable

Read data

Logit model

x..)p

pln( 453050321

+−=−

).)(.(

).)(.(p

x

x

5731081801

573108180

+=

If farm size=5 acres, p=0.44, that is probability

that the farmer is an adopter is 0.44.

Parameter estimates

Modeling Responses with More than

Two Categories

� Re-organize or ignore some of the categories

temporarily, to reduce to a binary response.

� Divide categories into a series of binary categories.

� Use multinomial logistic regression as an extension

of binary logistic regression.

� Use log-linear models if all variables are categorical.

Thank you!

intermediate r - analysis of categorical data

Documents