analysis of categorical data

1

Analysis of Categorical Data

Nick JacksonUniversity of Southern CaliforniaDepartment of Psychology10/11/2013

2

OverviewData TypesContingency TablesLogit Models◦Binomial◦Ordinal◦Nominal

3

Things not covered (but still fit into the topic)Matched pairs/repeated measures◦McNemar’s Chi-Square

Reliability◦Cohen’s Kappa◦ROC

Poisson (Count) modelsCategorical SEM◦Tetrachoric Correlation

Bernoulli Trials

4

Data Types (Levels of Measurement)Discrete/

Categorical/Qualitative

Continuous/Quantitative

Nominal/Multinomial:

Properties:Values arbitrary (no magnitude) No direction (no ordering)

Example: Race: 1=AA, 2=Ca, 3=As

Measures:Mode, relative frequency

Rank Order/Ordinal:

Properties:Values semi-arbitrary (no magnitude?) Have direction (ordering)

Example:Lickert Scales (LICK-URT): 1-5, Strongly Disagree to Strongly Agree

Measures:Mode, relative frequency, medianMean?

Binary/Dichotomous/

Binomial:

Properties:2 LevelsSpecial case of Ordinal or Multinomial

Examples: Gender (Multinomial)Disease (Y/N)

Measures:Mode, relative frequency,Mean?

5

Contingency TablesOften called Two-way tables or Cross-TabHave dimensions I x JCan be used to test hypotheses of

association between categorical variables

2 X 3 Table Age GroupsGender <40 Years 40-50 Years >50 YearFemale 25 68 63Male 240 223 201

Code 1.1

6

Contingency Tables: Test of Independence Chi-Square Test of Independence (χ2)

◦ Calculate χ2

◦ Determine DF: (I-1) * (J-1)◦ Compare to χ2 critical value for given DF.


C1=265 C2=331 C3=264

R1=156R2=664

N=820

χ2=∑𝑖=1

𝑛 (𝑂 𝑖−𝐸𝑖 )2𝐸𝑖

Where: Oi = Observed FreqEi = Expected Freqn = number of cells in table

𝐸𝑖 , 𝑗=(𝑅 𝑖∗𝐶 𝑗 )

𝑁

7

Contingency Tables: Test of Independence Pearson Chi-Square Test of Independence (χ2)

◦ H0: No Association

◦ HA: Association….where, how?

Not appropriate when Expected (Ei) cell size freq < 5◦ Use Fisher’s Exact Chi-Square


C1=265 C2=331 C3=264

R1=156R2=664

N=820

χ2 (𝑑𝑓 2 )=23.39 ,𝑝<0.001

Code 1.2

8

Contingency Tables2x2

a b

c d

a+b

c+d

b+da+c a+b+c+d

Disorder (Outcome)

Risk Factor/Exposure

Yes No

Yes

No

9

Contingency Tables:Measures of Association a=

25b=10

c=20

d=45

35

65

5545 100

Depression

Alcohol Use

Yes No

Yes

No

𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒𝑅𝑖𝑠𝑘(𝑅𝑅)=𝑃 (𝐷|𝐴¿ ¿𝑃 (𝐷∨𝐴)

=0.7140.308=2.31

Probability :

Odds:

Contrasting Probability:

Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol

𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜(𝑂𝑅)=𝑂𝑑𝑑𝑠 (𝐷|𝐴¿ ¿𝑂𝑑𝑑𝑠(𝐷∨𝐴)

=2.50.44=5.62

Contrasting Odds:

The odds for depression were 5.62 times greater in Alcohol users compared to nonusers.

10

Why Odds Ratios?2

34

56

OR

/ R

R

0 .1 .2 .3 .4 .5Overall Probability of Depression

RR OR

a=25

b=10*i

c=20

d=45*i

(25 + 10*i)

55*i45

Depression

Alcohol Use

Yes No

Yes

No (20 + 45*i)

(45 + 55*i)

i=1 to 45

11

The Generalized Linear ModelGeneral Linear Model (LM)◦Continuous Outcomes (DV)◦ Linear Regression, t-test, Pearson correlation,

ANOVA, ANCOVAGeneralized Linear Model (GLM)◦ John Nelder and Robert Wedderburn◦Maximum Likelihood Estimation◦Continuous, Categorical, and Count outcomes. ◦Distribution Family and Link Functions

Error distributions that are not normal

12

Logistic Regression“This is the most important model for categorical

response data” –Agresti (Categorical Data Analysis, 2nd Ed.)

Binary ResponsePredicting Probability (related to the Probit model)Assume (the usual):◦ Independence◦NOT Homoscedasticity or Normal Errors◦ Linearity (in the Log Odds)◦Also….adequate cell sizes.

13

Logistic RegressionThe Model

In terms of probability of success π(x)

In terms of Logits (Log Odds) Logit transform gives us a linear equation

14

Logistic Regression: Example

The Output as Logits◦Logits: H0: β=0

Y=Depressed Coef SE Z P CIα (_constant) -1.51 0.091 -16.7 <0.001 -1.69, -1.34

Freq. PercentNot Depressed 672 81.95Depressed 148 18.05

Code 2.1

Conversion to Probability:

Conversion to Odds

Also=0.1805/0.8195=0.22

What does H0: β=0 mean?

15

Logistic Regression: ExampleThe Output as ORs◦Odds Ratios: H0: β=1

◦ Conversion to Probability:

◦ Conversion to Logit (log odds!) Ln(OR) = logit Ln(0.220)=-1.51

Y=Depressed OR SE Z P CIα (_constant) 0.220 0.020 -16.7 <0.001 0.184, 0.263

Freq. PercentNot Depressed 672 81.95Depressed 148 18.05

Code 2.2

16

Logistic Regression: ExampleLogistic Regression w/ Single Continuous Predictor:


β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030

AS LOGITS:

Interpretation:A 1 unit increase in age results in a 0.013 increase in the log-odds of depression.Hmmmm….I have no concept of what a log-odds is. Interpret as something else.Logit > 0 so as age increases the risk of depression increases.

OR=e^0.013 = 1.013For a 1 unit increase in age, there is a 1.013 increase in the odds of depression.We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change]

Code 2.3

17

Logistic Regression: GOF• Overall Model Likelihood-Ratio Chi-Square

• Omnibus test for the model• Overall model fit?

• Relative to other models• Compares specified model with Null model (no

predictors)• Χ2=-2*(LL0-LL1), DF=K parameters estimated

18

Logistic Regression: GOF (Summary Measures) Pseudo-R2

◦ Not the same meaning as linear regression. ◦ There are many of them (Cox and Snell/McFadden)◦ Only comparable within nested models of the same outcome.

Hosmer-Lemeshow◦ Models with Continuous Predictors ◦ Is the model a better fit than the NULL model. X2

◦ H0: Good Fit for Data, so we want p>0.05◦ Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of

Group * Outcome using. Df=g-2

◦ Conservative (rarely rejects the null) Pearson Chi-Square

◦ Models with categorical predictors◦ Similar to Hosmer-Lemeshow

ROC-Area Under the Curve◦ Predictive accuracy/Classification

Code 2.4

http://www.ats.ucla.edu/stat/mult_pkg/faq/general/Psuedo_RSquareds.htm

http://www.ats.ucla.edu/stat/mult_pkg/faq/general/Psuedo_RSquareds.htm

19

Logistic Regression: GOF(Diagnostic Measures) Outliers in Y (Outcome)

◦ Pearson Residuals Square root of the contribution to the Pearson χ2

◦ Deviance Residuals Square root of the contribution to the likeihood-ratio test statistic of a saturated

model vs fitted model. Outliers in X (Predictors)

◦ Leverage (Hat Matrix/Projection Matrix) Maps the influence of observed on fitted values

Influential Observations◦ Pregibon’s Delta-Beta influence statistic◦ Similar to Cook’s-D in linear regression

Detecting Problems◦ Residuals vs Predictors◦ Leverage Vs Residuals◦ Boxplot of Delta-Beta

Code 2.5

20

Logistic Regression: GOF


β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030

log ( 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)1−𝜋 (𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑))=𝛼+𝛽1(𝑎𝑔𝑒)

H-L GOF:Number of Groups: 10H-L Chi2: 7.12DF: 8P: 0.5233

McFadden’s R2: 0.0030

L-R χ2 (df=1): 2.47, p=0.1162

21

Logistic Regression: DiagnosticsLinearity in the Log-Odds◦Use a lowess (loess) plot◦Depressed vs Age

-3-2

-10

1D

epre

ssed

(Log

it)

20 40 60 80age

bandwidth = .8

Logit transformed smoothLowess smoother

Code 2.6

22

Logistic Regression: ExampleLogistic Regression w/ Single Categorical Predictor:

Y=Depressed OR SE Z P CIα (_constant) 0.545 0.091 -3.63 <0.001 0.392, 0.756

β (male) 0.299 0.060 -5.99 <0.001 0.202, 0.444

AS OR:

Interpretation:The odds of depression are 0.299 times lower for males compared to females.

We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females.

Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34.

Code 2.7

23

Ordinal Logistic RegressionAlso called Ordered Logistic or Proportional

Odds ModelExtension of Binary Logistic Model>2 Ordered responsesNew Assumption!◦Proportional Odds

BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese) The predictors effect on the outcome is the same across

levels of the outcome. Bmi3grp (1 vs 2,3) = B(age) Bmi3grp (1,2 vs 3) = B(age)

24

Ordinal Logistic RegressionThe Model◦ A latent variable model (Y*)◦ j= number of levels-1◦ From the equation we can see that the odds ratio is

assumed to be independent of the category j

25

Ordinal Logistic Regression ExampleY=bmi3grp Coef SE Z P CI

β1 (age) -0.026 0.006 -4.15 <0.001 -0.381, -0.014β2 (blood_press) 0.012 0.005 2.48 0.013 0.002, 0.021Threshold1/cut1 -0.696 0.6678 -2.004, 0.613Threshold2/cut2 0.773 0.6680 -0.536, 2.082

AS LOGITS:

Y=bmi3grp OR SE Z P CIβ1 (age) 0.974 0.006 -4.15 <0.001 0.962, 0.986

β2 (blood_press) 1.012 0.005 2.48 0.013 1.002, 1.022

Threshold1/cut1 -0.696 0.6678 -2.004, 0.613

Threshold2/cut2 0.773 0.6680 -0.536, 2.082

AS OR:

For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higher bmi category

For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater.

Code 3.1

26

Ordinal Logistic Regression: GOFAssessing Proportional Odds Assumptions◦Brant Test of Parallel Regression

H0: Proportional Odds, thus want p >0.05 Tests each predictor separately and overall

◦Score Test of Parallel Regression H0: Proportional Odds, thus want p >0.05

◦Approx Likelihood-ratio test H0: Proportional Odds, thus want p >0.05

Code 3.2

27

Ordinal Logistic Regression: GOFPseudo R2

Diagnostics Measures◦Performed on the j-1 binomial logistic

regressions

Code 3.3

28

Multinomial Logistic RegressionAlso called multinomial logit/polytomous

logistic regression.Same assumptions as the binary logistic

model>2 non-ordered responses◦Or You’ve failed to meet the parallel odds

assumption of the Ordinal Logistic model

29

Multinomial Logistic RegressionThe Model◦ j= levels for the outcome◦ J=reference level◦ where x is a fixed setting of an explanatory variable

◦Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It’s actually an OR.

◦ Similar to conducting separate binary logistic models, but with better type 1 error control

30

Multinomial Logistic Regression Example

Y=religion (ref=Catholic(1))

OR SE Z P CI

Protestant (2) β (supernatural) 1.126 0.090 1.47 0.141 0.961, 1.317

α (_constant) 1.219 0.097 2.49 0.013 1.043, 1.425Evangelical (3)

β (supernatural) 1.218 0.117 2.06 0.039 1.010, 1.469α (_constant) 0.619 0.059 -5.02 <0.001 0.512, 0.746

Does degree of supernatural belief indicate a religious preference?

AS OR:

For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic.

Code 4.1

31

Multinomial Logistic Regression GOF

Limited GOF tests.◦Look at LR Chi-square and compare nested

models.◦“Essentially, all models are wrong, but some

are useful” –George E.P. BoxPseudo R2

Similar to Ordinal◦Perform tests on the j-1 binomial logistic

regressions

32

Resources“Categorical Data Analysis” by Alan Agresti

UCLA Stat Computing:http://www.ats.ucla.edu/stat/

http://www.ats.ucla.edu/stat/



analysis of categorical data

Documents