analysis of categorical data

32
Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1

Upload: amish

Post on 22-Feb-2016

80 views

Category:

Documents


0 download

DESCRIPTION

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013. Overview. Data Types Contingency Tables Logit Models Binomial Ordinal Nominal. Things not covered (but still fit into the topic). Matched pairs/repeated measures - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analysis of Categorical Data

1

Analysis of Categorical Data

Nick JacksonUniversity of Southern CaliforniaDepartment of Psychology10/11/2013

Page 2: Analysis of Categorical Data

2

OverviewData TypesContingency TablesLogit Models◦Binomial◦Ordinal◦Nominal

Page 3: Analysis of Categorical Data

3

Things not covered (but still fit into the topic)Matched pairs/repeated measures◦McNemar’s Chi-Square

Reliability◦Cohen’s Kappa◦ROC

Poisson (Count) modelsCategorical SEM◦Tetrachoric Correlation

Bernoulli Trials

Page 4: Analysis of Categorical Data

4

Data Types (Levels of Measurement)Discrete/

Categorical/Qualitative

Continuous/Quantitative

Nominal/Multinomial:

Properties:Values arbitrary (no magnitude) No direction (no ordering)

Example: Race: 1=AA, 2=Ca, 3=As

Measures:Mode, relative frequency

Rank Order/Ordinal:

Properties:Values semi-arbitrary (no magnitude?) Have direction (ordering)

Example:Lickert Scales (LICK-URT): 1-5, Strongly Disagree to Strongly Agree

Measures:Mode, relative frequency, medianMean?

Binary/Dichotomous/

Binomial:

Properties:2 LevelsSpecial case of Ordinal or Multinomial

Examples: Gender (Multinomial)Disease (Y/N)

Measures:Mode, relative frequency,Mean?

Page 5: Analysis of Categorical Data

5

Contingency TablesOften called Two-way tables or Cross-TabHave dimensions I x JCan be used to test hypotheses of

association between categorical variables

2 X 3 Table Age GroupsGender <40 Years 40-50 Years >50 YearFemale 25 68 63Male 240 223 201

Code 1.1

Page 6: Analysis of Categorical Data

6

Contingency Tables: Test of Independence Chi-Square Test of Independence (χ2)

◦ Calculate χ2

◦ Determine DF: (I-1) * (J-1)◦ Compare to χ2 critical value for given DF.

2 X 3 Table Age GroupsGender <40 Years 40-50 Years >50 YearFemale 25 68 63Male 240 223 201

C1=265 C2=331 C3=264

R1=156R2=664

N=820

χ2=∑𝑖=1

𝑛 (𝑂 𝑖−𝐸𝑖 )2𝐸𝑖

Where: Oi = Observed FreqEi = Expected Freqn = number of cells in table

𝐸𝑖 , 𝑗=(𝑅 𝑖∗𝐶 𝑗 )

𝑁

Page 7: Analysis of Categorical Data

7

Contingency Tables: Test of Independence Pearson Chi-Square Test of Independence (χ2)

◦ H0: No Association

◦ HA: Association….where, how?

Not appropriate when Expected (Ei) cell size freq < 5◦ Use Fisher’s Exact Chi-Square

2 X 3 Table Age GroupsGender <40 Years 40-50 Years >50 YearFemale 25 68 63Male 240 223 201

C1=265 C2=331 C3=264

R1=156R2=664

N=820

χ2 (𝑑𝑓 2 )=23.39 ,𝑝<0.001

Code 1.2

Page 8: Analysis of Categorical Data

8

Contingency Tables2x2

a b

c d

a+b

c+d

b+da+c a+b+c+d

Disorder (Outcome)

Risk Factor/Exposure

Yes No

Yes

No

Page 9: Analysis of Categorical Data

9

Contingency Tables:Measures of Association a=

25b=10

c=20

d=45

35

65

5545 100

Depression

Alcohol Use

Yes No

Yes

No

𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒𝑅𝑖𝑠𝑘(𝑅𝑅)=𝑃 (𝐷|𝐴¿ ¿𝑃 (𝐷∨𝐴)

=0.7140.308=2.31

Probability :

Odds:

Contrasting Probability:

Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol

𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜(𝑂𝑅)=𝑂𝑑𝑑𝑠 (𝐷|𝐴¿ ¿𝑂𝑑𝑑𝑠(𝐷∨𝐴)

=2.50.44=5.62

Contrasting Odds:

The odds for depression were 5.62 times greater in Alcohol users compared to nonusers.

Page 10: Analysis of Categorical Data

10

Why Odds Ratios?2

34

56

OR

/ R

R

0 .1 .2 .3 .4 .5Overall Probability of Depression

RR OR

a=25

b=10*i

c=20

d=45*i

(25 + 10*i)

55*i45

Depression

Alcohol Use

Yes No

Yes

No (20 + 45*i)

(45 + 55*i)

i=1 to 45

Page 11: Analysis of Categorical Data

11

The Generalized Linear ModelGeneral Linear Model (LM)◦Continuous Outcomes (DV)◦ Linear Regression, t-test, Pearson correlation,

ANOVA, ANCOVAGeneralized Linear Model (GLM)◦ John Nelder and Robert Wedderburn◦Maximum Likelihood Estimation◦Continuous, Categorical, and Count outcomes. ◦Distribution Family and Link Functions

Error distributions that are not normal

Page 12: Analysis of Categorical Data

12

Logistic Regression“This is the most important model for categorical

response data” –Agresti (Categorical Data Analysis, 2nd Ed.)

Binary ResponsePredicting Probability (related to the Probit model)Assume (the usual):◦ Independence◦NOT Homoscedasticity or Normal Errors◦ Linearity (in the Log Odds)◦Also….adequate cell sizes.

Page 13: Analysis of Categorical Data

13

Logistic RegressionThe Model

In terms of probability of success π(x)

In terms of Logits (Log Odds) Logit transform gives us a linear equation

Page 14: Analysis of Categorical Data

14

Logistic Regression: Example

The Output as Logits◦Logits: H0: β=0

Y=Depressed Coef SE Z P CIα (_constant) -1.51 0.091 -16.7 <0.001 -1.69, -1.34

Freq. PercentNot Depressed 672 81.95Depressed 148 18.05

Code 2.1

Conversion to Probability:

Conversion to Odds

Also=0.1805/0.8195=0.22

What does H0: β=0 mean?

Page 15: Analysis of Categorical Data

15

Logistic Regression: ExampleThe Output as ORs◦Odds Ratios: H0: β=1

◦ Conversion to Probability:

◦ Conversion to Logit (log odds!) Ln(OR) = logit Ln(0.220)=-1.51

Y=Depressed OR SE Z P CIα (_constant) 0.220 0.020 -16.7 <0.001 0.184, 0.263

Freq. PercentNot Depressed 672 81.95Depressed 148 18.05

Code 2.2

Page 16: Analysis of Categorical Data

16

Logistic Regression: ExampleLogistic Regression w/ Single Continuous Predictor:

Y=Depressed Coef SE Z P CIα (_constant) -2.24 0.489 -4.58 <0.001 -3.20, -1.28

β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030

AS LOGITS:

Interpretation:A 1 unit increase in age results in a 0.013 increase in the log-odds of depression.Hmmmm….I have no concept of what a log-odds is. Interpret as something else.Logit > 0 so as age increases the risk of depression increases.

OR=e^0.013 = 1.013For a 1 unit increase in age, there is a 1.013 increase in the odds of depression.We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change]

Code 2.3

Page 17: Analysis of Categorical Data

17

Logistic Regression: GOF• Overall Model Likelihood-Ratio Chi-Square

• Omnibus test for the model• Overall model fit?

• Relative to other models• Compares specified model with Null model (no

predictors)• Χ2=-2*(LL0-LL1), DF=K parameters estimated

Page 18: Analysis of Categorical Data

18

Logistic Regression: GOF (Summary Measures) Pseudo-R2

◦ Not the same meaning as linear regression. ◦ There are many of them (Cox and Snell/McFadden)◦ Only comparable within nested models of the same outcome.

Hosmer-Lemeshow◦ Models with Continuous Predictors ◦ Is the model a better fit than the NULL model. X2

◦ H0: Good Fit for Data, so we want p>0.05◦ Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of

Group * Outcome using. Df=g-2

◦ Conservative (rarely rejects the null) Pearson Chi-Square

◦ Models with categorical predictors◦ Similar to Hosmer-Lemeshow

ROC-Area Under the Curve◦ Predictive accuracy/Classification

Code 2.4

Page 19: Analysis of Categorical Data

19

Logistic Regression: GOF(Diagnostic Measures) Outliers in Y (Outcome)

◦ Pearson Residuals Square root of the contribution to the Pearson χ2

◦ Deviance Residuals Square root of the contribution to the likeihood-ratio test statistic of a saturated

model vs fitted model. Outliers in X (Predictors)

◦ Leverage (Hat Matrix/Projection Matrix) Maps the influence of observed on fitted values

Influential Observations◦ Pregibon’s Delta-Beta influence statistic◦ Similar to Cook’s-D in linear regression

Detecting Problems◦ Residuals vs Predictors◦ Leverage Vs Residuals◦ Boxplot of Delta-Beta

Code 2.5

Page 20: Analysis of Categorical Data

20

Logistic Regression: GOF

Y=Depressed Coef SE Z P CIα (_constant) -2.24 0.489 -4.58 <0.001 -3.20, -1.28

β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030

log ( 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)1−𝜋 (𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑))=𝛼+𝛽1(𝑎𝑔𝑒)

H-L GOF:Number of Groups: 10H-L Chi2: 7.12DF: 8P: 0.5233

McFadden’s R2: 0.0030

L-R χ2 (df=1): 2.47, p=0.1162

Page 21: Analysis of Categorical Data

21

Logistic Regression: DiagnosticsLinearity in the Log-Odds◦Use a lowess (loess) plot◦Depressed vs Age

-3-2

-10

1D

epre

ssed

(Log

it)

20 40 60 80age

bandwidth = .8

Logit transformed smoothLowess smoother

Code 2.6

Page 22: Analysis of Categorical Data

22

Logistic Regression: ExampleLogistic Regression w/ Single Categorical Predictor:

Y=Depressed OR SE Z P CIα (_constant) 0.545 0.091 -3.63 <0.001 0.392, 0.756

β (male) 0.299 0.060 -5.99 <0.001 0.202, 0.444

AS OR:

Interpretation:The odds of depression are 0.299 times lower for males compared to females.

We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females.

Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34.

Code 2.7

Page 23: Analysis of Categorical Data

23

Ordinal Logistic RegressionAlso called Ordered Logistic or Proportional

Odds ModelExtension of Binary Logistic Model>2 Ordered responsesNew Assumption!◦Proportional Odds

BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese) The predictors effect on the outcome is the same across

levels of the outcome. Bmi3grp (1 vs 2,3) = B(age) Bmi3grp (1,2 vs 3) = B(age)

Page 24: Analysis of Categorical Data

24

Ordinal Logistic RegressionThe Model◦ A latent variable model (Y*)◦ j= number of levels-1◦ From the equation we can see that the odds ratio is

assumed to be independent of the category j

Page 25: Analysis of Categorical Data

25

Ordinal Logistic Regression ExampleY=bmi3grp Coef SE Z P CI

β1 (age) -0.026 0.006 -4.15 <0.001 -0.381, -0.014β2 (blood_press) 0.012 0.005 2.48 0.013 0.002, 0.021Threshold1/cut1 -0.696 0.6678 -2.004, 0.613Threshold2/cut2 0.773 0.6680 -0.536, 2.082

AS LOGITS:

Y=bmi3grp OR SE Z P CIβ1 (age) 0.974 0.006 -4.15 <0.001 0.962, 0.986

β2 (blood_press) 1.012 0.005 2.48 0.013 1.002, 1.022

Threshold1/cut1 -0.696 0.6678 -2.004, 0.613

Threshold2/cut2 0.773 0.6680 -0.536, 2.082

AS OR:

For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higher bmi category

For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater.

Code 3.1

Page 26: Analysis of Categorical Data

26

Ordinal Logistic Regression: GOFAssessing Proportional Odds Assumptions◦Brant Test of Parallel Regression

H0: Proportional Odds, thus want p >0.05 Tests each predictor separately and overall

◦Score Test of Parallel Regression H0: Proportional Odds, thus want p >0.05

◦Approx Likelihood-ratio test H0: Proportional Odds, thus want p >0.05

Code 3.2

Page 27: Analysis of Categorical Data

27

Ordinal Logistic Regression: GOFPseudo R2

Diagnostics Measures◦Performed on the j-1 binomial logistic

regressions

Code 3.3

Page 28: Analysis of Categorical Data

28

Multinomial Logistic RegressionAlso called multinomial logit/polytomous

logistic regression.Same assumptions as the binary logistic

model>2 non-ordered responses◦Or You’ve failed to meet the parallel odds

assumption of the Ordinal Logistic model

Page 29: Analysis of Categorical Data

29

Multinomial Logistic RegressionThe Model◦ j= levels for the outcome◦ J=reference level◦ where x is a fixed setting of an explanatory variable

◦Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It’s actually an OR.

◦ Similar to conducting separate binary logistic models, but with better type 1 error control

Page 30: Analysis of Categorical Data

30

Multinomial Logistic Regression Example

Y=religion (ref=Catholic(1))

OR SE Z P CI

Protestant (2) β (supernatural) 1.126 0.090 1.47 0.141 0.961, 1.317

α (_constant) 1.219 0.097 2.49 0.013 1.043, 1.425Evangelical (3)

β (supernatural) 1.218 0.117 2.06 0.039 1.010, 1.469α (_constant) 0.619 0.059 -5.02 <0.001 0.512, 0.746

Does degree of supernatural belief indicate a religious preference?

AS OR:

For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic.

Code 4.1

Page 31: Analysis of Categorical Data

31

Multinomial Logistic Regression GOF

Limited GOF tests.◦Look at LR Chi-square and compare nested

models.◦“Essentially, all models are wrong, but some

are useful” –George E.P. BoxPseudo R2

Similar to Ordinal◦Perform tests on the j-1 binomial logistic

regressions

Page 32: Analysis of Categorical Data

32

Resources“Categorical Data Analysis” by Alan Agresti

UCLA Stat Computing:http://www.ats.ucla.edu/stat/