categorical data analysis & logistic regression

36
BIBS BIBS SEOUL NATIONAL UNIVERSITY SEOUL NATIONAL UNIVERSITY Bioinformatics & Biostatistics Lab. Bioinformatics & Biostatistics Lab. Categorical Data Analysis & Logistic Regression 수수수수수 수수수수수수 수 수 수 수 수수수수 수수수수 수수수수수 수 수 수

Upload: judith-holt

Post on 02-Jan-2016

70 views

Category:

Documents


0 download

DESCRIPTION

Categorical Data Analysis & Logistic Regression. Outline. Two-way contingency tables: RR, Odds ratio, Chi-square tests Three-way contingency tables: Conditional independence, Homogeneous association, Common odds ratio Logistic regression: Dichotomous response - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Categorical Data Analysis &  Logistic Regression

BIBBIBSS

SEOUL NATIONAL UNIVERSITYSEOUL NATIONAL UNIVERSITYBioinformatics & Biostatistics Lab.Bioinformatics & Biostatistics Lab.

Categorical Data Analysis &

Logistic Regression

수원대학교 통계정보학과

김 진 흠

㈜ 마케팅랩 파트너스

선임연구원

이 은 경

Page 2: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Outline

Two-way contingency tables: RR, Odds ratio, Chi-square tests

Three-way contingency tables: Conditional independence, Homogeneous association, Common odds ratio

Logistic regression: Dichotomous response

Logistic regression: Polytomous response

Page 3: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

First example: Aspirin & heart attacks

Clinical trials table of aspirin use and MI Test whether regular intake of aspirin reduces mortality from cardiovascular disease Data set

Prospective sampling design: Cohort studies, Clinical trials

Myocardial Infarction

Group Yes No Total

Placebo 189 10,845 11,034

Aspirin 104 10,933 11,037

2 2

Page 4: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Second example: Smoking & heart attacks

Case-control study: table of smoking status and MI

Compare ever-smokers with nonsmokers in terms of the proportion who suffered MI Data set

Retrospective sampling design: Case-control study, Cross-

sectional design

Remark: Observational studies vs. experimental study

2 2

Ever-Smoker

MyocardialInfarction Controls

Yes 172 173

No 90 346

Total 262 519

Page 5: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Comparing proportions in table

Difference:

Relative risk:

Useful when both proportions 0 or 1

: RR is more informative

: Response is

independent

of group

2 2

1 2

1

2

1

1 2 1 22

0.10, 0.01 0.09, 10p

p p p pp

11 2 1 2

2

0.410, 0.401 0.09, 1.02p

p p p pp

Page 6: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Example (revisited)

1st example =0.0171 - 0.0094=0.0077, 95% CI=(0.005, 0.011)

Taking aspirin diminishes heart attack

, 95% CI=(1.43, 2.3)

Risk of MI is at least 43% higher for the placebo group

2nd example , : Not estimable, meaningless even though possible

Estimate proportions in the reverse directionProportion of smoking given MI status:

(suffering MI), (Not suffered MI)

1 2p p

1

2

1.82p

p

1 2 1

2

1720.656

262 173

0.333519

Page 7: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Association measure: odds ratio

Def’n:

Meaning When two variables are independent, i.e., When odds of success (in row 1) > (in row 2) When odds of success (in row 1) < (in row 2)

Remark: When both variables are response,

(called cross-product ratio) using joint probabilities

1 1

2 2

(1 )

(1 )

1 2 , 1 1,

0 1,

11 22

12 21

Page 8: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Properties of odds ratio

Values of father from 1 in a given direction represent stronger association

When one value is the inverse of the other, two values of are the same strength of association, but in the opposite directions

Not changed when the table orientation reverses Unnecessary to identify one classification as a response variable

1 24, 0.25 1/ 4

Page 9: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Example (revisited)

1st example , 95% CI=(1.44, 2.33)

Estimated odds is 83% higher for the placebo group

2nd example Rough estimate of RR=3.8

Women who had ever smoked were about four times as likely to suffer as women who had never smoked

1 1 11 22

2 2 12 21

/(1 )ˆ 1.832/(1 )

p p n n

p p n n

ˆ RR

ˆ 3.8

Page 10: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Independence tests

Hypothesis:

Two chi-square tests Under , estimated expected frequency

Pearson’s =

Likelihood ratio(LR) statistic

For a large sample, follow a chi-squared null distribution w

ith

Remark: When the chi-squared approximation is good. If not, apply Fisher’s exact test

0 : for all ,ij i jH i j

0H ˆ i jij

n n

n

2X2ˆ( )

ˆij ij

ij

n

2 2 log( )

ˆij

jij

nG n

2 2,X G

( 1)( 1)df I J

ˆ 5,ij

Page 11: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Example: AZT use & AIDS

Development of AIDS symptoms in AZT use and race Study on the effects of AZT in slowing the development of AIDS symptoms Data set

Symptoms

Race AZT Use Yes No Total

White Yes 14 93 107

No 32 81 113

Black Yes 11 52 63

No 12 43 55

Page 12: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Three interests in table

Conditional independence? When controlling for race, AZT treatment and development of AIDS symptom are independent

Use Cochran-Mantel-Haenszel(CMH) test Summarize the information from partial tables

Homogeneous association? Odds ratios of AZT treatment and development of AIDS symptom are common for each race

Use Breslow-Day test

Common odds ratio? Use Mantel-Haenszel estimate

2 2 K

K

Page 13: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Example (AZT use & AIDS revisited)

CMH=6.8( =1) with -value=0.0091 Not independent!

Breslow-Day=1.39( =1) with -value=0.2384 Homogeneous association!

Common odds ratio=0.49 For each race, estimated odds of developing symptoms are half as high for those who took AZT

p

p

df

df

Page 14: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Overview of types of generalized linear models(GLMs)

Three components: Random component (response variable), Linear predictor (linear combination of covariates), Link function

Types of GLMs

RandomComponent Link

SystematicComponent Model

Normal

Normal

Normal

Binomial

Poisson

Multinomial

Identity

Identity

Identity

Logit

Log

Generalized logit

Continuous

Categorical

Mixed

Mixed

Mixed

Mixed

Regression

Analysis of variance

Analysis of covariance

Logistic regression

Loglinear

Multinomial response

Page 15: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Logistic regression with a quantitative covariate

Model:

Another representations Odds=

Odds at level equals the odds at multiplied by

Curve ascends ( ) or descends ( )The rate of change increases as increases

( )logit[ ( )] log

1 ( )

xx x

x

( )

1 ( )xx

ex

1x x e

( )1

x

x

ex

e

0 0 | |

Page 16: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Example: Horseshoe crabs Binary response

if a female crab has at least one satellite; otherwise

Covariate: female crab’s width

Data set

1Y 0Y

Width Number Cases Number Having Satellites < 23.25

23.25 - 24.25

24.25 - 25.25

25.25 - 26.25

26.25 - 27.25

27.25 - 28.25

28.25 - 29.25

> 29.25

14

14

28

39

22

24

18

14

5

4

17

21

15

20

15

14

Page 17: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Example: Horseshoe crabs

Page 18: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Goodness-of-fit tests

Working model: number of settings: number of parameters in :

Hypothesis: fits the data

Pearson’s statistic:

Deviance statistic:

approximately follow a chi-square null

distribution with

,M ,sM p

0 :H M2 2( ) (observed-fitted) / fittedX M

2 ( ) 2 observed log(observed/fitted)G M 2 2,X G

df s p

Page 19: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Inference for parameters

Interval estimation:

Two significance tests: Wald test: Use Likelihood ratio test: Use , log-likelihood function

Two tests have a large-sample chi-squared null distribution with

ˆ ˆ1.96 SE( )

0 : 0H ˆ ˆ/ SE( )z

0 12( )L L :L

1df

Page 20: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Example (Horseshoe crabs revisited)

Fitted model:

: larger at lager width ( )

There is a 64% increase in estimated odds of a satellite

for each centimeter increase in width ( )

with -value=0.506;

with -value=0.4012

95% CI for =(0.298, 0.697)

Significance test: Wald=23.9 ( =1) with -value < 0.0001; LRT=31.3 ( =1) with -value < 0.0001

ˆlogit[ ( )] 12.3508 0.4972x x

ˆ 0

ˆ1.64e

2 5.3 ( 6)X df p2 6.2 ( 6)G df p

df pdf p

Page 21: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Logistic regression with qualitativepredictors: AIDS symptoms data

Use indicator variables for representing categories of predictors

Logits implied by indicator variables

1 2logit[ ( )]x x z

Logit

0 0

1 0

0 1

1 1

x

1

2

1 2

z

Page 22: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

=difference between two logits (i.e., log of odds

ratio) at a fixed category of

Homogeneous association model

1

z

Logistic regression with qualitativepredictors: AIDS symptoms data

1odds of success at 1

odds of success at 0

xe

x

Page 23: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Equivalence of contingency table & logistic regression

Conditional independence: CMH test vs.

Homogeneous association: Breslow-Day test vs. Goodness-of-fit test

Common odds ratio estimate: Mantel-Haenszel estimate vs.

2 2 K

0 1: 0H

1e

Page 24: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Computer Output for Model with AIDS Symptoms Data

Log Likelihood - 167.5756Analysis of MaximumLikelihood Estmates

Parameter Estimate Std Error Wald Chi-Square Pr > ChiSq

Interceptaztrace

- 1.0736- 0.7195 0.0555

0.26290.27900.2886

16.67056.65070.0370

<.0.0010.00990.8476

LR Statistics

Source Df Chi-Square Pr>ChiSq azt race

11

6.870.04

0.00880.8473

Obs race azt y n pi_hat lower upper

1234

1100

1010

14321112

1071136355

0.149620.265400.142700.25472

0.098970.196680.087040.16953

0.219870.347740.225190.36396

Page 25: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Logistic regression with mixed predictors: Horseshoe crabs data

For color=medium light,

For color=medium,

For color=medium dark,

For controlling

1 1 2 2 3 3 4logit[ ( )]x c c c x

1 2 3( , , ) (1,0,0)c c c

1 2 3( , , ) (0,1,0)c c c

1 2 3( , , ) (0,0,1)c c c

1odds of success of a medium-light crab

,odds of success of a dark crab

e

2odds of success of a medium crab

,odds of success of a dark crab

e

3odds of success of a medium-dark crab

odds of success of a dark crabe

,x

Page 26: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Computer Output for Model for Horseshoe Crabs Data

Parameter Estimate

Std.

Error

Likelihood Ratio 95%

Confidence Limits

Chi

Square Pr > Chi Sq

interceptc1c2c3width

- 12.7151 1.3299 1.4023 1.1061 0.4680

2.76180.85250.54840.59210.1055

- 18.4564 - 0.2738 0.3527 - 0.0279 0.2713

- 7.5788 3.1354 2.5260 2.3138 0.6870

21.20 2.43 6.54 3.4919.66

< .00010.11880.01060.0617< .0001

LR Statistics

Source DF Chi-Square Pr > Chi Sq

widthcolor

13

26.40 7.00

< .00010.0720

Page 27: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Estimated probabilities for primary food choice

Page 28: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Logistic regression: ploytomous

Model categorical responses with more than two

categories

Two ways Use generalized logits function for nominal response Use cumulative logits function for ordinal response

Notation number of categories response probabilities with

:J1, , :J 1

1J

jj

Page 29: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Generalized logit model: nominal response

Baseline-category logit: Pair each category with a baseline category

when is the baseline

Model with a predictor The effects vary according to the category paired with the baseline These pairs of categories determine equations for all other pairs of categories

Eg, for a pair of categories

Remark: Parameter estimates are same no matter which category is the baseline

logit log , 1, , 1jjJ

J

j J

J

logit , 1, , 1jJ j jx j J

, ,a b/

log log ( ) ( )/

a a Ja b a b

b b J

x

Page 30: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Example: Alligator food choice

59 alligators sample in Lake Gorge, Florida Response: Primary food type found in alligator’s stomach

Fish(1), Invertebrate(2), Other(3, baseline category)

Predictor: alligator length, which varies 1.24~3.89(m) ML prediction equations

Larger alligator seem to select fish than invertebrates

Independence test: Food choice & length LRT=16.8006( ) with -value=0.0002

1 3 2 3ˆ ˆ ˆ ˆlog( / ) 1.618 0.11 ; log( / ) 5.697 2.465x x 1 2ˆ ˆlog( / ) 4.08 2.355x

0 1 2: 0H 2df p

Page 31: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Cumulative logit model: ordinal response

Logit of a cumulative probability

Categories 1 to : combined, Categories to : combined

Cumulative proportional odds model with a predictor The effect of are identical for all cumulative logits Any one curve for is identical to any of others shifted to the right or shifted to the left For =log of odds ratio is

Proportional to the difference between valuesSame for each cumulative probability

( )logit[ ( )] log , 1, , 1

1 ( )

P Y jP Y j j J

P Y j

j 1j J

logit[ ( )] , 1, , 1jP Y j x j J x 1J

( )P Y j

1 2, ,x x 2 1logit[ ( ) | ] logit[ ( ) | ]P Y j x x P Y j x x

2 1( )x x x

Page 32: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Example: Political ideology & party affiliation

Response: Political ideology with five-point ordinal scale

Predictors: Political party(Democratic, Republican)

PoliticalParty

Political Ideology

VeryLiberal

SlightlyLiberal Moderate

SlightlyConservative

VeryConservative

Democratic 80 81 171 41 55

Republican 30 46 148 84 99

Page 33: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Parameter inference ,

Democrats tend to be more liberal than Republicans Wald=57.1( ) with -value < 0.0001

Strong evidence of an association 95% CI for =(0.72, 1.23) or =(2.1, 3.4)

At least twice as high for Democrats as for Republicans

Goodness-of-fit with -value=0.2957 Good adequacy!

Example: Political ideology & party affiliation

ˆ 0.975 0.975 2.65e

0 : 0,H 1df p

e

p2 3.7( 3)G df

Page 34: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Another logit forms for ordinal response categories

Adjacent-categories logit

Adjacent-categories logits determine the logits for all pairs of response categories

Continuation-ratio logit Form1:

Contrast each category with a grouping of categories from lower levels of response scale

Form2:

Contrast each category with a grouping of categories from higher levels of response scale

1log , 1, , 1j

j

j J

1 11 1 2

2 3

log , log , , log J

J

11 2

2 3

log , log , , log J

n n J

Page 35: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

Summary

Two-way contingency tables: RR, Odds ratio, Chi-square tests

Three-way contingency tables: Conditional independence, Homogeneous association, Common odds ratio

Logistic regression: Dichotomous response

Logistic regression: Polytomous response

Page 36: Categorical Data Analysis &  Logistic Regression

SNUSNUBIBS

BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS

References

Agresti, A. (1996). An Introduction to Categorical Data Analysis, Wiley: New York (Also the 2nd edition is available)

Stokes, M.E., Davis, C.S., and Koch, G.G. (2000). Categorical Data Analysis Using The SAS System, Second Ed., SAS Inc.: Cary