Download - Multilevel binary ordinal Athens2005 - UniFI - DiSIA · 2018-12-19 · L. Grilli – Multilevel binary and ordinal - Athens 2005 10 The standard linear regression model as a GLM Y

Athens 2005 1

Multilevel models for binary and ordinal responses

Leonardo Grilli

Email: [email protected]: http://www.ds.unifi.it/grilli/

Department of Statistics “G. Parenti” – University of Florence

2L. Grilli – Multilevel binary and ordinal - Athens 2005

Outline

Introduction

Binary responsestandard logit modelmultilevel logit model

Ordinal responsestandard proportional odds modelmultilevel proportional odds model


Qualitative responses

P(Y=y | X=x)

Main types of qualitative response variable Y:

binary or dichotomous (y =0,1): e.g.employed/unemployedordinal (y = 1,2,…C): e.g. level of satisfactionnominal or polytomous (y =1,2,..C): e.g. type of job


Models for qualitative response

(a) Generalized linear models (GLM)

(b) Latent response modelsOne latent variable + a set of thresholds (if Y is binary or ordinal)C-1 latent variables (if Y is nominal)

Two alternative modelling strategies:

Two different ways of extending the linear model to the case of a qualitative responseThe two strategies lead to equivalent models, the difference being in the interpretation


Binary response:

standard logit model


Binary response

Example: model for the decision to buy a given product

Y =1 if the consumer decides to buy

Y =0 if the consumer decides not to buy

x vector of covariates (gender, age, education, etc.) that may help “explain” the decision

Wish to regress Y on x

Athens 2005 2


Binary response

If Y assumes only two values (0 and 1, say) its distribution is (necessarily) Bernoulli, i.e. Binomial with n=1

1| (1, ) ( ) (1 ) ( 1| )

iidy y

i i i i i

i i i

Y Bin f yP Yπ π π

π

−⇔ = −

= =

xxwhere

∼

( | ) ( | ) (1 )i i i i i i iE Y Var Yπ π π= = −x xThe variance is entirely determined by the mean!

(indeed in binary response models the variance is not estimated)8L. Grilli – Multilevel binary and ordinal - Athens 2005

Binary response

'i i iY ε= +x βLet’s first try a linear model

' [0,1]i ∉x β

' if 01 ' if 1

i ii

i i

YY

ε− =⎧

= ⎨ − =⎩

x βx β

There are some problems!

non-Normal and heteroschedastic errors

' ( | )i i i iE Y π= =x β x


GLM (Generalized Linear Models)(Nelder and Wedderburn, 1972)

Given n independent responses Yi with covariate vectors xi

and conditional means

1. Linear predictor

2. Link function g(.)

3. Density of Yi in the exponential family

f(yi|θi ,φ)=exp{[yiθi – b(θi)]φ –1+c(yi, φ)}

'i iη = x β( | )i i iE Yµ = x

1( ) or ( )i ii ig gµ ηµ η −= =

Key idea: bringing the mean on a scale on which to apply a linear model


The standard linear regression model as a GLM

Y continuous – linear regression:

µi = ηi identity link

εi ~ independent and Normal

(possibly heteroschedastic)

'i i iY ε= +x β


GLM for a binary response

1

( ) logit( ) log1

( ) ( )

zg z zz

g z z−

⎧ = =⎪−⎨

⎪ = Φ⎩

logit link (inverse logistic cdf)

probit link (inverse Normal cdf)

0

0,25

0,5

0,75

1

-30 -20 -10 0 10 20 30

b'X

F(b'

X)

We need a link g(.) such that g:(0,1) → (–∞,+∞)Every inverse cdf (cumulative distribution function) is a candidate

( ) iig µ η=

| (1, ) (0,1) ( , )ii i i iiY Bin π π ηµ⇒ = ∈ ∈ −∞ +∞xwhen but∼


probit or logit?

Usually probit and logit yield nearly the same fitThe difference may be appreciable when the probabilities are extreme (i.e. near 0 or 1), since logit has tails havier than probit

logit pros:Closed formCanonical link (→ various properties, e.g. the existence of sufficient statistics)Interpretation in terms of odds

probit pros:In the formulation with latent response and a threshold, probit corresponds to a Normal latent response

Athens 2005 3


probit or logit?

probit and logit have different measurement scalesprobit ⇔ standard Normal ⇒ σ = 1logit ⇔ standard logistic ⇒ σ = π /√3 ≅ 1.81

Even when probit and logit yield approximately the same fit the values of the slopes are different

logit probit1.81β β


Odds and logit

The logit link applies toi.e. the probability of success

Definition: the odds (of Yi=1 given xi) are

( | )i i iE Y π=x

logit( ) log1

ii

i

πππ

=−

odds1

i

i

ππ

=−

0 1odds

0.5iπ >0.5iπ < 0.5iπ =logit

0 +∞-∞

+∞

Definition: the logit is the logarithm of the odds


Odds Ratio

Definition: Given two units A and B with probabilities of success πAand πB, the Odds Ratio (OR) of B on A is

1

1

B

B

A

A

OR

ππ

ππ

−=

−

( ) ( )1 1, , , , , , , ,

1 negative effect of on 1 no effect of on

1 positive effect of on

1

A p B pk k

k

k

k

x xx x x x

xOR x

x

ππ

π

= =

< ⇔⎧⎪= ⇔⎨⎪> ⇔⎩

+x x… … … …The OR is a measure of association:


Odds Ratio and logit

1log( ) log log log1 1

1logit( ) logit( )

B

B B A

A B A

A

B A

OR

ππ π π

π π ππ

π π

⎛ ⎞⎜ ⎟ ⎛ ⎞ ⎛ ⎞−⎜ ⎟= = −⎜ ⎟ ⎜ ⎟− −⎜ ⎟ ⎝ ⎠ ⎝ ⎠⎜ ⎟−⎝ ⎠

= −

The logarithm of the OR is the difference between two logits!


logit model (with a single x)

log( ) logit( ) logit( )[ ( )] [ ]

B A

A A

Od d

Rx x

πβ β

πα βα

= −= + + − + =

If then

logit( )i ixπ α β= +

B Ax dx= +

β = effect of a unit increment of x on the logit scale



logit( )i ixπ α β= +

exp(βd)= exp(β)d is the OR between two units which differ for a d-increment in the covariate

exp(β) is the OR in the special case of a unit increment (i.e. d=1)

If x is a dummy 0-1 variable, exp(β) is the only ORthat makes sense

If x is a continuous covariate, the OR can be computed for any d-increment (and it may be that the unitincrement is not the most useful to compute)

Athens 2005 4



1( )1 exp( ( ))

xx

πα β

=+ − +

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 10 12 14

x

p(x) β>0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 10 12 14

x

p(x) β<0

• The sign of β determines if π(x)is increasing or decreasing

• The rate of variation increases with |β|

Around π =0.5 the curve is nearly linear



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 10 12 14

x

p(x)

(slope of the tangent in x)when π = 0.5 the slope is maximum and equal to 0.5 ⋅0.5 ⋅β = β/4

1 1( ) ( ) ( ) ( )[1 ( )]x g g x xx x

π η η η β βπ πη η

− −⎧ ⎫ ⎧ ⎫∂ ∂ ∂ ∂= = = −⎨ ⎬ ⎨ ⎬∂ ∂ ∂ ∂⎩ ⎭ ⎩ ⎭

Effect of x on the probability of Y=1

e.g. if the estimate of β is 0.20, then for an individual with probability of succes of 0.5 a unit increase in the covariate would imply an approximate increment of 0.20/4=0.05, leading to a probability of success of about 0.55


Specification with latent response and threshold

*1 0i iY Y= ⇔ >

• Assume there exists a latent continuous response Y*

• A threshold model determines the observed response Y

P(Yi=1) = P(Yi*>0)0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -3 -2 -1 0 1 2 3 4

y

dens

ità

• Model for the latent response: linear regression

0

*0 1 ( )

iid

i i i iY x Fβ β ε ε= + + ⋅with ∼



Latent response – GLM equivalence:

*

0 1

0 1

0 1

0 1

( 1) ( 0)( 0)( )( )( )

i i

i i

i i

i i

i

P Y P YP xP xP xF x

β β εε β β

ε β ββ β

= = >= + + >

= > − −

= − ≤ += +

( )i iFπ η=Therefore so F is the inverse of the link!

F is the cdf of -ε (equal to the cdf of ε if symmetrical)

(conditional on the covariates)



The variance of the latent variable is fixed:

Now let us assume that the variance of the latent variable is anarbitrary value:

2( ) Normal ( ) 1 ( ) Logistic ( ) / 3i iF Var F Varε ε π⋅ ⇒ = ⋅ ⇒ =

1*0 1

0( 1) ( 0) ( ) ii i i i iP Y P Y P x P xεε β β β β

σ σσ⎛ ⎞= = > = − ≤ + = − ≤ +⎜ ⎟⎝ ⎠

2 2 2( ) Normal ( ) 1 ( ) Logistic ( ) / 3i iF Var F Varε σ ε σ π⋅ ⇒ = × ⋅ ⇒ = ×

Then manipulating the prob. as in the previous slide it follows that

So the estimable quantities are in fact RATIOS between the parameters of the linear model for the latent response (β0 and β1) AND the standard deviation of the latent response (σ)




π2/6Gumbelcompl. log-log

compl. log-log

1standard Normal

probitprobit

π2/3standard logisticlogitlogit

Variance of εi

Distrib. of εi

Link F-1Model

Athens 2005 5



*1 i iY Y γ= ⇔ >An alternative specification

P(Yi=1) = P(Yi*> γ)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -3 -2 -1 0 1 2 3 4

y

dens

ità

i.e. the threshold γ is not fixed to 0 but it is an estimable parameter. However a constraint on the model for Y* is needed

γ

*1 with ( )

iid

i i i iY x Fβ ε ε= + ⋅∼

To avoid collinearity (non identification)

the intercept of Y* is fixed to 0


Binary response:

multilevel logit model


Introduction to multilevel logit models

Definition

“cluster-specific” vs “population-average” effects

Random intercept model

ICC

Estimation

• Snijders & Bosker §14.1-14.2, 14.3.2-14.3.3• Skrondal & Rabe-Hesketh ch. 9


Random effects GLM for a binary response (GLMM)

Components of a GLMM (Generalized Linear Mixed Model)

1. GLM for the distribution of Y conditioned on the random effects

2. distribution of the random effects

Remark: the marginal distribution of Y (marginal w.r.t. the random effects) does not follow a GLM!!!


Random effects GLM for a binary response (GLMM)

(1) linear predictor

(2) logit link

(3) distribution

• The β are the conditional effects of the covariates, given the value of the random effects u cluster specific effects

• The marginal effects of the covariates are obtained integratingw.r.t. the random effects u

'ij ij juη = +x βlogit( )ij ijµ η=

| , (1, )iid

ij ij j ijY u Bin πx ∼

individual i =1,2,…,nj; cluster: j =1,2,…,J

GLM forY|u

f(u) 2(0, )iid

j uu N σ∼


cluster-specific vs population-average effects

( )0 1

1( 1| , )1 exp ( )jij

i jij

j

uxu

P Yxβ β

= =+ − + +

cluster-specificmodel (random intercept)

( )0 1

1( 1| )1 exp ( )ij ij

ij

P Y xxγ γ

= =+ − +

γ1 < β1

the effect of x is attenuated!

see Skrondal & Rabe-Hesketh §4.8and the paper of Ritz & Spiegelman

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100

population-averagemodel (constant intercept)

Athens 2005 6


Estimating conditional probabilities

( )0 1

1

1 exp ( )( 1| , )

jij

jij ij uxP Y x u

β β+ − + += =

choose a value of xijplug-in the estimates of the fixed effectschoose a value of uj , for example

zero → hypothetical mean clustera low value (e.g. ) → hypothetical “bad” clustera high value (e.g. ) → hypothetical “good” clusteran EB residual → j-th cluster of the sample

Fit the random effects model and

0 1ˆ , ˆβ β

ˆ2 uσ−ˆ2 uσ+

ˆEBju


Estimating marginal probabilities

( )0 1ˆ

1ˆ ( 1 | )1 exp ˆ( )

ij ij

ij

P Y xxγ γ

= =+ − +

1) fit a model random effects (population - averaged)

and plug - in the estimates

without

( 1| )ij ijP Y x=Two ways to estimate

( )0 1

2

ˆ ( 1 | )1

1 exp ( )ˆ ˆ

ˆ ( 1| , )

( ;0 ˆ, )

ij ij

ij

i

uj jj

j ij jP Y x E

x

u

uu

P Y x

duβ

φβ

σ=

= =

+ − + +

⎡ ⎤=⎣ ⎦

∫

2) fit a model random effects (cluster - specific), plug - in the estimates and compute the integral

withor


Null random intercept logit model

population mean of logits or logit of the mean cluster

uj ~N(0,σu2)

( )0

0

1( 1| )1 exp ( )

logit( ) log1

j ij

jj

j

jj

j

uu

u

P Yπ

ππ

β

βπ

= = =+ − +

⎛ ⎞= = +⎜ ⎟⎜ ⎟−⎝ ⎠

0β


Random intercept logit model with covariates

( )0 1

0 1

1( 1| , )1 exp ( )

logit( ) log1

jj

ij ij ijij

ijij ij

ij

j

P Y xx

x

uu

u

π

ππ

β β

β βπ

= = =+ − + +

⎛ ⎞= = + +⎜ ⎟⎜ ⎟−⎝ ⎠

• Cluster-level covariates can be inserted• Individual-level covariates can have a random coefficient• Cross-level interaction terms can be inserted


ICC in binary response models

Specification with a continuous latent response

The total error uj+εij has variance:σu

2 +1 in the probit modelσu

2 +π2/3 in the logit model

The (residual) ICC is the between/total variance ratio:ρ = σu

2 /(σu2 +1) in the probit model

ρ = σu2 /(σu

2 +π2/3) in the logit model

0*

1ij ij ij jY x uβ β ε= + + +

2(0, )iid iid

u ijj N Fu σ ε∼ ∼


ICC in binary response models

For two individuals of the same cluster, the two responses are conditionally indipendent given the random effects:

Marginally w.r.t. the random effects, the correlation (in the latent responses) between the same two individuals is equal to the (residual) ICC:

* *' '( , | , , ) 0ij i j ij i j jCorr Y Y x x u =

* *' '( , | , )ij i j ij i jCorr Y Y x x ρ=

Athens 2005 7


Likelihood

21 0 1

20 (( , , ) ( ;0, | ) , )j u j u jj jL u u duL β ββ β σ φ σ= ∫

Binomial conditional prob.

Marginal likelihood j-th cluster

Conditional likelihood j -th cluster

( )1-

11

0 1( , | )j

ijij

nyy

ij ijj ji

L u πβ πβ=

= −∏

| , ~ (1, )

( 1| , )

iid

ij ij j ij

ij ij ij j

Y x u Bin

P Y x u

π

π = =


Likelihood: how to solve intractable integrals

Taylor expansion of the link (MQL, PQL)MLwiN (+bootstrap) HLM

ML with numerical integration

aML MIXOR NLMIXED GLLAMM Mplus

Laplace approximations HLM

Gibbs sampling WinBUGS MLwiN

The convergence of the algorithm depends on: the data at hand, the complexity of the model, the initial values, the specific options of the algorithm (e.g. the number of quadrature points)


PQL (Penalized Quasi-Likelihood)

(PQL clearly better than MQL, but sometimes it does not converge!)

ProsComputationally efficientGood performance when f(y|u) is approximately Normal(e.g. Poisson with mean >=7, large cluster sizes,proportions with large denominators)

ConsUnderestimation of random parameters (and thus attenuation of fixed parameters) for binary responses with small clusters or large ICCNo standard likelihood (⇒ no LRT test)


ML (Maximum Likelihood) with Gaussian quadrature

Ordinary (non-adaptive) Gaussian quadrature:underestimation of the variance components when ICC is high

Adaptive Gaussian quadrature:need calculation of the residuals at each iteration in order to tuning the grid for each clusterw.r.t. ordinary quadrature each iteration takes longer, but fewer iterations are neededaccurate estimates are always obtainable


ML (Maximum Likelihood) with Gaussian quadraturePros

Accurate estimatesGood performace even with small clustersPerformace can be evalutated by changing the number of quadrature points

ConsInefficient for continuous YComputational time can be very long

Warning: the time is roughly proportional on the number of quadrature points, a number that rapidly increases as the model becomes more complex: for example, using 8 quadrature points per dimension

• 1 random intercept + 1 random slope ⇒ 82=64 q.points• 1 random intercept + 2 random slopes ⇒ 83=512 q.points


An example of multilevel logit model:

Contraception in Brazil

Athens 2005 8


Contraception in Brazil: aims of the research

How much of the individual-level variability in the use of contraceptives is due to the social context where the women live in?

Is it possible to explain the differences due to the social context?

Angeli A., Rampichini C., Salvini S. (1996)La contraccezione in Brasile: un’analisi attraverso un modello a componenti di varianza.Dept. of Statistics of Florence, Working Papers n. 59


Data

DHS 1986 Brazil:

women in union aged 35-44

Y: Y: use of contraceptivesuse of contraceptives

(0=(0=never,never, 1=1=at least onceat least once))

Hierarchical structure:Women: 1156 level 1 unitsArea of residence: 47 level 2 units


Data

Id woman idArea area of residenceUso 1= use of contraceptives

Individual covariates:Age at interviewEducationNumber of children and interaction with educationListening to the radio (every day or not)Education of the mate

Contextual covariates:Infant mortality rateAverage number of desired childrenPercentage having a jobPercentage knowing the biology of ovulationPercentage knowing how to get contraceptives

1156 records, 18 variables


Reading data in STATA

infile id area uso eta primaria diplau figli primfigli diplfigli radio istrm1 istrm2 intercept tasso lavora ovul trova mfigli using brasile.txt

save brasile.dta,replace


Preliminary analysis

Area proportions

Overall proportion π= 0.8201

Area mean prop. πj =E(Yij | area=j)

min (πj)=0.33, max(πj)=1.00

tabulate area uso, chi2 row


Testing heterogeneity

p-value<0.001

There is significant heterogeneity among the areas

Chi2 =160.08

df=46

(chi2 option)

Athens 2005 9


Null model with GLLAMM

gllamm uso, i(area) family(binomial) link(logit) nip(5) adapt trace dots

yij~Bin(1,πij)

uso : response variablearea : variable identifying level 2 unitsnip(5) adapt : 5-point adaptive quadrature

logit(πij)=β0+ujSort the data

Model specification

sort area id


σu2 variance between

areas

Results of null model

1/[1+exp(-β0)]=0.8318

matrix a=e(b)

matrix list adi exp(a[1,1])/(1+exp(a[1,1]))

Estimated probability for uj=0

different from E(πj)!

β0

πj for high u: 1/[1+exp(β0 +2σu)]= 0.9680πj for low u: 1/[1+exp(β0 –2σu)]= 0.4473


Model with radio

Inserting radio (fixed effect)

( )0 1

0 1

1P( 1| )1 exp

logit( ) log1

ij j ijij j

ijij ij j

ij

Y ux u

x u

πβ β

ππ β β

π

= = =⎡ ⎤+ − + +⎣ ⎦

⎛ ⎞= = + +⎜ ⎟⎜ ⎟−⎝ ⎠

gllamm uso radio, i(area)family(binomial) link(logit) nip(5) adapt from(a) trace dots

Initial values from previous model52L. Grilli – Multilevel binary and ordinal - Athens 2005

Results of model with radio

Between variance: nearly the same as before

Better model fitLRT=2*(517.55307-509.8697)=15.4

radio=0 1/(1+exp(-_b[_cons])) =0.76

radio=1 1/(1+exp(-_b[_cons]-_b[_radio])) =0.86

Estimated probability using contraceptives for uj=0


Odds

For x=1 and u=0 the odds of Y=1 is

π(1)/[1-π(1)]=exp(β0 + β1)

=exp(1.1596+0.6835)= 6.316

for a women listening to the radio every day and living in a mean area, it is about 6 timesmore probable to use contraceptives than to not use


Odds

Mean area (u=0)exp(1.1596+0.6835)= 6.316

Low area (u=-2σu)exp(1.1596+0.6835-2*0.8939)= 1.057

High area (u=+2σu)exp(1.1596+0.6835+2*0.8939)= 37.75

05

10152025303540

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

u

odds

(rad

io=1

)

For x=1 the odds of Y=1 is a function of uπ(1|u)/[1-π(1|u)]=

exp(β0 + β1+ u)

Athens 2005 10


Odds Ratio

OR=(π(1)/[1-π(1)])/(π(0)/[1-π(0)])=exp(β)

log OR =β

OR is a measure of association between Y and Xwhich does not depend on u

OR(radio)=1.9838

The use of contraceptives is about 2 times more probable for a woman listening to the radio every day (whichever area she lives in)


Inserting other covariates

0.0070.02414893.89contextual

0.1600.62811934.69individual0.1960.80331019.78radio

0.2020.83121035.15null

ρσu2n.par.-2logLmodel

Residual ICC (on the latent response):

(fitted with MIXOR)

2* *

' ' 2 2( , | , )/ 3

uij i j ij i j

u

Corr Y Y x xσ

ρ σπ

= =+


Ordinal response:

standard proportional odds model


Ordinal responses

Y can assume C distinct values(categories) yc c=1,2,…,C

The categories are ordered

y1 < y2 <…< yc <…< yC

As a convention, the category yc is labelled with the number c

Examples:Severity of the symptoms: none, light, seriousResult of a test: normal, borderline, anormalSatisfaction: low, intermediate, high


Probabilities to be modelled

( 1)( 2)

(

(

1)

) 1

P YP Y

P Y C

P Y C≤ =

≤≤

≤ −……

1

1

( ) 1 ( )

( 1)( 2)

( 1)C

c

P Y C P Y

P YP Y

P C

c

Y−

=

= = − =

==

= −

∑

……

With C categories there are C-1 free probabilities, e.g. the first C-1 mass points of the distribution, or the first C-1 cumulative probabilities of the distribution


Cumulative GLM

Given the ordinal nature of Y it is convenient to build the model on the cumulative probabilities

Following the GLM approach

' linear predictor (stesso per tutte le prob. cumulate)specific intercept ( ) of the -th cumulative prob.

( ) link function

i i

c cthresholdg

ηγ

=

⋅

β x

( )( ) 1, , 1ci icg P Y c Cγ η≤ = − = −…

A cumulative GLM for an ordinal Y with C categories is made of C-1 submodels, one for each cumulative prob. (except the last one)

Athens 2005 11


Cumulative GLM

1 2 1Cγ γ γ −≤ ≤ ≤…

( )( ) 1, , 1ci icg P Y c Cγ η≤ = − = −…

What is the relationship among the C-1 thresholds γc ?

As the cumulative probabilities are non-decreasing by construction, also the thresholds must be be non-decreasing

Why the linear predictor has a minus sign?

To interpret the coefficients in the usual way: in fact, with the minus sign, increasing the value of a covariate with a positivecoefficient amounts to increasing the probability of a high category (i.e. a category in the right end of the scale)


Cumulative GLM

( )( ) 1, , 1ci icg P Y c Cγ η≤ = − = −…

How to compute the probability of a specific category c ?

By difference (hence the name difference model):

( ) ( )11 1

( ) ( ) ( )1i i i

ci ic

c c cP Y P Y P Y

g gγ η γ η−−

−

−= = ≤ − ≤

= − − −


Cumulative GLM

( )( ) 1, , 1ci icg P Y c Cγ η≤ = − = −…

What is the consequence of having the same linear predictor for all the categories?

A given covariate has an effect on the cumulative probabilities equal for all the categories of Y (so called parallel regressions assumption)

Such an assumption is clearly violated for a covariate that is not associated with a shift in the scale, but rather with an “extremization” of the responses (e.g. the individuals with certain features might use only the extremes of the scale)


Logit cumulative GLM: the proportional odds model

( )( ) 1, , 1ci icg P Y c Cγ η≤ = − = −…If g() is the logit function, the cumulative GLM is called “proportional odds”. The odds of exceeding category c are

( )( )

1 1/ 1 exp( ( ' )( ) 1 ( )( ) ( ) 1/ 1 exp( ( ' )

exp( ( ' ) exp( ' )

c

c

c c

ii i

i i i

i i

P Y P Y cY P

cc cP Y

γγ

γ γ

− + − −> − ≤= =

≤ ≤ + − −

= − − = −

β xβ x

β x β x

Similarly, the odds of not exceeding category c are

( ) exp( ' )( )

i

ic i

P YP Y

cc

γ≤= −

>β x Same expression but

with reversed signs!


Logit cumulative GLM: the proportional odds model

With reference to the odds of exceeding a category (or equivalently the odds of not exceeding), any two individuals have proportional odds, i.e. the ratio of the odds is the same for all the categories of Y

Let us consider two individuals A and B with the same values of the covariates with the exception of the r-th covariate, for which individual B has a value exceeding by 1 the value of individual A, so the difference in the linear predictor is

With reference to the odds of exceeding a category

exp( ' )( ) / ( ) exp(( ' ) ( ' )) exp( )( ) / ( ) exp( ' )

BB BB A r

A A

cc c

cA

c cc

P Y P YP P Y cY

γ γ γ βγ

−> ≤= = − − − =

> ≤ −β x β x β xβ x

' 'B A rβ− =β x β x

So the Odds Ratio is exp(βr) for any category c: this is the proportional odds property!


Specification with latent response and a set of thresholds

{ } { }1* c- i ciY Yc γ γ= ⇔ < ≤

• Underlying the observed value Y for the i-th individual there is a continuous latent response Y*

• A threshold mechanism determines the observed response:

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -3 -2 -1 0 1 2 3 4

y

dens

ità

• The latent response is modelled with a linear regression model without intercept:

* 'iid

i i i iY Fε ε= +β x ∼

( ) ( )1*

c-i i cP Y P Yc γ γ= = < ≤

Athens 2005 12



A latent response model is equivalent to a cumulative GLM:

( ) ( ) ( )( ) ( )

* '

' 'ci i i i

i i i

c

c c

P Y c P Y P

P F

γ ε γ

ε γ γ

≤ = ≤ = + ≤

= ≤ − = −

β x

β x β x

This relationship makes clear why in a cumulative GLM the estimated regression coefficients are approximately invariant to collapsing of the categories (warning: in principle the invariance is perfect, but in practice if the model is not adequate for the data at hand the estimates may change a lot)




π2/6Gumbelcompl. log-log

ordinal c. log-log

1standard Normal

probitordinal probit

π2/3standard logisticlogitproportional

odds

Variance of εi

Distrib. of εi

Link F-1Model


Ordinal response:

multilivel proportional odds model


Random intercept two-level ordinal response model

Representation with continuous latent response and a set of thresholds

* ' iji ijjj uY ε+= +β x2(0, )

iid iid

j u iju N Fσ ε∼ ∼Estimable parameters:• regression coefficients β (same number as the covariates)• level 2 variance: σu

2

• C-1 thresholds: γ1,…,γC-1


ICC in ordinal response models

Representation with a continuous latent response

The total error uj+εij has variance:σu

2 +1 in the ordinal probit modelσu

2 +π2/3 in the proportional odds model

The (residual) ICC is the between/total variance ratio:ρ = σu

2 /(σu2 +1) in the ordinal probit model

ρ = σu2 /(σu

2 +π2/3) in the proportional odds model

0*

1ij iij j jx uY εβ β+ ++=

2(0, )iid iid

j u iju N Fσ ε∼ ∼


ICC in ordinal response models

For two individuals of the same cluster, the two responses are conditionally indipendent given the random effects:

Marginally w.r.t. the random effects, the correlation (in the latent responses) between the same two individuals is equal to the (residual) ICC:

* *' '( , | , , ) 0ij i j ij i j jCorr Y Y x x u =

* *' '( , | , )ij i j ij i jCorr Y Y x x ρ=

Athens 2005 13


Multilevel ordinal response models

The issues that arise when introducing random effects in an ordinal response model are the same already noted in the binary response case, e.g.

cluster-specific vs. population-average effectsmarginal vs. conditional probabilitiesestimation algorithms approximating the integrals

Snijders & Bosker §14.4, Skrondal & Rabe-Hesketh ch. 10


Example of multilevel proportional odds model:

Tobacco information programme TVSFP


Tobacco information programme TVSFP

Data collected during the programme “TelevisionSchool and Family Smoking Prevention andCessation”

The schools in the sample were randomized to 4 types of treatment defined by crossing two factors:

CC dummy indicator for classroom interventionTV dummy indicator for television intervention

Hierarchical structure: students in classes, classesin schools

Hedeker and Gibbons (1996), MIXOR manual Rabe-Hesketh et al. (2004), GLLAMM manual


Ordinal response model

Response variable THK

------------thk | Freq.----+-------

1 | 2592 | 2773 | 2694 | 294

------------

Score defined as the number of correct answers to 7 questions on tobacco knowledge after the intervention, collapsed into 4 categories (higher means better knowledge)


Ordinal response model

CovariatesCC indicator for classroom interventionTV indicator for television interventionCCTV interaction CC*TVPRETHK pre-intervention value of THK

Variable | Obs Mean Std. Dev. Min Max----------+---------------------------------------------

prethk | 1600 2.069375 1.26018 0 6cc | 1600 .476875 .4996211 0 1tv | 1600 .499375 .5001559 0 1

CC and TV are randomized at school level


Reading and collapsing the data

When both the response and the covariates can assume few distinct values there are several individuals with the same value for Y and x

Collapsing reduces the size of the dataset and thus the computational time

gen cons=1collapse (count) wt1=cons, by(thk prethk

cc tv cctv School class)

infile school class thk a2 const prethk cctv cctv using tvsfpors.dat

Athens 2005 14


Two-level ordinal model:students in classes

ηijk=β0+β1PRETHKijk+ β2CCk + β3TVk + β4CCTVk + ujk

ujk ~N(0,τ2), i student, j class, k school

Response THKF

Linear predictor

gllamm thk prethk cc tv cctv, i(class)family(binomial) link(ologit)weight(wt) nip(10) trace dots

Ordinal logit linkWeights corresponds to level 1 units (students)


Results The level of knowledge before intervention (prethk) is a good predictor of the knowledge after intervention

Only the classroom intervention (CC) has an effect

3 thresholds(Y has 4 categories)

Variance between classes=0.1888, ρ=0.1888/(0.1888+π2/3)=0.054


Checking the performance of Gaussian quadrature

• Fit the model again with more quadrature points• Fit the model again with adaptive quadrature

(option adapt)Otherwise a quick method is to

• Evaluate the likelihood using more quadrature points (option eval)

In the TVSFP data the logL is about the same using 20 and 30 points the approximation yielded by 10-point quadrature seem to be adequate


Dropping TV and CCTV

estimates store a

matrix a=e(b)

gllamm thk prethk cc, i(class) family(binomial) link(ologit)weight(wt) nip(10) from(a) trace

Save the results of previous model

Initial values from the previous model


Interpretation of the parameters:odds ratio

Example: odds ratio of CC=1 on CC=0 conditions being equal on PRETHK and ujk

( 1) [thk]cc = 0-----------------------------------------------------------thk | exp(b) Std. Err. z P>|z| [95% Conf. Interval]----+-------------------------------------------------------(1) | 2.04235 .2562923 5.69 0.000 1.597033 2.61184------------------------------------------------------------

It does not depend on the threshold c!

( | 0) / ( | 0)Odds Ratio of B on A exp( )

( | 0) / ( | 0)B jk B jk

rA jk A jk

P Y u P Y uP Y u P

c cY uc c

β> = ≤ =

= => = ≤ =

A and B with the same covariate values with the exception of the r-th covariate, for which unit B has a value exceeding 1 that of unit A

lincom cc, eform


Interpretation of the parameters:odds of exceeding a category

lincom [thk]cc - [_cut11]_cons, eform

Example: for a student with PRETHKijk=0, CCk=1 and ujk=0 the odds of exceeding category c=1 is

( 1) [thk]cc - [_cut11]_cons = 0------------------------------------------------------------thk | exp(b) Std. Err. z P>|z| [95% Conf. Interval]----+-------------------------------------------------------(1) | 2.43737 .2963014 7.33 0.000 1.920633 3.093134------------------------------------------------------------

Similarly, for c=2 odds=0.68, c=3 odds=0.20

( ) exp( ' )( )

ii

ic

P YP Y

cc

γ>= −

≤β x

Athens 2005 15


Interpretation of the parameters:probability of a category

( )( ) ( )( )

*1

1* *

1

( | 0) ( | 0)

( | 0) ( | 0)

1/ 1 exp ( ' ) 1/ 1 exp ( ' )

ij jk ijk jk

ijk jk ijk jk

ijk ijk

c c

c c

c c

P Y u P Y u

P Y u P Y

c

u

γ γ

γ γ

γ γ

−

−

−

= = = < ≤ =

= ≤ = − ≤ =

= + − − − + − −β x β x

Category CC=0 CC=11 0.46 0.292 0.29 0.303 0.16 0.244 0.09 0.17TOT 1.00 1.00

E.g. PRETHKijk=0 and ujk=0


Two-level ordinal model:students in classes in schools

gllamm thk prethk cc tv cctv, i(classschool) family(binomial) link(ologit)weight(wt) nip(10) trace

LRT shows that the variance between schools is not significant ⇒ school level can be dropped

ηijk=β0+β1PRETHKijk+ β2CCk + β3TVk + β4CCTVk + ujk +vk

ujk ~N(0,τ2), vk ~N(0,ψ2), i student, j class, k school

Download - Multilevel binary ordinal Athens2005 - UniFI - DiSIA · 2018-12-19 · L. Grilli – Multilevel binary and ordinal - Athens 2005 10 The standard linear regression model as a GLM Y

Top Related