glm

26
Psyc 945 Class 14a 1 of 52 Generalized Multilevel Models Today’s Topics: Introduction to Generalized Multilevel Models Models for Binary Outcomes Other Generalized Models Estimation in Generalized Multilevel Models Psyc 945 Class 14a 2 of 52 The Two Sides of ANY Model Model for the Means: Aka Fixed Effects, Structural Part of Model What you are used to caring about for testing hypotheses How the expected outcome for a given observation varies as a function of values on predictor variables Model for the Variances: Aka Random Effects and Residuals, Stochastic Part of Model What you are used to making assumptions about instead How model residuals are related and distributed across cases Validity of the tests of the predictors depends on having the ‘right’ model for the variances (where ‘right’ means ‘least wrong’) Estimates will usually be ok they come from model for the means Standard errors (and p-values) of estimates can be compromised

Upload: nirmal-roy

Post on 01-Nov-2014

18 views

Category:

Documents


2 download

DESCRIPTION

GLM

TRANSCRIPT

Page 1: GLM

Psyc 945 Class 14a1 of 52

Generalized Multilevel Models

Today’s Topics:• Introduction to Generalized Multilevel Models

• Models for Binary Outcomes

• Other Generalized Models

• Estimation in Generalized Multilevel Models

Psyc 945 Class 14a2 of 52

The Two Sides of ANY Model• Model for the Means:

– Aka Fixed Effects, Structural Part of Model

– What you are used to caring about for testing hypotheses

– How the expected outcome for a given observation varies as a function of values on predictor variables

• Model for the Variances:– Aka Random Effects and Residuals, Stochastic Part of Model

– What you are used to making assumptions about instead

– How model residuals are related and distributed across cases

– Validity of the tests of the predictors depends on having the ‘right’ model for the variances (where ‘right’ means ‘least wrong’)

• Estimates will usually be ok they come from model for the means

• Standard errors (and p-values) of estimates can be compromised

Page 2: GLM

Psyc 945 Class 14a3 of 52

A World View of Models• Statistical models can be broadly organized as:

– General (normal outcome) vs. Generalized (not normal outcome)

– One dimension of sampling (one variance term per outcome) vs. multiple dimensions of sampling (multiple variance terms)

• Fixed effects only vs. mixed (fixed and random effects = multilevel)

• All models have fixed effects, and then:– General Linear Models: fixed effects, no random effects

– General Linear Mixed Models: fixed and random effects

– Generalized Linear Models: fixed effects through link functions, no random effects

– Generalized Linear Mixed Models: fixed and random effectsthrough link functions

• “Linear” means the fixed effects predict the link-transformed DV in a linear combination of (effect*predictor) + (effect*predictor)…

Psyc 945 Class 14a4 of 52

What kind of outcome?Generalized vs. General

• Generalized Linear Models General Linear Models whose residuals follow some not-normal distribution and in which a link-transformed Y is predicted instead of Y

• Many kinds of non-normally distributed outcomes have some kind of generalized linear model to go with them:– Binary (dichotomous)

– Unordered categorical (nominal)

– Ordered categorical (ordinal)

– Counts (discrete, positive values)

– Censored (piled up and cut off at one end – left or right)

– Zero-inflated (pile of 0’s, then some distribution after)

– Continuous but skewed data (pile on one end, long tail)

These two are often called “multinomial” inconsistently

Page 3: GLM

Psyc 945 Class 14a5 of 52

3 Parts of a Generalized Linear Model

• Link Function (main difference from GLM):– How a non-normal outcome gets transformed into something

we can predict that is more continuous (unbounded)

– For outcomes that are already normal, general linear models are just a special case with an “identity” link function (Y * 1)

• Model for the Means (“Structural Model”):– How predictors linearly relate to the link-transformed outcome

– New link-transformed Yi = β0 + β1xi + β2zi

• Model for the Variance (“Sampling/Stochastic Model”):– If the errors aren’t normally distributed, then what are they?

– Family of alternative distributions at our disposal that map onto what the distribution of errors could possibly look like

Psyc 945 Class 14a6 of 52

Examples of Generalized Models

• Binary (dichotomous) Logit, Probit, (Complementary) Log-Log

• Unordered categorical (nominal) Baseline Category Logit

• Ordered categorical (ordinal) Cumulative or Adjacent Category Logit

• Counts (discrete, positive values) Poisson or Negative Binomial (for too little/much variance)

• Zero-inflated (pile of 0’s, then some distribution after) ZIP/ZINB: Poisson/Negative Binomial + excess of 0’s

Two-Part Model: Logit for Whether, Normal/Log for How Much?

• Continuous but Skewed Log-Normal or Gamma

• Censored (piled up and cut off at one end) Tobit

Page 4: GLM

Psyc 945 Class 14a7 of 52

Means and Variances by DV Type• Means:

– Quantitative DV mean = sum items / n = μy

– Binary DV mean = number correct / n = πy py

• Variances:– Quantitative DV: Var(Y) =

– Binary DV: Var(Y) = py(1-py) = pyqy = σy2 sy

2

• In binary DVs, the variance IS determined by the mean (py)

y

n2

ii=1

(Y -Y)

n-1

pvariance

If 3+ options get used, the variance is NOT determined by the mean

Psyc 945 Class 14a8 of 52

A General Linear Model for Binary Outcomes?

• If Yi is a binary (0 or 1) outcome…– Expected mean is proportion of people who have a 1

(or “p”, the probability of Yi=1 in the sample)• The probability of having a 1 is what we’re trying to predict

for each person, given the values of his/her predictors

• General linear model: Yi = β0 + β1xi + β2zi + ei

– β0 = expected probability when all predictors are 0

– β’s = expected change in probability for a one-unit change in the predictor

– ei = difference between observed and predicted values

– Model becomes Yi = (predicted probability of 1) + ei

Page 5: GLM

Psyc 945 Class 14a9 of 52

A General Linear Model With Binary Outcomes?

• But if Yi is binary, then ei can only be 2 things:– ei = Observed Yi minus Predicted Yi

• If Yi = 0 then ei = (0 − predicted probability)

• If Yi = 1 then ei = (1 − predicted probability)

– Mean of errors would still be 0…

– But variance of errors can’t possibly be constant over levels of X like we assume in general linear models

• The mean and variance of a binary outcome are dependent!

• This means that because the conditional mean of Y (p, the predicted probability Y= 1) is dependent on X, then so is the error variance

Psyc 945 Class 14a10 of 52

-0.40-0.200.000.200.400.600.801.001.201.40

1 2 3 4 5 6 7 8 9 10 11

X Predictor

Pro

b (

Y=

1)

A General Linear Model With Binary Outcomes?

How can we have a linear relationship between X & Y?• Probability of a 1 is bounded between 0 and 1, but predicted

probabilities from a linear model aren’t bounded impossible values

• Linear relationship needs to ‘shut off’ somehow made non-linear

??

??

-0.40-0.200.000.200.400.600.801.001.201.40

1 2 3 4 5 6 7 8 9 10 11

X Predictor

Pro

b (

Y=

1)

Page 6: GLM

Psyc 945 Class 14a11 of 52

3 Problems with General* Linear Models for Binary Outcomes

• *General = model for continuous, normal outcome

• Restricted range (e.g., 0 to 1 for binary item)– Predictors should not be linearly related to observed outcome Effects of predictors need to be ‘shut off’ at some point to

keep predicted values of binary outcome within range

• Variance is dependent on the mean, and not estimated– Fixed and random parts are related So residuals can’t have constant variance

• Residuals have a limited number of possible values– Predicted values can each only be off in two ways So residuals can’t be normally distributed

Psyc 945 Class 14a12 of 52

Generalized Models for Binary Outcomes

• Rather than modeling the probability of a 1 directly, we need to transform it into a more continuous variable with a link function, for example:

– We could transform probability into an odds ratio:• Odds ratio: (p / 1-p) prob(1) / prob(0)

• If p = .7, then Odds(1) = 2.33; Odds(0) = .429

• Odds scale is way skewed, asymmetric, and ranges from 0 to +∞

– Nope, that’s not helpful

– Take natural log of odds ratio called “logit” link• LN (p / 1-p) Natural log of (prob(1) / prob(0))

• If p = .7, then LN(Odds(1)) = .846; LN(Odds(0)) = -.846

• Logit scale is now symmetric about 0 DING

Page 7: GLM

Psyc 945 Class 14a13 of 52

Turning Probability into Logits• Logit is a nonlinear transformation of probability:

– Equal intervals in logits are NOT equal in probability

– The logit goes from ±∞ and is symmetric about prob = .5 (logit = 0)

– This solves the problem of using a linear model• The model will be linear with respect to the logit, which translates

into nonlinear with respect to probability (i.e., it shuts off as needed)

Probability:p

Logit:LN (p / 1-p)

Zero-point on each scale:

Prob = .5Odds = 1Logit = 0

Psyc 945 Class 14a14 of 52

Transforming Probabilities to Logits

-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

Logit

Pro

babi

lty

Probability Logit

0.99 4.6

0.90 2.2

0.50 0.0

0.10 -2.2

Can you guess what a probability of .01 would be on the logit scale?

Page 8: GLM

Psyc 945 Class 14a15 of 52

Nonlinearity in Prediction• The relationship between X and the probability of response=1

is “nonlinear” an s-shaped logistic curve whose shape and location are dictated by the estimated fixed effects– Linear with respect to the logit, nonlinear with respect to probability

• The logit version of the model will be easier to explain; the probability version of the prediction will be easier to show.

B0 = 0B1 = 1

Predictor X Predictor X

Psyc 945 Class 14a16 of 52

Predicting Logits, Odds, & Prob: Different but Equivalent Equations

• Coefficients for each Y form of the model:– Logit: LN(pi/1-pi) = β0 + β1xi + β2zi

• Predictor effects are linear and additive like in regression, but what does a ‘change in the logit’ mean anyway?

– Odds: (pi/1-pi) = exp(β0) * exp(β1xi) * exp(β2zi)• A compromise: effects of predictors are multiplicative

– Prob: P(yi=1) = exp(β0) * exp(β1xi) * exp(β2zi) 1+ exp(β0) * exp(β1xi) * exp(β2zi)

• Effects of predictors on probability are nonlinear and non-additive (no “one-unit change” language allowed)

Page 9: GLM

Psyc 945 Class 14a17 of 52

Example: Categorical Predictor

• Logit: LN(p/1-p) = β0 + β1xs

– Logit Yi = -.847 + 1.695(Ms) note is additive

– Log odds for women = -.847, for men = .847

• Odds: (p/1-p) = exp(β0) * exp(β1xs)– Odds Yi = exp(-.847) * exp(1.695(Ms=1)) note is multiplicative

– Multiply, then exp: Odds Ys= .429 * 5.444 = .429 for W, 2.333 for M

• Prob: P(yi=1) = exp(β0)*exp(β1xs) / 1+(exp(β0) *exp(β1xs))

• Prob(Yi=1) = (.429 * 5.444) / 1 + (.429 * 5.444) = .70

• So, for men, probability(admit) = .70, odds = 2.334, log odds = 0.847for women, probability(admit) = .30, odds = 0.429, log odds = -0.847

Variables in the Equation

1.695 .976 3.015 1 .082 5.444

-.847 .690 1.508 1 .220 .429

Gender

Constant

Step1

1

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: Gender.1.

DV = “Admission”(0=no, 1=yes)IV = Sex (0=F, 1=M)

Psyc 945 Class 14a18 of 52

Example: Continuous Predictor

• Logit: LN(p/1-p)s = β0 + β1xs– Logit Ys = 1.776 - .275(WAISs=1) = 1.50 note is additive

– For every one-unit change in WAIS, log odds go down by .275

• Odds: (p/1-p)s = exp(β0 + β1xs) or exp(β0) * exp(β1xs)– Odds Ys = exp(1.776) * exp(- .275(WAISs=1)) note is multiplicative– Multiply, then exp: Odds Ys = 5.907 * .760 = 4.49

• Prob: P(1)s = exp(β0 + β1xs) / 1 + exp(β0 + β1xs)– Prob(Y=1)s = 5.907 * .760(WAISs=1) / 1 + 5.907 * .760(WAISs=1) = .82

• So, if WAIS=10, prob(senility) = .86, odds = 5.91, log odds = 1.78• So, if WAIS=11, prob(senility) = .82, odds = 4.49, log odds = 1.50• So, if WAIS=16, prob(senility) = .53, odds = 1.13, log odds = 0.13• So, if WAIS=17, prob(senility) = .46, odds = 0.86, log odds = -0.15

Variables in the Equation

-.275 .103 7.120 1 .008 .760

1.776 1.067 2.771 1 .096 5.907

WAIS

Constant

Step1

1

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: WAIS.1.

DV = “Senility”(0=no, 1=yes)IV = WAIS (0=10)

Page 10: GLM

Psyc 945 Class 14a19 of 52

More Models for Binary Outcomes:Logit vs. Probit (“Ogive”)

• 2 Alternative Link Functions (both new Y’s go from −∞ to +∞)– Logit link: binary Y = ln(p/1-p) logit is new transformed Y

– Probit link: binary Y = Φ(Y)• Value of standard normal curve below which observed proportion is found

(z-score for area to the left for that probability) Z-score is the new Y

• Same Model for the Means: – Main effects and interactions of predictors as desired…

– No analog to odds ratios in probit, however

• Different Model for the Variances: in both models, ei ~ Bernoulli, but new transformed Y’s have a known variance of: – Logit: π2/3, or 3.29 (from the logistic distribution)

– Probit: 1 (from the standard normal distribution)

Psyc 945 Class 14a20 of 52

“Threshold Model” Concept

Another way these models are explained is with the “threshold concept”

Underlying the observed 0/1 response is really a pretend continuous variable called y*, such that: if y*< 0 then y = 0 and if y*≥ 0 then y = 1, so the model predicts the probability that y* ≥ threshold at which 0 becomes 1

Accordingly, the difference between logit and probit is that the continuous underlying variable y* has a variance of 3.29 (SD = 1.8, logit) or 1.0 (probit)

Probit(SD=1)

Logit(SD=1.8)

Rescale to equate model coefficients:

BL = 1.7BP

Distribution of Transformed y*

You’d think it would be 1.8 to rescale, but it’s 1.7…

0 1 Transformed y*

Threshold

Page 11: GLM

Psyc 945 Class 14a21 of 52

More Models for Binary DVs:

Logit = Probit*1.7LINK=LOGIT, DIST=BINARY in SAS GLIMMIX, but no probit(both are in Mplus via CATEGORICAL ARE)

Log-Log is for DVs in which 1 is more frequentLINK=LOGLOG, DIST=BINARY in SAS GLIMMIX (not in Mplus)

Complementary Log-Log is for DVs in which 0 is more frequentLINK=CLOGLOG, DIST=BINARY in SAS GLIMMIX (not in Mplus)

Psyc 945 Class 14a22 of 52

Cumulative Logit Model for cOrdered Categories (Ordinal)

• Models the probability of lower vs. higher cumulative categories via c-1 submodels (e.g., if c = 4): 0 vs. 1,2,3 0,1 vs. 2,3 0,1,2 vs. 3

• Submodels can be phrased with intercepts or thresholds:– Intercept = logit of probability of higher response (=thres*-1)

– Threshold = logit of probability of lower response (=int*-1)

• Logit of 0 = 1 – int1 or = thres1 – 0 Logit of 1 = int1 – int2 or = thres1 – thres2Logit of 2 = int2 – int3 or = thres2 – thres3Logit of 3 = int3 – 0 or = 1 – thres3

Submodel3Submodel2Submodel1

Page 12: GLM

Psyc 945 Class 14a23 of 52

Cumulative Logit Model for cOrdered Categories (Ordinal)

• Because the middle probabilities are not given directly, it is called an “indirect” or “difference” model

• Can do this in SAS GLMMIX (LINK=CLOGIT, DIST=MULT) and in Mplus with CATEGORICAL ARE (= “graded response model” in IRT)

• In SAS also cumulative versions of log-log (LINK=CUMLOGLOG,) and complementary log-log (LINK=CUMCLL), DIST=MULT for each

• Cumulative submodels are each estimated using all the data (good), but ordering is assumed (maybe bad)

• Effects of predictors are assumed the same across submodels “Proportional odds assumption”– Is testable in Mplus or SAS PROC GLIMMIX via nominal models

– Is testable more flexibly in SAS PROC NLMIXED

Psyc 945 Class 14a24 of 52

Adjacent Category Logit Model for c Ordered Categories (Ordinal)

• Models the probability of a response in any category vs. the next via c-1 submodels (e.g., if c = 4):

0 vs. 1 1 vs. 2 2 vs. 3

• Submodels can be phrased with intercepts or thresholds:– Intercept = logit of probability of higher response (=thres*-1)

– Threshold = logit of probability of lower response (=int*-1)

• Logit of 0 = 1 – int1 or = thres1 – 0 Logit of 1 = 1 – int2 or = thres1 – 0 Logit of 2 = 1 – int3 or = thres2 – 0 Logit of 3 = int3 – 0 or = 1 – thres3

Submodel3Submodel2Submodel1

Page 13: GLM

Psyc 945 Class 14a25 of 52

Adjacent Category Logit Model for c Ordered Categories (Ordinal)

• Because the middle probabilities are given directly, it is called a “direct” or “divide by total” model

• Not available in Mplus or SAS PROC GLMIIX – But can specify it yourself using NLMIXED

– Is “partial credit model” in IRT

• Submodels are each estimated using only the data for those adjacent categories (bad), but ordering is not assumed (good)

• Effects of predictors are assumed the same across submodels “Proportional odds assumption”– Is testable more flexibly in SAS PROC NLMIXED

Psyc 945 Class 14a26 of 52

Baseline Category Logit Model for c Ordered Categories (Nominal)

• Models the probability of a response in any category vs. the reference category via c-1 submodels (e.g., if c = 4):

0 vs. 1 0 vs. 2 0 vs. 3

• Assumes no order whatsoever; is “direct” model– Can do this in SAS GLMMIX (LINK=GLOGIT, DIST=MULT and

in Mplus with NOMINAL ARE (= “nominal response model” in IRT)

• Useful for testing proportional odds assumption because all fixed and random effects are estimated separately per submodel, but may be hard to estimate as a result

Submodel3Submodel2Submodel1

Page 14: GLM

Psyc 945 Class 14a27 of 52

Generalized Multilevel Models for Non-Normal Outcomes

• Same components as generalized models:– Link function to transform DV into some continuous

– Linear predictive model (linear for link-transformed DV)

– Alternative distribution of errors assumed

• The difference is that we will add random effects (i.e., additional piles of variance) to address dependency in longitudinal or clustered data– Piles of variance are ADDED TO, not EXTRACTED FROM, the

original residual variance pile when it is fixed to a known value

– Thus, some concepts translate exactly from general multilevel models, but some don’t

Psyc 945 Class 14a28 of 52

GEE: An Alternative Approach

• GEE Generalized Estimating Equations “Marginal Model” or “Population Average Model”

– Assumes outcome at each time point is univariate normal, but it does not assume multivariate normality of outcomes over time

– Uses link functions for discrete outcomes (logit, probit, etc)

– Treats variances model as ‘nuisance’ ‘deals with’ effects of correlation on SEs instead of modeling it (no random effects; all in R matrix)

– Will provide accurate tests of fixed effects EVEN IF the variances model is way off, so it works well for non-normally distributed outcomes (no assumptions made about distribution of residuals)…

– … EXCEPT that it assumes MCAR (which never holds) instead of MAR

– … And that it can’t handle unbalanced time

– … And it’s based on quasi-likelihood, so model comparisons are problematic (stay tuned)

– … So never mind.

Page 15: GLM

Psyc 945 Class 14a29 of 52

Empty Multilevel Logistic Model for Binary Outcomes

• Level 1: Logit(yti) = β0i

• Level 2: β0i = γ00 + U0i

• Composite: Logit(yti) = γ00 + U0i

• eti residual variance is not estimated π2/3 = 3.29– (Known) residual is in model for actual Y, not prob(Y) or logit(Y)

• Logistic ICC = Var(U0i) Var(U0i)Var(U0i) + Var(eti) Var(U0i) + 3.29

• Can do -2LL difference test to see if Var(U0i) > 0 (although the ICC is somewhat problematic due to non-constant residual variance)

Note what’s NOT in level 1…

Psyc 945 Class 14a30 of 52

Random Linear Time Model for Binary Outcomes

• Level 1: Logit(yti) = β0i + β1i(timeti)

• Level 2: β0i = γ00 + U0i

β1i = γ10 + U1i

• Combined: Logit(yti) = (γ00 + U0i) + (γ10 + U1i)(timeti)

• eti residual variance is still not estimated π2/3 = 3.29

• Can test new fixed or random effects with -2LL difference tests (or Wald test p-values for fixed effects)

Page 16: GLM

Psyc 945 Class 14a31 of 52

Random Linear Time Model for Ordinal Outcomes (c=3)

• L1: Logit(yti1) = β0i1 + β1i1(timeti) Logit(yti2) = β0i2 + β1i2(timeti)

• L2: β0i1 = γ001 + U0i1 β1i1 = γ101 + U1i1

β0i2 = γ002 + U0i2 β1i2 = γ102 + U1i2

• Assumes proportional odds γ001 ≠ γ002 and γ101 = γ102 and U0i1 = U0i2 and U1i1 = U1i2

– Testable via nominal model (all unequal) or using NLMIXED to write a custom model in which some can be constrained

– eti residual variance is still not estimated π2/3 = 3.29

Psyc 945 Class 14a32 of 52

New Interpretation of Fixed Effects

• In general linear mixed models, the fixed effects are interpreted as the ‘average’ effect for the sample– γ00 is ‘sample average’ intercept

– U0i is ‘individual deviation from sample average’

• What ‘average’ means in generalized linear mixed models is different, because the natural log is a nonlinear function:– So the mean of the logs ≠ log of the means

– Therefore, the fixed effects are not the ‘sample average’ effect, they are the effect for specifically for Ui = 0

• Fixed effects are conditional on the random effects

• This gets called a “unit-specific” or “subject-specific” model

• This distinction does not exist for normally distributed outcomes

Page 17: GLM

Psyc 945 Class 14a33 of 52

New Comparisons across Models without Estimated Residual Variance

• NEW RULE: Coefficients cannot be compared across models, because they are not on the same scale! (see Bauer, 2009)

• e.g., if residual variance = 3.29 in binary models:– When adding a random intercept to an empty model, the total variation

in the outcome has increased the fixed effects will increase in size because they are unstandardized slopes

– Residual variance can’t decrease due to effects of level-1 predictors, so all other estimates have to go up to compensate

• If Xti is uncorrelated with other X’s and is a pure level-1 variable (ICC ≈ 0), then fixed and SD(U0i) will increase by same factor

– Random effects variances can decrease, though, so level-2 effects should be on the same scale across models if level-1 is the same

0imixed fixed

Var(U )+3.29γ ( )

3.29

Psyc 945 Class 14a34 of 52

A Little Bit about Estimation• Goal: End up with maximum likelihood estimates for all model

parameters (because they are consistent, efficient)– When we have a V matrix based on multivariate normally distributed

eti residuals at level-1 and multivariate normally distributed Ui terms at level 2, ML is easy

– When we have a V matrix based on multivariate Bernoulli distributed eti residuals at level-1 and multivariate normally distributed Ui terms at level 2, ML is much harder

• Same with any other kind model for “not normal” level 1 residual

• ML does not assume normality unless you fit a “normal” model!

• 3 main families of estimation approaches:– Quasi-Likelihood methods (“marginal/penalized quasi ML”)

– Numerical Integration (“adaptive Gaussian quadrature”)

– Also Bayesian methods (MCMC, newly available in SAS or Mplus)

Page 18: GLM

Psyc 945 Class 14a35 of 52

2 Main Types of Estimation• Quasi-Likelihood methods older methods

– “Marginal QL” approximation around fixed part of model

– “Penalized QL” approximation around fixed + random parts

– These both underestimate variances (MQL more so than PQL)

– 2nd-order PQL is supposed to be better than 1st-order MQL

– QL methods DO NOT PERMIT MODEL -2LL COMPARISONS

– HLM program adds Laplace approximation to QL, which then does permit -2LL comparisons (also in SAS GLIMMIX and STATA xtmelogit)

• ML via Numerical Integration more available nowadays– Better estimates, but can take for-freaking-ever

– DOES permit regular model -2LL comparisons

– Will blow up with many random effects (which make the model exponentially more complex, especially in these models)

– Good idea to use PQL to get start values first to then use in integration

Psyc 945 Class 14a36 of 52

ML via Numerical Integration

• Gold standard of estimation for non-normal outcomes

• Relies on two assumptions of independence:– Level-1 residuals are independent after controlling for level-2

random effects “local independence” in CFA/IRT contexts• This means that the joint probability (likelihood) of two

observations is just the probability of each multiplied together

– Level-2 units are independent (no additional clustering or nesting)• You can add Level-3 random effects to handle dependence, but

|then the assumption is “independent given L3 random effects”

• Doesn’t assume it knows the individual U values, but does assume that the U values have a multivariate normal distribution (cannot be modified in most programs)

Page 19: GLM

Psyc 945 Class 14a37 of 52

ML via Numerical Integration

• Step 1: Select starting values for all fixed effects

• Step 2: Compute the likelihood of each observation given by the current parameter values– Model tells you probability(y=1) given model parameters…

– Likelihood (binary data given parameters) = ∏ py(1-p)1-y

• But what about the U’s? We don’t have them yet… no problem!

• Computing the likelihood value for each set of possible parameters involves removing the individual U values from the equation – by integrating across possible U values for each Level-2 unit– “Integration” is like summing the area under the curve

– Integration is accomplished by “Gaussian Quadrature” summing up rectangles that approximate the integral for each Level-2 unit

Psyc 945 Class 14a38 of 52

ML via Numerical Integration• Step 2 (still): Divide the U distribution into rectangles

“Gaussian Quadrature” (# rectangles = # “quadrature points”)

– Can either divide the whole distribution into rectangles, or take the most likely section for each level-2 unit and rectangle that

• This is “adaptive quadrature” and is computationally more demanding, but gives more accurate results with fewer rectangles

The likelihood of each level-2 unit’s outcomes at each U rectangle is then weighted by that rectangle’s probability of being observed (from the multivariate normal distribution). The weighted likelihoods are then summed across all rectangles…

ta da! “numerical integration”

Page 20: GLM

Psyc 945 Class 14a39 of 52

Example of Numeric Integration: Binary DV, Fixed Linear, Random Intercept Model

1. Start with values for fixed effects: intercept: γ00 = 0.5, time: γ10 = 2.0,

2. Compute likelihood for real data based on fixed effects and plausible U0i (-2,0,2) using model: logit(yti=1) = γ00 + γ10(timeti) + U0i

• Here for one person at two occasions with yti=1 at both occasions

IF yti=1 IF yti=0  Likelihood Theta Theta  Product 

U0i = ‐2  Logit(yti)  Prob  1‐Prob if both y=1  prob  width  per Theta

Time 0  0.5 + 1.5(0) ‐ 2  ‐1.5  0.18  0.82     0.091213  0.05  2  0.00912 

Time 1  0.5 + 1.5(1) ‐ 2  0.0  0.50  0.50 

U0i =  0  Logit(yti)  Prob  1‐Prob

Time 0  0.5 + 1.5(0) + 0 0.5  0.62  0.38  0.54826  0.40  2  0.43861 

Time 1  0.5 + 1.5(1) + 0 2.0  0.88  0.12 

U0i = 2  Logit(yti)  Prob  1‐Prob

Time 0  0.5 + 1.5(0) + 2 2.5  0.92  0.08  0.90752  0.05  2  0.09075 

Time 1  0.5 + 1.5(1) + 2 4.0  0.98  0.02                

Overall Likelihood (Sum of Products over All Thetas): 0.53848

(do this for each occasion, then multiply this whole thing over all people) 

(repeat with new values of fixed effects until find highest overall likelihood) 

Psyc 945 Class 14a40 of 52

ML via Numerical Integration• Step 3: Decide if the algorithm should stop

– Algorithm should stop once values of the likelihood ‘don’t change much’ from iteration to iteration

• “Convergence criterion” is typically a tiny number like 0.000001

– However, if the change in likelihood value from the previous iteration to the current iteration is larger than your criterion…

• Step 4: Choose new parameter values (repeat as needed)

– Many methods are available to make the selection:• Newton-Rhapson: uses the shape of the likelihood function to find parameter

values closer to the maximum, provides SEs based on observed information (come from second derivatives of LL)

• Fisher Scoring: same idea, but SEs based on expected information (which is generally not recommended)

• EM algorithm: U’s are treated as missing data (SEs based on expected info)

• Mplus alternates between approaches to maximize efficiency

Page 21: GLM

Psyc 945 Class 14a41 of 52

Tobit Models: Censored Outcomes

• For outcomes with ceiling or floor effects– Can be “Right censored” and/or “left censored”

– Also “inflated” or not inflation = binary variable in which 1 = censored, 0 = not censored

• Model assumes continuous normal distribution for y* instead (y without missing responses; no transformation otherwise)

• In Mplus (not in GLIMMIX), use CENSORED ARE (with options):– CENSORED ARE y1 (a) y2 (b) y3 (ai) y4 (bi);

• y1 is censored from above (right); y2 is censored from below (left)

• y3 is censored from above (right) and has inflation variable (inflated: y3#1)

• y4 is censored from above (below) and has inflation variable (inflated: y4#1)

– So, can predict distribution of y1-y4, as well as whether or not y3 and y4 are censored (“inflation”) as separate outcomes

• y3 ON x; x predicts value of Y if at censoring point or above

• y3#1 ON x; x predicts whether Y is censored (1) or not (0)

Psyc 945 Class 14a42 of 52

Models for Proportions• DVs may be bounded from 0-1, even if they aren’t binary (%correct)

• Need logit link to prevent predicted Y from going out of 0-1 bounds – Predicted Y “shuts off” as it approaches 0 or 1, same as for binary DVs

• But because Y is not binary, it uses the binomial distribution for its residuals, such that Y is modeled as: Y= #events / #trials– Can be predicted as 2 variables in SAS GLIMMIX when #trials varies:

• MODEL events/trials = (predictors) / SOLUTION LINK=LOGIT DIST=BIN;

– Or can be predicted directly as proportion in SAS GLIMMIX otherwise:• MODEL accuracy = (predictors) / SOLUTION LINK=LOGIT DIST=BIN;

– Normal distribution for residuals only works ok if the mean ~ .50

• In single-level models, you can also estimate a “dispersion” or “scale” parameter that allows the residual variance to exceed what it is predicted to be (from p*[1-p]), but this doesn’t seem to be possible using true ML in GLIMMIX for multilevel models…

Page 22: GLM

Psyc 945 Class 14a43 of 52

Models for Skewed Continuous DVs….

Log-Normal: y* = LN(yi), so LN(ei) – but not ei – is assumed normally distributed

-- LINK=IDENTITY, DIST=LOGNORMAL in SAS GLIMMIX (not in Mplus)-- Or just log-transform Y, fit the exact same model using MIXED!

Psyc 945 Class 14a44 of 52

Models for Skewed Continuous DVs….

Gamma: y* = LN(y), and then LN(e) is assumed gamma distributed with “shape” and “scale” parameters-- LINK=LOG, DIST=GAMMA in SAS GLIMMIX (not in Mplus)

Page 23: GLM

Psyc 945 Class 14a45 of 52

RT Data from Chapter 15

Psyc 945 Class 14a46 of 52

SAS Code to Make this Type of Plot

ODS RTF FILE="&filesave./RT Histogram.rtf";

PROC UNIVARIATE DATA=&datafile.; VAR RT_sec;

HISTOGRAM RT_sec/ OUTHISTOGRAM=RTdistN NORMAL(COLOR=red W=5 MU=est SIGMA=est) HAXIS=AXIS2;

HISTOGRAM RT_sec/ OUTHISTOGRAM=RTdistLN LOGNORMAL(COLOR=purple W=5 SIGMA=est ZETA=est) HAXIS=AXIS2;

HISTOGRAM RT_sec/ OUTHISTOGRAM=RTdistG GAMMA(COLOR=blue W=5 ALPHA=est SIGMA=est) HAXIS=AXIS2;

* Settings for X-axis;AXIS2 LABEL=(HEIGHT=1.5 "RT in Seconds") ORDER=(0 TO 60 BY 5);RUN;

ODS RTF CLOSE;

* Rename predicted curve variable, merge into single data set, export to excel;DATA RTdistN; SET RTdistN; Normal=_EXPPCT_; DROP _CURVE_ _EXPPCT_; run;DATA RTdistLN; SET RTdistLN; LogNormal=_EXPPCT_; DROP _CURVE_ _EXPPCT_; run;DATA RTdistG; SET RTdistG; Gamma=_EXPPCT_; DROP _CURVE_ _EXPPCT_; run;DATA RTdist; MERGE RTdistN RTdistLN RTdistG; BY _MIDPT_; RUN;

Page 24: GLM

Psyc 945 Class 14a47 of 52

Models for Count Outcomes

• Counts: non-negative integer unbounded responses– e.g., how many cigarettes did you smoke this week?

• Poisson and negative binomial models– y* = LN(y), which makes the count stay positive

– Residuals follow 1 of 2 distributions:• Poisson distribution in which λ = Mean = Variance

• Negative binomial distribution that includes a new α “scaling” or “dispersion” parameter that allows the variance to be bigger than the mean variance = λ(1 + λα) if α > 0

– Negative binomial mixes gamma distribution with Poisson

– Poisson is nested within negative binomial (can test of α ≠ 0)

– In Mplus: COUNT ARE y1 (p) y2 (nb);

– SAS GLIMMIX: LINK=LOG, DIST=Poisson or DIST=Negbin

Psyc 945 Class 14a48 of 52

Models for Count DVs….

Poisson: y* = LN(y), and LN(e) is assumed Poisson distributed such that mean = variance, α dispersion = 0Negative Binomial: y* = LN(y), and LN(e) is assumed Gamma distributed, such that varianceNB = varianceP(1 + varianceP*α)

Page 25: GLM

Psyc 945 Class 14a49 of 52

Modeling Zeros in Count Data• No zeros zero-truncated negative binomial

– e.g., how many days were you in the hospital? (has to be >0)

– In Mplus only: COUNT ARE y1 (nbt);

• Too many zeros zero-inflated Poisson or NB– e.g., # cigarettes smoked when asked in non-smokers too

– Only in Mplus only: COUNT ARE y2 (pi) y3 (nbi);• Refer to “inflation” variable as y2#1 or y3#1

– Tries to distinguish two kinds of zeros• “Structural zeros” – would never do it

– Inflation is predicted as logit of being a structural zero

• “Expected zeros” – could do it, just didn’t (part of regular count)

– Count with expected zeros predicted by Poisson or NB

– Poisson or NB without inflation is nested within the same modelswith inflation (and Poisson is nested within NB)

Psyc 945 Class 14a50 of 52

Zero-Inflated Poisson and Negative Binomial Distributions

Zero-inflated distributions have extra “structural zeros” not expected from Poisson or NB (“stretched Poisson”) distributions.

This can be tricky to estimate and interpret because the model distinguishes between kinds of zeros rather than zero or not...

Image borrowed from Atkins & Gallop, 2007

Page 26: GLM

Psyc 945 Class 14a51 of 52

Modeling Zeros More Generally• Other more direct ways of dealing with too many zeros:

split distribution into (0 or not) and (if not 0, how much)? – Negative binomial “hurdle” (or “zero-altered” negative binomial)

• In Mplus: COUNT ARE y1 (nbh);

• 0 or not: predicted by logit of being a 0 (“0” is the higher category)

• How much: predicted by non-zero NB for counts (integers only)

– In Mplus: Two-part model uses DATA TWOPART: command• NAMES ARE y1-y4; list outcomes to be split into 2 parts

• CUTPOINT IS 0; where to split observed outcomes

• BINARY ARE b1-b4; create names for “0 or not” part

• CONTINUOUS ARE c1-c4; create names for “how much” part

• TRANSFORM IS LOG; transformation of continuous part

• 0 or not: predicted by logit of being NOT 0 (“something” is the 1)

• How much: predicted as continuous distribution (like log or normal)

Psyc 945 Class 14a52 of 52

Summary: Differences in Generalized Mixed Models

• Analyze link-transformed DV (e.g., via logit, log, log-log…)– Linear relationship between X’s and transformed Y

– Nonlinear relationship between X’s and original Y• Original e residuals are assumed to follow some non-normal distribution

• In models for binary or count data, Level-1 residual variance is set– So it can’t go down after adding level-1 predictors, which means that

the scale of everything else has to go UP to compensate

– Scale of model will also be different after adding random effects for the same reason – the total variation in the model is now bigger

– Fixed effects may not be comparable across models as a result

• Estimation is trickier and takes longer– Numerical integration is best but may blow up in complex models

– Start values are often essential (can get those with MSPL estimator)