categorical data analysis
DESCRIPTION
Categorical Data Analysis. Week 2. Binary Response Models. binary and binomial responses binary: y assumes values of 0 or 1 binomial: y is number of “successes” in n “ trials” distributions Bernoulli: Binomial:. Transformational Approach. linear probability model - PowerPoint PPT PresentationTRANSCRIPT
Categorical Data Analysis
Week 2
Binary Response Models binary and binomial responses
binary: y assumes values of 0 or 1 binomial: y is number of “successes” in n “trials”
distributions Bernoulli:
Binomial:
1Pr( | ) (1 )y yy p p p
Pr( | , ) (1 )y n ynp
yy n p p
Transformational Approach linear probability model
use grouped data (events/trials):
“identity” link:
linear predictor:
problems of prediction outside [0,1]
ii
i
yp
n
( )i i iIp x
i i x
The Logit Model
logit transformation:
inverse logit:
ensures that p is in [0,1] for all values of x and .
logit( ) log1
ii i i
i
p
pp
x
exp(
1 exp(
)( )
)i
i ii
p
The Logit Model
odds and odds ratios are the key to understanding and interpreting this model
the log odds transformation is a “stretching” transformation to map probabilities to the real line
Odds and Probabilities
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
odds
pro
ba
bili
ty
Probabilities and Log Odds
-6 -4 -2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
log(odds)
pro
ba
bility
The Logit Transformation properties of logit
-6 -4 -2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
logit
p
linear
Odds, Odds Ratios, and Relative Risk odds of “success” is the ratio:
consider two groups with success probabilities:
odds ratio (OR) is a measure of the odds of success in group 1 relative to group 2
1
p
p
1 2an d p p
1 1 1
2 2 2
/ (1 )
/ (1 )
pp
p p
Odds Ratio
2 X 2 table:
OR is the cross-product ratio (compare x = 1 group to x = 0 group)
odds of y = 1 are 4 times higher when x =1 than when x = 0
50 15
15 20
Y 0 1
0
1 X50
4.4415
20ˆ15
Odds Ratio equivalent interpretation
odds of y = 1 are 0.225 times higher when x = 0 than when x = 1
odds of y = 1 are 1-0.225 = .775 times lower when x = 0 than when x = 1
odds of y = 1 are 77.5% lower when x = 0 than when x = 1
1 15 15ˆ 50
0. 2520
2
Log Odds Ratios
Consider the model:
D is a dummy variable coded 1 if group 1 and 0 otherwise.
group 1: group 2:
LOR: OR:
0logit( )i ip D
0)logit( ip
exp( )
0logit( )ip
Relative Risk
similar to OR, but works with rates
relative risk or rate ratio (RR) is the rate in group 1 relative to group 2
OR RR as .
#Events
Exposure
Dr
R
1
2
RR = r
r
0p
Tutorial: odds and odds ratios
consider the following data
Tutorial: odds and odds ratios
read table:
clearinput educ psex f0 0 8730 1 11901 0 5331 1 1208endlabel define edlev 0 "HS or less" 1 "Col or more"label val educ edlevlabel var educ education
Tutorial: odds and odds ratios compute odds:
verify by hand
tabodds psex educ [fw=f]
Pr>chi2 = 0.0000Score test for trend of odds: chi2(1) = 55.48
Pr>chi2 = 0.0000Test of homogeneity (equal odds): chi2(1) = 55.48 Col or ~e 1208 533 2.26642 2.04681 2.50959 HS or l~s 1190 873 1.36312 1.24911 1.48753 educ cases controls odds [95% Conf. Interval]
Tutorial: odds and odds ratios compute odds ratios:
verify by hand
tabodds psex educ [fw=f], or
Pr>chi2 = 0.0000Score test for trend of odds: chi2(1) = 55.48
Pr>chi2 = 0.0000Test of homogeneity (equal odds): chi2(1) = 55.48 Col or ~e 1.662674 55.48 0.0000 1.452370 1.903429 HS or l~s 1.000000 . . . . educ Odds Ratio chi2 P>chi2 [95% Conf. Interval]
Tutorial: odds and odds ratios stat facts:
variances of functions use in statistical significance tests and forming
confidence intervals basic rule for variances of linear transformations
g(x) = a + bx is a linear function of x, then
this is a trivial case of the delta method applied to a single variable
the delta method for the variance of a nonlinear function g(x) of a single variable is
2var[ ] ( )a xb b varx
2
var[ ( )] var((
))g x
xg x x
Tutorial: odds and odds ratios stat facts:
variances of odds and odds ratios we can use the delta method to find the variance in the
odds and the odds ratios from the asymptotic (large sample theory) perspective it
is best to work with log odds and log odds ratios the log odds ratio converges to normality at a faster rate
than the odds ratio, so statistical tests may be more appropriate on log odds ratios (nonlinear functions of p)
21
ˆvar(log var( )ˆ ˆ
)(1 )
ˆ pp p
Tutorial: odds and odds ratios stat facts:
the log odds ratio is the difference in the log odds for two groups
groups are independent
variance of a difference is the sum of the variances
1 2
ˆ ˆ ˆlog ) var(log ) var(logvar( )
Tutorial: odds and odds ratios
data structures: grouped or individual level note:
use frequency weights to handle grouped data or we could “expand” this data by the frequency weights
resulting in individual-level data model results from either data structures are the same
expand the data and verify the following results
expand f
Tutorial: odds and odds ratios
statistical modeling
logit model (glm):
logit model (logit):
logit psex educ [fw=f], or
glm psex educ [fw=f], f(b) eform
Tutorial: odds and odds ratios statistical modeling (#1)
logit model (glm):
educ 1.662674 .1138634 7.42 0.000 1.453834 1.901512 psex Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] OIM
Log likelihood = -2477.935675 BIC = -26387.09 AIC = 1.303857
Link function : g(u) = ln(u/(1-u)) [Logit]Variance function: V(u) = u*(1-u) [Bernoulli]
Pearson = 3804 (1/df) Pearson = 1.000526Deviance = 4955.871349 (1/df) Deviance = 1.303491 Scale parameter = 1Optimization : ML Residual df = 3802Generalized linear models No. of obs = 3804
Tutorial: odds and odds ratios
statistical modeling (#2) some ideas from alternative normalizations
what parameters will this model produce? what is the interpretation of the “constant”
gen cons = 1glm psex cons educ [fw=f], nocons f(b) eform
Tutorial: odds and odds ratios
statistical modeling (#2)
educ 1.662674 .1138634 7.42 0.000 1.453834 1.901512 cons 1.363116 .0607438 6.95 0.000 1.249111 1.487525 psex Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] OIM
Log likelihood = -2477.935675 BIC = -26387.09 AIC = 1.303857
Link function : g(u) = ln(u/(1-u)) [Logit]Variance function: V(u) = u*(1-u) [Bernoulli]
Pearson = 3804 (1/df) Pearson = 1.000526Deviance = 4955.871349 (1/df) Deviance = 1.303491 Scale parameter = 1Optimization : ML Residual df = 3802Generalized linear models No. of obs = 3804
Tutorial: odds and odds ratios
statistical modeling (#3)
what parameters does this model produce? how do you interpret them?
gen lowed = educ == 0gen hied = educ == 1glm psex lowed hied [fw=f], nocons f(b) eform
Tutorial: odds and odds ratios
statistical modeling (#3)
hied 2.266417 .1178534 15.73 0.000 2.046809 2.509586 lowed 1.363116 .0607438 6.95 0.000 1.249111 1.487525 psex Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] OIM
Log likelihood = -2477.935675 BIC = -26387.09 AIC = 1.303857
Link function : g(u) = ln(u/(1-u)) [Logit]Variance function: V(u) = u*(1-u) [Bernoulli]
Pearson = 3804 (1/df) Pearson = 1.000526Deviance = 4955.871349 (1/df) Deviance = 1.303491 Scale parameter = 1Optimization : ML Residual df = 3802Generalized linear models No. of obs = 3804
are these odds ratios?
Tutorial: prediction fitted probabilities (after most recent model)
predict p, mu
tab educ [fw=f], sum(p) nostandard nofreq
Total .63038905 3804 Col or mo .69385409 1741 HS or les .57682985 2063 education Mean Obs. mean psex Summary of predicted
Probit Model
inverse probit is the CDF for a standard normal variable:
link function:
21
21d
2
u
p e u
1)probit( ( )i i ip p
Probit Transformation
-3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
probit
p
Interpretation probit coefficients
interpreted as a standard normal variables (no log odds-ratio interpretation)
“scaled” versions of logit coefficients
probit models more common in certain disciplines (economics) analogy with linear regression (normal latent variable) more easily extended to multivariate distributions
probit g t3
lo i
Example: Grouped Data Swedish mortality data revisited
_cons -4.017514 .1922715 -20.90 0.000 -4.394359 -3.640669 P2 .5271214 .120775 4.36 0.000 .2904068 .763836 A3 -.8384579 .2006439 -4.18 0.000 -1.231713 -.445203 A2 .1147916 .21511 0.53 0.594 -.3068163 .5363995 y Coef. Std. Err. z P>|z| [95% Conf. Interval] OIM
logit model
_cons -2.101865 .0778879 -26.99 0.000 -2.254522 -1.949207 P2 .2098432 .0472825 4.44 0.000 .1171712 .3025151 A3 -.3247921 .0807731 -4.02 0.000 -.4831045 -.1664797 A2 .0497241 .087904 0.57 0.572 -.1225646 .2220128 y Coef. Std. Err. z P>|z| [95% Conf. Interval] OIM
probit model
Swedish Historical Mortality Data predictions
Logit Probit
A 1 2 A 1 2
1 19.0 10.0 1 19.1 9.92 61.0 32.0 2 61.9 31.63 143.0 60.0 3 141.1 61.4
sum 325 sum 325.1
P P
Programming
Stata: generalized linear model (glm)
glm y A2 A3 P2, family(b n) link(probit)
glm y A2 A3 P2, family(b n) link(logit)
idea of glm is to make model linear in the link. old days: Iteratively Reweighted Least Squares now: Fisher scoring, Newton-Raphson both approaches yield MLEs
Generalized Linear Models applies to a broad class of models
iterative fitting (repeated updating) except for linear model update parameters, weights W, and predicted values m
models differ in terms of W and m and assumptions about the distribution of y
common distributions for y include: normal, binomial, and Poisson
common links include: identity, logit, probit, and log
1( 1) ( )t t t t XW X X y m
Latent Variable Approach example: insect mortality
suppose a researcher exposes insects to dosage levels (u) of an insecticide and observes whether the “subject” lives or dies at that dosage.
the response is expected to depend on the insect’s tolerance (c) to that dosage level.
the insect dies if u > c and survives if u < c
tolerance is not observed (survival is observed)
Pr( 1) Pr( )i i iy u c
Latent Variables u and c are continuous latent variables
examples: women’s employment: u is the market wage and c is the
reservation wage migration: u is the benefit of moving and c is the cost of
moving. observed outcome y =1 or y = 0 reveals the
individual’s preference, which is assumed to maximize a rational individual’s utility function.
Latent Variables Assume linear utility and criterion functions
over-parameterization = identification problem we can identify differences in components but not the
separate components
u uu x
Pr( 1) Pr( ) Pr ( )c u u cy u c x
c cc x
Latent Variables constraints:
Then:
where F(.) is the CDF of ε
u c
Pr( 1) Pr( ) ( )y x F x
c u
Latent Variables and Standardization Need to standardize the mean and variance of ε
binary dependent variables lack inherent scales magnitude of β is only in reference to the mean
and variance of ε which are unknown. redefine ε to a common standard
where a and b are two chosen constants.
* a
b
Standardization for Logit and Probit Models standardization implies
F*() is the cdf of ε*
location a and scale b need to be fixed
setting
and
a b
*Pr( 1)x a
y Fb
*() () probit modelF
Standardization for Logit and Probit Models
distribution of ε is standardized
standard normal probit
standard logistic logit
both distributions have a mean of 0 variances differ
2*probit 1
2
*2logit
3
Extending the Latent Variable Approach observed y is a dichotomous (binary) 0/1 variable
continuous latent variable: linear predictor + residual
observed outcome
*ii iy x
*1 0
0
if
otherwisei
iyy
Notation conditional means of latent variables obtained from
index function:
obtain probabilities from inverse link functions
logit model:
probit model:
*( | )E i iiy x x
( )i i x
( )i i x
ML likelihood function
where if data are binary
log-likelihood function
( )() )( 1 i ii
n yy
ii iL F F
x x
1in
log ( ) ( ) logl )o (g 1i i i i ii
y F n FL y x x
Assessing Models
definitions: L null model (intercept only): L saturated model (a parameter for each cell): L current model:
grouped data (events/trials) deviance (likelihood ratio statistic)
0L
fL
cL
2 2log 2 log logcc f
f
LG LL
L
Deviance grouped data:
if cell sizes are reasonably large deviance is distributed as chi-square
individual-level data: Lf =1 and log Lf =0 deviance is not a “fit” statistic
2 2log cLG
Deviance
deviance is like a residual sum of squares larger values indicate poorer models larger models have smaller deviance
deviance for the more constrained model (Model 1)
deviance for the less constrained model (Model 2)
assume that Model 1 is a constrained version of Model 2.
21G
22G
Difference in Deviance
evaluate competing “nested” models using a likelihood ratio statistic
model chi-square is a special case
SAS, Stata, R, etc. report different statistics
2 1
2 2 2 21 2 df dfG G G
2 2 20 0Model 2log ( 2log )c cG G L L
Other Fit Statistics BIC & AIC (useful for non-nested models)
basic idea of IC : penalize log L for the number of parameters (AIC/BIC) and/or the size of the sample (BIC)
AIC s=1 BIC s= ½ log n (sample size) dfm is the number of model parameters
I )C 2log 2( )( mL s df
Hypothesis Tests/Inference
single parameter: MLE are asymptotically normal Z-test
multi-parameter: likelihood ratio tests (after fitting) Wald tests (test constraints from current model)
0H : 0
0 1 2 0H :
Hypothesis Tests/Inference Wald test (tests a vector of restrictions)
a set of r parameters are all equal to 0
a set of r parameters are linearly restricted
0H : r 0
0H : r R q
restriction matrix constraint vector
parameter subset
Interpreting Parameters odds ratios: consider the model where x is a
continuous predictor and d is a dummy variable
suppose that d denotes sex and x denotes income and the problem concerns voting, where y* is the propensity to vote
results: logit(pi) = -1.92 + 0.012xi + 0.67di
*0 1 2i i i iy x d
Interpreting Parameters for d (dummy variable coded 1 for female) the odds ratio is
straightforward
holding income constant, women’s odds of voting are nearly twice those of men
2
/ (1 ) ˆexp( ) exp(0.67) 1.95/ (1 )
f f f
mm m
p p
p p
Interpreting Parameters
for x (continuous variable for income in thousands of dollars) the odds ratio is a multiplicative effect suppose we increase income by 1 unit ($1,000)
suppose we increase income by c units (c х $1,000$
11
1
ˆexp[ ( 1)](1)] 1.01
ˆexp[
exp( )
x
x
11
1
ˆexp[ ( )]( )]
ˆexp(exp[
)
x cc
x
Interpreting Parameters if income is increased by $10,000, this increases the odds of
voting by about 13%
a note on percent change in odds: if estimate of β > 0 then percent increase in odds for a unit change in
x is
if estimate of β < 0 then percent decrease in odds for a unit change in x is
ˆ1) 1 0%( 0e
10 0.012 1) 100% 12.75%(e
ˆ) 1 01 0( %e
Marginal Effects
marginal effect: effect of change in x on change in probability
pdf cdf
often we evaluate f(.) at the mean of x.
Pr( 1| ) ( )( )i i i
i kik ik
y F
xf
x
x x
x
)(·f )(·F
Marginal Effect for a Change in a Continuous Variable
Marginal Effect of a Change in a Dummy Variable
if x is a continuous variable and z is a dummy variable
marginal effect of change in z from 0 to 1 is the difference
10 1 2( )i i iF x z
0 1 2 0 1) (( )i ix F xF
Example logit models for high school graduation
odds ratios (constant is baseline odds)
LR Test
Model 3 vs. 2
22
(1) 3log )
2( 1240.70 ( 1038.39))
2(1038.39
2(l
1240.70)
404.6
o
4
g L L
Wald Test Test equality of parental education effects
logit hsg blk hsp female nonint inc nsibs mhs mcol fhs fcol wtesttest mhs=fhstest mcol=fcol
Prob > chi2 = 0.2770 chi2( 1) = 1.18
( 1) mcol - fcol = 0
. test mcol=fcol
Prob > chi2 = 0.9177 chi2( 1) = 0.01
( 1) mhs - fhs = 0
cannot reject H of equal parental education effects on HS graduation
0 mhs fhs
0 mcol fcol
:
:
H
H
Basic Estimation Commands (Stata)
* model 0 - null modelqui logit hsgest store m0* model 1 - race, sex, family structurequi logit hsg blk hsp female nonintest store m1* model 1a - race X family structure interactionsqui xi: logit hsg blk hsp female nonint i.nonint*i.blk i.nonint*i.hspest store m1alrtest m1 m1a* model 2 - SESqui xi: logit hsg blk hsp female nonint inc nsibs mhs mcol fhs fcol est store m2 * model 3 - Indivqui xi: logit hsg blk hsp female nonint inc nsibs mhs mcol fhs fcol wtestest store m3lrtest m2 m3
estimation commands model tests
Fit Statistics etc.* some 'hand' calculations with saved resultsscalar ll = e(ll)scalar npar = e(df_m)+1scalar nobs = e(N)scalar AIC = -2*ll + 2*nparscalar BIC = -2*ll + log(nobs)*npar scalar list AICscalar list BIC
* or use automated fitstat routinefitstat
*output as a table
estout1 m0 m1 m2 m3 using modF07, replace star stfmt(%9.2f %9.0f %9.0f) /// stats(ll N df_m) eform
Analysis of Deviance
(Assumption: m2 nested in m3) Prob > chi2 = 0.0000Likelihood-ratio test LR chi2(1) = 404.64
. lrtest m2 m3
(Assumption: m1 nested in m2) Prob > chi2 = 0.0000Likelihood-ratio test LR chi2(6) = 283.71
. lrtest m1 m2
(Assumption: m0 nested in m1) Prob > chi2 = 0.0000Likelihood-ratio test LR chi2(4) = 118.45
. lrtest m0 m1
BIC and AIC (using fitstat)
BIC used by Stata: 2173.993 AIC used by Stata: 2100.754BIC: -24607.056 BIC': -717.672AIC: 0.636 AIC*n: 2100.754Count R2: 0.857 Adj Count R2: 0.096Variance of y*: 6.240 Variance of error: 3.290McKelvey & Zavoina's R2: 0.473 Efron's R2: 0.252ML (Cox-Snell) R2: 0.217 Cragg-Uhler(Nagelkerke) R2: 0.372McFadden's R2: 0.280 McFadden's Adj R2: 0.271 Prob > LR: 0.000D(3293): 2076.754 LR(11): 806.807Log-Lik Intercept Only: -1441.781 Log-Lik Full Model: -1038.377
Measures of Fit for logit of hsg
Marginal Effects0
.2.4
.6.8
1P
r(y=
1)
-4 -2 0 2 4Test Score
white/intact white/nonintactblack/intact black/nonintact
Marginal Effect of Test Score on High School GraduationIncome Quartile 1
Marginal Effects0
.2.4
.6.8
1P
r(y=
1)
-4 -2 0 2 4Test Score
white/intact white/nonintactblack/intact black/nonintact
Marginal Effect of Test Score on High School GraduationIncome Quartile 4
qui sum adjinc, det* quartiles for income distributiongen incQ1 = adjinc < r(p25)gen incQ2 = adjinc >= r(p25) & adjinc < r(p50)gen incQ3 = adjinc >= r(p50) & adjinc < r(p75)gen incQ4 = adjinc >= r(p75)gen incQ = 1 if incQ1==1 replace incQ = 2 if incQ2==1 replace incQ = 3 if incQ3==1 replace incQ = 4 if incQ4==1tab incQ
Generate Income Quartiles
* look at marginal effects of test score on graduation by selected groups* (1) model (income quartiles)local i = 1 while `i' < 5 {logit hsg blk female mhs nonint nsibs urban so wtest if incQ ==`i'margeff
cap drop wm*cap drop bm*prgen wtest, x(blk=0 female=0 mhs=1 nonint=0) gen(wmi) from(-3) to(3)prgen wtest, x(blk=0 female=0 mhs=1 nonint=1) gen(wmn) from(-3) to(3)label var wmip1 "white/intact"label var wmnp1 "white/nonintact"prgen wtest, x(blk=1 female=0 mhs=1 nonint=0) gen(bmi) from(-3) to(3)prgen wtest, x(blk=1 female=0 mhs=1 nonint=1) gen(bmn) from(-3) to(3)label var bmip1 "black/intact"label var bmnp1 "black/nonintact"
Fit Model for Each Quartile calculate predictions
set scheme s2mono twoway (line wmip1 wmix, sort xtitle("Test Score") ytitle("Pr(y=1)")) /// (line wmnp1 wmix, sort) (line bmip1 wmix, sort) (line bmnp1 wmix, sort), /// subtitle("Marginal Effect of Test Score on High School Graduation" /// "Income Quartile `i'" ) saving(wtgrph`i', replace) graph export wtgrph`i'.eps, as(eps) replacelocal i = `i' + 1}
Graph
Fitted Probabilitieslogit hsg blk female mhs nonint inc nsibs urban so wtestprtab nonint blk female
1 0.8329 0.9480 0.8585 0.9569 0 0.9111 0.9740 0.9258 0.9786 nonint 0 1 0 1 0 1 female and blk
logit: Predicted probabilities of positive outcome for hsg
Fitted Probabilities predicted values
evaluate fitted probabilities at the sample mean values of x (or other fixed quantities)
averaging fitted probabilities over subgroup-specific models will produce marginal probabilities
exp(
1 exp
ˆ)ˆˆ ( )ˆ )(
p
x
xx
1
ˆˆ ( )1 jn
ij ij j j
j
yn
p
x
Observed & Fitted Probabilities
family type white black white blackintactobserved 0.90 0.86 0.91 0.89fitted 0.91 0.97 0.93 0.98n 776 224 749 234nonintactobserved 0.71 0.74 0.81 0.82fitted 0.83 0.95 0.86 0.96n 220 207 196 231Total 996 431 945 465
sex
race racemale female
Alternative Probability Model complementary log –log (cloglog or CLL)
standard extreme-value distribution for u:
cloglog model:
cloglog link function:
( ) exp( )exp exp( )f u u u
( ) 1 exp exp( )F u u
Pr( 1) 1 ex exp(p )i iy x
log log[1 Pr( 1)]i iy x
Extreme-Value Distribution properties
mean of u (Euler’s constant):
variance of u:
difference in two independent extreme value variables yields a logistic variable
(1) 0.5772
2
6
2
1 2
3 logistic(0, )
u u
CLL Transformation
-6 -4 -2 0 2
0.0
0.2
0.4
0.6
0.8
1.0
CLL
p
CLL Model
no “practical” differences from logit and probit models
often suited for survival data and other applications interpretation of coefficients:
exp(β) is a relative risk or hazard ratio not an OR glm: binomial distribution for y with a cloglog link cloglog: use the cloglog command directly
CLL and Logit Model Comparedlogit cloglog
blk 3.658*** 1.987***
female 1.218 1.128*
mhs 1.438** 1.161*
nonint 0.487*** 0.710***
inc 1.635** 1.236**
nsibs 0.938** 0.965**
urban 0.887 0.942
so 1.269 1.115
wtest 5.151*** 2.171***
_cons 6.851*** 1.891***
log L -838.92 -833.96
N 2837 2837
df 9 9
Cloglog and Logit Model Compared
P2 1.694049 .2045987 4.36 0.000 1.336971 2.146494 A3 .4323768 .0867538 -4.18 0.000 .2917924 .6406942 A2 1.12164 .2412759 0.53 0.594 .7357857 1.709839 d Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] OIM
more agreement when modeling rare events
P2 1.684947 .2016957 4.36 0.000 1.332581 2.130487 A3 .4350801 .0864137 -4.19 0.000 .2947864 .642142 A2 1.119414 .2380893 0.53 0.596 .7378156 1.698375 d exp(b) Std. Err. z P>|z| [95% Conf. Interval] OIM
logit
cloglog
Extensions: Multilevel Data
what is multilevel data? individuals are “nested” in a larger context:
children in families, kids in schools etc.
context 1
context 3
context 2
Multilevel Data i.i.d. assumptions?
the outcomes for units in a given context could be associated
standard model would treat all outcomes (regardless of context) as independent
multilevel methods account for the within-cluster dependence
a general problem with binomial responses we assume that trials are independent this might not be realistic non-independence will inflate the variance
(overdispersion)
Multilevel Data example (in book):
40 universities as units of analysis for each university we observe the number of graduates
(n) and the number receiving post-doctoral fellowships (y)
we could compute proportions (MLEs) some proportions would be “better” estimates as they
would have higher precision or lower variance example: the data y1/n1 = 2/5 and y2/n2 = 20/50 give
identical estimates of p but variances of 0.048 and 0.0048 respectively
the 2nd estimate is more precise than the 1st
Multilevel Data multilevel models allow for improved
predictions of individual probabilities MLE estimate is unaltered if it is precise MLE estimate moved toward average if it is
imprecise (shrinkage) multilevel estimate of p would be a weighted average of
the MLE and the average over all MLEs (weight (w) is based on the variance of each MLE and the variance over all the MLEs)
we are generally less interested in the p’s and more interested in the model parameters and variance components
ˆ (1 )i i i ip w pp w
Shrinkage Estimation primitive approach
assume we have a set of estimates (MLEs) our best estimate of the variance of each MLE is
this is the within variance (no pooling) if this is large, then the MLE is a poor estimate
a better estimate might be the average of the MLEs in this case (pooling the estimates)
we can average the MLEs and estimate the between variance as
ˆ(1 ))
ˆˆvar( i ii
i
p pp
n
2ˆ) (1
ar( )v ip pN
p
ˆ ip
Shrinkage Estimation primitive approach
we can then estimate a weight wi
a revised estimate of pi would take account of the precision to for a precision-weighted average precision is a function of ni
more weight is given to more precise MLE’s
) between-groupar(
var(
varianceˆ) var( ) total varianceii
pw
p p
v
ˆ (1 )i i i ip w pp w
Shrinkage: a primitive approach
0 10 20 30 40
0.2
0.4
0.6
0.8
university
obse
rved
and
shr
unke
n pr
obab
ilitie
s
ObservedShrunken
Shrinkage
0 10 20 30 40
0.2
0.4
0.6
0.8
university
obse
rved
and
EB
pro
babi
litie
s
ObservedEB Estimate
results from full Bayesian (multilevel) Analysis
Extension: Multilevel Models assumptions
within-context and between-context variation in outcomes
individuals within the same context share the same “random error” specific to that context
models are hierarchical individuals (level-1) contexts (level-2)
Multilevel Models: Background linear mixed model for continuous y (multilevel, random coefficients, etc.)
level-1 model and level-2 sub-models (hierarchical)
0 1
0 00 01 0
1 10 11 1
ij i i ij ij
i i i
i i i
z
x u
x u
y
Multilevel Models: Background linear mixed model assumptions
level-1 and level-2 residuals
2
0
1
20 01
201 1
~ Normal(0, )
0~ MVN ,
0
where
u
u
u
u
Multilevel Models: Background composite form
00 01 10 11 0 1ij i ij i ij i ij i ijx z x z uy z u
fixed effectscross-level interaction
random effects (level-2)
composite residual
Multilevel Models: Background variance components
0 1
0 1
)
within group: var
total: va
( )
between group: va
r
)
(
r(
i i ij ij
ij
i i ij
u u z
u u z
Multilevel Models: Background general form (linear mixed model)
ij ij ij i ijy x z u
variables associated with fixed coefficients
variables associated with random coefficients
Multilevel Models: Logit Models binomial model (random effect)
assumptions
u increases or decreases the expected response for individual j in context i independently of x
all individuals in context i share the same value of u also called a random intercept model
logit( ) ij ij ip u x
2~ Normal(0, )i uu
0 0i iu
Multilevel Models a hierarchical model:
z is a level-1 variable; x is a level-2 variable random intercept varies among level-2 units note: level-1 residual variance is fixed (why?)
0 1
0 00 01
logit( )=
and
ij i ij
i i i
p z
x u
Multilevel Models a general expression
x are variables associated with “fixed” coefficients z are variables associated with “random” coefficients u is multivariate normal vector of level-2 residuals mean of u is 0; covariance of u is
logit( ) ij ij ij ip x z u
u
Multilevel Models random effects vs. random coefficients
random effects u random coefficients β + u
variance components interested in level-2 variation in u
prediction E(y) is not equal to E(y|u) model based predictions need to consider random
effectsE( | , ) ( )ij i ij ij iy u u xx
Multilevel Models: Generalized Linear Mixed Models (GLMM)
E( | , ) ( )ij i ij ij iy u u xx Conditional Expectation
| ) E[E( | , )E( ]ij ij ij i ijy y ux x
Marginal Expectation
( ) ( )dij i
u
u g u u x
requires numerical integration or simulation
Data Structure multilevel data structure
requires a “context” id to identify individuals belonging to the same context
NLSY sibling data contains a “family id” (constructed by researcher)
data are unbalanced (we do not require clusters to be the same size)
small clusters will contribute less information to the estimation of variance components than larger clusters
it is OK to have clusters of size 1
(i.e., an individual is a context unto themselves) clusters of size 1 contribute to the estimation of fixed
effects but not to the estimation of variance components
Example: clustered data siblings nested in families
y is 1st premarital birth for NLSY women select sib-ships of size > 2 null model (random intercept):
xtlogit fpmbir, i(famid)
or
xtmelogit fpmbir || famid:
Example: clustered data
Likelihood-ratio test of rho=0: chibar2(01) = 20.58 Prob >= chibar2 = 0.000 rho .4730808 .0995195 .2910546 .662556 sigma_u 1.71864 .3430707 1.162171 2.541556 /lnsig2u 1.083066 .3992351 .30058 1.865553 _cons -2.888895 .3318566 -8.71 0.000 -3.539322 -2.238468 fpmbir Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -228.59345 Prob > chi2 = .
random intercept: xtlogit
Example: clustered data
LR test vs. logistic regression: chibar2(01) = 20.73 Prob>=chibar2 = 0.0000 sd(_cons) 1.752456 .3601534 1.171423 2.621685famid: Identity Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]
_cons -2.917541 .3479598 -8.38 0.000 -3.59953 -2.235552 fpmbir Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -228.51781 Prob > chi2 = .Integration points = 7 Wald chi2(0) = .
random intercept: xtmelogit
Variance Component add predictors (mostly level-2)
sd(_cons) 1.451511 .3515003 .9030084 2.333182famid: Identity Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]
weekly .885648 .296273 -0.36 0.717 .4597391 1.706125 consprot 1.614657 .6110603 1.27 0.206 .7690355 3.390111 inc .8848917 .2858459 -0.38 0.705 .4698153 1.666683 medu .8050785 .060073 -2.91 0.004 .6955425 .9318647 nsibs 1.112501 .1032876 1.15 0.251 .9274119 1.33453 nonint 3.356608 1.435222 2.83 0.005 1.451921 7.759938 fpmbir Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -215.39646 Prob > chi2 = 0.0010Integration points = 7 Wald chi2(6) = 22.48
Variance Component conditional variance in u is 2.107 proportionate reduction in error (PRE)
a 31% reduction in level-2 variance when level-2 predictors are accounted for
2 2
2
3.062 2.107PRE 0.312
3.062r c
r
u u
u
Random Effects we can examine the distribution of random effects
01
23
Den
sity
-1 0 1 2 3random effects for famid: _cons
Random Effects we can examine the distribution of random effects
99% 2.405483 2.583755 Kurtosis 4.81897195% 1.523062 2.583755 Skewness 1.68802690% 1.337971 2.431446 Variance .490946275% -.0689377 2.431446 Largest Std. Dev. .700675650% -.1484184 Mean .1132598
25% -.2422871 -.8339383 Sum of Wgt. 65310% -.388522 -.8339383 Obs 653 5% -.5100672 -.9210778 1% -.7111417 -.9210778 Percentiles Smallest random effects for famid: _cons
. sum u, detail
Random Effects Distribution 90th percentile u90 = 1.338
10th percentile u10 = 0.388
the risk for family at 90th percentile is
exp(1.338 – 0.388) = 2.586
times higher than for a family at the 10th percentile
even if families are compositionally identical on covariates, we can assess the hypothetical differential in risks
Growth Curve Models growth models
individuals are level-2 units repeated measures over time on individuals
(level-1) models imply that logits vary across individuals
intercept (conditional average logit) varies slope (conditional average effect of time) varies change is usually assumed to be linear
use GLMM complications due to dimensionality intercept and slope may co-vary (necessitating a more
complex model) and more
Growth Curve Models multilevel logit model for change over time
T is time (strictly increasing) fixed and random coefficients (with covariates)
assume that u0 and u1 are bivariate normal
0 1logit( )ij i i ijp T
0 1
0 00 01 0
1 10 11 1
logit( )ij i i ij
i i i
i i i
p T
X u
X u
Multilevel Logit Models for Change Example: Log odds of employment of black
men in the U.S. 1982-1988 (NLSY) (consider 5 years in this period)
time is coded 0, 1, 3, 4, 6 dependent variable is: not-working, not-in-school unconditional growth (no covariates except T) conditional growth (add covariates) note: cross-level interactions implied by composite
model
00 01 10 11 0 1logit( )ij ij ij i i i ijp X T uT X u T
Fitting Multilevel Model for Change programming
Stata (unconditional growth)
Stata (conditional growth)
xtmelogit y year || id: year, var cov(un)
xtmelogit y year south unem unemyr inc hs ||id: year, var cov(un)
Fitting Multilevel Model for Change
LR test vs. logistic regression: chi2(3) = 250.61 Prob > chi2 = 0.0000 cov(year,_cons) -.0517392 .0789636 -.206505 .1030266 var(_cons) 1.796561 .4330881 1.120075 2.881622 var(year) .0552714 .0241599 .0234654 .1301886id: Unstructured Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]
_cons -.8742502 .0972809 -8.99 0.000 -1.064917 -.6835831 year -.1467877 .0293921 -4.99 0.000 -.2043952 -.0891801 y Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -1916.0409 Prob > chi2 = 0.0000Integration points = 7 Wald chi2(1) = 24.94
max = 5 avg = 5.0 Obs per group: min = 5
Group variable: id Number of groups = 686Mixed-effects logistic regression Number of obs = 3430
Fitting Multilevel Logit Model for Change
LR test vs. logistic regression: chi2(3) = 140.20 Prob > chi2 = 0.0000 cov(year,_cons) -.0622441 .0708861 -.2011783 .07669 var(_cons) 1.304833 .3648705 .7542816 2.257233 var(year) .0433477 .0219905 .016038 .1171612id: Unstructured Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]
_cons -.0612559 .1285939 -0.48 0.634 -.3132954 .1907836 hs -.785545 .1242026 -6.32 0.000 -1.028978 -.5421124 inc -.5732738 .1872211 -3.06 0.002 -.9402205 -.2063271 unemyr -.1120936 .0641975 -1.75 0.081 -.2379184 .0137313 unem 1.014915 .2408795 4.21 0.000 .5428002 1.48703 south -.6523682 .1283314 -5.08 0.000 -.9038931 -.4008434 year -.0921512 .0281795 -3.27 0.001 -.1473819 -.0369205 y Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -1868.0104 Prob > chi2 = 0.0000Integration points = 7 Wald chi2(6) = 123.80
max = 5 avg = 5.0 Obs per group: min = 5
Group variable: id Number of groups = 686Mixed-effects logistic regression Number of obs = 3430
Logits: Observed, Conditional, and Marginal
the log odds of idleness decreases with time and shows variation in level and change
Composite Residuals in a Growth Model composite residual
composite residual variance
covariance of composite residual
0 1ij i i ij ijr u u T
22 2 20 1 01var(
3) 2ij j jr T T
2 20 1 01, ) ( )cov( ij ij j j j jr T T T Tr
Model covariance term is 0 (from either model)
results in simplified interpretation easier estimation via variance components (default option)
significant variation in slopes and initial levels other results:
log odds of idleness decrease over time (negative slope) other covariates except county unemployment have significant
effects on the odds of idleness the main effects are interpreted as effects on initial logits at time 1
or t = 0 or the 1982 baseline) interaction of time and unemployment rate captures the effect of
county unemployment rate in 1982 on the change log odds of idleness
the positive effect implies that higher county unemployment tends to dampen change in odds
IRT Models IRT models
Item Response Theory models account for an individual-level random effect on
a set of items (i.e., ability) items are assumed to tap a single latent construct
(aptitude on a specific subject) item difficulty
test items are assumed to be ordered on a difficulty scale easier harder expected patterns emerge whereby if a more difficult
item is answered correctly the easier items are likely to have been answered correctly
IRT Models IRT models
1-parameter logistic (Rasch) model
pij individual i’s probability of a correct response on the jth item
θ individual i’s ability b item j’s difficulty
properties an individual’s ability parameter is invariant with respect to the
item the difficulty parameter is invariant with respect to individual’s
ability higher ability or lower item difficulty lead to a higher probability
of a correct response both ability and difficulty are measured on the same scale
logit( )ij i jp b
ICC
item characteristics curve (item response curve) depicts the probability of a correct response as a function
of an examinee’s ability or trait level curves are shifted rightward with increasing item difficulty assume that item 3 is more difficult than item 2 and item 2
is more difficult than item 1 probability of a correct response decreases as the
threshold θ = bj is crossed, reflecting increasing item difficulty
IRT Models: ICC (3 Items)
jb slopes of item characteristics curves are equal when ability = item difficulty
Estimation as GLMM specification:
set up a person-item data structure define x as a set of dummy variables change signs on β to reflect “difficulty” fit model without intercept to estimate all item difficulties normalization is common
logit( )ij j i
ij i
p u
u
x
2
1
0 and 1.0J
j uj
PL1 Estimation Stata (data set up )
clearset memory 128minfile junk y1-y5 f using LSAT.datdrop if junk==11 | junk==13expand fdrop f junkgen cons = 1collapse (sum) wt2=cons, by(y1-y5)gen id = _nsort idreshape long y, i(id) j(item)
PL1 Estimation Stata (model set up )
gen i1 = 0gen i2 = 0gen i3 = 0gen i4 = 0gen i5 = 0replace i1 = 1 if item == 1replace i2 = 1 if item == 2replace i3 = 1 if item == 3replace i4 = 1 if item == 4replace i5 = 1 if item == 5** 1PL * constrain sd=1cons 1 [id1]_cons = 1gllamm y i1-i5, i(id) weight(wt) nocons family(binom) cons(1) link(logit) adapt
PL1 Estimation Stata (output )
------------------------------------------------------------------------------ var(1): 1 (0) ***level 2 (id)
------------------------------------------------------------------------------Variances and covariances of random effects i5 2.218779 .104828 21.17 0.000 2.01332 2.424238 i4 1.388057 .086496 16.05 0.000 1.218528 1.557586 i3 .2576052 .0765907 3.36 0.001 .1074903 .4077202 i2 1.063026 .0821146 12.95 0.000 .902084 1.223967 i1 2.871972 .1287498 22.31 0.000 2.619627 3.124317 Coef. Std. Err. z P>|z| [95% Conf. Interval] log likelihood = -2473.054321704064 ( 1) [id1]_cons = 1gllamm model with constraints: Condition Number = 1.8420141 number of level 2 units = 1000number of level 1 units = 5000
PL1 Estimation Stata (parameter normalization)
* normalized solution *[1 -- standard 1PL] *[2 -- coefs sum to 0] [var = 1]mata bALL = st_matrix("e(b)") b = -bALL[1,1..5] mb = mean(b') bs = b:-mb("MML Estimates", "IRT parameters", "B-A Normalization") (-b', b', bs')end
PL1 Estimation
Stata (normalized solution)
param MML Estimates IRT Normalized
1 2.87 -2.87 -1.31
2 1.06 -1.06 0.50
3 0.26 -0.26 1.30
4 1.39 -1.39 0.17
5 2.22 -2.22 -0.66
IRT: Extensions 2-parameter logistic (2PL) model
) (lo t( )gi ij j i j
j j i
ij ij i
a b
u
p
u
x x
jj
j
b
is a factor loading on the random ef c fe tj
item discrimination parameters
0 and 1 (normalization)j jj j
b a
IRT: Extensions 2-parameter logistic (2PL) model
item discrimination parameters reveal differences in item’s utility to distinguish different
ability levels among examinees high values denote items that are more useful in terms of
separating examinees into different ability levels low values denote items that are less useful in
distinguishing examinees in terms of ability ICCs corresponding to this model can intersect as they
differ in location and slope steeper slope of the ICC is associated with a better
discriminating item
IRT: Extensions 2-parameter logistic (2PL) model
IRT: Extensions 2-parameter logistic (2PL) model
Stata (estimation)eq id: i1 i2 i3 i4 i5cons 1 [id1_1]i1 = 1gllamm y i1-i5, i(id) weight(wt) nocons family(binom) link(logit) frload(1) eqs(id) cons(1) adaptmatrix list e(b)*normalized solutions *1 standard 2PL) mata bALL = st_matrix("e(b)") b = bALL[1,1..5] c = bALL[1,6..10] a = -b:/c("MML Estimates-Dif", "IRT Parameters")(b', a')("MML Discrimination Parameters")(c')end
IRT: Extensions 2-parameter logistic (2PL) model
Stata (estimation)* Bock and Aitkin Normalization (p. 164 corrected)mata bALL = st_matrix("e(b)") b = -bALL[1,1..5] c = bALL[1,6..10] lc = ln(c) mb = mean(b') mc = mean(lc') bs = b:-mb cs = exp(lc:-mc)("B-A Normalization DIFFICULTY", "B-A Normalization DISCRIMINATION")(bs', cs')end
IRT: 2PL (1)
i5: .65684452 (.20990788) i4: .68836241 (.18513868) i3: .890914 (.2328178) i2: .72273928 (.18667773) i1: .82565942 (.25811315) loadings for random effect 1 var(1): 1 (0) ***level 2 (id)
------------------------------------------------------------------------------Variances and covariances of random effects i5 2.053265 .1353574 15.17 0.000 1.78797 2.318561 i4 1.284755 .0990363 12.97 0.000 1.090647 1.478862 i3 .24915 .0762746 3.27 0.001 .0996546 .3986454 i2 .9901996 .0900182 11.00 0.000 .8137672 1.166632 i1 2.773234 .205743 13.48 0.000 2.369985 3.176483 Coef. Std. Err. z P>|z| [95% Conf. Interval] log likelihood = -2466.653343760672
IRT: 2PL (2) Bock-Aitkin Normalization
itemItem Difficulty
ParameterDiscrimination
Parameter1 -1.30 1.102 0.48 0.963 1.22 1.184 0.19 0.925 -0.58 0.87
check 0 1
B-A Normalization
item 3 has highest difficulty and greatest discrimination
1PL and 2PL
1PL and 2PL
Binary Response Models for Event Occurrence discrete-time event-history models
purpose: model the probability of an event occurring at some point
in time Pr(event at t | event has not yet occurred by t)
life table events & trials observe the number of events occurring to those who
are at remain at risk as time passes takes account of the changing composition of the sample
as time passes
Life Table
Life Table observe
Rj number at risk in time interval j (R0 = n), where the number at risk in interval j is adjusted over time
Dj events in time interval j (D0 = 0)
Wj removed from risk (censored) in time interval j (W0 = 0)
(removed from risk due to other unrelated causes)
1 1 1j j j jR DR W
Life Table other key quantities
discrete-time hazard (event probability in interval j)
surviving fraction (survivor function in interval j)
ˆ jj
j
pD
R
1
ˆ ˆ(1 )j
j kk
S p
Discrete-Time Hazard Models statistical concepts
discrete random variable Ti (individual’s event or censoring time)
pdf of T (probability that individual i experiences event in period j)
cdf of T (probability that individual i experiences event in period j or earlier)
survivor function (probability that individual i survives past period j)
) Pr( )( ij if t T j
1
) Pr (( ( ) )j
ij i ikk
T jF f tt
) Pr( ) 1 ( )( ij i ijT jS F tt
Discrete-Time Hazard Models statistical concepts
discrete hazard
the conditional probability of event occurrence in interval j for individual i given that an event has not already occurred to that individual by interval j
Pr( | )ij i ip T j T j
Discrete-Time Hazard Models equivalent expression using binary data
binary data dij = 1 if individual i experiences an event in interval j, 0 otherwise
use the sequence of binary values at each interval to form a history of the process for individual i up to the time the event occurs
discrete hazard
1 2 1Pr( 1| 0, 0, , 0)ij ij ij ij idp d d d
Discrete-Time Hazard Models modeling (logit link)
modeling (complementary log –log link)
non-proportional effects
exp( )
1 exp( )j ij
ijj ij
p
x
x
1 exp exp( )ij j ijp x
logit( )ij j ij jp x
Data Structure person-level data person-period form
Data Structure binary sequences
Estimation contributions to likelihood
contribution to log L for individual with event in period j
contribution to log L for individual censored in period j
combine
1 1
log (1 logg (1l )o )jn
ik ik ik iki k
pL d pd
Pr( ) ( ) if 1,
Pr( ) ( ) if 0.i ij ij
ii ij ij
T j f t dL
T j S t d
1
1
log llo og(g 1 )j
i ij ij ikk
pL d p
1
loglo (1 )gj
i ikk
L p
Example: dropping out of Ph.D. programs (large US university)
data: 6,964 individual histories spanning 20 years dropout cannot be distinguished from other types of
leaving (transfer to other program etc.) model the logit hazard of leaving the originally-entered
program as a function of the following: time in program (the time-dependent) baseline hazard) female and percent female in program race/ethnicity (black, Hispanic, Asian) marital status GRE score
also add a program-specific random effect (multilevel)
Example:
Example:
Example:
clearset memory 512minfile CID devnt I1-I5 female pctfem black hisp asian married gre using DT28432.datlogit devnt I1-I5, nocons orest store m1logit devnt I1-I5 female pctfem, nocons orest store m2logit devnt I1-I5 female pctfem black hisp asian , nocons orest store m3logit devnt I1-I5 female pctfem black hisp asian married, nocons orest store m4logit devnt I1-I5 female pctfem black hisp asian married gre , nocons or