propensity score weighting with multilevel datafl35/talk/psclustertalk.pdf · 2012. 11. 5. ·...

Propensity Score Weightingwith Multilevel Data

Fan Li

Department of Statistical ScienceDuke University

October 25, 2012

Joint work with Alan Zaslavsky and Mary Beth Landrum

Introduction

I In comparative effectiveness studies, the goal is usually toestimate the causal (treatment) effect.

I In noncausal descriptive studies, a common goal is toconduct an unconfounded comparison of two populations.

I Population-based observational studies are increasinglyimportant sources both types of studies.

I Proper adjustment for differences between treatmentgroups is crucial to both descriptive and causal analyses.

I Regression has long been the standard method.I Propensity score (Rosenbaum and Rubin, 1983) is a

robust alternative to regression adjustment, applicable toboth causal and descriptive studies.

Multilevel data

I Propensity score has been developed and applied incross-sectional settings.

I Data in medical care and health policy research are oftenmultilevel.

I Subjects are grouped in natural clusters, e.g., geographicalarea, hospitals, health service provider, etc.

I Significant within- and between-cluster variations.I Ignoring cluster structure often leads to invalid inference.

I Inaccurate standard errors.I Cluster-level effects could be confounded with

individual-level effects.I Hierarchical regression models provide a unified

framework to study multilevel data.

Propensity score methods for multilevel data

I Propensity score methods for multilevel data have beenless explored.

I Arpino and Mealli (2011) investigated propensity scorematching in multilevel settings.

I We focus on propensity score weighting strategies:1. Investigate the performance of different modeling and

weighting strategies in the presence of modelmisspecification due to cluster-level confounders

2. Clarify the differences and connections between causal andunconfounded descriptive comparisons.

Notations

Setting: (1) two-level structure, (2) treatment assigned atindividual level.

I Cluster: h = 1, ...,H.I Cluster sample size: k = 1, ....,nh; total sample size:

n =∑

h nh.I “Treatment" (individual-level): Zhk , 0 control, 1 treatment.I Covariates: Xhk = (Uhk ,Vh), Uhk individual-level; Vh

cluster-level.I Outcome: Yhk .

Controlled descriptive comparisons

I “Assignment": a nonmanipulable state definingmembership in one of two groups.

I Objective: an unconfounded comparison of the observedoutcomes between the groups

I Estimand: average controlled difference (ACD) - thedifference in the means of Y in two groups with balancedcovariate distributions.

πACD = EX[E(Y |X,Z = 1)− E(Y |X,Z = 0)].

I Examples: comparing outcomes among populations ofdifferent races; or of patients treated in two different years.

Causal comparisonsI Assignment: a potentially manipulable intervention.I Objective: causal effect - comparison of the potential

outcomes under treatment versus control in a common setof units

I Assuming SUTVA, each unit has two potential outcomes:Yhk (0),Yhk (1).

I Estimand: average treatment effect (ATE)

πATE = E[Y (1)− Y (0)],

I Examples: evaluating the treatment effect of a drug,therapy or policy for a given population.

I Under the assumption of “nonconfoundedness",(Y (0),Y (1)) ⊥ Z |X, we have

πATE = πACD.

Propensity score

I Propensity score: e(X) = Pr(Z = 1 | X).I Balancing property - balancing propensity score also

balances the covariates of different groups.

X ⊥ Z |e(X).

I Using propensity score - two-step procedure:I Step 1: estimate the propensity score, e.g., by logistic

regression.I Step 2: estimate the “treatment” effect by incorporating

(matching, weighting, stratification, etc.) the estimatedpropensity score.

I Propensity score is a less parametric alternative toregression adjustment.

Propensity score weighting

I Foundation of propensity score weighting:

E[

ZYe(X)

− (1− Z )Y1− e(X)

]= πACD = πATE = π,

I Inverse-probability (Horvitz-Thompson) weight:

whk =

{1

e(Xhk ), for Zhk = 1

11−e(Xhk ) , for Zhk = 0.

I HT weighting balances (in expectation) the weighteddistribution of covariates in the two groups.

Step 1: Propensity score models

I Marginal model-ignoring multilevel structure:

logit(ehk ) = δ0 + Xhkα.

I Fixed effects model - adding cluster-specific main effect δh:

logit(ehk ) = δh + Uhkα.

I Key: the cluster membership is a nominal covariate.I δh absorbs the effects of both observed and unobserved

cluster-level covariates Vh.I Estimates a balancing score without knowledge of Vh, but

might lead to larger variance than the propensity scoreestimated under a correct model with fully observed Vh.

I The Neyman-Scott problem: inconsistent estimates of δh,αgiven large number of small clusters.

Step 1: Propensity score models

I Random effects model:

logit(ehk ) = δh + Xhkα, with δh ∼ N(δ0, σ2δ ).

I Due to the shrinkage of random effects, Vh need to beincluded.

I Pros: “borrowing information” across clusters, works betterwith many small clusters.

I Cons: produce a biased estimate if δh are correlated withthe covariates.

I Does not guarantee balance within each cluster.I Random slopes can be incorporated.

Step 2: Estimators for the ACD or the ATE

Two types of propensity-score-weighted estimators:1. Nonparametric: applying the weights directly to the

observed outcomes.2. Parametric: applying the weights to the fitted outcomes

from a parametric model (doubly-robust estimators).

Nonparametric EstimatorsI Marginal estimator - ignore clustering:

π̂ma =∑

Zhk=1

Yhkwhkw1

−∑

Zhk=0

Yhkwhkw0

,

where whk is the HT weight using the estimated propensityscore, and wz =

∑h,k :Zhk=z whk for z = 0,1.

I Cluster-weighted estimator: (1) obtain cluster-specific ATE;and (2) average over clusters:

π̂cl =

∑h whπ̂h∑

h wh.

where

π̂h =

∑zhk=1k∈h Yhkwhk

wh1−∑zhk=0

k∈h Yhkwhkwh0

,

where whz =∑zhk=z

k∈h whk for z = 0,1.

Parametric doubly-robust (DR) estimators

I DR estimator (Robins and colleagues): replace theobserved outcome but fitted outcomes from a parametricmodel.

π̂dr =∑h,k

Îhk/n,

where

Îhk =

[Zhk Yhk

êhk−

(Zhk − êhk )Ŷ 1hkêhk

]−

[(1− Zhk )Yhk

1− êhk+

(Zhk − êhk )Ŷ 0hk1− êhk

],

Ŷ z : the fitted outcome from a parametric model in group z.I Double-robustness: in large samples, π̂dr is consistent if

either the p.s. model or the outcome model is correct, butnot necessarily both.

I SE in these estimators can be obtained via the deltamethod.

DR estimators: outcome modelI Marginal model - ignore clustering:

Yhk = η0 + Zhkγ + Xhkβ + �hk ,

I Fixed effects model - adjust for cluster-level main effects ηhand covariates:

Yhk = ηh + Zhkγ + Uhkβ + �hk ,

I Random effects model - assume cluster-specific effectsηh ∼ N(0, σ2η)

Yhk = ηh + Zhkγ + Xhkβ + �hk .

I Different DR estimators: combinations of different choicesof propensity score model and outcome model.

Large sample propertiesI We investigate the large-sample bias of the nonparametric

estimators in the presence of intra-cluster influence, in thesetting with no observed covariates.

I nhz : number of subjects in Z group in cluster h.I nz =

∑h nhz ,nh = nh1 + nh0,n = n1 + n0.

I Assume a Bernoulli assignment mechanism with varyingrates by cluster:

Zhk ∼ Bernoulli(ph). (1)

I Assume an outcome model with cluster-specific randomintercepts ηh and constant treatment effect π,

Yhk = ηh+Zhkπ+τdh+�hk , ηh ∼ N(η0, σ2η), �hk ∼ N(0, σ2� )(2)

where dh = nh1/nh is the cluster-specific proportiontreated.

Nonparametric marginal estimator with marginal p.s.

I Violation to unconfoundness and/or SUTVA.I Under marginal propensity score model, êhk = n1/n, for all

units.I Then for the nonparametric marginal estimator:

π̂mama = π + τVV0

+∑

h

ηh(nh1n1− nh0

n0) + (

zhk=1∑h,k

�hkn1−

zhk=0∑h,k

�hkn0

),

whereI V =

∑nh(d2h − (n1/n)2): weighted sample variance of {dh};

I V0 = (n1n0)/n2: maximum of V, attainable if each cluster isassigned to either all treatment or all control (treatmentassigned at cluster level).

I Expectation of the last two terms are asymptotically 0.

Nonparametric marginal estimator with marginal p.s.

I Therefore, under the generating model (1) and (2), thelarge sample bias of π̂mama is τV/V0.

I Interpreting the result:1. τ measures the variation in the outcome generating

mechanism between clusters;2. V/V0 measures the variation in the treatment assignment

mechanism between clusters.

Both are ignored in π̂mama, introducing bias.I We also analyze the non-parametric estimators that

accounting for clustering in at least one stage - all lead toconsistent estimates.

I Conclusion: with violations to unconfoundedness due tounobserved cluster effects, (1) ignoring clustering in bothstages of p.s. weighting leads to biased estimates, (2) butaccounting for clustering in at least one stage leads toconsistent estimates.

Simulation Design

I The asymptotics do not apply to “large number of smallclusters".

I The Neyman-Scott problem: MLEs for δh and α areinconsistent due to the growing number of parameters(clusters).

I The primary role of p.s. is to balance covariates - eveninconsistent p.s. estimate may still work well, as long as itbalances covariates.

I Simulated under balanced designs:(i) Small number of large clusters: H = 30,nh = 400,(ii) Large number of median clusters: H = 200,nh = 20,(iii) Large number of small clusters: H = 400,nh = 10.

I Additional goal: investigate the impact of an unmeasuredcluster-level confounder in more realistic settings withobserved covariates.

Simulation DesignI Two individual-level covariates:

U1 ∼ N(1,1); U2 ∼ Bernoulli(0.4).

I One cluster-level covariate V :(i) uncorrelated with U: V ∼ N(1,2);(ii) correlated with the p.s. model: Vh = 1 + 2δh + u,u ∼ N(0,0.5).

I True treatment assignment:

logit Pr(Zhk = 1|Xhk ) = δh + Xhkα,

True parameters: δh ∼ N(0,1) and α = (−1,−.5, α3).I True (potential) outcome model - random effects model

with interaction (ηh ∼ N(0,1), γh ∼ N(0,1), �hk ∼ N(0, σ2y )):

Yhk = ηh + Xhkβ + Zhk (Vhκ+ γh) + �hk ,

True parameters: κ = 2 and β = (1, .5, β3).

Simulation Design

I Under each scenario, 500 replicates are generated.I For each simulation,

1. Calculate the true value of π by∑

h,k [Yhk (1)− Yhk (0)]/nfrom the simulated units.

2. Estimate propensity score using the three models withU1,U2, omitting the cluster-level V .

3. Fit the three outcome models with U1,U2, omitting thecluster-level V .

4. Obtain the nonparametric estimates and the DR estimatesfor π from the above.

I For comparison, estimates from true models (benchmark)with U1,U2,V are also calculated.

I Random effects models are fitted via the lmer function inR package lme4 (penalized likelihood estimation).

Simulation Results

nonparametric doubly-robustmarginal cluster1 benchmark marginal fixed random

(H, nh) bias mse bias mse bias mse bias mse bias mse bias mseBE .15 .22 .07 .10 .029 .037 .23 .33 .10 .13 .029 .037

(30,400) MA 4.46 4.50 .27 .32 .023 .029 4.46 4.51 .28 .34 .024 .030FE .15 .22 .07 .10 .030 .039 .23 .35 .10 .14 .030 .039RE .17 .23 .07 .09 .029 .037 .25 .34 .10 .13 .029 .037

BE .49 .54 .16 .19 .041 .052 .82 .88 .13 .19 .044 .055(200,20) MA 4.46 4.48 .29 .32 .036 .045 4.46 4.48 .16 .20 .046 .057

FE .54 .59 .12 .15 .046 .058 .67 .75 .14 .18 .048 .060RE 1.24 1.25 .13 .16 .040 .050 1.59 1.61 .12 .15 .044 .054

BE .59 .63 .20 .23 .043 .053 1.06 1.10 .13 .17 .052 .065(400,10) MA 4.49 4.50 .31 .33 .038 .048 4.49 4.51 .16 .18 .067 .081

FE 1.00 1.03 .12 .15 .047 .059 1.25 1.29 .15 .18 .055 .068RE 1.84 1.85 .17 .20 .040 .051 2.32 2.33 .13 .16 .056 .069

Simulation Summaries

1. Estimators that ignore clustering in both propensity scoreand outcome models have much larger bias and RMSEthan all the others.

2. The choice of the outcome model appears to have a largerimpact on the results than the choice of the p.s. model.

3. Among the DR estimators, the ones with the benchmarkoutcome model and the ones with the random effectsoutcome model perform the best.

4. Given the same propensity score model, the nonparametriccluster-weighted estimator generally gives smaller biasthan the DR estimator with fixed effects outcome model

5. When the number of cluster increases and the size of eachcluster reduces, bias and RMSE increases considerably inall the estimators. The largest increase is observed in thenonparametric marginal estimators.

Application: racial disparity in health care

I Disparity: racial differences in care attributed to operationsof health care system.

I Breast cancer screening data in different health insuranceplans are collected from the Centers for Medicare andMedicaid Services (CMS).

I Focus on comparison between whites and blacks.I Focus on the plans with at least 25 whites and 25 blacks:

64 plans with a total sample size of 75012.I Subsample 3000 subjects from large (>3000) clusters to

restrict impact of extremely large clusters, resulting samplesize 56480.

Racial disparity data

I “Treatment” Zhk : black race (1=black, 0=white).I Outcome Yhk : receive screening for breast cancer or not.I Goal: investigate racial disparity in breast cancer

screening - unconfounded descriptive comparison.I Cluster level covariates Vh: geographical code,

non/for-profit status, practice model.I Individual level covariates Xhk : age category, eligibility for

medicaid, poor neighborhood.I Analysis: fit the three p.s. and the three outcome models,

obtain estimates.

Balance Check: different propensity score models

Figure : Histogram of cluster-specific weighted proportions of blackenrollees.

unweighted

cluster−specific proportion of blacks

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

04

812

marginal p.s. model


Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

02

46

8

fixed effects p.s. model


Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

010

25

random effects p.s. model


Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.00

1025

All p.s. models suggest: living in a poor neighborhood, beingeligible for Medicaid and enrollment in a for-profit insuranceplan are significantly associated with black race.

Results

weighted doubly-robustmarginal clustered marginal fixed random

marginal -4.96 (.79) -1.73 (.83) -4.43 (.85) -2.15 (.41) -1.65 (.43)fixed -2.49 (.92) -1.78 (.81) -1.93 (.82) -2.21 (.42) -1.96 (.41)

random -2.56 (.91) -1.78 (.82) -2.00 (.44) -2.22 (.39) -1.95 (.39)

Table : Average controlled difference in percentage of the proportionof getting breast cancer screening between blacks and whites.

I All estimators show the rate of receipt breast cancerscreening is significantly lower among blacks than amongwhites with similar characteristics.

I Ignoring clustering in both stages doubled the estimatesfrom analyses that account for clustering in at least onestage.

I Between-cluster variation is large.I DR estimates have smaller SE.

Conclusion

I Propensity score is a powerful tool to balance covariates,for both causal and descriptive comparisons.

I We introduce and compare several propensity scoreweighting methods for multilevel data.

I Using analytical derivations and simulations, we show that1. Ignoring the multilevel structure in both stages of p.s.

weighting generally leads to severe bias for ATE/ACD.2. Exploiting the multilevel structure, either parametrically or

nonparametrically, in at least one stage can greatly reducethe bias.

I In complex observational data, correctly specifying theoutcome model may be challenging. Propensity scoremethods are more robust.

I Interference between units in a cluster may exist (SUTVAdoes not hold), further research is needed.

IntroductionPropensity score with multilevel dataPropensity score modelsWeighting estimators

Large sample resultsSimulationsApplicationConclusion

propensity score weighting with multilevel datafl35/talk/psclustertalk.pdf · 2012. 11. 5. ·...

Documents