propensity score weighting for causal inference with multiple …fl35/ow/multitrt_talk.pdf · 2019....

27
Propensity Score Weighting for Causal Inference with Multiple Treatments Fan Li Department of Statistical Science Duke University JSM 2019, Denver July 30, 2019 Joint work with Fan (Frank) Li, Yale University 1 / 27

Upload: others

Post on 30-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Propensity Score Weighting forCausal Inference with Multiple Treatments

    Fan Li

    Department of Statistical ScienceDuke University

    JSM 2019, Denver

    July 30, 2019

    Joint work with Fan (Frank) Li, Yale University

    1 / 27

  • IntroductionI Causal inference literature has largely focused on binary

    treatments

    I Multiple (or multi-valued) treatments are increasinglycommon

    I Examples:I multiple treatment options for one medical condition

    I disparities between more than two races

    I Key: Balancing covariate distributions across treatmentgroups to remove confounding, by design or analysis

    I One common approach is weighting

    I Main idea: Weigh each treatment group to create apseudo-population (the target population) where thecovariate distributions are balanced

    2 / 27

  • Standard Setup

    I Data: a random sample of n units drawn from a population

    I Treatments: Zi ∈ Z = {1, . . . , J} with J ≥ 3

    I For each unit i , a set of potential outcomes{Yi(1), . . . ,Yi(J)}, only Yi(Zi) observed

    I Observed data: pre-treatment variables (covariates) Xi ,treatment status Zi and Yi = Yi(Zi)

    I Estimand: for unordered nominal treatments, pairwiseaverage treatment effect (pATE)

    τ pATEj,j ′ = E[Y (j)− Y (j′)], j 6= j ′

    I Generalized Propensity Score (GPS) is a commontechnique to estimate τ in observational studies

    3 / 27

  • Generalized Propensity Score (GPS)(Imbens, 2000)

    Definition: Generalized Propensity Score (GPS) – theconditional probability of being assigned to a treatment groupgiven the covariates:

    ej(X ) ≡ Pr(Z = j |X )

    I Each unit has J GPSs: e = {e1, ...,eJ}, and∑Jj=1 ej(X ) = 1 for all X ∈ X

    I In practice, J − 1 scores are adequate to characterize eachunit, but not fewer

    4 / 27

  • Causal Assumptions(Imbens, 2000)

    I (Weak Unconfoundedness) The assignment is weaklyunconfounded if

    Y (j) ⊥ 1{Z = j}|X , ∀ j ∈ Z.

    I ⇒ Y (j) ⊥ X |ej(X ) for all j ∈ Z

    I (Overlap) The probability of being assigned to anytreatment group ej(X ) = Pr(Z = j |X ) > 0 for all X ∈ X andj ∈ Z

    I Similar to binary treatments, can use weighting to identify

    mj = E[Y (j)] = EX{E[1{Zi = j}Yj

    ej(X )

    ]}5 / 27

  • Inverse Probability Weighting (IPW)

    I The generalized propensity score ej(X ) = Pr(Z = j |X )

    I Inverse probability weights:

    {w1(Xi), . . . ,wJ(Xi)} ={

    1e1(Xi)

    , . . . ,1

    eJ(Xi)

    }I An unbiased moment estimator of pairwise ATE:

    τ̂ pATEj,j ′ =

    ∑ni=1 1(Zi = j)Yi/ej(Xi)∑n

    i=1 1(Zi = j)/ej(Xi)−∑n

    i=1 1(Zi = j′)Yi/ej ′(Xi)∑n

    i=1 1(Zi = j ′)/ej ′(Xi)

    I IPW balances the weighted distribution of pre-treatmentcovariates across multiple groups relative to the combinedpopulation

    6 / 27

  • IPW: Challenges

    I Target population of IPW: the combined population from allJ treatment groups

    I What if the study sample is a convenience sample – maynot represent any meaningful population

    I May correspond to the effect of an infeasible intervention

    I OperationallyI Extreme GPS (close to 0) leads to bias and excessive

    variance, exacerbated in multiple treatments

    I Trimming used to remove extreme: may lose a large portionof sample, sensitive to cutoff

    I Matching (Yang et al., 2016): ambiguous target population

    7 / 27

  • Weighting Beyond IPW

    I Key problem:

    I substantial overlap in the centroid of the propensitydistribution – comparative effectiveness information of highscientific interest since the treatment option is uncertain

    I lack of overlap in the edges of the propensity distribution –comparative effectiveness information is less relevant

    I This work: extend the overlap weighting framework of Li,Morgan, Zaslavsky (2018) to multiple treatments

    I Provide a unified framework – the balancing weights – toallow user-specified target populations

    I Develop a new weighting scheme – the generalized overlapweighting – to emphasize the “overlapped population”

    8 / 27

  • General Estimands(Li and Li, 2019)

    I Define mj(X ) = E[Y (j)|X ] as the mean function

    I Assume the population density corresponding to theobserved sample, f (X ), exists w.r.t. a base measure µ

    I Consider a target population with a different densityg(X ) ∝ f (X )h(X ), for a pre-specified h(·) – tilting function

    I Average potential outcome in the target population

    mhj ≡ Eh[Y (j)] =∫X mj(X )f (X )h(X )µ(dX )∫

    X f (X )h(X )µ(dX ).

    I Estimand: τh(a) =∑J

    j=1 ajmhj for a = (a1, . . . ,aJ)

    I Set h(X ) = 1, pairwise ATE is a special case

    9 / 27

  • Balancing Weights

    I Recall

    fj(X ) = f (X |Z = j) ∝ f (X )ej(X ), ∀ j ∈ Z

    I For a given h(X ), to estimate mhj , we can weight fj(X ) tothe target population using weights

    wj(X ) ∝f (X )h(X )f (X )ej(X )

    =h(X )ej(X )

    , ∀ j ∈ Z

    I The class of weights {h(X )/e1(X ), . . . ,h(X )/eJ(X )} iscalled the balancing weights – balancing the weighteddistributions of covariates across J comparison groups:

    fj(X )wj(X ) = f (X )h(X ) ∝ g(X ), ∀ j ∈ Z

    I IPW is a special case with h(X ) = 1

    10 / 27

  • Balancing Weights: Nominal Treatments

    I Choice of coefficient a determines the causal contrast:often choose a to define pairwise comparisons

    I Choice h determines the target population and weightsTarget population Tilting function h(X ) Weights {wj(X ), j ∈ Z}Combined 1 {1/ej(X ), j ∈ Z}Treated (j ′th group) ej ′(X ) {ej ′(X )/ej(X ), j ∈ Z}Trimming 1{X ∈ C} {1{X ∈ C}/ej(X ), j ∈ Z}Matching min1≤l≤J{el(X )} {minl{el(X )}/ej(X ), j ∈ Z}

    Overlap1∑J

    l=1 1/el(X )

    {1/ej(X )∑Kl=1 1/el(X )

    , j ∈ Z

    }

    11 / 27

  • Moment Estimator

    I General principle: estimating the average of the potentialoutcomes separately for each treatment level with thebalancing weights, wj(X ) = h(X )/ej(X )

    I Moment weighting estimator

    m̂hj = ̂Eh[Y (j)] =∑n

    i=1 1(Zi = j)Yiwj(Xi)∑ni=1 1(Zi = j)wj(Xi)

    , τ̂h(a) =J∑

    j=1

    ajm̂hj

    I Theorem 1. Under weak unconfoundedness, for any h anda, τ̂h(a) is a consistent estimator of τh(a)

    12 / 27

  • Large-sample PropertiesI Theorem 2. Under mild regularity conditions, the

    expectation of the conditional variance converges

    n · E{V[τ̂h(a)|Z ,X ]} →

    Q(a,h) ≡∫X

    ( J∑j=1

    a2j vj(X )/ej(X ))

    h2(X )f (X )µ(dX )/C2h ,

    where vj(X ) = V[Y (j)|X ] and Ch ≡∫X h(X )f (X )µ(dX ).

    I Corollary 1. Under homoscedasticity, vj(X ) = v , the tiltingfunction

    h̃(X ) ∝ 1∑Jj=1 a

    2j /ej(X )

    gives the smallest asymptotic variance for the weightedestimator τ̂h(a) among all h’s, and minh Q(a,h) = v/Ch̃.

    13 / 27

  • Generalized Overlap Weights(Li and Li, 2019)

    I Nominal treatments: pairwise comparison

    I Choose a ∈ S = {λj − λj ′ : j 6= j ′}, where λj is the J × 1unit vector with one at the j th position and zero elsewhere

    I Choose h to minimize the total asymptotic variance of theweighting estimators for all pairwise comparisons

    h̃(X ) = argminh

    ∑j 6=j ′

    Q(λj − λj ′ ,h) ∝1∑J

    l=1 1/el(X )

    I Generalized overlap weights

    wj(X ) ∝1

    ej(X )× 1∑J

    l=1 1/el(X ), j ∈ Z

    I For J = 2, h̃(X ) ∝ e(X ){1− e(X )} – overlap weights forbinary treatments (Li, Morgan and Zaslavsky, 2018)

    14 / 27

  • Generalized Overlap Weights(Li and Li, 2019)

    I Recall

    h̃(X ) ∝ 1∑Jl=1 1/el(X )

    , wj(X ) ∝1/ej(X )∑Jl=1 1/el(X )

    I Maximum h is attained when ej(X ) = 1/J for all j –substantial probability to receive each treatment

    I Target population: subpopulation with the most overlap incovariates among all groups

    I Target estimand: pairwise average treatment effect amongthe overlap population (pATO)

    15 / 27

  • Optimal Tilting Function: Ternary PlotI For J = 3, visualize h(e1(X ),e2(X ),e3(X )) over a

    two-dimensional probability simplex

    e = (1/3, 1/3, 1/3)

    e ~ (0, 1/2, 1/2)

    0.02.55.07.510.0

    h

    16 / 27

  • Generalized Overlap Weights: Statistical Advantages

    I Maximum total efficiency for pairwise comparisons amongall balancing weights

    I Weights are by construction bounded and robust toextreme propensities (prevalence with multiple treatments)

    I Avoid ad hoc trimming decisions: continuouslydown-weighting the units along the “edges"

    I Simulations confirmed that causal comparisons enabled bygeneralized overlap weights are consistently more efficientthan ad hoc trimming methods

    17 / 27

  • Generalized Overlap Weights: Statistical Advantages

    Generalized Overlap Weights, pairwise ATO approximates

    I Pairwise ATT (fixing a reference group j): if the propensityto treatment j is small compared to others, ej(X ) ≈ 0:

    h(X ) ∝J∏

    l=1

    el(X )/J∑

    k=1

    ∏l 6=k

    el(X ) ≈J∏

    l=1

    el(X )/∏l 6=j

    el(X ) = ej(X ),

    I Pairwise ATE: if the treatment groups are almost balancedin size and covariate distribution so that ej(X ) ≈ 1/J,h(X ) ≈ 1

    Adaptiveness enables defining a scientific question that may bebest answered nonparametrically by the available data

    18 / 27

  • Generalized Overlap Weights: Scientific Relevance

    I Generalized overlap weights emphasize the populationclosest to the population enrolled in a multi-armrandomized trial

    I By design, ej(X ) = 1/J and h(X ) = 1

    I The overlap population and pairwise ATO are also ofsubstantive interest

    I In medical studies, patients at clinical equipoise – patientswhose treatment decisions are most uncertain

    I In policy studies, units whose treatment assignment wouldbe most responsive to a policy shift as new information isobtained

    19 / 27

  • Simulated Example

    I Consider J = 3 groups with total sample size n = 1500

    I Generate Z |X from a multinomial logistic model

    I Specify response function Y (j)|X for all j

    I Consider adequate overlap and lack of overlap (⇓)

    e1(X)

    GPS

    Den

    sity

    0.0 0.2 0.4 0.6 0.8 1.0

    e2(X)

    GPS

    Den

    sity

    0.0 0.2 0.4 0.6 0.8 1.0

    e3(X)

    GPS

    Den

    sity

    0.0 0.2 0.4 0.6 0.8 1.0

    Z = 1 Z = 2 Z = 3

    20 / 27

  • Simulated Example

    Absolute Bias RMSEτ1,2 τ1,3 τ2,3 τ1,2 τ1,3 τ2,3

    IPW 0.19 0.02 0.17 1.04 0.61 1.16Optimal Trimming 0.03 0.01 0.01 0.38 0.28 0.47Overlap 0.01 0.01 0.00 0.28 0.23 0.35

    95% Coverageτ1,2 τ1,3 τ2,3

    IPW 0.79 0.88 0.91Optimal Trimming 0.93 0.90 0.91Overlap 0.95 0.94 0.94

    I Generalized overlap weights: smallest bias, largestefficiency and nominal coverage

    I Similar findings with J = 4 and J = 6

    21 / 27

  • Example: GARFIELD-AF

    I Goal: study the effectiveness of oral anticoagulants onhealth outcomes

    I Treatments: vitamin K antagonists (VKA), direct thrombininhibitors (DTI) and factor Xa (FXA)

    I Estimate the GPS by multinomial logistic regressionPropensity to VKA

    Estimated GPS

    Den

    sity

    0.0 0.2 0.4 0.6 0.8 1.0

    Propensity to DTI

    Estimated GPS

    Den

    sity

    0.0 0.2 0.4 0.6 0.8 1.0

    Propensity to FXA

    Estimated GPS

    Den

    sity

    0.0 0.2 0.4 0.6 0.8 1.0

    VKA Group DTI Group FXA Group

    22 / 27

  • Example: GARFIELD-AF

    ●●

    ●●●●

    Unweighted IPW Trimming Overlap

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Max PSD

    ●●●●

    ●●●

    Unweighted IPW Trimming Overlap

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Max ASD

    Figure: Boxplots of population standardized difference (PSD) andabsolute standardized difference (ASD) for all covariates

    23 / 27

  • Example: GARFIELD-AF

    I Estimates (CIs) for causal risk ratio for overall mortality

    DTI-VKA FXA-VKA FXA-DTI

    IPW0.87 0.67 0.78(0.66, 1.14) (0.51, 0.88) (0.55, 1.10)

    Trimming0.87 0.73 0.84(0.64, 1.20) (0.57, 0.94) (0.59, 1.19)

    Overlap0.84 0.74 0.88(0.66, 1.09) (0.61, 0.90) (0.67, 1.15)

    I Overall similar since mortality ∼ 4%, IPW exaggerateseffect v.s. randomized trials (Providência et al, 2014)

    I Optimal trimming excludes 1936, 71 and 252 patients inthe VKA, DTI and FXA group

    24 / 27

  • Conclusion

    I We provided a unified framework of balancing weights tobalance covariates for any target population

    I We proposed the generalized overlap weights for pairwisecausal comparisons: statistical efficiency and scientificrelevance

    I Takeaways:

    I consider scientifically appropriate target population

    I should not automatically focus on IPW (ATE)

    I generalized overlap weights exemplify observationalstudies analyzed like randomized trials

    25 / 27

  • Acknowledgements

    I Fan Li (Yale Biostatistics)

    I Alan Zaslavsky (Harvard Health Care Policy)

    I Laine Thomas (Duke Biostatistics and Bioinformatics)

    I Karen Pieper (Duke Clinical Research Institute)

    26 / 27

  • References

    I Imbens, GW (2000). The role of the propensity score in estimatingdose-response functions. Biometrika, 87, 706–710.

    I Li, F, Thomas, LE, and Li, F. (2019). Addressing extreme propensityscores via the overlap weights. American Journal of Epidemiology.188(1), 250-257.

    I Yang, S, Imbens, GW, Cui, Z, Faries, DE and Kadziola, Z. (2016).Propensity score matching and subclassification in observationalstudies with multi-level treatments. Biometrics, 72, 1055–1065.

    I Li, F, Li, F. (2019). Propensity score weighting for causal inference withmultiple treatments. Annals of Applied Statistics, forthcoming. arXiv:1808.05339.

    I Li, F, Morgan, LK, and Zaslavsky, AM. (2018). Balancing covariates viapropensity score weighting. Journal of the American StatisticalAssociation. 113, 390–400.

    27 / 27

    IntroductionGPSGeneralized Overlap WeightsSimulationsGarfieldSummary