propensity scores making sense of non-randomized observational data

52
PROPENSITY SCORES MAKING SENSE OF NON-RANDOMIZED OBSERVATIONAL DATA Atul Sharma MD, MSc, FRCPC(ret) Biostatistical Consulting Unit April 2014

Upload: aulani

Post on 24-Feb-2016

58 views

Category:

Documents


1 download

DESCRIPTION

Propensity Scores Making Sense of Non-Randomized Observational Data. Atul Sharma MD, MSc, FRCPC(ret) Biostatistical Consulting Unit April 2014. Propensity score 1996 - 2013. RCT – the gold standard. R.A. Fisher: The Design of Experiments, 1925 Introduced the concept of randomization . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Propensity Scores  Making Sense of Non-Randomized Observational Data

PROPENSITY SCORES MAKING SENSE OF NON-RANDOMIZED OBSERVATIONAL DATA

Atul Sharma MD, MSc, FRCPC(ret)Biostatistical Consulting Unit

April 2014

Page 2: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity score 1996 - 2013

Page 3: Propensity Scores  Making Sense of Non-Randomized Observational Data

RCT – the gold standard

R.A. Fisher: The Design of Experiments, 1925• Introduced the concept of randomization

Goal: estimate treatment effect in ith individual ti = Yi1 – Yi0, where • Yi1 = outcome of ith patient with treatment, Yi0 = outcome with control• Usually, we observe only one outcome i.e. “counterfactual” is hypothetical • This is the “fundamental problem of causal inference “ -> randomization

Neyman-Ruben Counterfactual Framework• Ruben and Rosenbaum, 1983• Formalizes conditions for valid RCT

Page 4: Propensity Scores  Making Sense of Non-Randomized Observational Data

• With randomization, average treatment effect estimated by difference in means or proportions between treatment & control groups i.e.

• Randomization ensures groups balanced for all important covariates -> unbiased estimate of average treatment effect. Treatment assignment is ‘strongly ignorable’

• 1:1 randomization allows direct comparison of groups with simple statistics e.g t-tests for means; RR, OR, NNT for proportions

• With more complex study designs e.g. 1:k randomization, weighted averages still permit unbiased estimates of average treatment effect

• All such estimates rely implicitly on knowing the probability of assignment to treatment group = propensity score (e.g. ½ )

Neyman-Ruben Theory Counterfactual Framework

Page 5: Propensity Scores  Making Sense of Non-Randomized Observational Data

Sometimes randomization fails:

• Selection biases arise when there is an imbalance between groups for observed or unobserved covariate, and probability of treatment now depends on covariates e.g. older or sicker patients more likely to receive treatment

• Confounding occurs there is an imbalance between treatment groups on observed or unobserved covariates that influence outcome e.g. older or sicker patients more likely to receive treatment and more likely to die

• Potential confounders: Covariates correlated with treatment and outcomes

Neyman-Ruben Theory Counterfactual Framework

Page 6: Propensity Scores  Making Sense of Non-Randomized Observational Data

• Imbalance may be due to failure of randomization, non-compliance, unblinded treatment assignment.

• Imbalance also arises as a result of non-randomized treatment assignment chosen by patient/physician or institutional/social policies i.e. • ‘natural experiments’ or ‘quasi-experimental designs’

• A non-randomized study can yield unbiased estimates of treatment effects

Neyman-Ruben Theory Counterfactual Framework

Page 7: Propensity Scores  Making Sense of Non-Randomized Observational Data

• Stratification (limited to a few key covariates e.g. age, gender). For example, matching on 10 binary factors e.g disease history -> 1024 strata

• Matching case-control (also limited to a few key covariates). Given covariates, treatment assignment is ‘strongly ignorable’

• Regression adjustment : Models effects of confounders on outcome directly• Can adjust for multiple confounders, but effectiveness depends on

correctly specifying the model e.g. non-linear terms, interactions, etc. Misspecification may actually worsen bias

• Overfitting may improve fit with current data (‘internal validity’), but limit generalizability to larger population (‘external validity’). Principle of parsimony limits model complexity

• Almost impossible to assess effectiveness of ‘bias control’ (unlike PS)

Conventional methods for adjusting for imbalance:

Page 8: Propensity Scores  Making Sense of Non-Randomized Observational Data

• PS is probability of assignment to treatment given confounders, summarizes all observed confounders in a single number (scalar)

• The ‘balancing property’ tells us that conditioning on the PS by stratifying, matching or regression adjustment will improve balance between groups

• Alternatively, • The ‘fundamental theorem’: as sample size increases, treatment is

assignment independent of observed covariates after conditioning on propensity score (Ruben and Rosenbaum, 1983)

• Given PS, treatment assignment is said to be ‘strongly ignorable’

Balancing Property of the Propensity Score:

Page 9: Propensity Scores  Making Sense of Non-Randomized Observational Data

• Advantages:• PS estimated by familiar methods (e.g. logistic regression)• Balances multiple covariates simultaneously• Model of treatment assignment applies only to current data, no external validity• Similarly, less need for parsimony. Over-fitting is not a problem if balancing works• After conditioning on PS, balance can be assessed easily and directly

• Caveats:• Unlike randomization, no assurance that unobserved covariates will balance• You can only balance observed/ measured covariates• Works best with large N, as imbalance may be unavoidable with small N• Take care to avoid covariates that are really treatment outcomes (mediators).

Adjusting for covariates that are effected by treatment may worsen bias

Propensity scores in the real world:

Page 10: Propensity Scores  Making Sense of Non-Randomized Observational Data

A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97

Objective Examine impact of RHC use during 1st 24h of ICU care & outcome i.e. survival, length of stay, intensity of care, cost of care

Design Prospective cohort study

Setting 5 US teaching hospitals, 1989 – 1994.

Subjects 5735 critically ill adult patients in ICU for 1 of 9 pre-specified disease categories: acute respiratory failure, COPD, CHF, cirrhosis, coma, metastatic colon cancer, non-small cell lung cancer, multi-organ failure with malignancy or sepsis

Treatment RHC at discretion of physician, possibly confounded with patient factors related to the outcome e.g. BP

Outcome(s) Survival, costs, intensity of care, hospital stay

Dataset and codebook available at http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets

Page 11: Propensity Scores  Making Sense of Non-Randomized Observational Data

A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97

• RCT is often impractical for ethical, logistic, financial reasons e.g. “ the popularity of this procedure and the widespread belief that it is beneficial makes the performance of an RCT difficult”

• Treatment selection at discretion of MD, likely to be confounded with patient characteristics related to the outcome e.g. patients with low BP more likely to receive RHC and more likely to die

• 7 critical care specialists identified 50 variable likely to influence treatment

Dependent variable is RHC in first 24h (2184 cases vs 3551 controls)

Independent variables are age, sex, race, education, income, medical insurance, 9 primary disease categories, 10 admission diagnoses, 12 comorbid conditions, pre-admission ADL and DASI, day 1 DNR, cancer, 2 mo. survival probability, APACHE III APS score, Glasgow Coma Score, weight, temperature, BP, respiratory rate, heart rate, PaO2/FiO2, PaCO2, pH, WBC, hematocrit, sodium, potassium, creatinine, bilirubin, albumin, urine output

Great care taken to identify all confounders that might bias outcomes through literature review, expert panels

Page 12: Propensity Scores  Making Sense of Non-Randomized Observational Data

A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97

Pre-treatment balance: 8 select covariates

How do we compare groups that are fundamentally different?

Page 13: Propensity Scores  Making Sense of Non-Randomized Observational Data

A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97

Kaplan Meier Survival, death by day 30

Page 14: Propensity Scores  Making Sense of Non-Randomized Observational Data

A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97 Cox proportional hazards regression• lt = hazard function = probability of death at time t given survival until t• log(lt) = l0 + b1 x RHC • e b1 = hazard ratio for RHCUnadjusted Cox PH model: • coxph(survival~swang1, data=rhc))

Given selection biases, patients with RHC sicker (lower BP, higher creatinine, higher APACHE scores), HR likely overestimate of true risk of treatment

Page 15: Propensity Scores  Making Sense of Non-Randomized Observational Data

Estimating propensity scores• Logistic regression for binary outcome e.g. RCH vs No RHC (swang1)• LHS p = probability of treatment assignment = propensity score• RHS covariates may be numeric or categorical (0/1 dummy variables)

• b’s or OR = eb measure impact of each predictor on probability of RHC• For each subject, fitted value of p or linear predictor may be used as PS

Page 16: Propensity Scores  Making Sense of Non-Randomized Observational Data

Estimating propensity scores

model1 = glm(swang1~age + sex + race + edu + income + ninsclas + cat1 + das2d3pc + dnr1 + ca + surv2md1 + aps1 + scoma1 + wtkilo1 + temp1 + meanbp1 + resp1 + hrt1 + pafi1 + paco21 + ph1 +wblc1 + hema1 + sod1 + pot1 + crea1 + bili1 + alb1 + resp + card + neuro + gastr + renal + meta + hema + seps + trauma + ortho + cardiohx + chfhx + dementhx + psychhx + chrpulhx + renalhx + liverhx + gibledhx + malighx + immunhx + transhx + amihx, data=rhc, family=binomial)

prop_score = fitted.values(model1)

Logistic regression: RHC+/RHC- adjusted for 50 risk factors

• 30 of 50 were significant independent predictors of treatment (p < 0.05)• OR = eb measures impact of predictors on probability of RHC

• in general, sicker patients more likely to undergo RHC

Page 17: Propensity Scores  Making Sense of Non-Randomized Observational Data

Estimating propensity scores

Does LR model for PS adequately discriminate treatment groups? Check ROC

• Data: 3551 controls < 2184 cases• Area under the curve: 0.80• 95% CI: 0.79-0.81

If discrimination inadequate, refine model:• Add covariates, nonlinear terms, interactions • Alternate PS estimation procedure

• Probit regression• Discriminant analysis

Page 18: Propensity Scores  Making Sense of Non-Randomized Observational Data

Estimating propensity scores:• Compare frequency distributions• PS should discriminate, but overlap is essential!

Page 19: Propensity Scores  Making Sense of Non-Randomized Observational Data

Using Propensity Scores to balance covariates in observational data:

1. Stratify by quintile (Ruben and Rosenbaum, 1983): • Each quintile analyzed separately as if from RCT (~meta-analysis)• With perfect stratification, average treatment effect across strata is unbiased

estimate of true treatment effect • 5 groups eliminate 90% of bias from observed confounders

2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile • Continuous numeric outcomes: ordinary linear regression• Binary categorical outcomes: logistic regression• Survival analysis: Cox proportional hazards regression

3. Match cases and controls by PS • Compare groups as if from RCT

• Almost always reduces sample size (unlike 1 ,2)

Page 20: Propensity Scores  Making Sense of Non-Randomized Observational Data

Stratify by Propensity Score:

• As few as 5 strata can eliminate 90% bias (Ruben and Rosenbaum 1983)

centiles= quantile(prop_score, probs=(0,0.2,0.4,0.6,0.8,1))

ps_quintiles = cut(prop_score, centiles, include.lowest=TRUE);

Balancing property of propensity score:• Like randomization, treated and control patients in each PS quintile have similar

distribution of measured covariates• Unlike randomization, does not mean they will be similar for unobserved covariates• Does not imply that pairs matched by PS will be similar on covariates

Page 21: Propensity Scores  Making Sense of Non-Randomized Observational Data

Assess balance by quintile:

Page 22: Propensity Scores  Making Sense of Non-Randomized Observational Data

Assess balance by quintile:

Page 23: Propensity Scores  Making Sense of Non-Randomized Observational Data

Assess balance by quintile:

Page 24: Propensity Scores  Making Sense of Non-Randomized Observational Data

Are covariates related to RHC after PS adjustment? • Covariate balance after stratification must be assessed formally

• Simple t-tests affected by sample-size differences (N=5735 vs 1147)

• Differences between RHC+/RHC- should be tested using methods unaffected by sample size

• Recommended : Regress each covariate on RHC ± PS quintile• covariate ~ treatment ± ps_quintile

• If covariate unrelated to RHC after adjustment for PS, strata balanced

Page 25: Propensity Scores  Making Sense of Non-Randomized Observational Data

Are covariates related to RHC after PS adjustment?

Linear regression: age vs RHC status ± PS quintile (categorical, 5 levels):• lm(age~swang1, data=rhc) • lm(age~swang1 + ps_quintiles, data=rhc)

Logistic regression: sex vs RHC status ± PS quintile:• glm(sex~swang1, family = binomial, data=rhc)• glm(sex~swang1 + ps_quintiles, family = binomial, data=rhc)

After adjusting for PS quintile, cases and controls are well balanced. • May need to refine treatment model by adding covariates, non-linear terms,

interactions - iteratively until balanced

Page 26: Propensity Scores  Making Sense of Non-Randomized Observational Data

Stratify by Quintile : Analyze each separately

• Unadjusted Cox PH model applied to each stratum

Sometimes meaningful to report strata individually (distinct profiles, outcomes)

Average across strata (HR=1.24) is an ‘unbiased estimate of true treatment effect’

For most analyses, formulae exist for calculating pooled significance tests adjusted for sample sizes and variances in each stratum

• coxph(survival ~ swang1+strata(ps_quintiles), data=rhc)

Page 27: Propensity Scores  Making Sense of Non-Randomized Observational Data

Stratify by Quintile : Analyze each separately

• Unadjusted Cox model applied to each stratum• Individual strata can be reported separately (sometimes meaningful)

• coxph(survival~ swang1, data=rhc3)

PS Quintile 3

Page 28: Propensity Scores  Making Sense of Non-Randomized Observational Data

Using Propensity Scores to balance covariates in observational data:

1. Stratify by quintile (Ruben and Rosenbaum, 1983): • Each quintile analyzed separately as if from RCT (~meta-analysis)• With perfect stratification, average treatment effect across strata is unbiased

estimate of true treatment effect • 5 groups eliminate 90% of bias from observed confounders

2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile • Continuous numeric outcomes: ordinary linear regression• Binary categorical outcomes: logistic regression• Survival analysis: Cox proportional hazards regression

3. Match cases and controls by PS • Compare groups as if from RCT

• Almost always reduces sample size

Page 29: Propensity Scores  Making Sense of Non-Randomized Observational Data

Regression Adjustment using propensity scores• Regression of outcome on treatment with adjustment for PS quintile

Cox proportional hazards models: Unadjusted : coxph(survival~swang1, data=rhc)Quintile PS : coxph(survival~swang1+ps_quintile, data=rhc)Linear PS : coxph(survival~swang1+prop_score, data=rhc)

Page 30: Propensity Scores  Making Sense of Non-Randomized Observational Data

Regression Adjustment using propensity scores

In weighted regression, underrepresented subjects given greater weight so that weighted sample more representative

• Inverse-Probability-Treatment Weights (IPTW) • treatment effect in population whose risk factor distribution equals all study subjects• regression weights : 1/PS for controls and 1/(1-PS) for treated• iptw=ifelse(swang1==TRUE,1/(1-prop_score), 1/prop_score))

• Standardized Mortality Estimator (SMR) Weights • treatment effect in population whose risk factor distribution equals treated subjects*• regression weights : PS/(1-PS) for controls and 1 for treated• smr=ifelse(swang1==TRUE,1, prop_score/(1-prop_score))

• Weighted regression• coxph(survival ~ swang1, weights=smr, data=rhc)

• Results disappointing:• Freedman, Berk 2008: better just to fit the model without weights

Page 31: Propensity Scores  Making Sense of Non-Randomized Observational Data

Regression Adjustment without propensity scoresAdjusted Cox PH model: model effect of confounders on outcome directlycoxph(survival~ swang1 + age + sex + race + edu + income + ninsclas + cat1 + das2d3pc + dnr1 + ca + surv2md1 + aps1 + scoma1 + wtkilo1 + temp1 + + meanbp1 + resp1 + hrt1 + pafi1 + paco21 + ph1 + wblc1 + hema1 + sod1 + pot1 + crea1 + bili1 + alb1 + resp + card + neuro + gastr + renal + meta + hema + seps + trauma + ortho + cardiohx + chfhx + dementhx + psychhx + chrpulhx + renalhx + liverhx + gibledhx + malighx + immunhx + transhx + amihx, data=rhc)

• Misspecification and over-fitting may limit generalizability (external validity)• Difficult to assess adequacy of balance when modeling outcome directly• May provide reassuring confirmation of bias adjustment by PS

Page 32: Propensity Scores  Making Sense of Non-Randomized Observational Data

Using Propensity Scores to balance covariates in observational data:

1. Stratify by quintile (Ruben and Rosenbaum, 1983): • Each quintile analyzed separately as if from RCT (~meta-analysis)• With perfect stratification, average treatment effect across strata is unbiased

estimate of true treatment effect • 5 groups eliminate 90% of bias from observed confounders

2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile • Continuous numeric outcomes: ordinary linear regression• Binary categorical outcomes: logistic regression• Survival analysis: Cox proportional hazards regression

3. Match cases and controls by PS • Compare groups as if from RCT

• Almost always reduces sample size

Page 33: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity score matching:

• Create matched case-control pairs by matching on PS

• Confirm balance before and after matching by comparing treatment groups with procedure independent of sample size (e.g. standardized differences). If inadequate, refine PS model or matching procedure iteratively until it works

• Compare groups as if from randomized study

• Caveats:• ‘Balancing property’ of PS does not ensure matched pairs will

match on covariates, only groups on average• Matching constraints usually reduce sample size (stratification

and regression adjustment use all cases and controls)• Requires specialized software

Page 34: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity score matching:

• Matching on PS:

• Nearest neighbour: matches each case to closest control

• Caliper matching: only if PS within a pre-specified caliper distance, typically 0.2 sd’s

• Matching on original covariates

• Minimize sum of distances between cases and controls

i. Euclidean distance:

ii. Mahalanobis distance weights the contributions by inverse covariance, to avoid over-counting correlated measures, generally preferable

iii. Exact matching for categorical variables e.g. gender, disease status

Page 35: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity score matching:

Suggested approach:• start with nearest neighbor matching without replacement• ± caliper ≤ 0.20 SD’s• ± exact matching for priority covariates (gender, disease)• check balance using standardized differences• consider optimal matching (effective, but intensive)• experiment with other options until cases and controls balanced• may need to use specialized packages (Stata, R)

Options include: • matching ± replacement (re-use controls; less bias, more analysis)• one:many, many:one or variable ratio matching• greedy (closest) vs optimal matching (minimizes sum of distances)• almost exact matching, fine balance (without matching)

Page 36: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity score matching in R:Matching http://sekhon.berkeley.edu/matchingSekhon, J. S. (2011). Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software 42(7). • Uses automated procedure to select matches, based on univariate and

multivariate balance diagnostics. Primarily 1:M matching (where M is a positive integer), allows matching with or without replacement, caliper, exact

MatchIt http://gking.harvard.edu/matchitHo, D.E., Imai, K., King, G., and Stuart, E.A. (2011). MatchIt: Nonparametric preprocessing for parameteric causal inference. Journal of Statistical Software 42(8) • Two-step process: does matching, then user does outcome analysis. Wide array

of estimation procedures and matching methods available: nearest neighbor, Mahalanobis, caliper, exact, full, optimal, subclassification

optmatch http://cran.r-project.org/web/packages/optmatch/index.htmlHansen, B.B., and Fredrickson, M. (2009). • optmatch: Functions for optimal matching. Variable ratio, optimal, and full

matching.

Page 37: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity score matching in Stata:psmatch2 http://ideas.repec.org/c/boc/bocode/s432001.htmlLeuven, E. and Sianesi, B. (2003). Stata module to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing.• Allows k:1 matching, kernel weighting, Mahalanobis matching

pscore http://www.lrz-muenchen.de/~sobecker/pscore.htmlBecker, S.O. and Ichino, A. (2002). Estimation of average treatment effects based on propensity scores (2002) The Stata Journal 2(4): 358-377.• k:1 matching, radius (caliper) matching, and stratification (subclassification)

match http://www.economics.harvard.edu/faculty/imbens/software_imbensAbadie, A., Drukker, D., Herr, J. L., and Imbens, G. W. (2004). Implementing matching estimators for average treatment effects in Stata. The Stata Journal 4(3): 290-311. • Primarily k:1 matching (with replacement)

cem http://gking.harvard.edu/cem/Iacus, S.M., King, G., and Porro, G. (2008). Matching for Causal Inference Without Balance Checking. • Implements coarsened exact matching

Page 38: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity score matching in SAS:Local and global optimal propensity score matching Coca-Perraillon, M. (2007). Local and global optimal propensity score matching. In SAS Global Forum 2007. Paper 185-2007. • Variety of matching methods. No built in diagnostics.

Greedy matching (1:1 nearest neighbor)Parsons, L. S. (2001). Reducing bias in a propensity score matched-pair sample using greedy matching techniques. In SAS SUGI 26, Paper 214-26. Parsons, L.S. (2005). Using SAS software to perform a case-control match on propensity score in an observational study. In SAS SUGI 30, Paper 225-25.Kosanke, J., and Bergstralh, E. (2004): Match 1 or more controls to cases using the GREEDY algorithm

1:1 Mahalanbois matching within propensity score calipersFeng, W.W., Jun, Y., and Xu, R. (2005). A method/macro based on propensity score and Mahalanobis distance to reduce bias in treatment comparison in observational study. www.lexjansen.com/pharmasug/2006/publichealthresearch/pr05.pdf

Page 39: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity Score Matching

Using R Match() function to implement• Nearest neighbour match on PS with caliper = 0.1 (PS within 0.025)• Exact matching on gender, admission diagnosis (9 disease categories)

Original number of observations.............. 5735 Original number of treated obs............... 2184 Matched number of observations............... 1461 Matched number of observations (unweighted). 1461

Number of obs dropped by 'exact' or 'caliper' 723 cases, 2090 controls

Balance checks before and after matching are mandatory• Test for balance must be independent of sample size• Given numeric covariate x, standardized mean difference =• For proportions p,

• Most authors equate balance with a standardized mean difference < 10% • Full range pre- and post- match test statistics output by MatchBalance()

Page 40: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity Score MatchingOutput of MatchBalance() function includes tests for equality, treatment vs controls• means, standardized mean differences (< 10%)• variances (treatment vs controls) (0.5-2.0, Rubin 2001)• empiric quantiles (quantile-quantile plot)• empiric cumulative distribution function

***** (V5) meanbp1 ***** Before Matching After Matchingmean treatment........ 68.198 73.523 mean control.......... 84.869 73.673 std mean diff......... -48.685 -0.42029

mean raw eQQ diff..... 16.673 1.6547 med raw eQQ diff..... 11 1 max raw eQQ diff..... 49 74

mean eCDF diff........ 0.092555 0.0092787 med eCDF diff........ 0.06949 0.0058179 max eCDF diff........ 0.21172 0.039014

var ratio (Tr/Co)..... 0.77589 1.0215 T-test p-value........ < 2.22e-16 0.90308 KS Naive p-value...... < 2.22e-16 0.21612 KS Statistic.......... 0.21172 0.039014

Page 41: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity Score Matching: Assessing matching balance dot chart effective for display of many simultaneous covariates

Treatment and control groups balance when Standardized Mean Differences < 10%

Page 42: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity Score MatchingOutput of MatchBalance() function includes tests for equality, treatment vs controls• means, standardized mean differences (< 10%)• variances (treatment vs controls) (0.5-2.0, Rubin 2001)• empiric quantiles (quantile-quantile plot)• empiric cumulative distribution function

***** (V5) meanbp1 ***** Before Matching After Matchingmean treatment........ 68.198 73.523 mean control.......... 84.869 73.673 std mean diff......... -48.685 -0.42029

mean raw eQQ diff..... 16.673 1.6547 med raw eQQ diff..... 11 1 max raw eQQ diff..... 49 74

mean eCDF diff........ 0.092555 0.0092787 med eCDF diff........ 0.06949 0.0058179 max eCDF diff........ 0.21172 0.039014

var ratio (Tr/Co)..... 0.77589 1.0215 T-test p-value........ < 2.22e-16 0.90308 KS Naive p-value...... < 2.22e-16 0.21612 KS Statistic.......... 0.21172 0.039014

Page 43: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity Score Matching: Assessing matching balance Equality of distributions may be assessed by QQ plots, eCDF, KS statistic

Quantile – quantile plot: Creatinine (mmol/l)

Empiric CDF: Creatinine (mmol/l)Formally compared via KS statistic

Page 44: Propensity Scores  Making Sense of Non-Randomized Observational Data

Propensity Score Matching

Cox PH regression: 1461 matched pairscoxph(survival ~ swang1, data = rhcm)

Page 45: Propensity Scores  Making Sense of Non-Randomized Observational Data

Testing the fundamental assumption:

“Estimated treatment effects are unbiased if there are no unobserved confounders i.e. all relevant covariates have been included”

Sensitivity Analysis seeks to test robustness of conclusions to hidden biases from unobserved confounders

Connors. 1996:• Asked 13 practicing clinicians to identify 10 variables with the greatest

influence on their decision to use RHC in practice. None missing in study

• From logistic regression, identified 4 physiologic variables with the largest effect on the decision to treat: PaO2/FiO2, mean BP, pulse, and resp. rate. Individually omitted from PS model -> hazard ratio ± 0.01

• Formal ‘Rosenbaum bounds’ to estimate how much hidden bias would need to be present to invalidate conclusions, implemented in rbounds libraries for R, Stata.

Page 46: Propensity Scores  Making Sense of Non-Randomized Observational Data

• Matched patients have same PS and same probability of treatment (RHC) if no unmeasured confounders

• Assume true probabilities differ due to an unmeasured confounder that correlates perfectly with outcome (worst-case scenario). Let G = odds ratio for its effect on treatment assignment (in logistic regression for PS)

• How large does G need to be to invalidate the conclusion RCH increased mortality?

• For binary outcome: 2 x 2 table for survival in 1461 matched pairs

• RHC dies RHC lives Control dies 651 274 Control lives 356 180

Rosenbaum Sensitivity Test for matched pairs:

Discordant outcomes in 274+356 = 730 of 1461 matched pairs

McNemar test for matched pairs: Under Ho , how likely is it that two discordant cells (274 vs 356 ) arose from same binomial distribution by chance?

Page 47: Propensity Scores  Making Sense of Non-Randomized Observational Data

• For binary outcome: 2 x 2 table for survival in 1461 matched pairs

RHC dies RHC lives Control dies 651 274 Control lives 356 180 ← McNemar p= 0.001

binarysens(274,356,Gamma=1.2,GammaInc=0.02) ← varies G -> effect on McNemar p-value

Rosenbaum Sensitivity Test p-value of estimate .... 0.0006 Gamma Lower bound Upper bound 1.00 0.00062 0.00062 ← no confounding by unobserved covariate 1.02 0.00025 0.00143 1.04 0.00010 0.00308 1.06 0.00004 0.00619 1.08 0.00001 0.01170 1.10 0.00000 0.02082 1.12 0.00000 0.03503 1.14 0.00000 0.05590 ← McNemar p > 0.05

Rosenbaum Sensitivity Test for matched pairs:

Unmeasured confounder needs to increase odds of RHC by 1.14 to nullify the increase in mortality with treatment

Page 48: Propensity Scores  Making Sense of Non-Randomized Observational Data

• Becker et al, 2007: “These are worst case scenarios…a critical value of G = 1.15 does not mean there is no effect of treatment on the outcome. However, the results are sensitive to deviations from the unconfoundedness assumption, and we advise some caution when interpreting”

• Connors et al used a ‘less worst-case’ scenario to estimate G needed to ‘invert’ HR i.e. reduce it from 1.2 to 0.8 and cited a 3-fold increase in odds of RHC due to unobserved confounders.

• rbounds now the standard measure of sensitivity to hidden biases in matched studies

• psens() performs similar analysis for numeric outcomes based on Wilcoxon rank-sum test (≠ McNemar)

• On Friday, we’ll demonstrate both binarysens() and psens() to perform sensitivity analysis for different outcomes

Sensitivity analysis: Interpreting Rosenbaum bounds

Page 49: Propensity Scores  Making Sense of Non-Randomized Observational Data

• Multiple control groups:• consistency between control groups that differ in unobserved confounders• e.g. different ways to escape treatment e.g. declined vs denied treatment

• Coherence among outcomes: • e.g. length or cost of hospitalization rather than survival, looking for consistency • subgroup analysis: looking for consistency or plausible variations e.g. elderly,

women, whites, shock/sepsis/ post-op comparable

• Dose response relationships: looking for a consistent effect• e.g. SVC pressure line vs. full RHC

• Known effects• e.g. outcomes known to be unaffected by treatment

Devices to increase robustness

Page 50: Propensity Scores  Making Sense of Non-Randomized Observational Data

A brief review

Quasi-experimental designs require as much care and attention as an RCT• Carefully identify all potential confounders by

• Clinical judgment• Literature review• Expert panels

• Prioritize confounders to identify those which must be balanced (a priori)• Avoid controlling for post-treatment covariates

After estimating PS, make sure treatment model discriminates. Add covariates, non-linear terms, or interactions until it does. Make sure that group scores overlap

After conditioning on PS by stratification, regression, or matching, make sure that groups are balanced using measures independent of sample size. If needed, revise the treatment model or tighten up matching criteria

Assess robustness of conclusions to hidden bias due to unobserved confounders:• Have external panel of experts review choice of predictors• Re-analyze after omitting important confounders• Estimate Rosenbaum bounds i.e. G = how much hidden bias needed to

invalidate conclusions

Page 51: Propensity Scores  Making Sense of Non-Randomized Observational Data

Useful References:

• Rosenbaum, P.R., & Rubin, D.B. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70(1), 41-55

• A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996; 276(11): 889-97

• Rosenbaum, P.R. (2010). Design of Observational Studies. New York: Springer

• Pattanayaka, C.W. et al. Propensity Score Methods for Creating Covariate Balance in Observational Studies. Rev Esp Cardiol. 2011; 64(10):897–903

• Stuart, E. A. Matching methods for causal inference: A review and a look forward. Statistical Science 20. 2010; 25: 1-21.

• Stukel, T.A. et al. Analysis of Observational Studies in the Presence of Treatment Selection Bias. JAMA 2007; 297(3) 278-85

• J.S. Sekhon. Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R. J Stat Software. 2011; 42(7): 1-52

Page 52: Propensity Scores  Making Sense of Non-Randomized Observational Data

QUESTIONS?

Thank-you for your patience: