applying propensity score and mediation analyses to ... · pdf fileapplying propensity score...
TRANSCRIPT
Applying Propensity Score and
Mediation Analyses to Program and Policy Evaluation
Morning: Propensity Score Analysis
2014 MCH Epi/CityMatCH Conference AMCHP Pre-Conference Training
K R I S T I N R A N K I N , P H D A M A N D A B E N N E T T , P H D D E B R O S E N B E R G , P H D
k r a n k i n @ u i c . e d u a c a v e n 3 @ u i c . e d u d r o s e @ u i c . e d u
D I V I S I O N O F E P I D E M I O L O G Y A N D B I O S T A T I S T I C S
S C H O O L O F P U B L I C H E A L T H , U . O F I L A T C H I C A G O ( U I C - S P H )
Propensity Score Analysis Outline
1. Background and Rationale for Using Propensity Score (PS) Methods for Program Evaluation
2. Practical Example: Multnomah County Home Visiting Program Evaluation
3. Methods for Performing a PS Analysis (4 steps)
4. Application of Methods: Breastfeeding and Child Development
5. Benefits/Drawbacks/Challenges of PS Analyses
6. Group Exercise – Discussion Questions Based on Meghea, et al (2013) reading
2
Goal of Propensity Score Methods
Propensity score analysis methods aim to mimic a randomized clinical trial (RCT) within the context of an observational study
The goal of propensity score analysis is to generate an estimate of the causal effect of the program or policy on its intended outcomes by matching on covariate patterns to approximate the counterfactual
To do this, the propensity score is used as a balancing score with the goal of rendering the treatment assignment “ignorable”
3
4
Propensity Score Definition
Propensity scores, developed by Rosenbaum and Rubin (1983), are the predicted probabilities from a regression model of this form:
Program Participation (yes/no)= pool of observed confounders
SAS Code: proc logistic data=analysis desc;
class discrete_factors / param=ref ref=first; model exposure = pool of observed baseline factors +higher order terms and
interactions; output out=predvalues p=propscore;
run;
Absolute Relative
ATE/ATT
Risk difference/Attributable risk Risk of OCprogram – Risk of OCcontrol
ATERATIO/ATTRATIO
Relative risk =
Risk of OCprogram / Risk of OCcontrol
Absolute and Relative Measures of Effect
It is recommended that both relative and absolute measures be reported - they provide complementary information
6
Goal of Propensity Score Methods
Propensity score methods allow for estimating the ATE or ATT in a way that separates the design from the analysis in an observational study
By balancing covariate patterns between program participants (“treated”) and non-participants, the association between the program and observed baseline covariates is made null, therefore eliminating chance of confounding by those measured covariates
Austin 2011
7
8
Propensity Score Analysis: Four Step Process
Propensity score analysis is a multi-step, iterative process including two
different models:
1. Generate propensity score (Model I)
2. Use propensity scores to select comparable groups
3. Check covariate balance across groups
4. Estimate causal effect of program on outcome using propensity matched groups (Model II)
Home-Visitation at MultCo. HD
Program goals: 1. Promote family bonding and parent-child attachment 2. Improve pregnancy and birth outcomes 3. Help families adopt healthy behaviors during pregnancy and early life.
Nurse-Family Partnership
Healthy Birth Initiative
General Field
Priority criteria
Low income
Teen woman
Has medical risks
Homeless
Black/African-Amer.
10
Research Questions
1. What is the effect of Multnomah County Health Department’s home-visiting (HV) program on pregnancy outcomes?
2. Do the results vary by method of analysis: propensity-score matching vs. conventional logistic regression?
11
Step 1: Estimate propensity score (PS) for HV and non-HV women in the original unmatched data
Distribution of propensity scores for ECS, non-ECS women in unmatched data
0
10
20
30
40
50
Pe
rce
nt
0
0 0.045 0.090 0.135 0.180 0.225 0.270 0.315 0.360 0.405 0.450 0.495 0.540 0.585 0.630
0
10
20
30
40
50
Pe
rce
nt
1
Estimated Probability
Re
ce
ive
d E
CS
svc <
30
0d
ays b
efo
re b
ab
y's
bir
th
No
n-H
V G
ro
up
H
V G
ro
up
N= 17,712 PS range: 0-0.60 Median: 0.01
N= 1,743 PS range: 0-0.63 Median: 0.24
log odds(HV participation)= β0+ β1(Age) + β2(Education) + β3(Race/ethnicity) + β4(Medicaid) + β5(WIC) + β6(Parity) + β7(Medically high risk) + β8(Smoking) + βi(significant_interaction_termsi)+ ε
13
Below 0.1 line is good (Normand, 2001)
Steps 2&3: Match 1:1 on PS using Greedy 51 matching without replacement, then check balance
Step 4: Estimate Program Effect in Matched Sample – Likelihood of HV women experiencing each outcome as compared to
non-HV women (relative risk) , by method of analysis
* Adjusted RR calculated from multivariable logistic regression models treating the outcome as the dependent variable and receipt of HV as the main independent variable. The model controlled for maternal age, education, race/ethnicity, OHP status, WIC participation during pregnancy, parity, medically-high risk status, and smoking during pregnancy.
Small for Gestational Age Preterm Birth Adequate prenatal care
n = 17,712 for crude and adjusted analyses; n = 1,693 for PS-matched analysis
15
Step 4: Likelihood of HV women having received adequate PNC
compared to matched non-HV women, by minimum number of visits
Note: Re-matching was performed for each comparison, using the propensity scores generated from the original model
16
18
Step 1: Generate Propensity Score Variable Selection
Choose a pool of measured confounders between program participation and outcome(s) of interest Decision about inclusion should be based on theory or prior empirical findings rather than empirical associations with exposures or outcomes in your own data Include as many variables possible that are related to the exposure/program participation and/or outcome, as long as
the variable is not affected by the program or in the causal pathway between program and outcome
19
Step 1: Generate Propensity Score Model Specification
Concerns about collinearity and model fit do not apply in the context of the propensity score model Never use model selection procedures such as stepwise selection, or remove non-significant variables when generating propensity scores If limited by small sample size, prioritize variables strongly related to outcomes, but then check balance on all covariates
20
Step 1: Generate Propensity Score Model Specification
Include interactions and higher order terms (polynomials) in the model, when appropriate, to get optimal balance between program participants and comparison group across confounders Accuracy of propensity score model is less important than the balance on covariates obtained Model specification is an iterative process with balance checking
21
Step 1: Generate Propensity Score Missing values
Before modeling to generate propensity scores, delete observations with missing values on outcomes of interest to avoid unmatched exposed individuals
Consider using techniques such as single or multiple imputation for confounders to minimize loss of sample size/generalizability due to missing values Including imputed value, plus an indicator for missingness
controls for the covariate, plus the pattern of missing data, which may also confound the relationship of interest
Stuart 2010
22
Step 1: Generate Propensity Score
Excerpt of Propensity Scores for Sample
Obs. Outcome propensity score (predicted
probability)
811 Program = Yes 0.77917
812 Program = Yes 0.79674
813 Program = No 0.17937
814 Program = No 0.41324
815 Program = No 0.83309
816 Program = No 0.36290
817 Program = No 0.82015
818 Program = No 0.78867
819 Program = No .
820 Program = No 0.11435
821 Program = Yes 0.47309
822 Program = Yes 0.77425
823 Program = Yes 0.88204
23
Step 1: Generate Propensity Score Assessing Common Support
“Common Support” is the overlap in the distribution of propensity scores for program participants compared to non-participants
Sturmer, et al 2006, J Clin Epidemiol
24
Step 1: Generate Propensity Score Assessing Common Support
Lack of common support leads to:
Extrapolation beyond the data for any observation whose propensity score lies outside the range of scores for individuals in the other group (program participants or non-participants)
Loss of external validity: If there are individuals in the sample who fall outside of area of common support, a matched sample may not be representative (examine characteristics of excluded individuals to assess this)
If common support does not hold, the dataset cannot be used to generate the ATE
(but can be used for ATT)
25
Step 1: Generate Propensity Score Distribution of Propensity Scores
Medical Home= NO
Medical Home = YES
Example: Medical home as “program”
26
Step 1: Generate Propensity Score SAS programming code
title 'Step 1: Model to generate propensity score';
proc logistic data = &dataset; class &CatVars &design/param=ref; model &Exposure = &Confounders &Polynomials &Interactions &weight &design; output out=predvalues p=pscore; where &subset=1; /*eligible sample with non-missing values for OC*/
run;
This process results in a new dataset called predvalues with all of the original variables and data, but an added variable called pscore with a value between 0 and 1 for each individual
Note: weight and design variables only apply with complex sample survey data (see Special Topic for this later in slide set)
27
Step 1: Generate Propensity Score SAS programming code
title 'Step 1: Examine PS distributions for common support';
proc univariate data=predvalues;
class &Exposure;
var pscore;
histogram pscore;
run;
28
Step 2: Use propensity scores to select comparable groups
Once generated and assessed for common support, propensity scores may be used in one of four ways:
a) As a covariate in a model with exposure status predicting outcome (not recommended)
b) As values on which to stratify/subclassify data to form more comparable groups;
c) As weights (inverse of propensity score); or
d) As values on which to match a program participant (exposed) to non-participant (unexposed), then conduct matched analysis to estimate the exposure-outcome relationship (the program effect)
Ultimate goal is to create optimal balance
on baseline covariates (usually weighting or matching perform better with respect to this goal)
29
Step 2: Selecting Comparable Groups Stratification/Subclassification
Stratification on propensity score (e.g. quintiles) can be used; this yields multiple effect estimates that can be combined into a single estimate
Full matching is a similar approach, but creates variable matches of k:1 treatment to controls and k:1 controls to treatment
Estimates ATE
32
Step 2: Selecting Comparable Groups Propensity Score Matching
Propensity score matching results in new, matched sample of program participants to controls It requires:
1) Defining “closeness” to determine a good match between individuals 2) Implementing a matching method, given “closeness” measure
Estimates ATT
If sample size and makeup allows, PS matching can be combined with exact matching if there are variables for which balance is not easily obtained or if there are important stratifiers Example: First exact match by gender and race, then within propensity scores after that
Stuart 2010
33
Step 2: Selecting Comparable Groups Propensity Score Matching
Defining Closeness The caliper width is the defined acceptable value for the difference between propensity scores of control chosen for each program participant Simulation studies have consistently shown that 0.2 * the std
deviation of the linear propensity score (logit of propensity score) performs well as a caliper width
Matching Techniques o Greedy matching o Nearest neighbor o Optimal matching
Austin 2013
Step 2: Selecting Comparable Groups Propensity Score Matching
HV match with non-HV on 5-digits? Ex: 0.12345 and 0.12345
No
Yes!
HV match with non-HV on 4-digits? Ex: 0.12345 and 0.1234x Yes!
HV match with non-HV on 3-digits? Ex: 0.12345 and 0.123xx Yes!
HV match with non-HV on 1-digit? Ex: 0.12345 and 0.1xxxx Yes!
No
...
No
One example of Greedy Matching (SAS macro): Greedy 5 1 Digit
• First, “best matches” are made, then caliper width increases incrementally for unmatched cases to 0.1
• At each stage, non-program participant with “closest” propensity score is selected as the match to the program participant; match is randomly selected in the case of ties
34
35
Step 2: Selecting Comparable Groups Propensity Score Matching
Matching Techniques Optimal: Selects best match first (but harder to implement) Nearest neighbor: Relies on sort order for selection of matched controls (random sort order usually works almost as well as optimal and is easy to implement) Choice of matching technique should be based on ultimate goal of achieving an optimal middleground between exchangeability (balance/bias reduction) and inclusion of program participants (generalizability) Multiple matching techniques can be attempted for one analysis to determine which is best at achieving balance
Step 2: Selecting Comparable Groups Propensity Score Matching
Matching with or without replacement
Matching with replacement can be useful when the number of controls is small relative to the number of treated However, there are several issues that discourage use of replacement: o controls are not independent (frequency weights needed plus specialized
techniques for estimating variance)
o there may be only a small number of controls providing the whole comparison group – the number of times a control appears should be monitored
o Austin (2014) found greater variability and no improvement in bias reduction when matching with replacement
With few controls per program participant, probably better to
consider PS weighting or stratification rather than matching with replacement
Stuart 2010 Austin 2014
36
Step 2: Selecting Comparable Groups Propensity Score Matching
Software solutions for PS Matching (See Resources) PSMATCH2 (STATA): PSMATCH2 is flexible and user-
controlled with regard to matching techniques SAS MACRO for nearest Neighbor within user-defined
caliper without replacement
GREEDY (51 digit) macro in SAS: The GREEDY (51 digit) macro in SAS performs one to one nearest neighbor within-caliper matching without replacement:
37
Step 2: Selecting Comparable Groups SAS programming code
For PS Subclassification/Statification: Quintiles of Prop. Score
title 'Step 2: Define quintiles of propensity score for stratification';
proc means data=predvalues p20 p40 p60 p80;
var pscore;
run;
38
Step 2: Selecting Comparable Groups SAS programming code
For PS weighting: creating weights
title 'Step 2: Calculate weights for estimating ATE or ATT';
data predvalues;
set predvalues;
PSweightATE = (&Exposure/pscore) + ((1-&Exposure)/(1-pscore));
PSweightATT = &Exposure + ((1-&Exposure)*(pscore/(1-pscore)));
run;
39
Step 2: Selecting Comparable Groups SAS programming code
For PS Matching: Selecting caliper width (0.2 * sd of logit of pscore)
title 'Step 2: Select caliper width and random sort data before matching'; title2 'Create random number for each record and variable for logit of PS'; data predvalues; set predvalues; SORTER=RANUNI(-3); logitPscore= log(pscore/(1-pscore)); run; title3 'Calculate std deviation for logit of PS to determine caliper width'; proc means data=predvalues std n; var logitPscore; /*multiply std dev by 0.2*/ run; title4 ‘Perform random sort of data’; proc sort data=predvalues; by sorter; run;
40
Step 3: Check covariate balance across groups
Standardized differences are preferred to significance testing because they are in units of the pooled standard deviation, so allow for comparisons on the same scale; not influenced by sample size Standardized Difference (means): Standardized Difference (proportions): % Bias Reduction:
41
where are the sample standard deviations of the covariate in the
treated and untreated subjects, respectively.
unmatched
matched
StdDif
StdDif1
By convention, a standardized difference of >= 0.1 indicates imbalance, but no consensus yet
Step 3: Check covariate balance across groups
Recommended that interactions and higher order terms also be compared across treatment groups
For continuous variables, can also compare spread between groups using side-by-side box plots or other graphical method (t0 check balance on variance in addition to mean)
42
43
Step 3: Check covariate balance across groups
Three strategies for assessing balance: Choose propensity score model and matching method that… 1. yields smallest standardized differences across largest number
of baseline covariates; 2. minimizes the standardized differences of a few particularly
important covariates; 3. results in the fewest number of “large” (>0.25) standardized
differences
If groups are not balanced, re-specify the model and re-generate propensity scores o consider adding interaction terms or higher order terms to the
model for those variables that were not balanced o consider other matching strategies
Stuart 2010
44
Example: Checking Covariate Balance Before Propensity Score Matching
Selected
Variables
Before PS Match Absolute Standardized
Difference*
Exposed
(n = 524)
mean (SD)
Not Exposed
(n = 1,001)
mean (SD)
Age
0-5 0.38 (0.02) 0.28 (0.02) 4.83
6-11 0.31 (0.02) 0.36 (0.02) 2.39
12-17 0.31 (0.02) 0.37 (0.02) 2.97
Race/Ethnicity
NH White 0.68 (0.02) 0.39 (0.02) 13.79
NH African
American
0.14 (0.01) 0.21 (0.02) 4.27
Hispanic 0.12 (0.01) 0.32 (0.02) 10.41
Other/Multiracial 0.07 (0.01) 0.07 (0.01) 0.00
etc.
Example: Checking Covariate Balance After Propensity Score Matching
45
Selected
Variables
After PS Match Standardized
Difference*
% Bias
Reduction
Exposed
(n = 482)
Not
Exposed
(n = 482)
Age mean (SD) mean (SD)
0-5 0.30 (0.46) 0.28 (0.45) 0.04 99.1%
6-11 0.26 (0.44) 0.31 (0.46) 0.11 95.4%
12-17 0.44 (0.50) 0.41 (0.49) 0.06 98.0%
Race/Ethnicity
NH White 0.55 (0.50) 0.52 (0.50) 0.06 99.6%
NH African
American
0.22 (0.41) 0.23 (0.42)
0.02 99.4%
Hispanic 0.16 (0.36) 0.17 (0.38) 0.03 99.7%
Other 0.07(0.26) 0.08 (0.27) 0.04 0.0%
Step 4: Estimate Effect of Program Stratification and Weighting
Once balance is established, calculate a measure of association for effect of program on desired outcome(s) Analytic methods vary by strategy…
Stratification/Subclassification: 1. Calculate crude measure of effect (RD, RR, HR) within
strata of propensity score 2. Combine estimates using a weighted average, with each
stratum weighted according to its sample size as a proportion of the whole sample (ATE) or according to its proportion of program participants (ATT)
PS Weighting: Calculate crude measure of effect using weights as previously specified
46
Step 4: Estimate Effect of Program Propensity Score Matching
The matched design should be considered to correctly estimate the standard error of the program effect only (conditional logistic not necessary), since matched pairs are no longer statistically independent (controversial, but generally recommended)
Measures of association themselves will be statistically unbiased since program participants (the exposed) are being matched to non-participants (the unexposed); this is in contrast to matching in a case control study which imposes a new selection bias that must then be addressed by using conditional logistic regression
47
Step 4: Estimate Effect of Program Propensity Score Matching
Multivariable regression is not necessary since matching on the propensity scores has addressed confounding, so either a a simple 2x2 table or crude GEE model can be used. This 2x2 table must reflect the matched data structure.
48
Unexposed Develops Oucome?
Yes No
Yes
a
b
a + b
Exposed
Develops
Outcome? No c d
c + d
a + c b + d a + b + c + d
(n pairs)
Exposed Experiences Outcome
Unexposed
Experiences
Outcome
Step 4: Estimate Effect of Program Propensity Score Matching
Computations Based on a Simple 2 x 2 Table
Organized for Matched Pairs
Relative Risk (RR) = (a+c)/(a+b)
SE (lnRR) = sqrt [(b+c) / {(a+b)(a+c)}]
95% CI = exp[lnRR ± (1.96*SE)]
Risk Difference (RD) / Attributable Risk (AR)
= (b-c)/n
SE (RD) = ((c + b)−(b−c)2/n)/n2
95% CI = RD ± 1.96(SE)
49
Step 4: Estimate Effect of Program Propensity Score Matching
/*SAS code for restructuring data from one observation per infant to one observation per matched pair to create matched 2x2 table*/
data Unexp (rename=(Outcome=UnexpOutcome));
set smatchall; where Exposure=0; run;
proc sort data=NoMH; by matchto; run;
data Exp (rename=(Outcome=ExpOutcome));
set smatchall; where Exposure=1; run;
proc sort data=MH; by matchto; run;
data matchedpair;
merge Unexp Exp;
by matchto; run;
proc freq data=matchedpair order=formatted;
table ExpOutcome*UnexpOutcome/norow nocol;
exact mcnem;
run;
50
matchto is a variable indicating the ID that each matched pair shares
Step 4: Estimate Effect of Program Propensity Score Matching
SAS Code to run Generalized Estimating Equations (GEE) model for Relative Risks
(Use dataset with one observation per individual but a variable to indicate a unique ID for each matched pair)
proc genmod data=smatchall desc;
class matchto;
model Outcome = Exposure/dist=bin link=log; /*log binomial model*/
repeated subject=matchto/type=IND corrw covb;
estimate ‘Exp vs Unexp' exposure 1 /exp;
run;
51
matchto is a variable indicating the ID that each matched pair shares
Step 4: Estimate Effect of Program Sensitivity Analysis
For propensity score analyses to result in true causal effect of program, there is a strong ignorability assumption:
o Sufficient overlap of program participants and controls
o Unconfounded treatment assignment
Sensitivity analysis can be used to assess whether treatment assignment is unconfounded after balancing on observed covariates
o Test effect of program on baseline measurement of outcome or other related variable, if available (See Hillemeier 2014)
o Quantify extent to which unmeasured confounding may explain findings (See Jiang 2011 and Meghea 2013)
52
Special Topic: Subpopulation Analysis
Sometimes effect modification of program effects (differential effectiveness) by specific characteristics is suspected and/or subgroup analyses are part of the evaluation plan
For subpopulation analysis, it is best to stratify at the beginning of analysis and generate separate propensity scores for each stratum; then analyze within strata
Examples: Dose-specific analysis, race-specific differences, age-group differences
53
Recall from Multnomah County Home Visiting Evaluation…
Likelihood of HV women having received adequate PNC compared to matched non-HV women, by minimum number of visits
Note: Re-matching was performed for each comparison, using the propensity scores generated from the original model
54
Special Topic: >2 Category Programs
For multiple treatment groups, generalized logit modeling (Imbens 2000 and Imai 2004) can be used to produce propensity scores after which propensity score weighting can be applied
For PS matching, could generate propensity score and match to a common referent group for each category
o this is unsatisfying since control group selected for matching will likely be different for each level of program
55
Special Topic: Survey Data
Complex sample surveys involve weighting and survey design variables
Model 1: Generating Propensity Score Include survey weight, strata and other design variables as
predictors in regression equation, rather than as cluster, strata and weight variables
SEs are not of interest for Model 1, so no need to account for design variables in analysis to adjust SEs
Model 2: Estimating effect of program
Incorporate weights, clustering and stratification variables to accurately estimate variance and provide population representative results (for PS weighting, simply multiply survey weights by PS weights)
DuGoff 2014 56
Application of Methods:
Examining the Effect of Breastfeeding Duration on Early Child Development (NSCH 2007)
57
Main Variables
Exposure
Dichotomous Breastfeeding Duration: ≥ 6 months/still (extended breastfeeding) vs <6 months/never breastfed
Outcome
Summary measure plus domains of a child’s risk for
developmental delay
High Risk = 2+ concerns predictive of delay
Moderate Risk = 1 concern predictive of delay
Low/No Risk = 0 concerns or concerns not predictive of delay
Questions were adapted from the Parents’ Evaluation of Developmental Status (PEDS ©), a standardized screening tool used clinically with parents
58
Covariates
Child factors: Sex
Race/ethnicity
Age
Birth Order
Birthweight
Maternal factors: Age at child’s birth
Education level
Marital status/ Cohabitation
Country of Birth
Family Factors: Family structure
Father’s education
Income as % FPL
Primary language
Smoker in household
Geographic Factors: Residence in an MSA
Region of the U.S.
59
Step 1: Propensity Score Estimation
Used logistic regression model to compute propensity score (PS) associated with breastfeeding ≥ 6 months vs <6 months/never as dependent variable and all covariates as independent variables; examined distribution of propensity scores for each group
Propensity scores are the predicted probabilities from a regression model of this form:
Exposure = pool of observed covariates
60
Propensity Score Distributions
Breastfed <6 months/never
Breastfed ≥ 6 months
Range: 0.0357 – 0.9056
Range: 0.0294 – 0.8937
Step 2: Selecting Comparable Groups
Matched each child breastfed ≥ 6 months to a child breastfed <6 months/never on PS (SAS matching algorithm)
Nearest neighbor matching algorithm
Caliper width of 0.2 * std dev of logit (prop. score)
1:1 ratio of unexposed to exposed
Without replacement
Produced weights from propensity scores as an alternative to PS matching
62
63
Step 3: Balance Checking
Performed balance diagnostics to compare covariate distributions for exposed and unexposed in original and matched samples and in weighted sample
Calculated absolute standardized differences
Assured that matched sample had standardized differences for each covariate under 0.10
Step 4: Estimate Measure of Effect
1) Propensity score matched analysis using generalized estimating equations for polytomous regression/generalized logit model and binary regression (ATT)
2) Propensity score weighted analysis (ATE & ATT)
3) Traditional multivariable regression (ATE)
Step 5: Comparison of Analytic Methods
Qualitatively compared results across analytic methods for propensity score analysis and traditional multivariable generalized logit regression modeling
For several outcomes, the results diverged across methods
Results from multivariable regression, PS weighting and PS matching are not directly comparable (ATE vs ATT)
65
Step 5: Comparison of Analytic Methods
PS matching sets a high bar for group equivalence by matching on observed covariates but also limiting analysis to individuals who have a potential match on PS
This means that findings apply only to the effect of exposure on the exposed, meaning the segment of the population more likely to breastfeed (ATT)
In addition, a high proportion of exposed were lost in
matching process, which makes generalizability even less clear
66
Comparison of Propensity Score Analysis to Traditional Regression Approaches
Model I: the process of generating propensity scores
Because selection of covariates occurs when specifying Model 1, the process is blind to outcome status, which forces the researcher to think about and check covariate balance before looking at outcomes
Because Model 1 for generating the propensity scores is not focused on reliability of estimates or statistical testing, it permits adjustment for many covariates, as sample size allows
68
Comparison of Propensity Score Analysis to Traditional Regression Approaches
Model I: the process of generating propensity scores continued
While Model 1 can include many variables regardless of their statistical significance, the number of observations lost due to missing values likely increases as the number of variables used increases.
Must consider how to approach the issue of missing data on covariates of interest (complete-case analysis, separate dummy variable for missing, imputation) – multiple imputation approaches have been used in more recent work (See Foster et al 2012)
69
70
Comparison of Propensity Score Analysis to Traditional Regression Approaches
Model II: Estimating the exposure-outcome relationship
In usual regression modeling, the final model contains one or more "exposure" variables and a relatively few covariates; Model II in propensity score matching is typically a crude model with the exposure as the single independent variable with weighting, or a matched 2x2 table is used
Having a crude model (fewer degrees of freedom) is especially useful if sample size is small or the outcome is rare. If exposure is rare, however, modeling many covariates in Model I to generate the propensity scores may not be possible
Comparison of Propensity Score Analysis to Traditional Regression Approaches
Model II: Estimating the exposure-outcome relationship continued
A mis-specified model in usual regression may lead to inaccurate conclusions, while controlling for confounding using propensity scores is less prone to this issue, as long as balance on covariates has been achieved
Since only one exposure-outcome association is examined (all other variables are "hidden" as part of the propensity score), the analysis and reporting of results is likely to be more focused than from a traditional regression modeling approach
71
Comparison of Propensity Score Analysis to Traditional Regression Approaches
Generalizability
Propensity score analysis calls us to be explicit about who findings apply to:
ATE: Average program effect in target population
ATT: Average program effect among those likely to be in the program
ATU: Average program effect among those not likely to be in the program
72
Propensity Score Analysis – Is it Worth it?
Nine (13%) of 69 articles in the medical literature between 1998-2003 showed meaningful differences in effect sizes for results of regular regression and propensity scoring methods; since true effect estimate is unknown, not sure what this means (Sturmer 2006)
Eight (10%) of 78 associations (from 43 studies) reported different results between regression and PS methods; for all, PS methods were non-significant while regression were significant; on average, estimates 6.4% closer to null with PS methods (Shah 2005)
73
Propensity Score Analysis – Is it Worth it?
Transparency: “Design-based approach” to removing confounding
rather than “analysis-based approach” (Austin 2011)
Because selection of covariates occurs when specifying the model for the propensity score, the process is blind to outcome status
• Forces the researcher to rely on a conceptual model to identify appropriate covariates
• Allows for balance checking and assessment of common support before ever looking at outcome(s)
• The analysis and reporting of results is more focused
74
Propensity Score Matching – Is it Worth it?
Transparency: Balance diagnostics allow for more critical
assessment of exchangeability between program participants and control group
Explicitly assess degree to which confounding has
been removed using standardized differences and bias reduction
75
Resources Methods Articles Austin, Peter. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding
in Observational Studies. Multivariate Behavioral Research 46: 399-424, 2011. Austin, Peter. Comparing paired vs non-paired statistical methods of analyses when making inferences
about absolute risk reductions in propensity-score matched samples Statist. Med. 2011, 30 1292—1301.
Austin, Peter. A Comparison of 12 Algorithms for Matching on the Propensity Score. Stat Med 33: 1057-1069, 2014
DuGoff EH, Schuler M, Stuart EA. Generalizing Observational Study Results: Applying Propensity Score Methods to Complex Surveys. Health Services Research 49(1): 284-303, February 2014.
Imbens G. The role of the propensity score in estimating dose-response functions. Biometrika 87(3): 706-710, 2000.
Imani K and van Dyk DA 2004. Causal inference with general treatment regimes: Generalizing the propensity score. Journal of the American Statistical Association. 2004, 99 (467): 854-866.
Oakes JM and Johnson P. Propensity Score Matching for Social Epidemiology. Oakes JM, Kaufman JS (Eds.), Methods in Social Epidemiology. San Francisco, CA: Jossey-Bass.
Shah BR, Laupacis A, Hux JE, Austin PC. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. Journal of Clinical Epidemiology 58(6), 2005.
Stürmer T, Joshi M, Glynn RJ, Avorn J, Rothman KJ, Schneeweiss S. A Review of Propensity Score Methods Yielded Increasing Use, Advantages in Specific Settings, but not Substantially Different Estimates Compared with Conventional Multivariable Methods. J Clin Epidemiol. 2006 May; 59(5): 437-447.
Williamson E, Morley R, Lucas A, Carpenter J. Propensity scores: From naïve enthusiasm to intuitive understanding. Statistical Methods in Medical Research 2011; 21(3): 273-293.
Resources Some MCH Applications Bird TM, Bronstein JM, Hall RW, Lowery CL, Nugent R, Mays GP. Late preterm infants: birth
outcomes and health care utilization in the first year. Pediatrics (2):e311-9. Epub 2010 Jul 5.
Brandt S, Gale S, Tager IB. Estimation of treatment effect of asthma case management using propensity score methods. Am J Mang Care, 16(4): 257-64, 2010.
Foster EM, Jiang M, Gibson-Davis CM. The Effect of the WIC Program on the Health of Newborns. Health Services Research 45(4): 1083-1104, 2010.
Hillemeier, Marianne M., et al. "Effects of Maternity Care Coordination on Pregnancy Outcomes: Propensity-Weighted Analyses." Maternal and child health journal (2014): 1-7.
Jiang M, Foster EM, Gibson-Davis CM. Breastfeeding and the Child Cognitive Outcomes: A Propensity Score Matching Approach Matern Child Health J 15:1296–1307, 2011.
Meghea, Cristian I., et al. "Medicaid home visitation and maternal and infant healthcare utilization." American journal of preventive medicine 45.4 (2013): 441-447.
Okamoto M, Ishigami H, Tokimoto K, Matsuoka M, Tango R. Early Parenting Program as Intervention Strategy for Emotional Distress in First-Time Mothers: A Propensity Score Analysis. Matern Child Health J , Epub 2012.
Ounpraseuth S, Guass CH, Bronstein J, Lowery C, Nugent R, Hall R. Evaluating Effect of Hospital and Insurance Type on the Risk of 1-year Mortality of Very Low Birthweight Infants. Medical Care 50(4): 353-360, 2012.
Redding, Sarah, et al. "Pathways Community Care Coordination in Low Birth Weight Prevention." Maternal and child health journal (2014): 1-8.
78
Resources
Software
SAS GREEDY MACRO – code and documentation: http://www2.sas.com/proceedings/sugi26/p214-26.pdf
STATA PSMATCH2:
E. Leuven and B. Sianesi. (2003). "PSMATCH2: Stata module to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing". http://ideas.repec.org/c/boc/bocode/s432001.html
Other Matching Programs and Information on Sensitivity Analyses for Unmeasured Confounders: http://www.biostat.jhsph.edu/~estuart/propensityscoresoftware.html
Drawbacks of Traditional Regression Modeling for Estimating Program Effects
For categorical outcomes, adjusted program effects from multivariable regression models are not marginal or population-average estimates, but are rather conditional estimates, conditional on the covariates in the model; they are not an estimate of the counterfactual because they are dependent on covariates
Conditional measures of effect depend on covariate pattern Potentially a different relative risk for each covariate pattern; interpreted as the average effect of treatment on the individual; whereas marginal effect is average effect of treatment on population outcome – marginal effect is one estimated in RCTs
80
Discussion Questions: Groups 1-3
1. Describe the characteristics of the program that is being evaluated (e.g. staffing, inputs, activities, desired outcomes)
2. What general method do authors use to address selection bias? Describe the characteristics of the matching process.
3. Comment on the authors’ ability to achieve covariate balance across program clients and comparison group (see Table 2). How was balance assessed?
4. Do results represent ATEs or ATTs? Do the authors make this clear when interpreting results?
5. (Optional) What did authors do to assess the assumption of ignorability, specifically the possibility that unmeasured confounding may be influencing their results? Do you agree with the following statement at the end of the first paragraph on p446: “Most of the favorable MIHP effects were robust to potential unobserved confounders.”
82
Discussion Questions: Groups 4-6
1. What are the selection forces at play that threaten the validity of the evaluation?
2. What general method do authors use to address selection bias? Describe the characteristics of the matching process.
3. Do we have any information to assess common support? 4. Comment on the measures of effect reported by the authors and the
associated strengths and limitations of those measures
5. (Optional) What did authors do to assess the assumption of ignorability, specifically the possibility that unmeasured confounding may be influencing their results? Do you agree with the following statement at the end of the first paragraph on p446: “Most of the favorable MIHP effects were robust to potential unobserved confounders.”
83
Discussion Questions: Groups 7-10
1. What outcomes are the focus of this evaluation and where do you think those fit in the program’s logic model?
2. What general method do authors use to address selection bias? Describe the characteristics of the matching process.
3. What was the potential impact of missing data and unmatched MIHP participants on the accuracy and generalizability of the evaluation findings?
4. Comment on the authors’ general conclusion that an increase in participation MIHP-like programs due to Medicaid expansion may enhance prenatal service coverage. Is this supported by their findings?
5. (Optional) What did authors do to assess the assumption of ignorability, specifically the possibility that unmeasured confounding may be influencing their results? Do you agree with the following statement at the end of the first paragraph on p446: “Most of the favorable MIHP effects were robust to potential unobserved confounders.”
84