applying propensity score and mediation analyses to ... · pdf fileapplying propensity score...

84
Applying Propensity Score and Mediation Analyses to Program and Policy Evaluation Morning: Propensity Score Analysis 2014 MCH Epi/CityMatCH Conference AMCHP Pre-Conference Training KRISTIN RANKIN, PHD AMANDA BENNETT, PHD DEB ROSENBERG, PHD [email protected] [email protected] [email protected] DIVISION OF EPIDEMIOLOGY AND BIOSTATISTICS SCHOOL OF PUBLIC HEALTH, U. OF IL AT CHICAGO (UIC-SPH)

Upload: trandieu

Post on 12-Mar-2018

241 views

Category:

Documents


0 download

TRANSCRIPT

Applying Propensity Score and

Mediation Analyses to Program and Policy Evaluation

Morning: Propensity Score Analysis

2014 MCH Epi/CityMatCH Conference AMCHP Pre-Conference Training

K R I S T I N R A N K I N , P H D A M A N D A B E N N E T T , P H D D E B R O S E N B E R G , P H D

k r a n k i n @ u i c . e d u a c a v e n 3 @ u i c . e d u d r o s e @ u i c . e d u

D I V I S I O N O F E P I D E M I O L O G Y A N D B I O S T A T I S T I C S

S C H O O L O F P U B L I C H E A L T H , U . O F I L A T C H I C A G O ( U I C - S P H )

Propensity Score Analysis Outline

1. Background and Rationale for Using Propensity Score (PS) Methods for Program Evaluation

2. Practical Example: Multnomah County Home Visiting Program Evaluation

3. Methods for Performing a PS Analysis (4 steps)

4. Application of Methods: Breastfeeding and Child Development

5. Benefits/Drawbacks/Challenges of PS Analyses

6. Group Exercise – Discussion Questions Based on Meghea, et al (2013) reading

2

Goal of Propensity Score Methods

Propensity score analysis methods aim to mimic a randomized clinical trial (RCT) within the context of an observational study

The goal of propensity score analysis is to generate an estimate of the causal effect of the program or policy on its intended outcomes by matching on covariate patterns to approximate the counterfactual

To do this, the propensity score is used as a balancing score with the goal of rendering the treatment assignment “ignorable”

3

4

Propensity Score Definition

Propensity scores, developed by Rosenbaum and Rubin (1983), are the predicted probabilities from a regression model of this form:

Program Participation (yes/no)= pool of observed confounders

SAS Code: proc logistic data=analysis desc;

class discrete_factors / param=ref ref=first; model exposure = pool of observed baseline factors +higher order terms and

interactions; output out=predvalues p=propscore;

run;

Measures of Effect for Program Evaluation

Austin, 2011a DuGoff 2014

5

Absolute Relative

ATE/ATT

Risk difference/Attributable risk Risk of OCprogram – Risk of OCcontrol

ATERATIO/ATTRATIO

Relative risk =

Risk of OCprogram / Risk of OCcontrol

Absolute and Relative Measures of Effect

It is recommended that both relative and absolute measures be reported - they provide complementary information

6

Goal of Propensity Score Methods

Propensity score methods allow for estimating the ATE or ATT in a way that separates the design from the analysis in an observational study

By balancing covariate patterns between program participants (“treated”) and non-participants, the association between the program and observed baseline covariates is made null, therefore eliminating chance of confounding by those measured covariates

Austin 2011

7

8

Propensity Score Analysis: Four Step Process

Propensity score analysis is a multi-step, iterative process including two

different models:

1. Generate propensity score (Model I)

2. Use propensity scores to select comparable groups

3. Check covariate balance across groups

4. Estimate causal effect of program on outcome using propensity matched groups (Model II)

Practical Example of

Propensity Score Analysis:

Multnomah County

Home Visiting Program Evaluation

9

Home-Visitation at MultCo. HD

Program goals: 1. Promote family bonding and parent-child attachment 2. Improve pregnancy and birth outcomes 3. Help families adopt healthy behaviors during pregnancy and early life.

Nurse-Family Partnership

Healthy Birth Initiative

General Field

Priority criteria

Low income

Teen woman

Has medical risks

Homeless

Black/African-Amer.

10

Research Questions

1. What is the effect of Multnomah County Health Department’s home-visiting (HV) program on pregnancy outcomes?

2. Do the results vary by method of analysis: propensity-score matching vs. conventional logistic regression?

11

Study Population

12

Step 1: Estimate propensity score (PS) for HV and non-HV women in the original unmatched data

Distribution of propensity scores for ECS, non-ECS women in unmatched data

0

10

20

30

40

50

Pe

rce

nt

0

0 0.045 0.090 0.135 0.180 0.225 0.270 0.315 0.360 0.405 0.450 0.495 0.540 0.585 0.630

0

10

20

30

40

50

Pe

rce

nt

1

Estimated Probability

Re

ce

ive

d E

CS

svc <

30

0d

ays b

efo

re b

ab

y's

bir

th

No

n-H

V G

ro

up

H

V G

ro

up

N= 17,712 PS range: 0-0.60 Median: 0.01

N= 1,743 PS range: 0-0.63 Median: 0.24

log odds(HV participation)= β0+ β1(Age) + β2(Education) + β3(Race/ethnicity) + β4(Medicaid) + β5(WIC) + β6(Parity) + β7(Medically high risk) + β8(Smoking) + βi(significant_interaction_termsi)+ ε

13

Below 0.1 line is good (Normand, 2001)

Steps 2&3: Match 1:1 on PS using Greedy 51 matching without replacement, then check balance

Step 4: Estimate Program Effect in Matched Sample – Likelihood of HV women experiencing each outcome as compared to

non-HV women (relative risk) , by method of analysis

* Adjusted RR calculated from multivariable logistic regression models treating the outcome as the dependent variable and receipt of HV as the main independent variable. The model controlled for maternal age, education, race/ethnicity, OHP status, WIC participation during pregnancy, parity, medically-high risk status, and smoking during pregnancy.

Small for Gestational Age Preterm Birth Adequate prenatal care

n = 17,712 for crude and adjusted analyses; n = 1,693 for PS-matched analysis

15

Step 4: Likelihood of HV women having received adequate PNC

compared to matched non-HV women, by minimum number of visits

Note: Re-matching was performed for each comparison, using the propensity scores generated from the original model

16

Methods for Performing a Propensity Score Analysis:

Four Steps

17

18

Step 1: Generate Propensity Score Variable Selection

Choose a pool of measured confounders between program participation and outcome(s) of interest Decision about inclusion should be based on theory or prior empirical findings rather than empirical associations with exposures or outcomes in your own data Include as many variables possible that are related to the exposure/program participation and/or outcome, as long as

the variable is not affected by the program or in the causal pathway between program and outcome

19

Step 1: Generate Propensity Score Model Specification

Concerns about collinearity and model fit do not apply in the context of the propensity score model Never use model selection procedures such as stepwise selection, or remove non-significant variables when generating propensity scores If limited by small sample size, prioritize variables strongly related to outcomes, but then check balance on all covariates

20

Step 1: Generate Propensity Score Model Specification

Include interactions and higher order terms (polynomials) in the model, when appropriate, to get optimal balance between program participants and comparison group across confounders Accuracy of propensity score model is less important than the balance on covariates obtained Model specification is an iterative process with balance checking

21

Step 1: Generate Propensity Score Missing values

Before modeling to generate propensity scores, delete observations with missing values on outcomes of interest to avoid unmatched exposed individuals

Consider using techniques such as single or multiple imputation for confounders to minimize loss of sample size/generalizability due to missing values Including imputed value, plus an indicator for missingness

controls for the covariate, plus the pattern of missing data, which may also confound the relationship of interest

Stuart 2010

22

Step 1: Generate Propensity Score

Excerpt of Propensity Scores for Sample

Obs. Outcome propensity score (predicted

probability)

811 Program = Yes 0.77917

812 Program = Yes 0.79674

813 Program = No 0.17937

814 Program = No 0.41324

815 Program = No 0.83309

816 Program = No 0.36290

817 Program = No 0.82015

818 Program = No 0.78867

819 Program = No .

820 Program = No 0.11435

821 Program = Yes 0.47309

822 Program = Yes 0.77425

823 Program = Yes 0.88204

23

Step 1: Generate Propensity Score Assessing Common Support

“Common Support” is the overlap in the distribution of propensity scores for program participants compared to non-participants

Sturmer, et al 2006, J Clin Epidemiol

24

Step 1: Generate Propensity Score Assessing Common Support

Lack of common support leads to:

Extrapolation beyond the data for any observation whose propensity score lies outside the range of scores for individuals in the other group (program participants or non-participants)

Loss of external validity: If there are individuals in the sample who fall outside of area of common support, a matched sample may not be representative (examine characteristics of excluded individuals to assess this)

If common support does not hold, the dataset cannot be used to generate the ATE

(but can be used for ATT)

25

Step 1: Generate Propensity Score Distribution of Propensity Scores

Medical Home= NO

Medical Home = YES

Example: Medical home as “program”

26

Step 1: Generate Propensity Score SAS programming code

title 'Step 1: Model to generate propensity score';

proc logistic data = &dataset; class &CatVars &design/param=ref; model &Exposure = &Confounders &Polynomials &Interactions &weight &design; output out=predvalues p=pscore; where &subset=1; /*eligible sample with non-missing values for OC*/

run;

This process results in a new dataset called predvalues with all of the original variables and data, but an added variable called pscore with a value between 0 and 1 for each individual

Note: weight and design variables only apply with complex sample survey data (see Special Topic for this later in slide set)

27

Step 1: Generate Propensity Score SAS programming code

title 'Step 1: Examine PS distributions for common support';

proc univariate data=predvalues;

class &Exposure;

var pscore;

histogram pscore;

run;

28

Step 2: Use propensity scores to select comparable groups

Once generated and assessed for common support, propensity scores may be used in one of four ways:

a) As a covariate in a model with exposure status predicting outcome (not recommended)

b) As values on which to stratify/subclassify data to form more comparable groups;

c) As weights (inverse of propensity score); or

d) As values on which to match a program participant (exposed) to non-participant (unexposed), then conduct matched analysis to estimate the exposure-outcome relationship (the program effect)

Ultimate goal is to create optimal balance

on baseline covariates (usually weighting or matching perform better with respect to this goal)

29

Step 2: Selecting Comparable Groups Stratification/Subclassification

Stratification on propensity score (e.g. quintiles) can be used; this yields multiple effect estimates that can be combined into a single estimate

Full matching is a similar approach, but creates variable matches of k:1 treatment to controls and k:1 controls to treatment

Estimates ATE

30

Step 2: Selecting Comparable Groups Propensity Score Weighting

31

Step 2: Selecting Comparable Groups Propensity Score Weighting

32

Step 2: Selecting Comparable Groups Propensity Score Matching

Propensity score matching results in new, matched sample of program participants to controls It requires:

1) Defining “closeness” to determine a good match between individuals 2) Implementing a matching method, given “closeness” measure

Estimates ATT

If sample size and makeup allows, PS matching can be combined with exact matching if there are variables for which balance is not easily obtained or if there are important stratifiers Example: First exact match by gender and race, then within propensity scores after that

Stuart 2010

33

Step 2: Selecting Comparable Groups Propensity Score Matching

Defining Closeness The caliper width is the defined acceptable value for the difference between propensity scores of control chosen for each program participant Simulation studies have consistently shown that 0.2 * the std

deviation of the linear propensity score (logit of propensity score) performs well as a caliper width

Matching Techniques o Greedy matching o Nearest neighbor o Optimal matching

Austin 2013

Step 2: Selecting Comparable Groups Propensity Score Matching

HV match with non-HV on 5-digits? Ex: 0.12345 and 0.12345

No

Yes!

HV match with non-HV on 4-digits? Ex: 0.12345 and 0.1234x Yes!

HV match with non-HV on 3-digits? Ex: 0.12345 and 0.123xx Yes!

HV match with non-HV on 1-digit? Ex: 0.12345 and 0.1xxxx Yes!

No

...

No

One example of Greedy Matching (SAS macro): Greedy 5 1 Digit

• First, “best matches” are made, then caliper width increases incrementally for unmatched cases to 0.1

• At each stage, non-program participant with “closest” propensity score is selected as the match to the program participant; match is randomly selected in the case of ties

34

35

Step 2: Selecting Comparable Groups Propensity Score Matching

Matching Techniques Optimal: Selects best match first (but harder to implement) Nearest neighbor: Relies on sort order for selection of matched controls (random sort order usually works almost as well as optimal and is easy to implement) Choice of matching technique should be based on ultimate goal of achieving an optimal middleground between exchangeability (balance/bias reduction) and inclusion of program participants (generalizability) Multiple matching techniques can be attempted for one analysis to determine which is best at achieving balance

Step 2: Selecting Comparable Groups Propensity Score Matching

Matching with or without replacement

Matching with replacement can be useful when the number of controls is small relative to the number of treated However, there are several issues that discourage use of replacement: o controls are not independent (frequency weights needed plus specialized

techniques for estimating variance)

o there may be only a small number of controls providing the whole comparison group – the number of times a control appears should be monitored

o Austin (2014) found greater variability and no improvement in bias reduction when matching with replacement

With few controls per program participant, probably better to

consider PS weighting or stratification rather than matching with replacement

Stuart 2010 Austin 2014

36

Step 2: Selecting Comparable Groups Propensity Score Matching

Software solutions for PS Matching (See Resources) PSMATCH2 (STATA): PSMATCH2 is flexible and user-

controlled with regard to matching techniques SAS MACRO for nearest Neighbor within user-defined

caliper without replacement

GREEDY (51 digit) macro in SAS: The GREEDY (51 digit) macro in SAS performs one to one nearest neighbor within-caliper matching without replacement:

37

Step 2: Selecting Comparable Groups SAS programming code

For PS Subclassification/Statification: Quintiles of Prop. Score

title 'Step 2: Define quintiles of propensity score for stratification';

proc means data=predvalues p20 p40 p60 p80;

var pscore;

run;

38

Step 2: Selecting Comparable Groups SAS programming code

For PS weighting: creating weights

title 'Step 2: Calculate weights for estimating ATE or ATT';

data predvalues;

set predvalues;

PSweightATE = (&Exposure/pscore) + ((1-&Exposure)/(1-pscore));

PSweightATT = &Exposure + ((1-&Exposure)*(pscore/(1-pscore)));

run;

39

Step 2: Selecting Comparable Groups SAS programming code

For PS Matching: Selecting caliper width (0.2 * sd of logit of pscore)

title 'Step 2: Select caliper width and random sort data before matching'; title2 'Create random number for each record and variable for logit of PS'; data predvalues; set predvalues; SORTER=RANUNI(-3); logitPscore= log(pscore/(1-pscore)); run; title3 'Calculate std deviation for logit of PS to determine caliper width'; proc means data=predvalues std n; var logitPscore; /*multiply std dev by 0.2*/ run; title4 ‘Perform random sort of data’; proc sort data=predvalues; by sorter; run;

40

Step 3: Check covariate balance across groups

Standardized differences are preferred to significance testing because they are in units of the pooled standard deviation, so allow for comparisons on the same scale; not influenced by sample size Standardized Difference (means): Standardized Difference (proportions): % Bias Reduction:

41

where are the sample standard deviations of the covariate in the

treated and untreated subjects, respectively.

unmatched

matched

StdDif

StdDif1

By convention, a standardized difference of >= 0.1 indicates imbalance, but no consensus yet

Step 3: Check covariate balance across groups

Recommended that interactions and higher order terms also be compared across treatment groups

For continuous variables, can also compare spread between groups using side-by-side box plots or other graphical method (t0 check balance on variance in addition to mean)

42

43

Step 3: Check covariate balance across groups

Three strategies for assessing balance: Choose propensity score model and matching method that… 1. yields smallest standardized differences across largest number

of baseline covariates; 2. minimizes the standardized differences of a few particularly

important covariates; 3. results in the fewest number of “large” (>0.25) standardized

differences

If groups are not balanced, re-specify the model and re-generate propensity scores o consider adding interaction terms or higher order terms to the

model for those variables that were not balanced o consider other matching strategies

Stuart 2010

44

Example: Checking Covariate Balance Before Propensity Score Matching

Selected

Variables

Before PS Match Absolute Standardized

Difference*

Exposed

(n = 524)

mean (SD)

Not Exposed

(n = 1,001)

mean (SD)

Age

0-5 0.38 (0.02) 0.28 (0.02) 4.83

6-11 0.31 (0.02) 0.36 (0.02) 2.39

12-17 0.31 (0.02) 0.37 (0.02) 2.97

Race/Ethnicity

NH White 0.68 (0.02) 0.39 (0.02) 13.79

NH African

American

0.14 (0.01) 0.21 (0.02) 4.27

Hispanic 0.12 (0.01) 0.32 (0.02) 10.41

Other/Multiracial 0.07 (0.01) 0.07 (0.01) 0.00

etc.

Example: Checking Covariate Balance After Propensity Score Matching

45

Selected

Variables

After PS Match Standardized

Difference*

% Bias

Reduction

Exposed

(n = 482)

Not

Exposed

(n = 482)

Age mean (SD) mean (SD)

0-5 0.30 (0.46) 0.28 (0.45) 0.04 99.1%

6-11 0.26 (0.44) 0.31 (0.46) 0.11 95.4%

12-17 0.44 (0.50) 0.41 (0.49) 0.06 98.0%

Race/Ethnicity

NH White 0.55 (0.50) 0.52 (0.50) 0.06 99.6%

NH African

American

0.22 (0.41) 0.23 (0.42)

0.02 99.4%

Hispanic 0.16 (0.36) 0.17 (0.38) 0.03 99.7%

Other 0.07(0.26) 0.08 (0.27) 0.04 0.0%

Step 4: Estimate Effect of Program Stratification and Weighting

Once balance is established, calculate a measure of association for effect of program on desired outcome(s) Analytic methods vary by strategy…

Stratification/Subclassification: 1. Calculate crude measure of effect (RD, RR, HR) within

strata of propensity score 2. Combine estimates using a weighted average, with each

stratum weighted according to its sample size as a proportion of the whole sample (ATE) or according to its proportion of program participants (ATT)

PS Weighting: Calculate crude measure of effect using weights as previously specified

46

Step 4: Estimate Effect of Program Propensity Score Matching

The matched design should be considered to correctly estimate the standard error of the program effect only (conditional logistic not necessary), since matched pairs are no longer statistically independent (controversial, but generally recommended)

Measures of association themselves will be statistically unbiased since program participants (the exposed) are being matched to non-participants (the unexposed); this is in contrast to matching in a case control study which imposes a new selection bias that must then be addressed by using conditional logistic regression

47

Step 4: Estimate Effect of Program Propensity Score Matching

Multivariable regression is not necessary since matching on the propensity scores has addressed confounding, so either a a simple 2x2 table or crude GEE model can be used. This 2x2 table must reflect the matched data structure.

48

Unexposed Develops Oucome?

Yes No

Yes

a

b

a + b

Exposed

Develops

Outcome? No c d

c + d

a + c b + d a + b + c + d

(n pairs)

Exposed Experiences Outcome

Unexposed

Experiences

Outcome

Step 4: Estimate Effect of Program Propensity Score Matching

Computations Based on a Simple 2 x 2 Table

Organized for Matched Pairs

Relative Risk (RR) = (a+c)/(a+b)

SE (lnRR) = sqrt [(b+c) / {(a+b)(a+c)}]

95% CI = exp[lnRR ± (1.96*SE)]

Risk Difference (RD) / Attributable Risk (AR)

= (b-c)/n

SE (RD) = ((c + b)−(b−c)2/n)/n2

95% CI = RD ± 1.96(SE)

49

Step 4: Estimate Effect of Program Propensity Score Matching

/*SAS code for restructuring data from one observation per infant to one observation per matched pair to create matched 2x2 table*/

data Unexp (rename=(Outcome=UnexpOutcome));

set smatchall; where Exposure=0; run;

proc sort data=NoMH; by matchto; run;

data Exp (rename=(Outcome=ExpOutcome));

set smatchall; where Exposure=1; run;

proc sort data=MH; by matchto; run;

data matchedpair;

merge Unexp Exp;

by matchto; run;

proc freq data=matchedpair order=formatted;

table ExpOutcome*UnexpOutcome/norow nocol;

exact mcnem;

run;

50

matchto is a variable indicating the ID that each matched pair shares

Step 4: Estimate Effect of Program Propensity Score Matching

SAS Code to run Generalized Estimating Equations (GEE) model for Relative Risks

(Use dataset with one observation per individual but a variable to indicate a unique ID for each matched pair)

proc genmod data=smatchall desc;

class matchto;

model Outcome = Exposure/dist=bin link=log; /*log binomial model*/

repeated subject=matchto/type=IND corrw covb;

estimate ‘Exp vs Unexp' exposure 1 /exp;

run;

51

matchto is a variable indicating the ID that each matched pair shares

Step 4: Estimate Effect of Program Sensitivity Analysis

For propensity score analyses to result in true causal effect of program, there is a strong ignorability assumption:

o Sufficient overlap of program participants and controls

o Unconfounded treatment assignment

Sensitivity analysis can be used to assess whether treatment assignment is unconfounded after balancing on observed covariates

o Test effect of program on baseline measurement of outcome or other related variable, if available (See Hillemeier 2014)

o Quantify extent to which unmeasured confounding may explain findings (See Jiang 2011 and Meghea 2013)

52

Special Topic: Subpopulation Analysis

Sometimes effect modification of program effects (differential effectiveness) by specific characteristics is suspected and/or subgroup analyses are part of the evaluation plan

For subpopulation analysis, it is best to stratify at the beginning of analysis and generate separate propensity scores for each stratum; then analyze within strata

Examples: Dose-specific analysis, race-specific differences, age-group differences

53

Recall from Multnomah County Home Visiting Evaluation…

Likelihood of HV women having received adequate PNC compared to matched non-HV women, by minimum number of visits

Note: Re-matching was performed for each comparison, using the propensity scores generated from the original model

54

Special Topic: >2 Category Programs

For multiple treatment groups, generalized logit modeling (Imbens 2000 and Imai 2004) can be used to produce propensity scores after which propensity score weighting can be applied

For PS matching, could generate propensity score and match to a common referent group for each category

o this is unsatisfying since control group selected for matching will likely be different for each level of program

55

Special Topic: Survey Data

Complex sample surveys involve weighting and survey design variables

Model 1: Generating Propensity Score Include survey weight, strata and other design variables as

predictors in regression equation, rather than as cluster, strata and weight variables

SEs are not of interest for Model 1, so no need to account for design variables in analysis to adjust SEs

Model 2: Estimating effect of program

Incorporate weights, clustering and stratification variables to accurately estimate variance and provide population representative results (for PS weighting, simply multiply survey weights by PS weights)

DuGoff 2014 56

Application of Methods:

Examining the Effect of Breastfeeding Duration on Early Child Development (NSCH 2007)

57

Main Variables

Exposure

Dichotomous Breastfeeding Duration: ≥ 6 months/still (extended breastfeeding) vs <6 months/never breastfed

Outcome

Summary measure plus domains of a child’s risk for

developmental delay

High Risk = 2+ concerns predictive of delay

Moderate Risk = 1 concern predictive of delay

Low/No Risk = 0 concerns or concerns not predictive of delay

Questions were adapted from the Parents’ Evaluation of Developmental Status (PEDS ©), a standardized screening tool used clinically with parents

58

Covariates

Child factors: Sex

Race/ethnicity

Age

Birth Order

Birthweight

Maternal factors: Age at child’s birth

Education level

Marital status/ Cohabitation

Country of Birth

Family Factors: Family structure

Father’s education

Income as % FPL

Primary language

Smoker in household

Geographic Factors: Residence in an MSA

Region of the U.S.

59

Step 1: Propensity Score Estimation

Used logistic regression model to compute propensity score (PS) associated with breastfeeding ≥ 6 months vs <6 months/never as dependent variable and all covariates as independent variables; examined distribution of propensity scores for each group

Propensity scores are the predicted probabilities from a regression model of this form:

Exposure = pool of observed covariates

60

Propensity Score Distributions

Breastfed <6 months/never

Breastfed ≥ 6 months

Range: 0.0357 – 0.9056

Range: 0.0294 – 0.8937

Step 2: Selecting Comparable Groups

Matched each child breastfed ≥ 6 months to a child breastfed <6 months/never on PS (SAS matching algorithm)

Nearest neighbor matching algorithm

Caliper width of 0.2 * std dev of logit (prop. score)

1:1 ratio of unexposed to exposed

Without replacement

Produced weights from propensity scores as an alternative to PS matching

62

63

Step 3: Balance Checking

Performed balance diagnostics to compare covariate distributions for exposed and unexposed in original and matched samples and in weighted sample

Calculated absolute standardized differences

Assured that matched sample had standardized differences for each covariate under 0.10

Step 4: Estimate Measure of Effect

1) Propensity score matched analysis using generalized estimating equations for polytomous regression/generalized logit model and binary regression (ATT)

2) Propensity score weighted analysis (ATE & ATT)

3) Traditional multivariable regression (ATE)

Step 5: Comparison of Analytic Methods

Qualitatively compared results across analytic methods for propensity score analysis and traditional multivariable generalized logit regression modeling

For several outcomes, the results diverged across methods

Results from multivariable regression, PS weighting and PS matching are not directly comparable (ATE vs ATT)

65

Step 5: Comparison of Analytic Methods

PS matching sets a high bar for group equivalence by matching on observed covariates but also limiting analysis to individuals who have a potential match on PS

This means that findings apply only to the effect of exposure on the exposed, meaning the segment of the population more likely to breastfeed (ATT)

In addition, a high proportion of exposed were lost in

matching process, which makes generalizability even less clear

66

Propensity Score Analysis:

Benefits, Drawbacks,

and Challenges

67

Comparison of Propensity Score Analysis to Traditional Regression Approaches

Model I: the process of generating propensity scores

Because selection of covariates occurs when specifying Model 1, the process is blind to outcome status, which forces the researcher to think about and check covariate balance before looking at outcomes

Because Model 1 for generating the propensity scores is not focused on reliability of estimates or statistical testing, it permits adjustment for many covariates, as sample size allows

68

Comparison of Propensity Score Analysis to Traditional Regression Approaches

Model I: the process of generating propensity scores continued

While Model 1 can include many variables regardless of their statistical significance, the number of observations lost due to missing values likely increases as the number of variables used increases.

Must consider how to approach the issue of missing data on covariates of interest (complete-case analysis, separate dummy variable for missing, imputation) – multiple imputation approaches have been used in more recent work (See Foster et al 2012)

69

70

Comparison of Propensity Score Analysis to Traditional Regression Approaches

Model II: Estimating the exposure-outcome relationship

In usual regression modeling, the final model contains one or more "exposure" variables and a relatively few covariates; Model II in propensity score matching is typically a crude model with the exposure as the single independent variable with weighting, or a matched 2x2 table is used

Having a crude model (fewer degrees of freedom) is especially useful if sample size is small or the outcome is rare. If exposure is rare, however, modeling many covariates in Model I to generate the propensity scores may not be possible

Comparison of Propensity Score Analysis to Traditional Regression Approaches

Model II: Estimating the exposure-outcome relationship continued

A mis-specified model in usual regression may lead to inaccurate conclusions, while controlling for confounding using propensity scores is less prone to this issue, as long as balance on covariates has been achieved

Since only one exposure-outcome association is examined (all other variables are "hidden" as part of the propensity score), the analysis and reporting of results is likely to be more focused than from a traditional regression modeling approach

71

Comparison of Propensity Score Analysis to Traditional Regression Approaches

Generalizability

Propensity score analysis calls us to be explicit about who findings apply to:

ATE: Average program effect in target population

ATT: Average program effect among those likely to be in the program

ATU: Average program effect among those not likely to be in the program

72

Propensity Score Analysis – Is it Worth it?

Nine (13%) of 69 articles in the medical literature between 1998-2003 showed meaningful differences in effect sizes for results of regular regression and propensity scoring methods; since true effect estimate is unknown, not sure what this means (Sturmer 2006)

Eight (10%) of 78 associations (from 43 studies) reported different results between regression and PS methods; for all, PS methods were non-significant while regression were significant; on average, estimates 6.4% closer to null with PS methods (Shah 2005)

73

Propensity Score Analysis – Is it Worth it?

Transparency: “Design-based approach” to removing confounding

rather than “analysis-based approach” (Austin 2011)

Because selection of covariates occurs when specifying the model for the propensity score, the process is blind to outcome status

• Forces the researcher to rely on a conceptual model to identify appropriate covariates

• Allows for balance checking and assessment of common support before ever looking at outcome(s)

• The analysis and reporting of results is more focused

74

Propensity Score Matching – Is it Worth it?

Transparency: Balance diagnostics allow for more critical

assessment of exchangeability between program participants and control group

Explicitly assess degree to which confounding has

been removed using standardized differences and bias reduction

75

Resources Methods Articles Austin, Peter. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding

in Observational Studies. Multivariate Behavioral Research 46: 399-424, 2011. Austin, Peter. Comparing paired vs non-paired statistical methods of analyses when making inferences

about absolute risk reductions in propensity-score matched samples Statist. Med. 2011, 30 1292—1301.

Austin, Peter. A Comparison of 12 Algorithms for Matching on the Propensity Score. Stat Med 33: 1057-1069, 2014

DuGoff EH, Schuler M, Stuart EA. Generalizing Observational Study Results: Applying Propensity Score Methods to Complex Surveys. Health Services Research 49(1): 284-303, February 2014.

Imbens G. The role of the propensity score in estimating dose-response functions. Biometrika 87(3): 706-710, 2000.

Imani K and van Dyk DA 2004. Causal inference with general treatment regimes: Generalizing the propensity score. Journal of the American Statistical Association. 2004, 99 (467): 854-866.

Oakes JM and Johnson P. Propensity Score Matching for Social Epidemiology. Oakes JM, Kaufman JS (Eds.), Methods in Social Epidemiology. San Francisco, CA: Jossey-Bass.

Shah BR, Laupacis A, Hux JE, Austin PC. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. Journal of Clinical Epidemiology 58(6), 2005.

Stürmer T, Joshi M, Glynn RJ, Avorn J, Rothman KJ, Schneeweiss S. A Review of Propensity Score Methods Yielded Increasing Use, Advantages in Specific Settings, but not Substantially Different Estimates Compared with Conventional Multivariable Methods. J Clin Epidemiol. 2006 May; 59(5): 437-447.

Williamson E, Morley R, Lucas A, Carpenter J. Propensity scores: From naïve enthusiasm to intuitive understanding. Statistical Methods in Medical Research 2011; 21(3): 273-293.

Resources Some MCH Applications Bird TM, Bronstein JM, Hall RW, Lowery CL, Nugent R, Mays GP. Late preterm infants: birth

outcomes and health care utilization in the first year. Pediatrics (2):e311-9. Epub 2010 Jul 5.

Brandt S, Gale S, Tager IB. Estimation of treatment effect of asthma case management using propensity score methods. Am J Mang Care, 16(4): 257-64, 2010.

Foster EM, Jiang M, Gibson-Davis CM. The Effect of the WIC Program on the Health of Newborns. Health Services Research 45(4): 1083-1104, 2010.

Hillemeier, Marianne M., et al. "Effects of Maternity Care Coordination on Pregnancy Outcomes: Propensity-Weighted Analyses." Maternal and child health journal (2014): 1-7.

Jiang M, Foster EM, Gibson-Davis CM. Breastfeeding and the Child Cognitive Outcomes: A Propensity Score Matching Approach Matern Child Health J 15:1296–1307, 2011.

Meghea, Cristian I., et al. "Medicaid home visitation and maternal and infant healthcare utilization." American journal of preventive medicine 45.4 (2013): 441-447.

Okamoto M, Ishigami H, Tokimoto K, Matsuoka M, Tango R. Early Parenting Program as Intervention Strategy for Emotional Distress in First-Time Mothers: A Propensity Score Analysis. Matern Child Health J , Epub 2012.

Ounpraseuth S, Guass CH, Bronstein J, Lowery C, Nugent R, Hall R. Evaluating Effect of Hospital and Insurance Type on the Risk of 1-year Mortality of Very Low Birthweight Infants. Medical Care 50(4): 353-360, 2012.

Redding, Sarah, et al. "Pathways Community Care Coordination in Low Birth Weight Prevention." Maternal and child health journal (2014): 1-8.

78

Resources

Software

SAS GREEDY MACRO – code and documentation: http://www2.sas.com/proceedings/sugi26/p214-26.pdf

STATA PSMATCH2:

E. Leuven and B. Sianesi. (2003). "PSMATCH2: Stata module to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing". http://ideas.repec.org/c/boc/bocode/s432001.html

Other Matching Programs and Information on Sensitivity Analyses for Unmeasured Confounders: http://www.biostat.jhsph.edu/~estuart/propensityscoresoftware.html

Supplementary Slides

79

Drawbacks of Traditional Regression Modeling for Estimating Program Effects

For categorical outcomes, adjusted program effects from multivariable regression models are not marginal or population-average estimates, but are rather conditional estimates, conditional on the covariates in the model; they are not an estimate of the counterfactual because they are dependent on covariates

Conditional measures of effect depend on covariate pattern Potentially a different relative risk for each covariate pattern; interpreted as the average effect of treatment on the individual; whereas marginal effect is average effect of treatment on population outcome – marginal effect is one estimated in RCTs

80

Group Exercise Questions Based on Meghea, et al (2013) reading

81

Discussion Questions: Groups 1-3

1. Describe the characteristics of the program that is being evaluated (e.g. staffing, inputs, activities, desired outcomes)

2. What general method do authors use to address selection bias? Describe the characteristics of the matching process.

3. Comment on the authors’ ability to achieve covariate balance across program clients and comparison group (see Table 2). How was balance assessed?

4. Do results represent ATEs or ATTs? Do the authors make this clear when interpreting results?

5. (Optional) What did authors do to assess the assumption of ignorability, specifically the possibility that unmeasured confounding may be influencing their results? Do you agree with the following statement at the end of the first paragraph on p446: “Most of the favorable MIHP effects were robust to potential unobserved confounders.”

82

Discussion Questions: Groups 4-6

1. What are the selection forces at play that threaten the validity of the evaluation?

2. What general method do authors use to address selection bias? Describe the characteristics of the matching process.

3. Do we have any information to assess common support? 4. Comment on the measures of effect reported by the authors and the

associated strengths and limitations of those measures

5. (Optional) What did authors do to assess the assumption of ignorability, specifically the possibility that unmeasured confounding may be influencing their results? Do you agree with the following statement at the end of the first paragraph on p446: “Most of the favorable MIHP effects were robust to potential unobserved confounders.”

83

Discussion Questions: Groups 7-10

1. What outcomes are the focus of this evaluation and where do you think those fit in the program’s logic model?

2. What general method do authors use to address selection bias? Describe the characteristics of the matching process.

3. What was the potential impact of missing data and unmatched MIHP participants on the accuracy and generalizability of the evaluation findings?

4. Comment on the authors’ general conclusion that an increase in participation MIHP-like programs due to Medicaid expansion may enhance prenatal service coverage. Is this supported by their findings?

5. (Optional) What did authors do to assess the assumption of ignorability, specifically the possibility that unmeasured confounding may be influencing their results? Do you agree with the following statement at the end of the first paragraph on p446: “Most of the favorable MIHP effects were robust to potential unobserved confounders.”

84