Download - Wooldridge Session 5

8/10/2019 Wooldridge Session 5

1/57

Difference-in-Differences Estimation

Jeff WooldridgeMichigan State University

Programme Evaluation for Policy AnalysisInstitute for Fiscal Studies

June 2012

1. The Basic Methodology2. How Should We View Uncertainty in DD Settings?3. Inference with a Small Number of Groups4. Multiple Groups and Time Periods

5. Individual-Level Panel Data6. Semiparametric and Nonparametric Approaches7. Synthetic Control Methods for Comparative Case Studies

1


2/57

1.The Basic Methodology

Standard case: outcomes are observed for two groups for two time

periods. One of the groups is exposed to a treatment in the second

period but not in the first period. The second group is not exposed tothe treatment during either period. Structure can apply to repeated cross

sections or panel data.

With repeated cross sections, letAbe the control group andBthe

treatment group. Write

y 0 1dB 0d2 1d2 dB u, (1)

whereyis the outcome of interest.

2


3/57


4/57

Can refine the definition of treatment and control groups. Example: Change in state health care policy aimed at elderly. Could

use data only on people in the state with the policy change, both before

and after the change, with the control group being people 55 to 65 (say)

and and the treatment group being people over 65. This DD analysis

assumes that the paths of health outcomes for the younger and older

groups would not be systematically different in the absense of

intervention.

4


5/57

Instead, use the same two groups from another (untreated) state asan additional control. LetdEbe a dummy equal to one for someone

over 65 anddBbe the dummy for living in the treatment state:

y 0 1dB 2dE3dB dE0d2

1d2 dB 2d2 dE3d2 dB dEu

(3)

5


6/57

The OLS estimate

3is3 y B,E,2 y B,E,1y B,N,2 yB,N,1

y A,E,2 y A,E,1y A,N,2 y A,N,1

(4)

where theAsubscript means the state not implementing the policy and

theNsubscript represents the non-elderly. This is the

difference-in-difference-in-differences (DDD)estimate.

Can add covariates to either the DD or DDD analysis to (hopefully)

control for compositional changes. Even if the intervention is

independent of observed covariates, adding those covariates mayimprove precision of the DD or DDD estimate.

6


7/57

2.How Should We View Uncertainty in DD Settings?

Standard approach: All uncertainty in inference enters through

sampling error in estimating the means of each group/time period

combination. Long history in analysis of variance. Recently, different approaches have been suggested that focus on

different kinds of uncertainty perhaps in addition to sampling error in

estimating means. Bertrand, Duflo, and Mullainathan (2004), Donald

and Lang (2007), Hansen (2007a,b), and Abadie, Diamond, and

Hainmueller (2007) argue for additional sources of uncertainty.

In fact, in the new view, the additional uncertainty swamps the

sampling error in estimating group/time period means.

7


8/57

One way to view the uncertainty introduced in the DL framework aperspective explicitly taken by ADH is that our analysis should better

reflect the uncertainty in the quality of the control groups.

ADH show how to construct a synthetic control group (for California)using pre-training characteristics of other states (that were not subject

to cigarette smoking restrictions) to choose the best weighted average

of states in constructing the control.

8


9/57

Issue: In the standard DD and DDD cases, the policy effect is justidentified in the sense that we do not have multiple treatment or control

groups assumed to have the same mean responses. So, for example, the

Donald and Lang approach does not allow inference in such cases. Example from Meyer, Viscusi, and Durbin (1995) on estimating the

effects of benefit generosity on length of time a worker spends on

workers compensation. MVD have the standard DD before-after

setting.

9


10/57

. r eg l dur at af chnge hi ghear n af hi gh i f ky, r obust

Li near r egr essi on Number of obs 5626F( 3, 5622) 38. 97Pr ob F 0. 0000R- squar ed 0. 0207Root MSE 1. 2692

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -| Robust

l dur at | Coef . St d. Er r . t P| t | [ 95% Conf . I nt er val ]- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

af chnge | . 0076573 . 0440344 0. 17 0. 862 - . 078667 . 0939817hi ghear n | . 2564785 . 0473887 5. 41 0. 000 . 1635785 . 3493786

af hi gh | . 1906012 . 068982 2. 76 0. 006 . 0553699 . 3258325_cons | 1. 125615 . 0296226 38. 00 0. 000 1. 067544 1. 183687

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

10


11/57

. r eg l dur at af chnge hi ghear n af hi gh i f mi , r obust

Li near r egr essi on Number of obs 1524F( 3, 1520) 5. 65Pr ob F 0. 0008R- squar ed 0. 0118Root MSE 1. 3765

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -| Robust

l dur at | Coef . St d. Er r . t P| t | [ 95% Conf . I nt er val ]- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

af chnge | . 0973808 . 0832583 1. 17 0. 242 - . 0659325 . 2606941hi ghear n | . 1691388 . 1070975 1. 58 0. 114 - . 0409358 . 3792133

af hi gh | . 1919906 . 1579768 1. 22 0. 224 - . 117885 . 5018662_cons | 1. 412737 . 0556012 25. 41 0. 000 1. 303674 1. 5218

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

11


12/57

3.Inference with a Small Number of Groups

Suppose we have aggregated data on few groups (smallG) and large

group sizes (eachMgis large). Some of the groups are subject to a

policy intervention. How is the sampling done? With random sampling from a large

population, no clustering is needed.

Sometimes we have random sampling within each segment (group) ofthe population. Except for the relative dimensions ofGandMg, the

resulting data set is essentially indistinguishable from a data set

obtained by sampling entire clusters.

12


13/57

The problem of proper inference whenMgis large relative toG theMoulton (1990) problem has been recently studied by Donald and

Lang (2007).

DL treat the parameters associated with the different groups asoutcomes of random draws.

13


14/57

Simplest case: a single regressor that varies only by group:ygm xg cg ugm

g xg ugm.

(5)

(6)

(6) has a common slope,but intercept,g, that varies acrossg.

Donald and Lang focus on (5), wherecgis assumed to be independent

ofxgwith zero mean. Define the composite errorvgm cg ugm.

Standard pooled OLS inference applied to (5) can be badly biased

because it ignores the cluster correlation. And we cannot use fixed

effects.

14


15/57

DL propose studying the regression in averages:y g xg v g,g 1, . . . , G. (7)

If we add some strong assumptions, we can perform inference on (7)

using standard methods. In particular, assume thatMg Mfor allg,

cg|xg Normal0,c2andugm|xg, cg Normal0,u2. Thenv gis

independent ofxgandv g Normal0,c2 u2/M. Because we assume

independence acrossg, (7) satisfies the classical linear model

assumptions.

15


16/57

So, we can just use the between regression,y gon 1, xg, g 1, . . . , G. (8)

With same group sizes, identical to pooled OLS acrossgandm.

Conditional on thexg, inherits its distribution from

v g : g 1, . . . , G, the within-group averages of the composite errors.

We can use inference based on thetG2distribution to test hypotheses

about, providedG 2.

IfGis small, the requirements for a significanttstatistic using the

tG2distribution are much more stringent then if we use thetM1M2...MG2distribution (traditional approach).

16


17/57

Using the between regression isnotthe same as using cluster-robust

standard errors for pooled OLS. Those are not justified and, anyway,

we would use the wrong df in the tdistribution.

So the DL method uses a standard error from the aggregatedregression and degrees of freedomG2.

We can apply the DL method without normality of theugmif the

group sizes are large becauseVarv g c2 u2/Mgso that gis a

negligible part ofv g. But we still need to assumecgis normally

distributed.

17


18/57

Ifzgmappears in the model, then we can use the averaged equation

y g xg z gvg,g 1, . . . , G, (9)

providedG KL 1.

Ifcgis independent ofxg,z gwith a homoskedastic normal

distribution, and the group sizes are large, inference can be carried out

using thetGKL1distribution. Regressions like (9) are reasonably

common, at least as a check on results using disaggregated data, but

usually with largerGthen just a handful.

18


19/57

IfG 2, should we give up? Supposexgis binary, indicating

treatment and control (g 2 is the treatment,g 1 is the control). The

DL estimate ofis the usual one: y 2 y 1. But in the DL setting,

we cannot do inference (there are zero df). So, the DL setting rules outthe standard comparison of means.

19


20/57

Can we still obtain inference on estimated policy effects using

randomized or quasi-randomized interventions when the policy effects

are just identified? Not according the DL approach.

Ifygm wgm the change of some variable over time andxgisbinary, then application of the DL approach to

wgm xg cg ugm,

leads to a difference-in-differences estimate: w2 w1. But

inference is not available no matter the sizes ofM1andM2.

20


21/57

w

2 w

1has been a workhorse in the quasi-experiemental

literature, and obtaining inference in the traditional setting is

straightforward [Card and Krueger (1994), for example.]

Even when DL approach can be applied, should we? SupposeG 4with two control groups (x1 x2 0) and two treatment groups

(x3 x4 1). DL involves the OLS regressiony gon 1,xg,

g 1, . . . , 4; inference is based on thet2distribution.

21


22/57

Can show the DL estimate is

y 3 y 4/2y 1 y 2/2. (10)

With random sampling from each group, is approximately normal

even with moderate group sizesMg. In effect, the DL approach rejects

usual inference based on means from large samples because it may not

be the case that1 2and3 4.

22


23/57

Why not tackle mean heterogeneity directly? Could just define the

treatment effect as

3 4/21 2/2,

or weight by population frequencies.

23


24/57


25/57

With large group sizes, and whether or notGis especially large, we

can put the problem into an MD framework, as done by Loeb and

Bound (1996), who hadG 36 cohort-division groups and many

observations per group. For each groupg, write

ygm g zgmg ugm. (11)

Assume random sampling within group and independence across

groups. OLS estimates within group are Mg-asymptotically normal.

25


26/57

The presence ofxgcan be viewed as putting restrictions on the

intercepts:

g xg,g 1, . . . , G, (12)

where we think ofxgas fixed, observed attributes of heterogeneous

groups. WithKattributes we must haveG K1 to determineand

. In the first stage, obtaing, either by group-specific regressions or

pooling to impose some common slope elements ing

.

26


27/57

Let Vbe theG Gestimated (asymptotic) variance of . Let Xbe the

GK1matrix with rows1,xg. The MD estimator is

XV1X1X V

1 (13)

Asymptotics are as theMgget large, and has an asymptotic normal

distribution; its estimated asymptotic variance isXV1X1.

When separate group regressions are used, the

gare independent andV is diagonal.

Estimator looks like GLS, but inference is withG(number of rows

in X) fixed withMggrowing.

27


28/57

Can test the overidentification restrictions. If reject, can go back to

the DL approach, applied to theg. With large group sizes, can analyze

g xg cg,g 1, . . . , G (14)

as a classical linear model becauseg g OpMg1/2, providedcgis

homoskedastic, normally distributed, and independent ofxg.

Alternatively, can define the parameters of interest in terms of theg,

as in the treatment effects case.

28


29/57

4.Multiple Groups and Time Periods

With many time periods and groups, setup in BDM (2004) and

Hansen (2007a) is useful. At the individual level,

y igt t g xgt zigtgt vgt uigt,i 1, . . . ,Mgt,

(15)

whereiindexes individual,gindexes group, andtindexes time. Full set

of time effects,t, full set of group effects,g, group/time period

covariates (policy variabels), xgt, individual-specific covariates, z igt,

unobserved group/time effects,vgt, and individual-specific errors,uigt.

Interested in.

29


30/57

As in cluster sample cases, can write

y igt gt z igtgt uigt, i 1, . . . ,Mgt; (16 )

a model at the individual level where intercepts and slopes are allowed

to differ across allg, tpairs. Then, think ofgtas

gt t g xgt vgt. (17)

Think of (17) as a model at the group/time period level.

30


31/57

As discussed by BDM, a common way to estimate and perform

inference in the individual-level equation

y igt t g xgt z igtvgt uigt

is to ignorevgt, so the individual-level observations are treated as

independent. Whenvgtis present, the resulting inference can be very

misleading.

BDM and Hansen (2007b) allow serial correlation in

vgt : t 1, 2, . . . , Tbut assume independence acrossg.

We cannot replacet

ga full set of group/time interactionsbecause that would eliminatexgt.

31


32/57

If we viewingt t g xgt vgtas ultimately of interest

which is usually the case because xgtcontains the aggregate policy

variables there are simple ways to proceed. We observe xgt,tis

handled with year dummies,andgjust represents group dummies. Theproblem, then, is that we do not observegt.

But we can use OLS on the individual-level data to estimate thegtin

yigt gt zigtgt u igt, i 1, . . . ,Mgt

assumingEz igt uigt 0and the group/time period sample sizes,Mgt,

are reasonably large.

32


33/57

Sometimes one wishes to impose some homogeneity in the slopes

say,gt

gor even

gt in which case pooling across groups

and/or time can be used to impose the restrictions.

However we obtain the

gt, proceed as ifMgtare large enough to

ignore the estimation error in thegt; instead, the uncertainty comes

throughvgtingt t g xgt vgt.

The minimum distance (MD) approach effectively drops vgtand

viewsgt t g xgtas a set of deterministic restrictions to be

imposed ongt. Inference using the efficient MD estimator uses only

sampling variation in thegt.

33


34/57

Here, proceed ignoring estimation error, and actas if

gt t g xgt vgt. (18)

We can apply the BDM findings and Hansen (2007a) results directly

to this equation. Namely, if we estimate (18) by OLS which means

full year and group effects, along with xgt then the OLS estimator has

satisfying large-sample properties asGandTboth increase, provided

vgt : t 1, 2, . . . , Tis a weakly dependent time series for allg.

Simulations in BDM and Hansen (2007a) indicate cluster-robust

inference works reasonably well whenv

gtfollows a stable AR(1)

model andGis moderately large.

34


35/57

Hansen (2007b) shows how to improve efficiency by using feasible

GLS by modelingvgtas, say, an AR(1) process.

Naive estimators ofare seriously biased due to panel structure with

group fixed effects. Can remove much of the bias and improve FGLS. Important practical point: FGLS estimators that exploit serial

correlation require strict exogeneity of the covariates, even with large

T. Policy assignment might depend on past shocks.

35


36/57

5.Individual-Level Panel Data

Letwitbe a binary indicator, which is unity if unitiparticipates in the

program at timet. Consider

y it d2t wit c i uit,t 1, 2, (19)

whered2t 1 ift 2 and zero otherwise,c iis an observed effectis

the treatment effect. Removec iby first differencing:

yi2 y i1 wi2 wi1 ui2 ui1 (20)

y i wi ui. (21)

IfEwiui 0, OLS applied to (21) is consistent.

36


37/57

Ifwi1 0 for alli, the OLS estimate is

FD y treat y control, (22)

which is a DD estimate except that we different the means of the same

units over time.

It isnotmore general to regressy i2on 1, wi2,yi1,i 1, . . . ,N, even

though this appears to free up the coefficient ony i1. Why? Under (19)

withwi1 0 we can write

y i2 wi2 yi1 ui2 u i1. (23)

Now, ifEui2|wi2, c i, ui1 0 thenui2is uncorrelated withyi1, andy i1

andu i1are correlated. Soyi1is correlated withui2 ui1 ui.

37


38/57

In fact, if we add the standard no serial correlation assumption,

Eui1ui2|wi2, c i 0, and write the linear projection

wi2 0 1y i1 ri2, then can show that

plim LDV 1u12 /r22

where

1 Covc i, wi2/c2 u12 .

For example, ifwi2indicates a job training program and less

productive workers are more likely to participate (1 0), then the

regressiony i2(oryi2) on 1, wi2,yi1underestimates the effect.

38


39/57

If more productive workers participate, regressingy i2(ory i2) on 1,

wi2,y i1overestimates the effect of job training.

Following Angrist and Pischke (2009), suppose we use the FD

estimator when, in fact, unconfoundedness of treatment holdsconditional onyi1(and the treatment effect is constant). Then we can

write

yi2 wi2 y i1 ei2

Eei2 0, Covwi2, e i2 Covyi1, e i2 0.

39


40/57

Write the equation as

y i2 wi2 1yi1 e i2

wi2 y i1 e i2

Then, of course, the FD estimator generally suffers from omitted

variable bias if 1. We have

plim

FD

Covwi2,y i1

Varwi2

If 0 ( 1) andCovwi2,y i1 0 workers observed with low

first-period earnings are more likely to participate theplim FD ,

and so FD overestimates the effect.

40


41/57

We might expectto be close to unity for processes such as

earnings, which tend to be persistent. (measures persistence without

conditioning on unobserved heterogeneity.)

As an algebraic fact, if

0 (as it usually will be even if

1) andwi2andyi1are negatively correlated in the sample, FD LDV. But this

does not tell us which estimator is consistent.

If either is close to zero orwi2andy i1are weakly correlated, addingy i1can have a small effect on the estimate of.

41


42/57

With many time periods and arbitrary treatment patterns, we can use

yit t wit xitc i u it, t 1, . . . , T, (24)

which accounts for aggregate time effects and allows for controls, xit.

Estimation by FE or FD to removec iis standard, provided the policy

indicator,wit, is strictly exogenous: correlation beweenwitanduirfor

anytandrcauses inconsistency in both estimators (with FE having

advantages for largerTifuitis weakly dependent).

42


43/57

What if designation is correlated with unit-specific trends?

Correlated random trend model:

yit c i gitt wit xituit (25)

wheregiis the trend for unit i. A general analysis allows arbitrary

corrrelation betweenc i,giandwit, which requires at leastT3. If

we first difference, we get, fort 2, . . . , T,

yit gi t wit xit uit. (26)

Can difference again or estimate (26) by FE.

43


44/57

Can derive panel data approaches using the counterfactural

framework from the treatment effects literature.

For eachi, t, lety it1andy it0denote the counterfactual outcomes,

and assume there are no covariates. Unconfoundedness, conditional onunobserved heterogeneity, can be stated as

Ey it0|wi,ci Ey it0|ci

Ey it1|wi,ci Ey it1|ci,

(27)

(28)

where wi wi1, . . . , wiTis the time sequence of all treatments.

Suppose the gain from treatment only depends ont,

Ey it1|c i Eyit0|c i t. (29)

44


45/57

Then

Ey it|wi,c i Eyit0|c i twit (30)

whereyi1 1wity it0 wity it1. If we assume

Ey it0|c i t0 c i0, (31)

then

Eyit|wi,c i t0 ci0 twit, (32)

an estimating equation that leads to FE or FD (often with t .

45


46/57

If add strictly exogenous covariates and allow the gain from treatment

to depend on xitand an additive unobserved effectai, get

Ey it|wi,xi,ci t0 twit xit0

wit xitt ci0 a i wit,

(33)

a correlated random coefficient model because the coefficient on witis

t ai. Can eliminatea i(andc i0. Or, with t , can estimate the

i aiand then use

N1i1

N

i. (34)

46


47/57

WithT3, can also get to a random trend model, wheregitis added

to (25). Then, can difference followed by a second difference or fixed

effects estimation on the first differences. Witht ,

y it t wit xit0 wit xitt a i wit gi uit. (35)

Might ignoreaiwit, using the results on the robustness of the FE

estimator in the presence of certain kinds of random coefficients, or,

again, estimate i aifor eachiand form (34).

47


48/57

As in the simpleT 2 case, using unconfoundedness conditional on

unobserved heterogeneity and strictly exogenous covariates leads to

different strategies than assuming unconfoundedness conditional on

past responses and outcomes of other covariates. In the latter case, we might estimate propensity scores, for eacht, as

Pwit 1|y i,t1, . . . ,y i1, wi,t1, . . . , wi1,xit.

48


49/57

6.Semiparametric and Nonparametric Approaches

Consider the setup of Heckman, Ichimura, Smith, and Todd (1997)

and Abadie (2005), with two time periods. No units treated in first time

period. Without anisubscript,Ytwis the counterfactual outcome for

treatment levelw,w 0,1, at timet. Parameter: the average treatment

effect on the treated,

att EY11Y10|W 1. (36)

W 1 means treatment in the second time period.

49


50/57

Along withY01 Y00(no counterfactual in time period zero),

key unconfoundedness assumption:

EY10Y00|X, W EY10Y00|X (37)

Also the (partial) overlap assumption is critical foratt

PW 1|X 1 (38)

or the full overlap assumption forate EY1

1Y1

0,

0 PW 1|X 1.

50


51/57

Under (37) and (38),

att E WpXY1 Y0

1 pX (39)

whereYt,t

0,1are the observed outcomes (for the same unit), PW 1is the unconditional probability of treatment, and

pX PW 1|Xis the propensity score.

51


52/57


53/57

If we add

EY11 Y01|X, W EY11Y01|X, (41)

a similar approach works forate.

ate N1 i1

NWi pXiYi

pXi1pXi (42)

53


54/57

7.Synthetic Control Methods for Comparative Case Studies

Abadie, Diamond, and Hainmueller (2007) argue that in policy

analysis at the aggregate level, there is little or no estimation

uncertainty: the goal is to determine the effect of a policy on an entire

population, and the aggregate is measured without error (or very little

error). Application: Californias tobacco control program on state-wide

smoking rates. ADH focus on the uncertainty with choosing a suitable control for

California among other states (that did not implement comparable

policies over the same period).

54


55/57

ADH suggest using many potential control groups (38 states or so) to

create a single synthetic control group.

Two time periods: one before the policy and one after. Lety itbe the

outcome for unitiin timet, withi 1 the treated unit. Suppose there

areJpossible control units, and index these as2, . . . ,J1. Let xibe

observed covariates for unitithat are not (or would not be) affected by

the policy; ximay contain periodt 2 covariates provided they are notaffected by the policy.

55


56/57

Generally, we can estimate the effect of the policy as

y12 j2

J1

wjyj2, (43)

wherewjare nonnegative weights that add up to one. How to choose

the weights to best estimate the intervention effect?

56


57/57

ADH propose choosing the weights so as to minimize the distance

betweeny11,x1andj2J1

wj yj1,xj, say. That is, functions of the

pre-treatment outcomes and the predictors of post-treatment outcomes.

ADH propose permutation methods for inference, which requireestimating a placebo treatment effect for each region, using the same

synthetic control method as for the region that underwent the

intervention.

57

Download - Wooldridge Session 5

Top Related