microeconometrics lecture notes

Post on 26-Dec-2015

108 Views

Category:

Documents

15 Downloads

Preview:

Click to see full reader

DESCRIPTION

Microeconometrics Lecture Notes by Daniel Millimet

TRANSCRIPT

ECO 7377Microeconometrics

Daniel L. Millimet

Southern Methodist University

Fall 2011

DL Millimet (SMU) ECO 7377 Fall 2011 1 / 407

Introduction

Applied research in economics can be loosely classied into two types1 Descriptive analysis2 Causal analysis

While the rst is important and useful, the second is of primaryinterest

Causal analysis is needed to predict the impact of changingcircumstances or policies, or for the evaluation of existing policies orinterventions

Prior to conducting, or when reviewing, causal analyses, questionsthat need to be answered:

1 What is the causal relationship of interest? [Is it economicallyinteresting?]

2 What is the identication strategy?3 What is the method of statistical inference?

DL Millimet (SMU) ECO 7377 Fall 2011 2 / 407

Several statistical issues are confronted when answering thesequestions in economic research:

Specication of the causal relationship of interest entails more thanjust dening x and y ... lots of parameters could be estimated

I Heterogenous vs. homogeneous e¤ectsI Know what you are estimatingI To whom does it apply?I What question does it answer?

Statistical inference is often di¢ cult and overlookedI Spherical vs. non-spherical errorsI Derivation/computation of estimated asymptotic variances ofestimators

DL Millimet (SMU) ECO 7377 Fall 2011 3 / 407

Identication of the causal relationship of interest frequentlyencounters

I Selection issues

F Self-selection (endogeneity)F Sample selection (missing data, attrition)

I Measurement issues

F Classical vs. non-classical errorF Dependent vs. independent variableF Continuous vs. discrete variables

I Modeling issues

F Functional form (P, SNP, NP)F Role of space (spillovers, spatial correlation)F Consistency with theory

DL Millimet (SMU) ECO 7377 Fall 2011 4 / 407

Dissertation considerations (applied work):

Whats the question? Is it economically interesting?

Whats the identication strategy (if question is causal)?I Selection on observables vs. unobservablesI Parameter of interest

Whats the data requirement? Is it feasible?

Has it been done? Is there value added?I Tension between hottopics and ability to contribute

DL Millimet (SMU) ECO 7377 Fall 2011 5 / 407

Dissertation Writing AdviceBe organized

I Outline paper before writing

I Most papers have a common structure

F Abstract: Very important. Be concise. No abbreviations, notation. Include the motivation, punchline.

F Intro: Outline the question. Explain why we care, and what is new in the paper. Give a slightly longer

summary than the abstract of what is done in the paper, and emphasize the major ndings.

F Lit review (may be incorporated in intro if short)

F Theoretical model: Be only as complicated as necessary. Understand ramications of assumptions. If

innovation is in the empirics, theory is only needed if it adds something not well understood.

F Empirical model: Be clear. Understand where identication comes from. Consider relevant specication

tests. Acknowledge deciencies, circumstances under which estimates are inconsistent.

F Data: Explain the sample selection criteria and variables used. If building on an existing literature, note

any di¤erences between the sample selection criteria and those used in existing papers.

F Results: Be sure to spend enough time discussing the actual results. If results di¤er from existing

literature, try to pin down the reason(s) why.

F Conclusion: Emphasize importance of new ndings, as well as shortcomings of the current paper.

Discuss potential future work still to be done. End on a positive note.

I Put discussions in relevant sections

F Avoid discussing the same point in multiple locations

F Discuss data in data section; discuss results in results section; most econometric issues belong in the

empirical model section

DL Millimet (SMU) ECO 7377 Fall 2011 6 / 407

Be considerate to your readersI Invest the time to proofread the paper many times; if you are unwillingto go through your paper carefully, why should others invest their time?

F Pascal: The letter I have written today is longer than usual because Ilacked the time to make it shorter.

F Quintilian: One should aim not at being possible to understand, butat being impossible to misunderstand.

I Spell check, grammar check, check formatting issues, check spacing,check indenting, etc.

I Dene notation, abbreviations, etc.I Avoid redundant notation, excessive notation, awkward notation, etc.I Avoid overly critical remarks about other papers; other authors are notidiots, and may be your referees

I Tables should be easy to read, and self-explanatory (need to refer backto the text should be kept to a minimum); include notes under thetables to explain things; avoid using abbreviations for variable namesunless necessary

I References should be double-checked; be sure they are accurate and allare included in the bibliography

DL Millimet (SMU) ECO 7377 Fall 2011 7 / 407

Be professional (this is not a term paper)I Avoid unsubstantiated claims, sweeping or grand statements, andgeneralizations

I Be upfront; do not hide assumptions/restrictions hoping they will beoverlooked, and justify their use

I Do not be unnecessarily complex in order to feel smart or show o¤ (seeSiegfried 1970)

F Da Vinci: Simplicity is the ultimate sophistication.F Einstein: Any fool can make things bigger, more complex, and moreviolent. It takes a touch of genius-and a lot of courage-to move in theopposite direction.

F Fowler: Any one who wishes to become a good writer shouldendeavour, before he allows himself to be tempted by the more showyqualities, to be direct, simple, brief, vigorous, and lucid.

F Mingus: Making the simple complicated is commonplace; making thecomplicated simple, awesomely simple, thats creativity.

F Je¤erson: The most valuable of all talents is that of never using twowords when one will do.

I Avoid contractionsI Be consistent with the use of Ior we if the paper uses rst person,consistent with present vs. past tense

DL Millimet (SMU) ECO 7377 Fall 2011 8 / 407

PlagiarismI Be careful, be ethical!I Give credit where credit is due; cite othersideas (in parentheses, notfootnotes)

F Milton: Copy from one, its plagiarism; copy from two, its research.F Donatus: Perish those who said our good things before we did.F Kuralt: I could tell you which writers rhythms I am imitating. Itsnot exactly plagiarism, its falling in love with good language and tryingto imitate it.

I Any statement in a paper should t one of the following categories: (i)factual (agreeable to any reader), or (ii) debatable (but then referencesin support, or it should be supported by the work done in the paperitself, or it should be written in the appropriate language: If onebelieves X, then Y.)

F But, any statement should be in your own words, or should be inquotations

DL Millimet (SMU) ECO 7377 Fall 2011 9 / 407

What to include?I Dissertation chapters can/should be longer than papers submitted forpublication

I Chapters may include greater detail on:

F Literature reviewF Data constructionF Empirical methodology

DL Millimet (SMU) ECO 7377 Fall 2011 10 / 407

BootstrapIntroduction

General structure of estimation

population ) θ

#random sample ) bθ

Problem: bθ is an estimate; need to assess its dbn for proper inferenceSolutions

I Asymptotic theoryI Simulation methods ) bootstrap

Stata: -bootstrap-, -bsample-

DL Millimet (SMU) ECO 7377 Fall 2011 11 / 407

IdeaI Re-sample (with replacement) from the random sample multiple timesand assess the dbn of the estimates

population ) θ

#random sample ) bθ

#bootstrap sample ) bθ

I Results in a vector of estimates, bθb , b = 1, ...,B, where B is the # ofbootstrap repetitions

Many di¤erent bootstrap methodsI Parametric vs. nonparametricI Resampling algorithms

F iidF Block/clusterF Sub-sampling (M/N)

I Imposing the null or not imposing

DL Millimet (SMU) ECO 7377 Fall 2011 12 / 407

BootstrapCondence Intervals

Consider a regression model

yi = xi β+ εi

Problem: given sample estimates, bβ, need to obtain std errors orcondence intervals

DL Millimet (SMU) ECO 7377 Fall 2011 13 / 407

There are two common sampling methods

1 Resampling the data2 Resampling the errors

DataI Resample (with replacement) observations (yi , xi ) ) fyi , xi gNi=1I Estimate the original model (OLS) on the re-sampled data set ) bβI Repeat B times ) bβb , b = 1, ...,B

DL Millimet (SMU) ECO 7377 Fall 2011 14 / 407

ResidualsI Given bβ from OLS on original sample, obtain residuals ) bεi ,i = 1, ...,N

I Resample (with replacement) a vector of N residuals ) bεi , i = 1, ...,NF This represents a random draw from the (nonparametric) empirical dbnof the residuals

I Alternative (parametric):

F Estimate bσ2 = 1N K ∑i bε2i

F Draw N random numbers, bεi , i = 1, ...,N , from N(0, bσ2)I Generate yi = xi

bβ+bεi (which imposes β = bβ)I Regress y on x by OLS ) bβI Repeat B times ) bβb , b = 1, ...,B

Resampling data is typically preferred since it less model dependent

DL Millimet (SMU) ECO 7377 Fall 2011 15 / 407

What to do with bβb , b = 1, ...,B? Several options...Obtain std error for original sample estimate, bβ, given by

se(bβ) = r 1B 1 ∑b

bβb bβObtain symmetric CI using normal approximation

β 2nbβ t1 α

2 ,B1se(bβ)o

Obtain asymmetric CI using percentile method

β 2nbβ α

2, bβ1 α

2

owhere subscript refers to the quantile of the empirical dbn of bβ

DL Millimet (SMU) ECO 7377 Fall 2011 16 / 407

Obtain asymmetric bias corrected and accelerated CIs (BCa)I Calculate

z0 = Φ11B ∑b I

bβb 6 bβ (median bias)

a =∑i

bβJ bβJ(i )36

"∑i

bβJ bβJ(i )2#3/2 (acceleration parameter)

where bβJ(i ) is the jacknife estimate (omitting obs i from original

sample) and bβJ is the mean of the jacknife estimatesI Calculate lower and upper quantiles

p1 = Φ

"z0 +

z0 z1 α2

1 a(z0 z1 α2)

#; p2 = Φ

"z0 +

z0 + z1 α2

1 a(z0 + z1 α2)

#where z1 α

2is the (1 α/2)th quantile of the std normal distribution

I CI given by β 2nbβp1 , bβp2o

DL Millimet (SMU) ECO 7377 Fall 2011 17 / 407

Notes:I BC CI obtained by setting a = 0I BCa requires B > 1000I z0 = 0 when bβ = median of bβI a reects the rate of change of the standard error of bβ with respect tothe true value, β

F The standard normal approximation assumes that the standard error isinvariant with respect to the true value

F The acceleration parameter corrects for deviations in practice

DL Millimet (SMU) ECO 7377 Fall 2011 18 / 407

Obtain asymmetric CI using bootstrap-tI When estimating the model on the re-sampled data, collect thet-statistics obtained from testing Ho : β = bβ

t =bβ bβse(bβ)

I Yields tb , b = 1, ...,BI Dene

tα )1B ∑b I(tb 6 tα ) = α

) tα is the αth quantile of the empirical dbn of tI CI given by

β 2nbβ t1 α

2se(bβ), bβ+ tα

2se(bβ)o

I Notes

F Method assumes se(bβ) is known based on asymptotic theoryF If unknown, then use double bootstrap

DL Millimet (SMU) ECO 7377 Fall 2011 19 / 407

Obtain asymmetric CI using bootstrap-t with double bootstrapI Estimate original model by OLS ) bβI Obtain bootstrap samples, estimate by OLS, form t given by

t =bβ bβse(bβ)

I Since denominator is not known, resample from the bootstrap sampleB2 times ) bβb , b = 1, ...,B2

I Obtain the estimated std error of bβ as the std deviation of the B2estimates

I Repeat process B1 timesI Obtain CI as above, but with se(bβ) replaced by the std deviation of theB2 estimates of bβ

DL Millimet (SMU) ECO 7377 Fall 2011 20 / 407

Example: x N(0, 1), N = 1000, xa N(0, 0.001)

010

2030

­.2 0 .2

Bootstrap Asymptotic

Reps = 20

05

1015

20

­.2 0 .2

Bootstrap Asymptotic

Reps = 100

05

1015

­.2 0 .2

Bootstrap Asymptotic

Reps = 500

05

1015

­.2 0 .2

Bootstrap Asymptotic

Reps = 1000

DL Millimet (SMU) ECO 7377 Fall 2011 21 / 407

BootstrapImposing the Null

Goal: estimate the model, derive some estimate or test statistic, andyou wish to test whether the true value of the parameter is equal tosome value or derive a p-value associated with the test statistic

StrategyI When re-sampling the data, generate new data sets where the null istrue (imposed)

I Estimate the original model on the re-sampled dataI Compare the value of the test statistics obtained from the re-sampleddata sets with the value of the test statistic from the original sample

I If the test statistic from the original sample is very di¤erent(statistically), then it is unlikely the null is true in the original sample

DL Millimet (SMU) ECO 7377 Fall 2011 22 / 407

Regression example

Modelyi = β0 + β1xi + εi , εi N(0, σ2)

Hypothesis of interest:

Ho : β1 = 0

H1 : β1 6= 0

DL Millimet (SMU) ECO 7377 Fall 2011 23 / 407

AlgorithmI Estimate model on original data ) bβ0, bβ1 ) tβ1 (t-statistic for β1)I Obtain the residuals ) bεi , i = 1, ...,NI Resample (with replacement) a vector of N residuals ) bεi , i = 1, ...,N

F This represents a random draw from the (nonparametric) empirical dbnof the residuals

I Alternative (parametric):F Estimate bσ2 = 1

N K ∑i bε2iF Draw N random numbers, bεi , i = 1, ...,N , from N(0, bσ2)

I Generate yi =bβ0 + 0 xi +bεi = bβ0 +bεi (which imposes β1 = 0)

I Regress y on x by OLS ) tβ1I Repeat B times ) tβ1,b

, b = 1, ...,BI Obtain p-value as

p-value =1B ∑b I(jtβ1 j > jtβ1 j)

I Reject null if p < α < 0.5, where α is the signicance level

DL Millimet (SMU) ECO 7377 Fall 2011 24 / 407

Distributional example

Want to test equality of CDFs of two random variables (e.g., wages ofjob training participants and non-participants)

Data sampleI xi , i = 1, ...,N, is random sample of one variable (participants), withCDF F (x)

I yi , i = 1, ...,M, is random sample of another variable(non-participants), with CDF G (y)

Hypothesis of interest:

Ho : F = G

H1 : F 6= G

DL Millimet (SMU) ECO 7377 Fall 2011 25 / 407

AlgorithmI Estimate empirical CDF in each sample: bF (x) and bG (y)I Compute test statistic

d =

rNMN +M

maxz2Supp(X ,Y )

nbF (z) bG (z)oI Pool data, re-sample (with replacement), sample size = N +M )q1, ..., qN+M

I Split the sample: denote rst N obs from F ; nal M obs from G(imposes F = G )

I Compute dI Repeat B times ) db , b = 1, ...,BI Obtain p-value as

p-value =1B ∑b I(d > d)

I Reject null if p < α < 0.5, where α is the signicance level

DL Millimet (SMU) ECO 7377 Fall 2011 26 / 407

BootstrapOther Issues

Non-iid data

All previous discussion assumes iid data since re-sampling occurswithout regard to any dependence across observations

If there exists some sort of dependence in the data, then resampleblocks or clusters of data

Example #1: Time series data with serial correlationI Model

yt = xtβ+ εt , t = 1, ...,T

I Resample blocks of length l by drawing obs randomly fromt = 1, ...,T l

I If obs t 0 is chosen for the bootstrap sample, also include obst = t 0 + 1, ..., t 0 + (l 1)

I Draw T/l obs so nal bootstrap sample size remains T

DL Millimet (SMU) ECO 7377 Fall 2011 27 / 407

Example #2: Panel dataI For example, individuals within hhs, or employees within rms, orindividuals over time

I Modelyif = xif β+ εif , i = 1, ...,N

where i represents individuals and f represents rmsI Several individuals are sampled from each of F < N rmsI Generate bootstrap samples by resampling (with replacement) the Frms

I If rm f is chosen for the bootstrap sample, include all employees ifrom that rm

I If identical number of employees from each rm are in the sample, thenbootstrap samples are still of size N

Blocks/clusters are chosen such that data are iid across blocks

DL Millimet (SMU) ECO 7377 Fall 2011 28 / 407

Sub-sampling (Politis and Romano 1992, 1994)

M of N re-sampling with or without replacement

Evaluate a statistic of interest at subsamples of the data

Use these subsampled values to build up an estimated samplingdistribution

The consistency properties of this sampling distribution hold fordependent data under very weak assumptions and even in situationswhere the bootstrap collapses

DL Millimet (SMU) ECO 7377 Fall 2011 29 / 407

Jacknife estimation

Leave-one-out estimation

AlgorithmI Estimate model using original sample ) bβ (if OLS model, say)I Omit obs i and re-estimate model on sample of N 1 obs ) bβ(i )I Repeat omitting each i once (implies N estimations)I Standard error obtained as

se(bβ) = rN 1N ∑i

bβ(i ) bβ(i )2In some situations, delete-d jacknife achieves superior performance

DL Millimet (SMU) ECO 7377 Fall 2011 30 / 407

Failure of the bootstrap or jacknife ...

Resampling methods are not guaranteed to work; theoreticaljustication is needed

Most common case of failure occurs when parameter of interest is anon-smooth function of the data (e.g., median vs. mean)

DL Millimet (SMU) ECO 7377 Fall 2011 31 / 407

Example: x N(0, 1), N = 1000, xmeda N(0, 0.00157)

020

4060

80

­.1 ­.05 0 .05 .1

Bootstrap Asymptotic

Reps = 20

010

2030

­.1 ­.05 0 .05 .1

Bootstrap Asymptotic

Reps = 100

05

1015

­.1 ­.05 0 .05 .1

Bootstrap Asymptotic

Reps = 500

05

1015

­.15 ­.1 ­.05 0 .05 .1

Bootstrap Asymptotic

Reps = 1000

DL Millimet (SMU) ECO 7377 Fall 2011 32 / 407

How to choose B?

Andrews & Buchinsky

Davidson & MacKinnon

DL Millimet (SMU) ECO 7377 Fall 2011 33 / 407

CausationIntroduction

General goal of most (applied) econometrics exercises is to distinguishbetween causation and correlation

Many empirical questions of concern to economists and/orpolicymakers pertains to the causal e¤ect of a program or policy

Statistical and econometric literature analyzing causation has seentremendous growth over the past several decades

Central problem concerns evaluation of the causal e¤ect of exposureto a treatment or program by a set of units on some outcome

I In economics, these units are economic agents such as individuals, hhs,rms, geographical areas, etc.

I The e¤ect of an exposure is only well-dened if the comparison is alsodened; typically the comparison is dened as not exposed,butsometimes it is not obvious (particularly with non-binary treatments)

DL Millimet (SMU) ECO 7377 Fall 2011 34 / 407

Philosophy of causality...I Rich literature in analytic philosophy on causalityI Two main approaches to dening causality:

F Regularity approaches: Hume: We may dene a cause to be anobject followed by another, and where all the objects, similar to therst, are followed by objects similar to the second. (from An EnquiryConcerning Human Understanding, section VII)

F Counterfactual approaches: Hume: Or, in other words, where, if therst object had not been, the second never had existed. (from AnEnquiry Concerning Human Understanding, section VII)

DL Millimet (SMU) ECO 7377 Fall 2011 35 / 407

Regularity approach: a minimal constant conjunction between thetwo objects (Suppes: a probabilistic association between the twoobjects, which cannot be explained away by other factors)

I Basic idea behind Granger causalityI Di¢ culty: what are the other factors? Limiting to only observablefactors is unsatisfying... if some factors are unobservable, then what?

I Example...

F C is a potential cause of E if Pr(E jC ) > Pr(E jnot C )F May be spurious if there exists some factor B s.t. Pr(E jC ) > Pr(E jnotC ) and Pr(E jC ,B) = Pr(E jnot C ,B)

(e.g., E = wages ,C = educ ,B = ability )F May also be a spurious zero correlation if there exists some factor Bs.t. Pr(E jC ) = Pr(E jnot C ) and Pr(E jC ,B) > Pr(E jnot C ,B)

(e.g., E = wages ,C = training ,B = shock)F B is known as a confounder or confounding variable

DL Millimet (SMU) ECO 7377 Fall 2011 36 / 407

Be wary: correlation does not imply causation as things are notalways as they seem ...

and the truth may be di¢ cult to see ...

DL Millimet (SMU) ECO 7377 Fall 2011 37 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 38 / 407

Counterfactual approach: Lewis (1973) proposes to imagine arange of possible worlds

I Holland (1986, 2003): a treatment (cause) is a potential manipulationthat one can imagine

F NO CAUSATION WITHOUT MANIPULATIONF Gender, race are not treatments?!? (see Greiner and Rubin 2011)

I Imbens and Wooldridge (2009):

F A CRITICAL FEATURE IS THAT, IN PRINCIPLE, EACH UNIT CANBE EXPOSED TO MULTIPLE LEVELS OF THE TREATMENT.

I Angrist and Pischke (2009): a treatment should be manipulatableconditional on other factors ) Pr(C jB), Pr(not C jB) 2 (0, 1)

F NO FUNDAMENTALLY UNIDENTIFIED QUESTIONSF Example: school start age = biological age - time in school;if B = fbio age, time in schoolg, then school age is not an identiabletreatment

DL Millimet (SMU) ECO 7377 Fall 2011 39 / 407

Microeconometrics today emphasizes the counterfactual viewI Greiner & Rubin (2011):For analysts from a variety of elds, the intensely practical goal ofcausal inference is to discover what would happen if we changed theworld in some way.

Econometric methods are categorized by the type of selection involved

Selection typesI Selection on observables: all potential Bs are observedI Selection on unobservables: some potential Bs are unobserved

DL Millimet (SMU) ECO 7377 Fall 2011 40 / 407

CausationPotential Outcomes Model

Most causal research is couched in the potential outcomes framework

Typically referred to as the Rubin Causal Model (RCM); attributed toNeyman (1923, 1935), Fisher (1935), Roy (1951), Quandt (1972,1988), Rubin (1974)

Notationy1i = outcome of observation i with treatment

y0i = outcome of observation i without treatment

Di = treatment indicator ...

Di =1 treated0 untreated

fy1i , y0i ,Dig is a draw from the population of interest

fy1, y0,Dg is a sample from the population of interest

DL Millimet (SMU) ECO 7377 Fall 2011 41 / 407

NotesI Key insight is to model not just the observed outcome for each unit i ,but also the unobserved potential outcomes

I Implicit in this representation is the Stable Unit Treatment ValueAssumption (SUTVA, Rubin 1978), which assumes that outcome ofobs i with and without the treatment does not vary depending on thetreatment assignment of all other agents (rules out general equilibriumor indirect e¤ects)

F Allows one to write potential outcomes solely as a function of owntreatment assignment

y0i yi (D1,D2, ...,Di1, 0,Di+1, ...,DN ) = yi (0)

y1i yi (D1,D2, ...,Di1, 1,Di+1, ...,DN ) = yi (1)

F Imbens & Wooldridge (2009) provide some references to papers lookingat GE e¤ects; see also Ferracci et al. (2009), Heckman et al. (1999),Lewis (1963)

I Also implicit and sometimes lumped into SUTVA is the assumptionthat the mechanism for assignment treatments does not a¤ectpotential outcomes (rules out Hawthorne e¤ects, whereby agents mayact di¤erently if they know they are being observed)

DL Millimet (SMU) ECO 7377 Fall 2011 42 / 407

Parameters of interest

∆i = y1i y0i = treatment e¤ect for obs iI This is a random variable as it is obs-specicI Can summarize the distribution of this variable by focusing on di¤erentaspects

∆ATE = E[∆i ] = E[y1 y0 ]∆ATT = E[∆i jD = 1] = E[y1 y0 jD = 1]∆ATU = E[∆i jD = 0] = E[y1 y0 jD = 0]

Notes: Di¤erent parameters answer di¤erent questions, may be usefulfor di¤erent policy conclusions, and may require di¤erent assumptionsto identify

DL Millimet (SMU) ECO 7377 Fall 2011 43 / 407

Three other parameters that often appear1 Local Average Treatment E¤ect (Imbens & Angrist 1994, Angrist et al.1996)

F Dened as ∆LATE = E[y1 y0 ji 2 Ω], where Ω refers to somespecied subpopulation

2 Marginal Treatment E¤ect (Heckman & Vytlacil 1999, 2001, 2005,2007)

F Dened later3 Policy Relevant Treatment E¤ect (Heckman & Vytlacil 2001)

F Dened as ∆PRTE = E[yP yNP ], where P (NP) refers to the statewhere the program is fully (not) implemented

F With the program, all agents have access to the program, but maychoose not to participate

F Implies

∆PRTE = E[yP1 jDP = 1]Pr(DP = 1) + E[yP0 jDP = 0]Pr(DP = 0) E[yNP ]

= E[y1 y0 jDP = 1]Pr(DP = 1)where yP0 , y

P1 , and y

NP are the three potential outcomes, DP is thetreatment indicator in the world with the program, and the second linefollows if one assumes policy invariance (i.e., potential outcomes areuna¤ected by the existence of the program)

DL Millimet (SMU) ECO 7377 Fall 2011 44 / 407

Relationship among the parametersI Let

y1i = E[y1 ] + υi1

y0i = E[y0 ] + υi0

I This implies

∆i = y1i y0i= E[y1 y0 ] + υi1 υi0

= ∆ATE + υi1 υi0

and

∆ATT = ∆ATE + E[υi1 υi0 jD = 1]∆ATU = ∆ATE + E[υi1 υi0 jD = 0]

where E[υi1 υi0 jD = j ] is the average, obs-specic gain fromtreatment for group j

DL Millimet (SMU) ECO 7377 Fall 2011 45 / 407

Can re-dene any of the above parameters for sub-population denedon the basis of attributes, x

∆ATE (x) = E[y1 y0jx ]∆ATT (x) = E[y1jx ,D = 1] E[y0jx ,D = 1]∆ATU (x) = E[y1jx ,D = 0] E[y0jx ,D = 0]

The previous unconditional parameters are obtained by integratingover the dbn of x in the relevant population

∆ATE =Z

∆ATE (x)f (x)dx

∆ATT =Z

∆ATT (x)f (x jD = 1)dx

∆ATU =Z

∆ATU (x)f (x jD = 0)dx

DL Millimet (SMU) ECO 7377 Fall 2011 46 / 407

Aside

While the preceding parameters, based on di¤erences in expectations,are the near universal focus in economics, this need not be the case

Can also dene treatment e¤ects based on ratios

∆RATE = E[y1]/ E[y0]∆RATT = E[y1jD = 1]/ E[y0jD = 1]∆RATU = E[y1jD = 0]/ E[y0jD = 0]

These are referred to as relative treatment e¤ects (and priorparameters are referred to as absolute or di¤erenced treatmente¤ects)

Note, however, that relative e¤ects lack a bit of intuitive appeal sinceif we dene ∆i = y1i/y0i , then E[∆i ] = E[y1i/y0i ] 6= E[y1]/ E[y0] andsame for RATT and RATU

DL Millimet (SMU) ECO 7377 Fall 2011 47 / 407

Evaluation Problem

Only observe one potential outcome at a point in time for anyobservation

Implies...

Attributes of i Observed for ify1i , y0i ,Dig fyi ,Dig

where yi = Diy1i + (1Di )y0i = observed outcome for observation iMissing potential outcome is the missing counterfactual

I Holland (1986) refers to this as the fundamental problem of causalinference

I Because of this, the central issue in the RCM is the relationshipbetween treatment assignment and potential outcomes

F Typically referred to as the treatment assignment ruleF Growing literature on assignment rules (Manski 2000, 2004; Pepper2002, 2003; Dehejia 2005; Lechner & Smith 2007)

DL Millimet (SMU) ECO 7377 Fall 2011 48 / 407

Example #1... ATTI Consider estimating ∆ATT = E[y1 jD = 1] E[y0 jD = 1]I E[y1 jD = 1] can be estimated from the data, but one does not observe

E[y0 jD = 1]I If one uses outcomes of the untreated, we can denee∆ATT = E[y1 jD = 1] E[y0 jD = 0]

I Which implies selection bias equal to

∆ATT = E[y1 jD = 1] E[y0 jD = 0] + E[y0 jD = 0] E[y0 jD = 1]) bias = e∆ATT ∆ATT = E[y0 jD = 1] E[y0 jD = 0]

I This is generally non-zero, and may be decomposed into 3 components(Heckman et al. 1996, 1998):

1 Self-selection into treatment in a manner related to outcome in theuntreated state

2 Observables, x , impacting outcome may not overlap at certain valuesacross the treatment and control groups

3 Even with overlap, the distribution of x may vary across the treatmentand control groups

DL Millimet (SMU) ECO 7377 Fall 2011 49 / 407

Example #2... ATEI Consider estimating ∆ATE = E[y1 ] E[y0 ]I Neither unconditional expectation can be estimated from the dataI If one uses conditional expectations, we can dene

e∆ATE = E[y1 jD = 1] E[y0 jD = 0]

I Which implies selection bias equal to

e∆ATE ∆ATE = E[y1 jD = 1] E[y0 jD = 0] (E[y1 ] E[y0 ])) bias = (E[y1 jD = 1] E[y1 jD = 0])[1 Pr(D = 1)]

+ (E[y0 jD = 1] E[y0 jD = 0])Pr(D = 1)

which is a weighted average of the selection bias for the ATT and ATU

Question: How does one circumvent the missing counterfactualproblem to estimate ∆ATE , ∆ATT , ∆ATU , or any other summarystatistic of the distribution of ∆?

DL Millimet (SMU) ECO 7377 Fall 2011 50 / 407

Early Example of Potential Outcomes: Roy Model (Roy 1951)

As noted previously, at the heart of the RCM is the interplay betweenassignment of treatments, potential outcomes, and observed outcomes

Problem is one of self-selection; highlighted in a very clever fashion inRoy (1951)

Specic issue in Roy (1951) was occupational choiceI Individuals have potential earnings associated with di¤erent occupationchoices

I Realized earnings reect the chosen occuption

Example

Suppose y0y1

N

01,∑

DL Millimet (SMU) ECO 7377 Fall 2011 51 / 407

Unconditional outcome distributions look like

0.1

.2.3

.4

­4 ­2 0 2 4 6Support

kdensity y0 kdensity y1Simulated data, 1000 obs, rho=0.7

Unconditional Distributions of Potential Outcomes

DL Millimet (SMU) ECO 7377 Fall 2011 52 / 407

Conditional distributions

Depends onI Who selects into treatment or control group, andI Correlation of potential outcomes

Positive correlation in above example (ρ 0.7)

DL Millimet (SMU) ECO 7377 Fall 2011 53 / 407

Positive selection: Assume those above the mean in y1 distribution selectinto treatment

0.2

.4.6

.8

­4 ­2 0 2 4 6Support

kdensity yy0 kdensity yy1Simulated data, 1000 obs, rho=0.7; positive selection into treatment.

Conditional Distributions of Potential Outcomes

DL Millimet (SMU) ECO 7377 Fall 2011 54 / 407

Negative selection: Assume those below the mean in y1 distributionselect into treatment

0.2

.4.6

.8

­4 ­2 0 2 4Support

kdensity yy0 kdensity yy1Simulated data, 1000 obs, rho=0.7; negative selection into treatment.

Conditional Distributions of Potential Outcomes

DL Millimet (SMU) ECO 7377 Fall 2011 55 / 407

Random assignment:

0.1

.2.3

.4

­4 ­2 0 2 4 6Support

kdensity yy0 kdensity yy1Simulated data, 1000 obs, rho=0.7; random assignment into treatment.

Conditional Distributions of Potential Outcomes

Lesson to be learned: observed distributions are not the unconditionaldistributionsDL Millimet (SMU) ECO 7377 Fall 2011 56 / 407

Roy Model

Two occupations: hunter, sherPotential incomes

yd = gd (x) + υd , d = 0 (h), 1 (f)

Decision rule

D = I(y1 y0 > 0)= I(g1(x) g0(x) + υ1 υ0 > 0)

Observed incomey = Dy1 + (1D)y0

Treatment assignment depends on observables, x , and unobservables,υ1 υ0Notes:

1 Cov(D, υ1 υ0) 6= 0 referred to as essential heterogeneity (Heckmanet al. 2006)

2 Cov(D, υ1 υ0) 6= 0) Cov(D,D(υ1 υ0)) 6= 0DL Millimet (SMU) ECO 7377 Fall 2011 57 / 407

Generalized Roy Model

Replace income maximization decision rule with a more general rule

Decision ruleD = I(h(x) u > 0)

When D is a voluntary program (e.g., job training), u may reect (i)costs of participation and (ii) foregone earnings (opportunity costs)

Implies that treatment assignment depends on observables, x , andunobservables, u

I Essential heterogeneity implies Corr(u, υd ) 6= 0 8d

DL Millimet (SMU) ECO 7377 Fall 2011 58 / 407

Moving Forward

Guided by the potential outcomes framework, gure out conditionsunder which di¤erent estimators may provide consistent estimates ofthe ATE, ATT, ATU, etc.

Key points:I Given the missing counterfactual problem, any estimator of the causale¤ects of a treatment must rely on some assumptions

I Di¤erent estimators rely on di¤erent assumptions and thus should notbe expected to yield similar estimates unless the identifyingassumptions of each hold in the data

I While extraneous assumptions may be testable overidentifyingrestrictions not all assumptions can be tested

I Di¤erent estimators estimate di¤erent aspects of the dbn of ∆ and thusanswer di¤erent questions

DL Millimet (SMU) ECO 7377 Fall 2011 59 / 407

CausationRandom Experiments

First solution is to randomize treatment assignment

Generally speaking, randomization is the preferred solution; oftencalled the gold standard

Reason: randomization ensures that treatment assignment isindependent of potential outcomes in expectation

Freedman (2006): Experiments o¤er more reliable evidence oncausation than observational studies.

Imbens (2009): More generally, and this is the key point, in a situationwhere one has control over the assignment mechanism, there is little togain, and much to lose, by giving that up through allowing individuals tochoose their own treatment regime. Randomization ensures exogeneity ofkey variables, where in a corresponding observational study one wouldhave to worry about their endogeneity.

DL Millimet (SMU) ECO 7377 Fall 2011 60 / 407

That said, not everyone is convinced by experiments (without doingsome more mental work)

Much of the criticism about experiments is about thedi¢ culty of generalizing fom the evaluation of one particularprogram to predicting what would happen to this program in adi¤erent context. Clearly, without theory to guide us on why aresult extends from a context to another, it is di¢ cult to jumpdirectly to a policy conclusion. However, when experiemtns aremotivated by a theory, the results of experiments (not only onthe nal outcomes, but on the entire chain of intermediateoutcomes that led to the endpoint of interest) serve as a test ofsome of the implications of that theory. The combination of datapoints then eventually provides su¢ cient evidence to make policyrecommendations.

Duo (2010),http://www.aeaweb.org/econwhitepapers/white_papers/Esther_Duo.pdf

DL Millimet (SMU) ECO 7377 Fall 2011 61 / 407

From an ex post evaluation standpoint, a carefully plannedexperiment using random assignment of program statusrepresents the ideal scenario, delivering highly credible causalinference. But from an ex ante evaluation standpoint, the causalinferences from a randomized experiment may be a poor forecastof what were to happen if the program were to be scaled up.

DiNardo & Lee (2011)

Ex post evaluation answers the question: What happened?(descriptive)

Ex ante evaluation answers the question: What would happen?(predictive)

DL Millimet (SMU) ECO 7377 Fall 2011 62 / 407

Randomization may occur at di¤erent stages1 Population-level: randomize among agents in the population; typicallynot feasible since it would entail compellingtreatment by some

2 Eligibility-level: randomize among the population of eligibles byrandomly denying eligibility to a subset

3 Application-level: randomize among the population of programapplicants by randomly accepting/rejecting a subset

Stage at which randomization occurs generally a¤ects what can belearned unless additional assumptions are made

DL Millimet (SMU) ECO 7377 Fall 2011 63 / 407

Assumptions (with population-level randomization)(A.i) fy ,Dg is iid sample from the population(A.ii) y0, y1 ? D(A.iii) Pr(D = 1) 2 (0, 1)Notes

I (A.i) implies SUTVAI (A.ii) implies E[y1 jD = 1] = E[y1 jD = 0] = E[y1 ]; similarly for E[y0 ]I (A.ii) also implies ∆ATE = ∆ATT = ∆ATU since

E[y1 y0 ]| z ATE

= E[y1 y0 jD = 1]| z ATT

= E[y1 y0 jD = 0]| z ATU

I (A.ii) relies on perfect compliance; imperfect compliance may invalidatethe assumption if such non-compliance is related to potential outcomes

F Di¤erence in experimental means based on initial assignment still yieldsestimate of intent to treat under imperfect compliance; may actually bemore policy relevant

I (A.iii) ensures all agents have some probability of receiving and notreceiving the treatment

I Population-level randomization is feasible if compensation is o¤ered toensure compliance and this compensation does not a¤ect y0 and y1

DL Millimet (SMU) ECO 7377 Fall 2011 64 / 407

Estimation

b∆ATE = \E[yi jD = 1] \E[yi jD = 0]

=∑Ni=1 yi I[Di = 1]

∑Ni=1 I[Di = 1]

∑Ni=1 yi I[Di = 0]

∑Ni=1 I[Di = 0]

p! E[yi jD = 1] E[yi jD = 0]= E[Diy1i + (1Di )y0i jD = 1]

E[Diy1i + (1Di )y0i jD = 0]= E[y1i jD = 1] E[y0i jD = 0]= E[y1i ] E[y0i ]= ∆ATE

DL Millimet (SMU) ECO 7377 Fall 2011 65 / 407

PropertiesI UnbiasedI ConsistentI Asymptotically normalI Nonparametrically identied: no parametric or functional formassumptions needed

NotesI (A.ii) may be replaced by a mean independence assumption ...

E[yj jD = j ] = E[yj ], j = 0, 1I Randomization succeeds by balancing (in expectation) both observableand unobservable attributes of participants in the treatment andcontrol group

I Randomization can be assessed by testing for di¤erences in the jointdbn of predetermined attributes across the treatment and controlgroups

I Randomization at the eligibility or application stage only yield anestimate of the ATT, which does not equal the ATE unless (i)treatment e¤ects are homogeneous or (ii) agents do not becomeeligible or apply due to unobserved, observation-specic gains to thetreatment, υ1 υ0

DL Millimet (SMU) ECO 7377 Fall 2011 66 / 407

Selection on Observables

Randomization is typically not feasible in economics

Applied economists typically must rely on observational (ornon-experimental) data

Data structure is now given by...

attributes of i observed for ify1i , y0i ,Di , xig fyi ,Di , xig

where xi is a vector of observable attributes of i

DL Millimet (SMU) ECO 7377 Fall 2011 67 / 407

Selection on ObservablesStrong Ignorability

Assumptions

(A.i) iid sample: fy ,D, xg is iid sample from the population

(A.ii) Conditional independence or unconfoundedness: y0, y1 ? D jx(A.iii) Common support or overlap: Pr(D = 1jx) 2 (0, 1)

Note: CIA is sometime referred to as selection on observables (orobserved variables) assumption because if D is a deterministic fn of x ,then CIA will hold. However, the CIA is broader than this case; D mayalso depend on unobservables as long as these unobservables are notcorrelated with potential outcomes.

DL Millimet (SMU) ECO 7377 Fall 2011 68 / 407

Notes...(A.i) implies SUTVA(A.ii) implies

Pr(Di = 1jxi , y1i , y0i ) = Pr(Di = 1jxi )(A.iii) ensures one observes agents with a particular x in both thetreatment and control groups(A.ii), (A.iii) ) stong ignorability (Rosenbaum & Rubin 1983)

I xs must be pre-determined (i.e., una¤ected by treatment assignment);if some xs are directly a¤ected by D or by the anticipation of D, thenconditioning on them will mask (at least) some of the e¤ect of thetreatment

I Implies estimation under strong ignorability requires an instrumentexist, but it is not required to be observed (or even known) such thatconditional on x , D is random rather than deterministic

I There may not exist any vector x in a particular data set for aparticular treatment such that stong ignorability holds

I There is some tension between (A.ii) and (A.iii); some xs mayperfectly predict treatment assignment (invalidating CS), but omissionmay invalidate CIA... hence, the need for the implicit IV

DL Millimet (SMU) ECO 7377 Fall 2011 69 / 407

Nonparametric identication

Estimation

b∆ATE (x) = \E[yi jxi = x ,D = 1] \E[yi jxi = x ,D = 0]

=∑Ni=1 yi I[xi = x ,Di = 1]

∑Ni=1 I[xi = x ,Di = 1]

∑Ni=1 yi I[xi = x ,Di = 0]

∑Ni=1 I[xi = x ,Di = 0]

p! E[yi jxi = x ,D = 1] E[yi jxi = x ,D = 0]= E[y1i jxi = x ,D = 1] E[y0i jxi = x ,D = 0]= E[y1i jxi = x ] E[y0i jxi = x ]

and then

b∆ATE = E[b∆ATE (x)] = Z b∆ATE (x)f (x)dx = 1N ∑i

b∆ATE (xi )Similar story for other parameters, except nal step uses f (x jD = 1)or f (x jD = 0)

DL Millimet (SMU) ECO 7377 Fall 2011 70 / 407

CaveatsI If x takes on many values (even if still discrete), there may be smallsample size for any particular value, x , leading to high variance forb∆ATE (x)

I If x is continuous, then this estimator cannot be used since theprobability of observing more than one obs with the same value of x iszero

I Possible solution: functional form assumptions

DL Millimet (SMU) ECO 7377 Fall 2011 71 / 407

Final Note

CIA is not testable except by conducting random experiments forcomparison

One common testemployed entails testing for di¤erences inpre-treatment outcomes conditional on x between the to-be-treatedand the controls

I Intuition: if D is uncorrelated with unobservables related to theoutcome conditional on x , then pre-treatment outcomes should beunrelated to (future) D conditional on x

I Heckman et al. (1999) refers to this as the alignment fallacyI In particular, test based on outcomes more than one period in the pastis misleading if shocks are serially correlated and agents self-select intothe treatment group due to an adverse shock in the period directlybefore treatment

I In general, test is useful if it rejects the independence of D and yconditional on x in periods prior to treatment; if it fails to reject, thenthe test is ambiguous

DL Millimet (SMU) ECO 7377 Fall 2011 72 / 407

Selection on ObservablesStrong Ignorability: Regression

Previous results showed that

∆ATE (x) = E[y1i jxi = x ] E[y0i jxi = x ]= E[yi jxi = x ,D = 1] E[yi jxi = x ,D = 0]

Implies key is to estimate the regression function E[yi jxi ,Di ]

DL Millimet (SMU) ECO 7377 Fall 2011 73 / 407

Assumptions

(A.iv) Separability:

y0i = µ0(xi ) + υ0i

y1i = µ1(xi ) + υ1i

where E[υ1 jx ] = E[υ0 jx ] = E[υ1 υ0 jx ] = 0(A.v) Functional forms:

(A.va) Constant treatment e¤ect

µ0(xi ) = α0 + xi β

µ1(xi ) = α1 + xi β

(A.vb) Heterogeneous treatment e¤ects

µ0(xi ) = α0 + xi β0µ1(xi ) = α1 + xi β1

DL Millimet (SMU) ECO 7377 Fall 2011 74 / 407

Implications...

Given (A.i), (A.ii), (A.iv), and (A.va) ...

E[yi jxi ,D = 0] = α0 + xi β+ E[υ0i jxi ,D = 0]E[yi jxi ,D = 1] = α1 + xi β+ E[υ1i jxi ,D = 1]

implies

∆ATE (x) = E[yi jxi = x ,D = 1] E[yi jxi = x ,D = 0]= α1 α0

= ∆ATE = ∆ATT = ∆ATU

DL Millimet (SMU) ECO 7377 Fall 2011 75 / 407

Given (A.i), (A.ii), (A.iv), and (A.vb) ...

E[yi jxi ,D = 0] = α0 + xi β0 + E[υ0i jxi ,D = 0]E[yi jxi ,D = 1] = α1 + xi β1 + E[υ1i jxi ,D = 1]

implies

∆ATE (x) = E[yi jxi = x ,D = 1] E[yi jxi = x ,D = 0]= (α1 α0) + xi (β1 β0)

and

∆ATE =Z

∆ATE (x)f (x)dx = (α1 α0) + E[x ](β1 β0)

∆ATT =Z

∆ATE (x)f (x jD = 1)dx = (α1 α0) + E[x jD = 1](β1 β0)

∆ATU =Z

∆ATE (x)f (x jD = 0)dx = (α1 α0) + E[x jD = 0](β1 β0)

DL Millimet (SMU) ECO 7377 Fall 2011 76 / 407

Estimation... Given (A.i), (A.ii), (A.iv), and (A.va)

Estimate via OLS

yi y0i +Di (y1i y0i )= α0 + xi β+ υ0i +Di (α1 + xi β+ υ1i α0 xi β υ0i )

= α0 + xi β+ (α1 α0)Di + [υ0i +Di (υ1i υ0i )]

= α0 + xi β+ ∆ATEDi + eυiCoe¢ cient on D is an unbiased estimate of the causal parameter, and

∆ATE = ∆ATT = ∆ATU

DL Millimet (SMU) ECO 7377 Fall 2011 77 / 407

Estimation... Given (A.i), (A.ii), (A.iv), and (A.vb) ...

Estimate via OLS

yi = α0 + xi β0 + (α1 α0)Di + xiDi (β1 β0)

+ [υ0i +Di (υ1i υ0i )]

= α0 + xi β+ eα1Di + xiDieβ1 + eυiEstimates given by

b∆ATE (x) = beα1 + xbeβ1b∆ATE = beα1 + xbeβ1b∆ATT = beα1 + x1beβ1b∆ATU = beα1 + x0beβ1where x j = ∑i xi I[Di = j ]/ ∑i I[Di = j ], j = 0, 1

DL Millimet (SMU) ECO 7377 Fall 2011 78 / 407

Alternatively, estimate via OLS

yi = α0 + (xi x)β0 + (α1 α0)Di + (xi x)Di (β1 β0)

+ [υ0i +Di (υ1i υ0i )]

= α0 + (xi x)β0 + eα1Di + (xi x)Dieβ1 + eυiEstimates given by

b∆ATE (x) = beα1 + (x x)beβ1b∆ATE = beα1b∆ATT = beα1 + x1beβ1b∆ATU = beα1 + x0beβ1where x j = ∑i (xi x) I[Di = j ]/ ∑i I[Di = j ], j = 0, 1

DL Millimet (SMU) ECO 7377 Fall 2011 79 / 407

NotesI Inclusion of x on RHS of the introduces problem of generatedregressor; OLS std errors are incorrect, but e¤ect is generally minor

I Standard errors of estimators obtained via delta method or bootstrapI Prior to implementing regression approach, it is useful to examine thenormalized di¤erences in x across the treatment and control groups

F Normalized di¤erence for a particular x is given by

∆x =x1 x0qσ2x1 + σ2x0

F If j∆x j > 0.25, regression results are sensitive to functional formassumptions in (A.va) and (A.vb); see Imbens & Wooldridge (2009)

DL Millimet (SMU) ECO 7377 Fall 2011 80 / 407

Selection on ObservablesStrong Ignorability: Matching

PreliminariesI Matching methods were quite popular, and still are to a large extentI (Incorrectly) viewed by many as a magic bulletto the estimation oftreatment e¤ects, as a way to mimicrandomized experiments

I In practice, only as good as the underlying assumptionsI Matching when identifying assumptions are violated may yield worseestimate than without matching

Assumptions required: (A.i), (A.ii), and (A.iii)I Technicality #1: only need y0 ? D jx to estimate ATT; y1 ? D jx toestimate ATU

I Technicality #2: (really) only need E[yj jx ,D = j ] = E[yj jx ,D = j 0],j , j 0 = 0, 1 to estimate ATE; E[y0 jx ,D = 1] = E[y0 jx ,D = 0] toestimate ATT; E[y1 jx ,D = 0] = E[y1 jx ,D = 1] to estimate ATU

DL Millimet (SMU) ECO 7377 Fall 2011 81 / 407

Comparison to regression approachI No functional form assumptions: if CIA holds, but (A.va) or (A.vb) donot, then matching will be consistent and OLS will not

I Matching weights observations di¤erently, giving more weight to thosedeemed most similar

I Matching requires, and thus highlights problems due to, CS1

23

45

.2 .4 .6 .8 1x

Untreated Units Untreated, Regression LineTreated Units Treated, Regression Line

E[y|x,D=0]=1+1x; E[y|x,D=1]=1.5+2.5x; sigma = 0.25

F CS is violated, but OLS simply extrapolates from each group toestimate the missing counterfactual at a particular value of x

F If linear regression specication is not globally accurate, then regressionmay yield severe bias (see earlier discussion on normalized di¤erences)

DL Millimet (SMU) ECO 7377 Fall 2011 82 / 407

The fallacy (perhaps!) of extrapolation

DL Millimet (SMU) ECO 7377 Fall 2011 83 / 407

Estimation

Parameters

∆ATE = E[y1 y0]∆ATT = E[y1 y0jD = 1]∆ATU = E[y1 y0jD = 0]

Unfeasible estimators

b∆ATE =1N ∑i (y1i y0i )b∆ATT =

1

∑i I[Di = 1]∑i (y1i y0i ) I[Di = 1]

b∆ATU =1

∑i I[Di = 0]∑i (y1i y0i ) I[Di = 0]

DL Millimet (SMU) ECO 7377 Fall 2011 84 / 407

Feasible estimators

b∆ATT =1

∑i I[Di = 1]∑i (y1i byi0) I[Di = 1]

b∆ATU =1

∑i I[Di = 0]∑i (byi1 y0i ) I[Di = 0]

b∆ATE =∑i I[Di = 1]

Nb∆ATT + ∑i I[Di = 0]

Nb∆ATU

where byi0, byi1 are estimates of the missing counterfactuals, obtainedas

byi0 =1

∑l2fDl=0g

ωil∑

l2fDl=0gωilyl0

byi1 =1

∑l2fDl=1g

ωil∑

l2fDl=1gωilyl1

where ωil = weight given to observation l by observation i

DL Millimet (SMU) ECO 7377 Fall 2011 85 / 407

Feasible estimation accomplished by replacing the missingcounterfactual with a weighted average of outcomes from thecorresponding groupFormally, all matching estimators take the form

b∆ATT =1N1

∑i2fDi=1g

0BB@y1i 1

∑l2fDl=0g

ωil∑

l2fDl=0gωilyl0

1CCAb∆ATU =

1N0

∑i2fDi=0g

0BB@ 1

∑l2fDl=1g

ωil∑

l2fDl=1gωilyl1 y0i

1CCAb∆ATE =

N1Nb∆ATT + N0

Nb∆ATU

whereNj = ∑i I[Di = j ], j = 0, 1

Matching estimators di¤er in terms of how the weights are speciedand what exactly is matched onDL Millimet (SMU) ECO 7377 Fall 2011 86 / 407

Selection on ObservablesStrong Ignorability: Matching (Weighting Schemes)

Exact matching or cell matching

Assuming x contains only discrete variables, assign positive weightonly to observations with identical values of xLet there be K distinct values (or combinations) of xs indexed byk = 1, ...,K (i.e., K cells)N0k , N1k = the number of untreated, treated obs in cell kEstimators given by

b∆ATT = ∑k

N1kN1

i2k\fDi=1g

y1iN1k

∑l2k\fDl=0g

yl0N0k

!

b∆ATU = ∑k

N0kN0

l2k\fDl=1g

yl1N1k

∑i2k\fDi=0g

y0iN0k

!

which reect di¤erent weighted averages of the average treatmente¤ect within the K cellsDL Millimet (SMU) ECO 7377 Fall 2011 87 / 407

Estimator is subject to curse of dimensionality

With high dimensional x , or if x contains continuous variables,inexact matching algorithms are useful

Asymptotically, all inexact matching estimators are equivalent sincethe inexactnessdisappears as N ! ∞In nite samples, di¤erent inexact matching algorithms may yieldquite di¤erent estimates

A newly proposed middle ground between exact and inexact matchingis known as coarsened exact matching (CEM)

I Intuition: roundx to fewer distinct values, then match exactly on thecoarsened data

I Developed by King et al.I See -cem- in Stata

DL Millimet (SMU) ECO 7377 Fall 2011 88 / 407

Inexact matchingRequires a measure of distance between any two observations, i and l

I Euclidian-type distance metrics are of the form

dil = (xi xl )0W (xi xl )where common choices for W are

1 W = I (identity matrix)2 W = Σ1, where Σ is the sample variance-covariance matrix of x(Mahalanobis metric)

3 W is a diagonal matrix with the variance of x along the diagonal, zeroson the o¤-diagonal (Abadie & Imbens 2002, 2006)

4 Zhao (2004) proposes other alternativesI Propensity score methods compute the distance based on di¤erences inthe probability of being in the treatment group given x

p(x) = Pr(D = 1jx) 2 [0, 1]where distance between two observations is

dil = jp(xi ) p(xl )jI If y0, y1 ? D jx ) y0, y1 ? D jp(x), which follows from the fact thatD ? x jp(x) (Rosenbaum & Rubin 1983)

DL Millimet (SMU) ECO 7377 Fall 2011 89 / 407

Euclidean-type distance metrics, propensity score are both a means tocircumvent dimensionality as d is a scalar

No one method is superior; goal is to balance the xs ... discussedlater (Ho et al. 2007)

I In this sense, matching is not an estimator per se, but can be viewed asa way of pre-processing the data prior to applying some estimator

I Similar to a type of outlier analysis

Given dil several weighting schemes are frequently usedI Let C (0) represent a neighborhood around 0 for each iI Observations given positive weight by i are those included in the set Aiwhere

Ai = fl jDl 6= Di , dil 2 C (0)g

Focusing on propensity score estimators, we can re-write this as

Ai = fl jDl 6= Di , p(xl ) 2 C (p(xi ))g

where C (p(xi )) represents a neighborhood around p(xi )

DL Millimet (SMU) ECO 7377 Fall 2011 90 / 407

Single nearest neighbor matching

SetsC (p(xi )) = min

ljdil j

)ωil =

1 if l 2 Ai0 otherwise

Intuition: l has the closest propensity score to i , but with di¤erenttreatment assignment

DL Millimet (SMU) ECO 7377 Fall 2011 91 / 407

k-nearest neighbor matching

SetsC (p(xi )) = k-min

ljdil j

)ωil =

1/k if l 2 Ai0 otherwise

Intuition: compute the average of the k closest obs to i in terms ofpropensity score, but with di¤erent treatment assignment than i

DL Millimet (SMU) ECO 7377 Fall 2011 92 / 407

Caliper or radius matching (Cochran & Rubin 1973)

SetsC (p(xi )) = fp(xl ) j jdil j < εg

for a specied value of ε)

ωil =

1/ki if l 2 Ai0 otherwise

Intuition: compute the average over all ki obs that di¤er from i interms of propensity score by less than ε, but with di¤erent treatmentassignment than i

DL Millimet (SMU) ECO 7377 Fall 2011 93 / 407

Kernel matching (Smith & Todd 2005)

Sets

C (p(xi )) =p(xl ) p(xi )aN

6 ε

)

ωil =

8>><>>:Gp(xl )p(xi )

aN

l 02fDl 0=0gGp(xl 0 )p(xi )

aN

if l 2 Ai0 otherwise

where G () is the kernel function and aN is the bandwidthIntuition: compute a weighted average over all ki obs that receivepositive weight given the choice of G () and aN , but with di¤erenttreatment assignment than i

I G () must integrate to one, aN ! 0 as N ! ∞, and aNN ! ∞I Ex: quartic kernel (ε = 1)

G (s) = 15

16 (1 s2)2 if js j 6 10 otherwise

DL Millimet (SMU) ECO 7377 Fall 2011 94 / 407

Local linear matching (Smith & Todd 2005)

Sets

C (p(xi )) =p(xl ) p(xi )aN

6 ε

)

ωil =

8>>>>><>>>>>:Gil ∑

l 02fDl 0=0gGil 0 (pl 0pi )2[Gil (plpi )]

24 ∑l 02fDl 0=0g

Gil 0 (pl 0pi )

35∑

l2fDl=0gGil ∑

l 02fDl 0=0gGil (pl 0pi )2

24 ∑l 02fDl 0=0g

Gil (pl 0pi )

352 if l 2 Ai

0 otherwise

where Gil = GplpiaN

Intuition: similar to kernel matching, but di¤ers in handling of weightsassigned to obs when obs are distributed asymmetrically around i orwhen there are gaps in the distribution of the propensity score

DL Millimet (SMU) ECO 7377 Fall 2011 95 / 407

Stratication or interval matching

Di¤ers from above schemes (although it can be written as a matchingestimator)

Unit interval is divided into k intervals, the average outcome oftreated and untreated is computed within each interval, and b∆ATE (k)is computed within each interval

Finally

b∆ATT = ∑k

N1kN1b∆ATE (k)

b∆ATU = ∑k

N0kN0b∆ATE (k)

b∆ATE =∑i I[Di = 1]

Nb∆ATT + ∑i I[Di = 0]

Nb∆ATU

Stata: -psmatch2 - or -nnmatch-

DL Millimet (SMU) ECO 7377 Fall 2011 96 / 407

Selection on ObservablesStrong Ignorability: Matching (Comparison of Matching Methods)

Asymptotically, all methods are consistent if assumptions hold andbandwidth satsies the requisite criteria

In nite samples, choice may matter

Single nearest neighbor matching minimizes bias since it only uses theclosest match; however, Frölichs (2004) MC analysis shows it fairspoorly in practice

If sample size is large and the propensity score is evenly dispersedacross the unit interval, kneighbor matching may be idealIf sample size is large and the propensity score is asymmetricallydistributed, kernel matching may be ideal (weights obs according tocloseness)

If many obs have a propensity score close to the boundary (zero orone), LL matching may be ideal

Stratication methods face problem of arbitrarily choosing K

DL Millimet (SMU) ECO 7377 Fall 2011 97 / 407

Selection on ObservablesStrong Ignorability: Matching (Regression Adjustment)

Various methods combine matching estimators with regressionmethods

Regression then matching (Smith & Todd 2005)I Regress yi on (some) xi for treated and untreated samples, obtainresiduals, and use residuals to compute matching estimators

Matching then regression (Ho et al. 2007)I Match to obtain missing counterfactual for each obs, then regress yi onDi and (some) xi using matched sample

I Standard errors are an issue here, as the usual OLS SEs are incorrect(more below)

DL Millimet (SMU) ECO 7377 Fall 2011 98 / 407

Selection on ObservablesStrong Ignorability: Matching in Practice

Several practical issues are confronted when implementing matchingestimators

1 Restriction to the common support2 Does inexact matching balance the covariates, x?3 Which variables belong in x?4 Inference5 Failure of CIA

DL Millimet (SMU) ECO 7377 Fall 2011 99 / 407

Selection on ObservablesStrong Ignorability: Matching (Common Support)

Dened as

Sp = fp(x) : f (pjD = 1) > 0 and f (pjD = 0) > 0g

Matching estimates are only dened at values of p(x) 2 SpIn practice, may want to exclude obs outside SpTo do so requires an estimate

bSp = fp(x) : bf (pjD = 1) > 0 and bf (pjD = 0) > 0gSmith & Todd (2005) recommend using NP density estimators toestimate f ())

bf (pjD = j) = ∑i2fDi=jg Gp(xi ) paN

, j = 0, 1

I See -kdensity- in Stata

DL Millimet (SMU) ECO 7377 Fall 2011 100 / 407

Imprecise alternative

bSp = fp(x) : p 2

2664 max

mini2fDi=0g

fp(xi )g, mini2fDi=1g

fp(xi )g,

min

maxi2fDi=0g

fp(xi )g, maxi2fDi=1g

fp(xi )g3775

I Simpler alternativeI Excludes obs just outside the CS for whom close matches existI Does not address holesin the interior of the dbn

Note: imposing the CS changes interpretation of the parametersbeing estimated (e.g., b∆ATE becomes the ATE for treated individualswith a propensity score in a particular region)

Trimming: Smith & Todd (2005) recommend reducing the CS to

bSp = fp(x) : bf (pjD = 1) > q and bf (pjD = 0) > qg, q 2 (0, 1)

Dealing with limited overlap; see Crump et al. 2009

DL Millimet (SMU) ECO 7377 Fall 2011 101 / 407

Selection on ObservablesStrong Ignorability: Matching (Balancing)

Matching mimics a randomized experiment in that conditioning onp(x) should balance x across the treated and untreated groupsEquivalently, the problem is reduced to a series of quasirandomexperiments at each value of p(x)... hence, an IV exists whichexogenously determines treatment assignment conditional on p(x)Rosenbaum & Rubin (1983) prove that

x ? D jp(x)

which implies

E[x jp(x),D = 0] = E[x jp(x),D = 1]

This holds regardless of whether CIA holdsBalacing tests seek to gauge thisNote: this highlights that p(x) is simply a means to balance the xs;the goal of p(x) is not to modeltreatment choice (more below)

DL Millimet (SMU) ECO 7377 Fall 2011 102 / 407

Stratication tests (e.g., Deheija & Wahba 1999, 2002)I Estimate the propensity scoreI Divide the data into K intervals based on dp(x)I Test for equal means (or other moments) of each x across the treatedand control group within each strata

F See -ttest- in Stata

I Test xs individually or jointly using Hoteling T 2 test

F See -hotel- in Stata

I Add higher order or interaction terms of xs failing the test, and repeatI Problem: how to choose K?

F Too small ! typically always reject equalityF Too large ! rarely reject equality

DL Millimet (SMU) ECO 7377 Fall 2011 103 / 407

Standardized di¤erencesI Average di¤erence in each x , where weights from matching are used,normalized by the pooled SD of x in the full sample

I Example: ∆ATT

SDIFF (xm) = 100

1N1 ∑

i2fDi=1g

xmi ∑

l2fDl=0gωilxml

!q

Vari2fDi=1g(xmi )+Varl2fDl=0g(xml )2

I Problem: how large is too large? Rosenbaum & Rubin (1985) suggest20 is large

I Perhaps criteria should be more strict for variables thought to be moreimportant in particular application

DL Millimet (SMU) ECO 7377 Fall 2011 104 / 407

Hoteling T 2 testI Test joint null of equal (weighted) means across treatment and controlgroup

I Example: ∆ATT

T 2 = (x1 x0)0 ∑1(x1 x0)

where x1 = vector of (unweighted) means from treatment group andx0 = vector of weighted means from untreated group, weighted by ωil

I Test may be conservative since estimation of weights is not accountedfor

Regression-based testI Estimate propensity scoreI Regress each x on a polynomial of p(x), D, and D interacted with thesame polynomial of p(x)...

xi = φ0 +∑Ss=1 φsp(xi )

s + π0Di +∑Ss=1 πsDip(xi )

s + ηi

and test Ho : π0 = π1 = = πS = 0I Regression may be unweighted or weighted, assigning weight

ωl = ∑i2fDi=1g ωil to each untreated obs (when focus is on ∆ATT )

DL Millimet (SMU) ECO 7377 Fall 2011 105 / 407

Selection on ObservablesStrong Ignorability: Matching (Variable Selection)

CIA is a strong assumption that places great demands on the data

Two issuesI What variables to include in x?I What functional form to use; should x include higher order, interactionterms of the variables?

CIA will certainly hold if x includes all variables that determine bothoutcomes and participation, but is this required?

Rubin and Thomas (1996) favor including variables in the propensityscore model unless there is consensus that they do not belong

HIT (1997), HIST (1998), Heckman and Smith (1999), Lechner(2002), Smith & Todd (2005)

I Estimators are sensitive to variables included in xI Bias likely to result if x is too crude

DL Millimet (SMU) ECO 7377 Fall 2011 106 / 407

Brookhart et al. (2006)I Variables related to outcomes should always be includedI Variables weakly related to the outcome even if strongly related totreatment assignment should be excluded as their inclusion results inhigher mean squared error of the treatment e¤ect estimate

Zhao (2007)I Including irrelevant variables ; biased estimatesI Over-tting the propensity score model may be counterproductive

Wooldridge (2009), Pearl (2009)I Consider classes of variables whose inclusion leads to biasI Primary example is of instrumental variables

Hirano et al. (2003)I Using the true propensity score is ine¢ cient even when it is knownI May imply that over-tting the propensity score model may have littlenegative consequence in practice

Note: goal of the PS model is not to nd the best predictor of DI Generally, variables that impact participation and not outcomes shouldbe excluded; inclusion will exacerbate the CS problem

I Psuedo-R2 criteria should not be used to judge the PS modelDL Millimet (SMU) ECO 7377 Fall 2011 107 / 407

Millimet & Tchernis (2009)I MC analysis of matching and weighting estimators (discussed later)I Estimate propensity score using a series logit estimator

Pr(D = 1) =exp

θ0 +∑Ss=1 θsxs

1+ exp

θ0 +∑Ss=1 θsxs

where for su¢ ciently large S and appropriate coe¢ cients, θ, anyparticpation function may be approximated

I SLE ) bθ estimated via MLI Assess impact of

F Including irrelevant and excluding relevant higher order terms of variables that impact outcomes and

participation

F Including irrelevant and excluding relevant higher order terms of variables that impact outcomes only

F Including irrelevant and excluding relevant higher order terms of variables that impact participation only

I Little impact to over-ttingF Asymptotic variance of nonparametric estimators is dominated by bias terms (Ichimura & Linton 2005)

F Over-tting minimizes the bias

F Also, normalized weighting estimator is preferable (discussed later)

DL Millimet (SMU) ECO 7377 Fall 2011 108 / 407

DiNardo & Lee (2011) criticize us and show instances where adding xmay exacerbate bias

I Their examples are instances where the CIA does not hold, but oneapplies an estimator that requires the CIA (such as matching)

I Thus, the matching estimator is already biasedI In this case, adding an additional covariate may increase or decreasethe bias even if x belongs in the model

I That said, this is not the case examined in our work; we assume CIAholds

Shaikh et al. (2009) propose a specication test of the propensityscore model

I Informal test based on an eyeball comparison of the dbn of p(x) in thetreatment and control groups

I Formal test procedure also provided

DL Millimet (SMU) ECO 7377 Fall 2011 109 / 407

Selection on ObservablesStrong Ignorability: Matching (Standard Errors)

Non-smooth matching estimatorsI Correct standard errors are not feasible in this caseI Usual ttest for di¤ in mean outcomes across matched treated anduntreated group ignores estimation of propensity score and nature ofmatching

I Problem due to estimation of the propensity score disappearsasymptotically

I Eichler & Lechner (2001) suggest that N must be in the 1000s beforethis bias disappears

Bootstrap methods are feasible for smooth matching estimators (e.g.,kernel matching), but there is no formal evidence

Abadie & Imbens (2006) provide asymptotic standard errors fornon-propensity score matching estimators; work in progress focuses onpropensity score matching estimators

Must be careful when bootstrapping data with choice-based sampling

DL Millimet (SMU) ECO 7377 Fall 2011 110 / 407

Selection on ObservablesStrong Ignorability: Matching (Misc. Implementation Issues)

Replacement?I Single, k-nearest neighbor matching may be done with or withoutreplacement

I Without replacement implies results are sensitive to the sort order ofthe data

I With replacement reduces bias (by improving match quality), but isless e¢ cient (by using less of the data)

Estimation of propensity scoreI Typically probit or logit is used ) semiparametric estimatorI NP methods are available as well

DL Millimet (SMU) ECO 7377 Fall 2011 111 / 407

Bandwidth SelectionI In NP work, bandwidth choice is typically much important than choiceof kernel function

I Methods generally fall into three categories1 ad hoc combined with sensitivity analysis2 Rule-of-thumb approaches (Silverman 1986)

aN 1.06σN1/5

3 Data driven methods (e.g., cross-validation)

I Leave-one-out cross-validation (e.g., ∆ATT )F Perform a NP regression of y on p(x) using all untreated obs except land a candidate bandwidth, ab

F Predict bylF Repeat for all l , l = 1, ...,N0F Calculate MSE

MSE (ab) =1N0

∑l2fDl=0g

(yl byl )2F Repeat for all candidate bandwidths ab , b = 1, ...,BF Choose ab to minimize MSE (ab)

DL Millimet (SMU) ECO 7377 Fall 2011 112 / 407

Selection on ObservablesStrong Ignorability: Matching (Sensitivity to Unobservables)

CIA is not testable

Applied literature does/should assess the robustness of matchingestimators

Several currently available techniquesI Rosenbaum boundsI Simulation methods (Ichino et al. 2008)I Minimum bias approach (Millimet & Tchernis 2011)I Di¤erence-in-di¤erences matchingI Assuming SOO = SOU (Altonji et al. 2005; discussed later)I Bayesian sensitivity analysis (de Luna & Lundin 2009)

DL Millimet (SMU) ECO 7377 Fall 2011 113 / 407

Rosenbaum Bounds

Method of assessing sensitivity of matching estimator to anunobserved confounder (Rosenbaum 2002)Assume

p(xi ) = F (xi β+ ui ) =exp(xi β+ ui )

1+ exp(xi β+ ui )

where u is an unobserved binary variable and F is the logistic CDFImplications

I Odds ratio for obs i is

p(xi )1 p(xi )

= exp(xi β+ ui )

I Odds ratio for obs i relative to obs i 0

p(xi )1p(xi )p(xi 0 )1p(xi 0 )

=exp(xi β+ ui )exp(xi 0β+ ui 0)

= expfγ(ui ui 0)g if xi = xi 0

I Thus, two observationally identical obs have di¤erent probabilities ofbeing treated if γ 6= 0 and ui 6= ui 0

DL Millimet (SMU) ECO 7377 Fall 2011 114 / 407

How does inference regarding the treatment e¤ect parameters changeas γ and ui ui 0 change?

I Since u is binary, ui ui 0 2 f1, 0, 1gI Implies

1expfγg 6

p(xi )1p(xi )p(xi 0 )1p(xi 0 )

6 expfγg

where

F expfγg = 1) no selection biasF expfγg ! ∞ ) greater selection bias

I Rosenbaum bounds compute bounds on the signicance level of thematching estimate as expfγg changes values

F If matching estimate is statistically insignicant even whenexpfγg 1, then treatment e¤ect is not robust

F If matching estimate is statistically signicant even when expfγg islarge, then treatment e¤ect is not sensitive to hidden bias

Stata: -rbounds-, -mhbounds-

DL Millimet (SMU) ECO 7377 Fall 2011 115 / 407

Ichino et al. (2008) Approach

Nannicini (2007) and Ichino et al. (2008) propose an alternativemethod of assessing the robustness of ATT estimates obtained underCIA

The sensitivity analysis is performed by comparing the baselinematching estimate to estimates obtained after additionallyconditioning upon a simulated confounder

The distribution of the simulated variable can be constructed tocapture di¤erent hypotheses regarding the nature of potentialconfounders

DL Millimet (SMU) ECO 7377 Fall 2011 116 / 407

SetupI The parameter of interest is the ∆ATT E[y1 y0 jD = 1]I Accordingly, y0 ? D jx denotes the required CIAI Suppose that this condition is not met, but if an unobservable, U, isadded then a stronger CIA holds

y0 ? D jx ,U

I Implies

E[y0 jD = 1, x ] 6= E[y0 jD = 0, x ]E[y0 jD = 1, x ,U ] = E[y0 jD = 0, x ,U ]

DL Millimet (SMU) ECO 7377 Fall 2011 117 / 407

SolutionI Simulate the potential confounder and use it as a matching covariate

F For simplicity, the potential outcomes and the confounding variable areassumed to be binary

F Conditional independence of U and x is also assumedF Hence, the distribution of U is fully characterized by the choice of thefollowing four parameters

pij Pr(U = 1jD = i , y = j) = Pr(U = 1jD = i , y = j , x)

with i , j 2 f0, 1gF Given the parameters pij , a value of U is simulated for each observationdepending on D , y

I ∆ATT is then estimated with U as an additional matching covariate

For a given set of the parameters pij , many simulations are performed,∆ATT computed for each simulation, and the mean/sd of theestimates reported

DL Millimet (SMU) ECO 7377 Fall 2011 118 / 407

Choosing pij ...I It is essential to consider useful potential confoundersI Calibrated confounders: choose pij to make the distribution of Usimilar to the empirical distribution of observable binary covariates

I Killer confounders: search over di¤erent pij for the existence of a Uwhich makes ∆ATT = 0

I One can also simulate other meaningful confounders by setting theparameters pij and pi , where pi can be computed as

pi Pr(U = 1jD = i) =1∑j=0

pij Pr(y = j jD = i)

with i 2 f0, 1g

DL Millimet (SMU) ECO 7377 Fall 2011 119 / 407

Common caseI Typical scenario in applied work has b∆ATT > 0 in baseline modelI Thus, concern centers on potential confounder that has both a positivee¤ect on the untreated outcome and on the selection into treatment

I Ichino et al. prove that

1 p01 > p00 )

Pr(y0 = 1jD = 0,U = 1, x) > Pr(y0 = 1jD = 0,U = 0, x)

where p01 Pr(U = 1jD = 0, y = 1) andp00 Pr(U = 1jD = 0, y = 0)

2 p1 > p0 )

Pr(D = 1jU = 1, x) > Pr(D = 1jU = 0, x)

where p1 Pr (U = 1jD = 1) and p0 Pr (U = 1jD = 0)I Accordingly, by choosing p01 > p00 and setting p1 > p0, aconfounder is simulated such that it has a positive e¤ect on both y0and D even after conditioning on x

DL Millimet (SMU) ECO 7377 Fall 2011 120 / 407

What do these ps represent?I The di¤erences

d = p01 p00s = p1 p0

only depict the sign of Us outcome and selection e¤ectsI The size of these e¤ects must be evaluated after conditioning on x toaccount for the association between U and x that shows up in the data

I Thus, at every iteration, logit models for Pr(y = 1jD = 0,U, x) andPr(D = 1jU, x) are estimated

F The average odds ratio of U is reported as the outcome and selectione¤ects of the simulated confounder

Γ Pr(y=1jD=0,U=1,x )Pr(y=0jD=0,U=1,x )Pr(y=1jD=0,U=0,x )Pr(y=0jD=0,U=0,x )

Λ Pr(D=1jU=1,x )Pr(D=0jU=1,x )Pr(D=1jU=0,x )Pr(D=0jU=0,x )

F Γ and Λ reect the strength of U

Stata: -sensatt-

DL Millimet (SMU) ECO 7377 Fall 2011 121 / 407

Minimum Bias Approach

Intuition: Trim the sample on the basis of p(x) to minimize the biasfrom a failure of CIA

Assume (A.iv) plus unobservables are trivariate normal:υ0, υ1, u N3(0,Σ), where

Σ =

24 σ20 ρ01σ0σ1 ρ0uσ0σ21 ρ1uσ1

1

35and u is the error from the treatment assignment equation

Di = h(xi ) ui

where D is latent treatment assignment

DL Millimet (SMU) ECO 7377 Fall 2011 122 / 407

The bias of the ATT at some value of the propensity score, p(x), isgiven by

BATT [p(x)] = bτATT [p(x)] τATT [p(x)]

= ρ0uσ0φ(Φ1(p(x)))p(x)[1 p(x)]

whereI ρ0u = selection on unobservables a¤ecting outcome in untreated stateI φ and Φ are standard normal PDF and CDFI bτATT is some propensity score based estimator

BATT [p(x)] is minimized at p(x) = 0.5

DL Millimet (SMU) ECO 7377 Fall 2011 123 / 407

For the ATE,

BATE [p(x)] = fρ0uσ0 + [1 p(x)]ρδuσδg

φ(Φ1(p(x)))p(x)[1 p(x)]

where

I δ = υ1 υ0 = unobserved, individual-specic gain from treatmentI ρδu = selection on unobserved, individual-specic gains

) The bias-minimizing propensity score, p(x), depends on the errorcorrelation structure

Similar results in Black & Smith (2004), Heckman andNavarro-Lozano (2004)

DL Millimet (SMU) ECO 7377 Fall 2011 124 / 407

Minimum-biased (MB) estimation techniqueI Stage 1: Estimate the propensity score (e.g., probit model)I Stage 2: Retain only those observations with a propensity score,[p(xi ), within a xed neighborhood around p(x), the bias-minimizingpropensity score

I Stage 3: Estimate the ATE or ATT using any propensity-score basedestimator that relies on CI using this sub-sample

Notes:I Estimator is biased, but it minimizes the biasI For ATT... this is straightforward as we know that p(x) = 0.5I For ATE... p(x) is unknown, depends on error correlationsI If treatment e¤ect is heterogeneous, then interpretation changes; maynot be economically interesting

DL Millimet (SMU) ECO 7377 Fall 2011 125 / 407

For ATE, add Stage 1.5: Estimate the error correlationsI Feasible if one also imposes (A.va) or (A.vb)I Estimate via OLS (discussed in more detail later)

yi = α0 + (α1 α0)Di + xi β0 + xiDi (β1 β0)

+ βλ0(1Di )

φ(xiγ)1Φ(xiγ)

+ βλ1Di

φ(xiγ)Φ(xiγ)

+ ηi

where φ()/Φ() is the inverse Millsratio and

βλ0 = ρ0uσ0

βλ1 = ρ0uσ0 + ρδuσδ.

I Replacing γ with bγ from the rst-stage probit yields consistentestimates of ρ0uσ0 and ρδuσδ

Millimet & Tchernis (2009) nd that trimming is ine¢ cient when CIAholds, but is more robust to (some) mis-specications

DL Millimet (SMU) ECO 7377 Fall 2011 126 / 407

Di¤erence-in-Di¤erences Matching

All matching estimators are biased if unobservables invalidate the CIA

Formally (e.g., ∆ATT )

∆ATT (p(x)) =

E[y1jp(x),D = 1] E[y0jp(x),D = 0]+ E[y0jp(x),D = 0] E[y0jp(x),D = 1]

where matching estimators are based on

e∆ATT (p(x)) = E[y1jp(x),D = 1] E[y0jp(x),D = 0]

which implies

bias = e∆ATT (p(x)) ∆ATT (p(x))= E[y0jp(x),D = 1]| z

Counterfactual

E[y0jp(x),D = 0]| z Observed

which is zero under CIA

DL Millimet (SMU) ECO 7377 Fall 2011 127 / 407

Rearranging terms yields

∆ATT (p(x)) = e∆ATT (p(x)) biasThis suggests a bias-corrected estimator is feasible if the bias can beconsistently estimated

Might assume the bias equals the di¤erence in mean outcomes priorto treatment

bias = E[y0t jp(x),D = 1] E[y0t jp(x),D = 0]?= E[y0t 0 jp(x),D = 1] E[y0t 0 jp(x),D = 0]

where t 0 < t, t 0 precedes the treatment, t is post-treatment

DL Millimet (SMU) ECO 7377 Fall 2011 128 / 407

Implies

ee∆ATT (p(x)) = e∆ATT (p(x)) bias= E[y1t jp(x),D = 1] E[y0t jp(x),D = 0]

fE[y0t 0 jp(x),D = 1] E[y0t 0 jp(x),D = 0]g

=

E[y1t y0t 0 jp(x),D = 1] E[y0t y0t 0 jp(x),D = 0]

and ee∆ATT (p(x)) = ∆ATT (p(x)) requires

E[y0t y0t 0 jp(x),D = 1] = E[y0t y0t 0 jp(x),D = 0]

which is di¤erent than the original CIA

DL Millimet (SMU) ECO 7377 Fall 2011 129 / 407

Implementation: di¤erence the data 8i , then matchDID matching requires the original CIA be replaced with

∆y0,∆y1 ? D jp(x)

Intuition:I DID matching requires the change in potential outcomes to beindependent of treatment assignment given the PS

I Equivalently, there are no time varying unobservables correlated withboth outcomes and treatment assignment given x

Smith & Todd (2005) nd DID matching to be more robust, butconclusions are application-specic

DL Millimet (SMU) ECO 7377 Fall 2011 130 / 407

Selection on ObservablesStrong Ignorability: Inverse Propensity Score Weighting (IPW) Estimators

Alternative to matching estimators, but still rely onknowing/estimating the propensity score

Identities

EDyp(x)

= E

Dy1p(x)

= E

EDy1p(x)

j x

= E1p(x)

E [Dy1] j xCIA= E

1p(x)

E[D j x ]E[y1 j x ]

= Ep(x)p(x)

E[y1 j x ]= E [E[y1 j x ]] = E[y1]

and, similarly,

E(1D)y1 p(x)

= E[y0]

DL Millimet (SMU) ECO 7377 Fall 2011 131 / 407

Parameters of interest (Horvitz & Thompson 1952)

∆ATE = EDyp(x)

(1D)y1 p(x)

= E

D p(x)

p(x)[1 p(x)]y

∆ATT =1

E[p(x)]Ep(x)

Dyp(x)

(1D)y1 p(x)

=

1

E[p(x)]ED p(x)1 p(x) y

∆ATU =

1

E[1 p(x)] E[1 p(x)]

Dyp(x)

(1D)y1 p(x)

=

1

E[1 p(x)] ED p(x)p(x)

y

Proof: Wooldridge (2002, p. 613)

DL Millimet (SMU) ECO 7377 Fall 2011 132 / 407

Estimation

Unnormalized estimators

b∆ATE =1N ∑i

"Diyi[p(xi )

(1Di )yi1[p(xi )

#=1N ∑i

([Di [p(xi )]yi[p(xi )[1[p(xi )]

)

b∆ATT =1

1N ∑i

[p(xi )1N ∑i

[p(xi )"Diyi[p(xi )

(1Di )yi1[p(xi )

#

=1

1N ∑i

[p(xi )1N ∑i

([D [p(xi )]yi1[p(xi )

)

b∆ATU =1

1N ∑i

1[p(xi )

∑i

h1[p(xi )

i " Diyi[p(xi )

(1Di )yi1[p(xi )

#

=1

1N ∑i

1[p(xi )

∑i

([D [p(xi )]yi

[p(xi )

)

DL Millimet (SMU) ECO 7377 Fall 2011 133 / 407

Normalized estimators (Hirano and Imbens 2001)

I b∆ATE is the di¤erence in two weighted averages, where weights areDi

N[p(xi )and

1DiNh1[p(xi )

iI Problem: weights may not sum to unityI HI assign weights normalized by the sum of propensity scores fortreated and untreated groups

I Unnormalized estimator assigns equal weights of 1/N to eachobservation

I Normalized estimator (e.g., b∆ATE )b∆ATE = "∑i

Di yi[p(xi )

,∑i

Di[p(xi )

#"∑i

(1Di )yi1[p(xi )

,∑i

(1Di )1[p(xi )

#

I Tends to be more stable in practice as it restricts weights to 1;Millimet & Tchernis (2009), Busso et al. (2011) nd it performs better

Standard errors obtained via bootstrap

DL Millimet (SMU) ECO 7377 Fall 2011 134 / 407

Selection on ObservablesStrong Ignorability: Regression (Again)

Use propensity score as control variable in regression

Assumptions

(A.vi) E[y1 y0 jx ] is uncorrelated with Var(D jx) = p(x)[1 p(x)](A.vii) E[y1 jp(x)], E[y0 jp(x)] are linear in p(x)

(A.vi) has no good interpretation

(A.vii) replaces the functional form assumptions discussed in theprevious regression approach

DL Millimet (SMU) ECO 7377 Fall 2011 135 / 407

Estimation

Given (A.ii) and (A.vi)...I Estimate via OLS

yi = α0 + eα1Di + γ[p(xi ) + εi

I Estimates given by

b∆ATE = b∆ATT = b∆ATU = beα1which is consistent and asymptotically normal if [p(xi ) is consistent andasymptotically normal

I Proof: See Wooldridge (2002)

DL Millimet (SMU) ECO 7377 Fall 2011 136 / 407

Given (A.ii) and (A.vii)...I Estimate via OLS

yi = α0 + eα1Di + γ0[p(xi ) + γ1

h[p(xi ) bµpiDi +eεi

where bµp = 1N ∑i

[p(xi )

I Estimates given by

b∆ATE (x) = beα1 + bγ1 hdp(x) bµpib∆ATE = beα1b∆ATT = beα1 + bγ1x1b∆ATU = beα1 + bγ1x0where x j = ∑i

h[p(xi ) bµpi I[Di = j ]/ ∑i I[Di = j ], j = 0, 1

DL Millimet (SMU) ECO 7377 Fall 2011 137 / 407

Given (A.ii) and a weaker version of (A.vii)...I Estimate via OLS

yi = α0 + eα1Di +∑Kk=1 γ0k

[p(xi )k+∑K

k=1 γ1k

[p(xi )

k bµkpDi +eεi

where bµkp = 1N ∑i

[p(xi )k, k = 1, ...,K

and K is a low order numberI Estimates given by

b∆ATE (x) = beα1 +∑Kk=1 bγ1k dp(x)k bµkpb∆ATE = beα1b∆ATT = beα1 +∑Kk=1 bγ1k xk1b∆ATU = beα1 +∑Kk=1 bγ1k xk0

where xkj = ∑i

[p(xi )

k bµkp I[Di = j ]/ ∑i I[Di = j ], j = 0, 1;

k = 1, ...,K

DL Millimet (SMU) ECO 7377 Fall 2011 138 / 407

Selection on ObservablesStrong Ignorability: Double-Robust Estimators

Robins and Rotnizky (1995), Lunceford and Davidian (2004), andothers discuss DR estimators

DR estimators combine regression and weighting estimators and aredouble robust because they are consistent as long as either theregression specication for the outcome or the propensity scorespecication is correctly specied

DL Millimet (SMU) ECO 7377 Fall 2011 139 / 407

Estimation

OLS estimation

yi = α0 + xi β+ eα1Di + θ0Di[p(xi )

+ θ11Di1[p(xi )

+eεib∆ATE = beα1 + 1

N ∑i

"bθ0 Di[p(xi )

bθ1 1Di1[p(xi )

#

b∆ATT = beα1 + 1N1

∑i :Di=1

"bθ0 Di[p(xi )

bθ1 1Di1[p(xi )

#

b∆ATU = beα1 + 1N0

∑i :Di=0

"bθ0 Di[p(xi )

bθ1 1Di1[p(xi )

#

DL Millimet (SMU) ECO 7377 Fall 2011 140 / 407

WLS estimation: ATE

yi = α0 + xi β+ eα1Di + eυiwhere weights are

λi =

sDi[p(xi )

+1Di1[p(xi )

and di¤erent weights are used for ATT, ATU (given above)

Augmented IPW: ATE (Lunceford and Davidian 2004; Glynn andQuinn 2010)

b∆ATE= 1N ∑i

"Diyi (Di [p(xi ))g1(xi )

[p(xi ) (1Di )yi + (Di

[p(xi ))g0(xi )1[p(xi )

#

where g0(xi ) and g1(xi ) are estimated via separate OLS regressions ofy on x

I See -dr- in Stata

DL Millimet (SMU) ECO 7377 Fall 2011 141 / 407

Selection on ObservablesStrong Ignorability: Decomposition of Treatment E¤ects

Flores & Flores-Lagunes (2009) provide a framework to decompose∆k into a direct e¤ect of D and an indirect e¤ect that operatesthrough some causal mechanism, S

SetupI S 2 f0, 1g is a post-treatment, mechanism variableI S0,S1 are potential values of S associated with D = 1 and D = 0I S = DS1 + (1D)S0 is the realized value of S

Example: D = 1 if student i attends a private HS, 0 otherwise; S = 1if student i obtains a college degree, 0 otherwise; y = earnings as anadult

DL Millimet (SMU) ECO 7377 Fall 2011 142 / 407

Composite potential outcomes for y are dened as y(D,SD 0),D,D 0 2 f0, 1g

I y(1,S1) = potential outcome associated with D = 1 and S1, therealized value of the mechanism variable, S , when D = 1

I y(0,S0) = potential outcome associated with D = 0 and S0, therealized value of the mechanism variable, S , when D = 0

I y(1,S0) = potential outcome associated with D = 1 and S0, therealized value of the mechanism variable, S , when D = 0

DL Millimet (SMU) ECO 7377 Fall 2011 143 / 407

Decomposing ∆ATE

∆ATE = E[y(1,S1)] E[y(0,S0)]= fE[y(1,S1)] E[y(1,S0)]g| z

A

+ fE[y(1,S0)] E[y(0,S0)]g| z B

where A represents the indirect of D on y operating through S and Brepresents the direct e¤ect of D and y xing S at the non-treatmentvalueAuthors refer to

I A as the individual causal mechanism e¤ectI B as the net average treatment e¤ect

Note, B still reects two e¤ects of D on y1 E¤ects of D on y operating independently of S2 E¤ects on D on y operating through a change in the return to S (i.e.,even though the level of S is held xed, the e¤ect of S on y maychange due to D)

DL Millimet (SMU) ECO 7377 Fall 2011 144 / 407

Assumptions

(DTE.i) Independence of Treatment: y(1,S1), y(0,S0), y(1,S0),S0,S1 ? D(DTE.ii) Conditional Indepedence of Potential Mechanisms:

y(1,S1), y(0,S0), y(1,S0) ? fS0,S1gjx(DTE.iii) Constant Functional Form: If E[y(1,S1)jS1 = s1, x ] = f1(S1, x), then

E[y(1,S0)jS0 = s0, x ] = f1(S0, x)

(DTE.iii) implies that the functional form relating S and x to y whenD = 1 is the same regardless of whether S = S1 or S = S0Under (DTE.i) (DTE.iii), ∆ATE and B can be estimated, and thenA can be backed out

Extension to the case where (DTE.i) only holds conditional on x isalso presented

DL Millimet (SMU) ECO 7377 Fall 2011 145 / 407

Selection on ObservablesNon-Binary Treatments: Multi-Valued Treatments

Suppose the treatment can take on many discrete values

D 2 Ω = fd0, d1, d2, ..., dJg

) e.g., years of educationyj = potential outcome for treatment j = 0, 1, ..., JParameters of interest

∆ATEj ,j 0 = E [yj yj 0 ] , j , j 0 2 Ωe∆ATEj ,j 0 = Eyj yj 0 jD = j ,D = j 0

, j , j 0 2 Ω

∆ATTj ,j 0 = E [yj yj 0 jD = j ] , j , j 0 2 Ω

Dose-response function reects the unconditional expectation ofpotential outcomes at each dose

E [yj ] 8j 2 Ω

DL Millimet (SMU) ECO 7377 Fall 2011 146 / 407

Now, there are J missing counterfactualsI Dji = indicator if obs i receives treatment j

Dji =1 if Di = j0 otherwise

I yi = observed outcome for i

yi = ∑Jj=0 yjiDji

DL Millimet (SMU) ECO 7377 Fall 2011 147 / 407

Identication of the dose-response functionI Unconditional independence

yjj2Ω ? D

I Strong unconfoundedness (Rosenbaum & Rubin 1983)yjj2Ω ? D jx

) treatment assignment is conditionally independent of all potentialoutcomes

I Weak unconfoundedness (Imbens 2000)

yj ? Dj jx 8j 2 Ω

) assignment to any particular treatment is conditionally independentof that treatments potential outcome

DL Millimet (SMU) ECO 7377 Fall 2011 148 / 407

Implication of weak unconfoundedness

E [yj jx ] = E [y jDj = 1, x ]= E [y jD = j , x ]

)E [yj ] = E [E [y jD = j , x ]]

) one may estimate the conditional dose-response function byestimating the mean outcome given treatment assignment and x , andthen obtain the population dose-response function by averaging overthe distribution of x)

E [yj yj 0 ] = E [E [yj yj 0 jx ]]= E

E [y jD = j , x ] E

y jD = j 0, x

DL Millimet (SMU) ECO 7377 Fall 2011 149 / 407

ExampleI Let x = gender (M,F )I Ω = years of schooling (0, 1, ..., 21)I E

yjobtained by

F Computing average value of y for sub-sample with Dji = 1 and x = M) yMj

F Computing average value of y for sub-sample with Dji = 1 and x = F) yFj

F Obtaining portion of M and F in entire sample ) pM , pFF Compute pM yMj + pF yFj

I Obtain Eyj 0similarly

I Compute the di¤erenceI Other parameters can be estimated by using the proportions of M andF in various sub-samples (e.g., D = j , j 0 only)

DL Millimet (SMU) ECO 7377 Fall 2011 150 / 407

Generalized propensity scoreI Denition

r(j , x) = Pr(D = j jx) = E[Dj jx ]I r(j , x) may be estimated given data on D, x (MNL, MNP, orderedlogit/probit)

I Imbens (2000) shows that weak unconfoundedness )

yj ? Dj jr(j , x) 8j 2 Ω

and

Eyj jr(j , x)

= E

y jDj = 1, r(j , x)

= E [y jD = j , r(j , x)]

andEyj= E [E [y jD = j , r(j , x)]]

I The above result requires r(j , x) > 0 along the entire support of x

DL Millimet (SMU) ECO 7377 Fall 2011 151 / 407

EstimationI Given weak unconfoundedness and assuming r(j , x) > 0 for the entiresupport of x , then

EDjyr(j , x)

= E

yj

I Estimator

\Eyj=1N ∑i

"Dji yi\r(j , xi )

#which is analogous to the weighting estimator dened previously in thebinary treatment case

I Analogous normalized weighting estimator given by

\Eyj=

"∑i

Dji yi\r(j , xi )

# "∑i

Dji\r(j , xi )

#1

DL Millimet (SMU) ECO 7377 Fall 2011 152 / 407

Selection on ObservablesNon-Binary Treatments: Continuous Treatments

Suppose Ω is an interval [d , d ],and D has a continuous dbn on Ω) e.g., income

yj = potential outcome for treatment j 2 ΩDj is not useful since j takes on an innite number of values

Weak unconfoundedness can be re-stated as

yj ? D jx 8j 2 Ω

in contrast to strong unconfoundedness which requires fyjgj2Ω, thefull set of potential outcomes, to be conditionally independent

DL Millimet (SMU) ECO 7377 Fall 2011 153 / 407

Generalized propensity scoreI Now dened as the conditional density of D given x

r(j , x) = f (j jx)

I Implication (Hirano & Imbens 2004)

yj ? D jr(j , x) 8j 2 Ω

I Estimation based on

Eyj= E [E [y jD = j , r(j , x)]]

I Since D is continuous, estimation entails

F Estimation of r (j , x)F Estimate E [y jD = j , r (j , x)] by regessing y on D and \r (j , x)F Average \E [y jD = j , r (j , x)] over the dbn of x (at a xed value of j)

Weighting estimator version: see Robins (1998), Hernan et al. (2000)

See -doseresponse- in Stata

DL Millimet (SMU) ECO 7377 Fall 2011 154 / 407

Stratication estimator version (Imai & van Dyk 2004)

I Regress D on x via OLS ) θ = \E [D jx ] = xbβI Split sample in K strata of equal size based on θI Within each strata, model y as a function of D (and perhaps x tofurther control for di¤erences in x)

F y continuous: regress y on D and xF y binary: probit/logitF y ordered: oprobit/ologitF y count: poisson, NB

) b∆ATEk given by coe¢ cient on DI Obtain overall ∆ATE as

b∆ATE = ∑k

NkN

b∆ATEk

I Generalizable to multiple treatment case (e.g., two continuoustreatments: income, educ)

DL Millimet (SMU) ECO 7377 Fall 2011 155 / 407

Selection on ObservablesDynamic Matching

Pertains to situations where agents receive an initial treatment or not,and then have the option of receiving a second treatment if theyreceive the rst treatment

Many employment or job training programs, or treatments withinschools, operate in this manner

Need to carefully consider the parameter of interest in theseapplications, as well as CIA at di¤erent stages of the problem

See work by Lechner (2009, JBES), Lechner and Miquel (2010, EE ),Cooley et al. (2010), or Behrman et al. (2004, ReStat)

DL Millimet (SMU) ECO 7377 Fall 2011 156 / 407

Selection on ObservablesRegression Discontinuity

This estimator returns us to the class of binary treatments

First introduced in Thistlethwaite & Campbell (1960)

Two classes of models: sharp, fuzzy

Sharp RD is a selection on observables estimator, but is not based onstrong ignorability (in fact, it precludes it)

Fuzzy RD is a selection on unobservable estimators (discussed later inthe course)

Note: Recent work also on Regression Kinked Design (Card, Lee, &Pei 2009)

DL Millimet (SMU) ECO 7377 Fall 2011 157 / 407

RD setupI Agents self-select into treatment groupI Selection done at least in part on the basis of an observed continuousvariable, s

F s is referred to as the score, running variable, or forcing variable

I s may directly impact potential outcomes as wellI There exists a discrete jump in Pr(D = 1) at a known value, s

Thus, s and s are both known to the econometrician

DL Millimet (SMU) ECO 7377 Fall 2011 158 / 407

Sharp RD model

(SRD.i) Treatment assignment is a deterministic function of s (with a knownthreshhold, s)

Di = D(si ) =1 if si > s0 otherwise

(SRD.ii) Positive density at the threshold: fS (s) > 0(SRD.iii) Outcomes are continuous in s at least around s(SRD.iv) For each agent, the dbn of s is continuous at least around s

NotesI (SRD.ii) implies we see agents near sI (SRD.iii) precludes discontinuities in y at s due to other reasonsbesides changes in D

I (SRD.iv) implies that agents cannot perfectly manipulate s to ensures ? s

F This is crucial to give the setup the interpretation of a randomexperiment in the neighborhood of s

DL Millimet (SMU) ECO 7377 Fall 2011 159 / 407

Notes (cont.)I y0, y1 ? D js follows from (SRD.i)I All RD estimators require existence of following limits

D+ = lims#sPr(D = 1js)

D = lims"sPr(D = 1js)

and D+ 6= DF (SRD.i) implies D+ = 1 and D = 0

I Common support condition is necessarily violated since

Pr(D = 1) =1 if si > s0 otherwise

which implies that Pr(D = 1js) /2 (0, 1) 8s

DL Millimet (SMU) ECO 7377 Fall 2011 160 / 407

Parameter of interest

∆ATE (s) = E[y1 y0js ]= lim

s#sE[y js ] lim

s"sE[y js ]

DiNardo & Lee (2011) advocate a di¤erent intepretationI Argue that RD estimates a weighted average of ∆i where the weightsare proportional the probability that an agents si is the neighborhoodof s

DL Millimet (SMU) ECO 7377 Fall 2011 161 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 162 / 407

EstimationUse only sub-sample with si 2 fs δ, s + δg for small δ

I Similar s ) similar observationsI Compute mean di¤erence in outcomes across treatment groupsb∆ATE (s) = \E[yi jsi 2 fs, s + δg,D = 1]

\E[yi jsi 2 fs δ, sg,D = 0]

=∑Ni=1 yi I[si 2 fs, s + δg,Di = 1]

∑Ni=1 I[si 2 fs, s + δg,Di = 1]

∑Ni=1 yi I[si 2 fs δ, sg,Di = 0]

∑Ni=1 I[si 2 fs δ, sg,Di = 0]

p!

E[yi jsi 2 fs, s + δg,D = 1] E[yi jsi 2 fs δ, sg,D = 0]

=

E[y1i jsi 2 fs, s + δg,D = 1] E[y0i jsi 2 fs δ, sg,D = 0]

= E[y1i jsi 2 fs, s + δg] E[y0i jsi 2 fs δ, sg]6= lim

s#sE[y js ] lim

s"sE[y js ] for xed δ > 0

DL Millimet (SMU) ECO 7377 Fall 2011 163 / 407

This is essentially a kernel estimator with a uniform kernel over theinterval fs, s + δg or fs δ, sg, which entails a non-negligible biasfor δ > 0

Example: If y is increasing in s, then

I \E[yi jsi 2 fs, s + δg,D = 1] will overestimate lims#s E[y js ]I \E[yi jsi 2 fs δ, sg,D = 0] will underestimate lims"s E[y js ]) b∆ATE (s) will be biased up

DL Millimet (SMU) ECO 7377 Fall 2011 164 / 407

Regression approachI Model

yi = ∆Di + εi

where D = treatment indicator, ∆ = parameter of interestI Model is not estimable via OLS since Cov(D, ε) 6= 0I However, E[εjD, s ] = E[εjs ]I Implies ∆ is estimable if the model is augmented with a su¢ cientlyexible function of s to proxy for E[εjs ]

yi = ∆Di + k(si ) + ηi

where Cov(D, η) = 0I What is k(s)?

F Linear: k(s) = s (Goldberger 1972; Cain 1975)F Quadratic: k(s) = θ1s + θ2s2 (Berk & Rauma 1983; van der Klaauw2000)

F Semiparametric: k(s) = ∑Mm=1 θmsm , with M choosen bycross-validation (Trochim 1984; van der Klaauw 2000)

DL Millimet (SMU) ECO 7377 Fall 2011 165 / 407

Example:

­10

12

3

0 .2 .4 .6 .8 1score

outcome fitted values (OLS, y on D)fitted values (OLS, y on s & D)

Note: S~U(0,1); D(s)=I(s>0.5); y=s+D+e; delta = 1

DL Millimet (SMU) ECO 7377 Fall 2011 166 / 407

NotesI Testing of some of the underlying assumptions is feasible

F Examine the density of s to look for evidence of discontinuity at s ,suggesting manipulation by agents (McCrary 2008)

F Look for existence of discontinuities in predetermined variables at s(similar to assessing balancing of predetermined variables in randomizedexperiments)

I If treatment e¤ect is heterogeneous, then RD estimates a uniqueparameter (discussed above) that may be uninteresting

F This is an example of a local average treatment e¤ect (LATE)F May be a policy relevant parameter if the question is the impact of amarginal change in an eligibilitycut-o¤, s

I Applications: nancial aid, GED, Clean Air Act attainment statusI See -rd- in Stata

DL Millimet (SMU) ECO 7377 Fall 2011 167 / 407

Selection on ObservablesDistributional Approaches

Analysis to this point has focused on mean e¤ects of treatments

Averages may mask a lot of heterogeneity

Distributional methods seeks to assess the e¤ects of treatments onother quantities

Traditional approach is quantile regression (QR)

More recent approaches have been couched in the potential outcomesframework and focus on quantile treatment e¤ects (QTE)

DL Millimet (SMU) ECO 7377 Fall 2011 168 / 407

Selection on ObservablesDistributional Approaches: Quantile Regression

MotivationI QR provides a convenient linear framework for assessing the impact ofchanges in a vector of covariates on the quantiles of the dependentvariable

I Equivalently, QR allows estimation of linear conditional quantilefunctions

I Analogous to linear regression, which estimates the conditional meanfunction

I Common applicationsF Studies of wage determinationF Studies of student achievement

NotationI F (y) = CDF of yI Qθ(y) = θth quantile of the random variable, y , given by

Qθ(y) = inffy : F (y) > θg

DL Millimet (SMU) ECO 7377 Fall 2011 169 / 407

(Unconditional) quantiles as a minimization problemI Prior to discussing QR, it is useful to view unconditional quantiles as asolution to a minimization problem

I Example: median

Q0.5(y) = argminb

∑i jyi bj

F Solution depends on the sign of the residuals, not the magnitudeF y = f99, 100, 101g ) Q0.5(y ) = 100;y = f99, 100, 150g ) Q0.5(y ) = 100 as increasing b closer to 150reduces that residual, but increases the sum of the other two residualsby twice as much

F Implies median is less sensitive to outliers than the meanI General formula for any quantile θ 2 (0, 1)

Qθ(y) = argminb

(∑i :yi>b

θjyi bj+ ∑i :yi<b

(1 θ)jyi bj)

F Quantiles other than the median are dened as the arg min of aweighted sum of the absolute residuals

F Intuition: say θ = 0.75 and b = median, then problem puts moreweight on residuals above b, which pushes the solution to theminimization problem above the median

DL Millimet (SMU) ECO 7377 Fall 2011 170 / 407

QR model (Koeneker & Bassett 1978)I Replace b in previous problem with a linear function of covariates

bβθ = argminβ

1N

(∑

i :yi>xi βθjyi xi βj+ ∑

i :yi<xi β(1 θ)jyi xi βj

)which may be rewritten as

bβθ = argminβ

1Nf∑i ρθ(εθi )g

where ρθ(εθi ) is known as the check function, dened as

ρθ(εθi ) = [θ I(εθi < 0)]εθi

and εθi is the residual for i and θI Preceding objective fn is equivalent (after some algebra) to

bβθ = argminβ

1N

∑i

θ 1

2+12

sgn(yi xi β)(yi xi β)

I Error distribution

F Key assumption: Qθ(εθ jx) = 0F No other assumption about the distribution

DL Millimet (SMU) ECO 7377 Fall 2011 171 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 172 / 407

Estimation

The objective fn is not di¤erentiable ) standard optimizationmethods are not viable

Solved using linear programming methods

GMM estimation is also feasible (Buchinsky 1998)

Special case: median regressionI Corresponds to QR model with θ = 0.5; bβ obtained from

bβ0.5 = argminβ

1Nf∑i jyi xi βjg

I Analogous to OLS, but bβ minimizes the sum of absolute errors insteadof sum of squared errors

I Also known as LAD (Least Absolute Deviations) estimatorI Useful alternative to OLS, particularly when the distribution of theerror term is symmetric (so the conditional mean and median areequal), yet outliers are a concern

I Also useful when y is imputed for some obs

DL Millimet (SMU) ECO 7377 Fall 2011 173 / 407

Inference

Using a GMM framework, can showpN(bβθ βθ)! N(0,Λθ)

where

Λθ = ω2(θ)(x 0x)1

ω2(θ) =θ(1 θ)

f 2(F1(θ))

and f (F1(θ)) denotes the density of the error distribution evaluatedat the θth quantileIntuitiion:

I Estimation of the θth conditional quantile uses only obs near the θth

quantileI Asymptotically, obs are added in this range in a manner proportional tof (F1(θ)) assuming iid errors

Utilizing the asymptotic formula for inference is di¢ cult in practiceBootstrap methods provide a simpler alternative (Buchinsky 1998)

DL Millimet (SMU) ECO 7377 Fall 2011 174 / 407

Results

Parameters of interest are the partial derivatives of the conditionalquantile fn w.r.t. x

∂ E[Qθ(y jx)]∂xk

which equals βθk if x enters linearly

Presentation of resultsI Di¢ cult as there are a large number of results that are possible toobtain (i.e., βθk , k = 1, ...,K and θ 2 (0, 1))

I Possibilities

F Typical table of coe¢ cient estimates at several quantiles (typically θ =0.10, 0.25, 0.50, 0.75, and 0.90)

F Graph the conditional quantile fns against xk if there is one x that isthe focus of the paper (again, typically for a few quantiles)

F Graph bβθk vs. θ for several di¤erent xs on one graph (only works if xkenters linearly)

DL Millimet (SMU) ECO 7377 Fall 2011 175 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 176 / 407

Sequential estimationI In practice, one typically wishes to estimate bβθ for multiple values of θI Estimates are not independent since they are obtained from the samedata

I Estimation one equation at a time, however, is e¢ cient unless there arecross-equation restrictions (e.g., one might wish for a type of smoothcoe¢ cientmodel)

Stata: -qreg -, -bsqreg -, -sqreg -, -grqreg - (for graphing), -qcount-(for count data models), -lqreg - (for logistic models)

DL Millimet (SMU) ECO 7377 Fall 2011 177 / 407

Selection on ObservablesDistributional Approaches: Quantile Treatment E¤ects

NotationI y1i , y0i = potential outcomes for iI Di = binary indicator of treatment assignmentI Fj (y) Pr[yji < y ], j = 0, 1 = CDFs of potential outcomesI y θ

j = inffyj : Fj (y) > θg = quantiles of potential outcome dbns

Parameters of interest

4QTEθ = E[y θ

1 y θ0 ], θ 2 (0, 1)

4QTTθ = E[y θ

1 y θ0 jD = 1], θ 2 (0, 1)

4QTUθ = E[y θ

1 y θ0 jD = 0], θ 2 (0, 1)

DL Millimet (SMU) ECO 7377 Fall 2011 178 / 407

Interpretation

Constant treatment e¤ect assumptionI y1i = y0i + ∆ 8iI Implies F11 (θ) = F10 (θ) + ∆

0.2

.4.6

.81

­4 ­2 0 2 4

y1 y0

F(y)

NOTE: y0~N(0,1); y1=y0+1

I 4QTEθ = 4QTTθ = 4QTUθ = ∆ 8θ 2 (0, 1)

DL Millimet (SMU) ECO 7377 Fall 2011 179 / 407

Heterogeneous treatment e¤ectsI y1i = y0i + ∆iI Perfect rank correlation (Heckman et al. 1997)

F Denition: F1(y1i ) = F0(y0i ) 8iF Intuition: each observation lies in the identical quantile in bothpotential outcome dbns, which implies that y1 is a monotonetransformation of y0

F Implication: 4QTEθ = E[y θ

1 y θ0 ] = Qθ(∆), which is the θth quantile of

the dbn of ∆, which implies that QTEs identify the distribution of thetreatment e¤ect, BUT this requires a strong assumption about thejoint dbn of potential outcomes

I No perfect rank correlation

F No assumption about the joint dbn of potential outcomesF Implication: 4QTE

θ = E[y θ1 y θ

0 ] 6= Qθ(∆), which implies that QTEsidentify the di¤erence in the two marginal dbns of the potentialoutcomes, BUT say nothing about the dbn of actual treatment e¤ects... QTEs reect the e¤ect of D on quantiles of the potential outcomedbns, NOT on observations at particular quantiles.

DL Millimet (SMU) ECO 7377 Fall 2011 180 / 407

Example #1...

ID y0 y1 ∆1 1 2 12 2 4 23 3 6 34 4 8 45 5 10 5

Rank preservation holds; ∆ivaries

CDF of y0, y1 are not identical) 4QTE

θ varies with θ

4QTEθ = Qθ(∆)

DL Millimet (SMU) ECO 7377 Fall 2011 181 / 407

Example #2...

ID y0 y1 ∆1 1 1 02 2 4 23 3 3 04 4 2 -25 5 5 0

Rank preservation is violated; ∆ivaries

CDF of y0, y1 are identical )4QTE

θ = 0 8θ

4QTEθ 6= Qθ(∆)

DL Millimet (SMU) ECO 7377 Fall 2011 182 / 407

EstimationIdentication assumptions: strong ignorability (CIA, CS)yi = Diy1i + (1Di )y0i = observed outcomeb∆θ obtained using sample analogues of y θ

1 and yθ0

Obtain bFj (y), j = 0, 1bFj (y) =1

∑i I(Di = j)∑i I(Di = j) I(yi y) unconditional

bFj (y) =∑i2j bωi I(yi y)

∑i2j bωicovariates

bωi =Dibpi (xi ) + 1Di

1 bpi (xi ) (QTE)

bωi = Di +bpi (xi )(1Di )1 bpi (xi ) (QTT)

bωi =[1 bpi (xi )]Dibpi (xi ) + 1Di (QTU)

where bpi (xi ) is the propensity score and x is the vector such that CIAholdsDL Millimet (SMU) ECO 7377 Fall 2011 183 / 407

by θ1 = inffy : bF1(y) > θg; similarly for by θ

0

Implies b∆QT θ = by θ1 by θ

0

Inference based on bootstrap

DL Millimet (SMU) ECO 7377 Fall 2011 184 / 407

Test of equal CDFs (Abadie 2002)I Equivalent to test for Ho : ∆θ = 0 8θ 2 (0, 1)I Utilize Kologorov-Smirnov statistic

deq =

rN2sup jF1(y) F0(y)j

I Compute bdeq = rN2 maxk nbF1(yk ) bF0(yk )ofor a grid of points, k = 1, ...,K in the support of yi

I Inference for test of equality using bootstrap

Stata: -dbn- (my code)

DL Millimet (SMU) ECO 7377 Fall 2011 185 / 407

Selection on ObservablesDistributional Approaches: Stochastic Dominance

In the event the QTEs di¤er in sign or signicance across the dbn,may be interested in rankingdbnsDenitions

I First Order Stochastic Dominance: Y1 FSD Y0 i¤

F1(y) F0(y) 8y 2 Y

with strict inequality for some y (where Y is the union of the supportsfor Y1 and Y0), or

y θ1 y θ

0 8θ 2 [0, 1]with strict inequality for some θ

I Second Order Stochastic Dominance: X SSD Y i¤Z y∞

F1(t)dt Z y∞

F0(t)dt 8y 2 Y , orZ θ

0y t1dt

Z θ

0y t0dt 8θ 2 [0, 1]

with strict inequality for some y or θ

DL Millimet (SMU) ECO 7377 Fall 2011 186 / 407

Example: FSD... (y1 N(1, 1); y0 N(0, 1))

0.2

.4.6

.81

­4 ­2 0 2 4Support

Control Treatment

F(x)

.8.9

11.

11.

2

0 10 20 30 40 50 60 70 80 90 100Quantile

(Tre

atm

ent ­

Con

trol)

Qua

ntile

Tre

atm

ent E

ffect

DL Millimet (SMU) ECO 7377 Fall 2011 187 / 407

Example: SSD... (y1 N(0.25, 0.25); y0 N(0, 1))

0.2

.4.6

.81

­4 ­2 0 2 4Support

Control Treatment

F(x)

­1­.5

0.5

11.

5

0 10 20 30 40 50 60 70 80 90 100Quantile

(Tre

atm

ent ­

Con

trol)

Qua

ntile

Tre

atm

ent E

ffect

DL Millimet (SMU) ECO 7377 Fall 2011 188 / 407

FSD ) SSD

Third and higher order rankings exist

Any two dbns can be ranking at some order of SD

ImplicationsI Notation

F W1 = class of social welfare fns that are increasing in yF W2 = sub-class of W1 that includes all social welfare fns that are alsoconcave in y

I X FSD Y ) X is at least as preferred by all welfare functions in W1,with strict inequality holding for some welfare function in the class

I X SSD Y ) X is at least as preferred by all welfare functions in W2,with strict inequality holding for some welfare function in the class

DL Millimet (SMU) ECO 7377 Fall 2011 189 / 407

Test statistics

d = min supz2Y

[F (z) G (z)]

s = min supz2Y

Z z

∞[F (t) G (t)] dt

where min is taken over F G and G FTests are based on estimates of d and s using the empirical CDFs

I Unconditional, orI Inverse propensity score weighted

Inference using bootstrap (simple and/or more complex methods)

DL Millimet (SMU) ECO 7377 Fall 2011 190 / 407

Selection on UnobservablesWhen all xs required for CIA to hold are not observed, then oneenters into selection on unobservables worldImplies unobservable attributes of obs i are correlated with bothpotential outcomes and treatment assignment of obs iIn general, this implies

E[yj jx ,D = j ] 6= E[yj jx ,D = j 0], j , j 0 = 0, 1In a regression framework, with functional form assumptions, thisimplies

yi = Diy1i + (1D)iy0i= α0 + xi β0 + (α1 α0)Di + xiDi (β1 β0)

+ [υ0i +Di (υ1i υ0i )]

where SOU results ifI Cov(D, υ0) 6= 0 ) selection on unobservables impacting outcome inuntreated state, or

I Cov(D, υ1 υ0) 6= 0 ) presence of and selection on unobserved,obs-specic gains from treatment

DL Millimet (SMU) ECO 7377 Fall 2011 191 / 407

Possible solutions1 Bound treatment e¤ects (set identicationas opposed to pointidentication) under minimal assumptions

2 Utilize panel data3 Utilize exclusion restrictions (i.e., instrumental variables)4 Model dependence between treatment and unobservables ) controlfunction approach

5 Other methods that ndidentication elsewhere

DL Millimet (SMU) ECO 7377 Fall 2011 192 / 407

Selection on UnobservablesBounding Treatment E¤ects

Recall, the ATE

∆ATE (x) = E[y1 y0jx ] = E[y1jx ] E[y0jx ]= fE[y1jx ,D = 1]Pr(D = 1jx)

+ E[y1jx ,D = 0]Pr(D = 0jx)g fE[y0jx ,D = 1]Pr(D = 1jx)

+ E[y0jx ,D = 0]Pr(D = 0jx)g= fg1(x) E[y0jx ,D = 1]gp(x)

+ fE[y1jx ,D = 0] g0(x)g[1 p(x)]

where p(x), the propensity score, and gj (x), j = 0, 1, are allobservable from the data

DL Millimet (SMU) ECO 7377 Fall 2011 193 / 407

Similar derivation for other two primary mean treatment e¤ectparameters

∆ATT (x) = g1(x) E[y0jx ,D = 1]∆ATU (x) = E[y1jx ,D = 0] g0(x)

Thus, without additional information, no parameter is identied

Early bounding approach outlined in Smith and Welch (1986)I Objective was to estimate the average wage for blacks accounting forselection into LF

E[w ] = E[w jLF = 1]Pr(LF = 1) + E[w jLF = 0]Pr(LF = 0)

where E[w jLF = 0] is not observedI Solution: E[w jLF = 0] = γ E[w jLF = 1], γ 2 [0.5, 1]I In treatment e¤ects context, can specify

E[yd jD = d 0] = γ E[yd jD = d ] for di¤erent values of γ, where d 6= d 0I Rosenbaum (2002) summarizes other papers that bound causal e¤ectsby varying the unobserved parameters

DL Millimet (SMU) ECO 7377 Fall 2011 194 / 407

More recent approaches focus on adding assumptions to tighten thebounds on the parameter of interest

Notation (Lechner 1999; Manski 1990)I L1, L0 = lower bounds of the support of y1, y0, respectivelyI U1, U0 = upper bounds of the support of y1, y0, respectivelyI BLk , B

Uk = lower, upper bounds, respectively, of treatment e¤ect k

(k = ATE ,ATT , or ATU)I wk = BUk BLk = width of bounds for treatment e¤ect k

DL Millimet (SMU) ECO 7377 Fall 2011 195 / 407

Trivial caseI No additional information

BLk = L1 U0BUk = U1 L0wk = (U1 L0) (L1 U0)

= (U1 L1) + (U0 L0)

I Example: y is binary (e.g., employment after job training program)

L1 = L0 = 0

U1 = U0 = 1

BLk = 1BUk = 1

wk = 2

DL Millimet (SMU) ECO 7377 Fall 2011 196 / 407

Tightening bounds with data

Use sample dataI p(x), g0(x), g1(x) may be consistently estimated from the data by

F Sample meansF Nonparametric smoothing methodsF Parametric methods

DL Millimet (SMU) ECO 7377 Fall 2011 197 / 407

New bounds with sample dataI ∆ATE (x)

BLATE = f[g1(x) U0gdp(x) + fL1 [g0(x)g[1 dp(x)]BUATE = f[g1(x) L0gdp(x) + fU1 [g0(x)g[1 dp(x)]wATE = (U1 L1)[1 dp(x)] + (U0 L0)dp(x)

I ∆ATT (x)

BLATT = [g1(x) U0BUATT = [g1(x) L0wATT = U0 L0

I ∆ATU (x)

BLATU = L1 [g0(x)

BUATU = U1 [g0(x)wATU = U1 L1

DL Millimet (SMU) ECO 7377 Fall 2011 198 / 407

Example: y is binary ) wk = 1 8k (sample data cuts width in half)Note: Bounds necessarily include zero

I Cannot rule out zero average treatment e¤ectI Can exclude some extreme valuesI Full characterization of the bounds should also account for uncertaintyin the variables belonging in x and the model used to estimate g0(x),g1(x), and p(x) (Heckman et al. 1999)

F While bounds conditional on x and a model, m, all have width one, theexact bounds are a¤ected

I Kreider, Pepper, and co-authors incorporate measurement error in Dinto the bounds (discussed later)

DL Millimet (SMU) ECO 7377 Fall 2011 199 / 407

Tightening bounds with assumptions

Assume ∆ATT (x) = ∆ATU (x)I Calculate bounds for ∆ATT (x) and ∆ATU (x)I New bounds include only the intersection of the two boundsI Example

∆ATT (x) 2 [0.25, 0.75]∆ATU (x) 2 [0.75, 0.25]

then new bounds are [0.25, 0.25]I Note: still necessarily include zero since bounds on ∆ATT (x), ∆ATU (x)both include zero

DL Millimet (SMU) ECO 7377 Fall 2011 200 / 407

Level-set restrictions: treatment e¤ects are constant 8x 2 X0 X(the support of x)

I Calculate bounds for ∆k (x) 8x 2 X0I New bounds include only the intersection of these boundsI Example (∆ATE )

∆ATE (xa) 2 [0.25, 0.75]∆ATE (xb) 2 [0.75, 0.25]

where xa, xb 2 X0, then new bounds are [0.25, 0.25]I Note: still necessarily include zero since bounds on ∆k (x) include zero8x

I Formally

BLk (X0) = supx2X0

BLk (x)

BUk (X0) = infx2X0

BUk (x)

wk (X0) = BUk (X0) BLk (X0)

DL Millimet (SMU) ECO 7377 Fall 2011 201 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 202 / 407

Level-set restrictions: expected outcomes are constant8x 2 X0,1 X (for y1) and 8x 2 X0,0 X (for y0)

I Implies

E[y1 jx ] is constant 8x 2 X0,1E[y0 jx ] is constant 8x 2 X0,0

DL Millimet (SMU) ECO 7377 Fall 2011 203 / 407

) Bounds become

BLATE (x0) = supx2X0,1

f[g1(x)dp(x) + L1[1 dp(x)]g infx2X0,0

f[g0(x)[1 dp(x)] +U0dp(x)gBUATE (x0) = inf

x2X0,1f[g1(x)dp(x) + U1[1 dp(x)]g supx2X0,0

f[g0(x)[1 dp(x)] + L0dp(x)gBLATT (x0) = sup

x2X0,1f[g1(x)g inf

x2X0,0fU0g

BUATT (x0) = infx2X0,1

f[g1(x)g supx2X0,0

fL0g

BLATU (x0) = supx2X0,1

fL1g infx2X0,0

f[g0(x)g

BUATU (x0) = infx2X0,1

fU1g supx2X0,0

f[g0(x)g

where x0 2 X0,1 \ X0,0DL Millimet (SMU) ECO 7377 Fall 2011 204 / 407

Assumption: positive selectionI Implies

E[y1 jx ,D = 1] > E[y0 jx ,D = 1]which means that the treated only join the treatment group if there arenon-negative gains on average

I Bounds become

BLATE = fL1 [g0(x)g[1 dp(x)]BUATE = f[g1(x) L0gdp(x) + fU1 [g0(x)g[1 dp(x)]BLATT = 0

BUATT = [g1(x) L0

I Does not a¤ect bounds on ∆ATU (x)

DL Millimet (SMU) ECO 7377 Fall 2011 205 / 407

Combining assumptions, restrictions

BLk ,combine = maxp2Ψ

fBLk ,pg

BUk ,combine = minp2ΨfBUk ,pg

where Ψ is the set of restrictions being combined

Inference via bootstrapI Yields condence intervals for the bounds, not the treatment e¤ectI For example, a 90% CI implies that the probability that the truebounds lie in the CI is 90%; the probability that the truetreatmente¤ect lies in the CI is even higher (see also Imbens & Manski (2004))

DL Millimet (SMU) ECO 7377 Fall 2011 206 / 407

Tightening bounds (again)

Manski (1990), Manski & Pepper (2000) consider additionalassumptions

1 InstrumentE[yj jz ] = E[yj ], j = 0, 1

2 Monotone Instrument

z1 z z2 ) E[yj jZ = z1 ] E[yj jZ = z ] E[yj jZ = z2 ], j = 0, 1

3 Monotone Treatment Selection

E[yj jD = 1] E[yj jD = 0], j = 0, 1

4 Monotone Treatment Response

y0 y1 ) E[y0 ] E[y1 ]

where x is omitted for notational convenience

DL Millimet (SMU) ECO 7377 Fall 2011 207 / 407

Use of an instrumentI E[yj jz ] = E[yj ], j = 0, 1, implies

E[yj ] 2supzfE[y jD = j ,Z = z ]Pr(D = j jZ = z ) + Lj Pr(D 6= j jZ = z )g,

infzfE[y jD = j ,Z = z ]Pr(D = j jZ = z ) + Uj Pr(D 6= j jZ = z )g

i

I Bounds for ∆ATE become

BLATE = supzf[g1(z)dp(z) + L1 [1 dp(z)]g inf

zf[g0(z)[1 dp(z)] +U0dp(z)g

BUATE = infzf[g1(z)dp(z) + U1 [1 dp(z)]g sup

zf[g0(z)[1 dp(z)] + L0dp(z)g

I Bounds are tighter than worst case bounds if p(z) 6= Pr(D = 1); i.e., zis correlated with treatment assignment

DL Millimet (SMU) ECO 7377 Fall 2011 208 / 407

Use of a monotone instrument (MIV)I z1 z z2 ) E[yj jZ = z1 ] E[yj jZ = z ] E[yj jZ = z2 ], j = 0, 1

F Weaker assumption than the prior, mean independence assumptionF Implies that potential outcomes are non-decreasing in z

I Implies

E[yj ] 2"

∑z2Z

Pr(Z = z)

(supz1z

fE[y jD = j ,Z = z1 ]Pr(D = j jZ = z1)

+ Lj Pr(D 6= j jZ = z1)g

),

∑z2Z

Pr(Z = z)

(infz2z

fE[y jD = j ,Z = z2 ]Pr(D = j jZ = z2)+ Uj Pr(D 6= j jZ = z2)g

)#I Bounds derived based on this

DL Millimet (SMU) ECO 7377 Fall 2011 209 / 407

Monotone treatment selection (MTS)I E[yj jD = 1] E[yj jD = 0], j = 0, 1, implies that the treated grouphas weakly higher potential outcomes in all treatment states

I Plausible in certain cases when one does not condition on x and x iscorrelated with both D and yj in the same direction

I Implies

E[yj ] 2 [E[y jD = j ]Pr(D j) + Lj Pr(D < j),E[y jD = j ]Pr(D j) + Uj Pr(D > j)]

Monotone treatment response (MTR)I y0 y1 ) E[y0 ] E[y1 ] implies we know the sign of the treatmente¤ect (inclusive of zero)

I Implies ∆ATE 0I Stronger than the positive selection assumption previously as that onlyapplied to the sub-sample with D = 1

MIV can be combined with MTS, MTRMethodology can also be combined with assumptions concerningmeasurement error (discussed later)Stata: -bpbounds- (related)

DL Millimet (SMU) ECO 7377 Fall 2011 210 / 407

Selection on UnobservablesAltonji et al. Approach

Altonji et al. (2005) o¤er two approaches to assess the sensitivity ofestimates obtained under SOO assumption when this assumption isfalse

Approach #1 is applicable to the case of a binary outcome

Approach #2 is applicable regardless of type of outcome

Krauth (2011) attempts to extend the approach

DL Millimet (SMU) ECO 7377 Fall 2011 211 / 407

Approach #1: Bivariate probit model

Model

y i = xi β+ τDi + εi

Di = xiγ+ µi

where ε, µ N(0, 0, 1, 1, ρ) and

y =

1 if y > 00 otherwise

D =

1 if D > 00 otherwise

DL Millimet (SMU) ECO 7377 Fall 2011 212 / 407

Estimation by ML

lnL = ∑i :fy=1,D=1g ln[Φ2(xi β+ τ, xiγ, ρ)]

+∑i :fy=1,D=0g ln[Φ2(xi β,xiγ,ρ)]

+∑i :fy=0,D=1g ln[Φ2(xi β τ, xiγ,ρ)]

+∑i :fy=0,D=0g ln[Φ2(xi β,xiγ, ρ)]

Model is technically identied with no exclusion restriction, but treatρ as unidentied

Assessing treatment e¤ect as ρ varies provides evidence of sensitivityto selection on unobservables

Constrain ρ > 0) positive selection; ρ < 0) negative selection

DL Millimet (SMU) ECO 7377 Fall 2011 213 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 214 / 407

Approach #2: SOU relative to SOO

Intuition is to assess how much SOU, relative to the amount of SOO,is needed to fully explain the observed positive association between Dand y

If

(AET.i) Random observables: x is a random subset of all factors, w , inuencingy

(AET.ii) Equally important factors: the number of elements in w is large and nosingle variable factor has an undue inuence on y

(AET.iii) Relationship between x and unobservables: slightly weaker technicalassumption than independence between x and remaining elements of w

then one should expect the amount of selection controlled for by x toequal the amount of selection on unobservables

Implies that if the amount of SOU needed to explain the observedassociation is less than amount of SOO, the estimated treatmente¤ect should not be viewed as robust

DL Millimet (SMU) ECO 7377 Fall 2011 215 / 407

Model for outcomeyi = xi β+ τDi + εi

The (normalized) amount of SOU is given by

E[εjD = 1] E[εjD = 0]Var(ε)

The (normalized) amount of SOO ignoring the impact of D isgiven by

E[xβjD = 1] E[xβjD = 0]Var(xβ)

The goal is to assess how large SOU must be relative to SOO to fullyaccount for the positive treatment e¤ect estimated under exogeneity

DL Millimet (SMU) ECO 7377 Fall 2011 216 / 407

Express actual treatment participation as

Di = xiγ+ µi

plim of OLS estimator of τ is

plim bτ = τ +Cov(µ, ε)

Var(µ)

= τ +Var(D)Var(µ)

fE[εjD = 1] E[εjD = 0]g

Under the assumption that SOO = SOU, the asymptotic bias term is

Cov(µ, ε)Var(µ)

=Var(D)Var(µ)

E[xβjD = 1] E[xβjD = 0]

Var(xβ)Var(ε)

DL Millimet (SMU) ECO 7377 Fall 2011 217 / 407

This bias can be consistently estimated under Ho : τ = 0

The ratio bτ/dbias indicates how much larger SOU needs to be relativeto SOO to entirely explain the treatment e¤ect

A small ratio ) treatment e¤ect is highly sensitive to selection onunobservables; a ratio >> 1 implies treatment e¤ect is robust

Algorithm:1 Estimate Var(D) from sample2 Estimate treatment eqtn via LPM ) \Var(µ)

3 Estimate outcome eqtn via OLS restricting τ = 0 ) xbβ, \Var(xbβ),\Var(ε)

4 Obtain sample means of xbβ in treatment and control groups )\E[xbβjD = 1], \E[xbβjD = 0]

5 Estimate outcome eqtn via OLS ) bτ6 Compute ratio of bτ/dbias

DL Millimet (SMU) ECO 7377 Fall 2011 218 / 407

Notes:I If y is binary, estimate treatment eqtn via probit perhaps in step 3 )

Var(ε) = 1I AET methods have relatively little to say about economic signicanceof treatment e¤ect unless one makes assumptions about amount ofSOU

DL Millimet (SMU) ECO 7377 Fall 2011 219 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 220 / 407

Selection on UnobservablesPanel Data

Refer to ECO 6375 for panel data refresher...

Panel data is useful addressing selection on unobservables that areinvariant along a certain dimension

Thus, panel data methods provide a solution to selection onunobservables in only certain situations

NotationI Population regression fn given by E[y jx1, ..., xk , c ]I xk , k = 1, ...,K , are observable (to the econometrician)I c is an unobservable (to the econometrician) variable

Assuming linearity: E[y jx1, ..., xk , c ] = β0 + xβ+ c

DL Millimet (SMU) ECO 7377 Fall 2011 221 / 407

Error form of the model

y = β0 + xβ+ c + ε

where c is the unobserved e¤ect and ε is the idiosyncratic error

Time-series or cross-section models are forced to include c in the errorterm (referred to as the composite error)

yi = β0 + xi β+eεi , eεi = ci + εi

yt = β0 + xtβ+eεt , eεt = ct + εt

DL Millimet (SMU) ECO 7377 Fall 2011 222 / 407

Modelyit = β0 + xitβ+ ci + εit

I Unobserved e¤ect is assumed to be time invariant (assuming atraditional panel where t represents time)

I x may include time dummies or time trend, etc.

Problem: given presence of ci , how can we recover consistentestimates of β0, β?

Estimation techniquesI Assuming Cov(x , c) = 0

F Pooled OLS (POLS)F Random e¤ects (RE)

I Assuming Cov(x , c) 6= 0F Least squares dummy variable model (LSDV)F Fixed e¤ects (FE)F First-di¤erencing (FD)

DL Millimet (SMU) ECO 7377 Fall 2011 223 / 407

Selection on UnobservablesPanel Data: Treatment E¤ects Models

Structural model

yit = ci + λt + xitβ+ τDit + εit , i = 1, ...,N; t = 1, ...,T

where λt are time dummies

Special caseI Setup

F T = 2F Di1 = 0 8iF Di2 2 f0, 1g 8iF Assume no xs

I FE or FD estimation )

τ = E[∆y jD2 = 1] E[∆y jD2 = 0]

I Known as di¤erence-in-di¤erences estimator

DL Millimet (SMU) ECO 7377 Fall 2011 224 / 407

Visual representation of special case

yit = ci + λt + xitβ+ τDit + εit

I Expected outcomes by period and treatment status

t = 1 t = 2D = 0 c0 + λ1 c0 + λ2D = 1 c1 + λ1 c1 + λ2 + τ

I Implies

E[∆y jD2 = 1] = (c1 + λ2 + δ) (c1 + λ1) = τ + λ2 λ1

E[∆y jD2 = 0] = (c0 + λ2) (c0 + λ1) = λ2 λ1

which implies

τ = E[∆y jD2 = 1] E[∆y jD2 = 0]

DL Millimet (SMU) ECO 7377 Fall 2011 225 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 226 / 407

Before­After Estimator

Cross­Section Estimator

DID

01

23

­1 0 1Period

Y0 Y1

Note: Illustration of Three Common Estimators.

DL Millimet (SMU) ECO 7377 Fall 2011 227 / 407

Beyond the special caseI Special case is useful to gain the intuition, not requiredI In general, as long as Dit is time-varying for some units i , then τ canbe estimated by any panel data method given the required assumptionsare met

I If selection into treatment is only on observables (not ci ), then POLSor RE may be consistent and e¢ cient

I If selection into treatment is also on time invariant unobservables (ci ),then POLS and RE are inconsistent, but FE or FD are consistent ifother assumptions are met

I Important to remember: FE/FD is not a magic bullet (Duo et al.2004)

F FE and FD require strict exogeneity ; rules out Ashenfelters Dip )Cov(Dit , εit1) 6= 0

F Rules out selection on contemporaneous shocks ) Cov(Dit , εit ) 6= 0F Key: requires treated and untreated to follow same time trend inabsence of treatment

F Di¤-in-di¤-in-di¤ may be an option

I With heterogeneous treatment e¤ects, FE identies the ATT

DL Millimet (SMU) ECO 7377 Fall 2011 228 / 407

Timing issues (LaPorte & Windmeijer 2005)

Previous model restricts D to a one-time intercept shift, τ

In certain applications, agent may anticipate treatment and alterbehavior prior to actual treatment; or, response may occur with a lag;or, some combination of bothExamples: policy changes announced, but not implemented untilfuture date; or, lags in adjustment to policy changesGeneral structural model

yit = ci + λt + xitβ+∑L0l=1 δlD

lit + δ0Dit +∑L1

l=1 δlDlit + εit

where

Dlit = Dit+l (treatment assignment l periods in future)

D lit = Ditl (treatment assignment l periods in past)

δl reects anticipatory e¤ects of treatmentδl reects lagged e¤ects of treatmentδ0 reects instantaneous e¤ects of treatment

DL Millimet (SMU) ECO 7377 Fall 2011 229 / 407

Specication test

If anticipatory and/or lagged e¤ects occur, but simplemodel ofone-time e¤ect is estimated, then FE and FD will yield (statistically)di¤erent estimates

E[bδFD ] = δ0 δ1

E[bδFE ] = ∑t ωt (δ0+ δ)

where

δ0+ = average of δ0, δ1, ..., δL1δ = average of δ1, ..., δL0

and ωt are weights

DL Millimet (SMU) ECO 7377 Fall 2011 230 / 407

Ho : δFD = δFE () Ho : φ = 0yit yit1yit y i

=

xit xit1xit x i

β+

Dit Dit1Dit D i

δ

+

0

Dit D i

φ+

ηiteηit

Estimate via OLS, look at condence interval on bφLee and Huang (2011) extend the existing literature on dynamictreatment e¤ects to allow for anticipatory behavior

DL Millimet (SMU) ECO 7377 Fall 2011 231 / 407

Autoregressive Model

Fixed e¤ects models require Dit to be time-varying for some i

If D is time invariant 8i , it is still possible to identify the e¤ect of theprogram under the common treatment e¤ect assumption

Structural model

yit = λt + xitβ+ τDi + εit

εit = ρεit1 + ηit

where ηit is iid with mean zero and τ is the homogeneous treatmente¤ect

Quasi-FD yields

yit = eλt + (xit ρxit1)β+ (1 ρ)τDi + ρyit1 + ηit

OLS is consistent if (i) x are strictly exogenous and (ii) D isuncorrelated with η (e.g., post-treatment shocks are not forecastableand therefore do not a¤ect past treatment decision

DL Millimet (SMU) ECO 7377 Fall 2011 232 / 407

Comparative Case Study Approach

Provides an alternative to DD whenI Treatment occurs at an aggregate levelI Typically only a single observation is treated and lengthy history ofpre-treatment data are availble for the treated and the pool of controls

Examples:I Mariel Cuban Boat Lift (Card 1980)I State minimum wage (Card & Krueger 1994)

SolutionI Construct a synthetic control which is a weighted average of availableto controls to estimate the missing counterfactual in post-treatmentperiod(s)

I Weights are chosen by matching pre-treatment covariates and outcomesI Allows for di¤erential time trends in treatment and control observations

F By matching pre-treatment outcomes, one is implicitly matching on thetime-invariant unobserved e¤ect

F Thus, does not matter if unobservd e¤ect has di¤erential e¤ects overtime if the time-specic e¤ect is a common factor

DL Millimet (SMU) ECO 7377 Fall 2011 233 / 407

ModelI yit is observed outcome for obs i , i = 1, ..., J + 1, in periodt = 1, ...,To , ...,T

I Obs 1 is treated; remaining 2, ..., J + 1 are never treatedI Timing of treatment e¤ects

1 No Anticipatory E¤ects: To is period prior to obs 1 being treated2 Anticipatory E¤ects: To is period prior to any anticipatory e¤ects forobs 1 begining

I Outcomes in the absence of treatment

yit = yNit = δt + θtZi + λtui + εit

I Outcomes with treatment

yit = yIit = y

Nit + αit

DL Millimet (SMU) ECO 7377 Fall 2011 234 / 407

Synthetic control is dened as

∑J+1j=2 ωjyjt = ∑J+1

j=2 ωj (δt + θtZi + λtui + εit )

where ωj is the weight given to control j and

I ∑J+1j=2 ωj = 1I ωj 0 8j

Conditional on choice of weights, ωj , period-specic treatment e¤ect

is estimated as bαit = y1t ∑J+1j=2 ω

j yjt

Requires a SUTVA-type assumption that the treatment does notimpact outcomes in the control pool

DL Millimet (SMU) ECO 7377 Fall 2011 235 / 407

Weights are chosen to match moments of the data in periods t ToI Dene

yKi = ∑Tos=1 ksyiswhere K = (k1, ..., kTo ) is a vector of weights and thus y

Ki represents

a particular linear combination of pre-treatment outcomes for obs iI Given M unique linear combinations, dene the vector of pre-treatmentoutcomes for obs 1 as

X1 = (Z01, y

K11 , ..., y

KM1 )

with dimension R 1I Dene the R J matrix of variables for the remaining obs i ,i = 2, ..., J + 1 as X0, where column j is given by

(Z 0j1, yK1j1, ..., y

KMj1)

I Weights are chosen to minimize some distance function

jjX1 X0W jjV =q(X1 X0W )0V (X1 X0W )

where V is a R R symmetric, positive semidenite matrixI In practice, V is chosen to minimize the MSE of the pre-interventionpredictions

DL Millimet (SMU) ECO 7377 Fall 2011 236 / 407

Inference is handled byI Re-doing the analysis, treated obs i , i = 2, ..., J + 1, as treatedafterperiod To and the remaining obs as the pool of potential controls

I This yields a dbn of treatment e¤ect estimates under Ho of notreatment e¤ect

I If actual estimates of bα1t look very di¤erent, this is evidence of astatistically meaningful treatment e¤ect

Code is available in Stata athttp://www.mit.edu/~jhainm/synthpage.html.

DL Millimet (SMU) ECO 7377 Fall 2011 237 / 407

Example: Abadie et al. (2010)

DL Millimet (SMU) ECO 7377 Fall 2011 238 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 239 / 407

Selection on UnobservablesInstrumental Variables

Refer to ECO 6374 for refresher on basics...

TerminologyI Structuralmodel

yi = β0 + β1xi + εi

I First-stage modelxi = π0 + π1zi + ui

I Reduced form model

yi = (β0 + β1π0) + β1π1zi + (εi + β1ui )

= eπ0 + eπ1zi +eεi

DL Millimet (SMU) ECO 7377 Fall 2011 240 / 407

Goal: devise alternative estimation technique to obtain consistentestimates when E[εjx ] 6= 0

I Solution: identify β from exogenous variation in x isolated usinginstruments, z

I z is a valid IV for x i¤

(IV.i) First-stage: E[z 0x ] 6= 0(IV.ii) Exogeneity: E[z 0ε] = 0(IV.iii) Exclusion: E[y jx , z ] = E[y jx ]

where z and x are both N K matricesI Exogenous xs serve as instruments for themselvesI Need unique instrument for each endogenous var

Stata: -ivreg2 -, -xtivreg2 -

DL Millimet (SMU) ECO 7377 Fall 2011 241 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 242 / 407

Several issues remain under scutiny in the literature1 Choice of estimation technique2 Properties and inference with weak IVs ) E[z 0x ] 03 Properties and inference with endogenous IVs ) E[z 0ε] 6= 0

DL Millimet (SMU) ECO 7377 Fall 2011 243 / 407

Selection on UnobservablesEstimators

1 IV2 Two-Stage Least Squares (TSLS or 2SLS)3 Nagar4 Split-sample or Two-Sample IV(data set #1: fx , zgN1i=1; data set #2: fy , zg

N2i=1)

5 JIVE6 LIML7 Fuller (modied LIML)8 GMM

DL Millimet (SMU) ECO 7377 Fall 2011 244 / 407

Selection on UnobservablesEstimators: IV Estimator

Estimator is given by

y = xβ+ ε

) z 0y = z 0xβ+ z 0ε ! β = (z 0x)1z 0y if z 0ε = 0

) bβIV = (z 0x)1z 0y

Estimated asymptotic variance is given by

Var(bβIV ) = bσ2(z 0x)1(z 0z)(x 0z)1; bσ2 = 1N K ∑i

bε2i

DL Millimet (SMU) ECO 7377 Fall 2011 245 / 407

Selection on UnobservablesEstimators: Two-Stage Least Squares

IV estimator requires 1 instrument per endogenous variable; otherwisez 0x is a LK matrix (L > K ) with rank = K , and the inverse doesnot exist

Discarding additional IVs is probably ine¢ cient

TSLS is an alternative estimator that does not face this problem

In multivariate regression, this is formalized asI First-stage bx = z(z 0z)1z 0xand replacing z with bx in the IV estimator

I Estimator now given by

bβTSLS = (bx 0bx)1bx 0y = [x 0z(z 0z)1z 0x ]1x 0z(z 0z)1z 0yDL Millimet (SMU) ECO 7377 Fall 2011 246 / 407

Notes ...

In a multiple regression...I With multiple endogenous vars, need at least as many IVs asendogenous xs; do not interpret this IV for this x , that IV for that x

I Where the second-stage contains other exogenous vars, these vars mustbe included in the rst-stage

If strictly more IVs than endogenous vars, thenI Model is overidentied (as opposed to exactly identied)I Enables additional tests for instrument validity

Estimators are CAN, but biasedI Intuition behind the bias is that the rst-stage OLS estimates, bθ, arecorrelated with the error term from the structural model, ε, whichimplies that the tted values, bx are also correlated with ε

Incorrectly treating other covariates in the model as exogenous )inconsistent estimates if instrument(s) are correlated with thesecovariates

DL Millimet (SMU) ECO 7377 Fall 2011 247 / 407

Selection on UnobservablesEstimators: JIVE, SSIV, Nagar

Breaking the correlation between bθ and ε is the motivation behindJIVE and SSIVSSIV (Angrist & Krueger 1992, 1995)

I ApproachF Divide sample into two groups: i = 1, ...,N1 and i = N1 + 1, ...,NF Estimate rst-stage using N2 obs, i = N1 + 1, ...,NF Predict bx out-of-sample for rst N1 obsF Estimate second-stage using rst N1 obs

I Estimators bβSSIV = (bx 021bx 021)1bx 021ybβUSSIV = (bx 021x 01)1bx 021ywhere bx21 = z1(z 02z2)1z 02x2 and subscript 1 (2) refers to estimationon i = 1, ...,N1 (i = N1 + 1, ...,N)

I SSIV uses OLS in the second-stage; USSIV stands for Unbiased SSIVand uses IV in the second-stage

DL Millimet (SMU) ECO 7377 Fall 2011 248 / 407

JIVEI Approach

F Estimate rst-stage using N 1 obsF Predict bx out-of-sample for the excluded obsF Repeat for all N obs and estimate second-stage using all N obs

I Estimators

bβJIVE = (bx 0ibx 0i )1bx 0i ybβUJIVE = (bx 0i x)1bx 0i y = (x 0C 0Jx)1x 0C 0Jywhere bx 0i is matrix whose i th row is ziπi , πi is the vector ofrst-stage coe¤s with obs i removed, andCj = (IDPz )1(Pz DPz ), DPz = diag(Pz ), and Pz = z(z 0z)1z 0

I JIVE uses OLS in the second-stage; UJIVE stands for Unbiased JIVEand uses IV in the second-stage

I Stata: -jive-

DL Millimet (SMU) ECO 7377 Fall 2011 249 / 407

Nagar estimator is a bias-corrected TSLS estimatorI Nagar (1959), Hahn & Hausman (2002)I Estimator given by

bβN = x 0 Pz KNIN

x1

x 0Pz

KNIN

y

where K = # IVs and Pz = z(z 0z)1z 0I Hahn & Hausman (2002) discuss the poor performance of the Nagarestimator when the model is close to being unidentied

DL Millimet (SMU) ECO 7377 Fall 2011 250 / 407

Selection on UnobservablesEstimators: LIML, Fuller, and k-Class Estimators

k-class estimators can be all be written asbβk = [x 0(IN kMz )x ]1x 0(IN kMz )y

for di¤erent values of k, where Mz = IN z(z 0z)1z 0

k = 0) OLS

k = 1) TSLS

k = λ ) LIML

k = λ α

N L ) Fuller

k = 1+LKN

) Nagar

For LIML, λ is a minimum eigenvalueFor Fuller, α is user-specied (typically 1) and L = # included +excluded instrumentsFor Nagar, LK = # over-identifying restrictionsDL Millimet (SMU) ECO 7377 Fall 2011 251 / 407

Selection on UnobservablesIV: Specication Tests

Much specication testing is required when utilizing IV in appliedresearch

Types of tests available

I Tests of endogeneity: E[x 0ε]?= 0

I Tests of instrument relevance: E[z 0x ]?= 0

I Tests of overidentication: E[z 0ε]?= 0 (partial test only)

I Tests for weak instruments:E[z 0x ] 0

Covered in ECO 6374

With weak IVs, some recommend LIML, others Fuller, others UJIVE,others TSLS (which tends to have a larger bias, similar RMSE)

DL Millimet (SMU) ECO 7377 Fall 2011 252 / 407

Selection on UnobservablesIV: Imperfect Instruments

Recent work has explored what can be learned if z is an imperfectinstrumental variable (IIV)

Two possible imperfections:1 z is also endogenous2 z is not excludable from the second-stage

Nevo & Rosen (2010) and Ashley (2009) address endogeneity

Conley et al. (2010) address excludability

Note: These are intimately related since if z is incorrectly treated asexcludable, then it will be correlated with the second-stage compositeerror that now includes the error and z

DL Millimet (SMU) ECO 7377 Fall 2011 253 / 407

Nevo & Rosen (2010) ...

SetupI Model given by

yi = βxi + wi δ+ εi

where x is a single endogenous regressor, w is exogenous (oralternatively are endogenous with valid instruments), and z is 1 kzvector of imperfect instruments for x

I z is an imperfect IV (IIV) in the sense that it is also correlated with εI Assumptions:

(IIV.i) Sign of correlation: ρx ερzj ε 0, j = 1, ..., kz(IIV.ii) Degree of endogeneity: jρx εj jρzj εj, j = 1, ..., kz(IIV.iii) True model: yi = βxi + wi δ+ εi

(IIV.ii) contrasts with the classical IV assumption that ρzj ε = 0

DL Millimet (SMU) ECO 7377 Fall 2011 254 / 407

Dene

λj =ρzj ε

ρx ε

which is in the unit interval under (IIV.i), (IIV.ii)

If λj were known, then a valid IV for x is

Vj (λj ) = σx zj λj σzj x

However, Λ = [λ1 λkz ] is unknown, but lies in the unit cube inRkz -space

Intuitively, searching over feasible values of Λ, one may bound β

DL Millimet (SMU) ECO 7377 Fall 2011 255 / 407

Consider kz = 1I Partial out the e¤ects of w by dening

eyi = yi wi [(w 0w)1w 0y ]exi = xi wi [(w 0w)1w 0x ]

(Note: If w is endogenous with valid IVs, then the OLS coe¤s arereplaced by IV coe¤s.)

I Under (IIV.i) (IIV.iii) and assuming without loss of generality thatρx ε 0, obtain the following bounds:

F Case I. (σzexσx σxexσz )σzex > 0β 2

([βIVV (1), β

IVz ] if σzex < 0

[βIVz , βIVV (1)] if σzex > 0

F Case II. (σzexσx σxexσz )σzex 0β 2

8<: [maxn

βIVz , βIVV (1)

o,∞) if σzex < 0

(∞,minn

βIVz , βIVV (1)

o] if σzex > 0

DL Millimet (SMU) ECO 7377 Fall 2011 256 / 407

Additional work to bound δ is also possible

Extension to kz > 1I Bounds can be tightened by obtaining bounds for each z individuallyand then computing the nal bounds as the intersection of the kzbounds

I Formally

F For each zj , obtain Bj = [βlj , β

uj ]

F Final bounds given by

β 2maxjfβljg,minj fβuj g

F In Case II, these bounds are one-sided; one trick may be to try anddene a new IV that is a weighted average of two of the IVs such that(σqexσx σxexσq )σqex > 0, where qi = γzji + (1 γ)zj 0 i

I Need to be careful, though, and make sure di¤erent zs estimate thesame parameter (discussed later)

DL Millimet (SMU) ECO 7377 Fall 2011 257 / 407

Conley et al. (2010) ...Setup

yi = xi β+ ziγ+ εi

xi = ziπ + ui

where x is a kx -dimensional vector of endogenous regressors, z is akz -dimensional vector of instruments, kz kx , and E[z 0ε] = 0Classical IV requires the assumption that γ = 0

I With kx = kz = 1, we have

plim bβIV = β+σzeεσxz

= β+γσ2zπσ2z

= β+γ

π

where eε = ziγ+ εi is the composite errorI Thus, IV is asymptotically biased when γ 6= 0 and the bias isdecreasing in π and increasing in γ

I Authors refer to deviations from γ = 0 as plausible exogeneity

Approach

I Track estimates bβ(γ) = bβIV γ/bπ for di¤erent values of γI Estimates will be more sensitive to γ the weaker the rst-stagerelationship

DL Millimet (SMU) ECO 7377 Fall 2011 258 / 407

Authors present several possible methods of inference, only somepresented here

Method #1. Union of CIs with γ Support AssumptionI Suppose the true value of γ = γ0 Gkz , with known boundsI If γ0 were known, then IV/TSLS applied to

yi ziγ0 = xi β+ εi

using z as instruments is consistent for βI With γ0 unknown, but contained in Gkz , one can

F Apply IV/TSLS to a grid of values for γ from Gkz

F For each value, γs , s = 1, ...,S , obtain the (1 α)% CI for βF Compute a nal CI as the union of these S CIs

CI (1 α) = [γ2Gkz CI (1 α,γ)

which has an asymptotic coverage probability 1 αF If some prior info, may want to weight di¤erent γs di¤erently

DL Millimet (SMU) ECO 7377 Fall 2011 259 / 407

Method #2. γ Local-to-Zero ApproximationI γ is treated as unknown, but coming from a known dbn

γ =ηpN, η G

where prior info on γ translates to knowing the dbn GI The normalization by

pN ensures that uncertainty about z being a

valid instrument and sampling error are of the same order and so bothfactor into the asymptotic dbn of bβ

I Assuming γ N(µγ,Ωγ) leads to the following approximate dbn

bβ N(β+ Aµγ,VIV + AΩγA0)

where A = (x 0z(z 0z)1z 0x)1x 0zI If µγ = 0, then this approach simply leads to a revised variance for theIV/TSLS estimator

Stata ado les available on Conleys website

DL Millimet (SMU) ECO 7377 Fall 2011 260 / 407

Selection on UnobservablesIV: Heterogenous Treatment E¤ects

Assume a binary endogenous regressor, D, and a binary instrument, z

Motivation arises from the fact that the treatment e¤ect may varyacross by i and agents may act on observation-specic gains whenmaking treatment decision

Admitting this possibility implies that one must think more carefullyabout what parameter one is estimating

DL Millimet (SMU) ECO 7377 Fall 2011 261 / 407

Linear model

Setup (from earlier potential outcomes framework)

yi y0i +Di (y1i y0i )= α0 + exi β+ υ0i +Di (α1 + exi β+ υ1i α0 exi β υ0i )

= α0 + exi β+ (α1 α0 + υ1i υ0i )Di + υ0i

xi β+ ∆iDi + εi

Dene ∆i = (α1 α0) + (υ1i υ0i ) ∆+ ∆iSubstitution implies

yi = xi β+ ∆Di + (∆i Di + εi )

where ∆i Di + εi is the composite error term, which di¤ers from theusual error term for the treated

DL Millimet (SMU) ECO 7377 Fall 2011 262 / 407

A valid IV in the homogeneous treatment e¤ects setup requires

E[εi jxi ,Di , zi ] = E[εi jxi ,Di ]

but nowE[∆i Di + εi jxi ,Di , zi ] = E[∆i Di + εi jxi ,Di ]

is required

Thus, z must beI Correlated with Di (as usual)I Uncorrelated with the error term from the structural model andindividual-specic gains (or losses) from treatment

F Not possible unless (i) ∆i = 0 8i (implying a constant treatmente¤ect) or (ii) ∆i ? Di jxi (implying that agents either do not know ordo not act on specic gains ... no essential heterogeneity)

F Model with ∆i and Di correlated known as Correlated RandomCoe¢ cients (CRC) model

DL Millimet (SMU) ECO 7377 Fall 2011 263 / 407

Much more restrictive requirementI Example: if z is an exogenous variable representing the cost ofparticipation in the treatment (e.g., distance to job training center),then high z will lead to no participation unless the benet fromparticipation, ∆i , is very high; if z is low, one will participate if ∆i islow or high ) positive correlation between z and ∆i conditional on Di

If z is uncorrelated with ε, but correlated with ∆i , then IV estimatesare still useful, but identify a di¤erent parameter

Parameter known as local average treatment e¤ect (LATE)

DL Millimet (SMU) ECO 7377 Fall 2011 264 / 407

Formally, given the model (ignoring x)

yi = α+ ∆Di + (∆i Di + εi )

and an instrument, z , we have

plim b∆OLS =Cov(y ,D)

Var(D)= ∆+

Cov(ε,D) +Cov(∆D,D)Var(D)

6= ∆

plim b∆IV =Cov(y , z)Cov(D, z)

= ∆+Cov(ε, z) +Cov(∆D, z)

Cov(D, z)

= ∆+Cov(∆D, z)

Cov(D, z)6= ∆

where the last inequality holds unless (i) ∆i = 0 8i or (ii) ∆i ? Di jxi(as stated above)

How do we interpret b∆IV ?DL Millimet (SMU) ECO 7377 Fall 2011 265 / 407

LATE

Assume a binary endogenous regressor, D, and a binary instrument,z , and no other covariates (for simplicity)

Four potential subpopulations

z = 0 z = 1Never Takers (NT) D = 0 D = 0Deers (DF) D = 1 D = 0Compliers (C) D = 0 D = 1Always Takers (AT) D = 1 D = 1

Compliers are the key, as their treatment status varies with theinstrument

DL Millimet (SMU) ECO 7377 Fall 2011 266 / 407

Recall, the Wald estimator

b∆IV = E[y jz = 1] E[y jz = 0]Pr(D = 1jz = 1) Pr(D = 1jz = 0)

Numerator terms may be expressed as

E[y jz = j ] =

8<: E[y1jAT ]Pr(AT ) + E[yj jC ]Pr(C )+ E[y(1j)jDF ]Pr(DF )+ E[y0jNT ]Pr(NT )

9=; , j = 0, 1

DL Millimet (SMU) ECO 7377 Fall 2011 267 / 407

Denominator terms may be expressed as

Pr[D = 1jz = j ] =

8>><>>:Pr[D = 1jz = j ,AT ]Pr(AT )+ Pr[D = 1jz = j ,C ]Pr(C )+ Pr[D = 1jz = j ,DF ]Pr(DF )+ Pr[D = 1jz = j ,NT ]Pr(NT )

9>>=>>; , j = 0, 1=

Pr(AT ) + Pr(C ) if j = 1Pr(AT ) + Pr(DF ) if j = 0

DL Millimet (SMU) ECO 7377 Fall 2011 268 / 407

Wald estimator reduces to

b∆IV =fE[y1jC ]Pr(C ) + E[y0jDF ]Pr(DF )g fE[y0jC ]Pr(C ) + E[y1jDF ]Pr(DF )g

Pr(C ) Pr(DF )

which is a weighted average of the treatment e¤ect for compliers andthe negative of the treatment e¤ect for deers

Assumptions

(LATE.i) Independence: fy0, y1,D0,D1g ? z , where Dj , j = 0, 1, are potentialtreatment assignments

(LATE.ii) Exclusion: E[y0 jz ] = E[y0 ]; E[y1 jz ] = E[y1 ](LATE.iii) First-Stage/Compliers: Pr(C ) > 0) Pr(D = 1jz) is a non-trivial

function of z(LATE.iv) Monotonicity: Pr(Di = 1jzi = 1) > Pr(Di = 1jzi = 0) 8i )

Pr(DF ) = 0

DL Millimet (SMU) ECO 7377 Fall 2011 269 / 407

Imposing these assumptions )

b∆IV = b∆LATE = E[y1 y0jC ]

which is a parameter dened with respect to a particular instrument

CommentsI LATE is a well-dened economic parameterI Whether it is an interesting parameter is a di¤erent matterI Not possible to know who are the compliers in the dataI Interpretation is similar, but derivation more complex, if D or z iscontinuous

F Continuous z estimates the local instrumental variable (LIV) parameter(Heckman and Vytlacil 1999)

I With multiple instruments, things become thorny ... di¤erentinstruments, even if all valid, potentially identify di¤erent parameters!

F No reason why di¤erent IV estimates should be the sameF Using multiple IVs yield a weighted average of di¤erent LATEs

DL Millimet (SMU) ECO 7377 Fall 2011 270 / 407

DiNardo & Lee (2011) provide an alternative interpretation of the IVestimand

I They replace the monotonicity assumption with what they call aprobabilistic monotonicity assumption

I The result is that b∆IV is shown to be a weighted average of ∆i wherethe weights are proportional to the increase inPr(Di = 1jzi = 1) Pr(Di = 1jzi = 0)

F Under the monotonicity assumption,

Pr(Di = 1jzi = 1) Pr(Di = 1jzi = 0) =0 if type = AT ,NT1 if type = C

so that only compliers receive positive weightF This follows from the assumption that D is a deterministic fn of zF Probabilistic monotonicity relaxes this assumption and allows D to be anondecreasing fn of z (conditional on type)

DL Millimet (SMU) ECO 7377 Fall 2011 271 / 407

Not possible to infer anything about ∆ATE , ∆ATT , or ∆ATU withoutadditional assumptions about how compliers compare to rest of thepopulation

I Vytlacil et al. (2009) working on when one can learn the sign of ∆ATEI DiNardo & Lee (2011) discuss extrapolating to the ∆ATEI Heckman et al. (2010) propose two tests of the CRC assumption

Ho : ∆i ? Di jxi

F Test #1 based on comparison of di¤erent (valid) IV estimates; underHo di¤erent IVs provide consistent estimates of the same parametereven if they lead to di¤erent sub-populations of compliers

F Test #2 based on testing for a linear relationship between y and theestimated propensity score conditional on x

DL Millimet (SMU) ECO 7377 Fall 2011 272 / 407

Selection on UnobservablesIV: Finding Instruments

Economic theory ... what determines participation, but not outcomes?

Exogenous variation in program availability (across space or overtime) ... must be exogenous

Natural experiments ... twins, sex composition, miscarriages, MarialCuban boatlift, Russian immigration to Israel

Randomized experiments (even if imperfect compliance) ... ProjectStar

DL Millimet (SMU) ECO 7377 Fall 2011 273 / 407

Fuzzy regression discontinuity design

Recall from sharp RD case that we require the existence of thefollowing limits

D+ = lims#sPr(D = 1js)

D = lims"sPr(D = 1js)

and D+ 6= DI Sharp RD setup implies D+ = 1 and D = 0I Fuzzy RD setup implies 1 D+ > D 0

DL Millimet (SMU) ECO 7377 Fall 2011 274 / 407

Formally

(FRD.i) Treatment assignment is a discontinuous function of s (with a knownthreshhold, s)

Di = D(si , υi )

where

Pr(D = 1) = Pr(D = 1js s)Pr(s s)+Pr(D = 1js < s)Pr(s < s)

(FRD.ii) Positive density at the threshold: fS (s) > 0(FRD.iii) Outcomes are continuous in s at least around s and do not depend on

whether s ? s(FRD.iv) For each agent, the dbn of s is continuous at least around s

DL Millimet (SMU) ECO 7377 Fall 2011 275 / 407

NotesI Endogenous treatment variable, D, depends on observed score variable,s, and stochastic element

I Discrete jump in Pr(D = 1) at sI Example: Pr(D = 1) = maxf0, 0.5s + 0.25 I(s > 0.5) + υg

0.2

.4.6

.81

0 .2 .4 .6 .8 1x

Pr(

D=1

)

Implies Di = E[D jsi ] + υi , where Cov(ε, υ) 6= 0DL Millimet (SMU) ECO 7377 Fall 2011 276 / 407

OLS estimation of

yi = xi β+ ∆Di + f (si ) + εi

where x is a vector of exogenous controls, is biased, even with aexible function of s included

SolutionI Estimate propensity score, where f (s) is included along with the

indicator I(s > s) ) [p(D)I Estimate by OLS

yi = xi β+ ∆\p(Di ) + f (si ) + εi

I Equivalent to TSLS, with I(s > s) as the instrument, when f (s) ischosen parametrically

DL Millimet (SMU) ECO 7377 Fall 2011 277 / 407

IntepretationI Typical interpretation: RD identies the LATE at sI DiNardo & Lee (2011) intepret the estimated parameter as a weightedaverage of ∆i where the weights are proportional to (i) the probabilityof si being in the neighborhood of s and (ii) the inuence of crossingthe threshold, s, on the probability of receiving the treatment

DL Millimet (SMU) ECO 7377 Fall 2011 278 / 407

Selection on UnobservablesMethods Not Requiring Exclusion Restrictions

Several methods exist that do not rely on a typical exclusionrestriction for identication

1 Heckman bivariate normal selection model2 Millimet & Tchernis (2011) bias-corrected estimator3 Higher moments4 Covariance restrictions

All such methods mustreplace the assumptionconcerning an exclusionrestriction with someother identifyingassumption (there is nosuch thing as a free lunch)

DL Millimet (SMU) ECO 7377 Fall 2011 279 / 407

Selection on UnobservablesHeckman Bivariate Normal Selection Model

Requires fairly strong parametric assumptions to circumvent theselection on unobservables problem

Also useful to solve problems of non-random sample selection(discussed later)

DL Millimet (SMU) ECO 7377 Fall 2011 280 / 407

Treatment e¤ects model with common e¤ect

Setup

y0i = xi β0 + εi

y1i = xi β1 + εi

yi = Diy1i + (1Di )y0iDi = ziγ+ ui

Di =

1 if Di > 00 if Di 6 0

DL Millimet (SMU) ECO 7377 Fall 2011 281 / 407

NotesI εi = common error component (or common e¤ect) in both potentialoutcome equations

I βs allowed to di¤er across outcome equationsI Di = latent indicator of treatment statusI Model rules out selection on observables assumption sinceunobservables associated with treatment status, u, are correlated withunobservables a¤ecting outcomes conditional on x

Assumptions

(BVN.i) ε, u N2(0, 0, σ2ε , σ2u , ρ)(BVN.ii) ε, u ? x , z(BVN.iii) σ2u = 1

DL Millimet (SMU) ECO 7377 Fall 2011 282 / 407

Parameters of interestI Given the setup, individual-specic treatment e¤ect is given by

∆i = y1i y0i = xi (β1 β0)

I Average treatment e¤ects are

∆ATE = E[∆i ] = E[Xi ](β1 β0)

∆ATT = E[∆i jDi = 1] = E[Xi jDi = 1](β1 β0)

∆ATU = E[∆i jDi = 0] = E[Xi jDi = 0](β1 β0)

I Implies consistent estimates of all three parameters require consistentestimates of β0, β1

I Two naïve options:

F Split sample into D = 1 and D = 0, and regress y on x via OLS ineach sub-sample

F Pool sample, regress y on x ,Dx

I Under selection on unobservables, neither option produces consistentestimates

DL Millimet (SMU) ECO 7377 Fall 2011 283 / 407

Conditional expectations (following from the properties of conditionalnormal random variables)

I Of the outcome in the treated state for the treated

E[yi jDi = 1, xi , zi ] = xi β1 + E[εi jui > ziγ]

= xi β1 + ρσε

φ(ziγ)Φ(ziγ)

= xi β1 + ρσε [λ(ziγ)]

where λ() is known as the Inverse MillsRatioI Of the outcome in the untreated state for the untreated

E[yi jDi = 0, xi , zi ] = xi β0 + E[εi jui 6 ziγ]

= xi β0 + ρσε

φ(ziγ)1Φ(ziγ)

I Given Corr(ε, u) 6= 0, error term is no longer well-behaved

DL Millimet (SMU) ECO 7377 Fall 2011 284 / 407

Estimation: Method #1

Estimate the outcome equation for the treated and the untreatedseparately via OLS

Consistent estimates of β0, β1 require inclusion of the selection terms

Selection terms are estimable by1 Estimating a probit model for treatment assignment ) bγ2 Estimating the selection terms

φ(zi bγ)Φ(zi bγ)

and

φ(zi bγ)1Φ(zi bγ)

3 Including these as additional covariates in each second-stage regression

DL Millimet (SMU) ECO 7377 Fall 2011 285 / 407

Upon estimation of bβ0, bβ1 ...I Predict by1i , by0i 8iI Estimate treatment e¤ect parameters

b∆ATE = by1i by0ib∆ATT = by1i by0ib∆ATU = by1i by0iwhere ATE computes mean for entire sample, and latter two computemeans using only the treated and untreated, respectively

I Equivalently,

b∆ATE = x(bβ1 bβ0)b∆ATT = x1(bβ1 bβ0)b∆ATU = x0(bβ1 bβ0)where x is the sample mean, and xk , k = 0, 1, is the sample mean inthe sub-sample with D = k

DL Millimet (SMU) ECO 7377 Fall 2011 286 / 407

Estimation: Method #2

Estimate a single outcome equation with no restriction

yi = xi β0 + xiDi (β1 β0) + βλ1Di

φ(ziγ)Φ(ziγ)

+ βλ0(1Di )

φ(ziγ)1Φ(ziγ)

+ ηi

This does not impose the restriction that the coe¢ cient on bothselection terms should be the same: ρσε

Thus, testing Ho : βλ0 = βλ1 constitutes a specication test of theunderlying model

DL Millimet (SMU) ECO 7377 Fall 2011 287 / 407

Note

ηi = εi βλ1Di

φ(ziγ)Φ(ziγ)

βλ0(1Di )

φ(ziγ)1Φ(ziγ)

= εi Di E[εi jDi = 1] (1Di )E[εi jDi = 0]

which is a well-behaved error term since the portion of the error termthat is correlated with treatment assignment now appears in themodel in the form of the selection correction terms

DL Millimet (SMU) ECO 7377 Fall 2011 288 / 407

Estimation: Method #3

Estimate a single outcome equation imposing the restriction thatβλ0 = βλ1

yi = xi β0 + xiDi (β1 β0)

+ βλ

Di

φ(ziγ)Φ(ziγ)

+ (1Di )

φ(ziγ)1Φ(ziγ)

+ ηi

E¢ ciency gain if, in fact, the restriction is true

DL Millimet (SMU) ECO 7377 Fall 2011 289 / 407

Estimation: Method #4

Maximum likelihood estimation of the system of three equations

Above estimators are known as control function approach sinceselection terms control for selection on unobservables

ML is not a control function approach, but rather directlyincorporates the covariance structure of the errors into the estimationby jointly estimating the system of equations

Benets: yields an estimate of ρ along with a std error, more e¢ cientif parametric assumptions are true

Cost: results are less robust if parametric assumptions of the modelare violated

DL Millimet (SMU) ECO 7377 Fall 2011 290 / 407

Comments

There is no instrumentor exclusion restriction required foridentication

I Identication arises from the non-linearity of the selection correctionterms, which in turn arises from the assumption of bivariate normality

I Exclusion restrictions a variable in z not in x would be nice

Semi-parametric versions existI Relaxes dependence on bivariate normalityI Require exclusion restrictionsI One version includes a polynomial of the propensity score in theregression model; motivation is to include a exible functional form tocapture the selection terms without reliance on bivariate normality

Bivariate probit treatment e¤ects modelI Similar to above models, except outcome of interest is binary (e.g.,employment following a job training program)

I Similar estimation to above by ML, except likelihood is based on abivariate probit model (same as in Altonji et al. (2005) unconstrainedbivariate probit model)

DL Millimet (SMU) ECO 7377 Fall 2011 291 / 407

Aside:

Typical IV estimator can also be implemented using a control functionapproach

I TSLS estimator of the model

yi = β1x1i + x2i β2 + εi

x1i = ziπ1 + x2iπ2 + ui

is equivalent to OLS estimation of

yi = β1x1i + x2i β2 + ui +eεiwhere ui is replaced with the OLS estimate of the rst-stage

residualI Since bui = x1i zi bπ1 x2i bπ2, this is not linearly independent of x2unless π1 6= 0

DL Millimet (SMU) ECO 7377 Fall 2011 292 / 407

Treatment e¤ects model without the common e¤ect assumption

Relaxation of common e¤ect assumption allows for heterogeneouse¤ects of the treatment even conditional on x

Setup

y0i = xi β0 + ε0i

y1i = xi β1 + ε1i

= xi β1 + [(ε1i ε0i ) + ε0i ]

= xi β1 + [δi + ε0i ]

yi = Diy1i + (1Di )y0iDi = ziγ+ ui

Di =

1 if Di > 00 if Di 6 0

DL Millimet (SMU) ECO 7377 Fall 2011 293 / 407

NotesI δi = obs-specic gain to treatment (conditional on x)I ∆i = y1i y0i = xi (β1 β0) + δi (heterogeneous treatment e¤ectsgiven x)

I Selection into treatment may depend on either ε0i (untreated outcomelevel given x) or δi (obs-specic gains given x)

I Otherwise, intuition is identical to common e¤ect version

Assumptions (replaces (BVN.i))

(BVN.i) ε0, ε1, u N(0,Σ), where

Σ =

24σ2ε0 ρ01 ρ0uσ2ε1 ρ1u

1

35

DL Millimet (SMU) ECO 7377 Fall 2011 294 / 407

Conditional expectations

E[ε0i jDi = 1, xi , zi ] = ρ0uσε0

φ(ziγ)Φ(ziγ)

E[δi jDi = 1, xi , zi ] = ρδuσδ

φ(ziγ)Φ(ziγ)

E[ε0i jDi = 0, xi , zi ] = ρ0uσε0

φ(ziγ)1Φ(ziγ)

DL Millimet (SMU) ECO 7377 Fall 2011 295 / 407

Estimation

Generalization of the previous two-step approach in the commone¤ect modelEstimating equation

yi = xi β0 + xiDi (β1 β0) +eβλ1Di

φ(ziγ)Φ(ziγ)

+ βλ0(1Di )

φ(ziγ)1Φ(ziγ)

+ ζ i

where eβλ1 = ρ0uσε0 + ρδuσδ

βλ0 = ρ0uσε0

Selection terms obtain by estimating rst-stage probit model for DML estimation of entire model is feasible, but it requires estimation ofa trivariate normal dbn (computationally di¢ cult)ρ01 is not identied since never observe y1 and y0 for same i

DL Millimet (SMU) ECO 7377 Fall 2011 296 / 407

Upon estimation of bβ0, bβ1 ...I Predict by1i , by0i 8iI Estimate b∆ATE b∆ATE = by1i by0i = x bβ1 bβ0where ATE computes mean for entire sample

I ATT is given by

∆ATT = Exi jDi=1

[xi (β1 β0)] + Eδi jDi=1

[δi ]

= Exi jDi=1

[xi (β1 β0)] + Ezi jDi=1

ρδuσδ

φ(ziγ)Φ(ziγ)

F If there is no selection on unobservable gains, then ρδu = 0 ) commone¤ect model

F eβλ1 βλ0 = ρδuσδ )\ρδuσδ =beβλ1 bβλ0, which gives the sign of the

selection on gains (which one expects to be positive if obs know theirunobservable gains)

F Estimate obtained by replacing expectations with sample averageswithin the treatment group

I ATU obtained in similar fashion, but average over x , z in control group

Stata: -treatreg -, -biprobit-DL Millimet (SMU) ECO 7377 Fall 2011 297 / 407

Selection on UnobservablesMillimet & Tchernis (2011)

Builds on the minimum biased approach (discussed earlier) by o¤eringa bias-corrected procedure

Recall, under certain assumptions the bias of the ATT, ATE at somevalue of the propensity score, p(x), is given by

BATT [p(x)] = ρ0uσ0φ(Φ1(p(x)))p(x)[1 p(x)]

BATE [p(x)] = fρ0uσ0 + [1 p(x)]ρδuσδg

φ(Φ1(p(x)))p(x)[1 p(x)]

where

I ρ0u = selection on unobservables a¤ecting outcome in untreated stateI ρδu = selection on unobserved, individual-specic gains

BATT [p(x)] is minimized at p(x) = 0.5; BATE [p(x)] does not have aunique minimum

DL Millimet (SMU) ECO 7377 Fall 2011 298 / 407

Minimum-biased (MB) estimation techniqueI Stage 1: Estimate the propensity score (e.g., probit model)I Stage 2: Retain only those observations with a propensity score,[p(xi ), within a xed neighborhood around p(x), the bias-minimizingpropensity score

I Stage 3: Estimate the ATE or ATT using any propensity-score basedestimator that relies on CIA using this sub-sample

For ATE, add Stage 1.5: Estimate the error correlations usingHeckman BVN model

BC estimator amends the previous MB estimator by removing theestimated bias

b∆kBC = b∆k Z \Bk [p(xi )]fk (x)dx , k = ATE ,ATT ,ATU

where fk (x) is the appropriate dbn needed to estimate parameter k

Millimet & Tchernis (2011) nd some benet to this estimator,particularly in large samples, using MC

DL Millimet (SMU) ECO 7377 Fall 2011 299 / 407

Selection on UnobservablesHigher Moments: Lewbel (2010) approach

Originally proposed as a solution to measurement error, butpotentially applicable to more general dependence between x and ε(Lewbel 1997, 2010)

SetupI Structuralmodel

yi = β1Di + xi β2 + εi

I First-stage modelDi = xiπ + ui

where

F x includes the interceptF Cov(ε, u) 6= 0

D may be discrete or continuous

DL Millimet (SMU) ECO 7377 Fall 2011 300 / 407

Potential instruments for D include (zi z)ui , where z xEstimation requires consistently estimating the rst-stage andreplacing u with buValidity of the IVs requires

(HM.i) E[z 0u2 ] 6= 0(HM.ii) E[z 0εu] = 0

Restrictions are satised if, say,

εi = θi +eεiui = θi + eui

where θi is a homoskedastic common factor and the sole source ofcorrelation between ε and u, and eu is heteroskedastic with variancedepending on z

DL Millimet (SMU) ECO 7377 Fall 2011 301 / 407

Selection on UnobservablesHigher Moments: Klein & Vella (2009, 2010); Farré et al. (2010)

Setup as in the prior modelI Structuralmodel

yi = β1Di + xi β2 + εi

I First-stage modelDi = xiπ + ui

where

F x includes the interceptF Cov(ε, u) 6= 0

D may be discrete or continuous

DL Millimet (SMU) ECO 7377 Fall 2011 302 / 407

Identication assumptions

(KV.i) εi = Sε(zi )εi and/or ui = Su(zi )ui , where z x , such that

Sε(zi )/Su(zi ) varies across i(KV.ii) E[εi u

i ] = ρ, which is constant

Under (KV.i) and (KV.ii), the structural model may be re-written as

yi = β1Di + xi β2 + ρ

Sε(zi )Su(zi )

ui

+eεi

where eεi is now a well-behaved error termThe term in brackets acts as a control function since it controls forselection bias such that conditional on this term and x D is nolonger correlated with the error term

Klein & Vella (2009) propose a semiparametric estimator of the model

Farré et al. (2010) outline a parametric estimator

DL Millimet (SMU) ECO 7377 Fall 2011 303 / 407

Parametric Estimation

Assuming

Sε(zi ) =qexp(zi θε)

Su(zi ) =qexp(zi θu)

the structural model becomes

yi = β1Di + xi β2 + ρ

"pexp(zi θε)pexp(zi θε)

ui

#+eεi

Estimate the rst-stage by OLS ) buEstimate by OLS

ln(bu2i ) = zi θu + euiand form bSu(zi ) = qexp(zibθu)DL Millimet (SMU) ECO 7377 Fall 2011 304 / 407

Substitute bu and bSu(zi ) into the structural model and estimate theremaining parameters by NLS

yi = β1Di + xi β2 + ρ

"pexp(zi θε)bSu(zi ) bui

#+eεi

While one could stop, performance is perhaps improved by addingadditional steps

I Given NLS estimates of β1 and β2 ) bεI Estimate by OLS

ln(bε2i ) = zi θε +eeεiand form bSε(zi ) =

qexp(zibθε)

I Estimate by OLS

yi = β1Di + xi β2 + ρ

" bSε(zi )bSu(zi )bui#+eεi

Obtain std errors via bootstrap

DL Millimet (SMU) ECO 7377 Fall 2011 305 / 407

Selection on UnobservablesHigher Moments: Klein & Vella (2009)

SetupI Structuralmodel

yi = β1Di + xi β2 + εi

I First-stage modelDi = xiπ1 + ui

where x contains an intercept

When D is binary, one may estimate the rst-stage via probit andform an instrument using the propensity score, dp(x)Even with no exclusion restriction, dp(x) is correlated with D andlinearly independent of x (since dp(x) = Φ(x bπ))However, most of this linearity occurs in the tails

DL Millimet (SMU) ECO 7377 Fall 2011 306 / 407

Additional non-linearity of the IV may be induced if one uses aheteroskedastic probit to form the IV

I σu is modeled as exp(xδ)I dp(x) = Φ(x bπ/ exp(xbδ))I Additional non-linearity is roughly equivalent to using higher-orderterms of x as exclusion restrictions

Klein & Vella (2009) also propose a semiparametric version

DL Millimet (SMU) ECO 7377 Fall 2011 307 / 407

Selection on UnobservablesHigher Moments: Vella & Verbeek (1997); Rummery et al. (1999)

Vella and Verbeek (1997) propose an alternative IV strategy that mayalso be valid with heteroskedastic errors

Known as Rank Order IV

Setup as in the prior models

yi = β1Di + xi β2 + εi

Di = xiπ + ui

whereI x includes the interceptI Cov(ε, u) 6= 0

D may be discrete or continuous

DL Millimet (SMU) ECO 7377 Fall 2011 308 / 407

Identication assumptions

(ROIV.i) An agents level of unobserved heterogeneity responsible forCov(ε, u) 6= 0 does not impact y , but rather only the agents relativeposition or rank order matters

(ROIV.ii) Data can be partitioned into subsets such that agents may be pairedacross subsets in a manner leading to pairs with identical ranks in theirrespective subsets but di¤erent levels of D

For example, if y is wages, D is participation in a training program,and endogeneity is due to unobserved work ethic, then

I (ROIV.i) implies that the level of ones work ethic does not impactwages but only the fraction of workers with whom ones work ethicexceeds

F I.e., ones level of work ethic is irrelevant, only ones percentile in thedbn if work ethic matters

I (ROIV.ii) implies we can divide the data (say, by region) such thatacross regions individuals at the same percentile of the dbn of workethic within their region have di¤erent values of D

DL Millimet (SMU) ECO 7377 Fall 2011 309 / 407

To proceed, partition the data into mutually exclusive groups,s = 1, ...,S , on the basis of some attribute, qi (which may be asubset of x)Notation

I Dene F (jqi ) as the CDF of u given qI Let ci = F (ui jqi ) be the rank order of obs i in its partition

(ROIV.i) may be expressed formally as

E[εi jxi ,Di , ui , qi ] = E[εi jui , qi ] = E[εi jci ] = m(ci )where m() is some fn mapping c to y

I This condition states that E[εi jui , qi ] depends only on u and q throughthe rank order, c

I Vella & Verbeek (1997) refer to as the order restriction

The order restriction is useful for identifying the model since it impliesthat agents from di¤erent partitions, qi 6= qj , but with identical rankorders, ci = cj , are identical along the unobserved dimensionresponsible for the endogeneityTo be useful, however, requires an additional assumption, (ROIV.ii),such that these comparable pairs of agents have di¤erent values of DDL Millimet (SMU) ECO 7377 Fall 2011 310 / 407

Estimation

Re-write the structural model as

yi = β1Di + xi β2 +m(ci ) +eεiwhere eε is now a well-behaved error term; m(c) is another example ofa control function, but c and m() are unknownEstimate ci by

I Estimating the rst-stage model via OLS ) buiI Estimate bci nonparametrically using the empirical CDF within each ofthe S partitions based on q

Approximate m(c) using a nite-order polynomial in bcAlternatively, one may estimate the original structural model

yi = β1Di + xi β2 + εi

by IV with the instrument given by the residual, bη, obtained afterOLS estimation of the model

Di = θ0 + θ1ci + ηi

DL Millimet (SMU) ECO 7377 Fall 2011 311 / 407

Selection on UnobservablesCovariance Restrictions

SetupI Structuralmodel

yi = β0 + β1Di + xi β2 + εi

I First-stage modelDi = π0 + xiπ1 + ui

I Reduced form model

yi = (β0 + β1π0) + xi (β1π1 + β2) + (εi + β1ui )

= eβ0 + eβ1xi + eυiWith no IV, estimable quantities include: π0,π1, eβ0, eβ1

I These four quantities are functions of ve structural parameters:π0,π1, β0, β1, β2

I Thus, the model is under-identied

DL Millimet (SMU) ECO 7377 Fall 2011 312 / 407

What about the covariance matrix of the system of reduced formeqtns? β1 also shows up there

yi = eβ0 + eβ1xi + (εi + β1ui )

Di = π0 + xiπ1 + ui

Assume ε, u N(0, 0, σε, σu , ρ), then eυ, u are also mean zero withcovariance matrix

Σ =

σ2ε + β21σ2u + 2β1ρσεσu ρσεσu + β1

σ2u

=

Σ11 Σ12

Σ22

Three quantities are estimable based on MLE of the system:Σ11,Σ12,Σ22

I These 3 quantities are functions of 4 structural parameters:β1, σε, σu , ρ

I Thus, the model remains under-identied

DL Millimet (SMU) ECO 7377 Fall 2011 313 / 407

Intuition: place restrictions on other parameters in Σ in order toidentify β1 from the cov matrix; intercept and slope parameters are allidentied then as well

Model is then estimated via ML

lnL = ∑i12ln jΣ1j 1

2ε0iΣ

1εi

where εi is the vector of errors for obs i

Note: If D is instead modelled as a LDV, then the likelihood must befactored appropriately to account for the fact that one eqtn has adiscrete outcome

DL Millimet (SMU) ECO 7377 Fall 2011 314 / 407

Realistic restrictions may be easier to devise if one adds additionaloutcomes that also depend on the same endogenous regressor

I Ex: K = 2

y1i = eβ10 + eβ11xi + (ε1i + β11ui )

y2i = eβ20 + eβ21xi + (ε2i + β21ui )

Di = π0 + xiπ1 + ui

which entails

Σ =

26666664σ2ε1 + β211σ2u+2β11ρ1σε1

σ2ε1 + σ2ε2 + 2ρ12σε1σε2+β11ρ2σε2σu

+β21ρ1σε1σu + β11β21

2ρ1σε1σu+β11σ2u

σ2ε2 + β221σ2u+2β21ρ2σε2

2ρ2σε2σu+β21σ2u

σ2u

37777775=

24Σ11 Σ12 Σ13Σ22 Σ23

Σ33

35I If y1, y2 are similar (e.g., two anthropometric measures), might impose

ρ1 = ρ2 and might have a strong prior for ρ12DL Millimet (SMU) ECO 7377 Fall 2011 315 / 407

Types of restrictions

Altonji et al. (2005)-type restrictions: impose values for ρ and trackestimates of β1Factor Structure

I Add additional outcomesI Decompose errors as

εki = λkµi + ηki , k = 1, ...,K

ui = λuµi + ξ i

where µ has unit var (normalization, not an assumption), η, ξ, µ areassumed to be independent, and λ are known as factor loadings

I Factor structure assumes all cross-eqtn correlation is through µI Parameters to be estimated from Σ: σηk

,λk , β1k ,λu , σξ

F This is 3K + 2 parameters in totalF Estimable quantities from Σ is (K + 1)K/2F (K + 1)K/2 3K + 2) K 6

Hogan and Rigobon (2003), Rigobon (2003) propose an Identicationthrough Heteroskedasiticity estimator that is very similar

DL Millimet (SMU) ECO 7377 Fall 2011 316 / 407

Selection on UnobservablesDistributional Approaches

Relatively recent work has begun to address endogeneity in thecontext of distributional models

Other estimators not discussed here1 Fixed e¤ect QR models (Koenker 2004)2 Nonparametric bounds applied to QR models (Giustinelli 2011)

DL Millimet (SMU) ECO 7377 Fall 2011 317 / 407

Selection on UnobservablesDistributional Approaches: Changes-in-Changes

Recall, standard DID strategyI Assume treatment group observed pre- and post-interventionI Assume control group observed in same time periodsI Assume treatment and control groups follow same time trend absenttreatment

I Estimate treatment e¤ect by the additional change over time in thetreatment group relative to the control group

Idea is extendable beyond just average treatment e¤ects

Model does require panel data or repeated cross-sections

DL Millimet (SMU) ECO 7377 Fall 2011 318 / 407

Setup (Athey & Imbens 2005)

NotationI Individual i belongs to a group Gi 2 f0, 1g, where G = 1 is treatmentgroup

I Individual i observed at time Ti 2 f0, 1gI yNi , y

Ii = potential outcomes in non-treated (N), treated (intervention,

I ) statesI yi = (1 Ii )yNi + Ii y Ii = observed outcome, where Ii = treatment(intervention) indicator

I Ii = GiTi

DL Millimet (SMU) ECO 7377 Fall 2011 319 / 407

Standard DIDI Untreated outcome

yNi = α+ βTi + γGi + εi

I Constant treatment e¤ect assumption

τ = y Ii yNiI Combining above two assumptions yields

yi = α+ βTi + γGi + τIi + εi

where

F τ = ATE with constant treatment e¤ect assumptionF τ = ATT with heterogeneous treatment e¤ect assumption

DL Millimet (SMU) ECO 7377 Fall 2011 320 / 407

Generalizing the standard modelI Untreated outcome

yNi = h(Ui ,Ti )

whereF h(u, t) is increasing in uF ui = unobservable attribute of iF yN is identical across individuals within a time period with identical u,irrespective of G

I Dbn of u may vary by G , but not over time within G , ui ? Ti jGiI In the absence of treatment...

F Any di¤erences in outcomes across groups is entirely due to di¤s in thedbn of u across groups

F Any changes in outcomes within groups over time is due to di¤s inh(u, 0) and h(u, 1) [i.e., since unobservables do not change over time,the e¤ect of unobservables on the untreated outcome must change overtime]

I Treated outcomey Ii = h

I (Ui ,Ti )

where hI (u, t) is increasing in u

DL Millimet (SMU) ECO 7377 Fall 2011 321 / 407

Changes-in-changes model

NotationI Conditional dbns

yNgt yN jG = g ,T = ty Igt y I jG = g ,T = tygt y jG = g ,T = tUg U jG = g

I Inverse CDFsF1y (q) = inffy : FY (y) > qg

GoalI Devise set of assumptions to identify dbn of yN11, FyN ,11, which is (oneof) the distributions of missing counterfactuals

I Observable dbns include: FyN ,10, Fy I ,11, FyN ,00, and FyN ,01

DL Millimet (SMU) ECO 7377 Fall 2011 322 / 407

Assumptions

(CIC.i) Model: yN = h(U,T )(CIC.ii) Strict monotonicity: h(u, t) is strictly increasing in u for t 2 f0, 1g(CIC.iii) Time invariance within groups: U ? T jG(CIC.iv) Support: U1 U0

DL Millimet (SMU) ECO 7377 Fall 2011 323 / 407

Estimator

Counterfactual CDF

bFyN ,11 = Fy ,10(F1y ,00(Fy ,01(y)))which is estimable using empirical CDFs

Treatment e¤ect estimate

τCICq = F1y I ,11(q) bF1yN ,11(q)Note, τCICq is the di¤erence in two QTE (Firpo 2007) estimates

τCICq = ∆QTEq,1 ∆QTEq 0,0

whereI ∆QTEq,1 is change over time in y at quantile q for G = 1 group

I ∆QTEq 0,0 is change over time in y at quantile q0 for G = 0 group, where

q0 is the quantile in the G = 0,T = 0 dbn corresponding to the valueof y associated with quantile q in the G = 1,T = 0 dbn

DL Millimet (SMU) ECO 7377 Fall 2011 324 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 325 / 407

Alternative estimatorI QDID treatment e¤ect estimator

τQDIDq = F1y I ,11(q) bF1yN ,11(q)where bF1yN ,11(q) = F1y ,10(q) + [F1y ,01(q) F1y ,00(q)]which corresponds to

τQDIDq = ∆QTEq,1 ∆QTEq,0

where ∆QTEq,1 , ∆QTEq,0 is change over time in y at quantile q forG = 1, 0, respectively

I Relies on (perhaps) unrealistic assumptions

DL Millimet (SMU) ECO 7377 Fall 2011 326 / 407

Counterfactual CDF for control group

Fy I ,01 = Fy ,00(F1y ,10(Fy ,11(y)))

Treatment e¤ect estimate

τCICq,0 = F1y I ,01(q) F

1yN ,01(q)

DL Millimet (SMU) ECO 7377 Fall 2011 327 / 407

Notes

Athey & Imbens (2006) discuss extensions toI Discrete outcomesI Multiple groups and multiple time periodsI Incorporating covariates

F Semiparametric specication of potential outcomes

yN = h(u, t) + xβ

y I = hI (u, t) + xβ

where U ? T ,X jGF OLS estimation of outcomes

yi = Di δ+ xi β+ εi

where D = [GT (1 G )T G (1 T ) (1 G )(1 T )]F Perform CIC estimation on

byi = yi xibβ = Dibδ+bεiF Inverse propensity score weighting alternative?

DL Millimet (SMU) ECO 7377 Fall 2011 328 / 407

Panel data allows additional exibility, but repeated cross sections aresu¢ cient

InferenceI Athey & Imbens (2006) prove asymptotic normality, and deviseasymptotic variance

I Bootstrap alternative?

DL Millimet (SMU) ECO 7377 Fall 2011 329 / 407

Selection on UnobservablesDistributional Approaches: IV Quantile Regression

Recall, QR model (Koenker & Bassett 1978)I Assuming linear conditional quantiles, estimation is

bβθ,b∆θ= argmin

β,∆

1N

(∑

i :yi>xi βθjyi ∆Di xi βj+ ∑

i :yi<xi β(1 θ)jyi xi βj

)

I May be rewritten as

bβθ,b∆θ = argmin

β,∆

1N

(∑i

ρθ(εθi )

)

where ρθ(εθi ) is check function, dened as

ρθ(εθi ) = [θ I(εθi < 0)]εθi

and εθi is the residual for i and θ

DL Millimet (SMU) ECO 7377 Fall 2011 330 / 407

Parameters of interest are the partial derivatives of the conditionalquantile fn w.r.t. x

∂ E[Qθ(y jx ,D)]∂xk

which equals βθk if x enters linearly

For discrete regressors, parameters give the expected change in theconditional quantile fn

∆θ = E[Qθ(y jx , 1) E[Qθ(y jx ,D = 0)]

DL Millimet (SMU) ECO 7377 Fall 2011 331 / 407

QR model is biased and inconsistent if D is endogenous

Recall, potential outcomes setupI yd , d = 0, 1, are potential outcomes associated with D = 0, 1,respectively

I q(d , x , θ) = conditional quantile fn of potential outcomesI ∆θ = q(1, x , θ) q(0, x , θ) = QTE (parameter of interest)

DL Millimet (SMU) ECO 7377 Fall 2011 332 / 407

IV-QR model (Chernozhukov & Hansen 2005, 2006)

Express conditional quantile fn as

yd = q(d , x , ud ), ud U [0, 1]

where q(d , x , θ) is the conditional θth-quantile of potential outcome,ydLinear (in parameters) conditional quantile fn implies

q(d , x , θ) = ∆θDi + xi βθ

DL Millimet (SMU) ECO 7377 Fall 2011 333 / 407

Assumptions

(IV-QR.i) Potential outcomes: given X = x , for each d , yd = q(d , x , ud ),whereud U [0, 1] and q(d , x , θ) is strictly increasing in θ

(IV-QR.ii) Independence: given X = x , fud g ? Z(IV-QR.iii) Selection: given X = x ,Z = z , D δ(z , x , υ) for unknown fn δ() and

random vector, υ(IV-QR.iv) Rank similarity: given X = x ,Z = z , ud ud 0 8d , d 0(IV-QR.v) Observed data: y = q(d , x , ud ), D δ(z , x , υ), x , and z

Note: rank similarity is a bit weaker than rank invariance (wherebyUd = Ud 0 8d , d 0), and requires that Ud = Ud 0 are equal inexpectation only (thus, they may be considered equal ex ante, but areallowed to di¤er ex post)

DL Millimet (SMU) ECO 7377 Fall 2011 334 / 407

Estimation

Consider the objective fn

1N

(∑i

ρθ(εθi )

)

where

ρθ(εθi ) = [θ I(εθi < 0)]εθi

εθi = yi ∆θDi xi βθ bΦiγθ

and bΦi is the predicted value from the rst-stage regression of D onx , z

Given correctly specied structuralmodel, γθ should equal zero

DL Millimet (SMU) ECO 7377 Fall 2011 335 / 407

Algorithm1 Dene a grid of possible values of ∆, f∆j , j = 1, ..., Jg2 For each θ, estimate a QR model with yi ∆Di as the dependentvariable and x , bΦi as covariates

3 Obtain estimates bβθj , bγθj , j = 1, ..., J4 Choose b∆θ = b∆θj and bβθ =

bβθj to minimize jbγθj j

Inference via sub-sampling or typical, nonparametric iid bootstrap, asin QR model

Can test interesting hypotheses (∆θ = 0, ∆θ constant 8θ, SD,exogeneity)

Easily extendable to multiple endogenous variables, but grid searchincreases exponentially

DL Millimet (SMU) ECO 7377 Fall 2011 336 / 407

Selection on UnobservablesDistributional Approaches: Stochastic Dominance

Recall, previous denitions for stochastic dominanceI First Order Stochastic Dominance: Y1 FSD Y0 i¤

F1(y) F0(y) 8y 2 @

with strict inequality for some y (where @ is the union of the supportsfor Y1 and Y0), or

y θ1 y θ

0 8θ 2 [0, 1]with strict inequality for some θ

I Second Order Stochastic Dominance: X SSD Y i¤Z y∞

F1(t)dt Z y∞

F0(t)dt 8y 2 @

with strict inequality for some y , orZ θ

0y t1dt

Z θ

0y t0dt 8θ 2 [0, 1]

with strict inequality for some θ

DL Millimet (SMU) ECO 7377 Fall 2011 337 / 407

Recall, previous tests for stochastic dominanceI Test statistics

d = min supz2@

[F (z) G (z)]

s = min supz2@

Z z∞[F (t) G (t)] dt

where min is taken over F G and G FI Tests are based on estimates of d and s using the empirical CDFs

F Unconditional, orF Inverse propensity score weighted

Previous methods assume selection on observables

Failure of this assumption invalidates causal conclusions

DL Millimet (SMU) ECO 7377 Fall 2011 338 / 407

Solution (Abadie 2002; Imbens & Rubin 1997)

With a binary IV, Z , the potential distributions of the outcomevariable are identied for the subpopulation of compliers

Zi satises the following three assumptions:I Independence: fy0i , y1i ,D0i ,D1ig ? ZiI Correlation: Pr(Zi = 1) 2 (0, 1) and Pr(D0i = 1) < Pr(D1i = 1)I Monotonicity: Pr(D0i D1i ) = 1where:

F y0, y1 are potential outcomes (subscripts refer to treatment status)F D0,D1 are potential treatments (subscripts refer to instrument status)

SD tests comparing the distribution of outcomes across the sampleswith Z = 0 and Z = 1 identify the causal e¤ect of D on y forcompliers

DL Millimet (SMU) ECO 7377 Fall 2011 339 / 407

Dene the empirical CDF of potential outcomes for compliers as

bFC1 (y) = E[I (Y1i y) jD1i = 1,D0i = 0]bFC0 (y) = E[I (Y0i y) jD1i = 1,D0i = 0]

Abadie (2002) shows

bFC1 (y) bFC0 (y) = K [bF1(y) bF0(y)]wherebF1(y), bF0(y) are empirical CDFs for the Z = 1, Z = 0 samplesK = 1/(E[D jZ = 1] E[D jZ = 0]) < ∞Implies SD tests on bF1(y), bF0(y) yield valid inference for the SDrankings of bFC1 (y), bFC0 (y)Di¤erent Z s yield di¤erent results if the treatment e¤ect varies acrossthe population

DL Millimet (SMU) ECO 7377 Fall 2011 340 / 407

Data Issues

Data issues are a fact of life

Frequently encountered are problems pertaining to missing orcontaminated data

Sample selection concerns missing data on the dependent variable

Contaminated data refers to a scenarious where one is interested inthe marginal distribution of a potentially mismeasured variable

Measurement error more generally refers to mismeasured dependentor independent variables

DL Millimet (SMU) ECO 7377 Fall 2011 341 / 407

Data IssuesSample Selection

Population model

yi = xi β+ εi , εi N(0, σ2)

Given a random sample, fyi , xigNi=1, then OLS is consistent ande¢ cient if the usual assumptions are satised

Problem arises when data on y is only available for a non-randomsample

I Let Si = 1 if yi is observed; Si = 0 if yi is unobserved

Note: While exposition is using cross-section, a common source of(non-random) selection is attrition in panel data; particularlyimportant in rm-level studies where attrition may be due to rmsexiting the market

DL Millimet (SMU) ECO 7377 Fall 2011 342 / 407

Example: Certain subpopulations may not be representative of thepopulation

DL Millimet (SMU) ECO 7377 Fall 2011 343 / 407

Implies following data structureI Have data on a random sample, fyi , xi ,SigNi=1, but yi = . if Si = 0I Can only use M ∑i Si observations to estimate any modelI Examples

F Wages only observed for workersF Firm prots only observed for rms that remain in businessF Test scores only observed for test takersF House prices only observed for houses on the market (sold?)

IssueI Is OLS still unbiased and consistent?I Answer: depends

DL Millimet (SMU) ECO 7377 Fall 2011 344 / 407

Heckman Model (Heckman 1979)

Setup

yi = xi β+ εi

Si = ziγ+ ui

Si =

1 if Si > 00 if Si 6 0

yi = . if Si = 0

εi , ui N2(0, 0, σ2ε , 1, ρ)

x , z are exogenous

DL Millimet (SMU) ECO 7377 Fall 2011 345 / 407

ProblemI E[y jx ] = xβ, but

E[y jx ,S = 1] = E[y jx , z , u] = xβ+ E[εjx , z , u]= xβ+ E[εju > ziγ]

= xβ+ ρσεφ(zγ)

Φ(zγ)

where ρσεφ(zγ)/Φ(zγ) is the Inverse MillsRatio from beforeI Implies that E[y jx ,S = 1] = xβ i¤ ρ = 0I OLS estimation of

yi = xi β+eεiusing only M observations omits the IMR term, which implies that

eεi = ρσεφ(zγ)/Φ(zγ) + εi

which is not mean zero, and is not independent of x , unless ρ = 0

DL Millimet (SMU) ECO 7377 Fall 2011 346 / 407

SolutionI Estimate IMR (using i = 1, ...,N)

F Estimate probit model, where S is dependent variable and z are thecovariates ) bγ

F Obtain

IMRi =φ(zi bγ)Φ(zi bγ)

I Regress yi on xi , IMRi via OLS (using i = 1, ...,M)I Known as Heckman two-step methodI Test of endogenous selection

Ho : βλ = 0

Ha : βλ 6= 0

where βλ is the coe¢ cient on the IMR

DL Millimet (SMU) ECO 7377 Fall 2011 347 / 407

NotesI Usual OLS standard errors are incorrect since IMR is predicted; mustaccount for additional uncertainty due to estimation of γ

I Other complications in derivation of standard errorsI Need an exclusion restriction(s)

F A variable in z not in xF Otherwise model is identied from non-linearity of IMR, which arisessolely from the assumption of joint normality

F However, even though technically identied from the non-linearity,substantial collinearity in practice makes identication questionable

I Model can be estimated in one-step by ML

F More e¢ cient if model assumptions are validF Less robust in general since more dependent on functional formassumptions

Stata: -heckman-, -heckman2 -

DL Millimet (SMU) ECO 7377 Fall 2011 348 / 407

QR alternative

Assume the latent outcome is

y i = xi β+ ui

y is unobserved; instead observe

yi =y i if observed. otherwise

QR model estimated using data on feyi , xig, whereeyi = yi if observed

minfyig otherwise

yields bβθ = argminβ

1N

(∑i

ρθ(eyi xi β))

which is consistent as long as all missing values of y i 6 Qθ(y jx)DL Millimet (SMU) ECO 7377 Fall 2011 349 / 407

More generally, QR model estimated using data on feyi , xig, whereeyi = yi if observed

imputed value otherwise

yields bβθ = argminβ

1N

(∑i

ρθ(eyi xi β))

which is consistent as long as imputed values lie on the correct side ofQθ(y jx)

DL Millimet (SMU) ECO 7377 Fall 2011 350 / 407

Example:

­.50

.51

1.5

0 .2 .4 .6 .8 1x

ystar 'true' OLS fitted line'true' LAD fitted line OLS fitted line, y>0 onlyLAD fitted line

NOTE: x~U[0,1]; ystar=­0.25+x+e; e~N(0,0.25^2); y=ystar if ystar>0.LAD fitted line obtained by first replacing y=10 if ystar>true LAD line, ­10 otherwise.

DL Millimet (SMU) ECO 7377 Fall 2011 351 / 407

Multiple selection criteria

Setup

yi = xi β+ εi

S1i = z1iγ1 + u1i

S1i =

1 if S1i > 00 if S1i 6 0

S2i = z2iγ2 + u2i

S2i =

1 if S2i > 00 if S2i 6 0

yi = . if S1iS2i 6= 1εi , u1i , u2i N3(0, 0, 0, σ2ε , 1, 1, ρε1, ρε2, ρ12)

x , z are exogenous

DL Millimet (SMU) ECO 7377 Fall 2011 352 / 407

EstimationI Same as above, except with two IMR terms

IMR1i =φ(z1i bγ1)Φ(z1i bγ1) ; IMR2i =

φ(z2i bγ2)Φ(z2i bγ2)

I Coe¢ cients on each IMR term are ρε1σε and ρε2σε

ExamplesI Grameen Bank: only observe outcome of credit amount if villagecontains a bank, and income makes one eligible

I Child care: only observe price paid for child care if work and usemarket-based day care

DL Millimet (SMU) ECO 7377 Fall 2011 353 / 407

Regime switching models

Setup

Si = ziγ+ ui

Si =

1 if Si > 00 if Si 6 0

yi =

xi β1 + ε1ixi β0 + ε0i

which is the previous model for treatment e¤ects

Applicable to any situation where one thinks determinants of theoutcome (i.e., β) di¤er across groups or regimes

DL Millimet (SMU) ECO 7377 Fall 2011 354 / 407

May be extended to multiple regimes

Si = ziγ+ ui

Si =

8>>>>><>>>>>:

0 if Si 6 01 if Si 2 (0, α1]2 if Si 2 (α1, α2]...K if Si > αK1

yi =

8>>><>>>:xi β0 + ε0i if Si = 0xi β1 + ε1i if Si = 1...xi βK + εKi if Si = K

DL Millimet (SMU) ECO 7377 Fall 2011 355 / 407

Estimate each regime seperately

yi = xi βk + ρuεkσεkdIMRki + ηki

where

dIMRk =8>>>>><>>>>>:

φ(zi bγ)1Φ(zi bγ) if Si = 0

φ(αk1zi bγ)φ(αkzi bγ)Φ(αkzi bγ)Φ(αk1zi bγ) if Si = k 2 f1, 2, ...,K 1g...

φ(αK1zi bγ)1Φ(αK1zi bγ) if Si = K

and α0 = 0 and γ is estimated via ordered probit

ExamplesI Wages by rm size (Main & Reilly 1993)I Various outcomes by education or household size

DL Millimet (SMU) ECO 7377 Fall 2011 356 / 407

Regime switching models with unknown switch point

Setup

Si = ziγ+ ui

Si =

1 if Si > c0 if Si 6 c

yi =

xi β1 + ε1i if Si = 1xi β0 + ε0i if Si = 0

where S is observed, but c is unknown

DL Millimet (SMU) ECO 7377 Fall 2011 357 / 407

EstimationI ML, where c is unknown parameterI Grid search:

F Estimate model for several plausible values of cF bc and resulting estimates bβ are those that minimize total SSE

I Examples

F Wages of PT vs. FT (Hotchkiss 1991)F Outcomes of DCs vs. LDCsF Stock market performance of large vs. small rms

Separate literature on selection models with panel data

DL Millimet (SMU) ECO 7377 Fall 2011 358 / 407

Bounding distributions (Blundell et al. 2007)

NotationI W = latent outcome variableI E = selection indicatorI W = outcome variable, where

W =

W if E = 1. otherwise

I X = covariate vector

Goal: bound CDF F (w jx) given observable CDF F (w jx ,E = 1)Examples:

I Dbn of wages under full employmentI Dbn of child health under full HI coverageI Dbn of student achievement under universal attendance at publicschools

I Dbn of test scores on college entrance exams with full participation

DL Millimet (SMU) ECO 7377 Fall 2011 359 / 407

Worst case bounds

Identity

F (w jx) = F (w jx ,E = 1)p(x) + F (w jx ,E = 0)[1 p(x)]

where p(x) Pr(E = 1jx)F (w jx ,E = 0) is unknown, but must lie in unit intervalReplacing F (w jx ,E = 0) with zero and one yields

F (w jx ,E = 1)p(x) 6 F (w jx) 6 F (w jx ,E = 1)p(x) + [1 p(x)]

Example (ignoring x):I F (10jE = 1) = 0.4I Pr(E = 1) = 0.9) F (10) 2 [0.36, 0.46]

DL Millimet (SMU) ECO 7377 Fall 2011 360 / 407

Can be rewritten in terms of bounds on quantiles

wq,l (x) 6 wq(x) 6 wq,u(x)

whereI wq(x) = qth quantile of F (w jx)I wq,l (x) is the value of w that solves

q = F (w jx ,E = 1)p(x) + [1 p(x)]

, w = F1q [1 p(x)]

p(x)jx ,E = 1

I wq,u(x) is the value of w that solves

q = F (w jx ,E = 1)p(x)

, w = F1

qp(x)

jx ,E = 1

DL Millimet (SMU) ECO 7377 Fall 2011 361 / 407

ExampleI q = 0.5, p(x) = 0.9I wq,l (x) = F1(q00jx ,E = 1), whereq00 = (0.5 0.1)/0.9 = 0.4/0.9 0.44

I wq,u(x) = F1(q0jx ,E = 1), where q0 = 0.5/0.9 0.55) bounds on the median are given by the values of the observedconditional dbn at the 44th and 55th quantiles

NotesI Bounds cannot be used to determine if selection is non-random; onlyassess the possible consequences

I Bounds only estimable for q 2 [1 p(x), p(x)]I Bounds converge to point estimates as p(x)! 1

DL Millimet (SMU) ECO 7377 Fall 2011 362 / 407

Positive selection

Stochastic dominanceI One characterization of positive selection is to assume that

F (w jx ,E = 1) FSD F (w jx ,E = 0), F (w jx ,E = 1) 6 F (w jx ,E = 0) 8w , 8x

I Equivalent to Pr(E = 1jW 6 w , x) 6 Pr(E = 1jW > w , x)I Bounds on F (w jx) become

F (w jx ,E = 1) 6 F (w jx) 6 F (w jx ,E = 1)p(x) + [1 p(x)]

since the missing term, F (w jx ,E = 0), is now bounded from below atF (w jx ,E = 1)

Example (ignoring x):I F (10jE = 1) = 0.4I Pr(E = 1) = 0.9) F (10) 2 [0.4, 0.46] whereas the worst-case bounds were [0.36, 0.46]

DL Millimet (SMU) ECO 7377 Fall 2011 363 / 407

Median restrictionI Weaker characterization is to assume (conditional on x) thatw0.5(E=1) > w0.5(E=0)

I Equivalent toPr(E = 1jW 6 w0.5(E=1), x) 6 Pr(E = 1jW > w0.5(E=1), x)

I Bounds on F (w jx) become

F (w jx ,E = 1)p(x) 6 F (w jx) 6 F (w jx ,E = 1)p(x) + [1 p(x)]if w < w0.5(E=1)

F (w jx ,E = 1)p(x)+ 0.5[1 p(x)] 6 F (w jx) 6 F (w jx ,E = 1)p(x) + [1 p(x)]

if w > w0.5(E=1)

I Bounds are tightened (relative to worst case) only above the mediansince the missing term, F (w jx ,E = 0), is now bounded from below at0.5 for w > w0.5(E=1) (instead of zero)

DL Millimet (SMU) ECO 7377 Fall 2011 364 / 407

Exclusion restriction

Conditional independenceI Assume z satises

F (w jx , z) = F (w jx) 8w , x , z

I Bounds on F (w jx) become

maxzfF (w jx , z ,E = 1)p(x , z)g

6 F (w jx)6 min

zfF (w jx , z ,E = 1)p(x , z) + [1 p(x , z)]g

I If conditional independence is not true, bounds may cross; failure ofbounds to cross does not prove conditional independence holds

DL Millimet (SMU) ECO 7377 Fall 2011 365 / 407

MonotonicityI Higher values of z improve the dbn in a FSD sense

F (w jx , z 0) 6 F (w jx , z 00) 8w , x , z 0, z 00 s.t. z 0 > z 00

I Bounds on F (w jx , z1) become

maxz>z1

fF (w jx , z ,E = 1)p(x , z)g

6 F (w jx , z1)6 min

z6z1fF (w jx , z ,E = 1)p(x , z) + [1 p(x , z)]g

I Bounds on F (w jx) obtained by integrating over the dbn of z ; entailscomputing the weighted average of the upper and lower bounds acrossthe di¤erent values, z1, where the weights are sample proportion,Pr(z = z1 jx)

DL Millimet (SMU) ECO 7377 Fall 2011 366 / 407

Bounding di¤erences in QTEs across groups accounting fornon-random selection

NotationI D 2 f0, 1g indexes groupsI T 2 f0, 1g indexes time period

Bounds on QTEs across groups in a given time period

wq,l (1,T ) wq,u(0,T ) 6 wq(1,T ) wq(0,T )6 wq,u(1,T ) wq,l (0,T )

Bounds on QTEs across time for a given group

wq,l (D, 1) wq,u(D, 0) 6 wq(D, 1) wq(D, 0)6 wq,u(D, 1) wq,l (D, 0)

DL Millimet (SMU) ECO 7377 Fall 2011 367 / 407

Bounds on di¤-QTEs across groups

[wq(1, 1) wq(0, 1)] [wq(1, 0) wq(0, 0)] 2 [LB,UB ]

where

LB = [wq,l (1, 1) wq,u(0, 1)] [wq,u(1, 0) wq,l (0, 0)]UB = [wq,u(1, 1) wq,l (0, 1)] [wq,l (1, 0) wq,u(0, 0)]

I Example: Change in median wage gap across males and females overperiod T = 0 to T = 1

DL Millimet (SMU) ECO 7377 Fall 2011 368 / 407

Level set restrictionsI Assume di¤-QTE, [wq(1, 1) wq(0, 1)] [wq(1, 0) wq(0, 0)], isconstant across di¤erent values of some covariate x 2 X

I Calculate LB(x),UB(x) 8x 2 XI New LB,UB given by

LB = maxx2X

LB(x)

UB = minx2X

UB(x)

Test statistics derived in Blundell et al. for bounds crossings, whetherobserved conditional distribution, F (w jx ,E = 1) lies in the boundsInference via bootstrap

DL Millimet (SMU) ECO 7377 Fall 2011 369 / 407

Bounding di¤erences in average treatment e¤ects across groupsaccounting for non-random selection

Lechner and Melly (2007)

Imai (2008)

Lee (2009)

Huber and Mellace (2011)

DL Millimet (SMU) ECO 7377 Fall 2011 370 / 407

Data IssuesContamination

Horowitz and Manski (1995); see also Chen et al. (JEL 2011)

Goal is to bound the marginal distribution of y , where

yi = diy i + (1 di )eyiwhere y is the true value, ey is the mismeasured value, and d = 1 inthe absence of contamination (0 otherwise)

Add more!

DL Millimet (SMU) ECO 7377 Fall 2011 371 / 407

Data IssuesMeasurement Error

Refer to ECO 6374 for refresher on basics...

Problem: sometimes (often!) data are measured imprecisely; seeBound et al. (2001), Millimet (2011)

DL Millimet (SMU) ECO 7377 Fall 2011 372 / 407

Data IssuesME: Classical Errors-in-Variables (CEV) model

Continuous dependent variable

yi|zobserved

= y i|zactual

+ µi|zME

I Assumptions

(CEV.i) True model: y i = α+ βxi + εi(CEV.ii) Normality and Mean Zero: µi N(0, σ2µ)(CEV.iii) Independence: Cov(x, µ) = 0

I Implications

F OLS unbiased, consistentF Standard errors are correctF # R2, " standard errors due to extra noise in the data

DL Millimet (SMU) ECO 7377 Fall 2011 373 / 407

Continuous independent variable

xi|zobserved

= xi|zactual

+ µi|zME

I Assumptions (in addition to previous assumptions)

(CEV.iv) Independence: Cov(µ, ε) = 0

I Implications

F OLS biased, inconsistent unless β = 0F bβOLS su¤ers from attenuation bias

DL Millimet (SMU) ECO 7377 Fall 2011 374 / 407

Data IssuesME: Binary Dependent Variable (Hausman et al. 1998)

True modelDi = x

i β+ εi

where on a variable indicates correctly measured

Given a random sample fDi , xi gNi=1, assume logit model is consistentand e¢ cient

I Logit probabilities

Pr(D = 1jx) =exp(xi β)

1+ exp(xi β)

Pr(D = 0jx) =1

1+ exp(xi β)

I Estimation by ML

lnL = ∑ifI[D = 1] ln[Pr(D = 1jx)] + I[D = 0] ln[Pr(D = 0jx)]g

DL Millimet (SMU) ECO 7377 Fall 2011 375 / 407

With measurement error, do not observe DiI Instead one observes DiI Introduce following notation

α0 Pr(Di = 1jDi = 0)α1 Pr(Di = 0jDi = 1)

I α0, α1 dependent on D, but not on xi

DL Millimet (SMU) ECO 7377 Fall 2011 376 / 407

EstimationI Probabilities of observed responses

Pr(D = 1jx) = Pr(Di = 1jDi = 0)Pr(Di = 0jx)+ Pr(Di = 1jDi = 1)Pr(Di = 1jx)

= α0 + (1 α0 α1)

exp(xi β)

1+ exp(xi β)

Pr(D = 0jx) = 1 Pr(D = 1jx)

= 1 α0 (1 α0 α1)

exp(xi β)

1+ exp(xi β)

I Estimation by ML

lnL = ∑ifI[D = 1] ln[Pr(D = 1jx)] + I[D = 0] ln[Pr(D = 0jx)]g

I Extension to probit is trivial

DL Millimet (SMU) ECO 7377 Fall 2011 377 / 407

IdenticationI In linear probability model (LPM), conditional expectation given by

E[D jx ] = E[Di = 1jDi = 0]Pr(Di = 0)+ E[Di = 1jDi = 1]Pr(Di = 1)

= α0 + (1 α0 α1)(xi β)

= α0 + (1 α0 α1)(β0 + exi β1)= [α0 + (1 α0 α1)β0 ] + exi (1 α0 α1)β1

which makes clear that identication of α0, α1, and β arises fromnon-linearity of probit/logit, in addition to ...

I Monotonicity assumption: α0 + α1 < 1I Semiparametric alternatives available

DL Millimet (SMU) ECO 7377 Fall 2011 378 / 407

Data IssuesME: Binary Independent Variable

True modely i = α+ βDi + εi , εi N(0, σεε)

where on a variable indicates correctly measured

Given a random sample fy i ,Di gNi=1, assume OLS is consistent ande¢ cient

With measurement error, do not observe DiInstead one observes Di where

Di|zobserved

= Di|ztrue

+ µi|zME

which implies that µ 2 f0, 1g if D = 0, and µ 2 f0,1g if D = 1Thus, measurement error is

I Not normally distributed (violates CEV.ii)I Is negatively correlated with D (violates CEV.iii)

DL Millimet (SMU) ECO 7377 Fall 2011 379 / 407

Assumptions

(BME.i) Non-di¤erential classication errors: E[y jD] = E[y jD,D](BME.ii) D ? ε(BME.iii) Cov(D,D) > 0(BME.iv) Cov(D, µ) < 0

Given (BME.i) (BME.iv), asymptotic bias given by

plimbβOLS = σD D + σD µ

σD D + 2σD µ + σµµ

β

Results in attenuation bias for β if σD µ + σµµ > 0

Likely true for any mismeasured bounded variable

DL Millimet (SMU) ECO 7377 Fall 2011 380 / 407

Millimet (2011) conducts MC study comparing common treatmente¤ect estimators (∆ = 1)

DL Millimet (SMU) ECO 7377 Fall 2011 381 / 407

Partial solutions (Aigner 1973; Bollinger 1996; Black et al. 2000)

Reverse regressionI Estimate via OLS

Di = π0 + π1yi + υi

I plim given by

plimbπ11,OLS = β2σD D + σεε

β

σD D + σD µ

which is biased up in absolute value

I ImpliesbβD ,OLS 2 bβOLS , bπ11,OLS , where bβD ,OLS is the OLS

estimate if D were observed (Frisch bounds)I If R2 is low, then bounds obtained using reverse regression may beuninformative

I IV estimation also yields an upper bound (not a consistent estimate!),that may be more informative in many cases

I Inconsistency of IV results from fact that any instrument correlatedwith D will most likely be correlated µ since Cov(D, µ) 6= 0

DL Millimet (SMU) ECO 7377 Fall 2011 382 / 407

Improved lower bound obtained by estimating

y i = α+ β0 I[Di = 0,D 0i = 1]+ β1 I[Di = 1,D 0i = 0] + β2 I[Di = 1,D 0i = 1] + ηi

where D 0i is a second mis-measured indicatorI If the measurement errors are independent conditional on actualtreatment assignment, Di , then

0 <E[bβOLS ] < E[bβ2,OLS ] < jβj

Bound bβD ,OLS under various assumptions concerning severity ofmeasurement error (papers by Kreider and Pepper)

DL Millimet (SMU) ECO 7377 Fall 2011 383 / 407

Full Solutions

Point estimates possible using method-of-moments framework

Brachet (2008) proposes following algorithm1 Estimate Hausman et al. misclassication probit, including aninstrument z in the rst-stage

2 Replace D with Pr(Di = 1jx , z) in second-stage

McCarthy & Tchernis (2011) consider a similar approach in aBayesian framework

DL Millimet (SMU) ECO 7377 Fall 2011 384 / 407

Partial solutions (Kreider & Pepper 2007)

Utilize a non-regression approach to bound the e¤ect of amis-measured binary treatment

Authors do not wish to invoke (BME.i), which implies thatmis-reporting is independent of outcomes conditional on the truth

NotationI y 2 f0, 1g is a binary outcome (correctly measured)I D 2 f0, 1g is the true binary treatmentI D 2 f0, 1g is the reported binary treatmentI Z 2 f0, 1g, where Z = 1 if D = D and 0 otherwise

Estimand of interest: ∆ = Pr(y = 1jD = 1) Pr(y = 1jD = 0)Data provides an estimate of Pr(y = 1jD)

DL Millimet (SMU) ECO 7377 Fall 2011 385 / 407

Manipulation yields

Pr(y = 1jD = 1) =Pr(y = 1,D = 1)Pr(D = 1)

=

0@ Pr(y = 1,D = 1)+Pr(y = 1,D = 0,Z = 0)Pr(y = 1,D = 1,Z = 0)

1APr(D = 1) + Pr(D = 0,Z = 0)

Pr(D = 1,Z = 0)

where Pr(D = 1,Z = 0) is a false positive and Pr(D = 0,Z = 0) isa false negative

Data provide estimates of Pr(y = 1,D = 1), Pr(D = 1)

Other elements are unknown, but bounded by the unit interval

DL Millimet (SMU) ECO 7377 Fall 2011 386 / 407

Lower-Bound Accurate Reporting RateI Assume Pr(Z = 1) vI Can show that

Pr(y = 1jD = 1) 2

Pr(y = 1,D = 1) δ

Pr(D = 1) 2δ+ (1 v ) ,Pr(y = 1,D = 1) + γ

Pr(D = 1) + 2γ (1 v )

where

δ =

minf(1 v ),Pr(y = 1,D = 1)g if Pr(y = 1,D = 1) Pr(y = 0,D = 1) (1 v ) 0maxf0, (1 v ) Pr(y = 0,D = 0)g otherwise

γ =

minf(1 v ),Pr(y = 1,D = 0)g if Pr(y = 1,D = 1) Pr(y = 0,D = 1) + (1 v ) 0maxf0, (1 v ) Pr(y = 0,D = 1)g otherwise

I Bounds for Pr(y = 1jD = 0) are obtained by replacing D with 1DI Bounds for each term obtained by replacing elements with sampleanalogs

I Bounds for ∆ obtained using relevant upper and lower bounds for eachterm

I When v = 1, bounds collapse to a point estimate

DL Millimet (SMU) ECO 7377 Fall 2011 387 / 407

Partial VericationI Might assume a lower bound for accuracy among some sub-groupwhose status is more certain, W = 1

I Assume Pr(Z = 1jW = 1) vwI Can show that

Pr(y = 1jD = 1) 2

26666664Pr(y = 1,D = 1,W = 1) δ0@ Pr(D = 1,W = 1)

+Pr(y = 0,W = 0)2δ+ (1 vw )Pr(W = 1)

1A ,

Pr(y = 1,D = 1,W = 1)+Pr(y = 1,W = 0) + γ

Pr(D = 1,W = 1) + Pr(y = 1,W = 0)+2γ (1 vw )Pr(W = 1)

37777775

where

δ =

minf(1 vw )Pr(W = 1),Pr(y = 1,D = 1)g if α 0maxf0, (1 vw )Pr(W = 1) Pr(y = 0,D = 0,W = 1)g otherwise

γ =

minf(1 vw )Pr(W = 1),Pr(y = 1,D = 0)g if α0 0maxf0, (1 vw )Pr(W = 1) Pr(y = 0,D = 1,W = 1) otherwise

α = Pr(y = 1,D = 1,W = 1) Pr(y = 0,D = 1,W = 1)

Pr(y = 0,W = 0) (1 vy )Pr(W = 1) 0

α0 = Pr(y = 1,D = 1,W = 1) Pr(y = 0,D = 1,W = 1)

+ Pr(y = 1,W = 0) + (1 vy )Pr(W = 1) 0

I If vw = 1, then one has full verication for an observed sub-sample !bounds are tightened

DL Millimet (SMU) ECO 7377 Fall 2011 388 / 407

Combine the prior assumptions with a Monotone IV assumption topossibly further tighten the bounds

MIV AssumptionI 9 x s.t.

x0 2 [x1, x2 ]) Pr(y = 1jD, x0) 2 [Pr(y = 1jD, x1),Pr(y = 1jD, x2)]

I Implies that Pr(y = 1jD, x) is weakly monotonically increasing in xI Proceed by

F Computing bounds conditional on di¤erent values of xF Obtaining unconditional bounds by integratingover the dbn of x

Kreider & Hill (2009), Kreider et al. (2011) combine thismethodology on reporting errors with prior methods on boundingtreatment e¤ects under SOU

Imai & Yamamoto (2010) o¤er a similar analysis in poli sci

DL Millimet (SMU) ECO 7377 Fall 2011 389 / 407

Partial solutions (Battistin & Sianesi 2009)

Consider ME of a binary or multi-valued treatment in the context ofpropensity score estimatorsSetup

(MPS.i) CIA given no MEy0, y1 ? Djx

(MPS.ii) CS given no ME

p(x) = Pr(D = 1jx) 2 (0, 1) 8x

I D is not observed, instead D is, where Di 6= Di for at least some iEstimation based on D yieldsb∆ATE = EfE[y jD = 1, x ] E[y jD = 0, x ]g

where the outer expectation is over S , where

S = fx : p(x) = Pr(D = 1jx) 2 (0, 1)g

In contrast, estimation based on D ) b∆ATE DL Millimet (SMU) ECO 7377 Fall 2011 390 / 407

NotationI (Mis)classication probabilites given by

λjj 0(x) = Pr(D = j jD = j 0, x), j , j 0 2 f0, 1g

F λ10 = proportion of incorrect reported zerosF λ01 = proportion of incorrect reported ones

I Condensed notation for correct reporting rates

λ00(x) = λ0(x) = Pr(D = 0jD = 0, x)

λ11(x) = λ1(x) = Pr(D = 1jD = 1, x)

I Matrix of (mis)classication probabilities can be written in terms ofλ0,λ1

Λ(x) =

λ0(x) 1 λ0(x)1 λ1(x) λ1(x)

DL Millimet (SMU) ECO 7377 Fall 2011 391 / 407

Assumptions

(MPS.iii) Non-di¤erential classication errors: E[y jD, x ] = E[y jD,D, x ](MPS.iv) Informative reported treatment status: λ0(x) + λ1(x) 1 6= 0

Outcomes condition on D can be written as a weighted average ofoutcomes conditional on D

E[y jD = 0, x ]E[y jD = 1, x ]

= Λ(x)

E[y jD = 0, x ]E[y jD = 1, x ]

)

E[y jD = 0, x ]E[y jD = 1, x ]

= Λ1(x)

E[y jD = 0, x ]E[y jD = 1, x ]

provided det[Λ(x)] = λ0(x) + λ1(x) 1 6= 0Two cases satisfy (MPS.iv)

I Minimal classication errors: λ0(x) + λ1(x) > 1I Severe classication errors: λ0(x) + λ1(x) < 1

DL Millimet (SMU) ECO 7377 Fall 2011 392 / 407

The bias when using D is

∆ATE (x) = [λ0(x) + λ1(x) 1] ∆ATE(x)

Implications:I ∆ATE (x) is unbiased if λ0 = λ1 = 1I ∆ATE (x) su¤ers from attenuation bias if λ0(x) + λ1(x) > 1I ∆ATE (x) su¤ers from attenuation bias AND

sgnh∆ATE (x)

i6= sgn

h∆ATE

(x)iif λ0(x) + λ1(x) < 1

I ∆ATE (x) = ∆ATE(x) if λ0 = λ1 = 0

DL Millimet (SMU) ECO 7377 Fall 2011 393 / 407

The bias of the unconditional ATE, ∆ATE , also depends on theerroneous determination of the CS

I Can show that

p(x) =p(x) [1 λ0(x)]λ0(x) + λ1(x) 1

I This implies that boundary values of p(x) can be obtained even ifp(x) 2 (0, 1) if

p(x) = 0, λ0(x) = 1 p(x)p(x) = 1, λ1(x) = p

(x)

To ensure one does not utilize a di¤erent CS based on D, mustassume

(MPS.v) λ0(x) 6= 1 p(x) and λ1(x) 6= p(x)

DL Millimet (SMU) ECO 7377 Fall 2011 394 / 407

EstimationI Under (MPS.i) (MPS.v)

∆ATE=

RSω(x)∆ATE (x)f (x)dx

= ∆ATE +R

S[ω(x) 1]∆ATE (x)f (x)dx

where

ω(x) =Pr(D = 1)Pr(D = 1)

1+

1p(x)

1 λ0(x)λ0(x) + λ1(x) 1

Pr(D = 1) =

RS[1 λ0(x)]f (x)dx

+R

S[λ0(x) + λ1(x) 1]p(x)f (x)dx

I Shows that ∆ATEcan be obtained from an appropriately weighted

average of ∆ATE (x)I Weights depend on λ0(x), λ1(x)

DL Millimet (SMU) ECO 7377 Fall 2011 395 / 407

NotesI Bounds obtained by computing b∆ATE (λ0,λ1) over a grid of valuesand obtaining the lower and upper bounds

F Restrictions on possible values of λs can be imposed based on prior infoF b∆ATE (λ0,λ1) can be obtained using any propensity-score basedestimator

F In their paper, they use a (5 strata) stratication estimator and assume(λ0,λ1) are stratum-specic

I Extension to multi-valued treatments provided as well

DL Millimet (SMU) ECO 7377 Fall 2011 396 / 407

Data IssuesME: Missing Binary Independent Variable

Molinari (2010) applies similar bounding approach to analyze the casewhere D is missing, possibly non-randomly, due to subjectnon-response

I Examples:

F Respondents refuse to answer questions concerning drug use, welfareuse, etc.

DL Millimet (SMU) ECO 7377 Fall 2011 397 / 407

Millimet (2011) MC study also compares common treatment e¤ectestimators when y or x is measured with error (do not forget the restof the data! ... ∆ = 1)

DL Millimet (SMU) ECO 7377 Fall 2011 398 / 407

DL Millimet (SMU) ECO 7377 Fall 2011 399 / 407

Data IssuesME: Persistence of Treatment E¤ects

Often neglected in applied research is the question of whethertreatment e¤ects are persistent

Clearly relevant for policymakers; an investment that improvesoutcomes for one period only has di¤erent benets than aninvestment that yields a permanent improvement in outcomes

Jacob et al. (2010) propose an interesting method to estimate thedegree of persistence in a treatment e¤ect (under certaincircumstances)

Method relies on preceding analysis of measurement error

DL Millimet (SMU) ECO 7377 Fall 2011 400 / 407

Setupyit = yLit + y

Sit

where y is the outcome, which is decomposed into a LR component,yL, and a SR componenent, yS

I The two components are given by

ySit = τSDit + εSit

yLit = δyLit1 + τLDit + εLit

where D is a treatment (binary, discrete, or continuous)I Interpretation of parameters

F δ = persistence of the LR component of y (by denition, the SRcomponent completely decays each period)

F τS , τL = the (common) treatment e¤ect on yS , yL

Goal: say something about δ, τS , and τL

DL Millimet (SMU) ECO 7377 Fall 2011 401 / 407

Consider trying to estimate the LR component equation

yLit = δyLit1 + τLDit + εLit

Problem: yLit , yLit1 are unobserved; only y is observed

Some algebra yields

yit ySit = δ(yit1 ySit1) + τLDit + εLit

) yit = δyit1 + τLDit + [ySit δySit1 + εLit ]

NotesI Cov(yit1, ySit1) 6= 0 ... ySit1 is analagous to ME in the desiredcovariate, yLit1

I Cov(Dit , ySit ) 6= 0 if τS 6= 0Circumvent this second issue by incorporating Dit into the error term

yit = δyit1 + [τLDit + ySit δySit1 + εLit ]

= δyit1 + υit

DL Millimet (SMU) ECO 7377 Fall 2011 402 / 407

Comparison of estimators ...

OLS yields

plimbδOLS = δ

σ2y L

σ2y L + σ2y S

!< δ

using the CEV formula discussed previously

IV using yit2 as an instrument

plimbδIV ,1 = δ

if Cov(yit2, εLit ) = Cov(yit2, εSit ) = Cov(yit2, εSit1) =Cov(yit2,Dit ) = 0, implying that yit2 is predetermined anduncorrelated with future treatment status

DL Millimet (SMU) ECO 7377 Fall 2011 403 / 407

IV using Dit1 as an instrument

plimbδIV ,2 =Cov(yit ,Dit1)

Cov(yit1,Dit1)

=Cov(δyit1 + τLDit + ySit δySit1 + εLit ,Dit1)

Cov(yit1,Dit1)

= δ+Cov(τLDit + ySit δySit1 + εLit ,Dit1)

Cov(yit1,Dit1)

I Assume Cov(Dit1,Dit ) = Cov(Dit1, εSit ) = Cov(Dit1, εLit ) = 0I But, Cov(Dit1, ySit1) 6= 0 ) Dit1 is not a valid IV

plimbδIV ,2 = δ+Cov(δySit1,Dit1)

Cov(yit1,Dit1)

= δ

1

Cov(ySit1,Dit1)Cov(yit1,Dit1)

!

= δ

1 τS Var(Dit1)

(τS + τL)Var(Dit1)

= δ

τL

τS + τL

DL Millimet (SMU) ECO 7377 Fall 2011 404 / 407

Notes:I Combination of OLS and IV1 can estimate the relative contribution ofyL to y

I Combination of IV1 and IV2 can estimate the relative contribution ofD to the LR component

I xs can be incorporated by redening εLit = xit βL +eεLitI Model requires Cov(Dit1,Dit ) = 0, ruling out treatments whichpersist themselves (e.g., treaties)

F Examples (perhaps): class size, R&D (?)

DL Millimet (SMU) ECO 7377 Fall 2011 405 / 407

In conclusion, listen to the words of Sims (2010):

Natural, quasi-, and computational experiments, as well asregression discontinuity design (RDD), can all, when well applied, beuseful, but none are panaceas... Because we are not an experimentalscience, we face di¢ cult problems of inference. The same datagenerally are subject to multiple interpretations. It is not that we learnnothing from data, but that we have at best the ability to use data tonarrow the range of substantive disagreement. We are alwayscombining the objective information in the data with judgment, opinionand/or prejudice to reach conclusions...

Natural experiments, di¤erence-in-di¤erence, and regressiondiscontinuity design are good ideas. They have not taken the con outof econometrics in fact, as with any popular econometric technique,they in some cases have become the vector by which conisintroduced into applied studies. Furthermore, over-enthusiasm aboutthese methods, when it leads to claims that single-equation linearmodel with sandwiched errors are all we ever really need, can lead toour training applied economists who do not understand how to fullymodel a dataset.DL Millimet (SMU) ECO 7377 Fall 2011 406 / 407

In light of these sentiments, recall the points made at the start of thiscourse:

Prior to conducting, or when reviewing, causal analyses, questions thatneed to be answered:

1 What is the causal relationship of interest? [Is it economicallyinteresting?]

2 What is the identication strategy?3 What parameter are you actually estimating?4 To whom does the parameter apply?5 What question does the analysis answer?6 What is the method of statistical inference?

While applied work is open to multiple interpretations, theseinterpretations and objections to research are lessened when one is precisein answering these questions.

DL Millimet (SMU) ECO 7377 Fall 2011 407 / 407

top related