microeconometrics lecture notes

ECO 7377Microeconometrics

Daniel L. Millimet

Southern Methodist University

Fall 2011

DL Millimet (SMU) ECO 7377 Fall 2011 1 / 407

Introduction

Applied research in economics can be loosely classied into two types1 Descriptive analysis2 Causal analysis

While the rst is important and useful, the second is of primaryinterest

Causal analysis is needed to predict the impact of changingcircumstances or policies, or for the evaluation of existing policies orinterventions

Prior to conducting, or when reviewing, causal analyses, questionsthat need to be answered:

1 What is the causal relationship of interest? [Is it economicallyinteresting?]

2 What is the identication strategy?3 What is the method of statistical inference?

Several statistical issues are confronted when answering thesequestions in economic research:

Specication of the causal relationship of interest entails more thanjust dening x and y ... lots of parameters could be estimated

I Heterogenous vs. homogeneous e¤ectsI Know what you are estimatingI To whom does it apply?I What question does it answer?

Statistical inference is often di¢ cult and overlookedI Spherical vs. non-spherical errorsI Derivation/computation of estimated asymptotic variances ofestimators

Identication of the causal relationship of interest frequentlyencounters

I Selection issues

F Self-selection (endogeneity)F Sample selection (missing data, attrition)

I Measurement issues

F Classical vs. non-classical errorF Dependent vs. independent variableF Continuous vs. discrete variables

I Modeling issues

F Functional form (P, SNP, NP)F Role of space (spillovers, spatial correlation)F Consistency with theory

Dissertation considerations (applied work):

Whats the question? Is it economically interesting?

Whats the identication strategy (if question is causal)?I Selection on observables vs. unobservablesI Parameter of interest

Whats the data requirement? Is it feasible?

Has it been done? Is there value added?I Tension between hottopics and ability to contribute

Dissertation Writing AdviceBe organized

I Outline paper before writing

I Most papers have a common structure

F Abstract: Very important. Be concise. No abbreviations, notation. Include the motivation, punchline.

F Intro: Outline the question. Explain why we care, and what is new in the paper. Give a slightly longer

summary than the abstract of what is done in the paper, and emphasize the major ndings.

F Lit review (may be incorporated in intro if short)

F Theoretical model: Be only as complicated as necessary. Understand ramications of assumptions. If

innovation is in the empirics, theory is only needed if it adds something not well understood.

F Empirical model: Be clear. Understand where identication comes from. Consider relevant specication

tests. Acknowledge deciencies, circumstances under which estimates are inconsistent.

F Data: Explain the sample selection criteria and variables used. If building on an existing literature, note

any di¤erences between the sample selection criteria and those used in existing papers.

F Results: Be sure to spend enough time discussing the actual results. If results di¤er from existing

literature, try to pin down the reason(s) why.

F Conclusion: Emphasize importance of new ndings, as well as shortcomings of the current paper.

Discuss potential future work still to be done. End on a positive note.

I Put discussions in relevant sections

F Avoid discussing the same point in multiple locations

F Discuss data in data section; discuss results in results section; most econometric issues belong in the

empirical model section

Be considerate to your readersI Invest the time to proofread the paper many times; if you are unwillingto go through your paper carefully, why should others invest their time?

F Pascal: The letter I have written today is longer than usual because Ilacked the time to make it shorter.

F Quintilian: One should aim not at being possible to understand, butat being impossible to misunderstand.

I Spell check, grammar check, check formatting issues, check spacing,check indenting, etc.

I Dene notation, abbreviations, etc.I Avoid redundant notation, excessive notation, awkward notation, etc.I Avoid overly critical remarks about other papers; other authors are notidiots, and may be your referees

I Tables should be easy to read, and self-explanatory (need to refer backto the text should be kept to a minimum); include notes under thetables to explain things; avoid using abbreviations for variable namesunless necessary

I References should be double-checked; be sure they are accurate and allare included in the bibliography

Be professional (this is not a term paper)I Avoid unsubstantiated claims, sweeping or grand statements, andgeneralizations

I Be upfront; do not hide assumptions/restrictions hoping they will beoverlooked, and justify their use

I Do not be unnecessarily complex in order to feel smart or show o¤ (seeSiegfried 1970)

F Da Vinci: Simplicity is the ultimate sophistication.F Einstein: Any fool can make things bigger, more complex, and moreviolent. It takes a touch of genius-and a lot of courage-to move in theopposite direction.

F Fowler: Any one who wishes to become a good writer shouldendeavour, before he allows himself to be tempted by the more showyqualities, to be direct, simple, brief, vigorous, and lucid.

F Mingus: Making the simple complicated is commonplace; making thecomplicated simple, awesomely simple, thats creativity.

F Je¤erson: The most valuable of all talents is that of never using twowords when one will do.

I Avoid contractionsI Be consistent with the use of Ior we if the paper uses rst person,consistent with present vs. past tense

PlagiarismI Be careful, be ethical!I Give credit where credit is due; cite othersideas (in parentheses, notfootnotes)

F Milton: Copy from one, its plagiarism; copy from two, its research.F Donatus: Perish those who said our good things before we did.F Kuralt: I could tell you which writers rhythms I am imitating. Itsnot exactly plagiarism, its falling in love with good language and tryingto imitate it.

I Any statement in a paper should t one of the following categories: (i)factual (agreeable to any reader), or (ii) debatable (but then referencesin support, or it should be supported by the work done in the paperitself, or it should be written in the appropriate language: If onebelieves X, then Y.)

F But, any statement should be in your own words, or should be inquotations

What to include?I Dissertation chapters can/should be longer than papers submitted forpublication

I Chapters may include greater detail on:

F Literature reviewF Data constructionF Empirical methodology

BootstrapIntroduction

General structure of estimation

population ) θ

#random sample ) bθ

Problem: bθ is an estimate; need to assess its dbn for proper inferenceSolutions

I Asymptotic theoryI Simulation methods ) bootstrap

Stata: -bootstrap-, -bsample-

IdeaI Re-sample (with replacement) from the random sample multiple timesand assess the dbn of the estimates

population ) θ

#random sample ) bθ

#bootstrap sample ) bθ

I Results in a vector of estimates, bθb , b = 1, ...,B, where B is the # ofbootstrap repetitions

Many di¤erent bootstrap methodsI Parametric vs. nonparametricI Resampling algorithms

F iidF Block/clusterF Sub-sampling (M/N)

I Imposing the null or not imposing

BootstrapCondence Intervals

Consider a regression model

yi = xi β+ εi

Problem: given sample estimates, bβ, need to obtain std errors orcondence intervals

There are two common sampling methods

1 Resampling the data2 Resampling the errors

DataI Resample (with replacement) observations (yi , xi ) ) fyi , xi gNi=1I Estimate the original model (OLS) on the re-sampled data set ) bβI Repeat B times ) bβb , b = 1, ...,B

ResidualsI Given bβ from OLS on original sample, obtain residuals ) bεi ,i = 1, ...,N

I Resample (with replacement) a vector of N residuals ) bεi , i = 1, ...,NF This represents a random draw from the (nonparametric) empirical dbnof the residuals

I Alternative (parametric):

F Estimate bσ2 = 1N K ∑i bε2i

F Draw N random numbers, bεi , i = 1, ...,N , from N(0, bσ2)I Generate yi = xi

bβ+bεi (which imposes β = bβ)I Regress y on x by OLS ) bβI Repeat B times ) bβb , b = 1, ...,B

Resampling data is typically preferred since it less model dependent

What to do with bβb , b = 1, ...,B? Several options...Obtain std error for original sample estimate, bβ, given by

se(bβ) = r 1B 1 ∑b

bβb bβObtain symmetric CI using normal approximation

β 2nbβ t1 α

2 ,B1se(bβ)o

Obtain asymmetric CI using percentile method

β 2nbβ α

2, bβ1 α

owhere subscript refers to the quantile of the empirical dbn of bβ

Obtain asymmetric bias corrected and accelerated CIs (BCa)I Calculate

z0 = Φ11B ∑b I

bβb 6 bβ (median bias)

a =∑i

bβJ bβJ(i )36

bβJ bβJ(i )2#3/2 (acceleration parameter)

where bβJ(i ) is the jacknife estimate (omitting obs i from original

sample) and bβJ is the mean of the jacknife estimatesI Calculate lower and upper quantiles

p1 = Φ

z0 z1 α2

1 a(z0 z1 α2)

#; p2 = Φ

z0 + z1 α2

1 a(z0 + z1 α2)

#where z1 α

2is the (1 α/2)th quantile of the std normal distribution

I CI given by β 2nbβp1 , bβp2o

Notes:I BC CI obtained by setting a = 0I BCa requires B > 1000I z0 = 0 when bβ = median of bβI a reects the rate of change of the standard error of bβ with respect tothe true value, β

F The standard normal approximation assumes that the standard error isinvariant with respect to the true value

F The acceleration parameter corrects for deviations in practice

Obtain asymmetric CI using bootstrap-tI When estimating the model on the re-sampled data, collect thet-statistics obtained from testing Ho : β = bβ

t =bβ bβse(bβ)

I Yields tb , b = 1, ...,BI Dene

tα )1B ∑b I(tb 6 tα ) = α

) tα is the αth quantile of the empirical dbn of tI CI given by

β 2nbβ t1 α

2se(bβ), bβ+ tα

2se(bβ)o

I Notes

F Method assumes se(bβ) is known based on asymptotic theoryF If unknown, then use double bootstrap

Obtain asymmetric CI using bootstrap-t with double bootstrapI Estimate original model by OLS ) bβI Obtain bootstrap samples, estimate by OLS, form t given by

t =bβ bβse(bβ)

I Since denominator is not known, resample from the bootstrap sampleB2 times ) bβb , b = 1, ...,B2

I Obtain the estimated std error of bβ as the std deviation of the B2estimates

I Repeat process B1 timesI Obtain CI as above, but with se(bβ) replaced by the std deviation of theB2 estimates of bβ

Example: x N(0, 1), N = 1000, xa N(0, 0.001)

.2 0 .2

Bootstrap Asymptotic

Reps = 20

.2 0 .2

Reps = 100

.2 0 .2

Reps = 500

.2 0 .2

Reps = 1000

BootstrapImposing the Null

Goal: estimate the model, derive some estimate or test statistic, andyou wish to test whether the true value of the parameter is equal tosome value or derive a p-value associated with the test statistic

StrategyI When re-sampling the data, generate new data sets where the null istrue (imposed)

I Estimate the original model on the re-sampled dataI Compare the value of the test statistics obtained from the re-sampleddata sets with the value of the test statistic from the original sample

I If the test statistic from the original sample is very di¤erent(statistically), then it is unlikely the null is true in the original sample

Regression example

Modelyi = β0 + β1xi + εi , εi N(0, σ2)

Hypothesis of interest:

Ho : β1 = 0

H1 : β1 6= 0

AlgorithmI Estimate model on original data ) bβ0, bβ1 ) tβ1 (t-statistic for β1)I Obtain the residuals ) bεi , i = 1, ...,NI Resample (with replacement) a vector of N residuals ) bεi , i = 1, ...,N

F This represents a random draw from the (nonparametric) empirical dbnof the residuals

I Alternative (parametric):F Estimate bσ2 = 1

N K ∑i bε2iF Draw N random numbers, bεi , i = 1, ...,N , from N(0, bσ2)

I Generate yi =bβ0 + 0 xi +bεi = bβ0 +bεi (which imposes β1 = 0)

I Regress y on x by OLS ) tβ1I Repeat B times ) tβ1,b

, b = 1, ...,BI Obtain p-value as

p-value =1B ∑b I(jtβ1 j > jtβ1 j)

I Reject null if p < α < 0.5, where α is the signicance level

Distributional example

Want to test equality of CDFs of two random variables (e.g., wages ofjob training participants and non-participants)

Data sampleI xi , i = 1, ...,N, is random sample of one variable (participants), withCDF F (x)

I yi , i = 1, ...,M, is random sample of another variable(non-participants), with CDF G (y)

Hypothesis of interest:

Ho : F = G

H1 : F 6= G

AlgorithmI Estimate empirical CDF in each sample: bF (x) and bG (y)I Compute test statistic

rNMN +M

maxz2Supp(X ,Y )

nbF (z) bG (z)oI Pool data, re-sample (with replacement), sample size = N +M )q1, ..., qN+M

I Split the sample: denote rst N obs from F ; nal M obs from G(imposes F = G )

I Compute dI Repeat B times ) db , b = 1, ...,BI Obtain p-value as

p-value =1B ∑b I(d > d)

I Reject null if p < α < 0.5, where α is the signicance level

BootstrapOther Issues

Non-iid data

All previous discussion assumes iid data since re-sampling occurswithout regard to any dependence across observations

If there exists some sort of dependence in the data, then resampleblocks or clusters of data

Example #1: Time series data with serial correlationI Model

yt = xtβ+ εt , t = 1, ...,T

I Resample blocks of length l by drawing obs randomly fromt = 1, ...,T l

I If obs t 0 is chosen for the bootstrap sample, also include obst = t 0 + 1, ..., t 0 + (l 1)

I Draw T/l obs so nal bootstrap sample size remains T

Example #2: Panel dataI For example, individuals within hhs, or employees within rms, orindividuals over time

I Modelyif = xif β+ εif , i = 1, ...,N

where i represents individuals and f represents rmsI Several individuals are sampled from each of F < N rmsI Generate bootstrap samples by resampling (with replacement) the Frms

I If rm f is chosen for the bootstrap sample, include all employees ifrom that rm

I If identical number of employees from each rm are in the sample, thenbootstrap samples are still of size N

Blocks/clusters are chosen such that data are iid across blocks

Sub-sampling (Politis and Romano 1992, 1994)

M of N re-sampling with or without replacement

Evaluate a statistic of interest at subsamples of the data

Use these subsampled values to build up an estimated samplingdistribution

The consistency properties of this sampling distribution hold fordependent data under very weak assumptions and even in situationswhere the bootstrap collapses

Jacknife estimation

Leave-one-out estimation

AlgorithmI Estimate model using original sample ) bβ (if OLS model, say)I Omit obs i and re-estimate model on sample of N 1 obs ) bβ(i )I Repeat omitting each i once (implies N estimations)I Standard error obtained as

se(bβ) = rN 1N ∑i

bβ(i ) bβ(i )2In some situations, delete-d jacknife achieves superior performance

Failure of the bootstrap or jacknife ...

Resampling methods are not guaranteed to work; theoreticaljustication is needed

Most common case of failure occurs when parameter of interest is anon-smooth function of the data (e.g., median vs. mean)

Example: x N(0, 1), N = 1000, xmeda N(0, 0.00157)

.1 .05 0 .05 .1

Reps = 20

.1 .05 0 .05 .1

Reps = 100

.1 .05 0 .05 .1

Reps = 500

.15 .1 .05 0 .05 .1

Reps = 1000

How to choose B?

Andrews & Buchinsky

Davidson & MacKinnon

CausationIntroduction

General goal of most (applied) econometrics exercises is to distinguishbetween causation and correlation

Many empirical questions of concern to economists and/orpolicymakers pertains to the causal e¤ect of a program or policy

Statistical and econometric literature analyzing causation has seentremendous growth over the past several decades

Central problem concerns evaluation of the causal e¤ect of exposureto a treatment or program by a set of units on some outcome

I In economics, these units are economic agents such as individuals, hhs,rms, geographical areas, etc.

I The e¤ect of an exposure is only well-dened if the comparison is alsodened; typically the comparison is dened as not exposed,butsometimes it is not obvious (particularly with non-binary treatments)

Philosophy of causality...I Rich literature in analytic philosophy on causalityI Two main approaches to dening causality:

F Regularity approaches: Hume: We may dene a cause to be anobject followed by another, and where all the objects, similar to therst, are followed by objects similar to the second. (from An EnquiryConcerning Human Understanding, section VII)

F Counterfactual approaches: Hume: Or, in other words, where, if therst object had not been, the second never had existed. (from AnEnquiry Concerning Human Understanding, section VII)

Regularity approach: a minimal constant conjunction between thetwo objects (Suppes: a probabilistic association between the twoobjects, which cannot be explained away by other factors)

I Basic idea behind Granger causalityI Di¢ culty: what are the other factors? Limiting to only observablefactors is unsatisfying... if some factors are unobservable, then what?

I Example...

F C is a potential cause of E if Pr(E jC ) > Pr(E jnot C )F May be spurious if there exists some factor B s.t. Pr(E jC ) > Pr(E jnotC ) and Pr(E jC ,B) = Pr(E jnot C ,B)

(e.g., E = wages ,C = educ ,B = ability )F May also be a spurious zero correlation if there exists some factor Bs.t. Pr(E jC ) = Pr(E jnot C ) and Pr(E jC ,B) > Pr(E jnot C ,B)

(e.g., E = wages ,C = training ,B = shock)F B is known as a confounder or confounding variable

Be wary: correlation does not imply causation as things are notalways as they seem ...

and the truth may be di¢ cult to see ...

Counterfactual approach: Lewis (1973) proposes to imagine arange of possible worlds

I Holland (1986, 2003): a treatment (cause) is a potential manipulationthat one can imagine

F NO CAUSATION WITHOUT MANIPULATIONF Gender, race are not treatments?!? (see Greiner and Rubin 2011)

I Imbens and Wooldridge (2009):

F A CRITICAL FEATURE IS THAT, IN PRINCIPLE, EACH UNIT CANBE EXPOSED TO MULTIPLE LEVELS OF THE TREATMENT.

I Angrist and Pischke (2009): a treatment should be manipulatableconditional on other factors ) Pr(C jB), Pr(not C jB) 2 (0, 1)

F NO FUNDAMENTALLY UNIDENTIFIED QUESTIONSF Example: school start age = biological age - time in school;if B = fbio age, time in schoolg, then school age is not an identiabletreatment

Microeconometrics today emphasizes the counterfactual viewI Greiner & Rubin (2011):For analysts from a variety of elds, the intensely practical goal ofcausal inference is to discover what would happen if we changed theworld in some way.

Econometric methods are categorized by the type of selection involved

Selection typesI Selection on observables: all potential Bs are observedI Selection on unobservables: some potential Bs are unobserved

CausationPotential Outcomes Model

Most causal research is couched in the potential outcomes framework

Typically referred to as the Rubin Causal Model (RCM); attributed toNeyman (1923, 1935), Fisher (1935), Roy (1951), Quandt (1972,1988), Rubin (1974)

Notationy1i = outcome of observation i with treatment

y0i = outcome of observation i without treatment

Di = treatment indicator ...

Di =1 treated0 untreated

fy1i , y0i ,Dig is a draw from the population of interest

fy1, y0,Dg is a sample from the population of interest

NotesI Key insight is to model not just the observed outcome for each unit i ,but also the unobserved potential outcomes

I Implicit in this representation is the Stable Unit Treatment ValueAssumption (SUTVA, Rubin 1978), which assumes that outcome ofobs i with and without the treatment does not vary depending on thetreatment assignment of all other agents (rules out general equilibriumor indirect e¤ects)

F Allows one to write potential outcomes solely as a function of owntreatment assignment

y0i yi (D1,D2, ...,Di1, 0,Di+1, ...,DN ) = yi (0)

y1i yi (D1,D2, ...,Di1, 1,Di+1, ...,DN ) = yi (1)

F Imbens & Wooldridge (2009) provide some references to papers lookingat GE e¤ects; see also Ferracci et al. (2009), Heckman et al. (1999),Lewis (1963)

I Also implicit and sometimes lumped into SUTVA is the assumptionthat the mechanism for assignment treatments does not a¤ectpotential outcomes (rules out Hawthorne e¤ects, whereby agents mayact di¤erently if they know they are being observed)

Parameters of interest

∆i = y1i y0i = treatment e¤ect for obs iI This is a random variable as it is obs-specicI Can summarize the distribution of this variable by focusing on di¤erentaspects

∆ATE = E[∆i ] = E[y1 y0 ]∆ATT = E[∆i jD = 1] = E[y1 y0 jD = 1]∆ATU = E[∆i jD = 0] = E[y1 y0 jD = 0]

Notes: Di¤erent parameters answer di¤erent questions, may be usefulfor di¤erent policy conclusions, and may require di¤erent assumptionsto identify

Three other parameters that often appear1 Local Average Treatment E¤ect (Imbens & Angrist 1994, Angrist et al.1996)

F Dened as ∆LATE = E[y1 y0 ji 2 Ω], where Ω refers to somespecied subpopulation

2 Marginal Treatment E¤ect (Heckman & Vytlacil 1999, 2001, 2005,2007)

F Dened later3 Policy Relevant Treatment E¤ect (Heckman & Vytlacil 2001)

F Dened as ∆PRTE = E[yP yNP ], where P (NP) refers to the statewhere the program is fully (not) implemented

F With the program, all agents have access to the program, but maychoose not to participate

F Implies

∆PRTE = E[yP1 jDP = 1]Pr(DP = 1) + E[yP0 jDP = 0]Pr(DP = 0) E[yNP ]

= E[y1 y0 jDP = 1]Pr(DP = 1)where yP0 , y

P1 , and y

NP are the three potential outcomes, DP is thetreatment indicator in the world with the program, and the second linefollows if one assumes policy invariance (i.e., potential outcomes areuna¤ected by the existence of the program)

Relationship among the parametersI Let

y1i = E[y1 ] + υi1

y0i = E[y0 ] + υi0

I This implies

∆i = y1i y0i= E[y1 y0 ] + υi1 υi0

= ∆ATE + υi1 υi0

∆ATT = ∆ATE + E[υi1 υi0 jD = 1]∆ATU = ∆ATE + E[υi1 υi0 jD = 0]

where E[υi1 υi0 jD = j ] is the average, obs-specic gain fromtreatment for group j

Can re-dene any of the above parameters for sub-population denedon the basis of attributes, x

∆ATE (x) = E[y1 y0jx ]∆ATT (x) = E[y1jx ,D = 1] E[y0jx ,D = 1]∆ATU (x) = E[y1jx ,D = 0] E[y0jx ,D = 0]

The previous unconditional parameters are obtained by integratingover the dbn of x in the relevant population

∆ATE =Z

∆ATE (x)f (x)dx

∆ATT =Z

∆ATT (x)f (x jD = 1)dx

∆ATU =Z

∆ATU (x)f (x jD = 0)dx

While the preceding parameters, based on di¤erences in expectations,are the near universal focus in economics, this need not be the case

Can also dene treatment e¤ects based on ratios

∆RATE = E[y1]/ E[y0]∆RATT = E[y1jD = 1]/ E[y0jD = 1]∆RATU = E[y1jD = 0]/ E[y0jD = 0]

These are referred to as relative treatment e¤ects (and priorparameters are referred to as absolute or di¤erenced treatmente¤ects)

Note, however, that relative e¤ects lack a bit of intuitive appeal sinceif we dene ∆i = y1i/y0i , then E[∆i ] = E[y1i/y0i ] 6= E[y1]/ E[y0] andsame for RATT and RATU

Evaluation Problem

Only observe one potential outcome at a point in time for anyobservation

Implies...

Attributes of i Observed for ify1i , y0i ,Dig fyi ,Dig

where yi = Diy1i + (1Di )y0i = observed outcome for observation iMissing potential outcome is the missing counterfactual

I Holland (1986) refers to this as the fundamental problem of causalinference

I Because of this, the central issue in the RCM is the relationshipbetween treatment assignment and potential outcomes

F Typically referred to as the treatment assignment ruleF Growing literature on assignment rules (Manski 2000, 2004; Pepper2002, 2003; Dehejia 2005; Lechner & Smith 2007)

Example #1... ATTI Consider estimating ∆ATT = E[y1 jD = 1] E[y0 jD = 1]I E[y1 jD = 1] can be estimated from the data, but one does not observe

E[y0 jD = 1]I If one uses outcomes of the untreated, we can denee∆ATT = E[y1 jD = 1] E[y0 jD = 0]

I Which implies selection bias equal to

∆ATT = E[y1 jD = 1] E[y0 jD = 0] + E[y0 jD = 0] E[y0 jD = 1]) bias = e∆ATT ∆ATT = E[y0 jD = 1] E[y0 jD = 0]

I This is generally non-zero, and may be decomposed into 3 components(Heckman et al. 1996, 1998):

1 Self-selection into treatment in a manner related to outcome in theuntreated state

2 Observables, x , impacting outcome may not overlap at certain valuesacross the treatment and control groups

3 Even with overlap, the distribution of x may vary across the treatmentand control groups

Example #2... ATEI Consider estimating ∆ATE = E[y1 ] E[y0 ]I Neither unconditional expectation can be estimated from the dataI If one uses conditional expectations, we can dene

e∆ATE = E[y1 jD = 1] E[y0 jD = 0]

I Which implies selection bias equal to

e∆ATE ∆ATE = E[y1 jD = 1] E[y0 jD = 0] (E[y1 ] E[y0 ])) bias = (E[y1 jD = 1] E[y1 jD = 0])[1 Pr(D = 1)]

+ (E[y0 jD = 1] E[y0 jD = 0])Pr(D = 1)

which is a weighted average of the selection bias for the ATT and ATU

Question: How does one circumvent the missing counterfactualproblem to estimate ∆ATE , ∆ATT , ∆ATU , or any other summarystatistic of the distribution of ∆?

Early Example of Potential Outcomes: Roy Model (Roy 1951)

As noted previously, at the heart of the RCM is the interplay betweenassignment of treatments, potential outcomes, and observed outcomes

Problem is one of self-selection; highlighted in a very clever fashion inRoy (1951)

Specic issue in Roy (1951) was occupational choiceI Individuals have potential earnings associated with di¤erent occupationchoices

I Realized earnings reect the chosen occuption

Example

Suppose y0y1

01,∑

Unconditional outcome distributions look like

4 2 0 2 4 6Support

kdensity y0 kdensity y1Simulated data, 1000 obs, rho=0.7

Unconditional Distributions of Potential Outcomes

Conditional distributions

Depends onI Who selects into treatment or control group, andI Correlation of potential outcomes

Positive correlation in above example (ρ 0.7)

Positive selection: Assume those above the mean in y1 distribution selectinto treatment

4 2 0 2 4 6Support

kdensity yy0 kdensity yy1Simulated data, 1000 obs, rho=0.7; positive selection into treatment.

Conditional Distributions of Potential Outcomes

Negative selection: Assume those below the mean in y1 distributionselect into treatment

4 2 0 2 4Support

kdensity yy0 kdensity yy1Simulated data, 1000 obs, rho=0.7; negative selection into treatment.

Random assignment:

4 2 0 2 4 6Support

kdensity yy0 kdensity yy1Simulated data, 1000 obs, rho=0.7; random assignment into treatment.

Lesson to be learned: observed distributions are not the unconditionaldistributionsDL Millimet (SMU) ECO 7377 Fall 2011 56 / 407

Roy Model

Two occupations: hunter, sherPotential incomes

yd = gd (x) + υd , d = 0 (h), 1 (f)

Decision rule

D = I(y1 y0 > 0)= I(g1(x) g0(x) + υ1 υ0 > 0)

Observed incomey = Dy1 + (1D)y0

Treatment assignment depends on observables, x , and unobservables,υ1 υ0Notes:

1 Cov(D, υ1 υ0) 6= 0 referred to as essential heterogeneity (Heckmanet al. 2006)

2 Cov(D, υ1 υ0) 6= 0) Cov(D,D(υ1 υ0)) 6= 0DL Millimet (SMU) ECO 7377 Fall 2011 57 / 407

Generalized Roy Model

Replace income maximization decision rule with a more general rule

Decision ruleD = I(h(x) u > 0)

When D is a voluntary program (e.g., job training), u may reect (i)costs of participation and (ii) foregone earnings (opportunity costs)

Implies that treatment assignment depends on observables, x , andunobservables, u

I Essential heterogeneity implies Corr(u, υd ) 6= 0 8d

Moving Forward

Guided by the potential outcomes framework, gure out conditionsunder which di¤erent estimators may provide consistent estimates ofthe ATE, ATT, ATU, etc.

Key points:I Given the missing counterfactual problem, any estimator of the causale¤ects of a treatment must rely on some assumptions

I Di¤erent estimators rely on di¤erent assumptions and thus should notbe expected to yield similar estimates unless the identifyingassumptions of each hold in the data

I While extraneous assumptions may be testable overidentifyingrestrictions not all assumptions can be tested

I Di¤erent estimators estimate di¤erent aspects of the dbn of ∆ and thusanswer di¤erent questions

CausationRandom Experiments

First solution is to randomize treatment assignment

Generally speaking, randomization is the preferred solution; oftencalled the gold standard

Reason: randomization ensures that treatment assignment isindependent of potential outcomes in expectation

Freedman (2006): Experiments o¤er more reliable evidence oncausation than observational studies.

Imbens (2009): More generally, and this is the key point, in a situationwhere one has control over the assignment mechanism, there is little togain, and much to lose, by giving that up through allowing individuals tochoose their own treatment regime. Randomization ensures exogeneity ofkey variables, where in a corresponding observational study one wouldhave to worry about their endogeneity.

That said, not everyone is convinced by experiments (without doingsome more mental work)

Much of the criticism about experiments is about thedi¢ culty of generalizing fom the evaluation of one particularprogram to predicting what would happen to this program in adi¤erent context. Clearly, without theory to guide us on why aresult extends from a context to another, it is di¢ cult to jumpdirectly to a policy conclusion. However, when experiemtns aremotivated by a theory, the results of experiments (not only onthe nal outcomes, but on the entire chain of intermediateoutcomes that led to the endpoint of interest) serve as a test ofsome of the implications of that theory. The combination of datapoints then eventually provides su¢ cient evidence to make policyrecommendations.

Duo (2010),http://www.aeaweb.org/econwhitepapers/white_papers/Esther_Duo.pdf

From an ex post evaluation standpoint, a carefully plannedexperiment using random assignment of program statusrepresents the ideal scenario, delivering highly credible causalinference. But from an ex ante evaluation standpoint, the causalinferences from a randomized experiment may be a poor forecastof what were to happen if the program were to be scaled up.

DiNardo & Lee (2011)

Ex post evaluation answers the question: What happened?(descriptive)

Ex ante evaluation answers the question: What would happen?(predictive)

Randomization may occur at di¤erent stages1 Population-level: randomize among agents in the population; typicallynot feasible since it would entail compellingtreatment by some

2 Eligibility-level: randomize among the population of eligibles byrandomly denying eligibility to a subset

3 Application-level: randomize among the population of programapplicants by randomly accepting/rejecting a subset

Stage at which randomization occurs generally a¤ects what can belearned unless additional assumptions are made

Assumptions (with population-level randomization)(A.i) fy ,Dg is iid sample from the population(A.ii) y0, y1 ? D(A.iii) Pr(D = 1) 2 (0, 1)Notes

I (A.i) implies SUTVAI (A.ii) implies E[y1 jD = 1] = E[y1 jD = 0] = E[y1 ]; similarly for E[y0 ]I (A.ii) also implies ∆ATE = ∆ATT = ∆ATU since

E[y1 y0 ]| z ATE

= E[y1 y0 jD = 1]| z ATT

= E[y1 y0 jD = 0]| z ATU

I (A.ii) relies on perfect compliance; imperfect compliance may invalidatethe assumption if such non-compliance is related to potential outcomes

F Di¤erence in experimental means based on initial assignment still yieldsestimate of intent to treat under imperfect compliance; may actually bemore policy relevant

I (A.iii) ensures all agents have some probability of receiving and notreceiving the treatment

I Population-level randomization is feasible if compensation is o¤ered toensure compliance and this compensation does not a¤ect y0 and y1

Estimation

b∆ATE = \E[yi jD = 1] \E[yi jD = 0]

=∑Ni=1 yi I[Di = 1]

∑Ni=1 I[Di = 1]

∑Ni=1 yi I[Di = 0]

∑Ni=1 I[Di = 0]

p! E[yi jD = 1] E[yi jD = 0]= E[Diy1i + (1Di )y0i jD = 1]

E[Diy1i + (1Di )y0i jD = 0]= E[y1i jD = 1] E[y0i jD = 0]= E[y1i ] E[y0i ]= ∆ATE

PropertiesI UnbiasedI ConsistentI Asymptotically normalI Nonparametrically identied: no parametric or functional formassumptions needed

NotesI (A.ii) may be replaced by a mean independence assumption ...

E[yj jD = j ] = E[yj ], j = 0, 1I Randomization succeeds by balancing (in expectation) both observableand unobservable attributes of participants in the treatment andcontrol group

I Randomization can be assessed by testing for di¤erences in the jointdbn of predetermined attributes across the treatment and controlgroups

I Randomization at the eligibility or application stage only yield anestimate of the ATT, which does not equal the ATE unless (i)treatment e¤ects are homogeneous or (ii) agents do not becomeeligible or apply due to unobserved, observation-specic gains to thetreatment, υ1 υ0

Selection on Observables

Randomization is typically not feasible in economics

Applied economists typically must rely on observational (ornon-experimental) data

Data structure is now given by...

attributes of i observed for ify1i , y0i ,Di , xig fyi ,Di , xig

where xi is a vector of observable attributes of i

Selection on ObservablesStrong Ignorability

Assumptions

(A.i) iid sample: fy ,D, xg is iid sample from the population

(A.ii) Conditional independence or unconfoundedness: y0, y1 ? D jx(A.iii) Common support or overlap: Pr(D = 1jx) 2 (0, 1)

Note: CIA is sometime referred to as selection on observables (orobserved variables) assumption because if D is a deterministic fn of x ,then CIA will hold. However, the CIA is broader than this case; D mayalso depend on unobservables as long as these unobservables are notcorrelated with potential outcomes.

Notes...(A.i) implies SUTVA(A.ii) implies

Pr(Di = 1jxi , y1i , y0i ) = Pr(Di = 1jxi )(A.iii) ensures one observes agents with a particular x in both thetreatment and control groups(A.ii), (A.iii) ) stong ignorability (Rosenbaum & Rubin 1983)

I xs must be pre-determined (i.e., una¤ected by treatment assignment);if some xs are directly a¤ected by D or by the anticipation of D, thenconditioning on them will mask (at least) some of the e¤ect of thetreatment

I Implies estimation under strong ignorability requires an instrumentexist, but it is not required to be observed (or even known) such thatconditional on x , D is random rather than deterministic

I There may not exist any vector x in a particular data set for aparticular treatment such that stong ignorability holds

I There is some tension between (A.ii) and (A.iii); some xs mayperfectly predict treatment assignment (invalidating CS), but omissionmay invalidate CIA... hence, the need for the implicit IV

Nonparametric identication

Estimation

b∆ATE (x) = \E[yi jxi = x ,D = 1] \E[yi jxi = x ,D = 0]

=∑Ni=1 yi I[xi = x ,Di = 1]

∑Ni=1 I[xi = x ,Di = 1]

∑Ni=1 yi I[xi = x ,Di = 0]

∑Ni=1 I[xi = x ,Di = 0]

p! E[yi jxi = x ,D = 1] E[yi jxi = x ,D = 0]= E[y1i jxi = x ,D = 1] E[y0i jxi = x ,D = 0]= E[y1i jxi = x ] E[y0i jxi = x ]

and then

b∆ATE = E[b∆ATE (x)] = Z b∆ATE (x)f (x)dx = 1N ∑i

b∆ATE (xi )Similar story for other parameters, except nal step uses f (x jD = 1)or f (x jD = 0)

CaveatsI If x takes on many values (even if still discrete), there may be smallsample size for any particular value, x , leading to high variance forb∆ATE (x)

I If x is continuous, then this estimator cannot be used since theprobability of observing more than one obs with the same value of x iszero

I Possible solution: functional form assumptions

Final Note

CIA is not testable except by conducting random experiments forcomparison

One common testemployed entails testing for di¤erences inpre-treatment outcomes conditional on x between the to-be-treatedand the controls

I Intuition: if D is uncorrelated with unobservables related to theoutcome conditional on x , then pre-treatment outcomes should beunrelated to (future) D conditional on x

I Heckman et al. (1999) refers to this as the alignment fallacyI In particular, test based on outcomes more than one period in the pastis misleading if shocks are serially correlated and agents self-select intothe treatment group due to an adverse shock in the period directlybefore treatment

I In general, test is useful if it rejects the independence of D and yconditional on x in periods prior to treatment; if it fails to reject, thenthe test is ambiguous

Selection on ObservablesStrong Ignorability: Regression

Previous results showed that

∆ATE (x) = E[y1i jxi = x ] E[y0i jxi = x ]= E[yi jxi = x ,D = 1] E[yi jxi = x ,D = 0]

Implies key is to estimate the regression function E[yi jxi ,Di ]

Assumptions

(A.iv) Separability:

y0i = µ0(xi ) + υ0i

y1i = µ1(xi ) + υ1i

where E[υ1 jx ] = E[υ0 jx ] = E[υ1 υ0 jx ] = 0(A.v) Functional forms:

(A.va) Constant treatment e¤ect

µ0(xi ) = α0 + xi β

µ1(xi ) = α1 + xi β

(A.vb) Heterogeneous treatment e¤ects

µ0(xi ) = α0 + xi β0µ1(xi ) = α1 + xi β1

Implications...

Given (A.i), (A.ii), (A.iv), and (A.va) ...

E[yi jxi ,D = 0] = α0 + xi β+ E[υ0i jxi ,D = 0]E[yi jxi ,D = 1] = α1 + xi β+ E[υ1i jxi ,D = 1]

implies

∆ATE (x) = E[yi jxi = x ,D = 1] E[yi jxi = x ,D = 0]= α1 α0

= ∆ATE = ∆ATT = ∆ATU

Given (A.i), (A.ii), (A.iv), and (A.vb) ...

E[yi jxi ,D = 0] = α0 + xi β0 + E[υ0i jxi ,D = 0]E[yi jxi ,D = 1] = α1 + xi β1 + E[υ1i jxi ,D = 1]

implies

∆ATE (x) = E[yi jxi = x ,D = 1] E[yi jxi = x ,D = 0]= (α1 α0) + xi (β1 β0)

∆ATE =Z

∆ATE (x)f (x)dx = (α1 α0) + E[x ](β1 β0)

∆ATT =Z

∆ATE (x)f (x jD = 1)dx = (α1 α0) + E[x jD = 1](β1 β0)

∆ATU =Z

∆ATE (x)f (x jD = 0)dx = (α1 α0) + E[x jD = 0](β1 β0)

Estimation... Given (A.i), (A.ii), (A.iv), and (A.va)

Estimate via OLS

yi y0i +Di (y1i y0i )= α0 + xi β+ υ0i +Di (α1 + xi β+ υ1i α0 xi β υ0i )

= α0 + xi β+ (α1 α0)Di + [υ0i +Di (υ1i υ0i )]

= α0 + xi β+ ∆ATEDi + eυiCoe¢ cient on D is an unbiased estimate of the causal parameter, and

∆ATE = ∆ATT = ∆ATU

Estimation... Given (A.i), (A.ii), (A.iv), and (A.vb) ...

Estimate via OLS

yi = α0 + xi β0 + (α1 α0)Di + xiDi (β1 β0)

+ [υ0i +Di (υ1i υ0i )]

= α0 + xi β+ eα1Di + xiDieβ1 + eυiEstimates given by

b∆ATE (x) = beα1 + xbeβ1b∆ATE = beα1 + xbeβ1b∆ATT = beα1 + x1beβ1b∆ATU = beα1 + x0beβ1where x j = ∑i xi I[Di = j ]/ ∑i I[Di = j ], j = 0, 1

Alternatively, estimate via OLS

yi = α0 + (xi x)β0 + (α1 α0)Di + (xi x)Di (β1 β0)

+ [υ0i +Di (υ1i υ0i )]

= α0 + (xi x)β0 + eα1Di + (xi x)Dieβ1 + eυiEstimates given by

b∆ATE (x) = beα1 + (x x)beβ1b∆ATE = beα1b∆ATT = beα1 + x1beβ1b∆ATU = beα1 + x0beβ1where x j = ∑i (xi x) I[Di = j ]/ ∑i I[Di = j ], j = 0, 1

NotesI Inclusion of x on RHS of the introduces problem of generatedregressor; OLS std errors are incorrect, but e¤ect is generally minor

I Standard errors of estimators obtained via delta method or bootstrapI Prior to implementing regression approach, it is useful to examine thenormalized di¤erences in x across the treatment and control groups

F Normalized di¤erence for a particular x is given by

∆x =x1 x0qσ2x1 + σ2x0

F If j∆x j > 0.25, regression results are sensitive to functional formassumptions in (A.va) and (A.vb); see Imbens & Wooldridge (2009)

Selection on ObservablesStrong Ignorability: Matching

PreliminariesI Matching methods were quite popular, and still are to a large extentI (Incorrectly) viewed by many as a magic bulletto the estimation oftreatment e¤ects, as a way to mimicrandomized experiments

I In practice, only as good as the underlying assumptionsI Matching when identifying assumptions are violated may yield worseestimate than without matching

Assumptions required: (A.i), (A.ii), and (A.iii)I Technicality #1: only need y0 ? D jx to estimate ATT; y1 ? D jx toestimate ATU

I Technicality #2: (really) only need E[yj jx ,D = j ] = E[yj jx ,D = j 0],j , j 0 = 0, 1 to estimate ATE; E[y0 jx ,D = 1] = E[y0 jx ,D = 0] toestimate ATT; E[y1 jx ,D = 0] = E[y1 jx ,D = 1] to estimate ATU

Comparison to regression approachI No functional form assumptions: if CIA holds, but (A.va) or (A.vb) donot, then matching will be consistent and OLS will not

I Matching weights observations di¤erently, giving more weight to thosedeemed most similar

I Matching requires, and thus highlights problems due to, CS1

.2 .4 .6 .8 1x

Untreated Units Untreated, Regression LineTreated Units Treated, Regression Line

E[y|x,D=0]=1+1x; E[y|x,D=1]=1.5+2.5x; sigma = 0.25

F CS is violated, but OLS simply extrapolates from each group toestimate the missing counterfactual at a particular value of x

F If linear regression specication is not globally accurate, then regressionmay yield severe bias (see earlier discussion on normalized di¤erences)

The fallacy (perhaps!) of extrapolation

Estimation

Parameters

∆ATE = E[y1 y0]∆ATT = E[y1 y0jD = 1]∆ATU = E[y1 y0jD = 0]

Unfeasible estimators

b∆ATE =1N ∑i (y1i y0i )b∆ATT =

∑i I[Di = 1]∑i (y1i y0i ) I[Di = 1]

b∆ATU =1

∑i I[Di = 0]∑i (y1i y0i ) I[Di = 0]

Feasible estimators

b∆ATT =1

∑i I[Di = 1]∑i (y1i byi0) I[Di = 1]

b∆ATU =1

∑i I[Di = 0]∑i (byi1 y0i ) I[Di = 0]

b∆ATE =∑i I[Di = 1]

Nb∆ATT + ∑i I[Di = 0]

Nb∆ATU

where byi0, byi1 are estimates of the missing counterfactuals, obtainedas

byi0 =1

∑l2fDl=0g

ωil∑

l2fDl=0gωilyl0

byi1 =1

∑l2fDl=1g

ωil∑

l2fDl=1gωilyl1

where ωil = weight given to observation l by observation i

Feasible estimation accomplished by replacing the missingcounterfactual with a weighted average of outcomes from thecorresponding groupFormally, all matching estimators take the form

b∆ATT =1N1

∑i2fDi=1g

0BB@y1i 1

∑l2fDl=0g

ωil∑

l2fDl=0gωilyl0

1CCAb∆ATU =

∑i2fDi=0g

0BB@ 1

∑l2fDl=1g

ωil∑

l2fDl=1gωilyl1 y0i

1CCAb∆ATE =

N1Nb∆ATT + N0

Nb∆ATU

whereNj = ∑i I[Di = j ], j = 0, 1

Matching estimators di¤er in terms of how the weights are speciedand what exactly is matched onDL Millimet (SMU) ECO 7377 Fall 2011 86 / 407

Selection on ObservablesStrong Ignorability: Matching (Weighting Schemes)

Exact matching or cell matching

Assuming x contains only discrete variables, assign positive weightonly to observations with identical values of xLet there be K distinct values (or combinations) of xs indexed byk = 1, ...,K (i.e., K cells)N0k , N1k = the number of untreated, treated obs in cell kEstimators given by

b∆ATT = ∑k

i2k\fDi=1g

y1iN1k

∑l2k\fDl=0g

yl0N0k

b∆ATU = ∑k

l2k\fDl=1g

yl1N1k

∑i2k\fDi=0g

y0iN0k

which reect di¤erent weighted averages of the average treatmente¤ect within the K cellsDL Millimet (SMU) ECO 7377 Fall 2011 87 / 407

Estimator is subject to curse of dimensionality

With high dimensional x , or if x contains continuous variables,inexact matching algorithms are useful

Asymptotically, all inexact matching estimators are equivalent sincethe inexactnessdisappears as N ! ∞In nite samples, di¤erent inexact matching algorithms may yieldquite di¤erent estimates

A newly proposed middle ground between exact and inexact matchingis known as coarsened exact matching (CEM)

I Intuition: roundx to fewer distinct values, then match exactly on thecoarsened data

I Developed by King et al.I See -cem- in Stata

Inexact matchingRequires a measure of distance between any two observations, i and l

I Euclidian-type distance metrics are of the form

dil = (xi xl )0W (xi xl )where common choices for W are

1 W = I (identity matrix)2 W = Σ1, where Σ is the sample variance-covariance matrix of x(Mahalanobis metric)

3 W is a diagonal matrix with the variance of x along the diagonal, zeroson the o¤-diagonal (Abadie & Imbens 2002, 2006)

4 Zhao (2004) proposes other alternativesI Propensity score methods compute the distance based on di¤erences inthe probability of being in the treatment group given x

p(x) = Pr(D = 1jx) 2 [0, 1]where distance between two observations is

dil = jp(xi ) p(xl )jI If y0, y1 ? D jx ) y0, y1 ? D jp(x), which follows from the fact thatD ? x jp(x) (Rosenbaum & Rubin 1983)

Euclidean-type distance metrics, propensity score are both a means tocircumvent dimensionality as d is a scalar

No one method is superior; goal is to balance the xs ... discussedlater (Ho et al. 2007)

I In this sense, matching is not an estimator per se, but can be viewed asa way of pre-processing the data prior to applying some estimator

I Similar to a type of outlier analysis

Given dil several weighting schemes are frequently usedI Let C (0) represent a neighborhood around 0 for each iI Observations given positive weight by i are those included in the set Aiwhere

Ai = fl jDl 6= Di , dil 2 C (0)g

Focusing on propensity score estimators, we can re-write this as

Ai = fl jDl 6= Di , p(xl ) 2 C (p(xi ))g

where C (p(xi )) represents a neighborhood around p(xi )

Single nearest neighbor matching

SetsC (p(xi )) = min

ljdil j

)ωil =

1 if l 2 Ai0 otherwise

Intuition: l has the closest propensity score to i , but with di¤erenttreatment assignment

k-nearest neighbor matching

SetsC (p(xi )) = k-min

ljdil j

)ωil =

1/k if l 2 Ai0 otherwise

Intuition: compute the average of the k closest obs to i in terms ofpropensity score, but with di¤erent treatment assignment than i

Caliper or radius matching (Cochran & Rubin 1973)

SetsC (p(xi )) = fp(xl ) j jdil j < εg

for a specied value of ε)

ωil =

1/ki if l 2 Ai0 otherwise

Intuition: compute the average over all ki obs that di¤er from i interms of propensity score by less than ε, but with di¤erent treatmentassignment than i

Kernel matching (Smith & Todd 2005)

C (p(xi )) =p(xl ) p(xi )aN

ωil =

8>><>>:Gp(xl )p(xi )

l 02fDl 0=0gGp(xl 0 )p(xi )

if l 2 Ai0 otherwise

where G () is the kernel function and aN is the bandwidthIntuition: compute a weighted average over all ki obs that receivepositive weight given the choice of G () and aN , but with di¤erenttreatment assignment than i

I G () must integrate to one, aN ! 0 as N ! ∞, and aNN ! ∞I Ex: quartic kernel (ε = 1)

G (s) = 15

16 (1 s2)2 if js j 6 10 otherwise

Local linear matching (Smith & Todd 2005)

C (p(xi )) =p(xl ) p(xi )aN

ωil =

8>>>>><>>>>>:Gil ∑

l 02fDl 0=0gGil 0 (pl 0pi )2[Gil (plpi )]

24 ∑l 02fDl 0=0g

Gil 0 (pl 0pi )

l2fDl=0gGil ∑

l 02fDl 0=0gGil (pl 0pi )2

24 ∑l 02fDl 0=0g

Gil (pl 0pi )

352 if l 2 Ai

0 otherwise

where Gil = GplpiaN

Intuition: similar to kernel matching, but di¤ers in handling of weightsassigned to obs when obs are distributed asymmetrically around i orwhen there are gaps in the distribution of the propensity score

Stratication or interval matching

Di¤ers from above schemes (although it can be written as a matchingestimator)

Unit interval is divided into k intervals, the average outcome oftreated and untreated is computed within each interval, and b∆ATE (k)is computed within each interval

Finally

b∆ATT = ∑k

N1kN1b∆ATE (k)

b∆ATU = ∑k

N0kN0b∆ATE (k)

b∆ATE =∑i I[Di = 1]

Nb∆ATT + ∑i I[Di = 0]

Nb∆ATU

Stata: -psmatch2 - or -nnmatch-

Selection on ObservablesStrong Ignorability: Matching (Comparison of Matching Methods)

Asymptotically, all methods are consistent if assumptions hold andbandwidth satsies the requisite criteria

In nite samples, choice may matter

Single nearest neighbor matching minimizes bias since it only uses theclosest match; however, Frölichs (2004) MC analysis shows it fairspoorly in practice

If sample size is large and the propensity score is evenly dispersedacross the unit interval, kneighbor matching may be idealIf sample size is large and the propensity score is asymmetricallydistributed, kernel matching may be ideal (weights obs according tocloseness)

If many obs have a propensity score close to the boundary (zero orone), LL matching may be ideal

Stratication methods face problem of arbitrarily choosing K

Selection on ObservablesStrong Ignorability: Matching (Regression Adjustment)

Various methods combine matching estimators with regressionmethods

Regression then matching (Smith & Todd 2005)I Regress yi on (some) xi for treated and untreated samples, obtainresiduals, and use residuals to compute matching estimators

Matching then regression (Ho et al. 2007)I Match to obtain missing counterfactual for each obs, then regress yi onDi and (some) xi using matched sample

I Standard errors are an issue here, as the usual OLS SEs are incorrect(more below)

Selection on ObservablesStrong Ignorability: Matching in Practice

Several practical issues are confronted when implementing matchingestimators

1 Restriction to the common support2 Does inexact matching balance the covariates, x?3 Which variables belong in x?4 Inference5 Failure of CIA

Selection on ObservablesStrong Ignorability: Matching (Common Support)

Dened as

Sp = fp(x) : f (pjD = 1) > 0 and f (pjD = 0) > 0g

Matching estimates are only dened at values of p(x) 2 SpIn practice, may want to exclude obs outside SpTo do so requires an estimate

bSp = fp(x) : bf (pjD = 1) > 0 and bf (pjD = 0) > 0gSmith & Todd (2005) recommend using NP density estimators toestimate f ())

bf (pjD = j) = ∑i2fDi=jg Gp(xi ) paN

, j = 0, 1

I See -kdensity- in Stata

Imprecise alternative

bSp = fp(x) : p 2

2664 max

mini2fDi=0g

fp(xi )g, mini2fDi=1g

fp(xi )g,

maxi2fDi=0g

fp(xi )g, maxi2fDi=1g

fp(xi )g3775

I Simpler alternativeI Excludes obs just outside the CS for whom close matches existI Does not address holesin the interior of the dbn

Note: imposing the CS changes interpretation of the parametersbeing estimated (e.g., b∆ATE becomes the ATE for treated individualswith a propensity score in a particular region)

Trimming: Smith & Todd (2005) recommend reducing the CS to

bSp = fp(x) : bf (pjD = 1) > q and bf (pjD = 0) > qg, q 2 (0, 1)

Dealing with limited overlap; see Crump et al. 2009

Selection on ObservablesStrong Ignorability: Matching (Balancing)

Matching mimics a randomized experiment in that conditioning onp(x) should balance x across the treated and untreated groupsEquivalently, the problem is reduced to a series of quasirandomexperiments at each value of p(x)... hence, an IV exists whichexogenously determines treatment assignment conditional on p(x)Rosenbaum & Rubin (1983) prove that

x ? D jp(x)

which implies

E[x jp(x),D = 0] = E[x jp(x),D = 1]

This holds regardless of whether CIA holdsBalacing tests seek to gauge thisNote: this highlights that p(x) is simply a means to balance the xs;the goal of p(x) is not to modeltreatment choice (more below)

Stratication tests (e.g., Deheija & Wahba 1999, 2002)I Estimate the propensity scoreI Divide the data into K intervals based on dp(x)I Test for equal means (or other moments) of each x across the treatedand control group within each strata

F See -ttest- in Stata

I Test xs individually or jointly using Hoteling T 2 test

F See -hotel- in Stata

I Add higher order or interaction terms of xs failing the test, and repeatI Problem: how to choose K?

F Too small ! typically always reject equalityF Too large ! rarely reject equality

Standardized di¤erencesI Average di¤erence in each x , where weights from matching are used,normalized by the pooled SD of x in the full sample

I Example: ∆ATT

SDIFF (xm) = 100

1N1 ∑

i2fDi=1g

xmi ∑

l2fDl=0gωilxml

Vari2fDi=1g(xmi )+Varl2fDl=0g(xml )2

I Problem: how large is too large? Rosenbaum & Rubin (1985) suggest20 is large

I Perhaps criteria should be more strict for variables thought to be moreimportant in particular application

Hoteling T 2 testI Test joint null of equal (weighted) means across treatment and controlgroup

I Example: ∆ATT

T 2 = (x1 x0)0 ∑1(x1 x0)

where x1 = vector of (unweighted) means from treatment group andx0 = vector of weighted means from untreated group, weighted by ωil

I Test may be conservative since estimation of weights is not accountedfor

Regression-based testI Estimate propensity scoreI Regress each x on a polynomial of p(x), D, and D interacted with thesame polynomial of p(x)...

xi = φ0 +∑Ss=1 φsp(xi )

s + π0Di +∑Ss=1 πsDip(xi )

s + ηi

and test Ho : π0 = π1 = = πS = 0I Regression may be unweighted or weighted, assigning weight

ωl = ∑i2fDi=1g ωil to each untreated obs (when focus is on ∆ATT )

Selection on ObservablesStrong Ignorability: Matching (Variable Selection)

CIA is a strong assumption that places great demands on the data

Two issuesI What variables to include in x?I What functional form to use; should x include higher order, interactionterms of the variables?

CIA will certainly hold if x includes all variables that determine bothoutcomes and participation, but is this required?

Rubin and Thomas (1996) favor including variables in the propensityscore model unless there is consensus that they do not belong

HIT (1997), HIST (1998), Heckman and Smith (1999), Lechner(2002), Smith & Todd (2005)

I Estimators are sensitive to variables included in xI Bias likely to result if x is too crude

Brookhart et al. (2006)I Variables related to outcomes should always be includedI Variables weakly related to the outcome even if strongly related totreatment assignment should be excluded as their inclusion results inhigher mean squared error of the treatment e¤ect estimate

Zhao (2007)I Including irrelevant variables ; biased estimatesI Over-tting the propensity score model may be counterproductive

Wooldridge (2009), Pearl (2009)I Consider classes of variables whose inclusion leads to biasI Primary example is of instrumental variables

Hirano et al. (2003)I Using the true propensity score is ine¢ cient even when it is knownI May imply that over-tting the propensity score model may have littlenegative consequence in practice

Note: goal of the PS model is not to nd the best predictor of DI Generally, variables that impact participation and not outcomes shouldbe excluded; inclusion will exacerbate the CS problem

I Psuedo-R2 criteria should not be used to judge the PS modelDL Millimet (SMU) ECO 7377 Fall 2011 107 / 407

Millimet & Tchernis (2009)I MC analysis of matching and weighting estimators (discussed later)I Estimate propensity score using a series logit estimator

Pr(D = 1) =exp

θ0 +∑Ss=1 θsxs

1+ exp

θ0 +∑Ss=1 θsxs

where for su¢ ciently large S and appropriate coe¢ cients, θ, anyparticpation function may be approximated

I SLE ) bθ estimated via MLI Assess impact of

F Including irrelevant and excluding relevant higher order terms of variables that impact outcomes and

participation

F Including irrelevant and excluding relevant higher order terms of variables that impact outcomes only

F Including irrelevant and excluding relevant higher order terms of variables that impact participation only

I Little impact to over-ttingF Asymptotic variance of nonparametric estimators is dominated by bias terms (Ichimura & Linton 2005)

F Over-tting minimizes the bias

F Also, normalized weighting estimator is preferable (discussed later)

DiNardo & Lee (2011) criticize us and show instances where adding xmay exacerbate bias

I Their examples are instances where the CIA does not hold, but oneapplies an estimator that requires the CIA (such as matching)

I Thus, the matching estimator is already biasedI In this case, adding an additional covariate may increase or decreasethe bias even if x belongs in the model

I That said, this is not the case examined in our work; we assume CIAholds

Shaikh et al. (2009) propose a specication test of the propensityscore model

I Informal test based on an eyeball comparison of the dbn of p(x) in thetreatment and control groups

I Formal test procedure also provided

Selection on ObservablesStrong Ignorability: Matching (Standard Errors)

Non-smooth matching estimatorsI Correct standard errors are not feasible in this caseI Usual ttest for di¤ in mean outcomes across matched treated anduntreated group ignores estimation of propensity score and nature ofmatching

I Problem due to estimation of the propensity score disappearsasymptotically

I Eichler & Lechner (2001) suggest that N must be in the 1000s beforethis bias disappears

Bootstrap methods are feasible for smooth matching estimators (e.g.,kernel matching), but there is no formal evidence

Abadie & Imbens (2006) provide asymptotic standard errors fornon-propensity score matching estimators; work in progress focuses onpropensity score matching estimators

Must be careful when bootstrapping data with choice-based sampling

Selection on ObservablesStrong Ignorability: Matching (Misc. Implementation Issues)

Replacement?I Single, k-nearest neighbor matching may be done with or withoutreplacement

I Without replacement implies results are sensitive to the sort order ofthe data

I With replacement reduces bias (by improving match quality), but isless e¢ cient (by using less of the data)

Estimation of propensity scoreI Typically probit or logit is used ) semiparametric estimatorI NP methods are available as well

Bandwidth SelectionI In NP work, bandwidth choice is typically much important than choiceof kernel function

I Methods generally fall into three categories1 ad hoc combined with sensitivity analysis2 Rule-of-thumb approaches (Silverman 1986)

aN 1.06σN1/5

3 Data driven methods (e.g., cross-validation)

I Leave-one-out cross-validation (e.g., ∆ATT )F Perform a NP regression of y on p(x) using all untreated obs except land a candidate bandwidth, ab

F Predict bylF Repeat for all l , l = 1, ...,N0F Calculate MSE

MSE (ab) =1N0

∑l2fDl=0g

(yl byl )2F Repeat for all candidate bandwidths ab , b = 1, ...,BF Choose ab to minimize MSE (ab)

Selection on ObservablesStrong Ignorability: Matching (Sensitivity to Unobservables)

CIA is not testable

Applied literature does/should assess the robustness of matchingestimators

Several currently available techniquesI Rosenbaum boundsI Simulation methods (Ichino et al. 2008)I Minimum bias approach (Millimet & Tchernis 2011)I Di¤erence-in-di¤erences matchingI Assuming SOO = SOU (Altonji et al. 2005; discussed later)I Bayesian sensitivity analysis (de Luna & Lundin 2009)

Rosenbaum Bounds

Method of assessing sensitivity of matching estimator to anunobserved confounder (Rosenbaum 2002)Assume

p(xi ) = F (xi β+ ui ) =exp(xi β+ ui )

1+ exp(xi β+ ui )

where u is an unobserved binary variable and F is the logistic CDFImplications

I Odds ratio for obs i is

p(xi )1 p(xi )

= exp(xi β+ ui )

I Odds ratio for obs i relative to obs i 0

p(xi )1p(xi )p(xi 0 )1p(xi 0 )

=exp(xi β+ ui )exp(xi 0β+ ui 0)

= expfγ(ui ui 0)g if xi = xi 0

I Thus, two observationally identical obs have di¤erent probabilities ofbeing treated if γ 6= 0 and ui 6= ui 0

How does inference regarding the treatment e¤ect parameters changeas γ and ui ui 0 change?

I Since u is binary, ui ui 0 2 f1, 0, 1gI Implies

1expfγg 6

p(xi )1p(xi )p(xi 0 )1p(xi 0 )

6 expfγg

F expfγg = 1) no selection biasF expfγg ! ∞ ) greater selection bias

I Rosenbaum bounds compute bounds on the signicance level of thematching estimate as expfγg changes values

F If matching estimate is statistically insignicant even whenexpfγg 1, then treatment e¤ect is not robust

F If matching estimate is statistically signicant even when expfγg islarge, then treatment e¤ect is not sensitive to hidden bias

Stata: -rbounds-, -mhbounds-

Ichino et al. (2008) Approach

Nannicini (2007) and Ichino et al. (2008) propose an alternativemethod of assessing the robustness of ATT estimates obtained underCIA

The sensitivity analysis is performed by comparing the baselinematching estimate to estimates obtained after additionallyconditioning upon a simulated confounder

The distribution of the simulated variable can be constructed tocapture di¤erent hypotheses regarding the nature of potentialconfounders

SetupI The parameter of interest is the ∆ATT E[y1 y0 jD = 1]I Accordingly, y0 ? D jx denotes the required CIAI Suppose that this condition is not met, but if an unobservable, U, isadded then a stronger CIA holds

y0 ? D jx ,U

I Implies

E[y0 jD = 1, x ] 6= E[y0 jD = 0, x ]E[y0 jD = 1, x ,U ] = E[y0 jD = 0, x ,U ]

SolutionI Simulate the potential confounder and use it as a matching covariate

F For simplicity, the potential outcomes and the confounding variable areassumed to be binary

F Conditional independence of U and x is also assumedF Hence, the distribution of U is fully characterized by the choice of thefollowing four parameters

pij Pr(U = 1jD = i , y = j) = Pr(U = 1jD = i , y = j , x)

with i , j 2 f0, 1gF Given the parameters pij , a value of U is simulated for each observationdepending on D , y

I ∆ATT is then estimated with U as an additional matching covariate

For a given set of the parameters pij , many simulations are performed,∆ATT computed for each simulation, and the mean/sd of theestimates reported

Choosing pij ...I It is essential to consider useful potential confoundersI Calibrated confounders: choose pij to make the distribution of Usimilar to the empirical distribution of observable binary covariates

I Killer confounders: search over di¤erent pij for the existence of a Uwhich makes ∆ATT = 0

I One can also simulate other meaningful confounders by setting theparameters pij and pi , where pi can be computed as

pi Pr(U = 1jD = i) =1∑j=0

pij Pr(y = j jD = i)

with i 2 f0, 1g

Common caseI Typical scenario in applied work has b∆ATT > 0 in baseline modelI Thus, concern centers on potential confounder that has both a positivee¤ect on the untreated outcome and on the selection into treatment

I Ichino et al. prove that

1 p01 > p00 )

Pr(y0 = 1jD = 0,U = 1, x) > Pr(y0 = 1jD = 0,U = 0, x)

where p01 Pr(U = 1jD = 0, y = 1) andp00 Pr(U = 1jD = 0, y = 0)

2 p1 > p0 )

Pr(D = 1jU = 1, x) > Pr(D = 1jU = 0, x)

where p1 Pr (U = 1jD = 1) and p0 Pr (U = 1jD = 0)I Accordingly, by choosing p01 > p00 and setting p1 > p0, aconfounder is simulated such that it has a positive e¤ect on both y0and D even after conditioning on x

What do these ps represent?I The di¤erences

d = p01 p00s = p1 p0

only depict the sign of Us outcome and selection e¤ectsI The size of these e¤ects must be evaluated after conditioning on x toaccount for the association between U and x that shows up in the data

I Thus, at every iteration, logit models for Pr(y = 1jD = 0,U, x) andPr(D = 1jU, x) are estimated

F The average odds ratio of U is reported as the outcome and selectione¤ects of the simulated confounder

Γ Pr(y=1jD=0,U=1,x )Pr(y=0jD=0,U=1,x )Pr(y=1jD=0,U=0,x )Pr(y=0jD=0,U=0,x )

Λ Pr(D=1jU=1,x )Pr(D=0jU=1,x )Pr(D=1jU=0,x )Pr(D=0jU=0,x )

F Γ and Λ reect the strength of U

Stata: -sensatt-

Minimum Bias Approach

Intuition: Trim the sample on the basis of p(x) to minimize the biasfrom a failure of CIA

Assume (A.iv) plus unobservables are trivariate normal:υ0, υ1, u N3(0,Σ), where

24 σ20 ρ01σ0σ1 ρ0uσ0σ21 ρ1uσ1

35and u is the error from the treatment assignment equation

Di = h(xi ) ui

where D is latent treatment assignment

The bias of the ATT at some value of the propensity score, p(x), isgiven by

BATT [p(x)] = bτATT [p(x)] τATT [p(x)]

= ρ0uσ0φ(Φ1(p(x)))p(x)[1 p(x)]

whereI ρ0u = selection on unobservables a¤ecting outcome in untreated stateI φ and Φ are standard normal PDF and CDFI bτATT is some propensity score based estimator

BATT [p(x)] is minimized at p(x) = 0.5

For the ATE,

BATE [p(x)] = fρ0uσ0 + [1 p(x)]ρδuσδg

φ(Φ1(p(x)))p(x)[1 p(x)]

I δ = υ1 υ0 = unobserved, individual-specic gain from treatmentI ρδu = selection on unobserved, individual-specic gains

) The bias-minimizing propensity score, p(x), depends on the errorcorrelation structure

Similar results in Black & Smith (2004), Heckman andNavarro-Lozano (2004)

Minimum-biased (MB) estimation techniqueI Stage 1: Estimate the propensity score (e.g., probit model)I Stage 2: Retain only those observations with a propensity score,[p(xi ), within a xed neighborhood around p(x), the bias-minimizingpropensity score

I Stage 3: Estimate the ATE or ATT using any propensity-score basedestimator that relies on CI using this sub-sample

Notes:I Estimator is biased, but it minimizes the biasI For ATT... this is straightforward as we know that p(x) = 0.5I For ATE... p(x) is unknown, depends on error correlationsI If treatment e¤ect is heterogeneous, then interpretation changes; maynot be economically interesting

For ATE, add Stage 1.5: Estimate the error correlationsI Feasible if one also imposes (A.va) or (A.vb)I Estimate via OLS (discussed in more detail later)

yi = α0 + (α1 α0)Di + xi β0 + xiDi (β1 β0)

+ βλ0(1Di )

φ(xiγ)1Φ(xiγ)

+ βλ1Di

φ(xiγ)Φ(xiγ)

where φ()/Φ() is the inverse Millsratio and

βλ0 = ρ0uσ0

βλ1 = ρ0uσ0 + ρδuσδ.

I Replacing γ with bγ from the rst-stage probit yields consistentestimates of ρ0uσ0 and ρδuσδ

Millimet & Tchernis (2009) nd that trimming is ine¢ cient when CIAholds, but is more robust to (some) mis-specications

Di¤erence-in-Di¤erences Matching

All matching estimators are biased if unobservables invalidate the CIA

Formally (e.g., ∆ATT )

∆ATT (p(x)) =

E[y1jp(x),D = 1] E[y0jp(x),D = 0]+ E[y0jp(x),D = 0] E[y0jp(x),D = 1]

where matching estimators are based on

e∆ATT (p(x)) = E[y1jp(x),D = 1] E[y0jp(x),D = 0]

which implies

bias = e∆ATT (p(x)) ∆ATT (p(x))= E[y0jp(x),D = 1]| z

Counterfactual

E[y0jp(x),D = 0]| z Observed

which is zero under CIA

Rearranging terms yields

∆ATT (p(x)) = e∆ATT (p(x)) biasThis suggests a bias-corrected estimator is feasible if the bias can beconsistently estimated

Might assume the bias equals the di¤erence in mean outcomes priorto treatment

bias = E[y0t jp(x),D = 1] E[y0t jp(x),D = 0]?= E[y0t 0 jp(x),D = 1] E[y0t 0 jp(x),D = 0]

where t 0 < t, t 0 precedes the treatment, t is post-treatment

Implies

ee∆ATT (p(x)) = e∆ATT (p(x)) bias= E[y1t jp(x),D = 1] E[y0t jp(x),D = 0]

fE[y0t 0 jp(x),D = 1] E[y0t 0 jp(x),D = 0]g

E[y1t y0t 0 jp(x),D = 1] E[y0t y0t 0 jp(x),D = 0]

and ee∆ATT (p(x)) = ∆ATT (p(x)) requires

E[y0t y0t 0 jp(x),D = 1] = E[y0t y0t 0 jp(x),D = 0]

which is di¤erent than the original CIA

Implementation: di¤erence the data 8i , then matchDID matching requires the original CIA be replaced with

∆y0,∆y1 ? D jp(x)

Intuition:I DID matching requires the change in potential outcomes to beindependent of treatment assignment given the PS

I Equivalently, there are no time varying unobservables correlated withboth outcomes and treatment assignment given x

Smith & Todd (2005) nd DID matching to be more robust, butconclusions are application-specic

Selection on ObservablesStrong Ignorability: Inverse Propensity Score Weighting (IPW) Estimators

Alternative to matching estimators, but still rely onknowing/estimating the propensity score

Identities

EDyp(x)

Dy1p(x)

EDy1p(x)

= E1p(x)

E [Dy1] j xCIA= E

E[D j x ]E[y1 j x ]

= Ep(x)p(x)

E[y1 j x ]= E [E[y1 j x ]] = E[y1]

and, similarly,

E(1D)y1 p(x)

= E[y0]

Parameters of interest (Horvitz & Thompson 1952)

∆ATE = EDyp(x)

(1D)y1 p(x)

D p(x)

p(x)[1 p(x)]y

∆ATT =1

E[p(x)]Ep(x)

Dyp(x)

(1D)y1 p(x)

E[p(x)]ED p(x)1 p(x) y

∆ATU =

E[1 p(x)] E[1 p(x)]

Dyp(x)

(1D)y1 p(x)

E[1 p(x)] ED p(x)p(x)

Proof: Wooldridge (2002, p. 613)

Estimation

Unnormalized estimators

b∆ATE =1N ∑i

"Diyi[p(xi )

(1Di )yi1[p(xi )

#=1N ∑i

([Di [p(xi )]yi[p(xi )[1[p(xi )]

b∆ATT =1

1N ∑i

[p(xi )1N ∑i

[p(xi )"Diyi[p(xi )

(1Di )yi1[p(xi )

1N ∑i

[p(xi )1N ∑i

([D [p(xi )]yi1[p(xi )

b∆ATU =1

1N ∑i

1[p(xi )

h1[p(xi )

i " Diyi[p(xi )

(1Di )yi1[p(xi )

1N ∑i

1[p(xi )

([D [p(xi )]yi

[p(xi )

Normalized estimators (Hirano and Imbens 2001)

I b∆ATE is the di¤erence in two weighted averages, where weights areDi

N[p(xi )and

1DiNh1[p(xi )

iI Problem: weights may not sum to unityI HI assign weights normalized by the sum of propensity scores fortreated and untreated groups

I Unnormalized estimator assigns equal weights of 1/N to eachobservation

I Normalized estimator (e.g., b∆ATE )b∆ATE = "∑i

Di yi[p(xi )

Di[p(xi )

#"∑i

(1Di )yi1[p(xi )

(1Di )1[p(xi )

I Tends to be more stable in practice as it restricts weights to 1;Millimet & Tchernis (2009), Busso et al. (2011) nd it performs better

Standard errors obtained via bootstrap

Selection on ObservablesStrong Ignorability: Regression (Again)

Use propensity score as control variable in regression

Assumptions

(A.vi) E[y1 y0 jx ] is uncorrelated with Var(D jx) = p(x)[1 p(x)](A.vii) E[y1 jp(x)], E[y0 jp(x)] are linear in p(x)

(A.vi) has no good interpretation

(A.vii) replaces the functional form assumptions discussed in theprevious regression approach

Estimation

Given (A.ii) and (A.vi)...I Estimate via OLS

yi = α0 + eα1Di + γ[p(xi ) + εi

I Estimates given by

b∆ATE = b∆ATT = b∆ATU = beα1which is consistent and asymptotically normal if [p(xi ) is consistent andasymptotically normal

I Proof: See Wooldridge (2002)

Given (A.ii) and (A.vii)...I Estimate via OLS

yi = α0 + eα1Di + γ0[p(xi ) + γ1

h[p(xi ) bµpiDi +eεi

where bµp = 1N ∑i

[p(xi )

I Estimates given by

b∆ATE (x) = beα1 + bγ1 hdp(x) bµpib∆ATE = beα1b∆ATT = beα1 + bγ1x1b∆ATU = beα1 + bγ1x0where x j = ∑i

h[p(xi ) bµpi I[Di = j ]/ ∑i I[Di = j ], j = 0, 1

Given (A.ii) and a weaker version of (A.vii)...I Estimate via OLS

yi = α0 + eα1Di +∑Kk=1 γ0k

[p(xi )k+∑K

k=1 γ1k

[p(xi )

k bµkpDi +eεi

where bµkp = 1N ∑i

[p(xi )k, k = 1, ...,K

and K is a low order numberI Estimates given by

b∆ATE (x) = beα1 +∑Kk=1 bγ1k dp(x)k bµkpb∆ATE = beα1b∆ATT = beα1 +∑Kk=1 bγ1k xk1b∆ATU = beα1 +∑Kk=1 bγ1k xk0

where xkj = ∑i

[p(xi )

k bµkp I[Di = j ]/ ∑i I[Di = j ], j = 0, 1;

k = 1, ...,K

Selection on ObservablesStrong Ignorability: Double-Robust Estimators

Robins and Rotnizky (1995), Lunceford and Davidian (2004), andothers discuss DR estimators

DR estimators combine regression and weighting estimators and aredouble robust because they are consistent as long as either theregression specication for the outcome or the propensity scorespecication is correctly specied

Estimation

OLS estimation

yi = α0 + xi β+ eα1Di + θ0Di[p(xi )

+ θ11Di1[p(xi )

+eεib∆ATE = beα1 + 1

N ∑i

"bθ0 Di[p(xi )

bθ1 1Di1[p(xi )

b∆ATT = beα1 + 1N1

∑i :Di=1

"bθ0 Di[p(xi )

bθ1 1Di1[p(xi )

b∆ATU = beα1 + 1N0

∑i :Di=0

"bθ0 Di[p(xi )

bθ1 1Di1[p(xi )

WLS estimation: ATE

yi = α0 + xi β+ eα1Di + eυiwhere weights are

sDi[p(xi )

+1Di1[p(xi )

and di¤erent weights are used for ATT, ATU (given above)

Augmented IPW: ATE (Lunceford and Davidian 2004; Glynn andQuinn 2010)

b∆ATE= 1N ∑i

"Diyi (Di [p(xi ))g1(xi )

[p(xi ) (1Di )yi + (Di

[p(xi ))g0(xi )1[p(xi )

where g0(xi ) and g1(xi ) are estimated via separate OLS regressions ofy on x

I See -dr- in Stata

Selection on ObservablesStrong Ignorability: Decomposition of Treatment E¤ects

Flores & Flores-Lagunes (2009) provide a framework to decompose∆k into a direct e¤ect of D and an indirect e¤ect that operatesthrough some causal mechanism, S

SetupI S 2 f0, 1g is a post-treatment, mechanism variableI S0,S1 are potential values of S associated with D = 1 and D = 0I S = DS1 + (1D)S0 is the realized value of S

Example: D = 1 if student i attends a private HS, 0 otherwise; S = 1if student i obtains a college degree, 0 otherwise; y = earnings as anadult

Composite potential outcomes for y are dened as y(D,SD 0),D,D 0 2 f0, 1g

I y(1,S1) = potential outcome associated with D = 1 and S1, therealized value of the mechanism variable, S , when D = 1

Decomposing ∆ATE

∆ATE = E[y(1,S1)] E[y(0,S0)]= fE[y(1,S1)] E[y(1,S0)]g| z

+ fE[y(1,S0)] E[y(0,S0)]g| z B

where A represents the indirect of D on y operating through S and Brepresents the direct e¤ect of D and y xing S at the non-treatmentvalueAuthors refer to

I A as the individual causal mechanism e¤ectI B as the net average treatment e¤ect

Note, B still reects two e¤ects of D on y1 E¤ects of D on y operating independently of S2 E¤ects on D on y operating through a change in the return to S (i.e.,even though the level of S is held xed, the e¤ect of S on y maychange due to D)

Assumptions

(DTE.i) Independence of Treatment: y(1,S1), y(0,S0), y(1,S0),S0,S1 ? D(DTE.ii) Conditional Indepedence of Potential Mechanisms:

y(1,S1), y(0,S0), y(1,S0) ? fS0,S1gjx(DTE.iii) Constant Functional Form: If E[y(1,S1)jS1 = s1, x ] = f1(S1, x), then

E[y(1,S0)jS0 = s0, x ] = f1(S0, x)

(DTE.iii) implies that the functional form relating S and x to y whenD = 1 is the same regardless of whether S = S1 or S = S0Under (DTE.i) (DTE.iii), ∆ATE and B can be estimated, and thenA can be backed out

Extension to the case where (DTE.i) only holds conditional on x isalso presented

Selection on ObservablesNon-Binary Treatments: Multi-Valued Treatments

Suppose the treatment can take on many discrete values

D 2 Ω = fd0, d1, d2, ..., dJg

) e.g., years of educationyj = potential outcome for treatment j = 0, 1, ..., JParameters of interest

∆ATEj ,j 0 = E [yj yj 0 ] , j , j 0 2 Ωe∆ATEj ,j 0 = Eyj yj 0 jD = j ,D = j 0

, j , j 0 2 Ω

∆ATTj ,j 0 = E [yj yj 0 jD = j ] , j , j 0 2 Ω

Dose-response function reects the unconditional expectation ofpotential outcomes at each dose

E [yj ] 8j 2 Ω

Now, there are J missing counterfactualsI Dji = indicator if obs i receives treatment j

Dji =1 if Di = j0 otherwise

I yi = observed outcome for i

yi = ∑Jj=0 yjiDji

Identication of the dose-response functionI Unconditional independence

yjj2Ω ? D

I Strong unconfoundedness (Rosenbaum & Rubin 1983)yjj2Ω ? D jx

) treatment assignment is conditionally independent of all potentialoutcomes

I Weak unconfoundedness (Imbens 2000)

yj ? Dj jx 8j 2 Ω

) assignment to any particular treatment is conditionally independentof that treatments potential outcome

Implication of weak unconfoundedness

E [yj jx ] = E [y jDj = 1, x ]= E [y jD = j , x ]

)E [yj ] = E [E [y jD = j , x ]]

) one may estimate the conditional dose-response function byestimating the mean outcome given treatment assignment and x , andthen obtain the population dose-response function by averaging overthe distribution of x)

E [yj yj 0 ] = E [E [yj yj 0 jx ]]= E

E [y jD = j , x ] E

y jD = j 0, x

ExampleI Let x = gender (M,F )I Ω = years of schooling (0, 1, ..., 21)I E

yjobtained by

F Computing average value of y for sub-sample with Dji = 1 and x = M) yMj

F Computing average value of y for sub-sample with Dji = 1 and x = F) yFj

F Obtaining portion of M and F in entire sample ) pM , pFF Compute pM yMj + pF yFj

I Obtain Eyj 0similarly

I Compute the di¤erenceI Other parameters can be estimated by using the proportions of M andF in various sub-samples (e.g., D = j , j 0 only)

Generalized propensity scoreI Denition

r(j , x) = Pr(D = j jx) = E[Dj jx ]I r(j , x) may be estimated given data on D, x (MNL, MNP, orderedlogit/probit)

I Imbens (2000) shows that weak unconfoundedness )

yj ? Dj jr(j , x) 8j 2 Ω

Eyj jr(j , x)

y jDj = 1, r(j , x)

= E [y jD = j , r(j , x)]

andEyj= E [E [y jD = j , r(j , x)]]

I The above result requires r(j , x) > 0 along the entire support of x

EstimationI Given weak unconfoundedness and assuming r(j , x) > 0 for the entiresupport of x , then

EDjyr(j , x)

I Estimator

\Eyj=1N ∑i

"Dji yi\r(j , xi )

#which is analogous to the weighting estimator dened previously in thebinary treatment case

I Analogous normalized weighting estimator given by

Dji yi\r(j , xi )

# "∑i

Dji\r(j , xi )

Selection on ObservablesNon-Binary Treatments: Continuous Treatments

Suppose Ω is an interval [d , d ],and D has a continuous dbn on Ω) e.g., income

yj = potential outcome for treatment j 2 ΩDj is not useful since j takes on an innite number of values

Weak unconfoundedness can be re-stated as

yj ? D jx 8j 2 Ω

in contrast to strong unconfoundedness which requires fyjgj2Ω, thefull set of potential outcomes, to be conditionally independent

Generalized propensity scoreI Now dened as the conditional density of D given x

r(j , x) = f (j jx)

I Implication (Hirano & Imbens 2004)

yj ? D jr(j , x) 8j 2 Ω

I Estimation based on

Eyj= E [E [y jD = j , r(j , x)]]

I Since D is continuous, estimation entails

F Estimation of r (j , x)F Estimate E [y jD = j , r (j , x)] by regessing y on D and \r (j , x)F Average \E [y jD = j , r (j , x)] over the dbn of x (at a xed value of j)

Weighting estimator version: see Robins (1998), Hernan et al. (2000)

See -doseresponse- in Stata

Stratication estimator version (Imai & van Dyk 2004)

I Regress D on x via OLS ) θ = \E [D jx ] = xbβI Split sample in K strata of equal size based on θI Within each strata, model y as a function of D (and perhaps x tofurther control for di¤erences in x)

F y continuous: regress y on D and xF y binary: probit/logitF y ordered: oprobit/ologitF y count: poisson, NB

) b∆ATEk given by coe¢ cient on DI Obtain overall ∆ATE as

b∆ATE = ∑k

b∆ATEk

I Generalizable to multiple treatment case (e.g., two continuoustreatments: income, educ)

Selection on ObservablesDynamic Matching

Pertains to situations where agents receive an initial treatment or not,and then have the option of receiving a second treatment if theyreceive the rst treatment

Many employment or job training programs, or treatments withinschools, operate in this manner

Need to carefully consider the parameter of interest in theseapplications, as well as CIA at di¤erent stages of the problem

See work by Lechner (2009, JBES), Lechner and Miquel (2010, EE ),Cooley et al. (2010), or Behrman et al. (2004, ReStat)

Selection on ObservablesRegression Discontinuity

This estimator returns us to the class of binary treatments

First introduced in Thistlethwaite & Campbell (1960)

Two classes of models: sharp, fuzzy

Sharp RD is a selection on observables estimator, but is not based onstrong ignorability (in fact, it precludes it)

Fuzzy RD is a selection on unobservable estimators (discussed later inthe course)

Note: Recent work also on Regression Kinked Design (Card, Lee, &Pei 2009)

RD setupI Agents self-select into treatment groupI Selection done at least in part on the basis of an observed continuousvariable, s

F s is referred to as the score, running variable, or forcing variable

I s may directly impact potential outcomes as wellI There exists a discrete jump in Pr(D = 1) at a known value, s

Thus, s and s are both known to the econometrician

Sharp RD model

(SRD.i) Treatment assignment is a deterministic function of s (with a knownthreshhold, s)

Di = D(si ) =1 if si > s0 otherwise

(SRD.ii) Positive density at the threshold: fS (s) > 0(SRD.iii) Outcomes are continuous in s at least around s(SRD.iv) For each agent, the dbn of s is continuous at least around s

NotesI (SRD.ii) implies we see agents near sI (SRD.iii) precludes discontinuities in y at s due to other reasonsbesides changes in D

I (SRD.iv) implies that agents cannot perfectly manipulate s to ensures ? s

F This is crucial to give the setup the interpretation of a randomexperiment in the neighborhood of s

Notes (cont.)I y0, y1 ? D js follows from (SRD.i)I All RD estimators require existence of following limits

D+ = lims#sPr(D = 1js)

D = lims"sPr(D = 1js)

and D+ 6= DF (SRD.i) implies D+ = 1 and D = 0

I Common support condition is necessarily violated since

Pr(D = 1) =1 if si > s0 otherwise

which implies that Pr(D = 1js) /2 (0, 1) 8s

Parameter of interest

∆ATE (s) = E[y1 y0js ]= lim

s#sE[y js ] lim

s"sE[y js ]

DiNardo & Lee (2011) advocate a di¤erent intepretationI Argue that RD estimates a weighted average of ∆i where the weightsare proportional the probability that an agents si is the neighborhoodof s

EstimationUse only sub-sample with si 2 fs δ, s + δg for small δ

I Similar s ) similar observationsI Compute mean di¤erence in outcomes across treatment groupsb∆ATE (s) = \E[yi jsi 2 fs, s + δg,D = 1]

\E[yi jsi 2 fs δ, sg,D = 0]

=∑Ni=1 yi I[si 2 fs, s + δg,Di = 1]

∑Ni=1 I[si 2 fs, s + δg,Di = 1]

∑Ni=1 yi I[si 2 fs δ, sg,Di = 0]

∑Ni=1 I[si 2 fs δ, sg,Di = 0]

E[yi jsi 2 fs, s + δg,D = 1] E[yi jsi 2 fs δ, sg,D = 0]

E[y1i jsi 2 fs, s + δg,D = 1] E[y0i jsi 2 fs δ, sg,D = 0]

= E[y1i jsi 2 fs, s + δg] E[y0i jsi 2 fs δ, sg]6= lim

s#sE[y js ] lim

s"sE[y js ] for xed δ > 0

This is essentially a kernel estimator with a uniform kernel over theinterval fs, s + δg or fs δ, sg, which entails a non-negligible biasfor δ > 0

Example: If y is increasing in s, then

I \E[yi jsi 2 fs, s + δg,D = 1] will overestimate lims#s E[y js ]I \E[yi jsi 2 fs δ, sg,D = 0] will underestimate lims"s E[y js ]) b∆ATE (s) will be biased up

Regression approachI Model

yi = ∆Di + εi

where D = treatment indicator, ∆ = parameter of interestI Model is not estimable via OLS since Cov(D, ε) 6= 0I However, E[εjD, s ] = E[εjs ]I Implies ∆ is estimable if the model is augmented with a su¢ cientlyexible function of s to proxy for E[εjs ]

yi = ∆Di + k(si ) + ηi

where Cov(D, η) = 0I What is k(s)?

F Linear: k(s) = s (Goldberger 1972; Cain 1975)F Quadratic: k(s) = θ1s + θ2s2 (Berk & Rauma 1983; van der Klaauw2000)

F Semiparametric: k(s) = ∑Mm=1 θmsm , with M choosen bycross-validation (Trochim 1984; van der Klaauw 2000)

Example:

0 .2 .4 .6 .8 1score

outcome fitted values (OLS, y on D)fitted values (OLS, y on s & D)

Note: S~U(0,1); D(s)=I(s>0.5); y=s+D+e; delta = 1

NotesI Testing of some of the underlying assumptions is feasible

F Examine the density of s to look for evidence of discontinuity at s ,suggesting manipulation by agents (McCrary 2008)

F Look for existence of discontinuities in predetermined variables at s(similar to assessing balancing of predetermined variables in randomizedexperiments)

I If treatment e¤ect is heterogeneous, then RD estimates a uniqueparameter (discussed above) that may be uninteresting

F This is an example of a local average treatment e¤ect (LATE)F May be a policy relevant parameter if the question is the impact of amarginal change in an eligibilitycut-o¤, s

I Applications: nancial aid, GED, Clean Air Act attainment statusI See -rd- in Stata

Selection on ObservablesDistributional Approaches

Analysis to this point has focused on mean e¤ects of treatments

Averages may mask a lot of heterogeneity

Distributional methods seeks to assess the e¤ects of treatments onother quantities

Traditional approach is quantile regression (QR)

More recent approaches have been couched in the potential outcomesframework and focus on quantile treatment e¤ects (QTE)

Selection on ObservablesDistributional Approaches: Quantile Regression

MotivationI QR provides a convenient linear framework for assessing the impact ofchanges in a vector of covariates on the quantiles of the dependentvariable

I Equivalently, QR allows estimation of linear conditional quantilefunctions

I Analogous to linear regression, which estimates the conditional meanfunction

I Common applicationsF Studies of wage determinationF Studies of student achievement

NotationI F (y) = CDF of yI Qθ(y) = θth quantile of the random variable, y , given by

Qθ(y) = inffy : F (y) > θg

(Unconditional) quantiles as a minimization problemI Prior to discussing QR, it is useful to view unconditional quantiles as asolution to a minimization problem

I Example: median

Q0.5(y) = argminb

∑i jyi bj

F Solution depends on the sign of the residuals, not the magnitudeF y = f99, 100, 101g ) Q0.5(y ) = 100;y = f99, 100, 150g ) Q0.5(y ) = 100 as increasing b closer to 150reduces that residual, but increases the sum of the other two residualsby twice as much

F Implies median is less sensitive to outliers than the meanI General formula for any quantile θ 2 (0, 1)

Qθ(y) = argminb

(∑i :yi>b

θjyi bj+ ∑i :yi<b

(1 θ)jyi bj)

F Quantiles other than the median are dened as the arg min of aweighted sum of the absolute residuals

F Intuition: say θ = 0.75 and b = median, then problem puts moreweight on residuals above b, which pushes the solution to theminimization problem above the median

QR model (Koeneker & Bassett 1978)I Replace b in previous problem with a linear function of covariates

bβθ = argminβ

i :yi>xi βθjyi xi βj+ ∑

i :yi<xi β(1 θ)jyi xi βj

)which may be rewritten as

bβθ = argminβ

1Nf∑i ρθ(εθi )g

where ρθ(εθi ) is known as the check function, dened as

ρθ(εθi ) = [θ I(εθi < 0)]εθi

and εθi is the residual for i and θI Preceding objective fn is equivalent (after some algebra) to

bβθ = argminβ

sgn(yi xi β)(yi xi β)

I Error distribution

F Key assumption: Qθ(εθ jx) = 0F No other assumption about the distribution

Estimation

The objective fn is not di¤erentiable ) standard optimizationmethods are not viable

Solved using linear programming methods

GMM estimation is also feasible (Buchinsky 1998)

Special case: median regressionI Corresponds to QR model with θ = 0.5; bβ obtained from

bβ0.5 = argminβ

1Nf∑i jyi xi βjg

I Analogous to OLS, but bβ minimizes the sum of absolute errors insteadof sum of squared errors

I Also known as LAD (Least Absolute Deviations) estimatorI Useful alternative to OLS, particularly when the distribution of theerror term is symmetric (so the conditional mean and median areequal), yet outliers are a concern

I Also useful when y is imputed for some obs

Inference

Using a GMM framework, can showpN(bβθ βθ)! N(0,Λθ)

Λθ = ω2(θ)(x 0x)1

ω2(θ) =θ(1 θ)

f 2(F1(θ))

and f (F1(θ)) denotes the density of the error distribution evaluatedat the θth quantileIntuitiion:

I Estimation of the θth conditional quantile uses only obs near the θth

quantileI Asymptotically, obs are added in this range in a manner proportional tof (F1(θ)) assuming iid errors

Utilizing the asymptotic formula for inference is di¢ cult in practiceBootstrap methods provide a simpler alternative (Buchinsky 1998)

Results

Parameters of interest are the partial derivatives of the conditionalquantile fn w.r.t. x

∂ E[Qθ(y jx)]∂xk

which equals βθk if x enters linearly

Presentation of resultsI Di¢ cult as there are a large number of results that are possible toobtain (i.e., βθk , k = 1, ...,K and θ 2 (0, 1))

I Possibilities

F Typical table of coe¢ cient estimates at several quantiles (typically θ =0.10, 0.25, 0.50, 0.75, and 0.90)

F Graph the conditional quantile fns against xk if there is one x that isthe focus of the paper (again, typically for a few quantiles)

F Graph bβθk vs. θ for several di¤erent xs on one graph (only works if xkenters linearly)

Sequential estimationI In practice, one typically wishes to estimate bβθ for multiple values of θI Estimates are not independent since they are obtained from the samedata

I Estimation one equation at a time, however, is e¢ cient unless there arecross-equation restrictions (e.g., one might wish for a type of smoothcoe¢ cientmodel)

Stata: -qreg -, -bsqreg -, -sqreg -, -grqreg - (for graphing), -qcount-(for count data models), -lqreg - (for logistic models)

Selection on ObservablesDistributional Approaches: Quantile Treatment E¤ects

NotationI y1i , y0i = potential outcomes for iI Di = binary indicator of treatment assignmentI Fj (y) Pr[yji < y ], j = 0, 1 = CDFs of potential outcomesI y θ

j = inffyj : Fj (y) > θg = quantiles of potential outcome dbns

Parameters of interest

4QTEθ = E[y θ

1 y θ0 ], θ 2 (0, 1)

4QTTθ = E[y θ

1 y θ0 jD = 1], θ 2 (0, 1)

4QTUθ = E[y θ

1 y θ0 jD = 0], θ 2 (0, 1)

Interpretation

Constant treatment e¤ect assumptionI y1i = y0i + ∆ 8iI Implies F11 (θ) = F10 (θ) + ∆

4 2 0 2 4

NOTE: y0~N(0,1); y1=y0+1

I 4QTEθ = 4QTTθ = 4QTUθ = ∆ 8θ 2 (0, 1)

Heterogeneous treatment e¤ectsI y1i = y0i + ∆iI Perfect rank correlation (Heckman et al. 1997)

F Denition: F1(y1i ) = F0(y0i ) 8iF Intuition: each observation lies in the identical quantile in bothpotential outcome dbns, which implies that y1 is a monotonetransformation of y0

F Implication: 4QTEθ = E[y θ

1 y θ0 ] = Qθ(∆), which is the θth quantile of

the dbn of ∆, which implies that QTEs identify the distribution of thetreatment e¤ect, BUT this requires a strong assumption about thejoint dbn of potential outcomes

I No perfect rank correlation

F No assumption about the joint dbn of potential outcomesF Implication: 4QTE

θ = E[y θ1 y θ

0 ] 6= Qθ(∆), which implies that QTEsidentify the di¤erence in the two marginal dbns of the potentialoutcomes, BUT say nothing about the dbn of actual treatment e¤ects... QTEs reect the e¤ect of D on quantiles of the potential outcomedbns, NOT on observations at particular quantiles.

Example #1...

ID y0 y1 ∆1 1 2 12 2 4 23 3 6 34 4 8 45 5 10 5

Rank preservation holds; ∆ivaries

CDF of y0, y1 are not identical) 4QTE

θ varies with θ

4QTEθ = Qθ(∆)

Example #2...

ID y0 y1 ∆1 1 1 02 2 4 23 3 3 04 4 2 -25 5 5 0

Rank preservation is violated; ∆ivaries

CDF of y0, y1 are identical )4QTE

θ = 0 8θ

4QTEθ 6= Qθ(∆)

EstimationIdentication assumptions: strong ignorability (CIA, CS)yi = Diy1i + (1Di )y0i = observed outcomeb∆θ obtained using sample analogues of y θ

1 and yθ0

Obtain bFj (y), j = 0, 1bFj (y) =1

∑i I(Di = j)∑i I(Di = j) I(yi y) unconditional

bFj (y) =∑i2j bωi I(yi y)

∑i2j bωicovariates

bωi =Dibpi (xi ) + 1Di

1 bpi (xi ) (QTE)

bωi = Di +bpi (xi )(1Di )1 bpi (xi ) (QTT)

bωi =[1 bpi (xi )]Dibpi (xi ) + 1Di (QTU)

where bpi (xi ) is the propensity score and x is the vector such that CIAholdsDL Millimet (SMU) ECO 7377 Fall 2011 183 / 407

by θ1 = inffy : bF1(y) > θg; similarly for by θ

Implies b∆QT θ = by θ1 by θ

Inference based on bootstrap

Test of equal CDFs (Abadie 2002)I Equivalent to test for Ho : ∆θ = 0 8θ 2 (0, 1)I Utilize Kologorov-Smirnov statistic

rN2sup jF1(y) F0(y)j

I Compute bdeq = rN2 maxk nbF1(yk ) bF0(yk )ofor a grid of points, k = 1, ...,K in the support of yi

I Inference for test of equality using bootstrap

Stata: -dbn- (my code)

Selection on ObservablesDistributional Approaches: Stochastic Dominance

In the event the QTEs di¤er in sign or signicance across the dbn,may be interested in rankingdbnsDenitions

I First Order Stochastic Dominance: Y1 FSD Y0 i¤

F1(y) F0(y) 8y 2 Y

with strict inequality for some y (where Y is the union of the supportsfor Y1 and Y0), or

y θ1 y θ

0 8θ 2 [0, 1]with strict inequality for some θ

I Second Order Stochastic Dominance: X SSD Y i¤Z y∞

F1(t)dt Z y∞

F0(t)dt 8y 2 Y , orZ θ

0y t1dt

0y t0dt 8θ 2 [0, 1]

with strict inequality for some y or θ

Example: FSD... (y1 N(1, 1); y0 N(0, 1))

4 2 0 2 4Support

Control Treatment

0 10 20 30 40 50 60 70 80 90 100Quantile

Example: SSD... (y1 N(0.25, 0.25); y0 N(0, 1))

4 2 0 2 4Support

Control Treatment

0 10 20 30 40 50 60 70 80 90 100Quantile

FSD ) SSD

Third and higher order rankings exist

Any two dbns can be ranking at some order of SD

ImplicationsI Notation

F W1 = class of social welfare fns that are increasing in yF W2 = sub-class of W1 that includes all social welfare fns that are alsoconcave in y

I X FSD Y ) X is at least as preferred by all welfare functions in W1,with strict inequality holding for some welfare function in the class

I X SSD Y ) X is at least as preferred by all welfare functions in W2,with strict inequality holding for some welfare function in the class

Test statistics

d = min supz2Y

[F (z) G (z)]

s = min supz2Y

∞[F (t) G (t)] dt

where min is taken over F G and G FTests are based on estimates of d and s using the empirical CDFs

I Unconditional, orI Inverse propensity score weighted

Inference using bootstrap (simple and/or more complex methods)

Selection on UnobservablesWhen all xs required for CIA to hold are not observed, then oneenters into selection on unobservables worldImplies unobservable attributes of obs i are correlated with bothpotential outcomes and treatment assignment of obs iIn general, this implies

E[yj jx ,D = j ] 6= E[yj jx ,D = j 0], j , j 0 = 0, 1In a regression framework, with functional form assumptions, thisimplies

yi = Diy1i + (1D)iy0i= α0 + xi β0 + (α1 α0)Di + xiDi (β1 β0)

+ [υ0i +Di (υ1i υ0i )]

where SOU results ifI Cov(D, υ0) 6= 0 ) selection on unobservables impacting outcome inuntreated state, or

I Cov(D, υ1 υ0) 6= 0 ) presence of and selection on unobserved,obs-specic gains from treatment

Possible solutions1 Bound treatment e¤ects (set identicationas opposed to pointidentication) under minimal assumptions

2 Utilize panel data3 Utilize exclusion restrictions (i.e., instrumental variables)4 Model dependence between treatment and unobservables ) controlfunction approach

5 Other methods that ndidentication elsewhere

Selection on UnobservablesBounding Treatment E¤ects

Recall, the ATE

∆ATE (x) = E[y1 y0jx ] = E[y1jx ] E[y0jx ]= fE[y1jx ,D = 1]Pr(D = 1jx)

+ E[y1jx ,D = 0]Pr(D = 0jx)g fE[y0jx ,D = 1]Pr(D = 1jx)

+ E[y0jx ,D = 0]Pr(D = 0jx)g= fg1(x) E[y0jx ,D = 1]gp(x)

+ fE[y1jx ,D = 0] g0(x)g[1 p(x)]

where p(x), the propensity score, and gj (x), j = 0, 1, are allobservable from the data

Similar derivation for other two primary mean treatment e¤ectparameters

∆ATT (x) = g1(x) E[y0jx ,D = 1]∆ATU (x) = E[y1jx ,D = 0] g0(x)

Thus, without additional information, no parameter is identied

Early bounding approach outlined in Smith and Welch (1986)I Objective was to estimate the average wage for blacks accounting forselection into LF

E[w ] = E[w jLF = 1]Pr(LF = 1) + E[w jLF = 0]Pr(LF = 0)

where E[w jLF = 0] is not observedI Solution: E[w jLF = 0] = γ E[w jLF = 1], γ 2 [0.5, 1]I In treatment e¤ects context, can specify

E[yd jD = d 0] = γ E[yd jD = d ] for di¤erent values of γ, where d 6= d 0I Rosenbaum (2002) summarizes other papers that bound causal e¤ectsby varying the unobserved parameters

More recent approaches focus on adding assumptions to tighten thebounds on the parameter of interest

Notation (Lechner 1999; Manski 1990)I L1, L0 = lower bounds of the support of y1, y0, respectivelyI U1, U0 = upper bounds of the support of y1, y0, respectivelyI BLk , B

Uk = lower, upper bounds, respectively, of treatment e¤ect k

(k = ATE ,ATT , or ATU)I wk = BUk BLk = width of bounds for treatment e¤ect k

Trivial caseI No additional information

BLk = L1 U0BUk = U1 L0wk = (U1 L0) (L1 U0)

= (U1 L1) + (U0 L0)

I Example: y is binary (e.g., employment after job training program)

L1 = L0 = 0

U1 = U0 = 1

BLk = 1BUk = 1

wk = 2

Tightening bounds with data

Use sample dataI p(x), g0(x), g1(x) may be consistently estimated from the data by

F Sample meansF Nonparametric smoothing methodsF Parametric methods

New bounds with sample dataI ∆ATE (x)

BLATE = f[g1(x) U0gdp(x) + fL1 [g0(x)g[1 dp(x)]BUATE = f[g1(x) L0gdp(x) + fU1 [g0(x)g[1 dp(x)]wATE = (U1 L1)[1 dp(x)] + (U0 L0)dp(x)

I ∆ATT (x)

BLATT = [g1(x) U0BUATT = [g1(x) L0wATT = U0 L0

I ∆ATU (x)

BLATU = L1 [g0(x)

BUATU = U1 [g0(x)wATU = U1 L1

Example: y is binary ) wk = 1 8k (sample data cuts width in half)Note: Bounds necessarily include zero

I Cannot rule out zero average treatment e¤ectI Can exclude some extreme valuesI Full characterization of the bounds should also account for uncertaintyin the variables belonging in x and the model used to estimate g0(x),g1(x), and p(x) (Heckman et al. 1999)

F While bounds conditional on x and a model, m, all have width one, theexact bounds are a¤ected

I Kreider, Pepper, and co-authors incorporate measurement error in Dinto the bounds (discussed later)

Tightening bounds with assumptions

Assume ∆ATT (x) = ∆ATU (x)I Calculate bounds for ∆ATT (x) and ∆ATU (x)I New bounds include only the intersection of the two boundsI Example

∆ATT (x) 2 [0.25, 0.75]∆ATU (x) 2 [0.75, 0.25]

then new bounds are [0.25, 0.25]I Note: still necessarily include zero since bounds on ∆ATT (x), ∆ATU (x)both include zero

Level-set restrictions: treatment e¤ects are constant 8x 2 X0 X(the support of x)

I Calculate bounds for ∆k (x) 8x 2 X0I New bounds include only the intersection of these boundsI Example (∆ATE )

∆ATE (xa) 2 [0.25, 0.75]∆ATE (xb) 2 [0.75, 0.25]

where xa, xb 2 X0, then new bounds are [0.25, 0.25]I Note: still necessarily include zero since bounds on ∆k (x) include zero8x

I Formally

BLk (X0) = supx2X0

BLk (x)

BUk (X0) = infx2X0

BUk (x)

wk (X0) = BUk (X0) BLk (X0)

Level-set restrictions: expected outcomes are constant8x 2 X0,1 X (for y1) and 8x 2 X0,0 X (for y0)

I Implies

E[y1 jx ] is constant 8x 2 X0,1E[y0 jx ] is constant 8x 2 X0,0

) Bounds become

BLATE (x0) = supx2X0,1

f[g1(x)dp(x) + L1[1 dp(x)]g infx2X0,0

f[g0(x)[1 dp(x)] +U0dp(x)gBUATE (x0) = inf

x2X0,1f[g1(x)dp(x) + U1[1 dp(x)]g supx2X0,0

f[g0(x)[1 dp(x)] + L0dp(x)gBLATT (x0) = sup

x2X0,1f[g1(x)g inf

x2X0,0fU0g

BUATT (x0) = infx2X0,1

f[g1(x)g supx2X0,0

BLATU (x0) = supx2X0,1

fL1g infx2X0,0

f[g0(x)g

BUATU (x0) = infx2X0,1

fU1g supx2X0,0

f[g0(x)g

where x0 2 X0,1 \ X0,0DL Millimet (SMU) ECO 7377 Fall 2011 204 / 407

Assumption: positive selectionI Implies

E[y1 jx ,D = 1] > E[y0 jx ,D = 1]which means that the treated only join the treatment group if there arenon-negative gains on average

I Bounds become

BLATE = fL1 [g0(x)g[1 dp(x)]BUATE = f[g1(x) L0gdp(x) + fU1 [g0(x)g[1 dp(x)]BLATT = 0

BUATT = [g1(x) L0

I Does not a¤ect bounds on ∆ATU (x)

Combining assumptions, restrictions

BLk ,combine = maxp2Ψ

fBLk ,pg

BUk ,combine = minp2ΨfBUk ,pg

where Ψ is the set of restrictions being combined

Inference via bootstrapI Yields condence intervals for the bounds, not the treatment e¤ectI For example, a 90% CI implies that the probability that the truebounds lie in the CI is 90%; the probability that the truetreatmente¤ect lies in the CI is even higher (see also Imbens & Manski (2004))

Tightening bounds (again)

Manski (1990), Manski & Pepper (2000) consider additionalassumptions

1 InstrumentE[yj jz ] = E[yj ], j = 0, 1

2 Monotone Instrument

z1 z z2 ) E[yj jZ = z1 ] E[yj jZ = z ] E[yj jZ = z2 ], j = 0, 1

3 Monotone Treatment Selection

E[yj jD = 1] E[yj jD = 0], j = 0, 1

4 Monotone Treatment Response

y0 y1 ) E[y0 ] E[y1 ]

where x is omitted for notational convenience

Use of an instrumentI E[yj jz ] = E[yj ], j = 0, 1, implies

E[yj ] 2supzfE[y jD = j ,Z = z ]Pr(D = j jZ = z ) + Lj Pr(D 6= j jZ = z )g,

infzfE[y jD = j ,Z = z ]Pr(D = j jZ = z ) + Uj Pr(D 6= j jZ = z )g

I Bounds for ∆ATE become

BLATE = supzf[g1(z)dp(z) + L1 [1 dp(z)]g inf

zf[g0(z)[1 dp(z)] +U0dp(z)g

BUATE = infzf[g1(z)dp(z) + U1 [1 dp(z)]g sup

zf[g0(z)[1 dp(z)] + L0dp(z)g

I Bounds are tighter than worst case bounds if p(z) 6= Pr(D = 1); i.e., zis correlated with treatment assignment

Use of a monotone instrument (MIV)I z1 z z2 ) E[yj jZ = z1 ] E[yj jZ = z ] E[yj jZ = z2 ], j = 0, 1

F Weaker assumption than the prior, mean independence assumptionF Implies that potential outcomes are non-decreasing in z

I Implies

E[yj ] 2"

∑z2Z

Pr(Z = z)

(supz1z

fE[y jD = j ,Z = z1 ]Pr(D = j jZ = z1)

+ Lj Pr(D 6= j jZ = z1)g

∑z2Z

Pr(Z = z)

(infz2z

fE[y jD = j ,Z = z2 ]Pr(D = j jZ = z2)+ Uj Pr(D 6= j jZ = z2)g

)#I Bounds derived based on this

Monotone treatment selection (MTS)I E[yj jD = 1] E[yj jD = 0], j = 0, 1, implies that the treated grouphas weakly higher potential outcomes in all treatment states

I Plausible in certain cases when one does not condition on x and x iscorrelated with both D and yj in the same direction

I Implies

E[yj ] 2 [E[y jD = j ]Pr(D j) + Lj Pr(D < j),E[y jD = j ]Pr(D j) + Uj Pr(D > j)]

Monotone treatment response (MTR)I y0 y1 ) E[y0 ] E[y1 ] implies we know the sign of the treatmente¤ect (inclusive of zero)

I Implies ∆ATE 0I Stronger than the positive selection assumption previously as that onlyapplied to the sub-sample with D = 1

MIV can be combined with MTS, MTRMethodology can also be combined with assumptions concerningmeasurement error (discussed later)Stata: -bpbounds- (related)

Selection on UnobservablesAltonji et al. Approach

Altonji et al. (2005) o¤er two approaches to assess the sensitivity ofestimates obtained under SOO assumption when this assumption isfalse

Approach #1 is applicable to the case of a binary outcome

Approach #2 is applicable regardless of type of outcome

Krauth (2011) attempts to extend the approach

Approach #1: Bivariate probit model

y i = xi β+ τDi + εi

Di = xiγ+ µi

where ε, µ N(0, 0, 1, 1, ρ) and

1 if y > 00 otherwise

1 if D > 00 otherwise

Estimation by ML

lnL = ∑i :fy=1,D=1g ln[Φ2(xi β+ τ, xiγ, ρ)]

+∑i :fy=1,D=0g ln[Φ2(xi β,xiγ,ρ)]

+∑i :fy=0,D=1g ln[Φ2(xi β τ, xiγ,ρ)]

+∑i :fy=0,D=0g ln[Φ2(xi β,xiγ, ρ)]

Model is technically identied with no exclusion restriction, but treatρ as unidentied

Assessing treatment e¤ect as ρ varies provides evidence of sensitivityto selection on unobservables

Constrain ρ > 0) positive selection; ρ < 0) negative selection

Approach #2: SOU relative to SOO

Intuition is to assess how much SOU, relative to the amount of SOO,is needed to fully explain the observed positive association between Dand y

(AET.i) Random observables: x is a random subset of all factors, w , inuencingy

(AET.ii) Equally important factors: the number of elements in w is large and nosingle variable factor has an undue inuence on y

(AET.iii) Relationship between x and unobservables: slightly weaker technicalassumption than independence between x and remaining elements of w

then one should expect the amount of selection controlled for by x toequal the amount of selection on unobservables

Implies that if the amount of SOU needed to explain the observedassociation is less than amount of SOO, the estimated treatmente¤ect should not be viewed as robust

Model for outcomeyi = xi β+ τDi + εi

The (normalized) amount of SOU is given by

E[εjD = 1] E[εjD = 0]Var(ε)

The (normalized) amount of SOO ignoring the impact of D isgiven by

E[xβjD = 1] E[xβjD = 0]Var(xβ)

The goal is to assess how large SOU must be relative to SOO to fullyaccount for the positive treatment e¤ect estimated under exogeneity

Express actual treatment participation as

Di = xiγ+ µi

plim of OLS estimator of τ is

plim bτ = τ +Cov(µ, ε)

Var(µ)

= τ +Var(D)Var(µ)

fE[εjD = 1] E[εjD = 0]g

Under the assumption that SOO = SOU, the asymptotic bias term is

Cov(µ, ε)Var(µ)

=Var(D)Var(µ)

E[xβjD = 1] E[xβjD = 0]

Var(xβ)Var(ε)

This bias can be consistently estimated under Ho : τ = 0

The ratio bτ/dbias indicates how much larger SOU needs to be relativeto SOO to entirely explain the treatment e¤ect

A small ratio ) treatment e¤ect is highly sensitive to selection onunobservables; a ratio >> 1 implies treatment e¤ect is robust

Algorithm:1 Estimate Var(D) from sample2 Estimate treatment eqtn via LPM ) \Var(µ)

3 Estimate outcome eqtn via OLS restricting τ = 0 ) xbβ, \Var(xbβ),\Var(ε)

4 Obtain sample means of xbβ in treatment and control groups )\E[xbβjD = 1], \E[xbβjD = 0]

5 Estimate outcome eqtn via OLS ) bτ6 Compute ratio of bτ/dbias

Notes:I If y is binary, estimate treatment eqtn via probit perhaps in step 3 )

Var(ε) = 1I AET methods have relatively little to say about economic signicanceof treatment e¤ect unless one makes assumptions about amount ofSOU

Selection on UnobservablesPanel Data

Refer to ECO 6375 for panel data refresher...

Panel data is useful addressing selection on unobservables that areinvariant along a certain dimension

Thus, panel data methods provide a solution to selection onunobservables in only certain situations

NotationI Population regression fn given by E[y jx1, ..., xk , c ]I xk , k = 1, ...,K , are observable (to the econometrician)I c is an unobservable (to the econometrician) variable

Assuming linearity: E[y jx1, ..., xk , c ] = β0 + xβ+ c

Error form of the model

y = β0 + xβ+ c + ε

where c is the unobserved e¤ect and ε is the idiosyncratic error

Time-series or cross-section models are forced to include c in the errorterm (referred to as the composite error)

yi = β0 + xi β+eεi , eεi = ci + εi

yt = β0 + xtβ+eεt , eεt = ct + εt

Modelyit = β0 + xitβ+ ci + εit

I Unobserved e¤ect is assumed to be time invariant (assuming atraditional panel where t represents time)

I x may include time dummies or time trend, etc.

Problem: given presence of ci , how can we recover consistentestimates of β0, β?

Estimation techniquesI Assuming Cov(x , c) = 0

F Pooled OLS (POLS)F Random e¤ects (RE)

I Assuming Cov(x , c) 6= 0F Least squares dummy variable model (LSDV)F Fixed e¤ects (FE)F First-di¤erencing (FD)

Selection on UnobservablesPanel Data: Treatment E¤ects Models

Structural model

yit = ci + λt + xitβ+ τDit + εit , i = 1, ...,N; t = 1, ...,T

where λt are time dummies

Special caseI Setup

F T = 2F Di1 = 0 8iF Di2 2 f0, 1g 8iF Assume no xs

I FE or FD estimation )

τ = E[∆y jD2 = 1] E[∆y jD2 = 0]

I Known as di¤erence-in-di¤erences estimator

Visual representation of special case

yit = ci + λt + xitβ+ τDit + εit

I Expected outcomes by period and treatment status

t = 1 t = 2D = 0 c0 + λ1 c0 + λ2D = 1 c1 + λ1 c1 + λ2 + τ

I Implies

E[∆y jD2 = 1] = (c1 + λ2 + δ) (c1 + λ1) = τ + λ2 λ1

E[∆y jD2 = 0] = (c0 + λ2) (c0 + λ1) = λ2 λ1

which implies

τ = E[∆y jD2 = 1] E[∆y jD2 = 0]

BeforeAfter Estimator

CrossSection Estimator

1 0 1Period

Note: Illustration of Three Common Estimators.

Beyond the special caseI Special case is useful to gain the intuition, not requiredI In general, as long as Dit is time-varying for some units i , then τ canbe estimated by any panel data method given the required assumptionsare met

I If selection into treatment is only on observables (not ci ), then POLSor RE may be consistent and e¢ cient

I If selection into treatment is also on time invariant unobservables (ci ),then POLS and RE are inconsistent, but FE or FD are consistent ifother assumptions are met

I Important to remember: FE/FD is not a magic bullet (Duo et al.2004)

F FE and FD require strict exogeneity ; rules out Ashenfelters Dip )Cov(Dit , εit1) 6= 0

F Rules out selection on contemporaneous shocks ) Cov(Dit , εit ) 6= 0F Key: requires treated and untreated to follow same time trend inabsence of treatment

F Di¤-in-di¤-in-di¤ may be an option

I With heterogeneous treatment e¤ects, FE identies the ATT

Timing issues (LaPorte & Windmeijer 2005)

Previous model restricts D to a one-time intercept shift, τ

In certain applications, agent may anticipate treatment and alterbehavior prior to actual treatment; or, response may occur with a lag;or, some combination of bothExamples: policy changes announced, but not implemented untilfuture date; or, lags in adjustment to policy changesGeneral structural model

yit = ci + λt + xitβ+∑L0l=1 δlD

lit + δ0Dit +∑L1

l=1 δlDlit + εit

Dlit = Dit+l (treatment assignment l periods in future)

D lit = Ditl (treatment assignment l periods in past)

δl reects anticipatory e¤ects of treatmentδl reects lagged e¤ects of treatmentδ0 reects instantaneous e¤ects of treatment

Specication test

If anticipatory and/or lagged e¤ects occur, but simplemodel ofone-time e¤ect is estimated, then FE and FD will yield (statistically)di¤erent estimates

E[bδFD ] = δ0 δ1

E[bδFE ] = ∑t ωt (δ0+ δ)

δ0+ = average of δ0, δ1, ..., δL1δ = average of δ1, ..., δL0

and ωt are weights

Ho : δFD = δFE () Ho : φ = 0yit yit1yit y i

xit xit1xit x i

Dit Dit1Dit D i

Dit D i

ηiteηit

Estimate via OLS, look at condence interval on bφLee and Huang (2011) extend the existing literature on dynamictreatment e¤ects to allow for anticipatory behavior

Autoregressive Model

Fixed e¤ects models require Dit to be time-varying for some i

If D is time invariant 8i , it is still possible to identify the e¤ect of theprogram under the common treatment e¤ect assumption

Structural model

yit = λt + xitβ+ τDi + εit

εit = ρεit1 + ηit

where ηit is iid with mean zero and τ is the homogeneous treatmente¤ect

Quasi-FD yields

yit = eλt + (xit ρxit1)β+ (1 ρ)τDi + ρyit1 + ηit

OLS is consistent if (i) x are strictly exogenous and (ii) D isuncorrelated with η (e.g., post-treatment shocks are not forecastableand therefore do not a¤ect past treatment decision

Comparative Case Study Approach

Provides an alternative to DD whenI Treatment occurs at an aggregate levelI Typically only a single observation is treated and lengthy history ofpre-treatment data are availble for the treated and the pool of controls

Examples:I Mariel Cuban Boat Lift (Card 1980)I State minimum wage (Card & Krueger 1994)

SolutionI Construct a synthetic control which is a weighted average of availableto controls to estimate the missing counterfactual in post-treatmentperiod(s)

I Weights are chosen by matching pre-treatment covariates and outcomesI Allows for di¤erential time trends in treatment and control observations

F By matching pre-treatment outcomes, one is implicitly matching on thetime-invariant unobserved e¤ect

F Thus, does not matter if unobservd e¤ect has di¤erential e¤ects overtime if the time-specic e¤ect is a common factor

ModelI yit is observed outcome for obs i , i = 1, ..., J + 1, in periodt = 1, ...,To , ...,T

I Obs 1 is treated; remaining 2, ..., J + 1 are never treatedI Timing of treatment e¤ects

1 No Anticipatory E¤ects: To is period prior to obs 1 being treated2 Anticipatory E¤ects: To is period prior to any anticipatory e¤ects forobs 1 begining

I Outcomes in the absence of treatment

yit = yNit = δt + θtZi + λtui + εit

I Outcomes with treatment

yit = yIit = y

Nit + αit

Synthetic control is dened as

∑J+1j=2 ωjyjt = ∑J+1

j=2 ωj (δt + θtZi + λtui + εit )

where ωj is the weight given to control j and

I ∑J+1j=2 ωj = 1I ωj 0 8j

Conditional on choice of weights, ωj , period-specic treatment e¤ect

is estimated as bαit = y1t ∑J+1j=2 ω

Requires a SUTVA-type assumption that the treatment does notimpact outcomes in the control pool

Weights are chosen to match moments of the data in periods t ToI Dene

yKi = ∑Tos=1 ksyiswhere K = (k1, ..., kTo ) is a vector of weights and thus y

Ki represents

a particular linear combination of pre-treatment outcomes for obs iI Given M unique linear combinations, dene the vector of pre-treatmentoutcomes for obs 1 as

X1 = (Z01, y

K11 , ..., y

with dimension R 1I Dene the R J matrix of variables for the remaining obs i ,i = 2, ..., J + 1 as X0, where column j is given by

(Z 0j1, yK1j1, ..., y

I Weights are chosen to minimize some distance function

jjX1 X0W jjV =q(X1 X0W )0V (X1 X0W )

where V is a R R symmetric, positive semidenite matrixI In practice, V is chosen to minimize the MSE of the pre-interventionpredictions

Inference is handled byI Re-doing the analysis, treated obs i , i = 2, ..., J + 1, as treatedafterperiod To and the remaining obs as the pool of potential controls

I This yields a dbn of treatment e¤ect estimates under Ho of notreatment e¤ect

I If actual estimates of bα1t look very di¤erent, this is evidence of astatistically meaningful treatment e¤ect

Code is available in Stata athttp://www.mit.edu/~jhainm/synthpage.html.

Example: Abadie et al. (2010)

Selection on UnobservablesInstrumental Variables

Refer to ECO 6374 for refresher on basics...

TerminologyI Structuralmodel

yi = β0 + β1xi + εi

I First-stage modelxi = π0 + π1zi + ui

I Reduced form model

yi = (β0 + β1π0) + β1π1zi + (εi + β1ui )

= eπ0 + eπ1zi +eεi

Goal: devise alternative estimation technique to obtain consistentestimates when E[εjx ] 6= 0

I Solution: identify β from exogenous variation in x isolated usinginstruments, z

I z is a valid IV for x i¤

(IV.i) First-stage: E[z 0x ] 6= 0(IV.ii) Exogeneity: E[z 0ε] = 0(IV.iii) Exclusion: E[y jx , z ] = E[y jx ]

where z and x are both N K matricesI Exogenous xs serve as instruments for themselvesI Need unique instrument for each endogenous var

Stata: -ivreg2 -, -xtivreg2 -

Several issues remain under scutiny in the literature1 Choice of estimation technique2 Properties and inference with weak IVs ) E[z 0x ] 03 Properties and inference with endogenous IVs ) E[z 0ε] 6= 0

Selection on UnobservablesEstimators

1 IV2 Two-Stage Least Squares (TSLS or 2SLS)3 Nagar4 Split-sample or Two-Sample IV(data set #1: fx , zgN1i=1; data set #2: fy , zg

N2i=1)

5 JIVE6 LIML7 Fuller (modied LIML)8 GMM

Selection on UnobservablesEstimators: IV Estimator

Estimator is given by

y = xβ+ ε

) z 0y = z 0xβ+ z 0ε ! β = (z 0x)1z 0y if z 0ε = 0

) bβIV = (z 0x)1z 0y

Estimated asymptotic variance is given by

Var(bβIV ) = bσ2(z 0x)1(z 0z)(x 0z)1; bσ2 = 1N K ∑i

Selection on UnobservablesEstimators: Two-Stage Least Squares

IV estimator requires 1 instrument per endogenous variable; otherwisez 0x is a LK matrix (L > K ) with rank = K , and the inverse doesnot exist

Discarding additional IVs is probably ine¢ cient

TSLS is an alternative estimator that does not face this problem

In multivariate regression, this is formalized asI First-stage bx = z(z 0z)1z 0xand replacing z with bx in the IV estimator

I Estimator now given by

bβTSLS = (bx 0bx)1bx 0y = [x 0z(z 0z)1z 0x ]1x 0z(z 0z)1z 0yDL Millimet (SMU) ECO 7377 Fall 2011 246 / 407

Notes ...

In a multiple regression...I With multiple endogenous vars, need at least as many IVs asendogenous xs; do not interpret this IV for this x , that IV for that x

I Where the second-stage contains other exogenous vars, these vars mustbe included in the rst-stage

If strictly more IVs than endogenous vars, thenI Model is overidentied (as opposed to exactly identied)I Enables additional tests for instrument validity

Estimators are CAN, but biasedI Intuition behind the bias is that the rst-stage OLS estimates, bθ, arecorrelated with the error term from the structural model, ε, whichimplies that the tted values, bx are also correlated with ε

Incorrectly treating other covariates in the model as exogenous )inconsistent estimates if instrument(s) are correlated with thesecovariates

Selection on UnobservablesEstimators: JIVE, SSIV, Nagar

Breaking the correlation between bθ and ε is the motivation behindJIVE and SSIVSSIV (Angrist & Krueger 1992, 1995)

I ApproachF Divide sample into two groups: i = 1, ...,N1 and i = N1 + 1, ...,NF Estimate rst-stage using N2 obs, i = N1 + 1, ...,NF Predict bx out-of-sample for rst N1 obsF Estimate second-stage using rst N1 obs

I Estimators bβSSIV = (bx 021bx 021)1bx 021ybβUSSIV = (bx 021x 01)1bx 021ywhere bx21 = z1(z 02z2)1z 02x2 and subscript 1 (2) refers to estimationon i = 1, ...,N1 (i = N1 + 1, ...,N)

I SSIV uses OLS in the second-stage; USSIV stands for Unbiased SSIVand uses IV in the second-stage

JIVEI Approach

F Estimate rst-stage using N 1 obsF Predict bx out-of-sample for the excluded obsF Repeat for all N obs and estimate second-stage using all N obs

I Estimators

bβJIVE = (bx 0ibx 0i )1bx 0i ybβUJIVE = (bx 0i x)1bx 0i y = (x 0C 0Jx)1x 0C 0Jywhere bx 0i is matrix whose i th row is ziπi , πi is the vector ofrst-stage coe¤s with obs i removed, andCj = (IDPz )1(Pz DPz ), DPz = diag(Pz ), and Pz = z(z 0z)1z 0

I JIVE uses OLS in the second-stage; UJIVE stands for Unbiased JIVEand uses IV in the second-stage

I Stata: -jive-

Nagar estimator is a bias-corrected TSLS estimatorI Nagar (1959), Hahn & Hausman (2002)I Estimator given by

bβN = x 0 Pz KNIN

where K = # IVs and Pz = z(z 0z)1z 0I Hahn & Hausman (2002) discuss the poor performance of the Nagarestimator when the model is close to being unidentied

Selection on UnobservablesEstimators: LIML, Fuller, and k-Class Estimators

k-class estimators can be all be written asbβk = [x 0(IN kMz )x ]1x 0(IN kMz )y

for di¤erent values of k, where Mz = IN z(z 0z)1z 0

k = 0) OLS

k = 1) TSLS

k = λ ) LIML

k = λ α

N L ) Fuller

k = 1+LKN

) Nagar

For LIML, λ is a minimum eigenvalueFor Fuller, α is user-specied (typically 1) and L = # included +excluded instrumentsFor Nagar, LK = # over-identifying restrictionsDL Millimet (SMU) ECO 7377 Fall 2011 251 / 407

Selection on UnobservablesIV: Specication Tests

Much specication testing is required when utilizing IV in appliedresearch

Types of tests available

I Tests of endogeneity: E[x 0ε]?= 0

I Tests of instrument relevance: E[z 0x ]?= 0

I Tests of overidentication: E[z 0ε]?= 0 (partial test only)

I Tests for weak instruments:E[z 0x ] 0

Covered in ECO 6374

With weak IVs, some recommend LIML, others Fuller, others UJIVE,others TSLS (which tends to have a larger bias, similar RMSE)

Selection on UnobservablesIV: Imperfect Instruments

Recent work has explored what can be learned if z is an imperfectinstrumental variable (IIV)

Two possible imperfections:1 z is also endogenous2 z is not excludable from the second-stage

Nevo & Rosen (2010) and Ashley (2009) address endogeneity

Conley et al. (2010) address excludability

Note: These are intimately related since if z is incorrectly treated asexcludable, then it will be correlated with the second-stage compositeerror that now includes the error and z

Nevo & Rosen (2010) ...

SetupI Model given by

yi = βxi + wi δ+ εi

where x is a single endogenous regressor, w is exogenous (oralternatively are endogenous with valid instruments), and z is 1 kzvector of imperfect instruments for x

I z is an imperfect IV (IIV) in the sense that it is also correlated with εI Assumptions:

(IIV.i) Sign of correlation: ρx ερzj ε 0, j = 1, ..., kz(IIV.ii) Degree of endogeneity: jρx εj jρzj εj, j = 1, ..., kz(IIV.iii) True model: yi = βxi + wi δ+ εi

(IIV.ii) contrasts with the classical IV assumption that ρzj ε = 0

λj =ρzj ε

ρx ε

which is in the unit interval under (IIV.i), (IIV.ii)

If λj were known, then a valid IV for x is

Vj (λj ) = σx zj λj σzj x

However, Λ = [λ1 λkz ] is unknown, but lies in the unit cube inRkz -space

Intuitively, searching over feasible values of Λ, one may bound β

Consider kz = 1I Partial out the e¤ects of w by dening

eyi = yi wi [(w 0w)1w 0y ]exi = xi wi [(w 0w)1w 0x ]

(Note: If w is endogenous with valid IVs, then the OLS coe¤s arereplaced by IV coe¤s.)

I Under (IIV.i) (IIV.iii) and assuming without loss of generality thatρx ε 0, obtain the following bounds:

F Case I. (σzexσx σxexσz )σzex > 0β 2

([βIVV (1), β

IVz ] if σzex < 0

[βIVz , βIVV (1)] if σzex > 0

F Case II. (σzexσx σxexσz )σzex 0β 2

8<: [maxn

βIVz , βIVV (1)

o,∞) if σzex < 0

(∞,minn

βIVz , βIVV (1)

o] if σzex > 0

Additional work to bound δ is also possible

Extension to kz > 1I Bounds can be tightened by obtaining bounds for each z individuallyand then computing the nal bounds as the intersection of the kzbounds

I Formally

F For each zj , obtain Bj = [βlj , β

F Final bounds given by

β 2maxjfβljg,minj fβuj g

F In Case II, these bounds are one-sided; one trick may be to try anddene a new IV that is a weighted average of two of the IVs such that(σqexσx σxexσq )σqex > 0, where qi = γzji + (1 γ)zj 0 i

I Need to be careful, though, and make sure di¤erent zs estimate thesame parameter (discussed later)

Conley et al. (2010) ...Setup

yi = xi β+ ziγ+ εi

xi = ziπ + ui

where x is a kx -dimensional vector of endogenous regressors, z is akz -dimensional vector of instruments, kz kx , and E[z 0ε] = 0Classical IV requires the assumption that γ = 0

I With kx = kz = 1, we have

plim bβIV = β+σzeεσxz

= β+γσ2zπσ2z

= β+γ

where eε = ziγ+ εi is the composite errorI Thus, IV is asymptotically biased when γ 6= 0 and the bias isdecreasing in π and increasing in γ

I Authors refer to deviations from γ = 0 as plausible exogeneity

Approach

I Track estimates bβ(γ) = bβIV γ/bπ for di¤erent values of γI Estimates will be more sensitive to γ the weaker the rst-stagerelationship

Authors present several possible methods of inference, only somepresented here

Method #1. Union of CIs with γ Support AssumptionI Suppose the true value of γ = γ0 Gkz , with known boundsI If γ0 were known, then IV/TSLS applied to

yi ziγ0 = xi β+ εi

using z as instruments is consistent for βI With γ0 unknown, but contained in Gkz , one can

F Apply IV/TSLS to a grid of values for γ from Gkz

F For each value, γs , s = 1, ...,S , obtain the (1 α)% CI for βF Compute a nal CI as the union of these S CIs

CI (1 α) = [γ2Gkz CI (1 α,γ)

which has an asymptotic coverage probability 1 αF If some prior info, may want to weight di¤erent γs di¤erently

Method #2. γ Local-to-Zero ApproximationI γ is treated as unknown, but coming from a known dbn

γ =ηpN, η G

where prior info on γ translates to knowing the dbn GI The normalization by

pN ensures that uncertainty about z being a

valid instrument and sampling error are of the same order and so bothfactor into the asymptotic dbn of bβ

I Assuming γ N(µγ,Ωγ) leads to the following approximate dbn

bβ N(β+ Aµγ,VIV + AΩγA0)

where A = (x 0z(z 0z)1z 0x)1x 0zI If µγ = 0, then this approach simply leads to a revised variance for theIV/TSLS estimator

Stata ado les available on Conleys website

Selection on UnobservablesIV: Heterogenous Treatment E¤ects

Assume a binary endogenous regressor, D, and a binary instrument, z

Motivation arises from the fact that the treatment e¤ect may varyacross by i and agents may act on observation-specic gains whenmaking treatment decision

Admitting this possibility implies that one must think more carefullyabout what parameter one is estimating

Linear model

Setup (from earlier potential outcomes framework)

yi y0i +Di (y1i y0i )= α0 + exi β+ υ0i +Di (α1 + exi β+ υ1i α0 exi β υ0i )

= α0 + exi β+ (α1 α0 + υ1i υ0i )Di + υ0i

xi β+ ∆iDi + εi

Dene ∆i = (α1 α0) + (υ1i υ0i ) ∆+ ∆iSubstitution implies

yi = xi β+ ∆Di + (∆i Di + εi )

where ∆i Di + εi is the composite error term, which di¤ers from theusual error term for the treated

A valid IV in the homogeneous treatment e¤ects setup requires

E[εi jxi ,Di , zi ] = E[εi jxi ,Di ]

but nowE[∆i Di + εi jxi ,Di , zi ] = E[∆i Di + εi jxi ,Di ]

is required

Thus, z must beI Correlated with Di (as usual)I Uncorrelated with the error term from the structural model andindividual-specic gains (or losses) from treatment

F Not possible unless (i) ∆i = 0 8i (implying a constant treatmente¤ect) or (ii) ∆i ? Di jxi (implying that agents either do not know ordo not act on specic gains ... no essential heterogeneity)

F Model with ∆i and Di correlated known as Correlated RandomCoe¢ cients (CRC) model

Much more restrictive requirementI Example: if z is an exogenous variable representing the cost ofparticipation in the treatment (e.g., distance to job training center),then high z will lead to no participation unless the benet fromparticipation, ∆i , is very high; if z is low, one will participate if ∆i islow or high ) positive correlation between z and ∆i conditional on Di

If z is uncorrelated with ε, but correlated with ∆i , then IV estimatesare still useful, but identify a di¤erent parameter

Parameter known as local average treatment e¤ect (LATE)

Formally, given the model (ignoring x)

yi = α+ ∆Di + (∆i Di + εi )

and an instrument, z , we have

plim b∆OLS =Cov(y ,D)

Var(D)= ∆+

Cov(ε,D) +Cov(∆D,D)Var(D)

6= ∆

plim b∆IV =Cov(y , z)Cov(D, z)

= ∆+Cov(ε, z) +Cov(∆D, z)

Cov(D, z)

= ∆+Cov(∆D, z)

Cov(D, z)6= ∆

where the last inequality holds unless (i) ∆i = 0 8i or (ii) ∆i ? Di jxi(as stated above)

How do we interpret b∆IV ?DL Millimet (SMU) ECO 7377 Fall 2011 265 / 407

Assume a binary endogenous regressor, D, and a binary instrument,z , and no other covariates (for simplicity)

Four potential subpopulations

z = 0 z = 1Never Takers (NT) D = 0 D = 0Deers (DF) D = 1 D = 0Compliers (C) D = 0 D = 1Always Takers (AT) D = 1 D = 1

Compliers are the key, as their treatment status varies with theinstrument

Recall, the Wald estimator

b∆IV = E[y jz = 1] E[y jz = 0]Pr(D = 1jz = 1) Pr(D = 1jz = 0)

Numerator terms may be expressed as

E[y jz = j ] =

8<: E[y1jAT ]Pr(AT ) + E[yj jC ]Pr(C )+ E[y(1j)jDF ]Pr(DF )+ E[y0jNT ]Pr(NT )

9=; , j = 0, 1

Denominator terms may be expressed as

Pr[D = 1jz = j ] =

8>><>>:Pr[D = 1jz = j ,AT ]Pr(AT )+ Pr[D = 1jz = j ,C ]Pr(C )+ Pr[D = 1jz = j ,DF ]Pr(DF )+ Pr[D = 1jz = j ,NT ]Pr(NT )

9>>=>>; , j = 0, 1=

Pr(AT ) + Pr(C ) if j = 1Pr(AT ) + Pr(DF ) if j = 0

Wald estimator reduces to

b∆IV =fE[y1jC ]Pr(C ) + E[y0jDF ]Pr(DF )g fE[y0jC ]Pr(C ) + E[y1jDF ]Pr(DF )g

Pr(C ) Pr(DF )

which is a weighted average of the treatment e¤ect for compliers andthe negative of the treatment e¤ect for deers

Assumptions

(LATE.i) Independence: fy0, y1,D0,D1g ? z , where Dj , j = 0, 1, are potentialtreatment assignments

(LATE.ii) Exclusion: E[y0 jz ] = E[y0 ]; E[y1 jz ] = E[y1 ](LATE.iii) First-Stage/Compliers: Pr(C ) > 0) Pr(D = 1jz) is a non-trivial

function of z(LATE.iv) Monotonicity: Pr(Di = 1jzi = 1) > Pr(Di = 1jzi = 0) 8i )

Pr(DF ) = 0

Imposing these assumptions )

b∆IV = b∆LATE = E[y1 y0jC ]

which is a parameter dened with respect to a particular instrument

CommentsI LATE is a well-dened economic parameterI Whether it is an interesting parameter is a di¤erent matterI Not possible to know who are the compliers in the dataI Interpretation is similar, but derivation more complex, if D or z iscontinuous

F Continuous z estimates the local instrumental variable (LIV) parameter(Heckman and Vytlacil 1999)

I With multiple instruments, things become thorny ... di¤erentinstruments, even if all valid, potentially identify di¤erent parameters!

F No reason why di¤erent IV estimates should be the sameF Using multiple IVs yield a weighted average of di¤erent LATEs

DiNardo & Lee (2011) provide an alternative interpretation of the IVestimand

I They replace the monotonicity assumption with what they call aprobabilistic monotonicity assumption

I The result is that b∆IV is shown to be a weighted average of ∆i wherethe weights are proportional to the increase inPr(Di = 1jzi = 1) Pr(Di = 1jzi = 0)

F Under the monotonicity assumption,

Pr(Di = 1jzi = 1) Pr(Di = 1jzi = 0) =0 if type = AT ,NT1 if type = C

so that only compliers receive positive weightF This follows from the assumption that D is a deterministic fn of zF Probabilistic monotonicity relaxes this assumption and allows D to be anondecreasing fn of z (conditional on type)

Not possible to infer anything about ∆ATE , ∆ATT , or ∆ATU withoutadditional assumptions about how compliers compare to rest of thepopulation

I Vytlacil et al. (2009) working on when one can learn the sign of ∆ATEI DiNardo & Lee (2011) discuss extrapolating to the ∆ATEI Heckman et al. (2010) propose two tests of the CRC assumption

Ho : ∆i ? Di jxi

F Test #1 based on comparison of di¤erent (valid) IV estimates; underHo di¤erent IVs provide consistent estimates of the same parametereven if they lead to di¤erent sub-populations of compliers

F Test #2 based on testing for a linear relationship between y and theestimated propensity score conditional on x

Selection on UnobservablesIV: Finding Instruments

Economic theory ... what determines participation, but not outcomes?

Exogenous variation in program availability (across space or overtime) ... must be exogenous

Natural experiments ... twins, sex composition, miscarriages, MarialCuban boatlift, Russian immigration to Israel

Randomized experiments (even if imperfect compliance) ... ProjectStar

Fuzzy regression discontinuity design

Recall from sharp RD case that we require the existence of thefollowing limits

D+ = lims#sPr(D = 1js)

D = lims"sPr(D = 1js)

and D+ 6= DI Sharp RD setup implies D+ = 1 and D = 0I Fuzzy RD setup implies 1 D+ > D 0

Formally

(FRD.i) Treatment assignment is a discontinuous function of s (with a knownthreshhold, s)

Di = D(si , υi )

Pr(D = 1) = Pr(D = 1js s)Pr(s s)+Pr(D = 1js < s)Pr(s < s)

(FRD.ii) Positive density at the threshold: fS (s) > 0(FRD.iii) Outcomes are continuous in s at least around s and do not depend on

whether s ? s(FRD.iv) For each agent, the dbn of s is continuous at least around s

NotesI Endogenous treatment variable, D, depends on observed score variable,s, and stochastic element

I Discrete jump in Pr(D = 1) at sI Example: Pr(D = 1) = maxf0, 0.5s + 0.25 I(s > 0.5) + υg

0 .2 .4 .6 .8 1x

Implies Di = E[D jsi ] + υi , where Cov(ε, υ) 6= 0DL Millimet (SMU) ECO 7377 Fall 2011 276 / 407

OLS estimation of

yi = xi β+ ∆Di + f (si ) + εi

where x is a vector of exogenous controls, is biased, even with aexible function of s included

SolutionI Estimate propensity score, where f (s) is included along with the

indicator I(s > s) ) [p(D)I Estimate by OLS

yi = xi β+ ∆\p(Di ) + f (si ) + εi

I Equivalent to TSLS, with I(s > s) as the instrument, when f (s) ischosen parametrically

IntepretationI Typical interpretation: RD identies the LATE at sI DiNardo & Lee (2011) intepret the estimated parameter as a weightedaverage of ∆i where the weights are proportional to (i) the probabilityof si being in the neighborhood of s and (ii) the inuence of crossingthe threshold, s, on the probability of receiving the treatment

Selection on UnobservablesMethods Not Requiring Exclusion Restrictions

Several methods exist that do not rely on a typical exclusionrestriction for identication

1 Heckman bivariate normal selection model2 Millimet & Tchernis (2011) bias-corrected estimator3 Higher moments4 Covariance restrictions

All such methods mustreplace the assumptionconcerning an exclusionrestriction with someother identifyingassumption (there is nosuch thing as a free lunch)

Selection on UnobservablesHeckman Bivariate Normal Selection Model

Requires fairly strong parametric assumptions to circumvent theselection on unobservables problem

Also useful to solve problems of non-random sample selection(discussed later)

Treatment e¤ects model with common e¤ect

y0i = xi β0 + εi

y1i = xi β1 + εi

yi = Diy1i + (1Di )y0iDi = ziγ+ ui

1 if Di > 00 if Di 6 0

NotesI εi = common error component (or common e¤ect) in both potentialoutcome equations

I βs allowed to di¤er across outcome equationsI Di = latent indicator of treatment statusI Model rules out selection on observables assumption sinceunobservables associated with treatment status, u, are correlated withunobservables a¤ecting outcomes conditional on x

Assumptions

(BVN.i) ε, u N2(0, 0, σ2ε , σ2u , ρ)(BVN.ii) ε, u ? x , z(BVN.iii) σ2u = 1

Parameters of interestI Given the setup, individual-specic treatment e¤ect is given by

∆i = y1i y0i = xi (β1 β0)

I Average treatment e¤ects are

∆ATE = E[∆i ] = E[Xi ](β1 β0)

∆ATT = E[∆i jDi = 1] = E[Xi jDi = 1](β1 β0)

∆ATU = E[∆i jDi = 0] = E[Xi jDi = 0](β1 β0)

I Implies consistent estimates of all three parameters require consistentestimates of β0, β1

I Two naïve options:

F Split sample into D = 1 and D = 0, and regress y on x via OLS ineach sub-sample

F Pool sample, regress y on x ,Dx

I Under selection on unobservables, neither option produces consistentestimates

Conditional expectations (following from the properties of conditionalnormal random variables)

I Of the outcome in the treated state for the treated

E[yi jDi = 1, xi , zi ] = xi β1 + E[εi jui > ziγ]

= xi β1 + ρσε

φ(ziγ)Φ(ziγ)

= xi β1 + ρσε [λ(ziγ)]

where λ() is known as the Inverse MillsRatioI Of the outcome in the untreated state for the untreated

E[yi jDi = 0, xi , zi ] = xi β0 + E[εi jui 6 ziγ]

= xi β0 + ρσε

φ(ziγ)1Φ(ziγ)

I Given Corr(ε, u) 6= 0, error term is no longer well-behaved

Estimation: Method #1

Estimate the outcome equation for the treated and the untreatedseparately via OLS

Consistent estimates of β0, β1 require inclusion of the selection terms

Selection terms are estimable by1 Estimating a probit model for treatment assignment ) bγ2 Estimating the selection terms

φ(zi bγ)Φ(zi bγ)

φ(zi bγ)1Φ(zi bγ)

3 Including these as additional covariates in each second-stage regression

Upon estimation of bβ0, bβ1 ...I Predict by1i , by0i 8iI Estimate treatment e¤ect parameters

b∆ATE = by1i by0ib∆ATT = by1i by0ib∆ATU = by1i by0iwhere ATE computes mean for entire sample, and latter two computemeans using only the treated and untreated, respectively

I Equivalently,

b∆ATE = x(bβ1 bβ0)b∆ATT = x1(bβ1 bβ0)b∆ATU = x0(bβ1 bβ0)where x is the sample mean, and xk , k = 0, 1, is the sample mean inthe sub-sample with D = k

Estimate a single outcome equation with no restriction

yi = xi β0 + xiDi (β1 β0) + βλ1Di

φ(ziγ)Φ(ziγ)

+ βλ0(1Di )

φ(ziγ)1Φ(ziγ)

This does not impose the restriction that the coe¢ cient on bothselection terms should be the same: ρσε

Thus, testing Ho : βλ0 = βλ1 constitutes a specication test of theunderlying model

ηi = εi βλ1Di

φ(ziγ)Φ(ziγ)

βλ0(1Di )

φ(ziγ)1Φ(ziγ)

= εi Di E[εi jDi = 1] (1Di )E[εi jDi = 0]

which is a well-behaved error term since the portion of the error termthat is correlated with treatment assignment now appears in themodel in the form of the selection correction terms

Estimate a single outcome equation imposing the restriction thatβλ0 = βλ1

yi = xi β0 + xiDi (β1 β0)

+ βλ

φ(ziγ)Φ(ziγ)

+ (1Di )

φ(ziγ)1Φ(ziγ)

E¢ ciency gain if, in fact, the restriction is true

Maximum likelihood estimation of the system of three equations

Above estimators are known as control function approach sinceselection terms control for selection on unobservables

ML is not a control function approach, but rather directlyincorporates the covariance structure of the errors into the estimationby jointly estimating the system of equations

Benets: yields an estimate of ρ along with a std error, more e¢ cientif parametric assumptions are true

Cost: results are less robust if parametric assumptions of the modelare violated

Comments

There is no instrumentor exclusion restriction required foridentication

I Identication arises from the non-linearity of the selection correctionterms, which in turn arises from the assumption of bivariate normality

I Exclusion restrictions a variable in z not in x would be nice

Semi-parametric versions existI Relaxes dependence on bivariate normalityI Require exclusion restrictionsI One version includes a polynomial of the propensity score in theregression model; motivation is to include a exible functional form tocapture the selection terms without reliance on bivariate normality

Bivariate probit treatment e¤ects modelI Similar to above models, except outcome of interest is binary (e.g.,employment following a job training program)

I Similar estimation to above by ML, except likelihood is based on abivariate probit model (same as in Altonji et al. (2005) unconstrainedbivariate probit model)

Aside:

Typical IV estimator can also be implemented using a control functionapproach

I TSLS estimator of the model

yi = β1x1i + x2i β2 + εi

x1i = ziπ1 + x2iπ2 + ui

is equivalent to OLS estimation of

yi = β1x1i + x2i β2 + ui +eεiwhere ui is replaced with the OLS estimate of the rst-stage

residualI Since bui = x1i zi bπ1 x2i bπ2, this is not linearly independent of x2unless π1 6= 0

Treatment e¤ects model without the common e¤ect assumption

Relaxation of common e¤ect assumption allows for heterogeneouse¤ects of the treatment even conditional on x

y0i = xi β0 + ε0i

y1i = xi β1 + ε1i

= xi β1 + [(ε1i ε0i ) + ε0i ]

= xi β1 + [δi + ε0i ]

yi = Diy1i + (1Di )y0iDi = ziγ+ ui

1 if Di > 00 if Di 6 0

NotesI δi = obs-specic gain to treatment (conditional on x)I ∆i = y1i y0i = xi (β1 β0) + δi (heterogeneous treatment e¤ectsgiven x)

I Selection into treatment may depend on either ε0i (untreated outcomelevel given x) or δi (obs-specic gains given x)

I Otherwise, intuition is identical to common e¤ect version

Assumptions (replaces (BVN.i))

(BVN.i) ε0, ε1, u N(0,Σ), where

24σ2ε0 ρ01 ρ0uσ2ε1 ρ1u

Conditional expectations

E[ε0i jDi = 1, xi , zi ] = ρ0uσε0

φ(ziγ)Φ(ziγ)

E[δi jDi = 1, xi , zi ] = ρδuσδ

φ(ziγ)Φ(ziγ)

E[ε0i jDi = 0, xi , zi ] = ρ0uσε0

φ(ziγ)1Φ(ziγ)

Estimation

Generalization of the previous two-step approach in the commone¤ect modelEstimating equation

yi = xi β0 + xiDi (β1 β0) +eβλ1Di

φ(ziγ)Φ(ziγ)

+ βλ0(1Di )

φ(ziγ)1Φ(ziγ)

+ ζ i

where eβλ1 = ρ0uσε0 + ρδuσδ

βλ0 = ρ0uσε0

Selection terms obtain by estimating rst-stage probit model for DML estimation of entire model is feasible, but it requires estimation ofa trivariate normal dbn (computationally di¢ cult)ρ01 is not identied since never observe y1 and y0 for same i

Upon estimation of bβ0, bβ1 ...I Predict by1i , by0i 8iI Estimate b∆ATE b∆ATE = by1i by0i = x bβ1 bβ0where ATE computes mean for entire sample

I ATT is given by

∆ATT = Exi jDi=1

[xi (β1 β0)] + Eδi jDi=1

[δi ]

= Exi jDi=1

[xi (β1 β0)] + Ezi jDi=1

ρδuσδ

φ(ziγ)Φ(ziγ)

F If there is no selection on unobservable gains, then ρδu = 0 ) commone¤ect model

F eβλ1 βλ0 = ρδuσδ )\ρδuσδ =beβλ1 bβλ0, which gives the sign of the

selection on gains (which one expects to be positive if obs know theirunobservable gains)

F Estimate obtained by replacing expectations with sample averageswithin the treatment group

I ATU obtained in similar fashion, but average over x , z in control group

Stata: -treatreg -, -biprobit-DL Millimet (SMU) ECO 7377 Fall 2011 297 / 407

Selection on UnobservablesMillimet & Tchernis (2011)

Builds on the minimum biased approach (discussed earlier) by o¤eringa bias-corrected procedure

Recall, under certain assumptions the bias of the ATT, ATE at somevalue of the propensity score, p(x), is given by

BATT [p(x)] = ρ0uσ0φ(Φ1(p(x)))p(x)[1 p(x)]

BATE [p(x)] = fρ0uσ0 + [1 p(x)]ρδuσδg

φ(Φ1(p(x)))p(x)[1 p(x)]

I ρ0u = selection on unobservables a¤ecting outcome in untreated stateI ρδu = selection on unobserved, individual-specic gains

BATT [p(x)] is minimized at p(x) = 0.5; BATE [p(x)] does not have aunique minimum

Minimum-biased (MB) estimation techniqueI Stage 1: Estimate the propensity score (e.g., probit model)I Stage 2: Retain only those observations with a propensity score,[p(xi ), within a xed neighborhood around p(x), the bias-minimizingpropensity score

I Stage 3: Estimate the ATE or ATT using any propensity-score basedestimator that relies on CIA using this sub-sample

For ATE, add Stage 1.5: Estimate the error correlations usingHeckman BVN model

BC estimator amends the previous MB estimator by removing theestimated bias

b∆kBC = b∆k Z \Bk [p(xi )]fk (x)dx , k = ATE ,ATT ,ATU

where fk (x) is the appropriate dbn needed to estimate parameter k

Millimet & Tchernis (2011) nd some benet to this estimator,particularly in large samples, using MC

Selection on UnobservablesHigher Moments: Lewbel (2010) approach

Originally proposed as a solution to measurement error, butpotentially applicable to more general dependence between x and ε(Lewbel 1997, 2010)

SetupI Structuralmodel

yi = β1Di + xi β2 + εi

I First-stage modelDi = xiπ + ui

F x includes the interceptF Cov(ε, u) 6= 0

D may be discrete or continuous

Potential instruments for D include (zi z)ui , where z xEstimation requires consistently estimating the rst-stage andreplacing u with buValidity of the IVs requires

(HM.i) E[z 0u2 ] 6= 0(HM.ii) E[z 0εu] = 0

Restrictions are satised if, say,

εi = θi +eεiui = θi + eui

where θi is a homoskedastic common factor and the sole source ofcorrelation between ε and u, and eu is heteroskedastic with variancedepending on z

Selection on UnobservablesHigher Moments: Klein & Vella (2009, 2010); Farré et al. (2010)

Setup as in the prior modelI Structuralmodel

I First-stage modelDi = xiπ + ui

F x includes the interceptF Cov(ε, u) 6= 0

Identication assumptions

(KV.i) εi = Sε(zi )εi and/or ui = Su(zi )ui , where z x , such that

Sε(zi )/Su(zi ) varies across i(KV.ii) E[εi u

i ] = ρ, which is constant

Under (KV.i) and (KV.ii), the structural model may be re-written as

yi = β1Di + xi β2 + ρ

Sε(zi )Su(zi )

where eεi is now a well-behaved error termThe term in brackets acts as a control function since it controls forselection bias such that conditional on this term and x D is nolonger correlated with the error term

Klein & Vella (2009) propose a semiparametric estimator of the model

Farré et al. (2010) outline a parametric estimator

Parametric Estimation

Assuming

Sε(zi ) =qexp(zi θε)

Su(zi ) =qexp(zi θu)

the structural model becomes

"pexp(zi θε)pexp(zi θε)

#+eεi

Estimate the rst-stage by OLS ) buEstimate by OLS

ln(bu2i ) = zi θu + euiand form bSu(zi ) = qexp(zibθu)DL Millimet (SMU) ECO 7377 Fall 2011 304 / 407

Substitute bu and bSu(zi ) into the structural model and estimate theremaining parameters by NLS

"pexp(zi θε)bSu(zi ) bui

#+eεi

While one could stop, performance is perhaps improved by addingadditional steps

I Given NLS estimates of β1 and β2 ) bεI Estimate by OLS

ln(bε2i ) = zi θε +eeεiand form bSε(zi ) =

qexp(zibθε)

I Estimate by OLS

" bSε(zi )bSu(zi )bui#+eεi

Obtain std errors via bootstrap

Selection on UnobservablesHigher Moments: Klein & Vella (2009)

I First-stage modelDi = xiπ1 + ui

where x contains an intercept

When D is binary, one may estimate the rst-stage via probit andform an instrument using the propensity score, dp(x)Even with no exclusion restriction, dp(x) is correlated with D andlinearly independent of x (since dp(x) = Φ(x bπ))However, most of this linearity occurs in the tails

Additional non-linearity of the IV may be induced if one uses aheteroskedastic probit to form the IV

I σu is modeled as exp(xδ)I dp(x) = Φ(x bπ/ exp(xbδ))I Additional non-linearity is roughly equivalent to using higher-orderterms of x as exclusion restrictions

Klein & Vella (2009) also propose a semiparametric version

Selection on UnobservablesHigher Moments: Vella & Verbeek (1997); Rummery et al. (1999)

Vella and Verbeek (1997) propose an alternative IV strategy that mayalso be valid with heteroskedastic errors

Known as Rank Order IV

Setup as in the prior models

Di = xiπ + ui

whereI x includes the interceptI Cov(ε, u) 6= 0

Identication assumptions

(ROIV.i) An agents level of unobserved heterogeneity responsible forCov(ε, u) 6= 0 does not impact y , but rather only the agents relativeposition or rank order matters

(ROIV.ii) Data can be partitioned into subsets such that agents may be pairedacross subsets in a manner leading to pairs with identical ranks in theirrespective subsets but di¤erent levels of D

For example, if y is wages, D is participation in a training program,and endogeneity is due to unobserved work ethic, then

I (ROIV.i) implies that the level of ones work ethic does not impactwages but only the fraction of workers with whom ones work ethicexceeds

F I.e., ones level of work ethic is irrelevant, only ones percentile in thedbn if work ethic matters

I (ROIV.ii) implies we can divide the data (say, by region) such thatacross regions individuals at the same percentile of the dbn of workethic within their region have di¤erent values of D

To proceed, partition the data into mutually exclusive groups,s = 1, ...,S , on the basis of some attribute, qi (which may be asubset of x)Notation

I Dene F (jqi ) as the CDF of u given qI Let ci = F (ui jqi ) be the rank order of obs i in its partition

(ROIV.i) may be expressed formally as

E[εi jxi ,Di , ui , qi ] = E[εi jui , qi ] = E[εi jci ] = m(ci )where m() is some fn mapping c to y

I This condition states that E[εi jui , qi ] depends only on u and q throughthe rank order, c

I Vella & Verbeek (1997) refer to as the order restriction

The order restriction is useful for identifying the model since it impliesthat agents from di¤erent partitions, qi 6= qj , but with identical rankorders, ci = cj , are identical along the unobserved dimensionresponsible for the endogeneityTo be useful, however, requires an additional assumption, (ROIV.ii),such that these comparable pairs of agents have di¤erent values of DDL Millimet (SMU) ECO 7377 Fall 2011 310 / 407

Estimation

Re-write the structural model as

yi = β1Di + xi β2 +m(ci ) +eεiwhere eε is now a well-behaved error term; m(c) is another example ofa control function, but c and m() are unknownEstimate ci by

I Estimating the rst-stage model via OLS ) buiI Estimate bci nonparametrically using the empirical CDF within each ofthe S partitions based on q

Approximate m(c) using a nite-order polynomial in bcAlternatively, one may estimate the original structural model

by IV with the instrument given by the residual, bη, obtained afterOLS estimation of the model

Di = θ0 + θ1ci + ηi

Selection on UnobservablesCovariance Restrictions

yi = β0 + β1Di + xi β2 + εi

I First-stage modelDi = π0 + xiπ1 + ui

I Reduced form model

yi = (β0 + β1π0) + xi (β1π1 + β2) + (εi + β1ui )

= eβ0 + eβ1xi + eυiWith no IV, estimable quantities include: π0,π1, eβ0, eβ1

I These four quantities are functions of ve structural parameters:π0,π1, β0, β1, β2

I Thus, the model is under-identied

What about the covariance matrix of the system of reduced formeqtns? β1 also shows up there

yi = eβ0 + eβ1xi + (εi + β1ui )

Di = π0 + xiπ1 + ui

Assume ε, u N(0, 0, σε, σu , ρ), then eυ, u are also mean zero withcovariance matrix

σ2ε + β21σ2u + 2β1ρσεσu ρσεσu + β1

Σ11 Σ12

Three quantities are estimable based on MLE of the system:Σ11,Σ12,Σ22

I These 3 quantities are functions of 4 structural parameters:β1, σε, σu , ρ

I Thus, the model remains under-identied

Intuition: place restrictions on other parameters in Σ in order toidentify β1 from the cov matrix; intercept and slope parameters are allidentied then as well

Model is then estimated via ML

lnL = ∑i12ln jΣ1j 1

2ε0iΣ

where εi is the vector of errors for obs i

Note: If D is instead modelled as a LDV, then the likelihood must befactored appropriately to account for the fact that one eqtn has adiscrete outcome

Realistic restrictions may be easier to devise if one adds additionaloutcomes that also depend on the same endogenous regressor

I Ex: K = 2

y1i = eβ10 + eβ11xi + (ε1i + β11ui )

y2i = eβ20 + eβ21xi + (ε2i + β21ui )

Di = π0 + xiπ1 + ui

which entails

26666664σ2ε1 + β211σ2u+2β11ρ1σε1

σ2ε1 + σ2ε2 + 2ρ12σε1σε2+β11ρ2σε2σu

+β21ρ1σε1σu + β11β21

2ρ1σε1σu+β11σ2u

σ2ε2 + β221σ2u+2β21ρ2σε2

2ρ2σε2σu+β21σ2u

37777775=

24Σ11 Σ12 Σ13Σ22 Σ23

35I If y1, y2 are similar (e.g., two anthropometric measures), might impose

ρ1 = ρ2 and might have a strong prior for ρ12DL Millimet (SMU) ECO 7377 Fall 2011 315 / 407

Types of restrictions

Altonji et al. (2005)-type restrictions: impose values for ρ and trackestimates of β1Factor Structure

I Add additional outcomesI Decompose errors as

εki = λkµi + ηki , k = 1, ...,K

ui = λuµi + ξ i

where µ has unit var (normalization, not an assumption), η, ξ, µ areassumed to be independent, and λ are known as factor loadings

I Factor structure assumes all cross-eqtn correlation is through µI Parameters to be estimated from Σ: σηk

,λk , β1k ,λu , σξ

F This is 3K + 2 parameters in totalF Estimable quantities from Σ is (K + 1)K/2F (K + 1)K/2 3K + 2) K 6

Hogan and Rigobon (2003), Rigobon (2003) propose an Identicationthrough Heteroskedasiticity estimator that is very similar

Selection on UnobservablesDistributional Approaches

Relatively recent work has begun to address endogeneity in thecontext of distributional models

Other estimators not discussed here1 Fixed e¤ect QR models (Koenker 2004)2 Nonparametric bounds applied to QR models (Giustinelli 2011)

Selection on UnobservablesDistributional Approaches: Changes-in-Changes

Recall, standard DID strategyI Assume treatment group observed pre- and post-interventionI Assume control group observed in same time periodsI Assume treatment and control groups follow same time trend absenttreatment

I Estimate treatment e¤ect by the additional change over time in thetreatment group relative to the control group

Idea is extendable beyond just average treatment e¤ects

Model does require panel data or repeated cross-sections

Setup (Athey & Imbens 2005)

NotationI Individual i belongs to a group Gi 2 f0, 1g, where G = 1 is treatmentgroup

I Individual i observed at time Ti 2 f0, 1gI yNi , y

Ii = potential outcomes in non-treated (N), treated (intervention,

I ) statesI yi = (1 Ii )yNi + Ii y Ii = observed outcome, where Ii = treatment(intervention) indicator

I Ii = GiTi

Standard DIDI Untreated outcome

yNi = α+ βTi + γGi + εi

I Constant treatment e¤ect assumption

τ = y Ii yNiI Combining above two assumptions yields

yi = α+ βTi + γGi + τIi + εi

F τ = ATE with constant treatment e¤ect assumptionF τ = ATT with heterogeneous treatment e¤ect assumption

Generalizing the standard modelI Untreated outcome

yNi = h(Ui ,Ti )

whereF h(u, t) is increasing in uF ui = unobservable attribute of iF yN is identical across individuals within a time period with identical u,irrespective of G

I Dbn of u may vary by G , but not over time within G , ui ? Ti jGiI In the absence of treatment...

F Any di¤erences in outcomes across groups is entirely due to di¤s in thedbn of u across groups

F Any changes in outcomes within groups over time is due to di¤s inh(u, 0) and h(u, 1) [i.e., since unobservables do not change over time,the e¤ect of unobservables on the untreated outcome must change overtime]

I Treated outcomey Ii = h

I (Ui ,Ti )

where hI (u, t) is increasing in u

Changes-in-changes model

NotationI Conditional dbns

yNgt yN jG = g ,T = ty Igt y I jG = g ,T = tygt y jG = g ,T = tUg U jG = g

I Inverse CDFsF1y (q) = inffy : FY (y) > qg

GoalI Devise set of assumptions to identify dbn of yN11, FyN ,11, which is (oneof) the distributions of missing counterfactuals

I Observable dbns include: FyN ,10, Fy I ,11, FyN ,00, and FyN ,01

Assumptions

(CIC.i) Model: yN = h(U,T )(CIC.ii) Strict monotonicity: h(u, t) is strictly increasing in u for t 2 f0, 1g(CIC.iii) Time invariance within groups: U ? T jG(CIC.iv) Support: U1 U0

Estimator

Counterfactual CDF

bFyN ,11 = Fy ,10(F1y ,00(Fy ,01(y)))which is estimable using empirical CDFs

Treatment e¤ect estimate

τCICq = F1y I ,11(q) bF1yN ,11(q)Note, τCICq is the di¤erence in two QTE (Firpo 2007) estimates

τCICq = ∆QTEq,1 ∆QTEq 0,0

whereI ∆QTEq,1 is change over time in y at quantile q for G = 1 group

I ∆QTEq 0,0 is change over time in y at quantile q0 for G = 0 group, where

q0 is the quantile in the G = 0,T = 0 dbn corresponding to the valueof y associated with quantile q in the G = 1,T = 0 dbn

Alternative estimatorI QDID treatment e¤ect estimator

τQDIDq = F1y I ,11(q) bF1yN ,11(q)where bF1yN ,11(q) = F1y ,10(q) + [F1y ,01(q) F1y ,00(q)]which corresponds to

τQDIDq = ∆QTEq,1 ∆QTEq,0

where ∆QTEq,1 , ∆QTEq,0 is change over time in y at quantile q forG = 1, 0, respectively

I Relies on (perhaps) unrealistic assumptions

Counterfactual CDF for control group

Fy I ,01 = Fy ,00(F1y ,10(Fy ,11(y)))

Treatment e¤ect estimate

τCICq,0 = F1y I ,01(q) F

1yN ,01(q)

Athey & Imbens (2006) discuss extensions toI Discrete outcomesI Multiple groups and multiple time periodsI Incorporating covariates

F Semiparametric specication of potential outcomes

yN = h(u, t) + xβ

y I = hI (u, t) + xβ

where U ? T ,X jGF OLS estimation of outcomes

yi = Di δ+ xi β+ εi

where D = [GT (1 G )T G (1 T ) (1 G )(1 T )]F Perform CIC estimation on

byi = yi xibβ = Dibδ+bεiF Inverse propensity score weighting alternative?

Panel data allows additional exibility, but repeated cross sections aresu¢ cient

InferenceI Athey & Imbens (2006) prove asymptotic normality, and deviseasymptotic variance

I Bootstrap alternative?

Selection on UnobservablesDistributional Approaches: IV Quantile Regression

Recall, QR model (Koenker & Bassett 1978)I Assuming linear conditional quantiles, estimation is

bβθ,b∆θ= argmin

β,∆

i :yi>xi βθjyi ∆Di xi βj+ ∑

i :yi<xi β(1 θ)jyi xi βj

I May be rewritten as

bβθ,b∆θ = argmin

β,∆

ρθ(εθi )

where ρθ(εθi ) is check function, dened as

and εθi is the residual for i and θ

Parameters of interest are the partial derivatives of the conditionalquantile fn w.r.t. x

∂ E[Qθ(y jx ,D)]∂xk

which equals βθk if x enters linearly

For discrete regressors, parameters give the expected change in theconditional quantile fn

∆θ = E[Qθ(y jx , 1) E[Qθ(y jx ,D = 0)]

QR model is biased and inconsistent if D is endogenous

Recall, potential outcomes setupI yd , d = 0, 1, are potential outcomes associated with D = 0, 1,respectively

I q(d , x , θ) = conditional quantile fn of potential outcomesI ∆θ = q(1, x , θ) q(0, x , θ) = QTE (parameter of interest)

IV-QR model (Chernozhukov & Hansen 2005, 2006)

Express conditional quantile fn as

yd = q(d , x , ud ), ud U [0, 1]

where q(d , x , θ) is the conditional θth-quantile of potential outcome,ydLinear (in parameters) conditional quantile fn implies

q(d , x , θ) = ∆θDi + xi βθ

Assumptions

(IV-QR.i) Potential outcomes: given X = x , for each d , yd = q(d , x , ud ),whereud U [0, 1] and q(d , x , θ) is strictly increasing in θ

(IV-QR.ii) Independence: given X = x , fud g ? Z(IV-QR.iii) Selection: given X = x ,Z = z , D δ(z , x , υ) for unknown fn δ() and

random vector, υ(IV-QR.iv) Rank similarity: given X = x ,Z = z , ud ud 0 8d , d 0(IV-QR.v) Observed data: y = q(d , x , ud ), D δ(z , x , υ), x , and z

Note: rank similarity is a bit weaker than rank invariance (wherebyUd = Ud 0 8d , d 0), and requires that Ud = Ud 0 are equal inexpectation only (thus, they may be considered equal ex ante, but areallowed to di¤er ex post)

Estimation

Consider the objective fn

ρθ(εθi )

εθi = yi ∆θDi xi βθ bΦiγθ

and bΦi is the predicted value from the rst-stage regression of D onx , z

Given correctly specied structuralmodel, γθ should equal zero

Algorithm1 Dene a grid of possible values of ∆, f∆j , j = 1, ..., Jg2 For each θ, estimate a QR model with yi ∆Di as the dependentvariable and x , bΦi as covariates

3 Obtain estimates bβθj , bγθj , j = 1, ..., J4 Choose b∆θ = b∆θj and bβθ =

bβθj to minimize jbγθj j

Inference via sub-sampling or typical, nonparametric iid bootstrap, asin QR model

Can test interesting hypotheses (∆θ = 0, ∆θ constant 8θ, SD,exogeneity)

Easily extendable to multiple endogenous variables, but grid searchincreases exponentially

Selection on UnobservablesDistributional Approaches: Stochastic Dominance

Recall, previous denitions for stochastic dominanceI First Order Stochastic Dominance: Y1 FSD Y0 i¤

F1(y) F0(y) 8y 2 @

with strict inequality for some y (where @ is the union of the supportsfor Y1 and Y0), or

y θ1 y θ

0 8θ 2 [0, 1]with strict inequality for some θ

I Second Order Stochastic Dominance: X SSD Y i¤Z y∞

F1(t)dt Z y∞

F0(t)dt 8y 2 @

with strict inequality for some y , orZ θ

0y t1dt

0y t0dt 8θ 2 [0, 1]

with strict inequality for some θ

Recall, previous tests for stochastic dominanceI Test statistics

d = min supz2@

[F (z) G (z)]

s = min supz2@

Z z∞[F (t) G (t)] dt

where min is taken over F G and G FI Tests are based on estimates of d and s using the empirical CDFs

F Unconditional, orF Inverse propensity score weighted

Previous methods assume selection on observables

Failure of this assumption invalidates causal conclusions

Solution (Abadie 2002; Imbens & Rubin 1997)

With a binary IV, Z , the potential distributions of the outcomevariable are identied for the subpopulation of compliers

Zi satises the following three assumptions:I Independence: fy0i , y1i ,D0i ,D1ig ? ZiI Correlation: Pr(Zi = 1) 2 (0, 1) and Pr(D0i = 1) < Pr(D1i = 1)I Monotonicity: Pr(D0i D1i ) = 1where:

F y0, y1 are potential outcomes (subscripts refer to treatment status)F D0,D1 are potential treatments (subscripts refer to instrument status)

SD tests comparing the distribution of outcomes across the sampleswith Z = 0 and Z = 1 identify the causal e¤ect of D on y forcompliers

Dene the empirical CDF of potential outcomes for compliers as

bFC1 (y) = E[I (Y1i y) jD1i = 1,D0i = 0]bFC0 (y) = E[I (Y0i y) jD1i = 1,D0i = 0]

Abadie (2002) shows

bFC1 (y) bFC0 (y) = K [bF1(y) bF0(y)]wherebF1(y), bF0(y) are empirical CDFs for the Z = 1, Z = 0 samplesK = 1/(E[D jZ = 1] E[D jZ = 0]) < ∞Implies SD tests on bF1(y), bF0(y) yield valid inference for the SDrankings of bFC1 (y), bFC0 (y)Di¤erent Z s yield di¤erent results if the treatment e¤ect varies acrossthe population

Data Issues

Data issues are a fact of life

Frequently encountered are problems pertaining to missing orcontaminated data

Sample selection concerns missing data on the dependent variable

Contaminated data refers to a scenarious where one is interested inthe marginal distribution of a potentially mismeasured variable

Measurement error more generally refers to mismeasured dependentor independent variables

Data IssuesSample Selection

Population model

yi = xi β+ εi , εi N(0, σ2)

Given a random sample, fyi , xigNi=1, then OLS is consistent ande¢ cient if the usual assumptions are satised

Problem arises when data on y is only available for a non-randomsample

I Let Si = 1 if yi is observed; Si = 0 if yi is unobserved

Note: While exposition is using cross-section, a common source of(non-random) selection is attrition in panel data; particularlyimportant in rm-level studies where attrition may be due to rmsexiting the market

Example: Certain subpopulations may not be representative of thepopulation

Implies following data structureI Have data on a random sample, fyi , xi ,SigNi=1, but yi = . if Si = 0I Can only use M ∑i Si observations to estimate any modelI Examples

F Wages only observed for workersF Firm prots only observed for rms that remain in businessF Test scores only observed for test takersF House prices only observed for houses on the market (sold?)

IssueI Is OLS still unbiased and consistent?I Answer: depends

Heckman Model (Heckman 1979)

yi = xi β+ εi

Si = ziγ+ ui

1 if Si > 00 if Si 6 0

yi = . if Si = 0

εi , ui N2(0, 0, σ2ε , 1, ρ)

x , z are exogenous

ProblemI E[y jx ] = xβ, but

E[y jx ,S = 1] = E[y jx , z , u] = xβ+ E[εjx , z , u]= xβ+ E[εju > ziγ]

= xβ+ ρσεφ(zγ)

Φ(zγ)

where ρσεφ(zγ)/Φ(zγ) is the Inverse MillsRatio from beforeI Implies that E[y jx ,S = 1] = xβ i¤ ρ = 0I OLS estimation of

yi = xi β+eεiusing only M observations omits the IMR term, which implies that

eεi = ρσεφ(zγ)/Φ(zγ) + εi

which is not mean zero, and is not independent of x , unless ρ = 0

SolutionI Estimate IMR (using i = 1, ...,N)

F Estimate probit model, where S is dependent variable and z are thecovariates ) bγ

F Obtain

IMRi =φ(zi bγ)Φ(zi bγ)

I Regress yi on xi , IMRi via OLS (using i = 1, ...,M)I Known as Heckman two-step methodI Test of endogenous selection

Ho : βλ = 0

Ha : βλ 6= 0

where βλ is the coe¢ cient on the IMR

NotesI Usual OLS standard errors are incorrect since IMR is predicted; mustaccount for additional uncertainty due to estimation of γ

I Other complications in derivation of standard errorsI Need an exclusion restriction(s)

F A variable in z not in xF Otherwise model is identied from non-linearity of IMR, which arisessolely from the assumption of joint normality

F However, even though technically identied from the non-linearity,substantial collinearity in practice makes identication questionable

I Model can be estimated in one-step by ML

F More e¢ cient if model assumptions are validF Less robust in general since more dependent on functional formassumptions

Stata: -heckman-, -heckman2 -

QR alternative

Assume the latent outcome is

y i = xi β+ ui

y is unobserved; instead observe

yi =y i if observed. otherwise

QR model estimated using data on feyi , xig, whereeyi = yi if observed

minfyig otherwise

yields bβθ = argminβ

ρθ(eyi xi β))

which is consistent as long as all missing values of y i 6 Qθ(y jx)DL Millimet (SMU) ECO 7377 Fall 2011 349 / 407

More generally, QR model estimated using data on feyi , xig, whereeyi = yi if observed

imputed value otherwise

yields bβθ = argminβ

ρθ(eyi xi β))

which is consistent as long as imputed values lie on the correct side ofQθ(y jx)

Example:

0 .2 .4 .6 .8 1x

ystar 'true' OLS fitted line'true' LAD fitted line OLS fitted line, y>0 onlyLAD fitted line

NOTE: x~U[0,1]; ystar=0.25+x+e; e~N(0,0.25^2); y=ystar if ystar>0.LAD fitted line obtained by first replacing y=10 if ystar>true LAD line, 10 otherwise.

Multiple selection criteria

yi = xi β+ εi

S1i = z1iγ1 + u1i

1 if S1i > 00 if S1i 6 0

S2i = z2iγ2 + u2i

1 if S2i > 00 if S2i 6 0

yi = . if S1iS2i 6= 1εi , u1i , u2i N3(0, 0, 0, σ2ε , 1, 1, ρε1, ρε2, ρ12)

x , z are exogenous

EstimationI Same as above, except with two IMR terms

IMR1i =φ(z1i bγ1)Φ(z1i bγ1) ; IMR2i =

φ(z2i bγ2)Φ(z2i bγ2)

I Coe¢ cients on each IMR term are ρε1σε and ρε2σε

ExamplesI Grameen Bank: only observe outcome of credit amount if villagecontains a bank, and income makes one eligible

I Child care: only observe price paid for child care if work and usemarket-based day care

Regime switching models

Si = ziγ+ ui

1 if Si > 00 if Si 6 0

xi β1 + ε1ixi β0 + ε0i

which is the previous model for treatment e¤ects

Applicable to any situation where one thinks determinants of theoutcome (i.e., β) di¤er across groups or regimes

May be extended to multiple regimes

Si = ziγ+ ui

8>>>>><>>>>>:

0 if Si 6 01 if Si 2 (0, α1]2 if Si 2 (α1, α2]...K if Si > αK1

8>>><>>>:xi β0 + ε0i if Si = 0xi β1 + ε1i if Si = 1...xi βK + εKi if Si = K

Estimate each regime seperately

yi = xi βk + ρuεkσεkdIMRki + ηki

dIMRk =8>>>>><>>>>>:

φ(zi bγ)1Φ(zi bγ) if Si = 0

φ(αk1zi bγ)φ(αkzi bγ)Φ(αkzi bγ)Φ(αk1zi bγ) if Si = k 2 f1, 2, ...,K 1g...

φ(αK1zi bγ)1Φ(αK1zi bγ) if Si = K

and α0 = 0 and γ is estimated via ordered probit

ExamplesI Wages by rm size (Main & Reilly 1993)I Various outcomes by education or household size

Regime switching models with unknown switch point

Si = ziγ+ ui

1 if Si > c0 if Si 6 c

xi β1 + ε1i if Si = 1xi β0 + ε0i if Si = 0

where S is observed, but c is unknown

EstimationI ML, where c is unknown parameterI Grid search:

F Estimate model for several plausible values of cF bc and resulting estimates bβ are those that minimize total SSE

I Examples

F Wages of PT vs. FT (Hotchkiss 1991)F Outcomes of DCs vs. LDCsF Stock market performance of large vs. small rms

Separate literature on selection models with panel data

Bounding distributions (Blundell et al. 2007)

NotationI W = latent outcome variableI E = selection indicatorI W = outcome variable, where

W if E = 1. otherwise

I X = covariate vector

Goal: bound CDF F (w jx) given observable CDF F (w jx ,E = 1)Examples:

I Dbn of wages under full employmentI Dbn of child health under full HI coverageI Dbn of student achievement under universal attendance at publicschools

I Dbn of test scores on college entrance exams with full participation

Worst case bounds

Identity

F (w jx) = F (w jx ,E = 1)p(x) + F (w jx ,E = 0)[1 p(x)]

where p(x) Pr(E = 1jx)F (w jx ,E = 0) is unknown, but must lie in unit intervalReplacing F (w jx ,E = 0) with zero and one yields

F (w jx ,E = 1)p(x) 6 F (w jx) 6 F (w jx ,E = 1)p(x) + [1 p(x)]

Example (ignoring x):I F (10jE = 1) = 0.4I Pr(E = 1) = 0.9) F (10) 2 [0.36, 0.46]

Can be rewritten in terms of bounds on quantiles

wq,l (x) 6 wq(x) 6 wq,u(x)

whereI wq(x) = qth quantile of F (w jx)I wq,l (x) is the value of w that solves

q = F (w jx ,E = 1)p(x) + [1 p(x)]

, w = F1q [1 p(x)]

p(x)jx ,E = 1

I wq,u(x) is the value of w that solves

q = F (w jx ,E = 1)p(x)

, w = F1

jx ,E = 1

ExampleI q = 0.5, p(x) = 0.9I wq,l (x) = F1(q00jx ,E = 1), whereq00 = (0.5 0.1)/0.9 = 0.4/0.9 0.44

I wq,u(x) = F1(q0jx ,E = 1), where q0 = 0.5/0.9 0.55) bounds on the median are given by the values of the observedconditional dbn at the 44th and 55th quantiles

NotesI Bounds cannot be used to determine if selection is non-random; onlyassess the possible consequences

I Bounds only estimable for q 2 [1 p(x), p(x)]I Bounds converge to point estimates as p(x)! 1

Positive selection

Stochastic dominanceI One characterization of positive selection is to assume that

F (w jx ,E = 1) FSD F (w jx ,E = 0), F (w jx ,E = 1) 6 F (w jx ,E = 0) 8w , 8x

I Equivalent to Pr(E = 1jW 6 w , x) 6 Pr(E = 1jW > w , x)I Bounds on F (w jx) become

F (w jx ,E = 1) 6 F (w jx) 6 F (w jx ,E = 1)p(x) + [1 p(x)]

since the missing term, F (w jx ,E = 0), is now bounded from below atF (w jx ,E = 1)

Example (ignoring x):I F (10jE = 1) = 0.4I Pr(E = 1) = 0.9) F (10) 2 [0.4, 0.46] whereas the worst-case bounds were [0.36, 0.46]

Median restrictionI Weaker characterization is to assume (conditional on x) thatw0.5(E=1) > w0.5(E=0)

I Equivalent toPr(E = 1jW 6 w0.5(E=1), x) 6 Pr(E = 1jW > w0.5(E=1), x)

I Bounds on F (w jx) become

F (w jx ,E = 1)p(x) 6 F (w jx) 6 F (w jx ,E = 1)p(x) + [1 p(x)]if w < w0.5(E=1)

F (w jx ,E = 1)p(x)+ 0.5[1 p(x)] 6 F (w jx) 6 F (w jx ,E = 1)p(x) + [1 p(x)]

if w > w0.5(E=1)

I Bounds are tightened (relative to worst case) only above the mediansince the missing term, F (w jx ,E = 0), is now bounded from below at0.5 for w > w0.5(E=1) (instead of zero)

Exclusion restriction

Conditional independenceI Assume z satises

F (w jx , z) = F (w jx) 8w , x , z

I Bounds on F (w jx) become

maxzfF (w jx , z ,E = 1)p(x , z)g

6 F (w jx)6 min

zfF (w jx , z ,E = 1)p(x , z) + [1 p(x , z)]g

I If conditional independence is not true, bounds may cross; failure ofbounds to cross does not prove conditional independence holds

MonotonicityI Higher values of z improve the dbn in a FSD sense

F (w jx , z 0) 6 F (w jx , z 00) 8w , x , z 0, z 00 s.t. z 0 > z 00

I Bounds on F (w jx , z1) become

maxz>z1

fF (w jx , z ,E = 1)p(x , z)g

6 F (w jx , z1)6 min

z6z1fF (w jx , z ,E = 1)p(x , z) + [1 p(x , z)]g

I Bounds on F (w jx) obtained by integrating over the dbn of z ; entailscomputing the weighted average of the upper and lower bounds acrossthe di¤erent values, z1, where the weights are sample proportion,Pr(z = z1 jx)

Bounding di¤erences in QTEs across groups accounting fornon-random selection

NotationI D 2 f0, 1g indexes groupsI T 2 f0, 1g indexes time period

Bounds on QTEs across groups in a given time period

wq,l (1,T ) wq,u(0,T ) 6 wq(1,T ) wq(0,T )6 wq,u(1,T ) wq,l (0,T )

Bounds on QTEs across time for a given group

wq,l (D, 1) wq,u(D, 0) 6 wq(D, 1) wq(D, 0)6 wq,u(D, 1) wq,l (D, 0)

Bounds on di¤-QTEs across groups

[wq(1, 1) wq(0, 1)] [wq(1, 0) wq(0, 0)] 2 [LB,UB ]

LB = [wq,l (1, 1) wq,u(0, 1)] [wq,u(1, 0) wq,l (0, 0)]UB = [wq,u(1, 1) wq,l (0, 1)] [wq,l (1, 0) wq,u(0, 0)]

I Example: Change in median wage gap across males and females overperiod T = 0 to T = 1

Level set restrictionsI Assume di¤-QTE, [wq(1, 1) wq(0, 1)] [wq(1, 0) wq(0, 0)], isconstant across di¤erent values of some covariate x 2 X

I Calculate LB(x),UB(x) 8x 2 XI New LB,UB given by

LB = maxx2X

UB = minx2X

Test statistics derived in Blundell et al. for bounds crossings, whetherobserved conditional distribution, F (w jx ,E = 1) lies in the boundsInference via bootstrap

Bounding di¤erences in average treatment e¤ects across groupsaccounting for non-random selection

Lechner and Melly (2007)

Imai (2008)

Lee (2009)

Huber and Mellace (2011)

Data IssuesContamination

Horowitz and Manski (1995); see also Chen et al. (JEL 2011)

Goal is to bound the marginal distribution of y , where

yi = diy i + (1 di )eyiwhere y is the true value, ey is the mismeasured value, and d = 1 inthe absence of contamination (0 otherwise)

Add more!

Data IssuesMeasurement Error

Refer to ECO 6374 for refresher on basics...

Problem: sometimes (often!) data are measured imprecisely; seeBound et al. (2001), Millimet (2011)

Data IssuesME: Classical Errors-in-Variables (CEV) model

Continuous dependent variable

yi|zobserved

= y i|zactual

+ µi|zME

I Assumptions

(CEV.i) True model: y i = α+ βxi + εi(CEV.ii) Normality and Mean Zero: µi N(0, σ2µ)(CEV.iii) Independence: Cov(x, µ) = 0

I Implications

F OLS unbiased, consistentF Standard errors are correctF # R2, " standard errors due to extra noise in the data

Continuous independent variable

xi|zobserved

= xi|zactual

+ µi|zME

I Assumptions (in addition to previous assumptions)

(CEV.iv) Independence: Cov(µ, ε) = 0

I Implications

F OLS biased, inconsistent unless β = 0F bβOLS su¤ers from attenuation bias

Data IssuesME: Binary Dependent Variable (Hausman et al. 1998)

True modelDi = x

i β+ εi

where on a variable indicates correctly measured

Given a random sample fDi , xi gNi=1, assume logit model is consistentand e¢ cient

I Logit probabilities

Pr(D = 1jx) =exp(xi β)

1+ exp(xi β)

Pr(D = 0jx) =1

1+ exp(xi β)

I Estimation by ML

lnL = ∑ifI[D = 1] ln[Pr(D = 1jx)] + I[D = 0] ln[Pr(D = 0jx)]g

With measurement error, do not observe DiI Instead one observes DiI Introduce following notation

α0 Pr(Di = 1jDi = 0)α1 Pr(Di = 0jDi = 1)

I α0, α1 dependent on D, but not on xi

EstimationI Probabilities of observed responses

Pr(D = 1jx) = Pr(Di = 1jDi = 0)Pr(Di = 0jx)+ Pr(Di = 1jDi = 1)Pr(Di = 1jx)

= α0 + (1 α0 α1)

exp(xi β)

1+ exp(xi β)

Pr(D = 0jx) = 1 Pr(D = 1jx)

= 1 α0 (1 α0 α1)

exp(xi β)

1+ exp(xi β)

I Estimation by ML

lnL = ∑ifI[D = 1] ln[Pr(D = 1jx)] + I[D = 0] ln[Pr(D = 0jx)]g

I Extension to probit is trivial

IdenticationI In linear probability model (LPM), conditional expectation given by

E[D jx ] = E[Di = 1jDi = 0]Pr(Di = 0)+ E[Di = 1jDi = 1]Pr(Di = 1)

= α0 + (1 α0 α1)(xi β)

= α0 + (1 α0 α1)(β0 + exi β1)= [α0 + (1 α0 α1)β0 ] + exi (1 α0 α1)β1

which makes clear that identication of α0, α1, and β arises fromnon-linearity of probit/logit, in addition to ...

I Monotonicity assumption: α0 + α1 < 1I Semiparametric alternatives available

Data IssuesME: Binary Independent Variable

True modely i = α+ βDi + εi , εi N(0, σεε)

where on a variable indicates correctly measured

Given a random sample fy i ,Di gNi=1, assume OLS is consistent ande¢ cient

With measurement error, do not observe DiInstead one observes Di where

Di|zobserved

= Di|ztrue

+ µi|zME

which implies that µ 2 f0, 1g if D = 0, and µ 2 f0,1g if D = 1Thus, measurement error is

I Not normally distributed (violates CEV.ii)I Is negatively correlated with D (violates CEV.iii)

Assumptions

(BME.i) Non-di¤erential classication errors: E[y jD] = E[y jD,D](BME.ii) D ? ε(BME.iii) Cov(D,D) > 0(BME.iv) Cov(D, µ) < 0

Given (BME.i) (BME.iv), asymptotic bias given by

plimbβOLS = σD D + σD µ

σD D + 2σD µ + σµµ

Results in attenuation bias for β if σD µ + σµµ > 0

Likely true for any mismeasured bounded variable

Millimet (2011) conducts MC study comparing common treatmente¤ect estimators (∆ = 1)

Partial solutions (Aigner 1973; Bollinger 1996; Black et al. 2000)

Reverse regressionI Estimate via OLS

Di = π0 + π1yi + υi

I plim given by

plimbπ11,OLS = β2σD D + σεε

σD D + σD µ

which is biased up in absolute value

I ImpliesbβD ,OLS 2 bβOLS , bπ11,OLS , where bβD ,OLS is the OLS

estimate if D were observed (Frisch bounds)I If R2 is low, then bounds obtained using reverse regression may beuninformative

I IV estimation also yields an upper bound (not a consistent estimate!),that may be more informative in many cases

I Inconsistency of IV results from fact that any instrument correlatedwith D will most likely be correlated µ since Cov(D, µ) 6= 0

Improved lower bound obtained by estimating

y i = α+ β0 I[Di = 0,D 0i = 1]+ β1 I[Di = 1,D 0i = 0] + β2 I[Di = 1,D 0i = 1] + ηi

where D 0i is a second mis-measured indicatorI If the measurement errors are independent conditional on actualtreatment assignment, Di , then

0 <E[bβOLS ] < E[bβ2,OLS ] < jβj

Bound bβD ,OLS under various assumptions concerning severity ofmeasurement error (papers by Kreider and Pepper)

Full Solutions

Point estimates possible using method-of-moments framework

Brachet (2008) proposes following algorithm1 Estimate Hausman et al. misclassication probit, including aninstrument z in the rst-stage

2 Replace D with Pr(Di = 1jx , z) in second-stage

McCarthy & Tchernis (2011) consider a similar approach in aBayesian framework

Partial solutions (Kreider & Pepper 2007)

Utilize a non-regression approach to bound the e¤ect of amis-measured binary treatment

Authors do not wish to invoke (BME.i), which implies thatmis-reporting is independent of outcomes conditional on the truth

NotationI y 2 f0, 1g is a binary outcome (correctly measured)I D 2 f0, 1g is the true binary treatmentI D 2 f0, 1g is the reported binary treatmentI Z 2 f0, 1g, where Z = 1 if D = D and 0 otherwise

Estimand of interest: ∆ = Pr(y = 1jD = 1) Pr(y = 1jD = 0)Data provides an estimate of Pr(y = 1jD)

Manipulation yields

Pr(y = 1jD = 1) =Pr(y = 1,D = 1)Pr(D = 1)

0@ Pr(y = 1,D = 1)+Pr(y = 1,D = 0,Z = 0)Pr(y = 1,D = 1,Z = 0)

1APr(D = 1) + Pr(D = 0,Z = 0)

Pr(D = 1,Z = 0)

where Pr(D = 1,Z = 0) is a false positive and Pr(D = 0,Z = 0) isa false negative

Data provide estimates of Pr(y = 1,D = 1), Pr(D = 1)

Other elements are unknown, but bounded by the unit interval

Lower-Bound Accurate Reporting RateI Assume Pr(Z = 1) vI Can show that

Pr(y = 1jD = 1) 2

Pr(y = 1,D = 1) δ

Pr(D = 1) 2δ+ (1 v ) ,Pr(y = 1,D = 1) + γ

Pr(D = 1) + 2γ (1 v )

minf(1 v ),Pr(y = 1,D = 1)g if Pr(y = 1,D = 1) Pr(y = 0,D = 1) (1 v ) 0maxf0, (1 v ) Pr(y = 0,D = 0)g otherwise

minf(1 v ),Pr(y = 1,D = 0)g if Pr(y = 1,D = 1) Pr(y = 0,D = 1) + (1 v ) 0maxf0, (1 v ) Pr(y = 0,D = 1)g otherwise

I Bounds for Pr(y = 1jD = 0) are obtained by replacing D with 1DI Bounds for each term obtained by replacing elements with sampleanalogs

I Bounds for ∆ obtained using relevant upper and lower bounds for eachterm

I When v = 1, bounds collapse to a point estimate

Partial VericationI Might assume a lower bound for accuracy among some sub-groupwhose status is more certain, W = 1

I Assume Pr(Z = 1jW = 1) vwI Can show that

Pr(y = 1jD = 1) 2

26666664Pr(y = 1,D = 1,W = 1) δ0@ Pr(D = 1,W = 1)

+Pr(y = 0,W = 0)2δ+ (1 vw )Pr(W = 1)

Pr(y = 1,D = 1,W = 1)+Pr(y = 1,W = 0) + γ

Pr(D = 1,W = 1) + Pr(y = 1,W = 0)+2γ (1 vw )Pr(W = 1)

37777775

minf(1 vw )Pr(W = 1),Pr(y = 1,D = 1)g if α 0maxf0, (1 vw )Pr(W = 1) Pr(y = 0,D = 0,W = 1)g otherwise

minf(1 vw )Pr(W = 1),Pr(y = 1,D = 0)g if α0 0maxf0, (1 vw )Pr(W = 1) Pr(y = 0,D = 1,W = 1) otherwise

α = Pr(y = 1,D = 1,W = 1) Pr(y = 0,D = 1,W = 1)

Pr(y = 0,W = 0) (1 vy )Pr(W = 1) 0

α0 = Pr(y = 1,D = 1,W = 1) Pr(y = 0,D = 1,W = 1)

+ Pr(y = 1,W = 0) + (1 vy )Pr(W = 1) 0

I If vw = 1, then one has full verication for an observed sub-sample !bounds are tightened

Combine the prior assumptions with a Monotone IV assumption topossibly further tighten the bounds

MIV AssumptionI 9 x s.t.

x0 2 [x1, x2 ]) Pr(y = 1jD, x0) 2 [Pr(y = 1jD, x1),Pr(y = 1jD, x2)]

I Implies that Pr(y = 1jD, x) is weakly monotonically increasing in xI Proceed by

F Computing bounds conditional on di¤erent values of xF Obtaining unconditional bounds by integratingover the dbn of x

Kreider & Hill (2009), Kreider et al. (2011) combine thismethodology on reporting errors with prior methods on boundingtreatment e¤ects under SOU

Imai & Yamamoto (2010) o¤er a similar analysis in poli sci

Partial solutions (Battistin & Sianesi 2009)

Consider ME of a binary or multi-valued treatment in the context ofpropensity score estimatorsSetup

(MPS.i) CIA given no MEy0, y1 ? Djx

(MPS.ii) CS given no ME

p(x) = Pr(D = 1jx) 2 (0, 1) 8x

I D is not observed, instead D is, where Di 6= Di for at least some iEstimation based on D yieldsb∆ATE = EfE[y jD = 1, x ] E[y jD = 0, x ]g

where the outer expectation is over S , where

S = fx : p(x) = Pr(D = 1jx) 2 (0, 1)g

In contrast, estimation based on D ) b∆ATE DL Millimet (SMU) ECO 7377 Fall 2011 390 / 407

NotationI (Mis)classication probabilites given by

λjj 0(x) = Pr(D = j jD = j 0, x), j , j 0 2 f0, 1g

F λ10 = proportion of incorrect reported zerosF λ01 = proportion of incorrect reported ones

I Condensed notation for correct reporting rates

λ00(x) = λ0(x) = Pr(D = 0jD = 0, x)

λ11(x) = λ1(x) = Pr(D = 1jD = 1, x)

I Matrix of (mis)classication probabilities can be written in terms ofλ0,λ1

Λ(x) =

λ0(x) 1 λ0(x)1 λ1(x) λ1(x)

Assumptions

(MPS.iii) Non-di¤erential classication errors: E[y jD, x ] = E[y jD,D, x ](MPS.iv) Informative reported treatment status: λ0(x) + λ1(x) 1 6= 0

Outcomes condition on D can be written as a weighted average ofoutcomes conditional on D

E[y jD = 0, x ]E[y jD = 1, x ]

= Λ(x)

E[y jD = 0, x ]E[y jD = 1, x ]

= Λ1(x)

E[y jD = 0, x ]E[y jD = 1, x ]

provided det[Λ(x)] = λ0(x) + λ1(x) 1 6= 0Two cases satisfy (MPS.iv)

I Minimal classication errors: λ0(x) + λ1(x) > 1I Severe classication errors: λ0(x) + λ1(x) < 1

The bias when using D is

∆ATE (x) = [λ0(x) + λ1(x) 1] ∆ATE(x)

Implications:I ∆ATE (x) is unbiased if λ0 = λ1 = 1I ∆ATE (x) su¤ers from attenuation bias if λ0(x) + λ1(x) > 1I ∆ATE (x) su¤ers from attenuation bias AND

sgnh∆ATE (x)

i6= sgn

h∆ATE

(x)iif λ0(x) + λ1(x) < 1

I ∆ATE (x) = ∆ATE(x) if λ0 = λ1 = 0

The bias of the unconditional ATE, ∆ATE , also depends on theerroneous determination of the CS

I Can show that

p(x) =p(x) [1 λ0(x)]λ0(x) + λ1(x) 1

I This implies that boundary values of p(x) can be obtained even ifp(x) 2 (0, 1) if

p(x) = 0, λ0(x) = 1 p(x)p(x) = 1, λ1(x) = p

To ensure one does not utilize a di¤erent CS based on D, mustassume

(MPS.v) λ0(x) 6= 1 p(x) and λ1(x) 6= p(x)

EstimationI Under (MPS.i) (MPS.v)

∆ATE=

RSω(x)∆ATE (x)f (x)dx

= ∆ATE +R

S[ω(x) 1]∆ATE (x)f (x)dx

ω(x) =Pr(D = 1)Pr(D = 1)

1 λ0(x)λ0(x) + λ1(x) 1

Pr(D = 1) =

RS[1 λ0(x)]f (x)dx

S[λ0(x) + λ1(x) 1]p(x)f (x)dx

I Shows that ∆ATEcan be obtained from an appropriately weighted

average of ∆ATE (x)I Weights depend on λ0(x), λ1(x)

NotesI Bounds obtained by computing b∆ATE (λ0,λ1) over a grid of valuesand obtaining the lower and upper bounds

F Restrictions on possible values of λs can be imposed based on prior infoF b∆ATE (λ0,λ1) can be obtained using any propensity-score basedestimator

F In their paper, they use a (5 strata) stratication estimator and assume(λ0,λ1) are stratum-specic

I Extension to multi-valued treatments provided as well

Data IssuesME: Missing Binary Independent Variable

Molinari (2010) applies similar bounding approach to analyze the casewhere D is missing, possibly non-randomly, due to subjectnon-response

I Examples:

F Respondents refuse to answer questions concerning drug use, welfareuse, etc.

Millimet (2011) MC study also compares common treatment e¤ectestimators when y or x is measured with error (do not forget the restof the data! ... ∆ = 1)

Data IssuesME: Persistence of Treatment E¤ects

Often neglected in applied research is the question of whethertreatment e¤ects are persistent

Clearly relevant for policymakers; an investment that improvesoutcomes for one period only has di¤erent benets than aninvestment that yields a permanent improvement in outcomes

Jacob et al. (2010) propose an interesting method to estimate thedegree of persistence in a treatment e¤ect (under certaincircumstances)

Method relies on preceding analysis of measurement error

Setupyit = yLit + y

where y is the outcome, which is decomposed into a LR component,yL, and a SR componenent, yS

I The two components are given by

ySit = τSDit + εSit

yLit = δyLit1 + τLDit + εLit

where D is a treatment (binary, discrete, or continuous)I Interpretation of parameters

F δ = persistence of the LR component of y (by denition, the SRcomponent completely decays each period)

F τS , τL = the (common) treatment e¤ect on yS , yL

Goal: say something about δ, τS , and τL

Consider trying to estimate the LR component equation

yLit = δyLit1 + τLDit + εLit

Problem: yLit , yLit1 are unobserved; only y is observed

Some algebra yields

yit ySit = δ(yit1 ySit1) + τLDit + εLit

) yit = δyit1 + τLDit + [ySit δySit1 + εLit ]

NotesI Cov(yit1, ySit1) 6= 0 ... ySit1 is analagous to ME in the desiredcovariate, yLit1

I Cov(Dit , ySit ) 6= 0 if τS 6= 0Circumvent this second issue by incorporating Dit into the error term

yit = δyit1 + [τLDit + ySit δySit1 + εLit ]

= δyit1 + υit

Comparison of estimators ...

OLS yields

plimbδOLS = δ

σ2y L

σ2y L + σ2y S

using the CEV formula discussed previously

IV using yit2 as an instrument

plimbδIV ,1 = δ

if Cov(yit2, εLit ) = Cov(yit2, εSit ) = Cov(yit2, εSit1) =Cov(yit2,Dit ) = 0, implying that yit2 is predetermined anduncorrelated with future treatment status

IV using Dit1 as an instrument

plimbδIV ,2 =Cov(yit ,Dit1)

Cov(yit1,Dit1)

=Cov(δyit1 + τLDit + ySit δySit1 + εLit ,Dit1)

Cov(yit1,Dit1)

= δ+Cov(τLDit + ySit δySit1 + εLit ,Dit1)

Cov(yit1,Dit1)

I Assume Cov(Dit1,Dit ) = Cov(Dit1, εSit ) = Cov(Dit1, εLit ) = 0I But, Cov(Dit1, ySit1) 6= 0 ) Dit1 is not a valid IV

plimbδIV ,2 = δ+Cov(δySit1,Dit1)

Cov(yit1,Dit1)

Cov(ySit1,Dit1)Cov(yit1,Dit1)

1 τS Var(Dit1)

(τS + τL)Var(Dit1)

τS + τL

Notes:I Combination of OLS and IV1 can estimate the relative contribution ofyL to y

I Combination of IV1 and IV2 can estimate the relative contribution ofD to the LR component

I xs can be incorporated by redening εLit = xit βL +eεLitI Model requires Cov(Dit1,Dit ) = 0, ruling out treatments whichpersist themselves (e.g., treaties)

F Examples (perhaps): class size, R&D (?)

In conclusion, listen to the words of Sims (2010):

Natural, quasi-, and computational experiments, as well asregression discontinuity design (RDD), can all, when well applied, beuseful, but none are panaceas... Because we are not an experimentalscience, we face di¢ cult problems of inference. The same datagenerally are subject to multiple interpretations. It is not that we learnnothing from data, but that we have at best the ability to use data tonarrow the range of substantive disagreement. We are alwayscombining the objective information in the data with judgment, opinionand/or prejudice to reach conclusions...

Natural experiments, di¤erence-in-di¤erence, and regressiondiscontinuity design are good ideas. They have not taken the con outof econometrics in fact, as with any popular econometric technique,they in some cases have become the vector by which conisintroduced into applied studies. Furthermore, over-enthusiasm aboutthese methods, when it leads to claims that single-equation linearmodel with sandwiched errors are all we ever really need, can lead toour training applied economists who do not understand how to fullymodel a dataset.DL Millimet (SMU) ECO 7377 Fall 2011 406 / 407

In light of these sentiments, recall the points made at the start of thiscourse:

Prior to conducting, or when reviewing, causal analyses, questions thatneed to be answered:

1 What is the causal relationship of interest? [Is it economicallyinteresting?]

2 What is the identication strategy?3 What parameter are you actually estimating?4 To whom does the parameter apply?5 What question does the analysis answer?6 What is the method of statistical inference?

While applied work is open to multiple interpretations, theseinterpretations and objections to research are lessened when one is precisein answering these questions.

microeconometrics lecture notes

causal relationship

theorydl millimet smu

contributedl millimet

f data

f intro

causal analyses

sample selection criteria

f results

Documents

mect microeconometrics blundell lecture 4 evaluation methods...

essays on microeconometrics and immigrant assimilation

lecture notes -...

microeconometrics using stata

topics in microeconometrics

covid-19 - lecture notes - tiu - lecture notes

lecture notes -...

microeconometrics aneta dzik-walczak 2014/2015....

applied microeconometrics with stata nonparametric...

lecture notes: lecture

three essays in microeconometrics - core

lecture 1 - github pages · lecture 1 lecture notes page 1...

respiratory system - lecture notes - tiu - lecture notes

lecture 1 -...

microeconometrics mect2 lecture 9: evaluation methods...

2, thrombosis - lecture notes - tiu - lecture notes

lecture notes - cabrillo...

dr.v.thrimurthulu lecture notes antenna & wave propagation...

microeconometrics blundell lecture 1 overview and...

advances in microeconometrics and finance using