Understandingandmisunderstandingrandomizedcontrolledtrials
AngusDeatonandNancyCartwright
PrincetonUniversity
DurhamUniversityandUCSanDiego
Thisversion,August2016
Weacknowledgehelpfuldiscussionswithmanypeopleoverthemanyyearsthispaperhasbeeninpreparation.WewouldparticularlyliketonotecommentsfromseminarparticipantsatPrinceton,ColumbiaandChicago,theCHESSresearchgroupatDurham,aswellasdiscussionswithOrleyAshenfelter,AnneCase,NickCowen,HankFarber,BoHonoré,andJulianReiss.UlrichMuellerhadamajorinfluenceonshapingSection1ofthepaper.Wehavebenefitedfromgen-erouscommentsonanearlierversionbyTimBesley,ChrisBlattman,SylvainChassang,StevenDurlauf,JeanDrèze,WilliamEasterly,JonathanFuller,LarsHansen,JimHeckman,JeffHammer,MacartanHumphreys,HelenMilner,SureshNaidu,LantPritchett,DaniRodrik,BurtSinger,RichardZeckhauser,andSteveZiliak.Cartwright’sresearchforthispaperhasreceivedfundingfromtheEuropeanResearchCouncil(ERC)undertheEuropeanUnion’sHorizon2020researchandinnovationprogram(grantagreementNo667526K4U).DeatonacknowledgesfinancialsupportthroughtheNationalBureauofEconomicResearch,Grants5R01AG040629-02andP01AG05842-14andthroughPrincetonUniversity’sRoybalCenter,GrantP30AG024928.
1
ABSTRACTRCTsarevaluabletoolswhoseuseisspreadingineconomicsandinothersocialsciences.Theyareseenasdesirableaidsinscientificdiscoveryandforgeneratingevidenceforpoli-cy.YetsomeoftheenthusiasmforRCTsappearstobebasedonmisunderstandings:thatrandomizationprovidesafairtestbyequalizingeverythingbutthetreatmentandsoallowsapreciseestimateofthetreatmentalone;thatrandomizationisrequiredtosolveselectionproblems;thatlackofblindingdoeslittletocompromiseinference;andthatstatisticalin-ferenceinRCTsisstraightforward,becauseitrequiresonlythecomparisonoftwomeans.Noneofthesestatementsistrue.RCTsdoindeedrequireminimalassumptionsandcanop-eratewithlittlepriorknowledge,anadvantagewhenpersuadingdistrustfulaudiences,butacrucialdisadvantageforcumulativescientificprogress,whererandomizationaddsnoiseandunderminesprecision.ThelackofconnectionbetweenRCTsandotherscientificknowledgemakesithardtousethemoutsideoftheexactcontextinwhichtheyarecon-ducted.Yet,oncetheyareseenaspartofacumulativeprogram,theycanplayaroleinbuildinggeneralknowledgeandusefulpredictions,providedtheyarecombinedwithothermethods,includingconceptualandtheoreticaldevelopment,todiscovernot“whatworks,”butwhythingswork.Unlesswearepreparedtomakeassumptions,andtostandonwhatweknow,makingstatementsthatwillbeincredibletosome,allthecredibilityofRCTsisfornaught.
2
IntroductionRandomizedtrialsarecurrentlymuchusedineconomicsandarewidelyconsideredtobeade-
sirablemethodofempiricalanalysisanddiscovery.Thereisalonghistoryofsuchtrialsinthe
subject.Therewerefourlargefederallysponsorednegativeincometaxtrialsinthe1960sand
1970s.Inthemid-1970s,therewasafamous,andstillfrequentlycited,trialonhealthinsurance,
theRandhealthexperiment.Therewasthenaperiodduringwhichrandomizedcontrolledtrials
(RCTs)receivedlessattentionbyacademiceconomics;evenso,randomizedtrialsonwelfare,
socialpolicy,labormarkets,andeducationhavecontinuedsincethemid-1970s,somewithsub-
stantialinvolvementanddiscussionbyacademiceconomists,seeGreenbergandShroder
(2004).
Recentrandomizedtrialsineconomicdevelopmenthaveattractedattention,andthe
ideathatsuchtrialscandiscover“whatworks”hasbeenwidelyadoptedineconomics,aswell
asinpoliticalscience,education,andsocialpolicy.Amongbothresearchersandthegeneral
public,RCTsareperceivedtoyieldcausalinferencesandparameterestimatesthataremore
crediblethanotherempiricalmethodsthatdonotinvolvethecomparisonofrandomlyselected
treatmentandcontrolgroups.RCTsareseenaslargelyexemptfrommanyoftheeconometric
problemsthatcharacterizeobservationalstudies.WhenRCTsarenotfeasible,researchersoften
mimicrandomizeddesignsbyusingobservationaldatatoconstructtwogroupsthat,asfaras
possible,areidenticalanddifferonlyintheirexposuretotreatment.
Thepreferenceforrandomizedtrialshasspreadbeyondtrialiststothegeneralpublic
andthemedia,whichtypicallyreportsfavorablyonthem.Theyareseenasaccurate,objective,
andlargelyindependentof“expert”knowledgethatisoftenregardedasmanipulable,politically
biased,orotherwisesuspect.Therearenow“WhatWorks”centersusingandrecommending
RCTsinahugerangeofareasofsocialconcernacrossEuropeandtheAnglophoneworld,such
astheUSDepartmentofEducation’sWhatWorksClearingHouse,TheCampbellCollaboration
(paralleltotheCochraneCollaborationinhealth),theScottishIntercollegiateGuidelinesNet-
work(SIGN),theUSDepartmentofHealthandHumanServicesChildWelfareInformation
Gateway,theUSSocialandBehavioralSciencesTeam,andothers.TheBritishgovernmenthas
establishedeightnew(well-financed)WhatWorksCenterssimilartotheNationalInstitutefor
HealthandCareExcellence(NICE),withmoreplanned.TheyextendNICE’sevaluationofhealth
treatmentintoaging,earlyintervention,education,crime,localeconomicgrowth,Scottishser-
vicedelivery,poverty,andwellbeing.Thesecentersseerandomizedcontrolledtrialsastheir
3
preferredtool.Thereisawidespreaddesireforcarefulevaluation—tosupportwhatissome-
timescalledthe“auditsociety”—andeveryoneassentstotheideathatpolicyshouldbebased
onevidenceofeffectiveness,forwhichrandomizedtrialsappeartobeideallysuited.Trialsare
easily,ifnotveryprecisely,explainedalongthelinesthatrandomselectiongeneratestwooth-
erwiseidenticalgroups,onetreatedandonenot;resultsareeasytocompute—allweneedis
thecomparisonoftwoaverages;andunlikeothermethods,itseemstorequirenospecialized
understandingofthesubjectmatter.Itseemsatrulygeneraltoolthat(nominally)worksinthe
samewayinagriculture,medicine,sociology,economics,politics,andeducation.Itissupposed
torequirenopriorknowledge,whethersuspectornot,whichisseenasagreatadvantage.
Inthispaper,wepresenttwosetsofarguments,oneonconductingRCTSandonhowto
interprettheresults,andoneonhowtousetheresultsoncewehavethem.Althoughwedonot
carefortheterms—forreasonsthatwillbecomeapparent—thetwosectionscorrespondrough-
lytointernalandexternalvalidity.
Randomizedcontrolledtrialsareoftenuseful,andhavebeenimportantsourcesofem-
piricalevidenceforcausalclaimsandevaluationofeffectivenessinmanyfields.Yetmanyofthe
popularinterpretations—notonlyamongthegeneralpublic,butalsoamongtrialists—arein-
completeandsometimesmisleading,andthesemisunderstandingscanleadtounwarranted
trustintheimpregnabilityofresultsfromRCTs,toalackofunderstandingoftheirlimitations,
andtomistakenclaimsabouthowwidelytheirresultscanbeused.Allthese,inturn,canleadto
flawedpolicyrecommendations.
Amongthemisunderstandingsarethefollowing:(a)randomizationensuresafairtrial
byensuringthat,atleastwithhighprobability,treatmentandcontrolgroupsdifferonlyinthe
treatment;(b)RCTsprovidenotonlyunbiasedestimatesofaveragetreatmenteffects,butalso
preciseestimates;(c)randomizationisnecessarytosolvetheselectionproblem;(d)lackof
blinding,whichiscommoninsocialscienceexperiments,doesnotseriouslycompromiseinfer-
ence;(e)statisticalinferenceinRCTs,whichrequiresonlythesimplecomparisonofmeans,is
straightforward,sothatstandardsignificancetestsarereliable.
WhilemanyoftheproblemsofRCTsaresharedwithobservationalstudies,someare
unique,forexamplethefactthatrandomizingitselfcanchangeoutcomesindependentlyof
treatment.Moregenerally,itisalmostneverthecasethatanRCTcanbejudgedsuperiortoa
well-conductedobservationalstudysimplybyvirtueofbeinganRCT.Theideathatallmethods
4
havetheirflaws,butRCTsalwayshavefewest,isoneofthedeepestandmortperniciousmis-
understandings.
Inthesecondpartofthepaper,wediscusstheusesandlimitationsofresultsfromRCTs
formakingpolicy.Thenon-parametricandtheory-freenatureofRCTs,whichisarguablyanad-
vantageinestimation,isaseriousdisadvantagewhenwetrytousetheresultsoutsideofthe
contextinwhichtheywereobtained.Muchoftheliterature,ineconomicdevelopmentand
elsewhere,perhapsinspiredbyCampbellandStanley’s(1963)famous“primacyofinternalvalid-
ity,”assumesthatinternalvalidityisenoughtoguaranteetheusefulnessoftheestimatesindif-
ferentcontexts.WithoutunderstandingRCTswithinthecontextoftheknowledgethatweal-
readypossessabouttheworld,muchofitobtainedbyothermethods,wedonotknowhowto
usetrialresults.ButoncethecommitmenthasbeenmadetoseeingRCTswithinthisbroader
structureofknowledgeandinference,andwhentheyaredesignedtofitwithinit,theycanplay
ausefulroleinbuildinggeneralknowledgeandpolicypredictions;forexample,anRCTcanbea
goodwayofestimatingakeypolicymagnitude.ThebroadercontextwithinwhichRCTsneedto
besetincludesnotonlymodelsofeconomicstructure,butalsothepreviousexperiencethat
policymakershaveaccumulatedaboutlocalsettingsandimplementation.Mostimportantlyfor
economicdevelopment,theuseofRCTresultsshouldbesensitivetowhatpeoplewant,both
individuallyandcollectively.RCTsshouldnotbecomeyetanothertechnicalfixthatisimposed
onpeoplebybureaucratsorforeigners;RCTresultsneedtobeincorporatedintoademocratic
processofpublicreasoning,Sen(2011).Greenberg,Shroder,andOnstott(1999)documentthat,
evenbeforetherecentwaveofRCTsindevelopment,mostRCTsineconomicshavebeencar-
riedoutbyrichpeopleonpoorpeople,andthefactshouldmakeusespeciallysensitivetoavoid
chargesofpaternalism.
Section1:InterpretingtheresultsofRCTs
1.1Prolog
RCTswerefirstpopularizedbyFisher’sagriculturaltrialsinthe1930sandaretodayoftende-
scribedbytheRubincounterfactualcausalmodel,whichitselftracesbacktoNeymanin1923,
seeFreedman(2006)foradescriptionofthehistory:Eachuniti(aperson,apupil,aschool,an
agriculturalplot)isassumedtohavetwopossibleoutcomes, and ,theformeroccurring
ifthereisnotreatmentatthetimeinquestion,thelatteriftheunitistreated.Thedifference
betweenthetwooutcomes istheindividualtreatmenteffect,whichweshalldenote
Treatmenteffectsaretypicallydifferentfordifferentunits.Nounitcanbebothtreatedand
Yio Yi1
Yi1 −Yi0βi .
5
untreatedatthesametime,soonlyoneorotheroftheoutcomesoccurs;theotheriscounter-
factualsothatindividualtreatmenteffectsareinprincipleunobservable.
Wenoteparentheticallythatwhileweusethecounterfactualframeworkhere,wedo
notendorseit,norargueagainstotherapproachesthatdonotuseit,suchastheCowlescom-
missioneconometricframeworkwherethecausalrelationsarecodedasstructuralequations,
seealsoPearl(2009.)ImbensandWooldridge(2009,Introduction)provideaneloquentdefense
oftheRubinformulation,emphasizingthecredibilitythatcomesfromatheory-freespecifica-
tionwithunlimitedheterogeneityintreatmenteffects.HeckmanandVytlacil(2007,Introduc-
tion)makeanequallyeloquentcaseagainst,notingthatthetreatmentsinRCTsareoftenun-
clearlyspecifiedandthatthetreatmenteffectsarehardtolinktoinvariantparametersthat
wouldbeusefulelsewhere.
ThebasictheoremgoverningRCTsisaremarkableone.Itstatesthattheaveragetreat-
menteffectistheaverageoutcomeinthetreatmentgroupminustheaverageoutcomeinthe
controlgroup.Whilewecannotobservetheindividualtreatmenteffects,wecanobservetheir
mean.Theestimateoftheaveragetreatmenteffect(ATE)issimplythedifferencebetweenthe
meansinthetwogroups,andithasastandarderrorthatcanbeestimatedandusedtomake
significancestatementsaccordingtothestatisticaltheorythatappliestothedifferenceoftwo
means,onwhichmorebelowinSection1.3.Thedifferenceinmeansisanunbiasedestimatorof
themeantreatmenteffect.
Thetheoremisremarkablebecauseitrequiressofewassumptions;nomodelisre-
quired,noassumptionsaboutcovariatesareneeded,thetreatmenteffectscanbeheterogene-
ous,andnothingisassumedabouttheshapesofstatisticaldistributionsotherthanthestatisti-
calquestionoftheexistenceofthemeanofthecounterfactualoutcomevalues.Intermsofone
ofourrunningthemes,itrequiresnoexpertknowledge,ornoacceptanceofpriors,expertor
otherwise.Thetheoremalsohasitslimitations;theproofusesthefactthatthedifferencein
twomeansisthemeanoftheindividualdifferences,i.e.thetreatmenteffects.Thisisnottrue
forthemedian(thedifferenceintwomediansisnotthemedianofthedifferenceswhichisthe
mediantreatmenteffect).Italsodoesnotallowustoestimateanypercentileofthedistribution
oftreatmenteffects,oritsvariance.(Quantileestimatesoftreatmenteffectsarenotthequan-
tilesofthedistributionoftreatmenteffects,butthedifferencesinthequantilesofthetwomar-
ginaldistributionsoftreatmentsandcontrols;thetwomeasurescoincideiftheexperimenthas
noeffectonranks,anassumptionthatwouldbeconvenientbutishardtojustify,atleastin
6
general.)AllofthesestatisticscanbeofinterestforpolicybutRCTsarenotinformativeabout
them,oratleastnotwithoutfurtherassumptions,forexampleonthedistributionoftreatment
effects,seeHeckman,Smith,andClements(1997),andmuchoftheattractionofRCTsisthe
absenceofsuchassumptions.
Thebasictheoremtellsusthatthedifferenceinmeansisanunbiasedestimatorofthe
averagetreatmenteffectbutsaysnothingaboutthevarianceofthisestimator.Ingeneral,abi-
asedestimatorthatistypicallyclosertothetruthwilloftenbebetterthananunbiasedestima-
torthatistypicallywideofthetruth.Thereisnothingtosaythatanon-RCTestimator,inspite
ofbias,mightnothavealowermeansquarederror(MSE),onemeasureofthedistanceofthe
estimatefromthetruth,oralowervalueofa“lossfunction”thatdefinesthelosstotheexper-
imenterofmissingthetarget.
ItisusefultothinkofthemeanaveragetreatmenteffectfromanRCTintermsofsam-
plingfromafinitepopulation,aswhentheBureauoftheCensusestimatesaverageincomeof
theUSpopulationin2013.FortheRCT,thepopulationisthepopulationofunitswhoseaverage
treatmenteffectisofinterest;notetheimportanceofdefiningthepopulationofinterestbe-
cause,giventheheterogeneityoftreatmenteffects,theaveragetreatmenteffectwillvary
acrossdifferentpopulations,justasaverageincomesdifferacrossdifferentsubpopulationsof
theUS.Finitepopulationsamplingtheorytellsushowtogetaccurateestimatesofmeansfrom
samples;intheRCTcase,thesampleisthestudysample,bothtreatmentsandcontrols.Inprin-
ciple,thestudysamplecouldbearandomsampleoftheparentpopulationofinterest,inwhich
caseitisrepresentativeofit,butthatisseldomthecase.Becausetheestimateispopulation
specific,itisnot(orneednotbe)thoughtofastheparameterofasuper-population,orother-
wisegeneralizableinanyway.AverageincomeintheUSin2013maybeofinterestinitsown
right;butitwillnotbethesameasaverageincomein2014,norwillitbethesameasaverage
incomeofwhites,orofthepopulationsofWyomingorNewYork.Exactlythesameistrueof
theestimateofanaveragetreatmenteffect;itappliestothestudysampleinwhichthetrialwas
done,atthetimewhenitwasdone,anditsuseoutsideofthoseconfines,thoughoftenpossi-
ble,requiresargumentandjustification.Withoutsuchanargument,wecannotclaimthatan
ATEis“the”meantreatmenteffectanymorethanthataverageincomeintheUSin2013is
“the”averageincomeoftheUSinanyotheryear.Ofcourse,knowingaverageincomein2013
canbeusefulformakingothercalculations,suchasanestimateofincomein2014,orofasub-
7
populationthatweknowisricherorpoorer;thefactthatanestimatedoesnotuniversallygen-
eralizedoesnotmakeituseless.WeshallreturntotheseissuesinSection2.
1.2.Precision,balance,andrandomization
1.2.1Precisionandbias
Weshouldlikeourestimateoftheaveragetreatmenteffecttobeasclosetothetruthaspossi-
ble.Onewaytoassessclosenessisthemeansquareerror(MSE),definedas
(1)
where isthetrueaveragetreatmenteffect,and isitsestimatefromaparticulartrial.The
expectationistakenoverrepeatedrandomizationsoftreatmentsandcontrolsusingthesame
studypopulation.Itisalsostandardtorewrite(1)as
(2)
sothatmeansquareerroristhesumofthevarianceoftheestimator—whichwetypicallyknow
somethingaboutfromtheestimatedstandarderror—andthesquareofthebias—whichinthe
caseofa(nideal)randomizedcontrolledtrialiszero.Theelementary,butcrucialpointisthat,
whileitiscertainlygoodthatthebiasiszero,thatfactdoesnothingtomakethedistancefrom
thetruthassmallasitmightbe,whichiswhatwereallycareabout.Anunbiasedestimatorthat
isnearlyalwayswideofthetargetisnotasusefulasonethatisalwaysneartoit,evenif,on
average,itisoffcenter.Moregenerally,itwilloftenbedesirabletotradeinsomeunbiasedness
forgreaterprecision.Experimentsareoftenexpensive,sowecannotalwaysrelyonlargesam-
plestobringtheestimateclosetothetruthandresolvetheseissuesforus.MuchofthisSection
isconcernedwithhowtodesignexperimentstomaximizeprecision.
Unbiasednessalonecannotthereforejustifytheoften-expressedpreferenceforRCTs
overotherestimators.TheminimalistassumptionsrequiredforanRCTtobeunbiasedarean
attractionalthough,asweshallseeinthisSection,thisadvantageusuallycomesatthecostof
loweredprecisionandofdifficultiesinknowinghowtousetheresult,asweshallseeinSection
2.YetthereisanoftenexpressedbeliefthatRCTsaresomehowguaranteedtobeprecise,simp-
lybecausetheyareRCTs.Occasionallybiasandprecisionareexplicitlyconfused;theJPALweb-
site,initsexplanationofwhyitisgoodtorandomize,saysthatRCTs“aregenerallyconsidered
themostrigorousand,allelseequal,producethemostaccurate(i.e.unbiased)results.”Shad-
ish,Cook,andCampbell(2002,p.276),inwhatis(rightly)consideredoneofthebiblesofcausal
inferenceinsocialscience,statewithoutqualificationthat“randomizedexperimentsprovidea
MSE = E(⌢θ −θ )2
θ ⌢θ
MSE = E (
⌢θ − E(
⌢θ )( )2 + E(
⌢θ )−θ( )2 = var( ⌢θ )+ bias( ⌢θ ,θ )2
8
preciseansweraboutwhetheratreatmentworked”(p.276)and“Therandomizedexperimentis
oftenthepreferredmethodforobtainingapreciseandstatisticallyunbiasedestimateofthe
effectsofanintervention,”(p.277)ouritalics.
ContrastthiswithCronbachetal(1980)whoquotesKendall’s(1957)pasticheofLong-
fellow,“Hiawathadesignsanexperiment,”whereHiawatha’sinsistenceonunbiasednessleads
tohisneverhittingthetargetandtohiseventualbanishment.
1.2.2Balanceandprecisioninalinearall-causemodel
AusefulwaytothinkaboutprecisionandwhatanRCTdoesanddoesnotdoistouseasche-
maticlinearcausalmodeloftheform:
(3)
where,asbefore, istheoutcomeforuniti, isadichotomous(1,0)treatmentdummyin-
dicatingwhetherornotiistreated,and istheindividualtreatmenteffectofthetreatment
oni.Thex’saretheobservedorunobservedothercausesoftheoutcome,andwesupposethat
(3)capturesallthecausesof Yi . Jmaybeverylarge.Becausetheheterogeneityoftheindividu-
altreatmenteffects βi isunrestricted,weallowthepossibilitythatthetreatmentinteractswith
thex’sorothervariables,sothattheeffectsofTcandependonanyothervariables,andwe
shallhaveoccasiontomakethisexplicitbelow.Anobviousandimportantexampleiswhenthe
treatmentifeffectiveonlyinthepresenceofaparticularvalueofoneofthex’s.
Wedonotneedisubscriptsonthe γ 's thatcontroltheeffectsoftheothercauses;if
theireffectsdifferacrossindividuals,weincludetheinteractionsofindividualcharacteristics
withtheoriginalx’sasnewx’s.Giventhatthex’scanbeunobservable,thisisnotrestrictive.
Becausethe β 's candependonthex’s,theeffectsofthex’sontheoutcomecandependon
Ti , or,equivalently,theeffectsoftreatmentcandependoncovariates.
Inanexperiment,withorwithoutrandomization,wecanrepresentthetreatmentgroup
ashaving andthecontrolgroupashaving Sowhenwesubtracttheaverageout-
comesamongthecontrolsfromtheaverageoutcomesamongthetreatments,wewillget
Y
1−Y
0= β
1+ γ j (xij
1−
j=1
J
∑ xij0) = β
1+ (S
1− S
0) (4)
Thefirsttermonthefarrighthandside,whichistheaveragetreatmenteffect,iswhatwewant,
butthesecondtermorerrorterm,whichisthesumofthenetaveragebalancesofothercauses
Yi = βiTi + γ j xijj=1
J∑Yi Ti
βi
Ti = 1, Ti = 0.
9
acrossthetwogroups,willgenerallybenon-zero—becauseofselectionormanyotherrea-
sons—andneedstobedealtwithsomehow.Wegetwhatwewantwhenthemeansofallthe
othercausesareidenticalinthetwogroups,ormorepreciselywhenthesumoftheirnetdiffer-
ences S1− S
0iszero;thisisthecaseofperfectbalance.Withperfectbalance,thedifference
betweenthetwomeansisexactlyequaltotheaverageofthetreatmenteffectamongthe
treated,sothatwehavetheultimateprecisionandweknowtheanswerexactly,atleastinthis
linearcase.
1.2.3Balancingacts:realandmagical
Howdowegetbalance,orsomethingclosetoit?What,exactly,istheroleofrandomization?In
alaboratoryexperiment,wherethereisgoodbackgroundknowledgeoftheothercauses,the
experimenterhasagoodchanceofcontrollingalloftheothercauses,aimingtoensurethatthe
lasttermin(4)isclosetozero.Failingsuchknowledgeandcontrol,analternativeismatching,
frequentlyusedinstatistical,medical,andeconometricwork.Foreachtreatment,amatchis
foundthatisascloseaspossibleonallsuspectedcauses,sothat,onceagain,thelasttermin(4)
canbekeptsmall.Again,whenwehaveagoodideaofthecauses,matchingmayalsodelivera
preciseestimate.Ofcourse,whenthereareimportantunknownorunobservablecauses,nei-
therlaboratorycontrolnormatchingoffersprotection.
Whatdoesrandomizationdo?Becausethetreatmentsandcontrolscomefromthe
sameunderlyingdistribution,randomizationguarantees,byconstruction,thatthelasttermon
therightin(4)iszeroinexpectationatbaseline(muchcanhappentodisturbthisbeyondbase-
line).Thisistruewhetherornotthecausesareobserved.IftheRCTisrepeatedmanytimeson
thesametrialpopulation,thenthelasttermwillbezerowhenaveragedoveraninfinitenumber
of(entirelyhypothetical)trials.Ofcourse,thisdoesnothingtomakeitzeroinanyonetrial
wherethedifferenceinmeanswillbeequaltotheaveragetreatmenteffectamongthosetreat-
edplusatermthatreflectstheimbalanceintheneteffectsoftheothercauses.Wedonot
knowthesizeofthiserrorterm,andthereisnothingintherandomizationthatlimitsitssize;by
chance,therecanbeone(ormore)importantexcludedcause(s)thatisveryunequallydistribut-
edbetweentreatmentandcontrols.Thisimbalancewillvaryoverreplicationsofthetrial,and
itsaveragesizewillideallybecapturedbythestandarderroroftheestimatedATE,whichgives
ussomeideaofhowlikelywearetobeawayfromthetruth.Gettingthestandarderrorand
associatedsignificancestatementsrightarethereforeofgreatimportance.
10
Exactlywhatrandomizationdoesisfrequentlylostinthepracticalliterature,andthere
isoftenaconfusionbetweenperfectcontrol,ontheonehand—asinalaboratoryexperimentor
perfectmatchingwithnounobservablecauses—andcontrolinexpectation—whichiswhatRCTs
do.WesuspectthatatleastsomeofthepopularandprofessionalenthusiasmforRCTs,aswell
asthebeliefthattheyareprecisebyconstruction,comesfrommisunderstandingsaboutbal-
ance.Thesemisunderstandingsarenotsomuchamongthetrialistswho,whenpressed,willgive
acorrectaccount,butcomefromimprecisestatementsbytrialiststhataretakenasgospelby
thelayaudiencethatthetrialistsarekeentoreach.
SuchamisunderstandingiswellcapturedbythefollowingquotefromtheWorldBank’s
onlinemanualonimpactevaluation:
“Wecanbeveryconfidentthatourestimatedaverageimpact,givenasthedifference
betweentheoutcomeundertreatment(themeanoutcomeoftherandomlyassigned
treatmentgroup)andourestimateofthecounterfactual(themeanoutcomeofthe
randomlyassignedcomparisongroup)constitutethetrueimpactoftheprogram,since
byconstructionwehaveeliminatedallobservedandunobservedfactorsthatmightoth-
erwiseplausiblyexplainthedifferenceinoutcomes.”Gertleretal(2011)(ouritalics.)
Thisstatementconfusesactualbalanceinanysingletrialwithbalanceinexpectationovermany
entirelyhypotheticaltrials.Ifthestatementaboveweretrue,andifallfactorswereindeedcon-
trolled(andnoimbalanceswereintroducedpostrandomization),thedifferencewouldbean
exactmeasureoftheaveragetreatmenteffect,atleastintheabsenceofmeasurementerror.
Weshouldnotonlybeconfidentofourestimate;wewouldknowthetruth,asthequotesays.
AsimilarquotecomesfromJohnList,oneofthemostimaginativeandsuccessfulschol-
arswhouseRCTs:
“complicationsthataredifficulttounderstandandcontrolrepresentkeyreasonsto
conductexperiments,notapointofskepticism.Thisisbecauserandomizationactsasan
instrumentalvariable,balancingunobservablesacrosscontrolandtreatmentgroups.”
Al-UbaydliandList(2013)(italicsintheoriginal.)
AndfromDeanKarlan,founderandPresidentofYale’sInnovationsforPovertyAction,which
runsdevelopmentRCTsaroundtheworld:
“Asinmedicaltrials,weisolatetheimpactofaninterventionbyrandomlyassigningsub-
jectstotreatmentsandcontrolgroups.Thismakesitsothatallthoseotherfactors
whichcouldinfluencetheoutcomearepresentintreatmentandcontrol,andthusany
11
differenceinoutcomecanbeconfidentlyattributedtotheintervention.”Karlan,Gold-
bergandCopestake(2009)
Andfromthemedicalliterature,fromadistinguishedpsychiatristwhoisdeeplyskepticalof
RCTs,
“Thebeautyofarandomizedtrialisthattheresearcherdoesnotneedtounderstandall
thefactorsthatinfluenceoutcomes.Saythatanundiscoveredgeneticvariationmakes
certainpeopleunresponsivetomedication.Therandomizingprocesswillensure—or
makeithighlyprobable—thatthearmsofthetrialcontainequalnumbersofsubjects
withthatvariation.Theresultwillbeafairtest.”(Kramer,2016,p.18)
ClaimsareevenmadethatRCTsrevealknowledgewithoutpossibilityoferror.JudyGueron,the
long-timepresidentofMDRC,whichhasbeenrunningRCTsonUSgovernmentpolicyfor45
years,askswhyfederalandstateofficialswerepreparedtosupportrandomizationinspiteof
frequentdifficultiesandinspiteoftheavailabilityofothermethods,andconcludesthatitwas
because“theywantedtolearnthetruth,”GueronandRolston(2013,429).Therearemany
statementsoftheform“Weknowthat[projectX]workedbecauseitwasevaluatedwitharan-
domizedtrial,”Dynarski(2015).
Manywritersaremorecautious,andmodifystatementsabouttreatmentandcontrol
groupsbeingidenticalwithtermssuchas“statisticallyidentical,”“reasonablysimilar”ordonot
differ“systematically.”Andwehavenodoubtthatalloftheauthorsquotedaboveunderstand
theneedforthesequalifications.Buttotheuninformedreader,thequalifiedstatementsare
unlikelytobedifferentiatedfromtheunqualifiedstatementsquotedabove.Norisitalways
clearwhatsomeofthesetermsmean.Forexample,iftwopeopleareselectedatrandomfroma
population,anditsohappensthatoneisfemaleandonemale,inwhatsensetheyarestatisti-
callyidentical?Whileitistruethattheywererandomlyselectedfromthesameparentdistribu-
tion,whichprovidesthebasisforinference,thecalculationofstandarderrors,andsignificance
statements,itdoesnothingtohelpwithbalanceorprecisioninanygiventrial.
1.2.4Samplesizeandstatisticalinferenceinunbalancedtrials
Isasingletrialmorelikelytobebalanced,andthusmoreprecise,whenthesamplesizeislarge?
Indeed,asthesamplesizetendstoinfinity,themeansofthex’sinthetreatmentandcontrol
groupswillbecomearbitrarilyclose.YetthisisoflittlehelpinfinitesamplesasFisher(1926)
noted:“Mostexperimentersoncarryingoutarandomassignmentwillbeshockedtofindhow
farfromequallytheplotsdistributethemselves,”quotedinMorganandRubin(2012).Evenwith
12
verylargesamplesizes,iftherearealargenumberofcauses,balanceoneachcausemaybe
infeasible.Vandenbroucke(2004)notesthattherearethreemillionbasepairsinthehuman
genome,manyorallofwhichcouldberelevantprognosticfactorsforthebiologicaloutcome
thatweareseekingtoinfluence.
However,as(4)makesclear,wedonotneedbalanceonallcauses,onlyontheirnetef-
fect,theterm S 1 − S 0 whichdoesnotrequirebalanceoneachcauseindividually.Yetthereis
noguaranteethateventheneteffectwillbesmall.Forexample,theremayonlybeoneomitted
unobservedcausewhoseeffectislarge,onesinglebasepairsay,sothatifthatonecauseisun-
balancedacrosstreatmentsandcontrols,thatthereisindividualorevennetbalanceonother
lessimportantcausesisnotgoingtohelp.
Statementsaboutlargesamplesguaranteeingbalancearenotusefulwithoutguidelines
abouthowlargeislargeenough,andsuchstatementscannotbemadewithoutknowledgeof
othercausesandhowtheyaffectoutcomes.
Asimplecaseillustrates.Supposethatthereisonehiddencausein(3),abinaryvariable
xthatisunitywithprobabilitypand0otherwise.Withncontrolsandntreatments,thediffer-
enceinfractionswithx=1inthetwogroupshasmean0andvariance 1/ np(1− p). Withn=100
andp=0.5,thestandarderroraround0is0.2sothat,ifthisunobservedconfounderhasalarge
effectontheoutcome,theimbalancecouldeasilymasktheeffectoftreatment,orbemistaken
asevidencefortheeffectivenessofatrulyineffectivetreatment.
Lackofbalanceintheaboveexampleorintheneteffectofeitherobservablesornon-
observablesin(4)doesnotcompromisetheinferenceinanRCTinthesenseofobtaininga
standarderrorfortheunbiasedATE,seeSenn(2013)foraparticularlyclearstatement.The
randomizationdoesnotguaranteebalancebutitprovidesthebasisformakingprobability
statementsaboutthevariouspossibleoutcomes,whichisalsoclearintheexampleintheprevi-
ousparagraph.ThiswasalsoFisher’sargumentforrandomization.Sennwrites“theprobability
calculationappliedtoaclinicaltrialautomaticallymakesanallowanceforthefactthatthe
groupswillalmostcertainlybeunbalanced.”(italicsintheoriginal.)Ifthedesignissuchthat,
evenwithperfectrandomization,successivereplicationstendtogeneratelargeimbalances,the
resultingimprecisionoftheATEwillshowupinitsstandarderror.Ofcourse,theusefulnessof
thisrequiresthatthecalculatedstandarderrorspermitcorrectsignificancestatements,which,
asweshallseeinthenextsubsection,isoftenfarfromstraightforward.Intheexampleabove,
anextreme,butentirelypossible,caseoccurswhen,bychance,theunobservedconfounderis
13
perfectlycorrelatedwiththetreatment;unlessthereareactualreplications,thefalsecertainty
thatsuchanexperimentprovideswillbereinforcedbyfalsesignificancetests.
1.2.4Testingforbalance
Inpractice,trialistsineconomics(andinsomeotherdisciplines)usuallycarryoutastatistical
testforbalanceafterrandomizationbutbeforeanalysis,presumablywiththeaimoftaking
someappropriateactionifbalancefails.Thefirsttableofthepapertypicallypresentsthesam-
plemeansofobservablecovariates—theobservablex’sin(3),whichareeithercausesintheir
ownrightorinteractwiththe β 's—forthecontrolandtreatmentgroups,togetherwiththeir
differences,andtestsforwhetherornottheyaresignificantlydifferentfromzero,eithervaria-
blebyvariable,orjointly.Thesetestsareappropriateifweareconcernedthattherandom
numbergeneratormighthavefailed(becausewearedrawingplayingcards,rollingdice,or
spinningbottletops,thoughpresumablynotiftherandomizationisdonebyarandomnumber
generator,alwayssupposingthatthereissuchathingasrandomness,SingerandPincus(1998)),
orifweareworriedthattherandomizationisunderminedbynon-blindedsubjectsortrialists
systematicallyunderminingtheallocation.Otherwise,asthenextparagraphshows,thetest
makesnosenseandisnotinformative,whichdoesnotseemtostopitbeingroutinelyused.
Ifwewrite µ0 and µ1 forthe(vectorsof)populationmeans(i.e.themeansoverall
possiblerandomizations)oftheobservedx’sinthecontrolandtreatmentgroupsatthepointof
assignment,thenullhypothesisis(presumably,asjudgedbythetypicalbalancetest)thatthe
twovectorsareidentical,withthealternativebeingthattheyarenot.Butiftherandomization
hasbeencorrectlydone,thenullhypothesisistruebyconstruction,seee.g.Altman(1985)and
Senn(1994),whichmayhelpexplainwhyitsorarelyfailsinpractice.Indeed,althoughwecan-
not“test”it,weknowthatthenullhypothesisisalsotruefortheunobservablecomponentsof
x.NotethecontrastwiththestatementsquotedaboveclaimingthatRCTsguaranteebalanceon
causesacrosstreatmentandcontrolgroups.Thosestatementsrefertobalanceofcausesatthe
pointofassignmentinanysingletrial,whichisnotguaranteedbyrandomization,whereasthe
balancetestsareaboutthebalanceofcausesatthepointofassignmentinexpectationover
manytrials,whichisguaranteedbyrandomization.Theconfusionisperhapsunderstandable,
butitisconfusionnevertheless.Ofcourse,itmakessensetolookforbalancebetweenobserved
covariatesusingsomemoreappropriatedistancemeasureforexamplethenormalizeddiffer-
enceinmeans,ImbensandWooldridge(2009,equation3).
14
1.2.5Methodsforbalancing
Oneproceduretoimprovebalanceistoadaptthedesignbeforerandomization,forexampleby
stratification.Fisher,whoasthequoteaboveillustrates,waswellawareofthelossofprecision
fromrandomizationarguedfor“blocking”(stratification)inagriculturaltrialsorforusingLatin
Squares,bothofwhichrestricttheamountofimbalance.Stratification,tobeuseful,requires
somepriorunderstandingofthefactorsthatarelikelytobeimportant,andsoittakesusaway
fromthe“noknowledgerequired,”or“nopriorsaccepted”appealofRCTs.ButasScriven(1974,
103)notes:“causehunting,likelionhunting,isonlylikelytobesuccessfulifwehaveaconsider-
ableamountofrelevantbackgroundknowledge,”orevenmorestrongly,“nocausesin,no
causesout,”Cartwright(1994,Chapter2).StratificationinRCTs,asinotherformsofsampling,is
astandardmethodforusingbackgroundknowledgetoincreasetheprecisionofanestimator.It
hasthefurtheradvantagethatitallowsfortheexplorationofdifferentaveragetreatmentef-
fectsindifferentstratawhichcanbeusefulinadaptingortransportingtheresultstootherloca-
tions,seeSection2.
Stratificationisnotpossiblewhentherearetoomanycovariates,orifeachhasmany
values,sothattherearemorecellsthancanbefilledgiventhesamplesize.Analternativeisto
re-randomize,repeatingtherandomizationuntilthedistancebetweentheobservedcovariates
islessthansomepredeterminedcriteria.MorganandRubin(2012)suggesttheMahalanobisD–
statistic,anduseFisher’srandomizationinference(tobediscussedfurtherbelow)tocalculate
standarderrorsthattakethere-randomizationintoaccount.Analternative,widelyadaptedin
practice,istoadjustforcovariatesbyrunningaregression(orcovariance)analysis,withthe
outcomeonthelefthandsideandthetreatmentdummyandthecovariatesasexplanatoryvar-
iables,includingpossibleinteractionsbetweencovariatesandtreatmentdummies.
Freedman(2008)hasanalyzedthismethodandargues“ifadjustmentmadeasubstan-
tialdifference,wewouldsuggestmuchcautionwheninterpretingtheresults.”Butasubstantial
differenceisexactlywhatwewouldliketosee,atleastsomeofthetime,iftheadjustment
movestheestimateclosertothetruth.FreedmanshowsthattheadjustedestimateoftheATE
isbiasedinfinitesamples,withthebiasdependingonthecorrelationbetweenthesquared
treatmenteffectandthecovariates.Thereisalsonogeneralguaranteethattheregressionad-
justmentwillgenerateamorepreciseestimate,althoughitwilldosoifthereareequalnumbers
oftreatmentsandcontrolsorifthetreatmenteffectsareconstantoverunits(inwhichcase
therewillalsobenobias).Evenwithbias,theregressionadjustmentisattractiveifitdoesin-
15
deedtradeoffbiasforprecision,thoughpresumablynottoRCTpuristsforwhomunbiasedness
isthesinequanon.Noteagainthattheincreasedprecision,whenitexists,comesfromusing
priorknowledgeaboutthevariablesthatarelikelytobeimportantfortheoutcome.Thatthe
backgroundknowledgeortheoryiswidelysharedandunderstoodwillalsoprovidesomepro-
tectionagainstdataminingbysearchingthroughcovariatesinthesearchfor(perhapsfalsely)
estimatedprecision.
1.2.6Shouldwerandomize?
ThetensionbetweenrandomizationandprecisiongoesbacktotheearlydebatebetweenFisher
andStudent(Gosset)whoneveracceptedFisher’sargumentsforrandomization,seealsoZiliak
(2014).InhisdebatewithFisheraboutagriculturaltrials,Studentarguedthatrandomization
ignoredrelevantpriorinformation,forexampleabouthowlikelyconfounderswouldbedistrib-
utedacrossthetestplots,sothatrandomizationwastedresourcesandledtounnecessarily
poorestimates.Thisgeneralquestionofwhetherrandomizationisdesirablehasbeenreopened
inrecentpapersbyKasy(2016),Banerjee,Chassang,andSnowberg(2016)andBanerjee,
Chassang,Montero,andSnowberg(2016).
ReferbacktotheMSEintroducedabove,andconsiderdesigninganexperimentthatwill
makethisassmallaspossible.Unfortunately,thisisnotgenerallypossible;forexample,the“es-
timator”of3,say,fortheATEhasthelowestpossiblemean-squarederrorifthetrueATEisac-
tually3.Instead,weneedtoaveragetheMSEoveradistributionofpossibleATEs.Thisleadsto
adecisiontheoryapproachtoestimationwherebyaBayesianeconometricianwillestimatethe
ATEbychoosingtheallocationoftreatmentandcontrolssoastominimizetheexpectedvalue
ofalossfunction—theMSEbeingoneexample.Suchanapproachrequiresustospecifyaprior
ontheATE,ormoregenerally,ontheexpectationofoutcomesconditionalonthecovariates.
Thesepriorsareformalversionsoftheissuethathasalreadycomeuprepeatedly,thattoget
goodestimators,weneedtoknowsomethingabouthowthecovariatesaffecttheoutcome.
Kasy(2016)solvesthisproblemforthecaseofexpectedMSEandshowsthatrandomizationis
undesirable;itsimplyaddsnoiseandmakestheMSElarger.Heusesanon-parametricpriorthat
hasprovedusefulinanumberofotherapplications—wecouldpresumablydoevenbetterifwe
werepreparedtocommitfurther,andheprovidescodetoimplementhismethod,whichshows
a20percentreductioninMSEcomparedwithrandomization(14percentforstratifiedrandomi-
zation)forthewell-knownTennesseeSTARclass-sizeexperiment.
16
Banerjeeetalproposeamoregenerallossfunctionandprovethecomparabletheorem,
thatrandomizationleadstolargerlossesthantheoptimalnon-randompurposiveassignment.
Theseauthorsrecommendrandomizationonothergrounds,whichwewilldiscussbelow,but
agreethat,forstandardstatisticalefficiencyormaximizationofexpectedutilityrandomization
shouldnotbeusedinexperimentaldesign.Studentwasright.
Severalpointsshouldbenoted.First,theanti-randomizationtheoremisnotajustifica-
tionofanynon-experimentaldesign,forexampleonethatcomparesoutcomesofthosewhodo
ordonotself-selectintotreatment.Selectioneffectsarerealenough,andifselectionisbased
onunobservablecauses,comparisonoftreatedandcontrolswillbebiased.Oneacceptablenon-
randomschemeistousetheobservablecovariatestodividethestudysampleintocellswithin
whichallobservationshavethesamevalueandthendivideeachcellintotreatmentsandcon-
trols.Withineachcell,orforthoseunitsonwhichwehavenoinformation,wecanchooseany
waywelike,includingrandomly,thoughrandomizationhasnoadvantageordisadvantage.Such
allocationsruleoutself-selection(ordoctororprogramadministratorselection)wheretheindi-
vidual(doctor,oradministrator)hasinformationnotvisibletothepersonassigningtreatments
andcontrols.Thekeyisthatthepersonwhomakestheassignment(theanalyst)usesallofthe
informationthatheorshepossesses,andthatoncethishasbeentakenintoaccount,allunits
areinterchangeableconditionalonthatinformation,sothatassignmentbeyondthatdoesnot
matter.Ofcourse,theprogramadministratorsmustenforcetheanalyst’sassignment,sothat
privateinformationthattheyortheunitspossessisnotallowedtoaffecttheassignment,condi-
tionalontheinformationusedbytheanalyst.Giventhis,selectiononunobservablesisruled
out,anddoesnotaffecttheresults.Randomizationisnotrequiredtoeliminateselectionbias.
Whetheritisreallypossiblefortheanalysttoassignarbitrarilyisanopenquestion,asis
whether“randomization”fromarandom-numbergeneratorwilldoso.Evenmachine-generated
sequenceshavecauses,andeveniftheanalysthasonlyasetofuninformativelabelsforthe
units,thosetoomustcomefromsomewhere,sothatitispossiblethatthosecausesarelinked
totheunobservedcausesintheexperiment.Wedonotattempttodealherewiththesedeep
issuesonthemeaningofrandomization,butseeSingerandPincus(1998).
AccordingtoChalmers(2001)andBothwellandPodolsky(2016),thedevelopmentof
randomizationinmedicineoriginatedwithBradford-HillwhousedrandomizationinthefirstRCT
inmedicine—thestreptomycintrial—becauseitpreventeddoctorsselectingpatientsonthe
basisofperceivedneed(oragainstperceivedneed,leaningoverbackwardasitwere),anargu-
17
mentmorerecentlyechoedbyWorrall(2007).Randomizationservesthispurpose,butsodo
othernon-discretionaryschemes;whatisrequiredisthatthehiddeninformationnotaffectthe
allocation.Whileitistruethatdoctorscannotbeallowedtomaketheassignment,itisnottrue
thatrandomizationistheonlyschemethatcanbeenforced.
Second,theidealrulesbywhichunitsareallocatedtotreatmentorcontroldependon
thecovariates,andontheinvestigators’priorsabouthowthecovariatesaffecttheoutcomes.
Thisopensupallsortsofmethodsofinferencethatareexcludedbypurerandomization.For
example,thehypothetico-deductivemethodworksbyusingtheorytomakeapredictionthat
canbetakentothedata;herethepredictionswouldbeoftheformthataunitwithcharacteris-
ticsxwillrespondinaparticularwaytotreatment,falsificationofwhichcanbetestedbyan
appropriateallocationofunitstotreatment.Banerjee,ChassangandSnowberg(2016)provide
suchexamples.
Third,randomization,byrunningroughshodoverpriorinformationfromtheoryand
fromthecovariates,iswastefulandevenunethicalwhenitunnecessarilyexposespeople,or
unnecessarilymanypeople,topossibleharminariskyexperiment,seeWorrall(2002)foran
egregiouscaseofhowanunthinkingdemandforrandomizationandtherefusaltoacceptprior
informationputchildren’slivesdirectlyatrisk.
Fourth,thenon-randommethodsusepriorinformation,whichiswhytheydobetter
thanrandomization.Thisisbothanadvantageandadisadvantage,dependingonone’sperspec-
tive.Ifpriorinformationisnotwidelyaccepted,orisseenasnon-crediblebythoseweareseek-
ingtopersuade,wewillgeneratemorecredibleestimatesifwedonotusethosepriors.Indeed,
thisiswhyBanerjee,ChassangandSnowberg(2016)recommendrandomizeddesigns,including
inmedicineandindevelopmenteconomics.Theydevelopatheoryofaninvestigatorwhoisfac-
inganadversarialaudiencethatwillchallengeanypriorinformationandcanevenpotentially
vetoresultsthatarebasedonit(thinkadministrativeagenciesorjournalreferees).Theexperi-
mentertradesoffhisorherowndesireforprecision(andpreventingpossibleharmtosubjects),
whichusespriorinformation,againstthewishesoftheaudience,whowantnothingofthepri-
ors.Eventhen,theapprovalofthisaudienceisonlyexante;oncethefullyrandomizedexperi-
menthasbeendone,nothingstopscriticsarguingthat,infact,therandomizationdidnotoffera
fairtest.AmongdoctorswhouseRCTs,andespeciallymeta-analysis,suchargumentsare(ap-
propriately)common;seeagainKramer(2016).
18
AswenotedintheIntroduction,muchofthepublichascometoquestionexpertprior
knowledge,andBanerjee,Chassang,MonteroandSnowberg(2016)haveprovidedanelegant
(positive)accountofwhyRCTswillflourishinsuchanenvironment.Incaseswherethereisgood
reasontodoubtthegoodfaithofexperimenters,asinsomepharmaceuticaltrials,randomiza-
tionwillindeedbetheappropriateresponse.Butwebelievesuchargumentsaredeeplyde-
structiveforscientificendeavorandshouldberesistedasageneralprescriptionforscientific
research.Economistsandothersocialscientistsknowagreatdeal,andtherearemanyareasof
theoryandpriorknowledgethatarejointlyendorsedbylargenumbersofknowledgeablere-
searchers.Suchinformationneedstobebuiltonandincorporatedintonewknowledge,notdis-
cardedinthefaceofaggressiveknow-nothingignorance.Thesystematicrefusaltouseprior
knowledgeandtheassociatedpreferenceforRCTsarerecipesforpreventingcumulativescien-
tificprogress.Intheend,itisalsoself-defeating;toquoteRodrik(2016)“thepromiseofRCTsas
theory-freelearningmachinesisafalseone.”
1.3StatisticalinferenceinRCTs
IfwearetointerprettheresultsofanRCTasdemonstratingthecausaleffectofthetreatment
inthetrialpopulation,wemustbeabletotellwhetherthedifferencebetweenthecontroland
treatmentmeanscouldhavecomeaboutbychance.Anyconclusionaboutcausalityishostage
toourabilitytocalculatestandarderrorsandaccuratep–values.Butthisisnotgenerallypossi-
blewithoutassumptionsthatgobeyondthoseneededtosupportthebasictheoremofRCTs.In
particular,ithaslongbeenknownthatthemean—andafortiorithedifferencebetweentwo
means—isastatisticthatissensitivetooutliers.IndeedBahadurandSavage(1956)demon-
stratethat,withoutrestrictionsontheparentdistributions,standardt–testsareinherentlyun-
reliable.
Thekeyproblemhereisskewness;standardt–testsbreakdownindistributionswith
largeskewness,seeLehmannandRomano(2005,p.466–8).Inconsequence,RCTswillnotwork
wellwhenthedistributionoftheindividualtreatmenteffectsisstronglyasymmetric,atleastif
thestandardtwo-samplet–statistics(orequivalentlyWhite’s(1980)heteroskedasticrobustre-
gressiont–values)areused.Whilewemaybewillingtoassumethattreatmenteffectsaresym-
metricinsomecases,theneedforsuchanassumption—whichrequirespriorknowledgeabout
thespecificprocessbeingstudied—underminestheargumentthatRCTsarelargelyassumption
freeanddonotdependonsuchknowledge.Thereisadeepironyhere.Inthesearchforrobust-
nessandthedesiretodoawaywithunnecessaryassumptions,theRCTcandeliverthemeanof
19
theATE,yetthemean—asopposedtothemedian,whichcannotbeestimatedbyanRCT—does
notpermitrobustprobabilitystatementsabouttheestimatesoftheATE
Howdifficultisittomaintainsymmetry?Andhowbadlyisinferenceaffectedwhenthe
distributionoftreatmenteffectsisnotsymmetric?Ineconomics,manytrialshaveoutcomes
valuedinmoney.Doesananti-povertyinnovation—forexamplemicrofinance—increasethe
incomesoftheparticipants?Incomeitselfisnotsymmetricallydistributed,andthismightbe
trueofthetreatmenteffectstoo,ifthereareafewpeoplewhoaretalentedbutcredit-
constrainedentrepreneursandwhohavetreatmenteffectsthatarelargeandpositive,while
thevastmajorityofborrowersfritterawaytheirloans,oratbestmakepositivebutmodest
profits.Anotherimportantexampleisexpendituresonhealthcare.Mostpeoplehavezeroex-
penditureinanygivenperiod,butamongthosewhodoincurexpenditures,afewindividuals
spendhugeamountsthataccountforalargeshareofthetotal.Indeed,inthefamousRand
healthexperiment,Manning,Newhouseetal.(1987,1988),thereisasingleverylargeoutlier.
Theauthorsrealizethatthecomparisonofmeansacrosstreatmentarmsisfragile,and,alt-
houghtheydonotseetheirproblemexactlyasdescribedhere,theyobtaintheirpreferredes-
timatesusingastructuralapproachthatisdesignedtoexplicitlymodeltheskewnessofexpendi-
tures.
Insomecases,itwillbeappropriatetodealwithoutliersbytrimming,eliminatingob-
servationsthathavelargeeffectsontheestimates.Butiftheexperimentisaprojectevaluation
designedtoestimatethenetbenefitsofapolicy,theeliminationofgenuineoutliers,asinthe
RandHealthExperiment,willvitiatetheanalysis.Itispreciselytheoutliersthatmakeorbreak
theprogram.
1.3.1Spuriousstatisticalsignificance:anillustrativeexample
Weconsideranexamplethatillustrateswhatcanhappeninarealisticbutsimplifiedcase.There
isaparentpopulation,orpopulationofinterest,definedasthecollectionofunitsforwhichwe
wouldliketoestimateanaveragetreatmenteffect.ItmightbeallvillagesinIndia,orallrecipi-
entsoffoodsubsidies,orallusersofhealthcareintheUS.Fromthispopulationwehaveasam-
plethatisavailableforrandomization,thetrialorexperimentalsample;inarandomizedcon-
trolledtrial,thiswillsubsequentlyberandomlydividedintotreatmentsandcontrols.Ideally,
thetrialsamplewouldberandomlyselectedfromtheparentsample,sothatthesampleaver-
agetreatmenteffectwouldbeanunbiasedestimatorofthepopulationaveragetreatmentef-
fect;indeedinsomecasesthecompletepopulationofinterestisavailableforthetrial.Clearly,
20
intheseidealcases,itisstraightforwardtousestandardsamplingtheorytogeneralizethetrial
resultsfromthesampletothepopulation.However,foranumberofpracticalandconceptual
reasons,thetrialsampleisrarelyeitherthewholepopulationorarandomlyselectedsubset,
seeShadishetal(2002,pp.341–8)foragooddiscussionofbothpracticalandtheoreticalobsta-
cles.
Inourillustrativeexample,thereisparentpopulationeachmemberofwhichhashisor
herowntreatmenteffect;thesearecontinuouslydistributedwithashiftedlognormaldistribu-
tionwithzeromeansothatthepopulationaveragetreatmenteffectiszero.Theindividual
treatmenteffectsβ aredistributedsothat β + e0.5 ∼ Λ(0,1) ,forstandardizedlognormaldis-
tributionΛ. Wehavesomethinglikeamicrofinancetrialinmind,wherethereisalongpositive
tailofrareindividualswhocandoamazingthingswithcredit,whilemostpeoplecannotuseit
effectively.Atrial(experimental)sampleof2n individualsisrandomlydrawnfromtheparent
andisrandomlysplitbetweenntreatmentsandncontrols.Intheabsenceoftreatment,every-
oneinthesamplerecordszero,sothesampleaveragetreatmenteffectinanyonetrialissimply
themeanoutcomeamongthentreatments.Forvaluesofnequalto25,50,100,200,and500
wedraw100trial/experimentalsampleseachofsize2n;withfivevaluesofn,thisgivesus500
trial/experimentalsamplesinall.Foreachofthese500samples,werandomizeintoncontrols
andntreatments,estimatetheATEanditsestimatedt–value(usingthestandardtwo-samplet–
value,orequivalently,byrunningaregressionwithrobustt–values),andthenrepeat1,000
times,sowehave1,000ATEestimatesandt–valuesforeachofthe500trialsamples;theseal-
lowustoassessthedistributionofATEestimatesandtheirnominalt–valuesforeachtrial.
Table1:RCTswithskewedtreatmenteffects
Samplesize MeanofATE
estimates
Meanofnominalt–
values
Fractionnullreject-
ed(percent)
25
50
0.0268
0.0266
–0.4274
–0.2952
13.54
11.20
100 –0.0018 –0.2600 8.71
200 0.0184 –0.1748 7.09
500 –0.0024 –0.1362 6.06
21
Note:1,000randomizationsoneachof100drawsofthetrialsamplerandomlydrawnfromalognormaldistributionoftreatmenteffectsshiftedtohaveazeromean.
TheresultsareshowninTable1.Eachrowcorrespondstoasamplesize.Ineachrow,
weshowtheresultsof100,000individualtrials,composedof1,000replicationsoneachofthe
100trial(experimental)samples.Thecolumnsareaveragedoverall100,000trials.
Thelastcolumnshowsthefractionsoftimesthetruenullisrejectedandisthekeyre-
sult.Whenthereareonly50treatmentsand50controls(row2),the(true)nullisrejected11.2
percentofthetime,insteadofthe5percentthatwewouldlikeandexpectifwewereunaware
oftheproblem.Whenthereare500unitsineacharm,therejectionrateis6.06percent,much
closertothenominal5percent.
Whydoesthestandardapplicationofthet–distributiongivesuchstrangeresultswhen
allwearedoingisestimatingamean?Theproblemcasesarewhenthetrialsamplehappensto
containoneormoreoutliers,somethingthatisalwaysariskgiventhelongpositivetailofthe
parentdistribution.Whenthishappens,everythingdependsonwhethertheoutlierisamong
thetreatmentsorthecontrols;ineffecttheoutliersbecomethesample,reducingtheeffective
numberofdegreesoffreedom.
Figure1:EstimatesofanATEwithanoutlierinthetrialsample
Figure1illustratestheestimatedaveragetreatmenteffectsfromanextremecasefrom
thesimulationswith100observationsintotal,thesecondrowofTable1;thehistogramshows
the1,000estimatesoftheATE.Thetrialsamplehasasinglelargeoutlyingtreatmenteffectof
0.5
11.
5D
ensi
ty
-.5 0 .5 1 1.5 21,000 estimates of average treatment effect
22
48.3;themean(s.d.)oftheother99observationsis–0.51(2.1);whentheoutlierisinthe
treatmentgroup,wegettheright-handsideofthefigure,whenitisnot,wegettheleft-hand
side.Ontheright-handside,whentheoutlierisamongthetreatmentgroup,thedispersion
acrossoutcomesislarge,asistheestimatedstandarderror,andsothoseoutcomesrarelyreject
thenullusingthestandardtableoft–values.Theover-rejectionscomefromtheleft-handside
ofthefigurewhentheoutlierisinthecontrolgroup,theoutcomesarenotsodispersed,and
thet–valuescanbelarge,negative,andsignificant.Whilethesecasesofbimodaldistributions
maynotbecommon,anddependonlargeoutliers,theyillustratetheprocessthatgenerates
theover-rejectionsandspurioussignificance.
Wecouldescapetheseproblemsifwecouldcalculatethemediantreatmenteffect,but
RCTscannot(withoutfurtherassumption)identifythemedian,onlythemean,anditisthe
meanthatisatriskbecauseoftheBahadur-Savagetheorem.Notetoothatthereisonlymoder-
atecomforttobetakeninlargesamplesizes.Whilethelastrowiscertainlybetterthantheoth-
ers,therearestillmanytrialsamplesthataregoingtogivesampleaverageeffectsthataresig-
nificant,evenwhenthenumberwewantiszero.TheproofoftheBahadur-Savagetheorem
worksbynotingthatforanysamplesize,itisalwayspossibletofindanoutlierthatwillgivea
misleadingt–value.NoristhereanescapeherebyusingtheFisherexactmethodforinference;
theFishermethodteststhenullhypothesisthatallofthetreatmenteffectsarezerowhereas
whatweareinterestedinhere,atleastifwewanttodoprojectevaluationorcost-benefitanal-
ysis,isthattheaveragetreatmenteffectiszero.
Theproblemsillustratedabove,thatstemfromtheBahadur-Savagetheorem,arecer-
tainlynotconfinedtoRCTs,andoccurmoregenerallyineconometricandstatisticalwork.How-
ever,theanalysishereillustratesthatthesimplicityofidealRCTs,subtractingonemeanfrom
another,bringsnoexemptionfromtroublesomeproblemsofinference.Escapefromtheseis-
sues,asintheRandHealthExperiment,requiresexplicitmodeling,ormightbebesthandledby
estimatingquantilesofthetreatmentdistribution,whichagainrequiresadditionalassumptions.
OurreadingoftheliteratureonRCTsindevelopmentsuggeststhattheyarenotexempt
fromtheseconcerns.Manydevelopmenttrialsarerunon(sometimesvery)smallsamples,they
havetreatmenteffectswhereasymmetryishardtoruleout—especiallywhentheoutcomesare
inmoney—andtheyoftengiveresultsthatarepuzzling,oratleastnoteasilyinterpretedin
termsofeconomictheory.NeitherBanerjeeandDuflo(2012)norKarlanandAppel(2011),who
citemanyRCTs,raiseconcernsaboutmisleadinginference,treatingallresultsassolid.Nodoubt
23
therearebehaviorsintheworldthatareinconsistentwithstandardeconomics,andsomecan
beexplainedbystandardbiasesinbehavioraleconomics,butitwouldalsobegoodtobesuspi-
ciousofthesignificancetestsbeforeacceptingthatanunexpectedfindingiswellsupportedand
theoryshouldberevised.Replicationofresultsindifferentsettingsmaybehelpful—iftheyare
therightkindofplaces(seeourdiscussioninSection2)—butithardlysolvestheproblemgiven
thattheasymmetrymaybeinthesamedirectionindifferentsettings(andseemslikelytobeso
injustthosesettingsthataresufficientlyliketheoriginaltrialsettingtobeofuseforinference
aboutthetrialpopulation),andthatthe“significant”t–valueswillshowdeparturesfromthe
nullinthesamedirection,thusreplicatingspuriousfindings.
1.2.11:Significancetests:Fisher-Behrens,robustinference,andmultiplehypotheses
Skewnessoftreatmenteffectsisnottheonlythreattoaccuratesignificancetests.Thetwo–
samplet–statisticiscomputedbydividingtheATEbytheestimatedstandarderrorwhose
squareisgivenby
⌢σ 2 =(n1 −1)−1 (Yi −
⌢µ1)2
i∈1∑n1
+(n0 −1)−1 (Yi −
⌢µ0 )2
i∈0∑n0
(5)
where0referstocontrolsand1totreatments,sothatthereare n1 treatmentsand n0 con-
trols,and µ̂1 and µ̂0 arethetwomeans.Ashasbeenlongknown,thist–statisticisnotdistrib-
utedasStudent’stifthetwovariances(treatmentandcontrol)arenotidentical;thisisknown
astheBehrens–Fisherproblem.Inextremecases,whenoneofthevariancesiszero,thet–
statistichaseffectivedegreesoffreedomhalfofthatofthenominaldegreesoffreedom,sothat
thetest-statistichasthickertailsthanallowedfor,andtherewillbetoomanyrejectionswhen
thenullistrue.
Inaremarkablerecentpaper,Young(2016)arguesthatthisproblemgetsmuchworse
whenthetrialresultsareanalyzedbyregressingoutcomesnotonlyonthetreatmentdummy,
butalsoonadditionalcontrols,someofwhichmightinteractwiththetreatmentdummy.Again
theproblemconcernsoutliersincombinationwiththeuseofclusteredorrobuststandarder-
rors.Whenthedesignmatrixissuchthatthemaximalinfluenceislarge,sothatforsomeobser-
vationsoutcomeshavelargeinfluenceontheirownpredictedvalues,thereisareductioninthe
effectivedegreesoffreedomforthet–value(s)oftheaveragetreatmenteffect(s)leadingto
spuriousfindingsofsignificance.
24
Younglooksat2003regressionsreportedin53RCTpapersintheAmericanEconomic
AssociationjournalsandrecalculatesthesignificanceoftheestimatesusingFisher’srandomiza-
tioninferenceappliedtotheauthors’originaldata;seeagainImbensandWooldridge(2009)for
agoodmodernaccountofFisher’smethod.In30to40percentoftheestimatedtreatmentef-
fectsinindividualequationswithcoefficientsthatarereportedassignificant,hecannotreject
thenullofnoeffect;thefractionofspuriouslysignificantresultsincreasesfurtherwhenhesim-
ultaneouslytestsforallresultsineachpaper.Thesespuriousfindingscomeinpartfromthe
well-knownproblemofmultiple-hypothesistesting,bothwithinregressionswithseveraltreat-
mentsandacrossregressions.Withinregressions,treatmentsarelargelyorthogonal,butau-
thorstendtoemphasizesignificantt–valuesevenwhenthecorrespondingF-testsareinsignifi-
cant.Acrossequations,resultsareoftenstronglycorrelated,sothat,atworst,differentregres-
sionsarereportingvariantsofthesameresult,thusspuriouslyaddingtothe“killcount”ofsig-
nificanteffects.Atthesametime,thepervasivenessofobservationswithhighinfluencegener-
atesspurioussignificanceonitsown.
Oursenseisthattheseissuesarebeingtakenmoreseriouslyinrecentwork,especially
asconcernsmultiplehypothesistesting.YounghimselfisastrongproponentofRCTsingeneral
andbelievesthatrandomizationinferencewillyieldcorrectinferences.Yetrandomizationinfer-
encecanonlytestthenullthatalltreatmenteffectsarezero,thattheexperimentdoesnothing
toanyone,whereasmanyinvestigatorsareinterestedintheweakerhypothesisthattheaver-
agetreatmenteffectiszero.Thissimplymakesmattersworsesincethestrongerhypothesis
impliestheweakerhypothesisandtherearepresumablyundiscoveredcaseswheretheATEis
spuriouslysignificant,evenwhentheFishertestrejectsthatalltreatmenteffectsarezero.Note
thattestingdoesnotalwaysmatchlogic;itispossibletorejectthenullthattheATEiszeroeven
whenwecansimultaneouslyacceptthe(joint)hypothesisthatalltreatmenteffectsarezero;
thisisfamiliarfromOLSregression,whereanF–testcanshowjointinsignificance,evenwhena
t–testofsomelinearcombinationissignificant.
Itisclearthat,asofnow,allreportedsignificancelevelsfromRCTresultsineconomics
shouldbetreatedwithconsiderablecaution.Greatercareaboutskewnessandoutlierswould
help,aswouldgreateruseoftheFishermethodandofproceduresthatdealcorrectlywithmul-
tiplehypothesistesting.Yetifthenullhypothesisisthattheaveragetreatmenteffectiszero,as
inmostprojectevaluation,theFishertestisnotavailable,sothatwecurrentlydonothavea
reliablesetofprocedures.Robustorclusteredstandarderrorsarenecessarytoallowforthe
25
possibilitythattreatmentchangesvariances,andtheinclusionofcovariatesisnecessarytocon-
trolforimbalanceinfinitesamples.
1.3Blinding
Blindingisrarelypossibleineconomicsorsocialsciencetrials,andthisisoneofthemajordif-
ferencesfrommost(althoughnotall)RCTsinmedicine,whereblindingisstandard,bothfor
thosereceivingthetreatmentandthoseadministeringit.Indeed,theabilitytoblindhasbeen
oneofthekeyargumentsinfavorofrandomization,fromBradford-Hillinthe1950s,see
Chalmers(2003),towelfaretrialstoday,GueronandRolston(2013).Considerfirsttheblinding
ofsubjects.SubjectsinsocialRCTsusuallyknowwhethertheyarereceivingthetreatmentornot
andsocanreacttotheirassignmentinwaysthatcanaffecttheoutcomeotherthanthroughthe
operationofthetreatment;ineconometriclanguage,thisisakintoaviolationofexclusionre-
strictions,orafailureofexogeneity.Intermsof(1),thereisapathwayfromthetreatmentas-
signmenttoanotherunobservedcause,whichwillresultinabiasedATE.Thisisnottoarguein
favorofinstrumentalvariablesoverRCTs,orviceversa,butsimplytonotethat,withoutblind-
ing,RCTsdonotautomaticallysolvetheselectionproblemanymorethanIVestimationauto-
maticallysolvestheselectionproblem.Inbothcases,theexogeneity(exclusionrestriction)ar-
gumentneedstobeexplicitlymadeandjustified.Yettheliteratureineconomicsgivesgreatat-
tentiontothevalidityofexclusionrestrictionsinIVestimation,whiletendingtoshrugoffthe
essentiallyidenticalproblemswithlackofblindinginRCTs.
Notealsothatknowledgeoftheirassignmentmaycausepeopletowanttocrossover
fromtreatmenttocontrol,orviceversa,todropoutoftheprogram,ortochangetheirbehavior
inthetrialdependingontheirassignment.Inextremecases,onlythosemembersofthetrial
samplewhoexpecttobenefitfromthetreatmentwillaccepttreatment.Consider,forexample,
atrialinwhichchildrenarerandomlyallocatedtotwoschoolsthatteachindifferentlanguages,
RussianorEnglish,ashappenedduringthebreakupoftheformerYugoslavia.Thechildren(and
theirparents)knowtheirallocation,andthemoreeducated,wealthier,andless-ideologically
committedparentswhosechildrenareassignedtotheRussian-mediumschoolscan(anddid)
removetheirchildrentoprivateEnglish-mediumschools.Inacomparisonofthosewhoaccept-
edtheirassignments,theeffectsofthelanguageofinstructionwillbedistortedinfavorofthe
Englishschoolsbydifferencesinfamilycharacteristics.Thisisacasewhere,eveniftherandom
numbergeneratorisfullyfunctional,alaterbalancetestwillshowsystematicdifferencesinob-
26
servablebackgroundcharacteristicsbetweenthetreatmentandcontrolgroups;evenifthebal-
ancetestispassed,theremaystillbeselectiononunobservablesforwhichwecannottest.
Moregenerally,whenpeopleknowtheirallocation,whentheyhaveastakeintheout-
come,andwhenthetreatmenteffectisdifferentfordifferentpeople,thereareincentivesand
opportunitiesforselectioninresponsetotherandomization,andthatselectioncancontami-
natetheestimatedaveragetreatmenteffect,seeHeckman(1997)whomakesthesamepointin
thecontextofinstrumentalvariables.Thosewhowererandomizedbyalotteryintogoingto
Vietnamwillhavedifferenttreatmenteffectsdependingontheirlabormarketprospects,and
thosewithbetterprospectsaremorelikelytoresistthedraft.Asweshallseeinthenextsub-
section,variousstatisticalcorrectionsareavailableforafewoftheselectionproblemsnon-
blindingpresents,butallrelyonthekindofassumptionsthat,whilecommoninobservational
studies,RCTsaredesignedtoavoid.Ourownviewisthatassumptionsandtheuseofprior
knowledgearewhatweneedtomakeprogressinanykindofanalysis,includingRCTswhose
promiseofassumption-freelearningisalwayslikelytobeillusory.
Theremaybeatendencyineconomicstofocusontheselectionbiaseffectsofnon-
blindingbecausesomesolutionsareavailable,butselectionbiasisnottheonlyserioussource
ofbiasinsocialandmedicaltrials.Concernsabouttheplacebo,Pygmalion,Hawthorne,John
Henry,and'teacher/therapist'effectsarewidespreadacrossstudiesofmedicalandsocialinter-
ventions.Thisliteraturearguesthatdoubleblindingshouldbereplacedbyquadrupleblinding;
blindingshouldextendbeyondparticipantsandinvestigatorsandincludethosewhomeasure
outcomesandthosewhoanalyzethedata,allofwhommaybeaffectedbybothconsciousand
unconsciousbias.Theneedforblindinginthosewhoassessoutcomesisparticularlyimportant
inanycaseswhereoutcomesarenotdeterminedbystrictlyprescribedprocedureswhoseappli-
cationistransparentandcheckablebutrequireselementsofjudgment;agoodexampleisther-
apistswhoareaskedtoassesstheextentofdepressioninclinicaltrialsofanti-depressants,see
Kramer(2016).
Thelessonhereisthatblindingmattersandisveryoftenmissing.Thereisnoreasonto
supposethatapoorlyblindedtrialwithrandomassignmenttrumpsbetterblindedstudieswith
alternativeallocationmechanisms,ormatchedstudies.
1.13WhatdoRCTsdoinpractice?
TheexecutionofanRCTwilloftendeviatefromitsdesign.Peoplemaynotaccepttheirassign-
ment,controlsmaymanagetogettreatment,andviceversa,andpeoplemayaccepttheiras-
27
signment,butdropoutbeforethecompletionofthestudy.Insomedesigns,thetrialworksby
givingpeopleincentivestoparticipate,forexamplebymailingthemavoucherthatgivesthem
subsidizedaccesstoaschoolortoasavingsproduct.Iftheaimistoevaluatethevoucher
schemeitself,nonewissuearises.However,iftheaimistofindoutwhattheeducationorsav-
ingsprogramdoes,andthevoucherissimplyadevicetoinducevariation,muchdependson
whetherornotpeopledecidetousethevoucherwhich,likeattritionandcrossover,issubject
topurposivedecisionsbythesubjectsinducingdifferencesbetweentreatmentsandcontrols.
Everythingdependsonthepurposeofthetrial.Intheexampleabove,wemaywantto
evaluatethevoucherprogram,orwemaywanttofindoutwhatthesavingproductdoesfor
people.Wearesometimesinterestedinestablishingcausality,andsometimesinestimatingan
averagetreatmenteffect;intheeconomicsliterature,somewritersdefineinternalvalidityas
gettingtheATEright,whileothers,followingtheoriginaldefinitionoftheterm,defineinternal
validityasgettingcausalityright.Sometimesthetriallimitsitselftoestablishingcausality(orto
estimatinganATE)inonlythetrialsample,butsometrialsaremoreambitious,andtrytoestab-
lishcausality(orestimateanATE)forabroaderpopulationofinterest.When,asiscommonin
economicstrials,nolimitsareplacedontheheterogeneityoftreatmentresponses,different
trialsamplesanddifferentpopulationswillgenerallyhavedifferentATEsandmayhavedifferent
casualoutcomes,e.g.ifthetreatmenthasaneffectinonepopulationbutnoneortheopposite
effectinanother.Ourviewisthatthetargetofthetrial,includingthepopulationofinterest,
needstobedefinedinadvance.Otherwise,almostanyestimatednumbercanbeinterpretedas
avalidATEforsomepopulation,weallowdeviationsfromthedesigntodefineourtarget,and
wehavenowayofknowingwhetherapparentlycontradictoryresultsarereallycontradictoryor
arecorrectforthepopulationonwhichtheywerederived.Differencesinresults,betweendif-
ferentRCTsandbetweenRCTsandobservationalstudies,mayowelesstotheselectioneffects
thatRCTsaredesignedtoremove,thantothefactthatwearecomparingnon-comparablepeo-
ple,Heckman,Lalonde,andSmith(1999,p.2082).Withoutaclearideaofhowtocharacterize
thepopulationofindividualsinthetrial,whetherwearelookingforanATEortoidentifycausal-
ity,andforwhichgroupsenrolledinthetrialtheresultsaresupposedtohold,wehavenobasis
forthinkingabouthowtousethetrialresultsinothercontexts.
Toillustratesomeoftheissues,considerasimpleRCTinwhichatreatmentTisadminis-
teredtoatrialsamplethatissplitbetweenatreatmentgroupofsizenandacontrolgroupof
sizen,butthatonlyafractionpofthetreatmentgroupacceptstheirassignment,withfraction
28
(1− p) receivingnotreatment.SupposethattheparameterofinterestistheATEintheoriginal
population,fromwhichthetrialsamplewasdrawnrandomly.Denotebyβ thehypothetical
idealATEestimatethatwouldhavebeencalculatedifeveryonehadacceptedassignment;aswe
haveseen,thisisanunbiasedestimatoroftheparameterofinterestforboththetrialsample
andtheparentpopulation.β cannotbecalculated,buttherearevariousoptions.
Optiononeistoignoretheoriginalassignmentandcalculatethedifferenceinmeans
betweenthosewhoreceivedthetreatmentandthosewhodidnot,includingamongthelatter
thosewhowereintendedtoreceiveitbutdidnot.Denotethis(“astreated”)estimateβ1. Al-
ternatively,optiontwo,istocomparetheaverageoutcomeamongthosewhowereintendedto
betreatedandthosewhowereintendedtobecontrols.Denotethisestimate,the“intentto
treat”(ITT)estimator,β2. Itiseasytoshowthatonesetofconditionsforβ1 = β isthatthose
whoweretreatedhavethesameATEasthosewhowereintendedtobetreated,andthatthose
whobroketheirassignmenthavethesameuntreatedmeanasthosewhowereassignedtobe
controls,conditionsthatmayholdinsomeapplications,forexamplewherethetreatmentef-
fectsareidentical.
TheITTestimator,β2 ,willtypicallybeclosertozerothanisβ ,anditwillcertainlybe
soiftheaveragetreatmenteffectamongthosewhobreaktheirassignmentisthesameasthe
overallATE,inwhichcaseβ2 = pβ.Forthesereasons,theITTisoftendescribedasyieldinga
conservativeestimateandisroutinelyadvocatedinmedicaltrialseventhoughitisanattenuat-
edestimatoroftheATE.Athirdestimator,β3 ,thelocalaveragetreatmentestimator(LATE)is
computedbyrunningaregressionofoutcomesonan(actual)treatmentdummyusingthe
treatmentassignmentasaninstrumentalvariable.Inthiscase,theLATEissimplytheITT,scaled
upbythereciprocalofp,sothatβ3 = β2 / p. Fromtheabove,theLATEisβ iftheaverage
treatmenteffectofthosewhobreaktheirassignmentisthesameastheaveragetreatmentef-
fectingeneral,sothattheITTestimatorisbiaseddownbycountingthosewhoshouldhave
beentreatedasiftheywerecontrols.Moregenerally,andwithadditionalassumptions,Imbens
andAngrist(1994)showthattheLATEistheaveragetreatmenteffectamongthosewhowere
inducedtoacceptthetreatmentbytheirassignmenttotreatmentstatus,whichcanbeavery
differentobjectfromtheoriginaltargetofinvestigation.Thesevariousestimators,theATE,the
ITT,andtheLATE,areallaveragesoverdifferentgroups;moreformally,HeckmanandVytlacil
(2005)defineamarginaltreatmenteffect(MTE)astheATEforthoseonthemarginoftreat-
29
ment—whatevertheassignmentmechanism—andshowthattheotherestimatorscanbe
thoughtofasaveragesoftheMTEsoverdifferentpopulations.
Ingeneral,andunlesswearepreparedtosaymoreabouttheheterogeneityinthe
treatmenteffects,thethreeestimatorswillgivedifferentresultsbecausetheyareaveragesover
differentpopulations.Economiststendtobelievethatpeopleactintheirowninterest,atleast
inpart,soitisnotattractivetobelievethatthosewhobreaktheirassignmentshavethesame
distributionoftreatmenteffectsasdothosewhoacceptthem.InHeckman’s(1992)analogy,
peoplearenotlikeagriculturalplots,whichareinnopositiontoevadethetreatmentwhenthey
seeitcoming.Suchpurposivebehaviorwillgenerallyalsoaffectthecompositionofthetrial
samplecomparedwiththeparentpopulation,withthosewhoagreetoparticipatedifferent
fromthosewhodonot.Forexample,peoplemaydislikerandomizationbecauseoftherisksit
entails,orpeoplemayseektoentertrialsinthehopethattheywillreceiveabeneficialtreat-
mentthatisotherwiseunavailable.AfamousexampleineconomicsistheAshenfelter(1978)
pre-program“dip,”wherethosewhoentertrialsoftrainingprogramstendtobethosewhose
earningshavefallenimmediatelypriortoenrolment,seealsoHeckmanandSmith(1999).Peo-
plewhoparticipateindrugtrialsaremorelikelytobesickthanthosewhodonot,orarelikely
tobethosewhohavefailedonstandardmedication.AnotherexampleisChyn’s(2016)evidence
thatthosewhoappliedforvouchersintheMovingtoOpportunityexperimentandwerethus
eligibleforrandomization—andonlyaquarterofthosewhowereeligibleactuallydidso—were
thosewhowerealreadymakingunusualeffortsontheirchildren’sbehalf.Theseparentshad
effectivelysubstitutedforpartofthebetterenvironment,sothattheATEfromthetrialunder-
statesthebenefitstotheaveragechildofmoving.Similarphenomenaoccurinmedicine.Inthe
1954trialsoftheSalkpoliovaccineintheUS,theratesofinfection,whilelowestamongthe
treatedchildren,werehigherinthecontrolchildrenthaninthegeneralpopulationatrisk,so
thattheparentsofthosewhoselectedintothetrialpresumablyhadsomeideathattheymight
havebeenexposed,HausmanandWise(1985,p.193–4).Inthiscase,theaveragetreatment
effectinthetrialsampleexaggeratestheATEinthegeneralpopulation,whichiswhatwewant
toknowforpublicpolicy.
Giventhenon-parametricspiritofRCTs,andtheunwillingnessofmanytrialiststomake
assumptionsortoincorporatepriorinformation,theonlywayforwardistobeveryclearabout
thepurposeofthetrialand,inparticular,whichaveragewearetryingtoestimate.Forthose
whofocusoninternalvalidityintermsofestablishingcausalitybyfindinganATEsignificantly
30
differentfromzero,thedefinitionofthepopulationseemstobeasecondaryconcern.Theidea
seemstobethatifcausalityisestablishedinsomepopulation,thatfindingisimportantinitself,
withthetaskofexploringitsapplicabilitytootherpopulationsleftasasecondarymatter.For
themanyeconomicorcost–benefitanalyseswheretheATEistheparameterofinterest,the
populationofinterestisdefinitional,andtheinferenceneedstofocusonapathfromtheresults
ofthetrialtotheparameterofinterest.Thisisoftendifficultorevenimpossiblewithoutaddi-
tionalassumptionsand/ormodelingofbehavior,includingthedecisiontoparticipateinthetri-
al,andamongparticipants,thedecisionnottodropout.Manski(1990,1995,2003)hasshown
that,withoutadditionalevidence,thepopulationATEisnot(point)identifiedfromthetrialre-
sults,andhasdevelopednon-parametricbounds(anintervalestimate)fortheATE.Aswiththe
ITT,theseboundsaresometimestightenoughtobeinformative,thoughtheintervaldefinedby
theboundswilloftencontainzero,seeManski(2013)foradiscussionaimedatabroadaudi-
ence.Facedwiththis,manyscholarsarepreparedtomakeassumptionsortobuildmodelsthat
givemorepreciseresults.
RCTsmaytellusaboutcausality,evenwhentheydonotdeliveragoodestimateofthe
ATE.Forexample,iftheITTestimateissignificantlydifferentfromzero,thetreatmenthasa
causaleffectforatleastsomeindividualsinthepopulation.ThesameistrueiftheLATEissignif-
icantlydifferentfromzero;againthetreatmentiscausalforsomesub-population,evenifwe
mayhavedifficultycharacterizingitoracceptingitasthepopulationofinterest.Fromthis,we
alsolearnthat,providedwehadapopulationwiththerightdistributionofβi 's andgoverned
bythesamepotentialoutcomeequation,thetreatmentwouldproducetheeffectinatleast
someindividualsthere.
Section2:Usingtheresultsofrandomizedcontrolledtrials
2.1Introduction
Supposewehavetheresultsofawell-conductedRCT.Wehaveestimatedanaveragetreatment
effect,andourstandarderrorgivesusreasontobelievethattheeffectdidnotcomeaboutby
chance.Wethushavegoodwarrantthatthetreatmentcausestheeffectinoursamplepopula-
tion,uptothelimitsofstatisticalinference.Whataresuchfindingsgoodfor?Howshouldwe
usethem?
Theliteratureineconomics,asindeedinmedicineandinsocialpolicy,haspaidmoreat-
tentiontoobtainingresultsthantowhetherandhowtheyshouldbeadaptedforuse,oftenas-
31
sumingthatfindingscanbeused“asis.”Mucheffortisdevotedtodemonstratingcausalityand
estimatingeffectsizesinstudypopulations,bothinempiricalwork—moreandbetterRCTs,or
substitutesforRCTs,suchasinstrumentalvariablesorregressiondiscontinuitymodels—aswell
asintheoreticalstatisticalwork—forexampleontheconditionsunderwhichwecanestimate
anaveragetreatmenteffect,oralocalaveragetreatmenteffect,andwhattheseestimates
mean.Thereislesstheoreticalorempiricalworktoguideushowandforwhatpurposestouse
thefindingsofRCTs,suchastheconditionsunderwhichthesameresultsholdoutsideofthe
originalsettings,howtheymightbeadaptedforuseelsewhere,orhowtheymightbeusedfor
formulating,testing,understanding,orprobinghypothesesbeyondtheimmediaterelationbe-
tweenthetreatmentandtheoutcomeinvestigatedinthestudy.
Yetitcannotbethatknowinghowtouseresultsislessimportantthanknowinghowto
demonstratethem.Anychainofevidenceisonlyasstrongasitweakestlink,sothatarigorously
establishedeffectwhoseapplicabilityisjustifiedbyaloosedeclarationofsimilewarrantslittle
morethananestimatethatwaspluckedoutofthinair.Iftrialsaretobeuseful,weneedpaths
totheirusethatareascarefullyconstructedasarethetrialsthemselves.
Itissometimesassumedthataparameter,oncewellestablished,isinvariantacrossset-
tings.Theparametermaybedifficulttoestimate,becauseofselectionorotherissues,andit
maybethatonlyawell-conductedRCTcanprovideacredibleestimateofit.Ifso,internalvalidi-
tyisallthatisrequired,anddebateaboutusingtheresultsbecomesadebateabouttheconduct
ofthestudy.Theargumentforthe“primacyofinternalvalidity,”Shadish,Cook,andCampbell
(2002),isreasonableasawarningthatbadRCTsareunlikelytogeneralize,butitissometimes
incorrectlytakentoimplythatresultsofaninternallyvalidtrialwillautomaticallyoroftenapply
‘asis’elsewhere,orthatthisisthedefaultassumptionfailingargumentstothecontrary.Anin-
varianceargumentisoftenmadeinmedicine,whereitissometimesplausiblethataparticular
procedureordrugworksthesamewayeverywhere,thoughseeHorton(2000)forastrongdis-
sentandRothwell(2005)forexamplesonbothsidesofthequestion.Weshouldalsonotethe
recentmovementtoensurethattestingofdrugsincludeswomenandminoritiesbecausemem-
bersofthosegroupssupposethattheresultsoftrialsonmostlyhealthyyoungwhitemalesdo
notapplytothem.
2.2Usingresults,transportability,andexternalvalidity
Supposeatrialhasestablishedaresultinaspecificsetting,andweareinterestedinusingthe
resultoutsidetheoriginalcontext.If“thesame”resultholdselsewhere,wesaywehaveexter-
32
nalvalidity,otherwisenot.Externalvaliditymayreferjusttothetransportabilityofthecausal
connection,orgofurtherandrequirereplicationofthemagnitudeoftheaveragetreatment
effect.Eitherway,theresultholds—everywhere,orwidely,orinsomespecificelsewhere—orit
doesnot.
Thisbinaryconceptofexternalvalidityisoftenunhelpful;itbothoverstatesandunder-
statesthevalueoftheresultsfromanRCT.Itdirectsustowardsimpleextrapolation—whether
thesameresultwillholdelsewhere—orsimplegeneralization—whetheritholdsuniversallyor
atleastwidely—andawayfrompossiblymorecomplexbutmoreusefulapplicationsoftheevi-
dence.Justasinternalvaliditysaysnothingaboutwhetherornotatrialresultwillholdelse-
where,thefailureofexternalvalidityinterpretedassimplegeneralizationorextrapolationsays
littleaboutthevalueofthetrial.
First,thereareseveralusesofRCTsthatdonotrequiretransportabilitybeyondtheorig-
inalcontext;wediscusstheseinthenextsubsection.Second,thereareoftengoodreasonsto
expectthattheresultsfromawell-conducted,informative,andpotentiallyusefulRCTwillnot
applyelsewhereinanysimpleway.Evensuccessfulreplicationbyitselftellsuslittleeitherforor
againstsimplegeneralizationorextrapolation.Withoutfurtherunderstandingandanalysis,
evenmultiplereplicationscannotprovidemuchsupportfor,letaloneguarantee,theconclusion
thatthenextwillworkinthesameway.Nordofailuresofreplicationmaketheoriginalresult
useless.Wecanoftenlearnmuchfromcomingtounderstandwhyreplicationfailedanduse
thatknowledgetomakeappropriateuseoftheoriginalfindings,notbyexpectingreplication,
butbylookingforhowthefactorsthatcausedtheoriginalresultmightbeexpectedtooperate
differentlyindifferentsettings.Third,andparticularlyimportantforscientificprogress,theRCT
resultcanbeincorporatedintoanetworkofevidenceandhypothesesthattestorexplore
claimsthatlookverydifferentfromtheresultsreportedfromtheRCT.Weshallgiveexamples
belowofextremelyusefulRCTsthatarenotexternallyvalidinthe(usual)sensethattheirre-
sultsdonotholdelsewhere,whetherinaspecifictargetsettingorinthemoresweepingsense
ofholdingeverywhere.
BertrandRussell’schickenprovidesanexcellentexampleofthelimitationstostraight-
forwardextrapolationfromrepeatedsuccessfulreplication.Thebirdinfers,basedonmultiply
repeatedevidence,thatwhenthefarmercomesinthemorning,hefeedsher.Theinference
servesherwelluntilChristmasmorning,whenhewringsherneckandservesherforChristmas
dinner.Ofcourse,ourchickendidnotbaseherinferenceonanRCT.Buthadweconstructed
33
oneforher,wewouldhaveobtainedexactlythesameresultthatshedid.Herproblemwasnot
hermethodology,butratherthatshewasstudyingsurfacerelations,andthatshedidnotun-
derstandthesocialandeconomicstructurethatgaverisetothecausalrelationsthatsheob-
served.Soshedidnotknowhowwidelyorhowlongtheywouldobtain.Russellnotes,“more
refinedviewsastotheuniformityofnaturewouldhavebeenusefultothechicken”(1912,p.
44).Weoftenactasifthemethodsofinvestigationthatservedthechickensobadlywilldoper-
fectlywellforus.
Establishingcausalitydoesnothinginandofitselftoguaranteegeneralizability.Nor
doestheabilityofanidealRCTtoeliminatebiasfromselectionorfromomittedvariablesmean
thattheresultingATEwillapplyanywhereelse.Theissueisworthmentioningonlybecauseof
theenormousweightthatiscurrentlyattachedineconomicstothediscoveryandlabelingof
causalrelations,aweightthatishardtojustifyforeffectsthatmayhaveonlylocalapplicability,
whatmight(perhapsprovocatively)belabeled‘anecdotalcausality’.Theoperationofacause
generallyrequiresthepresenceofsupportorhelpingfactors,withoutwhichacausethatpro-
ducesthetargetedeffectinoneplace,eventhoughitmaybepresentandhavethecapacityto
operateelsewhere,willremainlatentandinoperative.WhatMackie(1974)calledINUScausality
(InsufficientbutNon-redundantpartsofaconditionthatisitselfUnnecessarybutSufficientfora
contributiontotheoutcome)isoftenthekindofcausalitywesee;astandardexampleisa
houseburningdownbecausethetelevisionwaslefton,althoughtelevisionsdonotoperatein
thiswaywithouthelpingfactors,suchaswiringfaults,thepresenceoftinder,andsoon.Thisis
standardfareinepidemiology,whichusestheterm“causalpie”torefertothecasewhereaset
ofcausesarejointlybutnotseparatelysufficientforaneffect.Ifwerewrite(3)intheform
Yi = βiTi + γ j xij = θk wik
k=1
K
∑⎛⎝⎜⎞⎠⎟
Ti +j=1
J
∑ γ j xijj=1
J
∑ (6)
where θk controlshow wik affectsindividualI’streatmenteffect βi . The“helping”or“support”
factorsforthetreatmentarerepresentedbytheinteractivevariables wik , amongwhichmaybe
includedsomex’s.SincetheATEistheaverageofthe βi 's ,twopopulationswillhavethesame
ATEonlyif,exceptbyaccident,theyhavethesameaverageforthesupportfactorsnecessary
forthetreatmenttowork.Thesearehoweverjustthekindoffactorsthatarelikelytobediffer-
entlydistributedindifferentpopulations,andindeedwedogenerallyfinddifferentATEsindif-
34
ferentdevelopment(andothersocialpolicy)RCTsindifferentplaceseveninthecaseswhere
(unusually)theyallpointinthesamedirection.
Causalprocessesoftenrequirehighlyspecializedeconomic,cultural,orsocialstructures
toenablethemtowork.ConsidertheRubeGoldbergmachinethatisriggedupsothatflyinga
kitesharpensapencil,CartwrightandHardie(2012,77),oranotherwherealongchainofropes
andpulleyscausestheinsertionoffoodintothemouthtoactivateaface-wipingnapkin.These
arecausalmachines,buttheyarespeciallyconstructedtogiveakindofcausalitythatoperates
extremelylocallyandhasnogeneralapplicability.Theunderlyingstructureaffordsaveryspecif-
icformof(6)thatwillnotdescribecausalprocesseselsewhere.NeitherthesameATEnorthe
samequalitativecausalrelationscanbeexpectedtoholdwherethespecificformfor(6)isdif-
ferent.
Indeed,wecontinuallyattempttodesignsystemsthatwillgeneratecausalrelations
thatwelikeandthatwillruleoutcausalrelationsthatwedonotlike.Healthcaresystemsare
designedtopreventnursesanddoctorsmakingerrors;carsaredesignedsothatdriverscannot
starttheminreverse;workschedulesforpilotsaredesignedsotheydonotflytoomanycon-
secutivehourswithoutrestbecausealertnessandperformancearecompromised.
AsintheRubeGoldbergmachinesandinthedesignofcarsandworkschedules,the
economicstructureandequilibriummaydifferinwaysthatsupportdifferentkindsofcausal
relationsandthusrenderatrialinonesettinguselessinanother.Forexample,atrialthatrelies
onprovidingincentivesforpersonalpromotionisofnouseinastateinwhichapoliticalsystem
lockspeopleintotheirsocialandeconomicpositions.Conditionalcashtransferscannotimprove
childhealthintheabsenceoffunctioningclinics.Policiestargetedatmenmaynotworkfor
women.Weusealevertotoastourbread,butleversonlyoperatetotoastbreadinatoaster;
wecannotbrowntoastbypressinganaccelerator,eveniftheprincipleoftheleveristhesame
inbothatoasterandacar.Ifwemisunderstandthesetting,ifwedonotunderstandwhythe
treatmentinourRCTworks,werunthesamerisksasRussell’schicken.
2.3WhenRCTsspeakforthemselves:notransportabilityrequired
Forsomethingswewanttolearn,anRCTisenoughbyitself.AnRCTmaydisproveageneral
theoreticalpropositiontowhichitprovidesacounterexample.Thetestmightbeofthegeneral
propositionitself(asimplerefutationtest),orofsomeconsequenceofitthatissusceptibleto
testingusinganRCT(acomplexrefutationtest).Ofcourse,counterexamplesareoftenchal-
lenged—forexample,itisnotthegeneralpropositionthatcausedtherejection,butaspecial
35
featureofthetrial—buthereweareonfamiliarinferentialturf.AnRCTmayalsoconfirmapre-
dictionofatheory,andalthoughthisdoesnotconfirmthetheory,itisevidenceinitsfavor,es-
peciallyifthepredictionseemsinherentlyunlikelyinadvance.Onceagain,thisisfamiliarterri-
tory,andthereisnothinguniqueaboutanRCT;itissimplyoneamongmanypossibletesting
procedures.Evenwhenthereisnotheory,orveryweaktheory,anRCT,bydemonstratingcau-
salityinsomepopulationcanbethoughtofasproofofconcept,thatthetreatmentiscapableof
workingsomewhere.Thisisoneoftheargumentsfortheimportanceofinternalvalidity.
AnothercasewherenotransportationiscalledforiswhenanRCTisusedforevaluation,
forexampletosatisfydonorsthattheprojecttheyfundedactuallyachieveditsaimsinthepop-
ulationinwhichitwasconducted.Evenso,forsuchevaluations,saybytheWorldBank,tobe
globalpublicgoodsrequiresthedevelopmentofargumentsandguidelinesthatjustifyusingthe
resultsinsomewayelsewhere;theglobalpublicgoodisnotanautomaticby-productofthe
Bankfulfillingitsfiduciaryresponsibility.Whenthecomponentsoftreatmentschangeacross
studies,evaluationsneednotleadtocumulativeknowledge.OrasHeckmanetal(1999,p.1934)
note,“thedataproducedfromthem[socialexperiments]arefarfromidealforestimatingthe
structuralparametersofbehavioralmodels.Thismakesitdifficulttogeneralizefindingsacross
experimentsortouseexperimentstoidentifythepolicy-invariantstructuralparametersthat
arerequiredforeconometricpolicyevaluation.”Ofcourse,whenweaskexactlywhatthosein-
variantstructuralparametersare,whethertheyexist,andhowtheyshouldbemodeled,we
openupmajorfaultlinesinmodernappliedeconomics.Forexample,wedonotintendtoen-
dorseintertemporaldynamicmodelsofbehaviorastheonlywayofrecoveringtheparameters
thatweneed.Wealsorecognizethattheusefulnessofsimplepricetheoryisnotasuniversally
acceptedasitoncewas.Butthepointremainsthatweneedsomething,someregularity,and
thatthesomethingneededcanrarelyberecoveredbysimplygeneralizingacrosstrials.
Athirdnon-problematicandimportantuseofanRCTiswhentheparameterofinterest
istheaveragetreatmenteffectinawell-definedpopulationfromwhichthesampletrialpopula-
tion—fromwhichtreatmentsandcontrolsarerandomlyassigned—isitselfarandomsample.In
thiscasethesampleaveragetreatmenteffect(SATE)isanunbiasedestimatorofthepopulation
averagetreatmenteffect(PATE)that,byassumption,isourtarget,seeImbens(2004)forthese
terms.Werefertothisasthe“publichealth”case;likemanypublichealthinterventions,the
targetistheaverage,“populationhealth,”notthehealthofindividuals.Onemajor(andwidely
recognized)dangerofthepublic-health-styleusesofRCTsisthatthescalingupfrom(evena
36
random)sampletothepopulationwillnotgothroughinanysimplewayiftheoutcomesofindi-
vidualsorgroupsofindividualschangethebehaviorofothers—whichwillbecommonineco-
nomicexamplesbutperhapslesscommoninhealth.Thereisalsoanissueoftimingiftheresults
aretobeimplementedsometimeafterthetrial.
Ineconomics,a‘public-health-style’exampleistheimpositionofacommoditytax,
wherethetotaltaxrevenueisofinterestandwedonotcarewhopaysthetax.Indeed,theory
canoftenidentifyaspecific,well-definedmagnitudewhosemeasurementiskeyforthepolicy;
seeDeatonandNg(1998)foranexampleofwhatChetty(2009)callsa“sufficient”statistic.In
thiscase,thebehaviorofarandomsampleofindividualsmightwellprovideagoodguidetothe
taxrevenuethatcanbeexpected.Anothercasecomesfromworkonpovertyprogramswhere
theinterestofthesponsorsisintheconsequencesforthebudgetofthestateresponsiblefor
theprogram;wediscussthesecasesattheendofthisSection.Evenhere,itiseasytoimagine
behavioraleffectscomingintoplaythatdriveawedgebetweenthetrialanditsfullscaleim-
plementation,forexampleifcomplianceishigherwhentheschemeiswidelypublicized,orif
governmentagenciesimplementtheschemedifferentlyfromtrialists.
2.4Transportingresultslaterallyandglobally
TheprogramofRCTsindevelopmenteconomics,asinotherareasofsocialscience,hasthe
broadergoaloffindingout“whatworks.”Atitsmostambitious,thisaimsforuniversalreach,
andthedevelopmentliteraturefrequentlyarguesthat“credibleimpactevaluationsareglobal
publicgoodsinthesensethattheycanofferreliableguidancetointernationalorganizations,
governments,donors,andnongovernmentalorganizations(NGOs)beyondnationalborders,”
KremerandDuflo(2008,p.93).SometimestheresultsofasingleRCTareadvocatedashaving
wideapplicability,withespeciallystrongendorsementwhenthereisatleastonereplication.
Forexample,KremerandHolla(2009)useaKenyantrialasthebasisforablanketstatement
withoutcontextrestriction,“Provisionoffreeschooluniforms,forexample,leadsto10%-15%
reductionsinteenpregnancyanddropoutrates.”KremerandDuflo(2008),writingaboutan-
othertrial,aremorecautious,citingtwoevaluations,andrestrictingthemselvestoIndia:“One
canberelativelyconfidentaboutrecommendingthescaling-upofthisprogram,atleastinIndia,
onthebasisoftheseestimates,sincetheprogramwascontinuedforaperiodoftime,waseval-
uatedintwodifferentcontexts,andhasshownitsabilitytoberolledoutonalargescale.”
Ofcourse,theproblemofgeneralizationextendsbeyondRCTs,toboth“fullycon-
trolled”laboratoryexperimentsandtomostnon-experimentalfindings.Forexample,eversince
37
AlfredMarshallthoughtofitwhilesunbathing,economistshaveusedtheconceptofanelastici-
ty—asintheincomeelasticityofthedemandforfood,orthepriceelasticityofthesupplyof
cotton—andhavetransportedelasticities—whichareconvenientlydimensionless—fromone
contexttoanother,asnumericalestimates,orinranges,suchashigh,medium,orlow.Articles
thatcollectsuchestimatesarewidelycitedeventhough,ashaslongbeenknown,theinvari-
anceofelasticitiesisnotguaranteedinpracticeandissometimesinconsistentwithchoicetheo-
ry.OurargumenthereisthatevidencefromRCTs,likeevidenceonelasticities,isnotautomati-
callysimplygeneralizable,andthatitsinternalvalidity,whenitexists,doesnotprovideitwith
anyuniqueinvarianceacrosscontext.WeshallalsoarguethatspecificfeaturesofRCTs,suchas
theirfreedomfromparametricassumptions,althoughadvantageousinestimation,canbease-
rioushandicapinuse.
MostadvocatesofRCTsunderstandthat“whatworks”needstobequalifiedto“what
worksunderwhichcircumstances,”andtrytosaysomethingaboutwhatthosecircumstances
mightbe,forexample,byreplicatingRCTsindifferentplaces,andthinkingintelligentlyabout
thedifferencesinoutcomeswhentheyfindthem.Sometimesthisisdoneinasystematicway,
forexamplebyhavingmultipletreatmentswithinthesametrialsothatitispossibletoestimate
a“responsesurface,”thatlinksoutcomestovariouscombinationsoftreatments,seeGreenberg
andSchroder(2004)orShadishetal(2002).Forexample,theRANDhealthexperimenthadmul-
tipletreatments,allowinginvestigation,notonlyofwhetherhealthinsuranceincreasedexpend-
itures,buthowmuchitdidsounderdifferentcircumstances.Someofthenegativeincometax
experiments(NITs)inthe1960sand1970sweredesignedtoestimateresponsesurfaces,with
thenumberoftreatmentsandcontrolsineacharmoptimizedtomaximizeprecisionofestimat-
edresponsefunctionssubjecttoanoverallcostlimit,Conlisk(1973).Experimentsontime-of-
daypricingforelectricityhadasimilarstructure,seeAigner(1985).
TheMDRCexperimentshavealsobeenanalyzedacrosscitiesinanefforttolinkcityfea-
turestotheresultsoftheRCTswithinthem,Bloom,Hill,andRiccio(2005).UnliketheRANDand
NITexamples,theseareexpostanalysesofcompletedtrials;thesameistrueofVivalt(2015)
whoassemblesevidenceonalargenumberoftrials,andfinds,forthecollectionoftrialsshe
studied,thatdevelopment-relatedRCTsrunbygovernmentagenciestypicallyfindsmaller
(standardized)effectsizesthanRCTsrunbyacademicsorbyNGOs.Boldetal(2013),whoran
parallelRCTsonaninterventionimplementedeitherbyanNGOorbythegovernmentofKenya,
foundsimilarresultsthere.Notethattheseanalyseshaveadifferentpurposefromthosemeta-
38
analysesthatassumethatdifferenttrialsestimatethesameparameteruptonoiseandaverage
inordertoincreaseprecision.
Althoughthereareissueswithallofthesemethodsofinvestigatingdifferencesacross
trials,withoutsomedisciplineitistooeasytocomeupwith“just-so”orfairystoriesthatac-
countforalmostanydifferences.Weriskaprocedurethat,ifaresultisreplicatedinfullorin
partinatleasttwoplaces,putsthattreatmentintothe“itworks”boxand,iftheresultdoesnot
replicate,causallyinterpretsthedifferenceinawaythatallowsatleastsomeofthefindingsto
survive.
Howcanwethinkaboutthismoreseriously?Howcanwedobetterthansimplegener-
alizationandsimpleextrapolation?Manywritershaveemphasizedtheroleoftheoryintrans-
portingandusingtheresultsoftrials,andweshalldiscussthisfurtherinthenextsubsection.
Butstatisticalapproachesarealsowidelyused;thesearedesignedtodealwiththepossibility
thattreatmenteffectsvarysystematicallywithothervariables.Referringbackto(6),suppose
thattheβi 's ,theindividualtreatmenteffects,arefunctionsofasetofKobservableorunob-
servablesupportvariables,wik ,andthatthenon-vacuousw’smayevenrepresentdifferentfea-
turesindifferentplaces.Itisthenclearthat,providedthedistributionofthewvaluesisthe
sameinthenewcircumstancesastheold,thentheATEintheoriginaltrialwillholdinthenew
circumstances.Ingeneral,ofcourse,thisconditionwillnothold,nordowehaveanyobvious
wayofcheckingitunlessweknowwhatthesupportfactorsareinbothplaces.
Oneproceduretodealwithinteractionsispost-experimentalstratification,whichparal-
lelspost-surveystratificationinsamplesurveys.Thetrialisbrokenupintosubgroupsthathave
thesamecombinationofknown,observablew’s,theATEswithineachofthesubgroupscalcu-
lated,andthenreassembledaccordingtotheconfigurationofw’sinthenewcontext.Forex-
ample,ifthetreatmenteffectsvarywithage,theage-specificATEscanbeestimated,andthe
agedistributioninthenewcontextusedtoreweighttheage-specificATEstogiveanew,overall,
ATE.ThiscanbeusedtoestimatetheATEinanewcontext,ortocorrectestimatestothepar-
entpopulationwhenthetrialsampleisnotarandomsampleoftheparent.Ofcourse,this
methodwillonlyworkinspecialcases;forexample,ifweonlyknowsomeofthew’s,thereisno
reasontosupposethatreweightingforthosealonewillgiveausefulcorrection.
Othermethodsalsoworkwhentherearetoomanyw’sforstratification,forexampleby
estimatingtheprobabilityofeachobservationinthepopulationbeingincludedinthetrialsam-
pleasafunctionofthew’s,thenweightingeachobservationbytheinverseofthesepropensity
39
scores.AgoodreferenceforthesemethodsisStuartetal(2011),orineconomics,Angrist
(2004)andHotz,Imbens,andMortimer(2005).
Thereareyetfurtherreasonswhythesemethodsdonotalwayswork.Aswithanyform
ofreweighting,thevariablesusedtoconstructtheweightsmustbepresentinboththeoriginal
andnewcontext.Iftreatmenteffectsvarybysex,wecannotpredicttheoutcomesformenus-
ingatrialsamplethatisentirelyfemale.Ifwearetocarryaresultforwardintime,wemaynot
beabletoextrapolatefromaperiodoflowinflationtoaperiodofhighinflation;asHotzetal
(2005)note,itwilltypicallybenecessarytoruleoutsuch“macro”effects,whetherovertime,or
overlocations.Italsodependsonassumingthatthesamegoverningequation(6)coversthe
trialandthetargetpopulation.Iftheydiffernotonlybywhatcausalfactorsarepresentinwhat
proportionsbutalsoinhow(ifatall)thecausescontributetotheeffects,re-weightingtheeffect
sizesthatoccurintrialsub-populationswillnotproducegoodpredictionsabouttargetpopula-
tionoutcomes.
Itshouldbeclearfromthisthatreweightingworksonlywhentheobservablefactors
usedforreweightingincludeallandonlygenuineinteractivecauses;weneeddataonallthe
relevantinteractivefactors.ButasMuller(2015)notes,thistakesusbacktothesituationthat
RCTsaredesignedtoavoid,whereweneedtostartfromacompleteandcorrectspecificationof
thecausalstructure.RCTscanavoidthisinestimation—whichisoneoftheirstrengths,support-
ingtheircredibility—butthebenefitvanishesassoonaswetrytocarrytheirresultstoanew
context.
PearlandBareinboim(2014)usePearl’sdo–calculustoprovideafullerformalanalysis
fortransportabilityofcausalempiricalfindingsacrosspopulations.Theydefinetransportability
as“alicensetotransfercausaleffectslearnedinRCTstoanewpopulation,inwhichonlyobser-
vationalstudiescanbeconducted,”PearlandBareinboim(2015,p.1).Theyconsiderbothquali-
tativecausalrelations,whichtheyrepresentindirectedacyclicgraphs,andprobabilisticfacts,
suchastheconditionalprobabilityoftheoutcomeonatreatmentconditionalonsomethird
factor.Theythenprovidetheoremsaboutwhattherelationshipbetweenthecausalandproba-
bilisticfactsintwopopulationsmustbeifitistobepossibletoinferaparticularcausalfact,
suchastheATE,aboutpopulation2fromcausalandprobabilisticinformationaboutpopulation
1coupledwithpurelyprobabilisticinformationaboutpopulation2.Notsurprisingly,formany
thingsweshouldliketoknowaboutpopulation2,knowledgeofeventhefullstructureonpopu-
lation1willnotsuffice.Inferencestofactsaboutanewpopulationrequirenotonlythatthe
40
factswesupposeaboutpopulation1—likeanATE—arewellgrounded,thattheRCTwaswell
conducted,thatthestatisticalinferenceissound—butthatwehaveequallygoodgroundingfor
otherassumptionsweneedabouttherelationbetweenthetwopopulations.Forexample,using
theresultdescribedabovefordirectlytransportingtheATEfromatrialpopulationtosomeoth-
er—simpleextrapolation—weneedgoodgroundstosupposeboththattheaverageofthenet
effectoftheinteractivefactorsisthesameinbothpopulationsandalsothatthesamegovern-
ingequationdescribesbothpopulations.
Thisdiscussionleadstoanumberofpoints.First,wecannotgettogeneralclaimsby
simplegeneralization;thereisnowarrantfortheconvenientassumptionthattheATEestimated
inaspecificRCTisaninvariantparameter.Weneedtothinkthroughthecausalchainthathas
generatedtheRCTresult,andtheunderlyingstructuresthatsupportthiscausalchain,whether
thatcausalchainmightoperateinanewsettingandhowitwoulddosowithdifferentjointdis-
tributionsofthecausalvariables;weneedtoknowwhyandwhetherthatwhywillapplyelse-
where.Whileitistruethatthereexistgeneralcausalclaims—theforceofgravity,orthatpeople
respondtoincentives—theyuserelativelyabstractconceptsandoperateatamuchhigherlevel
thantheclaimsthatcanbereasonablyinferredfromatypicalRCT,andcannot,bythemselves,
guaranteetheoutcomesthatweareconsideringhere.Thattransportationisfarfromautomatic
alsotellsuswhy(evenideal)RCTsofsimilarinterventionscanbeexpectedtogivedifferentan-
swersindifferentsettings.Suchdifferencesdonotnecessarilyreflectmethodologicalfailings
andwillholdacrossperfectlyexecutedRCTsjustastheydoacrossobservationalstudies.
Second,thoughtfulpre-experimentalstratificationinRCTsislikelytobevaluable,or
failingthat,subgroupanalysis,becauseitcanprovideinformationthatmaybeusefulforgener-
alizationortransportation.Forexample,KremerandHolla(2009)notethat,intheirtrials,
schoolattendanceissurprisinglysensitivetosmallsubsidies,whichtheysuggestisbecause
therearealargenumberofstudentsandparentswhoareonthe(financial)marginbetween
attendingandnotattendingschool;ifthisisindeedthemechanismfortheirresults,agoodvar-
iableforstratificationwouldbethefractionofpeopleneartherelevantcutoff.Wealsoneedto
knowthatthesamemechanismworksinanynewsettingwhereweconsiderusingsmallsubsi-
diestoincreaseschoolattendance.
Third,weneedtobeexplicitaboutcausalstructure,evenifthatmeansmoremodel
buildingandmore—ordifferent—assumptionsthanadvocatesofRCTsareoftencomfortable
with.Tobeclear,modelingcausalstructuredoesnotnecessarilycommitustotheelaborateand
41
oftenincredibleassumptionsthatcharacterizesomestructuralmodelingineconomics,but
thereisnoescapefromthinkingaboutthewaythingswork,thewhyaswellasthewhat.
Fourth,wewilltypicallyneedtoknowmorethantheresultsoftheRCTitself,forexam-
pleaboutdifferencesinsocial,economic,andculturalstructuresandaboutthejointdistribu-
tionsofcausalvariables,knowledgethatwilloftenonlybeavailablethrougharangeofempiri-
calstrategiesincludingobservationalstudies.Wewillalsoneedtobeabletocharacterizethe
populationtowhichtheoriginalRCTanditsATEappliedbecausehowthepopulationisde-
scribediscommonlytakentobesomeindicationofwhichotherpopulationstheresultsarelike-
lytobeexportabletoandwhichnot.Manymedicalandpsychologicaljournalsareexplicitabout
this.Forinstance,therulesforsubmissionrecommendedbytheInternationalCommitteeof
MedicalJournalEditors,ICMJE(2015,p14)insistthatarticleabstracts“Clearlydescribethese-
lectionofobservationalorexperimentalparticipants(healthyindividualsorpatients,including
controls),includingeligibilityandexclusioncriteriaandadescriptionofthesourcepopulation.”
Theproblemsofcharacterizingthepopulationheregoesbeyondthosewefacedinconsidering
aLATE.AnRCTisconductedonapopulationofspecificindividuals.Theresultsobtained,
whetherwethinkintermsofanATEorintermsofestablishingcausality,arefeaturesofthat
population,ofthoseveryindividualsatthatverytime,notanyotherpopulationwithanydiffer-
entindividualsthatmight,forexample,satisfyoneoftheinfinitesetofdescriptionsthatthe
trialpopulationsatisfies.Howisthedescriptionofthepopulationthatisusedinreportingthe
resultstobechosen?Forchoosewemust—thealternativetodescribingisnaming,identifying
eachindividualinthestudybyname,whichiscumbersomeandunhelpfulandoftenunethical.
Thissameissueisconfrontedalreadyinstudydesign.Apartfromspecialcases,likepost
hocevaluationforpayment-for-results,wearenotespeciallyconcernedtolearnaboutthevery
populationenrolledinthetrial.Mostexperimentsare,andshouldbe,conductedwithaneyeto
whattheresultscanhelpuslearnaboutotherpopulations.Thiscannotbedonewithoutsignifi-
cantsubstantialassumptionsaboutwhatmightbeandwhatmightnotberelevanttothepro-
ductionoftheoutcomestudied.(Forexample,theICMJEguidelinesgoontosay:“Becausethe
relevanceofsuchvariablesasage,sex,orethnicityisnotalwaysknownatthetimeofstudyde-
sign,researchersshouldaimforinclusionofrepresentativepopulationsintoallstudytypesand
ataminimumprovidedescriptivedatafortheseandotherrelevantdemographicvariables,”
p14.)Sobothintelligentstudydesignandresponsiblereportingofstudyresultsinvolvesubstan-
tialbackgroundassumptions.Ofcoursethisistrueforallstudies,notjustRCTs.ButRCTsrequire
42
specialconditionsiftheyaretobeconductedatallandespeciallyiftheyaretobeconducted
successfully—localagreements,compliantsubjects,affordableadministrators,peoplecompe-
tenttomeasureandrecordoutcomesreliably,asettingwhererandomallocationismorallyand
politicallyacceptable,etc.,whereasobservationaldataareoftenmorereadilyandwidelyavail-
able.InthecaseofRCTs,thereisdangerthatthesekindsofconsiderationshavetoomuchef-
fect.Thisisespeciallyworrisomewherethefeaturesthestudypopulationshouldhavearenot
justified,madeexplicit,orsubjectedtoseriouscriticalreview.Thiscarefuldescriptionofthe
studypopulationisuncommonineconomics,whetherinRCTsormanyobservationalstudies.
Theneedforobservationalknowledgeisoneofmanyreasonswhyitiscounter-
productivetoinsistthatRCTsaretheuniquegoldstandard,orthatsomecategoriesofevidence
shouldbeprioritizedoverothers;thesestrategiesleaveushelplessinusingRCTsbeyondtheir
originalcontext.TheresultsofRCTsmustbeintegratedwithotherknowledge,includingthe
practicalwisdomofpolicymakers,iftheyaretobeuseableoutsidethecontextinwhichthey
wereconstructed.Contrarytomuchpracticeinmedicineaswellasineconomics,conflictsbe-
tweenRCTsandobservationalresultsneedtobeexplained,forexamplebyreferencetothedif-
ferentpopulationsineach,aprocessthatwillsometimesyieldimportantevidence,includingon
therangeofapplicabilityoftheRCTitself.WhilethevalidityoftheRCTwillsometimesprovide
anunderstandingofwhytheobservationalstudyfoundadifferentanswer,thereisnobasis(or
excuse)forthecommonpracticeofdismissingtheobservationalstudysimplybecauseitwas
notanRCTandthereforemustbeinvalid.Itisabasictenetofscientificadvancethatnewfind-
ingsmustbeabletoexplainpreviousresults,evenresultsthatarenowthoughttobeinvalid;
methodologicalprejudiceisnotanexplanation.
Theseconsiderationscanbeseeninpracticeintherangeofrandomizedcontrolledtrials
ineconomics,whichweshallexploreinthefinalsubsectionbelow.
2.5Usingtheoryforgeneralization
Economistshavebeencombiningtheoryandrandomizedcontrolledtrialssincetheearlyexper-
iments.OrcuttandOrcutt(1968)laidouttheinspirationfortheincometaxtrialsusingasimple,
statictheoryoflaborsupply.Accordingtothis,peoplechoosehowtodividetheirtimebetween
workandleisureinanenvironmentinwhichtheyreceiveaminimumGiftheydonotwork,and
wheretheyreceiveanadditionalamount (1− t)w foreachhourtheywork,wherewisthe
wagerate,andtisataxrate.ThetrialsassigneddifferentcombinationsofGandttodifferent
trialgroups,sothattheresultstracedoutthelaborsupplyfunction,allowingestimationofthe
43
parametersofpreferences,whichcouldthenbeusedinawiderangeofpolicycalculations,for
exampletoraiserevenueatminimumutilitylosstoworkers.
Followingtheseearlytrials,therehasbeenalongandcontinuingtraditionofusingtrial
results,togetherwiththebaselinedatacollectedforthetrial,tofitstructuralmodelsthatareto
beusedmoregenerally.EarlyexamplesincludeMoffitt(1979)onlaborsupplyandWise(1985)
onhousing;morerecentexamplesareHeckman,PintoandSavelyev(2013)forthePerrypre-
schoolprogram.DevelopmenteconomicsexamplesincludeAttanasio,MeghirandSantiago
(2012),Attanasioetal(2015),ToddandWolpin(2006)andDuflo,HannaandRyan(2012).The-
sestructuralmodelssometimesrequireformidableauxiliaryassumptionsonfunctionalformsor
thedistributionsofunobservables,whichmakesmanyeconomistsreluctanttoembracethem,
buttheyhavecompensatingadvantages,includingtheabilitytointegratetheoryandevidence,
tomakeout-of-samplepredictions,andtoanalyzewelfare—whichalwaysrequiressomeunder-
standingofwhythingshappen—andtheuseofRCTevidenceallowstherelaxationofatleast
someoftheassumptionsthatareneededforidentification.Inthisway,thestructuralmodels
borrowcredibilityfromtheRCTsandinreturnhelpsettheRCTresultswithinacoherent
framework.Withoutsomesuchinterpretation,thewelfareimplicationsofRCTresultscanbe
problematic;knowinghowpeopleingeneral(letalonejustpeopleinthetrialpopulation,which
iswhat,aswekeeprepeating,thetrialresultstellusabout)respondtosomepolicyisrarely
enoughtotellwhetherornottheyaremadebetteroff.Whatworksisnotequivalenttowhat
shouldbe.
Inmanypapers,Heckmanhasdevelopedwaystomodelhowthebeliefsandinterestsof
participantsaffecttheirparticipationin,behaviorduring,andtheiroutcomesintrials,forexam-
pleusingaRoymodelofchoice;seee.g.HeckmanandSmith(1995),andmorerecently
Chassang,PadróIMiguel,andSnowberg(2012)andChassangetal(2015).Themodelingofbe-
liefsandbehaviorallowspredictionsabouttheresultsoftrialsthatdifferfromthebasetrial,or
wheretheriskandrewardstructuresaredifferent.Beyondthat,andinlinewitharunning
themeofthisSection,thinkingabouthowtohandlenewsituationscanbeincorporatedintothe
designoftheoriginaltrialsoastoprovidetheinformationneededfortransportation.
LighttouchtheorycandomuchtoextendandtouseRCTresults.InboththeRAND
HealthExperimentandnegativeincometaxexperiments,animmediateissueconcernedthe
differencebetweenshortandlong-runresponses;indeed,differencesbetweenimmediateand
ultimateeffectsoccurinawiderangeofRCTs.BothhealthandtaxRCTsaimedtodiscoverwhat
44
wouldhappenifconsumers/workerswerepermanentlyfacedwithhigherorlowerpric-
es/wages,butthetrialscouldonlyrunforalimitedperiod.Atemporarilyhightaxrateonearn-
ingswaseffectivelya“firesale”onleisure,sothattheexperimentprovidedanopportunityto
takeavacationandmakeuptheearningslater,anincentivethatwouldbeabsentinaperma-
nentscheme.Howdowegetfromtheshort-runresponsesthatcomefromthetrialtothelong-
runresponsesthatwewanttoknow?Metcalf(1973)andAshenfelter(1978)providedanswers
fortheincometaxexperiments,asdidArrow(1975)fortheRandHealthExperiment.
Arrow’sanalysisillustrateshowtousebothstructureandobservationaldatato
transportandadaptresultsfromonesettingtoanother.Hemodelsthehealthexperimentasa
two-periodmodel,inwhichthepriceofmedicalcareisloweredinthefirstperiodonly,and
showshowtoderivewhatwewant,whichistheresponseinthefirstperiodifpriceswerelow-
eredbythesameproportioninbothperiods.ThemagnitudethatwewantisS,thecompen-
satedpricederivativeofmedicalcareinperiod1inthefaceofidenticalincreasesin p1 and p2
inbothperiods1and2,andthisisequalto s11 + s12 ,thesumofthederivativesofperiod1’s
demandwithrespecttothetwoprices.Thetrialgivesonly s11 .Butifwehavepost-trialdataon
medicalservicesforbothtreatmentsandcontrols,wecaninfer s21 ,theeffectoftheexperi-
mentalpricemanipulationonpost-experimentalcare.Choicetheory,intheformofSlutsky
symmetry,allowsArrowtousethistoinfer s12 andthusS.HecontraststhiswithMetcalf’sal-
ternativesolution,whichmakesdifferentassumptions—thattwoperiodpreferencesareinter-
temporallyadditive,inwhichcasethelong-runelasticitycanbeobtainedfromknowledgeofthe
incomeelasticityofpost-experimentalmedicalcare,whichwouldhavetocomefromanobser-
vationalanalysis.Thesetwoalternativeapproachesshowhowwecanchoose,basedonourwill-
ingnesstomakeassumptionsandonthedatawehave,asuitablecombinationof(elementary
andtransparent)theoreticalassumptionsandobservationaldatainorderadaptandusethetrial
results.Suchanalysiscanalsohelpdesigntheoriginaltrialbyclarifyingwhatweneedtoknowin
ordertobeabletousetheresultsofatemporarytreatmenttoestimatethepermanenteffects
thatweneed.Ashenfelterprovidesathirdsolution,notingthatthetwoperiodmodelisformally
identicaltoatwopersonmodel,sothatwecanuseinformationontwo-personlaborsupplyto
tellusaboutthedynamics.
Theorycanoftenallowustoreclassifyneworunknownsituationsasanalogoustositua-
tionswherewealreadyhavebackgroundknowledge.Onefrequentlyusefulwayofdoingthisis
45
whenthenewpolicycanberecastasequivalenttoachangeinthebudgetconstraintthatre-
spondentsface.Theconsequencesofanewpolicymaybeeasiertopredictifwecanreduceit
toequivalentchangesinincomeandprices,whoseeffectsareoftenwellunderstoodandwell
studied.ToddandWolpin(2008)makethispointandprovideexamples.Inthelaborsupply
case,anincreaseinthetaxratethasthesameeffectasadecreaseinthewageratew,sothat
wecanrelyonpreviousliteraturetopredictwhatwillhappenwhentaxratesarechanged.In
thecaseofMexico’sPROGRESAconditionalcashtransferprogram,ToddandWolpinnotethat
thesubsidiespaidtoparentsiftheirchildrengotoschoolcanbethoughtofasacombinationof
reductioninchildren’swageratesandanincreaseinparents’income,whichallowsthemto
predicttheresultsoftheconditionalcashexperimentwithlimitedadditionalassumptions.If
thisworks,asitpartiallydoesintheiranalysis,thetrialhelpsconsolidatepreviousknowledge
andcontributestoanevolvingbodyoftheoryandempirical,includingtrial,evidence.
Theprogramofthinkingaboutpolicychangesasequivalenttopriceandincomechang-
eshasalonghistoryineconomics;muchofrationalchoicetheorycanbesointerpreted,see
DeatonandMuellbauer(1980)formanyexamples.Whenthisconversioniscredible,andwhen
atrialonsomeapparentlyunrelatedtopiccanbemodeledasequivalenttoachangeinprices
andincomes,andwhenwecanassumethatpeopleindifferentsettingsrespondrelevantlysimi-
larlytochangesinpricesandincomes,wehaveareadymadeframeworkforincorporatingthe
trialresultsintopreviousknowledge,aswellasforextendingthetrialresultsandusingthem
elsewhere.Ofcourse,alldependsonthevalidityandcredibilityofthetheory;peoplemaynotin
factthinkofataxincreaseasadecreaseinthepriceofleisure,andbehavioraleconomicsisfull
ofexampleswhereapparentlyequivalentstimuligeneratenon-equivalentoutcomes.Theem-
braceofbehavioraleconomicsbymanyofthecurrentgenerationoftrialistsmayaccountfor
theirlimitedwillingnesstouseconventionalchoicetheoryinthisway;unfortunately,behavioral
economicsdoesnotyetofferareplacementforthegeneralframeworkofchoicetheorythatis
sousefulinthisregard.
Theorycanalsohelpwiththeproblemweraisedofdelineatingthepopulationtowhich
thetrialresultsimmediatelyapplyandforthinkingaboutmovingfromthispopulationtothe
populationofinterest.Ashenfelter’s(1978)analysisisagainagoodillustrationandpredates
muchsimilarworkinlaterliterature.Theincometaxexperimentsofferedparticipationinthe
trialtoarandomsampleofthepopulationofinterest.Becausetherewasnoblindingandno
compulsion,peoplewhowererandomizedintothetreatmentgroupwerefreetochoosetore-
46
fusetreatment.Asinmanysubsequentanalyses,Ashenfeltersupposesthatpeoplechooseto
participateifitisintheirinteresttodoso,dependingonwhathasbecomeknownintheRCT
andInstrumentalVariablesliteratureastheirownidiosyncratic“gain.”Thesimplelaborsupply
modelgivesanapproximatecondition:ifthetreatmentincreasesthetaxratefrom t0 to t1 with
anoffsettingincreaseinG,thenanindividualassignedtotheexperimentalgroupwilldeclineto
participateif
(t1 − t0 )w0h0 +12s00 (t1 − t0 ) >G1 −G0 (7)
wheresubscript1referstothetreatmentsituation,0tothecontrol,h0 ishoursworked,and
s00 isthe(negative)utility-constantresponseofhoursworkedtothetaxrate.Ifthereisnosub-
stitution,thesecondtermontheleft-handsideiszero,andpeoplewillaccepttreatmentifthe
increaseinGmorethanmakesupfortheincreasesintaxespayable,the“breakeven”condition.
Inconsequence,thosewithhigherearningsarelesslikelytoaccepttreatment.Somebetter-off
peoplewithhighsubstitutioneffectswillalsoaccepttreatmentiftheopportunitytobuymore
cheapleisureissufficiententicement.
Theselectiveacceptanceoftreatmentlimitstheanalyst’sabilitytolearnaboutthebet-
ter-offorlow-substitutionpeoplewhodeclinetreatmentbutwhowouldhavetoacceptitifthe
policywereactuallyimplemented.BoththeITTestimatorandthe“astreated”estimatorthat
comparesthetreatedandtheuntreatedareaffected,notjustbythelaborsupplyeffectsthat
thetrialisdesignedtoinduce,butbythekindofselectioneffectsthatrandomizationisde-
signedtoeliminate.Ofcourse,theanalysisthatleadsto(3)canperhapshelpussaysomething
aboutthisandhelpusadjustthetrialestimatesbacktowhatwewouldliketoknow.Yetthisis
noeasymatterbecauseselectiondepends,notonlyonobservables,suchaspre-experimental
earningsandhoursworked,buton(muchhardertoobserve)laborsupplyresponsesthatlikely
varyacrossindividuals.ParaphrasingAshenfelter,wecannotestimatetheeffectsofaperma-
nentcompulsorynegativeincometaxprogramfromatransitoryvoluntarytrialwithoutstrong
assumptionsoradditionalevidence.
Muchofthemodernliterature,forexampleontrainingprograms,wrestleswiththeis-
sueofexactlywhoisrepresentedbytheRCTresults,seeagainHeckman,LalondeandSmith
(1999).Whenpeopleareallowedtorejecttheirrandomlyassignedtreatmentaccordingtotheir
own(realorperceived)individualadvantage,wehavecomealongwayawayfromtherandom
allocationinthestandardconceptionofarandomizedcontrolledtrial.Moreover,theabsenceof
47
blindingiscommoninsocialandeconomicRCTs,andwhiletherearetrials,suchaswelfaretri-
als,thateffectivelycompelpeopletoaccepttheirassignments,andsomewherethetreatment
isgenerousenoughtodoso,therearetrialswheresubjectshavemuchfreedomand,inthose
cases,itislessthanobvioustouswhatrole,ifany,randomizationplaysinwarrantingthere-
sults.
2.6Scalingup:usingtheaverageforpopulations
AtypicalRCT,especiallyinthedevelopmentcontext,issmall-scaleandlocal,forexampleina
fewschools,clinics,orfarmsinaparticulargeographic,cultural,socio-economicsetting.Ifsuc-
cessfulaccordingtoacost-effectivenesscriterion,forexample,itisacandidateforscaling-up,
applyingthesameinterventionforamuchlargerarea,oftenawholecountry,orsometimes
evenbeyond,aswhensometreatmentisconsideredforallrelevantWorldBankprojects.The
factthattheinterventionmightworkdifferentlyatscalehaslongbeennotedintheeconomics
literature,e.g.GarfinkelandManski(1992),Heckman(1992),andMoffitt(1992),andisrecog-
nizedintherecentreviewbyBanerjeeandDuflo(2009).Wewantheretoemphasizetheperva-
sivenessofsucheffects—thatfailureofthetrialresultstoreplicateatalargerscaleislikelyto
betheruleratherthantheexception—aswellastonoteonceagainthat,asinfailuresoftrans-
portability,thisshouldnotbetakenasanargumentagainstusingRCTs,butonlyagainsttheidea
thateffectsatscalearelikelytobethesameasinthetrial.UsingRCTresultsisnotthesameas
assumingthesameresultsholdsinallcircumstances.
Anexampleofwhatareoftencalledgeneralequilibriumeffectscomesfromagriculture.
SupposeanRCTdemonstratesthatinthestudypopulationanewwayofusingfertilizerorinsec-
ticidehadasubstantialpositiveeffecton,say,cocoayields,sothatfarmerswhousedthenew
methodssawincreasesinproductionandinincomescomparedtothoseinthecontrolgroup.If
theprocedureisscaleduptothewholecountry,ortoallcocoafarmersworldwide,theprice
willdrop,andifthedemandforcocoaispriceinelastic—asisusuallythoughttobethecase,at
leastintheshortrun—cocoafarmers’incomeswillfall.Indeed,theconventionalwisdomfor
manycropsisthatfarmersdobestwhentheharvestissmall,notlarge.Ofcourse,theseconsid-
erationsmightnotbedecisiveindecidingwhetherornottopromotetheinnovation,andthere
maystillbelongtermgainsif,forexample,somefarmersfindsomethingbettertodothan
growingcocoa.Butthebasicpointisthatthescaled-upeffectinthiscaseisoppositeinsignto
thetrialeffect.Theproblemhereisnotwiththetrialresults,whichcanbeusefullyincorporated
intoamorecomprehensivemarketmodelthatincorporatestheresponsesestimatedbythe
48
trial.Theproblemisonlyifweassumethattheaggregatelooksliketheindividual.Thatother
ingredientsoftheaggregatemodelmustcomefromobservationalstudiesshouldnotbeacriti-
cism,evenforthosewhofavorRCTs;itissimplythepriceofdoingseriousanalysis.
Therearemanypossibleinterventionsthataltersupplyordemandwhoseeffect,inag-
gregate,willchangeapriceorawagethatisheldconstantintheoriginalRCT.Educationwill
changethesuppliesofskilledversusunskilledlabor,withimplicationsforrelativewagerates.
Conditionalcashtransfersincreasethedemandfor(andperhapssupplyof)schoolsandclinics,
whichwillchangepricesorwaitinglines,orboth.Thereareinteractionsbetweenpeoplethat
willoperateonlyatscale.Givingonechildavouchertogotoprivateschoolmightimproveher
future,butdoingsoforeveryonecandecreasethequalityofeducationforthosechildrenwho
areleftinthepublicschools,seethecontrastingstudiesofAngristetal(1999)andHsiehand
Urquiola(2002).Educationalortrainingprogramsmaybenefitthosewhoaretreated,butharm
thoseleftbehind;ifthecontrolgroupisselectedfromthelatter,theRCTmaygenerateaposi-
tiveresultinspiteofhurtingsomeandhelpingnone;Créponetal(2014)recognizetheissueand
showhowtoadaptanRCTtodealwithit.
Scalingupcanalsodisturbthepoliticalequilibrium.Anexploitativegovernmentmaynot
allowthemasstransferofmoneyfromabroadtoapowerlesssegmentofthepopulation,
thoughitmaypermitasmall-scaleRCTofcashtransfers.Provisionofhealthcarebyforeign
NGOsmaybesuccessfulintrials,buthaveunintendednegativeconsequencestoscalebecause
ofgeneralequilibriumeffectsonthesupplyofhealthcarepersonnel,orbecauseitdisturbsthe
natureofthecontractbetweenthepeopleandagovernmentthatisusingtaxrevenuetopro-
videservices.InIndia,thegovernmentspendslargesumsonfoodsubsidiesthroughasystem
(thePDS)thatisbothcorruptandinefficient,withmuchofthegrainthatisprocuredfailingto
finditswaytotheintendedbeneficiaries.LocalizedRCTsonwhetherornotfamiliesarebetter
offwithcashtransfersarenotinformativeabouthowpoliticianswouldchangetheamountof
thetransferiffacedwithunanticipatedinflation,andatleastasimportant,whetherthegov-
ernmentcouldcutprocurementfromrelativelywealthyandpoliticallypowerfulfarmers.With-
outapoliticalandgeneralequilibriumanalysis,itisimpossibletothinkabouttheeffectsofre-
placingfoodsubsidieswithcashtransfers,seee.g.Basu(2010).
Eveninmedicine,wherebiologicalinteractionsbetweenpeoplearelesscommonthan
aresocialinteractionsinsocialscience,interactionscanbeimportant;infectiousdiseasesarean
example,andimmunizationprogramsaffectthedynamicsofdiseasetransmissionthroughherd
49
immunity,sothattheeffectsonanindividualdependonhowmanyothersarevaccinated,Fine
andClarkson(1986),Manski(2013,p52).Theusual,ifseldomcorrect,conceptionofanRCTin
medicineisofabiologicalprocess—forexample,theadministrationofaspirinafteraheartat-
tack—wheretheeffectisthoughttobesimilaracrossindividuals,andwheretherearenointer-
actions.Yetevenhere,thesocialandeconomicsettingaffectshowdrugsareactuallyusedand
thesameissuescanarise;thedistinctionbetweenefficacyandeffectivenessinclinicaltrialsisin
partrecognitionofthefact.
2.7Drillingdown:usingtheaverageforindividuals
Justasthereareissueswithscaling-up,itisnotobvioushowtousetheresultsfromRCTsatthe
levelofindividualunits,evenindividualunitsthatwereactually(orpotentially)includedinthe
trial.Awell-conductedRCTdeliversanaveragetreatmenteffectforawell-definedpopulation
but,ingeneral,thataveragedoesnotapplytoeveryone.Itisnottrue,forexample,asarguedin
JAMA’s“Users’guidetothemedicalliterature”that“ifthepatientwouldhavebeenenrolledin
thestudyhadshebeenthere—thatisshemeetsalloftheinclusioncriteriaanddoesn’tviolate
anyoftheexclusioncriteria—thereislittlequestionthattheresultsareapplicable,”Guyattetal
(1994).Evenmoremisleadingaretheoften-heardstatementsthatanRCTwithanaverage
treatmenteffectinsignificantlydifferentfromzerohasshownthatthetreatmentworksforno
one,thoughsuchaconclusionwouldbebettersupportedbyaFisherrandomizationtest.
Theseissuesarefamiliartophysicianspracticingevidence-basedmedicinewhoseguide-
linesrequire“integratingindividualclinicalexpertisewiththebestavailableexternalclinicalevi-
dencefromsystematicresearch,”Sackettetal(1996).Exactlywhatthismeansisunclear;phy-
siciansknowmuchmoreabouttheirpatientsthanisallowedforintheATEfromtheRCT
(though,onceagain,stratificationinthetrialislikelytobehelpful)andtheyoftenhaveintuitive
expertisefromlongpracticethattheyrelyontohelpthemidentifyfeaturesinaparticularpa-
tientthatarelikelytoaffecttheeffectivenessofagiventreatmentforthatpatient.Butthereis
anoddbalancebeingstruckhere.Thesejudgmentsaredeemedadmissibleindealingwiththe
individualpatient,atleastfordiscussionwiththepatientaspossibleconsiderations,butthey
don’tadduptoevidencetobemadepubliclyavailable,withtheusualcautionsaboutcredibility,
bythestandardsadoptedbymostEBMsites.Itisalsotruethatphysicianscanhaveprejudices
and“knowledge”thatmightbeanythingbut.Clearly,therearesituationswhereforcingpracti-
tionerstofollowtheaveragewilldobetter,evenforindividualpatients,andotherswherethe
oppositeistrue,seeKahnemanandKlein(2009).
50
Whetherornotaveragesareusefultoindividualsraisesthesameissueinsocialscience
research.Imaginetwoschools,StJoseph’sandSt.Mary’s,bothofwhichwereincludedinan
RCTofaclassroominnovation,oratleastwereeligibletobeso.Theinnovationissuccessfulon
average,butshouldtheschoolsadoptit?ShouldStMary’sbeinfluencedbyapreviousattempt
inStJoseph’sthatwasjudgedafailure?Manywoulddismissthisexperienceasanecdotaland
askhowStJoseph’scouldhaveknownthatitwasafailurewithoutbenefitof“rigorous”evi-
dence.YetifStMary’sislikeStJoseph’s,withasimilarmixofpupils,asimilarcurriculum,and
similaracademicstanding,mightnotStJoseph’sexperiencebemorerelevanttowhatmight
happenatStMary’sthanisthepositiveaveragefromtheRCT?Andmightitnotbeagoodidea
fortheteachersandgovernorsofStMary’stogotoStJoseph’sandfindoutwhathappenedand
why?Theymaybeabletoobservethemechanismofthefailure,ifsuchitwas,andfigureout
whetherthesameproblemswouldapplyforthem,orwhethertheymightbeabletoadaptthe
innovationtomakeitworkforthem,perhapsevenmoresuccessfullythanthepositiveaverage
inthetrial.
Onceagain,thesequestionsareunlikelytobesimplyansweredinpractice;but,aswith
transportability,thereisnoseriousalternativetotrying.Assumingthattheaverageworksfor
youwilloftenbewrong,anditwillatleastsometimesbepossibletodobetter.Asinthemedi-
calcase,theadvicetoindividualschoolsoftenlacksspecificity.Forexample,theUSInstituteof
EducationScienceshasprovideda“user-friendly”guidetopracticessupportedbyrigorousevi-
dence,USDepartmentofEducation(2003).Theadvice,whichisverysimilartorecommenda-
tionsindevelopmenteconomics,isthattheinterventionbedemonstratedeffectivethrough
well-designedRCTsinmorethanonesiteofimplementation,andthat“thetrialsshoulddemon-
stratetheintervention’seffectivenessinschoolsettingssimilartoyours”(2003,p.17).Nooper-
ationaldefinitionof“similar”isprovided.
Wenotefinallythatthesecaveats,whichapplytoindividuals(orschools)evenifthey
wereinthetrial,provideanotherreasonwhytheconceptof“external”validityisunhelpful.The
realissueishowtousethefindingsofatrialinnewsettings,includingsettingsincludedinthe
trial;externalvalidityinthesenseofinvarianceoftheATEemphasizessimplereplication,which
guaranteesnothing,whileignoringthepossibilitythatlackofreplicationcanbeakeytounder-
standing.
51
2.8Examplesandillustrationsfromeconomics
OurargumentsinthisSectionshouldnotbecontroversial,yetwebelievethattheyrepresentan
approachthatisdifferentfrommostcurrentpractice.Todocumentthisandtofilloutthear-
guments,weprovidesomeexamples.Whiletheseareoccasionallycritical,ourpurposeiscon-
structive;indeed,webelievethatmisunderstandingsabouthowtouseRCTshaveartificially
limitedtheirusefulness,aswellasalienatedsomewhowouldotherwiseusethem.
Conditionalcashtransfers(CCTs)areinterventionsthathavebeentestedusingRCTs
(andotherRCT-likemethods)andareoftencitedasaleadingexampleofhowanevaluation
withstronginternalvalidityleadstoarapidspreadofthepolicy,e.g.AngristandPischke(2010)
amongmanyothers.IThinkthroughthecausalchainthatisrequiredforCCTstobesuccessful:
peoplemustlikemoney,theymustlike(ordonotobjecttoomuch)totheirchildrenbeingedu-
catedandvaccinated,theremustexistschoolsandclinicsthatarecloseenoughandwell
enoughstaffedtodotheirjob,andthegovernmentoragencythatisrunningtheschememust
careaboutthewellbeingoffamiliesandtheirchildren.Thatsuchconditionsholdinawide
rangeof(althoughcertainlynotall)countriesmakesitunsurprisingthatCCTs“work”inmany
replications,thoughtheycertainlywillnotworkinplaceswheretheschoolsandclinicsdonot
exist,Levy(2001),norinplaceswherepeoplestronglyopposeeducationorvaccination.
Similarly,giventhatthehelpingfactorswilloperatewithdifferentstrengthsandeffec-
tivenessindifferentplaces,itisalsonotsurprisingthatthesizeoftheATEdiffersfromplaceto
place;forexample,Vivalt’sAidGradewebsitelists29estimatesfromarangeofcountriesofthe
standardized(dividedbylocalstandarddeviationoftheoutcome)effectsofconditionalcash
transfersonschoolattendance;allbutfourshowtheexpectedpositiveeffect,andtherange
runsfrom–8to+38percentagepoints.Eveninthisleadingcase,wherewemightreasonably
concludethatCCTs“work”ingettingchildrenintoschool,itwouldbehardtocalculatecredible
cost-effectivenessnumbers,ortocometoageneralconclusionaboutwhetherCCTsaremoreor
lesscosteffectivethanotherpossiblepolicies.Bothcostsandeffectsizescanbeexpectedto
differinnewsettings,justastheyhaveinobservedones,makingthesepredictionsdifficult.
Therangeofestimatesillustratesthatthesimpleviewofexternalvalidity—thattheATE
shouldtransportfromoneplacetoanother—isnotwelldefined.AidGradeusesstandardized
measuresofeffectsizedividedbystandarddeviationofoutcomeatbaseline,asdoesthemajor
multi-countrystudybyBanerjeeetal(2015),Butwemightprefermeasuresthathaveaneco-
nomicinterpretation,suchasadditionalmonthsofschoolingper$100spent(forexampleifa
52
donoristryingtodecidewheretospend,seebelow).Nutritionmightbemeasuredbyheight,or
bythelogofheight.EveniftheATEbyonemeasurecarriesacross,itwillonlydosousingan-
othermeasureiftherelationshipbetweenthetwomeasuresisthesameinbothsituations.This
isexactlythesortofthingthataformalanalysisoftransportabilityforcesustothinkabout.
(NotealsothatATEintheoriginalRCTcandifferdependingonwhethertheoutcomeismeas-
uredinlevelsorinlogs;thetwoATEscouldevenhavedifferentsigns.)
Dewormingissurelymorecomplicatedthanconditionalcashtransfersthoughnotbe-
causeanyonedisputesthedesirabilityofremovingparasiticalwormsorthebiologicalefficacyof
themedicines,atleastiftheyarerepeatedlyandeffectivelyadministered;thatisthepartofthe
causalprocessthatistransportablefromoneplacetoanother.Yetnutritionalorschoolattend-
anceoutcomesdependonreinfectionfromonepersontoanother—whichdependsonlocal
customsaboutdefecation(whichvaryfromplacetoplaceandaresubjecttoreligiousandcul-
turalfactors),particularlyontheextentofopendefecationandthedensityofpopulation,on
whetherornotchildrenwearshoes,andontheavailabilityanduseofpublicandprivatesanita-
tion;thislastwascrucialintheeliminationofhookworminthesouthernstatesoftheU.S.ac-
cordingtoStiles(1939).Temperaturemayalsobeimportant;indeed,such“macro”variablesare
likelytobeimportantinawiderangeofmedical,employment,andproductiontrials,
RosenzweigandUdry(2016).Therearetwoprominentpositivestudiesintheeconomicslitera-
ture,oneinKenya,KremerandMiguel(2000)andoneinIndia,Bobonis,MiguelandPuri-
Sharma(2006);theseareoftencitedasexamplesofthepowerofRCTstocomeupwiththe
“right”answer,forexamplebyKarlanandAppel(2008).YettheCochraneCollaborationreview
ofdewormingandschooling,Taylor-Robinsonetal(2015),whichreviewsonetrial(fromIndia)
coveringmorethanamillionparticipants,and44otherscovering67,672participants,including
KremerandMiguel(2004),concludethatthereis“substantialevidence”thatdewormingshows
nobenefitinnutritionalstatus,hemoglobin,cognition,schoolperformanceordeath.Thevalidi-
tyofthismeta-analysisisdisputedbyCrokeetal(2016).Areplication,Aikenetal(2015)andre-
analysis(usingdifferentmethods)ofMiguelandKremer’soriginaldatabyDaveyetal(2015)
concludedthatthestudy“providedsomeevidence,butwithhighriskofbias,”provokinga
lengthyexchange,Hicksetal(2015)andHargreavesetal(2015).Mostofthedifferencesinre-
sultscomefromdifferentmethodologicalchoices,themselveslargelybasedondisciplinarytra-
ditions,ratherfromtheeffectsofmistakesorerrors.Inanimpressiveandclearreanalysis,
Humphreys(2015)arguesthatonepuzzlingfeatureofMiguelandKremer’sresultsistheab-
53
senceofanycleareffectofdewormingonhealth,aswasthecaseinthelargeIndianRCT.Yet
theeffectsofdewormingoneducation,whicharethemaintargetofthepaper,presumably
workthroughhealth,sothattheabsenceofhealtheffects—afailureofexpectedmediators—is
apuzzle,seealsoMiguel,KremerandHicks(2015),andAhujaetal(2015).Recalltooourearlier
discussionofthedifficultyofinterpretingthestandarderrorsoftheoriginalstudyintheab-
senceofrandomization.
Itisnotourpurposeheretotrytoadjudicatethesecompetingclaimsbutrathertore-
latethisworktoourgeneralargument.First,itisnotclearthatthereisarightanswertobedis-
covered;giventhecausalchainsinvolved,dewormingmightbehelpfulinoneplacebutunhelp-
fulinanother.Yetthefocusofthedebateisalmostentirelyoninternalvalidity,onwhetherthe
originalstudieswerecorrectlydone.TheCochranereview,inlinewiththis,andinlinewith
muchmeta-analysisoftrials,seemstosupposethatthereisasingleeffecttobeuncoveredthat,
onceestablished,willbeinvarianttolocalandenvironmentaldifferences.Externalvalidity,it
seems,isimpliedbyinternalvalidity.Indeed,Chalmers,oneofthefoundersoftheCochrane
Collaboration,hasexplicitlyargued(inresponsetooneofus)that,intheabsenceofstrongrea-
sonstothecontrary,resultsshouldbetakenasapplicableeverywhere,PettigrewandChalmers
(2011).
Second,thedebatemakesitclearthatthepracticeofRCTsineconomicdevelopment
hasdonelittletofulfilltheoriginalpromisethattheirsimplicity—howhardisittosubtractone
meanfromanother?—woulddisposeofthemethodologicalandeconometricdisputesthat
characterizesomanyobservationalstudiesandwerethoughttobeoneoftheirmainflaws.
WhileRCTstendtotakesomecontentiousissuesofidentificationoffthetable,theyleavemuch
tobedisputed,includingthehandlingoffactorsthatinteractwithtreatmenteffects,theappro-
priatelevelofrandomization,thecalculationofstandarderrors,thechoiceofoutcomemeas-
ure,theinclusioncriteriaforthesample,placeboandHawthorneeffects,andmuchmore.The
claimthatRCTscutthroughtheusualeconometricdisputestodelivertopolicymakersasimple,
convincing,andeasilyunderstoodanswerissimplyfalse.Thedewormingdebatesareperhaps
theleadingillustration.
Muchofthedevelopmentliterature,likethemedicalliterature,workswiththeviewof
externalvaliditythat,unlessthereisevidencetothecontrary,thedirectionandsizeoftreat-
menteffectscanbetransportedfromoneplacetoanother.TheJ-PALwebsitereportsitsfind-
ingsunderageneralheadofpolicyrelevance,subdividedbyaselectionoftopics.Undereach
54
topic,thereisalistofrelevantRCTsfromarangeofdifferentsettingsaroundtheworld.These
areconvenientlyconvertedintoacommoncost-effectivenessmeasuresothat,forexample,
under‘education’,subhead‘studentparticipation’,therearefourstudiesfromAfrica:onin-
formingparentsaboutthereturnstoeducationinMadagascar,ondeworming,onschooluni-
forms,andonmeritscholarships,allfromKenya.Theunitsofmeasurementareadditionalyears
ofstudenteducationper$100,andamongthesefourstudies,theaverageeffectsizesofspend-
ing$100are20.7years,13.9years,0.71yearsand0.27yearsrespectively.(Notethatthisisa
different—andsuperior—standardizationfromtheeffectsizestandardizationdiscussedabove.)
Whatcanweconcludefromsuchcomparisons?Foraphilanthropicdonorinterestedin
education,andifmarginalandaverageeffectsarethesame,theymightindicatethatthebest
placetodevoteamarginaldollarisinMadagascar,whereitwouldbeusedtoinformparents
aboutthevalueofeducation.Thisiscertainlyuseful,butitisnotasusefulasstatementsthat
informationordewormingprogramsareeverywheremorecost-effectivethanprogramsinvolv-
ingschooluniformsorscholarships,orifnoteverywhere,atleastoversomedomain,anditis
thesesecondkindsofcomparisonthatwouldgenuinelyfulfillthepromiseof“findingoutwhat
works.”Butsuchcomparisonsonlymakesenseifwecantransporttheresultsfromoneplaceto
another,iftheKenyanresultsalsoholdinMadagascar,Mali,orNamibia,orsomeotherlistof
Africanornon-Africanplaces.J-PAL’smanualforcost-effectiveness,Dhaliwaletal(2012)ex-
plainsin(entirelyappropriate)detailhowtohandlevariationincostsacrosssites,notingvaria-
blefactorssuchaspopulationdensity,prices,exchangerates,discountrates,inflation,andbulk
discounts.Butitgivesshortshrifttocross-sitevariationinthesizeofaveragetreatmenteffects
whichplayanequalpartinthecalculationsofcosteffectiveness.Themanualbrieflynotesthat
diminishingreturns(orthe“last-mile”problem)mightbeimportantintheory,butarguesthat
thebaselinelevelsofoutcomesarelikelytobesimilarinthepilotandreplicationareas,sothat
theaveragetreatmenteffectcanbesafelytransportedasis.Allofthislacksajustificationfor
transportability,someunderstandingofwhenresultstransport,whentheydonot,orbetter
still,howtheyshouldbemodifiedtomakethemtransportable.
OneofthelargestandmosttechnicallyimpressiveofthedevelopmentRCTsisby
Banerjeeetal(2015),whichtestsa“graduation”programdesignedtopermanentlyliftextreme-
lypoorpeoplefrompovertybyprovidingthemwithagiftofaproductiveasset(fromguinea-
pigs,(regular-)pigs,sheep,goats,orchickensdependingonlocale),trainingandsupport,life
skillscoaching,aswellassupportforconsumption,saving,andhealthservices;theideaisthat
55
thispackageofaidcanhelppeoplebreakoutofpovertytrapsinawaythatwouldnotbepossi-
blewithoneinterventionatatime.ComparableversionsoftheprogramweretestedinEthio-
pia,Ghana,Honduras,India,Pakistan,andPeruand,exceptingHonduras(wherethechickens
died)findlargelypositiveandpersistenteffects—withsimilar(standardized)effectsizes—fora
rangeofoutcomes(economic,mentalandphysicalhealth,andfemaleempowerment).Onesite
apart,essentiallyeveryoneacceptedtheirassignment,sothatmanyofthefamiliarcaveatsdo
notapply.ReplicationofpositiveATEsoversuchawiderangeofplacescertainlyprovidesproof
ofconceptforsuchascheme.YetBauchet,Morduch,andRavi(2015)failtoreplicatetheresult
inSouthIndia,wherethecontrolgroupgotaccesstomuchthesamebenefits,whatHeckman,
Hohman,andSmith(2000)call‘substitutionbias’.Evenso,theresultsareimportantbecause,
althoughthereisalongstandinginterestinpovertytraps,manyeconomistshavelongbeen
skepticaloftheirexistenceorthattheycouldbesprungbysuchaid-basedpolicies.Inthissense,
thestudyisanimportantcontributiontothetheoryofeconomicdevelopment;ittestsatheo-
reticalpropositionandwill(orshould)changemindsaboutit.
Anumberofdifficultiesremain.Astheauthorsnote,suchtrialscannottelluswhich
componentofthetreatmentaccountedfortheresults,orwhichmightbedispensable—amuch
moreexpensivemultifactorialtrialwouldberequired—thoughitseemslikelyinpracticethat
thecostliestcomponent—therepeatedvisitsfortrainingandsupport—islikelytobethefirstto
becutbycash-strappedpoliticiansoradministrators.Andasnoted,itisunclearwhatshould
countas(simple)replicationininternationalcomparisons;itishardtothinkoftheusesof
standardizedeffectsizes,excepttodocumentthateffectsexisteverywhereandthattheyare
similarlylargerelativetolocalvariationinsuchthings.
Theeffectsize—theaveragetreatmenteffectexpressedinnumbersofstandarddevia-
tionsoftheoriginaloutcome—thoughconvenientlydimensionless,haslittletorecommendit.
AswithmuchofRCTpractice,itstripsoutanyeconomiccontent—noratesofreturn,orbenefits
minuscosts—anditremovesanydisciplineonwhatisbeingcompared.Applesandorangesbe-
comeimmediatelycomparable,asdotreatmentswhoseinclusioninameta-analysisislimited
onlybytheimaginationoftheanalystsinclaimingsimilarity.Inpsychology,wheretheconcept
originated,thereareendlessdisputesaboutwhatshouldandshouldnotbepooledinameta-
analysis.Beyondthat,asarguedbySimpson(2016),restrictionsonthetrialsample—oftengood
practicetoreducebackgroundnoiseandtohelpdetectaneffect—willreducethebaseline
standarddeviationandinflatetheeffectsize.Moregenerally,effectsizesareopentomanipula-
56
tionbyexclusionrules.Itmakesnosensetoclaimreplicabilityonthebasisofeffectsizes,let
alonetousethemtorankprojects.
Thegraduationstudycanbetakenastheclosesttofulfillingthe“findingoutwhat
works”aimoftheRCTmovementindevelopment.Yetitissilentonperhapsthecrucialaspect
forpolicy,whichisthatthetrialwasrunentirelyinpartnershipwithNGOs,whereaswhatwe
wouldliketoknowiswhetheritcouldbereplicatedbygovernments,includingthosegovern-
mentsthatareincapableofgettingdoctors,nurses,andteacherstoshowuptoclinics,or
schools,Chaudhuryetal(2005),Banerjee,DeatonandDuflo(2004),orofregulatingthequality
ofmedicalcareineitherthepublicorprivatesectors,Filmer,HammerandPritchett(2000)or
DasandHammer(2005).Infact,wealreadyknowagreatdealabout“whatworks.”Vaccina-
tionswork,maternalandchildhealthcareserviceswork,andclassroomteachingworks.Yet
knowingthisdoesnotgetthosethingsdone.Addinganotherprogramthatworksunderideal
conditionsisusefulonlywheresuchconditionsexist,andthatwouldlikelybeunnecessarywhen
theyexist.Findingoutwhatworksisnotthemagickeytoeconomicdevelopment.Technical
knowledge,thoughalwaysworthhaving,requiressuitableinstitutionsifitistodoanygood.
Asimilarpointisdocumentedinthecontrastbetweenasuccessfultrialthatusedcam-
erasandthreatsofwagereductionstoincentivizeattendanceofteachersinschoolsrunbyan
NGOinRajasthaninIndia,Duflo,Hanna,andRyan(2012),andthesubsequentfailureofafol-
low-upprograminthesamestatetotacklemassabsenteeismofhealthworkers,Banerjee,
Duflo,andGlennerster(2008).Intheschools,thecamerasandtimekeepingworkedasintended,
andteacherattendanceincreased.Intheclinics,therewasashort-runeffectonnurseattend-
ance,butitwasquicklyeliminated.(Theabilityofagentseventuallytounderminepoliciesthat
areinitiallyeffectiveiscommonenoughandnoteasilyhandledwithinanRCT.)Inbothtrials,
therewereincentivestoimproveattendance,andtherewereincentivestofindawaytosabo-
tagethemonitoringandrestoreworkerstotheiraccustomedpositions;theforceofthesein-
centivesisa“high-level”cause,likegravity,ortheprincipleofthelever,thatworksinmuchthe
samewayeverywhere.Fortheclinics,somesabotagewasdirect—thesmashingofcameras—
andsomewassubtler,whengovernmentsupervisorsprovidedofficial,thoughessentiallyspe-
ciousreasons,formissingwork.Wecanonlyconjecturewhythecausalitywasswitchedinthe
movefromNGOtogovernment;wesuspectthatworkingforahighly-respectedlocalNGOisa
differentcontractfromworkingforthegovernment,wherenotshowingupforworkiswidely(if
informally)understoodtobepartofthedeal.Theincentiveleverworkswhenitiswiredup
57
right,aswiththeNGOs,butnotwhenthewiringcutsitout,aswiththegovernment.Knowing
“whatworks”inthesenseofthetreatmenteffectonthetrialpopulationisoflimitedvalue
withoutunderstandingthepoliticalandinstitutionalenvironmentinwhichitisset.Thisunder-
linestheneedtounderstandtheunderlyingsocial,economic,andculturalstructures—including
theincentivesandagencyproblemsthatinhibitservicedelivery—thatarerequiredtosupport
thecausalpathwaysthatweshouldliketoseeatwork.
Trialsineconomicdevelopmentaresusceptibletothecritiquethattheytakeplaceinar-
tificialenvironments.Drèze(2016)notes,basedonextensiveexperienceinIndia,“whenafor-
eignagencycomesinwithitsheavybootsandsuitcasesofdollarstoadministera`treatment,’
whetherthroughalocalNGOorgovernmentorwhatever,thereisalotgoingonotherthanthe
treatment.”Thereisalsothesuspicionthatatreatmentthatworksdoessobecauseofthepres-
enceofthe“treators,”oftenfromabroad,ratherthanbecauseofthepeoplewhowillbecalled
toworkitinreality.
ThereisalsomuchtobelearnedfrommanyyearsofeconomictrialsintheUnited
States,particularlyfromtheworkoftheManpowerDemonstrationResearchCorporation(now
knownbyitsinitialsMDRC),fromtheearlyincometaxtrials,aswellasfromtheRandHealth
Experiment.Followingtheincometaxtrials,MDRChasrunmanyrandomizedtrialssincethe
1970s,mostlyfortheFederalgovernmentbutalsoforindividualstatesandforCanada,seethe
thoroughandinformativeaccountbyGueronandRolston(2011)forthefactualinformation
underlyingthefollowingdiscussion.MRDC’sprogram,likethatofJPALindevelopment,isin-
tendedtofindout“whatworks”inthestateandfederalwelfareprograms.Theseprogramsare
conditionalcashtransfersinwhichpoorrecipientsaregivencashprovidedtheysatisfycertain
conditionswhichareoftenthesubjectofthetrial.Shouldtherebeworkrequirements?Should
thereberemedialeducationalbeforeworkrequirements?Whatarethebenefitsandcostsof
variousalternatives,bothtotherecipientsandtothelocalandfederaltaxpayers?Allofthese
programsaredeeplypoliticized,withsharplydifferentviewsoverbothfactsanddesirability.
Manyengagedinthesedisputesfeelcertainofwhatshouldbedoneandwhatitsconsequences
willbesothat,bytheirlights,controlgroupsareunethicalbecausetheydeprivesomepeopleof
whattheadvocates“know”willbecertainbenefits.Giventhis,itisperhapssurprisingthatRCTs
havebecometheacceptednormforthiskindofpolicyevaluationintheUS.
Thereasonsowemuchtopoliticalinstitutions,aswellastothecommonfaiththatRCTs
canrevealthetruth.AttheFederallevel,prospectivepoliciesarevettedbythenon-partisan
58
CongressionalBudgetOffice,whichmakesitsownestimatesofthebudgetaryimplicationsof
theprogram.IdeologueswhoseprogramsscorepoorlybytheCBOhaveanincentivetosupport
anRCT,nottoconvincethemselves,buttoconvincetheiropponents;onceagain,RCTsarees-
peciallyvaluablewhenyouropponentsdonotshareyourprior.Andcontrolgroupsareeasierto
putinplacewhenthereareinsufficientfundstocoverthewholepopulation.Therewasalsoa
widespreadandlargelyuncriticalbeliefthatRCTsalwaysgivetherightanswer,atleastforthe
budgetaryimplications,which,ratherthanthewellbeingoftherecipients,wereoftenthepri-
mary(andindeedsometimestheonly)concern;notethatallofthesetrialsareonpoorpeople
byrichpeoplewhoaretypicallymoreconcernedwithcostthanwiththewellbeingofthepoor,
Greenberg,SchroderandOnstott(1999).MDRCstrialscouldthereforebeeffectivedisputerec-
onciliationmechanismsbothforthosewhosawtheneedforevidenceandforthosewhodid
not(exceptinstrumentally).Notethattheoutcomeherefitswithour“publichealth”case;what
thepoliticiansneedtoknowisnottheoutcomesforindividuals,orevenhowtheoutcomesin
onestatemighttransporttoanother,buttheaveragebudgetarycostinaspecificplaceforeach
poorpersontreated,somethingthatagoodRCTconductedonarepresentativesampleofthe
targetpopulationisequippedtodeliver,atleastintheabsenceofgeneralequilibriumeffects,
timingeffects,etc.
TheseRCTsbyMDRCandothercontractorsdeservemuchcredit.Theyhavedemon-
stratedboththefeasibilityoflarge-scalesocialtrialsincludingthepossibilityofrandomizationin
thesesettings(wheremanyparticipantswerehostiletotheidea),aswellastheirusefulnessto
policymakers.Theyalsoseemtohavechangedbeliefs,forexampleinfavorofthedesirabilityof
workrequirementsasaconditionofwelfare,evenamongmanyofthosewhowereoriginally
opposed.Therearealsolimitations;thetrialsappeartohavehadatbestalimitedinfluenceon
scientificthinkingaboutbehaviorinlabormarkets.Theresultsofsimilarprogramshaveoften
beendifferentacrossdifferentsites,andtherehastodatebeennofirmunderstandingofwhy;
indeed,thetrialsarenotdesignedtorevealthis,Moffitt(2004).Finally,andperhapscruciallyfor
thepotentialcontributiontoeconomicscience,therehasbeenlittlesuccessinunderstanding
eithertheunderlyingstructuresorchainsofcausation,inspiteofadeterminedeffortfromthe
verybeginningtopeerintotheblackboxes.Withoutsuchmechanisms,transportabilityisal-
waysindoubt,itisimpossibleforpolicymakersoracademicstopurposivelyimprovethepoli-
cies,andthecontributionstocumulativescienceareseverelylimited.
59
TheRANDhealthexperiment,Manningetal(1975a,b),providesadifferentbutequally
instructivestoryifonlybecauseitsresultshavepermeatedtheacademicandpolicydiscussions
abouthealthcareeversince.Itwasoriginallydesignedtotestthequestionofwhethermore
generousinsurancewouldcausepeopletousemoremedicalcareand,ifso,byhowmuch.The
incentiveeffectsarehardlyindoubttoday;theimmortalityofthestudycomesratherfromthe
factthatitsmulti-arm(responsesurface)designallowedthecalculationofanelasticityforthe
studypopulation,thatmedicalexpendituresdecreasedby–0.1to–0.2percentforeveryper-
centageincreaseinthecopayment.AccordingtoAron-Dine,Einav,andFinkelstein(2013),itis
thisdimensionlessandthusapparentlytransportablenumberthathasbeenusedeversinceto
discussthedesignofhealthcarepolicy;theelasticityhascometobetreatedasauniversalcon-
stant.Ironically,theyarguethattheestimatecannotbereplicatedinrecentstudies,anditis
evenunclearthatitisfirmlybasedontheoriginalevidence.Thisaccountpoints,onceagain,to
thecentralimportanceoftransportabilityfortheusefulnessandlong-termusefulnessofatrial.
Here,thesimpledirecttransportabilityoftheresultseemstohavebeenlargelyillusorythough,
aswehaveargued,thisdoesnotmeanthatmorecomplexconstructionsbasedontheresultsof
thetrialwouldnothavedonebetter.
Conclusions
RCTsaretheultimateincredibleestimationofaveragetreatmenteffectsinthepopulationbe-
ingstudiedbecausetheymakesofewassumptionsaboutheterogeneity,causalstructure,
choiceofvariables,andfunctionalform.Theyaretrulynonparametric.Andindeed,thisissome-
timesjustwhatwewant,particularlywherewehavelittlecrediblepriorinformation.RCTsare
oftenconvenientwaystointroduceexperimenter-controlledvariance—ifyouwanttoseewhat
happens,thenkickitandsee,twistthelion’stail—butnotethatmanyexperiments,including
manyofthemostimportant(andNobelPrizewinning)experimentsineconomics,donotand
didnotuserandomization,Harrison(2013),Svorencik(2015).Butthecredibilityoftheresults,
eveninternally,canbeunderminedbyexcessiveheterogeneityinresponses,andespecially
whenthedistributionofeffectsisasymmetric,whereinferenceonmeanscanbehazardous.
Ironically,thepriceofthecredibilityinRCTsisthatallwegetaremeans.Yet,inthepresenceof
outliers,meansthemselvesdonotprovidethebasisforreliableinference.Andrandomizationin
andofitselfdoesnothingunlessthedetailsareright;purposiveselectionintotheexperimental
population,likepurposiveselectionintoandoutofassignment,underminesinferenceinjust
60
thesamewayasdoesselectioninobservationalstudies.Lackofblinding,whetherofpartici-
pants,trialists,datacollectors,oranalysts,underminesinferencebypermittingfactorsother
thanthetreatmenttoaffecttheoutcome,akintoafailureofexclusionrestrictionsininstru-
mentalvariableanalysis.
ThelackofstructurecanbecomeseriouslydisablingwhenwetrytouseRCTresults,
outsideofafewcontexts,suchasprogramevaluation,hypothesistesting,orestablishingproof
ofconcept.Beyondthat,weareintrouble.Wecannotusetheresultstohelpmakepredictions
elsewherewithoutmorestructure,withoutmorepriorinformation,andwithouthavingsome
ideaofwhatmakestreatmenteffectsvaryfromplacetoplace,ortimetotime.Thereisnoop-
tionbuttocommittosomecausalstructureifwearetoknowhowtouseRCTevidenceelse-
where,ortousetheestimatesoutoftheoriginalcontext.Simplegeneralizationandsimpleex-
trapolationjustdonotcutthemustard.Thisistrueofanystudy,experimentalorobservational.
Butobservationalstudiesarefamiliarwith,androutinelyworkwith,thesortofassumptions
thatRCTsclaimtoavoid,sothatiftheaimistouseempiricalevidence,anycredibilityadvantage
thatRCTshaveinestimationisnolongeroperative.
Yetoncethatcommitmenthasbeenmade,RCTevidencecanbeextremelyuseful,pin-
ningdownpartofastructure,helpingtobuildstrongerunderstandingandknowledge,andhelp-
ingtoassesswelfareconsequences.Asourexamplesshow,thiscanoftenbedonewithout
committingtothefullcomplexityofwhatareoftenthoughtofasstructuralmodels.Yetwithout
thestructurethatallowsustoplaceRCTresultsincontext,ortounderstandthemechanisms
behindthoseresults,notonlycanwenottransportwhether“itworks”elsewhere,butwecan-
notdothestandardstuffofeconomics,whichistosaywhetherornottheinterventionisactual-
lywelfareimproving,seeHarrison(2014)foravividaccountthatsharplyidentifiesthisandoth-
erissues.Withoutknowingwhythingshappenandwhypeopledothings,weruntheriskof
worthlesscasual(“fairystory”)causaltheorizingandhaveessentiallygivenupononeofthe
centraltasksofeconomics.
Wemustbackawayfromtherefusaltotheorize,fromtheexultationinourabilityto
handleunlimitedheterogeneity,andactuallySAYsomething.Perhapsparadoxically,unlesswe
arepreparedtomakeassumptions,andtosaywhatweknow,makingstatementsthatwillbe
incredibletosome,allthecredibilityoftheRCTisfornaught.
Inthespecificcontextofdevelopmentthathasconcernedushere,RCTshaveproven
theirworthinprovidingproofsofconceptandattestingpredictionsthatsomepoliciesmust
61
alwaysworkorcanneverwork.But,aselsewhereineconomics,wecannotfindoutwhysome-
thingworksbysimplydemonstratingthatitdoeswork,nomatterhowoften,whichleavesus
uninformedastowhetherthepolicyshouldbeimplemented.Beyondthat,smallscale,demon-
strationRCTsarenotcapableoftellinguswhatwouldhappenifthesepolicieswereimplement-
edtoscale,ofcapturingunintendedconsequencesthattypicallycannotbeincludedinthepro-
tocols,orofmodelingwhatwillhappenifschemesareimplementedbygovernments,whose
motivesandoperatingprinciplesaredifferentfromtheNGOswhotypicallyruntrials.Whileitis
truethatabstractknowledgeisalwayslikelytobebeneficialtoeconomicdevelopment,success-
fuldevelopmentdependsoninstitutionsandonpolitics,mattersonwhichRCTshavelittleto
say.Intheend,RCTsareoneofthemanyexternaltechnicalfixesthathavemeanderedoffand
onthedevelopmentstagesincetheSecondWorldWar,includingbuildinginfrastructure,getting
pricesright,andservicedelivery,noneofwhichhavefaceduptotheessentialdomesticpolitical
foundationsfordevelopment.
Citations
Ahuja,Amrita,SarahBaird,JoanHamoryHicks,MichaelKremer,EdwardMiguel,andShawnPowers,2015,“Whenshouldgovernmentssubsidizehealth?Thecaseofmassdeworming,”WorldBankEconomicReview,29,S9–S24.
Aigner,DennisJ.,1985,“Theresidentialelectricitytime-of-usepricingexperiments.Whathavewelearned?”inDavidA.WiseandJerryA.Hausman,Socialexperimentation,Chicago,Il.Chi-cagoUniversityPressforNationalBureauofEconomicResearch,11–54.
Aiken,AlexanderM.,CalumDavey,JamesR.HargreavesandRichardJ.Hayes,“Re-analysisofhealthandeducationalimpactsofaschool-baseddewormingprogrammeinwesternKenya:apurereplication,”InternationalJournalofEpidemiology,0(0),1–9.
Al-Ubaydil,Omar,andJohnA.List,2013,“Onthegeneralizabilityofexperimentalresultsineco-nomics,”inG.FrechetteandA.Schotter,Methodsofmodernexperimentaleconomics,Ox-fordUniversityPress.
Altman,DouglasG.,1985,“Comparabilityofrandomizedgroups,”JournaloftheRoyalStatisticalSociety,SeriesD(TheStatistician),34(1),Statisticsinhealth,125–36.
Angrist,JoshuaD.,2004,“Treatmenteffectheterogeneityintheoryandpractice,”EconomicJournal,114,C52–C83.
Angrist,JoshuaD.,EricBettinger,ErikBloom,ElizabethKingandMichaelKremer,2002,“Vouch-ersforprivateschoolinginColombia:evidencefromarandomizednaturalexperiment,”AmericanEconomicReview,92(5),1535–58.
Angrist,JoshuaD.,andJörn-SteffenPischke,2010,“Thecredibilityrevolutioninempiricaleco-nomics:howbetterresearchdesignistakingtheconoutofeconometrics,”JournalofEco-nomicPerspectives,24(2),3–30.
Aron-Dine,Aviva,LiranEinav,andAmyFinkelstein,2013,“TheRANDhealthinsuranceexperi-ment,threedecadeslater,”JournalofEconomicPerspectives,27(1),197–222.
62
Arrow,KennethJ.,1975,“Twonotesoninferringlongrunbehaviorfromsocialexperiments,”DocumentNo.P-5546,SantaMonica,CA.RandCorporation.
Ashenfelter,Orley,1978,“Estimatingtheeffectoftrainingprogramsonearnings,”ReviewofEconomicsandStatistics,60(1),47–57.
Ashenfelter,Orley,1978,“Thelaborsupplyresponseofwageearners,”inJohnL.PalmerandJosephA.Pechman,eds.,Welfareinruralareas:theNorthCarolina–IowaIncomeMainte-nanceExperiment,Washington,DC.TheBrookingsInstitution.109–38.
Attanasio,Orazio,CostasMeghir,andAnaSantiago,2012,“EducationchoicesinMexico:usingastructuralmodelandarandomizedexperimenttoevaluatePROGRESA,”ReviewofEconomicStudies,79(1),37–66.
Attanasio,Orazio,SarahCattan,EmlaFitzsimons,CostasMeghir,andMartaRubioCodina,2015,“Estimatingtheproductionfunctionforhumancapital:resultsfromarandomizedcontrolledtrialinColumbia,”London.InstituteforFiscalStudies,WorkingPapernoW15/06.
Bahadur,R.R.,andLeonardJ.Savage,1956,“Thenon-existenceofcertainstatisticalproceduresinnonparametricproblems,”AnnalsofMathematicalStatistics,25:1115–22.
Banerjee,Abhijit,SylvainChassang,SergioMontero,andErikSnowberg,2016,“Atheoryofex-perimenters,”processed,July2016.
Banerjee,Abhijit,SylvainChassang,andErikSnowberg,2016,“Decisiontheoreticapproachestoexperimentdesignandexternalvalidity,”Cambridge,MA.NBERWorkingPaperno22167,April.
Banerjee,Abhijit,AngusDeaton,andEstherDuflo,2004,“HealthcaredeliveryinruralRaja-sthan,”EconomicandPoliticalWeekly,39(9),944–9.
Banerjee,Abhijit,andEstherDuflo,2012,Pooreconomics:aradicalrethinkingofthewaytofightglobalpoverty,PublicAffairs.
Banerjee,Abhijit,EstherDuflo,NathanaelGoldberg,DeanKarlan,RobertOsei,WilliamParienté,JeremyShapiro,BramThuysbaert,andChristopherUdry,2015,“Amultifacetedprogramcauseslastingprogressfortheverypoor:evidencefromsixcountries,”Science,348(6236),1260799.
Banerjee,Abhijit,EstherDuflo,andRachelGlennerster,2008,“Puttingaband-aidonacorpse:incentivesfornursesintheIndianpublichealthcaresystem,”JournaloftheEuropeanEco-nomicAssociation,6(2–3),487–500.
Banerjee,AbhijitV.,andRuiminHe,2003,“TheWorldBankofthefuture,”AmericanEconomicReview,93(2),39–44.
Bauchet,Jonathan,JonathanMorduchandShamikaRavi,2015,“Failurevsdisplacement:whyaninnovativeanti-povertyprogramshowednonetimpactinSouthIndia,”JournalofDevel-opmentEconomics,116,1–16.
Basu,Kaushik,2010,“TheeconomicsoffoodgrainmanagementinIndia,”MinistryofFinance,Delhi.http://finmin.nic.in/workingpaper/Foodgrain.pdf
Bloom,HowardS.,CarolynJ.Hill,andJamesA.Riccio,2005,“Modelingcross-siteexperimentaldifferencestofindoutwhyprogrameffectivenessvaries,”inHowardS.Bloom,ed.,Learningmorefromsocialexperiments:evolvinganalyticalapproaches,NewYork,NY.RussellSage.
Bobonis,Gustavo,EdwardMiguel,andCharuPuri-Sharma,2006,“Anemiaandschoolparticipa-tion,”JournalofHumanResources,41(4),692–721.
Bold,Tessa,MwangiKimenyi,,GermanoMwabu,AliceNg’ang’aandJustinSandefur,2013,“Scalingupwhatworks:experimentalevidenceonexternalvalidityinKenyaneducation,”Washington,DC.CenterforGlobalDevelopment,WorkingPaper321.
Bothwell,LauraE.,andScottH.Podolsky,2016,“Theemergenceoftherandomized,controlledtrial,”NewEnglandJournalofMedicine,375(6),501–4.doi:10.1056/NEJMp1604635
63
Campbell,D.T.,andJ.C.Stanley,1963,Experimentalandquasi-experimentaldesignsforre-search.Chicago.RandMcNally.
Cartwright,Nancy,1994,Nature’scapacitiesandtheirmeasurement.Oxford.ClarendonPress.Cartwright,Nancy,andJeremyHardie,2012,Evidencebasedpolicy:apracticalguidetodoingit
better,Oxford.OxfordUniversityPress.Chalmers,Iain,2001,“Comparinglikewithlike:somehistoricalmilestonesintheevolutionof
methodstocreateunbiasedcomparisongroupsintherapeuticexperiments,”InternationalJournalofEpidemiology,30,1156–64.
Chalmers,Iain,2003,“FisherandBradfordHill:theoryandpragmatism?”InternationalJournalofEpidemiology,32,922–24.
Chassang,Sylvain,GerardPadróIMiguel,andErikSnowberg,2012,“Selectivetrials:aprincipal–agentapproachtorandomizedcontrolledexperiments,”AmericanEconomicReview,102(4),1279–1309.
Chassang,Sylvain,ErikSnowberg,BenSeymour,andCayleyBowles,2015,“Accountingforbe-haviorintreatmenteffects:newapplicationsforblindtrials,”PLoSOne,10(6),e0127227.doi:10:1371/journal.pone.0127227.
Chaudhury,Nazmul,JeffreyHammer,MichaelKremer,KarthikMuralidharanandF.HalseyRog-ers,2005,“Missinginaction:teacherandhealthworkerabsenceindevelopingcountries,”JournalofEconomicPerspectives,19(4),91–116.Chyn,Eric,2016,“Movedtoopportunity:thelong-runeffectofpublichousingdemolitiononlabormarketoutcomesofchildren,”Uni-versityofMichigan.http://www-personal.umich.edu/~ericchyn/Chyn_Moved_to_Opportunity.pdf
Conlisk,John,1973,“Choiceofresponsefunctionalformindesigningsubsidyexperiments,”Econometrica,41(4),643–56.
Crépon,Bruno,EstherDuflo,MarcGurgand,RolandRathelot,andPhilippeZamora,2014,“Dolabormarketpolicieshavedisplacementeffects?evidencefromaclusteredrandomizedex-periment,”QuarterlyJournalofEconomics,128(2),531–80.
Croke,Kevin,JoanHamoryHicks,EricHsu,MichaelKremer,andEdwardMiguel,2016,“Doesmassdewormingaffectchildren’snutrition?Metaanalysis,costeffectiveness,andstatisticalpower,”Cambridge,MA.NBERWorkingPaperNo.22382(July.)
Cronbach,LeeJ.,S.R.Ambron,S.M.Dornbusch,R.D.Hess,R.C.Hornick,D.C.Phillips,D.F.Walker,andS.S.Weiner,1980,Towardsreformofprogramevaluation,SanFrancisco,Jossey-Bass.
Das,JishnuandJeffreyHammer,2005,”’Whichdoctor?Combiningvignettesanditemresponsetomeasureclinicalcompetence,”JournalofDevelopmentEconomics,78,348–83.
Davey,Calum,AlexanderM.Aitken,RichardJ.Hayes,andJamesR.Hargreaves,2015,“Re-analysisofhealthandeducationalimpactsofaschool-baseddewormingprogrammeinwesternKenya:astatisticalreplicationofaclusterquasi-randomizedsteppedwedgetrial,”InternationalJournalofEpidemiology,0(0),1–12.
Deaton,Angus,andJohnMuellbauer,1980,Economicsandconsumerbehavior,NewYork.Cam-bridgeUniversityPress.
Dhaliwal,Iqbal,EstherDuflo,RachelGlennerster,andCaitlinTulloch,2012,“Comparativecost-effectivenessanalysistoinformpolicyindevelopingcountries:ageneralframeworkwithap-plicationsforeducation,”J–PAL,MIT,December3rd.http://www.povertyactionlab.org/publication/cost-effectiveness
Drèze,Jean,2016,Personalemailcommunication.Duflo,Esther,RemaHanna,andStephenP.Ryan,2012,“Incentiveswork:gettingteachersto
cometoschool,”AmericanEconomicReview,102(4),1241–78.
64
Duflo,Esther,andMichaelKremer,2008,“Useofrandomizationintheevaluationofdevelop-menteffectiveness,”inWilliamEasterly,ed.,Reinventingforeignaid.Washington,DC.Brook-ings,93–120.
Dynarski,Susan,2015,”Helpingthepoorineducation:thepowerofasimplenudge,”NewYorkTimes,Jan17,2015.
Fine,PaulE.M.,andJacquelineA.Clarkson,1986,“Individualversuspublicprioritiesinthede-terminationofoptimalvaccinationpolicies,”AmericanJournalofEpidemiology,124(6),1012–20.
Fisher,RonaldA.,1926,“Thearrangementoffieldexperiments,”JournaloftheMinistryofAgri-cultureofGreatBritain,33,503–13.
Filmer,Deon,JeffreyHammer,andLantPritchett,2000,“Weaklinksinthechain:adiagnosisofhealthpolicyinpoorcountries,”WorldBankResearchObserver,15(2),199–204.
Freedman,DavidA.,2006,“Statisticalmodelsforcausation:whatinferentialleveragedotheyprovide?”EvaluationReview,30:691−713.
Freedman,DavidA.,2008,“Onregressionadjustmentstoexperimentaldata,”AdvancesinAp-pliedMathematics,40,180–93.
Garfinkel,Irwin,andCharlesF.Manski,1992,“Introduction,”inIrwinGarfinkelandCharlesF.Manski,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress.1–22.
Gertler,PaulJ.,SebastianMartinez,PatrickPremand,LauraB.Rawlings,andChristelM.J.Ver-meersch,Impactevaluationinpractice,Washington,DC.TheWorldBank.
Glewwe,Paul,MichaelKremer,SylvieMoulin,andEricZitzewitz,2004,“Retrospectivevs.pro-spectiveanalysesofschoolinputs:thecaseofflip-chartsinKenya,”JournalofDevelopmentEconomics,74,251–68.
Greenberg,DavidandMarkShroder,2004,Thedigestofsocialexperiments(3rded.),Washing-ton,DC.UrbanInstitutePress.
Greenberg,David,MarkShroder,andMatthewOnstott,1999,“Thesocialexperimentmarket,”JournalofEconomicPerspectives,13(3),157–72.
Gueron,JudithM.,andHowardRolston,2013,Fightingforreliableevidence,NewYork,RussellSage.
Guyatt,Gordon,DavidL.SackettandDeborahJ.CookfortheEvidence-BasedMedicineWorkingGroup,1994,“Users’guidestothemedicalliteratureII:howtouseanarticleabouttherapyorprevention.B.Whatweretheresultsandwilltheyhelpmeincaringformypatients?”JournaloftheAmericanMedicalAssociation,271(1),59–63.
Hargreaves,JamesR.,AlexanderM.Aiken,CalumDavey,andRichardJ.Hayes,2015,“Authors’responseto:dewormingexternalitiesandschoolimpactsinKenya,”InternationalJournalofEpidemiology,0(0),1–3.
Harrison,GlennW.,2013,“Fieldexperimentsandmethodologicalintolerance,”JournalofEco-nomicMethodology,20(2),103–17.
Harrison,GlennW.,2014,“Impactevaluationandwelfareevaluation,”EuropeanJournalofDe-velopmentResearch,26,39–45.
Hausman,JerryA.,andDavidA.Wise,1985,“Technicalproblemsinsocialexperimentation:costversuseaseofanalysis,”inJerryA.HausmanandDavidA.Wise,eds.,SocialExperimentation,Chicago,IL.ChicagoUniversityPress.187–220.
Heckman,JamesJ.,1992,“Randomizationandsocialpolicyevaluation,”inCharlesF.ManskiandIrwinGarfinkel,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress.547–70.
65
Heckman,JamesJ.,1997,“Instrumentalvariables:astudyofimplicitbehavioralassumptionsusedinmakingprogramevaluations,”JournalofHumanResources,32(3),441–62.
Heckman,JamesJ.,NeilHohman,andJeffreySmith,withtheassistanceofMichaelKhoo,2000,“Substitutionanddropoutbiasinsocialexperiments:astudyofaninfluentialsocialexperi-ment,”QuarterlyJournalofEconomics,115(2),651–94.
Heckman,JamesJ.,RobertJ.Lalonde,andJeffreyA.Smith,1999,“Theeconomicsandecono-metricsofactivelabormarkets,”Chapter31inAshenfelter,OrleyandDavidCard,eds.Handbookoflaboreconomics,Amsterdam.North-Holland,3(A),1866–2097.
Heckman,JamesJ,,RodrigoPinto,andPeterSavelyev,2013,“Understandingthemechanismsthroughwhichaninfluentialearlychildhoodprogramboostedadultoutcomes,”AmericanEconomicReview,103(6),2052–86.
Heckman,JamesJ.,JeffreySmith,andNancyClements,1997,“Makingthemostoutofpro-grammeevaluationsandsocialexperiments:accountingforheterogeneityinprogrammeimpacts,”ReviewofEconomicStudies,64(4),487–535.
Heckman,JamesJ,andEdwardVytlacil,2005,“Structuralequations,treatmenteffects,andeconometricpolicyevaluation,”Econometrica,73(3),669–738.
Heckman,JamesJ.andEdwardJ.Vytlacil,2007,“Econometricevaluationofsocialprograms,Part1:causalmodels,structuralmodels,andeconometricpolicyevaluation,”Chapter70inJamesJ.HeckmanandEdwardE.Leamer,eds.,HandbookofEconometrics,6B,4779–874.
Hicks,JoanHamory,MichaelKremer,andEdwardMiguel,2015,“Commentary:dewormingex-ternalitiesandschoolingimpactsinKenya:acommentonAikenetal(2015)andDaveyetal.(2015),”InternationalJournalofEpidemiology,0(0),1–4.
Horton,Richard,2000,“Commonsenseandfigures:therhetoricofvalidityinmedicine:Brad-fordHillmemoriallecture1999,”Statisticsinmedicine,19,3149–64.
Hotz,V.Joseph,GuidoW.ImbensandJulieH.Mortimer,2005,“Predictingtheefficacyoffuturetrainingprogramsusingpastexperienceatotherlocations,”JournalofEconometrics,125,241–70.
Hsieh,Chang-taiandMiguelUrquiola,2006,“Theeffectsofgeneralizedschoolchoiceonachievementandstratification:evidencefromChile’svoucherprogram,”JournalofPublicEconomics,90,1477–1503.
Humphreys,Macartan,2015,“Whathasbeenlearnedfromthedewormingreplications:anon-partisanview,”ColumbiaUniversity,Aug.http://www.columbia.edu/~mh2245/w/worms.html
Imbens,GuidoW.,2004,“Nonparametricestimationofaveragetreatmenteffectsunderexoge-neity:areview,”ReviewofEconomicsandStatistics,86(1),4–29.
Imbens,GuidoW.,2010,“BetterLATEthannothing:somecommentsonDeaton(2009)andHeckmanandUrzua,”JournalofEconomicLiterature,48(2),399–423.
Imbens,GuidoW.andJoshuaD.Angrist,1994,“Identificationandestimationoflocalaveragetreatmenteffects,”Econometrica,62(2),467–75.
Imbens,GuidoW.,andJeffreyM.Wooldridge,2009,“Recentdevelopmentsintheeconometricsofprogramevaluation,”JournalofEconomicLiterature,47(1),5–86.
InternationalCommitteeofMedicalJournalEditors,2015,Recommendationsfortheconduct,reporting,editing,andpublicationofscholarlyworkinmedicaljournals,http://www.icmje.org/icmje-recommendations.pdf(accessed,August20,2016.)
Kahneman,DanielandGaryKlein,2009,“Conditionsforintuitiveexpertise:afailuretodisa-gree,”AmericanPsychologist,64(6),515–26.
Karlan,DeanandJacobAppel,2011,Morethangoodintentions:howaneweconomicsishelp-ingtosolveglobalpoverty,Dutton.
66
Karlan,Dean,NathanealGoldbergandJamesCopestake,2009,“Randomizedcontrolledtrialsarethebestwaytomeasureimpactofmicrofinanceprogramsandimprovemicrofinanceproductdesigns,”EnterpriseDevelopmentandMicrofinance,20(3),167–76.
Kasy,Maximilian,2016,“Whyexperimentersmightnotwanttorandomize,andwhattheycoulddoinstead,”PoliticalAnalysis,1–15doi:10.1093/pan/mpw012
Kendall,MauriceG.,1959,“Hiawathadesignsanexperiment,”AmericanStatistician,13(5),23–4.
Kramer,Peter,2016,Ordinarilywell:thecaseforantidepressants,Farrar,Straus,andGiroux.Kremer,Michael,andAlakaHolla,2009,“Improvingeducationinthedevelopingworld:what
havewelearnedfromrandomizedevaluations?”AnnualReviewofEconomics,1,513–42. Lehman,Erich.L.,andJosephP.Romano,2005,Testingstatisticalhypotheses(thirdedition),
NewYork.Springer.Levy,Santiago,2006,Progressagainstpoverty:sustainingMexico’sProgresa-Oportunidades
program,Washington,DC.Brookings.Mackie,JohnL.,1974,Thecementoftheuniverse:astudyofcausation,Oxford.OxfordUniversi-
tyPress.Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeelerandArleenLeibowitz,
1988a,“Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedex-periment,”AmericanEconomicReview,77(3),251–77.
Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,BernadetteBenjamin,ArleenLeibowitz,M.SusanMarquis,andJackZwanziger,1988b,Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedexperiment,SantaMonica,CA.RAND.
Manski,CharlesF.,1990,“Nonparametricboundsontreatmenteffects”AmericanEconomicReview,80(2),319–23.
Manski,CharlesF.,1995,Identificationproblemsinthesocialsciences,Cambridge,MA.HarvardUniversityPress.
Manski,CharlesF.,2003,Partialidentificationofprobabilitydistributions,NewYork.Springer.Manski,CharlesF.,2013,Publicpolicyinanuncertainworld:analysisanddecisions,Cambridge,
MA.HarvardUniversityPress.Metcalfe,CharlesE.,1973,“Makinginferencesfromcontrolledincomemaintenanceexperi-
ments,”AmericanEconomicReview,63(3),478–83.Miguel,Edward,andMichaelKremer,2004,“Worms:identifyingimpactsoneducationand
healthinthepresenceoftreatmentexternalities,”Econometrica,72(1),159–217.Miguel,Edward,MichaelKremer,andJoanHamoryHicks,2015,“CommentonMacartanHum-
phreys’andotherrecentdiscussionsoftheMiguelandKremer(2004)study,”Berkeley,Dec.http://emiguel.econ.berkeley.edu/assets/miguel_research/63/Worms-Comment_2015-12-21.pdf
Moffitt,Robert,1979,“ThelaborsupplyresponseintheGaryexperiment,”JournalofHumanResources,14(4),477–87.
Moffitt,Robert,1992,“Evaluationmethodsforprogramentryeffects,”Chapter6inCharlesManskiandIrwinGarfinkel,Evaluatingwelfareandtrainingprograms,Cambridge,MA.Har-vardUniversityPress,231–52.
Moffitt,Robert,2004,“Theroleofrandomizedfieldtrialsinsocialscienceresearch:aperspec-tivefromevaluationsofreformsofsocialwelfareprograms,”AmericanBehavioralScientist,47(5),506–40
Morgan,KariLock,andDonaldB.Rubin,2012,“Rerandomizationtoimprovecovariatebalanceinexperiments,”AnnalsofStatistics,40(2),1263–82.
67
Muller,SeánM.,2015,“Causalinteractionandexternalvalidity:obstaclestothepolicyrele-vanceofrandomizedevaluations,”WorldBankEconomicReview,29,S217–S225.
Orcutt,GuyH.,andAliceG.Orcutt,1968,“Incentiveanddisincentiveexperimentationforin-comemaintenancepolicypurposes,”AmericanEconomicReview,58(4),754–72.
Pearl,Judea,2009,Causality:models,reasoning,andinference,2ndedition,Cambridge.Cam-bridgeUniversityPress.
Pettigrew,Mark,andIainChalmers,2011,“Useofresearchevidenceinpractice,”Lancet,378(9804),1696.
Rodrik,Dani,2006,personalemailcommunication.Rosenzweig,MarkandChristopherUdry,2016,“Externalvalidityinastochasticworld,”Cam-
bridge,MA.NBERWorkingPaper22449(July).Rothwell,PeterM.,2005,“Externalvalidityofrandomizedcontrolledtrials:‘towhomdothe
resultsofthetrialapply’”,Lancet,365,82–93.Russell,Bertrand,2008[1912],Theproblemsofphilosophy,Rockville,MD.ArcManor.Sackett,DavidL.,WilliamM.C.Rosenberg,J.A.MuirGray,R.BrianHaynesandW.ScottRich-
ardson,1996,“Evidencebasedmedicine:whatitisandwhatitisn’t,”BritishMedicalJournal,312(January13),71–2.
Scriven,Michael,1974,“Evaluationperspectivesandprocedures,”inW.JamesPopham,ed.,Evaluationineducation—currentapplications,Berkeley,CA.McCutchanPublishingCorpora-tion.
Sen,AmartyaK.,2011,Theideaofjustice,Cambridge,MA.HarvardUniversityPress.Senn,Stephen,1994,“Testingforbaselinebalanceinclinicaltrials,”StatisticsinMedicine,13,
1715–26.Senn,Stephen,2013,“Sevenmythsofrandomizationinclinicaltrials,”StatisticsinMedicine32,
1439–50.Shadish,WilliamR.,ThomasD.Cook,andDonaldT.Campbell,2002,Experimentalandquasi-
experimentaldesignsforgeneralizedcausalinference,Boston,MA.HoughtonMifflin.Simpson,Adrian,2016,“Comparingandcombiningstandardizedeffectsizes:themisdirectionof
publicpolicy,”WorkingPaper,UniversityofDurham(July).Singer,BurtonH.,andStevePincus,1998,“Irregulararraysandrandomization,”Proceedingsof
theNationalAcademyofSciencesoftheUSA,”95,1363–8.Stiles,CharlesWardell,1939,“Earlyhistory,inpartesoteric,ofthehookworm(uncinariasis)
campaigninoursouthernUnitedStates,”JournalofParasitology,25(4),283–308.Stuart,ElizabethA.,StephenR.Cole,andCatharineP.BradshawandPhilipJ.Leaf,2011,“The
useofpropensityscorestoassessthegeneralizabilityofresultsfromrandomizedtrials,”JournaloftheRoyalStatisticalSocietyA,174(2)369–86.
Svorencik,Andrej,2015,Theexperimentalturnineconomics:ahistoryofexperimentaleconom-ics,UtrechtSchoolofEconomics,DissertationSeries#29,http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2560026
Taylor-Robinson,DavidC.,NicolaMaayan,KarlaSoares-Weiser,SarahDonegan,andPaulGar-ner,2015,“Dewormingdrugsforsoil-transmittedintestinalwormsinchildren:effectsonnu-tritionalindicators,haemoglobin,andschoolperformance(review),”TheCochraneCollabo-ration.Wiley.http://onlinelibrary.wiley.com/doi/10.1002/14651858.CD000371.pub6/abstract
Todd,PetraE.,andKennethJ.Wolpin,2006,“AssessingtheimpactofaschoolsubsidyprograminMexico:usingasocialexperimenttovalidateadynamicbehavioralmodelofchildschool-ingandfertility,”AmericanEconomicReview,96(5),1384–1417.
68
Todd,PetraE.,andKennethJ.Wolpin,2008,“Exanteevaluationofsocialprograms,”Annalesd’EconomieetdelaStatistique,91/92,263–91.
U.S.DepartmentofEducation,InstituteofEducationSciences,NationalCenterforEducationEvaluationandRegionalAssistance,2003,Identifyingandimplementingeducationalpractic-essupportedbyrigorousevidence:auserfriendlyguide,Washington,DC.InstituteofEduca-tionSciences.
Vandenbroucke,JanP.,2004,“Whenareobservationalstudiesascredibleasrandomizedcon-trolledtrials?”TheLancet,363:1728–31.
Vivalt,Eva,2015,“Howmuchcanwegeneralizefromimpactevaluations?”NYU,unpublished.http://evavivalt.com/wp-content/uploads/2014/10/Vivalt-JMP-10.27.14.pdf
White,Halbert,1980,“Aheteroskedasticity-consistentcovariancematrixestimatorandadirecttestforheteroskedasticity,”Econometrica,50(1),1–25.
Wise,DavidA.,1985,“Abehavioralmodelversusexperimentation:theeffectsofhousingsubsi-diesonrent,”inP.BruckerandR.Pauly,eds..MethodsofOperationsResearch,50,VerlagAnonHain.441–89.
Worrall,John,2002,“WhatEvidenceinEvidence-BasedMedicine?”PhilosophyofScience69,S316-S330.
Worrall,John,2007,“Evidenceinmedicineandevidence-basedmedicine,”PhilosophyCompass,2/6,981–1022.
Young,Alwyn,2016,“ChannelingFisher:randomizationtestsandthestatisticalinsignificanceofseeminglysignificantexperimentalresults,”LondonSchoolofEconomics,WorkingPaper,Feb.
Ziliak,StephenT.,2014,“Balancedversusrandomizedfieldexperimentsineconomics:whyW.S.Gossetaka‘Student’matters,”ReviewofBehavioralEconomics,1,167–208.