using replicaons and predic-on markets to es-mate the...
TRANSCRIPT
CanWeTrustScien/ficResults?UsingReplica-onsandPredic-on
MarketstoEs-matetheReproducibilityofScience
AnnaDreberAlmenbergStockholmSchoolofEconomics
ReproducibilityandReplica-onWorkshop,Sept32018
Otherfalseresults
• P-valuesandpower• Publica-onbias• Researcherdegreesoffreedom
• Howtoassessreproducibility?
Replica-onandpredic-onprojects• Psychology• Experimentaleconomics• NatureandSciencesocialscienceexperiments• Peerbeliefsfrompredic-onmarketsandsurveys
• Jointworkwithlotsofpeople
OpenScienceFramework:ReproducibilityProject:Psychology(RPP)
• Studiesfrom3toppsychologyjournals• 100replica-onscompletedbytheprojectdeadline
• Replicateyes/no– Yes:Samedirec-onandp<0.05intwo-sidedtest
• “High”power(averageof92%todetect100%oforiginaleffectsize)
• 270authors
RPPoutcomes
OpenScienceCollabora-on(2015).“Es-ma-ngtheReproducibilityofPsychologicalScience.”Science,349(6251).
• 18papersin2topeconomicsjournals– Allexperimentalpapers2011-2014– Onlymaineffects,nointerac-ons,tes-nganhypothesis
– 90%powertodetectoriginaleffectsizeatp<0.05• Samereplica-oncriteriaasearlier– Samedirec-onandp<0.05
• Replica-onandanalysisplanspubliclyknownonprojectwebsiteandpre-registeredatOSFandsenttotheoriginalauthors
Camereretal.(2016)“Evalua-ngreplicabilityoflaboratoryexperimentsineconomics.”Science
ExperimentalEconomicsReplica-onProject(EERP)
SocialScienceReplica-onProject(SSRP)
• 21studiesinNature(4)andScience(17)published2010-2015– Betweenorwithinsubjectdesignswithclearhypothesisandstudentsorotheraccessiblesubjectpools
– Firststudyinpaperswithmorethanonestudy,wechosethecentralresult
Camereretal.(2018)“Evalua-ngthereplicabilityofsocialscienceexperimentsinNatureandSciencebetween2010and2015.”NatureHumanBehaviour
SocialScienceReplica-onProject(SSRP)
• Highpowerandtwo-stageprocedure– Stage1:90%powertodetect75%oftheoriginaleffectsize.Ifresultdoesnotreplicate,movetoStage2• Replica-onsamplesizesonaveragethree-mesaslargeasoriginalsamplesizes
– Stage2:90%powertodetect50%oftheoriginaleffectsizeinthe2stagespooled• Replica-onsamplesizesonaveragesix-mesaslargeasoriginalsamplesizes
– 90%powertodetect50%oftheoriginaleffectsizebasedontheRPPreplica-oneffectsizesbeingonaverageabout50%oftheoriginaleffectsizes
Camereretal.(2018)“Evalua-ngthereplicabilityofsocialscienceexperimentsinNatureandSciencebetween2010and2015.”NatureHumanBehaviour
SSRP:Toreplicateortonotreplicate• Sta-s-calsignificancecriterion
– Samedirec-onandp<0.05• Predic-onintervals
– Howmanyreplicatedeffectsliein95%predic-onintervalwhichtakesintoaccountthevariabilityinbothoriginalandreplica-onstudy?
• SmallTelescopesapproach– Isthereplica-oneffectsizesignificantlysmallerthana’smalleffect’intheoriginalstudywithaone-sidedtestatp<0.05?Smalleffectdefinedastheeffectsizetheoriginalstudywouldhavehad33%powertodetect
• BayesFactor– Comparesthepredic-veperformanceofthenullhypothesisagainstthatofanalterna-vehypothesisinwhichtheuncertaintyaboutthetrueeffectsizeisquan-fiedbyapriordistribu-on
• Andmore
Sta-s-calsignificancecriterion:13/21replicateinStage2
Meanrela-veeffectsize:46%.For13studiesthatreplicated:74%,fortherest,0.3%
Differentconclusionsononlyoneofthereplica-onscomparedtosta-s-calsignificancecriterion
14/21replicateforPredic-onintervals,12/21forSmallTelescopesapproach
ThedefaultBayesfactoris>1andprovideevidenceinfavorofaneffectinthedirec-onoftheoriginalstudyforthesame13/21studiesthatreplicatedaccordingtothesta-s-calsignificancecriterion.Strongtoextremeevidencefor9/21
Higherpower:Whatdowelearn?
• Originalstudiesoveres-matetheeffectsizesoftrueposi-ves
• Replica-oneffectsizesabout75%oftheoriginaleffectsize– SimilarresultwithBayesianmixturemodel
• Meta-analysesoftrueresultswilloveres-mateeffectsizesonaverage
• RPPandEERPprobablyhadlesspowerthanintended
Ourpredic-onmarkets
• 2setsofmarketson44RPPstudies– 2weeksandabout45par-cipantseach-me,USD100
• 1setofmarketson18EERPstudies– 10days,97par-cipants,USD50
• 1setofmarketson21SSRPstudies– 2weeks,about200par-cipants,USD– 2treatments
Ourpredic-onmarkets• Onecentralhypothesisforeachstudy• Binaryoutcomes• Par-cipantstradedcontractsthatpay$1(or$0.5)ifthestudyisreplicatedand$0otherwise
• Price:predictedprobabilityoftheoutcomeoccurring– Withsomecaveats
• Logarithmicscoringrule,longandshortselling• Par-cipantsgetreplica-onreports• Pricesstartat50
Pre-marketsurvey
• “Howlikelydoyouthinkitisthatthishypothesiswillbereplicated(onascalefrom0%to100)?”
• “Howwelldoyouknowthistopic?(notatall,slightly,moderately,verywell,extremelywell)”– 1-5
• SlightlymorecomplicatedforEERP
Predic-onmarketresultsRPP
DreberA,TPfeiffer,JAlmenberg,SIsaksson,BWilson,YChen,BANosek&MJohannesson(2015)."UsingPredic-onMarketstoEs-matetheReproducibilityofScien-ficResearch."ProceedingsoftheNa7onalAcademyofSciences,112:15343-15347.
• Meanmarketprice:55%(range13to88%)
• 41/44studiesfinished– 16/41successfulreplica-ons– 25/41failedreplica-ons
• Marketpredicts29/41(71%)correctly
• Significantlyhigherthan50%– One-samplebinomialtest
p=0.012• Surveypredicts23/40(57%)
correctly– Notsignificantlydifferent
from50%
Predic-onmarketresultsEERP
• Marketandsurveyequallysuccessful• Allprices(andbeliefs)>50
• Averagepredic-on:75%
• Surveyaverage:71%– Neitherdifferentfrom61%,andnotdifferentfromeachother
Camereretal.2016
Predic-onmarketsresultsSSRP
• Meanpredic-onmarketbeliefofreplica-onis63.4%– [rangeof23.1%to95.5%,95%CI=(53.7%,73.0%)]
• Meansurveybeliefis60.6%– [rangeof27.8%to81.5%,95%CI=(53.0%,68.2%)]
• Actualreplica-onrateis61.9%• Bothpredic-onmarketbeliefsandsurveybeliefsarealsohighlycorrelatedwithasuccessfulreplica-on– Market:Spearmancorrela-oncoefficient0.842,95%CI=(0.645,0.934),p<0.001,n=21
– Survey:Spearmancorrela-oncoefficient0.761,95%CI=(0.491,0.898),p<0.001,n=21
Posi-vepredic-vevalue(PPV)• Predic-onmarketpricescanalsobeusedtoes-mateaprobabilityforeachhypothesistobetrue(thePPV)
• Theprice(PM)reflectstheprobabilitythatapublishedresultwillbereplicated,notthePPV
• Marketpar-cipantsknowsta-s-calpowerandsignificancelevelsforeachstudy– Poweralways<100%– 0.05p-valuecut-off
• CombiningthePM,powerandsignificancelevelallowsustoes-matethePPVorprobabilityforahypothesistobetrue
Priorsandposteriors
• Probabilitythatahypothesisistrue(PPV)– Canbeassignedatthreedifferentstagesbasedonthemarketprice
andinfoonpowerandsignificanceoforiginalstudyandreplica-on:beforeoriginalstudy,ateroriginalstudy,aterreplica-on
p1: prior at time of replication pE : probability of observing positive evidence in the replication pM : final market price If positive outcome in original study: pE = p1β1 + (1 – p1)α1. If pM = pE, probability p1 can be reconstructed as p1 = (pM – α1) / (β1 – α1) Etc for p0 and p2
Probabilityofhypothesisbeingtrueat3stagesoftes-ngforRPP
– Ini-alpriorsarelow(median8.8%)
– Posi-veresultinini-alpublica-onmovespriortointermediatelevel(median56%)
– Ifsuccessfulreplica-on,probabilitymovesup(median98%)
– Iffailedreplica-on,probabilibyclosetoini-alprior(median6.3%)
Whiskers:rangeBoxes:1stto3rdquar-lesThicklines:mediansDreberetal.2015PNAS
Whathavewelearned?• Commonfalseinterpreta-onofp<0.05:95%probabilityofhypothesisbeingtrue
• Forthistobethecase,ap<0.05findingneedstosupportedinahigh-poweredreplica-on
• Aretheincen-vesforreplica-onsappropriate?• Thereissomethingsystema-caboutresultsthatfailtoreplicate
• Whyaresomanyfalseresultspublished?– Researcherdegreesoffreedom– Predic-onmarketresults:Peoplesavviernowthanbefore?SeeforthcomingbookchapterbyCamerer,DreberandJohannessonformore
Researcherdegreesoffreedom
Ioannidis2005WhyMostPublishedResearchFindingsAreFalse;Simmons,NelsonandSimonsohn2011False-Posi-vePsychology:UndisclosedFlexibilityinDataCollec-onandAnalysisAllowsPresen-ngAnythingasSignificant;GelmanandLoken2013TheGardenofForkingPaths
0
100
200
300
400
500
600
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
0.065
0.07
0.075
0.08
0.085
0.09
0.095
0.1
Freq
uency
Observedp-value
Histogramofp-values
P-hacking
• Manywaystogetp<0.051. Stopcollec-ngdataoncep<.052. Analyzemanymeasures,butreportonlythose
withp<.053. Collectandanalyzemanycondi-ons,butonly
reportthosewithp<.054. Usecovariatestogetp<.055. Excludepar-cipantstogetp<.056. Transformthedatatogetp<.05
Simmons,JP,LDNelson,USimonsohn,2011,False-Posi-vePsychology:UndisclosedFlexibilityinDataCollec-onandAnalysisAllowsPresen-ngAnythingasSignificant.PsychologicalScience22(11):1359-1366.
Simmons,JP,LDNelson,USimonsohn,2011,False-Posi-vePsychology:UndisclosedFlexibilityinDataCollec-onandAnalysisAllowsPresen-ngAnythingasSignificant.PsychologicalScience22(11):1359-1366.
Forking• Study:Wewanttounderstandaltruismfromadictatorgame.Hypothesisthatthe
genderoftherecipientmawers• Somanyforks!• Effectcouldbethateverybodygivesmoretomen• Everybodygivesmoretowomen• Mengivemoretomen• Mengivemoretowomen• Orwomengivemoretomen• Orwomengivemoretowomen• Ormenandwomenhaveoppositeeffects• Maybeinteractswithage• Howdefinealtruism?Howmuchyougaveorwhetheryougave?• Whattests,parametricornon-parametric?• Etc–evenifhypothesisismoreprecise• Enoughwithonetestandyouareforking
Priors
• Probabilityofahypothesistobetrue(“prior”)• Typicallysubjec-veandunaccessible• Combina-onoflowprior,lowpowerandp<0.05canbeverymisleading
Otherthoughts
• Pre-analysisplans• Lessobsessionwithp-values?• p<0.005– SeeBenjaminetal.2017“RedefineSta-s-calSignificance”NatureHumanBehavior
• Higherpower• Teamscience– Munafoetal.2017NatureHumanBehaviour
Next
• Decisionmarkets:Letthemarketdecidewhichreplica-onstoperformorhypothesestotest– Highestdegreeofuncertaintyintheoutcome?– Highestdisagreement?– Highestchancesofobservingasuspectedeffect?– Bewersurveymeasures– PNASpapersin“socialsciences”