using replicaons and predic-on markets to es-mate the...

41
Can We Trust Scien/fic Results? Using Replica-ons and Predic-on Markets to Es-mate the Reproducibility of Science Anna Dreber Almenberg Stockholm School of Economics Reproducibility and Replica-on Workshop, Sept 3 2018

Upload: leanh

Post on 24-Aug-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

CanWeTrustScien/ficResults?UsingReplica-onsandPredic-on

MarketstoEs-matetheReproducibilityofScience

AnnaDreberAlmenbergStockholmSchoolofEconomics

ReproducibilityandReplica-onWorkshop,Sept32018

Falseresults

Otherfalseresults

•  P-valuesandpower•  Publica-onbias•  Researcherdegreesoffreedom

•  Howtoassessreproducibility?

Replica-onandpredic-onprojects•  Psychology•  Experimentaleconomics•  NatureandSciencesocialscienceexperiments•  Peerbeliefsfrompredic-onmarketsandsurveys

•  Jointworkwithlotsofpeople

OpenScienceFramework:ReproducibilityProject:Psychology(RPP)

•  Studiesfrom3toppsychologyjournals•  100replica-onscompletedbytheprojectdeadline

•  Replicateyes/no– Yes:Samedirec-onandp<0.05intwo-sidedtest

•  “High”power(averageof92%todetect100%oforiginaleffectsize)

•  270authors

RPPoutcomes

OpenScienceCollabora-on(2015).“Es-ma-ngtheReproducibilityofPsychologicalScience.”Science,349(6251).

•  18papersin2topeconomicsjournals– Allexperimentalpapers2011-2014– Onlymaineffects,nointerac-ons,tes-nganhypothesis

–  90%powertodetectoriginaleffectsizeatp<0.05•  Samereplica-oncriteriaasearlier–  Samedirec-onandp<0.05

•  Replica-onandanalysisplanspubliclyknownonprojectwebsiteandpre-registeredatOSFandsenttotheoriginalauthors

Camereretal.(2016)“Evalua-ngreplicabilityoflaboratoryexperimentsineconomics.”Science

ExperimentalEconomicsReplica-onProject(EERP)

Camereretal.Science201611/18replicate

SocialScienceReplica-onProject(SSRP)

•  21studiesinNature(4)andScience(17)published2010-2015– Betweenorwithinsubjectdesignswithclearhypothesisandstudentsorotheraccessiblesubjectpools

– Firststudyinpaperswithmorethanonestudy,wechosethecentralresult

Camereretal.(2018)“Evalua-ngthereplicabilityofsocialscienceexperimentsinNatureandSciencebetween2010and2015.”NatureHumanBehaviour

SocialScienceReplica-onProject(SSRP)

•  Highpowerandtwo-stageprocedure–  Stage1:90%powertodetect75%oftheoriginaleffectsize.Ifresultdoesnotreplicate,movetoStage2•  Replica-onsamplesizesonaveragethree-mesaslargeasoriginalsamplesizes

–  Stage2:90%powertodetect50%oftheoriginaleffectsizeinthe2stagespooled•  Replica-onsamplesizesonaveragesix-mesaslargeasoriginalsamplesizes

–  90%powertodetect50%oftheoriginaleffectsizebasedontheRPPreplica-oneffectsizesbeingonaverageabout50%oftheoriginaleffectsizes

Camereretal.(2018)“Evalua-ngthereplicabilityofsocialscienceexperimentsinNatureandSciencebetween2010and2015.”NatureHumanBehaviour

SSRP:Toreplicateortonotreplicate•  Sta-s-calsignificancecriterion

–  Samedirec-onandp<0.05•  Predic-onintervals

–  Howmanyreplicatedeffectsliein95%predic-onintervalwhichtakesintoaccountthevariabilityinbothoriginalandreplica-onstudy?

•  SmallTelescopesapproach–  Isthereplica-oneffectsizesignificantlysmallerthana’smalleffect’intheoriginalstudywithaone-sidedtestatp<0.05?Smalleffectdefinedastheeffectsizetheoriginalstudywouldhavehad33%powertodetect

•  BayesFactor–  Comparesthepredic-veperformanceofthenullhypothesisagainstthatofanalterna-vehypothesisinwhichtheuncertaintyaboutthetrueeffectsizeisquan-fiedbyapriordistribu-on

•  Andmore

Sta-s-calsignificancecriterion:13/21replicateinStage2

Meanrela-veeffectsize:46%.For13studiesthatreplicated:74%,fortherest,0.3%

Differentconclusionsononlyoneofthereplica-onscomparedtosta-s-calsignificancecriterion

14/21replicateforPredic-onintervals,12/21forSmallTelescopesapproach

ThedefaultBayesfactoris>1andprovideevidenceinfavorofaneffectinthedirec-onoftheoriginalstudyforthesame13/21studiesthatreplicatedaccordingtothesta-s-calsignificancecriterion.Strongtoextremeevidencefor9/21

Higherpower:Whatdowelearn?

•  Originalstudiesoveres-matetheeffectsizesoftrueposi-ves

•  Replica-oneffectsizesabout75%oftheoriginaleffectsize– SimilarresultwithBayesianmixturemodel

•  Meta-analysesoftrueresultswilloveres-mateeffectsizesonaverage

•  RPPandEERPprobablyhadlesspowerthanintended

“Couldgamblingsavescience?”

Hanson1995SocialEpistemology

Ourpredic-onmarkets

•  2setsofmarketson44RPPstudies– 2weeksandabout45par-cipantseach-me,USD100

•  1setofmarketson18EERPstudies– 10days,97par-cipants,USD50

•  1setofmarketson21SSRPstudies– 2weeks,about200par-cipants,USD– 2treatments

Ourpredic-onmarkets•  Onecentralhypothesisforeachstudy•  Binaryoutcomes•  Par-cipantstradedcontractsthatpay$1(or$0.5)ifthestudyisreplicatedand$0otherwise

•  Price:predictedprobabilityoftheoutcomeoccurring– Withsomecaveats

•  Logarithmicscoringrule,longandshortselling•  Par-cipantsgetreplica-onreports•  Pricesstartat50

Pre-marketsurvey

•  “Howlikelydoyouthinkitisthatthishypothesiswillbereplicated(onascalefrom0%to100)?”

•  “Howwelldoyouknowthistopic?(notatall,slightly,moderately,verywell,extremelywell)”– 1-5

•  SlightlymorecomplicatedforEERP

RPPtradinginterface:ConsensusPoint

RPPtradinginterface

RPPtradinginterface

RPPtradinginterface

Predic-onmarketresultsRPP

DreberA,TPfeiffer,JAlmenberg,SIsaksson,BWilson,YChen,BANosek&MJohannesson(2015)."UsingPredic-onMarketstoEs-matetheReproducibilityofScien-ficResearch."ProceedingsoftheNa7onalAcademyofSciences,112:15343-15347.

•  Meanmarketprice:55%(range13to88%)

•  41/44studiesfinished–  16/41successfulreplica-ons–  25/41failedreplica-ons

•  Marketpredicts29/41(71%)correctly

•  Significantlyhigherthan50%–  One-samplebinomialtest

p=0.012•  Surveypredicts23/40(57%)

correctly–  Notsignificantlydifferent

from50%

Predic-onmarketresultsEERP

•  Marketandsurveyequallysuccessful•  Allprices(andbeliefs)>50

•  Averagepredic-on:75%

•  Surveyaverage:71%– Neitherdifferentfrom61%,andnotdifferentfromeachother

Camereretal.2016

Predic-onmarketsresultsSSRP

•  Meanpredic-onmarketbeliefofreplica-onis63.4%–  [rangeof23.1%to95.5%,95%CI=(53.7%,73.0%)]

•  Meansurveybeliefis60.6%–  [rangeof27.8%to81.5%,95%CI=(53.0%,68.2%)]

•  Actualreplica-onrateis61.9%•  Bothpredic-onmarketbeliefsandsurveybeliefsarealsohighlycorrelatedwithasuccessfulreplica-on– Market:Spearmancorrela-oncoefficient0.842,95%CI=(0.645,0.934),p<0.001,n=21

–  Survey:Spearmancorrela-oncoefficient0.761,95%CI=(0.491,0.898),p<0.001,n=21

Predic-onmarketsresultsSSRP

Fromtreatment2

Posi-vepredic-vevalue(PPV)•  Predic-onmarketpricescanalsobeusedtoes-mateaprobabilityforeachhypothesistobetrue(thePPV)

•  Theprice(PM)reflectstheprobabilitythatapublishedresultwillbereplicated,notthePPV

•  Marketpar-cipantsknowsta-s-calpowerandsignificancelevelsforeachstudy–  Poweralways<100%–  0.05p-valuecut-off

•  CombiningthePM,powerandsignificancelevelallowsustoes-matethePPVorprobabilityforahypothesistobetrue

Priorsandposteriors

•  Probabilitythatahypothesisistrue(PPV)–  Canbeassignedatthreedifferentstagesbasedonthemarketprice

andinfoonpowerandsignificanceoforiginalstudyandreplica-on:beforeoriginalstudy,ateroriginalstudy,aterreplica-on

p1: prior at time of replication pE : probability of observing positive evidence in the replication pM : final market price If positive outcome in original study: pE = p1β1 + (1 – p1)α1. If pM = pE, probability p1 can be reconstructed as p1 = (pM – α1) / (β1 – α1) Etc for p0 and p2

Probabilityofhypothesisbeingtrueat3stagesoftes-ngforRPP

–  Ini-alpriorsarelow(median8.8%)

–  Posi-veresultinini-alpublica-onmovespriortointermediatelevel(median56%)

–  Ifsuccessfulreplica-on,probabilitymovesup(median98%)

–  Iffailedreplica-on,probabilibyclosetoini-alprior(median6.3%)

Whiskers:rangeBoxes:1stto3rdquar-lesThicklines:mediansDreberetal.2015PNAS

Whathavewelearned?•  Commonfalseinterpreta-onofp<0.05:95%probabilityofhypothesisbeingtrue

•  Forthistobethecase,ap<0.05findingneedstosupportedinahigh-poweredreplica-on

•  Aretheincen-vesforreplica-onsappropriate?•  Thereissomethingsystema-caboutresultsthatfailtoreplicate

•  Whyaresomanyfalseresultspublished?–  Researcherdegreesoffreedom–  Predic-onmarketresults:Peoplesavviernowthanbefore?SeeforthcomingbookchapterbyCamerer,DreberandJohannessonformore

Researcherdegreesoffreedom

Ioannidis2005WhyMostPublishedResearchFindingsAreFalse;Simmons,NelsonandSimonsohn2011False-Posi-vePsychology:UndisclosedFlexibilityinDataCollec-onandAnalysisAllowsPresen-ngAnythingasSignificant;GelmanandLoken2013TheGardenofForkingPaths

0

100

200

300

400

500

600

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

0.065

0.07

0.075

0.08

0.085

0.09

0.095

0.1

Freq

uency

Observedp-value

Histogramofp-values

P-hacking

•  Manywaystogetp<0.051.  Stopcollec-ngdataoncep<.052.  Analyzemanymeasures,butreportonlythose

withp<.053.  Collectandanalyzemanycondi-ons,butonly

reportthosewithp<.054.  Usecovariatestogetp<.055.  Excludepar-cipantstogetp<.056.  Transformthedatatogetp<.05

Simmons,JP,LDNelson,USimonsohn,2011,False-Posi-vePsychology:UndisclosedFlexibilityinDataCollec-onandAnalysisAllowsPresen-ngAnythingasSignificant.PsychologicalScience22(11):1359-1366.

Simmons,JP,LDNelson,USimonsohn,2011,False-Posi-vePsychology:UndisclosedFlexibilityinDataCollec-onandAnalysisAllowsPresen-ngAnythingasSignificant.PsychologicalScience22(11):1359-1366.

P-hackingineconomics

Brodeuretal.2016AEJ:AE

Forking•  Study:Wewanttounderstandaltruismfromadictatorgame.Hypothesisthatthe

genderoftherecipientmawers•  Somanyforks!•  Effectcouldbethateverybodygivesmoretomen•  Everybodygivesmoretowomen•  Mengivemoretomen•  Mengivemoretowomen•  Orwomengivemoretomen•  Orwomengivemoretowomen•  Ormenandwomenhaveoppositeeffects•  Maybeinteractswithage•  Howdefinealtruism?Howmuchyougaveorwhetheryougave?•  Whattests,parametricornon-parametric?•  Etc–evenifhypothesisismoreprecise•  Enoughwithonetestandyouareforking

Priors

•  Probabilityofahypothesistobetrue(“prior”)•  Typicallysubjec-veandunaccessible•  Combina-onoflowprior,lowpowerandp<0.05canbeverymisleading

Otherthoughts

•  Pre-analysisplans•  Lessobsessionwithp-values?•  p<0.005– SeeBenjaminetal.2017“RedefineSta-s-calSignificance”NatureHumanBehavior

•  Higherpower•  Teamscience– Munafoetal.2017NatureHumanBehaviour

Next

•  Decisionmarkets:Letthemarketdecidewhichreplica-onstoperformorhypothesestotest– Highestdegreeofuncertaintyintheoutcome?– Highestdisagreement?– Highestchancesofobservingasuspectedeffect?– Bewersurveymeasures– PNASpapersin“socialsciences”

Time-reversalheuris-c