"the statistical replication crisis: paradoxes and scapegoats”

78
The Statistical Replication Crisis: Paradoxes and Scapegoats Deborah G Mayo Virginia Tech & LSE 10 May 2016

Upload: jemille6

Post on 13-Jan-2017

7.593 views

Category:

Education


0 download

TRANSCRIPT

Page 1: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

The Statistical Replication Crisis: Paradoxes and Scapegoats

Deborah G MayoVirginia Tech & LSE

10 May 2016

Page 2: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

2

American Statistical Society (ASA):Statement on P-values

“The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. …. much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices such as…to ban P-values…

Misunderstanding or misuse of statistical inference is only one cause of the “reproducibility crisis” (Peng, 2015), but to our community, it is an important one.” (ASA 2016)

Page 3: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

3

Reforms without philosophy of statistics are blind

O Replication crises (in social science and biology) led to programs to restore credibility: fraud busting, reproducibility studies

O Taskforces, journalistic reforms, and debunking treatises

O Proposed methodological reforms––many welcome (preregistration)–some quite radical

O The situation gives a practical spin to debates in philosophy of science and statistics

Page 4: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

4

Replication crisis in social psychology

O Diederik Stapel, the social psychologist who fabricated his data (2011)

O Investigating Stapel revealed a culture of verification bias; selective reporting so common; they began calling them questionable research practices (QRPs)

O A string of high profile cases followed, as did proposed reforms, replication research

Page 5: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

5

“I see a train-wreck looming,” Daniel Kahneman, calls for a “daisy chain” of replication in Sept. 2012

OSC: Reproducibility Project: Psychology: 2011-15 (Science 2015): Crowd-sourced effort to replicate 100 articles (Led by Brian Nozek, U. VA)

Page 6: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

6

I was a philosophical observer at the ASA P-value “pow wow”

Page 7: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

7

“Don’t throw out the error control baby with the bad statistics bathwater” The American Statistician

Page 8: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

8

O “Statistical significance tests are a small part of a rich set of:

“techniques for systematically appraising and bounding the probabilities … of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033)

O These I call error statistical methods (or sampling theory)”.

Page 9: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

9

One rock in a shifting scene

O “Birnbaum calls it the ‘one rock in a shifting scene’ in statistical practice”

O “Misinterpretations and abuses of tests, warned against by the very founders of the tools, shouldn’t be the basis for supplanting them with methods unable or less able to assess, control, and alert us to erroneous interpretations of data”

Page 10: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

10

Error Statistics

O Statistics: Collection, modeling, drawing inferences from data to claims about aspects of processes

O The inference may be in error

O It’s qualified by a claim about the methods’ capabilities to control and alert us to erroneous interpretations (error probabilities)

Page 11: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

11

“p-value. …to test the conformity of the particular data under analysis with H0 in some respect:…we find a function t = t(y) of the data, to be called the test statistic, such that• the larger the value of t the more

inconsistent are the data with H0;• The random variable T = t(Y) has a

(numerically) known probability distribution when H0 is true.

…the p-value corresponding to any t asp = p(t) = P(T ≥ t; H0)”

(Mayo and Cox 2006, p. 81)

Page 12: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

12

O “Clearly, if even larger differences than t occur fairly frequently under H0 (p-value is not small), there’s scarcely evidence of incompatibility 

O But a small p-value doesn’t warrant inferring a genuine effect H, let alone a scientific conclusion H*–as the ASA document correctly warns (Principle 3)”

 

Page 13: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

13

A paradox for significance test critics

Critic1: It’s much too easy to get small P-values.

You: Why do they find it so difficult to replicate the small P-values others found?* 

Is it easy or is it hard?

*(Only 36 of 100 psychology experiments yielded small P-values in Open Science Collaboration on replication in psychology)

Page 14: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

14

O R.A. Fisher: it’s easy to lie with statistics by selective reporting (he called it the “political principle”)

O Sufficient finagling (helped by Big Data)—cherry-picking, P-hacking, significance seeking—may practically guarantee a preferred claim C gets support, even if it’s unwarranted by evidence

O Note: Rejecting a null taken as support for some non-null claim C

Page 15: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

15

Severity Requirement: “Mere supporting instances are as a rule too

cheap to be worth having” (Popper 1983, 130)

O If data x0 agree with a claim C, but the test procedure had little or no capability of finding flaws with C (even if the claim is incorrect), then x0 provide poor evidence for C

O Such a test fails a minimal requirement for a stringent or severe test

O My account: severe testing based on error statistics

Page 16: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

16

By and large the ASA doc highlights classic foibles

“In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result” (Fisher 1935, 14)

(“isolated” low P-value ≠> H: statistical effect)

Page 17: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

17

Statistical ≠> substantive (H ≠> H*)

“[A]ccording to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter...requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions” (Gigerentzer 1989, 95-6)

Page 18: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

18

O Flaws in alternative H* have not been probed by the test,

O The inference from a statistically significant result to H* fails to pass with severity 

O “Merely refuting the null hypothesis is too weak to corroborate” substantive H*, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo]’” (Meehl and Waller 2002,184)

Page 19: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

19

The ASA’s Six PrinciplesO (1) P-values can indicate how incompatible the data are with

a specified statistical model

O (2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

O (3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold

O (4) Proper inference requires full reporting and transparency

O (5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result

O (6) By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

Page 20: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

20

The ASA’s Six PrinciplesO (1) P-values can indicate how incompatible the data are with a

specified statistical model

O (2) P-values do NOT measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

O (3) Scientific conclusions and business or policy decisions should NOT be based only on whether a p-value passes a specific threshold

O (4) Proper inference requires full reporting and transparency

O (5) A p-value, or statistical significance, does NOT measure the size of an effect or the importance of a result

O (6) By itself, a p-value does NOT provide a good measure of evidence regarding a model or hypothesis

Page 21: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

21

Two main views of the role of probability in inference (not in ASA doc)

O Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0. (e.g., Bayesian, likelihoodist)—with regard for inner coherency

O Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson)

Page 22: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

22

What happened to using probability to assess error probing capacity and severity?

O Neither “probabilism” nor “performance” directly captures it

O Good long-run performance is a necessary, not a sufficient, condition for severity

O That’s why frequentist methods can be shown to have howlers

 

Page 23: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

23

O Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs—

O It’s that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpreting data

Page 24: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

24

A claim C is not warranted _______

O Probabilism: unless C is true or probable (gets a probability boost, is made comparatively firmer)

O Performance: unless it stems from a method with low long-run error

O Probativism (severe testing) something (a fair amount) has been done to probe ways we can be wrong about C

Page 25: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

25

O If you assume probabilism is required for inference, error probabilities are relevant for inference only by misinterpretation False!

O I claim, error probabilities play a crucial role in appraising well-testedness

O It’s crucial to be able to say, C is highly believable or plausible but poorly tested

O Probabilists can allow for the distinct task of severe testing (you can have your cake and eat it)

Page 26: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

26

The ASA doc gives no sense of different tools for different jobs

O “To use an eclectic toolbox in statistics, it’s important not to expect an agreement on numbers form methods evaluating different things

O A p-value isn’t ‘invalid’ because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of H0) (Trafimow and Marks,* 2015)

*Editors of a journal, Basic and Applied Social Psychology ban P-values: don’t ask don’t tell

Page 27: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

27

O Principle 2 says a p-value ≠ posterior but one doesn’t get the sense of its role in error probability control

O It’s not that I’m keen to defend many common uses of significance tests

O The criticisms are often based on misunderstandings; consequently so are many “reforms”

[Aside: Principle 2 says “P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.”]

Page 28: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

28

Biasing selection effects:

O One function of severity is to identify which selection effects are problematic (not all are)

O Biasing selection effects: when data or hypotheses are selected, generated, interpreted (or a test criterion is specified), such that the minimal severity requirement is violated, seriously altered or incapable of being assessed

Page 29: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

29

Nominal vs actual significance levels

“The ASA correctly warns that “[c]onducting multiple analyses of the data and reporting only those with certain p-values” leads to spurious p-values (Principle 4)”

“Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent!” (Selvin, 1970, p. 104)

From (Morrison & Henkel’s Significance Test controversy 1970!)

Page 30: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

30

O They were clear on the fallacy: blurring the “computed” or “nominal” significance level, and the “actual” level

O There are many more ways you can be wrong with hunting (different sample space)–ok for exploratory

O Here’s a case where a p-value report is invalid

Page 31: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

31

You report: Such results would be difficult to achieve under the assumption of H0

When in fact such results are common under the assumption of H0

(Formally):

O You say Pr(P-value < Pobs; H0) = Pobs (small)

O But in fact Pr(P-value < Pobs; H0) = high

Page 32: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

32

Scapegoating

O Nowadays, we’re likely to see the tests blamed

O My view: Tests don’t kill inferences, people do

O Even worse are those statistical accounts where the abuse vanishes!

Page 33: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

33

On some views, taking account of biasing selection effects “defies scientific sense”“Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value…

But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value” (Goodman 1999, p. 1010)

(To his credit, he’s open about this; heads the Meta-Research Innovation Center at Stanford)

Page 34: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

34

Technical activism isn’t free of philosophy

Ben Goldacre (of Bad Science) in a 2016 Nature article, is puzzled that bad statistical practices continue even in the face of the new "technical activism”:

“The editors at Annals of Internal Medicine,… repeatedly (but confusedly) argue that it is acceptable to identify “prespecified outcomes” [from results] produced after a trial began; ….they say that their expertise allows them to permit — and even solicit — undeclared outcome-switching.” (2016, 7)

Page 35: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

35

His paper: “Make journals report clinical trials properly”

O He shouldn’t close his eyes to the possibility that some of the pushback he’s seeing has a basis in statistical philosophy!

Page 36: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

36

Likelihood Principle (LP)The vanishing act links to a pivotal disagreement in the philosophy of statistics battles

In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses

P(x0;H1)/P(x0;H0) for x0 fixed

O “They condition on the actual data,O error probabilities take into account other

outcomes that could have occurred but did not (sampling distribution)”

Page 37: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

37

All error probabilities violate the LP (even without selection effects):

“Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space.” (Lindley 1971, p. 436) 

“The LP implies…the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects.” (Rosenkrantz, 1977, p. 122)

Page 38: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

38

Another paradox enters

O ASA report: “In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. These include…confidence intervals; Bayesian methods,…likelihood ratios or Bayes Factors…”

O However, the same p-hacked hypothesis can occur in “likelihood ratios, Bayes factors and Bayesian updating with one big difference:

Page 39: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

39

Does principle 4 hold for other approaches?

O “The direct grounds to criticize inferences as flouting error statistical control is lost”

O An 11th hour point of controversy: whether to retain “full reporting and transparency” (principle 4) for all methods

O Or should it apply only to “p-values and related statistics”

Page 40: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

40

How might probabilists block intuitively unwarranted inferences (without error

probabilities)? A subjective Bayesian might say:

If our beliefs were mixed into the interpretation of the evidence, we wouldn’t declare there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences)

Page 41: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

41

Rescued by beliefs

O That could work in some cases (it still wouldn’t show what researchers had done wrong)—battle of beliefs

O Besides, researchers sincerely believe their hypotheses 

O Now you’ve got two sources of flexibility, priors and biasing selection effects

Page 42: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

42

No help with our most important problem

O How to distinguish the warrant for a single hypothesis H with different methods

(e.g., one has biasing selection effects, another, pre-registered results and precautions)?

Page 43: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

43

Most Bayesians are “conventional”

O Eliciting subjective priors too difficult, scientists reluctant to allow subjective beliefs to overshadow data

O Default, or reference priors are supposed to prevent prior beliefs from influencing the posteriors (O-Bayesians, 2006)

O A classic conundrum: no general non-informative prior exists, so most are conventional

Page 44: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

44

O “The priors are not to be considered expressions

of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, p. 299)

O Conventional Bayesian Reforms are touted as free of selection effects

O “Bayes factors can be used in the complete absence of a sampling plan” (Bayarri, Benjamin, Berger, Sellke 2016)  

Page 45: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

45

Granted, some are prepared to abandon the LP for model testing

In an attempted meeting of the minds Andrew Gelman and Cosma Shalizi say:

O “[C]rucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” which might be seen as using modern statistics to implement the Popperian criteria of severe tests.” (2013, 10).

O Fisherian tests are fine here

Page 46: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

46

The Problem is with so-called NHST

O NHSTs encourage the move from H to H*–that supposedly allow moving from statistical to substantive

O If defined that way, they exist only as abuses of tests

O ASA doc ignores Neyman-Pearson (N-P) tests

Page 47: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

47

Neyman-Pearson (N-P) tests:

A null and alternative hypotheses H0, H1 that exhaust the parameter space

O So the fallacy of rejection HH* is impossible

O Rejecting the null only indicates statistical alternatives

Page 48: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

48

P-values don’t report effect sizes Principle 5

Who ever said to just report a P-value?

O “Tests should be accompanied by interpretive tools that avoid the fallacies of rejection and non-rejection. These correctives can be articulated in either Fisherian or Neyman-Pearson terms” (Mayo and Cox 2006, Mayo and Spanos 2006)

Page 49: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

49

To avoid inferring a discrepancy beyond what’s warranted:

Large n problem.

O Severity tells us: an α-significant difference is indicative of less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2)

Page 50: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

50

O What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one so insensitive that it doesn’t go off unless the house is fully ablaze?

[The larger sample size is like the one that goes off with burnt toast]

Page 51: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

51

What about the fallacy of non-significant results?

This gets to my experience with the open source replication project in psychology

O Non-Replication occurs with non-significant results, but there’s much confusion in interpreting them

O Given the focus on NHST – many take nonsignificance as uninformative

Page 52: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

52

O Insignificant results don’t warrant 0 discrepancy

O Use the same severity reasoning to rule out discrepancies that very probably would have resulted in a larger difference than observed- set upper bounds

O “If you very probably would have observed a more impressive (smaller) p-value than you did, if μ > μ1 (where μ1 = μ0 + γ), then the data are good evidence that μ≤ μ1”

O Akin to power analysis (Cohen, Neyman) but sensitive to x0

Page 53: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

53

Improves on confidence intervals

“This is akin to confidence intervals (which are dual to tests) but we get around their shortcomings:

O We do not fix a single confidence level,

O The evidential warrant for different points in any interval are distinguished”

O Go beyond “performance goal” to give inferential construal

Page 54: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

54

Fraudbusting

O Uri Simonsohn exposed suspicious data in the work of social psychologists Dirk Smeesters, Lawrence Sanna (2011-12); Jens Förster (2014-15)

O Detecting something suspicious in particular data

O “Fabricated data detected by statistics alone” (Simonsohn 2013)

Page 55: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

55

P-values can’t be trusted except when used to argue that P-values can’t be

trustedO Paradoxically, significance test critics agree

with fraudbusting using tests they purport to reject

O “The methodology here is not new. It goes back to Fisher (founder of modern statistics) in the 30’s. … The tests of goodness of fit were again and again, too good. …using a trick invented by Fisher for the purpose, and well known to all statisticians”. (Richard. D. Gill)

Page 56: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

56

OSC: Reproducibility Project: Psychology: 2011-15 (Science 2015) (led by Brian Nozek, U. VA)

O Crowd sourced: Replicators select 100 articles from three journals (2008) to try and replicate using the same method as the initial research: direct replication

O Aimed to avoid fraudbusting in closed committees

Page 57: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

57

Repligate: Witch hunting or good science?

O Open replications also have pushback

O Daniel Gilbert (psychology at Harvard,) “so-called replicators” are "shameless little bullies" and "second stringers" who engage in tactics "out of Senator Joe McCarthy’s playbook”

O Daniel Kahneman: New Rule: consult with the original authors for “hidden secrets” that only the original authors would know

Page 58: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

58

Work completed in Aug 2015

O Does a negative replication mean the original was a false positive?

O Advantages of the Replications:

–Preregistered, designed to have high power

–Free of “perverse incentives” of usual research: guaranteed to be published

– No file drawers

Page 59: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

59

Vindication for the P-value?“The findings also offered some support for the oft-criticized statistical tool known as the P-value …a low P value was fairly predictive of which psychology studies could be replicated”. …(Smithsonian, 2015)

O I deny all we need is to lower the required P-value (as some reformers claim)

O “…replication attempts find it difficult to get small p-values with preregistered results. This shows the problem isn’t p-values but failing to adjust them for cherry picking, multiple testing, etc.”

Page 60: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

60

Highly impressive effort…butO Fidelity questions (replications differ from

originals)-new round in Science 2016

O My blogpost: Non-significance is the new significance (a new perverse incentive?)

O My real problem…. They stick to what might be called “purely statistical” issues: can we get a low P-value or not?

Page 61: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

61

Non-replications construed as simply weaker effects

O One of the non-replications: cleanliness and

morality: Does unscrambling soap words make you less judgmental?

“Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. …

Page 62: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

62

…Subjects who had unscrambled clean words weren’t as harsh on the guy who chows down on his chow.” (Chronicle of Higher Education)

(situated cognition effect?)

O By focusing on the P-values, they ignore the larger question of the methodological adequacy of the leap from the statistical to the substantive.

O Are they even measuring the phenomenon they intend? Is the result due to the “treatment”?

Page 63: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

63

Coming to a philosophy department near you (experimental philosophy)

2 that didn’t replicate in psychology:

O Belief in free will and cheating

O Physical distance (of points plotted) and emotional closeness

I encourage meta-methodological scrutiny by philosophers

Page 64: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

64

Concluding remarks

O I end my commentary: “Failing to understand the correct (if limited) role of simple significance tests threatens to throw the error control baby out with the bad statistics bathwater

O Don’t scapegoat methods based on cookbook uses long lampooned”

O Makes no sense to banish tools for testing assumptions, fraudbusting, replication research

Page 65: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

65

O Recognize different roles of probability: probabilism, long run performance, probativism (severe testing)

O Don’t expect an agreement on numbers from methods evaluating different things

O Probabilisms may enable rather than block illicit inferences due to biasing selection effects

O Main paradox of the “replication crisis”

Page 66: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

66

Future ASA project

O Look at the “other approaches” (Bayes factors, LRs, Bayesian updating)

O What is it for a replication to succeed or fail on those approaches?

(can’t be just a matter of prior beliefs in the hypotheses)

Page 67: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

67

Error statistical improvements are needed

O An inferential construal of error probabilities wasn’t clearly given (Birnbaum)-my goal

O It’s not long-run error control (performance), but severely probing flaws today 

O Needs to say more about how they’ll use background information

Page 68: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

68

Recognize that often better statistics cannot help

O One hypothesis must always be: our results point to the inability of our study to severely probe the phenomenon of interest

O Be ready to admit questionable science–unable to distinguish poorly run study and poor hypothesis

Page 69: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

69

2 that didn’t replicateO Belief in free will and cheating

“It found that participants who read a passage arguing that their behavior is predetermined were more likely than those who had not read the passage to cheat on a subsequent test.”

O Physical distance (of points plotted) and emotional closeness

“Volunteers asked to plot two points that were far apart on graph paper later reported weaker emotional attachment to family members, compared with subjects who had graphed points close together.” (NYT)

Page 70: "The Statistical Replication Crisis: Paradoxes and Scapegoats”
Page 71: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

71

Mayo and Cox (2010): Frequentist Principle of Evidence (FEV); SEV: Mayo and Spanos (2006) FEV/SEV: insignificant result: A moderate P-value is evidence of the absence of a discrepancy δ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., d(X) > d(x0) ) were a discrepancy δ to exist  FEV/SEV significant result d(X) > d(x0) is evidence of discrepancy δ from H0, if and only if, there is a high probability the test would have d(X) < d(x0) were a discrepancy as large as δ absent

Page 72: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

72

Test T+: Normal testing: H0: μ < μ0 vs. H1: μ > μ0 σ known (FEV/SEV): If d(x) is not statistically significant, then

μ < M0 + kεσ/√n passes the test T+ with severity (1 – ε)

 (FEV/SEV): If d(x) is statistically significant, then

μ > M0 + kεσ/√n passes the test T+ with severity (1 – ε)

  where P(d(X) > kε) = ε

Page 73: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

73

ReferencesO Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical

Inference: A Discussion, edited by L. J. Savage. London: Methuen.O Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (forthcoming). “Rejection Odds and

Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses." Journal of Mathematical Psychology.  Invited paper for special issue on “Bayesian hypothesis testing.”

O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?' and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.

O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.

O Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033.

O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin of the American Mathematical Society 50(1): 126-46.

O Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and Robustness. New York: Academic Press.

O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall. O Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist

Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press.

Page 74: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

74

O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd. O Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal

Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78. O Gelman, A. and Shalizi, C. 2013. “Philosophy and the Practice of Bayesian Statistics” and

“Rejoinder’” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.

O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The Empire of Chance. Cambridge: Cambridge University Press.

O Gilbert, D. Twitter post: https://twitter.com/dantgilbert/status/470199929626193921O Gill, comment: On the “Suspicion of Scientific Midconduc by Jens Forster by Neuroskeptic

May 6, 2014 on Discover Magazine Blog: http://blogs.discovermagazine.com/neuroskeptic/2014/05/06/suspicion-misconduct-forster/#.Vynr3j-scQ0.

O Goldacre, B. 2008. Bad Science. HarperCollins Publishers.O Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature 530(7588); 7;

online 04Feb2016.O Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,”

Annals of Internal Medicine 1999; 130:1005 –1013.O Handwerk, B. 2015. “Scientists Replicated 100 Psychology Studies, and Fewer than Half

Got the Same Results.” Smithsonian Magazine (August 27, 2015) http://www.smithsonianmag.com/science-nature/scientists-replicated-100-psychology-studies-and-fewer-half-got-same-results-180956426/?no-ist

O Hasselman, F. and Mayo, D. 2015, April 17. “seveRity” (R-program). Retrieved from osf.io/k6w3h

Page 75: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

75

O Levelt Committee, Noort Committee, Drenth Committee. 2012. 'Flawed science: The fraudulent research practices of social psychologist Diederik Stapel', Stapel Investigation: Joint Tilburg/Groningen/Amsterdam investigation of the publications by Mr. Stapel. (https://www.commissielevelt.nl/)

O Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.

O Mayo, D. G. 2016. 'Don't Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary', The American Statistician, online March 7, 2016. http://www.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108.

O Mayo, D. G. Error Statistics Philosophy Blog: errorstatistics.comO Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in

Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

O Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.

O Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier.

Page 76: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

76

O Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300.

O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter.

O Open Science Collaboration (Nozeck, B. et al). 2015. “Estimating the Reproducibility of Psychological Science.” Science 349(6251)

O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96.

O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.

O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen.

O Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.

O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.

O Smithsonian Magazine (See Handwerk)

O Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology 37(1): pp. 1-2.

O Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician

Link to ASA statement & Commentaries (under supplemental): http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108

Page 77: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

77

Jimmy Savage on the LP:

“According to Bayes' theorem,…. if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ…” (Savage 1962, p. 17)

Page 78: "The Statistical Replication Crisis: Paradoxes and Scapegoats”

78

O Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alter-native replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.