LLNL-PRES-670181 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Everything Wrong with Statistics (and How to Fix It)
Kristin P. Lennox Director of Statistical Consulting July 29, 2015
Lawrence Livermore National Laboratory LLNL-PRES-670181 2
Crisis!
PLoS Medicine | www.plosmedicine.org 0696
Essay
Open access, freely available online
August 2005 | Volume 2 | Issue 8 | e124
Published research fi ndings are sometimes refuted by subsequent evidence, with ensuing confusion
and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false fi ndings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research fi ndings are false. Here I will examine the key
factors that infl uence this problem and some corollaries thereof.
Modeling the Framework for False Positive Findings Several methodologists have pointed out [9–11] that the high rate of nonreplication (lack of confi rmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research fi ndings solely on the basis of a single study assessed by formal statistical signifi cance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles
should be interpreted based only on p-values. Research fi ndings are defi ned here as any relationship reaching formal statistical signifi cance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null fi ndings.
As has been shown previously, the probability that a research fi nding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical signifi cance [10,11]. Consider a 2 × 2 table in which research fi ndings are compared against the gold standard of true relationships in a scientifi c fi eld. In a research fi eld both true and false hypotheses can be made about the presence of relationships. Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the fi eld. R
is characteristic of the fi eld and can vary a lot depending on whether the fi eld targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fi elds where either there is only one true relationship (among many that can be hypothesized) or the power is similar to fi nd any of the several existing true relationships. The pre-study probability of a relationship being true is R⁄(R + 1). The probability of a study fi nding a true relationship refl ects the power 1 − β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists refl ects the Type I error rate, α. Assuming that c relationships are being probed in the fi eld, the expected values of the 2 × 2 table are given in Table 1. After a research fi nding has been claimed based on achieving formal statistical signifi cance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 − β)R⁄(R − βR + α). A research fi nding is thus
The Essay section contains opinion pieces on topics of broad interest to a general medical audience.
Why Most Published Research Findings Are False John P. A. Ioannidis
Citation: Ioannidis JPA (2005) Why most published research fi ndings are false. PLoS Med 2(8): e124.
Copyright: © 2005 John P. A. Ioannidis. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abbreviation: PPV, positive predictive value
John P. A. Ioannidis is in the Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Institute for Clinical Research and Health Policy Studies, Department of Medicine, Tufts-New England Medical Center, Tufts University School of Medicine, Boston, Massachusetts, United States of America. E-mail: [email protected]
Competing Interests: The author has declared that no competing interests exist.
DOI: 10.1371/journal.pmed.0020124
SummaryThere is increasing concern that most
current published research fi ndings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientifi c fi eld. In this framework, a research fi nding is less likely to be true when the studies conducted in a fi eld are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater fl exibility in designs, defi nitions, outcomes, and analytical modes; when there is greater fi nancial and other interest and prejudice; and when more teams are involved in a scientifi c fi eld in chase of statistical signifi cance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientifi c fi elds, claimed research fi ndings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
It can be proven that most claimed research
fi ndings are false.
Lawrence Livermore National Laboratory LLNL-PRES-670181 3
What’s going on? Statistics is popular and important!
Statisticians are rare.
Statistics training isn’t working.
σ
Lawrence Livermore National Laboratory LLNL-PRES-670181 4
STAT 101 is Procedural
1. Check your data type 2. Select inference method 3. Calculate required sample statistics 4. Look up critical values … N. Report result
Lawrence Livermore National Laboratory LLNL-PRES-670181 5
Real Statistics Isn’t
Lawrence Livermore National Laboratory LLNL-PRES-670181 6
Comprehensive Plan for Reform of All Statistics
1) Show the problems with “cookbook statistics”
2) Demonstrate real statistical thinking 3) Help as needed
Lawrence Livermore National Laboratory LLNL-PRES-670181 7
§ Know thy problem. § Know thy tools. § Know thy data.
Golden Rules of Statistics (What Statisticians REALLY Do)
Lawrence Livermore National Laboratory LLNL-PRES-670181 8
STAT 101: Determine the appropriate analysis by looking at the data.
E.g. two numeric variables = linear regression
Know Thy Problem
Lawrence Livermore National Laboratory LLNL-PRES-670181 9
STAT 101: Determine the appropriate analysis by looking at the data.
E.g. two numeric variables = linear regression
Know Thy Problem
Appropriate data AND appropriate analysis depend on the real world problem.
Lawrence Livermore National Laboratory LLNL-PRES-670181 10
The Million Dollar Binomial Distribution
Lawrence Livermore National Laboratory LLNL-PRES-670181 11
Know Thy Tools STAT 101: Statistical methods are selected according to the appropriateness to the data and correctness of assumptions. STAT 101: Statistical procedures, used correctly, yield unambiguous results.
Lawrence Livermore National Laboratory LLNL-PRES-670181 12
Know Thy Tools STAT 101: Statistical methods are selected according to the appropriateness to the data and correctness of assumptions. STAT 101: Statistical procedures, used correctly, yield unambiguous results. Statistical models work the same way that other
scientific and engineering models work. Their validity depends on context, and they may be open
to interpretation.
Lawrence Livermore National Laboratory LLNL-PRES-670181 13
yi = b0 + b1xi +εi
x
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
0 10 20 30 40 50
050
100
150
x
y−4 −2 0 2 4
0.00
0.10
0.20
0.30
x
Density
Statistical Methods are Based on Models
Lawrence Livermore National Laboratory LLNL-PRES-670181 14
Statistical Methods are Based on Models
yi = b0 + b1xi +εi
x
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
0 10 20 30 40 50
050
100
150
x
y−5 0 5 10 15
0.00
0.04
0.08
0.12
x
Density
Lawrence Livermore National Laboratory LLNL-PRES-670181 15
A Wise Man Once Said…
“Essentially, all models are wrong, but some are useful. ” – George E. P. Box
Lawrence Livermore National Laboratory LLNL-PRES-670181 16
How to Evaluate Explosives Safety
A METHOD FOR OBTAINING AND ANALYZINGSENSITIVITY DATA*
W. J. DIXONUniversity of Oregon
AND
A. M. MOODIowa State College
The standard method of dealing with sensitivity of dosage-mortality data is the probit technique developed by Bliss andFisher. This paper provides an alternative technique based ona special system for obtaining such data. It has some ad-vantages when observations must be taken on individualsrather than groups of individuals, and it may be preferred incertain other situations.
INTRODUCTION
EXPERIMENTAL investigations often deal with continuous variableswhich cannot be measured in practice. For example, in testing the
sensitivity of explosives to shock, a common procedure is to drop aweight on specimens of the same explosive mixture from variousheights. There are heights at which some specimens will explode, andothers will not, and it is assumed that those which willnot explode wouldexplode were the weight dropped from a sufficiently greater height. Itis supposed, therefore, that there is a critical height associated witheach specimen, and that the specimen will explode when the weight isdropped from a greater height and will not explode when the weightis dropped from a lesser height. The population of specimens is thuscharacterized by a continuous variable-the critical height-whichcannot be measured. All one can do is select some height arbitrarilyand determine whether the critical height for a given specimen is lessthan or greater than the selected height.This situation arises in many fields of research. Thus in testing insec-
ticides, a critical dose is associated with each insect, but one cannotmeasure it. He can only try some dose and observe whether or notthe insect is killed, that is, observe whether the critical dose for thatinsect is less than or greater than the chosen dose. The same difficultyarises in pharmaceutical research dealing with germicides, anesthetics,
• This paper is in part an adaptation of a memorandum submitted to the Applied MathematicsPanel by the Statistical Research Group, Princeton University. The Statistical Research Group oper-ated under a contract with the Officeof Scientific Research and Development, and was directed by theApplied Mathematics Panel of the National Defense Research Committee.
109
Dow
nloa
ded
by [L
awre
nce
Live
rmor
e N
atio
nal L
abor
ator
y] a
t 16:
19 0
3 O
ctob
er 2
013
0 10 20 30 40 50 60
−20
2
Up−and−Down Test Demo
Test
Nor
mal
ized
Hei
ght
xx x x x x x x x x x
xx x x x x x
xx x x x x x x x
xx x
o o o o o o o o o oo
o o o o ooo o o o o o o o o
oo o
xx
Lawrence Livermore National Laboratory LLNL-PRES-670181 17
How NOT to Evaluate Explosives Safety
A METHOD FOR OBTAINING AND ANALYZINGSENSITIVITY DATA*
W. J. DIXONUniversity of Oregon
AND
A. M. MOODIowa State College
The standard method of dealing with sensitivity of dosage-mortality data is the probit technique developed by Bliss andFisher. This paper provides an alternative technique based ona special system for obtaining such data. It has some ad-vantages when observations must be taken on individualsrather than groups of individuals, and it may be preferred incertain other situations.
INTRODUCTION
EXPERIMENTAL investigations often deal with continuous variableswhich cannot be measured in practice. For example, in testing the
sensitivity of explosives to shock, a common procedure is to drop aweight on specimens of the same explosive mixture from variousheights. There are heights at which some specimens will explode, andothers will not, and it is assumed that those which willnot explode wouldexplode were the weight dropped from a sufficiently greater height. Itis supposed, therefore, that there is a critical height associated witheach specimen, and that the specimen will explode when the weight isdropped from a greater height and will not explode when the weightis dropped from a lesser height. The population of specimens is thuscharacterized by a continuous variable-the critical height-whichcannot be measured. All one can do is select some height arbitrarilyand determine whether the critical height for a given specimen is lessthan or greater than the selected height.This situation arises in many fields of research. Thus in testing insec-
ticides, a critical dose is associated with each insect, but one cannotmeasure it. He can only try some dose and observe whether or notthe insect is killed, that is, observe whether the critical dose for thatinsect is less than or greater than the chosen dose. The same difficultyarises in pharmaceutical research dealing with germicides, anesthetics,
• This paper is in part an adaptation of a memorandum submitted to the Applied MathematicsPanel by the Statistical Research Group, Princeton University. The Statistical Research Group oper-ated under a contract with the Officeof Scientific Research and Development, and was directed by theApplied Mathematics Panel of the National Defense Research Committee.
109
Dow
nloa
ded
by [L
awre
nce
Live
rmor
e N
atio
nal L
abor
ator
y] a
t 16:
19 0
3 O
ctob
er 2
013
“…the up and down method is particularly effective for estimating the mean. It is not a good method for estimating small or large percentage points (for example, the height at which 99 per cent of specimens explode) unless normality of the distribution is assured.” – Dixon and Mood
0 10 20 30 40 50 60
−20
2
Up−and−Down Test Demo
Test
Nor
mal
ized
Hei
ght
xx x x x x x x x x x
xx x x x x x
xx x x x x x x x
xx x
o o o o o o o o o oo
o o o o ooo o o o o o o o o
oo o
xx
Lawrence Livermore National Laboratory LLNL-PRES-670181 18
A Note on Statistical Significance (the following statements reflect only the author’s opinion, and should
not be construed to reflect those of LLNL, the Applied Statistics Group, or any other person, statistician or not, living or dead)
• There isn’t anything wrong with p-values …but p=0.0501 is the same as p=0.0499
• There isn’t anything wrong with statistical hypothesis
testing … but it isn’t the right tool for making all decisions.
These procedures aren’t broken. They are misused.
This does not mean that you should keep using them.
Lawrence Livermore National Laboratory LLNL-PRES-670181 19
Know Thy Data
Parametric models are (of course) sensitive to assumptions, but purely data driven approaches are far more robust to “cookbook” approaches.
Lawrence Livermore National Laboratory LLNL-PRES-670181 20
Know Thy Data
Parametric models are (of course) sensitive to assumptions, but purely data driven approaches are far more robust to “cookbook” approaches.
There are multiple cautions and caveats when using “big data” approaches. The most important is that you have to start with the right data.
Lawrence Livermore National Laboratory LLNL-PRES-670181 21
Jackie’s Improbable Sister
Jackie is a girl in a family with two children. What is the probability that Jackie has a sister?
A. 1/2 B. 1/3 C. 0 or 1, but we don’t know which
Lawrence Livermore National Laboratory LLNL-PRES-670181 22
Jackie’s Improbable Sister
A. 1/2 B. 1/3
How did we find Jackie?
Jackie is a girl in a family with two children. What is the probability that Jackie has a sister?
Lawrence Livermore National Laboratory LLNL-PRES-670181 23
Option A: 1/2
1) Pick a two child family at random.
2) Pick a child from the family at random.
Lawrence Livermore National Laboratory LLNL-PRES-670181 24
Option A: 1/2
1) Pick a two child family at random.
2) Pick a child from the family at random.
Two girls have sisters and two girls have brothers.
Lawrence Livermore National Laboratory LLNL-PRES-670181 25
Option B: 1/3 1) Pick a two child family with at least one girl at random.
2) Report one girl’s name for each family.
Lawrence Livermore National Laboratory LLNL-PRES-670181 26
Option B: 1/3 1) Pick a two child family with at least one girl at random.
2) Report one girl’s name for each family.
Of three possible families, only one has girls with sisters.
Lawrence Livermore National Laboratory LLNL-PRES-670181 27
Real (and Expensive) Problem 1948
GENETIC DIAGNOSIS Data barriers hamper search for meaning in mutations p.156
FUNDING US science agencies gird themselves for the budget axe p.158
MALARIA Plant source of key drug faces lab-made competition p.160
BIOMEDICINE A Texas-style showdown over stem-cell therapy p.166
B Y D E C L A N B U T L E R
When influenza hit early and hard in the United States this year, it qui-etly claimed an unacknowledged
victim: one of the cutting-edge techniques being used to monitor the outbreak. A com-parison with traditional surveillance data showed that Google Flu Trends, which esti-mates prevalence from flu-related Internet searches, had drastically overestimated peak flu levels. The glitch is no more than a tempo-rary setback for a promising strategy, experts say, and Google is sure to refine its algorithms. But as flu-tracking techniques based on min-ing of web data and on social media prolifer-ate, the episode is a reminder that they will
complement, but not substitute for, traditional epidemiological surveillance networks.
“It is hard to think today that one can pro-vide disease surveillance without existing systems,” says Alain-Jacques Valleron, an epidemiologist at the Pierre and Marie Curie University in Paris, and founder of France’s Sentinelles monitoring network. “The new sys-tems depend too much on old existing ones to be able to live without them,” he adds.
This year’s US flu season started around November and seems to have peaked just after Christmas, making it the earliest flu season since 2003. It is also causing more serious ill-ness and deaths than usual, particularly among the elderly, because, just as in 2003, the pre-dominant strain this year is H3N2 — the most
virulent of the three main seasonal flu strains.Traditional flu monitoring depends in part
on national networks of physicians who report cases of patients with influenza-like illness (ILI) — a diffuse set of symptoms, including high fever, that is used as a proxy for flu. That estimate is then refined by testing a subset of people with these symptoms to determine how many have flu and not some other infection.
With its creation of the Sentinelles network in 1984, France was the first country to com-puterize its surveillance. Many countries have since developed similar networks — the US system, overseen by the Centers for Disease Control and Prevention (CDC) in Atlanta, Georgia, includes some 2,700 health-care centres that record about 30 million patient visits annually.
But the near-global coverage of the Internet and burgeoning social-media platforms such as Twitter have raised hopes that these tech-nologies could open the way to easier, faster estimates of ILI, spanning larger populations.
The mother of these new systems is Google’s, launched in 2008. Based on research by Google and the CDC, it relies on data mining records of flu-related search terms entered in Google’s search engine, combined with computer modelling. Its estimates have almost exactly matched the CDC’s own surveillance data over time — and it delivers them several days faster than the CDC can. The system has since been rolled out to 29 countries worldwide, and has been extended to include surveillance for a second disease, dengue.
Google Flu Trends has continued to per-form remarkably well, and researchers in many countries have confirmed that its ILI estimates are accurate. But the latest US flu season seems to have confounded its algorithms. Its estimate for the Christmas national peak of flu is almost double the CDC’s (see ‘Fever peaks’), and some of its state data show even larger discrepancies.
It is not the first time that a flu season has tripped Google up. In 2009, Flu Trends had to tweak its algorithms after its models badly underestimated ILI in the United States at the start of the H1N1 (swine flu) pandemic — a glitch attributed to changes in people’s search
behaviour as a result of the exceptional nature of the pandemic (see http://doi.org/djw73f ).
Google would not comment on this year’s
NATURE.COMSee maps showing reports of flu-like symptoms in France:go.nature.com/w954hn
E P I D E M I O L O G Y
When Google got flu wrongUS outbreak foxes a leading web-based method for tracking seasonal flu.
The latest US influenza season is more severe and has caused more deaths than usual.
JOH
N A
NG
ELIL
LO/U
PI/
NEW
SCO
M
1 4 F E B R U A R Y 2 0 1 3 | V O L 4 9 4 | N A T U R E | 1 5 5
NEWS IN FOCUS
© 2013 Macmillan Publishers Limited. All rights reserved
2013 1954
Lawrence Livermore National Laboratory LLNL-PRES-670181 28
To summarize…
Lawrence Livermore National Laboratory LLNL-PRES-670181 29
Don’t:
Lawrence Livermore National Laboratory LLNL-PRES-670181 30
§ Know thy problem. § Know thy tools. § Know thy data.
Do:
Lawrence Livermore National Laboratory LLNL-PRES-670181 31
§ Know thy problem. § Know thy tools. § Know thy data.
Do:
Lawrence Livermore National Laboratory LLNL-PRES-670181 32
The LLNL Statistical Consulting Service provides up to 4 hours of assistance free of charge for LLNL projects.
When in doubt:
https://data-analytics.llnl.gov/statistical_consultants
Thank you! σ
Lawrence Livermore National Laboratory LLNL-PRES-670181 34
Wikipedia: Betty Crocker Cookbook, Salk Polio Vaccine
Wikipedia (CC BY-SA 3.0): George Box
Harry S. Truman Library: Bernard Dickmann with Harry S. Truman
Library of Congress: Chicago Tribune Headline
Plain Unicorn: WPClipart
LLNL: NIF, Drop Hammer, Sigma the Statistics Unicorn
Image sources: