Download - Everything wrong with statistics (and how to fix it)

LLNL-PRES-670181 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Everything Wrong with Statistics (and How to Fix It)

Kristin P. Lennox Director of Statistical Consulting July 29, 2015

Lawrence Livermore National Laboratory LLNL-PRES-670181 2

Crisis!

PLoS Medicine | www.plosmedicine.org 0696

Essay

Open access, freely available online

August 2005 | Volume 2 | Issue 8 | e124

Published research fi ndings are sometimes refuted by subsequent evidence, with ensuing confusion

and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false fi ndings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research fi ndings are false. Here I will examine the key

factors that infl uence this problem and some corollaries thereof.

Modeling the Framework for False Positive Findings Several methodologists have pointed out [9–11] that the high rate of nonreplication (lack of confi rmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research fi ndings solely on the basis of a single study assessed by formal statistical signifi cance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles

should be interpreted based only on p-values. Research fi ndings are defi ned here as any relationship reaching formal statistical signifi cance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null fi ndings.

As has been shown previously, the probability that a research fi nding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical signifi cance [10,11]. Consider a 2 × 2 table in which research fi ndings are compared against the gold standard of true relationships in a scientifi c fi eld. In a research fi eld both true and false hypotheses can be made about the presence of relationships. Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the fi eld. R

is characteristic of the fi eld and can vary a lot depending on whether the fi eld targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fi elds where either there is only one true relationship (among many that can be hypothesized) or the power is similar to fi nd any of the several existing true relationships. The pre-study probability of a relationship being true is R⁄(R + 1). The probability of a study fi nding a true relationship refl ects the power 1 − β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists refl ects the Type I error rate, α. Assuming that c relationships are being probed in the fi eld, the expected values of the 2 × 2 table are given in Table 1. After a research fi nding has been claimed based on achieving formal statistical signifi cance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 − β)R⁄(R − βR + α). A research fi nding is thus

The Essay section contains opinion pieces on topics of broad interest to a general medical audience.

Why Most Published Research Findings Are False John P. A. Ioannidis

Citation: Ioannidis JPA (2005) Why most published research fi ndings are false. PLoS Med 2(8): e124.

Copyright: © 2005 John P. A. Ioannidis. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abbreviation: PPV, positive predictive value

John P. A. Ioannidis is in the Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Institute for Clinical Research and Health Policy Studies, Department of Medicine, Tufts-New England Medical Center, Tufts University School of Medicine, Boston, Massachusetts, United States of America. E-mail: [email protected]

Competing Interests: The author has declared that no competing interests exist.

DOI: 10.1371/journal.pmed.0020124

SummaryThere is increasing concern that most

current published research fi ndings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientifi c fi eld. In this framework, a research fi nding is less likely to be true when the studies conducted in a fi eld are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater fl exibility in designs, defi nitions, outcomes, and analytical modes; when there is greater fi nancial and other interest and prejudice; and when more teams are involved in a scientifi c fi eld in chase of statistical signifi cance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientifi c fi elds, claimed research fi ndings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

It can be proven that most claimed research

fi ndings are false.


What’s going on? Statistics is popular and important!

Statisticians are rare.

Statistics training isn’t working.

σ


STAT 101 is Procedural

1. Check your data type 2. Select inference method 3. Calculate required sample statistics 4. Look up critical values … N. Report result


Real Statistics Isn’t


Comprehensive Plan for Reform of All Statistics

1)  Show the problems with “cookbook statistics”

2) Demonstrate real statistical thinking 3) Help as needed


§ Know thy problem. § Know thy tools. § Know thy data.

Golden Rules of Statistics (What Statisticians REALLY Do)


STAT 101: Determine the appropriate analysis by looking at the data.

E.g. two numeric variables = linear regression

Know Thy Problem


STAT 101: Determine the appropriate analysis by looking at the data.

E.g. two numeric variables = linear regression

Know Thy Problem

Appropriate data AND appropriate analysis depend on the real world problem.


The Million Dollar Binomial Distribution


Know Thy Tools STAT 101: Statistical methods are selected according to the appropriateness to the data and correctness of assumptions. STAT 101: Statistical procedures, used correctly, yield unambiguous results.


Know Thy Tools STAT 101: Statistical methods are selected according to the appropriateness to the data and correctness of assumptions. STAT 101: Statistical procedures, used correctly, yield unambiguous results. Statistical models work the same way that other

scientific and engineering models work. Their validity depends on context, and they may be open

to interpretation.


yi = b0 + b1xi +εi

x

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

0 10 20 30 40 50

050

100

150

x

y−4 −2 0 2 4

0.00

0.10

0.20

0.30

x

Density

Statistical Methods are Based on Models


Statistical Methods are Based on Models

yi = b0 + b1xi +εi

x

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

0 10 20 30 40 50

050

100

150

x

y−5 0 5 10 15

0.00

0.04

0.08

0.12

x

Density


A Wise Man Once Said…

“Essentially, all models are wrong, but some are useful. ” – George E. P. Box


How to Evaluate Explosives Safety

A METHOD FOR OBTAINING AND ANALYZINGSENSITIVITY DATA*

W. J. DIXONUniversity of Oregon

AND

A. M. MOODIowa State College

The standard method of dealing with sensitivity of dosage-mortality data is the probit technique developed by Bliss andFisher. This paper provides an alternative technique based ona special system for obtaining such data. It has some ad-vantages when observations must be taken on individualsrather than groups of individuals, and it may be preferred incertain other situations.

INTRODUCTION

EXPERIMENTAL investigations often deal with continuous variableswhich cannot be measured in practice. For example, in testing the

sensitivity of explosives to shock, a common procedure is to drop aweight on specimens of the same explosive mixture from variousheights. There are heights at which some specimens will explode, andothers will not, and it is assumed that those which willnot explode wouldexplode were the weight dropped from a sufficiently greater height. Itis supposed, therefore, that there is a critical height associated witheach specimen, and that the specimen will explode when the weight isdropped from a greater height and will not explode when the weightis dropped from a lesser height. The population of specimens is thuscharacterized by a continuous variable-the critical height-whichcannot be measured. All one can do is select some height arbitrarilyand determine whether the critical height for a given specimen is lessthan or greater than the selected height.This situation arises in many fields of research. Thus in testing insec-

ticides, a critical dose is associated with each insect, but one cannotmeasure it. He can only try some dose and observe whether or notthe insect is killed, that is, observe whether the critical dose for thatinsect is less than or greater than the chosen dose. The same difficultyarises in pharmaceutical research dealing with germicides, anesthetics,

• This paper is in part an adaptation of a memorandum submitted to the Applied MathematicsPanel by the Statistical Research Group, Princeton University. The Statistical Research Group oper-ated under a contract with the Officeof Scientific Research and Development, and was directed by theApplied Mathematics Panel of the National Defense Research Committee.

109

Dow

nloa

ded

by [L

awre

nce

Live

rmor

e N

atio

nal L

abor

ator

y] a

t 16:

19 0

3 O

ctob

er 2

013

0 10 20 30 40 50 60

−20

2

Up−and−Down Test Demo

Test

Nor

mal

ized

Hei

ght

xx x x x x x x x x x

xx x x x x x

xx x x x x x x x

xx x

o o o o o o o o o oo

o o o o ooo o o o o o o o o

oo o

xx


How NOT to Evaluate Explosives Safety

A METHOD FOR OBTAINING AND ANALYZINGSENSITIVITY DATA*

W. J. DIXONUniversity of Oregon

AND

A. M. MOODIowa State College

The standard method of dealing with sensitivity of dosage-mortality data is the probit technique developed by Bliss andFisher. This paper provides an alternative technique based ona special system for obtaining such data. It has some ad-vantages when observations must be taken on individualsrather than groups of individuals, and it may be preferred incertain other situations.

INTRODUCTION

EXPERIMENTAL investigations often deal with continuous variableswhich cannot be measured in practice. For example, in testing the

sensitivity of explosives to shock, a common procedure is to drop aweight on specimens of the same explosive mixture from variousheights. There are heights at which some specimens will explode, andothers will not, and it is assumed that those which willnot explode wouldexplode were the weight dropped from a sufficiently greater height. Itis supposed, therefore, that there is a critical height associated witheach specimen, and that the specimen will explode when the weight isdropped from a greater height and will not explode when the weightis dropped from a lesser height. The population of specimens is thuscharacterized by a continuous variable-the critical height-whichcannot be measured. All one can do is select some height arbitrarilyand determine whether the critical height for a given specimen is lessthan or greater than the selected height.This situation arises in many fields of research. Thus in testing insec-

ticides, a critical dose is associated with each insect, but one cannotmeasure it. He can only try some dose and observe whether or notthe insect is killed, that is, observe whether the critical dose for thatinsect is less than or greater than the chosen dose. The same difficultyarises in pharmaceutical research dealing with germicides, anesthetics,

• This paper is in part an adaptation of a memorandum submitted to the Applied MathematicsPanel by the Statistical Research Group, Princeton University. The Statistical Research Group oper-ated under a contract with the Officeof Scientific Research and Development, and was directed by theApplied Mathematics Panel of the National Defense Research Committee.

109

Dow

nloa

ded

by [L

awre

nce

Live

rmor

e N

atio

nal L

abor

ator

y] a

t 16:

19 0

3 O

ctob

er 2

013

“…the up and down method is particularly effective for estimating the mean. It is not a good method for estimating small or large percentage points (for example, the height at which 99 per cent of specimens explode) unless normality of the distribution is assured.” – Dixon and Mood

0 10 20 30 40 50 60

−20

2

Up−and−Down Test Demo

Test

Nor

mal

ized

Hei

ght

xx x x x x x x x x x

xx x x x x x

xx x x x x x x x

xx x

o o o o o o o o o oo

o o o o ooo o o o o o o o o

oo o

xx


A Note on Statistical Significance (the following statements reflect only the author’s opinion, and should

not be construed to reflect those of LLNL, the Applied Statistics Group, or any other person, statistician or not, living or dead)

•  There isn’t anything wrong with p-values …but p=0.0501 is the same as p=0.0499

•  There isn’t anything wrong with statistical hypothesis

testing … but it isn’t the right tool for making all decisions.

These procedures aren’t broken. They are misused.

This does not mean that you should keep using them.


Know Thy Data

Parametric models are (of course) sensitive to assumptions, but purely data driven approaches are far more robust to “cookbook” approaches.


Know Thy Data

Parametric models are (of course) sensitive to assumptions, but purely data driven approaches are far more robust to “cookbook” approaches.

There are multiple cautions and caveats when using “big data” approaches. The most important is that you have to start with the right data.


Jackie’s Improbable Sister

Jackie is a girl in a family with two children. What is the probability that Jackie has a sister?

A. 1/2 B. 1/3 C. 0 or 1, but we don’t know which


Jackie’s Improbable Sister

A. 1/2 B. 1/3

How did we find Jackie?

Jackie is a girl in a family with two children. What is the probability that Jackie has a sister?


Option A: 1/2

1) Pick a two child family at random.

2) Pick a child from the family at random.


Option A: 1/2

1) Pick a two child family at random.

2) Pick a child from the family at random.

Two girls have sisters and two girls have brothers.


Option B: 1/3 1) Pick a two child family with at least one girl at random.

2) Report one girl’s name for each family.


Option B: 1/3 1) Pick a two child family with at least one girl at random.

2) Report one girl’s name for each family.

Of three possible families, only one has girls with sisters.


Real (and Expensive) Problem 1948

GENETIC DIAGNOSIS Data barriers hamper search for meaning in mutations p.156

FUNDING US science agencies gird themselves for the budget axe p.158

MALARIA Plant source of key drug faces lab-made competition p.160

BIOMEDICINE A Texas-style showdown over stem-cell therapy p.166

B Y D E C L A N B U T L E R

When influenza hit early and hard in the United States this year, it qui-etly claimed an unacknowledged

victim: one of the cutting-edge techniques being used to monitor the outbreak. A com-parison with traditional surveillance data showed that Google Flu Trends, which esti-mates prevalence from flu-related Internet searches, had drastically overestimated peak flu levels. The glitch is no more than a tempo-rary setback for a promising strategy, experts say, and Google is sure to refine its algorithms. But as flu-tracking techniques based on min-ing of web data and on social media prolifer-ate, the episode is a reminder that they will

complement, but not substitute for, traditional epidemiological surveillance networks.

“It is hard to think today that one can pro-vide disease surveillance without existing systems,” says Alain-Jacques Valleron, an epidemiologist at the Pierre and Marie Curie University in Paris, and founder of France’s Sentinelles monitoring network. “The new sys-tems depend too much on old existing ones to be able to live without them,” he adds.

This year’s US flu season started around November and seems to have peaked just after Christmas, making it the earliest flu season since 2003. It is also causing more serious ill-ness and deaths than usual, particularly among the elderly, because, just as in 2003, the pre-dominant strain this year is H3N2 — the most

virulent of the three main seasonal flu strains.Traditional flu monitoring depends in part

on national networks of physicians who report cases of patients with influenza-like illness (ILI) — a diffuse set of symptoms, including high fever, that is used as a proxy for flu. That estimate is then refined by testing a subset of people with these symptoms to determine how many have flu and not some other infection.

With its creation of the Sentinelles network in 1984, France was the first country to com-puterize its surveillance. Many countries have since developed similar networks — the US system, overseen by the Centers for Disease Control and Prevention (CDC) in Atlanta, Georgia, includes some 2,700 health-care centres that record about 30 million patient visits annually.

But the near-global coverage of the Internet and burgeoning social-media platforms such as Twitter have raised hopes that these tech-nologies could open the way to easier, faster estimates of ILI, spanning larger populations.

The mother of these new systems is Google’s, launched in 2008. Based on research by Google and the CDC, it relies on data mining records of flu-related search terms entered in Google’s search engine, combined with computer modelling. Its estimates have almost exactly matched the CDC’s own surveillance data over time — and it delivers them several days faster than the CDC can. The system has since been rolled out to 29 countries worldwide, and has been extended to include surveillance for a second disease, dengue.

Google Flu Trends has continued to per-form remarkably well, and researchers in many countries have confirmed that its ILI estimates are accurate. But the latest US flu season seems to have confounded its algorithms. Its estimate for the Christmas national peak of flu is almost double the CDC’s (see ‘Fever peaks’), and some of its state data show even larger discrepancies.

It is not the first time that a flu season has tripped Google up. In 2009, Flu Trends had to tweak its algorithms after its models badly underestimated ILI in the United States at the start of the H1N1 (swine flu) pandemic — a glitch attributed to changes in people’s search

behaviour as a result of the exceptional nature of the pandemic (see http://doi.org/djw73f ).

Google would not comment on this year’s

NATURE.COMSee maps showing reports of flu-like symptoms in France:go.nature.com/w954hn

E P I D E M I O L O G Y

When Google got flu wrongUS outbreak foxes a leading web-based method for tracking seasonal flu.

The latest US influenza season is more severe and has caused more deaths than usual.

JOH

N A

NG

ELIL

LO/U

PI/

NEW

SCO

M

1 4 F E B R U A R Y 2 0 1 3 | V O L 4 9 4 | N A T U R E | 1 5 5

NEWS IN FOCUS

© 2013 Macmillan Publishers Limited. All rights reserved

2013 1954


To summarize…


Don’t:



Do:


The LLNL Statistical Consulting Service provides up to 4 hours of assistance free of charge for LLNL projects.

When in doubt:

[email protected]

https://data-analytics.llnl.gov/statistical_consultants

Thank you! σ


Wikipedia: Betty Crocker Cookbook, Salk Polio Vaccine

Wikipedia (CC BY-SA 3.0): George Box

Harry S. Truman Library: Bernard Dickmann with Harry S. Truman

Library of Congress: Chicago Tribune Headline

Plain Unicorn: WPClipart

LLNL: NIF, Drop Hammer, Sigma the Statistics Unicorn

Image sources:

Download - Everything wrong with statistics (and how to fix it)

Top Related