false discovery rate part i : introduction et enjeux
TRANSCRIPT
False Discovery RatePart I : introduction et enjeux
E. Roquain1
1Laboratory LPMA, Université Pierre et Marie Curie (Paris 6), France
Point de Vue, 3rd February 2014
E. Roquain FDR : introduction, enjeux et perspectives. Part I. 1 / 33
1 Introduction
2 False discovery rate control
3 FDR in other statistical issues
E. Roquain FDR : introduction, enjeux et perspectives. Part I. 2 / 33
1 Introduction
2 False discovery rate control
3 FDR in other statistical issues
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 3 / 33
A “multiple testing joke" (http://xkcd.com)
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 4 / 33
A “multiple testing joke" (http://xkcd.com)
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 4 / 33
A “multiple testing joke" (http://xkcd.com)
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 4 / 33
A “multiple testing joke" (http://xkcd.com)
Multiplicity problem
P( make at least one false discovery )� P( the i-th is a false discovery )
A correction is needed to assess significancy!E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 4 / 33
Some other examples
Paradoxes due to large scale experiments
Probable facts appear significant
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 5 / 33
Science-wise multiplicity issue - [Ioannidis (2005, PLoS Medicine)]
PLoS Medicine | www.plosmedicine.org 0696
Essay
Open access, freely available online
August 2005 | Volume 2 | Issue 8 | e124
Published research fi ndings are sometimes refuted by subsequent evidence, with ensuing confusion
and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false fi ndings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research fi ndings are false. Here I will examine the key
factors that infl uence this problem and some corollaries thereof.
Modeling the Framework for False Positive Findings Several methodologists have pointed out [9–11] that the high rate of nonreplication (lack of confi rmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research fi ndings solely on the basis of a single study assessed by formal statistical signifi cance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles
should be interpreted based only on p-values. Research fi ndings are defi ned here as any relationship reaching formal statistical signifi cance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null fi ndings.
As has been shown previously, the probability that a research fi nding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical signifi cance [10,11]. Consider a 2 × 2 table in which research fi ndings are compared against the gold standard of true relationships in a scientifi c fi eld. In a research fi eld both true and false hypotheses can be made about the presence of relationships. Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the fi eld. R
is characteristic of the fi eld and can vary a lot depending on whether the fi eld targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fi elds where either there is only one true relationship (among many that can be hypothesized) or the power is similar to fi nd any of the several existing true relationships. The pre-study probability of a relationship being true is R⁄(R + 1). The probability of a study fi nding a true relationship refl ects the power 1 − β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists refl ects the Type I error rate, α. Assuming that c relationships are being probed in the fi eld, the expected values of the 2 × 2 table are given in Table 1. After a research fi nding has been claimed based on achieving formal statistical signifi cance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 − β)R⁄(R − βR + α). A research fi nding is thus
The Essay section contains opinion pieces on topics of broad interest to a general medical audience.
Why Most Published Research Findings Are False John P. A. Ioannidis
Citation: Ioannidis JPA (2005) Why most published research fi ndings are false. PLoS Med 2(8): e124.
Copyright: © 2005 John P. A. Ioannidis. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abbreviation: PPV, positive predictive value
John P. A. Ioannidis is in the Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Institute for Clinical Research and Health Policy Studies, Department of Medicine, Tufts-New England Medical Center, Tufts University School of Medicine, Boston, Massachusetts, United States of America. E-mail: [email protected]
Competing Interests: The author has declared that no competing interests exist.
DOI: 10.1371/journal.pmed.0020124
SummaryThere is increasing concern that most
current published research fi ndings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientifi c fi eld. In this framework, a research fi nding is less likely to be true when the studies conducted in a fi eld are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater fl exibility in designs, defi nitions, outcomes, and analytical modes; when there is greater fi nancial and other interest and prejudice; and when more teams are involved in a scientifi c fi eld in chase of statistical signifi cance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientifi c fi elds, claimed research fi ndings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
It can be proven that most claimed research
fi ndings are false.
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 6 / 33
Science-wise multiplicity issue - [Ioannidis (2005, PLoS Medicine)]
[Talk Benjamini Southampton (2013)] ; modeling:
Try many experiments
⇓
1000 pure noise 30 perfect signal
⇓
publish results with a p-value ≤ 0.05
⇓
' 50 false discoveries 30 true discoveries
I Jager and Leek (2013). An estimate of the science-wise false discovery rate and application to the
top medical literature ' 14%
I Ioannidis (2014). Discussion: Why "An estimate of the science-wise false discovery rate and
application to the top medical literature" is false
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 6 / 33
Science-wise multiplicity issue - [Ioannidis (2005, PLoS Medicine)]
[Talk Benjamini Southampton (2013)] ; modeling:
Try many experiments
⇓
1000 pure noise 30 perfect signal
⇓
publish results with a p-value ≤ 0.05
⇓
' 50 false discoveries 30 true discoveries
I Jager and Leek (2013). An estimate of the science-wise false discovery rate and application to the
top medical literature ' 14%
I Ioannidis (2014). Discussion: Why "An estimate of the science-wise false discovery rate and
application to the top medical literature" is false
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 6 / 33
Science-wise multiplicity issue - [Ioannidis (2005, PLoS Medicine)]
[Talk Benjamini Southampton (2013)] ; modeling:
Try many experiments
⇓
1000 pure noise 30 perfect signal
⇓
publish results with a p-value ≤ 0.05
⇓
' 50 false discoveries 30 true discoveries
I Jager and Leek (2013). An estimate of the science-wise false discovery rate and application to the
top medical literature ' 14%
I Ioannidis (2014). Discussion: Why "An estimate of the science-wise false discovery rate and
application to the top medical literature" is false
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 6 / 33
Science-wise multiplicity issue - [Ioannidis (2005, PLoS Medicine)]
[Talk Benjamini Southampton (2013)] ; modeling:
Try many experiments
⇓
1000 pure noise 30 perfect signal
⇓
publish results with a p-value ≤ 0.05
⇓
' 50 false discoveries 30 true discoveries
I Jager and Leek (2013). An estimate of the science-wise false discovery rate and application to the
top medical literature ' 14%
I Ioannidis (2014). Discussion: Why "An estimate of the science-wise false discovery rate and
application to the top medical literature" is false
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 6 / 33
Multiplicity in microarray [Hedenfalk et al. (2001)]
BRCA1 vs BRCA2
gene
s
I expression level (activity)I genes differentially activated?I 1 test for each geneI thousands of genes
I nb replications� dimensionI correlations
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 7 / 33
Other applications
I Neuroimaging (FMRI)activated regions?
I Econometricswinning strategies?
I Astronomydirections with stars?
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 8 / 33
Canonical setting
I Xi = avg group 2 - avg group 1 (rescaled) for genes i
I Gaussian model : X1X2...
Xm
= µ
H1H2...
Hm
+
ε1ε2...εm
,
with µ > 0, H ∈ {0,1}m (fixed) and ε ∼ N (0, Γ) (Γi,i = 1).
I Γ = dependence structure = Im for now
Question: for each i , Hi = 0 or Hi = 1 ?Multiple testing : favors the “0" decision
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 9 / 33
Individual decision and errors
I Test statistic: Xi
I p-value: pi = Φ(Xi ), with Φ(z) = P(Z ≥ z), Z ∼ N (0,1)
pi such that
if Hi = 0, pi ∼ U(0,1)
if Hi = 1, pi ∼ Φ(Φ−1
(·)− µ)
I Choose Hi = 1{pi ≤ t} for some threshold t
I Two errors:Hi = 0 Hi = 1
Hi = 0 true negative false positiveHi = 1 false negative true positive
I False positive more annoying
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 10 / 33
Picture 1. m = 100; m0 = 50; µ = 2; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
000000000000
00000
0000
00
00000
0
00000
0000000
0000
00000
11111111111111111111111111
1 1111111111111 111
111
1
1
1
1
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 11 / 33
Picture 2. m = 100; m0 = 95; µ = 3; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
000000
0000000
0000
0000000
000000000
000000
00000000
0000000
00000
00000000
00000
00000000
00000
000
0000000
11111
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 12 / 33
Picture 3. m = 100; m0 = 50; µ = 0.01; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0000000
00
0000
00
0000
00
00000
000
0 0000000
00
0000
0000
000
11 1
1
1111
11111
11111
111
11
1111
111
11 1111
11111
11 111
11
11
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 13 / 33
Picture 4. m = 100; m0 = 95; µ = 0.01; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
000000
00000000
000000
0000000
000000
000
0000
000000
0000000
00000000000
000000
000000
00000
000000
00000000
1
1
1
1
1
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 14 / 33
Doing like for 1 test? t ≡ α = 0.1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 15 / 33
Doing like for 1 test? t ≡ α = 0.1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0000 0000
0000000
0000000000
0000000
00000
000000
0000000000
000000
0000
000
0000
00000000
00000
000000
000000
11
1
1
1
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 15 / 33
Doing like for 1 test? t ≡ α = 0.1
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0000
0
00
0
0000000
0000
0
00000
000
00
00
11
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 15 / 33
Union bound Bonferroni? t ≡ α/m = 0.1/100
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0000
0
00
0
0000000
0000
0
00000
000
00
00
11
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 16 / 33
Union bound Bonferroni? t ≡ α/m = 0.1/100
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0
00
0
0
00
00
0
0
0
000
1111111111111111
111111
11111
111
11111111
111
11
11
1
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 16 / 33
Do something in between! t` = α`/m = 0.1`/100
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 17 / 33
Do something in between! t` = α`/m = 0.1`/100
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0
00
0
0
00
00
0
0
0
000
1111111111111111
111111
11111
111
11111111
111
11
11
1
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 17 / 33
Do something in between! t` = α`/m = 0.1`/100
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0000
0
00
0
0000000
0000
0
00000
000
00
00
11
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 17 / 33
Smart !. . . and rigorous ?
E. Roquain FDR : introduction, enjeux et perspectives. Part I. Introduction 18 / 33
1 Introduction
2 False discovery rate control
3 FDR in other statistical issues
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 19 / 33
BH procedure
p-value view c.d.f. view
k = max{0 ≤ i ≤ m : p(i) ≤ αi/m} t = max{t ∈ [0,1] : Gm(t) ≥ t/α}
t = αk/m
Hi = 1{pi ≤ t} = 1{Xi ≥ Φ−1
(t)}
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 20 / 33
False discovery rate control
For a decision Hi = 1{pi ≤ t} (∀i),
FDP(t) =#{i : Hi = 0, Hi = 1}
#{i : Hi = 1}
(00
= 0)
FDR(t) = E[FDP(t)]
Theorem [Benjamini and Hochberg (1995)] [Benjamini and Yekutieli (2001)]
If Γ = Im and t threshold of BH procedure, ∀µ,H,
FDR(t) = (m0/m)α ≤ α
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 21 / 33
Simulations. m = 50; m0 = 25; µ = 3; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0
0000
00
0000
00
1111111111111111111111
11
1
FDP(BH) = 0
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 22 / 33
Simulations. m = 50; m0 = 25; µ = 3; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
00
000
000
0
000
0
0
1111111111111111111
11
111
FDP(BH) = 0.16
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 22 / 33
Simulations. m = 50; m0 = 25; µ = 3; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0 0
000
0
000
00
0
1111111111111111111
111
1
1
1
FDP(BH) = 0.0833
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 22 / 33
Simulations. m = 50; m0 = 25; µ = 3; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
00
00
000
0
1111111111111111111
1111
1
1
FDP(BH) = 0.08
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 22 / 33
Simulations. m = 50; m0 = 25; µ = 3; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0
000
00
00
0
0000
111111111111111
11111
11
1
1
1
FDP(BH) = 0.12
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 22 / 33
Simulations. m = 50; m0 = 25; µ = 3; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0 000
0
0
000000
11111111111111111111
111
1
1FDP(BH) = 0.167
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 22 / 33
Simulations. m = 50; m0 = 25; µ = 3; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0
0
0
00
0
0
111111111111111111111
1
1
11
FDP(BH) = 0.0435
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 22 / 33
Simulations. m = 50; m0 = 25; µ = 3; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
00
0
000
00
111111111111111
111111
111
1
FDP(BH) = 0
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 22 / 33
Simulations. m = 50; m0 = 25; µ = 3; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0000
0000
0000
00
00
1111111111111111111
1
11
11
1
FDP(BH) = 0
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 22 / 33
Simulations. m = 50; m0 = 25; µ = 3; Γ = Im
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0
0
0
00000
00000
111111111111111111111
111
1
FDP(BH) = 0.04
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 22 / 33
Benjamini and Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to
multiple testing
False Discovery Rate 411
procedure such that FDR !α. However, later work has blurred this distinction: there are methods thatgiven a rejection procedureR estimate the unknown quantity ERV=R.
ImpactIn many ways, Benjamini and Hochberg (1995) is a very successful paper. Its influence is clear from its 4967citations (according to the Web of Science at the time of this session), which are still on the rise each year ascan be seen in Fig. 1. Although 607 of these are in the area of statistics and probability, the majority of thesepublications are in the life sciences, from genetics to biochemistry, from oncology to plant sciences, reflect-ing in large part the use of FDR in microarray-related research. Importantly, citations in other high dimen-sional application areas, such as neural imaging, are on the rise also, showing its ability to be applied in manydiverse types of application. The list of the 10 highest cited papers that cite Benjamini and Hochberg (1995),which is shown in Table 1, is particularly interesting, because it includes six statistical papers, suggestingthat further theoretical and methodological developments of the method have had significant influence.
1996 1998 2000 2002 2004 2006 2008
Year
Num
ber
of c
itatio
ns0
200
400
600
800
1000
1200
Fig. 1. Rapidly increasing number of citations of Benjamini and Hochberg (1995), suggesting that its influ-ence has not yet reached its peak (note that the figure for 2009 is only partially shown)
Table 1. 10 most cited papers that cite Benjamini andHochberg (1995)
Rank Article citing Benjamini Number ofand Hochberg (1995) citations
1 Tusher et al. (2001) 37232 Storey and Tibshirani (2003) 14123 Weisberg et al. (2003) 11874 Genovese et al. (2002) 10205 Storey (2002) 7266 Wilkinson (1999) 6527 Benjamini and Yekutieli (2001) 5848 Wacholder et al. (2004) 4869 Patti et al. (2003) 479
10 Dudoit et al. (2002) 459
[Benjamini (2010,JRSSB)]
now > 20,000 citations on google scholar !
E. Roquain FDR : introduction, enjeux et perspectives. Part I. False discovery rate control 23 / 33
1 Introduction
2 False discovery rate control
3 FDR in other statistical issues
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 24 / 33
Why should FDR thresholding be adaptive to sparsity?
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 25 / 33
Why should FDR thresholding be adaptive to sparsity?
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0000
0
00
0
0000000
0000
0
00000
000
00
00
11
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 25 / 33
Why should FDR thresholding be adaptive to sparsity?
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0
00
0
0
00
00
0
0
0
000
1111111111111111
111111
11111
111
11111111
111
11
11
1
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 25 / 33
[Linnemann] - increasing signal strength
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 26 / 33
[Linnemann] - increasing signal strength
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 26 / 33
[Linnemann] - increasing signal strength
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 26 / 33
[Linnemann] - increasing signal strength
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 26 / 33
[Linnemann] - increasing signal strength
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 26 / 33
Adaptation to unknown sparsity
t seems "adaptive" to the “quantity" of signal in the data
I Classification : where is the signal ?[Bogdan et al. (2011)], [Neuvial and R. (2012)]
I Detection: is there some signal ?[Ingster (2002)], [Donoho and Jin (2004)], etc
I Estimation: what is the value EX of the signal ?
EX = Xi 1{|Xi | ≥ t} (hard thresholding)
[Abramovich et al. (2006)], [Donoho and Jin (2006)]
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 27 / 33
Classification
Xi ∼ π0,m N (0,1) + (1− π0,m)N (µm,1),1 ≤ i ≤ m, i.i.d.
but π0,m → 1 (sparse) and µm →∞ (compensates sparsity).
I training set = null distribution known (one-class classification)
I Classification rule hm : R→ {0,1};I Risk
Rm(hm) = (1− π0)−1E(
m−1m∑
i=1
1{hm(Xi ) 6= Hi}).
I Classification boundary in (sparsity, signal) space such that
Above the boundary, ∃hm : Rm(hm)→ 0 (perfect classification)
Under the boundary, ∀hm,Rm(hm)→ 1 (unclassifiable)
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 28 / 33
Classification boundary
0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.2
0.4
0.6
0.8
1.0
β
r
Perfect classification
Unclassifiable
µm =√
2r log m
π0,m = 1−m−β
BH hBHm (x) = 1{x ≥ Φ
−1(t)}
with αm ∝ (log m)−1/2
I Classification boundaryattained by BH.
On the boundary :risk BH ∼ Bayes risk.
[Bogdan et al. (2011)], [Neuvial and R. (2012)]
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 29 / 33
Detection : is there some signal ?
Same model
Xi ∼ π0,m N (0,1) + (1− π0,m)N (µm,1),1 ≤ i ≤ m, i.i.d.
but π0,m → 1 (sparse) and µm →∞ (compensates sparsity).
I Test H0 : “N (0, Im)" against H1 : “mixture".I Risk Rm(T ) = PH0 (T (X ) = 1) + PH1 (T (X ) = 0)
I Detection boundary in (sparsity, signal) space such that
Above the boundary, ∃T : Rm(T )→ 0 (perfect detection)Under the boundary, ∀T ,Rm(T )→ 1 (undetectable)
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 30 / 33
Detection boundary
0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.2
0.4
0.6
0.8
1.0
β
r
Perfect classification
Perfect detection
Undetectable
µm =√
2r log m
π0,m = 1−m−β
T BH = 1{∃i : p(i) ≤ αmi/m}with αm ∝ (log m)−1/2
I Detection boundaryattained by BH whenβ ∈ (3/4,1)
I Better to use “highercriticism"
maxi
{√
mi/m − p(i)√p(i)(1− p(i))
}
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 31 / 33
LASSO and FDR
Regression with orthogonal design:
X ∼ N (β, Im)
[Bogdan et al. (2013)]: sorted `1 penalized estimator (SLOPE)
β = arg minβ∈Rm
{12||X − β||2 +
m∑k=1
λk |β|(k)
}
where λ1 ≥ λ2 ≥ · · · ≥ λm ; |β|(1) ≥ |β|(2) ≥ · · · ≥ |β|(m)
Selection with {i : βi 6= 0}:
I λk = λ = Φ−1
(α/(2m)) '√
2 log m Bonferroni
I λk = Φ−1
(αk/(2m)) '√
2 log(m/k) BH !
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 32 / 33
LASSO and FDR
Regression with orthogonal design:
X ∼ N (β, Im)
[Bogdan et al. (2013)]: sorted `1 penalized estimator (SLOPE)
β = arg minβ∈Rm
{12||X − β||2 +
m∑k=1
λk |β|(k)
}
where λ1 ≥ λ2 ≥ · · · ≥ λm ; |β|(1) ≥ |β|(2) ≥ · · · ≥ |β|(m)
Selection with {i : βi 6= 0}:
I λk = λ = Φ−1
(α/(2m)) '√
2 log m Bonferroni
I λk = Φ−1
(αk/(2m)) '√
2 log(m/k) BH !
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 32 / 33
Outlook
Some conclusions for FDR
⊕ Very simple⊕ Trade-off type I / power⊕ Adaptive to sparsity
Some issues
! Sensitive to null hypothesis! Choosing α! Calibrating test statistics
Main challenge
What about dependence ?
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 33 / 33
Outlook
Some conclusions for FDR
⊕ Very simple⊕ Trade-off type I / power⊕ Adaptive to sparsity
Some issues
! Sensitive to null hypothesis! Choosing α! Calibrating test statistics
Main challenge
What about dependence ?
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 33 / 33
Outlook
Some conclusions for FDR
⊕ Very simple⊕ Trade-off type I / power⊕ Adaptive to sparsity
Some issues
! Sensitive to null hypothesis! Choosing α! Calibrating test statistics
Main challenge
What about dependence ?
E. Roquain FDR : introduction, enjeux et perspectives. Part I. FDR in other statistical issues 33 / 33