today: safe testingpdg/teaching/inflearn/slides6.pdftoday: safe testing 1. reproducibility...
TRANSCRIPT
![Page 1: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/1.jpg)
TODAY: Safe Testing
1. Reproducibility Crisis/Problems with p-values
2. The S-Value
3. Optional Continuation
4. Optional Stopping vs Optional Continuation
5. Gambling Interpretation, Again
6. Types of S-Values / GROW S-Values
[Next Week No Lecture – Homework due Mo May 11]
![Page 2: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/2.jpg)
Safe Testing
Peter Grünwald
Centrum Wiskunde & Informatica – Amsterdam
Mathematical Institute – Leiden University
with Rianne de Heide,
Wouter Koolen, Judith
ter Schure, Alexander
Ly, Rosanne Turner
![Page 3: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/3.jpg)
Slate Sep 10th 2016: yet another classic finding in
psychology—that you can smile your way to
happiness—just blew up…
Reproducibility Crisis
Cover Story of
Economist (2013),
Science (2014)
![Page 4: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/4.jpg)
Reasons for Reproducibility Crisis
1. Publication Bias
2. Problems with Hypothesis Testing Methodology
![Page 5: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/5.jpg)
![Page 6: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/6.jpg)
![Page 7: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/7.jpg)
Xkcd.org
![Page 8: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/8.jpg)
Reasons for Reproducibility Crisis
1. Publication Bias
2. Problems with Hypothesis Testing Methodology
![Page 9: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/9.jpg)
Replication Crisis in
Science somehow related to use of p-values and
significance testing…
![Page 10: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/10.jpg)
Replication Crisis in
Science somehow related to use of p-values and
significance testing…
![Page 11: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/11.jpg)
Replication Crisis in
Science somehow related to use of p-values and
significance testing…
![Page 12: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/12.jpg)
Replication Crisis in
Science somehow related to use of p-values and
significance testing…
![Page 13: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/13.jpg)
Replication Crisis in
Science somehow related to use of p-values and
significance testing…
![Page 14: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/14.jpg)
• Suppose reseach group A tests medication, gets
‘almost significant’ result.
• ...whence group B tries again on new data. How to
combine their test results?
• Standard Method 1: sweep data together, recompute p-
value. This is not correct; type-I error guarantee does not
hold any more
• Standard method 2: use Fisher’s method for
combining p-values. Again not correct, since tests
cannot be viewed as independent
• Standard method 3: multiply p-values. Just plain
wrong – a mortal sin!
• With the type of “p-value” introduced here, despite
dependence, evidences can still be safely multiplied
P-value Problem:
Combining Dependent Tests
![Page 15: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/15.jpg)
• Suppose reseach group A tests medication, gets
‘almost significant’ result.
• Sometimes group A can’t resist to test a few
more subjects themselves...
• A recent survey revealed that 55% of psychologists have
succumbed to this practice (and then treat data as if large
sample size was determined in advance)
• But isn’t this just cheating?
• Not clear: what if you submit a paper and the referee
asks you to test a couple more subjects? Should you
refuse because it invalidates your p-values!?
P-value Problem (b):
Extending Your Test
![Page 16: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/16.jpg)
S is the new P
• We propose a generic replacement of
the 𝑝-value that we call the 𝑆-value
• 𝑆-values handle optional continuation
(to the next test (and the next, and ..))
without any problems
(can simply multiply S-values of
individual tests, despite dependencies)
![Page 17: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/17.jpg)
S is the new P
• We propose a generic replacement of
the 𝑝-value that we call the 𝑆-value
• 𝑆-values handle optional continuation
(to the next test (and the next, and ..))
without any problems
(can simply multiply S-values of
individual tests, despite dependencies)
S-values have Fisherian, Neymanian and Bayes-
Jeffreys’ aspects to them, all at the same time
Cf. J. Berger (2003, IMS Medaillion Lecture): Could
Neyman, Fisher and Jeffreys have agreed on testing?
![Page 18: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/18.jpg)
S-Values: General Definition
• Let 𝐻0 = 𝑃𝜃 𝜃 ∈ Θ0} represent the null hypothesis
• Assume data 𝑋1, 𝑋2, … are i.i.d. under all 𝑃 ∈ 𝐻0 .
• Let 𝐻1= 𝑃𝜃 𝜃 ∈ Θ1} represent alternative hypothesis
• An S-value for sample size 𝑛 is a function
such that for all 𝑃0 ∈ 𝐻0 , we have
![Page 19: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/19.jpg)
First Interpretation: p-values
• Proposition: Let S be an S-value. Then 𝑆−1 𝑋𝑛 is a
conservative p-value, i.e. p-value with wiggle room:
• for all 𝑃 ∈ 𝐻0, all 0 ≤ 𝛼 ≤ 1 ,
• Proof: just Markov’s inequality!
![Page 20: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/20.jpg)
Safe Tests
• The Safe Test against 𝐻0 at level 𝛼 based on S-
value S is defined as the test which rejects 𝐻0 if
S 𝑋𝑛 ≥1
𝛼
• Since 𝑆−1 is a conservative 𝑝-valuue...
• ....the safe test which rejects 𝐻0 iff 𝑆(𝑋𝑛) ≥ 20 , i.e.
𝑆−1 𝑋𝑛 ≤ 0.05 , has Type-I Error Bound of 0.05
![Page 21: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/21.jpg)
Safe Testing and Bayes
• Bayes factor hypothesis testing
with 𝐻0 = 𝑝𝜃 𝜃 ∈ Θ0} vs 𝐻1 = 𝑝𝜃 𝜃 ∈ Θ1} :
Evidence in favour of 𝐻1 measured by
where
(Jeffreys ‘39)
![Page 22: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/22.jpg)
Safe Testing and Bayes, simple 𝑯𝟎
Bayes factor hypothesis testing
between 𝐻0 = { 𝑝0} and 𝐻1 = 𝑝𝜃 𝜃 ∈ Θ1} :
Bayes factor of form
Note that (no matter what prior 𝑊1 we chose)
![Page 23: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/23.jpg)
Safe Testing and Bayes, simple 𝑯𝟎
Bayes factor hypothesis testing
between 𝐻0 = { 𝑝0} and 𝐻1 = 𝑝𝜃 𝜃 ∈ Θ1} :
Bayes factor of form
Note that (no matter what prior 𝑊1 we chose)
The Bayes Factor for Simple 𝑯𝟎
is an S-value!
![Page 24: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/24.jpg)
Default S-Value ≠ Neyman
1. 𝐻0 and 𝐻1 are point hypotheses – then default S-
value is:
... the safe test based on 𝑆 looks a bit like, but is not a
standard Neyman-Pearson test.
Safe Test: reject if 𝑆 𝑋𝜏 ≥ 1/𝛼
NP: reject if 𝑆 𝑋𝜏 ≥ 1/𝐵 with 𝐵 s.t. 𝑃0 𝑆 𝑋𝜏 ≥ 𝐵 = 𝛼
more conservative
![Page 25: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/25.jpg)
Safe Tests are Safe
under optional continuation
• Suppose we observe data (𝑋1, 𝑌1), 𝑋2, 𝑌2 , …
• 𝑌𝑖: side information
...coming in batches of size 𝑛1, 𝑛2, … , 𝑛𝑘. Let
• We first evaluate some S-value 𝑆1 on (𝑋1, … , 𝑋𝑛1).
• If outcome is in certain range (e.g. promising but not
conclusive) and 𝑌𝑛1has certain values (e.g. ‘boss has
money to collect more data’) then....
we evaluate some S-value 𝑆2 on 𝑋𝑛1+1, … , 𝑋𝑁2 ,
otherwise we stop.
![Page 26: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/26.jpg)
Safe Tests are Safe
• We first evaluate 𝑆1.
• If outcome is in certain range and 𝑌𝑛1 has certain
values then we evaluate 𝑆2 ; otherwise we stop.
• If outcome of 𝑆2 is in certain range and 𝑌𝑁2 has
certain values then we compute 𝑆3 , else we stop.
• ...and so on
• ...when we finally stop, after say 𝐾 data batches, we
report as final result the product
• First Result, Informally: any 𝑺 composed of S-
values in this manner is itself an S-value,
irrespective of the stop/continue rule used!
![Page 27: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/27.jpg)
Safe Tests are Safe
• 𝑆𝑗 may be same function as 𝑆𝑗−1, e.g. (simple 𝐻0)
• But choice of 𝑗th S-value 𝑆𝑗 may also depend on
previous 𝑋𝑁𝑗 , 𝑌𝑁𝑗 , e.g.
and then (full compatibility with Bayesian updating)
![Page 28: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/28.jpg)
Safe Tests are Safe
Let 𝑆1 be S-value on . For 𝑗 = 1,2,… , let
be any collection of S-values defined on
Let be arbitrary
stop/continue strategy, and:
Define if
Define if
else
else
and so on...
Define if
![Page 29: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/29.jpg)
Safe Tests are Safe
Theorem:
Suppose that for all 𝑖, 𝑋𝑖 ⊥ 𝑋𝑖−1, 𝑌𝑖−1 . Then
𝑆, the end-product of all employed S-values
𝑆1, 𝑆𝑔1 , 𝑆𝑔2, … is itself an S-value
• Technically, the process is a
nonnegative supermartingale (Ville ‘39) and the theorem is
proved using Doob’s optional stopping theorem
![Page 30: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/30.jpg)
Safe Tests are Safe
Theorem:
𝑆 , the end-product of all employed S-values
𝑆1, 𝑆𝑔1 , 𝑆𝑔2, … is itself an S-value
Corollary: Type-I Error Guarantee Preserved
under Optional Continuation
Suppose we combine S-values with arbitrary
stop/continue strategy and reject 𝐻0 when final 𝑆 has
𝑆−1 ≤ 0.05 . Then resulting test is a safe test and our
Type-I Error is guaranteed to be below 0.05!
![Page 31: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/31.jpg)
Safe Tests are Safe
Theorem:
𝑆 , the end-product of all employed S-values
𝑆1, 𝑆𝑔1 , 𝑆𝑔2, … is itself an S-value
Corollary: Type-I Error Guarantee Preserved
under Optional Continuation
Suppose we combine S-values with arbitrary
stop/continue strategy and reject 𝐻0 when final 𝑆 has
𝑆−1 ≤ 0.05 . Then resulting test is a safe test and our
Type-I Error is guaranteed to be below 0.05!
![Page 32: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/32.jpg)
Generalizing the Result
Theorem says:
• Let where
s.t.
, is S-value:
• Suppose that for all 𝑖, 𝑋𝑖 ⊥ 𝑋𝑖−1, 𝑌𝑖−1 .
• Let 𝜏 be smallest 𝑗 such that 𝑔𝑗 𝑍𝑁𝑗 = 𝑠𝑡𝑜𝑝
• Then is an S-value
![Page 33: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/33.jpg)
Generalizing the Result
Theorem says:
• Let where
For all 𝑗, let and let …
, is S-value:
• s.t.
• Suppose that for all 𝑖, 𝑋𝑖 ⊥ 𝑋𝑖−1, 𝑌𝑖−1 .
• Let 𝜏 be smallest 𝑗 such that 𝑔𝑗 𝑍𝑁𝑗 = 𝑠𝑡𝑜𝑝
• Then is an S-value
![Page 34: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/34.jpg)
Generalizing the Result
Theorem says: for all 𝑗,
• Let 𝑋⟨𝑗⟩ = (𝑋𝑁𝑗−1 +1, … , 𝑋𝑁𝑗) ; 𝑋⟨𝑗⟩ = (𝑋1, … , 𝑋𝑁𝑗
);
similarly for 𝑌, 𝑍
• Let be a function such that
• Let 𝜏 be a stopping time for the process 𝑍 1 , 𝑍 2 , …
• Then is an S-value
![Page 35: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/35.jpg)
Optional Stopping vs
Continuation
• Let 𝑋⟨𝑗⟩ = (𝑋𝑁𝑗−1 +1, … , 𝑋𝑁𝑗) ; 𝑋⟨𝑗⟩ = (𝑋1, … , 𝑋𝑁𝑗
)
• Let be a function such that
• Let 𝜏 be a stopping time for the process 𝑍 1 , 𝑍 2 , …
• Then is an S-value
Optional Continuation: we have batches of data
𝑍⟨1⟩, 𝑍⟨2⟩, … and we can do optional stopping at the batch-
level (and obtain an S-Value and preserve Type I error
guarantees)
Traditional Optional Stopping: we take 𝑍⟨𝑗 ⟩ = 𝑍𝑗 for all 𝑗.
![Page 36: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/36.jpg)
Optional Stopping vs
Continuation
• In many but certainly not all cases, we can also do
optional stopping based on S-values.
• Suppose that 𝐻0 = {𝑃0}, , no 𝑌𝑖 ‘s
• Then and we can do optional
stopping at each 𝑗 and not just optional continuation
between ‘blocks’
• …but if 𝐻0 composite then sometimes the only
conditional S-value satisfying
is given by the trivial S-value 𝑆𝑗+1 = 1 . Then OS
impossible (but OC with batches of size 𝑛𝑗 ≫ 1 still
possible)
![Page 37: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/37.jpg)
(Again:) Safe Testing = Gambling!
• At time 1 you can buy ticket 1 for 1$. It pays off
𝑆1(𝑋1, … , 𝑋𝑛1) $ after 𝑛1 steps
• At time 2 you can buy ticket 2 for 1$. It pays off
𝑆2(𝑋𝑛1+1, … , 𝑋𝑁2) $ after 𝑛2 further steps.... and so on.
You may buy multiple and fractional nrs of tickets.
Kelly (1956)
![Page 38: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/38.jpg)
Safe Testing = Gambling!
• At time 1 you can buy ticket 1 for 1$. It pays off
𝑆1(𝑋1, … , 𝑋𝑛1) $ after 𝑛1 steps
• At time 2 you can buy ticket 2 for 1$. It pays off
𝑆2(𝑋𝑛1+1, … , 𝑋𝑁2) $ after 𝑛2 further steps.... and so on.
You may buy multiple and fractional nrs of tickets.
• You start by investing 1$ in ticket 1.
![Page 39: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/39.jpg)
Safe Testing = Gambling!
• At time 1 you can buy ticket 1 for 1$. It pays off
𝑆1(𝑋1, … , 𝑋𝑛1) $ after 𝑛1 steps
• At time 2 you can buy ticket 2 for 1$. It pays off
𝑆2(𝑋𝑛1+1, … , 𝑋𝑁2) $ after 𝑛2 further steps.... and so on.
You may buy multiple and fractional nrs of tickets.
• You start by investing 1$ in ticket 1.
• After 𝑛1 outcomes you either stop with end capital 𝑆1or you continue and buy 𝑆1 tickets of type 2.
![Page 40: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/40.jpg)
Safe Testing = Gambling!
• At time 1 you can buy ticket 1 for 1$. It pays off
𝑆1(𝑋1, … , 𝑋𝑛1) $ after 𝑛1 steps
• At time 2 you can buy ticket 2 for 1$. It pays off
𝑆2(𝑋𝑛1+1, … , 𝑋𝑁2) $ after 𝑛2 further steps.... and so on.
You may buy multiple and fractional nrs of tickets.
• You start by investing 1$ in ticket 1.
• After 𝑛1 outcomes you either stop with end capital 𝑆1or you continue and buy 𝑆1 tickets of type 2. After 𝑁2 =𝑛1 + 𝑛2 outcomes you stop with end capital 𝑆1 ⋅ 𝑆2 or
you continue and buy 𝑆1 ⋅ 𝑆2 tickets of type 3, and so
on..
![Page 41: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/41.jpg)
Safe Testing = Gambling!
• You start by investing 1$ in ticket 1.
• After 𝑛1 outcomes you either stop with end capital 𝑆1or you continue and buy 𝑆1 tickets of type 2. After
𝑁2 = 𝑛1 + 𝑛2 outcomes you stop with end capital 𝑆1 ⋅𝑆2 or you continue and buy 𝑆1 ⋅ 𝑆2 tickets of type 3,
and so on...
• 𝑺 is simply your end capital
• Your don’t expect to gain money, no matter what the
stop/continuation rule since none of individual
gambles 𝑺𝒌 are strictly favorable to you
![Page 42: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/42.jpg)
Safe Testing = Gambling!
• You start by investing 1$ in ticket 1.
• After 𝑛1 outcomes you either stop with end capital 𝑆1or you continue and buy 𝑆1 tickets of type 2. After
𝑁2 = 𝑛1 + 𝑛2 outcomes you stop with end capital 𝑆1 ⋅𝑆2 or you continue and buy 𝑆1 ⋅ 𝑆2 tickets of type 3,
and so on...
• 𝑺 is simply your end capital
• Your don’t expect to gain money, no matter what the
stop/continuation rule since none of individual
gambles 𝑺𝒌 are strictly favorable to you
• Hence a large value of 𝑺 indicates that something
very unlikely has happened under 𝐻0 ...
![Page 43: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/43.jpg)
Default S-Value ≠ Neyman
1. 𝐻0 and 𝐻1 are point hypotheses – then default S-
value is:
... the safe test based on 𝑆 looks a bit like, but is not a
standard Neyman-Pearson test.
Safe Test: reject if 𝑆 𝑋𝜏 ≥ 1/𝛼
NP: reject if 𝑆 𝑋𝜏 ≥ 1/𝐵 with 𝐵 s.t. 𝑃0 𝑆 𝑋𝜏 ≥ 𝐵 = 𝛼
more conservative
![Page 44: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/44.jpg)
SafeTests & Neyman-Pearson, again
• Let 𝑝 be a strict 𝑝-value: for all 𝑃 ∈ 𝐻0, 𝑃 𝑝 ≤ 𝛼 = 𝛼
• Let 𝑆 =1
𝛼if 𝑝 ≤ 𝛼 , and 𝑆 =0 otherwise
• Then for all 𝑃 ∈ 𝐻0,
...so 𝑆 is an S-value, and obviously, the safe test based
on 𝑆 rejects iff 𝑝 ≤ 𝛼. It thus implements the Neyman-
Pearson test at significance level 𝛼.
![Page 45: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/45.jpg)
• Let 𝑝 be a strict 𝑝-value: for all 𝑃 ∈ 𝐻0, 𝑃 𝑝 ≤ 𝛼 = 𝛼
• Let 𝑆 =1
𝛼if 𝑝 ≤ 𝛼 , and 𝑆 =0 otherwise
• Then for all 𝑃 ∈ 𝐻0,
...so 𝑆 is an S-value, and obviously, the safe test based
on 𝑆 rejects iff 𝑝 ≤ 𝛼. t thus implements the Neyman-
Pearson test at significance level 𝛼.
...but it is a very silly S-value to use! With
probability 𝜶, you loose all your capital, and you will
never make up for that in the future!
SafeTests & Neyman-Pearson, again
![Page 46: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/46.jpg)
Safe Tests and Neyman-Pearson,
again
• The Safe Test based on an S-Value that is a
likelihood ratio is not a Neyman-Pearson test (it is
more conservative)
• Neyman-Pearson tests (that only report ‘reject’
and ‘accept’, and not the p-value) are (other) Safe
Tests, but useless ones corresponding to
irresponsible gambling...
![Page 47: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/47.jpg)
Some S-Values are
Better than Others
• The Trivial S-Value 𝑆 = 1 is valid, but useless
• The Neyman-Pearson S-value is valid, but extremely
dangerous to use!
• We need some idea of ‘optimal S-value’
![Page 48: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/48.jpg)
How to design S-Values?
• Suppose we are willing to admit that we’ll only be
able to tell 𝐻0 and 𝐻1 apart if 𝑃 ∈ 𝐻0 ∪ 𝐻1′ for some
𝐻1′ ⊂ 𝐻1 that excludes points that are ‘too close’ to 𝐻0
e.g.
• We can then look for the GROW (growth-optimal in
worst-case) S-value achieving
![Page 49: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/49.jpg)
GROW: an analogue of Power
• The GROW (growth-optimal in worst-case) S-value
relative to 𝐻1,𝛿 is the S-value achieving
where the supremum is over all 𝑆-values relative to 𝐻0• ...so we don’t expect to gain anything when investing
in 𝑆 under 𝐻0• ...but among all such 𝑆 we pick the one(s) that make
us rich fastest if we keep reinvesting in new gambles
under 𝐻1
![Page 50: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/50.jpg)
Main Theorem
(will be made precise in 3 weeks)
• Under ‘hardly any conditions’ on 𝐻0 and 𝐻1 a GROW
S-value exists! [G., De Heide, Koolen, 2019]
![Page 51: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/51.jpg)
• Jerzy Neyman: alternative exists, “inductive . .
... behaviour”, ‘significance level’ and power
Sir Ronald Fisher: test statistic rather than
alternative, p-value indicates “unlikeliness”
• Sir Harold Jeffreys: Bayesian, alternative exists,
absolutely no p-values
J. Berger (2003, IMS Medaillion Lecture): Could
Neyman, Fisher and Jeffreys have agreed on testing?
... Using S-Values we can unify/correct the central
ideas
Three Philosophies of Testing
![Page 52: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/52.jpg)
“Fisherian” Example
2. Ryabko & Monarev’s (2005)
Compression-based randomness test
R&M checked whether sequences generated by
famous random number generators can be
compressed by standard data compressors such as
gzip and rar
Answer: yes! 200 bits compression for file of 10
megabytes
![Page 53: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/53.jpg)
Additional Background Slides
![Page 54: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/54.jpg)
Null Hypothesis Testing
• Let 𝐻0 = 𝑃𝜃 𝜃 ∈ Θ0} represent the null hypothesis
• For simplicity, today we assume data 𝑋1, 𝑋2, … are
i.i.d. under all 𝑃 ∈ 𝐻0 .
• Let 𝐻1= 𝑃𝜃 𝜃 ∈ Θ1} represent alternative hypothesis
• Example: testing whether a coin is fair
Under 𝑃𝜃 , data are i.i.d. Bernoulli 𝜃
Θ0 =1
2, Θ1 = 0,1 ∖
1
2
Standard test would measure frequency of 1s
![Page 55: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/55.jpg)
Null Hypothesis Testing
• Let 𝐻0 = 𝑃𝜃 𝜃 ∈ Θ0} represent the null hypothesis
• Let 𝐻1= 𝑃𝜃 𝜃 ∈ Θ1} represent alternative hypothesis
• Example: testing whether a coin is fair
Under 𝑃𝜃 , data are i.i.d. Bernoulli 𝜃
Θ0 =1
2, Θ1 = 0,1 ∖
1
2
Standard test would measure frequency of 1s
Simple 𝐻0
![Page 56: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/56.jpg)
Null Hypothesis Testing
• Let 𝐻0 = 𝑃𝜃 𝜃 ∈ Θ0} represent the null hypothesis
• Let 𝐻1= 𝑃𝜃 𝜃 ∈ Θ1} represent alternative hypothesis
• Example: t-test (most used test world-wide)
𝐻0: 𝑋𝑖 ∼𝑖.𝑖.𝑑. 𝑁 0, 𝜎2 vs.
𝐻1 : 𝑋𝑖 ∼𝑖.𝑖.𝑑. 𝑁 𝜇, 𝜎2 for some 𝜇 ≠ 0
𝜎2 unknown (‘nuisance’) parameter
𝐻0 = 𝑃𝜎 𝜎 ∈ 0,∞ }
𝐻1 = 𝑃𝜎,𝜇 𝜎 ∈ 0,∞ , 𝜇 ∈ ℝ ∖ 0 }
![Page 57: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/57.jpg)
Null Hypothesis Testing
• Let 𝐻0 = 𝑃𝜃 𝜃 ∈ Θ0} represent the null hypothesis
• Let 𝐻1= 𝑃𝜃 𝜃 ∈ Θ1} represent alternative hypothesis
• Example: t-test (most used test world-wide)
𝐻0: 𝑋𝑖 ∼𝑖.𝑖.𝑑. 𝑁 0, 𝜎2 vs.
𝐻1 : 𝑋𝑖 ∼𝑖.𝑖.𝑑. 𝑁 𝜇, 𝜎2 for some 𝜇 ≠ 0
𝜎2 unknown (‘nuisance’) parameter
𝐻0 = 𝑃𝜎 𝜎 ∈ 0,∞ }
𝐻1 = 𝑃𝜎,𝜇 𝜎 ∈ 0,∞ , 𝜇 ∈ ℝ ∖ 0 }
Composite 𝐻0
![Page 58: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping](https://reader036.vdocuments.us/reader036/viewer/2022071013/5fcb953eac84254d782ab2b5/html5/thumbnails/58.jpg)
Standard Method:
p-value, significance
• Let 𝐻0 = 𝑃𝜃 𝜃 ∈ Θ0} represent the null hypothesis
• A (“nonstrict”) p-value is a random variable (!) such
that, for all 𝜃 ∈ Θ0 ,