if you liked it you should've put a p-value on it ...or not

If you liked it you should’ve put a p-value on it

… or not.

Chris Gorgolewski Max Planck Institute for Human Cognitive and Brain Sciences

SIGNAL DETECTION THEORY

Signal and noise

False positive and false negative errors

Power

Signal detection theory

Types of errors

Vocabulary

• Type I error – false positive

• Type II error – false negative

• False positive rate

• False negative rate

• Statistical power = 1 – false negative rate

• Sensitivity = Power

Inference = thresholding

Signal to Noise ratio

Looking in the wrong places

Lower SNR = we miss more stuff

Lower SNR = higher FDR threshold

VOXELWISE TESTS

P-maps

Multiple comparison

FWE correction: Bonferroni, permutations

FDR correction: B-H, Local FDR

Hypothesis testing

• Distinguish between two hypotheses

1. H0 – there is no difference between groups

2. H1 – there is a difference between groups

• Or…

1. H0 – there is no relation between two variables

2. H1 – there is some relation between the two variables

From statistical values to p-values

• Various procedures give us statistical values

– T-tests (one sample, two sample, paired etc.)

– F-Tests

– Correlation tests (r values)

• What is a p value?

P value

• P(z) = A probability if we repeat our experiment (with all the analyses) and there is no effect we will get this or greater statistical value.

t, z, F to p

OK back to neuroimaging

• Assuming that we are doing a massive univariate analysis (we look at each voxel independently) we have a t-map

• Now using a theoretical distribution (given the degrees of freedom) we can turn it into a p-map

Inference!

• We take out p-map discard all voxel with values > 0.05

– “The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant.”

• We are done – right?

Not quite done yet…

• Let me generate two vectors of values and test using a t-test if they are different

• What is the probability that P(t) < 0.05

– Well… 0.05

• Let me generate another set of values… and another… 100 pairs of vectors

• What is the probability that at least one of the test?

The Salmon of Doubt

Correcting for multiple comparisons

• Bonferroni correction (based on Bool’s inequality)

– Divide your p-threshold by the number of tests you have performed

– Or multiple your p-values by the number of tests you have performed

Bonferroni is a Family Wise Error correction

It guarantees that the chances of getting at least one false positive in all the tests is less than your

p-threshold

Permutation based FWE correction

• The assumptions behind the theoretical distributions are often not met

• There are many dependencies between voxels

– Each test is not independent so Bonferroni correction can be conservative

• We can however establish an empirical distribution


1. Break the relation: shuffle the participants between the groups

2. Perform the test

3. Save the maximum statistical value across voxels

4. Repeat


Our FWE corrected p value is the percentage of permutations that yielded statistical values

higher than the original (unshuffled one)

False Discovery Rate

• Even conceptually FWE correction seems conservative

– At least one test out of 60 000?

• Is there a more intuitive way of looking at this?


I present a number of voxels that I think show a strong effect, but I admit that a certain

percentage of them might be false positives.


Percentage of false positive voxels among all significant voxels.

FDR procedures

• Benjamini-Hochberg procedure

– With it’s dependent variables variant

• Efrons local FDR procedure

– Explicit modeling of the signal distribution

Interim Summary

• FWE corrections

– Bonferroni – simple but struggles with dependencies (over conservative)

– Permutations – less dependent on assumptions, but time consuming

• FDR corrections

– B-H – simple but also struggles with dependencies

– Local FDR – data driven, but can fail in case of low SNR

CLUSTER EXTENT TESTS

Test how big are the blobs Random field theory Smoothness estimation Permutation test The problem of cluster forming threshold Fun fact: FWE with RFT

Intuition

If we are interested in continuous regions of activations why are we looking at voxels not

blobs?

Aww patters!

No wait… it’s just smooth noise…

What contributes to expected cluster size?

How likely is to get cluster of this size from pure noise?

It depends… on:

1. cluster forming threshold

2. smoothness of the map

3. size of the map

Where do we get those parameters?

1. cluster forming threshold

– Arbitrary decision

2. smoothness of the map

– Estimated from the residuals of the GLM

3. size of the map

– Calculated from the mask

Permutation based cluster extent probability

1. Break the relation: shuffle the participants between the groups

2. Perform the test

3. Threshold the map to get clusters

4. Save the sizes of all clusters

5. Repeat

Permutation based cluster extent probability

Our cluster extent p value is the percentage of permutations that yielded cluster sizes bigger

than the original (unshuffled one)

Cluster forming threshold conundrum

HONORABLE MENTIONS

TFCE

Mixture models

Threshold Free Cluster Enhancement

Spatially Regularized Mixture Models

IMPLEMENTATIONS

SPM

FSL

AFNI

SPM

• RFT based voxelwise FWE correction

• Smoothness estimation

• Cluster extent p-values

• Peak height p-values

• Permutation tests through SnPM toolbox

FSL

• RFT based voxelwise FWE correction

• Smoothness estimation

• Cluster extent p-values

• FDR

• Permutation tests through randomize

– Including TFCE

AFNI

• Cluster extent p-values (3dClustSim)

– Simulations are not permutations

• Smoothness estimation (3dFWHMx)

Interim summary

Clusterwise methods allow us to find surprising patterns in terms of spatially consistent clusters

instead of individual voxels.

LIMITATIONS OF P-VALUES

P-VALUES ARE MEANINGLESS

FORGET ALL I SAID SO FAR

WE ARE ALL DOOMED

P-value paradox

• There are no two entities or groups that are truly identical

• There are no two variables that are in no way unrelated

• We just fail to obtain enough samples to see it

– Or our tools are not sensitive enough

More samples more “significance”

• The more subjects you will have in your study the more likely it is that you will find something significant

• The same applies to scan length, and field strength

H0 is never true

we just fail to show that

P-value failure

• P-values do not tell us much about actual size of the effect

• Neither do they tell of the predictive power of the found relation

The interesting question

Is PCC involved in autism?

vs.

Given cortical thickness of a subjects PCC how well am I able to predict his or hers diagnosis?

Why does this matter

• More subjects, longer scans, stronger scans – everything is significant

– We are getting there

• Lack of faith in science from the public

– Poor reproducibility

What needs to be done

We need more replications

We need to start reporting null results

What you can do

• Report effect sizes and their confidence intervals – For all test/voxels – not just those significant

• Share the unthresholded statistical maps – It only takes 5 minutes on neurovault.org

• Report all the tests you have performed – not just the significant ones

http://neurovault.org/

http://dx.doi.org/10.1016/j.neuron.2012.05.001

http://dx.doi.org/10.1016/j.neuron.2012.05.001

If you liked it you should’ve convinced a skeptical researcher to

to try to replicate your results.

if you liked it you should've put a p-value on it ...or not

Education

p map

noise false positive

fwe corrected p value

false positives

cluster extent tests

p value pz

statistical values ttests

false positive type