if you liked it you should've put a p-value on it ...or not
DESCRIPTION
Statistical inference in neuroimagingTRANSCRIPT
If you liked it you should’ve put a p-value on it
… or not.
Chris Gorgolewski Max Planck Institute for Human Cognitive and Brain Sciences
SIGNAL DETECTION THEORY
Signal and noise
False positive and false negative errors
Power
Signal detection theory
Types of errors
Vocabulary
• Type I error – false positive
• Type II error – false negative
• False positive rate
• False negative rate
• Statistical power = 1 – false negative rate
• Sensitivity = Power
Inference = thresholding
Inference = thresholding
Signal to Noise ratio
Looking in the wrong places
Lower SNR = we miss more stuff
Lower SNR = higher FDR threshold
VOXELWISE TESTS
P-maps
Multiple comparison
FWE correction: Bonferroni, permutations
FDR correction: B-H, Local FDR
Hypothesis testing
• Distinguish between two hypotheses
1. H0 – there is no difference between groups
2. H1 – there is a difference between groups
• Or…
1. H0 – there is no relation between two variables
2. H1 – there is some relation between the two variables
From statistical values to p-values
• Various procedures give us statistical values
– T-tests (one sample, two sample, paired etc.)
– F-Tests
– Correlation tests (r values)
• What is a p value?
P value
• P(z) = A probability if we repeat our experiment (with all the analyses) and there is no effect we will get this or greater statistical value.
t, z, F to p
OK back to neuroimaging
• Assuming that we are doing a massive univariate analysis (we look at each voxel independently) we have a t-map
• Now using a theoretical distribution (given the degrees of freedom) we can turn it into a p-map
Inference!
• We take out p-map discard all voxel with values > 0.05
– “The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant.”
• We are done – right?
Not quite done yet…
• Let me generate two vectors of values and test using a t-test if they are different
• What is the probability that P(t) < 0.05
– Well… 0.05
• Let me generate another set of values… and another… 100 pairs of vectors
• What is the probability that at least one of the test?
The Salmon of Doubt
Correcting for multiple comparisons
• Bonferroni correction (based on Bool’s inequality)
– Divide your p-threshold by the number of tests you have performed
– Or multiple your p-values by the number of tests you have performed
Bonferroni is a Family Wise Error correction
It guarantees that the chances of getting at least one false positive in all the tests is less than your
p-threshold
Permutation based FWE correction
• The assumptions behind the theoretical distributions are often not met
• There are many dependencies between voxels
– Each test is not independent so Bonferroni correction can be conservative
• We can however establish an empirical distribution
Permutation based FWE correction
1. Break the relation: shuffle the participants between the groups
2. Perform the test
3. Save the maximum statistical value across voxels
4. Repeat
Permutation based FWE correction
Our FWE corrected p value is the percentage of permutations that yielded statistical values
higher than the original (unshuffled one)
False Discovery Rate
• Even conceptually FWE correction seems conservative
– At least one test out of 60 000?
• Is there a more intuitive way of looking at this?
False Discovery Rate
I present a number of voxels that I think show a strong effect, but I admit that a certain
percentage of them might be false positives.
False Discovery Rate
Percentage of false positive voxels among all significant voxels.
FDR procedures
• Benjamini-Hochberg procedure
– With it’s dependent variables variant
• Efrons local FDR procedure
– Explicit modeling of the signal distribution
Interim Summary
• FWE corrections
– Bonferroni – simple but struggles with dependencies (over conservative)
– Permutations – less dependent on assumptions, but time consuming
• FDR corrections
– B-H – simple but also struggles with dependencies
– Local FDR – data driven, but can fail in case of low SNR
CLUSTER EXTENT TESTS
Test how big are the blobs Random field theory Smoothness estimation Permutation test The problem of cluster forming threshold Fun fact: FWE with RFT
Intuition
If we are interested in continuous regions of activations why are we looking at voxels not
blobs?
Aww patters!
No wait… it’s just smooth noise…
What contributes to expected cluster size?
How likely is to get cluster of this size from pure noise?
It depends… on:
1. cluster forming threshold
2. smoothness of the map
3. size of the map
Where do we get those parameters?
1. cluster forming threshold
– Arbitrary decision
2. smoothness of the map
– Estimated from the residuals of the GLM
3. size of the map
– Calculated from the mask
Permutation based cluster extent probability
1. Break the relation: shuffle the participants between the groups
2. Perform the test
3. Threshold the map to get clusters
4. Save the sizes of all clusters
5. Repeat
Permutation based cluster extent probability
Our cluster extent p value is the percentage of permutations that yielded cluster sizes bigger
than the original (unshuffled one)
Cluster forming threshold conundrum
HONORABLE MENTIONS
TFCE
Mixture models
Threshold Free Cluster Enhancement
Spatially Regularized Mixture Models
IMPLEMENTATIONS
SPM
FSL
AFNI
SPM
• RFT based voxelwise FWE correction
• Smoothness estimation
• Cluster extent p-values
• Peak height p-values
• Permutation tests through SnPM toolbox
FSL
• RFT based voxelwise FWE correction
• Smoothness estimation
• Cluster extent p-values
• FDR
• Permutation tests through randomize
– Including TFCE
AFNI
• Cluster extent p-values (3dClustSim)
– Simulations are not permutations
• Smoothness estimation (3dFWHMx)
Interim summary
Clusterwise methods allow us to find surprising patterns in terms of spatially consistent clusters
instead of individual voxels.
LIMITATIONS OF P-VALUES
P-VALUES ARE MEANINGLESS
FORGET ALL I SAID SO FAR
WE ARE ALL DOOMED
P-value paradox
• There are no two entities or groups that are truly identical
• There are no two variables that are in no way unrelated
• We just fail to obtain enough samples to see it
– Or our tools are not sensitive enough
More samples more “significance”
• The more subjects you will have in your study the more likely it is that you will find something significant
• The same applies to scan length, and field strength
H0 is never true
we just fail to show that
P-value failure
• P-values do not tell us much about actual size of the effect
• Neither do they tell of the predictive power of the found relation
The interesting question
Is PCC involved in autism?
vs.
Given cortical thickness of a subjects PCC how well am I able to predict his or hers diagnosis?
Why does this matter
• More subjects, longer scans, stronger scans – everything is significant
– We are getting there
• Lack of faith in science from the public
– Poor reproducibility
What needs to be done
We need more replications
We need to start reporting null results
What you can do
• Report effect sizes and their confidence intervals – For all test/voxels – not just those significant
• Share the unthresholded statistical maps – It only takes 5 minutes on neurovault.org
• Report all the tests you have performed – not just the significant ones
http://dx.doi.org/10.1016/j.neuron.2012.05.001
If you liked it you should’ve convinced a skeptical researcher to
to try to replicate your results.