multiple testing adjustments european molecular biology laboratory predoc bioinformatics course 17...

Multiple testing adjustments

European Molecular Biology LaboratoryPredoc Bioinformatics Course

17th Nov 2009

Tim Massingham, [email protected]

Motivation

Already come across several cases where need to correct p-values

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Exp 1 0.027 0.033 0.409 0.330 0.784

Exp 2 0.117 0.841 0.985 0.004

Exp 3 0.869 0.927 0.001

Exp 4 0.245 0.021

Exp 5 0.004

Exp 6

Pairwise gene expression data

What happens if we perform several vaccine trials?

Motivation

10 new vaccines are trialledDeclare vaccine a success if test has p-value of less than 0.05

If none of the vaccines work, what is our chance of success?

Motivation

10 new vaccines are trialledDeclare vaccine a success if test has p-value of less than 0.05

Each trial has probability of 0.05 of “success” (false positive)Each trial has probability of 0.95 of “failure” (true negative)

Probability of at least one = 1 - Probability of none = 1 -

(Probability a trial unsuccessful)10

= 1 - 0.9510

= 0.4

If none of the vaccines work, what is our chance of a “success”?

Rule of ThumbMultiple size of test by number of tests

Motivation

More extreme example: test entire population for disease

True negative False positive

False negative True positive

Mixture: some of population have disease, some don’tFind individuals with disease

Family Wise Error RateControl probability that any false positive occurs

False Discovery RateControl proportion of false positives discovered

True status HealthyDiseased

Test reportHealthy Diseased

FDR = # false positives = # false positives # positives # true positives + # false positives

Cumulative distribution

Simple examination by eye:The cumulative distribution should be approximately linear

Rank

• Rank data• Plot rank against p-value

P-value

0 11

n

N.B. Often scale ranks to (0,1] by dividing by largest rank

Start (0,1)End (1,n)Never decreases

Cumulative distribution

Five sets of uniformly distributed p-values

Non-uniformly distributed data. Excess of extreme p-values (small)

Examples: For 910 p-valuesCould use a one-sided Kolmogorov test if desired

A little set theory

Test 1 false positive

Test 2 false positive

Test

3 fal

se p

ositiv

e

No test gives false positive

All tests givefalse positive

Represent all possible outcomes of three tests in a Venn diagram

Areas are probabilities of events happening

A little set theory

+

+

≤

P(any test gives a false positive)

A little set theory

+

+

≤

€

P(any test gives false positive) ≤ P(test i gives false positive)i

∑

Bonferroni adjustment

€

P(any test gives false positive) ≤ P(test i gives false positive)i

∑

Want to control this Know how to control each of these(the size of each test)

Keep things simple: do all tests at same size

If we have n tests, each at size /a n then

€

P(any test gives false positive) ≤ n ×α

n=α

Bonferroni adjustment

€


n=α

If we have n tests, each at size /a n then

Family-Wise Error Rate

Bonferroni adjustment (correction)For a FWER of less than a, perform all tests at size a/n

Equivalently: multiple p-values of all tests by n (maximum of 1) to give adjusted p-value

Example 1Look at deviations from Chargaff’s 2nd parity ruleA and T content of genomes for 910 bugs

Many show significant deviations

First 9 pvalues3.581291e-66 3.072432e-12 1.675474e-01 6.687600e-01 1.272040e-05 1.493775e-23 2.009552e-26 1.024890e-14 1.519195e-24

Unadjusted pvaluespvalue < 0.05 764pvalue < 0.01 717pvalue < 1e-5 559

Bonferroni adjusted pvaluespvalue < 0.05 582pvalue < 0.01 560pvalue < 1e-5 461

First 9 adjusted pvalues3.258975e-63 2.795913e-09 1.000000e+00 1.000000e+00 1.157556e-02 1.359335e-20 1.828692e-23 9.326496e-12 1.382467e-21

Aside: pvalues measure evidence

Shown that many bugs deviate substantial from Chargaff’s 2nd rule

p-values tell us that there is significant evidence for a deviation

median

Upper quantile

Lower quantile

Lots of bases and so ability to detect small deviations from 50%

Powerful test

1st Qu. Median 3rd Qu.0.4989 0.4999 0.5012

Bonferroni is conservative

Conservative: actual size of test is less than bound

€


n=α

Not too bad for independent tests

Worst when positively correlated• Applying same test to subsets of data• Applying similar tests to same data

More subtle problem

Mixture of blue and red circles

Null hypothesis: Is blue

Red circles are never false positives

Bonferroni is conservative

+

+

≤

If experiment really is different from null, then

€

P(test gives false positive) = 0

Over adjustedp-value

Number of potential false positives may be less than number of tests

Holm’s method

Holm(1979) suggests repeatedly applying Bonferroni

Initial Bonferroni: Insignificant Significant

Insignificant Significant

No false positive? Been overly strict, apply Bonferroni only to

insignificant set. False positive? More won’t hurt, so may as well test again Step 2

Insignificant SignificantStep 3

Stop when “insignificant” set does not shrink further

Example 2

Bonferroni adjusted pvaluespvalue < 0.05 582pvalue < 0.01 560pvalue < 1e-5 461

First 9 adjusted pvalues3.258975e-63 2.795913e-09 1.000000e+00 1.000000e+00 1.157556e-02 1.359335e-20 1.828692e-23 9.326496e-12 1.382467e-21

Return to Chargaff data910 bugs but more than half are significantly different after adjustment

There is strong evidence that we’ve over-corrected

First 9 Holm adjusted pvalues2.915171e-63 1.591520e-09 1.000000e+00 1.000000e+00 4.452139e-03 9.903730e-21 1.390610e-23 5.623765e-12 1.019380e-21

Holm adjusted pvaluespvalue < 0.05 606 (+24)pvalue < 0.01 574 (+14)pvalue < 1e-5 472 (+12)

Gained a couple of percent more but notice that gains tail off

Hochberg’s methodConsider a pathological case

Apply same test to same data multiple times

# Ten identical pvaluespvalues <- rep(0.01,10)# None are significant with Bonferronip.adjust(pvalues,method=“bonferroni”) 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1# None are significant with Holmp.adjust(pvalues,method=“holm”) 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1# Hochberg recovers correctly adjusted pvaluesp.adjust(pvalues,method=“hochberg”) 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01

First 9 Hochberg adjusted pvalues2.915171e-63 1.591520e-09 9.972469e-01 9.972469e-01 4.452139e-03 9.903730e-21 1.390610e-23 5.623765e-12 1.019380e-21

Hochberg adjusted pvaluespvalue < 0.05 606pvalue < 0.01 574pvalue < 1e-5 472

Hochberg adjustment is identical to Holm for Chargaff data…. but requires additional assumptions

False Discovery Rates

New methods, dating back to 1995

Gaining popularity in literature but mainly used for large data sets

Useful for enriching data sets for further analysis

RecapFWER: control probability of any false positive occurringFDR: control proportion of false positives that occur

“q-value” is proportion of significant tests expected to be false positives

q-value times number significant = expected number of false positives

MethodsBenjamini & Hochberg (1995)Benjamini & Yekutieli (2001)Storey (2002,2003) aka “positive false discovery rate”

Example 3Returning once more to the Chargaff data

First 9 FDR q-values3.359768e-65 7.114283e-12 1.891664e-01 6.931340e-01 2.063380e-05 5.481191e-23 8.350193e-26 2.569283e-14 5.760281e-24

FDR q-valuesqvalue < 0.05 759qvalue < 0.01 713qvalue < 1e-5 547

Q-values have a different interpretation from p-valuesUse qvalues to get the expected number of false positivesqvalue = 0.05 expect 38 false positives (759 x 0.05)qvalue = 0.01 expect 7 false positives (713 x 0.01)qvalue = 1e-5 expect 1/200 false positives

Summary

Holm is always better than Bonferroni

Hochberg can be better but has additional assumptions

FDR is a more powerful approach - finds more things significant• controls a different criteria• more useful for exploratory analyses than publications

A little questionSuppose results are published if the p-value is less than 0.01, what proportion of the scientific literature is wrong?

multiple testing adjustments european molecular biology laboratory predoc bioinformatics course 17...

Documents