AffyDEComp: towards a benchmark for differential
expression methods
Richard Pearson
School of Computer Science
University of Manchester
Overview
Why benchmark DE methods?
The Golden Spike data set
AffyDEComp
Conclusions
Recommendations
The need for benchmarks
Microarray analysis has many stages
Competing methods at each stage
Methodologists good at showing superiority
Results can appear contradictory
Confused end users choice driven by…What they are familiar with
What colleagues use
What was used in their favourite paper
…and not by a scientific comparison
Benchmarking requirements
Methods: a set we wish to compareBenchmark data: where truth is knownMetrics: by which to compare methodsAffycomp
Methods: Summarisation methodsBenchmark data: various spike-in studiesMetrics: various, including, e.g. area under ROC curve for a fold change classifier
Affycomp doesn’t compare DE methods
A benchmark for DE methods
Methods:DE methods depend on summarisation
Compare summarisation/DE combinations
Benchmark data:Affycomp spike-ins have few DE genes
Golden spike data has many DE genes, but also a few “issues”!
Metrics:Based around areas under ROC curves
The Golden Spike data
3 “sample”, 3 “control” arrays
Many RNAs “spiked-in” at known levels
“DE”, “Equal” and “Empty” probesets.
Controversial data setNon-uniform null p-value distributions - use ROC
Spike-in concentrations high - unrepresentative
“DE” spike-ins all up-regulated - unrepresentative
Concentrations and FC confounded - loess
Different FC between “Equal” and “Empty”
“Empty” > FC than “Equal”
Most analyses have treated both Empty and Equal as True Negatives - to what effect?
“Empty” > FC than “Equal”
To illustrate how analysis choices effect results I’ll treat Empty and Equal as true negative (TN) and DE<=1.2 as true positive (TP)
2-sided test
Large apparent difference between methodsCan you guess which paper used this chart?
2-sided test
Large apparent difference between methodsAre TP correctly identified as up-regulated?
1-sided test of up-regulation
Probesets identified as up-regulated not TP
1-sided test of down-regulation
DE probesets are mostly being identified as down-regulated, despite the fact that they are in truth up-regulated
We appear to be identifying TP as down-regulated
DE <=1.2 lower than Empty
TP are identified as down-regulated because most TN are “Empty” which have higher FC than DE <= 1.2
Remove “empty” probesets
We can remedy this by using just Equal probesets as our TN…
…bearing in mind that this makes the data somewhat atypical
Up-regulation - Empty in TN
Probesets identified as up-regulated generally not TP when using Empty in TN
Up-regulation - TN Equal
Probesets identified as up-regulated more likely to be TP when using only Equal as TN
Down-regulation - Empty in TN
DE probesets are mostly being identified as down-regulated, despite the fact that they are in truth up-regulated
We appear to be identifying TP as down-regulated when including Empty in TN
Down-regulation - TN Equal
We generally don’t identify TP as down-regulated when excluding Empty in TN
“Recommended” test
We recommend using just Equal as TN, and all DE as TP
Recommended Up-reg
Using our recommendations, tests of up-regulation generally find TP, as expected
Recommended Down-reg
Using our recommendations, tests of down-regulation generally don’t find TP, as expected
Analysis decisions to make
Summarisation methodDE methodDirection of DE (recommend up)Choice of true negatives (equal only)Choice of true positives (all DE)Post-summarisation normalisation (loess using equal only)Type of ROC chart (standard ROC)Proportion of x-axis to display (all)
AffyDEComp - charts
AffyDEComp - comparison
AUCs - recommended choices
Conclusions
First step towards a reliable benchmark for DEGolden Spike data has some value if use of empty probesets is revisitedCertain combinations of summarisation/DE methods seem poor
Keep it open (Bioconductor) - because science should be reproducible!
Recommendations
Create a new spike-in data set whereSpike-in concentrations are realistic
DE spike-ins both up- and down-regulated
Concentrations and FC not confounded
Larger number of arrays
Benchmarks using regulatory information
Benchmarks for Illumina data
Benchmarks for SNP chips (GWA studies)
manchester.ac.uk/bioinformatics/affydecomp