gene set enrichment analysis petri törönen petri(dot)toronen(at)helsinki.fi
TRANSCRIPT
What, Why, How…
• Gene expression data/analysis• Problems with gene expression data analysis• Earlier solutions• My solution• Comparisons• Conclusions / Warnings
Genome-wide gene expression
• Genome-wide Gene Expression (GE) analysis. Standard lab tool
• Various methods• Aim to understand biological differences across
the samples at gene level
• If you don’t work with GE data:– Gene Set Methods can be used with most other
large scale data sets
Typical pipelinesGenerate the GE data
Pre-processing(Normalization etc.)
Define Differentially Expressed genes
Draw biological conclusionsFind over-represented biological processes
Generate the GE data
Pre-processing(Normalization etc.)
Define Differentially Expressed genes
Cluster selected genes
Draw biological conclusions
Generate the GE data
Pre-processing(Normalization etc.)
Define Differentially Expressed genes
Generate a classification of samples using GE profilesof genes
Draw biological conclusionsClassify unknown samples
What can go wrong?
• Is the definition of Differentially Expressed genes always reasonable?– datasets with large noise levels– p-value thresholds– sudden jump to signif. regulation– genes with weak regulation
• Is the set of Diff. Expr. genes the main goal?
What can go wrong?
• Is the definition of Differentially Expressed genes always reasonable?– datasets with large noise levels– p-value thresholds– genes with weak regulation
• Is the set of Diff. Expr. genes the main goal? => Biological Processes are usually more
informative.
What can go wrong?
Analysis of data with one threshold.Biological process with weak regulation goes unnoticed
Solution
• Analyze sets of genes instead of genes• Gene Set: Genes belonging to same pathway,
biological process, complex and/or Gene Ontology class
• Benefits: Group of genes is less sensitive to error than a single gene*
• Benefits: Easy interpretation of the results
• Something to support the gene based analysis
Gene set analysis pipeline
Generate the GE data
Pre-processing(Normalization etc.)
Define continuous Diff. Expr. score for genes
Calculate a gene set score for each gene set
Generate permuted data
Pre-definedgene sets
Calculate the gene set scorefor each gene set
Look for gene sets that show stronger signal in real data than in permuted data
Gene level
Gene set level
Expr
essi
on d
ata
Clas
s da
ta
Sample labels
Methods for gene set scoring
• Average based methods• Rank based methods• Other methods (omitted here)
Average based methods
• Calculate the average regulation of gene set (Tian et al. PNAS)
• Can something go wrong with it?
Rank based methods
• Steps:– order genes with differential
expression– test every possible threshold in the
ordered list– look over(/under)-representation of
gene set above the threshold– select the strongest score
• Expression values are (often) discarded!
• Iterative Group Analysis, Kolmogorov-Smirnov test (KS), modified KS (Gene Set Enrichment Analysis package, MIT)
Analyzedsubset
threshold
Gene expression data Analyzed gene classes
Black = class memberWhite = not a member
Permutations
• Needed to evaluate significance• Two types:• Row Randomization– mix labels gene set / gene class
• Column Randomization– mix sample labels, used to
calculate diff. expr.• Column Randomization
preferred
Rowrand.
Col. rand
Summary of methods
• Average-based methods are weak with non-coherent regulation
• Rank-based methods usually omit gene expression values => steps between all genes equally significant
My brilliant proposal
• Combine two method groups:– Order genes with diff. expr. scores– Test every threshold position– At each threshold calculate
• Scale the difference with STD and average estimates (Toronen et al. 2009)
• Get a Z-score scaling for difference=> Gene Set Z-score (GSZ)
My brilliant proposal
• An over-representation (hypergeometric) score weighted with diff. expr. score
• GSZ compares the Diff to the mean and STD we obtain when the class is randomly distr. in the ordered list.
• Considers both: Variance in the expr. values and variance in the number gene set members in the list
My brilliant proposal
• Many popular Gene Set scoring methods are variants of GSZ-method:– hypergeometric testing– Pearson correlation– Max-Mean (Efron, Tibshirani)– Random Sets (Newton et al.)
GSZ profile from ALL data (Chiaretti et.al) for one GO class vs. 7 quantiles (0, 5, 25, 50, 75, 95, 100) from 500 permutations. Different positions corresponds to other competing methods.
Evaluation• Stability of the scores as threshold goes through the gene list? • Red line: Strongest signal from positive data (across all GO classes)• Blue lines: various quantiles (same as before) across all GO class• Compare with KS and modified KS (Right column. MIT, PNAS and Nature Gen.)• Same data, same permutation!!
GSZ
with
diff
. par
amet
er v
alue
s. T
hird
box
sh
ows
defa
ult p
aram
eter
val
ues.
Pay attentionto stability ofblue lines.
More evaluation
• GSZ is also stable against the gene set size variations– most methods are not
• Several Gene Set scoring methods were tested with artificial positive and random datasets– GSZ showed best overall ability to separate two dataset types
• Methods were evaluated by splitting the real data to two halves: Test how well the results match– GSZ was best in predicting its own results from the other half– GSZ was best in predicting summary of all methods from the
other half
More evaluation• Compare different gene set scoring functions• Test with two popular datasets against GO classes• Calculate the empirical -log(p-values) for strongest GO classes from each
method • Blue line = GSZ, green line = T-test, red = KS, magenta = iGA, cyan = modified KS
ALL
data
set
p53
data
set
Pooled data
Class data
More evaluation• Select biologically relevant GO classes as
biologically positive• Look how many such classes each method
finds across the top ranks (GSZ = blue line)
Here ALL dataset. GSZ outperforms others at bigger ranks. Similar results were obtained with p53 dataset
Comparison with other programs
• Selected SignalPathway (green line), GSEA (cyan) and GSA (black) to comparison
• Evaluation was done again using the biologically positive classes
• Comparing programs less clear (more variables)
Here again ALL dataset. Similar results with p53GSZ outperforms others at large
Summary
• GSZ, weighted over-representation score• Math link to many other popular methods• Stable across GO class sizes and across gene
list positions• Good performance in artificial datasets• Best performance with many evaluations
from two real datasets
Other applications
• siRNA data vs. gene IDs (discussed)• Linkage data vs. biological processes
(discussed)• BLAST result list vs. descriptions (in usage)• BLAST result list vs. GO classes (in usage)
Warnings
• Quality of gene expression data• Enough samples for permutations• Each gene should occur only once in the
expression data• Filter genes without annotations (with GO
data)• Use Column Permutations• Quality of gene sets / annotations