a factor graph model for minimal gene set enrichment analysis diana uskat computational biology -...
TRANSCRIPT
A Factor Graph Model for Minimal Gene Set Enrichment Analysis
Diana Uskat
Computational Biology - Gene Center Munich
24.03.2010 Diana Uskat - Gene Center Munich 2
Problem Outline:• Single gene analysis of microarray
experiments entails a large multiple testing problem
• Even after appropriate multiple testing correction, the result is usually a long list of differentially expressed genes
• Interpretation is difficult by hand
Possible improvement: Gene set enrichment analysis
1. Group genes into different biologically meaningful categories (Gene Ontology, KEGG Pathways, Transcription factor targets)
2. Use a statistical method for finding those categories which are enriched for differentially expressed genes
Motivation
Ontologizer from S. Bauer, J. Gagneur, P. N. Robinson
Cutout of Gene Ontology
Graph from Ontologizer by S. Bauer, J. Gagneur, P. N. Robinson (NAR 2010)
Cutout of Gene Ontology
24.03.2010 Diana Uskat - Gene Center Munich 3
Established Methods:
• GSEA (Subramanian, Tamayo)
• TopGO (Alexa)
• Globaltest (Goemann, Mansmann)
• GOStats (Falcon, Gentleman)
Drawbacks:
• There are often 1000’s of overlapping categories, genes can belong to multiple categories difficult new multiple testing problem
• Group testing returns often a large number of significant categories identification of biologically relevant categories difficult
Motivation
Graph from Ontologizer by S. Bauer, J. Gagneur, P. N. Robinson (NAR 2010)
Cutout of Gene Ontology
24.03.2010 Diana Uskat - Gene Center Munich 4
Minimal Gene Set Enrichment
Idea (Bauer, Gagneur et al., Nucleic Acids Research 2010)
• Search for a sparse explanation, i.e. a minimal number of categories that explain the data (sufficiently well)
• Use a simplistic probabilistic graphical model relating categories and genes, and do Bayesian inference on the marginal posterior for each category
T2
E3E2E1
T1 T3 T2
E3E2E1
T1 T3
Correct explanation Correct minimal explanation
Genes
Categories
“gene E3 is element of category T3”
(coloured means „on“)
24.03.2010 Diana Uskat - Gene Center Munich 5
Minimal Gene Set Enrichment
T2
E3E2E1
T1 T3
D3D2D1
Genes
Categories
Observations (data)
Posterior Likelihood Prior
The model
A Bayesian Network factorization of the full posterior:Main trick: Use a prior favoring sparse solutions
24.03.2010 Diana Uskat - Gene Center Munich 6
Factor Graphs
T2
E3E2E1
T1 T3
D3D2D1
• Graphical model (Kschischang IEEE, 2001)
• Bipartite graph with factor nodes and variable nodes
• Each factor node encodes a function for its neighbouring variables
• Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...)
Our method: Factor Graphs
24.03.2010 Diana Uskat - Gene Center Munich 7
Factor Graphs
T2
E3E2E1
T1 T3
D3D2D1
f1 f2 f3
Jj
TgnextjjJj
jj TfTEgEfj
)(),()( )(
)|,Pr( DET
• Graphical model (Kschischang IEEE, 2001)
• Bipartite graph with factor nodes and variable nodes
• Each factor node encodes a function its neighbouring variables
• Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...)
Pr(D|E)
given by dataset
24.03.2010 Diana Uskat - Gene Center Munich
Factor Graphs
T2
E3E2E1
T1 T3
D3D2D1
g1
f1 f2 f3
g2 g3 g6
g4 g5
• Graphical model (Kschischang IEEE, 2001)
• Bipartite graph with factor nodes and variable nodes
• Each factor node encodes a function its neighbouring variables
• Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...)
Jj
TgnextjjJj
jj TfTEgEfj
)(),()( )(
)|,Pr( DET
E only active if at least one parent active
7
24.03.2010 Diana Uskat - Gene Center Munich 7
Factor Graphs
T2
E3E2E1
T1 T3
D3D2D1
g1
f1 f2 f3
g2 g3 g6
g4 g5
fT
Jj
TgnextjjJj
jj TfTEgEfj
)(),()( )(
)|,Pr( DET
• Graphical model (Kschischang IEEE, 2001)
• Bipartite graph with factor nodes and variable nodes
• Each factor node encodes a function its neighbouring variables
• Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...)
with
N
j
TTT
jj ppTf1
1)1(
5.00 p
24.03.2010 Diana Uskat - Gene Center Munich 8
Estimation Methods for Factor Graphs
T2
E3E2E1
T1 T3
D3D2D1
g1
f1 f2 f3
g2 g3 g6
g4 g5
fT
Computation of posterior for T,E:• Message-Passing Algorithm: Sum-
Product-Algorithm
• Stops at correct result after one round if graph has a tree structure
• No guarantees if graph has cycles
(e.g., oscillation may occur), however works well in practice
Principle:• Start in leaf nodes
• Message propagation:
– variable to factor node („Sum“)
– factor to variable node („Product“)
• Termination: Compute the marginal distribution of the variable nodes
24.03.2010 Diana Uskat - Gene Center Munich 9
Application: Yeast Salt Stress
• Categories: Transcritption factors (with their targets) instead of GO categories
• Given: – List of transcription factors with their corresponding genes– List of genes (their p-values) from a yeast salt stress experiment
• Question: Which transcription factors are active during salt stress? • Task: Find a set of transcription factors that are most likely to be active
TF1
TF2
g1
g2
g3
g4
g5
“g2 is target of TF2”
24.03.2010 Diana Uskat - Gene Center Munich 10
Results
~2.000 genes
118 transcription factors
Graph obtained from re-analysis of Harbison TF binding data
(Nat, 2004) by MacIsaac et al. (BMC Bioinformatics, 2006)
24.03.2010 Diana Uskat - Gene Center Munich 10
Results
~2.000 genes
118 transcription factors
Graph obtained from re-analysis of Harbison TF binding data
(Nat, 2004) by MacIsaac et al. (BMC Bioinformatics, 2006)
Previously known transcription factors
involved in salt stress (Capaldi et al., Nat.Gen 2008,Wu and Chen, Bioinform Biol
Insights. 2009)
Differentially phosphorylated
transcription factors (Soufi et al., Mol.Biosyst 2009)
YML081W
DAL81
STB4
HSF1
UME6
SNT2
RGT1
MET28
MSN2
GAL4
SKO1
24.03.2010 Diana Uskat - Gene Center Munich 11
Summary and Outlook
• Todo: scalability and speed• Lists of (meaningful) gene sets are better than
lists of genes• Search for biologically meaningful explanations
requires a new minmal model (MGSE) for gene set enrichment analysis
• We use factor graphs for parameter estimation• Wide application to GO analysis, TF-target
analysis, Pathway enrichment
24.03.2010 Diana Uskat - Gene Center Munich 12
Acknowledgments
Gene Center Munich:
Achim Tresch, Theresa Niederberger, Björn Schwalb, Sebastian Dümcke
Collaborating Partners:
Gene Center Munich:
Patrick Cramer, Christian Miller, Daniel Schulz, Dietmar Martin, Andreas Mayer
EMBL Heidelberg:
Julien Gagneur(talk nov. 2009, working group conference of the GMDS „AG Statistische Methoden in der Bioinformatik, Munich“)