efficient algorithms for detecting signaling pathways in protein interaction networks jacob scott,...
TRANSCRIPT
Efficient Algorithms for Detecting Signaling Pathways in
Protein Interaction Networks
Jacob Scott, Trey Ideker,
Richard M. Karp, Roded Sharan
RECOMB 2005
Outline
• Motivation
• Theoretical foundations
• Biological extensions
• Implementation
• Validation techniques
• Results from yeast
Motivation
• Post-genomics, want to understand organisms’ protein-protein interaction network
• Model network as a probabilistic graph, with edge weights representing probabilities
• Interested in protein signaling cascades– Show up as simple paths in the graph
• Want to find biologically interesting paths efficiently– Score paths, with high scores reflecting importance– Extended graph algorithms provide speed– Automated modelling of signal transduction networks
as baseline (Steffen et al 2002)
Theoretical Foundation
• Finding long, simple paths is NP-Hard– Reduce from TSP– Once we find these paths, want the best (lightest) ones
• Need for paths to be simple is what drives hardness• Color-Coding is a randomized, dynamic-
programming based algorithm for finding paths of fixed length– Developed by Alon et al (1995)
• Randomly color graph and require paths be colorful (exactly one vertex of each color)– Number of colors = length of paths– A colorful path is always simple
Color-Coding
• Colorful paths can be found with dynamic programming
• Key point: a colorful path of length k contains a colorful path of length k-1.
• Store path information at each node for each subset of k colors– Only 2k color subsets, rather than O(nk) node subsets
• Runtime is O(2kkm) << O(knk) brute force• Space is O(2kn) << O(knk) brute force
Coloring Example
• Two different colorings on toy graph, k=3• In coloring I, W(A,RGB) is built C->BC->ABC• In coloring II, W(A,RGB) is built G->BG->ABG• ABC is not colorful in coloring II
F
D E
G H
CA B
F
D E
G H
CA B
I II
Monte Carlo Details
• A colorful path is simple, but a simple path may not be colorful under a given coloring
• Solution: run multiple independent trials
• After one trial, for paths of length k,
•
Adding Biology
• Color-Coding gives an algorithmic basis, now introduce biologically motivated extensions
• Can set the start or end of path by type– E.g. screening by Gene Ontology categories
• Can force the inclusion of a protein on the path by giving it a unique color
• Using counters, can specify “path must contain between x and y proteins of a given type”– Computational cost multiplicative in y per counter
Adding Biology - Segmented Paths
• Pathways may be ordered– Signaling pathways going from the membrane, to
nuclear proteins and finally transcription factors
• Assign each protein an integer label based on biological information, build path out of ordered sequences of labeled proteins– Now only need to constrain color collisions among
proteins with the same label– If path length is about equally split among labels,
probability of correct coloring rises
• Modifications allow for inability to assign proteins to unique labels
Adding Biology - More Structures
• Modifications to the Color-Coding recurrence allow for the discovery beyond simple paths– Example: Two-terminal series-parallel graphs
• Capture parallel signaling pathways
Example two-terminal series-parallel graph
Generating Edge Weights
• So far, have glossed over how weights (probabilities) on the protein graph are assigned
• Here, use our previous work, generate logistic function of three variables (for a pair of proteins)– Number of times interaction between them was
experimental observed– Pearson correlation coefficient of expressions (for
corresponding genes)– Their small world clustering coefficient
• Used training data from MIPS (gold standard) for training our relative weighting
• Taking log of weights makes path score additive
Application
• Tested our simple path implementation with the yeast interaction network– ~4,500 vertices, ~14,500 edges– Based on interaction data from Database of
Interacting Proteins (Feb 2004)– Runtimes varied from minutes (length 8) to
under two hours (length 10)– Much faster than brute force for longer paths
(14x for paths of length 9)– Focus on paths from membrane proteins to
transcription factors
Validation Techniques
• Three methods of validation• Two statistical
– Functional enrichment p-value based on how many proteins in the path are similar (by GO category)
– Weight p-value compares weights of paths to those found when the protein graph undergoes random degree-preserving shuffling
• Lastly, search for expected pathways– MAP-Kinase, ubiquitin-ligation
MAP-Kinase and Ubiquitin-Ligation
• Concentrated on three MAPK pathways (same as Steffen et al)– Pheromone response– Filamentous growth– Cell wall integrity
• Looked for shorter (length 4-6) ubiquitin-ligation pathways– Started at a cullin, ended at an F-Box– High functional enrichment under ubiquitin GO
category
Statistical Results (CDFs)
• 100 best paths of length 8 @ 99.9% success
• 100 normal, 2000 random paths used for weight p-value
STE2/3 STE4/18 CDC42 STE20 STE11 STE7 FUS3 DIG1/2 STE12
MAPK Recovery Results
MID2 RHO1 PKC1 BCK1 MKK1/2 SLT2 RLM1
MID2 ROM2 RHO1 PKC1 MKK1 SLT2 RLM1
A) Cell wall integrity pathway in yeast
B) Best path of length 7 found from MID2 to RLM1
STE3 AKR1 STE4 CDC24 BEM1 STE5 STE7 KSS1 STE12
C) Pheromone response signaling pathway in yeast
D) Best path of length 9 found from STE2/3 to STE12
Additional MAPK Recovery Results
STE2/3 STE4/18 CDC42 STE20 STE11 STE7 FUS3 DIG1/2 STE12
Pheromone response signaling pathway in yeast
STE3
STE50GPA1
FAR1CDC24
REM1
STE11CDC42
STE4/18
AKR1 KSS1STE5
STE12
DIG1/2FUS3
STE7
Pheromone response pathway assembly network
Conclusion
• Presented efficient, color-coding based algorithms for finding simple paths– Added biological extensions, other structures
• Integrated our well-founded reliability scores
• Applied our algorithms to yeast– Shown 60% of discovered pathways were
significantly enriched– Recovered known MAP-Kinase, ubiquitin-
ligation pathways
Simple vs. Segmented CDFs
Simple: 54%
Segmented: 72%
p-value (functional enrichment)
References
• Steffen, M., Petti, A., Aach, J., D’haeseleer, P., Church, G.: Automated modelling of signal transduction networks. BMC Bioinformatics 3 (2002) 34–44
• Alon, N., Yuster, R., Zwick, U.: Color-coding. J. ACM 42 (1995) 844–856