efficient algorithms for detecting signaling pathways in protein interaction networks jacob scott,...

Efficient Algorithms for Detecting Signaling Pathways in

Protein Interaction Networks

Jacob Scott, Trey Ideker,

Richard M. Karp, Roded Sharan

RECOMB 2005

Outline

• Motivation

• Theoretical foundations

• Biological extensions

• Implementation

• Validation techniques

• Results from yeast

Motivation

• Post-genomics, want to understand organisms’ protein-protein interaction network

• Model network as a probabilistic graph, with edge weights representing probabilities

• Interested in protein signaling cascades– Show up as simple paths in the graph

• Want to find biologically interesting paths efficiently– Score paths, with high scores reflecting importance– Extended graph algorithms provide speed– Automated modelling of signal transduction networks

as baseline (Steffen et al 2002)

Theoretical Foundation

• Finding long, simple paths is NP-Hard– Reduce from TSP– Once we find these paths, want the best (lightest) ones

• Need for paths to be simple is what drives hardness• Color-Coding is a randomized, dynamic-

programming based algorithm for finding paths of fixed length– Developed by Alon et al (1995)

• Randomly color graph and require paths be colorful (exactly one vertex of each color)– Number of colors = length of paths– A colorful path is always simple

Color-Coding

• Colorful paths can be found with dynamic programming

• Key point: a colorful path of length k contains a colorful path of length k-1.

• Store path information at each node for each subset of k colors– Only 2k color subsets, rather than O(nk) node subsets

• Runtime is O(2kkm) << O(knk) brute force• Space is O(2kn) << O(knk) brute force

Coloring Example

• Two different colorings on toy graph, k=3• In coloring I, W(A,RGB) is built C->BC->ABC• In coloring II, W(A,RGB) is built G->BG->ABG• ABC is not colorful in coloring II

F

D E

G H

CA B

F

D E

G H

CA B

I II

Monte Carlo Details

• A colorful path is simple, but a simple path may not be colorful under a given coloring

• Solution: run multiple independent trials

• After one trial, for paths of length k,

•

Adding Biology

• Color-Coding gives an algorithmic basis, now introduce biologically motivated extensions

• Can set the start or end of path by type– E.g. screening by Gene Ontology categories

• Can force the inclusion of a protein on the path by giving it a unique color

• Using counters, can specify “path must contain between x and y proteins of a given type”– Computational cost multiplicative in y per counter

Adding Biology - Segmented Paths

• Pathways may be ordered– Signaling pathways going from the membrane, to

nuclear proteins and finally transcription factors

• Assign each protein an integer label based on biological information, build path out of ordered sequences of labeled proteins– Now only need to constrain color collisions among

proteins with the same label– If path length is about equally split among labels,

probability of correct coloring rises

• Modifications allow for inability to assign proteins to unique labels

Adding Biology - More Structures

• Modifications to the Color-Coding recurrence allow for the discovery beyond simple paths– Example: Two-terminal series-parallel graphs

• Capture parallel signaling pathways

Example two-terminal series-parallel graph

Generating Edge Weights

• So far, have glossed over how weights (probabilities) on the protein graph are assigned

• Here, use our previous work, generate logistic function of three variables (for a pair of proteins)– Number of times interaction between them was

experimental observed– Pearson correlation coefficient of expressions (for

corresponding genes)– Their small world clustering coefficient

• Used training data from MIPS (gold standard) for training our relative weighting

• Taking log of weights makes path score additive

Application

• Tested our simple path implementation with the yeast interaction network– ~4,500 vertices, ~14,500 edges– Based on interaction data from Database of

Interacting Proteins (Feb 2004)– Runtimes varied from minutes (length 8) to

under two hours (length 10)– Much faster than brute force for longer paths

(14x for paths of length 9)– Focus on paths from membrane proteins to

transcription factors

Validation Techniques

• Three methods of validation• Two statistical

– Functional enrichment p-value based on how many proteins in the path are similar (by GO category)

– Weight p-value compares weights of paths to those found when the protein graph undergoes random degree-preserving shuffling

• Lastly, search for expected pathways– MAP-Kinase, ubiquitin-ligation

MAP-Kinase and Ubiquitin-Ligation

• Concentrated on three MAPK pathways (same as Steffen et al)– Pheromone response– Filamentous growth– Cell wall integrity

• Looked for shorter (length 4-6) ubiquitin-ligation pathways– Started at a cullin, ended at an F-Box– High functional enrichment under ubiquitin GO

category

Statistical Results (CDFs)

• 100 best paths of length 8 @ 99.9% success

• 100 normal, 2000 random paths used for weight p-value

STE2/3 STE4/18 CDC42 STE20 STE11 STE7 FUS3 DIG1/2 STE12

MAPK Recovery Results

MID2 RHO1 PKC1 BCK1 MKK1/2 SLT2 RLM1

MID2 ROM2 RHO1 PKC1 MKK1 SLT2 RLM1

A) Cell wall integrity pathway in yeast

B) Best path of length 7 found from MID2 to RLM1

STE3 AKR1 STE4 CDC24 BEM1 STE5 STE7 KSS1 STE12

C) Pheromone response signaling pathway in yeast

D) Best path of length 9 found from STE2/3 to STE12

Additional MAPK Recovery Results

STE2/3 STE4/18 CDC42 STE20 STE11 STE7 FUS3 DIG1/2 STE12

Pheromone response signaling pathway in yeast

STE3

STE50GPA1

FAR1CDC24

REM1

STE11CDC42

STE4/18

AKR1 KSS1STE5

STE12

DIG1/2FUS3

STE7

Pheromone response pathway assembly network

Conclusion

• Presented efficient, color-coding based algorithms for finding simple paths– Added biological extensions, other structures

• Integrated our well-founded reliability scores

• Applied our algorithms to yeast– Shown 60% of discovered pathways were

significantly enriched– Recovered known MAP-Kinase, ubiquitin-

ligation pathways

Simple vs. Segmented CDFs

Simple: 54%

Segmented: 72%

p-value (functional enrichment)

References

• Steffen, M., Petti, A., Aach, J., D’haeseleer, P., Church, G.: Automated modelling of signal transduction networks. BMC Bioinformatics 3 (2002) 34–44

• Alon, N., Yuster, R., Zwick, U.: Color-coding. J. ACM 42 (1995) 844–856

efficient algorithms for detecting signaling pathways in protein interaction networks jacob scott,...

Documents

length of paths

paths of length

simple slide

path length

colorcoding colorful

colorful path of length

simple paths example

interesting paths