9. lecture ws 2004/05bioinformatics iii1 graph layout in cellular networks

9. Lecture WS 2004/05

Bioinformatics III 1

Graph Layout in Cellular Networks

www.cytoscape.org



Task: visualize cellular interaction data

e.g. protein interaction data (undirected): nodes – proteinsedges – interactions

metabolic pathways (directed)nodes – substancesedges – reactions

regulatory networks (directed): nodes – transcription factors + regulated proteinsedges – regulatory interaction

co-localization (undirected): nodes – proteins

edges – co-localization information

homology (undirected/directed)nodes – proteinsedges – sequence similarity (BLAST score)



Visualisation: intuitive approach to understand graphs

http://www.it.usyd.edu.au/~aquigley/3dfade/

Graph like structures are pervasive:

- route maps of airline companies

- infrastructure of computer networks

- the relationship between people who work in a same company etc.

- cellular interactions ...

One way to understand the information coded in these graphs is to draw

graphical representations of them. Since drawing by hand is tedious and

error-prone, it is natural to expect computers to draw graphs automatically,

assigning spatial coordinates to nodes and connecting them with edges.

Graphs, such as the flight route maps, are not hard to draw since the

precise locations of the nodes (cities) are already given.

For other graphs, such information is not available and computers need to

determine where to plot the nodes and how to draw the edges that connect

the nodes.



Force-directed algorithm for graph layout

http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html

Various graph layout algorithms have been

developed to solve this visualisation task.

20 years ago, Peter Eades proposed a graph

layout heuristic [A heuristic for graph

drawing. Congressus Numerantium, 42:149-

160, 1984] which is called the ``Spring

Embedder'' algorithm.

Edges are replaced by springs and vertexes

are replaced by rings that connect the

springs. A layout can be found by simulating

the dynamics of such a physical system.

This method and other methods, which

involve similar simulations to compute the

layout, are called ``Force Directed''

algorithms.



Force-directed algorithm

http://www.it.usyd.edu.au/~aquigley/3dfade/

The edges can be modeled as gravitational (or electrostatic) attraction and

all nodes have an electrical repulsion between them.

It is also possible for the system to simulate unnatural forces acting on the

bodies, which have no direct physical analogy, for example the use of a

logarithmic distance measure rather than Euclidean.



Force-directed algorithm


Because of the underlying analogy to a physical system, the force directed graph

layout methods tend to meet various aesthetic standards, such as

- efficient space filling,

- uniform edge length (when equal weights and repulsions are used)

- symmetry and the

- capability of rendering the layout process with smooth animation (visual

continuity).

Having these nice features, the force directed graph layout has become

the ``work horse'' of layout algorithms.

It has been successfully adapted to many domains with variations of

implementation.



Scaling


Force directed layout methods commonly have computational scaling problems.

When there are more than a few thousand vertexes in the graph, the running time

of the layout computation can become unacceptable.

This is caused by the fact that in each step of the simulation, the repulsive force

between each pair of unconnected vertexes needs to be computed, costing a

running time of O(0.5 V2 – E).

Here V is the number of vertexes and E is the number of edges in the graph.

This complexity is hard to escape for general graphs without hierarchical structure.



Protein interaction graphs

Ju et al. Bioinformatics 19, 317 (2003)

Most protein interaction data have the following characteristics:

(1) When visualized as a graph, the data yields a disconnected graph with many

connected components

(2) The data yields a nonplanar graph with a large number of edge crossings that

cannot be removed in a 2D drawing

(3) #interactions varies widely within the same set of data – p(k)

(4) data often contains protein interactions corresponding to self loops

demands robust algorithm.



InterViewer: Example of force-directed layout algorithm


InterViewer does not place initial nodes

randomly, but on the surface of a

sphere. Fixed # of iterations.

The original algorithm has complexity

O(N2) per timestep with N # of nodes.

When using multipole-methods, this

can be reduced to O(N logN)

Time may also be saved by introducing

a cut-off, e.g. only computing

interactions with the next neighbor

cells. Update neighbor list infrequently.



Application for protein interaction graphs


Visualisation of the

MIPS interaction data.

In 3D, this graph

contains no edge-

crossings.



Aim: analyze and visualize homologies between the protein universe :-)

50 genomes 145579 proteins 21 109 BLASTP pairwise sequence

comparisons.

Expect that fusion proteins („Rosetta Stone proteins“) will link proteins of

related function.

Need to visualize extremely large network! Develop stepwise scheme.



LGL

Adai et al. J. Mol. Biol. 340, 179 (2004)

(1) separate original network into connected sets

(2) generate coordinates for each node in each connected set

(using force-directed layout algorithm and a recipe for the sequential lay out of

nodes guided by a minimum spanning tree of the network).

(3) integrate connected sets into one coordinate system via a funnel process:

the connected sets are sorted in descending size by the number of vertices.

The first connected set is placed at the bottom of a potential funnel and other

sets are placed one at a time on the rim of the potential funnel and allowed to

fall towards the bottom where they are frozen in space upon collision with the

previous sets.

We concentrate on step (2) in the following



Minimum Spanning Tree

Given: undirected graph G = (V,E)

where for each edge (u,v) E

exists a weight w(u,v) specifying

the cost to connect u and v.

Find an acyclic graph T E that

connects all of the nodes and

whose total weight

is minimized.

Tvu

vuwTw,

,

Popular algorithms by Kruskal and Prim.

Both are greedy algorithms making the

best choice at the moment.

no guarantee to find the best global

solution

[Cormen]



Kruskal’s algorithm

Consider edges in sorted order by weight.

The arrow points to the edge under consideration at each step.

[Cormen]



Kruskal’s algorithm (II)

Running time O(E log V)

[Cormen]



Intuitive description of LGL


Successive iterations of the layout. The MST determines the oder of placement of

the nodes. The root node could be chosen randomly or based on its centrality in the

network (e.g. minimizing the sum of distances to all other nodes). All other nodes

are assigned a level according to their edge-based distance in the MST from the

root node.

Level one vertices (red circles) are placed randomly on a sphere around the root

node (black circle). The system is allowed to iterate through time satisfying attractive

and repulsive forces until at rest.

Level two nodes (blue circles) are placed randomly on spheres directed away from

the current layout. Again, the system is allowed to evolve through time till at rest.

This process is iterated for the entire graph.



What is the role of fusion proteins?


A protein homology map summarizes the results of billions of sequence comparisons by modeling

the proteins as vertices in a network, and the statistically significant sequence similarities as edges

connecting the relevant proteins. In this manner, proteins within a sequence family (such as A, A′, A

″, and AB; or B, B′ and AB) are all or mostly connected to each other, forming a cluster in the map.

Fusion proteins (such as AB) serve to connect their component proteins' families. The structure of

the resulting map reflects historic genetic events, such as gene fusions, fissions, and duplications,

which are responsible for producing the modern-day genes. The map simultaneously represents

homology relationships (edges), remote homologies (proteins not directly connected but in the same

cluster), and non-homologous functional relationships (adjacent clusters and clusters linked by

fusion proteins).



LGL Algorithm for very large biological networks


The complete protein homology map. A layout of the entire protein homology

map; a total of 11,516 connected sets containing 111,604 proteins (vertices)

with 1,912,684 edges. The largest connected set is shown more clearly in the

inset and is enlarged further on the right side.



Map of gene function


emerges from ~21 billion gene sequence

comparisons. Proteins are drawn as points, with

lines connecting proteins with similar sequences,

and are arranged so that homologous proteins

are adjacent in the Figure.

The size of each cluster is proportional to the

number of proteins in that sequence family.

Fusion proteins force their component proteins'

respective families to be close together in the

Figure, and thereby serve to organize the

proteins in the map according to their functions.

The resulting broad trends of protein function are

labeled, as are several of the most extensive

sequence families. A–C indicate specific regions

that are magnified later.

Only the greatest connected network

component is drawn, containing 30,727

proteins (vertices) and 1,206,654

significant sequence similarities (edges),

and representing ~4 billion sequence

comparisons.



Functionally related gene families form adjacent clusters


Three examples illustrate spatial

localization of protein function in the map,

specifically

A, the linkage of the tryptophan synthase

family to the functionally coupled but non-

homologous family by the yeast

tryptophan synthase fusion protein,

B, protein subunits of the pyruvate

synthase and alpha-ketoglutarate

ferredexin oxidoreductase complexes

C, metabolic enzymes, particularly those of

acetyl CoA and amino acid metabolism.



Colocalization


Neighboring proteins tend to be in the

same cellular system. The tendency

for proteins to operate in the same

cellular system, as defined by the

percentage of matching assignments

into the 18 COG database pathways,

is plotted against the spatial

separation in multiples of a typical

cluster size.

The functional similarity decays

exponentially with distance

proportional to the function e−0.26d

where d is a typical cluster diameter.



Comparison with other layout maps


A comparison of LGL with map layouts

produced by other algorithms. The layout of

the protein homology map by LGL (A) is

contrasted with the layout of the same

network by the spring-force algorithm only,

lacking the minimal spanning tree

calculation and iterative layout procedure

(B), and with the layout by the approach of

InterViewer (C). Interviewer

collapses equivalent nodes into single

nodes, thereby simplifying the graph, and is

one of the few available graph layout

programs that scales to such large

networks. The layout from LGL reveals

more of the internal graph structure than

the other approaches tested.



Modularity in molecular networks?

A functional module is, by definition, a discrete entity whose function is

separable from those of other modules.

This separation depends on chemical isolation, which can originate from

spatial localization or from chemical specificity.

E.g. a ribosome concentrates the reactions involved in making a polypeptide

into a single particle, thus spatially isolating its function.

A signal transduction system is an extended module that achieves its isolation

through the specificity of the initial binding of the chemical signal to receptor

proteins, and of the interactions between signalling proteins within the cell.

Hartwell et al. Nature 402, C47 (1999)



Modularity in molecular networks

Modules can be insulated from or connected to each other.

Insulation allows the cell to carry out many diverse reactions without cross-talk

that would harm the cell.

Connectivity allows one function to influence another.

The higher-level properties of cells, such as their ability to integrate information

from multiple sources, will be described by the pattern of connections among their

functional modules.

Hartwell et al. Nature 402, C47 (1999)



Organization of large-scale molecular networks

Organization of molecular networks revealed by large-scale experiments:

- power-law distribution ; P(k) exp-

- similar distribution of the node degree k (i.e. the number of edges of a node)

- small-world property (i.e. a high clustering coefficient and a small shortest path

between every pair of nodes)

- anticorrelation in the node degree of connected nodes (i.e. highly interacting

nodes tend to be connected to low-interacting ones)

These properties become evident when hundreds or thousands of molecules and

their interactions are studied together.

On the other end of the spectrum: recently discovered motifs that consist of 3-4

nodes.



Mesoscale properties of networks

Most relevant processes in biological networks correspond to the mesoscale

(5-25 genes or proteins) not to the entire network.

However, it is computationally enormously expensive to study mesoscale

properties of biological networks.

e.g. a network of 1000 nodes contains 1 1023 possible 10-node sets.

Spirin & Mirny analyzed combined network of protein interactions with data from

CELLZOME, MIPS, BIND: 6500 interactions.



Identify connected subgraphsThe network of protein interactions is typically presented as an undirected graph

with proteins as nodes and protein interactions as undirected edges.

Aim: identify highly connected subgraphs (clusters) that have more interactions

within themselves and fewer with the rest of the graph.

A fully connected subgraph, or clique, that is not a part of any other clique is an

example of such a cluster.

In general, clusters need not to be fully connected.

Measure density of connections by

where n is the number of proteins in the cluster

and m is the number of interactions between them.

Spirin, Mirny, PNAS 100, 12123 (2003)

12

nn

mQ



(method I) Identify all fully connected subgraphs (cliques)Generally, finding all cliques of a graph is an NP-hard problem.

Because the protein interaction graph is sofar very sparse (the number of interactions

(edges) is similar to the number of proteins (nodes), this can be done quickly.

To find cliques of size n one needs to enumerate only the cliques of size n-1.

The search for cliques starts with n = 4, pick all (known) pairs of edges (6500 6500

protein interactions) successively.

For every pair A-B and C-D check whether there are edges between A and C, A and

D, B and C, and B and D. If these edges are present, ABCD is a clique.

For every clique identified, ABCD, pick all known proteins successively.

For every picked protein E, if all of the interactions E-A, E-B, E-C, and E-D are known,

then ABCDE is a clique with size 5.

Continue for n = 6, 7, ... The largest clique found in the protein-interaction network

has size 14. Spirin, Mirny, PNAS 100, 12123 (2003)



(I) Identify all fully connected subgraphs (cliques)These results include, however, many redundant cliques.

For example, the clique with size 14 contains 14 cliques with size 13.

To find all nonredundant subgraphs, mark all proteins comprising the clique of size

14, and out of all subgraphs of size 13 pick those that have at least one protein

other than marked.

After all redundant cliques of size 13 are removed, proceed to remove redundant

twelves etc.

In total, only 41 nonredundant cliques with sizes 4 - 14 were found.




(method II) Superparamagnetic Clustering (SPC)

SPC uses an analogy to the physical properties of an inhomogenous ferromagnetic

model to find tightly connected clusters on a large graph.

Every node on the graph is assigned a Potts spin variable Si = 1, 2, ..., q.

The value of this spin variable Si performs thermal fluctuations, which are

determined by the temperature T and the spin values on the neighboring nodes.

Energetically, 2 nodes connected by an edge are favored to have the same spin

value. Therefore, the spin at each node tends to align itself with the majority of its

neighbors.

When such a Potts spin system reaches equilibrium for a given temperature T,

high correlation between fluctuating Si and Sj at nodes i and j would indicate that

nodes i and j belong to the same cluster.




(II) Superparamagnetic Clustering (SPC)The protein-interaction network is represented by a graph where every pair of

interacting proteins is an edge of length 1.

The simulations are run for temperatures ranging from 0 to 1 in units of the

coupling strength.

The network splits two monomers at temperatures between 0.7 and 0.8,

whereas larger clusters only exist for temperatures between 0.1 and 0.7.

Clusters are recorded at all values temperature.

The overlapping clusters are then merged and redundant ones are removed.




(method III) Monte Carlo SimulationUse MC to find a tight subgraph of a predetermined number of nodes M.

At time t = 0, a random set of M nodes is selected.

For each pair of nodes i,j from this set, the shortest path Lij between i and j on the

graph is calculated.

Denote the sum of all shortest paths Lij from this set as L0.

At every time step one of M nodes is picked at random, and one node is picked at

random out of all its neighbors.

The new sum of all shortest paths, L1, is calculated if the original node were to be

replaced by this neighbor.

If L1 < L0, accept replacement with probability 1.

If L1 > L0, accept replacement with probability

where T is the effective temperature.


T

LL 01

exp



(III) Monte Carlo Simulation

Every tenth time step an attempt is made to replace one of the nodes from

the current set with a node that has no edges to the current set to avoid

getting caught in an isolated disconnected subgraph.

This process is repeated

(i) until the original set converges to a complete subgraph, or

(ii) for a predetermined number of steps,

after which the tightest subgraph (the subgraph corresponding to the smallest

L0) is recorded.

The recorded clusters are merged and redundant clusters are removed.




Optimal temperature in MC simulationFor every cluster size there is an

optimal temperature that gives the

fastest convergence to the tightest

subgraph.


Time to find a clique with size 7 in MC steps

per site as a function of temperature T.

The region with optimal temperature is

shown in Inset.

The required time increases sharply as the

temperature goes to 0, but has a relatively

wide plateau in the region 3 < T < 7.

Simulations suggest that the choice of

temperature T M would be safe for any

cluster size M.



Comparison of clusters found with

SPC (blue) and MC simulation

(red).

Reasonable overlap (ca. one third

of all clusters are found by both

methods) – but both methods

seem complementary.


Comparison of SPC and Monte Carlo methods



The SPC method is best at detecting high-Q value clusters with relatively few links

with the outside world. An example is the TRAPP complex, a fully connected clique

of size 10 with just 7 links with outside proteins.

This cluster was perfectly detected by SPC, whereas the MC simulation was able to

find smaller pieces of this cluster separately rather than the whole cluster.

By contrast, MC simulations are better suited for finding very „outgoing“ cliques.

The Lsm complex, a clique of size 11, includes 3 proteins with more interactions

outside the complex than inside. This complex was easily found by MC, but was not

detected as a stand-alone cluster by SPC.


Comparison of SPC and Monte Carlo methods



Merging Overlapping ClustersA simple statistical test shows that nodes which have only one link to a cluster are

statistically insignificant. Clean such statistically insignificant members first.

Then merge overlapping clusters:

For every cluster Ai find all clusters Ak that overlap with this cluster by at least one

protein.

For every such found cluster calculate Q value of a possible merged cluster

Ai U Ak . Record cluster Abest(i) which gives the highest Q value if merged with Ai.

After the best match is found for every cluster, every cluster Ai is replaced by a

merged cluster Ai U Abest(i) unless Ai U Abest(i) is below a certain threshold value

for QC.

This process continues until there are no more overlapping clusters or until merging

any of the remaining clusters witll make a cluster with Q value lower than QC.




Statistical significance of complexes and modules

Number of complete cliques (Q = 1) as

a function of clique size enumerated in

the network of protein interactions

(red) and in randomly rewired graphs

(blue, averaged >1,000 graphs where

number of interactions for each protein

is preserved).

Inset shows the same plot in log-

normal scale. Note the dramatic

enrichment in the number of cliques in

the protein-interaction graph

compared with the random graphs.

Most of these cliques are parts of

bigger complexes and modules.




Statistical significance of complexes and modules


Distribution of Q of clusters found by the MC search

method.

Red bars: original network of protein interactions.

Blue cuves: randomly rewired graphs.

Clusters in the protein network have many more

interactions than their counterparts in the random

graphs.



Architecture of protein network

Fragment of the protein network. Nodes

and interactions in discovered clusters

are shown in bold. Nodes are colored by

functional categories in MIPS:

red, transcription regulation;

blue, cell-cycle/cell-fate control;

green, RNA processing; and

yellow, protein transport.

Complexes shown are the SAGA/TFIID

complex (red), the anaphase-promoting

complex (blue), and the TRAPP complex

(yellow).




Discovered functional modules


Examples of discovered functional modules.

(A) A module involved in cell-cycle regulation. This module consists of cyclins (CLB1-4 and

CLN2) and cyclin-dependent kinases (CKS1 and CDC28) and a nuclear import protein (NIP29).

Although they have many interactions, these proteins are not present in the cell at the same

time.

(B) Pheromone signal transduction pathway in the network of protein–protein interactions. This

module includes several MAPK (mitogen-activated protein kinase) and MAPKK (mitogen-

activated protein kinase kinase) kinases, as well as other proteins involved in signal

transduction. These proteins do not form a single complex; rather, they interact in a specific

order.



Architecture of protein networkComparison of discovered complexes and

modules with complexes derived

experimentally (BIND and Cellzome) and

complexes catalogued in MIPS.

Discovered complexes are sorted by the

overlap with the best-matching experimental

complex. The overlap is defined as the

number of common proteins divided by the

number of proteins in the best-matching

experimental complex.

The first 31 complexes match exactly, and

another 11 have overlap above 65%.

Inset shows the overlap as a function of the

size of the discovered complex. Note that

discovered complexes of all sizes match very

well with known experimental complexes.

Discovered complexes that do not match with

experimental ones constitute our predictions.




Robustness of clusters found

Model effect of false positives in

experimental data: randomly reconnect,

remove or add 10-50% of interactions

in network.

Cluster recovery probability as a

function of the fraction of altered links.

Black curves correspond to the case

when a fraction of links are rewired.

Red, removed;

green, added.

Circles represent the probability to

recover 75% of the original cluster;

triangles represent the probability to

recover 50%.


Noise in the form of removal or addions lf

links has less deteriorating effect than

random rewiring. About 75% of clusters

can still be found when 10% of links are

rewired.



Summary

Here: analysis of meso-scale properties demonstrated the presence of highly

connected clusters of proteins in a network of protein interactions. Strong support

for suggested modular architecture of biological networks.

Distinguish 2 types of clusters: protein complexes and dynamic functional modules.

Both complexes and modules have more interactions among their members than

with the rest of the network.

Dynamic modules are elusive to experimental purification because they are not

assembled as a complex at any single point in time.

Computational analysis allows detection of such modules by integrating pairwise

molecular interactions that occur at different times and places.

However, computational analysis alone, does not allow to distinguish between

complexes and modules or between transient and simultaneous interactions.



Summary

Most of the discovered complexes and modules come from traditional studies,

rather than from large-scale experiments.

This suggests that although large-scale proteomic studies provide a wealth of

protein interaction data, the scarcity of the data (and its comtamination with false

positives) makes such studies less valuable for identification of functional modules.

9. lecture ws 2004/05bioinformatics iii1 graph layout in cellular networks

Documents

graph layout methods

graph layout algorithms

graph layout heuristic

layout process

horse of layout algorithms

nodes cities

graph drawing

graph layouthttp