9. lecture ws 2004/05bioinformatics iii1 graph layout in cellular networks
TRANSCRIPT
9. Lecture WS 2004/05
Bioinformatics III 1
Graph Layout in Cellular Networks
www.cytoscape.org
9. Lecture WS 2004/05
Bioinformatics III 2
Task: visualize cellular interaction data
e.g. protein interaction data (undirected): nodes – proteinsedges – interactions
metabolic pathways (directed)nodes – substancesedges – reactions
regulatory networks (directed): nodes – transcription factors + regulated proteinsedges – regulatory interaction
co-localization (undirected): nodes – proteins
edges – co-localization information
homology (undirected/directed)nodes – proteinsedges – sequence similarity (BLAST score)
9. Lecture WS 2004/05
Bioinformatics III 3
Visualisation: intuitive approach to understand graphs
http://www.it.usyd.edu.au/~aquigley/3dfade/
Graph like structures are pervasive:
- route maps of airline companies
- infrastructure of computer networks
- the relationship between people who work in a same company etc.
- cellular interactions ...
One way to understand the information coded in these graphs is to draw
graphical representations of them. Since drawing by hand is tedious and
error-prone, it is natural to expect computers to draw graphs automatically,
assigning spatial coordinates to nodes and connecting them with edges.
Graphs, such as the flight route maps, are not hard to draw since the
precise locations of the nodes (cities) are already given.
For other graphs, such information is not available and computers need to
determine where to plot the nodes and how to draw the edges that connect
the nodes.
9. Lecture WS 2004/05
Bioinformatics III 4
Force-directed algorithm for graph layout
http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html
Various graph layout algorithms have been
developed to solve this visualisation task.
20 years ago, Peter Eades proposed a graph
layout heuristic [A heuristic for graph
drawing. Congressus Numerantium, 42:149-
160, 1984] which is called the ``Spring
Embedder'' algorithm.
Edges are replaced by springs and vertexes
are replaced by rings that connect the
springs. A layout can be found by simulating
the dynamics of such a physical system.
This method and other methods, which
involve similar simulations to compute the
layout, are called ``Force Directed''
algorithms.
9. Lecture WS 2004/05
Bioinformatics III 5
Force-directed algorithm
http://www.it.usyd.edu.au/~aquigley/3dfade/
The edges can be modeled as gravitational (or electrostatic) attraction and
all nodes have an electrical repulsion between them.
It is also possible for the system to simulate unnatural forces acting on the
bodies, which have no direct physical analogy, for example the use of a
logarithmic distance measure rather than Euclidean.
9. Lecture WS 2004/05
Bioinformatics III 6
Force-directed algorithm
http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html
Because of the underlying analogy to a physical system, the force directed graph
layout methods tend to meet various aesthetic standards, such as
- efficient space filling,
- uniform edge length (when equal weights and repulsions are used)
- symmetry and the
- capability of rendering the layout process with smooth animation (visual
continuity).
Having these nice features, the force directed graph layout has become
the ``work horse'' of layout algorithms.
It has been successfully adapted to many domains with variations of
implementation.
9. Lecture WS 2004/05
Bioinformatics III 7
Scaling
http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html
Force directed layout methods commonly have computational scaling problems.
When there are more than a few thousand vertexes in the graph, the running time
of the layout computation can become unacceptable.
This is caused by the fact that in each step of the simulation, the repulsive force
between each pair of unconnected vertexes needs to be computed, costing a
running time of O(0.5 V2 – E).
Here V is the number of vertexes and E is the number of edges in the graph.
This complexity is hard to escape for general graphs without hierarchical structure.
9. Lecture WS 2004/05
Bioinformatics III 8
Protein interaction graphs
Ju et al. Bioinformatics 19, 317 (2003)
Most protein interaction data have the following characteristics:
(1) When visualized as a graph, the data yields a disconnected graph with many
connected components
(2) The data yields a nonplanar graph with a large number of edge crossings that
cannot be removed in a 2D drawing
(3) #interactions varies widely within the same set of data – p(k)
(4) data often contains protein interactions corresponding to self loops
demands robust algorithm.
9. Lecture WS 2004/05
Bioinformatics III 9
InterViewer: Example of force-directed layout algorithm
Ju et al. Bioinformatics 19, 317 (2003)
InterViewer does not place initial nodes
randomly, but on the surface of a
sphere. Fixed # of iterations.
The original algorithm has complexity
O(N2) per timestep with N # of nodes.
When using multipole-methods, this
can be reduced to O(N logN)
Time may also be saved by introducing
a cut-off, e.g. only computing
interactions with the next neighbor
cells. Update neighbor list infrequently.
9. Lecture WS 2004/05
Bioinformatics III 10
Application for protein interaction graphs
Ju et al. Bioinformatics 19, 317 (2003)
Visualisation of the
MIPS interaction data.
In 3D, this graph
contains no edge-
crossings.
9. Lecture WS 2004/05
Bioinformatics III 11
Aim: analyze and visualize homologies between the protein universe :-)
50 genomes 145579 proteins 21 109 BLASTP pairwise sequence
comparisons.
Expect that fusion proteins („Rosetta Stone proteins“) will link proteins of
related function.
Need to visualize extremely large network! Develop stepwise scheme.
9. Lecture WS 2004/05
Bioinformatics III 12
LGL
Adai et al. J. Mol. Biol. 340, 179 (2004)
(1) separate original network into connected sets
(2) generate coordinates for each node in each connected set
(using force-directed layout algorithm and a recipe for the sequential lay out of
nodes guided by a minimum spanning tree of the network).
(3) integrate connected sets into one coordinate system via a funnel process:
the connected sets are sorted in descending size by the number of vertices.
The first connected set is placed at the bottom of a potential funnel and other
sets are placed one at a time on the rim of the potential funnel and allowed to
fall towards the bottom where they are frozen in space upon collision with the
previous sets.
We concentrate on step (2) in the following
9. Lecture WS 2004/05
Bioinformatics III 13
Minimum Spanning Tree
Given: undirected graph G = (V,E)
where for each edge (u,v) E
exists a weight w(u,v) specifying
the cost to connect u and v.
Find an acyclic graph T E that
connects all of the nodes and
whose total weight
is minimized.
Tvu
vuwTw,
,
Popular algorithms by Kruskal and Prim.
Both are greedy algorithms making the
best choice at the moment.
no guarantee to find the best global
solution
[Cormen]
9. Lecture WS 2004/05
Bioinformatics III 14
Kruskal’s algorithm
Consider edges in sorted order by weight.
The arrow points to the edge under consideration at each step.
[Cormen]
9. Lecture WS 2004/05
Bioinformatics III 15
Kruskal’s algorithm (II)
Running time O(E log V)
[Cormen]
9. Lecture WS 2004/05
Bioinformatics III 16
Intuitive description of LGL
Adai et al. J. Mol. Biol. 340, 179 (2004)
Successive iterations of the layout. The MST determines the oder of placement of
the nodes. The root node could be chosen randomly or based on its centrality in the
network (e.g. minimizing the sum of distances to all other nodes). All other nodes
are assigned a level according to their edge-based distance in the MST from the
root node.
Level one vertices (red circles) are placed randomly on a sphere around the root
node (black circle). The system is allowed to iterate through time satisfying attractive
and repulsive forces until at rest.
Level two nodes (blue circles) are placed randomly on spheres directed away from
the current layout. Again, the system is allowed to evolve through time till at rest.
This process is iterated for the entire graph.
9. Lecture WS 2004/05
Bioinformatics III 17
What is the role of fusion proteins?
Adai et al. J. Mol. Biol. 340, 179 (2004)
A protein homology map summarizes the results of billions of sequence comparisons by modeling
the proteins as vertices in a network, and the statistically significant sequence similarities as edges
connecting the relevant proteins. In this manner, proteins within a sequence family (such as A, A′, A
″, and AB; or B, B′ and AB) are all or mostly connected to each other, forming a cluster in the map.
Fusion proteins (such as AB) serve to connect their component proteins' families. The structure of
the resulting map reflects historic genetic events, such as gene fusions, fissions, and duplications,
which are responsible for producing the modern-day genes. The map simultaneously represents
homology relationships (edges), remote homologies (proteins not directly connected but in the same
cluster), and non-homologous functional relationships (adjacent clusters and clusters linked by
fusion proteins).
9. Lecture WS 2004/05
Bioinformatics III 18
LGL Algorithm for very large biological networks
Adai et al. J. Mol. Biol. 340, 179 (2004)
The complete protein homology map. A layout of the entire protein homology
map; a total of 11,516 connected sets containing 111,604 proteins (vertices)
with 1,912,684 edges. The largest connected set is shown more clearly in the
inset and is enlarged further on the right side.
9. Lecture WS 2004/05
Bioinformatics III 19
Map of gene function
Adai et al. J. Mol. Biol. 340, 179 (2004)
emerges from ~21 billion gene sequence
comparisons. Proteins are drawn as points, with
lines connecting proteins with similar sequences,
and are arranged so that homologous proteins
are adjacent in the Figure.
The size of each cluster is proportional to the
number of proteins in that sequence family.
Fusion proteins force their component proteins'
respective families to be close together in the
Figure, and thereby serve to organize the
proteins in the map according to their functions.
The resulting broad trends of protein function are
labeled, as are several of the most extensive
sequence families. A–C indicate specific regions
that are magnified later.
Only the greatest connected network
component is drawn, containing 30,727
proteins (vertices) and 1,206,654
significant sequence similarities (edges),
and representing ~4 billion sequence
comparisons.
9. Lecture WS 2004/05
Bioinformatics III 20
Functionally related gene families form adjacent clusters
Adai et al. J. Mol. Biol. 340, 179 (2004)
Three examples illustrate spatial
localization of protein function in the map,
specifically
A, the linkage of the tryptophan synthase
family to the functionally coupled but non-
homologous family by the yeast
tryptophan synthase fusion protein,
B, protein subunits of the pyruvate
synthase and alpha-ketoglutarate
ferredexin oxidoreductase complexes
C, metabolic enzymes, particularly those of
acetyl CoA and amino acid metabolism.
9. Lecture WS 2004/05
Bioinformatics III 21
Colocalization
Adai et al. J. Mol. Biol. 340, 179 (2004)
Neighboring proteins tend to be in the
same cellular system. The tendency
for proteins to operate in the same
cellular system, as defined by the
percentage of matching assignments
into the 18 COG database pathways,
is plotted against the spatial
separation in multiples of a typical
cluster size.
The functional similarity decays
exponentially with distance
proportional to the function e−0.26d
where d is a typical cluster diameter.
9. Lecture WS 2004/05
Bioinformatics III 22
Comparison with other layout maps
Adai et al. J. Mol. Biol. 340, 179 (2004)
A comparison of LGL with map layouts
produced by other algorithms. The layout of
the protein homology map by LGL (A) is
contrasted with the layout of the same
network by the spring-force algorithm only,
lacking the minimal spanning tree
calculation and iterative layout procedure
(B), and with the layout by the approach of
InterViewer (C). Interviewer
collapses equivalent nodes into single
nodes, thereby simplifying the graph, and is
one of the few available graph layout
programs that scales to such large
networks. The layout from LGL reveals
more of the internal graph structure than
the other approaches tested.
9. Lecture WS 2004/05
Bioinformatics III 23
Modularity in molecular networks?
A functional module is, by definition, a discrete entity whose function is
separable from those of other modules.
This separation depends on chemical isolation, which can originate from
spatial localization or from chemical specificity.
E.g. a ribosome concentrates the reactions involved in making a polypeptide
into a single particle, thus spatially isolating its function.
A signal transduction system is an extended module that achieves its isolation
through the specificity of the initial binding of the chemical signal to receptor
proteins, and of the interactions between signalling proteins within the cell.
Hartwell et al. Nature 402, C47 (1999)
9. Lecture WS 2004/05
Bioinformatics III 24
Modularity in molecular networks
Modules can be insulated from or connected to each other.
Insulation allows the cell to carry out many diverse reactions without cross-talk
that would harm the cell.
Connectivity allows one function to influence another.
The higher-level properties of cells, such as their ability to integrate information
from multiple sources, will be described by the pattern of connections among their
functional modules.
Hartwell et al. Nature 402, C47 (1999)
9. Lecture WS 2004/05
Bioinformatics III 25
Organization of large-scale molecular networks
Organization of molecular networks revealed by large-scale experiments:
- power-law distribution ; P(k) exp-
- similar distribution of the node degree k (i.e. the number of edges of a node)
- small-world property (i.e. a high clustering coefficient and a small shortest path
between every pair of nodes)
- anticorrelation in the node degree of connected nodes (i.e. highly interacting
nodes tend to be connected to low-interacting ones)
These properties become evident when hundreds or thousands of molecules and
their interactions are studied together.
On the other end of the spectrum: recently discovered motifs that consist of 3-4
nodes.
9. Lecture WS 2004/05
Bioinformatics III 26
Mesoscale properties of networks
Most relevant processes in biological networks correspond to the mesoscale
(5-25 genes or proteins) not to the entire network.
However, it is computationally enormously expensive to study mesoscale
properties of biological networks.
e.g. a network of 1000 nodes contains 1 1023 possible 10-node sets.
Spirin & Mirny analyzed combined network of protein interactions with data from
CELLZOME, MIPS, BIND: 6500 interactions.
9. Lecture WS 2004/05
Bioinformatics III 27
Identify connected subgraphsThe network of protein interactions is typically presented as an undirected graph
with proteins as nodes and protein interactions as undirected edges.
Aim: identify highly connected subgraphs (clusters) that have more interactions
within themselves and fewer with the rest of the graph.
A fully connected subgraph, or clique, that is not a part of any other clique is an
example of such a cluster.
In general, clusters need not to be fully connected.
Measure density of connections by
where n is the number of proteins in the cluster
and m is the number of interactions between them.
Spirin, Mirny, PNAS 100, 12123 (2003)
12
nn
mQ
9. Lecture WS 2004/05
Bioinformatics III 28
(method I) Identify all fully connected subgraphs (cliques)Generally, finding all cliques of a graph is an NP-hard problem.
Because the protein interaction graph is sofar very sparse (the number of interactions
(edges) is similar to the number of proteins (nodes), this can be done quickly.
To find cliques of size n one needs to enumerate only the cliques of size n-1.
The search for cliques starts with n = 4, pick all (known) pairs of edges (6500 6500
protein interactions) successively.
For every pair A-B and C-D check whether there are edges between A and C, A and
D, B and C, and B and D. If these edges are present, ABCD is a clique.
For every clique identified, ABCD, pick all known proteins successively.
For every picked protein E, if all of the interactions E-A, E-B, E-C, and E-D are known,
then ABCDE is a clique with size 5.
Continue for n = 6, 7, ... The largest clique found in the protein-interaction network
has size 14. Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 29
(I) Identify all fully connected subgraphs (cliques)These results include, however, many redundant cliques.
For example, the clique with size 14 contains 14 cliques with size 13.
To find all nonredundant subgraphs, mark all proteins comprising the clique of size
14, and out of all subgraphs of size 13 pick those that have at least one protein
other than marked.
After all redundant cliques of size 13 are removed, proceed to remove redundant
twelves etc.
In total, only 41 nonredundant cliques with sizes 4 - 14 were found.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 30
(method II) Superparamagnetic Clustering (SPC)
SPC uses an analogy to the physical properties of an inhomogenous ferromagnetic
model to find tightly connected clusters on a large graph.
Every node on the graph is assigned a Potts spin variable Si = 1, 2, ..., q.
The value of this spin variable Si performs thermal fluctuations, which are
determined by the temperature T and the spin values on the neighboring nodes.
Energetically, 2 nodes connected by an edge are favored to have the same spin
value. Therefore, the spin at each node tends to align itself with the majority of its
neighbors.
When such a Potts spin system reaches equilibrium for a given temperature T,
high correlation between fluctuating Si and Sj at nodes i and j would indicate that
nodes i and j belong to the same cluster.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 31
(II) Superparamagnetic Clustering (SPC)The protein-interaction network is represented by a graph where every pair of
interacting proteins is an edge of length 1.
The simulations are run for temperatures ranging from 0 to 1 in units of the
coupling strength.
The network splits two monomers at temperatures between 0.7 and 0.8,
whereas larger clusters only exist for temperatures between 0.1 and 0.7.
Clusters are recorded at all values temperature.
The overlapping clusters are then merged and redundant ones are removed.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 32
(method III) Monte Carlo SimulationUse MC to find a tight subgraph of a predetermined number of nodes M.
At time t = 0, a random set of M nodes is selected.
For each pair of nodes i,j from this set, the shortest path Lij between i and j on the
graph is calculated.
Denote the sum of all shortest paths Lij from this set as L0.
At every time step one of M nodes is picked at random, and one node is picked at
random out of all its neighbors.
The new sum of all shortest paths, L1, is calculated if the original node were to be
replaced by this neighbor.
If L1 < L0, accept replacement with probability 1.
If L1 > L0, accept replacement with probability
where T is the effective temperature.
Spirin, Mirny, PNAS 100, 12123 (2003)
T
LL 01
exp
9. Lecture WS 2004/05
Bioinformatics III 33
(III) Monte Carlo Simulation
Every tenth time step an attempt is made to replace one of the nodes from
the current set with a node that has no edges to the current set to avoid
getting caught in an isolated disconnected subgraph.
This process is repeated
(i) until the original set converges to a complete subgraph, or
(ii) for a predetermined number of steps,
after which the tightest subgraph (the subgraph corresponding to the smallest
L0) is recorded.
The recorded clusters are merged and redundant clusters are removed.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 34
Optimal temperature in MC simulationFor every cluster size there is an
optimal temperature that gives the
fastest convergence to the tightest
subgraph.
Spirin, Mirny, PNAS 100, 12123 (2003)
Time to find a clique with size 7 in MC steps
per site as a function of temperature T.
The region with optimal temperature is
shown in Inset.
The required time increases sharply as the
temperature goes to 0, but has a relatively
wide plateau in the region 3 < T < 7.
Simulations suggest that the choice of
temperature T M would be safe for any
cluster size M.
9. Lecture WS 2004/05
Bioinformatics III 35
Comparison of clusters found with
SPC (blue) and MC simulation
(red).
Reasonable overlap (ca. one third
of all clusters are found by both
methods) – but both methods
seem complementary.
Spirin, Mirny, PNAS 100, 12123 (2003)
Comparison of SPC and Monte Carlo methods
9. Lecture WS 2004/05
Bioinformatics III 36
The SPC method is best at detecting high-Q value clusters with relatively few links
with the outside world. An example is the TRAPP complex, a fully connected clique
of size 10 with just 7 links with outside proteins.
This cluster was perfectly detected by SPC, whereas the MC simulation was able to
find smaller pieces of this cluster separately rather than the whole cluster.
By contrast, MC simulations are better suited for finding very „outgoing“ cliques.
The Lsm complex, a clique of size 11, includes 3 proteins with more interactions
outside the complex than inside. This complex was easily found by MC, but was not
detected as a stand-alone cluster by SPC.
Spirin, Mirny, PNAS 100, 12123 (2003)
Comparison of SPC and Monte Carlo methods
9. Lecture WS 2004/05
Bioinformatics III 37
Merging Overlapping ClustersA simple statistical test shows that nodes which have only one link to a cluster are
statistically insignificant. Clean such statistically insignificant members first.
Then merge overlapping clusters:
For every cluster Ai find all clusters Ak that overlap with this cluster by at least one
protein.
For every such found cluster calculate Q value of a possible merged cluster
Ai U Ak . Record cluster Abest(i) which gives the highest Q value if merged with Ai.
After the best match is found for every cluster, every cluster Ai is replaced by a
merged cluster Ai U Abest(i) unless Ai U Abest(i) is below a certain threshold value
for QC.
This process continues until there are no more overlapping clusters or until merging
any of the remaining clusters witll make a cluster with Q value lower than QC.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 38
Statistical significance of complexes and modules
Number of complete cliques (Q = 1) as
a function of clique size enumerated in
the network of protein interactions
(red) and in randomly rewired graphs
(blue, averaged >1,000 graphs where
number of interactions for each protein
is preserved).
Inset shows the same plot in log-
normal scale. Note the dramatic
enrichment in the number of cliques in
the protein-interaction graph
compared with the random graphs.
Most of these cliques are parts of
bigger complexes and modules.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 39
Statistical significance of complexes and modules
Spirin, Mirny, PNAS 100, 12123 (2003)
Distribution of Q of clusters found by the MC search
method.
Red bars: original network of protein interactions.
Blue cuves: randomly rewired graphs.
Clusters in the protein network have many more
interactions than their counterparts in the random
graphs.
9. Lecture WS 2004/05
Bioinformatics III 40
Architecture of protein network
Fragment of the protein network. Nodes
and interactions in discovered clusters
are shown in bold. Nodes are colored by
functional categories in MIPS:
red, transcription regulation;
blue, cell-cycle/cell-fate control;
green, RNA processing; and
yellow, protein transport.
Complexes shown are the SAGA/TFIID
complex (red), the anaphase-promoting
complex (blue), and the TRAPP complex
(yellow).
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 41
Discovered functional modules
Spirin, Mirny, PNAS 100, 12123 (2003)
Examples of discovered functional modules.
(A) A module involved in cell-cycle regulation. This module consists of cyclins (CLB1-4 and
CLN2) and cyclin-dependent kinases (CKS1 and CDC28) and a nuclear import protein (NIP29).
Although they have many interactions, these proteins are not present in the cell at the same
time.
(B) Pheromone signal transduction pathway in the network of protein–protein interactions. This
module includes several MAPK (mitogen-activated protein kinase) and MAPKK (mitogen-
activated protein kinase kinase) kinases, as well as other proteins involved in signal
transduction. These proteins do not form a single complex; rather, they interact in a specific
order.
9. Lecture WS 2004/05
Bioinformatics III 42
Architecture of protein networkComparison of discovered complexes and
modules with complexes derived
experimentally (BIND and Cellzome) and
complexes catalogued in MIPS.
Discovered complexes are sorted by the
overlap with the best-matching experimental
complex. The overlap is defined as the
number of common proteins divided by the
number of proteins in the best-matching
experimental complex.
The first 31 complexes match exactly, and
another 11 have overlap above 65%.
Inset shows the overlap as a function of the
size of the discovered complex. Note that
discovered complexes of all sizes match very
well with known experimental complexes.
Discovered complexes that do not match with
experimental ones constitute our predictions.
Spirin, Mirny, PNAS 100, 12123 (2003)
9. Lecture WS 2004/05
Bioinformatics III 43
Robustness of clusters found
Model effect of false positives in
experimental data: randomly reconnect,
remove or add 10-50% of interactions
in network.
Cluster recovery probability as a
function of the fraction of altered links.
Black curves correspond to the case
when a fraction of links are rewired.
Red, removed;
green, added.
Circles represent the probability to
recover 75% of the original cluster;
triangles represent the probability to
recover 50%.
Spirin, Mirny, PNAS 100, 12123 (2003)
Noise in the form of removal or addions lf
links has less deteriorating effect than
random rewiring. About 75% of clusters
can still be found when 10% of links are
rewired.
9. Lecture WS 2004/05
Bioinformatics III 44
Summary
Here: analysis of meso-scale properties demonstrated the presence of highly
connected clusters of proteins in a network of protein interactions. Strong support
for suggested modular architecture of biological networks.
Distinguish 2 types of clusters: protein complexes and dynamic functional modules.
Both complexes and modules have more interactions among their members than
with the rest of the network.
Dynamic modules are elusive to experimental purification because they are not
assembled as a complex at any single point in time.
Computational analysis allows detection of such modules by integrating pairwise
molecular interactions that occur at different times and places.
However, computational analysis alone, does not allow to distinguish between
complexes and modules or between transient and simultaneous interactions.
9. Lecture WS 2004/05
Bioinformatics III 45
Summary
Most of the discovered complexes and modules come from traditional studies,
rather than from large-scale experiments.
This suggests that although large-scale proteomic studies provide a wealth of
protein interaction data, the scarcity of the data (and its comtamination with false
positives) makes such studies less valuable for identification of functional modules.