Title: Graph Layout in Cellular Networks
1Graph Layout in Cellular Networks
www.cytoscape.org
2Task visualize cellular interaction data
e.g. protein interaction data (undirected)
nodes proteins edges interactions metabo
lic pathways (directed) nodes
substances edges reactions regulatory
networks (directed) nodes transcription
factors regulated proteins edges regulatory
interaction co-localization (undirected) nodes
proteins edges co-localization
information homology (undirected/directed) node
s proteins edges sequence similarity (BLAST
score)
3Visualisation intuitive approach to understand
graphs
Graph like structures are pervasive - route maps
of airline companies - infrastructure of computer
networks - the relationship between people who
work in a same company etc. - cellular
interactions ... One way to understand the
information coded in these graphs is to draw
graphical representations of them. Since drawing
by hand is tedious and error-prone, it is natural
to expect computers to draw graphs automatically,
assigning spatial coordinates to nodes and
connecting them with edges. Graphs, such as the
flight route maps, are not hard to draw since the
precise locations of the nodes (cities) are
already given. For other graphs, such
information is not available and computers need
to determine where to plot the nodes and how to
draw the edges that connect the nodes.
http//www.it.usyd.edu.au/aquigley/3dfade/
4Force-directed algorithm for graph layout
Various graph layout algorithms have been
developed to solve this visualisation task. 20
years ago, Peter Eades proposed a graph layout
heuristic A heuristic for graph drawing.
Congressus Numerantium, 42149-160, 1984 which
is called the Spring Embedder'' algorithm.
Edges are replaced by springs and vertexes are
replaced by rings that connect the springs. A
layout can be found by simulating the dynamics of
such a physical system. This method and other
methods, which involve similar simulations to
compute the layout, are called Force Directed''
algorithms.
http//www.hpc.unm.edu/sunls/research/treelayout/
node1.html
5Force-directed algorithm
The edges can be modeled as gravitational (or
electrostatic) attraction and all nodes have an
electrical repulsion between them. It is also
possible for the system to simulate unnatural
forces acting on the bodies, which have no direct
physical analogy, for example the use of a
logarithmic distance measure rather than
Euclidean.
http//www.it.usyd.edu.au/aquigley/3dfade/
6Force-directed algorithm
Because of the underlying analogy to a physical
system, the force directed graph layout methods
tend to meet various aesthetic standards, such as
- efficient space filling, - uniform edge
length (when equal weights and repulsions are
used) - symmetry and the - capability of
rendering the layout process with smooth
animation (visual continuity). Having these
nice features, the force directed graph layout
has become the work horse'' of layout
algorithms. It has been successfully adapted to
many domains with variations of implementation.
http//www.hpc.unm.edu/sunls/research/treelayout/
node1.html
7Scaling
Force directed layout methods commonly have
computational scaling problems. When there are
more than a few thousand vertexes in the graph,
the running time of the layout computation can
become unacceptable. This is caused by the fact
that in each step of the simulation, the
repulsive force between each pair of unconnected
vertexes needs to be computed, costing a running
time of O(0.5 ? V2 E). Here V is the number of
vertexes and E is the number of edges in the
graph. This complexity is hard to escape for
general graphs without hierarchical structure.
http//www.hpc.unm.edu/sunls/research/treelayout/
node1.html
8Protein interaction graphs
Most protein interaction data have the following
characteristics (1) When visualized as a graph,
the data yields a disconnected graph with many
connected components (2) The data yields a
nonplanar graph with a large number of edge
crossings that cannot be removed in a 2D
drawing (3) interactions varies widely within
the same set of data p(k) (4) data often
contains protein interactions corresponding to
self loops ? demands robust algorithm.
Ju et al. Bioinformatics 19, 317 (2003)
9InterViewer Example of force-directed layout
algorithm
InterViewer does not place initial nodes
randomly, but on the surface of a sphere. Fixed
of iterations. The original algorithm has
complexity O(N2) per timestep with N of
nodes. When using multipole-methods, this can be
reduced to O(N logN) Time may also be saved by
introducing a cut-off, e.g. only computing
interactions with the next neighbor cells. Update
neighbor list infrequently.
Ju et al. Bioinformatics 19, 317 (2003)
10Application for protein interaction graphs
Visualisation of the MIPS interaction data. In
3D, this graph contains no edge-crossings.
Ju et al. Bioinformatics 19, 317 (2003)
11Aim analyze and visualize homologies between the
protein universe -) 50 genomes ? 145579
proteins ? 21 ? 109 BLASTP pairwise sequence
comparisons. Expect that fusion proteins
(Rosetta Stone proteins) will link proteins of
related function. Need to visualize extremely
large network! Develop stepwise scheme.
12LGL
(1) separate original network into connected
sets (2) generate coordinates for each node in
each connected set (using force-directed layout
algorithm and a recipe for the sequential lay out
of nodes guided by a minimum spanning tree of the
network). (3) integrate connected sets into one
coordinate system via a funnel process the
connected sets are sorted in descending size by
the number of vertices. The first connected set
is placed at the bottom of a potential funnel and
other sets are placed one at a time on the rim of
the potential funnel and allowed to fall towards
the bottom where they are frozen in space upon
collision with the previous sets. We concentrate
on step (2) in the following
Adai et al. J. Mol. Biol. 340, 179 (2004)
13Minimum Spanning Tree
Given undirected graph G (V,E) where for each
edge (u,v) ? E exists a weight w(u,v) specifying
the cost to connect u and v. Find an acyclic
graph T ? E that connects all of the nodes and
whose total weight is minimized.
Popular algorithms by Kruskal and Prim. Both are
greedy algorithms making the best choice at the
moment. ? no guarantee to find the best global
solution
Cormen
14Kruskals algorithm
Consider edges in sorted order by weight. The
arrow points to the edge under consideration at
each step.
Cormen
15Kruskals algorithm (II)
Running time ? O(E log V)
Cormen
16Intuitive description of LGL
Successive iterations of the layout. The MST
determines the oder of placement of the nodes.
The root node could be chosen randomly or based
on its centrality in the network (e.g. minimizing
the sum of distances to all other nodes). All
other nodes are assigned a level according to
their edge-based distance in the MST from the
root node. Level one vertices (red circles) are
placed randomly on a sphere around the root node
(black circle). The system is allowed to iterate
through time satisfying attractive and repulsive
forces until at rest. Level two nodes (blue
circles) are placed randomly on spheres directed
away from the current layout. Again, the system
is allowed to evolve through time till at rest.
This process is iterated for the entire graph.
Adai et al. J. Mol. Biol. 340, 179 (2004)
17What is the role of fusion proteins?
A protein homology map summarizes the results of
billions of sequence comparisons by modeling the
proteins as vertices in a network, and the
statistically significant sequence similarities
as edges connecting the relevant proteins. In
this manner, proteins within a sequence family
(such as A, A', A?, and AB or B, B' and AB) are
all or mostly connected to each other, forming a
cluster in the map. Fusion proteins (such as AB)
serve to connect their component proteins'
families. The structure of the resulting map
reflects historic genetic events, such as gene
fusions, fissions, and duplications, which are
responsible for producing the modern-day genes.
The map simultaneously represents homology
relationships (edges), remote homologies
(proteins not directly connected but in the same
cluster), and non-homologous functional
relationships (adjacent clusters and clusters
linked by fusion proteins).
Adai et al. J. Mol. Biol. 340, 179 (2004)
18LGL Algorithm for very large biological networks
The complete protein homology map. A layout of
the entire protein homology map a total of
11,516 connected sets containing 111,604 proteins
(vertices) with 1,912,684 edges. The largest
connected set is shown more clearly in the inset
and is enlarged further on the right side.
Adai et al. J. Mol. Biol. 340, 179 (2004)
19Map of gene function
emerges from 21 billion gene sequence
comparisons. Proteins are drawn as points, with
lines connecting proteins with similar sequences,
and are arranged so that homologous proteins are
adjacent in the Figure. The size of each cluster
is proportional to the number of proteins in that
sequence family. Fusion proteins force their
component proteins' respective families to be
close together in the Figure, and thereby serve
to organize the proteins in the map according to
their functions. The resulting broad trends of
protein function are labeled, as are several of
the most extensive sequence families. AC
indicate specific regions that are magnified
later.
Only the greatest connected network component is
drawn, containing 30,727 proteins (vertices) and
1,206,654 significant sequence similarities
(edges), and representing 4 billion sequence
comparisons.
Adai et al. J. Mol. Biol. 340, 179 (2004)
20Functionally related gene families form adjacent
clusters
Three examples illustrate spatial localization of
protein function in the map, specifically A,
the linkage of the tryptophan synthase ? family
to the functionally coupled but non-homologous ?
family by the yeast tryptophan synthase ?? fusion
protein, B, protein subunits of the pyruvate
synthase and alpha-ketoglutarate ferredexin
oxidoreductase complexes C, metabolic enzymes,
particularly those of acetyl CoA and amino acid
metabolism.
Adai et al. J. Mol. Biol. 340, 179 (2004)
21Colocalization
Neighboring proteins tend to be in the same
cellular system. The tendency for proteins to
operate in the same cellular system, as defined
by the percentage of matching assignments into
the 18 COG database pathways, is plotted against
the spatial separation in multiples of a typical
cluster size. The functional similarity decays
exponentially with distance proportional to the
function e-0.26d where d is a typical cluster
diameter.
Adai et al. J. Mol. Biol. 340, 179 (2004)
22Comparison with other layout maps
A comparison of LGL with map layouts produced by
other algorithms. The layout of the protein
homology map by LGL (A) is contrasted with the
layout of the same network by the spring-force
algorithm only, lacking the minimal spanning tree
calculation and iterative layout procedure (B),
and with the layout by the approach of
InterViewer (C). Interviewer collapses equivalent
nodes into single nodes, thereby simplifying the
graph, and is one of the few available graph
layout programs that scales to such large
networks. The layout from LGL reveals more of the
internal graph structure than the other
approaches tested.
Adai et al. J. Mol. Biol. 340, 179 (2004)
23Modularity in molecular networks?
A functional module is, by definition, a discrete
entity whose function is separable from those of
other modules. This separation depends on
chemical isolation, which can originate from
spatial localization or from chemical
specificity. E.g. a ribosome concentrates the
reactions involved in making a polypeptide into a
single particle, thus spatially isolating its
function. A signal transduction system is an
extended module that achieves its isolation
through the specificity of the initial binding of
the chemical signal to receptor proteins, and of
the interactions between signalling proteins
within the cell.
Hartwell et al. Nature 402, C47 (1999)
24Modularity in molecular networks
Modules can be insulated from or connected to
each other. Insulation allows the cell to carry
out many diverse reactions without cross-talk
that would harm the cell. Connectivity allows
one function to influence another. The
higher-level properties of cells, such as their
ability to integrate information from multiple
sources, will be described by the pattern of
connections among their functional modules.
Hartwell et al. Nature 402, C47 (1999)
25Organization of large-scale molecular networks
- Organization of molecular networks revealed by
large-scale experiments - power-law distribution P(k) ? exp-?
- similar distribution of the node degree k (i.e.
the number of edges of a node) - small-world property (i.e. a high clustering
coefficient and a small shortest path between
every pair of nodes) - anticorrelation in the node degree of connected
nodes (i.e. highly interacting nodes tend to be
connected to low-interacting ones) - These properties become evident when hundreds or
thousands of molecules and their interactions are
studied together. - On the other end of the spectrum recently
discovered motifs that consist of 3-4 nodes.
26Mesoscale properties of networks
Most relevant processes in biological networks
correspond to the mesoscale (5-25 genes or
proteins) not to the entire network. However, it
is computationally enormously expensive to study
mesoscale properties of biological networks. e.g.
a network of 1000 nodes contains 1 ? 1023
possible 10-node sets. Spirin Mirny analyzed
combined network of protein interactions with
data from CELLZOME, MIPS, BIND 6500
interactions.
27Identify connected subgraphs
The network of protein interactions is typically
presented as an undirected graph with proteins
as nodes and protein interactions as undirected
edges. Aim identify highly connected subgraphs
(clusters) that have more interactions within
themselves and fewer with the rest of the
graph. A fully connected subgraph, or clique,
that is not a part of any other clique is an
example of such a cluster. In general, clusters
need not to be fully connected. Measure density
of connections by where n is the number of
proteins in the cluster and m is the number of
interactions between them.
Spirin, Mirny, PNAS 100, 12123 (2003)
28(method I) Identify all fully connected subgraphs
(cliques)
Generally, finding all cliques of a graph is an
NP-hard problem. Because the protein interaction
graph is sofar very sparse (the number of
interactions (edges) is similar to the number of
proteins (nodes), this can be done quickly. To
find cliques of size n one needs to enumerate
only the cliques of size n-1. The search for
cliques starts with n 4, pick all (known) pairs
of edges (6500 ? 6500 protein interactions)
successively. For every pair A-B and C-D check
whether there are edges between A and C, A and D,
B and C, and B and D. If these edges are present,
ABCD is a clique. For every clique identified,
ABCD, pick all known proteins successively. For
every picked protein E, if all of the
interactions E-A, E-B, E-C, and E-D are known,
then ABCDE is a clique with size 5. Continue
for n 6, 7, ... The largest clique found in
the protein-interaction network has size 14.
Spirin, Mirny, PNAS 100, 12123 (2003)
29(I) Identify all fully connected subgraphs
(cliques)
These results include, however, many redundant
cliques. For example, the clique with size 14
contains 14 cliques with size 13. To find all
nonredundant subgraphs, mark all proteins
comprising the clique of size 14, and out of all
subgraphs of size 13 pick those that have at
least one protein other than marked. After all
redundant cliques of size 13 are removed, proceed
to remove redundant twelves etc. In total, only
41 nonredundant cliques with sizes 4 - 14 were
found.
Spirin, Mirny, PNAS 100, 12123 (2003)
30(method II) Superparamagnetic Clustering (SPC)
SPC uses an analogy to the physical properties of
an inhomogenous ferromagnetic model to find
tightly connected clusters on a large
graph. Every node on the graph is assigned a
Potts spin variable Si 1, 2, ..., q. The value
of this spin variable Si performs thermal
fluctuations, which are determined by the
temperature T and the spin values on the
neighboring nodes. Energetically, 2 nodes
connected by an edge are favored to have the same
spin value. Therefore, the spin at each node
tends to align itself with the majority of its
neighbors. When such a Potts spin system reaches
equilibrium for a given temperature T, high
correlation between fluctuating Si and Sj at
nodes i and j would indicate that nodes i and j
belong to the same cluster.
Spirin, Mirny, PNAS 100, 12123 (2003)
31(II) Superparamagnetic Clustering (SPC)
The protein-interaction network is represented by
a graph where every pair of interacting proteins
is an edge of length 1. The simulations are run
for temperatures ranging from 0 to 1 in units of
the coupling strength. The network splits two
monomers at temperatures between 0.7 and 0.8,
whereas larger clusters only exist for
temperatures between 0.1 and 0.7. Clusters are
recorded at all values temperature. The
overlapping clusters are then merged and
redundant ones are removed.
Spirin, Mirny, PNAS 100, 12123 (2003)
32(method III) Monte Carlo Simulation
Use MC to find a tight subgraph of a
predetermined number of nodes M. At time t 0,
a random set of M nodes is selected. For each
pair of nodes i,j from this set, the shortest
path Lij between i and j on the graph is
calculated. Denote the sum of all shortest paths
Lij from this set as L0. At every time step one
of M nodes is picked at random, and one node is
picked at random out of all its neighbors. The
new sum of all shortest paths, L1, is calculated
if the original node were to be replaced by this
neighbor. If L1 lt L0, accept replacement with
probability 1. If L1 gt L0, accept replacement
with probability where T is the effective
temperature.
Spirin, Mirny, PNAS 100, 12123 (2003)
33(III) Monte Carlo Simulation
Every tenth time step an attempt is made to
replace one of the nodes from the current set
with a node that has no edges to the current set
to avoid getting caught in an isolated
disconnected subgraph. This process is repeated
(i) until the original set converges to a
complete subgraph, or (ii) for a predetermined
number of steps, after which the tightest
subgraph (the subgraph corresponding to the
smallest L0) is recorded. The recorded clusters
are merged and redundant clusters are removed.
Spirin, Mirny, PNAS 100, 12123 (2003)
34Optimal temperature in MC simulation
For every cluster size there is an optimal
temperature that gives the fastest convergence to
the tightest subgraph.
Time to find a clique with size 7 in MC steps per
site as a function of temperature T. The region
with optimal temperature is shown in Inset. The
required time increases sharply as the
temperature goes to 0, but has a relatively wide
plateau in the region 3 lt T lt 7. Simulations
suggest that the choice of temperature T ? M
would be safe for any cluster size M.
Spirin, Mirny, PNAS 100, 12123 (2003)
35Comparison of SPC and Monte Carlo methods
Comparison of clusters found with SPC (blue) and
MC simulation (red). Reasonable overlap (ca. one
third of all clusters are found by both methods)
but both methods seem complementary.
Spirin, Mirny, PNAS 100, 12123 (2003)
36Comparison of SPC and Monte Carlo methods
The SPC method is best at detecting high-Q value
clusters with relatively few links with the
outside world. An example is the TRAPP complex, a
fully connected clique of size 10 with just 7
links with outside proteins. This cluster was
perfectly detected by SPC, whereas the MC
simulation was able to find smaller pieces of
this cluster separately rather than the whole
cluster. By contrast, MC simulations are better
suited for finding very outgoing cliques. The
Lsm complex, a clique of size 11, includes 3
proteins with more interactions outside the
complex than inside. This complex was easily
found by MC, but was not detected as a
stand-alone cluster by SPC.
Spirin, Mirny, PNAS 100, 12123 (2003)
37Merging Overlapping Clusters
A simple statistical test shows that nodes which
have only one link to a cluster are statistically
insignificant. Clean such statistically
insignificant members first. Then merge
overlapping clusters For every cluster Ai find
all clusters Ak that overlap with this cluster by
at least one protein. For every such found
cluster calculate Q value of a possible merged
cluster Ai U Ak . Record cluster Abest(i)
which gives the highest Q value if merged with
Ai. After the best match is found for every
cluster, every cluster Ai is replaced by a merged
cluster Ai U Abest(i) unless Ai U Abest(i) is
below a certain threshold value for QC. This
process continues until there are no more
overlapping clusters or until merging any of the
remaining clusters witll make a cluster with Q
value lower than QC.
Spirin, Mirny, PNAS 100, 12123 (2003)
38Statistical significance of complexes and modules
Number of complete cliques (Q 1) as a function
of clique size enumerated in the network of
protein interactions (red) and in randomly
rewired graphs (blue, averaged gt1,000 graphs
where number of interactions for each protein is
preserved). Inset shows the same plot in
log-normal scale. Note the dramatic enrichment in
the number of cliques in the protein-interaction
graph compared with the random graphs. Most of
these cliques are parts of bigger complexes and
modules.
Spirin, Mirny, PNAS 100, 12123 (2003)
39Statistical significance of complexes and modules
Distribution of Q of clusters found by the MC
search method. Red bars original network of
protein interactions. Blue cuves randomly
rewired graphs. Clusters in the protein network
have many more interactions than their
counterparts in the random graphs.
Spirin, Mirny, PNAS 100, 12123 (2003)
40Architecture of protein network
Fragment of the protein network. Nodes and
interactions in discovered clusters are shown in
bold. Nodes are colored by functional categories
in MIPS red, transcription regulation blue,
cell-cycle/cell-fate control green, RNA
processing and yellow, protein transport.
Complexes shown are the SAGA/TFIID complex
(red), the anaphase-promoting complex (blue), and
the TRAPP complex (yellow).
Spirin, Mirny, PNAS 100, 12123 (2003)
41Discovered functional modules
Examples of discovered functional modules. (A) A
module involved in cell-cycle regulation. This
module consists of cyclins (CLB1-4 and CLN2) and
cyclin-dependent kinases (CKS1 and CDC28) and a
nuclear import protein (NIP29). Although they
have many interactions, these proteins are not
present in the cell at the same time. (B)
Pheromone signal transduction pathway in the
network of proteinprotein interactions. This
module includes several MAPK (mitogen-activated
protein kinase) and MAPKK (mitogen-activated
protein kinase kinase) kinases, as well as other
proteins involved in signal transduction. These
proteins do not form a single complex rather,
they interact in a specific order.
Spirin, Mirny, PNAS 100, 12123 (2003)
42Architecture of protein network
Comparison of discovered complexes and modules
with complexes derived experimentally (BIND and
Cellzome) and complexes catalogued in MIPS.
Discovered complexes are sorted by the overlap
with the best-matching experimental complex. The
overlap is defined as the number of common
proteins divided by the number of proteins in the
best-matching experimental complex. The first 31
complexes match exactly, and another 11 have
overlap above 65. Inset shows the overlap as a
function of the size of the discovered complex.
Note that discovered complexes of all sizes match
very well with known experimental complexes.
Discovered complexes that do not match with
experimental ones constitute our predictions.
Spirin, Mirny, PNAS 100, 12123 (2003)
43Robustness of clusters found
Model effect of false positives in experimental
data randomly reconnect, remove or add 10-50 of
interactions in network. Cluster recovery
probability as a function of the fraction of
altered links. Black curves correspond to the
case when a fraction of links are rewired. Red,
removed green, added. Circles represent the
probability to recover 75 of the original
cluster triangles represent the probability to
recover 50.
Noise in the form of removal or addions lf links
has less deteriorating effect than random
rewiring. About 75 of clusters can still be
found when 10 of links are rewired.
Spirin, Mirny, PNAS 100, 12123 (2003)
44Summary
Here analysis of meso-scale properties
demonstrated the presence of highly connected
clusters of proteins in a network of protein
interactions. Strong support for suggested
modular architecture of biological
networks. Distinguish 2 types of clusters
protein complexes and dynamic functional
modules. Both complexes and modules have more
interactions among their members than with the
rest of the network. Dynamic modules are elusive
to experimental purification because they are not
assembled as a complex at any single point in
time. Computational analysis allows detection of
such modules by integrating pairwise molecular
interactions that occur at different times and
places. However, computational analysis alone,
does not allow to distinguish between complexes
and modules or between transient and simultaneous
interactions.
45Summary
Most of the discovered complexes and modules come
from traditional studies, rather than from
large-scale experiments. This suggests that
although large-scale proteomic studies provide a
wealth of protein interaction data, the scarcity
of the data (and its comtamination with false
positives) makes such studies less valuable for
identification of functional modules.