Title: 1 Scalefree networks: mathematical properties
11 Scale-free networks mathematical properties
Random graphs classical field in graph theory.
Well studied analytically and numerically. Scale-
free networks quite new. Properties were mostly
studied numerically and heuristically
(sofar). Nice review (suggested reading of
today) Graph Theory Approaches to Protein
Interaction Data Analysis, Nataa Prulj.
2Erdös-Renyi model
n nodes (vertices) joined by edges that have been
chosen and placed between pairs of nodes
uniformly at random. Gn,p each possible edge
in the graph on n nodes is present with
probability p and absent with probability 1
p. Average number of edges in Gn,p Each edge
connects two vertices ? average degree of a
vertex
3Erdös-Renyi model components
Erdös and Renyi studied how the expected topology
of a random graph with n nodes changes as a
function of the number of edges m. When m is
small, the graph is likely fragmented into many
small connected components having vertex sets of
size at most O(log n). As m increases the
components grow at first by linking to isolated
nodes, and later by fusing with other
components. A transition happens at m n/2,
when many clusters cross-link spontaneously to
form a unique largest component called the giant
component. Its vertex set size is much larger
than the vertex set sizes of any other
components. It contains O(n) nodes, while the
second largest component contains O(log n)
nodes. In statistical physics, this phenomenon
is called percolation.
4Erdös-Renyi model shortest path length
- The shortest path length between any pairs of
nodes in the giant component grows like log n. - Therefore, these graphs are called small
worlds. - The properties of random graphs have been studied
very extensively. - Literature B. Bollobas. Random Graphs. Academic,
London, 1985. - However, random graphs are no adequate models for
real-world networks because - real networks appear to have a power-law degree
distribution, - (while random graphs have Poisson distribution)
and - (2) real networks have strong clustering while
the clustering coefficient of a random graph is C
p, independent of whether two vertices have a
common neighbor.
5Generalized Random Graphs
- Aim allow a power-law degree distribution in a
graph while leaving all other aspects as in the
random graph model. - Given a degree sequence (e.g. power-law
distribution) one can generate a random graph by
assigning to a vertex i a degree ki from the
given degree sequence. Then choose pairs of
vertices uniformly at random to make edges so
that the assigned degrees remain preserved. - When all degrees have been used up to make edges,
the resulting graph is a random member of the set
of graphs with the desired degree distribution. - Problem method does not allow to specify
clustering coefficient. - On the other hand, this property makes it
possible to exactly determine many properties of
these graphs in the limit of large n. - E.g. almost all random graphs with a fixed degree
distribution and no nodes of degree smaller than
2 have a unique giant component.
6Barabasi scale-free model
Input (n0, m, t) where n0 is the initial number
of vertices, m (m? n0) is the number of added
edges every time one new vertex is added to the
graph, and t is the number of iterations. Algorit
hm a) Start with n0 isolated nodes. b) Every
time we add one new node v, m edges will be
linked to the existing nodes from v with a
preferential attachment probability where ki
is the number of links at the i-th
node. Eventually, the graph has (n0 t) nodes
and (mt) edges. Problem of pure mathematicians
with this algorithm how to start from n0 0?
7Properties of Barabasi-Albert scale-free model
P(k) ? k-? with ? 3. Real networks often show
? ? 2.1 2.4 Observation if either growth or
preferential attachment is eliminated, the
resulting network does not exhibit scale-free
properties. The average path length in the
BA-model is proportional to ln n/ln ln n which is
shorter than in random graphs ? scale-free
networks are ultrasmall worlds. Observation
non-trivial correlations clustering between the
degrees of connected nodes. Numerical result for
AB-model C ? n-0.75. No analytical predictions of
C sofar.
8Properties of scale-free models
Scale-free networks are resistant to random
failures (robustness) because a few high-degree
hubs dominate their topology a deliberate node
that fails probably has a small degree, and thus
not severly affects the rest of the
network. However, scale-free networks are quite
vulnerable to attacks on the hubs. See example
of last lecture about lethality of gene
deletions in yeast. These properties have been
confirmed numerically and analytically by
studying the average path length and the size of
the giant component.
9Properties of Barabasi-Albert scale-free model
- BA-model is a minimal model that captures the
mechanisms responsible for the power-law degree
distribution observed in real networks. - A discrepany is the fixed exponent of the
predicted power-law distribution (? 3). - Does the BA-model describe the true biological
evolution of networks? - Recent efforts
- study variants with cleaner mathematical
properties (Bollobas, LCD-model) - include effects of adding or re-wiring edges,
- allow nodes to age so that they can no longer
accept new edges - or vary forms of preferential attachment.
- These models also predict exponential and
truncated power-law degree distribution in some
parameter regimes.
102 Scale-free behavior in protein domain networks
- Domains are fundamental units of protein
structure. - Most proteins only contain one single domain.
- Some sequences appear as multidomain proteins. On
average, they have 2-3 domains, but can have up
to 130 domains! - Most new sequences show homologies to parts of
known protein sequences - most proteins may have descended from relatively
few ancestral types. - Sequence of large proteins often seem to have
evolved by joining preexisting domains in new
combinations, domain shuffling - domain duplication or domain insertion.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
11Protein domain database SMART
http//smart.embl-heidelberg.de/ contains 153
signalling domains 176 nuclear domains, e.g. HLH
domains 225 extracellular domains 115 other
domains
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
12Protein Domain databases
Prosite (http//expasy.proteome.org.au/prosite/)
contains 1400 biologically significant motifs and
profiles. Pfam (http//www.sanger.ac.uk/Software/
Pfam/index.shtml) collection of
multiple-sequence alignments of protein families
and profile HMMs. Curated documentation on 2500
families. ProDom (http//www.toulouse.inra.fr/pro
dom.html) contains all 160.000 protein domain
families that can be automatically generated from
SwissProt and TrEMBL databases. Here, only
consider families with ?10 members ? 6000 ProDom
families. InterPro Proteome Analysis of 41
nonredundant proteomes of genomes of archaea,
bacteria, and eukaryotes (http//www.ebi.ac.uk/pro
teome) yields domains which appear along with
other domains in a protein sequence ? vertices
links.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
13Protein Domain databases
Prosite (http//expasy.proteome.org.au/prosite/)
contains 1400 biologically significant motifs and
profiles.
P(number of links to other domains)
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
number of links to other domains
14Which are highly connected domains?
The majority of highly connected InterPro domains
appear in signalling pathways. List of the 10
best linked domains in various species.
Number of links increases. Number of signalling
domains (PH, SH3), their ligands (proline-rich
extensions), and receptors (GPCR/RHODOPSIN)
increases.
? evolutionary trend toward compartementalization
of the cell and multicellularity demands a higher
degree of organization.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
15Evolutionary Aspects
- BA-model of scale-free networks is constructed by
preferential attachment of newly added vertices
to already well connected ones. - Fell and Wagner (2000) argued that vertices with
many connections in metabolic network were
metabolites originating very early in the course
of evolution where they shaped a core metabolism. - Analogously, highly connected domains could have
also originated very early. - Is this true?
No. Majority of highly connected domains in
Methanococcus and in E.coli are concerned with
maintanced of metabolism. None of the highly
connected domains of higher organisms is found
here. On the other hand, helicase C has roughly
similar degrees of connection in all organisms.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
16Conclusions
- Expansion of protein families in multcellular
vertebrates coincides with higher connectivity of
the respective domains. - Extensive shuffling of domains to increase
combinatorial diversity might provide protein
sets which are sufficient to preserve cellular
procedures without dramatically expanding the
absolute size of the protein complement. - greater proteome complexity of higher eukaryotes
is not simply a consequence of the genome size,
but must also be a consequence of innovations in
domain arrangements. - highly linked domains represent functional
centers in various different cellular aspects. - They could be treated as evolutionary hubs
which help to organize the domain space.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
173 How did complexity evolve II?
At the molecular level, biological complexity
involves networks of ligand-protein,
protein-protein, and protein-nucleic acid
interactions in metabolism, signal transduction,
gene regulation, protein synthesis etc. As
organismal complexity increases, more control is
required for the positive and negative regulation
of genes. ? Complexity correlates with an
increase in both the ratio and absolute number of
transcription factors.
Amoutzias et al. EMBO reports 5, 1 (2004)
18Evolution of complex genetic networks
Duplication of genes is predominant factor for
the generation of new members of a protein
family. Duplicated gene creates redundancy if
multiple proteins have the same or overlapping
function. Alternatively, due to reduced
selective pressure, one of the gene copies can
become nonfunctional or acquire new function.
What is more important? Duplication of single
genes or duplication of large gene clusters
building blocks ?
Here Empirical evidence for scale-free protein
networks emerging through single-gene duplication.
Amoutzias et al. EMBO reports 5, 1 (2004)
19bHLH protein family
bHLH protein family ancient class of eukaryotic
transcription factors found in fungi, plants,
and animals. bHLHs may form homo- and
heterodimers. They form complex protein-protein
interaction network. Very conserved 60-residue
basic region helix loop helix motif bHLHs
dimerize into 4-helix bundle and recognize DNA
with basic regions. Additional regions
responsible for activation or repression of
target gene activity.
http//www.biochem.ucl.ac.uk/bsm/pdbsum/
20Two possible patterns of network evolution
A Evolution of a heterodimerization network by
single-gene duplication.
B Evolution of a heterodimerization network by
large-scale gene duplication.
Amoutzias et al. EMBO reports 5, 1 (2004)
21bHLH heterodimerization network
A phylogenetic analysis of human bHLH proteins B
domain architecture
Amoutzias et al. EMBO reports 5, 1 (2004)
22Topology of bHLH heterodimerization network
Topology based on protein-protein interaction
data (from literature).
Networks with hubs E2A, Arnt, Max. E2A and Arnt
sub-networks are connected. Max is distinct.
Hubs are shown as circles.
Amoutzias et al. EMBO reports 5, 1 (2004)
23Topology of bHLH heterodimerization network
P(k) follows a scale-free behavior! Relative
connectivity of hubs is higher (? 1) than
reported for most other networks.
This high connectivity appears to result from
gene duplication generating new, peripheral
proteins that interact prefe-rentially with the
hub.
Amoutzias et al. EMBO reports 5, 1 (2004)
24Topology of bHLH heterodimerization network
The hub proteins are usually widely expressed in
different tissues and organs. They
heterodimerize with peripheral proteins with more
limited expression pattern ? specific effects.
Topology of Max network parallels that of
E2A/arnt superfamily 2 hubs connected by Mad
family (repressor) vs. 2 families connected by
HES family (repressors).
Hubs are shown as circles.
Amoutzias et al. EMBO reports 5, 1 (2004)
25Phylogenetic relationships
- Parallelism between E2A/arnt and Max families
also reflected in phylogetic relationship - in Max (and E2A/arnt) network, the 2 hubs
(families) are not clustered together - the bridge linking the 2 hubs is a family of
repressor proteins that are phylogenetically
quite distant from the hubs. - ? 2 bHLH networks show evolutionary convergence.
Amoutzias et al. EMBO reports 5, 1 (2004)
26Conclusions
- For the evolution of networks based on one kind
of binding domain, - a model of single-gene duplication, followed by
domain rearrangements, point mutations and
ongoing gene duplication is sufficient to
generate quite complex interaction patterns,
which mediate activation and repression. - Seems first example where hub-based network with
scale-free properties is based on real-data
phylogenies. - (2) Compelling symmetry between the 2 networks
E2A/arnt and Max.
Amoutzias et al. EMBO reports 5, 1 (2004)
27Watts-Strogartz model
Input (n, k, p) where n is the number of
vertices, k is the distance in which each vertex
is connected initially to its neighbors by
undirected edges, and p(0? p ? 1) is the
probability of rewiring each edge. Algorithm a)
start with a ring lattice with n noces, each has
the kth nearest neighbors Thus, the degree of
each vertex is 2k and the ring has (nk) edges. b)
Replace original edges by random ones based on
the probability p.