Title: Random sequencematching model for emergent generegulatory networks
1Random sequence-matching model for emergent
gene-regulatory networks
- Ayse Erzan
- Istanbul Technical University, Gürsey Institute,
- Collegium Budapest
- Duygu Balcan (ITÜ) Muhittin Mungan (BÜ)
- Alkan Kabakçioglu (Padova) Ayse H. Bilge (ITÜ)
- Yasemin Sengün (ITÜ)
2outline
- Random and real networks
- central dogma of gene regulation
- RNA interference and more
- sequence matching model for gene regulatory
networks - simulations and analytical results
- comparison with experiments
- outcomes of similar models
3classical random networks
- Erdös and Renyi
- (Publ. Math.Inst. Hung. Acad. Sci. 5, 17 (1960)
- N vertices
- N(N-1)/2 possible connections
- with probability p
- degree distribution
- Poissonian for large N
- P(k) e z zk / k!
- z ltkgt pN zc1
- Average minimal path length
- lER ln N / ln z (1ln p/ln N)-1
- Clustering coefficient
- CER z/N p
4Random Networks
Probability of a connection between any two nodes
same, p N nodes has an average number Np of
connections Small world property Distances
between nodes grow very weakly with N
Most highly connected ? nodes Directly reach 25
of the rest ?
5naturally occuring networks
- Social and economic networks
- Citation and collaborative networks
- Technological networks
- www, communications networks
- Biological networks
- Neural networks Food networks
- Co-evolutionary networks Genomic networks
R.Albert and A.-L. Barabasi, Rev.Mod. Phys. 74,
47 (2002) S.N. Dorogovtsev and J.F.F. Mendes,
Adv. Phys 51, 1079 (2002)
6Real Networks
- considerable number of very highly connected
nodes - ? Their first neighbors 60 of the total
- ? most frequent are nodes with very few
connections (1)
Small world!
7small world / scale free networks
- High clustering coefficient
- ltCi gt ? 2 Ei / ki (ki-1)?
- gt CER z /N
- Short average minimum path length ltlmingt
- (comparable to ER nw
- for same C and N, differs from regular lattices)
- Scale free degree distribution
- P(k) k - ? , cutoff kc
- a realisation
- Barabasi-Albert model of
- preferential attachment
- growing network with probability of attachment of
new edge to vertex i is ki - P(k) k 3
- (exact)
- (Models with preferential attachment ? ? 2)
8 Genotype ? Phenotype Genomic networks a
network of interactions control and modulate
genetic expression - the output (expressed
proteins) capable of great variability - a
dynamical system (e.g., Wagner, PNAS 1994) si
(t1) sign ?j wij sj (t) hi w may be
sparse, but incorporating correlations of all
orders scale free degree distribution ?
9Genotype ? Phenotype
Nucleic acids 4-letter alphabet Can
replicate DNA Adenine, Guanine , Thymine,
Cytosine Watson-Crick Base Pairing A-T
C-G RNA - Adenine, Guanine , Uracil, Cytosine
A-U C-G
- Amino acids
- 20 word dictionary
- ? proteins
- Coded for by the nucleic acids
- 3 nucleic acids 1 anticodon
on the mRNA - 1 amino acid
- not all combinations correspond to amino acids
- some combinations are degenerate
10gene regulation networks - transcription
regulatory networks the central dogma
DNA
promoter1 gene1 promoter2 gene2 promoter3
gene3
transcription RNA mRNA chain amino
acid Transcription
Factors translation Proteins (structural
and regulatory)
Ribosome tRNArRNA
Adapted from Alvis Brazma, www.ebi.ac.uk/microarr
ay/research/networks/genetics
11(from S. Maslov)
Data from Regulon Database 606 interactions 424
operons Out degree 1ltkout lt 85 broader In degree
1ltkinlt6
12(from S. Maslov)
data obtained from literature search 1449
regulations 689 proteins kout lt 96 kin lt 40
n(kout ) ? kout 2.5
? 2.5
13- A. Wagner, Mol. Bio. Evol. 18, 1283 (2001).
- Duplication and divergence of genes - interaction
- between their regulatory proteins
14Transcription Regulatory genomic and Protein
Interaction Networks (interactions between
regulatory
proteins)
- for a review of properties see
- R.V. Sole and R. Pastor Satorras, in Handbook of
Graphs and Networks (Bornholdt and Schuster eds.,
Wiley-VCH, Berlin 2002) - Previous wisdom
- out degree distribution scale free with ? 2.5
!!? - A. Wagner, Mol. Bio. Evol. 18, 1283 (2001)
Jeong, Mason, Barabasi, Oltvai, Nature 411, 41
(2001) Maslov and Sneppen, Science 296, 910
(2002) - narrower in-degree distribution than out-degree
distribution ? - small world with non-classical clustering ?
15RNA interference
16New paradigm? Post Transcriptional Gene
Suppression (PTGS)
17RNA can bind directly on similar DNA
sequences and silence genes at the
transcriptional stage
18- Watson-Crick base pairing between nucleic acids
- DNA Adenine, Guanine , Thymine, Cytosine A-T
C-G - RNA - Adenine, Guanine , Uracil, Cytosine
A-U C-G - stabilisation, replication and transcription of
DNA - RNA interference (siRNA binding to mRNA or chr.
DNA) - binding of regulatory proteins on to mRNA
Basic mechanism of (lock-and-key combinations)
sequence matching
- D. Balcan , AE, Eur. Phys. J. B 38, 253 (2004)
- M, Mungan, A. Kabakcioglu, D. Balcan, AE ,
q-bio.MN/0406049
19- three- dimensional architecture (secondary
structure) - also sequence dependent
- -amino-acid recognition by tRNA
- -amino-acid binding by rRNA in Ribosome
- -binding of transcription factors to
promoter regions - Greater generality for modeling genomic
interactions? - Stay tuned!
-
20emergent gene expression networks?
21 sequence matching ? gene regulation ?
Model connectivity matrix of genomic network
1 iff the string Gi is embedded inside the
string Gj wij (Gi ? Gj ) li ?
lj 0 otherwise.
1101
2011000101201000110211
1101
1101
201010
112
2
interference (suppression)
1101
directed
kin1
kout 2
22connected network
Transitivity if wij wjk 1 then wik 1 gt
preferential Clustering
linking Congruenc
e if li lj then wij wji kin (i) ?j w
ji kout (i) ?j w ij
23simulationsclustering coefficient
- Ci 2E(i)/ k(i) k(i)-1
- number of edges connecting nn /total number of
possible connections - For incoming or outgoing bonds to the site i
- ltCoutgt 0.034
- ltCingt 0.648
- ltCgt 0.534 lt z gt / lt s gt
non-classical bhvr
24giant cluster breaks up for p lt pc(L) ( L p
frequency of stop-start signs)
N (number of genes) too small, genes too long
percolation threshold pc
exponent -3/4 (preliminary)
25extremely small world networks!
- cluster radius average minimum path length
- directed edges (in or out) lmin1
(transitivity!) - undirected edges
- lmin 1 lmax ? 4 11111 1 001101
0 00000 - ltlmin gt depends very weakly on p for fixed L
- pc lt p lt ½ most genes of length
unity - lmin undefined for p ? pc (L)
- L 15000 ltlmin gt 1.66 ltlmin gt ? 1.87
as p? pc
26simulations network robust under random
mutations
- random point mutations
- x? (0 , 1) x? mod 2 (x?
1) - x? 2 x? ? 1 x?
- random walk steps taken by
- STOP and GO signs
- long range modifications due to
change in reading frame
27Degree distribution preliminary simulation
results
peaks geometrically spaced for kout small
(log-periodic) periodic for kout large last
peak - the size distribution of the giant
cluster (single bit genes connect to
almost all others)
28distribution of out-going bonds at peak maxima
- a single sample
nm(kout) kout -? ? 0.45 ? 0.06 averaging
over 500 runs L15 000 p0.05
29n kout - ? ? 0.9
30nm kout - ? Maxima of the peaks ?? 0.9 small
k ? ?0.4 large k
0
no double scaling for p0.05 ? ?0.45
31n(k) ? k -? ? ? (1.1 , 1.8)
32Simulation results Crossover in the scaling
behaviour of the degree distribution
__ analytical simulation
dc
33Analytical calculations
1. The matching probabilities Probability of a
given string of length l to be reproduced in a
randomly chosen string of length k for an
alphabet of r letters, p (l, k) 1- (1- r -l )
k - l1 ? r l ( k-l1) for l large
neglecting correlations between overlaps r l
number of l strings with r letters ( k-l1)
number of shifted l- substrings in a k string
(1 for k l ) very good approximation
for r l ( k-l ) ?lt 1
34Computing the matching probabilities strings x
and y of length k ? l ya,l substring of y, of
length l that has been shifted by a U(x, yal)
Hamming dist. bet. x and ya,l (U 0, match, U
? 0, nomatch) 1- fa (x,y ? ) 1- exp - ?
U(x, yal) ? 0 or 1 for ? ? ? (counts
nomatches) p (l, k x, ? ) 1- ( number of
nomatches / r k ) summed over y p (l, k x, ?
) 1- r - k ? ? 1- fa (x,y ? )
all nomatch for any shift a y a lt k-l
Cluster expansion. Do x averages 2-pt
averages over the f factorise approximate all
higher orders with factorised ones for k ? l
get p (l, k) 1- (1- z l ) k - l1
z 1(r-1)-? / r ? z l (
k-l1) for l large
35matching probability for r 2 p( l, k) 1- (
1- 2 - l ) k-l1 ? 2 l ( k-l1) for l ? k
0 otherwise
p (l, k)
?exact enumeration __ above expression
l
Curves with embeding string k 16,14,12,10,8,6,4,2
from top to bottom, k ? l
362. Understanding the sequence matching
data Matching l with d long genes ? small
degree
373. Calculation of the out-degree distribution
number of out-edges from a randomly chosen gene
of length l to genes of length k Xlk ??
Xlk (?) ? different realisations of genes of
length k Xlk (?) independent random var,
binomially distributed p(l,k) Poission
for small p(l,k) large l total number of
out-edges from a randomly chosen gene of length
l Xl ? Xlk Gaussian distributed via the
Central Limit Theorem with mean lt Xlgt
and variance lt Xl 2 gt- lt Xl gt 2 Xl Poisson for
large l
38mean out-degree for genes of length l for
model with exponential gene length distribution lt
n(k)gt L p 2 q k q 1 - p,
probability of a coding element dl lt Xk gt
? k ? l ltXkl gt ?k ? l p(l, k) ltn(k)gt
Lp (q z) l / (p q z l ) (qz)
l variance of out-degree distribution - length
l ?l 2 lt Xl 2 gt- lt Xl gt 2 dl p (1-z l)
/ 1-q (1-z l ) 2 dl for large l ?
for large l, dl ? ?l 2
Poissonian
39out-degree distribution for small l (large d)
scaling behaviour of the envelope
hl ? l n ( l ) hl n( l ) / ?l
L p 2 q l / dl ½ dl (qz) l ? h l
(q / z) - ½ l h (d) d - ?
( q z )- ? ( q / z) - ½ gives ? ½
(ln z ln q) / (ln z - ln q) ? ? ½ - p
/ ln r
h
2?
40out-degree distribution for large l (small d)
P( Xl d ) (dl ) d exp ( - dl ) / d !
Poisson P(d ) ?l n(l ) P (Xl d) Lp ?l
p q l (dl )d exp( - dl ) / d ! ? ?0? dx
x d-? - ½ e-x / d ! for large l P(d) ? ?(d
½ - ?) / ?(d 1) d -? - ½ ? Gamma
funx. where ? ½ (ln z ln q) / (ln z - ln
q) ½ - p/ln r Scaling exponent ?1 ? ? ½
1 - p / ln r
41out-degree distribution finite size
effects dotted full Gaussian distribution taken
for P (Xl d ) solid lines finite size
correction dlout (?lout )2 , P( Xl d )
Poisson
Thus both for large and small l, P(d ) Lp ?l
p q l (dl )d exp( - dl ) / d ! provides a
good representation
42Note either for ? ? 0 or for a unique letter
(r1) the outdegree distribution is simply
controlled by the length distribution in which
case we get ?
-1 !!
434. Crossover in the scaling behaviour
peaks well seperated for l lt lc 8 dl ( q
/ 2) l ?l ? dl ? 0 slower than
dl crossover occurs where dl dl1 ?
l More precisely (dl dl1 ) / ? 2 ? l
dc ? 6.6 (From requiring that the minimum
between the two Gaussian peaks centered at dl and
dl1 vanish)
440
in-degree distribution superposition of
two peaks
0
first peak A (z-z0)2 exp-B (z-z0 )
second peak Gaussian circular very near max
455. Simulation and analytical results The
in-degree distribution
Solid line finite size effect taken care of by
inserting dlin (?lin )2
46The in-degree distribution
The second peak can be obtained accurately
from dlin ? k? l n(l ) p(l,
k) (?lin )2 ? k? l n(l ) p(l, k) 1- p(l,
k) p(d in) ? pq l 2? (?lin )2-½ exp -
(d- din)2/ 2 (?lin )2
47modelling gene interactionsA. Kabakcioglu, M,
Mungan, D. Balcan, AE, preprint sequence
matching also operates in the case of
transcriptional gene interaction ? claim
secondary structures (conformations) of
transcription factors are determined by their
amino acid sequence, coded for by the
corresponding DNA sequence - the different folds
expose precise regulatory sites, which are
recognized by regulatory sequences on the genome ?
48Experimental data from expression of mRNA in DNA
arrayM.Gustafsson, M, Hörnquist, A. Lombardi,
Large-scale reverse Engineering by Lasso,
q-bio.MN/040312. On data from P.T. Spellmann et
al., Mol. Bio. Cell 9, 3273 (1998) from
microarray experiments
- Yeast data
- (Saccaromyces cerevisae)
49Expected model out-degree distribution, with
Gaussian RS length distribution
50Model with a Gaussian RS length
distribution single realisation, adjustable
parameters ltl gt, ?l and Yeast data
51Comparison of network of a single realisation of
the model chromosome and yeast microarray
experiment
52Consensus data (http//cgsigma.cshl/org ) for
length distribution of Regulatory Segments
RS length Gaussian distribution with parameters
fixed by comparison with out-degree of yeast data
53Single realisation for two independent sets of
Regulatory Sequences associated with each node
of the network Si, Si
Connectivity rule Si ? Sj Note expected
distributions will not change
54Adnvances in Artificial Life5th Eur. Conf.
(ECAL99), Vol. 1674, LNAI, Springer
promoter seq. of length p 4 2
NL/4 p
Thanks Chrisantha!
55averaged over 20 genomes - oscillatory behavior
from superposition of Poisson peaks
56Evolution of gene networks by gene
duplicationWagner, PNAS 91, 4387 (1994),
Vazquez, Flammini, Maritan and Vespignani,
cond-mat/ 0108043, Sole, Pastor-Satorras, Smith
and Kepler, Adv. Comlex Syst. 5, 43 (2002)
- take random network
- duplicate gene with connections
- take out the connections with prob. ? and
establish new connection to random node with
probability ? - ? scale free proteomic model
- ? 2.5 , C and minimum path length compares
well with data
57Sequence similarity
58Gaussian network, evolution by duplication of
randomly chosen RSs, mutation (Yasemin Sengün)
59- Summary
- random gene interaction network model with
sequence - matching for
- - arbitrary alphabet
- - finite temperature (partial matching)
- outdegree distribution power law for small d
- - log-periodic for large
d - exponents ? 1- p / ln r , ?1 0.5 - p / ln r
universal for small p - single realisations compare well with experiment
- not scale free - crossover behaviour ?