Title: Codons, Genes and Networks
1Codons, Genes and Networks
-
- Bioinformatics service
- Math_at_Bio group
- of M.Gromov
Andrei Zinovyev
2Plan of the talk
- Part I 7-clusters structure of genome (codons
and genes) - Part II Coding and non-coding DNA scaling laws
(genes and networks)
3Part I 7-clusters genome structure
- Dr. Tatyana Popova
- RD Centre in
- Biberach,
- Germany
- Prof. Alexander Gorban
- Centre for
- Mathematical
- Modelling
4Genomic sequence as a text in unknown language
..cgtggtgagctgatgctagggacgcacgtggtgagctgatgctaggga
cgacgtggtgagctgatgctagggacgc
5From text to geometry
cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacg
acgtggtgagctgatgctagggacgc 107 cgtggtgagctgatgc
tagggacgcac ggtgagctgatgctagggacgcacact tgagctgatg
ctagggacgcacaattc gtgagctgatgctagggacgcacggtg
gagctgatgctagggacgcacaagtga
length200-400
10000-20000 fragments
RN
6Method of visualizationprincipal components
analysis
RN
7Caulobacter crescentus
8First explanation
cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcg
acgtggtgagctgatgctagggrcgc
9Basic 7-cluster structure
gtgagctgatgctagggrcgcacgtggtgagc
10Non-coding parts
Point mutations insertions, deletions
a
gtgagctgatgctagggr cgcacgaat
11The flower-like 7 clusters structure is flat
12Seven classes vs Seven clusters
Georgia Institute of Technology
Stanford
TIGR
Lomsadze A., Ter-Hovhannisyan V., Chernoff YO,
Borodovsky M. Gene identification in novel
eukaryotic genomes by self-training algorithm.
Nucleic Acids Research, 2005, Vol. 33, No. 20
Hong-Yu Ou, Feng-Biao Guo and Chun-Ting Zhang
(2003). Analysis of nucleotide distribution in
the genome of Streptomyces coelicolor A3(2) using
the Z curve method. FEBS Letters
540(1-3),188-194
Audic, S. and J. Claverie. Self-identification
of protein-coding regions in microbial
genomes. Proc Natl Acad Sci U S A,
95(17)10026-31, 1998.
13Computational gene prediction
14Mean-field approximationfor triplet frequencies
FIJK Frequency of triplet IJK ( I,J,K?
A,C,G,T ) FAAA , FAAT , FAAC FGGC , FGGG
64 numbers position-specific letter frequency
correlations 12 numbers
15Why hexagonal symmetry?
GC-content PC PG
-0
0-
16Genome codon usageand mean-field approximation
correct frameshift
ggtgaATG gat gct agg gtc gca cgc TAAtgagct
64 frequencies FIJK
17PIJ are linear functions of GC-content
eubacteria
archae
18THE MYSTERY OF TWOSTRAIGHT LINES ???
R64
R12
FIJK P1IP2JP3K correlations
19Codon usage signature
0-
2019 possible eubacterialsignatures
21Example Palindromic signatures
22Four symmetry typesof the basic 7-cluster
structure
23S.Coelicolor (GC72)
24Using branching principal components to analyze
7-clusters genome structures
25Using branching principal components to analyze
7-clusters genome structures
Streptomyces coelicolor
Fusobacterium nucleatum
Bacillus halodurans
Ercherichia coli
26Web-site
cluster structures in genomic sequences
http//www.ihes.fr/zinovyev/7clusters
27Papers (type Zinovyev in Google)
Gorban A, Zinovyev A PCA deciphers genome. 2005.
Arxiv preprint Gorban A, Popova T, Zinovyev A
Codon usage trajectories and 7-cluster structure
of 143 complete bacterial genomic sequences.
2005. Physica A 353, 365-387 Gorban A, Popova
T, Zinovyev A Four basic symmetry types in the
universal 7-cluster structure of microbial
genomic sequences. 2005. In Silico Biology 5,
0025 Gorban A, Zinovyev A, Popova T Seven
clusters in genomic triplet distributions. 2003.
In Silico Biology. V.3, 0039. Zinovyev A,
Gorban A, Popova T Self-Organizing Approach for
Automated Gene Identification. 2003. Open
Systems and Information Dynamics 10 (4).
28Part IICoding and non-coding DNA scaling laws
Dr. Sebastian Ahnert Cavendish
laboratory, University of Cambridge
Dr. Thomas Fink Bioinformatics service
29C-value and G-valueparadox
- Neither genome length nor gene number account for
complexity of an organism - Drosophila melanogaster (fruit fly) C120Mb
- Podisma pedestris (mountain grasshopper) C1650
Mb
30Non-linear growth of regulation
Amount of regulation scales non-linearly with
the number of genes every new gene with a new
function requires specific regulation, but the
regulators also need to be regulated
bacteria
Slope 1
archae
Log number of regulatory genes
Slope 1.96
Log number of genes
Mattick, J. S. Nature Reviews Genetics 5, 316323
(2004).
31Complexity ceiling for prokaryotes
- Adding a new function DS requires adding a
regulatory overhead DR, the total increase is - DN DR DS
- Since R N2 , at some point DR gt DS,
- i.e. gain from a new function is too expensive
for an organism, it requires too - much regulation to be integrated
There is a maximum possible genome length for
prokaryotes (10Mb)
32How eukaryotes bypassed this limitation?
- Presumably, they invented a cheaper (digital)
regulatory system, based on RNA - This regulatory information is stored in the
non-coding DNA
33Simple modelAccelerated networks
Node is a gene (c genes) Edge is a regulation
(n edges)
n ac2
Connectivity gt kmax deficit of regulations is
taken from non-coding DNA
Connectivity lt kmax, regulators are only proteins
34How much regulation genome needs to take from
non-coding DNA?
cmax (prokaryotic ceiling)
These regulations must be encoded in the
non-coding part of genome, therefore
N non-coding DNA length C coding DNA
length Cprok ceiling for prokaryotes (10Mb) b
- some coefficient
35Observationcoding length vs non-coding
b1
Minimum non-coding length needed for the
deficit regulation
36Hypothesis
- Prokaryotes
- ltNon-coding lengthgt a ltCoding lengthgt
- a 5-15 (little constant add-on, promoters,
UTRs) - 15 1/7
- Eukaryotes
- Nreg b/2 C/Cmaxprok(C-Cmaxprok) C2,
- Cmaxprok 10Mb, b 1
- This is the amount necessary for regulation, but
repeats, genome parasites, etc., might make a
genome much bigger
37This is only a hypothesis, but
- Prediction on the Nreg for human
- Nreg 87 Mb 3 of genome length
- C 48 Mb 1.7
- NregC 4.7
38Thank you for your attention