Title: Data are not homogenous: lessons from completed microbial genomes'
1Data are not homogenous lessons from completed
microbial genomes.
- James O. McInerney,
- National University of Ireland, Maynooth,
- Co. Kildare, Ireland.
- http//www.may.ie/academic/biology/jmbioinformatic
s.shtml - and
- The Natural History Museum, Cromwell road, London
SW7 5BD, UK.
2We wish to suggest a structure for the salt of
Deoxyribose nucleic acid (DNA). This structure
has novel features which are of considerable
biological interest.
Watson and Crick, Nature, 1953
3What was the relationship between DNA and
proteins?
- Nucleotide to amino acid?
- Doublet code?
- Triplet code?
- Intermediate device?
4Attempted to solve this problem by combinatorics
- George Gamow (Big Bang theory)
- F.H.C. Crick
5Attempted to solve it biochemically
6Universal Genetic code
7Universal Genetic code
8Universal Genetic code
9Universal Genetic code
10The rate of random fixation of neutral mutations
in evolution (per species per generation) is
equal to the rate of occurence of neutral
mutations (per gamete per generation)
Kimura, M. Nature 217 624 (1968)
11As far as is known, synonymous mutations are
truly neutral with respect to natural selection.
King, J.L. and Jukes, T.H. Non-Darwinian
Evolution. Science, 164 788-798 (1969).
12Evidence that all synonymous codons were not used
with equal frequency Fiers et al., 1975
A-protein gene of bacteriophage MS2, Nature 256,
273-278
UUU Phe 6 UCU Ser 5 UAU Tyr 4 UGU Cys
0 UUC Phe 10 UCC Ser 6 UAC Tyr 12 UGC
Cys 3 UUA Leu 8 UCA Ser 8 UAA Ter
UGA Ter UUG Leu 6 UCG Ser 10 UAG Ter
UGG Trp 12 CUU Leu 6 CCU Pro 5 CAU
His 2 CGU Arg 7 CUC Leu 9 CCC Pro 5
CAC His 3 CGC Arg 6 CUA Leu 5 CCA Pro 4
CAA Gln 9 CGA Arg 6 CUG Leu 2 CCG Pro
3 CAG Gln 9 CGG Arg 3 AUU Ile 1 ACU
Thr 11 AAU Asn 2 AGU Ser 4 AUC Ile 8
ACC Thr 5 AAC Asn 15 AGC Ser 3 AUA Ile 7
ACA Thr 5 AAA Lys 5 AGA Arg 3 AUG MeU
7 ACG Thr 6 AAG Lys 9 AGG Arg 4 GUU
Val 8 GCU Ala 6 GAU Asp 8 GGU Gly
15 GUC Val 7 GCC Ala 12 GAC Asp 5 GGC
Gly 6 GUA Val 7 GCA Ala 7 GAA Glu 5
GGA Gly 2 GUG Val 9 GCG Ala 10 GAG Glu
12 GGG Gly 5
13Multivariate reduction
- Attempts to reduce a high-dimensional space to a
lower-dimensional one. - In other words, it tries to simplify the data
set. - Many of the variables might co-vary, therefore
there might only be one, or a small few sources
of variation in the dataset - A gene can be represented by a 59-dimensional
vector (universal code) - A genome consists of hundreds (thousands) of
these genes - Variation in the variables (RSCU values) might be
governed by only a small number of factors
14Location of a hypothetical gene encoded only by
isoleucine codons in its three-dimensional space
AUA
2.00
1.00
Origin
0.66
1.00
AUU
1.00
2.00
1.33
AUC
2.00
15Location of a collection of genes encoded only
by isoleucine codons in their three-dimensional
space
AUA
2.00
1.00
Origin
0.66
AUU
1.00
1.00
2.00
1.33
AUC
2.00
16GCUA(General Codon Usage Analysis)
McInerney, J.O. (1998) Bioinformatics 14(4)
- Computer program for analysing codon and amino
acid usage data. - Written in ANSI C programming language
- Runs on all operating systems (e.g. MacOS, SGI,
SUN, DEC etc.). - Performs usual calculations such as number of
times a codon is used, RSCU, amino acid usage
etc. - Most important functions are the multivariate
analysis functions. - GCUA performs Correspondance analysis (CA) and
Principal Components Analysis (PCA) on both codon
usage (RSCU) and amino acid usage (aa) data.
17GCUAGeneral Codon Usage Analysis
18(No Transcript)
19Correspondence analysis of codon usage in
Escherichia coli.
Axis 2
Axis 1
20Correspondence analysis of codon usage in
Escherichia coli.
Highly-expressed genes
Axis 2
Axis 1
21Correspondence analysis of codon usage in
Escherichia coli.
"Lowly-expressed" genes
Axis 2
Axis 1
22Correspondence analysis of codon usage in
Escherichia coli.
Recently-acquired genes
Axis 2
Axis 1
23Prokaryotic genome evolution as assessed by
multivariate analysis of codon usage patterns
- McInerney, J.O. Microbial and Comparative
Genomics (1997) 2(1).
24Organisms
- Haemophilus influenzae (1, 830, 137 bp)
- Mycoplasma genitalium (580, 070 bp)
- Methanococcus jannaschii (1, 664, 976 bp)
25Haemophilus influenzae
26Mycoplasma genitalium
27Methanococcus jannaschii
28Mycoplasma genitalium genome.
Outer circle genes on the outer strand Second
circle genes on the inner strand
29Mycoplasma genitalium
Axis 2
Axis 1
30Mycoplasma genitalium
Axis 2
Highly-expressed genes
Axis 1
31M. genitalium
0.5
0.4
0.3
GC3s
GC3s
0.2
0.1
0
Axis 1
32M. genitalium
0.5
y 0.003x 0.233 r 0.875
0.4
0.3
Highly-expressed genes
GC3s
0.2
0.1
0
Axis 1
33M. genitalium
Axis 1
Chromosome position
34M. genitalium
2
3
Regression co-efficient r 0.718
Axis 1
Chromosome position
35Base composition changes in M. genitalium
Origin of Replication
50
40
30
GC3
20
10
0
0
10
50
10
30
30
50
Percentage distance from the origin
36Replicational and transcriptional selection on
codon usage in Borrelia burgdorferi
McInerney, J.O. (1998). Proceedings of the
National Academy of Sciences USA.
37Borrelia burgdorferi
38Lyme disease
39Lyme disease II - this time its personal!
40(No Transcript)
41(No Transcript)
42B. burgdorferi
- Linear chromosome of 910,725 bp
- At least 17 linear and circular plasmids with a
combined size of 530,000 bp - Many of the plasmid-borne orfs are of unknown
function or are hypothetically surmised to encode
de facto genes. - Loss of significant genes for cellular
biosynthetic reactions (similar situation to that
seen in M. genitalium).
43CoA of codon usage in B. burgdorferi
Axis 2
Axis 1
44CoA of codon usage in B. burgdorferi
Axis 2
Axis 1
45B. burgdorferi
- Leading Lagging Leading
Lagging - AA N RSCU N RSCU AA N
RSCU N RSCU -
- Phe UUU12116 1.88 4146 1.64 Ser UCU 5926
2.38 1571 1.50 - UUC 756 0.12 903 0.36 UCC 574
0.23 370 0.35 - Leu UUA 7918 2.39 4031 2.49 UCA 2865
1.15 2081 1.98 - UUG 4319 1.30 958 0.59 UCG 532
0.21 196 0.19 -
- Leu CUU 6325 1.91 2360 1.46 Pro CCU 2272
2.03 926 1.39 - CUC 224 0.07 388 0.24 CCC 662
0.59 425 0.64 - CUA 775 0.23 1664 1.03 CCA 1318
1.18 1207 1.81 - CUG 332 0.10 324 0.20 CCG 232
0.21 108 0.16 -
- Ile AUU12007 2.00 5015 1.20 Thr ACU 2929
1.88 1310 1.06 - AUC 845 0.14 1365 0.33 ACC 806
0.52 655 0.53 - AUA 5165 0.86 6161 1.47 ACA 2118
1.36 2817 2.27 - Met AUG 3444 1.00 1692 1.00 ACG 390
0.25 174 0.14 -
- Val GUU 7778 2.59 1187 1.45 Ala GCU 4395
2.08 1337 1.25
Leading Lagging Leading
Lagging AA N RSCU N RSCU AA
N RSCU N RSCU Tyr UAU 7043 1.77 2570
1.27 Cys UGU 1034 1.54 241 0.87 UAC
928 0.23 1480 0.73 UGC 311 0.46 312 1.13
ter UAA 3 0.00 0 0.00 ter UGA 0 0.00
0 0.00 ter UAG 2 0.00 0 0.00 Trp UGG
885 1.00 543 1.00 His CAU 1770 1.67 827
1.22 Arg CGU 455 0.40 50 0.13 CAC
352 0.33 532 0.78 CGC 192 0.17 59 0.16
Gln CAA 2936 1.51 2313 1.83 CGA 348 0.30
122 0.32 CAG 943 0.49 217 0.17
CGG 99 0.09 20 0.05 Asn AAU11290 1.80
5574 1.38 Ser AGU 3344 1.35 804 0.77
AAC 1273 0.20 2495 0.62 AGC 1676 0.67 1272
1.21 Lys AAA12380 1.4210585 1.82 Arg AGA
4213 3.67 1761 4.69 AAG 5102 0.58 1064
0.18 AGG 1585 1.38 241 0.64 Asp GAU
9509 1.75 2565 1.33 Gly GGU 3383 1.30 552
0.51 GAC 1338 0.25 1303 0.67 GGC
1620 0.62 681 0.63 Glu GAA 7952 1.31 6240
1.76 GGA 3666 1.40 2650 2.44 GAG
4151 0.69 856 0.24 GGG 1770 0.68 458
0.42
46B. burgdorferi
Lagging
Leading
- AA codon N RSCU N RSCU
- Phe UUU 12116 1.88 4146 1.64
Significantly higher (95 conf.)
47B. burgdorferi
- Sarah French
- Inserted high expression ribosomal RNA genes
downstream of an E. coli promoter on a plasmid. - One clone had the rRNA gene in the sense
orientation (on the leading strand) (CF78) - The other was in the anti-sense (lagging
strand) orientation (CF95).
48Replication fork movement - codirectional
49Replication fork movement - antisense
50Positions of replication forks after 6 minutes
between origin and rrnB In rrnB Beyond
rrnB CF78 3 (9) 2 (6) 30 (86) CF95 6
(16) 26 (68) 6 (16)
51B. burgdorferi
- B. burgdorferi has 66 of its genes on the
leading strands - Almost all of the highly-expressed genes are on
the leading strands - There is a significant difference in codon usage
between the two strands
52B. burgdorferi
- Codon selection in B. burgdorferi is the result
of replicational selection, transcriptional
selection and mutational bias (not GC mutational
bias, rather AC mutational bias)
53Plasmodium falciparum Gametocyte
Courtesy London School of Hygeine and Tropical
Medecine.
54Malaria vector - Anophalene mosquito
55Plasmodium lifecycle
56P. falciparum
- 14 Chromosomes
- From 0.65 Mb to 3.4 Mb in size
- Chromosome 2
- 947,103 bp in length
- 210 predicted ORFs
- 82 AT base composition overall
- Origin of replication not identified by
similarity searches - Distribution of genes is fairly even on both
Watson and Crick strands
57Replication in Plasmodium
- Two theories
- Multiple origins of replication.
- Mathematically derived (Janse et al., 1986, Mol.
Biochem. Parasitol.). - Single origin of replication in each chromosome.
- Anecdotal - "what we see in some other organisms"
58Multiple origins theory
- During microgametogenesis, the entire haploid
genome is replicated in about 3.2 minutes. - If we assume a replication rate of 50
bases/second, then there is a requirement for at
least 1300 origins of replication (one every 19.2
Kb).
59(No Transcript)
60(No Transcript)
61(No Transcript)
62P. falciparum
A3
C3
63P. falciparum
G3
T3
64Implications for single origins of replication.
- For chromosomes 2 and 3, replication must proceed
60 times faster than previously thought. - If the same holds true for the largest
chromosomes, then replication must proceed 170
times faster.
65Gene recognition
pA0.2 pC0.4 pG0.1 pT0.3
S
66- Acknowledgements
- London
- Prof T. Martin Embley,
- Dr. Mark Wilkinson,
- Dr. Robert Hirt
- Maynooth
- Chris Creevey,
- David Fitzpatrick.
67(No Transcript)