Title: Day 5: Comparative genome analysis
1Day 5 Comparative genome analysis
- Added value of complete genomes
- More sequences
- Large scale pattern detection in genomes
- -Function/orthology prediction by bi-directional
best hit approaches - More than bags of genes (?)
- -Presence/absence/variation of pathways
- -Prediction of new pathways
2Exponential growth of the number of sequenced
genomes, doubling time of 16 months
3Analyze and Compare genomes at various
levels. -DNA (e.g. GC content (we actually do
not need sequencing for that), dinucleotide
frequencies, coding densities of leading/lagging
strands. GC skew etc.). -Protein coding
potential (e.g. coding density). -Presence/absenc
e/size of Protein families. -Presence/absence of
genes/comparing at the level of orthologs. -Gene
Order evolution
4Strand asymmetries (genes)
Asymmetry of the density of coding/non-coding in
B.subtilis (Kunst et al., Nature 1997)
5Strand asymmetries (nucleotide frequencies)
GC skew (inner circle, G-C/GC) in a complete
genome (Nitrosomas_ europaea)
6Number of Orfs per nucleotide is more or less
constant in prokaryotes (Doolittle, Nature 2002).
Interesting exceptions1100 pseudogenes in M.
leprae Overprediction of ORFs in A.pernix
7Functional elements in the human genome
3.4 109 nt 20.000 protein genes
tRNA, rRNA, 0,5
Coding regions (proteins) 1.7
Satellite DNA (centromeres, telomeres) 12
non-translated RNA genes Xist, H19, His-1, bic,
microRNAs, etc. Regulatory elements promoters,
enhancers, etc. Transposable elements (LINEs,
SINEs, ...) 40-45
Introns 34
intergenic DNA 52
86 no (known) function
8Gene family size distribution for Bacillus
subtilis (Kunst et al., Nature 1997)
9A power-law in the gene family size
distribution YXb
10Buchnera metabolism, deduced from the genome
(Shigenobu et al, Nature, 2000)
11Genome annotation of Buchnera Classifying
functions into functional categories (Shigenobu
et al, Nature, 2000)
12- Evolution of gene content
- 1) Quantitative approaches Count the number of
genes that two genomes share (orthology) and
relate that to their phylogenetic distance. - -Is there a rate of gene content evolution ?
(quantitative trends) - Can we actually reconstruct genome evolution ?
(what happened when and what are the primary
processes ?) - 2) Qualitative approaches interpret the
differences between two genomes in terms of the
functions of the encoded proteins. - -To what extent can we explain the differences
between the - phenotypes in terms of the genomes gene content
- -Are there functional patterns e.g. in genome
size variation (qualitative trends)
13Rate of genome evolution in terms of gene content
(Huynen and Bork, PNAS, 1998)
14 Genome phylogeny based on gene content
- Count the number of shared orthologs between
genomes using the bi-directional best,
significant, hit approach (include
fusion/fission) - Create a similarity matrix by dividing number of
shared orthologs by the genome size of the
smallest genome - Create a distance based phylogeny from the
similarity matrix
Snel et al, 1999, Nat. Gen, 21108 Huynen et
al., 1999, Science 2861441a
15(No Transcript)
16Convergence in gene content is also visible in
phylogenetic trees that are, instead of based in
the fraction of shared genes, based on the number
of shared genes. Large Bacterial genomes (E.coli,
B.subtilis, M.tuberculosis, Synechocystis)
cluster together, and so do small genomes
(R.prowazekii, C.trachomatis)
17Shared gene content between Archaea and Bacteria
depends on genome size
18The topology of a genome phylogeny based on gene
content shows a high similarity to 16S and 23S
rRNA phylogenies
so what !! (....) If instances of lateral
gene transfer can no longer be dismissed as the
exceptions that prove the rule it must be
admitted that () unless organisms are
constructed as either less or more than the sum
of their genes there is no unique organismal
phylogeny (W. F. Doolittle, Science 284,
2124-2128)
19- Horizontal (lateral) gene transfer
- The evolutionary history of a gene is not always
consistent with the history of the species - Discovering horizontal gene transfer by
- Relative levels of sequence identity.
- Comparing phylogenetic trees of the species (SSU
rRNA) and that of the gene in question. Be
careful however!! The sequences have to be
orthologous to each other. Ancient gene
duplications followed by differential loss can
also give rise to horizontal gene transfer like
trees. - Different codon usage than that of the other
genes in the genome
20Eukaryotes
Mitochondria
Archaea
Bacteria
No apparent Horizontal Gene Transfer in the
evolution of Leucine Aminoacyl-tRNA synthetase
(the phylogeny of the sequences fits more or
less the species phylogeny).
21Apparent Horizontal Gene Transfer to the
parasites Bbu (B.burgdorferi) and Mge, Mpe
(Mycoplasmas) from the Eukaryotes represented by
Cel (C.elegans) and Sce (S.cerevisiae)
22Relatively few families do not display any
horizontal gene transfer. This has led to the
discussion whether we can actually talk about a
genome phylogeny. (see Doolittle quote) We
argue that there is a strong, dominant
phylogenetic signal in gene content, and thus one
can speak about a genome phylogeny. But that is
of course open for discussion
23Reconstructing the course of genome evolution via
a parsimonious approach. Primary
processes Gene gain -invention -gene
duplication -horizontal gene transfer Gene
loss -accumulation of mutations
(pseudogene) -gene deletion Gene fusion/fission
24Determining the relative contribution of these
processes in genome evolution requires the
reconstruction of the most likely evolution per
orthologous group of proteins, and adding up the
results. Thus we also explicitly reconstruct
the ancestors of the present genomes. NB. These
approaches are based on the size of orthologous
groups, not based on phylogenetic trees.
Because these methods are not based on trees we
need a HGT penalty to make a distinction between
HGT and multiple losses.
25Gene content evolution is a highly dynamic
process. Even in the evolution towards the
largest genome a large number of genes have been
lost (e.g. E.coli)
26Rope as a metaphor to describe an organismal
lineage (Gary Olsen) Individual fibers genes
that travel for some time in a lineage.
While no individual fiber present at the
beginning might be present at the end, the rope
(or the organismal lineage) nevertheless has
continuity.
27However, the genome as a whole will acquire the
character of the incoming genes (the rope turns
solidly red over time).
28- Qualitative differential genome analysis
- Find pathogen specific specific proteins that
can serve as drug targets - Relate the differences between genomes to the
differences in the phenotypes
29Interpreting the differences between genomes in
terms of the functions of their genes
H. influenzae genome
Huynen et al., 1997 Trends Genet 13, 389
30Three-way comparisons
Huynen et al., 1998, FEBS Lett 426, 1-5
31Although we can, qualitatively, interpret the
variations in shared gene content in terms of the
phenotypes of the species, quantitatively they
depend on the relative phylogenetic positions of
the species. The closer two species are the
larger fraction of their genes they share.
32Correlation in the amount of regulation per gene
and the size of the genome. Small genomes tend to
lose their regulation ? have few alternative
modes of action, and live in relatively constant
environments.
33Large genomes spend relatively many proteins on
regulation, few on cell division and other
household functions(van Nimwegen, Trends in
Genetics, 2003)
34A bottom-up approach to superfamily Distribution
supralinear behaving families tend to be involved
in gene regulation (60 to 80), linear behaving
families tend to be involved in metabolism (82 to
87), logarithmically behaving families do not
show a specific preponderance of functional
classes Orengo et al, TIG 2004
35The number of regulatory genes versus the number
of metabolic genes
The derivatives of the curves above
The difference between the numbers above is
maximal when the genome size is about 4800 ?
maximum amount of metabolic versatility for
minimum number of regulators
36Gene order evolution -Establish orthologous
relations between pairs of genomes (e.g. S-W best
bidirectional hit approach -Put them in a
dotplot, color the relative direction of
transcription (Green for the same relative
direction. Red for the opposite direction.)
37(No Transcript)
38- Evolution of genome organization
- In prokaryotes, genome inversions centered around
the origin/terminus of replication are a major
source of genome rearrangements. - This suggests that both replication forks are in
close contact - comparative genome analysis
provides support for a hypothesis about genome
replication - and a close proximity of the forks would
increase the - probability of reciprocal recombination or
transposition between sequences at the two forks.
That the forks are near each other is also
consistent with the 'replication factory' model
based on immunolocalization of components of the
replication machinery in Bacillus subtilis
(Tillier and Collins, 2000. Nat. Gen) - Prokaryotic genomes tend to be shuffled in a
comparatively short time (relative to the total
of time in the evolutionary tree)
39Newport Yan, Current Biology, 1996
40Rapid shuffling of genomes (compared to 16S rRNA
identity)
41Some species (MP-MG, CP-CT) show a significantly
lower rate of genome shuffling than others. A
possible explanation is the absence of the
protein RecA from these genomes. RecA is involved
in recombination ? absence of recombination would
slow down genome shuffling.
42Further Reading
- Comparative genome analysis Eppinger M, Baar C,
Raddatz G, Huson DH, Schuster SC Comparative
analysis of four Campylobacterales (2004) Nat Rev
Microbiol, 11872-85 - Gene order evolution by inversion Suyama M, Bork
P., (2001) Evolution of prokaryotic gene order
genome rearrangements in closely related species.
Trends Genet 1710-3. - Scaling of gene functional classes van Nimwegen
E. Links Scaling laws in the functional content
of genomes. Trends Genet. 2003 Sep19(9)479-84.