Title: Pr
1Exploring Evolutionary Trends in Proteomes
Fredj Tekaia Edouard Yeramian Institut
Pasteur tekaia_at_pasteur.fr
2Complete genomes 2434 projects 520 published
(01-03-07) 1086 Bacteria 59 Archaea
696 eukaryotes 73 metagenomes
433
36
Tree of life
3 phylogenetic domains Lifestyles
mesophiles (hyper)thermophiles psychrophiles
extreme conditions,...
46
http//www.genomesonline.org/
3 Data driven exploratory analyses as opposed to
model driven methods.
In the post genomic era, multidimensional data
resulting from large scale genome comparisons are
available.
Multivariate analysis methods are particularly
helpful for the discovery of evolutionary trends
associated with such data.
4Methodology
Matrice T kij gt 0
F?(is) ??-1/2.?fisj.G?(j) j1,p
5(No Transcript)
61. Evolution of Proteomes Signatures and Trends
in Amino Acid Compositions
2. Genome Trees from Whole Proteome Comparisons
7Evolution of Proteomes Signatures and Trends in
Amino Acid Compositions
8Mining the wealth of information contained in
complete genomes, to decipher genomic
characteristics to the adaptive evolution of
organisms in extreme conditions as high or low
temperatures, has long been a matter of interest
Kreil DP, Ouzounis CA (2001). Identification of
thermophilic species by the amino acid
compositions deduced from their genomes. NAR
2001, 468 1608-15. Tekaia F, Yeramian E, Dujon
B (2002). Amino acid composition of genomes,
lifestyles of organisms, and evolutionary trends
a global picture with correspondence analysis.
Gene, 297 51-60. Suhre K, Claverie JM (2003).
Genomic correlates of hyperthermostability, an
update. J. Biol. Chem., 278 17198-202. Hickey
DA, Singer GA (2004). Genomic and proteomic
adaptations to growth at high temperature. Genome
Biol., 5 117. Epub 2004. Brocchieri L (2004).
Environmental signatures in proteome properties.
Proc Natl Acad Sci U S A., 101 8257-8.
Cavicchioli R (2006). Cold-adapted archaea. Nat.
Rev. Microbiology,4 331-3. Lobry JR, Necsulea
A. (2006). Synonymous codon usage and its
potential link with optimal growth temperature in
prokaryotes.Gene. 385128-36. Zeldovich KB,
Berezovsky IN, Shakhnovich EI. (2007). Protein
and DNA Sequence Determinants of Thermophilic
Adaptation. PLoS Comput Biol. 3e5.
9The significant number of available completely
sequenced genomes with different lifestyles
offers an unprecedented opportunity to explore
species evolution.
Among simple analyses amino acid composition of
proteomes.
Which universal properties can be deduced from
amino acid compositions of proteomes? Are there
specific properties associated with lifestyles
and with phylogeny? What are the underlying
evolutionary trends?
10Outline Methodology Species considered and
data analysed Species and amino acids
distributions Amino acids distribution and
comparison with theoretical and experimental
model chronologies of amino acids recruitment
into the genetic code Example application to
predicting candidate thermostable proteins in
Aspergillus fumigatus.
11Methodology
In the post genomic era we are confronted with
many examples of multidimensional data, that
result from large scale genome comparisons.
Multivariate analysis methods are particularly
helpful for the discovery of evolutionary trends
associated with such data.
12Methodology
Matrice T kij gt 0
F?(is) ??-1/2.?fisj.G?(j) j1,p
13Previous work showed
Growth t
GC
54 species
Tekaia, F., Yeramian, E. and Dujon, B. 2002. Gene
297 51-60.
14Amino Acid composition of 208 proteomes
including 20 hyperthermophiles (HTH) (OGT
gt60C up to 120C), 7 thermophiles (TH) (OGT
gt50C up to 60C), 8 psychrophiles (PSYC)
(OGT -10C, up to 15C), 173 mesophiles (BMES)
including 53 eukaryotes (EUK)
Data table 222 (208 14 sup) vs 23 (20 aa
pol, char, hyd)
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbg
enomeprj specific sites
15Amino Acid composition
208
...............
13
Correspondence Analysis was used to explore
relationships between species and amino acids.
16 bestp1np allp1np segmatchp1np
bestnpp1 allnpp1 segmatchnpp1
P1 proteome1
bestnppn allnppn segmatchnppn
Pn proteomen
bestpnnp allpnnp segmatchpnnp
bestnppi np1 size pij e-value1 HS/IS/NS
allnppi np1 size pij e-value1 HS/IS/NS np1 size
pik e-value HS/IS/NS
Paralogs Orthologs
The expected number of HSPs with score at least S
is given by E Kmne-?S. m and n are sequence
and database lengths.
17 18(No Transcript)
19Statistical characterization of the observed
groups
Mean amino acids between the 3 groups were
compared using -One-way analysis of
variance -Newman-Keuls multiple comparison test
to detect significant differences at the
probability level of plt0.001.
20Mean aa composition in (hyper)thermophiles,
prokaryotic mesophiles-psychrophiles and
eukaryotes ( sig. different at plt0.001)
21AA physico-chemical properties in
(hyper)thermophiles, prokaryotic-pshychrophiles
and eukaryotes( sig. different at plt0.001)
22Amino acid signatures (plt0.001)
HTH-TH BMES-PSYC EUK
V(Val) V(Val) V (Val)
H(His) H(His) H (His)
S (Ser) S (Ser) S (Ser)
pol pol pol
pol-char pol-char pol-char
Y (Tyr)
E (Glu a)
Q (Gln)
T (Thr)
D (Asp a)
G (Gly)
I (Ile)
L (Leu)
C (Cys)
Hyd
R (Arg), M (Met), F (Phe), K (Lys), N (Asn) and
W (Trp) show no significant difference (at
plt0.001).
23Species evolutionary trends
24(No Transcript)
25Comparison with model chronologies of amino acids
recruitment into the genetic code
Comparison of amino acid distribution with
recent models of
Jordan et al. Nature 433 633-638 (2005)
Trifonov, J. Biomol. Struct. Dyn. 22 1-11
(2004)
and with ancient amino acids
Millers experiments Science 117, 528-529.
(1953) Analysis of Murchison meteorite (1983)
26Model of Jordan et al. 2005 A universal trend of
amino acid gain and loss in protein evolution.
Nature.433633-8.
They analysed 15 sets of three-way alignments
of orthologous proteins encoded by triplets of
closely related genomes from 15 taxa representing
all three domains of life (Bacteria, Archaea and
Eukaryota), and used phylogenies to polarize
amino acid substitutions.
All amino acids with declining frequencies are
thought to be among the first incorporated into
the genetic code conversely, all amino acids
with increasing frequencies, except Ser, were
probably recruited late.
27Following observed frequencies, they subdivided
amino acids into what they called
4 strong losers Pro, Ala, Glu, and Gly
(decline in at least 13 taxa/15) thought to be
among the first incorporated into the genetic
code i.e most ancient aa.
5 strong gainers Cys, Met, His, Ser and Phe
(accrue in 14/15 taxa) were probably recruited
late i.e most recent aa.
1 weak looser Lys (lost in 10 taxa/15).
4 weak gainers Asn, Thr, Ile (accrue in 11
taxa/15) and Val (accrues slowly in all taxa)
In contrast the remaining six amino-acids
(Arg, Gln, Trp, Leu and Tyr) evolve more
erratically.
Jordan et al. 2005.
28(No Transcript)
29Model of Trifonov, E.N. 2004. The triplet code
from first principles. J. Biomol. Struct. Dyn.
22 1-11.
A consensus chronology of amino acids is built
on the basis of 60 different criteria each
offering certain temporal order.
The chronology results in the consensus order
G1 (Gly), A2 (Ala), D3 (Asp), V4 (Val), P5
(Pro), S6 (Ser), E7 (Glu), (L8 (Leu), T8 (Thr)),
R10 (Arg), (I11 (Ile), Q11 (Gln), N11 (Asn)), H14
(His), K15 (Lys), C16 (Cys), F17 (Phe), Y18
(Tyr), M19 (Met), W20 (Trp).
30Trifonov, E.N. (2004). The triplet code from
first principles. J. Biomol. Struct. Dyn. 22
1-11.
31Comparison with ancient amino acids
32Miller/Urey Experiment 1953
By the 1950s, scientists were in hot pursuit of
the origin of life. The scientific community was
examining what kind of environment would be
needed to allow life to begin. In 1953, Miller
took molecules which were believed to represent
the major components of the early Earth's
atmosphere and put them into a closed system
Miller's experiment showed that organic
compounds such as amino acids, which are
essential to cellular life, could be made easily
under the conditions that scientists believed to
be present on the early earth.
33Miller, S.L. Science 117, 528-529. (1953)
Production of aa under possible primitive earth
conditions.
34Murchison meteorite 09-28-1969
The Murchison meteorite fall occurred on
September 28, 1969 over Murchison, Australia.
Over 100 kilograms of this meteorite have been
found. This meteorite is of possible cometary
origin due to its high water content of 12. An
abundance of amino acids found within this
meteorite has led to intense study by researchers
as to its origins. More than 92 different amino
acids have been identified within the Murchison
meteorite to date. Nineteen of these are found on
Earth. The remaining amino acids have no apparent
terrestrial source.
35(No Transcript)
36Conclusions
Simple description of amino acid compositions
of proteomes (free from a priori model) revealed
fundamental evolutionary properties
segregation of eukaryotes segregation of
hyperthermophiles non discrimination of
psychrophiles.
Amino acid signatures for hyperthermophiles and
for eukaryotes.
37Conclusions...
Amino acids distribution is consistent with
suggested model chronologies of their
recruitment into the genetic code
Correspondence Analysis helped these properties
to be shown.
38General Conclusion
Amino acids are significant markers for species
evolution.
39References
Murtagh, F. 2005. Correspondence Analysis And
Data Coding With Java And R. ed . Chapman
Hall/CRC . 248 p. ISBN 1584885289. Jordan,
I.K., Kondrashov, F.A., Adzhubei, I.A., Wolf,
Y.I., Koonin, E.V., Kondrashov, A.S., Sunyaev, S.
2005. A universal trend of amino acid gain and
loss in protein evolution. Nature 433
633-638. Tekaia, F., Yeramian, E. and Dujon,
B. 2002. Amino acid composition of genomes,
lifestyles of organisms, and evolutionary trends
a global picture with correspondence analysis.
Gene 297 51-60. Tekaia, F. and Yeramian, E.
2005. Genome Trees from Conservation Profiles.
PLoS Comput Biol.1(7)e75. Tekaia, F. and
Yeramian, E. 2006. Evolution of Proteomes
Fundamental signatures and global trends in amino
acid composition. BMC Genomics. 7307.
Trifonov, E.N. 2004. The triplet code from first
principles. J. Biomol. Struct. Dyn. 22
1-11. See also http//www.pasteur.fr/tekaia/sac
so.html
40Genome Trees from Whole Proteome Comparisons
41Outline
Species tree construction and difficulties
Post genome era species tree construction
Conservation profiles
Genome tree construction based on conservation
profiles
Conclusions
References.
42Species tree - Tree Of Life 16/18s rRNA tree
(Woese 1990) Woese and others have used rRNA
comparisons to construct a Tree Of Life showing
the evolutionary relationships of a wide variety
of organisms.
The Tree Of Life has long served as a useful
tool for describing the history and relationships
of organisms over evolutionary time. One species
is represented as a branching point, or node, on
the tree, and the branches represent paths of
descent from a parental node.
43Martin Embley Nature 431152-5.(2004)
The two-empire proposal, separating eukaryotes
from prokaryotes and eubacteria from
archaebacteria. Mayr, D. PNAS 959720-23.
(1998).
The three-domain proposal based on the ribosomal
RNA tree. Woese et al. PNAS. 874576-4579. (1990)
The ring of life, incorporating lateral gene
transfer but preserving the prokaryote eukaryote
divide. Rivera Lake JA. Nature 431 152-5.
(2004)
The three-domain proposal, with continuous
lateral gene transfer among domains. Doolittle.
Science 2842124-8. (1999)
44(No Transcript)
45Pennisi, E. (1998). Genome data shake tree of
life. Science 280672-4.
New genome sequences are mystifying evolutionary
biologists by revealing unexpected connections
between microbes thought to have diverged
hundreds of millions of years ago.
and suggests to construct species trees from
their whole gene content.
46(No Transcript)
47(No Transcript)
48Complete genomes 2434 projects 520 published
(01-03-07) 1086 Bacteria 59 Archaea
696 eukaryotes 73 metagenomes
433
36
Tree of life
Abundance of genome data is raising expectations
to accurately depict the evolutionary history of
all genomes. Idea construct a species tree from
many genes instead of only one gene.
46
http//www.genomesonline.org/
49(No Transcript)
50Problems with species tree construction main
difficulties in species tree construction include
extensive incongruence between alternative
phylogenies generated from single-gene data sets
-Genes don't evolve at the same rate nor in the
same way -the evolutionary history inferred from
one gene may be different from what another gene
appears to show.
51Alternative solutions integrative methods
supertree
The supertree approach estimates phylogenies for
subsets of genes with good overlap, then combines
these subtree estimates into a supertree.
Depends on the ability to distinguish between
orthologs and paralogs
Supertree approaches are controversial, in part
because the methodology results in a degree of
disconnection between the underlying genetic data
and the final tree produced.
52 phylogenomic tree (based on concatenation of
a gene sample common to the considered species)
genes don't evolve at the same rate nor in the
same way
a limited number of genes are shared among all
species
The tree of one percent (2006) Dagan and Martin.
Genome Biology, 7118.
53More generally these methods suffer difficulties
related to the phylogenetic tree construction
global sequence alignment (quality, gaps,...)
different evolutionary histories of genes
substitution saturation... and more seriously
from gene sampling difficulties.
54Gene tree - Species tree The gene sampling
problem
Adapted from Linder, Moret, Nakhleh, Warnow.
True species tree
55Gene tree - Species tree The gene sampling
problem
All red orthologs has been lost in the 3 species.
56Gene tree - Species tree The gene sampling
problem
All versions of the gene are in the 3 species
57Genome tree is another alternative to construct
species tree. The concept of genome tree is
based on overall gene content similarity. (conside
r more than single gene information)
58(No Transcript)
59Systematic Analysis of Completely Sequenced
Organisms
In silico species specific comparisons (Tekaia
Dujon. J. Mol. Evol. 1999) (27 eucaryal, 19
archaeal and 33 bacterial species 541880
proteins)
60Systematic Analysis of Completely Sequenced
Organisms
In silico species specific comparisons (27
eucaryal, 19 archaeal and 33 bacterial species
541880 proteins)
Degree of ancestral duplication and of
ancestral conservation between pairs of species
Families of paralogs (Partition-MCL)
Families of orthologs (Partition-MCL)
Distribution of orthologous families according
to the three domains of life
Determination of the protein dictionary
(orthologs)
Determination of protein conservation profiles
61Genome trees data matrices
T Tij i1,n j1,n n is the number of
surveyed species Tij is the overall similarity
score between species j and i.
Ancestral duplication and ancestral conservation
T Tij wij (number of proteins in j
conserved in i)/size(j)) i1,n j1,n .
n 99 species and T corresponds to 541880 total
proteins
62(No Transcript)
63species are clustered into 3 phylogenetic
domains bacterial species cluster with
archaeal species similar species cluster
together whole genome species clustering
tree very low resolution of deep clustering
64Genome trees data matrices
T Tij i1,n j1,n n is the number of
surveyed species Tij is the overall similarity
score between species j and i.
Shared orthologous genes
sij (shared orthologs between i and j)
T Tij sij/size(j) i1,n j1,n
65(No Transcript)
66Shared orthologous genes
67 3 phylogenetic domains bacterials species
cluster with archaeal species similar species
cluster together better resolution of deep
species clustering.
68 Large scale comparative analysis of predicted
proteomes revealed significant evolutionary
processes
Expansion, Exchange and Deletion are noise. They
should be eliminated or at least reduced.
69To overcome some of these limitations, we
consider
Genome tree construction from Protein
Conservation Profiles and attempt to reduce
noisy evolutionary processes
70Conservation profiles
99 species (B 33 A 19 E27) 541880 proteins
p 0111111000111111111000110110111101001111101111
A conservation profile is an n-component
binary vector describing a protein conservation
pattern across n species. Components are 0 and 1,
following absence or presence of homologs.
Main interesting properties of conservation
profiles
Conservation profiles are signatures of
evolutionary relationships
A conservation profile is the trace of protein
evolutionary histories jointly captured in a set
of n species (multidimensional feature)
71(No Transcript)
72Distinct conservation profiles
original total proteins (99 species)
541880
non-specific proteins i.e conservation profiles
(82)
442460
distinct conservation profiles (42)
184130
(one representative from each set of identical
conservation profiles)
Effect of the duplication process is reduced
This set is indicative of the various observed
evolutionary histories.
73(No Transcript)
74Genome tree construction data matrices
184130 d.c.prof
various evolutionary histories
Jaccard similarity scores between species
sij N11/(N11N01N10) N11 N01 N10 are
respectively total occurrences of (1,1), (0,1)
and (1,0) between i,j.
T Tij sij i1,n j1,n n
75Genome trees data matrices
T Tij i1,n j1,n n is the number of
surveyed species Tij is the overall similarity
score between species j and i.
Jaccard similarity scores
sij N11/(N11N01N10)
sij (Jaccard similarity score between species
i and j)
T Tij 100sij i1,n j1,n
184130 d.c.prof
76profiles tree
Tekaia F, Yeramian E. (2005). PLoS Comput
Biol.1(7)e75
77Partial trees associated with domains of life
A- Restrictions on lines.
TETij 100sij i1,p j1,n
B- Restrictions on lines and columns
TETij 100sij i1,p j1,p
78(No Transcript)
79Conclusions Methodology
Species classification is not an easy task!
Species tree construction should take into
account the whole information included in the
genomes
Methods that take into account whole genome
informations are still needed
Correspondence analysis method might be helpful
in revealing evolutionary trends embedded in the
multidimensional relationships as obtained from
large scale genome comparisons
80Conclusions...
Conservation profiles represent most conserved
and meaningful evolutionary signals jointly
captured in a set of species
Thus they should correspond to the most
accurate type of markers for species
classification
In principal profiles tree derived from
distinct conservation profiles should
considerably minimize genome acquisition effects
and should reflect less noisy phylogenetic
signals
The profiles tree presents evidence of
conservation of stable phylogenetic relationships
and reveals unconventional species clustering
The profiles tree corresponds to the
classification of the evolutionary scenari.
81References
Tekaia, F. and Dujon, B. (1999). Pervasiveness
of gene conservation and persistence of
duplicates in cellular genomes. Journal of
Molecular Evolution, 49591-600. Tekaia, F.,
Lazcano, A. and B. Dujon (1999). Genome tree as
revealed from whole proteome comparisons. Genome
Res. 1217-25. Tekaia, F., Yeramian, E. and
Dujon, B. (2002). Amino acid composition of
genomes, lifestyles of organisms, and
evolutionary trends a global picture with
correspondence analysis. Gene 297 51-60.
Tekaia, F. and Yeramian, E. (2005). Genome Trees
from Conservation Profiles. PLoS Comput
Biol.1(7)e75. Tekaia F, Latgé JP. (2005).
Aspergillus fumigatus saprophyte or
pathogen? Curr Opin Microbiol. 8385-92. Review.
Tekaia, F. and Yeramian, E. (2006). Evolution
of Proteomes Fundamental signatures and global
trends in amino acid composition. BMC Genomics.
7307. Systematic analysis of completely
sequenced organisms http//www.pasteur.fr/tekaia
/sacso.html
82References
Bininda-Emonds ORP (2005). Supertree
Construction in the Genomic Age. Methods in
Enzymology 395 p.745-757. Bininda-Emonds,OPRP,
John L. Gittleman, Mike A. Steel (2002) The
(super)Tree Of Life Procedures, Problems, and
Prospects. Annual Review of Ecology and
Systematics, Vol. 33 265-289. Dagan, T. and W,
Martin (2006). The tree of one percent. Genome
Biology, 7118. Delsuc F, Brinkmann H, Philippe
H. (2005). Phylogenomics and the reconstruction
of the tree of life. Nat Rev Genet. 6361-75.
Review. Doolittle. Science 2842124-8. (1999)
Driskell, et al. (2004). Sciences, 306
1172-1174. http//www.genomesonline.org/gold.cgi
(list of genome projects) Keith A. Crandall
and Jennifer E. Buhay (2004). Sciences, 306
1144-1145. Linder, Moret, Nakhleh, and Warnow
http//compbio.unm.edu/networks1.ppt Martin
Embley (2004). Nature 431152-5. MCL a cluster
algorithm for graphs http//micans.org/mcl/
Pennisi, E.(1998). Genome data shake tree of
life.Science. 280672-4. Rivera Lake
JA.(2004). Nature 431 152-5. Raoult et
al.(2004). Sciences, 3061344-1350. Snel, Bork,
Huynen (1999). Genome phylogeny based on gene
content.Nature Genetics 21, 108-110. Snel B,
Huynen MA, Dutilh BE (2005). Genome trees and the
nature of genome evolution.Annu Rev
Microbiol.59191-209. Review. Woese et
al.(1990). PNAS. 874576-4579.