Title: Prsentation PowerPoint
1Large scale proteome comparisons Genome trees
Fredj Tekaia Institut Pasteur tekaia_at_pasteur.fr
2207
21
- Complete genomes
- 1387 projects
- 261 published (01-03-05)
- 654 prokaryotes
- 472 eukaryotes
Tree of life
33
http//www.genomesonline.org/
3Cumulated number of available completely
sequenced genomes
Completely sequenced Genomes that span the three
domains of life are growing at a rapid rate
List and references
GOLD
4Genome sequencing projects
There are several web-based resources that
document the progress of completely sequenced
genomes and their reference publication,
including GOLD Genomes Online
Database http//wit.integratedgenomics.com/GOLD/
GNN Genome News Network http//www.genomenews
network.org/index.php
5Resources for genomes
There are two main resources for
genomes EBI European Bioinformatics
Institute http//www.ebi.ac.uk/genomes/ NCBI N
ational Center for Biotechnology
Information http//www.ncbi.nlm.nih.gov But
many others resources from sequencing
Institutions Sanger The welcome Trust Sanger
Institut http//www.sanger.ac.uk/ TIGR The
Institute for Genomic Research http//www.tigr.o
rg Genolevures http//cbi.labri.fr/Genolevures/ind
ex.php
6Definitions
Genome The genome of a cell is formed by the
collection of the DNA it comprises. The genome
size is the total of its DNA bases.
Gene Is a particular DNA sequence situated in a
specific position on a chromosome and that codes
for a specific function.
Protein Is a sequence composed of amino-acids
ordered according to the DNA sequences of the
gene it codes for.
Proteome Is the set of proteins in an organism.
Genomics Is the exhaustive study of genomes
genetic material, genes their functions, their
organization....
7Chronology of completely sequenced genomes
1977 first viral genome (5386 base pairs
encoding 11 genes). Sanger et al. sequence
bacteriophage fX174.
1981 Human mitochondrial genome. 16,500 base
pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)
1986 Chloroplast genome. 156,000 base pairs
(most are 120 kb to 200 kb)
81995 first genome of a free-living organism, the
bacterium Haemophilus influenzae, by TIGR, 1830
Kb, 1713 genes.
1996 first genome of an archaeal genome
Methanococcus jannaschii DSM 2661, by TIGR, 1664
Kb, 1773 genes.
1997 first eukaryotic genome Saccharomyces
cerevisiae S288C International collaboration 16
Chromosomes 12,057 Kb, 6000 genes.
1998 first multicellular organism Nematode
Caenorhabditis elegans 97 Mb 19,000 genes.
91999 first human chromosome Chromosome 22 (49
Mb, 673 genes))
10 2001 draft sequence of the human genome (x Mb
28000 genes)
2002 plasmodium falciparum (22,9 Mb 5334
genes) 2002 mouse genome (x Mb 28000 genes)
2004 Fish draft Tetraodon nigroviridis genome
(x Mb 28000 genes)
11How big are genome sizes?
Viral genomes 1 kb to 350 kb (Mimivirus 1.2
Mb) Bacterial genomes 0.5 Mb to 13
Mb Eukaryotic genomes 8 Mb to 670 Gb
DOGS http//www.cbs.dtu.dk/databases/DOGS/abbr_ta
ble.bysize.txt
12Comparative genomics
Analyses of the genetic material of different
species help understanding the similarity and
differences between genomes, their evolution and
the evolution of their genes.
Intra-genomic comparisons help understanding the
degree of duplication (genome regions genes) and
genes organization,...
Inter-genomic comparisons help understanding the
degree of similarity between genomes degree of
conservation between genes
understanding gene and genome evolution
13Evolution
14(No Transcript)
15Evolutionary processes include
Ancestor
species genome
and selection
16Gene duplications are traditionally considered to
be a major evolutionary source of protein new
functions
Understanding how duplications happened and how
important is this evolutionary process is a key
goal of genome analysis
Some examples
17Intérêts des comparaisons génomiques Analyse et
comparaison du matériel génétique de différentes
espèces pour comprendre la similarité et les
différences entre génomes leur évolution et
pour étudier lévolution et les fonctions de
leurs gènes. - Structure des génomes
statistiques globales, répétitions,
réarrangements, syntenie, points de
ruptures,... - Régions codantes nombre de gènes,
de protéines, paralogues, orthologues,... -
Régions non codantes éléments régulateurs,.. -
Régions dupliquées conservées....
18Colours reveal Duplications
19Kellis et al. Nature, 2004
20Kellis et al. Nature, 2004
21Nature Reviews Genetics 3 827-837
(2002) SPLITTING PAIRS THE DIVERGING FATES OF
DUPLICATED GENES
22Nature Reviews Genetics 3 827-837
(2002) SPLITTING PAIRS THE DIVERGING FATES OF
DUPLICATED GENES
23Nature Reviews Genetics 3 827-837
(2002) SPLITTING PAIRS THE DIVERGING FATES OF
DUPLICATED GENES
24Hurles M (2004) Gene Duplication The Genomic
Trade in Spare Parts. PLoS Biol 2(7) e206.
25Hurles M (2004) Gene Duplication The Genomic
Trade in Spare Parts. PLoS Biol 2(7) e206.
26Genome duplication. a, Distribution of Ks values
of duplicated genes in Tetraodon (left) and
Takifugu (right) genomes. Duplicated genes
broadly belong to two categories, depending on
their Ks value being below or higher than 0.35
substitutions per site since the divergence
between the two puffer fish (arrows). b, Global
distribution of ancient duplicated genes (Ks
0.35) in the Tetraodon genome. The 21 Tetraodon
chromosomes are represented in a circle in
numerical order and each line joins duplicated
genes at their respective position on a given
pair of chromosomes.
Jaillon et al. Nature 431, 946-857. 2004.
27Jaillon et al. Nature 431, 946-857. 2004.
28Jaillon et al. Nature 431, 946-857. 2004.
29- Comparaisons génomiques 3 axes de recherche
- 1. Comparaison de génomes pour comprendre la
similarité et les différences entre génomes. - 2. Comparaison des génomes afin de prédire les
fonctions des gènes,... dun nouveau génome et
donc pour étudier et comprendre lévolution. - 3. Développer des algorithmes efficaces pour
comparer des séquences entières de génomes.
30Le génome de Escherichia coli 4,639,221 pb A
24.6, C 25.4, G25.3, T24.5 U00096 ECOLI
Escherichia coli K-12 MG1655 complete
genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGG
ATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGT
GAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA T
ATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCAT
GAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGT
AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCT
GACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATG
CGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTC
TGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGC
CACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGA
TTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGC
CGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCG
GGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT GCCC
AAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGAT
AGCATCAACGCTGCGC TGATTTGCCGTGGCGAGAAAATGTCGATCGCCA
TTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT TACTGTTATCGA
TCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCG
ATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGA
TCACATGGTGCTGATGGCAGGTTTCACCG CCGGTAATGAAAAAGGCGAA
CTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC
TGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGG
GTCTATACCTGCGACCCGCGT CAGGTGCCCGATGCGAGGTTGTTGAAGT
CGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCG CTAAAGT
TCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGCC
TGATTAAAAATAC CGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGC
CAGCCGTGATGAAGACGAATTACCGGTCAAGGGC ATTTCCAATCTGAAT
AACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGG
CATGG CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTG
GTGCTGATTACGCAATCATCTTCCGA ATACAGCATCAGTTTCTGCGTTC
CACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC TA
CCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCAGTGACGGAACGGC
TGGCCATTATCTCGGTGG TAGGTGATGGTATGCGCACCTTGCGTGGGAT
CTCGGCGAAATTCTTTGCCGCACTGGCCCGCGCCAATAT CAACATTGTC
GCCATTGCTCAGGGATCTTCTGAACGCTCAATCTCTGTCGTGGTAAATAA
CGATGATGCG ACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAAT
ACCGATCAGGTTATCGAAGTGTTTGTGATTG GCGTCGGTGGCGTTGGCG
GTGCGCTGCTGGAGCAACTGAAGCGTCAGCAAAGCTGGCTGAAGAATAAA
CA TATCGACTTACGTGTCTGCGGTGTTGCCAACTCGAAGGCTCTGCTCA
CCAATGTACATGGCCTTAATCTG GAAAACTGGCAGGAAGAACTGGCGCA
AGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTCGCCTCGTGA AAGAA
TATCATCTGCTGAACCCGGTCATTGTTGACTGCACTTCCAGCCAGGCAGT
GGCGGATCAATATGC CGACTTCCTGCGCGAAGGTTTCCACGTTGTCACG
CCGAACAAAAAGGCCAACACCTCGTCGATGGATTAC TACCATCAGTTGC
GTTATGCGGCGGAAAAATCGCGGCGTAAATTCCTCTATGACACCAACGTT
GGGGCTG GATTACCGGTTATTGAGAACCTGCAAAATCTGCTCAATGCAG
GTGATGAATTGATGAAGTTCTCCGGCAT TCTTTCTGGTTCGCTTTCTTA
TATCTTCGGCAAGTTAGACGAAGGCATGAGTTTCTCCGAGGCGACCACG
CTGGCGCGGGAAATGGGTTATACCGAACCGGACCCGCGAGATGATCTTTC
TGGTATGGATGTGGCGCGTA AACTATTGATTCTCGCTCGTGAAACGGGA
CGTGAACTGGAGCTGGCGGATATTGAAATTGAACCTGTGCT GCCCGCAG
AGTTTAACGCCGAGGGTGATGTTGCCGCTTTTATGGCGAATCTGTCACAA
CTCGACGATCTC TTTGCCGCGCGCGTGGCGAAGGCCCGTGATGAAGGAA
AAGTTTTGCGCTATGTTGGCAATATTGATGAAG ATGGCGTCTGCCGCGT
GAAGATTGCCGAAGTGGATGGTAATGATCCGCTGTTCAAAGTGAAAAATG
GCGA AAACGCCCTGGCCTTCTATAGCCACTATTATCAGCCGCTGCCGTT
GGTACTGCGCGGATATGGTGCGGGC AATGACGTTACAGCTGCCGGTGTC
TTTGCTGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTCTGAC ATG
GTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCGGGTTTGA
TGTGCTCGGGGCGGCGG TGACACCTGTTGATGGTGCATTGCTCGGAGAT
GTAGTCACGGTTGAGGCGGCAGAGACATTCAGTCTCAA ...........
.................................
31thrA enzyme, Amino acid biosynthesis
Threonine MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
AKITNHLVAMIEKTISGQ DALPNISDAERIFAELLTGLAAAQPGFPLAQ
LKTFVDQEFAQIKHVLHGISLLGQCPD SINAALICRGEKMSIAIMAGVL
EARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRI AASRIPADHMVLM
AGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCCEIWTDVDGV YTCD
PRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTG
NPQA PGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARV
FAAMSRARISVVL ITQSSSEYSISFCVPQSDCVRAERAMQEEFYLELKE
GLLEPLAVTERLAIISVVGDGM RTLRGISAKFFAALARANINIVAIAQG
SSERSISVVVNNDDATTGVRVTHQMLFNTDQ VIEVFVIGVGGVGGALLE
QLKRQQSWLKNKHIDLRVCGVANSKALLTNVHGLNLENWQ EELAQAKEP
FNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKK
ANTSSMDYYHQLRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELM
KFSGILSG SLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDV
ARKLLILARETGRELEL ADIEIEPVLPAEFNAEGDVAAFMANLSQLDDL
FAARVAKARDEGKVLRYVGNIDEDGV CRVKIAEVDGNDPLFKVKNGENA
LAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLR TLSWKLGV
32Intra-genomic comparisons
Simple descriptions and compositional analysis
(genes sizes gene organization aa and nuc
composition,...)
Genome content
Rate of duplication (paralogs gene families)
33Comparaisons intra-génomiques description
simple (les gènes, distribution de leur tailles,
leurs compositions en bases nucléiques, en acides
aminés, ...), décrire le contenu génétique
dun génome, les gènes spécifiques, le
degré de duplication des gènes, les familles
de gènes, lorganisation des gènes dans un
génome.....
34Inter-genomic comparisons
Compositional comparisons between species (nuc
and aa compositions)
Gene, protein conservation between species
(rate of conservation)
Orthologs families of orthologs
Specific and non-specific genes
Genes exclusively conserved in one or in a
subset of species (or in domains)
Gene Dictionary Gene conservation
profiles Genome tree construction Genome
multiple alignments.
35Comparaisons inter-génomiques composition en
bases, en codons, en acides aminés,... le
degré de conservation entre génomes, les
orthologues, les familles dorthologues.
Dictionnaire des gènes, Profils de
conservation des gènes, Construction darbres
génomiques. Alignement multiple de génomes.
36(No Transcript)
37Amino Acid composition
38Tekaia, F., Yeramian, E. and Dujon B. (2002)
Gene. 297 pp. 51-60.
39Growth t
GC
2005
40PE, PPE families
41Protein size statistics
42Proteome comparisons Methodology
43 bestp1np allp1np segmatchp1np
bestnpp1 allnpp1 segmatchnpp1
P1 proteome1
bestnppn allnppn segmatchnppn
Pn proteomen
bestpnnp allpnnp segmatchpnnp
SPECSO
bestnppi np1 size pij e-value1 HS/IS/NS
allnppi np1 size pij e-value1 HS/IS/NS np1 size
pik e-value HS/IS/NS
100 species E28, A 19, B 53
Paralogs Orthologs
The expected number of HSPs with score at least S
is given by E Kmne-?S. m and n are sequence
and database lengths.
44(No Transcript)
45(No Transcript)
46Example
Comparing S. cerevisiae (SC) genome with C.
elegans (CE) genome
47SC vs SC
48(No Transcript)
49SC/CE
CE/SC
Orthologs
50(No Transcript)
51Partitions/MCL Clustering
P7.1
- A set of genes defines a "partition"
- if and only if
- a) each member of the set has at
- least one significant match with
- another member of the set
- b) no member of the set has significant matches
with members not included in the set - c) the set is minimal.
P4.2.C3.1
MCL Markov Cluster algorithm Stijn van Dongen A
cluster algorithm for graphs. http//micans.org/mc
l/
Each gene is identified by its partition and
its MCL cluster
P7.1.C3.1
P4.1
52Markov Cluster (MCL) algorithm http//micans.org/
mcl/
Traditionally, most methods deal with
similarity relationships in a pairwise manner,
while graph theory allows classification of
proteins into families based on a global
treatment of all relationships in similarity
space simultaneously.
Similarity between proteins are arranged in a
matrix that represents a connection graph.
Nodes of the graph represent proteins, and
edges represent sequence similarity that connects
such proteins.
A weight is assigned to each edge by taking
-log10(E-value) obtained by a BLAST comparison.
53These weights are transformed into probabilities
associated with a transition from one protein to
another within this graph.
This matrix is passed through iterative rounds
of matrix multiplication and matrix inflation
until there is little or no net change in the
matrix. The final matrix is then interpreted as a
protein family clustering.
The inflation value parameter of the MCL
algorithm is used to control the granularity of
these clusters.
54 blastp proteome specific comparisons
all protein significant hits
Adapted from Enright et al. NAR 2002.
55Example of Partition/MCL clustering
P6 19 Total number of distinct ORFs
6 --------------------
56Example of Partition/MCL clustering
P6 22 Total number of distinct ORFs
6 --------------------
57Large scale predicted proteome comparisons
58Gene Dictionary
Table 541880 predicted proteins x 100 species
59(No Transcript)
60Ancestral weight matrix
j
i
Wii weight of ancestral duplication Wij weight
of ancestral conservation of i in j nsi
nonspecific genes in species i.
i
Wii
j
Wjj
Wij
nsi
nsj
61(No Transcript)
62Ancestral duplication
A
B
E
mean 52.1 30.
38.4 std 17.8
11.7
11.2
63Specific and nonspecific proteins
Large scale proteome comparisons allow estimation
of
Specific proteins (genes) are proteins that
have no match outside their own proteome. (no
homolog in other species).
Non-specific proteins (genes) are proteins that
are conserved in at least one other species (have
homologs outside its own proteome).
64Specific and nonspecific proportions
mean 76.2 84.3 87.6
65Species specific genes
66Orthologs
100 species 367143 orthologs
67Structural orthologs according to the 3 domains
of life (100 species 367143 orthologous genes)
2
Total Partitions 37826
note 6 include genes from at least 2 domains
of life.
68Evolution by Module
69Evolution by Module (A. gambiae paralogs)
70GST orthologs
71Genome trees
72(No Transcript)
73The two-empire proposal, separating eukaryotes
from prokaryotes and eubacteria from
archaebacteria. Mayr, D. PNAS 959720-23.
(1998).
The three-domain proposal based on the ribosomal
RNA tree. Woese et al. PNAS. 874576-4579. (1990)
The three-domain proposal, with continuous
lateral gene transfer among domains. Doolittle
Science 2842124-2128. (1999)
The ring of life, incorporating lateral gene
transfer but preserving the prokaryoteeukaryote
divide. Rivera MC and Lake JA. Nature 431
152-155. (2004)
Martin Embley, Nature 431152-5.2004
74The 1.2-Megabase Genome Sequence of Mimivirus
Didier Raoult, Stéphane Audic, Catherine Robert,
Chantal Abergel, Patricia Renesto, Hiroyuki
Ogata, Bernard La Scola, Marie Suzan, Jean-Michel
Claverie. Sciences, 3061344-1350. (2004)
The tree was inferred with the use of a maximum
likelihood method based on the concatenated
sequences of seven universally conserved protein
sequences arginyl-tRNA synthetase,
methionyl-tRNA synthetase, tyrosyl-tRNA
synthetase, RNA polymerase II largest subunit,
RNA polymerase II second largest subunit, PCNA,
and 5'-3' exonuclease. The alignment contains
3164 sites without insertions and deletions.
Bootstrap percentages are shown along the
branches.
75Evolutionary biology Early evolution comes full
circle. Martin W, Embley TM. Nature, 431
134-137. (2004)
The ring of life provides evidence for a genome
fusion origin of eukaryotes Rivera, M.C. Lake,
J.A. Nature, 431 152-155. (2004)
Our analyses indicate that the eukaryotic genome
resulted from a fusion of two diverse prokaryotic
genomes, and therefore at the deepest levels
linking prokaryotes and eukaryotes, the tree of
life is actually a ring of life.
76Genomic Databases and the Tree of Life Keith A.
Crandall and Jennifer E. Buhay Sciences, 306
1144-1145. (2004)
Prospects for Building the Tree of Life from
Large Sequence Databases Amy C. Driskell, Cécile
Ané, J. Gordon Burleigh, Michelle M. McMahon,
Brian C. O'Meara, Michael J. Sanderson
. Sciences, 306 1172-1174. (2004)
77Species tree 16/18s rRNA tree (Woese 1990)
main difficulties include extensive incongruence
between alternative phylogenies generated from
single-gene data sets Alternative solutions
integrative methods supertree (consensus tree
from a set of individual gene phylogenetic
trees) phylogenomic tree based on
concatenation of a gene sample common to the
considered species
(these methods suffer difficulties related to
the phylogenetic tree construction sequence
global alignment difficulties substitution
variations between species...)
78 Genome trees The concept of genome tree is
based on overall gene content similarity
Genome trees consider more than single gene
information
79(No Transcript)
80Evolutionary processes include
Ancestor
species genome
and selection
81Universal tree (Woese 1990 ) 16s rRNA (most
conserved sequences) main difficulties include
extensive incongruence between alternative
phylogenies generated from single-gene data sets
tree that takes into account the whole make up
of the species genomes?
82Species trees tentative construction
supertrees - Consensus tree from a set of gene
phylogenetic trees - Phylogenetic trees based
on concatenated gene sample common to all
considered species - Still gene trees! genome
trees -based on overall genome content
similarity -mixed reflection of species
phylogenetic relationships and other evolutionary
processes including gene acquisition and gene
loss.
83Genome trees data matrices
T Tij i1,n j1,n n is the number of
surveyed species Tij is the overall similarity
score between species j and i.
Ancestral duplication and ancestral conservation
T Tij wij (number of proteins in j
conserved in i)/size(j)) i1,n j1,n .
541880 total
proteins
Shared orthologous genes
sij (shared orthologs between i and j)
T Tij sij/size(j) i1,n j1,n
442460 non-specific prot.
Distinct shared conservation profiles
sij (distinct shared conservation profiles
between i and j)
T Tij sij/sjj i1,n j1,n
28365 / 184130 d.c.prof
84(No Transcript)
85 whole genome species clustering tree
species are clustered into 3 phylogenetic
domains bacterial species cluster with
archaeal species similar species cluster
together low resolution of deep clustering
evolutionary side effects are taken into account
Tekaia, F., Lazcano, A.,B. Dujon (1999). Genome
Res. 1217-25.
86Shared orthologous genes (partial)
87 3 phylogenetic domains bacterials cluster
with archaeal species similar species cluster
together better resolution of deep species
clustering Evolutionary side effects (HGT,
duplication, loss) are not completely eliminated
88 Conservation profiles p
011111100011111111100011011011110100111111101111
a conservation profile is an n-component
vector describing a protein conservation pattern
across n species. Components are 0 and 1,
following absence or presence of homologs.
Conservation profile is the trace of protein
evolutionary histories jointly captured in a set
of species (multidimensional feature)
Conservation profiles are signatures of
evolutionary relationships Considering
distinct conservation profiles, reduces the
effects of noisy evolutionary processes (less
noisy phylogenetic signals) Each conservation
profile brings equal amount of information
regardless of the size of the set of genes that
have identical c. profiles give evidence of
evolutionary history in a set of species
89(No Transcript)
90541880 proteins
100 species
442460 non-specific proteins i.e. conservation
profiels
Distinct conservation profiles
184130 distinct conservation profiles
Drastic reduction
28365 distinct conservation profiles associated
with at least 2 proteins from distinct species
91(No Transcript)
92Occurrences of shared conservation profiles
Tij sij, where sij is the number of
occurrences of distinct shared conservation
profiles between species i and j Tij sij/sjj.
E A B S1..............I.............I.
...............Sn 10000000000000000000000000000
0000000000000000000 111111111111111111111111111
111111111111111111111 00000111000100000000
0000000000000000000000000000
000000000000000000000000000000000111000011100011
...........................................
.....
93(No Transcript)
94Profiles
Conservation
Orthologs
95Paralogs - orthologs Evolution by modules
- MEME - MAST segmatch (blast) - PAML
- topology
- quantitative analysis
96 CP6300 orthologs (transcription factor)
PagCP6300.html
CP12790 orthologs (maltase) agCP12790.orth.html
97 Resource URL UCSC Genome4 Bioinformatics http/
/genome.ucsc.edu/ Ensembl http//www.ensembl.org/
MapViewer http//www.ncbi.nlm.nih.gov/mapview/
VISTA Genome Browser http//pipeline.lbl.gov/
K-BROWSER http//hanuman.math.berkeley.edu/cgi-bi
n/kbrowser2 Comparative Regulatory
Genomics http//corg.molgen.mpg.de/
GALA http//www.bx.psu.edu/ EnsMart http//www.e
nsembl.org/EnsMart/ ETOPE http//www.bx.psu.edu/
PipMaker and MultiPipMaker http//www.bx.psu.edu/
VISTA server http//www-gsd.lbl.gov/vista/
MAVID server http//baboon.math.berkeley.edu/mavi
d/ zPicture server http//zpicture.dcode.org/
rVISTA server http//rvista.dcode.org/ Table
1 Internet resources for whole-genome comparative
analysis and associated tools
98 Tekaia, F. and B. Dujon (1999). Pervasiveness
of gene conservation and persistence of
duplicates in cellular genomes. Journal of
Molecular Evolution, 49591-600. Tekaia, F.,
Lazcano, A. and B. Dujon (1999). Genome tree as
revealed from whole proteome comparisons. Genome
Res. 1217-25. Tekaia, F., Gordon, S.V.,
Garnier, T., Brosch, R., Barrel, B.G. and S.T.
Cole (1999). Analysis of the proteome of
Mycobacterium tuberculosis in silico. Tubercle
and Lung Disease, 79329-342. Genolevures
program - F. Tekaia, G. Blandin, A. Malpertuy,
et al. (2000) Methods and strategies used for
sequence analysis and annotation. FEBS
487,117-30. - A. Malpertuy, F. Tekaia, S.
Casaregola, et al. (2000) Yeast specific
genes. FEBS 487,1113-121. - G. Blandin, P.
Durrens, F. Tekaia, et al. (2000). The genome
of Saccharomyces cerevisiae revisited. FEBS
487,131-36. Tekaia, F., Yeramian, E. and
Dujon B. (2002) Amino acid composition of
genomes, lifestyle of organisms and evolutionary
trends a global picture with correspondence
analysis. Gene. 297 pp. 51-60. Tekaia, F.,
Yeramian, E. in prep Genome tree based on
conservation profiles
99Systematic analysis of completely sequenced
organisms http//www.pasteur.fr/tekaia/sacso.htm
l