Title: Qu
1Qua apporté la génomique à la phylogénie des
animaux ? Hervé Philippe Département de
Biochimie, Centre Robert Cedergren, Université de
Montréal, Succursale Centre-Ville, Montréal,
Québec H3C3J7, Canada
2Cambrian explosion a paleontological perspective
Marrella
Aysheaia
Halkieria
Pikaia
3Cambrian explosion a molecular perspective
Molecular phylogenies should resolve series of
speciation events separated by a few millions of
years
MYa
700
600
500
400
300
200
100
0
Cambrian explosion
4Lack of resolution in molecular phylogenetics
- (1) Inadequate selection of sequences
(non-orthologous, saturated, etc.) - Inadequate tree reconstruction method
- Inadequate taxon sampling
- Rapid diversification of species
- Points (1), (2) and (3) are always mixed
- A (simplistically) theoretical overview
- Analyses of several case studies
- A molecular dating approach
5Cambrian explosion a molecular perspective
Bootstrap support ? 95 requires ?3 substitutions
on the corresponding branch (Felsenstein, 1985)
MYa
?T
700
600
500
400
300
200
100
0
Cambrian explosion
18S Ribosomal RNA (1000 positions) 100
substitutions over 500 MY ? resolution for
branches with ?T ? 15 MY
146 genes (Delsuc et al. 2006, 33800 positions)
7000 substitutions over 500 MY ? resolution
for branches with ?T ? 0.25 MY
50 genes (Rokas et al. 2005, 12060 positions)
2400 substitutions over 500 MY ? resolution
for branches with ?T ? 0.7 MY
6Phylogenetic signal
true history
7Comparison of ML SSU and LSU trees (A and B,
respectively)
Medina M. et.al. PNAS 2001989707-9712
850 genes (12,060 amino acid positions), ML
RtREVI? / MP bootstrap support
Rokas et al. (2005) Animal evolution and the
molecular signature of radiations compressed in
time. Science, 3101993-1998
9Phylogenetic signal
true history
10Non-phylogenetic signal
Sequence evolves according to a very complex and
heterogeneous process that our tree
reconstruction method approximates as best as
they can using elaborated model of sequence
evolution
Real complexities mutation process is not
homogeneous over time and across the genome,
population structure is not homogeneous over
time, selective pressures are not homogeneous
over time and across the genome ? Nucleotide
compositions are heterogeneous across species,
evolutionary rate is heterogeneous across
positions and over time (heterotachy),
substitution process is heterogeneous across
positions and over time, positions are
inter-dependent, etc.
All the complexities that are not adequately
handled by our oversimplified models of sequence
evolution can imply systematic biases, which are
referred here as non-phylogenetic signal
11Phylogenetic signal and non-phylogenetic signal
Phylogenetic signal
true history
2
1
3
1000 positions
12000 positions
12Systematic errors (inconsistency)
Systematic error the error in phylogenetic
estimates that is due to the failure of the
reconstruction method to account fully for
multiple substitutions (in a probabilistic
framework, the properties of the data)
Systematic errors will not disappear with
phylogenomics, and may indeed become more apparent
A
A
C
B
B
C
D
D
LONG BRANCH ATTRACTION (Felsenstein, 1978)
13Multiple substitutions at the same position
C
C
G?C
A?C
A?G
A
A
C
C
Tree building artefact
A
A
14ML MP 99 56 94 55 97 51 72 36 84 54 100 75 43 74
50 genes (12,060 amino acid positions), ML
RtREVI? / MP bootstrap support
Rokas et al. (2005) Science, 3101993-1938
15Phylogenetic signal and non-phylogenetic signal
1
Phylogenetic signal
2
ML MP 99 56 94 55 97 51 72 36 84 54 100 75 43 74
true history
3
12000 positions
1
12000 positions
2
3
16Phylogenomics yields incongruent results
PLoS Biology
Nature
Current Biology
17Single gene phylogeny of Schierwater et al. (2009)
18Single gene phylogeny of Schierwater et al. (2009)
19Single gene phylogeny of Schierwater et al. (2009)
20Excavata
Ciliophora
Contaminated dataset Schierwater et al. (2009)
PLoS Biol 7(1) e1000020
Amoebozoa
Basidiomycota
Ascomycota
Choanoflagellata
Placozoa
4
Calcarea
Porifera
Demospongiae
9
Hexactinellida
98
53
Ctenophora
Cnidaria
27
62
Bilateria
Excavata
Ciliophora
Amoebozoa
Ascomycota
Basidiomycota
Choanoflagellata
Calcarea
36
Porifera
Demospongiae
44
Clean dataset Philippe et al. (2011) PLoS Biol in
press
Hexactinellida
Ctenophora
40
4
9
Placozoa
Cnidaria
23
Bilateria
38
21Dunn et al. 150 genes 24,708 positions
Contaminations Symsagitiferra 13 genes
(including 6 Chlorophyta, 2 Ciliophora, 2
Bacteria)
4 Neochildia (Microsporidia) 2 Saccoglossus
(Mus) 2 Acanthoscurria (angiosperm) 2 Hydra
(Artemia) 1 Oscarella (Pseudomonas) 1 Asterina
(Bacteria) 1 Dugesia (Gallus) 1 Xiphinema
(Lumbricus) 1 Monosiga (Rhizopus) 1 Macrostomum 2
Trichinella 2 Priapulus 1 Branchiostoma
22Dunn et al. 150 genes 24,708 positions
Frameshifts 63 concerned species
Drosophila 2 Paraplanocera 3 Echinoderes 4 Xenotur
bella 4 Chaetopterus 5 Cyanea 5 Cristatella 6 Pla
tynereis 6 Spinochordodes 6 Cryptococcus 8 Spadell
a 8 Mnemiopsis 9 Bugula 10 Gnathostomula 10 Hydr
a 10 Sphaeroforma 10 Turbanella 10 Chaetoderma 15
Myzostoma 15 Scutigera 16 Carcinus 18
Lumbricus 20 Ptychodera 20 Euperipatoides 21 Carci
noscorpius 22 Symsagittifera 22 Chaetopleura 23 Ho
mo 25 Boophilus 30 Hypsibius 30 Richtersius 30 Da
phnia 32 Asterina 35 Anoplodactylus 40 Argopecten
43 Xiphinema 43 Acropora 45 Dugesia 46 Brachionu
s 50 Ciona 50 Branchiostoma 52 Hydractinia 53
Haementeria 54 Flaccisagitta 55 Strongylocentrot
us 55 Acanthoscurria 58 Aplysia 58 Saccoglossus
60 Capsaspora 68 Gallus 73 Phoronis 87 Capi
tella 93 Echinococcus 100 Ferrenopenaeus 112 Mo
nosiga 118 Schmidtea 129 Oscarella 141 Mytilus
151 Euprymna 201 Trichinella 281 Crassostrea
296 Macrostomum 382 Biomphalaria 384
23Frameshifts 3868 invented amino acids
5 introns Anoplodactylus Chaetopterus Ciona Themi
ste Trichinella
Many single point errors A total 970
errors (in large part due to the use of
erroneous mitochondrial genetic code!)
Several genes with paralogy issues 2-5
intractable problems 10-20 tractable problems
DUNN 150 genes 21,152 positions 55.6 of
missing data
UPDUNN 150 genes 18,463 positions 35.6 of
missing data
24Porifera
Ctenophora
Cnidaria
Clean Dunn et al. dataset
Bilateria
CATG model
150 genes 18,463 positions 35.6 of missing data
25Model of sequence evolution
190 relative rates (?ij ?ji)
20 stationary probabilities (?i)
C
D
E
F
G
H
I
K
A C D E F G H I K L M N P Q R S T V W Y
L
M
N
P
,
Q
R
S
T
V
W
Y
A C D E F G H I K L M N P Q R S T V W
WAG matrix
l
b
a
26The CAT model of sequence evolution
Man
M A E I G R L I E F S A M V D F W Q N R C
Frog
M A E I G R L V E Y S A M V D F W Q N R C
Zebrafish
M A D L G K L I D Y S A L V D F W Q N R C
Fly
M S D I G K L V E F S P M V E F W Q Q K C
Yeast
M S E I G R L V E F T P M V E F W Q N R C
Amoeba
L S E L G R L V D F T A M V D F W N N R C
Paramecium
L A E L G K L V E Y A P M I D F W Q A R C
Green alga
L S D L G K L I D F S A M I N F W Q N K C
Heterogeneous (CAT) model K distinct profiles
Homogeneous (WAG) model
amino acid profiles
1 substitution matrix
ACD...VWY
ACD...VWY
ACD...VWY
ACD...VWY
Categories (modes) 1 2 3
K
Lartillot Philippe (2004) Mol Biol Evol.
211095-1109
27The CAT model of sequence evolution
To keep the number of parameters low, a category
is only defined by a set of stationary
probabilities (the relative rates are uniform),
and the number of categories is inferred from the
alignment
uniform relative rates (?ij?ji)
20 stationary probabilities (?i)
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
A C D E F G H I K L M N P Q R S T V W Y
A C D E F G H I K L M N P Q R S T V W
Lartillot Philippe (2004) Mol Biol Evol.
211095-1109
28Stable categories inferred by the CAT model
E
Q
The size of an amino acid is proportional to its
stationary probability
D
N
29Multiple substitutions between two amino acids
30What is predicted by evolutionary models?
WAG
GTR
CAT
substitutions
0
1
2
3
4
5
6
7
31Multiple substitutions between two amino acids
32What is predicted by WAG replacement matrix?
substitutions
33Further reduction of non-phylogenetic signal
100
100
100
95
63
100
49
99
52
98
76
78
76
55
74
0.02
Alignment of Rokas et al. (2005) 50 genes
(12,060 amino acid positions) Model CAT?,
inferred using phylobayes 100 bootstrap
replicates
34Reduction of non-phylogenetic signal
100
Chordates
90
Protostomes
80
Ecdysozoa
70
Lophotrochozoa
60
Bootstrap support
Bilaterians
50
Cnidarians
40
Poriferans
30
20
10
0
MP
rtREV?
CAT?
35128 genes 30,257 positions Philippe et al. (2009)
Curr. Biol.
Model CAT?
0.1
Model WAG?
Philippe et al. (2011) PLOS Biol.
36Improvement of phylogenetic resolution
true history
Phylogenomics phylogenetic signal as well as
non-phylogenetic signal are abundant
To improve resolution, one has to use the same
methods as to avoid systematic errors
- Complex model of sequence evolution
- Rich taxon sampling
- Removal of fast evolving positions and taxa
37Model CAT?
47 species 128 genes 30,257 positions Philippe et
al. (2009) Curr. Biol.
Choanoflagellata
Model CAT?
3
Ctenophora
70
Cnidaria
94
18 species Same sampling as Schierwater et al.
Calcarea
Placozoa
44
Demospongiae
56
9
Hexactinellida
86
53
Bilateria
Philippe et al. (2011) PLOS Biol.
38Hétérogénéité des modèles
M A D I G R L I E F S A M V D F W M G E I G R L V
E Y S A M V D F W M A E L G K L I D Y S A L V D F
W M T D I G K L V E F S P M V E F W M W D I G R L
V E F T P M V E Y W M S D L A R L V D F T A M V D
F W M Y D L G K L I D F S A M I N F W M A D I G R
L I E F S A M V D Y W M E D I G R L V E Y S A M V
D F W M R D L G K L I D Y S A L V D F W
- Hétérogénéité des états de caractères
- matrices déchange Dayhoff, WAG LG, GTR
- Hétérogénéité entre les sites
- loi gamma, modèle CAT
- Hétérogénéité au cours du temps
- modèle covarion, points de changements
39Hypothèse
Hétéropécilie variation temporelle du processus
de substitution en acides aminés pour un site
donné (poikillw to vary)
40Retrait progressif des sites hétéropécilles
Protocole
Données
13 protéines mitochondriales 68 espèces
- Sites retirés suivant une valeur croissante de
PIPn
sites retirés sites retirés alignement
PIPn PIPn nb taille
- - - - 1927
0 0 168 8.7 1759
e-12 e-12 165 8.6 1594
e-8 e-8 177 9.2 1417
e-6 e-6 177 9.2 1240
e-4.5 e-4.5 201 10.4 1039
Bilateria
CATG4
Choanoflagellata
- Inférence par CATG4 avec les jeux réduits
Roure Philippe (2011) BMC Evol Biol 1117
41Retrait progressif des sites hétéropécilles
probabilité postérieure
taille de lalignement
Bilateria
Roure Philippe (2011) BMC Evol Biol 1117
41
42Retrait progressif des sites hétéropécilles
probabilité postérieure
taille de lalignement
Roure Philippe (2011) BMC Evol Biol 1117
43Retrait progressif des sites hétéropécilles
probabilité postérieure
taille de lalignement
Roure Philippe (2011) BMC Evol Biol 1117
44Retrait progressif des sites hétéropécilles
probabilité postérieure
taille de lalignement
Roure Philippe (2011) BMC Evol Biol 1117
45Retrait progressif des sites
Sites à évolution rapide
probabilité postérieure
probabilité postérieure
taille de lalignement
taille de lalignement
Le regroupement incorrect des Cnidaires et des
Porifères nest pas dû à la présence de sites à
évolution rapide, mais à la présence de sites
hétéropéciles qui est erronément interprétée
comme une synapomorphie pour regrouper Cnidaires
et Porifères
Roure Philippe (2011) BMC Evol Biol 1117