Title: Advances and Limitations of Maximum Likelihood Phylogenetics
1Advances and Limitationsof Maximum Likelihood
Phylogenetics
- Olivier Gascuel
- LIRMM-CNRS, Montpellier, France
2StéphaneGuindon
WimHordijk
Quang Le Si
MariaAnisimova
NicolasLartillot
Jean-FrançoisDufayard
3- Most of the talk will be about proteins
4- The data is a set of aligned sequences
5- We aim to reconstruct the phylogeny of the
sequences in the alignment
6- We assume a substitution model, denoted as M
- The likelihood of data D, given M and T, is
- We search for the tree T that maximizes data
likelihood
- Statistical modeling
- An improved replacement matrix
- Accounting for the structure
- Results
- Algorithmics
- Simultaneous NNIs
- Fast SPRs
- Results
7(No Transcript)
8 9NNI
10 11 12SPR
13- PHYML-NNI
- Start with a reasonnable tree with branch lengths
(BIONJ) - Compute all subtree partial likelihoods
- Independently compute all optimal branch-lengths
and optimal NNI configurations (i.e. local
changes) - When no local change significantly increases the
likelihood, return the current tree - Else, apply to the current tree all local
changes if the tree likelihood increases go to
(b), else (5 of the cases) apply as many as
possible of these changes and go to (b)
14- Comments
- Simultaneous NNIs can change the tree
dramatically, and are not included in (single)
SPR or TBR - The algorithm is very fast and able to deal
with large datasets (up to 500-1000 taxa with DNA
sequences) - High topological accuracy with simulated data
- But real data tend to be harder than simulated
data, specially the multiple-gene, concatenated
datasets
15- Fast SPRs
- SPRs are non-local moves
- We start from a phylogeny with ML branch length
estimates - The SPR procedure involves testing all
(subtree, edge) pairs - This cannot be achieved in an exact way (i.e.
with optimal branch lengths), thus the game is to
focus on the most promising pairs (PHYML 3.0 uses
a parsimony approach) and to minimize the number
of length optimizations and partial likelihod
calculations. - As soon as an improving SPR is found, we fully
optimize all branch lengths, compute all partial
likelihoods and iterate the procedure.
16- Results
- 60 Treebase protein alignments (i.e. all
available datasets, only removing redundancies
and incomplete data). - average of 25 sequences and 1000 sites
- 2 genomic datasets (e.g. 12.000 sites and 64
sequences) - WAGG4I, with PHYML 3.0
- SPR is about twice slower than NNI, ranging
from a few seconds to a few hours
p-valuelt0.01
17- Results
- 60 Treebase protein alignments (i.e. all
available datasets, only removing redundancies
and incomplete data). - average of 25 sequences and 1000 sites
- 2 genomic datasets (e.g. 12.000 sites and 64
sequences) - WAGG4I, with PHYML 3.0
- RAXML is in between in LLK values, and 2-3
times slower than PHYML SPR
18- Comments
- Fast with this representative, relatively small
alignments - Output trees are not statistically different
(in most cases, 52/60) - SPR trees do not depend (much) on the starting
trees - Some more intensive search strategy could be
envisaged, e.g. based on tabu - Genetic algorithms (e.g. MetaPIGA, GARLI) also
perform well.
I do not expect high gains from further
algorithmic developments (with such datasets)
19- Statistical modeling
- An improved, general AA replacement matrix
- Accounting for structure and exposition to
solvent - Results
20- AA time-reversible replacement matrices
- is the instaneous rate of changes
from x to y - Key role in protein phylogenetics (and
alignment) -
-
- M is defined by
- Global rate
- 1 in estimation and
- when using several models
Equilibrium frequency
21- Estimating replacement matrices
- Counting approach of Dayhoff et al. (1972),
using pairwise alignments of closely related
proteins (PAM, JTT, ). - Logarithmic (Gonnet et al 1992) and resolvent
(Muller et al 2000) counting approaches to deal
with pairs of remote proteins - A strong tendency is to estimate different
matrices for different protein groups
(mitochondrial, prokaryotic, viral, arthropoda
). - But general matrices (e.g., JTT, WAG) are
widely used, e.g. to build deep phylogenies or to
analyze concatenated datasets.
22- ML estimation of replacement matrices
- Counting methods are not able to deal with
multiple alignments, which contain much more
information than protein pairs - ML methods exploit multiple alignments and
phylogenies - a set of multiple alignments,
we aim to maximize - But we cannot simultaneously estimate a number
of trees and M. This full maximization was only
used with unique concatenated alignments (e.g.
AdachiHasegawa 1996, with mitochondrial genes,
3350 sites and 20 taxa).
23- ML estimation of replacement matrices,
WhelanGoldman 2001 - First step approximate trees are inferred
using NJ and ML branch length estimation - Second step M is estimated using an EM
algorithm maximizing - WAG was estimated using BRKALN (186 aligments,
51.000 sites, 900.000 AAs) - WAG is much better than JTT (also estimated
from BRKALN)
24- ML estimation of replacement matrices,
WhelanGoldman 2001 - Variability of rates across sites (RAS) was not
incorporated in likelihood calculations. - It is now recognized that RAS is essential.
Some sites are slow (invariant) due to strong
evolutionary constraints, while others are very
fast. - RAS is usually implemented with a discrete
gamma distribution of rates and invariant sites
(G4I), and used to infer most of trees. - Moreover, BRKALN is limited regarding current
databases, and likely biased toward proteins
being easy to cristallize, with well defined 3D
structure.
25- Lee G., 2007 (submission next week !)
- We used the seed alignments of Pfam, which are
manually verified multiple alignments of
representative sets of sequences, and selected
3,913 large enough alignments (600.000 sites,
6.5 millions AAs). - The trees were inferred by PHYML with
WAGG4I - Each site i was categorized in the rate
category with maximum a posteriori
probability, and rate - The LG replacement matrix was estimated using
XRATE (Holmes et al 06) EM-based software, with
site likelihood
26- Lee G., 2007 (submission next week !)
- We used the seed alignments of Pfam, which are
manually verified multiple alignments of
representative sets of sequences, and selected
3,913 large enough alignments (600.000 sites,
6.5 millions AAs). - The trees were inferred by PHYML with
WAGG4I - Each site i was categorized in the rate
category with maximum a posteriori
probability, and rate - The replacement matrix was estimated using
XRATE (Holmes 06) EM-based software, with site
likelihood
Convergence problems
27- LG/WAG matrices
- AA frequencies relatively close, very low
influence on likelihood values when inferring
trees - Exchangeabilities strongly correlated
28(No Transcript)
29- LG/WAG matrices
- Our estimation procedure has better ability to
distinguish among the substitution events that
are very rare (likely occuring in fast sites
only) and those being not so rare (possibly
occuring in slow sites). - LG exchangeabilities are much more contrasted
than WAGs - But LG cannot be viewed as a constrasted
version of WAG -
ratio ?
0.6
Asparagine?Tyrosine
0.69
LG
1.14
WAG
30- LG/WAG matrices
- Our estimation procedure has better ability to
distinguish among the substitution events that
are very rare (likely occuring in fast sites
only) and those being not so rare (possibly
occuring in slow sites). - LG exchangeabilities are much more contrasted
than WAGs - But LG cannot be viewed as a constrasted
version of WAG -
ratio ? 2.0
Cystein?Tyrosine
1.15
LG
0.57
WAG
31- LG/WAG in tree inference
- We analyzed the 60 Treebase alignments using
PHYML_SPR with WAGG4I, LGG4I, and JTTG4I. - We measured the tree length, the gama parameter
value (a) and the loglikelihood. We also compared
the tree topologies.
p-valuelt0.01
32- LG/WAG in tree inference
- LG trees are longer than WAG trees
- Topologies of the inferred trees differ with
half of the data sets. - Clear improvement in likelihood values
- Similar results with Pfam test aligments
33- Accounting for exposition and secondary structure
- Substitutions clearly depend on secondary
structure and exposition e.g., buried sites are
and remain hydrophobic. - Overington et al.1990 Lüthy et al. 1991
Topham et al. 1993 Wako and Blundell 1994
Goldman et al. 1996 (to infer both the structure
and the phylogeny). - Not (or rarely) used today in phylogenetics,
though the structure of dozens of thousands of
proteins is now available. - We revisited the question thanks to (1) our
improved ML-based estimation procedure, (2) the
huge, current databases.
34- Learning and testing data
- We extracted from HSSP ((homology-derived
structures of proteins) 4,889 non-redundant
(sub)alignments. - 290,000 sequences, 1,250,000 sites and 71
billions AAs. - Secondary structure (Helix, Sheet, Turn, Coil)
and exposition (Exposed, Buried) are available
for all the sites, but not fully reliable (80-90
of conservation). - We randomly selected 500 alignments as a test
set, leaving 4,389 alignments to learn
substitution matrices for various site categories
( E, B H, S, T, C EH, ES,
ET ).
35Computing the tree likelihood using site partition
Each category is associated to a replacement
matrix the category and corresponding matrix
are known for every site i
No extra parameter, regarding single-matrix
models
Extra parameters gamma, proportion of invariant
sites, etc.
36Mixture model
Site category is unknown. We have a set of
replacement matrices corresponding to
various categories with probabilities
37Confidence-based combination
Site category is known, but not fully reliable
One more parameter than mixture
Confidence coefficient, estimated separately for
each alignment c ? 1 useful site assignments,
c ? 0 useless site assignments
38- Results of buried/exposed model (LG_EX)
- We analyzed the 60 Treebase and 300 HSSP test
alignments with various models, all using G4I
option.
39- Results
- Likelihood gain is lower when using the
secondary structure (LG_SS, 0.85) and higher
when combining both secondary structure and
exposition (LG_EX_SS, 1.6). - The difference between LG_EX_SSG4I and
WAGG4I, is of the same range as the difference
between WAGG4I and WAG (2.0).
40- Discussion
- We revisited questions and models which were
proposed and explored by N. Goldman, Z. Yang,
their collaborators, others, using today - concepts, e.g. RAS MUST be accounted for in
tree inference AND replacement matrix estimation, - tools (XRATE, PHYML),
- and databases (Pfam, HSSP).
41- Discussion
- We revisited questions and models which were
proposed and explored by N. Goldman, Z. Yang,
their collaborators, others, using today - concepts, e.g. RAS MUST be accounted for in
tree inference AND replacement matrix estimation, - tools (XRATE, PHYML),
- and databases (Pfam, HSSP),
- and computers !
42Elegant HMM model to account for secondary
structure and exposition, but not incoporating
any RAS (Lio et al, 98)
43ML estimate
Counting estimate
44ML estimation with RAS and larger database
45Accounting for solvent exposition of residues
46 47- Warm up conclusions
- Statistical modelling provides much higher gains
than algorithmics !
48- Warm up conclusions
- Statistical modelling provides much higher gains
than algorithmics ! - This should continue in the next years, as
current models are still rejected for a number
of alignments .
49- Number of AA per site (Lartillot et al 2004, 2007)
50- Warm up conclusions
- Statistical modelling provides much higher gains
than algorithmics ! - This should continue in the next years, as
current models are still rejected for a number
of alignments ..
51- Thank you all, the organizers and the Isaac
Newton Institute
52(No Transcript)
53(No Transcript)
54(No Transcript)
55- Independence assumption
- Stationary distribution of AA
56- The tree likelihood is recursively computed
from the root
Partial likelihood of rooted tree U (L(U) for
short)
Probability of change from x to y in time lU
57- With time reversible models, the tree
likelihood can be obtained from any branch, using
partial likelihoods L(U) and L(V), and branch
length l(u,v).
58- (Relatively) time consuming
- Computing the partial likelihood of all
subtrees - Optimizing the branch lengths and computing the
likelihood of a given topology - Very time consuming
- Searching the topology space in an
hill-climbing, exact way. - Efficient algorithms simultaneously modify the
branch lengths and the tree topology, thus
searching the space of phylogenies with
branch-lengths.
59- Silmutaneous NNIs two (relatively) fast and
easy operations (when all partial likelihoods are
known) - Independently computing all optimal branch
lengths - Independently computing all optimal NNI
configurations
Evaluate ACBD and ADBC, optimizing l(e)or all
five branches
60- Orchestrating calculations (RAXML, PHYML .)
- Step0 - All partial likelihoods are available
61- Orchestrating calculations
- Step1 Pruning the subtree and estimating the
branch being left
62- Orchestrating calculations
- Step2 Computing 1 partial likelihood,
estimating the 3 new branch lengths and computing
the tree likelihood
63- Orchestrating calculations
- Step3 Computing 1 partial likelihood,
estimating the 3 new branch lengths and computing
the tree likelihood etc.
64- Progressive filtering strategy (PHYML)
- All possible SPRs are first filtered by a fast
distance-based (or parsimony) algorithm
typically, we retain for every subtree the 20
most promising edges for regraphting. - Previous scheme is run several times with
increasingly sophisticated branch-length
estimations when an improving SPR is found, it
is returned and the procedure restart from the
beginning else, results are used to rank and
filter remaining SPRs. - This strategy allows considerable gain in
computing time, without loss on the resulting
tree.