Advances and Limitations of Maximum Likelihood Phylogenetics - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Advances and Limitations of Maximum Likelihood Phylogenetics

Description:

The SPR procedure involves testing all (subtree, edge) pairs ... was estimated using XRATE (Holmes et al 06) EM-based software, with site likelihood ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 65
Provided by: le07
Category:

less

Transcript and Presenter's Notes

Title: Advances and Limitations of Maximum Likelihood Phylogenetics


1
Advances and Limitationsof Maximum Likelihood
Phylogenetics
  • Olivier Gascuel
  • LIRMM-CNRS, Montpellier, France

2
StéphaneGuindon
WimHordijk
Quang Le Si
MariaAnisimova
NicolasLartillot
Jean-FrançoisDufayard
3
  • Most of the talk will be about proteins

4
  • The data is a set of aligned sequences

5
  • We aim to reconstruct the phylogeny of the
    sequences in the alignment

6
  • We assume a substitution model, denoted as M
  • The likelihood of data D, given M and T, is
  • We search for the tree T that maximizes data
    likelihood
  • Statistical modeling
  • An improved replacement matrix
  • Accounting for the structure
  • Results
  • Algorithmics
  • Simultaneous NNIs
  • Fast SPRs
  • Results

7
(No Transcript)
8
  • Algorithmics

9
  • Algorithmics

NNI
10
  • Algorithmics

11
  • Algorithmics

12
  • Algorithmics

SPR
13
  • PHYML-NNI
  • Start with a reasonnable tree with branch lengths
    (BIONJ)
  • Compute all subtree partial likelihoods
  • Independently compute all optimal branch-lengths
    and optimal NNI configurations (i.e. local
    changes)
  • When no local change significantly increases the
    likelihood, return the current tree
  • Else, apply to the current tree all local
    changes if the tree likelihood increases go to
    (b), else (5 of the cases) apply as many as
    possible of these changes and go to (b)

14
  • Comments
  • Simultaneous NNIs can change the tree
    dramatically, and are not included in (single)
    SPR or TBR
  • The algorithm is very fast and able to deal
    with large datasets (up to 500-1000 taxa with DNA
    sequences)
  • High topological accuracy with simulated data
  • But real data tend to be harder than simulated
    data, specially the multiple-gene, concatenated
    datasets

15
  • Fast SPRs
  • SPRs are non-local moves
  • We start from a phylogeny with ML branch length
    estimates
  • The SPR procedure involves testing all
    (subtree, edge) pairs
  • This cannot be achieved in an exact way (i.e.
    with optimal branch lengths), thus the game is to
    focus on the most promising pairs (PHYML 3.0 uses
    a parsimony approach) and to minimize the number
    of length optimizations and partial likelihod
    calculations.
  • As soon as an improving SPR is found, we fully
    optimize all branch lengths, compute all partial
    likelihoods and iterate the procedure.

16
  • Results
  • 60 Treebase protein alignments (i.e. all
    available datasets, only removing redundancies
    and incomplete data).
  • average of 25 sequences and 1000 sites
  • 2 genomic datasets (e.g. 12.000 sites and 64
    sequences)
  • WAGG4I, with PHYML 3.0
  • SPR is about twice slower than NNI, ranging
    from a few seconds to a few hours

p-valuelt0.01
17
  • Results
  • 60 Treebase protein alignments (i.e. all
    available datasets, only removing redundancies
    and incomplete data).
  • average of 25 sequences and 1000 sites
  • 2 genomic datasets (e.g. 12.000 sites and 64
    sequences)
  • WAGG4I, with PHYML 3.0
  • RAXML is in between in LLK values, and 2-3
    times slower than PHYML SPR

18
  • Comments
  • Fast with this representative, relatively small
    alignments
  • Output trees are not statistically different
    (in most cases, 52/60)
  • SPR trees do not depend (much) on the starting
    trees
  • Some more intensive search strategy could be
    envisaged, e.g. based on tabu
  • Genetic algorithms (e.g. MetaPIGA, GARLI) also
    perform well.

I do not expect high gains from further
algorithmic developments (with such datasets)
19
  • Statistical modeling
  • An improved, general AA replacement matrix
  • Accounting for structure and exposition to
    solvent
  • Results

20
  • AA time-reversible replacement matrices
  • is the instaneous rate of changes
    from x to y
  • Key role in protein phylogenetics (and
    alignment)
  • M is defined by
  • Global rate
  • 1 in estimation and
  • when using several models

Equilibrium frequency
21
  • Estimating replacement matrices
  • Counting approach of Dayhoff et al. (1972),
    using pairwise alignments of closely related
    proteins (PAM, JTT, ).
  • Logarithmic (Gonnet et al 1992) and resolvent
    (Muller et al 2000) counting approaches to deal
    with pairs of remote proteins
  • A strong tendency is to estimate different
    matrices for different protein groups
    (mitochondrial, prokaryotic, viral, arthropoda
    ).
  • But general matrices (e.g., JTT, WAG) are
    widely used, e.g. to build deep phylogenies or to
    analyze concatenated datasets.

22
  • ML estimation of replacement matrices
  • Counting methods are not able to deal with
    multiple alignments, which contain much more
    information than protein pairs
  • ML methods exploit multiple alignments and
    phylogenies
  • a set of multiple alignments,
    we aim to maximize
  • But we cannot simultaneously estimate a number
    of trees and M. This full maximization was only
    used with unique concatenated alignments (e.g.
    AdachiHasegawa 1996, with mitochondrial genes,
    3350 sites and 20 taxa).

23
  • ML estimation of replacement matrices,
    WhelanGoldman 2001
  • First step approximate trees are inferred
    using NJ and ML branch length estimation
  • Second step M is estimated using an EM
    algorithm maximizing
  • WAG was estimated using BRKALN (186 aligments,
    51.000 sites, 900.000 AAs)
  • WAG is much better than JTT (also estimated
    from BRKALN)

24
  • ML estimation of replacement matrices,
    WhelanGoldman 2001
  • Variability of rates across sites (RAS) was not
    incorporated in likelihood calculations.
  • It is now recognized that RAS is essential.
    Some sites are slow (invariant) due to strong
    evolutionary constraints, while others are very
    fast.
  • RAS is usually implemented with a discrete
    gamma distribution of rates and invariant sites
    (G4I), and used to infer most of trees.
  • Moreover, BRKALN is limited regarding current
    databases, and likely biased toward proteins
    being easy to cristallize, with well defined 3D
    structure.

25
  • Lee G., 2007 (submission next week !)
  • We used the seed alignments of Pfam, which are
    manually verified multiple alignments of
    representative sets of sequences, and selected
    3,913 large enough alignments (600.000 sites,
    6.5 millions AAs).
  • The trees were inferred by PHYML with
    WAGG4I
  • Each site i was categorized in the rate
    category with maximum a posteriori
    probability, and rate
  • The LG replacement matrix was estimated using
    XRATE (Holmes et al 06) EM-based software, with
    site likelihood

26
  • Lee G., 2007 (submission next week !)
  • We used the seed alignments of Pfam, which are
    manually verified multiple alignments of
    representative sets of sequences, and selected
    3,913 large enough alignments (600.000 sites,
    6.5 millions AAs).
  • The trees were inferred by PHYML with
    WAGG4I
  • Each site i was categorized in the rate
    category with maximum a posteriori
    probability, and rate
  • The replacement matrix was estimated using
    XRATE (Holmes 06) EM-based software, with site
    likelihood

Convergence problems
27
  • LG/WAG matrices
  • AA frequencies relatively close, very low
    influence on likelihood values when inferring
    trees
  • Exchangeabilities strongly correlated

28
(No Transcript)
29
  • LG/WAG matrices
  • Our estimation procedure has better ability to
    distinguish among the substitution events that
    are very rare (likely occuring in fast sites
    only) and those being not so rare (possibly
    occuring in slow sites).
  • LG exchangeabilities are much more contrasted
    than WAGs
  • But LG cannot be viewed as a constrasted
    version of WAG

  • ratio ?
    0.6

Asparagine?Tyrosine
0.69
LG
1.14
WAG
30
  • LG/WAG matrices
  • Our estimation procedure has better ability to
    distinguish among the substitution events that
    are very rare (likely occuring in fast sites
    only) and those being not so rare (possibly
    occuring in slow sites).
  • LG exchangeabilities are much more contrasted
    than WAGs
  • But LG cannot be viewed as a constrasted
    version of WAG

  • ratio ? 2.0

Cystein?Tyrosine
1.15
LG
0.57
WAG
31
  • LG/WAG in tree inference
  • We analyzed the 60 Treebase alignments using
    PHYML_SPR with WAGG4I, LGG4I, and JTTG4I.
  • We measured the tree length, the gama parameter
    value (a) and the loglikelihood. We also compared
    the tree topologies.

p-valuelt0.01
32
  • LG/WAG in tree inference
  • LG trees are longer than WAG trees
  • Topologies of the inferred trees differ with
    half of the data sets.
  • Clear improvement in likelihood values
  • Similar results with Pfam test aligments

33
  • Accounting for exposition and secondary structure
  • Substitutions clearly depend on secondary
    structure and exposition e.g., buried sites are
    and remain hydrophobic.
  • Overington et al.1990 Lüthy et al. 1991
    Topham et al. 1993 Wako and Blundell 1994
    Goldman et al. 1996 (to infer both the structure
    and the phylogeny).
  • Not (or rarely) used today in phylogenetics,
    though the structure of dozens of thousands of
    proteins is now available.
  • We revisited the question thanks to (1) our
    improved ML-based estimation procedure, (2) the
    huge, current databases.

34
  • Learning and testing data
  • We extracted from HSSP ((homology-derived
    structures of proteins) 4,889 non-redundant
    (sub)alignments.
  • 290,000 sequences, 1,250,000 sites and 71
    billions AAs.
  • Secondary structure (Helix, Sheet, Turn, Coil)
    and exposition (Exposed, Buried) are available
    for all the sites, but not fully reliable (80-90
    of conservation).
  • We randomly selected 500 alignments as a test
    set, leaving 4,389 alignments to learn
    substitution matrices for various site categories
    ( E, B H, S, T, C EH, ES,
    ET ).

35
Computing the tree likelihood using site partition
Each category is associated to a replacement
matrix the category and corresponding matrix
are known for every site i
No extra parameter, regarding single-matrix
models
Extra parameters gamma, proportion of invariant
sites, etc.
36
Mixture model
Site category is unknown. We have a set of
replacement matrices corresponding to
various categories with probabilities
37
Confidence-based combination
Site category is known, but not fully reliable
One more parameter than mixture
Confidence coefficient, estimated separately for
each alignment c ? 1 useful site assignments,
c ? 0 useless site assignments
38
  • Results of buried/exposed model (LG_EX)
  • We analyzed the 60 Treebase and 300 HSSP test
    alignments with various models, all using G4I
    option.

39
  • Results
  • Likelihood gain is lower when using the
    secondary structure (LG_SS, 0.85) and higher
    when combining both secondary structure and
    exposition (LG_EX_SS, 1.6).
  • The difference between LG_EX_SSG4I and
    WAGG4I, is of the same range as the difference
    between WAGG4I and WAG (2.0).

40
  • Discussion
  • We revisited questions and models which were
    proposed and explored by N. Goldman, Z. Yang,
    their collaborators, others, using today
  • concepts, e.g. RAS MUST be accounted for in
    tree inference AND replacement matrix estimation,
  • tools (XRATE, PHYML),
  • and databases (Pfam, HSSP).

41
  • Discussion
  • We revisited questions and models which were
    proposed and explored by N. Goldman, Z. Yang,
    their collaborators, others, using today
  • concepts, e.g. RAS MUST be accounted for in
    tree inference AND replacement matrix estimation,
  • tools (XRATE, PHYML),
  • and databases (Pfam, HSSP),
  • and computers !

42
  • Discussion

Elegant HMM model to account for secondary
structure and exposition, but not incoporating
any RAS (Lio et al, 98)
43
  • Discussion

ML estimate
Counting estimate
44
  • Discussion

ML estimation with RAS and larger database
45
  • Discussion

Accounting for solvent exposition of residues
46
  • Discussion

47
  • Warm up conclusions
  • Statistical modelling provides much higher gains
    than algorithmics !

48
  • Warm up conclusions
  • Statistical modelling provides much higher gains
    than algorithmics !
  • This should continue in the next years, as
    current models are still rejected for a number
    of alignments .

49
  • Number of AA per site (Lartillot et al 2004, 2007)

50
  • Warm up conclusions
  • Statistical modelling provides much higher gains
    than algorithmics !
  • This should continue in the next years, as
    current models are still rejected for a number
    of alignments ..

51
  • Thank you all, the organizers and the Isaac
    Newton Institute

52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
  • Independence assumption
  • Stationary distribution of AA

56
  • The tree likelihood is recursively computed
    from the root

Partial likelihood of rooted tree U (L(U) for
short)
Probability of change from x to y in time lU
57
  • With time reversible models, the tree
    likelihood can be obtained from any branch, using
    partial likelihoods L(U) and L(V), and branch
    length l(u,v).

58
  • (Relatively) time consuming
  • Computing the partial likelihood of all
    subtrees
  • Optimizing the branch lengths and computing the
    likelihood of a given topology
  • Very time consuming
  • Searching the topology space in an
    hill-climbing, exact way.
  • Efficient algorithms simultaneously modify the
    branch lengths and the tree topology, thus
    searching the space of phylogenies with
    branch-lengths.

59
  • Silmutaneous NNIs two (relatively) fast and
    easy operations (when all partial likelihoods are
    known)
  • Independently computing all optimal branch
    lengths
  • Independently computing all optimal NNI
    configurations

Evaluate ACBD and ADBC, optimizing l(e)or all
five branches
60
  • Orchestrating calculations (RAXML, PHYML .)
  • Step0 - All partial likelihoods are available

61
  • Orchestrating calculations
  • Step1 Pruning the subtree and estimating the
    branch being left

62
  • Orchestrating calculations
  • Step2 Computing 1 partial likelihood,
    estimating the 3 new branch lengths and computing
    the tree likelihood

63
  • Orchestrating calculations
  • Step3 Computing 1 partial likelihood,
    estimating the 3 new branch lengths and computing
    the tree likelihood etc.

64
  • Progressive filtering strategy (PHYML)
  • All possible SPRs are first filtered by a fast
    distance-based (or parsimony) algorithm
    typically, we retain for every subtree the 20
    most promising edges for regraphting.
  • Previous scheme is run several times with
    increasingly sophisticated branch-length
    estimations when an improving SPR is found, it
    is returned and the procedure restart from the
    beginning else, results are used to rank and
    filter remaining SPRs.
  • This strategy allows considerable gain in
    computing time, without loss on the resulting
    tree.
Write a Comment
User Comments (0)
About PowerShow.com