Title: Maximum likelihood and Bayesian methods in Molecular phylogeny
1Maximum likelihood and Bayesian methods in
Molecular phylogeny
Marco Salemi, Ph.D.
Dept. Pathology U.F. Gainesville, FL (U.S.A.)
2The Likelihood function
- Likelihood function
- conditional probability of the data (aligned
sequences) given a hypothesis (a model of
substitution with a set of parameters ?, and the
tree ?, including topology and branch lengths) - L(?,?) Prob(Data??,?)
- or
- Prob(Aligned sequences?tree, model of evolution)
- The maximum likelihood estimates (MLEs) of ? and
? are those making the likelihood function as
large as possible - t , ? max L(?,?)
- Therefore, what we usually call the likelihood
of the tree IS NOT the likelihood of the tree but
the probability of the data given that the tree
is the true tree!
3Calculating the likelihood of a known tree with a
known evolutionary model
1
j
n
1)
(1)
. . C G G A C A C G T T T A C . .
(2)
. . C A G A C A C C T C T A C . .
A
(3)
. . C G G A T A A G T T A A C . .
(4)
. . C G G A T A G C C T A G C . .
A
?
A
A
2)
1
3
C
P(A) gt P(C) gt P(G) P(T)
2
4
4Calculating the likelihood known tree (contd)
3)
C
C
A
G
C
C
A
G
A
C
L(j) Prob
Prob
3)
L L(1) L(2) L(n) ? L(j)
A
A
C
C
A
G
4)
ln L ln L(1) ln L(2) ln L(n) ? L(j)
G
Prob
.
C
22n-2 internal nodes for a tree of n taxa 22n-2
terms in the summation! (More efficient
way Felsenstein Pruning Algorithm)
C
C
A
G
Prob
.
T
T
5Finding the ML tree
- Exhaustive search guaranteed to find the minimum
tree because all tree topologies are evaluated.
Not possible for more than 10 sequences - Branch and bound guaranteed to find the minimum
tree without evaluating all tree topologies a
larger number of taxa can be evaluated but still
limited (depends on the dataset) - Heuristic searches not guaranteed to find the
minimal tree. Uses stepwise addition of taxa plus
a rearrangement process (branch swapping)
6Trees combinatorial explosion
- Number of unrooted trees for n taxa
- NU(2n-5)!/2n-3(n-3)!
- Number of rooted trees for n taxa
- NR(2n-3)!/2n-2(n-2)!
7Heuristic Search strategy (1)choosing a
starting tree
8Stepwise addition
A
Start tree construction with first 3 sequences
C
B
L3 gt L2 , L1
B
A
A
B
A
C
Add 4th
L3
L1
L2
D
C
C
D
B
D
E
E
A
B
A
B
A
B
E
Add 5th
D
C
D
C
D
C
E
B
B
A
A
E
Add next sequence
Max L
D
C
D
C
9Tree landscape and input order
- During stepwise addition, in spite of branch and
bound or other rearrangement algorithms, one can
get stuck into a sub optimal tree peak. - Tree input order juggling can explore more than
one tree peak, with better chances at finding the
optimal tree.
10Heuristic Search strategy (1)choosing a
starting tree
- Stepwise addition with random repeats (at least
5 to 10) - ADVANTAGE exploring different portion of the
tree-space - DISADVANTAGE many random replicates necessary
(computationally slow for more than 35-40 taxa)
11Global Maximum Likelihood tree
local Maximum Likelihood tree
Likelihood
Trees
Starting tree of the heuristic search
Starting tree of the heuristic search
12Heuristic Search strategy (1)choosing a
starting tree
- Simple Stepwise addition
- Stepwise addition with random repeats (at least
5 to 10) - ADVANTAGE exploring different portion of the
tree-space - DISADVANTAGE many random replicates necessary
(computationally slow for more than 35-40 taxa)
- NJ tree
- ADVANTAGE starting tree probably not too wrong
(too far from the global maximum) - DISADVANTAGE possible to get stuck in local
maxima
13Heuristic Search strategy (2)Re-arranging the
tree
- The starting tree is re-arranged in order to find
a better (higher likelihood) tree with a Pruning
algorithm - Neighbor Nearest Interchange (NNI) subset of SPR
- Subtree Pruning Regrafting (SPR) subset of
TBR - Tree Bisection Reconnection (TBR)
14Branch swapping
C
D
A
E
Tree bisection and reconnection (TBR)
E
F
B
A
D
G
F
C
B
G
C
D
E
A
B
G
F
A
F
B
G
D
C
Evaluate Likelihood again ...
E
15Maximum Likelihood Estimates (MLEs) of
substitution model parameters
- Each free parameter of a given nt substitution
model can be estimated by ML - The MLEs of the parameters are those making the
likelihood of the tree the largest - In theory parameters and tree topology could be
estimated simultaneously - In practice the parameters are estimated using a
reasonable starting tree (for example NJ or ML
obtained with arbitrarily fixed parameters).
16Nucleotide substitution models (again!)
- is the mean substitution rate a, b, c, d, are
relative rate parameters ?A, ?C, ?G, and ?T are
frequency parameters - How do we decide which model best-fit our data?
17The Likelihood Ratio Test (LRT)
- The Likelihood function offers a natural way of
comparing nested evolutionary hypothesis using
the Likelihood Ratio (LR) statistic - ? 2 (loge L1 loge L0)
- L1 maximum likelihood under the more
parameter-rich, complex model (alternative
hypothesis, H1) - L0 maximum likelihood under the less
parameter-rich simple model (null hypothesis, H0) - If the models are nested and the null hypothesis
(H0) is correct, D is asymptotically distributed
as ?2 with a number of degrees of freedom equal
to the difference in number of free parameters
between the two models
18The Likelihood Ratio Test (LRT) (2)
- If LRT is significant (i.e. p lt 0.05, or lt0.01)
the inclusion of additional parameters in the
alternative model increases significantly the
likelihood of the data - When D is close to zero (p gt 0.05) the
alternative hypothesis does not fit the data
significantly better than the null hypothesis,
i.e. adding parameters to the null model does not
give us a better explanation of the data!
19The Likelihood Ratio Test (LRT) (3)
Nested evolutionary hypotheses That two models
are nested when one model (H0, null model or
constrained model) is equivalent to restrict the
possible values that one or more parameters can
take in the other model (H1, alternative,
unconstrained or full model
JC69 no Ti/Tv ratio, pA pT pC pG 0
K80 Ti/Tv ratio, pA pT pC pG 1
F84/HKY85 Ti/Tv ratio, pA ? pT ? pC ? pG 4
20The Likelihood Ratio Test (LRT) (4)
- Estimate a tree from the data (the "base tree").
This tree has been shown to not have influence in
the final model selected as far as it is not a
random tree . A neighbor-joining tree with the
JC69 or K80 model will be fast and will do fine. - Estimate the likelihood of the candidate models,
for the given data set and the "base tree". - Compare the likelihood of the candidate models
through a hierarchy of LRTs to select the
best-fit model among the candidate ones.
21 Model free parameters (in
the Q matrix) GTR 8 a?b?c?d?e?f,
pi TN93 5 pi ? pj, Ti/Tv ratio,
R/Y ratio (b, e, acdf) 4 pi ?
pj Ti/Tv ratio (be, acdf)
3 pi ? pj (a bcdef) 1 pA
pT pC pG0.25 Ti/Tv ratio (be,
acdf) JC69 0 pA
pT pC pG0.25 (a bcdef)
HKY85 F84 F81 (?Tajima-Nei) K80
22Testing rate heterogeneity over sites with
G-models (1)
- Each one of the models discussed above assumes
rate homogeneity over sites - A discrete G-distribution (Yang, 1994) is
commonly used to model rate heterogeneity over
sites - The a parameter of the G-distribution can be
estimated, as any other parameter of the
nucleotide substitution model, via maximum
likelihood
23Testing rate heterogeneity over sites with
G-models (2)
- Any given G-model with some value of a is nested
with the corresponding non G-model (which is the
special case of the G-model for a?) D.o.f.1 - For example, we can compare in a LRT the
likelihood of a tree under the JC model and the
JCG model. Since only one additional parameter
is estimated by the more complex (JCG) model,
the hypothesis of rate homogeneity over sites is
rejected if ? gt ?20.05 for 1 d.o.f.
24The classic molecular clock
- The molecular clock hypothesis postulates that
for any given macromolecule (a protein or a DNA
sequence) the rate of evolution is approximately
constant over time in all evolutionary lineages
(Zuckerkandl and and Pauling 1962 and 1965)
25Mutation and evolutionary rate
- Mutation Rate (m) error rate (biochemical
concept), rate of polymerase nucleotide
misincorporation - Evolutionary rate (r) number of mutants arising
per generation x probability of fixation - r 2Nm 1/2N m (Kimura, 1968)
26A global molecular clock?
The hypothesis known as global clock was based on
the observation that a linear relation seems to
exist between the number of amino acid
substitutions between homologous proteins of
different species, and the species divergence
times estimated from archaeological data.
27Why is the molecular clock attractive ?
- If macromolecules evolve at constant rates, they
can be used to date species-divergence times and
other types of evolutionary events, similar to
the dating of geological time using radioactive
elements - Phylogenetic reconstruction is much simpler under
constant rates that under nonconstant rates - The degree of rate variation among lineages may
provide much insight into the mechanisms of
molecular evolution (e.g. Kimura 1983 Gillespie
1991 Salemi et al., 1999).
28Estimating ancestral divergence times under the
clock hypothesis
knowing a divergence time T r dac/2T all the
other divergence dates in the internal nodes of
the tree can be estimated using the rate r
29Global vs Local clocks
- The molecular clock hypothesis is in perfect
agreement with the neutral theory of evolution
(Kimura 1968 Kimura 1983). - The existence of a clock seems to be a major
support of the neutral theory against natural
selection - The global clock hypothesis is today known to be
untrue. - Factors like different metabolic rates, different
lifespan, and different efficiency in the DNA
repair mechanisms between distantly related
species are responsible for different
evolutionary rates of homologous genes.
30Evolutionary rates of organisms
nucleotide substitutions per site per year
10 - 9
10 - 8
10 - 7
10 - 6
10 - 5
10 - 4
10 - 3
10 - 2
10 - 1
cellular genes
RNA viruses
DNA viruses
Human mtDNA
31Clock-like phylogenies
When the molecular clock holds the different
lineages in a phylogenetic tree accumulate
mutations more or less at the same rate during
evolution they have a constant evolutionary rate
r (nucleotide substitutions/site/year).
?
time
clock-like tree
non clock-like tree
n-1 2n-3 independent branch
lengths to be estimated (n number of TAXA)
32Likelihood ratio test for clock hypothesis
- Given a phylogenetic tree topology, branch
lengths proportional to the sequence divergence
can be estimated via maximum likelihood assuming
or not a constant evolutionary rate for each
lineage of the tree. - The likelihood ratio statistics
- ? 2 (loge LnoClock loge Lclock)
- is c2 distributed with (2n-3)-(n-1)n-2 degrees
of freedom - n number of TAXA (usually the clock-hypothesis
is rejected when D lt c20.05).
33LRT and non-parametric bootstrapping
- The ?2 approximation to assess the significance
of the LRT is not appropriate when the two
competing hypothesis are not nested. - The LRT may perform poorly when the data include
very short sequences relative to the number of
parameters to be estimated. - The null distribution of the LRT statistic can be
approximated by Monte Carlo simulation - Advantage statistically sound
- Disadvantage computational expensiveness
34Non-parametric bootstrapping (1)
- Select the competing models one for the null
hypothesis H0 and one for the alternative
hypothesis H1. - Estimate the tree and the parameters of the model
under the null hypothesis. - Use the tree and the estimated parameters to
simulate 200-1000 replicate data sets of the same
size as the original.
35Non-parametric bootstrapping (2)
- Infer the distribution of LRT statistic ?2 (loge
L1 loge L0) from the simulated data sets - H0 is rejected if the original ? is higher than
95 (99) percentile of the ? values in the
distribution,.
36Bayes formula
- Let A and B be events
- The intersection A?B AB is the event of A and B
both occurring - If B occurs with probability P(B) and, if B has
occurred, A can occur with probability P(AB)
prob. of A given that B has occurred, then - P(AB) P(AB)P(B) (1)
- Similarly, P(AB) P(BA)P(A). By substituting
P(AB) in equation (1) and solving for P(AB)
37A silly example
- The probability of getting 5 by rolling a fair
die is P(5) 1/6 - The conditional probability of getting 5 knowing
that rolling the die has given as outcome an odd
number is P(5odd)1/3
38Bayes formula in practice (1)
- In practical applications P(BA) and P(A) are
usually easy to calculate, whereas calculating
P(B) requires a sum over all possible ways in
which event B can occur - An example
- A lab blood test has 95 prob. of detecting the
disease and gives 1 false positives. 0.5 of the
population is known to have the disease. - What is the (conditional) probability of a person
actually having the disease if her test resulted
positive?
39Bayes formula in practice (2)
- P(Dis positive test) P(positive test
Dis)P(Dis) / P(positive test) - P(positive test Dis) 0.95
- P(Dis) 0.005
- P(positive test) P(positive test Dis)P(Dis)
P(positive test no Dis)P(no Dis) - P(positive test) (0.95)
(0.005) (0.01)
(0.995) - P(Dis positive test) 0.323
40Likelihood vs Bayesian in Molecular phylogeny
- The Likelihood function gives the probability of
the aligned sequences if the tree (and the model)
is true - Prob (Aligned sequences?tree)
- The Posterior probability of the tree is the
probability of a particular tree given the
observed data set - Prob (tree?Aligned sequences)
41 Bayesian Inference in phylogeny (1)
- Instead of searching for the ML tree, trees are
sampled according to their posterior probability - Bayesian inference takes full advantage of the
information contained in the data - Computationally faster (does not require an
exhaustive search to find the ML tree) - Can be used to select an evolutionary model
without knowing the topology of the optimal (or
nearly optimal) tree
42Posterior probability of a phylogenetic tree
- P(Aligned sequences?Tree) likelihood
- P(Tree) a priori probability of any particular
phylogeny easy to calculate by assuming that,
before any observation has been made, any
phylogenetic tree for the data has equal chance
of being the true tree - P(Aligned sequences) requires summation over all
trees and integration over all possible branch
lengths and model parameter values!!!
43The Markov Chain Monte Carlo (MCMC) sampler
- Metropolis et al. (1953) and Hastings (1970)
proposed an algorithm (a variant of MCMC) that
allows to visit a number of trees in the tree
space and (if run properly and long enough) to
visit more more often the trees with the higher
posterior probability - The algorithm starts from a random tree and
obtains a second tree by randomly perturbing the
first tree - The new tree is then accepted or rejected with a
probability that depends upon the ratio of the
posterior probabilities of the two trees being
compared - P(Tree2?Aligned sequences) / P(Tree1?Aligned
sequences). Therefore the term P(Aligned
sequences) does not need to be calculated!
44Inferring a high probability tree with
Metropolis-Hastings algorithm
- Obtain a random starting tree, Tree
- Randomly perturb Tree to obtain Tree. Let
q(Tree) the probability of proposing the new
tree from the previous one and q(Tree) the
probability of proposing the old tree conditional
on starting at the new tree (Tree) q(X) can be
picked for some proposal distribution - Calculate the probability R of accepting Tree
over Tree
- R min 1 , P(Aligned sequences?Tree)P(Tree
)q(Tree) - P(Aligned
sequences?Tree)P(Tree)q(Tree)
- Generate a random variable U uniformly
distributed in the interval (0,1) - If UltR accept Tree, let TreeTree, and go back
to step 2 otherwise continue with the current
tree - If the process is repeated for a sufficiently
large number of iterations the long-run
frequencies of states visited by the Markov chain
WILL approximate the posterior distribution
45 Bayesian Inference of phylogeny (2)
- The Bayesian framework can be used also
- Evaluating uncertainty in phylogenies instead of
classinc bootstrap methods (Larget Simon, Mol.
Biol. Evol. 1999) - Detecting positive selection (Nielsen Yang,
Genetics 1998) - Testing evolutionary models (Huelsenbeck et al.,
Mol. Biol. Evol. 2004) - Testing the molecular clock (Surchard et al.,
Mol. Biol. Evol. 2001) - Software for Bayesian inference programs include
- Mr. Bayes