Title: Maximum Likelihood James McInerney,
1Maximum LikelihoodJames McInerney,
- This presentation is based almost entirely on
Peter G. Fosters - "The Idiots Guide to the Zen
of Likelihood in a Nutshell in Seven Days for
Dummies, Unleashed.http//www.bioinf.org/molsys/d
ata/idiots.pdf (Peter cannot be blamed for any
inaccuracies in this presentation)
2Maximum Likelihood
- Could equally be called maximum probability.
- Historically the newest method.
- Popularized by Joseph Felsenstein, Seattle,
Washington. - Its slow uptake by the scientific community has
to do with the difficulty of understanding the
theory and also the absence (initially) of good
quality software with choice of models and ease
of interaction with data. - Also, at the time, it was computationally
intractable to analyse large datasets (consider
that in the mid-1980s a typical desktop computer
had a processor speed of less than 30 MHz). - In recent times, software, models and computer
hardware have become sufficiently sophisticated
that ML is becoming a method of choice.
3ML comparison with other methods.
- ML is similar to many other methods in many ways
- In many ways it is fundamentally different.
- ML assumes a model of sequence evolution (so does
Maximum Parsimony and so do distance matrix
methods). - ML attempts to answer the question What is the
probability that I would observe these data (a
multiple sequence alignment), given a particular
model of evolution (a tree and a process). - ML uses a model. This is justifiable, since
molecular sequence data can be shown to have
arisen according to a stochastic process.
4Maximum Likelihood - goal
- To estimate the probability that we would observe
a particular dataset, given a phylogenetic tree
and some notion of how the evolutionary process
worked over time.
)
(
Probability of
given
5What is the probability of observing a datum?
- If we flip a coin and get a head and we think the
coin is unbiased, then the probability of
observing this head is 0.5. - If we think the coin is biased so that we expect
to get a head 80 of the time, then the
likelihood of observing this datum (a head) is
0.8. - Therefore The likelihood of making some
observation is entirely dependent on the model
that underlies our assumption.
Lesson The datum has not changed, our model has.
Therefore under the new model the likelihood of
observing the datum has changed.
p
?
6What is the probability of observing a 'G'
nucleotide?
- QuestionIf we have a DNA sequence of one
nucleotide in length and the identity of this
nucleotide is 'G', what is the likelihood that we
would observe this 'G'? - Answer In the same way as the coin-flipping
observation, the likelihood of observing this 'G'
is dependent on the model of sequence evolution
that is thought to underlie the data. - E.g.
- Model 1 frequency of G 0.4 gt likelihood(G)
0.4 - Model 2 frequency of G 0.1 gt likelihood(G)
0.1 - Model 3 frequency of G 0.25 gt likelihood(G)
0.25
7One rulethe rule of 1.
- The sum of the likelihoods of all the
possibilities will always equal 1. - E.g. for DNA p(a)p(c)p(g)p(t)1
8What about longer sequences?
- If we consider a gene of length 2
- Gene 1 ga
- The the probability of observing this gene is the
product of the probabilities of observing each
character. - E.g
- p(g) 0.4 p(a)0.15 (for instance)
- likelihood(ga) 0.4 x 0.15 0.06
9or even longer sequences?
- Gene 1 gactagctagacagatacgaattac
- Model (simple base frequency model)
- p(a)0.15 p(c)0.2 p(g)0.4 p(t)0.25
- (the sum of all probabilities must equal 1)
- Like(Gene 1) 0.000000000000000018452813
10Note about models
- You might notice that our model of base frequency
is not the optimal model for our observed data.
If we had used the following model - p(a)0.4 p(c) 0.2 p(g) 0.2 p(t) 0.2
- The likelihood of observing the gene is
- Like(gene 1) 0.000000000000335544320000
- (a value that is almost 10,000 times higher)
Lesson The datum has not changed, our model has.
Therefore under the new model the likelihood of
observing the datum has changed.
11How does this relate to phylogenetic trees?
- Consider an alignment of two sequences
- Gene 1 gaac
- Gene 2 gacc
- We assume these genes are related by a (simple)
phylogenetic tree with branch lengths.
12Increase in model sophistication
- It is no longer possible to simply invoke a model
that encompasses base composition, we must also
include the mechanism of sequence change and
stasis. - There are two parts to this model - the tree and
the process (the latter is confusingly referred
to as the model, although both parts really
compose the model).
Note We will stay with the confusing notation -
to avoid further confusion.
13The model
- The two parts of the model are the tree and the
process (the model). - The model is composed of the composition and the
substitution process -rate of change from one
character state to another character state.
Model
14Simple time-reversible model
- A simple model is that the rate of change from a
to c or vice versa is 0.4, the composition of a
is 0.25 and the composition of c is 0.25 (a
simplified version of the Jukes and Cantor 1969
model)
P
15Probability of the third nucleotide position in
our current alignment
- p(a) 0.25 p(c) 0.25
- Starting with a, the likelihood of the nucleotide
is 0.25 and the likelihood of the substitution
(branch) is 0.4. So the likelihood of observing
these data is - Likelihood(DM) 0.25 x 0.4 0.01
Note you will get the same result if you start
with c, since this model is reversible
The likelihood of the data, given the model.
16Substitution matrix
- For nucleotide sequences, there are 16 possible
ways to describe substitutions - a 4x4 matrix.
Convention dictates that the order of the
nucleotides is a,c,g,t
Note for amino acids, the matrix is a 20 x 20
matrix and for codon-based models, the matrix is
61 x 61
17Substitution matrix - an example
- In this matrix, the probability of an a changing
to a c is 0.01 and the probability of a c
remaining the same is 0.979, etc.
Note The rows of this matrix sum to 1 - meaning
that for every nucleotide, we have covered all
the possibilities of what might happen to it. The
columns do not sum to anything in particular.
18To calculate the likelihood of the entire
dataset, given a substitution matrix, base
composition and a branch length of one "certain
evolutionary distance" or "ced"
Gene 1 ccat Gene 2 ccgt
Likelihood of given
p0.1,0.4,0.2,0.3
19Likelihood of a two-sequence alignment.
0.4x0.983x0.4x0.983x0.1x0.007x0.3x0.979 0.000030
0
Likelihood of going from the first to the second
sequence is 0.0000300
20Different Branch Lengths
- For very short branch lengths, the probability of
a character staying the same is high and the
probability of it changing is low (for our
particular matrix). - For longer branch lengths, the probability of
character change becomes higher and the
probability of staying the same is lower. - The previous calculations are based on the
assumption that the branch length describes one
Certain Evolutionary Distance or CED. - If we want to consider a branch length that is
twice as long (2 CED), then we can multiply the
substitution matrix by itself (matrix2).
212 CED model
X
Which gives a likelihood of 0.0000559
Note the higher likelihood
22For 3 CED
This gives a likelihood of 0.0000782
Note that as the branch lengths increase, the
values on diagonal decrease and the values on the
off-diagonals increase.
23For higher values of CED units
Likelihood
- 1 0.0000300
- 2 0.0000559
- 3 0.0000782
- 10 0.0001620
- 15 0.0001770
- 20 0.0001750
- 30 0.0001520
Branch Length
24Raising P to a large power
- If we raise P to a very large power, we find that
the ML base composition pops out. - So the base composition is built into the
probability matrix P.
25Rate Matrices
This does make sense doesnt it??
Consider the following equation
- In the same way, raising a matrix to the power of
a number can be calculated by taking the log of
the matrix, multiplying it by branch length and
taking the exponent of the product. - In this way, you can exponentiate the matrix by a
number that is not a whole number (e.g. 4.5698 or
whatever). - E.g. The log of the previous matrix, P is
Note the sum of each row is zero.
26- This matrix corresponds to one CED. What we want
is to derive a matrix so that when we
exponentiate it, the values correspond to
substitutions per site. - We must therefore scale logP so that when the
rows of logP are multiplied by the
off-diagonal elements sum to 1. - The resulting scaled logP (lets call it Q), when
its exponent is taken gives a P corresponding to
1 substitution per site.
27Converting to substitutions per site.
- For a branch length of v
- If we scale the logP appropriately, we will get a
Q matrix. If we multiply this Q matrix by a
diagonal matrix of the composition we get a
matrix where the off-diagonal elements sum to 1
and the diagonal elements sum to -1.
28Scaling logP appropriately.
LogP scaled by a factor of 50 (for instance)
Off-diagonal elements sum to 1, diagonal elements
sum to -1
(diagonal matrix of the composition)
Ps generated from this Q will give branch
lengths in substitutions per site.
29Separating composition from rates.
- If we divide the columns of Q by the
composition is separated from the rates. You can
then use the exact same rate matrix with
different matrices of base composition. - For the model we have been using, the rate matrix
R is - The diagonal elements do not matter and are left
out. The model is symmetrical (time reversible).
30Relationships between R, Q and P matrices.
Multiply columns by the composition, scale so
that the off-diagonals of sum to 1
Multiply by branch length, then exponentiate
R
Q
P
Log, then scale so that off-diagonals of sum
to 1
Divide columns by the composition
31Likelihood of the alignment at various branch
lengths
The maximum likelihood value is 0.0001777 at a
branch length of 0.330614
32Likelihood of a two-branch tree
A
0.1
O
0.2
B
- O is the origin or root, the numbers represent
branch lengths. The likelihood can be calculated
in three ways - from A to B in one step (this amounts to the
previous method) - from A to B in two steps (through O)
- in two parts starting at O.
33Lesson about O
- O is an unknown sequence.
- We can only speculate what each position in the
alignment would be if we could observe the
sequence of O. - What we do know is that the sum of all
possibilities is equal to 1. - Therefore we must sum the likielihoods for all
possibilities of O. - This becomes computationally intensive.
c
A
0.1
0.1
a,c,g,t
O
For position 1
0.2
0.2
c
B
34A three branch tree
A
0.1
0.3
C
B
0.2
The tree can be rooted anywhere and the
substitutions calculated accordingly. There are
many ways of doing this and this is left as an
exercise for the student.
35Increasing the sophistication of models
- So far, the models we have dealt with assume that
change is equally likely at all positions and
that the rate of change is constant for the
entire duration of the phylogeny. - This is not a realistic model for all sequences
(it is a neutral model with a constant molecular
clock).
CORRECT TREE
WRONG TREE
p gt q
36Small subunit ribosomal RNA
18S or 16S rRNA
37The molecular clock for alpha-globinEach point
represents the number of substitutions separating
each animal from humans
shark
carp
number of substitutions
platypus
chicken
cow
Time to common ancestor (millions of years)
38Invariable sites
- For a given dataset we can assume that a certain
proportion of sites are not free to vary -
purifying selection (related to function)
prevents these sites from changing). - We can therefore observe invariable positions
either because they are under this selective
constraint or because they have not had a chance
to vary or because there is homoplasy in the
dataset and a reversal (say) has caused the site
to appear constant. - The likelihood that a site is invariable can be
calculated by incorporating this possibility into
our model and calculating for every site the
likelihood that it is an invariable site. - It might improve the likelihood of the dataset if
we remove a certain proportion of invariable
sites (in a way that is analogous to the
preceding discussion). - The PAUP software can estimate the proportion of
invariable sites using Maximum Likelihood.
39Variable sites
- Obviously other sites in the dataset are free to
vary. - Selection intensity on these sites is rarely
uniform, so it is desirable to model site-by-site
rate variation. - This is done in two ways
- site specific (codon position, or alpha helix
etc.) - using a discrete approximation to a continuous
distribution (gamma distribution). - Again, these variables are modeled over all
possibilities of sequence change over all
possibilities of branch length over all
possibilities of tree topology.
40The shape of the gamma distribution for different
values of alpha.
41Does changing a model affect the outcome?
There are different models Jukes and Cantor
(JC69) All base compositions equal (0.25 each),
rate of change from one base to another is the
same Kimura 2-Parameter (K2P) All base
compositions equal (0.25 each), different
substitution rate for transitions and
transversions). Hasegawa-Kishino-Yano (HKY) Like
the K2P, but with base composition free to
vary. General Time Reversible (GTR) Base
composition free to vary, all possible
substitutions can differ. All these models can
be extended to accommodate invariable sites and
site-to-site rate variation.
42(No Transcript)
43Long-branch attraction (LBA)
- In the case below, the wrong tree is often
selected. ML will not be prone to this problem,
if the correct model of sequence evolution is
used. - This is often called the felsenstein-zone.
CORRECT TREE
WRONG TREE
p gt q
44Strengths of ML
- Does not try to make an observation of sequence
change and then a correction for superimposed
substitutions. There is no need to correct for
anything, the models take care of superimposed
substitutions. - Accurate branch lengths.
- Each site has a likelihood.
- If the model is correct, we should retrieve the
correct tree. - You can use a model that fits the data.
- ML uses all the data (no selection of sites based
on informativeness, all sites are informative). - ML can not only tell you about the phylogeny of
the sequences, but also the process of evolution
that led to the observations of todays sequences.
If we have long-enough sequences and a
sophisticated-enough model.
45Models
- You can use models that
- Deal with different transition/transversion
ratios. - Deal with unequal base composition.
- Deal with heterogeneity of rates across sites.
- Deal with heterogeneity of the substitution
process (different rates across lineages,
different rates at different parts of the tree). - The more free parameters, the better your model
fits your data (good). - The more free parameters, the higher the variance
of the estimate (bad). - Use a model that fits your data.
46Over-fitting a model to your data.
47Weaknesses of ML
- Can be inconsistent if we use models that are not
accurate. - Model might not be sophisticated enough (you can
max-out on models). - Very computationally-intensive. Might not be
possible to examine all models (substitution
matrices, tree topologies, etc.).
48Recommendations
- Interact with the data.
- If you have collected enough data, you might get
a good picture of the underlying model of
sequence evolution. - Use a test of alternative models (implemented in
the modeltest software). - Dont just choose a model, use a model that
fits your data. - Dont over-fit a model to your data.
49How often have I said to you that when you have
eliminated the impossible, whatever remains,
however improbable, must be the truth. Sherlock
Holmes to Dr. Watson in The Sign of Four, by A.
Conan Doyle.