Maximum Likelihood James McInerney, - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Maximum Likelihood James McInerney,

Description:

Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community has to do with the difficulty of ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 50
Provided by: drjamesom
Category:

less

Transcript and Presenter's Notes

Title: Maximum Likelihood James McInerney,


1
Maximum LikelihoodJames McInerney,
  • This presentation is based almost entirely on
    Peter G. Fosters - "The Idiots Guide to the Zen
    of Likelihood in a Nutshell in Seven Days for
    Dummies, Unleashed.http//www.bioinf.org/molsys/d
    ata/idiots.pdf (Peter cannot be blamed for any
    inaccuracies in this presentation)

2
Maximum Likelihood
  • Could equally be called maximum probability.
  • Historically the newest method.
  • Popularized by Joseph Felsenstein, Seattle,
    Washington.
  • Its slow uptake by the scientific community has
    to do with the difficulty of understanding the
    theory and also the absence (initially) of good
    quality software with choice of models and ease
    of interaction with data.
  • Also, at the time, it was computationally
    intractable to analyse large datasets (consider
    that in the mid-1980s a typical desktop computer
    had a processor speed of less than 30 MHz).
  • In recent times, software, models and computer
    hardware have become sufficiently sophisticated
    that ML is becoming a method of choice.

3
ML comparison with other methods.
  • ML is similar to many other methods in many ways
  • In many ways it is fundamentally different.
  • ML assumes a model of sequence evolution (so does
    Maximum Parsimony and so do distance matrix
    methods).
  • ML attempts to answer the question What is the
    probability that I would observe these data (a
    multiple sequence alignment), given a particular
    model of evolution (a tree and a process).
  • ML uses a model. This is justifiable, since
    molecular sequence data can be shown to have
    arisen according to a stochastic process.

4
Maximum Likelihood - goal
  • To estimate the probability that we would observe
    a particular dataset, given a phylogenetic tree
    and some notion of how the evolutionary process
    worked over time.

)
(
Probability of
given
5
What is the probability of observing a datum?
  • If we flip a coin and get a head and we think the
    coin is unbiased, then the probability of
    observing this head is 0.5.
  • If we think the coin is biased so that we expect
    to get a head 80 of the time, then the
    likelihood of observing this datum (a head) is
    0.8.
  • Therefore The likelihood of making some
    observation is entirely dependent on the model
    that underlies our assumption.

Lesson The datum has not changed, our model has.
Therefore under the new model the likelihood of
observing the datum has changed.
p
?
6
What is the probability of observing a 'G'
nucleotide?
  • QuestionIf we have a DNA sequence of one
    nucleotide in length and the identity of this
    nucleotide is 'G', what is the likelihood that we
    would observe this 'G'?
  • Answer In the same way as the coin-flipping
    observation, the likelihood of observing this 'G'
    is dependent on the model of sequence evolution
    that is thought to underlie the data.
  • E.g.
  • Model 1 frequency of G 0.4 gt likelihood(G)
    0.4
  • Model 2 frequency of G 0.1 gt likelihood(G)
    0.1
  • Model 3 frequency of G 0.25 gt likelihood(G)
    0.25

7
One rulethe rule of 1.
  • The sum of the likelihoods of all the
    possibilities will always equal 1.
  • E.g. for DNA p(a)p(c)p(g)p(t)1

8
What about longer sequences?
  • If we consider a gene of length 2
  • Gene 1 ga
  • The the probability of observing this gene is the
    product of the probabilities of observing each
    character.
  • E.g
  • p(g) 0.4 p(a)0.15 (for instance)
  • likelihood(ga) 0.4 x 0.15 0.06

9
or even longer sequences?
  • Gene 1 gactagctagacagatacgaattac
  • Model (simple base frequency model)
  • p(a)0.15 p(c)0.2 p(g)0.4 p(t)0.25
  • (the sum of all probabilities must equal 1)
  • Like(Gene 1) 0.000000000000000018452813

10
Note about models
  • You might notice that our model of base frequency
    is not the optimal model for our observed data.
    If we had used the following model
  • p(a)0.4 p(c) 0.2 p(g) 0.2 p(t) 0.2
  • The likelihood of observing the gene is
  • Like(gene 1) 0.000000000000335544320000
  • (a value that is almost 10,000 times higher)

Lesson The datum has not changed, our model has.
Therefore under the new model the likelihood of
observing the datum has changed.
11
How does this relate to phylogenetic trees?
  • Consider an alignment of two sequences
  • Gene 1 gaac
  • Gene 2 gacc
  • We assume these genes are related by a (simple)
    phylogenetic tree with branch lengths.

12
Increase in model sophistication
  • It is no longer possible to simply invoke a model
    that encompasses base composition, we must also
    include the mechanism of sequence change and
    stasis.
  • There are two parts to this model - the tree and
    the process (the latter is confusingly referred
    to as the model, although both parts really
    compose the model).

Note We will stay with the confusing notation -
to avoid further confusion.
13
The model
  • The two parts of the model are the tree and the
    process (the model).
  • The model is composed of the composition and the
    substitution process -rate of change from one
    character state to another character state.


Model
14
Simple time-reversible model
  • A simple model is that the rate of change from a
    to c or vice versa is 0.4, the composition of a
    is 0.25 and the composition of c is 0.25 (a
    simplified version of the Jukes and Cantor 1969
    model)

P
15
Probability of the third nucleotide position in
our current alignment
  • p(a) 0.25 p(c) 0.25
  • Starting with a, the likelihood of the nucleotide
    is 0.25 and the likelihood of the substitution
    (branch) is 0.4. So the likelihood of observing
    these data is
  • Likelihood(DM) 0.25 x 0.4 0.01

Note you will get the same result if you start
with c, since this model is reversible
The likelihood of the data, given the model.
16
Substitution matrix
  • For nucleotide sequences, there are 16 possible
    ways to describe substitutions - a 4x4 matrix.

Convention dictates that the order of the
nucleotides is a,c,g,t
Note for amino acids, the matrix is a 20 x 20
matrix and for codon-based models, the matrix is
61 x 61
17
Substitution matrix - an example
  • In this matrix, the probability of an a changing
    to a c is 0.01 and the probability of a c
    remaining the same is 0.979, etc.

Note The rows of this matrix sum to 1 - meaning
that for every nucleotide, we have covered all
the possibilities of what might happen to it. The
columns do not sum to anything in particular.
18
To calculate the likelihood of the entire
dataset, given a substitution matrix, base
composition and a branch length of one "certain
evolutionary distance" or "ced"
Gene 1 ccat Gene 2 ccgt
Likelihood of given
p0.1,0.4,0.2,0.3
19
Likelihood of a two-sequence alignment.
  • ccat
  • ccgt

0.4x0.983x0.4x0.983x0.1x0.007x0.3x0.979 0.000030
0
Likelihood of going from the first to the second
sequence is 0.0000300
20
Different Branch Lengths
  • For very short branch lengths, the probability of
    a character staying the same is high and the
    probability of it changing is low (for our
    particular matrix).
  • For longer branch lengths, the probability of
    character change becomes higher and the
    probability of staying the same is lower.
  • The previous calculations are based on the
    assumption that the branch length describes one
    Certain Evolutionary Distance or CED.
  • If we want to consider a branch length that is
    twice as long (2 CED), then we can multiply the
    substitution matrix by itself (matrix2).

21
2 CED model
X

Which gives a likelihood of 0.0000559
Note the higher likelihood
22
For 3 CED
This gives a likelihood of 0.0000782
Note that as the branch lengths increase, the
values on diagonal decrease and the values on the
off-diagonals increase.
23
For higher values of CED units
Likelihood
  • 1 0.0000300
  • 2 0.0000559
  • 3 0.0000782
  • 10 0.0001620
  • 15 0.0001770
  • 20 0.0001750
  • 30 0.0001520

Branch Length
24
Raising P to a large power
  • If we raise P to a very large power, we find that
    the ML base composition pops out.
  • So the base composition is built into the
    probability matrix P.

25
Rate Matrices
This does make sense doesnt it??
Consider the following equation
  • In the same way, raising a matrix to the power of
    a number can be calculated by taking the log of
    the matrix, multiplying it by branch length and
    taking the exponent of the product.
  • In this way, you can exponentiate the matrix by a
    number that is not a whole number (e.g. 4.5698 or
    whatever).
  • E.g. The log of the previous matrix, P is

Note the sum of each row is zero.
26
  • This matrix corresponds to one CED. What we want
    is to derive a matrix so that when we
    exponentiate it, the values correspond to
    substitutions per site.
  • We must therefore scale logP so that when the
    rows of logP are multiplied by the
    off-diagonal elements sum to 1.
  • The resulting scaled logP (lets call it Q), when
    its exponent is taken gives a P corresponding to
    1 substitution per site.

27
Converting to substitutions per site.
  • For a branch length of v
  • If we scale the logP appropriately, we will get a
    Q matrix. If we multiply this Q matrix by a
    diagonal matrix of the composition we get a
    matrix where the off-diagonal elements sum to 1
    and the diagonal elements sum to -1.

28
Scaling logP appropriately.
LogP scaled by a factor of 50 (for instance)
Off-diagonal elements sum to 1, diagonal elements
sum to -1
(diagonal matrix of the composition)
Ps generated from this Q will give branch
lengths in substitutions per site.
29
Separating composition from rates.
  • If we divide the columns of Q by the
    composition is separated from the rates. You can
    then use the exact same rate matrix with
    different matrices of base composition.
  • For the model we have been using, the rate matrix
    R is
  • The diagonal elements do not matter and are left
    out. The model is symmetrical (time reversible).

30
Relationships between R, Q and P matrices.
Multiply columns by the composition, scale so
that the off-diagonals of sum to 1
Multiply by branch length, then exponentiate
R
Q
P
Log, then scale so that off-diagonals of sum
to 1
Divide columns by the composition
31
Likelihood of the alignment at various branch
lengths
  • ccat
  • ccgt

The maximum likelihood value is 0.0001777 at a
branch length of 0.330614
32
Likelihood of a two-branch tree
A
0.1
O
0.2
B
  • O is the origin or root, the numbers represent
    branch lengths. The likelihood can be calculated
    in three ways
  • from A to B in one step (this amounts to the
    previous method)
  • from A to B in two steps (through O)
  • in two parts starting at O.

33
Lesson about O
  • O is an unknown sequence.
  • We can only speculate what each position in the
    alignment would be if we could observe the
    sequence of O.
  • What we do know is that the sum of all
    possibilities is equal to 1.
  • Therefore we must sum the likielihoods for all
    possibilities of O.
  • This becomes computationally intensive.

c
A
0.1
0.1
a,c,g,t
O
For position 1
0.2
0.2
c
B
34
A three branch tree
A
0.1
0.3
C
B
0.2
The tree can be rooted anywhere and the
substitutions calculated accordingly. There are
many ways of doing this and this is left as an
exercise for the student.
35
Increasing the sophistication of models
  • So far, the models we have dealt with assume that
    change is equally likely at all positions and
    that the rate of change is constant for the
    entire duration of the phylogeny.
  • This is not a realistic model for all sequences
    (it is a neutral model with a constant molecular
    clock).

CORRECT TREE
WRONG TREE
p gt q
36
Small subunit ribosomal RNA
18S or 16S rRNA
37
The molecular clock for alpha-globinEach point
represents the number of substitutions separating
each animal from humans
shark
carp
number of substitutions
platypus
chicken
cow
Time to common ancestor (millions of years)
38
Invariable sites
  • For a given dataset we can assume that a certain
    proportion of sites are not free to vary -
    purifying selection (related to function)
    prevents these sites from changing).
  • We can therefore observe invariable positions
    either because they are under this selective
    constraint or because they have not had a chance
    to vary or because there is homoplasy in the
    dataset and a reversal (say) has caused the site
    to appear constant.
  • The likelihood that a site is invariable can be
    calculated by incorporating this possibility into
    our model and calculating for every site the
    likelihood that it is an invariable site.
  • It might improve the likelihood of the dataset if
    we remove a certain proportion of invariable
    sites (in a way that is analogous to the
    preceding discussion).
  • The PAUP software can estimate the proportion of
    invariable sites using Maximum Likelihood.

39
Variable sites
  • Obviously other sites in the dataset are free to
    vary.
  • Selection intensity on these sites is rarely
    uniform, so it is desirable to model site-by-site
    rate variation.
  • This is done in two ways
  • site specific (codon position, or alpha helix
    etc.)
  • using a discrete approximation to a continuous
    distribution (gamma distribution).
  • Again, these variables are modeled over all
    possibilities of sequence change over all
    possibilities of branch length over all
    possibilities of tree topology.

40
The shape of the gamma distribution for different
values of alpha.
41
Does changing a model affect the outcome?
There are different models Jukes and Cantor
(JC69) All base compositions equal (0.25 each),
rate of change from one base to another is the
same Kimura 2-Parameter (K2P) All base
compositions equal (0.25 each), different
substitution rate for transitions and
transversions). Hasegawa-Kishino-Yano (HKY) Like
the K2P, but with base composition free to
vary. General Time Reversible (GTR) Base
composition free to vary, all possible
substitutions can differ. All these models can
be extended to accommodate invariable sites and
site-to-site rate variation.
42
(No Transcript)
43
Long-branch attraction (LBA)
  • In the case below, the wrong tree is often
    selected. ML will not be prone to this problem,
    if the correct model of sequence evolution is
    used.
  • This is often called the felsenstein-zone.

CORRECT TREE
WRONG TREE
p gt q
44
Strengths of ML
  • Does not try to make an observation of sequence
    change and then a correction for superimposed
    substitutions. There is no need to correct for
    anything, the models take care of superimposed
    substitutions.
  • Accurate branch lengths.
  • Each site has a likelihood.
  • If the model is correct, we should retrieve the
    correct tree.
  • You can use a model that fits the data.
  • ML uses all the data (no selection of sites based
    on informativeness, all sites are informative).
  • ML can not only tell you about the phylogeny of
    the sequences, but also the process of evolution
    that led to the observations of todays sequences.

If we have long-enough sequences and a
sophisticated-enough model.
45
Models
  • You can use models that
  • Deal with different transition/transversion
    ratios.
  • Deal with unequal base composition.
  • Deal with heterogeneity of rates across sites.
  • Deal with heterogeneity of the substitution
    process (different rates across lineages,
    different rates at different parts of the tree).
  • The more free parameters, the better your model
    fits your data (good).
  • The more free parameters, the higher the variance
    of the estimate (bad).
  • Use a model that fits your data.

46
Over-fitting a model to your data.
47
Weaknesses of ML
  • Can be inconsistent if we use models that are not
    accurate.
  • Model might not be sophisticated enough (you can
    max-out on models).
  • Very computationally-intensive. Might not be
    possible to examine all models (substitution
    matrices, tree topologies, etc.).

48
Recommendations
  • Interact with the data.
  • If you have collected enough data, you might get
    a good picture of the underlying model of
    sequence evolution.
  • Use a test of alternative models (implemented in
    the modeltest software).
  • Dont just choose a model, use a model that
    fits your data.
  • Dont over-fit a model to your data.

49
How often have I said to you that when you have
eliminated the impossible, whatever remains,
however improbable, must be the truth. Sherlock
Holmes to Dr. Watson in The Sign of Four, by A.
Conan Doyle.
Write a Comment
User Comments (0)
About PowerShow.com