Maximum Likelihood - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Maximum Likelihood

Description:

Its slow uptake by the scientific community has to do ... of C is 0.25 (a simplified version of the Jukes and Cantor 1969 model) ... Jukes and Cantor (JC69) ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 25
Provided by: martin79
Category:

less

Transcript and Presenter's Notes

Title: Maximum Likelihood


1
Maximum Likelihood
2
Maximum Likelihood
Historically the newest method. Popularized by
Joseph Felsenstein, Seattle, Washington. Its slow
uptake by the scientific community has to do with
the difficulty of understanding the theory and
also the absence (initially) of good quality
software with choice of models and ease of
interaction with data. Also, at the time, it was
computationally intractable to analyse large
datasets (consider that in the mid-1980s a
typical desktop computer had a processor speed of
less than 30 MHz). In recent times, software,
models and computer hardware have become
sufficiently sophisticated that ML is becoming a
method of choice.
3
ML comparison with other methods
- ML is similar to many other methods in many
ways, but fundamentally different. - ML assumes a
model of sequence evolution (so does Maximum
Parsimony and so do distance matrix methods). -
ML attempts to answer the question What is the
probability that I would observe these data (a
multiple sequence alignment), given a particular
model of evolution (a tree and a process).
Pr(DH) is the probability of getting the data D
given hypothesis H.
L Pr (DH)
4
Principle of Likelihood
Pr(DH) is the probability of getting the data D
given hypothesis H.
L Pr (DH)
In the context of molecular phylogenetics, D is
the set of sequences being compared, and H is a
phylogenetic tree and process, hence we want to
find the likelihood of obtaining the observed
sequences given a particular tree based on a
process. The tree that makes our data the most
probable evolutionary outcome is the maximum
likelihood estimate of the phylogeny.
5
Maximum Likelihood
Probability of
given
6
What is the probability of observing a datum?
If we flip a coin and get a head and we think the
coin is unbiased, then the probability of
observing this head is 0.5. If we think the coin
is biased so that we expect to get a head 80 of
the time, then the likelihood of observing this
datum (a head) is 0.8. Therefore The likelihood
of making some observation is entirely dependent
on the model that underlies our assumption.
Lesson The datum has not changed, our model has.
Therefore under the new model the likelihood of
observing the datum has changed.
p
?
7
What is the probability of observing a 'G'
nucleotide?
  • Question If we have a DNA sequence of one
    nucleotide in length and the identity of this
    nucleotide is 'G', what is the likelihood that we
    would observe this 'G'?
  • Answer In the same way as the coin-flipping
    observation, the likelihood of observing this 'G'
    is dependent on the model of sequence evolution
    that is thought to underlie the data.
  • e.g.
  • Model 1 frequency of G 0.4 gt likelihood (G)
    0.4
  • Model 2 frequency of G 0.1 gt likelihood (G)
    0.1
  • Model 3 frequency of G 0.25 gt likelihood (G)
    0.25

8
What about longer sequences?
If we consider a gene of length 2 Gene 1
GA The probability of observing this gene is
the product of the probabilities of observing
each character. e.g. p(G) 0.4 p(A) 0.15
(for instance) likelihood (GA) 0.4 x 0.15 0.06
9
or even longer sequences?
Gene 1 GACTAGCTAGACAGATACGAATTAC Model (simple
base frequency model) p(A) 0.15 p(C) 0.2
p(G) 0.4 p(T) 0.25 (the sum of all
probabilities must equal 1) Like (Gene 1)
0.000000000000000018452813
10
Note About Models
You might notice that our model of base frequency
is not the optimal model for our observed data.
If we had used the following model p(A) 0.4
p(C) 0.2 p(G) 0.2 p(T) 0.2 The
likelihood of observing the gene is Like (gene
1) 0.000000000000335544320000 (a value that is
almost 10,000 times higher)
Lesson The datum has not changed, our model
has. Therefore under the new model the likelihood
of observing the datum has changed.
11
The Model
- The two parts of the model are the tree and
the process (the model). - The model is composed
of the composition and the substitution process
-rate of change from one character state to
another character state.
Model

12
Simple time-reversible Model
A simple model is that the rate of change from A
to C or vice versa is 0.4, the composition of A
is 0.25 and the composition of C is 0.25 (a
simplified version of the Jukes and Cantor 1969
model)
P
13
Probability of the third nucleotide position in
our current alignment
  • p(A) 0.25 p(C) 0.25
  • Starting with A, the likelihood of the nucleotide
    is 0.25 and the likelihood of the substitution
    (branch) is 0.4. So the likelihood of observing
    these data is
  • Likelihood(DM) 0.25 x 0.4 0.01

Note you will get the same result if you start
with c, since this model is reversible
The likelihood of the data, given the model.
14
Substitution Matrix
For nucleotide sequences, there are 16 possible
ways to describe substitutions - a 4x4 matrix.
Convention dictates that the order of the
nucleotides is A,C,G,T
Note for amino acids, the matrix is a 20 x 20
matrix and for codon-based models, the matrix is
61 x 61
15
Substitution matrix - an example
In this matrix, the probability of an A changing
to a C is 0.01 and the probability of a C
remaining the same is 0.983, etc.
Note The rows of this matrix sum to 1 - meaning
that for every nucleotide, we have covered all
the possibilities of what might happen to it. The
columns do not sum to anything in particular.
16
To calculate the likelihood of the entire
dataset, given a substitution matrix, base
composition and a branch length.
Gene 1 CCAT Gene 2 CCGT
Likelihood of given
? 0.1,0.4,0.2,0.3
17
Likelihood of a two-sequence alignment
  • CCAT
  • CCGT

0.4 x 0.983 x 0.4 x 0.983 x 0.1 x 0.007 x 0.3 x
0.979 0.0000300
Likelihood of going from the first to the second
sequence is 0.0000300
18
Likelihood of a two-branch tree
A
O
B
  • O is the origin or root. The likelihood can be
    calculated in three ways
  • from A to B in one step (this amounts to the
    previous method)
  • from A to B in two steps (through O)
  • in two parts starting at O.

19
Lesson about O
  • O is an unknown sequence.
  • We can only speculate what each position in the
    alignment would be if we could observe the
    sequence of O.
  • What we do know is that the sum of all
    possibilities is equal to 1.
  • Therefore we must sum the likielihoods for all
    possibilities of O.
  • This becomes computationally intensive.

C
A
For position 1 A,C,G,T
O
B
C
20
Does changing a model affect the outcome?
There are different models Jukes and Cantor
(JC69) All base compositions equal (0.25 each),
rate of change from one base to another is the
same Kimura 2-Parameter (K2P) All base
compositions equal (0.25 each), different
substitution rate for transitions and
transversions). Hasegawa-Kishino-Yano
(HKY) Like the K2P, but with base composition
free to vary. General Time Reversible
(GTR) Base composition free to vary, all
possible substitutions can differ.
21
(No Transcript)
22
Long-Branch Attraction
  • In the case below, the wrong tree is often
    selected. ML will not be prone to this problem,
    if the correct model of sequence evolution is
    used.

p gt q
CORRECT TREE
WRONG TREE
23
Strengths of ML
  • There is no need to correct for anything, the
    models take care of superimposed substitutions.
  • Accurate branch lengths.
  • Each site has a likelihood.
  • If the model is correct, we should retrieve the
    correct tree (if we have long enough sequences
    and a sophisticated enough model).
  • ML uses all the data (no selection of sites based
    on informativeness, all sites are informative).
  • ML not only tells you about the phylogeny of the
    sequences, but also the process of evolution that
    led to the observations of current sequences.

24
Weaknesses of ML
  • Can be inconsistent if we use models that are not
    accurate.
  • Model might not be sophisticated enough.
  • Very computationally-intensive. Might not be
    possible to examine all models (substitution
    matrices, tree topologies, etc.).
Write a Comment
User Comments (0)
About PowerShow.com