Title: Molecular Systematics
1Introduction to characters and parsimony analysis
2Character evolution
- Heritable changes in features (morphology, gene
sequences, etc.) provide the basis for inferring
phylogeny - Such changes delimit what are usually referred to
as the states of characters (e.g. presence or
absence of a feature or different nucleotide
bases at specific sites in a sequence) - The utility of characters depends on how often
the changes that produce the different character
states occur independently (homoplasy)
3Unique and unreversed characters
- Given a heritable evolutionary change that is
unique and unreversed (e.g. the origin of hair)
in an ancestral species, the presence of the
novelty in any taxa must be due to inheritance
from the ancestor. - Similarly absence in any taxa must be because the
taxa are not descendants of that ancestor - The novelty will be a homology acting as a badge
or marker for the descendants of the ancestor - The taxa with the novelty will be a clade (e.g.
Mammalia)
4Unique and unreversed characters- Hair
- Because hair evolved only once and is unreversed
it is homologous and provides unambiguous
evidence for the clade Mammalia
Human
Lizard
HAIR
absent
present
Dog
Frog
change or step
5Homoplasy - Independent evolution
- Homoplasy is similarity that is not homologous
(not due to common ancestry) - Homoplasy is the result of independent evolution
(convergence, parallelism, reversal) - Homoplasy can provide misleading evidence of
phylogenetic relationships
6Homoplasy - independent evolution- Tails
- Loss of tails evolved independently in humans and
frogs - there are two steps on the true tree
Human
Lizard
TAIL
absent
present
Frog
Dog
7Homoplasy - misleading evidence of phylogeny
- If misinterpreted as homology, the absence of
tails would be evidence for a wrong tree grouping
humans with frogs and lizards with dogs
Lizard
Human
TAIL
absent
present
Dog
Frog
8Homoplasy - reversal
- Reversals are evolutionary changes back to an
ancestral condition - As with any homoplasy, reversals can provide
misleading evidence of relationships
True tree
Wrong tree
9
3
4
6
7
8
10
1
2
5
1
3
4
6
7
8
9
10
2
5
9Homoplasy - a fundamental problem of phylogenetic
inference
- If there were no homoplastic similarities
inferring phylogeny would be easy - all the
pieces of the jig-saw would fit together neatly - Distinguishing the misleading evidence of
homoplasy from the reliable evidence of homology
is a fundamental problem of phylogenetic inference
10Homoplasy and Incongruence
- If we assume that there is a single correct
phylogenetic tree then - When characters support conflicting phylogenetic
trees we know that there must be some misleading
evidence of relationships among the incongruent
or incompatible characters - Incongruence between two characters implies that
at least one of the characters is homoplastic and
that at least one of the trees the character
supports is wrong
11Incongruence or Incompatibility
Human
Lizard
HAIR
absent
present
Dog
Frog
- These trees and characters are incongruent. Both
trees cannot be correct and at least one
character must be homoplastic
Lizard
Human
TAIL
absent
present
Dog
Frog
12Distinguishing homology and homoplasy
- Morphologists use a variety of techniques to
distinguish homoplasy and homology - Homologous features are expected to display
detailed similarity (in position, structure,
development) whereas homoplastic similarities are
more likely to be superficial - As recognised by Darwin congruence with other
characters provides the most compelling evidence
for homology
13The importance of congruence
- The importance, for classification, of trifling
characters, mainly depends on their being
correlated with several other characters of more
or less importance. The value indeed of an
aggregate of characters is very evident ........
a classification founded on any single character,
however important that may be, has always failed. - Charles Darwin, Origin of Species
14Congruence - 4
- We prefer the true tree because it is supported
by multiple congruent characters
Human
Lizard
MAMMALIA
Hair Single bone in lower jaw Lactation
Frog
Dog
15Homoplasy in molecular data
- Incongruence and therefore homoplasy can be
common in molecular sequence data - One reason is that characters have a limited
number of alternative character states ( e.g. A,
G, C and T) - In addition, these states are chemically
identical so that homology and homoplasy are
equally similar and cannot be distinguished
through detailed study of structure or
development
16Parsimony analysis
- Parsimony methods provide one way of choosing
among alternative phylogenetic hypotheses - The parsimony criterion favours hypotheses that
maximise congruence and minimise homoplasy - It depends on the idea of the fit of a character
to a tree
17Character Fit
- Initially, we can define the fit of a character
to a tree as the minimum number of steps required
to explain the observed distribution of character
states among taxa - This is determined by parsimonious character
optimization - Characters differ in their fit to different trees
18Character Fit - Amniota
Rayfinned fish
Rayfinned fish
salamaders
salamaders
mammals
mammals
crocodiles
crocodiles
lungfish
snakes
lungfish
snakes
lizards
lizards
turtles
turtles
birds
birds
frogs
frogs
3 steps
1 step
19Parsimony Analysis
- Given a set of characters, such as aligned
sequences, parsimony analysis works by
determining the fit (number of steps) of each
character on a given tree - The sum over all characters is called Tree Length
- Most parsimonious trees (MPTs) have the minimum
tree length needed to explain the observed
distributions of all the characters
20Parsimony in practice
Of these two trees, Tree 1 has the shortest
length and is the most parsimonious Both trees
require some homoplasy (extra steps)
21Results of parsimony analysis
- One or more most parsimonious trees
- Hypotheses of character evolution associated with
each tree (where and how changes have occurred) -
this may be very useful - Branch lengths (amounts of change associated with
branches) - Various tree and character statistics describing
the fit between tree and data - Suboptimal trees - optional
22Parsimony - advantages
- is a simple method - easily understood operation
- does not seem to depend on an explicit model of
evolution - gives both trees and associated hypotheses of
character evolution - should give reliable results if the data is well
structured and homoplasy is either rare or widely
(randomly) distributed on the tree
23Parsimony - disadvantages
- May give misleading results if homoplasy is
common or concentrated in particular parts of the
tree, e.g - thermophilic convergence
- base composition biases
- long branch attraction
- Underestimates branch lengths
24Parsimony can be inconsistent
- Felsenstein (1978) developed a simple model
phylogeny including four taxa and a mixture of
short and long branches - Under this model parsimony will give the wrong
tree
Long branches are attracted but the
similarity is homoplastic
- With more data the certainty that parsimony will
give the wrong tree increases - so that parsimony
is statistically inconsistent - Advocates of parsimony initially responded by
claiming that Felsensteins result showed only
that his model was unrealistic - It is now recognised that the long-branch
attraction (in the Felsenstein Zone) is one of
the most serious problems in phylogenetic
inference
25Methods other than parsimony
26Phylogenetic analysis - different methods
- Character-based methods
- Maximum parsimony
- Maximum likelihood
- Distance-based methods
27Maximum Likelihood 1
- Maximum likelihood methods of phylogenetic
inference evaluate a hypothesis about
evolutionary history (the branching order and
branch lengths of a tree) in terms of a
probability that a proposed model of the
evolutionary process and the hypothesised history
(tree) would give rise to the data we observe -
so given a model and data we can estimate a tree
(the maximum likelihood tree)
28Maximum Likelihood 2
- Maximum likelihood estimates a parameter from
observed data under an explicit model - There is an explicit link between model tree
data (poor model poor tree?) - Likelihood also provides ways of evaluating
models in terms of their log likelihoods,
provided they are nested i.e. one model is a
special case of the other
29Maximum likelihood tree reconstruction 1
1
3
1 CGAGAC 2 AGCGAC 3 AGATTA 4 GGATAG
Tree A
4
2
What is the probability that unrooted Tree A
(rather than another tree) could have generated
the data shown under our chosen model ?
30Maximum likelihood tree reconstruction 1
3
4
1
2
1 CGAGAC 2 AGCGAC 3 AGATTA 4 GGATAG
Tree A
note rooting is arbitrary
What is the probability that unrooted Tree A
(rather than another tree) could have generated
the data shown under our chosen model ?
31Maximum likelihood tree reconstruction 2
1 CGAGA C 2 AGCGA C 3 AGATT A 4 GGATA G
ACGT
Tree A
j
4 x 4 possibilities
The likelihood for a particular site j is the sum
of the probabilities of every possible
reconstruction of ancestral states under a chosen
model
32Maximum likelihood tree reconstruction 3
- The likelihood of Tree A is the product of the
likelihoods at each site - The likelihood is usually evaluated by summing
the log of the likelihoods (because the summed
probabilities are so small) at each site and
reported as the log likelihood of the full tree - The Maximum likelihood tree is the one with the
highest likelihood (might not be Tree A i.e. it
could be another tree topology)
33Maximum likelihood tree reconstruction 4
- How are the probabilities of change calculated ?
- The probabilities used to calculate likelihoods
depend on the assumed model
34Typical assumptions of ML substitution models
- The probability of any change is independent of
the prior history of the site (a Markov Model) - Substitution probabilities do not change with
time or over the tree (a homogeneous Markov
process) - Change is time reversible e.g. the rate of change
of A to T is the same as T to A
35Maximum likelihood models 1
- The model incorporates information about the
rates at which each nucleotide is replaced by
each alternative nucleotide - For DNA this can be expressed as a 4 x 4 rate
matrix - Other model parameters may include
- Site by site rate variation - often modelled as a
statistical distribution - for example a gamma
distribution
36Maximum likelihood models 2
- Model parameters can be
- estimated from the data (using maximum likelihood
in PAUP) - can be pre-set based upon assumptions about the
data (for example that for all sequences all
sites change at the same rate and all
substitutions are equally likely - e.g. the
widely used Jukes and Cantor Model) - wherever possible avoid assumptions which are
violated by the data because they can lead to
incorrect trees
37The true tree for Deinococcus and Thermus
Thermus
Aquifex
Deinococcus
Bacillus
The true tree
38The Jukes and Cantor model is the simplest model
The JC model is a one parameter model 1) it
assumes that all changes are equally probable
(p0.25) 2) unless modified it assumes all sites
can change and that they do so at the same rate
39Output of JC ML analysis for (Thermus,
Deinococcus, Bacillus, Aquifex)
Tree 3 -log likelihood -4132
Tree 2 -log likelihood -4101 True tree
Tree 1 -log likelihood -4090 Best tree
The Jukes and Cantor model in ML is unable to
recover the true tree for this data set
40The 16S rRNA genes of Aquifex, Bacillus,
Deinococcus and Thermus
Exclude characters command in PAUP - exclude
constant sites
Character-exclusion status changed 859 of
1273 characters excluded Total number of
characters now excluded 859 Number of
included characters 414
Does the JC model fit these data?
Base frequencies command in PAUP
Taxon A C G
T sites ------------------------------------
-------------------------- Aquifex 0.12319
0.38164 0.38164 0.11353 414 Deinococc
0.23188 0.22222 0.27295 0.27295
414 Thermus 0.13317 0.35835 0.37530
0.13317 413 Bacillus 0.23188
0.22705 0.26570 0.27536
414 ----------------------------------------------
---------------- Mean 0.18006 0.29728
0.32387 0.19879 413.75
41Models can be made more parameter rich to
increase their realism 1
- The most common additional parameters are
- A correction to allow different substitution
rates for each type of nucleotide change - A correction for the proportion of sites which
are unable to change - A correction for variable site rates at those
sites which can change - PAUP will estimate the values of these additional
parameters for you
42A gamma distribution can be used to model site
rate heterogeneity
43The GTR model of sequence evolution
The general time reversable model (GTR) is the
most general substitution model because it
assigns different rates for each type of
substitution. For example for the 16S ribosomal
RNA data for Deinococcus, Thermus, Aquifex and
Bacillus
Tree number 1 -Ln likelihood 3985.30400
Estimated R-matrix -2.7325625
0.4419956 1.42028 0.87028688
0.4419956 -5.2448524 1.2621698
3.540687 1.42028 1.2621698
-3.6824498 1 0.87028688
3.540687 1 -5.4109739
Estimated value of proportion of invariable
sites 0.228318 Estimated value of gamma shape
parameter 0.610459
44Models can be made more parameter rich to
increase their realism 2
- But the more parameters you estimate from the
data the more time needed for an analysis and the
more sampling error accumulates - One might have a realistic model but large
sampling errors - Realism comes at a cost in time and precision!
- Fewer parameters may give an inaccurate estimate,
but more parameters decrease the precision of the
estimate - In general use the simplest model which fits the
data - Use PAUP to compare nested models incorporating
additional parameters for their likelihoods
45Models can be made more parameter rich to
increase their realism 3
JC -inv gamma correction for variable sites -
4029
JC ML tree -4090
JC -invariable sites - 4030
GTR-inv gamma correction for variable sites -
3985
46The 16S rRNA genes of Aquifex, Bacillus,
Deinococcus, Thermus and Thermus ruber
Exclude characters command in PAUP - exclude
constant sites
Character-exclusion status changed 837
characters excluded Total number of characters
now excluded 837 Number of included
characters 436
Base frequencies command in PAUP
Taxon A C G
T sites ------------------------------------
-------------------------- ruber 0.19725
0.27294 0.29587 0.23394 436 Aquifex
0.12156 0.38073 0.38532 0.11239
436 Deinococc 0.22477 0.22936 0.28211
0.26376 436 Thermus 0.13103
0.35862 0.37931 0.13103 435 Bacillus
0.22477 0.23394 0.27523 0.26606
436 ----------------------------------------------
---------------- Mean 0.17990 0.29509
0.32354 0.20147 435.80
47Output of GTR-inv sites ML analysis for
(Deinococcus, Bacillus, Aquifex, thermus and
Thermus ruber)
Tree 3 -log likelihood 4437 Best tree True
tree
Tree 2 -log likelihood 4447
Tree 1 -log likelihood 4439
With the addition of Thermus ruber which has a
base composition which is intermediate between
thermophiles and mesophiles GTR-inv sites ML
recovers the Thermus Deinococcus relationship
48Estimation of ML substitution model parameters
- Yang (1995) has shown that parameter estimates
are reasonably stable across tree topologies
provided trees are not too wrong. Thus one can
obtain a tree using a quick method (useful when
many sequences are being analysed) and then
estimate parameters on that tree. These
parameters can then be used in a search for the
Maximum Likelihood Tree.
49Parameter estimates using the tree scores
command in PAUP
Use PAUP tree scores to use ML to estimate over
this tree 1) Proportion of invariant sites 2)
Gamma shape parameter for variable sites 3)
Substitution parameters for all types of change
Maximum parsimony tree
50ML Parameter estimates over a parsimony tree
using tree scores in PAUP
Tree number 1 -Ln likelihood 4432.16903
Estimated R-matrix -2.992539
0.53399075 1.6835489 0.77499941
0.53399075 -6.0877637 1.0048052
4.5489678 1.6835489 1.0048052
-3.6883541 1 0.77499941
4.5489678 1 -6.3239672
Corresponding Q-matrix -0.77509276
0.12637319 0.52569668 0.12302289
0.11475088 -1.1506065 0.31375553
0.72210013 0.36178289 0.2377952
-0.75831742 0.15873934 0.16654196
1.0765496 0.31225508 -1.5553467 Estimated
value of proportion of invariable sites
0.302946 Estimated value of gamma shape
parameter 0.629797
These values can then be used as the starting
parameters for a full likelihood search
51Maximum Likelihood Tree
52Maximum Likelihood -advantages
- Mathematically rigorous performs well in
computer simulations - Allows investigation of the fit between model and
data - Provides a simple way of comparing trees
according to their likelihoods (difference tests
- Kishino Hasegawa Test)
53Maximum Likelihood -disadvantages
- Maximum likelihood will only be consistent
(converge on the true tree) if evolution proceeds
according to the assumed model How well does
the model fit the data ? - Becomes impossible computationally if many taxa
or many model parameters