Molecular Systematics - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Molecular Systematics

Description:

Frog. Dog. TAIL ... evolved independently in humans and frogs - there are two steps on the ... a wrong tree grouping humans with frogs and lizards with dogs ... – PowerPoint PPT presentation

Number of Views:179
Avg rating:3.0/5.0
Slides: 54
Provided by: marti283
Category:

less

Transcript and Presenter's Notes

Title: Molecular Systematics


1
Introduction to characters and parsimony analysis
2
Character evolution
  • Heritable changes in features (morphology, gene
    sequences, etc.) provide the basis for inferring
    phylogeny
  • Such changes delimit what are usually referred to
    as the states of characters (e.g. presence or
    absence of a feature or different nucleotide
    bases at specific sites in a sequence)
  • The utility of characters depends on how often
    the changes that produce the different character
    states occur independently (homoplasy)

3
Unique and unreversed characters
  • Given a heritable evolutionary change that is
    unique and unreversed (e.g. the origin of hair)
    in an ancestral species, the presence of the
    novelty in any taxa must be due to inheritance
    from the ancestor.
  • Similarly absence in any taxa must be because the
    taxa are not descendants of that ancestor
  • The novelty will be a homology acting as a badge
    or marker for the descendants of the ancestor
  • The taxa with the novelty will be a clade (e.g.
    Mammalia)

4
Unique and unreversed characters- Hair
  • Because hair evolved only once and is unreversed
    it is homologous and provides unambiguous
    evidence for the clade Mammalia

Human
Lizard
HAIR
absent
present
Dog
Frog
change or step
5
Homoplasy - Independent evolution
  • Homoplasy is similarity that is not homologous
    (not due to common ancestry)
  • Homoplasy is the result of independent evolution
    (convergence, parallelism, reversal)
  • Homoplasy can provide misleading evidence of
    phylogenetic relationships

6
Homoplasy - independent evolution- Tails
  • Loss of tails evolved independently in humans and
    frogs - there are two steps on the true tree

Human
Lizard
TAIL
absent
present
Frog
Dog
7
Homoplasy - misleading evidence of phylogeny
  • If misinterpreted as homology, the absence of
    tails would be evidence for a wrong tree grouping
    humans with frogs and lizards with dogs

Lizard
Human
TAIL
absent
present
Dog
Frog
8
Homoplasy - reversal
  • Reversals are evolutionary changes back to an
    ancestral condition
  • As with any homoplasy, reversals can provide
    misleading evidence of relationships

True tree
Wrong tree
9
3
4
6
7
8
10
1
2
5
1
3
4
6
7
8
9
10
2
5
9
Homoplasy - a fundamental problem of phylogenetic
inference
  • If there were no homoplastic similarities
    inferring phylogeny would be easy - all the
    pieces of the jig-saw would fit together neatly
  • Distinguishing the misleading evidence of
    homoplasy from the reliable evidence of homology
    is a fundamental problem of phylogenetic inference

10
Homoplasy and Incongruence
  • If we assume that there is a single correct
    phylogenetic tree then
  • When characters support conflicting phylogenetic
    trees we know that there must be some misleading
    evidence of relationships among the incongruent
    or incompatible characters
  • Incongruence between two characters implies that
    at least one of the characters is homoplastic and
    that at least one of the trees the character
    supports is wrong

11
Incongruence or Incompatibility
Human
Lizard
HAIR
absent
present
Dog
Frog
  • These trees and characters are incongruent. Both
    trees cannot be correct and at least one
    character must be homoplastic

Lizard
Human
TAIL
absent
present
Dog
Frog
12
Distinguishing homology and homoplasy
  • Morphologists use a variety of techniques to
    distinguish homoplasy and homology
  • Homologous features are expected to display
    detailed similarity (in position, structure,
    development) whereas homoplastic similarities are
    more likely to be superficial
  • As recognised by Darwin congruence with other
    characters provides the most compelling evidence
    for homology

13
The importance of congruence
  • The importance, for classification, of trifling
    characters, mainly depends on their being
    correlated with several other characters of more
    or less importance. The value indeed of an
    aggregate of characters is very evident ........
    a classification founded on any single character,
    however important that may be, has always failed.
  • Charles Darwin, Origin of Species

14
Congruence - 4
  • We prefer the true tree because it is supported
    by multiple congruent characters

Human
Lizard
MAMMALIA
Hair Single bone in lower jaw Lactation
Frog
Dog
15
Homoplasy in molecular data
  • Incongruence and therefore homoplasy can be
    common in molecular sequence data
  • One reason is that characters have a limited
    number of alternative character states ( e.g. A,
    G, C and T)
  • In addition, these states are chemically
    identical so that homology and homoplasy are
    equally similar and cannot be distinguished
    through detailed study of structure or
    development

16
Parsimony analysis
  • Parsimony methods provide one way of choosing
    among alternative phylogenetic hypotheses
  • The parsimony criterion favours hypotheses that
    maximise congruence and minimise homoplasy
  • It depends on the idea of the fit of a character
    to a tree

17
Character Fit
  • Initially, we can define the fit of a character
    to a tree as the minimum number of steps required
    to explain the observed distribution of character
    states among taxa
  • This is determined by parsimonious character
    optimization
  • Characters differ in their fit to different trees

18
Character Fit - Amniota
Rayfinned fish
Rayfinned fish
salamaders
salamaders
mammals
mammals
crocodiles
crocodiles
lungfish
snakes
lungfish
snakes
lizards
lizards
turtles
turtles
birds
birds
frogs
frogs
3 steps
1 step
19
Parsimony Analysis
  • Given a set of characters, such as aligned
    sequences, parsimony analysis works by
    determining the fit (number of steps) of each
    character on a given tree
  • The sum over all characters is called Tree Length
  • Most parsimonious trees (MPTs) have the minimum
    tree length needed to explain the observed
    distributions of all the characters

20
Parsimony in practice
Of these two trees, Tree 1 has the shortest
length and is the most parsimonious Both trees
require some homoplasy (extra steps)
21
Results of parsimony analysis
  • One or more most parsimonious trees
  • Hypotheses of character evolution associated with
    each tree (where and how changes have occurred) -
    this may be very useful
  • Branch lengths (amounts of change associated with
    branches)
  • Various tree and character statistics describing
    the fit between tree and data
  • Suboptimal trees - optional

22
Parsimony - advantages
  • is a simple method - easily understood operation
  • does not seem to depend on an explicit model of
    evolution
  • gives both trees and associated hypotheses of
    character evolution
  • should give reliable results if the data is well
    structured and homoplasy is either rare or widely
    (randomly) distributed on the tree

23
Parsimony - disadvantages
  • May give misleading results if homoplasy is
    common or concentrated in particular parts of the
    tree, e.g
  • thermophilic convergence
  • base composition biases
  • long branch attraction
  • Underestimates branch lengths

24
Parsimony can be inconsistent
  • Felsenstein (1978) developed a simple model
    phylogeny including four taxa and a mixture of
    short and long branches
  • Under this model parsimony will give the wrong
    tree

Long branches are attracted but the
similarity is homoplastic
  • With more data the certainty that parsimony will
    give the wrong tree increases - so that parsimony
    is statistically inconsistent
  • Advocates of parsimony initially responded by
    claiming that Felsensteins result showed only
    that his model was unrealistic
  • It is now recognised that the long-branch
    attraction (in the Felsenstein Zone) is one of
    the most serious problems in phylogenetic
    inference

25
Methods other than parsimony
26
Phylogenetic analysis - different methods
  • Character-based methods
  • Maximum parsimony
  • Maximum likelihood
  • Distance-based methods

27
Maximum Likelihood 1
  • Maximum likelihood methods of phylogenetic
    inference evaluate a hypothesis about
    evolutionary history (the branching order and
    branch lengths of a tree) in terms of a
    probability that a proposed model of the
    evolutionary process and the hypothesised history
    (tree) would give rise to the data we observe -
    so given a model and data we can estimate a tree
    (the maximum likelihood tree)

28
Maximum Likelihood 2
  • Maximum likelihood estimates a parameter from
    observed data under an explicit model
  • There is an explicit link between model tree
    data (poor model poor tree?)
  • Likelihood also provides ways of evaluating
    models in terms of their log likelihoods,
    provided they are nested i.e. one model is a
    special case of the other

29
Maximum likelihood tree reconstruction 1
1
3
1 CGAGAC 2 AGCGAC 3 AGATTA 4 GGATAG
Tree A
4
2
What is the probability that unrooted Tree A
(rather than another tree) could have generated
the data shown under our chosen model ?
30
Maximum likelihood tree reconstruction 1
3
4
1
2
1 CGAGAC 2 AGCGAC 3 AGATTA 4 GGATAG
Tree A
note rooting is arbitrary
What is the probability that unrooted Tree A
(rather than another tree) could have generated
the data shown under our chosen model ?
31
Maximum likelihood tree reconstruction 2
1 CGAGA C 2 AGCGA C 3 AGATT A 4 GGATA G
ACGT
Tree A
j
4 x 4 possibilities
The likelihood for a particular site j is the sum
of the probabilities of every possible
reconstruction of ancestral states under a chosen
model
32
Maximum likelihood tree reconstruction 3
  • The likelihood of Tree A is the product of the
    likelihoods at each site
  • The likelihood is usually evaluated by summing
    the log of the likelihoods (because the summed
    probabilities are so small) at each site and
    reported as the log likelihood of the full tree
  • The Maximum likelihood tree is the one with the
    highest likelihood (might not be Tree A i.e. it
    could be another tree topology)

33
Maximum likelihood tree reconstruction 4
  • How are the probabilities of change calculated ?
  • The probabilities used to calculate likelihoods
    depend on the assumed model

34
Typical assumptions of ML substitution models
  • The probability of any change is independent of
    the prior history of the site (a Markov Model)
  • Substitution probabilities do not change with
    time or over the tree (a homogeneous Markov
    process)
  • Change is time reversible e.g. the rate of change
    of A to T is the same as T to A

35
Maximum likelihood models 1
  • The model incorporates information about the
    rates at which each nucleotide is replaced by
    each alternative nucleotide
  • For DNA this can be expressed as a 4 x 4 rate
    matrix
  • Other model parameters may include
  • Site by site rate variation - often modelled as a
    statistical distribution - for example a gamma
    distribution

36
Maximum likelihood models 2
  • Model parameters can be
  • estimated from the data (using maximum likelihood
    in PAUP)
  • can be pre-set based upon assumptions about the
    data (for example that for all sequences all
    sites change at the same rate and all
    substitutions are equally likely - e.g. the
    widely used Jukes and Cantor Model)
  • wherever possible avoid assumptions which are
    violated by the data because they can lead to
    incorrect trees

37
The true tree for Deinococcus and Thermus
Thermus
Aquifex
Deinococcus
Bacillus
The true tree
38
The Jukes and Cantor model is the simplest model
The JC model is a one parameter model 1) it
assumes that all changes are equally probable
(p0.25) 2) unless modified it assumes all sites
can change and that they do so at the same rate
39
Output of JC ML analysis for (Thermus,
Deinococcus, Bacillus, Aquifex)
Tree 3 -log likelihood -4132
Tree 2 -log likelihood -4101 True tree
Tree 1 -log likelihood -4090 Best tree
The Jukes and Cantor model in ML is unable to
recover the true tree for this data set
40
The 16S rRNA genes of Aquifex, Bacillus,
Deinococcus and Thermus
Exclude characters command in PAUP - exclude
constant sites
Character-exclusion status changed 859 of
1273 characters excluded Total number of
characters now excluded 859 Number of
included characters 414
Does the JC model fit these data?
Base frequencies command in PAUP
Taxon A C G
T sites ------------------------------------
-------------------------- Aquifex 0.12319
0.38164 0.38164 0.11353 414 Deinococc
0.23188 0.22222 0.27295 0.27295
414 Thermus 0.13317 0.35835 0.37530
0.13317 413 Bacillus 0.23188
0.22705 0.26570 0.27536
414 ----------------------------------------------
---------------- Mean 0.18006 0.29728
0.32387 0.19879 413.75
41
Models can be made more parameter rich to
increase their realism 1
  • The most common additional parameters are
  • A correction to allow different substitution
    rates for each type of nucleotide change
  • A correction for the proportion of sites which
    are unable to change
  • A correction for variable site rates at those
    sites which can change
  • PAUP will estimate the values of these additional
    parameters for you

42
A gamma distribution can be used to model site
rate heterogeneity
43
The GTR model of sequence evolution
The general time reversable model (GTR) is the
most general substitution model because it
assigns different rates for each type of
substitution. For example for the 16S ribosomal
RNA data for Deinococcus, Thermus, Aquifex and
Bacillus
Tree number 1 -Ln likelihood 3985.30400
Estimated R-matrix -2.7325625
0.4419956 1.42028 0.87028688
0.4419956 -5.2448524 1.2621698
3.540687 1.42028 1.2621698
-3.6824498 1 0.87028688
3.540687 1 -5.4109739
Estimated value of proportion of invariable
sites 0.228318 Estimated value of gamma shape
parameter 0.610459
44
Models can be made more parameter rich to
increase their realism 2
  • But the more parameters you estimate from the
    data the more time needed for an analysis and the
    more sampling error accumulates
  • One might have a realistic model but large
    sampling errors
  • Realism comes at a cost in time and precision!
  • Fewer parameters may give an inaccurate estimate,
    but more parameters decrease the precision of the
    estimate
  • In general use the simplest model which fits the
    data
  • Use PAUP to compare nested models incorporating
    additional parameters for their likelihoods

45
Models can be made more parameter rich to
increase their realism 3
JC -inv gamma correction for variable sites -
4029
JC ML tree -4090
JC -invariable sites - 4030
GTR-inv gamma correction for variable sites -
3985
46
The 16S rRNA genes of Aquifex, Bacillus,
Deinococcus, Thermus and Thermus ruber
Exclude characters command in PAUP - exclude
constant sites
Character-exclusion status changed 837
characters excluded Total number of characters
now excluded 837 Number of included
characters 436
Base frequencies command in PAUP
Taxon A C G
T sites ------------------------------------
-------------------------- ruber 0.19725
0.27294 0.29587 0.23394 436 Aquifex
0.12156 0.38073 0.38532 0.11239
436 Deinococc 0.22477 0.22936 0.28211
0.26376 436 Thermus 0.13103
0.35862 0.37931 0.13103 435 Bacillus
0.22477 0.23394 0.27523 0.26606
436 ----------------------------------------------
---------------- Mean 0.17990 0.29509
0.32354 0.20147 435.80
47
Output of GTR-inv sites ML analysis for
(Deinococcus, Bacillus, Aquifex, thermus and
Thermus ruber)
Tree 3 -log likelihood 4437 Best tree True
tree
Tree 2 -log likelihood 4447
Tree 1 -log likelihood 4439
With the addition of Thermus ruber which has a
base composition which is intermediate between
thermophiles and mesophiles GTR-inv sites ML
recovers the Thermus Deinococcus relationship
48
Estimation of ML substitution model parameters
  • Yang (1995) has shown that parameter estimates
    are reasonably stable across tree topologies
    provided trees are not too wrong. Thus one can
    obtain a tree using a quick method (useful when
    many sequences are being analysed) and then
    estimate parameters on that tree. These
    parameters can then be used in a search for the
    Maximum Likelihood Tree.

49
Parameter estimates using the tree scores
command in PAUP
Use PAUP tree scores to use ML to estimate over
this tree 1) Proportion of invariant sites 2)
Gamma shape parameter for variable sites 3)
Substitution parameters for all types of change
Maximum parsimony tree
50
ML Parameter estimates over a parsimony tree
using tree scores in PAUP
Tree number 1 -Ln likelihood 4432.16903
Estimated R-matrix -2.992539
0.53399075 1.6835489 0.77499941
0.53399075 -6.0877637 1.0048052
4.5489678 1.6835489 1.0048052
-3.6883541 1 0.77499941
4.5489678 1 -6.3239672
Corresponding Q-matrix -0.77509276
0.12637319 0.52569668 0.12302289
0.11475088 -1.1506065 0.31375553
0.72210013 0.36178289 0.2377952
-0.75831742 0.15873934 0.16654196
1.0765496 0.31225508 -1.5553467 Estimated
value of proportion of invariable sites
0.302946 Estimated value of gamma shape
parameter 0.629797
These values can then be used as the starting
parameters for a full likelihood search
51
Maximum Likelihood Tree
52
Maximum Likelihood -advantages
  • Mathematically rigorous performs well in
    computer simulations
  • Allows investigation of the fit between model and
    data
  • Provides a simple way of comparing trees
    according to their likelihoods (difference tests
    - Kishino Hasegawa Test)

53
Maximum Likelihood -disadvantages
  • Maximum likelihood will only be consistent
    (converge on the true tree) if evolution proceeds
    according to the assumed model How well does
    the model fit the data ?
  • Becomes impossible computationally if many taxa
    or many model parameters
Write a Comment
User Comments (0)
About PowerShow.com