Title: Lecture 1: Overview of Phylogenetic methods and applications
1Lecture 1 Overview of Phylogenetic methods and
applications
Allan Wilson
2Charles Darwin and Alfred Russel Wallace
Evolution as descent with modification,
implying relationships between organisms by
unbroken genetic lines Phylogenetics seeks to
determine these genetic relationships
Alfred Russel Wallace
Darwins sketch the first phylogenetic tree?
Charles Darwin
3Interpretation of morphological characters is
often subjective, so open to personal biases
Cynodonts (0)
Morganuconodonts (1)
Eutriconodonts (1)
Spalacotheriids (2)
Eupantotheres (2)
Ji et al.
Archaic therians (2)
Hu et al.
Opalized lower jaw of the monotreme Steropodon
Modern therians (2)
e.g. Jaw rotation weak (0), moderate (1), strong
(2) as indicated by vertical wear facets on
molars.
Hu et al. (Nature, 1997) and Ji et
al. (Nature, 1999) coded Steropodon (1) and (2)
respectively, helping to account for their
alternative placements of monotremes
4Deoxyribonucleic acid (DNA) -Watson, Crick,
Wilkins and Franklin
5- Early Molecular phylogenetics
- - Immunological distances
- DNA-DNA hybridization
- Without access to the actual sequences, these are
difficult to apply corrections and statistical
significance testing to
6Phylogenetics is now dominated by the clearly
defined 4 nucleotides and 20 amino acids
Purines
A G
C T
Pyrimidines
Transitions Transversions
Millions of years
Hominid phylogeny from DNA
7Tree terminology
Rooted tree
internal edge/branch
Unrooted tree
external edge/branch
node
Taxon 1
Taxon 3
Taxon 5
Taxon 7
Taxon 6
Taxon 8
Taxon 2
Taxon 4
internode
8outgroup
ingroup
polyphyly
Sister taxa
paraphyly
polytomy
bifurcating
9Overview of phylogenetic procedure - by example
- Biological problem (the question)
- Which data to obtain (data sampling)
- Finding the best tree (search strategy)
- Defining the best tree (optimality criterion)
101. Biological problem (the question)
What is the relationship of the extinct American
Cheetah (Miracinonyx trumani) to other cats?
- Two main sister group hypotheses
- Cheetahs (Acinonyx jubatus) Limb, skull,
vertebrae morphology - B. Pumas (Felis concolor) Geography, early
fossils less cheetah-like
See Barnett et al. (Curr. Biol., 2005)
112. Which data to obtain (data sampling)
- Mitochondrial (mt) DNA
- High mtDNA copy number is important because
Ancient DNA is degraded - Inferring relatively recent (2-10 million year)
divergences, so substantial sequence variation is
required
mt control region best lt 2 million years
mt Protein/RNA coding, best 2 ? 25 million years
Observed divergence
Nuclear protein-coding, best gt 25 million years
time
12Mitochondrial partial NADH1 alignment for birds
Nexus Begin DATA
Dimensions ntax29 nchar10692
Format datatypedna gap-
Matrix
Tinamou AACTATCTATTCATATCCTTATCATACATCATTC
CTATTCTTATTGCA.. Emu
AACCATCTCACTATATCACTCTCCTATGCAATCCCCATTCTAATCGCA..
Cassowary AACCACCTCACCATATCCCTGTCCTATGCAATCC
CAATTCTAATCGCA.. Kiwi
AACTACCTCACTATATCACTATCATATGTCATCCCAATTCTGATTGCA..
Rhea AACTACCTAATTATGTCCCTGTCATATGCTATCC
CAATTCTAATCGCA.. Ostrich
ACACACCTGACTATAGCACTCTCATACGCTGTTCCAATCCTAATTGCA..
Chicken AACCTTCTAATCATAACCTTATCCTATATTCTCC
CCATCCTAATCGCC.. BrushTurkey
AAACACCTCATCATATCCCTATCCTATGTTCTCCCAATTTTAATCGCC..
MagpieGoose AATCACCTCATTATAACCCTATCGTATGCCATCC
CAATCCTAATCGCC.. Duck
AGCTACCTCATTATATCCCTCCTATACGCCATCCCCATTCTAATCGCC..
Broadbill ACTAACCTTACCATATCCCTATCCTACGCCATCC
CCGTCCTAGTTGCC.. Flycatcher
ACCCACCTCATTATATCACTATCCTATGCCGTACCCATCCTAATTGCT..
ZebraFinch ATTAACCTCATCATAGCCCTCTCCTATGCCCTCC
CAATCCTGATCGCA.. Rook
GTCAACCTCATTATAGCACTTTCTTATGCTATCCCTATTCTAATCGCC..
Oystercatcher ACCTATCTCATTATATCCCTATCCTATGCCATCC
CAATCCTGATCGCA.. Turnstone
ACCTACTTCATCATATCCCTATCCTATGCAATCCCAATTCTAATTGCA..
Penguin GCTCACTTAGCCATATCCCTATCCTATGCCATCC
CAATCCTCATTGCA.. Albatross
ACCTATCTTGTCATGTCCCTATCATATGCCATCCCAATCCTAATCGCC..
End
13Tree reconstruction
Type of data
Distances Discrete (e.g. nucleotides)
Information loss often statistical
power loss
Unweighted pair group method with arithmetic
means (UPGMA)
Clustering algorithm
Neighbour-joining (NJ)
Tree-building method
Slower Faster
Maximum parsimony (MP)
Optimality criterion
Minimum evolution (ME)
Maximum likelihood (ML)
143. Finding the best tree (search strategy)
Number of possible trees (where n is the number
of taxa)
Unrooted trees (2n-5) ? (2n-7) ? 3?1 Rooted
trees (2n-3) ? (2n-5) ? 3?1
For the 11-taxon cat phylogeny Unrooted 17 ? 5
? 13 ? 11 ? 9 ? 7 ? 5 ? 3 ? 1 34,459,425 Rooted
Unrooted ? (2n-3) 654,729,075
An exhaustive search will examine all trees, but
is not practical for n gt 12
15Reducing the time for searching tree space
Heuristic search
Find an initial tree, and move within near-by
tree-space, discarding worse alternatives
Only a small amount of tree-space is searched and
there is no guarantee of finding the optimal tree
- can be trapped in local maxima
Global optima
X
Local optima
X
X
Starting point
16Branch and Bound search
As trees are built and branches added, if the
addition of a taxon to a particular branch
results in a tree-length greater than a
previously determined upper bound for the tree,
then this topology and all those derived from it
are ignored and the search continues with a new
placement for that taxon Branch and bound
guarantees finding globally optimal trees
Global optima
X
Local optima
X
X
Starting point
174. Defining the best tree (optimality criteria)
Distance methods
Absolute distance matrix 1
2 3 4 5 6 7 8 9 10 11
1 Mongoose - 2 Hyena 156 - 3
Sabretooth 207 147 - 4 Am.Cheetah 192
140 159 - 5 Lion 186 134 148 131
- 6 Tiger 160 143 132 111 64 -
7 Puma 194 139 162 70 124 100 -
8 House.Cat 206 133 163 124 118 100 117
- 9 Cheetah 192 139 162 108 127 109
96 110 - 10 Ocelot 206 123 165 116
116 98 111 98 113 - 11 Jaguarundi 204
147 177 123 143 121 101 119 128 131 -
18Early phenetics (distance/similarity) studies
would note that taxon X and taxon Z are the most
similar
Taxon Y TCAGCTA Taxon X ACATGTG Taxon Z
ACGTCAG
XZ 3 difference YZ 5 differences XY 4
differences
Taxon X
Taxon Z
Taxon Y
19Cladistic methods, rather than being concerned
with similarity, are concerned with the nature of
changes (apomorphies)
synapomorphy
Taxon Y TC A GCTA Taxon X AC A TGTG Taxon Z
AC G TCAG Outgroup AA G TCTG
autapomorphy
symplesiomorphy
Synapomorphies are shared derived characters and
so are considered to define clades (relationship
groupings)
20Maximum Parsimony chooses the tree topology that
minimises the number of changes required
Character 3 changes G to A
Homoplasy
synapomorphy
Taxon X
Taxon Z
Taxon Y
Taxon X
Taxon Z
Taxon Y
Outgroup
Outgroup
8 step sub-optimal phenetic tree
7 steps (MP tree)
21Maximum Likelihood The explanation that makes
the observed outcome the most likely
L Pr(DH)
Probability of the data, given an hypothesis The
hypothesis is a tree topology, its branch-lengths
and a model under which the data evolved
First use in phylogenetics Cavalli-Sforza and
Edwards (1967) for gene frequency data
Felsenstein (1981) for DNA sequences
22A A
Model of rate change e.g. Kishino-Hasegawa
(1985) 4 base frequencies, transition/transversio
n (ti/tv ratio)
0.5 substitutions per site
0.5
0.6
0.4
0.4
A A
A A
A A
A A
A A
A A
A G C T A
G
G G
A A A A C
C
G G
G G
G G
G G
G G
G G
A A
A A
A A
A A
A A
Sum the probabilities for each of the 16 internal
node combinations to get the likelihood for this
single nucleotide site
C T A G C
C C T T T
G G
G G
G G
G G
G G
A A
A A
A A
A A
A A
T A G C T
T G G G G
G G
G G
G G
G G
G G
23The likelihood of a tree is the product of the
site likelihoods. Taken as natural logs, the site
likelihoods can be summed to give the log
likelihood The tree with the highest lnL is
the ML tree
- ML is computationally intensive (slow)
- If branch-lengths are long, such that
substitutions occur multiple times along the same
branch for the same site, ML will be more
consistent than MP if the evolutionary process
is sufficiently well modelled.
24Bayesian Inference The explanation with the
highest posterior probability
Prior probability, the probability of the
hypothesis on previous knowledge
Bayes Theorem
Likelihood function, probability of the data
given the hypothesis
Pr(H) Pr(D H)
Pr(H D)
Pr(D)
Posterior probability, the probability of the
hypothesis given the data
Unconditional probability of the data, a
normalizing constant ensuring the posterior
probabilities sum to 1.00
First use in phylogenetics Li (1996, PhD
thesis), Rannala and Yang (1996)
25- Bayesian inference in phylogenetics is
essentially a likelihood method, but may more
closely reflect the way humans think. - It is Informed by prior knowledge (e.g. fossil
data) - emphasis is placed on Pr(H D) instead of Pr(D H)
Markov chain Monte Carlo (MCMC) is used to
approximate Bayesian posterior probabilities
(BPP) over 1,000s 1,000,000s of generations
New state rejected
New state accepted
Tree 1
Tree 2
BPP(tree 1) 4/6
Tree 3
Generation 1 2 3
4 5 6
26Posterior probabilities are integrated over all
trees in the posterior distribution providing
density distributions rather than the
optimization of likelihood
(Flat prior)
0 0.5 1.0
0 0.5 1.0
Prior for a parameter value (e.g. proportion of
invariant sites)
Posterior for the proportion of invariant sites
27The American cheetah is related to the puma -
morphological similarity to the cheetah is
convergence
Mongoose
Mongoose
Hyena
Hyena
Sabretooth
Sabretooth
Am.Cheetah
Am.Cheetah
American felids
Puma
Puma
Jaguarundi
Jaguarundi
Cheetah
Cheetah
Cat
Cat
Ocelot
Ocelot
Lion
Lion
0.05 substitutions/site
Tiger
Tiger
Maximum parsimony and neighbour-joining
(distance) cladogram
Maximum likelihood and Bayesian inference
phylogram
28Applications
The tree of life and inferring our origins
29146 gene phylogeny Delsuc et al. (Nature,
2006) Little evidence from fossils
30Identifying selection
ACA GAG CGC Threonine - Glutamic acid -
Arginine ACG GAG AGC Threonine - Glutamic
acid - Serine
Decreased dN/dS suggests purifying selection
Synonymous (S)
non-synonymous (N) substitutions
The dN/dS ratio can be estimated along branches
of phylogenetic trees (e.g. Guindon et al. PNAS,
2004) Here dN/dS is indicated by branch width
Increased dN/dS suggests Positive selection
31Cohen (Molec. Biol. Evol., 2002) found increased
positive selection at binding sites in the MHC
proteins of estuarine fish Fundulus heteroclitus
populations subject to severe chemical pollution.
Non-synonymous/synonymous ratios for peptide
binding regions and non-peptide binding regions
MHC (Major histocompatibility complex) binds
antigens and presents them to T-cells as part of
the immune response.
Positive selection at binding sites provides high
MHC variability with which to confront new
pathogenic threats.
32Fish from the Hot spot and Gloucester populations
are genetically adapted to severe chemical
pollution and show novel patterns of DNA
substitution for Mhc class II B locus including
strong signals of positive selection at inferred
antigen-binding sites
Mhc class II B with inferred locations of
population-specific amino acid changes for
Gloucester and Hot Spot.
33Stanhope et al. (Infect. Genet. Evol., 2004)
Severe Acute Respiratory Syndrome coronavirus
(SARS-CoV) has a recombinant history with
lineages of types I and III coronavirus
34Using more sophisticated models of sequence
evolution, Holmes and Rambaut (Phil. Trans. Roy.
Soc. B, 2004) could not reject a single history
across the SARS genome
II
I
III
SARS-TOR2
Understanding sequence evolution and the biases
that may result from models (which necessarily
are simplifications) are of vital importance in
phylogenetic inference
35- Host-Parasite coevolution/co-speciation
- Etherington et al. (J. Gen Virol, 2006)
Carnivoran strains
Artiodactyl strains
Caliciviruses infect diverse mammalian hosts and
include Norovirus, the major cause of food-borne
viral gastroenteritis in humans. Host switching
by caliciviruses is rare, although pigs have
strains from co-speciation (artiodactyl strain)
and host switching (carnivoran strain).
36Fig (Ficus) and fig wasp mutualism is reflected
by co-speciation patterns Machado et al. (PNAS,
2006)
37Biogeography vicariance and dispersal
38Most frequent Area cladoragms mapping taxa onto
landmasses
Many plants follows wind dispersal patterns
Many land animals follows continental break-up
Africa
S. South America
Australia
midges
New Zealand
Southern beech
Cushion herb
Marsupial mammals
From SanMartin and Ronquist (Syst. Biol. 2004)
39Conservation genetics Amur leopard (Panthera
pardus orientalis)
Relict population of 25-40 individuals in the
Russian Far East.
- Nuclear microsatellites and mtDNA Uphyrkina et
al. (J. Hered., 2002) - validates subspecies distinctiveness
- extreme reduction in genetic diversity in the
wild - captive population genetically mixed with the
Chinese subspecies
40Macroevolutionary inference
Cretaceous
Tertiary
65 Ma
Present
Does the 65 Ma meteor impact (Alvarez et al.
Science, 1980) fully explain the great reptile
extinction and the rise of modern birds and
mammals?
41Molecular clock DNA/protein divergence between
organisms is a function of time
K/T boundary
71-68 Ma
144-83 Ma
83-71 Ma
68-65 Ma
95Ma 65Ma
42Megafaunal extinctions (human induced or climate
change)
Macrauchenia
Bison (Lascaux, France)
43Arrival of humans in North America
The distribution of coalescence events over time
on the tree allow inference of relative
population size
Last glacial maximum