Title: Phylogenetic Analysis Unit 16
1Phylogenetic AnalysisUnit 16
- BIOL221T Advanced Bioinformatics for
Biotechnology
Irene Gabashvili, PhD
2BO, chapter 14
- Bioinformatics analyses should be interpreted in
evolutionary context - Good-quality sequence alignments important for
evolutionary analysis - Common phylogenetic methods and software are
different be cautious when using and
interpreting your results
3Terminology and the Basics
- Phylogenetics is sometimes called claudistics
- Clade a set of descendants from a single
ancestor (greek for branch). - 3 basic assumptions
- Any group of organism descended from a common
ancestor - Bifurcating pattern of cladogenesis
- Change in characteristics occurs in lineages over
time
4Brief Introduction to the Theory of Evolution
5Classification Linnaeus
Carl Linnaeus 1707-1778
6Classification Linnaeus
- Hierarchical system
- Kingdom (Rige)
- Phylum (Række)
- Class (Klasse)
- Order (Orden)
- Family (Familie)
- Genus (Slægt)
- Species (Art)
7Classification depicted as a tree
8Classification depicted as a tree
Species
Genus
Family
Order
Class
9Theory of evolution
Charles Darwin 1809-1882
10Phylogenetic basis of systematics
- Linnaeus
- Ordering principle is God.
- Darwin
- Ordering principle is shared descent from common
ancestors. - Today, systematics is explicitly based on
phylogeny.
11Darwins four postulates
- More young are produced each generation than can
survive to reproduce. - Individuals in a population vary in their
characteristics. - Some differences among individuals are based on
genetic differences. - Individuals with favorable characteristics have
higher rates of survival and reproduction. - Evolution by means of natural selection
- Presence of design-like features in organisms
- quite often features are there for a reason
12Theory of evolution as the basis of biological
understanding
Nothing in biology makes sense, except in the
light of evolution. Without that light it
becomes a pile of sundry facts - some of them
interesting or curious but making no meaningful
picture as a whole
T. Dobzhansky
13Phylogenetic ReconstructionDistance Matrix
Methods
14Trees terminology
15Terminology
- Clades monophyletic taxon
- Taxons any named group of organism
- Branches divergence (length may indicate the
degree) - Nodes any bifurcating branch point
16Trees terminology
17Trees representations
Three different representations of the same tree
18Trees rooted vs. unrooted
- A rooted tree has a single node (the root) that
represents a point in time that is earlier than
any other node in the tree. - A rooted tree has directionality (nodes can be
ordered in terms of earlier or later). - In the rooted tree, distance between two nodes is
represented along the time-axis only (the second
axis just helps spread out the leafs)
Early
Late
19Trees rooted vs. unrooted
- A rooted tree has a single node (the root) that
represents a point in time that is earlier than
any other node in the tree. - A rooted tree has directionality (nodes can be
ordered in terms of earlier or later). - In the rooted tree, distance between two nodes is
represented along the time-axis only (the second
axis just helps spread out the leafs)
Early
Late
20Trees rooted vs. unrooted
- A rooted tree has a single node (the root) that
represents a point in time that is earlier than
any other node in the tree. - A rooted tree has directionality (nodes can be
ordered in terms of earlier or later). - In the rooted tree, distance between two nodes is
represented along the time-axis only (the second
axis just helps spread out the leafs)
Early
Late
21Trees rooted vs. unrooted
- In unrooted trees there is no directionality we
do not know if a node is earlier or later than
another node - Distance along branches directly represents node
distance
22Trees rooted vs. unrooted
- In unrooted trees there is no directionality we
do not know if a node is earlier or later than
another node - Distance along branches directly represents node
distance
23Reconstructing a tree using non-contemporaneous
data
24Reconstructing a tree using present-day data
25Data molecular phylogeny
- DNA sequences
- genomic DNA
- mitochondrial DNA
- chloroplast DNA
- Protein sequences
- Restriction site polymorphisms
- DNA/DNA hybridization
- Immunological cross-reaction
26Morphology vs. molecular data
African white-backed vulture (old world vulture)
Andean condor (new world vulture)
New and old world vultures seem to be closely
related based on morphology. Molecular data
indicates that old world vultures are related to
birds of prey (falcons, hawks, etc.) while new
world vultures are more closely related to
storks Similar features presumably the result of
convergent evolution
27Molecular data single-celled organisms
Molecular data useful for analyzing single-celled
organisms (which have only few prominent
morphological features).
28Distance Matrix Methods
Gorilla ACGTCGTA Human
ACGTTCCT Chimpanzee ACGTTTCG
- Construct multiple alignment of sequences
- Construct table listing all pairwise differences
(distance matrix) - Construct tree from pairwise distances
Ch
1
1
1
Hu
2
Go
29Finding Optimal Branch Lengths
S2
S1
a
c
b
e
d
S3
S4
Distance along tree
Observed distance
D12 ? d12 a b c D13 ? d13 a d D14 ? d14
a b e D23 ? d23 d b c D24 ? d24 c
e D34 ? d34 d b e
Goal
30Optimal Branch Lengths Least Squares
S2
S1
a
c
- Fit between given tree and observed distances can
be expressed as sum of squared differences - Q ?(Dij - dij)2
- Find branch lengths that minimize Q - this is the
optimal set of branch lengths for this tree.
b
e
d
S3
S4
Distance along tree
jgti
D12 ? d12 a b c D13 ? d13 a d D14 ? d14
a b e D23 ? d23 d b c D24 ? d24 c
e D34 ? d34 d b e
Goal
31Least Squares Optimality Criterion
- Search through all (or many) tree topologies
- For each investigated tree, find best branch
lengths using least squares criterion - Among all investigated trees, the best tree is
the one with the smallest sum of squared errors.
32Exhaustive search impossible for large data sets
33Heuristic search
- Construct initial tree determine sum of squares
- Construct set of neighboring trees by making
small rearrangements of initial tree determine
sum of squares for each neighbor - If any of the neighboring trees are better than
the initial tree, then select it/them and use as
starting point for new round of rearrangements.
(Possibly several neighbors are equally good) - Repeat steps 23 until you have found a tree
that is better than all of its neighbors. - This tree is a local optimum (not necessarily a
global optimum!)
34Clustering Algorithms
- Starting point Distance matrix
- Cluster least different pair of sequences
- Tree pair connected to common ancestral node,
compute branch lengths from ancestral node to
both descendants -
- Distance matrix combine two entries into one.
Compute new distance matrix, by finding distance
from new node to all other nodes - Repeat until all nodes are linked
- Results in only one tree, there is no measure of
tree-goodness.
35Neighbor Joining Algorithm
- For each tip compute ui ?j Dij/(n-2)
- (this is essentially the average distance to all
other tips, except the denominator is n-2 instead
of n) - Find the pair of tips, i and j, where Dij-ui-uj
is smallest - Connect the tips i and j, forming a new ancestral
node. The branch lengths from the ancestral node
to i and j are - vi 0.5 Dij 0.5 (ui-uj)
- vj 0.5 Dij 0.5 (uj-ui)
- Update the distance matrix Compute distance
between new node and each remaining tip as
follows - Dij,k (DikDjk-Dij)/2
- Replace tips i and j by the new node which is now
treated as a tip - Repeat until only two nodes remain.
36Superimposed Substitutions
- Actual number of
- evolutionary events 5
- Observed number of
- differences 2
- Distance is (almost) always underestimated
ACGGTGC C T GCGGTGA
37Model-based correction for superimposed
substitutions
- Goal try to infer the real number of
evolutionary events (the real distance) based on - Observed data (sequence alignment)
- A model of how evolution occurs
38Jukes and Cantor Model
- Four nucleotides assumed to be equally frequent
(f0.25) - All 12 substitution rates assumed to be equal
- Under this model the corrected distance is
- DJC -0.75 x ln(1-1.33 x DOBS)
- For instance
- DOBS0.43 gt DJC0.64
39Other models of evolution
40Homologs
- Orthologs - speciation
- Paralogs - duplication
- Xenologs horizontal transfer
41Clustering Algorithms
- Clustering algorithms use distances to calculate
phylogenetic trees. These trees are based solely
on the relative numbers of similarities and
differences between a set of sequences. - Start with a matrix of pairwise distances
- Cluster methods construct a tree by linking the
least distant pairs of taxa, followed by
successively more distant taxa.
42From Multiple Sequence Alignment
43Example
A Cladogram or a Phylogram?
1.5
1.5
0.5
0.5
1
1
ATCC ATGC TTCG TCGG
44Cladistic Methods
- Evolutionary relationships are documented by
creating a branching structure, termed a
phylogeny or tree, that illustrates the
relationships between the sequences. - Cladistic methods construct a tree (cladogram) by
considering the various possible pathways of
evolution and choose from among these the best
possible tree. - A phylogram is a tree with branches that are
proportional to evolutionary distances.
45(No Transcript)
46Hamming distance
- ltbetween two strings of equal lengthgt the
number of positions for which the corresponding
symbols are different. The number of
substitutions required to change one into the
other, or the number of errors that transformed
one string into the other.
47Hamming distance
- The Hamming distance between 1011101 and 1001001
is 2. - The Hamming distance between 2143896 and 2233796
is 3. - The Hamming distance between "toned" and "roses"
is 3.
48Levenshtein Distance
- A measure of the similarity between two strings
number of deletions, insertions, or substitutions - For example,
- If s is "test" and t is "test", then LD(s,t) 0,
- If s is "test" and t is "tent", then LD(s,t) 1,
because one substitution (change "s" to "n") is
sufficient to transform s into t. - If s os test and t is attempt, LD(s,t)4
49Levenshtein distance
- The Levenshtein distance algorithm has been used
in - Spell checking
- Speech recognition
- DNA analysis
- Plagiarism detection
50DNA Distances
- Distances between pairs of DNA sequences are
usually computed as the sum of all base pair
differences between the two sequences. - If sequences are similar enough to be aligned
- Generally all base changes are considered equal
- Insertion/deletions are generally given a larger
weight than replacements (gap penalties). - It is also possible to correct for multiple
substitutions at a single site, which is common
in distant relationships and for rapidly evolving
sites.
51Phylogenetic methods (1) Distance matrix/cluster
(UPGMA, NJ) Bacterial taxonomy based on
morphological, chemical, biochemical and
physiological chacters did not allow natural
relationships to be deduced Numerical taxonomy
(Sneath and Sokal, 1963, 1973) Parsimony
(maximum parsomony) The taxonomy of animals
shall reflect their natural relatioonships Phy
logenetic Systematics (Willi Hennig 1950,
1966) Without direction (eg. Wiley 1980)
52UPGMA
- The simplest of the distance methods is the UPGMA
(Unweighted Pair Group Method using Arithmetic
averages) -
- The PHYLIP programs DNADIST and PROTDIST
calculate absolute pairwise distances between a
group of sequences. Then the GCG program GROWTREE
uses UPGMA to build a tree. - Many multiple alignment programs such as PILEUP
use a variant of UPGMA to create a dendrogram of
DNA sequences which is then used to guide the
multiple alignment algorithm.
53Neighbor Joining
- The Neighbor Joining method is the most popular
way to build trees from distance measurements - (Saitou and Nei 1987, Mol. Biol. Evol. 4406)
- Neighbor Joining corrects the UPGMA method for
its (frequently invalid) assumption that the same
rate of evolution applies to each branch of a
tree. - The distance matrix is adjusted for differences
in the rate of evolution of each taxon (branch). - Neighbor Joining has given the best results in
simulation studies and it is the most
computationally efficient of the distance
algorithms (N. Saitou and T. Imanishi, Mol.
Biol. Evol. 6514 (1989)
54Cladistic methods
- Cladistic methods are based on the assumption
that a set of sequences evolved from a common
ancestor by a process of mutation and selection
without mixing (hybridization or other horizontal
gene transfers). - These methods work best if a specific tree, or at
least an ancestral sequence, is already known so
that comparisons can be made between a finite
number of alternate trees rather than calculating
all possible trees for a given set of sequences.
55Parsimony
- Parsimony is the most popular method for
reconstructing ancestral relationships. - Parsimony allows the use of all known
evolutionary information in building a tree - In contrast, distance methods compress all of the
differences between pairs of sequences into a
single number
56Building Trees with Parsimony
- Parsimony involves evaluating all possible trees
and giving each a score based on the number of
evolutionary changes that are needed to explain
the observed data. - The best tree is the one that requires the fewest
base changes for all sequences to derive from a
common ancestor.
57Methods
- Distance-based UPGMA, NJ, FM, ME
- Other Maximum Parsimony, ML, etc
- Neighbor Joining methods generally produce just
one tree, which can help to validate a tree built
with the parsimony or maximum likelihood method
58Phylogenetic methods Maximum likelihood
methods Phylogenies should be formulated in a
probalistic framework and statistically
testable. Protein and DNA sequence data are
extraordinary good for phylogenetic
interpreation and can resist such treatment.
Cavalli-Sforza and Edwards 1967
(theory) Felsenstein 1981 first practically
useful algorithms.
59Phylogenetic analysis. Comparison of phylogenetic
methodsConsistency a phylogenetic method is
consistent for an evolutionary model, if the
method converges on the corrrect tree as the data
becomes infinite. Efficiency a phylogenetic
method have high efficiency if it quickly
converges on the correct solution as more data
are applied to the problem. Robustness a
phylogenetic method is robust if converges on the
correct solution with violations of the
assumptions about the evolutionary model.
Hillis 1995. Syst. Biol. 44, 3-16.
60- Phylogenetic analysis. Test of robustness.
Bootstrap - Purpose. To show how well supported the nodes are
by the data. - Performance. The original data are simulated by
drawing columns randomly with replacement 100 or
1000 times. The phylogenetic analysis is repeated
and the number of nodes common in all 100 or 1000
trees summarized. - Example. Original data 1 replicate 2
replicate - Species 1 AGGA AAGA GGAA
- Species 2 ACGT AACT CGTT
- Species 3 ACGT AACT CGTT
- Species 4 ACTT AACT CTTT
- Species 5 CCGT CCCT CGTT
- linear form (2,3)4)5)1 (2,3)4)5)1 (2,3)5)4)1
61Are there Correct trees??
- Despite all of these caveats, it is actually
quite simple to use computer programs calculate
phylogenetic trees for data sets. - Provided the data are clean, outgroups are
correctly specified, appropriate algorithms are
chosen, no assumptions are violated, etc., can
the true, correct tree be found and proven to be
scientifically valid? - Unfortunately, it is impossible to ever
conclusively state what is the "true" tree for a
group of sequences (or a group of organisms)
taxonomy is constantly under revision as new data
is gathered (example 80s revision of the seals
and sea lions tree)