Phylogenetic Analysis Unit 16

About This Presentation

Title:

Phylogenetic Analysis Unit 16

Description:

Distance along branches directly represents node distance ... The distance matrix is adjusted for differences in the rate of evolution of each taxon (branch) ... – PowerPoint PPT presentation

Number of Views:364

Avg rating:3.0/5.0

Slides: 62

Provided by: irenegab

Category:

more less

Transcript and Presenter's Notes

Title: Phylogenetic Analysis Unit 16

1
Phylogenetic AnalysisUnit 16

BIOL221T Advanced Bioinformatics for
Biotechnology

Irene Gabashvili, PhD
2
BO, chapter 14

Bioinformatics analyses should be interpreted in
evolutionary context
Good-quality sequence alignments important for
evolutionary analysis
Common phylogenetic methods and software are
different be cautious when using and
interpreting your results

3
Terminology and the Basics

Phylogenetics is sometimes called claudistics
Clade a set of descendants from a single
ancestor (greek for branch).
3 basic assumptions
Any group of organism descended from a common
ancestor
Bifurcating pattern of cladogenesis
Change in characteristics occurs in lineages over
time

4
Brief Introduction to the Theory of Evolution
5
Classification Linnaeus
Carl Linnaeus 1707-1778
6
Classification Linnaeus

Hierarchical system
Kingdom (Rige)
Phylum (Række)
Class (Klasse)
Order (Orden)
Family (Familie)
Genus (Slægt)
Species (Art)

7
Classification depicted as a tree
8
Classification depicted as a tree
Species
Genus
Family
Order
Class
9
Theory of evolution
Charles Darwin 1809-1882
10
Phylogenetic basis of systematics

Linnaeus
Ordering principle is God.
Darwin
Ordering principle is shared descent from common
ancestors.
Today, systematics is explicitly based on
phylogeny.

11
Darwins four postulates

More young are produced each generation than can
survive to reproduce.
Individuals in a population vary in their
characteristics.
Some differences among individuals are based on
genetic differences.
Individuals with favorable characteristics have
higher rates of survival and reproduction.
Evolution by means of natural selection
Presence of design-like features in organisms
quite often features are there for a reason

12
Theory of evolution as the basis of biological
understanding
Nothing in biology makes sense, except in the
light of evolution. Without that light it
becomes a pile of sundry facts - some of them
interesting or curious but making no meaningful
picture as a whole
T. Dobzhansky
13
Phylogenetic ReconstructionDistance Matrix
Methods
14
Trees terminology
15
Terminology

Clades monophyletic taxon
Taxons any named group of organism
Branches divergence (length may indicate the
degree)
Nodes any bifurcating branch point

16
Trees terminology
17
Trees representations
Three different representations of the same tree
18
Trees rooted vs. unrooted

A rooted tree has a single node (the root) that
represents a point in time that is earlier than
any other node in the tree.
A rooted tree has directionality (nodes can be
ordered in terms of earlier or later).
In the rooted tree, distance between two nodes is
represented along the time-axis only (the second
axis just helps spread out the leafs)

Early
Late
19
Trees rooted vs. unrooted

A rooted tree has a single node (the root) that
represents a point in time that is earlier than
any other node in the tree.
A rooted tree has directionality (nodes can be
ordered in terms of earlier or later).
In the rooted tree, distance between two nodes is
represented along the time-axis only (the second
axis just helps spread out the leafs)

Early
Late
20
Trees rooted vs. unrooted

A rooted tree has a single node (the root) that
represents a point in time that is earlier than
any other node in the tree.
A rooted tree has directionality (nodes can be
ordered in terms of earlier or later).
In the rooted tree, distance between two nodes is
represented along the time-axis only (the second
axis just helps spread out the leafs)

Early
Late
21
Trees rooted vs. unrooted

In unrooted trees there is no directionality we
do not know if a node is earlier or later than
another node
Distance along branches directly represents node
distance

22
Trees rooted vs. unrooted

In unrooted trees there is no directionality we
do not know if a node is earlier or later than
another node
Distance along branches directly represents node
distance

23
Reconstructing a tree using non-contemporaneous
data
24
Reconstructing a tree using present-day data
25
Data molecular phylogeny

DNA sequences
genomic DNA
mitochondrial DNA
chloroplast DNA
Protein sequences
Restriction site polymorphisms
DNA/DNA hybridization
Immunological cross-reaction

26
Morphology vs. molecular data
African white-backed vulture (old world vulture)
Andean condor (new world vulture)
New and old world vultures seem to be closely
related based on morphology. Molecular data
indicates that old world vultures are related to
birds of prey (falcons, hawks, etc.) while new
world vultures are more closely related to
storks Similar features presumably the result of
convergent evolution
27
Molecular data single-celled organisms
Molecular data useful for analyzing single-celled
organisms (which have only few prominent
morphological features).
28
Distance Matrix Methods
Gorilla ACGTCGTA Human
ACGTTCCT Chimpanzee ACGTTTCG

Construct multiple alignment of sequences
Construct table listing all pairwise differences
(distance matrix)
Construct tree from pairwise distances

Ch
1
1
1
Hu
2
Go
29
Finding Optimal Branch Lengths
S2
S1
a
c
b
e
d
S3
S4
Distance along tree
Observed distance
D12 ? d12 a b c D13 ? d13 a d D14 ? d14
a b e D23 ? d23 d b c D24 ? d24 c
e D34 ? d34 d b e
Goal
30
Optimal Branch Lengths Least Squares
S2
S1
a
c

Fit between given tree and observed distances can
be expressed as sum of squared differences
Q ?(Dij - dij)2
Find branch lengths that minimize Q - this is the
optimal set of branch lengths for this tree.

b
e
d
S3
S4
Distance along tree
jgti
D12 ? d12 a b c D13 ? d13 a d D14 ? d14
a b e D23 ? d23 d b c D24 ? d24 c
e D34 ? d34 d b e
Goal
31
Least Squares Optimality Criterion

Search through all (or many) tree topologies
For each investigated tree, find best branch
lengths using least squares criterion
Among all investigated trees, the best tree is
the one with the smallest sum of squared errors.

32
Exhaustive search impossible for large data sets
33
Heuristic search

Construct initial tree determine sum of squares
Construct set of neighboring trees by making
small rearrangements of initial tree determine
sum of squares for each neighbor
If any of the neighboring trees are better than
the initial tree, then select it/them and use as
starting point for new round of rearrangements.
(Possibly several neighbors are equally good)
Repeat steps 23 until you have found a tree
that is better than all of its neighbors.
This tree is a local optimum (not necessarily a
global optimum!)

34
Clustering Algorithms

Starting point Distance matrix
Cluster least different pair of sequences
Tree pair connected to common ancestral node,
compute branch lengths from ancestral node to
both descendants
Distance matrix combine two entries into one.
Compute new distance matrix, by finding distance
from new node to all other nodes
Repeat until all nodes are linked
Results in only one tree, there is no measure of
tree-goodness.

35
Neighbor Joining Algorithm

For each tip compute ui ?j Dij/(n-2)
(this is essentially the average distance to all
other tips, except the denominator is n-2 instead
of n)
Find the pair of tips, i and j, where Dij-ui-uj
is smallest
Connect the tips i and j, forming a new ancestral
node. The branch lengths from the ancestral node
to i and j are
vi 0.5 Dij 0.5 (ui-uj)
vj 0.5 Dij 0.5 (uj-ui)
Update the distance matrix Compute distance
between new node and each remaining tip as
follows
Dij,k (DikDjk-Dij)/2
Replace tips i and j by the new node which is now
treated as a tip
Repeat until only two nodes remain.

36
Superimposed Substitutions

Actual number of
evolutionary events 5
Observed number of
differences 2
Distance is (almost) always underestimated

ACGGTGC C T GCGGTGA
37
Model-based correction for superimposed
substitutions

Goal try to infer the real number of
evolutionary events (the real distance) based on
Observed data (sequence alignment)
A model of how evolution occurs

38
Jukes and Cantor Model

Four nucleotides assumed to be equally frequent
(f0.25)
All 12 substitution rates assumed to be equal
Under this model the corrected distance is
DJC -0.75 x ln(1-1.33 x DOBS)
For instance
DOBS0.43 gt DJC0.64

39
Other models of evolution
40
Homologs

Orthologs - speciation
Paralogs - duplication
Xenologs horizontal transfer

41
Clustering Algorithms

Clustering algorithms use distances to calculate
phylogenetic trees. These trees are based solely
on the relative numbers of similarities and
differences between a set of sequences.
Start with a matrix of pairwise distances
Cluster methods construct a tree by linking the
least distant pairs of taxa, followed by
successively more distant taxa.

42
From Multiple Sequence Alignment

Best cluster ATCC,ATGC

Best cluster TTCG,TCGG

43
Example
A Cladogram or a Phylogram?
1.5
1.5
0.5
0.5
1
1
ATCC ATGC TTCG TCGG
44
Cladistic Methods

Evolutionary relationships are documented by
creating a branching structure, termed a
phylogeny or tree, that illustrates the
relationships between the sequences.
Cladistic methods construct a tree (cladogram) by
considering the various possible pathways of
evolution and choose from among these the best
possible tree.
A phylogram is a tree with branches that are
proportional to evolutionary distances.

45
(No Transcript)
46
Hamming distance

ltbetween two strings of equal lengthgt the
number of positions for which the corresponding
symbols are different. The number of
substitutions required to change one into the
other, or the number of errors that transformed
one string into the other.

47
Hamming distance

The Hamming distance between 1011101 and 1001001
is 2.
The Hamming distance between 2143896 and 2233796
is 3.
The Hamming distance between "toned" and "roses"
is 3.

48
Levenshtein Distance

A measure of the similarity between two strings
number of deletions, insertions, or substitutions
For example,
If s is "test" and t is "test", then LD(s,t) 0,
If s is "test" and t is "tent", then LD(s,t) 1,
because one substitution (change "s" to "n") is
sufficient to transform s into t.
If s os test and t is attempt, LD(s,t)4

49
Levenshtein distance

The Levenshtein distance algorithm has been used
in
Spell checking
Speech recognition
DNA analysis
Plagiarism detection

50
DNA Distances

Distances between pairs of DNA sequences are
usually computed as the sum of all base pair
differences between the two sequences.
If sequences are similar enough to be aligned
Generally all base changes are considered equal
Insertion/deletions are generally given a larger
weight than replacements (gap penalties).
It is also possible to correct for multiple
substitutions at a single site, which is common
in distant relationships and for rapidly evolving
sites.

51
Phylogenetic methods (1) Distance matrix/cluster
(UPGMA, NJ) Bacterial taxonomy based on
morphological, chemical, biochemical and
physiological chacters did not allow natural
relationships to be deduced Numerical taxonomy
(Sneath and Sokal, 1963, 1973) Parsimony
(maximum parsomony) The taxonomy of animals
shall reflect their natural relatioonships Phy
logenetic Systematics (Willi Hennig 1950,
1966) Without direction (eg. Wiley 1980)
52
UPGMA

The simplest of the distance methods is the UPGMA
(Unweighted Pair Group Method using Arithmetic
averages)
The PHYLIP programs DNADIST and PROTDIST
calculate absolute pairwise distances between a
group of sequences. Then the GCG program GROWTREE
uses UPGMA to build a tree.
Many multiple alignment programs such as PILEUP
use a variant of UPGMA to create a dendrogram of
DNA sequences which is then used to guide the
multiple alignment algorithm.

53
Neighbor Joining

The Neighbor Joining method is the most popular
way to build trees from distance measurements
(Saitou and Nei 1987, Mol. Biol. Evol. 4406)
Neighbor Joining corrects the UPGMA method for
its (frequently invalid) assumption that the same
rate of evolution applies to each branch of a
tree.
The distance matrix is adjusted for differences
in the rate of evolution of each taxon (branch).
Neighbor Joining has given the best results in
simulation studies and it is the most
computationally efficient of the distance
algorithms (N. Saitou and T. Imanishi, Mol.
Biol. Evol. 6514 (1989)

54
Cladistic methods

Cladistic methods are based on the assumption
that a set of sequences evolved from a common
ancestor by a process of mutation and selection
without mixing (hybridization or other horizontal
gene transfers).
These methods work best if a specific tree, or at
least an ancestral sequence, is already known so
that comparisons can be made between a finite
number of alternate trees rather than calculating
all possible trees for a given set of sequences.

55
Parsimony

Parsimony is the most popular method for
reconstructing ancestral relationships.
Parsimony allows the use of all known
evolutionary information in building a tree
In contrast, distance methods compress all of the
differences between pairs of sequences into a
single number

56
Building Trees with Parsimony

Parsimony involves evaluating all possible trees
and giving each a score based on the number of
evolutionary changes that are needed to explain
the observed data.
The best tree is the one that requires the fewest
base changes for all sequences to derive from a
common ancestor.

57
Methods

Distance-based UPGMA, NJ, FM, ME
Other Maximum Parsimony, ML, etc
Neighbor Joining methods generally produce just
one tree, which can help to validate a tree built
with the parsimony or maximum likelihood method

58
Phylogenetic methods Maximum likelihood
methods Phylogenies should be formulated in a
probalistic framework and statistically
testable. Protein and DNA sequence data are
extraordinary good for phylogenetic
interpreation and can resist such treatment.
Cavalli-Sforza and Edwards 1967
(theory) Felsenstein 1981 first practically
useful algorithms.
59
Phylogenetic analysis. Comparison of phylogenetic
methodsConsistency a phylogenetic method is
consistent for an evolutionary model, if the
method converges on the corrrect tree as the data
becomes infinite. Efficiency a phylogenetic
method have high efficiency if it quickly
converges on the correct solution as more data
are applied to the problem. Robustness a
phylogenetic method is robust if converges on the
correct solution with violations of the
assumptions about the evolutionary model.
Hillis 1995. Syst. Biol. 44, 3-16.
60

Phylogenetic analysis. Test of robustness.
Bootstrap
Purpose. To show how well supported the nodes are
by the data.
Performance. The original data are simulated by
drawing columns randomly with replacement 100 or
1000 times. The phylogenetic analysis is repeated
and the number of nodes common in all 100 or 1000
trees summarized.
Example. Original data 1 replicate 2
replicate
Species 1 AGGA AAGA GGAA
Species 2 ACGT AACT CGTT
Species 3 ACGT AACT CGTT
Species 4 ACTT AACT CTTT
Species 5 CCGT CCCT CGTT
linear form (2,3)4)5)1 (2,3)4)5)1 (2,3)5)4)1

61
Are there Correct trees??

Despite all of these caveats, it is actually
quite simple to use computer programs calculate
phylogenetic trees for data sets.
Provided the data are clean, outgroups are
correctly specified, appropriate algorithms are
chosen, no assumptions are violated, etc., can
the true, correct tree be found and proven to be
scientifically valid?
Unfortunately, it is impossible to ever
conclusively state what is the "true" tree for a
group of sequences (or a group of organisms)
taxonomy is constantly under revision as new data
is gathered (example 80s revision of the seals
and sea lions tree)