Title: Motivation
1(No Transcript)
2Motivation Tree Basics Homoplasy Molecular Clock
Hypothesis Prediction Methods Character-based perf
ect phylogeny maximum parsimony Distance-based ult
rametric trees additive trees (eg.
Fitch-Margoliash, Nearest Neighbors) Unweighted
pair group method with arithmetic mean
(UPGMA) Maximum Likelihood
Evaluating trees and data
3Evolution Recall that DNA encodes blue print
of life Living things pass DNA info to their
children Due to mutations, DNA is changed a
little bit After a long time, different
species would evolve Phylogenetics studies
genetic relationship between different species
4 Similarity searches and multiple alignments
of sequences naturally lead to the question
How are these sequences related? and more
generally How are the organisms from which
these sequences come related?
5Phylogenetic systematics is a method of taxonomic
classification based on their evolutionary history
- Willi Hennig, a German entomologist, 1950.
6Phenetic versus cladistic analysis
Phenetics is the study of relationships among a
group of organisms on the basis of the degree of
similarity between them, be that similarity
molecular, phenotypic, or anatomical. A tree-like
network expressing phenetic relationships is
called a phenogram. Cladistics can be defined
as the study of the pathways of evolution. In
other words, cladists are interested in such
questions as how many branches there are among a
group of organisms which branch connects to
which other branch and what is the branching
sequence. A tree-like network that expresses such
ancestor-descendant relationships is called a
cladogram. Thus, a cladogram refers to the
topology of a rooted phylogenetic tree.
The maximum parsimony method is a typical
representative of the cladistic approach, whereas
the UPGMA method is a typical phenetic method.
7Character-based approach
Trees constructed on the basis of gain or loss of
characters (or traits) NOT connected explicitly
to a measure of distance Best for small sets of
sequences with high similarity
- Distance measures are not necessary
- Traditionally, morphological features used
- Has backbone
- Has feathers
8Has a certain amino acid at position i Whether a
certain gap is present in a multiple sequence
alignment Whether or not protein X regulates
protein Y
9Character-based trees interpreted as
evolutionary trees
Root represents an ancestral object with none of
the present m characters 0 0 0 0 Each of
the characters changes from 0 to 1 exactly once
and never changes back Each character labels
one edge Evolutionary history by mutation event,
not time
10Independent evolution of tails
11Independent evolution of tails
12Distance based approach
Cladistic Methods
- Evolutionary relationships are documented by
creating a - branching structure, termed a phylogeny or
tree, that - illustrates the relationships between the
sequences. - Cladistic methods construct a tree (cladogram) by
- considering the various possible pathways
of evolution - and choose from among these the best
possible tree. - A phylogram is a tree with branches that are
proportional - to evolutionary distances.
13(No Transcript)
14Types of data used in phylogenetic
inference Character-based methods Use the
aligned characters, such as DNA or protein
sequences, directly during tree inference.
Taxa Characters Species
A ATGGCTATTCTTATAGTACG Species
B ATCGCTAGTCTTATATTACA Species
C TTCACTAGACCTGTGGTCCA Species
D TTGACCAGACCTGTGGTCCG Species
E TTGACCAGTTCTCTAGTTCG Distance-based methods
Transform the sequence data into pairwise
distances (dissimilarities), and then use the
matrix during tree building. A
B C D E Species A ---- 0.20
0.50 0.45 0.40 Species B 0.23 ---- 0.40
0.55 0.50 Species C 0.87 0.59 ----
0.15 0.40 Species D 0.73 1.12 0.17 ----
0.25 Species E 0.59 0.89 0.61 0.31 ----
15PHYLOGENETIC ANALYSIS
-evolution at a molecular level Linus
PaulingEmile Zuckerkandl, 1965 (mutation
rate) The branch of taxonomy that deals with
numerical data such as DNA sequence is known as
phylogenetics
Mutations Random (?) Accumulate (?) Ancestor
- Genetic drift (identical genes in different
species) - Gene duplication
- Recombination
- Exchange
16Assumptions of Phylogenies
- All sequences are homologous.
- No duplicate sequences are present..
- Back mutation/reversal
- Optimal alignments
- Reproductive isolation
- Limited horizontal gene transfer
17Purpose of phylogenetic predictions
- Understand the lineage of different species
- Organizing principle to sort species into a
taxonomy - Understand how various functions evolved
- Understand forces and constraints on evolution
- Perform multiple sequence alignment
18Homoplasy
- Homoplasy is similarity that is not homologous
- (not due to common ancestry)
- Homology is the result of independent evolution
- (convergence, parallelism, reversal)
- Can provide misleading evidence of phylogenetic
- relationships
19Homoplasy
- Homoplasy is similarity that is not homologous
- (not due to common ancestry)
- Homology is the result of independent evolution
- (convergence, parallelism, reversal)
- Can provide misleading evidence of phylogenetic
- relationships
Significantly similar molecular sequences are
very unlikely to arise by chance - i.e. homoplasy
on the molecular level is very unlikely
horizontal transfer of sequences from one
organism to another ???????
20Orthologs vs. Paralogs
- When comparing gene sequences, it is important to
distinguish between identical vs. merely similar
genes in different organisms. - Orthologs are homologous genes in different
species with analogous functions. - Paralogs are similar genes that are the result of
a gene duplication. - A phylogeny that includes both orthologs and
paralogs is likely to be incorrect. - Sometimes phylogenetic analysis is the best way
to determine if a new gene is an ortholog or
paralog to other known genes.
211. Alignment 2. Substitution model building 3.
Tree building 4. Tree evaluation
22Progressive alignment
Closely related sequences distantly
related sequences
Independent (RNA?????)
GAPS?
23- Alignment parameter estimation
- Placement of indels (insertion/deletion events)
- in an alignment of length-variable sequences
-
- depends on all parameters of evolutionary
model and - should be consistent with those observed
in a tree - inferred from the alignment
- extreme way- to delete from analysis all
sites that - include gaps (phylogenetic signals in this
regions will be - lost)
-
- another approach-incorporate gaps as
characters - (additional state or independent of base
substitution states) - Parameters should vary dynamically with
evolutionary - divergence
24PHYLOGENETIC ANALYSIS
- Attributes and options
- Computer dependence
- none partial complete
- Phylogeny invocation
- none a priori recursive
- Alignment parameter estimation
- a priori dynamic recursive
- Alignment features
- primary structure high order structures
- Mathematical optimization
- statistical nonstatistical
25PHYLOGENETIC ANALYSIS
- Attributes and options
- Computer dependence
- none partial complete
- Phylogeny invocation
- none a priori recursive
- Alignment parameter estimation
- a priori dynamic recursive
- Alignment features
- primary structure high order structures
- Mathematical optimization
- statistical nonstatistical
CLUSTAL W
- Partial computational dependence
- Phylogeny criteria invoked a priori (guide
tree) - Alignment parameter estimation a priori or
dynamically (optional) - Alignment of primary structure (partial
structural basis - in a case of hydrophilic AA)
- 5. Mathematical optimization nonstatistical
26- Alignment of primary versus higher order
- structures
- Aligning according to secondary or higher order
structures - are more reliable
TREE BUILDING PROGRAMS IN ALIGNMENT PACKAGES
ARE NOT RIGOROUS !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
27Similarity vs. Evolutionary Relationship
Similarity and relationship are not the same
thing, even though evolutionary relationship is
inferred from certain types of similarity. Simila
r having likeness or resemblance (an
observation) Related genetically connected
(an historical fact) Two taxa can be most
similar without being most closely-related
C is more similar in sequence to A (d 3) than
to B (d 7), but C and B are most
closely related (that is, C and B shared a common
ancestor more recently than either did with A).
28- Substitution model building
- Very important, since it will influence alignment
- as well as tree building
- For nucleotide sequences two models
- Substitutions between particular bases
- Substitutions among different sites in a sequence
A-G/G-A/C-T/T-C more frequent than
A-C/A-T/C-G/G-T
Rates of substitution takes form of a square
matrix 4- bases 20-AA 61-codons Fixed cost
matrices are used in weighted parsimony method
29Character weight matrix and application in
phylogenetic analysis
G G A A C C T T
G G C C A A
T T
C-T
G-A
G-C
A-T
G-A
G-C
G
G
-if unweighted 3 steps -if weighted???????????
Reconstructions of evolution from 8 sequences
Distances matrices are much more complicated
30Basic Considerations
- Codon bias
- Amino-acid codons have been degenerated with
wobble in the third position. - Yeasts, protozoa, and animals have different
codon preferences, which would result - in differences in DNA sequence that are related
to codon bias and not to evolution. -
- Also, the protozoa use the codons TAA and TGA to
encode glutamine, rather than - STOP, and in mitochondria the codon TGA encodes
tryptophane, rather than STOP.
Relationships between genes are not necessarily
the same as the relationships between whole
organisms.
- Phylogenies based on DNAs better than those based
on proteins due to degeneracy of the genetic code
and associated masking of mutations - DNA under less selective pressure than protein
- DNA comparison is more sensitive to pick up
divergence for closely related sequences. - But DNA sequence alignment is less reliable than
protein sequence alignment
31Distances Measurements
Distance score- the score between two sequences ,
representing the number of mismatched positions
in the alignments (number of positions that
should be changed)
- It is often useful to measure the genetic
distance between two species, between two
populations, or even between two individuals. - The entire concept of numerical taxonomy is based
on computing phylogenies from a table of
distances. - In the case of sequence data, pairwise distances
must be calculated between all sequences that
will be used to build the tree - thus creating a
distance matrix. - Distance methods give a single measurement of the
amount of evolutionary change between two
sequences since divergence from a common
ancestor.
32Finding Distance Between Two Species Consider
two species with these DNA fragments Species
i (A, C, G, C, T) Species j (C, C, A, C,
T) 2 mismatches, so can estimate distance to
be 2 Looks reasonable, as 2 mismatches can
be thought as 2 mutations However, this fails
to capture multiple mutations on the same
site In practice, need to apply some
corrective distance transformation
33Conversion of Alignment Scores to Distances
- Alignment scores are large for similar sequences.
- Distance methods require that the distances
between similar sequences are smaller than the
distances between less similar sequences. - Large alignment scores need to be mapped to small
distances and vice versa.
34Computing a Distance Matrix
Reading sequences... gtr1_human 548
total, 548 read gtr2_human 548 total,
548 read gtr3_human 548 total, 548
read gtr4_human 548 total, 548 read
gtr5_human 548 total, 548
read Computing distances using Kimura method...
1 x 2 48.61 1 x 3 45.50 1
x 4 65.74 1 x 5 107.70 2 x 3
61.53 2 x 4 74.57 2 x 5 113.82
3 x 4 68.93 3 x 5 104.43 4 x 5
110.86
Matrix 1 1 2
3 4 5 ________________________
____________________________________ ..
1 0.00 48.61 45.50 65.74
107.70 2 0.00
61.53 74.57 113.82 3
0.00 68.93 104.43
4 0.00
110.86 5
0.00
35DNA Distances
- Distances between pairs of DNA sequences are
relatively simple to compute as the sum of all
base pair differences between the two sequences. - this type of algorithm can only work for pairs of
sequences that are similar enough to be aligned - Generally all base changes are considered equal
- Insertion/deletions are generally given a larger
weight than replacements (gap penalties). - It is also possible to correct for multiple
substitutions at a single site, which is common
in distant relationships and for rapidly evolving
sites.
36Mutation rate?
37Genetic distance An attempt to answer the
question of how much evolutionary change has
occurred between sequences
Jukes Cantor distance mutation occurs at a
constant rate and each nucleotide is equally
likely to mutate into any other nucleotide with
rate a.
38Kimura two-parameter distance allows a
Different rate for transitions and
transversions.
39Amino Acid Distances
- Distances between amino acid sequences are a bit
more complicated to calculate. - Some amino acids can replace one another with
relatively little effect on the structure and
function of the final protein while other
replacements can be functionally devastating. - From the standpoint of the genetic code, some
amino acid changes can be made by a single DNA
mutation while others require two or even three
changes in the DNA sequence. - In practice, what has been done is to calculate
tables of frequencies of all amino acid
replacements within families of related protein
sequences in the databanks i.e. PAM and BLOSSUM
40EVOLUTIONARY TIME?
41Molecular clock hypothesis
proposed in 1968 by Motoo Kimura.
- The controversial hypothesis of molecular clock
(MC) is a consequence of the neutral theory of
evolution. - It holds that in any given DNA /protein sequence,
mutations accumulate at an approximately constant
rate as long as the DNA sequence retains its
original functions. - The difference between the sequences of a DNA
segment (or protein) in two species would then be
proportional to the time since the species
diverged from a common ancestor (coalescence
time). - This time may be measured in arbitrary units and
then it can be calibrated in millions of years
for any given gene if the fossil record of that
species happens to be rich.
42(No Transcript)
43The rate of evolution k nTf0 where k rate
of nucleotide substitutions, nT the mutation
rate f0 the fraction of new alleles that are
selectively neutral.
Under a molecular clock, the rate at which two
populations diverge is 2mt where m
mutation rate and t time of last common
ancestor.
44- Neutral theory of Evolution most variation that
is observed is of no interest to natural
selection (fitness). - Most mutations are so nearly selectively neutral
in their effects that their fate is determined
largely through random genetic drift and other
alleles are deleterious and removed by selection.
- silent substitutions and substitutions in
noncoding regions will occur more often because
they are likely to be selectively neutral. - Replacement substitutions will occur less often
because of selective pressure.
45- Rate of accepted mutations maybe different for
different proteins (depending on their tolerance
for mutations) - Different parts of a protein may evolve at
different rates
46Clustering Algorithms
Distances
Tree
- Clustering algorithms use distances to calculate
phylogenetic trees. These trees are based solely
on the relative numbers of similarities and
differences between a set of sequences. - Start with a matrix of pairwise distances
- Cluster methods construct a tree by linking the
least distant pairs of taxa, followed by
successively more distant taxa.
47TREE BUILDING
Species or genes tree
A tree is a 2-dimensional graph showing
evolutionary relationships among organisms, or
in our case, in certain genes from separate
organisms. We refer to these separate sources
of sequences as taxa (singular taxon), defined
as phylogenetically distinct units on the tree.
The tree is composed of nodes representing the
taxa and branches representing the relationships
among the taxa. The lengths of the branches are
often drawn proportional to the number of
sequence changes in the branch.
48The sum of all branch length tree length The
tree is bifurcating or binary tree
49The sum of all branch length tree length The
tree is bifurcating or binary tree (too
close-hard to resolve-several branches from the
node)
50- PROPERTIES OF TREES
- a unique path leads from the root node to any
other node - and the direction indicates evolutionary time.
- the root is the common ancestor of all taxa
- the root is defined by including a taxon which we
are - reasonably sure branched off earlier than the
other taxa - under study but should be related to the
remaining taxa -
- if we do not have a taxa to define the root, we
can predict - relationships by an UNROOTED TREE.
51- Rooted trees
- Single common ancestor
- Requires more information
Unrooted trees Insufficient information to tell
whether not not a given internal node is a common
ancestor of any 2 leaves
THERE ARE A LARGE POSSIBLE NUMBER OF TREES AND
ONLY ONE TREE IS THE CORRECT ONE. THE OBJECTIVE
OF THE ANALYSIS IS TO FIND THE CORRECT TREE.
TAXA OF ROOTED TREES OF UNROOTED
TREES 3 3
1 4
15
3 5
105
15 - 7
10,395 954
52Phylogenetic Tree Construction
- Processes
- Topology construction
- Length estimation
- Methods
- Distance methods
- Maximum parsimony methods
- Maximum likelihood methods
532 methods
54Method 1
OUTGROUP
- Outgroup seq should be closely related to rest of
seqs, but there should also be significantly more
difference between outgroup and rest of seqs - Outgroup that is too distant may lead to
incorrect tree because of more random complex
nature of diff between outgroup and rest of seqs - In choosing outgroup, one assumes that the
- evolutionary history of the gene is same as rest
- of seqs. If this assumption is incorrect (e.g.,
- horizontal gene transfer has occurred), an
incorrect analysis could result
55Method 2
Use statistical tools will root trees
automatically (e.g. mid-point rooting)
This must involve assumptions BEWARE!
56METRIC DISTANCES between any two or three taxa
(a, b, and c) have the following
properties Property 1 d (a, b)
0 Non-negativity Property 2 d (a, b) d (b,
a) Symmetry Property 3 d (a, b) 0 if and
only if a b Distinctness
and... Property 4 d (a, c) d (a, b) d (b,
c) Triangle inequality
57ULTRAMETRIC DISTANCES must satisfy the previous
four conditions, plus Property 5 d (a, b)
maximum d (a, c), d (b, c)
This implies that the two largest distances are
equal, so that they define an isosceles triangle
Similarity Relationship if the distances are
ultrametric!
If distances are ultrametric, then the sequences
are evolving in a perfectly clock-like manner,
thus can be used in UPGMA trees and for the most
precise calculations of divergence dates.
58Property 6 d (a, b) d (c, d) maximum d
(a, c) d (b, d), d (a, d) d (b, c)
ADDITIVE DISTANCES
59 METHODS OF PHYLOGENETIC ANALYSIS
(Phenetic-cladistic phenograms-cladograms)
Character-based methods maximum parsimony
method a multiple sequence alignment is
produced in order to predict which sequence
positions are likely to correspond. These
positions will appear in vertical columns in the
multiple sequence alignment. For each aligned
position, phylogenetic trees that require the
smallest number of evolutionary changes to
produce the observed sequence changes are
identified. This analysis is continued for every
position in the sequence alignment. Finally,
those trees which produce the smallest number of
changes overall for all sequence positions are
identified. maximum likelihood method like
the maximum parsimony method, the maximum
likelihood method depends upon first obtaining a
reliable multiple sequence alignment and then
examining the changes in each column in the
alignment. In this case, however, the likelihood
of a particular tree is calculated using an
expected model of change in the sequences
(Swofford and Olsen 1990). For example,
all nucleotides are assumed to be equally
frequent and the probability of change of any
nucleotide to any other nucleotide is assumed to
be the same in the Jukes-Cantor model. For each
possible tree, the likelihood of finding the
actual sequence changes at each column in the
aligned sequences is calculated. The
probabilities for each aligned position are then
multiplied to provide a likelihood for each tree.
The tree which provides the maximum likelihood
value is the most probable tree. Distance-based
methods all possible pairs of sequences are
aligned to determine which pairs are the most
similar or closely related. These alignments
provide a measure of the genetic distance between
the sequences. These distance measurements are
then used to predict the evolutionary
relationship.
Derive trees that optimize the distribution of
the data patterns for each character (not-fixed
distances)
Compute pairwise distances according to some
measures, then discard the actual data (fixed
distances)
60(No Transcript)
61Is there strong Seq similarity?
Obtain multiple Seq alignment
Maximum parsimony methods
Choose set of related seq
-
Is there clearly recognizable Seq similarity?
Distance methods
-
Maximum likelihood method
Analyze how well data support prediction
62Character-based
- maximum parsimony method
- Find tree which minimizes number of changes
needed to explain data - A multiple sequence alignment is produced in
order to predict which sequence positions are
likely to correspond. - These positions will appear in vertical columns
in the multiple sequence alignment. - For each aligned position, phylogenetic trees
that require the smallest number of evolutionary
changes to produce the observed sequence changes
are identified. - This analysis is continued for every position in
the sequence alignment. - Finally, those trees which produce the smallest
number of changes overall for all sequence
positions are identified.
63Character-based
A subset of all possible trees is examined. The
most parsimonious tree is the one that requires
the fewest evolutionary changes for all
sequences to derive from a common ancestor
(minimum evolution)
- Consider four sequences ATCG, TTCG, ATCC, and
TCCG - Imagine a tree that branches at the first
position, grouping ATCG and ATCC on one branch,
TTCG and TCCG on the other branch. - Then each branch splits, for a total of 3 nodes
on the tree (Tree 1)
- Compare Tree 1 with one that first divides ATCC
on its own branch, then splits off ATCG, and
finally divides TTCG from TCCG (Tree 2). - Trees 1 and 2 both have three nodes, but when
all of the distances back to the root ( of nodes
crossed) are summed, the total is equal to 8 for
Tree 1 and 9 for Tree 2.
Tree 2
Tree 1
64How do you search through all trees? Enumerate
all trees (too many) Can use techniques to
try to limit the search space (e.g., branch and
bound) or use heuristics (many
possibilities) E.g., nearest neighbor
interchange. Start with a tree and consider
neighboring trees. If any neighboring tree has
fewer changes, take it as current tree. Stop when
no improvements
65Character-based
-informative sites
RULES
- 4 taxa three unrooted trees
- Some sites are informative, some not
- Only informative sites need to be analyzed
COST of CHANGE??????
66Character-based
Maximum parsimony - scoring
- Step matrices
- Consistency Index (CI)
- CI min possible tree length
- actual tree length
- Codon position - variable weightage
- Mutations leading to Amino acid changes scored
- only
67Implementation of step matrices -
Character-state trees describing possible
pathways are explicitly assigning a weight to a
particular sort of change cost -
Parsimony methods will attempt to minimize the
summed cost of all changes - Summed cost
number of steps in a character (step unit of
cost) - One of key assumptions used in
parsimony analysis is assignment of relative
weights or costs to each type of change à
summarized in cost or step matrix
Structure of step matrix dependent on the types
of rules you think characters are evolving under.
In programs, you must choose (or
default chooses for you) a general step matrix
681. Unordered chars Change from any
state to any other counted as one step (Fitch
parsimony (Fitch, 1971) (nucleotide sequence
data) but may want TV as higher cost
2
1 3
0 2.
Ordered chars Number of steps from one state
to another diff between state numbers (Wagner
parsimony (Farris, 1970 ) draw where steps
lines in path (ex morph chars on
continuum) 01234 3. Irreversible
chars Number of steps between states diff
between state numbers, where decreases in state
number do not occur (Camin-Sokal parsimony
(Camin and Sokal, 1965) multiple gains
allowed, no losses
0à1à2à3à4
69Unordered Ordered Irreversible
Unordered Ordered
Irreversible 0 1 2 3 0 1 2
3 0 1 2 3 0 0 1 1 1 0 0 1 2
3 0 0 1 2 3 1 1 0 1 1 1 1 0 1
2 1 8 0 1 2 2 1 1 0 1 2 2 1 0
1 2 8 8 0 1 3 1 1 1 0 3 3 2 1
0 3 8 8 8 0
Can elaborate on step matrices for any
number of transformation or weighting schemes.
Common one is weighting transversions more
heavily in molecular data A C G T A -
2 1 2 C 2 - 2 1 G 1 2 - 2 T 2 1
2 -
70Character-based
PAUP (phylogenic analysis using parsimony) -GCG
http//evolution.genetics.washington.edu/phylip/so
ftware.pars.htmlPAUP
(no web interface)
MACCLADE Macintosh program, contains many tools
for entering and editing data, producing trees
and having diagnostic feedback
71Character-based
Maximum parsimony
Provides misleading information when rates of
sequence change in the different branches of
tree represented by the sequence data
Taxon 1
Taxon 4
g
g
predicted
Taxon 2
Taxon 1
a
g
real
a
g
a
a
Taxon 2
Taxon 3
Taxon 4
Taxon 3
If rates of change assumed to be equal..
Incorrect tree for the 1
72Character-based
Maximum parsimony
Provides misleading information when rates of
sequence change in the different branches of
tree represented by the sequence data
Taxon 1
Taxon 4
g
g
Taxon 2
Taxon 1
a
g
a
g
a
a
Taxon 2
Taxon 3
Taxon 4
Taxon 3
Incorrect tree for the 1
- Aproaches to solve the problem
- To broke down long branches by presenting
additional taxa - closely related to taxa in question
- Lakes method (PAUP)
- Only transversions are scored, (A,G) lt-gt (C,T)
- Transversions are assumed to occur at constant
rate - Also, independent of position
Taxa 2
Taxa 1
Taxa 2
Taxa 1
a
a
a
g
B
A
Other position
Evol. change or by chance?
c
c
c
c
Taxa 4
Taxa 3
Taxa 4
Taxa 3
73Minimum evolution (ME) methods
- Optimality criterion The tree(s) with the
shortest sum of the branch lengths (or overall
tree length) is chosen as the best tree. - Advantages
- Can be used on indirectly-measured distances
(immunological, hybridization). - Distances can be corrected for unseen events.
- Usually faster than character-based methods.
- Can be used for some rate analyses.
- Has an objective function (as compared to
clustering methods). - Disadvantages
- Information lost when characters transformed to
distances. - Slower than clustering methods.
74Character-based
Maximum Likelihood (ML)
- The term Maximum Likelihood does not refer to a
single - statistical method, but rather to a general
approach. - ML methods take what has been described as an
"inside - out" approach. In their simplest form, they
begin by listing - all possible models, and then calculating the
probability that - each model would generate the data actually
observed. - The model with the highest probability of
generating the - observed data is chosen as the best model.
- Joe Felsenstein's application of ML to phylogeny
is implemented in DNAML in the PHYLIP package,
and in a modified version of DNAML called
fastDNAml , written by Gary Olsen .
- -explicit model of evolution, therefore more
diverse sequences may be analyzed - -uses probability calculations to find a tree,
similar to - parsimony method in that the analysis is
performed - on each column of multiple alignment
- all possible trees are considered
- - trees with with the least numbers of changes
are considered
75The Maximum Likelihood approach resembles MP
method but presents additional opportunity to
evaluate trees variations in mutation
rates Jukes-Cantor and Kimura models
76Sequence a A C G C G T T G G G Sequence b A C G C
G T T G G G Sequence c A C G C A A T G A A
Sequence d A C A C A G G G A A
Unrooted tree
C
A
(One of three)
D
B
T T A G
Rooted tree
a b c d
( one of five)
L3
L6
L4
L5
L-Likelihood values for the probability
Consider every possible base assignments Total
64- for three node positions
L1
L2
L0
Rooted tree with base assignments
T T A G
Transition 2x10-6 Transversion 10-6
a b c d
L3
L6
L4
L5
T
G
LL0xL1xL2xL3xL4xL5xL6 0.25x1x2x10-6x1x1x1x10-6
5x10-13
T
L1
L2
L0
Next tree and so on.. L (Tree) L (Tree1)
L (Tree2) ..
77Maximum likelihood (ML) methods
Optimality criterion ML methods evaluate
phylogenetic hypotheses in terms of the
probability that a proposed model of the
evolutionary process and the proposed unrooted
tree would give rise to the observed data. The
tree found to have the highest ML value
is considered to be the preferred tree.
- Advantages
- Are inherently statistical and evolutionary
model-based. - Usually the most consistent of the methods
available. - Can be used for character (can infer the exact
substitutions) and rate analysis. - Can be used to infer the sequences of the
extinct (hypothetical) ancestors. - Can help account for branch-length effects in
unbalanced trees. - Can be applied to nucleotide or amino acid
sequences, and other types of data. - Disadvantages
- Are not as simple and intuitive as many other
methods. - Are computationally very intense (Iimits number
of taxa and length of sequence). - Like parsimony, can be fooled by high levels of
homoplasy. - Violations of the assumed model can lead to
incorrect trees.
78Distance-based methods
DISTANCES TREE
- Tree is constructed using distances between
species (number of mutations, time, other
distance measures) - Neighbors sequence pairs with smallest number
of changes - Trees are rooted, i.e. sequences share a common
ancestor - First step is producing MSAs (ex CLUSTALW)
- DNA
- Distance matrix is created
- Relatively simple for pairs of homologous
sequences that can be aligned - without large insertions, deletions etc.
- Proteins
- Matrices, such as PAM are used
- Multiple substitutions at one site is always a
problem
79- Distance method applications
- CLUSTALW
- PAUP
- PHYLIP
Methods of phylogenetic tree estimation
Outfile with a distance table
FITCH- Fitch-Margoliash method,
does not assume molecular clock KITSCH -
Fitch-Margoliash method, assume molecular
clock NEIGHBOR neighbor-joining (does not
assume molecular clock, unrooted tree) or
unweighted pair group methods (UPGMA)
DNADIST-distances among NA PROTDIST-distances
for AA seq
Distance matrices
Distance tables
outfile
infile
Distance score- the score between two sequences ,
representing the number of mismatched positions
in the alignments (number of positions that
should be changed)
Distance method will be successful if the
distances between the sequences can be made
additive on a predicted tree.
80DISTANCE-BASED
Sequence A xxxxxxxxxxxxxxxxxxxxxx Sequence B
xxxxxxxxxxxxxxxxxxxxxx Sequence C
xxxxxxxxxxxxxxxxxxxxxx Sequence D
xxxxxxxxxxxxxxxxxxxxxx Distances the number of
steps required to change one sequence to
another nAB 3 nAC 7 nAD 8 nBC 6 nBD 7 nCD 3 Dist
ance table
Phylogenetic Tree
dABdCDltdACdBDdADdBC
Principle of additivity for this tree
Each change occurs once?????????
81Additive Trees
- Generalization of ultrametric trees
- of mutations were assumed to be proportional to
temporal distance of a node to ancestor - Also assumed, mutations took place at same rate
in all branches - Additive trees model different rates of mutation
along different branches
82Fitch-Margoliash method
DISTANCE-BASED
- Draw unrooted tree
- Calculate the length of tree branches
algebraically -
c
From A to B ab22 (1) From A to C ac39
(2) From B to C bc41 (3) Subtract (3) from
(1) a-b-2 (4) Add (1) and (4) 2a20 a10 From
(1) and (2) b12 c29
83Fitch-Margoliash method
DISTANCE BASED
- Uses distance table
- Calculates the length of tree branches
algebraically - Draws unrooted tree
-
c
From A to B ab22 (1) From A to C ac39
(2) From B to C bc41 (3) Subtract (3) from
(1) a-b-2 (4) Add (1) and (4) 2a20 a10 From
(1) and (2) b12 c29
for n-sequences
- Simple extension of 3-sequence method
- Closest sequence pair is chosen
- The rest of the sequences are agglomerated
- Distance between the pair is computed
- The matrix is recomputed with the sequence pair
combined into single node - Process is repeated till the sequences are
combined
84Fitch-Margoliash method
DISTANCE -BASED
- Advantages
- tests more than one tree
- still pretty fast
- can use empirical substitution scoring methods
- global optimization of tree by statistical
criteria - Disadvantages
- Requires longer execution time than Neighbor
Joining, but still quite practical on most
computers, for typical datasets. - does not consider intermediate ancestors, meaning
that there is no requirement for an
internally-consistent evolutionary model - misses homoplasies, especially over long
distances long evolutionary distances will be
underestimated.
85The Neighbor-joining method
DISTANCE -BASED
Similar to Fitch-Margoliash Choice of which
sequences to pair is determined by a different
algorithm Pairs sequences based on the effect of
pairing on the sums of the branch lengths of the
tree
- The distances between the sequences are used to
calculate the sum of branch lengths - in a star-like tree
- Decompose/modify the tree by combining pairs of
sequences - The sum of the branch lengths of a new tree is
calculated - A new distance table is made by combining A with
B (composite sequence)
86The Neighbor-joining method
- Advantages
-
- fastest tree building method
- can use empirical substitution scoring methods
- not influenced by variations in the rates of
change along the branches of the tree - Disadvantages
- tests only a single tree
- does not consider intermediate ancestors, meaning
that there is no requirement for an
internally-consistent evolutionary model - misses homoplasies, especially over long
distances long evolutionary distances will be
underestimated.
87DISTANCE -BASED
Unweighted pair group method with arithmetic mean
(UPGMA)
- The rate of change along the branches of the tree
is constant - Distances are approximately ultrametric
- Simplest method
- Can lead to wrong tree, if the rates of mutations
- are not uniform
88Distance Matrix
89- dAB is the smallest distance
- Group A and B
- Branch length dAB/2 (here we say evolution rate
is constant..) - Recalculate distances from AB to other taxa as
average - d(AB)C (dAC dBC)/2
.15/2
A
.15/2
B
90- new distance matrix
- Find smallest distance and continue as before
- Repeat until all taxa are on tree
dAB/2
A
dAB/2
B
d(AB)C/2
C
d(ABC)D/2
D
http//www.icp.ucl.ac.be/opperd/private/upgma.htm
l
91Clustering methods (UPGMA N-J)
- Optimality criterion NONE. The algorithm
itself builds the tree. - Advantages
- Can be used on indirectly-measured distances
(immunological, hybridization). - Distances can be corrected for unseen events.
- The fastest of the methods available (N-J is
screamingly fast!). - Can therefore analyze very large datasets
quickly (needed for HIV, etc.). - Can be used for some types of rate and date
analysis. - Disadvantages
- Similarity and relationship are not necessarily
the same thing, so clustering by similarity does
not necessarily give an evolutionary tree. - Cannot be used for character analysis!
- Have no explicit optimization criteria, so one
cannot even know if the program worked properly
to find the correct tree for the method.
92GrowTree When you run GrowTree, SeqWeb
seamlessly links together these programs (in the
order given) to perform the analysis.
1.PileUp 2.Distances 3.GrowTree
For alignment-a simplification of the
progressive alignment method of FengDoolittle,
1987 is used (clusters are created)
GrowTree reconstructs a phylogenetic tree
from a distance matrix such as the one
created by Distances. Two methods are
available for reconstructing the tree UPGMA
(unweighted pair group method using
arithmetic averages same rate of evolution)
and neighbor-joining.
NEXUS Trees from file hum_gtr.distances
begin trees utree Tree_1
((('Gtr1_Human'18.43,'Gtr3_Human'30.18)4.34,'Gt
r4_Human'24.87) 3.19,('Gtr2_Human'35.98,'Gtr5_H
uman'74.88)3.19)0.00 endblock
93The NEXUS file is your actual ToL-MacClade data
file. It is the file you edit when working with
ToL-MacClade, and it is the file you write when
choosing Save from ToL-MacClade's File menu. The
function of the NEXUS file is to store the
information necessary to build your Tree .
ToL-MacClade uses a special format, the NEXUS
format, to store your list of taxa, your
phylogenetic tree, and the information you
entered in the various windows and boxes. The
NEXUS format has been created to allow for
compatibility between a number of different
programs for phylogenetic analysis. You will be
able to view and edit NEXUS files in ToL-MacClade
or in a word processor
GCG/SEQWEB
94New Hampshire (Newick) Format
Human Mouse Drosophila Honey bee Fern Wheat Pine
(((Human, Mouse), (Dros, Bee)), (Fern, (Wheat,
Pine))
95Q08832 P25123 P34903 P18505 o14764 P78334 Q99928 o
05591 P24046 p23415
96UPGMA
Neighbor joining
Kimura distance
Uncorrected distance
97PAUP (phylogenic analysis using parsimony) -GCG
-version 10 has an option to perform phylogenic
analysis using distance methods
PHYLIP (phylogenetic inference package)
http//evolution.genetics.washington.edu/phylip/ph
ylipweb.html
FITCH-estimates a PT assuming additivity of
branch lengths using Fitch-Margoliash method (no
molecular clock-rates of evolution along
branches can vary KITSCH- the same, but with
molecular clock NEIGHBOR- neighbor-joining method
with arithmetic mean (UPGMA)
98maximum likelihood method
Computer intense and time consuming
PHYLIP (phylogenetic inference package)
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
uk.html
DNAML - allows for variable frequences of 4
nucleotides, for unequal rates
of transitions and transversions
DNAMLK - the same, but molecular clock is
taking into account
99- There are several phylogenetics servers available
on the Web - some of these will change or disappear in the
near future - these programs can be very slow so keep your
sample sets small
- The Institut Pasteur, Paris has a PHYLIP server
at - http//bioweb.pasteur.fr/seqanal/phylogeny/phylip
-uk.html - The Belozersky Institute at Moscow State
University has their own - "GeneBee" phylogenetics server
- http//www.genebee.msu.su/services/phtree_reduced.
html - The Phylodendron website is a tree drawing
program with a nice user - interface and a lot of options,
however, the output is limited to gifs at - 72 dpi - not publication quality.
- http//iubio.bio.indiana.edu/treeapp/treeprint-for
m.html
- the most important factor quality of the input
data - use each of the three methods and compare trees
- different results depending on the order in
which SEQ are - in input file (jumble option)
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD
searchDBtaxonomy
http//phylogeny.arizona.edu/tree/phylogeny.html
http//evolution.genetics.washington.edu/phylip/so
ftware.htmlPlotting
100- Introduction to Phylogenetic Systematics,
- Peter H. Weston Michael D. Crisp, Society of
Australian Systematic Biologists - http//www.science.uts.edu.au/sasb/WestonCrisp.htm
l - University of California, Berkeley Museum of
Paleontology (UCMP) - http//www.ucmp.berkeley.edu/clad/clad4.html
- Formats conversion
- http//www.swbic.org/products/bioinfo/transform/tr
ansform_help.php
101Alignment in fasta
http//prodes.toulouse.inra.fr/multalin/multalin.h
tml
Alignment in a FASTA format
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
uk.html
Distances from protein sequences (protdist).
http//bioweb.pasteur.fr/seqanal/interfaces/protdi
st-simple.html
Tree from the same file
Outfile Neighbor Neighbor joining
protpars
Alignment in fasta
http//prodes.toulouse.inra.fr/multalin/multalin.h
tml
READSEQ in PHYLIP format
http//bioweb.pasteur.fr/seqanal/interfaces/readse
q-simple.html
Parsimony
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
uk.html
http//www.med.nyu.edu/rcr/nccu/phylogen-ex.txt
102Evaluating Phylogenies
Non-random sequence order might introduce a bias
into the dataset.
We have no way of knowing whether the tree
inferred from the data is truly representative of
the evolutionary history of the gene family.
103Evaluating Phylogenies
Non-random sequence order might introduce a bias
into the dataset.
We have no way of knowing whether the tree
inferred from the data is truly representative of
the evolutionary history of the gene family.
- Jumbling sequence addition order
- Most methods for phylogeny construction are
sensitive to the order in which sequences are
added to the tree. Consequently, the simplest
way to test a phylogeny is to repeat the analysis
several times with different addition orders. - All PHYLIP programs, and most other phylogeny
programs, have an option called JUMBLE, that uses
a random number generator to choose which
sequence to add at each step, rather than adding
them in the order in which they appear in the
file. The user is asked to supply a random number
to use as a "seed" in generating a random number
chain. -
- Therefore, even when doing only one run on a
phylogeny, it is probably a good idea to jumble
the order of sequences.
2. Bootstrap and Jacknife replicates
assumption - the statistical properties of a
sample should be similar to the statistical
properties of the population from which that
sample was drawn. The larger the sample, the
more representative it should be of the
population. Conversely, if the original sample
was large enough, it should also be possible to
take smaller samples from the larger sample, and
expect that the smaller samples would also retain
most of the statistical properties of the
original population.
104- Phylogenies if we create smaller alignments
containing only some of the positions from the
total alignment, and use these mini-alignments to
construct a tree, we should still get the same
tree each time. -
- If we get a different tree each time the data is
sampled, then we are strongly confident that all
the data is consistent with the tree. - If we get a different tree with each sample, then
no tree is strongly supported by the data. - Jacknife resampling has the drawback that the
subreplicates are of a smaller size than the
original dataset, which may change the
statistical properties of the samples. For that
reason, Jacknife resampling has largely been
replaced by bootstrap resampling. - Bootstrap resampling is sampling with
replacement. In the case of a multiple sequence
alignment, sites are sampled at random until the
dataset is equal in length to the original
alignment.
105(No Transcript)
106Assessing Reliability Bootstrap
107(No Transcript)
108For bootstrap resampling of a sequence alignment,
it is best to create at least 100 bootstrapped
datasets, and redo the phylogeny for each one.
A consensus tree can then be built which
indicates, for each branch in the tree, how often
it occurs in the population of replicate samples.
Certain positions are biased in each replicate,
while others are underrepresented. However, with
enough replicates, all sites will be weighted
equally.
109Simulations have shown that "bootstrap values
greater than 70 correspond to a probability
greater than 95"
110PROBLEM The disadvantage of bootstrap
resampling is that it drastically increases the
time required to construct a phylogeny.
only practical with distance methods where large
numbers of sequences must be used
111Are there Correct trees??
112Are there Correct trees??
- Despite all of these caveats, it is actually
quite simple to use computer programs calculate
phylogenetic trees for data sets. - Provided the data are clean, outgroups are
correctly specified, appropriate algorithms are
chosen, no assumptions are violated, etc., can
the true, correct tree be found and proven to be
scientifically valid? - Unfortunately, it is impossible to ever
conclusively state what is the "true" tree for a
group of sequences (or a group of organisms)
taxonomy is constantly under revision as new data
is gathered.
113Some simple practical considerations
- The most important factor is not the method but
the quality - of input data
- Use each of three methods and compare trees for
consistency - (though, it does not mean that result is
- statistically significant)
- The choice of outgroup taxa can have so much
influence on - analysis as a choice of ingroup taxa
- Different answers can be obtained depending on
the order in - which sequences are in input file (jumble option)
- put problematic sequences at the end
114Application of Phylogeny Understanding history
of life Understanding rapidly mutating viruses
(like HIV) Help to predict protein/RNA
structure Help to do multiple sequence
alignment Explaining and predicting gene
expression Explaining and predicting
ligands Help to design enhanced organisms
Help to design drug
115gtClostridium_perfringens MKGIYSALLVSFDKDGNINEKGLRE
IIRHNIDVCKIDGLYVGGSTGENFMLSTDEKKRIFEIAMDEAKGQ VKLI
AQVGSVNLKEAVELAKFTTDLGYDAISAVTPFYYKFDFNEIKHYYETIIN
SVDNKLIIYSIPFLTG VNMSIEQFAELFENDKIIGVKFTAADFYLLERM
RKAFPDKLIFAGFDEMMLPATVLGVDGAIGSTFNVNG VRARQIFEAAQK
GDIETALEVQHVTNDLITDILNNGLYQTIKLILQEQGVDAGYCRQPMKEA
TEEMIAKA KEINKKYF gtMus_musculus MAFPKKKLRGLVAATITP
MTENGEINFPVIGQYVDYLVKEQGVKNIFVNGTTGEGLSLSVSERRQVAE
EW VNQGRNKLDQVVIHVGALNVKESQELAQHAAEIGADGIAVIAPFFFK
SQNKDALISFLREVAAAAPTLPF YYYHMPSMTGVKIRAEELLDGIQDKI
PTFQGLKFTDTDLLDFGQCVDQNHQRQFALLFGVDEQLLSALVM GATGA
VGSTYNYLGKKTNQMLEAFEQKDLASALSYQFRIQRFINYVIKLGFGVSQ
TKAIMTLVSGIPMGP PRLPLQKATQEFTAKAEAKLKSLDFLSSPSVKEG
KPLASA gtSinorhizobium_meliloti MKLEGIYSALLTPFSEDES
IDRQAIGALVDFQVRLGIDGVYVGGSSGEAMLQSLDERADYLSDVAAAAS
G RLTLIAHVGTIATRDALRLSQHAAKSGYQAISAIPPFYYDFSRPEVMA
HYRELADVSALPLIVYNFPART SGFTLPELVELLSHPNIIGIKHTSSDM
FQLERIRHAVPDAIVYNGYDEMCLAGFAMGAQGAIGTTYNFMG DLFVAL
RDCAAAGRIEEARRLQAMANRVIQVLIKVGVMPGSKALLGIMGLPGGPSR
RPFRKVEEADLAAL REAVAPVLAWRESTSRKSM gtBacillus_subti
lis MNFGNVSTAMITPFDNKGNVDFQKLSTLIDYLLKNGTDSLVVAGTT
GESPTLSTEEKIALFEYTVKEVNG RVPVIAGTGSNNTKDSIKLTKKAEE
AGVDAVMLVTPYYNKPSQEGMYQHFKAIAAETSLPVMLYNVPGRT VASL
APETTIRLAADIPNVVAIKEASGDLEAITKIIAETPEDFYVYSGDDALTL
PILSVGGRGVVSVASH IAGTDMQQMIKNYTNGQTANAALIHQKLLPIMK
ELFKAPNPAPVKTALQLRGLDVGSVRLPLVPLTEDER LSLSSTISEL gt
Escherichia_coli_O157 MATNLRGVMAALLTPFDQQQALDKASLR
RLVQFNIQQGIDGLYVGGSTGEAFVQSLSEREQVLEIVAEEA KGKIKLI
AHVGCVSTAESQQLAASAKRYGFDAVSAVTPFYYPFSFEEHCDHYRAIID
SADGLPMVVYNIP ALSGVKLTLDQINTLVTLPGVGALKQTSGDLYQMEQ
IRREHPDLVLYNGYDEIFASGLLAGADGGIGSTY NIMGWRYQGIVKALK
EGDIQTAQKLQTECNKVIDLLIKTGVFRGLKTVLHYMDVVSVPLCRKPFG
PVDEK YLPELKALAQQLMQERG gtPasteurella_multocida MKN
LKGIFSALLVSFNADGSINEKGLRQIVRYNIDKMKVDGLYVGGSTGENFM
LSTEEKKEIFRIAKDEA KDEIALIAQVGSVNLQEAIELGKYATELGYDS
LSAVTPFYYKFSFPEIKHYYDSIIEATGNYMIVYSIPF LTGVNIGVEQF
GELYKNPKVLGVKFTAGDFYLLERLKKAYPNHLIWAGFDEMMLPAASLGV
DGAIGSTFN VNGVRARQIFELTQAGKLKEALEIQHVTNDLIEGILANGL
YLTIKELLKLDGVEAGYCREPMTKELSPEK VAFAKELKAKYLS gtYers
inia_pestis MKKLTGLIAAPHTPFDEQGEVNYPVIDQIAEHLINDGV
KGVYVCGTTGEGIHCSVDERKKIAERWVNAAQ GKLSITLHTGALSIKDA
VDLSRHAETLDIFATSAIGPCFFKPGNLDDLIAYCQAIAAAAPSKGFYYY
HSG MSGVNLDMEQFLIKAESKIPNLSGIKFNNADLYEFQRCLRVSGGKF
DIPFGVDEHLPGGLAVGAIGAVGS TYNYAAPLFHKIIADFNAGDQVAVQ
RGMDHVIALIRVLVEFGGVAAGKAAMQLHGIDAGNPRLPLRALTK EQKQ
TVVNRMRDAITLQ gtE._coli MATNLRGVMAALLTPFDQQQALDKASL
RRLVQFNIQQGIDGLYVGGSTGEAFVQSLSEREQVLEIVAEEA KGKIKL
IAHVGCVSTAESQQLAASAKRYGFDAVSAVTPFYYPFSFEEHCDHYRAII
DSADGLPMVVYNIP ALSGVKLTLDQINTLVTLPGVGALKQTSGDLYQME
QIRREHPDLVLYNGYDEIFASGLLAGADGGIGSTY NIMGWRYQGIVKAL
KEGDIQTAQKLQTECNKVIDLLIKTGVFRGLKTVLHYMDVVSVPLCRKPF
GPVDEK YLPELKALAQQLMQERG gtVibrio_cholerae MKKLTGLI
AAPHTPFTKDNKVNFAAIDQIAELLIEQGVKGAYVCGTTGEGIHCSVEER
KAIAERWVKAVD GKLDVILHTGALSIVDTINLTEHAETLDIFATSAIGP
CFFKPGSVDDLVEYCAQVAAAAPSKGFYYYHSG MSGVNLDLEQFLIKGE
QRIPNLYGAKFNNADLYEYQRCVRVSNRKFDIPFGVDEFLPAGLAVGAVG
AVGS TYNYAAPLYLKIIEAFNHGKHDEVAALMDKVIAIIRVLVEYGGVA
AGKVAMQLHGIDAGDPRLPIRSLND KQKADVLAKMRDAGFLSI gtHomo
_sapiens MAFPKKKLQGLVAATITPMTENGEINFSVIGQYVDYLVKEQ
GVKNIFVNGTTGEGLSLSVSERRQVAEEW VTKGKDKLDQVIIHVGALSL
KESQELAQHAAEIGADGIAVIAPFFLKPWTKDILINFLKEVAAAAPALPF
YYYHIPALTGVKIRAEELLDGILDKIPTFQGLKFSDTDLLDFGQCVDQN
RQQQFAFLFGVDEQLLSALVM GATGAVGSTYNYLGKKTNQMLEAFEQKD
FSLALNYQFCIQRFINFVVKLGFGVSQTKAIMTLVSGIPMGP PRLPLQK
ASREFTDSAEAKLKSLDFLSFTDLKDGNLEAGS gtNeisseria_menin
gitidis MLQGSLVALITPMNQDGSIHYEQLRDLIDWHIENGTDGIVAV
GTTGESATLSVEEHTAVIEAVVKHVAKR VPVIAGTGANNTVEAIALSQA
AEKAGADYTLSVVPYYNKPSQEGMYRHFKAVAEAAAIPMILYNVPGRTV
VSMNNETILRLAEIPNIVGVKEASGNIGSNIELINRAPEGFVVLSGD