Motivation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Motivation

1
(No Transcript)
2
Motivation Tree Basics Homoplasy Molecular Clock
Hypothesis Prediction Methods Character-based perf
ect phylogeny maximum parsimony Distance-based ult
rametric trees additive trees (eg.
Fitch-Margoliash, Nearest Neighbors) Unweighted
pair group method with arithmetic mean
(UPGMA) Maximum Likelihood
Evaluating trees and data
3
Evolution Recall that DNA encodes blue print
of life Living things pass DNA info to their
children Due to mutations, DNA is changed a
little bit After a long time, different
species would evolve Phylogenetics studies
genetic relationship between different species
4
Similarity searches and multiple alignments
of sequences naturally lead to the question
How are these sequences related? and more
generally How are the organisms from which
these sequences come related?
5
Phylogenetic systematics is a method of taxonomic
classification based on their evolutionary history

Willi Hennig, a German entomologist, 1950.

6
Phenetic versus cladistic analysis
Phenetics is the study of relationships among a
group of organisms on the basis of the degree of
similarity between them, be that similarity
molecular, phenotypic, or anatomical. A tree-like
network expressing phenetic relationships is
called a phenogram. Cladistics can be defined
as the study of the pathways of evolution. In
other words, cladists are interested in such
questions as how many branches there are among a
group of organisms which branch connects to
which other branch and what is the branching
sequence. A tree-like network that expresses such
ancestor-descendant relationships is called a
cladogram. Thus, a cladogram refers to the
topology of a rooted phylogenetic tree.
The maximum parsimony method is a typical
representative of the cladistic approach, whereas
the UPGMA method is a typical phenetic method.
7
Character-based approach
Trees constructed on the basis of gain or loss of
characters (or traits) NOT connected explicitly
to a measure of distance Best for small sets of
sequences with high similarity

Distance measures are not necessary
Traditionally, morphological features used
Has backbone
Has feathers

8
Has a certain amino acid at position i Whether a
certain gap is present in a multiple sequence
alignment Whether or not protein X regulates
protein Y
9
Character-based trees interpreted as
evolutionary trees
Root represents an ancestral object with none of
the present m characters 0 0 0 0 Each of
the characters changes from 0 to 1 exactly once
and never changes back Each character labels
one edge Evolutionary history by mutation event,
not time
10
Independent evolution of tails
11
Independent evolution of tails
12
Distance based approach
Cladistic Methods

Evolutionary relationships are documented by
creating a
branching structure, termed a phylogeny or
tree, that
illustrates the relationships between the
sequences.
Cladistic methods construct a tree (cladogram) by
considering the various possible pathways
of evolution
and choose from among these the best
possible tree.
A phylogram is a tree with branches that are
proportional
to evolutionary distances.

13
(No Transcript)
14
Types of data used in phylogenetic
inference Character-based methods Use the
aligned characters, such as DNA or protein
sequences, directly during tree inference.
Taxa Characters Species
A ATGGCTATTCTTATAGTACG Species
B ATCGCTAGTCTTATATTACA Species
C TTCACTAGACCTGTGGTCCA Species
D TTGACCAGACCTGTGGTCCG Species
E TTGACCAGTTCTCTAGTTCG Distance-based methods
Transform the sequence data into pairwise
distances (dissimilarities), and then use the
matrix during tree building. A
B C D E Species A ---- 0.20
0.50 0.45 0.40 Species B 0.23 ---- 0.40
0.55 0.50 Species C 0.87 0.59 ----
0.15 0.40 Species D 0.73 1.12 0.17 ----
0.25 Species E 0.59 0.89 0.61 0.31 ----
15
PHYLOGENETIC ANALYSIS
-evolution at a molecular level Linus
PaulingEmile Zuckerkandl, 1965 (mutation
rate) The branch of taxonomy that deals with
numerical data such as DNA sequence is known as
phylogenetics
Mutations Random (?) Accumulate (?) Ancestor

Genetic drift (identical genes in different
species)
Gene duplication
Recombination
Exchange

16
Assumptions of Phylogenies

All sequences are homologous.
No duplicate sequences are present..
Back mutation/reversal
Optimal alignments
Reproductive isolation
Limited horizontal gene transfer

17
Purpose of phylogenetic predictions

Understand the lineage of different species
Organizing principle to sort species into a
taxonomy
Understand how various functions evolved
Understand forces and constraints on evolution
Perform multiple sequence alignment

18
Homoplasy

Homoplasy is similarity that is not homologous
(not due to common ancestry)
Homology is the result of independent evolution
(convergence, parallelism, reversal)
Can provide misleading evidence of phylogenetic
relationships

19
Homoplasy

Homoplasy is similarity that is not homologous
(not due to common ancestry)
Homology is the result of independent evolution
(convergence, parallelism, reversal)
Can provide misleading evidence of phylogenetic
relationships

Significantly similar molecular sequences are
very unlikely to arise by chance - i.e. homoplasy
on the molecular level is very unlikely
horizontal transfer of sequences from one
organism to another ???????
20
Orthologs vs. Paralogs

When comparing gene sequences, it is important to
distinguish between identical vs. merely similar
genes in different organisms.
Orthologs are homologous genes in different
species with analogous functions.
Paralogs are similar genes that are the result of
a gene duplication.
A phylogeny that includes both orthologs and
paralogs is likely to be incorrect.
Sometimes phylogenetic analysis is the best way
to determine if a new gene is an ortholog or
paralog to other known genes.

21
1. Alignment 2. Substitution model building 3.
Tree building 4. Tree evaluation
22
Progressive alignment
Closely related sequences distantly
related sequences
Independent (RNA?????)
GAPS?
23

Alignment parameter estimation
Placement of indels (insertion/deletion events)
in an alignment of length-variable sequences
depends on all parameters of evolutionary
model and
should be consistent with those observed
in a tree
inferred from the alignment
extreme way- to delete from analysis all
sites that
include gaps (phylogenetic signals in this
regions will be
lost)
another approach-incorporate gaps as
characters
(additional state or independent of base
substitution states)
Parameters should vary dynamically with
evolutionary
divergence

24
PHYLOGENETIC ANALYSIS

Alignment

Attributes and options
Computer dependence
none partial complete
Phylogeny invocation
none a priori recursive
Alignment parameter estimation
a priori dynamic recursive
Alignment features
primary structure high order structures
Mathematical optimization
statistical nonstatistical

25
PHYLOGENETIC ANALYSIS

Alignment

Attributes and options
Computer dependence
none partial complete
Phylogeny invocation
none a priori recursive
Alignment parameter estimation
a priori dynamic recursive
Alignment features
primary structure high order structures
Mathematical optimization
statistical nonstatistical

CLUSTAL W

Partial computational dependence
Phylogeny criteria invoked a priori (guide
tree)
Alignment parameter estimation a priori or
dynamically (optional)
Alignment of primary structure (partial
structural basis
in a case of hydrophilic AA)
5. Mathematical optimization nonstatistical

Alignment of primary versus higher order
structures
Aligning according to secondary or higher order
structures
are more reliable

TREE BUILDING PROGRAMS IN ALIGNMENT PACKAGES
ARE NOT RIGOROUS !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
27
Similarity vs. Evolutionary Relationship
Similarity and relationship are not the same
thing, even though evolutionary relationship is
inferred from certain types of similarity. Simila
r having likeness or resemblance (an
observation) Related genetically connected
(an historical fact) Two taxa can be most
similar without being most closely-related
C is more similar in sequence to A (d 3) than
to B (d 7), but C and B are most
closely related (that is, C and B shared a common
ancestor more recently than either did with A).
28

Substitution model building
Very important, since it will influence alignment
as well as tree building
For nucleotide sequences two models
Substitutions between particular bases
Substitutions among different sites in a sequence

A-G/G-A/C-T/T-C more frequent than
A-C/A-T/C-G/G-T
Rates of substitution takes form of a square
matrix 4- bases 20-AA 61-codons Fixed cost
matrices are used in weighted parsimony method
29
Character weight matrix and application in
phylogenetic analysis
G G A A C C T T
G G C C A A
T T
C-T
G-A
G-C
A-T
G-A
G-C
G
G
-if unweighted 3 steps -if weighted???????????
Reconstructions of evolution from 8 sequences
Distances matrices are much more complicated
30
Basic Considerations

Codon bias
Amino-acid codons have been degenerated with
wobble in the third position.
Yeasts, protozoa, and animals have different
codon preferences, which would result
in differences in DNA sequence that are related
to codon bias and not to evolution.
Also, the protozoa use the codons TAA and TGA to
encode glutamine, rather than
STOP, and in mitochondria the codon TGA encodes
tryptophane, rather than STOP.

Relationships between genes are not necessarily
the same as the relationships between whole
organisms.

Phylogenies based on DNAs better than those based
on proteins due to degeneracy of the genetic code
and associated masking of mutations
DNA under less selective pressure than protein
DNA comparison is more sensitive to pick up
divergence for closely related sequences.
But DNA sequence alignment is less reliable than
protein sequence alignment

31
Distances Measurements
Distance score- the score between two sequences ,
representing the number of mismatched positions
in the alignments (number of positions that
should be changed)

It is often useful to measure the genetic
distance between two species, between two
populations, or even between two individuals.
The entire concept of numerical taxonomy is based
on computing phylogenies from a table of
distances.
In the case of sequence data, pairwise distances
must be calculated between all sequences that
will be used to build the tree - thus creating a
distance matrix.
Distance methods give a single measurement of the
amount of evolutionary change between two
sequences since divergence from a common
ancestor.

32
Finding Distance Between Two Species Consider
two species with these DNA fragments Species
i (A, C, G, C, T) Species j (C, C, A, C,
T) 2 mismatches, so can estimate distance to
be 2 Looks reasonable, as 2 mismatches can
be thought as 2 mutations However, this fails
to capture multiple mutations on the same
site In practice, need to apply some
corrective distance transformation
33
Conversion of Alignment Scores to Distances

Alignment scores are large for similar sequences.
Distance methods require that the distances
between similar sequences are smaller than the
distances between less similar sequences.
Large alignment scores need to be mapped to small
distances and vice versa.

34
Computing a Distance Matrix
Reading sequences... gtr1_human 548
total, 548 read gtr2_human 548 total,
548 read gtr3_human 548 total, 548
read gtr4_human 548 total, 548 read
gtr5_human 548 total, 548
read Computing distances using Kimura method...
1 x 2 48.61 1 x 3 45.50 1
x 4 65.74 1 x 5 107.70 2 x 3
61.53 2 x 4 74.57 2 x 5 113.82
3 x 4 68.93 3 x 5 104.43 4 x 5
110.86
Matrix 1 1 2
3 4 5 ________________________
____________________________________ ..
1 0.00 48.61 45.50 65.74
107.70 2 0.00
61.53 74.57 113.82 3
0.00 68.93 104.43
4 0.00
110.86 5
0.00
35
DNA Distances

Distances between pairs of DNA sequences are
relatively simple to compute as the sum of all
base pair differences between the two sequences.
this type of algorithm can only work for pairs of
sequences that are similar enough to be aligned
Generally all base changes are considered equal
Insertion/deletions are generally given a larger
weight than replacements (gap penalties).
It is also possible to correct for multiple
substitutions at a single site, which is common
in distant relationships and for rapidly evolving
sites.

36
Mutation rate?
37
Genetic distance An attempt to answer the
question of how much evolutionary change has
occurred between sequences
Jukes Cantor distance mutation occurs at a
constant rate and each nucleotide is equally
likely to mutate into any other nucleotide with
rate a.
38
Kimura two-parameter distance allows a
Different rate for transitions and
transversions.
39
Amino Acid Distances

Distances between amino acid sequences are a bit
more complicated to calculate.
Some amino acids can replace one another with
relatively little effect on the structure and
function of the final protein while other
replacements can be functionally devastating.
From the standpoint of the genetic code, some
amino acid changes can be made by a single DNA
mutation while others require two or even three
changes in the DNA sequence.
In practice, what has been done is to calculate
tables of frequencies of all amino acid
replacements within families of related protein
sequences in the databanks i.e. PAM and BLOSSUM

40
EVOLUTIONARY TIME?
41
Molecular clock hypothesis
proposed in 1968 by Motoo Kimura.

The controversial hypothesis of molecular clock
(MC) is a consequence of the neutral theory of
evolution.
It holds that in any given DNA /protein sequence,
mutations accumulate at an approximately constant
rate as long as the DNA sequence retains its
original functions.
The difference between the sequences of a DNA
segment (or protein) in two species would then be
proportional to the time since the species
diverged from a common ancestor (coalescence
time).
This time may be measured in arbitrary units and
then it can be calibrated in millions of years
for any given gene if the fossil record of that
species happens to be rich.

42
(No Transcript)
43
The rate of evolution k nTf0 where k rate
of nucleotide substitutions, nT the mutation
rate f0 the fraction of new alleles that are
selectively neutral.
Under a molecular clock, the rate at which two
populations diverge is 2mt where m
mutation rate and t time of last common
ancestor.
44

Neutral theory of Evolution most variation that
is observed is of no interest to natural
selection (fitness).
Most mutations are so nearly selectively neutral
in their effects that their fate is determined
largely through random genetic drift and other
alleles are deleterious and removed by selection.
silent substitutions and substitutions in
noncoding regions will occur more often because
they are likely to be selectively neutral.
Replacement substitutions will occur less often
because of selective pressure.

Rate of accepted mutations maybe different for
different proteins (depending on their tolerance
for mutations)
Different parts of a protein may evolve at
different rates

46
Clustering Algorithms
Distances
Tree

Clustering algorithms use distances to calculate
phylogenetic trees. These trees are based solely
on the relative numbers of similarities and
differences between a set of sequences.
Start with a matrix of pairwise distances
Cluster methods construct a tree by linking the
least distant pairs of taxa, followed by
successively more distant taxa.

47
TREE BUILDING
Species or genes tree
A tree is a 2-dimensional graph showing
evolutionary relationships among organisms, or
in our case, in certain genes from separate
organisms. We refer to these separate sources
of sequences as taxa (singular taxon), defined
as phylogenetically distinct units on the tree.
The tree is composed of nodes representing the
taxa and branches representing the relationships
among the taxa. The lengths of the branches are
often drawn proportional to the number of
sequence changes in the branch.
48
The sum of all branch length tree length The
tree is bifurcating or binary tree
49
The sum of all branch length tree length The
tree is bifurcating or binary tree (too
close-hard to resolve-several branches from the
node)
50

PROPERTIES OF TREES
a unique path leads from the root node to any
other node
and the direction indicates evolutionary time.
the root is the common ancestor of all taxa
the root is defined by including a taxon which we
are
reasonably sure branched off earlier than the
other taxa
under study but should be related to the
remaining taxa
if we do not have a taxa to define the root, we
can predict
relationships by an UNROOTED TREE.

Rooted trees
Single common ancestor
Requires more information

Unrooted trees Insufficient information to tell
whether not not a given internal node is a common
ancestor of any 2 leaves
THERE ARE A LARGE POSSIBLE NUMBER OF TREES AND
ONLY ONE TREE IS THE CORRECT ONE. THE OBJECTIVE
OF THE ANALYSIS IS TO FIND THE CORRECT TREE.
TAXA OF ROOTED TREES OF UNROOTED
TREES 3 3
1 4
15
3 5
105
15 - 7
10,395 954
52
Phylogenetic Tree Construction

Processes
Topology construction
Length estimation
Methods
Distance methods
Maximum parsimony methods
Maximum likelihood methods

53
2 methods
54
Method 1
OUTGROUP

Outgroup seq should be closely related to rest of
seqs, but there should also be significantly more
difference between outgroup and rest of seqs
Outgroup that is too distant may lead to
incorrect tree because of more random complex
nature of diff between outgroup and rest of seqs
In choosing outgroup, one assumes that the
evolutionary history of the gene is same as rest
of seqs. If this assumption is incorrect (e.g.,
horizontal gene transfer has occurred), an
incorrect analysis could result

55
Method 2
Use statistical tools will root trees
automatically (e.g. mid-point rooting)
This must involve assumptions BEWARE!
56
METRIC DISTANCES between any two or three taxa
(a, b, and c) have the following
properties Property 1 d (a, b)
0 Non-negativity Property 2 d (a, b) d (b,
a) Symmetry Property 3 d (a, b) 0 if and
only if a b Distinctness
and... Property 4 d (a, c) d (a, b) d (b,
c) Triangle inequality
57
ULTRAMETRIC DISTANCES must satisfy the previous
four conditions, plus Property 5 d (a, b)
maximum d (a, c), d (b, c)
This implies that the two largest distances are
equal, so that they define an isosceles triangle
Similarity Relationship if the distances are
ultrametric!
If distances are ultrametric, then the sequences
are evolving in a perfectly clock-like manner,
thus can be used in UPGMA trees and for the most
precise calculations of divergence dates.
58
Property 6 d (a, b) d (c, d) maximum d
(a, c) d (b, d), d (a, d) d (b, c)
ADDITIVE DISTANCES
59
METHODS OF PHYLOGENETIC ANALYSIS
(Phenetic-cladistic phenograms-cladograms)
Character-based methods maximum parsimony
method a multiple sequence alignment is
produced in order to predict which sequence
positions are likely to correspond. These
positions will appear in vertical columns in the
multiple sequence alignment. For each aligned
position, phylogenetic trees that require the
smallest number of evolutionary changes to
produce the observed sequence changes are
identified. This analysis is continued for every
position in the sequence alignment. Finally,
those trees which produce the smallest number of
changes overall for all sequence positions are
identified. maximum likelihood method like
the maximum parsimony method, the maximum
likelihood method depends upon first obtaining a
reliable multiple sequence alignment and then
examining the changes in each column in the
alignment. In this case, however, the likelihood
of a particular tree is calculated using an
expected model of change in the sequences
(Swofford and Olsen 1990). For example,
all nucleotides are assumed to be equally
frequent and the probability of change of any
nucleotide to any other nucleotide is assumed to
be the same in the Jukes-Cantor model. For each
possible tree, the likelihood of finding the
actual sequence changes at each column in the
aligned sequences is calculated. The
probabilities for each aligned position are then
multiplied to provide a likelihood for each tree.
The tree which provides the maximum likelihood
value is the most probable tree. Distance-based
methods all possible pairs of sequences are
aligned to determine which pairs are the most
similar or closely related. These alignments
provide a measure of the genetic distance between
the sequences. These distance measurements are
then used to predict the evolutionary
relationship.
Derive trees that optimize the distribution of
the data patterns for each character (not-fixed
distances)
Compute pairwise distances according to some
measures, then discard the actual data (fixed
distances)
60
(No Transcript)
61

Is there strong Seq similarity?
Obtain multiple Seq alignment
Maximum parsimony methods
Choose set of related seq
-
Is there clearly recognizable Seq similarity?

Distance methods
-
Maximum likelihood method
Analyze how well data support prediction
62
Character-based

maximum parsimony method
Find tree which minimizes number of changes
needed to explain data
A multiple sequence alignment is produced in
order to predict which sequence positions are
likely to correspond.
These positions will appear in vertical columns
in the multiple sequence alignment.
For each aligned position, phylogenetic trees
that require the smallest number of evolutionary
changes to produce the observed sequence changes
are identified.
This analysis is continued for every position in
the sequence alignment.
Finally, those trees which produce the smallest
number of changes overall for all sequence
positions are identified.

63
Character-based
A subset of all possible trees is examined. The
most parsimonious tree is the one that requires
the fewest evolutionary changes for all
sequences to derive from a common ancestor
(minimum evolution)

Consider four sequences ATCG, TTCG, ATCC, and
TCCG
Imagine a tree that branches at the first
position, grouping ATCG and ATCC on one branch,
TTCG and TCCG on the other branch.
Then each branch splits, for a total of 3 nodes
on the tree (Tree 1)

Compare Tree 1 with one that first divides ATCC
on its own branch, then splits off ATCG, and
finally divides TTCG from TCCG (Tree 2).
Trees 1 and 2 both have three nodes, but when
all of the distances back to the root ( of nodes
crossed) are summed, the total is equal to 8 for
Tree 1 and 9 for Tree 2.

Tree 2
Tree 1
64
How do you search through all trees? Enumerate
all trees (too many) Can use techniques to
try to limit the search space (e.g., branch and
bound) or use heuristics (many
possibilities) E.g., nearest neighbor
interchange. Start with a tree and consider
neighboring trees. If any neighboring tree has
fewer changes, take it as current tree. Stop when
no improvements
65
Character-based

-informative sites
RULES

4 taxa three unrooted trees
Some sites are informative, some not
Only informative sites need to be analyzed

COST of CHANGE??????
66
Character-based
Maximum parsimony - scoring

Step matrices
Consistency Index (CI)
CI min possible tree length
actual tree length
Codon position - variable weightage
Mutations leading to Amino acid changes scored
only

67
Implementation of step matrices -
Character-state trees describing possible
pathways are explicitly assigning a weight to a
particular sort of change cost -
Parsimony methods will attempt to minimize the
summed cost of all changes -        Summed cost
number of steps in a character (step unit of
cost) -        One of key assumptions used in
parsimony analysis is assignment of relative
weights or costs to each type of change à
summarized in cost or step matrix
Structure of step matrix dependent on the types
of rules you think characters are evolving under.
        In programs, you must choose (or
default chooses for you) a general step matrix
68
1. Unordered chars         Change from any
state to any other counted as one step (Fitch
parsimony (Fitch, 1971) (nucleotide sequence
data) but may want TV as higher cost
                                    2
                        1                      3
                                    0 2.
Ordered chars Number of steps from one state
to another diff between state numbers (Wagner
parsimony (Farris, 1970 ) draw where steps
lines in path (ex morph chars on
continuum)            01234 3. Irreversible
chars Number of steps between states diff
between state numbers, where decreases in state
number do not occur (Camin-Sokal parsimony
(Camin and Sokal, 1965)         multiple gains
allowed, no losses
0à1à2à3à4
69
Unordered   Ordered   Irreversible
Unordered         Ordered
Irreversible    0 1 2 3            0 1 2
3            0 1 2 3 0 0 1 1 1         0 0 1 2
3         0 0 1 2 3 1 1 0 1 1         1 1 0 1
2         1 8 0 1 2 2 1 1 0 1         2 2 1 0
1         2 8 8 0 1 3 1 1 1 0         3 3 2 1
0         3 8 8 8 0
        Can elaborate on step matrices for any
number of transformation or weighting schemes.
Common one is weighting transversions more
heavily in molecular data    A C G T A -
2 1 2 C 2 - 2 1 G 1 2 - 2 T 2 1
2 -
70
Character-based
PAUP (phylogenic analysis using parsimony) -GCG
http//evolution.genetics.washington.edu/phylip/so
ftware.pars.htmlPAUP
(no web interface)
MACCLADE Macintosh program, contains many tools
for entering and editing data, producing trees
and having diagnostic feedback
71
Character-based
Maximum parsimony
Provides misleading information when rates of
sequence change in the different branches of
tree represented by the sequence data
Taxon 1
Taxon 4
g
g
predicted
Taxon 2
Taxon 1
a
g
real
a
g
a
a
Taxon 2
Taxon 3
Taxon 4
Taxon 3
If rates of change assumed to be equal..
Incorrect tree for the 1
72
Character-based
Maximum parsimony
Provides misleading information when rates of
sequence change in the different branches of
tree represented by the sequence data
Taxon 1
Taxon 4
g
g
Taxon 2
Taxon 1
a
g
a
g
a
a
Taxon 2
Taxon 3
Taxon 4
Taxon 3
Incorrect tree for the 1

Aproaches to solve the problem
To broke down long branches by presenting
additional taxa
closely related to taxa in question
Lakes method (PAUP)

Only transversions are scored, (A,G) lt-gt (C,T)
Transversions are assumed to occur at constant
rate
Also, independent of position

Taxa 2
Taxa 1
Taxa 2
Taxa 1
a
a
a
g
B
A
Other position
Evol. change or by chance?
c
c
c
c
Taxa 4
Taxa 3
Taxa 4
Taxa 3
73
Minimum evolution (ME) methods

Optimality criterion The tree(s) with the
shortest sum of the branch lengths (or overall
tree length) is chosen as the best tree.
Advantages
Can be used on indirectly-measured distances
(immunological, hybridization).
Distances can be corrected for unseen events.
Usually faster than character-based methods.
Can be used for some rate analyses.
Has an objective function (as compared to
clustering methods).
Disadvantages
Information lost when characters transformed to
distances.
Slower than clustering methods.

74
Character-based
Maximum Likelihood (ML)

The term Maximum Likelihood does not refer to a
single
statistical method, but rather to a general
approach.
ML methods take what has been described as an
"inside
out" approach. In their simplest form, they
begin by listing
all possible models, and then calculating the
probability that
each model would generate the data actually
observed.
The model with the highest probability of
generating the
observed data is chosen as the best model.
Joe Felsenstein's application of ML to phylogeny
is implemented in DNAML in the PHYLIP package,
and in a modified version of DNAML called
fastDNAml , written by Gary Olsen .

-explicit model of evolution, therefore more
diverse sequences may be analyzed
-uses probability calculations to find a tree,
similar to
parsimony method in that the analysis is
performed
on each column of multiple alignment
all possible trees are considered
- trees with with the least numbers of changes
are considered

75
The Maximum Likelihood approach resembles MP
method but presents additional opportunity to
evaluate trees variations in mutation
rates Jukes-Cantor and Kimura models
76
Sequence a A C G C G T T G G G Sequence b A C G C
G T T G G G Sequence c A C G C A A T G A A
Sequence d A C A C A G G G A A
Unrooted tree
C
A
(One of three)
D
B
T T A G
Rooted tree
a b c d
( one of five)
L3
L6
L4
L5
L-Likelihood values for the probability
Consider every possible base assignments Total
64- for three node positions
L1
L2
L0
Rooted tree with base assignments
T T A G
Transition 2x10-6 Transversion 10-6
a b c d
L3
L6
L4
L5
T
G
LL0xL1xL2xL3xL4xL5xL6 0.25x1x2x10-6x1x1x1x10-6
5x10-13
T
L1
L2
L0
Next tree and so on.. L (Tree) L (Tree1)
L (Tree2) ..
77
Maximum likelihood (ML) methods
Optimality criterion ML methods evaluate
phylogenetic hypotheses in terms of the
probability that a proposed model of the
evolutionary process and the proposed unrooted
tree would give rise to the observed data. The
tree found to have the highest ML value
is considered to be the preferred tree.

Advantages
Are inherently statistical and evolutionary
model-based.
Usually the most consistent of the methods
available.
Can be used for character (can infer the exact
substitutions) and rate analysis.
Can be used to infer the sequences of the
extinct (hypothetical) ancestors.
Can help account for branch-length effects in
unbalanced trees.
Can be applied to nucleotide or amino acid
sequences, and other types of data.
Disadvantages
Are not as simple and intuitive as many other
methods.
Are computationally very intense (Iimits number
of taxa and length of sequence).
Like parsimony, can be fooled by high levels of
homoplasy.
Violations of the assumed model can lead to
incorrect trees.

78
Distance-based methods
DISTANCES TREE

Tree is constructed using distances between
species (number of mutations, time, other
distance measures)
Neighbors sequence pairs with smallest number
of changes
Trees are rooted, i.e. sequences share a common
ancestor
First step is producing MSAs (ex CLUSTALW)

DNA
Distance matrix is created
Relatively simple for pairs of homologous
sequences that can be aligned
without large insertions, deletions etc.
Proteins
Matrices, such as PAM are used
Multiple substitutions at one site is always a
problem

Distance method applications
CLUSTALW
PAUP
PHYLIP

Methods of phylogenetic tree estimation
Outfile with a distance table
FITCH- Fitch-Margoliash method,
does not assume molecular clock KITSCH -
Fitch-Margoliash method, assume molecular
clock NEIGHBOR neighbor-joining (does not
assume molecular clock, unrooted tree) or
unweighted pair group methods (UPGMA)

DNADIST-distances among NA PROTDIST-distances
for AA seq
Distance matrices
Distance tables
outfile
infile
Distance score- the score between two sequences ,
representing the number of mismatched positions
in the alignments (number of positions that
should be changed)
Distance method will be successful if the
distances between the sequences can be made
additive on a predicted tree.
80
DISTANCE-BASED
Sequence A xxxxxxxxxxxxxxxxxxxxxx Sequence B
xxxxxxxxxxxxxxxxxxxxxx Sequence C
xxxxxxxxxxxxxxxxxxxxxx Sequence D
xxxxxxxxxxxxxxxxxxxxxx Distances the number of
steps required to change one sequence to
another nAB 3 nAC 7 nAD 8 nBC 6 nBD 7 nCD 3 Dist
ance table
Phylogenetic Tree
dABdCDltdACdBDdADdBC
Principle of additivity for this tree
Each change occurs once?????????
81
Additive Trees

Generalization of ultrametric trees
of mutations were assumed to be proportional to
temporal distance of a node to ancestor
Also assumed, mutations took place at same rate
in all branches
Additive trees model different rates of mutation
along different branches

82
Fitch-Margoliash method
DISTANCE-BASED

Draw unrooted tree
Calculate the length of tree branches
algebraically

c
From A to B ab22 (1) From A to C ac39
(2) From B to C bc41 (3) Subtract (3) from
(1) a-b-2 (4) Add (1) and (4) 2a20 a10 From
(1) and (2) b12 c29
83
Fitch-Margoliash method
DISTANCE BASED

Uses distance table
Calculates the length of tree branches
algebraically
Draws unrooted tree

c
From A to B ab22 (1) From A to C ac39
(2) From B to C bc41 (3) Subtract (3) from
(1) a-b-2 (4) Add (1) and (4) 2a20 a10 From
(1) and (2) b12 c29
for n-sequences

Simple extension of 3-sequence method
Closest sequence pair is chosen
The rest of the sequences are agglomerated
Distance between the pair is computed
The matrix is recomputed with the sequence pair
combined into single node
Process is repeated till the sequences are
combined

84
Fitch-Margoliash method
DISTANCE -BASED

Advantages
tests more than one tree
still pretty fast
can use empirical substitution scoring methods
global optimization of tree by statistical
criteria
Disadvantages
Requires longer execution time than Neighbor
Joining, but still quite practical on most
computers, for typical datasets.
does not consider intermediate ancestors, meaning
that there is no requirement for an
internally-consistent evolutionary model
misses homoplasies, especially over long
distances long evolutionary distances will be
underestimated.

85
The Neighbor-joining method
DISTANCE -BASED
Similar to Fitch-Margoliash Choice of which
sequences to pair is determined by a different
algorithm Pairs sequences based on the effect of
pairing on the sums of the branch lengths of the
tree

The distances between the sequences are used to
calculate the sum of branch lengths
in a star-like tree
Decompose/modify the tree by combining pairs of
sequences
The sum of the branch lengths of a new tree is
calculated
A new distance table is made by combining A with
B (composite sequence)

86
The Neighbor-joining method

Advantages
fastest tree building method
can use empirical substitution scoring methods
not influenced by variations in the rates of
change along the branches of the tree
Disadvantages
tests only a single tree
does not consider intermediate ancestors, meaning
that there is no requirement for an
internally-consistent evolutionary model
misses homoplasies, especially over long
distances long evolutionary distances will be
underestimated.

87
DISTANCE -BASED
Unweighted pair group method with arithmetic mean
(UPGMA)

The rate of change along the branches of the tree
is constant
Distances are approximately ultrametric

Simplest method
Can lead to wrong tree, if the rates of mutations
are not uniform

88
Distance Matrix
89

dAB is the smallest distance
Group A and B
Branch length dAB/2 (here we say evolution rate
is constant..)
Recalculate distances from AB to other taxa as
average
d(AB)C (dAC dBC)/2

.15/2
A
.15/2
B
90

new distance matrix
Find smallest distance and continue as before
Repeat until all taxa are on tree

dAB/2
A
dAB/2
B
d(AB)C/2
C
d(ABC)D/2
D
http//www.icp.ucl.ac.be/opperd/private/upgma.htm
l
91
Clustering methods (UPGMA N-J)

Optimality criterion NONE. The algorithm
itself builds the tree.
Advantages
Can be used on indirectly-measured distances
(immunological, hybridization).
Distances can be corrected for unseen events.
The fastest of the methods available (N-J is
screamingly fast!).
Can therefore analyze very large datasets
quickly (needed for HIV, etc.).
Can be used for some types of rate and date
analysis.
Disadvantages
Similarity and relationship are not necessarily
the same thing, so clustering by similarity does
not necessarily give an evolutionary tree.
Cannot be used for character analysis!
Have no explicit optimization criteria, so one
cannot even know if the program worked properly
to find the correct tree for the method.

92
GrowTree When you run GrowTree, SeqWeb
seamlessly links together these programs (in the
order given) to perform the analysis.
1.PileUp 2.Distances 3.GrowTree
For alignment-a simplification of the
progressive alignment method of FengDoolittle,
1987 is used (clusters are created)
GrowTree reconstructs a phylogenetic tree
from a distance matrix such as the one
created by Distances. Two methods are
available for reconstructing the tree UPGMA
(unweighted pair group method using
arithmetic averages same rate of evolution)
and neighbor-joining.
NEXUS Trees from file hum_gtr.distances
begin trees utree Tree_1
((('Gtr1_Human'18.43,'Gtr3_Human'30.18)4.34,'Gt
r4_Human'24.87) 3.19,('Gtr2_Human'35.98,'Gtr5_H
uman'74.88)3.19)0.00 endblock
93
The NEXUS file is your actual ToL-MacClade data
file. It is the file you edit when working with
ToL-MacClade, and it is the file you write when
choosing Save from ToL-MacClade's File menu. The
function of the NEXUS file is to store the
information necessary to build your Tree .
ToL-MacClade uses a special format, the NEXUS
format, to store your list of taxa, your
phylogenetic tree, and the information you
entered in the various windows and boxes. The
NEXUS format has been created to allow for
compatibility between a number of different
programs for phylogenetic analysis. You will be
able to view and edit NEXUS files in ToL-MacClade
or in a word processor
GCG/SEQWEB
94
New Hampshire (Newick) Format
Human Mouse Drosophila Honey bee Fern Wheat Pine
(((Human, Mouse), (Dros, Bee)), (Fern, (Wheat,
Pine))
95
Q08832 P25123 P34903 P18505 o14764 P78334 Q99928 o
05591 P24046 p23415
96
UPGMA
Neighbor joining
Kimura distance
Uncorrected distance
97
PAUP (phylogenic analysis using parsimony) -GCG
-version 10 has an option to perform phylogenic
analysis using distance methods
PHYLIP (phylogenetic inference package)
http//evolution.genetics.washington.edu/phylip/ph
ylipweb.html
FITCH-estimates a PT assuming additivity of
branch lengths using Fitch-Margoliash method (no
molecular clock-rates of evolution along
branches can vary KITSCH- the same, but with
molecular clock NEIGHBOR- neighbor-joining method
with arithmetic mean (UPGMA)
98
maximum likelihood method
Computer intense and time consuming
PHYLIP (phylogenetic inference package)
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
uk.html
DNAML - allows for variable frequences of 4
nucleotides, for unequal rates
of transitions and transversions
DNAMLK - the same, but molecular clock is
taking into account
99

There are several phylogenetics servers available
on the Web
some of these will change or disappear in the
near future
these programs can be very slow so keep your
sample sets small

The Institut Pasteur, Paris has a PHYLIP server
at
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip
-uk.html
The Belozersky Institute at Moscow State
University has their own
"GeneBee" phylogenetics server
http//www.genebee.msu.su/services/phtree_reduced.
html
The Phylodendron website is a tree drawing
program with a nice user
interface and a lot of options,
however, the output is limited to gifs at
72 dpi - not publication quality.
http//iubio.bio.indiana.edu/treeapp/treeprint-for
m.html

the most important factor quality of the input
data
use each of the three methods and compare trees
different results depending on the order in
which SEQ are
in input file (jumble option)

http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD
searchDBtaxonomy
http//phylogeny.arizona.edu/tree/phylogeny.html
http//evolution.genetics.washington.edu/phylip/so
ftware.htmlPlotting
100

Introduction to Phylogenetic Systematics,
Peter H. Weston Michael D. Crisp, Society of
Australian Systematic Biologists
http//www.science.uts.edu.au/sasb/WestonCrisp.htm
l
University of California, Berkeley Museum of
Paleontology (UCMP)
http//www.ucmp.berkeley.edu/clad/clad4.html
Formats conversion
http//www.swbic.org/products/bioinfo/transform/tr
ansform_help.php

101
Alignment in fasta
http//prodes.toulouse.inra.fr/multalin/multalin.h
tml
Alignment in a FASTA format
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
uk.html
Distances from protein sequences (protdist).
http//bioweb.pasteur.fr/seqanal/interfaces/protdi
st-simple.html
Tree from the same file
Outfile Neighbor Neighbor joining
protpars
Alignment in fasta
http//prodes.toulouse.inra.fr/multalin/multalin.h
tml
READSEQ in PHYLIP format
http//bioweb.pasteur.fr/seqanal/interfaces/readse
q-simple.html
Parsimony
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
uk.html
http//www.med.nyu.edu/rcr/nccu/phylogen-ex.txt
102
Evaluating Phylogenies
Non-random sequence order might introduce a bias
into the dataset.
We have no way of knowing whether the tree
inferred from the data is truly representative of
the evolutionary history of the gene family.
103
Evaluating Phylogenies
Non-random sequence order might introduce a bias
into the dataset.
We have no way of knowing whether the tree
inferred from the data is truly representative of
the evolutionary history of the gene family.

Jumbling sequence addition order
Most methods for phylogeny construction are
sensitive to the order in which sequences are
added to the tree. Consequently, the simplest
way to test a phylogeny is to repeat the analysis
several times with different addition orders.
All PHYLIP programs, and most other phylogeny
programs, have an option called JUMBLE, that uses
a random number generator to choose which
sequence to add at each step, rather than adding
them in the order in which they appear in the
file. The user is asked to supply a random number
to use as a "seed" in generating a random number
chain.
Therefore, even when doing only one run on a
phylogeny, it is probably a good idea to jumble
the order of sequences.

2. Bootstrap and Jacknife replicates
assumption - the statistical properties of a
sample should be similar to the statistical
properties of the population from which that
sample was drawn. The larger the sample, the
more representative it should be of the
population. Conversely, if the original sample
was large enough, it should also be possible to
take smaller samples from the larger sample, and
expect that the smaller samples would also retain
most of the statistical properties of the
original population.
104

Phylogenies if we create smaller alignments
containing only some of the positions from the
total alignment, and use these mini-alignments to
construct a tree, we should still get the same
tree each time.
If we get a different tree each time the data is
sampled, then we are strongly confident that all
the data is consistent with the tree.
If we get a different tree with each sample, then
no tree is strongly supported by the data.
Jacknife resampling has the drawback that the
subreplicates are of a smaller size than the
original dataset, which may change the
statistical properties of the samples. For that
reason, Jacknife resampling has largely been
replaced by bootstrap resampling.
Bootstrap resampling is sampling with
replacement. In the case of a multiple sequence
alignment, sites are sampled at random until the
dataset is equal in length to the original
alignment.

105
(No Transcript)
106
Assessing Reliability Bootstrap
107
(No Transcript)
108
For bootstrap resampling of a sequence alignment,
it is best to create at least 100 bootstrapped
datasets, and redo the phylogeny for each one.
A consensus tree can then be built which
indicates, for each branch in the tree, how often
it occurs in the population of replicate samples.
Certain positions are biased in each replicate,
while others are underrepresented. However, with
enough replicates, all sites will be weighted
equally.
109
Simulations have shown that "bootstrap values
greater than 70 correspond to a probability
greater than 95"
110
PROBLEM The disadvantage of bootstrap
resampling is that it drastically increases the
time required to construct a phylogeny.
only practical with distance methods where large
numbers of sequences must be used
111
Are there Correct trees??
112
Are there Correct trees??

Despite all of these caveats, it is actually
quite simple to use computer programs calculate
phylogenetic trees for data sets.
Provided the data are clean, outgroups are
correctly specified, appropriate algorithms are
chosen, no assumptions are violated, etc., can
the true, correct tree be found and proven to be
scientifically valid?
Unfortunately, it is impossible to ever
conclusively state what is the "true" tree for a
group of sequences (or a group of organisms)
taxonomy is constantly under revision as new data
is gathered.

113
Some simple practical considerations

The most important factor is not the method but
the quality
of input data
Use each of three methods and compare trees for
consistency
(though, it does not mean that result is
statistically significant)
The choice of outgroup taxa can have so much
influence on
analysis as a choice of ingroup taxa
Different answers can be obtained depending on
the order in
which sequences are in input file (jumble option)
put problematic sequences at the end

114
Application of Phylogeny Understanding history
of life Understanding rapidly mutating viruses
(like HIV) Help to predict protein/RNA
structure Help to do multiple sequence
alignment Explaining and predicting gene
expression Explaining and predicting
ligands Help to design enhanced organisms
Help to design drug
115
gtClostridium_perfringens MKGIYSALLVSFDKDGNINEKGLRE
IIRHNIDVCKIDGLYVGGSTGENFMLSTDEKKRIFEIAMDEAKGQ VKLI
AQVGSVNLKEAVELAKFTTDLGYDAISAVTPFYYKFDFNEIKHYYETIIN
SVDNKLIIYSIPFLTG VNMSIEQFAELFENDKIIGVKFTAADFYLLERM
RKAFPDKLIFAGFDEMMLPATVLGVDGAIGSTFNVNG VRARQIFEAAQK
GDIETALEVQHVTNDLITDILNNGLYQTIKLILQEQGVDAGYCRQPMKEA
TEEMIAKA KEINKKYF gtMus_musculus MAFPKKKLRGLVAATITP
MTENGEINFPVIGQYVDYLVKEQGVKNIFVNGTTGEGLSLSVSERRQVAE
EW VNQGRNKLDQVVIHVGALNVKESQELAQHAAEIGADGIAVIAPFFFK
SQNKDALISFLREVAAAAPTLPF YYYHMPSMTGVKIRAEELLDGIQDKI
PTFQGLKFTDTDLLDFGQCVDQNHQRQFALLFGVDEQLLSALVM GATGA
VGSTYNYLGKKTNQMLEAFEQKDLASALSYQFRIQRFINYVIKLGFGVSQ
TKAIMTLVSGIPMGP PRLPLQKATQEFTAKAEAKLKSLDFLSSPSVKEG
KPLASA gtSinorhizobium_meliloti MKLEGIYSALLTPFSEDES
IDRQAIGALVDFQVRLGIDGVYVGGSSGEAMLQSLDERADYLSDVAAAAS
G RLTLIAHVGTIATRDALRLSQHAAKSGYQAISAIPPFYYDFSRPEVMA
HYRELADVSALPLIVYNFPART SGFTLPELVELLSHPNIIGIKHTSSDM
FQLERIRHAVPDAIVYNGYDEMCLAGFAMGAQGAIGTTYNFMG DLFVAL
RDCAAAGRIEEARRLQAMANRVIQVLIKVGVMPGSKALLGIMGLPGGPSR
RPFRKVEEADLAAL REAVAPVLAWRESTSRKSM gtBacillus_subti
lis MNFGNVSTAMITPFDNKGNVDFQKLSTLIDYLLKNGTDSLVVAGTT
GESPTLSTEEKIALFEYTVKEVNG RVPVIAGTGSNNTKDSIKLTKKAEE
AGVDAVMLVTPYYNKPSQEGMYQHFKAIAAETSLPVMLYNVPGRT VASL
APETTIRLAADIPNVVAIKEASGDLEAITKIIAETPEDFYVYSGDDALTL
PILSVGGRGVVSVASH IAGTDMQQMIKNYTNGQTANAALIHQKLLPIMK
ELFKAPNPAPVKTALQLRGLDVGSVRLPLVPLTEDER LSLSSTISEL gt
Escherichia_coli_O157 MATNLRGVMAALLTPFDQQQALDKASLR
RLVQFNIQQGIDGLYVGGSTGEAFVQSLSEREQVLEIVAEEA KGKIKLI
AHVGCVSTAESQQLAASAKRYGFDAVSAVTPFYYPFSFEEHCDHYRAIID
SADGLPMVVYNIP ALSGVKLTLDQINTLVTLPGVGALKQTSGDLYQMEQ
IRREHPDLVLYNGYDEIFASGLLAGADGGIGSTY NIMGWRYQGIVKALK
EGDIQTAQKLQTECNKVIDLLIKTGVFRGLKTVLHYMDVVSVPLCRKPFG
PVDEK YLPELKALAQQLMQERG gtPasteurella_multocida MKN
LKGIFSALLVSFNADGSINEKGLRQIVRYNIDKMKVDGLYVGGSTGENFM
LSTEEKKEIFRIAKDEA KDEIALIAQVGSVNLQEAIELGKYATELGYDS
LSAVTPFYYKFSFPEIKHYYDSIIEATGNYMIVYSIPF LTGVNIGVEQF
GELYKNPKVLGVKFTAGDFYLLERLKKAYPNHLIWAGFDEMMLPAASLGV
DGAIGSTFN VNGVRARQIFELTQAGKLKEALEIQHVTNDLIEGILANGL
YLTIKELLKLDGVEAGYCREPMTKELSPEK VAFAKELKAKYLS gtYers
inia_pestis MKKLTGLIAAPHTPFDEQGEVNYPVIDQIAEHLINDGV
KGVYVCGTTGEGIHCSVDERKKIAERWVNAAQ GKLSITLHTGALSIKDA
VDLSRHAETLDIFATSAIGPCFFKPGNLDDLIAYCQAIAAAAPSKGFYYY
HSG MSGVNLDMEQFLIKAESKIPNLSGIKFNNADLYEFQRCLRVSGGKF
DIPFGVDEHLPGGLAVGAIGAVGS TYNYAAPLFHKIIADFNAGDQVAVQ
RGMDHVIALIRVLVEFGGVAAGKAAMQLHGIDAGNPRLPLRALTK EQKQ
TVVNRMRDAITLQ gtE._coli MATNLRGVMAALLTPFDQQQALDKASL
RRLVQFNIQQGIDGLYVGGSTGEAFVQSLSEREQVLEIVAEEA KGKIKL
IAHVGCVSTAESQQLAASAKRYGFDAVSAVTPFYYPFSFEEHCDHYRAII
DSADGLPMVVYNIP ALSGVKLTLDQINTLVTLPGVGALKQTSGDLYQME
QIRREHPDLVLYNGYDEIFASGLLAGADGGIGSTY NIMGWRYQGIVKAL
KEGDIQTAQKLQTECNKVIDLLIKTGVFRGLKTVLHYMDVVSVPLCRKPF
GPVDEK YLPELKALAQQLMQERG gtVibrio_cholerae MKKLTGLI
AAPHTPFTKDNKVNFAAIDQIAELLIEQGVKGAYVCGTTGEGIHCSVEER
KAIAERWVKAVD GKLDVILHTGALSIVDTINLTEHAETLDIFATSAIGP
CFFKPGSVDDLVEYCAQVAAAAPSKGFYYYHSG MSGVNLDLEQFLIKGE
QRIPNLYGAKFNNADLYEYQRCVRVSNRKFDIPFGVDEFLPAGLAVGAVG
AVGS TYNYAAPLYLKIIEAFNHGKHDEVAALMDKVIAIIRVLVEYGGVA
AGKVAMQLHGIDAGDPRLPIRSLND KQKADVLAKMRDAGFLSI gtHomo
_sapiens MAFPKKKLQGLVAATITPMTENGEINFSVIGQYVDYLVKEQ
GVKNIFVNGTTGEGLSLSVSERRQVAEEW VTKGKDKLDQVIIHVGALSL
KESQELAQHAAEIGADGIAVIAPFFLKPWTKDILINFLKEVAAAAPALPF
YYYHIPALTGVKIRAEELLDGILDKIPTFQGLKFSDTDLLDFGQCVDQN
RQQQFAFLFGVDEQLLSALVM GATGAVGSTYNYLGKKTNQMLEAFEQKD
FSLALNYQFCIQRFINFVVKLGFGVSQTKAIMTLVSGIPMGP PRLPLQK
ASREFTDSAEAKLKSLDFLSFTDLKDGNLEAGS gtNeisseria_menin
gitidis MLQGSLVALITPMNQDGSIHYEQLRDLIDWHIENGTDGIVAV
GTTGESATLSVEEHTAVIEAVVKHVAKR VPVIAGTGANNTVEAIALSQA
AEKAGADYTLSVVPYYNKPSQEGMYRHFKAVAEAAAIPMILYNVPGRTV
VSMNNETILRLAEIPNIVGVKEASGNIGSNIELINRAPEGFVVLSGD

Write a Comment

User Comments (0)

About PowerShow.com

Motivation PowerPoint PPT Presentation