Title: Building Phylogenetic Trees
1Building Phylogenetic Trees
Yaw-Ling Lin (???) Dept Computer Sci and Info
Management Providence University, Taiwan E-mail
yllin_at_pu.edu.tw WWW http//www.cs.pu.edu.tw/yawl
in
2Phylogenetic Tree
- Topology bifurcating
- Leaves - 1N
- Internal nodes N12N-2
3Orthologues / Paralogues
4Rooted / Unrooted Tree
5Counting Trees
6Counting Trees
(2N - 5)!! unrooted trees for N taxa (2N-
3)!! rooted trees for N taxa
7Rrooting the tree
To root a tree mentally, imagine that the tree is
made of string. Grab the string at the root
and tug on it until the ends of the string (the
taxa) fall opposite the root
Unrooted tree
8UPGMA -- Unweighted Pair Group Method with
Arithmetic mean
simplest method - uses sequential clustering
algorithm (assumption of rate constancy among
lineages - often violated)
step 1 step 2
(AB) C d(AB)C
Distance matrix Tree
d(AB)C (dAC dAB) / 2
9UPGMA Step 1combine B and C
10UPGMA step 2combine BC and D
(1012)/2
(46)/2
11UPGMA step 3combine A and E
12UPGMA step 4combine AE and BCD
13UPGMA Result
14UPGMA Result
15UPGMA(1)
16UPGMA(2)
17UPGMA -- Ilustrations
18When UPGMA fails
19Neighbor Joining
- Very popular method
- Does not make molecular clock assumption
modified distance matrix constructed to adjust
for differences in evolution rate of each taxon - Produces unrooted tree
- Assumes additivity distance between pairs of
leaves sum of lengths of edges connecting them - Like UPGMA, constructs tree by sequentially
joining subtrees
20Additivity
21Naïve NJ by Additivity?
- O(n2) (i,j) pairs
- O(n2) (k,l) pairs
- (k,l) rejects (i,j) whenever additivity fails
- O(n4) to pick an (i,j) neighbor pair!
- So totally O(n5) time suffices
22Neighbor Joining Once we know the correct (i,j)
pair
23Neighbour Joining why not pick the smallest
(i,j) pair?
24Neighbour Joining(3)
25Neighbour Joining Algorithm
26Neighbor-Joining Algorithm
27Neighbor-Joining Complexity
- The method performs a search using time O(n2) and
using time O(n2) to update distance matrix. - Giving a total time complexity of O(n3),and a
space complexity of O(n2).
28Reasoning the NJ Method
- How did the ideas of Si,j and Ri comes from ?
- How correct is the algorithm?
- Heuristic or exact solution?
29The 1-star Sum of the Branch Lengths
- D and L as the distance between OTUs and the
branch length between nodes - Each branch is counted N-1 times when all
distances are added
30The paired-2-star Sum of the Branch Lengths
31The paired-2-star Tree Size
32The Distance and Branch Lengths between a
Combined OTU and another One
33Before the proof
34Before the proof (Cont.)
35Neighbor-Joining The proof
36Lemma
37Lemma (Cont.)
38Proof
39Proof of the Theorem by contradiction
r
k
i
s
Type1 A -2Dux-2Duv Type2 B -4Dvx2Duv For
the sum in formula b to be nonnegative, Type2
should be more than Type1.
w
B
x
x
v
u
x
A
j
l
Suppose that i and j are not neighbors. Let k and
l be any pair of neighbors, so that i, j, k, and
l are distinct and are represented in the tree
.Consider the sum in formula (b), which is
nonnegative. If m is fifth OUT, then it joins the
tree at point x along one of the indicated arcs.
Say that m is of type 1 if it joins the path from
I to j at any node different from u and that m is
of type 2 if it joins the path from i to j at
node u.
40Proof of the theorem (Cont.)
If m is of type 1,then the corresponding summand
in formula (b) is -2Dux-2Duv. If m is of type 2,
then the corresponding summand in formula (b) is
-4Dvx2Duv. For the sum in formula (b) to be
nonnegative, there must be at least as many terms
corresponding to OTUs m of type 2 as there are
terms corresponding top OTUs m of type 1. It
follows that there are more OTUs that join the
path from i to j at u than there are OTUs that
join that path at all other nodes
combined. Because neither i nor j has a
neighbor, there must be a pair r,s of neighbors
that argument applied to w that is different from
u, By the above argument applied to w, there are
more OTUs that join the path from i to j at w
than there are OTUs that join that path at all
other nodes combined. The conclusions about u and
w contradict each other, and the theorem follows.
41Speeding up Neighbor-Joining Tree Construction
- In this paper, the authors present several
heuristics for speeding up the NJ method. - The heuristics attempt to reduce the search time
by using a quad-tree. - The worst case time complexity remains O(n3) and
the space complexity after adding the quad-tree
is still O(n2). - The authors have implemented a tool, QuickJoin.
42Previous Work
- The neighbor-joining method is introduced by
Saitou and Nei. - The algorithm was later amended by Studier and
Keppler with a running time O(n3). - BIONJ -- Gascuel et al. produce a O(n3)
implementation of a variant of the NJ algorithm
that produce more accurate trees in many cases. - QuickTree -- Durbin et al. produce an code
optimized implementation of the NJ algorithm.
43AppendixProof of neighbour-joining
44/- of distance methods
- Advantages
- easy to perform
- quick calculation
- fit for sequences having high similarity scores
- Disadvantages
- the sequences are not considered as such (loss of
information) - all sites are generally equally treated (do not
take into account differences of substitution
rates ) - not applicable to distantly divergent sequences.
45Parsimony
46Maximum Parsimony Method
principle - search for tree that requires the
smallest number of character state changes
between the OTUs
informative sites - those that favor some trees
over others operationally - at least two
different kinds of residues at the site, each of
which is found in at least two of the OUT
sequences
47Evaluating Parsimony Scores
- How do we compute the Parsimony score for a given
tree? - Traditional Parsimony
- Each base change has a cost of 1
- Weighted Parsimony
- Each change is weighted by the score c(a,b)
48Traditional Parsimony
a
a
- Solved independently for each position
- Linear time solution
a,g
a
49Traditional Parsimony
50Evaluating Weighted Parsimony
- Dynamic programming on the tree
- Initialization
- For each leaf i set S(i,a) 0 if i is labeled by
a, otherwise S(i,a) ? - Iteration
- if k is node with children i and j, then S(k,a)
minb(S(i,b)c(a,b)) minb(S(j,b)c(a,b)) - Termination
- cost of tree is minaS(r,a) where r is the root
51Example
A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
52Cost of Evaluating Parsimony
- Score is evaluated on each position independetly.
Scores are then summed over all positions. - If there are n nodes, m characters, and k
possible values for each character, then
complexity is O(nmk) - By keeping traceback information, we can
reconstruct most parsimonious values at each
ancestor node
53Weighted Parsimony
54Traditional Parsimony is not complete
55Parsimony Searching over all trees by Branch and
Bound
56Assessing the trees the bootstrap
57(No Transcript)
58Simultaneous alignment and phylogeny(1)
59Inferring trees Maximum Likelihood method
- Maximum likelihood supposes a model of evolution
along tree branches. - Strategy
- Find parameters (tree, branch lengths,
substitution rate) that maximizes the likelihood
assigned to the data. - Note Model of evolution does not include indels!
- In Phylip package program PROTML
60Probabilistic Methods
- The phylogenetic tree represents a generative
probabilistic model (like HMMs) for the observed
sequences. - Background probabilities q(a)
- Mutation probabilities P(ab, t)
- Models for evolutionary mutations
- Jukes Cantor
- Kimura 2-parameter model
- Such models are used to derive the probabilities
61Jukes Cantor model
- A model for mutation rates
- Mutation occurs at a constant rate
- Each nucleotide is equally likely to mutate into
any other nucleotide with rate a.
62Kimura 2-parameter model
- Allows a different rate for transitions and
transversions.
63Mutation Probabilities
- The rate matrix R is used to derive the mutation
probability matrix S - S is obtained by integration. For Jukes Cantor
- q can be obtained by setting t to infinity
64Mutation Probabilities
- Both models satisfy the following properties
- Lack of memory
-
- Reversibility
- Exist stationary probabilities Pa s.t.
65Probabilistic Approach
- Given P,q, the tree topology and branch lengths,
we can compute
66Computing the Tree Likelihood
- We are interested in the probability of observed
data given tree and branch lengths - Computed by summing over internal nodes
- This can be done efficiently using a tree upward
traversal pass.
67Tree Likelihood Computation
- Define P(Lka) prob. of leaves below node k
given that xka - Init for leaves P(Lka)1 if xka 0 otherwise
- Iteration if k is node with children i and j,
then - TerminationLikelihood is
68Maximum Likelihood (ML)
- Score each tree by
- Assumption of independent positions
- Branch lengths t can be optimized
- Gradient ascent
- EM
- We look for the highest scoring tree
- Exhaustive
- Sampling methods (Metropolis)
69Optimal Tree Search
- Perform search over possible topologies
Parameter space
Parametric optimization (EM)
Local Maxima
70Computational Problem
- Such procedures are computationally expensive!
- Computation of optimal parameters, per candidate,
requires non-trivial optimization step. - Spend non-negligible computation on a candidate,
even if it is a low scoring one. - In practice, such learning procedures can only
consider small sets of candidate structures
71Max Likelihood versus Parsimony
3
1
3
1
2
1
4
1
0.3
0.1
0.09
0.1
0.3
2
2
4
3
4
2
3
4
T1
T2
T3
T
- (Example from BSA p. 225)
- Choose tree T, with unequal branch lengths.
- Generate 1000 sequences of length N according to
probabilistic model - (A) Reconstruction by ML (B)
Reconstruction by Parsimony
N T1 T2 T3
20 419 339 242
100 638 204 158
500 904 61 35
2000 997 3 0
N T1 T2 T3
20 396 378 224
100 405 515 79
500 404 594 2
2000 353 646 0
Conclusion ML infers right tree as N gets
larger, Parsimony does not necessarily.
72Max Likelihood versus NJ
- (Example from BSA p. 225)
- Choose tree T, with unequal branch lengths.
- Generate 1000 sequences of length N according to
probabilistic model - (A) Reconstruction by ML (B)
Reconstruction by NJ
N T1 T2 T3
20 419 339 242
100 638 204 158
500 904 61 35
2000 997 3 0
Conclusion ML infers right tree as N gets
largerl. If the probabilistic model is correct,
the ML distances shall be very close to additive,
therefore the NJ method predicts the correct
tree.
73Phylip - practicalities
- Menu-driven, no command line
- Input file format
- First line ltnumber of sequencesgt ltnumber of
letters per sequencegt - Next lines Sequences
- First ten characters is the sequence name
- Then sequence follows. Spaces and newlines are
allowed. - Dashes (-) signify gaps
- Example
4 46 hba1 MV-LSPADKTNVKAAWGKVG AHAGEYGAEALERM
FLSFPTTKTYFP beta MVHLTPEEKSAVTALWGKVN VDEVGG
EALGRLLVVYPWTQRFFESF Myoglobin MGLSDGEWQLVLNVWGKV
E ADIPGHGQEVLIRLFKGHPETLEKFD Leghemogl
MGAFSEKQESLVKSSWEAFK QNVPHHSAVFYTLILEKAPAAQNMFS
74The End