Building Phylogenetic Trees - PowerPoint PPT Presentation

About This Presentation

Title:

Building Phylogenetic Trees

Description:

1. Building Phylogenetic Trees. Yaw-Ling Lin (???) ... Chimp. Dog. Elephant. A: CAGGTA. B: CAGACA. C: CGGGTA. D: TGCACT. E: TGCGTA. 52 ... – PowerPoint PPT presentation

Number of Views:294

Avg rating:3.0/5.0

Slides: 66

Provided by: Curt90

Category:

more less

Transcript and Presenter's Notes

Title: Building Phylogenetic Trees

1
Building Phylogenetic Trees
Yaw-Ling Lin (???) Dept Computer Sci and Info
Management Providence University, Taiwan E-mail
yllin_at_pu.edu.tw WWW http//www.cs.pu.edu.tw/yawl
in
2
Phylogenetic Tree

Topology bifurcating
Leaves - 1N
Internal nodes N12N-2

3
Orthologues / Paralogues
4
Rooted / Unrooted Tree
5
Counting Trees
6
Counting Trees
(2N - 5)!! unrooted trees for N taxa (2N-
3)!! rooted trees for N taxa
7
Rrooting the tree
To root a tree mentally, imagine that the tree is
made of string. Grab the string at the root
and tug on it until the ends of the string (the
taxa) fall opposite the root
Unrooted tree
8
UPGMA -- Unweighted Pair Group Method with
Arithmetic mean
simplest method - uses sequential clustering
algorithm (assumption of rate constancy among
lineages - often violated)
step 1 step 2
(AB) C d(AB)C
Distance matrix Tree
d(AB)C (dAC dAB) / 2
9
UPGMA Step 1combine B and C
10
UPGMA step 2combine BC and D
(1012)/2
(46)/2
11
UPGMA step 3combine A and E
12
UPGMA step 4combine AE and BCD
13
UPGMA Result
14
UPGMA Result
15
UPGMA(1)
16
UPGMA(2)
17
UPGMA -- Ilustrations
18
When UPGMA fails
19
Neighbor Joining

Very popular method
Does not make molecular clock assumption
modified distance matrix constructed to adjust
for differences in evolution rate of each taxon
Produces unrooted tree
Assumes additivity distance between pairs of
leaves sum of lengths of edges connecting them
Like UPGMA, constructs tree by sequentially
joining subtrees

20
Additivity
21
Naïve NJ by Additivity?

O(n2) (i,j) pairs
O(n2) (k,l) pairs
(k,l) rejects (i,j) whenever additivity fails
O(n4) to pick an (i,j) neighbor pair!
So totally O(n5) time suffices

22
Neighbor Joining Once we know the correct (i,j)
pair
23
Neighbour Joining why not pick the smallest
(i,j) pair?
24
Neighbour Joining(3)
25
Neighbour Joining Algorithm
26
Neighbor-Joining Algorithm
27
Neighbor-Joining Complexity

The method performs a search using time O(n2) and
using time O(n2) to update distance matrix.
Giving a total time complexity of O(n3),and a
space complexity of O(n2).

28
Reasoning the NJ Method

How did the ideas of Si,j and Ri comes from ?
How correct is the algorithm?
Heuristic or exact solution?

29
The 1-star Sum of the Branch Lengths

D and L as the distance between OTUs and the
branch length between nodes
Each branch is counted N-1 times when all
distances are added

30
The paired-2-star Sum of the Branch Lengths
31
The paired-2-star Tree Size
32
The Distance and Branch Lengths between a
Combined OTU and another One
33
Before the proof
34
Before the proof (Cont.)
35
Neighbor-Joining The proof
36
Lemma
37
Lemma (Cont.)
38
Proof
39
Proof of the Theorem by contradiction
r
k
i
s
Type1 A -2Dux-2Duv Type2 B -4Dvx2Duv For
the sum in formula b to be nonnegative, Type2
should be more than Type1.
w
B
x
x
v
u
x
A
j
l
Suppose that i and j are not neighbors. Let k and
l be any pair of neighbors, so that i, j, k, and
l are distinct and are represented in the tree
.Consider the sum in formula (b), which is
nonnegative. If m is fifth OUT, then it joins the
tree at point x along one of the indicated arcs.
Say that m is of type 1 if it joins the path from
I to j at any node different from u and that m is
of type 2 if it joins the path from i to j at
node u.
40
Proof of the theorem (Cont.)
If m is of type 1,then the corresponding summand
in formula (b) is -2Dux-2Duv. If m is of type 2,
then the corresponding summand in formula (b) is
-4Dvx2Duv. For the sum in formula (b) to be
nonnegative, there must be at least as many terms
corresponding to OTUs m of type 2 as there are
terms corresponding top OTUs m of type 1. It
follows that there are more OTUs that join the
path from i to j at u than there are OTUs that
join that path at all other nodes
combined. Because neither i nor j has a
neighbor, there must be a pair r,s of neighbors
that argument applied to w that is different from
u, By the above argument applied to w, there are
more OTUs that join the path from i to j at w
than there are OTUs that join that path at all
other nodes combined. The conclusions about u and
w contradict each other, and the theorem follows.
41
Speeding up Neighbor-Joining Tree Construction

In this paper, the authors present several
heuristics for speeding up the NJ method.
The heuristics attempt to reduce the search time
by using a quad-tree.
The worst case time complexity remains O(n3) and
the space complexity after adding the quad-tree
is still O(n2).
The authors have implemented a tool, QuickJoin.

42
Previous Work

The neighbor-joining method is introduced by
Saitou and Nei.
The algorithm was later amended by Studier and
Keppler with a running time O(n3).
BIONJ -- Gascuel et al. produce a O(n3)
implementation of a variant of the NJ algorithm
that produce more accurate trees in many cases.
QuickTree -- Durbin et al. produce an code
optimized implementation of the NJ algorithm.

43
AppendixProof of neighbour-joining
44
/- of distance methods

Advantages
easy to perform
quick calculation
fit for sequences having high similarity scores
Disadvantages
the sequences are not considered as such (loss of
information)
all sites are generally equally treated (do not
take into account differences of substitution
rates )
not applicable to distantly divergent sequences.

45
Parsimony
46
Maximum Parsimony Method
principle - search for tree that requires the
smallest number of character state changes
between the OTUs
informative sites - those that favor some trees
over others operationally - at least two
different kinds of residues at the site, each of
which is found in at least two of the OUT
sequences
47
Evaluating Parsimony Scores

How do we compute the Parsimony score for a given
tree?
Traditional Parsimony
Each base change has a cost of 1
Weighted Parsimony
Each change is weighted by the score c(a,b)

48
Traditional Parsimony
a
a

Solved independently for each position
Linear time solution

a,g
a
49
Traditional Parsimony
50
Evaluating Weighted Parsimony

Dynamic programming on the tree
Initialization
For each leaf i set S(i,a) 0 if i is labeled by
a, otherwise S(i,a) ?
Iteration
if k is node with children i and j, then S(k,a)
minb(S(i,b)c(a,b)) minb(S(j,b)c(a,b))
Termination
cost of tree is minaS(r,a) where r is the root

51
Example
A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
52
Cost of Evaluating Parsimony

Score is evaluated on each position independetly.
Scores are then summed over all positions.
If there are n nodes, m characters, and k
possible values for each character, then
complexity is O(nmk)
By keeping traceback information, we can
reconstruct most parsimonious values at each
ancestor node

53
Weighted Parsimony
54
Traditional Parsimony is not complete
55
Parsimony Searching over all trees by Branch and
Bound
56
Assessing the trees the bootstrap
57
(No Transcript)
58
Simultaneous alignment and phylogeny(1)
59
Inferring trees Maximum Likelihood method

Maximum likelihood supposes a model of evolution
along tree branches.
Strategy
Find parameters (tree, branch lengths,
substitution rate) that maximizes the likelihood
assigned to the data.
Note Model of evolution does not include indels!
In Phylip package program PROTML

60
Probabilistic Methods

The phylogenetic tree represents a generative
probabilistic model (like HMMs) for the observed
sequences.
Background probabilities q(a)
Mutation probabilities P(ab, t)
Models for evolutionary mutations
Jukes Cantor
Kimura 2-parameter model
Such models are used to derive the probabilities

61
Jukes Cantor model

A model for mutation rates

Mutation occurs at a constant rate
Each nucleotide is equally likely to mutate into
any other nucleotide with rate a.

62
Kimura 2-parameter model

Allows a different rate for transitions and
transversions.

63
Mutation Probabilities

The rate matrix R is used to derive the mutation
probability matrix S
S is obtained by integration. For Jukes Cantor
q can be obtained by setting t to infinity

64
Mutation Probabilities

Both models satisfy the following properties
Lack of memory
Reversibility
Exist stationary probabilities Pa s.t.

65
Probabilistic Approach

Given P,q, the tree topology and branch lengths,
we can compute

66
Computing the Tree Likelihood

We are interested in the probability of observed
data given tree and branch lengths
Computed by summing over internal nodes
This can be done efficiently using a tree upward
traversal pass.

67
Tree Likelihood Computation

Define P(Lka) prob. of leaves below node k
given that xka
Init for leaves P(Lka)1 if xka 0 otherwise
Iteration if k is node with children i and j,
then
TerminationLikelihood is

68
Maximum Likelihood (ML)

Score each tree by
Assumption of independent positions
Branch lengths t can be optimized
Gradient ascent
EM
We look for the highest scoring tree
Exhaustive
Sampling methods (Metropolis)

69
Optimal Tree Search

Perform search over possible topologies

Parameter space
Parametric optimization (EM)
Local Maxima
70
Computational Problem

Such procedures are computationally expensive!
Computation of optimal parameters, per candidate,
requires non-trivial optimization step.
Spend non-negligible computation on a candidate,
even if it is a low scoring one.
In practice, such learning procedures can only
consider small sets of candidate structures

71
Max Likelihood versus Parsimony
3
1
3
1
2
1
4
1
0.3
0.1
0.09
0.1
0.3
2
2
4
3
4
2
3
4
T1
T2
T3
T

(Example from BSA p. 225)
Choose tree T, with unequal branch lengths.
Generate 1000 sequences of length N according to
probabilistic model
(A) Reconstruction by ML (B)
Reconstruction by Parsimony

N T1 T2 T3
20 419 339 242
100 638 204 158
500 904 61 35
2000 997 3 0
N T1 T2 T3
20 396 378 224
100 405 515 79
500 404 594 2
2000 353 646 0
Conclusion ML infers right tree as N gets
larger, Parsimony does not necessarily.
72
Max Likelihood versus NJ

(Example from BSA p. 225)
Choose tree T, with unequal branch lengths.
Generate 1000 sequences of length N according to
probabilistic model
(A) Reconstruction by ML (B)
Reconstruction by NJ

N T1 T2 T3
20 419 339 242
100 638 204 158
500 904 61 35
2000 997 3 0
Conclusion ML infers right tree as N gets
largerl. If the probabilistic model is correct,
the ML distances shall be very close to additive,
therefore the NJ method predicts the correct
tree.
73
Phylip - practicalities

Menu-driven, no command line
Input file format
First line ltnumber of sequencesgt ltnumber of
letters per sequencegt
Next lines Sequences
First ten characters is the sequence name
Then sequence follows. Spaces and newlines are
allowed.
Dashes (-) signify gaps
Example

4 46 hba1 MV-LSPADKTNVKAAWGKVG AHAGEYGAEALERM
FLSFPTTKTYFP beta MVHLTPEEKSAVTALWGKVN VDEVGG
EALGRLLVVYPWTQRFFESF Myoglobin MGLSDGEWQLVLNVWGKV
E ADIPGHGQEVLIRLFKGHPETLEKFD Leghemogl
MGAFSEKQESLVKSSWEAFK QNVPHHSAVFYTLILEKAPAAQNMFS
74
The End

Write a Comment

User Comments (0)