Title: Phylogenetic trees
1Phylogenetic trees
- Jana Sperschneider
- Institute for Computer Science
- Albert-Ludwigs-Universität, Freiburg
2Review from last talk
- All organisms share a common ancestor
- Phylogenetic tree displays evolutionary distances
between objects - Molecular phylogenetic analysis
- Choose homologous sequences
- Build distance matrix
- UPGMA or Neighbour Joining algorithmus
3Contents
- 1. Properties of phylogenetic trees
- Ultrametric trees
- Additive trees
- 2. Distance based methods
- Least-squares method
- UPGMA
- Neighbor-Joining
- Maximum Parsimony methods
- Maximum Likelihood methods
- 3. Tree construction using partial distance
matrices - 4. Summary
4Metric spaces
- A metric space is a set of objects X such that to
every pair we associate a
nonnegative real number with the following
properties - An ultrametric space has the special ultrametric
inequality - An additive space has the special additive
inequality
5Ultrametric trees
- An ultrametric tree is characterized by the
3-point condition - For all objects two of the
distances are equal and the third one is
smaller.
1. 2. 3. i j
k j k i i
k j Ultrametric tree
assumes constant molecular clock.
6Additive trees
- An additive tree is characterized by the 4-point
condition - Given any four objects of X we can label them i,
j, k, l such that - two of the sums are equal and the third sum is
smaller than the first two. - i j
- l k
7Contents
- 1. Properties of phylogenetic trees
- Ultrametric trees
- Additive trees
- 2. Distance based methods
- Least-squares method
- UPGMA
- Neighbor-Joining
- Maximum Parsimony methods
- Maximum Likelihood methods
- 3. Tree construction using partial distance
matrices - 4. Summary
8Distance based methods
- In reality, input data is neither additive nor
ultrametric. - Find the tree that minimizes the squared error
between the distance from the input matrix d(i,j)
and the distance in the tree - This optimization problem is NP-hard !
- Least-squares criterion
-
- Weighted least-squares criterion
-
9Distance based methods
- Polynomial heuristics
- UPGMA
- very fast
- - always produces an ultrametric, rooted tree
- Neighbor Joining
- fast
- works well in practice
- does not assume molecular clock
10Maximum Parsimony
- Choose tree that explains data using the minimal
number of substitutions. - Two computational subproblems
- 1. Find the parsimony cost of a given tree
(easy). - 2. Search through all the tree topologies
(NP-hard). - Number of possible trees
-
-
11Maximum Parsimony
- Example Sequences for 4 objects 3 possible
unrooted trees - 1. AC
- 2. TC
- 3. TG
- 4. TG
- Output Most parsimonous tree (Tree 1)
AC TG TC TG AC TC Tree 1 2
mutations TG TG AC TG Tree 23 3
mutations TG TC
12Maximum Parsimony
- Search through all tree topologies not possible
for more than 12 objects - Heuristic methods
- useful for small sequences that are quite
similar - - Results often not optimal
- Branch and Bound
- Does not have to evaluate every possible tree
- - Method is limited
13Maximum Likelihood
- Search through all trees to find the one with
highest probability - Assumed to be NP-complete
- For a given tree, we can calculate its
likelihood score - Markov model
- Dynamic programming
- Very time-consuming
14Which method to use?
- Distance based
- fast
- Maximum Parsimony
- Strong sequence similarity
- Maximum Likelihood
- Very slow
- Use only for small number of sequences
- Most packages (Phylip, TRex) use software for all
three methods.
15Contents
- Properties of phylogenetic trees
- Ultrametric trees
- Additive trees
- Distance based methods
- Least-squares method
- UPGMA
- Neighbor-Joining
- Maximum Parsimony methods
- Maximum Likelihood methods
- 3. Tree construction using partial distance
matrices - 4. Summary
16Incomplete distance matrix
- How do we construct a phylogenetic tree from an
incomplete distance matrix? - 1. Estimate missing cells
- Ultrametric procedure (De Soete, 1984)
- Additive procedure (Landry, 1996)
- 2. Constructive tree algorithms
- Triangle (Guénoche, 2001)
- Fitch (Felsenstein, 1997)
- Method of weights (MW) (Makarenkov, 2004)
171. Estimate missing cells
- Ultrametric procedure
- Use ultrametric inequality
- Given a missing entry d(i,j), look for an index
k with known entries d(i,k) and d(j,k). - The d(i,j) is set to the greatest of the two
others if and only if they are different. -
- does not always return a complete distance
matrix - time complexity O(n³)
-
181. Estimate missing cells
- Additive procedure
- Use additive inequality
- Given a missing entry d(i,j), look for indices
k,l such that d(i,k), d(j,k), d(i,l), d(j,l), and
d(k,l) are known entries. - The entry d(i,j) is set to the greatest of the
two sums (minus d(k,l)) if and only if they are
different. - does not always return a complete distance
matrix - time complexity O(n4)
-
192. Constructive tree algorithms
- PHYLIP package contains Fitch algorithm
- Triangle method
- TRex package contains Method of weights (MW)
202. Constructive tree algorithmsFitch
- branch lengths are estimated by minimizing the
weighted SSQ for a given tree topology - Fitch-Margoliash criterion
- Greater distances are given less weight
212. Constructive tree algorithmsFitch
- The algorithm
- Start with three species in an unrooted tree.
- Calculate least squares branch lengths.
- Where can the fourth species be added?
- 3 possible places, try them all
- Calculate least squares branch lengths for each
topology ---gt choose the one with the smallest
SSQ
222. Constructive tree algorithmsFitch
- Continue in this fashion, add each species to all
possible places. Pick the placement that
minimizes the SSQ. - After adding a species, a series of
rearrangements is carried out, where branch
lengths are recalculated. - Complexity O(n4)
232. Constructive tree algorithmsFitch
- What if we have missing entries in the distance
matrix? - Use the following weight option
- n(i,j) 0,1
- Missing entries n(i,j) 0
- Known entries n(i,j) 1
- Known entry i, unknown entry j n(i,j) 0.5
242. Constructive tree algorithmsTriangle
- Uses only a subset of the distance matrix
- A new element is placed in the tree according to
two previously examined elements (triangle) - Quite complicated algorithm
- For partial distance matrices O(n³)
252. Constructive tree algorithmsMethod of weights
(MW)
- Constructs a tree from a distance matrix with
missing entries - Step A Use ultrametric or additive procedure
to estimate missing entries. - Compute weight matrix W.
- Step B Apply weighted least-squares fitting
algorithm.
262. Constructive tree algorithmsMethod of weights
(MW)
- Step A Use ultrametric or additive procedure
to estimate missing entries. - Ultrametric for high percentages of missing
entries and small distance matrices - Additive for low number of missing entries and
bigger distance matrices - 24x24 matrix lt 40 missing entries -gt additive
procedure
272. Constructive tree algorithmsMethod of weights
(MW)
- Step A Compute weight matrix W.
- Both ultrametric or additive procedure do not
necessarily return complete distance matrix. - Compute weight matrix W
282. Constructive tree algorithmsMethod of weights
(MW)
- Step B Apply weighted least-squares fitting
algorithm. Given distance matrix D and weight
matrix W. - Choose taxa i and j, such that d(i,j) is a known
distance in D. - Gives tree T2. i j
- Place taxon k in the tree T3, which maximizes the
sum of weights w(i,k) w(j,k). If we have more
than one candidate, choose the one that minimizes
SSQ. If we still have more than one candidate,
choose the one that has the greatest score
for i j -
- 3. Continue in this manner k
292. Constructive tree algorithmsMethod of weights
(MW)
- No fixed weighting function
- Time complexity O(n³)
- If we carry out the algorithm for all possible
pairs of taxa in the first step, we get O(n5).
30Data set and experiment
- Two distance matrices D
- 20 mammals (20x20 matrix)
- 34 species (34x34 matrix)
- Random deletion of fixed number of entries from D
- 50 to 100 known values
- Apply
- Triangle
- Fitch
- MW
- Ultrametric procedure followed by MW with weights
set to 1 - Additive procedure followed by MW with weights
set to 1
31Results Computational time34 x 34 matrix
32Results Tree construction20 x 20 matrix
- Lower RF value indicates better tree recovery
33Results Tree construction34 x 34 matrix
- Lower RF value indicates better tree recovery
34Results
- Triangle
- very fast
- - worst results, never recovered the correct tree
- Fitch
- best results for 34 x 34 matrix
- - slow
- MW
- best results for 20 x 20 matrix
- fast for a high percentage of missing entries
- Additive procedure MW
- good results for 20 x 20 matrix
- - worst performance for high number of missing
cells - Ultrametric procedure MW
- good results for higher number of missing cell
35Summary
- Phylogenetic trees are an important tool in
understanding how objects evolve through time - Real data not perfect, leads to optimization
problems - Can be NP-hard, use heuristics
- Methods
- Distance based
- Maximum Parsimony
- Maximum Likelihood
- No method is superior under all conditions
36References
- Böckenhauser, Bongartz. Algorithmische Grundlagen
der Bioinformatik. Teubner 2003. - Makarenkov, Lapointe. (2004). A weighted
least-squares approach for inferring phylogenies
from incomplete distance matrices. Bioinformatics
20, 2113-2121. - Felsenstein. (1997). An alternating least squares
approach to inferring phylogenies from pairwise
distances. Syst. Zool., 46, 101-111. - Guénoche, Leclerc. (2001). The triangle method to
build X-trees from incomplete distance matrices.
RAIRO Oper. Res., 35, 283-300. - Ron Shamir, Phylogenetic Trees,
www.math.tau.ac.il/rshamir/algmb/01/scribe08/lec0
8.pdf - Mona Singh, Phylogenetics,
- http//www.cs.princeton.edu/mona/Lecture/phyloge
ny-slides.pdf - Phylip Software Package http//evolution.genetics
.washington.edu/phylip.html
37The end
- Thanks for listening!
- Any questions?