Title: CS 394C: Computational Biology Algorithms
1CS 394C Computational Biology Algorithms
- Tandy Warnow
- Department of Computer Sciences
- University of Texas at Austin
2DNA Sequence Evolution
3Molecular Systematics
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
4Phylogeny estimation methods
- Distance-based (Neighbor joining, NQM, and
others) mostly statistically consistent and
polynomial time - Maximum parsimony and maximum compatibility
NP-hard and not statistically consistent - Maximum likelihood NP-hard and usually
statistically consistent (if solved exactly) - Bayesian Methods statistically consistent if run
long enough
5Distance-based methods
- Theorem Let (T,?) be a Cavender-Farris model
tree, with additive matrix ?(i,j). Let ?gt0 be
given. The sequence length that suffices for
accuracy with probability at least 1- ? of NJ
(neighbor joining) and the Naïve Quartet Method
is O(log n e(O(max ?(i,j))).
6Neighbor joining (although statistically
consistent) has poor performance on large
diameter trees Nakhleh et al. ISMB 2001
- Simulation study based upon fixed edge lengths,
K2P model of evolution, sequence lengths fixed to
1000 nucleotides. - Error rates reflect proportion of incorrect edges
in inferred trees.
0.8
NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
7Maximum Parsimony
- Input Set S of n aligned sequences of length k
- Output A phylogenetic tree T
- leaf-labeled by sequences in S
- additional sequences of length k labeling the
internal nodes of T - such that is minimized.
8Maximum parsimony (example)
- Input Four sequences
- ACT
- ACA
- GTT
- GTA
- Question which of the three trees has the best
MP scores?
9Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
10Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
11Maximum Parsimony
12Solving NP-hard problems exactly is unlikely
leaves trees
4 3
5 15
6 105
7 945
8 10395
9 135135
10 2027025
20 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900
- Number of (unrooted) binary trees on n leaves is
(2n-5)!! - If each tree on 1000 taxa could be analyzed in
0.001 seconds, we would find the best tree in - 2890 millennia
13Approaches for solving MP and ML(and other
NP-hard problems in phylogeny)
- Hill-climbing heuristics (which can get stuck in
local optima) - Randomized algorithms for getting out of local
optima - Approximation algorithms for MP (based upon
Steiner Tree approximation algorithms) --
however, the approx. ratio that is needed is
probably 1.01 or smaller!
14Problems with techniques for MP and ML
Shown here is the performance of a TNT heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
15MP and Cavender-Farris
- Consider a tree (AB,CD) with two very long
branches leading to A and C, and all other
branches very short. - MP will be statistically inconsistent (and
positively misleading) on this tree.
16Problems with existing phylogeny reconstruction
methods
- Polynomial time methods (generally based upon
distances) have poor accuracy with large diameter
datasets. - Heuristics for NP-hard optimization problems take
too long (months to reach acceptable local
optima).
17Warnow et al. Meta-algorithms for phylogenetics
- Basic technique determine the conditions under
which a phylogeny reconstruction method does well
(or poorly), and design a divide-and-conquer
strategy (specific to the method) to improve its
performance - Warnow et al. developed a class of
divide-and-conquer methods, collectively called
DCMs (Disk-Covering Methods). These are based
upon chordal graph theory to give fast
decompositions and provable performance
guarantees.
18Disk-Covering Method (DCM)
19Improving phylogeny reconstruction methods using
DCMs
- Improving the theoretical convergence rate and
performance of polynomial time distance-based
methods using DCM1 - Speeding up heuristics for NP-hard optimization
problems (Maximum Parsimony and Maximum
Likelihood) using Rec-I-DCM3
20DCM1 Warnow, St. John, and Moret, SODA 2001
Exponentially converging method
Absolute fast converging method
DCM
SQS
- A two-phase procedure which reduces the sequence
length requirement of methods. The DCM phase
produces a collection of trees, and the SQS phase
picks the best tree. - The base method is applied to subsets of the
original dataset. When the base method is NJ,
you get DCM1-NJ.
21DCM1-boosting distance-based methodsNakhleh et
al. ISMB 2001
- Theorem DCM1-NJ converges to the true tree from
polynomial length sequences
0.8
NJ
DCM1-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
22Rec-I-DCM3 significantly improves performance
(Roshan et al. CSB 2004)
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset. Similar improvements obtained for RAxML
(maximum likelihood).
23Summary (so far)
- Optimization problems in biology are almost all
NP-hard, and heuristics may run for months before
finding local optima. - The challenge here is to find better heuristics,
since exact solutions are very unlikely to ever
be achievable on large datasets.
24Summary
- NP-hard optimization problems abound in phylogeny
reconstruction, and in computational biology in
general, and need very accurate solutions - Many real problems have beautiful and natural
combinatorial and graph-theoretic formulations