CS 394C: Computational Biology Algorithms - PowerPoint PPT Presentation

About This Presentation

Title:

CS 394C: Computational Biology Algorithms

Description:

Number of (unrooted) binary trees on n leaves is (2n-5) ... Optimal' here means best score to date, using any method for any amount of time. ... – PowerPoint PPT presentation

Number of Views:10

Avg rating:3.0/5.0

Slides: 25

Provided by: tandyw

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 394C: Computational Biology Algorithms

1
CS 394C Computational Biology Algorithms

Tandy Warnow
Department of Computer Sciences
University of Texas at Austin

2
DNA Sequence Evolution
3
Molecular Systematics
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
4
Phylogeny estimation methods

Distance-based (Neighbor joining, NQM, and
others) mostly statistically consistent and
polynomial time
Maximum parsimony and maximum compatibility
NP-hard and not statistically consistent
Maximum likelihood NP-hard and usually
statistically consistent (if solved exactly)
Bayesian Methods statistically consistent if run
long enough

5
Distance-based methods

Theorem Let (T,?) be a Cavender-Farris model
tree, with additive matrix ?(i,j). Let ?gt0 be
given. The sequence length that suffices for
accuracy with probability at least 1- ? of NJ
(neighbor joining) and the Naïve Quartet Method
is O(log n e(O(max ?(i,j))).

6
Neighbor joining (although statistically
consistent) has poor performance on large
diameter trees Nakhleh et al. ISMB 2001

Simulation study based upon fixed edge lengths,
K2P model of evolution, sequence lengths fixed to
1000 nucleotides.
Error rates reflect proportion of incorrect edges
in inferred trees.

0.8
NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
7
Maximum Parsimony

Input Set S of n aligned sequences of length k
Output A phylogenetic tree T
leaf-labeled by sequences in S
additional sequences of length k labeling the
internal nodes of T
such that is minimized.

8
Maximum parsimony (example)

Input Four sequences
ACT
ACA
GTT
GTA
Question which of the three trees has the best
MP scores?

9
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
10
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
11
Maximum Parsimony
12
Solving NP-hard problems exactly is unlikely
leaves trees
4 3
5 15
6 105
7 945
8 10395
9 135135
10 2027025
20 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900

Number of (unrooted) binary trees on n leaves is
(2n-5)!!
If each tree on 1000 taxa could be analyzed in
0.001 seconds, we would find the best tree in
2890 millennia

13
Approaches for solving MP and ML(and other
NP-hard problems in phylogeny)

Hill-climbing heuristics (which can get stuck in
local optima)
Randomized algorithms for getting out of local
optima
Approximation algorithms for MP (based upon
Steiner Tree approximation algorithms) --
however, the approx. ratio that is needed is
probably 1.01 or smaller!

14
Problems with techniques for MP and ML
Shown here is the performance of a TNT heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
15
MP and Cavender-Farris

Consider a tree (AB,CD) with two very long
branches leading to A and C, and all other
branches very short.
MP will be statistically inconsistent (and
positively misleading) on this tree.

16
Problems with existing phylogeny reconstruction
methods

Polynomial time methods (generally based upon
distances) have poor accuracy with large diameter
datasets.
Heuristics for NP-hard optimization problems take
too long (months to reach acceptable local
optima).

17
Warnow et al. Meta-algorithms for phylogenetics

Basic technique determine the conditions under
which a phylogeny reconstruction method does well
(or poorly), and design a divide-and-conquer
strategy (specific to the method) to improve its
performance
Warnow et al. developed a class of
divide-and-conquer methods, collectively called
DCMs (Disk-Covering Methods). These are based
upon chordal graph theory to give fast
decompositions and provable performance
guarantees.

18
Disk-Covering Method (DCM)
19
Improving phylogeny reconstruction methods using
DCMs

Improving the theoretical convergence rate and
performance of polynomial time distance-based
methods using DCM1
Speeding up heuristics for NP-hard optimization
problems (Maximum Parsimony and Maximum
Likelihood) using Rec-I-DCM3

20
DCM1 Warnow, St. John, and Moret, SODA 2001
Exponentially converging method
Absolute fast converging method
DCM
SQS

A two-phase procedure which reduces the sequence
length requirement of methods. The DCM phase
produces a collection of trees, and the SQS phase
picks the best tree.
The base method is applied to subsets of the
original dataset. When the base method is NJ,
you get DCM1-NJ.

21
DCM1-boosting distance-based methodsNakhleh et
al. ISMB 2001

Theorem DCM1-NJ converges to the true tree from
polynomial length sequences

0.8
NJ
DCM1-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
22
Rec-I-DCM3 significantly improves performance
(Roshan et al. CSB 2004)
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset. Similar improvements obtained for RAxML
(maximum likelihood).
23
Summary (so far)

Optimization problems in biology are almost all
NP-hard, and heuristics may run for months before
finding local optima.
The challenge here is to find better heuristics,
since exact solutions are very unlikely to ever
be achievable on large datasets.

24
Summary

NP-hard optimization problems abound in phylogeny
reconstruction, and in computational biology in
general, and need very accurate solutions
Many real problems have beautiful and natural
combinatorial and graph-theoretic formulations

Write a Comment

User Comments (0)