CS 394C: Computational Biology Algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

CS 394C: Computational Biology Algorithms

Description:

Number of (unrooted) binary trees on n leaves is (2n-5) ... Optimal' here means best score to date, using any method for any amount of time. ... – PowerPoint PPT presentation

Number of Views:10
Avg rating:3.0/5.0
Slides: 25
Provided by: tandyw
Category:

less

Transcript and Presenter's Notes

Title: CS 394C: Computational Biology Algorithms


1
CS 394C Computational Biology Algorithms
  • Tandy Warnow
  • Department of Computer Sciences
  • University of Texas at Austin

2
DNA Sequence Evolution
3
Molecular Systematics
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
4
Phylogeny estimation methods
  • Distance-based (Neighbor joining, NQM, and
    others) mostly statistically consistent and
    polynomial time
  • Maximum parsimony and maximum compatibility
    NP-hard and not statistically consistent
  • Maximum likelihood NP-hard and usually
    statistically consistent (if solved exactly)
  • Bayesian Methods statistically consistent if run
    long enough

5
Distance-based methods
  • Theorem Let (T,?) be a Cavender-Farris model
    tree, with additive matrix ?(i,j). Let ?gt0 be
    given. The sequence length that suffices for
    accuracy with probability at least 1- ? of NJ
    (neighbor joining) and the Naïve Quartet Method
    is O(log n e(O(max ?(i,j))).

6
Neighbor joining (although statistically
consistent) has poor performance on large
diameter trees Nakhleh et al. ISMB 2001
  • Simulation study based upon fixed edge lengths,
    K2P model of evolution, sequence lengths fixed to
    1000 nucleotides.
  • Error rates reflect proportion of incorrect edges
    in inferred trees.

0.8
NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
7
Maximum Parsimony
  • Input Set S of n aligned sequences of length k
  • Output A phylogenetic tree T
  • leaf-labeled by sequences in S
  • additional sequences of length k labeling the
    internal nodes of T
  • such that is minimized.

8
Maximum parsimony (example)
  • Input Four sequences
  • ACT
  • ACA
  • GTT
  • GTA
  • Question which of the three trees has the best
    MP scores?

9
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
10
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
11
Maximum Parsimony
12
Solving NP-hard problems exactly is unlikely
leaves trees
4 3
5 15
6 105
7 945
8 10395
9 135135
10 2027025
20 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900
  • Number of (unrooted) binary trees on n leaves is
    (2n-5)!!
  • If each tree on 1000 taxa could be analyzed in
    0.001 seconds, we would find the best tree in
  • 2890 millennia

13
Approaches for solving MP and ML(and other
NP-hard problems in phylogeny)
  1. Hill-climbing heuristics (which can get stuck in
    local optima)
  2. Randomized algorithms for getting out of local
    optima
  3. Approximation algorithms for MP (based upon
    Steiner Tree approximation algorithms) --
    however, the approx. ratio that is needed is
    probably 1.01 or smaller!

14
Problems with techniques for MP and ML
Shown here is the performance of a TNT heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
15
MP and Cavender-Farris
  • Consider a tree (AB,CD) with two very long
    branches leading to A and C, and all other
    branches very short.
  • MP will be statistically inconsistent (and
    positively misleading) on this tree.

16
Problems with existing phylogeny reconstruction
methods
  • Polynomial time methods (generally based upon
    distances) have poor accuracy with large diameter
    datasets.
  • Heuristics for NP-hard optimization problems take
    too long (months to reach acceptable local
    optima).

17
Warnow et al. Meta-algorithms for phylogenetics
  • Basic technique determine the conditions under
    which a phylogeny reconstruction method does well
    (or poorly), and design a divide-and-conquer
    strategy (specific to the method) to improve its
    performance
  • Warnow et al. developed a class of
    divide-and-conquer methods, collectively called
    DCMs (Disk-Covering Methods). These are based
    upon chordal graph theory to give fast
    decompositions and provable performance
    guarantees.

18
Disk-Covering Method (DCM)
19
Improving phylogeny reconstruction methods using
DCMs
  • Improving the theoretical convergence rate and
    performance of polynomial time distance-based
    methods using DCM1
  • Speeding up heuristics for NP-hard optimization
    problems (Maximum Parsimony and Maximum
    Likelihood) using Rec-I-DCM3

20
DCM1 Warnow, St. John, and Moret, SODA 2001
Exponentially converging method
Absolute fast converging method
DCM
SQS
  • A two-phase procedure which reduces the sequence
    length requirement of methods. The DCM phase
    produces a collection of trees, and the SQS phase
    picks the best tree.
  • The base method is applied to subsets of the
    original dataset. When the base method is NJ,
    you get DCM1-NJ.

21
DCM1-boosting distance-based methodsNakhleh et
al. ISMB 2001
  • Theorem DCM1-NJ converges to the true tree from
    polynomial length sequences

0.8
NJ
DCM1-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
22
Rec-I-DCM3 significantly improves performance
(Roshan et al. CSB 2004)
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset. Similar improvements obtained for RAxML
(maximum likelihood).
23
Summary (so far)
  • Optimization problems in biology are almost all
    NP-hard, and heuristics may run for months before
    finding local optima.
  • The challenge here is to find better heuristics,
    since exact solutions are very unlikely to ever
    be achievable on large datasets.

24
Summary
  • NP-hard optimization problems abound in phylogeny
    reconstruction, and in computational biology in
    general, and need very accurate solutions
  • Many real problems have beautiful and natural
    combinatorial and graph-theoretic formulations
Write a Comment
User Comments (0)
About PowerShow.com