Computational and mathematical challenges involved in very largescale phylogenetics - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Computational and mathematical challenges involved in very largescale phylogenetics

Description:

Performance of TNT with time. FN: false negative (missing edge) FP: false positive ... 'Perfect Phylogenetic Network' (all characters compatible) ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 32
Provided by: csUt8
Category:

less

Transcript and Presenter's Notes

Title: Computational and mathematical challenges involved in very largescale phylogenetics


1
Computational and mathematical challenges
involved in very large-scale phylogenetics
  • Tandy Warnow
  • The University of Texas at Austin

2
Species phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3
How did life evolve on earth?
An international effort to understand how life
evolved on earth Biomedical applications drug
design, protein structure and function
prediction, biodiversity Phylogenetic estimation
is a Grand Challenge millions of taxa, NP-hard
optimization problems
  • Courtesy of the Tree of Life project

4
DNA Sequence Evolution
5
Step 1 Gather data
S1 AGGCTATCACCTGACCTCCA S2 TAGCTATCACGACCGC S3
TAGCTGACCGC S4 TCACGACCGACA
6
Step 2 Multiple Sequence Alignment
S1 AGGCTATCACCTGACCTCCA S2 TAGCTATCACGACCGC S3
TAGCTGACCGC S4 TCACGACCGACA
S1 -AGGCTATCACCTGACCTCCA S2
TAG-CTATCAC--GACCGC-- S3 TAG-CT-------GACCGC-- S
4 -------TCAC--GACCGACA
7
Step 3 Construct tree
S1 AGGCTATCACCTGACCTCCA S2 TAGCTATCACGACCGC S3
TAGCTGACCGC S4 TCACGACCGACA
S1 -AGGCTATCACCTGACCTCCA S2
TAG-CTATCAC--GACCGC-- S3 TAG-CT-------GACCGC-- S
4 -------TCAC--GACCGACA
S1
S2
S4
S3
8
Standard problem Maximum Parsimony (Hamming
distance Steiner Tree)
  • Input Set S of n aligned sequences of length k
  • Output A phylogenetic tree T
  • leaf-labeled by sequences in S
  • additional sequences of length k labeling the
    internal nodes of T
  • such that is minimized.

9
Maximum parsimony (example)
  • Input Four sequences
  • ACT
  • ACA
  • GTT
  • GTA
  • Question which of the three trees has the best
    MP scores?

10
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
11
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
12
Maximum Parsimony computational complexity
13
Approaches for solving MP (and other NP-hard
problems in phylogeny)
  • Hill-climbing heuristics (which can get stuck in
    local optima)
  • Randomized algorithms for getting out of local
    optima
  • Approximation algorithms for MP (based upon
    Steiner Tree approximation algorithms).

14
Problems with current techniques for MP
Shown here is the performance of a TNT heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
15
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
16
Performance criteria
  • Estimated alignments are evaluated with respect
    to the true alignment. Studied both in
    simulation and on real data.
  • Estimated trees are evaluated for topological
    accuracy with respect to the true tree.
    Typically studied in simulation.
  • Methods for these problems can also be evaluated
    with respect to an optimization criterion (e.g.,
    maximum likelihood score) as a function of
    running time. Typically studied on real data.
    (Reasonably valid for phylogeny but not yet for
    alignment.)
  • Issues Simulation studies need to be based upon
    realistic models, and truth is often not known
    for real data.

17
Statistical consistency, exponential convergence,
and absolute fast convergence (afc)
18
Distance-based Phylogenetic Methods
19
  • Theorem Neighbor joining (and some other
    distance-based methods) will return the true tree
    with high probability provided sequence lengths
    are exponential in the diameter of the tree
    (Erdos et al., Atteson).

20
Neighbor joining has poor performance on large
diameter trees Nakhleh et al. ISMB 2001
  • Simulation study based upon fixed edge lengths,
    K2P model of evolution, sequence lengths fixed to
    1000 nucleotides.
  • Error rates reflect proportion of incorrect edges
    in inferred trees.

0.8
NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
21
Observations
  • The best current multiple sequence alignment
    methods can produce highly inacccurate alignments
    on large datasets (with the result that trees
    estimated on these alignments are also
    inaccurate).
  • The fast (polynomial time) methods produce highly
    inaccurate trees for many datasets.
  • Heuristics for NP-hard optimization problems
    often produce highly accurate trees, but can take
    months to reach solutions on large datasets.

22
Meta-algorithms for phylogenetics
  • Basic technique determine the conditions under
    which a phylogeny reconstruction method does well
    (or poorly), and design a divide-and-conquer
    strategy to improve the performance
  • The divide-and-conquer technique is specific to
    the method.

23
DCM (cartoon)
24
Graph-theoretic
divide-and-conquer (DCMs)
  • Define a triangulated (i.e. chordal) graph so
    that its vertices correspond to the input taxa
  • Compute a decomposition of the graph into
    overlapping subgraphs, thus defining a
    decomposition of the taxa into overlapping
    subsets.
  • Apply the base method to each subset of taxa,
    to construct a subtree
  • Merge the subtrees into a single tree on the full
    set of taxa.

25
DCM1-boosting distance-based methodsNakhleh et
al. ISMB 2001
  • Theorem DCM1-NJ converges to the true tree from
    polynomial length sequences

0.8
NJ
DCM1-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
26
The DCM3 decomposition
  • DCM3 decompositions
  • can be obtained in O(n) time (the
  • short subtree graph is triangulated)
  • (2) yield small subproblems
  • (3) can be used iteratively

27
Iterative-DCM3
T
DCM3
Base method
T
28
Rec-I-DCM3 significantly improves performance
(Roshan et al.)
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
29
Other problems
  • Multiple sequence alignment
  • Horizontal gene transfer (and other types of
    reticulation) detection and reconstruction
  • Inferring species trees from gene trees
  • Whole genome phylogenies
  • Reconstructing evolutionary histories of languages

30
For more information
  • My webpage www.cs.utexas.edu/users/tandy
  • The CIPRES webpage www.phylo.org
  • Historical linguistics www.cs.rice.edu/nakhleh/C
    PHL

31
Perfect Phylogenetic Network (all characters
compatible)
Write a Comment
User Comments (0)
About PowerShow.com