395T: Algorithms for Computational Biology - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

395T: Algorithms for Computational Biology

Description:

... the performance of a heuristic maximum parsimony analysis on a real dataset of ... Heuristics for Maximum Parsimony (MP) and Maximum Likelihood (ML) cannot handle ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 28
Provided by: tandyw
Category:

less

Transcript and Presenter's Notes

Title: 395T: Algorithms for Computational Biology


1
395T Algorithms for Computational Biology
  • Tandy Warnow
  • Dept. of Computer Science
  • The University of Texas at Austin

2
Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3
Evolution informs about everything in biology
  • Big genome sequencing projects just produce data
    so what?
  • Evolutionary history relates all organisms and
    genes, and helps us understand and predict
  • interactions between genes (genetic networks)
  • drug design
  • predicting functions of genes
  • influenza vaccine development
  • origins and spread of disease
  • origins and migrations of humans

4
Reconstructing the Tree of Life
Handling large datasets millions of species,
NP-hard problems, Lots of computer science
research to do
5
Steps in a phylogenetic analysis
  • Gather data
  • Align sequences
  • Estimate phylogeny on the multiple alignment
  • Estimate the reliable aspects of the evolutionary
    history (using bootstrapping, consensus trees, or
    other methods)
  • Perform post-tree analyses.

6
DNA Sequence Evolution
7
Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
8
Phylogenetic reconstruction methods
  • Hill-climbing heuristics for hard optimization
    criteria (Maximum Parsimony and Maximum
    Likelihood)
  • Polynomial time distance-based methods UPGMA,
    Neighbor Joining, FastME, Weighbor, etc.

9
Performance criteria
  • Running time.
  • Space.
  • Statistical performance issues (e.g., statistical
    consistency) with respect to a Markov model of
    evolution.
  • Topological accuracy with respect to the
    underlying true tree. Typically studied in
    simulation.
  • Accuracy with respect to a particular criterion
    (e.g. tree length or likelihood score), on real
    data.

10
How can we infer evolution?
  • While there are more than two sequences, DO
  • Find the closest pair of sequences and make
    them siblings
  • Replace the pair by a single sequence

11
That was called UPGMA
  • Advantages UPGMA is polynomial time and works
    well under the strong molecular clock
    hypothesis.
  • Disadvantages UPGMA does not work well in
    simulations, perhaps because the molecular clock
    hypothesis does not generally apply.
  • Other polynomial time methods, also
    distance-based, work better. One of the best of
    these is Neighbor Joining.

12
Quantifying Error
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
13
Neighbor joining has poor performance on large
diameter trees Nakhleh et al. ISMB 2001
  • Simulation study based upon fixed edge lengths,
    K2P model of evolution, sequence lengths fixed to
    1000 nucleotides.
  • Error rates reflect proportion of incorrect edges
    in inferred trees.

0.8
NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
14
  • Other standard polynomial time methods dont
    improve substantially on NJ (and have the same
    problem with large diameter datasets).
  • What about trying to solve maximum parsimony or
    maximum likelihood?

15
Maximum Parsimony
  • Input Set S of n aligned sequences of length k
  • Output
  • A phylogenetic tree T leaf-labeled by sequences
    in S
  • additional sequences of length k labeling the
    internal nodes of T
  • such that is minimized,
    where H(i,j) denotes the Hamming distance between
    sequences at nodes i and j

16
Maximum parsimony (example)
  • Input Four sequences
  • ACT
  • ACA
  • GTT
  • GTA
  • Question which of the three trees has the best
    MP scores?

17
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
18
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
19
Maximum Parsimony computational complexity
20
Solving NP-hard problems exactly is unlikely
  • Number of (unrooted) binary trees on n leaves is
    (2n-5)!!
  • If each tree on 1000 taxa could be analyzed in
    0.001 seconds, we would find the best tree in
  • 2890 millennia

21
Approaches for solving MP/ML
  • Hill-climbing heuristics (which can get stuck in
    local optima)
  • Randomized algorithms for getting out of local
    optima
  • Approximation algorithms for MP (based upon
    Steiner Tree approximation algorithms).

22
Problems with current techniques for MP
Shown here is the performance of a heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
23
Observations
  • The best MP heuristics cannot get acceptably good
    solutions within 24 hours on most of these large
    datasets.
  • Datasets of these sizes may need months (or
    years) of further analysis to reach reasonable
    solutions.
  • Apparent convergence can be misleading.

24
Empirical problems with existing methods
  • Heuristics for Maximum Parsimony (MP) and Maximum
    Likelihood (ML) cannot handle large datasets
    (take too long!) we need new heuristics for
    MP/ML that can analyze large datasets
  • Polynomial time methods have poor topological
    accuracy on large diameter datasets we need
    better polynomial time methods

25
What happens after the analysis?
  • The result of a phylogenetic analysis is often
    thousands (or tens of thousands) of equally good
    trees. What to do?
  • Biologists use consensus methods, as well as
    other techniques, to try to infer what is likely
    to be the characteristics of the true tree.
    Current techniques lack sufficient power.

26
This course
  • Focus on the design and analysis of algorithms
    for phylogeny reconstruction, multiple sequence
    alignment, and consensus of sets of trees.
  • Objective the design of new algorithms with
    better performance than existing algorithms, as
    evidenced by theory, experiment, or empirical
    studies.
  • No background in biology or statistics is
    required. (Some background in computer science is
    presumed.)

27
General comments
  • There is interesting computer science research to
    be done in computational phylogenetics, with a
    tremendous potential for impact.
  • Algorithm development must be tested on both real
    and simulated data.
  • The interplay between data, stochastic models of
    evolution, optimization problems, and algorithms,
    is important and instructive.
Write a Comment
User Comments (0)
About PowerShow.com