395T: Algorithms for Computational Biology - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

395T: Algorithms for Computational Biology

Description:

... the performance of a heuristic maximum parsimony analysis on a real dataset of ... Heuristics for Maximum Parsimony (MP) and Maximum Likelihood (ML) cannot handle ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 28

Provided by: tandyw

Category:

more less

Transcript and Presenter's Notes

Title: 395T: Algorithms for Computational Biology

1
395T Algorithms for Computational Biology

Tandy Warnow
Dept. of Computer Science
The University of Texas at Austin

2
Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3
Evolution informs about everything in biology

Big genome sequencing projects just produce data
so what?
Evolutionary history relates all organisms and
genes, and helps us understand and predict
interactions between genes (genetic networks)
drug design
predicting functions of genes
influenza vaccine development
origins and spread of disease
origins and migrations of humans

4
Reconstructing the Tree of Life
Handling large datasets millions of species,
NP-hard problems, Lots of computer science
research to do
5
Steps in a phylogenetic analysis

Gather data
Align sequences
Estimate phylogeny on the multiple alignment
Estimate the reliable aspects of the evolutionary
history (using bootstrapping, consensus trees, or
other methods)
Perform post-tree analyses.

6
DNA Sequence Evolution
7
Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
8
Phylogenetic reconstruction methods

Hill-climbing heuristics for hard optimization
criteria (Maximum Parsimony and Maximum
Likelihood)

Polynomial time distance-based methods UPGMA,
Neighbor Joining, FastME, Weighbor, etc.

9
Performance criteria

Running time.
Space.
Statistical performance issues (e.g., statistical
consistency) with respect to a Markov model of
evolution.
Topological accuracy with respect to the
underlying true tree. Typically studied in
simulation.
Accuracy with respect to a particular criterion
(e.g. tree length or likelihood score), on real
data.

10
How can we infer evolution?

While there are more than two sequences, DO
Find the closest pair of sequences and make
them siblings
Replace the pair by a single sequence

11
That was called UPGMA

Advantages UPGMA is polynomial time and works
well under the strong molecular clock
hypothesis.
Disadvantages UPGMA does not work well in
simulations, perhaps because the molecular clock
hypothesis does not generally apply.
Other polynomial time methods, also
distance-based, work better. One of the best of
these is Neighbor Joining.

12
Quantifying Error
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
13
Neighbor joining has poor performance on large
diameter trees Nakhleh et al. ISMB 2001

Simulation study based upon fixed edge lengths,
K2P model of evolution, sequence lengths fixed to
1000 nucleotides.
Error rates reflect proportion of incorrect edges
in inferred trees.

0.8
NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
14

Other standard polynomial time methods dont
improve substantially on NJ (and have the same
problem with large diameter datasets).
What about trying to solve maximum parsimony or
maximum likelihood?

15
Maximum Parsimony

Input Set S of n aligned sequences of length k
Output
A phylogenetic tree T leaf-labeled by sequences
in S
additional sequences of length k labeling the
internal nodes of T
such that is minimized,
where H(i,j) denotes the Hamming distance between
sequences at nodes i and j

16
Maximum parsimony (example)

Input Four sequences
ACT
ACA
GTT
GTA
Question which of the three trees has the best
MP scores?

17
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
18
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
19
Maximum Parsimony computational complexity
20
Solving NP-hard problems exactly is unlikely

Number of (unrooted) binary trees on n leaves is
(2n-5)!!
If each tree on 1000 taxa could be analyzed in
0.001 seconds, we would find the best tree in
2890 millennia

21
Approaches for solving MP/ML

Hill-climbing heuristics (which can get stuck in
local optima)
Randomized algorithms for getting out of local
optima
Approximation algorithms for MP (based upon
Steiner Tree approximation algorithms).

22
Problems with current techniques for MP
Shown here is the performance of a heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
23
Observations

The best MP heuristics cannot get acceptably good
solutions within 24 hours on most of these large
datasets.
Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions.
Apparent convergence can be misleading.

24
Empirical problems with existing methods

Heuristics for Maximum Parsimony (MP) and Maximum
Likelihood (ML) cannot handle large datasets
(take too long!) we need new heuristics for
MP/ML that can analyze large datasets
Polynomial time methods have poor topological
accuracy on large diameter datasets we need
better polynomial time methods

25
What happens after the analysis?

The result of a phylogenetic analysis is often
thousands (or tens of thousands) of equally good
trees. What to do?
Biologists use consensus methods, as well as
other techniques, to try to infer what is likely
to be the characteristics of the true tree.
Current techniques lack sufficient power.

26
This course

Focus on the design and analysis of algorithms
for phylogeny reconstruction, multiple sequence
alignment, and consensus of sets of trees.
Objective the design of new algorithms with
better performance than existing algorithms, as
evidenced by theory, experiment, or empirical
studies.
No background in biology or statistics is
required. (Some background in computer science is
presumed.)

27
General comments

There is interesting computer science research to
be done in computational phylogenetics, with a
tremendous potential for impact.
Algorithm development must be tested on both real
and simulated data.
The interplay between data, stochastic models of
evolution, optimization problems, and algorithms,
is important and instructive.

Write a Comment

User Comments (0)