GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution presentation

About This Presentation

Transcript and Presenter's Notes

Title: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

1
GRAPPA Large-scale whole genome phylogenies
based upon gene order evolution

Tandy Warnow, UT-Austin
Department of Computer Sciences
Institute for Cellular and Molecular Biology
Program in Evolution, Ecology, and Behavior
Center for Computational Biology and
Bioinformatics

2
Whole-Genome Phylogenetics
3
Genomes As Signed Permutations
1 5 3 4 -2 -6or6 2 -4 3 5 1 etc.
4
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10

Inversion (Reversal)

Transposition

Inverted Transposition

5
Genome Rearrangement Has A Huge State Space

DNA sequences 4 states per site
Signed circular genomes with n genes
states, 1
site
Circular genomes (1 site)
with 37 genes
states
with 120 genes
states

6
Why use gene orders?

Rare genomic changes huge state space and
relative infrequency of events (compared to site
substitutions) could make the inference of deep
evolution easier, or more accurate.
Our research shows this is true, but accurate
analysis of gene order data is computationally
very intensive!

7
Phylogeny reconstruction from gene orders

Distance-based reconstruction estimate pairwise
distances, and apply methods like
Neighbor-Joining or Weighbor
Maximum Parsimony find tree with the minimum
length (inversions, transpositions, or other edit
distances)
Maximum Likelihood find tree and parameters of
evolution most likely to generate the observed
data

8
This talk

Distance-based methods we show how to estimate
genomic distances appropriately for the
Generalized Nadeau-Taylor model
Parsimony-style methods we can find very good
solutions to NP-hard problems (inversion and
breakpoint phylogeny) quite quickly
Validation of these approaches on real and
simulated data

9
Distance-based methods
10
Genomic Distance Estimators

Standard
Breakpoint distance
(Minimum) Inversion distance
Our estimators We attempt to estimate the actual
number of events (the true evolutionary
distance)
EDE Moret et al, ISMB01
Approx-IEBP Wang and Warnow, STOC01
Exact-IEBP Wang, WABI01

11
Breakpoint Distance

Breakpoint distance5

1 2 3 4 5 6 7 8 9 10
1 3 2 4 5 9 6 7 8 10
12
Minimum Inversion Distance

Inversion distance3

1 2 3 4 5 6 7 8 9 10
1 2 3 8 7 6 5 4 9 10
1 8 3 2 7 6 5 4 9 10
1 8 3 7 2 6 5 4 9 10
13
Measured Distance vs. Actual Number of Events
Breakpoint Distance
Inversion Distance
120 genes, inversion-only evolution
14
Generalized Nadeau-Taylor Model

Three types of events
Inversions
Transpositions
Inverted Transpositions
Events of the same type are equiprobable
Probability of the three types have fixed ratio
Inv Trp Inv.Trp (1-a-b)ab

15
Estimating True Evolutionary Distances for Genomes

Given fixed probabilities for each type of
event, we estimate the expected breakpoint
distance after k random events, or the expected
inversion distance after k random events.
Inverting these functions gives us a better
estimate of true evolutionary distances.
Approx-IEBP Wang and Warnow 2001
Exact-IEBP Wang 2001
EDE Moret et al, ISMB 2001 (inversion based)

16
Estimating True Evolutionary Distances for
Genomes (cont.)

Estimating the expected Inversion distance
EDE Moret, Wang, Warnow, Wyman 2001
Closed-form formula based upon an empirical
estimation of the expected inversion distance
after k random events (based upon 120 genes and
inversion only, but robust to errors in the
model) .
Polynomial time, fastest of the three.

17
Absolute Difference

120 genes
Inversion only evolution
(Similar relative
performance under
other models)

18
Accuracy of Neighbor Joining Using Distance
Estimators

120 genes
All three event types equiprobable
10, 20, 40, 80, and 160 genomes
Similar relative
performance under
other models

19
Summary of Distance-based Reconstruction Methods

Statistically-based estimation of genomic
distances improves NJ analyses.
Our IEBP estimators assume knowledge of the
probabilities of each type of event, but are
robust to model violations.
EDE is based upon an inversion-only evolutionary
model, but is robust.
Best performing method Weighbor(EDE) second
best is NJ(EDE) both are robust to model
violations.
Worst performing is NJ(BP).
Accuracy is very good, except when very close to
saturation.

20
Maximum Parsimony on Rearranged Genomes (MPRG)

The leaves are rearranged genomes.
Find the tree that minimizes the total number of
rearrangement events

21
Optimization problems for gene order phylogeny

Breakpoint phylogeny find the phylogeny which
minimizes the total number of breakpoints
(NP-hard, even to find the median of three
genomes)
Inversion phylogeny find the phylogeny which
minimizes the sum of inversion distances on the
edges (NP-hard, even to find the median of three
genomes)

22
Inversion and Breakpoint phylogenies

When the data are close to saturated, even
Weighbor(EDE) analyses are insufficiently
accurate. In these cases, our initial
investigations suggest that the inversion and
breakpoint phylogeny approaches may be superior.
Problem finding the best trees is enormously
hard, since even the point estimation problem
is hard (worse than estimating branch lengths in
ML).

Local optimum
MP score
Global optimum
Phylogenetic trees
23
GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)

http//www.cs.unm.edu/moret/GRAPPA/
Heuristics for NP-hard optimization problems
Fast polynomial time distance-based methods
Contributors U. New Mexico,U. Texas at Austin,
Universitá di Bologna, Italy
Freely available in source code at this site.
Project leader Bernard Moret (UNM)
(moret_at_cs.unm.edu)

24
Benchmark gene order dataset Campanulaceae

12 genomes 1 outgroup (Tobacco), 105 gene
segments
NP-hard optimization problems breakpoint and
inversion phylogenies (techniques score every
tree)
1997 BPAnalysis (Blanchette and Sankoff) 200
years (est.)

25
Benchmark gene order dataset Campanulaceae

12 genomes 1 outgroup (Tobacco), 105 gene
segments
NP-hard optimization problems breakpoint and
inversion phylogenies (techniques score every
tree)
1997 BPAnalysis (Blanchette and Sankoff) 200
years (est.)
2000 Using GRAPPA v1.1 on the 512-processor Los
Lobos Supercluster machine 2 minutes
(200,000-fold speedup per processor)

26
Benchmark gene order dataset Campanulaceae

12 genomes 1 outgroup (Tobacco), 105 gene
segments
NP-hard optimization problems breakpoint and
inversion phylogenies (techniques score every
tree)
1997 BPAnalysis (Blanchette and Sankoff) 200
years (est.)
2000 Using GRAPPA v1.1 on the 512-processor Los
Lobos Supercluster machine 2 minutes
(200,000-fold speedup per processor)
2003 Using latest version of GRAPPA 2 minutes
on a single processor (1-billion-fold speedup per
processor)

27
Summary

Weighbor(EDE) and NJ(EDE) are highly accurate
polynomial time distance-based reconstructions,
except when datasets are close to saturated
GRAPPA (inversion phylogeny or breakpoint
phylogeny) produces highly accurate estimates of
trees, and even of ancestral gene orders,
acceptably fast

28
DCM-boosting MP and ML

Idea it may be faster to run a computationally
expensive method on a few overlapping subproblems
of somewhat smaller size
Challenge how to pick the best decomposition?

29
Addressing the accuracy/time issues
Disk-Covering Methods
DCM1 decomposition lots of small diameter
subproblems. (Used for NJ.)
DCM2 decomposition Very few subproblems, each
somewhat smaller. (Used for MP or ML.)
30
DCM-boosting Speeding up MP/ML heuristics
Fake study
Performance of hill-climbing heuristic
MP score of best trees
Desired Performance
Time
31
The DCM2 technique for speeding up MP/ML searches
32
GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)

http//www.cs.unm.edu/moret/GRAPPA/
Heuristics for NP-hard optimization problems
Fast polynomial time distance-based methods
Contributors U. New Mexico,U. Texas at Austin,
Universitá di Bologna, Italy
Freely available in source code
Project leader Bernard Moret (UNM)
(moret_at_cs.unm.edu)

33
Limitations and ongoing research

Current methods limited to single chromosomes
with equal gene content (or very small amounts of
deletions and duplications) -- we are working on
developing reliable techniques for genomes with
unequal gene content

34
Acknowledgements

Funding
The David and Lucile Packard Foundation, and
The National Science Foundation.
Collaborators
Bernard Moret (UNM), David Bader (UNM), Bob
Jansen (UT), Linda Raubeson (CWU)
Students Li-San Wang (now postdoc of Junhyong
Kim at Penn), Jijun Tang (UNM), and others at UNM

35
Phylolab, U. Texas
Please visit us at http//www.cs.utexas.edu/users/
phylo/

Write a Comment

User Comments (0)

About PowerShow.com

GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution PowerPoint PPT Presentation