HighPerformance Computing for Reconstructing Phylogenies from GeneOrder Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: HighPerformance Computing for Reconstructing Phylogenies from GeneOrder Data

1
High-Performance Computing forReconstructing
Phylogenies fromGene-Order Data

David A. Bader
Electrical Computer Engineering
University of New Mexico
dbader_at_eece.unm.edu
http//hpc.eece.unm.edu/

2
Acknowledgment of Support

National Science Foundation
CAREER High-Performance Algorithms for
Scientific Applications (00-93039)
ITR Algorithms for Irregular Discrete
Computations on SMPs (00-81404)
DEB Ecosystem Studies Self-Organization of
Semi-Arid Landscapes Test of Optimality
Principles (99-10123)
ITR/AP Reconstructing Complex Evolutionary
Histories (01-21377)
DEB Comparative Chloroplast Genomics Integrating
Computational Methods, Molecular Evolution, and
Phylogeny (01-20709)
ITR/AP(DEB) Computing Optimal Phylogenetic Trees
under Genome Rearrangement Metrics (01-13095)
PACI NCSA/Alliance, NPACI/SDSC, PSC
Sun Microsystems

3
Algorithms that Scale from the Blade to the Fire
4
Commercial Aspects of Phylogeny Reconstruction

Identification of microorganisms
public health entomology
sequence motifs for groups are patented
example differentiating tuberculosis strains
Dynamics of microbial communities
pesticide exposure identify and quantify
microbes in soil
Vaccine development
variants of a cell wall or protein coat component
porcine reproductive and respiratory syndrome
virus isolates from US and Europe were separate
populations
HIV studied through DNA markers
Biochemical pathways
antibacterials and herbicides
Glyphosate (Roundup?, Rodeo ?, and Pondmaster ?)
first herbicide targeted at a pathway not present
in mammals
phylogenetic distribution of a pathway is studied
by the pharmaceutical industry before a drug is
developed
Pharmaceutical industry
predicting the natural ligands for cell surface
receptors which are potential drug targets
a single family, G protein coupled receptors
(GPCRs), contains 40 of the targets of most
pharm. companies

5
GRAPPA Genome Rearrangements Analysis

Genome Rearrangements Analysis under Parsimony
and other Phylogenetic Algorithms
http//www.cs.unm.edu/moret/GRAPPA/
Open-source
already used by other computational phylogeny
groups, Caprara, Pevzner, LANL, FBI, PharmCos.
Gene-order Phylogeny Reconstruction
Breakpoint Median
Inversion Median
over one-million fold speedup from previous codes
Parallelism
Scales linearly with the number of processors
Developed using Sun Forte C

6
Molecular Data for Phylogeny

simple DNA sequence nucleotides
low-level functionality amino acids, etc.
genomic level genes
(next is functional level proteomics, etc.)
Biologists now have full gene sequences for
many single-chromosome organisms and organelles
(e.g., mitochondria, chloroplasts) and for more
and more larger organisms

7
Gene Order Phylogeny

Many organelles appear to evolve mostly through
processes that simply rearrange gene ordering
(inversion, transposition) and perhaps alter gene
content (duplication, loss).
Chloroplast have a single, typically circular,
chromosome and appear to evolve mostly through
inversion

The sequence of genes i, i1, , j is inverted
and every gene is flipped.
8
Gene Order Phylogeny (contd)

The real problem
Reconstruct the true tree, identify the true
ancestral genomes, and recover on each edge the
true sequence of evolutionary changes
The optimization problem (parsimony)
Reconstruct a tree and ancestral genomes so as to
minimize the sum, over all tree edges, of the
inferred evolutionary distance along each edge
The surrogate problem
Do the optimization problem with a measure of
inferred evolutionary distance that lends itself
to analysis

9
Breakpoint AnalysisA Surrogate for Gene Order

Breakpoint
an adjacent pair of genes present in one genome,
but absent in the other
Breakpoint distance
the total number of breakpoints between two
genomes (a true metric, similar to Hamming
distance)
Breakpoint phylogeny
the tree and ancestral genomes that minimize the
sum, over all edges of the tree, of the
breakpoint distances
Naturally, it is an NP-hard problem, even with
just 3 leaves.

10
Breakpoint Analysis(Sankoff Blanchette 1998)

For each tree topology do
somehow assign initial genomes to the internal
nodes
repeat
for each internal node do
compute a new genome that minimizes the distances
to its three neighbors
replace old genome by new if distance is reduced
until no change
Sankoff Blanchette implemented this in a C
package

(2n-5)!! (2n-5) (2n-7) ? 5 ? 3 trees
unknown iterative heuristic
NP-hard
11
Algorithm Engineering Works!

We reimplemented everything
the original code is too slow and not as flexible
as we wanted.
Our main dataset is a collection of chloroplast
data from the flowering plant family
Campanulaceae (bluebells)
13 genomes of 105 gene segments each
On our old workstation
BPAnalysis processes 10-12 trees/minute
Our implementation processes over 50,000
trees/minute
Speedup ratio is over 5,000!!
On synthetic datasets, we see speedups from 300
to over 50,000

12
So What did we do?!

( 10x)

Absolutely no high-level algorithmic changes
Three low-level algorithmic changes
better bounding
strong upper bound initialization
condensing
Completely different data representation
Two low-level algorithmic changes
all memory is pre-allocated
some loops are hand-unrolled
Written in C instead of C

10x
10x
6x
? (convenience)
Well, so I lied just a little bit
13
One high-level algorithmic change

(Ok, so I lied a little)
Avoid labeling the tree if possible
Use current best score as an upper bound.
Compute lower bound prune tree away if lower
bound gt upper bound
Lower bound
Get circular ordering of leaves, x1 x2 xn
Compute D d(x1,x2) d(x2,x3) d(xn,x1)
Then ½ D is a lower bound because
d(.) obeys the triangle inequality
every tree edge is used twice in a tree-based
version of D

14
Tree
Tree
Tree version (paths)
d(e,a)
d(d,e)
d(a,b)
d(c,d)
d(b,c)
(Same trick as in the twice around the tree
approximation for the TSP with triangle
inequality.)
D d(a,b) d(b,c) d(c,d) d(d,e)
d(e,a)
15
Algorithmic Changes ( 10x)

Better bounding skip edges that would cause
degree 3 or premature cycle
Condensing whenever the same
gene subsequence appears in all genomes, it can
be condensed into a single superfragment
done as static processing and on the fly before
each TSP
Initializing the new median with the best of the
old one and its three neighbors.
Condensing is very effective on real data within
families, but easily defeated by large
evolutionary distances.
(1) and (3) cause over half of the TSP instances
(for finding computing median-of-three updated
internal nodes) to be pruned away instantly.

16
Data Representation ( 10x)

No distance matrix for reduction of
median-of-three to Traveling Salesperson
Problem (TSP) at most 4n edges can be of
interest the others are treated as an
undifferentiated pool. The adjacency lists have
length ? 4.
thus, linear time at each step and reduced
storage.
Backtracking search has a small list of edges and
only searches among edges of cost 1 and 2 (? and
0 are always included) still NP-hard, but often
easy
When search runs out of edges, tour is completed
in linear time from the pool of edges of cost 3
Many auxiliary arrays (á la Fortran!) to carry
information on flags, degrees, other end of
chains,

17
Low-Level Coding Changes ( 6x)

All storage allocated at start, with large s of
pointers passed to subroutines (no globals, to
allow parallel execution).
Avoids malloc/free overhead Improves cache
locality
Avoid recomputations.
Use local variables for intermediate pointers
Hand unroll loops on adjacencies to preserve
locality (and to avoid mod operations with
circular genomes)
Speeds up addressing never deference!
Improves cache locality
BPAnalysis uses 65MB and has a real memory
footprint of 12MB on our real data
Our reimplementation uses 1.6MB with a footprint
of 0.6MB

18
And How did we do it?

3 strategies Profile, Profile, Profile
(and use your engineering sense/nose/ )
Sun Forte 6 Analyzer
We began with 4 main culprits
preparing adjacency lists for the TSP
computing breakpoint distances
computing lower bounds in TSP
backtracking in TSP
Over 10 12 major iterations, each of which
yielded a 1.5 2 fold speed-up, these four
switched places over and over.

19
Profiling

And our final tally (still on the Campanulaceae
dataset) is
30 backtracking (excl. LB)
20 preparing adjacency lists
20 condensing expanding
15 computing LB
8 computing distances
7 miscellaneous overhead
(no obvious culprits left)

20
High-Performance Computing Techniques

Availability of hundreds of powerful processors
Standard parallel programming interfaces (Sun
HPC)
Message passing interface (MPI)
OpenMP or POSIX threads
Algorithmic libraries for SMP clusters
SIMPLE
Goal make efficient use of parallelism for
exploring candidate tree topologies
sharing of improved bounds

21
Parallelization of the Phylogeny Algorithm

Enumerating tree topologies is pleasantly
parallel and allows multiple processors to
independently search the tree space with little
or no overhead
Improved bounds can be broadcast to other
processors without interrupting work
Load is evenly balanced when trees are cyclically
assigned (e.g. in a round-robin fashion) to the
processors
Linear speedup

22
Final Remarks

Our reimplementation led to numerous extensions
as well as to new theoretical results
GRAPPA has been extended to inversion phylogeny,
with linear-time algorithms for inversion
distance and a new approach to exact inversion
median-of-three.
Better bounding in the next version of GRAPPA
yields two more orders of magnitude speedup.
These insights and improvements are made possible
by mature development tools (Forte)
Algorithmic engineering techniques are widely
applicable
We may not always get 6 orders of magnitude, but
3 4 orders should be nearly routine with most
codes. (We are starting work on TBR and exact
parsimony solvers.)

23
Final Remarks (contd)

High-performance implementations enable
better approximations for difficult problems (MP,
ML)
true optimization for larger instances
realistic data exploration (e.g., testing
evolutionary scenarios, assessing answers
obtained through other means, etc.)
Our analysis of the Campanulaceae dataset
confirmed the conjecture of Robert Jansen et al.
that inversion is the principal process of
genome evolution in cpDNA for this group.

24
Work-In-Progress and Future Work

Tree enumeration using circular ordering
Handle unequal gene content and duplicate genes
using exemplars
Parallel branch and bound techniques (optimized
for Sun HPC Servers) for searching tree space
Improved SPR and TBR techniques (local searches
around good trees)
Exact Algorithm for Maximum Parsimony

25
Recent publications (2001)

A New Implementation and Detailed Study of
Breakpoint Analysis, B.M.E. Moret, S. Wyman, D.A.
Bader, T. Warnow, M. Yan, Sixth Pacific Symposium
on Biocomputing 2001, pp. 583-594, Hawaii,
January 2001.
High-Performance Algorithm Engineering for
Gene-Order Phylogenies, D.A. Bader, B. M.E.
Moret, T. Warnow, S.K. Wyman, and M. Yan, DIMACS
Workshop on Whole Genome Comparison, DIMACS
Center, Rutgers University, Piscataway, NJ, March
2001.
Variation in vegetation growth rates
Implications for the evolution of semi-arid
landscapes, C. Restrepo, B.T. Milne, D. Bader, W.
Pockman, and A. Kerkhoff, 16th Annual Symposium
of the US-International Association of Landscape
Ecology, Arizona State University, Tempe, April
2001.
High-Performance Algorithm Engineering for
Computational Phylogeny, B. M.E. Moret, D.A.
Bader, and T. Warnow, 2001 International
Conference on Computational Science, San
Francisco, CA, May 2001.
Cluster Computing Applications, David A. Bader
and Robert Pennington, The International Journal
of High Performance Computing, 15(2)181-185, May
2001.
New approaches for using gene order data in
phylogeny reconstruction, R.K. Jansen, D.A.
Bader, B. M. E. Moret, L.A. Raubeson, L.-S. Wang,
T. Warnow, and S. Wyman. Botany 2001,
Albuquerque, NM, August 2001.
GRAPPA a high-performance computational tool for
phylogeny reconstruction from gene-order data, B.
M.E. Moret, D.A. Bader, T. Warnow, S.K. Wyman,
and M. Yan. Botany 2001, Albuquerque, NM, August
2001.
Inferring phylogenies of photosynthetic organisms
from chloroplast gene orders, L.A. Raubeson, D.A.
Bader, B. M.E. Moret, L.-S. Wang, T. Warnow, and
S.K. Wyman. Botany 2001, Albuquerque, NM, August
2001.
Industrial Applications of High-Performance
Computing for Phylogeny Reconstruction, D.A.
Bader, B. M.E. Moret, and L. Vawter, SPIE ITCom
Commercial Applications for High-Performance
Computing, Denver, CO, SPIE Vol. 4528, pp.
159-168, August 2001.
Using PRAM Algorithms on a Uniform-Memory-Access
Shared-Memory Architecture, D.A. Bader, A.
Illendula, B. M.E. Moret, and N.R.
Weisse-Bernstein, Fifth Workshop on Algorithm
Engineering, Springer-Verlag LNCS 2141, 129-144,
Aarhus, Denmark, August 2001.
A Linear-Time Algorithm for Computing Inversion
Distance Between Two Signed Permutations with an
Experimental Study, D.A. Bader, B. M.E. Moret,
and M. Yan, Journal of Computational Biology,
8(5)483-491, October 2001.

Write a Comment

User Comments (0)

About PowerShow.com

HighPerformance Computing for Reconstructing Phylogenies from GeneOrder Data PowerPoint PPT Presentation