Title: A Parallel Geometric Buildup
1A Parallel Geometric Buildup
- Vladimir Sukhoy
- Graduate Student
- Department of Mathematics
- Iowa State University
- 2006
2The Problem
- Find the coordinates for a set of points given
the distances between some (or all) of them. - Do it fast enough.
- And numerically stable enough.
3Is it important?
HIV Retrotranscriptase
- Proteins are building blocks of life and key
ingredients of biological processes. - A biological system may have up to hundreds of
thousands of different proteins, each with a
specific role in the system. - A protein is formed by a polypeptide chain with
typically several hundreds of amino acids and
tens of thousands of atoms. - A protein has a unique 3D structure, which
determines in many ways the function of the
protein. - To figure out 3D structure we need sometimes to
get coordinates from distances.
4200 atoms
554 amino acids
4A More Formal Problem Description
- Given n atoms a1, , an and a set of distances
dij between ai and aj, (i,j) in S
5Related Material
problems with all distances solvable in O (n3)
using SVD
problems with sparse sets of distances NP-compl
ete (Saxe 1979)
problems with probability distributions of
distances stochastic multidimensional scaling,
structure prediction
problems with distance ranges (NMR
results) NP-complete (More and Wu 1997), if
the ranges are small
6Current Approaches
- Embed Algorithm by Crippen and Havel
- CNS Partial Metrization by Brünger et al
- Graph Reduction by Hendrickson
- Alternating Projection by Glunt and Hayden
- Global Optimization by Moré and Wu
- Multidimensional Scaling by Trosset, et al.
7Geometric Buildup
- Independent Points A set of k1 points in Rk is
called independent if it is not a set of points
in Rk-1. - Metric Basis A set of points B in a space S is a
metric basis of S provided each point of S is
uniquely determined by its distances from the
points in B. - Fundamental Theorem Any k1 independent points
in Rk form a metric basis for Rk.
Blumenthal 1953 Theory and Applications of
Distance Geometry
82D Geometric Buildup
in two dimensions
In 2D we need as much as distances to three known
points (given they are not on the same line) to
figure out coordinates of unknown point
93D Geometric Buildup
in three dimensions
In 3D we need distances to four known points
(given they are not on the same plane) to figure
out coordinates of unknown point
103D Geometric Buildup (detailed)
x1 (u1, v1, w1) x2 (u2, v2, w2) x3 (u3, v3,
w3) x4 (u4, v4, w4)
1
? xi (ui, vi, wi)
xi - x1 di,1 xi - x2 di,2 xi - x3
di,3 xi - x4 di,4
i
4
2
xj - x1 dj,1 xj - x2 dj,2 xj - x3
dj,3 xj - x4 dj,4
j
? xj (uj, vj, wj)
3
11Performance
The geometric build-up algorithm solves a
molecular distance geometry problem in O(n) when
distances between all (or at least, enough!)
pairs of atoms are given, while the singular
value decomposition algorithm requires O(n2n3)
computing time (always)!
12Some Pictures
The X-ray crystallography structure (left) of the
HIV-1 RT p66 protein (4200 atoms) and the
structure (right) determined by the geometric
build-up algorithm using the distances for all
pairs of atoms in the protein. The algorithm took
only 188,859 floating-point operations to obtain
the structure, while a conventional
singular-value decomposition algorithm required
1,268,200,000 floating-point operations. The RMSD
of the two structures is 10-4 Ã….
13Parallelization
- If the set of distances is full, then geometric
buildup algorithm is easily parallelized each
processor is made responsible for particular
definite subset of unknown points and there is no
communication necessary! The performance increase
is linear (or superlinear when everything fits
into a cache?) with respect to the number of
processors. And, by the way, it is of no
interest. - If the set of distances is not full Everything
gets a bit complicated.
14Geometric Buildup with Sparse Set of Distances
15Rounding Errors and Stability
Unfortunately numeric error tends to become too
high when doing multistage geometric buildup. The
critical number of points (in my current
implementation using double precision) seems to
be around 70 and also depends on the sparsity of
the data. When data is more sparse error tends
to become intolerable sooner (because there are
more steps required on average to get the
coordinates of particular point, more steps
more rounding errors).
16More on Rounding Errors
- Rounding error increases exponentially with the
number of generations! - Even if nice long-double LU-factorization linear
solver is used, everything dies after a handful
of generations! - One can try to solve optimization problem
instead Or try some mixture. But it is likely to
be much slower. And the worst case error does
not become better at all.
17The Project
- A specific version of geometric buildup algorithm
was implemented on hpc-class. - The algorithm is able to deal with sparse set of
distances (to some extent). - The algorithm was used to solve Geometric Buildup
problem for HIV RT p66 protein and showed
near-to-linear performance increase with respect
to the number of processors used.
18Parallelization Idea
- The parallelization implemented is based on
supposed locality property of input dataset. - Locality property if some point has index i1 and
another point has index i2 and i1 is close to
i2 then, probably, these two points are close
enough to each other.
19Parallelization Idea Example
Processor 3
Processor 1
Processor 2
Each processor solves its own buildup problem.
The results are integrated into a single dataset.
20Performance
- The performance was measured for 1HIV RT p66
protein buildup problems. The distances were
filtered to be less 15 angstrom prior to
launching algorithm (about 0.5 of all distances
were actually used). - It looks like the results are encouraging
21Struggle Against Instability
- Tried to minimize the number of generations
when possible, helps a lot (e.g. reduces RMSD
from 800 to 100 ?) but obviously not enough. - Do not use on the same line/plane or closely so
tuples. - Anyway couldnt get far past 7th generation.
22Notes
- The problem is more algorithmic rather than
numerical. - Therefore numerical optimizations do not make
much difference in performance (although may
matter for stability!). - Data structures are very important.
- Although in general problem tends to be
NP-Complete, for most datasets it is expected to
be O(N) or closely so. - There is quite a bit of code.
- So, there may be bugs ?.
23Possible Directions
- Independently try to use rational mathematics
(calculations in rational numbers rather than
floating point). Maybe I am clueless, but I still
think there is a chance to get some kind of
result here. - Improve algorithms intelligence try to make
and verify assumptions about coordinates when
basis points are not immediately available or
belong to the same line/plane. Use all
information on-hand! - Find a way to parallelize w/o need for
locality. - Improve code performance (use more efficient data
structures, optimize locally within a single
node). - Try to solve optimization problem via Newtons
method to figure out new coordinates instead or
together with quads. Maybe use weights for
distances to known points of different
generations - Try to shake the system to some extent with
respect to generations e.g. points of older
generation should be more influential than
younger points.
24Tools/Libraries Used
- Programming Language C.
- GNU Make to streamline compilation.
- GNU Emacs for editing source code.
- LinAlg C library by Oleg Kiselyov (for svd
computation in order to compute RMSD to verify
results of buildup). - Some of boost libraries (http//boost.org/) for
type safe multidimensional arrays, and anonymous
tuple support in C. - STL was used extensively. Especially useful were
associative sorted containers like stdmap,
stdset and stdmultimap. - Version control with Subversion (too much code to
do otherwise).
25Questions? Comments?
26Related Reading
- Z. Wu, Linear Algebra in Biomolecular Modeling,
Handbook of Linear Algebra Chapman/Hall CRC Press
2006. - Dong, Q. and Wu, Z., A geometric build-up
algorithm for solving the molecular distance
geometry problem with sparse distance data, J.
Global Optim. 26, 321-333, 2003.
27There is No Buildup
- Spoon Boy Do not try and bend the buildup.
Thats impossible. Instead only try to realize
the truth. - Neo What truth?
- Spoon Boy There is no buildup.
- Neo There is no buildup?
- Spoon Boy Then youll see that it is not the
buildup that bends it is only yourself.