A Parallel Geometric Buildup - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

A Parallel Geometric Buildup

Description:

Proteins are building blocks of life and key ingredients of biological processes. ... were associative sorted containers like std::map, std::set and std::multimap. ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 28
Provided by: suk1
Category:

less

Transcript and Presenter's Notes

Title: A Parallel Geometric Buildup


1
A Parallel Geometric Buildup
  • Vladimir Sukhoy
  • Graduate Student
  • Department of Mathematics
  • Iowa State University
  • 2006

2
The Problem
  • Find the coordinates for a set of points given
    the distances between some (or all) of them.
  • Do it fast enough.
  • And numerically stable enough.

3
Is it important?
HIV Retrotranscriptase
  • Proteins are building blocks of life and key
    ingredients of biological processes.
  • A biological system may have up to hundreds of
    thousands of different proteins, each with a
    specific role in the system.
  • A protein is formed by a polypeptide chain with
    typically several hundreds of amino acids and
    tens of thousands of atoms.
  • A protein has a unique 3D structure, which
    determines in many ways the function of the
    protein.
  • To figure out 3D structure we need sometimes to
    get coordinates from distances.

4200 atoms
554 amino acids
4
A More Formal Problem Description
  • Given n atoms a1, , an and a set of distances
    dij between ai and aj, (i,j) in S

5
Related Material
problems with all distances solvable in O (n3)
using SVD
problems with sparse sets of distances NP-compl
ete (Saxe 1979)
problems with probability distributions of
distances stochastic multidimensional scaling,
structure prediction
problems with distance ranges (NMR
results) NP-complete (More and Wu 1997), if
the ranges are small
6
Current Approaches
  • Embed Algorithm by Crippen and Havel
  • CNS Partial Metrization by Brünger et al
  • Graph Reduction by Hendrickson
  • Alternating Projection by Glunt and Hayden
  • Global Optimization by Moré and Wu
  • Multidimensional Scaling by Trosset, et al.

7
Geometric Buildup
  • Independent Points A set of k1 points in Rk is
    called independent if it is not a set of points
    in Rk-1.
  • Metric Basis A set of points B in a space S is a
    metric basis of S provided each point of S is
    uniquely determined by its distances from the
    points in B.
  • Fundamental Theorem Any k1 independent points
    in Rk form a metric basis for Rk.

Blumenthal 1953 Theory and Applications of
Distance Geometry
8
2D Geometric Buildup
in two dimensions
In 2D we need as much as distances to three known
points (given they are not on the same line) to
figure out coordinates of unknown point
9
3D Geometric Buildup
in three dimensions
In 3D we need distances to four known points
(given they are not on the same plane) to figure
out coordinates of unknown point
10
3D Geometric Buildup (detailed)
x1 (u1, v1, w1) x2 (u2, v2, w2) x3 (u3, v3,
w3) x4 (u4, v4, w4)
1
? xi (ui, vi, wi)
xi - x1 di,1 xi - x2 di,2 xi - x3
di,3 xi - x4 di,4
i
4
2
xj - x1 dj,1 xj - x2 dj,2 xj - x3
dj,3 xj - x4 dj,4
j
? xj (uj, vj, wj)
3
11
Performance
The geometric build-up algorithm solves a
molecular distance geometry problem in O(n) when
distances between all (or at least, enough!)
pairs of atoms are given, while the singular
value decomposition algorithm requires O(n2n3)
computing time (always)!
12
Some Pictures
The X-ray crystallography structure (left) of the
HIV-1 RT p66 protein (4200 atoms) and the
structure (right) determined by the geometric
build-up algorithm using the distances for all
pairs of atoms in the protein. The algorithm took
only 188,859 floating-point operations to obtain
the structure, while a conventional
singular-value decomposition algorithm required
1,268,200,000 floating-point operations. The RMSD
of the two structures is 10-4 Ã….
13
Parallelization
  • If the set of distances is full, then geometric
    buildup algorithm is easily parallelized each
    processor is made responsible for particular
    definite subset of unknown points and there is no
    communication necessary! The performance increase
    is linear (or superlinear when everything fits
    into a cache?) with respect to the number of
    processors. And, by the way, it is of no
    interest.
  • If the set of distances is not full Everything
    gets a bit complicated.

14
Geometric Buildup with Sparse Set of Distances
15
Rounding Errors and Stability
Unfortunately numeric error tends to become too
high when doing multistage geometric buildup. The
critical number of points (in my current
implementation using double precision) seems to
be around 70 and also depends on the sparsity of
the data. When data is more sparse error tends
to become intolerable sooner (because there are
more steps required on average to get the
coordinates of particular point, more steps
more rounding errors).
16
More on Rounding Errors
  • Rounding error increases exponentially with the
    number of generations!
  • Even if nice long-double LU-factorization linear
    solver is used, everything dies after a handful
    of generations!
  • One can try to solve optimization problem
    instead Or try some mixture. But it is likely to
    be much slower. And the worst case error does
    not become better at all.

17
The Project
  • A specific version of geometric buildup algorithm
    was implemented on hpc-class.
  • The algorithm is able to deal with sparse set of
    distances (to some extent).
  • The algorithm was used to solve Geometric Buildup
    problem for HIV RT p66 protein and showed
    near-to-linear performance increase with respect
    to the number of processors used.

18
Parallelization Idea
  • The parallelization implemented is based on
    supposed locality property of input dataset.
  • Locality property if some point has index i1 and
    another point has index i2 and i1 is close to
    i2 then, probably, these two points are close
    enough to each other.

19
Parallelization Idea Example
Processor 3
Processor 1
Processor 2
Each processor solves its own buildup problem.
The results are integrated into a single dataset.
20
Performance
  • The performance was measured for 1HIV RT p66
    protein buildup problems. The distances were
    filtered to be less 15 angstrom prior to
    launching algorithm (about 0.5 of all distances
    were actually used).
  • It looks like the results are encouraging

21
Struggle Against Instability
  • Tried to minimize the number of generations
    when possible, helps a lot (e.g. reduces RMSD
    from 800 to 100 ?) but obviously not enough.
  • Do not use on the same line/plane or closely so
    tuples.
  • Anyway couldnt get far past 7th generation.

22
Notes
  • The problem is more algorithmic rather than
    numerical.
  • Therefore numerical optimizations do not make
    much difference in performance (although may
    matter for stability!).
  • Data structures are very important.
  • Although in general problem tends to be
    NP-Complete, for most datasets it is expected to
    be O(N) or closely so.
  • There is quite a bit of code.
  • So, there may be bugs ?.

23
Possible Directions
  • Independently try to use rational mathematics
    (calculations in rational numbers rather than
    floating point). Maybe I am clueless, but I still
    think there is a chance to get some kind of
    result here.
  • Improve algorithms intelligence try to make
    and verify assumptions about coordinates when
    basis points are not immediately available or
    belong to the same line/plane. Use all
    information on-hand!
  • Find a way to parallelize w/o need for
    locality.
  • Improve code performance (use more efficient data
    structures, optimize locally within a single
    node).
  • Try to solve optimization problem via Newtons
    method to figure out new coordinates instead or
    together with quads. Maybe use weights for
    distances to known points of different
    generations
  • Try to shake the system to some extent with
    respect to generations e.g. points of older
    generation should be more influential than
    younger points.

24
Tools/Libraries Used
  • Programming Language C.
  • GNU Make to streamline compilation.
  • GNU Emacs for editing source code.
  • LinAlg C library by Oleg Kiselyov (for svd
    computation in order to compute RMSD to verify
    results of buildup).
  • Some of boost libraries (http//boost.org/) for
    type safe multidimensional arrays, and anonymous
    tuple support in C.
  • STL was used extensively. Especially useful were
    associative sorted containers like stdmap,
    stdset and stdmultimap.
  • Version control with Subversion (too much code to
    do otherwise).

25
Questions? Comments?
26
Related Reading
  • Z. Wu, Linear Algebra in Biomolecular Modeling,
    Handbook of Linear Algebra Chapman/Hall CRC Press
    2006.
  • Dong, Q. and Wu, Z., A geometric build-up
    algorithm for solving the molecular distance
    geometry problem with sparse distance data, J.
    Global Optim. 26, 321-333, 2003.

27
There is No Buildup
  • Spoon Boy Do not try and bend the buildup.
    Thats impossible. Instead only try to realize
    the truth.
  • Neo What truth?
  • Spoon Boy There is no buildup.
  • Neo There is no buildup?
  • Spoon Boy Then youll see that it is not the
    buildup that bends it is only yourself.
Write a Comment
User Comments (0)
About PowerShow.com