Title: Introduction to Computational Structural Biology
1A Novel Geometric Build-Up Algorithm for Solving
the Distance Geometry Problem and Its Application
to Multidimensional Scaling
Zhijun Wu Department of Mathematics Program on
Bio-Informatics and Computational Biology Iowa
State University Joint Work with Tauqir Bibi,
Feng Cui, Qunfeng Dong, Peter Vedell, Di Wu
2S
Multidimensional Scaling data classification geom
etric mapping of data
T
Distance Geometry mapping from semi-metric to
metric spaces Euclidean and non-Euclidean
B
fundamental problem find the coordinates for a
set of points, given the distances for all pairs
of points
Cayley-Menger determinant necessary sufficient
conditions of embedding
singular-value decomposition method strain/stress
minimization
Molecular Conformation embedding in 3D Euclidean
space protein structure prediction and
determination
sparse, inexact distances, bounds on the
distances, probability distributions
3Proteins are building blocks of life and key
ingredients of biological processes. A
biological system may have up to hundreds of
thousands of different proteins, each with a
specific role in the system. A protein is formed
by a polypeptide chain with typically several
hundreds of amino acids and tens of thousands of
atoms. A protein has a unique 3D structure,
which determines in many ways the function of the
protein.
HIV Retrotranscriptase
an example
554 amino acids
4200 atoms
4Molecular Distance Geometry Problem
Given n atoms a1, , an and a set of distances
di,j between ai and aj, (i,j) in S
5Problems and Complexity
problems with all distances solvable in O (n3)
using SVD
problems with sparse sets of distances NP-compl
ete (Saxe 1979)
problems with distance ranges (NMR
results) NP-complete (More and Wu 1997), if
the ranges are small
problems with probability distributions of
distances stochastic multidimensional scaling,
structure prediction
6Current Approaches
- Embed Algorithm by Crippen and Havel
- CNS Partial Metrization by Brünger et al
- Graph Reduction by Hendrickson
- Alternating Projection by Glunt and Hayden
- Global Optimization by Moré and Wu
- Multidimensional Scaling by Trosset, et al
7Embed Algorithm
time consuming in O(n3n4)
- bound smooth keep distances consistent
- distance metrization estimate the missing
distances - repeat (say 1000 times)
- randomly generate D in between L and U
- find X using SVD with D
- if X is found, stop
- select the best approximation X
- refine X with simulated annealing
- final optimization
costly in O(n2n3)
Crippen and Havel 1988 (DGII, DGEOM) Brünger et
al 1992, 1998 (XPLOR, CNS)
8Geometric Build-Up
Independent Points A set of k1 points in Rk is
called independent if it is not a set of points
in Rk-1.
Metric Basis A set of points B in a space S is a
metric basis of S provided each point of S is
uniquely determined by its distances from the
points in B.
Fundamental Theorem Any k1 independent points
in Rk form a metric basis for Rk.
Blumenthal 1953 Theory and Applications of
Distance Geometry
9Geometric Build-Up
in two dimension
10Geometric Build-Up
in three dimension
11Geometric Build-Up
in three dimension
12Geometric Build-Up
1
x1 (u1, v1, w1) x2 (u2, v2, w2) x3 (u3, v3,
w3) x4 (u4, v4, w4)
? xi (ui, vi, wi)
i
xi - x1 di,1 xi - x2 di,2 xi - x3
di,3 xi - x4 di,4
4
2
xj - x1 dj,1 xj - x2 dj,2 xj - x3
dj,3 xj - x4 dj,4
j
? xj (uj, vj, wj)
3
13The geometric build-up algorithm solves a
molecular distance geometry problem in O(n) when
distances between all pairs of atoms are given,
while the singular value decomposition algorithm
requires O(n2n3) computing time!
14The X-ray crystallography structure (left) of the
HIV-1 RT p66 protein (4200 atoms) and the
structure (right) determined by the geometric
build-up algorithm using the distances for all
pairs of atoms in the protein. The algorithm took
only 188,859 floating-point operations to obtain
the structure, while a conventional
singular-value decomposition algorithm required
1,268,200,000 floating-point operations. The RMSD
of the two structures is 10-4 Ã….
15Problems with Sparse Sets of Distances
16Control of Rounding Errors
17Control of Rounding Errors
18Tolerate Distance Errors
19Tolerate Distance Errors
i
(i,j) in S
j
xj are determined.
20The objective function is convex and the problem
can be solved using a standard Newton method.
Each function evaluation requires order of n
floating point operations, where n is the number
of atoms.
(i,j) in S
xj are determined.
In the ideal case when every atom can be
determined, n atoms require O(n2) floating point
operations.
21NMR Structure Determination
The distances are given with their possible
ranges.
i
j
22(i, j) in S
23Computational Results
Computational Results
The structure of 4MBA (red lines) determined by
using a geometric build-up algorithm with a
subset of all pairs of inter-atomic distances.
The X-ray crystallography structure is shown in
blue lines.
24Computational Results
Computational Results
The total distance errors (red) for the partial
structures of a polypeptide chain obtained by
using a geometric build-up are all smaller than 1
Å, while those (blue) by using CNS (Brünger et
al) grow quickly with increasing numbers of atoms
in the chain.
25Extension to Statistical Distance Data
the distributions of the distances in structure
database
i
j
structure prediction