Title: Nonlinear Dimensionality Reduction Applied to Problems in Structural Bioinformatics
1Nonlinear Dimensionality Reduction Applied to
Problems in Structural Bioinformatics
Forbes Burkowski Shirley Hui School of
Computer Science University of Waterloo
2Dimensionality Reduction
- Introduction
- Dimensionality reduction is a technique that may
be used to organize high dimensional data by
discovering a more compact representation of the
data. - We regard the data as a set of points in a high
dimensional space. - This strategy is suitable for data that are
generated by some process that situates the
points in a lower dimensional manifold of that
space.
3Data Set Example 1
- The swiss roll data set is a 2D manifold
situated in a higher dimensional 3D space. - The goal of a dimensionality reduction exercise
would be the discovery of the 2D intrinsic
coordinates that would allow us to see the data
as it is displayed in figure (c).
4Data Set Example 2
- Suppose we have a camera that takes pictures of a
face that is rotated through a continuous range
of angles. - If a picture is represented using 256 grey scale
values in a 32 by 32 array then each picture may
be represented as a point in a space of 1024
dimensions. - Since the angle of rotation is a single parameter
the sequence of pictures corresponds to a
sequence of points that trace out a 1D curve in
the 1024D space.
5Motivation
- Why is this important?
- Visualization of high dimensional data
- Data compression
- Recognition of similarity in data sets
- It can be very useful to extract meaningful
dimensions. - We hope to determine an intrinsic coordinate
system for the data. - We want to unroll the manifold.
- Interpolation
6Working with the geodesic distance
- Often the distance between two points is more
meaningful when calculated by going through
adjacent points in the lower dimensional space
rather than simply using the Euclidean distance
in the high dimensional space.
7LLE Overview
- We assume that a point Xi in the high dimension
space resides in a hyper plane that is determined
by Xi and its k closest neighbouring points. - Calculate the best weights Wij that would
linearly reconstruct Xi from its neighbours. - This is done by solving a constrained
least-squares problem, minimizing the error - Weights are zero if Xj does not belong to the
neighbour set of Xi . - We also need
8LLE Overview (cont.)
- Compute the low d-dimensional embedding vectors
Yi that are best reconstructed by the weights
Wij. - The Yi minimize
- This is done by finding the smallest eigenvectors
of the sparse symmetric matrix - Reference
- Roweis, S.T. and Saul, L.K., Nonlinear
Dimensionality Reduction by Locally Linear
Embedding, Science, Vol. 29, Dec. 22, 2000, pp.
2323-2326.
9LLE Overview (cont.)
- Summary of the 3 steps for LLE
- High dimensional space
- Low dimensional space
10Images of Lips Mapped Via LLE
Images were mapped into the embedding space
described by the first two coordinates of LLE.
- Saul, L.K. Roweis, S.T.
- An Introduction to Locally Linear Embedding
- http//www.cs.toronto.edu/roweis/lle/papers/llein
tro.pdf
11Images of Faces Mapped Via LLE
- These images were mapped into the embedding space
described by the first two coordinates of LLE.
12HIV Protease 2D Manifold
13Dimensionality Reduction and Protein Flexibility
- The study of protein motion is difficult because
of the many degrees of freedom. - A medium sized protein may have a few thousand
degrees of freedom. - Can dimensionality reduction be used to obtain a
reduced basis representation of protein
flexibility? - Teodoro et al. answer in the affirmative
- Teodoro, M.L., Phillips, G.N.Jr., Kavraki, L.E.,
A Dimensionality Reduction Approach to Modeling
Protein Flexibility, RECOMB 02, Washington, Apr.
2002, 299-308.
14Modeling Flexibility of HIV Protease
- Teodoro et al use a molecular dynamics program to
generate 14,000 conformations of the protein. - They get a vector set that is subjected to
dimensionality reduction via PCA (Principle
Component Analysis). - The reduced basis representation retains the
critical information about the directions of
preferred motion of the protein. - They have a convincing animation that shows the
flap movement in 4HVP.
15Nonlinear Dimension Reduction for Proteins
- Can we use LLE in lieu of PCA?
- S. Hui and U. Shakeel (University of Waterloo,
School of Computer Science graduate students)
repeated the HIV protease study using LLE instead
of PCA. - very good results
- but quite computationally expensive
- Results will be presented in poster form at this
years ISMB in Glasgow.
16Coordinate Systems for Molecules
- A point in the high dimensional space will have
coordinates. Some possibilities are - The 3N numbers representing the Cartesian
coordinates of N atoms in the molecule. - The phi/psi angles of the alpha carbons.
- The interatomic distances of particular atoms in
the molecule (useful in drug design). - We now look at the use of LLE for another
application dealing with proteins structure
alignment.
17Structure Alignment
- From a biological perspective demonstrating the
similarity of two proteins by comparing their 3D
structure is very important because protein
functionality is strongly related to structure. - From an evolutionary viewpoint Structure is more
conserved than sequence.
18What is Structure Alignment?
- Objective
- We are given the 3D coordinates of all atoms for
two proteins P and Q. - So we know their 3D conformations.
- How do we find the translation and rotation
operations that lead to the most significant
amount of superimposition? - If flexibility of the protein is to be allowed,
then the transformation operations may include
some local deformations of either P or Q or both.
19Variations of the Problem
- There are variations of the problem depending on
whether we allow flexibility and on whether we
have the same or different sequences. - Flexible same sequence
- Trivial We simply assume that P and Q can adopt
the same shape. - Rigid same sequence
- We find the rotation and translation that
minimizes the distance between corresponding
alpha carbons in P and Q. - Measure via RMSD.
- Rigid different sequence
- More difficult It is necessary to find an
alignment between P and Q that provides a
matching between a subset of the amino acids in P
with those of Q. - Then we calculate the rotation and translation
that minimizes the RMSD measured only with
respect to the matching alpha carbons.
20Variations of the Problem (cont.)
- Flexible and different sequence
- A similar approach to the last case complicated
by protein deformations that should be made with
due attention to energy considerations. - Other concerns
- For both of the last two cases, the alignment
(matching amino acids) may change as the
rotation, translation, and deformation
transformations are performed. - So many of the algorithms used are heuristic and
iterative. - It is possible that a less than optimal alignment
may lead to a structural alignment that is more
biologically relevant. - There is no universally accepted structural
alignment definition.
21Prior Art
- In all currently available methodologies
developed and applied to the comparisons of
protein structures, the molecules are considered
to be rigid objects - Verbitsky, G., Nussinov, R., Wolfson, H.,
Flexible Structural Comparison Allowing
Hinge-Bending, Swiveling Motions. Proteins
Structure, Function, and Genetics 34 1999, pp.
232-254. - Their paper goes on to describe an algorithm that
does structure comparison allowing molecular
parts such as domains, subdomains, and loops, to
rotate around preselected point-hinges.
22Reverse Chiropraxis
- In our approach we wish to derive a structural
comparison that involves automatically selected
bonds chosen as the places where rotation takes
place. - So we wish to take the protein backbone from its
native conformation to some nearby conformation
that has small rotational alterations to its
backbone. - This to be done with due attention to potential
energy considerations. - Introducing subluxations of the protein
backbone ?
23A Structural Alignment Heuristic (1)
- Each protein will have a set of conformations
each conformation being a point in its space of
conformations. - Assuming we have an alignment algorithm that
establishes the alpha carbons that are to be put
into correspondence then we can create a
coordinate system that is based on the positions
of these atoms. - In this way, the points representing
conformations will have the same frames of
reference for both proteins P and Q.
24A Structural Alignment Heuristic (2)
- If the two manifolds intersect then a point of
intersection will specify conformations for P and
Q that are the same (at least with respect to the
alpha carbons in the alignment). - It would remain to investigate whether the
energies of these conformations of P and Q were
acceptable. - Ideally, a point of intersection is close to the
points in the manifolds that represent the native
conformations of P and Q.
25A Structural Alignment Heuristic (3)
- If the two manifolds do not intersect (or even
when they do) we would want to find two points,
one in the manifold for P and the other in the
manifold for Q such that the points are close to
one another and correspond to reasonably low
energy levels. - We essentially have a search problem in the low
dimensional space.
26Interpolation (1)
- Calculating the intersection of two manifolds is
based on the observation that the manifolds are
defined by discrete sets of points corresponding
to the conformations that were fed to the
software doing the dimension reduction. - A simpler heuristic is to assume that one protein
(say P) is rigid and so its conformation space is
really a single point corresponding to the native
conformation.
27Interpolation (2)
- We find the line through P that is normal to the
manifold at a point that is on a line between the
point representing the native structure of Q and
some neighbouring point.
Native Structure of P
Next interpolated conformation for Q
Conformational Neighbour of Native Structure of Q
Native Structure of Q
28Some Empirical Results
- My grad student Shirley Hui has done various
experiments using PCA and LLE for dimensionality
reduction applied to structural alignment.
29Experiments
- The goal of the following experiment was to align
two protein chains where one is flexible and one
is rigid. - Aligned fragments of the two proteins are
obtained. - The alignment of these fragments is improved upon
by flexing the flexible protein. - Same data as used in
Shindyalov IN, Bourne PE (1998) Protein
structure alignment by incremental combinatorial
extension (CE) of the optimal path.
Protein Engineering 11(9) 739-747.
30Method
- The flexible structural algorithm developed
consists of 3 steps - 1) Perform a Rigid Alignment
- The initial rigid alignment provides an idea of
which areas of the structures should be aligned.
When the rigid alignment is complete, we know
which areas of the structures are aligned and
which areas are gaps in both of the chains. - 2) Achieve a Tighter Fit
- A tighter fit may be achieved if the flexible
molecule is flexed in the aligned areas. We do
not flex in the gap areas. - 3) Achieve a Feasible Fit
- Since the chain moves as a system, flexing the
chain in individual areas in step two may distort
the structure so much that although it is a
better fit, it is not one that is energetically
feasible. In order to achieve a feasible fit,
the flexible proteins flexibility manifold is
searched for a conformation that is as close as
possible to the flexed chain generated from step
2. The resulting conformation is one that is
feasible since it lies on the manifold.
31Data Used
- The protein chains used are as follows Example
One 1) Flexible Chain - C-phycocyanin L Chain
(atoms 408 - 716)2) Rigid Chain Colicin
A Chain (atoms 508 - 747)Example Two1)
Flexible Chain - C-phycocyanin L Chain
(atoms 2040 - 2315)2) Rigid Chain - Colicin A
Chain (atoms 955 - 1372)
32Example 1 Results
Blue COL A (assumed rigid) Green CPC L
(assumed rigid) Red CPC L (assumed flexible)
Flexibility allowed more superposition with an
improvement in RMSD of 0.606 Angstrom.
33Example 2 Results
Blue COL A (assumed rigid) Green CPC L
(assumed rigid) Red CPC L (assumed flexible)
Flexibility allowed more superposition with an
improvement in RMSD of 0.489 Angstrom.