Nonlinear Dimensionality Reduction Applied to Problems in Structural Bioinformatics - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Nonlinear Dimensionality Reduction Applied to Problems in Structural Bioinformatics

Description:

0.606 Angstrom. Example 2 Results. Blue: COL A (assumed rigid) Green: ... Flexibility allowed more superposition. with an improvement in RMSD of. 0.489 Angstrom. ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 34

Provided by: forbesbu

Category:

more less

Transcript and Presenter's Notes

Title: Nonlinear Dimensionality Reduction Applied to Problems in Structural Bioinformatics

1
Nonlinear Dimensionality Reduction Applied to
Problems in Structural Bioinformatics
Forbes Burkowski Shirley Hui School of
Computer Science University of Waterloo
2
Dimensionality Reduction

Introduction
Dimensionality reduction is a technique that may
be used to organize high dimensional data by
discovering a more compact representation of the
data.
We regard the data as a set of points in a high
dimensional space.
This strategy is suitable for data that are
generated by some process that situates the
points in a lower dimensional manifold of that
space.

3
Data Set Example 1

The swiss roll data set is a 2D manifold
situated in a higher dimensional 3D space.
The goal of a dimensionality reduction exercise
would be the discovery of the 2D intrinsic
coordinates that would allow us to see the data
as it is displayed in figure (c).

4
Data Set Example 2

Suppose we have a camera that takes pictures of a
face that is rotated through a continuous range
of angles.
If a picture is represented using 256 grey scale
values in a 32 by 32 array then each picture may
be represented as a point in a space of 1024
dimensions.
Since the angle of rotation is a single parameter
the sequence of pictures corresponds to a
sequence of points that trace out a 1D curve in
the 1024D space.

5
Motivation

Why is this important?
Visualization of high dimensional data
Data compression
Recognition of similarity in data sets
It can be very useful to extract meaningful
dimensions.
We hope to determine an intrinsic coordinate
system for the data.
We want to unroll the manifold.
Interpolation

6
Working with the geodesic distance

Often the distance between two points is more
meaningful when calculated by going through
adjacent points in the lower dimensional space
rather than simply using the Euclidean distance
in the high dimensional space.

7
LLE Overview

We assume that a point Xi in the high dimension
space resides in a hyper plane that is determined
by Xi and its k closest neighbouring points.
Calculate the best weights Wij that would
linearly reconstruct Xi from its neighbours.
This is done by solving a constrained
least-squares problem, minimizing the error
Weights are zero if Xj does not belong to the
neighbour set of Xi .
We also need

8
LLE Overview (cont.)

Compute the low d-dimensional embedding vectors
Yi that are best reconstructed by the weights
Wij.
The Yi minimize
This is done by finding the smallest eigenvectors
of the sparse symmetric matrix
Reference
Roweis, S.T. and Saul, L.K., Nonlinear
Dimensionality Reduction by Locally Linear
Embedding, Science, Vol. 29, Dec. 22, 2000, pp.
2323-2326.

9
LLE Overview (cont.)

Summary of the 3 steps for LLE
High dimensional space
Low dimensional space

10
Images of Lips Mapped Via LLE
Images were mapped into the embedding space
described by the first two coordinates of LLE.

Saul, L.K. Roweis, S.T.
An Introduction to Locally Linear Embedding
http//www.cs.toronto.edu/roweis/lle/papers/llein
tro.pdf

11
Images of Faces Mapped Via LLE

These images were mapped into the embedding space
described by the first two coordinates of LLE.

12
HIV Protease 2D Manifold
13
Dimensionality Reduction and Protein Flexibility

The study of protein motion is difficult because
of the many degrees of freedom.
A medium sized protein may have a few thousand
degrees of freedom.
Can dimensionality reduction be used to obtain a
reduced basis representation of protein
flexibility?
Teodoro et al. answer in the affirmative
Teodoro, M.L., Phillips, G.N.Jr., Kavraki, L.E.,
A Dimensionality Reduction Approach to Modeling
Protein Flexibility, RECOMB 02, Washington, Apr.
2002, 299-308.

14
Modeling Flexibility of HIV Protease

Teodoro et al use a molecular dynamics program to
generate 14,000 conformations of the protein.
They get a vector set that is subjected to
dimensionality reduction via PCA (Principle
Component Analysis).
The reduced basis representation retains the
critical information about the directions of
preferred motion of the protein.
They have a convincing animation that shows the
flap movement in 4HVP.

15
Nonlinear Dimension Reduction for Proteins

Can we use LLE in lieu of PCA?
S. Hui and U. Shakeel (University of Waterloo,
School of Computer Science graduate students)
repeated the HIV protease study using LLE instead
of PCA.
very good results
but quite computationally expensive
Results will be presented in poster form at this
years ISMB in Glasgow.

16
Coordinate Systems for Molecules

A point in the high dimensional space will have
coordinates. Some possibilities are
The 3N numbers representing the Cartesian
coordinates of N atoms in the molecule.
The phi/psi angles of the alpha carbons.
The interatomic distances of particular atoms in
the molecule (useful in drug design).
We now look at the use of LLE for another
application dealing with proteins structure
alignment.

17
Structure Alignment

From a biological perspective demonstrating the
similarity of two proteins by comparing their 3D
structure is very important because protein
functionality is strongly related to structure.
From an evolutionary viewpoint Structure is more
conserved than sequence.

18
What is Structure Alignment?

Objective
We are given the 3D coordinates of all atoms for
two proteins P and Q.
So we know their 3D conformations.
How do we find the translation and rotation
operations that lead to the most significant
amount of superimposition?
If flexibility of the protein is to be allowed,
then the transformation operations may include
some local deformations of either P or Q or both.

19
Variations of the Problem

There are variations of the problem depending on
whether we allow flexibility and on whether we
have the same or different sequences.
Flexible same sequence
Trivial We simply assume that P and Q can adopt
the same shape.
Rigid same sequence
We find the rotation and translation that
minimizes the distance between corresponding
alpha carbons in P and Q.
Measure via RMSD.
Rigid different sequence
More difficult It is necessary to find an
alignment between P and Q that provides a
matching between a subset of the amino acids in P
with those of Q.
Then we calculate the rotation and translation
that minimizes the RMSD measured only with
respect to the matching alpha carbons.

20
Variations of the Problem (cont.)

Flexible and different sequence
A similar approach to the last case complicated
by protein deformations that should be made with
due attention to energy considerations.
Other concerns
For both of the last two cases, the alignment
(matching amino acids) may change as the
rotation, translation, and deformation
transformations are performed.
So many of the algorithms used are heuristic and
iterative.
It is possible that a less than optimal alignment
may lead to a structural alignment that is more
biologically relevant.
There is no universally accepted structural
alignment definition.

21
Prior Art

In all currently available methodologies
developed and applied to the comparisons of
protein structures, the molecules are considered
to be rigid objects
Verbitsky, G., Nussinov, R., Wolfson, H.,
Flexible Structural Comparison Allowing
Hinge-Bending, Swiveling Motions. Proteins
Structure, Function, and Genetics 34 1999, pp.
232-254.
Their paper goes on to describe an algorithm that
does structure comparison allowing molecular
parts such as domains, subdomains, and loops, to
rotate around preselected point-hinges.

22
Reverse Chiropraxis

In our approach we wish to derive a structural
comparison that involves automatically selected
bonds chosen as the places where rotation takes
place.
So we wish to take the protein backbone from its
native conformation to some nearby conformation
that has small rotational alterations to its
backbone.
This to be done with due attention to potential
energy considerations.
Introducing subluxations of the protein
backbone ?

23
A Structural Alignment Heuristic (1)

Each protein will have a set of conformations
each conformation being a point in its space of
conformations.
Assuming we have an alignment algorithm that
establishes the alpha carbons that are to be put
into correspondence then we can create a
coordinate system that is based on the positions
of these atoms.
In this way, the points representing
conformations will have the same frames of
reference for both proteins P and Q.

24
A Structural Alignment Heuristic (2)

If the two manifolds intersect then a point of
intersection will specify conformations for P and
Q that are the same (at least with respect to the
alpha carbons in the alignment).
It would remain to investigate whether the
energies of these conformations of P and Q were
acceptable.
Ideally, a point of intersection is close to the
points in the manifolds that represent the native
conformations of P and Q.

25
A Structural Alignment Heuristic (3)

If the two manifolds do not intersect (or even
when they do) we would want to find two points,
one in the manifold for P and the other in the
manifold for Q such that the points are close to
one another and correspond to reasonably low
energy levels.
We essentially have a search problem in the low
dimensional space.

26
Interpolation (1)

Calculating the intersection of two manifolds is
based on the observation that the manifolds are
defined by discrete sets of points corresponding
to the conformations that were fed to the
software doing the dimension reduction.
A simpler heuristic is to assume that one protein
(say P) is rigid and so its conformation space is
really a single point corresponding to the native
conformation.

27
Interpolation (2)

We find the line through P that is normal to the
manifold at a point that is on a line between the
point representing the native structure of Q and
some neighbouring point.

Native Structure of P
Next interpolated conformation for Q
Conformational Neighbour of Native Structure of Q
Native Structure of Q
28
Some Empirical Results

My grad student Shirley Hui has done various
experiments using PCA and LLE for dimensionality
reduction applied to structural alignment.

29
Experiments

The goal of the following experiment was to align
two protein chains where one is flexible and one
is rigid.
Aligned fragments of the two proteins are
obtained.
The alignment of these fragments is improved upon
by flexing the flexible protein.
Same data as used in

Shindyalov IN, Bourne PE (1998) Protein
structure alignment by incremental combinatorial
extension (CE) of the optimal path.
Protein Engineering 11(9) 739-747.
30
Method

The flexible structural algorithm developed
consists of 3 steps
1) Perform a Rigid Alignment
The initial rigid alignment provides an idea of
which areas of the structures should be aligned.
When the rigid alignment is complete, we know
which areas of the structures are aligned and
which areas are gaps in both of the chains.
2) Achieve a Tighter Fit
A tighter fit may be achieved if the flexible
molecule is flexed in the aligned areas. We do
not flex in the gap areas.
3) Achieve a Feasible Fit
Since the chain moves as a system, flexing the
chain in individual areas in step two may distort
the structure so much that although it is a
better fit, it is not one that is energetically
feasible. In order to achieve a feasible fit,
the flexible proteins flexibility manifold is
searched for a conformation that is as close as
possible to the flexed chain generated from step
2. The resulting conformation is one that is
feasible since it lies on the manifold.

31
Data Used

The protein chains used are as follows Example
One 1) Flexible Chain - C-phycocyanin L Chain
(atoms 408 - 716)2) Rigid Chain Colicin
A Chain (atoms 508 - 747)Example Two1)
Flexible Chain - C-phycocyanin L Chain
(atoms 2040 - 2315)2) Rigid Chain - Colicin A
Chain (atoms 955 - 1372)

32
Example 1 Results
Blue COL A (assumed rigid) Green CPC L
(assumed rigid) Red CPC L (assumed flexible)
Flexibility allowed more superposition with an
improvement in RMSD of 0.606 Angstrom.
33
Example 2 Results
Blue COL A (assumed rigid) Green CPC L
(assumed rigid) Red CPC L (assumed flexible)
Flexibility allowed more superposition with an
improvement in RMSD of 0.489 Angstrom.

Write a Comment

User Comments (0)