Nonlinear Dimensionality Reduction Applied to Problems in Structural Bioinformatics - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Nonlinear Dimensionality Reduction Applied to Problems in Structural Bioinformatics

Description:

0.606 Angstrom. Example 2 Results. Blue: COL A (assumed rigid) Green: ... Flexibility allowed more superposition. with an improvement in RMSD of. 0.489 Angstrom. ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 34
Provided by: forbesbu
Category:

less

Transcript and Presenter's Notes

Title: Nonlinear Dimensionality Reduction Applied to Problems in Structural Bioinformatics


1
Nonlinear Dimensionality Reduction Applied to
Problems in Structural Bioinformatics
Forbes Burkowski Shirley Hui School of
Computer Science University of Waterloo
2
Dimensionality Reduction
  • Introduction
  • Dimensionality reduction is a technique that may
    be used to organize high dimensional data by
    discovering a more compact representation of the
    data.
  • We regard the data as a set of points in a high
    dimensional space.
  • This strategy is suitable for data that are
    generated by some process that situates the
    points in a lower dimensional manifold of that
    space.

3
Data Set Example 1
  • The swiss roll data set is a 2D manifold
    situated in a higher dimensional 3D space.
  • The goal of a dimensionality reduction exercise
    would be the discovery of the 2D intrinsic
    coordinates that would allow us to see the data
    as it is displayed in figure (c).

4
Data Set Example 2
  • Suppose we have a camera that takes pictures of a
    face that is rotated through a continuous range
    of angles.
  • If a picture is represented using 256 grey scale
    values in a 32 by 32 array then each picture may
    be represented as a point in a space of 1024
    dimensions.
  • Since the angle of rotation is a single parameter
    the sequence of pictures corresponds to a
    sequence of points that trace out a 1D curve in
    the 1024D space.

5
Motivation
  • Why is this important?
  • Visualization of high dimensional data
  • Data compression
  • Recognition of similarity in data sets
  • It can be very useful to extract meaningful
    dimensions.
  • We hope to determine an intrinsic coordinate
    system for the data.
  • We want to unroll the manifold.
  • Interpolation

6
Working with the geodesic distance
  • Often the distance between two points is more
    meaningful when calculated by going through
    adjacent points in the lower dimensional space
    rather than simply using the Euclidean distance
    in the high dimensional space.

7
LLE Overview
  • We assume that a point Xi in the high dimension
    space resides in a hyper plane that is determined
    by Xi and its k closest neighbouring points.
  • Calculate the best weights Wij that would
    linearly reconstruct Xi from its neighbours.
  • This is done by solving a constrained
    least-squares problem, minimizing the error
  • Weights are zero if Xj does not belong to the
    neighbour set of Xi .
  • We also need

8
LLE Overview (cont.)
  • Compute the low d-dimensional embedding vectors
    Yi that are best reconstructed by the weights
    Wij.
  • The Yi minimize
  • This is done by finding the smallest eigenvectors
    of the sparse symmetric matrix
  • Reference
  • Roweis, S.T. and Saul, L.K., Nonlinear
    Dimensionality Reduction by Locally Linear
    Embedding, Science, Vol. 29, Dec. 22, 2000, pp.
    2323-2326.

9
LLE Overview (cont.)
  • Summary of the 3 steps for LLE
  • High dimensional space
  • Low dimensional space

10
Images of Lips Mapped Via LLE
Images were mapped into the embedding space
described by the first two coordinates of LLE.
  • Saul, L.K. Roweis, S.T.
  • An Introduction to Locally Linear Embedding
  • http//www.cs.toronto.edu/roweis/lle/papers/llein
    tro.pdf

11
Images of Faces Mapped Via LLE
  • These images were mapped into the embedding space
    described by the first two coordinates of LLE.

12
HIV Protease 2D Manifold
13
Dimensionality Reduction and Protein Flexibility
  • The study of protein motion is difficult because
    of the many degrees of freedom.
  • A medium sized protein may have a few thousand
    degrees of freedom.
  • Can dimensionality reduction be used to obtain a
    reduced basis representation of protein
    flexibility?
  • Teodoro et al. answer in the affirmative
  • Teodoro, M.L., Phillips, G.N.Jr., Kavraki, L.E.,
    A Dimensionality Reduction Approach to Modeling
    Protein Flexibility, RECOMB 02, Washington, Apr.
    2002, 299-308.

14
Modeling Flexibility of HIV Protease
  • Teodoro et al use a molecular dynamics program to
    generate 14,000 conformations of the protein.
  • They get a vector set that is subjected to
    dimensionality reduction via PCA (Principle
    Component Analysis).
  • The reduced basis representation retains the
    critical information about the directions of
    preferred motion of the protein.
  • They have a convincing animation that shows the
    flap movement in 4HVP.

15
Nonlinear Dimension Reduction for Proteins
  • Can we use LLE in lieu of PCA?
  • S. Hui and U. Shakeel (University of Waterloo,
    School of Computer Science graduate students)
    repeated the HIV protease study using LLE instead
    of PCA.
  • very good results
  • but quite computationally expensive
  • Results will be presented in poster form at this
    years ISMB in Glasgow.

16
Coordinate Systems for Molecules
  • A point in the high dimensional space will have
    coordinates. Some possibilities are
  • The 3N numbers representing the Cartesian
    coordinates of N atoms in the molecule.
  • The phi/psi angles of the alpha carbons.
  • The interatomic distances of particular atoms in
    the molecule (useful in drug design).
  • We now look at the use of LLE for another
    application dealing with proteins structure
    alignment.

17
Structure Alignment
  • From a biological perspective demonstrating the
    similarity of two proteins by comparing their 3D
    structure is very important because protein
    functionality is strongly related to structure.
  • From an evolutionary viewpoint Structure is more
    conserved than sequence.

18
What is Structure Alignment?
  • Objective
  • We are given the 3D coordinates of all atoms for
    two proteins P and Q.
  • So we know their 3D conformations.
  • How do we find the translation and rotation
    operations that lead to the most significant
    amount of superimposition?
  • If flexibility of the protein is to be allowed,
    then the transformation operations may include
    some local deformations of either P or Q or both.

19
Variations of the Problem
  • There are variations of the problem depending on
    whether we allow flexibility and on whether we
    have the same or different sequences.
  • Flexible same sequence
  • Trivial We simply assume that P and Q can adopt
    the same shape.
  • Rigid same sequence
  • We find the rotation and translation that
    minimizes the distance between corresponding
    alpha carbons in P and Q.
  • Measure via RMSD.
  • Rigid different sequence
  • More difficult It is necessary to find an
    alignment between P and Q that provides a
    matching between a subset of the amino acids in P
    with those of Q.
  • Then we calculate the rotation and translation
    that minimizes the RMSD measured only with
    respect to the matching alpha carbons.

20
Variations of the Problem (cont.)
  • Flexible and different sequence
  • A similar approach to the last case complicated
    by protein deformations that should be made with
    due attention to energy considerations.
  • Other concerns
  • For both of the last two cases, the alignment
    (matching amino acids) may change as the
    rotation, translation, and deformation
    transformations are performed.
  • So many of the algorithms used are heuristic and
    iterative.
  • It is possible that a less than optimal alignment
    may lead to a structural alignment that is more
    biologically relevant.
  • There is no universally accepted structural
    alignment definition.

21
Prior Art
  • In all currently available methodologies
    developed and applied to the comparisons of
    protein structures, the molecules are considered
    to be rigid objects
  • Verbitsky, G., Nussinov, R., Wolfson, H.,
    Flexible Structural Comparison Allowing
    Hinge-Bending, Swiveling Motions. Proteins
    Structure, Function, and Genetics 34 1999, pp.
    232-254.
  • Their paper goes on to describe an algorithm that
    does structure comparison allowing molecular
    parts such as domains, subdomains, and loops, to
    rotate around preselected point-hinges.

22
Reverse Chiropraxis
  • In our approach we wish to derive a structural
    comparison that involves automatically selected
    bonds chosen as the places where rotation takes
    place.
  • So we wish to take the protein backbone from its
    native conformation to some nearby conformation
    that has small rotational alterations to its
    backbone.
  • This to be done with due attention to potential
    energy considerations.
  • Introducing subluxations of the protein
    backbone ?

23
A Structural Alignment Heuristic (1)
  • Each protein will have a set of conformations
    each conformation being a point in its space of
    conformations.
  • Assuming we have an alignment algorithm that
    establishes the alpha carbons that are to be put
    into correspondence then we can create a
    coordinate system that is based on the positions
    of these atoms.
  • In this way, the points representing
    conformations will have the same frames of
    reference for both proteins P and Q.

24
A Structural Alignment Heuristic (2)
  • If the two manifolds intersect then a point of
    intersection will specify conformations for P and
    Q that are the same (at least with respect to the
    alpha carbons in the alignment).
  • It would remain to investigate whether the
    energies of these conformations of P and Q were
    acceptable.
  • Ideally, a point of intersection is close to the
    points in the manifolds that represent the native
    conformations of P and Q.

25
A Structural Alignment Heuristic (3)
  • If the two manifolds do not intersect (or even
    when they do) we would want to find two points,
    one in the manifold for P and the other in the
    manifold for Q such that the points are close to
    one another and correspond to reasonably low
    energy levels.
  • We essentially have a search problem in the low
    dimensional space.

26
Interpolation (1)
  • Calculating the intersection of two manifolds is
    based on the observation that the manifolds are
    defined by discrete sets of points corresponding
    to the conformations that were fed to the
    software doing the dimension reduction.
  • A simpler heuristic is to assume that one protein
    (say P) is rigid and so its conformation space is
    really a single point corresponding to the native
    conformation.

27
Interpolation (2)
  • We find the line through P that is normal to the
    manifold at a point that is on a line between the
    point representing the native structure of Q and
    some neighbouring point.

Native Structure of P
Next interpolated conformation for Q
Conformational Neighbour of Native Structure of Q
Native Structure of Q
28
Some Empirical Results
  • My grad student Shirley Hui has done various
    experiments using PCA and LLE for dimensionality
    reduction applied to structural alignment.

29
Experiments
  • The goal of the following experiment was to align
    two protein chains where one is flexible and one
    is rigid.  
  • Aligned fragments of the two proteins are
    obtained.
  • The alignment of these fragments is improved upon
    by flexing the flexible protein.
  • Same data as used in

Shindyalov IN, Bourne PE (1998) Protein
structure alignment by incremental combinatorial
extension (CE) of the optimal path.
Protein Engineering 11(9) 739-747.
30
Method
  • The flexible structural algorithm developed
    consists of 3 steps
  • 1)  Perform a Rigid Alignment
  • The initial rigid alignment provides an idea of
    which areas of the structures should be aligned.
    When the rigid alignment is complete, we know
    which areas of the structures are aligned and
    which areas are gaps in both of the chains.
  • 2)  Achieve a Tighter Fit
  • A tighter fit may be achieved if the flexible
    molecule is flexed in the aligned areas. We do
    not flex in the gap areas.
  • 3)  Achieve a Feasible Fit
  • Since the chain moves as a system, flexing the
    chain in individual areas in step two may distort
    the structure so much that although it is a
    better fit, it is not one that is energetically
    feasible. In order to achieve a feasible fit,
    the flexible proteins flexibility manifold is
    searched for a conformation that is as close as
    possible to the flexed chain generated from step
    2. The resulting conformation is one that is
    feasible since it lies on the manifold.

31
Data Used
  • The protein chains used are as follows Example
    One 1) Flexible Chain - C-phycocyanin L Chain
    (atoms 408 - 716)2) Rigid Chain Colicin
    A Chain (atoms 508 - 747)Example Two1)
    Flexible Chain - C-phycocyanin L Chain
    (atoms 2040 - 2315)2) Rigid Chain - Colicin A
    Chain (atoms 955 - 1372)

32
Example 1 Results
Blue COL A (assumed rigid) Green CPC L
(assumed rigid) Red CPC L (assumed flexible)
Flexibility allowed more superposition with an
improvement in RMSD of 0.606 Angstrom.
33
Example 2 Results
Blue COL A (assumed rigid) Green CPC L
(assumed rigid) Red CPC L (assumed flexible)
Flexibility allowed more superposition with an
improvement in RMSD of 0.489 Angstrom.
Write a Comment
User Comments (0)
About PowerShow.com