Approximation of Protein Structure for Fast Similarity Measures - PowerPoint PPT Presentation

About This Presentation
Title:

Approximation of Protein Structure for Fast Similarity Measures

Description:

Clustering predictions (Shortle et al, Biophysics '98) Graph-based methods ... similarity for structures of same SCOP super-family with and without m-averaging ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 29
Provided by: ItayL7
Category:

less

Transcript and Presenter's Notes

Title: Approximation of Protein Structure for Fast Similarity Measures


1
Approximation of Protein Structure for Fast
Similarity Measures
  • Fabian Schwarzer
  • Itay Lotan
  • Stanford University

2
Comparing Protein Structures
Same protein
vs.
Analysis of MDS and MCS trajectories
Graph-based methods
Structure prediction applications
  • Evaluating decoy sets
  • Clustering predictions (Shortle et al,
    Biophysics 98)

Stochastic Roadmap Simulation (Apaydin et al,
RECOMB 02)
http//folding.stanford.edu
3
k Nearest-Neighbors Problem
Given a set S of conformations of a protein and a
query conformation c, find the k conformations
in S most similar to c.
Can be done in N
size of S L time to compare two conformations
4
k Nearest-Neighbors Problem
What if needed for all c in S ?
- too much time
  • Can be improved by
  • Reducing L
  • A more efficient algorithm

5
Our Solution
  • Reduce structure description

Approximate but fast similarity measures
Reduce description further
Efficient nearest-neighbor algorithms can be used
6
Description of a Proteins Structure
3n coordinates of Ca atoms (n Number of
residues)
7
Similarity Measures - cRMS
  • The RMS of the distances between corresponding
    atoms after the two conformations are optimally
    aligned

Computed in O(n) time
8
Similarity Measures - dRMS
  • The Euclidean distance between the
    intra-molecular distances matrices of the two
    conformations

Computed in O(n2) time
9
m-Averaged Approximation
  • Cut chain into m pieces
  • Replace each sequence of n/m Ca atoms by its
    centroid

3n coordinates
3m coordinates
10
Why m-Averaging?
  • Averaging reduces description of random chains
    with small error
  • Demonstrated through Haar wavelet analysis
  • Protein backbones behave on average like random
    chains
  • Chain topology
  • Limited compactness

11
Evaluation Test Sets
9 structurally diverse proteins of size 38 -76
residues
  • Decoy sets conformations from the Park-Levitt
    set (Park Levitt, JMB 96), N 10,000
  • Random sets conformations generated by the
    program FOLDTRAJ (Feldman Hogue, Proteins 00),
    N 5000

12
Decoy Sets Correlation
0.37 0.73
0.40 0.86
0.84 0.98
0.70 0.94
0.98 0.99
0.92 0.96
0.98 0.99
0.92 0.98
0.98 0.99
0.93 0.97
Higher Correlation for random sets!
13
Speed-up for Decoy Sets
  • Between 5X and 8X for cRMS (m 8)
  • Between 9X and 36X for dRMS (m 12)
  • with very small error

For random sets the speed-up for dRMS was between
25X and 64X (m 8)
14
Efficient Nearest-Neighbor Algorithms
  • There are efficient nearest-neighbor algorithms,
    but they are not compatible with similarity
    measures
  • cRMS is not a Euclidean metric
  • dRMS uses a space of dimensionality
    n(n-1)/2

15
Further Dimensionality Reduction of dRMS
  • kd-trees require dimension ? 20
  • m-averaging with dRMS is not enough
  • Reduce further using SVD

SVD A tool for principal component analysis.
Computes directions of greatest variance.
16
Reduction Using SVD
  • Stack m-averaged distances matrices as vectors
  • Compute the SVD of entire set
  • Project onto most important singular vectors

dRMS is thus reduced to ?20 dimensions
Without m-averaging SVD can be too costly
17
Testing the Method
  • Use decoy sets (N 10,000)
  • m-averaging with (m 16)
  • Project onto 20 largest PCs (more than 95 of
    variance)
  • Each conformation represented by 20 numbers

18
Results
  • For k 10, 25, 100
  • Decoy sets 80 correct furthest
    NN off by 10 - 20 (0.7Å 1.5Å)
  • 1CTF, with N 100,000 ? similar results
  • Random sets ? 90 correct with smaller error (5
    - 10)

When precision is important use as pre-filter
with larger k than needed
19
Running Time
  • N 100,000
  • k 100, for each conformation

Brute-force
84 hours Brute-force m-averaging
4.8 hours Brute-force m-averaging SVD 41
minutes Kd-tree m-averaging SVD 19
minutes
kd-trees will have more impact for larger sets
20
Structural Classification
Computing the similarity between structures of
two different proteins is more involved
2MM1
1IRD
vs.
The correspondence problem Which parts of the
two structures should be compared?
21
STRUCTAL (Gerstein Levitt 98)
  • Compute optimal correspondence using dynamic
    programming
  • Optimally align the corresponding parts in space
    to minimize cRMS
  • Repeat until convergence

O(n1n2) time
Result depends on initial correspondence!
22
STRUCTAL m-averaging
  • Compute similarity for structures of same SCOP
    super-family with and without m-averaging

correlation
speed-up
n/m
3
0.60 0.66
7
0.44 0.58
19
5
8
0.35 0.57
46
NN results were disappointing
23
Conclusion
  • Fast computation of similarity measures
  • Trade-off between speed and precision
  • Exploits chain topology and limited compactness
    of proteins
  • Allows use of efficient nearest-neighbor
    algorithms
  • Can be used as pre-filter when precision is
    important

24
Random Chains
c5
c7
c2
c6
c8
cn-1
c0
c4
c1
c3
  • The dimensions are uncorrelated
  • Average behavior can be approximated by normal
    variables

25
1-D Haar Wavelet Transform
  • Recursive averaging and differencing of the values

Detail Coefficients
Level
Averages
9 7 2 6 5 1 4 6
3
2
8 4 3 5
1 -2 2 -1
1
6 4
-2 -1
0
5
1
9 7 2 6 5 1 4 6
5 1 -2 -1 1 -2 2 1
26
Haar Wavelets and Compression
When discarding detail coefficients the
approximation error is the root of the sum of the
squares of the discarded coefficients
  • Compress by discarding smallest coefficients

27
Transform of Random Chains
For random chains the pdf of the detail
coefficients is Coefficients expected to be
ordered!
Discard coefficients starting at lowest level
28
Random Chains and Proteins
Write a Comment
User Comments (0)
About PowerShow.com