Approximation of Protein Structure for Fast Similarity Measures presentation

About This Presentation

Transcript and Presenter's Notes

Title: Approximation of Protein Structure for Fast Similarity Measures

1
Approximation of Protein Structure for Fast
Similarity Measures

Fabian Schwarzer
Itay Lotan
Stanford University

2
Comparing Protein Structures
Same protein
vs.
Analysis of MDS and MCS trajectories
Graph-based methods
Structure prediction applications

Evaluating decoy sets
Clustering predictions (Shortle et al,
Biophysics 98)

Stochastic Roadmap Simulation (Apaydin et al,
RECOMB 02)
http//folding.stanford.edu
3
k Nearest-Neighbors Problem
Given a set S of conformations of a protein and a
query conformation c, find the k conformations
in S most similar to c.
Can be done in N
size of S L time to compare two conformations
4
k Nearest-Neighbors Problem
What if needed for all c in S ?
- too much time

Can be improved by
Reducing L
A more efficient algorithm

5
Our Solution

Reduce structure description

Approximate but fast similarity measures
Reduce description further
Efficient nearest-neighbor algorithms can be used
6
Description of a Proteins Structure
3n coordinates of Ca atoms (n Number of
residues)
7
Similarity Measures - cRMS

The RMS of the distances between corresponding
atoms after the two conformations are optimally
aligned

Computed in O(n) time
8
Similarity Measures - dRMS

The Euclidean distance between the
intra-molecular distances matrices of the two
conformations

Computed in O(n2) time
9
m-Averaged Approximation

Cut chain into m pieces
Replace each sequence of n/m Ca atoms by its
centroid

3n coordinates
3m coordinates
10
Why m-Averaging?

Averaging reduces description of random chains
with small error
Demonstrated through Haar wavelet analysis
Protein backbones behave on average like random
chains
Chain topology
Limited compactness

11
Evaluation Test Sets
9 structurally diverse proteins of size 38 -76
residues

Decoy sets conformations from the Park-Levitt
set (Park Levitt, JMB 96), N 10,000
Random sets conformations generated by the
program FOLDTRAJ (Feldman Hogue, Proteins 00),
N 5000

12
Decoy Sets Correlation
0.37 0.73
0.40 0.86
0.84 0.98
0.70 0.94
0.98 0.99
0.92 0.96
0.98 0.99
0.92 0.98
0.98 0.99
0.93 0.97
Higher Correlation for random sets!
13
Speed-up for Decoy Sets

Between 5X and 8X for cRMS (m 8)
Between 9X and 36X for dRMS (m 12)
with very small error

For random sets the speed-up for dRMS was between
25X and 64X (m 8)
14
Efficient Nearest-Neighbor Algorithms

There are efficient nearest-neighbor algorithms,
but they are not compatible with similarity
measures
cRMS is not a Euclidean metric
dRMS uses a space of dimensionality
n(n-1)/2

15
Further Dimensionality Reduction of dRMS

kd-trees require dimension ? 20
m-averaging with dRMS is not enough
Reduce further using SVD

SVD A tool for principal component analysis.
Computes directions of greatest variance.
16
Reduction Using SVD

Stack m-averaged distances matrices as vectors
Compute the SVD of entire set
Project onto most important singular vectors

dRMS is thus reduced to ?20 dimensions
Without m-averaging SVD can be too costly
17
Testing the Method

Use decoy sets (N 10,000)
m-averaging with (m 16)
Project onto 20 largest PCs (more than 95 of
variance)
Each conformation represented by 20 numbers

18
Results

For k 10, 25, 100
Decoy sets 80 correct furthest
NN off by 10 - 20 (0.7Å 1.5Å)
1CTF, with N 100,000 ? similar results
Random sets ? 90 correct with smaller error (5
- 10)

When precision is important use as pre-filter
with larger k than needed
19
Running Time

N 100,000
k 100, for each conformation

Brute-force
84 hours Brute-force m-averaging
4.8 hours Brute-force m-averaging SVD 41
minutes Kd-tree m-averaging SVD 19
minutes
kd-trees will have more impact for larger sets
20
Structural Classification
Computing the similarity between structures of
two different proteins is more involved
2MM1
1IRD
vs.
The correspondence problem Which parts of the
two structures should be compared?
21
STRUCTAL (Gerstein Levitt 98)

Compute optimal correspondence using dynamic
programming
Optimally align the corresponding parts in space
to minimize cRMS
Repeat until convergence

O(n1n2) time
Result depends on initial correspondence!
22
STRUCTAL m-averaging

Compute similarity for structures of same SCOP
super-family with and without m-averaging

correlation
speed-up
n/m
3
0.60 0.66
7
0.44 0.58
19
5
8
0.35 0.57
46
NN results were disappointing
23
Conclusion

Fast computation of similarity measures
Trade-off between speed and precision
Exploits chain topology and limited compactness
of proteins
Allows use of efficient nearest-neighbor
algorithms
Can be used as pre-filter when precision is
important

24
Random Chains
c5
c7
c2
c6
c8
cn-1
c0
c4
c1
c3

The dimensions are uncorrelated
Average behavior can be approximated by normal
variables

25
1-D Haar Wavelet Transform

Recursive averaging and differencing of the values

Detail Coefficients
Level
Averages
9 7 2 6 5 1 4 6
3
2
8 4 3 5
1 -2 2 -1
1
6 4
-2 -1
0
5
1
9 7 2 6 5 1 4 6
5 1 -2 -1 1 -2 2 1
26
Haar Wavelets and Compression
When discarding detail coefficients the
approximation error is the root of the sum of the
squares of the discarded coefficients

Compress by discarding smallest coefficients

27
Transform of Random Chains
For random chains the pdf of the detail
coefficients is Coefficients expected to be
ordered!
Discard coefficients starting at lowest level
28
Random Chains and Proteins

Write a Comment

User Comments (0)

About PowerShow.com

Approximation of Protein Structure for Fast Similarity Measures PowerPoint PPT Presentation