Title: Algorithms Exploiting the Chain Structure of Proteins
1Algorithms Exploiting the Chain Structure of
Proteins
- Itay Lotan
- Computer Science
2Proteins 101
- Involved in all functions of our body
metabolism, motion, defense, etc.
?Michael Levitt
3Protein representation
- Torsion angle model
- Ca model
4Structure determination
X-ray crystallography
?Bernhard Rupp
5Outline
- Fast energy computation during Monte Carlo
simulation - Model completion for protein X-ray
crystallography - Large scale computation of similarity
Exploit specific properties of proteins to
perform the computation efficiently
6Outline
- Fast energy computation during Monte Carlo
simulation - Model completion for protein X-ray
crystallography - Large scale computation of similarity
Lotan, Schwarzer, Halperin and Latombe. J.
Comput. Bio. 2004 (to appear)
CS Department, Tel-Aviv University
7Monte Carlo simulation (MCS)
Popular method for sampling the conformation
space of proteins
- Estimate thermodynamic quantities
- Search for low-energy conformations and the
folded structure
8MCS How it works
- Propose random change in conformation
- Compute energy E of new conformation
- Accept with probability
Requires gtgt106 steps to sample adequately
9Energy function
- Bonded terms
- Bond lengths
- Bond angles
- Dihedral angles
- Non-bonded terms
- Van der Waals
- Electrostatic
- Heuristic Go models, HP models, etc.
10Pair-wise interactions
- Cutoff distance (6 - 12Ã…)
- Linear number of interactions contribute to
energy (Halperin Overmars 98)
Challenge Find all interacting pairs without
enumerating all pairs
11Related work
- Biology
- Neighbor lists
- Verlet 67
- Brooks et al. 83
- Grid
- Quentrec Brot 73
- Hockney et al. 74
- Van Gunsteren et al. 84
- Neighbor lists grid
- Yip Elber 89
- Petrella 02
- Computer Science
- Bounding volume hierarchies for collision
detection - Gotschalk et al. 96
- Larsen et al. 00
- Guibas et al. 02
- Space partition methods for collision detection
- Faverjon 84
- Halperin Overmars 98
- Collisions detection for chains
- Halperin et al. 97
- Guibas et al. 02
12Grid method
d Cutoff distance
- Linear complexity
- Optimal in worst case
13Contributions
- Efficient maintenance and self-collision
detection for kinematic chains - Efficient computation of pair-wise interactions
in MCS of proteins - Scheme for caching and reusing partial energy
sums during MCS - MCS software
Much faster than existing algorithm (grid method)
Download at http//robotics.stanford.edu/itayl/
mcs
14Properties of kinematic chains
- Small changes ? large effects
15Properties of kinematic chains
- Small changes ? large effects
16Properties of kinematic chains
- Small changes ? large effects
- Local changes ? global effects
17Properties of kinematic chains
- Small changes ? large effects
- Local changes ? global effects
- Few DoF changes ? long rigid sub-chains
18Properties of kinematic chains
- Small changes ? large effects
- Local changes ? global effects
- Few DoF changes ? long rigid sub-chains
19ChainTree A tale of two hierarchies
- Transform hierarchy approximates kinematics of
protein backbone at successive resolutions - Bounding volume hierarchy approximates geometry
of protein at successive resolutions
20Hierarchy of transforms
21Hierarchy of transforms
22Hierarchy of bounding volumes
23The ChainTree
TAI BAH
TAE BAD
TEI BEH
TAC BAB
TCE BCD
TEG BEF
TGI BGH
TAB BA
TBC BB
TCD BC
TDE BD
TEF BE
TFG BF
TGH BG
THI BH
24Updating the ChainTree
TAI BAH
TAE BAD
TEI BEH
TAC BAB
TCE BCD
TEG BEF
TGI BGH
TAB BA
TBC BB
TCD BC
TDE BD
TEF BE
TFG BF
TGH BG
THI BH
25Computing the energy
Recursively search ChainTree for interactions
- Pruning rules
- Prune search when distance between bounding
volumes is more than cutoff distance - Do not search inside rigid sub-chains
26Computing the energy
P
27Computing the energy
P
N
28Computing the energy
P
N
O
29Computing the energy
P
N-O
N
O
30Computing the energy
P
N-O
N
O
J-K
J
K
A-C
C
A-D
C-D
B-C
D
B-D
31Computing the energy
P
N
N-O
O
J-K
K
K-L
J-M
J-L
K-M
J
L
L-M
M
A-G
C
A-E
C-E
C-G
A
E
E-G
H
A-H
A-D
C-D
A-F
C-F
C-H
A-B
E-F
E-H
H-G
B-G
D
B-E
B
D-E
F
F-G
G
B-H
B-D
B-F
D-F
F-H
32Computing the energy
E(O)
P
N
N-O
O
J-K
K
K-L
J-M
J-L
K-M
J
L
L-M
M
A-G
C
A-E
C-E
C-G
A
E
E-G
H
A-H
A-D
C-D
A-F
C-F
C-H
A-B
E-F
E-H
H-G
B-G
D
B-E
B
D-E
F
F-G
G
B-H
B-D
B-F
D-F
F-H
33Computing the energy
- Only changed interactions are found
- Reuse unaffected partial sums
- Better performance for
- Longer proteins
- Fewer simultaneous changes
34Computational complexity
worst case bound
Much faster in practice
35Test
1-DoF change
5-DoF change
68 res.
144 res.
374 res.
755 res.
68 res.
144 res.
374 res.
755 res.
36Simulation of a-Synuclein
- 140 res. protein implicated in Parkinsons
disease - Multi-canonical Replica-exchange MC regime
- Over 1000 CPU days of simulation
- Study conformations at room temp.
- Joint work with Vijay Pande
37Outline
- Fast energy computation during Monte Carlo
simulation - Model completion for protein X-ray
crystallography - Large scale computation of similarity
Lotan, van den Bedem, Deacon and Latombe, WAFR
2004 van den Bedem, Lotan, Latombe and Deacon,
submitted to Acta. Cryst. D
Joint Center for Structural Genomics (JCSG) at
SSRL
38Protein Structure Initiative
152K sequenced genes (30K/year)
25K determined structures (3.6K/year)
- Reduce cost and time to determine protein
structure
- Develop software to automatically interpret the
electron density map (EDM)
39EDM
- 3-D image of atomic structure
- High value (electron density) at atom centers
- Density falls off exponentially away from center
40Automated model building
- 90 built at high resolution (2Ã…)
- 66 built at medium to low resolution (2.5
2.8Ã…) - Gaps left at noisy areas in EDM (blurred density)
Gaps need to be resolved manually
41The Fragment completion problem
- Input
- EDM
- Partially resolved structure
- 2 Anchor residues
- Length of missing fragment
- Output
- A small number of candidate structures for
missing fragment
A robotics inverse kinematics (IK) problem
42Related work
- Biology/Crystallography
- Exact IK solvers
- Wedemeyer Scheraga 99
- Coutsias et al. 04
- Optimization IK solvers
- Fine et al. 86
- Canutescu Dunbrack Jr. 03
- Ab-initio loop closure
- Fiser et al. 00
- Kolodny et al. 03
- Database search loop closure
- Jones Thirup 86
- Van Vlijman Karplus 97
- Semi-automatic tools
- Jones Kjeldgaard 97
- Oldfield 01
- Computer Science
- Exact IK solvers
- Manocha Canny 94
- Manocha et al. 95
- Optimization IK solvers
- Wang Chen 91
- Redundant manipulators
- Khatib 87
- Burdick 89
- Motion planning for closed loops
- Han Amato 00
- Yakey et al. 01
- Cortes et al. 02, 04
43Contributions
- Sampling of gap-closing fragments biased by the
EDM - Refinement of fit to density without breaking
closure - Fully automatic fragment completion software for
X-ray Crystallography
Novel application of a combination of inverse
kinematics techniques
44Two-stage IK method
- Candidate generations Optimize density fit while
closing the gap - Refinement Optimize closed fragments without
breaking closure
45Stage 1 candidate generation
- Generate random conformation
- Close using Cyclic Coordinate Descent (CCD) (Wang
Chen 91, Canutescu Dunbrack Jr. 03)
46Stage 1 candidate generation
- Generate random conformation
- Close using Cyclic Coordinate Descent (CCD) (Wang
Chen 91, Canutescu Dunbrack 03)
47Stage 1 candidate generation
- Generate random conformation
- Close using Cyclic Coordinate Descent (CCD) (Wang
Chen 91, Canutescu Dunbrack 03)
48Stage 1 candidate generation
- Generate random conformation
- Close using Cyclic Coordinate Descent (CCD) (Wang
Chen 91, Canutescu Dunbrack 03)
49Stage 1 candidate generation
- Generate random conformation
- Close using Cyclic Coordinate Descent (CCD) (Wang
Chen 91, Canutescu Dunbrack 03)
CCD moves biased toward high-density
50Stage 2 refinement
- Target function T (goodness of fit to EDM)
- Minimize T while retaining closure
- Closed conformations lie on Self-motion manifold
of lower dimension
1-D manifold
51Stage 2 null-space minimization
- Jacobian linear relation between joint
velocities and end-effector linear and angular
velocity .
Compute minimizing move using
N orthonormal basis of null space
52Stage 2 minimization with closure
- Choose sub-fragment with n gt 6 DOFs
- Compute using SVD
- Project onto
- Move until minimum is reached or closure is broken
Escape from local minima using Monte Carlo with
simulated annealing
53Test artificial gaps
- Completed structure (gold standard)
- Good density (1.6Ã… res.)
- Remove fragment and rebuild
Length High (2.0Ã…) Medium (2.5Ã…) Low (2.8Ã…)
4 100 (0.14Ã…) 100 (0.19Ã…) 100 (0.32Ã…)
8 100 (0.18Ã…) 100 (0.23Ã…) 100 (0.36Ã…)
12 91 (0.51Ã…) 96 (0.41Ã…) 91 (0.52Ã…)
15 91 (0.53Ã…) 88 (0.63Ã…) 83 (0.76Ã…)
Produced by H. van den Bedem
54Test true gaps
- Completed structure (gold standard)
- O.K. density (2.4Ã… res.)
- 6 gaps left by model builder (RESOLVE)
Length Top scorer Lowest error
4 0.44Ã… 0.40Ã…
4 0.22Ã… 0.22Ã…
5 0.78Ã… 0.78Ã…
5 0.36Ã… 0.36Ã…
7 0.72Ã… 0.66Ã…
10 0.43Ã… 0.43Ã…
Produced by H. van den Bedem
55Example TM0423
PDB 1KQ3, 376 res. 2.0Ã… resolution 12 residue
gap Best 0.3Ã… aaRMSD
56Example TM0813
PDB 1J5X, 342 res. 2.8Ã… resolution 12 residue
gap Best 0.6Ã… aaRMSD
GLU-83
GLY-96
57Example TM0813
PDB 1J5X, 342 res. 2.8Ã… resolution 12 residue
gap Best 0.6Ã… aaRMSD
GLU-83
GLY-96
58Example TM0813
PDB 1J5X, 342 res. 2.8Ã… resolution 12 residue
gap Best 0.6Ã… aaRMSD
GLU-83
GLY-96
59Outline
- Fast energy computation during Monte Carlo
simulation - Model completion for protein X-ray
crystallography - Large scale computation of similarity
Lotan and Schwarzer, J. Comput. Biol. 11(23)
299317, 2004
60Large scale similarity
- Analysis of simulation trajectories
- Molecular dynamics simulation
- Monte Carlo simulation
- Clustering of decoy sets (e.g., Shortle et al.
98) - Stochastic Roadmap Simulation (Apaydin et
al. 03)
Fast similarity measures are needed for analyzing
large sets of conformations
61Contributions
- Uniform simplification of protein structure for
similarity computation - Speed-up existing similarity measures
- Method offers trade-off between speed and
precision - Efficient computation of nearest neighbors
62m-Averaged approximation
- Cut chain into pieces of length m
- Replace each sequence of m Ca atoms by its
centroid
3n coordinates
3n/m coordinates
63Chains and distances
- Proximity along the chain entails spatial
proximity
- Far away links along the chain are spatially
distant (on average)
ci
cj
64Similarity measures
65Evaluation test sets
8 structurally diverse proteins (54 -76 residues)
- Decoy sets conformations from the Park-Levitt
set (Park et al, 97), N 10,000 - Random sets conformations generated by the
program FOLDTRAJ (Feldman Hogue, 00), N 5000
66Evaluation results decoy sets
m cRMS dRMS
3 0.99 0.96-0.98
4 0.98-0.99 0.94-0.97
6 0.92-0.99 0.78-0.93
9 0.81-0.98 0.65-0.96
12 0.54-0.92 0.52-0.69
- 9x for cRMS (m 9)
- 36x for dRMS (m 6)
Higher correlation for random sets!
67k Nearest-neighbors problem
Given a set S of conformations of a protein and a
query conformation c, find the k conformations in
S most similar to c
- Brute force complexity
-
- for all
N size of S L time to compute
similarity
68Efficient nearest neighbor search
- kd-tree time per query
- Limitations
- Requires Minkowski metric
- Less efficient when dgt20
cRMS is not a Minkowski metric dRMS has
dimensionality of
Reduce dRMS dimensionality using SVD
69Reduction using SVD
- Stack m-averaged distance matrices as vectors
- Compute the SVD of entire set
- Project onto principle components
dRMS is reduced to ?20 dimensions
Complexity of SVD
70Testing the method
- Use decoy sets (N 10,000) and random sets (N
5,000) - m-averaging with (m 4)
- Project onto 16 PCs for decoys, 12 PCs for random
sets - Find k 10, 25, 100 NNs for 250 conformations in
each set
71Results
- Decoy sets
- 77 correct
- Furthest NN off by 10 - 15 (0.7Ã… 1.5Ã…)
- 4k approximate NNs contain all true k NNs
- Random sets slightly better results
Use reduction as fast filter
72Running Time
- N 100,000, m4, PC 16
- Find k 100 for each conformation
Brute-force
84 hours Brute-force m-averaging
4.8 hours Brute-force m-averaging SVD 41
minutes kd-tree m-averaging SVD 19
minutes
kd-tree has more impact for larger sets
73Contributions
- Energy computation in MCS
- Efficient maintenance and self-collision
detection for kinematic chains - Efficient computation of pair-wise interactions
in MCS of proteins - Caching scheme for partial energy sums during MCS
- MCS software
- Model completion in X-ray crystallography
- sampling of gap-closing fragments biased towards
the EDM - Refinement of fit to density without breaking
closure - Fully automatic fragment completion software
- Similarity computation for large conformation
sets - Uniform simplification of protein structure for
similarity computation - Speed-up existing similarity measures
- Method offers trade-off between speed and
precision - Efficient computation of nearest neighbors
74Take-home message
- Taking into account physical properties of
proteins can lead to efficient algorithms for a
wide variety of applications in structural biology
75Outlook
computer scientist
biophysicist/biochemist
Models that simplify the physics and chemistry of
proteins
Algorithms that exploit properties of protein
models
Develop simplified protein models that lend
themselves to efficient computations
76Acknowledgements
- Jean-Claude Latombe
- Vijay Pande
- Michael Levitt
- Leo Guibas
- Axel Brunger, Balaji Prabhakar, Serafim Batzoglou
- Fabian Schwarzer, Henry van den Bedem, Dan
Halperin - Carlo Tomasi
- Daniel Russakoff, Rachel Kolodny
- Latombe group
- Serkan Apaydin, Tim Bretl, Joel Brown, Phil Fong,
Mitul Saha, Pekka Isto, Kris Hauser - Pande group
- Bojan Zagrovic, Stefan Larson, Lillian Chong,
Young Min Rhee, Sidney Elmer, Chris Snow, Guha
Jayachandran, Eric Sorin, Sung-Joo Lee, Jim
Cladwell, Michael Shirts, Nina Singhal, Relly
Brandman, Vishal Vaidyanathan, Nick Kelley, Mark
Engelhardt - Levitt Group
- Patrice Koehl, Tanya Raschke, Erik Lindahl
77Thank you!