Protein Structure Similarity - PowerPoint PPT Presentation

About This Presentation
Title:

Protein Structure Similarity

Description:

Protein Structure Similarity Computation of Best Matches Two simultaneous subproblems Find maximal correspondence set C Find alignment transform T Chicken-and ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 72
Provided by: latombe
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Protein Structure Similarity


1
Protein Structure Similarity
2
Computation of Best Matches
  • Two simultaneous subproblems
  • Find maximal correspondence set C
  • Find alignment transform T
  • Chicken-and-egg issue
  • Each subproblem is relatively simple
  • If we knew C, we could compute T
  • If we knew T, we could get C by proximity
  • But the combination is hard !!!

3
Computation of Best Matches
  • Two simultaneous subproblems
  • Find maximal correspondence set C
  • Find alignment transform T
  • Chicken-and-egg issue
  • Each subproblem is relatively simple
  • If we knew C, we could compute T
  • If we knew T, we could get C by proximity
  • But the combination is hard !!!

4
Find Alignment Transform
  • Two sets of points A a1,,an and B
    b1,,bn
  • Correspondence pairs (ai, bi)
  • Find T arg minT RMSD(A,T(B)) ?
  • O(n) closed-form solution Arun, Huang, and
    Blostein, 87 Horn, 87 Horn, Hilden, and
    Negahdaripour, 88

5
O(n) SVD-Based Algorithm
  • T combines translation t and rotation R, such
    that T(bi) t R(bi)
  • b (Si1,...,nbi)/n mean of the bis
  • Place the origin of coordinate system at b
  • minT RMSD(A,T(B)) simplifies to (up to some
    constants)
  • t and R can be computed separately
  • t a mean of the ais

Arun, Huang, and Blostein, 87
6
O(n) SVD-Based Algorithm
  • A3?n a1-a, ..., an-a B3?n b1-b, ...,
    bn-b
  • Compute SVD decomposition of 33 correlation
    matrix BAT BAT UDVT
    where D is a diagonal matrices with decreasing
    non-negative entries (singular values) along the
    diagonal
  • If det(U)det(V) 1 then S I,
    else S diag(1,1,-1)
  • R USVT

Arun, Huang, and Blostein, 87
7
O(n) SVD-Based Algorithm
  • A3?n a1-a, ..., an-a B3?n b1-b, ...,
    bn-b
  • Compute SVD decomposition of 33 correlation
    matrix BAT BAT UDVT
    where D is a diagonal matrices with decreasing
    non-negative entries (singular values) along the
    diagonal
  • If det(U)det(V) 1 then S I,
    else S diag(1,1,-1)
  • R USVT

Arun, Huang, and Blostein, 87
8
  • Arun, Huang, and Blostein, 87
  • ? rotation matrix
  • Horn, 87 ? quaternion

9
? Trial-and-Error Approach to Protein Structure
Comparison
10
? Trial-and-Error Approach to Protein Structure
Comparison
  • Set CS to a seed correspondence set (small set
    sufficient to generate an alignment transform)
  • Compute the alignment transform T for CS and
    apply T to the second protein B
  • Update CS to include all pairs of features that
    are close apart
  • If CS has changed, then return to Step 2 else
    return (CS,T)

11
? Trial-and-Error Approach to Protein Structure
Comparison
  • - result nil
  • - Iterate N times
  • Set CS to a seed correspondence set (small set
    sufficient to generate an alignment transform)
  • Compute the alignment transform T for CS and
    apply T to the second protein B
  • Update CS to include all pairs of features that
    are close apart
  • If CS has changed, then return to Step 2 else
    result ? result ? (CS,T)
  • - Return result

12
  • How to get seed correspondences?

13
Seed Generation from Fragment
  • From distance matrices
  • E.g., DALI Holm and Sander, 1996

14
Using Distance Matrices (DALI)
  • Distances are invariant to rigid-body
    transformations
  • DALI Holm and Sander, 1996 looks for similar
    hexapeptides by searching for similar 7x7 Ca-Ca
    distance matrices

15
Seed Generation from Fragment
  • From distance matrices
  • E.g., DALI Holm and Sander, 1996
  • From secondary structure elements (SSEs)
  • E.g., LOCK Singh and Brutlag, 1996
  • From voting scheme (using geometric hashing)
  • E.g., 3dSEARCH Singh and Brutlag, 2000

16
LOCK
  • A.P. Singh and D.L. Brutlag. Hierarchical
    Protein Structure Superposition Using Both
    Secondary and Atomic Representations. Proc. ISMB,
    pp. 284-293, 1997.
  • LOCK2J. Shapiro and D.L. Brutlag. FoldMiner
    Structural Motif Discovery Using an Improved
    Superposition Algorithm. Protein Science,
    13278-294, 2004.
  • http//motif.stanford.edu/lock2/

17
LOCK
  • Two levels of features SSEs and Ca atoms
  • Stage 1 (SSE alignment) Initial alignment is
    computed using SSEs represented as vectors
  • Stage 2 (atom alignment) Alignment is refined
    using Ca atoms represented as points

18
Rationale for LOCK
  • Using types of features is an effective way to
    reduce combinatorial explosion and computation
  • SSEs, which are responsible for most of the
    stability and functionality of the proteins, are
    more meaningful and better conserved than types
    of atoms and amino-acids
  • If 2 structures are similar, some of their SSEs
    should form similar substructures
  • Drawback It narrows down the set of possible
    applications, e.g., cant find small motifs at
    atomic level

19
Vector-Based Representation
b-strands
loops
a-helices
One vector per SSE (helix, strand, loop)
20
Vector-Based Representation
  • DSSP Kabsch and Sander, 1983 classifies
    residues into helices/strands
  • For a-helix starting at residue iXorigin
    (0.74Xi Xi1 Xi2 0.74Xi3)/3.48where Xi
    is the position of the Ca atom of residue i
  • (angle between two consecutive residues is 100dg
    ? factor 0.74)
  • Similar computation for Xend and for b-strand

21
Scoring Similarity
Maximal score
  • Position-independent differences
  • angle(i,k)-angle(p,r)
  • angle(i,j)-angle(p,q)
  • angle(j,k)-angle(q,r)
  • distance(i,k)-distance(p,r)
  • length(k)-length(r)
  • Position-dependent differences
  • angle(k,r)
  • distance(k,r)
  • Scores are additive

Score S S(di)
Value of di forwhich score is 0
22
Stage 1 SSE Alignment
  • For every pair of SSE vectors of protein A, find
    all pairs of vectors in B that align well using
    orientation-independent scores ? seed
    correspondence sets
  • For each correspondence set
  • Find alignment transform and apply it to B
  • Find correspondence set with maximal score
  • (record transform T and correspondence set CS
    that yields maximal score)

23
Stage 1 SSE Alignment
  • A (i, j, k, l, m)
  • B (p, q, r, s, t)
  • Seed correspondence (i,p),(j,q)
  • Simultaneous gaps in both structures are not
    allowed (not in SCOP2)
  • Terminate a path when score of new
    correspondence is negative
  • Re-compute new transform with each new
    correspondence (?)

24
Stage 2 Atom (Core) Alignment
  • Construct correspondence pairs of atoms
  • Atom i of A corresponds to atom j of T(B) iff i
    is the closest atom in A to j and j is the
    closest atom in T(B) to i
  • The distance between i and T(j) is lt e (3Ã…)
  • Prune correspondence set to largest subset of
    correspondence pairs that follow backbone
    alignment constraint
  • Re-compute T to be the transform that minimizes
    the RMSD of the atoms in the correspondence set
  • Iterate 1-2-3 until RSMD converges

25
Experimental Results
  • 685 protein structures from PDB such that each
    pair has less than 25 sequence identity
  • 3 families of folds (based on SCOP
    classification) - myoglobins (11 structures)
    20 amino acid identity- TIM barrels (50
    structures)- immunoglobulins (38 structures)
  • Goal Given one query protein in each family,
    find the other members of the family (3685
    2055 alignments)
  • Method For each query, sort the 685 structures
    by score (computed by LOCK). Select the top k
    proteins. Count members of family (true
    positives) and non-members (false positives)

26
Myoglobins (11)
TIM-barrels (50)
Immunoglobulins (38)
True positives False positives
11 0
True positives False positives
40 0
45 1
50 5
True positives False positives
20 0
25 1
30 2
35 11
38 383
27
Alignment of 11 Myoglobins
28
Alignment of 50 TIM barrels
a-helices in red b-strands in yellow
29
Alignments of 31 Immunoglobulins
Only b-strands are shown
30
ROC Curves
31
Running Time
  • 1ms per seed correspondence
  • 1h to search 10,000 protein structures
  • 100s of days to compare all pairs of proteins
    in PDB
  • ? Geometric hashing to speedup stage 1

32
Seed Generation from Fragment
  • From distance matrices
  • E.g., DALI Holm and Sander, 1996
  • From secondary structure elements (SSEs)
  • E.g., LOCK Singh and Brutlag, 1996
  • From voting scheme (using geometric hashing)
  • E.g., 3dSEARCH Singh and Brutlag, 2000

33
Voting Scheme with Hash Table
  • Many-to-many comparison requires a better
    organization of computation to avoid repeating
    the same computation again and again
  • Pre-computation Index proteins in hash table
  • Query phase Voting scheme using hash table
  • Several variants on this theme
    3d-Lookup Holm and Sander, 1995
    3dSEARCH Singh 2002

34
Voting Scheme with Hash Table
  • Many-to-many comparison requires a better
    organization of computation to avoid repeting the
    same computation again and again
  • Pre-computation Index proteins in hash table
  • Query phase Voting scheme using hash table
  • Several variants on this theme
    3d-Lookup Holm and Sander, 1995
    3dSEARCH Singh 2002

35
Indexing Target Structures in Hash Table
(3dSEARCH Singh 2002)
  • Hash table 3-D regular grid of cubic bins (2Ã…)
  • For each target structure
  • For each pair of vectors (i,j)
  • Compute a coordinate system
  • Place an entry for each other vectork into the
    bin containing the coordinates of the midpoint of
    the vector (or average of coordinates of origin,
    middle, and end points). Store ID of coordinate
    system ks orientation and type (a or b) in the
    entry.

36
v
u
Grid is same for all coordinate systems
37
v
v
u
u
Grid is same for all coordinate systems
38
Indexing Target Structures in Hash Table
(3dSEARCH Singh 2002)
  • Hash table 3-D regular grid of cubic bins (2Ã…)
  • For each target structure
  • For each pair of vectors (i,j)
  • Compute a coordinate system
  • Place an entry for each other vectork into the
    bin containing the coordinates of the midpoint of
    the vector (or average of coordinates of origin,
    middle, and end points). Store ID of coordinate
    system ks orientation and type (a or b) in
    the entry.
  • Grid is sparsely occupied ? hash table
  • A structure with n SSEs contributes n(n-1)(n-2)
    entries. Each vector is represented (n-1)(n-2)
    times
  • 10,000 structures with 10 SSEs each yield 7M
    entries

39
Voting Using Hash Table
  • Given a query structure
  • For each pair of vectors (i,j)
  • Compute a coordinate system
  • For each other vector k
  • Retrieve the bin accessed by this vector and the
    neighboring bins
  • For every entry (vector) in those bins that has
    the same orientation and type as k, add a vote
    for the coordinate system stored in the entry
  • Sort target structures based on max number of
    votes received by any of its coordinate systems
  • ? Small number of target structures. Use LOCK for
    better alignment
  • Hours of pure LOCK are reduced to seconds

40
Advantages of Voting System
  • Very efficient in practice for many-to-many
    comparisons
  • Can establish correspondence between partial,
    disconnected substructures
  • Parallel implementation is straightforward
  • Independent of the order in which vectors are
    considered
  • Drawback (?) May establish correspondences that
    do not satisfy the backbone sequence constraint

41
Problem 4 Find Pharmacophore in Ligands
  • Given
  • Collection of N ( 5 to 10) small flexible
    ligands with similar activity (binding at same
    sites)

Benzamidine binding to beta-Trypsin (3ptb)
Inhibitor binding to HIV protease
42
(No Transcript)
43
Problem 4 Find Pharmacophore in Ligands
  • Given
  • Collection of N ( 5 to 10) small flexible
    ligands with similar activity (binding at same
    sites)
  • A set of low-energy conformations (dozens to few
    hundreds) for each ligand

44
Problem 4 Find Pharmacophore in Ligands
  • Given
  • Collection of N ( 5 to 10) small flexible
    ligands with similar activity (binding at same
    sites)
  • A set of low-energy conformations (dozens to few
    hundreds) for each ligand
  • Find a substructure (pharmacophore) that has a
    match in at least one conformation of each ligand

45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
Pharmacophore and Rational Drug Design
  • Pharmacophore identification is a form of
    reverse engineering to get a model of a binding
    site
  • A pharmacophore can be used to modify ligands
    into more potent drugs and/or to screen large
    databases of ligands for leads

50
Three Simultaneous Problems
  • Conformations?
  • Correspondence?
  • Transform?
  • But ligands are small molecules

51
Software
  • DISCO Martin et al., 1993
  • DISCOtech and GASP Tripos, Inc.
  • CATALYST and HIPHOP Accelrys et al. Green et
    al., 1994 Barnum et al., 1996
  • RAPID P.W. Finn, L.E. Kavraki, J.C. Latombe, R.
    Motwani, C. Shelton, S. Venkatasubramanian, and
    A. Yao. RAPID Randomized Pharmacophore
    Identification for Drug Design. Computational
    Geometry Theory and Applications, 10, pp.
    263-272, 1998

52
(No Transcript)
53
Pairwise Comparison
  • Multi-Probe(M1,,MN)
  • Extract invariants from M1 and M2 by calling
    Pair-Probe(P1,P2) on every pair of conformations
    of the two ligands
  • Test each candidate invariant S obtained at Step
    1 against every ligand Mi, i 3,,N by calling
    Pair-Probe(S,P) on S and each conformation P of Mi

54
Pair-Probe
  • n smallest number of atoms/features in a
    liganda given constant (0 lt a 1) P1 and P2
    Conformations of two distinct ligands (or
    candidate invariant)
  • Pair-Probe(P1,P2)
  • Perform s times
  • Pick a triplet of atoms at random from P1
  • Determine three atoms in P2 congruent to this
    triplet compute the alignment transform T
  • Iterate Apply T to P2 determine the atoms in P1
    matching those in P2 update T
  • If the number of matching atoms exceed an, then
    return this atom set as a candidate invariant S

55
Magnitude of s
  • Prpicking 3 atoms in invariant ? a3
  • Prfailing to find invariant ? (1 - a3)s
  • We want (1-a3)s ? g (g is acceptable
    probability of failure)
  • s ? ln(g)/ln(1-a3)
  • Since x lt -ln(1-x) for 0 lt x lt 1, we get s ?
    ln(1/g)/a3
  • For g 10-2 and a 0.3, we get s ? 180

56
Some Results
  • 63 to 69 atoms with 10 to 15 torsional degrees
    of freedom
  • Feature every non-H atom ? 30 features of 6
    types(atom types)
  • Invariant in active conformations 7-atom
    pharmacophore 7-atom scaffolding

conf t(s) 4 5 6 7 8 9
10 11 12 13 14
11 800 44 20 10 5 2 1 0 0 1 0 0
57
Fuel for Thoughts
58
Idea Many-to-many correspondence may be more
robust
Example Hausdorf distance
59
Hausdorf Distance
  • Two sets of points A a1,...,an and B
    b1,...,bm in ?k
  • dH(A,B) maxa?A minb?B a-b
  • DH(A,B) max dH(A,B), dH(B,A)
  • Variation for shape similarity?H(A,B) minT
    DH(A,T(B))
  • But efficient algorithms only exist for planar
    sets of points

60
Other Idea Minimize cost of transforming A into
B
  • Old idea
  • Graphics Morphing distance
  • Computer vision Earth Movers distanceRubner,
    Tomasi, and Guibas, 1998
  • Protein similarity
  • Isotopic distance Erdmann, 2004

61
Structure Alignment Isotopies
  • Two curves are isotopic if one can be deformed
    into the other without self-collision
  • Example Polygonal curve with n vertices
  • One may think of structure alignment as an
    isotopy deforming one structure into the other
  • Two structures are similar if the isotopy is
    small

M.A. Erdmann. Protein Similarity from Knot
Theory GeometricConvolution and Line Weavings,
CMU Tech. Rep. CMU-CS-04-138.
62
Small Isotopy
  • Model a structure as a set of polygonal lines
    (e.g., vertices are Ca atoms)
  • Two structures A and B are (T,d)-isotopic if
    there exists an isotopy deforming A into T(B) in
    such a way that no vertices of A moves further
    away than some d from its initial or final
    location

Erdmann 2004
63
Similarity Measure
  • dT(A,B) inf d A is (T,d)-isotopic to B
  • d(A,B) infT dT(A,B)
  • d is computable Erdmann,2004
  • But as complex as path planning, hence
    exponential in the number of degrees of freedom
  • Possibility of approximating d using
    probabilistic roadmaps?

64
Topology of Line Weavings
1xis 1nar
a helix axes
M.A. Erdmann. Protein Similarity from Knot
Theory GeometricConvolution and Line Weavings,
CMU Tech. Rep. CMU-CS-04-138.
65
(No Transcript)
66
? 2 topologically equivalent line weavings
3 equivalent classes for 4 lines
Erdmann 2004
67
(No Transcript)
68
Another (incorrect) alignment of 1xis and 1nar
69
? 2 non-equivalent line weavings
70
Why topology is interesting?
  • Two conformations may be geometrically close
    (small RMSD) may require a long continuous
    deformation to map one into the other (without
    steric clashes)

71
Conclusion
  • Automatic computation of structure similarity is
    essential due to the rapid growth of the PDB and
    other molecule (e.g., ligand) libraries
  • As the growth of new protein structures outpaces
    that of new folds, detecting structural
    similarity will have to be much more fine-grained
    than it is today
  • Biological discoveries will likely lie in local,
    possibly rare structure similarities, rather than
    in global fold-level classification
  • Need for better understanding of applications
    and radically new approaches
  • Still a lot of work ...
Write a Comment
User Comments (0)
About PowerShow.com