Title: Protein Structure Similarity
1Protein Structure Similarity
2Secondary Structure Elements a helices, b
strands/sheets, loops
3Structure Prediction/Determination
- Computational tools
- Homology, threading
- Molecular dynamics
- Experimental tools
X-ray crystallography
4Protein Structure Determination (1)
- X-ray diffraction crystallography
5Protein Structure Determination (2)
- Nuclear magnetic resonance spectroscopy
6Protein Data Bank
1990 ? 250 new structures 1999 ? 2500 new
structures 2000 ? gt20,000 structures total 2004 ?
30,000 structures total
7Protein Data Bank
Only about 10 of structures have been
determined for known protein sequences ?
Protein Structure Initiative (PSI)
1990 ? 250 new structures 1999 ? 2500 new
structures 2000 ? gt20,000 structures total 2004 ?
30,000 structures total
8Structure Similarity
- Refers to how well (or poorly) 3D folded
structures of proteins can be aligned - Expected to reflect functional similarities
(interaction with other molecules)
Proteins in the TIM barrel fold family
9Alignment of 1xis and 1nar (TIM-Barrels)
Sayle, R. RasMol. A protein visualization
tool. http//www.umass.edu/microbio/rasmol/index2.
htm.
ribbon format
1xis 1nar
backbone format
Alignment computed by DALI
a helix axes
10Structure Similarity
- Refers to how well (or poorly) 3D folded
structures of proteins can be aligned - Is expected to reflect functional similarities
(interaction with other molecules) - 2000 20,000 structures in PDB
4,000 different folds (15 ratio)
11(No Transcript)
12(No Transcript)
13Structure Similarity
- Refers to how well (or poorly) 3D folded
structures of proteins can be aligned - Is expected to reflect functional similarities
(interaction with other molecules) - 2000 20,000 structures in PDB
4,000 different folds (15 ratio) - Three possible reasons - evolution, - physical
constraints (e.g., few ways to maximize
hydrophobic interactions), - limits in
techniques used for structure determination - Given a new structure, the probability is high
that it is similar to an existing one
14Why Comparing Protein Folded Structures?
Sequence
Structure
Function
- Low sequence similarity may yield very similar
structures - Sometimes high sequence similarity yields
different structures
15Alignment of 1xis and 1nar (TIM-Barrels)
1xis and 1nar have only 7 sequenceidentity, but
approximately 70 of the residues are
structurally similar
16Why Comparing Protein Folded Structures?
Sequence
Structure
Function
- Low sequence similarity may yield very similar
structures - Sometimes high sequence similarity yields
different structures - Structure comparison is expected to provide more
pertinent information about functional
(dis-)similarity among proteins, especially with
non-evolutionary relationships or non-detectable
evolutionary relationships
17Ill-Posed Problem? Multiple Terminology
- (Dis-)similarity analysis
- Structure comparison
- Alignment, superposition, matching
- Classification
- Applications
- Definitions and issues
- Methods
18A Few Web Sites
- Protein Data Bank (PDB)http//www.rcsb.org/pdb/
- Protein classification
- SCOPhttp//scop.berkeley.edu/
- CATHhttp//www.biochem.ucl.ac.uk/bsm/cath/
- Protein alignment
- DALIhttp//www.ebi.ac.uk/dali/
- LOCKhttp//motif.stanford.edu/lock2/
19Application 1 Find Global Similarities Among
Protein Structures
- Given two protein structures, find the largest
similar substructures - For example, a substructure is a subset of Ca
atoms or a subset of secondary structure elements
in each molecule - Several possible similarity measures
- Variants 1-to-1, 1-to-many, many-to-many (PDB)
- Must be automatic (and fast)
20Application 2 Classify Proteins
- Many proteins, but relatively few distinct fold
families Chotia, 1992 Holm and Sander, 1996
Brenner et al. 1997 - Hierarchical classification
- Insight into functions and structure
stabilization - Basis for homology and threading
- Manual classification ? SCOP Murzin et al.,
1995
21Application 2 Classify Proteins
Class Similar secondary structure content
- Many proteins, but relatively few distinct fold
families Chotia, 1992 Holm and Sander, 1996
Brenner et al. 1997 - Hierarchical classification
- Insight into functions and structure
stabilization - Basis for homology and threading
- Manual classification ? SCOP Murzin et al.,
1995 - Increasing size of PDB ? Automatic classifiers
CATH Orengo et al., 1997 Pclass Singh et
al. FSSP Holm and Sander
Fold SSEs in similar arrangement
Family Clear evolutionary relationship
22Manuel vs. Automatic Classification
23Application 3 Find Motif in Protein Structure
- Given a protein structure and a motif (e.g., a
small collection of atoms corresponding to a
binding site) - Find whether the motif matches a substructure of
the protein - Variant One motif against many proteins
Active sites of 1PIP and 5PAD. Only 3 amino-acids
participate in the motif
24Application 4 Find Pharmacophore
- Given
- Small collection (5-10) of small flexible ligands
with similar activity (hence, assumed to bind at
same protein site) - Low-energy conformations (several dozens to few
100s) for each ligand - Find substructure (pharmacophore) that occurs in
at least one conformation of each ligand - Key problem in drug design when binding site is
unknown
25Application 4 Find Pharmacophore
Inhibitors of thermolysin
26Application 5 Search for Ligands Containing a
Pharmacophore
- Given
- Database containing several 100,000, or more,
small ligands - A pharmacophore P
- Find all ligands that have a low-energy
conformation containing P - Data mining of pharmaceutical databases (lead
generation)
S.M. LaValle, P.W. Finn, L.E. Kavraki, and J.C.
Latombe. A Randomized Kinematics-Based Approach
to Pharmacophore-Constrained Conformational
Search and Database Screening. J. of
Computational Chemistry, 21(9)731-747, July 2000
27- Applications
- Definitions and issues
- Methods
283D Molecular Structure
- Collection of (possibly typed) atoms or groups of
atoms in some given 3D relative placement - The placement of a group of atoms is defined by
the position of a reference point (e.g., the
center of an atom) and the orientation of a
reference direction - The type can be the atom ID, the amino-acid ID,
etc
29Matching of Structures
- Two structures A and B match iff
- Correspondence There is a one-to-one map
between their elements - AlignmentThere exists a rigid-body transform T
such that the RMSD between the elements in A and
those in T(B) is less than some threshold e.
30Complete Match
31Alignment of 3adk and 1gky
- Both matching and non-matching secondary
structure elements
32Partial Match
- Notion of support s of the match the match is
between s(A) and s(B) - ? Dual problem - What is the support?
- What is the transform? - Often several (many) possible supports
- Small supports ? motifs
33Mathematical Relative
g
f
s
f - g2
Over which support?
34Mathematical Relative
g
f
s
f - g2
Over which support?
35Multiple Partial Matches
36Distributed Support
37What is Best?
Should gaps be penalized?
38What About This?
Sequence along backbone is not preserved
39Similarity measure is unlikely to satisfy
triangular inequality for partial match
40Scoring Issues
- Trade-off between size of s and RMSD
- How should gaps be counted?
- Is there a quality of the correspondence?
- The correspondence may, or may not, satisfy
type and/or backbone sequence preferences - Should accessible surface be given more
importance? - ? Similarity measure may be different from the
inverse of RSMD (though no consensus on best
measure!) - But RMSD is computationally very convenient!
41Examples
RMSD dissimilarity measure ? emphasizes
differences ? smaller support
STRUCTALs similarity measure? emphasizes
similarities ? larger support
42Comparison of Similarity Measures
- A.C.M. May. Toward more meaningful hierarchical
classification of amino acids scoring functions.
Protein Engineering, 12707-712, 1999reviews 37
protein structure similarity measures - The difficulty of defining a similarity score is
probably due to the facts that structure
comparison is an ill-posed problem and has
multiple solutions
43Bottom Line
- Finding an optimal partial match is NP-hard
- No fast algorithm is guaranteed to give an
optimal answer for any given measure Godzik,
1996 - ? Heuristic/approximate algorithms
- ? Probably not a single solution, but
application- dependent solutions - ? But there exist general algorithmic principles
44Computational Questions
- Given a (dis)similarity measure and two
proteins, compute the best match - Which support?
- Which correspondence?
- Which alignment transform?
45- Applications
- Definitions and issues
- Methods
46Find Global Similarities Among Protein Structures
- Input Two sets of features (atoms or groups of
atoms) a1,,an and b1,,bm belonging to two
different proteins A and B - Output - Maximal correspondence set C of pairs
(ai,bj), where all ai and all bj are distinct-
Alignment transform T such that the RMSD of the
pairs (ai,T(bj)) is less than a given e - Several possible outputs
Variant of the Largest Common Point Set
problemAkutsu and Halldorsson, 1994
47Possible Correspondence Constraints
- Typed features(ai,bj) is a possible
correspondence pair iff Type(ai) Type(bj) - Ordered features(ai,bj) and (ai,bj), where
igti, are possible correspondence pairs iff
jgtjE.g., sequence along backbone
48Some Existing Software
- Ca atoms
- DALI Holm and Sander, 1993
- STRUCTAL Gerstein and Levitt, 1996
- MINAREA Falicov and Cohen, 1996
- CE Shindyalov and Bourne, 1998
- ProtDex Aung,Fu and Tan, 2003
- Secondary structure elements and Ca atoms
- VAST Gibrat et al., 1996
- LOCK Singh and Brutlag, 1996
- 3dSEARCH Singh and Brutlag, 1999
49RMSD ? Similarity
But matches and RMSDs are not exactly what we
need In general, we need to computea similarity
measure of the form maxT S(A,T(B))
where S is more complex than RMSD Two-step
approach 1. Compute best matches using
RMSD 2. Adjust transform to maximize
similarity measure
50Computation of Best Matches
- Two simultaneous subproblems
- Find maximal correspondence set C
- Find alignment transform T
- Chicken-and-egg issue
- Each subproblem is relatively simple
- If we knew C, we could compute T
- If we knew T, we could get C by proximity
- But the combination is hard !!!
51Computation of Best Matches
- Two simultaneous subproblems
- Find maximal correspondence set C
- Find alignment transform T
- Chicken-and-egg issue
- Each subproblem is relatively simple
- If we knew C, we could compute T
- If we knew T, we could get C by proximity
- But the combination is hard !!!
52Find Alignment Transform
- Two sets of points A a1,,an and B
b1,,bn - Correspondence pairs (ai, bi)
- Find T arg minT RMSD(A,T(B)) ?
- O(n) closed-form solution Arun, Huang, and
Blostein, 87 Horn, 87 Horn, Hilden, and
Negahdaripour, 88
53O(n) SVD-Based Algorithm
- T combines translation t and rotation R, such
that T(bi) t R(bi) - b (Si1,...,nbi)/n mean of the bis
- Place the origin of coordinate system at b
- minT RMSD(A,T(B)) simplifies to (up to some
constants) - t and R can be computed separately
- t a mean of the ais
Arun, Huang, and Blostein, 87
54O(n) SVD-Based Algorithm
- A3?n a1-a, ..., an-a B3?n b1-b, ...,
bn-b - Compute SVD decomposition of 33 correlation
matrix BAT BAT UDVT
where D is a diagonal matrices with decreasing
non-negative entries (singular values) along the
diagonal - If det(U)det(V) 1 then S I,
else S diag(1,1,-1) - R USVT
Arun, Huang, and Blostein, 87
55- Arun, Huang, and Blostein, 87
- ? rotation matrix
- Horn, 87 ? quaternion
56? Trial-and-Error Approach to Protein Structure
Comparison
57? Trial-and-Error Approach to Protein Structure
Comparison
- Set CS to a seed correspondence set (small set
sufficient to generate an alignment transform) - Compute the alignment transform T for CS and
apply T to the second protein B - Update CS to include all pairs of features that
are close apart - If CS has changed, then return to Step 2 else
return (CS,T) -
58? Trial-and-Error Approach to Protein Structure
Comparison
- - result nil
- - Iterate N times
- Set CS to a seed correspondence set (small set
sufficient to generate an alignment transform) - Compute the alignment transform T for CS and
apply T to the second protein B - Update CS to include all pairs of features that
are close apart - If CS has changed, then return to Step 2 else
result ? result ? (CS,T) - - Return result
59- How to get seed correspondences?