Title: A model for Statistical Significance of Local Similarities in Structure' Alexander Stark, Shamil Sun
1A model for Statistical Significance of Local
Similarities in Structure. Alexander Stark,
Shamil Sunaev and Robert Russell.
- Presented by
- Viacheslav Fofanov
2Outline
- RMSD measure
- Comparison Procedure
- Model assuming independence of atoms
- Incorporating dependence of covalently linked
atoms - Final P-value
- Conclusions
3Sequence?Structure ?Function
- The usual methods
- Sequence comparisons
- Structure comparisons
- Alternative method
- Try to discover functionally important patterns,
such as 3D structural patterns from known
functional sites (e.g. active sites of a known
protein) - RMSD method (purely geometric)
- Evolutionary trace
4Why?
- Want to detect functionally important sites (from
a purely geometric point of view). - Once a site with similar geometric configuration
has been detected want to know how likely is this
match to happen by chance.
5RMS Deviation
- Root-mean-square deviation of the atomic position
in a protein from the native coordinates (another
protein). Measures the difference between the
two sets of atoms. - Initially used to evaluate protein structure
prediction accuracy, so native coordinates were
coordinates from experimentally determined
structure. - Here, the query is considered to have ideal
coordinated and proteins from the database are
compared to it.
6RMS Deviation (2)
- RMSD computed by interatomic distance method
7Querying process
- We have
- 3D coordinates (of C?) of some functionally
important pattern/configuration (e.g. binding
site of a protein) - A database of protein structures, each stored as
a set of coordinates of its residues. - We need
- To query our pattern against those of the
database and find close or identical matches - Rank proteins based on how well did the query
pattern match to a pattern on the protein. - We get
- A protein with a possible similar function to the
one that contained the query pattern.
8Querying process
- Brute force approach is not very feasible.
- Suppose we are searching for an 8 residue query
pattern in a 150 residue protein. - That leads to comparisons
- Times thousands proteins in database
- Solution Restrict comparisons to plausible
matches, i.e. ones sufficiently close to the
query pattern.
9Querying process
- Transformations allowed are translation and
rotation. - First atom can be moved into ideal position
without any constraints - Second atom can be placed anywhere in a shell
defined by two spheres - Third can lie in a ring like volume shown
10Comparison of RMSD
- Obvious problem arises in comparing RMSD scores,
as they are highly dependent on the number of
atoms being compared. - RMSD of 2 Å over 150 residues does not compare
with RMSD of 2 Å over 3 residues. - Similar to sequence comparison, e.g. a given
10-mer in a viral genome may be significant but
in eukariotic genome (e.g. human) may easily
appear by chance.
11Rationale for a statistical model of RMSD
- We cannot compare two RMSD directly, so we need
some kind of value of how significant (unusual) a
particular RMSD score is. - Want to get something like a P-value, meaning
want to know probability of finding a score equal
or better than the one we found by chance.
12Rationale (2)
- For database searches statistical significance is
generally assessed by an extreme value
distribution. - Cumulative Distribution of scores
- P(x)1 eEF(x) , where
- EF is an expectation function that predicts the
number of matches with an equally good or better
score found in database. - P(x) is probability of finding a score equal or
better than x by chance.
13Why EVD
- When querying a pattern against a much larger
protein, there are many possible matches of
varying RMSD. - We are looking for the minimum of this set of
RMSD to represent the entire protein, and hence
Extreme Value Distribution - What is known about EVD?
14EVD
- For any CD three possible models for asymptotic
behavior of EVD - Double exponent (CD decreasing quickly with good
scores) - Exponent of power function (CD has slowly
decreasing tail) - Exponent of power function w/ different sign (CD
has finite terminal, i.e. bound). - Correct choice of the asymptotic model is crucial
for accurate statistics
15EVD (2)
- Authors have empirically determined the
distribution of RMSD by searching query patterns
against existing structural database (SCOP
version 1.55). - Slow increase for small RMSDs is typical for
power but not exponential functions.
Example distribution of RMSDs for a typical
query. Plot of number of matches versus RMSD for
a search with a random pattern of C atoms (A15,
I30, D60 from PDB entry 1a6m) in a background,
non-redundant database (one member of each of the
723 folds in SCOP).
16Model assuming independence of atoms
- Assumptions
- Only one atom per residue (C?)
- Residues are independent and randomly distributed
in space
17Model assuming independence of atoms
- Probability of residue from database to match one
from the query - increases with allowed volume (RM)
- proportional to the database size (D)
- proportional to residue abundance(?)
18Independence model.
- Schematic two-dimensional representation of
allowed volumes (green) for placing atoms (grey)
of a pattern within an RMSD limit (RM). RMSD?RM
restricts atoms on average to an allowed volume
described by a sphere of radius RM around the
ideal positions (V4/3?RM3 ? RM3, top).
19Independence model.
- The first atom is not restricted, but the second
(middle) must be placed within a shell defined by
two spheres around the first in a volume V24/3?
(2RM36dintra2RM) ? 8? dintra2RM ? RM (for
RMltltdintra).
20Independence model.
- The third atom (bottom) can lie in a ring-like
volume V3 d'intra(2RM)24 d'intraRM2 ? RM2.
21Disadvantages of independence model
- Not very realistic
- Representing a residue by only one atom is not
sufficient - Correct relative orientation of residues rather
than their simple presence is crucial for
activity - Depends on density (rho). As density increases
it is easier to find better matches
22Dependence model
- Possible to consider multiple atoms per residue
when calculating RMSD - In which case covalent bonds violate assumption
of random and independent atom distribution - ? the need to develop a model that accounts
for dependence.
23Dependence model
- First residue is unconstrained
- For the second not only the position of C? has to
match but also C?, C?, etc. Note that positions
of C? , C? are relatively fixed.
24Modified EF
- Where S and T are the numbers of query residues
where two and three atoms are used, respectively.
25P-value
- In square brackets are corrections depending on
how many atoms were used per residue.
26Points of interest
- No way to tell if matched pattern is indeed on
the surface - The way the proteins appear in these databases
are not random and hence certain patterns may
appear there more often than others. - If a query pattern appears on the surface of the
protein at random it can still be functionally
important.
27Points of interest
- Which three atoms in the pattern are chosen first
might affect RMSD. - good score may not guarantee functionality.
e.g. little deviations in all atoms VS ideal
matching of all but one. - Would be interesting to know at what RMSD
functionality is lost.
28References
- S.A. Teichmann, A.G. Murzin and C. Chothia ,
Determination of protein function, evolution and
interactions by structural genomics. Curr. Opin.
Struct. Biol. 11 (2001), pp. 354363. - F.E. Cohen and M.J. Sternberg , On the prediction
of protein structure the significance of the
root-mean-square deviation. J. Mol. Biol. 138
(1980), pp. 321333. - S. Dietmann and L. Holm , Identification of
homology in protein structure classification.
Nature Struct. Biol. 8 (2001), pp. 953957. - A. Stark, S. Sunyaev, R.B. Russell, A model for
statistical significance of local similarities in
structure J. Mol. Biol, 326, 1307-1316, 2003.