A model for Statistical Significance of Local Similarities in Structure' Alexander Stark, Shamil Sun - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

A model for Statistical Significance of Local Similarities in Structure' Alexander Stark, Shamil Sun

Description:

Try to discover functionally important patterns, such as 3D ... S. Dietmann and L. Holm , Identification of homology in protein structure classification. ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 29

Provided by: vfof

Category:

more less

Transcript and Presenter's Notes

Title: A model for Statistical Significance of Local Similarities in Structure' Alexander Stark, Shamil Sun

1
A model for Statistical Significance of Local
Similarities in Structure. Alexander Stark,
Shamil Sunaev and Robert Russell.

Presented by
Viacheslav Fofanov

2
Outline

RMSD measure
Comparison Procedure
Model assuming independence of atoms
Incorporating dependence of covalently linked
atoms
Final P-value
Conclusions

3
Sequence?Structure ?Function

The usual methods
Sequence comparisons
Structure comparisons
Alternative method
Try to discover functionally important patterns,
such as 3D structural patterns from known
functional sites (e.g. active sites of a known
protein)
RMSD method (purely geometric)
Evolutionary trace

4
Why?

Want to detect functionally important sites (from
a purely geometric point of view).
Once a site with similar geometric configuration
has been detected want to know how likely is this
match to happen by chance.

5
RMS Deviation

Root-mean-square deviation of the atomic position
in a protein from the native coordinates (another
protein). Measures the difference between the
two sets of atoms.
Initially used to evaluate protein structure
prediction accuracy, so native coordinates were
coordinates from experimentally determined
structure.
Here, the query is considered to have ideal
coordinated and proteins from the database are
compared to it.

6
RMS Deviation (2)

RMSD computed by interatomic distance method

7
Querying process

We have
3D coordinates (of C?) of some functionally
important pattern/configuration (e.g. binding
site of a protein)
A database of protein structures, each stored as
a set of coordinates of its residues.
We need
To query our pattern against those of the
database and find close or identical matches
Rank proteins based on how well did the query
pattern match to a pattern on the protein.
We get
A protein with a possible similar function to the
one that contained the query pattern.

8
Querying process

Brute force approach is not very feasible.
Suppose we are searching for an 8 residue query
pattern in a 150 residue protein.
That leads to comparisons
Times thousands proteins in database
Solution Restrict comparisons to plausible
matches, i.e. ones sufficiently close to the
query pattern.

9
Querying process

Transformations allowed are translation and
rotation.
First atom can be moved into ideal position
without any constraints
Second atom can be placed anywhere in a shell
defined by two spheres
Third can lie in a ring like volume shown

10
Comparison of RMSD

Obvious problem arises in comparing RMSD scores,
as they are highly dependent on the number of
atoms being compared.
RMSD of 2 Å over 150 residues does not compare
with RMSD of 2 Å over 3 residues.
Similar to sequence comparison, e.g. a given
10-mer in a viral genome may be significant but
in eukariotic genome (e.g. human) may easily
appear by chance.

11
Rationale for a statistical model of RMSD

We cannot compare two RMSD directly, so we need
some kind of value of how significant (unusual) a
particular RMSD score is.
Want to get something like a P-value, meaning
want to know probability of finding a score equal
or better than the one we found by chance.

12
Rationale (2)

For database searches statistical significance is
generally assessed by an extreme value
distribution.
Cumulative Distribution of scores
P(x)1 eEF(x) , where
EF is an expectation function that predicts the
number of matches with an equally good or better
score found in database.
P(x) is probability of finding a score equal or
better than x by chance.

13
Why EVD

When querying a pattern against a much larger
protein, there are many possible matches of
varying RMSD.
We are looking for the minimum of this set of
RMSD to represent the entire protein, and hence
Extreme Value Distribution
What is known about EVD?

14
EVD

For any CD three possible models for asymptotic
behavior of EVD
Double exponent (CD decreasing quickly with good
scores)
Exponent of power function (CD has slowly
decreasing tail)
Exponent of power function w/ different sign (CD
has finite terminal, i.e. bound).
Correct choice of the asymptotic model is crucial
for accurate statistics

15
EVD (2)

Authors have empirically determined the
distribution of RMSD by searching query patterns
against existing structural database (SCOP
version 1.55).
Slow increase for small RMSDs is typical for
power but not exponential functions.

Example distribution of RMSDs for a typical
query. Plot of number of matches versus RMSD for
a search with a random pattern of C atoms (A15,
I30, D60 from PDB entry 1a6m) in a background,
non-redundant database (one member of each of the
723 folds in SCOP).
16
Model assuming independence of atoms

Assumptions
Only one atom per residue (C?)
Residues are independent and randomly distributed
in space

17
Model assuming independence of atoms

Probability of residue from database to match one
from the query
increases with allowed volume (RM)
proportional to the database size (D)
proportional to residue abundance(?)

18
Independence model.

Schematic two-dimensional representation of
allowed volumes (green) for placing atoms (grey)
of a pattern within an RMSD limit (RM). RMSD?RM
restricts atoms on average to an allowed volume
described by a sphere of radius RM around the
ideal positions (V4/3?RM3 ? RM3, top).

19
Independence model.

The first atom is not restricted, but the second
(middle) must be placed within a shell defined by
two spheres around the first in a volume V24/3?
(2RM36dintra2RM) ? 8? dintra2RM ? RM (for
RMltltdintra).

20
Independence model.

The third atom (bottom) can lie in a ring-like
volume V3 d'intra(2RM)24 d'intraRM2 ? RM2.

21
Disadvantages of independence model

Not very realistic
Representing a residue by only one atom is not
sufficient
Correct relative orientation of residues rather
than their simple presence is crucial for
activity
Depends on density (rho). As density increases
it is easier to find better matches

22
Dependence model

Possible to consider multiple atoms per residue
when calculating RMSD
In which case covalent bonds violate assumption
of random and independent atom distribution
? the need to develop a model that accounts
for dependence.

23
Dependence model

First residue is unconstrained
For the second not only the position of C? has to
match but also C?, C?, etc. Note that positions
of C? , C? are relatively fixed.

24
Modified EF

Where S and T are the numbers of query residues
where two and three atoms are used, respectively.

25
P-value

In square brackets are corrections depending on
how many atoms were used per residue.

26
Points of interest

No way to tell if matched pattern is indeed on
the surface
The way the proteins appear in these databases
are not random and hence certain patterns may
appear there more often than others.
If a query pattern appears on the surface of the
protein at random it can still be functionally
important.

27
Points of interest

Which three atoms in the pattern are chosen first
might affect RMSD.
good score may not guarantee functionality.
e.g. little deviations in all atoms VS ideal
matching of all but one.
Would be interesting to know at what RMSD
functionality is lost.

28
References

S.A. Teichmann, A.G. Murzin and C. Chothia ,
Determination of protein function, evolution and
interactions by structural genomics. Curr. Opin.
Struct. Biol. 11 (2001), pp. 354363.
F.E. Cohen and M.J. Sternberg , On the prediction
of protein structure the significance of the
root-mean-square deviation. J. Mol. Biol. 138
(1980), pp. 321333.
S. Dietmann and L. Holm , Identification of
homology in protein structure classification.
Nature Struct. Biol. 8 (2001), pp. 953957.
A. Stark, S. Sunyaev, R.B. Russell, A model for
statistical significance of local similarities in
structure J. Mol. Biol, 326, 1307-1316, 2003.