A model for Statistical Significance of Local Similarities in Structure' Alexander Stark, Shamil Sun - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

A model for Statistical Significance of Local Similarities in Structure' Alexander Stark, Shamil Sun

Description:

Try to discover functionally important patterns, such as 3D ... S. Dietmann and L. Holm , Identification of homology in protein structure classification. ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 29
Provided by: vfof
Category:

less

Transcript and Presenter's Notes

Title: A model for Statistical Significance of Local Similarities in Structure' Alexander Stark, Shamil Sun


1
A model for Statistical Significance of Local
Similarities in Structure. Alexander Stark,
Shamil Sunaev and Robert Russell.
  • Presented by
  • Viacheslav Fofanov

2
Outline
  • RMSD measure
  • Comparison Procedure
  • Model assuming independence of atoms
  • Incorporating dependence of covalently linked
    atoms
  • Final P-value
  • Conclusions

3
Sequence?Structure ?Function
  • The usual methods
  • Sequence comparisons
  • Structure comparisons
  • Alternative method
  • Try to discover functionally important patterns,
    such as 3D structural patterns from known
    functional sites (e.g. active sites of a known
    protein)
  • RMSD method (purely geometric)
  • Evolutionary trace

4
Why?
  • Want to detect functionally important sites (from
    a purely geometric point of view).
  • Once a site with similar geometric configuration
    has been detected want to know how likely is this
    match to happen by chance.

5
RMS Deviation
  • Root-mean-square deviation of the atomic position
    in a protein from the native coordinates (another
    protein). Measures the difference between the
    two sets of atoms.
  • Initially used to evaluate protein structure
    prediction accuracy, so native coordinates were
    coordinates from experimentally determined
    structure.
  • Here, the query is considered to have ideal
    coordinated and proteins from the database are
    compared to it.

6
RMS Deviation (2)
  • RMSD computed by interatomic distance method

7
Querying process
  • We have
  • 3D coordinates (of C?) of some functionally
    important pattern/configuration (e.g. binding
    site of a protein)
  • A database of protein structures, each stored as
    a set of coordinates of its residues.
  • We need
  • To query our pattern against those of the
    database and find close or identical matches
  • Rank proteins based on how well did the query
    pattern match to a pattern on the protein.
  • We get
  • A protein with a possible similar function to the
    one that contained the query pattern.

8
Querying process
  • Brute force approach is not very feasible.
  • Suppose we are searching for an 8 residue query
    pattern in a 150 residue protein.
  • That leads to comparisons
  • Times thousands proteins in database
  • Solution Restrict comparisons to plausible
    matches, i.e. ones sufficiently close to the
    query pattern.

9
Querying process
  • Transformations allowed are translation and
    rotation.
  • First atom can be moved into ideal position
    without any constraints
  • Second atom can be placed anywhere in a shell
    defined by two spheres
  • Third can lie in a ring like volume shown

10
Comparison of RMSD
  • Obvious problem arises in comparing RMSD scores,
    as they are highly dependent on the number of
    atoms being compared.
  • RMSD of 2 Å over 150 residues does not compare
    with RMSD of 2 Å over 3 residues.
  • Similar to sequence comparison, e.g. a given
    10-mer in a viral genome may be significant but
    in eukariotic genome (e.g. human) may easily
    appear by chance.

11
Rationale for a statistical model of RMSD
  • We cannot compare two RMSD directly, so we need
    some kind of value of how significant (unusual) a
    particular RMSD score is.
  • Want to get something like a P-value, meaning
    want to know probability of finding a score equal
    or better than the one we found by chance.

12
Rationale (2)
  • For database searches statistical significance is
    generally assessed by an extreme value
    distribution.
  • Cumulative Distribution of scores
  • P(x)1 eEF(x) , where
  • EF is an expectation function that predicts the
    number of matches with an equally good or better
    score found in database.
  • P(x) is probability of finding a score equal or
    better than x by chance.

13
Why EVD
  • When querying a pattern against a much larger
    protein, there are many possible matches of
    varying RMSD.
  • We are looking for the minimum of this set of
    RMSD to represent the entire protein, and hence
    Extreme Value Distribution
  • What is known about EVD?

14
EVD
  • For any CD three possible models for asymptotic
    behavior of EVD
  • Double exponent (CD decreasing quickly with good
    scores)
  • Exponent of power function (CD has slowly
    decreasing tail)
  • Exponent of power function w/ different sign (CD
    has finite terminal, i.e. bound).
  • Correct choice of the asymptotic model is crucial
    for accurate statistics

15
EVD (2)
  • Authors have empirically determined the
    distribution of RMSD by searching query patterns
    against existing structural database (SCOP
    version 1.55).
  • Slow increase for small RMSDs is typical for
    power but not exponential functions.

Example distribution of RMSDs for a typical
query. Plot of number of matches versus RMSD for
a search with a random pattern of C atoms (A15,
I30, D60 from PDB entry 1a6m) in a background,
non-redundant database (one member of each of the
723 folds in SCOP).
16
Model assuming independence of atoms
  • Assumptions
  • Only one atom per residue (C?)
  • Residues are independent and randomly distributed
    in space

17
Model assuming independence of atoms
  • Probability of residue from database to match one
    from the query
  • increases with allowed volume (RM)
  • proportional to the database size (D)
  • proportional to residue abundance(?)

18
Independence model.
  • Schematic two-dimensional representation of
    allowed volumes (green) for placing atoms (grey)
    of a pattern within an RMSD limit (RM). RMSD?RM
    restricts atoms on average to an allowed volume
    described by a sphere of radius RM around the
    ideal positions (V4/3?RM3 ? RM3, top).

19
Independence model.
  • The first atom is not restricted, but the second
    (middle) must be placed within a shell defined by
    two spheres around the first in a volume V24/3?
    (2RM36dintra2RM) ? 8? dintra2RM ? RM (for
    RMltltdintra).

20
Independence model.
  • The third atom (bottom) can lie in a ring-like
    volume V3 d'intra(2RM)24 d'intraRM2 ? RM2.

21
Disadvantages of independence model
  • Not very realistic
  • Representing a residue by only one atom is not
    sufficient
  • Correct relative orientation of residues rather
    than their simple presence is crucial for
    activity
  • Depends on density (rho). As density increases
    it is easier to find better matches

22
Dependence model
  • Possible to consider multiple atoms per residue
    when calculating RMSD
  • In which case covalent bonds violate assumption
    of random and independent atom distribution
  • ? the need to develop a model that accounts
    for dependence.

23
Dependence model
  • First residue is unconstrained
  • For the second not only the position of C? has to
    match but also C?, C?, etc. Note that positions
    of C? , C? are relatively fixed.

24
Modified EF
  • Where S and T are the numbers of query residues
    where two and three atoms are used, respectively.

25
P-value
  • In square brackets are corrections depending on
    how many atoms were used per residue.

26
Points of interest
  • No way to tell if matched pattern is indeed on
    the surface
  • The way the proteins appear in these databases
    are not random and hence certain patterns may
    appear there more often than others.
  • If a query pattern appears on the surface of the
    protein at random it can still be functionally
    important.

27
Points of interest
  • Which three atoms in the pattern are chosen first
    might affect RMSD.
  • good score may not guarantee functionality.
    e.g. little deviations in all atoms VS ideal
    matching of all but one.
  • Would be interesting to know at what RMSD
    functionality is lost.

28
References
  • S.A. Teichmann, A.G. Murzin and C. Chothia ,
    Determination of protein function, evolution and
    interactions by structural genomics. Curr. Opin.
    Struct. Biol. 11 (2001), pp. 354363.
  • F.E. Cohen and M.J. Sternberg , On the prediction
    of protein structure the significance of the
    root-mean-square deviation. J. Mol. Biol. 138
    (1980), pp. 321333.
  • S. Dietmann and L. Holm , Identification of
    homology in protein structure classification.
    Nature Struct. Biol. 8 (2001), pp. 953957.
  • A. Stark, S. Sunyaev, R.B. Russell, A model for
    statistical significance of local similarities in
    structure J. Mol. Biol, 326, 1307-1316, 2003.
Write a Comment
User Comments (0)
About PowerShow.com