A unified statistical framework for sequence comparison and structure comparison - PowerPoint PPT Presentation

About This Presentation
Title:

A unified statistical framework for sequence comparison and structure comparison

Description:

... a and b fitted to the observed density by least squares seq = a Comparison to BLAST and FASTA ... more error than model Structure Comparison Algorithm ... – PowerPoint PPT presentation

Number of Views:248
Avg rating:3.0/5.0
Slides: 29
Provided by: Arken
Category:

less

Transcript and Presenter's Notes

Title: A unified statistical framework for sequence comparison and structure comparison


1
A unified statistical frameworkfor sequence
comparison and structure comparison
  • Michael Levitt Mark Gerstein

2
Statistics Introduction
  • Statistics is the discipline which deals with
    inference in the presence of variation
  • Given a score, how significant is it?
  • Ho , HA , Critical Region, P-value
  • Extreme Value Distribution-maximum over all
    sequence scores is distributed as Extreme Value
    Distribution
  • Reason why extreme value distribution is useful
  • maximize score over all possible random
    alignments

3
Introduction
  • Given sequence and structural scores, develop
    hypothesis testing framework
  • Ho Two proteins compared are unrelated
  • Distribution of scores of unrelated proteins
    determined empirically using PDB data at 40
    sequence identity
  • No assumption of background distribution

4
Sequence Comparison Framework
  • Sequence score determined by SSEARCH and BLOSUM
    50 substitution matrix
  • Sseq (sequence score), n and m (lengths of two
    sequences compared) in p.d.f.
  • Compared all possible pairs to determine
    empirically the p.d.f.

5
P.D.F. for Sequence Score
6
Cross Section of p.d.f for constant ln(nm)
7
Density Distribution for constant ln(nm)
  • Density distribution follows extreme value
    distribution exp(-Z exp(-Z)) pcseq(Z)
  • Z(Sseq - µseq)/oseq
  • µseq a ln(nm) b model average a and b
    fitted to the observed density by least squares
  • oseq a

8
Comparison to BLAST and FASTA statistics
  • Critical region to determine p-value for model
    Pseq(zgtZ)
  • Comparison of model p-values with BLAST p-value
    found BLAST p-value higher than model
  • FASTA statistic better coverage, more error than
    model

9
Structure Comparison Algorithm
10
Structure Comparison Framework
  • The score obtained from the structure comparison
    algorithm is Sstr
  • P.d.f. for Sstr used N (number of residues
    matched) and Sstr (pairs which scored high were
    removed)
  • Kept N fixed and fitted extreme value
    distribution to density using all N

11
Comparison with RMS
  • RMS deviation in alpha-carbon after least squares
    fit is traditional method
  • RMS score used to determine p.d.f. with ln(RMS
    score) and N
  • Comparison of RMS with Sstr found RMS worse than
    S in coverage and accuracy

12
Comparison with RMS (cont.)
  • Three reasons
  • Sstr depends most strongly on best-fitting atoms
    RMS depends most on worst-fitting atoms
  • Sstr penalizes gaps RMS does not
  • Sstr is analogous to Sseq in the sense that both
    use dynamic-programming

13
Comparison of Structure and Sequence Comparison
14
Concluding Remarks
  • Significance of sequence structure score can be
    calculated from any structural alignment program
  • This method of statistical significance is
    between FASTA and BLAST methods

15
Efficient Detection of Three-Dimensional
Structural Motifs in Biological Macromolecules By
Computer Vision Techniques
  • Ruth Nussinov Haim J. Wolfson

16
Introduction
  • One of the earlier papers addressing structure
    comparison
  • Based on computer vision techniques ( geometric
    hashing paradigm)
  • No a priori predefined motif assumed
  • Advantage Can be parallelized

17
Problem
  • Given 3D coordinates of atoms of two molecules,
    find a rigid transformation (rotation and
    translation allowed) so that a large number of
    atoms of one molecule match the atoms of the
    other molecule
  • Closely related to 3D rigid object recognition

18
Geometric Hashing ParadigmRepresentation of
Geometric Constraints
  • Proteins represented as points using coordinate
    frames (minimal representation of coordinate
    frames)
  • Pick three noncolinear points to define a plane
    (RS) and construct orthogonal 3D coordinate
    system based on RS

19
Representation of Geometric Constraints (cont.)
  • Define orthonormal vectors w.r.t. RS so that any
    point can be represented as a linear combination
    of the orthonormal vectors
  • To remove dependence on particular RS (may
    preclude recognition if at least one of the RS
    points does not match with input substructure),
    represent the m points in all basis triplets
    (I.e. all orthonormal vectors) with all possible
    RS

20
Algorithm for Representation of Geometric
Constraints
  • For each RS
  • Compute orthonormal 3D basis associated with each
    RS
  • Compute coordinates of all other points in
    coordinate frame defined by 3D basis
  • For each point define address of hash table with
    labels and measurements
  • Use each address to enter hash table with pair
    (model, RS)

21
Determining Hash Table Entries with Model M1 and
Points 4 and 1 as Basis
22
Locations of Hash Table Entries for Model M1
after all bases, RS
23
Geometric Hashing Matching
  • Given observed object
  • 1. Choose an RS and compute 3D basis associated
    with RS
  • 2. Compute the coordinates of the other observed
    object points in 3D basis
  • 3. For each point, enter hash table at address
    defined by labels and measurements and label and
    coordinate of new point

24
Geometric Hashing Matching (cont.)
  • For step 3 Tally a vote for model and RS for
    each entry found at address can histogram all
    hash table entries which received one or more
    votes
  • 4. If no pair scores high (determine by
    threshold), then go to 1, and begin with
    different RS of the observed object

25
Geometric Hashing Matching (cont.)
  • 5. Consider all the models from step 4 and find
    rigid motion that gives best least squares match
  • 6. Transform the model point set according to the
    transformation of step 5 and check consistency of
    all biological information (I.e. match labeling)

26
Modifications to Algorithm
  • Could modify voting scheme, modify representation
    of coordinate axes to 2D coordinate axes (reduces
    worst case running time analysis), could apply
    representation of atoms to alpha-carbons only (no
    labeling allowed), could group atoms together
    into a single unit and analyze structures using
    these atom groups

27
Algorithm Performance
  • Experimented with bacterial proteins, bovine
    pancreas protein, calcium binding protein, bovine
    liver protein, and protein from hen egg
  • All experiments were favorable to excellent
    results in terms of fit

28
Conclusion
  • Algorithm needs O(N x m4) for hash table (can be
    big for large N, m)
  • Running time for algorithm can also be long
  • Can be parallelized (ie. representation stage
    independent of matching stage)
  • Sequence order independent (ie. Insensitive to
    gaps, insertions, deletions)
Write a Comment
User Comments (0)
About PowerShow.com