A unified statistical framework for sequence comparison and structure comparison - PowerPoint PPT Presentation

About This Presentation

Title:

A unified statistical framework for sequence comparison and structure comparison

Description:

... a and b fitted to the observed density by least squares seq = a Comparison to BLAST and FASTA ... more error than model Structure Comparison Algorithm ... – PowerPoint PPT presentation

Number of Views:248

Avg rating:3.0/5.0

Slides: 29

Provided by: Arken

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: A unified statistical framework for sequence comparison and structure comparison

1
A unified statistical frameworkfor sequence
comparison and structure comparison

Michael Levitt Mark Gerstein

2
Statistics Introduction

Statistics is the discipline which deals with
inference in the presence of variation
Given a score, how significant is it?
Ho , HA , Critical Region, P-value
Extreme Value Distribution-maximum over all
sequence scores is distributed as Extreme Value
Distribution
Reason why extreme value distribution is useful
maximize score over all possible random
alignments

3
Introduction

Given sequence and structural scores, develop
hypothesis testing framework
Ho Two proteins compared are unrelated
Distribution of scores of unrelated proteins
determined empirically using PDB data at 40
sequence identity
No assumption of background distribution

4
Sequence Comparison Framework

Sequence score determined by SSEARCH and BLOSUM
50 substitution matrix
Sseq (sequence score), n and m (lengths of two
sequences compared) in p.d.f.
Compared all possible pairs to determine
empirically the p.d.f.

5
P.D.F. for Sequence Score
6
Cross Section of p.d.f for constant ln(nm)
7
Density Distribution for constant ln(nm)

Density distribution follows extreme value
distribution exp(-Z exp(-Z)) pcseq(Z)
Z(Sseq - µseq)/oseq
µseq a ln(nm) b model average a and b
fitted to the observed density by least squares
oseq a

8
Comparison to BLAST and FASTA statistics

Critical region to determine p-value for model
Pseq(zgtZ)
Comparison of model p-values with BLAST p-value
found BLAST p-value higher than model
FASTA statistic better coverage, more error than
model

9
Structure Comparison Algorithm
10
Structure Comparison Framework

The score obtained from the structure comparison
algorithm is Sstr
P.d.f. for Sstr used N (number of residues
matched) and Sstr (pairs which scored high were
removed)
Kept N fixed and fitted extreme value
distribution to density using all N

11
Comparison with RMS

RMS deviation in alpha-carbon after least squares
fit is traditional method
RMS score used to determine p.d.f. with ln(RMS
score) and N
Comparison of RMS with Sstr found RMS worse than
S in coverage and accuracy

12
Comparison with RMS (cont.)

Three reasons
Sstr depends most strongly on best-fitting atoms
RMS depends most on worst-fitting atoms
Sstr penalizes gaps RMS does not
Sstr is analogous to Sseq in the sense that both
use dynamic-programming

13
Comparison of Structure and Sequence Comparison
14
Concluding Remarks

Significance of sequence structure score can be
calculated from any structural alignment program
This method of statistical significance is
between FASTA and BLAST methods

15
Efficient Detection of Three-Dimensional
Structural Motifs in Biological Macromolecules By
Computer Vision Techniques

Ruth Nussinov Haim J. Wolfson

16
Introduction

One of the earlier papers addressing structure
comparison
Based on computer vision techniques ( geometric
hashing paradigm)
No a priori predefined motif assumed
Advantage Can be parallelized

17
Problem

Given 3D coordinates of atoms of two molecules,
find a rigid transformation (rotation and
translation allowed) so that a large number of
atoms of one molecule match the atoms of the
other molecule
Closely related to 3D rigid object recognition

18
Geometric Hashing ParadigmRepresentation of
Geometric Constraints

Proteins represented as points using coordinate
frames (minimal representation of coordinate
frames)
Pick three noncolinear points to define a plane
(RS) and construct orthogonal 3D coordinate
system based on RS

19
Representation of Geometric Constraints (cont.)

Define orthonormal vectors w.r.t. RS so that any
point can be represented as a linear combination
of the orthonormal vectors
To remove dependence on particular RS (may
preclude recognition if at least one of the RS
points does not match with input substructure),
represent the m points in all basis triplets
(I.e. all orthonormal vectors) with all possible
RS

20
Algorithm for Representation of Geometric
Constraints

For each RS
Compute orthonormal 3D basis associated with each
RS
Compute coordinates of all other points in
coordinate frame defined by 3D basis
For each point define address of hash table with
labels and measurements
Use each address to enter hash table with pair
(model, RS)

21
Determining Hash Table Entries with Model M1 and
Points 4 and 1 as Basis
22
Locations of Hash Table Entries for Model M1
after all bases, RS
23
Geometric Hashing Matching

Given observed object
1. Choose an RS and compute 3D basis associated
with RS
2. Compute the coordinates of the other observed
object points in 3D basis
3. For each point, enter hash table at address
defined by labels and measurements and label and
coordinate of new point

24
Geometric Hashing Matching (cont.)

For step 3 Tally a vote for model and RS for
each entry found at address can histogram all
hash table entries which received one or more
votes
4. If no pair scores high (determine by
threshold), then go to 1, and begin with
different RS of the observed object

25
Geometric Hashing Matching (cont.)

5. Consider all the models from step 4 and find
rigid motion that gives best least squares match
6. Transform the model point set according to the
transformation of step 5 and check consistency of
all biological information (I.e. match labeling)

26
Modifications to Algorithm

Could modify voting scheme, modify representation
of coordinate axes to 2D coordinate axes (reduces
worst case running time analysis), could apply
representation of atoms to alpha-carbons only (no
labeling allowed), could group atoms together
into a single unit and analyze structures using
these atom groups

27
Algorithm Performance

Experimented with bacterial proteins, bovine
pancreas protein, calcium binding protein, bovine
liver protein, and protein from hen egg
All experiments were favorable to excellent
results in terms of fit

28
Conclusion