Statistical Significance Assessment of Local Sequence Alignments - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Statistical Significance Assessment of Local Sequence Alignments

Description:

Hybrid PSI-BLAST. Well ... solve for the two Gumbel parameters for Hybrid Alignment ... Hybrid combines the features of probabilistic alignment with ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 27
Provided by: Nich179
Category:

less

Transcript and Presenter's Notes

Title: Statistical Significance Assessment of Local Sequence Alignments


1
Statistical Significance Assessment of Local
Sequence Alignments
Nicholas Chia Department of Physics Ohio State
University
2
  • Gapped Local Alignment Statistics
  • Prevalent form of alignment used
  • i.e., Smith-Waterman, BLAST
  • Hybrid PSI-BLAST
  • Well understood statistics
  • Position-specific gap costs improve homology
    detection

3
A Practical Approach to Significance Assessment
in Alignments with Gaps
  • Nicholas Chia and Ralf Bundschuh
  • Department of Physics
  • Ohio State University

4
Outline
  • Gapped Alignment
  • Statistical Significance Assessment
  • Markov Model Approach
  • Sequence Length Dependence
  • Numerical Results
  • Conclusion

5
Gapped Alignment
scoring matrix
number of gaps
alignment score
gap cost
G
A
T
C
G
G
T
A
C
-
Smith-Waterman Local Alignment for ? gt 1
6
Significance Assessment
Studying the distribution of alignment scores
among random sequences yields information about
the rarity, a.k.a., biological import, of a given
alignment score
Evidence suggests that gapped alignment scores
are distributed according to the Gumbel or
extreme value distribution
but statistically characterizing the slow
exponential tail of the Gumbel (?) requires a
large number of simulations.
score
Gumbel parameters
maximum alignment score
Can we find a better method to understand gapped
distributions?
alignment
Can we evaluate ? faster?
7
Small Change in Geometry
T
C
G
T
A
Needleman-Wunsch global alignment score
G
C
T
If we can solve for ?, we know ?!
G
C
T
C
G
but how do we solve for ??
A
i.e., how do we account for the length dependence
of the score?
Change geometry!
  • fixing the width also fixes the state space
    allowing us to model alignment as a Markov
    process
  • replaces a 2D length dependence with a 1D width
    dependence where we can use a Markov matrix to
    solve infinite t behavior

t1
8
The Markov Model
Using score differences between lattice sites as
our elements, we can write a Markov matrix
describing the transition from t to t1
In this way we can describe the fixed width
dynamics for infinite t
t
t1
largest eigenvalue of the modified Markov matrix
In order to obtain information about the relevant
quantity
we modify our Markov matrix to include the
necessary ?
all alignment parameters
dependence.
9
Technique for Calculating ?w(??)
The System Parameters - ?
  • match-mismatch scoring matrix
  • linear gap costs
  • technical condition to reduce state space
  • periodic boundary conditions
  • Bernoulli randomness (uncorrelated bonds)
    approximation

Calculating ?w(??)
  • construct the modified Markov matrices
    symbolically
  • eigenvalue simply too difficult to solve
    symbolically large matrices (105)
  • ARPACK solves for the largest eigenvalue ?w with
    precision and speed since the matrices are sparse
    (N log N) - Lehoucq et al., SIAM 1997

10
Understanding the W-dependence
Definition
... not quite that simple
  • cannot solve for extremely large W
  • non-trivial width dependence

Kardar-Parisi-Zhang Systems
So, what can we do?
Derrida Lebowitz, PRL 1998
From Derrida and Lebowitz comes the scaling
function G, which gives the form of the width
dependence as follows
ASEP
KPZ
Sequence Alignment
Gapped Alignment
KPZ systems all share properties on a course
grained level
By understanding the W-dependence, we can solve
for ? and ?!
11
Calculating ?
parameter dependent scaling factors
Equation for ?
Equation for ?W
W-dependence
Can solve for ? if we know the scaling factors a?
and b?!
Fit difference in order to obtain scaling factors
Solution for ?
12
Convergence of ?
0.65
?
0.6
0.55
0
2
4
6
8
10
12
W
13
Results
14
Conclusion
  • Succeeded in calculating ? with precision on
    fast timescales
  • Demonstrated a non-sampling method for
    calculating ?
  • Successful synthesis of computational biology,
    high performance numerics, and statistical
    physics
  • In principle, can be generalized to more complex
    scoring systems (affine gap costs, correlated
    bonds, PAM or BLOSUM matrices)

15
Detecting Protein Homologs using Hybrid
PSI-BLAST
Nicholas Chia, Yuheng Li?, Mario Lauria?, and
Ralf Bundschuh?
Department of Physics Department of Computer
Science and Engineering ?Biophysics Graduate
Program The Ohio State University
16
Outline
  • Protein Classification
  • Probabilistic Alignment
  • Hybrid Alignment
  • Statistical Significance Assessment
  • Model Building
  • PSI-BLAST vs. Hybrid PSI-BLAST
  • Conclusion

17
Protein Classification
individual protein families
all known proteins
query sequence
BLAST
PSI-BLAST (position specific scoring matrix)
Reliable protein classification requires the use
of structural information in order to detect
distant family members
How can we improve homology detection?
Number of sequences growing far faster than
number of structures
growth rate
Protein structures
sequences
18
Probabilistic Alignment
number of gaps
scoring matrix
BLOSUM62
alignment weight
gap weight
C
A
S
E
T
C
S
A
-
E
Typically, deterministic alignments measure
homology by scoring only the optimal alignment
path one path one score
However, probabilistic alignment measures
homology by summing the weights over all
alignment paths many paths one score
19
Hybrid Alignment
Deterministic Alignment
vs.
Global alignment algorithm searches for the
highest scoring path from upper-left to
lower-right and reports its score
Global alignment sums weights over all alignment
paths from upper-left to lower-right and reports
a final score
C
A
S
E
T
C
A
S
E
T
C
C
A
A
S
S
E
E
Local alignment searches all alignment segments
beginning or ending anywhere and returns highest
scoring segment
Local alignment sums over alignment segments
beginning everywhere and returns the highest
scoring endpoint
Local alignment scores for random sequences are
distributed according to the Gumbel distribution
Logarithm of local alignment scores for random
sequences are distributed according to the Gumbel
distribution
O(N2) computational complexity
Hybrid PSI-BLAST
NCBI PSI-BLAST
20
Significance Assessment
Studying the distribution of alignment scores
among random sequences yields information about
the rarity, a.k.a., biological import, of a given
alignment score
Need to solve for the two Gumbel parameters for
Hybrid Alignment
but while incorporating position-specific
scoring matrices and gap costs.
log of maximum alignment score
even easier ???
If the average total weight entering a point
equals 1, i.e.,
How can this property be used to benefit homology
detection?
then
21
Position-Specific Gap Costs
Since ?1 for any ?, we may freely choose
position dependent gap weights without undergoing
the usual difficulty of the significance
assessment.
Hybrid PSI-BLAST
PSI-BLAST
  • Choosing gap weights for a sequence model
  • Use existing alignments
  • Need to know gap probabilities in alignment

22
Measuring Probabilities
C
A
S
E
T
C
A
S
E
max score measures alignment likelihood
forward-backward algorithm
Probability of a gap transition
Probability of a match transition
Probability of any transition
?
?
?
(1-2?)
backward
forward
x
max score
23
Iterative Model Building
  • Score query against sequences in database. Select
    close homologs for inclusion in model building.
  • Extract transition and amino acid emission
    probabilities of homologs with the
    forward-backward algorithm.
  • Construct an alignment model from the average
    transition and amino acid emission probabilities.
  • Search sequence database again, this time scoring
    sequences in database against the newly built
    model.
  • Iterate steps 2-4 until convergence or maximum
    iterations are reached.

C
A
S
E
T
C
A
S
E
24
Technical Details
The System Parameters
match?match match?insertion match?deletion
insertion?match insertion?insertion
  • affine gap scheme

deletion?match deletion?deletion
insertion ? deletion
deletion ? insertion
  • finite size correction of the Gumbel parameters
    ? and ?
  • initial scoring scheme BLOSUM62 (11,1)
  • 5 rounds of iteration
  • pseudocount
  • weighting scheme
  • adaptive model inclusion threshold

The Database
  • Superfamily PDB90 v1.69
  • Fuzzy countingSUPERFAMILY counting rules

25
Results
26
Conclusions
  • Hybrid combines the features of probabilistic
    alignment with well understood significance
    assessments and can be used to improve sequence
    homology detection
  • Position specific gap weights improve homology
    detection

Future Work
  • Refine method for choosing gap weights
  • Using suboptimal alignment information for
    identifying "corrupt models"models built from
    nonmembers
Write a Comment
User Comments (0)
About PowerShow.com