Statistical Significance Assessment of Local Sequence Alignments - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Statistical Significance Assessment of Local Sequence Alignments

Description:

Hybrid PSI-BLAST. Well ... solve for the two Gumbel parameters for Hybrid Alignment ... Hybrid combines the features of probabilistic alignment with ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 27

Provided by: Nich179

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Significance Assessment of Local Sequence Alignments

1
Statistical Significance Assessment of Local
Sequence Alignments
Nicholas Chia Department of Physics Ohio State
University
2

Gapped Local Alignment Statistics
Prevalent form of alignment used
i.e., Smith-Waterman, BLAST
Hybrid PSI-BLAST
Well understood statistics
Position-specific gap costs improve homology
detection

3
A Practical Approach to Significance Assessment
in Alignments with Gaps

Nicholas Chia and Ralf Bundschuh
Department of Physics
Ohio State University

4
Outline

Gapped Alignment
Statistical Significance Assessment
Markov Model Approach
Sequence Length Dependence
Numerical Results
Conclusion

5
Gapped Alignment
scoring matrix
number of gaps
alignment score
gap cost
G
A
T
C
G
G
T
A
C
-
Smith-Waterman Local Alignment for ? gt 1
6
Significance Assessment
Studying the distribution of alignment scores
among random sequences yields information about
the rarity, a.k.a., biological import, of a given
alignment score
Evidence suggests that gapped alignment scores
are distributed according to the Gumbel or
extreme value distribution
but statistically characterizing the slow
exponential tail of the Gumbel (?) requires a
large number of simulations.
score
Gumbel parameters
maximum alignment score
Can we find a better method to understand gapped
distributions?
alignment
Can we evaluate ? faster?
7
Small Change in Geometry
T
C
G
T
A
Needleman-Wunsch global alignment score
G
C
T
If we can solve for ?, we know ?!
G
C
T
C
G
but how do we solve for ??
A
i.e., how do we account for the length dependence
of the score?
Change geometry!

fixing the width also fixes the state space
allowing us to model alignment as a Markov
process

replaces a 2D length dependence with a 1D width
dependence where we can use a Markov matrix to
solve infinite t behavior

t1
8
The Markov Model
Using score differences between lattice sites as
our elements, we can write a Markov matrix
describing the transition from t to t1
In this way we can describe the fixed width
dynamics for infinite t
t
t1
largest eigenvalue of the modified Markov matrix
In order to obtain information about the relevant
quantity
we modify our Markov matrix to include the
necessary ?
all alignment parameters
dependence.
9
Technique for Calculating ?w(??)
The System Parameters - ?

match-mismatch scoring matrix

linear gap costs

technical condition to reduce state space

periodic boundary conditions

Bernoulli randomness (uncorrelated bonds)
approximation

Calculating ?w(??)

construct the modified Markov matrices
symbolically
eigenvalue simply too difficult to solve
symbolically large matrices (105)
ARPACK solves for the largest eigenvalue ?w with
precision and speed since the matrices are sparse
(N log N) - Lehoucq et al., SIAM 1997

10
Understanding the W-dependence
Definition
... not quite that simple

cannot solve for extremely large W
non-trivial width dependence

Kardar-Parisi-Zhang Systems
So, what can we do?
Derrida Lebowitz, PRL 1998
From Derrida and Lebowitz comes the scaling
function G, which gives the form of the width
dependence as follows
ASEP
KPZ
Sequence Alignment
Gapped Alignment
KPZ systems all share properties on a course
grained level
By understanding the W-dependence, we can solve
for ? and ?!
11
Calculating ?
parameter dependent scaling factors
Equation for ?
Equation for ?W
W-dependence
Can solve for ? if we know the scaling factors a?
and b?!
Fit difference in order to obtain scaling factors
Solution for ?
12
Convergence of ?
0.65
?
0.6
0.55
0
2
4
6
8
10
12
W
13
Results
14
Conclusion

Succeeded in calculating ? with precision on
fast timescales
Demonstrated a non-sampling method for
calculating ?
Successful synthesis of computational biology,
high performance numerics, and statistical
physics
In principle, can be generalized to more complex
scoring systems (affine gap costs, correlated
bonds, PAM or BLOSUM matrices)

15
Detecting Protein Homologs using Hybrid
PSI-BLAST
Nicholas Chia, Yuheng Li?, Mario Lauria?, and
Ralf Bundschuh?
Department of Physics Department of Computer
Science and Engineering ?Biophysics Graduate
Program The Ohio State University
16
Outline

Protein Classification
Probabilistic Alignment
Hybrid Alignment
Statistical Significance Assessment
Model Building
PSI-BLAST vs. Hybrid PSI-BLAST
Conclusion

17
Protein Classification
individual protein families
all known proteins
query sequence
BLAST
PSI-BLAST (position specific scoring matrix)
Reliable protein classification requires the use
of structural information in order to detect
distant family members
How can we improve homology detection?
Number of sequences growing far faster than
number of structures
growth rate
Protein structures
sequences
18
Probabilistic Alignment
number of gaps
scoring matrix
BLOSUM62
alignment weight
gap weight
C
A
S
E
T
C
S
A
-
E
Typically, deterministic alignments measure
homology by scoring only the optimal alignment
path one path one score
However, probabilistic alignment measures
homology by summing the weights over all
alignment paths many paths one score
19
Hybrid Alignment
Deterministic Alignment
vs.
Global alignment algorithm searches for the
highest scoring path from upper-left to
lower-right and reports its score
Global alignment sums weights over all alignment
paths from upper-left to lower-right and reports
a final score
C
A
S
E
T
C
A
S
E
T
C
C
A
A
S
S
E
E
Local alignment searches all alignment segments
beginning or ending anywhere and returns highest
scoring segment
Local alignment sums over alignment segments
beginning everywhere and returns the highest
scoring endpoint
Local alignment scores for random sequences are
distributed according to the Gumbel distribution
Logarithm of local alignment scores for random
sequences are distributed according to the Gumbel
distribution
O(N2) computational complexity
Hybrid PSI-BLAST
NCBI PSI-BLAST
20
Significance Assessment
Studying the distribution of alignment scores
among random sequences yields information about
the rarity, a.k.a., biological import, of a given
alignment score
Need to solve for the two Gumbel parameters for
Hybrid Alignment
but while incorporating position-specific
scoring matrices and gap costs.
log of maximum alignment score
even easier ???
If the average total weight entering a point
equals 1, i.e.,
How can this property be used to benefit homology
detection?
then
21
Position-Specific Gap Costs
Since ?1 for any ?, we may freely choose
position dependent gap weights without undergoing
the usual difficulty of the significance
assessment.
Hybrid PSI-BLAST
PSI-BLAST

Choosing gap weights for a sequence model
Use existing alignments
Need to know gap probabilities in alignment

22
Measuring Probabilities
C
A
S
E
T
C
A
S
E
max score measures alignment likelihood
forward-backward algorithm
Probability of a gap transition
Probability of a match transition
Probability of any transition
?
?
?
(1-2?)
backward
forward
x
max score
23
Iterative Model Building

Score query against sequences in database. Select
close homologs for inclusion in model building.
Extract transition and amino acid emission
probabilities of homologs with the
forward-backward algorithm.
Construct an alignment model from the average
transition and amino acid emission probabilities.
Search sequence database again, this time scoring
sequences in database against the newly built
model.
Iterate steps 2-4 until convergence or maximum
iterations are reached.

C
A
S
E
T
C
A
S
E
24
Technical Details
The System Parameters
match?match match?insertion match?deletion
insertion?match insertion?insertion

affine gap scheme

deletion?match deletion?deletion
insertion ? deletion
deletion ? insertion

finite size correction of the Gumbel parameters
? and ?

initial scoring scheme BLOSUM62 (11,1)

5 rounds of iteration

pseudocount

weighting scheme

adaptive model inclusion threshold

The Database

Superfamily PDB90 v1.69

Fuzzy countingSUPERFAMILY counting rules

25
Results
26
Conclusions

Hybrid combines the features of probabilistic
alignment with well understood significance
assessments and can be used to improve sequence
homology detection
Position specific gap weights improve homology
detection

Future Work