Title: Statistical Significance Assessment of Local Sequence Alignments
1Statistical Significance Assessment of Local
Sequence Alignments
Nicholas Chia Department of Physics Ohio State
University
2- Gapped Local Alignment Statistics
- Prevalent form of alignment used
- i.e., Smith-Waterman, BLAST
- Hybrid PSI-BLAST
- Well understood statistics
- Position-specific gap costs improve homology
detection
3A Practical Approach to Significance Assessment
in Alignments with Gaps
- Nicholas Chia and Ralf Bundschuh
- Department of Physics
- Ohio State University
4Outline
- Gapped Alignment
- Statistical Significance Assessment
- Markov Model Approach
- Sequence Length Dependence
- Numerical Results
- Conclusion
5Gapped Alignment
scoring matrix
number of gaps
alignment score
gap cost
G
A
T
C
G
G
T
A
C
-
Smith-Waterman Local Alignment for ? gt 1
6Significance Assessment
Studying the distribution of alignment scores
among random sequences yields information about
the rarity, a.k.a., biological import, of a given
alignment score
Evidence suggests that gapped alignment scores
are distributed according to the Gumbel or
extreme value distribution
but statistically characterizing the slow
exponential tail of the Gumbel (?) requires a
large number of simulations.
score
Gumbel parameters
maximum alignment score
Can we find a better method to understand gapped
distributions?
alignment
Can we evaluate ? faster?
7Small Change in Geometry
T
C
G
T
A
Needleman-Wunsch global alignment score
G
C
T
If we can solve for ?, we know ?!
G
C
T
C
G
but how do we solve for ??
A
i.e., how do we account for the length dependence
of the score?
Change geometry!
- fixing the width also fixes the state space
allowing us to model alignment as a Markov
process
- replaces a 2D length dependence with a 1D width
dependence where we can use a Markov matrix to
solve infinite t behavior
t1
8The Markov Model
Using score differences between lattice sites as
our elements, we can write a Markov matrix
describing the transition from t to t1
In this way we can describe the fixed width
dynamics for infinite t
t
t1
largest eigenvalue of the modified Markov matrix
In order to obtain information about the relevant
quantity
we modify our Markov matrix to include the
necessary ?
all alignment parameters
dependence.
9Technique for Calculating ?w(??)
The System Parameters - ?
- match-mismatch scoring matrix
- technical condition to reduce state space
- periodic boundary conditions
- Bernoulli randomness (uncorrelated bonds)
approximation
Calculating ?w(??)
- construct the modified Markov matrices
symbolically - eigenvalue simply too difficult to solve
symbolically large matrices (105) - ARPACK solves for the largest eigenvalue ?w with
precision and speed since the matrices are sparse
(N log N) - Lehoucq et al., SIAM 1997
10Understanding the W-dependence
Definition
... not quite that simple
- cannot solve for extremely large W
- non-trivial width dependence
Kardar-Parisi-Zhang Systems
So, what can we do?
Derrida Lebowitz, PRL 1998
From Derrida and Lebowitz comes the scaling
function G, which gives the form of the width
dependence as follows
ASEP
KPZ
Sequence Alignment
Gapped Alignment
KPZ systems all share properties on a course
grained level
By understanding the W-dependence, we can solve
for ? and ?!
11Calculating ?
parameter dependent scaling factors
Equation for ?
Equation for ?W
W-dependence
Can solve for ? if we know the scaling factors a?
and b?!
Fit difference in order to obtain scaling factors
Solution for ?
12Convergence of ?
0.65
?
0.6
0.55
0
2
4
6
8
10
12
W
13Results
14Conclusion
- Succeeded in calculating ? with precision on
fast timescales - Demonstrated a non-sampling method for
calculating ? - Successful synthesis of computational biology,
high performance numerics, and statistical
physics - In principle, can be generalized to more complex
scoring systems (affine gap costs, correlated
bonds, PAM or BLOSUM matrices)
15Detecting Protein Homologs using Hybrid
PSI-BLAST
Nicholas Chia, Yuheng Li?, Mario Lauria?, and
Ralf Bundschuh?
Department of Physics Department of Computer
Science and Engineering ?Biophysics Graduate
Program The Ohio State University
16Outline
- Protein Classification
- Probabilistic Alignment
- Hybrid Alignment
- Statistical Significance Assessment
- Model Building
- PSI-BLAST vs. Hybrid PSI-BLAST
- Conclusion
17Protein Classification
individual protein families
all known proteins
query sequence
BLAST
PSI-BLAST (position specific scoring matrix)
Reliable protein classification requires the use
of structural information in order to detect
distant family members
How can we improve homology detection?
Number of sequences growing far faster than
number of structures
growth rate
Protein structures
sequences
18Probabilistic Alignment
number of gaps
scoring matrix
BLOSUM62
alignment weight
gap weight
C
A
S
E
T
C
S
A
-
E
Typically, deterministic alignments measure
homology by scoring only the optimal alignment
path one path one score
However, probabilistic alignment measures
homology by summing the weights over all
alignment paths many paths one score
19Hybrid Alignment
Deterministic Alignment
vs.
Global alignment algorithm searches for the
highest scoring path from upper-left to
lower-right and reports its score
Global alignment sums weights over all alignment
paths from upper-left to lower-right and reports
a final score
C
A
S
E
T
C
A
S
E
T
C
C
A
A
S
S
E
E
Local alignment searches all alignment segments
beginning or ending anywhere and returns highest
scoring segment
Local alignment sums over alignment segments
beginning everywhere and returns the highest
scoring endpoint
Local alignment scores for random sequences are
distributed according to the Gumbel distribution
Logarithm of local alignment scores for random
sequences are distributed according to the Gumbel
distribution
O(N2) computational complexity
Hybrid PSI-BLAST
NCBI PSI-BLAST
20Significance Assessment
Studying the distribution of alignment scores
among random sequences yields information about
the rarity, a.k.a., biological import, of a given
alignment score
Need to solve for the two Gumbel parameters for
Hybrid Alignment
but while incorporating position-specific
scoring matrices and gap costs.
log of maximum alignment score
even easier ???
If the average total weight entering a point
equals 1, i.e.,
How can this property be used to benefit homology
detection?
then
21Position-Specific Gap Costs
Since ?1 for any ?, we may freely choose
position dependent gap weights without undergoing
the usual difficulty of the significance
assessment.
Hybrid PSI-BLAST
PSI-BLAST
- Choosing gap weights for a sequence model
- Use existing alignments
- Need to know gap probabilities in alignment
22Measuring Probabilities
C
A
S
E
T
C
A
S
E
max score measures alignment likelihood
forward-backward algorithm
Probability of a gap transition
Probability of a match transition
Probability of any transition
?
?
?
(1-2?)
backward
forward
x
max score
23Iterative Model Building
- Score query against sequences in database. Select
close homologs for inclusion in model building. - Extract transition and amino acid emission
probabilities of homologs with the
forward-backward algorithm. - Construct an alignment model from the average
transition and amino acid emission probabilities. - Search sequence database again, this time scoring
sequences in database against the newly built
model. - Iterate steps 2-4 until convergence or maximum
iterations are reached.
C
A
S
E
T
C
A
S
E
24Technical Details
The System Parameters
match?match match?insertion match?deletion
insertion?match insertion?insertion
deletion?match deletion?deletion
insertion ? deletion
deletion ? insertion
- finite size correction of the Gumbel parameters
? and ?
- initial scoring scheme BLOSUM62 (11,1)
- adaptive model inclusion threshold
The Database
- Fuzzy countingSUPERFAMILY counting rules
25Results
26Conclusions
- Hybrid combines the features of probabilistic
alignment with well understood significance
assessments and can be used to improve sequence
homology detection - Position specific gap weights improve homology
detection
Future Work
- Refine method for choosing gap weights
- Using suboptimal alignment information for
identifying "corrupt models"models built from
nonmembers