Title: Statistical Significance Assesment of Local Sequence Alignment
1Statistical Significance Assesment of Local
Sequence Alignment
Nicholas Chia
2Outline
- Why should you care?
- How does it work?
- Statistics of gapped alignments the need for
speed - A Markov Model approach
- Where does the physics come in?
- And does it solve the problem?
- Conclusion
3What is Sequence Alignment?
- Comparing sequences of letters or data strings
- Essential for computational tasks from file and
web searches to image processing - One of the most widely used tool in molecular
biology
4The Biology of Sequences I
Fundamental biology is based on sequences of
nucleic and amino acids!
DNA
RNA
Protein
mRNA
tRNA
rRNA
siRNA
miRNA
smRNA
5The Biology of Sequence II
DNA
RNA
Proteins
mRNA
tRNA
rRNA
siRNA
miRNA
smRNA
DNA
Replication is not perfect. Three types of errors
occur.
Mutation
GCCTATACCGTA
GCCTATTCCGTA
Insertion
GCCTATACCGTA
GCCTATAGCCGTA
Deletion
GCCTATACCGTA
GCCTATCCGTA
Basis for evolution
6Role of Alignment in Biology
- Identifying function and structure
protein families
all known proteins
- Discerning evolutionary relationships
Tree of Life web project
7Gapped Alignment
scoring matrix
number of gaps
alignment score
gap cost
G
A
T
C
G
G
T
A
C
-
Smith-Waterman Local Alignment for ? gt 1
8Choosing a Scoring System
In theory, the optimal mismatch and gap costs are
related to the "true" number of mutations,
insertions, and deletions between a pair of
sequences.
Number of "true" Mutation
1 / Mismatch cost, ?
Number of "true" Gaps
1 / Gap cost, ?
In practice, one does not observe the
evolutionary path at each step and thus the
"true" path is lost.
Instead, one judges the optimality of a scoring
system by the number of true hits versus the
number of false hits and its ability to reliably
find future true hits. Ultimately, sequence
relationships are determined by experimental
assays probing for function and structure.
What distinguishes a true hit from a false hit?
9Significance Assessment
Studying the distribution of alignment scores
among random sequences yields information about
the rarity, a.k.a., biological import, of a given
alignment score
Evidence suggests gapped alignment scores are
distributed according to the extreme value
distribution
but statistically characterizing the slow
exponential tail of this distribution requires a
large number of simulations.
score
maximum alignment score
Can we find a better method to understand gapped
distributions?
alignment
Can we evaluate ? faster?
10Small Change in Geometry
T
C
G
T
A
Needleman-Wunsch global alignment score
G
C
T
If we can solve for ?, we know ?!
G
C
T
C
G
but how do we solve for ??
A
i.e., how do we account for the length dependence
of the score?
Change geometry!
- fixing the width also fixes the state space
allowing us to model alignment as a Markov
process
- replaces a 2D length dependence with a 1D width
dependence where we can use a Markov matrix to
solve infinite t behavior
t1
11The Markov Model
Using score differences between lattice sites as
our elements, we can write a Markov matrix
describing the transition from t to t1
In this way we can describe the fixed width
dynamics for infinite t
t
t1
largest eigenvalue of the modified Markov matrix
In order to obtain information about the relevant
quantity
we modify our Markov matrix to include the
necessary ?
all alignment parameters
dependence.
12Understanding the W-dependence
Definition
... not quite that simple
- cannot solve for extremely large W
- non-trivial width dependence
So, what can we do?
Kardar-Parisi-Zhang Systems
ASEP Asymmetric Exclusion Process
KPZ systems all share properties on a course
grained level
KPZ
Sequence Alignment
13KPZ Surface Growth
Describes, on a course grained level, tumor
growth, bacterial growth, flame fronts, forest
formations, asymmetric exclusion processes,
geological formations, avalanches, crystal
defects, directed polymers, sputter deposition,
electrochemical deposition, shoreline
growth/recession, sequence alignment, and vicious
imbibitions.
height profile
noise
surface tension
surface relaxation
lowest order non-linear growth term
The set of processes which are governed by the
KPZ equation are collectively known as the KPZ
universality class. Properties found for any
system apply to all systems in a given
universality class.
Thus, just by studying the dynamics of a single
member of the KPZ universality class, we can
uncover truths for the large number of dynamic
processes.
M. Kardar, G. Parisi, and Y. Zhang, Phys. Rev.
Lett. 56 889 (1986).
14Universal Properties
T. Haplin-Healy and Y. Zhang, Phys. Rep. 254 215
(1995).
15Totally ASymmetric Exclusion Process
- Single Particle TASEP
- Periodic boundary
- Particles move only to the right (1) if space
exists (2) with probability q - Sublattice-parallel update scheme - inherently
discrete time scheme
- Multiple Particle TASEP
- Only one particle per site considered for
hopping, up to n particles per site
16Background
Generating Function
Particle current or hopping per site
Summarizes system parameters (num. of part.,
hopping prob., part. per site)
Derrida-Lebowitz Scaling Function
Proposed universal scaling function, i.e.,
completely independent of system, solved
analytically
How long does this solution take to converge?
Applies to the continuous time ASEP
17Finite Size Effects
Once again, this solution converges very slowly.
Direct calculation of
is possible via matrix method. However, slow
convergence toward the limits N??, t?? make this
method difficult.
Extremely time consuming calculation
18Reformulated Solution
Define Infinite Limit
Can calculate directly via a matrix method for
small system sizes (N 22)
Analytical solution exists for single particle
case for all negative ?
Reformulation
Can measure G directly! Not using cumulant ratios.
Additional term arises from our newly defined
left hand side which must go to zero for fixed N
as ??-?
Does this agree with the observed behavior?
Scaling Function N ? ?
19Continuous Time ASEP
Since Derrida et al. have already solved
analytically for a continuous time ASEP for ? and
?N, this method can be tested directly using
their solutions.
When plotted in this manner, the rescaled small
system size solutions quickly converge to the
universal scaling function. This can be done much
faster and more easily than with previous methods.
20Understanding the W-dependence
Definition
... not quite that simple
- cannot solve for extremely large W
- non-trivial width dependence
Kardar-Parisi-Zhang Systems
So, what can we do?
Derrida Lebowitz, PRL 1998
From Derrida and Lebowitz comes the scaling
function G, which gives the form of the width
dependence as follows
ASEP
KPZ
Sequence Alignment
Gapped Alignment
KPZ systems all share properties on a course
grained level
By understanding the W-dependence, we can solve
for ? and ?!
21Single Particle TASEP
Solution for the Single Particle TASEP
Plot
Since there is a solution for l in the single
particle TASEP, one can plot the left hand side
versus the right hand side and rescale
appropriately in order to isolate the DL proposed
universal behavior.
22Multi-Particle TASEP
Plot
Once a and b are fitted for one N, these scaling
factors give good agreement for size systems,
i.e., the scaling factors a and b are independent
of system size!
23Calculating ?
parameter dependent scaling factors
Equation for ?
Equation for ?W
W-dependence
Can solve for ? if we know the scaling factors a?
and b?!
Fit difference in order to obtain scaling factors
Solution for ?
24Convergence of ?
0.65
?
0.6
0.55
0
2
4
6
8
10
12
W
25Results
26Conclusion
- Succeeded in calculating ? with precision on
fast timescales - Demonstrated a non-sampling method for
calculating ? - Successful synthesis of computational biology
and statistical physics - In principle, can be generalized to more complex
scoring systems - Uncovered a method for numerically verifying the
Derrida-Lebowitz Scaling Function and for using
it to overcome finite size effects
27Acknowledgments
Ralf Bundschuh
Nigel Goldenfeld Carl Woese
Eduardo Fradkin
28Technique for Calculating ?w(??)
The System Parameters - ?
- match-mismatch scoring matrix
- technical condition to reduce state space
- periodic boundary conditions
- Bernoulli randomness (uncorrelated bonds)
approximation
Calculating ?w(??)
- construct the modified Markov matrices
symbolically - eigenvalue simply too difficult to solve
symbolically large matrices (105) - ARPACK solves for the largest eigenvalue ?w with
precision and speed since the matrices are sparse
(N log N) - Lehoucq et al., SIAM 1997
29Mapping
30Detecting Protein Homologs using Hybrid
PSI-BLAST
Nicholas Chia, Yuheng Li?, Mario Lauria?, and
Ralf Bundschuh?
Department of Physics Department of Computer
Science and Engineering ?Biophysics Graduate
Program The Ohio State University
31Protein Classification
individual protein families
all known proteins
query sequence
BLAST
PSI-BLAST (position specific scoring matrix)
Reliable protein classification requires the use
of structural information in order to detect
distant family members
How can we improve homology detection?
Number of sequences growing far faster than
number of structures
growth rate
Protein structures
sequences
32Probabilistic Alignment
number of gaps
scoring matrix
BLOSUM62
alignment weight
gap weight
C
A
S
E
T
C
S
A
-
E
Typically, deterministic alignments measure
homology by scoring only the optimal alignment
path one path one score
However, probabilistic alignment measures
homology by summing the weights over all
alignment paths many paths one score
33Hybrid Alignment
Deterministic Alignment
vs.
Global alignment algorithm searches for the
highest scoring path from upper-left to
lower-right and reports its score
Global alignment sums weights over all alignment
paths from upper-left to lower-right and reports
a final score
C
A
S
E
T
C
A
S
E
T
C
C
A
A
S
S
E
E
Local alignment searches all alignment segments
beginning or ending anywhere and returns highest
scoring segment
Local alignment sums over alignment segments
beginning everywhere and returns the highest
scoring endpoint
Local alignment scores for random sequences are
distributed according to the Gumbel distribution
Logarithm of local alignment scores for random
sequences are distributed according to the Gumbel
distribution
O(N2) computational complexity
Hybrid PSI-BLAST
NCBI PSI-BLAST
34Significance Assessment
Studying the distribution of alignment scores
among random sequences yields information about
the rarity, a.k.a., biological import, of a given
alignment score
Need to solve for the two Gumbel parameters for
Hybrid Alignment
but while incorporating position-specific
scoring matrices and gap costs.
log of maximum alignment score
even easier ???
If the average total weight entering a point
equals 1, i.e.,
How can this property be used to benefit homology
detection?
then
35Position-Specific Gap Costs
Since ?1 for any ?, we may freely choose
position dependent gap weights without undergoing
the usual difficulty of the significance
assessment.
Hybrid PSI-BLAST
PSI-BLAST
- Choosing gap weights for a sequence model
- Use existing alignments
- Need to know gap probabilities in alignment
36Measuring Probabilities
C
A
S
E
T
C
A
S
E
max score measures alignment likelihood
forward-backward algorithm
Probability of a gap transition
Probability of a match transition
Probability of any transition
?
?
?
(1-2?)
backward
forward
x
max score
37Iterative Model Building
- Score query against sequences in database. Select
close homologs for inclusion in model building. - Extract transition and amino acid emission
probabilities of homologs with the
forward-backward algorithm. - Construct an alignment model from the average
transition and amino acid emission probabilities. - Search sequence database again, this time scoring
sequences in database against the newly built
model. - Iterate steps 2-4 until convergence or maximum
iterations are reached.
C
A
S
E
T
C
A
S
E
38Technical Details
The System Parameters
match?match match?insertion match?deletion
insertion?match insertion?insertion
deletion?match deletion?deletion
insertion ? deletion
deletion ? insertion
- finite size correction of the Gumbel parameters
? and ?
- initial scoring scheme BLOSUM62 (11,1)
- adaptive model inclusion threshold
The Database
- Fuzzy countingSUPERFAMILY counting rules
39Results
40Conclusions
- Hybrid combines the features of probabilistic
alignment with well understood significance
assessments and can be used to improve sequence
homology detection - Position specific gap weights improve homology
detection
Future Work
- Refine method for choosing gap weights
- Using suboptimal alignment information for
identifying "corrupt models"models built from
nonmembers