Title: Fasta, Blast, Probabilities
1Fasta, Blast, Probabilities
2Reminder
- Last classes we discussed dynamic programming
algorithms for - global alignment
- local alignment
- Multiple alignment
- All of these assumed a scoring rulethat
determines the quality of perfect matches,
substitutions, insertions, and deletions.
3Alignment in Real Life
- One of the major uses of alignments is to find
sequences in a database. - The current protein database contains about 100
millions (i.e.,108) residues ! So searching a
1000 long target sequence requires to evaluate
about 1011 matrix cells which will take about
three hours in the rate of 10 millions
evaluations per second. - Quite annoying when, say, one thousand target
sequences need to be searched because it will
take about four months to run.
4Heuristic Fast Search
- Instead, most searches rely on heuristic
procedures - These are not guaranteed to find the best match
- Sometimes, they will completely miss a
high-scoring match - We now describe the main ideas used by the
best known of these heuristic procedures.
5Basic Intuition
- Almost all heuristic search procedures are based
on the observation that real-life matches often
contain long strings with gap-less matches. - These heuristic try to find significant gap-less
matches and then extend them.
6Banded DP
- Suppose that we have two strings s1..n and
t1..m such that n?m - If the optimal alignment of s and t has few gaps,
then path of the alignment will be close to
diagonal
s
t
7Banded DP
- To find such a path, it suffices to search in a
diagonal region of the matrix. - If the diagonal band has width k, then the
dynamic programming step takes O(kn). - Much faster than O(n2) of standard DP.
8Banded DP for local alignment
- Problem Where is the banded diagonal ? It
need not be the main diagonal when looking for a
good local alignment.
How do we select which subsequences to align
using banded DP?
We heuristically find potential diagonals and
evaluate them using Banded DP. This is the main
idea of FASTA.
9Finding Potential Diagonals
- Suppose that we have a relatively long gap-less
match - AGCGCCATGGATTGAGCGA
- TGCGACATTGATCGACCTA
- Can we find clues that will let us find it
quickly? - Each such sequence defines a potential diagonal
(which is then evaluated using Banded DP.
10Signature of a Match
- Assumption good matches contain several
patches of perfect matches - AGCGCCATGGATTGAGCTA
- TGCGACATTGATCGACCTA
- Since this is a gap-less alignment, all perfect
match regionsshould be on one diagonal
11FASTA-finding ungapped matches
- Input strings s and t, and a parameter ktup
- Find all pairs (i,j) such that si..iktuptj..j
ktup - Locate sets of pairs that are on the same
diagonal - By sorting according to the difference i-j
- Compute the score for the diagonal that contains
all these pairs
12FASTA-finding ungapped matches
- Input strings s and t, and a parameter ktup
- Find all pairs (i,j) such that si..iktuptj..j
ktup - Step one prepare an index of the database and
the query sequence such that given a sequence of
length ktup, one gets the list of positions.
(Linear time). - Step two for each ktup from the query add 1 in
the diagonal (i-j) in which it appears. Then find
contiguous (possibly with mismatch) ktup in
diagonals.
13FASTA- using banded DP
- Step 3
- Select the ten high scoring contiguous segments
- Try and score all combinations of these ten
segments in order to constitute a pass into the
matrix - Step 4
- Run banded DP on the region containing the best
scoring pass (say with width 12). - Hence, the algorithm may combine some diagonals
into gapped matches (in the example below combine
diagonals 2 and 3).
14FASTA- practical choices
Most applications of FASTA use very small ktup
(1-2 for proteins, and 4-6 for DNA). Higher
values are faster, yielding less diagonal to
search around, but increase the chance to miss
the optimal local alignment.
Some implementation choices /tricks have not been
explicated herein.
15FASTA-summary
- Input strings s and t, and a parameter ktup
1,2,4,5, or 6 depending on the application. - Output A highly scored local alignment
- Find pairs of matching substrings
si..iktuptj..jktup - Extend to ungapped diagonals
- Extend to gapped matches using banded DP
16BLAST Overview(Basic Local Alignment Search
Tool)
- Input strings s and t, and a parameter T
threshold value - Output A highly scored local alignment
- Definition Two strings s and t of length k are
a high scoring pair (HSP) if d(s,t) gt T (usually
consider un-gapped alignments only). - Find high scoring pairs of substrings such that
d(s,t) gt T - These words serve as seeds for finding longer
matches - Extend to ungapped diagonals (as in FASTA)
- Extend to gapped matches
17BLAST Overview (cont.)
- Step 1 Find high scoring pairs of substrings
such that d(s,t) gt T (The seeds) - Find all strings of length k which score at least
T with substrings of s in a gapless alignment (k
4 for proteins, 11 for DNA) - (note possibly, not all k-words must be
tested, e.g. when such a word scores less than T
with itself). - Find in t all exact matches with each of the
above strings.
18Extending Potential Matches
Once a seed is found, BLAST attempts to find a
local alignment that extends the seed. Seeds on
the same diagonal are combined (as in FASTA),
then extended as far as possible in a greedy
manner without gap.
s
During the extension phase, the search stops when
the score passes below some lower bound computed
by BLAST (to save time). For the best ungap
alignment do a banded SW an assign a
probabilistic score.
t
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Why use probability to define and/or interpret a
scoring function ?
- Similarity is probabilistic in nature because
biological changes like mutation, recombination,
and selection, are not deterministic. - We could answer questions such as
- How probable two sequences are similar?
- Is the similarity found significant or random?
- How to change a similarity score when, say,
mutation rate of a specific area on the
chromosome becomes known ?
27A Probabilistic Model
- For now, we will focus on alignment without
indels. - For now, we assume each position (nucleotide
/amino-acid) is independent of other positions. - We consider two options
- M the sequences are Matched (related)
- R the sequences are Random (unrelated)
28Unrelated Sequences
- Our random model of unrelated sequences is simple
- Each position is sampled independently from a
distribution over the alphabet ? - We assume there is a distribution q(?) that
describes the probability of letters in such
positions. - Then
29Related Sequences
- We assume that each pair of aligned positions
(si,ti) evolved from a common ancestor - Let p(a,b) be a distribution over pairs of
letters. - p(a,b) is the probability that some ancestral
letter evolved into this particular pair of
letters.
30Odd-Ratio Test for Alignment
If Q gt 1, then the two strings s and t are more
likely to be related (M) than unrelated (R). If
Q lt 1, then the two strings s and t are more
likely to be unrelated (R) than related (M).
31Log Odd-Ratio Test for Alignment
- Taking logarithm of Q yields
Score(si,ti)
If log Q gt 0, then s and t are more likely to be
related. If log Q lt 0, then they are more likely
to be unrelated. How can we relate this
quantity to a score function ?
32Probabilistic Interpretation of Scores
- We define the scoring function via
- Then, the score of an alignment is the log-ratio
between the two models - Score gt 0 ? Model is more likely
- Score lt 0 ? Random is more likely
33Estimating Probabilities
- Suppose we are given a long string s1..n of
letters from ? - We want to estimate the distribution q() that
generated the sequence - How should we go about this?
- We build on the theory of parameter estimation in
statistics using either maximum likelihood
estimation or the Bayesian approach .
34Estimating q(?)
- Suppose we are given a long string s1..n of
letters from ? - s can be the concatenation of all sequences in
our database - We want to estimate the distribution q(?)
- That is, q is defined per letter
Likelihood function
35Estimating q(?) (cont.)
Likelihood function
ML parameters (Maximum Likelihood)
MAP parameters (Maximum A posteriori Probability)
36Estimating p(,)
- Intuition
- Find pair of aligned sequences s1..n, t1..n,
- Estimate probability of pairs
- Again, s and t can be the concatenation of many
aligned pairs from the database
Number of times a is aligned with b in (s,t)
37Problems in Estimating p(,)
- How do we find pairs of aligned sequences?
- How far is the ancestor ?
- earlier divergence ? low sequence similarity
- later divergence ? high sequence similarity
- Does one letter mutate to the other or are they
both mutations of a common ancestor having yet
another residue/nucleotide acid ?
38Estimating p(,) for proteins
39PAM-1 matrices
For PAM-1 it is assumed that 1 of all amino
acids are mutated.
Proportion of mutation that concern a times
number of mutation
40PAM-1 matrices
Define Mab to be the probability matrix for
switching from a to b via a mutation
41Properties of PAM-1 matrices
Note that
Namely, the probability of not changing and
changing sums to 1.
42Model of Evolution
- Again, we need to make some assumptions
- Each position changes independently of the rest
- The probability of mutations is the same in each
positions - Evolution does not remember
Time
t
t?
t2?
t3?
t4?
43Model of Evolution
- How do we model such a process?
- This process is called a Markov Chain
- A chain is defined by the transition probability
- P(Xt? bXta) - the probability that the next
state is b given that the current state is a - We often describe these probabilities by a
matrixM?ab P(Xt? bXta)
44Multi-Step Changes
- Based on Mab, we can compute the probabilities of
changes over two time periods
- Thus M2? M?M?
- By induction (HMW exercise) Mk? M? k
45Longer Term Changes
- Estimate M? (PAM-1 matrices)
- Use Mk? M? k (PAM-k matrices)
- Define
46Using PAM
- Historically researchers use PAM-250. (The only
one published in the original paper.) - Original PAM matrices were based on small number
of proteins (circa 1978). Later versions use many
more examples. - Used to be the most popular scoring rule, but
there are some problems with PAM matrices.
47(No Transcript)
48Problems with PAM
Normalization step is quite arbitrary. If, for
example, we define relative mutability using the
constant 50 rather than 100, we get
We will get
Now we have two different ways to estimate the
matrix M4? M4? M? 4 as we did before or
M4? M2? 2
M250? for example does not reflect well long
period changes.
49BLOSUM
- Idea use aligned ungapped regions of protein
families.These are assumed to have a common
ancestor. Similar ideas but better statistics
and modeling. - Procedure
- Cluster together sequences in a family whenever
more than L identical residues are shared. - Count number of substitutions across different
clusters in the same family. - Estimate frequencies as before.
- Practice Blosum50 and Blosum62 are wildly used.
(See page 43-44 in the text book).
Considered state of the art nowadays.
50BLOSUM 62