Title: Bioinformatics Algorithms and Data Structures
1Bioinformatics Algorithms and Data Structures
- Chapter 11 sections4-7
- Lecturer Dr. Rose
- Slides by Dr. Rose
- February 4 6, 2003
2Edit Graphs
- Key idea weighted edit graph
- Defn. Given strings S1 and S2 of lengths n and m,
respectively, a weighted edit graph has (n1) by
(m1) nodes, labelled (i,j) , 0 ? i ? n, 0 ? j ?
m. The edges edge weights are problem specific.
3Edit Graphs
- Example edit distance problem
- The weighted graph for the edit distance problem
has directed edges from node (i, j) to the nodes
(i 1, j) , (i, j 1) , and (i 1, j 1),
provided they exist. - The weight of the directed edges to nodes (i 1,
j) , (i, j 1) is 1. - The weight of the directed edge to (i 1, j
1) is t(i 1, j 1). - Figure 11.4 in the textbook shows an edit graph.
4Edit Graphs
- Thm. An edit transcript for strings S1 and S2 has
the minimum number of edit operations ? it
corresponds to a shortest path from 0,0 to n,m in
the edit graph. - Cor. The set of all shortest paths from 0,0 to
n,m in the edit graph specifies all optimal edit
transcript of S1 to S2.
5Weight Edit Distance
- There are two ways of assigning weight or costs
to calculate edit distance - By edit operation
- By alphabet, i.e., different costs for different
characters - Our initial approach was to assign weight by edit
operation, i.e., 1 for insert, delete, replace,
and 0 for match. - We can generalize our approach by assigning the
weight d for an insertion or deletion, r for a
replacement, and e for a match.
6Weight Edit Distance
- QWhat values for d, r, and e have we been
using? - A d 1 r 1, and e 0.
- Q What would happen if r gt 2d?
- A Replacements would never occur.
- Defn. The operation-weight distance problem
entails finding an edit transcript transforming
S1 to S2 with the minimum total operation weight.
7Weight Edit Distance
- Q What changes should we make to the definition
of edit distance, D(i,j), to reflect operation
weight? - We have to specify an operation-specific
definition. - The base conditions become
- D(i,0) i d. Why?
- D(0,j) j d. Why?
8Weight Edit Distance
- The general recurrence becomes
- D(i,j) minD(i,j-1) d, D(i-1,j) d,
D(i-1,j-1) t(i,j) - Where t(i,j) e if S1(i) S2 (j) o/w t(i,j)
r - Q Why?
- A the cost of
- Delete (from i-1,j) is d
- Insert (from i,j-1) is d
- Match (from i-1,j-1) is e
- Replace (from i-1,j-1) is r
9Weight Edit Distance
- The alternative to operation-weight edit distance
is alphabet-weight edit distance. - Idea different characters have different cost.
- Q How would we modify the edit distance
function, D(i,j), to support alphabet-weight edit
distance? - A Let weight(x) denote the weight associated
with character x for all x in the alphabet. - Then D(i,0) ?weight(S1(i))
- And D(0,j) ?weight(S2(j))
- Q what about the general recurrence D(i,j)?
10Weight Edit Distance
- A D(i,j) minD(i,j-1) weight(S2(j)),
D(i-1,j) weight(S1(i)), D(i-1,j-1) t(i,j) - Where t(i,j) weight(S2(j)), if S1(i) ? S2(j),
o/w 0. - Note for proteins, edit distance usually refers
to alphabet-weight edit distance. - As the text mentions the weights are usually
derived from the PAM matrices of Dayhoff or the
BLOSUM matrices of Henikoff. - Edit distance for DNA strings is usually either
unweighted or operation-weighted edit distance.
11String Similarity
- The relatedness of two strings can be expressed
in terms of similarity. - This similarity is usually expressed in terms of
alignment rather than in terms of edit distance. - Defn. Let S be the alphabet for strings S1 and
S2. Let S? be S with the additional character -
denoting space. Let s(x,y) denote the value
obtained by aligning character x with character y.
12String Similarity
- Defn. The value of alignment A is defined as
Where S1 and S2 denote strings after the
insertion of spaces and their length is denoted
by l.
If s(x,y) is greater than or equal to zero if x
y match and negative if they mismatch, then we
look for the alignment with the largest score
13String Similarity
- Example S a, g, c, t. Let s(x,y) be defined
by
Q What is the value of the following
alignment? a t a - a c t g t g t a g a c - g t
14String Similarity
- Defn. Given a scoring matrix over S?, define the
similarity of two strings S1 and S2 as the
value of the alignment A that maximizes the total
alignment value of S1 and S2 . - This also defines the optimal alignment value of
the strings S1 and S2.
15Computing Similarity
- Q How can we compute the optimal alignment value
of the strings S1 and S2? - A Use dynamic programming.
- Defn. Let V(i,j) denote the value of the optimal
alignment of prefixes S11..i and S21..j. - If strings S1 and S2 have lengths n and m,
respectively, then the value of the optimal
alignment of these strings is given by V(n,m). - Q What do you guess the time complexity will be?
- A O(n,m)
16Computing Similarity
The optimal alignment value relation is defined
similar to the edit distance relation. Base
Conditions
- Define the general recurrence relation as
- V(i,j) maxV(i - 1, j - 1) s(S1(i), S2(j)),
V(i - 1, j ) s(S1(i),_), V(i, j - 1) s(_,
S2(j))
17Computing Similarity
- V(i,j) maxV(i - 1, j - 1) s(S1(i), S2(j)),
V(i - 1, j ) s(S1(i),_), V(i, j - 1) s(_,
S2(j)) - Q What does this recurrence relation say?
- A The optimal alignment of the prefixes S11..i
and S21..j is the maximum of - The optimal alignment of S11..i-1 and
S21..j-1 extended by aligning S1(i) and S2(j). - The optimal alignment of S11..i-1 and S21..j
extended by aligning S1(i) with a space. - The optimal alignment of S11..i and S21..j-1
extended by aligning a space with S2(j).
18Longest Common Subsequence
- Defn. A subsequence of a string S, is a subset of
characters arranged in their original relative
order. - Example
- S interdepartmentaladministratorstaskforce
- subsequence gt idiots
- interdepartmentaladministratorstaskforce
- Obviously every substring of S is also a
subsequence of S. - Defn. a common subsequence of two strings is a
subsequence that appears in both strings.
19Longest Common Subsequence
- Defn. The longest subsequence problem entails
finding the longest common subsequence (lcs) of
two strings. - Thm. The optimal alignment of A forms a longest
common subsequence, if a scoring scheme is use in
which each matching pair of characters scores a 1
and a mismatch or space scores 0.
20Alignment Graphs
- Like distance, similarity can be viewed as a path
problem the graph that is analogous to the edit
graph (section 11.4) is called an alignment
graph. - Defn. An alignment graph is a DAG similar to an
edit graph in which the edge weights correspond
to costs for aligning specific character pairs. - The optimal alignment corresponds to the longest
path, in terms of sum of edge costs, from 0,0 to
n,m of the dynamic programming table. - The longest paths (optimal alignments) can be
found in O(nm).
21End-Space Free Alignment
- End-space free alignment an alignment variant
in which leading and trailing spaces contribute
zero weight. - Example
- e x a m p l e - h e c o u l d a - - - h a d a - -
b e e r - - - - - - - - - h e w o u l d n t a s h o t h i s
d e a r - The first eight spaces are free.
- This encourages (biases towards)
- Alignment of one string inside the other or
- Alignment of the prefix of one string with the
suffix of the other
22End-Space Free Alignment
- Q When should interior or prefix/suffix matching
be preferred? - A When it matches the nature of the problem
being modeled. - An example is shotgun sequence assembly Explain!
- Start with a large collection of partially
overlapping substrings that come from multiple
copies of one original, but unknown string. - Use comparisons of pairs of substrings to infer
the original string.
23End-Space Free Alignment
- Q Would you expect substrings that overlap in
the original string to show significant
alignment? - A Perhaps. In any case, with some slop for
sequencing errors, either - one string would align inside the other or
- the prefix of one string would align with the
suffix of the other - In contrast, a significant alignment of randomly
selected substrings from this collection is
unlikely. - An End-Space Free Alignment would detect this
difference and score overlapping substrings
higher.
24End-Space Free Alignment
- We can deduce candidate neighbor pairs by
- Computing End-Space Free Alignment for every pair
of substrings. - High scoring alignments are likely neighbors.
- To compute this
- Use a recurrence for global alignment where
spaces count. - Change the definition of V(i,0), V(0,j) to
address leading spaces V(i,0) V(0,j) 0 for
all i and j. - Compute the alignment graph in O(mn) How?
25End-Space Free Alignment
- Unlike global alignment the value of optimal
alignment is not necessarily in cell (n,m). - The optimal alignment will now be found in
- A cell in row n, if the last character of S1
contributes to the value of the alignment but the
last characters of S2 do not. - A cell in column m, if the last character of S2
contributes to the value of the alignment but the
last characters of S1 do not. - The optimal alignment will be the cell in row n
or column m that has the largest value.
26- And now for something completely different
- Approximate Matching
27Approximate Matching
- Basic idea Threshold-hold defined similarity
- Defn. A substring T of T is an approximate
occurrence of P ? the optimal alignment of P to
T has value at least ?, the threshold parameter. - Approach
- Use the standard recurrence for global alignment.
- Do not charge preceding spaces V(i,0) V(0,j)
0 for all i and j. - Leave backpointers while computing the table
28Approximate Matching
- Q How can we recognize an approximate occurrence
of P in T from the table computation? - A If the length of P is n, then for some j,
V(n,j) ? ? - More specifically
- Thm. The approximate occurrence of P in T ends at
position j of T ? V(n,j) ? ? - This tells us where in T the approximate
occurrence ends. Where in T does it start?
29Approximate Matching
- Thm.(version 0) The approximate occurrence of P
in T ends at position j of T ? V(n,j) ? ? - This tells us where in T the approximate
occurrence ends. Where in T does it start? - We can find the start by following the path from
cell (n,j) back to (0,k). k is the starting
position in T. - Thm.(version 1) Tk..j is an approximate
occurrence of P in T ? V(n,j) ? ? and there is a
path of backpointers from (n,j) to (0,k).
30Approximate Matching
- The table computation takes O(nm).
- Consider depending on the threshold d, T may
contain a great many approximate occurrences of
P. - Q Can all approximate occurrences be explicitly
output in O(nm)? - A Perhaps not.
- Textbook suggest locating all j s.t. V(n,j) ? ?
and explicitly outputting a shortest approximate
occurrence. - Traverse backpointers from (n,j) until reaching
(0,k) - Choose vertical pointers over diagonal pointers
- Choose diagonal pointers over horizontal pointers.
31Approximate Matching
- How does this particular preference produce a
shortest path? - Choose vertical pointers over diagonal pointers
- Choose diagonal pointers over horizontal
pointers. - Recall
- Horzontal edges correspond to inserting space in
P, this lengthens the path. Clearly this is to be
avoided. - Diagonal edges correspond to matches or
mismatches. - Vertical edges correspond to inserting space in T
. - There is no obvious reason for choosing diagonal
over vertical edges, however, some preference
must be made for tie-breaking. - Except choosing vertical results in match that is
shortest in T.
32- Global Alignment vs Local Alignment
33Local Alignment
- So far we have focused on global alignment. This
makes sense if - We expect one string to be contained in the other
or - We expect the strings to be close related.
- Example comparing amino acid sequences from the
same protein family.
34Local Alignment
- Local alignment exposes regions of high
similarity. - This may be interesting even if we expect the
strings to be globally dissimilar. - Can you think of examples?
- Comparing proteins from different protein
families - How about searching for lateral gene transfer
from prokaryotic genomes to eukaryotic genomes? - Huh????
35Local Alignment
- Local alignment problem. Find maximally similar
(optimal global alignment) substrings a and b of
S1 and S2, respectively. - Example from text S1 pqraxabcstvq, S2
xyaxbacsll - a a x a b - c s
- b a x - b a c s
- This global alignment is predicated on
- a score of 2 for a match
- a score of 2 for a mismatch
- a score of 1 for a space
- Resulting in a value of 8.
36Computing Local Alignment
- Q How can local alignment be computed?
- Q Can global alignment be used to find local
alignment? - A Not efficiently. Global alignment effectively
averages out local similarity. - Use explicit search for local similarity.
37Computing Local Alignment
- Q Assuming S1 and S2 have respective lengths n
and m, how many pairs of substrings are there? - A There are O(n2m2) pairs of substrings.
- Q If we wanted to, how could we show there are
this many substrings?
38Computing Local Alignment
- Observation Computing global alignment for each
of the O(n2m2) pairs of substrings gt O(nm). - Surprisingly, we can compute local alignment in
O(nm) even though there are O(n2m2) pairs of
substrings. - Assumption the global alignment of two empty
strings has value zero.
39Computing Local Alignment
- First consider a restricted version of local
alignment. - Defn. The local suffix alignment problem entails
finding a suffix a of S11..i and a suffix b of
S21..j s.t. V(a,b) is the maximum over all
pairs of suffixes of S11..i and S21..j. - Let v(i,j) denote the value of the optimal suffix
alignment for the index pair i,j.
40Computing Local Alignment
- Local suffix alignment example
- S1 abcxdex, S2 xxcxdeabc, Score 2 for matches
and 1 for mismatches or spaces - v(3,4) 1, how?
- The cs match but there is an additional -
aligned with x. - v(4,4) 4, how?
- The cs match and the final xs match
- v(5,4) 3, how?
- Same as v(4,4) but extended with d aligned with
-
41Computing Local Alignment
- Observation v(i,j) ? 0.
- Q Why is this true?
- A We can always choose a and/or b to be the
empty string. - Let v denote the value of optimal local
alignment for strings of length n and m. - Thm. v maxv(i,j) i ? n, j ? m
42Computing Local Alignment
- We need to understand why this theorem,
v maxv(i,j) i ? n, j ? m , is true. - Proof ?
- v ? maxv(i,j) i ? n, j ? m since any local
optimal suffix alignment is also a local
alignment.
43Computing Local Alignment
- ?
- WLOG assume v is derived from the optimal
solution involving substrings a and b with end
indices i and j, a and b define the local
suffix alignment for indices i and j, thus v ?
v(i,j) ? maxv(i,j) i ? n, j ? m - From this it is clear that a solution to the
local suffix alignment problem also solves the
local alignment problem.
44Computing Local Alignment
- Thm. v(i,j) max0, v(i 1, j - 1) s(S1(i),
S2 (j)), v(i 1, j) s(S1(i), _), v(i, j -
1) s(_, S2 (j)) - Where v(i, 0) 0 and v(0, j) 0 for all i,j
- Q What does this recurrence say?
- A The solution to the local alignment problem
v(i, j) is the larger of - 0, punt and choose a and b to be empty strings
- v(i 1, j - 1) extended by aligning S1(i) and
S2 (j) - v(i 1, j) extended by aligning S1(i) with _
- v(i, j - 1) extended by aligning _ with S2 (j)
45Computing Local Alignment
- Q What is the difference between the equations
for global alignment and local suffix alignment? - A There are two differences
- The inclusion of 0 in the local local suffix
alignment - The base conditions for local suffix alignment
v(i,0) 0 and v(0,j) 0 for all i,j.This is
similar for finding approximate occurrences but
not for general global alignment.
46Computing Local Alignment
- Approach to computing v
- Compute the table for v(i, j).
- Search the entire table for the largest value,
let (i, j) denote the cell containing the
largest value. - Follow backpointers from cell (i, j) to cell
(i, j) which has the value zero. This gives the
optimal local alignment. - The local optimal alignment substrings are then a
S1(i.. i and b S2(j.. j
47Computing Local Alignment
- Analysis of computing v
- We know that computing the table to solve v
takes time O(nm). - The table contains all optimal local alignments
for v(i, j). An alignment can be found by
locating a cell with v and tracing back from it.