Title: Inexact Matching
1Inexact Matching
- General Problem
- Input
- Strings S and T
- Questions
- How distant is S from T?
- How similar is S to T?
- Solution Technique
- Dynamic programming with cost/similarity/scoring
matrix
2Biological Motivation
- Read pages 210-214 in textbook
- First Fact of Biological Sequence Analysis
- In biomolecular sequences (DNA, RNA, amino acid
sequences), high sequence similarity usually
implies significant functional or structural
similarity - sequence similarity implies functional/structural
similarity - Converse is NOT true
- Evolution reuses, builds upon, duplicates, and
modifies successful structures
3Measuring Distance of S and T
- Consider S and T
- We can transform S into T using the following
four operations - insertion of a character into S
- deletion of a character from S
- substitution (replacement) of a character in S by
another character (typically in T) - matching (no operation)
4Example
- S vintner
- T writers
- vintner
- wintner (Replace v with w)
- wrintner (Insert r)
- writner (Delete first n)
- writer (Delete second n)
- writers (Insert S)
5Example
- Edit Transcript (or just transcript)
- a string that describes the transformation of one
string into the other - Example
- RIMDMDMMI
- v intner
- wri t ers
6Edit Distance
- Edit distance of strings S and T
- The minimum number of edit operations (insertion,
deletion, replacement) needed to transform string
S into string T - Levenshtein distance 299, Levenshtein appears
to have been the first to define this concept - Optimal transcript
- An edit transcript of S and T that has the
minimum number of edit operations - cooptimal transcripts
7Alignment
- A global alignment of strings S and T is obtained
- by inserting spaces (dashes) into S and T
- they should have the same number of characters
(including dashes) at the end - then placing two strings over each other matching
one character (or dash) in S with a unique
character (or dash) in T - Note ALL positions in both S and T are involved
- Later, we will consider local alignments
8Alignments and Edit transcripts
- Example Alignment
- v-intner-
- wri-t-ers
- Alignments and edit transcripts are interrelated
- edit transcript emphasizes process
- the specific mutational events
- alignment emphasizes product
- the relationship between the two strings
- Alignments are often easier to work with and
visualize - also generalize better to more than 2 strings
9Edit Distance Problem
- Input
- 2 strings S and T
- Task
- Output edit distance of S and T
- Output optimal edit transcript
- Output optimal alignment
- Solution method
- Dynamic Programming
10Definition of D(i,j)
- Let D(i,j) be the edit distance of S1..i and
T1..j - The edit distance of the first i characters of S
with the first j characters of T - Let S n, T m
- D(n,m) edit distance of S and T
- We will compute D(i,j) for all i and j such that
0 lt i lt n, 0 lt j lt m
11Recurrence Relation
- Base Case
- For 0 lt i lt n, D(i,0) i
- For 0 lt j lt m, D(0,j) j
- Recursive Case
- 0 lt i lt n, 0 lt j lt m
- D(i,j) min
- D(i-1,j) 1
- D(i,j-1) 1
- D(i-1,j-1) d(i,j)
- d(i,j) 0 if S(i) T(j) and is 1 otherwise
12What the various cases mean
- D(i,j) min
- D(i-1,j) 1
- Align S1..i-1 with T1..j optimally
- Match S(i) with a dash in T
- D(i,j-1) 1
- Align S1..i with T1..j-1 optimally
- Match a dash in S with T(j)
- D(i-1,j-1) d(i,j)
- Align S1..i-1 with T1..j-1 optimally
- Match S(i) with T(j)
13Computing D(i,j) values
14Initialization Base Case
15Row i1
16Entry i2, j3
17Calculation methodologies
- Location of edit distance
- D(n,m)
- Example was to calculate row by row
- Can also calculate column by column
- Can also use antidiagonals
- Key is to build from upper left corner
18Traceback
- Using table to construct optimal transcript
- Pointers in cell D(i,j)
- Set a pointer from cell (i,j) to
- cell (i, j-1) if D(i,j) D(i, j-1) 1
- cell (i-1,j) if D(i,j) D(i-1,j) 1
- cell (i-1,j-1) if D(i,j) D(i-1,j-1) d(i,j)
- Follow path of pointers from (n,m) back to (0,0)
- Example Figure 11.3 on page 222
19What the pointers mean
- horizontal pointer cell (i,j) to cell (i, j-1)
- Align T(j) with a space in S
- Insert T(j) into S
- vertical pointer cell (i,j) to cell (i-1, j)
- Align S(i) with a space in T
- Delete S(i) from S
- diagonal pointer cell (i,j) to cell (i-1, j-1)
- Align S(i) with T(j)
- Replace S(i) with T(j)
20Table and transcripts
- The pointers represent all optimal transcripts
- Theorem
- Any path from (n,m) to (0,0) following the
pointers specifies an optimal transcript. - Conversely, any optimal transcript is specified
by such a path. - The correspondence between paths and transcripts
is one to one.
21Running Time
- Initialization of table
- O(nm)
- Calculating table and pointers
- O(nm)
- Traceback for one optimal transcript or optimal
alignment - O(nm)
22Operation-Weight Edit Distance
- Consider S and T
- We can assign weights to the various operations
- insertion/deletion of a character cost d
- substitution (replacement) of a character cost r
- matching cost e
- Previous case d r 1, e 0
23Modified Recurrence Relation
- Base Case
- For 0 lt i lt n, D(i,0) i d
- For 0 lt j lt m, D(0,j) j d
- Recursive Case
- 0 lt i lt n, 0 lt j lt m
- D(i,j) min
- D(i-1,j) d
- D(i,j-1) d
- D(i-1,j-1) d(i,j)
- d(i,j) e if S(i) T(j) and is r otherwise
24Alphabet-Weight Edit Distance
- Define weight of each possible substitution
- r(a,b) where a is being replaced by b for all a,b
in the alphabet - For example, with DNA, maybe r(A,T) gt r(A,G)
- Likewise, I(a) may vary by character
- Operation-weight edit distance is a special case
of this variation - Weighted edit distance refers to this
alphabet-weight setting
25Modified Recurrence Relation
- Base Case
- For 0 lt i lt n, D(i,0) S1 lt k lt i I(S(k))
- For 0 lt j lt m, D(0,j) S1 lt k lt j I(T(k))
- Recursive Case
- 0 lt i lt n, 0 lt j lt m
- D(i,j) min
- D(i-1,j) I(S(i))
- D(i,j-1) I(T(j))
- D(i-1,j-1) d(i,j)
- d(i,j) r(S(i), T(j))
26Measuring Similarity of S and T
- Definitions
- Let S be the alphabet for strings S and T
- Let S be the alphabet S with character - added
- For any two characters x,y in S, s(x,y) denotes
the value (or score) obtained by aligning x with
y - For a given alignment A of S and T, let S and T
denote the strings after the chosen insertion of
spaces and l their new length - The value of alignment A is S1ltiltl s(S(i),T(i))
27Example
- a b a a - b a b
- a a a a a b - b
- 1-21102025
28String Similarity Problem
- Input
- 2 strings S and T
- Scoring matrix s for alphabet S
- Task
- Output optimal alignment value of S and T
- The alignment of S and T with maximal, not
minimal, value - Output this alignment
29Modified Recurrence Relation
- Base Case
- For 0 lt i lt n, V(i,0) S1 lt k lt i s(S(k),-)
- For 0 lt j lt m, V(0,j) S1 lt k lt j s(-,T(k))
- Recursive Case
- 0 lt i lt n, 0 lt j lt m
- V(i,j) max
- V(i-1,j) s(S(i),-)
- V(i,j-1) s(-,T(j))
- V(i-1,j-1) s(S(i), T(j))
30Longest Common Subsequence Problem
- Given 2 strings S and T, a common subsequence is
a subsequence that appears in both S and T. - The longest common subsequence problem is to find
a longest common subsequence (lcs) of S and T - subsequence characters need not be contiguous
- different than substring
- O(nm) solution
- Make scoring matrix 1 for match, 0 for mismatch
- The matched characters in an alignment of maximal
value form a longest common subsequence
31Similarity and Distance
- If we are focused on aligning both entire
strings, maximizing similarity is essentially
identical to minimizing distance - Just need to modify scoring matrices
appropriately - When we consider substrings of uncertain length,
maximizing similarity often makes more sense than
minimizing distance - Overlapping strings
- Local alignment
32Overlapping Strings
- Find best alignment where the two strings overlap
without penalizing for the unmatched ends - Application sequence assembly problem
- strings are likely to overlap without being
substrings of each other - Solution method
- End-space free variant of dynamic programming
- Change base conditions so that V(i,0) V(0,j)
0 - Need to search over row n and column n for
optimal value - Optimal value may not be in entry (n,m)
- Why is max similarity better than min distance?
33Maximally Similar Substrings
- Local alignment problem
- Input
- Two strings S and T
- Task
- Find substrings s and t of S and T that have the
maximum possible alignment value as well as this
value. - Let v denote this value.
- Why is max similarity better than min distance?
- Read pages 230-231 for motivation
34Local suffix alignments
- Define v(i,j) to be the value of the optimal
alignment of any of the i1 suffixes of S1..i
with any of the j1 suffixes of T1..j. - We bound v(i,j) to be at least 0 by scoring the
alignment of two empty suffixes to be 0 - Theorem
- v (the value of the optimal local alignment)
max v(i,j) 1 lt i lt n, 1 lt j lt m
35Recurrences for local suffix alignments
- Base Case
- For 0 lt i lt n, v(i,0) 0
- For 0 lt j lt m, v(0,j) 0
- Recursive Case
- 0 lt i lt n, 0 lt j lt m
- v(i,j) max
- 0
- v(i-1,j) s(S(i),-)
- v(i,j-1) s(-,T(j))
- v(i-1,j-1) s(S(i), T(j))
36Comments
- Traceback
- No longer start from cell (n,m)
- Search whole table for max value and start from
there - Still O(mn) running time
- Terminology
- In the literature, the distinction between
problem statements from solution methods is not
clear - Global alignment often referred to as
Needleman-Wunsch alignment - There solution method was cubic in terms of m,n
- Smith-Waterman often used to refer to both local
alignments and their solution method
37Comments continued
- Scoring schemes
- The utility of optimal local alignments is highly
dependent on the scoring scheme - Examples
- matches 1, mismatches spaces 0 leads to longest
common subsequence - mismaches and spaces big negatives leads to
longest common substring - Average score in matrix must be negative,
otherise local alignments tend to be global - There is a theory developed about scoring schemes
that we will cover later.
38Aligning with Gaps
- Gaps Any maximal run of spaces in a single
string of a given alignment - Example
- S aaabbbcccdddeeefff
- T aaabbbdddeeefffggg
- Alignment
- aaabbbcccdddeeefff---
- aaabbb---dddeeefffggg
39Scoring with gaps
- Example Scoring
- aaabbbcccdddeeefff---
- aaabbb---dddeeefffggg
- 111111-1 111111111 -1 13
- Why include gaps in scoring schemes?
- Read 236-240
- When an insertion/deletion event occurs, often
more than a single character is inserted or
deleted. - A single gap cost helps model the fact that a
sequence of insertions/deletions is really one
mutational event
40Constant gap weight model
- We present a series of possible gap weight
models, each of which is a special case of the
next one - Constant gap weight model
- each individual space is free (Ws 0)
- each gap has constant cost Wg
- Alignment problem boils down to finding an
alignment that maximizes - Match scores - mismatch scores - Wg( of gaps)
- Dynamic programming can still solve in O(nm) time
41Affine gap weight model
- Gap opening versus gap extension penalties
- each gap has constant cost Wg
- each individual space has cost Ws lt Wg, typically
- Alignment problem boils down to finding an
alignment that maximizes - Match scores - mismatch scores - Wg( of gaps) -
Ws( of spaces) - Dynamic programming can still solve in O(nm) time
- Probably most commonly used model because of
efficiency and generality of model
42Convex gap weight model
- Extension penalty should not be a constant but
rather decrease as length of gap increases - One example
- each gap has cost Wg log q where q is the
length of the gap - Time now requires more than O(nm) time
- In chapter 13 is an O(nmlog m) time solution
- Further improvement is possible, but costly
43Arbitrary gap weight model
- Gap cost is an arbitrary function of gap length
- each gap has cost w(q) where q is the length of
the gap - no properties are assumed on w(q) such as its
second derivative is negative - Solution time is now O(nm2 n2m)
- cubic cost, similar to original Needleman-Wunsch
solution
44Recurrences for arbitrary gap weights
- Base Case
- For 0 lt i lt n, V(i,0) -w(i)
- For 0 lt j lt m, V(0,j) -w(j)
- Recursive Case
- 0 lt i lt n, 0 lt j lt m
- V(i,j) max
- V(i-1,j-1) s(S(i),T(j))
- max0ltkltj-1 V(i,k) - w(j-k)
- Match S1..i with T1..k and gap of length j-k
at end of T - max0ltklti-1 V(k,j) - w(i-k)
- Match S1..k with T1..j and gap of length i-k
at end of S
45Recurrences for affine gap weights
- Base Case
- For 0 lt i lt n, V(i,0) E(i,0) - Wg - iWs
- For 0 lt j lt m, V(0,j) F(0,j) -Wg - jWs
- Recursive Case
- 0 lt i lt n, 0 lt j lt m
- V(i,j) max E(i,j), F(i,j), G(i,j)
- G(i,j) V(i-1,j-1) s(S(i),T(j))
- E(i,j) max E(i,j-1), V(i,j-1) - Wg - Ws
- max checks if gap begins at S(i) or if it began
earlier - F(i,j) max F(i-1,j), V(i-1,j) - Wg - Ws
- max checks if gap begins at T(j) or if it began
earlier