Inexact Matching - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Inexact Matching

Description:

Read pages 210-214 in textbook 'First Fact of Biological Sequence Analysis' ... Smith-Waterman often used to refer to both local alignments and their solution method ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 46

Provided by: erict9

Learn more at: http://www.cse.msu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Inexact Matching

1
Inexact Matching

General Problem
Input
Strings S and T
Questions
How distant is S from T?
How similar is S to T?
Solution Technique
Dynamic programming with cost/similarity/scoring
matrix

2
Biological Motivation

Read pages 210-214 in textbook
First Fact of Biological Sequence Analysis
In biomolecular sequences (DNA, RNA, amino acid
sequences), high sequence similarity usually
implies significant functional or structural
similarity
sequence similarity implies functional/structural
similarity
Converse is NOT true
Evolution reuses, builds upon, duplicates, and
modifies successful structures

3
Measuring Distance of S and T

Consider S and T
We can transform S into T using the following
four operations
insertion of a character into S
deletion of a character from S
substitution (replacement) of a character in S by
another character (typically in T)
matching (no operation)

4
Example

S vintner
T writers
vintner
wintner (Replace v with w)
wrintner (Insert r)
writner (Delete first n)
writer (Delete second n)
writers (Insert S)

5
Example

Edit Transcript (or just transcript)
a string that describes the transformation of one
string into the other
Example
RIMDMDMMI
v intner
wri t ers

6
Edit Distance

Edit distance of strings S and T
The minimum number of edit operations (insertion,
deletion, replacement) needed to transform string
S into string T
Levenshtein distance 299, Levenshtein appears
to have been the first to define this concept
Optimal transcript
An edit transcript of S and T that has the
minimum number of edit operations
cooptimal transcripts

7
Alignment

A global alignment of strings S and T is obtained
by inserting spaces (dashes) into S and T
they should have the same number of characters
(including dashes) at the end
then placing two strings over each other matching
one character (or dash) in S with a unique
character (or dash) in T
Note ALL positions in both S and T are involved
Later, we will consider local alignments

8
Alignments and Edit transcripts

Example Alignment
v-intner-
wri-t-ers
Alignments and edit transcripts are interrelated
edit transcript emphasizes process
the specific mutational events
alignment emphasizes product
the relationship between the two strings
Alignments are often easier to work with and
visualize
also generalize better to more than 2 strings

9
Edit Distance Problem

Input
2 strings S and T
Task
Output edit distance of S and T
Output optimal edit transcript
Output optimal alignment
Solution method
Dynamic Programming

10
Definition of D(i,j)

Let D(i,j) be the edit distance of S1..i and
T1..j
The edit distance of the first i characters of S
with the first j characters of T
Let S n, T m
D(n,m) edit distance of S and T
We will compute D(i,j) for all i and j such that
0 lt i lt n, 0 lt j lt m

11
Recurrence Relation

Base Case
For 0 lt i lt n, D(i,0) i
For 0 lt j lt m, D(0,j) j
Recursive Case
0 lt i lt n, 0 lt j lt m
D(i,j) min
D(i-1,j) 1
D(i,j-1) 1
D(i-1,j-1) d(i,j)
d(i,j) 0 if S(i) T(j) and is 1 otherwise

12
What the various cases mean

D(i,j) min
D(i-1,j) 1
Align S1..i-1 with T1..j optimally
Match S(i) with a dash in T
D(i,j-1) 1
Align S1..i with T1..j-1 optimally
Match a dash in S with T(j)
D(i-1,j-1) d(i,j)
Align S1..i-1 with T1..j-1 optimally
Match S(i) with T(j)

13
Computing D(i,j) values
14
Initialization Base Case
15
Row i1
16
Entry i2, j3
17
Calculation methodologies

Location of edit distance
D(n,m)
Example was to calculate row by row
Can also calculate column by column
Can also use antidiagonals
Key is to build from upper left corner

18
Traceback

Using table to construct optimal transcript
Pointers in cell D(i,j)
Set a pointer from cell (i,j) to
cell (i, j-1) if D(i,j) D(i, j-1) 1
cell (i-1,j) if D(i,j) D(i-1,j) 1
cell (i-1,j-1) if D(i,j) D(i-1,j-1) d(i,j)
Follow path of pointers from (n,m) back to (0,0)
Example Figure 11.3 on page 222

19
What the pointers mean

horizontal pointer cell (i,j) to cell (i, j-1)
Align T(j) with a space in S
Insert T(j) into S
vertical pointer cell (i,j) to cell (i-1, j)
Align S(i) with a space in T
Delete S(i) from S
diagonal pointer cell (i,j) to cell (i-1, j-1)
Align S(i) with T(j)
Replace S(i) with T(j)

20
Table and transcripts

The pointers represent all optimal transcripts
Theorem
Any path from (n,m) to (0,0) following the
pointers specifies an optimal transcript.
Conversely, any optimal transcript is specified
by such a path.
The correspondence between paths and transcripts
is one to one.

21
Running Time

Initialization of table
O(nm)
Calculating table and pointers
O(nm)
Traceback for one optimal transcript or optimal
alignment
O(nm)

22
Operation-Weight Edit Distance

Consider S and T
We can assign weights to the various operations
insertion/deletion of a character cost d
substitution (replacement) of a character cost r
matching cost e
Previous case d r 1, e 0

23
Modified Recurrence Relation

Base Case
For 0 lt i lt n, D(i,0) i d
For 0 lt j lt m, D(0,j) j d
Recursive Case
0 lt i lt n, 0 lt j lt m
D(i,j) min
D(i-1,j) d
D(i,j-1) d
D(i-1,j-1) d(i,j)
d(i,j) e if S(i) T(j) and is r otherwise

24
Alphabet-Weight Edit Distance

Define weight of each possible substitution
r(a,b) where a is being replaced by b for all a,b
in the alphabet
For example, with DNA, maybe r(A,T) gt r(A,G)
Likewise, I(a) may vary by character
Operation-weight edit distance is a special case
of this variation
Weighted edit distance refers to this
alphabet-weight setting

25
Modified Recurrence Relation

Base Case
For 0 lt i lt n, D(i,0) S1 lt k lt i I(S(k))
For 0 lt j lt m, D(0,j) S1 lt k lt j I(T(k))
Recursive Case
0 lt i lt n, 0 lt j lt m
D(i,j) min
D(i-1,j) I(S(i))
D(i,j-1) I(T(j))
D(i-1,j-1) d(i,j)
d(i,j) r(S(i), T(j))

26
Measuring Similarity of S and T

Definitions
Let S be the alphabet for strings S and T
Let S be the alphabet S with character - added
For any two characters x,y in S, s(x,y) denotes
the value (or score) obtained by aligning x with
y
For a given alignment A of S and T, let S and T
denote the strings after the chosen insertion of
spaces and l their new length
The value of alignment A is S1ltiltl s(S(i),T(i))

27
Example

a b a a - b a b
a a a a a b - b
1-21102025

28
String Similarity Problem

Input
2 strings S and T
Scoring matrix s for alphabet S
Task
Output optimal alignment value of S and T
The alignment of S and T with maximal, not
minimal, value
Output this alignment

29
Modified Recurrence Relation

Base Case
For 0 lt i lt n, V(i,0) S1 lt k lt i s(S(k),-)
For 0 lt j lt m, V(0,j) S1 lt k lt j s(-,T(k))
Recursive Case
0 lt i lt n, 0 lt j lt m
V(i,j) max
V(i-1,j) s(S(i),-)
V(i,j-1) s(-,T(j))
V(i-1,j-1) s(S(i), T(j))

30
Longest Common Subsequence Problem

Given 2 strings S and T, a common subsequence is
a subsequence that appears in both S and T.
The longest common subsequence problem is to find
a longest common subsequence (lcs) of S and T
subsequence characters need not be contiguous
different than substring
O(nm) solution
Make scoring matrix 1 for match, 0 for mismatch
The matched characters in an alignment of maximal
value form a longest common subsequence

31
Similarity and Distance

If we are focused on aligning both entire
strings, maximizing similarity is essentially
identical to minimizing distance
Just need to modify scoring matrices
appropriately
When we consider substrings of uncertain length,
maximizing similarity often makes more sense than
minimizing distance
Overlapping strings
Local alignment

32
Overlapping Strings

Find best alignment where the two strings overlap
without penalizing for the unmatched ends
Application sequence assembly problem
strings are likely to overlap without being
substrings of each other
Solution method
End-space free variant of dynamic programming
Change base conditions so that V(i,0) V(0,j)
0
Need to search over row n and column n for
optimal value
Optimal value may not be in entry (n,m)
Why is max similarity better than min distance?

33
Maximally Similar Substrings

Local alignment problem
Input
Two strings S and T
Task
Find substrings s and t of S and T that have the
maximum possible alignment value as well as this
value.
Let v denote this value.
Why is max similarity better than min distance?
Read pages 230-231 for motivation

34
Local suffix alignments

Define v(i,j) to be the value of the optimal
alignment of any of the i1 suffixes of S1..i
with any of the j1 suffixes of T1..j.
We bound v(i,j) to be at least 0 by scoring the
alignment of two empty suffixes to be 0
Theorem
v (the value of the optimal local alignment)
max v(i,j) 1 lt i lt n, 1 lt j lt m

35
Recurrences for local suffix alignments

Base Case
For 0 lt i lt n, v(i,0) 0
For 0 lt j lt m, v(0,j) 0
Recursive Case
0 lt i lt n, 0 lt j lt m
v(i,j) max
0
v(i-1,j) s(S(i),-)
v(i,j-1) s(-,T(j))
v(i-1,j-1) s(S(i), T(j))

36
Comments

Traceback
No longer start from cell (n,m)
Search whole table for max value and start from
there
Still O(mn) running time
Terminology
In the literature, the distinction between
problem statements from solution methods is not
clear
Global alignment often referred to as
Needleman-Wunsch alignment
There solution method was cubic in terms of m,n
Smith-Waterman often used to refer to both local
alignments and their solution method

37
Comments continued

Scoring schemes
The utility of optimal local alignments is highly
dependent on the scoring scheme
Examples
matches 1, mismatches spaces 0 leads to longest
common subsequence
mismaches and spaces big negatives leads to
longest common substring
Average score in matrix must be negative,
otherise local alignments tend to be global
There is a theory developed about scoring schemes
that we will cover later.

38
Aligning with Gaps

Gaps Any maximal run of spaces in a single
string of a given alignment
Example
S aaabbbcccdddeeefff
T aaabbbdddeeefffggg
Alignment
aaabbbcccdddeeefff---
aaabbb---dddeeefffggg

39
Scoring with gaps

Example Scoring
aaabbbcccdddeeefff---
aaabbb---dddeeefffggg
111111-1 111111111 -1 13
Why include gaps in scoring schemes?
Read 236-240
When an insertion/deletion event occurs, often
more than a single character is inserted or
deleted.
A single gap cost helps model the fact that a
sequence of insertions/deletions is really one
mutational event

40
Constant gap weight model

We present a series of possible gap weight
models, each of which is a special case of the
next one
Constant gap weight model
each individual space is free (Ws 0)
each gap has constant cost Wg
Alignment problem boils down to finding an
alignment that maximizes
Match scores - mismatch scores - Wg( of gaps)
Dynamic programming can still solve in O(nm) time

41
Affine gap weight model

Gap opening versus gap extension penalties
each gap has constant cost Wg
each individual space has cost Ws lt Wg, typically
Alignment problem boils down to finding an
alignment that maximizes
Match scores - mismatch scores - Wg( of gaps) -
Ws( of spaces)
Dynamic programming can still solve in O(nm) time
Probably most commonly used model because of
efficiency and generality of model

42
Convex gap weight model

Extension penalty should not be a constant but
rather decrease as length of gap increases
One example
each gap has cost Wg log q where q is the
length of the gap
Time now requires more than O(nm) time
In chapter 13 is an O(nmlog m) time solution
Further improvement is possible, but costly

43
Arbitrary gap weight model