Inexact Matching - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Inexact Matching

Description:

Read pages 210-214 in textbook 'First Fact of Biological Sequence Analysis' ... Smith-Waterman often used to refer to both local alignments and their solution method ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 46
Provided by: erict9
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Inexact Matching


1
Inexact Matching
  • General Problem
  • Input
  • Strings S and T
  • Questions
  • How distant is S from T?
  • How similar is S to T?
  • Solution Technique
  • Dynamic programming with cost/similarity/scoring
    matrix

2
Biological Motivation
  • Read pages 210-214 in textbook
  • First Fact of Biological Sequence Analysis
  • In biomolecular sequences (DNA, RNA, amino acid
    sequences), high sequence similarity usually
    implies significant functional or structural
    similarity
  • sequence similarity implies functional/structural
    similarity
  • Converse is NOT true
  • Evolution reuses, builds upon, duplicates, and
    modifies successful structures

3
Measuring Distance of S and T
  • Consider S and T
  • We can transform S into T using the following
    four operations
  • insertion of a character into S
  • deletion of a character from S
  • substitution (replacement) of a character in S by
    another character (typically in T)
  • matching (no operation)

4
Example
  • S vintner
  • T writers
  • vintner
  • wintner (Replace v with w)
  • wrintner (Insert r)
  • writner (Delete first n)
  • writer (Delete second n)
  • writers (Insert S)

5
Example
  • Edit Transcript (or just transcript)
  • a string that describes the transformation of one
    string into the other
  • Example
  • RIMDMDMMI
  • v intner
  • wri t ers

6
Edit Distance
  • Edit distance of strings S and T
  • The minimum number of edit operations (insertion,
    deletion, replacement) needed to transform string
    S into string T
  • Levenshtein distance 299, Levenshtein appears
    to have been the first to define this concept
  • Optimal transcript
  • An edit transcript of S and T that has the
    minimum number of edit operations
  • cooptimal transcripts

7
Alignment
  • A global alignment of strings S and T is obtained
  • by inserting spaces (dashes) into S and T
  • they should have the same number of characters
    (including dashes) at the end
  • then placing two strings over each other matching
    one character (or dash) in S with a unique
    character (or dash) in T
  • Note ALL positions in both S and T are involved
  • Later, we will consider local alignments

8
Alignments and Edit transcripts
  • Example Alignment
  • v-intner-
  • wri-t-ers
  • Alignments and edit transcripts are interrelated
  • edit transcript emphasizes process
  • the specific mutational events
  • alignment emphasizes product
  • the relationship between the two strings
  • Alignments are often easier to work with and
    visualize
  • also generalize better to more than 2 strings

9
Edit Distance Problem
  • Input
  • 2 strings S and T
  • Task
  • Output edit distance of S and T
  • Output optimal edit transcript
  • Output optimal alignment
  • Solution method
  • Dynamic Programming

10
Definition of D(i,j)
  • Let D(i,j) be the edit distance of S1..i and
    T1..j
  • The edit distance of the first i characters of S
    with the first j characters of T
  • Let S n, T m
  • D(n,m) edit distance of S and T
  • We will compute D(i,j) for all i and j such that
    0 lt i lt n, 0 lt j lt m

11
Recurrence Relation
  • Base Case
  • For 0 lt i lt n, D(i,0) i
  • For 0 lt j lt m, D(0,j) j
  • Recursive Case
  • 0 lt i lt n, 0 lt j lt m
  • D(i,j) min
  • D(i-1,j) 1
  • D(i,j-1) 1
  • D(i-1,j-1) d(i,j)
  • d(i,j) 0 if S(i) T(j) and is 1 otherwise

12
What the various cases mean
  • D(i,j) min
  • D(i-1,j) 1
  • Align S1..i-1 with T1..j optimally
  • Match S(i) with a dash in T
  • D(i,j-1) 1
  • Align S1..i with T1..j-1 optimally
  • Match a dash in S with T(j)
  • D(i-1,j-1) d(i,j)
  • Align S1..i-1 with T1..j-1 optimally
  • Match S(i) with T(j)

13
Computing D(i,j) values
14
Initialization Base Case
15
Row i1
16
Entry i2, j3
17
Calculation methodologies
  • Location of edit distance
  • D(n,m)
  • Example was to calculate row by row
  • Can also calculate column by column
  • Can also use antidiagonals
  • Key is to build from upper left corner

18
Traceback
  • Using table to construct optimal transcript
  • Pointers in cell D(i,j)
  • Set a pointer from cell (i,j) to
  • cell (i, j-1) if D(i,j) D(i, j-1) 1
  • cell (i-1,j) if D(i,j) D(i-1,j) 1
  • cell (i-1,j-1) if D(i,j) D(i-1,j-1) d(i,j)
  • Follow path of pointers from (n,m) back to (0,0)
  • Example Figure 11.3 on page 222

19
What the pointers mean
  • horizontal pointer cell (i,j) to cell (i, j-1)
  • Align T(j) with a space in S
  • Insert T(j) into S
  • vertical pointer cell (i,j) to cell (i-1, j)
  • Align S(i) with a space in T
  • Delete S(i) from S
  • diagonal pointer cell (i,j) to cell (i-1, j-1)
  • Align S(i) with T(j)
  • Replace S(i) with T(j)

20
Table and transcripts
  • The pointers represent all optimal transcripts
  • Theorem
  • Any path from (n,m) to (0,0) following the
    pointers specifies an optimal transcript.
  • Conversely, any optimal transcript is specified
    by such a path.
  • The correspondence between paths and transcripts
    is one to one.

21
Running Time
  • Initialization of table
  • O(nm)
  • Calculating table and pointers
  • O(nm)
  • Traceback for one optimal transcript or optimal
    alignment
  • O(nm)

22
Operation-Weight Edit Distance
  • Consider S and T
  • We can assign weights to the various operations
  • insertion/deletion of a character cost d
  • substitution (replacement) of a character cost r
  • matching cost e
  • Previous case d r 1, e 0

23
Modified Recurrence Relation
  • Base Case
  • For 0 lt i lt n, D(i,0) i d
  • For 0 lt j lt m, D(0,j) j d
  • Recursive Case
  • 0 lt i lt n, 0 lt j lt m
  • D(i,j) min
  • D(i-1,j) d
  • D(i,j-1) d
  • D(i-1,j-1) d(i,j)
  • d(i,j) e if S(i) T(j) and is r otherwise

24
Alphabet-Weight Edit Distance
  • Define weight of each possible substitution
  • r(a,b) where a is being replaced by b for all a,b
    in the alphabet
  • For example, with DNA, maybe r(A,T) gt r(A,G)
  • Likewise, I(a) may vary by character
  • Operation-weight edit distance is a special case
    of this variation
  • Weighted edit distance refers to this
    alphabet-weight setting

25
Modified Recurrence Relation
  • Base Case
  • For 0 lt i lt n, D(i,0) S1 lt k lt i I(S(k))
  • For 0 lt j lt m, D(0,j) S1 lt k lt j I(T(k))
  • Recursive Case
  • 0 lt i lt n, 0 lt j lt m
  • D(i,j) min
  • D(i-1,j) I(S(i))
  • D(i,j-1) I(T(j))
  • D(i-1,j-1) d(i,j)
  • d(i,j) r(S(i), T(j))

26
Measuring Similarity of S and T
  • Definitions
  • Let S be the alphabet for strings S and T
  • Let S be the alphabet S with character - added
  • For any two characters x,y in S, s(x,y) denotes
    the value (or score) obtained by aligning x with
    y
  • For a given alignment A of S and T, let S and T
    denote the strings after the chosen insertion of
    spaces and l their new length
  • The value of alignment A is S1ltiltl s(S(i),T(i))

27
Example
  • a b a a - b a b
  • a a a a a b - b
  • 1-21102025

28
String Similarity Problem
  • Input
  • 2 strings S and T
  • Scoring matrix s for alphabet S
  • Task
  • Output optimal alignment value of S and T
  • The alignment of S and T with maximal, not
    minimal, value
  • Output this alignment

29
Modified Recurrence Relation
  • Base Case
  • For 0 lt i lt n, V(i,0) S1 lt k lt i s(S(k),-)
  • For 0 lt j lt m, V(0,j) S1 lt k lt j s(-,T(k))
  • Recursive Case
  • 0 lt i lt n, 0 lt j lt m
  • V(i,j) max
  • V(i-1,j) s(S(i),-)
  • V(i,j-1) s(-,T(j))
  • V(i-1,j-1) s(S(i), T(j))

30
Longest Common Subsequence Problem
  • Given 2 strings S and T, a common subsequence is
    a subsequence that appears in both S and T.
  • The longest common subsequence problem is to find
    a longest common subsequence (lcs) of S and T
  • subsequence characters need not be contiguous
  • different than substring
  • O(nm) solution
  • Make scoring matrix 1 for match, 0 for mismatch
  • The matched characters in an alignment of maximal
    value form a longest common subsequence

31
Similarity and Distance
  • If we are focused on aligning both entire
    strings, maximizing similarity is essentially
    identical to minimizing distance
  • Just need to modify scoring matrices
    appropriately
  • When we consider substrings of uncertain length,
    maximizing similarity often makes more sense than
    minimizing distance
  • Overlapping strings
  • Local alignment

32
Overlapping Strings
  • Find best alignment where the two strings overlap
    without penalizing for the unmatched ends
  • Application sequence assembly problem
  • strings are likely to overlap without being
    substrings of each other
  • Solution method
  • End-space free variant of dynamic programming
  • Change base conditions so that V(i,0) V(0,j)
    0
  • Need to search over row n and column n for
    optimal value
  • Optimal value may not be in entry (n,m)
  • Why is max similarity better than min distance?

33
Maximally Similar Substrings
  • Local alignment problem
  • Input
  • Two strings S and T
  • Task
  • Find substrings s and t of S and T that have the
    maximum possible alignment value as well as this
    value.
  • Let v denote this value.
  • Why is max similarity better than min distance?
  • Read pages 230-231 for motivation

34
Local suffix alignments
  • Define v(i,j) to be the value of the optimal
    alignment of any of the i1 suffixes of S1..i
    with any of the j1 suffixes of T1..j.
  • We bound v(i,j) to be at least 0 by scoring the
    alignment of two empty suffixes to be 0
  • Theorem
  • v (the value of the optimal local alignment)
    max v(i,j) 1 lt i lt n, 1 lt j lt m

35
Recurrences for local suffix alignments
  • Base Case
  • For 0 lt i lt n, v(i,0) 0
  • For 0 lt j lt m, v(0,j) 0
  • Recursive Case
  • 0 lt i lt n, 0 lt j lt m
  • v(i,j) max
  • 0
  • v(i-1,j) s(S(i),-)
  • v(i,j-1) s(-,T(j))
  • v(i-1,j-1) s(S(i), T(j))

36
Comments
  • Traceback
  • No longer start from cell (n,m)
  • Search whole table for max value and start from
    there
  • Still O(mn) running time
  • Terminology
  • In the literature, the distinction between
    problem statements from solution methods is not
    clear
  • Global alignment often referred to as
    Needleman-Wunsch alignment
  • There solution method was cubic in terms of m,n
  • Smith-Waterman often used to refer to both local
    alignments and their solution method

37
Comments continued
  • Scoring schemes
  • The utility of optimal local alignments is highly
    dependent on the scoring scheme
  • Examples
  • matches 1, mismatches spaces 0 leads to longest
    common subsequence
  • mismaches and spaces big negatives leads to
    longest common substring
  • Average score in matrix must be negative,
    otherise local alignments tend to be global
  • There is a theory developed about scoring schemes
    that we will cover later.

38
Aligning with Gaps
  • Gaps Any maximal run of spaces in a single
    string of a given alignment
  • Example
  • S aaabbbcccdddeeefff
  • T aaabbbdddeeefffggg
  • Alignment
  • aaabbbcccdddeeefff---
  • aaabbb---dddeeefffggg

39
Scoring with gaps
  • Example Scoring
  • aaabbbcccdddeeefff---
  • aaabbb---dddeeefffggg
  • 111111-1 111111111 -1 13
  • Why include gaps in scoring schemes?
  • Read 236-240
  • When an insertion/deletion event occurs, often
    more than a single character is inserted or
    deleted.
  • A single gap cost helps model the fact that a
    sequence of insertions/deletions is really one
    mutational event

40
Constant gap weight model
  • We present a series of possible gap weight
    models, each of which is a special case of the
    next one
  • Constant gap weight model
  • each individual space is free (Ws 0)
  • each gap has constant cost Wg
  • Alignment problem boils down to finding an
    alignment that maximizes
  • Match scores - mismatch scores - Wg( of gaps)
  • Dynamic programming can still solve in O(nm) time

41
Affine gap weight model
  • Gap opening versus gap extension penalties
  • each gap has constant cost Wg
  • each individual space has cost Ws lt Wg, typically
  • Alignment problem boils down to finding an
    alignment that maximizes
  • Match scores - mismatch scores - Wg( of gaps) -
    Ws( of spaces)
  • Dynamic programming can still solve in O(nm) time
  • Probably most commonly used model because of
    efficiency and generality of model

42
Convex gap weight model
  • Extension penalty should not be a constant but
    rather decrease as length of gap increases
  • One example
  • each gap has cost Wg log q where q is the
    length of the gap
  • Time now requires more than O(nm) time
  • In chapter 13 is an O(nmlog m) time solution
  • Further improvement is possible, but costly

43
Arbitrary gap weight model
  • Gap cost is an arbitrary function of gap length
  • each gap has cost w(q) where q is the length of
    the gap
  • no properties are assumed on w(q) such as its
    second derivative is negative
  • Solution time is now O(nm2 n2m)
  • cubic cost, similar to original Needleman-Wunsch
    solution

44
Recurrences for arbitrary gap weights
  • Base Case
  • For 0 lt i lt n, V(i,0) -w(i)
  • For 0 lt j lt m, V(0,j) -w(j)
  • Recursive Case
  • 0 lt i lt n, 0 lt j lt m
  • V(i,j) max
  • V(i-1,j-1) s(S(i),T(j))
  • max0ltkltj-1 V(i,k) - w(j-k)
  • Match S1..i with T1..k and gap of length j-k
    at end of T
  • max0ltklti-1 V(k,j) - w(i-k)
  • Match S1..k with T1..j and gap of length i-k
    at end of S

45
Recurrences for affine gap weights
  • Base Case
  • For 0 lt i lt n, V(i,0) E(i,0) - Wg - iWs
  • For 0 lt j lt m, V(0,j) F(0,j) -Wg - jWs
  • Recursive Case
  • 0 lt i lt n, 0 lt j lt m
  • V(i,j) max E(i,j), F(i,j), G(i,j)
  • G(i,j) V(i-1,j-1) s(S(i),T(j))
  • E(i,j) max E(i,j-1), V(i,j-1) - Wg - Ws
  • max checks if gap begins at S(i) or if it began
    earlier
  • F(i,j) max F(i-1,j), V(i-1,j) - Wg - Ws
  • max checks if gap begins at T(j) or if it began
    earlier
Write a Comment
User Comments (0)
About PowerShow.com