Title: Chapter 3 String Matching
1Chapter 3String Matching
- 3.1 Basic Terminologies of Strings
- 3.2 The KMP Algorithm
- 3.3 The Boyer-Moore Algorithm
- 3.4 Suffix Trees and Suffix Arrays
- 3.5 Approximate String Matching
23.1 Basic Terminologies of String
- Sa alphabet set
- Sithe ith character
- Si,jSiSi1Sj
- Sa substring of string
- perfixSS1,s
- suffixSSs-s1,s
3It take O(nm) time
4The KMP Algorithm
- Case1
- The first mismatch occurs at P4 and T4
- slide the window all the way to T4 and Match P1
with T4
5The KMP Algorithm
- Case2
- The first mismatch occurs at P7 and T7
- We can not slide the window all the way to match
P1 with T7 - slide the window to match P1 with T6.
6The KMP Algorithm
- Case3
- The first mismatch occurs at P8 and T8
- go back to P7 and find out that P6,7 equal to
the prefix P1,2 - We can slide the window to align P1 with T6 and
P3 with T8.
7The KMP Algorithm
8The KMP Algorithm
9The KMP Algorithm
- The KMP algorithm consists of two phases.
- First phase computes the prefix function for the
pattern P . - Second phase searches the pattern.
10The KMP Algorithm
11The KMP Algorithm
12The Boyer-Moore Algorithm
- This algorithm compares the pattern with the
substring within a sliding window in the
right-to-left order. - Assume that the first mismatch occurs when
comparing Tsj-1 whit Pj
13The Boyer-Moore Algorithm
- Bad character rule
- Align Tsj-1 with Pj , whrer j is the rightmost
position of Tsj-1 in P. - Only one character is used.
14The Boyer-Moore Algorithm
- Good Suffix Rule 1
- ?Align Tsj-1 with Pj-mj
- ?j is the largest position such that Pj1,m is a
suffix of P1,j - ? Pj-mjltgtPj
15The Boyer-Moore Algorithm
- Good Suffix Rule 2
- ?Align Tsm-j with P1
- ?j is the largest position such that P1,j is a
suffix of Pj1,m
16The Boyer-Moore Algorithm
17The Boyer-Moore Algorithm
- Let us consider g1(7). Note that g1(7)9.This
mean that P8,12CATCA must be equal to P5,9 and
P7ltgtP4 - Consider g2(4).Note that g2(4)4.This means that
P1,4 is a suffix of P5,12.That is,P1,4 must be
equal to P9,12. - G(j)m-maxg1(j),g2(j)
18The Boyer-Moore Algorithm
19The Boyer-Moore Algorithm
20The Boyer-Moore Algorithm
21The Boyer-Moore Algorithm
22The Boyer-Moore Algorithm
23The Boyer-Moore Algorithm
- n24 and m12
- The first mismatch occurs at PjP12
- ssmaxG(j),m-B(Tsj-1)
- 1maxg(12),12-B(T12)
- 1max1,2
- 3
- Then we shift the windows to position s3
24Suffix Trees
If S is ATCACATCATCA, its 12 suffixes are listed
in Table 3.1.
25Suffix Tree
Suffix_array
26For Example
Consider the suffix tree for
SATCACATCATCA. Suppose PTCAT. Since P1 is
T, we follow the branch of TCA. Then we
match P1,3 with TCA. Since P4 is T, we
follow the branch of TCA. We now can report
that P is at position 7 in S because P4 matches
the first symbol of TCA and leaf 7 is reached
along the branch of TCA.
Suffix_tree
27Suffix Array
For example, the non-decreasing lexical order
of suffices of SATCACATCATCA is S(12) , S(4)
, S(9) , S(1) , S(6) , S(11) , S(3) , S(8) ,
S(5) , S(10) , S(2) and S(7) .Table 3.2 shows the
suffix array A.
Suffix_tree
28The longest common substring(1)
The longest common substring of strings X and
Y is a common substring of X and Y which has the
longest length. For example, PAT is the
longest common substring of XAPAT and
YPATT. We can create a suffix tree for X
and Y for finding the longest Common substring.
The suffices for X and Y are descrebed in Table
3.3. Figure 3.25 shows the suffix string for
X and Y.
29The longest common substring(2)
30Approximate String Matching
Given a text string T of length n, a pattern
string P of length m and a maximal number of
errors allowed k. For instance, if
Tpttapa, Ppatt and k2, the substrings
T1,2, T1,4,and T5,6 are all up to 2 errors with
P.
31The suffix edit distance
This is called the suffix edit distance which
is the minimum number of substitutions,
insertions and deletions, which will transform
some suffix of S1 and S2. Consider S1p and
S2p. The suffix edit distance between S1 and
S2 is 0. Consider S1ptt and S2p. The
suffix edit distance between S1 and S2 is 1 as we
can replace the last character t by p.
32Algorithm
33Dynamic Programming
34For Example
Consider E(4, 3). The arrows traced are E(4,
3) to E(3, 2) to E(2, 1) to E(1, 1). We ignore
E(2, 1) to E(1, 1). Thus we have obtained an
occurrence of approximate matching, T1,3ptt.
Algorithm