Chapter 3 String Matching - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Chapter 3 String Matching

Description:

Case1: The first mismatch occurs at P4 and T4. the window all the way to T4 and Match P1 with T4. The KMP Algorithm. Case2: ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 35

Provided by: dann60

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 3 String Matching

1
Chapter 3String Matching

3.1 Basic Terminologies of Strings
3.2 The KMP Algorithm
3.3 The Boyer-Moore Algorithm
3.4 Suffix Trees and Suffix Arrays
3.5 Approximate String Matching

2
3.1 Basic Terminologies of String

Sa alphabet set
Sithe ith character
Si,jSiSi1Sj
Sa substring of string
perfixSS1,s
suffixSSs-s1,s

3
It take O(nm) time
4
The KMP Algorithm

Case1
The first mismatch occurs at P4 and T4
slide the window all the way to T4 and Match P1
with T4

5
The KMP Algorithm

Case2
The first mismatch occurs at P7 and T7
We can not slide the window all the way to match
P1 with T7
slide the window to match P1 with T6.

6
The KMP Algorithm

Case3
The first mismatch occurs at P8 and T8
go back to P7 and find out that P6,7 equal to
the prefix P1,2
We can slide the window to align P1 with T6 and
P3 with T8.

7
The KMP Algorithm
8
The KMP Algorithm
9
The KMP Algorithm

The KMP algorithm consists of two phases.
First phase computes the prefix function for the
pattern P .
Second phase searches the pattern.

10
The KMP Algorithm
11
The KMP Algorithm
12
The Boyer-Moore Algorithm

This algorithm compares the pattern with the
substring within a sliding window in the
right-to-left order.
Assume that the first mismatch occurs when
comparing Tsj-1 whit Pj

13
The Boyer-Moore Algorithm

Bad character rule
Align Tsj-1 with Pj , whrer j is the rightmost
position of Tsj-1 in P.
Only one character is used.

14
The Boyer-Moore Algorithm

Good Suffix Rule 1
?Align Tsj-1 with Pj-mj
?j is the largest position such that Pj1,m is a
suffix of P1,j
? Pj-mjltgtPj

15
The Boyer-Moore Algorithm

Good Suffix Rule 2
?Align Tsm-j with P1
?j is the largest position such that P1,j is a
suffix of Pj1,m

16
The Boyer-Moore Algorithm
17
The Boyer-Moore Algorithm

Let us consider g1(7). Note that g1(7)9.This
mean that P8,12CATCA must be equal to P5,9 and
P7ltgtP4
Consider g2(4).Note that g2(4)4.This means that
P1,4 is a suffix of P5,12.That is,P1,4 must be
equal to P9,12.
G(j)m-maxg1(j),g2(j)

18
The Boyer-Moore Algorithm
19
The Boyer-Moore Algorithm
20
The Boyer-Moore Algorithm
21
The Boyer-Moore Algorithm
22
The Boyer-Moore Algorithm
23
The Boyer-Moore Algorithm

n24 and m12
The first mismatch occurs at PjP12
ssmaxG(j),m-B(Tsj-1)
1maxg(12),12-B(T12)
1max1,2
3
Then we shift the windows to position s3

24
Suffix Trees
If S is ATCACATCATCA, its 12 suffixes are listed
in Table 3.1.
25
Suffix Tree
Suffix_array
26
For Example
Consider the suffix tree for
SATCACATCATCA. Suppose PTCAT. Since P1 is
T, we follow the branch of TCA. Then we
match P1,3 with TCA. Since P4 is T, we
follow the branch of TCA. We now can report
that P is at position 7 in S because P4 matches
the first symbol of TCA and leaf 7 is reached
along the branch of TCA.
Suffix_tree
27
Suffix Array
For example, the non-decreasing lexical order
of suffices of SATCACATCATCA is S(12) , S(4)
, S(9) , S(1) , S(6) , S(11) , S(3) , S(8) ,
S(5) , S(10) , S(2) and S(7) .Table 3.2 shows the
suffix array A.
Suffix_tree
28
The longest common substring(1)
The longest common substring of strings X and
Y is a common substring of X and Y which has the
longest length. For example, PAT is the
longest common substring of XAPAT and
YPATT. We can create a suffix tree for X
and Y for finding the longest Common substring.
The suffices for X and Y are descrebed in Table
3.3. Figure 3.25 shows the suffix string for
X and Y.
29
The longest common substring(2)
30
Approximate String Matching
Given a text string T of length n, a pattern
string P of length m and a maximal number of
errors allowed k. For instance, if
Tpttapa, Ppatt and k2, the substrings
T1,2, T1,4,and T5,6 are all up to 2 errors with
P.
31
The suffix edit distance
This is called the suffix edit distance which
is the minimum number of substitutions,
insertions and deletions, which will transform
some suffix of S1 and S2. Consider S1p and
S2p. The suffix edit distance between S1 and
S2 is 0. Consider S1ptt and S2p. The
suffix edit distance between S1 and S2 is 1 as we
can replace the last character t by p.
32
Algorithm
33
Dynamic Programming
34
For Example
Consider E(4, 3). The arrows traced are E(4,
3) to E(3, 2) to E(2, 1) to E(1, 1). We ignore
E(2, 1) to E(1, 1). Thus we have obtained an
occurrence of approximate matching, T1,3ptt.
Algorithm

Write a Comment

User Comments (0)