Title: Advisor: Prof. R. C. T. Lee
1 Two exact string matching algorithms using
suffix to prefix rule
Advisor Prof. R. C. T. Lee Speaker G. W. Cheng
2Speeding up on two string matching algorithms
Algorithmica, Vol.12, 1994, pp.247-267 CROCHEMORE,
M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S.,
LECROQ, T., PLANDOWSKI, W. and RYTTER, W.
3Problem Definition We are given a text string
and a pattern string
and we want to find all
occurrences of P in T.
4Consider the following example
There are two occurrences of P in T as shown
below
5Rule 1 The Suffix to Prefix Rule
- For a window to have any chance to match a
pattern, in some way, there must be a suffix of
the window which is equal to a prefix of the
pattern.
T
P
6Basic Ideas
- Open a window W with size P in the text.
-
W
T
P
p
- Find the longest suffix of W is also the prefix
of pattern.
Case 1
W
T
P
p
Match!
7Case 2
W
T
P
p
W
T
P
p
Case 3
If there is no such suffix, we move W with length
P.
W
T
P
P
p
8Preprocessing phase
- TGCATCGGCGAGAGTATACAGTACGÂ
- PGCAGAGAG
We construct the suffix automaton of P.
C
Suffix Automaton
A
G
G
A
C
G
G
A
0
8
7
6
5
4
3
2
1
C
A
C
9Preprocessing Construct a Suffix Tree of the
reverse of Pattern
PR the reversal string of P.
1
2
4
7
3
8
6
5
10When there is a match, how do we move the window?
G C A T C G C A G A G A G T A T A C A G T A C G
T
P
G C A G A G A G
11G C A T C G C A G A G A G T A T A C A G T A C G
T
P
G C A G A G A G
12 Find the longest suffix of W is also the prefix
of pattern.
G C A T C G C A G G C A G T A T A C A G T A C G
T
G C A G A G A G
P
13G C A T C G C A G G C A G T A T A C A G T A C G
T
G C A G A G A G
P
14A Whole Example
- TGCATCGCAGAGAGTATACAGTACGÂ
- PGCAGAGAG
- First attempt Â
T
G C A T C G C A G A G A G T A T A C A G T A C G
P
G C A G A G A G
Shift by 5 (8 - 3)
15Second attempt
G C A T C G C A G A G A G T A T A C A G T A C G
T
P
G C A G A G A G
Shift by 7 (8 - 1)
16Third attempt
G C A T C G C A G A G A G T A T A C A G T A C G
T
P
G C A G A G A G
Shift by 7 (8 - 1)
17Third attempt
T
G C A T C G C A G A G A G T A T A C A G T A C G
P
G C A G A G A G
18Conclusion
- Preprocessing phase is O(m).
- Searching phase is O(mn).
19Reference
- A90Algorithms for finding patterns in strings,
A. V. Aho, Handbook of Theoretical Computer
Science, Vol. A, Elsevier, Amsterdam, 1990,
pp.255-300. - A85The myriad virtues of suffix trees,
Apostolico, A., Combinatorial Algorithms on
words, NATO Advanced Science Institutes, Series
F, Vol. 12, 1985, pp.85-96 - AG86The Boyer-Moore-Galil string searching
strategies revisited, Apostolico, A. and
Giancarlo, R., SIAM, Comput. 15, 1986, pp98-105. - BR92Average running time of the
Boyer-Moore-Horspool algorithm, Baeza-Yates, R.
A. and Regnier, M. Theoret. Comput. Sci., 1992,
pp.19-31. - BKR91Analysis of algorithms and Data
Structures, Banachowski, L., Kreczmar, A. and
Rytter, W., Addison-Wesley. Reading, MA,1991.
20Speeding up on two string matching algorithms
Algorithmica, Vol.12, 1994, pp.247-267 CROCHEMORE,
M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S.,
LECROQ, T., PLANDOWSKI, W. and RYTTER, W.
21A Bit-Parallel Approach to Suffix Automata Fast
Extended String Matching
In Proceedings of the 9th Annual Symposium on
Combinatorial Pattern Matching, Lecture Notes in
Computer Science 1448, Springer-Verlag, Berlin,
14-31, 1998. NAVARRO G., RAFFINOT M.,
22Problem Definition We are given a text string
and a pattern string
and we want to find all
occurrences of P in T.
23This algorithm compares the pattern P with T
within a sliding window. And the sliding window
slides from left to right. Example
Text ABDAACDGAEEGGGGJJ
Pattern ACDAAC
sliding window
24Example
Text ABDAACDGAEEGGGGJJ
Pattern ACDAAC
sliding window
25Example
Text ABDAACDGAEEGGGGJJ
Pattern ACDAAC
sliding window
26Basic idea
- In this algorithm, we want to find the longest
prefix of the pattern which is equal to the
suffix of the window.
27Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We want to find the suffix of BDDCACDAD which
is a longest prefix of the pattern.
28Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We find all substrings D in the pattern.
29Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
ACDADCEAD
ACDADCEAD
Actually, it means that we compare the windows as
above.
30Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
mismatch
Then we try to find out all substrings AD in
the pattern.
31Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We succeed in finding all substrings AD in the
pattern.
32Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
mismatch
We try to find out all substrings DAD in the
pattern.
33Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We find all substrings DAD in the pattern.
34Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We try to find all substrings CDAD in the
pattern.
35Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We try to find all substrings ACDAD in the
pattern.
36Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We can align the pattern and the text with the
longest prefix of the pattern to the suffix of
the window.
37Why do we want to find the longest suffix of the
text in the sliding window which is also a prefix
of pattern? We will explain this by the
following idea.
38Case 1 u is not a prefix of P, and no prefix of
P is equal to the suffix of the window.
u
T
u
P
u
39So, we can shift the pattern as below.
u
T
u
P
u
40Example
Text ABDDCCDDADEGGGGJJ
Pattern ACDADCEAD
P must be shifted in such a way to avoid
comparing any part of P with DDAD.
41Example
Text ABDDCCDDADEGGGGJJ
Pattern ACDADCEAD
So, we can shift the pattern as above.
42Case 2 u is not a prefix of P.
u
T
u
P
u
43But a suffix v of the window of T may be a prefix
of P.
u
T
v
u
P
v
v
u
44So, we can shift pattern as below.
u
T
v
u
P
v
u
45Example
Text ABCABCABA
Pattern CABBCAD
BCA is a the longest suffix of ABCABCA which
is also a substring of pattern
CA is a suffix of BCA which is a prefix of
the pattern.
46Example
Text ABCABCABA
Pattern CABBCAD
So we can shift as above.
47- The idea that we explained above is the main idea
of this - algorithm, and we will use bit-parallel method to
- implement this algorithm.
48Here, we explain how to use bit-parallel to find
the substring of a pattern which is equal to a
suffix of the window.
Example
Text ABCABCCBA
,?A,B,C
Pattern ACBCCBB
49Example
Text ABCABCCBA
Pattern ACBCCBB
For every character exists in both Text and
Pattern, we build
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 others 0000000
50Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 1111111
We use a mask D to record some information.
51Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 1111111
52Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 1111111
C 0101100
And
D 1111111
0101100
Where there is a 1, there is a substring C
in Pattern.
We set D
0101100ltlt1 1011000
53Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 1011000
C 0101100
And
D 1011000
0001000
Where there is a 1, there is a substring CC
in Pattern.
We set D
0001000ltlt1 0010000
54Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 0010000
B 0010011
And
D 0010000
0010000
Where there is a 1, there is a substring BCC
in Pattern.
We set D
0010000ltlt1 0100000
55Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 0100000
A 1000000
And
D 0100000
0000000
There is no substring ABCC in Pattern.
So, we can say that there is no prefix of Pattern
which is equal to the suffix of the window.
56Example
Text ABCABCCBA
Pattern ACBCCBB
We can shift Pattern as above.
57We give another example
Text ABCABCABA
,?A,B,C
Pattern CABBCAB
Pattern CABCCAB A
0100010 B 0011001 C
1000100 other 0000000
D 1111111
A 0100010
And
0100010
D 0100100ltlt1 1000100
58Text ABCABCABA
,?A,B,C,D
Pattern CABBCAB
Pattern CABCCAB A
0100010 B 0011001 C
1000100 other 0000000
D 1000100
C 1000100
And
1000100
We know that CA is a substring of the pattern
which starts from position 1 in pattern, and
this means that CA is a prefix of the pattern.
D 1000100ltlt1 0001000
59Text ABCABCABA
,?A,B,C,D
Pattern CABBCAB
Pattern CABCCAB A
0100010 B 0011001 C
1000100 other 0000000
D 0001000
B 0011001
And
0001000
So, we know BCA is a substring of the pattern.
D 0001000ltlt1 0010000
60Text ABCABCABA
,?A,B,C,D
Pattern CABBCAB
Pattern CABCCAB A
0100010 B 0011001 C
1000100 other 0000000
D 0010000
A 0100010
And
0000000
There is no substring ABCA in Pattern.
61Text ABCABCABA
,?A,B,C,D
Pattern CABBCAB
BCA is a the longest suffix of ABCABCA which
is also a substring of pattern, but the longest
prefix of the pattern which is equal to the
suffix of the window is CA.
62We take an example of the whole algorithm.
63We use read to store the suffix of the sliding
in the text which we have already read and use
pre-temp for storing the suffix of the current
read which is also a prefix of the pattern.
64Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
65Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D11111
read empty
pre-temp empty
66Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D11111
read A
pre-temp empty
TAGATACGATATATAC
Reading A 11111
10101 -----------------------
10101
We set pre-temp A which is a prefix of the
pattern.
67Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D11111
read A
pre-temp A
TAGATACGATATATAC
Reading A 11111
10101 -----------------------
10101
D 10101ltlt 1 01010
68Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01010
read TA
pre-temp A
TAGATACGATATATAC
Reading T 01010
01010 -----------------------
01010
69Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01010
read TA
pre-temp A
TAGATACGATATATAC
Reading T 01010
01010 -----------------------
01010
D 01010ltlt 1 10100
70Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D10100
read ATA
pre-temp A
pre-temp A
TAGATACGATATATAC
Reading A 10100
10101 -----------------------
10100
We set pre-tempATA which is a prefix of the
pattern.
71Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D10100
read ATA
pre-temp ATA
pre-temp A
TAGATACGATATATAC
Reading A 10100
10101 -----------------------
10100
D 10100ltlt 1 01000
72Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01000
read GATA
pre-temp ATA
TAGATACGATATATAC
Reading G 01000
00000 -----------------------
00000
73Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D00000
read GATA
pre-temp ATA
TAGATACGATATATAC
PATATA
We find that ATA is the longest suffix of
AGATA which is also a prefix of the pattern.
74Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D00000
read GATA
pre-temp ATA
TAGATACGATATATAC
P ATATA
So, we can shift as above.
75Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D11111
read empty
pre-temp empty
TAGATACGATATATAC
P ATATA
Then we reset D11111, readempty and pre-temp
empty.
76Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D11111
read G
pre-temp empty
TAGATACGATATATAC
Reading G 11111
00000 -----------------------
00000
77Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D00000
read G
pre-temp empty
TAGATACGATATATAC
P ATATA
There is no substring G in the pattern.
78Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D11111
read empty
pre-temp empty
TAGATACGATATATAC
P ATATA
So, we can shift the length of P to the
right. And we reset D11111, readempty and
pre-temp empty.
79Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D11111
read A
pre-temp empty
TAGATACGATATATAC
Reading A 11111
10101 -----------------------
10101
We set pre-tempA which is a prefix of the
pattern.
80Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D11111
read A
pre-temp A
TAGATACGATATATAC
Reading A 11111
10101 -----------------------
10101
D 10101ltlt 1 01010
81Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01010
read AT
pre-temp A
TAGATACGATATATAC
Reading T 01010
01010 -----------------------
01010
D 01010ltlt 1 10100
82Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01010
read ATA
pre-temp A
pre-temp A
TAGATACGATATATAC
Reading A 10100
10101 -----------------------
10100
We set pre-tempATA which is a prefix of the
pattern.
83Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D10100
read ATA
pre-temp A
pre-temp A
TAGATACGATATATAC
Reading A 10100
10101 -----------------------
10100
D 10100ltlt 1 01000
84Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01000
read TATA
pre-temp ATA
TAGATACGATATATAC
Reading T 01000
01010 -----------------------
01000
D 01000ltlt 1 10000
85Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D10000
read ATATA
pre-temp ATA
TAGATACGATATATAC
Reading A 10000
10101 -----------------------
10000
86Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D10000
read ATATA
pre-temp ATA
TAGATACGATATATAC
P ATATA
We find ATATA which is the longest prefix of
the pattern which is equal to the suffix of the
window with length m, so an exact match occurs.
87Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D10000
read ATATA
pre-temp ATA
TAGATACGATATATAC
P ATATA
ATA is a longest suffix of ATATA which is
equal to the suffix of the window of T besides
the full pattern.
88Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D10000
read ATATA
pre-temp ATA
TAGATACGATATATAC
P ATATA
So, we can shift as above.
89Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D11111
read empty
pre-temp empty
TAGATACGATATATAC
P ATATA
Repeat above steps, until the window slides out
of Text.
90- We give an extreme example to show the worst case
of the algorithm.
91Example
PAAAAA
TAAAAAAAA
Preprocessing A11111 B 00000
Initial D11111
read empty
pre-temp empty
92Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11111
read A
pre-temp empty
TAAAAAAAA
Reading A 11111
11111 -----------------------
11111
93Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11111
read A
pre-temp A
TAAAAAAAA
Reading A 11111
11111 -----------------------
11111
D 11111ltlt 1 11110
94Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11110
read AA
pre-temp A
TAAAAAAAA
Reading A 11110
11111 -----------------------
11110
95Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11110
read AA
pre-temp AA
TAAAAAAAA
Reading A 11110
11111 -----------------------
11110
D 11110ltlt 1 11100
96Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11100
read AAA
pre-temp AA
TAAAAAAAA
Reading A 11100
11111 -----------------------
11100
97Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11100
read AAA
pre-temp AAA
TAAAAAAAA
Reading A 11100
11111 -----------------------
11100
D 11100ltlt 1 11000
98Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11000
read AAAA
pre-temp AAA
TAAAAAAAA
Reading A 11000
11111 -----------------------
11000
99Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11000
read AAAA
pre-temp AAAA
TAAAAAAAA
Reading A 11000
11111 -----------------------
11000
D 11000ltlt 1 10000
100Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D10000
read AAAAA
pre-temp AAAA
TAAAAAAAA
We find AAAAA which is the longest prefix of
the pattern which is equal to the suffix of the
window with length m, so an exact match occurs.
Reading A 10000
11111 -----------------------
10000
101Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11111
read empty
pre-temp empty
TAAAAAAAA
102Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11111
read A
pre-temp empty
TAAAAAAAA
Reading A 11111
11111 -----------------------
11111
103Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11111
read A
pre-temp A
TAAAAAAAA
Reading A 11111
11111 -----------------------
11111
D 11111ltlt 1 11110
104Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11110
read AA
pre-temp A
TAAAAAAAA
Reading A 11110
11111 -----------------------
11110
105Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11110
read AA
pre-temp AA
TAAAAAAAA
Reading A 11110
11111 -----------------------
11110
D 11110ltlt 1 11100
106Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11100
read AAA
pre-temp AA
TAAAAAAAA
Reading A 11100
11111 -----------------------
11100
107Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11100
read AAA
pre-temp AAA
TAAAAAAAA
Reading A 11100
11111 -----------------------
11100
D 11100ltlt 1 11000
108Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11000
read AAAA
pre-temp AAA
TAAAAAAAA
Reading A 11000
11111 -----------------------
11000
109Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11000
read AAAA
pre-temp AAAA
TAAAAAAAA
Reading A 11000
11111 -----------------------
11000
D 11000ltlt 1 10000
110Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D10000
read AAAAA
pre-temp AAAA
TAAAAAAAA
Reading A 10000
11111 -----------------------
10000
We find AAAAA which is the longest prefix of
the pattern which is equal to the suffix of the
window with length m, so an exact match occurs.
111Time Complexity
If the length of the text is n and the length of
pattern is m,
the time complexity of this algorithm is O(mn) in
the worst case.
112Reference
M.Crochemore, A.Czumaj, L.Gasieniec, S.Jarominek,
T.Lecroq, W.Plandowski, and W.Rytter. Speeding
up two string matching algorithms. Algorithmica,
12(4/5)247-267,1994.
G.Navarro and M.Raffinot. Fast and flexible
string matching by combining bit-parallelism and
Suffix automata. ACM Journal of Experimental
Algorithmics ,5,2000.
W.I.Chang and E.L.Lawler. Sublinear approximate
string matching and biological applications. Algor
ithmica, 12(4/5)327-344,1994
113-
- Thanks for your attention.
114Algorithm Preprocessing For c ? Do
Bc?0m For j 1m Do Bpj?Bpj0
j-11m-j Searching pos ? 0 while pos
n-m Do j ? m, last ? m D
?1m while D? 0m Do
D ?D Bt posj j ?j-1
If D 10m-1 ?0m Then
If j gt0 Then last ? j
Else report an occurrence at
pos1 End of if
D ?Dltlt1 End of while
pos ?pos last End of while