Advisor: Prof. R. C. T. Lee - PowerPoint PPT Presentation

1 / 114
About This Presentation
Title:

Advisor: Prof. R. C. T. Lee

Description:

Two exact string matching algorithms using suffix to prefix rule. 2 ... the Boyer-Moore-Horspool algorithm, Baeza-Yates, R. A. and Regnier, M. Theoret. ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 115
Provided by: algCsie
Category:
Tags: advisor | lee | prof | yates

less

Transcript and Presenter's Notes

Title: Advisor: Prof. R. C. T. Lee


1
Two exact string matching algorithms using
suffix to prefix rule
Advisor Prof. R. C. T. Lee Speaker G. W. Cheng
2
Speeding up on two string matching algorithms
Algorithmica, Vol.12, 1994, pp.247-267 CROCHEMORE,
M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S.,
LECROQ, T., PLANDOWSKI, W. and RYTTER, W.
3
Problem Definition We are given a text string
and a pattern string
and we want to find all
occurrences of P in T.
4
Consider the following example
There are two occurrences of P in T as shown
below
5
Rule 1 The Suffix to Prefix Rule
  • For a window to have any chance to match a
    pattern, in some way, there must be a suffix of
    the window which is equal to a prefix of the
    pattern.

T
P
6
Basic Ideas
  • Open a window W with size P in the text.

W
T
P
p
  • Find the longest suffix of W is also the prefix
    of pattern.

Case 1
W
T
P
p
Match!
7
Case 2
W
T
P
p
W
T
P
p
Case 3
If there is no such suffix, we move W with length
P.
W
T
P
P
p
8
Preprocessing phase
  • TGCATCGGCGAGAGTATACAGTACG 
  • PGCAGAGAG

We construct the suffix automaton of P.
C
Suffix Automaton
A
G
G
A
C
G
G
A
0
8
7
6
5
4
3
2
1
C
A
C
9
Preprocessing Construct a Suffix Tree of the
reverse of Pattern
PR the reversal string of P.
1
2
4
7
3
8
6
5
10
When there is a match, how do we move the window?
G C A T C G C A G A G A G T A T A C A G T A C G
T
P
G C A G A G A G
11
G C A T C G C A G A G A G T A T A C A G T A C G
T
P
G C A G A G A G
12
Find the longest suffix of W is also the prefix
of pattern.
G C A T C G C A G G C A G T A T A C A G T A C G
T
G C A G A G A G
P
13
G C A T C G C A G G C A G T A T A C A G T A C G
T
G C A G A G A G
P
14
A Whole Example
  • TGCATCGCAGAGAGTATACAGTACG 
  • PGCAGAGAG
  • First attempt  

T
G C A T C G C A G A G A G T A T A C A G T A C G
P
G C A G A G A G
Shift by 5 (8 - 3)
15
Second attempt
G C A T C G C A G A G A G T A T A C A G T A C G
T
P
G C A G A G A G
Shift by 7 (8 - 1)
16
Third attempt
G C A T C G C A G A G A G T A T A C A G T A C G
T
P
G C A G A G A G
Shift by 7 (8 - 1)
17
Third attempt
T
G C A T C G C A G A G A G T A T A C A G T A C G
P
G C A G A G A G
18
Conclusion
  • Preprocessing phase is O(m).
  • Searching phase is O(mn).

19
Reference
  • A90Algorithms for finding patterns in strings,
    A. V. Aho, Handbook of Theoretical Computer
    Science, Vol. A, Elsevier, Amsterdam, 1990,
    pp.255-300.
  • A85The myriad virtues of suffix trees,
    Apostolico, A., Combinatorial Algorithms on
    words, NATO Advanced Science Institutes, Series
    F, Vol. 12, 1985, pp.85-96
  • AG86The Boyer-Moore-Galil string searching
    strategies revisited, Apostolico, A. and
    Giancarlo, R., SIAM, Comput. 15, 1986, pp98-105.
  • BR92Average running time of the
    Boyer-Moore-Horspool algorithm, Baeza-Yates, R.
    A. and Regnier, M. Theoret. Comput. Sci., 1992,
    pp.19-31.
  • BKR91Analysis of algorithms and Data
    Structures, Banachowski, L., Kreczmar, A. and
    Rytter, W., Addison-Wesley. Reading, MA,1991.

20
Speeding up on two string matching algorithms
Algorithmica, Vol.12, 1994, pp.247-267 CROCHEMORE,
M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S.,
LECROQ, T., PLANDOWSKI, W. and RYTTER, W.
21
A Bit-Parallel Approach to Suffix Automata Fast
Extended String Matching
In Proceedings of the 9th Annual Symposium on
Combinatorial Pattern Matching, Lecture Notes in
Computer Science 1448, Springer-Verlag, Berlin,
14-31, 1998. NAVARRO G., RAFFINOT M.,
22
Problem Definition We are given a text string
and a pattern string
and we want to find all
occurrences of P in T.
23
This algorithm compares the pattern P with T
within a sliding window. And the sliding window
slides from left to right. Example
Text ABDAACDGAEEGGGGJJ
Pattern ACDAAC
sliding window
24
Example
Text ABDAACDGAEEGGGGJJ
Pattern ACDAAC
sliding window
25
Example
Text ABDAACDGAEEGGGGJJ
Pattern ACDAAC
sliding window
26
Basic idea
  • In this algorithm, we want to find the longest
    prefix of the pattern which is equal to the
    suffix of the window.

27
Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We want to find the suffix of BDDCACDAD which
is a longest prefix of the pattern.
28
Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We find all substrings D in the pattern.
29
Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
ACDADCEAD
ACDADCEAD
Actually, it means that we compare the windows as
above.
30
Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
mismatch
Then we try to find out all substrings AD in
the pattern.
31
Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We succeed in finding all substrings AD in the
pattern.
32
Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
mismatch
We try to find out all substrings DAD in the
pattern.
33
Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We find all substrings DAD in the pattern.
34
Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We try to find all substrings CDAD in the
pattern.
35
Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We try to find all substrings ACDAD in the
pattern.
36
Example
Text ABDDCACDADEGGGGJJ
Pattern ACDADCEAD
We can align the pattern and the text with the
longest prefix of the pattern to the suffix of
the window.
37
Why do we want to find the longest suffix of the
text in the sliding window which is also a prefix
of pattern? We will explain this by the
following idea.
38
Case 1 u is not a prefix of P, and no prefix of
P is equal to the suffix of the window.
u
T
u
P
u
39
So, we can shift the pattern as below.
u
T
u
P
u
40
Example
Text ABDDCCDDADEGGGGJJ
Pattern ACDADCEAD
P must be shifted in such a way to avoid
comparing any part of P with DDAD.
41
Example
Text ABDDCCDDADEGGGGJJ
Pattern ACDADCEAD
So, we can shift the pattern as above.
42
Case 2 u is not a prefix of P.
u
T
u
P
u
43
But a suffix v of the window of T may be a prefix
of P.
u
T
v
u
P
v
v
u
44
So, we can shift pattern as below.
u
T
v
u
P
v
u
45
Example
Text ABCABCABA
Pattern CABBCAD
BCA is a the longest suffix of ABCABCA which
is also a substring of pattern
CA is a suffix of BCA which is a prefix of
the pattern.
46
Example
Text ABCABCABA
Pattern CABBCAD
So we can shift as above.
47
  • The idea that we explained above is the main idea
    of this
  • algorithm, and we will use bit-parallel method to
  • implement this algorithm.

48
Here, we explain how to use bit-parallel to find
the substring of a pattern which is equal to a
suffix of the window.
Example
Text ABCABCCBA
,?A,B,C
Pattern ACBCCBB
49
Example
Text ABCABCCBA
Pattern ACBCCBB
For every character exists in both Text and
Pattern, we build
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 others 0000000
50
Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 1111111
We use a mask D to record some information.
51
Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 1111111
52
Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 1111111
C 0101100
And
D 1111111
0101100
Where there is a 1, there is a substring C
in Pattern.
We set D
0101100ltlt1 1011000
53
Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 1011000
C 0101100
And
D 1011000
0001000
Where there is a 1, there is a substring CC
in Pattern.
We set D
0001000ltlt1 0010000
54
Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 0010000
B 0010011
And
D 0010000
0010000
Where there is a 1, there is a substring BCC
in Pattern.
We set D
0010000ltlt1 0100000
55
Example
Text ABCABCCBA
Pattern ACBCCBB
Pattern ACBCCBB A
1000000 B 0010011 C
0101100 other 0000000
D 0100000
A 1000000
And
D 0100000
0000000
There is no substring ABCC in Pattern.
So, we can say that there is no prefix of Pattern
which is equal to the suffix of the window.
56
Example
Text ABCABCCBA
Pattern ACBCCBB
We can shift Pattern as above.
57
We give another example
Text ABCABCABA
,?A,B,C
Pattern CABBCAB
Pattern CABCCAB A
0100010 B 0011001 C
1000100 other 0000000
D 1111111
A 0100010
And
0100010
D 0100100ltlt1 1000100
58
Text ABCABCABA
,?A,B,C,D
Pattern CABBCAB
Pattern CABCCAB A
0100010 B 0011001 C
1000100 other 0000000
D 1000100
C 1000100
And
1000100
We know that CA is a substring of the pattern
which starts from position 1 in pattern, and
this means that CA is a prefix of the pattern.
D 1000100ltlt1 0001000
59
Text ABCABCABA
,?A,B,C,D
Pattern CABBCAB
Pattern CABCCAB A
0100010 B 0011001 C
1000100 other 0000000
D 0001000
B 0011001
And
0001000
So, we know BCA is a substring of the pattern.
D 0001000ltlt1 0010000
60
Text ABCABCABA
,?A,B,C,D
Pattern CABBCAB
Pattern CABCCAB A
0100010 B 0011001 C
1000100 other 0000000
D 0010000
A 0100010
And
0000000
There is no substring ABCA in Pattern.
61
Text ABCABCABA
,?A,B,C,D
Pattern CABBCAB
BCA is a the longest suffix of ABCABCA which
is also a substring of pattern, but the longest
prefix of the pattern which is equal to the
suffix of the window is CA.
62
We take an example of the whole algorithm.
63
We use read to store the suffix of the sliding
in the text which we have already read and use
pre-temp for storing the suffix of the current
read which is also a prefix of the pattern.
64
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
65
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D11111
read empty
pre-temp empty
66
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D11111
read A
pre-temp empty
TAGATACGATATATAC
Reading A 11111
10101 -----------------------
10101
We set pre-temp A which is a prefix of the
pattern.
67
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D11111
read A
pre-temp A
TAGATACGATATATAC
Reading A 11111
10101 -----------------------
10101
D 10101ltlt 1 01010
68
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01010
read TA
pre-temp A
TAGATACGATATATAC
Reading T 01010
01010 -----------------------
01010
69
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01010
read TA
pre-temp A
TAGATACGATATATAC
Reading T 01010
01010 -----------------------
01010
D 01010ltlt 1 10100
70
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D10100
read ATA
pre-temp A
pre-temp A
TAGATACGATATATAC
Reading A 10100
10101 -----------------------
10100
We set pre-tempATA which is a prefix of the
pattern.
71
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D10100
read ATA
pre-temp ATA
pre-temp A
TAGATACGATATATAC
Reading A 10100
10101 -----------------------
10100
D 10100ltlt 1 01000
72
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01000
read GATA
pre-temp ATA
TAGATACGATATATAC
Reading G 01000
00000 -----------------------
00000
73
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D00000
read GATA
pre-temp ATA
TAGATACGATATATAC
PATATA
We find that ATA is the longest suffix of
AGATA which is also a prefix of the pattern.
74
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D00000
read GATA
pre-temp ATA
TAGATACGATATATAC
P ATATA
So, we can shift as above.
75
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D11111
read empty
pre-temp empty
TAGATACGATATATAC
P ATATA
Then we reset D11111, readempty and pre-temp
empty.
76
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D11111
read G
pre-temp empty
TAGATACGATATATAC
Reading G 11111
00000 -----------------------
00000
77
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D00000
read G
pre-temp empty
TAGATACGATATATAC
P ATATA
There is no substring G in the pattern.
78
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D11111
read empty
pre-temp empty
TAGATACGATATATAC
P ATATA
So, we can shift the length of P to the
right. And we reset D11111, readempty and
pre-temp empty.
79
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D11111
read A
pre-temp empty
TAGATACGATATATAC
Reading A 11111
10101 -----------------------
10101
We set pre-tempA which is a prefix of the
pattern.
80
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D11111
read A
pre-temp A
TAGATACGATATATAC
Reading A 11111
10101 -----------------------
10101
D 10101ltlt 1 01010
81
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01010
read AT
pre-temp A
TAGATACGATATATAC
Reading T 01010
01010 -----------------------
01010
D 01010ltlt 1 10100
82
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01010
read ATA
pre-temp A
pre-temp A
TAGATACGATATATAC
Reading A 10100
10101 -----------------------
10100
We set pre-tempATA which is a prefix of the
pattern.
83
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D10100
read ATA
pre-temp A
pre-temp A
TAGATACGATATATAC
Reading A 10100
10101 -----------------------
10100
D 10100ltlt 1 01000
84
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D01000
read TATA
pre-temp ATA
TAGATACGATATATAC
Reading T 01000
01010 -----------------------
01000
D 01000ltlt 1 10000
85
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D10000
read ATATA
pre-temp ATA
TAGATACGATATATAC
Reading A 10000
10101 -----------------------
10000
86
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
D10000
read ATATA
pre-temp ATA
TAGATACGATATATAC
P ATATA
We find ATATA which is the longest prefix of
the pattern which is equal to the suffix of the
window with length m, so an exact match occurs.
87
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D10000
read ATATA
pre-temp ATA
TAGATACGATATATAC
P ATATA
ATA is a longest suffix of ATATA which is
equal to the suffix of the window of T besides
the full pattern.
88
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D10000
read ATATA
pre-temp ATA
TAGATACGATATATAC
P ATATA
So, we can shift as above.
89
Example
PATATA
TAGATACGATATATAC
Preprocessing A10101 B T 01010
00000
Initial D11111
read empty
pre-temp empty
TAGATACGATATATAC
P ATATA
Repeat above steps, until the window slides out
of Text.
90
  • We give an extreme example to show the worst case
    of the algorithm.

91
Example
PAAAAA
TAAAAAAAA
Preprocessing A11111 B 00000
Initial D11111
read empty
pre-temp empty
92
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11111
read A
pre-temp empty
TAAAAAAAA
Reading A 11111
11111 -----------------------
11111
93
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11111
read A
pre-temp A
TAAAAAAAA
Reading A 11111
11111 -----------------------
11111
D 11111ltlt 1 11110
94
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11110
read AA
pre-temp A
TAAAAAAAA
Reading A 11110
11111 -----------------------
11110
95
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11110
read AA
pre-temp AA
TAAAAAAAA
Reading A 11110
11111 -----------------------
11110
D 11110ltlt 1 11100
96
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11100
read AAA
pre-temp AA
TAAAAAAAA
Reading A 11100
11111 -----------------------
11100
97
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11100
read AAA
pre-temp AAA
TAAAAAAAA
Reading A 11100
11111 -----------------------
11100
D 11100ltlt 1 11000
98
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11000
read AAAA
pre-temp AAA
TAAAAAAAA
Reading A 11000
11111 -----------------------
11000
99
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11000
read AAAA
pre-temp AAAA
TAAAAAAAA
Reading A 11000
11111 -----------------------
11000
D 11000ltlt 1 10000
100
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D10000
read AAAAA
pre-temp AAAA
TAAAAAAAA
We find AAAAA which is the longest prefix of
the pattern which is equal to the suffix of the
window with length m, so an exact match occurs.
Reading A 10000
11111 -----------------------
10000
101
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11111
read empty
pre-temp empty
TAAAAAAAA
102
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11111
read A
pre-temp empty
TAAAAAAAA
Reading A 11111
11111 -----------------------
11111
103
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11111
read A
pre-temp A
TAAAAAAAA
Reading A 11111
11111 -----------------------
11111
D 11111ltlt 1 11110
104
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11110
read AA
pre-temp A
TAAAAAAAA
Reading A 11110
11111 -----------------------
11110
105
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11110
read AA
pre-temp AA
TAAAAAAAA
Reading A 11110
11111 -----------------------
11110
D 11110ltlt 1 11100
106
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11100
read AAA
pre-temp AA
TAAAAAAAA
Reading A 11100
11111 -----------------------
11100
107
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11100
read AAA
pre-temp AAA
TAAAAAAAA
Reading A 11100
11111 -----------------------
11100
D 11100ltlt 1 11000
108
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11000
read AAAA
pre-temp AAA
TAAAAAAAA
Reading A 11000
11111 -----------------------
11000
109
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D11000
read AAAA
pre-temp AAAA
TAAAAAAAA
Reading A 11000
11111 -----------------------
11000
D 11000ltlt 1 10000
110
Example
PATATA
TAAAAAAAA
Preprocessing A11111 B 00000
D10000
read AAAAA
pre-temp AAAA
TAAAAAAAA
Reading A 10000
11111 -----------------------
10000
We find AAAAA which is the longest prefix of
the pattern which is equal to the suffix of the
window with length m, so an exact match occurs.
111
Time Complexity
If the length of the text is n and the length of
pattern is m,
the time complexity of this algorithm is O(mn) in
the worst case.
112
Reference
M.Crochemore, A.Czumaj, L.Gasieniec, S.Jarominek,
T.Lecroq, W.Plandowski, and W.Rytter. Speeding
up two string matching algorithms. Algorithmica,
12(4/5)247-267,1994.
G.Navarro and M.Raffinot. Fast and flexible
string matching by combining bit-parallelism and
Suffix automata. ACM Journal of Experimental
Algorithmics ,5,2000.
W.I.Chang and E.L.Lawler. Sublinear approximate
string matching and biological applications. Algor
ithmica, 12(4/5)327-344,1994
113
  • Thanks for your attention.

114
Algorithm Preprocessing For c ? Do
Bc?0m For j 1m Do Bpj?Bpj0
j-11m-j Searching pos ? 0 while pos
n-m Do j ? m, last ? m D
?1m while D? 0m Do
D ?D Bt posj j ?j-1
If D 10m-1 ?0m Then
If j gt0 Then last ? j
Else report an occurrence at
pos1 End of if
D ?Dltlt1 End of while
pos ?pos last End of while
Write a Comment
User Comments (0)
About PowerShow.com