Title: A Pre-Processing Algorithm for String Pattern Matching
1A Pre-Processing Algorithm for String Pattern
Matching
Laurence Boxer Department of Computer and
Information Sciences Niagara University and Depart
ment of Computer Science and Engineering SUNY at
Buffalo
2The Problem
- Given a text T of n characters and a pattern
P of m characters, - 1 lt m lt n,
- find every substring P of T thats a copy of P.
- Applications
- Find operations of word processors, Web
browsers - molecular biologists search for DNA fragments in
genomes or proteins in protein complexes - Note amount of input is T(m n) T(n).
- Examples are known that require examination of
every character of T. Hence, worst-case running
time of solution is O(n). - There exist algorithms that run in T(n) time,
which is therefore optimal in worst case. - So, what do I have thats new interesting?
3Boyer-Moore algorithm
- This well-known algorithm has a worst-case
running time thats ?(n). - In practice, it often runs in T(n) time with low
constant of proportionality. - There is a large class of examples for which
Boyer-Moore runs in o(n) time (best case T(n /
m) example of more input larger m
resulting in faster solution). This is because
the algorithm recognizes bad characters that
enable skipping blocks of characters of T. - Therefore,
- Use Boyer-Moore methods as pre-processing step to
reduce amount of data in T that need be
considered, in O(n) time. - Apply another, linear-time algorithm to the
reduced amount of data.
4Analysis
- In worst case, theres no data reduction, so
resulting algorithm takes T(n) time with higher
constant of proportionality than had we omitted
pre-processing. - When T P are ordinary English with P using
less of alphabet than T (which is common),
expected running time is T(n) with smaller
constant of proportionality than if we dont
pre-process as described. - Best case T(n / m) time.
5Start by finding characters in T that cant be
last characters of matches
- In T(m) time, scan characters of P, marking which
characters of alphabet appear in P. - Boyer-Moore bad character rule if character
of T aligned with last character of P isnt in P,
then none of the m characters of T starting with
this one can align with last character of P in a
substring match.
- For a case-insensitive search, examine positions
2, 5,8,9,12,13,14,15,18,19,20 conclude positions
0-13, 15-18, 20-22 cannot be last positions of
matching substrings. Note among eliminated is
t at position 6.
6Next, find positions in T not yet ruled out as
final positions of substring matches
- This is done in O(n) time by computing the
complement of the union of segments determined in
previous step. - In the example, only positions 14, 19 remain.
- Expand the intervals of possible final positions
by m-1 positions to the left to obtain intervals
containing possible matches in the example,
12,14 U 17,19. - Apply a linear-time algorithm to these remaining
segments of T.
7Experimental results
- Thanks to Stephen Englert, who wrote test program
- Used Z algorithm
- Implementation in C, Unix
- Time units are C clock units
8Experimental Results best case experiment
ordinary English text
does not occur in T, so all characters of T
are bad.
T file "test2.txt", n 2,350,367 T file "test2.txt", n 2,350,367
P With Preprocessing Without Preprocessing
""4 8 167
""8 5 167
""16 3 166
""32 2 168
""64 1 167
9Artificial best case experiment
pattern"12345678" pattern"12345678" pattern"1234567890123456" pattern"1234567890123456"
Preprocessed Not Preproc. Preprocessed Not Preproc.
text ""m, m 2 k k
19 1 37 0 37
20 2 76 1 73
21 4 150 2 151
22 8 307 5 303
23 18 621 11 622
10Worst case experiment preprocessing doesnt
reduce data
T n, n 2 k, P m Here,
preprocessing slows running time (by about 12 -
16).
m 4 m 4 m 8 m 8 m 16 m 16
k Preproc. Not Preproc. Preproc. Not Preproc. Preproc. Not Preproc.
19 159 138 158 138 159 138
20 319 278 318 277 318 276
21 648 570 644 567 644 567
22 1,303 1,148 1,299 1,153 1,289 1,147
23 2,631 2,321 2,625 2,327 2,613 2,318
11Ordinary English text pattern experiment 1
T File "test2.txt", n 2,350,367
Preproc. Not Preproc.
P "algorithm" 41 180
P "algorithm"2 4 177
P "algorithm"4 4 178
P "algorithm"8 2 179
Superlinear speedup likely due to matches vs. no
matches.
12Ordinary English text pattern experiment 2
T File "test2.txt", n 2,350,367 Preproc. Not Preproc.
P "parallel" 9 169
P "parallel"2 4 170
P "parallel"4 3 170
P "parallel"8 1 170
9 vs. 41 for algorithm likely due to more bad
characters, since parallel uses fewer distinct
letters