A Pre-Processing Algorithm for String Pattern Matching - PowerPoint PPT Presentation

About This Presentation

Title:

A Pre-Processing Algorithm for String Pattern Matching

Description:

Department of Computer and Information Sciences. Niagara University. and ... Apply another, linear-time algorithm to the reduced amount of data. Analysis ... – PowerPoint PPT presentation

Number of Views:177

Avg rating:3.0/5.0

Slides: 13

Provided by: laurenc4

Learn more at: https://purple.niagara.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Pre-Processing Algorithm for String Pattern Matching

1
A Pre-Processing Algorithm for String Pattern
Matching
Laurence Boxer Department of Computer and
Information Sciences Niagara University and Depart
ment of Computer Science and Engineering SUNY at
Buffalo
2
The Problem

Given a text T of n characters and a pattern
P of m characters,
1 lt m lt n,
find every substring P of T thats a copy of P.
Applications
Find operations of word processors, Web
browsers
molecular biologists search for DNA fragments in
genomes or proteins in protein complexes
Note amount of input is T(m n) T(n).
Examples are known that require examination of
every character of T. Hence, worst-case running
time of solution is O(n).
There exist algorithms that run in T(n) time,
which is therefore optimal in worst case.
So, what do I have thats new interesting?

3
Boyer-Moore algorithm

This well-known algorithm has a worst-case
running time thats ?(n).
In practice, it often runs in T(n) time with low
constant of proportionality.
There is a large class of examples for which
Boyer-Moore runs in o(n) time (best case T(n /
m) example of more input larger m
resulting in faster solution). This is because
the algorithm recognizes bad characters that
enable skipping blocks of characters of T.
Therefore,
Use Boyer-Moore methods as pre-processing step to
reduce amount of data in T that need be
considered, in O(n) time.
Apply another, linear-time algorithm to the
reduced amount of data.

4
Analysis

In worst case, theres no data reduction, so
resulting algorithm takes T(n) time with higher
constant of proportionality than had we omitted
pre-processing.
When T P are ordinary English with P using
less of alphabet than T (which is common),
expected running time is T(n) with smaller
constant of proportionality than if we dont
pre-process as described.
Best case T(n / m) time.

5
Start by finding characters in T that cant be
last characters of matches

In T(m) time, scan characters of P, marking which
characters of alphabet appear in P.
Boyer-Moore bad character rule if character
of T aligned with last character of P isnt in P,
then none of the m characters of T starting with
this one can align with last character of P in a
substring match.

For a case-insensitive search, examine positions
2, 5,8,9,12,13,14,15,18,19,20 conclude positions
0-13, 15-18, 20-22 cannot be last positions of
matching substrings. Note among eliminated is
t at position 6.

6
Next, find positions in T not yet ruled out as
final positions of substring matches

This is done in O(n) time by computing the
complement of the union of segments determined in
previous step.
In the example, only positions 14, 19 remain.
Expand the intervals of possible final positions
by m-1 positions to the left to obtain intervals
containing possible matches in the example,
12,14 U 17,19.
Apply a linear-time algorithm to these remaining
segments of T.

7
Experimental results

Thanks to Stephen Englert, who wrote test program
Used Z algorithm
Implementation in C, Unix
Time units are C clock units

8
Experimental Results best case experiment
ordinary English text
does not occur in T, so all characters of T
are bad.
T file "test2.txt", n 2,350,367 T file "test2.txt", n 2,350,367
P With Preprocessing Without Preprocessing
""4 8 167
""8 5 167
""16 3 166
""32 2 168
""64 1 167
9
Artificial best case experiment
pattern"12345678" pattern"12345678" pattern"1234567890123456" pattern"1234567890123456"
Preprocessed Not Preproc. Preprocessed Not Preproc.
text ""m, m 2 k k
19 1 37 0 37
20 2 76 1 73
21 4 150 2 151
22 8 307 5 303
23 18 621 11 622
10
Worst case experiment preprocessing doesnt
reduce data
T n, n 2 k, P m Here,
preprocessing slows running time (by about 12 -
16).
m 4 m 4 m 8 m 8 m 16 m 16
k Preproc. Not Preproc. Preproc. Not Preproc. Preproc. Not Preproc.
19 159 138 158 138 159 138
20 319 278 318 277 318 276
21 648 570 644 567 644 567
22 1,303 1,148 1,299 1,153 1,289 1,147
23 2,631 2,321 2,625 2,327 2,613 2,318
11
Ordinary English text pattern experiment 1
T File "test2.txt", n 2,350,367
Preproc. Not Preproc.
P "algorithm" 41 180
P "algorithm"2 4 177
P "algorithm"4 4 178
P "algorithm"8 2 179
Superlinear speedup likely due to matches vs. no
matches.
12
Ordinary English text pattern experiment 2
T File "test2.txt", n 2,350,367 Preproc. Not Preproc.
P "parallel" 9 169
P "parallel"2 4 170
P "parallel"4 3 170
P "parallel"8 1 170
9 vs. 41 for algorithm likely due to more bad
characters, since parallel uses fewer distinct
letters

Write a Comment

User Comments (0)