A Pre-Processing Algorithm for String Pattern Matching - PowerPoint PPT Presentation

About This Presentation
Title:

A Pre-Processing Algorithm for String Pattern Matching

Description:

Department of Computer and Information Sciences. Niagara University. and ... Apply another, linear-time algorithm to the reduced amount of data. Analysis ... – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0
Slides: 13
Provided by: laurenc4
Category:

less

Transcript and Presenter's Notes

Title: A Pre-Processing Algorithm for String Pattern Matching


1
A Pre-Processing Algorithm for String Pattern
Matching
Laurence Boxer Department of Computer and
Information Sciences Niagara University and Depart
ment of Computer Science and Engineering SUNY at
Buffalo
2
The Problem
  • Given a text T of n characters and a pattern
    P of m characters,
  • 1 lt m lt n,
  • find every substring P of T thats a copy of P.
  • Applications
  • Find operations of word processors, Web
    browsers
  • molecular biologists search for DNA fragments in
    genomes or proteins in protein complexes
  • Note amount of input is T(m n) T(n).
  • Examples are known that require examination of
    every character of T. Hence, worst-case running
    time of solution is O(n).
  • There exist algorithms that run in T(n) time,
    which is therefore optimal in worst case.
  • So, what do I have thats new interesting?

3
Boyer-Moore algorithm
  • This well-known algorithm has a worst-case
    running time thats ?(n).
  • In practice, it often runs in T(n) time with low
    constant of proportionality.
  • There is a large class of examples for which
    Boyer-Moore runs in o(n) time (best case T(n /
    m) example of more input larger m
    resulting in faster solution). This is because
    the algorithm recognizes bad characters that
    enable skipping blocks of characters of T.
  • Therefore,
  • Use Boyer-Moore methods as pre-processing step to
    reduce amount of data in T that need be
    considered, in O(n) time.
  • Apply another, linear-time algorithm to the
    reduced amount of data.

4
Analysis
  • In worst case, theres no data reduction, so
    resulting algorithm takes T(n) time with higher
    constant of proportionality than had we omitted
    pre-processing.
  • When T P are ordinary English with P using
    less of alphabet than T (which is common),
    expected running time is T(n) with smaller
    constant of proportionality than if we dont
    pre-process as described.
  • Best case T(n / m) time.

5
Start by finding characters in T that cant be
last characters of matches
  • In T(m) time, scan characters of P, marking which
    characters of alphabet appear in P.
  • Boyer-Moore bad character rule if character
    of T aligned with last character of P isnt in P,
    then none of the m characters of T starting with
    this one can align with last character of P in a
    substring match.
  • For a case-insensitive search, examine positions
    2, 5,8,9,12,13,14,15,18,19,20 conclude positions
    0-13, 15-18, 20-22 cannot be last positions of
    matching substrings. Note among eliminated is
    t at position 6.

6
Next, find positions in T not yet ruled out as
final positions of substring matches
  • This is done in O(n) time by computing the
    complement of the union of segments determined in
    previous step.
  • In the example, only positions 14, 19 remain.
  • Expand the intervals of possible final positions
    by m-1 positions to the left to obtain intervals
    containing possible matches in the example,
    12,14 U 17,19.
  • Apply a linear-time algorithm to these remaining
    segments of T.

7
Experimental results
  • Thanks to Stephen Englert, who wrote test program
  • Used Z algorithm
  • Implementation in C, Unix
  • Time units are C clock units

8
Experimental Results best case experiment
ordinary English text
does not occur in T, so all characters of T
are bad.
T file "test2.txt", n 2,350,367 T file "test2.txt", n 2,350,367
P With Preprocessing Without Preprocessing
""4 8 167
""8 5 167
""16 3 166
""32 2 168
""64 1 167
9
Artificial best case experiment
pattern"12345678" pattern"12345678" pattern"1234567890123456" pattern"1234567890123456"
Preprocessed Not Preproc. Preprocessed Not Preproc.
text ""m, m 2 k k
19 1 37 0 37
20 2 76 1 73
21 4 150 2 151
22 8 307 5 303
23 18 621 11 622
10
Worst case experiment preprocessing doesnt
reduce data
T n, n 2 k, P m Here,
preprocessing slows running time (by about 12 -
16).
m 4 m 4 m 8 m 8 m 16 m 16
k Preproc. Not Preproc. Preproc. Not Preproc. Preproc. Not Preproc.
19 159 138 158 138 159 138
20 319 278 318 277 318 276
21 648 570 644 567 644 567
22 1,303 1,148 1,299 1,153 1,289 1,147
23 2,631 2,321 2,625 2,327 2,613 2,318
11
Ordinary English text pattern experiment 1
T File "test2.txt", n 2,350,367
Preproc. Not Preproc.
P "algorithm" 41 180
P "algorithm"2 4 177
P "algorithm"4 4 178
P "algorithm"8 2 179
Superlinear speedup likely due to matches vs. no
matches.
12
Ordinary English text pattern experiment 2
T File "test2.txt", n 2,350,367 Preproc. Not Preproc.
P "parallel" 9 169
P "parallel"2 4 170
P "parallel"4 3 170
P "parallel"8 1 170
9 vs. 41 for algorithm likely due to more bad
characters, since parallel uses fewer distinct
letters
Write a Comment
User Comments (0)
About PowerShow.com