A Fast String Matching Algorithm - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

A Fast String Matching Algorithm

Description:

Knuth-Pratt-Morris Algoritm Linear search algorithm. Preprocesses pat in time linear in and searches str in time linear in . EXAMPLE HERE IS A SIMPLE ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 19
Provided by: banyanCm9
Category:

less

Transcript and Presenter's Notes

Title: A Fast String Matching Algorithm


1
A Fast String Matching Algorithm
  • The Boyer Moore Algorithm

2
The obvious search algorithm
  • Considers each character position of str and
    determines whether the successive patlen
    characters of str matches pat.
  • In worst case, the number of comparisons is in
    the order of .
  • Ex. pat aab str ..aaaaac .

3
Knuth-Pratt-Morris Algoritm
  • Linear search algorithm.
  • Preprocesses pat in time linear in
    and searches str in time linear in
    .
  • EXAMPLE
  • HERE IS A SIMPLE EXAMPLE


EXAMPLE
EXAMPLE
EXAMPLE
4
Characteristics of Boyer Moore Algorithm
  • Basic idea string matches the pattern from the
    right rather than from the left.
  • Preprocessing pat and compute two tables
  • for shifting pat
    the pointer of str.
  • Ex. pat AT-THAT str WHICH-FINALLY-HALTS
    .AT-THAT-POINT

5
Informal Description
  • Compare the last char of the pat with the
    patlenth char of str
  • AT-THAT
  • WHICH-FINALLY-HALTS.AT-THAT-POINT
  • Observation 1 char is not to occur in pat, skip
  • chars of str.

AT-THAT
6
Informal Description
  • Observation 2 char is in pat, slide pat down
  • positions so that char is aligned to the
    corresponding character in pat.
  • if char not occur in
    pat,then else
  • , where j is the maximum
    integer such that
  • .

AT-THAT WHICH-FINALLY-HALTS.--AT-THAT-P
OINT
7
Informal Description
  • Observation 3a str matches the last m chars of
    pat, and came to a mismatch at some new char.
    Move strptr by .(pat shifted by
    )
  • AT-THAT
  • FINALLY-HALTS.--AT-THAT-POINT

AT-THAT
8
Informal Description
  • Observation 3b the final m chars of pat (a
    subpat) is matched, find the right most plausible
    reoccurrence of the subpat, align it with the
    matched m chars of str (slide pat
    positions).
  • AT-THAT
  • FINALLY-HALTS.AT-THAT-POINT

AT-THAT
AT-THAT
9
The delta1 delta2 tables
  • The delta1 table has as many entries as there are
    chars in the alphabet.
  • Ex. pat a b c d e a t t h a t
  • 4 3 2 1 0 else,5 1 0 4 0 2 1 0
    else,7
  • The delta2 table has as many entries as there are
    chars in pat.
  • Ex. pat a b c d e a t - t h a t
  • 9 8 7 6 1 11 10 9 8 7 8 1

10
  • Ex we compute j5
  • j 1 2 3 4 5 6 7
  • Pat e d b c a b c
  • e d b c a b c
  • -2 -1 0 1 2 3 4 5 6 7
  • Then

11
The algorithm
  • stringlen length of string.
  • i patlen.
  • top if i gt stringlen then return false.
  • j patlen.
  • loop if j0 then return i1.
  • if string(i)pat(j)
  • then
  • j j-1
  • i i-1
  • goto loop.
  • close
  • i i max( delta1(sting(i)) , delta2(j))
  • goto top.

12
Implementation Consideration
13
Loops fast, undo, slow
  • Fastscans down string, effectively looking for
    the last character in pat,
    skipping according to .
  • 80 time spent in it.
  • Undodecides whether this situation arose because
    all of string has been scanned or because
    was hit.
  • Slowbacks up checking for matches.
  • It is easy to implement on a byte addressable
    machine
  • Char lt- string (i), etc

14
Measured the cost of each search
  • Three stringsbinary alphabet, English, random
    alphabet.
  • Fig.1the number of references made to string.
  • Fig.2the total number of machine instruction
    that actually got executed.

15
Performance (empirical evidence)
16
Boyer Moore V.S. Knuth, Morris, and Pratt
algorithm
  • for English text.
  • Boyer Moore
  • every reference to string passes about 4
    characters for a pattern of length 5.
  • For sufficiently large alphabets and sufficiently
    long patterns executes fewer than 1 instruction
    per character passed.
  • K.M.P.
  • Search reference string about 1.1 times per
    character.
  • a character can be expected to be at least 3.3
    instructions.

17
Conclusion
  • Require fewer CPU cycle.
  • Most efficiently on a byte-addressable machine.
  • Unadvisableto find the first of several possible
    substrings or to identify a location in string
    defined by a regular expression.
  • Aho and Corasick is more suitable.

18
Conclusion
  • Improveby fetching larger bytes in the fast loop
    and using a hash array to encode the extended
    .
  • Exponentially increases the effective size of the
    alphabet and reduces the frequency of common
    characters.
Write a Comment
User Comments (0)
About PowerShow.com