String Searching and Matching - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

String Searching and Matching

Description:

Common problem in text-processing programs and many other ... we compare h(LEAN) first against h(CARP), then against h(ARPE), h(RPET), h(PETS), and so on. ... – PowerPoint PPT presentation

Number of Views:194
Avg rating:3.0/5.0
Slides: 29
Provided by: joh133
Category:

less

Transcript and Presenter's Notes

Title: String Searching and Matching


1
String Searching and Matching
2
Naive String Matching
  • Problem to check whether a given string occurs
    in a given text
  • Common problem in text-processing programs and
    many other applications
  • We consider first the problem of matching strings
    without wildcards
  • Naive algorithm
  • match string character by character
  • when there is a mismatch, shift the whole string
    down by one character against the text, and start
    again at the beginning of the string.

3
Worst Case of Naive Method
  • Consider the following text
  • aaaaaaaaa........aaab (total text length N)
  • Consider the following search string
  • aaa....b (M-1 times a followed by one b)
  • Using the naive method, we match up to the Mth
    character (b) and after each mismatch, restart
    at the 1st character a.
  • The mismatch occurs N-M times.
  • The match succeeds at position N-M1.
  • Total number of comparisons M(N-M1).

4
Optimisations
  • We consider three improved string matching
    algorithms.
  • Knuth-Morris-Pratt algorithm
  • could solve the above problem with N-M
    comparisons
  • Boyer-Moore algorithm
  • could solve the above problem with N-M
    comparisons - but often can achieve only about
    M/N comparisons
  • Rabin-Karp algorithm
  • about MN comparisons

5
Knuth-Morris-Pratt algorithm
  • When a mismatch is detected, say at position j in
    the pattern string, we have already successfully
    matched j-1 characters.
  • We try to take advantage of this to decide where
    to restart matching
  • Example suppose string 10100 mismatches at the
    5th character.
  • Then we know that the text so far matched
    consists of 1010X (where X is unknown)
  • We can then restart by matching the 3rd character
    (1) of the string against the X.
  • (since the second 10 already matched in the
    text could rematch with the first 10 in the
    pattern)

6
Where to re-start matching
  • Consider two pointers i and j.
  • i gives the current position of the text
    (starting at 0)
  • j gives the current position of the pattern
    (starting at 0).
  • The key feature of the KMP algorithm is that i
    never decreases. We never have to backtrack on
    the text.
  • When a mismatch occurs, we look at the j-1
    characters already matched.
  • If some non-empty prefix of length k of those
    characters occurs further along in the pattern
    then restart the matching on position i in the
    text and k in the pattern.
  • If there is no prefix that occurs later in the
    pattern, then restart the matching on position i
    in the text and 0 in the pattern.

7
KMP preprocessing
  • Given a string pattern, we can precompute, for
    each string position j, where to restart matching
    when a mismatch occurs at j.
  • Slide a copy of the first j characters of the
    pattern over itself.
  • Move from left to right and stop when all
    overlapping characters match (or the pattern
    slides past position j)
  • The number of overlapping characters tells us
    where to restart comparing the pattern.

8
KMP preprocessing - example
  • Let the pattern be 10100111

J nextJ 10100111 1 0 10100111 2 0
10100111 3 1 10100111 4 2 10100111 5 0
10100111 6 1 10100111 7 1
10100111 nextJ is the position in the pattern
at which to restart when a mismatch occurs at
position J (note J index starts at 0). Also,
define next0 -1
The underlined characters are the overlapping pref
ix of the already matched characters at the
point of the mismatch.
9
Example using the nextj table
  • Consider the example on the previous slide.
  • Suppose a mismatch occurs on j4.
  • Consulting the table we see that next42.
  • Hence re-start the match with j2.
  • Thus the character in the text marked X is next
    matched with the 3rd character of the pattern.
    Note that that the first two characters 10 are
    already matched up.
  • Note that if nextj0, it means that there is no
    overlapping pattern. In this case we will
    re-start with j0.

Pattern 10100111 Text .....1010X
Pattern 10100111 Text .....1010X
10
Constructing the nextJ array
initNext(String P) int i,j, M P.length()
int next next0 -1 i 0 j -1
while (i lt M) // match pattern with itself
while (j gt 0 P.charAt(i) !
P.charAt(j)) j nextj i j
nexti j
11
KMP algorithm
int KMP (String P, String T) int i,j, M
P.length(), N T.length() initNext(P) i0
j0 while (j lt M i lt N) while (j gt 0
P.charAt(j)! T.charAt(i)) j
nextj // after mismatch i j
if (j M) return i-M else return i
12
KMP - analysis
  • The KMP algorithm never needs to backtrack on the
    text string.
  • This is an advantage for matching on some
    continuous stream input from an external
    device.
  • Maximum number of comparisons is MN (length of
    pattern length of text).
  • The nextj array could be hard-wired into the
    KMP algorithm (see next slide).

13
State Machine KMP match
int KMP (String T) int i -1 s0 i
s1 if (T.charAt(i) ! 1 goto s0 i s2
if (T.charAt(i) ! 0 goto s1 i s3 if
(T.charAt(i) ! 1 goto s1 i s4 if
(T.charAt(i) ! 0 goto s2 i s5 if
(T.charAt(i) ! 0 goto s3 i s6 if
(T.charAt(i) ! 1 goto s1 i s7 if
(T.charAt(i) ! 1 goto s2 i s8 if
(T.charAt(i) ! 1 goto s2 i return i-8
14
The Boyer-Moore String Algorithm
  • This method can give substantially faster
    searches where the language contains a large
    number of symbols
  • E.g. Normal text (128 or 256 character alphabet)
    rather than binary strings
  • BM method incorporates two main ideas
  • start matching at the right of the pattern so as
    to find the rightmost mismatch
  • use information about the possible alphabet of
    the text, as well as the characters in the pattern

15
Example
Search for LEAN in CARPETS NEED CLEANING
REGULARLY CARPETS NEED CLEANING REGULARLY LEAN
N and P mismatch. Furthermore, P does not occur
anywhere in the string LEAN. Hence move string
all the way past P and compare with N
again. CARPETS NEED CLEANING REGULARLY LEAN
LEAN LEAN LEAN
N and E mismatch, but E occurs in LEAN, so
we move the E of LEAN to this position
16
Boyer-Moore preprocessing
  • In order to implement the above idea, consider
    the characters in the alphabet which makes up the
    text.
  • C0,C1,,Ck (k1 characters in the alphabet)
  • Initialise an array skip such that
  • for each Cj in the pattern string set skipj to
    the distance of Cj from the right hand end of the
    pattern
  • skipj M otherwise, where M is the length of
    the pattern.

17
The skip array - example
  • Suppose pattern is LEAN and alphabet is ltblankgt,
    A,B,,Z (C0,C1,,C26).
  • skip12 3 (L)
  • skip5 2 (E)
  • skip1 1 (A)
  • skip14 0 (N)
  • skipX 4 (otherwise)
  • skipC is the number of characters to move the
    pattern to the right after a mismatch in the text
    with character with index C

18
Using the skip array
  • Try to match the pattern from right to left
  • Mismatch occurs between Cn with index n and the
    (M-j)th position of the pattern.
  • Get value of skipn
  • If (M-j) gt skipn then shift pattern by 1 (since
    we have already passed the rightmost occurrence
    of Cn in the pattern).
  • Else shift pattern skipn-j positions, to try to
    align Cn in the text with the rightmost
    occurrence of Cn in the pattern.

19
Example - shifting using skip
  • Pattern X X A X X X Z Z Z Z
  • M 10 (length of pattern)
  • skip1 7 (distance of rightmost A from right)
  • mismatch at position 10-4
  • Y Y Y Y Y A Z Z Z Z Z Z Z Z Z text
  • X X A X X X Z Z Z Z
    mismatched pattern
  • X X A X X X Z Z Z Z
    shift 3 positions
  • Shift pattern by 7-4 3 positions

20
Boyer-Moore Algorithm (1)
int boyermoore1(String P, String T) int
1,j,t,MP.length(),NT.length() initskip(P)
// initialise skip array i M-1 j M-1
while (j gt 0) while (Ti ! Pj) t
skipindex(Ti) if ((M-j)gtt)
iiM-j else iit if (I gt
N) return N // no match j M-1
i-- j-- return i // successful match
21
Refinement to B-M Algorithm
  • We can apply the idea of the KMP algorithm
    right-to-left
  • Sometimes this gives a larger skip value than the
    skip index used above
  • E.g. Pattern BBAAA
  • skip1 0 (skip value for A)
  • skip2 3 (skip value for B)
  • AAAAAAA
  • BBAAA mismatch on A in text
  • boyermoore1 algorithm shifts only one position
  • However its clear that AAA does not occur
    anywhere to the left of positions 3,4,5

22
Boyer-Moore Refinement (2)
  • Build KMP next array from right to left
  • j position of mismatch (from right starting at
    0)
  • nextj no. of positions to shift pattern to
    right
  • j nextj BBAAA
  • 1 1 BBAAA
  • 2 1 BBAAA
  • 3 5 BBAAA
  • 4 5 BBAAA
  • Using the next array, a mismatch on B results in
    a shift of 5 positions

23
Refined Boyer-Moore Algorithm
  • Initialise both the skip and the next arrays
    (right-to-left).
  • Whenever a mismatch occurs, get the skip value
    for the mismatched character and the next value
    for the position of the mismatch.
  • Shift the pattern right by whichever gives the
    greater value.

24
Rabin-Karp String Matching
  • Consider a text and pattern consisting of
    characters represented by b bits each
  • e.g. 7-bit ASCII characters
  • We can regard a sequence of characters as a
    (large) binary number (as with keys when using
    hash tables)
  • Idea - compute a hash value for an M -character
    pattern and compare it successively with the hash
    values of each successive sequence of M
    characters in the text.

25
Rabin-Karp matching - basic idea
  • Example. Consider the string
  • CARPETS NEED CLEANING
  • and the search string LEAN.
  • Then we compare h(LEAN) first against h(CARP),
    then against h(ARPE), h(RPET), h(PETS), and so
    on.
  • Clearly h(LEAN) need be computed only once.
  • The key to efficient comparison is to compute the
    successive hash values efficiently.
  • We can exploit the fact that successive keys
    overlap, e.g. ARPE and RPET share 3 characters.

26
Rabin-Karp - computing hash values
  • Let us use h(K) K mod P as our hash function as
    before, where P is a large prime number
  • Let d max number of characters (e.g. d2b)
  • Suppose K C1,,Cn where C1,,Cn is a sequence
    of characters in the text, and h(K) X
  • It can be shown that
  • h(C2,,Cn1) h((X?C1dn-1)d Cn1), since
  • C2,,Cn1 can be rewritten as (C1,,Cn - C1
    dn-1)d Cn1)
  • E.g. (d10) 45678 (34567 - (3104))10 8
  • Then use some properties such as h(XY) h(h(X)
    Y) and h(XY) h(h(X) Y)
  • Hence h(45678) h((h(34567) - (3104))10 8
  • Thus, successive values for h are efficiently
    computed, since we can reuse the previous hash
    value to compute the next one.

27
Rabin-Karp Algorithm
int rabinkarp(String P, String T) int
q33554393 // a large prime int d32 //
size of alphabet int i,dM1, h10, h20 int
MP.length(), NT.length() for
(i0iltMi)dM(ddM) mod q for
(i0iltMi) h1(h1dval(Pi)) mod q //
hash P h2(h2dval(Ti)) mod q for
(i0 h1 ! h2 i) h2(h2dq-val(Ti))dM
) mod q h2(h2d val(TiM)) mod q
if (i gt N-M) return N \\ not found return i
28
Rabin-Karp - analysis
  • In the above algorithm, val(Pi) is the number
    corresponding to the character Pi.
  • h1 is the hash value of the pattern
  • h2 takes the hash value of successive sequences
    of M characters in the text.
  • Strictly, if h1h2, we might not have a match,
    since a hash collision could occur. We still
    need to make a final comparison on the strings
    themselves.
  • We can use a very large prime, since we do not
    actually have to store the hash table this
    makes collisions extremely unlikely.
  • Average number of comparisons NM
Write a Comment
User Comments (0)
About PowerShow.com