Chapter 11: String Matching - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Chapter 11: String Matching

Description:

If we have a match, increment both string pointers ... after we match 5 characters, we have a mismatch, but ... Start with j=1, k=1, match, increment j and k ... – PowerPoint PPT presentation

Number of Views:245
Avg rating:3.0/5.0
Slides: 19
Provided by: foxr
Category:

less

Transcript and Presenter's Notes

Title: Chapter 11: String Matching


1
Chapter 11 String Matching
  • Problem given a string (some text, possibly a
    large file) and a substring, find the first
    (next) occurrence of the substring in the string
  • The straightforward solution has a complexity of
    ?(mn) where m is the size of the substring and n
    is the size of the string
  • For textfiles, this is too large of a price to
    pay
  • Can we improve? Yes
  • We will explore the Knuth-Morris-Pratt (KMP) and
    Boyer-Moore (BM) algorithms
  • We will also take a brief look at approximate
    string matching that is used for spell checking
    and other applications

2
Straightforward Algorithm
  • Let s be our string and sub be our substring
  • Start searching s and sub at position 0
  • If we have a match, increment both string
    pointers
  • If we have a mismatch, start over at s from the
    next position after the previous match began and
    start sub at 0

current1 0 current2 0 current 0 found
false while(!found current m
while(s.charAt(current1) sub.charAt(current2
)) current1 current2
if(current2 m) found true
else current
current1 current current2 0

current
s sub
Search until mismatch then realign to start at
current1
3
Example
  • Consider s AAAAAAAAAAAAAAB and sub AAAB
  • Start with current 0, current1 0, current2
    0
  • We immediately have a match, so increment
    current1 and current2 and again, we have a match
  • increment current1 and current2 and again, we
    have a match
  • increment current1 and current2 and again, but
    now we have a mismatch
  • So, reset current to 1, reset current1 to 1,
    reset current2 to 0 and start again and we will
    have 3 more matches
  • And so on
  • When do we finally get a full match? Only on our
    12th iteration of our outer while loop
  • Number of comparisons 48, size of s 15, size
    of sub 4, the complexity is (n m 1) m,
  • So, this search is ?(mn), much too inefficient

4
Knuth-Morris-Pratt Algorithm
  • Observation from our previous example, notice
    that if we had a match in our substring of AAAB
    up to the B, what does that tell us?
  • It tells us that the string has a sequence of
    AAAx (where x is some character other than B)
  • Once we found a mismatch, did we have to start
    our substring matching over at 0? No, we could
    actually have realigned our matching to start at
    our last A in the substring
  • we know that the string had AAAx
  • if we start comparing our substring again
    starting in the strings next position, this is
    AAx, but we already know that AA matched, so the
    first two As do not need to be repeated
  • so, lets realign our substring at a clever
    position and start matching from there if we do
    this successfully, it will reduce the complexity
    of our string searching

5
Creating an Align Array
  • In order to figure out where to start searching
    in our substring, we need to figure out how to
    realign it to the string after a failure
  • Note the book presents this idea as a finite
    automaton but this makes the idea somewhat more
    complicated that we need to think about
  • In truth, we can just create an align array
  • If we have a mismatch at character k in our
    substring, at what character do we start over
    again in the substring?
  • The align array will tell us this
  • Note the book calls this array fail, so we will
    call it that from now on
  • If we have a mismatch at k, we realign sub to
    position failk

6
Character Repetition
  • The fail array is generated by only considering
    the substring itself, looking for repetition
  • Once repetition is found, we build on that
    repetition
  • Reconsider our substring AAAB
  • If we have a mismatch after the second A, it
    means that the preceding character in the string
    was A (for instance, we matched against AABB),
    and we do not need to start our substring over
    from the start, we already have 1 matching A
  • If we have a mismatch after the third A, it means
    that two preceding characters have already
    matched (for instance, if the string is AAAC) so
    we already have 2 matching As
  • So, we look for repetition in the substring
  • What about ABCABC, is there repetition there?
    Yes, consider a mismatch against ABCABF after
    we match 5 characters, we have a mismatch, but we
    do not have to start our substring over, we
    already know there is a match up to AB

7
Generating the Fail Array
  • This algorithm is somewhat tricky to understand
  • We have a substring sub of length m
  • fail1 is always 0 (we cannot fail before we
    have 1 comparison)
  • If we fail after the first character, we must
    realign to the beginning and thus fail2 1
  • However at this point, if we fail, it is based on
    repetition found

fail1 0 for(k2kfailk-1 more true while(s 1
more) if(sub.charAts !
sub.charAtk-1) sfails
else more false failk s1
Note that this algorithm iterates from
1..m whereas java and c arrays start at 0, so
we will have to make a modification when dealing
with java/c/c code
8
Complexity of Constructing Fail Array
  • The complexity is no more than ?(m2)
  • there is a for-loop that iterates m times and
    within the loop, we iterate while there is no
    repetition between the previous character (k-1)
    and the character at fails
  • However, is it really as bad as ?(m2) ?
  • If there is a match between (k-1) and fails
    then we exit the inner loop
  • If there is no match, then we backup to fails
    which itself is a value based on a lack of
    previous repetition
  • If there is no repetition for some character j,
    then failj 1, so if there is no repetition,
    we are all the way back to the beginning of the
    string and we exit the inner while loop
  • So ultimately, the number of times through the
    inner while loop is bound by the amount of
    repetition
  • This will be more apparent when we look at some
    examples next
  • The complexity of this code is then ?(m)

9
Some Examples of the Fail Array
  • AAAB
  • fail1 0 and fail2 1 by default
  • for k 3, let s fail2 1
  • does sub.charAt(s) sub.charAt(k-1)? Yes A
    A so set failk s1 2
  • for k 4, let s fail3 2
  • does sub.charAt(s) sub.charAt(k-1)? Yes A
    A so set failk s1 3
  • So AAAB has a fail array of 0, 1, 2, 3
  • ABACAB
  • fail1 0, fail2 1
  • for k 3, let s fail2 1, since
    sub.charAt(s) ! sub.charAt(k-1) (no repetition
    between 2nd and 3rd characters), set s fails
    fail1 0, so fail3 s 1 1
  • for k 4, let s fail3 1, since
    sub.charAt(s) sub.charAt(k-1) (there is
    repetition between 1st and 3rd characters), set
    failk s1 2
  • for k 5, let s fail4 2, since sub.char(s)
    ! sub.charAt(k-1) (no repetition between 2nd and
    4th characters), set s fails fail2 1
  • for k 6, let s fail5 1, since sub.char(s)
    sub.charAt(k-1) (repetition between 1st and
    5th characters), set failk s1 2
  • So ABACAB has a fail array of 0, 1, 1, 2, 1, 2

10
More Fail Array Examples
  • SSSSHH
  • fail1 0, fail2 1
  • k3, s fail2 1, repetition between 1st and
    2nd characters, so fail3 s 1 2
  • k4, s fail3 2, repetition between 2nd and
    3rd characters, so fail4 s 1 3
  • k5, s fail4 3, repetition between 3rd and
    4th characters, so fail4 s 1 4
  • k6, s fail5 4, no repetition between 4th
    and 5th characters, so s fails fail4 3,
    no repetition between 3rd and 5th characters
    either, so s fails fail3 2, no
    repetition between 2nd and 5th characters, so s
    fails fail2 1, so fail6 1
  • SSSSHH has a fail array of 0, 1, 2, 3, 4, 1
  • ABCABC
  • will have a fail array of 0, 1, 1, 1, 2, 3
  • ABCDE
  • Will have a fail array of 0, 1, 1, 1, 1

11
The KMP Scan Algorithm
  • Now that we have seen how to generate the fail
    array, how do we use it?
  • In the straightforward matching algorithm, for
    any mismatch, we started over at both the
    beginning of sub and at the next position in s
    after the first match
  • Here, however, we have scanned successfully up
    through some point in s until we have a mismatch
  • We need to realign sub appropriately using the
    fail array and continue searching s at the point
    of failure
  • Code given to the right
  • The complexity of this code is ?(n)
  • Note KMPs complexity is ?(n m) because it
    requires first computing the fail array

match -1 j 1 k 1 while(j k if (k m) match j m //
success else if (k
0) j
k 1 // start pattern over
else if (s.charAt(j)
sub.charAt(k) j k
else k failk
12
Example
  • s AABACAABACABAAB
  • sub ABACAB with fail 0, 1, 1, 2, 1, 2
  • Start with j1, k1, match, increment j and k
  • At j2, k2, we have a mismatch, so reset k
    fail2 1 (start sub over)
  • At j2, k1, match, increment j and k
  • At j3, k2, match, increment j and k
  • At j4, k3, match, increment j and k
  • At j5, k4, match, increment j and k
  • At j6, k5, match, increment j and k
  • At j7, k6, mismatch (bummer), reset k fail6
    2
  • At j7, k2, mismatch, reset k fail2 1
  • At j7, k1, match, increment j and k
  • At j8, k2, match, increment j and k
  • matches at j9, k3, and j10, k4, and j11,
    k5
  • At j12, k6, match, increment j and k
  • k 7 m so we have found our substring at j-m
    (13-6 7)

13
Another Idea Boyer-Moore
  • Consider if our substring is hello
  • Rather than starting at position 1 in the string,
    lets look ahead to position 5 to see if it is an
    o
  • If so, then I have to back up and check more of
    hello out, but what if it is not an o, does
    that tell me something?
  • If I also determine that the 5th character is not
    an h, e, or l, I can jump ahead 5 more
    characters and try again
  • If I can apply this idea correctly, I can improve
    over KMP by a factor of m, that is, instead of
    taking ?(nm) I should be able to perform my
    scanning in about ?(n/m m)
  • Of course there are problems, how do I know if I
    can skip over m characters? I have to compare
    the character at position j of the string to all
    m characters of sub, which defeats the whole
    purpose unless I can be clever
  • Also, if there are partial matches, I have to
    back up, so that reduces the improvement, but it
    will still be better than KMP

14
Jump Array
  • Like KMPs fail array, we need an array to tell
    us where to reposition our substring
  • But in this case, we will jump back in the string
    if we have a match of a character in s
  • And jump ahead in the string if we have a
    mismatch
  • The amount to jump back depends on character
    matched in sub
  • The amount to jump ahead depends on how large sub
    is
  • We compute this array in a similar way as to the
    KMP fail array see the code to the right
  • Assign each letter of the alphabet to have a jump
    of m (jump over m characters)
  • unless the character is an element of sub, then
    jump backward an appropriate amount to be at the
    potential beginning of sub in the string

char ch int k for(ch0 ch jumpch m for(k1kjumpsub.charAt(k)m k
  • Notice that the first loop iterates once for
    every character in the alphabet
  • ASCII has 128 characters
  • Unicode has 65,536
  • So, this algorithm is ?(alphabet m)

15
Enhancement
  • Notice that if we have a match
  • we may not have to compare all characters, only
    those until we have a mismatch
  • and if we have a mismatch, we can take advantage
    of knowing the repetition in our substring to
    jump ahead some intermediate amount
  • So, we enhance the skip array by including the
    KMP fail information
  • This makes the algorithm to compute jump
    extremely complicated, we will skip the details
  • the algorithm itself is given on page 501
  • but the idea is that the jump array now
    represents how far to jump ahead on a miss and
    jump behind on a match
  • The algorithm to compute the backward skips is
    ?(alphabet m)
  • The algorithm to compute the forward jumps is
    ?(m)
  • So, the overall complexity of computing jump in
    the Boyer-Moore algorithm is ?(alphabet m)

16
Boyer-Moore Scan Algorithm
  • Like the KMP algorithm, the BM algorithm simply
    searches through both the string and substring
    looking for matches
  • Upon a match however, we scan backwards
  • On a mismatch, we jump either forward or backward
    depending on whether the character in the string
    exists in the substring at all or not
  • This algorithm is given on page 503, but is
    omitted here
  • The complexity of BM is, in the worst case, ?(n)
    but can improve to be as good as ?(n/m)
  • Based on empirical studies, if m 5, BM only
    has to examine 24-30 of the text, a vast
    improvement over KMP which must examine 100 of
    the text!

17
Approximate String Matching
  • A variation of the string matching application is
    to determine if a string partially matches
  • This is much more difficult, what is a partially
    match?
  • Consider the string approximate. Which of
    these are partial matches?
  • aproximate approximately appropriate
    proximate approx approximat
    apropos approxximate
  • A partial match can be thought of as one that has
    k differences from the string where k is some
    small integer (for instance 1 or 2)
  • A difference occurs if the string1.charAt(j) !
    string2.charAt(j) or if string1.charAt(j) does
    not appear in string2 (or vice versa)
  • The former case is known as a revise difference,
    the latter is a delete or insert difference (a
    character has been deleted or inserted)
  • What about two characters that appear out of
    position? For instance, approximate vs.
    apporximate? Should that be 1 or 2 differences?
  • A spell checker will determine the number of
    differences between the given word and all
    closely matching words in the dictionary and
    order the words based on number of differences
    (fewer differences at the top starting with 1
    difference and going to some maximum k such as
    k3)

18
Determining Differences a DP Solution
  • Comparing two strings to determine the number of
    differences is challenging
  • A dynamic programming example is given in section
    11.5 in which Dij is the number of
    differences that occur between characters 1..i of
    string1 and 1..j of string2
  • We define 4 costs
  • matchCost Di-1j-1 if string1.charAt(i)
    string2.charAt(j)
  • reviseCost Di-1j-1 1 if string1.charAt(i)
    ! string2.charAt(j)
  • insertCost Di-1j 1
  • deleteCost Dij-1 1
  • Dij minimum(matchCost, reviseCost,
    insertCost, deleteCost)
  • Now, compute D for all i starting at 0 up to
    length of string 1 and for all j starting at 0 up
    to length string2
  • If n is the size of the larger string (string2),
    and we are searching for up to k differences, the
    algorithm is ?(kn)
  • A brief example is shown on page 508, table 11.2
Write a Comment
User Comments (0)
About PowerShow.com