Title: Chapter 11: String Matching
1Chapter 11 String Matching
- Problem given a string (some text, possibly a
large file) and a substring, find the first
(next) occurrence of the substring in the string - The straightforward solution has a complexity of
?(mn) where m is the size of the substring and n
is the size of the string - For textfiles, this is too large of a price to
pay - Can we improve? Yes
- We will explore the Knuth-Morris-Pratt (KMP) and
Boyer-Moore (BM) algorithms - We will also take a brief look at approximate
string matching that is used for spell checking
and other applications
2Straightforward Algorithm
- Let s be our string and sub be our substring
- Start searching s and sub at position 0
- If we have a match, increment both string
pointers - If we have a mismatch, start over at s from the
next position after the previous match began and
start sub at 0
current1 0 current2 0 current 0 found
false while(!found current m
while(s.charAt(current1) sub.charAt(current2
)) current1 current2
if(current2 m) found true
else current
current1 current current2 0
current
s sub
Search until mismatch then realign to start at
current1
3Example
- Consider s AAAAAAAAAAAAAAB and sub AAAB
- Start with current 0, current1 0, current2
0 - We immediately have a match, so increment
current1 and current2 and again, we have a match - increment current1 and current2 and again, we
have a match - increment current1 and current2 and again, but
now we have a mismatch - So, reset current to 1, reset current1 to 1,
reset current2 to 0 and start again and we will
have 3 more matches - And so on
- When do we finally get a full match? Only on our
12th iteration of our outer while loop - Number of comparisons 48, size of s 15, size
of sub 4, the complexity is (n m 1) m, - So, this search is ?(mn), much too inefficient
4Knuth-Morris-Pratt Algorithm
- Observation from our previous example, notice
that if we had a match in our substring of AAAB
up to the B, what does that tell us? - It tells us that the string has a sequence of
AAAx (where x is some character other than B) - Once we found a mismatch, did we have to start
our substring matching over at 0? No, we could
actually have realigned our matching to start at
our last A in the substring - we know that the string had AAAx
- if we start comparing our substring again
starting in the strings next position, this is
AAx, but we already know that AA matched, so the
first two As do not need to be repeated - so, lets realign our substring at a clever
position and start matching from there if we do
this successfully, it will reduce the complexity
of our string searching
5Creating an Align Array
- In order to figure out where to start searching
in our substring, we need to figure out how to
realign it to the string after a failure - Note the book presents this idea as a finite
automaton but this makes the idea somewhat more
complicated that we need to think about - In truth, we can just create an align array
- If we have a mismatch at character k in our
substring, at what character do we start over
again in the substring? - The align array will tell us this
- Note the book calls this array fail, so we will
call it that from now on - If we have a mismatch at k, we realign sub to
position failk
6Character Repetition
- The fail array is generated by only considering
the substring itself, looking for repetition - Once repetition is found, we build on that
repetition - Reconsider our substring AAAB
- If we have a mismatch after the second A, it
means that the preceding character in the string
was A (for instance, we matched against AABB),
and we do not need to start our substring over
from the start, we already have 1 matching A - If we have a mismatch after the third A, it means
that two preceding characters have already
matched (for instance, if the string is AAAC) so
we already have 2 matching As - So, we look for repetition in the substring
- What about ABCABC, is there repetition there?
Yes, consider a mismatch against ABCABF after
we match 5 characters, we have a mismatch, but we
do not have to start our substring over, we
already know there is a match up to AB
7Generating the Fail Array
- This algorithm is somewhat tricky to understand
- We have a substring sub of length m
- fail1 is always 0 (we cannot fail before we
have 1 comparison) - If we fail after the first character, we must
realign to the beginning and thus fail2 1 - However at this point, if we fail, it is based on
repetition found
fail1 0 for(k2kfailk-1 more true while(s 1
more) if(sub.charAts !
sub.charAtk-1) sfails
else more false failk s1
Note that this algorithm iterates from
1..m whereas java and c arrays start at 0, so
we will have to make a modification when dealing
with java/c/c code
8Complexity of Constructing Fail Array
- The complexity is no more than ?(m2)
- there is a for-loop that iterates m times and
within the loop, we iterate while there is no
repetition between the previous character (k-1)
and the character at fails - However, is it really as bad as ?(m2) ?
- If there is a match between (k-1) and fails
then we exit the inner loop - If there is no match, then we backup to fails
which itself is a value based on a lack of
previous repetition - If there is no repetition for some character j,
then failj 1, so if there is no repetition,
we are all the way back to the beginning of the
string and we exit the inner while loop - So ultimately, the number of times through the
inner while loop is bound by the amount of
repetition - This will be more apparent when we look at some
examples next - The complexity of this code is then ?(m)
9Some Examples of the Fail Array
- AAAB
- fail1 0 and fail2 1 by default
- for k 3, let s fail2 1
- does sub.charAt(s) sub.charAt(k-1)? Yes A
A so set failk s1 2 - for k 4, let s fail3 2
- does sub.charAt(s) sub.charAt(k-1)? Yes A
A so set failk s1 3 - So AAAB has a fail array of 0, 1, 2, 3
- ABACAB
- fail1 0, fail2 1
- for k 3, let s fail2 1, since
sub.charAt(s) ! sub.charAt(k-1) (no repetition
between 2nd and 3rd characters), set s fails
fail1 0, so fail3 s 1 1 - for k 4, let s fail3 1, since
sub.charAt(s) sub.charAt(k-1) (there is
repetition between 1st and 3rd characters), set
failk s1 2 - for k 5, let s fail4 2, since sub.char(s)
! sub.charAt(k-1) (no repetition between 2nd and
4th characters), set s fails fail2 1 - for k 6, let s fail5 1, since sub.char(s)
sub.charAt(k-1) (repetition between 1st and
5th characters), set failk s1 2 - So ABACAB has a fail array of 0, 1, 1, 2, 1, 2
10More Fail Array Examples
- SSSSHH
- fail1 0, fail2 1
- k3, s fail2 1, repetition between 1st and
2nd characters, so fail3 s 1 2 - k4, s fail3 2, repetition between 2nd and
3rd characters, so fail4 s 1 3 - k5, s fail4 3, repetition between 3rd and
4th characters, so fail4 s 1 4 - k6, s fail5 4, no repetition between 4th
and 5th characters, so s fails fail4 3,
no repetition between 3rd and 5th characters
either, so s fails fail3 2, no
repetition between 2nd and 5th characters, so s
fails fail2 1, so fail6 1 - SSSSHH has a fail array of 0, 1, 2, 3, 4, 1
- ABCABC
- will have a fail array of 0, 1, 1, 1, 2, 3
- ABCDE
- Will have a fail array of 0, 1, 1, 1, 1
11The KMP Scan Algorithm
- Now that we have seen how to generate the fail
array, how do we use it? - In the straightforward matching algorithm, for
any mismatch, we started over at both the
beginning of sub and at the next position in s
after the first match - Here, however, we have scanned successfully up
through some point in s until we have a mismatch - We need to realign sub appropriately using the
fail array and continue searching s at the point
of failure - Code given to the right
- The complexity of this code is ?(n)
- Note KMPs complexity is ?(n m) because it
requires first computing the fail array
match -1 j 1 k 1 while(j k if (k m) match j m //
success else if (k
0) j
k 1 // start pattern over
else if (s.charAt(j)
sub.charAt(k) j k
else k failk
12Example
- s AABACAABACABAAB
- sub ABACAB with fail 0, 1, 1, 2, 1, 2
- Start with j1, k1, match, increment j and k
- At j2, k2, we have a mismatch, so reset k
fail2 1 (start sub over) - At j2, k1, match, increment j and k
- At j3, k2, match, increment j and k
- At j4, k3, match, increment j and k
- At j5, k4, match, increment j and k
- At j6, k5, match, increment j and k
- At j7, k6, mismatch (bummer), reset k fail6
2 - At j7, k2, mismatch, reset k fail2 1
- At j7, k1, match, increment j and k
- At j8, k2, match, increment j and k
- matches at j9, k3, and j10, k4, and j11,
k5 - At j12, k6, match, increment j and k
- k 7 m so we have found our substring at j-m
(13-6 7)
13Another Idea Boyer-Moore
- Consider if our substring is hello
- Rather than starting at position 1 in the string,
lets look ahead to position 5 to see if it is an
o - If so, then I have to back up and check more of
hello out, but what if it is not an o, does
that tell me something? - If I also determine that the 5th character is not
an h, e, or l, I can jump ahead 5 more
characters and try again - If I can apply this idea correctly, I can improve
over KMP by a factor of m, that is, instead of
taking ?(nm) I should be able to perform my
scanning in about ?(n/m m) - Of course there are problems, how do I know if I
can skip over m characters? I have to compare
the character at position j of the string to all
m characters of sub, which defeats the whole
purpose unless I can be clever - Also, if there are partial matches, I have to
back up, so that reduces the improvement, but it
will still be better than KMP
14Jump Array
- Like KMPs fail array, we need an array to tell
us where to reposition our substring - But in this case, we will jump back in the string
if we have a match of a character in s - And jump ahead in the string if we have a
mismatch - The amount to jump back depends on character
matched in sub - The amount to jump ahead depends on how large sub
is - We compute this array in a similar way as to the
KMP fail array see the code to the right
- Assign each letter of the alphabet to have a jump
of m (jump over m characters) - unless the character is an element of sub, then
jump backward an appropriate amount to be at the
potential beginning of sub in the string
char ch int k for(ch0 ch jumpch m for(k1kjumpsub.charAt(k)m k
- Notice that the first loop iterates once for
every character in the alphabet - ASCII has 128 characters
- Unicode has 65,536
- So, this algorithm is ?(alphabet m)
15Enhancement
- Notice that if we have a match
- we may not have to compare all characters, only
those until we have a mismatch - and if we have a mismatch, we can take advantage
of knowing the repetition in our substring to
jump ahead some intermediate amount - So, we enhance the skip array by including the
KMP fail information - This makes the algorithm to compute jump
extremely complicated, we will skip the details - the algorithm itself is given on page 501
- but the idea is that the jump array now
represents how far to jump ahead on a miss and
jump behind on a match - The algorithm to compute the backward skips is
?(alphabet m) - The algorithm to compute the forward jumps is
?(m) - So, the overall complexity of computing jump in
the Boyer-Moore algorithm is ?(alphabet m)
16Boyer-Moore Scan Algorithm
- Like the KMP algorithm, the BM algorithm simply
searches through both the string and substring
looking for matches - Upon a match however, we scan backwards
- On a mismatch, we jump either forward or backward
depending on whether the character in the string
exists in the substring at all or not - This algorithm is given on page 503, but is
omitted here - The complexity of BM is, in the worst case, ?(n)
but can improve to be as good as ?(n/m) - Based on empirical studies, if m 5, BM only
has to examine 24-30 of the text, a vast
improvement over KMP which must examine 100 of
the text!
17Approximate String Matching
- A variation of the string matching application is
to determine if a string partially matches - This is much more difficult, what is a partially
match? - Consider the string approximate. Which of
these are partial matches? - aproximate approximately appropriate
proximate approx approximat
apropos approxximate - A partial match can be thought of as one that has
k differences from the string where k is some
small integer (for instance 1 or 2) - A difference occurs if the string1.charAt(j) !
string2.charAt(j) or if string1.charAt(j) does
not appear in string2 (or vice versa) - The former case is known as a revise difference,
the latter is a delete or insert difference (a
character has been deleted or inserted) - What about two characters that appear out of
position? For instance, approximate vs.
apporximate? Should that be 1 or 2 differences? - A spell checker will determine the number of
differences between the given word and all
closely matching words in the dictionary and
order the words based on number of differences
(fewer differences at the top starting with 1
difference and going to some maximum k such as
k3)
18Determining Differences a DP Solution
- Comparing two strings to determine the number of
differences is challenging - A dynamic programming example is given in section
11.5 in which Dij is the number of
differences that occur between characters 1..i of
string1 and 1..j of string2 - We define 4 costs
- matchCost Di-1j-1 if string1.charAt(i)
string2.charAt(j) - reviseCost Di-1j-1 1 if string1.charAt(i)
! string2.charAt(j) - insertCost Di-1j 1
- deleteCost Dij-1 1
- Dij minimum(matchCost, reviseCost,
insertCost, deleteCost) - Now, compute D for all i starting at 0 up to
length of string 1 and for all j starting at 0 up
to length string2 - If n is the size of the larger string (string2),
and we are searching for up to k differences, the
algorithm is ?(kn) - A brief example is shown on page 508, table 11.2