Chapter 11: String Matching

About This Presentation

Title:

Chapter 11: String Matching

Description:

If we have a match, increment both string pointers ... after we match 5 characters, we have a mismatch, but ... Start with j=1, k=1, match, increment j and k ... – PowerPoint PPT presentation

Number of Views:245

Avg rating:3.0/5.0

Slides: 19

Provided by: foxr

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 11: String Matching

1
Chapter 11 String Matching

Problem given a string (some text, possibly a
large file) and a substring, find the first
(next) occurrence of the substring in the string
The straightforward solution has a complexity of
?(mn) where m is the size of the substring and n
is the size of the string
For textfiles, this is too large of a price to
pay
Can we improve? Yes
We will explore the Knuth-Morris-Pratt (KMP) and
Boyer-Moore (BM) algorithms
We will also take a brief look at approximate
string matching that is used for spell checking
and other applications

2
Straightforward Algorithm

Let s be our string and sub be our substring
Start searching s and sub at position 0
If we have a match, increment both string
pointers
If we have a mismatch, start over at s from the
next position after the previous match began and
start sub at 0

current1 0 current2 0 current 0 found
false while(!found current m
while(s.charAt(current1) sub.charAt(current2
)) current1 current2
if(current2 m) found true
else current
current1 current current2 0

current
s sub
Search until mismatch then realign to start at
current1
3
Example

Consider s AAAAAAAAAAAAAAB and sub AAAB
Start with current 0, current1 0, current2
0
We immediately have a match, so increment
current1 and current2 and again, we have a match
increment current1 and current2 and again, we
have a match
increment current1 and current2 and again, but
now we have a mismatch
So, reset current to 1, reset current1 to 1,
reset current2 to 0 and start again and we will
have 3 more matches
And so on
When do we finally get a full match? Only on our
12th iteration of our outer while loop
Number of comparisons 48, size of s 15, size
of sub 4, the complexity is (n m 1) m,
So, this search is ?(mn), much too inefficient

4
Knuth-Morris-Pratt Algorithm

Observation from our previous example, notice
that if we had a match in our substring of AAAB
up to the B, what does that tell us?
It tells us that the string has a sequence of
AAAx (where x is some character other than B)
Once we found a mismatch, did we have to start
our substring matching over at 0? No, we could
actually have realigned our matching to start at
our last A in the substring
we know that the string had AAAx
if we start comparing our substring again
starting in the strings next position, this is
AAx, but we already know that AA matched, so the
first two As do not need to be repeated
so, lets realign our substring at a clever
position and start matching from there if we do
this successfully, it will reduce the complexity
of our string searching

5
Creating an Align Array

In order to figure out where to start searching
in our substring, we need to figure out how to
realign it to the string after a failure
Note the book presents this idea as a finite
automaton but this makes the idea somewhat more
complicated that we need to think about
In truth, we can just create an align array
If we have a mismatch at character k in our
substring, at what character do we start over
again in the substring?
The align array will tell us this
Note the book calls this array fail, so we will
call it that from now on
If we have a mismatch at k, we realign sub to
position failk

6
Character Repetition

The fail array is generated by only considering
the substring itself, looking for repetition
Once repetition is found, we build on that
repetition
Reconsider our substring AAAB
If we have a mismatch after the second A, it
means that the preceding character in the string
was A (for instance, we matched against AABB),
and we do not need to start our substring over
from the start, we already have 1 matching A
If we have a mismatch after the third A, it means
that two preceding characters have already
matched (for instance, if the string is AAAC) so
we already have 2 matching As
So, we look for repetition in the substring
What about ABCABC, is there repetition there?
Yes, consider a mismatch against ABCABF after
we match 5 characters, we have a mismatch, but we
do not have to start our substring over, we
already know there is a match up to AB

7
Generating the Fail Array

This algorithm is somewhat tricky to understand
We have a substring sub of length m
fail1 is always 0 (we cannot fail before we
have 1 comparison)
If we fail after the first character, we must
realign to the beginning and thus fail2 1
However at this point, if we fail, it is based on
repetition found

fail1 0 for(k2kfailk-1 more true while(s 1
more) if(sub.charAts !
sub.charAtk-1) sfails
else more false failk s1
Note that this algorithm iterates from
1..m whereas java and c arrays start at 0, so
we will have to make a modification when dealing
with java/c/c code
8
Complexity of Constructing Fail Array

The complexity is no more than ?(m2)
there is a for-loop that iterates m times and
within the loop, we iterate while there is no
repetition between the previous character (k-1)
and the character at fails
However, is it really as bad as ?(m2) ?
If there is a match between (k-1) and fails
then we exit the inner loop
If there is no match, then we backup to fails
which itself is a value based on a lack of
previous repetition
If there is no repetition for some character j,
then failj 1, so if there is no repetition,
we are all the way back to the beginning of the
string and we exit the inner while loop
So ultimately, the number of times through the
inner while loop is bound by the amount of
repetition
This will be more apparent when we look at some
examples next
The complexity of this code is then ?(m)

9
Some Examples of the Fail Array

AAAB
fail1 0 and fail2 1 by default
for k 3, let s fail2 1
does sub.charAt(s) sub.charAt(k-1)? Yes A
A so set failk s1 2
for k 4, let s fail3 2
does sub.charAt(s) sub.charAt(k-1)? Yes A
A so set failk s1 3
So AAAB has a fail array of 0, 1, 2, 3
ABACAB
fail1 0, fail2 1
for k 3, let s fail2 1, since
sub.charAt(s) ! sub.charAt(k-1) (no repetition
between 2nd and 3rd characters), set s fails
fail1 0, so fail3 s 1 1
for k 4, let s fail3 1, since
sub.charAt(s) sub.charAt(k-1) (there is
repetition between 1st and 3rd characters), set
failk s1 2
for k 5, let s fail4 2, since sub.char(s)
! sub.charAt(k-1) (no repetition between 2nd and
4th characters), set s fails fail2 1
for k 6, let s fail5 1, since sub.char(s)
sub.charAt(k-1) (repetition between 1st and
5th characters), set failk s1 2
So ABACAB has a fail array of 0, 1, 1, 2, 1, 2

10
More Fail Array Examples

SSSSHH
fail1 0, fail2 1
k3, s fail2 1, repetition between 1st and
2nd characters, so fail3 s 1 2
k4, s fail3 2, repetition between 2nd and
3rd characters, so fail4 s 1 3
k5, s fail4 3, repetition between 3rd and
4th characters, so fail4 s 1 4
k6, s fail5 4, no repetition between 4th
and 5th characters, so s fails fail4 3,
no repetition between 3rd and 5th characters
either, so s fails fail3 2, no
repetition between 2nd and 5th characters, so s
fails fail2 1, so fail6 1
SSSSHH has a fail array of 0, 1, 2, 3, 4, 1
ABCABC
will have a fail array of 0, 1, 1, 1, 2, 3
ABCDE
Will have a fail array of 0, 1, 1, 1, 1

11
The KMP Scan Algorithm

Now that we have seen how to generate the fail
array, how do we use it?
In the straightforward matching algorithm, for
any mismatch, we started over at both the
beginning of sub and at the next position in s
after the first match
Here, however, we have scanned successfully up
through some point in s until we have a mismatch
We need to realign sub appropriately using the
fail array and continue searching s at the point
of failure
Code given to the right
The complexity of this code is ?(n)
Note KMPs complexity is ?(n m) because it
requires first computing the fail array

match -1 j 1 k 1 while(j k if (k m) match j m //
success else if (k
0) j
k 1 // start pattern over
else if (s.charAt(j)
sub.charAt(k) j k
else k failk
12
Example

s AABACAABACABAAB
sub ABACAB with fail 0, 1, 1, 2, 1, 2
Start with j1, k1, match, increment j and k
At j2, k2, we have a mismatch, so reset k
fail2 1 (start sub over)
At j2, k1, match, increment j and k
At j3, k2, match, increment j and k
At j4, k3, match, increment j and k
At j5, k4, match, increment j and k
At j6, k5, match, increment j and k
At j7, k6, mismatch (bummer), reset k fail6
2
At j7, k2, mismatch, reset k fail2 1
At j7, k1, match, increment j and k
At j8, k2, match, increment j and k
matches at j9, k3, and j10, k4, and j11,
k5
At j12, k6, match, increment j and k
k 7 m so we have found our substring at j-m
(13-6 7)

13
Another Idea Boyer-Moore

Consider if our substring is hello
Rather than starting at position 1 in the string,
lets look ahead to position 5 to see if it is an
o
If so, then I have to back up and check more of
hello out, but what if it is not an o, does
that tell me something?
If I also determine that the 5th character is not
an h, e, or l, I can jump ahead 5 more
characters and try again
If I can apply this idea correctly, I can improve
over KMP by a factor of m, that is, instead of
taking ?(nm) I should be able to perform my
scanning in about ?(n/m m)
Of course there are problems, how do I know if I
can skip over m characters? I have to compare
the character at position j of the string to all
m characters of sub, which defeats the whole
purpose unless I can be clever
Also, if there are partial matches, I have to
back up, so that reduces the improvement, but it
will still be better than KMP

14
Jump Array

Like KMPs fail array, we need an array to tell
us where to reposition our substring
But in this case, we will jump back in the string
if we have a match of a character in s
And jump ahead in the string if we have a
mismatch
The amount to jump back depends on character
matched in sub
The amount to jump ahead depends on how large sub
is
We compute this array in a similar way as to the
KMP fail array see the code to the right

Assign each letter of the alphabet to have a jump
of m (jump over m characters)
unless the character is an element of sub, then
jump backward an appropriate amount to be at the
potential beginning of sub in the string

char ch int k for(ch0 ch jumpch m for(k1kjumpsub.charAt(k)m k

Notice that the first loop iterates once for
every character in the alphabet
ASCII has 128 characters
Unicode has 65,536
So, this algorithm is ?(alphabet m)

15
Enhancement

Notice that if we have a match
we may not have to compare all characters, only
those until we have a mismatch
and if we have a mismatch, we can take advantage
of knowing the repetition in our substring to
jump ahead some intermediate amount
So, we enhance the skip array by including the
KMP fail information
This makes the algorithm to compute jump
extremely complicated, we will skip the details
the algorithm itself is given on page 501
but the idea is that the jump array now
represents how far to jump ahead on a miss and
jump behind on a match
The algorithm to compute the backward skips is
?(alphabet m)
The algorithm to compute the forward jumps is
?(m)
So, the overall complexity of computing jump in
the Boyer-Moore algorithm is ?(alphabet m)

16
Boyer-Moore Scan Algorithm

Like the KMP algorithm, the BM algorithm simply
searches through both the string and substring
looking for matches
Upon a match however, we scan backwards
On a mismatch, we jump either forward or backward
depending on whether the character in the string
exists in the substring at all or not
This algorithm is given on page 503, but is
omitted here
The complexity of BM is, in the worst case, ?(n)
but can improve to be as good as ?(n/m)
Based on empirical studies, if m 5, BM only
has to examine 24-30 of the text, a vast
improvement over KMP which must examine 100 of
the text!

17
Approximate String Matching

A variation of the string matching application is
to determine if a string partially matches
This is much more difficult, what is a partially
match?
Consider the string approximate. Which of
these are partial matches?
aproximate approximately appropriate
proximate approx approximat
apropos approxximate
A partial match can be thought of as one that has
k differences from the string where k is some
small integer (for instance 1 or 2)
A difference occurs if the string1.charAt(j) !
string2.charAt(j) or if string1.charAt(j) does
not appear in string2 (or vice versa)
The former case is known as a revise difference,
the latter is a delete or insert difference (a
character has been deleted or inserted)
What about two characters that appear out of
position? For instance, approximate vs.
apporximate? Should that be 1 or 2 differences?
A spell checker will determine the number of
differences between the given word and all
closely matching words in the dictionary and
order the words based on number of differences
(fewer differences at the top starting with 1
difference and going to some maximum k such as
k3)

18
Determining Differences a DP Solution

Comparing two strings to determine the number of
differences is challenging
A dynamic programming example is given in section
11.5 in which Dij is the number of
differences that occur between characters 1..i of
string1 and 1..j of string2
We define 4 costs
matchCost Di-1j-1 if string1.charAt(i)
string2.charAt(j)
reviseCost Di-1j-1 1 if string1.charAt(i)
! string2.charAt(j)
insertCost Di-1j 1
deleteCost Dij-1 1
Dij minimum(matchCost, reviseCost,
insertCost, deleteCost)
Now, compute D for all i starting at 0 up to
length of string 1 and for all j starting at 0 up
to length string2
If n is the size of the larger string (string2),
and we are searching for up to k differences, the
algorithm is ?(kn)
A brief example is shown on page 508, table 11.2