Title: String Searching and Matching
1String Searching and Matching
2Naive String Matching
- Problem to check whether a given string occurs
in a given text - Common problem in text-processing programs and
many other applications - We consider first the problem of matching strings
without wildcards - Naive algorithm
- match string character by character
- when there is a mismatch, shift the whole string
down by one character against the text, and start
again at the beginning of the string.
3Worst Case of Naive Method
- Consider the following text
- aaaaaaaaa........aaab (total text length N)
- Consider the following search string
- aaa....b (M-1 times a followed by one b)
- Using the naive method, we match up to the Mth
character (b) and after each mismatch, restart
at the 1st character a. - The mismatch occurs N-M times.
- The match succeeds at position N-M1.
- Total number of comparisons M(N-M1).
4Optimisations
- We consider three improved string matching
algorithms. - Knuth-Morris-Pratt algorithm
- could solve the above problem with N-M
comparisons - Boyer-Moore algorithm
- could solve the above problem with N-M
comparisons - but often can achieve only about
M/N comparisons - Rabin-Karp algorithm
- about MN comparisons
5Knuth-Morris-Pratt algorithm
- When a mismatch is detected, say at position j in
the pattern string, we have already successfully
matched j-1 characters. - We try to take advantage of this to decide where
to restart matching - Example suppose string 10100 mismatches at the
5th character. - Then we know that the text so far matched
consists of 1010X (where X is unknown) - We can then restart by matching the 3rd character
(1) of the string against the X. - (since the second 10 already matched in the
text could rematch with the first 10 in the
pattern)
6Where to re-start matching
- Consider two pointers i and j.
- i gives the current position of the text
(starting at 0) - j gives the current position of the pattern
(starting at 0). - The key feature of the KMP algorithm is that i
never decreases. We never have to backtrack on
the text. - When a mismatch occurs, we look at the j-1
characters already matched. - If some non-empty prefix of length k of those
characters occurs further along in the pattern
then restart the matching on position i in the
text and k in the pattern. - If there is no prefix that occurs later in the
pattern, then restart the matching on position i
in the text and 0 in the pattern.
7KMP preprocessing
- Given a string pattern, we can precompute, for
each string position j, where to restart matching
when a mismatch occurs at j. - Slide a copy of the first j characters of the
pattern over itself. - Move from left to right and stop when all
overlapping characters match (or the pattern
slides past position j) - The number of overlapping characters tells us
where to restart comparing the pattern.
8KMP preprocessing - example
- Let the pattern be 10100111
J nextJ 10100111 1 0 10100111 2 0
10100111 3 1 10100111 4 2 10100111 5 0
10100111 6 1 10100111 7 1
10100111 nextJ is the position in the pattern
at which to restart when a mismatch occurs at
position J (note J index starts at 0). Also,
define next0 -1
The underlined characters are the overlapping pref
ix of the already matched characters at the
point of the mismatch.
9Example using the nextj table
- Consider the example on the previous slide.
- Suppose a mismatch occurs on j4.
- Consulting the table we see that next42.
- Hence re-start the match with j2.
- Thus the character in the text marked X is next
matched with the 3rd character of the pattern.
Note that that the first two characters 10 are
already matched up. - Note that if nextj0, it means that there is no
overlapping pattern. In this case we will
re-start with j0.
Pattern 10100111 Text .....1010X
Pattern 10100111 Text .....1010X
10Constructing the nextJ array
initNext(String P) int i,j, M P.length()
int next next0 -1 i 0 j -1
while (i lt M) // match pattern with itself
while (j gt 0 P.charAt(i) !
P.charAt(j)) j nextj i j
nexti j
11KMP algorithm
int KMP (String P, String T) int i,j, M
P.length(), N T.length() initNext(P) i0
j0 while (j lt M i lt N) while (j gt 0
P.charAt(j)! T.charAt(i)) j
nextj // after mismatch i j
if (j M) return i-M else return i
12KMP - analysis
- The KMP algorithm never needs to backtrack on the
text string. - This is an advantage for matching on some
continuous stream input from an external
device. - Maximum number of comparisons is MN (length of
pattern length of text). - The nextj array could be hard-wired into the
KMP algorithm (see next slide).
13State Machine KMP match
int KMP (String T) int i -1 s0 i
s1 if (T.charAt(i) ! 1 goto s0 i s2
if (T.charAt(i) ! 0 goto s1 i s3 if
(T.charAt(i) ! 1 goto s1 i s4 if
(T.charAt(i) ! 0 goto s2 i s5 if
(T.charAt(i) ! 0 goto s3 i s6 if
(T.charAt(i) ! 1 goto s1 i s7 if
(T.charAt(i) ! 1 goto s2 i s8 if
(T.charAt(i) ! 1 goto s2 i return i-8
14The Boyer-Moore String Algorithm
- This method can give substantially faster
searches where the language contains a large
number of symbols - E.g. Normal text (128 or 256 character alphabet)
rather than binary strings - BM method incorporates two main ideas
- start matching at the right of the pattern so as
to find the rightmost mismatch - use information about the possible alphabet of
the text, as well as the characters in the pattern
15Example
Search for LEAN in CARPETS NEED CLEANING
REGULARLY CARPETS NEED CLEANING REGULARLY LEAN
N and P mismatch. Furthermore, P does not occur
anywhere in the string LEAN. Hence move string
all the way past P and compare with N
again. CARPETS NEED CLEANING REGULARLY LEAN
LEAN LEAN LEAN
N and E mismatch, but E occurs in LEAN, so
we move the E of LEAN to this position
16Boyer-Moore preprocessing
- In order to implement the above idea, consider
the characters in the alphabet which makes up the
text. - C0,C1,,Ck (k1 characters in the alphabet)
- Initialise an array skip such that
- for each Cj in the pattern string set skipj to
the distance of Cj from the right hand end of the
pattern - skipj M otherwise, where M is the length of
the pattern.
17The skip array - example
- Suppose pattern is LEAN and alphabet is ltblankgt,
A,B,,Z (C0,C1,,C26). - skip12 3 (L)
- skip5 2 (E)
- skip1 1 (A)
- skip14 0 (N)
- skipX 4 (otherwise)
- skipC is the number of characters to move the
pattern to the right after a mismatch in the text
with character with index C
18Using the skip array
- Try to match the pattern from right to left
- Mismatch occurs between Cn with index n and the
(M-j)th position of the pattern. - Get value of skipn
- If (M-j) gt skipn then shift pattern by 1 (since
we have already passed the rightmost occurrence
of Cn in the pattern). - Else shift pattern skipn-j positions, to try to
align Cn in the text with the rightmost
occurrence of Cn in the pattern.
19Example - shifting using skip
- Pattern X X A X X X Z Z Z Z
- M 10 (length of pattern)
- skip1 7 (distance of rightmost A from right)
- mismatch at position 10-4
- Y Y Y Y Y A Z Z Z Z Z Z Z Z Z text
- X X A X X X Z Z Z Z
mismatched pattern - X X A X X X Z Z Z Z
shift 3 positions - Shift pattern by 7-4 3 positions
20Boyer-Moore Algorithm (1)
int boyermoore1(String P, String T) int
1,j,t,MP.length(),NT.length() initskip(P)
// initialise skip array i M-1 j M-1
while (j gt 0) while (Ti ! Pj) t
skipindex(Ti) if ((M-j)gtt)
iiM-j else iit if (I gt
N) return N // no match j M-1
i-- j-- return i // successful match
21Refinement to B-M Algorithm
- We can apply the idea of the KMP algorithm
right-to-left - Sometimes this gives a larger skip value than the
skip index used above - E.g. Pattern BBAAA
- skip1 0 (skip value for A)
- skip2 3 (skip value for B)
- AAAAAAA
- BBAAA mismatch on A in text
- boyermoore1 algorithm shifts only one position
- However its clear that AAA does not occur
anywhere to the left of positions 3,4,5
22Boyer-Moore Refinement (2)
- Build KMP next array from right to left
- j position of mismatch (from right starting at
0) - nextj no. of positions to shift pattern to
right - j nextj BBAAA
- 1 1 BBAAA
- 2 1 BBAAA
- 3 5 BBAAA
- 4 5 BBAAA
- Using the next array, a mismatch on B results in
a shift of 5 positions
23Refined Boyer-Moore Algorithm
- Initialise both the skip and the next arrays
(right-to-left). - Whenever a mismatch occurs, get the skip value
for the mismatched character and the next value
for the position of the mismatch. - Shift the pattern right by whichever gives the
greater value.
24Rabin-Karp String Matching
- Consider a text and pattern consisting of
characters represented by b bits each - e.g. 7-bit ASCII characters
- We can regard a sequence of characters as a
(large) binary number (as with keys when using
hash tables) - Idea - compute a hash value for an M -character
pattern and compare it successively with the hash
values of each successive sequence of M
characters in the text.
25Rabin-Karp matching - basic idea
- Example. Consider the string
- CARPETS NEED CLEANING
- and the search string LEAN.
- Then we compare h(LEAN) first against h(CARP),
then against h(ARPE), h(RPET), h(PETS), and so
on. - Clearly h(LEAN) need be computed only once.
- The key to efficient comparison is to compute the
successive hash values efficiently. - We can exploit the fact that successive keys
overlap, e.g. ARPE and RPET share 3 characters.
26Rabin-Karp - computing hash values
- Let us use h(K) K mod P as our hash function as
before, where P is a large prime number - Let d max number of characters (e.g. d2b)
- Suppose K C1,,Cn where C1,,Cn is a sequence
of characters in the text, and h(K) X - It can be shown that
- h(C2,,Cn1) h((X?C1dn-1)d Cn1), since
- C2,,Cn1 can be rewritten as (C1,,Cn - C1
dn-1)d Cn1) - E.g. (d10) 45678 (34567 - (3104))10 8
- Then use some properties such as h(XY) h(h(X)
Y) and h(XY) h(h(X) Y) - Hence h(45678) h((h(34567) - (3104))10 8
- Thus, successive values for h are efficiently
computed, since we can reuse the previous hash
value to compute the next one.
27Rabin-Karp Algorithm
int rabinkarp(String P, String T) int
q33554393 // a large prime int d32 //
size of alphabet int i,dM1, h10, h20 int
MP.length(), NT.length() for
(i0iltMi)dM(ddM) mod q for
(i0iltMi) h1(h1dval(Pi)) mod q //
hash P h2(h2dval(Ti)) mod q for
(i0 h1 ! h2 i) h2(h2dq-val(Ti))dM
) mod q h2(h2d val(TiM)) mod q
if (i gt N-M) return N \\ not found return i
28Rabin-Karp - analysis
- In the above algorithm, val(Pi) is the number
corresponding to the character Pi. - h1 is the hash value of the pattern
- h2 takes the hash value of successive sequences
of M characters in the text. - Strictly, if h1h2, we might not have a match,
since a hash collision could occur. We still
need to make a final comparison on the strings
themselves. - We can use a very large prime, since we do not
actually have to store the hash table this
makes collisions extremely unlikely. - Average number of comparisons NM