String Searching and Matching

About This Presentation

Title:

String Searching and Matching

Description:

Common problem in text-processing programs and many other ... we compare h(LEAN) first against h(CARP), then against h(ARPE), h(RPET), h(PETS), and so on. ... – PowerPoint PPT presentation

Number of Views:194

Avg rating:3.0/5.0

Slides: 29

Provided by: joh133

Category:

more less

Transcript and Presenter's Notes

Title: String Searching and Matching

1
String Searching and Matching
2
Naive String Matching

Problem to check whether a given string occurs
in a given text
Common problem in text-processing programs and
many other applications
We consider first the problem of matching strings
without wildcards
Naive algorithm
match string character by character
when there is a mismatch, shift the whole string
down by one character against the text, and start
again at the beginning of the string.

3
Worst Case of Naive Method

Consider the following text
aaaaaaaaa........aaab (total text length N)
Consider the following search string
aaa....b (M-1 times a followed by one b)
Using the naive method, we match up to the Mth
character (b) and after each mismatch, restart
at the 1st character a.
The mismatch occurs N-M times.
The match succeeds at position N-M1.
Total number of comparisons M(N-M1).

4
Optimisations

We consider three improved string matching
algorithms.
Knuth-Morris-Pratt algorithm
could solve the above problem with N-M
comparisons
Boyer-Moore algorithm
could solve the above problem with N-M
comparisons - but often can achieve only about
M/N comparisons
Rabin-Karp algorithm
about MN comparisons

5
Knuth-Morris-Pratt algorithm

When a mismatch is detected, say at position j in
the pattern string, we have already successfully
matched j-1 characters.
We try to take advantage of this to decide where
to restart matching
Example suppose string 10100 mismatches at the
5th character.
Then we know that the text so far matched
consists of 1010X (where X is unknown)
We can then restart by matching the 3rd character
(1) of the string against the X.
(since the second 10 already matched in the
text could rematch with the first 10 in the
pattern)

6
Where to re-start matching

Consider two pointers i and j.
i gives the current position of the text
(starting at 0)
j gives the current position of the pattern
(starting at 0).
The key feature of the KMP algorithm is that i
never decreases. We never have to backtrack on
the text.
When a mismatch occurs, we look at the j-1
characters already matched.
If some non-empty prefix of length k of those
characters occurs further along in the pattern
then restart the matching on position i in the
text and k in the pattern.
If there is no prefix that occurs later in the
pattern, then restart the matching on position i
in the text and 0 in the pattern.

7
KMP preprocessing

Given a string pattern, we can precompute, for
each string position j, where to restart matching
when a mismatch occurs at j.
Slide a copy of the first j characters of the
pattern over itself.
Move from left to right and stop when all
overlapping characters match (or the pattern
slides past position j)
The number of overlapping characters tells us
where to restart comparing the pattern.

8
KMP preprocessing - example

Let the pattern be 10100111

J nextJ 10100111 1 0 10100111 2 0
10100111 3 1 10100111 4 2 10100111 5 0
10100111 6 1 10100111 7 1
10100111 nextJ is the position in the pattern
at which to restart when a mismatch occurs at
position J (note J index starts at 0). Also,
define next0 -1
The underlined characters are the overlapping pref
ix of the already matched characters at the
point of the mismatch.
9
Example using the nextj table

Consider the example on the previous slide.
Suppose a mismatch occurs on j4.
Consulting the table we see that next42.
Hence re-start the match with j2.
Thus the character in the text marked X is next
matched with the 3rd character of the pattern.
Note that that the first two characters 10 are
already matched up.
Note that if nextj0, it means that there is no
overlapping pattern. In this case we will
re-start with j0.

Pattern 10100111 Text .....1010X
Pattern 10100111 Text .....1010X
10
Constructing the nextJ array
initNext(String P) int i,j, M P.length()
int next next0 -1 i 0 j -1
while (i lt M) // match pattern with itself
while (j gt 0 P.charAt(i) !
P.charAt(j)) j nextj i j
nexti j
11
KMP algorithm
int KMP (String P, String T) int i,j, M
P.length(), N T.length() initNext(P) i0
j0 while (j lt M i lt N) while (j gt 0
P.charAt(j)! T.charAt(i)) j
nextj // after mismatch i j
if (j M) return i-M else return i
12
KMP - analysis

The KMP algorithm never needs to backtrack on the
text string.
This is an advantage for matching on some
continuous stream input from an external
device.
Maximum number of comparisons is MN (length of
pattern length of text).
The nextj array could be hard-wired into the
KMP algorithm (see next slide).

13
State Machine KMP match
int KMP (String T) int i -1 s0 i
s1 if (T.charAt(i) ! 1 goto s0 i s2
if (T.charAt(i) ! 0 goto s1 i s3 if
(T.charAt(i) ! 1 goto s1 i s4 if
(T.charAt(i) ! 0 goto s2 i s5 if
(T.charAt(i) ! 0 goto s3 i s6 if
(T.charAt(i) ! 1 goto s1 i s7 if
(T.charAt(i) ! 1 goto s2 i s8 if
(T.charAt(i) ! 1 goto s2 i return i-8
14
The Boyer-Moore String Algorithm

This method can give substantially faster
searches where the language contains a large
number of symbols
E.g. Normal text (128 or 256 character alphabet)
rather than binary strings
BM method incorporates two main ideas
start matching at the right of the pattern so as
to find the rightmost mismatch
use information about the possible alphabet of
the text, as well as the characters in the pattern

15
Example
Search for LEAN in CARPETS NEED CLEANING
REGULARLY CARPETS NEED CLEANING REGULARLY LEAN
N and P mismatch. Furthermore, P does not occur
anywhere in the string LEAN. Hence move string
all the way past P and compare with N
again. CARPETS NEED CLEANING REGULARLY LEAN
LEAN LEAN LEAN
N and E mismatch, but E occurs in LEAN, so
we move the E of LEAN to this position
16
Boyer-Moore preprocessing

In order to implement the above idea, consider
the characters in the alphabet which makes up the
text.
C0,C1,,Ck (k1 characters in the alphabet)
Initialise an array skip such that
for each Cj in the pattern string set skipj to
the distance of Cj from the right hand end of the
pattern
skipj M otherwise, where M is the length of
the pattern.

17
The skip array - example

Suppose pattern is LEAN and alphabet is ltblankgt,
A,B,,Z (C0,C1,,C26).
skip12 3 (L)
skip5 2 (E)
skip1 1 (A)
skip14 0 (N)
skipX 4 (otherwise)
skipC is the number of characters to move the
pattern to the right after a mismatch in the text
with character with index C

18
Using the skip array

Try to match the pattern from right to left
Mismatch occurs between Cn with index n and the
(M-j)th position of the pattern.
Get value of skipn
If (M-j) gt skipn then shift pattern by 1 (since
we have already passed the rightmost occurrence
of Cn in the pattern).
Else shift pattern skipn-j positions, to try to
align Cn in the text with the rightmost
occurrence of Cn in the pattern.

19
Example - shifting using skip

Pattern X X A X X X Z Z Z Z
M 10 (length of pattern)
skip1 7 (distance of rightmost A from right)
mismatch at position 10-4
Y Y Y Y Y A Z Z Z Z Z Z Z Z Z text
X X A X X X Z Z Z Z
mismatched pattern
X X A X X X Z Z Z Z
shift 3 positions
Shift pattern by 7-4 3 positions

20
Boyer-Moore Algorithm (1)
int boyermoore1(String P, String T) int
1,j,t,MP.length(),NT.length() initskip(P)
// initialise skip array i M-1 j M-1
while (j gt 0) while (Ti ! Pj) t
skipindex(Ti) if ((M-j)gtt)
iiM-j else iit if (I gt
N) return N // no match j M-1
i-- j-- return i // successful match
21
Refinement to B-M Algorithm

We can apply the idea of the KMP algorithm
right-to-left
Sometimes this gives a larger skip value than the
skip index used above
E.g. Pattern BBAAA
skip1 0 (skip value for A)
skip2 3 (skip value for B)
AAAAAAA
BBAAA mismatch on A in text
boyermoore1 algorithm shifts only one position
However its clear that AAA does not occur
anywhere to the left of positions 3,4,5

22
Boyer-Moore Refinement (2)

Build KMP next array from right to left
j position of mismatch (from right starting at
0)
nextj no. of positions to shift pattern to
right
j nextj BBAAA
1 1 BBAAA
2 1 BBAAA
3 5 BBAAA
4 5 BBAAA
Using the next array, a mismatch on B results in
a shift of 5 positions

23
Refined Boyer-Moore Algorithm

Initialise both the skip and the next arrays
(right-to-left).
Whenever a mismatch occurs, get the skip value
for the mismatched character and the next value
for the position of the mismatch.
Shift the pattern right by whichever gives the
greater value.

24
Rabin-Karp String Matching

Consider a text and pattern consisting of
characters represented by b bits each
e.g. 7-bit ASCII characters
We can regard a sequence of characters as a
(large) binary number (as with keys when using
hash tables)
Idea - compute a hash value for an M -character
pattern and compare it successively with the hash
values of each successive sequence of M
characters in the text.

25
Rabin-Karp matching - basic idea

Example. Consider the string
CARPETS NEED CLEANING
and the search string LEAN.
Then we compare h(LEAN) first against h(CARP),
then against h(ARPE), h(RPET), h(PETS), and so
on.
Clearly h(LEAN) need be computed only once.
The key to efficient comparison is to compute the
successive hash values efficiently.
We can exploit the fact that successive keys
overlap, e.g. ARPE and RPET share 3 characters.

26
Rabin-Karp - computing hash values

Let us use h(K) K mod P as our hash function as
before, where P is a large prime number
Let d max number of characters (e.g. d2b)
Suppose K C1,,Cn where C1,,Cn is a sequence
of characters in the text, and h(K) X
It can be shown that
h(C2,,Cn1) h((X?C1dn-1)d Cn1), since
C2,,Cn1 can be rewritten as (C1,,Cn - C1
dn-1)d Cn1)
E.g. (d10) 45678 (34567 - (3104))10 8
Then use some properties such as h(XY) h(h(X)
Y) and h(XY) h(h(X) Y)
Hence h(45678) h((h(34567) - (3104))10 8
Thus, successive values for h are efficiently
computed, since we can reuse the previous hash
value to compute the next one.

27
Rabin-Karp Algorithm
int rabinkarp(String P, String T) int
q33554393 // a large prime int d32 //
size of alphabet int i,dM1, h10, h20 int
MP.length(), NT.length() for
(i0iltMi)dM(ddM) mod q for
(i0iltMi) h1(h1dval(Pi)) mod q //
hash P h2(h2dval(Ti)) mod q for
(i0 h1 ! h2 i) h2(h2dq-val(Ti))dM
) mod q h2(h2d val(TiM)) mod q
if (i gt N-M) return N \\ not found return i
28
Rabin-Karp - analysis

In the above algorithm, val(Pi) is the number
corresponding to the character Pi.
h1 is the hash value of the pattern
h2 takes the hash value of successive sequences
of M characters in the text.
Strictly, if h1h2, we might not have a match,
since a hash collision could occur. We still
need to make a final comparison on the strings
themselves.
We can use a very large prime, since we do not
actually have to store the hash table this
makes collisions extremely unlikely.
Average number of comparisons NM

Write a Comment

User Comments (0)

About PowerShow.com

String Searching and Matching - PowerPoint PPT Presentation

String Searching and Matching

Common problem in text-processing programs and many other ... we compare h(LEAN) first against h(CARP), then against h(ARPE), h(RPET), h(PETS), and so on. ... – PowerPoint PPT presentation