Title: String Searching
1String Searching
- CSCI 2720
- Spring 2005
- Eileen Kraemer
2String Search
- A common word processor facility is to search for
a given word in a document. Generally, the
problem is to search for occurrences of a short
string in a long string.
Do the first then do the other one
the
3History of String Search
- The brute force algorithm
- invented in the dawn of computer history
- re-invented many times, still common
- Knuth Pratt invented a better one in 1970
- invented independently by Morris
- published 1976 as Knuth-Morris-Pratt
- Boyer Moore found a better one before 1976
- found independently by Gosper
- Karp Rabin found a better one in 1980
4- The obvious algorithm is to try the word at each
possible place, and compare all the characters - for i 0 to n-m do (doc length n)
- for j 0 to m-1 do (word length m)
- compare wordj with docij
- if not equal, exit the inner loop
- The complexity is at worst O(mn) and best O(n).
5Improving String Search
- Surprisingly, there is a faster algorithm where
you compare the last characters first
Do the first then do the other one
the
compare e with , fail so move along 3 places
Do the first then do the other one
the
can only move along 2 places
6Improved string search, continued
- In every case where the document character is not
one of the characters in the word, we can move
along m places. Sometimes, it is less.
7Problem Definition, terminology
- Let p be the pattern string
- Let t be the target string
- Let k be the index of the character in the target
string that lies over the first character of
the pattern - Given two strings, p and t, over the alphabet ?,
determine whether p occurs as the substring of t - That is, determine whether there exists k such
that pSubstring(t,k,p).
8Straightforward string searching
- function SimpleStringSearch(string p,t) integer
- Find p in t return its location or -1 if p is
not a substring of t - for k from 0 to Length(t) Length(p) do
- i lt- 0
- while i lt Length(p) and pi tki do
- i lt- i1
- if i Length(p) then return k
- return -1
-
9SimpleStringSearch
- t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10
p0 p1 p2 p3
Y
Y
Y
N
10SimpleStringSearch
- t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10
p0 p1 p2 p3
N
11SimpleStringSearch
- t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10
p0 p1
p2 p3
N
12SimpleStringSearch
- t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10
p0
p1 p2 p3
N
13SimpleStringSearch
- t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10
p0 p1 p2 p3
N
14SimpleStringSearch
- t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10
p0 p1 p2 p3
N
15SimpleStringSearch
- t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10
p0 p1 p2 p3
N
16SimpleStringSearch
- t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10
p0 p1 p2
p3
Y
Y
Y
Y
17Straightforward string searching
- Worst case
- Pattern string always matches completely except
for last character - Example search for XXXXXXY in target string of
XXXXXXXXXXXXXXXXXXXX - Outer loop executed once for every character in
target string - Inner loop executed once for every character in
pattern - ?(p t)
- Okay if patterns are short, but better algorithms
exist
18Knuth-Morris-Pratt
- ?(p t)
- Key idea
- if pattern fails to match, slide pattern to
right by as many boxes as possible without
permitting a match to go unnoticed
19Knuth-Morris-Pratt
- t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10
p0 p1 p2 p3 p4
Y
Y
Y
Y
N
Y
Y
Y
Y
?
20Knuth-Morris Pratt
- Correct motion of pattern depends on both
location of mismatch and the mismatching
character - If c X move 2 boxes to right
- If c E move 5 boxes to right
- If c Z target found alg terminates
21Knuth-Morris-Pratt
- Goal determine d, number of boxes to right
pattern should move smallest d such that - p0 tkd
- p1 tkd1
- p2 tkd2
-
- pi-d tki
22Knuth-Morris-Pratt
- Note can be stated largely in terms of pattern
alone. - Value of d depends only on
- The pattern
- The value of i
- The mismatching character c (at tki)
23Knuth-Morris-Pratt
- Can define a function KMPskip(p,i,c) to give
correct d - Return smallest integer d such that 0 lt d ltI,
such that pi-d c and pj pjd for each
0 ltj lt i-di1 - Return i1 if no such d exists
- Calculate all values of KMPskip for pattern p and
store it in KMPskiparray - do lookup at each mismatch
24Knuth-Morris-Pratt
A B C D
A B C
D other
25Knuth-Morris-Pratt
X Y X Y Z
X Y Z other
26Knuth-Morris-Pratt
- Function KMPSearch(string p, t) integer
- Find p in t return its location or -1 if p is
not a substring of t - KMPskiparray lt- ComputeKMPskiparray(p)
- k lt- 0
- i lt- 0
- While k lt Length(t) Length(p) do
- if i Length(p) then return k
- d lt- KMPskiparrayI,tki
- k lt- k d
- i lt- I 1 d
- Return -1
27The Boyer-Moore Algorithm
28Building a Skip Table
- To work out how far to skip when the last
character does not match, build a table. Care is
needed with repeated letters - skipc distance of last occurrence of c from
end
1 2 3 3 ...
cab
word
skip
a b c d e ...
1 4 4 4 ...
abba
word
skip
a b c d e ...
29The Skip Table algorithm
- The algorithm becomes
- i 0
- while i lt n-m do
- if wordm-1 docim-1 then
- for j 0 to m-1 do
- compare wordj with docij
- i i 1
- else i i skipdocim-1
- This is still O(nm) in the worst case, but now
it is O(n/m) in the best case, because m
characters may be skipped at each stage.
30The Boyer-Moore Algorithm
- The last-character algorithm can be generalised
by making the skip table work for partial
matches, and by adding a secondary table. The
result is the Boyer-Moore algorithm. - It is possible to show that the complexity of the
Boyer-Moore algorithm is guaranteed to be only
O(n) in the worst case, as well as O(n/m) in the
best case. - It has generally been regarded as too difficult
to understand, and so has not been used much.
31The Karp-Rabin Algorithm Idea
- Karp Rabin found an algorithm which is
- almost as fast as Boyer-Moore
- simple enough to understand easily
- can be adapted for 2-dimensional searches for
patterns in pictures - Go back to the brute force idea, but now use a
single number to represent the word you are
searching for, and a single number for the
current portion of the document you are comparing
against.
32The Karp-Rabin Algorithm
- Suppose we are searching for 4-letter words. Then
the whole (English) word fits in one (computer)
word w of 4 bytes. If the current 4 bytes of the
document are also in one word d, a single
comparison can match the two in one step. To
move along the document, shift d and add in the
next character. - For longer words, use hashing. The characters of
the word and the document are combined into
single hash numbers wh and dh. The hash number
dh can be updated by doing a suitable sum and
adding in the code for the next character.