String Searching - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

String Searching

Description:

re-invented many times, still common. Knuth & Pratt invented a better one in 1970 ... making the skip table work for partial matches, and by adding a secondary table. ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 33

Provided by: eileenk2

Learn more at: https://csci.franklin.uga.edu

Category:

more less

Transcript and Presenter's Notes

Title: String Searching

1
String Searching

CSCI 2720
Spring 2005
Eileen Kraemer

2
String Search

A common word processor facility is to search for
a given word in a document. Generally, the
problem is to search for occurrences of a short
string in a long string.

Do the first then do the other one
the
3
History of String Search

The brute force algorithm
invented in the dawn of computer history
re-invented many times, still common
Knuth Pratt invented a better one in 1970
invented independently by Morris
published 1976 as Knuth-Morris-Pratt
Boyer Moore found a better one before 1976
found independently by Gosper
Karp Rabin found a better one in 1980

The obvious algorithm is to try the word at each
possible place, and compare all the characters
for i 0 to n-m do (doc length n)
for j 0 to m-1 do (word length m)
compare wordj with docij
if not equal, exit the inner loop
The complexity is at worst O(mn) and best O(n).

5
Improving String Search

Surprisingly, there is a faster algorithm where
you compare the last characters first

Do the first then do the other one
the
compare e with , fail so move along 3 places
Do the first then do the other one
the
can only move along 2 places
6
Improved string search, continued

In every case where the document character is not
one of the characters in the word, we can move
along m places. Sometimes, it is less.

7
Problem Definition, terminology

Let p be the pattern string
Let t be the target string
Let k be the index of the character in the target
string that lies over the first character of
the pattern
Given two strings, p and t, over the alphabet ?,
determine whether p occurs as the substring of t
That is, determine whether there exists k such
that pSubstring(t,k,p).

8
Straightforward string searching

function SimpleStringSearch(string p,t) integer
Find p in t return its location or -1 if p is
not a substring of t
for k from 0 to Length(t) Length(p) do
i lt- 0
while i lt Length(p) and pi tki do
i lt- i1
if i Length(p) then return k
return -1

9
SimpleStringSearch

t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10

p0 p1 p2 p3
Y
Y
Y
N
10
SimpleStringSearch

t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10

p0 p1 p2 p3
N
11
SimpleStringSearch

t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10

p0 p1
p2 p3
N
12
SimpleStringSearch

t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10

p0
p1 p2 p3
N
13
SimpleStringSearch

t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10

p0 p1 p2 p3
N
14
SimpleStringSearch

t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10

p0 p1 p2 p3
N
15
SimpleStringSearch

t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10

p0 p1 p2 p3
N
16
SimpleStringSearch

t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10

p0 p1 p2
p3
Y
Y
Y
Y
17
Straightforward string searching

Worst case
Pattern string always matches completely except
for last character
Example search for XXXXXXY in target string of
XXXXXXXXXXXXXXXXXXXX
Outer loop executed once for every character in
target string
Inner loop executed once for every character in
pattern
?(p t)
Okay if patterns are short, but better algorithms
exist

18
Knuth-Morris-Pratt

?(p t)
Key idea
if pattern fails to match, slide pattern to
right by as many boxes as possible without
permitting a match to go unnoticed

19
Knuth-Morris-Pratt

t0 t1 t2 t3 t4
t5 t6 t7 t8
t9 t10

p0 p1 p2 p3 p4
Y
Y
Y
Y
N
Y
Y
Y
Y
?
20
Knuth-Morris Pratt

Correct motion of pattern depends on both
location of mismatch and the mismatching
character
If c X move 2 boxes to right
If c E move 5 boxes to right
If c Z target found alg terminates

21
Knuth-Morris-Pratt

Goal determine d, number of boxes to right
pattern should move smallest d such that
p0 tkd
p1 tkd1
p2 tkd2
pi-d tki

22
Knuth-Morris-Pratt

Note can be stated largely in terms of pattern
alone.
Value of d depends only on
The pattern
The value of i
The mismatching character c (at tki)

23
Knuth-Morris-Pratt

Can define a function KMPskip(p,i,c) to give
correct d
Return smallest integer d such that 0 lt d ltI,
such that pi-d c and pj pjd for each
0 ltj lt i-di1
Return i1 if no such d exists
Calculate all values of KMPskip for pattern p and
store it in KMPskiparray
do lookup at each mismatch

24
Knuth-Morris-Pratt

For pattern ABCD

A B C D
A B C
D other
25
Knuth-Morris-Pratt

For pattern XYXYZ

X Y X Y Z
X Y Z other
26
Knuth-Morris-Pratt

Function KMPSearch(string p, t) integer
Find p in t return its location or -1 if p is
not a substring of t
KMPskiparray lt- ComputeKMPskiparray(p)
k lt- 0
i lt- 0
While k lt Length(t) Length(p) do
if i Length(p) then return k
d lt- KMPskiparrayI,tki
k lt- k d
i lt- I 1 d
Return -1

27
The Boyer-Moore Algorithm

Coming soon .

28
Building a Skip Table

To work out how far to skip when the last
character does not match, build a table. Care is
needed with repeated letters
skipc distance of last occurrence of c from
end

1 2 3 3 ...
cab
word
skip
a b c d e ...
1 4 4 4 ...
abba
word
skip
a b c d e ...
29
The Skip Table algorithm

The algorithm becomes
i 0
while i lt n-m do
if wordm-1 docim-1 then
for j 0 to m-1 do
compare wordj with docij
i i 1
else i i skipdocim-1
This is still O(nm) in the worst case, but now
it is O(n/m) in the best case, because m
characters may be skipped at each stage.

30
The Boyer-Moore Algorithm

The last-character algorithm can be generalised
by making the skip table work for partial
matches, and by adding a secondary table. The
result is the Boyer-Moore algorithm.
It is possible to show that the complexity of the
Boyer-Moore algorithm is guaranteed to be only
O(n) in the worst case, as well as O(n/m) in the
best case.
It has generally been regarded as too difficult
to understand, and so has not been used much.

31
The Karp-Rabin Algorithm Idea

Karp Rabin found an algorithm which is
almost as fast as Boyer-Moore
simple enough to understand easily
can be adapted for 2-dimensional searches for
patterns in pictures
Go back to the brute force idea, but now use a
single number to represent the word you are
searching for, and a single number for the
current portion of the document you are comparing
against.

32
The Karp-Rabin Algorithm

Suppose we are searching for 4-letter words. Then
the whole (English) word fits in one (computer)
word w of 4 bytes. If the current 4 bytes of the
document are also in one word d, a single
comparison can match the two in one step. To
move along the document, shift d and add in the
next character.
For longer words, use hashing. The characters of
the word and the document are combined into
single hash numbers wh and dh. The hash number
dh can be updated by doing a suitable sum and
adding in the code for the next character.