String Searching - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

String Searching

Description:

re-invented many times, still common. Knuth & Pratt invented a better one in 1970 ... making the skip table work for partial matches, and by adding a secondary table. ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 33
Provided by: eileenk2
Category:

less

Transcript and Presenter's Notes

Title: String Searching


1
String Searching
  • CSCI 2720
  • Spring 2005
  • Eileen Kraemer

2
String Search
  • A common word processor facility is to search for
    a given word in a document. Generally, the
    problem is to search for occurrences of a short
    string in a long string.

Do the first then do the other one
the
3
History of String Search
  • The brute force algorithm
  • invented in the dawn of computer history
  • re-invented many times, still common
  • Knuth Pratt invented a better one in 1970
  • invented independently by Morris
  • published 1976 as Knuth-Morris-Pratt
  • Boyer Moore found a better one before 1976
  • found independently by Gosper
  • Karp Rabin found a better one in 1980

4
  • The obvious algorithm is to try the word at each
    possible place, and compare all the characters
  • for i 0 to n-m do (doc length n)
  • for j 0 to m-1 do (word length m)
  • compare wordj with docij
  • if not equal, exit the inner loop
  • The complexity is at worst O(mn) and best O(n).

5
Improving String Search
  • Surprisingly, there is a faster algorithm where
    you compare the last characters first

Do the first then do the other one
the
compare e with , fail so move along 3 places
Do the first then do the other one
the
can only move along 2 places
6
Improved string search, continued
  • In every case where the document character is not
    one of the characters in the word, we can move
    along m places. Sometimes, it is less.

7
Problem Definition, terminology
  • Let p be the pattern string
  • Let t be the target string
  • Let k be the index of the character in the target
    string that lies over the first character of
    the pattern
  • Given two strings, p and t, over the alphabet ?,
    determine whether p occurs as the substring of t
  • That is, determine whether there exists k such
    that pSubstring(t,k,p).

8
Straightforward string searching
  • function SimpleStringSearch(string p,t) integer
  • Find p in t return its location or -1 if p is
    not a substring of t
  • for k from 0 to Length(t) Length(p) do
  • i lt- 0
  • while i lt Length(p) and pi tki do
  • i lt- i1
  • if i Length(p) then return k
  • return -1

9
SimpleStringSearch
  • t0 t1 t2 t3 t4
    t5 t6 t7 t8
    t9 t10

p0 p1 p2 p3
Y
Y
Y
N
10
SimpleStringSearch
  • t0 t1 t2 t3 t4
    t5 t6 t7 t8
    t9 t10

p0 p1 p2 p3
N
11
SimpleStringSearch
  • t0 t1 t2 t3 t4
    t5 t6 t7 t8
    t9 t10

p0 p1
p2 p3
N
12
SimpleStringSearch
  • t0 t1 t2 t3 t4
    t5 t6 t7 t8
    t9 t10

p0
p1 p2 p3
N
13
SimpleStringSearch
  • t0 t1 t2 t3 t4
    t5 t6 t7 t8
    t9 t10


p0 p1 p2 p3
N
14
SimpleStringSearch
  • t0 t1 t2 t3 t4
    t5 t6 t7 t8
    t9 t10


p0 p1 p2 p3
N
15
SimpleStringSearch
  • t0 t1 t2 t3 t4
    t5 t6 t7 t8
    t9 t10


p0 p1 p2 p3
N
16
SimpleStringSearch
  • t0 t1 t2 t3 t4
    t5 t6 t7 t8
    t9 t10


p0 p1 p2
p3
Y
Y
Y
Y
17
Straightforward string searching
  • Worst case
  • Pattern string always matches completely except
    for last character
  • Example search for XXXXXXY in target string of
    XXXXXXXXXXXXXXXXXXXX
  • Outer loop executed once for every character in
    target string
  • Inner loop executed once for every character in
    pattern
  • ?(p t)
  • Okay if patterns are short, but better algorithms
    exist

18
Knuth-Morris-Pratt
  • ?(p t)
  • Key idea
  • if pattern fails to match, slide pattern to
    right by as many boxes as possible without
    permitting a match to go unnoticed

19
Knuth-Morris-Pratt
  • t0 t1 t2 t3 t4
    t5 t6 t7 t8
    t9 t10

p0 p1 p2 p3 p4
Y
Y
Y
Y
N
Y
Y
Y
Y
?
20
Knuth-Morris Pratt
  • Correct motion of pattern depends on both
    location of mismatch and the mismatching
    character
  • If c X move 2 boxes to right
  • If c E move 5 boxes to right
  • If c Z target found alg terminates

21
Knuth-Morris-Pratt
  • Goal determine d, number of boxes to right
    pattern should move smallest d such that
  • p0 tkd
  • p1 tkd1
  • p2 tkd2
  • pi-d tki

22
Knuth-Morris-Pratt
  • Note can be stated largely in terms of pattern
    alone.
  • Value of d depends only on
  • The pattern
  • The value of i
  • The mismatching character c (at tki)

23
Knuth-Morris-Pratt
  • Can define a function KMPskip(p,i,c) to give
    correct d
  • Return smallest integer d such that 0 lt d ltI,
    such that pi-d c and pj pjd for each
    0 ltj lt i-di1
  • Return i1 if no such d exists
  • Calculate all values of KMPskip for pattern p and
    store it in KMPskiparray
  • do lookup at each mismatch

24
Knuth-Morris-Pratt
  • For pattern ABCD

A B C D
A B C
D other
25
Knuth-Morris-Pratt
  • For pattern XYXYZ

X Y X Y Z
X Y Z other
26
Knuth-Morris-Pratt
  • Function KMPSearch(string p, t) integer
  • Find p in t return its location or -1 if p is
    not a substring of t
  • KMPskiparray lt- ComputeKMPskiparray(p)
  • k lt- 0
  • i lt- 0
  • While k lt Length(t) Length(p) do
  • if i Length(p) then return k
  • d lt- KMPskiparrayI,tki
  • k lt- k d
  • i lt- I 1 d
  • Return -1

27
The Boyer-Moore Algorithm
  • Coming soon .

28
Building a Skip Table
  • To work out how far to skip when the last
    character does not match, build a table. Care is
    needed with repeated letters
  • skipc distance of last occurrence of c from
    end

1 2 3 3 ...
cab
word
skip
a b c d e ...
1 4 4 4 ...
abba
word
skip
a b c d e ...
29
The Skip Table algorithm
  • The algorithm becomes
  • i 0
  • while i lt n-m do
  • if wordm-1 docim-1 then
  • for j 0 to m-1 do
  • compare wordj with docij
  • i i 1
  • else i i skipdocim-1
  • This is still O(nm) in the worst case, but now
    it is O(n/m) in the best case, because m
    characters may be skipped at each stage.

30
The Boyer-Moore Algorithm
  • The last-character algorithm can be generalised
    by making the skip table work for partial
    matches, and by adding a secondary table. The
    result is the Boyer-Moore algorithm.
  • It is possible to show that the complexity of the
    Boyer-Moore algorithm is guaranteed to be only
    O(n) in the worst case, as well as O(n/m) in the
    best case.
  • It has generally been regarded as too difficult
    to understand, and so has not been used much.

31
The Karp-Rabin Algorithm Idea
  • Karp Rabin found an algorithm which is
  • almost as fast as Boyer-Moore
  • simple enough to understand easily
  • can be adapted for 2-dimensional searches for
    patterns in pictures
  • Go back to the brute force idea, but now use a
    single number to represent the word you are
    searching for, and a single number for the
    current portion of the document you are comparing
    against.

32
The Karp-Rabin Algorithm
  • Suppose we are searching for 4-letter words. Then
    the whole (English) word fits in one (computer)
    word w of 4 bytes. If the current 4 bytes of the
    document are also in one word d, a single
    comparison can match the two in one step. To
    move along the document, shift d and add in the
    next character.
  • For longer words, use hashing. The characters of
    the word and the document are combined into
    single hash numbers wh and dh. The hash number
    dh can be updated by doing a suitable sum and
    adding in the code for the next character.
Write a Comment
User Comments (0)
About PowerShow.com