String Matching - PowerPoint PPT Presentation

About This Presentation
Title:

String Matching

Description:

... KMP : Knuth Morris Pratt This is a commonly used linear-time running string matching algorithm that achieves O(m+n) running time (worst and expected). – PowerPoint PPT presentation

Number of Views:219
Avg rating:3.0/5.0
Slides: 39
Provided by: eecsUcfEd
Learn more at: http://www.cs.ucf.edu
Category:
Tags: matching | string

less

Transcript and Presenter's Notes

Title: String Matching


1
String Matching
2
String Matching
  • Problem is to find if a pattern p of length m
    occurs within text t of length n
  • Simple solution Naïve String Matching
  • Match each position in the pattern to each
    position in the text
  • t AAAAAAAAAAAAAA
  • p AAAAAB
  • AAAAAB
  • etc.
  • O(m(n-m1)) worst case average case performance
    is surprisingly good provided stings are neither
    long nor have lots of repeated letters.
    Unfortunately, this occurs in DNA sequences and
    images.

3
Brute Force Approach
  • public static int search(String pattern, String
    text)
  • int m pattern.length()
  • int n text.length()
  • for (int i 0 i lt n - m i)
  • int j
  • for (j 0 j lt m j)
  • if (text.charAt(ij) ! pattern.charAt(j))
  • break
  • if (j m) return i
  • return -1

4
Rabin-Karp Fingerprint
  • Idea Before spending a lot of time comparing
    chars for a match, do some pre-processing to
    eliminate locations that could not possibly match
  • If we could quickly eliminate most of the
    positions then we can run the naïve algorithm on
    whats left
  • Eliminate enough to hopefully get O(n) cost of
    pre-processing run-time overall. Obviously want
    preprocessing to also be fast.

5
Rabin-Karp Idea
  • To get a feel for the idea say that our text and
    pattern is a sequence of bits.
  • For example,
  • p010111
  • t0010110101001010011
  • The parity of a binary value is to count the
    number of ones. If odd, the parity is 1. If
    even, the parity is 0. Since our pattern is six
    bits long, lets compute the parity for each
    position in t, counting six bits ahead. Call
    this fi where fi is the parity of the string
    ti..i5.

6
Parity
  • t0010110101001010011
  • p010111

Since the parity of our pattern is 0, we only
need to check positions 2, 4, 6, 8, 10, and 11 in
the text. By the way, how do we compute parity of
all substrings of length m in just order n,
because if we do all n-m1 substrings separately,
that will already cost us m(n-m1) units of time.
7
Rabin-Karp Parity Fingerprint
  • On average we expect the parity check to reject
    half the inputs.
  • To get a better speed-up, by a factor of q, we
    need a fingerprint function that maps m-bit
    strings to q different fingerprint values.
  • Rabin and Karp proposed using a hash function
    that considers the next m bits in the text as the
    binary expansion of an unsigned integer and then
    take the remainder after division by q.
  • A good value of q is a prime number greater than
    m.

8
Rabin-Karp Fingerprint Example
  • More precisely, if the m bits are s0s1s2 .. sm-1
    then we compute the fingerprint value
  • For the previous example, fi
  • Consider how to compute f incrementally.

For our pattern 010111, its hash value is 23 mod
7 or 2. This means that we would only use the
naïve algorithm for positions where fi 2.
9
Rabin-Karp Discussion
  • But we want to compare text, not bits!
  • Text is represented using bits
  • For a textual pattern and text, we simply use
    their ASCII sequences.
  • We can compute fi in O(m) time giving us the
    expected runtime of O(mn), given a good hash
    function. This can be a worst case of mn if we
    get significant hash conflicts. Of course, we
    could try doing just probabilistic.

10
Monte Carlo Rabin-Karp
  • public static int MonteCarloRabinKarp (String p,
    String t)
  • int m p.length()
  • int n t.length()
  • int dM 1, h1 0, h2 0
  • int q 3355439 // table size
  • int d 256 // radix
  • for (int j 1 j lt m j) // precompute dM q
  • dM (d dM) q
  • for (int j 0 i lt m i)
  • h1 (h1d p.charAt(j)) q // hash of
    pattern
  • h2 (h2d t.charAt(j)) q // hash of text
  • if (h1 h2) return 0 // match found, we hope
  • for (int i m i lt n i)
  • h2 (h2 - t.charAt(i-m)) q // remove high
    order digit
  • h2 (h2d t.charAt(i)) q // insert low
    order digit
  • if (h1 h2) return i m 1 // match found,
    we hope
  • return -1 // not found

11
Randomized Again
  • Las Vegas algorithms
  • Expected to be fast
  • Guaranteed to be correct (if halts and gives
    answer)
  • Ex quicksort, Rabin-Karp with match check
  • Monte Carlo algorithms
  • Guaranteed to be fast
  • Expected to be correct
  • Ex Rabin-Karp without match check

12
Run Times so Far
Scheme Expected Worst Case
Brute n m n
Rabin-Karp mn n or m n
Monte Carlo sometimes reports success when
not true. Las Vegas does match check
(really unlikely to take this long as most checks
are cut off)
13
KMP Knuth Morris Pratt
  • This is a commonly used linear-time running
    string matching algorithm that achieves O(mn)
    running time (worst and expected).
  • Uses an auxiliary function pi1..m pre-computed
    from p in time O(m).

14
Pi Function
  • This function contains knowledge about how the
    pattern shifts against itself.
  • If we know how the pattern matches against
    itself, we can slide the pattern more characters
    ahead than just one character as in the naïve
    algorithm.

15
Pi Function Example
Naive
p pappar t pappappapparrassanuaragh
p pappar t pappappapparrassanuaragh
Smarter technique We can slide the pattern
ahead so that the longest PREFIX of p that we
have already processed matches the longest SUFFIX
of t that we have already matched.
p pappar t pappappapparrassanuaragh
16
KMP Example
p pappar t pappappapparrassanuaragh
p pappar t pappappapparrassanuaragh
p pappar t pappappapparrassanuaragh
The characters mismatch so we shift over one
character for both the text and the pattern p
pappar t pappappapparrassanuaragh We
continue in this fashion until we reach the end
of the text.
17
KMP Example
18
KMP Analysis
  • Runtime
  • O(m) to compute the Pi values
  • O(n) to compare the pattern to the text
  • Total O(nm) runtime

19
KMPs DFA
  • KMP algorithm.
  • Use knowledge of how search pattern repeats
    itself.
  • Build DFA from pattern.
  • Run DFA on text.

20
DFA Linear Property
  • DFA used in KMP has special property.
  • Upon character match, go forward one state.
  • Only need to keep track of where to go upon
    character mismatch
  • go to state nextj if character mismatches in
    state j

21
2nd Example of KMP next
22
Computing KMP next Function
23
DFA for three-letter alphabet
24
KMP Algorithm
  • for (int i 0, j 0 i lt n i)
  • if (t.charAt(i) p.charAt(j)) j // match
  • else j nextj // mismatch
  • if (j m) return i - m 1 // found
  • return -1 // not found

25
Run Times so Far
Scheme Expected Worst Case
Brute n m n
Rabin-Karp mn n or m n
Karp-Morris-Pratt mn 2n
Monte Carlo sometimes reports success when
not true. Las Vegas does match check
(really unlikely to take this long as most checks
are cut off)
26
Horspools Algorithm
  • It is possible in some cases to search text of
    length n in less than n comparisons!
  • Horspools algorithm is a relatively simple
    technique that achieves this distinction for many
    (but not all) input patterns. The idea is to
    perform the comparison from right to left instead
    of left to right.

27
Horspools Algorithm
  • Consider searching
  • TBARBUGABOOTOOMOOBARBERONI
  • PBARBER
  • There are four cases to consider
  • 1. There is no occurrence of the character in T
    in P. In this case there is no use shifting over
    by one, since well eventually compare with this
    character in T that is not in P. Consequently,
    we can shift the pattern all the way over by the
    entire length of the pattern (m)

28
Horspools Algorithm
  • 2. There is an occurrence of the character from
    T in P. Horspools algorithm then shifts the
    pattern so the rightmost occurrence of the
    character from P lines up with the current
    character in T

29
Horspools Algorithm
  • 3. Weve done some matching until we hit a
    character in T that is not in P. Then we shift
    as in case 1, we move the entire pattern over by
    m

30
Horspools Algorithm
  • 4. If weve done some matching until we hit a
    character that doesnt match in P, but exists
    among its first m-1 characters. In this case,
    the shift should be like case 2, where we match
    the last character in T with the next
    corresponding character in P

31
Horspools Algorithm
  • More on case 4

32
Horspool Implementation
  • We first precompute the shifts and store them in
    a table. The table will be indexed by all
    possible characters that can appear in a text.
    To compute the shift T(c) for some character c we
    use the formula
  • T(c) the patterns length m, if c is not among
    the first m-1 characters of P, else the distance
    from the rightmost occurrence of c in P to the
    end of P

33
Pseudocode for Horspool
34
Horspool Example
In running only make 12 comparisons, less than
the length of the text! (24 chars)
35
Boyer Moore
  • Similar idea to Horspools algorithm in that
    comparisons are made right to left, but is more
    sophisticated in how to shift the pattern.
  • Using the bad symbol heuristic, we jump to the
    next rightmost character in P matching the char
    in T

36
Boyer-Moore Heuristic 1
  • Advance offset i using "bad character rule.
  • upon mismatch of text character c, look up
    indexc
  • increase offset i so that jth character of
    pattern lines up with text character c

37
Boyer-Moore Heuristic 2
  • Use KMP-like suffix rule.
  • effective with small alphabets
  • different rules lead to different worst-case
    behavior

38
Run Times
Scheme Expected Worst Case
Brute n m n
Rabin-Karp mn n or m n
Karp-Morris-Pratt mn 2n
Boyer-Moore m n/m 4n
Monte Carlo sometimes reports success when
not true. Las Vegas does match check
(really unlikely to take this long as most checks
are cut off)
Write a Comment
User Comments (0)
About PowerShow.com