String Matching - PowerPoint PPT Presentation

About This Presentation
Title:

String Matching

Description:

Knuth-Morris-Pratt Method (linear time algorithm) A better idea In step 3, when there is a mismatch we move forward one position (i=i+1). – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 31
Provided by: csCityuE
Category:
Tags: matching | string

less

Transcript and Presenter's Notes

Title: String Matching


1
String Matching
  • The problem
  • Input a text T (very long string) and a pattern
    P (short string).
  • Output the index in T where a copy of P begins.

2
Some Notations and Terminologies
  • P and T the lengths of P and T.
  • Pi the i-th letter of P.
  • Prefix of P a substring of P starting with P1.
  • P1..i the prefix containing the first i
    letters of P.
  • Example abcabbccaa.
  • prefix a, ab, abc, abca, abcab, abcabb, .

3
Some Notations and Terminologies
  • suffix of P1..i a substring of P1..i ending
    at Pi, e.g. P3..i, P5..i (igt4).
  • Example P1..5abcaa.
  • Suffix of P1, 3 c, bc, abc.
  • Suffix of P1..4 a, ca, bca, abca.

4
Straightforward method
  • Basic idea
  • 1. i1
  • 2. Start with Ti and match P with
  • Ti,Ti1, ... TiP-1
  • P1 P2 PP
  • 3. whenever a mismatch is found,
  • ii1 and goto 2 until
    iP-1ltT.
  • Example 1 TABABABCCA and PABABC
  • P ABABC A ABABC
  • T ABABABCCA ABABABCCA ABABABCCA

5
Analysis
  • Step 2 takes O(P) comparisons in the worst
    case.
  • Step 2 could be repeated O(T) times.
  • Total running time is O(TP).

6
Knuth-Morris-Pratt Method (linear time algorithm)
  • A better idea
  • In step 3, when there is a mismatch we move
    forward one position (ii1).
  • We may move more than one position at a time when
    a mismatch occurs. (carefully study the pattern
    P).
  • For example
  • P ABABC ABA
  • T ABABABCCA ABABABCCA

7
  • Questions
  • How to decide how many positions we should jump
    when a mismatch occurs?
  • How much we can benefit? O(TP).
  • Example 2
  • P abcabcabcaa
  • T abcabcabcabcaa
  • abcabcab
  • back here

8
  • We can move forward more than one position.
    Reason?
  • Study of Pattern P
  • P1..7 abcabca
  • P1..10 abcabcabca (when trying to P11, we
    have a mismatch)
  • P1..7 abcabca
  • P1..4 abca
  • P1..7 is the longest prefix that is also a
    suffix of P1..10.
  • P1..4 is a prefix that is a suffix of P1..10,
    but not the longest.
  • Key When mismatch occurs at Pi1, we want to
    find the longest prefix of P1..i which is also
    a suffix of P1..i.

9
Failure function
  • f(i) is the largest r with (rlti) such that
  • P1 P2 ...Pr Pi-r1Pi-r2, ...,
    Pi.
  • Prefix of length r Suffix of
    P1P2Pi of length r
  • That is, P1..f(i) is the longest prefix that is
    a suffix of P1..i.
  • Example 3 Pababaccc and i5.
  • P1 P2 P3
  • a b a
  • a b a b a
  • P3 P4 P5 (r3) f(5)3.

10
  • Example 4
  • Pabcabbabcabbaa
  • It is easy to verify that
  • f(1)0, f(2)0, f(3)0, f(4)1, f(5)2,
  • f(6)0, f(7)1, f(8)2, f(9)3, f(10)4,
  • f(11)5, f(12)6, f(13)7, f(14)1.

11
The Scan Algorithm(draw a figure to show)
  • i indicates that Ti is the next character in T
    to be compared with the right end of the
    pattern.
  • q indicates that Pq1 is the next character in
    P to be compared with Ti.
  • i1 and q0
  • Compare Ti with Pq1case 1
    TiPq1 ii1qq1 if q1P then
    print "P occurs at i1-P"case 2 Ti?Pq1
    and q?0 qf(q) case 3 Ti?Pq1 and
    q0 ii1
  • Repeat step2 until iT.

12
  • Example 5 Pabcabbabcabbaa
  • Tabcabcabbabbabcabbabcabbaa
  • abcabb
  • abcabbabc
  • abc
  • a(ii1)
  • abcabbabcabbaa(q1p)

13
Running time complexity(hard)
  • The running time of the scan algorithm is O(T).
  • Proof
  • There are two pointers i and p.
  • i the next character in T to be compared.
  • p the position of P1. (See figure below)
  • p i
  • Pabcabcabcaa
  • Tabcabcabcabcaa
  • P abcabcaa
  • p

14
  • Facts
  • 1 When a match is found, move i forward.
  • 2 When a mismatch is found, move p forward until
    p and i are the same. (When pi and a mismatch
    occur, move both i and p forward)
  • From facts 1 and 2, it is easy to see that the
    total number of comparisons is at most 2T.
  • Thus, the time complexity is O(T).

15
Another version of scan algorithm (code)
  • nT
  • mP
  • q0
  • for i1 to n
  • while qgt0 and Pq1?Ti do
  • qf(q)
  • if Pq1Ti then
  • qq1
  • if qm then
  • print "pattern occurs at i-m1"
  • qf(q)

16
Failure Function Construction
  • Basic idea
  • Case 1 f(1) is always 0.
  • Case 2 if PqPf(q-1)1 then f(q)f(q-1)1.
  • Example pabcabcc
  • abc
  • f(1)0 f(2)0 f(3)0 f(4)1 f(5)2 f(6)3
    f(7)0
  • P4 Pf(4-1)1, f(4)f(4-1)11.
  • P5 Pf(5-1)1, f(5)f(5-1)1112.
  • P6 Pf(6-1)1. F(6)f(6-1)1213.

17
  • Case 3 if Pq?Pf(q-1)1 and f(q-1)?0 then
    consider Pq ? Pf(f(q-1))1 (Do it
    recursively)
  • Case 4 if Pq ? Pf(q-1)1 and f(q-1)0 then
    fq0.
  • Example abc abc abb
  • abc abc f(8)5
  • abc f(5)2
  • a
    f(2)0
  • i 1 2 3 4 5 6 7 8 9
  • f(i) 0 0 0 1 2 3 4 5 0

18
The algorithm (code) to compute failure function
  • 1. mP
  • 2. f(1)0
  • 3. k0
  • 4. for q2 to P do
  • 5. kf(q-1)
  • 6. if(kgt0 and Pk1!Pq)
  • kf(k) goto 6
  • 7. if(kgt0 and Pk1Pq)
  • fqk1
  • 8. if(k0)
  • if(Pk1Pq fq1
  • else fq0

19
Another version
  • 1. mP
  • 2. f(1)0
  • 3. k0
  • 4. for q2 to P do
  • 5. kf(q-1)
  • 6. while(kgt0 and Pk1!Pq) do
  • 7. kf(k)
  • 8. if(Pk1Pq) then kk1
  • 9. fqk

20
  • Example 3
  • 1 2 3 4 5 6 7 8 9 10 11 12
  • Pa b c a b c a b c a a c
  • f(1)0 f(2)0 f(3)0 f(4)1 f(5)2
  • f(6)3 f(7)4 f(8)5 f(9)6 f(10)7
  • f(11)1.
  • (The computation of f(11) is very interesting.)
  • Question Do we need to compute f(12)?
  • Yes, if you want to find ALL occurrences of P.
  • No, if you just want to find the first occurrence
    of P.

21
  • Example
  • Pabcabc
  • Tabcabcabc
  • abcabc
  • abcabc
  • When a match is found at the end of P, call
    f(p).
  • Running time complexity (Fun Part, not required)
  • The running time of failure function construction
    algorithm is O(P). (The proof is similar to
    that for scan algorithm.)
  • Total running time complexity
  • The total complexity for failure function
    construction and scan algorithm is O(PT).

22
Linear Time Algorithm for Multiple patterns (Fun
Part)
  • Input a string T (very long) and a set of
    patterns P1,P2,...,Pk.
  • Output all the occurrences of Pi's in T.
  • Let us consider the set of patterns he, she,
    his, hers . We can construct an automata as
    follows

23
e,i,r
h
e
r
s
1
s
i
s
h
e
24
  • g(s,a)s' means that at state s if the next input
    letter is a then the next state is s'.
  • The states of the automata is organized column by
    column.
  • Each state corresponds to a prefix of some
    pattern Pi.
  • F the set of final states (dark circled)
    corresponding to the ends of patterns.
  • For the starting state 0, add g(0,a)0, if g(0,a)
    is originally fail.

25
  • Exercise write down the g() function for the
    above automata.
  • Failure function
  • f(s) the state for the longest prefix of some
    pattern Pi that is a suffix of the string in the
    path from 0 (starting state) to s.
  • Example
  • he is the longest prefix for hers that is a
    suffix of the string she.

26
The scan algorithm
  • Text T1T2...Tn
  • s0
  • for i1 to n do
  • while g(s,Ti)fail do sf(s)
  • sg(s,Ti)
  • if s is in F then return "yes"
  • return "no"

27
  • Theorem The scan algorithm takes O(T) time.
  • Proof Again, the two pointer argument.
  • When a match is found, move the first pointer
    forward. (sg(s,Ti))
  • When a mismatch is found (g(s,Ti)fail), move
    the second pointer forward. (sf(s))
  • When a final state is meet, declare the finding
    of a pattern. (if s is in F then return "yes")

28
  • Example
  • i1 2 3 4 5 6 7 8
  • s h e r s h i i
  • 3 4 5
  • 2 8 9
  • 3 4
  • 1
  • 0
  • 0 0

29
Failure function construction
  • Basic idea similar to that for one pattern.
  • for each state s of depth 1 do
  • f(s)0
  • for each depth dgt1 do
  • for each state sd of depth d and character a
    such that g(sd,a)s' do
  • sf(sd)
  • while g(s,a)fail do
  • sf(s)
  • f(s')g(s,a)

30
  • g(0,c)?fail for any possible character c.
  • The failure function for he, she, his, hers is
  • Time complexity O(P1P2...Pk).
  • Proof Two pointer argument.
  • Leave it for assignment (optional)
Write a Comment
User Comments (0)
About PowerShow.com