String Matching Algorithms Based upon the Uniqueness Property - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

String Matching Algorithms Based upon the Uniqueness Property

Description:

Given a text string T of length n and a pattern string P of length m. ... For any substring V of P, if V occurs in P only once, V is a unique substring. ... – PowerPoint PPT presentation

Number of Views:192
Avg rating:3.0/5.0
Slides: 35
Provided by: algCsie
Category:

less

Transcript and Presenter's Notes

Title: String Matching Algorithms Based upon the Uniqueness Property


1
String Matching Algorithms Based upon the
Uniqueness Property
C. W. Lu and R. C. T. Lee, 2007, String Matching
Algorithms Based upon the Uniqueness Property,
The 24th Workshop on Combinatorial Mathematics
and Computation Theory, pp.385-392.
  • Advisor Prof. R. C. T. Lee
  • Speaker C. W. Lu

2
  • String matching problem
  • Given a text string T of length n and a pattern
    string P of length m.
  • Find all occurrences of P in T.

3
Rule 1 The Suffix to Prefix Rule
  • Suppose we have longest suffix u of a window
    which is also a prefix of P, we can move P in
    such a way that the prefix u of P matches with
    the suffix u of the window.

4
The Uniqueness Property of a String
  • For any substring V of P, if V occurs in P only
    once, V is a unique substring.
  • When V matches with some substring of T, we can
    move P such a way that the prefix of P matches
    with the suffix of V.

5
Example
P c a t a g t a g c c t Suppose we use the
substring cc as the unique substring.
6
Algorithm 1- The Longest Prefix with Unique
Suffix Matching Algorithm
  • We further modified the uniqueness by noting that
    the substring does not have to be unique in the
    entire pattern P. In fact, a substring which is
    unique in a prefix of P suffices.
  • Therefore, we only have to find the longest
    prefix which contains a unique suffix in P.

7
Example
P CACTAGCCACTCTC The substring TC occurs twice
in P, but it is unique in the prefix CACTAGCCACTC.
Move P 11 steps.
8
Example
P CACTAGCCACTCTC The substring G is also unique
in the prefix CACTAG.
Move P 6 steps.
9
P CACTAGCCACTCTC
In the above example, using the unique substring
TC, we could move P 11 steps if TC matches with
TC in T using the unique substring G, we could
move P 6 steps if G matches with G in T.
Is the unique substring TC better than the unique
substring G?
10
  • We should notice that if the unique substring
    appears in T many times, our algorithm would be
    efficient.
  • In general, the probability of TC in P matching
    with TC in T exactly is 1/16 (Suppose the size of
    alphabet is 4), and the probability of G in P
    matching with G in T exactly is 1/4.
  • Thus, the size of the unique substring is also
    important.

11
P CACTAGCCACTCTC
  • If the substring TC in P exactly matches with TC
    in T once and moves P by 11 steps, the substring
    G in P may match G in T four times and moves P by
    6 steps for each time. So, we expect that the
    substring G would be better than the substring TC
    in general.

12
  • We now define a ratio to determine which
    substring is better.
  • Let S be the alphabet.
  • The larger s is, the better efficiency can be
    achieved in the searching phase.

13
Preprocessing Phase
P CAGACGACCCCAACAGC S A, C, G, T, S 4.
Find the longest prefix with an unique suffix
which size is one.
14
Preprocessing Phase
  • We have found the unique substring with size 1,
    and we could use it to move P 3 steps.
  • Next, we try to find an unique substring with
    size 2 such that we could use this substring to
    move P more than 34 steps.
  • Thus, we only consider the substrings of
    p12p13p16.

15
Searching Phase
If the unique substring mismatches, move P one
step.
Move 1 step.
16
Searching Phase
If the unique substring GC matches with GC in T,
move P 16 steps.
Move 16 steps.
17
  • As we discuss above, the size of the unique
    substring is important.
  • In the following, we will introduce another
    algorithm which uses an unique substring with
    size one.

18
Algorithm 2- Longest Substring with Unique
Character Matching Algorithm
  • In the window, let x be any character. In order
    to have any meaningful matching of P with T, we
    must find the same x in P located in the left
    side of x in T.

19
  • In preprocessing phase, we try to find the
    longest substring p in P such that x in p
    occurs only once. That is,
  • and pj occurs in p only once.

20
  • If the unique character x matches with x in T, we
    can move P p steps.

21
Example
In this example, we would find the longest
substring p4p5p10 with a unique character p10.
If the character p10 matches with T, we can move
P 7 steps.
22
Searching Phase
If p10 mismatches, move P one step.
Move 1 step.
23
Searching Phase
If p10 matches with T, move P 7 steps.
Move 7 steps.
24
Algorithm 3- The Unique Pairwise Substring
Algorithm
  • The substring pipi1pj-1pj is called an unique
    pairwise substring if it satisfies the condition
    that pipi1pj-1pj occurs in the prefix
    p1p2pj-1pj of P exactly once, and no
    pkpk1pkj-i exists in p1p2pj-1 such that
    pk pi and pkj-i pj.

25
Example
The substring TCG is an unique pairwise substring
because no pkpk1pk2 exists in p1p2p12 such
that pk p11 T and pk2 p13 G.
The substring CAC is not an unique pairwise
substring because there exists a substring p2p3p4
in p1p2p9 such that p2 p8 C and p4 p10 C.
26
  • Suppose pipi1pj-1pj is an unique pairwise
    substring.
  • If pi and pj match with T, we have two cases to
    move P.

Case 1 such that pj pk, where
0?k?j-i-1. We can move P j-k steps.
27
Case 2 pj ? pk, where 0?k ?j-i-1.
We can move P j1 steps.
28
Example
If we choose p11p12p13 as the unique pairwise
substring, we can move P 14 steps when p11 and
p13 match with T.
29
  • There would be many unique pairwise substrings in
    the pattern.
  • We will select the one which is located at
    rightest in the pattern.

Example
The substrings p5p6, p7p8p9 and p11p12p13 are all
unique pairwise substrings. We would select
p11p12p13 because it will have the largest move.
30
Example
If p11 or p13 mismatch, move P one step.
31
Example
If p11 and p13 match with T, move P 14 steps.
32
References
  • 1 Apostolico, A., Giancarlo, R., 1986, The
    Boyer-Moore-Galil string searching strategies
    revisited, SIAM Journal on Computing
    15(1)98-105.
  • 2 Apostolico, A., Crochemore, M., 1991, Optimal
    canonization of all substrings of a string,
    Information and Computation 95(1)76-95.
  • 3 Boyer, R.S., Moore, J.S., 1977, A fast string
    searching algorithm. Communications of the ACM.
    20762-772.
  • 4 Colussi, L., 1991, Correctness and efficiency
    of the pattern matching algorithms, Information
    and Computation 95(2)225-251.
  • 5 Crochemore, M., Czumaj, A., Gasieniec, L.,
    Jarominek, S., Lecroq, T., Plandowski, W.,
    Rytter, W., 1992, Deux méthodes pour accélérer
    l'algorithme de Boyer-Moore, in Théorie des
    Automates et Applications, Actes des 2e Journées
    Franco-Belges, D. Krob ed., Rouen, France, 1991,
    pp 45-63, PUR 176, Rouen, France.
  • 6 Colussi, L., 1994, Fastest pattern matching
    in strings, Journal of Algorithms. 16(2)163-189.
  • 7 Charras, C., Lecroq, T., Pehoushek, J.D.,
    1998, A very fast string matching algorithm for
    small alphabets and long patterns, in Proceedings
    of the 9th Annual Symposium on Combinatorial
    Pattern Matching , M. Farach-Colton ed.,
    Piscataway, New Jersey, Lecture Notes in Computer
    Science 1448, pp 55-64, Springer-Verlag, Berlin.

33
  • 8 Galil, Z., Seiferas, J., 1983, Time-space
    optimal string matching, Journal of Computer and
    System Science 26(3)280-294.
  • 9 Galil, Z., Giancarlo, R., 1992, On the exact
    complexity of string matching upper bounds, SIAM
    Journal on Computing, 21(3)407-437.
  • 10 Horspool, R.N., 1980, Practical fast
    searching in strings, Software - Practice
    Experience, 10(6)501-506.
  • 11 Knuth, D.E., Morris (Jr), J.H., Pratt, V.R.,
    1977, Fast pattern matching in strings, SIAM
    Journal on Computing 6(1)323-350.
  • 12 Lecroq, T., 1992, A variation on the
    Boyer-Moore algorithm, Theoretical Computer
    Science 92(1)119-144.
  • 13 Morris (Jr), J.H., Pratt, V.R., 1970, A
    linear pattern-matching algorithm, Technical
    Report 40, University of California, Berkeley.
  • 14 Sunday, D.M., 1990, A very fast substring
    search algorithm, Communications of the ACM .
    33(8)132-142.
  • 15 Simon, I., 1993, String matching algorithms
    and automata, in in Proceedings of 1st American
    Workshop on String Processing, R.A. Baeza-Yates
    and N. Ziviani ed., pp 151-157, Universidade
    Federal de Minas Gerais, Brazil.

34
Thanks for your attention.
Write a Comment
User Comments (0)
About PowerShow.com