Pattern Matching - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Pattern Matching

Description:

Knuth-Morris-Pratt algorithm. Boyer-Moore algorithm. 3. Problem description ... Knuth-Morris-Pratt's algorithm preprocesses the pattern to find matches of ... – PowerPoint PPT presentation

Number of Views:488
Avg rating:3.0/5.0
Slides: 29
Provided by: csieNc
Category:

less

Transcript and Presenter's Notes

Title: Pattern Matching


1
Pattern Matching
2
Outline
  • Problem description
  • Brute Force algorithm
  • Knuth-Morris-Pratt algorithm
  • Boyer-Moore algorithm

3
Problem description
  • What is pattern matching?
  • Pattern quchjfng
  • Text
  • kadogqlxwwenkpjkmolpevnmqcsjakoheorplkzb
  • zfjliar Iacwzjlvkxromavcdquchjfnganqxharnhkw
  • eweg tkrmwm fjduwhqvhwpfg pymyajsaywf aas
  • bwxwcmrncmslgeqztyoygdbnxjqoctslbrvngoyxe

4
Problem description
  • Given two strings P (pattern) and T (text), find
    the first occurrence of P in T.
  • Input two strings P T
  • Output The starting index of the occurrence of P
    in T or Null .
  • Applications
  • Text editors
  • Search engines
  • Biological research

5
Strings 1/2
  • A string is a sequence of data elements, such as
    numbers or characters.
  • An alphabet S is a set of basic symbols for a
    category of strings
  • Examples of alphabet
  • Character A, BZ, a, b z
  • Number 0, 1, 29
  • DNA A, C, G, T

6
Strings 2/2
  • Let P be a string of size m
  • A substring Pi .. j is a string containing
    characters of string P with ranks between i and
    j. P P0 .. m-1.
  • A prefix of P is a substring of the type P0 ..
    i
  • A suffix of P is a substring of the type Pi ..m
    - 1

7
Brute Force Algorithm 1/2
T
P TTAGCT TTATTAGTTAGC
P
7
8
2
3
Pattern size m, Text size n Time complexity
O(m(n-m)) Example of worst cast P
AAAAG T AAAAAAAAAAAAAAAAG
4
5
6
8
Brute Force Algorithm 2/2
Algorithm BruteForceMatch(P,T) Input pattern P
of size m and text T of size n Output starting
index of the first substring of T equal to P or
-1 if no such substring exists begin for i ? 0
to n - m do j ? 0 while j lt m and Pj Ti
j do j ? j 1 if j m then return
i /match at i/ return -1 /no match
anywhere/ end
9
How to speed up the Pattern Matching.
  • Shift P by more than one position when mismatch
    occurs
  • Never miss a occurrence of P in T
  • Preprocessing approach
  • Preprocessing P
  • Knuth-Morris-Pratt Algorithm
  • Boyer-Moore Algorithm
  • Preprocessing T
  • Suffix trees

10
The Knuth-Morris-Pratt algorithm
  • Best known
  • For linear time for exact matching
  • Preprocessing pattern P
  • Compares from Left to Right

11
The KMP Ideas
  • Shift more than one position
  • Reduce comparison

12
The KMP Algorithm - Motivation
  • When a mismatch occurs, what is the most we can
    shift the pattern?
  • Answer the largest prefix of P0..j-1 that is a
    suffix of P1..j-1

T
--
--
--
--
--
P
j
No need to repeat these comparisons
Resume comparing here
13
KMP Failure Function
  • Knuth-Morris-Pratts algorithm preprocesses the
    pattern to find matches of prefixes of the
    pattern with the pattern itself
  • The failure function F(j) is defined as the size
    of the largest prefix of P0..j that is also a
    suffix of P1..j

P
P
0
0
1
0
1
2
0
0
1
1
2
3
14
The KMP Algorithm
Algorithm KMPMatch(T, P) begin F ?
failureFunction(P) i ? 0 j ? 0 while i lt n
do if Ti Pj then if j m - 1
then return i - j / match / else i ?
i 1 j ? j 1 else if j gt 0 then j ?
Fj - 1 else i ? i 1 return -1 / no
match / end
15
Example
T
T
C
C
C
A
A
P
16
The KMP Algorithm
  • The failure function can be represented by an
    array and can be computed in O(m) time
  • At each iteration of the while-loop, either
  • i increases by one, or
  • the shift amount i - j increases by at least one
    (observe that F(j - 1) lt j)
  • Hence, there are no more than 2n iterations of
    the while-loop
  • Thus, KMPs algorithm runs in optimal time O(m
    n)

Algorithm KMPMatch(T, P) begin F ?
failureFunction(P) i ? 0 j ? 0 while i lt n
do if Ti Pj then if j m - 1
then return i - j / match / else i ?
i 1 j ? j 1 else if j gt 0 then j ?
Fj - 1 else i ? i 1 return -1 / no
match / end
17
Computing the Failure Function
  • The failure function can be represented by an
    array and can be computed in O(m) time
  • The construction is similar to the KMP algorithm
    itself
  • At each iteration of the while-loop, either
  • i increases by one, or
  • the shift amount i - j increases by at least one
    (observe that F(j - 1) lt j)
  • Hence, there are no more than 2m iterations of
    the while-loop

Algorithm failureFunction(P) begin F0 ? 0 i ?
1 j ? 0 while i lt m do if Pi Pj then /
we have matched j 1 chars / Fi ? j
1 i ? i 1 j ? j 1 else if j gt 0
then / use failure function to shift P / j ?
Fj - 1 else Fi ? 0 / no match / i ? i
1 end
18
Boyer-Moores Algorithm
  • The Boyer-Moores pattern matching algorithm is
    based on two principles
  • Right to Left Scan
  • Bad character shift
  • Reference
  • http//www.cs.utexas.edu/users/moore/best-ideas/st
    ring-searching

19
Right to left Scan Rule
T
T
T
T
T
A
P
14
4
15
5
20
Bad Character Shift Rule 1/3
  • Shift more than one position at a time.
  • When a mismatch occurs at Ti c
  • If P contains c shift P to align the last
    occurrence of c in P with Ti

T
--
--
--
--
--
P
21
Bad Character Shift Rule 2/3
  • When a mismatch occurs at Ti c
  • If P contains c shift P to align the last
    occurrence of c in P with Ti
  • Else if P doesnt contains c shift P to
    align P0 with Ti 1

T
--
--
--
--
--
P
22
Bad Character Shift Rule 3/3
  • When a mismatch occurs at Ti c
  • If P contains c shift P to align the last
    occurrence of c in P with Ti
  • Else if P doesnt contains c shift P to
    align P0 with Ti 1
  • What if we have already crossed the last
    occurrence?

T
--
--
--
--
--
P
shift one char only
23
Example of BM Algorithm
T
T
T
T
T
A
P
7
4
24
Last-Occurrence Function
  • Preprocess the pattern P and the alphabet S to
    build the last-occurrence function L mapping S to
    integers, where L(c) is defined as L(c)
  • Example
  • S a, b, c, d
  • P abacab
  • The last-occurrence function can be computed in
    time O(m s), where m is the size of P and s is
    the size of S

4
5
3
-1
25
The Boyer-Moore Algorithm
Case 1 j lt l
Algorithm BoyerMooreMatch(T, P, S ) L ?
lastOccurenceFunction(P, S ) i ? m - 1 j ? m -
1 repeat if Ti Pj then if j 0
then return i / match at i / else i ?
i - 1 j ? j - 1 else / character-jump
/ until i gt n - 1 return -1 / no
match /
i
l ? LTi i ? i m min(j, 1 l) j ? m - 1
26
Time complexity Analysis
  • Boyer-Moores algorithm is significantly faster
    than the Brute-Force Algorithm on English text
  • Pattern size m, Text size n ,Number of
    alphabet s ?Time complexity O(nms)
  • Example of worst caseT aaaaaaaa..aP
    caaaaa

T
a
a
a
a
a
a
a
a
a
P
27
Extended Boyer-Moore algorithm
  • Extended Bad character shift rule
  • then shift P to the right so that the closest c
    to the left of position i in P is below the
    mismatched c in T.
  • The original Boyer-Moore algorithm use the
    simpler bad character rule

28
Extended Boyer-Moore algorithm
  • Extended Bad character shift rule

T
--
--
--
--
--
R(E)0
P
The position of E is 0,3,4 Find the top number lt j
Write a Comment
User Comments (0)
About PowerShow.com