Pattern Matching presentation

About This Presentation

Transcript and Presenter's Notes

Title: Pattern Matching

1
Pattern Matching
2
Outline

Problem description
Brute Force algorithm
Knuth-Morris-Pratt algorithm
Boyer-Moore algorithm

3
Problem description

What is pattern matching?
Pattern quchjfng
Text
kadogqlxwwenkpjkmolpevnmqcsjakoheorplkzb
zfjliar Iacwzjlvkxromavcdquchjfnganqxharnhkw
eweg tkrmwm fjduwhqvhwpfg pymyajsaywf aas
bwxwcmrncmslgeqztyoygdbnxjqoctslbrvngoyxe

4
Problem description

Given two strings P (pattern) and T (text), find
the first occurrence of P in T.
Input two strings P T
Output The starting index of the occurrence of P
in T or Null .
Applications
Text editors
Search engines
Biological research

5
Strings 1/2

A string is a sequence of data elements, such as
numbers or characters.
An alphabet S is a set of basic symbols for a
category of strings
Examples of alphabet
Character A, BZ, a, b z
Number 0, 1, 29
DNA A, C, G, T

6
Strings 2/2

Let P be a string of size m
A substring Pi .. j is a string containing
characters of string P with ranks between i and
j. P P0 .. m-1.
A prefix of P is a substring of the type P0 ..
i
A suffix of P is a substring of the type Pi ..m
- 1

7
Brute Force Algorithm 1/2
T
P TTAGCT TTATTAGTTAGC
P
7
8
2
3
Pattern size m, Text size n Time complexity
O(m(n-m)) Example of worst cast P
AAAAG T AAAAAAAAAAAAAAAAG
4
5
6
8
Brute Force Algorithm 2/2
Algorithm BruteForceMatch(P,T) Input pattern P
of size m and text T of size n Output starting
index of the first substring of T equal to P or
-1 if no such substring exists begin for i ? 0
to n - m do j ? 0 while j lt m and Pj Ti
j do j ? j 1 if j m then return
i /match at i/ return -1 /no match
anywhere/ end
9
How to speed up the Pattern Matching.

Shift P by more than one position when mismatch
occurs
Never miss a occurrence of P in T
Preprocessing approach
Preprocessing P
Knuth-Morris-Pratt Algorithm
Boyer-Moore Algorithm
Preprocessing T
Suffix trees

10
The Knuth-Morris-Pratt algorithm

Best known
For linear time for exact matching
Preprocessing pattern P
Compares from Left to Right

11
The KMP Ideas

Shift more than one position
Reduce comparison

12
The KMP Algorithm - Motivation

When a mismatch occurs, what is the most we can
shift the pattern?
Answer the largest prefix of P0..j-1 that is a
suffix of P1..j-1

T
--
--
--
--
--
P
j
No need to repeat these comparisons
Resume comparing here
13
KMP Failure Function

Knuth-Morris-Pratts algorithm preprocesses the
pattern to find matches of prefixes of the
pattern with the pattern itself
The failure function F(j) is defined as the size
of the largest prefix of P0..j that is also a
suffix of P1..j

P
P
0
0
1
0
1
2
0
0
1
1
2
3
14
The KMP Algorithm
Algorithm KMPMatch(T, P) begin F ?
failureFunction(P) i ? 0 j ? 0 while i lt n
do if Ti Pj then if j m - 1
then return i - j / match / else i ?
i 1 j ? j 1 else if j gt 0 then j ?
Fj - 1 else i ? i 1 return -1 / no
match / end
15
Example
T
T
C
C
C
A
A
P
16
The KMP Algorithm

The failure function can be represented by an
array and can be computed in O(m) time
At each iteration of the while-loop, either
i increases by one, or
the shift amount i - j increases by at least one
(observe that F(j - 1) lt j)
Hence, there are no more than 2n iterations of
the while-loop
Thus, KMPs algorithm runs in optimal time O(m
n)

Algorithm KMPMatch(T, P) begin F ?
failureFunction(P) i ? 0 j ? 0 while i lt n
do if Ti Pj then if j m - 1
then return i - j / match / else i ?
i 1 j ? j 1 else if j gt 0 then j ?
Fj - 1 else i ? i 1 return -1 / no
match / end
17
Computing the Failure Function

The failure function can be represented by an
array and can be computed in O(m) time
The construction is similar to the KMP algorithm
itself
At each iteration of the while-loop, either
i increases by one, or
the shift amount i - j increases by at least one
(observe that F(j - 1) lt j)
Hence, there are no more than 2m iterations of
the while-loop

Algorithm failureFunction(P) begin F0 ? 0 i ?
1 j ? 0 while i lt m do if Pi Pj then /
we have matched j 1 chars / Fi ? j
1 i ? i 1 j ? j 1 else if j gt 0
then / use failure function to shift P / j ?
Fj - 1 else Fi ? 0 / no match / i ? i
1 end
18
Boyer-Moores Algorithm

The Boyer-Moores pattern matching algorithm is
based on two principles
Right to Left Scan
Bad character shift
Reference
http//www.cs.utexas.edu/users/moore/best-ideas/st
ring-searching

19
Right to left Scan Rule
T
T
T
T
T
A
P
14
4
15
5
20
Bad Character Shift Rule 1/3

Shift more than one position at a time.
When a mismatch occurs at Ti c
If P contains c shift P to align the last
occurrence of c in P with Ti

T
--
--
--
--
--
P
21
Bad Character Shift Rule 2/3

When a mismatch occurs at Ti c
If P contains c shift P to align the last
occurrence of c in P with Ti
Else if P doesnt contains c shift P to
align P0 with Ti 1

T
--
--
--
--
--
P
22
Bad Character Shift Rule 3/3

When a mismatch occurs at Ti c
If P contains c shift P to align the last
occurrence of c in P with Ti
Else if P doesnt contains c shift P to
align P0 with Ti 1
What if we have already crossed the last
occurrence?

T
--
--
--
--
--
P
shift one char only
23
Example of BM Algorithm
T
T
T
T
T
A
P
7
4
24
Last-Occurrence Function

Preprocess the pattern P and the alphabet S to
build the last-occurrence function L mapping S to
integers, where L(c) is defined as L(c)
Example
S a, b, c, d
P abacab
The last-occurrence function can be computed in
time O(m s), where m is the size of P and s is
the size of S

4
5
3
-1
25
The Boyer-Moore Algorithm
Case 1 j lt l
Algorithm BoyerMooreMatch(T, P, S ) L ?
lastOccurenceFunction(P, S ) i ? m - 1 j ? m -
1 repeat if Ti Pj then if j 0
then return i / match at i / else i ?
i - 1 j ? j - 1 else / character-jump
/ until i gt n - 1 return -1 / no
match /
i
l ? LTi i ? i m min(j, 1 l) j ? m - 1
26
Time complexity Analysis

Boyer-Moores algorithm is significantly faster
than the Brute-Force Algorithm on English text
Pattern size m, Text size n ,Number of
alphabet s ?Time complexity O(nms)
Example of worst caseT aaaaaaaa..aP
caaaaa

T
a
a
a
a
a
a
a
a
a
P
27
Extended Boyer-Moore algorithm

Extended Bad character shift rule
then shift P to the right so that the closest c
to the left of position i in P is below the
mismatched c in T.
The original Boyer-Moore algorithm use the
simpler bad character rule

28
Extended Boyer-Moore algorithm

Extended Bad character shift rule

T
--
--
--
--
--
R(E)0
P
The position of E is 0,3,4 Find the top number lt j

Write a Comment

User Comments (0)

About PowerShow.com

Pattern Matching PowerPoint PPT Presentation