Title: Pattern Matching
1Pattern Matching
2Outline
- Problem description
- Brute Force algorithm
- Knuth-Morris-Pratt algorithm
- Boyer-Moore algorithm
3Problem description
- What is pattern matching?
- Pattern quchjfng
- Text
- kadogqlxwwenkpjkmolpevnmqcsjakoheorplkzb
- zfjliar Iacwzjlvkxromavcdquchjfnganqxharnhkw
- eweg tkrmwm fjduwhqvhwpfg pymyajsaywf aas
- bwxwcmrncmslgeqztyoygdbnxjqoctslbrvngoyxe
-
4Problem description
- Given two strings P (pattern) and T (text), find
the first occurrence of P in T. - Input two strings P T
- Output The starting index of the occurrence of P
in T or Null . - Applications
- Text editors
- Search engines
- Biological research
5Strings 1/2
- A string is a sequence of data elements, such as
numbers or characters. - An alphabet S is a set of basic symbols for a
category of strings - Examples of alphabet
- Character A, BZ, a, b z
- Number 0, 1, 29
- DNA A, C, G, T
6Strings 2/2
- Let P be a string of size m
- A substring Pi .. j is a string containing
characters of string P with ranks between i and
j. P P0 .. m-1. - A prefix of P is a substring of the type P0 ..
i - A suffix of P is a substring of the type Pi ..m
- 1
7Brute Force Algorithm 1/2
T
P TTAGCT TTATTAGTTAGC
P
7
8
2
3
Pattern size m, Text size n Time complexity
O(m(n-m)) Example of worst cast P
AAAAG T AAAAAAAAAAAAAAAAG
4
5
6
8Brute Force Algorithm 2/2
Algorithm BruteForceMatch(P,T) Input pattern P
of size m and text T of size n Output starting
index of the first substring of T equal to P or
-1 if no such substring exists begin for i ? 0
to n - m do j ? 0 while j lt m and Pj Ti
j do j ? j 1 if j m then return
i /match at i/ return -1 /no match
anywhere/ end
9How to speed up the Pattern Matching.
- Shift P by more than one position when mismatch
occurs - Never miss a occurrence of P in T
- Preprocessing approach
- Preprocessing P
- Knuth-Morris-Pratt Algorithm
- Boyer-Moore Algorithm
- Preprocessing T
- Suffix trees
10The Knuth-Morris-Pratt algorithm
- Best known
- For linear time for exact matching
- Preprocessing pattern P
- Compares from Left to Right
11The KMP Ideas
- Shift more than one position
- Reduce comparison
12The KMP Algorithm - Motivation
- When a mismatch occurs, what is the most we can
shift the pattern? - Answer the largest prefix of P0..j-1 that is a
suffix of P1..j-1
T
--
--
--
--
--
P
j
No need to repeat these comparisons
Resume comparing here
13KMP Failure Function
- Knuth-Morris-Pratts algorithm preprocesses the
pattern to find matches of prefixes of the
pattern with the pattern itself - The failure function F(j) is defined as the size
of the largest prefix of P0..j that is also a
suffix of P1..j
P
P
0
0
1
0
1
2
0
0
1
1
2
3
14The KMP Algorithm
Algorithm KMPMatch(T, P) begin F ?
failureFunction(P) i ? 0 j ? 0 while i lt n
do if Ti Pj then if j m - 1
then return i - j / match / else i ?
i 1 j ? j 1 else if j gt 0 then j ?
Fj - 1 else i ? i 1 return -1 / no
match / end
15Example
T
T
C
C
C
A
A
P
16The KMP Algorithm
- The failure function can be represented by an
array and can be computed in O(m) time - At each iteration of the while-loop, either
- i increases by one, or
- the shift amount i - j increases by at least one
(observe that F(j - 1) lt j) - Hence, there are no more than 2n iterations of
the while-loop - Thus, KMPs algorithm runs in optimal time O(m
n)
Algorithm KMPMatch(T, P) begin F ?
failureFunction(P) i ? 0 j ? 0 while i lt n
do if Ti Pj then if j m - 1
then return i - j / match / else i ?
i 1 j ? j 1 else if j gt 0 then j ?
Fj - 1 else i ? i 1 return -1 / no
match / end
17Computing the Failure Function
- The failure function can be represented by an
array and can be computed in O(m) time - The construction is similar to the KMP algorithm
itself - At each iteration of the while-loop, either
- i increases by one, or
- the shift amount i - j increases by at least one
(observe that F(j - 1) lt j) - Hence, there are no more than 2m iterations of
the while-loop
Algorithm failureFunction(P) begin F0 ? 0 i ?
1 j ? 0 while i lt m do if Pi Pj then /
we have matched j 1 chars / Fi ? j
1 i ? i 1 j ? j 1 else if j gt 0
then / use failure function to shift P / j ?
Fj - 1 else Fi ? 0 / no match / i ? i
1 end
18Boyer-Moores Algorithm
- The Boyer-Moores pattern matching algorithm is
based on two principles - Right to Left Scan
- Bad character shift
- Reference
- http//www.cs.utexas.edu/users/moore/best-ideas/st
ring-searching
19Right to left Scan Rule
T
T
T
T
T
A
P
14
4
15
5
20Bad Character Shift Rule 1/3
- Shift more than one position at a time.
- When a mismatch occurs at Ti c
- If P contains c shift P to align the last
occurrence of c in P with Ti
T
--
--
--
--
--
P
21Bad Character Shift Rule 2/3
- When a mismatch occurs at Ti c
- If P contains c shift P to align the last
occurrence of c in P with Ti - Else if P doesnt contains c shift P to
align P0 with Ti 1
T
--
--
--
--
--
P
22Bad Character Shift Rule 3/3
- When a mismatch occurs at Ti c
- If P contains c shift P to align the last
occurrence of c in P with Ti - Else if P doesnt contains c shift P to
align P0 with Ti 1 - What if we have already crossed the last
occurrence?
T
--
--
--
--
--
P
shift one char only
23Example of BM Algorithm
T
T
T
T
T
A
P
7
4
24Last-Occurrence Function
- Preprocess the pattern P and the alphabet S to
build the last-occurrence function L mapping S to
integers, where L(c) is defined as L(c) -
- Example
- S a, b, c, d
- P abacab
- The last-occurrence function can be computed in
time O(m s), where m is the size of P and s is
the size of S
4
5
3
-1
25The Boyer-Moore Algorithm
Case 1 j lt l
Algorithm BoyerMooreMatch(T, P, S ) L ?
lastOccurenceFunction(P, S ) i ? m - 1 j ? m -
1 repeat if Ti Pj then if j 0
then return i / match at i / else i ?
i - 1 j ? j - 1 else / character-jump
/ until i gt n - 1 return -1 / no
match /
i
l ? LTi i ? i m min(j, 1 l) j ? m - 1
26Time complexity Analysis
- Boyer-Moores algorithm is significantly faster
than the Brute-Force Algorithm on English text - Pattern size m, Text size n ,Number of
alphabet s ?Time complexity O(nms) - Example of worst caseT aaaaaaaa..aP
caaaaa
T
a
a
a
a
a
a
a
a
a
P
27Extended Boyer-Moore algorithm
- Extended Bad character shift rule
- then shift P to the right so that the closest c
to the left of position i in P is below the
mismatched c in T. - The original Boyer-Moore algorithm use the
simpler bad character rule
28Extended Boyer-Moore algorithm
- Extended Bad character shift rule
T
--
--
--
--
--
R(E)0
P
The position of E is 0,3,4 Find the top number lt j