Title: String Matching
1String Matching
Input Strings P (pattern) and T (text) P
m, T n.
Output Indices of all occurrences of P in T.
Example
T discombobulate
P output
combo 4 (i.e., with shift 3)
ate 12
later 15 T (no occurrence of P)
2Applications
Text retrieval
Computational biology
- DNA is a one-dimensional (1-D) string of
characters As, Gs, Cs, Ts.
- All information for 3-D protein folding is
contained in protein sequence itself and
independent of the environment.
Searching for DNA patterns
Comparing two or more DNA strings for similarities
Reconstructing DNA strings from overlapping
fragments.
3Sliding the Pattern Template
T b i o l o g y P l o g i c
n 7 m 5
b i o l o g y l o g i c
b i o l o g y l o g i c
b i o l o g y l o g i c
T1 ? P1
No match!
b i o l o g y l o g i c
b i o l o g y l o g i c
T4 P1, T5 P2, T6 P3, but T7
? P4
T2 ? P1
b i o l o g y l o g i c
b i o l o g y l o g i c
T3 ? P1
4Another Example
T b i o l o g i c a l P l o
g i c
n 10 m
5
b i o l o g i c a l l
o g i c
Match found! return 4.
5The Naive Matcher
Pattern P1..m Text T1..n
Naive-String-Matcher(T, P) // find all
occurrences of P in T. for s 1 to n ? m
1 do if P1 .. m Ts .. sm?1
then print Pattern occurs at index s
T
s sm-1
P
1 m
6Time Complexity
m(n ? m 1) comparisons (as below) in the worst
case.
m chars
n ? m 1 blocks, each requiring m comparisons
Time complexity is O(mn)!
7Finite Automaton
A finite automaton consists of
a finite set Q of states a start state a set A
of accepting states a finite input alphabet ? a
transition function d Q ? ? ? Q.
accepting state
start state
8Accepting a String
input state sequence
accepts?
Yes
aabba
010001
No
bbabb
000100
9A String Matching Automaton
Pattern P a a b a
Ex.
aba not rescanned due to transition 4?2
T a b b a a a b a a b a
Pattern occurs at indices 5 and 8!
0 1 0 0 1
2 2 3 4
2 3 4
10Key Ideas of Automaton Matching
Slide pattern forward by more than one position
if possible.
Do not rescan chars of T that have already been
examined.
11The Automaton Matcher
Finite-Automaton-Matcher(T, d, m) n
lengthT q 0 //
current state for i 1 to n do q d(q,
Ti) // d function precomputed if q m
// match succeeds then print
Pattern occurs at index i ? m1
O(n) if the state transition function d is
available.