Title: Design and Analysis of Computer Algorithm Lecture 8
1Design and Analysis of Computer AlgorithmLecture
8
- Pradondet Nilagupta
- Department of Computer Engineering
This lecture note has been modified from lecture
note by Prof. Somchai Prasitjutrakul and Prof.
Dimitris Papadias
2String Matching
3Notation
- P The pattern being searched for
- T The text in which P is sought
- m The length of P
- n The length of T
- pi,ti The ith characters in P and T are denoted
with - lower case letters and subscripts.
- j Current position withing T
- k Current position withing P
4string matching
- Naive string matching
- Â for (i0 Ti ! '\0' i)
- for (j0 Tij ! '\0' Pj ! '\0'
TijPj j) - if (Pj '\0') found a match
-
- There are two nested loops the inner one takes
O(m) iterations and the outer one takes O(n)
iterations so the total time is the product,
O(mn). This is slow we'd like to speed it up.
5Example
- Suppose we're looking for pattern "nano" in text
"banananobano". - Each row represents an iteration of the outer
loop, with each character in the row representing
the result of a comparison (X if the comparison
was unequal). - Suppose we're looking for pattern "nano" in text
"banananobano".
6Example
7Note
- Some of these comparisons are wasted work!
- For instance, after iteration i2, we know from
the comparisons we've done that T3"a", so
there is no point comparing it to "n" in
iteration i3. - And we also know that T4"n", so there is no
point making the same comparison in iteration
i4.
8Skipping outer iterations
- Try overlapping the partial match you've found
with the new match you want to find - i2 n a n
- i3 n a n o
- we know from the i2 iteration that T3 and T4
are "a" and "n", so they can't be the "n" and "a"
that the i3 iteration is looking for. We can
keep skipping positions until we find one that
doesn't conflict - i2 n a n
- i4 n a n o
-
9String matching with skipped iterations
- i0
- while (iltn)
- for (j0 Tij ! '\0' Pj ! '\0'
TijPj j) - if (Pj '\0') found a match
- i i max(1, j-overlap(P0..j-1,P0..m))
-
10Skipping inner iterations
- The other optimization that can be done is to
skip some iterations in the inner loop. Let's
look at the same example, in which we skipped
from i2 to i4 - i2 n a n
- i4 n a n o
- the "n" that overlaps has already been tested by
the i2 iteration. There's no need to test it
again in the i4 iteration. In general, if we
have a nontrivial overlap with the last partial
match, we can avoid testing a number of
characters equal to the length of the overlap.
11KMP, version 1
- Â i0
- o0
- while (iltn)
-
- for (jo Tij ! '\0' Pj ! '\0'
TijPj j) - if (Pj '\0') found a match
- o overlap(P0..j-1,P0..m)
- i i max(1, j-o)
-
The only remaining detail is how to compute the
overlap function. This is a function only of j,
and not of the characters in T, so we can
compute it once in a preprocessing stage
12KMP time analysis (1/2)
- We still have an outer loop and an inner loop, so
it looks like the time might still be O(mn). But
we can count it a different way to see that it's
actually always less than that. - We split the comparisons into two groups
- those that return true, and those that return
false. - If a comparison returns true, we've determined
the value of Tij. Then in future iterations,
as long as there is a nontrivial overlap
involving Tij, we'll skip past that overlap
and not make a comparison with that position
again.
13KMP time analysis (2/2)
- So each position of T is only involved in one
true comparison, and there can be n such
comparisons total. - On the other hand, there is at most one false
comparison per iteration of the outer loop, so
there can also only be n of those. As a result we
see that this part of the KMP algorithm makes at
most 2n comparisons and takes time O(n).
14Finite State Machine
Finite automaton for P AABC
15KMP Flow chart
16KMP Flow chart
17Example Action of KMP flowchart
P ABABCB TACABAABABA