Title: Chapter 13 String Matching
1Chapter 13String Matching
2Objectives
- Discuss the following topics
- Exact String Matching
- Approximate String Matching
- Case Study Longest Common Substring
3Exact String Matching
- Stringologys major area of interest is pattern
matching - Exact string matching consists of finding an
exact copy of pattern P in text T
4Exact String Matching (continued)
- bruteForceStringMatching(pattern P, text T)
- i 0
- while i T - P
- j 0
- while Ti Pj and j lt P
- i // try to match all characters in P
- j
- if j P
- return match at i - P // success if the end
of P is reached - // if there is a mismatch,
- i i - j 1 // shift P to the right by one
position - return no match // failure if fewer characters
left in T than P
5Exact String Matching (continued)
- Hancart(pattern P, text T)
- if P0 P1
- sEqual 1
- sDiff 2
- else sEqual 2
- sDiff 1
- i 0
- while i T - P
- if Ti1 ? P1
- i i sDiff
- else j 1
- while j lt P and Tij Pj
- j
- if j P and P0 Ti
- return match at i
- i i sEqual
- return no match
6The Knuth-Morris-Pratt Algorithm
- The Knuth-Morris-Pratt algorithm can be obtained
from bruteForceStringMatching() - KnuthMorrisPratt(pattern P, text T)
- findNext (P, next)
- i j 0
- while i T - P
- while j -1 or j lt P and Ti Pj
- i //increment i only for matched character
- j
- if j P
- return a match at i - P
- j next j //in the case of a mismatch, i does
not change - return no match
-
7The Boyer-Moore Algorithm
- The Boyer-Moore algorithm tries to match P with T
by comparing them from right to left, not from
left to right
8The Boyer-Moore Algorithm (continued)
- BoyerMooreSimple(pattern P, text T)
- initialize all cells of delta1 to P
- for j 0 to P - 1
- delta1Pj P - j 1
- i P - 1
- while i lt T
- j P - 1
- while j 0 and Pj Ti
- i --
- j --
- if j -1
- return match at i1
- i i max(delta1 Ti , P-j)
- return no match
9The Sunday Algorithms
- Daniel Sunday (1990) observed that in the case of
a mismatch with a text character Ti, the pattern
shifts to the right by at least one position
thus, the character TiP is included - More advantageous to build delta1 with respect to
character TiP - Sunday introduced two more algorithms, based on a
generalized delta2 table
10Multiple Searches
- All preceding algorithms presented find an
occurrence of a pattern in text and discontinue
after finding the first. - Modifying the Boyer-Moore algorithm allows for
multiple searches
11Multiple Searches (continued)
- BoyerMooreSimple(pattern P, text T)
- initialize all cells of delta1 to P
- for j 0 to P - 1
- delta1Pj P - j 1
- compute delta2
- i P - 1
- while i lt T
- j P - 1
- while j 0 and Pj Ti
- i --
- j --
- if j -1
- output match at i1
- i i P 1 //shift P by one position to
the right - else i i max(delta1 Ti, delta2j)
12Bit-Oriented Approach
- Each state of the search is represented as a
numberthat is, a string of bitsand a transition
from one state to the next is the result of a
small number of bitwise operations - A shift-and algorithm that uses a bit-oriented
approach for string matching was proposed by
Baeza-Yates and Gonnet (1992)
13Matching Sets of Words
- To considerably improve run time by considering
all relevant words at the same time during the
match process, Aho and Corasick (1975)
constructed a string-match automation algorithm - The goto function is constructed in the form of a
trie, or multiway tree, in which consecutive
characters of a string are used to navigate the
search in the tree
14Matching Sets of Words (continued)
- AhoCorasick(set keywords, text T)
- computeGotoFunction(keywords,g,output) // the
output function is computed - computeFailureFunction(g,output,f) // in
these two functions - state 0
- for i 0 to T - 1
- while g(state,Ti) fail
- state f(state)
- state g(state,Ti)
- if output(state) is not empty
- output a match ending at i
- output(state)
15Matching Sets of Words (continued)
Figure 13-1 (a) A trie for the string inner, (b)
for the strings inner and input, and (c) or the
set keywords inner, input, in, outer, output,
out, put, outing, tint
16Matching Sets of Words (continued)
Figure 13-1 (d) the trie (c) with failure links
(e) scanning the trie (d) for the text T
outinputting (continued)
17Regular Expression Matching
- All letters of the alphabet are regular
expressions - If r and s are regular expressions, then rs,
(r), r, and rs are regular expressions. - Regular expression rs represents regular
expression r or s - Regular expression r (where the star is called
a Kleene closure) represents any finite sequence
of rs r, rr, rrr, . . . .
18Regular Expression Matching (continued)
- Regular expression rs represents a concatenation
rs - (r) represents regular expression r
19Regular Expression Matching (continued)
Figure 13-2 (a) An automaton representing one
letter c an automaton a
regular expression (b) r s
20Regular Expression Matching (continued)
Figure 13-2 (c) rs, (d) r (continued)
21Regular Expression Matching (continued)
Figure 13-3 The Thompson automaton for the
regular expression a(bcd
)ef
22Suffix Tries and Trees
- A suffix trie for a text T is a tree structure in
which each edge is labeled with one letter of T
and each suffix of T is represented in the trie
as a concatenation of edge labels from the root
to some node of the trie
23Suffix Tries and Trees (continued)
Figure 13-4 (a) A suffix trie for the string
caracas
24Suffix Tries and Trees (continued)
Figure 13-4 (b) a suffix tree for the substring
caraca and (c) for the
string caracas (continued)
25Suffix Tries and Trees (continued)
Figure 13-5 Creating an Ukkonen suffix trie for
the string pepper
26Suffix Tries and Trees (continued)
Figure 13-5 Creating an Ukkonen suffix trie for
the string pepper (continued)
27Suffix Tries and Trees (continued)
Figure 13-5 Creating an Ukkonen suffix trie for
the string pepper (continued)
28Suffix Tries and Trees (continued)
Figure 13-6 Creating an Ukkonen suffix tree for
the string pepper
29Suffix Tries and Trees (continued)
Figure 13-6 Creating an Ukkonen suffix tree for
the string pepper (continued)
30Suffix Arrays
- If suffix trees require too much space, a simple
alternative are suffix arrays (Manber and Myers,
1993) - Suffix array pos is the array position o through
T - 1 of suffixes taken in lexicographic order - The Suffix array can be created from an existing
suffix tree on which ordered depth-first
traversal is performed
31Approximate String Matching
- A popular measure of the similarity of two
strings is the number of elementary edit
operations that are needed to transform one
string into another - The differences between two strings is sought in
terms of insertion (I), deletion (D), and
substitution (S) - Difference can be represented in trace,
alignment (matching), and listing (derivation)
32String Similarity
- The string similarity problem can be approached
by reducing the problem of finding the minimum
distance for a particulate i and j to the problem
of finding the minimum distance for values not
larger than i and j - There are four possibilities deletion,
insertion, substitution, and match - The Wagner and Fischer algorithm (1974) attempts
to address string similarity
33String Matching with k Errors
- To determine all substrings of text T for which
the Levenshtein distance does not exceed k,
perform string matching with k errors or k
differences - All the possibilities for matching P(0j) with a
substring of T that ends at position i with e k
errors can be summarized using Match,
Substitution, Insertion, and Deletion where there
is a match with e errors between P(0j - 1) and a
substring ending at Tj-1
34Case Study Longest Common Substring
Figure 13-7 (ah) Creating an Ukkonen suffix tree
for the string abaabaac
35Case Study Longest Common Substring (continued)
Figure 13-7 (ah) Creating an Ukkonen suffix tree
for the string abaabaac
(continued)
36Case Study Longest Common Substring (continued)
Figure 13-7 (i) a data structure used for
implementation of the
Ukkonen tree (h) (continued)
37Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
38Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
39Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
40Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
41Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
42Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
43Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
44Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
45Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
46Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
47Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
48Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
49Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
50Summary
- Stringologys major area of interest is pattern
matching - Exact string matching consists of finding an
exact copy of pattern P in text T - The bruteForceStringMatching algorithm is an
example of exact string matching - The Knuth-Morris-Pratt algorithm can be obtained
from bruteForceStringMatching() - The Boyer-Moore algorthm tries to match P with T
by comparing them from right to left, not from
left to right
51Summary (continued)
- Daniel Sunday (1990) observed that in the case of
a mismatch with a text character Ti, the pattern
shifts to the right by at least one position
thus, the character TiP is included. - Modifying the Boyer-Moore algorithm allows for
multiple searches - A shift-and algorithm that uses a bit-oriented
approach for string matching was proposed by
Baeza-Yates and Gonnet (1992) - To considerably improve run time by considering
all relevant word at the same time during the
match process, Aho and Corasick (1975)
constructed a string-match automation algorithm
52Summary (continued)
- All letters of the alphabet are regular
expressions - A suffix trie for a text T is a tree structure in
which each edge is labeled with one letter of T
and each suffix of T is represented in the trie
as a concatenation of edge labels from the root
to some node of the trie - If suffix trees require too much space, a simple
alternative are suffix arrays (Manber and Myers,
1993)
53Summary (continued)
- A popular measure of the similarity of two
strings is the number of elementary edit
operations that are needed to transform one
string into another - The differences between two strings is sought in
terms of insertion (I), deletion (D), and
substitution (S) - The string similarity problem can be approached
by reducing the problem of finding the minimum
distance for a particulate i and j to the problem
of finding the minimum distance for values not
larger than i and j
54Summary (continued)
- The Wagner and Fischer algorithm (1974) attempts
to address string similarity - To determine all substrings of text T for which
the Levenshtein distance does not exceed k,
perform string matching with k errors or k
differences