Chapter 13 String Matching - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Chapter 13 String Matching

Description:

Stringology's major area of interest is pattern matching ... (c) or the set keywords = {inner, input, in, outer, output, out, put, outing, tint } ... – PowerPoint PPT presentation

Number of Views:387

Avg rating:5.0/5.0

Slides: 55

Provided by: cynd4

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 13 String Matching

1
Chapter 13String Matching
2
Objectives

Discuss the following topics
Exact String Matching
Approximate String Matching
Case Study Longest Common Substring

3
Exact String Matching

Stringologys major area of interest is pattern
matching
Exact string matching consists of finding an
exact copy of pattern P in text T

4
Exact String Matching (continued)

bruteForceStringMatching(pattern P, text T)
i 0
while i T - P
j 0
while Ti Pj and j lt P
i // try to match all characters in P
j
if j P
return match at i - P // success if the end
of P is reached
// if there is a mismatch,
i i - j 1 // shift P to the right by one
position
return no match // failure if fewer characters
left in T than P

5
Exact String Matching (continued)

Hancart(pattern P, text T)
if P0 P1
sEqual 1
sDiff 2
else sEqual 2
sDiff 1
i 0
while i T - P
if Ti1 ? P1
i i sDiff
else j 1
while j lt P and Tij Pj
j
if j P and P0 Ti
return match at i
i i sEqual
return no match

6
The Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt algorithm can be obtained
from bruteForceStringMatching()
KnuthMorrisPratt(pattern P, text T)
findNext (P, next)
i j 0
while i T - P
while j -1 or j lt P and Ti Pj
i //increment i only for matched character
j
if j P
return a match at i - P
j next j //in the case of a mismatch, i does
not change
return no match

7
The Boyer-Moore Algorithm

The Boyer-Moore algorithm tries to match P with T
by comparing them from right to left, not from
left to right

8
The Boyer-Moore Algorithm (continued)

BoyerMooreSimple(pattern P, text T)
initialize all cells of delta1 to P
for j 0 to P - 1
delta1Pj P - j 1
i P - 1
while i lt T
j P - 1
while j 0 and Pj Ti
i --
j --
if j -1
return match at i1
i i max(delta1 Ti , P-j)
return no match

9
The Sunday Algorithms

Daniel Sunday (1990) observed that in the case of
a mismatch with a text character Ti, the pattern
shifts to the right by at least one position
thus, the character TiP is included
More advantageous to build delta1 with respect to
character TiP
Sunday introduced two more algorithms, based on a
generalized delta2 table

10
Multiple Searches

All preceding algorithms presented find an
occurrence of a pattern in text and discontinue
after finding the first.
Modifying the Boyer-Moore algorithm allows for
multiple searches

11
Multiple Searches (continued)

BoyerMooreSimple(pattern P, text T)
initialize all cells of delta1 to P
for j 0 to P - 1
delta1Pj P - j 1
compute delta2
i P - 1
while i lt T
j P - 1
while j 0 and Pj Ti
i --
j --
if j -1
output match at i1
i i P 1 //shift P by one position to
the right
else i i max(delta1 Ti, delta2j)

12
Bit-Oriented Approach

Each state of the search is represented as a
numberthat is, a string of bitsand a transition
from one state to the next is the result of a
small number of bitwise operations
A shift-and algorithm that uses a bit-oriented
approach for string matching was proposed by
Baeza-Yates and Gonnet (1992)

13
Matching Sets of Words

To considerably improve run time by considering
all relevant words at the same time during the
match process, Aho and Corasick (1975)
constructed a string-match automation algorithm
The goto function is constructed in the form of a
trie, or multiway tree, in which consecutive
characters of a string are used to navigate the
search in the tree

14
Matching Sets of Words (continued)

AhoCorasick(set keywords, text T)
computeGotoFunction(keywords,g,output) // the
output function is computed
computeFailureFunction(g,output,f) // in
these two functions
state 0
for i 0 to T - 1
while g(state,Ti) fail
state f(state)
state g(state,Ti)
if output(state) is not empty
output a match ending at i
output(state)

15
Matching Sets of Words (continued)
Figure 13-1 (a) A trie for the string inner, (b)
for the strings inner and input, and (c) or the
set keywords inner, input, in, outer, output,
out, put, outing, tint
16
Matching Sets of Words (continued)
Figure 13-1 (d) the trie (c) with failure links
(e) scanning the trie (d) for the text T
outinputting (continued)
17
Regular Expression Matching

All letters of the alphabet are regular
expressions
If r and s are regular expressions, then rs,
(r), r, and rs are regular expressions.
Regular expression rs represents regular
expression r or s
Regular expression r (where the star is called
a Kleene closure) represents any finite sequence
of rs r, rr, rrr, . . . .

18
Regular Expression Matching (continued)

Regular expression rs represents a concatenation
rs
(r) represents regular expression r

19
Regular Expression Matching (continued)
Figure 13-2 (a) An automaton representing one
letter c an automaton a
regular expression (b) r s
20
Regular Expression Matching (continued)
Figure 13-2 (c) rs, (d) r (continued)
21
Regular Expression Matching (continued)
Figure 13-3 The Thompson automaton for the
regular expression a(bcd
)ef
22
Suffix Tries and Trees

A suffix trie for a text T is a tree structure in
which each edge is labeled with one letter of T
and each suffix of T is represented in the trie
as a concatenation of edge labels from the root
to some node of the trie

23
Suffix Tries and Trees (continued)
Figure 13-4 (a) A suffix trie for the string
caracas
24
Suffix Tries and Trees (continued)
Figure 13-4 (b) a suffix tree for the substring
caraca and (c) for the
string caracas (continued)
25
Suffix Tries and Trees (continued)
Figure 13-5 Creating an Ukkonen suffix trie for
the string pepper
26
Suffix Tries and Trees (continued)
Figure 13-5 Creating an Ukkonen suffix trie for
the string pepper (continued)
27
Suffix Tries and Trees (continued)
Figure 13-5 Creating an Ukkonen suffix trie for
the string pepper (continued)
28
Suffix Tries and Trees (continued)
Figure 13-6 Creating an Ukkonen suffix tree for
the string pepper
29
Suffix Tries and Trees (continued)
Figure 13-6 Creating an Ukkonen suffix tree for
the string pepper (continued)
30
Suffix Arrays

If suffix trees require too much space, a simple
alternative are suffix arrays (Manber and Myers,
1993)
Suffix array pos is the array position o through
T - 1 of suffixes taken in lexicographic order
The Suffix array can be created from an existing
suffix tree on which ordered depth-first
traversal is performed

31
Approximate String Matching

A popular measure of the similarity of two
strings is the number of elementary edit
operations that are needed to transform one
string into another
The differences between two strings is sought in
terms of insertion (I), deletion (D), and
substitution (S)
Difference can be represented in trace,
alignment (matching), and listing (derivation)

32
String Similarity

The string similarity problem can be approached
by reducing the problem of finding the minimum
distance for a particulate i and j to the problem
of finding the minimum distance for values not
larger than i and j
There are four possibilities deletion,
insertion, substitution, and match
The Wagner and Fischer algorithm (1974) attempts
to address string similarity

33
String Matching with k Errors

To determine all substrings of text T for which
the Levenshtein distance does not exceed k,
perform string matching with k errors or k
differences
All the possibilities for matching P(0j) with a
substring of T that ends at position i with e k
errors can be summarized using Match,
Substitution, Insertion, and Deletion where there
is a match with e errors between P(0j - 1) and a
substring ending at Tj-1

34
Case Study Longest Common Substring
Figure 13-7 (ah) Creating an Ukkonen suffix tree
for the string abaabaac
35
Case Study Longest Common Substring (continued)
Figure 13-7 (ah) Creating an Ukkonen suffix tree
for the string abaabaac
(continued)
36
Case Study Longest Common Substring (continued)
Figure 13-7 (i) a data structure used for
implementation of the
Ukkonen tree (h) (continued)
37
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
38
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
39
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
40
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
41
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
42
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
43
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
44
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
45
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
46
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
47
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
48
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
49
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
50
Summary

Stringologys major area of interest is pattern
matching
Exact string matching consists of finding an
exact copy of pattern P in text T
The bruteForceStringMatching algorithm is an
example of exact string matching
The Knuth-Morris-Pratt algorithm can be obtained
from bruteForceStringMatching()
The Boyer-Moore algorthm tries to match P with T
by comparing them from right to left, not from
left to right

51
Summary (continued)

Daniel Sunday (1990) observed that in the case of
a mismatch with a text character Ti, the pattern
shifts to the right by at least one position
thus, the character TiP is included.
Modifying the Boyer-Moore algorithm allows for
multiple searches
A shift-and algorithm that uses a bit-oriented
approach for string matching was proposed by
Baeza-Yates and Gonnet (1992)
To considerably improve run time by considering
all relevant word at the same time during the
match process, Aho and Corasick (1975)
constructed a string-match automation algorithm

52
Summary (continued)

All letters of the alphabet are regular
expressions
A suffix trie for a text T is a tree structure in
which each edge is labeled with one letter of T
and each suffix of T is represented in the trie
as a concatenation of edge labels from the root
to some node of the trie
If suffix trees require too much space, a simple
alternative are suffix arrays (Manber and Myers,
1993)

53
Summary (continued)

A popular measure of the similarity of two
strings is the number of elementary edit
operations that are needed to transform one
string into another
The differences between two strings is sought in
terms of insertion (I), deletion (D), and
substitution (S)
The string similarity problem can be approached
by reducing the problem of finding the minimum
distance for a particulate i and j to the problem
of finding the minimum distance for values not
larger than i and j

54
Summary (continued)

The Wagner and Fischer algorithm (1974) attempts
to address string similarity
To determine all substrings of text T for which
the Levenshtein distance does not exceed k,
perform string matching with k errors or k
differences

Write a Comment

User Comments (0)