Title: Evolutionary Computing
1 CS4413 Matching Algorithms (These materials
are used in the classroom only)
2Two important concepts
- Finite automata.
- Character strings.
3String Matching with Finite Automata
- Finite Automata
- M (Q, q0, A, ?, ?)
- Q a finite set of states
- q0?Q the initial state
- A?Q accepting states
- ? input alphabets
- ? transition function from Q x ? ? Q
- M accepts (rejects) an input string
- M acts as a final-state function ? from ? to Q
- M scans the string w, ends up with a state ?(w) ?
A
4Simple Automata
A simple two state finite automaton with state
set Q 0,1, start state q0 0, and input
alphabet ? a, b.
5Automata Example
6String Matching
- PROBLEM find the occurrence of a given
substring, called a pattern, in another string,
called the text. - Applications
- In text processing of character strings.
- In matching a string of bytes containing
graphical data or machine code. - Virus checking in a computer virus.
- Search for particular patterns in DNA sequences.
7Notation
- T the text in which we search for a pattern.
- n - length of the text T.
- P - the pattern being searched for.
- m - length of a pattern P.
- Pi , T i the i-th character in P and T
respectively.
8String Matching
- We formalize the string-matching problem as
follows - We assume that the text is an array T 1...n of
length n and that the pattern is an array
P1...m of length m. - We further assume that the elements of P and T
are characters drawn from a finite alphabet ?.
For example, we may have ? 0, 1 or ? a,
b, ..., z . The character arrays P and T are
often called strings of characters.
9String Matching
- We say that pattern P occurs with shift s in text
T (or, equivalently, that pattern P occurs
beginning at position s 1 in text T) if 0 s
n -m and T s 1 ..s m P1...m (that is,
if Ts j Pj, for 1 j m). - If P occurs with shift s in T, then we call a
valid shift otherwise, we call an invalid shift.
The string-matching problem is the problem of
finding all valid shifts with which a given
pattern P occurs in a given text T.
10String Matching
- Finite Automata
- A finite automaton M is a 5-tuple (Q, q0, A, ?,
d), where - Q is a finite set of states
- q0 ? Q is the start state
- A Q is a distinguished set of accepting states
- ? is a finite input alphabet
- d is a function from Q ? ? into Q, called the
transition function of M.
11Simple String Matching
- INPUT P of length m and T of length n.
- PRECONDITION P is nonempty.
- OUTPUT The index in T where a copy of P begins
or -1 if no match for P is found.
12Naïve String-Matching Algorithm
- The naïve algorithm finds all valid shifts using
a loop that checks the condition P1 m T
s1, , sm for each of the n m 1 possible
values of s.
13Naïve String-Matching Algorithm
- Naïve-String-Matcher (T, P)
- N length T
- M length P
- For j 0 to n-m
- Compare Tj Tj1 Tj2Tjm-1 to
- P1 P2 P3 ...Pm
- If all m characters are matching
- return j /print pattern occurs with
shift s.
14Examples
- Example How many comparisons (both successful
and unsuccessful) will be made by the brute-force
string-matching algorithm in searching for each
of the following patterns in the binary text of
1000 zeros? - 00001
- 10000
- 01010
15Worst Case
- Worst case happens when each time all m-1
characters match and the last one does not. - a a a b
- a a a a a a a a a a a a a a a a a a a a a a b
- T((n m 1) m) in the worst case.
16Analysis
- The worst case is not one that occurs often in
natural language text. - Empirical studies show that the algorithm did
only 1.1 comparisons for each character in T (up
to the point where match was found.)
17Analysis
- Naïve string-matcher is inefficient because
information gained about the text for one value
of s is totally ignored in considering other
values of s. - Such information can be very valuable, however.
- For example, if P aaab and we find that s 0
is valid, then none of the shifts 1, 2, or 3 are
valid, since T4 b.
18Input Enhancement in String Matching
- The Knuth-Morris-Pratt algorithm compare left
to right. - The Boyer-Moore algorithm compare right to
left, leads to simpler algorithms Horspools
algorithm.
19Horspools Algorithm
- Example
- s0 .. c ..sn-1
- B A R B E R
- Case 1 if there are no cs in the pattern eg.,
c is letter S in our example we can shift the
pattern by its entire length. - s0 .. S ..sn-1
- B A R B E R
- B A R B E R
20Horspools Algorithm(contd..)
- Case 2 if there are occurrences of character c
in the pattern but it is not the last one there
e.g., c is letter B in our example the shift
should align the rightmost occurrence of c in the
pattern with the c in the text. - s0 .. B ..sn-1
- B A R B E R
- B A R B E R
21Horspools Algorithm(contd..)
- Case 3 if c happens to be the last character in
the pattern but there are no c s among its other
m-1 characters, the shift should be the entire
pattern length m - s0 .. M E R ..sn-1
- L E A D E R
- L E A D E R
22Horspools Algorithm(contd..)
- Case 4 Finally, if c happens to be the last
character in the pattern and there are other cs
among its first m-1 characters, the shift should
be such that, the rightmost occurrence of c among
the first m-1 characters is aligned with the
texts c - s0 . O R ..sn-1
- R E O R D E R
- R E O R D E R
23Horspools Algorithm(contd..)
- Compute the shifts value, thus
- t(c) the pattern length m, if c is not among
the first m-1 characters of the pattern - t(c) the distance from the rightmost c among
the first m-1 characters of the pattern to
its last character, otherwise - ALGORITHM ShiftTable(P0m-1)
- //Fills the shift table used by Horspools
and Boyer-Moore algorithms - //Input Pattern P0.m-1 and an alphabet
of possible charactrers - //Output Table0..size-1 indexed by the
alphabets characters and - // filled with shift sizes
computed by formula (7.1) - initialise all the elements of Table with m
- for j ? 0 to m-2 do TablePj ? m-1-j
- return Table
24Horspools Algorithm(contd..)
- Horspools algorithm
- Step 1 For a given pattern of length m and the
alphabet used in both the pattern and text,
construct the shift table as described above. - Step 2 Align the pattern against the beginning
of the text. - Step 3Repeat the following until either a
matching substring is found or the pattern
reaches beyond the last character of the text.
Starting from the last character in the pattern,
compare the corresponding characters in the
pattern and text until either all m characters
are matched or a mismatching pair is encountered.
25Horspools Algorithm(contd..)
- ALGORITHM HorspoolMatching(P0..m-1,T0..n-1)
- // Implements Horspools algorithm for string
matching - // Input Pattern P0..m-1 and text
T0..n-1 - // Output The index of the left end of the
first matching - // substring or -1 if there are
no matches - ShiftTable(P0..m-1) //generate Table of
shifts - i ? m-1 //position of
the patterns right end - while i n-1 do
- k ? 0 //number of
matched characters - while k m-1 and Pm-1-kTi-k
- k ? k1
- if km
- return i-m1
- else i ? i TableTi
- return -1
26Horspools Algorithm
- Exercise
- Apply Horspools algorithm to search for the
pattern BAOBAB in the text -
- BESS_KNEW_ABOUT_BAOBABS
27Horspools Algorithm
- Exercise Consider the problem of searching for
genes in DNA sequences using Horspools
algorithm. A DNA sequence is represented by a
text on the alphabet A, C, G, T, and the gene
or gene segment is the pattern. - (a) Construct the shift table for the following
gene segment of your chromosome 10 TCCTATTCTT - (b) Apply Horspools algorithm to locate the
pattern in the following DNA sequence - TTATAGATCTCGTATTCTTTTATAGATCTCCTATTCTT
28Horspools Algorithm
- Exercise
- How many character comparisons will be made
by Horspools algorithm in searching for the
following patterns in the binary text of 1000
zeros? - 00001
- 10000
- 01010
29Prestructuring
- Hashing and B-Trees are examples of
presturucturing. - In general, a hash function needs to satisfy two
somewhat conflicting requirements - 1) A hash function needs to distribute keys among
the cells of the hash table as evenly as
possible. - 2) A hash function has to be easy to compute.
30Hashing
- Hashing
- Hash Table
- Hash Function
- Hash Address
- Collisions
- Open Hashing (Separate Chaining)
- Closed Hashing (Open Addressing)
- (example Linear Probing checks the cell
following the one where the collision occurs)
implies that the table size m must be at least as
large as the number of keys n.
31Hashing
A 1 B 2 C3 D 4 .. Z26 Hash function
key mod 13
keys A FOOL AND HIS MONEY ARE SOON PARTED
Hash addresses 1 9 6 10 7 11 11 12
0 1 2 3 4 5 6 7 8 9 10 11 12
32Open Hashing
keys A FOOL AND HIS MONEY ARE SOON PARTED
Hash addresses 1 9 6 10 7 11 11 12
0 1 2 3 4 5 6 7 8 9 10 11 12
A AND MONEY FOOL HIS ARE PARTED
SOON
33Closed Hashing
keys A FOOL AND HIS MONEY ARE SOON PARTED
Hash addresses 1 9 6 10 7 11 11 12
0 1 2 3 4 5 6 7 8 9 10 11 12
A
A FOOL
A AND FOOL
A AND FOOL HIS
A AND MONEY FOOL HIS
A AND MONEY FOOL HIS ARE
A AND MONEY FOOL HIS ARE SOON
PARTED A AND MONEY FOOL HIS ARE SOON
34Hashing
- Hash function distributes n keys among m cells of
the hash table evenly, each list will be about
n/m keys long. - load factor a n/m
- Efficiency of hashing (Open Hashing)
- Efficiency of hashing (Closed Hashing)
35Hashing
- Exercise For the input 30, 20, 56, 75, 31, 19
and hash function h(K) K mod 11 - (a) Construct the open hash table.
- (b) Find the largest number of key comparisons in
a successful search in this table. - (c) Find the average number of key comparisons in
a successful search in this table.
36Hashing
- Exercise For the input 30, 20, 56, 75, 31, 19
and hash function h(K) K mod 11 - (a) Construct the closed hash table.
- (b) Find the largest number of key comparisons in
a successful search in this table. - (c) Find the average number of key comparisons in
a successful search in this table.
37END