Title: Fundamental Data Structures and Algorithms
1- 15-211
- Fundamental Data Structures and Algorithms
String Matching
March 28, 2006 Ananda Gunawardena
2In this lecture
- String Matching Problem
- Concept
- Regular expressions
- brute force algorithm
- complexity
- Finite State Machines
- Knuth-Morris-Pratt(KMP) Algorithm
- Pre-processing
- complexity
3Pattern Matching Algorithms
4The Problem
- Given a text T and a pattern P, check whether P
occurs in T - eg T aabbcbbcabbbcbccccabbabbccc
- Find all occurrences of pattern P bbc
- There are variations of pattern matching
- Finding approximate matchings
- Finding multiple patterns etc..
5Why String Matching?
- Applications in Computational Biology
- DNA sequence is a long word (or text) over a
4-letter alphabet - GTTTGAGTGGTCAGTCTTTTCGTTTCGACGGAGCCCCCAATTAATAAACT
CATAAGCAGACCTCAGTTCGCTTAGAGCAGCCGAAA.. - Find a Specific pattern W
- Finding patterns in documents formed using a
large alphabet - Word processing
- Web searching
- Desktop search (Google, MSN)
- Matching strings of bytes containing
- Graphical data
- Machine code
- grep in unix
- grep searches for lines matching a pattern.
6String Matching
- Text string T0..N-1 T abacaabaccabacabaabb
- Pattern string P0..M-1 P abacab
- Where is the first instance of P in T? T10..15
P0..5 - Typically N gtgtgt M
7Java Pattern Matching Utilities
- Java provides an API for pattern matching with
regular expressions - java.util.regex
- Regular expressions describe a set of strings
based on some common characteristics shared by
each string in the set. eg a ,a, aa, aaa,
- Regular expressions can be used as a tool to
search, edit or manipulate text or data - perl, java, C
8Java Pattern Matching Utilities
- java.util.regex
- Pattern
- Is a compiled representation of a regular
expression. - Eg Pattern p Pattern.compile("ab")
- Matcher
- A machine that performs match operations on a
character sequence by interpreting a pattern. - Eg Matcher m p.matcher("aabbb")
- Example
- public static void main( String args )
- Pattern p Pattern.compile("(aabb)")
- Matcher m p.matcher("aabbb")
- boolean b m.matches() //match the entire
input sequence against the pattern - // or boolean b m.find() // match the
entire input sequence against the pattern - System.out.println("The value is " b)
-
9String Matching
- abacaabaccabacabaabb
- abacab
- abacab
- abacab
- abacab
- abacab
- abacab
- abacab
- abacab
- abacab
- abacab
- abacab
- The brute force algorithm
- 22628 comparisons.
10Naïve Algorithm(or Brute Force)
- Assume T n and P m
- Compare until a match is found. If so return the
index where match occurs - else return -1
-
Text T
Pattern P
Pattern P
Pattern P
11Brute Force Version 1
- static int match(char T, char P)
- for (int i0 iltT.length i)
- boolean flag true
- if (P0Ti)
- for (int j1jltP.lengthj)
- if (Tij!Pj)
- flagfalse break
- if (flag) return i
-
-
- What is the complexity of the code?
12Brute Force, Version 2
- static int match(char T, char P)
- int n T.length
- int m P.length
- int i 0
- int j 0
- // rewrite the brute-force code with only one
loop - do
- // Homework
- while (jltm iltn)
-
13A bad case
- 00000000000000001
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 00001
- 605 65 comparisons are needed
- How many of them could be avoided?
14A bad case
- 00000000000000001
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 0000-
- 00001
- 605 65 comparisons are needed
- How many of them could be avoided?
15Typical text matching
- This is a sample sentence
- -
- -
- -
- s-
- -
- -
- s-
- -
- -
- -
- s-
- -
- -
- -
- -
- -
- -
- 20525 comparisons are needed
- (The match is near the same point in the target
string as the previous example.) - In practice, 0?j?2
16String Matching
- Brute force worst case
- O(MN)
- Expensive for long patterns in repetitive text
- How to improve on this?
- Intuition
- Remember what is learned from previous matches
17Finite State Machines
18Finite State Machines (FSM)
- FSM is a computing machine that takes
- A string as an input
- Outputs YES/NO answer
- That is, the machine accepts or rejects the
string
Yes / No
FSM
Input String
19FSM Model
- Input to a FSM
- Strings built from a fixed alphabet a,b,c
- Possible inputs aa, aabbcc, a etc..
- The Machine
- A directed graph
- Nodes States of the machine
- Edges Transition from one state to another
o
1
20FSM Model
- Special States
- Start (q0) and Final (or Accepting) (q2)
- Assume the alphabet is a,b
- Which strings are accepted by this FSM?
21FSM Model
- Exercise draw a finite automaton that accepts
any string with even number of 1s - Exercise draw a finite automaton that accepts
any string with even number of consecutive 1s
followed by odd number of consecutive zeros
22Why Study FSMs
- Useful Algorithm Design Technique
- Lexical Analysis (tokenization)
- Control Systems
- Elevators, Soda Machines.
- Modeling a problem with FSM is
- Simple
- Elegant
23State Transitions
- Let Q be the set of states and ? be the alphabet.
Then the transition function T is given by - T Q x ? ? Q
- ? could be
- 0,1 binary
- C,G,T,A nucleotide base
- 0,1,2,..,9,a,b,c,d,e,f hexadecimal
- etc..
- Eg Consider ? a,b,c and Paabc
- set of states are all prefixes of P
- Q , a, aa, aab, aabc or
- Q 0 1 2 3 4
- State transitions T(0,a) 1 T(1, a) 2,
etc - What about failure transitions?
24Failure Transitions
- Where do we go when a failure occurs?
- Paabc
- Q current state
- Q next state
- initial state 0
- end state 4
- How to store state transition table?
- as a matrix
Q ? Q
0 a b,c 1 0
1 a b,c 2 0
2 b a c 3 2 0
3 c a b 4 1 0
25Using FSM concept inPattern Matching
- Consider the alphabet a,b,c
- Suppose we are looking for pattern aabc
- Construct a finite automaton for aabc as follows
a
a
bc
c
Start
a
a
b
0
1
2
3
4
c
bc
b
26Knuth Morris Pratt (KMP) Algorithm
27KMP The Big Idea
- Retain information from prior attempts.
- Compute in advance how far to jump in P when a
match fails. - Suppose the match fails at Pj ? Tij.
- Then we know P0 .. j-1 Ti .. ij-1.
- We must next try P0 ? Ti1.
- But we know Ti1P1
- What if we compare P1?P0
- If so, increment j by 1. No need to look at T.
- What if P1P0 and P2P1?
- Then increment j by 2. Again, no need to look at
T. - In general, we can determine how far to jump
without any knowledge of T!
28Implementing KMP
- Never decrement i, ever.
- Comparing Ti with Pj.
- Compute a table f of how far to jump j forward
when a match fails. - The next match will compare Ti with
Pfj-1 - Do this by matching P against itself in all
positions.
29Building the Table for f
- P 1010011
- Find self-overlaps
- Prefix Overlap j f
- 1 . 1 0
- 10 . 2 0
- 101 1 3 1
- 1010 10 4 2
- 10100 . 5 0
- 101001 1 6 1
- 1010011 1 7 1
30What f means
- f non-zero implies there is a self-match.
- E.g., f2 means P0..1 Pj-2..j-1
- Hence must start new comparison at j-2, since we
know Ti-2..i-1 P0..1 - In general
- Set jfj-1
- Do not change i.
- The next match is
- Ti ? Pfj-1
- Prefix Overlap j f
- 1 . 1 0
- 10 . 2 0
- 101 1 3 1
- 1010 10 4 2
- 10100 . 5 0
- 101001 1 6 1
- 1010011 1 7 1
- If f is zero, there is no self-match.
- Set j0
- Do not change i.
- The next match is
- Ti ? P0
31Favorable conditions
- P 1234567
- Find self-overlaps
- Prefix Overlap j f
- 1 . 1 0
- 12 . 2 0
- 123 . 3 0
- 1234 . 4 0
- 12345 . 5 0
- 123456 . 6 0
- 1234567 . 7 0
32Mixed conditions
- P 1231234
- Find self-overlaps
- Prefix Overlap j f
- 1 . 1 0
- 12 . 2 0
- 123 . 3 0
- 1231 1 4 1
- 12312 12 5 2
- 123123 123 6 3
- 1231234 . 7 0
33Poor conditions
- P 1111110
- Find self-overlaps
- Prefix Overlap j f
- 1 . 1 0
- 11 1 2 1
- 111 11 3 2
- 1111 111 4 3
- 11111 1111 5 4
- 111111 11111 6 5
- 1111110 . 7 0
34KMP pre-process Algorithm
- m P
- Define a table F of size m
- F0 0
- i 1 j 0
- while(iltm)
- compare Pi and Pj
- if(PjPi)
- Fi j1
- i j
- else if (jgt0) jFj-1
- else Fi 0 i
-
-
Use previous values of f
35KMP Algorithm
- input Text T and Pattern P
- T n
- P m
- Compute Table F for Pattern P
- ij0
- while(iltn)
- if(PjTi)
- if (jm-1) return i-m1
- i j
- else if (jgt0) jFj-1
- else i
-
- output first occurrence of P in T
Use F to determine next value for j.
36Specializing the matcher
- Prefix Overlap j f
- 1 . 1 0
- 10 . 2 0
- 101 1 3 1
- 1010 10 4 2
- 10100 . 5 0
- 101001 1 6 1
- 1010011 1 7 1
0
.
0
0
1
1
0
0
1
0
1
1
1
0
1
0
1
0
1
1
0
0
1
37Brute Force KMP
- 000000000000000000000000001
- 0000000000000-
- 0000000000000-
- 0000000000000-
- 0000000000000-
- 0000000000000-
- 0000000000000-
- 0000000000000-
- 0000000000000-
- 0000000000000-
- 0000000000000-
- 0000000000000-
- 0000000000000-
- 0000000000000-
- 0000000000000-
0000000000000000000000000001 0000000000000-
0- 0- 0-
0- 0-
0- 0-
0- 0-
0- 0-
0- 0-
01 2814 42 comparisons
- A worse case example
- 196 14 210 comparisons
38Brute Force KMP
- abcdeabcdeabcedfghijkl
- -
- bc-
- -
- -
- -
- -
- bc-
- -
- -
- -
- -
- bcedfg
abcdeabcdeabcedfghijkl - bc- - - -
bc- - - -
bcedfg 19 comparisons 5 preparation
comparisons
39KMP Performance
- Pre-processing needs O(M) operations.
- At each iteration, one of three cases
- Ti Pj
- i increases
- Ti ltgt Pj and jgt0
- i-j increases
- TI ltgt Pj and j0
- i increases and i-j increases
- Hence, maximum of 2N iterations.
- Thus worst case performance is O(NM).
40Exercises
- Suppose we are given the pattern P 10010001 and
- text T 000100100100010111
- do the following
- Draw a FSM for pattern P
- Construct the KMP table for P
- Trace the KMP algorithm with T