Fundamental Data Structures and Algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

Fundamental Data Structures and Algorithms

Description:

Title: PowerPoint Presentation Last modified by: Carnegie Mellon University Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:196
Avg rating:3.0/5.0
Slides: 41
Provided by: cmu132
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Fundamental Data Structures and Algorithms


1
  • 15-211
  • Fundamental Data Structures and Algorithms

String Matching
March 28, 2006 Ananda Gunawardena
2
In this lecture
  • String Matching Problem
  • Concept
  • Regular expressions
  • brute force algorithm
  • complexity
  • Finite State Machines
  • Knuth-Morris-Pratt(KMP) Algorithm
  • Pre-processing
  • complexity

3
Pattern Matching Algorithms
4
The Problem
  • Given a text T and a pattern P, check whether P
    occurs in T
  • eg T aabbcbbcabbbcbccccabbabbccc
  • Find all occurrences of pattern P bbc
  • There are variations of pattern matching
  • Finding approximate matchings
  • Finding multiple patterns etc..

5
Why String Matching?
  • Applications in Computational Biology
  • DNA sequence is a long word (or text) over a
    4-letter alphabet
  • GTTTGAGTGGTCAGTCTTTTCGTTTCGACGGAGCCCCCAATTAATAAACT
    CATAAGCAGACCTCAGTTCGCTTAGAGCAGCCGAAA..
  • Find a Specific pattern W
  • Finding patterns in documents formed using a
    large alphabet
  • Word processing
  • Web searching
  • Desktop search (Google, MSN)
  • Matching strings of bytes containing
  • Graphical data
  • Machine code
  • grep in unix
  • grep searches for lines matching a pattern.

6
String Matching
  • Text string T0..N-1 T abacaabaccabacabaabb
  • Pattern string P0..M-1 P abacab
  • Where is the first instance of P in T? T10..15
    P0..5
  • Typically N gtgtgt M

7
Java Pattern Matching Utilities
  • Java provides an API for pattern matching with
    regular expressions
  • java.util.regex
  • Regular expressions describe a set of strings
    based on some common characteristics shared by
    each string in the set. eg a ,a, aa, aaa,
  • Regular expressions can be used as a tool to
    search, edit or manipulate text or data
  • perl, java, C

8
Java Pattern Matching Utilities
  • java.util.regex
  • Pattern
  • Is a compiled representation of a regular
    expression.
  • Eg Pattern p Pattern.compile("ab")
  • Matcher
  • A machine that performs match operations on a
    character sequence by interpreting a pattern.
  • Eg Matcher m p.matcher("aabbb")
  • Example
  • public static void main( String args )
  • Pattern p Pattern.compile("(aabb)")
  • Matcher m p.matcher("aabbb")
  • boolean b m.matches() //match the entire
    input sequence against the pattern
  • // or boolean b m.find() // match the
    entire input sequence against the pattern
  • System.out.println("The value is " b)

9
String Matching
  • abacaabaccabacabaabb
  • abacab
  • abacab
  • abacab
  • abacab
  • abacab
  • abacab
  • abacab
  • abacab
  • abacab
  • abacab
  • abacab
  • The brute force algorithm
  • 22628 comparisons.

10
Naïve Algorithm(or Brute Force)
  • Assume T n and P m
  • Compare until a match is found. If so return the
    index where match occurs
  • else return -1

Text T
Pattern P
Pattern P
Pattern P
11
Brute Force Version 1
  • static int match(char T, char P)
  • for (int i0 iltT.length i)
  • boolean flag true
  • if (P0Ti)
  • for (int j1jltP.lengthj)
  • if (Tij!Pj)
  • flagfalse break
  • if (flag) return i
  • What is the complexity of the code?

12
Brute Force, Version 2
  • static int match(char T, char P)
  • int n T.length
  • int m P.length
  • int i 0
  • int j 0
  • // rewrite the brute-force code with only one
    loop
  • do
  • // Homework
  • while (jltm iltn)

13
A bad case
  • 00000000000000001
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 00001
  • 605 65 comparisons are needed
  • How many of them could be avoided?

14
A bad case
  • 00000000000000001
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 0000-
  • 00001
  • 605 65 comparisons are needed
  • How many of them could be avoided?

15
Typical text matching
  • This is a sample sentence
  • -
  • -
  • -
  • s-
  • -
  • -
  • s-
  • -
  • -
  • -
  • s-
  • -
  • -
  • -
  • -
  • -
  • -
  • 20525 comparisons are needed
  • (The match is near the same point in the target
    string as the previous example.)
  • In practice, 0?j?2

16
String Matching
  • Brute force worst case
  • O(MN)
  • Expensive for long patterns in repetitive text
  • How to improve on this?
  • Intuition
  • Remember what is learned from previous matches

17
Finite State Machines
18
Finite State Machines (FSM)
  • FSM is a computing machine that takes
  • A string as an input
  • Outputs YES/NO answer
  • That is, the machine accepts or rejects the
    string

Yes / No
FSM
Input String
19
FSM Model
  • Input to a FSM
  • Strings built from a fixed alphabet a,b,c
  • Possible inputs aa, aabbcc, a etc..
  • The Machine
  • A directed graph
  • Nodes States of the machine
  • Edges Transition from one state to another

o
1
20
FSM Model
  • Special States
  • Start (q0) and Final (or Accepting) (q2)
  • Assume the alphabet is a,b
  • Which strings are accepted by this FSM?

21
FSM Model
  • Exercise draw a finite automaton that accepts
    any string with even number of 1s
  • Exercise draw a finite automaton that accepts
    any string with even number of consecutive 1s
    followed by odd number of consecutive zeros

22
Why Study FSMs
  • Useful Algorithm Design Technique
  • Lexical Analysis (tokenization)
  • Control Systems
  • Elevators, Soda Machines.
  • Modeling a problem with FSM is
  • Simple
  • Elegant

23
State Transitions
  • Let Q be the set of states and ? be the alphabet.
    Then the transition function T is given by
  • T Q x ? ? Q
  • ? could be
  • 0,1 binary
  • C,G,T,A nucleotide base
  • 0,1,2,..,9,a,b,c,d,e,f hexadecimal
  • etc..
  • Eg Consider ? a,b,c and Paabc
  • set of states are all prefixes of P
  • Q , a, aa, aab, aabc or
  • Q 0 1 2 3 4
  • State transitions T(0,a) 1 T(1, a) 2,
    etc
  • What about failure transitions?

24
Failure Transitions
  • Where do we go when a failure occurs?
  • Paabc
  • Q current state
  • Q next state
  • initial state 0
  • end state 4
  • How to store state transition table?
  • as a matrix

Q ? Q
0 a b,c 1 0
1 a b,c 2 0
2 b a c 3 2 0
3 c a b 4 1 0
25
Using FSM concept inPattern Matching
  • Consider the alphabet a,b,c
  • Suppose we are looking for pattern aabc
  • Construct a finite automaton for aabc as follows

a
a
bc
c
Start
a
a
b
0
1
2
3
4
c
bc
b
26
Knuth Morris Pratt (KMP) Algorithm
27
KMP The Big Idea
  • Retain information from prior attempts.
  • Compute in advance how far to jump in P when a
    match fails.
  • Suppose the match fails at Pj ? Tij.
  • Then we know P0 .. j-1 Ti .. ij-1.
  • We must next try P0 ? Ti1.
  • But we know Ti1P1
  • What if we compare P1?P0
  • If so, increment j by 1. No need to look at T.
  • What if P1P0 and P2P1?
  • Then increment j by 2. Again, no need to look at
    T.
  • In general, we can determine how far to jump
    without any knowledge of T!

28
Implementing KMP
  • Never decrement i, ever.
  • Comparing Ti with Pj.
  • Compute a table f of how far to jump j forward
    when a match fails.
  • The next match will compare Ti with
    Pfj-1
  • Do this by matching P against itself in all
    positions.

29
Building the Table for f
  • P 1010011
  • Find self-overlaps
  • Prefix Overlap j f
  • 1 . 1 0
  • 10 . 2 0
  • 101 1 3 1
  • 1010 10 4 2
  • 10100 . 5 0
  • 101001 1 6 1
  • 1010011 1 7 1

30
What f means
  • f non-zero implies there is a self-match.
  • E.g., f2 means P0..1 Pj-2..j-1
  • Hence must start new comparison at j-2, since we
    know Ti-2..i-1 P0..1
  • In general
  • Set jfj-1
  • Do not change i.
  • The next match is
  • Ti ? Pfj-1
  • Prefix Overlap j f
  • 1 . 1 0
  • 10 . 2 0
  • 101 1 3 1
  • 1010 10 4 2
  • 10100 . 5 0
  • 101001 1 6 1
  • 1010011 1 7 1
  • If f is zero, there is no self-match.
  • Set j0
  • Do not change i.
  • The next match is
  • Ti ? P0

31
Favorable conditions
  • P 1234567
  • Find self-overlaps
  • Prefix Overlap j f
  • 1 . 1 0
  • 12 . 2 0
  • 123 . 3 0
  • 1234 . 4 0
  • 12345 . 5 0
  • 123456 . 6 0
  • 1234567 . 7 0

32
Mixed conditions
  • P 1231234
  • Find self-overlaps
  • Prefix Overlap j f
  • 1 . 1 0
  • 12 . 2 0
  • 123 . 3 0
  • 1231 1 4 1
  • 12312 12 5 2
  • 123123 123 6 3
  • 1231234 . 7 0

33
Poor conditions
  • P 1111110
  • Find self-overlaps
  • Prefix Overlap j f
  • 1 . 1 0
  • 11 1 2 1
  • 111 11 3 2
  • 1111 111 4 3
  • 11111 1111 5 4
  • 111111 11111 6 5
  • 1111110 . 7 0

34
KMP pre-process Algorithm
  • m P
  • Define a table F of size m
  • F0 0
  • i 1 j 0
  • while(iltm)
  • compare Pi and Pj
  • if(PjPi)
  • Fi j1
  • i j
  • else if (jgt0) jFj-1
  • else Fi 0 i

Use previous values of f
35
KMP Algorithm
  • input Text T and Pattern P
  • T n
  • P m
  • Compute Table F for Pattern P
  • ij0
  • while(iltn)
  • if(PjTi)
  • if (jm-1) return i-m1
  • i j
  • else if (jgt0) jFj-1
  • else i
  • output first occurrence of P in T

Use F to determine next value for j.
36
Specializing the matcher
  • Prefix Overlap j f
  • 1 . 1 0
  • 10 . 2 0
  • 101 1 3 1
  • 1010 10 4 2
  • 10100 . 5 0
  • 101001 1 6 1
  • 1010011 1 7 1

0
.
0
0
1
1
0
0
1
0
1
1
1
0
1
0
1
0
1
1
0
0
1
37
Brute Force KMP
  • 000000000000000000000000001
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-
  • 0000000000000-

0000000000000000000000000001 0000000000000-
0- 0- 0-
0- 0-
0- 0-
0- 0-
0- 0-
0- 0-
01 2814 42 comparisons
  • A worse case example
  • 196 14 210 comparisons

38
Brute Force KMP
  • abcdeabcdeabcedfghijkl
  • -
  • bc-
  • -
  • -
  • -
  • -
  • bc-
  • -
  • -
  • -
  • -
  • bcedfg

abcdeabcdeabcedfghijkl - bc- - - -
bc- - - -
bcedfg 19 comparisons 5 preparation
comparisons
  • 21 comparisons

39
KMP Performance
  • Pre-processing needs O(M) operations.
  • At each iteration, one of three cases
  • Ti Pj
  • i increases
  • Ti ltgt Pj and jgt0
  • i-j increases
  • TI ltgt Pj and j0
  • i increases and i-j increases
  • Hence, maximum of 2N iterations.
  • Thus worst case performance is O(NM).

40
Exercises
  • Suppose we are given the pattern P 10010001 and
  • text T 000100100100010111
  • do the following
  • Draw a FSM for pattern P
  • Construct the KMP table for P
  • Trace the KMP algorithm with T
Write a Comment
User Comments (0)
About PowerShow.com