Exact and Approximate Pattern in the Streaming Model - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Exact and Approximate Pattern in the Streaming Model

Description:

Exact and Approximate Pattern in the ... Contributions Exact pattern matching ... Quick History The Intuition Combine the key features of KMP and the Rabin-Karp ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 22
Provided by: tamu199
Category:

less

Transcript and Presenter's Notes

Title: Exact and Approximate Pattern in the Streaming Model


1
Exact and Approximate Pattern in the Streaming
Model
Benny Porat and Ely Porat 2009 FOCS
  • Presented by - Tanushree Mitra

2
Problem Statement
  • Find all instances of pattern P of length m, as a
    contiguous substring in a text string T, of
    length n, where m lt n.

3
Contributions
  • Exact pattern matching - A fully online
    randomized algorithm for the classical pattern
    matching problem
  • Time complexity - O(logm) per character that
    arrives
  • Space complexity - O(logm), breaking the O(m)
    barrier that held for this problem for a long
    time.
  • Approximate pattern matching An algorithm for
    pattern matching with k mismatches problem.
  • Time complexity - O(k2poly(logm)) per
    character
  • Space complexity - O(k3poly(logm))

4
Applications
  • Monitoring Internet traffic
  • Computational Biology
  • Large Scale web searching
  • Viruses and Malware detection
  • Automatic Stock market analysis
  • Robotics

5
Background
  • Brute Force Algorithm
  • Slide the pattern along the text and
  • Compare it to the corresponding portion of the
    text
  • Time Complexity O(mn)
  • Speedup possible in these 2 steps.
  • Sliding step speedup by pre-processing the
    pattern,
  • Knuth-Morris-Pratt algorithm
  • Boyer-Moore algorithm.
  • Ukkonens algorithm to construct suffix trees
  • Comparison step speedup
  • Rabin-Karp algorithm.

6
Quick History
7
The Intuition
  • When Rabin-Karps algorithm is done with the ith
    character, and advances to the next position in
    the text, it does not use any of the information
    gathered.
  • The KMP algorithm, on the other hand, puts that
    information to good use.

The Idea
  • Combine the key features of KMP and the
    Rabin-Karp algorithms to achieve an online
    algorithm that uses less space.

8
Definitions - Fingerprints
Fingerprint
  • String S ?(S)

Sliding Fingerprint
  • Polynomial Fingerprint
  • q s1r s2r2 slrl mod p, where p??(N4),
    r?Fp
  • False Positives
  • If S1 ? S2, then probability of ?r,p(S1)
    ?r,p(S2) is lt 1/n3

9
Definitions - PeriodPl
  • Period - A prefix Sp s1,s2,.,sl of a string S
    is defined to be a period of S, iff si sil,
    for 0 i n - l
  • PeriodPl - For a pattern P p1,p2,.,pm, prefix
    is, Pl p1,p2,.,pl ,0 l m. The shortest
    period of Pl is periodPl

Put the information to good use
  • If Pl matches the test at a given index i, then
    there cannot be a match between i to i
    periodPl

10
The Idea
False Positives?? Slide over periodPl position
that could be a match. Very LOW PROBABILITY of
false positives
  • Match at ith index indicates that we know the
    last m characters, so no point saving them?
  • Preprocessing phase Calculate Sliding
    fingerprint on the pattern ?p and on the shortest
    period ?period p
  • Online phase Slide fingerprint ? over the
    entire text.
  • While ? ?p, slide ? by PeriodPl characters
  • If we do not reach end of text abort

Text and pattern should satisfy stringent
restrictions
11
Go for subpatterns
  • Log m subpatterns

p1, p2, p3, pm-3, pm-2, pm-1, pm
P1
pm
pm-2 ,pm-1
pm-6,pm-5,pm-4,pm-3
P2
P4
p1, p2, p3, pm/2
Pm/2
  • Starting point Find a position in which the
    smallest subpattern matches the text. Smallest
    subpattern is of length 1 this can be easily
    found.

12
Algorithm
  • Guidelines
  • Find a position where Pi is a match, try to
    match Pi 1 from the same starting point as Pi
  • If Pi 1 does not match, use the information
    that Pi is a match.
  • Check in jumps of periodPi until there is no
    overlap with the area where Pi matches.
  • PROCESS
  • Initialize an empty sliding fingerprint ?.
  • For each character that arrive
  • Extend ? to include the new character
  • If ? 2i and ? ?i for some 0 i log m.
  • If ? has at least periodPi-1 length overlaps
    with the last match, slide ? by periodPi-1
    characters.
  • Else, abort.

What if there is a match that starts in substring
of 1st process and ends in substring of 2nd
process
13
Exact_PM final AlgorithmIntroduce Checkpoint
  • Checkpoint - Start a new process in the
    last checkpoint of each process
  • Algorithm
  • Preprocessing -
  • Initialize an empty sliding fingerprint ?.
  • For each 0 i log m calculate the sliding
    fingerprint
  • ?i of Pi and
  • ?i,period of the period of Pi

14
Final Algorithm Online Phase
  • Online Phase
  • Start a new process
  • For any character that arrive send it to all the
    processes
  • If some process aborts start new prorcess
  • If some process , A reaches to a checkpoint
  • Stop the son process of A (if it has one)
  • Start a new son process of A

15
Complexity
  • Space
  • All fingerprints from preprocessing use O(log m)
    space.
  • Each process saves another fingerprint and there
    can be atmost log m processes in parallel
  • OVERALL usage O(log m) space
  • Time
  • Each process spends O(1) time for each new
    character that arrives
  • Each time there are at most 3 log m processes
    running (1. process A, 2. son-process of A,
    grandson-process of A. A has to die when
    great-granson of A is created)
  • OVERALL running time O(log m) per character

16
Pattern Matching ( 1 Mistmatch)
  • Partition the pattern and the text
  • We need to align every partition of the pattern
    Pqi,j to qi text shifts

17
Intuition
  • For each Pqi,j, run qi processes of Exact_PM.
  • Processqi,j,s - sth process of the subpattern
    Pqi,j , for 0 s lt qi. This will try to match
    the Pqi,j to the text by considering the text as
    if it starts from the s character. (t mod qi j
    s)
  • If for all qi,
  • numOfNotMatchqi,s 0 match.
  • numOfNotMatchqi,s 1, exactly
    1-mismatch
  • Otherwise, more than 1-mismatch.

18
Complexity
  • FACTS
  • Run ?li1 qi2 processes of Exact_PM
  • There exists a constant c such that for any x,
    there exist (x / logm) prime numbers, between x,
    and cx
  • We have q1,q2, . . . ql groups of partitions.
    Each qi is a prime number
  • Space - O(log4m / log log m)
  • Time - O(log3m / log log m)

19
Pattern Matching ( k Errors)
  • Preprocessing Phase Initialize a process
    Processqi,j,s of 1-mismatch, for each qi ?
    q1,q2, . . . ql, 0 i qi and 0 s lt qi
  • Online Phase Send t character to each
    Processqi,j,s such that t mod qi j s
  • d all mismatches from all processes that return
    exactly 1-mismatch
  • d gt k more than k mismatches

20
Complexity
  • Space
  • Run ?i1klogm qi2 ? O(k3 log4m/ log log m)
    processes of 1-mismatch in parallel.
  • Each process requires log4m space.
  • OVERALL - O(k3poly(log m))
  • Time
  • Number of processes of 1-mismatch algorithm is
    bounded by ?i1klogm qi2 ? O(k3 log4m/ log log m)
  • Running time of each character O(log3m)
  • OVERALL - O(k2poly(log m))

21
Concluding Discussion
  • The Two-Dimensional String-Matching Problem
  • The String-Matching Problem with Wild Characters
    Example pattern P abcabc is found in
    texts T1 abcdcadbaccabc, T2 abcabc
  • String matching with weighted mismatch
Write a Comment
User Comments (0)
About PowerShow.com