Online Approximate String Matching with Bounded Errors - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Online Approximate String Matching with Bounded Errors

Description:

Marcos Kiwi, Gonzalo Navarro and Claudio Telha. Universidad de Chile. CPM' 08. The Approximate String Matching (ASM), applications and state of art ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 33

Provided by: claudiotel

Category:

more less

Transcript and Presenter's Notes

Title: Online Approximate String Matching with Bounded Errors

1
On-line Approximate String Matching with Bounded
Errors

Marcos Kiwi, Gonzalo Navarro and Claudio Telha
Universidad de Chile

CPM 08
2
Outline
The Approximate String Matching (ASM),
applications and state of art
The ASM with bounded errors problem, and one
algorithm that solves it
Complexity results of the proposed algorithms.
Open questions and final comments
3
Introduction
The Approximate String Matching (ASM),
applications and state of art
4

Approximate String Matching Problem (ASM)
Find the positions where a pattern P appears
approximately in a text T

Example (Allowing one mismatch)
ACT ACTGATAACGTTAG
Pattern
Text
5

Applications of ASM

Sequence alignment of DNA and Proteins
Search Engines
Unreliable transmission of data
Data Mining
Information Retrieval
6

State of art in ASM
There are algorithms theoretically optimal and
simultaneously efficient in practice

When approximation edit distance
1985 Optimal worst case, but exponential space
on P
1994 Optimal on average, less practical (Chang
Marr)
2004 Optimal on average, practical (Fredriksson
Navarro)
Average Characters of P and T chosen uniformly
and independently from an
alphabet ?.
Most of the key ASM algorithmic questions have
been solved
7
Our method
The ASM with bounded errors problem, and one
algorithm that solves it
8

Our proposal ASM with Bounded Errors
Break the average lower bound by allowing an
additional factor of error

Not bad deal, ASM is already an approximate model

ASM with bounded errors (ASM-BE)
Algorithms for ASM-BE are allowed to miss each
occurrence with probability at most e
New parameter Error threshold 0 e 1
9
Typical execution of ASM-BE algorithms
ASM-BE algorithms are allowed to miss each
occurrence with probability at most e

ASM algorithm (One mismatch allowed)
ACT ACTGATAACGTTAG
ASM-BE algorithm First execution
ACT ACTGATAACGTTAG
10
Typical execution of ASM-BE algorithms
ASM-BE algorithms are allowed to miss each
occurrence with probability at most e

ASM algorithm (One mismatch allowed)
ACT ACTGATAACGTTAG
ASM-BE algorithm Second execution
ACT ACTGATAACGTTAG
11
Typical execution of ASM-BE algorithms
ASM-BE algorithms are allowed to miss each
occurrence with probability at most e

ASM algorithm (One mismatch allowed)
ACT ACTGATAACGTTAG
ASM-BE algorithm Third execution
ACT ACTGATAACGTTAG
12
Main Contributions

Break the average lower bound by allowing an
additional factor of error

Theoretically interesting, and potentially
practical
Not bad deal, since ASM is already an approximate
model
Novel approach for ASM
New framework for ASM, with room for improvements
ASM non-exact algorithm with probabilistic
guarantee
Easy to extend other distances, Multiple-ASM
13
Some notation and facts

Edit Distance
Minimum number of insertions, deletions and
substitutions needed to make two strings equals
edit(fellow, follows)2 (fellow ? follow ?
follows)

n is the length of text T m is the length of the
pattern P k is the maximum edit distance we allow
in a match s is the size of the alphabet S
14
Some notation and facts

ASM results (Worst case time)
Lower bound O(n)
Achievable with automata, but space is
exponential in m
Complexity not known when polynomial space is
required

ASM results (Average case time)
Lower bound O( ((klogs m)n/m )
Achievable with filtering algorithms
From practical perspective, average case is
better measure
This is the on-line setting text cannot be
preprocessed
15
Building an ASM-BE algorithm

Filter algorithms for ASM usually lead to an
ASM-BE algorithm

Conceptually, an ASM filter is a two stage process
Intuitively, they work well on average because in
this case its easy to say a pattern doesnt
appear
16
How a filter works

First, the filtering sub-process discard areas
where the pattern cannot appear

Filtering process
Text
Filter sub-process is faster than any traditional
ASM algorithm
Pattern cannot appear in discarded areas
17
How a filter works

Second, the verification sub-process find the
matches in non-discarded areas

Verification process
P
P
Text
Verification sub-process is any non-filtering
algorithm
Actually, filtering and verification process runs
in parallel
18
An ASM filter The q-grams Algorithm

More definitions
A q-gram is a string of length q
A window is a substring of the text (usually, the
size is O(m))
The set of q-grams of a string S are all the
S-q1 substrings of S of length q. Multiplicity
counts.
Ukkonen, in 1992, uses q-grams as the base of a
filter algorithm
Running time is O(n) on average
19
An ASM filter The q-grams Algorithm

Key ideas of the algorithm
Pattern and occurrence always share (m-q1-kq)
q-grams On average any window and pattern share
less q-grams
Why kq ?
Occurrence
Affected q-grams (k at most)
Its q-grams
20
An ASM filter The q-grams Algorithm

The q-grams algorithm
Filter process counts the q-grams shared with the
pattern, for every window of size m-k
Verification process is only executed to find any
occurrence containing those windows with at least
m-q1-kq q-grams
Average case analysis
Filter process can be implemented in O(n) time (
Naïve O(nq) )
Verification process takes O(m2) per verified
window
Windows are verified with O(1/m2) probability,
when q2 logs m
Total time O(n)
21
Building an ASM-BE algorithm

Filter algorithms for ASM usually lead to an
ASM-BE algorithm

-BE
Conceptually, an ASM filter is a two stage
process
Probabilistic
probably
Essentially, the only difference is that the
filter can fail and discard some occurrences
22
A small change to the q-grams algorithm
Windows are now disjoint and length is (m-k)/2
Any window included in an occurrence has at least
(m-k)/2 q 1 kq q-grams
Any occurrence necessarily include at least one
window
Current Window
Next Window
Windows are now disjoint and length is (m-k)/2
Text
q-grams filter
Discard
Go to next window
Verify
Verification process
23
A small change to the q-grams algorithm

The q-grams algorithm modified (still ASM
oriented)
Partition the text in disjoint windows of size
(m-k)/2
Filter counts the q-grams shared with the pattern
per window
Verification process finds any occurrence
containing windows with at least (m-k)/2-q1-kq
q-grams
Average case analysis remains the same
The filter process is still O(n) time
Verification process still takes O(m2) per
verified window
Windows are verified with O(1/m2) probability,
when q2 logs m
Total time O(n)
24
The ASM-BE q-grams algorithm

We approximate the q-grams counting by using
random sampling

The q-grams algorithm for ASM-BE
Partition the text in disjoint windows of size
(m-k)/2
Filter estimates the fraction of the q-grams
shared with the pattern per window
Verification process finds any occurrence
containing windows with an estimated fraction of
shared q-grams at least ?
Fraction of shared q-grams is estimated by
randomly choosing c q-grams of the window and
obtaining the fraction of shared q-grams on this
subset.
25
Results
Complexity results of the proposed algorithms.
26
The ASM-BE algorithms
Complexity of ASM is O( ((klogs m)n/m )
27
Conclusion
Open questions and future work
28

What we did?
We just scratched the surface of this new area

Introduced the framework of ASM with Bounded
Errors
Proved the existence of natural algorithms for
this framework, that are easy to extend to
related problems (Multiple-patterns string
matching, for example)
Showed that it is possible to break the lower
bound for ASM in this framework, with a
reasonable error probability
29

Open questions

So far the error e is mainly a dependent
parameter. Is there a natural way to introduce
this parameter in the algorithms?
What are the best possible tradeoffs (error v/s
time) that we can achieve?
30