Survey: String Matching with k Mismatches - PowerPoint PPT Presentation

About This Presentation
Title:

Survey: String Matching with k Mismatches

Description:

Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Trie A tree representing a set of strings. Trie (Cont) Assume no string is a prefix of ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 70
Provided by: Mosh46
Category:

less

Transcript and Presenter's Notes

Title: Survey: String Matching with k Mismatches


1
Survey String Matching with k Mismatches
Moshe Lewenstein Bar Ilan University
2
String Matching with k Mismatches
Landau Vishkin
1986 Galil Giancarlo
1986 Abrahamson
1987 Amir - Lewenstein - Porat 2000
3
Exact String Matching
Input T t1 . . . tn P
p1 pm
Output All locations i of T where P
appears
Example P A B C A A B T
A B A B C A A B C A A B C A A B A A

4
Exact String Matching
Input T t1 . . . tn P
p1 pm
Output All locations i of T where P
appears
Example P A B C A A B T
A B A B C A A B C A A B C A A B A A
3
5
Exact String Matching
Input T t1 . . . tn P
p1 pm
Output All locations i of T where P
appears
Example P A B C A A B T
A B A B C A A B C A A B C A A B A A
3 7
6
Exact String Matching
Input T t1 . . . tn P
p1 pm
Output All locations i of T where P
appears
Example P A B C A A B T
A B A B C A A B C A A B C A A B A A
3 7 11

7
Exact String Matching
Input T t1 . . . tn P
p1 pm
Output All locations i of T where P
appears
Example P A B C A A B T
A B A B C A A B C A A B C A A B A A
Answer 3,7,11,..
8
Exact String Matching
  • Problem Matching not exact in applications of
  • Computational Biology
  • Musicology
  • Text Editing
  • Meteorology
  • etc.
  • Need other definitions of string matching!

9
Approximate String Matching
Idea Find all text locations where distance
from pattern is sufficiently small.
distance metric
HAMMING DISTANCE
Let S s1s2sm R r1r2rm
Ham(S,R) The number of
locations j where sj rj
Example S ABCABC R ABBAAC
Ham(S,R) 2
10
String Matching with Mismatches
Input T t1 . . . tn P
p1 pm
Output For each i in T Ham(P,
titi1tim-1)
Example P A B B A A C T
A B C A A B C A C

11
String Matching with Mismatches
Input T t1 . . . tn P
p1 pm
Output For each i in T Ham(P,
titi1tim-1)
Example P A B B A A C T
A B C A A B C A C 2

Ham(P,T1) 2
12
String Matching with Mismatches
Input T t1 . . . tn P
p1 pm
Output For each i in T Ham(P,
titi1tim-1)
Example P A B B A A C
T A B C A A B C A C 2, 4

Ham(P,T2) 4
13
String Matching with Mismatches
Input T t1 . . . tn P
p1 pm
Output For each i in T Ham(P,
titi1tim-1)
Example P A B B A A C
T A B C A A B C A C 2,
4, 6
Ham(P,T3) 6
14
String Matching with Mismatches
Input T t1 . . . tn P
p1 pm
Output For each i in T Ham(P,
titi1tim-1)
Example P A B B A A C
T A B C A A B C A C
2, 4, 6, 2
Ham(P,T4) 2
15
String Matching with Mismatches
Input T t1 . . . tn P
p1 pm
Output For each i in T Ham(P,
titi1tim-1)
Example P A B B A
A C T A B C A A B C A C
2, 4, 6, 2,
16
String Matching with k Mismatches
Input T t1 . . . tn, P p1 pm
Output Every i in T s.t. Ham(P,
titi1tim-1) k
Example k 2 P A B B A A C
T A B C A A B C A C
2, 4, 6, 2,
17
String Matching with k Mismatches
Input T t1 . . . tn, P p1 pm
Output Every i in T s.t. Ham(P,
titi1tim-1) k
Example k 2 P A B B A A C
T A B C A A B C A C
2, 4, 6, 2,
18
String Matching with k Mismatches
Input T t1 . . . tn, P p1 pm
Output Every i in T s.t. Ham(P,
titi1tim-1) k
Example k 2 P A B B A A C
T A B C A A B C A C
2, 4, 6, 2, Y,N,N,Y,

19
Naïve Algorithm (for counting mismatches or
k-mismatches problem)
- Goto each location of text and compute
hamming distance of P and Ti
Running Time O(nm) n T, m P
20
The Kangaroo Method (for k-mismatches)
Landau Vishkin 1986 Galil Giancarlo 1986
21
Trie
  • A tree representing a set of strings.

c
a
aeef ad bbfe bbfg c
b
e
b
d
e
f
f
e
g
22
Trie (Cont)
  • Assume no string is a prefix of another

c
Each string corresponds to a leaf.
a
b
e
b
d
e
f
f
e
g
23
Compressed Trie
  • Compress unary nodes, label edges by strings

c
?
c
a
a
b
e
b
bbf
d
d
eef
e
f
f
e
g
e
g
24
Suffix tree
Suffix tree of string s a compressed trie of all
suffixes of s
Prefix-free add a special character, say , at
the end of s
25
Suffix tree (Example)
Let s abab, a suffix tree of s is a compressed
trie of all suffixes of sabab

b ab bab abab
a
b
b

a
a
b

b


26
Suffix Tree properties
b
  • Succint in space - O(n).
  • - Can be built in O(n) time. McCreight, Weiner,

  • Ukkonen, Farach-Colton

27
Exact string matching

sabab
a
b
5
b

a
a
b

4
b


3
2
1
Given a pattern P ab we traverse the tree
according to the pattern.
28
Exact string matching

sabab 1 3
a
b
5
b

a
a
b

4
b


3
2
1
Leaves correspond to locations of appearance!
29
Exact string matching

sabab 1 3
a
b
5
b

a
a
b

4
b


3
2
1
Prepare Tree O(n) time Find matches O(m
occ) time occ of matches
30
Lowest common ancestors
A lot more can be gained from the suffix tree if
we preprocess it so that we can answer LCA
queries on it








31
Why?
The LCA of two leaves represents the longest
common prefix (LCP) of these 2 suffixes

s abbaab
a
b
7

a
a
b
b
b
a

6
a
b
a
b

b
4
a


a
3
b
5
2

1
32
Why?
The LCA of two leaves represents the longest
common prefix (LCP) of these 2 suffixes

a
s abbaab aab
b
7

a
a
b
b
b
a

6
a
b
a
b

b
4
a


a
3
b
5
2

1
33
Why?
The LCA of two leaves represents the longest
common prefix (LCP) of these 2 suffixes

a
s abbaab aab abbaab
b
7

a
a
b
b
b
a

6
a
b
a
b

b
4
a


a
3
b
5
2

1
34
Why?
The LCA of two leaves represents the longest
common prefix (LCP) of these 2 suffixes

a
s abbaab aab abbaab
b
7

a
a
b
b
b
a

6
a
b
a
b

b
4
a


a
3
b
5
2

1
35
LCA/LCP properties
a
Preprocesssing time O(n) Query Time
O(1) Harel Tarjan 1984, Schieber
Vishkin 1988, Berkman Vishkin 1993
36
The Kangaroo Method (for k-mismatches)
  • Create suffix tree for s PT
  • Check P at each location i of T by
    kangrooing
  • Example
  • P A B A B A A B A C A
    B
  • T A B B A C A B A B A B C A B B C A B C A
  • i

37
The Kangaroo Method (for k-mismatches)
  • Create suffix tree for s PT
  • Check P at each location i of T by
    kangrooing
  • Example
  • P A B A B A A B A C A
    B
  • T A B B A C A B A B A B C A B B C A B C A
  • i

38
The Kangaroo Method (for k-mismatches)
  • Create suffix tree for s PT
  • Check P at each location i of T by
    kangrooing
  • Example
  • P A B A B A A B A C A
    B
  • T A B B A C A B A B A B C A B B C A B C A
  • i

39
The Kangaroo Method (for k-mismatches)
  • Create suffix tree for s PT
  • Check P at each location i of T by
    kangrooing
  • Example
  • P A B A B A A B A C A
    B
  • T A B B A C A B A B A B C A B B C A B C A
  • i

40
The Kangaroo Method (for k-mismatches)
  • Create suffix tree for s PT
  • Check P at each location i of T by
    kangrooing
  • Example
  • P A B A B A A B A C A
    B
  • T A B B A C A B A B A B C A B B C A B C A
  • i

41
The Kangaroo Method (for k-mismatches)
  • Create suffix tree for s PT
  • Check P at each location i of T by
    kangrooing
  • Example
  • P A B A B A A B A C A
    B
  • T A B B A C A B A B A B C A B B C A B C A
  • i

42
The Kangaroo Method (for k-mismatches)
  • Create suffix tree for s PT
  • Check P at each location i of T by
    kangrooing
  • Example
  • P A B A B A A B A C A
    B
  • T A B B A C A B A B A B C A B B C A B C A
  • i

43
The Kangaroo Method (for k-mismatches)
  • Create suffix tree for s PT
  • Check P at each location i of T by
    kangrooing
  • Example
  • P A B A B A A B A C A
    B
  • T A B B A C A B A B A B C A B B C A B C A
  • i

44
The Kangaroo Method (for k-mismatches)
  • Create suffix tree for s PT
  • Check P at each location i of T by
    kangrooing
  • Example
  • P A B A B A A B A C A
    B
  • T A B B A C A B A B A B C A B B C A B C A
  • i

45
The Kangaroo Method (for k-mismatches)
Preprocess
Build suffix tree of both P and T - O(nm)
time LCA preprocessing -
O(nm) time
Check P at given text location
Kangroo jump till next mismatch
- O(k) time
Overall time O(nk)
46
Boolean Convolutions (FFT) Method
P
T
a b b b c c c a a a a b a c
b
...
...
47
Boolean Convolutions (FFT) Method
P
a-mask
a b a c c a c b a c a b a c
c
T
a b b b c c c a a a a b a c
b
...
...
48
Boolean Convolutions (FFT) Method
P
a-mask
a b a c c a c b a c a b a c
c
Pa
T
a b b b c c c a a a a b a c
b
...
...
49
Boolean Convolutions (FFT) Method
P
Pa
T
a b b b c c c a a a a b a c
b
...
...
50
Boolean Convolutions (FFT) Method
P
Pa
T
a b b b c c c a a a a b a c
b
...
...
not-a mask
a b b b c c c a a a a b a c
b
...
...
51
Boolean Convolutions (FFT) Method
P
Pa
T
a b b b c c c a a a a b a c
b
...
...
not-a mask
a b b b c c c a a a a b a c
b
...
...
0 1 1 1 1 1 1 0 0 0 0 1 0 1
1
Tnot a
52
Boolean Convolutions (FFT) Method
P
Pa
T
a b b b c c c a a a a b a c
b
...
...
0 1 1 1 1 1 1 0 0 0 0 1 0 1
1
Tnot a
53
Boolean Convolutions (FFT) Method
P
a b a c c a c b a c a b a c
c
Pa
T
a b b b c c c a a a a b a c
b
...
...
0 1 1 1 1 1 1 0 0 0 0 1 0 1
1
Tnot a
Multiply Pa and Tnot a to count mismatches
(use FFT)
Pa
1 0 1 0 0 1 0 0 1 0 1 0 1 0
0
0 1 1 1 1 1 1 0 0 0 0 1 0 1
1
Tnot a
...
...
54
Boolean Convolutions (FFT) Method
P
a b a c c a c b a c a b a c
c
Pa
T
a b b b c c c a a a a b a c
b
...
...
0 1 1 1 1 1 1 0 0 0 0 1 0 1
1
Tnot a
Multiply Pa and Tnot a to count mismatches
(use FFT)
Pa
1 0 1 0 0 1 0 0 1 0 1 0 1 0
0
0 1 1 1 1 1 1 0 0 0 0 1 0 1
1
Tnot a
...
...
55
Boolean Convolutions (FFT) Method
56
Boolean Convolutions (FFT) Method
Running Time One boolean convolution - O(n log
m) time
of matches of all symbols - O(n log m) time
57
Counting Method
Input
Text T t1tn Pattern P p1pm Max
of allowed mismatches k
Assumption Each pattern element is distinct
Count matches (instead of mismatches)
P
b g d e f h d c c a b g h
h
...
T
...
counter
increment
58
O(n log m) Algorithm
Frequent Symbol a symbol that appears
at least times in
P.
We distinguish between two cases
Case 1 At least frequent symbols.
Case 2 Less than frequent symbols.
Case 1 At least frequent symbols.
- Consider first frequent symbols. - For
each of them construct a mask for first
appearances.
59
Example of Masked Counting
k 4, 4
P
a-mask
a b a c c a c b a c a b a c
c
c-mask
T
use a-mask
60
Example of Masked Counting
k 4, 4
P
a-mask
a b a c c a c b a c a b a c
c
c-mask
T
a b a c c a c b a c a b a c
c
d
a b b b c c c a a a a b a c
b
...
...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1
counter
...
...
61
Example of Masked Counting
k 4, 4
P
a-mask
a b a c c a c b a c a b a c
c
c-mask
T
a b a c c a c b a c a b a c
c
d
a b b b c c c a a a a b a c
b
...
...
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
1
counter
...
...
62
Counting Stage
Run through text and count occurrences of all
marks.
Time O(n ).
Important Observations
1) Sum of all counters 2 n
2) Every counter whose value is less than k
already has more than k errors.
Why? The total of elements in all masks
is 2 2k.
For location i of T, if counteri lt k
then no match at location i.
63
How many locations remain?
Sum of all counters 2n
Value of potential matches gt k
of potential matches
How do we check these locations?
The Kangaroo Method.
Use
Kangaroo Method Time O(k) per location
Overall Time O( ) O( )
64
Case 2
X frequent symbols, x lt
a) Count all matches of frequent symbols -
one boolean convolution per symbol.
Time O(x n log m) O( n log m)
b) For non-frequent symbols, build full
masks.
Symbol non-frequent appears lt 2
in P mask size lt 2
Count time O(n )
65
c) Add results of a) b) and get total
number of matches at every text location.
Time
a) O(n log m) b) O(n ) c)
O(n)
So, Case 2 is O(n log m)
Overall Algo. Time O(n log m)
66
Additional Points
1. O(n log k)
For there is a linear
time algorithm - O(
)
2. O( n )
Better tradeoff Define frequent symbol gt

67
O( ) time algorithm
Outline
1. Find 2k special substrings of pattern. 2.
Construct forest data structure combining
info of special pattern substrings and text. 3.
Use local counting arguments and quick
queries to forest data structure to prune
candidates. 4. Use kangaroo method to check
leftover potential candidates.
68
k-Mismatches and Matrix Multiplication
Or-And matrix multiplication AxB
C, cij aik bkj
Pattern all-mismatch problem Find all text
locations where the pattern mismatches at
every character. Indyk If there is an
algorithm faster than O(n ) for
the Pattern all-mismatch problem then there
is a new method for solving Or-And
matrix multiplication faster than
O(n3)
69
OPEN PROBLEMS
O(n log m)
Hamming Distance in time
Edit Distance?
Other metrics?
Write a Comment
User Comments (0)
About PowerShow.com