Title: Boyer Moore Searches on Binary Texts
1Boyer Moore Searches on Binary Texts
Accelerating
- Shmuel Tomi Klein
- Miri Kopel Ben-Nissan
- Bar Ilan University, ISRAEL
2Outline
Background and motivation
Boyer Moore algorithm
New binary variant
Analysis
Experiments
Summary
3Important application of Automata
PATTERN MATCHING
KMP BDM BM
Boyer Moore
Match Backwards ! !
this-is-a-sample-text---
pattern
4Boyer Moore Algorithm
Mismatch case 1 delta1
b does not occur in x
y
u
b
x
u
a
5Boyer Moore Algorithm
Mismatch case 2 delta1
b occurs in x
y
u
b
x
u
a
6Boyer Moore Algorithm
Mismatch case 3 delta2
u reoccurs in x preceded by c ? a
y
u
b
x
u
a
7Boyer Moore Algorithm
Mismatch case 4 delta2
Only a suffix v of u reoccurs in x
y
u
b
x
u
a
v
8Boyer Moore Example
rest x p m l e a
7 5 2 3 1 0 4
e l p m a x e
1 7 8 9 10 11 12
9Problems of Binary Boyer Moore
most work by delta1
delta1 useless
10Need for Binary Boyer Moore
Compressed Matching
Given E(T) and P look for
E(P) in E(T)
rather than
P in D(E(T))
Suggested Solution
BBBMM
Blocked Binary Boyer Moore Matching
11BBBMM
12BBBMM
More information in binary case
ffghabdgttiocb sbgghj
ASCII
01100010 01101010
BINARY
13BBBMM
extended delta1
14BBBMM
Total size of delta1 tables
If too large, use limit value
Size of delta1 tables reduced to
15BBBMM
Original delta1 increase of text pointer
BBBMM delta1 shift size
Mismatch not in last block
Correctsh,j
16BBBMM
delta2
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 j
1 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 Patj
1 2 15 7 3 13 13 13 13 13 13 13 13 13 13 13 delta2j
17Analysis
Assumption random input
Reasonable for compressed text
Expected comparisons till mismatch
Bit-wise
Blocked
18Analysis
Expected bits shifted after mismatch
Bit-wise M
Blocked M
19Experiments
English Bible (2.5MB)
World Factbook (1.5MB)
Text Huffman encoded
k 8
Patterns Random substrings
of lengths 10 to 500
20Experiments
Average comparisons between shifts
21Experiments
Average size of shifts
Bit-wise
22Experiments
Average comparisons for 1000 bits
23Experiments
Time to locate first occurrence (ms)
24Summary
Blocked variant of BM
Faster than alternatives, Overhead 1-10 K
Extensions
ASCII, words instead of characters
25 Thank you !