An Online Algorithm for Finding the Longest Previous Factors presentation

About This Presentation

Transcript and Presenter's Notes

Title: An Online Algorithm for Finding the Longest Previous Factors

1
An Online Algorithm for Finding the Longest
Previous Factors
ESA2008_at_Universitat Karlsruhe, Sep 15, 2008
Kunihiko SadakaneKyushu University

Daisuke OkanoharaUniversity of Tokyo

2
Problem Finding the longest previous factors
(matching)

Input A text T0n-1
At all position k, report the longest substring
Tk-len1k that also occurs in previous
positions (history)Tposposlen-1
Tk-len1k
c.f. LZ77, LZ-factorization

(pos, len) (0, 4)
(pos, len) (5, 2)
3
Applications

Data Compression
LZ77, Prediction by Partial Matching
Pattern Analysis
Log analysis
Data Mining

4
Previous approach

Sequential search on the fly
O(n2) time for a text of length n
Offline- Index approach
Read an whole text beforehand, and build an index
(suffix array/trees) for it.
Search the match using the index Chen 07 Chen
08 Crochemore 08 Kolpakov 01 Larsson 99
6n bytes, and O(n log n) time Chen 08Suffix
Arrays with Range Minimum Query

5
New Problem Online finding the longest previous
factors

Report match information just after reading each
character
A case where we dont know the length of data
beforehand, e.g. streaming data
Previous approaches cannot deal with this problem

6
Our approach for new problem

Online construction of enhanced prefix arrays
Update an index just after reading each character
Although many methods used in LZ77 cannot report
the longest match, our method can.
Succinct data structures
Keep all information very compactly using about
the same space for an original text

7
Prefix arrays

Keep NOT suffix arrays (SA), but prefix arrays
(PA)
because when a character is added at the last of
a text, SA may cause W (n) changes, but PA not
In PA, prefixes are sorted in the
reverse-lexicographic order

Tnewaaaaz
Taaaa
SA for T
SA for Tnew
PA for T
PA for Tnew
0 1 a 2 aa 3 aaa 4 aaaa
0 4 aaaaz 3 aaaz 2 aaz 1 az 5 z
0 1 a 2 aa 3 aaa 4 aaaa
0 1 a 2 aa 3
aaa 4 aaaa 5 aaaaz
8
Our idea

Weiners suffix tree construction algorithm
Insert the suffixes from the shortest ones
Modify it to the insert prefixes form the
shortest ones
Similar idea is used for the incremental
construction of compressed suffix arrays Chan,
et. al 2007, Lippert 2005
We extend this work to the succinct version
Our algorithm reports matching information as a
by-product of construction
Do not require tree representation, we just use
array information

9
Preliminary Dynamic Rank/Select Dictionary (DRSD)

For an text T0n-1, DRSD supports
rank(T, c, i) return the number of c in T0i
select(T, c, i) return the position of i-th c in
T
insert(T, c, i) insert c at Ti
delete(T, i) delete Ti
These operations can be supported in

time (O(logn) time if s lt logn),
bits space
where s is the alphabet size Lee, et. al. 07,
10
Preliminary Range Minimum Query (RMQ)

Given an array E0n-1 of elements from totally
ordered set, rmq(E, l, r) returns the index of
the smallest element in Elr
i.e. rmq(E, l, r) argmink?l, rEk
return the leftmost such element in the tie
In the static case, RMQ can be supported in O(1)
time using 2no(n) bits space Fischer, 2007
In the dynamic case, RMQ/insert/delete can be
supported in O(Tlogn) time using O(n) bits if the
lookup cost (Ei) is O(T)

11
Data structures

Keep the following data structures for T0k
Assume T0, is the unique smallest character
B0k (Prefix-) BW-transformed Text
Bi TPAi1 and Bi if PAik
H0k Height Array
will be explained in the next slide
C0s-1 Cumulative Array
Cc the total number of characters c s.t. c
lt c in T
s The position for the next prefix to be inserted

12
T abaababa
13
T abaababa
PA stores the end position of each prefix(we
will omit this) Prefix stores prefixes sorted in
the reverse-lexicographic order (Neither PA nor
prefix are stored explicitly) We can examine
PAi by using SAlookup operation using O(log2n)
time as in FM-index Ferragina 00
14
B stores the next character for each
prefix(Burrows Wheelers transform for prefix
arrays)
T abaababa
15
H stores the length of the longest common suffix
between adjacent prefixes
T abaababa
16
T abaababa
s 4 s denotes the position where in B, and
the longest prefix is placed.
17
T abaababa
Cc the number of characters c that is
smaller than c in T(B)
C0
Ca1
Cb6
18
T abaababaa
The next character a comes !
19
T abaababaa
Replace in Bs with a (because is placed in
the position of the longest prefix)
a
20
T abaababaa
Find the position for the new prefix abaababaa
Count the number of a in B0s-1 rank(B, a,
s-1) 2
21
T abaababaa
Insert abaababaa at 3rd position in
aCarank(B, a, s-1) 3 s Carank(B, a,
s-1), insert(B, s, )
22
T abaababaa
Update H This is actually the length of the
longest match in the history
23
T abaababa
Recall that in the previous step, abaa and
aba are placed in the prefixes whose B is a
These positions can be found by using rank and
select c. f. succ(T, c, s) select(T, c,
rank(T, s, c))
24
T abaababa
RMQ(H, 4, 6) 5, H5 0 Therefore RMQ(H, 4,
6) 1 is the new value for the next H entry
25
T abaababa
RMQ(H, 3, 3) 3, H3 3 Therefore RMQ(H, 3,
3) 1 is the new value for the nextH entry
26
T abaababaa
rmq(H, 3, 3) 1
rmq(H, 4, 6) 1
27
T abaababaa
Report max(4, 1) 4 as the length of
thelongest factor and report the positionof
abaa as SAlookup2- len 0 Report (pos0,
len4) as the max. matching
28
Overall algorithm
All operations are rank, select, RMQ
29
Overall Analysis

H is stored in 2n bits Sadakane , Soda 02
naïve representation requires O(n log n) bits
requires one SA lookup operation to decode
B is stored in nlogs o(nlogs) bits
by using dynamic rank/select dictionary
The bottleneck of our algorithm is rmq(H, I, r)
which requires O(log3n) time
SAlookup requires O(log2n) time

30
Overall Analysis (cont.)

We can solve the online longest previous factor
problem in O(log3n) time for each character,
using nlog2s o(nlogs) O(n) bits of space
where s is the alphabet size, and n is the length
of a text

31
Simulating window buffer

If the working space is limited, we often
discards the history from the oldest ones
We can simulate this by using the almost the same
operations as in the insertion operation
We actually do not discard a character but ignore
it
If we actually discard an oldest character , it
may cause W(n) changes in B and H
The effect of discarded character is remained
(prefixes are sorted according to the discarded
characters)
But this does not cause the problem if we only
report the matching information up to the history
size

32
Experiments

In experiment, we used a simpler data structure
(algorithm is same)
B and H is store in the balanced binary tree
Each leaf stores the small block of B and H
We call this implementation as OS
Compare OS with other offline algorithms
Require to read the whole text beforehand
CPSa, CPSd SALCP with stack Chen, et. al. 07
CPS6n SA with RMQ Chen, et. al. 08
kk-lz mreps, specialized for s4 Kolpakov 01

33
Peak memory usage in bytes per input symbol

The space of OS is smallest in many real data
especially when the values in H is small

34
Runtime in milliseconds for searching the longest
previous factors

OS is about 210 times slower than the fastest
ones due to the dynamic operations

35
Conclusion

Solve online longest matching problem by using
enhanced prefix arrays
Simple and easy to implement
Require about 36 times space of the input text
Actually this is a by-product of construction of
compressed suffix trees c.f. Weiners algorithm
Simple and much room for improvements
by using better rank/select/rmq implementation

36
Future work

Construction of compressed suffix trees
Update the parenthesis tree efficiently
Actually, the time complexity for this is smaller
Practical improvements
Currently, dynamic succinct data structure is not
efficient due to cache misses, and memory
fragmentation
Approximated version of longest matching problem
enough for many application

Thank you for you attention !
37
(No Transcript)
38
Weiners suffix trees construction alg.
a abraca abracada abra ab abracadab abracad
abr
a abraca abracada abra ab abracad abr
a abraca abracada abra ab abracadab abracad
abr abracadabr

a
a
ba

abr
abr
ab

abracad
abrac
abracad
abr
abr
abrac
abr
a
abrac
ab
abrac
abracada

abracad

abrac
abr
abracad
abrac
abracada

Write a Comment

User Comments (0)

About PowerShow.com

An Online Algorithm for Finding the Longest Previous Factors PowerPoint PPT Presentation