Title: An Online Algorithm for Finding the Longest Previous Factors
1An Online Algorithm for Finding the Longest
Previous Factors
ESA2008_at_Universitat Karlsruhe, Sep 15, 2008
Kunihiko SadakaneKyushu University
- Daisuke OkanoharaUniversity of Tokyo
2Problem Finding the longest previous factors
(matching)
- Input A text T0n-1
- At all position k, report the longest substring
Tk-len1k that also occurs in previous
positions (history)Tposposlen-1
Tk-len1k - c.f. LZ77, LZ-factorization
(pos, len) (0, 4)
(pos, len) (5, 2)
3Applications
- Data Compression
- LZ77, Prediction by Partial Matching
- Pattern Analysis
- Log analysis
- Data Mining
4Previous approach
- Sequential search on the fly
- O(n2) time for a text of length n
- Offline- Index approach
- Read an whole text beforehand, and build an index
(suffix array/trees) for it. - Search the match using the index Chen 07 Chen
08 Crochemore 08 Kolpakov 01 Larsson 99 - 6n bytes, and O(n log n) time Chen 08Suffix
Arrays with Range Minimum Query
5New Problem Online finding the longest previous
factors
- Report match information just after reading each
character - A case where we dont know the length of data
beforehand, e.g. streaming data - Previous approaches cannot deal with this problem
6Our approach for new problem
- Online construction of enhanced prefix arrays
- Update an index just after reading each character
- Although many methods used in LZ77 cannot report
the longest match, our method can. - Succinct data structures
- Keep all information very compactly using about
the same space for an original text
7Prefix arrays
- Keep NOT suffix arrays (SA), but prefix arrays
(PA) - because when a character is added at the last of
a text, SA may cause W (n) changes, but PA not - In PA, prefixes are sorted in the
reverse-lexicographic order
Tnewaaaaz
Taaaa
SA for T
SA for Tnew
PA for T
PA for Tnew
0 1 a 2 aa 3 aaa 4 aaaa
0 4 aaaaz 3 aaaz 2 aaz 1 az 5 z
0 1 a 2 aa 3 aaa 4 aaaa
0 1 a 2 aa 3
aaa 4 aaaa 5 aaaaz
8Our idea
- Weiners suffix tree construction algorithm
- Insert the suffixes from the shortest ones
- Modify it to the insert prefixes form the
shortest ones - Similar idea is used for the incremental
construction of compressed suffix arrays Chan,
et. al 2007, Lippert 2005 - We extend this work to the succinct version
- Our algorithm reports matching information as a
by-product of construction - Do not require tree representation, we just use
array information
9Preliminary Dynamic Rank/Select Dictionary (DRSD)
- For an text T0n-1, DRSD supports
- rank(T, c, i) return the number of c in T0i
- select(T, c, i) return the position of i-th c in
T - insert(T, c, i) insert c at Ti
- delete(T, i) delete Ti
- These operations can be supported in
time (O(logn) time if s lt logn),
bits space
where s is the alphabet size Lee, et. al. 07,
10Preliminary Range Minimum Query (RMQ)
- Given an array E0n-1 of elements from totally
ordered set, rmq(E, l, r) returns the index of
the smallest element in Elr - i.e. rmq(E, l, r) argmink?l, rEk
- return the leftmost such element in the tie
- In the static case, RMQ can be supported in O(1)
time using 2no(n) bits space Fischer, 2007 - In the dynamic case, RMQ/insert/delete can be
supported in O(Tlogn) time using O(n) bits if the
lookup cost (Ei) is O(T)
11Data structures
- Keep the following data structures for T0k
- Assume T0, is the unique smallest character
- B0k (Prefix-) BW-transformed Text
- Bi TPAi1 and Bi if PAik
- H0k Height Array
- will be explained in the next slide
- C0s-1 Cumulative Array
- Cc the total number of characters c s.t. c
lt c in T - s The position for the next prefix to be inserted
12T abaababa
13T abaababa
PA stores the end position of each prefix(we
will omit this) Prefix stores prefixes sorted in
the reverse-lexicographic order (Neither PA nor
prefix are stored explicitly) We can examine
PAi by using SAlookup operation using O(log2n)
time as in FM-index Ferragina 00
14B stores the next character for each
prefix(Burrows Wheelers transform for prefix
arrays)
T abaababa
15H stores the length of the longest common suffix
between adjacent prefixes
T abaababa
16T abaababa
s 4 s denotes the position where in B, and
the longest prefix is placed.
17T abaababa
Cc the number of characters c that is
smaller than c in T(B)
C0
Ca1
Cb6
18T abaababaa
The next character a comes !
19T abaababaa
Replace in Bs with a (because is placed in
the position of the longest prefix)
a
20T abaababaa
Find the position for the new prefix abaababaa
Count the number of a in B0s-1 rank(B, a,
s-1) 2
21T abaababaa
Insert abaababaa at 3rd position in
aCarank(B, a, s-1) 3 s Carank(B, a,
s-1), insert(B, s, )
22T abaababaa
Update H This is actually the length of the
longest match in the history
23T abaababa
Recall that in the previous step, abaa and
aba are placed in the prefixes whose B is a
These positions can be found by using rank and
select c. f. succ(T, c, s) select(T, c,
rank(T, s, c))
24T abaababa
RMQ(H, 4, 6) 5, H5 0 Therefore RMQ(H, 4,
6) 1 is the new value for the next H entry
25T abaababa
RMQ(H, 3, 3) 3, H3 3 Therefore RMQ(H, 3,
3) 1 is the new value for the nextH entry
26T abaababaa
rmq(H, 3, 3) 1
rmq(H, 4, 6) 1
27T abaababaa
Report max(4, 1) 4 as the length of
thelongest factor and report the positionof
abaa as SAlookup2- len 0 Report (pos0,
len4) as the max. matching
28Overall algorithm
All operations are rank, select, RMQ
29Overall Analysis
- H is stored in 2n bits Sadakane , Soda 02
- naïve representation requires O(n log n) bits
- requires one SA lookup operation to decode
- B is stored in nlogs o(nlogs) bits
- by using dynamic rank/select dictionary
- The bottleneck of our algorithm is rmq(H, I, r)
which requires O(log3n) time - SAlookup requires O(log2n) time
30Overall Analysis (cont.)
- We can solve the online longest previous factor
problem in O(log3n) time for each character,
using nlog2s o(nlogs) O(n) bits of space - where s is the alphabet size, and n is the length
of a text
31Simulating window buffer
- If the working space is limited, we often
discards the history from the oldest ones - We can simulate this by using the almost the same
operations as in the insertion operation - We actually do not discard a character but ignore
it - If we actually discard an oldest character , it
may cause W(n) changes in B and H - The effect of discarded character is remained
(prefixes are sorted according to the discarded
characters) - But this does not cause the problem if we only
report the matching information up to the history
size
32Experiments
- In experiment, we used a simpler data structure
(algorithm is same) - B and H is store in the balanced binary tree
- Each leaf stores the small block of B and H
- We call this implementation as OS
- Compare OS with other offline algorithms
- Require to read the whole text beforehand
- CPSa, CPSd SALCP with stack Chen, et. al. 07
- CPS6n SA with RMQ Chen, et. al. 08
- kk-lz mreps, specialized for s4 Kolpakov 01
33Peak memory usage in bytes per input symbol
- The space of OS is smallest in many real data
especially when the values in H is small
34Runtime in milliseconds for searching the longest
previous factors
- OS is about 210 times slower than the fastest
ones due to the dynamic operations
35Conclusion
- Solve online longest matching problem by using
enhanced prefix arrays - Simple and easy to implement
- Require about 36 times space of the input text
- Actually this is a by-product of construction of
compressed suffix trees c.f. Weiners algorithm - Simple and much room for improvements
- by using better rank/select/rmq implementation
36Future work
- Construction of compressed suffix trees
- Update the parenthesis tree efficiently
- Actually, the time complexity for this is smaller
- Practical improvements
- Currently, dynamic succinct data structure is not
efficient due to cache misses, and memory
fragmentation - Approximated version of longest matching problem
enough for many application
Thank you for you attention !
37(No Transcript)
38Weiners suffix trees construction alg.
a abraca abracada abra ab abracadab abracad
abr
a abraca abracada abra ab abracad abr
a abraca abracada abra ab abracadab abracad
abr abracadabr
a
a
ba
abr
abr
ab
abracad
abrac
abracad
abr
abr
abrac
abr
a
abrac
ab
abrac
abracada
abracad
abrac
abr
abracad
abrac
abracada