An Online Algorithm for Finding the Longest Previous Factors PowerPoint PPT Presentation

presentation player overlay
1 / 38
About This Presentation
Transcript and Presenter's Notes

Title: An Online Algorithm for Finding the Longest Previous Factors


1
An Online Algorithm for Finding the Longest
Previous Factors
ESA2008_at_Universitat Karlsruhe, Sep 15, 2008
Kunihiko SadakaneKyushu University
  • Daisuke OkanoharaUniversity of Tokyo

2
Problem Finding the longest previous factors
(matching)
  • Input A text T0n-1
  • At all position k, report the longest substring
    Tk-len1k that also occurs in previous
    positions (history)Tposposlen-1
    Tk-len1k
  • c.f. LZ77, LZ-factorization

(pos, len) (0, 4)
(pos, len) (5, 2)
3
Applications
  • Data Compression
  • LZ77, Prediction by Partial Matching
  • Pattern Analysis
  • Log analysis
  • Data Mining

4
Previous approach
  • Sequential search on the fly
  • O(n2) time for a text of length n
  • Offline- Index approach
  • Read an whole text beforehand, and build an index
    (suffix array/trees) for it.
  • Search the match using the index Chen 07 Chen
    08 Crochemore 08 Kolpakov 01 Larsson 99
  • 6n bytes, and O(n log n) time Chen 08Suffix
    Arrays with Range Minimum Query

5
New Problem Online finding the longest previous
factors
  • Report match information just after reading each
    character
  • A case where we dont know the length of data
    beforehand, e.g. streaming data
  • Previous approaches cannot deal with this problem

6
Our approach for new problem
  • Online construction of enhanced prefix arrays
  • Update an index just after reading each character
  • Although many methods used in LZ77 cannot report
    the longest match, our method can.
  • Succinct data structures
  • Keep all information very compactly using about
    the same space for an original text

7
Prefix arrays
  • Keep NOT suffix arrays (SA), but prefix arrays
    (PA)
  • because when a character is added at the last of
    a text, SA may cause W (n) changes, but PA not
  • In PA, prefixes are sorted in the
    reverse-lexicographic order

Tnewaaaaz
Taaaa
SA for T
SA for Tnew
PA for T
PA for Tnew
0 1 a 2 aa 3 aaa 4 aaaa
0 4 aaaaz 3 aaaz 2 aaz 1 az 5 z
0 1 a 2 aa 3 aaa 4 aaaa
0 1 a 2 aa 3
aaa 4 aaaa 5 aaaaz
8
Our idea
  • Weiners suffix tree construction algorithm
  • Insert the suffixes from the shortest ones
  • Modify it to the insert prefixes form the
    shortest ones
  • Similar idea is used for the incremental
    construction of compressed suffix arrays Chan,
    et. al 2007, Lippert 2005
  • We extend this work to the succinct version
  • Our algorithm reports matching information as a
    by-product of construction
  • Do not require tree representation, we just use
    array information

9
Preliminary Dynamic Rank/Select Dictionary (DRSD)
  • For an text T0n-1, DRSD supports
  • rank(T, c, i) return the number of c in T0i
  • select(T, c, i) return the position of i-th c in
    T
  • insert(T, c, i) insert c at Ti
  • delete(T, i) delete Ti
  • These operations can be supported in

time (O(logn) time if s lt logn),
bits space
where s is the alphabet size Lee, et. al. 07,
10
Preliminary Range Minimum Query (RMQ)
  • Given an array E0n-1 of elements from totally
    ordered set, rmq(E, l, r) returns the index of
    the smallest element in Elr
  • i.e. rmq(E, l, r) argmink?l, rEk
  • return the leftmost such element in the tie
  • In the static case, RMQ can be supported in O(1)
    time using 2no(n) bits space Fischer, 2007
  • In the dynamic case, RMQ/insert/delete can be
    supported in O(Tlogn) time using O(n) bits if the
    lookup cost (Ei) is O(T)

11
Data structures
  • Keep the following data structures for T0k
  • Assume T0, is the unique smallest character
  • B0k (Prefix-) BW-transformed Text
  • Bi TPAi1 and Bi if PAik
  • H0k Height Array
  • will be explained in the next slide
  • C0s-1 Cumulative Array
  • Cc the total number of characters c s.t. c
    lt c in T
  • s The position for the next prefix to be inserted

12
T abaababa
13
T abaababa
PA stores the end position of each prefix(we
will omit this) Prefix stores prefixes sorted in
the reverse-lexicographic order (Neither PA nor
prefix are stored explicitly) We can examine
PAi by using SAlookup operation using O(log2n)
time as in FM-index Ferragina 00
14
B stores the next character for each
prefix(Burrows Wheelers transform for prefix
arrays)
T abaababa
15
H stores the length of the longest common suffix
between adjacent prefixes
T abaababa
16
T abaababa
s 4 s denotes the position where in B, and
the longest prefix is placed.
17
T abaababa
Cc the number of characters c that is
smaller than c in T(B)
C0
Ca1
Cb6
18
T abaababaa
The next character a comes !
19
T abaababaa
Replace in Bs with a (because is placed in
the position of the longest prefix)
a
20
T abaababaa
Find the position for the new prefix abaababaa
Count the number of a in B0s-1 rank(B, a,
s-1) 2
21
T abaababaa
Insert abaababaa at 3rd position in
aCarank(B, a, s-1) 3 s Carank(B, a,
s-1), insert(B, s, )
22
T abaababaa
Update H This is actually the length of the
longest match in the history
23
T abaababa
Recall that in the previous step, abaa and
aba are placed in the prefixes whose B is a
These positions can be found by using rank and
select c. f. succ(T, c, s) select(T, c,
rank(T, s, c))
24
T abaababa
RMQ(H, 4, 6) 5, H5 0 Therefore RMQ(H, 4,
6) 1 is the new value for the next H entry
25
T abaababa
RMQ(H, 3, 3) 3, H3 3 Therefore RMQ(H, 3,
3) 1 is the new value for the nextH entry
26
T abaababaa
rmq(H, 3, 3) 1
rmq(H, 4, 6) 1
27
T abaababaa
Report max(4, 1) 4 as the length of
thelongest factor and report the positionof
abaa as SAlookup2- len 0 Report (pos0,
len4) as the max. matching
28
Overall algorithm
All operations are rank, select, RMQ
29
Overall Analysis
  • H is stored in 2n bits Sadakane , Soda 02
  • naïve representation requires O(n log n) bits
  • requires one SA lookup operation to decode
  • B is stored in nlogs o(nlogs) bits
  • by using dynamic rank/select dictionary
  • The bottleneck of our algorithm is rmq(H, I, r)
    which requires O(log3n) time
  • SAlookup requires O(log2n) time

30
Overall Analysis (cont.)
  • We can solve the online longest previous factor
    problem in O(log3n) time for each character,
    using nlog2s o(nlogs) O(n) bits of space
  • where s is the alphabet size, and n is the length
    of a text

31
Simulating window buffer
  • If the working space is limited, we often
    discards the history from the oldest ones
  • We can simulate this by using the almost the same
    operations as in the insertion operation
  • We actually do not discard a character but ignore
    it
  • If we actually discard an oldest character , it
    may cause W(n) changes in B and H
  • The effect of discarded character is remained
    (prefixes are sorted according to the discarded
    characters)
  • But this does not cause the problem if we only
    report the matching information up to the history
    size

32
Experiments
  • In experiment, we used a simpler data structure
    (algorithm is same)
  • B and H is store in the balanced binary tree
  • Each leaf stores the small block of B and H
  • We call this implementation as OS
  • Compare OS with other offline algorithms
  • Require to read the whole text beforehand
  • CPSa, CPSd SALCP with stack Chen, et. al. 07
  • CPS6n SA with RMQ Chen, et. al. 08
  • kk-lz mreps, specialized for s4 Kolpakov 01

33
Peak memory usage in bytes per input symbol
  • The space of OS is smallest in many real data
    especially when the values in H is small

34
Runtime in milliseconds for searching the longest
previous factors
  • OS is about 210 times slower than the fastest
    ones due to the dynamic operations

35
Conclusion
  • Solve online longest matching problem by using
    enhanced prefix arrays
  • Simple and easy to implement
  • Require about 36 times space of the input text
  • Actually this is a by-product of construction of
    compressed suffix trees c.f. Weiners algorithm
  • Simple and much room for improvements
  • by using better rank/select/rmq implementation

36
Future work
  • Construction of compressed suffix trees
  • Update the parenthesis tree efficiently
  • Actually, the time complexity for this is smaller
  • Practical improvements
  • Currently, dynamic succinct data structure is not
    efficient due to cache misses, and memory
    fragmentation
  • Approximated version of longest matching problem
    enough for many application

Thank you for you attention !
37
(No Transcript)
38
Weiners suffix trees construction alg.
a abraca abracada abra ab abracadab abracad
abr
a abraca abracada abra ab abracad abr
a abraca abracada abra ab abracadab abracad
abr abracadabr
  • .

a
a
ba


abr
abr
ab

abracad
abrac
abracad
abr
abr
abrac
abr
a
abrac
ab
abrac
abracada


abracad

abrac
abr
abracad
abrac
abracada
Write a Comment
User Comments (0)
About PowerShow.com