Dynamic RankSelect Structures with Applications to RunLength Encoded Texts - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Dynamic RankSelect Structures with Applications to RunLength Encoded Texts

Description:

For a given text T over s-size alphabet, our structures support: ... Index for a collection of texts (Chan, Hon & Lam) Add or remove a text from the collection ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 32
Provided by: shl29
Category:

less

Transcript and Presenter's Notes

Title: Dynamic RankSelect Structures with Applications to RunLength Encoded Texts


1
Dynamic Rank-Select Structures with Applications
to Run-Length Encoded Texts
  • Sunho Lee and Kunsoo Park
  • Seoul National Univ.

2
Contents
  • Introduction
  • Rank/select problem
  • Relations to compressed full-text indices
  • Dynamic rank-select structure
  • Extensions of the structure
  • For a large alphabet text
  • For a run-length encoded text

3
Rank-select problem
  • For a given text T over s-size alphabet, our
    structures support
  • rankT(c, i) gives the number of character cs up
    to position i in T
  • selectT(c, k) gives the position of the k-th c
  • E.g. Tacabbc
  • rankT(a, 5) 2
  • selectT(a, 2) 3

4
Rank-select problem
  • Our structures support additional update
    operations
  • insertT(c, i) inserts character c between Ti
    and Ti1
  • deleteT(i) deletes Ti from T
  • E.g. Tacabbc aababc
  • rankT(a, 5) 2 ? rankT(a, 5) 3
  • selectT(a, 2) 3 selectT(a, 2) 2

5
Why rank-select problem?
  • In compressed full-text index
  • Rank-select structures are built on
    Burrows-Wheeler Transform (BWT)
  • Rank backward search (Ferragina Manzini)
  • Select Psi-function in CSA (Grossi Vitter)
  • Dynamic BWT
  • Index for a collection of texts (Chan, Hon Lam)
  • Add or remove a text from the collection

6
Example of select on BWT
  • Tmississippi
  • Psi function
  • Order of the suffix at next position
  • E.g.. Psi4 11, the order of ssippi

7
Example of select on BWT
  • Tmississippi
  • Psi function
  • Order of the suffix at next position
  • E.g. Psi4 11, the order of ssippi
  • Duality between Psi-function and BWT
  • (Hon, Sadakane Sung)
  • BWTi TSAi 1
  • Psii selectBWT(Ci, i FCi)
  • Ci TSAi
  • Fc The number of x lt c

8
Our results
  • Dynamic rank-select on texts over a small
    alphabet (s lt log n)
  • Improve the binary-alphabet version by Makinen
    Navarro
  • O(log n) time and nlogs o(nlogs) bits
  • Dynamic rank-select for a large alphabet (s lt n)
  • Use wavelet trees to extend our small-alphabet
    structure
  • O(log n logs / loglog n) time and nlogs
    o(nlogs) bits
  • Application to RLE texts

9
Static rank-select
10
Dynamic rank-select
11
Dynamic rank-select preliminary
  • We assume RAM model with
  • Word size w ?(log n) bits
  • , -, , / and bitwise operations in O(1) time
  • We process a word-size text of ?(log n/log ?)
    characters in O(1) time

12
Dynamic rank-select preliminary
  • Partition of text
  • Blocks of sizes from ½ log n words to 2log n
    words
  • Bit vector representation, I
  • Give block number b and offset r for position i
  • Employ binary rank-select by Makinen Navarro
  • O(log n) time O(n) bits
  • E.g.
  • T babc abab abca ? b rankI(1, 10) 3
  • I 1000 1000 1000 r 10 -
    selectI(1, 3) 1 2

13
Dynamic rank-select preliminary
  • Over-block/in-block operation
  • rankT(c, i)
  • rank-overT(c, b) The number of cs before the
    b-th block
  • rankTb(c, r) The number of cs up to position r
    in Tb
  • E.g.
  • T babc abab abca rankT(a,10)
    rank-overT(a, 3)
  • I 1000 1000 1000
    rankT3(a, 2)

14
Dynamic rank-select preliminary
  • Over-block/in-block operation
  • selectT(c, k)
  • select-overT(c,k) The block number containing
    the k-th c
  • selectTb(c,k) The offset of the k-th c in Tb
  • Update operation
  • In-block update change the text itself
  • Over-block update change the statistics of the
    text

15
Over-block structures
  • Sorted character-block pair
  • Character-block pair (Ti, b) Ti in the b-th
    block
  • E.g. T babc abab abca
  • (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2)
    (a,3)(b,3)(c,3)(a,3)

16
Over-block structures
  • Sorted character-block pair
  • Character-block pair (Ti, b) Ti in the b-th
    block
  • Sorted pairs partially non-decreasing
  • (Hon, Sadakane Sung)
  • E.g. T babc abab abca
  • (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2)
    (a,3)(b,3)(c,3)(a,3)
  • (a,1)(a,2)(a,2)(a,3)(a,3) (b,1)(b,1)(b,2)(b,2)(b,
    3) (c,1)(c,3)

17
Over-block structures
  • Differential encoding of sorted pairs
  • A bit vector B of O(n) bits
  • For each distinct pair
  • 1 the difference of block number
  • 0 the number of the same pairs
  • E.g.
  • T ... babc abab bbbb abcc
  • (c,5)(c,8)(c,8) ? 11111011100

18
Over-block structures
  • Differential encoding of sorted pairs
  • A bit vector B of O(n) bits
  • For each distinct pair
  • 1 the difference of block number
  • 0 the number of the same pairs
  • E.g.
  • T babc abab abca
  • B 10100100 10010010 10110

b group
19
Over-block rank-select
  • rank-overT(c, b)
  • Find the position of the b-th 1 in the group of
    c
  • Count 0s representing c up to the position
  • E.g.
  • T babc abab abca
  • B 10100100 10010010 10110
  • rank-overT(b, 3) count 0s up to 3rd 1 in
    b group

20
Over-block updates
  • If the number of blocks is fixed
  • Insert or delete 0s at the b-th block in I and B
  • Rank-select remains correct
  • E.g.
  • T babc abab abca ? babc aabaaabb abca
  • I 1000 1000 1000 ? 1000 100000000 1000
  • B 10100100 10010010 10110
  • ? 10100000100 100100010 10110

21
Over-block updates
  • If the number of blocks is changing
  • Split or merge the b-th block in I and B
  • Call O(?) queries on B ? amortized (? lt log n)
  • E.g.
  • T babc aabaaabb abca ? babc aaba aabb abca
  • I 1000 10000000 1000 ? 1000 1000 1000 1000
  • B 10100000100 1001000010 10110
  • ? 101000100100 10010100010 10110

22
In-block structures
  • We use the hierarchy as Makinen Navarros
    word, sub-block and block
  • Rank/select on word-size texts w
  • Convert w to a bit vector representing
    occurrences of c
  • E.g. w abaacbab, mask bbbbbbbb (log?)
  • w XOR mask x0xxx0x0 (log?) ?
    01000101(2)
  • O(1) time rank-select by tables of o(n) bits size

23
In-block structures
  • Linked list over sub-blocks
  • A block contains ½log n to 2log n words
  • A sub-block contains vlog n words
  • One extra sub-block is a buffer for updates
  • Red-black tree over blocks
  • Leaf node pointer to block, list of sub-blocks
  • Internal node the number of blocks in its subtree

24
In-block rank-select
  • RankTb(c, r) in O(log n) time
  • Traverse the tree to find the b-th block
  • Scan the b-th block of ?(log n) words

25
In-block updates
  • Update words in the list in O(log n) time
  • Process carry characters using the extra space in
    a block

26
In-block updates
  • Split or merge the block of out of the range
  • Update tree nodes from leaf to root

5
3
2
2
ab
bc
ac
ba
bc
27
In-block updates
  • Split or merge the block of out of the range
  • Update tree nodes from leaf to root

6
2
4
2
2
ab
bc
ac
ba
bc
28
Extension of our structure
  • Dynamic rank-select on plain texts over a large
    alphabet, s lt n
  • Use k-ary wavelet trees
  • O(log n logs /loglog n) time nlogs O(nlogs
    /loglog n) bits
  • Application to run-length encoded texts
  • Start from RLFM (Makinen Navarro)
  • Support dynamic BWT

29
Application to RLE
  • Run-Length Encoding (RLE) of T
  • Character of runs text T
  • Length of runs bit vector L
  • E.g. T aaabbaacccc ? Tabac, L10010101000
  • RLE of BWT (Makinen Navarro)
  • Run-Length based FM-index
  • The number of runs in BWT(T) min(n, nHk) sk

30
Application to RLE
  • Assume rank/select on L and T
  • Total size of structure O(n nlogs)
  • Operation time O(log n log n logs/loglog n)
  • Some additional vectors
  • Sorted length vector L
  • Frequency table F count characters in T
  • E.g.
  • T bb aa bbbb cc aaa aa aaa bb bbbb
    cc
  • L 10 10 1000 10 100 ? L 10 100 10 1000 10
  • T babca F 001 001 01

31
Conclusion
  • Rank-select structure is an essential ingredient
    of compressed full-text indices
  • We propose dynamic rank-select for a small
    alphabet and its large-alphabet version
  • We can apply our structures to indices that uses
    BWT, such as RLFM and index for texts collection
Write a Comment
User Comments (0)
About PowerShow.com