Title: Succinct Data Structures: Upper, Lower
1Succinct Data Structures Upper, Lower Middle
Bounds
- Ian Munro
- University of Waterloo
- Joint work with/of Arash Farzan, Alex Golynski,
Meng He - How do we encode a large combinatorial object
(e.g. a tree, string, graph, group) - even a static one
- in a small amount of space still perform
required operations in constant time ???
2Example of a Succinct Data Structure The
(Static) Bounded Subset
- Given Universe of n elements 0,...n-1
- and m arbitrary elements from this universe
- Create a static structure to support search in
constant time (lg n bit word usual ops) - Using Essentially minimum possible bits
-
- Operation Member query in O(1) time
- (Brodnik M.)
3Careful .. Lower Bounds
- Beame-Fich Find largest less than i is tough in
some ranges of m(e.g. m2 vlg n) - But OK if i is present this can be added (Raman,
Raman, Rao)
4Focus on Trees
.. Because Computer Science is .. Arbophilic -
Directories (Unix, all the rest) - Search trees
(B-trees, binary search trees, digital trees or
tries) - Graph structures (we do a tree based
search) - Search indices for text (including DNA)
5A Big Patricia Trie / Suffix Trie
0
1
- Given a large text file treat it as bit vector
- Construct a trie with leaves pointing to unique
locations in text that match path in trie
(paths must start at character boundaries) - Skip the nodes where there is no branching (n-1
internal nodes)
0
1
1 0 0 0 1 1
6Space for Trees
- Abstract data type binary tree
- Size n-1 internal nodes, n leaves
- Operations child, parent, subtree size, leaf
data - Motivation Obvious representation of an n node
tree takes about 6 n lg n words (up, left, right,
size, memory manager, leaf reference) - i.e. full suffix tree takes about 5 or 6 times
the space of suffix array (i.e. leaf references
only)
7Succinct Representations of Trees
- Start with Jacobson, then others
- There are about 4n/(pn)3/2 ordered rooted trees,
and same number of binary trees - Lower bound on specifying is about 2n bits
- What are the natural representations?
8Arbitrary Ordered Trees
- Use parenthesis notation
- Represent the tree
- As the binary string (((())())((())()()))
traverse tree as ( for node, then subtrees,
then ) - Each node takes 2 bits
9Heap-like Notation for a Binary Tree
Add external nodes Enumerate level by
level Store vector 11110111001000000 length
2n1 (Here dont know size of subtrees can be
overcome. Could use isomorphism to flip between
notations)
1
1
1
1
0
1
1
1
0
0
0
0
1
0
0
0
0
10How do we Navigate?
- Jacobsons key suggestionOperations on a bit
vector - rank(x) 1s up to including x
- select(x) position of xth 1
- So in the binary tree
- leftchild(x) 2 rank(x)
- rightchild(x) 2 rank(x) 1
- parent(x) select(?x/2?)
11Rank Select
- Rank Auxiliary storage 2nlglg n / lg n bits
- 1s up to each (lg n)2 rd bit
- 1s within these too each lg nth bit
- Table lookup after that
- Select More complicated (especially to get this
lower order term) but similar notions - Key issue Rank Select take O(1) time with lg n
bit word (M. et al)
12Aside Dynamic Rank Select
- Rank/Select Structures Raw data plus some
cumulative arrays - Model We keep a finger at a position and can
insert/delete change at that spot or move 1 spot
left/right - When at position i maintain structures up to i
and backwards from n down to i1. - Problem in most (tree) applications rank/select
updates are all over
13Lower Bound for Rank for Select
- Theorem (Golynski) Given a bit vector of length
n and an index (extra data) of size r bits, let
t be the number of bits probed to perform rank
(or select) then rO(n (lg t)/t). - Proof idea Argue to reconstructing the entire
string with too few rank queries (similarly for
select) - Corollary (Golynski) Under the lg n bit RAM
model, an index of size ?(n lglg n/ lg n) is
necessary and sufficient to perform the rank and
the select operations.
14More on Trees
- Updating trees simple mapping plus rank/select
does not work well - Other kinds of trees free trees (no root or
ordering on children), a simple mapping may not
exist - So break tree into little hunks (say (1-e) lg n
size), small enough to explicitly keep in a
table, with special constraints (e.g. few edges
going out of a hunk)
15More on Trees
- Keep most nodes in these little hunks (or a
couple of levels of hunk size classes), a limited
number can be in a core tree with real pointers
16Hunks Lead to
- Updates on binary trees (M., Raman Storm),
more general trees (Farzan M.) - Also representing
- special classes of trees
- optimally (Farzan M.)
- e.g. free trees 1.56..n bits,
- free binary trees 1.31..n bits
17Other Combinatorial Objects
- Planar Graphs (Lu et al, Barbay et al))
- Permutations n? n
- Or more generally
- Functions n ? n But what operations?
- Clearly p(i), but also p -1(i)
- And then p k(i) and p -k(i)
- Suffix Arrays (special permutations) in linear
space - Arbitrary Graphs (Farzan M.)
18Permutations Backpointer Notation
- Let P be a simple array giving p Pi pi
- Also have Bi be a pointer t positions back in
(the cycle of) the permutation - Bi p-ti .. But only define B for every tth
position in cycle. (t is a constant ignore cycle
length round-off) - So array representation
- P 8 4 12 5 13 x x 3 x 2 x 10 1
- 1 2 3 4 5
6 7 8 9 10 11 12 13
2
4
5
13
1
8
3
12
10
19Representing Shortcuts
- In a cycle there is a B every t positions
- But these positions can be in arbitrary order
- Which is have a B, and how do we store it?
- Keep a vector of all positions 0 no B 1 B
- Rank gives the position of Bi in B array
- So p(i) p -1(i) in O(1) time (1e)n lg n
bits - Theorem Under a pointer machine model with space
(1 e) n references, we need time 1/e to answer p
and p -1 queries i.e. this is as good as it gets
in the pointer model.
20Aside Extending to powers of p
- Consider the cycles of p
- ( 2 6 8)( 3 5 9 10)( 4 1 7)
- Bit vector indicates start of each cycle
- ( 2 6 8 3 5 9 10 4 1 7)
- Ignore parens, view as new permutation, ?.
- Note ?-1(i) is position containing i
- So we have ? and ?-1 as before
- Use ?-1(i) to find i, then bit vector (rank,
select) to find pk or p-k
21Aside Functions
- Consider an arbitrary function, fn?n
- Note f-1(i) is a set
- All tree edges lead to a cycle
- A function is just a hairy permutation
- Deal with level ancestors, result holds
22Back to p p-1 in Fewer Bits
- This is the best we can do for O(1) operations
- But using Benes networks
- 1-Benes network is a 2 input/2 output switch
- r1-Benes network join tops to tops
- bits(n)2bits(n/2)nn lg n-n1minO(n)
1 2 3 4 5 6 7 8
3 5 7 8 1 6 4 2
R-Benes Network
R-Benes Network
23A Benes Network
- Realizing the permutation (std p(i) notation)
- (3 5 7 8 1 6 4 2)
- Note O(n) bits more than necessary
1 2 3 4 5 6 7 8
3 5 7 8 1 6 4 2
24What can we do with it?
- Divide into blocks of lg lg n gates encode
their actions in a word. Taking advantage of
regularity of address mechanism - and also
- Modify approach to avoid power of 2 issue
- Can trace a path in time O(lg n/(lg lg n)
- Beats previous lower bound by using micro
pointers
25Backpointers Benes Both are Best
- Recall Benes method violates the pointer
machine lower bound by using micropointers. - Indeed With (a lot of) care, space required is
- lg(n!) O(n (lg lg n)2/lg n) bits
- But more general
- Lower Bound (Golynski) Both methods are optimal
for their respective extra space constraints
26Permutation Lower Bound
- Operations p(i), p-1(i) with times t and t
- Backpointers natural index
- Benes just a pile of bits, in lg n bit words
- General Model memory (lg(n!)r bits in words
- Lower bound r extra space O(lg n!/tt)
- It works out both Backpointers and Benes are
optimal
27Proof of Lower Bound Model
- Model Tree program
- Separate tree for each p(i) or p-1(i)
- Start at root, look at memory location (word)
based on value required - At depth d take appropriate
- branch based on which of n
- values is read
28Proof of Lower Bound Set up
- Fix the permutation (for now)
- Consider table of locations inspected at every
step for every query
location
r 4 6 9 3 6 8
m 5 3 4 9 2
o 7 5 9 1 8
p 9 8 3 2 3
q 8 1 3 4 7
s 3 7 3 8
t 3 7 1 2 4
p(1) p(2) p(3) p(4) . p-1(n)
query
29Proof of Lower Bound contd
- Take the least used cell (over all queries for
this permutation
location
r 4 6 9 3 6 8
m 5 3 4 9 2
o 7 5 9 1 8
p 9 8 3 2 3
q 8 1 3 4 7
s 3 7 3 8
t 3 7 1 2 4
p(1) p(2) p(3) p(4) . p-1(n)
query
30Proof of Lower Bound contd
- Take the least used cell (over all queries for
this permutation - And NUKE (eliminate) it
location
r 4 6 9 3 6 8
m 5 3 4 9 2
o 7 5 9 1 8
p 9 8 3 2 3
q 8 1 3 4 7
s 3 7 3 8
t 3 7 1 2 4
p(1) p(2) p(3) p(4) . p-1(n)
query
31Proof of Lower Bound contd
- And continuing removing cells for a while ..
- This means some queries may become unanswerable
(no matter how many probes made) but other are
still OK - e.g. removing a cell for p(6) (56) p-1 (56)
(6) makes these unanswerable, versus cell for
p(9) (52) but not p(52)-1 (9), - We do have to remember what we removed (though
not the order)
32Proof of Lower Bound Saving Space
- So we save the space for the values we no
longer need, but we do have to remember which
are destroyed - d locations destroyed, order doesnt matter
- d lg(n/d) bits used to say what is gone
- But
- d lg(n) bits saved
33Proof of Lower Bound Finishing
- Now some queries dont work
- p(is) s1,..c p-1(js) s1,..c
- We know is js but not their correspondence
- encode it
- After reduction we still need lg (n!) bits
(averaging over all permutations) - So reduce to that point .. Do arithmetic, bound
follows
34Text Search Lower Bound
- Key point reciprocal relation
- Text search operations
- F access substring length p starting in ip1,
i0,n/p - I search(X,j) jth (aligned) occurrence of X
- Theorem(Golynski) rtt O(np(lg s)2/?2)
- rextra space in words salphabet ?word size
- For lg n substring linear extra space needed
same as Demaine Lopez-Ortiz, but better model
35Conclusion
- Interesting, and useful, combinatorial objects
can be - Stored succinctly lower bound o()
- So that
- Natural queries are performed in O(1) time (or at
least very close) - Indeed our o() terms are often optimal
- But border on operations is subtle