Title: Succinct Data Structures
1Succinct Data Structures
- Ian Munro
- University of Waterloo
- Joint work with David Benoit, Andrej Brodnik, D,
Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
S. Srinivasa Rao, Rajeev Raman, Venkatesh Raman,
Adam Storm - How do we encode a large tree or other
combinatorial object of specialized information - even a static one
- in a small amount of space
- and still perform queries in constant time ???
2Example of a Succinct Data Structure The
(Static) Bounded Subset
- Given Universe of n elements 0,...n-1
- and m arbitrary elements from this universe
- Create a static structure to support search in
constant time (lg n bit word and usual
operations) - Using Essentially minimum possible bits ...
- Operation Member query in O(1) time
- (Brodnik M.)
3Focus on Trees
.. Because Computer Science is .. Arbophilic -
Directories (Unix, all the rest) - Search trees
(B-trees, binary search trees, digital trees or
tries) - Graph structures (we do a tree based
search) - Search indices for text (including DNA)
4A Big Patricia Trie / Suffix Trie
0
1
- Given a large text file treat it as bit vector
- Construct a trie with leaves pointing to unique
locations in text that match path in trie
(paths must start at character boundaries) - Skip the nodes where there is no branching ( n-1
internal nodes)
0
1
1 0 0 0 1 1
5Space for Trees
- Abstract data type binary tree
- Size n-1 internal nodes, n leaves
- Operations child, parent, subtree size, leaf
data - Motivation Obvious representation of an n node
tree takes about 6 n lg n words (up, left, right,
size, memory manager, leaf reference) - i.e. full suffix tree takes about 5 or 6 times
the space of suffix array (i.e. leaf references
only)
6Succinct Representations of Trees
- Start with Jacobson, then others
- There are about 4n/(pn)3/2 ordered rooted trees,
and same number of binary trees - Lower bound on specifying is about 2n bits
- What are the natural representations?
7Arbitrary Ordered Trees
- Use parenthesis notation
- Represent the tree
- As the binary string (((())())((())()()))
traverse tree as ( for node, then subtrees,
then ) - Each node takes 2 bits
8Heap-like Notation for a Binary Tree
Add external nodes Enumerate level by
level Store vector 11110111001000000
length2n1 (Here dont know size of subtrees can
be overcome. Could use isomorphism to flip
between notations)
1
1
1
1
0
1
1
1
0
0
0
0
1
0
0
0
0
9How do we Navigate?
- Jacobsons key suggestionOperations on a bit
vector - rank(x) 1s up to including x
- select(x) position of xth 1
- So in the binary tree
- leftchild(x) 2 rank(x)
- rightchild(x) 2 rank(x) 1
- parent(x) select(?x/2?)
10Rank Select
- Rank -Auxiliary storage 2nlglg n / lg n bits
- 1s up to each (lg n)2 rd bit
- 1s within these too each lg nth bit
- Table lookup after that
- Select -more complicated but similar notions
- Key issue Rank Select take O(1) time with lg n
bit word (M. et al) - Aside Interesting data type by itself
11Other Combinatorial Objects
- Planar Graphs (Lu et al)
- Permutations n? n
- Or more generally
- Functions n ? n
- But what are the operations?
- Clearly p(i), but also p -1(i)
- And then p k(i) and p -k(i)
- Suffix Arrays (special permutations) in linear
space
12Permutations a Shortcut Notation
- Let P be a simple array giving p Pi pi
- Also have Bi be a pointer t positions back in
(the cycle of) the permutation - Bi p-ti .. But only define B for every tth
position in cycle. (t is a constant ignore cycle
length round-off) - So array representation
- P 8 4 12 5 13 x x 3 x 2 x 10 1
- 1 2 3 4
5 6 7 8 9 10 11
12 13
2
4
5
13
1
8
3
12
10
13Representing Shortcuts
- In a cycle there is a B every t positions
- But these positions can be in arbitrary order
- Which is have a B, and how do we store it?
- Keep a vector of all positions
- 0 indicates no B 1 indicates a B
- Rank gives the position of Bi in B array
- So p(i) and p -1(i) in O(1) time (1e)n lg n
bits - Theorem Under a pointer machine model with space
(1 e) n references, we need time 1/e to answer p
and p -1 queries i.e. this is as good as it gets.
14Getting n lg n Bits an Aside
- This is the best we can do for O(1) operations
- But using Benes networks
- 1-Benes network is a 2 input/2 output switch
- r1-Benes network join tops to tops
1 2 3 4 5 6 7 8
3 5 7 8 1 6 4 2
R-Benes Network
R-Benes Network
15A Benes Network
- Realizing the permutation
- (3 5 7 8 1 6 4 2)
1 2 3 4 5 6 7 8
3 5 7 8 1 6 4 2
16What can we do with it?
- Divide into blocks of lg lg n gates encode
their actions in a word. Taking advantage of
regularity of address mechanism - and also
- Modify approach to avoid power of 2 issue
- Can trace a path in time O(lg n/(lg lg n)
- This is the best time we are able get for p and
p-1 in minimum space. - Observe This method violates the pointer
machine lower bound by using micropointers.
17Back to the main track Powers of p
- Consider the cycles of p
- ( 2 6 8)( 3 5 9 10)( 4 1 7)
- Keep a bit vector to indicate the start of each
cycle - ( 2 6 8 3 5 9 10 4 1 7)
- Ignoring parentheses, view as new permutation, ?.
- Note ?-1(i) is position containing i
- So we have ? and ?-1 as before
- Use ?-1(i) to find i, then bit vector (rank,
select) to find pk or p-k
18Functions
- Now consider arbitrary functions n?n
- A function is just a hairy permutation
- All tree edges lead to a cycle
19Challenges here
- Essentially write down the components in a
convenient order and use the n lg n bits to
describe the mapping (as per permutations) - To get fk(i)
- Find the level ancestor (k levels up) in a tree
- Or
- Go up to root and apply f the remaining number of
steps around a cycle
20Level Ancestors
- There are several level ancestor techniques using
- O(1) time and O(n) WORDS.
- Adapt Bender Farach-Colton to work in O(n) bits
- But going the other way
21f-k is a set
- Moving Down the tree requires care
- f-3( ) ( )
- The trick
- Report all nodes on a given level of a tree in
time proportional to the number of nodes, and - Dont waste time on trees with no answers
22Final Function Result
- Given an arbitrary function f n?n
- With an n lg n O(n) bit representation we can
compute fk(i) in O(1) time and f-k(i) in time O(1
size of answer).
23Back to Text And Suffix Arrays
- Text T1..n over (a,b) (altltb)
- There are 2n-1 such texts, which of the n! suffix
arrays are valid? - 1 2 3 4 5 6 7 8
- SA 4 7 5 1 8 3 6 2
- is
- a b b a a b a
- SA-1 4 8 6 1 3 7 2 5
- M 4 7 1 5 8 2 3 6 isnt ..why?
24Ascending to Max
- M is a permutation so M-1 is its inverse
- i.e. M-1i says where i is in M
- Ascending-to-Max ? 1 ? i ? n-2
- M-1i lt M-1n and M-1i1 lt M-1n ? M-1i
lt M-1i1 - M-1i gt M-1n and M-1i1 gt M-1n ? M-1i
gt M-1i1 - 4 7 5 1 8 3 6 2 OK
- 4 7 1 5 8 2 3 6 NO
25Non-Nesting
- Non-Nesting? 1 ? i,j ? n-1 and M-1iltM-1j
- M-1i lt M-1i1 and M-1j lt M-1j1 ?
M-1i1 lt M-1j1 - M-1i gt M-1i1 and M-1j gt M-1j1 ?
M-1i1 lt M-1j1 - 4 7 5 1 8 3 6 2 OK
- 4 7 1 5 8 2 3 6 NO
26Characterization Theorem for Suffix Arrays on
Binary Texts
- Theorem Ascending to Max Non-nesting ? Suffix
Array - Corollary Clean method of breaking SA into
segments - Corollary Linear time algorithm to check whether
SA is valid
27Cardinality Queries
- T a b a a a b b a a a b a a b b
- Remember lengths longest run of as and of bs
- SA (broken by runs, but not stored explicitly)
- 8 3 9 4 12 1 10 5 1316 7 2 11
156 14 - Ba, ? bit vector .. If SA-1i-1 in an a
section store 1 in Ba,SA-1i, else 0 - Ba 0 0 1 1 0 0 1 1 1 0 0
1 1 0 1 1 - Create rank structure on Ba, and similarly Bb,
(Note these are reversed except at ) - Algorithm Count(T,P)
- s ? 1 e ?n i ? m
- while igt0 and s?e do
- if Pia then
- s? rank1(Ba,s-1)1 e?rank1(Ba,e)
- else
- s? na 2 rank1(Bb,s-1) e?na 1
rank1(Bb,e) - i ? i-1
- Return max(e-s1,0)
- Time O(length of query)
28Listing Queries
- Complex methods
- Key idea for queries of length at least d, index
every dth position .. For T and forT(reversed) - So we have matches for Ti..n and T1,i-1
- View these as points in 2 space (Ferragina
Manzini and Grossi Vitter) - Do a range query (Alstrup et al)
- Variety of results follow
29General Conclusion
- Interesting, and useful, combinatorial objects
can be - Stored succinctly O(lower bound) o()
- So that
- Natural queries are performed in O(1) time (or at
least very close) - This can make the difference between using them
and not