Succinct Data Structures - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Succinct Data Structures

Description:

Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. L pez-Ortiz, S. Srinivasa Rao, Rajeev Raman ... – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 30
Provided by: IanM90
Category:

less

Transcript and Presenter's Notes

Title: Succinct Data Structures


1
Succinct Data Structures
  • Ian Munro
  • University of Waterloo
  • Joint work with David Benoit, Andrej Brodnik, D,
    Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
    S. Srinivasa Rao, Rajeev Raman, Venkatesh Raman,
    Adam Storm
  • How do we encode a large tree or other
    combinatorial object of specialized information
  • even a static one
  • in a small amount of space
  • and still perform queries in constant time ???

2
Example of a Succinct Data Structure The
(Static) Bounded Subset
  • Given Universe of n elements 0,...n-1
  • and m arbitrary elements from this universe
  • Create a static structure to support search in
    constant time (lg n bit word and usual
    operations)
  • Using Essentially minimum possible bits ...
  • Operation Member query in O(1) time
  • (Brodnik M.)

3
Focus on Trees
.. Because Computer Science is .. Arbophilic -
Directories (Unix, all the rest) - Search trees
(B-trees, binary search trees, digital trees or
tries) - Graph structures (we do a tree based
search) - Search indices for text (including DNA)
4
A Big Patricia Trie / Suffix Trie
0
1
  • Given a large text file treat it as bit vector
  • Construct a trie with leaves pointing to unique
    locations in text that match path in trie
    (paths must start at character boundaries)
  • Skip the nodes where there is no branching ( n-1
    internal nodes)

0
1
1 0 0 0 1 1
5
Space for Trees
  • Abstract data type binary tree
  • Size n-1 internal nodes, n leaves
  • Operations child, parent, subtree size, leaf
    data
  • Motivation Obvious representation of an n node
    tree takes about 6 n lg n words (up, left, right,
    size, memory manager, leaf reference)
  • i.e. full suffix tree takes about 5 or 6 times
    the space of suffix array (i.e. leaf references
    only)

6
Succinct Representations of Trees
  • Start with Jacobson, then others
  • There are about 4n/(pn)3/2 ordered rooted trees,
    and same number of binary trees
  • Lower bound on specifying is about 2n bits
  • What are the natural representations?

7
Arbitrary Ordered Trees
  • Use parenthesis notation
  • Represent the tree
  • As the binary string (((())())((())()()))
    traverse tree as ( for node, then subtrees,
    then )
  • Each node takes 2 bits

8
Heap-like Notation for a Binary Tree
Add external nodes Enumerate level by
level Store vector 11110111001000000
length2n1 (Here dont know size of subtrees can
be overcome. Could use isomorphism to flip
between notations)
1
1
1
1
0
1
1
1
0
0
0
0
1
0
0
0
0
9
How do we Navigate?
  • Jacobsons key suggestionOperations on a bit
    vector
  • rank(x) 1s up to including x
  • select(x) position of xth 1
  • So in the binary tree
  • leftchild(x) 2 rank(x)
  • rightchild(x) 2 rank(x) 1
  • parent(x) select(?x/2?)

10
Rank Select
  • Rank -Auxiliary storage 2nlglg n / lg n bits
  • 1s up to each (lg n)2 rd bit
  • 1s within these too each lg nth bit
  • Table lookup after that
  • Select -more complicated but similar notions
  • Key issue Rank Select take O(1) time with lg n
    bit word (M. et al)
  • Aside Interesting data type by itself

11
Other Combinatorial Objects
  • Planar Graphs (Lu et al)
  • Permutations n? n
  • Or more generally
  • Functions n ? n
  • But what are the operations?
  • Clearly p(i), but also p -1(i)
  • And then p k(i) and p -k(i)
  • Suffix Arrays (special permutations) in linear
    space

12
Permutations a Shortcut Notation
  • Let P be a simple array giving p Pi pi
  • Also have Bi be a pointer t positions back in
    (the cycle of) the permutation
  • Bi p-ti .. But only define B for every tth
    position in cycle. (t is a constant ignore cycle
    length round-off)
  • So array representation
  • P 8 4 12 5 13 x x 3 x 2 x 10 1
  • 1 2 3 4
    5 6 7 8 9 10 11
    12 13

2
4
5
13
1
8
3
12
10
13
Representing Shortcuts
  • In a cycle there is a B every t positions
  • But these positions can be in arbitrary order
  • Which is have a B, and how do we store it?
  • Keep a vector of all positions
  • 0 indicates no B 1 indicates a B
  • Rank gives the position of Bi in B array
  • So p(i) and p -1(i) in O(1) time (1e)n lg n
    bits
  • Theorem Under a pointer machine model with space
    (1 e) n references, we need time 1/e to answer p
    and p -1 queries i.e. this is as good as it gets.

14
Getting n lg n Bits an Aside
  • This is the best we can do for O(1) operations
  • But using Benes networks
  • 1-Benes network is a 2 input/2 output switch
  • r1-Benes network join tops to tops

1 2 3 4 5 6 7 8
3 5 7 8 1 6 4 2
R-Benes Network
R-Benes Network
15
A Benes Network
  • Realizing the permutation
  • (3 5 7 8 1 6 4 2)

1 2 3 4 5 6 7 8
3 5 7 8 1 6 4 2
16
What can we do with it?
  • Divide into blocks of lg lg n gates encode
    their actions in a word. Taking advantage of
    regularity of address mechanism
  • and also
  • Modify approach to avoid power of 2 issue
  • Can trace a path in time O(lg n/(lg lg n)
  • This is the best time we are able get for p and
    p-1 in minimum space.
  • Observe This method violates the pointer
    machine lower bound by using micropointers.

17
Back to the main track Powers of p
  • Consider the cycles of p
  • ( 2 6 8)( 3 5 9 10)( 4 1 7)
  • Keep a bit vector to indicate the start of each
    cycle
  • ( 2 6 8 3 5 9 10 4 1 7)
  • Ignoring parentheses, view as new permutation, ?.
  • Note ?-1(i) is position containing i
  • So we have ? and ?-1 as before
  • Use ?-1(i) to find i, then bit vector (rank,
    select) to find pk or p-k

18
Functions
  • Now consider arbitrary functions n?n
  • A function is just a hairy permutation
  • All tree edges lead to a cycle

19
Challenges here
  • Essentially write down the components in a
    convenient order and use the n lg n bits to
    describe the mapping (as per permutations)
  • To get fk(i)
  • Find the level ancestor (k levels up) in a tree
  • Or
  • Go up to root and apply f the remaining number of
    steps around a cycle

20
Level Ancestors
  • There are several level ancestor techniques using
  • O(1) time and O(n) WORDS.
  • Adapt Bender Farach-Colton to work in O(n) bits
  • But going the other way

21
f-k is a set
  • Moving Down the tree requires care
  • f-3( ) ( )
  • The trick
  • Report all nodes on a given level of a tree in
    time proportional to the number of nodes, and
  • Dont waste time on trees with no answers

22
Final Function Result
  • Given an arbitrary function f n?n
  • With an n lg n O(n) bit representation we can
    compute fk(i) in O(1) time and f-k(i) in time O(1
    size of answer).

23
Back to Text And Suffix Arrays
  • Text T1..n over (a,b) (altltb)
  • There are 2n-1 such texts, which of the n! suffix
    arrays are valid?
  • 1 2 3 4 5 6 7 8
  • SA 4 7 5 1 8 3 6 2
  • is
  • a b b a a b a
  • SA-1 4 8 6 1 3 7 2 5
  • M 4 7 1 5 8 2 3 6 isnt ..why?

24
Ascending to Max
  • M is a permutation so M-1 is its inverse
  • i.e. M-1i says where i is in M
  • Ascending-to-Max ? 1 ? i ? n-2
  • M-1i lt M-1n and M-1i1 lt M-1n ? M-1i
    lt M-1i1
  • M-1i gt M-1n and M-1i1 gt M-1n ? M-1i
    gt M-1i1
  • 4 7 5 1 8 3 6 2 OK
  • 4 7 1 5 8 2 3 6 NO

25
Non-Nesting
  • Non-Nesting? 1 ? i,j ? n-1 and M-1iltM-1j
  • M-1i lt M-1i1 and M-1j lt M-1j1 ?
    M-1i1 lt M-1j1
  • M-1i gt M-1i1 and M-1j gt M-1j1 ?
    M-1i1 lt M-1j1
  • 4 7 5 1 8 3 6 2 OK
  • 4 7 1 5 8 2 3 6 NO

26
Characterization Theorem for Suffix Arrays on
Binary Texts
  • Theorem Ascending to Max Non-nesting ? Suffix
    Array
  • Corollary Clean method of breaking SA into
    segments
  • Corollary Linear time algorithm to check whether
    SA is valid

27
Cardinality Queries
  • T a b a a a b b a a a b a a b b
  • Remember lengths longest run of as and of bs
  • SA (broken by runs, but not stored explicitly)
  • 8 3 9 4 12 1 10 5 1316 7 2 11
    156 14
  • Ba, ? bit vector .. If SA-1i-1 in an a
    section store 1 in Ba,SA-1i, else 0
  • Ba 0 0 1 1 0 0 1 1 1 0 0
    1 1 0 1 1
  • Create rank structure on Ba, and similarly Bb,
    (Note these are reversed except at )
  • Algorithm Count(T,P)
  • s ? 1 e ?n i ? m
  • while igt0 and s?e do
  • if Pia then
  • s? rank1(Ba,s-1)1 e?rank1(Ba,e)
  • else
  • s? na 2 rank1(Bb,s-1) e?na 1
    rank1(Bb,e)
  • i ? i-1
  • Return max(e-s1,0)
  • Time O(length of query)

28
Listing Queries
  • Complex methods
  • Key idea for queries of length at least d, index
    every dth position .. For T and forT(reversed)
  • So we have matches for Ti..n and T1,i-1
  • View these as points in 2 space (Ferragina
    Manzini and Grossi Vitter)
  • Do a range query (Alstrup et al)
  • Variety of results follow

29
General Conclusion
  • Interesting, and useful, combinatorial objects
    can be
  • Stored succinctly O(lower bound) o()
  • So that
  • Natural queries are performed in O(1) time (or at
    least very close)
  • This can make the difference between using them
    and not
Write a Comment
User Comments (0)
About PowerShow.com