Title: Compressing and Indexing Strings and labeled Trees
1Compressing and Indexing Strings and (labeled)
Trees
- Paolo Ferragina
- Dipartimento di Informatica, Università di Pisa
2Two types of data
- String raw sequence of symbols from an
alphabet ? - Texts
- DNA sequences
- Executables
- Audio files
- ...
- Labeled tree tree of arbitrary shape and depth
whose nodes are labeled with strings drawn from
an alphabet ? - XML files
- Parse trees
- Tries and Suffix Trees
- Compiler intermediate representations
- Execution traces
- ...
3What do we mean by Indexing ?
- Word-based indexes, here a notion of word must
be devised ! - Inverted files, Signature files, Bitmaps.
- Full-text indexes, no constraint on text and
queries ! - Suffix Array, Suffix tree, ...
- Path indexes that also support navigational
operations ! - see next...
Subset of XPath W3C
4What do we mean by Compression ?
- Data compression has two positive effects
- Space saving (or, enlarge memory at the same
cost) - Performance improvement
- Better use of memory levels closer to CPU
- Increased network, disk and memory bandwidth
- Reduced (mechanical) seek time
5(No Transcript)
6(No Transcript)
7(No Transcript)
8Study the interplay of Compression and Indexing
- Do we witness a paradoxical situation ?
- An index injects redundant data, in order to
speed up the pattern searches - Compression removes redundancy, in order to
squeeze the space occupancy
- NO, new results proved a mutual reinforcement
behaviour ! - Better indexes can be designed by exploiting
compression techniques - Better compressors can be designed by exploiting
indexing techniques
- More surprisingly, strings and labeled trees are
closer than expected ! - Labeled-tree compression can be reduced to string
compression - Labeled-tree indexing can be reduced to special
string indexing problems
9Our journey over string data
Index design (Weiner 73)
Compressor design (Shannon 48)
Burrows-Wheeler Transform (1994)
Suffix Array 87 and 90
Wavelet Tree Grossi-Gupta-Vitter, Soda 03
Improved indexes and compressors for
strings Ferragina-Manzini-Makinen-Navarro,
04 And many other papers of many other
authors...
10The Suffix Array BaezaYates-Gonnet, 87 and
Manber-Myers, 90
T mississippi
Psi
- Suffix permutation cannot be any of 1,...,N
- binary texts 2N N! permutations on
1, 2, ..., N - ?(N) bits is the worst-case lower bound ?
- ?(N H(T)) bits for compressible texts ?
Several papers on characterizing the SAs
permutation Duval et al, 02 Bannai et al, 03
Munro et al, 05 Stoye et al, 05
11Can we compress the Suffix Array ?
Ferragina-Manzini, Focs 00
Ferragina-Manzini, JACM 05
- The FM-index is a data structure that mixes the
best of - Suffix array data structure
- Burrows-Wheeler Transform
- The theoretical result
- Query complexity O(p occ loge N) time
- Space occupancy O( N Hk(T)) o(N) bits
? o(N) if T compressible
- The corollary is that
- The Suffix Array is compressible
- It is a self-index
Index does not depend on k Bound holds for all
k, simultaneously
New concept The FM-index is an opportunistic
data structure that takes advantage of
repetitiveness in the input data to achieve
compressed space occupancy, and still efficient
query performance.
12The Burrows-Wheeler Transform (1994)
Let us given a text T mississippi
F
L
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
13Why L is so interesting for compression ?
F
L
unknown
mississipp i
- A key observation
- L is locally homogeneous
i mississip p
i ppimissis s
- Bzip vs. Gzip 20 vs. 33 compression ratio !
Some theory behind Manzini, JACM 01
Building the BWT ? SA construction Inverting the
BWT ? array visit ...overall ?(N) time, but
slower than gzip...
14L is helpful for full-text searching ?
mississipp imississip ippimissis issippimis is
sissippi mississippi pimississi ppimississ sipp
imissi sissippimi ssippimiss ssissippim
mississippi
15A useful tool L ? F mapping
F
L
unknown
mississipp i
i mississip p
i ppimissis s
To implement the LF-mapping we need an
oracle occ( c , j ) Rank of char c in L1,j
16Substring search in T (Count the pattern
occurrences)
unknown
s
s
- Find the first c in Lfr, lr
- Find the last c in Lfr, lr
- L-to-F mapping of these chars
Occ() oracle is enough (ie. Rank/Select
primitives over L)
17Many details are missing...
- What about a large ?
- Wavelet Tree and variations Grossi et al, Soda
03 F.M.-Makinen-Navarro, Spire 04 - New approaches to Rank/Select primitives Munro
et al. Soda 06
- Efficient and succinct index construction Hon
et al., Focs 03 - In practice, Lightweight Algorithms (5?)N bytes
of space - see Manzini-Ferragina, Algorithmica 04
18Five years of history...
FM-index (Ferragina-Manzini, Focs 00)
Compact Suffix Array (Grossi-Vitter, Stoc 00)
Space 5 N Hk(T) o(N) bits, for any k Search
O( p occ loge N )
Space ?(N) bits text Search O(p
polylog(N) occ loge N ) o(p) time with
Patricia Tree, O(occ) for short P
Look at the survey by Gonzalo Navarro and Veli
Makinen
Wavelet Tree
WT variant
q-gram index Kärkkäinen-Ukkonen,
96 Succinct Suffix Tree N log N ?(N) bits
Munro et al., 97ss LZ-index ?(N) bits and fast
occ retrieval Navarro, 03 Variations
over CSA and FM-index Navarro, Makinen
19Whats next ?
20What about their practicality ?
December 2003
January 2005
21(No Transcript)
22Is this a technological breakthrough ?
23(No Transcript)
24Where we are...
Labeled Trees ?
Data type
Indexing
Compressed Indexing
25Why we care about labeled trees ?
26An XML excerpt
ltdblpgt ltbookgt ltauthorgt Donald E. Knuth
lt/authorgt lttitlegt The TeXbook lt/titlegt ltpublishe
rgt Addison-Wesley lt/publishergt ltyeargt 1986
lt/yeargt lt/bookgt ltarticlegt ltauthorgt
Donald E. Knuth lt/authorgt ltauthorgt Ronald W.
Moore lt/authorgt lttitlegt An Analysis of
Alpha-Beta Pruning lt/titlegt ltpagesgt 293-326
lt/pagesgt ltyeargt 1975 lt/yeargt ltvolumegt 6
lt/volumegt ltjournalgt Artificial Intelligence
lt/journalgt lt/articlegt ... lt/dblpgt
27A tree interpretation...
- XML document exploration ? Tree navigation
- XML document search ? Labeled subpath
searches
Subset of XPath W3C
28Our problem
- Consider a rooted, ordered, static tree T of
arbitrary shape, whose t nodes are labeled with
symbols from an alphabet S. - We wish to devise a succinct representation for T
that efficiently - supports some operations over Ts structure
- Navigational operations parent(u), child(u, i),
child(u, i, c) - Subpath searches over a sequence of k labels
- Seminal work by Jacobson Focs 90 dealt with
binary unlabeled trees, achieving O(1) time per
navigational operation and 2t o(t) bits.
- Munro-Raman Focs 97, then many others,
extended to unlabeled trees of arbitrary degree
and a richer set of navigational ops subtree
size, ancestor,...
- Geary et al Soda 04 were the first to deal
with labeled trees and navigational operations,
but the space is Q(t S) bits.
Yet, subpath searches are unexplored
29Our journey over labeled trees Ferragina et
al, Focs 05
- We propose the XBW-transform that mimics on trees
the nice structural properties of the BW-trasform
on strings.
- The XBW-transform linearizes the tree T in such a
way that - the indexing of T reduces to implement simple
rank/select operations over a string of symbols
from S. - the compression of T reduces to use any k-th
order entropy compressor (gzip, bzip,...) over a
string of symbols from S.
30The XBW-Transform
Sa
Sp
C B D c a c A b a D c B D b a
e C B C D B C D B C B C C A C A C A C D A C C B
C D B C B C
Step 1. Visit the tree in pre-order. For each
node, write down its label and the labels on its
upward path
31The XBW-Transform
Sa
Sp
C b a D D c D a B A B c c a b
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
Step 2. Stably sort according to Sp
32The XBW-Transform
Sp
Slast
Sa
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
XBW can be built and inverted in optimal O(t)
time
Key facts Nodes correspond to items in
ltSlast,Sagt Node numbering has useful properties
for compression and indexing
Step 3. Add a binary array Slast marking the rows
corresponding to last children
XBW takes optimal t log S 2t bits
33The XBW-Transform is highly compressible
Sp
Slast
Sa
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
- XBW is highly compressible
- Sa is locally homogeneous (like BWT for strings)
- Slast has some structure (because of Ts
structure)
34XML Compression XBW PPMdi !
String compressors are not so bad !?!
35Structural properties of XBW
Sp
Slast
Sa
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
C b a D D c D a B A B c c a b
- Properties
- Relative order among nodes having same leading
- path reflects the pre-order visit of T
- Children are contiguous in XBW (delimited by 1s)
- Children reflect the order of their parents
36The XBW is searchable
Sp
Slast
Sa
SS
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
C b a D D c D a B A B c c a b
0 1 0 0 1 0 0 0 1 0 0 1 0 0 0
A
B
C
D
- XBW indexing reduction to string indexing
- Store succinct and efficient Rank and Select
- data structures over these three arrays
37Subpath search in XBW
Sp
Slast
Sa
SS
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
0 1 0 0 1 0 0 0 1 0 0 1 0 0 0
P B D
Their children have upward path D B
- Inductive step
- Pick the next char in Pi1, i.e. D
- Search for the first and last D in Safr,lr
- ? Jump to their children
38Subpath search in XBW
Sp
Slast
Sa
SS
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
0 1 0 0 1 0 0 0 1 0 0 1 0 0 0
P B D
Look at Slast to find the 2 and 3 group of
children
Their children have upward path D B
- Inductive step
- Pick the next char in Pi1, i.e. D
- Search for the first and last D in Safr,lr
- ? Jump to their children
Two occurrences because of two 1s
39XML Compressed Indexing
What about XPress and XGrind ? XPress ? 30 (dblp
50), XGrind ? 50 ? no software running
40In summary Ferragina et al, Focs 05
- The XBW-transform takes optimal space 2t t log
S, and can be computed in optimal linear time.
- We can compress and index the XBW-transform so
that - its space occupancy is the optimal t H0(T) 2t
o(t) bits - navigational operations take O(log S) time
- subpath searches take O(p log S) time
If Spolylog(t), no logS-factor (loglog S
for general S Munro et al, Soda 06)
New bread for Rank/Select people !!
- It is possible to extend these ideas to other
XPath queries, like - //pathtext()substring
- //path1//path2
- ...
41The overall picture on Compressed Indexing...
Data type
Indexing
Kosaraju, Focs 89
Strong connection
Compressed Indexing
42Mutual reinforcement relationship...
We investigated the reinforcement
relation Compression ideas ? Index
design Lets now turn to the other
direction Indexing ideas ? Compressor design
Booster
43Compression Boosting for strings Ferragina et
al., J.ACM 2005
- Qualitatively, the booster offers various
properties - The more compressible is s, the shorter is c
wrt c - It deploys compressor A as a black-box, hence no
change to As structure is needed - No loss in time efficiency, actually it is
optimal - Its performance holds for any string s, it
results better than Gzip and Bzip - It is fully combinatorial, hence it does not
require any parameter estimations
44An interesting compression paradigm
PPC paradigm (Permutation, Partition, Compression)
- Problem 1. Fix a permutation P. Find a
partitioning strategy and a - compressor that minimize the number of compressed
bits. - If PId, this is classic data compression !
- Problem 2. Fix a compressor C. Find a permutation
P and partitioning strategy that minimize the
number of compressed bits. - Taking PId, PPC cannot be worse than compressor
C alone. - Our booster showed that a good P can make PPC
far better. - Other contexts Tables ATT people, Graphs
Bondi-Vigna, WWW 04
Theory is missing, here!
45Compression of labeled trees Ferragina et al.,
Focs 05
Extend the definition of Hk to labeled trees by
taking as k-context of a node its leading path of
k-length (related to Markov random fields
over trees)
A new paradigm for compressing the tree T
XBW(T)
46Thanks !!
47Where we are ...
We investigated the reinforcement
relation Compression ideas ? Index
design Lets now turn to the other
direction Indexing ideas ? Compressor design
Booster
48What do we mean by boosting ?
A memoryless compressor is poor in that it
assigns codewords to symbols according only to
their frequencies (e.g. Huffman) It incurs in
some obvious limitations T anbn (highly
compressible) T random string of n as and n
bs (uncompressible)
49The empirical entropy Hk
(1/T) ?wk Tw H0(Tw)
Hk(T)
- Tw string of symbols that precede w in T
Example Given T mississippi, we have
- Problems with this approach
- How to go from all Tw back to the string T ?
- How do we choose efficiently the best k ?
50Use BWT to approximate Hk
Bwt(T)
unknown
? compress pieces of bwt(T) up to H0
Remember that...
51Finding the best pieces to compress...
Leaf cover ?
unknown
12 11 9 5 2 1 10 9 7 4 6 3
L1
L2
H1(T)
H2(T)
Goal find the best BWT-partition induced by a
Leaf Cover !!
Some leaf covers are related to Hk !!!
52A compression booster Ferragina et al.,
JACM 05
- Let Compr be the compressor we wish to boost
- Let LC1, , LCr be the partition of BWT(T)
induced by a leaf cover LC, and let us define
cost of LC as cost(LC, Compr)?j Compr(LCj) - Goal Find the leaf cover LC of minimum cost
- It suffices a post-order visit of the suffix
tree (suffix array), optimal time - We have Cost(LC, Compr) Cost(Hk, Compr) ?
Hk(T), ?k
?k
0
k
This is purely combinatorial. We do not need any
knowledge of the statistical properties of the
source, no parameter estimation, no training,...
53(No Transcript)
542001
55Locate the pattern occurrences in T
T mississippi
4
From ss position we get 4 3 7, ok !!
56What about their practicality ?
- We have a library that currently offers
- The FM-index build, search, display,...
- The Suffix Array construction in space (5?) n
bytes - The LCP Array construction in space (6?) n
bytes
57What about word-based searches ?
Pbzip
T bzipbzip2unbzip2unbzip
...the post-processing phase can be time
consuming !
- The FM-index can be adapted to support word-based
searches - Preprocess T and transform it into a digested
text DT
Word-search in T ? Substring-search in DT
- Use the FM-index over the digested DT
58The WFM-index
- Digested-T derived from a Huffman variant
Moura et al, 98 - Symbols of the huffman tree are the words of T
- The Huffman tree has fan-out 128
- Codewords are byte-aligned and tagged
Any word
P bzip
1. Dictionary of words
3. FM-index built on DT
59A historical perspective
- Shannon showed a narrower result for a
stationary ergodic S - Idea Compress groups of k chars in the string T
- Result Compress ratio ? the entropy of S, for k
? ? - Various limitations
- It works for a source S
- It must modify As structure, because of the
alphabet change - For a given string T, the best k is found by
trying k0,1,,T - W(T2) time slowdown
- k is eventually fixed and this is not an optimal
choice !
Any string s
Black-box
O(s) time
Variable length contexts
Two Key Components Burrows-Wheeler Transform and
Suffix Tree
60How do we find the best partition (i.e. k)
- Approximate via MTF Burrows-Wheeler,
94 - MTF is efficient in practice bzip2
- Theory and practice showed that we can aim for
more ! - Use Dynamic Programming Giancarlo-Sciortino
, CPM 03 - It finds the optimal partition
- Very slow, the time complexity is cubic in T
Surprisingly, full-text indexes help in
finding the optimal partition in optimal linear
time !!
61Example not one k
xs ynzn gt yxs yn-1 , zxs zn-1