Compressing and Indexing Strings and labeled Trees

About This Presentation

Title:

Compressing and Indexing Strings and labeled Trees

Description:

Query complexity: O(p occ loge N) time. Space occupancy: O( N Hk(T)) o(N) bits ... Search: O(p log N occ loge N ) High-order entropy CSA (GV and Gupta, Soda 03) ... – PowerPoint PPT presentation

Number of Views:198

Avg rating:3.0/5.0

Slides: 62

Provided by: paol94

Category:

more less

Transcript and Presenter's Notes

Title: Compressing and Indexing Strings and labeled Trees

1
Compressing and Indexing Strings and (labeled)
Trees

Paolo Ferragina
Dipartimento di Informatica, Università di Pisa

2
Two types of data

String raw sequence of symbols from an
alphabet ?
Texts
DNA sequences
Executables
Audio files
...

Labeled tree tree of arbitrary shape and depth
whose nodes are labeled with strings drawn from
an alphabet ?
XML files
Parse trees
Tries and Suffix Trees
Compiler intermediate representations
Execution traces
...

3
What do we mean by Indexing ?

Word-based indexes, here a notion of word must
be devised !
Inverted files, Signature files, Bitmaps.

Full-text indexes, no constraint on text and
queries !
Suffix Array, Suffix tree, ...

Path indexes that also support navigational
operations !
see next...

Subset of XPath W3C
4
What do we mean by Compression ?

Data compression has two positive effects
Space saving (or, enlarge memory at the same
cost)
Performance improvement
Better use of memory levels closer to CPU
Increased network, disk and memory bandwidth
Reduced (mechanical) seek time

5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
Study the interplay of Compression and Indexing

Do we witness a paradoxical situation ?
An index injects redundant data, in order to
speed up the pattern searches
Compression removes redundancy, in order to
squeeze the space occupancy

NO, new results proved a mutual reinforcement
behaviour !
Better indexes can be designed by exploiting
compression techniques
Better compressors can be designed by exploiting
indexing techniques

More surprisingly, strings and labeled trees are
closer than expected !
Labeled-tree compression can be reduced to string
compression
Labeled-tree indexing can be reduced to special
string indexing problems

9
Our journey over string data
Index design (Weiner 73)
Compressor design (Shannon 48)
Burrows-Wheeler Transform (1994)
Suffix Array 87 and 90
Wavelet Tree Grossi-Gupta-Vitter, Soda 03
Improved indexes and compressors for
strings Ferragina-Manzini-Makinen-Navarro,
04 And many other papers of many other
authors...
10
The Suffix Array BaezaYates-Gonnet, 87 and
Manber-Myers, 90
T mississippi
Psi

Suffix permutation cannot be any of 1,...,N
binary texts 2N N! permutations on
1, 2, ..., N
?(N) bits is the worst-case lower bound ?
?(N H(T)) bits for compressible texts ?

Several papers on characterizing the SAs
permutation Duval et al, 02 Bannai et al, 03
Munro et al, 05 Stoye et al, 05
11
Can we compress the Suffix Array ?
Ferragina-Manzini, Focs 00
Ferragina-Manzini, JACM 05

The FM-index is a data structure that mixes the
best of
Suffix array data structure
Burrows-Wheeler Transform

The theoretical result
Query complexity O(p occ loge N) time
Space occupancy O( N Hk(T)) o(N) bits

? o(N) if T compressible

The corollary is that
The Suffix Array is compressible
It is a self-index

Index does not depend on k Bound holds for all
k, simultaneously
New concept The FM-index is an opportunistic
data structure that takes advantage of
repetitiveness in the input data to achieve
compressed space occupancy, and still efficient
query performance.
12
The Burrows-Wheeler Transform (1994)
Let us given a text T mississippi
F
L
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
13
Why L is so interesting for compression ?
F
L
unknown
mississipp i

A key observation
L is locally homogeneous

i mississip p
i ppimissis s

Bzip vs. Gzip 20 vs. 33 compression ratio !
Some theory behind Manzini, JACM 01

Building the BWT ? SA construction Inverting the
BWT ? array visit ...overall ?(N) time, but
slower than gzip...
14
L is helpful for full-text searching ?
mississipp imississip ippimissis issippimis is
sissippi mississippi pimississi ppimississ sipp
imissi sissippimi ssippimiss ssissippim
mississippi
15
A useful tool L ? F mapping
F
L
unknown
mississipp i
i mississip p
i ppimissis s
To implement the LF-mapping we need an
oracle occ( c , j ) Rank of char c in L1,j
16
Substring search in T (Count the pattern
occurrences)
unknown
s
s

Find the first c in Lfr, lr

Find the last c in Lfr, lr

L-to-F mapping of these chars

Occ() oracle is enough (ie. Rank/Select
primitives over L)
17
Many details are missing...

What about a large ?
Wavelet Tree and variations Grossi et al, Soda
03 F.M.-Makinen-Navarro, Spire 04
New approaches to Rank/Select primitives Munro
et al. Soda 06

Efficient and succinct index construction Hon
et al., Focs 03
In practice, Lightweight Algorithms (5?)N bytes
of space
see Manzini-Ferragina, Algorithmica 04

18
Five years of history...
FM-index (Ferragina-Manzini, Focs 00)
Compact Suffix Array (Grossi-Vitter, Stoc 00)
Space 5 N Hk(T) o(N) bits, for any k Search
O( p occ loge N )
Space ?(N) bits text Search O(p
polylog(N) occ loge N ) o(p) time with
Patricia Tree, O(occ) for short P
Look at the survey by Gonzalo Navarro and Veli
Makinen
Wavelet Tree
WT variant
q-gram index Kärkkäinen-Ukkonen,
96 Succinct Suffix Tree N log N ?(N) bits
Munro et al., 97ss LZ-index ?(N) bits and fast
occ retrieval Navarro, 03 Variations
over CSA and FM-index Navarro, Makinen
19
Whats next ?
20
What about their practicality ?
December 2003
January 2005
21
(No Transcript)
22
Is this a technological breakthrough ?
23
(No Transcript)
24
Where we are...
Labeled Trees ?
Data type
Indexing
Compressed Indexing
25
Why we care about labeled trees ?
26
An XML excerpt
ltdblpgt ltbookgt ltauthorgt Donald E. Knuth
lt/authorgt lttitlegt The TeXbook lt/titlegt ltpublishe
rgt Addison-Wesley lt/publishergt ltyeargt 1986
lt/yeargt lt/bookgt ltarticlegt ltauthorgt
Donald E. Knuth lt/authorgt ltauthorgt Ronald W.
Moore lt/authorgt lttitlegt An Analysis of
Alpha-Beta Pruning lt/titlegt ltpagesgt 293-326
lt/pagesgt ltyeargt 1975 lt/yeargt ltvolumegt 6
lt/volumegt ltjournalgt Artificial Intelligence
lt/journalgt lt/articlegt ... lt/dblpgt
27
A tree interpretation...

XML document exploration ? Tree navigation
XML document search ? Labeled subpath
searches

Subset of XPath W3C
28
Our problem

Consider a rooted, ordered, static tree T of
arbitrary shape, whose t nodes are labeled with
symbols from an alphabet S.
We wish to devise a succinct representation for T
that efficiently
supports some operations over Ts structure
Navigational operations parent(u), child(u, i),
child(u, i, c)
Subpath searches over a sequence of k labels

Seminal work by Jacobson Focs 90 dealt with
binary unlabeled trees, achieving O(1) time per
navigational operation and 2t o(t) bits.

Munro-Raman Focs 97, then many others,
extended to unlabeled trees of arbitrary degree
and a richer set of navigational ops subtree
size, ancestor,...

Geary et al Soda 04 were the first to deal
with labeled trees and navigational operations,
but the space is Q(t S) bits.

Yet, subpath searches are unexplored
29
Our journey over labeled trees Ferragina et
al, Focs 05

We propose the XBW-transform that mimics on trees
the nice structural properties of the BW-trasform
on strings.

The XBW-transform linearizes the tree T in such a
way that
the indexing of T reduces to implement simple
rank/select operations over a string of symbols
from S.
the compression of T reduces to use any k-th
order entropy compressor (gzip, bzip,...) over a
string of symbols from S.

30
The XBW-Transform
Sa
Sp
C B D c a c A b a D c B D b a
e C B C D B C D B C B C C A C A C A C D A C C B
C D B C B C
Step 1. Visit the tree in pre-order. For each
node, write down its label and the labels on its
upward path
31
The XBW-Transform
Sa
Sp
C b a D D c D a B A B c c a b
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
Step 2. Stably sort according to Sp
32
The XBW-Transform
Sp
Slast
Sa
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
XBW can be built and inverted in optimal O(t)
time
Key facts Nodes correspond to items in
ltSlast,Sagt Node numbering has useful properties
for compression and indexing
Step 3. Add a binary array Slast marking the rows
corresponding to last children
XBW takes optimal t log S 2t bits
33
The XBW-Transform is highly compressible
Sp
Slast
Sa
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C

XBW is highly compressible
Sa is locally homogeneous (like BWT for strings)
Slast has some structure (because of Ts
structure)

34
XML Compression XBW PPMdi !
String compressors are not so bad !?!
35
Structural properties of XBW
Sp
Slast
Sa
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
C b a D D c D a B A B c c a b

Properties
Relative order among nodes having same leading
path reflects the pre-order visit of T
Children are contiguous in XBW (delimited by 1s)
Children reflect the order of their parents

36
The XBW is searchable
Sp
Slast
Sa
SS
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
C b a D D c D a B A B c c a b
0 1 0 0 1 0 0 0 1 0 0 1 0 0 0
A
B
C
D

XBW indexing reduction to string indexing
Store succinct and efficient Rank and Select
data structures over these three arrays

37
Subpath search in XBW
Sp
Slast
Sa
SS
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
0 1 0 0 1 0 0 0 1 0 0 1 0 0 0
P B D
Their children have upward path D B

Inductive step
Pick the next char in Pi1, i.e. D
Search for the first and last D in Safr,lr
? Jump to their children

38
Subpath search in XBW
Sp
Slast
Sa
SS
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
0 1 0 0 1 0 0 0 1 0 0 1 0 0 0
P B D
Look at Slast to find the 2 and 3 group of
children
Their children have upward path D B

Inductive step
Pick the next char in Pi1, i.e. D
Search for the first and last D in Safr,lr
? Jump to their children

Two occurrences because of two 1s
39
XML Compressed Indexing
What about XPress and XGrind ? XPress ? 30 (dblp
50), XGrind ? 50 ? no software running
40
In summary Ferragina et al, Focs 05

The XBW-transform takes optimal space 2t t log
S, and can be computed in optimal linear time.

We can compress and index the XBW-transform so
that
its space occupancy is the optimal t H0(T) 2t
o(t) bits
navigational operations take O(log S) time
subpath searches take O(p log S) time

If Spolylog(t), no logS-factor (loglog S
for general S Munro et al, Soda 06)
New bread for Rank/Select people !!

It is possible to extend these ideas to other
XPath queries, like
//pathtext()substring
//path1//path2
...

41
The overall picture on Compressed Indexing...
Data type
Indexing
Kosaraju, Focs 89
Strong connection
Compressed Indexing
42
Mutual reinforcement relationship...
We investigated the reinforcement
relation Compression ideas ? Index
design Lets now turn to the other
direction Indexing ideas ? Compressor design
Booster
43
Compression Boosting for strings Ferragina et
al., J.ACM 2005

Qualitatively, the booster offers various
properties
The more compressible is s, the shorter is c
wrt c
It deploys compressor A as a black-box, hence no
change to As structure is needed
No loss in time efficiency, actually it is
optimal
Its performance holds for any string s, it
results better than Gzip and Bzip
It is fully combinatorial, hence it does not
require any parameter estimations

44
An interesting compression paradigm
PPC paradigm (Permutation, Partition, Compression)

Problem 1. Fix a permutation P. Find a
partitioning strategy and a
compressor that minimize the number of compressed
bits.
If PId, this is classic data compression !

Problem 2. Fix a compressor C. Find a permutation
P and partitioning strategy that minimize the
number of compressed bits.
Taking PId, PPC cannot be worse than compressor
C alone.
Our booster showed that a good P can make PPC
far better.
Other contexts Tables ATT people, Graphs
Bondi-Vigna, WWW 04

Theory is missing, here!
45
Compression of labeled trees Ferragina et al.,
Focs 05
Extend the definition of Hk to labeled trees by
taking as k-context of a node its leading path of
k-length (related to Markov random fields
over trees)
A new paradigm for compressing the tree T
XBW(T)
46
Thanks !!
47
Where we are ...
We investigated the reinforcement
relation Compression ideas ? Index
design Lets now turn to the other
direction Indexing ideas ? Compressor design
Booster
48
What do we mean by boosting ?
A memoryless compressor is poor in that it
assigns codewords to symbols according only to
their frequencies (e.g. Huffman) It incurs in
some obvious limitations T anbn (highly
compressible) T random string of n as and n
bs (uncompressible)
49
The empirical entropy Hk
(1/T) ?wk Tw H0(Tw)
Hk(T)

Tw string of symbols that precede w in T

Example Given T mississippi, we have

Problems with this approach
How to go from all Tw back to the string T ?
How do we choose efficiently the best k ?

50
Use BWT to approximate Hk
Bwt(T)
unknown
? compress pieces of bwt(T) up to H0
Remember that...
51
Finding the best pieces to compress...
Leaf cover ?
unknown
12 11 9 5 2 1 10 9 7 4 6 3
L1
L2
H1(T)
H2(T)
Goal find the best BWT-partition induced by a
Leaf Cover !!
Some leaf covers are related to Hk !!!
52
A compression booster Ferragina et al.,
JACM 05

Let Compr be the compressor we wish to boost
Let LC1, , LCr be the partition of BWT(T)
induced by a leaf cover LC, and let us define
cost of LC as cost(LC, Compr)?j Compr(LCj)
Goal Find the leaf cover LC of minimum cost
It suffices a post-order visit of the suffix
tree (suffix array), optimal time
We have Cost(LC, Compr) Cost(Hk, Compr) ?
Hk(T), ?k

?k
0
k
This is purely combinatorial. We do not need any
knowledge of the statistical properties of the
source, no parameter estimation, no training,...
53
(No Transcript)
54
2001
55
Locate the pattern occurrences in T
T mississippi
4
From ss position we get 4 3 7, ok !!
56
What about their practicality ?

We have a library that currently offers
The FM-index build, search, display,...
The Suffix Array construction in space (5?) n
bytes
The LCP Array construction in space (6?) n
bytes

57
What about word-based searches ?
Pbzip
T bzipbzip2unbzip2unbzip
...the post-processing phase can be time
consuming !

The FM-index can be adapted to support word-based
searches
Preprocess T and transform it into a digested
text DT

Word-search in T ? Substring-search in DT

Use the FM-index over the digested DT

58
The WFM-index

Digested-T derived from a Huffman variant
Moura et al, 98
Symbols of the huffman tree are the words of T
The Huffman tree has fan-out 128
Codewords are byte-aligned and tagged

Any word
P bzip
1. Dictionary of words
3. FM-index built on DT
59
A historical perspective

Shannon showed a narrower result for a
stationary ergodic S
Idea Compress groups of k chars in the string T
Result Compress ratio ? the entropy of S, for k
? ?
Various limitations
It works for a source S
It must modify As structure, because of the
alphabet change
For a given string T, the best k is found by
trying k0,1,,T
W(T2) time slowdown
k is eventually fixed and this is not an optimal
choice !

Any string s
Black-box
O(s) time
Variable length contexts
Two Key Components Burrows-Wheeler Transform and
Suffix Tree
60
How do we find the best partition (i.e. k)