XML Compression and Indexing - PowerPoint PPT Presentation

About This Presentation

Title:

XML Compression and Indexing

Description:

The Future of Web Search Barcelona, May 2006 XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Universit di Pisa [Joint with F. Luccio, G ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 26

Provided by: PaoloFe1

Category:

more less

Transcript and Presenter's Notes

Title: XML Compression and Indexing

1
XML Compression and Indexing
The Future of Web Search Barcelona, May 2006

Paolo Ferragina
Dipartimento di Informatica, Università di Pisa
Joint with F. Luccio, G. Manzini, S.
Muthukrishnan

Under patenting by Pisa-Rutgers Univ.
2
Compressed Permuterm Index

Paolo Ferragina, Rossano Venturini
Dipartimento di Informatica, Università di Pisa

Under Y!-patenting
3
A basic problem

Given a dictionary D of strings, having variable
length, design a compressed data structure that
supports
string ? id
Prefix(a) find all strings in D that are
prefixed by a
Suffix(b) find all strings in D that are
suffixed by b
Substring(g) find all strings in D that contain
g
PrefixSuffix(a,b) Prefix(a) ? Suffix(b)

IR book of Manning-Raghavan-Schutze ?
Tolerant Retrieval Problem (wildcards)
Prefix(a) a Suffix(b) b Substring(g)
g PrefixSuffix(a,b) ab
4
A basic problem

Given a dictionary D of strings, having variable
length, design a compressed data structure that
supports
string ? id
Prefix(a) find all s in D that are prefixed by a
Suffix(b) find all s in D that are suffixed by b
Substring(g) find all s in D that contain g
PrefixSuffix(a,b) Prefix(a) ? Suffix(b)

Hashing ? Not exact searches
5
A basic problem

Given a dictionary D of strings, having variable
length, design a compressed data structure that
supports
string ? id
Prefix(a) find all s in D that are prefixed by a
Suffix(b) find all s in D that are suffixed by b
Substring(g) find all s in D that contain g
PrefixSuffix(a,b) Prefix(a) ? Suffix(b)

(Compacted) Trie ? Two versions for D and for
DR Intersect answers ? No substring search
(unless using Suffix Trie) ? Need to store D for
resolving edge-labels
6
A basic problem

Given a dictionary D of strings, having variable
length, design a compressed data structure that
supports
string ? id
Prefix(a) find all s in D that are prefixed by a
Suffix(b) find all s in D that are suffixed by b
Substring(g) find all s in D that contain g
PrefixSuffix(a,b) Prefix(a) ? Suffix(b)

Front coding...
7
Front-coding
uk-2002 crawl 250Mb
bzip 10 Be back on this, later on!

? Two versions for D and for DR Intersect
answers
Need some extra data structures for bucket
identification
No substring search

8
A basic problem

Given a dictionary D of strings, having variable
length, compress them in a way that we can
efficiently support
string ? id
Prefix(a) find all s in D that are prefixed by a
Suffix(b) find all s in D that are suffixed by b
Substring(g) find all s in D that contain by g
PrefixSuffix(a,b) Prefix(a) ? Suffix(b)

Permuterm Index (Garfield, 76) ? Reduce
any query to a prefix query over a larger
dictionary
9
Premuterm Index Garfield, 1976

Take a dictionary Dyahoo,google
Append a special char to the end of each
string
Generate all rotations of these strings
yahoo
ahooy
hooya
ooyah
oyaho
yahoo
google
oogleg
oglego
glegoo
legoog
egoogl
google

Prefix(ya) Prefix(ya) Suffix(oo)
Prefix(oo) Substring(oo) Prefix(oo) PrefixSuffi
x(y,o) Prefix(oy)
Permuterm Dictionary
Space problems
Any query on D reduces to a prefix-query on PD
10
Compressed Permuterm Index
SIGIR 07

It deploys two ingredients
Permuterm index
Compressed full-text index

Theoretically
Query ops take optimal time proportional to
pattern length
Space occupancy is D Hk(D) o(D log S)
bits

Technically
A simple reduction step Permuterm ? Compressed
index
Re-use known machinery on compressed indexes
Achieve bzip-compression at Front-coding speed

11
The Burrows-Wheeler Transform (1994)
Take the text T mississippi
L
F
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
12
Compressing L is effective

Key observation
L is locally homogeneous

Bzip vs. Gzip 20 vs. 33, but it is slower in
(de)compression !

13
The FM-index
Ferragina-Manzini, JACM 05
Survey of Navarro-Makinen contains many other
indexes

The result
Count(P) O(p) time
Locate(P) O(occ polylog(T)) time
Display( Ti,iL ) O( L polylog(T) ) time
Space occupancy T Hk(T) o(T log S) bits

?
New concept The FM-index is an opportunistic
data structure
?
Compressed Permuterm index builds upon the best
two features of the FM-index
14
First ingredient L ? F mapping
F
L
unknown
mississipp i
i mississip p
i ppimissis s
15
First ingredient L ? F mapping
F
L
unknown
mississipp i
i mississip p
i ppimissis s
FM-index is actually Rank ds over BWT
O(1) time and Hk-space
16
Second ingredient Backward step
F
L
unknown
mississipp i
i mississip p
i ppimissis s
T scanned backward by using LF-mapping
LF
...s
s
i...
LF
17
Third ingredient substring search
L
unknown
mississipp imississip ippimissis issippimis is
sissippi mississippi pimississi ppimississ sipp
imissi sissippimi ssippimiss ssissippim
i p s s m p i s s i i
18
The Comprressed Permuterm
Z hathiphophot
Some queries are trivial... ? Prefix(a)
Substring search(a) within Z ? Suffix(b)
Substring search(b) within Z ? Substr(g)
Substring search(g) within Z
19
PrefixSuffix search
unknown
20
PrefixSuffix(ho,p)
unknown
ho
LF
CLF
No change in time/space bounds of compressed
indexes
21
Rank and Select of strings
unknown
Z hathiphophot
Other queries... ? Rank(s) row of s ?
Select(i) backw from Li1
22
Experiments

Three dictionaries
Term dictionary Trec WT10G
Host dictionary (reversed) UK-2005
Url dictionary (host reversed) first 190Mb of
UK-2005

Term Host Url
size 118 Mb 34 Mb 190 Mb
strings 10 Mil 2 Mil 3 Mil
FC 40 45 30
bzip 33 25 10
PrefixSuffix search needs 2
23
(No Transcript)
24
A test on URLs
Choose your trade-off
MRS book says one disadvantage of the PI is
that its dictionary becomes quite large,
including as it does all rotations of each term.
dict-size
Now, they mention CPI ?
Trade-off

Time of 20?60 msec/char, and space close to bzip
Time close to Front-Coding (4 msec/char), but
lt50 of its space

25
We proposed an approach for dictionary storage
Theory optimal time and entropy-bounds for space
Practice trades time vs space, thus fitting
user needs

Write a Comment

User Comments (0)