Title: XML Compression and Indexing
1XML Compression and Indexing
The Future of Web Search Barcelona, May 2006
- Paolo Ferragina
- Dipartimento di Informatica, Università di Pisa
- Joint with F. Luccio, G. Manzini, S.
Muthukrishnan
Under patenting by Pisa-Rutgers Univ.
2Compressed Permuterm Index
- Paolo Ferragina, Rossano Venturini
- Dipartimento di Informatica, Università di Pisa
Under Y!-patenting
3A basic problem
- Given a dictionary D of strings, having variable
length, design a compressed data structure that
supports - string ? id
- Prefix(a) find all strings in D that are
prefixed by a - Suffix(b) find all strings in D that are
suffixed by b - Substring(g) find all strings in D that contain
g - PrefixSuffix(a,b) Prefix(a) ? Suffix(b)
IR book of Manning-Raghavan-Schutze ?
Tolerant Retrieval Problem (wildcards)
Prefix(a) a Suffix(b) b Substring(g)
g PrefixSuffix(a,b) ab
4A basic problem
- Given a dictionary D of strings, having variable
length, design a compressed data structure that
supports - string ? id
- Prefix(a) find all s in D that are prefixed by a
- Suffix(b) find all s in D that are suffixed by b
- Substring(g) find all s in D that contain g
- PrefixSuffix(a,b) Prefix(a) ? Suffix(b)
Hashing ? Not exact searches
5A basic problem
- Given a dictionary D of strings, having variable
length, design a compressed data structure that
supports - string ? id
- Prefix(a) find all s in D that are prefixed by a
- Suffix(b) find all s in D that are suffixed by b
- Substring(g) find all s in D that contain g
- PrefixSuffix(a,b) Prefix(a) ? Suffix(b)
(Compacted) Trie ? Two versions for D and for
DR Intersect answers ? No substring search
(unless using Suffix Trie) ? Need to store D for
resolving edge-labels
6A basic problem
- Given a dictionary D of strings, having variable
length, design a compressed data structure that
supports - string ? id
- Prefix(a) find all s in D that are prefixed by a
- Suffix(b) find all s in D that are suffixed by b
- Substring(g) find all s in D that contain g
- PrefixSuffix(a,b) Prefix(a) ? Suffix(b)
Front coding...
7Front-coding
uk-2002 crawl 250Mb
bzip 10 Be back on this, later on!
- ? Two versions for D and for DR Intersect
answers - Need some extra data structures for bucket
identification - No substring search
8A basic problem
- Given a dictionary D of strings, having variable
length, compress them in a way that we can
efficiently support - string ? id
- Prefix(a) find all s in D that are prefixed by a
- Suffix(b) find all s in D that are suffixed by b
- Substring(g) find all s in D that contain by g
- PrefixSuffix(a,b) Prefix(a) ? Suffix(b)
Permuterm Index (Garfield, 76) ? Reduce
any query to a prefix query over a larger
dictionary
9Premuterm Index Garfield, 1976
- Take a dictionary Dyahoo,google
- Append a special char to the end of each
string - Generate all rotations of these strings
- yahoo
- ahooy
- hooya
- ooyah
- oyaho
- yahoo
- google
- oogleg
- oglego
- glegoo
- legoog
- egoogl
- google
Prefix(ya) Prefix(ya) Suffix(oo)
Prefix(oo) Substring(oo) Prefix(oo) PrefixSuffi
x(y,o) Prefix(oy)
Permuterm Dictionary
Space problems
Any query on D reduces to a prefix-query on PD
10Compressed Permuterm Index
SIGIR 07
- It deploys two ingredients
- Permuterm index
- Compressed full-text index
- Theoretically
- Query ops take optimal time proportional to
pattern length - Space occupancy is D Hk(D) o(D log S)
bits
- Technically
- A simple reduction step Permuterm ? Compressed
index - Re-use known machinery on compressed indexes
- Achieve bzip-compression at Front-coding speed
11The Burrows-Wheeler Transform (1994)
Take the text T mississippi
L
F
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
12Compressing L is effective
- Key observation
- L is locally homogeneous
- Bzip vs. Gzip 20 vs. 33, but it is slower in
(de)compression !
13The FM-index
Ferragina-Manzini, JACM 05
Survey of Navarro-Makinen contains many other
indexes
- The result
- Count(P) O(p) time
- Locate(P) O(occ polylog(T)) time
- Display( Ti,iL ) O( L polylog(T) ) time
- Space occupancy T Hk(T) o(T log S) bits
?
New concept The FM-index is an opportunistic
data structure
?
Compressed Permuterm index builds upon the best
two features of the FM-index
14First ingredient L ? F mapping
F
L
unknown
mississipp i
i mississip p
i ppimissis s
15First ingredient L ? F mapping
F
L
unknown
mississipp i
i mississip p
i ppimissis s
FM-index is actually Rank ds over BWT
O(1) time and Hk-space
16Second ingredient Backward step
F
L
unknown
mississipp i
i mississip p
i ppimissis s
T scanned backward by using LF-mapping
LF
...s
s
i...
LF
17Third ingredient substring search
L
unknown
mississipp imississip ippimissis issippimis is
sissippi mississippi pimississi ppimississ sipp
imissi sissippimi ssippimiss ssissippim
i p s s m p i s s i i
18The Comprressed Permuterm
Z hathiphophot
Some queries are trivial... ? Prefix(a)
Substring search(a) within Z ? Suffix(b)
Substring search(b) within Z ? Substr(g)
Substring search(g) within Z
19PrefixSuffix search
unknown
20PrefixSuffix(ho,p)
unknown
ho
LF
CLF
No change in time/space bounds of compressed
indexes
21Rank and Select of strings
unknown
Z hathiphophot
Other queries... ? Rank(s) row of s ?
Select(i) backw from Li1
22Experiments
- Three dictionaries
- Term dictionary Trec WT10G
- Host dictionary (reversed) UK-2005
- Url dictionary (host reversed) first 190Mb of
UK-2005
Term Host Url
size 118 Mb 34 Mb 190 Mb
strings 10 Mil 2 Mil 3 Mil
FC 40 45 30
bzip 33 25 10
PrefixSuffix search needs 2
23(No Transcript)
24A test on URLs
Choose your trade-off
MRS book says one disadvantage of the PI is
that its dictionary becomes quite large,
including as it does all rotations of each term.
dict-size
Now, they mention CPI ?
Trade-off
- Time of 20?60 msec/char, and space close to bzip
- Time close to Front-Coding (4 msec/char), but
lt50 of its space
25We proposed an approach for dictionary storage
Theory optimal time and entropy-bounds for space
Practice trades time vs space, thus fitting
user needs