Title: Advanced Seminar in Data Structures
1Advanced Seminar in Data Structures
- 28/12/2004
- An Analysis of the Burrows-Wheeler Transform
(Giovanni Manzini)
Presented by Assaf Oren
2Topics
- Introduction
- Burrows-Wheeler Transform
- MovetoFront
- Empirical Entropy
- Order0 coder
- Analysis of the BW0 algorithm
- Run-Length encoding
- Analysis of BW0RL algorithm
3Introduction
- BWT-based algorithm
- Takes the input string s
- Transforms it to bwt(s)
- bwt(s) s
- Compress bwt(s) with compressor A
- The compressed string will be A(bwt(s))
4Introduction (cont)
- Notations
- Recording scheme tra()
- A transformation with no compression
- Coding scheme A()
- An algorithm which designed to reduce the size of
the input
5BWT-based Alg Properties
- Even when using a simple compression alg,
A(bwt(s)) will have a good compression ratio - The very simple and clean alg from Nelson1996,
outperforms the PkZip package. - Other, more advanced BWT compressors are Bzip
Seward 1997 and Szip Schindler 1997. - BWT-based compressors achieve a very good
compression ratio using relatively small
resources - Arnold and Bell 2000, Fenwick 1996a
6nova 25 man bzip2 bzip2(1)
bzip2(1) NAME
bzip2, bunzip2 - a block-sorting file compressor,
v1.0.2 bzcat - decompresses files to
stdout bzip2recover - recovers data from
damaged bzip2 files SYNOPSIS bzip2
-cdfkqstvzVL123456789 filenames ...
bunzip2 -fkvsVL filenames ...
bzcat -s filenames ...
bzip2recover filename DESCRIPTION bzip2
compresses files using the Burrows-Wheeler
block sorting text compression algorithm,
and Huffman coding. Compression is
generally considerably better than that
achieved by more conventional LZ77/LZ78-based
compressors, and approaches the
performance of the PPM family of sta
tistical compressors.
7BWT-based Alg Properties (cont)
- Works very well in practice, but no satisfactory
proof has been given for their compression
ratio. - Previous proofs were done
- Assuming the input string is a finite-order
Markov source - Sadakane 19971998
- To get bounds on the speed at which the average
compression ratio approaches the entropy. - Effros 1999
8The Burrows-Wheeler Transform
- Background
- Part of a research for DIGITAL released at 1994
- Based on a previously unpublished transformation
discovered by Wheeler in 1983 - Technical
- The resulting output block contains exactly the
same data elements that it started with - Performed on an entire block of data at once
- Reversible
9The Burrows-Wheeler Transform (cont)
- Append to the end of s
- is unique and smaller then anyother character
- Form a Matrix M whoserows are the cyclic shifts
of s - Sort the rows right toleft
10The Burrows-Wheeler Transform (cont)
- The output of BWT is the column F msspipissii
and the number 3 (the position of )
11The Burrows-Wheeler Transform (cont)
- Observations
- Every column of M is a permutation of s.
- Each character in L is followed in s by the
corresponding character in F. - For any character c, the ith occurrence of c in F
corresponds the the ith occurrence of c in L.
- How to reconstruct s
- Sort bwt(s) to get column L. (column F is bwt(s))
- F1 is the first character of s.
- By applying observation3 we get that m (is the
same m from L6), and obsetvation2 will tell us
that F6 is the next character of s.
12The Burrows-Wheeler Transform (cont)
13The Burrows-Wheeler Transform (cont)
- Why this transform is so helpful ?
- BWT collects together the symbols following a
given context. - Formally
- For each substring w of s, the characters
following w in s are grouped together inside
bwt(s) - More formally!!!
14MovetoFront (mtf )
- Another recording scheme
- Suggested be BW to be used after applying BWT on
string s - s mtf(bwt(s))
- mtf(bwt(s)) bwt(s) s
- If s is over a1, a2, , ah then s is over 0,
1, , h-1
15MovetoFront (cont)
- For each letter (left-to-right)
- Write the number of other letters since the last
time the current letter appeared. - Example
a a b a c a a c c b a
0
0
1
1
2
1
0
1
0
2
2
a
a
b
a
c
a
a
c
c
b
a
16MovetoFront (cont)
- Why this transform is helpful ?
- Transforms the local homogeneous of bwt(s) to
global homogeneous - Formally if we had After mtf both strings will
probably have the same small numbers.
17Huffman coding
- Sets binary values to letters according to their
frequency - For example
- A a, b, c
- In our string the frequency is
- The coding will be
a 300
b 150
c 150
a 0
b 10
c 11
18Arithmetic coding
19The Empirical Entropy of a string
- s our string
- n s
- A our Alphabet
- h A
- ni number of occurrences of the symbol ai
inside s - H0(s) the zeroth order empirical entropy of s
20Intuition for the Empirical Entropy
For each symbol
For each appearance of this symbol in the text
The number of bits that will be needed
torepresent it with an ultimate uniquely
decodable code
21The kth order Empirical Entropy
- We can achieve a greater compression if the
codeword depends on the k symbols that precedes
the coded symbol - For example s abcabcabd the codeword for
ab could be abs ccd - And formally we can define
22Examples of Hk(s)
- Example 1
- K1, s mississippi
- ms i, is ssp, ss sisi, ps pi
- Example 2
- K1, s cc(ab)n
- as bn, bs an-1, cs ca
0
0
1
23The modified Empirical Entropy
- Modified in order to avoid cases of
24Empirical Entropy and BWT
- We saw that
- We know that
- If we had an Ideal algorithm A
- ?We get
- ?We reduced the problem of compressing up to kth
order entropy to the problem of compressing
distinct portions of the input string up to their
zeroth order entropy.
25An Order0 coder
- A coder with a compression ratio that is close to
the zeroth order empirical entropy. - Formally
- For static Huffman coding, ? 1
- For a simple arithmetic coder, ? 10-2
- Howard and Vitter 1992a
26Analysis of the BW0 algorithm
- BW0(s) Order0(mtf(bwt(s)))
- We would like to achieve
- For now lets assume Theorem 4.1 on mtf(s)
27Proof of BW0
- We saw that if then for t hk
- For combined with theorem 4.1
- With our knowledge on Order0
- ? Get get
28Proof of Theorem 4.1
29Proof of Theorem 4.1 (cont)
- Lemma 4.5
- Lemma 4.6
- Lemma 4.7
30Proof of Theorem 4.1 (cont)
- Lemma 4.8
- It is sufficient to prove that
31Proof of Theorem 4.1 (cont)
- By applying Lemma 4.3 and 4.5 we get
- And
- And
32Analysis of BW0RL algorithm
- BW0RL(s) Order0(RLE(mtf(bwt(s))))
- RLE(s)
- Let 0 and 1 be two symbols that are not belong to
the alphabet - For m 1, B(m) m1 written in binary with 0
and 1, discarding the MSB - B(1) 0, B(2) 1, B(3) 00. B(4) 01,
B(5) 10 - RLE(s) will replace 0m zeros in s with B(m)
- Given s 110022013000, RLE(s) 1112201300
- RLE(s) s, since ?log(m1)? m
33Analysis of BW0RL (cont)
34Analysis of BW0RL (cont)
- Locally ?-Optimal Algorithm
- For all t gt 0, there exists a constant ct, that
for any partition s1, s2, , st of the string s
we have - A locally ?-Optimal Algorithm combined with BWT
is bounded by
35A bit of practicality
- A nice article by Mark Nelson
- http//www.dogma.net/markn/articles/bwt/bwt.htm
- Includes source code measurements
- Usage
- RLE input-file BWT MTF RLE ARI gt
output-file - UNARI input-file UNRLE UNMTF UNBWT
UNRLE gt output-file
36A bit of practicality (cont)
BTW Bits/Byte BTW Size PKZIP Bits/Byte PKZIP Size Raw Size File Name
2.13 29,567 2.58 35,821 111,261 bib
2.87 275,831 3.29 315,999 768,771 book1
2.44 186,592 2.74 209,061 610,856 book2
4.85 62,120 5.38 68,917 102,400 geo
2.85 134,174 3.10 146,010 377,109 news
4.04 10,857 3.84 10,311 21,504 obj1
2.66 81,948 2.65 81,846 246,814 obj2
2.67 17,724 2.80 18,624 53,161 paper1
2.62 26,956 2.90 29,795 82,199 paper2
2.92 16,995 3.11 18,106 46,526 paper3
3.33 5,529 3.32 5,509 13,286 paper4
3.44 5,136 3.32 4,962 11,954 paper5
2.76 13,159 2.80 13,331 38,105 paper6
0.79 50,829 0.84 54,188 513,216 pic
2.69 13,312 2.69 13,340 39,611 progc
1.86 16,688 1.81 16,227 71,646 progl
1.85 11,404 1.82 11,248 49,379 progp
1.65 19,301 1.68 19,691 93,695 trans
2.41 978,122 2.64 1,072,986 3,251,493 total
37A bit of practicality (cont)
The End