Title: Combinatorial aspects of the Burrows-Wheeler transform
1Combinatorial aspects of the Burrows-Wheeler
transform
Sabrina Mantaci Antonio Restivo Marinella
Sciortino
University of Palermo
2Burrows-Wheeler Transform
- In 1994 M. Burrows and D. Wheeler introduced a
new data compression method based on a
preprocessing on the input string. Such a
preprocessing, called after them the
Burrows-Wheeler Transform (BWT), produces a
permutation of the letters in the input string
such that - the transformed string is easier to compress
than the original one. - the original string can be recovered
- The use of this preprocessing allowed to define a
class of lossless data compression algorithms
that - achieve speed comparable to the algorithms based
on the techniques by Lempel and Ziv - obtains a compression ratio close to the best
statistical modelling techniques.
3How does BWT work ?
- Lexicographically sort the cyclic rotations of w
- The following properties hold
- the character Li is followed in w by Fi
- for each character ch, the i-th occurrence of ch
in F corresponds to the i-th occurrence of ch in
L.
4Reversibility
The Burrows-Wheeler transform is reversible, in
the sense that given BWT(w) and an index I, it is
possible to recover w.
- Given LBWT(w)caraab and I1
- Construct F by alphabetically sorting the
letters in L
- Define a permutation ? on 0,1,,n-1,
establishing a correspondence between the
positions of the same letters in F and in L
- Starting from position I, we can recover ww0
wn as follows - wi F?i(I), where ?0(x)x, ?i1(x) ?(?i(x))
5We can deduce that
Therefore we can study combinatorial properties
of the BWT by studying the conjugacy classes of
primitive words.
6Standard Words
d1, d2,,dn, a sequence of natural numbers d1?0,
gt0 i 2,,n Consider the sequence snn ?0
defined as
- s is a characteristic
Sturmian word - sn ?0 is called approximating sequence of s
- (d1, d2,,dn, ) is the directive sequence of s
- Each finite word sn is a standard word
7Characterization of standard words
- A word w is standard if and only if it is a
letter or wvab (or equivalently wvba) and v has
periods p,q such that gcd(p,q)1 and
vpq-2.(extremal case of Fine and Wilf
theorem) - A word w is standard if and only if it is a
letter or there exist palindrome words P,Q,R,
such that w QR Pxy where x,ya,b. - Standard words correspond to an extremal case of
Knuth-Morris-Pratt algorithm.
8Rotations
Standard words can also be generated by
rotations. Let p,q?2 such that gcd(p,q)1 and
npq. ?p0,1,,n-1?0,1,,n-1 defined as
?p(z)zp (mod n)
If n8, p3, q5, wabaababa
9A new characterization of standard words
10Idea of the proof
The permutation ? giving the correspondence
between the positions of characters in F and L is
?(z)zp(mod n). Starting, for example, from the
position Ip we can recover the word u,
uiF(?i(p)).
11Further Research
Further Research
- Study extremal case of the BWT for k-letters
alphabets with kgt2. - For instance for k3, characterize the words w
such that BWT(w) belongs to cab or bca. - This property does work neither with 3-Standard
words nor with balanced words.
- Does a relation between the complexity function
of a word w and the structure of BWT(w) exist?
- Given a language L, one can define
BWT(L)BWT(w) w in L. One can ask whether BWT
preserves some properties of a language L, such
as belonging to a certain family of languages in
the Chomsky Hierarchy. - We found negative results
L1(ab), BWT(L1)bnan n0 a context free
language
L2(abc), BWT(L2)cnanbn n0 a context
sensitive language
12Further Research
- Is it possible to characterize interesting
families of words in terms of their BWT?