The Burrows-Wheeler Transform: Theory and Practice - PowerPoint PPT Presentation

About This Presentation
Title:

The Burrows-Wheeler Transform: Theory and Practice

Description:

Title: PowerPoint Presentation Author: eran Last modified by: eran Created Date: 11/14/2002 5:43:11 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 36
Provided by: eran
Category:

less

Transcript and Presenter's Notes

Title: The Burrows-Wheeler Transform: Theory and Practice


1
The Burrows-Wheeler TransformTheory and Practice
  • Article by Giovanni Manzini
  • Original Algorithm by
  • M. Burrows and D. J. Wheeler
  • Lecturer Eran Vered

2
Overview
  • The Burrows-Wheeler transform (bwt).
  • Statistical compression overview
  • Compressing using bwt
  • Analysis of the results of the compression.

3
General
  • bwt Transforms the order of the symbols of a
    text.
  • The bwt output can be very easily compressed.
  • Used by the compressor bzip2.

4
Calculating bw(s)
  • Add an end-of-string symbol () to s
  • Generate a matrix of all the cyclic shifts of s
  • Sort the matrix rows, in right to left
    lexicographic order
  • bw(s) is the first column of the matrix
  • sign is dropped. Its location saved

5
BWT Example
mississippiississippimssissippimisissippimi
sissippimissssippimissisippimissisippimiss
iss ppimississi pimississip imississipp
mississippi
mississippississippimimississippissippimiss
ippimississi ississippimpimississipimissis
sipp sissippimissippimissisissippimissippi
mississ
  • s mississipi

Sorting the rows of the matrix is equivalent to
sorting the suffixes of sr (ippississim)
bw(s) (msspipissii, 3)
6
BWT Matrix Properties
F
L
  • Sorting F gives L
  • s1F1
  • Fi follows Li in s
  • Equal symbols in L are ordered the same as in F

m ississippi s sissippim i mississipp is
sippimiss ip pimississ i i ssissippi mp
imississi pi mississip p s issippimi ss
ippimissi si ssippimis si ppimissis s
7
Reconstructing s
  • Add to get F
  • Sort F to get L
  • s1F1
  • Fi follows Li in s
  • Equal order of appearance

s m
i
s
s
i
8
Reconstructing s
  • Lsort(F)
  • s F1
  • j1
  • for i2 to n
  • a of appearances of Fj in F1 , F2 , Fj
  • j index of the ath appearance of Fj in L
  • s s Fj

9
Whats good about bwt?
  • bwt(s) is locally homogenous
  • For every substring w of s, all the symbols
    following w in s are grouped together.

mississippississippimimississippissippimiss
ippimississi ississippimpimississipimissis
sipp sissippimissippimissisissippimissippi
mississ
  • These symbols will usually be homogenous.

10
Whats good about bwt?
  • miss_mississippi_misses_miss_missouri

mmmmmssssss_spiiiiiupii_ssssss_e_ioir
follow mi
follow _
follow mis
follow m
11
Statistical Compression
  • We will discuss lossless statistical compression
    with the following notations
  • s input string over the alphabet
  • S a1 , a2 , a3 , , ah
  • h S
  • n s
  • ni number of appearances of ai in s.
  • log x log2x

12
Zeroth Order Encoding
e ? 0 a ? 10 c ? 111
  • Every input symbol is replaced by the same
    codeword for all its appearances
  • ai ? ci

Krafts Inequality
Output size
Minimum achieved for
13
Zeroth Order Encoding
  • Output size is bounded by sH0(s), where

is the Empirical Entropy (zeroth order) of s.
  • Compressing a string using Huffman Coding or
    Arithmetic Coding produces an output which size
    is close to sH0(s) bits.
  • Specifically

14
Zeroth order Entropy Example
  • n1 n2 nh
  • n1 gtgt n2, n3 , nh
  • s mississippi

15
k-th Order Encoding
  • The codeword that encodes an input symbol is
    determined according to that symbol, and its k
    preceding symbols.
  • Output size is bounded by sHk(s) bits

k-th Order Empirical Entropy of s
ws A string containing all the symbols
following w in s.
16
k-th order Entropy Example
  • s mississippi (k1)
  • msi ? H0(i)0
  • isssp ? H0(ssp)0.92
  • sssisi ? H0(sisi)1
  • pspi ? H0(pi)1

17
k-th Order Encoding and bwt
  • After applying bwt, for every substring w of s,
    all the symbols following w in s are grouped
    together
  • Did we get an optimal k-th order compressor?
  • Not yet
  • Local homogeneity instead of global homogeneity.

18
k-th Order Encoding and bwt
  • For example
  • sababababababab.
  • bwt(s) abbbbbbbbbbaaaaaaaaa

w2 (a)
w1 ()
w3 (b)
H1(s)0 (wabbb , wbaaa) H0(wi)0 H0(w1
w2 w3 )H0(s)1
19
Compressing bwt
  • bwt
  • Arithmetic coding

20
MoveToFront Compression
  • Every input symbol is encoded by the number of
    distinct symbols that occurred since the last
    appearance of this symbol.
  • Implemented using a list of symbols sorted by
    recency of usage.
  • Output contains a lot of small numbers if the
    text is locally homogenous.

?Transforms local homogeneity into global
homogeneity.
21
MoveToFront Compression
  • S d,e,h,l,o,r,w
  • s h e l l o w o r l d
  • mtf-list
  • mtf(s)

d, e, h, l, o, r, w
h, d, e, l, o, r, w
e, h, d, l, o, r, w
l, e, h, d, o, r, w
o, l, e, h, d, r, w
w, o, l, e, h, d, r
2
2
3
0
4
6
1
  • Initial list may be either
  • Ordered alphabetically
  • Symbols in order of appearance in the string
    (need to add it to the output)

22
bwt0 Compression
  • bwt0(s) ? arit( mtf( bw(s) ) )

Theorem 1 For any k (hsize of alphabet)
23
Notations
  • x mtf(x)
  • for a string w over 0,1,2, , m define
  • w01 w, with all the non-zeros replaced by 1.
  • x01 x, with all the non-zeros replaced by 1.
  • Note bwt(x) x mtf(x)x

24
Theorem 1 - Proof
Lemma 1 ss1s2st , smtf(s). Then
25
Theorem 1 - Proof
  • bw(s) can be partitioned into at most hk
    substrings w1, w2, , wl such that
  • smtf(bw(s)). By Lemma 1

sHk(s)
  • Using bound on output of Arit

26
Lemma 1 - Proof
ss1s2st , smtf(s). Then
  • Encoding of s
  • For each symbol is it 0 or not?
  • For non-zeros encode one of 1, 2, 3, , h-1
  • Note Ignoring some inter-substrings problems.

27
Encoding non-zeros of s
  • Use prefix code (i ? ci ) s pcnz(s)
  • c1 10
  • c2 11
  • ci 0 0 0 0 0 B(i1)
    (igt2)

B(i1) - 2
B(i1)
ci lt 2log(i1) (c0 0)
mi occurrences of i in s.
28
Encoding non-zeros of smtf(s)
For any string s
Sum over all symbols of s
Proof Na Occurrences of symbol a in s p1, p2,,
pNa
29
Encoding non-zeros of s
ss1s2st
  • For every i
  • Summing for all substrings

30
Encoding of s
  • For non-zeros encode one of 1, 2, 3, , h-1
  • ?No more than bits
  • For each symbol Is it 0 or not?
  • ?Encode s01

31
Encoding s01
  • If for every si01 the number of 0s is at least
    as large as the number of 1s

and
It follows that
  • Otherwise

32
Encoding s01 (second case)
  • If si01 has more 1s than 0s for i1,2,l

If there are more 1s than 0s in si01, then
It follows that
33
Encoding of s
  • For non-zeros encode one of 1, 2, 3, , h-1
  • ?No more than bits
  • For each symbol Is it 0 or not? (Encode s01 )
  • ?No more than bits
  • Total (after fixing some inaccuracies)
  • ?No more than

bits
34
Improvement
  • Use RLE
  • bw0RL(s) ? arit( rle( mtf( bw(s) ) ) )
  • Better performance
  • Better theoretical bound

35
Notes
  • Compressor Implementation
  • Use blocks of text. Sort using one of
  • Compact suffix trees (long average LCP)
  • Suffix arrays (medium average LCP)
  • General String sorter (short average LCP)
  • Search in a compressed text Extract
    suffix-array from bwt(s).
  • Empirical Results
Write a Comment
User Comments (0)
About PowerShow.com