The Burrows-Wheeler Transform - PowerPoint PPT Presentation

About This Presentation
Title:

The Burrows-Wheeler Transform

Description:

Title: The BW-Transform Last modified by: zhangs Created Date: 10/30/2002 2:31:43 AM Document presentation format: On-screen Show (4:3) Other titles – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 65
Provided by: employees8
Category:

less

Transcript and Presenter's Notes

Title: The Burrows-Wheeler Transform


1
The Burrows-Wheeler Transform
  • Sen Zhang

2
Transform
  • What is the definition for transform?
  • To change the nature, function, or condition of
    convert.
  • To change markedly the appearance or form of
  • Lossless and reversible
  • By the way, to transform is simple, a kid can do
    it.
  • To put them back is a problem.
  • Think of a 3 years old baby, he pretty much can
    transform anything, disassemble anything, but
  • There exist efficient reverse algorithms that can
    retrieve the original text from the transformed
    text.

3
What is BWT?
  • The Burrows and Wheeler transform (BWT) is a
    block sorting lossless and reversible data
    transform.
  • The BWT can permute a text into a new sequence
    which is usually more compressible.
  • Surfaced not long ago, 1994, by Michael Burrows
    and David Wheeler.
  • The transformed text can be better compressed
    with fast locally-adaptive algorithms, such as
    run-length-encoding (or move-to-front coding) in
    combination with Huffman coding (or arithmetic
    coding).

4
Outline
  • What does BWT stand for?
  • Why BWT?
  • Data Compression algorithms
  • REL
  • Huffman coding
  • Combine them
  • What is left out?
  • Bring the reality closer to ideality
  • Steps of BWT
  • BWT is reversible and lossless
  • Steps to inverse
  • Variants of BWT
  • ST
  • When was BWT initially proposed?
  • Where are the inventors of the algorithms?
  • Your homework!

5
Why BWT?
  • Run length encoding
  • Replacing a long series of a repeated character
    with a count of the repetition. Squeezing to a
    number and a character.
  • AAAAAAA
  • A7 , flag
  • Ideally, the longer of the sequence of the same
    character is, the better.
  • In reality, the input data, however, does not
    necessarily favor the expectation of the RLE
    method.

6
Bridge reality and ideality
  • BWT can transform a text into a sequence that is
    easier to compress.
  • Closer to ideality (what is expected by RLE).
  • Compression on the transformed text improves the
    compression performance

7
Preliminaries
  • Alphabet S
  • a,b,c,
  • We assume
  • an order on the alphabet
  • altbltclt
  • A character is available to be used as the
    sentinel, denoted as .

8
How to transform?
  • Three steps
  • Form a NN matrix by cyclically rotating (left)
    the given text to form the rows of the matrix.
  • Sort the matrix according to the alphabetic
    order.
  • Extract the last column of the matrix.

9
One example
  • how the BWT transforms mississippi.
  • Tmississippi

10
Step 1 form the matrix
  • The N N symmetric matrix, MO, originally
    constructed from the texts obtained by rotating
    the text T.
  • The matrix OM has S as its first row, i.e. OM1,
    1NT.
  • The rest rows of OM are constructed by applying
    successive cyclic left-shifts to T, i.e. each of
    the remaining rows, a new text T_i is obtained by
    cyclically shifting the previous text T_i-1 one
    column to the left.
  • The matrix OM obtained is shown in the next slide.

11
  • A text T is a sequence of characters drawn from
    the alphabet.
  • Without loss of generality, a text T of length
    N is denoted as x_1x_2x_3...x_N-1, where
    every character x_i is in the alphabet, S, for i
    in 1, N-1. The last character of the text is a
    sentinel, which is the lexicographically greatest
    character in the alphabet and occurs exactly once
    in the text.
  • Appending a sentinel to the original text is not
    a must but helps simplifying the understanding
    and make any text nonrepeating.
  • abcababac

12
Step 1 form the matrix
First treat the input string as a cyclic string
and construct N N matrix from it.
13
Step 1 form the matrix
  • m i s s i s s i p p
    i
  • i s s i s s i p p i
    m
  • s s i s s i p p i
    m i
  • s i s s i p p i m
    i s
  • i s s i p p i m i
    s s
  • s s i p p i m i s
    s i
  • s i p p i m i s s
    i s
  • i p p i m i s s i
    s s
  • p p i m i s s i s
    s i
  • p i m i s s i s s
    i p
  • i m i s s i s s i
    p p
  • m i s s i s s i p
    p i

14
Step 2 transform the matrix
  • Now, we sort all the rows of the matrix OM in
    ascending order with the leftmost element of each
    row being the most significant position.
  • Consequently, we obtain the transformed matrix M
    as given below.
  • i p p i m i s s i
    s s
  • i s s i p p i m i
    s s
  • i s s i s s i p p i
    m
  • i m i s s i s s i
    p p
  • m i s s i s s i p p
    i
  • p i m i s s i s s
    i p
  • p p i m i s s i s
    s i
  • s i p p i m i s s
    i s
  • s i s s i p p i m
    i s
  • s s i p p i m i s
    s i
  • s s i s s i p p i
    m i
  • m i s s i s s i p
    p i

Completely sorted from the leftmost column to the
rightmost column.
15
Step 3 get the transformed text
  • The Burrows Wheeler transform is the last column
    in the sorted list, together with the row number
    where the original string ends up.

16
Step 3 get the transformed text
  • From the above transform, L is easily obtained by
    taking the transpose of the last column of M
    together with the primary index.
  • 4
  • L s s m p p i s s i i i
  • i p p i m i s s i
    s s
  • i s s i p p i m i
    s s
  • i s s i s s i p p i
    m
  • i m i s s i s s i
    p p
  • m i s s i s s i p p
    i
  • p i m i s s i s s
    i p
  • p p i m i s s i s
    s i
  • s i p p i m i s s
    i s
  • s i s s i p p i m
    i s
  • s s i p p i m i s
    s i
  • s s i s s i p p i
    m i
  • m i s s i s s i p
    p i

4
  • Notice how there are 3 i's in a row and 2
    consecutive s's and another 2 consecutive ss -
    this makes the text easier to compress, than the
    original string mississippi.

17
What is the benefit?
  • The transformed text is more amenable to
    subsequent compression algorithms.

18
Any problem?
  • It sounds cool, but
  • Is the transformation reversible?

19
BWT is reversible and lossless
  • The remarkable thing about the BWT is not only
    that it generates a more easily compressible
    output, but also that it is reversible, i.e. it
    allows the original text to be re-generated from
    the last column data and the primary index.

20
BWT is reversible and lossless
mississippi
BWT
Index 4 and ssmppissiii
??? How to achieve the goal?
Inverse BWT
mississippi
21
The intuition
  • Assuming you are in a 1000 people line.
  • For some reason, people are dispersed
  • Now, we need to restore the line.
  • What should you (the people in line) do?
  • What is your strategy?
  • Centralized?
  • A bookkeeper or ticket numbers, that requires
    centralized extra bookkeeping space
  • Distributed?
  • If every person can point out who stood
    immediately in front of him. Bookkeeping space is
    distributed.

22
For IBWT
  • The order is distributed and hidden in the output
    themselves!!!

23
The trick is
  • Where to start? Who is the first one to ask?
  • The last one.
  • Finding immediate preceding character
  • By finding immediate preceding row of the current
    row.
  • A loop is needed to recover all.
  • Each iteration involves two matters
  • Recover the current people (by index)
  • In addition to that, to point out the next people
    (by index) to keep the loop running.

24
  • Two matters
  • Recover the current people (by index)
  • Lcurrentindex, so what is the currentindex?
  • In addition to that, to point out the next people
    (by index)
  • currentindex new index
  • // how to update currentindex, we need a updating
    method.

25
We want to know where is the preceding character
of a given character.
  • i p p i m i s s i
    s s
  • i s s i p p i m i
    s s
  • i s s i s s i p p i
    m
  • i m i s s i s s i
    p p
  • m i s s i s s i p p
    i
  • p i m i s s i s s
    i p
  • p p i m i s s i s
    s i
  • s i p p i m i s s
    i s
  • s i s s i p p i m
    i s
  • s s i p p i m i s
    s i
  • s s i s s i p p i
    m i
  • m i s s i s s i p
    p i

4
Based on the already known primary index, 4, we
know, L4, i.e. is the first character to
retrieve, backwardly, but our question is which
character is the next character to retrieve?
26
We want to know where is the preceding character
of a given character.
  • i p p i m i s s i
    s s
  • i s s i p p i m i
    s s
  • i s s i s s i p p i
    m
  • i m i s s i s s i
    p p
  • m i s s i s s i p p
    i
  • p i m i s s i s s
    i p
  • p p i m i s s i s
    s i
  • s i p p i m i s s
    i s
  • s i s s i p p i m
    i s
  • s s i p p i m i s
    s i
  • s s i s s i p p i
    m i
  • m i s s i s s i p
    p i

4
We know that the next character is going to be
i? But L6L9 L10 L11 i. Which
index should be chosen? Any of 6, 9, 10, and 11
can give us the right character i, but the
correct strategy also has to determine which
index is the next index continue the restoration.
27
  • We know that the next character is going to be
    i?
  • But L6L9 L10 L11 i. Which index
    should be chosen?
  • Any of 6, 9, 10, and 11 can give us the right
    character i, but the correct strategy also has
    to determine which index is the next index
    continue the restoration.

28
The solution
  • The solution turns out to be very simple
  • Using LF mapping!
  • Continue to see what LF mapping is?

29
Inverse BW-Transform
  • Assume we know the complete ordered matrix
  • Using L and F, construct an LF-mapping LF1N
    which maps each character in L to the character
    occurring in F.
  • Using LF-mapping and L, then reconstruct T
    backwards by threading through the LF-mapping and
    reading the characters off of L.

30
L and F
  • i p p i m i s s i
    s s
  • i s s i p p i m i
    s s
  • i s s i s s i p p i
    m
  • i m i s s i s s i
    p p
  • m i s s i s s i p p
    i
  • p i m i s s i s s
    i p
  • p p i m i s s i s
    s i
  • s i p p i m i s s
    i s
  • s i s s i p p i m
    i s
  • s s i p p i m i s
    s i
  • s s i s s i p p i
    m i
  • m i s s i s s i p
    p i

4
31
LF mapping
  • i p p i m i s s i
    s s
  • i s s i p p i m i
    s s
  • i s s i s s i p p i
    m
  • i m i s s i s s i
    p p
  • m i s s i s s i p p
    i
  • p i m i s s i s s
    i p
  • p p i m i s s i s
    s i
  • s i p p i m i s s
    i s
  • s i s s i p p i m
    i s
  • s s i p p i m i s
    s i
  • s s i s s i p p i
    m i
  • m i s s i s s i p
    p i

7 8 4 5 11 6 0 9 10 1 2 3
4
32
Inverse BW-TransformReconstruction of T
  • Start with T blank.
  • Let u NInitialize Index the primary index (4
    in our case)
  • Tu Lindex.We know that Lindex is the
    last character of T because Mthe primary index
    ends with .
  • For each i u-1, , 1 do s LFs
    (threading backwards) Ti Ls (read off the
    next letter back)

33
Inverse BW-TransformReconstruction of T
  • First step
  • s 4 T .._ _ _ _ _
  • Second step
  • s LF4 11 T .._ _ _ _ i
  • Third step
  • s LF11 3 T .._ _ _ p i
  • Fourth step
  • s LF3 5 T .._ _ p p i
  • And so on

34
Who can retrieve the data?
  • Please complete it!

35
Why does LF mapping work?
  • i p p i m i s s i
    s s
  • i s s i p p i m i
    s s
  • i s s i s s i p p i
    m
  • i m i s s i s s i
    p p
  • m i s s i s s i p p
    i
  • p i m i s s i s s
    i p
  • p p i m i s s i s
    s i
  • s i p p i m i s s
    i s
  • s i s s i p p i m
    i s
  • s s i p p i m i s
    s i
  • s s i s s i p p i
    m i
  • m i s s i s s i p
    p i

7 8 4 5 11 6 0 9 10 1 2 3
4
? Which one
36
Why does LF mapping work?
  • i p p i m i s s i
    s s
  • i s s i p p i m i
    s s
  • i s s i s s i p p i
    m
  • i m i s s i s s i
    p p
  • m i s s i s s i p p
    i
  • p i m i s s i s s
    i p
  • p p i m i s s i s
    s i
  • s i p p i m i s s
    i s
  • s i s s i p p i m
    i s
  • s s i p p i m i s
    s i
  • s s i s s i p p i
    m i
  • m i s s i s s i p
    p i

7 8 4 5 11 6 0 9 10 1 2 3
4
? Why not this?
37
Why does LF mapping work?
  • i p p i m i s s i
    s s
  • i s s i p p i m i
    s s
  • i s s i s s i p p i
    m
  • i m i s s i s s i
    p p
  • m i s s i s s i p p
    i
  • p i m i s s i s s
    i p
  • p p i m i s s i s
    s i
  • s i p p i m i s s
    i s
  • s i s s i p p i m
    i s
  • s s i p p i m i s
    s i
  • s s i s s i p p i
    m i
  • m i s s i s s i p
    p i

7 8 4 5 11 6 0 9 10 1 2 3
4
? Why this?
38
Why does LF mapping work?
  • i p p i m i s s i
    s s
  • i s s i p p i m i
    s s
  • i s s i s s i p p i
    m
  • i m i s s i s s i
    p p
  • m i s s i s s i p p
    i
  • p i m i s s i s s
    i p
  • p p i m i s s i s
    s i
  • s i p p i m i s s
    i s
  • s i s s i p p i m
    i s
  • s s i p p i m i s
    s i
  • s s i s s i p p i
    m i
  • m i s s i s s i p
    p i

7 8 4 5 11 6 0 9 10 1 2 3
4
? Why this?
39
Why does LF mapping work?
  • i p p i m i s s i
    s s
  • i s s i p p i m i
    s s
  • i s s i s s i p p i
    m
  • i m i s s i s s i
    p p
  • m i s s i s s i p p
    i
  • p i m i s s i s s
    i p
  • p p i m i s s i s
    s i
  • s i p p i m i s s
    i s
  • s i s s i p p i m
    i s
  • s s i p p i m i s
    s i
  • s s i s s i p p i
    m i
  • m i s s i s s i p
    p i

7 8 4 5 11 6 0 9 10 1 2 3
4
? Why this?
40
The mathematic explanation
  • T1S1P
  • T2S2P
  • If T1ltT2, S1ltS2
  • Now, let us reverse S and P
  • PS1 T1
  • PS2T2
  • Since S1ltS2, we know T1ltT2

41
  • The secret is hidden in the sorting strategy the
    forward component.
  • Sorting strategy preserves the relative order in
    both last column and first column.

42
  • We had assumed we have the matrix. But actually
    we dont.
  • Observation, we only need two columns.
  • Amazingly, the information contained in the
    Burrows-Wheeler transform (L) is enough to
    reconstruct F, and hence the mapping, hence the
    original message!

43
  • First, we know all of the characters in the
    original message, even if they're permuted in the
    wrong order. This enables us to reconstruct the
    first column.

44
  • Given only this information, you can easily
    reconstruct the first column. The last column
    tells you all the characters in the text, so just
    sort these characters to get the first column.

45
Inverse BW-TransformConstruction of C
  • Store in Cc the number of occurrences in T of
    the characters 1, , c-1.
  • In our example
  • T mississippi?
  • i 4, m 1, p 2, s 4, 1
  • C 0 4 5 7 11
  • Notice that Cc m is the position of the mth
    occurrence of c in F (if any).

46
Inverse BW-TransformConstructing the LF-mapping
  • Why and how the LF-mapping?
  • Notice that for every row of M, Li directly
    precedes Fi in the text (thanks to the cyclic
    shifts).
  • Let Li c, let ri be the number of occurrences
    of c in the prefix L1,i, and let Mj be the
    ri-th row of M that starts with c. Then the
    character in the first column F corresponding to
    Li is located at Fj.
  • How to use this fact in the LF-mapping?

47
Inverse BW-TransformConstructing the LF-mapping
  • So, define LF1N as
  • LFi CLi ri.
  • CLi gets us the proper offset to the zeroth
    occurrence of Li, and the addition of ri gets
    us the ri-th row of M that starts with c.

48
Inverse BW-Transform
  • Construct C1S, which stores in Ci the
    cumulative number of occurrences in T of
    character i.
  • Construct an LF-mapping LF1N which maps each
    character in L to the character occurring in F
    using only L and C.
  • Reconstruct T backwards by threading through the
    LF-mapping and reading the characters off of L.

49
Another example
  • You are given and input string ababc
  • (a) Using Burrows-Wheeler, create all cyclic
    shifts of the string
  • (b) sorted order(b) Output L and the primary
    index.(g) Given L, determine F and LF (and show
    how you do it).(h) Decode the original string
    using indexX, L, and LF (and show how you do it).

50
Pros and cons of BWT
  • Pros
  • The transformed text does enjoy a
    compression-favorable property which tends to
    group identical characters together so that the
    probability of finding a character close to
    another instance of the same character is
    increased substantially.
  • More importantly, there exist efficient and smart
    algorithms to restore the original string from
    the transformed result.
  • Cons
  • the need of sorting all the contexts up to their
    full lengths of N is the main cause for the
    super-linear time complexity of BWT.
  • Super-linear time algorithms are not hardware
    friendly.

51
Block wise
  • It works on blocks of certain typical size.

52
An improved algorithm -Schindler Transforms
  • To address the above drawbacks, a slightly
    different transform, called ST, was proposed.
  • which can sort the texts by using only their
    first k characters (where k can be a value
    far less than N), but still render itself
    reversible.
  • The key idea of ST is a two-hierarchy priority
    sorting scheme, which can be easily achieved
    using the radix sort.
  • the lexicographical sorting criterion.
  • the positional sorting criterion.

53
ST transform
  • Let OM be the same matrix as defined for the BWT.
  • Under k-order ST, OM is transformed to M_k by
    sorting all its rows according to their first k
    leftmost characters, i.e. k-order contexts, only.
  • In case that any two k-order contexts are equal,
    the tie is resolved by their relative positions
    in the original OM.
  • i p p i m i s s i s s
  • i s s i s s i p p i m
  • i s s i p p i m i s s
  • i m i s s i s s i p p
  • m i s s i s s i p p i
  • p i m i s s i s s i p
  • p p i m i s s i s s i
  • s i s s i p p i m i s
  • s i p p i m i s s i s
  • s s i s s i p p i m i
  • s s i p p i m i s s i
  • m i s s i s s i p p i

Only partially sorted on the leftmost two columns
54
Pros and Cons of ST
  • Pros
  • Faster than BWT
  • Hardware implementation friendly
  • Cons
  • The currently known approach to inverse ST is
    based on a hashing function.
  • The relationship between inverse ST and inverse
    BWT is not well studied.

55
An application schemein data communication system
56
Conclusions
  • The BW transform makes the text (string) more
    amenable to compression.
  • BWT in itself does not modify the data stream. It
    just reorders the symbols inside the data blocks.
  • Evaluation of the performance actually is subject
    to information model assumed. Another topic.
  • The transform is lossless and reversible

57
BW Transform Summary
  • Any naïve implementation of the transform has an
    O(n3) time complexity.
  • The best solution has O(n), which is tricky to
    implement.
  • We can reverse it to reconstruct the original
    text in O(n) time, using O(n) space.
  • Once we obtain L, we can compress L in a provably
    efficient manner

58
Issues left out
  • How about if all characters in the alphabet set
    appear in the text, i.e. no sentinel can be used?
  • Do you need to compare N positions?
  • How about the input data is not ascii encoded,
    but an image, or a biological sequence (DNA, RNA
    or protein)?
  • Why not the first column, but the last column?
  • In BWT, the last column, L, of the sorted matrix
    contains concentrations of identical characters,
    which is why L is easy to compress. However, the
    first column, F, of the same matrix is even
    easier to compress since it . Why select column L
    and not column F?

59
homework
  • The BWT algorithms
  • Forward Transform
  • Backward Transform
  • Either in the Windows environment or the Linux
    environment

60
Examples of running your program in the command
line
  • bwt f text1 text2
  • Transfer text1 to text2
  • bwt i text2 text3
  • Inverse text2 to text3

61
How to verify the correctness of your algorithms.
  • Because the bwt is reversible and lossless, if
    your implementation is correct, text3 should be
    the same as text1.
  • Your can manually verify text1 and text3
  • Alternatively, you can run diff command in
    Linux to report any differences between any two
    files.

62
Requirements
  • Stage 1 use a fixed string or accept a string
    from keyboard to test the correctness of your
    algorithms. (80 points)
  • Stage 2 then expand your solution to read the
    string from a given file. (20 points) Notice that
    text2 should be a binary file, for the first data
    is index, then followed by ascii code.

63
How to sort the matrix
  • 1. the simplest way
  • Whatever sorting algorithm you feel comfortable
  • Make each row a string, then do string comparison
  • C string, need to know how functions for string
    comparison
  • Cpp string, need to know how to how to use string
    class.
  • You use whichever way you feel the most
    comfortable.
  • 2. radix sort
  • 3. suffix array

64
Knowledge to be practiced for the homework
  • Array
  • Dynamic memory allocation
  • String manipulation
  • Sorting
  • File operation
  • Data compression algorithms
Write a Comment
User Comments (0)
About PowerShow.com