Compression of Concatenated Web Pages Using XBW

About This Presentation

Title:

Compression of Concatenated Web Pages Using XBW

Description:

Compression of Concatenated Web Pages Using XBW. Radovan est k, Jan L nsk ... Words and syllables have simular compression ratio. Compression speed forwords is best. ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 28

Provided by: lab86

Category:

more less

Transcript and Presenter's Notes

Title: Compression of Concatenated Web Pages Using XBW

1

Compression of Concatenated Web Pages Using XBW
Radovan esták, Jan Lánský
radofan, zizelevak_at_gmail.com
Dept. of Software Engineering
Faculty of Mathematics and Physics
Charles University
SOFSEM 2008

2
Synopsis

Motivation
Project XBW
Structure
Implemented methods
Results
surprising results
Conclusion

3
Motivation

EGOTHOR
15-20 MB files
1000 HTML pages
Non-well-formed XML in UNICODE
Text compression XML compression
Large alphabet (words, syllables)
XML tags processing

4
Structure

Parser
Dictionary / Trie
BWT
MTF

RLE
LZC, LZSS
PPM
AC, HC

All methods work over large alphabet
5
Structure
6
Parser

UNICODE other encodings
XML structure
Can process non-well-formed XML files
Dictionaries (D1) of used tag and attribute names
We dont need encode explicit
Large alphabet of elements
Element symbol, syllable, or word
Dictionaries of used elements (D2)
We need encode explicit
D1 ? D2

7
Dictionary of Elements

Element symbol, syllable, or word
Dictionary
Represent in memory by trie structure
Each element gets unique number
Decompression into linked list
Compression by methods TD2 a TD3

8
Dictionary - TD2

Trie We go recursive (Deep first search) from
root to nodes.
We encode for each node
Number of sons,
Whatever string (from root to this node)
represent a member of string set.
For first son of each node we encode its value.
For second and following sons we encode
difference of values from previous brother.
Coded values are small and repetitive
The same prefix of two words is written only once

9
Dictionary - TD2
10
BWT

Burrows Wheeler transformation
Output is permuted input
Similar prefixes are grouped together
Effective only for large blocks
Combination BWTMTFRLE
Bzip
ABRACADABRACA is transformed into CDARRCAAAAABB

11
BWT

Encoding
Compression speed is dependable on repetitiveness
of input file
We have implemented different sorting methods
(see more in DCC 2008)
Seward O(n2 log n)
Sadakane O(n log n)
Karkkainen O(n)
Decompression is in linear time

12
MTF

Move to Front
Init Ordered list (L) of all symbols of input
alphabet
Step One symbol is read form input file and its
order in L is written to output. Symbol is moved
to front in list L.
String ABCADAD is encoded as 0123411
Data structure splay tree

13
RLE

Run-length coding
Tree or more same symbols in row in the file are
replaced by one triplet
Special symbol
Repetitive symbol
Length of repetition
RLE versions 1,2,3 difference in encoding this
triplet
Compression and decompression are running in
linear time.

14
RLE

Versions have different memory requirements
First version O(1)
Second version O(n), where n is alphabet size
string 0001000011112222222 gt 19 symbols
version 1 special symbol
0001041427 gt 13 symbols
version 2 special symbols 3, 4, 5 as shortcuts
for 0, 1, 2
331344457 gt 9 symbols

15
LZC

Dictionary compression method, based on LZ78
Init Dictionary is filled by all symbols of
alphabet
Step
From input is read maximal string S which can be
found in dictionary.
String FC is inserted into dictionary, C is
symbol which followed F in input file.
Number of index of phase in dictionary is set to
output.

16
LZC

Input file CABBADCABBADCAB
Step number of algorithm step
Index index of phrase in dictionary
Position position in input file

17
LZSS

Dictionary compression method, based on LZ77
Repetitive sequences of symbols are stored in
dictionary. Their occurences in sliding window
are replaces by pointer to dictionary.
Dictionary is part of compressed file
Sliding window is prefix of uncompressed file
Init sliding window is filled by space

18
LZSS

Step
We search a string in dictionary, which is equal
with prefix of sliding window.
We encode one bit indicator and string (P,L) or
symbol C
Data structure Binary search tree

19
PPM

Lossless compression method
Good results for text compression in natural
languages
High memory requirements
For word and syllable versions very high
Experiment in XBW let us PPM to output from BWT

20
PPM

Based on adaptive statistical model
Context length 0-N (N parameter)
Step
We start in context of length N.
If current symbol occurs there, it is encoded by
AC.
Else we try context N-1
Last context is -1, where all symbols have
probability 1/(alphabet size)

21
Corpus
22
Results

In paper 12 tables, 3 pages
Compression ratio, Compression speed and
decompression speed for
BWTMTFRLEAC (best results)
LZSS, LZC, PPM
Over alphabet of Symbols, syllables, words
Parser text mode, XML mode, binary

23
Results

XBW Text mode, words, Kao for BWT

24
Results

Comparation XBW and bzip2
XBW words, blocks 20 MB, Kao for BWT
XBW has twice better compression ration than
bzip2
XBW has compression time only twice worse than
bzip2, decompression time has 2.5 times worse
than bzip2.

25
Surprising results

For large text files
Words and syllables have better compression ratio
than symbols
Words and syllables have simular compression
ratio.
Compression speed forwords is best.

26
Surprising results

Effect of XML parser is decreasing with
increasing size of block. At 20 MB blocks was
very small.
Influence of MTF before LZ
Usage of MTF after BWT has different effect on
LZSS and LZC
Compression ratio is rapidly fallen when we use
MTF without BWT

27
Conclusion