Compression of Concatenated Web Pages Using XBW - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Compression of Concatenated Web Pages Using XBW

Description:

Compression of Concatenated Web Pages Using XBW. Radovan est k, Jan L nsk ... Words and syllables have simular compression ratio. Compression speed forwords is best. ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 28
Provided by: lab86
Category:

less

Transcript and Presenter's Notes

Title: Compression of Concatenated Web Pages Using XBW


1
  • Compression of Concatenated Web Pages Using XBW
  • Radovan esták, Jan Lánský
  • radofan, zizelevak_at_gmail.com
  • Dept. of Software Engineering
  • Faculty of Mathematics and Physics
  • Charles University
  • SOFSEM 2008

2
Synopsis
  • Motivation
  • Project XBW
  • Structure
  • Implemented methods
  • Results
  • surprising results
  • Conclusion

3
Motivation
  • EGOTHOR
  • 15-20 MB files
  • 1000 HTML pages
  • Non-well-formed XML in UNICODE
  • Text compression XML compression
  • Large alphabet (words, syllables)
  • XML tags processing

4
Structure
  • Parser
  • Dictionary / Trie
  • BWT
  • MTF
  • RLE
  • LZC, LZSS
  • PPM
  • AC, HC

All methods work over large alphabet
5
Structure
6
Parser
  • UNICODE other encodings
  • XML structure
  • Can process non-well-formed XML files
  • Dictionaries (D1) of used tag and attribute names
  • We dont need encode explicit
  • Large alphabet of elements
  • Element symbol, syllable, or word
  • Dictionaries of used elements (D2)
  • We need encode explicit
  • D1 ? D2

7
Dictionary of Elements
  • Element symbol, syllable, or word
  • Dictionary
  • Represent in memory by trie structure
  • Each element gets unique number
  • Decompression into linked list
  • Compression by methods TD2 a TD3

8
Dictionary - TD2
  • Trie We go recursive (Deep first search) from
    root to nodes.
  • We encode for each node
  • Number of sons,
  • Whatever string (from root to this node)
    represent a member of string set.
  • For first son of each node we encode its value.
    For second and following sons we encode
    difference of values from previous brother.
  • Coded values are small and repetitive
  • The same prefix of two words is written only once

9
Dictionary - TD2
10
BWT
  • Burrows Wheeler transformation
  • Output is permuted input
  • Similar prefixes are grouped together
  • Effective only for large blocks
  • Combination BWTMTFRLE
  • Bzip
  • ABRACADABRACA is transformed into CDARRCAAAAABB

11
BWT
  • Encoding
  • Compression speed is dependable on repetitiveness
    of input file
  • We have implemented different sorting methods
    (see more in DCC 2008)
  • Seward O(n2 log n)
  • Sadakane O(n log n)
  • Karkkainen O(n)
  • Decompression is in linear time

12
MTF
  • Move to Front
  • Init Ordered list (L) of all symbols of input
    alphabet
  • Step One symbol is read form input file and its
    order in L is written to output. Symbol is moved
    to front in list L.
  • String ABCADAD is encoded as 0123411
  • Data structure splay tree

13
RLE
  • Run-length coding
  • Tree or more same symbols in row in the file are
    replaced by one triplet
  • Special symbol
  • Repetitive symbol
  • Length of repetition
  • RLE versions 1,2,3 difference in encoding this
    triplet
  • Compression and decompression are running in
    linear time.

14
RLE
  • Versions have different memory requirements
  • First version O(1)
  • Second version O(n), where n is alphabet size
  • string 0001000011112222222 gt 19 symbols
  • version 1 special symbol
  • 0001041427 gt 13 symbols
  • version 2 special symbols 3, 4, 5 as shortcuts
    for 0, 1, 2
  • 331344457 gt 9 symbols

15
LZC
  • Dictionary compression method, based on LZ78
  • Init Dictionary is filled by all symbols of
    alphabet
  • Step
  • From input is read maximal string S which can be
    found in dictionary.
  • String FC is inserted into dictionary, C is
    symbol which followed F in input file.
  • Number of index of phase in dictionary is set to
    output.

16
LZC
  • Input file CABBADCABBADCAB
  • Step number of algorithm step
  • Index index of phrase in dictionary
  • Position position in input file

17
LZSS
  • Dictionary compression method, based on LZ77
  • Repetitive sequences of symbols are stored in
    dictionary. Their occurences in sliding window
    are replaces by pointer to dictionary.
  • Dictionary is part of compressed file
  • Sliding window is prefix of uncompressed file
  • Init sliding window is filled by space

18
LZSS
  • Step
  • We search a string in dictionary, which is equal
    with prefix of sliding window.
  • We encode one bit indicator and string (P,L) or
    symbol C
  • Data structure Binary search tree

19
PPM
  • Lossless compression method
  • Good results for text compression in natural
    languages
  • High memory requirements
  • For word and syllable versions very high
  • Experiment in XBW let us PPM to output from BWT

20
PPM
  • Based on adaptive statistical model
  • Context length 0-N (N parameter)
  • Step
  • We start in context of length N.
  • If current symbol occurs there, it is encoded by
    AC.
  • Else we try context N-1
  • Last context is -1, where all symbols have
    probability 1/(alphabet size)

21
Corpus
22
Results
  • In paper 12 tables, 3 pages
  • Compression ratio, Compression speed and
    decompression speed for
  • BWTMTFRLEAC (best results)
  • LZSS, LZC, PPM
  • Over alphabet of Symbols, syllables, words
  • Parser text mode, XML mode, binary

23
Results
  • XBW Text mode, words, Kao for BWT

24
Results
  • Comparation XBW and bzip2
  • XBW words, blocks 20 MB, Kao for BWT
  • XBW has twice better compression ration than
    bzip2
  • XBW has compression time only twice worse than
    bzip2, decompression time has 2.5 times worse
    than bzip2.

25
Surprising results
  • For large text files
  • Words and syllables have better compression ratio
    than symbols
  • Words and syllables have simular compression
    ratio.
  • Compression speed forwords is best.

26
Surprising results
  • Effect of XML parser is decreasing with
    increasing size of block. At 20 MB blocks was
    very small.
  • Influence of MTF before LZ
  • Usage of MTF after BWT has different effect on
    LZSS and LZC
  • Compression ratio is rapidly fallen when we use
    MTF without BWT

27
Conclusion
  • For Compression concatenated HTML pages we
    recomand (default XBW setings)
  • BWTMTFRLEAC,
  • BWT Kaos algorithm, RLE version 2
  • Parser XML or text mode, words
  • XBW
  • http//xbw.sourceforge.net/
  • High modularity, good testing tool
  • We are still improving XBW
Write a Comment
User Comments (0)
About PowerShow.com