2. Text Compression - PowerPoint PPT Presentation

About This Presentation
Title:

2. Text Compression

Description:

2. Text Compression (2 ) ( , ... – PowerPoint PPT presentation

Number of Views:439
Avg rating:3.0/5.0
Slides: 54
Provided by: 6649732
Category:

less

Transcript and Presenter's Notes

Title: 2. Text Compression


1
2. Text Compression
  • ?? ?? (2?)

2
??? ??? ??
  • ??
  • ??? ???? ?? ? ??? ??? ?? ?? ?? (??, ??) ????
    ? ?? Parkinsons Law
  • ??? ????
  • ??? ?? ? ?????, Genome, ?????, ?????, ????
  • ??? ?? ?? ??? ????!!!!
  • ????? ??
  • ????
  • ???
  • Morse ??, Braille ??, ??? ??

3
??
  • ??
  • PC????, RAID ???
  • ????? DB ??
  • ???? ?? (???, ????)
  • Network is computing!!!
  • ???
  • ????? ??, ??? ??? ??? ??
  • ????? ??? ???? ?? ???? ????, ?? ???? ??

4
??
  • Text compression ? ??? ????
  • Multi-media ??? ??? ??? ???

5
??
  • 1950s Huffman coding
  • 1970s Ziv Lempel(Lampel-Ziv-Welch(gif)),
    Arithmetic coding
  • English Text
  • Huffman (5bits/character)
  • adaptive
  • Ziv-Lempel (4bits/character) 70??
  • Arithmetic coding (2bits/character)
  • PPM Prediction by Partial Matching
  • 80?? ?
  • Slow and require large amount of memory
  • ? ? ? ???? ??? ??? ?? ??? MEMORY? ???? ?? ????
    ???? ??? ??
  • 0.51Mbytes, 0.1Mbytes ????? Ziv Lempel? ???
  • ?? text ??? 1??? ??, ? ??? ??? ??? ?? ??? ???
    ???? ? ??? ?
  • ?? ??, space ??

6
?? ??
  • Models
  • Adaptive models
  • Coding
  • Symbolwise models
  • Dictionary models
  • Synchronization
  • Performance comparison

7
?? ??
  • ??
  • Symbol-wise method
  • Dictionary method
  • ??
  • Models
  • static ? adaptive
  • Coding

8
Symbol-wise methods
  • Estimating the probabilities of symbols
  • Statistical methods
  • Huffman coding or arithmetic coding
  • Modeling estimating probabilities
  • Coding converting the probabilities into
    bitstreams

9
Dictionary methods
  • Code references to entries in the dictionary
  • Several symbols as one output codeword
  • Group symbols ? dictionary
  • Ziv-Lempel coding by referencing (pointing)
    previous occurrence of strings ? adaptive
  • Hybrid schemes ? ??? symbol-wise schemes?? ?? ???
    ?? ??

10
Models
  • prediction
  • - To predict symbols, which amounts to providing
    a probability distribution for the next symbol to
    be coded
  • - ??? ?? coding decoding
  • Information Content
  • I(s) -log Prs (bits)
  • ????? entropy Claude Shannon
  • H ? PrsI(s) ?- PrslogPrs
  • (a lower bound on compression)
  • Entropy? 0? ???? ?? ???? ??? ? ? Huffman coding?
    ??? ????!!! ?????
  • Zero probability, Entropy? ????? ??(??? 0??), ???
    ??? ???? ??.

11
Pr
  • ??? 1?? ??? ?? ??
  • ??? 0?? coding? ? ??
  • u? ??? 2?? 5.6bits ??
  • q ??? u? 95 ??? ??? ? 0.074bits ??
  • ? ?? ???? ??? bit? ??!!!

12
Model? ??
  • finite-context model of order m
  • - ?? ?? m?? symbol? ???? ??
  • finite-state model
  • - Figure 2.2
  • The decoder works with an identical probability
    distribution
  • - synchronization
  • - On error, synchronization would be lost
  • Formal languages as C, Java
  • grammars,

13
Estimation of probabilities in a Model
  • static modeling
  • ???? ??? ???? ?? ?? ?? ??
  • ????? ??? ?? ??, ??? ?? ?? ? ????
  • ?? ?? ???? ?? ??????
  • semi-static (semi-adaptive) modeling
  • ??? ???? ??? ??? encoding?? ??? ??? ??
  • ??? ??? ???? ??? ??? ???? ??? ? ??
  • adaptive modeling
  • - ?? ?? model?? ???? ???? ?? ??? ?? model? ??
  • ??? symbol? ?? ??? ?? ??? ??

14
Adaptive models
  • zero-order model ? character by character
  • zero frequency problem
  • ?? character(?, z)? ???? ? ?? ???? ??? ?
  • 128? ASCII ? 82? ??? ???, 46?? ? ??? ?
  • 1/(46(768,0781)) ? 25.07bits
  • 1/(768,078128) ? 19.6bits
  • ? ????? ???? ??? ??? ??? ??? ?? ?? ??? ?? ?? ??
  • higher-order model
  • - 0-probability? ?? ???? ??
  • first-order model 37,526(h) ? 1,139(t)
    ?1,139/37,526 ? 9.302 ? 5.05bits (0-order?? ??)
    ? ??? ???
  • second-order model gh ? t (64,
    0.636bits)
  • ??? ??? ?? ?? encoding? decoding ??? ?? ??? ??
    ? (synchronization)

15
adaptive modeling
  • ??
  • Robust, Reliable, Flexible
  • ??
  • Random access is impossible
  • fragile on communication errors
  • Good for general compression utilities but not
    good for full-text retrieval

16
Coding
  • coding? ??
  • - model? ?? ??? ?? ??? ???? symbol? ??? ????? ??
  • coding? ???
  • ????
  • short codewords for likely symbols
  • long codewords for rare symbols
  • ????? ?? ??????? ????, ??? ??? ?
  • ??
  • ??? ??? ??? ???? ?? ?? ??
  • symbolwise scheme? coder? ?? ? ??? ??? ??
  • Huffman coding ??? ??
  • Arithmetic coding ???? ??? ??? ???

17
Huffman Coding
  • static model? ??? ? encoding? decoding??? ??
  • adaptive Huffman coding
  • - memory? ??? ?? ??
  • full-text retrieval application? ??
  • - random access? ??

18
Examples
  • a 0000 0.05
  • b 0001 0.005
  • c 001 0.1
  • d 01 0.2
  • e 10 0.3
  • f 110 0.2
  • g 111 0.1
  • Eefggfed
  • 10101101111111101001
  • Prefix-(free) code

19
Huffman coding Algorithm
  • Fig. 2.6 ??
  • Fast for both encoding and decoding
  • Adaptive Huffman coding? ??? arithmetic coding?
    ??? ??
  • ????? random access? ???
  • ????, ?? ??? ???? ??
  • Words-based approach? ???? ?? ??? ?

20
Canonical Huffman Coding I
  • a static zero-order word-level Canonical Huffman
    Coding ? 2.2
  • Huffman code? ?? ??? codeword ??
  • - codeword? ??? ? ??? ??
  • - ?? ??? ???? ??? ??? ???
  • - encoding? ?? ??? ??? ?? ??? ? ?? ???? ??? ???
    ??? ??? ?? ??
  • - ? Table 2.2?? said? 7bit?? ??? 10??, ???
    ?? 1010100 ? 10101001001 1011101

21
Canonical Huffman Coding II
  • Decoding ??? Codeword? ???? ?? ????? ?? ?? ?
    ??
  • 1100000101 ? 7bits(1010100), 6bits(110001)
    7bits?? 12?? ? (with)
  • decoding tree? ???? ??

22
Canonical Huffman Coding III
  • Word? ??? ???? ???
  • ? 2.3 ??
  • Canonical Huffman code? Huffman algorithm? ?? ???
    ?? ?? ? ??!!!!!!!
  • ? any prefix-free assignment of codewords where
    the length of each code is equal to the depth of
    that symbol in a Huffman tree
  • Huffman? ?? ?? ??? ????? ???? ??!!!! ?? ??? ????
    ??? !!!
  • n? symbol? ?? ? 2n-1
  • ? ? ? ?? canonical Huffman code

23
Canonical Huffman code IV
  • ??
  • Tree? ?? ??? ???? memory ??
  • Tree? ?? ??? ???? ?? ??
  • ????? ?? ??, ??? ???? ?? ?? ???? ?? ??
  • ? ? ??!!! 1? ??? !!!! ??? ?? ??? !!!!
  • ?? ? ?? ? ?? ?? ?? ?? ??? ???? ?? 1? ?? ?
  • (?) 5bits 4, 3bits 1, 2bits 3 ? 00000,
    00001,00010, 001, 01, 10, 11

24
????
  • ??? tree? ??? ? 24n bytes
  • ?, pointer (2?)
  • Intermediate node leaf node ? 2n
  • 8n bytes ????
  • Heap? ??
  • 2n? ?? array
  • ????? ?? ??? ??
  • ???? ??

25
Arithmetic Coding
  • ????? ?????? ?? ????? ???
  • ??? model? ???? ?? ??? ??
  • - entropy? ??? ??? coding
  • ? symbol? 1bit ??? ?? ?? ? ?? ? symbol? ?? ???
    ??? ? ??
  • tree? ???? ?? ??? ?? ??? ??
  • static?? semi-static application??? Huffman
    coding?? ??
  • random access ???

26
Huffman Code? Arithmetic Code
27
?? ?
  • 0.99, 0.01? ??? ? ??? ?? ?
  • Arithmetic coding 0.015bit
  • Huffman coding (symbol? inefficiency)
    Pr(s1)log(2log2/e) Prs10.086 (??? s1? ??
    ??? ?? ??) 1.076bits
  • ???? entropy 5bits per character (0-order
    character level)
  • ???? ?? 0.18 ? 0.266
  • 0.266/5bits ? 5.3? inefficiency
  • ??? ?? 2 ?? symbol ? arithmetic coding

28
Transmission of output
  • low 0.6334 ? high 0.6667
  • 6, 0.334 ? 0.667
  • 32bit precession?? ?? ??? ??? ??

29
Arithmetic Coding (Static Model)
30
Decoding(Static Model)
31
Arithmetic Coding (Adaptive Model)
32
Decoding(Adaptive Model)
33
Cumulative Count Calculation
  • ?? ??
  • Heap
  • Encoding 101101 ? 101101, 1011, 101, 1
  • ?? ??

34
Symbolwise models
? Symbolwise model coder( arithmatic, huffman )
? Three Approaches
- PPM( Prediction by Partial Matching ) -
DMC(Dynamic Markov Compression ) - Word-based
compression
35
PPM ( Prediction by Partial Matching )
? finite-context models of characters
? variable-length code ??? code? ? text? partial
matching
? zero-frequency problem - Escape symbol
- PPMA escape method A escape symbol? 1?
36
Escape method
  • Escape method A (PPMA) ? count 1
  • Exclusion ? ????? ???? ?? ?? ??, ?) lies (201,
    22), ?lies?? ?? ? 179? lie, lier 19? 19/202 ?
    19/180
  • Method C r/(nr) ? total n, distinct symbols
    r, ci/(nr) ? 2.5bits per character for Hardys
    book.
  • Method D r/(2n)
  • Method X symbols of frequency 1 ? t1,
    (t11)/(nt11)
  • PPMZ, Swiss Army Knife Data Compression (SAKDC)
    1991?, 1197? ???? ??
  • ?? 2,24

37
Block-sorting compression
  • 1994?? ??
  • ??? ??? ?? ??
  • Image compression ? discrete cosine
    transformation, Fourier transformation? ??
  • Input? block ??? ??? ???!!!

38
DMC ( Dynamic Markov Compression )
? finite state model
? adaptive model - Probabilties and the
structure of the finite state machine
? Figure 2.13
? avoid zero-frequency problem
? Figure 2.14
? Cloning - heuristic - the adaptation
of the structure of a DMC
39
Word-based Compression
? parse a document into words and nonwords
? Textual/Non-Textual ?? ?? - Textual
zero-order model
? suitable for large full-text database
? Low Frequency Word - ???? - ?) ???
Digit, Page Number
40
Dictionary Models
? Principle of replacing substrings in a text
with codeword
? Adaptive dictionary compression model LZ77,
LZ78
? Approaches
- LZ77 - Gzip - LZ78 - LZW
41
Dictionary Model - LZ77
? adaptive dictionary model
? characteristic - easy to implement -
quick decoding - using small amount of memory
? Figure 2.16
? Triples lt offset, length of phrase,
character gt
42
Dictionary Model - LZ77(continue)
? Improve
- offset shorter codewords for
recent matches - match length
variable length code - character
????? ??(raw data ??)
? Figure 2.17
43
Dictionary Model - Gzip
? based on LZ77
? hash table
? Tuples lt offset, matched length gt
? Using Huffman code - semi-static /
canonical Huffman code - 64K Blocks -
Code Table Block ?? ??
44
Dictionary Model - LZ78
? adaptive dictionary model
? parsed phrase reference
? Figure 2.18
? Tuples - lt phrase number, character gt -
phrase 0 empty string
? Figure 2.19
45
Dictionary Model - LZ78(continue)
? characteristic
- hash table simple, fast - encoding
fast - decoding slow - trie memory ??
??
46
Dictionary Model - LZW
? variant of LZ78
? encode only the phrase number does not
have explicit characters in the output
? appending the fast character of the next phrase
? Figure 2.20
? characteristic - good compression - easy
to implement
47
Synchronization
? random access
? impossible random access
- variable-length code - adaptive model
? synchronization point
? synchronization with adaptive model - large
file -gt break into small sections
48
Creating synchronization point
? main text consist of a number of documents
? bit offset
- ??? ??/?? ?? bit? ?? ??
? byte offset
- end of document symbol - length of each
document at its beginning - end of file
49
Self-synchronizing codes
? not useful or full-text retrieval
? motivation
- compressed text? ???? decoding synchronizing
cycle? ?? decoding - part of corrupteed,
beginning is missing
? fixed-length code self-synchronizing ??
? Table 2.3
? Figure 2.22
50
Performance comparisons
? consideration - compression speed -
compression performance - computing resource
? Table 2.4
51
Compression Performance
? Calgary corpus - English text, program
source code, bilevel fascimile image -
geological data, program object code
? Bits per character
? Figure 2.24
52
Compression speed
? speed dependency - method of
implementation - architecure of machine
- compiler
? Better compression, Slower program run
? Ziv-Lempel based method decoding gt encoding
? Table 2.6
53
Other Performance considerations
? memory usage - adaptive model ?? memory
?? - Ziv-Lempel ltlt Symbolwise model
? Random access - synchronization point
Write a Comment
User Comments (0)
About PowerShow.com