Title: 2. Text Compression
12. Text Compression
2??? ??? ??
- ??
- ??? ???? ?? ? ??? ??? ?? ?? ?? (??, ??) ????
? ?? Parkinsons Law - ??? ????
- ??? ?? ? ?????, Genome, ?????, ?????, ????
- ??? ?? ?? ??? ????!!!!
- ????? ??
- ????
- ???
- Morse ??, Braille ??, ??? ??
3??
- ??
- PC????, RAID ???
- ????? DB ??
- ???? ?? (???, ????)
- Network is computing!!!
- ???
- ????? ??, ??? ??? ??? ??
- ????? ??? ???? ?? ???? ????, ?? ???? ??
4??
- Text compression ? ??? ????
- Multi-media ??? ??? ??? ???
5??
- 1950s Huffman coding
- 1970s Ziv Lempel(Lampel-Ziv-Welch(gif)),
Arithmetic coding - English Text
- Huffman (5bits/character)
- adaptive
- Ziv-Lempel (4bits/character) 70??
- Arithmetic coding (2bits/character)
- PPM Prediction by Partial Matching
- 80?? ?
- Slow and require large amount of memory
- ? ? ? ???? ??? ??? ?? ??? MEMORY? ???? ?? ????
???? ??? ?? - 0.51Mbytes, 0.1Mbytes ????? Ziv Lempel? ???
- ?? text ??? 1??? ??, ? ??? ??? ??? ?? ??? ???
???? ? ??? ? - ?? ??, space ??
6?? ??
- Models
- Adaptive models
- Coding
- Symbolwise models
- Dictionary models
- Synchronization
- Performance comparison
7?? ??
- ??
- Symbol-wise method
- Dictionary method
- ??
- Models
- static ? adaptive
- Coding
8Symbol-wise methods
- Estimating the probabilities of symbols
- Statistical methods
- Huffman coding or arithmetic coding
- Modeling estimating probabilities
- Coding converting the probabilities into
bitstreams
9Dictionary methods
- Code references to entries in the dictionary
- Several symbols as one output codeword
- Group symbols ? dictionary
- Ziv-Lempel coding by referencing (pointing)
previous occurrence of strings ? adaptive - Hybrid schemes ? ??? symbol-wise schemes?? ?? ???
?? ??
10Models
- prediction
- - To predict symbols, which amounts to providing
a probability distribution for the next symbol to
be coded - - ??? ?? coding decoding
- Information Content
- I(s) -log Prs (bits)
- ????? entropy Claude Shannon
- H ? PrsI(s) ?- PrslogPrs
- (a lower bound on compression)
- Entropy? 0? ???? ?? ???? ??? ? ? Huffman coding?
??? ????!!! ????? - Zero probability, Entropy? ????? ??(??? 0??), ???
??? ???? ??.
11Pr
- ??? 1?? ??? ?? ??
- ??? 0?? coding? ? ??
- u? ??? 2?? 5.6bits ??
- q ??? u? 95 ??? ??? ? 0.074bits ??
- ? ?? ???? ??? bit? ??!!!
12Model? ??
- finite-context model of order m
- - ?? ?? m?? symbol? ???? ??
- finite-state model
- - Figure 2.2
- The decoder works with an identical probability
distribution - - synchronization
- - On error, synchronization would be lost
- Formal languages as C, Java
- grammars,
13Estimation of probabilities in a Model
- static modeling
- ???? ??? ???? ?? ?? ?? ??
- ????? ??? ?? ??, ??? ?? ?? ? ????
- ?? ?? ???? ?? ??????
- semi-static (semi-adaptive) modeling
- ??? ???? ??? ??? encoding?? ??? ??? ??
- ??? ??? ???? ??? ??? ???? ??? ? ??
- adaptive modeling
- - ?? ?? model?? ???? ???? ?? ??? ?? model? ??
- ??? symbol? ?? ??? ?? ??? ??
14Adaptive models
- zero-order model ? character by character
- zero frequency problem
- ?? character(?, z)? ???? ? ?? ???? ??? ?
- 128? ASCII ? 82? ??? ???, 46?? ? ??? ?
- 1/(46(768,0781)) ? 25.07bits
- 1/(768,078128) ? 19.6bits
- ? ????? ???? ??? ??? ??? ??? ?? ?? ??? ?? ?? ??
- higher-order model
- - 0-probability? ?? ???? ??
- first-order model 37,526(h) ? 1,139(t)
?1,139/37,526 ? 9.302 ? 5.05bits (0-order?? ??)
? ??? ??? - second-order model gh ? t (64,
0.636bits) - ??? ??? ?? ?? encoding? decoding ??? ?? ??? ??
? (synchronization)
15adaptive modeling
- ??
- Robust, Reliable, Flexible
- ??
- Random access is impossible
- fragile on communication errors
- Good for general compression utilities but not
good for full-text retrieval
16Coding
- coding? ??
- - model? ?? ??? ?? ??? ???? symbol? ??? ????? ??
- coding? ???
- ????
- short codewords for likely symbols
- long codewords for rare symbols
- ????? ?? ??????? ????, ??? ??? ?
- ??
- ??? ??? ??? ???? ?? ?? ??
- symbolwise scheme? coder? ?? ? ??? ??? ??
- Huffman coding ??? ??
- Arithmetic coding ???? ??? ??? ???
17Huffman Coding
- static model? ??? ? encoding? decoding??? ??
- adaptive Huffman coding
- - memory? ??? ?? ??
- full-text retrieval application? ??
- - random access? ??
18Examples
- a 0000 0.05
- b 0001 0.005
- c 001 0.1
- d 01 0.2
- e 10 0.3
- f 110 0.2
- g 111 0.1
- Eefggfed
- 10101101111111101001
- Prefix-(free) code
19Huffman coding Algorithm
- Fig. 2.6 ??
- Fast for both encoding and decoding
- Adaptive Huffman coding? ??? arithmetic coding?
??? ?? - ????? random access? ???
- ????, ?? ??? ???? ??
- Words-based approach? ???? ?? ??? ?
20Canonical Huffman Coding I
- a static zero-order word-level Canonical Huffman
Coding ? 2.2 - Huffman code? ?? ??? codeword ??
- - codeword? ??? ? ??? ??
- - ?? ??? ???? ??? ??? ???
- - encoding? ?? ??? ??? ?? ??? ? ?? ???? ??? ???
??? ??? ?? ?? - - ? Table 2.2?? said? 7bit?? ??? 10??, ???
?? 1010100 ? 10101001001 1011101
21Canonical Huffman Coding II
- Decoding ??? Codeword? ???? ?? ????? ?? ?? ?
?? - 1100000101 ? 7bits(1010100), 6bits(110001)
7bits?? 12?? ? (with) - decoding tree? ???? ??
22Canonical Huffman Coding III
- Word? ??? ???? ???
- ? 2.3 ??
- Canonical Huffman code? Huffman algorithm? ?? ???
?? ?? ? ??!!!!!!! - ? any prefix-free assignment of codewords where
the length of each code is equal to the depth of
that symbol in a Huffman tree - Huffman? ?? ?? ??? ????? ???? ??!!!! ?? ??? ????
??? !!! - n? symbol? ?? ? 2n-1
- ? ? ? ?? canonical Huffman code
23Canonical Huffman code IV
- ??
- Tree? ?? ??? ???? memory ??
- Tree? ?? ??? ???? ?? ??
- ????? ?? ??, ??? ???? ?? ?? ???? ?? ??
- ? ? ??!!! 1? ??? !!!! ??? ?? ??? !!!!
- ?? ? ?? ? ?? ?? ?? ?? ??? ???? ?? 1? ?? ?
- (?) 5bits 4, 3bits 1, 2bits 3 ? 00000,
00001,00010, 001, 01, 10, 11
24????
- ??? tree? ??? ? 24n bytes
- ?, pointer (2?)
- Intermediate node leaf node ? 2n
- 8n bytes ????
- Heap? ??
- 2n? ?? array
- ????? ?? ??? ??
- ???? ??
25Arithmetic Coding
- ????? ?????? ?? ????? ???
- ??? model? ???? ?? ??? ??
- - entropy? ??? ??? coding
- ? symbol? 1bit ??? ?? ?? ? ?? ? symbol? ?? ???
??? ? ?? - tree? ???? ?? ??? ?? ??? ??
- static?? semi-static application??? Huffman
coding?? ?? - random access ???
26Huffman Code? Arithmetic Code
27?? ?
- 0.99, 0.01? ??? ? ??? ?? ?
- Arithmetic coding 0.015bit
- Huffman coding (symbol? inefficiency)
Pr(s1)log(2log2/e) Prs10.086 (??? s1? ??
??? ?? ??) 1.076bits - ???? entropy 5bits per character (0-order
character level) - ???? ?? 0.18 ? 0.266
- 0.266/5bits ? 5.3? inefficiency
- ??? ?? 2 ?? symbol ? arithmetic coding
28Transmission of output
- low 0.6334 ? high 0.6667
- 6, 0.334 ? 0.667
- 32bit precession?? ?? ??? ??? ??
29Arithmetic Coding (Static Model)
30Decoding(Static Model)
31Arithmetic Coding (Adaptive Model)
32Decoding(Adaptive Model)
33Cumulative Count Calculation
- ?? ??
- Heap
- Encoding 101101 ? 101101, 1011, 101, 1
- ?? ??
34Symbolwise models
? Symbolwise model coder( arithmatic, huffman )
? Three Approaches
- PPM( Prediction by Partial Matching ) -
DMC(Dynamic Markov Compression ) - Word-based
compression
35PPM ( Prediction by Partial Matching )
? finite-context models of characters
? variable-length code ??? code? ? text? partial
matching
? zero-frequency problem - Escape symbol
- PPMA escape method A escape symbol? 1?
36Escape method
- Escape method A (PPMA) ? count 1
- Exclusion ? ????? ???? ?? ?? ??, ?) lies (201,
22), ?lies?? ?? ? 179? lie, lier 19? 19/202 ?
19/180 - Method C r/(nr) ? total n, distinct symbols
r, ci/(nr) ? 2.5bits per character for Hardys
book. - Method D r/(2n)
- Method X symbols of frequency 1 ? t1,
(t11)/(nt11) - PPMZ, Swiss Army Knife Data Compression (SAKDC)
1991?, 1197? ???? ?? - ?? 2,24
37Block-sorting compression
- 1994?? ??
- ??? ??? ?? ??
- Image compression ? discrete cosine
transformation, Fourier transformation? ?? - Input? block ??? ??? ???!!!
38DMC ( Dynamic Markov Compression )
? finite state model
? adaptive model - Probabilties and the
structure of the finite state machine
? Figure 2.13
? avoid zero-frequency problem
? Figure 2.14
? Cloning - heuristic - the adaptation
of the structure of a DMC
39Word-based Compression
? parse a document into words and nonwords
? Textual/Non-Textual ?? ?? - Textual
zero-order model
? suitable for large full-text database
? Low Frequency Word - ???? - ?) ???
Digit, Page Number
40Dictionary Models
? Principle of replacing substrings in a text
with codeword
? Adaptive dictionary compression model LZ77,
LZ78
? Approaches
- LZ77 - Gzip - LZ78 - LZW
41Dictionary Model - LZ77
? adaptive dictionary model
? characteristic - easy to implement -
quick decoding - using small amount of memory
? Figure 2.16
? Triples lt offset, length of phrase,
character gt
42Dictionary Model - LZ77(continue)
? Improve
- offset shorter codewords for
recent matches - match length
variable length code - character
????? ??(raw data ??)
? Figure 2.17
43Dictionary Model - Gzip
? based on LZ77
? hash table
? Tuples lt offset, matched length gt
? Using Huffman code - semi-static /
canonical Huffman code - 64K Blocks -
Code Table Block ?? ??
44Dictionary Model - LZ78
? adaptive dictionary model
? parsed phrase reference
? Figure 2.18
? Tuples - lt phrase number, character gt -
phrase 0 empty string
? Figure 2.19
45Dictionary Model - LZ78(continue)
? characteristic
- hash table simple, fast - encoding
fast - decoding slow - trie memory ??
??
46Dictionary Model - LZW
? variant of LZ78
? encode only the phrase number does not
have explicit characters in the output
? appending the fast character of the next phrase
? Figure 2.20
? characteristic - good compression - easy
to implement
47Synchronization
? random access
? impossible random access
- variable-length code - adaptive model
? synchronization point
? synchronization with adaptive model - large
file -gt break into small sections
48Creating synchronization point
? main text consist of a number of documents
? bit offset
- ??? ??/?? ?? bit? ?? ??
? byte offset
- end of document symbol - length of each
document at its beginning - end of file
49Self-synchronizing codes
? not useful or full-text retrieval
? motivation
- compressed text? ???? decoding synchronizing
cycle? ?? decoding - part of corrupteed,
beginning is missing
? fixed-length code self-synchronizing ??
? Table 2.3
? Figure 2.22
50Performance comparisons
? consideration - compression speed -
compression performance - computing resource
? Table 2.4
51Compression Performance
? Calgary corpus - English text, program
source code, bilevel fascimile image -
geological data, program object code
? Bits per character
? Figure 2.24
52Compression speed
? speed dependency - method of
implementation - architecure of machine
- compiler
? Better compression, Slower program run
? Ziv-Lempel based method decoding gt encoding
? Table 2.6
53Other Performance considerations
? memory usage - adaptive model ?? memory
?? - Ziv-Lempel ltlt Symbolwise model
? Random access - synchronization point