Title: Set No. 2
1Set No. 2
2Compression Programs
- File Compression Gzip, Bzip
- Archivers Arc, Pkzip, Winrar,
- File Systems NTFS
3Multimedia
- HDTV (Mpeg 4)
- Sound (Mp3)
- Images (Jpeg)
4Compression Outline
- Introduction Lossy vs. Lossless
- Information Theory Entropy, etc.
- Probability Coding Huffman Arithmetic Coding
5Encoding/Decoding
- Will use message in generic sense to mean the
data to be compressed
Encoder
Decoder
Output Message
Input Message
Compressed Message
CODEC
The encoder and decoder need to understand common
compressed format.
6Lossless vs. Lossy
- Lossless Input message Output message
- Lossy Input message ? Output message
- Lossy does not necessarily mean loss of quality.
In fact the output could be better than the
input. - Drop random noise in images (dust on lens)
- Drop background in music
- Fix spelling errors in text. Put into better
form. - Writing is the art of lossy text compression.
7Lossless Compression Techniques
- LZW (Lempel-Ziv-Welch) compression
- Build dictionary
- Replace patterns with index of dict.
- Burrows-Wheeler transform
- Block sort data to improve compression
- Run length encoding
- Find compress repetitive sequences
- Huffman code
- Use variable length codes based on frequency
8How much can we compress?
- For lossless compression, assuming all input
messages are valid, if even one string is
compressed, some other must expand.
9Model vs. Coder
- To compress we need a bias on the probability of
messages. The model determines this bias - Example models
- Simple Character counts, repeated strings
- Complex Models of a human face
Encoder
Model
Coder
Probs.
Bits
Messages
10Quality of Compression
- Runtime vs. Compression vs. Generality
- Several standard corpuses to compare algorithms
- Calgary Corpus
- 2 books, 5 papers, 1 bibliography, 1 collection
of news articles, 3 programs, 1 terminal
session, 2 object files, 1 geophysical data, 1
bitmap bw image - The Archive Comparison Test maintains a
comparison of just about all algorithms publicly
available
11Comparison of Algorithms
12Entropy
- Entropy is the measurement of the average
uncertainty of information - H entropy
- P probability
- X random variable with a discrete set of
possible outcomes - (X0, X1, X2, Xn-1) where n is the total number
of possibilities
13Entropy
- Entropy is greatest when the probabilities of the
outcomes are equal - Lets consider our fair coin experiment again
- The entropy H ½ lg 2 ½ lg 2 1
- Since each outcome has self-information of 1, the
average of 2 outcomes is (11)/2 1 - Consider a biased coin, P(H) 0.98, P(T) 0.02
- H 0.98 lg 1/0.98 0.02 lg 1/0.02
- 0.98 0.029 0.02 5.643 0.0285 0.1129
0.1414
14Entropy
- The estimate depends on our assumptions about
about the structure (read pattern) of the source
of information - Consider the following sequence
- 1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10
- Obtaining the probability from the sequence
- 16 digits, 1, 6, 7, 10 all appear once, the rest
appear twice - The entropy H 3.25 bits
- Since there are 16 symbols, we theoretically
would need 16 3.25 bits to transmit the
information
15A Brief Introduction to Information Theory
- Consider the following sequence
- 1 2 1 2 4 4 1 2 4 4 4 4 4 4 1 2 4 4 4 4 4 4
- Obtaining the probability from the sequence
- 1, 2 four times (4/22), (4/22)
- 4 fourteen times (14/22)
- The entropy H 0.447 0.447 0.415 1.309
bits - Since there are 22 symbols, we theoretically
would need 22 1.309 28.798 (29) bits to
transmit the information - However, check the symbols 12, 44
- 12 appears 4/11 and 44 appears 7/11
- H 0.530 0.415 0.945 bits
- 11 0.945 10.395 (11) bits to tx the info (38
less!) - We might possibly be able to find patterns with
less entropy
16Revisiting the Entropy
- Entropy
- A measure of information content
- Entropy of the English Language
- How much information does each character in
typical English text contain?
17Entropy of the English Language
- How can we measure the information per character?
- ASCII code 7
- Entropy 4.5 (based on character probabilities)
- Huffman codes (average) 4.7
- Unix Compress 3.5
- Gzip 2.5
- BOA 1.9 (current close to best text compressor)
- Must be less than 1.9.
18Shannons experiment
- Asked humans to predict the next character given
the whole previous text. He used these as
conditional probabilities to estimate the entropy
of the English Language. - The number of guesses required for right answer
- From the experiment we can predict the entropy
of English.
19Data compression model
Input data
Reduce Data Redundancy
Reduction of Entropy
Entropy Encoding
Compressed Data
20Coding
- How do we use the probabilities to code messages?
- Prefix codes and relationship to Entropy
- Huffman codes
- Arithmetic codes
- Implicit probability codes
21Assumptions
- Communication (or file) broken up into pieces
called messages. - Adjacent messages might be of a different types
and come from a different probability
distributions - We will consider two types of coding
- Discrete each message is a fixed set of bits
- Huffman coding, Shannon-Fano coding
- Blended bits can be shared among messages
- Arithmetic coding
22Uniquely Decodable Codes
- A variable length code assigns a bit string
(codeword) of variable length to every message
value - e.g. a 1, b 01, c 101, d 011
- What if you get the sequence of bits1011 ?
- Is it aba, ca, or, ad?
- A uniquely decodable code is a variable length
code in which bit strings can always be uniquely
decomposed into its codewords.
23Prefix Codes
- A prefix code is a variable length code in which
no codeword is a prefix of another word - e.g a 0, b 110, c 111, d 10
- Can be viewed as a binary tree with message
values at the leaves and 0 or 1s on the edges.
0
1
0
1
a
0
1
d
b
c
24Huffman Coding
- Binary trees for compression
25Huffman Code
- Approach
- Variable length encoding of symbols
- Exploit statistical frequency of symbols
- Efficient when symbol probabilities vary widely
- Principle
- Use fewer bits to represent frequent symbols
- Use more bits to represent infrequent symbols
A
A
B
A
A
A
A
B
26Huffman Codes
- Invented by Huffman as a class assignment in
1950. - Used in many, if not most compression algorithms
- gzip, bzip, jpeg (as option), fax compression,
- Properties
- Generates optimal prefix codes
- Cheap to generate codes
- Cheap to encode and decode
- laH if probabilities are powers of 2
27Huffman Code Example
- Expected size
- Original ? 1/8?2 1/4?2 1/2?2 1/8?2 2
bits / symbol - Huffman ? 1/8?3 1/4?2 1/2?1 1/8?3 1.75
bits / symbol
28Huffman Codes
- Huffman Algorithm
- Start with a forest of trees each consisting of a
single vertex corresponding to a message s and
with weight p(s) - Repeat
- Select two trees with minimum weight roots p1 and
p2 - Join into single tree by adding root with weight
p1 p2
29Example
- p(a) .1, p(b) .2, p(c ) .2, p(d) .5
a(.1)
b(.2)
d(.5)
c(.2)
(.3)
(.5)
(1.0)
1
0
(.5)
d(.5)
a(.1)
b(.2)
(.3)
c(.2)
1
0
Step 1
(.3)
c(.2)
a(.1)
b(.2)
0
1
a(.1)
b(.2)
Step 2
Step 3
a000, b001, c01, d1
30Encoding and Decoding
- Encoding Start at leaf of Huffman tree and
follow path to the root. Reverse order of bits
and send. - Decoding Start at root of Huffman tree and take
branch for each bit received. When at leaf can
output message and return to root.
(1.0)
1
0
(.5)
d(.5)
1
0
(.3)
c(.2)
0
1
a(.1)
b(.2)
31Adaptive Huffman Codes
- Huffman codes can be made to be adaptive without
completely recalculating the tree on each step. - Can account for changing probabilities
- Small changes in probability, typically make
small changes to the Huffman tree - Used frequently in practice
32Huffman Coding Disadvantages
- Integral number of bits in each code.
- If the entropy of a given character is 2.2
bits,the Huffman code for that character must be
either 2 or 3 bits , not 2.2.
33Arithmetic Coding
- Huffman codes have to be an integral number of
bits long, while the entropy value of a symbol is
almost always a faction number, theoretical
possible compressed message cannot be achieved. - For example, if a statistical method assign 90
probability to a given character, the optimal
code size would be 0.15 bits.
34Arithmetic Coding
- Arithmetic coding bypasses the idea of replacing
an input symbol with a specific code. It replaces
a stream of input symbols with a single
floating-point output number. - Arithmetic coding is especially useful when
dealing with sources with small alphabets, such
as binary sources, and alphabets with highly
skewed probabilities.
35Arithmetic Coding Example (1)
Character probability Range (space)
1/10 A 1/10
B 1/10 E 1/10 G
1/10 I 1/10 L
2/10 S 1/10 T
1/10
Suppose that we want to encode the message BILL
GATES
36Arithmetic Coding Example (1)
0.2572
0.2
0.0
0.25
0.256
0.1
0.25724
A
0.2
B
0.3
E
0.4
G
0.5
0.25
I
I
0.6
0.26
0.2572
0.256
L
L
L
0.258
0.8
0.2576
S
0.9
T
0.26
0.258
1.0
0.3
0.2576
37Arithmetic Coding Example (1)
- New character Low value high
value - B 0.2 0.3
- I 0.25 0.26
- L 0.256 0.258
- L 0.2572 0.2576
- (space) 0.25720 0.25724
- G 0.257216 0.257220
- A 0.2572164 0.2572168
- T 0.25721676 0.2572168
- E 0.257216772 0.257216776
- S 0.2572167752
0.2572167756
38Arithmetic Coding Example (1)
- The final value, named a tag, 0.2572167752 will
uniquely encode the message BILL GATES. - Any value between 0.2572167752 and 0.2572167756
can be a tag for the encoded message, and can be
uniquely decoded.
39Arithmetic Coding
- Encoding algorithm for arithmetic coding.
- Low 0.0 high 1.0
- while not EOF do
- range high - low read(c)
- high low range?high_range(c)
- low low range?low_range(c)
- enddo
- output(low)
40Arithmetic Coding
- Decoding is the inverse process.
- Since 0.2572167752 falls between 0.2 and 0.3, the
first character must be B. - Removing the effect of B from 0.2572167752 by
first subtracting the low value of B, 0.2, giving
0.0572167752. - Then divided by the width of the range of B,
0.1. This gives a value of 0.572167752.
41Arithmetic Coding
- Then calculate where that lands, which is in the
range of the next letter, I. - The process repeats until 0 or the known length
of the message is reached.
42 r c Low High range 0.2572167752
B 0.2 0.3 0.1 0.572167752
I 0.5 0.6 0.1 0.72167752
L 0.6 0.8 0.2 0.6083876 L
0.6 0.8 0.2 0.041938 (space)
0.0 0.1 0.1 0.41938 G 0.4
0.5 0.1 0.1938 A 0.2 0.3
0.1 0.938 T 0.9 1.0 0.1
0.38 E 0.3 0.4 0.1 0.8
S 0.8 0.9 0.1 0.0
43Arithmetic Coding
- Decoding algorithm
- r input_code
- repeat
- search c such that r falls in its range
- output(c)
- r r - low_range(c)
- r r/(high_range(c) - low_range(c))
- until r equal 0
44Arithmetic Coding Example (2)
Suppose that we want to encode the message 1 3 2 1
45Arithmetic Coding Example (2)
0.00
0.00
0.7712
0.656
0.7712
1
1
0.7712
0.773504
0.80
2
2
0.82
0.656
0.77408
3
3
1.00
0.773504
0.77408
0.80
0.80
46Arithmetic Coding Example (2)
Encoding
New character Low value High
value 0.0
1.0 1 0.0
0.8 3 0.656 0.800 2
0.7712 0.77408 1 0.7712
0.773504
47Arithmetic Coding Example (2)
Decoding
48Arithmetic Coding
- In summary, the encoding process is simply one of
narrowing the range of possible numbers with
every new symbol. - The new range is proportional to the predefined
probability attached to that symbol. - Decoding is the inverse procedure, in which the
range is expanded in proportion to the
probability of each symbol as it is extracted.
49Arithmetic Coding
- Coding rate approaches high-order entropy
theoretically. - Not so popular as Huffman coding because ?, ? are
needed. - Average bits/byte on 14 files (program, object,
text, and etc.) - Huff. LZW LZ77/LZ78 Arithmetic
- 4.99 4.71 2.95 2.48
50Generating a Binary Code forArithmetic Coding
- Problem
- The binary representation of some of the
generated floating point values (tags) would be
infinitely long. - We need increasing precision as the length of the
sequence increases. - Solution
- Synchronized rescaling and incremental encoding.
51Generating a Binary Code forArithmetic Coding
- If the upper bound and the lower bound of the
interval are both less than 0.5, then rescaling
the interval and transmitting a 0 bit. - If the upper bound and the lower bound of the
interval are both greater than 0.5, then
rescaling the interval and transmitting a 1
bit. - Mapping rules
52Arithmetic Coding Example (2)
0.00
0.00
0.3568
0.312
0.3568
0.0848
0.1696
0.6784
1
0.3392
0.312
1
0.09632
0.19264
0.38528
0.77056
0.5424
0.38528
0.80
2
2
0.54112
0.82
0.656
0.54812
0.6
3
3
1.00
0.80
0.6
0.504256
53Encoding
Any binary value between lower or upper.
54- Decoding the bit stream start with 1100011
- The number of bits to distinct the different
symbol is bits.
55Revisiting Arithmetic coding
- An Example Consider sending a message of length
1000 each with having probability .999 - Self information of each message
- -log(.999) .00144 bits
- Sum of self information 1.4 bits.
- Huffman coding will take at least 1k bits.
- Arithmetic coding 3 bits!
56Arithmetic Coding Introduction
- Allows blending of bits in a message sequence.
- Can bound total bits required based on sum of
self information - Used in PPM, JPEG/MPEG (as option), DMM
- More expensive than Huffman coding, but integer
implementation is not too bad.
57Arithmetic Coding (message intervals)
- Assign each probability distribution to an
interval range from 0 (inclusive) to 1
(exclusive). - e.g.
f(a) .0, f(b) .2, f(c) .7
The interval for a particular message will be
calledthe message interval (e.g for b the
interval is .2,.7))
58Arithmetic Coding (sequence intervals)
- To code a message use the following
- Each message narrows the interval by a factor of
pi. - Final interval size
- The interval for a message sequence will be
called the sequence interval
59Arithmetic Coding Encoding Example
- Coding the message sequence bac
- The final interval is .27,.3)
0.7
1.0
0.3
c .3
c .3
c .3
0.7
0.55
0.27
b .5
b .5
b .5
0.2
0.3
0.21
a .2
a .2
a .2
0.0
0.2
0.2
60Vector Quantization
- How do we compress a color image (r,g,b)?
- Find k representative points for all colors
- For every pixel, output the nearest
representative - If the points are clustered around the
representatives, the residuals are small and
hence probability coding will work well.
61Transform coding
- Transform input into another space.
- One form of transform is to choose a set of basis
functions. - JPEG/MPEG both
- use this idea.
62Other Transform codes
- Wavelets
- Fractal base compression
- Based on the idea of fixed points of functions.