Set No. 2 - PowerPoint PPT Presentation

1 / 62

About This Presentation

Title:

Set No. 2

Description:

Set No. 2. Compression Programs. File Compression: Gzip, Bzip. Archivers :Arc, Pkzip, Winrar, ... Will use 'message' in generic sense to mean the data to be ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 63

Provided by: Sony96

Category:

Tags: set | winrar

more less

Transcript and Presenter's Notes

Title: Set No. 2

1
Set No. 2
2
Compression Programs

File Compression Gzip, Bzip
Archivers Arc, Pkzip, Winrar,
File Systems NTFS

3
Multimedia

HDTV (Mpeg 4)
Sound (Mp3)
Images (Jpeg)

4
Compression Outline

Introduction Lossy vs. Lossless
Information Theory Entropy, etc.
Probability Coding Huffman Arithmetic Coding

5
Encoding/Decoding

Will use message in generic sense to mean the
data to be compressed

Encoder
Decoder
Output Message
Input Message
Compressed Message
CODEC
The encoder and decoder need to understand common
compressed format.
6
Lossless vs. Lossy

Lossless Input message Output message
Lossy Input message ? Output message
Lossy does not necessarily mean loss of quality.
In fact the output could be better than the
input.
Drop random noise in images (dust on lens)
Drop background in music
Fix spelling errors in text. Put into better
form.
Writing is the art of lossy text compression.

7
Lossless Compression Techniques

LZW (Lempel-Ziv-Welch) compression
Build dictionary
Replace patterns with index of dict.
Burrows-Wheeler transform
Block sort data to improve compression
Run length encoding
Find compress repetitive sequences
Huffman code
Use variable length codes based on frequency

8
How much can we compress?

For lossless compression, assuming all input
messages are valid, if even one string is
compressed, some other must expand.

9
Model vs. Coder

To compress we need a bias on the probability of
messages. The model determines this bias
Example models
Simple Character counts, repeated strings
Complex Models of a human face

Encoder
Model
Coder
Probs.
Bits
Messages
10
Quality of Compression

Runtime vs. Compression vs. Generality
Several standard corpuses to compare algorithms
Calgary Corpus
2 books, 5 papers, 1 bibliography, 1 collection
of news articles, 3 programs, 1 terminal
session, 2 object files, 1 geophysical data, 1
bitmap bw image
The Archive Comparison Test maintains a
comparison of just about all algorithms publicly
available

11
Comparison of Algorithms

12
Entropy

Entropy is the measurement of the average
uncertainty of information
H entropy
P probability
X random variable with a discrete set of
possible outcomes
(X0, X1, X2, Xn-1) where n is the total number
of possibilities

13
Entropy

Entropy is greatest when the probabilities of the
outcomes are equal
Lets consider our fair coin experiment again
The entropy H ½ lg 2 ½ lg 2 1
Since each outcome has self-information of 1, the
average of 2 outcomes is (11)/2 1
Consider a biased coin, P(H) 0.98, P(T) 0.02
H 0.98 lg 1/0.98 0.02 lg 1/0.02
0.98 0.029 0.02 5.643 0.0285 0.1129
0.1414

14
Entropy

The estimate depends on our assumptions about
about the structure (read pattern) of the source
of information
Consider the following sequence
1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10
Obtaining the probability from the sequence
16 digits, 1, 6, 7, 10 all appear once, the rest
appear twice
The entropy H 3.25 bits
Since there are 16 symbols, we theoretically
would need 16 3.25 bits to transmit the
information

15
A Brief Introduction to Information Theory

Consider the following sequence
1 2 1 2 4 4 1 2 4 4 4 4 4 4 1 2 4 4 4 4 4 4
Obtaining the probability from the sequence
1, 2 four times (4/22), (4/22)
4 fourteen times (14/22)
The entropy H 0.447 0.447 0.415 1.309
bits
Since there are 22 symbols, we theoretically
would need 22 1.309 28.798 (29) bits to
transmit the information
However, check the symbols 12, 44
12 appears 4/11 and 44 appears 7/11
H 0.530 0.415 0.945 bits
11 0.945 10.395 (11) bits to tx the info (38
less!)
We might possibly be able to find patterns with
less entropy

16
Revisiting the Entropy

Entropy
A measure of information content
Entropy of the English Language
How much information does each character in
typical English text contain?

17
Entropy of the English Language

How can we measure the information per character?
ASCII code 7
Entropy 4.5 (based on character probabilities)
Huffman codes (average) 4.7
Unix Compress 3.5
Gzip 2.5
BOA 1.9 (current close to best text compressor)
Must be less than 1.9.

18
Shannons experiment

Asked humans to predict the next character given
the whole previous text. He used these as
conditional probabilities to estimate the entropy
of the English Language.
The number of guesses required for right answer
From the experiment we can predict the entropy
of English.

19
Data compression model
Input data
Reduce Data Redundancy
Reduction of Entropy
Entropy Encoding
Compressed Data
20
Coding

How do we use the probabilities to code messages?
Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes
Implicit probability codes

21
Assumptions

Communication (or file) broken up into pieces
called messages.
Adjacent messages might be of a different types
and come from a different probability
distributions
We will consider two types of coding
Discrete each message is a fixed set of bits
Huffman coding, Shannon-Fano coding
Blended bits can be shared among messages
Arithmetic coding

22
Uniquely Decodable Codes

A variable length code assigns a bit string
(codeword) of variable length to every message
value
e.g. a 1, b 01, c 101, d 011
What if you get the sequence of bits1011 ?
Is it aba, ca, or, ad?
A uniquely decodable code is a variable length
code in which bit strings can always be uniquely
decomposed into its codewords.

23
Prefix Codes

A prefix code is a variable length code in which
no codeword is a prefix of another word
e.g a 0, b 110, c 111, d 10
Can be viewed as a binary tree with message
values at the leaves and 0 or 1s on the edges.

0
1
0
1
a
0
1
d
b
c
24
Huffman Coding

Binary trees for compression

25
Huffman Code

Approach
Variable length encoding of symbols
Exploit statistical frequency of symbols
Efficient when symbol probabilities vary widely
Principle
Use fewer bits to represent frequent symbols
Use more bits to represent infrequent symbols

A
A
B
A
A
A
A
B
26
Huffman Codes

Invented by Huffman as a class assignment in
1950.
Used in many, if not most compression algorithms
gzip, bzip, jpeg (as option), fax compression,
Properties
Generates optimal prefix codes
Cheap to generate codes
Cheap to encode and decode
laH if probabilities are powers of 2

27
Huffman Code Example

Expected size
Original ? 1/8?2 1/4?2 1/2?2 1/8?2 2
bits / symbol
Huffman ? 1/8?3 1/4?2 1/2?1 1/8?3 1.75
bits / symbol

28
Huffman Codes

Huffman Algorithm
Start with a forest of trees each consisting of a
single vertex corresponding to a message s and
with weight p(s)
Repeat
Select two trees with minimum weight roots p1 and
p2
Join into single tree by adding root with weight
p1 p2

29
Example

p(a) .1, p(b) .2, p(c ) .2, p(d) .5

a(.1)
b(.2)
d(.5)
c(.2)
(.3)
(.5)
(1.0)
1
0
(.5)
d(.5)
a(.1)
b(.2)
(.3)
c(.2)
1
0
Step 1
(.3)
c(.2)
a(.1)
b(.2)
0
1
a(.1)
b(.2)
Step 2
Step 3
a000, b001, c01, d1
30
Encoding and Decoding

Encoding Start at leaf of Huffman tree and
follow path to the root. Reverse order of bits
and send.
Decoding Start at root of Huffman tree and take
branch for each bit received. When at leaf can
output message and return to root.

(1.0)
1
0
(.5)
d(.5)
1
0
(.3)
c(.2)
0
1
a(.1)
b(.2)
31
Adaptive Huffman Codes

Huffman codes can be made to be adaptive without
completely recalculating the tree on each step.
Can account for changing probabilities
Small changes in probability, typically make
small changes to the Huffman tree
Used frequently in practice

32
Huffman Coding Disadvantages

Integral number of bits in each code.
If the entropy of a given character is 2.2
bits,the Huffman code for that character must be
either 2 or 3 bits , not 2.2.

33
Arithmetic Coding

Huffman codes have to be an integral number of
bits long, while the entropy value of a symbol is
almost always a faction number, theoretical
possible compressed message cannot be achieved.
For example, if a statistical method assign 90
probability to a given character, the optimal
code size would be 0.15 bits.

34
Arithmetic Coding

Arithmetic coding bypasses the idea of replacing
an input symbol with a specific code. It replaces
a stream of input symbols with a single
floating-point output number.
Arithmetic coding is especially useful when
dealing with sources with small alphabets, such
as binary sources, and alphabets with highly
skewed probabilities.

35
Arithmetic Coding Example (1)
Character probability Range (space)
1/10 A 1/10
B 1/10 E 1/10 G
1/10 I 1/10 L
2/10 S 1/10 T
1/10
Suppose that we want to encode the message BILL
GATES
36
Arithmetic Coding Example (1)
0.2572
0.2
0.0
0.25
0.256

0.1
0.25724
A
0.2
B
0.3
E
0.4
G
0.5
0.25
I
I
0.6
0.26
0.2572
0.256
L
L
L
0.258
0.8
0.2576
S
0.9
T
0.26
0.258
1.0
0.3
0.2576
37
Arithmetic Coding Example (1)

New character Low value high
value
B 0.2 0.3
I 0.25 0.26
L 0.256 0.258
L 0.2572 0.2576
(space) 0.25720 0.25724
G 0.257216 0.257220
A 0.2572164 0.2572168
T 0.25721676 0.2572168
E 0.257216772 0.257216776
S 0.2572167752
0.2572167756

38
Arithmetic Coding Example (1)

The final value, named a tag, 0.2572167752 will
uniquely encode the message BILL GATES.
Any value between 0.2572167752 and 0.2572167756
can be a tag for the encoded message, and can be
uniquely decoded.

39
Arithmetic Coding

Encoding algorithm for arithmetic coding.
Low 0.0 high 1.0
while not EOF do
range high - low read(c)
high low range?high_range(c)
low low range?low_range(c)
enddo
output(low)

40
Arithmetic Coding

Decoding is the inverse process.
Since 0.2572167752 falls between 0.2 and 0.3, the
first character must be B.
Removing the effect of B from 0.2572167752 by
first subtracting the low value of B, 0.2, giving
0.0572167752.
Then divided by the width of the range of B,
0.1. This gives a value of 0.572167752.

41
Arithmetic Coding

Then calculate where that lands, which is in the
range of the next letter, I.
The process repeats until 0 or the known length
of the message is reached.

42
r c Low High range 0.2572167752
B 0.2 0.3 0.1 0.572167752
I 0.5 0.6 0.1 0.72167752
L 0.6 0.8 0.2 0.6083876 L
0.6 0.8 0.2 0.041938 (space)
0.0 0.1 0.1 0.41938 G 0.4
0.5 0.1 0.1938 A 0.2 0.3
0.1 0.938 T 0.9 1.0 0.1
0.38 E 0.3 0.4 0.1 0.8
S 0.8 0.9 0.1 0.0
43
Arithmetic Coding

Decoding algorithm
r input_code
repeat
search c such that r falls in its range
output(c)
r r - low_range(c)
r r/(high_range(c) - low_range(c))
until r equal 0

44
Arithmetic Coding Example (2)
Suppose that we want to encode the message 1 3 2 1
45
Arithmetic Coding Example (2)
0.00
0.00
0.7712
0.656
0.7712
1
1
0.7712
0.773504
0.80
2
2
0.82
0.656
0.77408
3
3
1.00
0.773504
0.77408
0.80
0.80
46
Arithmetic Coding Example (2)
Encoding
New character Low value High
value 0.0
1.0 1 0.0
0.8 3 0.656 0.800 2
0.7712 0.77408 1 0.7712
0.773504
47
Arithmetic Coding Example (2)
Decoding
48
Arithmetic Coding

In summary, the encoding process is simply one of
narrowing the range of possible numbers with
every new symbol.
The new range is proportional to the predefined
probability attached to that symbol.
Decoding is the inverse procedure, in which the
range is expanded in proportion to the
probability of each symbol as it is extracted.

49
Arithmetic Coding

Coding rate approaches high-order entropy
theoretically.
Not so popular as Huffman coding because ?, ? are
needed.
Average bits/byte on 14 files (program, object,
text, and etc.)
Huff. LZW LZ77/LZ78 Arithmetic
4.99 4.71 2.95 2.48

50
Generating a Binary Code forArithmetic Coding

Problem
The binary representation of some of the
generated floating point values (tags) would be
infinitely long.
We need increasing precision as the length of the
sequence increases.
Solution
Synchronized rescaling and incremental encoding.

51
Generating a Binary Code forArithmetic Coding

If the upper bound and the lower bound of the
interval are both less than 0.5, then rescaling
the interval and transmitting a 0 bit.
If the upper bound and the lower bound of the
interval are both greater than 0.5, then
rescaling the interval and transmitting a 1
bit.
Mapping rules

52
Arithmetic Coding Example (2)
0.00
0.00
0.3568
0.312
0.3568
0.0848
0.1696
0.6784
1
0.3392
0.312
1
0.09632
0.19264
0.38528
0.77056
0.5424
0.38528
0.80
2
2
0.54112
0.82
0.656
0.54812
0.6
3
3
1.00
0.80
0.6
0.504256
53
Encoding
Any binary value between lower or upper.
54

Decoding the bit stream start with 1100011
The number of bits to distinct the different
symbol is bits.

55
Revisiting Arithmetic coding

An Example Consider sending a message of length
1000 each with having probability .999
Self information of each message
-log(.999) .00144 bits
Sum of self information 1.4 bits.
Huffman coding will take at least 1k bits.
Arithmetic coding 3 bits!

56
Arithmetic Coding Introduction

Allows blending of bits in a message sequence.
Can bound total bits required based on sum of
self information
Used in PPM, JPEG/MPEG (as option), DMM
More expensive than Huffman coding, but integer
implementation is not too bad.

57
Arithmetic Coding (message intervals)

Assign each probability distribution to an
interval range from 0 (inclusive) to 1
(exclusive).
e.g.

f(a) .0, f(b) .2, f(c) .7
The interval for a particular message will be
calledthe message interval (e.g for b the
interval is .2,.7))
58
Arithmetic Coding (sequence intervals)

To code a message use the following
Each message narrows the interval by a factor of
pi.
Final interval size
The interval for a message sequence will be
called the sequence interval

59
Arithmetic Coding Encoding Example

Coding the message sequence bac
The final interval is .27,.3)

0.7
1.0
0.3
c .3
c .3
c .3
0.7
0.55
0.27
b .5
b .5
b .5
0.2
0.3
0.21
a .2
a .2
a .2
0.0
0.2
0.2
60
Vector Quantization

How do we compress a color image (r,g,b)?
Find k representative points for all colors
For every pixel, output the nearest
representative
If the points are clustered around the
representatives, the residuals are small and
hence probability coding will work well.

61
Transform coding