Huffman coding - PowerPoint PPT Presentation

About This Presentation

Title:

Huffman coding

Description:

Title: Slide 1 Author - Last modified by - Created Date: 5/6/2005 5:03:27 PM Document presentation format: On-screen Show Company - Other titles – PowerPoint PPT presentation

Number of Views:511

Avg rating:3.0/5.0

Slides: 38

Provided by: 16959

Category:

more less

Transcript and Presenter's Notes

Title: Huffman coding

1
Huffman coding
2
Optimal codes - I

A code is optimal if it has the shortest codeword
length L
This can be seen as an optimization problem

3
Optimal codes - II

Lets make two simplifying assumptions
no integer constraint on the codelengths
Kraft inequality holds with equality
Lagrange-multiplier problem

4
Optimal codes - III

Substitute into the Kraft
inequality
that is
Note that

the entropy, when we use base D for logarithms
5
Optimal codes - IV

In practice the codeword lengths must be integer
value, so obtained results is a lower bound
Theorem
The expected length of any istantaneous D-ary
code for a r.v. X satisfies
this fundamental result derives frow the work of
Shannon

6
Optimal codes - V

What about the upper bound?
Theorem
Given a source alphabet (i.e. a r.v.) of entropy
it is possible to find an instantaneous
binary code which length satisfies
A similar theorem could be stated if we use the
wrong probabilities instead of the true
ones the only difference is a term which
accounts for the relative entropy

7
The redundance

It is defined as the average codeword legths
minus the entropy
Note that
(why?)

8
Compression ratio

It is the ratio between the average number of
bit/symbol in the original message and the same
quantity for the coded message, i.e.

9
Uniquely decodable codes

The set of the instantaneous codes are a small
subset of the uniquely decodable codes.
It is possible to obtain a lower average code
length L using a uniquely decodable code that is
not instantaneous? NO
So we use instantaneous codes that are easier to
decode

10
Summary

Average codeword length L
for uniquely decodable codes (and
for instantaneous codes)
In practice for each r.v. with entropy
we can build a code with average codeword length
that satisfies

11
Shannon-Fano coding

The main advantage of the Shannon-Fano technique
is its semplicity
Source symbols are listed in order of
nonincreasing probability.
The list is divided in such a way to form two
groups of as nearly equal probabilities as
possible
Each symbol in the first group receives a 0 as
first digit of its codeword, while the others
receive a 1
Each of these group is then divided according to
the same criterion and additional code digits are
appended
The process is continued until each group
contains only one message

12
example

H1.9375 bits
L1.9375 bits

13
Shannon-Fano coding - exercise

Encode, using Shannon-Fano algorithm

14
Is Shannon-Fano coding optimal?

H2.2328 bits
L2.31 bits

L12.3 bits
15
Huffman coding - I

There is another algorithm which performances are
slightly better than Shanno-Fano, the famous
Huffman coding
It works constructing bottom-up a tree, that has
symbols in the leafs
The two leafs with the smallest probabilities
becomes sibling under a parent node with
probabilities equal to the two childrens
probabilities

16
Huffman coding - II

At this time the operation is repeated,
considering also the new parent node and ignoring
its children
The process continue until there is only parent
node with probability 1, that is the root of the
tree
Then the two branches for every non-leaf node are
labeled 0 and 1 (typically, 0 on the left branch,
but the order is not important)

17
Huffman coding - example
1.0
1.0
0
1
0.4
0.4

1
0.2
0.6
0.2
0.6
0
0
1
1
0.1
0.3
0.1
0.3
0
0
1
1
a 0.05
b 0.05
c 0.1
d 0.2
e 0.3
f 0.2
g 0.1
a 0.05
b 0.05
c 0.1
d 0.2
e 0.3
f 0.2
g 0.1
18
Huffman coding - example

Exercise evaluate H(X) and L(X)
H(X)2.5464 bits
L(X)2.6 bits !!

19
Huffman coding - exercise

Code the sequence
aeebcddegfced and calculate the
compression ratio
Sol 0000 10 10 0001 001 01 01
10 111 110 001 10 01
Aver. orig. symb. length 3 bits
Aver. compr. symb. length 34/13
C.....

20
Huffman coding - exercise

Decode the sequence
0111001001000001111110
Sol dfdcadgf

21
Huffman coding - exercise

Encode with Huffman the sequence
01cc0a02ba10
and evaluate entropy, average codeword length
and compression ratio

22
Huffman coding - exercise

Decode (if possible) the Huffman coded bit
streaming
01001011010011110101...

23
Huffman coding - notes

In the huffman coding, if, at any time, there is
more than one way to choose a smallest pair of
probabilities, any such pair may be chosen
Sometimes, the list of probabilities is
inizialized to be non-increasing and reordered
after each node creation. This details doesnt
affect the correctness of the algorithm, but it
provides a more efficient implementation

24
Huffman coding - notes

There are cases in which the Huffman coding does
not uniquely determine codeword lengths, due to
the arbitrary choice among equal minimum
probabilities.
For example for a source with probabilities
it is possible to obtain
codeword lengths of and of
It would be better to have a code which
codelength has the minimum variance, as this
solution will need the minimum buffer space in
the transmitter and in the receiver

25
Huffman coding - notes

Schwarz defines a variant of the Huffman
algorithm that allows to build the code with
minimum .
There are several other variants, we will explain
the most important in a while.

26
Optimality of Huffman coding - I

It is possible to prove that, in case of
character coding (one symbol, one codeword),
Huffman coding is optimal
In another terms Huffman code has minimum
redundancy
An upper bound for redundancy has been found
where is the probability of the most likely
simbol

27
Optimality of Huffman coding - II

Why Huffman code suffers when there is one
symbol with very high probability?
Remember the notion of uncertainty...
The main problem is given by the integer
constraint on codelengths!!
This consideration opens the way to a more
powerful coding... we will see it later

28
Huffman coding - implementation

Huffman coding can be generated in O(n) time,
where n is the number of source symbols, provided
that probabilities have been presorted (however
this sort costs O(nlogn)...)
Nevertheless, encoding is very fast

29
Huffman coding - implementation

However, spatial and temporal complexity of the
decoding phase are far more important, because,
on average, decoding will happen more frequently.
Consider a Huffman tree with n symbols
n leafs and n-1 internal nodes

has the pointer to a symbol and the info that it
is a leaf
has two pointers
30
Huffman coding - implementation

1 million symbols 16 MB of memory!
Moreover traversing a tree from root to leaf
involves follow a lot of pointers, with little
locality of reference. This causes several page
faults or cache misses.
To solve this problem a variant of Huffman coding
has been proposed canonical Huffman coding

31
canonical Huffman coding - I
1.0
0
1
(0)
(1)
0.53
0.47
0
1
(0)
(1)
0
(1)
1
(0)
0.23
0.27
0
0
1
1
(0)
(0)
(1)
(1)
?
b 0.12
c 0.13
d 0.14
e 0.24
f 0.26
a 0.11
32
canonical Huffman coding - II

This code cannot be obtained through a Huffman
tree!
We do call it an Huffman code because it is
instantaneous and the codeword lengths are the
same than a valid Huffman code
numerical sequence property
codewords with the same length are ordered
lexicographically
when the codewords are sorted in lexical order
they are also in order from the longest to the
shortest codeword

33
canonical Huffman coding - III

The main advantage is that it is not necessary to
store a tree, in order to decoding
We need
a list of the symbols ordered according to the
lexical order of the codewords
an array with the first codeword of each distinct
length

34
canonical Huffman coding - IV

Encoding. Suppose there are n disctinct symbols,
that for symbol i we have calculated huffman
codelength and

numlk number of codewords with length
k firstcodek integer for first code of
length k nextcodek integer for the next
codeword of length k to be assigned symbol-,-
used for decoding codewordi the rightmost
bits of this integer are the code for symbol i
35
canonical Huffman - example

1. Evaluate array numl

2. Evaluate array firstcode

3. Construct array codeword and symbol

symbol
0 1 2 3
1 2 3 4 5
- - - -
a e h -
d - - -
- - - -
b c f g
36
canonical Huffman coding - V

Decoding. We have the arrays firstcode and symbols

nextinputbit() function that returns next input
bit firstcodek integer for first code of
length k symbolk,n returns the symbol number
n with codelength k
37
canonical Huffman - example
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
symbol3,0 d
symbol3,0 d
symbol2,2 h
symbol2,2 h
symbol2,1 e
symbol2,1 e
symbol5,0 b
symbol5,0 b
symbol
0 1 2 3
symbol2,0 a
symbol2,0 a
1 2 3 4 5
- - - -
a e h -
d - - -
- - - -
b c f g
symbol3,0 d
symbol3,0 d